adapt section about domain decomposition from paper

2021-09-03 16:59:41 -04:00
parent 6cf2aa4fbb
commit a98ded7722
7 changed files with 107 additions and 0 deletions
--- a/doc/src/Developer.rst
+++ b/doc/src/Developer.rst
@ -11,6 +11,7 @@ of time and requests from the LAMMPS user community.
   :maxdepth: 1

   Developer_org
+   Developer_parallel
   Developer_flow
   Developer_write
   Developer_notes
--- a/doc/src/Developer_parallel.rst
+++ b/doc/src/Developer_parallel.rst
@ -0,0 +1,106 @@
+Parallel algorithms
+-------------------
+
+LAMMPS is from ground up designed to be running in parallel using the
+MPI standard with distributed data via domain decomposition.  The
+parallelization has to be efficient to enable good strong scaling (=
+good speedup for the same system) and good weak scaling (= the
+computational cost of enlarging the system is linear with the system
+size).  Additional parallelization using GPUs or OpenMP can then be
+applied within the sub-domain assigned to an MPI process.
+
+
+Partitioning
+^^^^^^^^^^^^
+
+The underlying spatial decomposition strategy used by LAMMPS for
+distributed-memory parallelism is set with the :doc:`comm_style command <comm_style>`
+and can be either "brick" (a regular grid) or "tiled".
+
+.. _domain-decomposition:
+.. figure:: img/domain-decomp.png
+
+   LAMMPS domain decomposition
+
+   This figure shows the different kinds of domain decomposition used
+   for MPI parallelization: "brick" on the left with an orthogonal (top)
+   and a triclinic (bottom) simulation domain, and "tiled" on the right.
+   The black lines show the division into sub-domains and the contained
+   atoms are "owned" by the corresponding MPI process. The green dashed
+   lines indicate how sub-domains are extended with "ghost" atoms up
+   to the communication cutoff distance.
+
+The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal
+or triclinic in shape, as illustrated in the :ref:`domain-decomposition`
+figure for the 2d case.  Orthogonal means the box edges are aligned with
+the *x*, *y*, *z* Cartesian axes, and the box faces are thus all
+rectangular.  Triclinic allows for a more general parallelepiped shape
+in which edges are aligned with three arbitrary vectors and the box
+faces are parallelograms.  In each dimension box faces can be periodic,
+or non-periodic with fixed or shrink-wrapped boundaries.  In the fixed
+case, atoms which move outside the face are deleted; shrink-wrapped
+means the position of the box face adjusts continuously to enclose all
+the atoms.
+
+For distributed-memory MPI parallelism, the simulation box is spatially
+decomposed (partitioned) into non-overlapping sub-domains which fill the
+box. The default partitioning, "brick", is most suitable when atom
+density is roughly uniform, as shown in the left-side images of the
+:ref:`domain-decomposition` figure.  The sub-domains comprise a regular
+grid and all sub-domains are identical in size and shape.  Both the
+orthogonal and triclinic boxes can deform continuously during a
+simulation, e.g. to compress a solid or shear a liquid, in which case
+the processor sub-domains likewise deform.
+
+
+For models with non-uniform density, the number of particles per
+processor can be load-imbalanced with the default partitioning.  This
+reduces parallel efficiency, as the overall simulation rate is limited
+by the slowest processor, i.e. the one with the largest computational
+load.  For such models, LAMMPS supports multiple strategies to reduce
+the load imbalance:
+
+- The processor grid decomposition is by default based on the simulation
+  cell volume and tries to optimize the volume to surface ratio for the sub-domains.
+  This can be changed with the :doc:`processors command <processors>`.
+- The parallel planes defining the size of the sub-domains can be shifted
+  with the :doc:`balance command <balance>`. Which can be done in addition
+  to choosing a more optimal processor grid.
+- The recursive bisectioning algorithm in combination with the "tiled"
+  communication style can produce a partitioning with equal numbers of
+  particles in each sub-domain.
+
+ 
+.. |decomp1| image:: img/decomp-regular.png
+   :width: 24%
+
+.. |decomp2| image:: img/decomp-processors.png
+   :width: 24%
+
+.. |decomp3| image:: img/decomp-balance.png
+   :width: 24%
+
+.. |decomp4| image:: img/decomp-rcb.png
+   :width: 24%
+
+|decomp1|  |decomp2|  |decomp3|  |decomp4|
+
+The pictures above demonstrate the differences for a 2d system with 12 MPI ranks.
+Due to the vacuum in the system, the default decomposition is unbalanced
+with several MPI ranks without atoms (left). By forcing a 1x12x1 processor
+grid, every MPI rank does computations now, but the amount of communication
+between sub-domains is increased (center left). With a 2x6x1 processor grid and
+shifting the sub-domain divisions, the load imbalance is also reduced and
+the amount of communication required between sub-domains is less (center right).
+And using the recursive bisectioning leads to further improved decomposition (right).
+
+
+Communication
+^^^^^^^^^^^^^
+
+Neighbor lists
+^^^^^^^^^^^^^^
+
+Long-range interactions
+^^^^^^^^^^^^^^^^^^^^^^^
+
--- a/doc/src/img/decomp-balance.png
+++ b/doc/src/img/decomp-balance.png
--- a/doc/src/img/decomp-processors.png
+++ b/doc/src/img/decomp-processors.png
--- a/doc/src/img/decomp-rcb.png
+++ b/doc/src/img/decomp-rcb.png
--- a/doc/src/img/decomp-regular.png
+++ b/doc/src/img/decomp-regular.png
--- a/doc/src/img/domain-decomp.png
+++ b/doc/src/img/domain-decomp.png