adapt section about domain decomposition from paper

2021-09-03 16:59:41 -04:00
parent 6cf2aa4fbb
commit a98ded7722
7 changed files with 107 additions and 0 deletions
--- a/doc/src/Developer.rst
+++ b/doc/src/Developer.rst
@ -11,6 +11,7 @@ of time and requests from the LAMMPS user community.
   :maxdepth: 1
   Developer_org
   Developer_parallel
   Developer_flow
   Developer_write
   Developer_notes
--- a/doc/src/Developer_parallel.rst
+++ b/doc/src/Developer_parallel.rst
@ -0,0 +1,106 @@
 Parallel algorithms
 -------------------
 LAMMPS is from ground up designed to be running in parallel using the
 MPI standard with distributed data via domain decomposition.  The
 parallelization has to be efficient to enable good strong scaling (=
 good speedup for the same system) and good weak scaling (= the
 computational cost of enlarging the system is linear with the system
 size).  Additional parallelization using GPUs or OpenMP can then be
 applied within the sub-domain assigned to an MPI process.
 Partitioning
 ^^^^^^^^^^^^
 The underlying spatial decomposition strategy used by LAMMPS for
 distributed-memory parallelism is set with the :doc:`comm_style command <comm_style>`
 and can be either "brick" (a regular grid) or "tiled".
 .. _domain-decomposition:
 .. figure:: img/domain-decomp.png
   LAMMPS domain decomposition
   This figure shows the different kinds of domain decomposition used
   for MPI parallelization: "brick" on the left with an orthogonal (top)
   and a triclinic (bottom) simulation domain, and "tiled" on the right.
   The black lines show the division into sub-domains and the contained
   atoms are "owned" by the corresponding MPI process. The green dashed
   lines indicate how sub-domains are extended with "ghost" atoms up
   to the communication cutoff distance.
 The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal
 or triclinic in shape, as illustrated in the :ref:`domain-decomposition`
 figure for the 2d case.  Orthogonal means the box edges are aligned with
 the *x*, *y*, *z* Cartesian axes, and the box faces are thus all
 rectangular.  Triclinic allows for a more general parallelepiped shape
 in which edges are aligned with three arbitrary vectors and the box
 faces are parallelograms.  In each dimension box faces can be periodic,
 or non-periodic with fixed or shrink-wrapped boundaries.  In the fixed
 case, atoms which move outside the face are deleted; shrink-wrapped
 means the position of the box face adjusts continuously to enclose all
 the atoms.
 For distributed-memory MPI parallelism, the simulation box is spatially
 decomposed (partitioned) into non-overlapping sub-domains which fill the
 box. The default partitioning, "brick", is most suitable when atom
 density is roughly uniform, as shown in the left-side images of the
 :ref:`domain-decomposition` figure.  The sub-domains comprise a regular
 grid and all sub-domains are identical in size and shape.  Both the
 orthogonal and triclinic boxes can deform continuously during a
 simulation, e.g. to compress a solid or shear a liquid, in which case
 the processor sub-domains likewise deform.
 For models with non-uniform density, the number of particles per
 processor can be load-imbalanced with the default partitioning.  This
 reduces parallel efficiency, as the overall simulation rate is limited
 by the slowest processor, i.e. the one with the largest computational
 load.  For such models, LAMMPS supports multiple strategies to reduce
 the load imbalance:
 - The processor grid decomposition is by default based on the simulation
  cell volume and tries to optimize the volume to surface ratio for the sub-domains.
  This can be changed with the :doc:`processors command <processors>`.
 - The parallel planes defining the size of the sub-domains can be shifted
  with the :doc:`balance command <balance>`. Which can be done in addition
  to choosing a more optimal processor grid.
 - The recursive bisectioning algorithm in combination with the "tiled"
  communication style can produce a partitioning with equal numbers of
  particles in each sub-domain.
 .. |decomp1| image:: img/decomp-regular.png
   :width: 24%
 .. |decomp2| image:: img/decomp-processors.png
   :width: 24%
 .. |decomp3| image:: img/decomp-balance.png
   :width: 24%
 .. |decomp4| image:: img/decomp-rcb.png
   :width: 24%
 |decomp1|  |decomp2|  |decomp3|  |decomp4|
 The pictures above demonstrate the differences for a 2d system with 12 MPI ranks.
 Due to the vacuum in the system, the default decomposition is unbalanced
 with several MPI ranks without atoms (left). By forcing a 1x12x1 processor
 grid, every MPI rank does computations now, but the amount of communication
 between sub-domains is increased (center left). With a 2x6x1 processor grid and
 shifting the sub-domain divisions, the load imbalance is also reduced and
 the amount of communication required between sub-domains is less (center right).
 And using the recursive bisectioning leads to further improved decomposition (right).
 Communication
 ^^^^^^^^^^^^^
 Neighbor lists
 ^^^^^^^^^^^^^^
 Long-range interactions
 ^^^^^^^^^^^^^^^^^^^^^^^
--- a/doc/src/img/decomp-balance.png
+++ b/doc/src/img/decomp-balance.png
--- a/doc/src/img/decomp-processors.png
+++ b/doc/src/img/decomp-processors.png
--- a/doc/src/img/decomp-rcb.png
+++ b/doc/src/img/decomp-rcb.png
--- a/doc/src/img/decomp-regular.png
+++ b/doc/src/img/decomp-regular.png
--- a/doc/src/img/domain-decomp.png
+++ b/doc/src/img/domain-decomp.png