diff --git a/doc/src/Developer.rst b/doc/src/Developer.rst index f54bc4152f..f68007486d 100644 --- a/doc/src/Developer.rst +++ b/doc/src/Developer.rst @@ -11,6 +11,7 @@ of time and requests from the LAMMPS user community. :maxdepth: 1 Developer_org + Developer_parallel Developer_flow Developer_write Developer_notes diff --git a/doc/src/Developer_parallel.rst b/doc/src/Developer_parallel.rst new file mode 100644 index 0000000000..2f27b75bfc --- /dev/null +++ b/doc/src/Developer_parallel.rst @@ -0,0 +1,106 @@ +Parallel algorithms +------------------- + +LAMMPS is from ground up designed to be running in parallel using the +MPI standard with distributed data via domain decomposition. The +parallelization has to be efficient to enable good strong scaling (= +good speedup for the same system) and good weak scaling (= the +computational cost of enlarging the system is linear with the system +size). Additional parallelization using GPUs or OpenMP can then be +applied within the sub-domain assigned to an MPI process. + + +Partitioning +^^^^^^^^^^^^ + +The underlying spatial decomposition strategy used by LAMMPS for +distributed-memory parallelism is set with the :doc:`comm_style command ` +and can be either "brick" (a regular grid) or "tiled". + +.. _domain-decomposition: +.. figure:: img/domain-decomp.png + + LAMMPS domain decomposition + + This figure shows the different kinds of domain decomposition used + for MPI parallelization: "brick" on the left with an orthogonal (top) + and a triclinic (bottom) simulation domain, and "tiled" on the right. + The black lines show the division into sub-domains and the contained + atoms are "owned" by the corresponding MPI process. The green dashed + lines indicate how sub-domains are extended with "ghost" atoms up + to the communication cutoff distance. + +The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal +or triclinic in shape, as illustrated in the :ref:`domain-decomposition` +figure for the 2d case. Orthogonal means the box edges are aligned with +the *x*, *y*, *z* Cartesian axes, and the box faces are thus all +rectangular. Triclinic allows for a more general parallelepiped shape +in which edges are aligned with three arbitrary vectors and the box +faces are parallelograms. In each dimension box faces can be periodic, +or non-periodic with fixed or shrink-wrapped boundaries. In the fixed +case, atoms which move outside the face are deleted; shrink-wrapped +means the position of the box face adjusts continuously to enclose all +the atoms. + +For distributed-memory MPI parallelism, the simulation box is spatially +decomposed (partitioned) into non-overlapping sub-domains which fill the +box. The default partitioning, "brick", is most suitable when atom +density is roughly uniform, as shown in the left-side images of the +:ref:`domain-decomposition` figure. The sub-domains comprise a regular +grid and all sub-domains are identical in size and shape. Both the +orthogonal and triclinic boxes can deform continuously during a +simulation, e.g. to compress a solid or shear a liquid, in which case +the processor sub-domains likewise deform. + + +For models with non-uniform density, the number of particles per +processor can be load-imbalanced with the default partitioning. This +reduces parallel efficiency, as the overall simulation rate is limited +by the slowest processor, i.e. the one with the largest computational +load. For such models, LAMMPS supports multiple strategies to reduce +the load imbalance: + +- The processor grid decomposition is by default based on the simulation + cell volume and tries to optimize the volume to surface ratio for the sub-domains. + This can be changed with the :doc:`processors command `. +- The parallel planes defining the size of the sub-domains can be shifted + with the :doc:`balance command `. Which can be done in addition + to choosing a more optimal processor grid. +- The recursive bisectioning algorithm in combination with the "tiled" + communication style can produce a partitioning with equal numbers of + particles in each sub-domain. + + +.. |decomp1| image:: img/decomp-regular.png + :width: 24% + +.. |decomp2| image:: img/decomp-processors.png + :width: 24% + +.. |decomp3| image:: img/decomp-balance.png + :width: 24% + +.. |decomp4| image:: img/decomp-rcb.png + :width: 24% + +|decomp1| |decomp2| |decomp3| |decomp4| + +The pictures above demonstrate the differences for a 2d system with 12 MPI ranks. +Due to the vacuum in the system, the default decomposition is unbalanced +with several MPI ranks without atoms (left). By forcing a 1x12x1 processor +grid, every MPI rank does computations now, but the amount of communication +between sub-domains is increased (center left). With a 2x6x1 processor grid and +shifting the sub-domain divisions, the load imbalance is also reduced and +the amount of communication required between sub-domains is less (center right). +And using the recursive bisectioning leads to further improved decomposition (right). + + +Communication +^^^^^^^^^^^^^ + +Neighbor lists +^^^^^^^^^^^^^^ + +Long-range interactions +^^^^^^^^^^^^^^^^^^^^^^^ + diff --git a/doc/src/img/decomp-balance.png b/doc/src/img/decomp-balance.png new file mode 100644 index 0000000000..7ac72c5114 Binary files /dev/null and b/doc/src/img/decomp-balance.png differ diff --git a/doc/src/img/decomp-processors.png b/doc/src/img/decomp-processors.png new file mode 100644 index 0000000000..7fa9e98d6c Binary files /dev/null and b/doc/src/img/decomp-processors.png differ diff --git a/doc/src/img/decomp-rcb.png b/doc/src/img/decomp-rcb.png new file mode 100644 index 0000000000..03ab2a1395 Binary files /dev/null and b/doc/src/img/decomp-rcb.png differ diff --git a/doc/src/img/decomp-regular.png b/doc/src/img/decomp-regular.png new file mode 100644 index 0000000000..7607a42737 Binary files /dev/null and b/doc/src/img/decomp-regular.png differ diff --git a/doc/src/img/domain-decomp.png b/doc/src/img/domain-decomp.png new file mode 100644 index 0000000000..d1c2f268d6 Binary files /dev/null and b/doc/src/img/domain-decomp.png differ