adapt section about domain decomposition from paper
This commit is contained in:
106
doc/src/Developer_parallel.rst
Normal file
106
doc/src/Developer_parallel.rst
Normal file
@ -0,0 +1,106 @@
|
||||
Parallel algorithms
|
||||
-------------------
|
||||
|
||||
LAMMPS is from ground up designed to be running in parallel using the
|
||||
MPI standard with distributed data via domain decomposition. The
|
||||
parallelization has to be efficient to enable good strong scaling (=
|
||||
good speedup for the same system) and good weak scaling (= the
|
||||
computational cost of enlarging the system is linear with the system
|
||||
size). Additional parallelization using GPUs or OpenMP can then be
|
||||
applied within the sub-domain assigned to an MPI process.
|
||||
|
||||
|
||||
Partitioning
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The underlying spatial decomposition strategy used by LAMMPS for
|
||||
distributed-memory parallelism is set with the :doc:`comm_style command <comm_style>`
|
||||
and can be either "brick" (a regular grid) or "tiled".
|
||||
|
||||
.. _domain-decomposition:
|
||||
.. figure:: img/domain-decomp.png
|
||||
|
||||
LAMMPS domain decomposition
|
||||
|
||||
This figure shows the different kinds of domain decomposition used
|
||||
for MPI parallelization: "brick" on the left with an orthogonal (top)
|
||||
and a triclinic (bottom) simulation domain, and "tiled" on the right.
|
||||
The black lines show the division into sub-domains and the contained
|
||||
atoms are "owned" by the corresponding MPI process. The green dashed
|
||||
lines indicate how sub-domains are extended with "ghost" atoms up
|
||||
to the communication cutoff distance.
|
||||
|
||||
The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal
|
||||
or triclinic in shape, as illustrated in the :ref:`domain-decomposition`
|
||||
figure for the 2d case. Orthogonal means the box edges are aligned with
|
||||
the *x*, *y*, *z* Cartesian axes, and the box faces are thus all
|
||||
rectangular. Triclinic allows for a more general parallelepiped shape
|
||||
in which edges are aligned with three arbitrary vectors and the box
|
||||
faces are parallelograms. In each dimension box faces can be periodic,
|
||||
or non-periodic with fixed or shrink-wrapped boundaries. In the fixed
|
||||
case, atoms which move outside the face are deleted; shrink-wrapped
|
||||
means the position of the box face adjusts continuously to enclose all
|
||||
the atoms.
|
||||
|
||||
For distributed-memory MPI parallelism, the simulation box is spatially
|
||||
decomposed (partitioned) into non-overlapping sub-domains which fill the
|
||||
box. The default partitioning, "brick", is most suitable when atom
|
||||
density is roughly uniform, as shown in the left-side images of the
|
||||
:ref:`domain-decomposition` figure. The sub-domains comprise a regular
|
||||
grid and all sub-domains are identical in size and shape. Both the
|
||||
orthogonal and triclinic boxes can deform continuously during a
|
||||
simulation, e.g. to compress a solid or shear a liquid, in which case
|
||||
the processor sub-domains likewise deform.
|
||||
|
||||
|
||||
For models with non-uniform density, the number of particles per
|
||||
processor can be load-imbalanced with the default partitioning. This
|
||||
reduces parallel efficiency, as the overall simulation rate is limited
|
||||
by the slowest processor, i.e. the one with the largest computational
|
||||
load. For such models, LAMMPS supports multiple strategies to reduce
|
||||
the load imbalance:
|
||||
|
||||
- The processor grid decomposition is by default based on the simulation
|
||||
cell volume and tries to optimize the volume to surface ratio for the sub-domains.
|
||||
This can be changed with the :doc:`processors command <processors>`.
|
||||
- The parallel planes defining the size of the sub-domains can be shifted
|
||||
with the :doc:`balance command <balance>`. Which can be done in addition
|
||||
to choosing a more optimal processor grid.
|
||||
- The recursive bisectioning algorithm in combination with the "tiled"
|
||||
communication style can produce a partitioning with equal numbers of
|
||||
particles in each sub-domain.
|
||||
|
||||
|
||||
.. |decomp1| image:: img/decomp-regular.png
|
||||
:width: 24%
|
||||
|
||||
.. |decomp2| image:: img/decomp-processors.png
|
||||
:width: 24%
|
||||
|
||||
.. |decomp3| image:: img/decomp-balance.png
|
||||
:width: 24%
|
||||
|
||||
.. |decomp4| image:: img/decomp-rcb.png
|
||||
:width: 24%
|
||||
|
||||
|decomp1| |decomp2| |decomp3| |decomp4|
|
||||
|
||||
The pictures above demonstrate the differences for a 2d system with 12 MPI ranks.
|
||||
Due to the vacuum in the system, the default decomposition is unbalanced
|
||||
with several MPI ranks without atoms (left). By forcing a 1x12x1 processor
|
||||
grid, every MPI rank does computations now, but the amount of communication
|
||||
between sub-domains is increased (center left). With a 2x6x1 processor grid and
|
||||
shifting the sub-domain divisions, the load imbalance is also reduced and
|
||||
the amount of communication required between sub-domains is less (center right).
|
||||
And using the recursive bisectioning leads to further improved decomposition (right).
|
||||
|
||||
|
||||
Communication
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Neighbor lists
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
Long-range interactions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Reference in New Issue
Block a user