189 lines
10 KiB
ReStructuredText
189 lines
10 KiB
ReStructuredText
Long-range interactions
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For charged systems, LAMMPS can compute long-range Coulombic
|
|
interactions via the FFT-based particle-particle/particle-mesh (PPPM)
|
|
method implemented in :doc:`kspace style pppm and its variants
|
|
<kspace_style>`. For that Coulombic interactions are partitioned into
|
|
short- and long-range components. The short-ranged portion is computed
|
|
in real space as a loop over pairs of charges within a cutoff distance,
|
|
using neighbor lists. The long-range portion is computed in reciprocal
|
|
space using a kspace style. For the PPPM implementation the simulation
|
|
cell is overlaid with a regular FFT grid in 3d. It proceeds in several stages:
|
|
|
|
a) each atom's point charge is interpolated to nearby FFT grid points,
|
|
b) a forward 3d FFT is performed,
|
|
c) a convolution operation is performed in reciprocal space,
|
|
d) one or more inverse 3d FFTs are performed, and
|
|
e) electric field values from grid points near each atom are interpolated to compute
|
|
its forces.
|
|
|
|
For any of the spatial-decomposition partitioning schemes each processor
|
|
owns the brick-shaped portion of FFT grid points contained within its
|
|
sub-domain. The two interpolation operations use a stencil of grid
|
|
points surrounding each atom. To accommodate the stencil size, each
|
|
processor also stores a few layers of ghost grid points surrounding its
|
|
brick. Forward and reverse communication of grid point values is
|
|
performed similar to the corresponding :doc:`atom data communication
|
|
<Developer_par_comm>`. In this case, electric field values on owned
|
|
grid points are sent to neighboring processors to become ghost point
|
|
values. Likewise charge values on ghost points are sent and summed to
|
|
values on owned points.
|
|
|
|
For triclinic simulation boxes, the FFT grid planes are parallel to
|
|
the box faces, but the mapping of charge and electric field values
|
|
to/from grid points is done in reduced coordinates where the tilted
|
|
box is conceptually a unit cube, so that the stencil and FFT
|
|
operations are unchanged. However the FFT grid size required for a
|
|
given accuracy is larger for triclinic domains than it is for
|
|
orthogonal boxes.
|
|
|
|
.. _fft-parallel:
|
|
.. figure:: img/fft-decomp-parallel.png
|
|
:align: center
|
|
|
|
parallel FFT in PPPM
|
|
|
|
Stages of a parallel FFT for a simulation domain overlaid
|
|
with an 8x8x8 3d FFT grid, partitioned across 64 processors.
|
|
Within each of the 4 diagrams, grid cells of the same color are
|
|
owned by a single processor; for simplicity only cells owned by 4
|
|
or 8 of the 64 processors are colored. The two images on the left
|
|
illustrate brick-to-pencil communication. The two images on the
|
|
right illustrate pencil-to-pencil communication, which in this
|
|
case transposes the *y* and *z* dimensions of the grid.
|
|
|
|
Parallel 3d FFTs require substantial communication relative to their
|
|
computational cost. A 3d FFT is implemented by a series of 1d FFTs
|
|
along the *x-*, *y-*, and *z-*\ direction of the FFT grid. Thus the FFT
|
|
grid cannot be decomposed like atoms into 3 dimensions for parallel
|
|
processing of the FFTs but only in 1 (as planes) or 2 (as pencils)
|
|
dimensions and in between the steps the grid needs to be transposed to
|
|
have the FFT grid portion "owned" by each MPI process complete in the
|
|
direction of the 1d FFTs it has to perform. LAMMPS uses the
|
|
pencil-decomposition algorithm as shown in the :ref:`fft-parallel` figure.
|
|
|
|
Initially (far left), each processor owns a brick of same-color grid
|
|
cells (actually grid points) contained within in its sub-domain. A
|
|
brick-to-pencil communication operation converts this layout to 1d
|
|
pencils in the *x*-dimension (center left). Again, cells of the same
|
|
color are owned by the same processor. Each processor can then compute
|
|
a 1d FFT on each pencil of data it wholly owns using a call to the
|
|
configured FFT library. A pencil-to-pencil communication then converts
|
|
this layout to pencils in the *y* dimension (center right) which
|
|
effectively transposes the *x* and *y* dimensions of the grid, followed
|
|
by 1d FFTs in *y*. A final transpose of pencils from *y* to *z* (far
|
|
right) followed by 1d FFTs in *z* completes the forward FFT. The data
|
|
is left in a *z*-pencil layout for the convolution operation. One or
|
|
more inverse FFTs then perform the sequence of 1d FFTs and communication
|
|
steps in reverse order; the final layout of resulting grid values is the
|
|
same as the initial brick layout.
|
|
|
|
Each communication operation within the FFT (brick-to-pencil or
|
|
pencil-to-pencil or pencil-to-brick) converts one tiling of the 3d grid
|
|
to another, where a tiling in this context means an assignment of a
|
|
small brick-shaped subset of grid points to each processor, the union of
|
|
which comprise the entire grid. The parallel `fftMPI library
|
|
<https://lammps.github.io/fftmpi/>`_ written for LAMMPS allows arbitrary
|
|
definitions of the tiling so that an irregular partitioning of the
|
|
simulation domain can use it directly. Transforming data from one
|
|
tiling to another is implemented in `fftMPI` using point-to-point
|
|
communication, where each processor sends data to a few other
|
|
processors, since each tile in the initial tiling overlaps with a
|
|
handful of tiles in the final tiling.
|
|
|
|
The transformations could also be done using collective communication
|
|
across all $P$ processors with a single call to ``MPI_Alltoall()``, but
|
|
this is typically much slower. However, for the specialized brick and
|
|
pencil tiling illustrated in :ref:`fft-parallel` figure, collective
|
|
communication across the entire MPI communicator is not required. In
|
|
the example an :math:`8^3` grid with 512 grid cells is partitioned
|
|
across 64 processors; each processor owns a 2x2x2 3d brick of grid
|
|
cells. The initial brick-to-pencil communication (upper left to upper
|
|
right) only requires collective communication within subgroups of 4
|
|
processors, as illustrated by the 4 colors. More generally, a
|
|
brick-to-pencil communication can be performed by partitioning *P*
|
|
processors into :math:`P^{\frac{2}{3}}` subgroups of
|
|
:math:`P^{\frac{1}{3}}` processors each. Each subgroup performs
|
|
collective communication only within its subgroup. Similarly,
|
|
pencil-to-pencil communication can be performed by partitioning *P*
|
|
processors into :math:`P^{\frac{1}{2}}` subgroups of
|
|
:math:`P^{\frac{1}{2}}` processors each. This is illustrated in the
|
|
figure for the :math:`y \Rightarrow z` communication (center). An
|
|
eight-processor subgroup owns the front *yz* plane of data and performs
|
|
collective communication within the subgroup to transpose from a
|
|
*y*-pencil to *z*-pencil layout.
|
|
|
|
LAMMPS invokes point-to-point communication by default, but also
|
|
provides the option of partitioned collective communication when using a
|
|
:doc:`kspace_modify collective yes <kspace_modify>` command to switch to
|
|
that mode. In the latter case, the code detects the size of the
|
|
disjoint subgroups and partitions the single *P*-size communicator into
|
|
multiple smaller communicators, each of which invokes collective
|
|
communication. Testing on a large IBM Blue Gene/Q machine at Argonne
|
|
National Labs showed a significant improvement in FFT performance for
|
|
large processor counts; partitioned collective communication was faster
|
|
than point-to-point communication or global collective communication
|
|
involving all *P* processors.
|
|
|
|
Here are some additional details about FFTs for long-range and related
|
|
grid/particle operations that LAMMPS supports:
|
|
|
|
- The fftMPI library allows each grid dimension to be a multiple of
|
|
small prime factors (2,3,5), and allows any number of processors to
|
|
perform the FFT. The resulting brick and pencil decompositions are
|
|
thus not always as well-aligned but the size of subgroups of
|
|
processors for the two modes of communication (brick/pencil and
|
|
pencil/pencil) still scale as :math:`O(P^{\frac{1}{3}})` and
|
|
:math:`O(P^{\frac{1}{2}})`.
|
|
|
|
- For efficiency in performing 1d FFTs, the grid transpose
|
|
operations illustrated in Figure \ref{fig:fft} also involve
|
|
reordering the 3d data so that a different dimension is contiguous
|
|
in memory. This reordering can be done during the packing or
|
|
unpacking of buffers for MPI communication.
|
|
|
|
- For large systems and particularly a large number of MPI processes,
|
|
the dominant cost for parallel FFTs is often the communication, not
|
|
the computation of 1d FFTs, even though the latter scales as :math:`N
|
|
\log(N)` in the number of grid points *N* per grid direction. This is
|
|
due to the fact that only a 2d decomposition into pencils is possible
|
|
while atom data (and their corresponding short-range force and energy
|
|
computations) can be decomposed efficiently in 3d.
|
|
|
|
This can be addressed by reducing the number of MPI processes involved
|
|
in the MPI communication by using :doc:`hybrid MPI + OpenMP
|
|
parallelization <Speed_omp>`. This will use OpenMP parallelization
|
|
inside the MPI domains and while that may have a lower parallel
|
|
efficiency, it reduces the communication overhead.
|
|
|
|
As an alternative it is also possible to start a :ref:`multi-partition
|
|
<partition>` calculation and then use the :doc:`verlet/split
|
|
integrator <run_style>` to perform the PPPM computation on a
|
|
dedicated, separate partition of MPI processes. This uses an integer
|
|
"1:*p*" mapping of *p* sub-domains of the atom decomposition to one
|
|
sub-domain of the FFT grid decomposition and where pairwise non-bonded
|
|
and bonded forces and energies are computed on the larger partition
|
|
and the PPPM kspace computation concurrently on the smaller partition.
|
|
|
|
- LAMMPS also implements PPPM-based solvers for other long-range
|
|
interactions, dipole and dispersion (Lennard-Jones), which can be used
|
|
in conjunction with long-range Coulombics for point charges.
|
|
|
|
- LAMMPS implements a ``GridComm`` class which overlays the simulation
|
|
domain with a regular grid, partitions it across processors in a
|
|
manner consistent with processor sub-domains, and provides methods for
|
|
forward and reverse communication of owned and ghost grid point
|
|
values. It is used for PPPM as an FFT grid (as outlined above) and
|
|
also for the MSM algorithm which uses a cascade of grid sizes from
|
|
fine to coarse to compute long-range Coulombic forces. The GridComm
|
|
class is also useful for models where continuum fields interact with
|
|
particles. For example, the two-temperature model (TTM) defines heat
|
|
transfer between atoms (particles) and electrons (continuum gas) where
|
|
spatial variations in the electron temperature are computed by finite
|
|
differences of a discretized heat equation on a regular grid. The
|
|
:doc:`fix ttm/grid <fix_ttm>` command uses the ``GridComm`` class
|
|
internally to perform its grid operations on a distributed grid
|
|
instead of the original :doc:`fix ttm <fix_ttm>` which uses a
|
|
replicated grid.
|