break large file into multiple smaller files by section and add toctree

2021-09-05 21:57:03 -04:00
parent 94f03f169f
commit 4fc9753a69
5 changed files with 376 additions and 363 deletions
--- a/doc/src/Developer_par_comm.rst
+++ b/doc/src/Developer_par_comm.rst
@ -0,0 +1,120 @@
 Communication
 ^^^^^^^^^^^^^
 Following the partitioning scheme in use all per-atom data is
 distributed across the MPI processes, which allows LAMMPS to handle very
 large systems provided it uses a correspondingly large number of MPI
 processes.  Since The per-atom data (atom IDs, positions, velocities,
 types, etc.)  To be able to compute the short-range interactions MPI
 processes need not only access to data of atoms they "own" but also
 information about atoms from neighboring sub-domains, in LAMMPS referred
 to as "ghost" atoms.  These are copies of atoms storing required
 per-atom data for up to the communication cutoff distance. The green
 dashed-line boxes in the :ref:`domain-decomposition` figure illustrate
 the extended ghost-atom sub-domain for one processor.
 This approach is also used to implement periodic boundary
 conditions: atoms that lie within the cutoff distance across a periodic
 boundary are also stored as ghost atoms and taken from the periodic
 replication of the sub-domain, which may be the same sub-domain, e.g. if
 running in serial.  As a consequence of this, force computation in
 LAMMPS is not subject to minimum image conventions and thus cutoffs may
 be larger than half the simulation domain.
 .. _ghost-atom-comm:
 .. figure:: img/ghost-comm.png
   :align: center
   ghost atom communication
   This figure shows the ghost atom communication patterns between
   sub-domains for "brick" (left) and "tiled" communication styles for
   2d simulations.  The numbers indicate MPI process ranks.  Here the
   sub-domains are drawn spatially separated for clarity.  The
   dashed-line box is the extended sub-domain of processor 0 which
   includes its ghost atoms.  The red- and blue-shaded boxes are the
   regions of communicated ghost atoms.
 Efficient communication patterns are needed to update the "ghost" atom
 data, since that needs to be done at every MD time step or minimization
 step.  The diagrams of the `ghost-atom-comm` figure illustrate how ghost
 atom communication is performed in two stages for a 2d simulation (three
 in 3d) for both a regular and irregular partitioning of the simulation
 box.  For the regular case (left) atoms are exchanged first in the
 *x*-direction, then in *y*, with four neighbors in the grid of processor
 sub-domains.
 In the *x* stage, processor ranks 1 and 2 send owned atoms in their
 red-shaded regions to rank 0 (and vice versa).  Then in the *y* stage,
 ranks 3 and 4 send atoms in their blue-shaded regions to rank 0, which
 includes ghost atoms they received in the *x* stage.  Rank 0 thus
 acquires all its ghost atoms; atoms in the solid blue corner regions
 are communicated twice before rank 0 receives them.
 For the irregular case (right) the two stages are similar, but a
 processor can have more than one neighbor in each direction.  In the
 *x* stage, MPI ranks 1,2,3 send owned atoms in their red-shaded regions to
 rank 0 (and vice versa).  These include only atoms between the lower
 and upper *y*-boundary of rank 0's sub-domain.  In the *y* stage, ranks
 4,5,6 send atoms in their blue-shaded regions to rank 0.  This may
 include ghost atoms they received in the *x* stage, but only if they
 are needed by rank 0 to fill its extended ghost atom regions in the
 +/-*y* directions (blue rectangles).  Thus in this case, ranks 5 and
 6 do not include ghost atoms they received from each other (in the *x*
 stage) in the atoms they send to rank 0.  The key point is that while
 the pattern of communication is more complex in the irregular
 partitioning case, it can still proceed in two stages (three in 3d)
 via atom exchanges with only neighboring processors.
 When attributes of owned atoms are sent to neighboring processors to
 become attributes of their ghost atoms, LAMMPS calls this a "forward"
 communication.  On timesteps when atoms migrate to new owning processors
 and neighbor lists are rebuilt, each processor creates a list of its
 owned atoms which are ghost atoms in each of its neighbor processors.
 These lists are used to pack per-atom coordinates (for example) into
 message buffers in subsequent steps until the next reneighboring.
 A "reverse" communication is when computed ghost atom attributes are
 sent back to the processor who owns the atom.  This is used (for
 example) to sum partial forces on ghost atoms to the complete force on
 owned atoms.  The order of the two stages described in the
 :ref:`ghost-atom-comm` figure is inverted and the same lists of atoms
 are used to pack and unpack message buffers with per-atom forces.  When
 a received buffer is unpacked, the ghost forces are summed to owned atom
 forces.  As in forward communication, forces on atoms in the four blue
 corners of the diagrams are sent, received, and summed twice (once at
 each stage) before owning processors have the full force.
 These two operations are used many places within LAMMPS aside from
 exchange of coordinates and forces, for example by manybody potentials
 to share intermediate per-atom values, or by rigid-body integrators to
 enable each atom in a body to access body properties.  Here are
 additional details about how these communication operations are
 performed in LAMMPS:
 - When exchanging data with different processors, forward and reverse
  communication is done using ``MPI_Send()`` and ``MPI_IRecv()`` calls.
  If a processor is "exchanging" atoms with itself, only the pack and
  unpack operations are performed, e.g. to create ghost atoms across
  periodic boundaries when running on a single processor.
 - For forward communication of owned atom coordinates, periodic box
  lengths are added and subtracted when the receiving processor is
  across a periodic boundary from the sender.  There is then no need to
  apply a minimum image convention when calculating distances between
  atom pairs when building neighbor lists or computing forces.
 - The cutoff distance for exchanging ghost atoms is typically equal to
  the neighbor cutoff.  But it can also chosen to be longer if needed,
  e.g. half the diameter of a rigid body composed of multiple atoms or
  over 3x the length of a stretched bond for dihedral interactions.  It
  can also exceed the periodic box size.  For the regular communication
  pattern (left), if the cutoff distance extends beyond a neighbor
  processor's sub-domain, then multiple exchanges are performed in the
  same direction.  Each exchange is with the same neighbor processor,
  but buffers are packed/unpacked using a different list of atoms. For
  forward communication, in the first exchange a processor sends only
  owned atoms.  In subsequent exchanges, it sends ghost atoms received
  in previous exchanges.  For the irregular pattern (right) overlaps of
  a processor's extended ghost-atom sub-domain with all other processors
  in each dimension are detected.
--- a/doc/src/Developer_par_long.rst
+++ b/doc/src/Developer_par_long.rst
@ -0,0 +1,2 @@
 Long-range interactions
 ^^^^^^^^^^^^^^^^^^^^^^^
--- a/doc/src/Developer_par_neigh.rst
+++ b/doc/src/Developer_par_neigh.rst
@ -0,0 +1,159 @@
 Neighbor lists
 ^^^^^^^^^^^^^^
 To compute forces efficiently, each processor creates a Verlet-style
 neighbor list which enumerates all pairs of atoms *i,j* (*i* = owned,
 *j* = owned or ghost) with separation less than the applicable
 neighbor list cutoff distance.  In LAMMPS the neighbor lists are stored
 in a multiple-page data structure; each page is a contiguous chunk of
 memory which stores vectors of neighbor atoms *j* for many *i* atoms.
 This allows pages to be incrementally allocated or deallocated in blocks
 as needed.  Neighbor lists typically consume the most memory of any data
 structure in LAMMPS.  The neighbor list is rebuilt (from scratch) once
 every few timesteps, then used repeatedly each step for force or other
 computations.  The neighbor cutoff distance is :math:`R_n = R_f +
 \Delta_s`, where :math:`R_f` is the (largest) force cutoff defined by
 the interatomic potential for computing short-range pairwise or manybody
 forces and :math:`\Delta_s` is a "skin" distance that allows the list to
 be used for multiple steps assuming that atoms do not move very far
 between consecutive time steps.  Typically the code triggers
 reneighboring when any atom has moved half the skin distance since the
 last reneighboring; this and other options of the neighbor list rebuild
 can be adjusted with the :doc:`neigh_modify <neigh_modify>` command.
 On steps when reneighboring is performed, atoms which have moved outside
 their owning processor's sub-domain are first migrated to new processors
 via communication.  Periodic boundary conditions are also (only)
 enforced on these steps to ensure each atom is re-assigned to the
 correct processor.  After migration, the atoms owned by each processor
 are stored in a contiguous vector.  Periodically each processor
 spatially sorts owned atoms within its vector to reorder it for improved
 cache efficiency in force computations and neighbor list building.  For
 that atoms are spatially binned and then reordered so that atoms in the
 same bin are adjacent in the vector.  Atom sorting can be disabled or
 its settings modified with the :doc:`atom_modify <atom_modify>` command.
 .. _neighbor-stencil:
 .. figure:: img/neigh-stencil.png
   :align: center
   neighbor list stencils
   A 2d simulation sub-domain (thick black line) and the corresponding
   ghost atom cutoff region (dashed blue line) for both orthogonal
   (left) and triclinic (right) domains.  A regular grid of neighbor
   bins (thin lines) overlays the entire simulation domain and need not
   align with sub-domain boundaries; only the portion overlapping the
   augmented sub-domain is shown.  In the triclinic case it overlaps the
   bounding box of the tilted rectangle.  The blue- and red-shaded bins
   represent a stencil of bins searched to find neighbors of a particular
   atom (black dot).
 To build a local neighbor list in linear time, the simulation domain is
 overlaid (conceptually) with a regular 3d (or 2d) grid of neighbor bins,
 as shown in the :ref:`neighbor-stencil` figure for 2d models and a
 single MPI processor's sub-domain.  Each processor stores a set of
 neighbor bins which overlap its sub-domain extended by the neighbor
 cutoff distance :math:`R_n`.  As illustrated, the bins need not align
 with processor boundaries; an integer number in each dimension is fit to
 the size of the entire simulation box.
 Most often LAMMPS builds what it calls a "half" neighbor list where
 each *i,j* neighbor pair is stored only once, with either atom *i* or
 *j* as the central atom.  The build can be done efficiently by using a
 pre-computed "stencil" of bins around a central origin bin which
 contains the atom whose neighbors are being searched for.  A stencil
 is simply a list of integer offsets in *x,y,z* of nearby bins
 surrounding the origin bin which are close enough to contain any
 neighbor atom *j* within a distance :math:`R_n` from any atom *i* in the
 origin bin.  Note that for a half neighbor list, the stencil can be
 asymmetric since each atom only need store half its nearby neighbors.
 These stencils are illustrated in the figure for a half list and a bin
 size of :math:`\frac{1}{2} R_n`.  There are 13 red+blue stencil bins in
 2d (for the orthogonal case, 15 for triclinic).  In 3d there would be
 63, 13 in the plane of bins that contain the origin bin and 25 in each
 of the two planes above it in the *z* direction (75 for triclinic).  The
 reason the triclinic stencil has extra bins is because the bins tile the
 bounding box of the entire triclinic domain and thus are not periodic
 with respect to the simulation box itself.  The stencil and logic for
 determining which *i,j* pairs to include in the neighbor list are
 altered slightly to account for this.
 To build a neighbor list, a processor first loops over its "owned" plus
 "ghost" atoms and assigns each to a neighbor bin.  This uses an integer
 vector to create a linked list of atom indices within each bin.  It then
 performs a triply-nested loop over its owned atoms *i*, the stencil of
 bins surrounding atom *i*'s bin, and the *j* atoms in each stencil bin
 (including ghost atoms).  If the distance :math:`r_{ij} < R_n`, then
 atom *j* is added to the vector of atom *i*'s neighbors.
 Here are additional details about neighbor list build options LAMMPS
 supports:
 - The choice of bin size is an option; a size half of :math:`R_n` has
  been found to be optimal for many typical cases.  Smaller bins incur
  additional overhead to loop over; larger bins require more distance
  calculations.  Note that for smaller bin sizes, the 2d stencil in the
  figure would be more semi-circular in shape (hemispherical in 3d),
  with bins near the corners of the square eliminated due to their
  distance from the origin bin.
 - Depending on the interatomic potential(s) and other commands used in
  an input script, multiple neighbor lists and stencils with different
  attributes may be needed.  This includes lists with different cutoff
  distances, e.g. for force computation versus occasional diagnostic
  computations such as a radial distribution function, or for the
  r-RESPA time integrator which can partition pairwise forces by
  distance into subsets computed at different time intervals.  It
  includes "full" lists (as opposed to half lists) where each *i,j* pair
  appears twice, stored once with *i* and *j*, and which use a larger
  symmetric stencil.  It also includes lists with partial enumeration of
  ghost atom neighbors.  The full and ghost-atom lists are used by
  various manybody interatomic potentials.  Lists may also use different
  criteria for inclusion of a pair interaction.  Typically this simply
  depends only on the distance between two atoms and the cutoff
  distance.  But for finite-size coarse-grained particles with
  individual diameters (e.g. polydisperse granular particles), it can
  also depend on the diameters of the two particles.
 - When using :doc:`pair style hybrid <pair_hybrid>` multiple sub-lists
  of the master neighbor list for the full system need to be generated,
  one for each sub-style, which contains only the *i,j* pairs needed to
  compute interactions between subsets of atoms for the corresponding
  potential.  This means not all *i* or *j* atoms owned by a processor
  are included in a particular sub-list.
 - Some models use different cutoff lengths for pairwise interactions
  between different kinds of particles which are stored in a single
  neighbor list.  One example is a solvated colloidal system with large
  colloidal particles where colloid/colloid, colloid/solvent, and
  solvent/solvent interaction cutoffs can be dramatically different.
  Another is a model of polydisperse finite-size granular particles;
  pairs of particles interact only when they are in contact with each
  other.  Mixtures with particle size ratios as high as 10-100x may be
  used to model realistic systems.  Efficient neighbor list building
  algorithms for these kinds of systems are available in LAMMPS.  They
  include a method which uses different stencils for different cutoff
  lengths and trims the stencil to only include bins that straddle the
  cutoff sphere surface.  More recently a method which uses both
  multiple stencils and multiple bin sizes was developed; it builds
  neighbor lists efficiently for systems with particles of any size
  ratio, though other considerations (timestep size, force computations)
  may limit the ability to model systems with huge polydispersity.
 - For small and sparse systems and as a fallback method, LAMMPS also
  supports neighbor list construction without binning by using a full
  :math:`O(N^2)` loop over all *i,j* atom pairs in a sub-domain when
  using the :doc:`neighbor nsq <neighbor>` command.
 - Dependent on the "pair" setting of the :doc:`newton <newton>` command,
  the "half" neighbor lists may contain **all** pairs of atoms where
  atom *j* is a ghost atom (i.e. when the newton pair setting is *off*)
  For the newton pair *on* setting the atom *j* is only added to the
  list if its *z* coordinate is larger, or if equal the *y* coordinate
  is larger, and that is equal, too, the *x* coordinate is larger.  For
  a homogenously dense system that will result in picking neighbors from
  a same size sector in always the same direction relative to the
  "owned" atom and thus it should lead to equal length neighbor lists
  and thus reduce the chance of a load imbalance.
--- a/doc/src/Developer_par_part.rst
+++ b/doc/src/Developer_par_part.rst
@ -0,0 +1,89 @@
 Partitioning
 ^^^^^^^^^^^^
 The underlying spatial decomposition strategy used by LAMMPS for
 distributed-memory parallelism is set with the :doc:`comm_style command
 <comm_style>` and can be either "brick" (a regular grid) or "tiled".
 .. _domain-decomposition:
 .. figure:: img/domain-decomp.png
   :align: center
   domain decomposition
   This figure shows the different kinds of domain decomposition used
   for MPI parallelization: "brick" on the left with an orthogonal
   (left) and a triclinic (middle) simulation domain, and a "tiled"
   decomposition (right).  The black lines show the division into
   sub-domains and the contained atoms are "owned" by the corresponding
   MPI process. The green dashed lines indicate how sub-domains are
   extended with "ghost" atoms up to the communication cutoff distance.
 The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal
 or triclinic in shape, as illustrated in the :ref:`domain-decomposition`
 figure for the 2d case.  Orthogonal means the box edges are aligned with
 the *x*, *y*, *z* Cartesian axes, and the box faces are thus all
 rectangular.  Triclinic allows for a more general parallelepiped shape
 in which edges are aligned with three arbitrary vectors and the box
 faces are parallelograms.  In each dimension box faces can be periodic,
 or non-periodic with fixed or shrink-wrapped boundaries.  In the fixed
 case, atoms which move outside the face are deleted; shrink-wrapped
 means the position of the box face adjusts continuously to enclose all
 the atoms.
 For distributed-memory MPI parallelism, the simulation box is spatially
 decomposed (partitioned) into non-overlapping sub-domains which fill the
 box. The default partitioning, "brick", is most suitable when atom
 density is roughly uniform, as shown in the left-side images of the
 :ref:`domain-decomposition` figure.  The sub-domains comprise a regular
 grid and all sub-domains are identical in size and shape.  Both the
 orthogonal and triclinic boxes can deform continuously during a
 simulation, e.g. to compress a solid or shear a liquid, in which case
 the processor sub-domains likewise deform.
 For models with non-uniform density, the number of particles per
 processor can be load-imbalanced with the default partitioning.  This
 reduces parallel efficiency, as the overall simulation rate is limited
 by the slowest processor, i.e. the one with the largest computational
 load.  For such models, LAMMPS supports multiple strategies to reduce
 the load imbalance:
 - The processor grid decomposition is by default based on the simulation
  cell volume and tries to optimize the volume to surface ratio for the sub-domains.
  This can be changed with the :doc:`processors command <processors>`.
 - The parallel planes defining the size of the sub-domains can be shifted
  with the :doc:`balance command <balance>`. Which can be done in addition
  to choosing a more optimal processor grid.
 - The recursive bisectioning algorithm in combination with the "tiled"
  communication style can produce a partitioning with equal numbers of
  particles in each sub-domain.
 .. |decomp1| image:: img/decomp-regular.png
   :width: 24%
 .. |decomp2| image:: img/decomp-processors.png
   :width: 24%
 .. |decomp3| image:: img/decomp-balance.png
   :width: 24%
 .. |decomp4| image:: img/decomp-rcb.png
   :width: 24%
 |decomp1|  |decomp2|  |decomp3|  |decomp4|
 The pictures above demonstrate different decompositions for a 2d system
 with 12 MPI ranks.  The atom colors indicate the load imbalance of each
 sub-domain with green being optimal and red the least optimal.
 Due to the vacuum in the system, the default decomposition is unbalanced
 with several MPI ranks without atoms (left). By forcing a 1x12x1
 processor grid, every MPI rank does computations now, but number of
 atoms per sub-domain is still uneven and the thin slice shape increases
 the amount of communication between sub-domains (center left). With a
 2x6x1 processor grid and shifting the sub-domain divisions, the load
 imbalance is further reduced and the amount of communication required
 between sub-domains is less (center right).  And using the recursive
 bisectioning leads to further improved decomposition (right).
--- a/doc/src/Developer_parallel.rst
+++ b/doc/src/Developer_parallel.rst
@ -10,367 +10,10 @@ size).  Additional parallelization using GPUs or OpenMP can then be
 applied within the sub-domain assigned to an MPI process.  For clarity,
 most of the following illustrations show the 2d simulation case.
-Partitioning
+.. toctree::
-^^^^^^^^^^^^
+   :maxdepth: 1
-The underlying spatial decomposition strategy used by LAMMPS for
+   Developer_par_part
-distributed-memory parallelism is set with the :doc:`comm_style command
+   Developer_par_comm
-<comm_style>` and can be either "brick" (a regular grid) or "tiled".
+   Developer_par_neigh
-
+   Developer_par_long
 .. _domain-decomposition:
 .. figure:: img/domain-decomp.png
   :align: center
   domain decomposition
   This figure shows the different kinds of domain decomposition used
   for MPI parallelization: "brick" on the left with an orthogonal
   (left) and a triclinic (middle) simulation domain, and a "tiled"
   decomposition (right).  The black lines show the division into
   sub-domains and the contained atoms are "owned" by the corresponding
   MPI process. The green dashed lines indicate how sub-domains are
   extended with "ghost" atoms up to the communication cutoff distance.
 The LAMMPS simulation box is a 3d or 2d volume, which can be orthogonal
 or triclinic in shape, as illustrated in the :ref:`domain-decomposition`
 figure for the 2d case.  Orthogonal means the box edges are aligned with
 the *x*, *y*, *z* Cartesian axes, and the box faces are thus all
 rectangular.  Triclinic allows for a more general parallelepiped shape
 in which edges are aligned with three arbitrary vectors and the box
 faces are parallelograms.  In each dimension box faces can be periodic,
 or non-periodic with fixed or shrink-wrapped boundaries.  In the fixed
 case, atoms which move outside the face are deleted; shrink-wrapped
 means the position of the box face adjusts continuously to enclose all
 the atoms.
 For distributed-memory MPI parallelism, the simulation box is spatially
 decomposed (partitioned) into non-overlapping sub-domains which fill the
 box. The default partitioning, "brick", is most suitable when atom
 density is roughly uniform, as shown in the left-side images of the
 :ref:`domain-decomposition` figure.  The sub-domains comprise a regular
 grid and all sub-domains are identical in size and shape.  Both the
 orthogonal and triclinic boxes can deform continuously during a
 simulation, e.g. to compress a solid or shear a liquid, in which case
 the processor sub-domains likewise deform.
 For models with non-uniform density, the number of particles per
 processor can be load-imbalanced with the default partitioning.  This
 reduces parallel efficiency, as the overall simulation rate is limited
 by the slowest processor, i.e. the one with the largest computational
 load.  For such models, LAMMPS supports multiple strategies to reduce
 the load imbalance:
 - The processor grid decomposition is by default based on the simulation
  cell volume and tries to optimize the volume to surface ratio for the sub-domains.
  This can be changed with the :doc:`processors command <processors>`.
 - The parallel planes defining the size of the sub-domains can be shifted
  with the :doc:`balance command <balance>`. Which can be done in addition
  to choosing a more optimal processor grid.
 - The recursive bisectioning algorithm in combination with the "tiled"
  communication style can produce a partitioning with equal numbers of
  particles in each sub-domain.
 .. |decomp1| image:: img/decomp-regular.png
   :width: 24%
 .. |decomp2| image:: img/decomp-processors.png
   :width: 24%
 .. |decomp3| image:: img/decomp-balance.png
   :width: 24%
 .. |decomp4| image:: img/decomp-rcb.png
   :width: 24%
 |decomp1|  |decomp2|  |decomp3|  |decomp4|
 The pictures above demonstrate different decompositions for a 2d system
 with 12 MPI ranks.  The atom colors indicate the load imbalance of each
 sub-domain with green being optimal and red the least optimal.
 Due to the vacuum in the system, the default decomposition is unbalanced
 with several MPI ranks without atoms (left). By forcing a 1x12x1
 processor grid, every MPI rank does computations now, but number of
 atoms per sub-domain is still uneven and the thin slice shape increases
 the amount of communication between sub-domains (center left). With a
 2x6x1 processor grid and shifting the sub-domain divisions, the load
 imbalance is further reduced and the amount of communication required
 between sub-domains is less (center right).  And using the recursive
 bisectioning leads to further improved decomposition (right).
 Communication
 ^^^^^^^^^^^^^
 Following the partitioning scheme in use all per-atom data is
 distributed across the MPI processes, which allows LAMMPS to handle very
 large systems provided it uses a correspondingly large number of MPI
 processes.  Since The per-atom data (atom IDs, positions, velocities,
 types, etc.)  To be able to compute the short-range interactions MPI
 processes need not only access to data of atoms they "own" but also
 information about atoms from neighboring sub-domains, in LAMMPS referred
 to as "ghost" atoms.  These are copies of atoms storing required
 per-atom data for up to the communication cutoff distance. The green
 dashed-line boxes in the :ref:`domain-decomposition` figure illustrate
 the extended ghost-atom sub-domain for one processor.
 This approach is also used to implement periodic boundary
 conditions: atoms that lie within the cutoff distance across a periodic
 boundary are also stored as ghost atoms and taken from the periodic
 replication of the sub-domain, which may be the same sub-domain, e.g. if
 running in serial.  As a consequence of this, force computation in
 LAMMPS is not subject to minimum image conventions and thus cutoffs may
 be larger than half the simulation domain.
 .. _ghost-atom-comm:
 .. figure:: img/ghost-comm.png
   :align: center
   ghost atom communication
   This figure shows the ghost atom communication patterns between
   sub-domains for "brick" (left) and "tiled" communication styles for
   2d simulations.  The numbers indicate MPI process ranks.  Here the
   sub-domains are drawn spatially separated for clarity.  The
   dashed-line box is the extended sub-domain of processor 0 which
   includes its ghost atoms.  The red- and blue-shaded boxes are the
   regions of communicated ghost atoms.
 Efficient communication patterns are needed to update the "ghost" atom
 data, since that needs to be done at every MD time step or minimization
 step.  The diagrams of the `ghost-atom-comm` figure illustrate how ghost
 atom communication is performed in two stages for a 2d simulation (three
 in 3d) for both a regular and irregular partitioning of the simulation
 box.  For the regular case (left) atoms are exchanged first in the
 *x*-direction, then in *y*, with four neighbors in the grid of processor
 sub-domains.
 In the *x* stage, processor ranks 1 and 2 send owned atoms in their
 red-shaded regions to rank 0 (and vice versa).  Then in the *y* stage,
 ranks 3 and 4 send atoms in their blue-shaded regions to rank 0, which
 includes ghost atoms they received in the *x* stage.  Rank 0 thus
 acquires all its ghost atoms; atoms in the solid blue corner regions
 are communicated twice before rank 0 receives them.
 For the irregular case (right) the two stages are similar, but a
 processor can have more than one neighbor in each direction.  In the
 *x* stage, MPI ranks 1,2,3 send owned atoms in their red-shaded regions to
 rank 0 (and vice versa).  These include only atoms between the lower
 and upper *y*-boundary of rank 0's sub-domain.  In the *y* stage, ranks
 4,5,6 send atoms in their blue-shaded regions to rank 0.  This may
 include ghost atoms they received in the *x* stage, but only if they
 are needed by rank 0 to fill its extended ghost atom regions in the
 +/-*y* directions (blue rectangles).  Thus in this case, ranks 5 and
 6 do not include ghost atoms they received from each other (in the *x*
 stage) in the atoms they send to rank 0.  The key point is that while
 the pattern of communication is more complex in the irregular
 partitioning case, it can still proceed in two stages (three in 3d)
 via atom exchanges with only neighboring processors.
 When attributes of owned atoms are sent to neighboring processors to
 become attributes of their ghost atoms, LAMMPS calls this a "forward"
 communication.  On timesteps when atoms migrate to new owning processors
 and neighbor lists are rebuilt, each processor creates a list of its
 owned atoms which are ghost atoms in each of its neighbor processors.
 These lists are used to pack per-atom coordinates (for example) into
 message buffers in subsequent steps until the next reneighboring.
 A "reverse" communication is when computed ghost atom attributes are
 sent back to the processor who owns the atom.  This is used (for
 example) to sum partial forces on ghost atoms to the complete force on
 owned atoms.  The order of the two stages described in the
 :ref:`ghost-atom-comm` figure is inverted and the same lists of atoms
 are used to pack and unpack message buffers with per-atom forces.  When
 a received buffer is unpacked, the ghost forces are summed to owned atom
 forces.  As in forward communication, forces on atoms in the four blue
 corners of the diagrams are sent, received, and summed twice (once at
 each stage) before owning processors have the full force.
 These two operations are used many places within LAMMPS aside from
 exchange of coordinates and forces, for example by manybody potentials
 to share intermediate per-atom values, or by rigid-body integrators to
 enable each atom in a body to access body properties.  Here are
 additional details about how these communication operations are
 performed in LAMMPS:
 - When exchanging data with different processors, forward and reverse
  communication is done using ``MPI_Send()`` and ``MPI_IRecv()`` calls.
  If a processor is "exchanging" atoms with itself, only the pack and
  unpack operations are performed, e.g. to create ghost atoms across
  periodic boundaries when running on a single processor.
 - For forward communication of owned atom coordinates, periodic box
  lengths are added and subtracted when the receiving processor is
  across a periodic boundary from the sender.  There is then no need to
  apply a minimum image convention when calculating distances between
  atom pairs when building neighbor lists or computing forces.
 - The cutoff distance for exchanging ghost atoms is typically equal to
  the neighbor cutoff.  But it can also chosen to be longer if needed,
  e.g. half the diameter of a rigid body composed of multiple atoms or
  over 3x the length of a stretched bond for dihedral interactions.  It
  can also exceed the periodic box size.  For the regular communication
  pattern (left), if the cutoff distance extends beyond a neighbor
  processor's sub-domain, then multiple exchanges are performed in the
  same direction.  Each exchange is with the same neighbor processor,
  but buffers are packed/unpacked using a different list of atoms. For
  forward communication, in the first exchange a processor sends only
  owned atoms.  In subsequent exchanges, it sends ghost atoms received
  in previous exchanges.  For the irregular pattern (right) overlaps of
  a processor's extended ghost-atom sub-domain with all other processors
  in each dimension are detected.
 Neighbor lists
 ^^^^^^^^^^^^^^
 To compute forces efficiently, each processor creates a Verlet-style
 neighbor list which enumerates all pairs of atoms *i,j* (*i* = owned,
 *j* = owned or ghost) with separation less than the applicable
 neighbor list cutoff distance.  In LAMMPS the neighbor lists are stored
 in a multiple-page data structure; each page is a contiguous chunk of
 memory which stores vectors of neighbor atoms *j* for many *i* atoms.
 This allows pages to be incrementally allocated or deallocated in blocks
 as needed.  Neighbor lists typically consume the most memory of any data
 structure in LAMMPS.  The neighbor list is rebuilt (from scratch) once
 every few timesteps, then used repeatedly each step for force or other
 computations.  The neighbor cutoff distance is :math:`R_n = R_f +
 \Delta_s`, where :math:`R_f` is the (largest) force cutoff defined by
 the interatomic potential for computing short-range pairwise or manybody
 forces and :math:`\Delta_s` is a "skin" distance that allows the list to
 be used for multiple steps assuming that atoms do not move very far
 between consecutive time steps.  Typically the code triggers
 reneighboring when any atom has moved half the skin distance since the
 last reneighboring; this and other options of the neighbor list rebuild
 can be adjusted with the :doc:`neigh_modify <neigh_modify>` command.
 On steps when reneighboring is performed, atoms which have moved outside
 their owning processor's sub-domain are first migrated to new processors
 via communication.  Periodic boundary conditions are also (only)
 enforced on these steps to ensure each atom is re-assigned to the
 correct processor.  After migration, the atoms owned by each processor
 are stored in a contiguous vector.  Periodically each processor
 spatially sorts owned atoms within its vector to reorder it for improved
 cache efficiency in force computations and neighbor list building.  For
 that atoms are spatially binned and then reordered so that atoms in the
 same bin are adjacent in the vector.  Atom sorting can be disabled or
 its settings modified with the :doc:`atom_modify <atom_modify>` command.
 .. _neighbor-stencil:
 .. figure:: img/neigh-stencil.png
   :align: center
   neighbor list stencils
   A 2d simulation sub-domain (thick black line) and the corresponding
   ghost atom cutoff region (dashed blue line) for both orthogonal
   (left) and triclinic (right) domains.  A regular grid of neighbor
   bins (thin lines) overlays the entire simulation domain and need not
   align with sub-domain boundaries; only the portion overlapping the
   augmented sub-domain is shown.  In the triclinic case it overlaps the
   bounding box of the tilted rectangle.  The blue- and red-shaded bins
   represent a stencil of bins searched to find neighbors of a particular
   atom (black dot).
 To build a local neighbor list in linear time, the simulation domain is
 overlaid (conceptually) with a regular 3d (or 2d) grid of neighbor bins,
 as shown in the :ref:`neighbor-stencil` figure for 2d models and a
 single MPI processor's sub-domain.  Each processor stores a set of
 neighbor bins which overlap its sub-domain extended by the neighbor
 cutoff distance :math:`R_n`.  As illustrated, the bins need not align
 with processor boundaries; an integer number in each dimension is fit to
 the size of the entire simulation box.
 Most often LAMMPS builds what it calls a "half" neighbor list where
 each *i,j* neighbor pair is stored only once, with either atom *i* or
 *j* as the central atom.  The build can be done efficiently by using a
 pre-computed "stencil" of bins around a central origin bin which
 contains the atom whose neighbors are being searched for.  A stencil
 is simply a list of integer offsets in *x,y,z* of nearby bins
 surrounding the origin bin which are close enough to contain any
 neighbor atom *j* within a distance :math:`R_n` from any atom *i* in the
 origin bin.  Note that for a half neighbor list, the stencil can be
 asymmetric since each atom only need store half its nearby neighbors.
 These stencils are illustrated in the figure for a half list and a bin
 size of :math:`\frac{1}{2} R_n`.  There are 13 red+blue stencil bins in
 2d (for the orthogonal case, 15 for triclinic).  In 3d there would be
 63, 13 in the plane of bins that contain the origin bin and 25 in each
 of the two planes above it in the *z* direction (75 for triclinic).  The
 reason the triclinic stencil has extra bins is because the bins tile the
 bounding box of the entire triclinic domain and thus are not periodic
 with respect to the simulation box itself.  The stencil and logic for
 determining which *i,j* pairs to include in the neighbor list are
 altered slightly to account for this.
 To build a neighbor list, a processor first loops over its "owned" plus
 "ghost" atoms and assigns each to a neighbor bin.  This uses an integer
 vector to create a linked list of atom indices within each bin.  It then
 performs a triply-nested loop over its owned atoms *i*, the stencil of
 bins surrounding atom *i*'s bin, and the *j* atoms in each stencil bin
 (including ghost atoms).  If the distance :math:`r_{ij} < R_n`, then
 atom *j* is added to the vector of atom *i*'s neighbors.
 Here are additional details about neighbor list build options LAMMPS
 supports:
 - The choice of bin size is an option; a size half of :math:`R_n` has
  been found to be optimal for many typical cases.  Smaller bins incur
  additional overhead to loop over; larger bins require more distance
  calculations.  Note that for smaller bin sizes, the 2d stencil in the
  figure would be more semi-circular in shape (hemispherical in 3d),
  with bins near the corners of the square eliminated due to their
  distance from the origin bin.
 - Depending on the interatomic potential(s) and other commands used in
  an input script, multiple neighbor lists and stencils with different
  attributes may be needed.  This includes lists with different cutoff
  distances, e.g. for force computation versus occasional diagnostic
  computations such as a radial distribution function, or for the
  r-RESPA time integrator which can partition pairwise forces by
  distance into subsets computed at different time intervals.  It
  includes "full" lists (as opposed to half lists) where each *i,j* pair
  appears twice, stored once with *i* and *j*, and which use a larger
  symmetric stencil.  It also includes lists with partial enumeration of
  ghost atom neighbors.  The full and ghost-atom lists are used by
  various manybody interatomic potentials.  Lists may also use different
  criteria for inclusion of a pair interaction.  Typically this simply
  depends only on the distance between two atoms and the cutoff
  distance.  But for finite-size coarse-grained particles with
  individual diameters (e.g. polydisperse granular particles), it can
  also depend on the diameters of the two particles.
 - When using :doc:`pair style hybrid <pair_hybrid>` multiple sub-lists
  of the master neighbor list for the full system need to be generated,
  one for each sub-style, which contains only the *i,j* pairs needed to
  compute interactions between subsets of atoms for the corresponding
  potential.  This means not all *i* or *j* atoms owned by a processor
  are included in a particular sub-list.
 - Some models use different cutoff lengths for pairwise interactions
  between different kinds of particles which are stored in a single
  neighbor list.  One example is a solvated colloidal system with large
  colloidal particles where colloid/colloid, colloid/solvent, and
  solvent/solvent interaction cutoffs can be dramatically different.
  Another is a model of polydisperse finite-size granular particles;
  pairs of particles interact only when they are in contact with each
  other.  Mixtures with particle size ratios as high as 10-100x may be
  used to model realistic systems.  Efficient neighbor list building
  algorithms for these kinds of systems are available in LAMMPS.  They
  include a method which uses different stencils for different cutoff
  lengths and trims the stencil to only include bins that straddle the
  cutoff sphere surface.  More recently a method which uses both
  multiple stencils and multiple bin sizes was developed; it builds
  neighbor lists efficiently for systems with particles of any size
  ratio, though other considerations (timestep size, force computations)
  may limit the ability to model systems with huge polydispersity.
 - For small and sparse systems and as a fallback method, LAMMPS also
  supports neighbor list construction without binning by using a full
  :math:`O(N^2)` loop over all *i,j* atom pairs in a sub-domain when
  using the :doc:`neighbor nsq <neighbor>` command.
 Long-range interactions
 ^^^^^^^^^^^^^^^^^^^^^^^
		`@ -0,0 +1,2 @@`
							`Long-range interactions`
							`^^^^^^^^^^^^^^^^^^^^^^^`