Merge branch 'master' into tersoff-shift

2021-01-11 04:30:11 -05:00
parent 412d1c1b72 102a6eba79
commit cbca189490
509 changed files with 7657 additions and 7183 deletions
--- a/doc/src/Build_extras.rst
+++ b/doc/src/Build_extras.rst
@ -807,7 +807,7 @@ compiling LAMMPS with Python version 3.6 or later.
      the ``cythonize`` command in case the corresponding .pyx file(s) were
      modified.  You may need to modify ``lib/python/Makefile.lammps``
      if the LAMMPS build fails.
-      To manually enforce building MLIAP with Python support enabled, 
+      To manually enforce building MLIAP with Python support enabled,
      you can add
      ``-DMLIAP_PYTHON`` to the ``LMP_INC`` variable in your machine makefile.
      You may have to manually run the ``cythonize`` command on .pyx file(s)
--- a/doc/src/Build_package.rst
+++ b/doc/src/Build_package.rst
@ -1,5 +1,4 @@
 Include packages in build
-
 =========================

 In LAMMPS, a package is a group of files that enable a specific set of
--- a/doc/src/Speed_kokkos.rst
+++ b/doc/src/Speed_kokkos.rst
@ -38,14 +38,14 @@ produce an executable compatible with a specific hardware.
   :class: note

   Kokkos with CUDA currently implicitly assumes that the MPI library is
-   CUDA-aware. This is not always the case, especially when using
+   GPU-aware. This is not always the case, especially when using
   pre-compiled MPI libraries provided by a Linux distribution. This is
   not a problem when using only a single GPU with a single MPI
   rank. When running with multiple MPI ranks, you may see segmentation
-   faults without CUDA-aware MPI support. These can be avoided by adding
-   the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the
+   faults without GPU-aware MPI support. These can be avoided by adding
+   the flags :doc:`-pk kokkos gpu/aware off <Run_options>` to the
   LAMMPS command line or by using the command :doc:`package kokkos
-   cuda/aware off <package>` in the input file.
+   gpu/aware off <package>` in the input file.

 .. admonition:: AMD GPU support
   :class: note
@ -242,8 +242,8 @@ case, also packing/unpacking communication buffers on the host may give
 speedup (see the KOKKOS :doc:`package <package>` command). Using CUDA MPS
 is recommended in this scenario.

-Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be
-avoided by using :doc:`-pk kokkos cuda/aware no <package>`. As above for
+Using a GPU-aware MPI library is highly recommended. GPU-aware MPI use can be
+avoided by using :doc:`-pk kokkos gpu/aware off <package>`. As above for
 multi-core CPUs (and no GPU), if N is the number of physical cores/node,
 then the number of MPI tasks/node should not exceed N.

--- a/doc/src/compute_orientorder_atom.rst
+++ b/doc/src/compute_orientorder_atom.rst
@ -115,8 +115,8 @@ The optional keyword *chunksize* is only applicable when using the
 the KOKKOS package and is ignored otherwise. This keyword controls
 the number of atoms in each pass used to compute the bond-orientational
 order parameters and is used to avoid running out of memory. For example
-if there are 4000 atoms in the simulation and the *chunksize*
-is set to 2000, the parameter calculation will be broken up
+if there are 32768 atoms in the simulation and the *chunksize*
+is set to 16384, the parameter calculation will be broken up
 into two passes.

 The value of :math:`Q_l` is set to zero for atoms not in the
@ -193,7 +193,7 @@ Default

 The option defaults are *cutoff* = pair style cutoff, *nnn* = 12,
 *degrees* = 5 4 6 8 10 12 i.e. :math:`Q_4`, :math:`Q_6`, :math:`Q_8`, :math:`Q_{10}`, and :math:`Q_{12}`,
-*wl* = no, *wl/hat* = no, *components* off, and *chunksize* = 2000
+*wl* = no, *wl/hat* = no, *components* off, and *chunksize* = 16384

 ----------

--- a/doc/src/fix_ave_correlate.rst
+++ b/doc/src/fix_ave_correlate.rst
@ -93,7 +93,7 @@ from a compute, fix, or variable, then see the :doc:`fix ave/chunk <fix_ave_chun
 :doc:`fix ave/histo <fix_ave_histo>` commands.  If you wish to convert a
 per-atom quantity into a single global value, see the :doc:`compute reduce <compute_reduce>` command.

-The input values must either be all scalars.  What kinds of
+The input values must be all scalars.  What kinds of
 correlations between input values are calculated is determined by the
 *type* keyword as discussed below.

--- a/doc/src/package.rst
+++ b/doc/src/package.rst
@ -68,7 +68,7 @@ Syntax
           *no_affinity* values = none
       *kokkos* args = keyword value ...
         zero or more keyword/value pairs may be appended
-         keywords = *neigh* or *neigh/qeq* or *neigh/thread* or *newton* or *binsize* or *comm* or *comm/exchange* or *comm/forward* or *comm/reverse* or *cuda/aware* or *pair/only*
+         keywords = *neigh* or *neigh/qeq* or *neigh/thread* or *newton* or *binsize* or *comm* or *comm/exchange* or *comm/forward* *pair/comm/forward* *fix/comm/forward* or *comm/reverse* or *gpu/aware* or *pair/only*
           *neigh* value = *full* or *half*
             full = full neighbor list
             half = half neighbor list built in thread-safe manner
@ -84,16 +84,18 @@ Syntax
           *binsize* value = size
             size = bin size for neighbor list construction (distance units)
           *comm* value = *no* or *host* or *device*
-             use value for comm/exchange and comm/forward and comm/reverse
+             use value for comm/exchange and comm/forward and pair/comm/forward and fix/comm/forward and comm/reverse
           *comm/exchange* value = *no* or *host* or *device*
           *comm/forward* value = *no* or *host* or *device*
+           *pair/comm/forward* value = *no* or *device*
+           *fix/comm/forward* value = *no* or *device*
           *comm/reverse* value = *no* or *host* or *device*
             no = perform communication pack/unpack in non-KOKKOS mode
             host = perform pack/unpack on host (e.g. with OpenMP threading)
             device = perform pack/unpack on device (e.g. on GPU)
-           *cuda/aware* = *off* or *on*
-             off = do not use CUDA-aware MPI
-             on = use CUDA-aware MPI (default)
+           *gpu/aware* = *off* or *on*
+             off = do not use GPU-aware MPI
+             on = use GPU-aware MPI (default)
           *pair/only* = *off* or *on*
             off = use device acceleration (e.g. GPU) for all available styles in the KOKKOS package (default)
             on  = use device acceleration only for pair styles (and host acceleration for others)
@ -498,7 +500,8 @@ because the GPU is faster at performing pairwise interactions, then this
 rule of thumb may give too large a binsize and the default should be
 overridden with a smaller value.

-The *comm* and *comm/exchange* and *comm/forward* and *comm/reverse*
+The *comm* and *comm/exchange* and *comm/forward* and *pair/comm/forward*
+and *fix/comm/forward* and comm/reverse*
 keywords determine whether the host or device performs the packing and
 unpacking of data when communicating per-atom data between processors.
 "Exchange" communication happens only on timesteps that neighbor lists
@ -506,18 +509,22 @@ are rebuilt. The data is only for atoms that migrate to new processors.
 "Forward" communication happens every timestep. "Reverse" communication
 happens every timestep if the *newton* option is on. The data is for
 atom coordinates and any other atom properties that needs to be updated
-for ghost atoms owned by each processor.
+for ghost atoms owned by each processor. "Pair/comm" controls additional
+communication in pair styles, such as pair_style EAM. "Fix/comm" controls
+additional communication in fixes, such as fix SHAKE.

-The *comm* keyword is simply a short-cut to set the same value for both
-the *comm/exchange* and *comm/forward* and *comm/reverse* keywords.
+The *comm* keyword is simply a short-cut to set the same value for all
+the comm keywords.

-The value options for all 3 keywords are *no* or *host* or *device*\ . A
+The value options for the keywords are *no* or *host* or *device*\ . A
 value of *no* means to use the standard non-KOKKOS method of
 packing/unpacking data for the communication. A value of *host* means to
 use the host, typically a multi-core CPU, and perform the
 packing/unpacking in parallel with threads. A value of *device* means to
 use the device, typically a GPU, to perform the packing/unpacking
-operation.
+operation. If a value of *host* is used for the *pair/comm/forward* or
+*fix/comm/forward* keyword, it will be automatically be changed to *no*
+since these keywords don't support *host* mode.

 The optimal choice for these keywords depends on the input script and
 the hardware used. The *no* value is useful for verifying that the
@ -538,18 +545,18 @@ pack/unpack communicated data. When running small systems on a GPU,
 performing the exchange pack/unpack on the host CPU can give speedup
 since it reduces the number of CUDA kernel launches.

-The *cuda/aware* keyword chooses whether CUDA-aware MPI will be used. When
+The *gpu/aware* keyword chooses whether GPU-aware MPI will be used. When
 this keyword is set to *on*\ , buffers in GPU memory are passed directly
 through MPI send/receive calls. This reduces overhead of first copying
-the data to the host CPU. However CUDA-aware MPI is not supported on all
+the data to the host CPU. However GPU-aware MPI is not supported on all
 systems, which can lead to segmentation faults and would require using a
-value of *off*\ . If LAMMPS can safely detect that CUDA-aware MPI is not
+value of *off*\ . If LAMMPS can safely detect that GPU-aware MPI is not
 available (currently only possible with OpenMPI v2.0.0 or later), then
-the *cuda/aware* keyword is automatically set to *off* by default. When
-the *cuda/aware* keyword is set to *off* while any of the *comm*
+the *gpu/aware* keyword is automatically set to *off* by default. When
+the *gpu/aware* keyword is set to *off* while any of the *comm*
 keywords are set to *device*\ , the value for these *comm* keywords will
 be automatically changed to *no*\ . This setting has no effect if not
-running on GPUs or if using only one MPI rank. CUDA-aware MPI is available
+running on GPUs or if using only one MPI rank. GPU-aware MPI is available
 for OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later) when the
 "MV2_USE_CUDA" environment variable is set to "1", CrayMPI, and IBM
 Spectrum MPI when the "-gpu" flag is used.
@ -558,7 +565,7 @@ The *pair/only* keyword can change how the KOKKOS suffix "kk" is applied
 when using an accelerator device.  By default device acceleration is
 always used for all available styles.  With *pair/only* set to *on* the
 suffix setting will choose device acceleration only for pair styles and
-run all other force computations concurrently on the host CPU.
+run all other force computations on the host CPU.
 The *comm* flags will also automatically be changed to *no*\ . This can
 result in better performance for certain configurations and system sizes.

@ -671,8 +678,8 @@ script or via the "-pk intel" :doc:`command-line switch <Run_options>`.

 For the KOKKOS package, the option defaults for GPUs are neigh = full,
 neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
-value, comm = device, cuda/aware = on. When LAMMPS can safely detect
-that CUDA-aware MPI is not available, the default value of cuda/aware
+value, comm = device, gpu/aware = on. When LAMMPS can safely detect
+that GPU-aware MPI is not available, the default value of gpu/aware
 becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
 half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
 option neigh/thread = on when there are 16K atoms or less on an MPI
--- a/doc/src/pair_snap.rst
+++ b/doc/src/pair_snap.rst
@ -152,7 +152,7 @@ The default values for these keywords are
 * *chemflag* = 0
 * *bnormflag* = 0
 * *wselfallflag* = 0
-* *chunksize* = 2000
+* *chunksize* = 4096

 If *quadraticflag* is set to 1, then the SNAP energy expression includes additional quadratic terms
 that have been shown to increase the overall accuracy of the potential without much increase
@ -189,8 +189,8 @@ pair style *snap* with the KOKKOS package and is ignored otherwise.
 This keyword controls
 the number of atoms in each pass used to compute the bispectrum
 components and is used to avoid running out of memory. For example
-if there are 4000 atoms in the simulation and the *chunksize*
-is set to 2000, the bispectrum calculation will be broken up
+if there are 8192 atoms in the simulation and the *chunksize*
+is set to 4096, the bispectrum calculation will be broken up
 into two passes.

 Detailed definitions for all the other keywords