Merge remote-tracking branch 'upstream/master'

2019-06-19 16:12:36 -04:00
parent 395a9d3739 42b0cb5e3e
commit fe1afee84e
68 changed files with 2724 additions and 1686 deletions
--- a/doc/src/Speed_kokkos.txt
+++ b/doc/src/Speed_kokkos.txt
@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
 the discussion for the "GPU package"_Speed_gpu.html for details of how
 to check and do this.

-NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
+NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
 library is CUDA-aware and has support for GPU-direct. This is not
 always the case, especially when using pre-compiled MPI libraries
 provided by a Linux distribution. This is not a problem when using
@ -207,19 +207,21 @@ supports.

 [Running on GPUs:]

-Use the "-k" "command-line switch"_Run_options.html to
-specify the number of GPUs per node. Typically the -np setting of the
-mpirun command should set the number of MPI tasks/node to be equal to
-the number of physical GPUs on the node.  You can assign multiple MPI
-tasks to the same GPU with the KOKKOS package, but this is usually
-only faster if significant portions of the input script have not
-been ported to use Kokkos. Using CUDA MPS is recommended in this
-scenario. Using a CUDA-aware MPI library with support for GPU-direct
-is highly recommended. GPU-direct use can be avoided by using
-"-pk kokkos gpu/direct no"_package.html.
-As above for multi-core CPUs (and no GPU), if N is the number of
-physical cores/node, then the number of MPI tasks/node should not
-exceed N.
+Use the "-k" "command-line switch"_Run_options.html to specify the 
+number of GPUs per node. Typically the -np setting of the mpirun command 
+should set the number of MPI tasks/node to be equal to the number of 
+physical GPUs on the node. You can assign multiple MPI tasks to the same 
+GPU with the KOKKOS package, but this is usually only faster if some 
+portions of the input script have not been ported to use Kokkos. In this 
+case, also packing/unpacking communication buffers on the host may give 
+speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS 
+is recommended in this scenario.
+
+Using a CUDA-aware MPI library with 
+support for GPU-direct is highly recommended. GPU-direct use can be 
+avoided by using "-pk kokkos gpu/direct no"_package.html. As above for 
+multi-core CPUs (and no GPU), if N is the number of physical cores/node, 
+then the number of MPI tasks/node should not exceed N.

 -k on g Ng :pre

--- a/doc/src/package.txt
+++ b/doc/src/package.txt
@ -64,13 +64,16 @@ args = arguments specific to the style :l
      {no_affinity} values = none
  {kokkos} args = keyword value ...
    zero or more keyword/value pairs may be appended
-    keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
+    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
      {neigh} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
      {neigh/qeq} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
+      {neigh/thread} value = {off} or {on}
+        off = thread only over atoms
+        on = thread over both atoms and neighbors
      {newton} = {off} or {on}
        off = set Newton pairwise and bonded flags off
        on = set Newton pairwise and bonded flags on
@ -442,7 +445,19 @@ running on CPUs, a {half} neighbor list is the default because it are
 often faster, just as it is for non-accelerated pair styles. Similarly, 
 the {neigh/qeq} keyword determines how neighbor lists are built for "fix 
 qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of 
-{neigh/qeq} will match {neigh}. 
+{neigh/qeq} will match {neigh}.
+
+If the {neigh/thread} keyword is set to {off}, then the KOKKOS package 
+threads only over atoms. However, for small systems, this may not expose 
+enough parallelism to keep a GPU busy. When this keyword is set to {on}, 
+the KOKKOS package threads over both atoms and neighbors of atoms. When 
+using {neigh/thread} {on}, a full neighbor list must also be used. Using 
+{neigh/thread} {on} may be slower for large systems, so this this option 
+is turned on by default only when there are 16K atoms or less owned by 
+an MPI rank and when using a full neighbor list. Not all KOKKOS-enabled 
+potentials support this keyword yet, and only thread over atoms. Many 
+simple pair-wise potentials such as Lennard-Jones do support threading 
+over both atoms and neighbors.

 The {newton} keyword sets the Newton flags for pairwise and bonded 
 interactions to {off} or {on}, the same as the "newton"_newton.html 
@ -475,10 +490,10 @@ are rebuilt. The data is only for atoms that migrate to new processors.
 "Forward" communication happens every timestep. "Reverse" communication 
 happens every timestep if the {newton} option is on. The data is for 
 atom coordinates and any other atom properties that needs to be updated 
-for ghost atoms owned by each processor. 
+for ghost atoms owned by each processor.

 The {comm} keyword is simply a short-cut to set the same value for both 
-the {comm/exchange} and {comm/forward} and {comm/reverse} keywords. 
+the {comm/exchange} and {comm/forward} and {comm/reverse} keywords.

 The value options for all 3 keywords are {no} or {host} or {device}. A 
 value of {no} means to use the standard non-KOKKOS method of 
@ -486,26 +501,26 @@ packing/unpacking data for the communication. A value of {host} means to
 use the host, typically a multi-core CPU, and perform the 
 packing/unpacking in parallel with threads. A value of {device} means to 
 use the device, typically a GPU, to perform the packing/unpacking 
-operation. 
+operation.

 The optimal choice for these keywords depends on the input script and 
 the hardware used. The {no} value is useful for verifying that the 
 Kokkos-based {host} and {device} values are working correctly. It is the 
-default when running on CPUs since it is usually the fastest. 
+default when running on CPUs since it is usually the fastest.

 When running on CPUs or Xeon Phi, the {host} and {device} values work 
 identically. When using GPUs, the {device} value is the default since it 
 will typically be optimal if all of your styles used in your input 
 script are supported by the KOKKOS package. In this case data can stay 
 on the GPU for many timesteps without being moved between the host and 
-GPU, if you use the {device} value. This requires that your MPI is able 
-to access GPU memory directly. Currently that is true for OpenMPI 1.8 
-(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your 
-script uses styles (e.g. fixes) which are not yet supported by the 
-KOKKOS package, then data has to be move between the host and device 
-anyway, so it is typically faster to let the host handle communication, 
-by using the {host} value. Using {host} instead of {no} will enable use 
-of multiple threads to pack/unpack communicated data. 
+GPU, if you use the {device} value. If your script uses styles (e.g. 
+fixes) which are not yet supported by the KOKKOS package, then data has 
+to be move between the host and device anyway, so it is typically faster 
+to let the host handle communication, by using the {host} value. Using 
+{host} instead of {no} will enable use of multiple threads to 
+pack/unpack communicated data. When running small systems on a GPU, 
+performing the exchange pack/unpack on the host CPU can give speedup 
+since it reduces the number of CUDA kernel launches.

 The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
 this keyword is set to {on}, buffers in GPU memory are passed directly 
@ -518,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
 the {gpu/direct} keyword is set to {off} while any of the {comm} 
 keywords are set to {device}, the value for these {comm} keywords will 
 be automatically changed to {host}. This setting has no effect if not 
-running on GPUs.
+running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later 
+versions), Mvapich2 1.9 (or later), and CrayMPI.

 :line

@ -630,11 +646,12 @@ neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
 value, comm = device, gpu/direct = on. When LAMMPS can safely detect 
 that GPU-direct is not available, the default value of gpu/direct 
 becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = 
-half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These 
-settings are made automatically by the required "-k on" "command-line 
-switch"_Run_options.html. You can change them by using the package 
-kokkos command in your input script or via the "-pk kokkos command-line 
-switch"_Run_options.html.
+half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The 
+option neigh/thread = on when there are 16K atoms or less on an MPI 
+rank, otherwise it is "off". These settings are made automatically by 
+the required "-k on" "command-line switch"_Run_options.html. You can 
+change them by using the package kokkos command in your input script or 
+via the "-pk kokkos command-line switch"_Run_options.html.

 For the OMP package, the default is Nthreads = 0 and the option
 defaults are neigh = yes.  These settings are made automatically if