Merge remote-tracking branch 'upstream/master'
This commit is contained in:
@ -46,7 +46,7 @@ software version 7.5 or later must be installed on your system. See
|
||||
the discussion for the "GPU package"_Speed_gpu.html for details of how
|
||||
to check and do this.
|
||||
|
||||
NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
|
||||
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
|
||||
library is CUDA-aware and has support for GPU-direct. This is not
|
||||
always the case, especially when using pre-compiled MPI libraries
|
||||
provided by a Linux distribution. This is not a problem when using
|
||||
@ -207,19 +207,21 @@ supports.
|
||||
|
||||
[Running on GPUs:]
|
||||
|
||||
Use the "-k" "command-line switch"_Run_options.html to
|
||||
specify the number of GPUs per node. Typically the -np setting of the
|
||||
mpirun command should set the number of MPI tasks/node to be equal to
|
||||
the number of physical GPUs on the node. You can assign multiple MPI
|
||||
tasks to the same GPU with the KOKKOS package, but this is usually
|
||||
only faster if significant portions of the input script have not
|
||||
been ported to use Kokkos. Using CUDA MPS is recommended in this
|
||||
scenario. Using a CUDA-aware MPI library with support for GPU-direct
|
||||
is highly recommended. GPU-direct use can be avoided by using
|
||||
"-pk kokkos gpu/direct no"_package.html.
|
||||
As above for multi-core CPUs (and no GPU), if N is the number of
|
||||
physical cores/node, then the number of MPI tasks/node should not
|
||||
exceed N.
|
||||
Use the "-k" "command-line switch"_Run_options.html to specify the
|
||||
number of GPUs per node. Typically the -np setting of the mpirun command
|
||||
should set the number of MPI tasks/node to be equal to the number of
|
||||
physical GPUs on the node. You can assign multiple MPI tasks to the same
|
||||
GPU with the KOKKOS package, but this is usually only faster if some
|
||||
portions of the input script have not been ported to use Kokkos. In this
|
||||
case, also packing/unpacking communication buffers on the host may give
|
||||
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
|
||||
is recommended in this scenario.
|
||||
|
||||
Using a CUDA-aware MPI library with
|
||||
support for GPU-direct is highly recommended. GPU-direct use can be
|
||||
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
|
||||
multi-core CPUs (and no GPU), if N is the number of physical cores/node,
|
||||
then the number of MPI tasks/node should not exceed N.
|
||||
|
||||
-k on g Ng :pre
|
||||
|
||||
|
||||
@ -64,13 +64,16 @@ args = arguments specific to the style :l
|
||||
{no_affinity} values = none
|
||||
{kokkos} args = keyword value ...
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = {neigh} or {neigh/qeq} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
|
||||
keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
|
||||
{neigh} value = {full} or {half}
|
||||
full = full neighbor list
|
||||
half = half neighbor list built in thread-safe manner
|
||||
{neigh/qeq} value = {full} or {half}
|
||||
full = full neighbor list
|
||||
half = half neighbor list built in thread-safe manner
|
||||
{neigh/thread} value = {off} or {on}
|
||||
off = thread only over atoms
|
||||
on = thread over both atoms and neighbors
|
||||
{newton} = {off} or {on}
|
||||
off = set Newton pairwise and bonded flags off
|
||||
on = set Newton pairwise and bonded flags on
|
||||
@ -442,7 +445,19 @@ running on CPUs, a {half} neighbor list is the default because it are
|
||||
often faster, just as it is for non-accelerated pair styles. Similarly,
|
||||
the {neigh/qeq} keyword determines how neighbor lists are built for "fix
|
||||
qeq/reax/kk"_fix_qeq_reax.html. If not explicitly set, the value of
|
||||
{neigh/qeq} will match {neigh}.
|
||||
{neigh/qeq} will match {neigh}.
|
||||
|
||||
If the {neigh/thread} keyword is set to {off}, then the KOKKOS package
|
||||
threads only over atoms. However, for small systems, this may not expose
|
||||
enough parallelism to keep a GPU busy. When this keyword is set to {on},
|
||||
the KOKKOS package threads over both atoms and neighbors of atoms. When
|
||||
using {neigh/thread} {on}, a full neighbor list must also be used. Using
|
||||
{neigh/thread} {on} may be slower for large systems, so this this option
|
||||
is turned on by default only when there are 16K atoms or less owned by
|
||||
an MPI rank and when using a full neighbor list. Not all KOKKOS-enabled
|
||||
potentials support this keyword yet, and only thread over atoms. Many
|
||||
simple pair-wise potentials such as Lennard-Jones do support threading
|
||||
over both atoms and neighbors.
|
||||
|
||||
The {newton} keyword sets the Newton flags for pairwise and bonded
|
||||
interactions to {off} or {on}, the same as the "newton"_newton.html
|
||||
@ -475,10 +490,10 @@ are rebuilt. The data is only for atoms that migrate to new processors.
|
||||
"Forward" communication happens every timestep. "Reverse" communication
|
||||
happens every timestep if the {newton} option is on. The data is for
|
||||
atom coordinates and any other atom properties that needs to be updated
|
||||
for ghost atoms owned by each processor.
|
||||
for ghost atoms owned by each processor.
|
||||
|
||||
The {comm} keyword is simply a short-cut to set the same value for both
|
||||
the {comm/exchange} and {comm/forward} and {comm/reverse} keywords.
|
||||
the {comm/exchange} and {comm/forward} and {comm/reverse} keywords.
|
||||
|
||||
The value options for all 3 keywords are {no} or {host} or {device}. A
|
||||
value of {no} means to use the standard non-KOKKOS method of
|
||||
@ -486,26 +501,26 @@ packing/unpacking data for the communication. A value of {host} means to
|
||||
use the host, typically a multi-core CPU, and perform the
|
||||
packing/unpacking in parallel with threads. A value of {device} means to
|
||||
use the device, typically a GPU, to perform the packing/unpacking
|
||||
operation.
|
||||
operation.
|
||||
|
||||
The optimal choice for these keywords depends on the input script and
|
||||
the hardware used. The {no} value is useful for verifying that the
|
||||
Kokkos-based {host} and {device} values are working correctly. It is the
|
||||
default when running on CPUs since it is usually the fastest.
|
||||
default when running on CPUs since it is usually the fastest.
|
||||
|
||||
When running on CPUs or Xeon Phi, the {host} and {device} values work
|
||||
identically. When using GPUs, the {device} value is the default since it
|
||||
will typically be optimal if all of your styles used in your input
|
||||
script are supported by the KOKKOS package. In this case data can stay
|
||||
on the GPU for many timesteps without being moved between the host and
|
||||
GPU, if you use the {device} value. This requires that your MPI is able
|
||||
to access GPU memory directly. Currently that is true for OpenMPI 1.8
|
||||
(or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your
|
||||
script uses styles (e.g. fixes) which are not yet supported by the
|
||||
KOKKOS package, then data has to be move between the host and device
|
||||
anyway, so it is typically faster to let the host handle communication,
|
||||
by using the {host} value. Using {host} instead of {no} will enable use
|
||||
of multiple threads to pack/unpack communicated data.
|
||||
GPU, if you use the {device} value. If your script uses styles (e.g.
|
||||
fixes) which are not yet supported by the KOKKOS package, then data has
|
||||
to be move between the host and device anyway, so it is typically faster
|
||||
to let the host handle communication, by using the {host} value. Using
|
||||
{host} instead of {no} will enable use of multiple threads to
|
||||
pack/unpack communicated data. When running small systems on a GPU,
|
||||
performing the exchange pack/unpack on the host CPU can give speedup
|
||||
since it reduces the number of CUDA kernel launches.
|
||||
|
||||
The {gpu/direct} keyword chooses whether GPU-direct will be used. When
|
||||
this keyword is set to {on}, buffers in GPU memory are passed directly
|
||||
@ -518,7 +533,8 @@ the {gpu/direct} keyword is automatically set to {off} by default. When
|
||||
the {gpu/direct} keyword is set to {off} while any of the {comm}
|
||||
keywords are set to {device}, the value for these {comm} keywords will
|
||||
be automatically changed to {host}. This setting has no effect if not
|
||||
running on GPUs.
|
||||
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
|
||||
versions), Mvapich2 1.9 (or later), and CrayMPI.
|
||||
|
||||
:line
|
||||
|
||||
@ -630,11 +646,12 @@ neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
|
||||
value, comm = device, gpu/direct = on. When LAMMPS can safely detect
|
||||
that GPU-direct is not available, the default value of gpu/direct
|
||||
becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
|
||||
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. These
|
||||
settings are made automatically by the required "-k on" "command-line
|
||||
switch"_Run_options.html. You can change them by using the package
|
||||
kokkos command in your input script or via the "-pk kokkos command-line
|
||||
switch"_Run_options.html.
|
||||
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
|
||||
option neigh/thread = on when there are 16K atoms or less on an MPI
|
||||
rank, otherwise it is "off". These settings are made automatically by
|
||||
the required "-k on" "command-line switch"_Run_options.html. You can
|
||||
change them by using the package kokkos command in your input script or
|
||||
via the "-pk kokkos command-line switch"_Run_options.html.
|
||||
|
||||
For the OMP package, the default is Nthreads = 0 and the option
|
||||
defaults are neigh = yes. These settings are made automatically if
|
||||
|
||||
Reference in New Issue
Block a user