Merge pull request #1580 from stanmoore1/kk_cuda_aware

Fix CUDA-aware MPI issues with KOKKOS package
This commit is contained in:
Axel Kohlmeyer
2019-07-26 15:17:49 -04:00
committed by GitHub
9 changed files with 306 additions and 84 deletions

View File

@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
the discussion for the "GPU package"_Speed_gpu.html for details of how
to check and do this.
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
library is CUDA-aware and has support for GPU-direct. This is not
always the case, especially when using pre-compiled MPI libraries
provided by a Linux distribution. This is not a problem when using
only a single GPU and a single MPI rank on a desktop. When running
with multiple MPI ranks, you may see segmentation faults without
GPU-direct support. These can be avoided by adding the flags "-pk
kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
by using the command "package kokkos gpu/direct off"_package.html in
the input file.
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library
is CUDA-aware. This is not always the case, especially when using
pre-compiled MPI libraries provided by a Linux distribution. This is not
a problem when using only a single GPU with a single MPI rank. When
running with multiple MPI ranks, you may see segmentation faults without
CUDA-aware MPI support. These can be avoided by adding the flags "-pk
kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by
using the command "package kokkos cuda/aware off"_package.html in the
input file.
[Building LAMMPS with the KOKKOS package:]
@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
is recommended in this scenario.
Using a CUDA-aware MPI library with
support for GPU-direct is highly recommended. GPU-direct use can be
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be
avoided by using "-pk kokkos cuda/aware no"_package.html. As above for
multi-core CPUs (and no GPU), if N is the number of physical cores/node,
then the number of MPI tasks/node should not exceed N.

View File

@ -64,7 +64,7 @@ args = arguments specific to the style :l
{no_affinity} values = none
{kokkos} args = keyword value ...
zero or more keyword/value pairs may be appended
keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
{neigh} value = {full} or {half}
full = full neighbor list
half = half neighbor list built in thread-safe manner
@ -87,9 +87,9 @@ args = arguments specific to the style :l
no = perform communication pack/unpack in non-KOKKOS mode
host = perform pack/unpack on host (e.g. with OpenMP threading)
device = perform pack/unpack on device (e.g. on GPU)
{gpu/direct} = {off} or {on}
off = do not use GPU-direct
on = use GPU-direct (default)
{cuda/aware} = {off} or {on}
off = do not use CUDA-aware MPI
on = use CUDA-aware MPI (default)
{omp} args = Nthreads keyword value ...
Nthread = # of OpenMP threads to associate with each MPI process
zero or more keyword/value pairs may be appended
@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
performing the exchange pack/unpack on the host CPU can give speedup
since it reduces the number of CUDA kernel launches.
The {gpu/direct} keyword chooses whether GPU-direct will be used. When
The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When
this keyword is set to {on}, buffers in GPU memory are passed directly
through MPI send/receive calls. This reduces overhead of first copying
the data to the host CPU. However GPU-direct is not supported on all
the data to the host CPU. However CUDA-aware MPI is not supported on all
systems, which can lead to segmentation faults and would require using a
value of {off}. If LAMMPS can safely detect that GPU-direct is not
value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not
available (currently only possible with OpenMPI v2.0.0 or later), then
the {gpu/direct} keyword is automatically set to {off} by default. When
the {gpu/direct} keyword is set to {off} while any of the {comm}
the {cuda/aware} keyword is automatically set to {off} by default. When
the {cuda/aware} keyword is set to {off} while any of the {comm}
keywords are set to {device}, the value for these {comm} keywords will
be automatically changed to {host}. This setting has no effect if not
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later), and CrayMPI.
running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later
versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
flag is used.
:line
@ -641,8 +643,8 @@ switch"_Run_options.html.
For the KOKKOS package, the option defaults for GPUs are neigh = full,
neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
value, comm = device, gpu/direct = on. When LAMMPS can safely detect
that GPU-direct is not available, the default value of gpu/direct
value, comm = device, cuda/aware = on. When LAMMPS can safely detect
that CUDA-aware MPI is not available, the default value of cuda/aware
becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
option neigh/thread = on when there are 16K atoms or less on an MPI