Merge pull request #1580 from stanmoore1/kk_cuda_aware
Fix CUDA-aware MPI issues with KOKKOS package
This commit is contained in:
@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
|
||||
the discussion for the "GPU package"_Speed_gpu.html for details of how
|
||||
to check and do this.
|
||||
|
||||
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
|
||||
library is CUDA-aware and has support for GPU-direct. This is not
|
||||
always the case, especially when using pre-compiled MPI libraries
|
||||
provided by a Linux distribution. This is not a problem when using
|
||||
only a single GPU and a single MPI rank on a desktop. When running
|
||||
with multiple MPI ranks, you may see segmentation faults without
|
||||
GPU-direct support. These can be avoided by adding the flags "-pk
|
||||
kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
|
||||
by using the command "package kokkos gpu/direct off"_package.html in
|
||||
the input file.
|
||||
NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library
|
||||
is CUDA-aware. This is not always the case, especially when using
|
||||
pre-compiled MPI libraries provided by a Linux distribution. This is not
|
||||
a problem when using only a single GPU with a single MPI rank. When
|
||||
running with multiple MPI ranks, you may see segmentation faults without
|
||||
CUDA-aware MPI support. These can be avoided by adding the flags "-pk
|
||||
kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by
|
||||
using the command "package kokkos cuda/aware off"_package.html in the
|
||||
input file.
|
||||
|
||||
[Building LAMMPS with the KOKKOS package:]
|
||||
|
||||
@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
|
||||
speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS
|
||||
is recommended in this scenario.
|
||||
|
||||
Using a CUDA-aware MPI library with
|
||||
support for GPU-direct is highly recommended. GPU-direct use can be
|
||||
avoided by using "-pk kokkos gpu/direct no"_package.html. As above for
|
||||
Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be
|
||||
avoided by using "-pk kokkos cuda/aware no"_package.html. As above for
|
||||
multi-core CPUs (and no GPU), if N is the number of physical cores/node,
|
||||
then the number of MPI tasks/node should not exceed N.
|
||||
|
||||
|
||||
@ -64,7 +64,7 @@ args = arguments specific to the style :l
|
||||
{no_affinity} values = none
|
||||
{kokkos} args = keyword value ...
|
||||
zero or more keyword/value pairs may be appended
|
||||
keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
|
||||
keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
|
||||
{neigh} value = {full} or {half}
|
||||
full = full neighbor list
|
||||
half = half neighbor list built in thread-safe manner
|
||||
@ -87,9 +87,9 @@ args = arguments specific to the style :l
|
||||
no = perform communication pack/unpack in non-KOKKOS mode
|
||||
host = perform pack/unpack on host (e.g. with OpenMP threading)
|
||||
device = perform pack/unpack on device (e.g. on GPU)
|
||||
{gpu/direct} = {off} or {on}
|
||||
off = do not use GPU-direct
|
||||
on = use GPU-direct (default)
|
||||
{cuda/aware} = {off} or {on}
|
||||
off = do not use CUDA-aware MPI
|
||||
on = use CUDA-aware MPI (default)
|
||||
{omp} args = Nthreads keyword value ...
|
||||
Nthread = # of OpenMP threads to associate with each MPI process
|
||||
zero or more keyword/value pairs may be appended
|
||||
@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
|
||||
performing the exchange pack/unpack on the host CPU can give speedup
|
||||
since it reduces the number of CUDA kernel launches.
|
||||
|
||||
The {gpu/direct} keyword chooses whether GPU-direct will be used. When
|
||||
The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When
|
||||
this keyword is set to {on}, buffers in GPU memory are passed directly
|
||||
through MPI send/receive calls. This reduces overhead of first copying
|
||||
the data to the host CPU. However GPU-direct is not supported on all
|
||||
the data to the host CPU. However CUDA-aware MPI is not supported on all
|
||||
systems, which can lead to segmentation faults and would require using a
|
||||
value of {off}. If LAMMPS can safely detect that GPU-direct is not
|
||||
value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not
|
||||
available (currently only possible with OpenMPI v2.0.0 or later), then
|
||||
the {gpu/direct} keyword is automatically set to {off} by default. When
|
||||
the {gpu/direct} keyword is set to {off} while any of the {comm}
|
||||
the {cuda/aware} keyword is automatically set to {off} by default. When
|
||||
the {cuda/aware} keyword is set to {off} while any of the {comm}
|
||||
keywords are set to {device}, the value for these {comm} keywords will
|
||||
be automatically changed to {host}. This setting has no effect if not
|
||||
running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later
|
||||
versions), Mvapich2 1.9 (or later), and CrayMPI.
|
||||
running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later
|
||||
versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
|
||||
variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
|
||||
flag is used.
|
||||
|
||||
:line
|
||||
|
||||
@ -641,8 +643,8 @@ switch"_Run_options.html.
|
||||
|
||||
For the KOKKOS package, the option defaults for GPUs are neigh = full,
|
||||
neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default
|
||||
value, comm = device, gpu/direct = on. When LAMMPS can safely detect
|
||||
that GPU-direct is not available, the default value of gpu/direct
|
||||
value, comm = device, cuda/aware = on. When LAMMPS can safely detect
|
||||
that CUDA-aware MPI is not available, the default value of cuda/aware
|
||||
becomes "off". For CPUs or Xeon Phis, the option defaults are neigh =
|
||||
half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The
|
||||
option neigh/thread = on when there are 16K atoms or less on an MPI
|
||||
|
||||
Reference in New Issue
Block a user