Merge pull request #1580 from stanmoore1/kk_cuda_aware

Fix CUDA-aware MPI issues with KOKKOS package
2019-07-26 15:17:49 -04:00
parent f4a984175c f2dbe186ca
commit 35cee904d7
9 changed files with 306 additions and 84 deletions
--- a/doc/src/Speed_kokkos.txt
+++ b/doc/src/Speed_kokkos.txt
@ -46,16 +46,15 @@ software version 7.5 or later must be installed on your system. See
 the discussion for the "GPU package"_Speed_gpu.html for details of how
 to check and do this.

-NOTE: Kokkos with CUDA currently implicitly assumes that the MPI
-library is CUDA-aware and has support for GPU-direct. This is not
-always the case, especially when using pre-compiled MPI libraries
-provided by a Linux distribution. This is not a problem when using
-only a single GPU and a single MPI rank on a desktop. When running
-with multiple MPI ranks, you may see segmentation faults without
-GPU-direct support.  These can be avoided by adding the flags "-pk
-kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
-by using the command "package kokkos gpu/direct off"_package.html in
-the input file.
+NOTE: Kokkos with CUDA currently implicitly assumes that the MPI library 
+is CUDA-aware. This is not always the case, especially when using 
+pre-compiled MPI libraries provided by a Linux distribution. This is not 
+a problem when using only a single GPU with a single MPI rank. When 
+running with multiple MPI ranks, you may see segmentation faults without 
+CUDA-aware MPI support. These can be avoided by adding the flags "-pk 
+kokkos cuda/aware off"_Run_options.html to the LAMMPS command line or by 
+using the command "package kokkos cuda/aware off"_package.html in the 
+input file.

 [Building LAMMPS with the KOKKOS package:]

@ -217,9 +216,8 @@ case, also packing/unpacking communication buffers on the host may give
 speedup (see the KOKKOS "package"_package.html command). Using CUDA MPS 
 is recommended in this scenario.

-Using a CUDA-aware MPI library with 
-support for GPU-direct is highly recommended. GPU-direct use can be 
-avoided by using "-pk kokkos gpu/direct no"_package.html. As above for 
+Using a CUDA-aware MPI library is highly recommended. CUDA-aware MPI use can be 
+avoided by using "-pk kokkos cuda/aware no"_package.html. As above for 
 multi-core CPUs (and no GPU), if N is the number of physical cores/node, 
 then the number of MPI tasks/node should not exceed N.

--- a/doc/src/package.txt
+++ b/doc/src/package.txt
@ -64,7 +64,7 @@ args = arguments specific to the style :l
      {no_affinity} values = none
  {kokkos} args = keyword value ...
    zero or more keyword/value pairs may be appended
-    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {gpu/direct}
+    keywords = {neigh} or {neigh/qeq} or {neigh/thread} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} or {comm/reverse} or {cuda/aware}
      {neigh} value = {full} or {half}
        full = full neighbor list
        half = half neighbor list built in thread-safe manner
@ -87,9 +87,9 @@ args = arguments specific to the style :l
        no = perform communication pack/unpack in non-KOKKOS mode
        host = perform pack/unpack on host (e.g. with OpenMP threading)
        device = perform pack/unpack on device (e.g. on GPU)
-      {gpu/direct} = {off} or {on}
-        off = do not use GPU-direct
-        on = use GPU-direct (default)
+      {cuda/aware} = {off} or {on}
+        off = do not use CUDA-aware MPI
+        on = use CUDA-aware MPI (default)
  {omp} args = Nthreads keyword value ...
    Nthread = # of OpenMP threads to associate with each MPI process
    zero or more keyword/value pairs may be appended
@ -520,19 +520,21 @@ pack/unpack communicated data. When running small systems on a GPU,
 performing the exchange pack/unpack on the host CPU can give speedup 
 since it reduces the number of CUDA kernel launches.

-The {gpu/direct} keyword chooses whether GPU-direct will be used. When 
+The {cuda/aware} keyword chooses whether CUDA-aware MPI will be used. When 
 this keyword is set to {on}, buffers in GPU memory are passed directly 
 through MPI send/receive calls. This reduces overhead of first copying 
-the data to the host CPU. However GPU-direct is not supported on all 
+the data to the host CPU. However CUDA-aware MPI is not supported on all 
 systems, which can lead to segmentation faults and would require using a 
-value of {off}. If LAMMPS can safely detect that GPU-direct is not 
+value of {off}. If LAMMPS can safely detect that CUDA-aware MPI is not 
 available (currently only possible with OpenMPI v2.0.0 or later), then 
-the {gpu/direct} keyword is automatically set to {off} by default. When 
-the {gpu/direct} keyword is set to {off} while any of the {comm} 
+the {cuda/aware} keyword is automatically set to {off} by default. When 
+the {cuda/aware} keyword is set to {off} while any of the {comm} 
 keywords are set to {device}, the value for these {comm} keywords will 
 be automatically changed to {host}. This setting has no effect if not 
-running on GPUs. GPU-direct is available for OpenMPI 1.8 (or later 
-versions), Mvapich2 1.9 (or later), and CrayMPI.
+running on GPUs. CUDA-aware MPI is available for OpenMPI 1.8 (or later 
+versions), Mvapich2 1.9 (or later) when the "MV2_USE_CUDA" environment
+variable is set to "1", CrayMPI, and IBM Spectrum MPI when the "-gpu"
+flag is used.

 :line

@ -641,8 +643,8 @@ switch"_Run_options.html.

 For the KOKKOS package, the option defaults for GPUs are neigh = full, 
 neigh/qeq = full, newton = off, binsize for GPUs = 2x LAMMPS default 
-value, comm = device, gpu/direct = on. When LAMMPS can safely detect 
-that GPU-direct is not available, the default value of gpu/direct 
+value, comm = device, cuda/aware = on. When LAMMPS can safely detect 
+that CUDA-aware MPI is not available, the default value of cuda/aware 
 becomes "off". For CPUs or Xeon Phis, the option defaults are neigh = 
 half, neigh/qeq = half, newton = on, binsize = 0.0, and comm = no. The 
 option neigh/thread = on when there are 16K atoms or less on an MPI