add some notes about GPU-direct support requirements to the manual

2018-08-08 14:25:49 +02:00
parent 5d87e0c651
commit 64e152bced
2 changed files with 26 additions and 9 deletions
--- a/doc/src/Speed_kokkos.txt
+++ b/doc/src/Speed_kokkos.txt
@ -96,6 +96,19 @@ software version 7.5 or later must be installed on your system. See
 the discussion for the "GPU package"_Speed_gpu.html for details of how
 to check and do this.

+NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
+library is CUDA-aware and has support for GPU-direct. This is not always
+the case, especially when using pre-compiled MPI libraries provided by
+a Linux distribution. This is not a problem when using only a single
+GPU and a single MPI rank on a desktop. When running with multiple
+MPI ranks, you may see segmentation faults without GPU-direct support.
+Many of those can be avoided by adding the flags '-pk kokkos comm no'
+to the LAMMPS command line or using "package kokkos comm on"_package.html
+in the input file, however for some KOKKOS enabled styles like 
+"EAM"_pair_eam.html or "PPPM"_kspace_style.html, this is not the case
+and a GPU-direct enabled MPI library is REQUIRED.
+
+
 Use a C++11 compatible compiler and set KOKKOS_ARCH variable in
 /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU as
 described above.  Then do the following:
@ -262,9 +275,12 @@ the # of physical GPUs on the node.  You can assign multiple MPI tasks
 to the same GPU with the KOKKOS package, but this is usually only
 faster if significant portions of the input script have not been
 ported to use Kokkos. Using CUDA MPS is recommended in this
-scenario. As above for multi-core CPUs (and no GPU), if N is the
-number of physical cores/node, then the number of MPI tasks/node
-should not exceed N.
+scenario. Using a CUDA-aware MPI library with support for GPU-direct
+is highly recommended and for some KOKKOS-enabled styles even required.
+Most GPU-direct use can be avoided by using "-pk kokkos comm no".
+As above for multi-core CPUs (and no GPU), if N is the number of
+physical cores/node, then the number of MPI tasks/node should not
+exceed N.

 -k on g Ng :pre

--- a/doc/src/package.txt
+++ b/doc/src/package.txt
@ -480,15 +480,16 @@ The value options for all 3 keywords are {no} or {host} or {device}.
 A value of {no} means to use the standard non-KOKKOS method of
 packing/unpacking data for the communication.  A value of {host} means
 to use the host, typically a multi-core CPU, and perform the
-packing/unpacking in parallel with threads.  A value of {device} means
-to use the device, typically a GPU, to perform the packing/unpacking
-operation.
+packing/unpacking in parallel with threads.  A value of {device}
+means to use the device, typically a GPU, to perform the
+packing/unpacking operation.

 The optimal choice for these keywords depends on the input script and
 the hardware used.  The {no} value is useful for verifying that the
-Kokkos-based {host} and {device} values are working correctly.  It may
-also be the fastest choice when using Kokkos styles in MPI-only mode
-(i.e. with a thread count of 1).
+Kokkos-based {host} and {device} values are working correctly.  The {no}
+value should also be used, in case of using an MPI library that does
+not support GPU-direct. It may also be the fastest choice when using
+Kokkos styles in MPI-only mode (i.e. with a thread count of 1).

 When running on CPUs or Xeon Phi, the {host} and {device} values work
 identically.  When using GPUs, the {device} value will typically be