Merge pull request #2004 from stanmoore1/kk_3.1

Update Kokkos library in LAMMPS to v3.1
2020-04-24 18:35:53 -04:00
parent 5cbebddae6 e114a8d15b
commit 72ff0dd87d
560 changed files with 24838 additions and 15005 deletions
--- a/doc/src/Build_extras.rst
+++ b/doc/src/Build_extras.rst
@ -320,11 +320,12 @@ to have an executable that will run on this and newer architectures.

 .. note::

-   NVIDIA GPUs with CC 5.0 (Maxwell) and newer are not compatible with
-   CC 3.x (Kepler).  If you run Kokkos on a newer architecture than what
-   LAMMPS was compiled with, there will be a significant delay during
-   device initialization since the just-in-time compiler has to
-   recompile the GPU kernel code for the new hardware.
+   If you run Kokkos on a newer GPU architecture than what LAMMPS was
+   compiled with, there will be a delay during device initialization
+   since the just-in-time compiler has to recompile all GPU kernels
+   for the new hardware.  This is, however, not possible when compiled
+   for NVIDIA GPUs with CC 3.x (Kepler) for GPUs with CC 5.0 (Maxwell)
+   and newer as they are not compatible.

 The settings discussed below have been tested with LAMMPS and are
 confirmed to work.  Kokkos is an active project with ongoing improvements
@ -343,73 +344,109 @@ be specified in uppercase.
   :widths: auto

   *  - **Arch-ID**
+      - **HOST or GPU**
      - **Description**
   *  - AMDAVX
+      - HOST
      - AMD 64-bit x86 CPU (AVX 1)
   *  - EPYC
+      - HOST
      - AMD EPYC Zen class CPU (AVX 2)
   *  - ARMV80
+      - HOST
      - ARMv8.0 Compatible CPU
   *  - ARMV81
+      - HOST
      - ARMv8.1 Compatible CPU
   *  - ARMV8_THUNDERX
+      - HOST
      - ARMv8 Cavium ThunderX CPU
   *  - ARMV8_THUNDERX2
+      - HOST
      - ARMv8 Cavium ThunderX2 CPU
   *  - WSM
+      - HOST
      - Intel Westmere CPU (SSE 4.2)
   *  - SNB
+      - HOST
      - Intel Sandy/Ivy Bridge CPU (AVX 1)
   *  - HSW
+      - HOST
      - Intel Haswell CPU (AVX 2)
   *  - BDW
+      - HOST
      - Intel Broadwell Xeon E-class CPU (AVX 2 + transactional mem)
   *  - SKX
+      - HOST
      - Intel Sky Lake Xeon E-class HPC CPU (AVX512 + transactional mem)
   *  - KNC
+      - HOST
      - Intel Knights Corner Xeon Phi
   *  - KNL
+      - HOST
      - Intel Knights Landing Xeon Phi
   *  - BGQ
+      - HOST
      - IBM Blue Gene/Q CPU
   *  - POWER7
-      - IBM POWER8 CPU
+      - HOST
+      - IBM POWER7 CPU
   *  - POWER8
+      - HOST
      - IBM POWER8 CPU
   *  - POWER9
+      - HOST
      - IBM POWER9 CPU
   *  - KEPLER30
+      - GPU
      - NVIDIA Kepler generation CC 3.0 GPU
   *  - KEPLER32
+      - GPU
      - NVIDIA Kepler generation CC 3.2 GPU
   *  - KEPLER35
+      - GPU
      - NVIDIA Kepler generation CC 3.5 GPU
   *  - KEPLER37
+      - GPU
      - NVIDIA Kepler generation CC 3.7 GPU
   *  - MAXWELL50
+      - GPU
      - NVIDIA Maxwell generation CC 5.0 GPU
   *  - MAXWELL52
+      - GPU
      - NVIDIA Maxwell generation CC 5.2 GPU
   *  - MAXWELL53
+      - GPU
      - NVIDIA Maxwell generation CC 5.3 GPU
   *  - PASCAL60
+      - GPU
      - NVIDIA Pascal generation CC 6.0 GPU
   *  - PASCAL61
+      - GPU
      - NVIDIA Pascal generation CC 6.1 GPU
   *  - VOLTA70
+      - GPU
      - NVIDIA Volta generation CC 7.0 GPU
   *  - VOLTA72
+      - GPU
      - NVIDIA Volta generation CC 7.2 GPU
   *  - TURING75
+      - GPU
      - NVIDIA Turing generation CC 7.5 GPU
+   *  - VEGA900
+      - GPU
+      - AMD GPU MI25 GFX900
+   *  - VEGA906
+      - GPU
+      - AMD GPU MI50/MI60 GFX906

-CMake build settings:
-^^^^^^^^^^^^^^^^^^^^^
+Basic CMake build settings:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
 For multicore CPUs using OpenMP, set these 2 variables.

 .. code-block:: bash

-   -D Kokkos_ARCH_CPUARCH=yes  # CPUARCH = CPU from list above
+   -D Kokkos_ARCH_HOSTARCH=yes  # HOSTARCH = HOST from list above
   -D Kokkos_ENABLE_OPENMP=yes
   -D BUILD_OMP=yes

@ -427,15 +464,19 @@ For NVIDIA GPUs using CUDA, set these variables:

 .. code-block:: bash

-   -D Kokkos_ARCH_CPUARCH=yes    # CPUARCH = CPU from list above
+   -D Kokkos_ARCH_HOSTARCH=yes   # HOSTARCH = HOST from list above
   -D Kokkos_ARCH_GPUARCH=yes    # GPUARCH = GPU from list above
   -D Kokkos_ENABLE_CUDA=yes
   -D Kokkos_ENABLE_OPENMP=yes
   -D CMAKE_CXX_COMPILER=wrapper # wrapper = full path to Cuda nvcc wrapper

-The wrapper value is the Cuda nvcc compiler wrapper provided in the
-Kokkos library: ``lib/kokkos/bin/nvcc_wrapper``\ .  The setting should
-include the full path name to the wrapper, e.g.
+This will also enable executing FFTs on the GPU, either via the internal
+KISSFFT library, or - by preference - with the cuFFT library bundled
+with the CUDA toolkit, depending on whether CMake can identify its
+location.  The *wrapper* value for ``CMAKE_CXX_COMPILER`` variable is
+the path to the CUDA nvcc compiler wrapper provided in the Kokkos
+library: ``lib/kokkos/bin/nvcc_wrapper``\ .  The setting should include
+the full path name to the wrapper, e.g.

 .. code-block:: bash

@ -455,8 +496,8 @@ common packages enabled, you can do the following:
   cmake -C ../cmake/presets/minimal.cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake
   cmake --build .

-Traditional make settings:
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+Basic traditional make settings:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Choose which hardware to support in ``Makefile.machine`` via
 ``KOKKOS_DEVICES`` and ``KOKKOS_ARCH`` settings.  See the
@ -467,7 +508,7 @@ For multicore CPUs using OpenMP:
 .. code-block:: make

   KOKKOS_DEVICES = OpenMP
-   KOKKOS_ARCH = CPUARCH          # CPUARCH = CPU from list above
+   KOKKOS_ARCH = HOSTARCH          # HOSTARCH = HOST from list above

 For Intel KNLs using OpenMP:

@ -481,7 +522,8 @@ For NVIDIA GPUs using CUDA:
 .. code-block:: make

   KOKKOS_DEVICES = Cuda
-   KOKKOS_ARCH = CPUARCH,GPUARCH  # CPUARCH = CPU from list above that is hosting the GPU
+   KOKKOS_ARCH = HOSTARCH,GPUARCH  # HOSTARCH = HOST from list above that is hosting the GPU
+   KOKKOS_CUDA_OPTIONS = "enable_lambda"
                                  # GPUARCH = GPU from list above
   FFT_INC = -DFFT_CUFFT          # enable use of cuFFT (optional)
   FFT_LIB = -lcufft              # link to cuFFT library
@ -504,6 +546,44 @@ C++ compiler for non-Kokkos, non-CUDA files.
   KOKKOS_ABSOLUTE_PATH = $(shell cd $(KOKKOS_PATH); pwd)
   CC = mpicxx -cxx=$(KOKKOS_ABSOLUTE_PATH)/config/nvcc_wrapper

+
+Advanced KOKKOS compilation settings
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are other allowed options when building with the KOKKOS package
+that can improve performance or assist in debugging or profiling. Below
+are some examples that may be useful in combination with LAMMPS.  For
+the full list (which keeps changing as the Kokkos package itself evolves),
+please consult the Kokkos library documentation.
+
+As alternative to using multi-threading via OpenMP
+(``-DKokkos_ENABLE_OPENMP=on`` or ``KOKKOS_DEVICES=OpenMP``) it is also
+possible to use Posix threads directly (``-DKokkos_ENABLE_PTHREAD=on``
+or ``KOKKOS_DEVICES=Pthread``).  While binding of threads to individual
+or groups of CPU cores is managed in OpenMP with environment variables,
+you need assistance from either the "hwloc" or "libnuma" library for the
+Pthread thread parallelization option. To enable use with CMake:
+``-DKokkos_ENABLE_HWLOC=on`` or ``-DKokkos_ENABLE_LIBNUMA=on``; and with
+conventional make: ``KOKKOS_USE_TPLS=hwloc`` or
+``KOKKOS_USE_TPLS=libnuma``.
+
+The CMake option ``-DKokkos_ENABLE_LIBRT=on`` or the makefile setting
+``KOKKOS_USE_TPLS=librt`` enables the use of a more accurate timer
+mechanism on many Unix-like platforms for internal profiling.
+
+The CMake option ``-DKokkos_ENABLE_DEBUG=on`` or the makefile setting
+``KOKKOS_DEBUG=yes`` enables printing of run-time
+debugging information that can be useful. It also enables runtime
+bounds checking on Kokkos data structures.  As to be expected, enabling
+this option will negatively impact the performance and thus is only
+recommended when developing a Kokkos-enabled style in LAMMPS.
+
+The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
+setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables the
+use of CUDA "Unified Virtual Memory" in Kokkos.  Please note, that
+the LAMMPS KOKKOS package must **always** be compiled with the
+*enable_lambda* option when using GPUs.
+
 ----------

 .. _latte:
--- a/doc/src/Speed_kokkos.rst
+++ b/doc/src/Speed_kokkos.rst
@ -9,10 +9,7 @@ different back end languages such as CUDA, OpenMP, or Pthreads.  The
 Kokkos library also provides data abstractions to adjust (at compile
 time) the memory layout of data structures like 2d and 3d arrays to
 optimize performance on different hardware. For more information on
-Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_. Kokkos is
-part of `Trilinos <https://www.trilinos.org/>`_. The Kokkos
-library was written primarily by Carter Edwards, Christian Trott, and
-Dan Sunderland (all Sandia).
+Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_.

 The LAMMPS KOKKOS package contains versions of pair, fix, and atom
 styles that use data structures and macros provided by the Kokkos
@ -21,7 +18,7 @@ package was developed primarily by Christian Trott (Sandia) and Stan
 Moore (Sandia) with contributions of various styles by others,
 including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
 (Sandia). For more information on developing using Kokkos abstractions
-see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
+see the Kokkos `Wiki <https://github.com/kokkos/kokkos/wiki>`_.

 Kokkos currently provides support for 3 modes of execution (per MPI
 task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
@ -31,33 +28,30 @@ compatible with specific hardware.

 .. note::

-   Kokkos support within LAMMPS must be built with a C++11 compatible
-   compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
-   Clang 3.5.2 or later is required.
-
-.. note::
-
-   To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
+   To build with Kokkos support for NVIDIA GPUs, the NVIDIA CUDA toolkit
   software version 9.0 or later must be installed on your system. See
-   the discussion for the :doc:`GPU package <Speed_gpu>` for details of how
-   to check and do this.
+   the discussion for the :doc:`GPU package <Speed_gpu>` for details of
+   how to check and do this.

 .. note::

-   Kokkos with CUDA currently implicitly assumes that the MPI library
-   is CUDA-aware. This is not always the case, especially when using
-   pre-compiled MPI libraries provided by a Linux distribution. This is not
-   a problem when using only a single GPU with a single MPI rank. When
-   running with multiple MPI ranks, you may see segmentation faults without
-   CUDA-aware MPI support. These can be avoided by adding the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the LAMMPS command line or by
-   using the command :doc:`package kokkos cuda/aware off <package>` in the
-   input file.
+   Kokkos with CUDA currently implicitly assumes that the MPI library is
+   CUDA-aware. This is not always the case, especially when using
+   pre-compiled MPI libraries provided by a Linux distribution. This is
+   not a problem when using only a single GPU with a single MPI
+   rank. When running with multiple MPI ranks, you may see segmentation
+   faults without CUDA-aware MPI support. These can be avoided by adding
+   the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the
+   LAMMPS command line or by using the command :doc:`package kokkos
+   cuda/aware off <package>` in the input file.

-**Building LAMMPS with the KOKKOS package:**
+Building LAMMPS with the KOKKOS package
+"""""""""""""""""""""""""""""""""""""""

 See the :ref:`Build extras <kokkos>` doc page for instructions.

-**Running LAMMPS with the KOKKOS package:**
+Running LAMMPS with the KOKKOS package
+""""""""""""""""""""""""""""""""""""""

 All Kokkos operations occur within the context of an individual MPI
 task running on a single node of the machine. The total number of MPI
@ -66,7 +60,8 @@ usual manner via the mpirun or mpiexec commands, and is independent of
 Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
 -npernode switches. Ditto for MPICH via -np and -ppn.

-**Running on a multi-core CPU:**
+Running on a multi-core CPU
+^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Here is a quick overview of how to use the KOKKOS package
 for CPU acceleration, assuming one or more 16-core nodes.
@ -142,7 +137,8 @@ atom.  When using the Kokkos Serial back end or the OpenMP back end with
 a single thread, no duplication or atomic operations are used. For CUDA
 and half neighbor lists, the KOKKOS package always uses atomic operations.

-**Core and Thread Affinity:**
+CPU Cores, Sockets and Thread Affinity
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 When using multi-threading, it is important for performance to bind
 both MPI tasks to physical cores, and threads to physical cores, so
@ -156,15 +152,16 @@ for your MPI installation), binding can be forced with these flags:
   OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
   Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...

-For binding threads with KOKKOS OpenMP, use thread affinity
-environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
-later, intel 12 or later) setting the environment variable
-OMP_PROC_BIND=true should be sufficient. In general, for best
-performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
-OMP_PLACES=threads.  For binding threads with the KOKKOS pthreads
-option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
+For binding threads with KOKKOS OpenMP, use thread affinity environment
+variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
+or later) setting the environment variable ``OMP_PROC_BIND=true`` should
+be sufficient. In general, for best performance with OpenMP 4.0 or later
+set ``OMP_PROC_BIND=spread`` and ``OMP_PLACES=threads``.  For binding
+threads with the KOKKOS pthreads option, compile LAMMPS with the hwloc
+or libnuma support enabled as described in the :ref:`extra build options page <kokkos>`.

-**Running on Knight's Landing (KNL) Intel Xeon Phi:**
+Running on Knight's Landing (KNL) Intel Xeon Phi
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Here is a quick overview of how to use the KOKKOS package for the
 Intel Knight's Landing (KNL) Xeon Phi:
@ -222,7 +219,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
   them in "native" mode, not "offload" mode like the USER-INTEL package
   supports.

-**Running on GPUs:**
+Running on GPUs
+^^^^^^^^^^^^^^^

 Use the "-k" :doc:`command-line switch <Run_options>` to specify the
 number of GPUs per node. Typically the -np setting of the mpirun command
@ -257,7 +255,7 @@ one or more nodes, each with two GPUs:
   running on GPUs is to use "full" neighbor lists and set the Newton flag
   to "off" for both pairwise and bonded interactions, along with threaded
   communication. When running on Maxwell or Kepler GPUs, this will
-   typically be best. For Pascal GPUs, using "half" neighbor lists and
+   typically be best. For Pascal GPUs and beyond, using "half" neighbor lists and
   setting the Newton flag to "on" may be faster. For many pair styles,
   setting the neighbor binsize equal to twice the CPU default value will
   give speedup, which is the default when running on GPUs. Use the "-pk
@ -270,13 +268,6 @@ one or more nodes, each with two GPUs:

   mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj      # Newton on, half neighbor list, set binsize = neighbor ghost cutoff

-.. note::
-
-   For good performance of the KOKKOS package on GPUs, you must
-   have Kepler generation GPUs (or later). The Kokkos library exploits
-   texture cache options not supported by Telsa generation GPUs (or
-   older).
-
 .. note::

   When using a GPU, you will achieve the best performance if your
@ -293,7 +284,8 @@ one or more nodes, each with two GPUs:
   kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
   However, this will reduce performance and is not recommended for production runs.

-**Run with the KOKKOS package by editing an input script:**
+Run with the KOKKOS package by editing an input script
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Alternatively the effect of the "-sf" or "-pk" switches can be
 duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
@ -316,17 +308,24 @@ You only need to use the :doc:`package kokkos <package>` command if you
 wish to change any of its option defaults, as set by the "-k on"
 :doc:`command-line switch <Run_options>`.

-**Using OpenMP threading and CUDA together (experimental):**
+**Using OpenMP threading and CUDA together:**

 With the KOKKOS package, both OpenMP multi-threading and GPUs can be
-used together in a few special cases. In the Makefile, the
-KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
-the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
+compiled and used together in a few special cases. In the makefile for
+the conventional build, the KOKKOS_DEVICES variable must include both,
+"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.

 .. code-block:: bash

   KOKKOS_DEVICES=Cuda,OpenMP

+When building with CMake you need to enable both features as it is done
+in the ``kokkos-cuda.cmake`` CMake preset file.
+
+.. code-block:: bash
+
+   cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes
+
 The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
 using the "-sf kk" in the command line gives the default CUDA version
 everywhere.  However, if the "/kk/host" suffix is added to a specific
@ -360,7 +359,8 @@ suffix for kspace and bonds, angles, etc.  in the input file and the
 sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
 so CPU/GPU overlap can occur.

-**Speed-ups to expect:**
+Performance to expect
+"""""""""""""""""""""

 The performance of KOKKOS running in different modes is a function of
 your hardware, which KOKKOS-enable styles are used, and the problem
@ -377,52 +377,26 @@ Generally speaking, the following rules of thumb apply:
  performance of a KOKKOS style is a bit slower than the USER-OMP
  package.
 * When running large number of atoms per GPU, KOKKOS is typically faster
-  than the GPU package.
+  than the GPU package when compiled for double precision. The benefit
+  of using single or mixed precision with the GPU package depends
+  significantly on the hardware in use and the simulated system and pair
+  style.
 * When running on Intel hardware, KOKKOS is not as fast as
-  the USER-INTEL package, which is optimized for that hardware.
+  the USER-INTEL package, which is optimized for x86 hardware (not just
+  from Intel) and compilation with the Intel compilers.  The USER-INTEL
+  package also can increase the vector length of vector instructions
+  by switching to single or mixed precision mode.

 See the `Benchmark page <https://lammps.sandia.gov/bench.html>`_ of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.

-**Advanced Kokkos options:**
+Advanced Kokkos options
+"""""""""""""""""""""""

-There are other allowed options when building with the KOKKOS package.
-As explained on the :ref:`Build extras <kokkos>` doc page,
-they can be set either as variables on the make command line or in
-Makefile.machine, or they can be specified as CMake variables.  Each
-takes a value shown below.  The default value is listed, which is set
-in the lib/kokkos/Makefile.kokkos file.
-
-* KOKKOS_DEBUG, values = *yes*\ , *no*\ , default = *no*
-* KOKKOS_USE_TPLS, values = *hwloc*\ , *librt*\ , *experimental_memkind*, default = *none*
-* KOKKOS_CXX_STANDARD, values = *c++11*\ , *c++1z*\ , default = *c++11*
-* KOKKOS_OPTIONS, values = *aggressive_vectorization*, *disable_profiling*, default = *none*
-* KOKKOS_CUDA_OPTIONS, values = *force_uvm*, *use_ldg*, *rdc*\ , *enable_lambda*, default = *enable_lambda*
-
-KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
-migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
-used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
-necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
-provides alternative methods via environment variables for binding
-threads to hardware cores.  More info on binding threads to cores is
-given on the :doc:`Speed omp <Speed_omp>` doc page.
-
-KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
-on most Unix platforms. This library is not available on all
-platforms.
-
-KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
-within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
-debugging information that can be useful. It also enables runtime
-bounds checking on Kokkos data structures.
-
-KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when
-building LAMMPS.
-
-KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
-package must be compiled with the *enable_lambda* option when using
-GPUs.
+There are other allowed options when building with the KOKKOS package
+that can improve performance or assist in debugging or profiling.
+They are explained on the :ref:`KOKKOS section of the build extras <kokkos>` doc page,

 Restrictions
 """"""""""""