Merge pull request #2004 from stanmoore1/kk_3.1

Update Kokkos library in LAMMPS to v3.1
This commit is contained in:
Axel Kohlmeyer
2020-04-24 18:35:53 -04:00
committed by GitHub
560 changed files with 24838 additions and 15005 deletions

View File

@ -320,11 +320,12 @@ to have an executable that will run on this and newer architectures.
.. note::
NVIDIA GPUs with CC 5.0 (Maxwell) and newer are not compatible with
CC 3.x (Kepler). If you run Kokkos on a newer architecture than what
LAMMPS was compiled with, there will be a significant delay during
device initialization since the just-in-time compiler has to
recompile the GPU kernel code for the new hardware.
If you run Kokkos on a newer GPU architecture than what LAMMPS was
compiled with, there will be a delay during device initialization
since the just-in-time compiler has to recompile all GPU kernels
for the new hardware. This is, however, not possible when compiled
for NVIDIA GPUs with CC 3.x (Kepler) for GPUs with CC 5.0 (Maxwell)
and newer as they are not compatible.
The settings discussed below have been tested with LAMMPS and are
confirmed to work. Kokkos is an active project with ongoing improvements
@ -343,73 +344,109 @@ be specified in uppercase.
:widths: auto
* - **Arch-ID**
- **HOST or GPU**
- **Description**
* - AMDAVX
- HOST
- AMD 64-bit x86 CPU (AVX 1)
* - EPYC
- HOST
- AMD EPYC Zen class CPU (AVX 2)
* - ARMV80
- HOST
- ARMv8.0 Compatible CPU
* - ARMV81
- HOST
- ARMv8.1 Compatible CPU
* - ARMV8_THUNDERX
- HOST
- ARMv8 Cavium ThunderX CPU
* - ARMV8_THUNDERX2
- HOST
- ARMv8 Cavium ThunderX2 CPU
* - WSM
- HOST
- Intel Westmere CPU (SSE 4.2)
* - SNB
- HOST
- Intel Sandy/Ivy Bridge CPU (AVX 1)
* - HSW
- HOST
- Intel Haswell CPU (AVX 2)
* - BDW
- HOST
- Intel Broadwell Xeon E-class CPU (AVX 2 + transactional mem)
* - SKX
- HOST
- Intel Sky Lake Xeon E-class HPC CPU (AVX512 + transactional mem)
* - KNC
- HOST
- Intel Knights Corner Xeon Phi
* - KNL
- HOST
- Intel Knights Landing Xeon Phi
* - BGQ
- HOST
- IBM Blue Gene/Q CPU
* - POWER7
- IBM POWER8 CPU
- HOST
- IBM POWER7 CPU
* - POWER8
- HOST
- IBM POWER8 CPU
* - POWER9
- HOST
- IBM POWER9 CPU
* - KEPLER30
- GPU
- NVIDIA Kepler generation CC 3.0 GPU
* - KEPLER32
- GPU
- NVIDIA Kepler generation CC 3.2 GPU
* - KEPLER35
- GPU
- NVIDIA Kepler generation CC 3.5 GPU
* - KEPLER37
- GPU
- NVIDIA Kepler generation CC 3.7 GPU
* - MAXWELL50
- GPU
- NVIDIA Maxwell generation CC 5.0 GPU
* - MAXWELL52
- GPU
- NVIDIA Maxwell generation CC 5.2 GPU
* - MAXWELL53
- GPU
- NVIDIA Maxwell generation CC 5.3 GPU
* - PASCAL60
- GPU
- NVIDIA Pascal generation CC 6.0 GPU
* - PASCAL61
- GPU
- NVIDIA Pascal generation CC 6.1 GPU
* - VOLTA70
- GPU
- NVIDIA Volta generation CC 7.0 GPU
* - VOLTA72
- GPU
- NVIDIA Volta generation CC 7.2 GPU
* - TURING75
- GPU
- NVIDIA Turing generation CC 7.5 GPU
* - VEGA900
- GPU
- AMD GPU MI25 GFX900
* - VEGA906
- GPU
- AMD GPU MI50/MI60 GFX906
CMake build settings:
^^^^^^^^^^^^^^^^^^^^^
Basic CMake build settings:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
For multicore CPUs using OpenMP, set these 2 variables.
.. code-block:: bash
-D Kokkos_ARCH_CPUARCH=yes # CPUARCH = CPU from list above
-D Kokkos_ARCH_HOSTARCH=yes # HOSTARCH = HOST from list above
-D Kokkos_ENABLE_OPENMP=yes
-D BUILD_OMP=yes
@ -427,15 +464,19 @@ For NVIDIA GPUs using CUDA, set these variables:
.. code-block:: bash
-D Kokkos_ARCH_CPUARCH=yes # CPUARCH = CPU from list above
-D Kokkos_ARCH_HOSTARCH=yes # HOSTARCH = HOST from list above
-D Kokkos_ARCH_GPUARCH=yes # GPUARCH = GPU from list above
-D Kokkos_ENABLE_CUDA=yes
-D Kokkos_ENABLE_OPENMP=yes
-D CMAKE_CXX_COMPILER=wrapper # wrapper = full path to Cuda nvcc wrapper
The wrapper value is the Cuda nvcc compiler wrapper provided in the
Kokkos library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should
include the full path name to the wrapper, e.g.
This will also enable executing FFTs on the GPU, either via the internal
KISSFFT library, or - by preference - with the cuFFT library bundled
with the CUDA toolkit, depending on whether CMake can identify its
location. The *wrapper* value for ``CMAKE_CXX_COMPILER`` variable is
the path to the CUDA nvcc compiler wrapper provided in the Kokkos
library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should include
the full path name to the wrapper, e.g.
.. code-block:: bash
@ -455,8 +496,8 @@ common packages enabled, you can do the following:
cmake -C ../cmake/presets/minimal.cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake
cmake --build .
Traditional make settings:
^^^^^^^^^^^^^^^^^^^^^^^^^^
Basic traditional make settings:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Choose which hardware to support in ``Makefile.machine`` via
``KOKKOS_DEVICES`` and ``KOKKOS_ARCH`` settings. See the
@ -467,7 +508,7 @@ For multicore CPUs using OpenMP:
.. code-block:: make
KOKKOS_DEVICES = OpenMP
KOKKOS_ARCH = CPUARCH # CPUARCH = CPU from list above
KOKKOS_ARCH = HOSTARCH # HOSTARCH = HOST from list above
For Intel KNLs using OpenMP:
@ -481,7 +522,8 @@ For NVIDIA GPUs using CUDA:
.. code-block:: make
KOKKOS_DEVICES = Cuda
KOKKOS_ARCH = CPUARCH,GPUARCH # CPUARCH = CPU from list above that is hosting the GPU
KOKKOS_ARCH = HOSTARCH,GPUARCH # HOSTARCH = HOST from list above that is hosting the GPU
KOKKOS_CUDA_OPTIONS = "enable_lambda"
# GPUARCH = GPU from list above
FFT_INC = -DFFT_CUFFT # enable use of cuFFT (optional)
FFT_LIB = -lcufft # link to cuFFT library
@ -504,6 +546,44 @@ C++ compiler for non-Kokkos, non-CUDA files.
KOKKOS_ABSOLUTE_PATH = $(shell cd $(KOKKOS_PATH); pwd)
CC = mpicxx -cxx=$(KOKKOS_ABSOLUTE_PATH)/config/nvcc_wrapper
Advanced KOKKOS compilation settings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are other allowed options when building with the KOKKOS package
that can improve performance or assist in debugging or profiling. Below
are some examples that may be useful in combination with LAMMPS. For
the full list (which keeps changing as the Kokkos package itself evolves),
please consult the Kokkos library documentation.
As alternative to using multi-threading via OpenMP
(``-DKokkos_ENABLE_OPENMP=on`` or ``KOKKOS_DEVICES=OpenMP``) it is also
possible to use Posix threads directly (``-DKokkos_ENABLE_PTHREAD=on``
or ``KOKKOS_DEVICES=Pthread``). While binding of threads to individual
or groups of CPU cores is managed in OpenMP with environment variables,
you need assistance from either the "hwloc" or "libnuma" library for the
Pthread thread parallelization option. To enable use with CMake:
``-DKokkos_ENABLE_HWLOC=on`` or ``-DKokkos_ENABLE_LIBNUMA=on``; and with
conventional make: ``KOKKOS_USE_TPLS=hwloc`` or
``KOKKOS_USE_TPLS=libnuma``.
The CMake option ``-DKokkos_ENABLE_LIBRT=on`` or the makefile setting
``KOKKOS_USE_TPLS=librt`` enables the use of a more accurate timer
mechanism on many Unix-like platforms for internal profiling.
The CMake option ``-DKokkos_ENABLE_DEBUG=on`` or the makefile setting
``KOKKOS_DEBUG=yes`` enables printing of run-time
debugging information that can be useful. It also enables runtime
bounds checking on Kokkos data structures. As to be expected, enabling
this option will negatively impact the performance and thus is only
recommended when developing a Kokkos-enabled style in LAMMPS.
The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables the
use of CUDA "Unified Virtual Memory" in Kokkos. Please note, that
the LAMMPS KOKKOS package must **always** be compiled with the
*enable_lambda* option when using GPUs.
----------
.. _latte:

View File

@ -9,10 +9,7 @@ different back end languages such as CUDA, OpenMP, or Pthreads. The
Kokkos library also provides data abstractions to adjust (at compile
time) the memory layout of data structures like 2d and 3d arrays to
optimize performance on different hardware. For more information on
Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_. Kokkos is
part of `Trilinos <https://www.trilinos.org/>`_. The Kokkos
library was written primarily by Carter Edwards, Christian Trott, and
Dan Sunderland (all Sandia).
Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_.
The LAMMPS KOKKOS package contains versions of pair, fix, and atom
styles that use data structures and macros provided by the Kokkos
@ -21,7 +18,7 @@ package was developed primarily by Christian Trott (Sandia) and Stan
Moore (Sandia) with contributions of various styles by others,
including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
(Sandia). For more information on developing using Kokkos abstractions
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
see the Kokkos `Wiki <https://github.com/kokkos/kokkos/wiki>`_.
Kokkos currently provides support for 3 modes of execution (per MPI
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
@ -31,33 +28,30 @@ compatible with specific hardware.
.. note::
Kokkos support within LAMMPS must be built with a C++11 compatible
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
Clang 3.5.2 or later is required.
.. note::
To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
To build with Kokkos support for NVIDIA GPUs, the NVIDIA CUDA toolkit
software version 9.0 or later must be installed on your system. See
the discussion for the :doc:`GPU package <Speed_gpu>` for details of how
to check and do this.
the discussion for the :doc:`GPU package <Speed_gpu>` for details of
how to check and do this.
.. note::
Kokkos with CUDA currently implicitly assumes that the MPI library
is CUDA-aware. This is not always the case, especially when using
pre-compiled MPI libraries provided by a Linux distribution. This is not
a problem when using only a single GPU with a single MPI rank. When
running with multiple MPI ranks, you may see segmentation faults without
CUDA-aware MPI support. These can be avoided by adding the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the LAMMPS command line or by
using the command :doc:`package kokkos cuda/aware off <package>` in the
input file.
Kokkos with CUDA currently implicitly assumes that the MPI library is
CUDA-aware. This is not always the case, especially when using
pre-compiled MPI libraries provided by a Linux distribution. This is
not a problem when using only a single GPU with a single MPI
rank. When running with multiple MPI ranks, you may see segmentation
faults without CUDA-aware MPI support. These can be avoided by adding
the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the
LAMMPS command line or by using the command :doc:`package kokkos
cuda/aware off <package>` in the input file.
**Building LAMMPS with the KOKKOS package:**
Building LAMMPS with the KOKKOS package
"""""""""""""""""""""""""""""""""""""""
See the :ref:`Build extras <kokkos>` doc page for instructions.
**Running LAMMPS with the KOKKOS package:**
Running LAMMPS with the KOKKOS package
""""""""""""""""""""""""""""""""""""""
All Kokkos operations occur within the context of an individual MPI
task running on a single node of the machine. The total number of MPI
@ -66,7 +60,8 @@ usual manner via the mpirun or mpiexec commands, and is independent of
Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
-npernode switches. Ditto for MPICH via -np and -ppn.
**Running on a multi-core CPU:**
Running on a multi-core CPU
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is a quick overview of how to use the KOKKOS package
for CPU acceleration, assuming one or more 16-core nodes.
@ -142,7 +137,8 @@ atom. When using the Kokkos Serial back end or the OpenMP back end with
a single thread, no duplication or atomic operations are used. For CUDA
and half neighbor lists, the KOKKOS package always uses atomic operations.
**Core and Thread Affinity:**
CPU Cores, Sockets and Thread Affinity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When using multi-threading, it is important for performance to bind
both MPI tasks to physical cores, and threads to physical cores, so
@ -156,15 +152,16 @@ for your MPI installation), binding can be forced with these flags:
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
For binding threads with KOKKOS OpenMP, use thread affinity
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. In general, for best
performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
OMP_PLACES=threads. For binding threads with the KOKKOS pthreads
option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
For binding threads with KOKKOS OpenMP, use thread affinity environment
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
or later) setting the environment variable ``OMP_PROC_BIND=true`` should
be sufficient. In general, for best performance with OpenMP 4.0 or later
set ``OMP_PROC_BIND=spread`` and ``OMP_PLACES=threads``. For binding
threads with the KOKKOS pthreads option, compile LAMMPS with the hwloc
or libnuma support enabled as described in the :ref:`extra build options page <kokkos>`.
**Running on Knight's Landing (KNL) Intel Xeon Phi:**
Running on Knight's Landing (KNL) Intel Xeon Phi
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here is a quick overview of how to use the KOKKOS package for the
Intel Knight's Landing (KNL) Xeon Phi:
@ -222,7 +219,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
them in "native" mode, not "offload" mode like the USER-INTEL package
supports.
**Running on GPUs:**
Running on GPUs
^^^^^^^^^^^^^^^
Use the "-k" :doc:`command-line switch <Run_options>` to specify the
number of GPUs per node. Typically the -np setting of the mpirun command
@ -257,7 +255,7 @@ one or more nodes, each with two GPUs:
running on GPUs is to use "full" neighbor lists and set the Newton flag
to "off" for both pairwise and bonded interactions, along with threaded
communication. When running on Maxwell or Kepler GPUs, this will
typically be best. For Pascal GPUs, using "half" neighbor lists and
typically be best. For Pascal GPUs and beyond, using "half" neighbor lists and
setting the Newton flag to "on" may be faster. For many pair styles,
setting the neighbor binsize equal to twice the CPU default value will
give speedup, which is the default when running on GPUs. Use the "-pk
@ -270,13 +268,6 @@ one or more nodes, each with two GPUs:
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
.. note::
For good performance of the KOKKOS package on GPUs, you must
have Kepler generation GPUs (or later). The Kokkos library exploits
texture cache options not supported by Telsa generation GPUs (or
older).
.. note::
When using a GPU, you will achieve the best performance if your
@ -293,7 +284,8 @@ one or more nodes, each with two GPUs:
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
However, this will reduce performance and is not recommended for production runs.
**Run with the KOKKOS package by editing an input script:**
Run with the KOKKOS package by editing an input script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Alternatively the effect of the "-sf" or "-pk" switches can be
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
@ -316,17 +308,24 @@ You only need to use the :doc:`package kokkos <package>` command if you
wish to change any of its option defaults, as set by the "-k on"
:doc:`command-line switch <Run_options>`.
**Using OpenMP threading and CUDA together (experimental):**
**Using OpenMP threading and CUDA together:**
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
used together in a few special cases. In the Makefile, the
KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
compiled and used together in a few special cases. In the makefile for
the conventional build, the KOKKOS_DEVICES variable must include both,
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
.. code-block:: bash
KOKKOS_DEVICES=Cuda,OpenMP
When building with CMake you need to enable both features as it is done
in the ``kokkos-cuda.cmake`` CMake preset file.
.. code-block:: bash
cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the "-sf kk" in the command line gives the default CUDA version
everywhere. However, if the "/kk/host" suffix is added to a specific
@ -360,7 +359,8 @@ suffix for kspace and bonds, angles, etc. in the input file and the
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
so CPU/GPU overlap can occur.
**Speed-ups to expect:**
Performance to expect
"""""""""""""""""""""
The performance of KOKKOS running in different modes is a function of
your hardware, which KOKKOS-enable styles are used, and the problem
@ -377,52 +377,26 @@ Generally speaking, the following rules of thumb apply:
performance of a KOKKOS style is a bit slower than the USER-OMP
package.
* When running large number of atoms per GPU, KOKKOS is typically faster
than the GPU package.
than the GPU package when compiled for double precision. The benefit
of using single or mixed precision with the GPU package depends
significantly on the hardware in use and the simulated system and pair
style.
* When running on Intel hardware, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware.
the USER-INTEL package, which is optimized for x86 hardware (not just
from Intel) and compilation with the Intel compilers. The USER-INTEL
package also can increase the vector length of vector instructions
by switching to single or mixed precision mode.
See the `Benchmark page <https://lammps.sandia.gov/bench.html>`_ of the
LAMMPS web site for performance of the KOKKOS package on different
hardware.
**Advanced Kokkos options:**
Advanced Kokkos options
"""""""""""""""""""""""
There are other allowed options when building with the KOKKOS package.
As explained on the :ref:`Build extras <kokkos>` doc page,
they can be set either as variables on the make command line or in
Makefile.machine, or they can be specified as CMake variables. Each
takes a value shown below. The default value is listed, which is set
in the lib/kokkos/Makefile.kokkos file.
* KOKKOS_DEBUG, values = *yes*\ , *no*\ , default = *no*
* KOKKOS_USE_TPLS, values = *hwloc*\ , *librt*\ , *experimental_memkind*, default = *none*
* KOKKOS_CXX_STANDARD, values = *c++11*\ , *c++1z*\ , default = *c++11*
* KOKKOS_OPTIONS, values = *aggressive_vectorization*, *disable_profiling*, default = *none*
* KOKKOS_CUDA_OPTIONS, values = *force_uvm*, *use_ldg*, *rdc*\ , *enable_lambda*, default = *enable_lambda*
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
provides alternative methods via environment variables for binding
threads to hardware cores. More info on binding threads to cores is
given on the :doc:`Speed omp <Speed_omp>` doc page.
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
on most Unix platforms. This library is not available on all
platforms.
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
debugging information that can be useful. It also enables runtime
bounds checking on Kokkos data structures.
KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when
building LAMMPS.
KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
package must be compiled with the *enable_lambda* option when using
GPUs.
There are other allowed options when building with the KOKKOS package
that can improve performance or assist in debugging or profiling.
They are explained on the :ref:`KOKKOS section of the build extras <kokkos>` doc page,
Restrictions
""""""""""""