Merge pull request #2004 from stanmoore1/kk_3.1
Update Kokkos library in LAMMPS to v3.1
This commit is contained in:
@ -320,11 +320,12 @@ to have an executable that will run on this and newer architectures.
|
||||
|
||||
.. note::
|
||||
|
||||
NVIDIA GPUs with CC 5.0 (Maxwell) and newer are not compatible with
|
||||
CC 3.x (Kepler). If you run Kokkos on a newer architecture than what
|
||||
LAMMPS was compiled with, there will be a significant delay during
|
||||
device initialization since the just-in-time compiler has to
|
||||
recompile the GPU kernel code for the new hardware.
|
||||
If you run Kokkos on a newer GPU architecture than what LAMMPS was
|
||||
compiled with, there will be a delay during device initialization
|
||||
since the just-in-time compiler has to recompile all GPU kernels
|
||||
for the new hardware. This is, however, not possible when compiled
|
||||
for NVIDIA GPUs with CC 3.x (Kepler) for GPUs with CC 5.0 (Maxwell)
|
||||
and newer as they are not compatible.
|
||||
|
||||
The settings discussed below have been tested with LAMMPS and are
|
||||
confirmed to work. Kokkos is an active project with ongoing improvements
|
||||
@ -343,73 +344,109 @@ be specified in uppercase.
|
||||
:widths: auto
|
||||
|
||||
* - **Arch-ID**
|
||||
- **HOST or GPU**
|
||||
- **Description**
|
||||
* - AMDAVX
|
||||
- HOST
|
||||
- AMD 64-bit x86 CPU (AVX 1)
|
||||
* - EPYC
|
||||
- HOST
|
||||
- AMD EPYC Zen class CPU (AVX 2)
|
||||
* - ARMV80
|
||||
- HOST
|
||||
- ARMv8.0 Compatible CPU
|
||||
* - ARMV81
|
||||
- HOST
|
||||
- ARMv8.1 Compatible CPU
|
||||
* - ARMV8_THUNDERX
|
||||
- HOST
|
||||
- ARMv8 Cavium ThunderX CPU
|
||||
* - ARMV8_THUNDERX2
|
||||
- HOST
|
||||
- ARMv8 Cavium ThunderX2 CPU
|
||||
* - WSM
|
||||
- HOST
|
||||
- Intel Westmere CPU (SSE 4.2)
|
||||
* - SNB
|
||||
- HOST
|
||||
- Intel Sandy/Ivy Bridge CPU (AVX 1)
|
||||
* - HSW
|
||||
- HOST
|
||||
- Intel Haswell CPU (AVX 2)
|
||||
* - BDW
|
||||
- HOST
|
||||
- Intel Broadwell Xeon E-class CPU (AVX 2 + transactional mem)
|
||||
* - SKX
|
||||
- HOST
|
||||
- Intel Sky Lake Xeon E-class HPC CPU (AVX512 + transactional mem)
|
||||
* - KNC
|
||||
- HOST
|
||||
- Intel Knights Corner Xeon Phi
|
||||
* - KNL
|
||||
- HOST
|
||||
- Intel Knights Landing Xeon Phi
|
||||
* - BGQ
|
||||
- HOST
|
||||
- IBM Blue Gene/Q CPU
|
||||
* - POWER7
|
||||
- IBM POWER8 CPU
|
||||
- HOST
|
||||
- IBM POWER7 CPU
|
||||
* - POWER8
|
||||
- HOST
|
||||
- IBM POWER8 CPU
|
||||
* - POWER9
|
||||
- HOST
|
||||
- IBM POWER9 CPU
|
||||
* - KEPLER30
|
||||
- GPU
|
||||
- NVIDIA Kepler generation CC 3.0 GPU
|
||||
* - KEPLER32
|
||||
- GPU
|
||||
- NVIDIA Kepler generation CC 3.2 GPU
|
||||
* - KEPLER35
|
||||
- GPU
|
||||
- NVIDIA Kepler generation CC 3.5 GPU
|
||||
* - KEPLER37
|
||||
- GPU
|
||||
- NVIDIA Kepler generation CC 3.7 GPU
|
||||
* - MAXWELL50
|
||||
- GPU
|
||||
- NVIDIA Maxwell generation CC 5.0 GPU
|
||||
* - MAXWELL52
|
||||
- GPU
|
||||
- NVIDIA Maxwell generation CC 5.2 GPU
|
||||
* - MAXWELL53
|
||||
- GPU
|
||||
- NVIDIA Maxwell generation CC 5.3 GPU
|
||||
* - PASCAL60
|
||||
- GPU
|
||||
- NVIDIA Pascal generation CC 6.0 GPU
|
||||
* - PASCAL61
|
||||
- GPU
|
||||
- NVIDIA Pascal generation CC 6.1 GPU
|
||||
* - VOLTA70
|
||||
- GPU
|
||||
- NVIDIA Volta generation CC 7.0 GPU
|
||||
* - VOLTA72
|
||||
- GPU
|
||||
- NVIDIA Volta generation CC 7.2 GPU
|
||||
* - TURING75
|
||||
- GPU
|
||||
- NVIDIA Turing generation CC 7.5 GPU
|
||||
* - VEGA900
|
||||
- GPU
|
||||
- AMD GPU MI25 GFX900
|
||||
* - VEGA906
|
||||
- GPU
|
||||
- AMD GPU MI50/MI60 GFX906
|
||||
|
||||
CMake build settings:
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
Basic CMake build settings:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
For multicore CPUs using OpenMP, set these 2 variables.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
-D Kokkos_ARCH_CPUARCH=yes # CPUARCH = CPU from list above
|
||||
-D Kokkos_ARCH_HOSTARCH=yes # HOSTARCH = HOST from list above
|
||||
-D Kokkos_ENABLE_OPENMP=yes
|
||||
-D BUILD_OMP=yes
|
||||
|
||||
@ -427,15 +464,19 @@ For NVIDIA GPUs using CUDA, set these variables:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
-D Kokkos_ARCH_CPUARCH=yes # CPUARCH = CPU from list above
|
||||
-D Kokkos_ARCH_HOSTARCH=yes # HOSTARCH = HOST from list above
|
||||
-D Kokkos_ARCH_GPUARCH=yes # GPUARCH = GPU from list above
|
||||
-D Kokkos_ENABLE_CUDA=yes
|
||||
-D Kokkos_ENABLE_OPENMP=yes
|
||||
-D CMAKE_CXX_COMPILER=wrapper # wrapper = full path to Cuda nvcc wrapper
|
||||
|
||||
The wrapper value is the Cuda nvcc compiler wrapper provided in the
|
||||
Kokkos library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should
|
||||
include the full path name to the wrapper, e.g.
|
||||
This will also enable executing FFTs on the GPU, either via the internal
|
||||
KISSFFT library, or - by preference - with the cuFFT library bundled
|
||||
with the CUDA toolkit, depending on whether CMake can identify its
|
||||
location. The *wrapper* value for ``CMAKE_CXX_COMPILER`` variable is
|
||||
the path to the CUDA nvcc compiler wrapper provided in the Kokkos
|
||||
library: ``lib/kokkos/bin/nvcc_wrapper``\ . The setting should include
|
||||
the full path name to the wrapper, e.g.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@ -455,8 +496,8 @@ common packages enabled, you can do the following:
|
||||
cmake -C ../cmake/presets/minimal.cmake -C ../cmake/presets/kokkos-cuda.cmake ../cmake
|
||||
cmake --build .
|
||||
|
||||
Traditional make settings:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
Basic traditional make settings:
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Choose which hardware to support in ``Makefile.machine`` via
|
||||
``KOKKOS_DEVICES`` and ``KOKKOS_ARCH`` settings. See the
|
||||
@ -467,7 +508,7 @@ For multicore CPUs using OpenMP:
|
||||
.. code-block:: make
|
||||
|
||||
KOKKOS_DEVICES = OpenMP
|
||||
KOKKOS_ARCH = CPUARCH # CPUARCH = CPU from list above
|
||||
KOKKOS_ARCH = HOSTARCH # HOSTARCH = HOST from list above
|
||||
|
||||
For Intel KNLs using OpenMP:
|
||||
|
||||
@ -481,7 +522,8 @@ For NVIDIA GPUs using CUDA:
|
||||
.. code-block:: make
|
||||
|
||||
KOKKOS_DEVICES = Cuda
|
||||
KOKKOS_ARCH = CPUARCH,GPUARCH # CPUARCH = CPU from list above that is hosting the GPU
|
||||
KOKKOS_ARCH = HOSTARCH,GPUARCH # HOSTARCH = HOST from list above that is hosting the GPU
|
||||
KOKKOS_CUDA_OPTIONS = "enable_lambda"
|
||||
# GPUARCH = GPU from list above
|
||||
FFT_INC = -DFFT_CUFFT # enable use of cuFFT (optional)
|
||||
FFT_LIB = -lcufft # link to cuFFT library
|
||||
@ -504,6 +546,44 @@ C++ compiler for non-Kokkos, non-CUDA files.
|
||||
KOKKOS_ABSOLUTE_PATH = $(shell cd $(KOKKOS_PATH); pwd)
|
||||
CC = mpicxx -cxx=$(KOKKOS_ABSOLUTE_PATH)/config/nvcc_wrapper
|
||||
|
||||
|
||||
Advanced KOKKOS compilation settings
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are other allowed options when building with the KOKKOS package
|
||||
that can improve performance or assist in debugging or profiling. Below
|
||||
are some examples that may be useful in combination with LAMMPS. For
|
||||
the full list (which keeps changing as the Kokkos package itself evolves),
|
||||
please consult the Kokkos library documentation.
|
||||
|
||||
As alternative to using multi-threading via OpenMP
|
||||
(``-DKokkos_ENABLE_OPENMP=on`` or ``KOKKOS_DEVICES=OpenMP``) it is also
|
||||
possible to use Posix threads directly (``-DKokkos_ENABLE_PTHREAD=on``
|
||||
or ``KOKKOS_DEVICES=Pthread``). While binding of threads to individual
|
||||
or groups of CPU cores is managed in OpenMP with environment variables,
|
||||
you need assistance from either the "hwloc" or "libnuma" library for the
|
||||
Pthread thread parallelization option. To enable use with CMake:
|
||||
``-DKokkos_ENABLE_HWLOC=on`` or ``-DKokkos_ENABLE_LIBNUMA=on``; and with
|
||||
conventional make: ``KOKKOS_USE_TPLS=hwloc`` or
|
||||
``KOKKOS_USE_TPLS=libnuma``.
|
||||
|
||||
The CMake option ``-DKokkos_ENABLE_LIBRT=on`` or the makefile setting
|
||||
``KOKKOS_USE_TPLS=librt`` enables the use of a more accurate timer
|
||||
mechanism on many Unix-like platforms for internal profiling.
|
||||
|
||||
The CMake option ``-DKokkos_ENABLE_DEBUG=on`` or the makefile setting
|
||||
``KOKKOS_DEBUG=yes`` enables printing of run-time
|
||||
debugging information that can be useful. It also enables runtime
|
||||
bounds checking on Kokkos data structures. As to be expected, enabling
|
||||
this option will negatively impact the performance and thus is only
|
||||
recommended when developing a Kokkos-enabled style in LAMMPS.
|
||||
|
||||
The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
|
||||
setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables the
|
||||
use of CUDA "Unified Virtual Memory" in Kokkos. Please note, that
|
||||
the LAMMPS KOKKOS package must **always** be compiled with the
|
||||
*enable_lambda* option when using GPUs.
|
||||
|
||||
----------
|
||||
|
||||
.. _latte:
|
||||
|
||||
@ -9,10 +9,7 @@ different back end languages such as CUDA, OpenMP, or Pthreads. The
|
||||
Kokkos library also provides data abstractions to adjust (at compile
|
||||
time) the memory layout of data structures like 2d and 3d arrays to
|
||||
optimize performance on different hardware. For more information on
|
||||
Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_. Kokkos is
|
||||
part of `Trilinos <https://www.trilinos.org/>`_. The Kokkos
|
||||
library was written primarily by Carter Edwards, Christian Trott, and
|
||||
Dan Sunderland (all Sandia).
|
||||
Kokkos, see `GitHub <https://github.com/kokkos/kokkos>`_.
|
||||
|
||||
The LAMMPS KOKKOS package contains versions of pair, fix, and atom
|
||||
styles that use data structures and macros provided by the Kokkos
|
||||
@ -21,7 +18,7 @@ package was developed primarily by Christian Trott (Sandia) and Stan
|
||||
Moore (Sandia) with contributions of various styles by others,
|
||||
including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
|
||||
(Sandia). For more information on developing using Kokkos abstractions
|
||||
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
|
||||
see the Kokkos `Wiki <https://github.com/kokkos/kokkos/wiki>`_.
|
||||
|
||||
Kokkos currently provides support for 3 modes of execution (per MPI
|
||||
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
|
||||
@ -31,33 +28,30 @@ compatible with specific hardware.
|
||||
|
||||
.. note::
|
||||
|
||||
Kokkos support within LAMMPS must be built with a C++11 compatible
|
||||
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
|
||||
Clang 3.5.2 or later is required.
|
||||
|
||||
.. note::
|
||||
|
||||
To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
|
||||
To build with Kokkos support for NVIDIA GPUs, the NVIDIA CUDA toolkit
|
||||
software version 9.0 or later must be installed on your system. See
|
||||
the discussion for the :doc:`GPU package <Speed_gpu>` for details of how
|
||||
to check and do this.
|
||||
the discussion for the :doc:`GPU package <Speed_gpu>` for details of
|
||||
how to check and do this.
|
||||
|
||||
.. note::
|
||||
|
||||
Kokkos with CUDA currently implicitly assumes that the MPI library
|
||||
is CUDA-aware. This is not always the case, especially when using
|
||||
pre-compiled MPI libraries provided by a Linux distribution. This is not
|
||||
a problem when using only a single GPU with a single MPI rank. When
|
||||
running with multiple MPI ranks, you may see segmentation faults without
|
||||
CUDA-aware MPI support. These can be avoided by adding the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the LAMMPS command line or by
|
||||
using the command :doc:`package kokkos cuda/aware off <package>` in the
|
||||
input file.
|
||||
Kokkos with CUDA currently implicitly assumes that the MPI library is
|
||||
CUDA-aware. This is not always the case, especially when using
|
||||
pre-compiled MPI libraries provided by a Linux distribution. This is
|
||||
not a problem when using only a single GPU with a single MPI
|
||||
rank. When running with multiple MPI ranks, you may see segmentation
|
||||
faults without CUDA-aware MPI support. These can be avoided by adding
|
||||
the flags :doc:`-pk kokkos cuda/aware off <Run_options>` to the
|
||||
LAMMPS command line or by using the command :doc:`package kokkos
|
||||
cuda/aware off <package>` in the input file.
|
||||
|
||||
**Building LAMMPS with the KOKKOS package:**
|
||||
Building LAMMPS with the KOKKOS package
|
||||
"""""""""""""""""""""""""""""""""""""""
|
||||
|
||||
See the :ref:`Build extras <kokkos>` doc page for instructions.
|
||||
|
||||
**Running LAMMPS with the KOKKOS package:**
|
||||
Running LAMMPS with the KOKKOS package
|
||||
""""""""""""""""""""""""""""""""""""""
|
||||
|
||||
All Kokkos operations occur within the context of an individual MPI
|
||||
task running on a single node of the machine. The total number of MPI
|
||||
@ -66,7 +60,8 @@ usual manner via the mpirun or mpiexec commands, and is independent of
|
||||
Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
|
||||
-npernode switches. Ditto for MPICH via -np and -ppn.
|
||||
|
||||
**Running on a multi-core CPU:**
|
||||
Running on a multi-core CPU
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here is a quick overview of how to use the KOKKOS package
|
||||
for CPU acceleration, assuming one or more 16-core nodes.
|
||||
@ -142,7 +137,8 @@ atom. When using the Kokkos Serial back end or the OpenMP back end with
|
||||
a single thread, no duplication or atomic operations are used. For CUDA
|
||||
and half neighbor lists, the KOKKOS package always uses atomic operations.
|
||||
|
||||
**Core and Thread Affinity:**
|
||||
CPU Cores, Sockets and Thread Affinity
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When using multi-threading, it is important for performance to bind
|
||||
both MPI tasks to physical cores, and threads to physical cores, so
|
||||
@ -156,15 +152,16 @@ for your MPI installation), binding can be forced with these flags:
|
||||
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
|
||||
|
||||
For binding threads with KOKKOS OpenMP, use thread affinity
|
||||
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
||||
later, intel 12 or later) setting the environment variable
|
||||
OMP_PROC_BIND=true should be sufficient. In general, for best
|
||||
performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
|
||||
OMP_PLACES=threads. For binding threads with the KOKKOS pthreads
|
||||
option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
|
||||
For binding threads with KOKKOS OpenMP, use thread affinity environment
|
||||
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
|
||||
or later) setting the environment variable ``OMP_PROC_BIND=true`` should
|
||||
be sufficient. In general, for best performance with OpenMP 4.0 or later
|
||||
set ``OMP_PROC_BIND=spread`` and ``OMP_PLACES=threads``. For binding
|
||||
threads with the KOKKOS pthreads option, compile LAMMPS with the hwloc
|
||||
or libnuma support enabled as described in the :ref:`extra build options page <kokkos>`.
|
||||
|
||||
**Running on Knight's Landing (KNL) Intel Xeon Phi:**
|
||||
Running on Knight's Landing (KNL) Intel Xeon Phi
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Here is a quick overview of how to use the KOKKOS package for the
|
||||
Intel Knight's Landing (KNL) Xeon Phi:
|
||||
@ -222,7 +219,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
||||
them in "native" mode, not "offload" mode like the USER-INTEL package
|
||||
supports.
|
||||
|
||||
**Running on GPUs:**
|
||||
Running on GPUs
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Use the "-k" :doc:`command-line switch <Run_options>` to specify the
|
||||
number of GPUs per node. Typically the -np setting of the mpirun command
|
||||
@ -257,7 +255,7 @@ one or more nodes, each with two GPUs:
|
||||
running on GPUs is to use "full" neighbor lists and set the Newton flag
|
||||
to "off" for both pairwise and bonded interactions, along with threaded
|
||||
communication. When running on Maxwell or Kepler GPUs, this will
|
||||
typically be best. For Pascal GPUs, using "half" neighbor lists and
|
||||
typically be best. For Pascal GPUs and beyond, using "half" neighbor lists and
|
||||
setting the Newton flag to "on" may be faster. For many pair styles,
|
||||
setting the neighbor binsize equal to twice the CPU default value will
|
||||
give speedup, which is the default when running on GPUs. Use the "-pk
|
||||
@ -270,13 +268,6 @@ one or more nodes, each with two GPUs:
|
||||
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
|
||||
|
||||
.. note::
|
||||
|
||||
For good performance of the KOKKOS package on GPUs, you must
|
||||
have Kepler generation GPUs (or later). The Kokkos library exploits
|
||||
texture cache options not supported by Telsa generation GPUs (or
|
||||
older).
|
||||
|
||||
.. note::
|
||||
|
||||
When using a GPU, you will achieve the best performance if your
|
||||
@ -293,7 +284,8 @@ one or more nodes, each with two GPUs:
|
||||
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
|
||||
However, this will reduce performance and is not recommended for production runs.
|
||||
|
||||
**Run with the KOKKOS package by editing an input script:**
|
||||
Run with the KOKKOS package by editing an input script
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Alternatively the effect of the "-sf" or "-pk" switches can be
|
||||
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
|
||||
@ -316,17 +308,24 @@ You only need to use the :doc:`package kokkos <package>` command if you
|
||||
wish to change any of its option defaults, as set by the "-k on"
|
||||
:doc:`command-line switch <Run_options>`.
|
||||
|
||||
**Using OpenMP threading and CUDA together (experimental):**
|
||||
**Using OpenMP threading and CUDA together:**
|
||||
|
||||
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
|
||||
used together in a few special cases. In the Makefile, the
|
||||
KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
|
||||
the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
|
||||
compiled and used together in a few special cases. In the makefile for
|
||||
the conventional build, the KOKKOS_DEVICES variable must include both,
|
||||
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
KOKKOS_DEVICES=Cuda,OpenMP
|
||||
|
||||
When building with CMake you need to enable both features as it is done
|
||||
in the ``kokkos-cuda.cmake`` CMake preset file.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes
|
||||
|
||||
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
|
||||
using the "-sf kk" in the command line gives the default CUDA version
|
||||
everywhere. However, if the "/kk/host" suffix is added to a specific
|
||||
@ -360,7 +359,8 @@ suffix for kspace and bonds, angles, etc. in the input file and the
|
||||
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
|
||||
so CPU/GPU overlap can occur.
|
||||
|
||||
**Speed-ups to expect:**
|
||||
Performance to expect
|
||||
"""""""""""""""""""""
|
||||
|
||||
The performance of KOKKOS running in different modes is a function of
|
||||
your hardware, which KOKKOS-enable styles are used, and the problem
|
||||
@ -377,52 +377,26 @@ Generally speaking, the following rules of thumb apply:
|
||||
performance of a KOKKOS style is a bit slower than the USER-OMP
|
||||
package.
|
||||
* When running large number of atoms per GPU, KOKKOS is typically faster
|
||||
than the GPU package.
|
||||
than the GPU package when compiled for double precision. The benefit
|
||||
of using single or mixed precision with the GPU package depends
|
||||
significantly on the hardware in use and the simulated system and pair
|
||||
style.
|
||||
* When running on Intel hardware, KOKKOS is not as fast as
|
||||
the USER-INTEL package, which is optimized for that hardware.
|
||||
the USER-INTEL package, which is optimized for x86 hardware (not just
|
||||
from Intel) and compilation with the Intel compilers. The USER-INTEL
|
||||
package also can increase the vector length of vector instructions
|
||||
by switching to single or mixed precision mode.
|
||||
|
||||
See the `Benchmark page <https://lammps.sandia.gov/bench.html>`_ of the
|
||||
LAMMPS web site for performance of the KOKKOS package on different
|
||||
hardware.
|
||||
|
||||
**Advanced Kokkos options:**
|
||||
Advanced Kokkos options
|
||||
"""""""""""""""""""""""
|
||||
|
||||
There are other allowed options when building with the KOKKOS package.
|
||||
As explained on the :ref:`Build extras <kokkos>` doc page,
|
||||
they can be set either as variables on the make command line or in
|
||||
Makefile.machine, or they can be specified as CMake variables. Each
|
||||
takes a value shown below. The default value is listed, which is set
|
||||
in the lib/kokkos/Makefile.kokkos file.
|
||||
|
||||
* KOKKOS_DEBUG, values = *yes*\ , *no*\ , default = *no*
|
||||
* KOKKOS_USE_TPLS, values = *hwloc*\ , *librt*\ , *experimental_memkind*, default = *none*
|
||||
* KOKKOS_CXX_STANDARD, values = *c++11*\ , *c++1z*\ , default = *c++11*
|
||||
* KOKKOS_OPTIONS, values = *aggressive_vectorization*, *disable_profiling*, default = *none*
|
||||
* KOKKOS_CUDA_OPTIONS, values = *force_uvm*, *use_ldg*, *rdc*\ , *enable_lambda*, default = *enable_lambda*
|
||||
|
||||
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
|
||||
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
|
||||
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
|
||||
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
|
||||
provides alternative methods via environment variables for binding
|
||||
threads to hardware cores. More info on binding threads to cores is
|
||||
given on the :doc:`Speed omp <Speed_omp>` doc page.
|
||||
|
||||
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
|
||||
on most Unix platforms. This library is not available on all
|
||||
platforms.
|
||||
|
||||
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
|
||||
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
|
||||
debugging information that can be useful. It also enables runtime
|
||||
bounds checking on Kokkos data structures.
|
||||
|
||||
KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when
|
||||
building LAMMPS.
|
||||
|
||||
KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
|
||||
package must be compiled with the *enable_lambda* option when using
|
||||
GPUs.
|
||||
There are other allowed options when building with the KOKKOS package
|
||||
that can improve performance or assist in debugging or profiling.
|
||||
They are explained on the :ref:`KOKKOS section of the build extras <kokkos>` doc page,
|
||||
|
||||
Restrictions
|
||||
""""""""""""
|
||||
|
||||
Reference in New Issue
Block a user