lammps/doc/src/Speed_kokkos.rst

KOKKOS package
==============

Kokkos is a templated C++ library that provides abstractions to allow
a single implementation of an application kernel (e.g. a pair style)
to run efficiently on different kinds of hardware, such as GPUs, Intel
Xeon Phis, or many-core CPUs. Kokkos maps the C++ kernel onto
different back end languages such as CUDA, OpenMP, or Pthreads.  The
Kokkos library also provides data abstractions to adjust (at compile
time) the memory layout of data structures like 2d and 3d arrays to
optimize performance on different hardware. For more information on
Kokkos, see `the Kokkos GitHub page <https://github.com/kokkos/kokkos>`_.

The LAMMPS KOKKOS package contains versions of pair, fix, and atom
styles that use data structures and macros provided by the Kokkos
library, which is included with LAMMPS in /lib/kokkos. The KOKKOS
package was developed primarily by Christian Trott (Sandia) and Stan
Moore (Sandia) with contributions of various styles by others,
including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
(Sandia). For more information on developing using Kokkos abstractions
see the `Kokkos Wiki <https://github.com/kokkos/kokkos/wiki>`_.

.. note::

   The Kokkos library is under active development and tracking the
   availability of accelerator hardware, so is the KOKKOS package in
   LAMMPS.  This means that only a certain range of versions of the
   Kokkos library are compatible with the KOKKOS package of a certain
   range of LAMMPS versions.  For that reason LAMMPS comes with a
   bundled version of the Kokkos library that has been validated on
   multiple platforms and may contain selected back-ported bug fixes
   from upstream Kokkos versions.  While it is possible to build LAMMPS
   with an external version of Kokkos, it is untested and may result in
   incorrect execution or crashes.

Kokkos currently provides full support for 4 modes of execution (per MPI
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
(threading for many-core CPUs and Intel Phi), CUDA (for NVIDIA GPUs) and
HIP (for AMD GPUs).  Additional modes (e.g. OpenMP target, Intel data
center GPUs) are under development.  You choose the mode at build time
to produce an executable compatible with a specific hardware.

The following compatibility notes have been last updated for LAMMPS
version 23 November 2023 and Kokkos version 4.2.

.. admonition:: C++17 support
   :class: note

   Kokkos requires using a compiler that supports the c++17 standard. For
   some compilers, it may be necessary to add a flag to enable c++17 support.
   For example, the GNU compiler uses the ``-std=c++17`` flag. For a list of
   compilers that have been tested with the Kokkos library, see the
   `requirements document of the Kokkos Wiki
   <https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_.

.. admonition:: NVIDIA CUDA support
   :class: note

   To build with Kokkos support for NVIDIA GPUs, the NVIDIA CUDA toolkit
   software version 11.0 or later must be installed on your system. See
   the discussion for the :doc:`GPU package <Speed_gpu>` for details of
   how to check and do this.

.. admonition:: AMD ROCm (HIP) support
   :class: note

   To build with Kokkos support for AMD GPUs, the AMD ROCm toolkit
   software version 5.2.0 or later must be installed on your system.

.. admonition:: Intel Data Center GPU support
   :class: note

   Support for Kokkos with Intel Data Center GPU accelerators (formerly
   known under the code name "Ponte Vecchio") in LAMMPS is still a work
   in progress.  Only a subset of the functionality works correctly.
   Please contact the LAMMPS developers if you run into problems.

.. admonition:: CUDA and MPI library compatibility
   :class: note

   Kokkos with CUDA currently implicitly assumes that the MPI library is
   GPU-aware.  This is not always the case, especially when using
   pre-compiled MPI libraries provided by a Linux distribution. This is
   not a problem when using only a single GPU with a single MPI
   rank.  When running with multiple MPI ranks, you may see segmentation
   faults without GPU-aware MPI support. These can be avoided by adding
   the flags :doc:`-pk kokkos gpu/aware off <Run_options>` to the
   LAMMPS command-line or by using the command :doc:`package kokkos
   gpu/aware off <package>` in the input file.

.. admonition:: Using multiple MPI ranks per GPU
   :class: note

   Unlike with the GPU package, there are limited benefits from using
   multiple MPI processes per GPU with KOKKOS.  But when doing this it
   is **required** to enable CUDA MPS (`Multi-Process Service :: GPU
   Deployment and Management Documentation
   <https://docs.nvidia.com/deploy/mps/index.html>`_ ) to get acceptable
   performance.

Building LAMMPS with the KOKKOS package
"""""""""""""""""""""""""""""""""""""""

See the :ref:`Build extras <kokkos>` page for instructions.

Running LAMMPS with the KOKKOS package
""""""""""""""""""""""""""""""""""""""

All Kokkos operations occur within the context of an individual MPI task
running on a single node of the machine. The total number of MPI tasks
used by LAMMPS (one or multiple per compute node) is set in the usual
manner via the ``mpirun`` or ``mpiexec`` commands, and is independent of
Kokkos. E.g. the mpirun command in OpenMPI does this via its ``-np`` and
``-npernode`` switches. Ditto for MPICH via ``-np`` and ``-ppn``.

Running on a multicore CPU
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here is a quick overview of how to use the KOKKOS package
for CPU acceleration, assuming one or more 16-core nodes.

.. code-block:: bash

   # 1 node, 16 MPI tasks/node, no multi-threading
   mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj

   # 2 nodes, 1 MPI task/node, 16 threads/task
   mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj

   # 1 node,  2 MPI tasks/node, 8 threads/task
   mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj

   # 8 nodes, 4 MPI tasks/node, 4 threads/task
   mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj

To run using the KOKKOS package, use the ``-k on``, ``-sf kk`` and ``-pk
kokkos`` :doc:`command-line switches <Run_options>` in your ``mpirun``
command.  You must use the ``-k on`` :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
additional arguments for hardware settings appropriate to your system.
For OpenMP use:

.. parsed-literal::

   -k on t Nt

The ``t Nt`` option specifies how many OpenMP threads per MPI task to
use with a node. The default is ``Nt`` = 1, which is MPI-only mode.  Note
that the product of MPI tasks \* OpenMP threads/task should not exceed
the physical number of cores (on a node), otherwise performance will
suffer. If Hyper-Threading (HT) is enabled, then the product of MPI
tasks \* OpenMP threads/task should not exceed the physical number of
cores \* hardware threads.  The ``-k on`` switch also issues a
``package kokkos`` command (with no additional arguments) which sets
various KOKKOS options to default values, as discussed on the
:doc:`package <package>` command doc page.

The ``-sf kk`` :doc:`command-line switch <Run_options>` will automatically
append the "/kk" suffix to styles that support it.  In this manner no
modification to the input script is needed. Alternatively, one can run
with the KOKKOS package by editing the input script as described
below.

.. note::

   When using a single OpenMP thread, the Kokkos Serial back end (i.e.
   ``Makefile.kokkos_mpi_only``) will give better performance than the OpenMP
   back end (i.e. ``Makefile.kokkos_omp``) because some of the overhead to make
   the code thread-safe is removed.

.. note::

   Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>` to
   change the default :doc:`package kokkos <package>` options. See its doc
   page for details and default settings. Experimenting with its options
   can provide a speed-up for specific calculations. For example:

.. code-block:: bash

   # Newton on, Half neighbor list, non-threaded comm
   mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk \
          -pk kokkos newton on neigh half comm no -in in.lj

If the :doc:`newton <newton>` command is used in the input
script, it can also override the Newton flag defaults.

For half neighbor lists and OpenMP, the KOKKOS package uses data
duplication (i.e. thread-private arrays) by default to avoid
thread-level write conflicts in the force arrays (and other data
structures as necessary). Data duplication is typically fastest for
small numbers of threads (i.e. 8 or less) but does increase memory
footprint and is not scalable to large numbers of threads. An
alternative to data duplication is to use thread-level atomic operations
which do not require data duplication. The use of atomic operations can
be enforced by compiling LAMMPS with the ``-DLMP_KOKKOS_USE_ATOMICS``
pre-processor flag. Most but not all Kokkos-enabled pair_styles support
data duplication. Alternatively, full neighbor lists avoid the need for
duplication or atomic operations but require more compute operations per
atom.  When using the Kokkos Serial back end or the OpenMP back end with
a single thread, no duplication or atomic operations are used. For CUDA
and half neighbor lists, the KOKKOS package always uses atomic operations.

CPU Cores, Sockets and Thread Affinity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When using multi-threading, it is important for performance to bind
both MPI tasks to physical cores, and threads to physical cores, so
they do not migrate during a simulation.

If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), binding can be forced with these flags:

.. code-block:: bash

   # OpenMPI 1.8
   mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...

   # Mvapich2 2.0
   mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...

For binding threads with KOKKOS OpenMP, use thread affinity environment
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
or later) setting the environment variable ``OMP_PROC_BIND=true`` should
be sufficient. In general, for best performance with OpenMP 4.0 or later
set ``OMP_PROC_BIND=spread`` and ``OMP_PLACES=threads``.  For binding
threads with the KOKKOS pthreads option, compile LAMMPS with the hwloc
or libnuma support enabled as described in the :ref:`extra build options page <kokkos>`.

Running on Knight's Landing (KNL) Intel Xeon Phi
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Here is a quick overview of how to use the KOKKOS package for the
Intel Knight's Landing (KNL) Xeon Phi:

KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores are
reserved for the OS, and only 64 or 66 cores are used. Each core has 4
Hyper-Threads,so there are effectively N = 256 (4\*64) or N = 264 (4\*66)
cores to run on. The product of MPI tasks \* OpenMP threads/task should
not exceed this limit, otherwise performance will suffer. Note that
with the KOKKOS package you do not need to specify how many KNLs there
are per node; each KNL is simply treated as running some number of MPI
tasks.

Examples of mpirun commands that follow these rules are shown below.

.. code-block:: bash

   # Running on an Intel KNL node with 68 cores
   # (272 threads/node via 4x hardware threading):

   # 1 node, 64 MPI tasks/node, 4 threads/task
   mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj

   # 1 node, 66 MPI tasks/node, 4 threads/task
   mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj

   # 1 node, 32 MPI tasks/node, 8 threads/task
   mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj

   # 8 nodes, 64 MPI tasks/node, 4 threads/task
   mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj

The ``-np`` setting of the mpirun command sets the number of MPI
tasks/node. The ``-k on t Nt`` command-line switch sets the number of
threads/task as ``Nt``. The product of these two values should be N, i.e.
256 or 264.

.. note::

   The default for the :doc:`package kokkos <package>` command when
   running on KNL is to use "half" neighbor lists and set the Newton
   flag to "on" for both pairwise and bonded interactions. This will
   typically be best for many-body potentials. For simpler pairwise
   potentials, it may be faster to use a "full" neighbor list with
   Newton flag to "off".  Use the ``-pk kokkos`` :doc:`command-line switch
   <Run_options>` to change the default :doc:`package kokkos <package>`
   options. See its documentation page for details and default
   settings. Experimenting with its options can provide a speed-up for
   specific calculations. For example:

.. code-block:: bash

   #  Newton on, half neighbor list, threaded comm
   mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax

   # Newton off, full neighbor list, non-threaded comm
   mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk \
          -pk kokkos newton off neigh full comm no -in in.lj

.. note::

   MPI tasks and threads should be bound to cores as described
   above for CPUs.

.. note::

   To build with Kokkos support for Intel Xeon Phi co-processors
   such as Knight's Corner (KNC), your system must be configured to use
   them in "native" mode, not "offload" mode like the INTEL package
   supports.

Running on GPUs
^^^^^^^^^^^^^^^

Use the ``-k`` :doc:`command-line switch <Run_options>` to specify the
number of GPUs per node. Typically the ``-np`` setting of the ``mpirun`` command
should set the number of MPI tasks/node to be equal to the number of
physical GPUs on the node. You can assign multiple MPI tasks to the same
GPU with the KOKKOS package, but this is usually only faster if some
portions of the input script have not been ported to use Kokkos. In this
case, also packing/unpacking communication buffers on the host may give
speedup (see the KOKKOS :doc:`package <package>` command). Using CUDA MPS
is recommended in this scenario.

Using a GPU-aware MPI library is highly recommended. GPU-aware MPI use can be
avoided by using :doc:`-pk kokkos gpu/aware off <package>`. As above for
multicore CPUs (and no GPU), if N is the number of physical cores/node,
then the number of MPI tasks/node should not exceed N.

.. parsed-literal::

   -k on g Ng

Here are examples of how to use the KOKKOS package for GPUs, assuming
one or more nodes, each with two GPUs:

.. code-block:: bash

   # 1 node,   2 MPI tasks/node, 2 GPUs/node
   mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj

   # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
   mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj

.. note::

   The default for the :doc:`package kokkos <package>` command when
   running on GPUs is to use "full" neighbor lists and set the Newton
   flag to "off" for both pairwise and bonded interactions, along with
   threaded communication. When running on Maxwell or Kepler GPUs, this
   will typically be best. For Pascal GPUs and beyond, using "half"
   neighbor lists and setting the Newton flag to "on" may be faster. For
   many pair styles, setting the neighbor binsize equal to twice the CPU
   default value will give speedup, which is the default when running on
   GPUs. Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>`
   to change the default :doc:`package kokkos <package>` options. See
   its documentation page for details and default
   settings. Experimenting with its options can provide a speed-up for
   specific calculations. For example:

.. code-block:: bash

   # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
   mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk \
          -pk kokkos newton on neigh half binsize 2.8 -in in.lj

.. note::

   The default binsize for :doc:`atom sorting <atom_modify>` on GPUs
   is equal to the default CPU neighbor binsize (i.e. 2x smaller than the
   default GPU neighbor binsize). When running simple pair-wise
   potentials like Lennard Jones on GPUs, using a 2x larger binsize for
   atom sorting (equal to the default GPU neighbor binsize) and a more
   frequent sorting than default (e.g. sorting every 100 time steps
   instead of 1000) may improve performance.

.. note::

   When running on GPUs with many MPI ranks (tens of thousands and
   more), the creation of the atom map (required for molecular systems)
   on the GPU can slow down significantly or run out of GPU memory and
   thus slow down the whole calculation or cause a crash.  You can use
   the ``-pk kokkos atom/map no`` :doc:`command-line switch <Run_options>`
   of the :doc:`package kokkos atom/map no <package>` command to create
   the atom map on the CPU instead.

.. note::

   When using a GPU, you will achieve the best performance if your input
   script does not use fix or compute styles which are not yet
   Kokkos-enabled. This allows data to stay on the GPU for multiple
   timesteps, without being copied back to the host CPU. Invoking a
   non-Kokkos fix or compute, or performing I/O for :doc:`thermo
   <thermo_style>` or :doc:`dump <dump>` output will cause data to be
   copied back to the CPU incurring a performance penalty.

.. note::

   To get an accurate timing breakdown between time spend in pair,
   kspace, etc., you must set the environment variable ``CUDA_LAUNCH_BLOCKING=1``.
   However, this will reduce performance and is not recommended for production runs.

Troubleshooting segmentation faults on GPUs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As noted above, KOKKOS by default assumes that the MPI library is
GPU-aware.  This is not always the case and can lead to segmentation
faults when using more than one MPI process.  Normally, LAMMPS will
print a warning like "*Turning off GPU-aware MPI since it is not
detected*", or an error message like "*Kokkos with GPU-enabled backend
assumes GPU-aware MPI is available*", OR a **segmentation fault**.  To
confirm that a segmentation fault is caused by this, you can turn off
the GPU-aware assumption via the :doc:`package kokkos command <package>`
or the corresponding command-line flag.

If you still get a segmentation fault, despite running with only one MPI
process or using the command-line flag to turn off expecting a GPU-aware
MPI library, then using the CMake compile setting
``-DKokkos_ENABLE_DEBUG=on`` or adding ``KOKKOS_DEBUG=yes`` to your
machine makefile for building with traditional make will generate useful
output that can be passed to the LAMMPS developers for further
debugging.

Troubleshooting memory allocation on GPUs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

`Kokkos Tools <https://github.com/kokkos/kokkos-tools/>`_ provides a set
of lightweight profiling and debugging utilities, which interface with
instrumentation hooks (eg. `space-time-stack
<https://github.com/kokkos/kokkos-tools/tree/develop/profiling/space-time-stack>`_)
built directly into the Kokkos runtime.  After compiling a dynamic LAMMPS
library, you then have to set the environment variable ``KOKKOS_TOOLS_LIBS``
before executing your LAMMPS Kokkos run. Example:

.. code-block:: bash

    export KOKKOS_TOOLS_LIBS=${HOME}/kokkos-tools/src/tools/memory-events/kp_memory_event.so
    mpirun -np 4 lmp_kokkos_cuda_openmpi -in in.lj -k on g 4 -sf kk

Starting with the NVIDIA Pascal GPU architecture, CUDA supports
`"Unified Virtual Memory" (UVM)
<https://developer.nvidia.com/blog/unified-memory-cuda-beginners/>`_
which enables allocating more memory than a GPU possesses by also using
memory on the host CPU and then CUDA will transparently move data
between CPU and GPU as needed.  The resulting LAMMPS performance depends
on `memory access pattern, data residency, and GPU memory
oversubscription
<https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/>`_
. The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables using
:ref:`UVM with Kokkos <kokkos>` when compiling LAMMPS.

Run with the KOKKOS package by editing an input script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Alternatively the effect of the ``-sf`` or ``-pk`` switches can be
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.

The discussion above for building LAMMPS with the KOKKOS package, the
``mpirun`` or ``mpiexec`` command, and setting appropriate thread
properties are the same.

You must still use the ``-k on`` :doc:`command-line switch <Run_options>`
to enable the KOKKOS package, and specify its additional arguments for
hardware options appropriate to your system, as documented above.

You can use the :doc:`suffix kk <suffix>` command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.

.. code-block:: LAMMPS

   pair_style lj/cut/kk 2.5

You only need to use the :doc:`package kokkos <package>` command if you
wish to change any of its option defaults, as set by the "-k on"
:doc:`command-line switch <Run_options>`.

**Using OpenMP threading and CUDA together:**

With the KOKKOS package, both OpenMP multi-threading and GPUs can be
compiled and used together in a few special cases. In the makefile for
the conventional build, the ``KOKKOS_DEVICES`` variable must include both,
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.

.. code-block:: bash

   KOKKOS_DEVICES=Cuda,OpenMP

When building with CMake you need to enable both features as it is done
in the ``kokkos-cuda.cmake`` CMake preset file.

.. code-block:: bash

   cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes ../cmake

The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the ``-sf kk`` in the command-line gives the default CUDA version
everywhere.  However, if the "/kk/host" suffix is added to a specific
style in the input script, the Kokkos OpenMP (CPU) version of that
specific style will be used instead.  Set the number of OpenMP threads
as ``t Nt`` and the number of GPUs as ``g Ng``

.. parsed-literal::

   -k on t Nt g Ng

For example, the command to run with 1 GPU and 8 OpenMP threads is then:

.. code-block:: bash

   mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk

Conversely, if the ``-sf kk/host`` is used in the command-line and then
the "/kk" or "/kk/device" suffix is added to a specific style in your
input script, then only that specific style will run on the GPU while
everything else will run on the CPU in OpenMP mode. Note that the
execution of the CPU and GPU styles will NOT overlap, except for a
special case:

A kspace style and/or molecular topology (bonds, angles, etc.) running
on the host CPU can overlap with a pair style running on the
GPU. First compile with ``--default-stream per-thread`` added to ``CCFLAGS``
in the Kokkos CUDA Makefile.  Then explicitly use the "/kk/host"
suffix for kspace and bonds, angles, etc.  in the input file and the
"kk" suffix (equal to "kk/device") on the command-line.  Also make
sure the environment variable ``CUDA_LAUNCH_BLOCKING`` is not set to "1"
so CPU/GPU overlap can occur.

Performance to expect
"""""""""""""""""""""

The performance of KOKKOS running in different modes is a function of
your hardware, which KOKKOS-enable styles are used, and the problem
size.

Generally speaking, the following rules of thumb apply:

* When running on CPUs only, with a single thread per MPI task,
  performance of a KOKKOS style is somewhere between the standard
  (un-accelerated) styles (MPI-only mode), and those provided by the
  OPENMP package. However the difference between all 3 is small (less
  than 20%).
* When running on CPUs only, with multiple threads per MPI task,
  performance of a KOKKOS style is a bit slower than the OPENMP
  package.
* When running large number of atoms per GPU, KOKKOS is typically faster
  than the GPU package when compiled for double precision.  The benefit
  of using single or mixed precision with the GPU package depends
  significantly on the hardware in use and the simulated system and pair
  style.
* When running on Intel Phi hardware, KOKKOS is not as fast as
  the INTEL package, which is optimized for x86 hardware (not just
  from Intel) and compilation with the Intel compilers.  The INTEL
  package also can increase the vector length of vector instructions
  by switching to single or mixed precision mode.
* The KOKKOS package by default assumes that you are using exactly one
  MPI rank per GPU. When trying to use multiple MPI ranks per GPU it is
  mandatory to enable `CUDA Multi-Process Service (MPS)
  <https://docs.nvidia.com/deploy/mps/index.html>`_ to get good
  performance.  In this case it is better to not use all available
  MPI ranks in order to avoid competing with the MPS daemon for
  CPU resources.

See the `Benchmark page <https://www.lammps.org/bench.html>`_ of the
LAMMPS website for performance of the KOKKOS package on different
hardware.

Advanced Kokkos options
"""""""""""""""""""""""

There are other allowed options when building with the KOKKOS package
that can improve performance or assist in debugging or profiling.
They are explained on the :ref:`KOKKOS section of the build extras <kokkos>` doc page,

Restrictions
""""""""""""

Currently, there are no precision options with the KOKKOS package. All
compilation and computation is performed in double precision.