restore files
This commit is contained in:
2000
doc/src/Packages_details.txt
Normal file
2000
doc/src/Packages_details.txt
Normal file
File diff suppressed because it is too large
Load Diff
381
doc/src/Speed_kokkos.txt
Normal file
381
doc/src/Speed_kokkos.txt
Normal file
@ -0,0 +1,381 @@
|
||||
"Higher level section"_Speed.html - "LAMMPS WWW Site"_lws - "LAMMPS
|
||||
Documentation"_ld - "LAMMPS Commands"_lc :c
|
||||
|
||||
:link(lws,http://lammps.sandia.gov)
|
||||
:link(ld,Manual.html)
|
||||
:link(lc,Commands_all.html)
|
||||
|
||||
:line
|
||||
|
||||
KOKKOS package :h3
|
||||
|
||||
Kokkos is a templated C++ library that provides abstractions to allow
|
||||
a single implementation of an application kernel (e.g. a pair style)
|
||||
to run efficiently on different kinds of hardware, such as GPUs, Intel
|
||||
Xeon Phis, or many-core CPUs. Kokkos maps the C++ kernel onto
|
||||
different backend languages such as CUDA, OpenMP, or Pthreads. The
|
||||
Kokkos library also provides data abstractions to adjust (at compile
|
||||
time) the memory layout of data structures like 2d and 3d arrays to
|
||||
optimize performance on different hardware. For more information on
|
||||
Kokkos, see "Github"_https://github.com/kokkos/kokkos. Kokkos is part
|
||||
of "Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos
|
||||
library was written primarily by Carter Edwards, Christian Trott, and
|
||||
Dan Sunderland (all Sandia).
|
||||
|
||||
The LAMMPS KOKKOS package contains versions of pair, fix, and atom
|
||||
styles that use data structures and macros provided by the Kokkos
|
||||
library, which is included with LAMMPS in /lib/kokkos. The KOKKOS
|
||||
package was developed primarily by Christian Trott (Sandia) and Stan
|
||||
Moore (Sandia) with contributions of various styles by others,
|
||||
including Sikandar Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez
|
||||
(Sandia). For more information on developing using Kokkos abstractions
|
||||
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
|
||||
|
||||
Kokkos currently provides support for 3 modes of execution (per MPI
|
||||
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP
|
||||
(threading for many-core CPUs and Intel Phi), and CUDA (for NVIDIA
|
||||
GPUs). You choose the mode at build time to produce an executable
|
||||
compatible with specific hardware.
|
||||
|
||||
NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
|
||||
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
|
||||
Clang 3.5.2 or later is required.
|
||||
|
||||
NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA
|
||||
software version 7.5 or later must be installed on your system. See
|
||||
the discussion for the "GPU package"_Speed_gpu.html for details of how
|
||||
to check and do this.
|
||||
|
||||
NOTE: Kokkos with CUDA currently implicitly assumes, that the MPI
|
||||
library is CUDA-aware and has support for GPU-direct. This is not
|
||||
always the case, especially when using pre-compiled MPI libraries
|
||||
provided by a Linux distribution. This is not a problem when using
|
||||
only a single GPU and a single MPI rank on a desktop. When running
|
||||
with multiple MPI ranks, you may see segmentation faults without
|
||||
GPU-direct support. These can be avoided by adding the flags "-pk
|
||||
kokkos gpu/direct off"_Run_options.html to the LAMMPS command line or
|
||||
by using the command "package kokkos gpu/direct off"_package.html in
|
||||
the input file.
|
||||
|
||||
[Building LAMMPS with the KOKKOS package:]
|
||||
|
||||
See the "Build extras"_Build_extras.html#kokkos doc page for instructions.
|
||||
|
||||
[Running LAMMPS with the KOKKOS package:]
|
||||
|
||||
All Kokkos operations occur within the context of an individual MPI
|
||||
task running on a single node of the machine. The total number of MPI
|
||||
tasks used by LAMMPS (one or multiple per compute node) is set in the
|
||||
usual manner via the mpirun or mpiexec commands, and is independent of
|
||||
Kokkos. E.g. the mpirun command in OpenMPI does this via its -np and
|
||||
-npernode switches. Ditto for MPICH via -np and -ppn.
|
||||
|
||||
[Running on a multi-core CPU:]
|
||||
|
||||
Here is a quick overview of how to use the KOKKOS package
|
||||
for CPU acceleration, assuming one or more 16-core nodes.
|
||||
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
|
||||
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
|
||||
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
|
||||
|
||||
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk
|
||||
kokkos" "command-line switches"_Run_options.html in your mpirun
|
||||
command. You must use the "-k on" "command-line
|
||||
switch"_Run_options.html to enable the KOKKOS package. It takes
|
||||
additional arguments for hardware settings appropriate to your system.
|
||||
For OpenMP use:
|
||||
|
||||
-k on t Nt :pre
|
||||
|
||||
The "t Nt" option specifies how many OpenMP threads per MPI task to
|
||||
use with a node. The default is Nt = 1, which is MPI-only mode. Note
|
||||
that the product of MPI tasks * OpenMP threads/task should not exceed
|
||||
the physical number of cores (on a node), otherwise performance will
|
||||
suffer. If hyperthreading is enabled, then the product of MPI tasks *
|
||||
OpenMP threads/task should not exceed the physical number of cores *
|
||||
hardware threads. The "-k on" switch also issues a "package kokkos"
|
||||
command (with no additional arguments) which sets various KOKKOS
|
||||
options to default values, as discussed on the "package"_package.html
|
||||
command doc page.
|
||||
|
||||
The "-sf kk" "command-line switch"_Run_options.html will automatically
|
||||
append the "/kk" suffix to styles that support it. In this manner no
|
||||
modification to the input script is needed. Alternatively, one can run
|
||||
with the KOKKOS package by editing the input script as described
|
||||
below.
|
||||
|
||||
NOTE: The default for the "package kokkos"_package.html command is to
|
||||
use "full" neighbor lists and set the Newton flag to "off" for both
|
||||
pairwise and bonded interactions. However, when running on CPUs, it
|
||||
will typically be faster to use "half" neighbor lists and set the
|
||||
Newton flag to "on", just as is the case for non-accelerated pair
|
||||
styles. It can also be faster to use non-threaded communication. Use
|
||||
the "-pk kokkos" "command-line switch"_Run_options.html to change the
|
||||
default "package kokkos"_package.html options. See its doc page for
|
||||
details and default settings. Experimenting with its options can
|
||||
provide a speed-up for specific calculations. For example:
|
||||
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre
|
||||
|
||||
If the "newton"_newton.html command is used in the input
|
||||
script, it can also override the Newton flag defaults.
|
||||
|
||||
[Core and Thread Affinity:]
|
||||
|
||||
When using multi-threading, it is important for performance to bind
|
||||
both MPI tasks to physical cores, and threads to physical cores, so
|
||||
they do not migrate during a simulation.
|
||||
|
||||
If you are not certain MPI tasks are being bound (check the defaults
|
||||
for your MPI installation), binding can be forced with these flags:
|
||||
|
||||
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre
|
||||
|
||||
For binding threads with KOKKOS OpenMP, use thread affinity
|
||||
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
|
||||
later, intel 12 or later) setting the environment variable
|
||||
OMP_PROC_BIND=true should be sufficient. In general, for best
|
||||
performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and
|
||||
OMP_PLACES=threads. For binding threads with the KOKKOS pthreads
|
||||
option, compile LAMMPS the KOKKOS HWLOC=yes option as described below.
|
||||
|
||||
[Running on Knight's Landing (KNL) Intel Xeon Phi:]
|
||||
|
||||
Here is a quick overview of how to use the KOKKOS package for the
|
||||
Intel Knight's Landing (KNL) Xeon Phi:
|
||||
|
||||
KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores are
|
||||
reserved for the OS, and only 64 or 66 cores are used. Each core has 4
|
||||
hyperthreads,so there are effectively N = 256 (4*64) or N = 264 (4*66)
|
||||
cores to run on. The product of MPI tasks * OpenMP threads/task should
|
||||
not exceed this limit, otherwise performance will suffer. Note that
|
||||
with the KOKKOS package you do not need to specify how many KNLs there
|
||||
are per node; each KNL is simply treated as running some number of MPI
|
||||
tasks.
|
||||
|
||||
Examples of mpirun commands that follow these rules are shown below.
|
||||
|
||||
Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
|
||||
|
||||
The -np setting of the mpirun command sets the number of MPI
|
||||
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
||||
threads/task as Nt. The product of these two values should be N, i.e.
|
||||
256 or 264.
|
||||
|
||||
NOTE: The default for the "package kokkos"_package.html command is to
|
||||
use "full" neighbor lists and set the Newton flag to "off" for both
|
||||
pairwise and bonded interactions. When running on KNL, this will
|
||||
typically be best for pair-wise potentials. For manybody potentials,
|
||||
using "half" neighbor lists and setting the Newton flag to "on" may be
|
||||
faster. It can also be faster to use non-threaded communication. Use
|
||||
the "-pk kokkos" "command-line switch"_Run_options.html to change the
|
||||
default "package kokkos"_package.html options. See its doc page for
|
||||
details and default settings. Experimenting with its options can
|
||||
provide a speed-up for specific calculations. For example:
|
||||
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre
|
||||
|
||||
NOTE: MPI tasks and threads should be bound to cores as described
|
||||
above for CPUs.
|
||||
|
||||
NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors
|
||||
such as Knight's Corner (KNC), your system must be configured to use
|
||||
them in "native" mode, not "offload" mode like the USER-INTEL package
|
||||
supports.
|
||||
|
||||
[Running on GPUs:]
|
||||
|
||||
Use the "-k" "command-line switch"_Run_options.html to
|
||||
specify the number of GPUs per node. Typically the -np setting of the
|
||||
mpirun command should set the number of MPI tasks/node to be equal to
|
||||
the number of physical GPUs on the node. You can assign multiple MPI
|
||||
tasks to the same GPU with the KOKKOS package, but this is usually
|
||||
only faster if significant portions of the input script have not
|
||||
been ported to use Kokkos. Using CUDA MPS is recommended in this
|
||||
scenario. Using a CUDA-aware MPI library with support for GPU-direct
|
||||
is highly recommended. GPU-direct use can be avoided by using
|
||||
"-pk kokkos gpu/direct no"_package.html.
|
||||
As above for multi-core CPUs (and no GPU), if N is the number of
|
||||
physical cores/node, then the number of MPI tasks/node should not
|
||||
exceed N.
|
||||
|
||||
-k on g Ng :pre
|
||||
|
||||
Here are examples of how to use the KOKKOS package for GPUs, assuming
|
||||
one or more nodes, each with two GPUs:
|
||||
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
|
||||
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
|
||||
|
||||
NOTE: The default for the "package kokkos"_package.html command is to
|
||||
use "full" neighbor lists and set the Newton flag to "off" for both
|
||||
pairwise and bonded interactions, along with threaded communication.
|
||||
When running on Maxwell or Kepler GPUs, this will typically be
|
||||
best. For Pascal GPUs, using "half" neighbor lists and setting the
|
||||
Newton flag to "on" may be faster. For many pair styles, setting the
|
||||
neighbor binsize equal to the ghost atom cutoff will give speedup.
|
||||
Use the "-pk kokkos" "command-line switch"_Run_options.html to change
|
||||
the default "package kokkos"_package.html options. See its doc page
|
||||
for details and default settings. Experimenting with its options can
|
||||
provide a speed-up for specific calculations. For example:
|
||||
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj # Set binsize = neighbor ghost cutoff
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
|
||||
|
||||
NOTE: For good performance of the KOKKOS package on GPUs, you must
|
||||
have Kepler generation GPUs (or later). The Kokkos library exploits
|
||||
texture cache options not supported by Telsa generation GPUs (or
|
||||
older).
|
||||
|
||||
NOTE: When using a GPU, you will achieve the best performance if your
|
||||
input script does not use fix or compute styles which are not yet
|
||||
Kokkos-enabled. This allows data to stay on the GPU for multiple
|
||||
timesteps, without being copied back to the host CPU. Invoking a
|
||||
non-Kokkos fix or compute, or performing I/O for
|
||||
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
|
||||
to be copied back to the CPU incurring a performance penalty.
|
||||
|
||||
NOTE: To get an accurate timing breakdown between time spend in pair,
|
||||
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
|
||||
However, this will reduce performance and is not recommended for production runs.
|
||||
|
||||
[Run with the KOKKOS package by editing an input script:]
|
||||
|
||||
Alternatively the effect of the "-sf" or "-pk" switches can be
|
||||
duplicated by adding the "package kokkos"_package.html or "suffix
|
||||
kk"_suffix.html commands to your input script.
|
||||
|
||||
The discussion above for building LAMMPS with the KOKKOS package, the
|
||||
mpirun/mpiexec command, and setting appropriate thread are the same.
|
||||
|
||||
You must still use the "-k on" "command-line switch"_Run_options.html
|
||||
to enable the KOKKOS package, and specify its additional arguments for
|
||||
hardware options appropriate to your system, as documented above.
|
||||
|
||||
You can use the "suffix kk"_suffix.html command, or you can explicitly add a
|
||||
"kk" suffix to individual styles in your input script, e.g.
|
||||
|
||||
pair_style lj/cut/kk 2.5 :pre
|
||||
|
||||
You only need to use the "package kokkos"_package.html command if you
|
||||
wish to change any of its option defaults, as set by the "-k on"
|
||||
"command-line switch"_Run_options.html.
|
||||
|
||||
[Using OpenMP threading and CUDA together (experimental):]
|
||||
|
||||
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
|
||||
used together in a few special cases. In the Makefile, the
|
||||
KOKKOS_DEVICES variable must include both "Cuda" and "OpenMP", as is
|
||||
the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
|
||||
|
||||
KOKKOS_DEVICES=Cuda,OpenMP :pre
|
||||
|
||||
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
|
||||
using the "-sf kk" in the command line gives the default CUDA version
|
||||
everywhere. However, if the "/kk/host" suffix is added to a specific
|
||||
style in the input script, the Kokkos OpenMP (CPU) version of that
|
||||
specific style will be used instead. Set the number of OpenMP threads
|
||||
as "t Nt" and the number of GPUs as "g Ng"
|
||||
|
||||
-k on t Nt g Ng :pre
|
||||
|
||||
For example, the command to run with 1 GPU and 8 OpenMP threads is then:
|
||||
|
||||
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
|
||||
|
||||
Conversely, if the "-sf kk/host" is used in the command line and then
|
||||
the "/kk" or "/kk/device" suffix is added to a specific style in your
|
||||
input script, then only that specific style will run on the GPU while
|
||||
everything else will run on the CPU in OpenMP mode. Note that the
|
||||
execution of the CPU and GPU styles will NOT overlap, except for a
|
||||
special case:
|
||||
|
||||
A kspace style and/or molecular topology (bonds, angles, etc.) running
|
||||
on the host CPU can overlap with a pair style running on the
|
||||
GPU. First compile with "--default-stream per-thread" added to CCFLAGS
|
||||
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
|
||||
suffix for kspace and bonds, angles, etc. in the input file and the
|
||||
"kk" suffix (equal to "kk/device") on the command line. Also make
|
||||
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
|
||||
so CPU/GPU overlap can occur.
|
||||
|
||||
[Speed-ups to expect:]
|
||||
|
||||
The performance of KOKKOS running in different modes is a function of
|
||||
your hardware, which KOKKOS-enable styles are used, and the problem
|
||||
size.
|
||||
|
||||
Generally speaking, the following rules of thumb apply:
|
||||
|
||||
When running on CPUs only, with a single thread per MPI task,
|
||||
performance of a KOKKOS style is somewhere between the standard
|
||||
(un-accelerated) styles (MPI-only mode), and those provided by the
|
||||
USER-OMP package. However the difference between all 3 is small (less
|
||||
than 20%). :ulb,l
|
||||
|
||||
When running on CPUs only, with multiple threads per MPI task,
|
||||
performance of a KOKKOS style is a bit slower than the USER-OMP
|
||||
package. :l
|
||||
|
||||
When running large number of atoms per GPU, KOKKOS is typically faster
|
||||
than the GPU package. :l
|
||||
|
||||
When running on Intel hardware, KOKKOS is not as fast as
|
||||
the USER-INTEL package, which is optimized for that hardware. :l
|
||||
:ule
|
||||
|
||||
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
||||
LAMMPS web site for performance of the KOKKOS package on different
|
||||
hardware.
|
||||
|
||||
[Advanced Kokkos options:]
|
||||
|
||||
There are other allowed options when building with the KOKKOS package.
|
||||
As explained on the "Build extras"_Build_extras.html#kokkos doc page,
|
||||
they can be set either as variables on the make command line or in
|
||||
Makefile.machine, or they can be specified as CMake variables. Each
|
||||
takes a value shown below. The default value is listed, which is set
|
||||
in the lib/kokkos/Makefile.kokkos file.
|
||||
|
||||
KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
|
||||
KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
|
||||
KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
|
||||
KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
|
||||
KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul
|
||||
|
||||
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
|
||||
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
|
||||
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
|
||||
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
|
||||
provides alternative methods via environment variables for binding
|
||||
threads to hardware cores. More info on binding threads to cores is
|
||||
given on the "Speed omp"_Speed_omp.html doc page.
|
||||
|
||||
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
|
||||
on most Unix platforms. This library is not available on all
|
||||
platforms.
|
||||
|
||||
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
|
||||
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
|
||||
debugging information that can be useful. It also enables runtime
|
||||
bounds checking on Kokkos data structures.
|
||||
|
||||
KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when
|
||||
building LAMMPS.
|
||||
|
||||
KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS
|
||||
package must be compiled with the {enable_lambda} option when using
|
||||
GPUs.
|
||||
|
||||
[Restrictions:]
|
||||
|
||||
Currently, there are no precision options with the KOKKOS package. All
|
||||
compilation and computation is performed in double precision.
|
||||
Reference in New Issue
Block a user