docs: update speed section

This commit is contained in:
Richard Berger
2024-08-16 14:09:32 -06:00
parent 1fca4d94d0
commit 9c2c8045cb
6 changed files with 215 additions and 161 deletions

View File

@ -15,7 +15,7 @@ The 5 standard problems are as follow:
#. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 #. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55
neighbors per atom), NVE integration neighbors per atom), NVE integration
#. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ #. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ
pairwise interactions with a 2\^(1/6) sigma cutoff (5 neighbors per pairwise interactions with a :math:`2^{\frac{1}{6}}` sigma cutoff (5 neighbors per
atom), NVE integration atom), NVE integration
#. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 #. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45
neighbors per atom), NVE integration neighbors per atom), NVE integration
@ -29,19 +29,19 @@ The 5 standard problems are as follow:
Input files for these 5 problems are provided in the bench directory Input files for these 5 problems are provided in the bench directory
of the LAMMPS distribution. Each has 32,000 atoms and runs for 100 of the LAMMPS distribution. Each has 32,000 atoms and runs for 100
timesteps. The size of the problem (number of atoms) can be varied timesteps. The size of the problem (number of atoms) can be varied
using command-line switches as described in the bench/README file. using command-line switches as described in the ``bench/README`` file.
This is an easy way to test performance and either strong or weak This is an easy way to test performance and either strong or weak
scalability on your machine. scalability on your machine.
The bench directory includes a few log.\* files that show performance The bench directory includes a few ``log.*`` files that show performance
of these 5 problems on 1 or 4 cores of Linux desktop. The bench/FERMI of these 5 problems on 1 or 4 cores of Linux desktop. The ``bench/FERMI``
and bench/KEPLER directories have input files and scripts and instructions and ``bench/KEPLER`` directories have input files and scripts and instructions
for running the same (or similar) problems using OpenMP or GPU or Xeon for running the same (or similar) problems using OpenMP or GPU or Xeon
Phi acceleration options. See the README files in those directories and the Phi acceleration options. See the ``README`` files in those directories and the
:doc:`Accelerator packages <Speed_packages>` pages for instructions on how :doc:`Accelerator packages <Speed_packages>` pages for instructions on how
to build LAMMPS and run on that kind of hardware. to build LAMMPS and run on that kind of hardware.
The bench/POTENTIALS directory has input files which correspond to the The ``bench/POTENTIALS`` directory has input files which correspond to the
table of results on the table of results on the
`Potentials <https://www.lammps.org/bench.html#potentials>`_ section of `Potentials <https://www.lammps.org/bench.html#potentials>`_ section of
the Benchmarks web page. So you can also run those test problems on the Benchmarks web page. So you can also run those test problems on
@ -50,7 +50,7 @@ your machine.
The `billion-atom <https://www.lammps.org/bench.html#billion>`_ section The `billion-atom <https://www.lammps.org/bench.html#billion>`_ section
of the Benchmarks web page has performance data for very large of the Benchmarks web page has performance data for very large
benchmark runs of simple Lennard-Jones (LJ) models, which use the benchmark runs of simple Lennard-Jones (LJ) models, which use the
bench/in.lj input script. ``bench/in.lj`` input script.
---------- ----------

View File

@ -38,10 +38,10 @@ to have an NVIDIA GPU and install the corresponding NVIDIA CUDA
toolkit software on your system (this is only tested on Linux toolkit software on your system (this is only tested on Linux
and unsupported on Windows): and unsupported on Windows):
* Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/\*/information * Check if you have an NVIDIA GPU: ``cat /proc/driver/nvidia/gpus/\*/information``
* Go to https://developer.nvidia.com/cuda-downloads * Go to https://developer.nvidia.com/cuda-downloads
* Install a driver and toolkit appropriate for your system (SDK is not necessary) * Install a driver and toolkit appropriate for your system (SDK is not necessary)
* Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to * Run ``lammps/lib/gpu/nvc_get_devices`` (after building the GPU library, see below) to
list supported devices and properties list supported devices and properties
To compile and use this package in OpenCL mode, you currently need To compile and use this package in OpenCL mode, you currently need
@ -51,7 +51,7 @@ installed. There can be multiple of them for the same or different hardware
(GPUs, CPUs, Accelerators) installed at the same time. OpenCL refers to those (GPUs, CPUs, Accelerators) installed at the same time. OpenCL refers to those
as 'platforms'. The GPU library will try to auto-select the best suitable platform, as 'platforms'. The GPU library will try to auto-select the best suitable platform,
but this can be overridden using the platform option of the :doc:`package <package>` but this can be overridden using the platform option of the :doc:`package <package>`
command. run lammps/lib/gpu/ocl_get_devices to get a list of available command. run ``lammps/lib/gpu/ocl_get_devices`` to get a list of available
platforms and devices with a suitable ICD available. platforms and devices with a suitable ICD available.
To compile and use this package for Intel GPUs, OpenCL or the Intel oneAPI To compile and use this package for Intel GPUs, OpenCL or the Intel oneAPI
@ -63,7 +63,7 @@ provides optimized C++, MPI, and many other libraries and tools. See:
If you do not have a discrete GPU card installed, this package can still provide If you do not have a discrete GPU card installed, this package can still provide
significant speedups on some CPUs that include integrated GPUs. Additionally, for significant speedups on some CPUs that include integrated GPUs. Additionally, for
many macs, OpenCL is already included with the OS and Makefiles are available many macs, OpenCL is already included with the OS and Makefiles are available
in the lib/gpu directory. in the ``lib/gpu`` directory.
To compile and use this package in HIP mode, you have to have the AMD ROCm To compile and use this package in HIP mode, you have to have the AMD ROCm
software installed. Versions of ROCm older than 3.5 are currently deprecated software installed. Versions of ROCm older than 3.5 are currently deprecated
@ -94,31 +94,36 @@ shared by 4 MPI tasks.
The GPU package also has limited support for OpenMP for both The GPU package also has limited support for OpenMP for both
multi-threading and vectorization of routines that are run on the CPUs. multi-threading and vectorization of routines that are run on the CPUs.
This requires that the GPU library and LAMMPS are built with flags to This requires that the GPU library and LAMMPS are built with flags to
enable OpenMP support (e.g. -fopenmp). Some styles for time integration enable OpenMP support (e.g. ``-fopenmp``). Some styles for time integration
are also available in the GPU package. These run completely on the CPUs are also available in the GPU package. These run completely on the CPUs
in full double precision, but exploit multi-threading and vectorization in full double precision, but exploit multi-threading and vectorization
for faster performance. for faster performance.
Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will Use the ``-sf gpu`` :doc:`command-line switch <Run_options>`, which will
automatically append "gpu" to styles that support it. Use the "-pk automatically append "gpu" to styles that support it. Use the ``-pk
gpu Ng" :doc:`command-line switch <Run_options>` to set Ng = # of gpu Ng`` :doc:`command-line switch <Run_options>` to set ``Ng`` = # of
GPUs/node to use. If Ng is 0, the number is selected automatically as GPUs/node to use. If ``Ng`` is 0, the number is selected automatically as
the number of matching GPUs that have the highest number of compute the number of matching GPUs that have the highest number of compute
cores. cores.
.. code-block:: bash .. code-block:: bash
lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU # 1 MPI task uses 1 GPU
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node lmp_machine -sf gpu -pk gpu 1 -in in.script
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes
Note that if the "-sf gpu" switch is used, it also issues a default # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
# ditto on 4 16-core nodes
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
Note that if the ``-sf gpu`` switch is used, it also issues a default
:doc:`package gpu 0 <package>` command, which will result in :doc:`package gpu 0 <package>` command, which will result in
automatic selection of the number of GPUs to use. automatic selection of the number of GPUs to use.
Using the "-pk" switch explicitly allows for setting of the number of Using the ``-pk`` switch explicitly allows for setting of the number of
GPUs/node to use and additional options. Its syntax is the same as GPUs/node to use and additional options. Its syntax is the same as
the "package gpu" command. See the :doc:`package <package>` the ``package gpu`` command. See the :doc:`package <package>`
command page for details, including the default values used for command page for details, including the default values used for
all its options if it is not specified. all its options if it is not specified.
@ -141,7 +146,7 @@ Use the :doc:`suffix gpu <suffix>` command, or you can explicitly add an
pair_style lj/cut/gpu 2.5 pair_style lj/cut/gpu 2.5
You must also use the :doc:`package gpu <package>` command to enable the You must also use the :doc:`package gpu <package>` command to enable the
GPU package, unless the "-sf gpu" or "-pk gpu" :doc:`command-line switches <Run_options>` were used. It specifies the number of GPU package, unless the ``-sf gpu`` or ``-pk gpu`` :doc:`command-line switches <Run_options>` were used. It specifies the number of
GPUs/node to use, as well as other options. GPUs/node to use, as well as other options.
**Speed-ups to expect:** **Speed-ups to expect:**

View File

@ -41,7 +41,7 @@ precision mode. Performance improvements are shown compared to
LAMMPS *without using other acceleration packages* as these are LAMMPS *without using other acceleration packages* as these are
under active development (and subject to performance changes). The under active development (and subject to performance changes). The
measurements were performed using the input files available in measurements were performed using the input files available in
the src/INTEL/TEST directory with the provided run script. the ``src/INTEL/TEST`` directory with the provided run script.
These are scalable in size; the results given are with 512K These are scalable in size; the results given are with 512K
particles (524K for Liquid Crystal). Most of the simulations are particles (524K for Liquid Crystal). Most of the simulations are
standard LAMMPS benchmarks (indicated by the filename extension in standard LAMMPS benchmarks (indicated by the filename extension in
@ -56,7 +56,7 @@ Results are speedups obtained on Intel Xeon E5-2697v4 processors
Knights Landing), and Intel Xeon Gold 6148 processors (code-named Knights Landing), and Intel Xeon Gold 6148 processors (code-named
Skylake) with "June 2017" LAMMPS built with Intel Parallel Studio Skylake) with "June 2017" LAMMPS built with Intel Parallel Studio
2017 update 2. Results are with 1 MPI task per physical core. See 2017 update 2. Results are with 1 MPI task per physical core. See
*src/INTEL/TEST/README* for the raw simulation rates and ``src/INTEL/TEST/README`` for the raw simulation rates and
instructions to reproduce. instructions to reproduce.
---------- ----------
@ -82,9 +82,9 @@ order of operations compared to LAMMPS without acceleration:
* The *newton* setting applies to all atoms, not just atoms shared * The *newton* setting applies to all atoms, not just atoms shared
between MPI tasks between MPI tasks
* Vectorization can change the order for adding pairwise forces * Vectorization can change the order for adding pairwise forces
* When using the -DLMP_USE_MKL_RNG define (all included intel optimized * When using the ``-DLMP_USE_MKL_RNG`` define (all included intel optimized
makefiles do) at build time, the random number generator for makefiles do) at build time, the random number generator for
dissipative particle dynamics (pair style dpd/intel) uses the Mersenne dissipative particle dynamics (``pair style dpd/intel``) uses the Mersenne
Twister generator included in the Intel MKL library (that should be Twister generator included in the Intel MKL library (that should be
more robust than the default Masaglia random number generator) more robust than the default Masaglia random number generator)
@ -106,36 +106,36 @@ LAMMPS should be built with the INTEL package installed.
Simulations should be run with 1 MPI task per physical *core*, Simulations should be run with 1 MPI task per physical *core*,
not *hardware thread*\ . not *hardware thread*\ .
* Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. * Edit ``src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi`` as necessary.
* Set the environment variable KMP_BLOCKTIME=0 * Set the environment variable ``KMP_BLOCKTIME=0``
* "-pk intel 0 omp $t -sf intel" added to LAMMPS command-line * ``-pk intel 0 omp $t -sf intel`` added to LAMMPS command-line
* $t should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi * ``$t`` should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi
* For some of the simple 2-body potentials without long-range * For some of the simple 2-body potentials without long-range
electrostatics, performance and scalability can be better with electrostatics, performance and scalability can be better with
the "newton off" setting added to the input script the ``newton off`` setting added to the input script
* For simulations on higher node counts, add "processors \* \* \* grid * For simulations on higher node counts, add ``processors * * * grid
numa" to the beginning of the input script for better scalability numa`` to the beginning of the input script for better scalability
* If using *kspace_style pppm* in the input script, add * If using ``kspace_style pppm`` in the input script, add
"kspace_modify diff ad" for better performance ``kspace_modify diff ad`` for better performance
For Intel Xeon Phi CPUs: For Intel Xeon Phi CPUs:
* Runs should be performed using MCDRAM. * Runs should be performed using MCDRAM.
For simulations using *kspace_style pppm* on Intel CPUs supporting For simulations using ``kspace_style pppm`` on Intel CPUs supporting
AVX-512: AVX-512:
* Add "kspace_modify diff ad" to the input script * Add ``kspace_modify diff ad`` to the input script
* The command-line option should be changed to * The command-line option should be changed to
"-pk intel 0 omp $r lrt yes -sf intel" where $r is the number of ``-pk intel 0 omp $r lrt yes -sf intel`` where ``$r`` is the number of
threads minus 1. threads minus 1.
* Do not use thread affinity (set KMP_AFFINITY=none) * Do not use thread affinity (set ``KMP_AFFINITY=none``)
* The "newton off" setting may provide better scalability * The ``newton off`` setting may provide better scalability
For Intel Xeon Phi co-processors (Offload): For Intel Xeon Phi co-processors (Offload):
* Edit src/MAKE/OPTIONS/Makefile.intel_co-processor as necessary * Edit ``src/MAKE/OPTIONS/Makefile.intel_co-processor`` as necessary
* "-pk intel N omp 1" added to command-line where N is the number of * ``-pk intel N omp 1`` added to command-line where ``N`` is the number of
co-processors per node. co-processors per node.
---------- ----------
@ -209,7 +209,7 @@ See the :ref:`Build extras <intel>` page for
instructions. Some additional details are covered here. instructions. Some additional details are covered here.
For building with make, several example Makefiles for building with For building with make, several example Makefiles for building with
the Intel compiler are included with LAMMPS in the src/MAKE/OPTIONS/ the Intel compiler are included with LAMMPS in the ``src/MAKE/OPTIONS/``
directory: directory:
.. code-block:: bash .. code-block:: bash
@ -239,35 +239,35 @@ However, if you do not have co-processors on your system, building
without offload support will produce a smaller binary. without offload support will produce a smaller binary.
The general requirements for Makefiles with the INTEL package The general requirements for Makefiles with the INTEL package
are as follows. When using Intel compilers, "-restrict" is required are as follows. When using Intel compilers, ``-restrict`` is required
and "-qopenmp" is highly recommended for CCFLAGS and LINKFLAGS. and ``-qopenmp`` is highly recommended for ``CCFLAGS`` and ``LINKFLAGS``.
CCFLAGS should include "-DLMP_INTEL_USELRT" (unless POSIX Threads ``CCFLAGS`` should include ``-DLMP_INTEL_USELRT`` (unless POSIX Threads
are not supported in the build environment) and "-DLMP_USE_MKL_RNG" are not supported in the build environment) and ``-DLMP_USE_MKL_RNG``
(unless Intel Math Kernel Library (MKL) is not available in the build (unless Intel Math Kernel Library (MKL) is not available in the build
environment). For Intel compilers, LIB should include "-ltbbmalloc" environment). For Intel compilers, ``LIB`` should include ``-ltbbmalloc``
or if the library is not available, "-DLMP_INTEL_NO_TBB" can be added or if the library is not available, ``-DLMP_INTEL_NO_TBB`` can be added
to CCFLAGS. For builds supporting offload, "-DLMP_INTEL_OFFLOAD" is to ``CCFLAGS``. For builds supporting offload, ``-DLMP_INTEL_OFFLOAD`` is
required for CCFLAGS and "-qoffload" is required for LINKFLAGS. Other required for ``CCFLAGS`` and ``-qoffload`` is required for ``LINKFLAGS``. Other
recommended CCFLAG options for best performance are "-O2 -fno-alias recommended ``CCFLAG`` options for best performance are ``-O2 -fno-alias
-ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div". -ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div``.
.. note:: .. note::
See the src/INTEL/README file for additional flags that See the ``src/INTEL/README`` file for additional flags that
might be needed for best performance on Intel server processors might be needed for best performance on Intel server processors
code-named "Skylake". code-named "Skylake".
.. note:: .. note::
The vectorization and math capabilities can differ depending on The vectorization and math capabilities can differ depending on
the CPU. For Intel compilers, the "-x" flag specifies the type of the CPU. For Intel compilers, the ``-x`` flag specifies the type of
processor for which to optimize. "-xHost" specifies that the compiler processor for which to optimize. ``-xHost`` specifies that the compiler
should build for the processor used for compiling. For Intel Xeon Phi should build for the processor used for compiling. For Intel Xeon Phi
x200 series processors, this option is "-xMIC-AVX512". For fourth x200 series processors, this option is ``-xMIC-AVX512``. For fourth
generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should generation Intel Xeon (v4/Broadwell) processors, ``-xCORE-AVX2`` should
be used. For older Intel Xeon processors, "-xAVX" will perform best be used. For older Intel Xeon processors, ``-xAVX`` will perform best
in general for the different simulations in LAMMPS. The default in general for the different simulations in LAMMPS. The default
in most of the example Makefiles is to use "-xHost", however this in most of the example Makefiles is to use ``-xHost``, however this
should not be used when cross-compiling. should not be used when cross-compiling.
Running LAMMPS with the INTEL package Running LAMMPS with the INTEL package
@ -304,11 +304,11 @@ almost all cases.
uniform. Unless disabled at build time, affinity for MPI tasks and uniform. Unless disabled at build time, affinity for MPI tasks and
OpenMP threads on the host (CPU) will be set by default on the host OpenMP threads on the host (CPU) will be set by default on the host
*when using offload to a co-processor*\ . In this case, it is unnecessary *when using offload to a co-processor*\ . In this case, it is unnecessary
to use other methods to control affinity (e.g. taskset, numactl, to use other methods to control affinity (e.g. ``taskset``, ``numactl``,
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the *no_affinity* ``I_MPI_PIN_DOMAIN``, etc.). This can be disabled with the *no_affinity*
option to the :doc:`package intel <package>` command or by disabling the option to the :doc:`package intel <package>` command or by disabling the
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the option at build time (by adding ``-DINTEL_OFFLOAD_NOAFFINITY`` to the
CCFLAGS line of your Makefile). Disabling this option is not ``CCFLAGS`` line of your Makefile). Disabling this option is not
recommended, especially when running on a machine with Intel recommended, especially when running on a machine with Intel
Hyper-Threading technology disabled. Hyper-Threading technology disabled.
@ -316,7 +316,7 @@ Run with the INTEL package from the command line
""""""""""""""""""""""""""""""""""""""""""""""""""""" """""""""""""""""""""""""""""""""""""""""""""""""""""
To enable INTEL optimizations for all available styles used in To enable INTEL optimizations for all available styles used in
the input script, the "-sf intel" :doc:`command-line switch <Run_options>` can be used without any requirement for the input script, the ``-sf intel`` :doc:`command-line switch <Run_options>` can be used without any requirement for
editing the input script. This switch will automatically append editing the input script. This switch will automatically append
"intel" to styles that support it. It also invokes a default command: "intel" to styles that support it. It also invokes a default command:
:doc:`package intel 1 <package>`. This package command is used to set :doc:`package intel 1 <package>`. This package command is used to set
@ -329,15 +329,15 @@ will be used with automatic balancing of work between the CPU and the
co-processor. co-processor.
You can specify different options for the INTEL package by using You can specify different options for the INTEL package by using
the "-pk intel Nphi" :doc:`command-line switch <Run_options>` with the ``-pk intel Nphi`` :doc:`command-line switch <Run_options>` with
keyword/value pairs as specified in the documentation. Here, Nphi = # keyword/value pairs as specified in the documentation. Here, ``Nphi`` = #
of Xeon Phi co-processors/node (ignored without offload of Xeon Phi co-processors/node (ignored without offload
support). Common options to the INTEL package include *omp* to support). Common options to the INTEL package include *omp* to
override any OMP_NUM_THREADS setting and specify the number of OpenMP override any ``OMP_NUM_THREADS`` setting and specify the number of OpenMP
threads, *mode* to set the floating-point precision mode, and *lrt* to threads, *mode* to set the floating-point precision mode, and *lrt* to
enable Long-Range Thread mode as described below. See the :doc:`package intel <package>` command for details, including the default values enable Long-Range Thread mode as described below. See the :doc:`package intel <package>` command for details, including the default values
used for all its options if not specified, and how to set the number used for all its options if not specified, and how to set the number
of OpenMP threads via the OMP_NUM_THREADS environment variable if of OpenMP threads via the ``OMP_NUM_THREADS`` environment variable if
desired. desired.
Examples (see documentation for your MPI/Machine for differences in Examples (see documentation for your MPI/Machine for differences in
@ -345,8 +345,13 @@ launching MPI applications):
.. code-block:: bash .. code-block:: bash
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any co-processors that might be available, use 2 OpenMP threads for each task, use double precision mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script
# Don't use any co-processors that might be available,
# use 2 OpenMP threads for each task, use double precision
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script \
-pk intel 0 omp 2 mode double
Or run with the INTEL package by editing an input script Or run with the INTEL package by editing an input script
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
@ -386,19 +391,19 @@ Long-Range Thread (LRT) mode is an option to the :doc:`package intel <package>`
with SMT. It generates an extra pthread for each MPI task. The thread with SMT. It generates an extra pthread for each MPI task. The thread
is dedicated to performing some of the PPPM calculations and MPI is dedicated to performing some of the PPPM calculations and MPI
communications. This feature requires setting the pre-processor flag communications. This feature requires setting the pre-processor flag
-DLMP_INTEL_USELRT in the makefile when compiling LAMMPS. It is unset ``-DLMP_INTEL_USELRT`` in the makefile when compiling LAMMPS. It is unset
in the default makefiles (\ *Makefile.mpi* and *Makefile.serial*\ ) but in the default makefiles (``Makefile.mpi`` and ``Makefile.serial``) but
it is set in all makefiles tuned for the INTEL package. On Intel it is set in all makefiles tuned for the INTEL package. On Intel
Xeon Phi x200 series CPUs, the LRT feature will likely improve Xeon Phi x200 series CPUs, the LRT feature will likely improve
performance, even on a single node. On Intel Xeon processors, using performance, even on a single node. On Intel Xeon processors, using
this mode might result in better performance when using multiple nodes, this mode might result in better performance when using multiple nodes,
depending on the specific machine configuration. To enable LRT mode, depending on the specific machine configuration. To enable LRT mode,
specify that the number of OpenMP threads is one less than would specify that the number of OpenMP threads is one less than would
normally be used for the run and add the "lrt yes" option to the "-pk" normally be used for the run and add the ``lrt yes`` option to the ``-pk``
command-line suffix or "package intel" command. For example, if a run command-line suffix or "package intel" command. For example, if a run
would normally perform best with "-pk intel 0 omp 4", instead use would normally perform best with "-pk intel 0 omp 4", instead use
"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the ``-pk intel 0 omp 3 lrt yes``. When using LRT, you should set the
environment variable "KMP_AFFINITY=none". LRT mode is not supported environment variable ``KMP_AFFINITY=none``. LRT mode is not supported
when using offload. when using offload.
.. note:: .. note::
@ -411,12 +416,12 @@ Not all styles are supported in the INTEL package. You can mix
the INTEL package with styles from the :doc:`OPT <Speed_opt>` the INTEL package with styles from the :doc:`OPT <Speed_opt>`
package or the :doc:`OPENMP package <Speed_omp>`. Of course, this package or the :doc:`OPENMP package <Speed_omp>`. Of course, this
requires that these packages were installed at build time. This can requires that these packages were installed at build time. This can
performed automatically by using "-sf hybrid intel opt" or "-sf hybrid performed automatically by using ``-sf hybrid intel opt`` or ``-sf hybrid
intel omp" command-line options. Alternatively, the "opt" and "omp" intel omp`` command-line options. Alternatively, the "opt" and "omp"
suffixes can be appended manually in the input script. For the latter, suffixes can be appended manually in the input script. For the latter,
the :doc:`package omp <package>` command must be in the input script or the :doc:`package omp <package>` command must be in the input script or
the "-pk omp Nt" :doc:`command-line switch <Run_options>` must be used the ``-pk omp Nt`` :doc:`command-line switch <Run_options>` must be used
where Nt is the number of OpenMP threads. The number of OpenMP threads where ``Nt`` is the number of OpenMP threads. The number of OpenMP threads
should not be set differently for the different packages. Note that should not be set differently for the different packages. Note that
the :doc:`suffix hybrid intel omp <suffix>` command can also be used the :doc:`suffix hybrid intel omp <suffix>` command can also be used
within the input script to automatically append the "omp" suffix to within the input script to automatically append the "omp" suffix to
@ -436,7 +441,7 @@ alternative to LRT mode and the two cannot be used together.
Currently, when using Intel MPI with Intel Xeon Phi x200 series Currently, when using Intel MPI with Intel Xeon Phi x200 series
CPUs, better performance might be obtained by setting the CPUs, better performance might be obtained by setting the
environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do environment variable ``I_MPI_SHM_LMT=shm`` for Linux kernels that do
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200 not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
series processors will always perform better using MCDRAM. Please series processors will always perform better using MCDRAM. Please
consult your system documentation for the best approach to specify consult your system documentation for the best approach to specify
@ -515,7 +520,7 @@ per MPI task. Additionally, an offload timing summary is printed at
the end of each run. When offloading, the frequency for :doc:`atom sorting <atom_modify>` is changed to 1 so that the per-atom data is the end of each run. When offloading, the frequency for :doc:`atom sorting <atom_modify>` is changed to 1 so that the per-atom data is
effectively sorted at every rebuild of the neighbor lists. All the effectively sorted at every rebuild of the neighbor lists. All the
available co-processor threads on each Phi will be divided among MPI available co-processor threads on each Phi will be divided among MPI
tasks, unless the *tptask* option of the "-pk intel" :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per tasks, unless the ``tptask`` option of the ``-pk intel`` :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per
MPI task. MPI task.
Restrictions Restrictions

View File

@ -48,7 +48,7 @@ version 23 November 2023 and Kokkos version 4.2.
Kokkos requires using a compiler that supports the c++17 standard. For Kokkos requires using a compiler that supports the c++17 standard. For
some compilers, it may be necessary to add a flag to enable c++17 support. some compilers, it may be necessary to add a flag to enable c++17 support.
For example, the GNU compiler uses the -std=c++17 flag. For a list of For example, the GNU compiler uses the ``-std=c++17`` flag. For a list of
compilers that have been tested with the Kokkos library, see the compilers that have been tested with the Kokkos library, see the
`requirements document of the Kokkos Wiki `requirements document of the Kokkos Wiki
<https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_. <https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_.
@ -111,14 +111,21 @@ for CPU acceleration, assuming one or more 16-core nodes.
.. code-block:: bash .. code-block:: bash
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading # 1 node, 16 MPI tasks/node, no multi-threading
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk # 2 nodes, 1 MPI task/node, 16 threads/task
kokkos" :doc:`command-line switches <Run_options>` in your mpirun mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj
command. You must use the "-k on" :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
# 1 node, 2 MPI tasks/node, 8 threads/task
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj
# 8 nodes, 4 MPI tasks/node, 4 threads/task
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj
To run using the KOKKOS package, use the ``-k on``, ``-sf kk`` and ``-pk
kokkos`` :doc:`command-line switches <Run_options>` in your ``mpirun``
command. You must use the ``-k on`` :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
additional arguments for hardware settings appropriate to your system. additional arguments for hardware settings appropriate to your system.
For OpenMP use: For OpenMP use:
@ -126,18 +133,18 @@ For OpenMP use:
-k on t Nt -k on t Nt
The "t Nt" option specifies how many OpenMP threads per MPI task to The ``t Nt`` option specifies how many OpenMP threads per MPI task to
use with a node. The default is Nt = 1, which is MPI-only mode. Note use with a node. The default is ``Nt`` = 1, which is MPI-only mode. Note
that the product of MPI tasks \* OpenMP threads/task should not exceed that the product of MPI tasks \* OpenMP threads/task should not exceed
the physical number of cores (on a node), otherwise performance will the physical number of cores (on a node), otherwise performance will
suffer. If Hyper-Threading (HT) is enabled, then the product of MPI suffer. If Hyper-Threading (HT) is enabled, then the product of MPI
tasks \* OpenMP threads/task should not exceed the physical number of tasks \* OpenMP threads/task should not exceed the physical number of
cores \* hardware threads. The "-k on" switch also issues a cores \* hardware threads. The ``-k on`` switch also issues a
"package kokkos" command (with no additional arguments) which sets ``package kokkos`` command (with no additional arguments) which sets
various KOKKOS options to default values, as discussed on the various KOKKOS options to default values, as discussed on the
:doc:`package <package>` command doc page. :doc:`package <package>` command doc page.
The "-sf kk" :doc:`command-line switch <Run_options>` will automatically The ``-sf kk`` :doc:`command-line switch <Run_options>` will automatically
append the "/kk" suffix to styles that support it. In this manner no append the "/kk" suffix to styles that support it. In this manner no
modification to the input script is needed. Alternatively, one can run modification to the input script is needed. Alternatively, one can run
with the KOKKOS package by editing the input script as described with the KOKKOS package by editing the input script as described
@ -146,20 +153,22 @@ below.
.. note:: .. note::
When using a single OpenMP thread, the Kokkos Serial back end (i.e. When using a single OpenMP thread, the Kokkos Serial back end (i.e.
Makefile.kokkos_mpi_only) will give better performance than the OpenMP ``Makefile.kokkos_mpi_only``) will give better performance than the OpenMP
back end (i.e. Makefile.kokkos_omp) because some of the overhead to make back end (i.e. ``Makefile.kokkos_omp``) because some of the overhead to make
the code thread-safe is removed. the code thread-safe is removed.
.. note:: .. note::
Use the "-pk kokkos" :doc:`command-line switch <Run_options>` to Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>` to
change the default :doc:`package kokkos <package>` options. See its doc change the default :doc:`package kokkos <package>` options. See its doc
page for details and default settings. Experimenting with its options page for details and default settings. Experimenting with its options
can provide a speed-up for specific calculations. For example: can provide a speed-up for specific calculations. For example:
.. code-block:: bash .. code-block:: bash
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm # Newton on, Half neighbor list, non-threaded comm
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk \
-pk kokkos newton on neigh half comm no -in in.lj
If the :doc:`newton <newton>` command is used in the input If the :doc:`newton <newton>` command is used in the input
script, it can also override the Newton flag defaults. script, it can also override the Newton flag defaults.
@ -172,7 +181,7 @@ small numbers of threads (i.e. 8 or less) but does increase memory
footprint and is not scalable to large numbers of threads. An footprint and is not scalable to large numbers of threads. An
alternative to data duplication is to use thread-level atomic operations alternative to data duplication is to use thread-level atomic operations
which do not require data duplication. The use of atomic operations can which do not require data duplication. The use of atomic operations can
be enforced by compiling LAMMPS with the "-DLMP_KOKKOS_USE_ATOMICS" be enforced by compiling LAMMPS with the ``-DLMP_KOKKOS_USE_ATOMICS``
pre-processor flag. Most but not all Kokkos-enabled pair_styles support pre-processor flag. Most but not all Kokkos-enabled pair_styles support
data duplication. Alternatively, full neighbor lists avoid the need for data duplication. Alternatively, full neighbor lists avoid the need for
duplication or atomic operations but require more compute operations per duplication or atomic operations but require more compute operations per
@ -190,10 +199,13 @@ they do not migrate during a simulation.
If you are not certain MPI tasks are being bound (check the defaults If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), binding can be forced with these flags: for your MPI installation), binding can be forced with these flags:
.. parsed-literal:: .. code-block:: bash
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ... # OpenMPI 1.8
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
# Mvapich2 2.0
mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
For binding threads with KOKKOS OpenMP, use thread affinity environment For binding threads with KOKKOS OpenMP, use thread affinity environment
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
@ -222,15 +234,24 @@ Examples of mpirun commands that follow these rules are shown below.
.. code-block:: bash .. code-block:: bash
# Running on an Intel KNL node with 68 cores (272 threads/node via 4x hardware threading): # Running on an Intel KNL node with 68 cores
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task # (272 threads/node via 4x hardware threading):
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task
The -np setting of the mpirun command sets the number of MPI # 1 node, 64 MPI tasks/node, 4 threads/task
tasks/node. The "-k on t Nt" command-line switch sets the number of mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
threads/task as Nt. The product of these two values should be N, i.e.
# 1 node, 66 MPI tasks/node, 4 threads/task
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
# 1 node, 32 MPI tasks/node, 8 threads/task
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj
# 8 nodes, 64 MPI tasks/node, 4 threads/task
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
The ``-np`` setting of the mpirun command sets the number of MPI
tasks/node. The ``-k on t Nt`` command-line switch sets the number of
threads/task as ``Nt``. The product of these two values should be N, i.e.
256 or 264. 256 or 264.
.. note:: .. note::
@ -240,7 +261,7 @@ threads/task as Nt. The product of these two values should be N, i.e.
flag to "on" for both pairwise and bonded interactions. This will flag to "on" for both pairwise and bonded interactions. This will
typically be best for many-body potentials. For simpler pairwise typically be best for many-body potentials. For simpler pairwise
potentials, it may be faster to use a "full" neighbor list with potentials, it may be faster to use a "full" neighbor list with
Newton flag to "off". Use the "-pk kokkos" :doc:`command-line switch Newton flag to "off". Use the ``-pk kokkos`` :doc:`command-line switch
<Run_options>` to change the default :doc:`package kokkos <package>` <Run_options>` to change the default :doc:`package kokkos <package>`
options. See its documentation page for details and default options. See its documentation page for details and default
settings. Experimenting with its options can provide a speed-up for settings. Experimenting with its options can provide a speed-up for
@ -248,8 +269,12 @@ threads/task as Nt. The product of these two values should be N, i.e.
.. code-block:: bash .. code-block:: bash
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax # Newton on, half neighbor list, threaded comm # Newton on, half neighbor list, threaded comm
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton off neigh full comm no -in in.lj # Newton off, full neighbor list, non-threaded comm mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax
# Newton off, full neighbor list, non-threaded comm
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk \
-pk kokkos newton off neigh full comm no -in in.lj
.. note:: .. note::
@ -266,8 +291,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
Running on GPUs Running on GPUs
^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
Use the "-k" :doc:`command-line switch <Run_options>` to specify the Use the ``-k`` :doc:`command-line switch <Run_options>` to specify the
number of GPUs per node. Typically the -np setting of the mpirun command number of GPUs per node. Typically the ``-np`` setting of the ``mpirun`` command
should set the number of MPI tasks/node to be equal to the number of should set the number of MPI tasks/node to be equal to the number of
physical GPUs on the node. You can assign multiple MPI tasks to the same physical GPUs on the node. You can assign multiple MPI tasks to the same
GPU with the KOKKOS package, but this is usually only faster if some GPU with the KOKKOS package, but this is usually only faster if some
@ -290,8 +315,11 @@ one or more nodes, each with two GPUs:
.. code-block:: bash .. code-block:: bash
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node # 1 node, 2 MPI tasks/node, 2 GPUs/node
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
# 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
.. note:: .. note::
@ -303,7 +331,7 @@ one or more nodes, each with two GPUs:
neighbor lists and setting the Newton flag to "on" may be faster. For neighbor lists and setting the Newton flag to "on" may be faster. For
many pair styles, setting the neighbor binsize equal to twice the CPU many pair styles, setting the neighbor binsize equal to twice the CPU
default value will give speedup, which is the default when running on default value will give speedup, which is the default when running on
GPUs. Use the "-pk kokkos" :doc:`command-line switch <Run_options>` GPUs. Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>`
to change the default :doc:`package kokkos <package>` options. See to change the default :doc:`package kokkos <package>` options. See
its documentation page for details and default its documentation page for details and default
settings. Experimenting with its options can provide a speed-up for settings. Experimenting with its options can provide a speed-up for
@ -311,7 +339,9 @@ one or more nodes, each with two GPUs:
.. code-block:: bash .. code-block:: bash
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk \
-pk kokkos newton on neigh half binsize 2.8 -in in.lj
.. note:: .. note::
@ -329,7 +359,7 @@ one or more nodes, each with two GPUs:
more), the creation of the atom map (required for molecular systems) more), the creation of the atom map (required for molecular systems)
on the GPU can slow down significantly or run out of GPU memory and on the GPU can slow down significantly or run out of GPU memory and
thus slow down the whole calculation or cause a crash. You can use thus slow down the whole calculation or cause a crash. You can use
the "-pk kokkos atom/map no" :doc:`command-line switch <Run_options>` the ``-pk kokkos atom/map no`` :doc:`command-line switch <Run_options>`
of the :doc:`package kokkos atom/map no <package>` command to create of the :doc:`package kokkos atom/map no <package>` command to create
the atom map on the CPU instead. the atom map on the CPU instead.
@ -346,20 +376,20 @@ one or more nodes, each with two GPUs:
.. note:: .. note::
To get an accurate timing breakdown between time spend in pair, To get an accurate timing breakdown between time spend in pair,
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1. kspace, etc., you must set the environment variable ``CUDA_LAUNCH_BLOCKING=1``.
However, this will reduce performance and is not recommended for production runs. However, this will reduce performance and is not recommended for production runs.
Run with the KOKKOS package by editing an input script Run with the KOKKOS package by editing an input script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Alternatively the effect of the "-sf" or "-pk" switches can be Alternatively the effect of the ``-sf`` or ``-pk`` switches can be
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script. duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
The discussion above for building LAMMPS with the KOKKOS package, the The discussion above for building LAMMPS with the KOKKOS package, the
``mpirun`` or ``mpiexec`` command, and setting appropriate thread ``mpirun`` or ``mpiexec`` command, and setting appropriate thread
properties are the same. properties are the same.
You must still use the "-k on" :doc:`command-line switch <Run_options>` You must still use the ``-k on`` :doc:`command-line switch <Run_options>`
to enable the KOKKOS package, and specify its additional arguments for to enable the KOKKOS package, and specify its additional arguments for
hardware options appropriate to your system, as documented above. hardware options appropriate to your system, as documented above.
@ -378,7 +408,7 @@ wish to change any of its option defaults, as set by the "-k on"
With the KOKKOS package, both OpenMP multi-threading and GPUs can be With the KOKKOS package, both OpenMP multi-threading and GPUs can be
compiled and used together in a few special cases. In the makefile for compiled and used together in a few special cases. In the makefile for
the conventional build, the KOKKOS_DEVICES variable must include both, the conventional build, the ``KOKKOS_DEVICES`` variable must include both,
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``. "Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
.. code-block:: bash .. code-block:: bash
@ -390,14 +420,14 @@ in the ``kokkos-cuda.cmake`` CMake preset file.
.. code-block:: bash .. code-block:: bash
cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes ../cmake
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA, The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the "-sf kk" in the command line gives the default CUDA version using the ``-sf kk`` in the command line gives the default CUDA version
everywhere. However, if the "/kk/host" suffix is added to a specific everywhere. However, if the "/kk/host" suffix is added to a specific
style in the input script, the Kokkos OpenMP (CPU) version of that style in the input script, the Kokkos OpenMP (CPU) version of that
specific style will be used instead. Set the number of OpenMP threads specific style will be used instead. Set the number of OpenMP threads
as "t Nt" and the number of GPUs as "g Ng" as ``t Nt`` and the number of GPUs as ``g Ng``
.. parsed-literal:: .. parsed-literal::
@ -409,7 +439,7 @@ For example, the command to run with 1 GPU and 8 OpenMP threads is then:
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk
Conversely, if the "-sf kk/host" is used in the command line and then Conversely, if the ``-sf kk/host`` is used in the command line and then
the "/kk" or "/kk/device" suffix is added to a specific style in your the "/kk" or "/kk/device" suffix is added to a specific style in your
input script, then only that specific style will run on the GPU while input script, then only that specific style will run on the GPU while
everything else will run on the CPU in OpenMP mode. Note that the everything else will run on the CPU in OpenMP mode. Note that the
@ -418,11 +448,11 @@ special case:
A kspace style and/or molecular topology (bonds, angles, etc.) running A kspace style and/or molecular topology (bonds, angles, etc.) running
on the host CPU can overlap with a pair style running on the on the host CPU can overlap with a pair style running on the
GPU. First compile with "--default-stream per-thread" added to CCFLAGS GPU. First compile with ``--default-stream per-thread`` added to ``CCFLAGS``
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host" in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
suffix for kspace and bonds, angles, etc. in the input file and the suffix for kspace and bonds, angles, etc. in the input file and the
"kk" suffix (equal to "kk/device") on the command line. Also make "kk" suffix (equal to "kk/device") on the command line. Also make
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1" sure the environment variable ``CUDA_LAUNCH_BLOCKING`` is not set to "1"
so CPU/GPU overlap can occur. so CPU/GPU overlap can occur.
Performance to expect Performance to expect

View File

@ -28,32 +28,39 @@ These examples assume one or more 16-core nodes.
.. code-block:: bash .. code-block:: bash
env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script # 1 MPI task, 16 threads according to OMP_NUM_THREADS # 1 MPI task, 16 threads according to OMP_NUM_THREADS
lmp_mpi -sf omp -in in.script # 1 MPI task, no threads, optimized kernels env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task # 1 MPI task, no threads, optimized kernels
lmp_mpi -sf omp -in in.script
# 4 MPI tasks, 4 threads/task
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script
# 8 nodes, 4 MPI tasks/node, 4 threads/task
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script
The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks
used by LAMMPS (one or multiple per compute node) and the number of MPI used by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. its ``-np`` and ``-ppn`` switches. Ditto for OpenMPI via ``-np`` and ``-npernode``.
You need to choose how many OpenMP threads per MPI task will be used You need to choose how many OpenMP threads per MPI task will be used
by the OPENMP package. Note that the product of MPI tasks \* by the OPENMP package. Note that the product of MPI tasks \*
threads/task should not exceed the physical number of cores (on a threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer. node), otherwise performance will suffer.
As in the lines above, use the "-sf omp" :doc:`command-line switch <Run_options>`, which will automatically append "omp" to As in the lines above, use the ``-sf omp`` :doc:`command-line switch <Run_options>`, which will automatically append "omp" to
styles that support it. The "-sf omp" switch also issues a default styles that support it. The ``-sf omp`` switch also issues a default
:doc:`package omp 0 <package>` command, which will set the number of :doc:`package omp 0 <package>` command, which will set the number of
threads per MPI task via the OMP_NUM_THREADS environment variable. threads per MPI task via the ``OMP_NUM_THREADS`` environment variable.
You can also use the "-pk omp Nt" :doc:`command-line switch <Run_options>`, to explicitly set Nt = # of OpenMP threads You can also use the ``-pk omp Nt`` :doc:`command-line switch <Run_options>`, to explicitly set ``Nt`` = # of OpenMP threads
per MPI task to use, as well as additional options. Its syntax is the per MPI task to use, as well as additional options. Its syntax is the
same as the :doc:`package omp <package>` command whose page gives same as the :doc:`package omp <package>` command whose page gives
details, including the default values used if it is not specified. It details, including the default values used if it is not specified. It
also gives more details on how to set the number of threads via the also gives more details on how to set the number of threads via the
OMP_NUM_THREADS environment variable. ``OMP_NUM_THREADS`` environment variable.
Or run with the OPENMP package by editing an input script Or run with the OPENMP package by editing an input script
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" """""""""""""""""""""""""""""""""""""""""""""""""""""""""""
@ -71,7 +78,7 @@ Use the :doc:`suffix omp <suffix>` command, or you can explicitly add an
You must also use the :doc:`package omp <package>` command to enable the You must also use the :doc:`package omp <package>` command to enable the
OPENMP package. When you do this you also specify how many threads OPENMP package. When you do this you also specify how many threads
per MPI task to use. The command page explains other options and per MPI task to use. The command page explains other options and
how to set the number of threads via the OMP_NUM_THREADS environment how to set the number of threads via the ``OMP_NUM_THREADS`` environment
variable. variable.
Speed-up to expect Speed-up to expect

View File

@ -80,23 +80,30 @@ it provides, follow these general steps. Details vary from package to
package and are explained in the individual accelerator doc pages, package and are explained in the individual accelerator doc pages,
listed above: listed above:
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ +-----------------------------------------------------------+---------------------------------------------+
| build the accelerator library | only for GPU package | | build the accelerator library | only for GPU package |
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ +-----------------------------------------------------------+---------------------------------------------+
| install the accelerator package | make yes-opt, make yes-intel, etc | | install the accelerator package | ``make yes-opt``, ``make yes-intel``, etc |
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ +-----------------------------------------------------------+---------------------------------------------+
| add compile/link flags to Makefile.machine in src/MAKE | only for INTEL, KOKKOS, OPENMP, OPT packages | | add compile/link flags to ``Makefile.machine`` | only for INTEL, KOKKOS, OPENMP, |
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | in ``src/MAKE`` | OPT packages |
| re-build LAMMPS | make machine | +-----------------------------------------------------------+---------------------------------------------+
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | re-build LAMMPS | ``make machine`` |
| prepare and test a regular LAMMPS simulation | lmp_machine -in in.script; mpirun -np 32 lmp_machine -in in.script | +-----------------------------------------------------------+---------------------------------------------+
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | prepare and test a regular LAMMPS simulation | ``lmp_machine -in in.script;`` |
| enable specific accelerator support via '-k on' :doc:`command-line switch <Run_options>`, | only needed for KOKKOS package | | | ``mpirun -np 32 lmp_machine -in in.script`` |
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ +-----------------------------------------------------------+---------------------------------------------+
| set any needed options for the package via "-pk" :doc:`command-line switch <Run_options>` or :doc:`package <package>` command, | only if defaults need to be changed | | enable specific accelerator support via ``-k on`` | only needed for KOKKOS package |
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | :doc:`command-line switch <Run_options>` | |
| use accelerated styles in your input via "-sf" :doc:`command-line switch <Run_options>` or :doc:`suffix <suffix>` command | lmp_machine -in in.script -sf gpu | +-----------------------------------------------------------+---------------------------------------------+
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+ | set any needed options for the package via ``-pk`` | only if defaults need to be changed |
| :doc:`command-line switch <Run_options>` or | |
| :doc:`package <package>` command | |
+-----------------------------------------------------------+---------------------------------------------+
| use accelerated styles in your input via ``-sf`` | ``lmp_machine -in in.script -sf gpu`` |
| :doc:`command-line switch <Run_options>` or | |
| :doc:`suffix <suffix>` command | |
+-----------------------------------------------------------+---------------------------------------------+
Note that the first 4 steps can be done as a single command with Note that the first 4 steps can be done as a single command with
suitable make command invocations. This is discussed on the suitable make command invocations. This is discussed on the