docs: update speed section
This commit is contained in:
@ -15,7 +15,7 @@ The 5 standard problems are as follow:
|
||||
#. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55
|
||||
neighbors per atom), NVE integration
|
||||
#. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ
|
||||
pairwise interactions with a 2\^(1/6) sigma cutoff (5 neighbors per
|
||||
pairwise interactions with a :math:`2^{\frac{1}{6}}` sigma cutoff (5 neighbors per
|
||||
atom), NVE integration
|
||||
#. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45
|
||||
neighbors per atom), NVE integration
|
||||
@ -29,19 +29,19 @@ The 5 standard problems are as follow:
|
||||
Input files for these 5 problems are provided in the bench directory
|
||||
of the LAMMPS distribution. Each has 32,000 atoms and runs for 100
|
||||
timesteps. The size of the problem (number of atoms) can be varied
|
||||
using command-line switches as described in the bench/README file.
|
||||
using command-line switches as described in the ``bench/README`` file.
|
||||
This is an easy way to test performance and either strong or weak
|
||||
scalability on your machine.
|
||||
|
||||
The bench directory includes a few log.\* files that show performance
|
||||
of these 5 problems on 1 or 4 cores of Linux desktop. The bench/FERMI
|
||||
and bench/KEPLER directories have input files and scripts and instructions
|
||||
The bench directory includes a few ``log.*`` files that show performance
|
||||
of these 5 problems on 1 or 4 cores of Linux desktop. The ``bench/FERMI``
|
||||
and ``bench/KEPLER`` directories have input files and scripts and instructions
|
||||
for running the same (or similar) problems using OpenMP or GPU or Xeon
|
||||
Phi acceleration options. See the README files in those directories and the
|
||||
Phi acceleration options. See the ``README`` files in those directories and the
|
||||
:doc:`Accelerator packages <Speed_packages>` pages for instructions on how
|
||||
to build LAMMPS and run on that kind of hardware.
|
||||
|
||||
The bench/POTENTIALS directory has input files which correspond to the
|
||||
The ``bench/POTENTIALS`` directory has input files which correspond to the
|
||||
table of results on the
|
||||
`Potentials <https://www.lammps.org/bench.html#potentials>`_ section of
|
||||
the Benchmarks web page. So you can also run those test problems on
|
||||
@ -50,7 +50,7 @@ your machine.
|
||||
The `billion-atom <https://www.lammps.org/bench.html#billion>`_ section
|
||||
of the Benchmarks web page has performance data for very large
|
||||
benchmark runs of simple Lennard-Jones (LJ) models, which use the
|
||||
bench/in.lj input script.
|
||||
``bench/in.lj`` input script.
|
||||
|
||||
----------
|
||||
|
||||
|
||||
@ -38,10 +38,10 @@ to have an NVIDIA GPU and install the corresponding NVIDIA CUDA
|
||||
toolkit software on your system (this is only tested on Linux
|
||||
and unsupported on Windows):
|
||||
|
||||
* Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/\*/information
|
||||
* Check if you have an NVIDIA GPU: ``cat /proc/driver/nvidia/gpus/\*/information``
|
||||
* Go to https://developer.nvidia.com/cuda-downloads
|
||||
* Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
||||
* Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to
|
||||
* Run ``lammps/lib/gpu/nvc_get_devices`` (after building the GPU library, see below) to
|
||||
list supported devices and properties
|
||||
|
||||
To compile and use this package in OpenCL mode, you currently need
|
||||
@ -51,7 +51,7 @@ installed. There can be multiple of them for the same or different hardware
|
||||
(GPUs, CPUs, Accelerators) installed at the same time. OpenCL refers to those
|
||||
as 'platforms'. The GPU library will try to auto-select the best suitable platform,
|
||||
but this can be overridden using the platform option of the :doc:`package <package>`
|
||||
command. run lammps/lib/gpu/ocl_get_devices to get a list of available
|
||||
command. run ``lammps/lib/gpu/ocl_get_devices`` to get a list of available
|
||||
platforms and devices with a suitable ICD available.
|
||||
|
||||
To compile and use this package for Intel GPUs, OpenCL or the Intel oneAPI
|
||||
@ -63,7 +63,7 @@ provides optimized C++, MPI, and many other libraries and tools. See:
|
||||
If you do not have a discrete GPU card installed, this package can still provide
|
||||
significant speedups on some CPUs that include integrated GPUs. Additionally, for
|
||||
many macs, OpenCL is already included with the OS and Makefiles are available
|
||||
in the lib/gpu directory.
|
||||
in the ``lib/gpu`` directory.
|
||||
|
||||
To compile and use this package in HIP mode, you have to have the AMD ROCm
|
||||
software installed. Versions of ROCm older than 3.5 are currently deprecated
|
||||
@ -94,31 +94,36 @@ shared by 4 MPI tasks.
|
||||
The GPU package also has limited support for OpenMP for both
|
||||
multi-threading and vectorization of routines that are run on the CPUs.
|
||||
This requires that the GPU library and LAMMPS are built with flags to
|
||||
enable OpenMP support (e.g. -fopenmp). Some styles for time integration
|
||||
enable OpenMP support (e.g. ``-fopenmp``). Some styles for time integration
|
||||
are also available in the GPU package. These run completely on the CPUs
|
||||
in full double precision, but exploit multi-threading and vectorization
|
||||
for faster performance.
|
||||
|
||||
Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
|
||||
automatically append "gpu" to styles that support it. Use the "-pk
|
||||
gpu Ng" :doc:`command-line switch <Run_options>` to set Ng = # of
|
||||
GPUs/node to use. If Ng is 0, the number is selected automatically as
|
||||
Use the ``-sf gpu`` :doc:`command-line switch <Run_options>`, which will
|
||||
automatically append "gpu" to styles that support it. Use the ``-pk
|
||||
gpu Ng`` :doc:`command-line switch <Run_options>` to set ``Ng`` = # of
|
||||
GPUs/node to use. If ``Ng`` is 0, the number is selected automatically as
|
||||
the number of matching GPUs that have the highest number of compute
|
||||
cores.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU
|
||||
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
|
||||
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes
|
||||
# 1 MPI task uses 1 GPU
|
||||
lmp_machine -sf gpu -pk gpu 1 -in in.script
|
||||
|
||||
Note that if the "-sf gpu" switch is used, it also issues a default
|
||||
# 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
|
||||
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
|
||||
|
||||
# ditto on 4 16-core nodes
|
||||
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
|
||||
|
||||
Note that if the ``-sf gpu`` switch is used, it also issues a default
|
||||
:doc:`package gpu 0 <package>` command, which will result in
|
||||
automatic selection of the number of GPUs to use.
|
||||
|
||||
Using the "-pk" switch explicitly allows for setting of the number of
|
||||
Using the ``-pk`` switch explicitly allows for setting of the number of
|
||||
GPUs/node to use and additional options. Its syntax is the same as
|
||||
the "package gpu" command. See the :doc:`package <package>`
|
||||
the ``package gpu`` command. See the :doc:`package <package>`
|
||||
command page for details, including the default values used for
|
||||
all its options if it is not specified.
|
||||
|
||||
@ -141,7 +146,7 @@ Use the :doc:`suffix gpu <suffix>` command, or you can explicitly add an
|
||||
pair_style lj/cut/gpu 2.5
|
||||
|
||||
You must also use the :doc:`package gpu <package>` command to enable the
|
||||
GPU package, unless the "-sf gpu" or "-pk gpu" :doc:`command-line switches <Run_options>` were used. It specifies the number of
|
||||
GPU package, unless the ``-sf gpu`` or ``-pk gpu`` :doc:`command-line switches <Run_options>` were used. It specifies the number of
|
||||
GPUs/node to use, as well as other options.
|
||||
|
||||
**Speed-ups to expect:**
|
||||
|
||||
@ -41,7 +41,7 @@ precision mode. Performance improvements are shown compared to
|
||||
LAMMPS *without using other acceleration packages* as these are
|
||||
under active development (and subject to performance changes). The
|
||||
measurements were performed using the input files available in
|
||||
the src/INTEL/TEST directory with the provided run script.
|
||||
the ``src/INTEL/TEST`` directory with the provided run script.
|
||||
These are scalable in size; the results given are with 512K
|
||||
particles (524K for Liquid Crystal). Most of the simulations are
|
||||
standard LAMMPS benchmarks (indicated by the filename extension in
|
||||
@ -56,7 +56,7 @@ Results are speedups obtained on Intel Xeon E5-2697v4 processors
|
||||
Knights Landing), and Intel Xeon Gold 6148 processors (code-named
|
||||
Skylake) with "June 2017" LAMMPS built with Intel Parallel Studio
|
||||
2017 update 2. Results are with 1 MPI task per physical core. See
|
||||
*src/INTEL/TEST/README* for the raw simulation rates and
|
||||
``src/INTEL/TEST/README`` for the raw simulation rates and
|
||||
instructions to reproduce.
|
||||
|
||||
----------
|
||||
@ -82,9 +82,9 @@ order of operations compared to LAMMPS without acceleration:
|
||||
* The *newton* setting applies to all atoms, not just atoms shared
|
||||
between MPI tasks
|
||||
* Vectorization can change the order for adding pairwise forces
|
||||
* When using the -DLMP_USE_MKL_RNG define (all included intel optimized
|
||||
* When using the ``-DLMP_USE_MKL_RNG`` define (all included intel optimized
|
||||
makefiles do) at build time, the random number generator for
|
||||
dissipative particle dynamics (pair style dpd/intel) uses the Mersenne
|
||||
dissipative particle dynamics (``pair style dpd/intel``) uses the Mersenne
|
||||
Twister generator included in the Intel MKL library (that should be
|
||||
more robust than the default Masaglia random number generator)
|
||||
|
||||
@ -106,36 +106,36 @@ LAMMPS should be built with the INTEL package installed.
|
||||
Simulations should be run with 1 MPI task per physical *core*,
|
||||
not *hardware thread*\ .
|
||||
|
||||
* Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.
|
||||
* Set the environment variable KMP_BLOCKTIME=0
|
||||
* "-pk intel 0 omp $t -sf intel" added to LAMMPS command-line
|
||||
* $t should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi
|
||||
* Edit ``src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi`` as necessary.
|
||||
* Set the environment variable ``KMP_BLOCKTIME=0``
|
||||
* ``-pk intel 0 omp $t -sf intel`` added to LAMMPS command-line
|
||||
* ``$t`` should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi
|
||||
* For some of the simple 2-body potentials without long-range
|
||||
electrostatics, performance and scalability can be better with
|
||||
the "newton off" setting added to the input script
|
||||
* For simulations on higher node counts, add "processors \* \* \* grid
|
||||
numa" to the beginning of the input script for better scalability
|
||||
* If using *kspace_style pppm* in the input script, add
|
||||
"kspace_modify diff ad" for better performance
|
||||
the ``newton off`` setting added to the input script
|
||||
* For simulations on higher node counts, add ``processors * * * grid
|
||||
numa`` to the beginning of the input script for better scalability
|
||||
* If using ``kspace_style pppm`` in the input script, add
|
||||
``kspace_modify diff ad`` for better performance
|
||||
|
||||
For Intel Xeon Phi CPUs:
|
||||
|
||||
* Runs should be performed using MCDRAM.
|
||||
|
||||
For simulations using *kspace_style pppm* on Intel CPUs supporting
|
||||
For simulations using ``kspace_style pppm`` on Intel CPUs supporting
|
||||
AVX-512:
|
||||
|
||||
* Add "kspace_modify diff ad" to the input script
|
||||
* Add ``kspace_modify diff ad`` to the input script
|
||||
* The command-line option should be changed to
|
||||
"-pk intel 0 omp $r lrt yes -sf intel" where $r is the number of
|
||||
``-pk intel 0 omp $r lrt yes -sf intel`` where ``$r`` is the number of
|
||||
threads minus 1.
|
||||
* Do not use thread affinity (set KMP_AFFINITY=none)
|
||||
* The "newton off" setting may provide better scalability
|
||||
* Do not use thread affinity (set ``KMP_AFFINITY=none``)
|
||||
* The ``newton off`` setting may provide better scalability
|
||||
|
||||
For Intel Xeon Phi co-processors (Offload):
|
||||
|
||||
* Edit src/MAKE/OPTIONS/Makefile.intel_co-processor as necessary
|
||||
* "-pk intel N omp 1" added to command-line where N is the number of
|
||||
* Edit ``src/MAKE/OPTIONS/Makefile.intel_co-processor`` as necessary
|
||||
* ``-pk intel N omp 1`` added to command-line where ``N`` is the number of
|
||||
co-processors per node.
|
||||
|
||||
----------
|
||||
@ -209,7 +209,7 @@ See the :ref:`Build extras <intel>` page for
|
||||
instructions. Some additional details are covered here.
|
||||
|
||||
For building with make, several example Makefiles for building with
|
||||
the Intel compiler are included with LAMMPS in the src/MAKE/OPTIONS/
|
||||
the Intel compiler are included with LAMMPS in the ``src/MAKE/OPTIONS/``
|
||||
directory:
|
||||
|
||||
.. code-block:: bash
|
||||
@ -239,35 +239,35 @@ However, if you do not have co-processors on your system, building
|
||||
without offload support will produce a smaller binary.
|
||||
|
||||
The general requirements for Makefiles with the INTEL package
|
||||
are as follows. When using Intel compilers, "-restrict" is required
|
||||
and "-qopenmp" is highly recommended for CCFLAGS and LINKFLAGS.
|
||||
CCFLAGS should include "-DLMP_INTEL_USELRT" (unless POSIX Threads
|
||||
are not supported in the build environment) and "-DLMP_USE_MKL_RNG"
|
||||
are as follows. When using Intel compilers, ``-restrict`` is required
|
||||
and ``-qopenmp`` is highly recommended for ``CCFLAGS`` and ``LINKFLAGS``.
|
||||
``CCFLAGS`` should include ``-DLMP_INTEL_USELRT`` (unless POSIX Threads
|
||||
are not supported in the build environment) and ``-DLMP_USE_MKL_RNG``
|
||||
(unless Intel Math Kernel Library (MKL) is not available in the build
|
||||
environment). For Intel compilers, LIB should include "-ltbbmalloc"
|
||||
or if the library is not available, "-DLMP_INTEL_NO_TBB" can be added
|
||||
to CCFLAGS. For builds supporting offload, "-DLMP_INTEL_OFFLOAD" is
|
||||
required for CCFLAGS and "-qoffload" is required for LINKFLAGS. Other
|
||||
recommended CCFLAG options for best performance are "-O2 -fno-alias
|
||||
-ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div".
|
||||
environment). For Intel compilers, ``LIB`` should include ``-ltbbmalloc``
|
||||
or if the library is not available, ``-DLMP_INTEL_NO_TBB`` can be added
|
||||
to ``CCFLAGS``. For builds supporting offload, ``-DLMP_INTEL_OFFLOAD`` is
|
||||
required for ``CCFLAGS`` and ``-qoffload`` is required for ``LINKFLAGS``. Other
|
||||
recommended ``CCFLAG`` options for best performance are ``-O2 -fno-alias
|
||||
-ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div``.
|
||||
|
||||
.. note::
|
||||
|
||||
See the src/INTEL/README file for additional flags that
|
||||
See the ``src/INTEL/README`` file for additional flags that
|
||||
might be needed for best performance on Intel server processors
|
||||
code-named "Skylake".
|
||||
|
||||
.. note::
|
||||
|
||||
The vectorization and math capabilities can differ depending on
|
||||
the CPU. For Intel compilers, the "-x" flag specifies the type of
|
||||
processor for which to optimize. "-xHost" specifies that the compiler
|
||||
the CPU. For Intel compilers, the ``-x`` flag specifies the type of
|
||||
processor for which to optimize. ``-xHost`` specifies that the compiler
|
||||
should build for the processor used for compiling. For Intel Xeon Phi
|
||||
x200 series processors, this option is "-xMIC-AVX512". For fourth
|
||||
generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
|
||||
be used. For older Intel Xeon processors, "-xAVX" will perform best
|
||||
x200 series processors, this option is ``-xMIC-AVX512``. For fourth
|
||||
generation Intel Xeon (v4/Broadwell) processors, ``-xCORE-AVX2`` should
|
||||
be used. For older Intel Xeon processors, ``-xAVX`` will perform best
|
||||
in general for the different simulations in LAMMPS. The default
|
||||
in most of the example Makefiles is to use "-xHost", however this
|
||||
in most of the example Makefiles is to use ``-xHost``, however this
|
||||
should not be used when cross-compiling.
|
||||
|
||||
Running LAMMPS with the INTEL package
|
||||
@ -304,11 +304,11 @@ almost all cases.
|
||||
uniform. Unless disabled at build time, affinity for MPI tasks and
|
||||
OpenMP threads on the host (CPU) will be set by default on the host
|
||||
*when using offload to a co-processor*\ . In this case, it is unnecessary
|
||||
to use other methods to control affinity (e.g. taskset, numactl,
|
||||
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the *no_affinity*
|
||||
to use other methods to control affinity (e.g. ``taskset``, ``numactl``,
|
||||
``I_MPI_PIN_DOMAIN``, etc.). This can be disabled with the *no_affinity*
|
||||
option to the :doc:`package intel <package>` command or by disabling the
|
||||
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
|
||||
CCFLAGS line of your Makefile). Disabling this option is not
|
||||
option at build time (by adding ``-DINTEL_OFFLOAD_NOAFFINITY`` to the
|
||||
``CCFLAGS`` line of your Makefile). Disabling this option is not
|
||||
recommended, especially when running on a machine with Intel
|
||||
Hyper-Threading technology disabled.
|
||||
|
||||
@ -316,7 +316,7 @@ Run with the INTEL package from the command line
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||
|
||||
To enable INTEL optimizations for all available styles used in
|
||||
the input script, the "-sf intel" :doc:`command-line switch <Run_options>` can be used without any requirement for
|
||||
the input script, the ``-sf intel`` :doc:`command-line switch <Run_options>` can be used without any requirement for
|
||||
editing the input script. This switch will automatically append
|
||||
"intel" to styles that support it. It also invokes a default command:
|
||||
:doc:`package intel 1 <package>`. This package command is used to set
|
||||
@ -329,15 +329,15 @@ will be used with automatic balancing of work between the CPU and the
|
||||
co-processor.
|
||||
|
||||
You can specify different options for the INTEL package by using
|
||||
the "-pk intel Nphi" :doc:`command-line switch <Run_options>` with
|
||||
keyword/value pairs as specified in the documentation. Here, Nphi = #
|
||||
the ``-pk intel Nphi`` :doc:`command-line switch <Run_options>` with
|
||||
keyword/value pairs as specified in the documentation. Here, ``Nphi`` = #
|
||||
of Xeon Phi co-processors/node (ignored without offload
|
||||
support). Common options to the INTEL package include *omp* to
|
||||
override any OMP_NUM_THREADS setting and specify the number of OpenMP
|
||||
override any ``OMP_NUM_THREADS`` setting and specify the number of OpenMP
|
||||
threads, *mode* to set the floating-point precision mode, and *lrt* to
|
||||
enable Long-Range Thread mode as described below. See the :doc:`package intel <package>` command for details, including the default values
|
||||
used for all its options if not specified, and how to set the number
|
||||
of OpenMP threads via the OMP_NUM_THREADS environment variable if
|
||||
of OpenMP threads via the ``OMP_NUM_THREADS`` environment variable if
|
||||
desired.
|
||||
|
||||
Examples (see documentation for your MPI/Machine for differences in
|
||||
@ -345,8 +345,13 @@ launching MPI applications):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any co-processors that might be available, use 2 OpenMP threads for each task, use double precision
|
||||
# 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script
|
||||
|
||||
# Don't use any co-processors that might be available,
|
||||
# use 2 OpenMP threads for each task, use double precision
|
||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script \
|
||||
-pk intel 0 omp 2 mode double
|
||||
|
||||
Or run with the INTEL package by editing an input script
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||
@ -386,19 +391,19 @@ Long-Range Thread (LRT) mode is an option to the :doc:`package intel <package>`
|
||||
with SMT. It generates an extra pthread for each MPI task. The thread
|
||||
is dedicated to performing some of the PPPM calculations and MPI
|
||||
communications. This feature requires setting the pre-processor flag
|
||||
-DLMP_INTEL_USELRT in the makefile when compiling LAMMPS. It is unset
|
||||
in the default makefiles (\ *Makefile.mpi* and *Makefile.serial*\ ) but
|
||||
``-DLMP_INTEL_USELRT`` in the makefile when compiling LAMMPS. It is unset
|
||||
in the default makefiles (``Makefile.mpi`` and ``Makefile.serial``) but
|
||||
it is set in all makefiles tuned for the INTEL package. On Intel
|
||||
Xeon Phi x200 series CPUs, the LRT feature will likely improve
|
||||
performance, even on a single node. On Intel Xeon processors, using
|
||||
this mode might result in better performance when using multiple nodes,
|
||||
depending on the specific machine configuration. To enable LRT mode,
|
||||
specify that the number of OpenMP threads is one less than would
|
||||
normally be used for the run and add the "lrt yes" option to the "-pk"
|
||||
normally be used for the run and add the ``lrt yes`` option to the ``-pk``
|
||||
command-line suffix or "package intel" command. For example, if a run
|
||||
would normally perform best with "-pk intel 0 omp 4", instead use
|
||||
"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
|
||||
environment variable "KMP_AFFINITY=none". LRT mode is not supported
|
||||
``-pk intel 0 omp 3 lrt yes``. When using LRT, you should set the
|
||||
environment variable ``KMP_AFFINITY=none``. LRT mode is not supported
|
||||
when using offload.
|
||||
|
||||
.. note::
|
||||
@ -411,12 +416,12 @@ Not all styles are supported in the INTEL package. You can mix
|
||||
the INTEL package with styles from the :doc:`OPT <Speed_opt>`
|
||||
package or the :doc:`OPENMP package <Speed_omp>`. Of course, this
|
||||
requires that these packages were installed at build time. This can
|
||||
performed automatically by using "-sf hybrid intel opt" or "-sf hybrid
|
||||
intel omp" command-line options. Alternatively, the "opt" and "omp"
|
||||
performed automatically by using ``-sf hybrid intel opt`` or ``-sf hybrid
|
||||
intel omp`` command-line options. Alternatively, the "opt" and "omp"
|
||||
suffixes can be appended manually in the input script. For the latter,
|
||||
the :doc:`package omp <package>` command must be in the input script or
|
||||
the "-pk omp Nt" :doc:`command-line switch <Run_options>` must be used
|
||||
where Nt is the number of OpenMP threads. The number of OpenMP threads
|
||||
the ``-pk omp Nt`` :doc:`command-line switch <Run_options>` must be used
|
||||
where ``Nt`` is the number of OpenMP threads. The number of OpenMP threads
|
||||
should not be set differently for the different packages. Note that
|
||||
the :doc:`suffix hybrid intel omp <suffix>` command can also be used
|
||||
within the input script to automatically append the "omp" suffix to
|
||||
@ -436,7 +441,7 @@ alternative to LRT mode and the two cannot be used together.
|
||||
|
||||
Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
||||
CPUs, better performance might be obtained by setting the
|
||||
environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
|
||||
environment variable ``I_MPI_SHM_LMT=shm`` for Linux kernels that do
|
||||
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
||||
series processors will always perform better using MCDRAM. Please
|
||||
consult your system documentation for the best approach to specify
|
||||
@ -515,7 +520,7 @@ per MPI task. Additionally, an offload timing summary is printed at
|
||||
the end of each run. When offloading, the frequency for :doc:`atom sorting <atom_modify>` is changed to 1 so that the per-atom data is
|
||||
effectively sorted at every rebuild of the neighbor lists. All the
|
||||
available co-processor threads on each Phi will be divided among MPI
|
||||
tasks, unless the *tptask* option of the "-pk intel" :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per
|
||||
tasks, unless the ``tptask`` option of the ``-pk intel`` :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per
|
||||
MPI task.
|
||||
|
||||
Restrictions
|
||||
|
||||
@ -48,7 +48,7 @@ version 23 November 2023 and Kokkos version 4.2.
|
||||
|
||||
Kokkos requires using a compiler that supports the c++17 standard. For
|
||||
some compilers, it may be necessary to add a flag to enable c++17 support.
|
||||
For example, the GNU compiler uses the -std=c++17 flag. For a list of
|
||||
For example, the GNU compiler uses the ``-std=c++17`` flag. For a list of
|
||||
compilers that have been tested with the Kokkos library, see the
|
||||
`requirements document of the Kokkos Wiki
|
||||
<https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_.
|
||||
@ -111,14 +111,21 @@ for CPU acceleration, assuming one or more 16-core nodes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
|
||||
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
|
||||
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||
# 1 node, 16 MPI tasks/node, no multi-threading
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj
|
||||
|
||||
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk
|
||||
kokkos" :doc:`command-line switches <Run_options>` in your mpirun
|
||||
command. You must use the "-k on" :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
|
||||
# 2 nodes, 1 MPI task/node, 16 threads/task
|
||||
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj
|
||||
|
||||
# 1 node, 2 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj
|
||||
|
||||
# 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj
|
||||
|
||||
To run using the KOKKOS package, use the ``-k on``, ``-sf kk`` and ``-pk
|
||||
kokkos`` :doc:`command-line switches <Run_options>` in your ``mpirun``
|
||||
command. You must use the ``-k on`` :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
|
||||
additional arguments for hardware settings appropriate to your system.
|
||||
For OpenMP use:
|
||||
|
||||
@ -126,18 +133,18 @@ For OpenMP use:
|
||||
|
||||
-k on t Nt
|
||||
|
||||
The "t Nt" option specifies how many OpenMP threads per MPI task to
|
||||
use with a node. The default is Nt = 1, which is MPI-only mode. Note
|
||||
The ``t Nt`` option specifies how many OpenMP threads per MPI task to
|
||||
use with a node. The default is ``Nt`` = 1, which is MPI-only mode. Note
|
||||
that the product of MPI tasks \* OpenMP threads/task should not exceed
|
||||
the physical number of cores (on a node), otherwise performance will
|
||||
suffer. If Hyper-Threading (HT) is enabled, then the product of MPI
|
||||
tasks \* OpenMP threads/task should not exceed the physical number of
|
||||
cores \* hardware threads. The "-k on" switch also issues a
|
||||
"package kokkos" command (with no additional arguments) which sets
|
||||
cores \* hardware threads. The ``-k on`` switch also issues a
|
||||
``package kokkos`` command (with no additional arguments) which sets
|
||||
various KOKKOS options to default values, as discussed on the
|
||||
:doc:`package <package>` command doc page.
|
||||
|
||||
The "-sf kk" :doc:`command-line switch <Run_options>` will automatically
|
||||
The ``-sf kk`` :doc:`command-line switch <Run_options>` will automatically
|
||||
append the "/kk" suffix to styles that support it. In this manner no
|
||||
modification to the input script is needed. Alternatively, one can run
|
||||
with the KOKKOS package by editing the input script as described
|
||||
@ -146,20 +153,22 @@ below.
|
||||
.. note::
|
||||
|
||||
When using a single OpenMP thread, the Kokkos Serial back end (i.e.
|
||||
Makefile.kokkos_mpi_only) will give better performance than the OpenMP
|
||||
back end (i.e. Makefile.kokkos_omp) because some of the overhead to make
|
||||
``Makefile.kokkos_mpi_only``) will give better performance than the OpenMP
|
||||
back end (i.e. ``Makefile.kokkos_omp``) because some of the overhead to make
|
||||
the code thread-safe is removed.
|
||||
|
||||
.. note::
|
||||
|
||||
Use the "-pk kokkos" :doc:`command-line switch <Run_options>` to
|
||||
Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>` to
|
||||
change the default :doc:`package kokkos <package>` options. See its doc
|
||||
page for details and default settings. Experimenting with its options
|
||||
can provide a speed-up for specific calculations. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm
|
||||
# Newton on, Half neighbor list, non-threaded comm
|
||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk \
|
||||
-pk kokkos newton on neigh half comm no -in in.lj
|
||||
|
||||
If the :doc:`newton <newton>` command is used in the input
|
||||
script, it can also override the Newton flag defaults.
|
||||
@ -172,7 +181,7 @@ small numbers of threads (i.e. 8 or less) but does increase memory
|
||||
footprint and is not scalable to large numbers of threads. An
|
||||
alternative to data duplication is to use thread-level atomic operations
|
||||
which do not require data duplication. The use of atomic operations can
|
||||
be enforced by compiling LAMMPS with the "-DLMP_KOKKOS_USE_ATOMICS"
|
||||
be enforced by compiling LAMMPS with the ``-DLMP_KOKKOS_USE_ATOMICS``
|
||||
pre-processor flag. Most but not all Kokkos-enabled pair_styles support
|
||||
data duplication. Alternatively, full neighbor lists avoid the need for
|
||||
duplication or atomic operations but require more compute operations per
|
||||
@ -190,10 +199,13 @@ they do not migrate during a simulation.
|
||||
If you are not certain MPI tasks are being bound (check the defaults
|
||||
for your MPI installation), binding can be forced with these flags:
|
||||
|
||||
.. parsed-literal::
|
||||
.. code-block:: bash
|
||||
|
||||
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
|
||||
# OpenMPI 1.8
|
||||
mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||
|
||||
# Mvapich2 2.0
|
||||
mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
|
||||
|
||||
For binding threads with KOKKOS OpenMP, use thread affinity environment
|
||||
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
|
||||
@ -222,15 +234,24 @@ Examples of mpirun commands that follow these rules are shown below.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Running on an Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task
|
||||
# Running on an Intel KNL node with 68 cores
|
||||
# (272 threads/node via 4x hardware threading):
|
||||
|
||||
The -np setting of the mpirun command sets the number of MPI
|
||||
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
||||
threads/task as Nt. The product of these two values should be N, i.e.
|
||||
# 1 node, 64 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||
|
||||
# 1 node, 66 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||
|
||||
# 1 node, 32 MPI tasks/node, 8 threads/task
|
||||
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj
|
||||
|
||||
# 8 nodes, 64 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||
|
||||
The ``-np`` setting of the mpirun command sets the number of MPI
|
||||
tasks/node. The ``-k on t Nt`` command-line switch sets the number of
|
||||
threads/task as ``Nt``. The product of these two values should be N, i.e.
|
||||
256 or 264.
|
||||
|
||||
.. note::
|
||||
@ -240,7 +261,7 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
||||
flag to "on" for both pairwise and bonded interactions. This will
|
||||
typically be best for many-body potentials. For simpler pairwise
|
||||
potentials, it may be faster to use a "full" neighbor list with
|
||||
Newton flag to "off". Use the "-pk kokkos" :doc:`command-line switch
|
||||
Newton flag to "off". Use the ``-pk kokkos`` :doc:`command-line switch
|
||||
<Run_options>` to change the default :doc:`package kokkos <package>`
|
||||
options. See its documentation page for details and default
|
||||
settings. Experimenting with its options can provide a speed-up for
|
||||
@ -248,8 +269,12 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax # Newton on, half neighbor list, threaded comm
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton off neigh full comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
|
||||
# Newton on, half neighbor list, threaded comm
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax
|
||||
|
||||
# Newton off, full neighbor list, non-threaded comm
|
||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk \
|
||||
-pk kokkos newton off neigh full comm no -in in.lj
|
||||
|
||||
.. note::
|
||||
|
||||
@ -266,8 +291,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
||||
Running on GPUs
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
Use the "-k" :doc:`command-line switch <Run_options>` to specify the
|
||||
number of GPUs per node. Typically the -np setting of the mpirun command
|
||||
Use the ``-k`` :doc:`command-line switch <Run_options>` to specify the
|
||||
number of GPUs per node. Typically the ``-np`` setting of the ``mpirun`` command
|
||||
should set the number of MPI tasks/node to be equal to the number of
|
||||
physical GPUs on the node. You can assign multiple MPI tasks to the same
|
||||
GPU with the KOKKOS package, but this is usually only faster if some
|
||||
@ -290,8 +315,11 @@ one or more nodes, each with two GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
|
||||
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
|
||||
# 1 node, 2 MPI tasks/node, 2 GPUs/node
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
|
||||
|
||||
# 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
|
||||
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
|
||||
|
||||
.. note::
|
||||
|
||||
@ -303,7 +331,7 @@ one or more nodes, each with two GPUs:
|
||||
neighbor lists and setting the Newton flag to "on" may be faster. For
|
||||
many pair styles, setting the neighbor binsize equal to twice the CPU
|
||||
default value will give speedup, which is the default when running on
|
||||
GPUs. Use the "-pk kokkos" :doc:`command-line switch <Run_options>`
|
||||
GPUs. Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>`
|
||||
to change the default :doc:`package kokkos <package>` options. See
|
||||
its documentation page for details and default
|
||||
settings. Experimenting with its options can provide a speed-up for
|
||||
@ -311,7 +339,9 @@ one or more nodes, each with two GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
|
||||
# Newton on, half neighbor list, set binsize = neighbor ghost cutoff
|
||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk \
|
||||
-pk kokkos newton on neigh half binsize 2.8 -in in.lj
|
||||
|
||||
.. note::
|
||||
|
||||
@ -329,7 +359,7 @@ one or more nodes, each with two GPUs:
|
||||
more), the creation of the atom map (required for molecular systems)
|
||||
on the GPU can slow down significantly or run out of GPU memory and
|
||||
thus slow down the whole calculation or cause a crash. You can use
|
||||
the "-pk kokkos atom/map no" :doc:`command-line switch <Run_options>`
|
||||
the ``-pk kokkos atom/map no`` :doc:`command-line switch <Run_options>`
|
||||
of the :doc:`package kokkos atom/map no <package>` command to create
|
||||
the atom map on the CPU instead.
|
||||
|
||||
@ -346,20 +376,20 @@ one or more nodes, each with two GPUs:
|
||||
.. note::
|
||||
|
||||
To get an accurate timing breakdown between time spend in pair,
|
||||
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
|
||||
kspace, etc., you must set the environment variable ``CUDA_LAUNCH_BLOCKING=1``.
|
||||
However, this will reduce performance and is not recommended for production runs.
|
||||
|
||||
Run with the KOKKOS package by editing an input script
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Alternatively the effect of the "-sf" or "-pk" switches can be
|
||||
Alternatively the effect of the ``-sf`` or ``-pk`` switches can be
|
||||
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
|
||||
|
||||
The discussion above for building LAMMPS with the KOKKOS package, the
|
||||
``mpirun`` or ``mpiexec`` command, and setting appropriate thread
|
||||
properties are the same.
|
||||
|
||||
You must still use the "-k on" :doc:`command-line switch <Run_options>`
|
||||
You must still use the ``-k on`` :doc:`command-line switch <Run_options>`
|
||||
to enable the KOKKOS package, and specify its additional arguments for
|
||||
hardware options appropriate to your system, as documented above.
|
||||
|
||||
@ -378,7 +408,7 @@ wish to change any of its option defaults, as set by the "-k on"
|
||||
|
||||
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
|
||||
compiled and used together in a few special cases. In the makefile for
|
||||
the conventional build, the KOKKOS_DEVICES variable must include both,
|
||||
the conventional build, the ``KOKKOS_DEVICES`` variable must include both,
|
||||
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
|
||||
|
||||
.. code-block:: bash
|
||||
@ -390,14 +420,14 @@ in the ``kokkos-cuda.cmake`` CMake preset file.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes
|
||||
cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes ../cmake
|
||||
|
||||
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
|
||||
using the "-sf kk" in the command line gives the default CUDA version
|
||||
using the ``-sf kk`` in the command line gives the default CUDA version
|
||||
everywhere. However, if the "/kk/host" suffix is added to a specific
|
||||
style in the input script, the Kokkos OpenMP (CPU) version of that
|
||||
specific style will be used instead. Set the number of OpenMP threads
|
||||
as "t Nt" and the number of GPUs as "g Ng"
|
||||
as ``t Nt`` and the number of GPUs as ``g Ng``
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
@ -409,7 +439,7 @@ For example, the command to run with 1 GPU and 8 OpenMP threads is then:
|
||||
|
||||
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk
|
||||
|
||||
Conversely, if the "-sf kk/host" is used in the command line and then
|
||||
Conversely, if the ``-sf kk/host`` is used in the command line and then
|
||||
the "/kk" or "/kk/device" suffix is added to a specific style in your
|
||||
input script, then only that specific style will run on the GPU while
|
||||
everything else will run on the CPU in OpenMP mode. Note that the
|
||||
@ -418,11 +448,11 @@ special case:
|
||||
|
||||
A kspace style and/or molecular topology (bonds, angles, etc.) running
|
||||
on the host CPU can overlap with a pair style running on the
|
||||
GPU. First compile with "--default-stream per-thread" added to CCFLAGS
|
||||
GPU. First compile with ``--default-stream per-thread`` added to ``CCFLAGS``
|
||||
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
|
||||
suffix for kspace and bonds, angles, etc. in the input file and the
|
||||
"kk" suffix (equal to "kk/device") on the command line. Also make
|
||||
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
|
||||
sure the environment variable ``CUDA_LAUNCH_BLOCKING`` is not set to "1"
|
||||
so CPU/GPU overlap can occur.
|
||||
|
||||
Performance to expect
|
||||
|
||||
@ -28,32 +28,39 @@ These examples assume one or more 16-core nodes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script # 1 MPI task, 16 threads according to OMP_NUM_THREADS
|
||||
lmp_mpi -sf omp -in in.script # 1 MPI task, no threads, optimized kernels
|
||||
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||
# 1 MPI task, 16 threads according to OMP_NUM_THREADS
|
||||
env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script
|
||||
|
||||
# 1 MPI task, no threads, optimized kernels
|
||||
lmp_mpi -sf omp -in in.script
|
||||
|
||||
# 4 MPI tasks, 4 threads/task
|
||||
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script
|
||||
|
||||
# 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script
|
||||
|
||||
The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks
|
||||
used by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
its ``-np`` and ``-ppn`` switches. Ditto for OpenMPI via ``-np`` and ``-npernode``.
|
||||
|
||||
You need to choose how many OpenMP threads per MPI task will be used
|
||||
by the OPENMP package. Note that the product of MPI tasks \*
|
||||
threads/task should not exceed the physical number of cores (on a
|
||||
node), otherwise performance will suffer.
|
||||
|
||||
As in the lines above, use the "-sf omp" :doc:`command-line switch <Run_options>`, which will automatically append "omp" to
|
||||
styles that support it. The "-sf omp" switch also issues a default
|
||||
As in the lines above, use the ``-sf omp`` :doc:`command-line switch <Run_options>`, which will automatically append "omp" to
|
||||
styles that support it. The ``-sf omp`` switch also issues a default
|
||||
:doc:`package omp 0 <package>` command, which will set the number of
|
||||
threads per MPI task via the OMP_NUM_THREADS environment variable.
|
||||
threads per MPI task via the ``OMP_NUM_THREADS`` environment variable.
|
||||
|
||||
You can also use the "-pk omp Nt" :doc:`command-line switch <Run_options>`, to explicitly set Nt = # of OpenMP threads
|
||||
You can also use the ``-pk omp Nt`` :doc:`command-line switch <Run_options>`, to explicitly set ``Nt`` = # of OpenMP threads
|
||||
per MPI task to use, as well as additional options. Its syntax is the
|
||||
same as the :doc:`package omp <package>` command whose page gives
|
||||
details, including the default values used if it is not specified. It
|
||||
also gives more details on how to set the number of threads via the
|
||||
OMP_NUM_THREADS environment variable.
|
||||
``OMP_NUM_THREADS`` environment variable.
|
||||
|
||||
Or run with the OPENMP package by editing an input script
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||
@ -71,7 +78,7 @@ Use the :doc:`suffix omp <suffix>` command, or you can explicitly add an
|
||||
You must also use the :doc:`package omp <package>` command to enable the
|
||||
OPENMP package. When you do this you also specify how many threads
|
||||
per MPI task to use. The command page explains other options and
|
||||
how to set the number of threads via the OMP_NUM_THREADS environment
|
||||
how to set the number of threads via the ``OMP_NUM_THREADS`` environment
|
||||
variable.
|
||||
|
||||
Speed-up to expect
|
||||
|
||||
@ -80,23 +80,30 @@ it provides, follow these general steps. Details vary from package to
|
||||
package and are explained in the individual accelerator doc pages,
|
||||
listed above:
|
||||
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| build the accelerator library | only for GPU package |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| install the accelerator package | make yes-opt, make yes-intel, etc |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| add compile/link flags to Makefile.machine in src/MAKE | only for INTEL, KOKKOS, OPENMP, OPT packages |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| re-build LAMMPS | make machine |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| prepare and test a regular LAMMPS simulation | lmp_machine -in in.script; mpirun -np 32 lmp_machine -in in.script |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| enable specific accelerator support via '-k on' :doc:`command-line switch <Run_options>`, | only needed for KOKKOS package |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| set any needed options for the package via "-pk" :doc:`command-line switch <Run_options>` or :doc:`package <package>` command, | only if defaults need to be changed |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
| use accelerated styles in your input via "-sf" :doc:`command-line switch <Run_options>` or :doc:`suffix <suffix>` command | lmp_machine -in in.script -sf gpu |
|
||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| install the accelerator package | ``make yes-opt``, ``make yes-intel``, etc |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| add compile/link flags to ``Makefile.machine`` | only for INTEL, KOKKOS, OPENMP, |
|
||||
| in ``src/MAKE`` | OPT packages |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| re-build LAMMPS | ``make machine`` |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| prepare and test a regular LAMMPS simulation | ``lmp_machine -in in.script;`` |
|
||||
| | ``mpirun -np 32 lmp_machine -in in.script`` |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| enable specific accelerator support via ``-k on`` | only needed for KOKKOS package |
|
||||
| :doc:`command-line switch <Run_options>` | |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| set any needed options for the package via ``-pk`` | only if defaults need to be changed |
|
||||
| :doc:`command-line switch <Run_options>` or | |
|
||||
| :doc:`package <package>` command | |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
| use accelerated styles in your input via ``-sf`` | ``lmp_machine -in in.script -sf gpu`` |
|
||||
| :doc:`command-line switch <Run_options>` or | |
|
||||
| :doc:`suffix <suffix>` command | |
|
||||
+-----------------------------------------------------------+---------------------------------------------+
|
||||
|
||||
Note that the first 4 steps can be done as a single command with
|
||||
suitable make command invocations. This is discussed on the
|
||||
|
||||
Reference in New Issue
Block a user