docs: update speed section
This commit is contained in:
@ -15,7 +15,7 @@ The 5 standard problems are as follow:
|
|||||||
#. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55
|
#. LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55
|
||||||
neighbors per atom), NVE integration
|
neighbors per atom), NVE integration
|
||||||
#. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ
|
#. Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ
|
||||||
pairwise interactions with a 2\^(1/6) sigma cutoff (5 neighbors per
|
pairwise interactions with a :math:`2^{\frac{1}{6}}` sigma cutoff (5 neighbors per
|
||||||
atom), NVE integration
|
atom), NVE integration
|
||||||
#. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45
|
#. EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45
|
||||||
neighbors per atom), NVE integration
|
neighbors per atom), NVE integration
|
||||||
@ -29,19 +29,19 @@ The 5 standard problems are as follow:
|
|||||||
Input files for these 5 problems are provided in the bench directory
|
Input files for these 5 problems are provided in the bench directory
|
||||||
of the LAMMPS distribution. Each has 32,000 atoms and runs for 100
|
of the LAMMPS distribution. Each has 32,000 atoms and runs for 100
|
||||||
timesteps. The size of the problem (number of atoms) can be varied
|
timesteps. The size of the problem (number of atoms) can be varied
|
||||||
using command-line switches as described in the bench/README file.
|
using command-line switches as described in the ``bench/README`` file.
|
||||||
This is an easy way to test performance and either strong or weak
|
This is an easy way to test performance and either strong or weak
|
||||||
scalability on your machine.
|
scalability on your machine.
|
||||||
|
|
||||||
The bench directory includes a few log.\* files that show performance
|
The bench directory includes a few ``log.*`` files that show performance
|
||||||
of these 5 problems on 1 or 4 cores of Linux desktop. The bench/FERMI
|
of these 5 problems on 1 or 4 cores of Linux desktop. The ``bench/FERMI``
|
||||||
and bench/KEPLER directories have input files and scripts and instructions
|
and ``bench/KEPLER`` directories have input files and scripts and instructions
|
||||||
for running the same (or similar) problems using OpenMP or GPU or Xeon
|
for running the same (or similar) problems using OpenMP or GPU or Xeon
|
||||||
Phi acceleration options. See the README files in those directories and the
|
Phi acceleration options. See the ``README`` files in those directories and the
|
||||||
:doc:`Accelerator packages <Speed_packages>` pages for instructions on how
|
:doc:`Accelerator packages <Speed_packages>` pages for instructions on how
|
||||||
to build LAMMPS and run on that kind of hardware.
|
to build LAMMPS and run on that kind of hardware.
|
||||||
|
|
||||||
The bench/POTENTIALS directory has input files which correspond to the
|
The ``bench/POTENTIALS`` directory has input files which correspond to the
|
||||||
table of results on the
|
table of results on the
|
||||||
`Potentials <https://www.lammps.org/bench.html#potentials>`_ section of
|
`Potentials <https://www.lammps.org/bench.html#potentials>`_ section of
|
||||||
the Benchmarks web page. So you can also run those test problems on
|
the Benchmarks web page. So you can also run those test problems on
|
||||||
@ -50,7 +50,7 @@ your machine.
|
|||||||
The `billion-atom <https://www.lammps.org/bench.html#billion>`_ section
|
The `billion-atom <https://www.lammps.org/bench.html#billion>`_ section
|
||||||
of the Benchmarks web page has performance data for very large
|
of the Benchmarks web page has performance data for very large
|
||||||
benchmark runs of simple Lennard-Jones (LJ) models, which use the
|
benchmark runs of simple Lennard-Jones (LJ) models, which use the
|
||||||
bench/in.lj input script.
|
``bench/in.lj`` input script.
|
||||||
|
|
||||||
----------
|
----------
|
||||||
|
|
||||||
|
|||||||
@ -38,10 +38,10 @@ to have an NVIDIA GPU and install the corresponding NVIDIA CUDA
|
|||||||
toolkit software on your system (this is only tested on Linux
|
toolkit software on your system (this is only tested on Linux
|
||||||
and unsupported on Windows):
|
and unsupported on Windows):
|
||||||
|
|
||||||
* Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/gpus/\*/information
|
* Check if you have an NVIDIA GPU: ``cat /proc/driver/nvidia/gpus/\*/information``
|
||||||
* Go to https://developer.nvidia.com/cuda-downloads
|
* Go to https://developer.nvidia.com/cuda-downloads
|
||||||
* Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
* Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
||||||
* Run lammps/lib/gpu/nvc_get_devices (after building the GPU library, see below) to
|
* Run ``lammps/lib/gpu/nvc_get_devices`` (after building the GPU library, see below) to
|
||||||
list supported devices and properties
|
list supported devices and properties
|
||||||
|
|
||||||
To compile and use this package in OpenCL mode, you currently need
|
To compile and use this package in OpenCL mode, you currently need
|
||||||
@ -51,7 +51,7 @@ installed. There can be multiple of them for the same or different hardware
|
|||||||
(GPUs, CPUs, Accelerators) installed at the same time. OpenCL refers to those
|
(GPUs, CPUs, Accelerators) installed at the same time. OpenCL refers to those
|
||||||
as 'platforms'. The GPU library will try to auto-select the best suitable platform,
|
as 'platforms'. The GPU library will try to auto-select the best suitable platform,
|
||||||
but this can be overridden using the platform option of the :doc:`package <package>`
|
but this can be overridden using the platform option of the :doc:`package <package>`
|
||||||
command. run lammps/lib/gpu/ocl_get_devices to get a list of available
|
command. run ``lammps/lib/gpu/ocl_get_devices`` to get a list of available
|
||||||
platforms and devices with a suitable ICD available.
|
platforms and devices with a suitable ICD available.
|
||||||
|
|
||||||
To compile and use this package for Intel GPUs, OpenCL or the Intel oneAPI
|
To compile and use this package for Intel GPUs, OpenCL or the Intel oneAPI
|
||||||
@ -63,7 +63,7 @@ provides optimized C++, MPI, and many other libraries and tools. See:
|
|||||||
If you do not have a discrete GPU card installed, this package can still provide
|
If you do not have a discrete GPU card installed, this package can still provide
|
||||||
significant speedups on some CPUs that include integrated GPUs. Additionally, for
|
significant speedups on some CPUs that include integrated GPUs. Additionally, for
|
||||||
many macs, OpenCL is already included with the OS and Makefiles are available
|
many macs, OpenCL is already included with the OS and Makefiles are available
|
||||||
in the lib/gpu directory.
|
in the ``lib/gpu`` directory.
|
||||||
|
|
||||||
To compile and use this package in HIP mode, you have to have the AMD ROCm
|
To compile and use this package in HIP mode, you have to have the AMD ROCm
|
||||||
software installed. Versions of ROCm older than 3.5 are currently deprecated
|
software installed. Versions of ROCm older than 3.5 are currently deprecated
|
||||||
@ -94,31 +94,36 @@ shared by 4 MPI tasks.
|
|||||||
The GPU package also has limited support for OpenMP for both
|
The GPU package also has limited support for OpenMP for both
|
||||||
multi-threading and vectorization of routines that are run on the CPUs.
|
multi-threading and vectorization of routines that are run on the CPUs.
|
||||||
This requires that the GPU library and LAMMPS are built with flags to
|
This requires that the GPU library and LAMMPS are built with flags to
|
||||||
enable OpenMP support (e.g. -fopenmp). Some styles for time integration
|
enable OpenMP support (e.g. ``-fopenmp``). Some styles for time integration
|
||||||
are also available in the GPU package. These run completely on the CPUs
|
are also available in the GPU package. These run completely on the CPUs
|
||||||
in full double precision, but exploit multi-threading and vectorization
|
in full double precision, but exploit multi-threading and vectorization
|
||||||
for faster performance.
|
for faster performance.
|
||||||
|
|
||||||
Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
|
Use the ``-sf gpu`` :doc:`command-line switch <Run_options>`, which will
|
||||||
automatically append "gpu" to styles that support it. Use the "-pk
|
automatically append "gpu" to styles that support it. Use the ``-pk
|
||||||
gpu Ng" :doc:`command-line switch <Run_options>` to set Ng = # of
|
gpu Ng`` :doc:`command-line switch <Run_options>` to set ``Ng`` = # of
|
||||||
GPUs/node to use. If Ng is 0, the number is selected automatically as
|
GPUs/node to use. If ``Ng`` is 0, the number is selected automatically as
|
||||||
the number of matching GPUs that have the highest number of compute
|
the number of matching GPUs that have the highest number of compute
|
||||||
cores.
|
cores.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU
|
# 1 MPI task uses 1 GPU
|
||||||
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
|
lmp_machine -sf gpu -pk gpu 1 -in in.script
|
||||||
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes
|
|
||||||
|
|
||||||
Note that if the "-sf gpu" switch is used, it also issues a default
|
# 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node
|
||||||
|
mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
|
||||||
|
|
||||||
|
# ditto on 4 16-core nodes
|
||||||
|
mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script
|
||||||
|
|
||||||
|
Note that if the ``-sf gpu`` switch is used, it also issues a default
|
||||||
:doc:`package gpu 0 <package>` command, which will result in
|
:doc:`package gpu 0 <package>` command, which will result in
|
||||||
automatic selection of the number of GPUs to use.
|
automatic selection of the number of GPUs to use.
|
||||||
|
|
||||||
Using the "-pk" switch explicitly allows for setting of the number of
|
Using the ``-pk`` switch explicitly allows for setting of the number of
|
||||||
GPUs/node to use and additional options. Its syntax is the same as
|
GPUs/node to use and additional options. Its syntax is the same as
|
||||||
the "package gpu" command. See the :doc:`package <package>`
|
the ``package gpu`` command. See the :doc:`package <package>`
|
||||||
command page for details, including the default values used for
|
command page for details, including the default values used for
|
||||||
all its options if it is not specified.
|
all its options if it is not specified.
|
||||||
|
|
||||||
@ -141,7 +146,7 @@ Use the :doc:`suffix gpu <suffix>` command, or you can explicitly add an
|
|||||||
pair_style lj/cut/gpu 2.5
|
pair_style lj/cut/gpu 2.5
|
||||||
|
|
||||||
You must also use the :doc:`package gpu <package>` command to enable the
|
You must also use the :doc:`package gpu <package>` command to enable the
|
||||||
GPU package, unless the "-sf gpu" or "-pk gpu" :doc:`command-line switches <Run_options>` were used. It specifies the number of
|
GPU package, unless the ``-sf gpu`` or ``-pk gpu`` :doc:`command-line switches <Run_options>` were used. It specifies the number of
|
||||||
GPUs/node to use, as well as other options.
|
GPUs/node to use, as well as other options.
|
||||||
|
|
||||||
**Speed-ups to expect:**
|
**Speed-ups to expect:**
|
||||||
|
|||||||
@ -41,7 +41,7 @@ precision mode. Performance improvements are shown compared to
|
|||||||
LAMMPS *without using other acceleration packages* as these are
|
LAMMPS *without using other acceleration packages* as these are
|
||||||
under active development (and subject to performance changes). The
|
under active development (and subject to performance changes). The
|
||||||
measurements were performed using the input files available in
|
measurements were performed using the input files available in
|
||||||
the src/INTEL/TEST directory with the provided run script.
|
the ``src/INTEL/TEST`` directory with the provided run script.
|
||||||
These are scalable in size; the results given are with 512K
|
These are scalable in size; the results given are with 512K
|
||||||
particles (524K for Liquid Crystal). Most of the simulations are
|
particles (524K for Liquid Crystal). Most of the simulations are
|
||||||
standard LAMMPS benchmarks (indicated by the filename extension in
|
standard LAMMPS benchmarks (indicated by the filename extension in
|
||||||
@ -56,7 +56,7 @@ Results are speedups obtained on Intel Xeon E5-2697v4 processors
|
|||||||
Knights Landing), and Intel Xeon Gold 6148 processors (code-named
|
Knights Landing), and Intel Xeon Gold 6148 processors (code-named
|
||||||
Skylake) with "June 2017" LAMMPS built with Intel Parallel Studio
|
Skylake) with "June 2017" LAMMPS built with Intel Parallel Studio
|
||||||
2017 update 2. Results are with 1 MPI task per physical core. See
|
2017 update 2. Results are with 1 MPI task per physical core. See
|
||||||
*src/INTEL/TEST/README* for the raw simulation rates and
|
``src/INTEL/TEST/README`` for the raw simulation rates and
|
||||||
instructions to reproduce.
|
instructions to reproduce.
|
||||||
|
|
||||||
----------
|
----------
|
||||||
@ -82,9 +82,9 @@ order of operations compared to LAMMPS without acceleration:
|
|||||||
* The *newton* setting applies to all atoms, not just atoms shared
|
* The *newton* setting applies to all atoms, not just atoms shared
|
||||||
between MPI tasks
|
between MPI tasks
|
||||||
* Vectorization can change the order for adding pairwise forces
|
* Vectorization can change the order for adding pairwise forces
|
||||||
* When using the -DLMP_USE_MKL_RNG define (all included intel optimized
|
* When using the ``-DLMP_USE_MKL_RNG`` define (all included intel optimized
|
||||||
makefiles do) at build time, the random number generator for
|
makefiles do) at build time, the random number generator for
|
||||||
dissipative particle dynamics (pair style dpd/intel) uses the Mersenne
|
dissipative particle dynamics (``pair style dpd/intel``) uses the Mersenne
|
||||||
Twister generator included in the Intel MKL library (that should be
|
Twister generator included in the Intel MKL library (that should be
|
||||||
more robust than the default Masaglia random number generator)
|
more robust than the default Masaglia random number generator)
|
||||||
|
|
||||||
@ -106,36 +106,36 @@ LAMMPS should be built with the INTEL package installed.
|
|||||||
Simulations should be run with 1 MPI task per physical *core*,
|
Simulations should be run with 1 MPI task per physical *core*,
|
||||||
not *hardware thread*\ .
|
not *hardware thread*\ .
|
||||||
|
|
||||||
* Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.
|
* Edit ``src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi`` as necessary.
|
||||||
* Set the environment variable KMP_BLOCKTIME=0
|
* Set the environment variable ``KMP_BLOCKTIME=0``
|
||||||
* "-pk intel 0 omp $t -sf intel" added to LAMMPS command-line
|
* ``-pk intel 0 omp $t -sf intel`` added to LAMMPS command-line
|
||||||
* $t should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi
|
* ``$t`` should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi
|
||||||
* For some of the simple 2-body potentials without long-range
|
* For some of the simple 2-body potentials without long-range
|
||||||
electrostatics, performance and scalability can be better with
|
electrostatics, performance and scalability can be better with
|
||||||
the "newton off" setting added to the input script
|
the ``newton off`` setting added to the input script
|
||||||
* For simulations on higher node counts, add "processors \* \* \* grid
|
* For simulations on higher node counts, add ``processors * * * grid
|
||||||
numa" to the beginning of the input script for better scalability
|
numa`` to the beginning of the input script for better scalability
|
||||||
* If using *kspace_style pppm* in the input script, add
|
* If using ``kspace_style pppm`` in the input script, add
|
||||||
"kspace_modify diff ad" for better performance
|
``kspace_modify diff ad`` for better performance
|
||||||
|
|
||||||
For Intel Xeon Phi CPUs:
|
For Intel Xeon Phi CPUs:
|
||||||
|
|
||||||
* Runs should be performed using MCDRAM.
|
* Runs should be performed using MCDRAM.
|
||||||
|
|
||||||
For simulations using *kspace_style pppm* on Intel CPUs supporting
|
For simulations using ``kspace_style pppm`` on Intel CPUs supporting
|
||||||
AVX-512:
|
AVX-512:
|
||||||
|
|
||||||
* Add "kspace_modify diff ad" to the input script
|
* Add ``kspace_modify diff ad`` to the input script
|
||||||
* The command-line option should be changed to
|
* The command-line option should be changed to
|
||||||
"-pk intel 0 omp $r lrt yes -sf intel" where $r is the number of
|
``-pk intel 0 omp $r lrt yes -sf intel`` where ``$r`` is the number of
|
||||||
threads minus 1.
|
threads minus 1.
|
||||||
* Do not use thread affinity (set KMP_AFFINITY=none)
|
* Do not use thread affinity (set ``KMP_AFFINITY=none``)
|
||||||
* The "newton off" setting may provide better scalability
|
* The ``newton off`` setting may provide better scalability
|
||||||
|
|
||||||
For Intel Xeon Phi co-processors (Offload):
|
For Intel Xeon Phi co-processors (Offload):
|
||||||
|
|
||||||
* Edit src/MAKE/OPTIONS/Makefile.intel_co-processor as necessary
|
* Edit ``src/MAKE/OPTIONS/Makefile.intel_co-processor`` as necessary
|
||||||
* "-pk intel N omp 1" added to command-line where N is the number of
|
* ``-pk intel N omp 1`` added to command-line where ``N`` is the number of
|
||||||
co-processors per node.
|
co-processors per node.
|
||||||
|
|
||||||
----------
|
----------
|
||||||
@ -209,7 +209,7 @@ See the :ref:`Build extras <intel>` page for
|
|||||||
instructions. Some additional details are covered here.
|
instructions. Some additional details are covered here.
|
||||||
|
|
||||||
For building with make, several example Makefiles for building with
|
For building with make, several example Makefiles for building with
|
||||||
the Intel compiler are included with LAMMPS in the src/MAKE/OPTIONS/
|
the Intel compiler are included with LAMMPS in the ``src/MAKE/OPTIONS/``
|
||||||
directory:
|
directory:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@ -239,35 +239,35 @@ However, if you do not have co-processors on your system, building
|
|||||||
without offload support will produce a smaller binary.
|
without offload support will produce a smaller binary.
|
||||||
|
|
||||||
The general requirements for Makefiles with the INTEL package
|
The general requirements for Makefiles with the INTEL package
|
||||||
are as follows. When using Intel compilers, "-restrict" is required
|
are as follows. When using Intel compilers, ``-restrict`` is required
|
||||||
and "-qopenmp" is highly recommended for CCFLAGS and LINKFLAGS.
|
and ``-qopenmp`` is highly recommended for ``CCFLAGS`` and ``LINKFLAGS``.
|
||||||
CCFLAGS should include "-DLMP_INTEL_USELRT" (unless POSIX Threads
|
``CCFLAGS`` should include ``-DLMP_INTEL_USELRT`` (unless POSIX Threads
|
||||||
are not supported in the build environment) and "-DLMP_USE_MKL_RNG"
|
are not supported in the build environment) and ``-DLMP_USE_MKL_RNG``
|
||||||
(unless Intel Math Kernel Library (MKL) is not available in the build
|
(unless Intel Math Kernel Library (MKL) is not available in the build
|
||||||
environment). For Intel compilers, LIB should include "-ltbbmalloc"
|
environment). For Intel compilers, ``LIB`` should include ``-ltbbmalloc``
|
||||||
or if the library is not available, "-DLMP_INTEL_NO_TBB" can be added
|
or if the library is not available, ``-DLMP_INTEL_NO_TBB`` can be added
|
||||||
to CCFLAGS. For builds supporting offload, "-DLMP_INTEL_OFFLOAD" is
|
to ``CCFLAGS``. For builds supporting offload, ``-DLMP_INTEL_OFFLOAD`` is
|
||||||
required for CCFLAGS and "-qoffload" is required for LINKFLAGS. Other
|
required for ``CCFLAGS`` and ``-qoffload`` is required for ``LINKFLAGS``. Other
|
||||||
recommended CCFLAG options for best performance are "-O2 -fno-alias
|
recommended ``CCFLAG`` options for best performance are ``-O2 -fno-alias
|
||||||
-ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div".
|
-ansi-alias -qoverride-limits fp-model fast=2 -no-prec-div``.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
See the src/INTEL/README file for additional flags that
|
See the ``src/INTEL/README`` file for additional flags that
|
||||||
might be needed for best performance on Intel server processors
|
might be needed for best performance on Intel server processors
|
||||||
code-named "Skylake".
|
code-named "Skylake".
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The vectorization and math capabilities can differ depending on
|
The vectorization and math capabilities can differ depending on
|
||||||
the CPU. For Intel compilers, the "-x" flag specifies the type of
|
the CPU. For Intel compilers, the ``-x`` flag specifies the type of
|
||||||
processor for which to optimize. "-xHost" specifies that the compiler
|
processor for which to optimize. ``-xHost`` specifies that the compiler
|
||||||
should build for the processor used for compiling. For Intel Xeon Phi
|
should build for the processor used for compiling. For Intel Xeon Phi
|
||||||
x200 series processors, this option is "-xMIC-AVX512". For fourth
|
x200 series processors, this option is ``-xMIC-AVX512``. For fourth
|
||||||
generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
|
generation Intel Xeon (v4/Broadwell) processors, ``-xCORE-AVX2`` should
|
||||||
be used. For older Intel Xeon processors, "-xAVX" will perform best
|
be used. For older Intel Xeon processors, ``-xAVX`` will perform best
|
||||||
in general for the different simulations in LAMMPS. The default
|
in general for the different simulations in LAMMPS. The default
|
||||||
in most of the example Makefiles is to use "-xHost", however this
|
in most of the example Makefiles is to use ``-xHost``, however this
|
||||||
should not be used when cross-compiling.
|
should not be used when cross-compiling.
|
||||||
|
|
||||||
Running LAMMPS with the INTEL package
|
Running LAMMPS with the INTEL package
|
||||||
@ -304,11 +304,11 @@ almost all cases.
|
|||||||
uniform. Unless disabled at build time, affinity for MPI tasks and
|
uniform. Unless disabled at build time, affinity for MPI tasks and
|
||||||
OpenMP threads on the host (CPU) will be set by default on the host
|
OpenMP threads on the host (CPU) will be set by default on the host
|
||||||
*when using offload to a co-processor*\ . In this case, it is unnecessary
|
*when using offload to a co-processor*\ . In this case, it is unnecessary
|
||||||
to use other methods to control affinity (e.g. taskset, numactl,
|
to use other methods to control affinity (e.g. ``taskset``, ``numactl``,
|
||||||
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the *no_affinity*
|
``I_MPI_PIN_DOMAIN``, etc.). This can be disabled with the *no_affinity*
|
||||||
option to the :doc:`package intel <package>` command or by disabling the
|
option to the :doc:`package intel <package>` command or by disabling the
|
||||||
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
|
option at build time (by adding ``-DINTEL_OFFLOAD_NOAFFINITY`` to the
|
||||||
CCFLAGS line of your Makefile). Disabling this option is not
|
``CCFLAGS`` line of your Makefile). Disabling this option is not
|
||||||
recommended, especially when running on a machine with Intel
|
recommended, especially when running on a machine with Intel
|
||||||
Hyper-Threading technology disabled.
|
Hyper-Threading technology disabled.
|
||||||
|
|
||||||
@ -316,7 +316,7 @@ Run with the INTEL package from the command line
|
|||||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""
|
"""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||||
|
|
||||||
To enable INTEL optimizations for all available styles used in
|
To enable INTEL optimizations for all available styles used in
|
||||||
the input script, the "-sf intel" :doc:`command-line switch <Run_options>` can be used without any requirement for
|
the input script, the ``-sf intel`` :doc:`command-line switch <Run_options>` can be used without any requirement for
|
||||||
editing the input script. This switch will automatically append
|
editing the input script. This switch will automatically append
|
||||||
"intel" to styles that support it. It also invokes a default command:
|
"intel" to styles that support it. It also invokes a default command:
|
||||||
:doc:`package intel 1 <package>`. This package command is used to set
|
:doc:`package intel 1 <package>`. This package command is used to set
|
||||||
@ -329,15 +329,15 @@ will be used with automatic balancing of work between the CPU and the
|
|||||||
co-processor.
|
co-processor.
|
||||||
|
|
||||||
You can specify different options for the INTEL package by using
|
You can specify different options for the INTEL package by using
|
||||||
the "-pk intel Nphi" :doc:`command-line switch <Run_options>` with
|
the ``-pk intel Nphi`` :doc:`command-line switch <Run_options>` with
|
||||||
keyword/value pairs as specified in the documentation. Here, Nphi = #
|
keyword/value pairs as specified in the documentation. Here, ``Nphi`` = #
|
||||||
of Xeon Phi co-processors/node (ignored without offload
|
of Xeon Phi co-processors/node (ignored without offload
|
||||||
support). Common options to the INTEL package include *omp* to
|
support). Common options to the INTEL package include *omp* to
|
||||||
override any OMP_NUM_THREADS setting and specify the number of OpenMP
|
override any ``OMP_NUM_THREADS`` setting and specify the number of OpenMP
|
||||||
threads, *mode* to set the floating-point precision mode, and *lrt* to
|
threads, *mode* to set the floating-point precision mode, and *lrt* to
|
||||||
enable Long-Range Thread mode as described below. See the :doc:`package intel <package>` command for details, including the default values
|
enable Long-Range Thread mode as described below. See the :doc:`package intel <package>` command for details, including the default values
|
||||||
used for all its options if not specified, and how to set the number
|
used for all its options if not specified, and how to set the number
|
||||||
of OpenMP threads via the OMP_NUM_THREADS environment variable if
|
of OpenMP threads via the ``OMP_NUM_THREADS`` environment variable if
|
||||||
desired.
|
desired.
|
||||||
|
|
||||||
Examples (see documentation for your MPI/Machine for differences in
|
Examples (see documentation for your MPI/Machine for differences in
|
||||||
@ -345,8 +345,13 @@ launching MPI applications):
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
# 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
||||||
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any co-processors that might be available, use 2 OpenMP threads for each task, use double precision
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script
|
||||||
|
|
||||||
|
# Don't use any co-processors that might be available,
|
||||||
|
# use 2 OpenMP threads for each task, use double precision
|
||||||
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script \
|
||||||
|
-pk intel 0 omp 2 mode double
|
||||||
|
|
||||||
Or run with the INTEL package by editing an input script
|
Or run with the INTEL package by editing an input script
|
||||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||||
@ -386,19 +391,19 @@ Long-Range Thread (LRT) mode is an option to the :doc:`package intel <package>`
|
|||||||
with SMT. It generates an extra pthread for each MPI task. The thread
|
with SMT. It generates an extra pthread for each MPI task. The thread
|
||||||
is dedicated to performing some of the PPPM calculations and MPI
|
is dedicated to performing some of the PPPM calculations and MPI
|
||||||
communications. This feature requires setting the pre-processor flag
|
communications. This feature requires setting the pre-processor flag
|
||||||
-DLMP_INTEL_USELRT in the makefile when compiling LAMMPS. It is unset
|
``-DLMP_INTEL_USELRT`` in the makefile when compiling LAMMPS. It is unset
|
||||||
in the default makefiles (\ *Makefile.mpi* and *Makefile.serial*\ ) but
|
in the default makefiles (``Makefile.mpi`` and ``Makefile.serial``) but
|
||||||
it is set in all makefiles tuned for the INTEL package. On Intel
|
it is set in all makefiles tuned for the INTEL package. On Intel
|
||||||
Xeon Phi x200 series CPUs, the LRT feature will likely improve
|
Xeon Phi x200 series CPUs, the LRT feature will likely improve
|
||||||
performance, even on a single node. On Intel Xeon processors, using
|
performance, even on a single node. On Intel Xeon processors, using
|
||||||
this mode might result in better performance when using multiple nodes,
|
this mode might result in better performance when using multiple nodes,
|
||||||
depending on the specific machine configuration. To enable LRT mode,
|
depending on the specific machine configuration. To enable LRT mode,
|
||||||
specify that the number of OpenMP threads is one less than would
|
specify that the number of OpenMP threads is one less than would
|
||||||
normally be used for the run and add the "lrt yes" option to the "-pk"
|
normally be used for the run and add the ``lrt yes`` option to the ``-pk``
|
||||||
command-line suffix or "package intel" command. For example, if a run
|
command-line suffix or "package intel" command. For example, if a run
|
||||||
would normally perform best with "-pk intel 0 omp 4", instead use
|
would normally perform best with "-pk intel 0 omp 4", instead use
|
||||||
"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
|
``-pk intel 0 omp 3 lrt yes``. When using LRT, you should set the
|
||||||
environment variable "KMP_AFFINITY=none". LRT mode is not supported
|
environment variable ``KMP_AFFINITY=none``. LRT mode is not supported
|
||||||
when using offload.
|
when using offload.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
@ -411,12 +416,12 @@ Not all styles are supported in the INTEL package. You can mix
|
|||||||
the INTEL package with styles from the :doc:`OPT <Speed_opt>`
|
the INTEL package with styles from the :doc:`OPT <Speed_opt>`
|
||||||
package or the :doc:`OPENMP package <Speed_omp>`. Of course, this
|
package or the :doc:`OPENMP package <Speed_omp>`. Of course, this
|
||||||
requires that these packages were installed at build time. This can
|
requires that these packages were installed at build time. This can
|
||||||
performed automatically by using "-sf hybrid intel opt" or "-sf hybrid
|
performed automatically by using ``-sf hybrid intel opt`` or ``-sf hybrid
|
||||||
intel omp" command-line options. Alternatively, the "opt" and "omp"
|
intel omp`` command-line options. Alternatively, the "opt" and "omp"
|
||||||
suffixes can be appended manually in the input script. For the latter,
|
suffixes can be appended manually in the input script. For the latter,
|
||||||
the :doc:`package omp <package>` command must be in the input script or
|
the :doc:`package omp <package>` command must be in the input script or
|
||||||
the "-pk omp Nt" :doc:`command-line switch <Run_options>` must be used
|
the ``-pk omp Nt`` :doc:`command-line switch <Run_options>` must be used
|
||||||
where Nt is the number of OpenMP threads. The number of OpenMP threads
|
where ``Nt`` is the number of OpenMP threads. The number of OpenMP threads
|
||||||
should not be set differently for the different packages. Note that
|
should not be set differently for the different packages. Note that
|
||||||
the :doc:`suffix hybrid intel omp <suffix>` command can also be used
|
the :doc:`suffix hybrid intel omp <suffix>` command can also be used
|
||||||
within the input script to automatically append the "omp" suffix to
|
within the input script to automatically append the "omp" suffix to
|
||||||
@ -436,7 +441,7 @@ alternative to LRT mode and the two cannot be used together.
|
|||||||
|
|
||||||
Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
||||||
CPUs, better performance might be obtained by setting the
|
CPUs, better performance might be obtained by setting the
|
||||||
environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
|
environment variable ``I_MPI_SHM_LMT=shm`` for Linux kernels that do
|
||||||
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
||||||
series processors will always perform better using MCDRAM. Please
|
series processors will always perform better using MCDRAM. Please
|
||||||
consult your system documentation for the best approach to specify
|
consult your system documentation for the best approach to specify
|
||||||
@ -515,7 +520,7 @@ per MPI task. Additionally, an offload timing summary is printed at
|
|||||||
the end of each run. When offloading, the frequency for :doc:`atom sorting <atom_modify>` is changed to 1 so that the per-atom data is
|
the end of each run. When offloading, the frequency for :doc:`atom sorting <atom_modify>` is changed to 1 so that the per-atom data is
|
||||||
effectively sorted at every rebuild of the neighbor lists. All the
|
effectively sorted at every rebuild of the neighbor lists. All the
|
||||||
available co-processor threads on each Phi will be divided among MPI
|
available co-processor threads on each Phi will be divided among MPI
|
||||||
tasks, unless the *tptask* option of the "-pk intel" :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per
|
tasks, unless the ``tptask`` option of the ``-pk intel`` :doc:`command-line switch <Run_options>` is used to limit the co-processor threads per
|
||||||
MPI task.
|
MPI task.
|
||||||
|
|
||||||
Restrictions
|
Restrictions
|
||||||
|
|||||||
@ -48,7 +48,7 @@ version 23 November 2023 and Kokkos version 4.2.
|
|||||||
|
|
||||||
Kokkos requires using a compiler that supports the c++17 standard. For
|
Kokkos requires using a compiler that supports the c++17 standard. For
|
||||||
some compilers, it may be necessary to add a flag to enable c++17 support.
|
some compilers, it may be necessary to add a flag to enable c++17 support.
|
||||||
For example, the GNU compiler uses the -std=c++17 flag. For a list of
|
For example, the GNU compiler uses the ``-std=c++17`` flag. For a list of
|
||||||
compilers that have been tested with the Kokkos library, see the
|
compilers that have been tested with the Kokkos library, see the
|
||||||
`requirements document of the Kokkos Wiki
|
`requirements document of the Kokkos Wiki
|
||||||
<https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_.
|
<https://kokkos.github.io/kokkos-core-wiki/requirements.html>`_.
|
||||||
@ -111,14 +111,21 @@ for CPU acceleration, assuming one or more 16-core nodes.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
|
# 1 node, 16 MPI tasks/node, no multi-threading
|
||||||
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
|
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj
|
||||||
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
|
|
||||||
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task
|
|
||||||
|
|
||||||
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk
|
# 2 nodes, 1 MPI task/node, 16 threads/task
|
||||||
kokkos" :doc:`command-line switches <Run_options>` in your mpirun
|
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj
|
||||||
command. You must use the "-k on" :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
|
|
||||||
|
# 1 node, 2 MPI tasks/node, 8 threads/task
|
||||||
|
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj
|
||||||
|
|
||||||
|
# 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj
|
||||||
|
|
||||||
|
To run using the KOKKOS package, use the ``-k on``, ``-sf kk`` and ``-pk
|
||||||
|
kokkos`` :doc:`command-line switches <Run_options>` in your ``mpirun``
|
||||||
|
command. You must use the ``-k on`` :doc:`command-line switch <Run_options>` to enable the KOKKOS package. It takes
|
||||||
additional arguments for hardware settings appropriate to your system.
|
additional arguments for hardware settings appropriate to your system.
|
||||||
For OpenMP use:
|
For OpenMP use:
|
||||||
|
|
||||||
@ -126,18 +133,18 @@ For OpenMP use:
|
|||||||
|
|
||||||
-k on t Nt
|
-k on t Nt
|
||||||
|
|
||||||
The "t Nt" option specifies how many OpenMP threads per MPI task to
|
The ``t Nt`` option specifies how many OpenMP threads per MPI task to
|
||||||
use with a node. The default is Nt = 1, which is MPI-only mode. Note
|
use with a node. The default is ``Nt`` = 1, which is MPI-only mode. Note
|
||||||
that the product of MPI tasks \* OpenMP threads/task should not exceed
|
that the product of MPI tasks \* OpenMP threads/task should not exceed
|
||||||
the physical number of cores (on a node), otherwise performance will
|
the physical number of cores (on a node), otherwise performance will
|
||||||
suffer. If Hyper-Threading (HT) is enabled, then the product of MPI
|
suffer. If Hyper-Threading (HT) is enabled, then the product of MPI
|
||||||
tasks \* OpenMP threads/task should not exceed the physical number of
|
tasks \* OpenMP threads/task should not exceed the physical number of
|
||||||
cores \* hardware threads. The "-k on" switch also issues a
|
cores \* hardware threads. The ``-k on`` switch also issues a
|
||||||
"package kokkos" command (with no additional arguments) which sets
|
``package kokkos`` command (with no additional arguments) which sets
|
||||||
various KOKKOS options to default values, as discussed on the
|
various KOKKOS options to default values, as discussed on the
|
||||||
:doc:`package <package>` command doc page.
|
:doc:`package <package>` command doc page.
|
||||||
|
|
||||||
The "-sf kk" :doc:`command-line switch <Run_options>` will automatically
|
The ``-sf kk`` :doc:`command-line switch <Run_options>` will automatically
|
||||||
append the "/kk" suffix to styles that support it. In this manner no
|
append the "/kk" suffix to styles that support it. In this manner no
|
||||||
modification to the input script is needed. Alternatively, one can run
|
modification to the input script is needed. Alternatively, one can run
|
||||||
with the KOKKOS package by editing the input script as described
|
with the KOKKOS package by editing the input script as described
|
||||||
@ -146,20 +153,22 @@ below.
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
When using a single OpenMP thread, the Kokkos Serial back end (i.e.
|
When using a single OpenMP thread, the Kokkos Serial back end (i.e.
|
||||||
Makefile.kokkos_mpi_only) will give better performance than the OpenMP
|
``Makefile.kokkos_mpi_only``) will give better performance than the OpenMP
|
||||||
back end (i.e. Makefile.kokkos_omp) because some of the overhead to make
|
back end (i.e. ``Makefile.kokkos_omp``) because some of the overhead to make
|
||||||
the code thread-safe is removed.
|
the code thread-safe is removed.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
Use the "-pk kokkos" :doc:`command-line switch <Run_options>` to
|
Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>` to
|
||||||
change the default :doc:`package kokkos <package>` options. See its doc
|
change the default :doc:`package kokkos <package>` options. See its doc
|
||||||
page for details and default settings. Experimenting with its options
|
page for details and default settings. Experimenting with its options
|
||||||
can provide a speed-up for specific calculations. For example:
|
can provide a speed-up for specific calculations. For example:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm
|
# Newton on, Half neighbor list, non-threaded comm
|
||||||
|
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk \
|
||||||
|
-pk kokkos newton on neigh half comm no -in in.lj
|
||||||
|
|
||||||
If the :doc:`newton <newton>` command is used in the input
|
If the :doc:`newton <newton>` command is used in the input
|
||||||
script, it can also override the Newton flag defaults.
|
script, it can also override the Newton flag defaults.
|
||||||
@ -172,7 +181,7 @@ small numbers of threads (i.e. 8 or less) but does increase memory
|
|||||||
footprint and is not scalable to large numbers of threads. An
|
footprint and is not scalable to large numbers of threads. An
|
||||||
alternative to data duplication is to use thread-level atomic operations
|
alternative to data duplication is to use thread-level atomic operations
|
||||||
which do not require data duplication. The use of atomic operations can
|
which do not require data duplication. The use of atomic operations can
|
||||||
be enforced by compiling LAMMPS with the "-DLMP_KOKKOS_USE_ATOMICS"
|
be enforced by compiling LAMMPS with the ``-DLMP_KOKKOS_USE_ATOMICS``
|
||||||
pre-processor flag. Most but not all Kokkos-enabled pair_styles support
|
pre-processor flag. Most but not all Kokkos-enabled pair_styles support
|
||||||
data duplication. Alternatively, full neighbor lists avoid the need for
|
data duplication. Alternatively, full neighbor lists avoid the need for
|
||||||
duplication or atomic operations but require more compute operations per
|
duplication or atomic operations but require more compute operations per
|
||||||
@ -190,10 +199,13 @@ they do not migrate during a simulation.
|
|||||||
If you are not certain MPI tasks are being bound (check the defaults
|
If you are not certain MPI tasks are being bound (check the defaults
|
||||||
for your MPI installation), binding can be forced with these flags:
|
for your MPI installation), binding can be forced with these flags:
|
||||||
|
|
||||||
.. parsed-literal::
|
.. code-block:: bash
|
||||||
|
|
||||||
OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
# OpenMPI 1.8
|
||||||
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
|
mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
|
||||||
|
|
||||||
|
# Mvapich2 2.0
|
||||||
|
mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ...
|
||||||
|
|
||||||
For binding threads with KOKKOS OpenMP, use thread affinity environment
|
For binding threads with KOKKOS OpenMP, use thread affinity environment
|
||||||
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
|
variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12
|
||||||
@ -222,15 +234,24 @@ Examples of mpirun commands that follow these rules are shown below.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
# Running on an Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
|
# Running on an Intel KNL node with 68 cores
|
||||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
|
# (272 threads/node via 4x hardware threading):
|
||||||
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
|
|
||||||
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
|
|
||||||
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task
|
|
||||||
|
|
||||||
The -np setting of the mpirun command sets the number of MPI
|
# 1 node, 64 MPI tasks/node, 4 threads/task
|
||||||
tasks/node. The "-k on t Nt" command-line switch sets the number of
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||||
threads/task as Nt. The product of these two values should be N, i.e.
|
|
||||||
|
# 1 node, 66 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||||
|
|
||||||
|
# 1 node, 32 MPI tasks/node, 8 threads/task
|
||||||
|
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj
|
||||||
|
|
||||||
|
# 8 nodes, 64 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj
|
||||||
|
|
||||||
|
The ``-np`` setting of the mpirun command sets the number of MPI
|
||||||
|
tasks/node. The ``-k on t Nt`` command-line switch sets the number of
|
||||||
|
threads/task as ``Nt``. The product of these two values should be N, i.e.
|
||||||
256 or 264.
|
256 or 264.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
@ -240,7 +261,7 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
|||||||
flag to "on" for both pairwise and bonded interactions. This will
|
flag to "on" for both pairwise and bonded interactions. This will
|
||||||
typically be best for many-body potentials. For simpler pairwise
|
typically be best for many-body potentials. For simpler pairwise
|
||||||
potentials, it may be faster to use a "full" neighbor list with
|
potentials, it may be faster to use a "full" neighbor list with
|
||||||
Newton flag to "off". Use the "-pk kokkos" :doc:`command-line switch
|
Newton flag to "off". Use the ``-pk kokkos`` :doc:`command-line switch
|
||||||
<Run_options>` to change the default :doc:`package kokkos <package>`
|
<Run_options>` to change the default :doc:`package kokkos <package>`
|
||||||
options. See its documentation page for details and default
|
options. See its documentation page for details and default
|
||||||
settings. Experimenting with its options can provide a speed-up for
|
settings. Experimenting with its options can provide a speed-up for
|
||||||
@ -248,8 +269,12 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax # Newton on, half neighbor list, threaded comm
|
# Newton on, half neighbor list, threaded comm
|
||||||
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton off neigh full comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm host -in in.reax
|
||||||
|
|
||||||
|
# Newton off, full neighbor list, non-threaded comm
|
||||||
|
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk \
|
||||||
|
-pk kokkos newton off neigh full comm no -in in.lj
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@ -266,8 +291,8 @@ threads/task as Nt. The product of these two values should be N, i.e.
|
|||||||
Running on GPUs
|
Running on GPUs
|
||||||
^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
Use the "-k" :doc:`command-line switch <Run_options>` to specify the
|
Use the ``-k`` :doc:`command-line switch <Run_options>` to specify the
|
||||||
number of GPUs per node. Typically the -np setting of the mpirun command
|
number of GPUs per node. Typically the ``-np`` setting of the ``mpirun`` command
|
||||||
should set the number of MPI tasks/node to be equal to the number of
|
should set the number of MPI tasks/node to be equal to the number of
|
||||||
physical GPUs on the node. You can assign multiple MPI tasks to the same
|
physical GPUs on the node. You can assign multiple MPI tasks to the same
|
||||||
GPU with the KOKKOS package, but this is usually only faster if some
|
GPU with the KOKKOS package, but this is usually only faster if some
|
||||||
@ -290,8 +315,11 @@ one or more nodes, each with two GPUs:
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
|
# 1 node, 2 MPI tasks/node, 2 GPUs/node
|
||||||
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
|
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
|
||||||
|
|
||||||
|
# 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total)
|
||||||
|
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@ -303,7 +331,7 @@ one or more nodes, each with two GPUs:
|
|||||||
neighbor lists and setting the Newton flag to "on" may be faster. For
|
neighbor lists and setting the Newton flag to "on" may be faster. For
|
||||||
many pair styles, setting the neighbor binsize equal to twice the CPU
|
many pair styles, setting the neighbor binsize equal to twice the CPU
|
||||||
default value will give speedup, which is the default when running on
|
default value will give speedup, which is the default when running on
|
||||||
GPUs. Use the "-pk kokkos" :doc:`command-line switch <Run_options>`
|
GPUs. Use the ``-pk kokkos`` :doc:`command-line switch <Run_options>`
|
||||||
to change the default :doc:`package kokkos <package>` options. See
|
to change the default :doc:`package kokkos <package>` options. See
|
||||||
its documentation page for details and default
|
its documentation page for details and default
|
||||||
settings. Experimenting with its options can provide a speed-up for
|
settings. Experimenting with its options can provide a speed-up for
|
||||||
@ -311,7 +339,9 @@ one or more nodes, each with two GPUs:
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighbor list, set binsize = neighbor ghost cutoff
|
# Newton on, half neighbor list, set binsize = neighbor ghost cutoff
|
||||||
|
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk \
|
||||||
|
-pk kokkos newton on neigh half binsize 2.8 -in in.lj
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@ -329,7 +359,7 @@ one or more nodes, each with two GPUs:
|
|||||||
more), the creation of the atom map (required for molecular systems)
|
more), the creation of the atom map (required for molecular systems)
|
||||||
on the GPU can slow down significantly or run out of GPU memory and
|
on the GPU can slow down significantly or run out of GPU memory and
|
||||||
thus slow down the whole calculation or cause a crash. You can use
|
thus slow down the whole calculation or cause a crash. You can use
|
||||||
the "-pk kokkos atom/map no" :doc:`command-line switch <Run_options>`
|
the ``-pk kokkos atom/map no`` :doc:`command-line switch <Run_options>`
|
||||||
of the :doc:`package kokkos atom/map no <package>` command to create
|
of the :doc:`package kokkos atom/map no <package>` command to create
|
||||||
the atom map on the CPU instead.
|
the atom map on the CPU instead.
|
||||||
|
|
||||||
@ -346,20 +376,20 @@ one or more nodes, each with two GPUs:
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
To get an accurate timing breakdown between time spend in pair,
|
To get an accurate timing breakdown between time spend in pair,
|
||||||
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
|
kspace, etc., you must set the environment variable ``CUDA_LAUNCH_BLOCKING=1``.
|
||||||
However, this will reduce performance and is not recommended for production runs.
|
However, this will reduce performance and is not recommended for production runs.
|
||||||
|
|
||||||
Run with the KOKKOS package by editing an input script
|
Run with the KOKKOS package by editing an input script
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
Alternatively the effect of the "-sf" or "-pk" switches can be
|
Alternatively the effect of the ``-sf`` or ``-pk`` switches can be
|
||||||
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
|
duplicated by adding the :doc:`package kokkos <package>` or :doc:`suffix kk <suffix>` commands to your input script.
|
||||||
|
|
||||||
The discussion above for building LAMMPS with the KOKKOS package, the
|
The discussion above for building LAMMPS with the KOKKOS package, the
|
||||||
``mpirun`` or ``mpiexec`` command, and setting appropriate thread
|
``mpirun`` or ``mpiexec`` command, and setting appropriate thread
|
||||||
properties are the same.
|
properties are the same.
|
||||||
|
|
||||||
You must still use the "-k on" :doc:`command-line switch <Run_options>`
|
You must still use the ``-k on`` :doc:`command-line switch <Run_options>`
|
||||||
to enable the KOKKOS package, and specify its additional arguments for
|
to enable the KOKKOS package, and specify its additional arguments for
|
||||||
hardware options appropriate to your system, as documented above.
|
hardware options appropriate to your system, as documented above.
|
||||||
|
|
||||||
@ -378,7 +408,7 @@ wish to change any of its option defaults, as set by the "-k on"
|
|||||||
|
|
||||||
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
|
With the KOKKOS package, both OpenMP multi-threading and GPUs can be
|
||||||
compiled and used together in a few special cases. In the makefile for
|
compiled and used together in a few special cases. In the makefile for
|
||||||
the conventional build, the KOKKOS_DEVICES variable must include both,
|
the conventional build, the ``KOKKOS_DEVICES`` variable must include both,
|
||||||
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
|
"Cuda" and "OpenMP", as is the case for ``/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi``.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@ -390,14 +420,14 @@ in the ``kokkos-cuda.cmake`` CMake preset file.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
cmake ../cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes
|
cmake -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes ../cmake
|
||||||
|
|
||||||
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
|
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
|
||||||
using the "-sf kk" in the command line gives the default CUDA version
|
using the ``-sf kk`` in the command line gives the default CUDA version
|
||||||
everywhere. However, if the "/kk/host" suffix is added to a specific
|
everywhere. However, if the "/kk/host" suffix is added to a specific
|
||||||
style in the input script, the Kokkos OpenMP (CPU) version of that
|
style in the input script, the Kokkos OpenMP (CPU) version of that
|
||||||
specific style will be used instead. Set the number of OpenMP threads
|
specific style will be used instead. Set the number of OpenMP threads
|
||||||
as "t Nt" and the number of GPUs as "g Ng"
|
as ``t Nt`` and the number of GPUs as ``g Ng``
|
||||||
|
|
||||||
.. parsed-literal::
|
.. parsed-literal::
|
||||||
|
|
||||||
@ -409,7 +439,7 @@ For example, the command to run with 1 GPU and 8 OpenMP threads is then:
|
|||||||
|
|
||||||
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk
|
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk
|
||||||
|
|
||||||
Conversely, if the "-sf kk/host" is used in the command line and then
|
Conversely, if the ``-sf kk/host`` is used in the command line and then
|
||||||
the "/kk" or "/kk/device" suffix is added to a specific style in your
|
the "/kk" or "/kk/device" suffix is added to a specific style in your
|
||||||
input script, then only that specific style will run on the GPU while
|
input script, then only that specific style will run on the GPU while
|
||||||
everything else will run on the CPU in OpenMP mode. Note that the
|
everything else will run on the CPU in OpenMP mode. Note that the
|
||||||
@ -418,11 +448,11 @@ special case:
|
|||||||
|
|
||||||
A kspace style and/or molecular topology (bonds, angles, etc.) running
|
A kspace style and/or molecular topology (bonds, angles, etc.) running
|
||||||
on the host CPU can overlap with a pair style running on the
|
on the host CPU can overlap with a pair style running on the
|
||||||
GPU. First compile with "--default-stream per-thread" added to CCFLAGS
|
GPU. First compile with ``--default-stream per-thread`` added to ``CCFLAGS``
|
||||||
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
|
in the Kokkos CUDA Makefile. Then explicitly use the "/kk/host"
|
||||||
suffix for kspace and bonds, angles, etc. in the input file and the
|
suffix for kspace and bonds, angles, etc. in the input file and the
|
||||||
"kk" suffix (equal to "kk/device") on the command line. Also make
|
"kk" suffix (equal to "kk/device") on the command line. Also make
|
||||||
sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
|
sure the environment variable ``CUDA_LAUNCH_BLOCKING`` is not set to "1"
|
||||||
so CPU/GPU overlap can occur.
|
so CPU/GPU overlap can occur.
|
||||||
|
|
||||||
Performance to expect
|
Performance to expect
|
||||||
|
|||||||
@ -28,32 +28,39 @@ These examples assume one or more 16-core nodes.
|
|||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script # 1 MPI task, 16 threads according to OMP_NUM_THREADS
|
# 1 MPI task, 16 threads according to OMP_NUM_THREADS
|
||||||
lmp_mpi -sf omp -in in.script # 1 MPI task, no threads, optimized kernels
|
env OMP_NUM_THREADS=16 lmp_omp -sf omp -in in.script
|
||||||
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script # 4 MPI tasks, 4 threads/task
|
|
||||||
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task
|
# 1 MPI task, no threads, optimized kernels
|
||||||
|
lmp_mpi -sf omp -in in.script
|
||||||
|
|
||||||
|
# 4 MPI tasks, 4 threads/task
|
||||||
|
mpirun -np 4 lmp_omp -sf omp -pk omp 4 -in in.script
|
||||||
|
|
||||||
|
# 8 nodes, 4 MPI tasks/node, 4 threads/task
|
||||||
|
mpirun -np 32 -ppn 4 lmp_omp -sf omp -pk omp 4 -in in.script
|
||||||
|
|
||||||
The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks
|
The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks
|
||||||
used by LAMMPS (one or multiple per compute node) and the number of MPI
|
used by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
its ``-np`` and ``-ppn`` switches. Ditto for OpenMPI via ``-np`` and ``-npernode``.
|
||||||
|
|
||||||
You need to choose how many OpenMP threads per MPI task will be used
|
You need to choose how many OpenMP threads per MPI task will be used
|
||||||
by the OPENMP package. Note that the product of MPI tasks \*
|
by the OPENMP package. Note that the product of MPI tasks \*
|
||||||
threads/task should not exceed the physical number of cores (on a
|
threads/task should not exceed the physical number of cores (on a
|
||||||
node), otherwise performance will suffer.
|
node), otherwise performance will suffer.
|
||||||
|
|
||||||
As in the lines above, use the "-sf omp" :doc:`command-line switch <Run_options>`, which will automatically append "omp" to
|
As in the lines above, use the ``-sf omp`` :doc:`command-line switch <Run_options>`, which will automatically append "omp" to
|
||||||
styles that support it. The "-sf omp" switch also issues a default
|
styles that support it. The ``-sf omp`` switch also issues a default
|
||||||
:doc:`package omp 0 <package>` command, which will set the number of
|
:doc:`package omp 0 <package>` command, which will set the number of
|
||||||
threads per MPI task via the OMP_NUM_THREADS environment variable.
|
threads per MPI task via the ``OMP_NUM_THREADS`` environment variable.
|
||||||
|
|
||||||
You can also use the "-pk omp Nt" :doc:`command-line switch <Run_options>`, to explicitly set Nt = # of OpenMP threads
|
You can also use the ``-pk omp Nt`` :doc:`command-line switch <Run_options>`, to explicitly set ``Nt`` = # of OpenMP threads
|
||||||
per MPI task to use, as well as additional options. Its syntax is the
|
per MPI task to use, as well as additional options. Its syntax is the
|
||||||
same as the :doc:`package omp <package>` command whose page gives
|
same as the :doc:`package omp <package>` command whose page gives
|
||||||
details, including the default values used if it is not specified. It
|
details, including the default values used if it is not specified. It
|
||||||
also gives more details on how to set the number of threads via the
|
also gives more details on how to set the number of threads via the
|
||||||
OMP_NUM_THREADS environment variable.
|
``OMP_NUM_THREADS`` environment variable.
|
||||||
|
|
||||||
Or run with the OPENMP package by editing an input script
|
Or run with the OPENMP package by editing an input script
|
||||||
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
|
||||||
@ -71,7 +78,7 @@ Use the :doc:`suffix omp <suffix>` command, or you can explicitly add an
|
|||||||
You must also use the :doc:`package omp <package>` command to enable the
|
You must also use the :doc:`package omp <package>` command to enable the
|
||||||
OPENMP package. When you do this you also specify how many threads
|
OPENMP package. When you do this you also specify how many threads
|
||||||
per MPI task to use. The command page explains other options and
|
per MPI task to use. The command page explains other options and
|
||||||
how to set the number of threads via the OMP_NUM_THREADS environment
|
how to set the number of threads via the ``OMP_NUM_THREADS`` environment
|
||||||
variable.
|
variable.
|
||||||
|
|
||||||
Speed-up to expect
|
Speed-up to expect
|
||||||
|
|||||||
@ -80,23 +80,30 @@ it provides, follow these general steps. Details vary from package to
|
|||||||
package and are explained in the individual accelerator doc pages,
|
package and are explained in the individual accelerator doc pages,
|
||||||
listed above:
|
listed above:
|
||||||
|
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
| build the accelerator library | only for GPU package |
|
| build the accelerator library | only for GPU package |
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
| install the accelerator package | make yes-opt, make yes-intel, etc |
|
| install the accelerator package | ``make yes-opt``, ``make yes-intel``, etc |
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
| add compile/link flags to Makefile.machine in src/MAKE | only for INTEL, KOKKOS, OPENMP, OPT packages |
|
| add compile/link flags to ``Makefile.machine`` | only for INTEL, KOKKOS, OPENMP, |
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
| in ``src/MAKE`` | OPT packages |
|
||||||
| re-build LAMMPS | make machine |
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
| re-build LAMMPS | ``make machine`` |
|
||||||
| prepare and test a regular LAMMPS simulation | lmp_machine -in in.script; mpirun -np 32 lmp_machine -in in.script |
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
| prepare and test a regular LAMMPS simulation | ``lmp_machine -in in.script;`` |
|
||||||
| enable specific accelerator support via '-k on' :doc:`command-line switch <Run_options>`, | only needed for KOKKOS package |
|
| | ``mpirun -np 32 lmp_machine -in in.script`` |
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
| set any needed options for the package via "-pk" :doc:`command-line switch <Run_options>` or :doc:`package <package>` command, | only if defaults need to be changed |
|
| enable specific accelerator support via ``-k on`` | only needed for KOKKOS package |
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
| :doc:`command-line switch <Run_options>` | |
|
||||||
| use accelerated styles in your input via "-sf" :doc:`command-line switch <Run_options>` or :doc:`suffix <suffix>` command | lmp_machine -in in.script -sf gpu |
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
+--------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|
| set any needed options for the package via ``-pk`` | only if defaults need to be changed |
|
||||||
|
| :doc:`command-line switch <Run_options>` or | |
|
||||||
|
| :doc:`package <package>` command | |
|
||||||
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
|
| use accelerated styles in your input via ``-sf`` | ``lmp_machine -in in.script -sf gpu`` |
|
||||||
|
| :doc:`command-line switch <Run_options>` or | |
|
||||||
|
| :doc:`suffix <suffix>` command | |
|
||||||
|
+-----------------------------------------------------------+---------------------------------------------+
|
||||||
|
|
||||||
Note that the first 4 steps can be done as a single command with
|
Note that the first 4 steps can be done as a single command with
|
||||||
suitable make command invocations. This is discussed on the
|
suitable make command invocations. This is discussed on the
|
||||||
|
|||||||
Reference in New Issue
Block a user