documentation corrections, spelling fixes and updates

This commit is contained in:
Axel Kohlmeyer
2021-02-17 18:47:35 -05:00
parent e575c5fa29
commit f367e66aba
4 changed files with 42 additions and 26 deletions

View File

@ -1,11 +1,14 @@
GPU package GPU package
=========== ===========
The GPU package was developed by Mike Brown while at SNL and ORNL The GPU package was developed by Mike Brown while at SNL and ORNL (now
and his collaborators, particularly Trung Nguyen (now at Northwestern). at Intel Corp.) and his collaborators, particularly Trung Nguyen (now at
It provides GPU versions of many pair styles and for parts of the Northwestern). Support for AMD GPUs via HIP was added by Vsevolod Nikolskiy
:doc:`kspace_style pppm <kspace_style>` for long-range Coulombics. and coworkers at HSE University.
It has the following general features:
The GPU package provides GPU versions of many pair styles and for
parts of the :doc:`kspace_style pppm <kspace_style>` for long-range
Coulombics. It has the following general features:
* It is designed to exploit common GPU hardware configurations where one * It is designed to exploit common GPU hardware configurations where one
or more GPUs are coupled to many cores of one or more multi-core CPUs, or more GPUs are coupled to many cores of one or more multi-core CPUs,
@ -24,8 +27,9 @@ It has the following general features:
force vectors. force vectors.
* LAMMPS-specific code is in the GPU package. It makes calls to a * LAMMPS-specific code is in the GPU package. It makes calls to a
generic GPU library in the lib/gpu directory. This library provides generic GPU library in the lib/gpu directory. This library provides
NVIDIA support as well as more general OpenCL support, so that the either Nvidia support, AMD support, or more general OpenCL support
same functionality is supported on a variety of hardware. (for Nvidia GPUs, AMD GPUs, Intel GPUs, and multi-core CPUs).
so that the same functionality is supported on a variety of hardware.
**Required hardware/software:** **Required hardware/software:**
@ -89,10 +93,10 @@ shared by 4 MPI tasks.
The GPU package also has limited support for OpenMP for both The GPU package also has limited support for OpenMP for both
multi-threading and vectorization of routines that are run on the CPUs. multi-threading and vectorization of routines that are run on the CPUs.
This requires that the GPU library and LAMMPS are built with flags to This requires that the GPU library and LAMMPS are built with flags to
enable OpenMP support (e.g. -fopenmp -fopenmp-simd). Some styles for enable OpenMP support (e.g. -fopenmp). Some styles for time integration
time integration are also available in the GPU package. These run are also available in the GPU package. These run completely on the CPUs
completely on the CPUs in full double precision, but exploit in full double precision, but exploit multi-threading and vectorization
multi-threading and vectorization for faster performance. for faster performance.
Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
automatically append "gpu" to styles that support it. Use the "-pk automatically append "gpu" to styles that support it. Use the "-pk
@ -159,11 +163,11 @@ Likewise, you should experiment with the precision setting for the GPU
library to see if single or mixed precision will give accurate library to see if single or mixed precision will give accurate
results, since they will typically be faster. results, since they will typically be faster.
MPI parallelism typically outperforms OpenMP parallelism, but in same cases MPI parallelism typically outperforms OpenMP parallelism, but in some
using fewer MPI tasks and multiple OpenMP threads with the GPU package cases using fewer MPI tasks and multiple OpenMP threads with the GPU
can give better performance. 3-body potentials can often perform better package can give better performance. 3-body potentials can often perform
with multiple OMP threads because the inter-process communication is better with multiple OMP threads because the inter-process communication
higher for these styles with the GPU package in order to allow is higher for these styles with the GPU package in order to allow
deterministic results. deterministic results.
**Guidelines for best performance:** **Guidelines for best performance:**
@ -189,6 +193,12 @@ deterministic results.
:doc:`angle <angle_style>`, :doc:`dihedral <dihedral_style>`, :doc:`angle <angle_style>`, :doc:`dihedral <dihedral_style>`,
:doc:`improper <improper_style>`, and :doc:`long-range <kspace_style>` :doc:`improper <improper_style>`, and :doc:`long-range <kspace_style>`
calculations will not be included in the "Pair" time. calculations will not be included in the "Pair" time.
* Since only part of the pppm kspace style is GPU accelerated, it
may be faster to only use GPU acceleration for Pair styles with
long-range electrostatics. See the "pair/only" keyword of the
package command for a shortcut to do that. The work between kspace
on the CPU and non-bonded interactions on the GPU can be balanced
through adjusting the coulomb cutoff without loss of accuracy.
* When the *mode* setting for the package gpu command is force/neigh, * When the *mode* setting for the package gpu command is force/neigh,
the time for neighbor list calculations on the GPU will be added into the time for neighbor list calculations on the GPU will be added into
the "Pair" time, not the "Neigh" time. An additional breakdown of the the "Pair" time, not the "Neigh" time. An additional breakdown of the

View File

@ -175,7 +175,7 @@ package.
The *Ngpu* argument sets the number of GPUs per node. If *Ngpu* is 0 The *Ngpu* argument sets the number of GPUs per node. If *Ngpu* is 0
and no other keywords are specified, GPU or accelerator devices are and no other keywords are specified, GPU or accelerator devices are
autoselected. In this process, all platforms are searched for auto-selected. In this process, all platforms are searched for
accelerator devices and GPUs are chosen if available. The device with accelerator devices and GPUs are chosen if available. The device with
the highest number of compute cores is selected. The number of devices the highest number of compute cores is selected. The number of devices
is increased to be the number of matching accelerators with the same is increased to be the number of matching accelerators with the same
@ -257,7 +257,8 @@ the other particles.
The *gpuID* keyword is used to specify the first ID for the GPU or The *gpuID* keyword is used to specify the first ID for the GPU or
other accelerator that LAMMPS will use. For example, if the ID is other accelerator that LAMMPS will use. For example, if the ID is
1 and *Ngpu* is 3, GPUs 1-3 will be used. Device IDs should be 1 and *Ngpu* is 3, GPUs 1-3 will be used. Device IDs should be
determined from the output of nvc_get_devices or ocl_get_devices determined from the output of nvc_get_devices, ocl_get_devices,
or hip_get_devices
as provided in the lib/gpu directory. When using OpenCL with as provided in the lib/gpu directory. When using OpenCL with
accelerators that have main memory NUMA, the accelerators can be accelerators that have main memory NUMA, the accelerators can be
split into smaller virtual accelerators for more efficient use split into smaller virtual accelerators for more efficient use
@ -306,13 +307,14 @@ PPPM_MAX_SPLINE.
CONFIG_ID can be 0. SHUFFLE_AVAIL in {0,1} indicates that inline-PTX CONFIG_ID can be 0. SHUFFLE_AVAIL in {0,1} indicates that inline-PTX
(NVIDIA) or OpenCL extensions (Intel) should be used for horizontal (NVIDIA) or OpenCL extensions (Intel) should be used for horizontal
vector operataions. FAST_MATH in {0,1} indicates that OpenCL fast math vector operations. FAST_MATH in {0,1} indicates that OpenCL fast math
optimizations are used during the build and HW-accelerated optimizations are used during the build and hardware-accelerated
transcendentals are used when available. THREADS_PER_* give the default transcendental functions are used when available. THREADS_PER_* give the
*tpa* values for ellipsoidal models, styles using charge, and any other default *tpa* values for ellipsoidal models, styles using charge, and
styles. The BLOCK_* parameters specify the block sizes for various any other styles. The BLOCK_* parameters specify the block sizes for
kernal calls and the MAX_*SHARED*_ parameters are used to determine the various kernel calls and the MAX_*SHARED*_ parameters are used to
amount of local shared memory to use for storing model parameters. determine the amount of local shared memory to use for storing model
parameters.
For OpenCL, the routines are compiled at runtime for the specified GPU For OpenCL, the routines are compiled at runtime for the specified GPU
or accelerator architecture. The *ocl_args* keyword can be used to or accelerator architecture. The *ocl_args* keyword can be used to

View File

@ -2297,6 +2297,7 @@ omegaz
Omelyan Omelyan
omp omp
OMP OMP
oneAPI
onelevel onelevel
oneway oneway
onn onn
@ -2528,6 +2529,7 @@ ptm
PTM PTM
ptol ptol
ptr ptr
PTX
pu pu
purdue purdue
Purohit Purohit

View File

@ -45,8 +45,10 @@ efficient use with MPI.
After building the GPU library, for OpenCL: After building the GPU library, for OpenCL:
./ocl_get_devices ./ocl_get_devices
and for CUDA for CUDA:
./nvc_get_devices ./nvc_get_devices
and for ROCm HIP:
./hip_get_devices
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
QUICK START QUICK START