documentation corrections, spelling fixes and updates
This commit is contained in:
@ -1,11 +1,14 @@
|
||||
GPU package
|
||||
===========
|
||||
|
||||
The GPU package was developed by Mike Brown while at SNL and ORNL
|
||||
and his collaborators, particularly Trung Nguyen (now at Northwestern).
|
||||
It provides GPU versions of many pair styles and for parts of the
|
||||
:doc:`kspace_style pppm <kspace_style>` for long-range Coulombics.
|
||||
It has the following general features:
|
||||
The GPU package was developed by Mike Brown while at SNL and ORNL (now
|
||||
at Intel Corp.) and his collaborators, particularly Trung Nguyen (now at
|
||||
Northwestern). Support for AMD GPUs via HIP was added by Vsevolod Nikolskiy
|
||||
and coworkers at HSE University.
|
||||
|
||||
The GPU package provides GPU versions of many pair styles and for
|
||||
parts of the :doc:`kspace_style pppm <kspace_style>` for long-range
|
||||
Coulombics. It has the following general features:
|
||||
|
||||
* It is designed to exploit common GPU hardware configurations where one
|
||||
or more GPUs are coupled to many cores of one or more multi-core CPUs,
|
||||
@ -24,8 +27,9 @@ It has the following general features:
|
||||
force vectors.
|
||||
* LAMMPS-specific code is in the GPU package. It makes calls to a
|
||||
generic GPU library in the lib/gpu directory. This library provides
|
||||
NVIDIA support as well as more general OpenCL support, so that the
|
||||
same functionality is supported on a variety of hardware.
|
||||
either Nvidia support, AMD support, or more general OpenCL support
|
||||
(for Nvidia GPUs, AMD GPUs, Intel GPUs, and multi-core CPUs).
|
||||
so that the same functionality is supported on a variety of hardware.
|
||||
|
||||
**Required hardware/software:**
|
||||
|
||||
@ -89,10 +93,10 @@ shared by 4 MPI tasks.
|
||||
The GPU package also has limited support for OpenMP for both
|
||||
multi-threading and vectorization of routines that are run on the CPUs.
|
||||
This requires that the GPU library and LAMMPS are built with flags to
|
||||
enable OpenMP support (e.g. -fopenmp -fopenmp-simd). Some styles for
|
||||
time integration are also available in the GPU package. These run
|
||||
completely on the CPUs in full double precision, but exploit
|
||||
multi-threading and vectorization for faster performance.
|
||||
enable OpenMP support (e.g. -fopenmp). Some styles for time integration
|
||||
are also available in the GPU package. These run completely on the CPUs
|
||||
in full double precision, but exploit multi-threading and vectorization
|
||||
for faster performance.
|
||||
|
||||
Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
|
||||
automatically append "gpu" to styles that support it. Use the "-pk
|
||||
@ -159,11 +163,11 @@ Likewise, you should experiment with the precision setting for the GPU
|
||||
library to see if single or mixed precision will give accurate
|
||||
results, since they will typically be faster.
|
||||
|
||||
MPI parallelism typically outperforms OpenMP parallelism, but in same cases
|
||||
using fewer MPI tasks and multiple OpenMP threads with the GPU package
|
||||
can give better performance. 3-body potentials can often perform better
|
||||
with multiple OMP threads because the inter-process communication is
|
||||
higher for these styles with the GPU package in order to allow
|
||||
MPI parallelism typically outperforms OpenMP parallelism, but in some
|
||||
cases using fewer MPI tasks and multiple OpenMP threads with the GPU
|
||||
package can give better performance. 3-body potentials can often perform
|
||||
better with multiple OMP threads because the inter-process communication
|
||||
is higher for these styles with the GPU package in order to allow
|
||||
deterministic results.
|
||||
|
||||
**Guidelines for best performance:**
|
||||
@ -189,6 +193,12 @@ deterministic results.
|
||||
:doc:`angle <angle_style>`, :doc:`dihedral <dihedral_style>`,
|
||||
:doc:`improper <improper_style>`, and :doc:`long-range <kspace_style>`
|
||||
calculations will not be included in the "Pair" time.
|
||||
* Since only part of the pppm kspace style is GPU accelerated, it
|
||||
may be faster to only use GPU acceleration for Pair styles with
|
||||
long-range electrostatics. See the "pair/only" keyword of the
|
||||
package command for a shortcut to do that. The work between kspace
|
||||
on the CPU and non-bonded interactions on the GPU can be balanced
|
||||
through adjusting the coulomb cutoff without loss of accuracy.
|
||||
* When the *mode* setting for the package gpu command is force/neigh,
|
||||
the time for neighbor list calculations on the GPU will be added into
|
||||
the "Pair" time, not the "Neigh" time. An additional breakdown of the
|
||||
|
||||
@ -175,7 +175,7 @@ package.
|
||||
|
||||
The *Ngpu* argument sets the number of GPUs per node. If *Ngpu* is 0
|
||||
and no other keywords are specified, GPU or accelerator devices are
|
||||
autoselected. In this process, all platforms are searched for
|
||||
auto-selected. In this process, all platforms are searched for
|
||||
accelerator devices and GPUs are chosen if available. The device with
|
||||
the highest number of compute cores is selected. The number of devices
|
||||
is increased to be the number of matching accelerators with the same
|
||||
@ -257,7 +257,8 @@ the other particles.
|
||||
The *gpuID* keyword is used to specify the first ID for the GPU or
|
||||
other accelerator that LAMMPS will use. For example, if the ID is
|
||||
1 and *Ngpu* is 3, GPUs 1-3 will be used. Device IDs should be
|
||||
determined from the output of nvc_get_devices or ocl_get_devices
|
||||
determined from the output of nvc_get_devices, ocl_get_devices,
|
||||
or hip_get_devices
|
||||
as provided in the lib/gpu directory. When using OpenCL with
|
||||
accelerators that have main memory NUMA, the accelerators can be
|
||||
split into smaller virtual accelerators for more efficient use
|
||||
@ -306,13 +307,14 @@ PPPM_MAX_SPLINE.
|
||||
|
||||
CONFIG_ID can be 0. SHUFFLE_AVAIL in {0,1} indicates that inline-PTX
|
||||
(NVIDIA) or OpenCL extensions (Intel) should be used for horizontal
|
||||
vector operataions. FAST_MATH in {0,1} indicates that OpenCL fast math
|
||||
optimizations are used during the build and HW-accelerated
|
||||
transcendentals are used when available. THREADS_PER_* give the default
|
||||
*tpa* values for ellipsoidal models, styles using charge, and any other
|
||||
styles. The BLOCK_* parameters specify the block sizes for various
|
||||
kernal calls and the MAX_*SHARED*_ parameters are used to determine the
|
||||
amount of local shared memory to use for storing model parameters.
|
||||
vector operations. FAST_MATH in {0,1} indicates that OpenCL fast math
|
||||
optimizations are used during the build and hardware-accelerated
|
||||
transcendental functions are used when available. THREADS_PER_* give the
|
||||
default *tpa* values for ellipsoidal models, styles using charge, and
|
||||
any other styles. The BLOCK_* parameters specify the block sizes for
|
||||
various kernel calls and the MAX_*SHARED*_ parameters are used to
|
||||
determine the amount of local shared memory to use for storing model
|
||||
parameters.
|
||||
|
||||
For OpenCL, the routines are compiled at runtime for the specified GPU
|
||||
or accelerator architecture. The *ocl_args* keyword can be used to
|
||||
|
||||
@ -2297,6 +2297,7 @@ omegaz
|
||||
Omelyan
|
||||
omp
|
||||
OMP
|
||||
oneAPI
|
||||
onelevel
|
||||
oneway
|
||||
onn
|
||||
@ -2528,6 +2529,7 @@ ptm
|
||||
PTM
|
||||
ptol
|
||||
ptr
|
||||
PTX
|
||||
pu
|
||||
purdue
|
||||
Purohit
|
||||
|
||||
@ -45,8 +45,10 @@ efficient use with MPI.
|
||||
|
||||
After building the GPU library, for OpenCL:
|
||||
./ocl_get_devices
|
||||
and for CUDA
|
||||
for CUDA:
|
||||
./nvc_get_devices
|
||||
and for ROCm HIP:
|
||||
./hip_get_devices
|
||||
|
||||
------------------------------------------------------------------------------
|
||||
QUICK START
|
||||
|
||||
Reference in New Issue
Block a user