documentation corrections, spelling fixes and updates

2021-02-17 18:47:35 -05:00
parent e575c5fa29
commit f367e66aba
4 changed files with 42 additions and 26 deletions
--- a/doc/src/Speed_gpu.rst
+++ b/doc/src/Speed_gpu.rst
@ -1,11 +1,14 @@
 GPU package
 ===========

-The GPU package was developed by Mike Brown while at SNL and ORNL
-and his collaborators, particularly Trung Nguyen (now at Northwestern).
-It provides GPU versions of many pair styles and for parts of the
-:doc:`kspace_style pppm <kspace_style>` for long-range Coulombics.
-It has the following general features:
+The GPU package was developed by Mike Brown while at SNL and ORNL (now
+at Intel Corp.) and his collaborators, particularly Trung Nguyen (now at
+Northwestern).  Support for AMD GPUs via HIP was added by Vsevolod Nikolskiy
+and coworkers at HSE University.
+
+The GPU package provides GPU versions of many pair styles and for
+parts of the :doc:`kspace_style pppm <kspace_style>` for long-range
+Coulombics.  It has the following general features:

 * It is designed to exploit common GPU hardware configurations where one
  or more GPUs are coupled to many cores of one or more multi-core CPUs,
@ -24,8 +27,9 @@ It has the following general features:
  force vectors.
 * LAMMPS-specific code is in the GPU package.  It makes calls to a
  generic GPU library in the lib/gpu directory.  This library provides
-  NVIDIA support as well as more general OpenCL support, so that the
-  same functionality is supported on a variety of hardware.
+  either Nvidia support, AMD support, or more general OpenCL support
+  (for Nvidia GPUs, AMD GPUs, Intel GPUs, and multi-core CPUs).
+  so that the same functionality is supported on a variety of hardware.

 **Required hardware/software:**

@ -89,10 +93,10 @@ shared by 4 MPI tasks.
 The GPU package also has limited support for OpenMP for both
 multi-threading and vectorization of routines that are run on the CPUs.
 This requires that the GPU library and LAMMPS are built with flags to
-enable OpenMP support (e.g. -fopenmp -fopenmp-simd). Some styles for
-time integration are also available in the GPU package. These run
-completely on the CPUs in full double precision, but exploit
-multi-threading and vectorization for faster performance.
+enable OpenMP support (e.g. -fopenmp). Some styles for time integration
+are also available in the GPU package. These run completely on the CPUs
+in full double precision, but exploit multi-threading and vectorization
+for faster performance.

 Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
 automatically append "gpu" to styles that support it.  Use the "-pk
@ -159,11 +163,11 @@ Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.

-MPI parallelism typically outperforms OpenMP parallelism, but in same cases
-using fewer MPI tasks and multiple OpenMP threads with the GPU package
-can give better performance. 3-body potentials can often perform better
-with multiple OMP threads because the inter-process communication is
-higher for these styles with the GPU package in order to allow
+MPI parallelism typically outperforms OpenMP parallelism, but in some
+cases using fewer MPI tasks and multiple OpenMP threads with the GPU
+package can give better performance. 3-body potentials can often perform
+better with multiple OMP threads because the inter-process communication
+is higher for these styles with the GPU package in order to allow
 deterministic results.

 **Guidelines for best performance:**
@ -189,6 +193,12 @@ deterministic results.
  :doc:`angle <angle_style>`, :doc:`dihedral <dihedral_style>`,
  :doc:`improper <improper_style>`, and :doc:`long-range <kspace_style>`
  calculations will not be included in the "Pair" time.
+* Since only part of the pppm kspace style is GPU accelerated, it
+  may be faster to only use GPU acceleration for Pair styles with
+  long-range electrostatics.  See the "pair/only" keyword of the
+  package command for a shortcut to do that.  The work between kspace
+  on the CPU and non-bonded interactions on the GPU can be balanced
+  through adjusting the coulomb cutoff without loss of accuracy.
 * When the *mode* setting for the package gpu command is force/neigh,
  the time for neighbor list calculations on the GPU will be added into
  the "Pair" time, not the "Neigh" time.  An additional breakdown of the
--- a/doc/src/package.rst
+++ b/doc/src/package.rst
@ -175,7 +175,7 @@ package.

 The *Ngpu* argument sets the number of GPUs per node. If *Ngpu* is 0
 and no other keywords are specified, GPU or accelerator devices are
-autoselected. In this process, all platforms are searched for
+auto-selected. In this process, all platforms are searched for
 accelerator devices and GPUs are chosen if available. The device with
 the highest number of compute cores is selected. The number of devices
 is increased to be the number of matching accelerators with the same
@ -257,7 +257,8 @@ the other particles.
 The *gpuID* keyword is used to specify the first ID for the GPU or
 other accelerator that LAMMPS will use. For example, if the ID is
 1 and *Ngpu* is 3, GPUs 1-3 will be used. Device IDs should be
-determined from the output of nvc_get_devices or ocl_get_devices
+determined from the output of nvc_get_devices, ocl_get_devices,
+or hip_get_devices
 as provided in the lib/gpu directory. When using OpenCL with
 accelerators that have main memory NUMA, the accelerators can be
 split into smaller virtual accelerators for more efficient use
@ -306,13 +307,14 @@ PPPM_MAX_SPLINE.

 CONFIG_ID can be 0. SHUFFLE_AVAIL in {0,1} indicates that inline-PTX
 (NVIDIA) or OpenCL extensions (Intel) should be used for horizontal
-vector operataions. FAST_MATH in {0,1} indicates that OpenCL fast math
-optimizations are used during the build and HW-accelerated
-transcendentals are used when available. THREADS_PER_* give the default
-*tpa* values for ellipsoidal models, styles using charge, and any other
-styles. The BLOCK_* parameters specify the block sizes for various
-kernal calls and the MAX_*SHARED*_ parameters are used to determine the
-amount of local shared memory to use for storing model parameters.
+vector operations. FAST_MATH in {0,1} indicates that OpenCL fast math
+optimizations are used during the build and hardware-accelerated
+transcendental functions are used when available. THREADS_PER_* give the
+default *tpa* values for ellipsoidal models, styles using charge, and
+any other styles. The BLOCK_* parameters specify the block sizes for
+various kernel calls and the MAX_*SHARED*_ parameters are used to
+determine the amount of local shared memory to use for storing model
+parameters.

 For OpenCL, the routines are compiled at runtime for the specified GPU
 or accelerator architecture. The *ocl_args* keyword can be used to
--- a/doc/utils/sphinx-config/false_positives.txt
+++ b/doc/utils/sphinx-config/false_positives.txt
@ -2297,6 +2297,7 @@ omegaz
 Omelyan
 omp
 OMP
+oneAPI
 onelevel
 oneway
 onn
@ -2528,6 +2529,7 @@ ptm
 PTM
 ptol
 ptr
+PTX
 pu
 purdue
 Purohit
--- a/lib/gpu/README
+++ b/lib/gpu/README
@ -45,8 +45,10 @@ efficient use with MPI.

 After building the GPU library, for OpenCL:
  ./ocl_get_devices
-and for CUDA
+for CUDA:
  ./nvc_get_devices
+and for ROCm HIP:
+  ./hip_get_devices

 ------------------------------------------------------------------------------
                              QUICK START