From f367e66abafe2cd4bd7bc4d63e25118259612419 Mon Sep 17 00:00:00 2001
From: Axel Kohlmeyer <akohlmey@gmail.com>
Date: Wed, 17 Feb 2021 18:47:35 -0500
Subject: [PATCH] documentation corrections, spelling fixes and updates

---
 doc/src/Speed_gpu.rst                       | 42 +++++++++++++--------
 doc/src/package.rst                         | 20 +++++-----
 doc/utils/sphinx-config/false_positives.txt |  2 +
 lib/gpu/README                              |  4 +-
 4 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/doc/src/Speed_gpu.rst b/doc/src/Speed_gpu.rst
index 655f2e1958..709a3ad3bb 100644
--- a/doc/src/Speed_gpu.rst
+++ b/doc/src/Speed_gpu.rst
@@ -1,11 +1,14 @@
 GPU package
 ===========
 
-The GPU package was developed by Mike Brown while at SNL and ORNL
-and his collaborators, particularly Trung Nguyen (now at Northwestern).
-It provides GPU versions of many pair styles and for parts of the
-:doc:`kspace_style pppm <kspace_style>` for long-range Coulombics.
-It has the following general features:
+The GPU package was developed by Mike Brown while at SNL and ORNL (now
+at Intel Corp.) and his collaborators, particularly Trung Nguyen (now at
+Northwestern).  Support for AMD GPUs via HIP was added by Vsevolod Nikolskiy
+and coworkers at HSE University.
+
+The GPU package provides GPU versions of many pair styles and for
+parts of the :doc:`kspace_style pppm <kspace_style>` for long-range
+Coulombics.  It has the following general features:
 
 * It is designed to exploit common GPU hardware configurations where one
   or more GPUs are coupled to many cores of one or more multi-core CPUs,
@@ -24,8 +27,9 @@ It has the following general features:
   force vectors.
 * LAMMPS-specific code is in the GPU package.  It makes calls to a
   generic GPU library in the lib/gpu directory.  This library provides
-  NVIDIA support as well as more general OpenCL support, so that the
-  same functionality is supported on a variety of hardware.
+  either Nvidia support, AMD support, or more general OpenCL support
+  (for Nvidia GPUs, AMD GPUs, Intel GPUs, and multi-core CPUs).
+  so that the same functionality is supported on a variety of hardware.
 
 **Required hardware/software:**
 
@@ -89,10 +93,10 @@ shared by 4 MPI tasks.
 The GPU package also has limited support for OpenMP for both
 multi-threading and vectorization of routines that are run on the CPUs.
 This requires that the GPU library and LAMMPS are built with flags to
-enable OpenMP support (e.g. -fopenmp -fopenmp-simd). Some styles for
-time integration are also available in the GPU package. These run
-completely on the CPUs in full double precision, but exploit
-multi-threading and vectorization for faster performance.
+enable OpenMP support (e.g. -fopenmp). Some styles for time integration
+are also available in the GPU package. These run completely on the CPUs
+in full double precision, but exploit multi-threading and vectorization
+for faster performance.
 
 Use the "-sf gpu" :doc:`command-line switch <Run_options>`, which will
 automatically append "gpu" to styles that support it.  Use the "-pk
@@ -159,11 +163,11 @@ Likewise, you should experiment with the precision setting for the GPU
 library to see if single or mixed precision will give accurate
 results, since they will typically be faster.
 
-MPI parallelism typically outperforms OpenMP parallelism, but in same cases
-using fewer MPI tasks and multiple OpenMP threads with the GPU package
-can give better performance. 3-body potentials can often perform better
-with multiple OMP threads because the inter-process communication is
-higher for these styles with the GPU package in order to allow
+MPI parallelism typically outperforms OpenMP parallelism, but in some
+cases using fewer MPI tasks and multiple OpenMP threads with the GPU
+package can give better performance. 3-body potentials can often perform
+better with multiple OMP threads because the inter-process communication
+is higher for these styles with the GPU package in order to allow
 deterministic results.
 
 **Guidelines for best performance:**
@@ -189,6 +193,12 @@ deterministic results.
   :doc:`angle <angle_style>`, :doc:`dihedral <dihedral_style>`,
   :doc:`improper <improper_style>`, and :doc:`long-range <kspace_style>`
   calculations will not be included in the "Pair" time.
+* Since only part of the pppm kspace style is GPU accelerated, it
+  may be faster to only use GPU acceleration for Pair styles with
+  long-range electrostatics.  See the "pair/only" keyword of the
+  package command for a shortcut to do that.  The work between kspace
+  on the CPU and non-bonded interactions on the GPU can be balanced
+  through adjusting the coulomb cutoff without loss of accuracy.
 * When the *mode* setting for the package gpu command is force/neigh,
   the time for neighbor list calculations on the GPU will be added into
   the "Pair" time, not the "Neigh" time.  An additional breakdown of the
diff --git a/doc/src/package.rst b/doc/src/package.rst
index a091759214..aea4ba657f 100644
--- a/doc/src/package.rst
+++ b/doc/src/package.rst
@@ -175,7 +175,7 @@ package.
 
 The *Ngpu* argument sets the number of GPUs per node. If *Ngpu* is 0
 and no other keywords are specified, GPU or accelerator devices are
-autoselected. In this process, all platforms are searched for
+auto-selected. In this process, all platforms are searched for
 accelerator devices and GPUs are chosen if available. The device with
 the highest number of compute cores is selected. The number of devices
 is increased to be the number of matching accelerators with the same
@@ -257,7 +257,8 @@ the other particles.
 The *gpuID* keyword is used to specify the first ID for the GPU or
 other accelerator that LAMMPS will use. For example, if the ID is
 1 and *Ngpu* is 3, GPUs 1-3 will be used. Device IDs should be
-determined from the output of nvc_get_devices or ocl_get_devices
+determined from the output of nvc_get_devices, ocl_get_devices,
+or hip_get_devices
 as provided in the lib/gpu directory. When using OpenCL with
 accelerators that have main memory NUMA, the accelerators can be
 split into smaller virtual accelerators for more efficient use
@@ -306,13 +307,14 @@ PPPM_MAX_SPLINE.
 
 CONFIG_ID can be 0. SHUFFLE_AVAIL in {0,1} indicates that inline-PTX
 (NVIDIA) or OpenCL extensions (Intel) should be used for horizontal
-vector operataions. FAST_MATH in {0,1} indicates that OpenCL fast math
-optimizations are used during the build and HW-accelerated
-transcendentals are used when available. THREADS_PER_* give the default
-*tpa* values for ellipsoidal models, styles using charge, and any other
-styles. The BLOCK_* parameters specify the block sizes for various
-kernal calls and the MAX_*SHARED*_ parameters are used to determine the
-amount of local shared memory to use for storing model parameters.
+vector operations. FAST_MATH in {0,1} indicates that OpenCL fast math
+optimizations are used during the build and hardware-accelerated
+transcendental functions are used when available. THREADS_PER_* give the
+default *tpa* values for ellipsoidal models, styles using charge, and
+any other styles. The BLOCK_* parameters specify the block sizes for
+various kernel calls and the MAX_*SHARED*_ parameters are used to
+determine the amount of local shared memory to use for storing model
+parameters.
 
 For OpenCL, the routines are compiled at runtime for the specified GPU
 or accelerator architecture. The *ocl_args* keyword can be used to
diff --git a/doc/utils/sphinx-config/false_positives.txt b/doc/utils/sphinx-config/false_positives.txt
index 9937a98850..982e1fde2a 100644
--- a/doc/utils/sphinx-config/false_positives.txt
+++ b/doc/utils/sphinx-config/false_positives.txt
@@ -2297,6 +2297,7 @@ omegaz
 Omelyan
 omp
 OMP
+oneAPI
 onelevel
 oneway
 onn
@@ -2528,6 +2529,7 @@ ptm
 PTM
 ptol
 ptr
+PTX
 pu
 purdue
 Purohit
diff --git a/lib/gpu/README b/lib/gpu/README
index 28655836f4..dfffe11b81 100644
--- a/lib/gpu/README
+++ b/lib/gpu/README
@@ -45,8 +45,10 @@ efficient use with MPI.
 
 After building the GPU library, for OpenCL:
   ./ocl_get_devices
-and for CUDA
+for CUDA:
   ./nvc_get_devices
+and for ROCm HIP:
+  ./hip_get_devices
 
 ------------------------------------------------------------------------------
                               QUICK START