From 4afdf493d7f50a16b26220648efa9d5dab0ca9f3 Mon Sep 17 00:00:00 2001
From: Axel Kohlmeyer <akohlmey@gmail.com>
Date: Wed, 15 Jan 2025 06:05:25 -0500
Subject: [PATCH] integrate and adapt contents from PR #4028 by @alphataubio

---
 doc/src/Speed_kokkos.rst | 80 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 70 insertions(+), 10 deletions(-)
diff --git a/doc/src/Speed_kokkos.rst b/doc/src/Speed_kokkos.rst
index 9f8dcf8340..f8379949a4 100644
--- a/doc/src/Speed_kokkos.rst
+++ b/doc/src/Speed_kokkos.rst
@@ -67,6 +67,14 @@ version 23 November 2023 and Kokkos version 4.2.
    To build with Kokkos support for AMD GPUs, the AMD ROCm toolkit
    software version 5.2.0 or later must be installed on your system.
 
+.. admonition:: Intel Data Center GPU support
+   :class: note
+
+   Support for Kokkos with Intel Data Center GPU accelerators (formerly
+   known under the code name "Ponte Vecchio") in LAMMPS is still a work
+   in progress.  Only a subset of the functionality works correctly.
+   Please contact the LAMMPS developers if you run into problems.
+
 .. admonition:: CUDA and MPI library compatibility
    :class: note
 
@@ -80,13 +88,15 @@ version 23 November 2023 and Kokkos version 4.2.
    LAMMPS command-line or by using the command :doc:`package kokkos
    gpu/aware off <package>` in the input file.
 
-.. admonition:: Intel Data Center GPU support
+.. admonition:: Using multiple MPI ranks per GPU
    :class: note
 
-   Support for Kokkos with Intel Data Center GPU accelerators (formerly
-   known under the code name "Ponte Vecchio") in LAMMPS is still a work
-   in progress.  Only a subset of the functionality works correctly.
-   Please contact the LAMMPS developers if you run into problems.
+   Unlike with the GPU package, there are limited benefits from using
+   multiple MPI processes per GPU with KOKKOS.  But when doing this it
+   is **required** to enable CUDA MPS (`Multi-Process Service :: GPU
+   Deployment and Management Documentation
+   <https://docs.nvidia.com/deploy/mps/index.html>`_ ) to get acceptable
+   performance.
 
 Building LAMMPS with the KOKKOS package
 """""""""""""""""""""""""""""""""""""""
@@ -365,13 +375,13 @@ one or more nodes, each with two GPUs:
 
 .. note::
 
-   When using a GPU, you will achieve the best performance if your
-   input script does not use fix or compute styles which are not yet
+   When using a GPU, you will achieve the best performance if your input
+   script does not use fix or compute styles which are not yet
    Kokkos-enabled. This allows data to stay on the GPU for multiple
    timesteps, without being copied back to the host CPU. Invoking a
-   non-Kokkos fix or compute, or performing I/O for
-   :doc:`thermo <thermo_style>` or :doc:`dump <dump>` output will cause data
-   to be copied back to the CPU incurring a performance penalty.
+   non-Kokkos fix or compute, or performing I/O for :doc:`thermo
+   <thermo_style>` or :doc:`dump <dump>` output will cause data to be
+   copied back to the CPU incurring a performance penalty.
 
 .. note::
 
@@ -379,6 +389,56 @@ one or more nodes, each with two GPUs:
    kspace, etc., you must set the environment variable ``CUDA_LAUNCH_BLOCKING=1``.
    However, this will reduce performance and is not recommended for production runs.
 
+Troubleshooting segmentation faults on GPUs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above, KOKKOS by default assumes that the MPI library is
+GPU-aware.  This is not always the case and can lead to segmentation
+faults when using more than one MPI process.  Normally, LAMMPS will
+print a warning like "*Turning off GPU-aware MPI since it is not
+detected*", or an error message like "*Kokkos with GPU-enabled backend
+assumes GPU-aware MPI is available*", OR a **segmentation fault**.  To
+confirm that a segmentation fault is caused by this, you can turn off
+the GPU-aware assumption via the :doc:`package kokkos command <package>`
+or the corresponding command-line flag.
+
+If you still get a segmentation fault, despite running with only one MPI
+process or using the command-line flag to turn off expecting a GPU-aware
+MPI library, then using the CMake compile setting
+``-DKokkos_ENABLE_DEBUG=on`` or adding ``KOKKOS_DEBUG=yes`` to your
+machine makefile for building with traditional make will generate useful
+output that can be passed to the LAMMPS developers for further
+debugging.
+
+Troubleshooting memory allocation on GPUs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+`Kokkos Tools <https://github.com/kokkos/kokkos-tools/>`_ provides a set
+of lightweight profiling and debugging utilities, which interface with
+instrumentation hooks (eg. `space-time-stack
+<https://github.com/kokkos/kokkos-tools/tree/develop/profiling/space-time-stack>`_)
+built directly into the Kokkos runtime.  After compiling a dynamic LAMMPS
+library, you then have to set the environment variable ``KOKKOS_TOOLS_LIBS``
+before executing your LAMMPS Kokkos run. Example:
+
+.. code-block:: bash
+
+    export KOKKOS_TOOLS_LIBS=${HOME}/kokkos-tools/src/tools/memory-events/kp_memory_event.so
+    mpirun -np 4 lmp_kokkos_cuda_openmpi -in in.lj -k on g 4 -sf kk
+
+Starting with the NVIDIA Pascal GPU architecture, CUDA supports
+`"Unified Virtual Memory" (UVM)
+<https://developer.nvidia.com/blog/unified-memory-cuda-beginners/>`_
+which enables allocating more memory than a GPU possesses by also using
+memory on the host CPU and then CUDA will transparently move data
+between CPU and GPU as needed.  The resulting LAMMPS performance depends
+on `memory access pattern, data residency, and GPU memory
+oversubscription
+<https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/>`_
+. The CMake option ``-DKokkos_ENABLE_CUDA_UVM=on`` or the makefile
+setting ``KOKKOS_CUDA_OPTIONS=enable_lambda,force_uvm`` enables using
+:ref:`UVM with Kokkos <kokkos>` when compiling LAMMPS.
+
 Run with the KOKKOS package by editing an input script
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^