From de0baba124751709afea927b97e2976c1dbd24f1 Mon Sep 17 00:00:00 2001 From: Axel Kohlmeyer Date: Fri, 27 Dec 2024 03:50:51 -0500 Subject: [PATCH] add updates/corrections, improve formatting --- doc/src/Speed_gpu.rst | 41 +++++++++++++++++++++++++--------------- doc/src/Speed_intel.rst | 42 ++++++++++++++++++++--------------------- 2 files changed, 47 insertions(+), 36 deletions(-) diff --git a/doc/src/Speed_gpu.rst b/doc/src/Speed_gpu.rst index 42bd8bf059..c456042452 100644 --- a/doc/src/Speed_gpu.rst +++ b/doc/src/Speed_gpu.rst @@ -31,7 +31,8 @@ Coulombics. It has the following general features: (for Nvidia GPUs, AMD GPUs, Intel GPUs, and multicore CPUs). so that the same functionality is supported on a variety of hardware. -**Required hardware/software:** +Required hardware/software +"""""""""""""""""""""""""" To compile and use this package in CUDA mode, you currently need to have an NVIDIA GPU and install the corresponding NVIDIA CUDA @@ -69,12 +70,14 @@ To compile and use this package in HIP mode, you have to have the AMD ROCm software installed. Versions of ROCm older than 3.5 are currently deprecated by AMD. -**Building LAMMPS with the GPU package:** +Building LAMMPS with the GPU package +"""""""""""""""""""""""""""""""""""" See the :ref:`Build extras ` page for instructions. -**Run with the GPU package from the command line:** +Run with the GPU package from the command-line +"""""""""""""""""""""""""""""""""""""""""""""" The ``mpirun`` or ``mpiexec`` command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI @@ -133,7 +136,8 @@ affect the setting for bonded interactions (LAMMPS default is "on"). The "off" setting for pairwise interaction is currently required for GPU package pair styles. -**Or run with the GPU package by editing an input script:** +Run with the GPU package by editing an input script +""""""""""""""""""""""""""""""""""""""""""""""""""" The discussion above for the ``mpirun`` or ``mpiexec`` command, MPI tasks/node, and use of multiple MPI tasks/GPU is the same. @@ -149,7 +153,8 @@ You must also use the :doc:`package gpu ` command to enable the GPU package, unless the ``-sf gpu`` or ``-pk gpu`` :doc:`command-line switches ` were used. It specifies the number of GPUs/node to use, as well as other options. -**Speed-ups to expect:** +Speed-up to expect +"""""""""""""""""" The performance of a GPU versus a multicore CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the @@ -176,10 +181,13 @@ better with multiple OMP threads because the inter-process communication is higher for these styles with the GPU package in order to allow deterministic results. -**Guidelines for best performance:** +Guidelines for best performance +""""""""""""""""""""""""""""""" -* Using multiple MPI tasks per GPU will often give the best performance, - as allowed my most multicore CPU/GPU configurations. +* Using multiple MPI tasks (2-10) per GPU will often give the best + performance, as allowed my most multicore CPU/GPU configurations. + Using too many MPI tasks will result in wors performance due to + growing overhead. * If the number of particles per MPI task is small (e.g. 100s of particles), it can be more efficient to run with fewer MPI tasks per GPU, even if you do not use all the cores on the compute node. @@ -199,12 +207,13 @@ deterministic results. :doc:`angle `, :doc:`dihedral `, :doc:`improper `, and :doc:`long-range ` calculations will not be included in the "Pair" time. -* Since only part of the pppm kspace style is GPU accelerated, it - may be faster to only use GPU acceleration for Pair styles with - long-range electrostatics. See the "pair/only" keyword of the - package command for a shortcut to do that. The work between kspace - on the CPU and non-bonded interactions on the GPU can be balanced - through adjusting the coulomb cutoff without loss of accuracy. +* Since only part of the pppm kspace style is GPU accelerated, it may be + faster to only use GPU acceleration for Pair styles with long-range + electrostatics. See the "pair/only" keyword of the :doc:`package + command ` for a shortcut to do that. The distribution of + work between kspace on the CPU and non-bonded interactions on the GPU + can be balanced through adjusting the coulomb cutoff without loss of + accuracy. * When the *mode* setting for the package gpu command is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the @@ -220,4 +229,6 @@ deterministic results. Restrictions """""""""""" -None. +When using :doc:`hybrid pair styles `, the neighbor list +must be generated on the host instead of the GPU and thus the potential +GPU acceleration is reduced. diff --git a/doc/src/Speed_intel.rst b/doc/src/Speed_intel.rst index dd6c27b4e7..78a88f4407 100644 --- a/doc/src/Speed_intel.rst +++ b/doc/src/Speed_intel.rst @@ -1,5 +1,5 @@ INTEL package -================== +============= The INTEL package is maintained by Mike Brown at Intel Corporation. It provides two methods for accelerating simulations, @@ -13,18 +13,18 @@ twice, once on the CPU and once with an offload flag. This allows LAMMPS to run on the CPU cores and co-processor cores simultaneously. Currently Available INTEL Styles -""""""""""""""""""""""""""""""""""""" +"""""""""""""""""""""""""""""""" * Angle Styles: charmm, harmonic -* Bond Styles: fene, fourier, harmonic +* Bond Styles: fene, harmonic * Dihedral Styles: charmm, fourier, harmonic, opls -* Fixes: nve, npt, nvt, nvt/sllod, nve/asphere +* Fixes: nve, npt, nvt, nvt/sllod, nve/asphere, electrode/conp, electrode/conq, electrode/thermo * Improper Styles: cvff, harmonic * Pair Styles: airebo, airebo/morse, buck/coul/cut, buck/coul/long, buck, dpd, eam, eam/alloy, eam/fs, gayberne, lj/charmm/coul/charmm, lj/charmm/coul/long, lj/cut, lj/cut/coul/long, lj/long/coul/long, - rebo, sw, tersoff -* K-Space Styles: pppm, pppm/disp + rebo, snap, sw, tersoff +* K-Space Styles: pppm, pppm/disp, pppm/electrode .. warning:: @@ -33,7 +33,7 @@ Currently Available INTEL Styles input requires it, LAMMPS will abort with an error message. Speed-up to expect -""""""""""""""""""" +"""""""""""""""""" The speedup will depend on your simulation, the hardware, which styles are used, the number of atoms, and the floating-point @@ -312,21 +312,21 @@ almost all cases. recommended, especially when running on a machine with Intel Hyper-Threading technology disabled. -Run with the INTEL package from the command line -""""""""""""""""""""""""""""""""""""""""""""""""""""" +Run with the INTEL package from the command-line +"""""""""""""""""""""""""""""""""""""""""""""""" -To enable INTEL optimizations for all available styles used in -the input script, the ``-sf intel`` :doc:`command-line switch ` can be used without any requirement for -editing the input script. This switch will automatically append -"intel" to styles that support it. It also invokes a default command: -:doc:`package intel 1 `. This package command is used to set -options for the INTEL package. The default package command will -specify that INTEL calculations are performed in mixed precision, -that the number of OpenMP threads is specified by the OMP_NUM_THREADS -environment variable, and that if co-processors are present and the -binary was built with offload support, that 1 co-processor per node -will be used with automatic balancing of work between the CPU and the -co-processor. +To enable INTEL optimizations for all available styles used in the input +script, the ``-sf intel`` :doc:`command-line switch ` can +be used without any requirement for editing the input script. This +switch will automatically append "intel" to styles that support it. It +also invokes a default command: :doc:`package intel 1 `. This +package command is used to set options for the INTEL package. The +default package command will specify that INTEL calculations are +performed in mixed precision, that the number of OpenMP threads is +specified by the OMP_NUM_THREADS environment variable, and that if +co-processors are present and the binary was built with offload support, +that 1 co-processor per node will be used with automatic balancing of +work between the CPU and the co-processor. You can specify different options for the INTEL package by using the ``-pk intel Nphi`` :doc:`command-line switch ` with