some rewrite/update of the accelerator comparison page removing outdated info
This commit is contained in:
@ -9,65 +9,108 @@ Documentation"_ld - "LAMMPS Commands"_lc :c
|
|||||||
|
|
||||||
Comparison of various accelerator packages :h3
|
Comparison of various accelerator packages :h3
|
||||||
|
|
||||||
NOTE: this section still needs to be re-worked with additional KOKKOS
|
|
||||||
and USER-INTEL information.
|
|
||||||
|
|
||||||
The next section compares and contrasts the various accelerator
|
The next section compares and contrasts the various accelerator
|
||||||
options, since there are multiple ways to perform OpenMP threading,
|
options, since there are multiple ways to perform OpenMP threading,
|
||||||
run on GPUs, and run on Intel Xeon Phi coprocessors.
|
run on GPUs, optimize for vector units on CPUs and run on Intel
|
||||||
|
Xeon Phi (co-)processors.
|
||||||
|
|
||||||
All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
|
All of these packages can accelerate a LAMMPS calculation taking
|
||||||
hardware, but they do it in different ways.
|
advantage of hardware features, but they do it in different ways
|
||||||
|
and acceleration is not always guaranteed.
|
||||||
|
|
||||||
As a consequence, for a particular simulation on specific hardware,
|
As a consequence, for a particular simulation on specific hardware,
|
||||||
one package may be faster than the other. We give guidelines below,
|
one package may be faster than the other. We give some guidelines
|
||||||
but the best way to determine which package is faster for your input
|
below, but the best way to determine which package is faster for your
|
||||||
script is to try both of them on your machine. See the benchmarking
|
input script is to try multiple of them on your machine and experiment
|
||||||
|
with available performance tuning settings. See the benchmarking
|
||||||
section below for examples where this has been done.
|
section below for examples where this has been done.
|
||||||
|
|
||||||
[Guidelines for using each package optimally:]
|
[Guidelines for using each package optimally:]
|
||||||
|
|
||||||
The GPU package allows you to assign multiple CPUs (cores) to a single
|
Both, the GPU and the KOKKOS package allows you to assign multiple
|
||||||
GPU (a common configuration for "hybrid" nodes that contain multicore
|
MPI ranks (= CPU cores) to the same GPU. For the GPU package, this
|
||||||
CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l
|
can lead to a speedup through better utilization of the GPU (by
|
||||||
|
overlapping computation and data transfer) and more efficient
|
||||||
|
computation of the non-GPU accelerated parts of LAMMPS through MPI
|
||||||
|
parallelization, as all system data is maintained and updated on
|
||||||
|
the host. For KOKKOS, there is less to no benefit from this, due
|
||||||
|
to its different memory management model, which tries to retain
|
||||||
|
data on the GPU.
|
||||||
|
:ulb,l
|
||||||
|
|
||||||
The GPU package moves per-atom data (coordinates, forces)
|
The GPU package moves per-atom data (coordinates, forces, and
|
||||||
back-and-forth between the CPU and GPU every timestep. The
|
(optionally) neighbor list data, if not computed on the GPU) between
|
||||||
KOKKOS/CUDA package only does this on timesteps when a CPU calculation
|
the CPU and GPU at every timestep. The KOKKOS/CUDA package only does
|
||||||
is required (e.g. to invoke a fix or compute that is non-GPU-ized).
|
this on timesteps when a CPU calculation is required (e.g. to invoke
|
||||||
Hence, if you can formulate your input script to only use GPU-ized
|
a fix or compute that is non-GPU-ized). Hence, if you can formulate
|
||||||
fixes and computes, and avoid doing I/O too often (thermo output, dump
|
your input script to only use GPU-ized fixes and computes, and avoid
|
||||||
file snapshots, restart files), then the data transfer cost of the
|
doing I/O too often (thermo output, dump file snapshots, restart files),
|
||||||
KOKKOS/CUDA package can be very low, causing it to run faster than the
|
then the data transfer cost of the KOKKOS/CUDA package can be very low,
|
||||||
GPU package. :l
|
causing it to run faster than the GPU package. :l
|
||||||
|
|
||||||
The GPU package is often faster than the KOKKOS/CUDA package, if the
|
The GPU package is often faster than the KOKKOS/CUDA package, when the
|
||||||
number of atoms per GPU is smaller. The crossover point, in terms of
|
number of atoms per GPU is on the smaller side. The crossover point,
|
||||||
atoms/GPU at which the KOKKOS/CUDA package becomes faster depends
|
in terms of atoms/GPU at which the KOKKOS/CUDA package becomes faster
|
||||||
strongly on the pair style. For example, for a simple Lennard Jones
|
depends strongly on the pair style. For example, for a simple Lennard Jones
|
||||||
system the crossover (in single precision) is often about 50K-100K
|
system the crossover (in single precision) is often about 50K-100K
|
||||||
atoms per GPU. When performing double precision calculations the
|
atoms per GPU. When performing double precision calculations the
|
||||||
crossover point can be significantly smaller. :l
|
crossover point can be significantly smaller. :l
|
||||||
|
|
||||||
Both packages compute bonded interactions (bonds, angles, etc) on the
|
Both KOKKOS and GPU package compute bonded interactions (bonds, angles,
|
||||||
CPU. If the GPU package is running with several MPI processes
|
etc) on the CPU. If the GPU package is running with several MPI processes
|
||||||
assigned to one GPU, the cost of computing the bonded interactions is
|
assigned to one GPU, the cost of computing the bonded interactions is
|
||||||
spread across more CPUs and hence the GPU package can run faster. :l
|
spread across more CPUs and hence the GPU package can run faster in these
|
||||||
|
cases. :l
|
||||||
|
|
||||||
When using the GPU package with multiple CPUs assigned to one GPU, its
|
When using LAMMPS with multiple MPI ranks assigned to the same GPU, its
|
||||||
performance depends to some extent on high bandwidth between the CPUs
|
performance depends to some extent on the available bandwidth between
|
||||||
and the GPU. Hence its performance is affected if full 16 PCIe lanes
|
the CPUs and the GPU. This can differ significantly based on the
|
||||||
are not available for each GPU. In HPC environments this can be the
|
available bus technology, capability of the host CPU and mainboard,
|
||||||
case if S2050/70 servers are used, where two devices generally share
|
the wiring of the buses and whether switches are used to increase the
|
||||||
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
|
number of available bus slots, or if GPUs are housed in an external
|
||||||
full 16 lanes to each of the PCIe 2.0 16x slots. :l
|
enclosure. This can become quite complex. :l
|
||||||
|
|
||||||
|
To achieve significant acceleration through GPUs, both KOKKOS and GPU
|
||||||
|
package require capable GPUs with fast on-device memory and efficient
|
||||||
|
data transfer rates. This requests capable upper mid-level to high-end
|
||||||
|
(desktop) GPUs. Using lower performance GPUs (e.g. on laptops) may
|
||||||
|
result in a slowdown instead. :l
|
||||||
|
|
||||||
|
For the GPU package, specifically when running in parallel with MPI,
|
||||||
|
if it often more efficient to exclude the PPPM kspace style from GPU
|
||||||
|
acceleration and instead run it - concurrently with a GPU accelerated
|
||||||
|
pair style - on the CPU. This can often be easily achieved with placing
|
||||||
|
a {suffix off} command before and a {suffix on} command after the
|
||||||
|
{kspace_style pppm} command. :l
|
||||||
|
|
||||||
|
The KOKKOS/OpenMP and USER-OMP package have different thread management
|
||||||
|
strategies, which should result in USER-OMP being more efficient for a
|
||||||
|
small number of threads with increasing overhead as the number of threads
|
||||||
|
per MPI rank grows. The KOKKOS/OpenMP kernels have less overhead in that
|
||||||
|
case, but have lower performance with few threads. :l
|
||||||
|
|
||||||
|
The USER-INTEL package contains many options and settings for achieving
|
||||||
|
additional performance on Intel hardware (CPU and accelerator cards), but
|
||||||
|
to unlock this potential, an Intel compiler is required. The package code
|
||||||
|
will compile with GNU gcc, but it will not be as efficient. :l
|
||||||
:ule
|
:ule
|
||||||
|
|
||||||
[Differences between the two packages:]
|
[Differences between the GPU and KOKKOS packages:]
|
||||||
|
|
||||||
The GPU package accelerates only pair force, neighbor list, and PPPM
|
The GPU package accelerates only pair force, neighbor list, and (parts
|
||||||
calculations. :ulb,l
|
of) PPPM calculations. The KOKKOS package attempts to run most of the
|
||||||
|
calculation on the GPU, but can transparently support non-accelerated
|
||||||
|
code (with a performance penalty due to having data transfers between
|
||||||
|
host and GPU). :ulb,l
|
||||||
|
|
||||||
The GPU package requires neighbor lists to be built on the CPU when using
|
The GPU package requires neighbor lists to be built on the CPU when using
|
||||||
exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
|
exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
|
||||||
|
|
||||||
|
The GPU package can be compiled for CUDA or OpenCL and thus supports
|
||||||
|
both, Nvidia and AMD GPUs well. On Nvidia hardware, using CUDA is typically
|
||||||
|
resulting in equal or better performance over OpenCL. :l
|
||||||
|
|
||||||
|
OpenCL in the GPU package does theoretically also support Intel CPUs or
|
||||||
|
Intel Xeon Phi, but the native support for those in KOKKOS (or USER-INTEL)
|
||||||
|
is superior. :l
|
||||||
:ule
|
:ule
|
||||||
|
|||||||
Reference in New Issue
Block a user