some rewrite/update of the accelerator comparison page removing outdated info

2018-08-16 23:47:00 -04:00
parent a8c687aee8
commit f8c9ab4a3e
1 changed files with 81 additions and 38 deletions
--- a/doc/src/Speed_compare.txt
+++ b/doc/src/Speed_compare.txt
@ -9,65 +9,108 @@ Documentation"_ld - "LAMMPS Commands"_lc :c
 Comparison of various accelerator packages :h3
 NOTE: this section still needs to be re-worked with additional KOKKOS
 and USER-INTEL information.
 The next section compares and contrasts the various accelerator
 options, since there are multiple ways to perform OpenMP threading,
-run on GPUs, and run on Intel Xeon Phi coprocessors.
+run on GPUs, optimize for vector units on CPUs and run on Intel
 Xeon Phi (co-)processors.
-All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
+All of these packages can accelerate a LAMMPS calculation taking
-hardware, but they do it in different ways.
+advantage of hardware features, but they do it in different ways
 and acceleration is not always guaranteed.
 As a consequence, for a particular simulation on specific hardware,
-one package may be faster than the other.  We give guidelines below,
+one package may be faster than the other.  We give some guidelines
-but the best way to determine which package is faster for your input
+below, but the best way to determine which package is faster for your
-script is to try both of them on your machine.  See the benchmarking
+input script is to try multiple of them on your machine and experiment
 with available performance tuning settings.  See the benchmarking
 section below for examples where this has been done.
 [Guidelines for using each package optimally:]
-The GPU package allows you to assign multiple CPUs (cores) to a single
+Both, the GPU and the KOKKOS package allows you to assign multiple
-GPU (a common configuration for "hybrid" nodes that contain multicore
+MPI ranks (= CPU cores) to the same GPU. For the GPU package, this
-CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l
+can lead to a speedup through better utilization of the GPU (by
 overlapping computation and data transfer) and more efficient
 computation of the non-GPU accelerated parts of LAMMPS through MPI
 parallelization, as all system data is maintained and updated on
 the host. For KOKKOS, there is less to no benefit from this, due
 to its different memory management model, which tries to retain
 data on the GPU.
 :ulb,l
-The GPU package moves per-atom data (coordinates, forces)
+The GPU package moves per-atom data (coordinates, forces, and
-back-and-forth between the CPU and GPU every timestep.  The
+(optionally) neighbor list data, if not computed on the GPU) between
-KOKKOS/CUDA package only does this on timesteps when a CPU calculation
+the CPU and GPU at every timestep.  The KOKKOS/CUDA package only does
-is required (e.g. to invoke a fix or compute that is non-GPU-ized).
+this on timesteps when a CPU calculation is required (e.g. to invoke
-Hence, if you can formulate your input script to only use GPU-ized
+a fix or compute that is non-GPU-ized). Hence, if you can formulate
-fixes and computes, and avoid doing I/O too often (thermo output, dump
+your input script to only use GPU-ized fixes and computes, and avoid
-file snapshots, restart files), then the data transfer cost of the
+doing I/O too often (thermo output, dump file snapshots, restart files),
-KOKKOS/CUDA package can be very low, causing it to run faster than the
+then the data transfer cost of the KOKKOS/CUDA package can be very low,
-GPU package. :l
+causing it to run faster than the GPU package. :l
-The GPU package is often faster than the KOKKOS/CUDA package, if the
+The GPU package is often faster than the KOKKOS/CUDA package, when the
-number of atoms per GPU is smaller.  The crossover point, in terms of
+number of atoms per GPU is on the smaller side.  The crossover point,
-atoms/GPU at which the KOKKOS/CUDA package becomes faster depends
+in terms of atoms/GPU at which the KOKKOS/CUDA package becomes faster
-strongly on the pair style.  For example, for a simple Lennard Jones
+depends strongly on the pair style.  For example, for a simple Lennard Jones
 system the crossover (in single precision) is often about 50K-100K
 atoms per GPU.  When performing double precision calculations the
 crossover point can be significantly smaller. :l
-Both packages compute bonded interactions (bonds, angles, etc) on the
+Both KOKKOS and GPU package compute bonded interactions (bonds, angles,
-CPU.  If the GPU package is running with several MPI processes
+etc) on the CPU.  If the GPU package is running with several MPI processes
 assigned to one GPU, the cost of computing the bonded interactions is
-spread across more CPUs and hence the GPU package can run faster. :l
+spread across more CPUs and hence the GPU package can run faster in these
 cases. :l
-When using the GPU package with multiple CPUs assigned to one GPU, its
+When using LAMMPS with multiple MPI ranks assigned to the same GPU, its
-performance depends to some extent on high bandwidth between the CPUs
+performance depends to some extent on the available bandwidth between
-and the GPU.  Hence its performance is affected if full 16 PCIe lanes
+the CPUs and the GPU. This can differ significantly based on the
-are not available for each GPU.  In HPC environments this can be the
+available bus technology, capability of the host CPU and mainboard,
-case if S2050/70 servers are used, where two devices generally share
+the wiring of the buses and whether switches are used to increase the
-one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
+number of available bus slots, or if GPUs are housed in an external
-full 16 lanes to each of the PCIe 2.0 16x slots. :l
+enclosure.  This can become quite complex. :l
 To achieve significant acceleration through GPUs, both KOKKOS and GPU
 package require capable GPUs with fast on-device memory and efficient
 data transfer rates. This requests capable upper mid-level to high-end
 (desktop) GPUs. Using lower performance GPUs (e.g. on laptops) may
 result in a slowdown instead. :l
 For the GPU package, specifically when running in parallel with MPI,
 if it often more efficient to exclude the PPPM kspace style from GPU
 acceleration and instead run it - concurrently with a GPU accelerated
 pair style - on the CPU. This can often be easily achieved with placing
 a {suffix off} command before and a {suffix on} command after the
 {kspace_style pppm} command. :l
 The KOKKOS/OpenMP and USER-OMP package have different thread management
 strategies, which should result in USER-OMP being more efficient for a
 small number of threads with increasing overhead as the number of threads
 per MPI rank grows. The KOKKOS/OpenMP kernels have less overhead in that
 case, but have lower performance with few threads. :l
 The USER-INTEL package contains many options and settings for achieving
 additional performance on Intel hardware (CPU and accelerator cards), but
 to unlock this potential, an Intel compiler is required. The package code
 will compile with GNU gcc, but it will not be as efficient. :l
 :ule
-[Differences between the two packages:]
+[Differences between the GPU and KOKKOS packages:]
-The GPU package accelerates only pair force, neighbor list, and PPPM
+The GPU package accelerates only pair force, neighbor list, and (parts
-calculations. :ulb,l
+of) PPPM calculations. The KOKKOS package attempts to run most of the
 calculation on the GPU, but can transparently support non-accelerated
 code (with a performance penalty due to having data transfers between
 host and GPU). :ulb,l
 The GPU package requires neighbor lists to be built on the CPU when using
 exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
 The GPU package can be compiled for CUDA or OpenCL and thus supports
 both, Nvidia and AMD GPUs well. On Nvidia hardware, using CUDA is typically
 resulting in equal or better performance over OpenCL. :l
 OpenCL in the GPU package does theoretically also support Intel CPUs or
 Intel Xeon Phi, but the native support for those in KOKKOS (or USER-INTEL)
 is superior. :l
 :ule