Previous Section - LAMMPS WWW Site - LAMMPS Documentation - LAMMPS Commands - Next Section

10. Using accelerated CPU and GPU styles

Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions, if you have the appropriate hardware on your system.

The accelerated styles have the same name as the standard styles, except that a suffix is appended. Otherwise, the syntax for the command is identical, their functionality is the same, and the numerical results it produces should also be identical, except for precision and round-off issues.

For example, all of these variants of the basic Lennard-Jones pair style exist in LAMMPS:

Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input script. Or you can use the -suffix command-line switch to invoke the accelerated versions automatically, without changing your input script. The suffix command allows you to set a suffix explicitly and to turn off/on the comand-line switch setting, both from within your input script.

Styles with an "opt" suffix are part of the OPT package and typically speed-up the pairwise calculations of your simulation by 5-25%.

Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA packages, and can be run on NVIDIA GPUs associated with your CPUs. The speed-up due to GPU usage depends on a variety of factors, as discussed below.

To see what styles are currently available in each of the accelerated packages, see this section of the manual. A list of accelerated styles is included in the pair, fix, compute, and kspace sections.

The following sections explain:

what hardware and software the accelerated styles require
how to build LAMMPS with the accelerated packages in place
what changes (if any) are needed in your input scripts
guidelines for best performance
speed-ups you can expect

The final section compares and contrasts the GPU and USER-CUDA packages, since they are both designed to use NVIDIA GPU hardware.

10.1 OPT package
10.2 GPU package
10.3 USER-CUDA package
10.4 Comparison of GPU and USER-CUDA packages

10.1 OPT package

The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.

The procedure for building LAMMPS with the OPT package is simple. It is the same as for any other package which has no additional library dependencies:

make yes-opt
make machine

If your input script uses one of the OPT pair styles, you can run it as follows:

lmp_machine -sf opt < in.script
mpirun -np 4 lmp_machine -sf opt < in.script

You should see a reduction in the "Pair time" printed out at the end of the run. On most machines and problems, this will typically be a 5 to 20% savings.

10.2 GPU package

The GPU package was developed by Mike Brown at ORNL. It provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command. It has the following features:

The package is designed to exploit common GPU hardware configurations where one or more GPUs are coupled with one or more multi-core CPUs within a node of a parallel machine.
Atom-based data (e.g. coordinates, forces) moves back-and-forth between the CPU and GPU every timestep.
Neighbor lists can be constructed by on the CPU or on the GPU, controlled by the fix gpu command.
The charge assignement and force interpolation portions of PPPM can be run on the GPU. The FFT portion, which requires MPI communication between processors, runs on the CPU.
Asynchronous force computations can be performed simulataneously on the CPU and GPU.
LAMMPS-specific code is in the GPU package. It makee calls to a more generic GPU library in the lib/gpu directory. This library provides NVIDIA support as well as a more general OpenCL support, so that the same functionality can eventually be supported on other GPU hardware.

Hardware and software requirements:

To use this package, you need to have specific NVIDIA hardware and install specific NVIDIA CUDA software on your system:

Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
Go to http://www.nvidia.com/object/cuda_get.html
Install a driver and toolkit appropriate for your system (SDK is not necessary)
Follow the instructions in lammps/lib/gpu/README to build the library (also see below)
Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties

Building LAMMPS with the GPU package:

As with other packages that link with a separately complied library, you need to first build the GPU library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, do the following, using a Makefile appropriate for your system:

cd lammps/lib/gpu
make -f Makefile.linux
(see further instructions in lammps/lib/gpu/README)

If you are successful, you will produce the file lib/libgpu.a.

Now you are ready to build LAMMPS with the GPU package installed:

cd lammps/lib/src
make yes-gpu
make machine

Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH. These need to be set appropriately to include the paths and settings for the CUDA system software on your machine. See src/MAKE/Makefile.g++ for an example.

GPU configuration

When using GPUs, you are restricted to one physical GPU per LAMMPS process, which is an MPI process running (typically) on a single core or processor. Multiple processes can share a single GPU and in many cases it will be more efficient to run with multiple processes per GPU.

Input script requirements:

Additional input script requirements to run styles with a gpu suffix are as follows.

The newton pair setting must be off and the fix gpu command must be used. To invoke specific styles from the GPU package, you can either append "gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the -suffix command-line switch, or use the suffix command.

The fix gpu command controls the GPU selection and initialization steps.

The format for the fix is:

fix fix-ID all gpu mode first last split

where fix-ID is the name for the fix. The gpu fix must be the first fix specified for a given run, otherwise LAMMPS will exit with an error. The gpu fix does not have any effect on runs that do not use GPU acceleration, so there should be no problem specifying the fix first in any input script.

The mode setting can be either "force" or "force/neigh". In the former, neighbor list calculation is performed on the CPU using the standard LAMMPS routines. In the latter, the neighbor list calculation is performed on the GPU. The GPU neighbor list can be used for better performance, however, it cannot not be used with a triclinic box or with hybrid pair styles.

There are cases when it may be more efficient to select the CPU for neighbor list builds. If a non-GPU enabled style (e.g. a fix or compute) requires a neighbor list, it will also be built using CPU routines. Redundant CPU and GPU neighbor list calculations will typically be less efficient.

The first setting is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on each node. The last setting is the ID of the last GPU that will be used on each node. If you have only one GPU per node, first and last will typically both be 0. Selecting a non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.

The split setting is the fraction of particles whose forces, torques, energies, and/or virials will be calculated on the GPU. This can be used to perform CPU and GPU force calculations simultaneously, e.g. on a hybrid node with a multicore CPU and a GPU(s). If split is negative, the software will attempt to calculate the optimal fraction automatically every 25 timesteps based on CPU and GPU timings. Because the GPU speedups are dependent on the number of particles, automatic calculation of the split can be less efficient, but typically results in loop times within 20% of an optimal fixed split.

As an example, if you have two GPUs per node, 8 CPU cores per node, and would like to run on 4 nodes (32 cores) with dynamic balancing of force calculation across CPU and GPU cores, the fix might be

fix 0 all gpu force/neigh 0 1 -1

In this case, all CPU cores and GPU devices on the nodes would be utilized. Each GPU device would be shared by 4 CPU cores. The CPU cores would perform force calculations for some fraction of the particles at the same time the GPUs performed force calculation for the other particles.

Asynchronous pair computation on GPU and CPU

The GPU accelerated pair styles can perform pair style force calculation on the GPU at the same time other force calculations within LAMMPS are being performed on the CPU. These include pair, bond, angle, etc forces as well as long-range Coulombic forces. This is enabled by the split setting in the gpu fix as described above.

With a split setting less than 1.0, a portion of the pair-wise force calculations will also be performed on the CPU. When the CPU finishes its pair style computations (if any), the next LAMMPS force computation will begin (bond, angle, etc), possibly before the GPU has finished its pair style computations.

This means that if split is set to 1.0, the GPU will begin the LAMMPS force computation immediately. This can be used to run a hybrid GPU pair style at the same time as a hybrid CPU pair style. In this case, the GPU pair style should be first in the hybrid command in order to perform simultaneous calculations. This also allows bond, angle, dihedral, improper, and long-range force computations to run simultaneously with the GPU pair style. If all CPU force computations complete before the GPU, LAMMPS will block until the GPU has finished before continuing the timestep.

Timing output:

As noted above, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with bond, angle, dihedral, improper, and long-range calculations will not be included in the "Pair" time.

When the mode setting for the gpu fix is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations.

Performance tips:

Because of the large number of cores within each GPU device, it may be more efficient to run on fewer processes per GPU when the number of particles per MPI process is small (100's of particles); this can be necessary to keep the GPU cores busy.

See the lammps/lib/gpu/README file for instructions on how to build the LAMMPS gpu library for single, mixed, and double precision. The latter requires that your GPU card support double precision.

10.3 USER-CUDA package

The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following features:

The package is designed to allow an entire LAMMPS calculation, for many timesteps, to run entirely on the GPU (except for inter-processor MPI communication), so that atom-based data (e.g. coordinates, forces) do not have to move back-and-forth between the CPU and GPU.
This will occur until a timestep where a non-GPU-ized fix or compute is invoked. E.g. whenever a non-GPU operation occurs (fix, compute, output), data automatically moves back to the CPU as needed. This may incur a performance penalty, but should otherwise just work transparently.
Neighbor lists for GPU-ized pair styles are constructed on the GPU.

Hardware and software requirements:

To use this package, you need to have specific NVIDIA hardware and install specific NVIDIA CUDA software on your system:

Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

Install the Nvidia Cuda Toolkit in version 3.2 or higher and the corresponding GPU drivers. The Nvidia Cuda SDK is not required for LAMMPSCUDA but we recommend it be installed. You can then make sure that its sample projects can be compiled without problems.

Building LAMMPS with the USER-CUDA package:

As with other packages that link with a separately complied library, you need to first build the USER-CUDA library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, do the following, using a Makefile appropriate for your system:

If your CUDA toolkit is not installed in the default system directoy /usr/local/cuda edit the file lib/cuda/Makefile.common accordingly.
Go to the lammps/lib/cuda directory

Type "make OPTIONS", where OPTIONS are one or more of the following options. The settings will be written to the lib/cuda/Makefile.defaults and used in the next step.

precision=N to set the precision level
  N = 1 for single precision (default)
  N = 2 for double precision
  N = 3 for positions in double precision
  N = 4 for positions and velocities in double precision
arch=M to set GPU compute capability
  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
prec_timer=0/1 to use hi-precision timers
  0 = do not use them (default)
  1 = use these timers
  this is usually only useful for Mac machines 
dbg=0/1 to activate debug mode
  0 = no debug mode (default)
  1 = yes debug mode
  this is only useful for developers
cufft=1 to determine usage of CUDA FFT library
  0 = no CUFFT support (default)
  in the future other CUDA-enabled FFT libraries might be supported

Type "make" to build the library. If you are successful, you will produce the file lib/libcuda.a.

Now you are ready to build LAMMPS with the USER-CUDA package installed:

cd lammps/lib/src
make yes-user-cuda
make machine

Note that the build will reference the lib/cuda/Makefile.common file to extract setting relevant to the LAMMPS build. So it is important that you have first built the cuda library (in lib/cuda) using settings appropriate to your system.

Input script requirements:

Additional input script requirements to run styles with a cuda suffix are as follows.

To invoke specific styles from the USER-CUDA package, you can either append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use the -suffix command-line switch, or use the suffix command. One exception is that the kspace_style pppm/cuda command has to be requested explicitly.

To use the USER-CUDA package with its default settings, no additional command is needed in your input script. This is because when LAMMPS starts up, it detects if it has been built with the USER-CUDA package. See the -cuda command-line switch for more details.

To change settings for the USER-CUDA package at run-time, the package cuda command can be used at the beginning of your input script. See the commands doc page for details.

Performance tips:

The USER-CUDA package offers more speed-up relative to CPU performance when the number of atoms per GPU is large, e.g. on the order of tens or hundreds of 1000s.

As noted above, this package will continue to run a simulation entirely on the GPU(s) (except for inter-processor MPI communication), for multiple timesteps, until a CPU calculation is required, either by a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation may run.

10.4 Comparison of GPU and USER-CUDA packages

Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.

As a consequence, for a specific simulation on particular hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.

Guidelines for using each package optimally:

The GPU package moves per-atom data (coordinates, forces) back-and-forth between the CPU and GPU every timestep. The USER-CUDA package only does this on timesteps when a CPU calculation is required (e.g. to invoke a fix or compute that is non-GPU-ized). Hence, if you can formulate your input script to only use GPU-ized fixes and computes, and avoid doing I/O too often (thermo output, dump file snapshots, restart files), then the data transfer cost of the USER-CUDA package can be very low, causing it to run faster than the GPU package.
The GPU package is often faster than the USER-CUDA package, if the number of atoms per GPU is "small". The crossover point, in terms of atoms/GPU at which the USER-CUDA package becomes faster depends strongly on the pair style. For example, for a simple Lennard Jones system the crossover (in single precision) is often about 50K-100K atoms per GPU. When performing double precision calculations the crossover point can be significantly smaller.
The GPU package allows you to assign multiple CPUs (cores) to a single GPU (a common configuration for "hybrid" nodes that contain multicore CPU(s) and GPU(s)) and works effectively in this mode. The USER-CUDA package does not; it works best when there is one CPU per GPU.
Both packages compute bonded interactions (bonds, angles, etc) on the CPU. This means a model with bonds will force the USER-CUDA package to transfer per-atom data back-and-forth between the CPU and GPU every timestep. If the GPU package is running with several MPI processes assigned to one GPU, the cost of computing the bonded interactions is spread across more CPUs and hence the GPU package can run faster.

Chief differences between the two packages:

The GPU package accelerates only pair force, neighbor list, and PPPM calculations. The USER-CUDA package currently supports a wider range of pair styles and can also accelerate many fix styles and some compute styles, as well as neighbor list and PPPM calculations.
The GPU package uses more GPU memory than the USER-CUDA package. This is generally not much of a problem since typical runs are computation-limited rather than memory-limited.
When using the GPU package with multiple CPUs assigned to one GPU, its performance depends to some extent on high bandwidth between the CPUs and the GPU. Hence its performance is affected if full 16 PCIe lanes are not available for each GPU. In HPC environments this can be the case if S2050/70 servers are used, where two devices generally share one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide full 16 lanes to each of the PCIe 2.0 16x slots.

Examples:

The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.

lammps/examples/gpu = GPU package files
lammps/examples/USER/cuda = USER-CUDA package files

These are files for identical systems, so they can be used to benchmark the performance of both packages on your system.

Benchmark data:

NOTE: We plan to add some benchmark results and plots here for the examples described in the previous section.

Simulations:

1. Lennard Jones

256,000 atoms
2.5 A cutoff
0.844 density

2. Lennard Jones

256,000 atoms
5.0 A cutoff
0.844 density

3. Rhodopsin model

256,000 atoms
10A cutoff
Coulomb via PPPM

4. Lihtium-Phosphate

295650 atoms
15A cutoff
Coulomb via PPPM

Hardware:

Workstation:

2x GTX470
i7 950@3GHz
24Gb DDR3 @ 1066Mhz
CentOS 5.5
CUDA 3.2
Driver 260.19.12

eStella:

6 Nodes
2xC2050
2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
Intel X5650 HexCore @ 2.67GHz
SL 5.5
CUDA 3.2
Driver 260.19.26

Keeneland:

HP SL-390 (Ariston) cluster
120 nodes
2x Intel Westmere hex-core CPUs
3xC2070s
QDR InfiniBand interconnect