Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions, if you have the appropriate hardware on your system.
The accelerated styles have the same name as the standard styles, except that a suffix is appended. Otherwise, the syntax for the command is identical, their functionality is the same, and the numerical results it produces should also be identical, except for precision and round-off issues.
For example, all of these variants of the basic Lennard-Jones pair style exist in LAMMPS:
Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input script. Or you can use the -suffix command-line switch to invoke the accelerated versions automatically, without changing your input script. The suffix command allows you to set a suffix explicitly and to turn off/on the comand-line switch setting, both from within your input script.
Styles with an "opt" suffix are part of the OPT package and typically speed-up the pairwise calculations of your simulation by 5-25%.
Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA packages, and can be run on NVIDIA GPUs associated with your CPUs. The speed-up due to GPU usage depends on a variety of factors, as discussed below.
To see what styles are currently available in each of the accelerated packages, see this section of the manual. A list of accelerated styles is included in the pair, fix, compute, and kspace sections.
The following sections explain:
The final section compares and contrasts the GPU and USER-CUDA packages, since they are both designed to use NVIDIA GPU hardware.
10.1 OPT packageThe OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.
The procedure for building LAMMPS with the OPT package is simple. It is the same as for any other package which has no additional library dependencies:
make yes-opt make machine
If your input script uses one of the OPT pair styles, you can run it as follows:
lmp_machine -sf opt < in.script mpirun -np 4 lmp_machine -sf opt < in.script
You should see a reduction in the "Pair time" printed out at the end of the run. On most machines and problems, this will typically be a 5 to 20% savings.
The GPU package was developed by Mike Brown at ORNL. It provides GPU versions of several pair styles and for long-range Coulombics via the PPPM command. It has the following features:
Hardware and software requirements:
To use this package, you need to have specific NVIDIA hardware and install specific NVIDIA CUDA software on your system:
Building LAMMPS with the GPU package:
As with other packages that link with a separately complied library, you need to first build the GPU library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, do the following, using a Makefile appropriate for your system:
cd lammps/lib/gpu make -f Makefile.linux (see further instructions in lammps/lib/gpu/README)
If you are successful, you will produce the file lib/libgpu.a.
Now you are ready to build LAMMPS with the GPU package installed:
cd lammps/lib/src make yes-gpu make machine
Note that the lo-level Makefile (e.g. src/MAKE/Makefile.linux) has these settings: gpu_SYSINC, gpu_SYSLIB, gpu_SYSPATH. These need to be set appropriately to include the paths and settings for the CUDA system software on your machine. See src/MAKE/Makefile.g++ for an example.
GPU configuration
When using GPUs, you are restricted to one physical GPU per LAMMPS process, which is an MPI process running (typically) on a single core or processor. Multiple processes can share a single GPU and in many cases it will be more efficient to run with multiple processes per GPU.
Input script requirements:
Additional input script requirements to run styles with a gpu suffix are as follows.
The newton pair setting must be off and the fix gpu command must be used. To invoke specific styles from the GPU package, you can either append "gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the -suffix command-line switch, or use the suffix command.
The fix gpu command controls the GPU selection and initialization steps.
The format for the fix is:
fix fix-ID all gpu mode first last split
where fix-ID is the name for the fix. The gpu fix must be the first fix specified for a given run, otherwise LAMMPS will exit with an error. The gpu fix does not have any effect on runs that do not use GPU acceleration, so there should be no problem specifying the fix first in any input script.
The mode setting can be either "force" or "force/neigh". In the former, neighbor list calculation is performed on the CPU using the standard LAMMPS routines. In the latter, the neighbor list calculation is performed on the GPU. The GPU neighbor list can be used for better performance, however, it cannot not be used with a triclinic box or with hybrid pair styles.
There are cases when it may be more efficient to select the CPU for neighbor list builds. If a non-GPU enabled style (e.g. a fix or compute) requires a neighbor list, it will also be built using CPU routines. Redundant CPU and GPU neighbor list calculations will typically be less efficient.
The first setting is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on each node. The last setting is the ID of the last GPU that will be used on each node. If you have only one GPU per node, first and last will typically both be 0. Selecting a non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
The split setting is the fraction of particles whose forces, torques, energies, and/or virials will be calculated on the GPU. This can be used to perform CPU and GPU force calculations simultaneously, e.g. on a hybrid node with a multicore CPU and a GPU(s). If split is negative, the software will attempt to calculate the optimal fraction automatically every 25 timesteps based on CPU and GPU timings. Because the GPU speedups are dependent on the number of particles, automatic calculation of the split can be less efficient, but typically results in loop times within 20% of an optimal fixed split.
As an example, if you have two GPUs per node, 8 CPU cores per node, and would like to run on 4 nodes (32 cores) with dynamic balancing of force calculation across CPU and GPU cores, the fix might be
fix 0 all gpu force/neigh 0 1 -1
In this case, all CPU cores and GPU devices on the nodes would be utilized. Each GPU device would be shared by 4 CPU cores. The CPU cores would perform force calculations for some fraction of the particles at the same time the GPUs performed force calculation for the other particles.
Asynchronous pair computation on GPU and CPU
The GPU accelerated pair styles can perform pair style force calculation on the GPU at the same time other force calculations within LAMMPS are being performed on the CPU. These include pair, bond, angle, etc forces as well as long-range Coulombic forces. This is enabled by the split setting in the gpu fix as described above.
With a split setting less than 1.0, a portion of the pair-wise force calculations will also be performed on the CPU. When the CPU finishes its pair style computations (if any), the next LAMMPS force computation will begin (bond, angle, etc), possibly before the GPU has finished its pair style computations.
This means that if split is set to 1.0, the GPU will begin the LAMMPS force computation immediately. This can be used to run a hybrid GPU pair style at the same time as a hybrid CPU pair style. In this case, the GPU pair style should be first in the hybrid command in order to perform simultaneous calculations. This also allows bond, angle, dihedral, improper, and long-range force computations to run simultaneously with the GPU pair style. If all CPU force computations complete before the GPU, LAMMPS will block until the GPU has finished before continuing the timestep.
Timing output:
As noted above, GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with bond, angle, dihedral, improper, and long-range calculations will not be included in the "Pair" time.
When the mode setting for the gpu fix is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. An additional breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc) are output only with the LAMMPS screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations.
Performance tips:
Because of the large number of cores within each GPU device, it may be more efficient to run on fewer processes per GPU when the number of particles per MPI process is small (100's of particles); this can be necessary to keep the GPU cores busy.
See the lammps/lib/gpu/README file for instructions on how to build the LAMMPS gpu library for single, mixed, and double precision. The latter requires that your GPU card support double precision.
The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following features:
Hardware and software requirements:
To use this package, you need to have specific NVIDIA hardware and install specific NVIDIA CUDA software on your system:
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Install the Nvidia Cuda Toolkit in version 3.2 or higher and the corresponding GPU drivers. The Nvidia Cuda SDK is not required for LAMMPSCUDA but we recommend it be installed. You can then make sure that its sample projects can be compiled without problems.
Building LAMMPS with the USER-CUDA package:
As with other packages that link with a separately complied library, you need to first build the USER-CUDA library, before building LAMMPS itself. General instructions for doing this are in this section of the manual. For this package, do the following, using a Makefile appropriate for your system:
precision=N to set the precision level N = 1 for single precision (default) N = 2 for double precision N = 3 for positions in double precision N = 4 for positions and velocities in double precision arch=M to set GPU compute capability M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default) M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450) M = 13 for CC1.3 (GF200, e.g. C1060, GTX285) prec_timer=0/1 to use hi-precision timers 0 = do not use them (default) 1 = use these timers this is usually only useful for Mac machines dbg=0/1 to activate debug mode 0 = no debug mode (default) 1 = yes debug mode this is only useful for developers cufft=1 to determine usage of CUDA FFT library 0 = no CUFFT support (default) in the future other CUDA-enabled FFT libraries might be supported
Now you are ready to build LAMMPS with the USER-CUDA package installed:
cd lammps/lib/src make yes-user-cuda make machine
Note that the build will reference the lib/cuda/Makefile.common file to extract setting relevant to the LAMMPS build. So it is important that you have first built the cuda library (in lib/cuda) using settings appropriate to your system.
Input script requirements:
Additional input script requirements to run styles with a cuda suffix are as follows.
To invoke specific styles from the USER-CUDA package, you can either append "cuda" to the style name (e.g. pair_style lj/cut/cuda), or use the -suffix command-line switch, or use the suffix command. One exception is that the kspace_style pppm/cuda command has to be requested explicitly.
To use the USER-CUDA package with its default settings, no additional command is needed in your input script. This is because when LAMMPS starts up, it detects if it has been built with the USER-CUDA package. See the -cuda command-line switch for more details.
To change settings for the USER-CUDA package at run-time, the package cuda command can be used at the beginning of your input script. See the commands doc page for details.
Performance tips:
The USER-CUDA package offers more speed-up relative to CPU performance when the number of atoms per GPU is large, e.g. on the order of tens or hundreds of 1000s.
As noted above, this package will continue to run a simulation entirely on the GPU(s) (except for inter-processor MPI communication), for multiple timesteps, until a CPU calculation is required, either by a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation may run.
Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.
As a consequence, for a specific simulation on particular hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.
Guidelines for using each package optimally:
Chief differences between the two packages:
Examples:
The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.
These are files for identical systems, so they can be used to benchmark the performance of both packages on your system.
Benchmark data:
NOTE: We plan to add some benchmark results and plots here for the examples described in the previous section.
Simulations:
1. Lennard Jones
2. Lennard Jones
3. Rhodopsin model
4. Lihtium-Phosphate
Hardware:
Workstation:
eStella:
Keeneland: