-
- -

Return to Section accelerate overview

-
-

5.USER-CUDA package

-

The USER-CUDA package was developed by Christian Trott (Sandia) while -at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions -of many pair styles, many fixes, a few computes, and for long-range -Coulombics via the PPPM command. It has the following general -features:

-
    -
  • The package is designed to allow an entire LAMMPS calculation, for -many timesteps, to run entirely on the GPU (except for inter-processor -MPI communication), so that atom-based data (e.g. coordinates, forces) -do not have to move back-and-forth between the CPU and GPU.
  • -
  • The speed-up advantage of this approach is typically better when the -number of atoms per GPU is large
  • -
  • Data will stay on the GPU until a timestep where a non-USER-CUDA fix -or compute is invoked. Whenever a non-GPU operation occurs (fix, -compute, output), data automatically moves back to the CPU as needed. -This may incur a performance penalty, but should otherwise work -transparently.
  • -
  • Neighbor lists are constructed on the GPU.
  • -
  • The package only supports use of a single MPI task, running on a -single CPU (core), assigned to each GPU.
  • -
-

Here is a quick overview of how to use the USER-CUDA package:

-
    -
  • build the library in lib/cuda for your GPU hardware with desired precision
  • -
  • include the USER-CUDA package and build LAMMPS
  • -
  • use the mpirun command to specify 1 MPI task per GPU (on each node)
  • -
  • enable the USER-CUDA package via the “-c on” command-line switch
  • -
  • specify the # of GPUs per node
  • -
  • use USER-CUDA styles in your input script
  • -
-

The latter two steps can be done using the “-pk cuda” and “-sf cuda” -command-line switches respectively. Or -the effect of the “-pk” or “-sf” switches can be duplicated by adding -the package cuda or suffix cuda commands -respectively to your input script.

-

Required hardware/software:

-

To use this package, you need to have one or more NVIDIA GPUs and -install the NVIDIA Cuda software on your system:

-

Your NVIDIA GPU needs to support Compute Capability 1.3. This list may -help you to find out the Compute Capability of your card:

-

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

-

Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the -corresponding GPU drivers. The Nvidia Cuda SDK is not required, but -we recommend it also be installed. You can then make sure its sample -projects can be compiled without problems.

-

Building LAMMPS with the USER-CUDA package:

-

This requires two steps (a,b): build the USER-CUDA library, then build -LAMMPS with the USER-CUDA package.

-

You can do both these steps in one line, using the src/Make.py script, -described in Section 2.4 of the manual. -Type “Make.py -h” for help. If run from the src directory, this -command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the -starting Makefile.machine:

-
Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi
-
-
-

Or you can follow these two (a,b) steps:

-
    -
  1. Build the USER-CUDA library
  2. -
-

The USER-CUDA library is in lammps/lib/cuda. If your CUDA toolkit -is not installed in the default system directoy /usr/local/cuda edit -the file lib/cuda/Makefile.common accordingly.

-

To build the library with the settings in lib/cuda/Makefile.default, -simply type:

-
make
-
-
-

To set options when the library is built, type “make OPTIONS”, where -OPTIONS are one or more of the following. The settings will be -written to the lib/cuda/Makefile.defaults before the build.

-
-precision=N to set the precision level
-  N = 1 for single precision (default)
-  N = 2 for double precision
-  N = 3 for positions in double precision
-  N = 4 for positions and velocities in double precision
-arch=M to set GPU compute capability
-  M = 35 for Kepler GPUs
-  M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
-  M = 21 for CC2.1 (GF104/114,  e.g. GTX560, GTX460, GTX450)
-  M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
-prec_timer=0/1 to use hi-precision timers
-  0 = do not use them (default)
-  1 = use them
-  this is usually only useful for Mac machines
-dbg=0/1 to activate debug mode
-  0 = no debug mode (default)
-  1 = yes debug mode
-  this is only useful for developers
-cufft=1 for use of the CUDA FFT library
-  0 = no CUFFT support (default)
-  in the future other CUDA-enabled FFT libraries might be supported
-
-

If the build is successful, it will produce the files liblammpscuda.a and -Makefile.lammps.

-

Note that if you change any of the options (like precision), you need -to re-build the entire library. Do a “make clean” first, followed by -“make”.

-
    -
  1. Build LAMMPS with the USER-CUDA package
  2. -
-
cd lammps/src
-make yes-user-cuda
-make machine
-
-
-

No additional compile/link flags are needed in Makefile.machine.

-

Note that if you change the USER-CUDA library precision (discussed -above) and rebuild the USER-CUDA library, then you also need to -re-install the USER-CUDA package and re-build LAMMPS, so that all -affected files are re-compiled and linked to the new USER-CUDA -library.

-

Run with the USER-CUDA package from the command line:

-

The mpirun or mpiexec command sets the total number of MPI tasks used -by LAMMPS (one or multiple per compute node) and the number of MPI -tasks used per node. E.g. the mpirun command in MPICH does this via -its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.

-

When using the USER-CUDA package, you must use exactly one MPI task -per physical GPU.

-

You must use the “-c on” command-line switch to enable the USER-CUDA package. -The “-c on” switch also issues a default package cuda 1 -command which sets various USER-CUDA options to default values, as -discussed on the package command doc page.

-

Use the “-sf cuda” command-line switch, -which will automatically append “cuda” to styles that support it. Use -the “-pk cuda Ng” command-line switch to -set Ng = # of GPUs per node to a different value than the default set -by the “-c on” switch (1 GPU) or change other package cuda options.

-
lmp_machine -c on -sf cuda -pk cuda 1 -in in.script                       # 1 MPI task uses 1 GPU
-mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script          # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
-mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script  # ditto on 12 16-core nodes
-
-
-

The syntax for the “-pk” switch is the same as same as the “package -cuda” command. See the package command doc page for -details, including the default values used for all its options if it -is not specified.

-

Note that the default for the package cuda command is -to set the Newton flag to “off” for both pairwise and bonded -interactions. This typically gives fastest performance. If the -newton command is used in the input script, it can -override these defaults.

-

Or run with the USER-CUDA package by editing an input script:

-

The discussion above for the mpirun/mpiexec command and the requirement -of one MPI task per GPU is the same.

-

You must still use the “-c on” command-line switch to enable the USER-CUDA package.

-

Use the suffix cuda command, or you can explicitly add a -“cuda” suffix to individual styles in your input script, e.g.

-
pair_style lj/cut/cuda 2.5
-
-
-

You only need to use the package cuda command if you -wish to change any of its option defaults, including the number of -GPUs/node (default = 1), as set by the “-c on” command-line switch.

-

Speed-ups to expect:

-

The performance of a GPU versus a multi-core CPU is a function of your -hardware, which pair style is used, the number of atoms/GPU, and the -precision used on the GPU (double, single, mixed).

-

See the Benchmark page of the -LAMMPS web site for performance of the USER-CUDA package on different -hardware.

-

Guidelines for best performance:

-
    -
  • The USER-CUDA package offers more speed-up relative to CPU performance -when the number of atoms per GPU is large, e.g. on the order of tens -or hundreds of 1000s.
  • -
  • As noted above, this package will continue to run a simulation -entirely on the GPU(s) (except for inter-processor MPI communication), -for multiple timesteps, until a CPU calculation is required, either by -a fix or compute that is non-GPU-ized, or until output is performed -(thermo or dump snapshot or restart file). The less often this -occurs, the faster your simulation will run.
  • -
-
-

Restrictions

-

None.

-
-
- - -
-