git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@14950 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
223
doc/html/_sources/accelerate_cuda.txt
Normal file
223
doc/html/_sources/accelerate_cuda.txt
Normal file
@ -0,0 +1,223 @@
|
||||
:doc:`Return to Section accelerate overview <Section_accelerate>`
|
||||
|
||||
5.USER-CUDA package
|
||||
-------------------
|
||||
|
||||
The USER-CUDA package was developed by Christian Trott (Sandia) while
|
||||
at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions
|
||||
of many pair styles, many fixes, a few computes, and for long-range
|
||||
Coulombics via the PPPM command. It has the following general
|
||||
features:
|
||||
|
||||
* The package is designed to allow an entire LAMMPS calculation, for
|
||||
many timesteps, to run entirely on the GPU (except for inter-processor
|
||||
MPI communication), so that atom-based data (e.g. coordinates, forces)
|
||||
do not have to move back-and-forth between the CPU and GPU.
|
||||
* The speed-up advantage of this approach is typically better when the
|
||||
number of atoms per GPU is large
|
||||
* Data will stay on the GPU until a timestep where a non-USER-CUDA fix
|
||||
or compute is invoked. Whenever a non-GPU operation occurs (fix,
|
||||
compute, output), data automatically moves back to the CPU as needed.
|
||||
This may incur a performance penalty, but should otherwise work
|
||||
transparently.
|
||||
* Neighbor lists are constructed on the GPU.
|
||||
* The package only supports use of a single MPI task, running on a
|
||||
single CPU (core), assigned to each GPU.
|
||||
Here is a quick overview of how to use the USER-CUDA package:
|
||||
|
||||
* build the library in lib/cuda for your GPU hardware with desired precision
|
||||
* include the USER-CUDA package and build LAMMPS
|
||||
* use the mpirun command to specify 1 MPI task per GPU (on each node)
|
||||
* enable the USER-CUDA package via the "-c on" command-line switch
|
||||
* specify the # of GPUs per node
|
||||
* use USER-CUDA styles in your input script
|
||||
|
||||
The latter two steps can be done using the "-pk cuda" and "-sf cuda"
|
||||
:ref:`command-line switches <start_7>` respectively. Or
|
||||
the effect of the "-pk" or "-sf" switches can be duplicated by adding
|
||||
the :doc:`package cuda <package>` or :doc:`suffix cuda <suffix>` commands
|
||||
respectively to your input script.
|
||||
|
||||
**Required hardware/software:**
|
||||
|
||||
To use this package, you need to have one or more NVIDIA GPUs and
|
||||
install the NVIDIA Cuda software on your system:
|
||||
|
||||
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may
|
||||
help you to find out the Compute Capability of your card:
|
||||
|
||||
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
|
||||
|
||||
Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the
|
||||
corresponding GPU drivers. The Nvidia Cuda SDK is not required, but
|
||||
we recommend it also be installed. You can then make sure its sample
|
||||
projects can be compiled without problems.
|
||||
|
||||
**Building LAMMPS with the USER-CUDA package:**
|
||||
|
||||
This requires two steps (a,b): build the USER-CUDA library, then build
|
||||
LAMMPS with the USER-CUDA package.
|
||||
|
||||
You can do both these steps in one line, using the src/Make.py script,
|
||||
described in :ref:`Section 2.4 <start_4>` of the manual.
|
||||
Type "Make.py -h" for help. If run from the src directory, this
|
||||
command will create src/lmp_cuda using src/MAKE/Makefile.mpi as the
|
||||
starting Makefile.machine:
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
Make.py -p cuda -cuda mode=single arch=20 -o cuda -a lib-cuda file mpi
|
||||
|
||||
Or you can follow these two (a,b) steps:
|
||||
|
||||
(a) Build the USER-CUDA library
|
||||
|
||||
The USER-CUDA library is in lammps/lib/cuda. If your *CUDA* toolkit
|
||||
is not installed in the default system directoy */usr/local/cuda* edit
|
||||
the file *lib/cuda/Makefile.common* accordingly.
|
||||
|
||||
To build the library with the settings in lib/cuda/Makefile.default,
|
||||
simply type:
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
make
|
||||
|
||||
To set options when the library is built, type "make OPTIONS", where
|
||||
*OPTIONS* are one or more of the following. The settings will be
|
||||
written to the *lib/cuda/Makefile.defaults* before the build.
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
*precision=N* to set the precision level
|
||||
N = 1 for single precision (default)
|
||||
N = 2 for double precision
|
||||
N = 3 for positions in double precision
|
||||
N = 4 for positions and velocities in double precision
|
||||
*arch=M* to set GPU compute capability
|
||||
M = 35 for Kepler GPUs
|
||||
M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default)
|
||||
M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450)
|
||||
M = 13 for CC1.3 (GF200, e.g. C1060, GTX285)
|
||||
*prec_timer=0/1* to use hi-precision timers
|
||||
0 = do not use them (default)
|
||||
1 = use them
|
||||
this is usually only useful for Mac machines
|
||||
*dbg=0/1* to activate debug mode
|
||||
0 = no debug mode (default)
|
||||
1 = yes debug mode
|
||||
this is only useful for developers
|
||||
*cufft=1* for use of the CUDA FFT library
|
||||
0 = no CUFFT support (default)
|
||||
in the future other CUDA-enabled FFT libraries might be supported
|
||||
|
||||
If the build is successful, it will produce the files liblammpscuda.a and
|
||||
Makefile.lammps.
|
||||
|
||||
Note that if you change any of the options (like precision), you need
|
||||
to re-build the entire library. Do a "make clean" first, followed by
|
||||
"make".
|
||||
|
||||
(b) Build LAMMPS with the USER-CUDA package
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
cd lammps/src
|
||||
make yes-user-cuda
|
||||
make machine
|
||||
|
||||
No additional compile/link flags are needed in Makefile.machine.
|
||||
|
||||
Note that if you change the USER-CUDA library precision (discussed
|
||||
above) and rebuild the USER-CUDA library, then you also need to
|
||||
re-install the USER-CUDA package and re-build LAMMPS, so that all
|
||||
affected files are re-compiled and linked to the new USER-CUDA
|
||||
library.
|
||||
|
||||
**Run with the USER-CUDA package from the command line:**
|
||||
|
||||
The mpirun or mpiexec command sets the total number of MPI tasks used
|
||||
by LAMMPS (one or multiple per compute node) and the number of MPI
|
||||
tasks used per node. E.g. the mpirun command in MPICH does this via
|
||||
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
||||
|
||||
When using the USER-CUDA package, you must use exactly one MPI task
|
||||
per physical GPU.
|
||||
|
||||
You must use the "-c on" :ref:`command-line switch <start_7>` to enable the USER-CUDA package.
|
||||
The "-c on" switch also issues a default :doc:`package cuda 1 <package>`
|
||||
command which sets various USER-CUDA options to default values, as
|
||||
discussed on the :doc:`package <package>` command doc page.
|
||||
|
||||
Use the "-sf cuda" :ref:`command-line switch <start_7>`,
|
||||
which will automatically append "cuda" to styles that support it. Use
|
||||
the "-pk cuda Ng" :ref:`command-line switch <start_7>` to
|
||||
set Ng = # of GPUs per node to a different value than the default set
|
||||
by the "-c on" switch (1 GPU) or change other :doc:`package cuda <package>` options.
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU
|
||||
mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node
|
||||
mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes
|
||||
|
||||
The syntax for the "-pk" switch is the same as same as the "package
|
||||
cuda" command. See the :doc:`package <package>` command doc page for
|
||||
details, including the default values used for all its options if it
|
||||
is not specified.
|
||||
|
||||
Note that the default for the :doc:`package cuda <package>` command is
|
||||
to set the Newton flag to "off" for both pairwise and bonded
|
||||
interactions. This typically gives fastest performance. If the
|
||||
:doc:`newton <newton>` command is used in the input script, it can
|
||||
override these defaults.
|
||||
|
||||
**Or run with the USER-CUDA package by editing an input script:**
|
||||
|
||||
The discussion above for the mpirun/mpiexec command and the requirement
|
||||
of one MPI task per GPU is the same.
|
||||
|
||||
You must still use the "-c on" :ref:`command-line switch <start_7>` to enable the USER-CUDA package.
|
||||
|
||||
Use the :doc:`suffix cuda <suffix>` command, or you can explicitly add a
|
||||
"cuda" suffix to individual styles in your input script, e.g.
|
||||
|
||||
.. parsed-literal::
|
||||
|
||||
pair_style lj/cut/cuda 2.5
|
||||
|
||||
You only need to use the :doc:`package cuda <package>` command if you
|
||||
wish to change any of its option defaults, including the number of
|
||||
GPUs/node (default = 1), as set by the "-c on" :ref:`command-line switch <start_7>`.
|
||||
|
||||
**Speed-ups to expect:**
|
||||
|
||||
The performance of a GPU versus a multi-core CPU is a function of your
|
||||
hardware, which pair style is used, the number of atoms/GPU, and the
|
||||
precision used on the GPU (double, single, mixed).
|
||||
|
||||
See the `Benchmark page <http://lammps.sandia.gov/bench.html>`_ of the
|
||||
LAMMPS web site for performance of the USER-CUDA package on different
|
||||
hardware.
|
||||
|
||||
**Guidelines for best performance:**
|
||||
|
||||
* The USER-CUDA package offers more speed-up relative to CPU performance
|
||||
when the number of atoms per GPU is large, e.g. on the order of tens
|
||||
or hundreds of 1000s.
|
||||
* As noted above, this package will continue to run a simulation
|
||||
entirely on the GPU(s) (except for inter-processor MPI communication),
|
||||
for multiple timesteps, until a CPU calculation is required, either by
|
||||
a fix or compute that is non-GPU-ized, or until output is performed
|
||||
(thermo or dump snapshot or restart file). The less often this
|
||||
occurs, the faster your simulation will run.
|
||||
Restrictions
|
||||
""""""""""""
|
||||
|
||||
|
||||
None.
|
||||
|
||||
|
||||
.. _lws: http://lammps.sandia.gov
|
||||
.. _ld: Manual.html
|
||||
.. _lc: Section_commands.html#comm
|
||||
Reference in New Issue
Block a user