git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@6051 f3b2605a-c512-4ea7-a41b-209d697bcdaa
This commit is contained in:
@ -994,143 +994,130 @@ processing units (GPUs). We plan to add more over time. Currently,
|
||||
they only support NVIDIA GPU cards. To use them you need to install
|
||||
certain NVIDIA CUDA software on your system:
|
||||
</P>
|
||||
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0
|
||||
<LI>Go to http://www.nvidia.com/object/cuda_get.html
|
||||
<LI>Install a driver and toolkit appropriate for your system (SDK is not necessary)
|
||||
<LI>Follow the instructions in README in lammps/lib/gpu to build the library.
|
||||
<LI>Run lammps/lib/gpu/nvc_get_devices to list supported devices and properties
|
||||
<UL><LI>Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0 Go
|
||||
<LI>to http://www.nvidia.com/object/cuda_get.html Install a driver and
|
||||
<LI>toolkit appropriate for your system (SDK is not necessary) Follow the
|
||||
<LI>instructions in README in lammps/lib/gpu to build the library. Run
|
||||
<LI>lammps/lib/gpu/nvc_get_devices to list supported devices and
|
||||
<LI>properties
|
||||
</UL>
|
||||
<H4>GPU configuration
|
||||
</H4>
|
||||
<P>When using GPUs, you are restricted to one physical GPU per LAMMPS
|
||||
process. Multiple processes can share a single GPU and in many cases it
|
||||
will be more efficient to run with multiple processes per GPU. Any GPU
|
||||
accelerated style requires that <A HREF = "fix_gpu.html">fix gpu</A> be used in the
|
||||
input script to select and initialize the GPUs. The format for the fix
|
||||
is:
|
||||
process. Multiple processes can share a single GPU and in many cases
|
||||
it will be more efficient to run with multiple processes per GPU. Any
|
||||
GPU accelerated style requires that <A HREF = "fix_gpu.html">fix gpu</A> be used in
|
||||
the input script to select and initialize the GPUs. The format for the
|
||||
fix is:
|
||||
</P>
|
||||
<PRE>fix <I>name</I> all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I>
|
||||
</PRE>
|
||||
<P>where <I>name</I> is the name for the fix. The gpu fix must be the first
|
||||
fix specified for a given run, otherwise the program will exit
|
||||
with an error. The gpu fix will not have any effect on runs
|
||||
that do not use GPU acceleration; there should be no problem
|
||||
with specifying the fix first in any input script.
|
||||
fix specified for a given run, otherwise the program will exit with an
|
||||
error. The gpu fix will not have any effect on runs that do not use
|
||||
GPU acceleration; there should be no problem with specifying the fix
|
||||
first in any input script.
|
||||
</P>
|
||||
<P><I>mode</I> can be either "force" or "force/neigh". In the former,
|
||||
neighbor list calculation is performed on the CPU using the
|
||||
standard LAMMPS routines. In the latter, the neighbor list
|
||||
calculation is performed on the GPU. The GPU neighbor list
|
||||
can be used for better performance, however, it
|
||||
should not be used with a triclinic box.
|
||||
<P><I>mode</I> can be either "force" or "force/neigh". In the former, neighbor
|
||||
list calculation is performed on the CPU using the standard LAMMPS
|
||||
routines. In the latter, the neighbor list calculation is performed on
|
||||
the GPU. The GPU neighbor list can be used for better performance,
|
||||
however, it cannot not be used with a triclinic box or with
|
||||
<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
|
||||
</P>
|
||||
<P>There are cases when it might be more efficient to select the CPU for neighbor
|
||||
list builds. If a non-GPU enabled style requires a neighbor list, it will also
|
||||
be built using CPU routines. Redundant CPU and GPU neighbor list calculations
|
||||
will typically be less efficient. For <A HREF = "pair_hybrid.html">hybrid</A> pair
|
||||
styles, GPU calculated neighbor lists might be less efficient because
|
||||
no particles will be skipped in a given neighbor list.
|
||||
<P>There are cases when it might be more efficient to select the CPU for
|
||||
neighbor list builds. If a non-GPU enabled style requires a neighbor
|
||||
list, it will also be built using CPU routines. Redundant CPU and GPU
|
||||
neighbor list calculations will typically be less efficient.
|
||||
</P>
|
||||
<P><I>first</I> is the ID (as reported by lammps/lib/gpu/nvc_get_devices)
|
||||
of the first GPU that will be used on each node. <I>last</I> is the
|
||||
ID of the last GPU that will be used on each node. If you have
|
||||
only one GPU per node, <I>first</I> and <I>last</I> will typically both be
|
||||
0. Selecting a non-sequential set of GPU IDs (e.g. 0,1,3)
|
||||
is not currently supported.
|
||||
<P><I>first</I> is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of
|
||||
the first GPU that will be used on each node. <I>last</I> is the ID of the
|
||||
last GPU that will be used on each node. If you have only one GPU per
|
||||
node, <I>first</I> and <I>last</I> will typically both be 0. Selecting a
|
||||
non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
|
||||
</P>
|
||||
<P><I>split</I> is the fraction of particles whose forces, torques,
|
||||
energies, and/or virials will be calculated on the GPU. This
|
||||
can be used to perform CPU and GPU force calculations
|
||||
simultaneously. If <I>split</I> is negative, the software will
|
||||
attempt to calculate the optimal fraction automatically
|
||||
every 25 timesteps based on CPU and GPU timings. Because the GPU speedups
|
||||
are dependent on the number of particles, automatic calculation of the
|
||||
split can be less efficient, but typically results in loop times
|
||||
within 20% of an optimal fixed split.
|
||||
<P><I>split</I> is the fraction of particles whose forces, torques, energies,
|
||||
and/or virials will be calculated on the GPU. This can be used to
|
||||
perform CPU and GPU force calculations simultaneously. If <I>split</I> is
|
||||
negative, the software will attempt to calculate the optimal fraction
|
||||
automatically every 25 timesteps based on CPU and GPU timings. Because
|
||||
the GPU speedups are dependent on the number of particles, automatic
|
||||
calculation of the split can be less efficient, but typically results
|
||||
in loop times within 20% of an optimal fixed split.
|
||||
</P>
|
||||
<P>If you have two GPUs per node, 8 CPU cores per node, and
|
||||
would like to run on 4 nodes with dynamic balancing of
|
||||
force calculation across CPU and GPU cores, the fix
|
||||
might be
|
||||
<P>If you have two GPUs per node, 8 CPU cores per node, and would like to
|
||||
run on 4 nodes with dynamic balancing of force calculation across CPU
|
||||
and GPU cores, the fix might be
|
||||
</P>
|
||||
<PRE>fix 0 all gpu force/neigh 0 1 -1
|
||||
</PRE>
|
||||
<P>with LAMMPS run on 32 processes. In this case, all
|
||||
CPU cores and GPU devices on the nodes would be utilized.
|
||||
Each GPU device would be shared by 4 CPU cores. The
|
||||
CPU cores would perform force calculations for some
|
||||
fraction of the particles at the same time the GPUs
|
||||
performed force calculation for the other particles.
|
||||
<P>with LAMMPS run on 32 processes. In this case, all CPU cores and GPU
|
||||
devices on the nodes would be utilized. Each GPU device would be
|
||||
shared by 4 CPU cores. The CPU cores would perform force calculations
|
||||
for some fraction of the particles at the same time the GPUs performed
|
||||
force calculation for the other particles.
|
||||
</P>
|
||||
<P>Because of the large number of cores on each GPU
|
||||
device, it might be more efficient to run on fewer
|
||||
processes per GPU when the number of particles per process
|
||||
is small (100's of particles); this can be necessary
|
||||
to keep the GPU cores busy.
|
||||
<P>Because of the large number of cores on each GPU device, it might be
|
||||
more efficient to run on fewer processes per GPU when the number of
|
||||
particles per process is small (100's of particles); this can be
|
||||
necessary to keep the GPU cores busy.
|
||||
</P>
|
||||
<H4>GPU input script
|
||||
</H4>
|
||||
<P>In order to use GPU acceleration in LAMMPS,
|
||||
<A HREF = "fix_gpu.html">fix_gpu</A>
|
||||
should be used in order to initialize and configure the
|
||||
GPUs for use. Additionally, GPU enabled styles must be
|
||||
selected in the input script. Currently,
|
||||
this is limited to a few <A HREF = "pair_style.html">pair styles</A>.
|
||||
Some GPU-enabled styles have additional restrictions
|
||||
listed in their documentation.
|
||||
<P>In order to use GPU acceleration in LAMMPS, <A HREF = "fix_gpu.html">fix_gpu</A>
|
||||
should be used in order to initialize and configure the GPUs for
|
||||
use. Additionally, GPU enabled styles must be selected in the input
|
||||
script. Currently, this is limited to a few <A HREF = "pair_style.html">pair
|
||||
styles</A> and PPPM. Some GPU-enabled styles have
|
||||
additional restrictions listed in their documentation.
|
||||
</P>
|
||||
<H4>GPU asynchronous pair computation
|
||||
</H4>
|
||||
<P>The GPU accelerated pair styles can be used to perform
|
||||
pair style force calculation on the GPU while other
|
||||
calculations are
|
||||
performed on the CPU. One method to do this is to specify
|
||||
a <I>split</I> in the gpu fix as described above. In this case,
|
||||
force calculation for the pair style will also be performed
|
||||
on the CPU.
|
||||
<P>The GPU accelerated pair styles can be used to perform pair style
|
||||
force calculation on the GPU while other calculations are performed on
|
||||
the CPU. One method to do this is to specify a <I>split</I> in the gpu fix
|
||||
as described above. In this case, force calculation for the pair
|
||||
style will also be performed on the CPU.
|
||||
</P>
|
||||
<P>When the CPU work in a GPU pair style has finished,
|
||||
the next force computation will begin, possibly before the
|
||||
GPU has finished. If <I>split</I> is 1.0 in the gpu fix, the next
|
||||
force computation will begin almost immediately. This can
|
||||
be used to run a <A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at
|
||||
the same time as a hybrid CPU pair style. In this case, the
|
||||
GPU pair style should be first in the hybrid command in order to
|
||||
perform simultaneous calculations. This also
|
||||
allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
|
||||
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>,
|
||||
and <A HREF = "kspace_style.html">long-range</A> force
|
||||
computations to be run simultaneously with the GPU pair style.
|
||||
Once all CPU force computations have completed, the gpu fix
|
||||
will block until the GPU has finished all work before continuing
|
||||
the run.
|
||||
<P>When the CPU work in a GPU pair style has finished, the next force
|
||||
computation will begin, possibly before the GPU has finished. If
|
||||
<I>split</I> is 1.0 in the gpu fix, the next force computation will begin
|
||||
almost immediately. This can be used to run a
|
||||
<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
|
||||
CPU pair style. In this case, the GPU pair style should be first in
|
||||
the hybrid command in order to perform simultaneous calculations. This
|
||||
also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
|
||||
<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
|
||||
<A HREF = "kspace_style.html">long-range</A> force computations to be run
|
||||
simultaneously with the GPU pair style. Once all CPU force
|
||||
computations have completed, the gpu fix will block until the GPU has
|
||||
finished all work before continuing the run.
|
||||
</P>
|
||||
<H4>GPU timing
|
||||
</H4>
|
||||
<P>GPU accelerated pair styles can perform computations asynchronously
|
||||
with CPU computations. The "Pair" time reported by LAMMPS
|
||||
will be the maximum of the time required to complete the CPU
|
||||
pair style computations and the time required to complete the GPU
|
||||
pair style computations. Any time spent for GPU-enabled pair styles
|
||||
for computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
|
||||
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
|
||||
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A> calculations
|
||||
will not be included in the "Pair" time.
|
||||
with CPU computations. The "Pair" time reported by LAMMPS will be the
|
||||
maximum of the time required to complete the CPU pair style
|
||||
computations and the time required to complete the GPU pair style
|
||||
computations. Any time spent for GPU-enabled pair styles for
|
||||
computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
|
||||
<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
|
||||
<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
|
||||
calculations will not be included in the "Pair" time.
|
||||
</P>
|
||||
<P>When <I>mode</I> for the gpu fix is force/neigh,
|
||||
the time for neighbor list calculations on the GPU will be added
|
||||
into the "Pair" time, not the "Neigh" time. A breakdown of the
|
||||
times required for various tasks on the GPU (data copy, neighbor
|
||||
calculations, force computations, etc.) are output only
|
||||
with the LAMMPS screen output at the end of each run. These timings represent
|
||||
total time spent on the GPU for each routine, regardless of asynchronous
|
||||
CPU calculations.
|
||||
<P>When <I>mode</I> for the gpu fix is force/neigh, the time for neighbor list
|
||||
calculations on the GPU will be added into the "Pair" time, not the
|
||||
"Neigh" time. A breakdown of the times required for various tasks on
|
||||
the GPU (data copy, neighbor calculations, force computations, etc.)
|
||||
are output only with the LAMMPS screen output at the end of each
|
||||
run. These timings represent total time spent on the GPU for each
|
||||
routine, regardless of asynchronous CPU calculations.
|
||||
</P>
|
||||
<H4>GPU single vs double precision
|
||||
</H4>
|
||||
<P>See the lammps/lib/gpu/README file for instructions on how to build
|
||||
the LAMMPS gpu library for single, mixed, and double precision. The latter
|
||||
requires that your GPU card supports double precision.
|
||||
<P>See the lammps/lib/gpu/README file for instructions on how to build
|
||||
the LAMMPS gpu library for single, mixed, and double precision. The
|
||||
latter requires that your GPU card supports double precision.
|
||||
</P>
|
||||
<HR>
|
||||
|
||||
|
||||
Reference in New Issue
Block a user