Package GPU always sets newton pair off. Not so for USER-CUDA>
+
Optional keyword/value pairs can also be specified. Each has a
+default value as listed below.
-package cuda gpu/node/special 2 0 2
+The gpuID keyword allows selection of which GPUs on each node will
+be used for a simulation. GPU IDs range from 0 to N-1 where N is the
+physical number of GPUs/node. An ID is specified for each of the
+Ngpus being used. For example if you have three GPUs on a machine,
+one of which is used for the X-Server (the GPU with the ID 1) while
+the others (with IDs 0 and 2) are used for computations you would
+specify:
+
+package cuda 2 gpuID 0 2
-A main purpose of the gpu/node/special optoin is to allow two (or
-more) simulations to be run on one workstation. In that case one
-would set the first simulation to use GPU 0 and the second to use GPU
-1. This is not necessary though, if the GPUs are in what is called
-compute exclusive mode. Using that setting, every process will get
-its own GPU automatically. This compute exclusive mode can be set
-as root using the nvidia-smi tool which is part of the CUDA
-installation.
+
The purpose of the gpuID keyword is to allow two (or more)
+simulations to be run on one workstation. In that case one could set
+the first simulation to use GPU 0 and the second to use GPU 1. This is
+not necessary however, if the GPUs are in what is called compute
+exclusive mode. Using that setting, every process will get its own
+GPU automatically. This compute exclusive mode can be set as root
+using the nvidia-smi tool which is part of the CUDA installation.
-Note that if the gpu/node/special keyword is not used, the USER-CUDA
+
Also note that if the gpuID keyword is not used, the USER-CUDA
package sorts existing GPUs on each node according to their number of
multiprocessors. This way, compute GPUs will be priorized over
X-Server GPUs.
-Use of the timing keyword will output detailed timing information
-for various subroutines.
+
If the timing keyword is specified, detailed timing information for
+various subroutines will be output.
-The test keyword will output info for the the specified atom at
-several points during each time step. This is mainly usefull for
-debugging purposes. Note that the simulation will be severly slowed
-down if this option is used.
+
If the test keyword is specified, information for the specified atom
+with atom-ID will be output at several points during each timestep.
+This is mainly usefull for debugging purposes. Note that the
+simulation slow down dramatically if this option is used.
-The override/bpa keyword can be used to specify which mode is used
-for pair-force evaluation. TpA = one thread per atom; BpA = one block
-per atom. If this keyword is not used, a short test at the begin of
-each run will determine which method is more effective (the result of
-this test is part of the LAMMPS output). Therefore it is usually not
-necessary to use this keyword.
+
The thread keyword can be used to specify how GPU threads are
+assigned work during pair style force evaluation. If the value =
+tpa, one thread per atom is used. If the value = bpa, one block
+per atom is used. If the value = auto, a short test is performed at
+the beginning of each run to determing where tpa or bpa mode is
+faster. The result of this test is output. Since auto is the
+default value, it is usually not necessary to use this keyword.
-The gpu style invokes options associated with the use of the GPU
-package.
+
The gpu style invokes settings associated with the use of the GPU
+package.
-The mode setting specifies where neighbor list calculations will be
-performed. If mode is force, neighbor list calculation is performed
-on the CPU. If mode is force/neigh, neighbor list calculation is
-performed on the GPU. GPU neighbor list calculation currently cannot
-be used with a triclinic box. GPU neighbor list calculation currently
-cannot be used with hybrid pair styles. GPU
-neighbor lists are not compatible with styles that are not
-GPU-enabled. When a non-GPU enabled style requires a neighbor list,
-it will also be built using CPU routines. In these cases, it will
-typically be more efficient to only use CPU neighbor list builds.
+
The Ngpu argument sets the number of GPUs per node. There must be
+at least as many MPI tasks per node as GPUs, as set by the mpirun or
+mpiexec command. If there are more MPI tasks (per node)
+than GPUs, multiple MPI tasks will share each GPU.
-The first and last settings specify the GPUs that will be used for
-simulation. On each node, the GPU IDs in the inclusive range from
-first to last will be used.
+
Optional keyword/value pairs can also be specified. Each has a
+default value as listed below.
-The split setting can be used for load balancing force calculation
-work between CPU and GPU cores in GPU-enabled pair styles. If 0 <
-split < 1.0, a fixed fraction of particles is offloaded to the GPU
-while force calculation for the other particles occurs simulataneously
-on the CPU. If split<0, the optimal fraction (based on CPU and GPU
-timings) is calculated every 25 timesteps. If split = 1.0, all force
-calculations for GPU accelerated pair styles are performed on the
-GPU. In this case, hybrid, bond,
-angle, dihedral,
-improper, and long-range
-calculations can be performed on the CPU while the GPU is performing
-force calculations for the GPU-enabled pair style. If all CPU force
-computations complete before the GPU, LAMMPS will block until the GPU
-has finished before continuing the timestep.
+
The neigh keyword specifies where neighbor lists for pair style
+computation will be built. If neigh is yes, which is the default,
+neighbor list building is performed on the GPU. If neigh is no,
+neighbor list building is performed on the CPU. GPU neighbor list
+building currently cannot be used with a triclinic box. GPU neighbor
+list calculation currently cannot be used with
+hybrid pair styles. GPU neighbor lists are not
+compatible with comannds that are not GPU-enabled. When a non-GPU
+enabled command requires a neighbor list, it will also be built on the
+CPU. In these cases, it will typically be more efficient to only use
+CPU neighbor list builds.
+
+The split keyword can be used for load balancing force calculations
+between CPU and GPU cores in GPU-enabled pair styles. If 0 < split <
+1.0, a fixed fraction of particles is offloaded to the GPU while force
+calculation for the other particles occurs simulataneously on the
+CPU. If split < 0.0, the optimal fraction (based on CPU and GPU
+timings) is calculated every 25 timesteps. If split = 1.0, all
+force calculations for GPU accelerated pair styles are performed on
+the GPU. In this case, other hybrid pair
+interactions, bond, angle,
+dihedral, improper, and
+long-range calculations can be performed on the
+CPU while the GPU is performing force calculations for the GPU-enabled
+pair style. If all CPU force computations complete before the GPU
+completes, LAMMPS will block until the GPU has finished before
+continuing the timestep.
As an example, if you have two GPUs per node and 8 CPU cores per node,
and would like to run on 4 nodes (32 cores) with dynamic balancing of
force calculation across CPU and GPU cores, you could specify
-package gpu force/neigh 0 1 -1
+mpirun -np 32 -sf gpu -in in.script # launch command
+package gpu 2 split -1 # input script command
In this case, all CPU cores and GPU devices on the nodes would be
utilized. Each GPU device would be shared by 4 CPU cores. The CPU
@@ -204,93 +226,105 @@ cores would perform force calculations for some fraction of the
particles at the same time the GPUs performed force calculation for
the other particles.
-The threads_per_atom keyword allows control of the number of GPU
-threads used per-atom to perform the short range force calculation.
-By default, the value will be chosen based on the pair style, however,
-the value can be set with this keyword to fine-tune performance. For
+
The gpuID keyword allows selection of which GPUs on each node will
+be used for a simulation. The first and last values specify the
+GPU IDs to use (from 0 to Ngpu-1). By default, first = 0 and last =
+Ngpu-1, so that all GPUs are used, assuming Ngpu is set to the number
+of physical GPUs. If you only wish to use a subset, set Ngpu to a
+smaller number and first/last to a sub-range of the available GPUs.
+
+The tpa keyword sets the number of GPU thread per atom used to
+perform force calculations. With a default value of 1, the number of
+threads will be chosen based on the pair style, however, the value can
+be set explicitly with this keyword to fine-tune performance. For
large cutoffs or with a small number of particles per GPU, increasing
the value can improve performance. The number of threads per atom must
be a power of 2 and currently cannot be greater than 32.
-The cellsize keyword can be used to control the size of the cells used
-for binning atoms in neighbor list calculations. Setting this value is
-normally not needed; the optimal value is close to the default
-(equal to the cutoff distance for the short range interactions
-plus the neighbor skin). GPUs can perform efficiently with much larger cutoffs
-than CPUs and this can be used to reduce the time required for long-range
-calculations or in some cases to eliminate them with models such as
-coul/wolf or coul/dsf. For very large cutoffs,
-it can be more efficient to use smaller values for cellsize in parallel
-simulations. For example, with a cutoff of 20*sigma and a neighbor skin of
-sigma, a cellsize of 5.25*sigma can be efficient for parallel simulations.
+
The binsize keyword sets the size of bins used to bin atoms in
+neighbor list builds. Setting this value is normally not needed; the
+optimal value is close to the default, which is set equal to the
+cutoff distance for the short range interactions plus the neighbor
+skin. Note that this is 2x larger than the default bin size for
+neighbor list builds on the CPU. This is becuase GPUs can perform
+efficiently with much larger cutoffs than CPUs. This can be used to
+reduce the time required for long-range calculations or in some cases
+to eliminate them with pair style models such as
+coul/wolf or coul/dsf. For very
+large cutoffs, it can be more efficient to use smaller values for
+binsize in parallel simulations. For example, with a cutoff of
+20*sigma in LJ units and a neighbor skin distance of
+sigma, a binsize = 5.25*sigma can be more efficient than the
+default.
-The device keyword can be used to tune parameters to optimize for a specific
-accelerator when using OpenCL. For CUDA, the device keyword is ignored.
-Currently, the device type is limited to NVIDIA Kepler, NVIDIA Fermi,
-AMD Cypress, or a generic device. More devices will be added soon. The default
-device type can be specified when building LAMMPS with the GPU library.
+
The device keyword can be used to tune parameters optimized for a
+specific accelerator, when using OpenCL. For CUDA, the device
+keyword is ignored. Currently, the device type is limited to NVIDIA
+Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices
+may be added later. The default device type can be specified when
+building LAMMPS with the GPU library, via settings in the
+lib/gpu/Makefile that is used.
-The intel style invokes options associated with the use of the
-USER-INTEL package.
+
The intel style invokes settings associated with the use of the
+USER-INTEL package. All of its settings, except the prec keyword,
+are ignored if LAMMPS was not built with Xeon Phi coprocessor support,
+when building with the USER-INTEL package. All of its settings,
+including the prec keyword are applicable if LAMMPS was built with
+coprocessor support.
-The Nthreads argument allows to one explicitly set the number of
-OpenMP threads to be allocated for each MPI process, An Nthreads
-value of '*' instructs LAMMPS to use whatever is the default for the
-given OpenMP environment. This is usually determined via the
-OMP_NUM_THREADS environment variable or the compiler runtime.
+
The Nphi argument sets the number of coprocessors per node.
-The precision argument determines the precision mode to use and can
-take values of single (intel styles use single precision for all
-calculations), mixed (intel styles use double precision for
-accumulation and storage of forces, torques, energies, and virial
-terms and single precision for everything else), or double (intel
-styles use double precision for all calculations).
+
Optional keyword/value pairs can also be specified. Each has a
+default value as listed below.
-Additional keyword-value pairs are available that are used to
-determine how work is offloaded to an Intel(R) coprocessor. If LAMMPS is
-built without offload support, these values are ignored. The
-additional settings are as follows:
+
The prec keyword argument determines the precision mode to use for
+computing pair style forces, either on the CPU or on the coprocessor,
+when using a USER-INTEL supported pair style. It
+can take a value of single, mixed which is the default, or
+double. Single means single precision is used for the entire
+force calculation. Mixed means forces between a pair of atoms are
+computed in single precision, but accumulated and stored in double
+precision, including storage of forces, torques, energies, and virial
+quantities. Double means double precision is used for the entire
+force calculation.
-The balance setting is used to set the fraction of work offloaded to
-the coprocessor for an intel style (in the inclusive range 0.0 to
-1.0). While this fraction of work is running on the coprocessor, other
-calculations will run on the host, including neighbor and pair
-calculations that are not offloaded, angle, bond, dihedral, kspace,
-and some MPI communications. If the balance is set to -1, the fraction
-of work is dynamically adjusted automatically throughout the run. This
-can typically give performance within 5 to 10 percent of the optimal
-fixed fraction.
+
The balance keyword sets the fraction of pair
+style work offloaded to the coprocessor style for
+split values between 0.0 and 1.0 inclusive. While this fraction of
+work is running on the coprocessor, other calculations will run on the
+host, including neighbor and pair calculations that are not offloaded,
+angle, bond, dihedral, kspace, and some MPI communications. If
+split is set to -1, the fraction of work is dynamically adjusted
+automatically throughout the run. This typically give performance
+within 5 to 10 percent of the optimal fixed fraction.
-The offload_cards setting determines the number of coprocessors to
-use on each node.
-
-Additional options for fine tuning performance with offload are as
-follows:
-
-The offload_ghost setting determines whether or not ghost atoms,
-atoms at the borders between MPI tasks, are offloaded for neighbor and
-force calculations. When set to "0", ghost atoms are not offloaded.
-This option can reduce the amount of data transfer with the
-coprocessor and also can overlap MPI communication of forces with
+
The ghost keyword determines whether or not ghost atoms, i.e. atoms
+at the boundaries of proessor sub-domains, are offloaded for neighbor
+and force calculations. When the value = "no", ghost atoms are not
+offloaded. This option can reduce the amount of data transfer with
+the coprocessor and can also overlap MPI communication of forces with
computation on the coprocessor when the newton pair
-setting is "on". When set to "1", ghost atoms are offloaded. In some
-cases this can provide better performance, especially if the offload
-fraction is high.
+setting is "on". When the value = "ues", ghost atoms are offloaded.
+In some cases this can provide better performance, especially if the
+balance fraction is high.
-The offload_tpc option sets the maximum number of threads that will
-run on each core of the coprocessor.
+
The tpc keyword sets the maximum # of threads Ntpc that will
+run on each physical core of the coprocessor. The default value is
+set to 4, which is the number of hardware threads per core supported
+by the current generation Xeon Phi chips.
-The offload_threads option sets the maximum number of threads that
-will be used on the coprocessor for each MPI task. This, along with
-the offload_tpc setting, are the only methods for changing the
-number of threads on the coprocessor. The OMP_NUM_THREADS keyword and
-Nthreads options are only used for threads on the host.
+
The tptask keyword sets the maximum # of threads (Ntptask that will
+be used on the coprocessor for each MPI task. This, along with the
+tpc keyword setting, are the only methods for changing the number of
+threads used on the coprocessor. The default value is set to 240 =
+60*4, which is the maximum # of threads supported by an entire current
+generation Xeon Phi chip.
-The kokkos style invokes options associated with the use of the
+
The kokkos style invokes settings associated with the use of the
KOKKOS package.
The neigh keyword determines what kinds of neighbor lists are built.
@@ -346,43 +380,59 @@ multiple threads to pack/unpack communicated data.
-The omp style invokes options associated with the use of the
+
The omp style invokes settings associated with the use of the
USER-OMP package.
-The first argument allows to explicitly set the number of OpenMP
-threads to be allocated for each MPI process. For example, if your
-system has nodes with dual quad-core processors, it has a total of 8
-cores per node. You could run MPI on 2 cores on each node (e.g. using
-options for the mpirun command), and set the Nthreads setting to 4.
-This would effectively use all 8 cores on each node. Since each MPI
-process would spawn 4 threads (one of which runs as part of the MPI
-process itself).
+
The Nthread argument sets the number of OpenMP threads allocated for
+each MPI task. For example, if your system has nodes with dual
+quad-core processors, it has a total of 8 cores per node. You could
+use two MPI tasks per node (e.g. using the -ppn option of the mpirun
+command), and set Nthreads = 4. This would use all 8 cores on each
+node. Note that the product of MPI tasks * threads/task should not
+exceed the physical number of cores (on a node), otherwise performance
+will suffer.
-For performance reasons, you should not set Nthreads to more threads
-than there are physical cores (per MPI task), but LAMMPS cannot check
-for this.
-
-An Nthreads value of '*' instructs LAMMPS to use whatever is the
+
Setting Nthread = 0 instructs LAMMPS to use whatever value is the
default for the given OpenMP environment. This is usually determined
via the OMP_NUM_THREADS environment variable or the compiler
-runtime. Please note that in most cases the default for OpenMP
-capable compilers is to use one thread for each available CPU core
-when OMP_NUM_THREADS is not set, which can lead to extremely bad
+runtime. Note that in most cases the default for OpenMP capable
+compilers is to use one thread for each available CPU core when
+OMP_NUM_THREADS is not explicitly set, which can lead to poor
performance.
-Which combination of threads and MPI tasks gives the best performance
-is difficult to predict and can depend on many components of your input.
-Not all features of LAMMPS support OpenMP and the parallel efficiency
-can be very different, too.
+
Here are examples of how to set the environment variable when
+launching LAMMPS:
-The mode setting specifies where neighbor list calculations will be
-multi-threaded as well. If mode is force, neighbor list calculation
-is performed in serial. If mode is force/neigh, a multi-threaded
-neighbor list build is used. Using the force/neigh setting is almost
-always faster and should produce idential neighbor lists at the
-expense of using some more memory (neighbor list pages are always
-allocated for all threads at the same time and each thread works on
-its own pages).
+
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script
+env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script
+mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
+
+or you can set it permanently in your shell's start-up script.
+All three of these examples use a total of 4 CPU cores.
+
+Note that different MPI implementations have different ways of passing
+the OMP_NUM_THREADS environment variable to all MPI processes. The
+2nd example line above is for MPICH; the 3rd example line with -x is
+for OpenMPI. Check your MPI documentation for additional details.
+
+What combination of threads and MPI tasks gives the best performance
+is difficult to predict and can depend on many components of your
+input. Not all features of LAMMPS support OpenMP threading via the
+USER-OMP packaage and the parallel efficiency can be very different,
+too.
+
+Optional keyword/value pairs can also be specified. Each has a
+default value as listed below.
+
+The neigh keyword specifies whether neighbor list building will be
+multi-threaded in addition to force calculations. If neigh is set
+to no then neighbor list calculation is performed only by MPI tasks
+with no OpenMP threading. If mode is yes (the default), a
+multi-threaded neighbor list build is used. Using neigh = yes is
+almost always faster and should produce idential neighbor lists at the
+expense of using more memory. Specifically, neighbor list pages are
+allocated for all threads at the same time and each thread works
+within its own pages.
@@ -399,6 +449,10 @@ LAMMPS section for more info.
with the GPU package. See the Making
LAMMPS section for more info.
+The intel style of this command can only be invoked if LAMMPS was
+built with the USER-INTEL package. See the Making
+LAMMPS section for more info.
+
The kk style of this command can only be invoked if LAMMPS was built
with the KOKKOS package. See the Making
LAMMPS section for more info.
@@ -409,35 +463,50 @@ LAMMPS section for more info.
Related commands:
-suffix
+
suffix, "-pk" command-line
+setting
Default:
-The default settings for the USER-CUDA package are "package cuda gpu
-2". This is the case whether the "-sf cuda" command-line
-switch is used or not.
+
To use the USER-CUDA package, the package cuda command must be invoked
+explicitly in your input script or via the "-pk cuda" command-line
+switch. This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.
-If the "-sf gpu" command-line switch is
-used then it is as if the command "package gpu force/neigh 0 0 1" were
-invoked, to specify default settings for the GPU package. If the
-command-line switch is not used, then no defaults are set, and you
-must specify the appropriate package command in your input script.
+
For the GPU package, the default is Ngpu = 1 and the option defaults
+are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize =
+pair cutoff + neighbor skin, device = not used. These settings are
+made automatically if the "-sf gpu" command-line
+switch is used. If it is not used, you
+must invoke the package gpu command in your input script or via the
+"-pk gpu" command-line switch.
-The default settings for the USER-INTEL package are "package intel *
-mixed balance -1 offload_cards 1 offload_tpc 4 offload_threads 240".
-The offload_ghost default setting is determined by the intel style
-being used. The value used is output to the screen in the offload
-report at the end of each run.
+
For the USER-INTEL package, the default is Nphi = 1 and the option
+defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. The
+default ghost option is determined by the pair style being used. This
+value used is output to the screen in the offload report at the end of
+each run. These settings are made automatically if the "-sf intel"
+command-line switch is used. If it is
+not used, you must invoke the package intel command in your input
+script or or via the "-pk intel" command-line
+switch.
-The default settings for the KOKKOS package are "package kk neigh full
-comm/exchange host comm/forward host". This is the case whether the
-"-sf kk" command-line switch is used or
-not.
+
The default settings for the KOKKOS package are "package kokkos neigh
+full comm/exchange host comm/forward host". This is the case whether
+the "-sf kk" command-line switch is used
+or not.
+To use the KOKKOS package, the package kokkos command must be invoked
+explicitly in your input script or via the "-pk kokkos" command-line
+switch. This will set the # of GPUs/node.
+The options defaults are gpuID = 0 to Ngpu-1, timing = not enabled,
+test = not enabled, and thread = auto.
-If the "-sf omp" command-line switch is
-used then it is as if the command "package omp *" were invoked, to
-specify default settings for the USER-OMP package. If the
-command-line switch is not used, then no defaults are set, and you
-must specify the appropriate package command in your input script.
+
For the OMP package, the default is Nthreads = 0 and the option
+defaults are neigh = yes. These settings are made automatically if
+the "-sf omp" command-line switch is
+used. If it is not used, you must invoke the package omp command in
+your input script or via the "-pk omp" command-line
+switch.