Syntax:
package style args
cuda args = Ngpu keyword value ...
Ngpu = # of GPUs per node
zero or more keyword/value pairs may be appended
keywords = gpuID or timing or test or thread
gpuID values = gpu1 .. gpuN
gpu1 .. gpuN = IDs of the Ngpu GPUs to use
timing values = none
test values = id
id = atom-ID of a test particle
thread = auto or tpa or bpa
auto = test whether tpa or bpa is faster
tpa = one thread per atom
bpa = one block per atom
gpu args = Ngpu keyword value ...
Ngpu = # of GPUs per node
zero or more keyword/value pairs may be appended
keywords = neigh or split or gpuID or tpa or binsize or device
neigh value = yes or no
yes = neighbor list build on GPU (default)
no = neighbor list build on CPU
split = fraction
fraction = fraction of atoms assigned to GPU (default = 1.0)
gpuID values = first last
first = ID of first GPU to be used on each node
last = ID of last GPU to be used on each node
tpa value = Nthreads
Nthreads = # of GPU threads used per atom
binsize value = size
size = bin size for neighbor list construction (distance units)
device value = device_type
device_type = kepler or fermi or cypress or generic
intel args = NPhi keyword value ...
Nphi = # of coprocessors per node
zero or more keyword/value pairs may be appended
keywords = prec or balance or ghost or tpc or tptask
prec value = single or mixed or double
single = perform force calculations in single precision
mixed = perform force calculations in mixed precision
double = perform force calculations in double precision
balance value = split
split = fraction of work to offload to coprocessor, -1 for dynamic
ghost value = yes or no
yes = include ghost atoms for offload
no = do not include ghost atoms for offload
tpc value = Ntpc
Ntpc = number of threads to use on each physical core of coprocessor
tptask value = Ntptask
Ntptask = max number of threads to use on coprocessor for each MPI task
kokkos args = keyword value ...
zero or more keyword/value pairs may be appended
keywords = neigh or comm or comm/exchange or comm/forward
neigh value = full or half/thread or half or n2 or full/cluster
full = full neighbor list
half/thread = half neighbor list built in thread-safe manner
half = half neighbor list, not thread-safe, only use when 1 thread/MPI task
n2 = non-binning neighbor list build, O(N^2) algorithm
full/cluster = full neighbor list with clustered groups of atoms
comm value = no or host or device
use value for both comm/exchange and comm/forward
comm/exchange value = no or host or device
comm/forward value = no or host or device
no = perform communication pack/unpack in non-KOKKOS mode
host = perform pack/unpack on host (e.g. with OpenMP threading)
device = perform pack/unpack on device (e.g. on GPU)
omp args = Nthreads keyword value ...
Nthread = # of OpenMP threads to associate with each MPI process
zero or more keyword/value pairs may be appended
keywords = neigh
neigh value = yes or no
yes = threaded neighbor list build (default)
no = non-threaded neighbor list build
Examples:
package gpu 1 package gpu 1 split 0.75 package gpu 2 split -1.0 package cuda 2 gpuID 0 2 package cuda 1 test 3948 package kokkos neigh half/thread comm device package omp 0 neigh no package omp 4 package intel * mixed balance -1
Description:
This command invokes package-specific settings for the various accelerator packages available in LAMMPS. Currently the following packages use settings from this command: USER-CUDA, GPU, USER-INTEL, KOKKOS, and USER-OMP.
If this command is specified in an input script, it must be near the top of the script, before the simulation box has been defined. This is because it specifies settings that the accelerator packages use in their intialization, before a simultion is defined.
This command can also be specified from the command-line when launching LAMMPS, using the "-pk" command-line switch. The syntax is exactly the same as when used in an input script.
Note that all of the accelerator packages require the package command to be specified (except the OPT package), if the package is to be used in a simulation (LAMMPS can be built with an accelerator package without using it in a particular simulation). However, in all cases, a default version of the command is typically invoked by other accelerator settings.
The USER-CUDA and KOKKOS packages require a "-c on" or "-k on" command-line switch respectively, which invokes a "package cuda" or "package kokkos" command with default settings.
For the GPU, USER-INTEL, and USER-OMP packages, if a "-sf gpu" or "-sf intel" or "-sf omp" command-line switch is used to auto-append accelerator suffixes to various styles in the input script, then those switches also invoke a "package gpu", "package intel", or "package omp" command with default settings.
IMPORTANT NOTE: A package command for a particular style can be invoked multiple times when a simulation is setup, e.g. by the "-c on", "-k on", "-sf", and "-pk" command-line switches, and by using this command in an input script. Each time it is used all of the style options are set, either to default values or to specified settings. I.e. settings from previous invocations do not persist across multiple invocations.
See the Section Accelerate section of the manual for more details about using the various accelerator packages for speeding up LAMMPS simulations.
The cuda style invokes settings associated with the use of the USER-CUDA package.
The Ngpus argument sets the number of GPUs per node. There must be exactly one MPI task per GPU, as set by the mpirun or mpiexec command.
Optional keyword/value pairs can also be specified. Each has a default value as listed below.
The gpuID keyword allows selection of which GPUs on each node will be used for a simulation. GPU IDs range from 0 to N-1 where N is the physical number of GPUs/node. An ID is specified for each of the Ngpus being used. For example if you have three GPUs on a machine, one of which is used for the X-Server (the GPU with the ID 1) while the others (with IDs 0 and 2) are used for computations you would specify:
package cuda 2 gpuID 0 2
The purpose of the gpuID keyword is to allow two (or more) simulations to be run on one workstation. In that case one could set the first simulation to use GPU 0 and the second to use GPU 1. This is not necessary however, if the GPUs are in what is called compute exclusive mode. Using that setting, every process will get its own GPU automatically. This compute exclusive mode can be set as root using the nvidia-smi tool which is part of the CUDA installation.
Also note that if the gpuID keyword is not used, the USER-CUDA package sorts existing GPUs on each node according to their number of multiprocessors. This way, compute GPUs will be priorized over X-Server GPUs.
If the timing keyword is specified, detailed timing information for various subroutines will be output.
If the test keyword is specified, information for the specified atom with atom-ID will be output at several points during each timestep. This is mainly usefull for debugging purposes. Note that the simulation slow down dramatically if this option is used.
The thread keyword can be used to specify how GPU threads are assigned work during pair style force evaluation. If the value = tpa, one thread per atom is used. If the value = bpa, one block per atom is used. If the value = auto, a short test is performed at the beginning of each run to determing where tpa or bpa mode is faster. The result of this test is output. Since auto is the default value, it is usually not necessary to use this keyword.
The gpu style invokes settings associated with the use of the GPU package.
The Ngpu argument sets the number of GPUs per node. There must be at least as many MPI tasks per node as GPUs, as set by the mpirun or mpiexec command. If there are more MPI tasks (per node) than GPUs, multiple MPI tasks will share each GPU.
Optional keyword/value pairs can also be specified. Each has a default value as listed below.
The neigh keyword specifies where neighbor lists for pair style computation will be built. If neigh is yes, which is the default, neighbor list building is performed on the GPU. If neigh is no, neighbor list building is performed on the CPU. GPU neighbor list building currently cannot be used with a triclinic box. GPU neighbor list calculation currently cannot be used with hybrid pair styles. GPU neighbor lists are not compatible with comannds that are not GPU-enabled. When a non-GPU enabled command requires a neighbor list, it will also be built on the CPU. In these cases, it will typically be more efficient to only use CPU neighbor list builds.
The split keyword can be used for load balancing force calculations between CPU and GPU cores in GPU-enabled pair styles. If 0 < split < 1.0, a fixed fraction of particles is offloaded to the GPU while force calculation for the other particles occurs simulataneously on the CPU. If split < 0.0, the optimal fraction (based on CPU and GPU timings) is calculated every 25 timesteps. If split = 1.0, all force calculations for GPU accelerated pair styles are performed on the GPU. In this case, other hybrid pair interactions, bond, angle, dihedral, improper, and long-range calculations can be performed on the CPU while the GPU is performing force calculations for the GPU-enabled pair style. If all CPU force computations complete before the GPU completes, LAMMPS will block until the GPU has finished before continuing the timestep.
As an example, if you have two GPUs per node and 8 CPU cores per node, and would like to run on 4 nodes (32 cores) with dynamic balancing of force calculation across CPU and GPU cores, you could specify
mpirun -np 32 -sf gpu -in in.script # launch command package gpu 2 split -1 # input script command
In this case, all CPU cores and GPU devices on the nodes would be utilized. Each GPU device would be shared by 4 CPU cores. The CPU cores would perform force calculations for some fraction of the particles at the same time the GPUs performed force calculation for the other particles.
The gpuID keyword allows selection of which GPUs on each node will be used for a simulation. The first and last values specify the GPU IDs to use (from 0 to Ngpu-1). By default, first = 0 and last = Ngpu-1, so that all GPUs are used, assuming Ngpu is set to the number of physical GPUs. If you only wish to use a subset, set Ngpu to a smaller number and first/last to a sub-range of the available GPUs.
The tpa keyword sets the number of GPU thread per atom used to perform force calculations. With a default value of 1, the number of threads will be chosen based on the pair style, however, the value can be set explicitly with this keyword to fine-tune performance. For large cutoffs or with a small number of particles per GPU, increasing the value can improve performance. The number of threads per atom must be a power of 2 and currently cannot be greater than 32.
The binsize keyword sets the size of bins used to bin atoms in neighbor list builds. Setting this value is normally not needed; the optimal value is close to the default, which is set equal to the cutoff distance for the short range interactions plus the neighbor skin. Note that this is 2x larger than the default bin size for neighbor list builds on the CPU. This is becuase GPUs can perform efficiently with much larger cutoffs than CPUs. This can be used to reduce the time required for long-range calculations or in some cases to eliminate them with pair style models such as coul/wolf or coul/dsf. For very large cutoffs, it can be more efficient to use smaller values for binsize in parallel simulations. For example, with a cutoff of 20*sigma in LJ units and a neighbor skin distance of sigma, a binsize = 5.25*sigma can be more efficient than the default.
The device keyword can be used to tune parameters optimized for a specific accelerator, when using OpenCL. For CUDA, the device keyword is ignored. Currently, the device type is limited to NVIDIA Kepler, NVIDIA Fermi, AMD Cypress, or a generic device. More devices may be added later. The default device type can be specified when building LAMMPS with the GPU library, via settings in the lib/gpu/Makefile that is used.
The intel style invokes settings associated with the use of the USER-INTEL package. All of its settings, except the prec keyword, are ignored if LAMMPS was not built with Xeon Phi coprocessor support, when building with the USER-INTEL package. All of its settings, including the prec keyword are applicable if LAMMPS was built with coprocessor support.
The Nphi argument sets the number of coprocessors per node.
Optional keyword/value pairs can also be specified. Each has a default value as listed below.
The prec keyword argument determines the precision mode to use for computing pair style forces, either on the CPU or on the coprocessor, when using a USER-INTEL supported pair style. It can take a value of single, mixed which is the default, or double. Single means single precision is used for the entire force calculation. Mixed means forces between a pair of atoms are computed in single precision, but accumulated and stored in double precision, including storage of forces, torques, energies, and virial quantities. Double means double precision is used for the entire force calculation.
The balance keyword sets the fraction of pair style work offloaded to the coprocessor style for split values between 0.0 and 1.0 inclusive. While this fraction of work is running on the coprocessor, other calculations will run on the host, including neighbor and pair calculations that are not offloaded, angle, bond, dihedral, kspace, and some MPI communications. If split is set to -1, the fraction of work is dynamically adjusted automatically throughout the run. This typically give performance within 5 to 10 percent of the optimal fixed fraction.
The ghost keyword determines whether or not ghost atoms, i.e. atoms at the boundaries of proessor sub-domains, are offloaded for neighbor and force calculations. When the value = "no", ghost atoms are not offloaded. This option can reduce the amount of data transfer with the coprocessor and can also overlap MPI communication of forces with computation on the coprocessor when the newton pair setting is "on". When the value = "ues", ghost atoms are offloaded. In some cases this can provide better performance, especially if the balance fraction is high.
The tpc keyword sets the maximum # of threads Ntpc that will run on each physical core of the coprocessor. The default value is set to 4, which is the number of hardware threads per core supported by the current generation Xeon Phi chips.
The tptask keyword sets the maximum # of threads (Ntptask that will be used on the coprocessor for each MPI task. This, along with the tpc keyword setting, are the only methods for changing the number of threads used on the coprocessor. The default value is set to 240 = 60*4, which is the maximum # of threads supported by an entire current generation Xeon Phi chip.
The kokkos style invokes settings associated with the use of the KOKKOS package.
All of the settings are optional keyword/value pairs. Each has a default value as listed below.
The neigh keyword determines how neighbor lists are built. A value of half uses half-neighbor lists, the same as used by most pair styles in LAMMPS. A value of half/thread uses a thread-safe variant of the half-neighbor list. It should be used instead of half when running with more than 1 threads per MPI task on a CPU. A value of n2 uses an O(N^2) algorithm to build the neighbor list without binning, where N = # of atoms on a processor. It is typically slower than the other methods, which use binning.
A value of full uses a full neighbor lists and is the default. This performs twice as much computation as the half option, however that is often a win because it is thread-safe and doesn't require atomic operations in the calculation of pair forces.
A value of full/cluster is an experimental neighbor style, where particles interact with all particles within a small cluster, if at least one of the clusters particles is within the neighbor cutoff range. This potentially allows for better vectorization on architectures such as the Intel Phi. If also reduces the size of the neighbor list by roughly a factor of the cluster size, thus reducing the total memory footprint considerably.
The comm and comm/exchange and comm/forward keywords determine whether the host or device performs the packing and unpacking of data when communicating per-atom data between processors. "Exchange" communication happens only on timesteps that neighbor lists are rebuilt. The data is only for atoms that migrate to new processors. "Forward" communication happens every timestep. The data is for atom coordinates and any other atom properties that needs to be updated for ghost atoms owned by each processor.
The comm keyword is simply a short-cut to set the same value for both the comm/exchange and comm/forward keywords.
The value options for all 3 keywords are no or host or device. A value of no means to use the standard non-KOKKOS method of packing/unpacking data for the communication. A value of host means to use the host, typically a multi-core CPU, and perform the packing/unpacking in parallel with threads. A value of device means to use the device, typically a GPU, to perform the packing/unpacking operation.
The optimal choice for these keywords depends on the input script and the hardware used. The no value is useful for verifying that the Kokkos-based host and device values are working correctly. It may also be the fastest choice when using Kokkos styles in MPI-only mode (i.e. with a thread count of 1).
When running on CPUs or Xeon Phi, the host and device values work identically. When using GPUs, the device value will typically be optimal if all of your styles used in your input script are supported by the KOKKOS package. In this case data can stay on the GPU for many timesteps without being moved between the host and GPU, if you use the device value. This requires that your MPI is able to access GPU memory directly. Currently that is true for OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses styles (e.g. fixes) which are not yet supported by the KOKKOS package, then data has to be move between the host and device anyway, so it is typically faster to let the host handle communication, by using the host value. Using host instead of no will enable use of multiple threads to pack/unpack communicated data.
The omp style invokes settings associated with the use of the USER-OMP package.
The Nthread argument sets the number of OpenMP threads allocated for each MPI task. For example, if your system has nodes with dual quad-core processors, it has a total of 8 cores per node. You could use two MPI tasks per node (e.g. using the -ppn option of the mpirun command), and set Nthreads = 4. This would use all 8 cores on each node. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
Setting Nthread = 0 instructs LAMMPS to use whatever value is the default for the given OpenMP environment. This is usually determined via the OMP_NUM_THREADS environment variable or the compiler runtime. Note that in most cases the default for OpenMP capable compilers is to use one thread for each available CPU core when OMP_NUM_THREADS is not explicitly set, which can lead to poor performance.
Here are examples of how to set the environment variable when launching LAMMPS:
env OMP_NUM_THREADS=4 lmp_machine -sf omp -in in.script env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
or you can set it permanently in your shell's start-up script. All three of these examples use a total of 4 CPU cores.
Note that different MPI implementations have different ways of passing the OMP_NUM_THREADS environment variable to all MPI processes. The 2nd example line above is for MPICH; the 3rd example line with -x is for OpenMPI. Check your MPI documentation for additional details.
What combination of threads and MPI tasks gives the best performance is difficult to predict and can depend on many components of your input. Not all features of LAMMPS support OpenMP threading via the USER-OMP packaage and the parallel efficiency can be very different, too.
Optional keyword/value pairs can also be specified. Each has a default value as listed below.
The neigh keyword specifies whether neighbor list building will be multi-threaded in addition to force calculations. If neigh is set to no then neighbor list calculation is performed only by MPI tasks with no OpenMP threading. If mode is yes (the default), a multi-threaded neighbor list build is used. Using neigh = yes is almost always faster and should produce idential neighbor lists at the expense of using more memory. Specifically, neighbor list pages are allocated for all threads at the same time and each thread works within its own pages.
Restrictions:
This command cannot be used after the simulation box is defined by a read_data or create_box command.
The cuda style of this command can only be invoked if LAMMPS was built with the USER-CUDA package. See the Making LAMMPS section for more info.
The gpu style of this command can only be invoked if LAMMPS was built with the GPU package. See the Making LAMMPS section for more info.
The intel style of this command can only be invoked if LAMMPS was built with the USER-INTEL package. See the Making LAMMPS section for more info.
The kk style of this command can only be invoked if LAMMPS was built with the KOKKOS package. See the Making LAMMPS section for more info.
The omp style of this command can only be invoked if LAMMPS was built with the USER-OMP package. See the Making LAMMPS section for more info.
Related commands:
suffix, "-pk" command-line setting
Default:
For the USER-CUDA package, the default is Ngpu = 1 and the option defaults are gpuID = 0 to Ngpu-1, timing = not enabled, test = not enabled, and thread = auto. These settings are made automatically by the required "-c on" command-line switch. You can change them bu using the package cuda command in your input script or via the "-pk cuda" command-line switch.
For the GPU package, the default is Ngpu = 1 and the option defaults are neigh = yes, split = 1.0, gpuID = 0 to Ngpu-1, tpa = 1, binsize = pair cutoff + neighbor skin, device = not used. These settings are made automatically if the "-sf gpu" command-line switch is used. If it is not used, you must invoke the package gpu command in your input script or via the "-pk gpu" command-line switch.
For the USER-INTEL package, the default is Nphi = 1 and the option defaults are prec = mixed, balance = -1, tpc = 4, tptask = 240. Note that all of these settings, except "prec", are ignored if LAMMPS was not built with Xeon Phi coprocessor support. The default ghost option is determined by the pair style being used. This value is output to the screen in the offload report at the end of each run. These settings are made automatically if the "-sf intel" command-line switch is used. If it is not used, you must invoke the package intel command in your input script or or via the "-pk intel" command-line switch.
For the KOKKOS package, the option defaults neigh = full and comm = host. These settings are made automatically by the required "-k on" command-line switch. You can change them bu using the package kokkos command in your input script or via the "-pk kokkos" command-line switch.
For the OMP package, the default is Nthreads = 0 and the option defaults are neigh = yes. These settings are made automatically if the "-sf omp" command-line switch is used. If it is not used, you must invoke the package omp command in your input script or via the "-pk omp" command-line switch.