This section describes various methods for improving LAMMPS performance for different classes of problems running on different kinds of machines.
5.1 Measuring performanceThe Benchmark page of the LAMMPS web site gives performance results for the various accelerator packages discussed in this section for several of the standard LAMMPS benchmarks, as a function of problem size and number of compute nodes, on different hardware platforms.
Before trying to make your simulation run faster, you should understand how it currently performs and where the bottlenecks are.
The best way to do this is run the your system (actual number of atoms) for a modest number of timesteps (say 100 steps) on several different processor counts, including a single processor if possible. Do this for an equilibrium version of your system, so that the 100-step timings are representative of a much longer run. There is typically no need to run for 1000s of timesteps to get accurate timings; you can simply extrapolate from short runs.
For the set of runs, look at the timing data printed to the screen and log file at the end of each LAMMPS run. This section of the manual has an overview.
Running on one (or a few processors) should give a good estimate of the serial performance and what portions of the timestep are taking the most time. Running the same problem on a few different processor counts should give an estimate of parallel scalability. I.e. if the simulation runs 16x faster on 16 processors, its 100% parallel efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
The most important data to look at in the timing info is the timing breakdown and relative percentages. For example, trying different options for speeding up the long-range solvers will have little impact if they only consume 10% of the run time. If the pairwise time is dominating, you may want to look at GPU or OMP versions of the pair style, as discussed below. Comparing how the percentages change as you increase the processor count gives you a sense of how different operations within the timestep are scaling. Note that if you are running with a Kspace solver, there is additional output on the breakdown of the Kspace time. For PPPM, this includes the fraction spent on FFTs, which can be communication intensive.
Another important detail in the timing info are the histograms of atoms counts and neighbor counts. If these vary widely across processors, you have a load-imbalance issue. This often results in inaccurate relative timing data, because processors have to wait when communication occurs for other processors to catch up. Thus the reported times for "Communication" or "Other" may be higher than they really are, due to load-imbalance. If this is an issue, you can uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile LAMMPS, to obtain synchronized timings.
NOTE: this section 5.2 is still a work in progress
Here is a list of general ideas for improving simulation performance. Most of them are only applicable to certain models and certain bottlenecks in the current performance, so let the timing data you generate be your guide. It is hard, if not impossible, to predict how much difference these options will make, since it is a function of problem size, number of processors used, and your machine. There is no substitute for identifying performance bottlenecks, and trying out various options.
2-FFT PPPM, also called analytic differentiation or ad PPPM, uses 2 FFTs instead of the 4 FFTs used by the default ik differentiation PPPM. However, 2-FFT PPPM also requires a slightly larger mesh size to achieve the same accuracy as 4-FFT PPPM. For problems where the FFT cost is the performance bottleneck (typically large problems running on many processors), 2-FFT PPPM may be faster than 4-FFT PPPM.
Staggered PPPM performs calculations using two different meshes, one shifted slightly with respect to the other. This can reduce force aliasing errors and increase the accuracy of the method, but also doubles the amount of work required. For high relative accuracy, using staggered PPPM allows one to half the mesh size in each dimension as compared to regular PPPM, which can give around a 4x speedup in the kspace time. However, for low relative accuracy, using staggered PPPM gives little benefit and can be up to 2x slower in the kspace time. For example, the rhodopsin benchmark was run on a single processor, and results for kspace time vs. relative accuracy for the different methods are shown in the figure below. For this system, staggered PPPM (using ik differentiation) becomes useful when using a relative accuracy of slightly greater than 1e-5 and above.
IMPORTANT NOTE: Using staggered PPPM may not give the same increase in accuracy of energy and pressure as it does in forces, so some caution must be used if energy and/or pressure are quantities of interest, such as when using a barostat.
Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions. Some require appropriate hardware on your system, e.g. GPUs or Intel Xeon Phi chips.
All of these commands are in packages provided with LAMMPS, as explained here. Currently, there are 6 such accelerator packages in LAMMPS, either as standard or user packages:
| USER-CUDA | for NVIDIA GPUs |
| GPU | for NVIDIA GPUs as well as OpenCL support |
| USER-INTEL | for Intel CPUs and Intel Xeon Phi |
| KOKKOS | for GPUs, Intel Xeon Phi, and OpenMP threading |
| USER-OMP | for OpenMP threading |
| OPT | generic CPU optimizations |
Any accelerated style has the same name as the corresponding standard style, except that a suffix is appended. Otherwise, the syntax for the command that specifies the style is identical, their functionality is the same, and the numerical results it produces should also be the same, except for precision and round-off effects.
For example, all of these styles are variants of the basic Lennard-Jones pair_style lj/cut:
Assuming LAMMPS was built with the appropriate package, a simulation using accelerated styles from the package can be run without modifying your input script, by specifying command-line switches. The details of how to do this vary from package to package and are explained below. There is also a suffix command and a package command that accomplish the same thing and can be used within an input script if preferred. The suffix command allows more precise control of whether an accelerated or unaccelerated version of a style is used at various points within an input script.
To see what styles are currently available in each of the accelerated packages, see Section_commands 5 of the manual. The doc page for individual commands (e.g. pair lj/cut or fix nve) also lists any accelerated variants available for that style.
The examples directory has several sub-directories with scripts and README files for using the accelerator packages:
Likewise, the bench directory has FERMI and KEPLER sub-directories with scripts and README files for using all the accelerator packages.
Here is a brief summary of what the various packages provide. Details are in individual sections below.
The following sections explain:
The final section compares and contrasts the USER-CUDA, GPU, and KOKKOS packages, since they all enable use of NVIDIA GPUs.
The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.
Here is a quick overview of how to use the OPT package:
The last step can be done using the "-sf opt" command-line switch. Or the effect of the "-sf" switch can be duplicated by adding a suffix opt command to your input script.
Required hardware/software:
None.
Building LAMMPS with the OPT package:
Include the package and build LAMMPS:
cd lammps/src make yes-opt make machine
No additional compile/link flags are needed in your Makefile.machine in src/MAKE.
Run with the OPT package from the command line:
Use the "-sf opt" command-line switch, which will automatically append "opt" to styles that support it.
lmp_machine -sf opt -in in.script mpirun -np 4 lmp_machine -sf opt -in in.script
Or run with the OPT package by editing an input script:
Use the suffix opt command, or you can explicitly add an "opt" suffix to individual styles in your input script, e.g.
pair_style lj/cut/opt 2.5
Speed-ups to expect:
You should see a reduction in the "Pair time" value printed at the end of a run. On most machines for reasonable problem sizes, it will be a 5 to 20% savings.
Guidelines for best performance:
None. Just try out an OPT pair style to see how it performs.
Restrictions:
None.
The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, nearly all bonded styles (bond, angle, dihedral, improper), several Kspace styles, and a few fix styles. The package currently uses the OpenMP interface for multi-threading.
Here is a quick overview of how to use the USER-OMP package:
The latter two steps can be done using the "-pk omp" and "-sf omp" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package omp or suffix omp commands respectively to your input script.
Required hardware/software:
Your compiler must support the OpenMP interface. You should have one or more multi-core CPUs so that multiple threads can be launched by an MPI task running on a CPU.
Building LAMMPS with the USER-OMP package:
Include the package and build LAMMPS:
cd lammps/src make yes-user-omp make machine
Your src/MAKE/Makefile.machine needs a flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables. For GNU and Intel compilers, this flag is "-fopenmp". Without this flag the USER-OMP styles will still be compiled and work, but will not support multi-threading.
Run with the USER-OMP package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command does this via its -np and -ppn switches.
You need to choose how many threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
Use the "-sf omp" command-line switch, which will automatically append "omp" to styles that support it. Use the "-pk omp Nt" command-line switch, to set Nt = # of OpenMP threads per MPI task to use.
lmp_machine -sf omp -pk omp 16 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf omp -pk omp 4 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 -ppn 4 lmp_machine -sf omp -pk omp 4 -in in.script # ditto on 8 16-core nodes
Note that if the "-sf omp" switch is used, it also issues a default package omp 0 command, which sets the number of threads per MPI task via the OMP_NUM_THREADS environment variable.
Using the "-pk" switch explicitly allows for direct setting of the number of threads and additional options. Its syntax is the same as the "package omp" command. See the package command doc page for details, including the default values used for all its options if it is not specified, and how to set the number of threads via the OMP_NUM_THREADS environment variable if desired.
Or run with the USER-OMP package by editing an input script:
The discussion above for the mpirun/mpiexec command, MPI tasks/node, and threads/MPI task is the same.
Use the suffix omp command, or you can explicitly add an "omp" suffix to individual styles in your input script, e.g.
pair_style lj/cut/omp 2.5
You must also use the package omp command to enable the USER-OMP package, unless the "-sf omp" or "-pk omp" command-line switches were used. It specifies how many threads per MPI task to use, as well as other options. Its doc page explains how to set the number of threads via an environment variable if desired.
Speed-ups to expect:
Depending on which styles are accelerated, you should look for a reduction in the "Pair time", "Bond time", "KSpace time", and "Loop time" values printed at the end of a run.
You may see a small performance advantage (5 to 20%) when running a USER-OMP style (in serial or parallel) with a single thread per MPI task, versus running standard LAMMPS with its standard (un-accelerated) styles (in serial or all-MPI parallelization with 1 task/core). This is because many of the USER-OMP styles contain similar optimizations to those used in the OPT package, as described above.
With multiple threads/task, the optimal choice of MPI tasks/node and OpenMP threads/task can vary a lot and should always be tested via benchmark runs for a specific simulation running on a specific machine, paying attention to guidelines discussed in the next sub-section.
A description of the multi-threading strategy used in the USER-OMP package and some performance examples are presented here
Guidelines for best performance:
For many problems on current generation CPUs, running the USER-OMP package with a single thread/task is faster than running with multiple threads/task. This is because the MPI parallelization in LAMMPS is often more efficient than multi-threading as implemented in the USER-OMP package. The parallel efficiency (in a threaded sense) also varies for different USER-OMP styles.
Using multiple threads/task can be more effective under the following circumstances:
Additional performance tips are as follows:
Restrictions:
None.
The GPU package was developed by Mike Brown at ORNL and his collaborators, particularly Trung Nguyen (ORNL). It provides GPU versions of many pair styles, including the 3-body Stillinger-Weber pair style, and for kspace_style pppm for long-range Coulombics. It has the following general features:
Here is a quick overview of how to use the GPU package:
The latter two steps can be done using the "-pk gpu" and "-sf gpu" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package gpu or suffix gpu commands respectively to your input script.
Required hardware/software:
To use this package, you currently need to have an NVIDIA GPU and install the NVIDIA Cuda software on your system:
Building LAMMPS with the GPU package:
This requires two steps (a,b): build the GPU library, then build LAMMPS with the GPU package.
(a) Build the GPU library
The GPU library is in lammps/lib/gpu. Select a Makefile.machine (in lib/gpu) appropriate for your system. You should pay special attention to 3 settings in this makefile.
See lib/gpu/Makefile.linux.double for examples of the ARCH settings for different GPU choices, e.g. Fermi vs Kepler. It also lists the possible precision settings:
CUDA_PREC = -D_SINGLE_SINGLE # single precision for all calculations CUDA_PREC = -D_DOUBLE_DOUBLE # double precision for all calculations CUDA_PREC = -D_SINGLE_DOUBLE # accumulation of forces, etc, in double
The last setting is the mixed mode referred to above. Note that your GPU must support double precision to use either the 2nd or 3rd of these settings.
To build the library, type:
make -f Makefile.machine
If successful, it will produce the files libgpu.a and Makefile.lammps.
The latter file has 3 settings that need to be appropriate for the paths and settings for the CUDA system software on your machine. Makefile.lammps is a copy of the file specified by the EXTRAMAKE setting in Makefile.machine. You can change EXTRAMAKE or create your own Makefile.lammps.machine if needed.
Note that to change the precision of the GPU library, you need to re-build the entire library. Do a "clean" first, e.g. "make -f Makefile.linux clean", followed by the make command above.
(b) Build LAMMPS with the GPU package
cd lammps/src make yes-gpu make machine
No additional compile/link flags are needed in your Makefile.machine in src/MAKE.
Note that if you change the GPU library precision (discussed above) and rebuild the GPU library, then you also need to re-install the GPU package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new GPU library.
Run with the GPU package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command does this via its -np and -ppn switches.
When using the GPU package, you cannot assign more than one GPU to a single MPI task. However multiple MPI tasks can share the same GPU, and in many cases it will be more efficient to run this way. Likewise it may be more efficient to use less MPI tasks/node than the available # of CPU cores. Assignment of multiple MPI tasks to a GPU will happen automatically if you create more MPI tasks/node than there are GPUs/mode. E.g. with 8 MPI tasks/node and 2 GPUs, each GPU will be shared by 4 MPI tasks.
Use the "-sf gpu" command-line switch, which will automatically append "gpu" to styles that support it. Use the "-pk gpu Ng" command-line switch to set Ng = # of GPUs/node to use.
lmp_machine -sf gpu -pk gpu 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # 12 MPI tasks share 2 GPUs on a single 16-core (or whatever) node mpirun -np 48 -ppn 12 lmp_machine -sf gpu -pk gpu 2 -in in.script # ditto on 4 16-core nodes
Note that if the "-sf gpu" switch is used, it also issues a default package gpu 1 command, which sets the number of GPUs/node to use to 1.
Using the "-pk" switch explicitly allows for direct setting of the number of GPUs/node to use and additional options. Its syntax is the same as same as the "package gpu" command. See the package command doc page for details, including the default values used for all its options if it is not specified.
Or run with the GPU package by editing an input script:
The discussion above for the mpirun/mpiexec command, MPI tasks/node, and use of multiple MPI tasks/GPU is the same.
Use the suffix gpu command, or you can explicitly add an "gpu" suffix to individual styles in your input script, e.g.
pair_style lj/cut/gpu 2.5
You must also use the package gpu command to enable the GPU package, unless the "-sf gpu" or "-pk gpu" command-line switches were used. It specifies the number of GPUs/node to use, as well as other options.
IMPORTANT NOTE: The input script must also use a newton pairwise setting of off in order to use GPU package pair styles. This can be set via the package gpu or newton commands.
Speed-ups to expect:
The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).
See the Benchmark page of the LAMMPS web site for performance of the GPU package on various hardware, including the Titan HPC platform at ORNL.
You should also experiment with how many MPI tasks per GPU to use to give the best performance for your problem and machine. This is also a function of the problem size and the pair style being using. Likewise, you should experiment with the precision setting for the GPU library to see if single or mixed precision will give accurate results, since they will typically be faster.
Guidelines for best performance:
Restrictions:
None.
The USER-CUDA package was developed by Christian Trott (Sandia) while at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair styles, many fixes, a few computes, and for long-range Coulombics via the PPPM command. It has the following general features:
Here is a quick overview of how to use the USER-CUDA package:
The latter two steps can be done using the "-pk cuda" and "-sf cuda" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package cuda or suffix cuda commands respectively to your input script.
Required hardware/software:
To use this package, you need to have one or more NVIDIA GPUs and install the NVIDIA Cuda software on your system:
Your NVIDIA GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Install the Nvidia Cuda Toolkit (version 3.2 or higher) and the corresponding GPU drivers. The Nvidia Cuda SDK is not required, but we recommend it also be installed. You can then make sure its sample projects can be compiled without problems.
Building LAMMPS with the USER-CUDA package:
This requires two steps (a,b): build the USER-CUDA library, then build LAMMPS with the USER-CUDA package.
(a) Build the USER-CUDA library
The USER-CUDA library is in lammps/lib/cuda. If your CUDA toolkit is not installed in the default system directoy /usr/local/cuda edit the file lib/cuda/Makefile.common accordingly.
To set options for the library build, type "make OPTIONS", where OPTIONS are one or more of the following. The settings will be written to the lib/cuda/Makefile.defaults and used when the library is built.
precision=N to set the precision level N = 1 for single precision (default) N = 2 for double precision N = 3 for positions in double precision N = 4 for positions and velocities in double precision arch=M to set GPU compute capability M = 35 for Kepler GPUs M = 20 for CC2.0 (GF100/110, e.g. C2050,GTX580,GTX470) (default) M = 21 for CC2.1 (GF104/114, e.g. GTX560, GTX460, GTX450) M = 13 for CC1.3 (GF200, e.g. C1060, GTX285) prec_timer=0/1 to use hi-precision timers 0 = do not use them (default) 1 = use them this is usually only useful for Mac machines dbg=0/1 to activate debug mode 0 = no debug mode (default) 1 = yes debug mode this is only useful for developers cufft=1 for use of the CUDA FFT library 0 = no CUFFT support (default) in the future other CUDA-enabled FFT libraries might be supported
To build the library, simply type:
make
If successful, it will produce the files libcuda.a and Makefile.lammps.
Note that if you change any of the options (like precision), you need to re-build the entire library. Do a "make clean" first, followed by "make".
(b) Build LAMMPS with the USER-CUDA package
cd lammps/src make yes-user-cuda make machine
No additional compile/link flags are needed in your Makefile.machine in src/MAKE.
Note that if you change the USER-CUDA library precision (discussed above) and rebuild the USER-CUDA library, then you also need to re-install the USER-CUDA package and re-build LAMMPS, so that all affected files are re-compiled and linked to the new USER-CUDA library.
Run with the USER-CUDA package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command does this via its -np and -ppn switches.
When using the USER-CUDA package, you must use exactly one MPI task per physical GPU.
You must use the "-c on" command-line switch to enable the USER-CUDA package.
Use the "-sf cuda" command-line switch, which will automatically append "cuda" to styles that support it. Use the "-pk cuda Ng" command-line switch to set Ng = # of GPUs per node.
lmp_machine -c on -sf cuda -pk cuda 1 -in in.script # 1 MPI task uses 1 GPU mpirun -np 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # 2 MPI tasks use 2 GPUs on a single 16-core (or whatever) node mpirun -np 24 -ppn 2 lmp_machine -c on -sf cuda -pk cuda 2 -in in.script # ditto on 12 16-core nodes
The "-pk" switch must be used (unless the package cuda command is used in the input script) to set the number of GPUs/node to use. It also allows for setting of additional options. Its syntax is the same as same as the "package cuda" command. See the package command doc page for details.
Or run with the USER-CUDA package by editing an input script:
The discussion above for the mpirun/mpiexec command and the requirement of one MPI task per GPU is the same.
You must still use the "-c on" command-line switch to enable the USER-CUDA package.
Use the suffix cuda command, or you can explicitly add a "cuda" suffix to individual styles in your input script, e.g.
pair_style lj/cut/cuda 2.5
You must use the package cuda command to set the the number of GPUs/node, unless the "-pk" command-line switch was used. The command also allows for setting of additional options.
Speed-ups to expect:
The performance of a GPU versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/GPU, and the precision used on the GPU (double, single, mixed).
See the Benchmark page of the LAMMPS web site for performance of the USER-CUDA package on different hardware.
Guidelines for best performance:
Restrictions:
None.
The KOKKOS package was developed primaritly by Christian Trott (Sandia) with contributions of various styles by others, including Sikandar Mashayak (UIUC). The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia).
The KOKKOS package contains versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.
The Kokkos library is part of Trilinos and is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on different kinds of hardware, such as a GPU, Intel Phi, or many-core chip.
The Kokkos library also provides data abstractions to adjust (at compile time) the memory layout of basic data structures like 2d and 3d arrays and allow the transparent utilization of special hardware load and store operations. Such data structures are used in LAMMPS to store atom coordinates or forces or neighbor lists. The layout is chosen to optimize performance on different platforms. Again this functionality is hidden from the developer, and does not affect how the kernel is coded.
These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. This is done by selecting a "host" and "device" to build for, compatible with the compute nodes in your machine (one on a desktop machine or 1000s on a supercomputer).
All Kokkos operations occur within the context of an individual MPI task running on a single node of the machine. The total number of MPI tasks used by LAMMPS (one or multiple per compute node) is set in the usual manner via the mpirun or mpiexec commands, and is independent of Kokkos.
Kokkos provides support for two different modes of execution per MPI task. This means that computational tasks (pairwise interactions, neighbor list builds, time integration, etc) can be parallelized for one or the other of the two modes. The first mode is called the "host" and is one or more threads running on one or more physical CPUs (within the node). Currently, both multi-core CPUs and an Intel Phi processor (running in native mode, not offload mode like the USER-INTEL package) are supported. The second mode is called the "device" and is an accelerator chip of some kind. Currently only an NVIDIA GPU is supported. If your compute node does not have a GPU, then there is only one mode of execution, i.e. the host and device are the same.
Here is a quick overview of how to use the KOKKOS package for GPU acceleration:
The latter two steps can be done using the "-k on", "-pk kokkos" and "-sf kk" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package kokkos or suffix kk commands respectively to your input script.
Required hardware/software:
The KOKKOS package can be used to build and run LAMMPS on the following kinds of hardware:
Note that Intel Xeon Phi coprocessors are supported in "native" mode, not "offload" mode like the USER-INTEL package supports.
Only NVIDIA GPUs are currently supported.
IMPORTANT NOTE: For good performance of the KOKKOS package on GPUs, you must have Kepler generation GPUs (or later). The Kokkos library exploits texture cache options not supported by Telsa generation GPUs (or older).
To build the KOKKOS package for GPUs, NVIDIA Cuda software must be installed on your system. See the discussion above for the USER-CUDA and GPU packages for details of how to check and do this.
Building LAMMPS with the KOKKOS package:
Unlike other acceleration packages discussed in this section, the Kokkos library in lib/kokkos does not have to be pre-built before building LAMMPS itself. Instead, options for the Kokkos library are specified at compile time, when LAMMPS itself is built. This can be done in one of two ways, as discussed below.
Here are examples of how to build LAMMPS for the different compute-node configurations listed above.
CPU-only (run all-MPI or with OpenMP threading):
cd lammps/src make yes-kokkos make g++ OMP=yes
Intel Xeon Phi:
cd lammps/src make yes-kokkos make g++ OMP=yes MIC=yes
CPUs and GPUs:
cd lammps/src make yes-kokkos make cuda CUDA=yes
These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try "gmake" if your system's standard make complains.
IMPORTANT NOTE: If you build using make line variables and re-build LAMMPS twice with different KOKKOS options and the *same* target, e.g. g++ in the first two examples above, then you *must* perform a "make clean-all" or "make clean-machine" before each build. This is to force all the KOKKOS-dependent files to be re-compiled with the new options.
You can also hardwire these make variables in the specified machine makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above, with a line like:
MIC = yes
Note that if you build LAMMPS multiple times in this manner, using different KOKKOS options (defined in different machine makefiles), you do not have to worry about doing a "clean" in between. This is because the targets will be different.
IMPORTANT NOTE: The 3rd example above for a GPU, uses a different machine makefile, in this case src/MAKE/Makefile.cuda, which is included in the LAMMPS distribution. To build the KOKKOS package for a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must have a CCFLAGS -arch setting that is appropriate for your NVIDIA hardware and installed software. Typical values for -arch are given in Section 2.3.4 of the manual, as well as other settings that must be included in the machine makefile, if you create your own.
There are other allowed options when building with the KOKKOS package. As above, They can be set either as variables on the make command line or in the machine makefile in the src/MAKE directory. See Section 2.3.4 of the manual for details.
IMPORTANT NOTE: Currently, there are no precision options with the KOKKOS package. All compilation and computation is performed in double precision.
Run with the KOKKOS package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command does this via its -np and -ppn switches.
When using KOKKOS built with host=OMP, you need to choose how many OpenMP threads per MPI task will be used (via the "-k" command-line switch discussed below). Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
When using the KOKKOS package built with device=CUDA, you must use exactly one MPI task per physical GPU.
When using the KOKKOS package built with host=MIC for Intel Xeon Phi coprocessor support you need to insure there are one or more MPI tasks per coprocessor, and choose the number of coprocessor threads to use per MPI task (via the "-k" command-line switch discussed below). The product of MPI tasks * coprocessor threads/task should not exceed the maximum number of threads the coproprocessor is designed to run, otherwise performance will suffer. This value is 240 for current generation Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. Note that with the KOKKOS package you do not need to specify how many Phi coprocessors there are per node; each coprocessors is simply treated as running some number of MPI tasks.
You must use the "-k on" command-line switch to enable the KOKKOS package. It takes additional arguments for hardware settings appropriate to your system. Those arguments are documented here. The two most commonly used arguments are:
-k on t Nt -k on g Ng
The "t Nt" option applies to host=OMP (even if device=CUDA) and host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI task to use with a node. For host=MIC, it specifies how many Xeon Phi threads per MPI task to use within a node. The default is Nt = 1. Note that for host=OMP this is effectively MPI-only mode which may be fine. But for host=MIC you will typically end up using far less than all the 240 available threads, which could give very poor performance.
The "g Ng" option applies to device=CUDA. It specifies how many GPUs per compute node to use. The default is 1, so this only needs to be specified is you have 2 or more GPUs per compute node.
The "-k on" switch also issues a default package kokkos neigh full comm host command which sets various KOKKOS options to default values, as discussed on the package command doc page.
Use the "-sf kk" command-line switch, which will automatically append "kk" to styles that support it. Use the "-pk kokkos" command-line switch if you wish to override any of the default values set by the package kokkos command invoked by the "-k on" switch.
host=OMP, dual hex-core nodes (12 threads/node): mpirun -np 12 lmp_g++ -in in.lj # MPI-only mode with no Kokkos mpirun -np 12 lmp_g++ -k on -sf kk -in in.lj # MPI-only mode with Kokkos mpirun -np 1 lmp_g++ -k on t 12 -sf kk -in in.lj # one MPI task, 12 threads mpirun -np 2 lmp_g++ -k on t 6 -sf kk -in in.lj # two MPI tasks, 6 threads/task mpirun -np 32 -ppn 2 lmp_g++ -k on t 6 -sf kk -in in.lj # ditto on 16 nodes
host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading): mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis
host=OMP, device=CUDA, node = dual hex-core CPUs and a single GPU: mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes
host=OMP, device=CUDA, node = dual 8-core CPUs and 2 GPUs: mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes
Or run with the KOKKOS package by editing an input script:
The discussion above for the mpirun/mpiexec command and setting appropriate thread and GPU values for host=OMP or host=MIC or device=CUDA are the same.
You must still use the "-k on" command-line switch to enable the KOKKOS package, and specify its additional arguments for hardware options appopriate to your system, as documented above.
Use the suffix kk command, or you can explicitly add a "kk" suffix to individual styles in your input script, e.g.
pair_style lj/cut/kk 2.5
You only need to use the package kokkos command if you wish to change any of its option defaults.
Speed-ups to expect:
The performance of KOKKOS running in different modes is a function of your hardware, which KOKKOS-enable styles are used, and the problem size.
Generally speaking, the following rules of thumb apply:
See the Benchmark page of the LAMMPS web site for performance of the KOKKOS package on different hardware.
Guidelines for best performance:
Here are guidline for using the KOKKOS package on the different hardware configurations listed above.
Many of the guidelines use the package kokkos command See its doc page for details and default settings. Experimenting with its options can provide a speed-up for specific calculations.
Running on a multi-core CPU:
If N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the "t" keyword of the "-k" command-line switch. If you do not change this, no additional parallelism (beyond MPI) will be invoked on the host CPU(s).
You can compare the performance running in different modes:
Examples of mpirun commands in these modes are shown above.
When using KOKKOS to perform multi-threading, it is important for performance to bind both MPI tasks to physical cores, and threads to physical cores, so they do not migrate during a simulation.
If you are not certain MPI tasks are being bound (check the defaults for your MPI installation), binding can be forced with these flags:
OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ...
For binding threads with the KOKKOS OMP option, use thread affinity environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. For binding threads with the KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as discussed in Section 2.3.4 of the manual.
Running on GPUs:
Insure the -arch setting in the machine makefile you are using, e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software (see this section of the manual for details).
The -np setting of the mpirun command should set the number of MPI tasks/node to be equal to the # of physical GPUs on the node.
Use the "-k" command-line switch to specify the number of GPUs per node, and the number of threads per MPI task. As above for multi-core CPUs (and no GPU), if N is the number of physical cores/node, then the number of MPI tasks/node * number of threads/task should not exceed N. With one GPU (and one MPI task) it may be faster to use less than all the available cores, by setting threads/task to a smaller value. This is because using all the cores on a dual-socket node will incur extra cost to copy memory from the 2nd socket to the GPU.
Examples of mpirun commands that follow these rules are shown above.
IMPORTANT NOTE: When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple timesteps, without being copied back to the host CPU. Invoking a non-Kokkos fix or compute, or performing I/O for thermo or dump output will cause data to be copied back to the CPU.
You cannot yet assign multiple MPI tasks to the same GPU with the KOKKOS package. We plan to support this in the future, similar to the GPU package in LAMMPS.
You cannot yet use both the host (multi-threaded) and device (GPU) together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS.
Running on an Intel Phi:
Kokkos only uses Intel Phi processors in their "native" mode, i.e. not hosted by a CPU.
As illustrated above, build LAMMPS with OMP=yes (the default) and MIC=yes. The latter insures code is correctly compiled for the Intel Phi. The OMP setting means OpenMP will be used for parallelization on the Phi, which is currently the best option within Kokkos. In the future, other options may be added.
Current-generation Intel Phi chips have either 61 or 57 cores. One core should be excluded for running the OS, leaving 60 or 56 cores. Each core is hyperthreaded, so there are effectively N = 240 (4*60) or N = 224 (4*56) cores to run on.
The -np setting of the mpirun command sets the number of MPI tasks/node. The "-k on t Nt" command-line switch sets the number of threads/task as Nt. The product of these 2 values should be N, i.e. 240 or 224. Also, the number of threads/task should be a multiple of 4 so that logical threads from more than one MPI task do not run on the same physical core.
Examples of mpirun commands that follow these rules are shown above.
Restrictions:
As noted above, if using GPUs, the number of MPI tasks per compute node should equal to the number of GPUs per compute node. In the future Kokkos will support assigning multiple MPI tasks to a single GPU.
Currently Kokkos does not support AMD GPUs due to limits in the available backend programming models. Specifically, Kokkos requires extensive C++ support from the Kernel language. This is expected to change in the future.
The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel(R) Xeon Phi(TM) coprocessors (not native mode like the KOKKOS package). Additionally, it supports running simulations in single, mixed, or double precision with vectorization, even if a coprocessor is not present, i.e. on an Intel(R) CPU. The same C++ code is used for both cases. When offloading to a coprocessor, the routine is run twice, once with an offload flag.
The USER-INTEL package can be used in tandem with the USER-OMP package. This is useful when offloading pair style computations to coprocessors, so that other styles not supported by the USER-INTEL package, e.g. bond, angle, dihedral, improper, and long-range electrostatics, can be run simultaneously in threaded mode on CPU cores. Since less MPI tasks than CPU cores will typically be invoked when running with coprocessors, this enables the extra cores to be utilized for useful computation.
If LAMMPS is built with both the USER-INTEL and USER-OMP packages intsalled, this mode of operation is made easier to use, because the "-suffix intel" command-line switch or the suffix intel command will both set a second-choice suffix to "omp" so that styles from the USER-OMP package will be used if available, after first testing if a style from the USER-INTEL package is available.
Here is a quick overview of how to use the USER-INTEL package for CPU acceleration:
Using the USER-INTEL package to offload work to the Intel(R) Xeon Phi(TM) coprocessor is the same except for these additional steps:
The latter two steps in the first case and the last step in the coprocessor case can be done using the "-pk omp" and "-sf intel" and "-pk intel" command-line switches respectively. Or the effect of the "-pk" or "-sf" switches can be duplicated by adding the package omp or suffix intel or package intel commands respectively to your input script.
Required hardware/software:
To use the offload option, you must have one or more Intel(R) Xeon Phi(TM) coprocessors.
Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in vectorization or give poor performance.
Use of an Intel C++ compiler is reccommended, but not required. The compiler must support the OpenMP interface.
Building LAMMPS with the USER-INTEL package:
Include the package(s) and build LAMMPS:
cd lammps/src make yes-user-intel make yes-user-omp (if desired) make machine
If the USER-OMP package is also installed, you can use styles from both packages, as described below.
The lo-level src/MAKE/Makefile.machine needs a flag for OpenMP support in both the CCFLAGS and LINKFLAGS variables, which is -openmp for Intel compilers. You also need to add -DLAMMPS_MEMALIGN=64 and -restrict to CCFLAGS.
If you are compiling on the same architecture that will be used for the runs, adding the flag -xHost to CCFLAGS will enable vectorization with the Intel(R) compiler.
In order to build with support for an Intel(R) coprocessor, the flag -offload should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.
Note that the machine makefiles Makefile.intel and Makefile.intel_offload are included in the src/MAKE directory with options that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not.
If using an Intel compiler, it is recommended that Intel(R) Compiler 2013 SP1 update 1 be used. Newer versions have some performance issues that are being addressed. If using Intel(R) MPI, version 5 or higher is recommended.
Running with the USER-INTEL package from the command line:
The mpirun or mpiexec command sets the total number of MPI tasks used by LAMMPS (one or multiple per compute node) and the number of MPI tasks used per node. E.g. the mpirun command does this via its -np and -ppn switches.
If LAMMPS was also built with the USER-OMP package, you need to choose how many OpenMP threads per MPI task will be used by the USER-OMP package. Note that the product of MPI tasks * OpenMP threads/task should not exceed the physical number of cores (on a node), otherwise performance will suffer.
If LAMMPS was built with coprocessor support for the USER-INTEL package, you need to specify the number of coprocessor/node and the number of threads to use on the coprocessor per MPI task. Note that coprocessor threads (which run on the coprocessor) are totally independent from OpenMP threads (which run on the CPU). The product of MPI tasks * coprocessor threads/task should not exceed the maximum number of threads the coproprocessor is designed to run, otherwise performance will suffer. This value is 240 for current generation Xeon Phi(TM) chips, which is 60 physical cores * 4 threads/core. The threads/core value can be set to a smaller value if desired by an option on the package intel command, in which case the maximum number of threads is also reduced.
Use the "-sf intel" command-line switch, which will automatically append "intel" to styles that support it. If a style does not support it, a "omp" suffix is tried next. Use the "-pk omp Nt" command-line switch, to set Nt = # of OpenMP threads per MPI task to use, if LAMMPS was built with the USER-OMP package. Use the "-pk intel Nphi" command-line switch to set Nphi = # of Xeon Phi(TM) coprocessors/node, if LAMMPS was built with coprocessor support.
CPU-only without USER-OMP (but using Intel vectorization on CPU): lmp_machine -sf intel -in in.script # 1 MPI task mpirun -np 32 lmp_machine -sf intel -in in.script # 32 MPI tasks on as many nodes as needed (e.g. 2 16-core nodes)
CPU-only with USER-OMP (and Intel vectorization on CPU): lmp_machine -sf intel -pk intel 16 0 -in in.script # 1 MPI task on a 16-core node mpirun -np 4 lmp_machine -sf intel -pk intel 4 0 -in in.script # 4 MPI tasks each with 4 threads on a single 16-core node mpirun -np 32 lmp_machine -sf intel -pk intel 4 0 -in in.script # ditto on 8 16-core nodes
CPUs + Xeon Phi(TM) coprocessors with USER-OMP:
lmp_machine -sf intel -pk intel 16 1 -in in.script # 1 MPI task, 240 threads on 1 coprocessor
mpirun -np 4 lmp_machine -sf intel -pk intel 4 1 tptask 60 -in in.script # 4 MPI tasks each with 4 OpenMP threads on a single 16-core node,
# each MPI task uses 60 threads on 1 coprocessor
mpirun -np 32 -ppn 4 lmp_machine -sf intel -pk intel 4 2 tptask 120 -in in.script # ditto on 8 16-core nodes for MPI tasks and OpenMP threads,
# each MPI task uses 120 threads on one of 2 coprocessors
Note that if the "-sf intel" switch is used, it also issues two default commands: package omp 0 and package intel 1 command. These set the number of OpenMP threads per MPI task via the OMP_NUM_THREADS environment variable, and the number of Xeon Phi(TM) coprocessors/node to 1. The former is ignored if LAMMPS was not built with the USER-OMP package. The latter is ignored is LAMMPS was not built with coprocessor support, except for its optional precision setting.
Using the "-pk omp" switch explicitly allows for direct setting of the number of OpenMP threads per MPI task, and additional options. Using the "-pk intel" switch explicitly allows for direct setting of the number of coprocessors/node, and additional options. The syntax for these two switches is the same as the package omp and package intel commands. See the package command doc page for details, including the default values used for all its options if these switches are not specified, and how to set the number of OpenMP threads via the OMP_NUM_THREADS environment variable if desired.
Or run with the USER-INTEL package by editing an input script:
The discussion above for the mpirun/mpiexec command, MPI tasks/node, OpenMP threads per MPI task, and coprocessor threads per MPI task is the same.
Use the suffix intel command, or you can explicitly add an "intel" suffix to individual styles in your input script, e.g.
pair_style lj/cut/intel 2.5
You must also use the package omp command to enable the USER-OMP package (assuming LAMMPS was built with USER-OMP) unless the "-sf intel" or "-pk omp" command-line switches were used. It specifies how many OpenMP threads per MPI task to use, as well as other options. Its doc page explains how to set the number of threads via an environment variable if desired.
You must also use the package intel command to enable coprocessor support within the USER-INTEL package (assuming LAMMPS was built with coprocessor support) unless the "-sf intel" or "-pk intel" command-line switches were used. It specifies how many coprocessors/node to use, as well as other coprocessor options.
Speed-ups to expect:
If LAMMPS was not built with coprocessor support when including the USER-INTEL package, then acclerated styles will run on the CPU using vectorization optimizations and the specified precision. This may give a substantial speed-up for a pair style, particularly if mixed or single precision is used.
If LAMMPS was built with coproccesor support, the pair styles will run on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The performance of a Xeon Phi versus a multi-core CPU is a function of your hardware, which pair style is used, the number of atoms/coprocessor, and the precision used on the coprocessor (double, single, mixed).
See the Benchmark page of the LAMMPS web site for performance of the USER-INTEL package on different hardware.
Guidelines for best performance on an Intel(R) Xeon Phi(TM) coprocessor:
Restrictions:
When offloading to a coprocessor, hybrid styles that require skip lists for neighbor builds cannot be offloaded. Using hybrid/overlay is allowed. Only one intel accelerated style may be used with hybrid styles. Special_bonds exclusion lists are not currently supported with offload, however, the same effect can often be accomplished by setting cutoffs for excluded atom types to 0. None of the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the run_style respa command; only the "pair" option is supported.
All 3 of these packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways.
NOTE: this section still needs to be re-worked with additional KOKKOS information.
As a consequence, for a particular simulation on specific hardware, one package may be faster than the other. We give guidelines below, but the best way to determine which package is faster for your input script is to try both of them on your machine. See the benchmarking section below for examples where this has been done.
Guidelines for using each package optimally:
Differences between the two packages:
Examples:
The LAMMPS distribution has two directories with sample input scripts for the GPU and USER-CUDA packages.
These contain input scripts for identical systems, so they can be used to benchmark the performance of both packages on your system.