git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@15251 f3b2605a-c512-4ea7-a41b-209d697bcdaa

This commit is contained in:
sjplimp
2016-07-01 23:31:19 +00:00
parent 88812c44fb
commit 6f3ac03a08
5 changed files with 438 additions and 256 deletions

View File

@ -487,7 +487,7 @@ LAMMPS. If you use LAMMPS results in your published work, please cite
this paper and include a pointer to the "LAMMPS WWW Site"_lws
(http://lammps.sandia.gov):
S. J. Plimpton, [Fast Parallel Algorithms for Short-Range Molecular
S. Plimpton, [Fast Parallel Algorithms for Short-Range Molecular
Dynamics], J Comp Phys, 117, 1-19 (1995).
Other papers describing specific algorithms used in LAMMPS are listed

View File

@ -11,247 +11,399 @@
5.3.2 USER-INTEL package :h4
The USER-INTEL package was developed by Mike Brown at Intel
The USER-INTEL package is maintained by Mike Brown at Intel
Corporation. It provides two methods for accelerating simulations,
depending on the hardware you have. The first is acceleration on
Intel(R) CPUs by running in single, mixed, or double precision with
vectorization. The second is acceleration on Intel(R) Xeon Phi(TM)
Intel CPUs by running in single, mixed, or double precision with
vectorization. The second is acceleration on Intel Xeon Phi
coprocessors via offloading neighbor list and non-bonded force
calculations to the Phi. The same C++ code is used in both cases.
When offloading to a coprocessor from a CPU, the same routine is run
twice, once on the CPU and once with an offload flag.
twice, once on the CPU and once with an offload flag. This allows
LAMMPS to run on the CPU cores and coprocessor cores simulataneously.
Note that the USER-INTEL package supports use of the Phi in "offload"
mode, not "native" mode like the "KOKKOS
package"_accelerate_kokkos.html.
[Currently Available USER-INTEL Styles:]
Also note that the USER-INTEL package can be used in tandem with the
"USER-OMP package"_accelerate_omp.html. This is useful when
offloading pair style computations to the Phi, so that other styles
not supported by the USER-INTEL package, e.g. bond, angle, dihedral,
improper, and long-range electrostatics, can run simultaneously in
threaded mode on the CPU cores. Since less MPI tasks than CPU cores
will typically be invoked when running with coprocessors, this enables
the extra CPU cores to be used for useful computation.
Angle Styles: charmm, harmonic :ulb,l
Bond Styles: fene, harmonic :l
Dihedral Styles: charmm, harmonic, opls :l
Fixes: nve, npt, nvt, nvt/sllod :l
Improper Styles: cvff, harmonic :l
Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne,
charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff :l
K-Space Styles: pppm :l,ule
As illustrated below, if LAMMPS is built with both the USER-INTEL and
USER-OMP packages, this dual mode of operation is made easier to use,
via the "-suffix hybrid intel omp" "command-line
switch"_Section_start.html#start_7 or the "suffix hybrid intel
omp"_suffix.html command. Both set a second-choice suffix to "omp" so
that styles from the USER-INTEL package will be used if available,
with styles from the USER-OMP package as a second choice.
[Speed-ups to expect:]
Here is a quick overview of how to use the USER-INTEL package for CPU
acceleration, assuming one or more 16-core nodes. More details
follow.
The speedups will depend on your simulation, the hardware, which
styles are used, the number of atoms, and the floating-point
precision mode. Performance improvements are shown compared to
LAMMPS {without using other acceleration packages} as these are
under active development (and subject to performance changes). The
measurements were performed using the input files available in
the src/USER-INTEL/TEST directory. These are scalable in size; the
results given are with 512K particles (524K for Liquid Crystal).
Most of the simulations are standard LAMMPS benchmarks (indicated
by the filename extension in parenthesis) with modifications to the
run length and to add a warmup run (for use with offload
benchmarks).
use an Intel compiler
use these CCFLAGS settings in Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost, -fno-alias, -ansi-alias, -override-limits
use these LINKFLAGS settings in Makefile.machine: -fopenmp, -xHost
make yes-user-intel yes-user-omp # including user-omp is optional
make mpi # build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi
make intel_cpu # or Makefile.intel_cpu already has settings, uses Intel MPI wrapper
Make.py -v -p intel omp -intel cpu -a file mpich_icc # or one-line build via Make.py for MPICH
Make.py -v -p intel omp -intel cpu -a file ompi_icc # or for OpenMPI
Make.py -v -p intel omp -intel cpu -a file intel_cpu # or for Intel MPI wrapper :pre
:c,image(JPG/user_intel.png)
lmp_machine -sf intel -pk intel 0 omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP
mpirun -np 32 lmp_machine -sf intel -in in.script # 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP
lmp_machine -sf hybrid intel omp -pk intel 0 omp 16 -pk omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP
mpirun -np 32 -ppn 4 lmp_machine -sf hybrid intel omp -pk omp 4 -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP :pre
Results are speedups obtained on Intel Xeon E5-2697v4 processors
(code-named Broadwell) and Intel Xeon Phi 7250 processors
(code-named Knights Landing) with "18 Jun 2016" LAMMPS built with
Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
per physical core. See {src/USER-INTEL/TEST/README} for the raw
simulation rates and instructions to reproduce.
Here is a quick overview of how to use the USER-INTEL package for the
same CPUs as above (16 cores/node), with an additional Xeon Phi(TM)
coprocessor per node. More details follow.
:line
Same as above for building, with these additions/changes:
add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in Makefile.machine
add the flag -offload to LINKFLAGS in Makefile.machine
for Make.py change "-intel cpu" to "-intel phi", and "file intel_cpu" to "file intel_phi" :pre
[Quick Start for Experienced Users:]
mpirun -np 32 lmp_machine -sf intel -pk intel 1 -in in.script # 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP
mpirun -np 16 -ppn 8 lmp_machine -sf intel -pk intel 1 omp 2 -in in.script # 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP
mpirun -np 32 -ppn 8 lmp_machine -sf hybrid intel omp -pk intel 1 omp 2 -pk omp 2 -in in.script # 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP :pre
LAMMPS should be built with the USER-INTEL package installed.
Simulations should be run with 1 MPI task per physical {core},
not {hardware thread}.
For Intel Xeon CPUs:
Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l
If using {kspace_style pppm} in the input script, add "neigh_modify binsize 3" and "kspace_modify diff ad" to the input script for better
performance. :l
"-pk intel 0 omp 2 -sf intel" added to LAMMPS command-line :l,ule
For Intel Xeon Phi CPUs for simulations without {kspace_style
pppm} in the input script :
Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
Runs should be performed using MCDRAM. :l
"-pk intel 0 omp 2 -sf intel" {or} "-pk intel 0 omp 4 -sf intel"
should be added to the LAMMPS command-line. Choice for best
performance will depend on the simulation. :l,ule
For Intel Xeon Phi CPUs for simulations with {kspace_style
pppm} in the input script:
Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
Runs should be performed using MCDRAM. :l
Add "neigh_modify binsize 3" to the input script for better
performance. :l
Add "kspace_modify diff ad" to the input script for better
performance. :l
export KMP_AFFINITY=none :l
"-pk intel 0 omp 3 lrt yes -sf intel" or "-pk intel 0 omp 1 lrt yes
-sf intel" added to LAMMPS command-line. Choice for best performance
will depend on the simulation. :l,ule
For Intel Xeon Phi coprocessors (Offload):
Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l
"-pk intel N omp 1" added to command-line where N is the number of
coprocessors per node. :l,ule
:line
[Required hardware/software:]
Your compiler must support the OpenMP interface. Use of an Intel(R)
C++ compiler is recommended, but not required. However, g++ will not
recognize some of the settings listed above, so they cannot be used.
Optimizations for vectorization have only been tested with the
Intel(R) compiler. Use of other compilers may not result in
vectorization, or give poor performance.
In order to use offload to coprocessors, an Intel Xeon Phi
coprocessor and an Intel compiler are required. For this, the
recommended version of the Intel compiler is 14.0.1.106 or
versions 15.0.2.044 and higher.
The recommended version of the Intel(R) compiler is 14.0.1.106.
Versions 15.0.1.133 and later are also supported. If using Intel(R)
MPI, versions 15.0.2.044 and later are recommended.
Although any compiler can be used with the USER-INTEL pacakge,
currently, vectorization directives are disabled by default when
not using Intel compilers due to lack of standard support and
observations of decreased performance. The OpenMP standard now
supports directives for vectorization and we plan to transition the
code to this standard once it is available in most compilers. We
expect this to allow improved performance and support with other
compilers.
To use the offload option, you must have one or more Intel(R) Xeon
Phi(TM) coprocessors and use an Intel(R) C++ compiler.
For Intel Xeon Phi x200 series processors (code-named Knights
Landing), there are multiple configuration options for the hardware.
For best performance, we recommend that the MCDRAM is configured in
"Flat" mode and with the cluster mode set to "Quadrant" or "SNC4".
"Cache" mode can also be used, although the performance might be
slightly lower.
[Notes about Simultaneous Multithreading:]
Modern CPUs often support Simultaneous Multithreading (SMT). On
Intel processors, this is called Hyper-Threading (HT) technology.
SMT is hardware support for running multiple threads efficiently on
a single core. {Hardware threads} or {logical cores} are often used
to refer to the number of threads that are supported in hardware.
For example, the Intel Xeon E5-2697v4 processor is described
as having 36 cores and 72 threads. This means that 36 MPI processes
or OpenMP threads can run simultaneously on separate cores, but that
up to 72 MPI processes or OpenMP threads can be running on the CPU
without costly operating system context switches.
Molecular dynamics simulations will often run faster when making use
of SMT. If a thread becomes stalled, for example because it is
waiting on data that has not yet arrived from memory, another thread
can start running so that the CPU pipeline is still being used
efficiently. Although benefits can be seen by launching a MPI task
for every hardware thread, for multinode simulations, we recommend
that OpenMP threads are used for SMT instead, either with the
USER-INTEL package, "USER-OMP package"_accelerate_omp.html", or
"KOKKOS package"_accelerate_kokkos.html. In the example above, up
to 36X speedups can be observed by using all 36 physical cores with
LAMMPS. By using all 72 hardware threads, an additional 10-30%
performance gain can be achieved.
The BIOS on many platforms allows SMT to be disabled, however, we do
not recommend this on modern processors as there is little to no
benefit for any software package in most cases. The operating system
will report every hardware thread as a separate core allowing one to
determine the number of hardware threads available. On Linux systems,
this information can normally be obtained with:
cat /proc/cpuinfo :pre
[Building LAMMPS with the USER-INTEL package:]
The lines above illustrate how to include/build with the USER-INTEL
package, for either CPU or Phi support, in two steps, using the "make"
command. Or how to do it with one command via the src/Make.py script,
described in "Section 2.4"_Section_start.html#start_4 of the manual.
Type "Make.py -h" for help. Because the mechanism for specifing what
compiler to use (Intel in this case) is different for different MPI
wrappers, 3 versions of the Make.py command are shown.
The USER-INTEL package must be installed into the source directory:
make yes-user-intel :pre
Several example Makefiles for building with the Intel compiler are
included with LAMMPS in the src/MAKE/OPTIONS/ directory:
Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
Makefile.knl # Intel Compiler, Intel MPI, No Offload
Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload
Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload
Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload :pre
Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
it explicitly specifies that vectorization should be for Intel
Xeon Phi x200 processors making it easier to cross-compile. For
users with recent installations of Intel Parallel Studio, the
process can be as simple as:
make yes-user-intel
source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
# or psxevars.csh for C-shell
make intel_cpu_intelmpi :pre
Alternatively, the build can be accomplished with the src/Make.py
script, described in "Section 2.4"_Section_start.html#start_4 of the
manual. Type "Make.py -h" for help. For an example:
Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre
Note that if you build with support for a Phi coprocessor, the same
binary can be used on nodes with or without coprocessors installed.
However, if you do not have coprocessors on your system, building
without offload support will produce a smaller binary.
If you also build with the USER-OMP package, you can use styles from
both packages, as described below.
The general requirements for Makefiles with the USER-INTEL package
are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When
using Intel compilers, "-restrict" is required and "-qopenmp" is
highly recommended for CCFLAGS and LINKFLAGS. LIB should include
"-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD"
is required for CCFLAGS and "-qoffload" is required for LINKFLAGS.
Other recommended CCFLAG options for best performance are
"-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
-no-prec-div". The Make.py command will add all of these
automatically.
Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
include "-fopenmp". Likewise, if you use an Intel compiler, the
CCFLAGS setting must include "-restrict". For Phi support, the
"-DLMP_INTEL_OFFLOAD" (CCFLAGS) and "-offload" (LINKFLAGS) settings
are required. The other settings listed above are optional, but will
typically improve performance. The Make.py command will add all of
these automatically.
NOTE: The vectorization and math capabilities can differ depending on
the CPU. For Intel compilers, the "-x" flag specifies the type of
processor for which to optimize. "-xHost" specifies that the compiler
should build for the processor used for compiling. For Intel Xeon Phi
x200 series processors, this option is "-xMIC-AVX512". For fourth
generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
be used. For older Intel Xeon processors, "-xAVX" will perform best
in general for the different simulations in LAMMPS. The default
in most of the example Makefiles is to use "-xHost", however this
should not be used when cross-compiling.
[Running LAMMPS with the USER-INTEL package:]
If you are compiling on the same architecture that will be used for
the runs, adding the flag {-xHost} to CCFLAGS enables vectorization
with the Intel(R) compiler. Otherwise, you must provide the correct
compute node architecture to the -x option (e.g. -xAVX).
Running LAMMPS with the USER-INTEL package is similar to normal use
with the exceptions that one should 1) specify that LAMMPS should use
the USER-INTEL package, 2) specify the number of OpenMP threads, and
3) optionally specify the specific LAMMPS styles that should use the
USER-INTEL package. 1) and 2) can be performed from the command-line
or by editing the input script. 3) requires editing the input script.
Advanced performance tuning options are also described below to get
the best performance.
Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi
are included in the src/MAKE/OPTIONS directory with settings that
perform well with the Intel(R) compiler. The latter has support for
offload to Phi coprocessors; the former does not.
[Run with the USER-INTEL package from the command line:]
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
If you compute (any portion of) pairwise interactions using USER-INTEL
pair styles on the CPU, or use USER-OMP styles on the CPU, you need to
choose how many OpenMP threads per MPI task to use. If both packages
are used, it must be done for both packages, and the same thread count
value should be used for both. Note that the product of MPI tasks *
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer.
When using the USER-INTEL package for the Phi, you also need to
specify the number of coprocessor/node and optionally the number of
coprocessor threads per MPI task to use. Note that coprocessor
threads (which run on the coprocessor) are totally independent from
OpenMP threads (which run on the CPU). The default values for the
settings that affect coprocessor threads are typically fine, as
discussed below.
As in the lines above, use the "-sf intel" or "-sf hybrid intel omp"
"command-line switch"_Section_start.html#start_7, which will
automatically append "intel" to styles that support it. In the second
case, "omp" will be appended if an "intel" style does not exist.
Note that if either switch is used, it also invokes a default command:
"package intel 1"_package.html. If the "-sf hybrid intel omp" switch
is used, the default USER-OMP command "package omp 0"_package.html is
also invoked (if LAMMPS was built with USER-OMP). Both set the number
of OpenMP threads per MPI task via the OMP_NUM_THREADS environment
variable. The first command sets the number of Xeon Phi(TM)
coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only),
and the precision mode to "mixed" (default value).
You can also use the "-pk intel Nphi" "command-line
switch"_Section_start.html#start_7 to explicitly set Nphi = # of Xeon
Phi(TM) coprocessors/node, as well as additional options. Nphi should
be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi
= 0 for a CPU-only build. All the available coprocessor threads on
each Phi will be divided among MPI tasks, unless the {tptask} option
of the "-pk intel" "command-line switch"_Section_start.html#start_7 is
used to limit the coprocessor threads per MPI task. See the "package
intel"_package.html command for details, including the default values
used for all its options if not specified, and how to set the number
of OpenMP threads via the OMP_NUM_THREADS environment variable if
desired.
If LAMMPS was built with the USER-OMP package, you can also use the
"-pk omp Nt" "command-line switch"_Section_start.html#start_7 to
explicitly set Nt = # of OpenMP threads per MPI task to use, as well
as additional options. Nt should be the same threads per MPI task as
set for the USER-INTEL package, e.g. via the "-pk intel Nphi omp Nt"
command. Again, see the "package omp"_package.html command for
details, including the default values used for all its options if not
specified, and how to set the number of OpenMP threads via the
OMP_NUM_THREADS environment variable if desired.
[Or run with the USER-INTEL package by editing an input script:]
The discussion above for the mpirun/mpiexec command, MPI tasks/node,
OpenMP threads per MPI task, and coprocessor threads per MPI task is
the same.
Use the "suffix intel"_suffix.html or "suffix hybrid intel
omp"_suffix.html commands, or you can explicitly add an "intel" or
"omp" suffix to individual styles in your input script, e.g.
pair_style lj/cut/intel 2.5 :pre
You must also use the "package intel"_package.html command, unless the
"-sf intel" or "-pk intel" "command-line
switches"_Section_start.html#start_7 were used. It specifies how many
coprocessors/node to use, as well as other OpenMP threading and
coprocessor options. The "package"_package.html doc page explains how
to set the number of OpenMP threads via an environment variable if
desired.
If LAMMPS was also built with the USER-OMP package, you must also use
the "package omp"_package.html command to enable that package, unless
the "-sf hybrid intel omp" or "-pk omp" "command-line
switches"_Section_start.html#start_7 were used. It specifies how many
OpenMP threads per MPI task to use (should be same as the setting for
the USER-INTEL package), as well as other options. Its doc page
explains how to set the number of OpenMP threads via an environment
variable if desired.
[Speed-ups to expect:]
If LAMMPS was not built with coprocessor support (CPU only) when
including the USER-INTEL package, then acclerated styles will run on
the CPU using vectorization optimizations and the specified precision.
This may give a substantial speed-up for a pair style, particularly if
mixed or single precision is used.
If LAMMPS was built with coproccesor support, the pair styles will run
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The
performance of a Xeon Phi versus a multi-core CPU is a function of
your hardware, which pair style is used, the number of
atoms/coprocessor, and the precision used on the coprocessor (double,
single, mixed).
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the USER-INTEL package on different
hardware.
When running on a single node (including runs using offload to a
coprocessor), best performance is normally obtained by using 1 MPI
task per physical core and additional OpenMP threads with SMT. For
Intel Xeon processors, 2 OpenMP threads should be used for SMT.
For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
(best choice depends on the simulation). In cases where the user
specifies that LRT mode is used (described below), 1 or 3 OpenMP
threads should be used. For multi-node runs, using 1 MPI task per
physical core will often perform best, however, depending on the
machine and scale, users might get better performance by decreasing
the number of MPI tasks and using more OpenMP threads. For
performance, the product of the number of MPI tasks and OpenMP
threads should not exceed the number of available hardware threads in
almost all cases.
NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP
threads to a core or group of cores so that memory access can be
uniform. Unless disabled at build time, affinity for MPI tasks and
OpenMP threads on the host (CPU) will be set by default on the host
when using offload to a coprocessor. In this case, it is unnecessary
{when using offload to a coprocessor}. In this case, it is unnecessary
to use other methods to control affinity (e.g. taskset, numactl,
I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with
the {no_affinity} option to the "package intel"_package.html command
or by disabling the option at build time (by adding
-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
Disabling this option is not recommended, especially when running on a
machine with hyperthreading disabled.
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity}
option to the "package intel"_package.html command or by disabling the
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
CCFLAGS line of your Makefile). Disabling this option is not
recommended, especially when running on a machine with Intel
Hyper-Threading technology disabled.
[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
coprocessor:]
[Run with the USER-INTEL package from the command line:]
To enable USER-INTEL optimizations for all available styles used in
the input script, the "-sf intel"
"command-line switch"_Section_start.html#start_7 can be used without
any requirement for editing the input script. This switch will
automatically append "intel" to styles that support it. It also
invokes a default command: "package intel 1"_package.html. This
package command is used to set options for the USER-INTEL package.
The default package command will specify that USER-INTEL calculations
are performed in mixed precision, that the number of OpenMP threads
is specified by the OMP_NUM_THREADS environment variable, and that
if coprocessors are present and the binary was built with offload
support, that 1 coprocessor per node will be used with automatic
balancing of work between the CPU and the coprocessor.
You can specify different options for the USER-INTEL package by using
the "-pk intel Nphi" "command-line switch"_Section_start.html#start_7
with keyword/value pairs as specified in the documentation. Here,
Nphi = # of Xeon Phi coprocessors/node (ignored without offload
support). Common options to the USER-INTEL package include {omp} to
override any OMP_NUM_THREADS setting and specify the number of OpenMP
threads, {mode} to set the floating-point precision mode, and
{lrt} to enable Long-Range Thread mode as described below. See the
"package intel"_package.html command for details, including the
default values used for all its options if not specified, and how to
set the number of OpenMP threads via the OMP_NUM_THREADS environment
variable if desired.
Examples (see documentation for your MPI/Machine for differences in
launching MPI applications):
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre
[Or run with the USER-INTEL package by editing an input script:]
As an alternative to adding command-line arguments, the input script
can be edited to enable the USER-INTEL package. This requires adding
the "package intel"_package.html command to the top of the input
script. For the second example above, this would be:
package intel 0 omp 2 mode double :pre
To enable the USER-INTEL package only for individual styles, you can
add an "intel" suffix to the individual style, e.g.:
pair_style lj/cut/intel 2.5 :pre
Alternatively, the "suffix intel"_suffix.html command can be added to
the input script to enable USER-INTEL styles for the commands that
follow in the input script.
[Tuning for Performance:]
NOTE: The USER-INTEL package will perform better with modifications
to the input script when "PPPM"_kspace_style.html is used:
"kspace_modify diff ad"_kspace_modify.html and "neigh_modify binsize
3"_neigh_modify.html should be added to the input script.
Long-Range Thread (LRT) mode is an option to the "package
intel"_package.html command that can improve performance when using
"PPPM"_kspace_style.html for long-range electrostatics on processors
with SMT. It generates an extra pthread for each MPI task. The thread
is dedicated to performing some of the PPPM calculations and MPI
communications. On Intel Xeon Phi x200 series CPUs, this will likely
always improve performance, even on a single node. On Intel Xeon
processors, using this mode might result in better performance when
using multiple nodes, depending on the machine. To use this mode,
specify that the number of OpenMP threads is one less than would
normally be used for the run and add the "lrt yes" option to the "-pk"
command-line suffix or "package intel" command. For example, if a run
would normally perform best with "-pk intel 0 omp 4", instead use
"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
environment variable "KMP_AFFINITY=none". LRT mode is not supported
when using offload.
Not all styles are supported in the USER-INTEL package. You can mix
the USER-INTEL package with styles from the "OPT"_accelerate_opt.html
package or the "USER-OMP package"_accelerate_omp.html". Of course,
this requires that these packages were installed at build time. This
can performed automatically by using "-sf hybrid intel opt" or
"-sf hybrid intel omp" command-line options. Alternatively, the "opt"
and "omp" suffixes can be appended manually in the input script. For
the latter, the "package omp"_package.html command must be in the
input script or the "-pk omp Nt" "command-line
switch"_Section_start.html#start_7 must be used where Nt is the
number of OpenMP threads. The number of OpenMP threads should not be
set differently for the different packages. Note that the "suffix
hybrid intel omp"_suffix.html command can also be used within the
input script to automatically append the "omp" suffix to styles when
USER-INTEL styles are not available.
When running on many nodes, performance might be better when using
fewer OpenMP threads and more MPI tasks. This will depend on the
simulation and the machine. Using the "verlet/split"_run_style.html
run style might also give better performance for simulations with
"PPPM"_kspace_style.html electrostatics. Note that this is an
alternative to LRT mode and the two cannot be used together.
Currently, when using Intel MPI with Intel Xeon Phi x200 series
CPUs, better performance might be obtained by setting the
environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
series processors will always perform better using MCDRAM. Please
consult your system documentation for the best approach to specify
that MPI runs are performed in MCDRAM.
[Tuning for Offload Performance:]
The default settings for offload should give good performance.
When using LAMMPS with offload to Intel coprocessors, best performance
will typically be achieved with concurrent calculations performed on
both the CPU and the coprocessor. This is achieved by offloading only
a fraction of the neighbor and pair computations to the coprocessor or
using "hybrid"_pair_hybrid.html pair styles where only one style uses
the "intel" suffix. For simulations with long-range electrostatics or
bond, angle, dihedral, improper calculations, computation and data
transfer to the coprocessor will run concurrently with computations
and MPI communications for these calculations on the host CPU. This
is illustrated in the figure below for the rhodopsin protein benchmark
running on E5-2697v2 processors with a Intel Xeon Phi 7120p
coprocessor. In this plot, the vertical access is time and routines
running at the same time are running concurrently on both the host and
the coprocessor.
:c,image(JPG/offload_knc.png)
The fraction of the offloaded work is controlled by the {balance}
keyword in the "package intel"_package.html command. A balance of 0
runs all calculations on the CPU. A balance of 1 runs all
supported calculations on the coprocessor. A balance of 0.5 runs half
of the calculations on the coprocessor. Setting the balance to -1
(the default) will enable dynamic load balancing that continously
adjusts the fraction of offloaded work throughout the simulation.
Because data transfer cannot be timed, this option typically produces
results within 5 to 10 percent of the optimal fixed balance.
If running short benchmark runs with dynamic load balancing, adding a
short warm-up run (10-20 steps) will allow the load-balancer to find a
near-optimal setting that will carry over to additional runs.
The default for the "package intel"_package.html command is to have
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
all the MPI tasks on a given compute node use a single Xeon Phi
coprocessor. In general, running with a large number of MPI tasks on
each node will perform best with offload. Each MPI task will
automatically get affinity to a subset of the hardware threads
@ -261,50 +413,35 @@ with 60 cores available for offload and 4 hardware threads per core
each MPI task to use a subset of 10 threads on the coprocessor. Fine
tuning of the number of threads to use per MPI task or the number of
threads to use per core can be accomplished with keyword settings of
the "package intel"_package.html command. :ulb,l
the "package intel"_package.html command.
If desired, only a fraction of the pair style computation can be
offloaded to the coprocessors. This is accomplished by using the
{balance} keyword in the "package intel"_package.html command. A
balance of 0 runs all calculations on the CPU. A balance of 1 runs
all calculations on the coprocessor. A balance of 0.5 runs half of
the calculations on the coprocessor. Setting the balance to -1 (the
default) will enable dynamic load balancing that continously adjusts
the fraction of offloaded work throughout the simulation. This option
typically produces results within 5 to 10 percent of the optimal fixed
balance. :l
The USER-INTEL package has two modes for deciding which atoms will be
handled by the coprocessor. This choice is controlled with the {ghost}
keyword of the "package intel"_package.html command. When set to 0,
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
to the card. This allows for overlap of MPI communication of forces
with computation on the coprocessor when the "newton"_newton.html
setting is "on". The default is dependent on the style being used,
however, better performance may be achieved by setting this option
explictly.
When using offload with CPU hyperthreading disabled, it may help
When using offload with CPU Hyper-Threading disabled, it may help
performance to use fewer MPI tasks and OpenMP threads than available
cores. This is due to the fact that additional threads are generated
internally to handle the asynchronous offload tasks. :l
internally to handle the asynchronous offload tasks.
If running short benchmark runs with dynamic load balancing, adding a
short warm-up run (10-20 steps) will allow the load-balancer to find a
near-optimal setting that will carry over to additional runs. :l
If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
If pair computations are being offloaded to an Intel Xeon Phi
coprocessor, a diagnostic line is printed to the screen (not to the
log file), during the setup phase of a run, indicating that offload
mode is being used and indicating the number of coprocessor threads
per MPI task. Additionally, an offload timing summary is printed at
the end of each run. When offloading, the frequency for "atom
sorting"_atom_modify.html is changed to 1 so that the per-atom data is
effectively sorted at every rebuild of the neighbor lists. :l
For simulations with long-range electrostatics or bond, angle,
dihedral, improper calculations, computation and data transfer to the
coprocessor will run concurrently with computations and MPI
communications for these calculations on the host CPU. The USER-INTEL
package has two modes for deciding which atoms will be handled by the
coprocessor. This choice is controlled with the {ghost} keyword of
the "package intel"_package.html command. When set to 0, ghost atoms
(atoms at the borders between MPI tasks) are not offloaded to the
card. This allows for overlap of MPI communication of forces with
computation on the coprocessor when the "newton"_newton.html setting
is "on". The default is dependent on the style being used, however,
better performance may be achieved by setting this option
explictly. :l,ule
effectively sorted at every rebuild of the neighbor lists. All the
available coprocessor threads on each Phi will be divided among MPI
tasks, unless the {tptask} option of the "-pk intel" "command-line
switch"_Section_start.html#start_7 is used to limit the coprocessor
threads per MPI task.
[Restrictions:]
@ -319,3 +456,15 @@ the pair styles in the USER-INTEL package currently support the
"inner", "middle", "outer" options for rRESPA integration via the
"run_style respa"_run_style.html command; only the "pair" option is
supported.
[References:]
Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l
Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press. :l
Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l,ule

View File

@ -324,8 +324,11 @@ N1, N2, N3 are the number of 1-2, 1-3, 1-4 neighbors respectively of
this atom within the topology of the molecule. See the
"special_bonds"_special_bonds.html doc page for more discussion of
1-2, 1-3, 1-4 neighbors. If this section appears, the Special Bonds
section must also appear. If this section is not specied, the
atoms in the molecule will have no special bonds.
section must also appear.
As explained above, LAMMPS will auto-generate this information if this
section is not specified. If specified, this section will
override what would be auto-generated.
:line
@ -342,9 +345,11 @@ values should be the 1-2 neighbors, the next N2 should be the 1-3
neighbors, the last N3 should be the 1-4 neighbors. No atom ID should
appear more than once. See the "special_bonds"_special_bonds.html doc
page for more discussion of 1-2, 1-3, 1-4 neighbors. If this section
appears, the Special Bond Counts section must also appear. If this
section is not specied, the atoms in the molecule will have no special
bonds.
appears, the Special Bond Counts section must also appear.
As explained above, LAMMPS will auto-generate this information if this
section is not specified. If specified, this section will override
what would be auto-generated.
:line

View File

@ -18,7 +18,7 @@ args = list of arguments for a particular style :l
{multi} args = none
{custom} args = list of keywords
possible keywords = step, elapsed, elaplong, dt, time,
cpu, tpcpu, spcpu, cpuremain, part,
cpu, tpcpu, spcpu, cpuremain, part, timeremain,
atoms, temp, press, pe, ke, etotal, enthalpy,
evdwl, ecoul, epair, ebond, eangle, edihed, eimp,
emol, elong, etail,
@ -41,6 +41,7 @@ args = list of arguments for a particular style :l
spcpu = timesteps per CPU second
cpuremain = estimated CPU time remaining in run
part = which partition (0 to Npartition-1) this is
timeremain = remaining time in seconds on timer timeout.
atoms = # of atoms
temp = temperature
press = pressure
@ -256,6 +257,14 @@ a filename for output specific to this partition. See "Section_start
7"_Section_start.html#start_7 of the manual for details on running in
multi-partition mode.
The {timeremain} keyword returns the remaining seconds when a
timeout has been configured via the "timer timeout"_timer.html command.
If the timeout timer is inactive, the value of this keyword is 0.0 and
if the timer is expired, it is negative. This allows for example to exit
loops cleanly, if the timeout is expired with:
if "$(timeremain) < 0.0" then "quit 0" :pre
The {fmax} and {fnorm} keywords are useful for monitoring the progress
of an "energy minimization"_minimize.html. The {fmax} keyword
calculates the maximum force in any dimension on any atom in the

View File

@ -31,6 +31,9 @@ timer loop :pre
[Description:]
Select the level of detail at which LAMMPS performs its CPU timings.
Multiple keywords can be specified with the {timer} command. For
keywords that are mutually exclusive, the last one specified takes
effect.
During a simulation run LAMMPS collects information about how much
time is spent in different sections of the code and thus can provide
@ -54,17 +57,37 @@ slow down the simulation. Using the {nosync} setting (which is the
default) turns off this synchronization.
With the {timeout} keyword a walltime limit can be imposed that
affects the "run"_run.html and "minimize"_minimize.html commands. If
the time limit is reached, the run or energy minimization will exit on
the next step or iteration that is a multiple of the {Ncheck} value
specified with the {every} keyword. All subsequent run or minimize
commands in the input script will be skipped until the timeout is
reset or turned off by a new {timer} command. The timeout {elapse}
value can be specified as {off} or {unlimited} to impose no timeout
condition (which is the default). The {elapse} setting can be
specified as a single number for seconds, two numbers separated by a
colon (MM:SS) for minutes and seconds, or as three numbers separated
by colons for hours, minutes, and seconds.
affects the "run"_run.html and "minimize"_minimize.html commands.
This can be convenient when runs have to confirm to time limits,
e.g. when running under a batch system and you want to maximize
the utilization of the batch time slot, especially when the time
per timestep varies and is thus difficult to predict how many
steps a simulation can perform, or for difficult to converge
minimizations. The timeout {elapse} value should be somewhat smaller
than the time requested from the batch system, as there is usually
some overhead to launch jobs, and it may be advisable to write
out a restart after terminating a run due to a timeout.
The timeout timer starts when the command is issued. When the time
limit is reached, the run or energy minimization will exit on the
next step or iteration that is a multiple of the {Ncheck} value
which can be set with the {every} keyword. Default is checking
every 10 steps. After the timer timeout has expired all subsequent
run or minimize commands in the input script will be skipped.
The remaining time or timer status can be accessed with the
"thermo"_thermo_style.html variable {timeremain}, which will be
zero, if the timeout is inactive (default setting), it will be
negative, if the timeout time is expired and positive if there
is time remaining and in this case the value of the variable are
the number of seconds remaining.
When the {timeout} key word is used a second time, the timer is
restarted with a new time limit. The timeout {elapse} value can
be specified as {off} or {unlimited} to impose a no timeout condition
(which is the default). The {elapse} setting can be specified as
a single number for seconds, two numbers separated by a colon (MM:SS)
for minutes and seconds, or as three numbers separated by colons for
hours, minutes, and seconds (H:MM:SS).
The {every} keyword sets how frequently during a run or energy
minimization the wall clock will be checked. This check count applies
@ -74,10 +97,6 @@ can slow a calculation down. Checking too infrequently can make the
timeout measurement less accurate, with the run being stopped later
than desired.
Multiple keywords can be specified with the {timer} command. For
keywords that are mutually exclusive, the last one specified takes
effect.
NOTE: Using the {full} and {sync} options provides the most detailed
and accurate timing information, but can also have a negative
performance impact due to the overhead of the many required system