|
|
|
|
@ -11,247 +11,399 @@
|
|
|
|
|
|
|
|
|
|
5.3.2 USER-INTEL package :h4
|
|
|
|
|
|
|
|
|
|
The USER-INTEL package was developed by Mike Brown at Intel
|
|
|
|
|
The USER-INTEL package is maintained by Mike Brown at Intel
|
|
|
|
|
Corporation. It provides two methods for accelerating simulations,
|
|
|
|
|
depending on the hardware you have. The first is acceleration on
|
|
|
|
|
Intel(R) CPUs by running in single, mixed, or double precision with
|
|
|
|
|
vectorization. The second is acceleration on Intel(R) Xeon Phi(TM)
|
|
|
|
|
Intel CPUs by running in single, mixed, or double precision with
|
|
|
|
|
vectorization. The second is acceleration on Intel Xeon Phi
|
|
|
|
|
coprocessors via offloading neighbor list and non-bonded force
|
|
|
|
|
calculations to the Phi. The same C++ code is used in both cases.
|
|
|
|
|
When offloading to a coprocessor from a CPU, the same routine is run
|
|
|
|
|
twice, once on the CPU and once with an offload flag.
|
|
|
|
|
twice, once on the CPU and once with an offload flag. This allows
|
|
|
|
|
LAMMPS to run on the CPU cores and coprocessor cores simulataneously.
|
|
|
|
|
|
|
|
|
|
Note that the USER-INTEL package supports use of the Phi in "offload"
|
|
|
|
|
mode, not "native" mode like the "KOKKOS
|
|
|
|
|
package"_accelerate_kokkos.html.
|
|
|
|
|
[Currently Available USER-INTEL Styles:]
|
|
|
|
|
|
|
|
|
|
Also note that the USER-INTEL package can be used in tandem with the
|
|
|
|
|
"USER-OMP package"_accelerate_omp.html. This is useful when
|
|
|
|
|
offloading pair style computations to the Phi, so that other styles
|
|
|
|
|
not supported by the USER-INTEL package, e.g. bond, angle, dihedral,
|
|
|
|
|
improper, and long-range electrostatics, can run simultaneously in
|
|
|
|
|
threaded mode on the CPU cores. Since less MPI tasks than CPU cores
|
|
|
|
|
will typically be invoked when running with coprocessors, this enables
|
|
|
|
|
the extra CPU cores to be used for useful computation.
|
|
|
|
|
Angle Styles: charmm, harmonic :ulb,l
|
|
|
|
|
Bond Styles: fene, harmonic :l
|
|
|
|
|
Dihedral Styles: charmm, harmonic, opls :l
|
|
|
|
|
Fixes: nve, npt, nvt, nvt/sllod :l
|
|
|
|
|
Improper Styles: cvff, harmonic :l
|
|
|
|
|
Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne,
|
|
|
|
|
charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff :l
|
|
|
|
|
K-Space Styles: pppm :l,ule
|
|
|
|
|
|
|
|
|
|
As illustrated below, if LAMMPS is built with both the USER-INTEL and
|
|
|
|
|
USER-OMP packages, this dual mode of operation is made easier to use,
|
|
|
|
|
via the "-suffix hybrid intel omp" "command-line
|
|
|
|
|
switch"_Section_start.html#start_7 or the "suffix hybrid intel
|
|
|
|
|
omp"_suffix.html command. Both set a second-choice suffix to "omp" so
|
|
|
|
|
that styles from the USER-INTEL package will be used if available,
|
|
|
|
|
with styles from the USER-OMP package as a second choice.
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
|
|
Here is a quick overview of how to use the USER-INTEL package for CPU
|
|
|
|
|
acceleration, assuming one or more 16-core nodes. More details
|
|
|
|
|
follow.
|
|
|
|
|
The speedups will depend on your simulation, the hardware, which
|
|
|
|
|
styles are used, the number of atoms, and the floating-point
|
|
|
|
|
precision mode. Performance improvements are shown compared to
|
|
|
|
|
LAMMPS {without using other acceleration packages} as these are
|
|
|
|
|
under active development (and subject to performance changes). The
|
|
|
|
|
measurements were performed using the input files available in
|
|
|
|
|
the src/USER-INTEL/TEST directory. These are scalable in size; the
|
|
|
|
|
results given are with 512K particles (524K for Liquid Crystal).
|
|
|
|
|
Most of the simulations are standard LAMMPS benchmarks (indicated
|
|
|
|
|
by the filename extension in parenthesis) with modifications to the
|
|
|
|
|
run length and to add a warmup run (for use with offload
|
|
|
|
|
benchmarks).
|
|
|
|
|
|
|
|
|
|
use an Intel compiler
|
|
|
|
|
use these CCFLAGS settings in Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost, -fno-alias, -ansi-alias, -override-limits
|
|
|
|
|
use these LINKFLAGS settings in Makefile.machine: -fopenmp, -xHost
|
|
|
|
|
make yes-user-intel yes-user-omp # including user-omp is optional
|
|
|
|
|
make mpi # build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi
|
|
|
|
|
make intel_cpu # or Makefile.intel_cpu already has settings, uses Intel MPI wrapper
|
|
|
|
|
Make.py -v -p intel omp -intel cpu -a file mpich_icc # or one-line build via Make.py for MPICH
|
|
|
|
|
Make.py -v -p intel omp -intel cpu -a file ompi_icc # or for OpenMPI
|
|
|
|
|
Make.py -v -p intel omp -intel cpu -a file intel_cpu # or for Intel MPI wrapper :pre
|
|
|
|
|
:c,image(JPG/user_intel.png)
|
|
|
|
|
|
|
|
|
|
lmp_machine -sf intel -pk intel 0 omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP
|
|
|
|
|
mpirun -np 32 lmp_machine -sf intel -in in.script # 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP
|
|
|
|
|
lmp_machine -sf hybrid intel omp -pk intel 0 omp 16 -pk omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP
|
|
|
|
|
mpirun -np 32 -ppn 4 lmp_machine -sf hybrid intel omp -pk omp 4 -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP :pre
|
|
|
|
|
Results are speedups obtained on Intel Xeon E5-2697v4 processors
|
|
|
|
|
(code-named Broadwell) and Intel Xeon Phi 7250 processors
|
|
|
|
|
(code-named Knights Landing) with "18 Jun 2016" LAMMPS built with
|
|
|
|
|
Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
|
|
|
|
|
per physical core. See {src/USER-INTEL/TEST/README} for the raw
|
|
|
|
|
simulation rates and instructions to reproduce.
|
|
|
|
|
|
|
|
|
|
Here is a quick overview of how to use the USER-INTEL package for the
|
|
|
|
|
same CPUs as above (16 cores/node), with an additional Xeon Phi(TM)
|
|
|
|
|
coprocessor per node. More details follow.
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
|
|
Same as above for building, with these additions/changes:
|
|
|
|
|
add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in Makefile.machine
|
|
|
|
|
add the flag -offload to LINKFLAGS in Makefile.machine
|
|
|
|
|
for Make.py change "-intel cpu" to "-intel phi", and "file intel_cpu" to "file intel_phi" :pre
|
|
|
|
|
[Quick Start for Experienced Users:]
|
|
|
|
|
|
|
|
|
|
mpirun -np 32 lmp_machine -sf intel -pk intel 1 -in in.script # 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP
|
|
|
|
|
mpirun -np 16 -ppn 8 lmp_machine -sf intel -pk intel 1 omp 2 -in in.script # 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP
|
|
|
|
|
mpirun -np 32 -ppn 8 lmp_machine -sf hybrid intel omp -pk intel 1 omp 2 -pk omp 2 -in in.script # 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP :pre
|
|
|
|
|
LAMMPS should be built with the USER-INTEL package installed.
|
|
|
|
|
Simulations should be run with 1 MPI task per physical {core},
|
|
|
|
|
not {hardware thread}.
|
|
|
|
|
|
|
|
|
|
For Intel Xeon CPUs:
|
|
|
|
|
|
|
|
|
|
Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l
|
|
|
|
|
If using {kspace_style pppm} in the input script, add "neigh_modify binsize 3" and "kspace_modify diff ad" to the input script for better
|
|
|
|
|
performance. :l
|
|
|
|
|
"-pk intel 0 omp 2 -sf intel" added to LAMMPS command-line :l,ule
|
|
|
|
|
|
|
|
|
|
For Intel Xeon Phi CPUs for simulations without {kspace_style
|
|
|
|
|
pppm} in the input script :
|
|
|
|
|
|
|
|
|
|
Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
|
|
|
|
|
Runs should be performed using MCDRAM. :l
|
|
|
|
|
"-pk intel 0 omp 2 -sf intel" {or} "-pk intel 0 omp 4 -sf intel"
|
|
|
|
|
should be added to the LAMMPS command-line. Choice for best
|
|
|
|
|
performance will depend on the simulation. :l,ule
|
|
|
|
|
|
|
|
|
|
For Intel Xeon Phi CPUs for simulations with {kspace_style
|
|
|
|
|
pppm} in the input script:
|
|
|
|
|
|
|
|
|
|
Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
|
|
|
|
|
Runs should be performed using MCDRAM. :l
|
|
|
|
|
Add "neigh_modify binsize 3" to the input script for better
|
|
|
|
|
performance. :l
|
|
|
|
|
Add "kspace_modify diff ad" to the input script for better
|
|
|
|
|
performance. :l
|
|
|
|
|
export KMP_AFFINITY=none :l
|
|
|
|
|
"-pk intel 0 omp 3 lrt yes -sf intel" or "-pk intel 0 omp 1 lrt yes
|
|
|
|
|
-sf intel" added to LAMMPS command-line. Choice for best performance
|
|
|
|
|
will depend on the simulation. :l,ule
|
|
|
|
|
|
|
|
|
|
For Intel Xeon Phi coprocessors (Offload):
|
|
|
|
|
|
|
|
|
|
Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l
|
|
|
|
|
"-pk intel N omp 1" added to command-line where N is the number of
|
|
|
|
|
coprocessors per node. :l,ule
|
|
|
|
|
|
|
|
|
|
:line
|
|
|
|
|
|
|
|
|
|
[Required hardware/software:]
|
|
|
|
|
|
|
|
|
|
Your compiler must support the OpenMP interface. Use of an Intel(R)
|
|
|
|
|
C++ compiler is recommended, but not required. However, g++ will not
|
|
|
|
|
recognize some of the settings listed above, so they cannot be used.
|
|
|
|
|
Optimizations for vectorization have only been tested with the
|
|
|
|
|
Intel(R) compiler. Use of other compilers may not result in
|
|
|
|
|
vectorization, or give poor performance.
|
|
|
|
|
In order to use offload to coprocessors, an Intel Xeon Phi
|
|
|
|
|
coprocessor and an Intel compiler are required. For this, the
|
|
|
|
|
recommended version of the Intel compiler is 14.0.1.106 or
|
|
|
|
|
versions 15.0.2.044 and higher.
|
|
|
|
|
|
|
|
|
|
The recommended version of the Intel(R) compiler is 14.0.1.106.
|
|
|
|
|
Versions 15.0.1.133 and later are also supported. If using Intel(R)
|
|
|
|
|
MPI, versions 15.0.2.044 and later are recommended.
|
|
|
|
|
Although any compiler can be used with the USER-INTEL pacakge,
|
|
|
|
|
currently, vectorization directives are disabled by default when
|
|
|
|
|
not using Intel compilers due to lack of standard support and
|
|
|
|
|
observations of decreased performance. The OpenMP standard now
|
|
|
|
|
supports directives for vectorization and we plan to transition the
|
|
|
|
|
code to this standard once it is available in most compilers. We
|
|
|
|
|
expect this to allow improved performance and support with other
|
|
|
|
|
compilers.
|
|
|
|
|
|
|
|
|
|
To use the offload option, you must have one or more Intel(R) Xeon
|
|
|
|
|
Phi(TM) coprocessors and use an Intel(R) C++ compiler.
|
|
|
|
|
For Intel Xeon Phi x200 series processors (code-named Knights
|
|
|
|
|
Landing), there are multiple configuration options for the hardware.
|
|
|
|
|
For best performance, we recommend that the MCDRAM is configured in
|
|
|
|
|
"Flat" mode and with the cluster mode set to "Quadrant" or "SNC4".
|
|
|
|
|
"Cache" mode can also be used, although the performance might be
|
|
|
|
|
slightly lower.
|
|
|
|
|
|
|
|
|
|
[Notes about Simultaneous Multithreading:]
|
|
|
|
|
|
|
|
|
|
Modern CPUs often support Simultaneous Multithreading (SMT). On
|
|
|
|
|
Intel processors, this is called Hyper-Threading (HT) technology.
|
|
|
|
|
SMT is hardware support for running multiple threads efficiently on
|
|
|
|
|
a single core. {Hardware threads} or {logical cores} are often used
|
|
|
|
|
to refer to the number of threads that are supported in hardware.
|
|
|
|
|
For example, the Intel Xeon E5-2697v4 processor is described
|
|
|
|
|
as having 36 cores and 72 threads. This means that 36 MPI processes
|
|
|
|
|
or OpenMP threads can run simultaneously on separate cores, but that
|
|
|
|
|
up to 72 MPI processes or OpenMP threads can be running on the CPU
|
|
|
|
|
without costly operating system context switches.
|
|
|
|
|
|
|
|
|
|
Molecular dynamics simulations will often run faster when making use
|
|
|
|
|
of SMT. If a thread becomes stalled, for example because it is
|
|
|
|
|
waiting on data that has not yet arrived from memory, another thread
|
|
|
|
|
can start running so that the CPU pipeline is still being used
|
|
|
|
|
efficiently. Although benefits can be seen by launching a MPI task
|
|
|
|
|
for every hardware thread, for multinode simulations, we recommend
|
|
|
|
|
that OpenMP threads are used for SMT instead, either with the
|
|
|
|
|
USER-INTEL package, "USER-OMP package"_accelerate_omp.html", or
|
|
|
|
|
"KOKKOS package"_accelerate_kokkos.html. In the example above, up
|
|
|
|
|
to 36X speedups can be observed by using all 36 physical cores with
|
|
|
|
|
LAMMPS. By using all 72 hardware threads, an additional 10-30%
|
|
|
|
|
performance gain can be achieved.
|
|
|
|
|
|
|
|
|
|
The BIOS on many platforms allows SMT to be disabled, however, we do
|
|
|
|
|
not recommend this on modern processors as there is little to no
|
|
|
|
|
benefit for any software package in most cases. The operating system
|
|
|
|
|
will report every hardware thread as a separate core allowing one to
|
|
|
|
|
determine the number of hardware threads available. On Linux systems,
|
|
|
|
|
this information can normally be obtained with:
|
|
|
|
|
|
|
|
|
|
cat /proc/cpuinfo :pre
|
|
|
|
|
|
|
|
|
|
[Building LAMMPS with the USER-INTEL package:]
|
|
|
|
|
|
|
|
|
|
The lines above illustrate how to include/build with the USER-INTEL
|
|
|
|
|
package, for either CPU or Phi support, in two steps, using the "make"
|
|
|
|
|
command. Or how to do it with one command via the src/Make.py script,
|
|
|
|
|
described in "Section 2.4"_Section_start.html#start_4 of the manual.
|
|
|
|
|
Type "Make.py -h" for help. Because the mechanism for specifing what
|
|
|
|
|
compiler to use (Intel in this case) is different for different MPI
|
|
|
|
|
wrappers, 3 versions of the Make.py command are shown.
|
|
|
|
|
The USER-INTEL package must be installed into the source directory:
|
|
|
|
|
|
|
|
|
|
make yes-user-intel :pre
|
|
|
|
|
|
|
|
|
|
Several example Makefiles for building with the Intel compiler are
|
|
|
|
|
included with LAMMPS in the src/MAKE/OPTIONS/ directory:
|
|
|
|
|
|
|
|
|
|
Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
|
|
|
|
|
Makefile.knl # Intel Compiler, Intel MPI, No Offload
|
|
|
|
|
Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload
|
|
|
|
|
Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload
|
|
|
|
|
Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload :pre
|
|
|
|
|
|
|
|
|
|
Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
|
|
|
|
|
it explicitly specifies that vectorization should be for Intel
|
|
|
|
|
Xeon Phi x200 processors making it easier to cross-compile. For
|
|
|
|
|
users with recent installations of Intel Parallel Studio, the
|
|
|
|
|
process can be as simple as:
|
|
|
|
|
|
|
|
|
|
make yes-user-intel
|
|
|
|
|
source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
|
|
|
|
|
# or psxevars.csh for C-shell
|
|
|
|
|
make intel_cpu_intelmpi :pre
|
|
|
|
|
|
|
|
|
|
Alternatively, the build can be accomplished with the src/Make.py
|
|
|
|
|
script, described in "Section 2.4"_Section_start.html#start_4 of the
|
|
|
|
|
manual. Type "Make.py -h" for help. For an example:
|
|
|
|
|
|
|
|
|
|
Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre
|
|
|
|
|
|
|
|
|
|
Note that if you build with support for a Phi coprocessor, the same
|
|
|
|
|
binary can be used on nodes with or without coprocessors installed.
|
|
|
|
|
However, if you do not have coprocessors on your system, building
|
|
|
|
|
without offload support will produce a smaller binary.
|
|
|
|
|
|
|
|
|
|
If you also build with the USER-OMP package, you can use styles from
|
|
|
|
|
both packages, as described below.
|
|
|
|
|
The general requirements for Makefiles with the USER-INTEL package
|
|
|
|
|
are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When
|
|
|
|
|
using Intel compilers, "-restrict" is required and "-qopenmp" is
|
|
|
|
|
highly recommended for CCFLAGS and LINKFLAGS. LIB should include
|
|
|
|
|
"-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD"
|
|
|
|
|
is required for CCFLAGS and "-qoffload" is required for LINKFLAGS.
|
|
|
|
|
Other recommended CCFLAG options for best performance are
|
|
|
|
|
"-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
|
|
|
|
|
-no-prec-div". The Make.py command will add all of these
|
|
|
|
|
automatically.
|
|
|
|
|
|
|
|
|
|
Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
|
|
|
|
|
include "-fopenmp". Likewise, if you use an Intel compiler, the
|
|
|
|
|
CCFLAGS setting must include "-restrict". For Phi support, the
|
|
|
|
|
"-DLMP_INTEL_OFFLOAD" (CCFLAGS) and "-offload" (LINKFLAGS) settings
|
|
|
|
|
are required. The other settings listed above are optional, but will
|
|
|
|
|
typically improve performance. The Make.py command will add all of
|
|
|
|
|
these automatically.
|
|
|
|
|
NOTE: The vectorization and math capabilities can differ depending on
|
|
|
|
|
the CPU. For Intel compilers, the "-x" flag specifies the type of
|
|
|
|
|
processor for which to optimize. "-xHost" specifies that the compiler
|
|
|
|
|
should build for the processor used for compiling. For Intel Xeon Phi
|
|
|
|
|
x200 series processors, this option is "-xMIC-AVX512". For fourth
|
|
|
|
|
generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
|
|
|
|
|
be used. For older Intel Xeon processors, "-xAVX" will perform best
|
|
|
|
|
in general for the different simulations in LAMMPS. The default
|
|
|
|
|
in most of the example Makefiles is to use "-xHost", however this
|
|
|
|
|
should not be used when cross-compiling.
|
|
|
|
|
|
|
|
|
|
[Running LAMMPS with the USER-INTEL package:]
|
|
|
|
|
|
|
|
|
|
If you are compiling on the same architecture that will be used for
|
|
|
|
|
the runs, adding the flag {-xHost} to CCFLAGS enables vectorization
|
|
|
|
|
with the Intel(R) compiler. Otherwise, you must provide the correct
|
|
|
|
|
compute node architecture to the -x option (e.g. -xAVX).
|
|
|
|
|
Running LAMMPS with the USER-INTEL package is similar to normal use
|
|
|
|
|
with the exceptions that one should 1) specify that LAMMPS should use
|
|
|
|
|
the USER-INTEL package, 2) specify the number of OpenMP threads, and
|
|
|
|
|
3) optionally specify the specific LAMMPS styles that should use the
|
|
|
|
|
USER-INTEL package. 1) and 2) can be performed from the command-line
|
|
|
|
|
or by editing the input script. 3) requires editing the input script.
|
|
|
|
|
Advanced performance tuning options are also described below to get
|
|
|
|
|
the best performance.
|
|
|
|
|
|
|
|
|
|
Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi
|
|
|
|
|
are included in the src/MAKE/OPTIONS directory with settings that
|
|
|
|
|
perform well with the Intel(R) compiler. The latter has support for
|
|
|
|
|
offload to Phi coprocessors; the former does not.
|
|
|
|
|
|
|
|
|
|
[Run with the USER-INTEL package from the command line:]
|
|
|
|
|
|
|
|
|
|
The mpirun or mpiexec command sets the total number of MPI tasks used
|
|
|
|
|
by LAMMPS (one or multiple per compute node) and the number of MPI
|
|
|
|
|
tasks used per node. E.g. the mpirun command in MPICH does this via
|
|
|
|
|
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
|
|
|
|
|
|
|
|
|
|
If you compute (any portion of) pairwise interactions using USER-INTEL
|
|
|
|
|
pair styles on the CPU, or use USER-OMP styles on the CPU, you need to
|
|
|
|
|
choose how many OpenMP threads per MPI task to use. If both packages
|
|
|
|
|
are used, it must be done for both packages, and the same thread count
|
|
|
|
|
value should be used for both. Note that the product of MPI tasks *
|
|
|
|
|
threads/task should not exceed the physical number of cores (on a
|
|
|
|
|
node), otherwise performance will suffer.
|
|
|
|
|
|
|
|
|
|
When using the USER-INTEL package for the Phi, you also need to
|
|
|
|
|
specify the number of coprocessor/node and optionally the number of
|
|
|
|
|
coprocessor threads per MPI task to use. Note that coprocessor
|
|
|
|
|
threads (which run on the coprocessor) are totally independent from
|
|
|
|
|
OpenMP threads (which run on the CPU). The default values for the
|
|
|
|
|
settings that affect coprocessor threads are typically fine, as
|
|
|
|
|
discussed below.
|
|
|
|
|
|
|
|
|
|
As in the lines above, use the "-sf intel" or "-sf hybrid intel omp"
|
|
|
|
|
"command-line switch"_Section_start.html#start_7, which will
|
|
|
|
|
automatically append "intel" to styles that support it. In the second
|
|
|
|
|
case, "omp" will be appended if an "intel" style does not exist.
|
|
|
|
|
|
|
|
|
|
Note that if either switch is used, it also invokes a default command:
|
|
|
|
|
"package intel 1"_package.html. If the "-sf hybrid intel omp" switch
|
|
|
|
|
is used, the default USER-OMP command "package omp 0"_package.html is
|
|
|
|
|
also invoked (if LAMMPS was built with USER-OMP). Both set the number
|
|
|
|
|
of OpenMP threads per MPI task via the OMP_NUM_THREADS environment
|
|
|
|
|
variable. The first command sets the number of Xeon Phi(TM)
|
|
|
|
|
coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only),
|
|
|
|
|
and the precision mode to "mixed" (default value).
|
|
|
|
|
|
|
|
|
|
You can also use the "-pk intel Nphi" "command-line
|
|
|
|
|
switch"_Section_start.html#start_7 to explicitly set Nphi = # of Xeon
|
|
|
|
|
Phi(TM) coprocessors/node, as well as additional options. Nphi should
|
|
|
|
|
be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi
|
|
|
|
|
= 0 for a CPU-only build. All the available coprocessor threads on
|
|
|
|
|
each Phi will be divided among MPI tasks, unless the {tptask} option
|
|
|
|
|
of the "-pk intel" "command-line switch"_Section_start.html#start_7 is
|
|
|
|
|
used to limit the coprocessor threads per MPI task. See the "package
|
|
|
|
|
intel"_package.html command for details, including the default values
|
|
|
|
|
used for all its options if not specified, and how to set the number
|
|
|
|
|
of OpenMP threads via the OMP_NUM_THREADS environment variable if
|
|
|
|
|
desired.
|
|
|
|
|
|
|
|
|
|
If LAMMPS was built with the USER-OMP package, you can also use the
|
|
|
|
|
"-pk omp Nt" "command-line switch"_Section_start.html#start_7 to
|
|
|
|
|
explicitly set Nt = # of OpenMP threads per MPI task to use, as well
|
|
|
|
|
as additional options. Nt should be the same threads per MPI task as
|
|
|
|
|
set for the USER-INTEL package, e.g. via the "-pk intel Nphi omp Nt"
|
|
|
|
|
command. Again, see the "package omp"_package.html command for
|
|
|
|
|
details, including the default values used for all its options if not
|
|
|
|
|
specified, and how to set the number of OpenMP threads via the
|
|
|
|
|
OMP_NUM_THREADS environment variable if desired.
|
|
|
|
|
|
|
|
|
|
[Or run with the USER-INTEL package by editing an input script:]
|
|
|
|
|
|
|
|
|
|
The discussion above for the mpirun/mpiexec command, MPI tasks/node,
|
|
|
|
|
OpenMP threads per MPI task, and coprocessor threads per MPI task is
|
|
|
|
|
the same.
|
|
|
|
|
|
|
|
|
|
Use the "suffix intel"_suffix.html or "suffix hybrid intel
|
|
|
|
|
omp"_suffix.html commands, or you can explicitly add an "intel" or
|
|
|
|
|
"omp" suffix to individual styles in your input script, e.g.
|
|
|
|
|
|
|
|
|
|
pair_style lj/cut/intel 2.5 :pre
|
|
|
|
|
|
|
|
|
|
You must also use the "package intel"_package.html command, unless the
|
|
|
|
|
"-sf intel" or "-pk intel" "command-line
|
|
|
|
|
switches"_Section_start.html#start_7 were used. It specifies how many
|
|
|
|
|
coprocessors/node to use, as well as other OpenMP threading and
|
|
|
|
|
coprocessor options. The "package"_package.html doc page explains how
|
|
|
|
|
to set the number of OpenMP threads via an environment variable if
|
|
|
|
|
desired.
|
|
|
|
|
|
|
|
|
|
If LAMMPS was also built with the USER-OMP package, you must also use
|
|
|
|
|
the "package omp"_package.html command to enable that package, unless
|
|
|
|
|
the "-sf hybrid intel omp" or "-pk omp" "command-line
|
|
|
|
|
switches"_Section_start.html#start_7 were used. It specifies how many
|
|
|
|
|
OpenMP threads per MPI task to use (should be same as the setting for
|
|
|
|
|
the USER-INTEL package), as well as other options. Its doc page
|
|
|
|
|
explains how to set the number of OpenMP threads via an environment
|
|
|
|
|
variable if desired.
|
|
|
|
|
|
|
|
|
|
[Speed-ups to expect:]
|
|
|
|
|
|
|
|
|
|
If LAMMPS was not built with coprocessor support (CPU only) when
|
|
|
|
|
including the USER-INTEL package, then acclerated styles will run on
|
|
|
|
|
the CPU using vectorization optimizations and the specified precision.
|
|
|
|
|
This may give a substantial speed-up for a pair style, particularly if
|
|
|
|
|
mixed or single precision is used.
|
|
|
|
|
|
|
|
|
|
If LAMMPS was built with coproccesor support, the pair styles will run
|
|
|
|
|
on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The
|
|
|
|
|
performance of a Xeon Phi versus a multi-core CPU is a function of
|
|
|
|
|
your hardware, which pair style is used, the number of
|
|
|
|
|
atoms/coprocessor, and the precision used on the coprocessor (double,
|
|
|
|
|
single, mixed).
|
|
|
|
|
|
|
|
|
|
See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
|
|
|
|
|
LAMMPS web site for performance of the USER-INTEL package on different
|
|
|
|
|
hardware.
|
|
|
|
|
When running on a single node (including runs using offload to a
|
|
|
|
|
coprocessor), best performance is normally obtained by using 1 MPI
|
|
|
|
|
task per physical core and additional OpenMP threads with SMT. For
|
|
|
|
|
Intel Xeon processors, 2 OpenMP threads should be used for SMT.
|
|
|
|
|
For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
|
|
|
|
|
(best choice depends on the simulation). In cases where the user
|
|
|
|
|
specifies that LRT mode is used (described below), 1 or 3 OpenMP
|
|
|
|
|
threads should be used. For multi-node runs, using 1 MPI task per
|
|
|
|
|
physical core will often perform best, however, depending on the
|
|
|
|
|
machine and scale, users might get better performance by decreasing
|
|
|
|
|
the number of MPI tasks and using more OpenMP threads. For
|
|
|
|
|
performance, the product of the number of MPI tasks and OpenMP
|
|
|
|
|
threads should not exceed the number of available hardware threads in
|
|
|
|
|
almost all cases.
|
|
|
|
|
|
|
|
|
|
NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP
|
|
|
|
|
threads to a core or group of cores so that memory access can be
|
|
|
|
|
uniform. Unless disabled at build time, affinity for MPI tasks and
|
|
|
|
|
OpenMP threads on the host (CPU) will be set by default on the host
|
|
|
|
|
when using offload to a coprocessor. In this case, it is unnecessary
|
|
|
|
|
{when using offload to a coprocessor}. In this case, it is unnecessary
|
|
|
|
|
to use other methods to control affinity (e.g. taskset, numactl,
|
|
|
|
|
I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with
|
|
|
|
|
the {no_affinity} option to the "package intel"_package.html command
|
|
|
|
|
or by disabling the option at build time (by adding
|
|
|
|
|
-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
|
|
|
|
|
Disabling this option is not recommended, especially when running on a
|
|
|
|
|
machine with hyperthreading disabled.
|
|
|
|
|
I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity}
|
|
|
|
|
option to the "package intel"_package.html command or by disabling the
|
|
|
|
|
option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
|
|
|
|
|
CCFLAGS line of your Makefile). Disabling this option is not
|
|
|
|
|
recommended, especially when running on a machine with Intel
|
|
|
|
|
Hyper-Threading technology disabled.
|
|
|
|
|
|
|
|
|
|
[Guidelines for best performance on an Intel(R) Xeon Phi(TM)
|
|
|
|
|
coprocessor:]
|
|
|
|
|
[Run with the USER-INTEL package from the command line:]
|
|
|
|
|
|
|
|
|
|
To enable USER-INTEL optimizations for all available styles used in
|
|
|
|
|
the input script, the "-sf intel"
|
|
|
|
|
"command-line switch"_Section_start.html#start_7 can be used without
|
|
|
|
|
any requirement for editing the input script. This switch will
|
|
|
|
|
automatically append "intel" to styles that support it. It also
|
|
|
|
|
invokes a default command: "package intel 1"_package.html. This
|
|
|
|
|
package command is used to set options for the USER-INTEL package.
|
|
|
|
|
The default package command will specify that USER-INTEL calculations
|
|
|
|
|
are performed in mixed precision, that the number of OpenMP threads
|
|
|
|
|
is specified by the OMP_NUM_THREADS environment variable, and that
|
|
|
|
|
if coprocessors are present and the binary was built with offload
|
|
|
|
|
support, that 1 coprocessor per node will be used with automatic
|
|
|
|
|
balancing of work between the CPU and the coprocessor.
|
|
|
|
|
|
|
|
|
|
You can specify different options for the USER-INTEL package by using
|
|
|
|
|
the "-pk intel Nphi" "command-line switch"_Section_start.html#start_7
|
|
|
|
|
with keyword/value pairs as specified in the documentation. Here,
|
|
|
|
|
Nphi = # of Xeon Phi coprocessors/node (ignored without offload
|
|
|
|
|
support). Common options to the USER-INTEL package include {omp} to
|
|
|
|
|
override any OMP_NUM_THREADS setting and specify the number of OpenMP
|
|
|
|
|
threads, {mode} to set the floating-point precision mode, and
|
|
|
|
|
{lrt} to enable Long-Range Thread mode as described below. See the
|
|
|
|
|
"package intel"_package.html command for details, including the
|
|
|
|
|
default values used for all its options if not specified, and how to
|
|
|
|
|
set the number of OpenMP threads via the OMP_NUM_THREADS environment
|
|
|
|
|
variable if desired.
|
|
|
|
|
|
|
|
|
|
Examples (see documentation for your MPI/Machine for differences in
|
|
|
|
|
launching MPI applications):
|
|
|
|
|
|
|
|
|
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
|
|
|
|
|
mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre
|
|
|
|
|
|
|
|
|
|
[Or run with the USER-INTEL package by editing an input script:]
|
|
|
|
|
|
|
|
|
|
As an alternative to adding command-line arguments, the input script
|
|
|
|
|
can be edited to enable the USER-INTEL package. This requires adding
|
|
|
|
|
the "package intel"_package.html command to the top of the input
|
|
|
|
|
script. For the second example above, this would be:
|
|
|
|
|
|
|
|
|
|
package intel 0 omp 2 mode double :pre
|
|
|
|
|
|
|
|
|
|
To enable the USER-INTEL package only for individual styles, you can
|
|
|
|
|
add an "intel" suffix to the individual style, e.g.:
|
|
|
|
|
|
|
|
|
|
pair_style lj/cut/intel 2.5 :pre
|
|
|
|
|
|
|
|
|
|
Alternatively, the "suffix intel"_suffix.html command can be added to
|
|
|
|
|
the input script to enable USER-INTEL styles for the commands that
|
|
|
|
|
follow in the input script.
|
|
|
|
|
|
|
|
|
|
[Tuning for Performance:]
|
|
|
|
|
|
|
|
|
|
NOTE: The USER-INTEL package will perform better with modifications
|
|
|
|
|
to the input script when "PPPM"_kspace_style.html is used:
|
|
|
|
|
"kspace_modify diff ad"_kspace_modify.html and "neigh_modify binsize
|
|
|
|
|
3"_neigh_modify.html should be added to the input script.
|
|
|
|
|
|
|
|
|
|
Long-Range Thread (LRT) mode is an option to the "package
|
|
|
|
|
intel"_package.html command that can improve performance when using
|
|
|
|
|
"PPPM"_kspace_style.html for long-range electrostatics on processors
|
|
|
|
|
with SMT. It generates an extra pthread for each MPI task. The thread
|
|
|
|
|
is dedicated to performing some of the PPPM calculations and MPI
|
|
|
|
|
communications. On Intel Xeon Phi x200 series CPUs, this will likely
|
|
|
|
|
always improve performance, even on a single node. On Intel Xeon
|
|
|
|
|
processors, using this mode might result in better performance when
|
|
|
|
|
using multiple nodes, depending on the machine. To use this mode,
|
|
|
|
|
specify that the number of OpenMP threads is one less than would
|
|
|
|
|
normally be used for the run and add the "lrt yes" option to the "-pk"
|
|
|
|
|
command-line suffix or "package intel" command. For example, if a run
|
|
|
|
|
would normally perform best with "-pk intel 0 omp 4", instead use
|
|
|
|
|
"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
|
|
|
|
|
environment variable "KMP_AFFINITY=none". LRT mode is not supported
|
|
|
|
|
when using offload.
|
|
|
|
|
|
|
|
|
|
Not all styles are supported in the USER-INTEL package. You can mix
|
|
|
|
|
the USER-INTEL package with styles from the "OPT"_accelerate_opt.html
|
|
|
|
|
package or the "USER-OMP package"_accelerate_omp.html". Of course,
|
|
|
|
|
this requires that these packages were installed at build time. This
|
|
|
|
|
can performed automatically by using "-sf hybrid intel opt" or
|
|
|
|
|
"-sf hybrid intel omp" command-line options. Alternatively, the "opt"
|
|
|
|
|
and "omp" suffixes can be appended manually in the input script. For
|
|
|
|
|
the latter, the "package omp"_package.html command must be in the
|
|
|
|
|
input script or the "-pk omp Nt" "command-line
|
|
|
|
|
switch"_Section_start.html#start_7 must be used where Nt is the
|
|
|
|
|
number of OpenMP threads. The number of OpenMP threads should not be
|
|
|
|
|
set differently for the different packages. Note that the "suffix
|
|
|
|
|
hybrid intel omp"_suffix.html command can also be used within the
|
|
|
|
|
input script to automatically append the "omp" suffix to styles when
|
|
|
|
|
USER-INTEL styles are not available.
|
|
|
|
|
|
|
|
|
|
When running on many nodes, performance might be better when using
|
|
|
|
|
fewer OpenMP threads and more MPI tasks. This will depend on the
|
|
|
|
|
simulation and the machine. Using the "verlet/split"_run_style.html
|
|
|
|
|
run style might also give better performance for simulations with
|
|
|
|
|
"PPPM"_kspace_style.html electrostatics. Note that this is an
|
|
|
|
|
alternative to LRT mode and the two cannot be used together.
|
|
|
|
|
|
|
|
|
|
Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
|
|
|
|
CPUs, better performance might be obtained by setting the
|
|
|
|
|
environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
|
|
|
|
|
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
|
|
|
|
series processors will always perform better using MCDRAM. Please
|
|
|
|
|
consult your system documentation for the best approach to specify
|
|
|
|
|
that MPI runs are performed in MCDRAM.
|
|
|
|
|
|
|
|
|
|
[Tuning for Offload Performance:]
|
|
|
|
|
|
|
|
|
|
The default settings for offload should give good performance.
|
|
|
|
|
|
|
|
|
|
When using LAMMPS with offload to Intel coprocessors, best performance
|
|
|
|
|
will typically be achieved with concurrent calculations performed on
|
|
|
|
|
both the CPU and the coprocessor. This is achieved by offloading only
|
|
|
|
|
a fraction of the neighbor and pair computations to the coprocessor or
|
|
|
|
|
using "hybrid"_pair_hybrid.html pair styles where only one style uses
|
|
|
|
|
the "intel" suffix. For simulations with long-range electrostatics or
|
|
|
|
|
bond, angle, dihedral, improper calculations, computation and data
|
|
|
|
|
transfer to the coprocessor will run concurrently with computations
|
|
|
|
|
and MPI communications for these calculations on the host CPU. This
|
|
|
|
|
is illustrated in the figure below for the rhodopsin protein benchmark
|
|
|
|
|
running on E5-2697v2 processors with a Intel Xeon Phi 7120p
|
|
|
|
|
coprocessor. In this plot, the vertical access is time and routines
|
|
|
|
|
running at the same time are running concurrently on both the host and
|
|
|
|
|
the coprocessor.
|
|
|
|
|
|
|
|
|
|
:c,image(JPG/offload_knc.png)
|
|
|
|
|
|
|
|
|
|
The fraction of the offloaded work is controlled by the {balance}
|
|
|
|
|
keyword in the "package intel"_package.html command. A balance of 0
|
|
|
|
|
runs all calculations on the CPU. A balance of 1 runs all
|
|
|
|
|
supported calculations on the coprocessor. A balance of 0.5 runs half
|
|
|
|
|
of the calculations on the coprocessor. Setting the balance to -1
|
|
|
|
|
(the default) will enable dynamic load balancing that continously
|
|
|
|
|
adjusts the fraction of offloaded work throughout the simulation.
|
|
|
|
|
Because data transfer cannot be timed, this option typically produces
|
|
|
|
|
results within 5 to 10 percent of the optimal fixed balance.
|
|
|
|
|
|
|
|
|
|
If running short benchmark runs with dynamic load balancing, adding a
|
|
|
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
|
|
|
near-optimal setting that will carry over to additional runs.
|
|
|
|
|
|
|
|
|
|
The default for the "package intel"_package.html command is to have
|
|
|
|
|
all the MPI tasks on a given compute node use a single Xeon Phi(TM)
|
|
|
|
|
all the MPI tasks on a given compute node use a single Xeon Phi
|
|
|
|
|
coprocessor. In general, running with a large number of MPI tasks on
|
|
|
|
|
each node will perform best with offload. Each MPI task will
|
|
|
|
|
automatically get affinity to a subset of the hardware threads
|
|
|
|
|
@ -261,50 +413,35 @@ with 60 cores available for offload and 4 hardware threads per core
|
|
|
|
|
each MPI task to use a subset of 10 threads on the coprocessor. Fine
|
|
|
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
|
|
|
threads to use per core can be accomplished with keyword settings of
|
|
|
|
|
the "package intel"_package.html command. :ulb,l
|
|
|
|
|
the "package intel"_package.html command.
|
|
|
|
|
|
|
|
|
|
If desired, only a fraction of the pair style computation can be
|
|
|
|
|
offloaded to the coprocessors. This is accomplished by using the
|
|
|
|
|
{balance} keyword in the "package intel"_package.html command. A
|
|
|
|
|
balance of 0 runs all calculations on the CPU. A balance of 1 runs
|
|
|
|
|
all calculations on the coprocessor. A balance of 0.5 runs half of
|
|
|
|
|
the calculations on the coprocessor. Setting the balance to -1 (the
|
|
|
|
|
default) will enable dynamic load balancing that continously adjusts
|
|
|
|
|
the fraction of offloaded work throughout the simulation. This option
|
|
|
|
|
typically produces results within 5 to 10 percent of the optimal fixed
|
|
|
|
|
balance. :l
|
|
|
|
|
The USER-INTEL package has two modes for deciding which atoms will be
|
|
|
|
|
handled by the coprocessor. This choice is controlled with the {ghost}
|
|
|
|
|
keyword of the "package intel"_package.html command. When set to 0,
|
|
|
|
|
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
|
|
|
|
|
to the card. This allows for overlap of MPI communication of forces
|
|
|
|
|
with computation on the coprocessor when the "newton"_newton.html
|
|
|
|
|
setting is "on". The default is dependent on the style being used,
|
|
|
|
|
however, better performance may be achieved by setting this option
|
|
|
|
|
explictly.
|
|
|
|
|
|
|
|
|
|
When using offload with CPU hyperthreading disabled, it may help
|
|
|
|
|
When using offload with CPU Hyper-Threading disabled, it may help
|
|
|
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
|
|
|
cores. This is due to the fact that additional threads are generated
|
|
|
|
|
internally to handle the asynchronous offload tasks. :l
|
|
|
|
|
internally to handle the asynchronous offload tasks.
|
|
|
|
|
|
|
|
|
|
If running short benchmark runs with dynamic load balancing, adding a
|
|
|
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
|
|
|
near-optimal setting that will carry over to additional runs. :l
|
|
|
|
|
|
|
|
|
|
If pair computations are being offloaded to an Intel(R) Xeon Phi(TM)
|
|
|
|
|
If pair computations are being offloaded to an Intel Xeon Phi
|
|
|
|
|
coprocessor, a diagnostic line is printed to the screen (not to the
|
|
|
|
|
log file), during the setup phase of a run, indicating that offload
|
|
|
|
|
mode is being used and indicating the number of coprocessor threads
|
|
|
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
|
|
|
the end of each run. When offloading, the frequency for "atom
|
|
|
|
|
sorting"_atom_modify.html is changed to 1 so that the per-atom data is
|
|
|
|
|
effectively sorted at every rebuild of the neighbor lists. :l
|
|
|
|
|
|
|
|
|
|
For simulations with long-range electrostatics or bond, angle,
|
|
|
|
|
dihedral, improper calculations, computation and data transfer to the
|
|
|
|
|
coprocessor will run concurrently with computations and MPI
|
|
|
|
|
communications for these calculations on the host CPU. The USER-INTEL
|
|
|
|
|
package has two modes for deciding which atoms will be handled by the
|
|
|
|
|
coprocessor. This choice is controlled with the {ghost} keyword of
|
|
|
|
|
the "package intel"_package.html command. When set to 0, ghost atoms
|
|
|
|
|
(atoms at the borders between MPI tasks) are not offloaded to the
|
|
|
|
|
card. This allows for overlap of MPI communication of forces with
|
|
|
|
|
computation on the coprocessor when the "newton"_newton.html setting
|
|
|
|
|
is "on". The default is dependent on the style being used, however,
|
|
|
|
|
better performance may be achieved by setting this option
|
|
|
|
|
explictly. :l,ule
|
|
|
|
|
effectively sorted at every rebuild of the neighbor lists. All the
|
|
|
|
|
available coprocessor threads on each Phi will be divided among MPI
|
|
|
|
|
tasks, unless the {tptask} option of the "-pk intel" "command-line
|
|
|
|
|
switch"_Section_start.html#start_7 is used to limit the coprocessor
|
|
|
|
|
threads per MPI task.
|
|
|
|
|
|
|
|
|
|
[Restrictions:]
|
|
|
|
|
|
|
|
|
|
@ -319,3 +456,15 @@ the pair styles in the USER-INTEL package currently support the
|
|
|
|
|
"inner", "middle", "outer" options for rRESPA integration via the
|
|
|
|
|
"run_style respa"_run_style.html command; only the "pair" option is
|
|
|
|
|
supported.
|
|
|
|
|
|
|
|
|
|
[References:]
|
|
|
|
|
|
|
|
|
|
Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l
|
|
|
|
|
|
|
|
|
|
Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press. :l
|
|
|
|
|
|
|
|
|
|
Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l,ule
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|