This should print a warning when 2x the bonded interaction cutoff list larger then other cutoffs, as was the setting before the performance optimization with the change in 2690075405
this now covers a large set of cases where the variable name can be printed.
it also is complete for the current code, since no more default arguments are required
are provided.
New directory: tools/doxygen
New file: tools/doxygen/Developer.dox.lammps
New file: tools/doxygen/Doxyfile.lammps
New file: tools/doxygen/doxygen.sh
New file: tools/doxygen/README
The Developer.dox.lammps file contains a slightly revised version of the
Developer.pdf file adopted to the LAMMPS "doxygen" documentation.
The Doxyfile.lammps file is a first proposal for a LAMMPS "doxygen"
documentation flavor and can be adjusted to specific requirements.
The "doxygen.sh" shell script generates the LAMMPS "doxygen"
documentation.
Detailed instructions can be found in the README file.
This reverts commit 4a3a6b4455.
As it turns out, when using the LAMMPS python wrapper from inside
code using the PYTHON package, the library symbols *are* needed.
Thanks for Richard Berger (@rbberger) for pointing this out.
This reverts commit 4a3a6b4455.
As it turns out, when using the LAMMPS python wrapper from inside
code using the PYTHON package, the library symbols *are* needed.
Thanks for Richard Berger (@rbberger) for pointing this out.
NEB was not working fine when using multiple proc
per replica and the keywords last/efirst or last/efirst/middle
I have corrected this in the enclosed fix_neb.cpp
I also slightly modified the nudging for this free end so that
it would be applied only when the target energy is larger than
the energy. Anyway if the target energy is lower than the energy,
the replica should relax toward the target energy without adding
any nudging.
I also modified the documentation according to this change.
This reverts commit 4a3a6b4455.
As it turns out, when using the LAMMPS python wrapper from inside
code using the PYTHON package, the library symbols *are* needed.
Thanks for Richard Berger (@rbberger) for pointing this out.
the sphinxcontrib.image extension was broken with sphinx 16.x.
however, sphinx 15.x breaks with newer version of the multiprocessor module.
so we suspend the thumbnail processing and lift the lock to sphinx 15.x
also, the number of parallel sphinx tasks is can be overridden with SPHINXEXTRA="-j #'.
default is to try use all local CPU cores.
This includes an example of how to implement fix NVE in Python.
The library interface was extended to provide direct access to atom data using
numpy arrays. No data copies are made and numpy operations directly manipulate
memory of the native code.
To keep this numpy dependency optional, all functions are wrapped into the
lammps.numpy sub-object which is only loaded when accessed.
This was accomplished with several key changes:
1) Modified fix_shardlow's control flow to match fix_shardlow_kokkos so
that random numbers are pulled fromn the RNGs in exactly the same order.
2) Created random_external_state.h, a simplified version of the Kokkos
random number generator that keeps its state variables external to itself.
Thus it can be used both with and without Kokkos enabled, as long as the
caller stores and passes in the required state variable.
3) Replaced all references to random_mars.h and Kokkos_Random.hpp code in
the fix_shardlow* files with calls to the random_external_state.h code,
guaranteeing that fix_shardlow* is using an identical RNG in all cases.
Result: most (56 of 61) of our internal tests now generate the same results
with kokkos turned on or off. Four cases still differ due to what appear
to be vectorization caused rounding differences, and the fifth case
appears to be something triggered by the kokkos "atom_style hybrid" code.
Propogate the efaa4c67 changes to npair_ssa_kokkos from npair_kokkos that
support the new neigh_modify exclude molecule/intra and /inter options.
Note: npair_ssa_kokkos could inherit from npair_kokkos to avoid this kind
of missed change. Unfortunately, inheritance from templated classes is
both tricky and messy, and not worth the complexity in this case, IMHO.
Notable features are the umbrella-integration based free energy estimator for
eABF, and the traditional thermodynamic integration estimator now available
for umbrella sampling, SMD, metadynamics. Also included are several small fixes.
Below is a list of relevant commits in the Colvars repository since the last update.
321d06a 2017-10-10 Add macros to manage colvarscript commands [Giacomo Fiorin]
26c3bec 2017-10-09 Document coming availability of Lepton in LAMMPS [Giacomo Fiorin]
cc8f249 2017-10-04 Clarify that SMP depends on code build [Giacomo Fiorin]
0b2ffac 2017-10-04 Summarize colvar definition options, clarify some details [Giacomo Fiorin]
28002e0 2017-10-01 Separate writing of restart file from other output (e.g. PMFs) [Giacomo Fiorin]
92f7c1d 2017-10-01 Deprecate colvarsTrajAppend [Giacomo Fiorin]
12a707f 2017-09-26 Accurate Jacobian calculation for RMSD variants [Jérôme Hénin]
fe389c9 2017-09-21 Allow subtractAppliedForce with extended-L again [Jérôme Hénin]
c050ce0 2017-09-18 Silence compiler warnings, remove Tabs [Giacomo Fiorin]
cb41905 2017-01-11 Add base class for TI estimator in other biases than ABF [Giacomo Fiorin]
a1bc676 2017-09-14 Avoid writing to unopened traj file [Jérôme Hénin]
b58d8cd 2017-09-08 Function to check for overlapping groups [Jérôme Hénin]
1e5efec 2017-09-07 Check for overlapping groups in coordNum [Jérôme Hénin]
03a61a4 2017-04-06 Add UI-based estimator [fhh2626]
ae43754 2017-08-17 Fix outputCenters parsing [Josh Vermaas]
1619e0e 2017-08-14 Delete static feature arrays in cvm destructor [Jérôme Hénin]
- re-indent to 2 blanks
- white space cleanup
- use force->numeric() and force->inumeric() instead of atof() and atoi()
- include system headers before local LAMMPS headers
- move example folder to examples/USER/misc/
- comment out writing of trajectory files
- reduce run length (for easier testing for regressions)
- record example outputs for 1 and 4 MPI processes
- rename readme.md to README.md for visibility
Adding raw performance numbers for Skylake xeon server.
Fixes for using older Intel compilers and compiling without OpenMP.
Fix adding in hooks for using USER-INTEL w/ minimization.
- include the used tricubic functions directly as static functions
- silence compiler warnings
- define f2c.h imported data types directly or use C equivalents
- since the direct LAPACK API was called and not cLAPACK, declare LAPACK interface and depend only on LAPACK
- add proper dependencies
- disable automatic minor version number generation. step version manually.
- comment out optional spglib functionality by default
with this change, the USER-INTEL package can be installed and
compiled without having to alter makefiles for adding -lpthread.
All "intel optimized" makefiles have been updated to have the
LRT feature enabled. This change will allow us to include the
USER-INTEL package in several automated testing configurations
and thus allows to detect incompatibilities and compilation issus faster.
The library interface was extended to provide direct access to atom data using
numpy arrays. No data copies are made and numpy operations directly manipulate
memory of the native code.
To keep this numpy dependency optional, all functions are wrapped into the
lammps.numpy sub-object which is only loaded when accessed.
The Constant Energy DPD (DPDE) was our primary usage case, so only stubs
for the Constant Temperature case were included in Kokkos code so far.
The non-Kokkos version works fine for Constant Temperature DPD.
New function that allows for parallel tempering (replica exchange) in MD in LAMMPS in the isothermal-isobaric ensemble (NPT)
Similar to temper which works in the canonical (NVT) ensemble.
An example is included that uses temper_npt
Merge changes thru July 27, 2017 from master 6d0a2286 into USER-DPD_kokkos
Includes 67a0183b which partially reverted 7f9a331c (from May 16, 2017) in USER-DPD,
since SSA neighbor lists use ghost info, so they can't currently be used as "occasional" lists.
The default compiler flags in voro++'s config.mk file do not include
-fPIC, which makes it incompatible with building the shared object for
the python wrapper.
- build into local directory to replace existing installation is now default
- add wrapper function that calls curl in case python package has not ssl support
- have to specify -n flag to avoid wiping out the existing installation
- can specify -p to point to an existing kim-api installation (implies -n)
There were several clean_copy() calls in pair
styles *outside device code*.
They seem to have been left over from an abandoned
effort to copy the Kokkos neighbor list as
a member of the pair style, instead of copying
out the individual views needed.
These leftover clean_copy() calls were setting
pointers to NULL that had not been freed,
leading to large memory leaks.
I've removed the clean_copy() function entirely,
and replaced it with the copymode flag system used
in many other Kokkos objects.
The copymode flag is only set to one in
functors that hold copies of the neighbor list.
There were several clean_copy() calls in pair
styles *outside device code*.
They seem to have been left over from an abandoned
effort to copy the Kokkos neighbor list as
a member of the pair style, instead of copying
out the individual views needed.
These leftover clean_copy() calls were setting
pointers to NULL that had not been freed,
leading to large memory leaks.
I've removed the clean_copy() function entirely,
and replaced it with the copymode flag system used
in many other Kokkos objects.
The copymode flag is only set to one in
functors that hold copies of the neighbor list.
- Moved the particle loop inside a replica of getMixingWeights, getMixingWeightsVect,
and refactored to improve vectorization.
- Added OMP SIMD and OMP threading directly inside that function but will replace with
kokkos parallel_for and parallel_reduce methods later.
Normally, the gzip process would be pinned to the same core as the
MPI rank 0 process, which makes the pipe stay in one core's cache,
but forces the two process to fight for that core, slowing things down.
Note: "newton on" still required if using non-kokkos pair styles or fixes.
Non-kokkos pairs/fixes don't expect their half lists with newton off,
which happens if newton is turned off globally by kokkos via commandline.
Note2: Regardless, fix_shardlow* will still use half lists and newton on.
two sort functions with different
names but identical functionality.
making them the same function
until we descide to use a different
algorithm for atoms and ghosts
KOKKOS_LAMBDA doesn't quite work on CUDA,
you have to use LAMMPS_LAMBDA.
Also, if you do use LAMMPS_LAMBDA, you need
to run on the default device type,
i.e. no using lambdas to run on OpenMP
when LAMMPS has been compiled for CUDA.
Add support for lock-free and deterministic use of Random_XorShift*_Pool
by giving state_idx selection and lock responsibility up to the
application. Done by an overload of get_state() to take sate_idx as
an argument that the appplication guarantees is concurrently unique
and within the range of num_states that the application passed to init().
In other words, this allows the RNG state to be associated with some
application specific index, rather than a runtime arbitrary thread ID,
and thus the application can control which work is performed using
which RNG in a deterministic manner, regardless of which thread
performs the work.
Random_XorShift*_Pool<Kokkos::Cuda>::free_state() has two purposes:
1) update the state value kept in the pool
2) unlock the state
For a CUDA host thread, ONLY skip step 2, not both.
SSA atom binning algorithm was adjusted to do as much work in
parallel while preserving deterministic behavior. The final
step is done serially to preserve deterministic behavior.
An alternative would be to sort the contents of the bins so
that they are always in the same order.
ssa_update_dpde() hangs on first use of rand_gen.normal()
Switching to not using a pointer to PairDPDfdtEnergyKokkos's rand_pool
had no noticble effect.
Eliminates a special case version of a loop just for Subphase 0.
NOTE: pair evaluation order changes, causing numerical differences!
This changed the order that close neighbors of ghosts are processed.
NOTE: pair evaluation order changes, causing numerical differences!
Atom pair processing order is fully planned out in npair_half_bin_newton_ssa
Makes the SSA neighbor list structure very different. Do not use by others!
Each local is in ilist, numneigh, and firstneigh four times instead of once.
Changes LAMMPS core code that had been previously changed for USER-DPD/SSA:
Removes ssaAIR[] from class Atom as it is now unused.
Removes ndxAIR_ssa[] from class NeighList as it is now unused.
Increases length of ilist[], numneigh[], and firstneigh[] if SSA flag set.
NOTE: pair evaluation order changes, causing numerical differences!
This enables processing neighbors in subphase groups that enforce
a geometrical seperation of pairs, allowing greater parallelism
once fix_shardlow (SSA) is converted to Kokkos.
This removes the the distinction between pure and impure locals.
Pure and impure locals messed up the directionality of half neighbor lists,
which turns out is crucial to the approach for SSA with kokkos.
- Switched from using lambda functions to operator()'s with type tags
in FixRxKokkos. The lambda's were giving big problems in Cuda with
the memory objects. This required that all referenced views be members
of the FixRXKokkos class.
- Add copymode controls to solve_reactions() to avoid the destructor
freeing pointers carried forward from the copy constructor. Added
the same to FixRX since its called, too.
- Updated the function prototypes to include the necessary KOKKOS
macros for __host__ and __device__ functions and inlined functions.
- Changed several View definitions to match the disjoint memory spaces
that only come up with Cuda builds.
- Finished porting all scratch arrays to using the StridedArrayType
template.
- Created a single, large Kokkos device array and using that for all
scratch data passed into the StridedArrayType objects.
- Created an Array class that provides stride access for operator[]
w/o needing Kokkos views. This was designed to avoid the performance
issues encountered with Views and sub-views throughout the RHS and
ODE solver functions.
- Added the diagnostics performance analysis routine to FixRxKokkos
using Kokkos views.
TODO:
- Switch to using Kokkos data for the per-iteration scratch data.
How to allocate only enouch for each work-unit and not all
iterations? Can the shared-memory scratch memory work for this,
even for large sizes?
usage and calls to computeLocalTemperature.
- Created request for kokkos neighbor list for fix and switched to
that neighbor list datatype in computeLocalTemperature.
- Reconfigured pre_force and setup_pre_force to call a common
solve_reactions() method to avoid duplicate code.
TODO:
- Clean-up
- Provide per-problem scratch data within kokkos framework (instead
of C++ new/delete data).
- Added a kokkos version of setup_pre_force that only sets dvector
and then communicates that.
- Converted all for loops to parallel_for's in computeLocalTemperator()
and setup_pre_force.
- Added pack/unpack forward/reverse methods with Kokkos host views.
TODO:
- The Kokkos neighbor list is not working. Need to request a Kokkos
neighbor list in ::init(). Then, replace objects like list->ilist[]
with k_list->d_ilist().
Added kokkos dual-view datatypes used in computeLocalTemperature and
pre_force (e.g., dpdThetaLocal) but still using the original host
pointers for the pack/unpack operations.
TODO:
- The Kokkos neighbor list is not working. Need to request a Kokkos
neighbor list in ::init(). Then, replace objects like list->ilist[]
with k_list->d_ilist().
- Add another template parameter for HALFTHREAD and create (automatic)
atomic view of dpdThetaLocal and sumWeights.
- Add modify/sync comments and replace the host-only pointers in the
pack/unpack methods.
- Added templated computeLocalTemp<>() to FixRxKokkos but still
using the original host data pointers.
- Updated the copy-back to dvector operation to be the same with
RK4 and RKF45 per discussion with J. Larentzos.
TODO:
- Add kokkos data for computeLocalTemp and parallel_for loop.
- Updated the KOKKOS installer to include the fix_rx_kokkos.[cpp,h].
- Updated the USER-DPD version of fix_rx.[cpp,h] to sync with the Kokkos
version. Solves child->parent class dependencies.
- Added kokkos-managed parameter data for the kinetics equations.
- Removed dependencies in rhs() on atom and domain objects.
TODO:
1. Switch to using KOKKOS data for dvector.
2. Port ComputeLocalTemp(...) to Kokkos (needs pairing algorithm).
Initial port of USER-DPD/fix_rx.cpp to KOKKOS/fix_rx_kokkos.cpp.
Using parallel_reduce(...) but still using host-only data.
TODO:
1. Switch to KOKKOS datatypes for sparse-kinetics data; dense
is finished.
2. Switch to using KOKKOS data for dvector.
3. Remove dependencies in rhs(...) on atom. Store those consts
in UserData{} or as member constants.
4. Port ComputeLocalTemp(...) to Kokkos (needs pairing algorithm).
the main bug here is the use of a local
rho_i accumulator which later gets assigned
back to rho[i].
in parallel, atomic additions can happen to
rho[i] while the local accumulator is held;
those atomic additions are lost when
the accumulator is atomically assigned.
we instead initialize the accumulator to zero
and atomically add it back to rho[i].
one Kokkos kernel was not annotated consistently,
STACKPARAMS was essentially uninitialized and
confused with a local variable,
plus lots of variables were unused in some
of the Kokkos kernels.
During dynamic load balancing, the subdomains will not be uniform so the
bbox size test in USER-DPD/fix_shardlow.cpp may only be called by one rank.
Using error->one allows any rank to stop the simulation in this scenario.
Added rcut and bbox information to help in diagnostics.
set(LAMMPS_MEMALIGN"64"CACHESTRING"enables the use of the posix_memalign() call instead of malloc() when large chunks or memory are allocated by LAMMPS")
@ -233,8 +233,8 @@ set any needed options for the package via "-pk" "command-line switch"_Section_s
use accelerated styles in your input via "-sf" "command-line switch"_Section_start.html#start_6 or "suffix"_suffix.html command | lmp_machine -in in.script -sf gpu
:tb(c=2,s=|)
Note that the first 4 steps can be done as a single command, using the
src/Make.py tool. This tool is discussed in "Section
Note that the first 4 steps can be done as a single command with
suitable make command invocations. This is discussed in "Section
4"_Section_packages.html of the manual, and its use is
illustrated in the individual accelerator sections. Typically these
steps only need to be done once, to create an executable that uses one
@ -118,15 +131,17 @@ Package, Description, Doc page, Example, Library
"USER-EFF"_#USER-EFF, electron force field,"pair_style eff/cut"_pair_eff.html, USER/eff, -
"USER-FEP"_#USER-FEP, free energy perturbation,"compute fep"_compute_fep.html, USER/fep, -
"USER-H5MD"_#USER-H5MD, dump output via HDF5,"dump h5md"_dump_h5md.html, -, ext
"USER-INTEL"_#USER-INTEL, optimized Intel CPU and KNL styles,"Section 5.3.2"_accelerate_intel.html, WWW bench, -
"USER-INTEL"_#USER-INTEL, optimized Intel CPU and KNL styles,"Section 5.3.2"_accelerate_intel.html, "Benchmarks"_http://lammps.sandia.gov/bench.html, -
Here is a quick overview of how to use the KOKKOS package
for CPU acceleration, assuming one or more 16-core nodes.
KOKKOS_DEVICE sets the parallelization method used for Kokkos code
(within LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be
used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
directory must use "nvcc" as its compiler, via its CC setting. For
best performance its CCFLAGS setting should use -O3 and have a
KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note
the minimal required compute capability is 2.0, but this will give
significantly reduced performance compared to Kepler generation GPUs
with compute capability 3.x. For the LINK setting, "nvcc" should not
be used; instead use g++ or another compiler suitable for linking C++
applications. Often you will want to use your MPI compiler wrapper
for this setting (i.e. mpicxx). Finally, the lo-level Makefile must
also have a "Compilation rule" for creating *.o files from *.cu files.
See src/Makefile.cuda for an example of a lo-level Makefile with all
of these settings.
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
provides alternative methods via environment variables for binding
threads to hardware cores. More info on binding threads to cores is
given in "Section 5.3"_Section_accelerate.html#acc_3.
KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
Intel Phi processor.
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
on most Unix platforms. This library is not available on all
platforms.
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
debugging information that can be useful. It also enables runtime
bounds checking on Kokkos data structures.
KOKKOS_CUDA_OPTIONS are additional options for CUDA.
For more information on Kokkos see the Kokkos programmers' guide here:
/lib/kokkos/doc/Kokkos_PG.pdf.
[Run with the KOKKOS package from the command line:]
The mpirun or mpiexec command sets the total number of MPI tasks used
by LAMMPS (one or multiple per compute node) and the number of MPI
tasks used per node. E.g. the mpirun command in MPICH does this via
its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.
When using KOKKOS built with host=OMP, you need to choose how many
OpenMP threads per MPI task will be used (via the "-k" command-line
switch discussed below). Note that the product of MPI tasks * OpenMP
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer.
When using the KOKKOS package built with device=CUDA, you must use
exactly one MPI task per physical GPU.
When using the KOKKOS package built with host=MIC for Intel Xeon Phi
coprocessor support you need to insure there are one or more MPI tasks
per coprocessor, and choose the number of coprocessor threads to use
per MPI task (via the "-k" command-line switch discussed below). The
product of MPI tasks * coprocessor threads/task should not exceed the
maximum number of threads the coprocessor is designed to run,
otherwise performance will suffer. This value is 240 for current
generation Xeon Phi(TM) chips, which is 60 physical cores * 4
threads/core. Note that with the KOKKOS package you do not need to
specify how many Phi coprocessors there are per node; each
coprocessors is simply treated as running some number of MPI tasks.
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command.
You must use the "-k on" "command-line
switch"_Section_start.html#start_6 to enable the KOKKOS package. It
switch"_Section_start.html#start_7 to enable the KOKKOS package. It
takes additional arguments for hardware settings appropriate to your
system. Those arguments are "documented
here"_Section_start.html#start_6. The two most commonly used
options are:
system. Those arguments are "documented
here"_Section_start.html#start_7. For OpenMP use:
-k on t Nt g Ng :pre
The "t Nt" option applies to host=OMP (even if device=CUDA) and
host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI
task to use with a node. For host=MIC, it specifies how many Xeon Phi
threads per MPI task to use within a node. The default is Nt = 1.
Note that for host=OMP this is effectively MPI-only mode which may be
fine. But for host=MIC you will typically end up using far less than
all the 240 available threads, which could give very poor performance.
The "g Ng" option applies to device=CUDA. It specifies how many GPUs
per compute node to use. The default is 1, so this only needs to be
specified is you have 2 or more GPUs per compute node.
-k on t Nt :pre
The "t Nt" option specifies how many OpenMP threads per MPI
task to use with a node. The default is Nt = 1, which is MPI-only mode.
Note that the product of MPI tasks * OpenMP
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer. If hyperthreading is enabled, then
the product of MPI tasks * OpenMP threads/task should not exceed the
physical number of cores * hardware threads.
The "-k on" switch also issues a "package kokkos" command (with no
additional arguments) which sets various KOKKOS options to default
values, as discussed on the "package"_package.html command doc page.
Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
which will automatically append "kk" to styles that support it. Use
the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
you wish to change any of the default "package kokkos"_package.html
optionns set by the "-k on" "command-line
switch"_Section_start.html#start_6.
The "-sf kk" "command-line switch"_Section_start.html#start_7
will automatically append the "/kk" suffix to styles that support it.
In this manner no modification to the input script is needed. Alternatively,
one can run with the KOKKOS package by editing the input script as described below.
Note that the default for the "package kokkos"_package.html command is
NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. This typically gives fastest
performance. If the "newton"_newton.html command is used in the input
script, it can override the Newton flag defaults.
However, when running in MPI-only mode with 1 thread per MPI task, it
pairwise and bonded interactions. However, when running on CPUs, it
will typically be faster to use "half" neighbor lists and set the
Newton flag to "on", just as is the case for non-accelerated pair
styles. You can do this with the "-pk" "command-line
switch"_Section_start.html#start_6.
styles. It can also be faster to use non-threaded communication.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:
[Or run with the KOKKOS package by editing an input script:]
mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre
The discussion above for the mpirun/mpiexec command and setting
appropriate thread and GPU values for host=OMP or host=MIC or
device=CUDA are the same.
If the "newton"_newton.html command is used in the input
script, it can also override the Newton flag defaults.
[Core and Thread Affinity:]
When using multi-threading, it is important for
performance to bind both MPI tasks to physical cores, and threads to
physical cores, so they do not migrate during a simulation.
If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), binding can be forced with these flags:
For binding threads with KOKKOS OpenMP, use thread affinity
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. In general, for best performance
with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads.
For binding threads with the
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
as described below.
[Running on Knight's Landing (KNL) Intel Xeon Phi:]
Here is a quick overview of how to use the KOKKOS package
for the Intel Knight's Landing (KNL) Xeon Phi:
KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores
are reserved for the OS, and only 64 or 66 cores are used. Each core
has 4 hyperthreads,so there are effectively N = 256 (4*64) or
N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit,
otherwise performance will suffer. Note that with the KOKKOS package you do not need to
specify how many KNLs there are per node; each
KNL is simply treated as running some number of MPI tasks.
Examples of mpirun commands that follow these rules are shown below.
Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
The -np setting of the mpirun command sets the number of MPI
tasks/node. The "-k on t Nt" command-line switch sets the number of
threads/task as Nt. The product of these two values should be N, i.e.
256 or 264.
NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. When running on KNL, this
will typically be best for pair-wise potentials. For manybody potentials,
using "half" neighbor lists and setting the
Newton flag to "on" may be faster. It can also be faster to use non-threaded communication.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre
NOTE: MPI tasks and threads should be bound to cores as described above for CPUs.
NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your
system must be configured to use them in "native" mode, not "offload"
mode like the USER-INTEL package supports.
[Running on GPUs:]
Use the "-k" "command-line switch"_Section_commands.html#start_7 to
specify the number of GPUs per node. Typically the -np setting
of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
You can assign multiple MPI tasks to the same GPU with the
KOKKOS package, but this is usually only faster if significant portions
of the input script have not been ported to use Kokkos. Using CUDA MPS
is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number
of physical cores/node, then the number of MPI tasks/node should not exceed N.
-k on g Ng :pre
Here are examples of how to use the KOKKOS package for GPUs,
assuming one or more nodes, each with two GPUs:
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions, along with threaded communication.
When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs,
using "half" neighbor lists and setting the
Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize
equal to the ghost atom cutoff will give speedup.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj # Set binsize = neighbor ghost cutoff
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
NOTE: For good performance of the KOKKOS package on GPUs, you must
have Kepler generation GPUs (or later). The Kokkos library exploits
texture cache options not supported by Telsa generation GPUs (or
older).
NOTE: When using a GPU, you will achieve the best performance if your
input script does not use fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU. Invoking a
non-Kokkos fix or compute, or performing I/O for
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
to be copied back to the CPU incurring a performance penalty.
NOTE: To get an accurate timing breakdown between time spend in pair,
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
However, this will reduce performance and is not recommended for production runs.
[Run with the KOKKOS package by editing an input script:]
Alternatively the effect of the "-sf" or "-pk" switches can be
duplicated by adding the "package kokkos"_package.html or "suffix
kk"_suffix.html commands to your input script.
The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting
appropriate thread are the same.
You must still use the "-k on" "command-line
switch"_Section_start.html#start_6 to enable the KOKKOS package, and
switch"_Section_start.html#start_7 to enable the KOKKOS package, and
specify its additional arguments for hardware options appropriate to
your system, as documented above.
Use the "suffix kk"_suffix.html command, or you can explicitly add a
You can use the "suffix kk"_suffix.html command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.
pair_style lj/cut/kk 2.5 :pre
You only need to use the "package kokkos"_package.html command if you
wish to change any of its option defaults, as set by the "-k on"
"command-line switch"_Section_start.html#start_6.
"command-line switch"_Section_start.html#start_7.
[Using OpenMP threading and CUDA together (experimental):]
With the KOKKOS package, both OpenMP multi-threading and GPUs can be used
together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must
include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
KOKKOS_DEVICES=Cuda,OpenMP :pre
The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the "-sf kk" in the command line gives the default CUDA version everywhere.
However, if the "/kk/host" suffix is added to a specific style in the input
script, the Kokkos OpenMP (CPU) version of that specific style will be used instead.
Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng"
-k on t Nt g Ng :pre
For example, the command to run with 1 GPU and 8 OpenMP threads is then:
mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
Conversely, if the "-sf kk/host" is used in the command line and then the
"/kk" or "/kk/device" suffix is added to a specific style in your input script,
then only that specific style will run on the GPU while everything else will
run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU
styles will NOT overlap, except for a special case:
A kspace style and/or molecular topology (bonds, angles, etc.) running on
the host CPU can overlap with a pair style running on the GPU. First compile
with "--default-stream per-thread" added to CCFLAGS in the Kokkos CUDA Makefile.
Then explicitly use the "/kk/host" suffix for kspace and bonds, angles, etc.
in the input file and the "kk" suffix (equal to "kk/device") on the command line.
Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
so CPU/GPU overlap can occur.
[Speed-ups to expect:]
@ -356,7 +363,7 @@ Generally speaking, the following rules of thumb apply:
When running on CPUs only, with a single thread per MPI task,
performance of a KOKKOS style is somewhere between the standard
(un-accelerated) styles (MPI-only mode), and those provided by the
USER-OMP package. However the difference between all 3 is small (less
USER-OMP package. However the difference between all 3 is small (less
than 20%). :ulb,l
When running on CPUs only, with multiple threads per MPI task,
@ -366,7 +373,7 @@ package. :l
When running large number of atoms per GPU, KOKKOS is typically faster
than the GPU package. :l
When running on Intel Xeon Phi, KOKKOS is not as fast as
When running on Intel hardware, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware. :l
:ule
@ -374,123 +381,78 @@ See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the KOKKOS package on different
hardware.
[Guidelines for best performance:]
[Advanced Kokkos options:]
Here are guidline for using the KOKKOS package on the different
hardware configurations listed above.
There are other allowed options when building with the KOKKOS package.
As above, they can be set either as variables on the make command line
or in Makefile.machine. This is the full list of options, including
those discussed above. Each takes a value shown below. The
default value is listed, which is set in the
/lib/kokkos/Makefile.kokkos file.
Many of the guidelines use the "package kokkos"_package.html command
See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.