Merge branch 'master' into collected-small-changes

This commit is contained in:
Axel Kohlmeyer
2021-08-27 14:58:42 -04:00
19 changed files with 822 additions and 533 deletions

View File

@ -22,4 +22,5 @@ page.
Build_extras
Build_manual
Build_windows
Build_diskspace
Build_development

View File

@ -0,0 +1,45 @@
Notes for saving disk space when building LAMMPS from source
------------------------------------------------------------
LAMMPS is a large software project with a large number of source files,
extensive documentation, and a large collection of example files.
When downloading LAMMPS by cloning the
`git repository from GitHub <https://github.com/lammps/lammps>`_ this
will by default also download the entire commit history since September 2006.
Compiling LAMMPS will add the storage requirements of the compiled object
files and libraries to the tally.
In a user account on an HPC cluster with filesystem quotas or in other
environments with restricted disk space capacity it may be needed to
reduce the storage requirements. Here are some suggestions:
- Create a so-called shallow repository by cloning only the last commit
instead of the full project history by using ``git clone git@github.com:lammps/lammps --depth=1 --branch=master``.
This reduces the downloaded size to about half. With ``--depth=1`` it is not possible to check out different
versions/branches of LAMMPS, using ``--depth=1000`` will make multiple recent versions available at little
extra storage needs (the entire git history had nearly 30,000 commits in fall 2021).
- Download a tar archive from either the `download section on the LAMMPS homepage <https://www.lammps.org/download.html>`_
or from the `LAMMPS releases page on GitHub <https://github.com/lammps/lammps/releases>`_ these will not
contain the git history at all.
- Build LAMMPS without the debug flag (remove ``-g`` from the machine makefile or use ``-DCMAKE_BUILD_TYPE=Release``)
or use the ``strip`` command on the LAMMPS executable when no more debugging would be needed. The strip command
may also be applied to the LAMMPS shared library. The static library may be deleted entirely.
- Delete compiled object files and libraries after copying the LAMMPS executable to a permanent location.
When using the traditional build process, one may use ``make clean-<machine>`` or ``make clean-all``
to delete object files in the src folder. For CMake based builds, one may use ``make clean`` or just
delete the entire build folder.
- The folders containing the documentation tree (doc), the examples (examples) are not needed to build and
run LAMMPS and can be safely deleted. Some files in the potentials folder are large and may be deleted,
if not needed. The largest of those files (occupying about 120 MBytes combined) will only be downloaded on
demand, when the corresponding package is installed.
- When using the CMake build procedure, the compilation can be done on a (local) scratch storage that will not
count toward the quota. A local scratch file system may offer the additional benefit of speeding up creating
object files and linking with libraries compared to a networked file system. Also with CMake (and unlike with
the traditional make) it is possible to compile LAMMPS executables with different settings and packages included
from the same source tree since all the configuration information is stored in the build folder. So it is
not necessary to have multiple copies of LAMMPS.

View File

@ -29,7 +29,7 @@ The following folks deserve special recognition. Many of the packages
they have written are unique for an MD code and LAMMPS would not be as
general-purpose as it is without their expertise and efforts.
* Metin Aktulga (MSU), REAXFF package for C version of ReaxFF
* Metin Aktulga (MSU), REAXFF package for C/C++ version of ReaxFF
* Mike Brown (Intel), GPU and INTEL packages
* Colin Denniston (U Western Ontario), LATBOLTZ package
* Georg Ganzenmuller (EMI), MACHDYN and SPH packages
@ -37,9 +37,10 @@ general-purpose as it is without their expertise and efforts.
* Reese Jones (Sandia) and colleagues, ATC package for atom/continuum coupling
* Christoph Kloss (DCS Computing), LIGGGHTS code for granular materials, built on top of LAMMPS
* Rudra Mukherjee (JPL), POEMS package for articulated rigid body motion
* Trung Ngyuen (Northwestern U), GPU and RIGID and BODY packages
* Trung Ngyuen (Northwestern U), GPU, RIGID, BODY, and DIELECTRIC packages
* Mike Parks (Sandia), PERI package for Peridynamics
* Roy Pollock (LLNL), Ewald and PPPM solvers
* Julien Tranchida (Sandia), SPIN package
* Christian Trott (Sandia), CUDA and KOKKOS packages
* Ilya Valuev (JIHT), AWPMD package for wave packet MD
* Greg Wagner (Northwestern U), MEAM package for MEAM potential

View File

@ -27,19 +27,19 @@ General features
* distributed memory message-passing parallelism (MPI)
* shared memory multi-threading parallelism (OpenMP)
* spatial decomposition of simulation domain for MPI parallelism
* particle decomposition inside of spatial decomposition for OpenMP parallelism
* particle decomposition inside of spatial decomposition for OpenMP and GPU parallelism
* GPLv2 licensed open-source distribution
* highly portable C++-11
* modular code with most functionality in optional packages
* only depends on MPI library for basic parallel functionality
* only depends on MPI library for basic parallel functionality, MPI stub for serial compilation
* other libraries are optional and only required for specific packages
* GPU (CUDA and OpenCL), Intel Xeon Phi, and OpenMP support for many code features
* GPU (CUDA, OpenCL, HIP, SYCL), Intel Xeon Phi, and OpenMP support for many code features
* easy to extend with new features and functionality
* runs from an input script
* syntax for defining and using variables and formulas
* syntax for looping over runs and breaking out of loops
* run one or multiple simulations simultaneously (in parallel) from one script
* build as library, invoke LAMMPS through library interface or provided Python wrapper
* build as library, invoke LAMMPS through library interface or provided Python wrapper or SWIG based wrappers
* couple with other codes: LAMMPS calls other code, other code calls LAMMPS, umbrella code calls both
.. _particle:
@ -58,8 +58,10 @@ Particle and model types
* coarse-grained mesoscale models
* finite-size spherical and ellipsoidal particles
* finite-size line segment (2d) and triangle (3d) particles
* finite-size rounded polygons (2d) and polyhedra (3d) particles
* point dipole particles
* rigid collections of particles
* particles with magnetic spin
* rigid collections of n particles
* hybrid combinations of these
.. _ff:
@ -74,24 +76,28 @@ commands)
* pairwise potentials: Lennard-Jones, Buckingham, Morse, Born-Mayer-Huggins, Yukawa, soft, class 2 (COMPASS), hydrogen bond, tabulated
* charged pairwise potentials: Coulombic, point-dipole
* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, SNAP, Streitz-Mintmire, 3-body polymorphic
* long-range interactions for charge, point-dipoles, and LJ dispersion: Ewald, Wolf, PPPM (similar to particle-mesh Ewald)
* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, Streitz-Mintmire, 3-body polymorphic, BOP, Vashishta
* machine learning potentials: SNAP, GAP, ACE, N2P2, RANN, AGNI
* long-range interactions for charge, point-dipoles, and LJ dispersion: Ewald, Wolf, PPPM (similar to particle-mesh Ewald), MSM
* polarization models: :doc:`QEq <fix_qeq>`, :doc:`core/shell model <Howto_coreshell>`, :doc:`Drude dipole model <Howto_drude>`
* charge equilibration (QEq via dynamic, point, shielded, Slater methods)
* coarse-grained potentials: DPD, GayBerne, REsquared, colloidal, DLVO
* mesoscopic potentials: granular, Peridynamics, SPH
* mesoscopic potentials: granular, Peridynamics, SPH, mesoscopic tubular potential (MESONT)
* semi-empirical potentials: multi-ion generalized pseudopotential theory (MGPT), second moment tight binding + QEq (SMTB-Q), density functional tight-binding (LATTE)
* electron force field (eFF, AWPMD)
* bond potentials: harmonic, FENE, Morse, nonlinear, class 2, quartic (breakable)
* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic, class 2 (COMPASS)
* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix, class 2 (COMPASS), OPLS
* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS)
* bond potentials: harmonic, FENE, Morse, nonlinear, class 2, quartic (breakable), tabulated
* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic, class 2 (COMPASS), tabulated
* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix, class 2 (COMPASS), OPLS, tabulated
* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS), tabulated
* polymer potentials: all-atom, united-atom, bead-spring, breakable
* water potentials: TIP3P, TIP4P, SPC
* water potentials: TIP3P, TIP4P, SPC, SPC/E and variants
* interlayer potentials for graphene and analogues
* metal-organic framework potentials (QuickFF, MO-FF)
* implicit solvent potentials: hydrodynamic lubrication, Debye
* force-field compatibility with common CHARMM, AMBER, DREIDING, OPLS, GROMACS, COMPASS options
* access to the `OpenKIM Repository <http://openkim.org>`_ of potentials via :doc:`kim command <kim_commands>`
* hybrid potentials: multiple pair, bond, angle, dihedral, improper potentials can be used in one simulation
* overlaid potentials: superposition of multiple pair potentials
* overlaid potentials: superposition of multiple pair potentials (including many-body) with optional scale factor
.. _create:
@ -124,9 +130,10 @@ Ensembles, constraints, and boundary conditions
* harmonic (umbrella) constraint forces
* rigid body constraints
* SHAKE bond and angle constraints
* Monte Carlo bond breaking, formation, swapping
* motion constraints to manifold surfaces
* Monte Carlo bond breaking, formation, swapping, template based reaction modeling
* atom/molecule insertion and deletion
* walls of various kinds
* walls of various kinds, static and moving
* non-equilibrium molecular dynamics (NEMD)
* variety of additional boundary conditions and constraints
@ -150,6 +157,7 @@ Diagnostics
^^^^^^^^^^^
* see various flavors of the :doc:`fix <fix>` and :doc:`compute <compute>` commands
* introspection command for system, simulation, and compile time settings and configurations
.. _output:
@ -164,8 +172,9 @@ Output
* parallel I/O of dump and restart files
* per-atom quantities (energy, stress, centro-symmetry parameter, CNA, etc)
* user-defined system-wide (log file) or per-atom (dump file) calculations
* spatial and time averaging of per-atom quantities
* time averaging of system-wide quantities
* custom partitioning (chunks) for binning, and static or dynamic grouping of atoms for analysis
* spatial, time, and per-chunk averaging of per-atom quantities
* time averaging and histogramming of system-wide quantities
* atom snapshots in native, XYZ, XTC, DCD, CFG formats
.. _replica1:
@ -178,7 +187,7 @@ Multi-replica models
* :doc:`parallel replica dynamics <prd>`
* :doc:`temperature accelerated dynamics <tad>`
* :doc:`parallel tempering <temper>`
* :doc:`path-integral MD <fix_pimd>`
* path-integral MD: `first variant <fix_pimd>`, `second variant <fix_ipi>`
* multi-walker collective variables with :doc:`Colvars <fix_colvars>` and :doc:`Plumed <fix_plumed>`
.. _prepost:
@ -210,11 +219,12 @@ page for details.
These are LAMMPS capabilities which you may not think of as typical
classical MD options:
* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`
* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`, optional with recursive bisectioning decomposition
* :doc:`generalized aspherical particles <Howto_body>`
* :doc:`stochastic rotation dynamics (SRD) <fix_srd>`
* :doc:`real-time visualization and interactive MD <fix_imd>`
* :doc:`real-time visualization and interactive MD <fix_imd>`, :doc:`built-in renderer for images and movies <dump_image>`
* calculate :doc:`virtual diffraction patterns <compute_xrd>`
* calculate :doc:`finite temperature phonon dispersion <fix_phonon>` and the :doc:`dynamical matrix of minimized structures <dynamical_matrix>`
* :doc:`atom-to-continuum coupling <fix_atc>` with finite elements
* coupled rigid body integration via the :doc:`POEMS <fix_poems>` library
* :doc:`QM/MM coupling <fix_qmmm>`

View File

@ -1,40 +1,61 @@
LAMMPS open-source license
--------------------------
LAMMPS is a freely-available open-source code, distributed under the
terms of the `GNU Public License Version 2 <gpl_>`_, which means you can
use or modify the code however you wish for your own purposes, but have
to adhere to certain rules when redistributing it or software derived
GPL version of LAMMPS
^^^^^^^^^^^^^^^^^^^^^
LAMMPS is an open-source code, available free-of-charge, and distributed
under the terms of the `GNU Public License Version 2 <gpl_>`_ (GPLv2),
which means you can use or modify the code however you wish for your own
purposes, but have to adhere to certain rules when redistributing it -
specifically in binary form - or are distributing software derived
from it or that includes parts of it.
LAMMPS comes with no warranty of any kind. As each source file states
in its header, it is a copyrighted code that is distributed free-of-
charge, under the terms of the `GNU Public License Version 2 <gpl_>`_
(GPLv2). This is often referred to as open-source distribution - see
`www.gnu.org <gnuorg_>`_ or `www.opensource.org <opensource_>`_. The
legal text of the GPL is in the LICENSE file included in the LAMMPS
distribution.
LAMMPS comes with no warranty of any kind.
As each source file states in its header, it is a copyrighted code, and
thus not in the public domain. For more information about open-source
software and open-source distribution, see `www.gnu.org <gnuorg_>`_
or `www.opensource.org <opensource_>`_. The legal text of the GPL as it
applies to LAMMPS is in the LICENSE file included in the LAMMPS distribution.
.. _gpl: https://github.com/lammps/lammps/blob/master/LICENSE
.. _lgpl: https://www.gnu.org/licenses/old-licenses/lgpl-2.1.html
.. _gnuorg: http://www.gnu.org
.. _opensource: http://www.opensource.org
Here is a summary of what the GPL means for LAMMPS users:
Here is a more specific summary of what the GPL means for LAMMPS users:
(1) Anyone is free to use, modify, or extend LAMMPS in any way they
(1) Anyone is free to use, copy, modify, or extend LAMMPS in any way they
choose, including for commercial purposes.
(2) If you **distribute** a modified version of LAMMPS, it must remain
open-source, meaning you distribute **all** of it under the terms of
the GPL. You should clearly annotate such a code as a derivative version
of LAMMPS.
open-source, meaning you are required to distribute **all** of it under
the terms of the GPL. You should clearly annotate such a modified code
as a derivative version of LAMMPS.
(3) If you release any code that includes or uses LAMMPS source code,
then it must also be open-sourced, meaning you distribute it under
the terms of the GPL.
the terms of the GPL. You may write code that interfaces LAMMPS to
a differently licensed library. In that case the code that provides
the interface must be licensed GPL, but not necessarily that library
unless you are distributing binaries that require the library to run.
(4) If you give LAMMPS files to someone else, the GPL LICENSE file and
source file headers (including the copyright and GPL notices) should
remain part of the code.
LGPL version of LAMMPS
^^^^^^^^^^^^^^^^^^^^^^
We occasionally make stable LAMMPS releases available under the `GNU
Lesser Public License v2.1 <lgpl_>`_. This is on request only and with
non-LGPL compliant files removed. This allows uses linking non-GPL
compatible software with the (otherwise unmodified) LAMMPS library
or loading it dynamically at runtime. Any **modifications** to
the LAMMPS code however, even with the LGPL licensed version, must still
be made available under the same open source terms as LAMMPS itself.

View File

@ -10,24 +10,26 @@ conditions. It can model 2d or 3d systems with only a few particles
up to millions or billions.
LAMMPS can be built and run on a laptop or desktop machine, but is
designed for parallel computers. It will run on any parallel machine
that supports the `MPI <mpi_>`_ message-passing library. This includes
shared-memory boxes and distributed-memory clusters and
supercomputers.
designed for parallel computers. It will run in serial and on any
parallel machine that supports the `MPI <mpi_>`_ message-passing
library. This includes shared-memory boxes and distributed-memory
clusters and supercomputers. Parts of LAMMPS also support
`OpenMP multi-threading <omp_>`_, vectorization and GPU acceleration.
.. _mpi: https://en.wikipedia.org/wiki/Message_Passing_Interface
.. _lws: https://www.lammps.org
.. _omp: https://www.openmp.org
LAMMPS is written in C++ and requires a compiler that is at least
compatible with the C++-11 standard.
Earlier versions were written in F77 and F90. See the `History page
compatible with the C++-11 standard. Earlier versions were written in
F77, F90, and C++-98. See the `History page
<https://www.lammps.org/history.html>`_ of the website for details. All
versions can be downloaded from the `LAMMPS website <lws_>`_.
versions can be downloaded as source code from the `LAMMPS website
<lws_>`_.
LAMMPS is designed to be easy to modify or extend with new
capabilities, such as new force fields, atom types, boundary
conditions, or diagnostics. See the :doc:`Modify <Modify>` page for
more details.
LAMMPS is designed to be easy to modify or extend with new capabilities,
such as new force fields, atom types, boundary conditions, or
diagnostics. See the :doc:`Modify <Modify>` page for more details.
In the most general sense, LAMMPS integrates Newton's equations of
motion for a collection of interacting particles. A single particle
@ -47,4 +49,5 @@ MPI parallelization to partition the simulation domain into small
sub-domains of equal computational cost, one of which is assigned to
each processor. Processors communicate and store "ghost" atom
information for atoms that border their sub-domain. Multi-threading
parallelization with with particle-decomposition can be used in addition.
parallelization and GPU acceleration with with particle-decomposition
can be used in addition.

View File

@ -2,12 +2,21 @@ What does a LAMMPS version mean
-------------------------------
The LAMMPS "version" is the date when it was released, such as 1 May
2014. LAMMPS is updated continuously. Whenever we fix a bug or add a
feature, we release it in the next *patch* release, which are
typically made every couple of weeks. Info on patch releases are on
`this website page <https://www.lammps.org/bug.html>`_. Every few
months, the latest patch release is subjected to more thorough testing
and labeled as a *stable* version.
2014. LAMMPS is updated continuously and we aim to keep it working
correctly and reliably at all times. You can follow its development
in a public `git repository on GitHub <https://github.com/lammps/lammps>`_.
Whenever we fix a bug or update or add a feature, it will be merged into
the `master` branch of the git repository. When a sufficient number of
changes have accumulated *and* the software passes a set of automated
tests, we release it in the next *patch* release, which are made every
few weeks. Info on patch releases are on `this website page
<https://www.lammps.org/bug.html>`_.
Once or twice a year, only bug fixes and small, non-intrusive changes are
included for a period of time, and the code is subjected to more detailed
and thorough testing than the default automated testing. The latest
patch release after such a period is then labeled as a *stable* version.
Each version of LAMMPS contains all the features and bug-fixes up to
and including its version date.

View File

@ -19,7 +19,7 @@ Syntax
bondmax = length of longest bond in the system (in length units)
tlimit = elapsed CPU time (in seconds)
diskfree = free disk space (in megabytes)
diskfree = free disk space (in MBytes)
v_name = name of :doc:`equal-style variable <variable>`
* operator = "<" or "<=" or ">" or ">=" or "==" or "!=" or "\|\^"
@ -81,7 +81,7 @@ the timer frequently across a large number of processors may be
non-negligible.
The *diskfree* attribute will check for available disk space (in
megabytes) on supported operating systems. By default it will
MBytes) on supported operating systems. By default it will
check the file system of the current working directory. This
can be changed with the optional *path* keyword, which will take
the path to a file or folder on the file system to be checked

View File

@ -128,7 +128,7 @@ spectrum while consumes more memory. With fixed *f_max* and
:math:`\gamma`, *N_f* should be big enough to converge the classical
temperature :math:`T^{cl}` as a function of target quantum bath
temperature. Memory usage per processor could be from 10 to 100
Mbytes.
MBytes.
.. note::

View File

@ -135,7 +135,7 @@ with #) anywhere. Each non-blank non-comment line must contain one
keyword/value pair. The required keywords are *rcutfac* and
*twojmax*\ . Optional keywords are *rfac0*, *rmin0*,
*switchflag*, *bzeroflag*, *quadraticflag*, *chemflag*,
*bnormflag*, *wselfallflag*, and *chunksize*\ .
*bnormflag*, *wselfallflag*, *chunksize*, and *parallelthresh*\ .
The default values for these keywords are
@ -147,7 +147,8 @@ The default values for these keywords are
* *chemflag* = 0
* *bnormflag* = 0
* *wselfallflag* = 0
* *chunksize* = 4096
* *chunksize* = 32768
* *parallelthresh* = 8192
If *quadraticflag* is set to 1, then the SNAP energy expression includes
additional quadratic terms that have been shown to increase the overall
@ -188,14 +189,24 @@ corresponding *K*-vector of linear coefficients for element
which must equal the number of unique elements appearing in the LAMMPS
pair_coeff command, to avoid ambiguity in the number of coefficients.
The keyword *chunksize* is only applicable when using the
pair style *snap* with the KOKKOS package and is ignored otherwise.
This keyword controls
The keywords *chunksize* and *parallelthresh* are only applicable when
using the pair style *snap* with the KOKKOS package on GPUs and are
ignored otherwise.
The *chunksize* keyword controls
the number of atoms in each pass used to compute the bispectrum
components and is used to avoid running out of memory. For example
if there are 8192 atoms in the simulation and the *chunksize*
is set to 4096, the bispectrum calculation will be broken up
into two passes.
into two passes (running on a single GPU).
The *parallelthresh* keyword controls
a crossover threshold for performing extra parallelism. For
small systems, exposing additional parallism can be beneficial when
there is not enough work to fully saturate the GPU threads otherwise.
However, the extra parallelism also leads to more divergence
and can hurt performance when the system is already large enough to
saturate the GPU threads. Extra parallelism will be performed if the
*chunksize* (or total number of atoms per GPU) is smaller than
*parallelthresh*.
Detailed definitions for all the other keywords
are given on the :doc:`compute sna/atom <compute_sna_atom>` doc page.

View File

@ -1174,6 +1174,7 @@ googletest
Gordan
Goudeau
GPa
GPL
gpu
gpuID
gpus
@ -1689,6 +1690,7 @@ Lett
Leuven
Leven
Lewy
LGPL
lgvdw
Liang
libatc
@ -1889,7 +1891,6 @@ maxX
Mayergoyz
Mayoral
mbt
Mbytes
MBytes
mc
McLachlan

View File

@ -44,7 +44,8 @@ struct TagPairSNAPComputeForce{};
struct TagPairSNAPComputeNeigh{};
struct TagPairSNAPComputeCayleyKlein{};
struct TagPairSNAPPreUi{};
struct TagPairSNAPComputeUi{};
struct TagPairSNAPComputeUiSmall{}; // more parallelism, more divergence
struct TagPairSNAPComputeUiLarge{}; // less parallelism, no divergence
struct TagPairSNAPTransformUi{}; // re-order ulisttot from SoA to AoSoA, zero ylist
struct TagPairSNAPComputeZi{};
struct TagPairSNAPBeta{};
@ -53,7 +54,9 @@ struct TagPairSNAPTransformBi{}; // re-order blist from AoSoA to AoS
struct TagPairSNAPComputeYi{};
struct TagPairSNAPComputeYiWithZlist{};
template<int dir>
struct TagPairSNAPComputeFusedDeidrj{};
struct TagPairSNAPComputeFusedDeidrjSmall{}; // more parallelism, more divergence
template<int dir>
struct TagPairSNAPComputeFusedDeidrjLarge{}; // less parallelism, no divergence
// CPU backend only
struct TagPairSNAPComputeNeighCPU{};
@ -143,7 +146,10 @@ public:
void operator() (TagPairSNAPPreUi,const int iatom_mod, const int j, const int iatom_div) const;
KOKKOS_INLINE_FUNCTION
void operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUi>::member_type& team) const;
void operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiSmall>::member_type& team) const;
KOKKOS_INLINE_FUNCTION
void operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiLarge>::member_type& team) const;
KOKKOS_INLINE_FUNCTION
void operator() (TagPairSNAPTransformUi,const int iatom_mod, const int j, const int iatom_div) const;
@ -168,7 +174,11 @@ public:
template<int dir>
KOKKOS_INLINE_FUNCTION
void operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const;
void operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const;
template<int dir>
KOKKOS_INLINE_FUNCTION
void operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const;
// CPU backend only
KOKKOS_INLINE_FUNCTION

View File

@ -341,18 +341,32 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
// ComputeUi w/vector parallelism, shared memory, direct atomicAdd into ulisttot
{
// team_size_compute_ui is defined in `pair_snap_kokkos.h`
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
const int tile_size = vector_length * (twojmax + 1);
const int scratch_size = scratch_size_helper<complex>(team_size_compute_ui * tile_size);
if (chunk_size < parallel_thresh)
{
// Version with parallelism over j_bend
// total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUi> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiSmall> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeUi",policy_ui,*this);
Kokkos::parallel_for("ComputeUiSmall",policy_ui,*this);
} else {
// Version w/out parallelism over j_bend
// total number of teams needed: (natoms / 32) * (max_neighs)
const int n_teams = chunk_size_div * max_neighs;
const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiLarge> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeUiLarge",policy_ui,*this);
}
}
//TransformUi: un-"fold" ulisttot, zero ylist
@ -412,25 +426,51 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
const int tile_size = vector_length * (twojmax + 1);
const int scratch_size = scratch_size_helper<complex>(2 * team_size_compute_fused_deidrj * tile_size);
if (chunk_size < parallel_thresh)
{
// Version with parallelism over j_bend
// total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;
// x direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrj<0>",policy_fused_deidrj_x,*this);
Kokkos::parallel_for("ComputeFusedDeidrjSmall<0>",policy_fused_deidrj_x,*this);
// y direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrj<1>",policy_fused_deidrj_y,*this);
Kokkos::parallel_for("ComputeFusedDeidrjSmall<1>",policy_fused_deidrj_y,*this);
// z direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrj<2>",policy_fused_deidrj_z,*this);
Kokkos::parallel_for("ComputeFusedDeidrjSmall<2>",policy_fused_deidrj_z,*this);
} else {
// Version w/out parallelism over j_bend
// total number of teams needed: (natoms / 32) * (max_neighs)
const int n_teams = chunk_size_div * max_neighs;
const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;
// x direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrjLarge<0>",policy_fused_deidrj_x,*this);
// y direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrjLarge<1>",policy_fused_deidrj_y,*this);
// z direction
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
Kokkos::parallel_for("ComputeFusedDeidrjLarge<2>",policy_fused_deidrj_z,*this);
}
}
#endif // LMP_KOKKOS_GPU
@ -603,13 +643,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
const auto idxb = icoeff % idxb_max;
const auto idx_chem = icoeff / idxb_max;
auto bveci = my_sna.blist(idxb, idx_chem, ii);
real_type bveci = my_sna.blist(ii, idx_chem, idxb);
d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bveci;
k++;
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
const auto jdxb = jcoeff % idxb_max;
const auto jdx_chem = jcoeff / idxb_max;
real_type bvecj = my_sna.blist(jdxb, jdx_chem, ii);
real_type bvecj = my_sna.blist(ii, jdx_chem, jdxb);
d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bvecj;
d_beta_pack(iatom_mod,jcoeff,iatom_div) += d_coeffi[k]*bveci;
k++;
@ -736,7 +776,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_INLINE_FUNCTION
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUi>::member_type& team) const {
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiSmall>::member_type& team) const {
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
// extract flattened atom_div / neighbor number / bend location
@ -756,11 +796,37 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
const int ninside = d_ninside(ii);
if (jj >= ninside) return;
my_sna.compute_ui(team,iatom_mod, jbend, jj, iatom_div);
my_sna.compute_ui_small(team, iatom_mod, jbend, jj, iatom_div);
});
}
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_INLINE_FUNCTION
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiLarge>::member_type& team) const {
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
// extract flattened atom_div / neighbor number / bend location
int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_ui;
// extract neighbor index, iatom_div
int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
int jj = flattened_idx - iatom_div * max_neighs;
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
[&] (const int iatom_mod) {
const int ii = iatom_mod + vector_length * iatom_div;
if (ii >= chunk_size) return;
const int ninside = d_ninside(ii);
if (jj >= ninside) return;
my_sna.compute_ui_large(team,iatom_mod, jj, iatom_div);
});
}
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_INLINE_FUNCTION
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPTransformUi,const int iatom_mod, const int idxu, const int iatom_div) const {
@ -861,9 +927,9 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
for (int itriple = 0; itriple < ntriples; itriple++) {
const auto blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);
const real_type blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);
my_sna.blist(idxb, itriple, iatom) = blocal;
my_sna.blist(iatom, itriple, idxb) = blocal;
}
}
@ -871,7 +937,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
template<class DeviceType, typename real_type, int vector_length>
template<int dir>
KOKKOS_INLINE_FUNCTION
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const {
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const {
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
// extract flattened atom_div / neighbor number / bend location
@ -891,12 +957,38 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
const int ninside = d_ninside(ii);
if (jj >= ninside) return;
my_sna.template compute_fused_deidrj<dir>(team, iatom_mod, jbend, jj, iatom_div);
my_sna.template compute_fused_deidrj_small<dir>(team, iatom_mod, jbend, jj, iatom_div);
});
}
template<class DeviceType, typename real_type, int vector_length>
template<int dir>
KOKKOS_INLINE_FUNCTION
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const {
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
// extract flattened atom_div / neighbor number / bend location
int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_fused_deidrj;
// extract neighbor index, iatom_div
int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
int jj = flattened_idx - max_neighs * iatom_div;
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
[&] (const int iatom_mod) {
const int ii = iatom_mod + vector_length * iatom_div;
if (ii >= chunk_size) return;
const int ninside = d_ninside(ii);
if (jj >= ninside) return;
my_sna.template compute_fused_deidrj_large<dir>(team, iatom_mod, jj, iatom_div);
});
}
/* ----------------------------------------------------------------------
Begin routines that are unique to the CPU codepath. These do not take
advantage of AoSoA data layouts, but that could be a good point of
@ -925,13 +1017,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
const auto idxb = icoeff % idxb_max;
const auto idx_chem = icoeff / idxb_max;
auto bveci = my_sna.blist(idxb,idx_chem,ii);
real_type bveci = my_sna.blist(ii,idx_chem,idxb);
d_beta(icoeff,ii) += d_coeffi[k]*bveci;
k++;
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
const auto jdxb = jcoeff % idxb_max;
const auto jdx_chem = jcoeff / idxb_max;
auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
real_type bvecj = my_sna.blist(ii,jdx_chem,jdxb);
d_beta(icoeff,ii) += d_coeffi[k]*bvecj;
d_beta(jcoeff,ii) += d_coeffi[k]*bveci;
k++;
@ -1221,7 +1313,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
const auto idxb = icoeff % idxb_max;
const auto idx_chem = icoeff / idxb_max;
evdwl += d_coeffi[icoeff+1]*my_sna.blist(idxb,idx_chem,ii);
evdwl += d_coeffi[icoeff+1]*my_sna.blist(ii,idx_chem,idxb);
}
// quadratic contributions
@ -1230,12 +1322,12 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
const auto idxb = icoeff % idxb_max;
const auto idx_chem = icoeff / idxb_max;
auto bveci = my_sna.blist(idxb,idx_chem,ii);
real_type bveci = my_sna.blist(ii,idx_chem,idxb);
evdwl += 0.5*d_coeffi[k++]*bveci*bveci;
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
auto jdxb = jcoeff % idxb_max;
auto jdx_chem = jcoeff / idxb_max;
auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
auto bvecj = my_sna.blist(ii,jdx_chem,jdxb);
evdwl += d_coeffi[k++]*bveci*bvecj;
}
}

View File

@ -45,12 +45,12 @@ struct WignerWrapper {
{ ; }
KOKKOS_INLINE_FUNCTION
complex get(const int& ma) {
complex get(const int& ma) const {
return complex(buffer[offset + 2 * vector_length * ma], buffer[offset + vector_length + 2 * vector_length * ma]);
}
KOKKOS_INLINE_FUNCTION
void set(const int& ma, const complex& store) {
void set(const int& ma, const complex& store) const {
buffer[offset + 2 * vector_length * ma] = store.re;
buffer[offset + vector_length + 2 * vector_length * ma] = store.im;
}
@ -122,8 +122,14 @@ inline
void compute_cayley_klein(const int&, const int&, const int&);
KOKKOS_INLINE_FUNCTION
void pre_ui(const int&, const int&, const int&, const int&); // ForceSNAP
// version of the code with parallelism over j_bend
KOKKOS_INLINE_FUNCTION
void compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
void compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
// version of the code without parallelism over j_bend
KOKKOS_INLINE_FUNCTION
void compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); // ForceSNAP
KOKKOS_INLINE_FUNCTION
void compute_zi(const int&, const int&, const int&); // ForceSNAP
KOKKOS_INLINE_FUNCTION
@ -135,6 +141,35 @@ inline
KOKKOS_INLINE_FUNCTION
void compute_bi(const int&, const int&, const int&); // ForceSNAP
// functions for derivatives, GPU only
// version of the code with parallelism over j_bend
template<int dir>
KOKKOS_INLINE_FUNCTION
void compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
// version of the code without parallelism over j_bend
template<int dir>
KOKKOS_INLINE_FUNCTION
void compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); //ForceSNAP
// core "evaluation" functions that get plugged into "compute" functions
// plugged into compute_ui_small, compute_ui_large
KOKKOS_FORCEINLINE_FUNCTION
void evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&, const int&,
const int&, const int&, const int&);
// plugged into compute_zi, compute_yi
KOKKOS_FORCEINLINE_FUNCTION
complex evaluate_zi(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
const int&, const int&, const int&, const int&, const real_type*);
// plugged into compute_yi, compute_yi_with_zlist
KOKKOS_FORCEINLINE_FUNCTION
real_type evaluate_beta_scaled(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &);
// plugged into compute_fused_deidrj_small, compute_fused_deidrj_large
KOKKOS_FORCEINLINE_FUNCTION
real_type evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&,
const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&,
const int&, const int&, const int&, const int&);
// functions for bispectrum coefficients, CPU only
KOKKOS_INLINE_FUNCTION
void pre_ui_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team,const int&,const int&); // ForceSNAP
@ -148,11 +183,6 @@ inline
KOKKOS_INLINE_FUNCTION
void compute_bi_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int); // ForceSNAP
// functions for derivatives, GPU only
template<int dir>
KOKKOS_INLINE_FUNCTION
void compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
// functions for derivatives, CPU only
KOKKOS_INLINE_FUNCTION
void compute_duidrj_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int, int); //ForceSNAP
@ -168,23 +198,6 @@ inline
KOKKOS_INLINE_FUNCTION
void compute_s_dsfac(const real_type, const real_type, real_type&, real_type&); // compute_cayley_klein
static KOKKOS_FORCEINLINE_FUNCTION
void sincos_wrapper(double x, double* sin_, double *cos_) {
#ifdef __SYCL_DEVICE_ONLY__
*sin_ = sycl::sincos(x, cos_);
#else
sincos(x, sin_, cos_);
#endif
}
static KOKKOS_FORCEINLINE_FUNCTION
void sincos_wrapper(float x, float* sin_, float *cos_) {
#ifdef __SYCL_DEVICE_ONLY__
*sin_ = sycl::sincos(x, cos_);
#else
sincosf(x, sin_, cos_);
#endif
}
#ifdef TIMING_INFO
double* timers;
timespec starttime, endtime;
@ -207,7 +220,7 @@ inline
int twojmax, diagonalstyle;
t_sna_3d_ll blist;
t_sna_3d blist;
t_sna_3c_ll ulisttot;
t_sna_3c_ll ulisttot_full; // un-folded ulisttot, cpu only
t_sna_3c_ll zlist;

View File

@ -316,7 +316,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),1,1,1);
zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),1,1,1);
zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),vector_length,idxz_max,ndoubles,natom_div);
blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),vector_length,idxb_max,ntriples,natom_div);
ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),1,1,1);
ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),vector_length,idxu_half_max,nelements,natom_div);
@ -337,7 +337,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),idxu_cache_max,natom,nmax);
zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),idxz_max,ndoubles,natom);
zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),1,1,1,1);
blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),1,1,1,1);
ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),idxu_half_max,nelements,natom);
ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),1,1,1,1);
@ -365,44 +365,44 @@ KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const int& iatom_mod, const int& jnbor, const int& iatom_div)
{
const int iatom = iatom_mod + vector_length * iatom_div;
const auto x = rij(iatom,jnbor,0);
const auto y = rij(iatom,jnbor,1);
const auto z = rij(iatom,jnbor,2);
const auto rsq = x * x + y * y + z * z;
const auto r = sqrt(rsq);
const auto rcut = rcutij(iatom, jnbor);
const auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
const auto theta0 = (r - rmin0) * rscale0;
real_type sn, cs;
sincos_wrapper(theta0, &sn, &cs);
const real_type x = rij(iatom,jnbor,0);
const real_type y = rij(iatom,jnbor,1);
const real_type z = rij(iatom,jnbor,2);
const real_type rsq = x * x + y * y + z * z;
const real_type r = sqrt(rsq);
const real_type rcut = rcutij(iatom, jnbor);
const real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
const real_type theta0 = (r - rmin0) * rscale0;
const real_type sn = sin(theta0);
const real_type cs = cos(theta0);
const real_type z0 = r * cs / sn;
const real_type dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;
const auto wj_local = wj(iatom, jnbor);
const real_type wj_local = wj(iatom, jnbor);
real_type sfac, dsfac;
compute_s_dsfac(r, rcut, sfac, dsfac);
sfac *= wj_local;
dsfac *= wj_local;
const auto rinv = static_cast<real_type>(1.0) / r;
const auto ux = x * rinv;
const auto uy = y * rinv;
const auto uz = z * rinv;
const real_type rinv = static_cast<real_type>(1.0) / r;
const real_type ux = x * rinv;
const real_type uy = y * rinv;
const real_type uz = z * rinv;
const auto r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);
const real_type r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);
const complex a = { z0 * r0inv, -z * r0inv };
const complex b = { r0inv * y, -r0inv * x };
const auto dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);
const real_type dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);
const auto dr0invx = dr0invdr * ux;
const auto dr0invy = dr0invdr * uy;
const auto dr0invz = dr0invdr * uz;
const real_type dr0invx = dr0invdr * ux;
const real_type dr0invy = dr0invdr * uy;
const real_type dr0invz = dr0invdr * uz;
const auto dz0x = dz0dr * ux;
const auto dz0y = dz0dr * uy;
const auto dz0z = dz0dr * uz;
const real_type dz0x = dz0dr * ux;
const real_type dz0y = dz0dr * uy;
const real_type dz0z = dz0dr * uz;
const complex dax = { dz0x * r0inv + z0 * dr0invx, -z * dr0invx };
const complex day = { dz0y * r0inv + z0 * dr0invy, -z * dr0invy };
@ -412,9 +412,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const
const complex dby = { y * dr0invy + r0inv, -x * dr0invy };
const complex dbz = { y * dr0invz, -x * dr0invz };
const auto dsfacux = dsfac * ux;
const auto dsfacuy = dsfac * uy;
const auto dsfacuz = dsfac * uz;
const real_type dsfacux = dsfac * ux;
const real_type dsfacuy = dsfac * uy;
const real_type dsfacuz = dsfac * uz;
a_pack(iatom_mod,jnbor,iatom_div) = a;
b_pack(iatom_mod,jnbor,iatom_div) = b;
@ -479,17 +479,13 @@ void SNAKokkos<DeviceType, real_type, vector_length>::pre_ui(const int& iatom_mo
accumulating to the total. GPU only.
------------------------------------------------------------------------- */
// Version of the code that exposes additional parallelism by threading over `j_bend` values
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
{
// utot(j,ma,mb) = 0 for all j,ma,ma
// utot(j,ma,ma) = 1 for all j,ma
// for j in neighbors of i:
// compute r0 = (x,y,z,z0)
// utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb
// get shared memory offset
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
const int tile_size = vector_length * (twojmax + 1);
@ -498,13 +494,12 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
const int scratch_shift = team_rank * tile_size;
// extract and wrap
WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
// load parameters
const auto a = a_pack(iatom_mod, jnbor, iatom_div);
const auto b = b_pack(iatom_mod, jnbor, iatom_div);
const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
@ -512,12 +507,59 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
// this for loop is here for context --- we expose additional
// parallelism over this loop instead
//for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
}
// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
// and some amount of load imbalance, at the expense of reducing parallelism
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
{
// get shared memory offset
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
const int tile_size = vector_length * (twojmax + 1);
const int team_rank = team.team_rank();
const int scratch_shift = team_rank * tile_size;
// extract and wrap
const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
// load parameters
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
// we need to "choose" when to bend
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
}
}
// Core "evaluation" kernel that gets reused in `compute_ui_small` and `compute_ui_large`
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_FORCEINLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper,
const complex& a, const complex& b, const real_type& sfac, const int& jelem,
const int& iatom_mod, const int& j_bend, const int& iatom_div)
{
// utot(j,ma,mb) = 0 for all j,ma,ma
// utot(j,ma,ma) = 1 for all j,ma
// for j in neighbors of i:
// compute r0 = (x,y,z,z0)
// utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb
// level 0 is just 1.
ulist_wrapper.set(0, complex::one());
// j from before the bend, don't store, mb == 0
// this is "creeping up the side"
for (int j = 1; j <= j_bend; j++) {
constexpr int mb = 0; // intentional for readability, compiler should optimize this out
@ -601,12 +643,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
Kokkos::atomic_add(&(ulisttot_im_pack(iatom_mod, jjup + ma, jelem, iatom_div)), ulist_prev.im * sfac);
}
//} // end of "reference" loop over j_bend
}
/* ----------------------------------------------------------------------
compute Zi by summing over products of Ui,
AoSoA data layout to take advantage of coalescing, avoiding warp
@ -634,47 +672,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi(const int& iato
for (int elem1 = 0; elem1 < nelements; elem1++) {
for (int elem2 = 0; elem2 < nelements; elem2++) {
complex ztmp = complex::zero();
int jju1 = idxu_block[j1] + (j1+1)*mb1min;
int jju2 = idxu_block[j2] + (j2+1)*mb2max;
int icgb = mb1min*(j2+1) + mb2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ib = 0; ib < nb; ib++) {
int ma1 = ma1min;
int ma2 = ma2max;
int icga = ma1min*(j2+1) + ma2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ia = 0; ia < na; ia++) {
const auto utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
const auto utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
const auto cgcoeff_a = cgblock[icga];
const auto cgcoeff_b = cgblock[icgb];
ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
ma1++;
ma2--;
icga += j2;
} // end loop over ia
jju1 += j1 + 1;
jju2 -= j2 + 1;
icgb += j2;
} // end loop over ib
if (bnorm_flag) {
ztmp.re /= (j + 1);
ztmp.im /= (j + 1);
}
zlist_pack(iatom_mod,jjz,idouble,iatom_div) = ztmp;
zlist_pack(iatom_mod,jjz,idouble,iatom_div) = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);
idouble++;
}
@ -721,8 +720,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
const int jju_index = jju+mb*(j+1)+ma;
const int jjz_index = jjz+mb*(j+1)+ma;
if (2*mb == j) return; // I think we can remove this?
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;
}
}
@ -737,8 +736,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;
}
@ -748,8 +747,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
sumzu += static_cast<real_type>(0.5) * (utot.re * zloc.re + utot.im * zloc.im);
} // end if jeven
@ -785,7 +784,6 @@ KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod, int jjz, int iatom_div,
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
{
real_type betaj;
const int j1 = idxz(jjz, 0);
const int j2 = idxz(jjz, 1);
@ -805,46 +803,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,
for (int elem1 = 0; elem1 < nelements; elem1++) {
for (int elem2 = 0; elem2 < nelements; elem2++) {
real_type ztmp_r = 0.0;
real_type ztmp_i = 0.0;
int jju1 = idxu_block[j1] + (j1 + 1) * mb1min;
int jju2 = idxu_block[j2] + (j2 + 1) * mb2max;
int icgb = mb1min * (j2 + 1) + mb2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ib = 0; ib < nb; ib++) {
int ma1 = ma1min;
int ma2 = ma2max;
int icga = ma1min*(j2+1) + ma2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ia = 0; ia < na; ia++) {
const auto utot1 = ulisttot_pack(iatom_mod,jju1+ma1,elem1,iatom_div);
const auto utot2 = ulisttot_pack(iatom_mod,jju2+ma2,elem2,iatom_div);
const auto cgcoeff_a = cgblock[icga];
const auto cgcoeff_b = cgblock[icgb];
ztmp_r += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
ztmp_i += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
ma1++;
ma2--;
icga += j2;
} // end loop over ia
jju1 += j1 + 1;
jju2 -= j2 + 1;
icgb += j2;
} // end loop over ib
if (bnorm_flag) {
ztmp_r /= (j + 1);
ztmp_i /= (j + 1);
}
const complex ztmp = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
// find right y_list[jju] and beta(iatom,jjb) entries
@ -853,30 +812,11 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,
// pick out right beta value
for (int elem3 = 0; elem3 < nelements; elem3++) {
if (j >= j1) {
const int jjb = idxb_block(j1, j2, j);
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
if (j1 == j) {
if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else if (j >= j2) {
const int jjb = idxb_block(j, j2, j1);
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else {
const int jjb = idxb_block(j2, j, j1);
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
betaj = beta_pack(iatom_mod, itriple, iatom_div);
}
if (!bnorm_flag && j1 > j)
betaj *= (j1 + 1) / (j + 1.0);
const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_r);
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_i);
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
} // end loop over elem3
} // end loop over elem2
} // end loop over elem1
@ -893,7 +833,6 @@ KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int iatom_mod, int jjz, int iatom_div,
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
{
real_type betaj;
const int j1 = idxz(jjz, 0);
const int j2 = idxz(jjz, 1);
const int j = idxz(jjz, 2);
@ -901,49 +840,123 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int
int idouble = 0;
for (int elem1 = 0; elem1 < nelements; elem1++) {
for (int elem2 = 0; elem2 < nelements; elem2++) {
auto ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
const complex ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
// find right y_list[jju] and beta(iatom,jjb) entries
// multiply and divide by j+1 factors
// account for multiplicity of 1, 2, or 3
// pick out right beta value
for (int elem3 = 0; elem3 < nelements; elem3++) {
if (j >= j1) {
const int jjb = idxb_block(j1, j2, j);
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
if (j1 == j) {
if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else if (j >= j2) {
const int jjb = idxb_block(j, j2, j1);
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else {
const int jjb = idxb_block(j2, j, j1);
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
betaj = beta_pack(iatom_mod, itriple, iatom_div);
}
if (!bnorm_flag && j1 > j)
betaj *= (j1 + 1) / (j + 1.0);
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.re);
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.im);
const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
} // end loop over elem3
idouble++;
} // end loop over elem2
} // end loop over elem1
}
// Core "evaluation" kernel that computes a single zlist value
// which gets used in both `compute_zi` and `compute_yi`
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_FORCEINLINE_FUNCTION
typename SNAKokkos<DeviceType, real_type, vector_length>::complex SNAKokkos<DeviceType, real_type, vector_length>::evaluate_zi(const int& j1, const int& j2, const int& j,
const int& ma1min, const int& ma2max, const int& mb1min, const int& mb2max, const int& na, const int& nb,
const int& iatom_mod, const int& elem1, const int& elem2, const int& iatom_div, const real_type* cgblock) {
complex ztmp = complex::zero();
int jju1 = idxu_block[j1] + (j1+1)*mb1min;
int jju2 = idxu_block[j2] + (j2+1)*mb2max;
int icgb = mb1min*(j2+1) + mb2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ib = 0; ib < nb; ib++) {
int ma1 = ma1min;
int ma2 = ma2max;
int icga = ma1min*(j2+1) + ma2max;
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int ia = 0; ia < na; ia++) {
const complex utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
const complex utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
const real_type cgcoeff_a = cgblock[icga];
const real_type cgcoeff_b = cgblock[icgb];
ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
ma1++;
ma2--;
icga += j2;
} // end loop over ia
jju1 += j1 + 1;
jju2 -= j2 + 1;
icgb += j2;
} // end loop over ib
if (bnorm_flag) {
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
ztmp.re *= scale;
ztmp.im *= scale;
}
return ztmp;
}
// Core "evaluation" kernel that extracts and rescales the appropriate `beta` value,
// which gets used in both `compute_yi` and `compute_yi_from_zlist
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_FORCEINLINE_FUNCTION
typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_beta_scaled(const int& j1, const int& j2, const int& j,
const int& iatom_mod, const int& elem1, const int& elem2, const int& elem3, const int& iatom_div,
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack) {
real_type betaj = 0;
if (j >= j1) {
const int jjb = idxb_block(j1, j2, j);
const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
if (j1 == j) {
if (j2 == j) betaj = static_cast<real_type>(3) * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else if (j >= j2) {
const int jjb = idxb_block(j, j2, j1);
const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
if (j2 == j) betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
} else {
const int jjb = idxb_block(j2, j, j1);
const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
betaj = beta_pack(iatom_mod, itriple, iatom_div);
}
if (!bnorm_flag && j1 > j) {
const real_type scale = static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);
betaj *= scale;
}
return betaj;
}
/* ----------------------------------------------------------------------
Fused calculation of the derivative of Ui w.r.t. atom j
and accumulation into dEidRj. GPU only.
------------------------------------------------------------------------- */
// Version of the code that exposes additional parallelism by threading over `j_bend` values
template<class DeviceType, typename real_type, int vector_length>
template<int dir>
KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
{
// get shared memory offset
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
@ -957,21 +970,76 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
// load parameters
const auto a = a_pack(iatom_mod, jnbor, iatom_div);
const auto b = b_pack(iatom_mod, jnbor, iatom_div);
const auto da = da_pack(iatom_mod, jnbor, iatom_div, dir);
const auto db = db_pack(iatom_mod, jnbor, iatom_div, dir);
const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const auto dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
auto dedr_full_sum = static_cast<real_type>(0.);
// compute the contribution to dedr_full_sum for one "bend" location
const real_type dedr_full_sum = evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
jelem, iatom_mod, j_bend, iatom_div);
// we need to "choose" when to bend
// this for loop is here for context --- we expose additional
// parallelism over this loop instead
//for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
// dedr gets zeroed out at the start of each iteration in compute_cayley_klein
Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);
}
// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
// and some amount of load imbalance, at the expense of reducing parallelism
template<class DeviceType, typename real_type, int vector_length>
template<int dir>
KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
{
// get shared memory offset
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
const int tile_size = vector_length * (twojmax + 1);
const int team_rank = team.team_rank();
const int scratch_shift = team_rank * tile_size;
// extract, wrap shared memory buffer
WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
// load parameters
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
// compute the contributions to dedr_full_sum for all "bend" locations
real_type dedr_full_sum = static_cast<real_type>(0);
#ifdef LMP_KK_DEVICE_COMPILE
#pragma unroll
#endif
for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
dedr_full_sum += evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
jelem, iatom_mod, j_bend, iatom_div);
}
// there's one thread per atom, neighbor pair, so no need to make this atomic
dedr(iatom_mod + vector_length * iatom_div, jnbor, dir) = static_cast<real_type>(2.0) * dedr_full_sum;
}
// Core "evaluation" kernel that gets reused in `compute_fused_deidrj_small` and
// `compute_fused_deidrj_large`
template<class DeviceType, typename real_type, int vector_length>
KOKKOS_FORCEINLINE_FUNCTION
typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper, const complex& a, const complex& b, const real_type& sfac,
const WignerWrapper<real_type, vector_length>& dulist_wrapper, const complex& da, const complex& db, const real_type& dsfacu,
const int& jelem, const int& iatom_mod, const int& j_bend, const int& iatom_div) {
real_type dedr_full_sum = static_cast<real_type>(0);
// level 0 is just 1, 0
ulist_wrapper.set(0, complex::one());
@ -1039,7 +1107,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
// grab y_local early
// this will never be the last element of a row, no need to rescale.
auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
// grab the cached value
const complex ulist_prev = ulist_wrapper.get(ma);
@ -1085,7 +1153,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
for (int ma = 0; ma < j; ma++) {
// grab y_local early
auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
if (j % 2 == 1 && 2*(mb-1) == j-1) { // double check me...
if (ma == (mb-1)) { y_local = static_cast<real_type>(0.5)*y_local; }
else if (ma > (mb-1)) { y_local.re = static_cast<real_type>(0.); y_local.im = static_cast<real_type>(0.); } // can probably avoid this outright
@ -1100,15 +1168,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
dedr_full_sum += du_prod.re * y_local.re + du_prod.im * y_local.im;
}
//} // end reference loop over j_bend
// dedr gets zeroed out at the start of each iteration in compute_cayley_klein
Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);
return dedr_full_sum;
}
/* ----------------------------------------------------------------------
* CPU routines
* ----------------------------------------------------------------------*/
@ -1238,8 +1301,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi_cpu(const int&
} // end loop over ib
if (bnorm_flag) {
zlist(jjz, idouble, iatom).re /= (j+1);
zlist(jjz, idouble, iatom).im /= (j+1);
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
zlist(jjz, idouble, iatom).re *= scale;
zlist(jjz, idouble, iatom).im *= scale;
}
idouble++;
} // end loop over elem2
@ -1268,7 +1332,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
int idouble = 0;
for (int elem1 = 0; elem1 < nelements; elem1++) {
for (int elem2 = 0; elem2 < nelements; elem2++) {
auto jalloy = idouble; // must be non-const to work around gcc compiler bug
int jalloy = idouble; // must be non-const to work around gcc compiler bug
for (int elem3 = 0; elem3 < nelements; elem3++) {
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,idxb_max),
[&] (const int& jjb) {
@ -1331,7 +1395,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
}
}
blist(jjb, itriple, iatom) = sumzu;
blist(iatom, itriple, jjb) = sumzu;
});
});
//} // end loop over j
@ -1410,8 +1474,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,
} // end loop over ib
if (bnorm_flag) {
ztmp_i /= j + 1;
ztmp_r /= j + 1;
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
ztmp_i *= scale;
ztmp_r *= scale;
}
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
@ -1424,24 +1489,24 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,
if (j >= j1) {
const int jjb = idxb_block(j1, j2, j);
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
if (j1 == j) {
if (j2 == j) betaj = 3 * beta(itriple, iatom);
else betaj = 2 * beta(itriple, iatom);
} else betaj = beta(itriple, iatom);
} else if (j >= j2) {
const int jjb = idxb_block(j, j2, j1);
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
if (j2 == j) betaj = 2 * beta(itriple, iatom);
else betaj = beta(itriple, iatom);
} else {
const int jjb = idxb_block(j2, j, j1);
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
betaj = beta(itriple, iatom);
}
if (!bnorm_flag && j1 > j)
betaj *= (j1 + 1) / (j + 1.0);
betaj *= static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);
Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).re), betaj*ztmp_r);
Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).im), betaj*ztmp_i);
@ -1469,9 +1534,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_duidrj_cpu(const t
z = rij(iatom,jnbor,2);
rsq = x * x + y * y + z * z;
r = sqrt(rsq);
auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
theta0 = (r - rmin0) * rscale0;
sincos_wrapper(theta0, &sn, &cs);
sn = sin(theta0);
cs = cos(theta0);
z0 = r * cs / sn;
dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;
@ -1559,7 +1625,7 @@ KOKKOS_INLINE_FUNCTION
void SNAKokkos<DeviceType, real_type, vector_length>::add_uarraytot(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int iatom, int jnbor,
const real_type& r, const real_type& wj, const real_type& rcut, int jelem)
{
const auto sfac = compute_sfac(r, rcut) * wj;
const real_type sfac = compute_sfac(r, rcut) * wj;
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,twojmax+1),
[&] (const int& j) {
@ -2168,7 +2234,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_sfac(real_typ
if (r <= rmin0) return one;
else if (r > rcut) return zero;
else {
auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
return onehalf * (cos((r - rmin0) * rcutfac) + one);
}
}
@ -2188,7 +2254,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_dsfac(real_ty
if (r <= rmin0) return zero;
else if (r > rcut) return zero;
else {
auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
return -onehalf * sin((r - rmin0) * rcutfac) * rcutfac;
}
}
@ -2206,9 +2272,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_s_dsfac(const real
if (r <= rmin0) { sfac = one; dsfac = zero; }
else if (r > rcut) { sfac = zero; dsfac = zero; }
else {
const auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
real_type sn, cs;
sincos_wrapper((r - rmin0) * rcutfac, &sn, &cs); // need to create a wrapper
const real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
const real_type theta0 = (r - rmin0) * rcutfac;
const real_type sn = sin(theta0);
const real_type cs = cos(theta0);
sfac = onehalf * (cs + one);
dsfac = -onehalf * sn * rcutfac;

View File

@ -628,7 +628,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
chemflag = 0;
bnormflag = 0;
wselfallflag = 0;
chunksize = 4096;
chunksize = 32768;
parallel_thresh = 8192;
// open SNAP parameter file on proc 0
@ -696,6 +697,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
wselfallflag = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
else if (keywd == "chunksize")
chunksize = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
else if (keywd == "parallelthresh")
parallel_thresh = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
else
error->all(FLERR,"Unknown parameter '{}' in SNAP "
"parameter file", keywd);

View File

@ -59,7 +59,7 @@ class PairSNAP : public Pair {
double **scale; // for thermodynamic integration
int twojmax, switchflag, bzeroflag, bnormflag;
int chemflag, wselfallflag;
int chunksize;
int chunksize,parallel_thresh;
double rfac0, rmin0, wj1, wj2;
int rcutfacflag, twojmaxflag; // flags for required parameters
int beta_max; // length of beta

View File

@ -20,7 +20,7 @@ charges (dsf and long-range treatment of charges)
out-of-plane angle
See the file doc/drude_tutorial.html for getting started.
See the doc pages for "pair_style buck6d/coul/gauss", "anlge_style class2",
See the doc pages for "pair_style buck6d/coul/gauss", "angle_style class2",
"angle_style cosine/buck6d", and "improper_style inversion/harmonic"
commands to get started. Also see the above mentioned website and
literature for further documentation about the force field.

View File

@ -34,6 +34,7 @@ exclude:
- lib/hdnnp
- lib/kim
- lib/kokkos
- lib/latte
- lib/machdyn
- lib/mdi
- lib/mscg
@ -41,6 +42,7 @@ exclude:
- lib/plumed
- lib/quip
- lib/scafacos
- lib/voronoi
- src/Make.sh
patterns:
- "*.c"