Merge branch 'master' into collected-small-changes
This commit is contained in:
@ -22,4 +22,5 @@ page.
|
||||
Build_extras
|
||||
Build_manual
|
||||
Build_windows
|
||||
Build_diskspace
|
||||
Build_development
|
||||
|
||||
45
doc/src/Build_diskspace.rst
Normal file
45
doc/src/Build_diskspace.rst
Normal file
@ -0,0 +1,45 @@
|
||||
Notes for saving disk space when building LAMMPS from source
|
||||
------------------------------------------------------------
|
||||
|
||||
LAMMPS is a large software project with a large number of source files,
|
||||
extensive documentation, and a large collection of example files.
|
||||
When downloading LAMMPS by cloning the
|
||||
`git repository from GitHub <https://github.com/lammps/lammps>`_ this
|
||||
will by default also download the entire commit history since September 2006.
|
||||
Compiling LAMMPS will add the storage requirements of the compiled object
|
||||
files and libraries to the tally.
|
||||
|
||||
In a user account on an HPC cluster with filesystem quotas or in other
|
||||
environments with restricted disk space capacity it may be needed to
|
||||
reduce the storage requirements. Here are some suggestions:
|
||||
|
||||
- Create a so-called shallow repository by cloning only the last commit
|
||||
instead of the full project history by using ``git clone git@github.com:lammps/lammps --depth=1 --branch=master``.
|
||||
This reduces the downloaded size to about half. With ``--depth=1`` it is not possible to check out different
|
||||
versions/branches of LAMMPS, using ``--depth=1000`` will make multiple recent versions available at little
|
||||
extra storage needs (the entire git history had nearly 30,000 commits in fall 2021).
|
||||
|
||||
- Download a tar archive from either the `download section on the LAMMPS homepage <https://www.lammps.org/download.html>`_
|
||||
or from the `LAMMPS releases page on GitHub <https://github.com/lammps/lammps/releases>`_ these will not
|
||||
contain the git history at all.
|
||||
|
||||
- Build LAMMPS without the debug flag (remove ``-g`` from the machine makefile or use ``-DCMAKE_BUILD_TYPE=Release``)
|
||||
or use the ``strip`` command on the LAMMPS executable when no more debugging would be needed. The strip command
|
||||
may also be applied to the LAMMPS shared library. The static library may be deleted entirely.
|
||||
|
||||
- Delete compiled object files and libraries after copying the LAMMPS executable to a permanent location.
|
||||
When using the traditional build process, one may use ``make clean-<machine>`` or ``make clean-all``
|
||||
to delete object files in the src folder. For CMake based builds, one may use ``make clean`` or just
|
||||
delete the entire build folder.
|
||||
|
||||
- The folders containing the documentation tree (doc), the examples (examples) are not needed to build and
|
||||
run LAMMPS and can be safely deleted. Some files in the potentials folder are large and may be deleted,
|
||||
if not needed. The largest of those files (occupying about 120 MBytes combined) will only be downloaded on
|
||||
demand, when the corresponding package is installed.
|
||||
|
||||
- When using the CMake build procedure, the compilation can be done on a (local) scratch storage that will not
|
||||
count toward the quota. A local scratch file system may offer the additional benefit of speeding up creating
|
||||
object files and linking with libraries compared to a networked file system. Also with CMake (and unlike with
|
||||
the traditional make) it is possible to compile LAMMPS executables with different settings and packages included
|
||||
from the same source tree since all the configuration information is stored in the build folder. So it is
|
||||
not necessary to have multiple copies of LAMMPS.
|
||||
@ -29,7 +29,7 @@ The following folks deserve special recognition. Many of the packages
|
||||
they have written are unique for an MD code and LAMMPS would not be as
|
||||
general-purpose as it is without their expertise and efforts.
|
||||
|
||||
* Metin Aktulga (MSU), REAXFF package for C version of ReaxFF
|
||||
* Metin Aktulga (MSU), REAXFF package for C/C++ version of ReaxFF
|
||||
* Mike Brown (Intel), GPU and INTEL packages
|
||||
* Colin Denniston (U Western Ontario), LATBOLTZ package
|
||||
* Georg Ganzenmuller (EMI), MACHDYN and SPH packages
|
||||
@ -37,9 +37,10 @@ general-purpose as it is without their expertise and efforts.
|
||||
* Reese Jones (Sandia) and colleagues, ATC package for atom/continuum coupling
|
||||
* Christoph Kloss (DCS Computing), LIGGGHTS code for granular materials, built on top of LAMMPS
|
||||
* Rudra Mukherjee (JPL), POEMS package for articulated rigid body motion
|
||||
* Trung Ngyuen (Northwestern U), GPU and RIGID and BODY packages
|
||||
* Trung Ngyuen (Northwestern U), GPU, RIGID, BODY, and DIELECTRIC packages
|
||||
* Mike Parks (Sandia), PERI package for Peridynamics
|
||||
* Roy Pollock (LLNL), Ewald and PPPM solvers
|
||||
* Julien Tranchida (Sandia), SPIN package
|
||||
* Christian Trott (Sandia), CUDA and KOKKOS packages
|
||||
* Ilya Valuev (JIHT), AWPMD package for wave packet MD
|
||||
* Greg Wagner (Northwestern U), MEAM package for MEAM potential
|
||||
|
||||
@ -27,19 +27,19 @@ General features
|
||||
* distributed memory message-passing parallelism (MPI)
|
||||
* shared memory multi-threading parallelism (OpenMP)
|
||||
* spatial decomposition of simulation domain for MPI parallelism
|
||||
* particle decomposition inside of spatial decomposition for OpenMP parallelism
|
||||
* particle decomposition inside of spatial decomposition for OpenMP and GPU parallelism
|
||||
* GPLv2 licensed open-source distribution
|
||||
* highly portable C++-11
|
||||
* modular code with most functionality in optional packages
|
||||
* only depends on MPI library for basic parallel functionality
|
||||
* only depends on MPI library for basic parallel functionality, MPI stub for serial compilation
|
||||
* other libraries are optional and only required for specific packages
|
||||
* GPU (CUDA and OpenCL), Intel Xeon Phi, and OpenMP support for many code features
|
||||
* GPU (CUDA, OpenCL, HIP, SYCL), Intel Xeon Phi, and OpenMP support for many code features
|
||||
* easy to extend with new features and functionality
|
||||
* runs from an input script
|
||||
* syntax for defining and using variables and formulas
|
||||
* syntax for looping over runs and breaking out of loops
|
||||
* run one or multiple simulations simultaneously (in parallel) from one script
|
||||
* build as library, invoke LAMMPS through library interface or provided Python wrapper
|
||||
* build as library, invoke LAMMPS through library interface or provided Python wrapper or SWIG based wrappers
|
||||
* couple with other codes: LAMMPS calls other code, other code calls LAMMPS, umbrella code calls both
|
||||
|
||||
.. _particle:
|
||||
@ -58,8 +58,10 @@ Particle and model types
|
||||
* coarse-grained mesoscale models
|
||||
* finite-size spherical and ellipsoidal particles
|
||||
* finite-size line segment (2d) and triangle (3d) particles
|
||||
* finite-size rounded polygons (2d) and polyhedra (3d) particles
|
||||
* point dipole particles
|
||||
* rigid collections of particles
|
||||
* particles with magnetic spin
|
||||
* rigid collections of n particles
|
||||
* hybrid combinations of these
|
||||
|
||||
.. _ff:
|
||||
@ -74,24 +76,28 @@ commands)
|
||||
|
||||
* pairwise potentials: Lennard-Jones, Buckingham, Morse, Born-Mayer-Huggins, Yukawa, soft, class 2 (COMPASS), hydrogen bond, tabulated
|
||||
* charged pairwise potentials: Coulombic, point-dipole
|
||||
* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, SNAP, Streitz-Mintmire, 3-body polymorphic
|
||||
* long-range interactions for charge, point-dipoles, and LJ dispersion: Ewald, Wolf, PPPM (similar to particle-mesh Ewald)
|
||||
* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, Streitz-Mintmire, 3-body polymorphic, BOP, Vashishta
|
||||
* machine learning potentials: SNAP, GAP, ACE, N2P2, RANN, AGNI
|
||||
* long-range interactions for charge, point-dipoles, and LJ dispersion: Ewald, Wolf, PPPM (similar to particle-mesh Ewald), MSM
|
||||
* polarization models: :doc:`QEq <fix_qeq>`, :doc:`core/shell model <Howto_coreshell>`, :doc:`Drude dipole model <Howto_drude>`
|
||||
* charge equilibration (QEq via dynamic, point, shielded, Slater methods)
|
||||
* coarse-grained potentials: DPD, GayBerne, REsquared, colloidal, DLVO
|
||||
* mesoscopic potentials: granular, Peridynamics, SPH
|
||||
* mesoscopic potentials: granular, Peridynamics, SPH, mesoscopic tubular potential (MESONT)
|
||||
* semi-empirical potentials: multi-ion generalized pseudopotential theory (MGPT), second moment tight binding + QEq (SMTB-Q), density functional tight-binding (LATTE)
|
||||
* electron force field (eFF, AWPMD)
|
||||
* bond potentials: harmonic, FENE, Morse, nonlinear, class 2, quartic (breakable)
|
||||
* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic, class 2 (COMPASS)
|
||||
* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix, class 2 (COMPASS), OPLS
|
||||
* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS)
|
||||
* bond potentials: harmonic, FENE, Morse, nonlinear, class 2, quartic (breakable), tabulated
|
||||
* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic, class 2 (COMPASS), tabulated
|
||||
* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix, class 2 (COMPASS), OPLS, tabulated
|
||||
* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS), tabulated
|
||||
* polymer potentials: all-atom, united-atom, bead-spring, breakable
|
||||
* water potentials: TIP3P, TIP4P, SPC
|
||||
* water potentials: TIP3P, TIP4P, SPC, SPC/E and variants
|
||||
* interlayer potentials for graphene and analogues
|
||||
* metal-organic framework potentials (QuickFF, MO-FF)
|
||||
* implicit solvent potentials: hydrodynamic lubrication, Debye
|
||||
* force-field compatibility with common CHARMM, AMBER, DREIDING, OPLS, GROMACS, COMPASS options
|
||||
* access to the `OpenKIM Repository <http://openkim.org>`_ of potentials via :doc:`kim command <kim_commands>`
|
||||
* hybrid potentials: multiple pair, bond, angle, dihedral, improper potentials can be used in one simulation
|
||||
* overlaid potentials: superposition of multiple pair potentials
|
||||
* overlaid potentials: superposition of multiple pair potentials (including many-body) with optional scale factor
|
||||
|
||||
.. _create:
|
||||
|
||||
@ -124,9 +130,10 @@ Ensembles, constraints, and boundary conditions
|
||||
* harmonic (umbrella) constraint forces
|
||||
* rigid body constraints
|
||||
* SHAKE bond and angle constraints
|
||||
* Monte Carlo bond breaking, formation, swapping
|
||||
* motion constraints to manifold surfaces
|
||||
* Monte Carlo bond breaking, formation, swapping, template based reaction modeling
|
||||
* atom/molecule insertion and deletion
|
||||
* walls of various kinds
|
||||
* walls of various kinds, static and moving
|
||||
* non-equilibrium molecular dynamics (NEMD)
|
||||
* variety of additional boundary conditions and constraints
|
||||
|
||||
@ -150,6 +157,7 @@ Diagnostics
|
||||
^^^^^^^^^^^
|
||||
|
||||
* see various flavors of the :doc:`fix <fix>` and :doc:`compute <compute>` commands
|
||||
* introspection command for system, simulation, and compile time settings and configurations
|
||||
|
||||
.. _output:
|
||||
|
||||
@ -164,8 +172,9 @@ Output
|
||||
* parallel I/O of dump and restart files
|
||||
* per-atom quantities (energy, stress, centro-symmetry parameter, CNA, etc)
|
||||
* user-defined system-wide (log file) or per-atom (dump file) calculations
|
||||
* spatial and time averaging of per-atom quantities
|
||||
* time averaging of system-wide quantities
|
||||
* custom partitioning (chunks) for binning, and static or dynamic grouping of atoms for analysis
|
||||
* spatial, time, and per-chunk averaging of per-atom quantities
|
||||
* time averaging and histogramming of system-wide quantities
|
||||
* atom snapshots in native, XYZ, XTC, DCD, CFG formats
|
||||
|
||||
.. _replica1:
|
||||
@ -178,7 +187,7 @@ Multi-replica models
|
||||
* :doc:`parallel replica dynamics <prd>`
|
||||
* :doc:`temperature accelerated dynamics <tad>`
|
||||
* :doc:`parallel tempering <temper>`
|
||||
* :doc:`path-integral MD <fix_pimd>`
|
||||
* path-integral MD: `first variant <fix_pimd>`, `second variant <fix_ipi>`
|
||||
* multi-walker collective variables with :doc:`Colvars <fix_colvars>` and :doc:`Plumed <fix_plumed>`
|
||||
|
||||
.. _prepost:
|
||||
@ -210,11 +219,12 @@ page for details.
|
||||
These are LAMMPS capabilities which you may not think of as typical
|
||||
classical MD options:
|
||||
|
||||
* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`
|
||||
* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`, optional with recursive bisectioning decomposition
|
||||
* :doc:`generalized aspherical particles <Howto_body>`
|
||||
* :doc:`stochastic rotation dynamics (SRD) <fix_srd>`
|
||||
* :doc:`real-time visualization and interactive MD <fix_imd>`
|
||||
* :doc:`real-time visualization and interactive MD <fix_imd>`, :doc:`built-in renderer for images and movies <dump_image>`
|
||||
* calculate :doc:`virtual diffraction patterns <compute_xrd>`
|
||||
* calculate :doc:`finite temperature phonon dispersion <fix_phonon>` and the :doc:`dynamical matrix of minimized structures <dynamical_matrix>`
|
||||
* :doc:`atom-to-continuum coupling <fix_atc>` with finite elements
|
||||
* coupled rigid body integration via the :doc:`POEMS <fix_poems>` library
|
||||
* :doc:`QM/MM coupling <fix_qmmm>`
|
||||
|
||||
@ -1,40 +1,61 @@
|
||||
LAMMPS open-source license
|
||||
--------------------------
|
||||
|
||||
LAMMPS is a freely-available open-source code, distributed under the
|
||||
terms of the `GNU Public License Version 2 <gpl_>`_, which means you can
|
||||
use or modify the code however you wish for your own purposes, but have
|
||||
to adhere to certain rules when redistributing it or software derived
|
||||
GPL version of LAMMPS
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
LAMMPS is an open-source code, available free-of-charge, and distributed
|
||||
under the terms of the `GNU Public License Version 2 <gpl_>`_ (GPLv2),
|
||||
which means you can use or modify the code however you wish for your own
|
||||
purposes, but have to adhere to certain rules when redistributing it -
|
||||
specifically in binary form - or are distributing software derived
|
||||
from it or that includes parts of it.
|
||||
|
||||
LAMMPS comes with no warranty of any kind. As each source file states
|
||||
in its header, it is a copyrighted code that is distributed free-of-
|
||||
charge, under the terms of the `GNU Public License Version 2 <gpl_>`_
|
||||
(GPLv2). This is often referred to as open-source distribution - see
|
||||
`www.gnu.org <gnuorg_>`_ or `www.opensource.org <opensource_>`_. The
|
||||
legal text of the GPL is in the LICENSE file included in the LAMMPS
|
||||
distribution.
|
||||
LAMMPS comes with no warranty of any kind.
|
||||
|
||||
As each source file states in its header, it is a copyrighted code, and
|
||||
thus not in the public domain. For more information about open-source
|
||||
software and open-source distribution, see `www.gnu.org <gnuorg_>`_
|
||||
or `www.opensource.org <opensource_>`_. The legal text of the GPL as it
|
||||
applies to LAMMPS is in the LICENSE file included in the LAMMPS distribution.
|
||||
|
||||
.. _gpl: https://github.com/lammps/lammps/blob/master/LICENSE
|
||||
|
||||
.. _lgpl: https://www.gnu.org/licenses/old-licenses/lgpl-2.1.html
|
||||
|
||||
.. _gnuorg: http://www.gnu.org
|
||||
|
||||
.. _opensource: http://www.opensource.org
|
||||
|
||||
Here is a summary of what the GPL means for LAMMPS users:
|
||||
Here is a more specific summary of what the GPL means for LAMMPS users:
|
||||
|
||||
(1) Anyone is free to use, modify, or extend LAMMPS in any way they
|
||||
(1) Anyone is free to use, copy, modify, or extend LAMMPS in any way they
|
||||
choose, including for commercial purposes.
|
||||
|
||||
(2) If you **distribute** a modified version of LAMMPS, it must remain
|
||||
open-source, meaning you distribute **all** of it under the terms of
|
||||
the GPL. You should clearly annotate such a code as a derivative version
|
||||
of LAMMPS.
|
||||
open-source, meaning you are required to distribute **all** of it under
|
||||
the terms of the GPL. You should clearly annotate such a modified code
|
||||
as a derivative version of LAMMPS.
|
||||
|
||||
(3) If you release any code that includes or uses LAMMPS source code,
|
||||
then it must also be open-sourced, meaning you distribute it under
|
||||
the terms of the GPL.
|
||||
the terms of the GPL. You may write code that interfaces LAMMPS to
|
||||
a differently licensed library. In that case the code that provides
|
||||
the interface must be licensed GPL, but not necessarily that library
|
||||
unless you are distributing binaries that require the library to run.
|
||||
|
||||
(4) If you give LAMMPS files to someone else, the GPL LICENSE file and
|
||||
source file headers (including the copyright and GPL notices) should
|
||||
remain part of the code.
|
||||
|
||||
|
||||
LGPL version of LAMMPS
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
We occasionally make stable LAMMPS releases available under the `GNU
|
||||
Lesser Public License v2.1 <lgpl_>`_. This is on request only and with
|
||||
non-LGPL compliant files removed. This allows uses linking non-GPL
|
||||
compatible software with the (otherwise unmodified) LAMMPS library
|
||||
or loading it dynamically at runtime. Any **modifications** to
|
||||
the LAMMPS code however, even with the LGPL licensed version, must still
|
||||
be made available under the same open source terms as LAMMPS itself.
|
||||
|
||||
@ -10,24 +10,26 @@ conditions. It can model 2d or 3d systems with only a few particles
|
||||
up to millions or billions.
|
||||
|
||||
LAMMPS can be built and run on a laptop or desktop machine, but is
|
||||
designed for parallel computers. It will run on any parallel machine
|
||||
that supports the `MPI <mpi_>`_ message-passing library. This includes
|
||||
shared-memory boxes and distributed-memory clusters and
|
||||
supercomputers.
|
||||
designed for parallel computers. It will run in serial and on any
|
||||
parallel machine that supports the `MPI <mpi_>`_ message-passing
|
||||
library. This includes shared-memory boxes and distributed-memory
|
||||
clusters and supercomputers. Parts of LAMMPS also support
|
||||
`OpenMP multi-threading <omp_>`_, vectorization and GPU acceleration.
|
||||
|
||||
.. _mpi: https://en.wikipedia.org/wiki/Message_Passing_Interface
|
||||
.. _lws: https://www.lammps.org
|
||||
.. _omp: https://www.openmp.org
|
||||
|
||||
LAMMPS is written in C++ and requires a compiler that is at least
|
||||
compatible with the C++-11 standard.
|
||||
Earlier versions were written in F77 and F90. See the `History page
|
||||
compatible with the C++-11 standard. Earlier versions were written in
|
||||
F77, F90, and C++-98. See the `History page
|
||||
<https://www.lammps.org/history.html>`_ of the website for details. All
|
||||
versions can be downloaded from the `LAMMPS website <lws_>`_.
|
||||
versions can be downloaded as source code from the `LAMMPS website
|
||||
<lws_>`_.
|
||||
|
||||
LAMMPS is designed to be easy to modify or extend with new
|
||||
capabilities, such as new force fields, atom types, boundary
|
||||
conditions, or diagnostics. See the :doc:`Modify <Modify>` page for
|
||||
more details.
|
||||
LAMMPS is designed to be easy to modify or extend with new capabilities,
|
||||
such as new force fields, atom types, boundary conditions, or
|
||||
diagnostics. See the :doc:`Modify <Modify>` page for more details.
|
||||
|
||||
In the most general sense, LAMMPS integrates Newton's equations of
|
||||
motion for a collection of interacting particles. A single particle
|
||||
@ -47,4 +49,5 @@ MPI parallelization to partition the simulation domain into small
|
||||
sub-domains of equal computational cost, one of which is assigned to
|
||||
each processor. Processors communicate and store "ghost" atom
|
||||
information for atoms that border their sub-domain. Multi-threading
|
||||
parallelization with with particle-decomposition can be used in addition.
|
||||
parallelization and GPU acceleration with with particle-decomposition
|
||||
can be used in addition.
|
||||
|
||||
@ -2,12 +2,21 @@ What does a LAMMPS version mean
|
||||
-------------------------------
|
||||
|
||||
The LAMMPS "version" is the date when it was released, such as 1 May
|
||||
2014. LAMMPS is updated continuously. Whenever we fix a bug or add a
|
||||
feature, we release it in the next *patch* release, which are
|
||||
typically made every couple of weeks. Info on patch releases are on
|
||||
`this website page <https://www.lammps.org/bug.html>`_. Every few
|
||||
months, the latest patch release is subjected to more thorough testing
|
||||
and labeled as a *stable* version.
|
||||
2014. LAMMPS is updated continuously and we aim to keep it working
|
||||
correctly and reliably at all times. You can follow its development
|
||||
in a public `git repository on GitHub <https://github.com/lammps/lammps>`_.
|
||||
|
||||
Whenever we fix a bug or update or add a feature, it will be merged into
|
||||
the `master` branch of the git repository. When a sufficient number of
|
||||
changes have accumulated *and* the software passes a set of automated
|
||||
tests, we release it in the next *patch* release, which are made every
|
||||
few weeks. Info on patch releases are on `this website page
|
||||
<https://www.lammps.org/bug.html>`_.
|
||||
|
||||
Once or twice a year, only bug fixes and small, non-intrusive changes are
|
||||
included for a period of time, and the code is subjected to more detailed
|
||||
and thorough testing than the default automated testing. The latest
|
||||
patch release after such a period is then labeled as a *stable* version.
|
||||
|
||||
Each version of LAMMPS contains all the features and bug-fixes up to
|
||||
and including its version date.
|
||||
|
||||
@ -19,7 +19,7 @@ Syntax
|
||||
|
||||
bondmax = length of longest bond in the system (in length units)
|
||||
tlimit = elapsed CPU time (in seconds)
|
||||
diskfree = free disk space (in megabytes)
|
||||
diskfree = free disk space (in MBytes)
|
||||
v_name = name of :doc:`equal-style variable <variable>`
|
||||
|
||||
* operator = "<" or "<=" or ">" or ">=" or "==" or "!=" or "\|\^"
|
||||
@ -81,7 +81,7 @@ the timer frequently across a large number of processors may be
|
||||
non-negligible.
|
||||
|
||||
The *diskfree* attribute will check for available disk space (in
|
||||
megabytes) on supported operating systems. By default it will
|
||||
MBytes) on supported operating systems. By default it will
|
||||
check the file system of the current working directory. This
|
||||
can be changed with the optional *path* keyword, which will take
|
||||
the path to a file or folder on the file system to be checked
|
||||
|
||||
@ -128,7 +128,7 @@ spectrum while consumes more memory. With fixed *f_max* and
|
||||
:math:`\gamma`, *N_f* should be big enough to converge the classical
|
||||
temperature :math:`T^{cl}` as a function of target quantum bath
|
||||
temperature. Memory usage per processor could be from 10 to 100
|
||||
Mbytes.
|
||||
MBytes.
|
||||
|
||||
.. note::
|
||||
|
||||
|
||||
@ -135,7 +135,7 @@ with #) anywhere. Each non-blank non-comment line must contain one
|
||||
keyword/value pair. The required keywords are *rcutfac* and
|
||||
*twojmax*\ . Optional keywords are *rfac0*, *rmin0*,
|
||||
*switchflag*, *bzeroflag*, *quadraticflag*, *chemflag*,
|
||||
*bnormflag*, *wselfallflag*, and *chunksize*\ .
|
||||
*bnormflag*, *wselfallflag*, *chunksize*, and *parallelthresh*\ .
|
||||
|
||||
The default values for these keywords are
|
||||
|
||||
@ -147,7 +147,8 @@ The default values for these keywords are
|
||||
* *chemflag* = 0
|
||||
* *bnormflag* = 0
|
||||
* *wselfallflag* = 0
|
||||
* *chunksize* = 4096
|
||||
* *chunksize* = 32768
|
||||
* *parallelthresh* = 8192
|
||||
|
||||
If *quadraticflag* is set to 1, then the SNAP energy expression includes
|
||||
additional quadratic terms that have been shown to increase the overall
|
||||
@ -188,14 +189,24 @@ corresponding *K*-vector of linear coefficients for element
|
||||
which must equal the number of unique elements appearing in the LAMMPS
|
||||
pair_coeff command, to avoid ambiguity in the number of coefficients.
|
||||
|
||||
The keyword *chunksize* is only applicable when using the
|
||||
pair style *snap* with the KOKKOS package and is ignored otherwise.
|
||||
This keyword controls
|
||||
The keywords *chunksize* and *parallelthresh* are only applicable when
|
||||
using the pair style *snap* with the KOKKOS package on GPUs and are
|
||||
ignored otherwise.
|
||||
The *chunksize* keyword controls
|
||||
the number of atoms in each pass used to compute the bispectrum
|
||||
components and is used to avoid running out of memory. For example
|
||||
if there are 8192 atoms in the simulation and the *chunksize*
|
||||
is set to 4096, the bispectrum calculation will be broken up
|
||||
into two passes.
|
||||
into two passes (running on a single GPU).
|
||||
The *parallelthresh* keyword controls
|
||||
a crossover threshold for performing extra parallelism. For
|
||||
small systems, exposing additional parallism can be beneficial when
|
||||
there is not enough work to fully saturate the GPU threads otherwise.
|
||||
However, the extra parallelism also leads to more divergence
|
||||
and can hurt performance when the system is already large enough to
|
||||
saturate the GPU threads. Extra parallelism will be performed if the
|
||||
*chunksize* (or total number of atoms per GPU) is smaller than
|
||||
*parallelthresh*.
|
||||
|
||||
Detailed definitions for all the other keywords
|
||||
are given on the :doc:`compute sna/atom <compute_sna_atom>` doc page.
|
||||
|
||||
@ -1174,6 +1174,7 @@ googletest
|
||||
Gordan
|
||||
Goudeau
|
||||
GPa
|
||||
GPL
|
||||
gpu
|
||||
gpuID
|
||||
gpus
|
||||
@ -1689,6 +1690,7 @@ Lett
|
||||
Leuven
|
||||
Leven
|
||||
Lewy
|
||||
LGPL
|
||||
lgvdw
|
||||
Liang
|
||||
libatc
|
||||
@ -1889,7 +1891,6 @@ maxX
|
||||
Mayergoyz
|
||||
Mayoral
|
||||
mbt
|
||||
Mbytes
|
||||
MBytes
|
||||
mc
|
||||
McLachlan
|
||||
|
||||
@ -44,7 +44,8 @@ struct TagPairSNAPComputeForce{};
|
||||
struct TagPairSNAPComputeNeigh{};
|
||||
struct TagPairSNAPComputeCayleyKlein{};
|
||||
struct TagPairSNAPPreUi{};
|
||||
struct TagPairSNAPComputeUi{};
|
||||
struct TagPairSNAPComputeUiSmall{}; // more parallelism, more divergence
|
||||
struct TagPairSNAPComputeUiLarge{}; // less parallelism, no divergence
|
||||
struct TagPairSNAPTransformUi{}; // re-order ulisttot from SoA to AoSoA, zero ylist
|
||||
struct TagPairSNAPComputeZi{};
|
||||
struct TagPairSNAPBeta{};
|
||||
@ -53,7 +54,9 @@ struct TagPairSNAPTransformBi{}; // re-order blist from AoSoA to AoS
|
||||
struct TagPairSNAPComputeYi{};
|
||||
struct TagPairSNAPComputeYiWithZlist{};
|
||||
template<int dir>
|
||||
struct TagPairSNAPComputeFusedDeidrj{};
|
||||
struct TagPairSNAPComputeFusedDeidrjSmall{}; // more parallelism, more divergence
|
||||
template<int dir>
|
||||
struct TagPairSNAPComputeFusedDeidrjLarge{}; // less parallelism, no divergence
|
||||
|
||||
// CPU backend only
|
||||
struct TagPairSNAPComputeNeighCPU{};
|
||||
@ -143,7 +146,10 @@ public:
|
||||
void operator() (TagPairSNAPPreUi,const int iatom_mod, const int j, const int iatom_div) const;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUi>::member_type& team) const;
|
||||
void operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiSmall>::member_type& team) const;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiLarge>::member_type& team) const;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator() (TagPairSNAPTransformUi,const int iatom_mod, const int j, const int iatom_div) const;
|
||||
@ -168,7 +174,11 @@ public:
|
||||
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const;
|
||||
void operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const;
|
||||
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const;
|
||||
|
||||
// CPU backend only
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
|
||||
@ -341,18 +341,32 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
|
||||
// ComputeUi w/vector parallelism, shared memory, direct atomicAdd into ulisttot
|
||||
{
|
||||
// team_size_compute_ui is defined in `pair_snap_kokkos.h`
|
||||
|
||||
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
|
||||
const int tile_size = vector_length * (twojmax + 1);
|
||||
const int scratch_size = scratch_size_helper<complex>(team_size_compute_ui * tile_size);
|
||||
|
||||
if (chunk_size < parallel_thresh)
|
||||
{
|
||||
// Version with parallelism over j_bend
|
||||
|
||||
// total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
|
||||
const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
|
||||
const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;
|
||||
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUi> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiSmall> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
|
||||
policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeUi",policy_ui,*this);
|
||||
Kokkos::parallel_for("ComputeUiSmall",policy_ui,*this);
|
||||
} else {
|
||||
// Version w/out parallelism over j_bend
|
||||
|
||||
// total number of teams needed: (natoms / 32) * (max_neighs)
|
||||
const int n_teams = chunk_size_div * max_neighs;
|
||||
const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;
|
||||
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiLarge> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
|
||||
policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeUiLarge",policy_ui,*this);
|
||||
}
|
||||
}
|
||||
|
||||
//TransformUi: un-"fold" ulisttot, zero ylist
|
||||
@ -412,25 +426,51 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
|
||||
const int tile_size = vector_length * (twojmax + 1);
|
||||
const int scratch_size = scratch_size_helper<complex>(2 * team_size_compute_fused_deidrj * tile_size);
|
||||
|
||||
if (chunk_size < parallel_thresh)
|
||||
{
|
||||
// Version with parallelism over j_bend
|
||||
|
||||
// total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
|
||||
const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
|
||||
const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;
|
||||
|
||||
// x direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrj<0>",policy_fused_deidrj_x,*this);
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjSmall<0>",policy_fused_deidrj_x,*this);
|
||||
|
||||
// y direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrj<1>",policy_fused_deidrj_y,*this);
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjSmall<1>",policy_fused_deidrj_y,*this);
|
||||
|
||||
// z direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrj<2>",policy_fused_deidrj_z,*this);
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjSmall<2>",policy_fused_deidrj_z,*this);
|
||||
} else {
|
||||
// Version w/out parallelism over j_bend
|
||||
|
||||
// total number of teams needed: (natoms / 32) * (max_neighs)
|
||||
const int n_teams = chunk_size_div * max_neighs;
|
||||
const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;
|
||||
|
||||
// x direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjLarge<0>",policy_fused_deidrj_x,*this);
|
||||
|
||||
// y direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjLarge<1>",policy_fused_deidrj_y,*this);
|
||||
|
||||
// z direction
|
||||
SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
|
||||
policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
|
||||
Kokkos::parallel_for("ComputeFusedDeidrjLarge<2>",policy_fused_deidrj_z,*this);
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
#endif // LMP_KOKKOS_GPU
|
||||
@ -603,13 +643,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
|
||||
const auto idxb = icoeff % idxb_max;
|
||||
const auto idx_chem = icoeff / idxb_max;
|
||||
auto bveci = my_sna.blist(idxb, idx_chem, ii);
|
||||
real_type bveci = my_sna.blist(ii, idx_chem, idxb);
|
||||
d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bveci;
|
||||
k++;
|
||||
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
|
||||
const auto jdxb = jcoeff % idxb_max;
|
||||
const auto jdx_chem = jcoeff / idxb_max;
|
||||
real_type bvecj = my_sna.blist(jdxb, jdx_chem, ii);
|
||||
real_type bvecj = my_sna.blist(ii, jdx_chem, jdxb);
|
||||
d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bvecj;
|
||||
d_beta_pack(iatom_mod,jcoeff,iatom_div) += d_coeffi[k]*bveci;
|
||||
k++;
|
||||
@ -736,7 +776,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUi>::member_type& team) const {
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiSmall>::member_type& team) const {
|
||||
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
|
||||
|
||||
// extract flattened atom_div / neighbor number / bend location
|
||||
@ -756,11 +796,37 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
const int ninside = d_ninside(ii);
|
||||
if (jj >= ninside) return;
|
||||
|
||||
my_sna.compute_ui(team,iatom_mod, jbend, jj, iatom_div);
|
||||
my_sna.compute_ui_small(team, iatom_mod, jbend, jj, iatom_div);
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiLarge>::member_type& team) const {
|
||||
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
|
||||
|
||||
// extract flattened atom_div / neighbor number / bend location
|
||||
int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_ui;
|
||||
|
||||
// extract neighbor index, iatom_div
|
||||
int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
|
||||
int jj = flattened_idx - iatom_div * max_neighs;
|
||||
|
||||
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
|
||||
[&] (const int iatom_mod) {
|
||||
const int ii = iatom_mod + vector_length * iatom_div;
|
||||
if (ii >= chunk_size) return;
|
||||
|
||||
const int ninside = d_ninside(ii);
|
||||
if (jj >= ninside) return;
|
||||
|
||||
my_sna.compute_ui_large(team,iatom_mod, jj, iatom_div);
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPTransformUi,const int iatom_mod, const int idxu, const int iatom_div) const {
|
||||
@ -861,9 +927,9 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
|
||||
for (int itriple = 0; itriple < ntriples; itriple++) {
|
||||
|
||||
const auto blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);
|
||||
const real_type blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);
|
||||
|
||||
my_sna.blist(idxb, itriple, iatom) = blocal;
|
||||
my_sna.blist(iatom, itriple, idxb) = blocal;
|
||||
}
|
||||
|
||||
}
|
||||
@ -871,7 +937,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const {
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const {
|
||||
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
|
||||
|
||||
// extract flattened atom_div / neighbor number / bend location
|
||||
@ -891,12 +957,38 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
const int ninside = d_ninside(ii);
|
||||
if (jj >= ninside) return;
|
||||
|
||||
my_sna.template compute_fused_deidrj<dir>(team, iatom_mod, jbend, jj, iatom_div);
|
||||
my_sna.template compute_fused_deidrj_small<dir>(team, iatom_mod, jbend, jj, iatom_div);
|
||||
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const {
|
||||
SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
|
||||
|
||||
// extract flattened atom_div / neighbor number / bend location
|
||||
int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_fused_deidrj;
|
||||
|
||||
// extract neighbor index, iatom_div
|
||||
int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
|
||||
int jj = flattened_idx - max_neighs * iatom_div;
|
||||
|
||||
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
|
||||
[&] (const int iatom_mod) {
|
||||
const int ii = iatom_mod + vector_length * iatom_div;
|
||||
if (ii >= chunk_size) return;
|
||||
|
||||
const int ninside = d_ninside(ii);
|
||||
if (jj >= ninside) return;
|
||||
|
||||
my_sna.template compute_fused_deidrj_large<dir>(team, iatom_mod, jj, iatom_div);
|
||||
|
||||
});
|
||||
}
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
Begin routines that are unique to the CPU codepath. These do not take
|
||||
advantage of AoSoA data layouts, but that could be a good point of
|
||||
@ -925,13 +1017,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
|
||||
const auto idxb = icoeff % idxb_max;
|
||||
const auto idx_chem = icoeff / idxb_max;
|
||||
auto bveci = my_sna.blist(idxb,idx_chem,ii);
|
||||
real_type bveci = my_sna.blist(ii,idx_chem,idxb);
|
||||
d_beta(icoeff,ii) += d_coeffi[k]*bveci;
|
||||
k++;
|
||||
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
|
||||
const auto jdxb = jcoeff % idxb_max;
|
||||
const auto jdx_chem = jcoeff / idxb_max;
|
||||
auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
|
||||
real_type bvecj = my_sna.blist(ii,jdx_chem,jdxb);
|
||||
d_beta(icoeff,ii) += d_coeffi[k]*bvecj;
|
||||
d_beta(jcoeff,ii) += d_coeffi[k]*bveci;
|
||||
k++;
|
||||
@ -1221,7 +1313,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
|
||||
const auto idxb = icoeff % idxb_max;
|
||||
const auto idx_chem = icoeff / idxb_max;
|
||||
evdwl += d_coeffi[icoeff+1]*my_sna.blist(idxb,idx_chem,ii);
|
||||
evdwl += d_coeffi[icoeff+1]*my_sna.blist(ii,idx_chem,idxb);
|
||||
}
|
||||
|
||||
// quadratic contributions
|
||||
@ -1230,12 +1322,12 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
|
||||
for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
|
||||
const auto idxb = icoeff % idxb_max;
|
||||
const auto idx_chem = icoeff / idxb_max;
|
||||
auto bveci = my_sna.blist(idxb,idx_chem,ii);
|
||||
real_type bveci = my_sna.blist(ii,idx_chem,idxb);
|
||||
evdwl += 0.5*d_coeffi[k++]*bveci*bveci;
|
||||
for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
|
||||
auto jdxb = jcoeff % idxb_max;
|
||||
auto jdx_chem = jcoeff / idxb_max;
|
||||
auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
|
||||
auto bvecj = my_sna.blist(ii,jdx_chem,jdxb);
|
||||
evdwl += d_coeffi[k++]*bveci*bvecj;
|
||||
}
|
||||
}
|
||||
|
||||
@ -45,12 +45,12 @@ struct WignerWrapper {
|
||||
{ ; }
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
complex get(const int& ma) {
|
||||
complex get(const int& ma) const {
|
||||
return complex(buffer[offset + 2 * vector_length * ma], buffer[offset + vector_length + 2 * vector_length * ma]);
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void set(const int& ma, const complex& store) {
|
||||
void set(const int& ma, const complex& store) const {
|
||||
buffer[offset + 2 * vector_length * ma] = store.re;
|
||||
buffer[offset + vector_length + 2 * vector_length * ma] = store.im;
|
||||
}
|
||||
@ -122,8 +122,14 @@ inline
|
||||
void compute_cayley_klein(const int&, const int&, const int&);
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void pre_ui(const int&, const int&, const int&, const int&); // ForceSNAP
|
||||
|
||||
// version of the code with parallelism over j_bend
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
|
||||
void compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
|
||||
// version of the code without parallelism over j_bend
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); // ForceSNAP
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_zi(const int&, const int&, const int&); // ForceSNAP
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
@ -135,6 +141,35 @@ inline
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_bi(const int&, const int&, const int&); // ForceSNAP
|
||||
|
||||
// functions for derivatives, GPU only
|
||||
// version of the code with parallelism over j_bend
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
|
||||
// version of the code without parallelism over j_bend
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); //ForceSNAP
|
||||
|
||||
// core "evaluation" functions that get plugged into "compute" functions
|
||||
// plugged into compute_ui_small, compute_ui_large
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
void evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&, const int&,
|
||||
const int&, const int&, const int&);
|
||||
// plugged into compute_zi, compute_yi
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
complex evaluate_zi(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
|
||||
const int&, const int&, const int&, const int&, const real_type*);
|
||||
// plugged into compute_yi, compute_yi_with_zlist
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
real_type evaluate_beta_scaled(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
|
||||
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &);
|
||||
// plugged into compute_fused_deidrj_small, compute_fused_deidrj_large
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
real_type evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&,
|
||||
const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&,
|
||||
const int&, const int&, const int&, const int&);
|
||||
|
||||
// functions for bispectrum coefficients, CPU only
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void pre_ui_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team,const int&,const int&); // ForceSNAP
|
||||
@ -148,11 +183,6 @@ inline
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_bi_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int); // ForceSNAP
|
||||
|
||||
// functions for derivatives, GPU only
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
|
||||
|
||||
// functions for derivatives, CPU only
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_duidrj_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int, int); //ForceSNAP
|
||||
@ -168,23 +198,6 @@ inline
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void compute_s_dsfac(const real_type, const real_type, real_type&, real_type&); // compute_cayley_klein
|
||||
|
||||
static KOKKOS_FORCEINLINE_FUNCTION
|
||||
void sincos_wrapper(double x, double* sin_, double *cos_) {
|
||||
#ifdef __SYCL_DEVICE_ONLY__
|
||||
*sin_ = sycl::sincos(x, cos_);
|
||||
#else
|
||||
sincos(x, sin_, cos_);
|
||||
#endif
|
||||
}
|
||||
static KOKKOS_FORCEINLINE_FUNCTION
|
||||
void sincos_wrapper(float x, float* sin_, float *cos_) {
|
||||
#ifdef __SYCL_DEVICE_ONLY__
|
||||
*sin_ = sycl::sincos(x, cos_);
|
||||
#else
|
||||
sincosf(x, sin_, cos_);
|
||||
#endif
|
||||
}
|
||||
|
||||
#ifdef TIMING_INFO
|
||||
double* timers;
|
||||
timespec starttime, endtime;
|
||||
@ -207,7 +220,7 @@ inline
|
||||
|
||||
int twojmax, diagonalstyle;
|
||||
|
||||
t_sna_3d_ll blist;
|
||||
t_sna_3d blist;
|
||||
t_sna_3c_ll ulisttot;
|
||||
t_sna_3c_ll ulisttot_full; // un-folded ulisttot, cpu only
|
||||
t_sna_3c_ll zlist;
|
||||
|
||||
@ -316,7 +316,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
|
||||
ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),1,1,1);
|
||||
zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),1,1,1);
|
||||
zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),vector_length,idxz_max,ndoubles,natom_div);
|
||||
blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
|
||||
blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
|
||||
blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),vector_length,idxb_max,ntriples,natom_div);
|
||||
ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),1,1,1);
|
||||
ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),vector_length,idxu_half_max,nelements,natom_div);
|
||||
@ -337,7 +337,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
|
||||
ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),idxu_cache_max,natom,nmax);
|
||||
zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),idxz_max,ndoubles,natom);
|
||||
zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),1,1,1,1);
|
||||
blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
|
||||
blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
|
||||
blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),1,1,1,1);
|
||||
ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),idxu_half_max,nelements,natom);
|
||||
ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),1,1,1,1);
|
||||
@ -365,44 +365,44 @@ KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const int& iatom_mod, const int& jnbor, const int& iatom_div)
|
||||
{
|
||||
const int iatom = iatom_mod + vector_length * iatom_div;
|
||||
const auto x = rij(iatom,jnbor,0);
|
||||
const auto y = rij(iatom,jnbor,1);
|
||||
const auto z = rij(iatom,jnbor,2);
|
||||
const auto rsq = x * x + y * y + z * z;
|
||||
const auto r = sqrt(rsq);
|
||||
const auto rcut = rcutij(iatom, jnbor);
|
||||
const auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
const auto theta0 = (r - rmin0) * rscale0;
|
||||
real_type sn, cs;
|
||||
sincos_wrapper(theta0, &sn, &cs);
|
||||
const real_type x = rij(iatom,jnbor,0);
|
||||
const real_type y = rij(iatom,jnbor,1);
|
||||
const real_type z = rij(iatom,jnbor,2);
|
||||
const real_type rsq = x * x + y * y + z * z;
|
||||
const real_type r = sqrt(rsq);
|
||||
const real_type rcut = rcutij(iatom, jnbor);
|
||||
const real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
const real_type theta0 = (r - rmin0) * rscale0;
|
||||
const real_type sn = sin(theta0);
|
||||
const real_type cs = cos(theta0);
|
||||
const real_type z0 = r * cs / sn;
|
||||
const real_type dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;
|
||||
|
||||
const auto wj_local = wj(iatom, jnbor);
|
||||
const real_type wj_local = wj(iatom, jnbor);
|
||||
real_type sfac, dsfac;
|
||||
compute_s_dsfac(r, rcut, sfac, dsfac);
|
||||
sfac *= wj_local;
|
||||
dsfac *= wj_local;
|
||||
|
||||
const auto rinv = static_cast<real_type>(1.0) / r;
|
||||
const auto ux = x * rinv;
|
||||
const auto uy = y * rinv;
|
||||
const auto uz = z * rinv;
|
||||
const real_type rinv = static_cast<real_type>(1.0) / r;
|
||||
const real_type ux = x * rinv;
|
||||
const real_type uy = y * rinv;
|
||||
const real_type uz = z * rinv;
|
||||
|
||||
const auto r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);
|
||||
const real_type r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);
|
||||
|
||||
const complex a = { z0 * r0inv, -z * r0inv };
|
||||
const complex b = { r0inv * y, -r0inv * x };
|
||||
|
||||
const auto dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);
|
||||
const real_type dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);
|
||||
|
||||
const auto dr0invx = dr0invdr * ux;
|
||||
const auto dr0invy = dr0invdr * uy;
|
||||
const auto dr0invz = dr0invdr * uz;
|
||||
const real_type dr0invx = dr0invdr * ux;
|
||||
const real_type dr0invy = dr0invdr * uy;
|
||||
const real_type dr0invz = dr0invdr * uz;
|
||||
|
||||
const auto dz0x = dz0dr * ux;
|
||||
const auto dz0y = dz0dr * uy;
|
||||
const auto dz0z = dz0dr * uz;
|
||||
const real_type dz0x = dz0dr * ux;
|
||||
const real_type dz0y = dz0dr * uy;
|
||||
const real_type dz0z = dz0dr * uz;
|
||||
|
||||
const complex dax = { dz0x * r0inv + z0 * dr0invx, -z * dr0invx };
|
||||
const complex day = { dz0y * r0inv + z0 * dr0invy, -z * dr0invy };
|
||||
@ -412,9 +412,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const
|
||||
const complex dby = { y * dr0invy + r0inv, -x * dr0invy };
|
||||
const complex dbz = { y * dr0invz, -x * dr0invz };
|
||||
|
||||
const auto dsfacux = dsfac * ux;
|
||||
const auto dsfacuy = dsfac * uy;
|
||||
const auto dsfacuz = dsfac * uz;
|
||||
const real_type dsfacux = dsfac * ux;
|
||||
const real_type dsfacuy = dsfac * uy;
|
||||
const real_type dsfacuz = dsfac * uz;
|
||||
|
||||
a_pack(iatom_mod,jnbor,iatom_div) = a;
|
||||
b_pack(iatom_mod,jnbor,iatom_div) = b;
|
||||
@ -479,17 +479,13 @@ void SNAKokkos<DeviceType, real_type, vector_length>::pre_ui(const int& iatom_mo
|
||||
accumulating to the total. GPU only.
|
||||
------------------------------------------------------------------------- */
|
||||
|
||||
// Version of the code that exposes additional parallelism by threading over `j_bend` values
|
||||
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
|
||||
{
|
||||
|
||||
// utot(j,ma,mb) = 0 for all j,ma,ma
|
||||
// utot(j,ma,ma) = 1 for all j,ma
|
||||
// for j in neighbors of i:
|
||||
// compute r0 = (x,y,z,z0)
|
||||
// utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb
|
||||
|
||||
// get shared memory offset
|
||||
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
|
||||
const int tile_size = vector_length * (twojmax + 1);
|
||||
@ -498,13 +494,12 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
|
||||
const int scratch_shift = team_rank * tile_size;
|
||||
|
||||
// extract and wrap
|
||||
WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
|
||||
const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
|
||||
// load parameters
|
||||
const auto a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const auto b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
|
||||
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
|
||||
|
||||
@ -512,12 +507,59 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
|
||||
// this for loop is here for context --- we expose additional
|
||||
// parallelism over this loop instead
|
||||
//for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
|
||||
evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
|
||||
}
|
||||
|
||||
// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
|
||||
// and some amount of load imbalance, at the expense of reducing parallelism
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
|
||||
{
|
||||
// get shared memory offset
|
||||
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
|
||||
const int tile_size = vector_length * (twojmax + 1);
|
||||
|
||||
const int team_rank = team.team_rank();
|
||||
const int scratch_shift = team_rank * tile_size;
|
||||
|
||||
// extract and wrap
|
||||
const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
|
||||
// load parameters
|
||||
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
|
||||
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
|
||||
|
||||
// we need to "choose" when to bend
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
|
||||
evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
|
||||
}
|
||||
}
|
||||
|
||||
// Core "evaluation" kernel that gets reused in `compute_ui_small` and `compute_ui_large`
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper,
|
||||
const complex& a, const complex& b, const real_type& sfac, const int& jelem,
|
||||
const int& iatom_mod, const int& j_bend, const int& iatom_div)
|
||||
{
|
||||
|
||||
// utot(j,ma,mb) = 0 for all j,ma,ma
|
||||
// utot(j,ma,ma) = 1 for all j,ma
|
||||
// for j in neighbors of i:
|
||||
// compute r0 = (x,y,z,z0)
|
||||
// utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb
|
||||
|
||||
// level 0 is just 1.
|
||||
ulist_wrapper.set(0, complex::one());
|
||||
|
||||
// j from before the bend, don't store, mb == 0
|
||||
// this is "creeping up the side"
|
||||
for (int j = 1; j <= j_bend; j++) {
|
||||
|
||||
constexpr int mb = 0; // intentional for readability, compiler should optimize this out
|
||||
@ -601,12 +643,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
|
||||
Kokkos::atomic_add(&(ulisttot_im_pack(iatom_mod, jjup + ma, jelem, iatom_div)), ulist_prev.im * sfac);
|
||||
}
|
||||
|
||||
//} // end of "reference" loop over j_bend
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
compute Zi by summing over products of Ui,
|
||||
AoSoA data layout to take advantage of coalescing, avoiding warp
|
||||
@ -634,47 +672,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi(const int& iato
|
||||
|
||||
for (int elem1 = 0; elem1 < nelements; elem1++) {
|
||||
for (int elem2 = 0; elem2 < nelements; elem2++) {
|
||||
complex ztmp = complex::zero();
|
||||
|
||||
int jju1 = idxu_block[j1] + (j1+1)*mb1min;
|
||||
int jju2 = idxu_block[j2] + (j2+1)*mb2max;
|
||||
int icgb = mb1min*(j2+1) + mb2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ib = 0; ib < nb; ib++) {
|
||||
|
||||
int ma1 = ma1min;
|
||||
int ma2 = ma2max;
|
||||
int icga = ma1min*(j2+1) + ma2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ia = 0; ia < na; ia++) {
|
||||
const auto utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
|
||||
const auto utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
|
||||
const auto cgcoeff_a = cgblock[icga];
|
||||
const auto cgcoeff_b = cgblock[icgb];
|
||||
ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
|
||||
ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
|
||||
ma1++;
|
||||
ma2--;
|
||||
icga += j2;
|
||||
} // end loop over ia
|
||||
|
||||
jju1 += j1 + 1;
|
||||
jju2 -= j2 + 1;
|
||||
icgb += j2;
|
||||
} // end loop over ib
|
||||
|
||||
if (bnorm_flag) {
|
||||
ztmp.re /= (j + 1);
|
||||
ztmp.im /= (j + 1);
|
||||
}
|
||||
|
||||
zlist_pack(iatom_mod,jjz,idouble,iatom_div) = ztmp;
|
||||
zlist_pack(iatom_mod,jjz,idouble,iatom_div) = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);
|
||||
|
||||
idouble++;
|
||||
}
|
||||
@ -721,8 +720,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
|
||||
const int jju_index = jju+mb*(j+1)+ma;
|
||||
const int jjz_index = jjz+mb*(j+1)+ma;
|
||||
if (2*mb == j) return; // I think we can remove this?
|
||||
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;
|
||||
}
|
||||
}
|
||||
@ -737,8 +736,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
|
||||
const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
|
||||
const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;
|
||||
|
||||
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;
|
||||
|
||||
}
|
||||
@ -748,8 +747,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
|
||||
const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
|
||||
const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;
|
||||
|
||||
const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
|
||||
const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
|
||||
sumzu += static_cast<real_type>(0.5) * (utot.re * zloc.re + utot.im * zloc.im);
|
||||
} // end if jeven
|
||||
|
||||
@ -785,7 +784,6 @@ KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod, int jjz, int iatom_div,
|
||||
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
|
||||
{
|
||||
real_type betaj;
|
||||
|
||||
const int j1 = idxz(jjz, 0);
|
||||
const int j2 = idxz(jjz, 1);
|
||||
@ -805,46 +803,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,
|
||||
for (int elem1 = 0; elem1 < nelements; elem1++) {
|
||||
for (int elem2 = 0; elem2 < nelements; elem2++) {
|
||||
|
||||
real_type ztmp_r = 0.0;
|
||||
real_type ztmp_i = 0.0;
|
||||
|
||||
int jju1 = idxu_block[j1] + (j1 + 1) * mb1min;
|
||||
int jju2 = idxu_block[j2] + (j2 + 1) * mb2max;
|
||||
int icgb = mb1min * (j2 + 1) + mb2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ib = 0; ib < nb; ib++) {
|
||||
|
||||
int ma1 = ma1min;
|
||||
int ma2 = ma2max;
|
||||
int icga = ma1min*(j2+1) + ma2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ia = 0; ia < na; ia++) {
|
||||
const auto utot1 = ulisttot_pack(iatom_mod,jju1+ma1,elem1,iatom_div);
|
||||
const auto utot2 = ulisttot_pack(iatom_mod,jju2+ma2,elem2,iatom_div);
|
||||
const auto cgcoeff_a = cgblock[icga];
|
||||
const auto cgcoeff_b = cgblock[icgb];
|
||||
ztmp_r += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
|
||||
ztmp_i += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
|
||||
ma1++;
|
||||
ma2--;
|
||||
icga += j2;
|
||||
} // end loop over ia
|
||||
|
||||
jju1 += j1 + 1;
|
||||
jju2 -= j2 + 1;
|
||||
icgb += j2;
|
||||
} // end loop over ib
|
||||
|
||||
if (bnorm_flag) {
|
||||
ztmp_r /= (j + 1);
|
||||
ztmp_i /= (j + 1);
|
||||
}
|
||||
const complex ztmp = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);
|
||||
|
||||
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
|
||||
// find right y_list[jju] and beta(iatom,jjb) entries
|
||||
@ -853,30 +812,11 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,
|
||||
|
||||
// pick out right beta value
|
||||
for (int elem3 = 0; elem3 < nelements; elem3++) {
|
||||
if (j >= j1) {
|
||||
const int jjb = idxb_block(j1, j2, j);
|
||||
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
|
||||
if (j1 == j) {
|
||||
if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else if (j >= j2) {
|
||||
const int jjb = idxb_block(j, j2, j1);
|
||||
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
|
||||
if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else {
|
||||
const int jjb = idxb_block(j2, j, j1);
|
||||
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
|
||||
betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
}
|
||||
|
||||
if (!bnorm_flag && j1 > j)
|
||||
betaj *= (j1 + 1) / (j + 1.0);
|
||||
const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);
|
||||
|
||||
|
||||
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_r);
|
||||
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_i);
|
||||
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
|
||||
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
|
||||
} // end loop over elem3
|
||||
} // end loop over elem2
|
||||
} // end loop over elem1
|
||||
@ -893,7 +833,6 @@ KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int iatom_mod, int jjz, int iatom_div,
|
||||
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
|
||||
{
|
||||
real_type betaj;
|
||||
const int j1 = idxz(jjz, 0);
|
||||
const int j2 = idxz(jjz, 1);
|
||||
const int j = idxz(jjz, 2);
|
||||
@ -901,49 +840,123 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int
|
||||
int idouble = 0;
|
||||
for (int elem1 = 0; elem1 < nelements; elem1++) {
|
||||
for (int elem2 = 0; elem2 < nelements; elem2++) {
|
||||
auto ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
|
||||
const complex ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
|
||||
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
|
||||
// find right y_list[jju] and beta(iatom,jjb) entries
|
||||
// multiply and divide by j+1 factors
|
||||
// account for multiplicity of 1, 2, or 3
|
||||
// pick out right beta value
|
||||
for (int elem3 = 0; elem3 < nelements; elem3++) {
|
||||
if (j >= j1) {
|
||||
const int jjb = idxb_block(j1, j2, j);
|
||||
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
|
||||
if (j1 == j) {
|
||||
if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else if (j >= j2) {
|
||||
const int jjb = idxb_block(j, j2, j1);
|
||||
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
|
||||
if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else {
|
||||
const int jjb = idxb_block(j2, j, j1);
|
||||
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
|
||||
betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
}
|
||||
if (!bnorm_flag && j1 > j)
|
||||
betaj *= (j1 + 1) / (j + 1.0);
|
||||
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.re);
|
||||
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.im);
|
||||
|
||||
const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);
|
||||
|
||||
Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
|
||||
Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
|
||||
} // end loop over elem3
|
||||
idouble++;
|
||||
} // end loop over elem2
|
||||
} // end loop over elem1
|
||||
}
|
||||
|
||||
// Core "evaluation" kernel that computes a single zlist value
|
||||
// which gets used in both `compute_zi` and `compute_yi`
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
typename SNAKokkos<DeviceType, real_type, vector_length>::complex SNAKokkos<DeviceType, real_type, vector_length>::evaluate_zi(const int& j1, const int& j2, const int& j,
|
||||
const int& ma1min, const int& ma2max, const int& mb1min, const int& mb2max, const int& na, const int& nb,
|
||||
const int& iatom_mod, const int& elem1, const int& elem2, const int& iatom_div, const real_type* cgblock) {
|
||||
|
||||
complex ztmp = complex::zero();
|
||||
|
||||
int jju1 = idxu_block[j1] + (j1+1)*mb1min;
|
||||
int jju2 = idxu_block[j2] + (j2+1)*mb2max;
|
||||
int icgb = mb1min*(j2+1) + mb2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ib = 0; ib < nb; ib++) {
|
||||
|
||||
int ma1 = ma1min;
|
||||
int ma2 = ma2max;
|
||||
int icga = ma1min*(j2+1) + ma2max;
|
||||
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int ia = 0; ia < na; ia++) {
|
||||
const complex utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
|
||||
const complex utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
|
||||
const real_type cgcoeff_a = cgblock[icga];
|
||||
const real_type cgcoeff_b = cgblock[icgb];
|
||||
ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
|
||||
ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
|
||||
ma1++;
|
||||
ma2--;
|
||||
icga += j2;
|
||||
} // end loop over ia
|
||||
|
||||
jju1 += j1 + 1;
|
||||
jju2 -= j2 + 1;
|
||||
icgb += j2;
|
||||
} // end loop over ib
|
||||
|
||||
if (bnorm_flag) {
|
||||
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
|
||||
ztmp.re *= scale;
|
||||
ztmp.im *= scale;
|
||||
}
|
||||
|
||||
return ztmp;
|
||||
}
|
||||
|
||||
// Core "evaluation" kernel that extracts and rescales the appropriate `beta` value,
|
||||
// which gets used in both `compute_yi` and `compute_yi_from_zlist
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_beta_scaled(const int& j1, const int& j2, const int& j,
|
||||
const int& iatom_mod, const int& elem1, const int& elem2, const int& elem3, const int& iatom_div,
|
||||
const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack) {
|
||||
|
||||
real_type betaj = 0;
|
||||
|
||||
if (j >= j1) {
|
||||
const int jjb = idxb_block(j1, j2, j);
|
||||
const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
|
||||
if (j1 == j) {
|
||||
if (j2 == j) betaj = static_cast<real_type>(3) * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else if (j >= j2) {
|
||||
const int jjb = idxb_block(j, j2, j1);
|
||||
const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
|
||||
if (j2 == j) betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
|
||||
else betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
} else {
|
||||
const int jjb = idxb_block(j2, j, j1);
|
||||
const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
|
||||
betaj = beta_pack(iatom_mod, itriple, iatom_div);
|
||||
}
|
||||
|
||||
if (!bnorm_flag && j1 > j) {
|
||||
const real_type scale = static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);
|
||||
betaj *= scale;
|
||||
}
|
||||
|
||||
return betaj;
|
||||
|
||||
}
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
Fused calculation of the derivative of Ui w.r.t. atom j
|
||||
and accumulation into dEidRj. GPU only.
|
||||
------------------------------------------------------------------------- */
|
||||
|
||||
// Version of the code that exposes additional parallelism by threading over `j_bend` values
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
|
||||
{
|
||||
// get shared memory offset
|
||||
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
|
||||
@ -957,21 +970,76 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
|
||||
WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
|
||||
// load parameters
|
||||
const auto a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const auto b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const auto da = da_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const auto db = db_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
const auto dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
|
||||
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
|
||||
|
||||
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
|
||||
|
||||
auto dedr_full_sum = static_cast<real_type>(0.);
|
||||
// compute the contribution to dedr_full_sum for one "bend" location
|
||||
const real_type dedr_full_sum = evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
|
||||
jelem, iatom_mod, j_bend, iatom_div);
|
||||
|
||||
// we need to "choose" when to bend
|
||||
// this for loop is here for context --- we expose additional
|
||||
// parallelism over this loop instead
|
||||
//for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
|
||||
// dedr gets zeroed out at the start of each iteration in compute_cayley_klein
|
||||
Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);
|
||||
|
||||
}
|
||||
|
||||
// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
|
||||
// and some amount of load imbalance, at the expense of reducing parallelism
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
template<int dir>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
|
||||
{
|
||||
// get shared memory offset
|
||||
// scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
|
||||
const int tile_size = vector_length * (twojmax + 1);
|
||||
|
||||
const int team_rank = team.team_rank();
|
||||
const int scratch_shift = team_rank * tile_size;
|
||||
|
||||
// extract, wrap shared memory buffer
|
||||
WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
|
||||
|
||||
// load parameters
|
||||
const complex a = a_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex b = b_pack(iatom_mod, jnbor, iatom_div);
|
||||
const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
|
||||
const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
|
||||
const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
|
||||
|
||||
const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
|
||||
|
||||
// compute the contributions to dedr_full_sum for all "bend" locations
|
||||
real_type dedr_full_sum = static_cast<real_type>(0);
|
||||
#ifdef LMP_KK_DEVICE_COMPILE
|
||||
#pragma unroll
|
||||
#endif
|
||||
for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
|
||||
dedr_full_sum += evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
|
||||
jelem, iatom_mod, j_bend, iatom_div);
|
||||
}
|
||||
|
||||
// there's one thread per atom, neighbor pair, so no need to make this atomic
|
||||
dedr(iatom_mod + vector_length * iatom_div, jnbor, dir) = static_cast<real_type>(2.0) * dedr_full_sum;
|
||||
|
||||
}
|
||||
|
||||
// Core "evaluation" kernel that gets reused in `compute_fused_deidrj_small` and
|
||||
// `compute_fused_deidrj_large`
|
||||
template<class DeviceType, typename real_type, int vector_length>
|
||||
KOKKOS_FORCEINLINE_FUNCTION
|
||||
typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper, const complex& a, const complex& b, const real_type& sfac,
|
||||
const WignerWrapper<real_type, vector_length>& dulist_wrapper, const complex& da, const complex& db, const real_type& dsfacu,
|
||||
const int& jelem, const int& iatom_mod, const int& j_bend, const int& iatom_div) {
|
||||
|
||||
real_type dedr_full_sum = static_cast<real_type>(0);
|
||||
|
||||
// level 0 is just 1, 0
|
||||
ulist_wrapper.set(0, complex::one());
|
||||
@ -1039,7 +1107,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
|
||||
|
||||
// grab y_local early
|
||||
// this will never be the last element of a row, no need to rescale.
|
||||
auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
|
||||
complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
|
||||
|
||||
// grab the cached value
|
||||
const complex ulist_prev = ulist_wrapper.get(ma);
|
||||
@ -1085,7 +1153,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
|
||||
|
||||
for (int ma = 0; ma < j; ma++) {
|
||||
// grab y_local early
|
||||
auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
|
||||
complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
|
||||
if (j % 2 == 1 && 2*(mb-1) == j-1) { // double check me...
|
||||
if (ma == (mb-1)) { y_local = static_cast<real_type>(0.5)*y_local; }
|
||||
else if (ma > (mb-1)) { y_local.re = static_cast<real_type>(0.); y_local.im = static_cast<real_type>(0.); } // can probably avoid this outright
|
||||
@ -1100,15 +1168,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
|
||||
dedr_full_sum += du_prod.re * y_local.re + du_prod.im * y_local.im;
|
||||
|
||||
}
|
||||
//} // end reference loop over j_bend
|
||||
|
||||
// dedr gets zeroed out at the start of each iteration in compute_cayley_klein
|
||||
Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);
|
||||
|
||||
return dedr_full_sum;
|
||||
}
|
||||
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------------
|
||||
* CPU routines
|
||||
* ----------------------------------------------------------------------*/
|
||||
@ -1238,8 +1301,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi_cpu(const int&
|
||||
} // end loop over ib
|
||||
|
||||
if (bnorm_flag) {
|
||||
zlist(jjz, idouble, iatom).re /= (j+1);
|
||||
zlist(jjz, idouble, iatom).im /= (j+1);
|
||||
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
|
||||
zlist(jjz, idouble, iatom).re *= scale;
|
||||
zlist(jjz, idouble, iatom).im *= scale;
|
||||
}
|
||||
idouble++;
|
||||
} // end loop over elem2
|
||||
@ -1268,7 +1332,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
|
||||
int idouble = 0;
|
||||
for (int elem1 = 0; elem1 < nelements; elem1++) {
|
||||
for (int elem2 = 0; elem2 < nelements; elem2++) {
|
||||
auto jalloy = idouble; // must be non-const to work around gcc compiler bug
|
||||
int jalloy = idouble; // must be non-const to work around gcc compiler bug
|
||||
for (int elem3 = 0; elem3 < nelements; elem3++) {
|
||||
Kokkos::parallel_for(Kokkos::TeamThreadRange(team,idxb_max),
|
||||
[&] (const int& jjb) {
|
||||
@ -1331,7 +1395,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
|
||||
}
|
||||
}
|
||||
|
||||
blist(jjb, itriple, iatom) = sumzu;
|
||||
blist(iatom, itriple, jjb) = sumzu;
|
||||
});
|
||||
});
|
||||
//} // end loop over j
|
||||
@ -1410,8 +1474,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,
|
||||
} // end loop over ib
|
||||
|
||||
if (bnorm_flag) {
|
||||
ztmp_i /= j + 1;
|
||||
ztmp_r /= j + 1;
|
||||
const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
|
||||
ztmp_i *= scale;
|
||||
ztmp_r *= scale;
|
||||
}
|
||||
|
||||
// apply to z(j1,j2,j,ma,mb) to unique element of y(j)
|
||||
@ -1424,24 +1489,24 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,
|
||||
|
||||
if (j >= j1) {
|
||||
const int jjb = idxb_block(j1, j2, j);
|
||||
const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
|
||||
const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
|
||||
if (j1 == j) {
|
||||
if (j2 == j) betaj = 3 * beta(itriple, iatom);
|
||||
else betaj = 2 * beta(itriple, iatom);
|
||||
} else betaj = beta(itriple, iatom);
|
||||
} else if (j >= j2) {
|
||||
const int jjb = idxb_block(j, j2, j1);
|
||||
const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
|
||||
const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
|
||||
if (j2 == j) betaj = 2 * beta(itriple, iatom);
|
||||
else betaj = beta(itriple, iatom);
|
||||
} else {
|
||||
const int jjb = idxb_block(j2, j, j1);
|
||||
const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
|
||||
const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
|
||||
betaj = beta(itriple, iatom);
|
||||
}
|
||||
|
||||
if (!bnorm_flag && j1 > j)
|
||||
betaj *= (j1 + 1) / (j + 1.0);
|
||||
betaj *= static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);
|
||||
|
||||
Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).re), betaj*ztmp_r);
|
||||
Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).im), betaj*ztmp_i);
|
||||
@ -1469,9 +1534,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_duidrj_cpu(const t
|
||||
z = rij(iatom,jnbor,2);
|
||||
rsq = x * x + y * y + z * z;
|
||||
r = sqrt(rsq);
|
||||
auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
|
||||
real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
|
||||
theta0 = (r - rmin0) * rscale0;
|
||||
sincos_wrapper(theta0, &sn, &cs);
|
||||
sn = sin(theta0);
|
||||
cs = cos(theta0);
|
||||
z0 = r * cs / sn;
|
||||
dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;
|
||||
|
||||
@ -1559,7 +1625,7 @@ KOKKOS_INLINE_FUNCTION
|
||||
void SNAKokkos<DeviceType, real_type, vector_length>::add_uarraytot(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int iatom, int jnbor,
|
||||
const real_type& r, const real_type& wj, const real_type& rcut, int jelem)
|
||||
{
|
||||
const auto sfac = compute_sfac(r, rcut) * wj;
|
||||
const real_type sfac = compute_sfac(r, rcut) * wj;
|
||||
|
||||
Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,twojmax+1),
|
||||
[&] (const int& j) {
|
||||
@ -2168,7 +2234,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_sfac(real_typ
|
||||
if (r <= rmin0) return one;
|
||||
else if (r > rcut) return zero;
|
||||
else {
|
||||
auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
return onehalf * (cos((r - rmin0) * rcutfac) + one);
|
||||
}
|
||||
}
|
||||
@ -2188,7 +2254,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_dsfac(real_ty
|
||||
if (r <= rmin0) return zero;
|
||||
else if (r > rcut) return zero;
|
||||
else {
|
||||
auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
return -onehalf * sin((r - rmin0) * rcutfac) * rcutfac;
|
||||
}
|
||||
}
|
||||
@ -2206,9 +2272,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_s_dsfac(const real
|
||||
if (r <= rmin0) { sfac = one; dsfac = zero; }
|
||||
else if (r > rcut) { sfac = zero; dsfac = zero; }
|
||||
else {
|
||||
const auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
real_type sn, cs;
|
||||
sincos_wrapper((r - rmin0) * rcutfac, &sn, &cs); // need to create a wrapper
|
||||
const real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
|
||||
const real_type theta0 = (r - rmin0) * rcutfac;
|
||||
const real_type sn = sin(theta0);
|
||||
const real_type cs = cos(theta0);
|
||||
sfac = onehalf * (cs + one);
|
||||
dsfac = -onehalf * sn * rcutfac;
|
||||
|
||||
|
||||
@ -628,7 +628,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
|
||||
chemflag = 0;
|
||||
bnormflag = 0;
|
||||
wselfallflag = 0;
|
||||
chunksize = 4096;
|
||||
chunksize = 32768;
|
||||
parallel_thresh = 8192;
|
||||
|
||||
// open SNAP parameter file on proc 0
|
||||
|
||||
@ -696,6 +697,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
|
||||
wselfallflag = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
|
||||
else if (keywd == "chunksize")
|
||||
chunksize = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
|
||||
else if (keywd == "parallelthresh")
|
||||
parallel_thresh = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
|
||||
else
|
||||
error->all(FLERR,"Unknown parameter '{}' in SNAP "
|
||||
"parameter file", keywd);
|
||||
|
||||
@ -59,7 +59,7 @@ class PairSNAP : public Pair {
|
||||
double **scale; // for thermodynamic integration
|
||||
int twojmax, switchflag, bzeroflag, bnormflag;
|
||||
int chemflag, wselfallflag;
|
||||
int chunksize;
|
||||
int chunksize,parallel_thresh;
|
||||
double rfac0, rmin0, wj1, wj2;
|
||||
int rcutfacflag, twojmaxflag; // flags for required parameters
|
||||
int beta_max; // length of beta
|
||||
|
||||
@ -20,7 +20,7 @@ charges (dsf and long-range treatment of charges)
|
||||
out-of-plane angle
|
||||
|
||||
See the file doc/drude_tutorial.html for getting started.
|
||||
See the doc pages for "pair_style buck6d/coul/gauss", "anlge_style class2",
|
||||
See the doc pages for "pair_style buck6d/coul/gauss", "angle_style class2",
|
||||
"angle_style cosine/buck6d", and "improper_style inversion/harmonic"
|
||||
commands to get started. Also see the above mentioned website and
|
||||
literature for further documentation about the force field.
|
||||
|
||||
@ -34,6 +34,7 @@ exclude:
|
||||
- lib/hdnnp
|
||||
- lib/kim
|
||||
- lib/kokkos
|
||||
- lib/latte
|
||||
- lib/machdyn
|
||||
- lib/mdi
|
||||
- lib/mscg
|
||||
@ -41,6 +42,7 @@ exclude:
|
||||
- lib/plumed
|
||||
- lib/quip
|
||||
- lib/scafacos
|
||||
- lib/voronoi
|
||||
- src/Make.sh
|
||||
patterns:
|
||||
- "*.c"
|
||||
|
||||
Reference in New Issue
Block a user