Merge branch 'master' into collected-small-changes

2021-08-27 14:58:42 -04:00
parent 4be33df8fb e363b4aeff
commit 4eef3eaac6
19 changed files with 822 additions and 533 deletions
--- a/doc/src/Build.rst
+++ b/doc/src/Build.rst
@ -22,4 +22,5 @@ page.
   Build_extras
   Build_manual
   Build_windows
+   Build_diskspace
   Build_development
--- a/doc/src/Build_diskspace.rst
+++ b/doc/src/Build_diskspace.rst
@ -0,0 +1,45 @@
+Notes for saving disk space when building LAMMPS from source
+------------------------------------------------------------
+
+LAMMPS is a large software project with a large number of source files,
+extensive documentation, and a large collection of example files.
+When downloading LAMMPS by cloning the
+`git repository from GitHub <https://github.com/lammps/lammps>`_ this
+will by default also download the entire commit history since September 2006.
+Compiling LAMMPS will add the storage requirements of the compiled object
+files and libraries to the tally.
+
+In a user account on an HPC cluster with filesystem quotas or in other
+environments with restricted disk space capacity it may be needed to
+reduce the storage requirements. Here are some suggestions:
+
+- Create a so-called shallow repository by cloning only the last commit
+  instead of the full project history by using ``git clone git@github.com:lammps/lammps --depth=1 --branch=master``.
+  This reduces the downloaded size to about half.  With ``--depth=1`` it is not possible to check out different
+  versions/branches of LAMMPS, using ``--depth=1000`` will make multiple recent versions available at little
+  extra storage needs (the entire git history had nearly 30,000 commits in fall 2021).
+
+- Download a tar archive from either the `download section on the LAMMPS homepage <https://www.lammps.org/download.html>`_
+  or from the `LAMMPS releases page on GitHub <https://github.com/lammps/lammps/releases>`_ these will not
+  contain the git history at all.
+
+- Build LAMMPS without the debug flag (remove ``-g`` from the machine makefile or use ``-DCMAKE_BUILD_TYPE=Release``)
+  or use the ``strip`` command on the LAMMPS executable when no more debugging would be needed.  The strip command
+  may also be applied to the LAMMPS shared library. The static library may be deleted entirely.
+
+- Delete compiled object files and libraries after copying the LAMMPS executable to a permanent location.
+  When using the traditional build process, one may use ``make clean-<machine>`` or ``make clean-all``
+  to delete object files in the src folder.  For CMake based builds, one may use ``make clean`` or just
+  delete the entire build folder.
+
+- The folders containing the documentation tree (doc), the examples (examples) are not needed to build and
+  run LAMMPS and can be safely deleted.  Some files in the potentials folder are large and may be deleted,
+  if not needed.  The largest of those files (occupying about 120 MBytes combined) will only be downloaded on
+  demand, when the corresponding package is installed.
+
+- When using the CMake build procedure, the compilation can be done on a (local) scratch storage that will not
+  count toward the quota.  A local scratch file system may offer the additional benefit of speeding up creating
+  object files and linking with libraries compared to a networked file system.  Also with CMake (and unlike with
+  the traditional make) it is possible to compile LAMMPS executables with different settings and packages included
+  from the same source tree since all the configuration information is stored in the build folder.  So it is
+  not necessary to have multiple copies of LAMMPS.
--- a/doc/src/Intro_authors.rst
+++ b/doc/src/Intro_authors.rst
@ -29,7 +29,7 @@ The following folks deserve special recognition.  Many of the packages
 they have written are unique for an MD code and LAMMPS would not be as
 general-purpose as it is without their expertise and efforts.

-* Metin Aktulga (MSU), REAXFF package for C version of ReaxFF
+* Metin Aktulga (MSU), REAXFF package for C/C++ version of ReaxFF
 * Mike Brown (Intel), GPU and INTEL packages
 * Colin Denniston (U Western Ontario), LATBOLTZ package
 * Georg Ganzenmuller (EMI), MACHDYN and SPH packages
@ -37,9 +37,10 @@ general-purpose as it is without their expertise and efforts.
 * Reese Jones (Sandia) and colleagues, ATC package for atom/continuum coupling
 * Christoph Kloss (DCS Computing), LIGGGHTS code for granular materials, built on top of LAMMPS
 * Rudra Mukherjee (JPL), POEMS package for articulated rigid body motion
-* Trung Ngyuen (Northwestern U), GPU and RIGID and BODY packages
+* Trung Ngyuen (Northwestern U), GPU, RIGID, BODY, and DIELECTRIC packages
 * Mike Parks (Sandia), PERI package for Peridynamics
 * Roy Pollock (LLNL), Ewald and PPPM solvers
+* Julien Tranchida (Sandia), SPIN package
 * Christian Trott (Sandia), CUDA and KOKKOS packages
 * Ilya Valuev (JIHT), AWPMD package for wave packet MD
 * Greg Wagner (Northwestern U), MEAM package for MEAM potential
--- a/doc/src/Intro_features.rst
+++ b/doc/src/Intro_features.rst
@ -27,19 +27,19 @@ General features
 * distributed memory message-passing parallelism (MPI)
 * shared memory multi-threading parallelism (OpenMP)
 * spatial decomposition of simulation domain for MPI parallelism
-* particle decomposition inside of spatial decomposition for OpenMP parallelism
+* particle decomposition inside of spatial decomposition for OpenMP and GPU parallelism
 * GPLv2 licensed open-source distribution
 * highly portable C++-11
 * modular code with most functionality in optional packages
-* only depends on MPI library for basic parallel functionality
+* only depends on MPI library for basic parallel functionality, MPI stub for serial compilation
 * other libraries are optional and only required for specific packages
-* GPU (CUDA and OpenCL), Intel Xeon Phi, and OpenMP support for many code features
+* GPU (CUDA, OpenCL, HIP, SYCL), Intel Xeon Phi, and OpenMP support for many code features
 * easy to extend with new features and functionality
 * runs from an input script
 * syntax for defining and using variables and formulas
 * syntax for looping over runs and breaking out of loops
 * run one or multiple simulations simultaneously (in parallel) from one script
-* build as library, invoke LAMMPS through library interface or provided Python wrapper
+* build as library, invoke LAMMPS through library interface or provided Python wrapper or SWIG based wrappers
 * couple with other codes: LAMMPS calls other code, other code calls LAMMPS, umbrella code calls both

 .. _particle:
@ -58,8 +58,10 @@ Particle and model types
 * coarse-grained mesoscale models
 * finite-size spherical and ellipsoidal particles
 * finite-size line segment (2d) and triangle (3d) particles
+* finite-size rounded polygons (2d) and polyhedra (3d) particles
 * point dipole particles
-* rigid collections of particles
+* particles with magnetic spin
+* rigid collections of n particles
 * hybrid combinations of these

 .. _ff:
@ -74,24 +76,28 @@ commands)

 * pairwise potentials: Lennard-Jones, Buckingham, Morse, Born-Mayer-Huggins, Yukawa, soft, class 2 (COMPASS), hydrogen bond, tabulated
 * charged pairwise potentials: Coulombic, point-dipole
-* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, SNAP, Streitz-Mintmire, 3-body polymorphic
-* long-range interactions for charge, point-dipoles, and LJ dispersion:     Ewald, Wolf, PPPM (similar to particle-mesh Ewald)
+* many-body potentials: EAM, Finnis/Sinclair EAM, modified EAM (MEAM), embedded ion method (EIM), EDIP, ADP, Stillinger-Weber, Tersoff, REBO, AIREBO, ReaxFF, COMB, Streitz-Mintmire, 3-body polymorphic, BOP, Vashishta
+* machine learning potentials: SNAP, GAP, ACE, N2P2, RANN, AGNI
+* long-range interactions for charge, point-dipoles, and LJ dispersion:  Ewald, Wolf, PPPM (similar to particle-mesh Ewald), MSM
 * polarization models: :doc:`QEq <fix_qeq>`,     :doc:`core/shell model <Howto_coreshell>`,     :doc:`Drude dipole model <Howto_drude>`
 * charge equilibration (QEq via dynamic, point, shielded, Slater methods)
 * coarse-grained potentials: DPD, GayBerne, REsquared, colloidal, DLVO
-* mesoscopic potentials: granular, Peridynamics, SPH
+* mesoscopic potentials: granular, Peridynamics, SPH, mesoscopic tubular potential (MESONT)
+* semi-empirical potentials: multi-ion generalized pseudopotential theory (MGPT), second moment tight binding + QEq (SMTB-Q), density functional tight-binding (LATTE)
 * electron force field (eFF, AWPMD)
-* bond potentials: harmonic, FENE, Morse, nonlinear, class 2,     quartic (breakable)
-* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic,     class 2 (COMPASS)
-* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix,     class 2 (COMPASS), OPLS
-* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS)
+* bond potentials: harmonic, FENE, Morse, nonlinear, class 2, quartic (breakable), tabulated
+* angle potentials: harmonic, CHARMM, cosine, cosine/squared, cosine/periodic, class 2 (COMPASS), tabulated
+* dihedral potentials: harmonic, CHARMM, multi-harmonic, helix, class 2 (COMPASS), OPLS, tabulated
+* improper potentials: harmonic, cvff, umbrella, class 2 (COMPASS), tabulated
 * polymer potentials: all-atom, united-atom, bead-spring, breakable
-* water potentials: TIP3P, TIP4P, SPC
+* water potentials: TIP3P, TIP4P, SPC, SPC/E and variants
+* interlayer potentials for graphene and analogues
+* metal-organic framework potentials (QuickFF, MO-FF)
 * implicit solvent potentials: hydrodynamic lubrication, Debye
 * force-field compatibility with common CHARMM, AMBER, DREIDING, OPLS, GROMACS, COMPASS options
 * access to the `OpenKIM Repository <http://openkim.org>`_ of potentials via     :doc:`kim command <kim_commands>`
 * hybrid potentials: multiple pair, bond, angle, dihedral, improper potentials can be used in one simulation
-* overlaid potentials: superposition of multiple pair potentials
+* overlaid potentials: superposition of multiple pair potentials (including many-body) with optional scale factor

 .. _create:

@ -124,9 +130,10 @@ Ensembles, constraints, and boundary conditions
 * harmonic (umbrella) constraint forces
 * rigid body constraints
 * SHAKE bond and angle constraints
-* Monte Carlo bond breaking, formation, swapping
+* motion constraints to manifold surfaces
+* Monte Carlo bond breaking, formation, swapping, template based reaction modeling
 * atom/molecule insertion and deletion
-* walls of various kinds
+* walls of various kinds, static and moving
 * non-equilibrium molecular dynamics (NEMD)
 * variety of additional boundary conditions and constraints

@ -150,6 +157,7 @@ Diagnostics
 ^^^^^^^^^^^

 * see various flavors of the :doc:`fix <fix>` and :doc:`compute <compute>` commands
+* introspection command for system, simulation, and compile time settings and configurations

 .. _output:

@ -164,8 +172,9 @@ Output
 * parallel I/O of dump and restart files
 * per-atom quantities (energy, stress, centro-symmetry parameter, CNA, etc)
 * user-defined system-wide (log file) or per-atom (dump file) calculations
-* spatial and time averaging of per-atom quantities
-* time averaging of system-wide quantities
+* custom partitioning (chunks) for binning, and static or dynamic grouping of atoms for analysis
+* spatial, time, and per-chunk averaging of per-atom quantities
+* time averaging and histogramming of system-wide quantities
 * atom snapshots in native, XYZ, XTC, DCD, CFG formats

 .. _replica1:
@ -178,7 +187,7 @@ Multi-replica models
 * :doc:`parallel replica dynamics <prd>`
 * :doc:`temperature accelerated dynamics <tad>`
 * :doc:`parallel tempering <temper>`
-* :doc:`path-integral MD <fix_pimd>`
+* path-integral MD: `first variant <fix_pimd>`, `second variant <fix_ipi>`
 * multi-walker collective variables with :doc:`Colvars <fix_colvars>` and :doc:`Plumed <fix_plumed>`

 .. _prepost:
@ -210,11 +219,12 @@ page for details.
 These are LAMMPS capabilities which you may not think of as typical
 classical MD options:

-* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`
+* :doc:`static <balance>` and :doc:`dynamic load-balancing <fix_balance>`, optional with recursive bisectioning decomposition
 * :doc:`generalized aspherical particles <Howto_body>`
 * :doc:`stochastic rotation dynamics (SRD) <fix_srd>`
-* :doc:`real-time visualization and interactive MD <fix_imd>`
+* :doc:`real-time visualization and interactive MD <fix_imd>`, :doc:`built-in renderer for images and movies <dump_image>`
 * calculate :doc:`virtual diffraction patterns <compute_xrd>`
+* calculate :doc:`finite temperature phonon dispersion <fix_phonon>` and the :doc:`dynamical matrix of minimized structures <dynamical_matrix>`
 * :doc:`atom-to-continuum coupling <fix_atc>` with finite elements
 * coupled rigid body integration via the :doc:`POEMS <fix_poems>` library
 * :doc:`QM/MM coupling <fix_qmmm>`
--- a/doc/src/Intro_opensource.rst
+++ b/doc/src/Intro_opensource.rst
@ -1,40 +1,61 @@
 LAMMPS open-source license
 --------------------------

-LAMMPS is a freely-available open-source code, distributed under the
-terms of the `GNU Public License Version 2 <gpl_>`_, which means you can
-use or modify the code however you wish for your own purposes, but have
-to adhere to certain rules when redistributing it or software derived
+GPL version of LAMMPS
+^^^^^^^^^^^^^^^^^^^^^
+
+LAMMPS is an open-source code, available free-of-charge, and distributed
+under the terms of the `GNU Public License Version 2 <gpl_>`_ (GPLv2),
+which means you can use or modify the code however you wish for your own
+purposes, but have to adhere to certain rules when redistributing it -
+specifically in binary form - or are distributing software derived
 from it or that includes parts of it.

-LAMMPS comes with no warranty of any kind.  As each source file states
-in its header, it is a copyrighted code that is distributed free-of-
-charge, under the terms of the `GNU Public License Version 2 <gpl_>`_
-(GPLv2).  This is often referred to as open-source distribution - see
-`www.gnu.org <gnuorg_>`_ or `www.opensource.org <opensource_>`_.  The
-legal text of the GPL is in the LICENSE file included in the LAMMPS
-distribution.
+LAMMPS comes with no warranty of any kind.
+
+As each source file states in its header, it is a copyrighted code, and
+thus not in the public domain. For more information about open-source
+software and open-source distribution, see `www.gnu.org <gnuorg_>`_
+or `www.opensource.org <opensource_>`_.  The legal text of the GPL as it
+applies to LAMMPS is in the LICENSE file included in the LAMMPS distribution.

 .. _gpl: https://github.com/lammps/lammps/blob/master/LICENSE

+.. _lgpl: https://www.gnu.org/licenses/old-licenses/lgpl-2.1.html
+
 .. _gnuorg: http://www.gnu.org

 .. _opensource: http://www.opensource.org

-Here is a summary of what the GPL means for LAMMPS users:
+Here is a more specific summary of what the GPL means for LAMMPS users:

-(1) Anyone is free to use, modify, or extend LAMMPS in any way they
+(1) Anyone is free to use, copy, modify, or extend LAMMPS in any way they
 choose, including for commercial purposes.

 (2) If you **distribute** a modified version of LAMMPS, it must remain
-open-source, meaning you distribute **all** of it under the terms of
-the GPL.  You should clearly annotate such a code as a derivative version
-of LAMMPS.
+open-source, meaning you are required to distribute **all** of it under
+the terms of the GPL.  You should clearly annotate such a modified code
+as a derivative version of LAMMPS.

 (3) If you release any code that includes or uses LAMMPS source code,
 then it must also be open-sourced, meaning you distribute it under
-the terms of the GPL.
+the terms of the GPL.  You may write code that interfaces LAMMPS to
+a differently licensed library.  In that case the code that provides
+the interface must be licensed GPL, but not necessarily that library
+unless you are distributing binaries that require the library to run.

 (4) If you give LAMMPS files to someone else, the GPL LICENSE file and
 source file headers (including the copyright and GPL notices) should
 remain part of the code.
+
+
+LGPL version of LAMMPS
+^^^^^^^^^^^^^^^^^^^^^^
+
+We occasionally make stable LAMMPS releases available under the `GNU
+Lesser Public License v2.1 <lgpl_>`_.  This is on request only and with
+non-LGPL compliant files removed.  This allows uses linking non-GPL
+compatible software with the (otherwise unmodified) LAMMPS library
+or loading it dynamically at runtime.  Any **modifications** to
+the LAMMPS code however, even with the LGPL licensed version, must still
+be made available under the same open source terms as LAMMPS itself.
--- a/doc/src/Intro_overview.rst
+++ b/doc/src/Intro_overview.rst
@ -10,24 +10,26 @@ conditions.  It can model 2d or 3d systems with only a few particles
 up to millions or billions.

 LAMMPS can be built and run on a laptop or desktop machine, but is
-designed for parallel computers.  It will run on any parallel machine
-that supports the `MPI <mpi_>`_ message-passing library.  This includes
-shared-memory boxes and distributed-memory clusters and
-supercomputers.
+designed for parallel computers.  It will run in serial and on any
+parallel machine that supports the `MPI <mpi_>`_ message-passing
+library.  This includes shared-memory boxes and distributed-memory
+clusters and supercomputers. Parts of LAMMPS also support
+`OpenMP multi-threading <omp_>`_, vectorization and GPU acceleration.

 .. _mpi: https://en.wikipedia.org/wiki/Message_Passing_Interface
 .. _lws: https://www.lammps.org
+.. _omp: https://www.openmp.org

 LAMMPS is written in C++ and requires a compiler that is at least
-compatible with the C++-11 standard.
-Earlier versions were written in F77 and F90.  See the `History page
+compatible with the C++-11 standard.  Earlier versions were written in
+F77, F90, and C++-98.  See the `History page
 <https://www.lammps.org/history.html>`_ of the website for details.  All
-versions can be downloaded from the `LAMMPS website <lws_>`_.
+versions can be downloaded as source code from the `LAMMPS website
+<lws_>`_.

-LAMMPS is designed to be easy to modify or extend with new
-capabilities, such as new force fields, atom types, boundary
-conditions, or diagnostics.  See the :doc:`Modify <Modify>` page for
-more details.
+LAMMPS is designed to be easy to modify or extend with new capabilities,
+such as new force fields, atom types, boundary conditions, or
+diagnostics.  See the :doc:`Modify <Modify>` page for more details.

 In the most general sense, LAMMPS integrates Newton's equations of
 motion for a collection of interacting particles.  A single particle
@ -47,4 +49,5 @@ MPI parallelization to partition the simulation domain into small
 sub-domains of equal computational cost, one of which is assigned to
 each processor.  Processors communicate and store "ghost" atom
 information for atoms that border their sub-domain.  Multi-threading
-parallelization with with particle-decomposition can be used in addition.
+parallelization and GPU acceleration with with particle-decomposition
+can be used in addition.
--- a/doc/src/Manual_version.rst
+++ b/doc/src/Manual_version.rst
@ -2,12 +2,21 @@ What does a LAMMPS version mean
 -------------------------------

 The LAMMPS "version" is the date when it was released, such as 1 May
-2014. LAMMPS is updated continuously.  Whenever we fix a bug or add a
-feature, we release it in the next *patch* release, which are
-typically made every couple of weeks.  Info on patch releases are on
-`this website page <https://www.lammps.org/bug.html>`_. Every few
-months, the latest patch release is subjected to more thorough testing
-and labeled as a *stable* version.
+2014.  LAMMPS is updated continuously and we aim to keep it working
+correctly and reliably at all times.  You can follow its development
+in a public `git repository on GitHub <https://github.com/lammps/lammps>`_.
+
+Whenever we fix a bug or update or add a feature, it will be merged into
+the `master` branch of the git repository.  When a sufficient number of
+changes have accumulated *and* the software passes a set of automated
+tests, we release it in the next *patch* release, which are made every
+few weeks.  Info on patch releases are on `this website page
+<https://www.lammps.org/bug.html>`_.
+
+Once or twice a year, only bug fixes and small, non-intrusive changes are
+included for a period of time, and the code is subjected to more detailed
+and thorough testing than the default automated testing.  The latest
+patch release after such a period is then labeled as a *stable* version.

 Each version of LAMMPS contains all the features and bug-fixes up to
 and including its version date.
--- a/doc/src/fix_halt.rst
+++ b/doc/src/fix_halt.rst
@ -19,7 +19,7 @@ Syntax

       bondmax = length of longest bond in the system (in length units)
       tlimit = elapsed CPU time (in seconds)
-       diskfree = free disk space (in megabytes)
+       diskfree = free disk space (in MBytes)
       v_name = name of :doc:`equal-style variable <variable>`

 * operator = "<" or "<=" or ">" or ">=" or "==" or "!=" or "\|\^"
@ -81,7 +81,7 @@ the timer frequently across a large number of processors may be
 non-negligible.

 The *diskfree* attribute will check for available disk space (in
-megabytes) on supported operating systems. By default it will
+MBytes) on supported operating systems. By default it will
 check the file system of the current working directory.  This
 can be changed with the optional *path* keyword, which will take
 the path to a file or folder on the file system to be checked
--- a/doc/src/fix_qtb.rst
+++ b/doc/src/fix_qtb.rst
@ -128,7 +128,7 @@ spectrum while consumes more memory.  With fixed *f_max* and
 :math:`\gamma`, *N_f* should be big enough to converge the classical
 temperature :math:`T^{cl}` as a function of target quantum bath
 temperature. Memory usage per processor could be from 10 to 100
-Mbytes.
+MBytes.

 .. note::

--- a/doc/src/pair_snap.rst
+++ b/doc/src/pair_snap.rst
@ -135,7 +135,7 @@ with #) anywhere. Each non-blank non-comment line must contain one
 keyword/value pair. The required keywords are *rcutfac* and
 *twojmax*\ . Optional keywords are *rfac0*, *rmin0*,
 *switchflag*, *bzeroflag*, *quadraticflag*, *chemflag*,
-*bnormflag*, *wselfallflag*, and *chunksize*\ .
+*bnormflag*, *wselfallflag*, *chunksize*, and *parallelthresh*\ .

 The default values for these keywords are

@ -147,7 +147,8 @@ The default values for these keywords are
 * *chemflag* = 0
 * *bnormflag* = 0
 * *wselfallflag* = 0
-* *chunksize* = 4096
+* *chunksize* = 32768
+* *parallelthresh* = 8192

 If *quadraticflag* is set to 1, then the SNAP energy expression includes
 additional quadratic terms that have been shown to increase the overall
@ -188,14 +189,24 @@ corresponding *K*-vector of linear coefficients for element
 which must equal the number of unique elements appearing in the LAMMPS
 pair_coeff command, to avoid ambiguity in the number of coefficients.

-The keyword *chunksize* is only applicable when using the
-pair style *snap* with the KOKKOS package and is ignored otherwise.
-This keyword controls
+The keywords *chunksize* and *parallelthresh* are only applicable when
+using the pair style *snap* with the KOKKOS package on GPUs and are
+ignored otherwise.
+The *chunksize* keyword controls
 the number of atoms in each pass used to compute the bispectrum
 components and is used to avoid running out of memory. For example
 if there are 8192 atoms in the simulation and the *chunksize*
 is set to 4096, the bispectrum calculation will be broken up
-into two passes.
+into two passes (running on a single GPU).
+The *parallelthresh* keyword controls
+a crossover threshold for performing extra parallelism. For
+small systems, exposing additional parallism can be beneficial when
+there is not enough work to fully saturate the GPU threads otherwise.
+However, the extra parallelism also leads to more divergence
+and can hurt performance when the system is already large enough to
+saturate the GPU threads. Extra parallelism will be performed if the
+*chunksize* (or total number of atoms per GPU) is smaller than
+*parallelthresh*.

 Detailed definitions for all the other keywords
 are given on the :doc:`compute sna/atom <compute_sna_atom>` doc page.
--- a/doc/utils/sphinx-config/false_positives.txt
+++ b/doc/utils/sphinx-config/false_positives.txt
@ -1174,6 +1174,7 @@ googletest
 Gordan
 Goudeau
 GPa
+GPL
 gpu
 gpuID
 gpus
@ -1689,6 +1690,7 @@ Lett
 Leuven
 Leven
 Lewy
+LGPL
 lgvdw
 Liang
 libatc
@ -1889,7 +1891,6 @@ maxX
 Mayergoyz
 Mayoral
 mbt
-Mbytes
 MBytes
 mc
 McLachlan
--- a/src/KOKKOS/pair_snap_kokkos.h
+++ b/src/KOKKOS/pair_snap_kokkos.h
@ -44,7 +44,8 @@ struct TagPairSNAPComputeForce{};
 struct TagPairSNAPComputeNeigh{};
 struct TagPairSNAPComputeCayleyKlein{};
 struct TagPairSNAPPreUi{};
-struct TagPairSNAPComputeUi{};
+struct TagPairSNAPComputeUiSmall{}; // more parallelism, more divergence
+struct TagPairSNAPComputeUiLarge{}; // less parallelism, no divergence
 struct TagPairSNAPTransformUi{}; // re-order ulisttot from SoA to AoSoA, zero ylist
 struct TagPairSNAPComputeZi{};
 struct TagPairSNAPBeta{};
@ -53,7 +54,9 @@ struct TagPairSNAPTransformBi{}; // re-order blist from AoSoA to AoS
 struct TagPairSNAPComputeYi{};
 struct TagPairSNAPComputeYiWithZlist{};
 template<int dir>
-struct TagPairSNAPComputeFusedDeidrj{};
+struct TagPairSNAPComputeFusedDeidrjSmall{}; // more parallelism, more divergence
+template<int dir>
+struct TagPairSNAPComputeFusedDeidrjLarge{}; // less parallelism, no divergence

 // CPU backend only
 struct TagPairSNAPComputeNeighCPU{};
@ -143,7 +146,10 @@ public:
  void operator() (TagPairSNAPPreUi,const int iatom_mod, const int j, const int iatom_div) const;

  KOKKOS_INLINE_FUNCTION
-  void operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUi>::member_type& team) const;
+  void operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiSmall>::member_type& team) const;
+
+  KOKKOS_INLINE_FUNCTION
+  void operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeUiLarge>::member_type& team) const;

  KOKKOS_INLINE_FUNCTION
  void operator() (TagPairSNAPTransformUi,const int iatom_mod, const int j, const int iatom_div) const;
@ -168,7 +174,11 @@ public:

  template<int dir>
  KOKKOS_INLINE_FUNCTION
-  void operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const;
+  void operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const;
+
+  template<int dir>
+  KOKKOS_INLINE_FUNCTION
+  void operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType, TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const;

  // CPU backend only
  KOKKOS_INLINE_FUNCTION
--- a/src/KOKKOS/pair_snap_kokkos_impl.h
+++ b/src/KOKKOS/pair_snap_kokkos_impl.h
@ -341,18 +341,32 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
      // ComputeUi w/vector parallelism, shared memory, direct atomicAdd into ulisttot
      {
        // team_size_compute_ui is defined in `pair_snap_kokkos.h`
-
        // scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
        const int tile_size = vector_length * (twojmax + 1);
        const int scratch_size = scratch_size_helper<complex>(team_size_compute_ui * tile_size);

+        if (chunk_size < parallel_thresh)
+        {
+          // Version with parallelism over j_bend
+
          // total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
          const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
          const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;

-        SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUi> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiSmall> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
          policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
-        Kokkos::parallel_for("ComputeUi",policy_ui,*this);
+          Kokkos::parallel_for("ComputeUiSmall",policy_ui,*this);
+        } else {
+          // Version w/out parallelism over j_bend
+
+          // total number of teams needed: (natoms / 32) * (max_neighs)
+          const int n_teams = chunk_size_div * max_neighs;
+          const int n_teams_div = (n_teams + team_size_compute_ui - 1) / team_size_compute_ui;
+
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_ui, TagPairSNAPComputeUiLarge> policy_ui(n_teams_div, team_size_compute_ui, vector_length);
+          policy_ui = policy_ui.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
+          Kokkos::parallel_for("ComputeUiLarge",policy_ui,*this);
+        }
      }

      //TransformUi: un-"fold" ulisttot, zero ylist
@ -412,25 +426,51 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::compute(int eflag_in,
        const int tile_size = vector_length * (twojmax + 1);
        const int scratch_size = scratch_size_helper<complex>(2 * team_size_compute_fused_deidrj * tile_size);

+        if (chunk_size < parallel_thresh)
+        {
+          // Version with parallelism over j_bend
+
          // total number of teams needed: (natoms / 32) * (max_neighs) * ("bend" locations)
          const int n_teams = chunk_size_div * max_neighs * (twojmax + 1);
          const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;

          // x direction
-        SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
          policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
-        Kokkos::parallel_for("ComputeFusedDeidrj<0>",policy_fused_deidrj_x,*this);
+          Kokkos::parallel_for("ComputeFusedDeidrjSmall<0>",policy_fused_deidrj_x,*this);

          // y direction
-        SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
          policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
-        Kokkos::parallel_for("ComputeFusedDeidrj<1>",policy_fused_deidrj_y,*this);
+          Kokkos::parallel_for("ComputeFusedDeidrjSmall<1>",policy_fused_deidrj_y,*this);

          // z direction
-        SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrj<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjSmall<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
          policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
-        Kokkos::parallel_for("ComputeFusedDeidrj<2>",policy_fused_deidrj_z,*this);
+          Kokkos::parallel_for("ComputeFusedDeidrjSmall<2>",policy_fused_deidrj_z,*this);
+        } else {
+          // Version w/out parallelism over j_bend

+          // total number of teams needed: (natoms / 32) * (max_neighs)
+          const int n_teams = chunk_size_div * max_neighs;
+          const int n_teams_div = (n_teams + team_size_compute_fused_deidrj - 1) / team_size_compute_fused_deidrj;
+
+          // x direction
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<0> > policy_fused_deidrj_x(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          policy_fused_deidrj_x = policy_fused_deidrj_x.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
+          Kokkos::parallel_for("ComputeFusedDeidrjLarge<0>",policy_fused_deidrj_x,*this);
+
+          // y direction
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<1> > policy_fused_deidrj_y(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          policy_fused_deidrj_y = policy_fused_deidrj_y.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
+          Kokkos::parallel_for("ComputeFusedDeidrjLarge<1>",policy_fused_deidrj_y,*this);
+
+          // z direction
+          SnapAoSoATeamPolicy<DeviceType, team_size_compute_fused_deidrj, TagPairSNAPComputeFusedDeidrjLarge<2> > policy_fused_deidrj_z(n_teams_div,team_size_compute_fused_deidrj,vector_length);
+          policy_fused_deidrj_z = policy_fused_deidrj_z.set_scratch_size(0, Kokkos::PerTeam(scratch_size));
+          Kokkos::parallel_for("ComputeFusedDeidrjLarge<2>",policy_fused_deidrj_z,*this);
+
+        }
      }

 #endif // LMP_KOKKOS_GPU
@ -603,13 +643,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
    for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
      const auto idxb = icoeff % idxb_max;
      const auto idx_chem = icoeff / idxb_max;
-      auto bveci = my_sna.blist(idxb, idx_chem, ii);
+      real_type bveci = my_sna.blist(ii, idx_chem, idxb);
      d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bveci;
      k++;
      for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
        const auto jdxb = jcoeff % idxb_max;
        const auto jdx_chem = jcoeff / idxb_max;
-        real_type bvecj = my_sna.blist(jdxb, jdx_chem, ii);
+        real_type bvecj = my_sna.blist(ii, jdx_chem, jdxb);
        d_beta_pack(iatom_mod,icoeff,iatom_div) += d_coeffi[k]*bvecj;
        d_beta_pack(iatom_mod,jcoeff,iatom_div) += d_coeffi[k]*bveci;
        k++;
@ -736,7 +776,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN

 template<class DeviceType, typename real_type, int vector_length>
 KOKKOS_INLINE_FUNCTION
-void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUi,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUi>::member_type& team) const {
+void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiSmall,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiSmall>::member_type& team) const {
  SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;

  // extract flattened atom_div / neighbor number / bend location
@ -756,11 +796,37 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
    const int ninside = d_ninside(ii);
    if (jj >= ninside) return;

-    my_sna.compute_ui(team,iatom_mod, jbend, jj, iatom_div);
+    my_sna.compute_ui_small(team, iatom_mod, jbend, jj, iatom_div);
  });

 }

+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_INLINE_FUNCTION
+void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeUiLarge,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeUiLarge>::member_type& team) const {
+  SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
+
+  // extract flattened atom_div / neighbor number / bend location
+  int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_ui;
+
+  // extract neighbor index, iatom_div
+  int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
+  int jj = flattened_idx - iatom_div * max_neighs;
+
+  Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
+    [&] (const int iatom_mod) {
+    const int ii = iatom_mod + vector_length * iatom_div;
+    if (ii >= chunk_size) return;
+
+    const int ninside = d_ninside(ii);
+    if (jj >= ninside) return;
+
+    my_sna.compute_ui_large(team,iatom_mod, jj, iatom_div);
+  });
+
+}
+
+
 template<class DeviceType, typename real_type, int vector_length>
 KOKKOS_INLINE_FUNCTION
 void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPTransformUi,const int iatom_mod, const int idxu, const int iatom_div) const {
@ -861,9 +927,9 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN

  for (int itriple = 0; itriple < ntriples; itriple++) {

-    const auto blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);
+    const real_type blocal = my_sna.blist_pack(iatom_mod, idxb, itriple, iatom_div);

-    my_sna.blist(idxb, itriple, iatom) = blocal;
+    my_sna.blist(iatom, itriple, idxb) = blocal;
  }

 }
@ -871,7 +937,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
 template<class DeviceType, typename real_type, int vector_length>
 template<int dir>
 KOKKOS_INLINE_FUNCTION
-void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrj<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrj<dir> >::member_type& team) const {
+void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjSmall<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjSmall<dir> >::member_type& team) const {
  SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;

  // extract flattened atom_div / neighbor number / bend location
@ -891,12 +957,38 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
    const int ninside = d_ninside(ii);
    if (jj >= ninside) return;

-    my_sna.template compute_fused_deidrj<dir>(team, iatom_mod, jbend, jj, iatom_div);
+    my_sna.template compute_fused_deidrj_small<dir>(team, iatom_mod, jbend, jj, iatom_div);

  });

 }

+template<class DeviceType, typename real_type, int vector_length>
+template<int dir>
+KOKKOS_INLINE_FUNCTION
+void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSNAPComputeFusedDeidrjLarge<dir>,const typename Kokkos::TeamPolicy<DeviceType,TagPairSNAPComputeFusedDeidrjLarge<dir> >::member_type& team) const {
+  SNAKokkos<DeviceType, real_type, vector_length> my_sna = snaKK;
+
+  // extract flattened atom_div / neighbor number / bend location
+  int flattened_idx = team.team_rank() + team.league_rank() * team_size_compute_fused_deidrj;
+
+  // extract neighbor index, iatom_div
+  int iatom_div = flattened_idx / max_neighs; // removed "const" to work around GCC 7 bug
+  int jj = flattened_idx - max_neighs * iatom_div;
+
+  Kokkos::parallel_for(Kokkos::ThreadVectorRange(team, vector_length),
+    [&] (const int iatom_mod) {
+    const int ii = iatom_mod + vector_length * iatom_div;
+    if (ii >= chunk_size) return;
+
+    const int ninside = d_ninside(ii);
+    if (jj >= ninside) return;
+
+    my_sna.template compute_fused_deidrj_large<dir>(team, iatom_mod, jj, iatom_div);
+
+  });
+}
+
 /* ----------------------------------------------------------------------
   Begin routines that are unique to the CPU codepath. These do not take
   advantage of AoSoA data layouts, but that could be a good point of
@ -925,13 +1017,13 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
    for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
      const auto idxb = icoeff % idxb_max;
      const auto idx_chem = icoeff / idxb_max;
-      auto bveci = my_sna.blist(idxb,idx_chem,ii);
+      real_type bveci = my_sna.blist(ii,idx_chem,idxb);
      d_beta(icoeff,ii) += d_coeffi[k]*bveci;
      k++;
      for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
        const auto jdxb = jcoeff % idxb_max;
        const auto jdx_chem = jcoeff / idxb_max;
-        auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
+        real_type bvecj = my_sna.blist(ii,jdx_chem,jdxb);
        d_beta(icoeff,ii) += d_coeffi[k]*bvecj;
        d_beta(jcoeff,ii) += d_coeffi[k]*bveci;
        k++;
@ -1221,7 +1313,7 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
      for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
        const auto idxb = icoeff % idxb_max;
        const auto idx_chem = icoeff / idxb_max;
-        evdwl += d_coeffi[icoeff+1]*my_sna.blist(idxb,idx_chem,ii);
+        evdwl += d_coeffi[icoeff+1]*my_sna.blist(ii,idx_chem,idxb);
      }

      // quadratic contributions
@ -1230,12 +1322,12 @@ void PairSNAPKokkos<DeviceType, real_type, vector_length>::operator() (TagPairSN
        for (int icoeff = 0; icoeff < ncoeff; icoeff++) {
          const auto idxb = icoeff % idxb_max;
          const auto idx_chem = icoeff / idxb_max;
-          auto bveci = my_sna.blist(idxb,idx_chem,ii);
+          real_type bveci = my_sna.blist(ii,idx_chem,idxb);
          evdwl += 0.5*d_coeffi[k++]*bveci*bveci;
          for (int jcoeff = icoeff+1; jcoeff < ncoeff; jcoeff++) {
            auto jdxb = jcoeff % idxb_max;
            auto jdx_chem = jcoeff / idxb_max;
-            auto bvecj = my_sna.blist(jdxb,jdx_chem,ii);
+            auto bvecj = my_sna.blist(ii,jdx_chem,jdxb);
            evdwl += d_coeffi[k++]*bveci*bvecj;
          }
        }
--- a/src/KOKKOS/sna_kokkos.h
+++ b/src/KOKKOS/sna_kokkos.h
@ -45,12 +45,12 @@ struct WignerWrapper {
  { ; }

  KOKKOS_INLINE_FUNCTION
-  complex get(const int& ma) {
+  complex get(const int& ma) const {
    return complex(buffer[offset + 2 * vector_length * ma], buffer[offset + vector_length + 2 * vector_length * ma]);
  }

  KOKKOS_INLINE_FUNCTION
-  void set(const int& ma, const complex& store) {
+  void set(const int& ma, const complex& store) const {
    buffer[offset + 2 * vector_length * ma] = store.re;
    buffer[offset + vector_length + 2 * vector_length * ma] = store.im;
  }
@ -122,8 +122,14 @@ inline
  void compute_cayley_klein(const int&, const int&, const int&);
  KOKKOS_INLINE_FUNCTION
  void pre_ui(const int&, const int&, const int&, const int&); // ForceSNAP
+
+  // version of the code with parallelism over j_bend
  KOKKOS_INLINE_FUNCTION
-  void compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
+  void compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); // ForceSNAP
+  // version of the code without parallelism over j_bend
+  KOKKOS_INLINE_FUNCTION
+  void compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); // ForceSNAP
+
  KOKKOS_INLINE_FUNCTION
  void compute_zi(const int&, const int&, const int&);    // ForceSNAP
  KOKKOS_INLINE_FUNCTION
@ -135,6 +141,35 @@ inline
  KOKKOS_INLINE_FUNCTION
  void compute_bi(const int&, const int&, const int&);    // ForceSNAP

+  // functions for derivatives, GPU only
+  // version of the code with parallelism over j_bend
+  template<int dir>
+  KOKKOS_INLINE_FUNCTION
+  void compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
+  // version of the code without parallelism over j_bend
+  template<int dir>
+  KOKKOS_INLINE_FUNCTION
+  void compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int); //ForceSNAP
+
+  // core "evaluation" functions that get plugged into "compute" functions
+  // plugged into compute_ui_small, compute_ui_large
+  KOKKOS_FORCEINLINE_FUNCTION
+  void evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&, const int&,
+                        const int&, const int&, const int&);
+  // plugged into compute_zi, compute_yi
+  KOKKOS_FORCEINLINE_FUNCTION
+  complex evaluate_zi(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
+                        const int&, const int&, const int&, const int&, const real_type*);
+  // plugged into compute_yi, compute_yi_with_zlist
+  KOKKOS_FORCEINLINE_FUNCTION
+  real_type evaluate_beta_scaled(const int&, const int&, const int&, const int&, const int&, const int&, const int&, const int&,
+                        const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &);
+  // plugged into compute_fused_deidrj_small, compute_fused_deidrj_large
+  KOKKOS_FORCEINLINE_FUNCTION
+  real_type evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&, 
+                        const WignerWrapper<real_type, vector_length>&, const complex&, const complex&, const real_type&,
+                        const int&, const int&, const int&, const int&);
+
  // functions for bispectrum coefficients, CPU only
  KOKKOS_INLINE_FUNCTION
  void pre_ui_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team,const int&,const int&); // ForceSNAP
@ -148,11 +183,6 @@ inline
    KOKKOS_INLINE_FUNCTION
  void compute_bi_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int);    // ForceSNAP

-  // functions for derivatives, GPU only
-  template<int dir>
-  KOKKOS_INLINE_FUNCTION
-  void compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int, const int, const int, const int); //ForceSNAP
-
  // functions for derivatives, CPU only
  KOKKOS_INLINE_FUNCTION
  void compute_duidrj_cpu(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int, int); //ForceSNAP
@ -168,23 +198,6 @@ inline
  KOKKOS_INLINE_FUNCTION
  void compute_s_dsfac(const real_type, const real_type, real_type&, real_type&); // compute_cayley_klein

-  static KOKKOS_FORCEINLINE_FUNCTION
-  void sincos_wrapper(double x, double* sin_, double *cos_) {
-#ifdef __SYCL_DEVICE_ONLY__
-    *sin_ = sycl::sincos(x, cos_);
-#else
-    sincos(x, sin_, cos_);
-#endif
-  }
-  static KOKKOS_FORCEINLINE_FUNCTION
-  void sincos_wrapper(float x, float* sin_, float *cos_) {
-#ifdef __SYCL_DEVICE_ONLY__
-    *sin_ = sycl::sincos(x, cos_);
-#else
-    sincosf(x, sin_, cos_);
-#endif
-  }
-
 #ifdef TIMING_INFO
  double* timers;
  timespec starttime, endtime;
@ -207,7 +220,7 @@ inline

  int twojmax, diagonalstyle;

-  t_sna_3d_ll blist;
+  t_sna_3d blist;
  t_sna_3c_ll ulisttot;
  t_sna_3c_ll ulisttot_full; // un-folded ulisttot, cpu only
  t_sna_3c_ll zlist;
--- a/src/KOKKOS/sna_kokkos_impl.h
+++ b/src/KOKKOS/sna_kokkos_impl.h
@ -316,7 +316,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
    ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),1,1,1);
    zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),1,1,1);
    zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),vector_length,idxz_max,ndoubles,natom_div);
-    blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
+    blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
    blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),vector_length,idxb_max,ntriples,natom_div);
    ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),1,1,1);
    ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),vector_length,idxu_half_max,nelements,natom_div);
@ -337,7 +337,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::grow_rij(int newnatom, int
    ulist = t_sna_3c_ll(Kokkos::NoInit("sna:ulist"),idxu_cache_max,natom,nmax);
    zlist = t_sna_3c_ll(Kokkos::NoInit("sna:zlist"),idxz_max,ndoubles,natom);
    zlist_pack = t_sna_4c_ll(Kokkos::NoInit("sna:zlist_pack"),1,1,1,1);
-    blist = t_sna_3d_ll(Kokkos::NoInit("sna:blist"),idxb_max,ntriples,natom);
+    blist = t_sna_3d(Kokkos::NoInit("sna:blist"),natom,ntriples,idxb_max);
    blist_pack = t_sna_4d_ll(Kokkos::NoInit("sna:blist_pack"),1,1,1,1);
    ylist = t_sna_3c_ll(Kokkos::NoInit("sna:ylist"),idxu_half_max,nelements,natom);
    ylist_pack_re = t_sna_4d_ll(Kokkos::NoInit("sna:ylist_pack_re"),1,1,1,1);
@ -365,44 +365,44 @@ KOKKOS_INLINE_FUNCTION
 void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const int& iatom_mod, const int& jnbor, const int& iatom_div)
 {
  const int iatom = iatom_mod + vector_length * iatom_div;
-  const auto x = rij(iatom,jnbor,0);
-  const auto y = rij(iatom,jnbor,1);
-  const auto z = rij(iatom,jnbor,2);
-  const auto rsq = x * x + y * y + z * z;
-  const auto r = sqrt(rsq);
-  const auto rcut = rcutij(iatom, jnbor);
-  const auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
-  const auto theta0 = (r - rmin0) * rscale0;
-  real_type sn, cs;
-  sincos_wrapper(theta0, &sn, &cs);
+  const real_type x = rij(iatom,jnbor,0);
+  const real_type y = rij(iatom,jnbor,1);
+  const real_type z = rij(iatom,jnbor,2);
+  const real_type rsq = x * x + y * y + z * z;
+  const real_type r = sqrt(rsq);
+  const real_type rcut = rcutij(iatom, jnbor);
+  const real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcut - rmin0);
+  const real_type theta0 = (r - rmin0) * rscale0;
+  const real_type sn = sin(theta0);
+  const real_type cs = cos(theta0);
  const real_type z0 = r * cs / sn;
  const real_type dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;

-  const auto wj_local = wj(iatom, jnbor);
+  const real_type wj_local = wj(iatom, jnbor);
  real_type sfac, dsfac;
  compute_s_dsfac(r, rcut, sfac, dsfac);
  sfac *= wj_local;
  dsfac *= wj_local;

-  const auto rinv = static_cast<real_type>(1.0) / r;
-  const auto ux = x * rinv;
-  const auto uy = y * rinv;
-  const auto uz = z * rinv;
+  const real_type rinv = static_cast<real_type>(1.0) / r;
+  const real_type ux = x * rinv;
+  const real_type uy = y * rinv;
+  const real_type uz = z * rinv;

-  const auto r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);
+  const real_type r0inv = static_cast<real_type>(1.0) / sqrt(r * r + z0 * z0);

  const complex a = { z0 * r0inv, -z * r0inv };
  const complex b = { r0inv * y, -r0inv * x };

-  const auto dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);
+  const real_type dr0invdr = -r0inv * r0inv * r0inv * (r + z0 * dz0dr);

-  const auto dr0invx = dr0invdr * ux;
-  const auto dr0invy = dr0invdr * uy;
-  const auto dr0invz = dr0invdr * uz;
+  const real_type dr0invx = dr0invdr * ux;
+  const real_type dr0invy = dr0invdr * uy;
+  const real_type dr0invz = dr0invdr * uz;

-  const auto dz0x = dz0dr * ux;
-  const auto dz0y = dz0dr * uy;
-  const auto dz0z = dz0dr * uz;
+  const real_type dz0x = dz0dr * ux;
+  const real_type dz0y = dz0dr * uy;
+  const real_type dz0z = dz0dr * uz;

  const complex dax = { dz0x * r0inv + z0 * dr0invx, -z * dr0invx };
  const complex day = { dz0y * r0inv + z0 * dr0invy, -z * dr0invy };
@ -412,9 +412,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_cayley_klein(const
  const complex dby = { y * dr0invy + r0inv, -x * dr0invy };
  const complex dbz = { y * dr0invz, -x * dr0invz };

-  const auto dsfacux = dsfac * ux;
-  const auto dsfacuy = dsfac * uy;
-  const auto dsfacuz = dsfac * uz;
+  const real_type dsfacux = dsfac * ux;
+  const real_type dsfacuy = dsfac * uy;
+  const real_type dsfacuz = dsfac * uz;

  a_pack(iatom_mod,jnbor,iatom_div) = a;
  b_pack(iatom_mod,jnbor,iatom_div) = b;
@ -479,17 +479,13 @@ void SNAKokkos<DeviceType, real_type, vector_length>::pre_ui(const int& iatom_mo
   accumulating to the total. GPU only.
 ------------------------------------------------------------------------- */

+// Version of the code that exposes additional parallelism by threading over `j_bend` values
+
 template<class DeviceType, typename real_type, int vector_length>
 KOKKOS_INLINE_FUNCTION
-void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
+void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
 {

-  // utot(j,ma,mb) = 0 for all j,ma,ma
-  // utot(j,ma,ma) = 1 for all j,ma
-  // for j in neighbors of i:
-  //   compute r0 = (x,y,z,z0)
-  //   utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb
-
  // get shared memory offset
  // scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
  const int tile_size = vector_length * (twojmax + 1);
@ -498,13 +494,12 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
  const int scratch_shift = team_rank * tile_size;

  // extract and wrap
-  WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
-
+  const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);

  // load parameters
-  const auto a = a_pack(iatom_mod, jnbor, iatom_div);
-  const auto b = b_pack(iatom_mod, jnbor, iatom_div);
-  const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
+  const complex a = a_pack(iatom_mod, jnbor, iatom_div);
+  const complex b = b_pack(iatom_mod, jnbor, iatom_div);
+  const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);

  const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);

@ -512,12 +507,59 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
  // this for loop is here for context --- we expose additional
  // parallelism over this loop instead
  //for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
+  evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
+}
+
+// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
+// and some amount of load imbalance, at the expense of reducing parallelism
+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_INLINE_FUNCTION
+void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
+{
+  // get shared memory offset
+  // scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
+  const int tile_size = vector_length * (twojmax + 1);
+
+  const int team_rank = team.team_rank();
+  const int scratch_shift = team_rank * tile_size;
+
+  // extract and wrap
+  const WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
+
+  // load parameters
+  const complex a = a_pack(iatom_mod, jnbor, iatom_div);
+  const complex b = b_pack(iatom_mod, jnbor, iatom_div);
+  const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
+
+  const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
+
+  // we need to "choose" when to bend
+  #ifdef LMP_KK_DEVICE_COMPILE
+  #pragma unroll
+  #endif
+  for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
+    evaluate_ui_jbend(ulist_wrapper, a, b, sfac, jelem, iatom_mod, j_bend, iatom_div);
+  }
+}
+
+// Core "evaluation" kernel that gets reused in `compute_ui_small` and `compute_ui_large`
+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_FORCEINLINE_FUNCTION
+void SNAKokkos<DeviceType, real_type, vector_length>::evaluate_ui_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper,
+          const complex& a, const complex& b, const real_type& sfac, const int& jelem,
+          const int& iatom_mod, const int& j_bend, const int& iatom_div)
+{
+
+  // utot(j,ma,mb) = 0 for all j,ma,ma
+  // utot(j,ma,ma) = 1 for all j,ma
+  // for j in neighbors of i:
+  //   compute r0 = (x,y,z,z0)
+  //   utot(j,ma,mb) += u(r0;j,ma,mb) for all j,ma,mb

  // level 0 is just 1.
  ulist_wrapper.set(0, complex::one());

  // j from before the bend, don't store, mb == 0
-    // this is "creeping up the side"
  for (int j = 1; j <= j_bend; j++) {

    constexpr int mb = 0; // intentional for readability, compiler should optimize this out
@ -601,12 +643,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_ui(const typename
    Kokkos::atomic_add(&(ulisttot_im_pack(iatom_mod, jjup + ma, jelem, iatom_div)), ulist_prev.im * sfac);
  }

-  //} // end of "reference" loop over j_bend
-
-
 }

-
 /* ----------------------------------------------------------------------
   compute Zi by summing over products of Ui,
   AoSoA data layout to take advantage of coalescing, avoiding warp
@ -634,47 +672,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi(const int& iato

  for (int elem1 = 0; elem1 < nelements; elem1++) {
    for (int elem2 = 0; elem2 < nelements; elem2++) {
-      complex ztmp = complex::zero();

-      int jju1 = idxu_block[j1] + (j1+1)*mb1min;
-      int jju2 = idxu_block[j2] + (j2+1)*mb2max;
-      int icgb = mb1min*(j2+1) + mb2max;
-
-      #ifdef LMP_KK_DEVICE_COMPILE
-      #pragma unroll
-      #endif
-      for (int ib = 0; ib < nb; ib++) {
-
-        int ma1 = ma1min;
-        int ma2 = ma2max;
-        int icga = ma1min*(j2+1) + ma2max;
-
-        #ifdef LMP_KK_DEVICE_COMPILE
-        #pragma unroll
-        #endif
-        for (int ia = 0; ia < na; ia++) {
-          const auto utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
-          const auto utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
-          const auto cgcoeff_a = cgblock[icga];
-          const auto cgcoeff_b = cgblock[icgb];
-          ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
-          ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
-          ma1++;
-          ma2--;
-          icga += j2;
-        } // end loop over ia
-
-        jju1 += j1 + 1;
-        jju2 -= j2 + 1;
-        icgb += j2;
-      } // end loop over ib
-
-      if (bnorm_flag) {
-        ztmp.re /= (j + 1);
-        ztmp.im /= (j + 1);
-      }
-
-      zlist_pack(iatom_mod,jjz,idouble,iatom_div) = ztmp;
+      zlist_pack(iatom_mod,jjz,idouble,iatom_div) = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);

      idouble++;
    }
@ -721,8 +720,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
            const int jju_index = jju+mb*(j+1)+ma;
            const int jjz_index = jjz+mb*(j+1)+ma;
            if (2*mb == j) return; // I think we can remove this?
-            const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
-            const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
+            const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
+            const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
            sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;
          }
        }
@ -737,8 +736,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
            const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
            const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;

-            const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
-            const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
+            const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
+            const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
            sumzu_temp += utot.re * zloc.re + utot.im * zloc.im;

          }
@ -748,8 +747,8 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi(const int& iato
          const int jju_index = jju+(mb-1)*(j+1)+(j+1)+ma;
          const int jjz_index = jjz+(mb-1)*(j+1)+(j+1)+ma;

-          const auto utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
-          const auto zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
+          const complex utot = ulisttot_pack(iatom_mod, jju_index, elem3, iatom_div);
+          const complex zloc = zlist_pack(iatom_mod, jjz_index, idouble, iatom_div);
          sumzu += static_cast<real_type>(0.5) * (utot.re * zloc.re + utot.im * zloc.im);
        } // end if jeven

@ -785,7 +784,6 @@ KOKKOS_INLINE_FUNCTION
 void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod, int jjz, int iatom_div,
 const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
 {
-  real_type betaj;

  const int j1 = idxz(jjz, 0);
  const int j2 = idxz(jjz, 1);
@ -805,46 +803,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,
  for (int elem1 = 0; elem1 < nelements; elem1++) {
    for (int elem2 = 0; elem2 < nelements; elem2++) {

-      real_type ztmp_r = 0.0;
-      real_type ztmp_i = 0.0;
-
-      int jju1 = idxu_block[j1] + (j1 + 1) * mb1min;
-      int jju2 = idxu_block[j2] + (j2 + 1) * mb2max;
-      int icgb = mb1min * (j2 + 1) + mb2max;
-
-      #ifdef LMP_KK_DEVICE_COMPILE
-      #pragma unroll
-      #endif
-      for (int ib = 0; ib < nb; ib++) {
-
-        int ma1 = ma1min;
-        int ma2 = ma2max;
-        int icga = ma1min*(j2+1) + ma2max;
-
-        #ifdef LMP_KK_DEVICE_COMPILE
-        #pragma unroll
-        #endif
-        for (int ia = 0; ia < na; ia++) {
-          const auto utot1 = ulisttot_pack(iatom_mod,jju1+ma1,elem1,iatom_div);
-          const auto utot2 = ulisttot_pack(iatom_mod,jju2+ma2,elem2,iatom_div);
-          const auto cgcoeff_a = cgblock[icga];
-          const auto cgcoeff_b = cgblock[icgb];
-          ztmp_r += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
-          ztmp_i += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
-          ma1++;
-          ma2--;
-          icga += j2;
-        } // end loop over ia
-
-        jju1 += j1 + 1;
-        jju2 -= j2 + 1;
-        icgb += j2;
-      } // end loop over ib
-
-      if (bnorm_flag) {
-        ztmp_r /= (j + 1);
-        ztmp_i /= (j + 1);
-      }
+      const complex ztmp = evaluate_zi(j1, j2, j, ma1min, ma2max, mb1min, mb2max, na, nb, iatom_mod, elem1, elem2, iatom_div, cgblock);

      // apply to z(j1,j2,j,ma,mb) to unique element of y(j)
      // find right y_list[jju] and beta(iatom,jjb) entries
@ -853,30 +812,11 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi(int iatom_mod,

      // pick out right beta value
      for (int elem3 = 0; elem3 < nelements; elem3++) {
-        if (j >= j1) {
-          const int jjb = idxb_block(j1, j2, j);
-          const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
-          if (j1 == j) {
-            if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
-            else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
-          } else betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        } else if (j >= j2) {
-          const int jjb = idxb_block(j, j2, j1);
-          const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
-          if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
-          else betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        } else {
-          const int jjb = idxb_block(j2, j, j1);
-          const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
-          betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        }

-        if (!bnorm_flag && j1 > j)
-          betaj *= (j1 + 1) / (j + 1.0);
+        const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);

-
-        Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_r);
-        Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp_i);
+        Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
+        Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
      } // end loop over elem3
    } // end loop over elem2
  } // end loop over elem1
@ -893,7 +833,6 @@ KOKKOS_INLINE_FUNCTION
 void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int iatom_mod, int jjz, int iatom_div,
 const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack)
 {
-  real_type betaj;
  const int j1 = idxz(jjz, 0);
  const int j2 = idxz(jjz, 1);
  const int j = idxz(jjz, 2);
@ -901,49 +840,123 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_with_zlist(int
  int idouble = 0;
  for (int elem1 = 0; elem1 < nelements; elem1++) {
    for (int elem2 = 0; elem2 < nelements; elem2++) {
-      auto ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
+      const complex ztmp = zlist_pack(iatom_mod,jjz,idouble,iatom_div);
      // apply to z(j1,j2,j,ma,mb) to unique element of y(j)
      // find right y_list[jju] and beta(iatom,jjb) entries
      // multiply and divide by j+1 factors
      // account for multiplicity of 1, 2, or 3
      // pick out right beta value
      for (int elem3 = 0; elem3 < nelements; elem3++) {
-        if (j >= j1) {
-          const int jjb = idxb_block(j1, j2, j);
-          const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
-          if (j1 == j) {
-            if (j2 == j) betaj = 3 * beta_pack(iatom_mod, itriple, iatom_div);
-            else betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
-          } else betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        } else if (j >= j2) {
-          const int jjb = idxb_block(j, j2, j1);
-          const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
-          if (j2 == j) betaj = 2 * beta_pack(iatom_mod, itriple, iatom_div);
-          else betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        } else {
-          const int jjb = idxb_block(j2, j, j1);
-          const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
-          betaj = beta_pack(iatom_mod, itriple, iatom_div);
-        }
-        if (!bnorm_flag && j1 > j)
-          betaj *= (j1 + 1) / (j + 1.0);
-        Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.re);
-        Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj*ztmp.im);
+
+        const real_type betaj = evaluate_beta_scaled(j1, j2, j, iatom_mod, elem1, elem2, elem3, iatom_div, beta_pack);
+
+        Kokkos::atomic_add(&(ylist_pack_re(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.re);
+        Kokkos::atomic_add(&(ylist_pack_im(iatom_mod, jju_half, elem3, iatom_div)), betaj * ztmp.im);
      } // end loop over elem3
      idouble++;
    } // end loop over elem2
  } // end loop over elem1
 }

+// Core "evaluation" kernel that computes a single zlist value
+// which gets used in both `compute_zi` and `compute_yi`
+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_FORCEINLINE_FUNCTION
+typename SNAKokkos<DeviceType, real_type, vector_length>::complex SNAKokkos<DeviceType, real_type, vector_length>::evaluate_zi(const int& j1, const int& j2, const int& j,
+        const int& ma1min, const int& ma2max, const int& mb1min, const int& mb2max, const int& na, const int& nb,
+        const int& iatom_mod, const int& elem1, const int& elem2, const int& iatom_div, const real_type* cgblock) {
+
+  complex ztmp = complex::zero();
+
+  int jju1 = idxu_block[j1] + (j1+1)*mb1min;
+  int jju2 = idxu_block[j2] + (j2+1)*mb2max;
+  int icgb = mb1min*(j2+1) + mb2max;
+
+  #ifdef LMP_KK_DEVICE_COMPILE
+  #pragma unroll
+  #endif
+  for (int ib = 0; ib < nb; ib++) {
+
+    int ma1 = ma1min;
+    int ma2 = ma2max;
+    int icga = ma1min*(j2+1) + ma2max;
+
+    #ifdef LMP_KK_DEVICE_COMPILE
+    #pragma unroll
+    #endif
+    for (int ia = 0; ia < na; ia++) {
+      const complex utot1 = ulisttot_pack(iatom_mod, jju1+ma1, elem1, iatom_div);
+      const complex utot2 = ulisttot_pack(iatom_mod, jju2+ma2, elem2, iatom_div);
+      const real_type cgcoeff_a = cgblock[icga];
+      const real_type cgcoeff_b = cgblock[icgb];
+      ztmp.re += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.re - utot1.im * utot2.im);
+      ztmp.im += cgcoeff_a * cgcoeff_b * (utot1.re * utot2.im + utot1.im * utot2.re);
+      ma1++;
+      ma2--;
+      icga += j2;
+    } // end loop over ia
+
+    jju1 += j1 + 1;
+    jju2 -= j2 + 1;
+    icgb += j2;
+  } // end loop over ib
+
+  if (bnorm_flag) {
+    const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
+    ztmp.re *= scale;
+    ztmp.im *= scale;
+  }
+
+  return ztmp;
+}
+
+// Core "evaluation" kernel that extracts and rescales the appropriate `beta` value,
+// which gets used in both `compute_yi` and `compute_yi_from_zlist
+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_FORCEINLINE_FUNCTION
+typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_beta_scaled(const int& j1, const int& j2, const int& j,
+          const int& iatom_mod, const int& elem1, const int& elem2, const int& elem3, const int& iatom_div,
+          const Kokkos::View<real_type***, Kokkos::LayoutLeft, DeviceType> &beta_pack) {
+
+  real_type betaj = 0;
+
+  if (j >= j1) {
+    const int jjb = idxb_block(j1, j2, j);
+    const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
+    if (j1 == j) {
+      if (j2 == j) betaj = static_cast<real_type>(3) * beta_pack(iatom_mod, itriple, iatom_div);
+      else betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
+    } else betaj = beta_pack(iatom_mod, itriple, iatom_div);
+  } else if (j >= j2) {
+    const int jjb = idxb_block(j, j2, j1);
+    const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
+    if (j2 == j) betaj = static_cast<real_type>(2) * beta_pack(iatom_mod, itriple, iatom_div);
+    else betaj = beta_pack(iatom_mod, itriple, iatom_div);
+  } else {
+    const int jjb = idxb_block(j2, j, j1);
+    const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
+    betaj = beta_pack(iatom_mod, itriple, iatom_div);
+  }
+
+  if (!bnorm_flag && j1 > j) {
+    const real_type scale = static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);
+    betaj *= scale;
+  }
+
+  return betaj;
+
+}
+
 /* ----------------------------------------------------------------------
   Fused calculation of the derivative of Ui w.r.t. atom j
   and accumulation into dEidRj. GPU only.
 ------------------------------------------------------------------------- */

+// Version of the code that exposes additional parallelism by threading over `j_bend` values
 template<class DeviceType, typename real_type, int vector_length>
 template<int dir>
 KOKKOS_INLINE_FUNCTION
-void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
+void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_small(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int j_bend, const int jnbor, const int iatom_div)
 {
  // get shared memory offset
  // scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
@ -957,21 +970,76 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
  WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);

  // load parameters
-  const auto a = a_pack(iatom_mod, jnbor, iatom_div);
-  const auto b = b_pack(iatom_mod, jnbor, iatom_div);
-  const auto da = da_pack(iatom_mod, jnbor, iatom_div, dir);
-  const auto db = db_pack(iatom_mod, jnbor, iatom_div, dir);
-  const auto sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
-  const auto dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
+  const complex a = a_pack(iatom_mod, jnbor, iatom_div);
+  const complex b = b_pack(iatom_mod, jnbor, iatom_div);
+  const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
+  const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
+  const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
+  const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u

  const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);

-  auto dedr_full_sum = static_cast<real_type>(0.);
+  // compute the contribution to dedr_full_sum for one "bend" location
+  const real_type dedr_full_sum = evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
+                                                       jelem, iatom_mod, j_bend, iatom_div);

-  // we need to "choose" when to bend
-  // this for loop is here for context --- we expose additional
-  // parallelism over this loop instead
-  //for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
+  // dedr gets zeroed out at the start of each iteration in compute_cayley_klein
+  Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);
+
+}
+
+// Version of the code that loops over all `j_bend` values which reduces integer arithmetic
+// and some amount of load imbalance, at the expense of reducing parallelism
+template<class DeviceType, typename real_type, int vector_length>
+template<int dir>
+KOKKOS_INLINE_FUNCTION
+void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj_large(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, const int iatom_mod, const int jnbor, const int iatom_div)
+{
+  // get shared memory offset
+  // scratch size: 32 atoms * (twojmax+1) cached values, no double buffer
+  const int tile_size = vector_length * (twojmax + 1);
+
+  const int team_rank = team.team_rank();
+  const int scratch_shift = team_rank * tile_size;
+
+  // extract, wrap shared memory buffer
+  WignerWrapper<real_type, vector_length> ulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
+  WignerWrapper<real_type, vector_length> dulist_wrapper((complex*)team.team_shmem().get_shmem(team.team_size() * tile_size * sizeof(complex), 0) + scratch_shift, iatom_mod);
+
+  // load parameters
+  const complex a = a_pack(iatom_mod, jnbor, iatom_div);
+  const complex b = b_pack(iatom_mod, jnbor, iatom_div);
+  const complex da = da_pack(iatom_mod, jnbor, iatom_div, dir);
+  const complex db = db_pack(iatom_mod, jnbor, iatom_div, dir);
+  const real_type sfac = sfac_pack(iatom_mod, jnbor, iatom_div, 0);
+  const real_type dsfacu = sfac_pack(iatom_mod, jnbor, iatom_div, dir + 1); // dsfac * u
+
+  const int jelem = element(iatom_mod + vector_length * iatom_div, jnbor);
+
+  // compute the contributions to dedr_full_sum for all "bend" locations
+  real_type dedr_full_sum = static_cast<real_type>(0);
+  #ifdef LMP_KK_DEVICE_COMPILE
+  #pragma unroll
+  #endif
+  for (int j_bend = 0; j_bend <= twojmax; j_bend++) {
+    dedr_full_sum += evaluate_duidrj_jbend(ulist_wrapper, a, b, sfac, dulist_wrapper, da, db, dsfacu,
+                                          jelem, iatom_mod, j_bend, iatom_div);
+  }
+
+  // there's one thread per atom, neighbor pair, so no need to make this atomic
+  dedr(iatom_mod + vector_length * iatom_div, jnbor, dir) = static_cast<real_type>(2.0) * dedr_full_sum;
+
+}
+
+// Core "evaluation" kernel that gets reused in `compute_fused_deidrj_small` and
+// `compute_fused_deidrj_large`
+template<class DeviceType, typename real_type, int vector_length>
+KOKKOS_FORCEINLINE_FUNCTION
+typename SNAKokkos<DeviceType, real_type, vector_length>::real_type SNAKokkos<DeviceType, real_type, vector_length>::evaluate_duidrj_jbend(const WignerWrapper<real_type, vector_length>& ulist_wrapper, const complex& a, const complex& b, const real_type& sfac,
+                      const WignerWrapper<real_type, vector_length>& dulist_wrapper, const complex& da, const complex& db, const real_type& dsfacu,
+                      const int& jelem, const int& iatom_mod, const int& j_bend, const int& iatom_div) {
+
+  real_type dedr_full_sum = static_cast<real_type>(0);

  // level 0 is just 1, 0
  ulist_wrapper.set(0, complex::one());
@ -1039,7 +1107,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const

      // grab y_local early
      // this will never be the last element of a row, no need to rescale.
-        auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
+      complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));

      // grab the cached value
      const complex ulist_prev = ulist_wrapper.get(ma);
@ -1085,7 +1153,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const

  for (int ma = 0; ma < j; ma++) {
    // grab y_local early
-      auto y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
+    complex y_local = complex(ylist_pack_re(iatom_mod, jjup + ma, jelem, iatom_div), ylist_pack_im(iatom_mod, jjup+ma, jelem, iatom_div));
    if (j % 2 == 1 && 2*(mb-1) == j-1) { // double check me...
      if (ma == (mb-1)) { y_local = static_cast<real_type>(0.5)*y_local; }
      else if (ma > (mb-1)) { y_local.re = static_cast<real_type>(0.); y_local.im = static_cast<real_type>(0.); } // can probably avoid this outright
@ -1100,15 +1168,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_fused_deidrj(const
    dedr_full_sum += du_prod.re * y_local.re + du_prod.im * y_local.im;

  }
-  //} // end reference loop over j_bend
-
-  // dedr gets zeroed out at the start of each iteration in compute_cayley_klein
-  Kokkos::atomic_add(&(dedr(iatom_mod + vector_length * iatom_div, jnbor, dir)), static_cast<real_type>(2.0) * dedr_full_sum);

+  return dedr_full_sum;
 }

-
-
 /* ----------------------------------------------------------------------
 * CPU routines
 * ----------------------------------------------------------------------*/
@ -1238,8 +1301,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_zi_cpu(const int&
      } // end loop over ib

      if (bnorm_flag) {
-        zlist(jjz, idouble, iatom).re /= (j+1);
-        zlist(jjz, idouble, iatom).im /= (j+1);
+        const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
+        zlist(jjz, idouble, iatom).re *= scale;
+        zlist(jjz, idouble, iatom).im *= scale;
      }
      idouble++;
    } // end loop over elem2
@ -1268,7 +1332,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
  int idouble = 0;
  for (int elem1 = 0; elem1 < nelements; elem1++) {
    for (int elem2 = 0; elem2 < nelements; elem2++) {
-      auto jalloy = idouble; // must be non-const to work around gcc compiler bug
+      int jalloy = idouble; // must be non-const to work around gcc compiler bug
      for (int elem3 = 0; elem3 < nelements; elem3++) {
        Kokkos::parallel_for(Kokkos::TeamThreadRange(team,idxb_max),
          [&] (const int& jjb) {
@ -1331,7 +1395,7 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_bi_cpu(const typen
              }
            }

-            blist(jjb, itriple, iatom) = sumzu;
+            blist(iatom, itriple, jjb) = sumzu;
          });
        });
          //} // end loop over j
@ -1410,8 +1474,9 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,
      } // end loop over ib

      if (bnorm_flag) {
-        ztmp_i /= j + 1;
-        ztmp_r /= j + 1;
+        const real_type scale = static_cast<real_type>(1) / static_cast<real_type>(j + 1);
+        ztmp_i *= scale;
+        ztmp_r *= scale;
      }

      // apply to z(j1,j2,j,ma,mb) to unique element of y(j)
@ -1424,24 +1489,24 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_yi_cpu(int iter,

        if (j >= j1) {
          const int jjb = idxb_block(j1, j2, j);
-          const auto itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
+          const int itriple = ((elem1 * nelements + elem2) * nelements + elem3) * idxb_max + jjb;
          if (j1 == j) {
            if (j2 == j) betaj = 3 * beta(itriple, iatom);
            else betaj = 2 * beta(itriple, iatom);
          } else betaj = beta(itriple, iatom);
        } else if (j >= j2) {
          const int jjb = idxb_block(j, j2, j1);
-          const auto itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
+          const int itriple = ((elem3 * nelements + elem2) * nelements + elem1) * idxb_max + jjb;
          if (j2 == j) betaj = 2 * beta(itriple, iatom);
          else betaj = beta(itriple, iatom);
        } else {
          const int jjb = idxb_block(j2, j, j1);
-          const auto itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
+          const int itriple = ((elem2 * nelements + elem3) * nelements + elem1) * idxb_max + jjb;
          betaj = beta(itriple, iatom);
        }

        if (!bnorm_flag && j1 > j)
-          betaj *= (j1 + 1) / (j + 1.0);
+          betaj *= static_cast<real_type>(j1 + 1) / static_cast<real_type>(j + 1);

        Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).re), betaj*ztmp_r);
        Kokkos::atomic_add(&(ylist(jju_half, elem3, iatom).im), betaj*ztmp_i);
@ -1469,9 +1534,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_duidrj_cpu(const t
  z = rij(iatom,jnbor,2);
  rsq = x * x + y * y + z * z;
  r = sqrt(rsq);
-  auto rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
+  real_type rscale0 = rfac0 * static_cast<real_type>(MY_PI) / (rcutij(iatom,jnbor) - rmin0);
  theta0 = (r - rmin0) * rscale0;
-  sincos_wrapper(theta0, &sn, &cs);
+  sn = sin(theta0);
+  cs = cos(theta0);
  z0 = r * cs / sn;
  dz0dr = z0 / r - (r*rscale0) * (rsq + z0 * z0) / rsq;

@ -1559,7 +1625,7 @@ KOKKOS_INLINE_FUNCTION
 void SNAKokkos<DeviceType, real_type, vector_length>::add_uarraytot(const typename Kokkos::TeamPolicy<DeviceType>::member_type& team, int iatom, int jnbor,
                                          const real_type& r, const real_type& wj, const real_type& rcut, int jelem)
 {
-  const auto sfac = compute_sfac(r, rcut) * wj;
+  const real_type sfac = compute_sfac(r, rcut) * wj;

  Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,twojmax+1),
      [&] (const int& j) {
@ -2168,7 +2234,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_sfac(real_typ
    if (r <= rmin0) return one;
    else if (r > rcut) return zero;
    else {
-      auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
+      real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
      return onehalf * (cos((r - rmin0) * rcutfac) + one);
    }
  }
@ -2188,7 +2254,7 @@ real_type SNAKokkos<DeviceType, real_type, vector_length>::compute_dsfac(real_ty
    if (r <= rmin0) return zero;
    else if (r > rcut) return zero;
    else {
-      auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
+      real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
      return -onehalf * sin((r - rmin0) * rcutfac) * rcutfac;
    }
  }
@ -2206,9 +2272,10 @@ void SNAKokkos<DeviceType, real_type, vector_length>::compute_s_dsfac(const real
    if (r <= rmin0) { sfac = one; dsfac = zero; }
    else if (r > rcut) { sfac = zero; dsfac = zero; }
    else {
-      const auto rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
-      real_type sn, cs;
-      sincos_wrapper((r - rmin0) * rcutfac, &sn, &cs); // need to create a wrapper
+      const real_type rcutfac = static_cast<real_type>(MY_PI) / (rcut - rmin0);
+      const real_type theta0 = (r - rmin0) * rcutfac;
+      const real_type sn = sin(theta0);
+      const real_type cs = cos(theta0);
      sfac = onehalf * (cs + one);
      dsfac = -onehalf * sn * rcutfac;

--- a/src/ML-SNAP/pair_snap.cpp
+++ b/src/ML-SNAP/pair_snap.cpp
@ -628,7 +628,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
  chemflag = 0;
  bnormflag = 0;
  wselfallflag = 0;
-  chunksize = 4096;
+  chunksize = 32768;
+  parallel_thresh = 8192;

  // open SNAP parameter file on proc 0

@ -696,6 +697,8 @@ void PairSNAP::read_files(char *coefffilename, char *paramfilename)
      wselfallflag = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
    else if (keywd == "chunksize")
      chunksize = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
+    else if (keywd == "parallelthresh")
+      parallel_thresh = utils::inumeric(FLERR,keyval.c_str(),false,lmp);
    else
      error->all(FLERR,"Unknown parameter '{}' in SNAP "
                                   "parameter file", keywd);
--- a/src/ML-SNAP/pair_snap.h
+++ b/src/ML-SNAP/pair_snap.h
@ -59,7 +59,7 @@ class PairSNAP : public Pair {
  double **scale;         // for thermodynamic integration
  int twojmax, switchflag, bzeroflag, bnormflag;
  int chemflag, wselfallflag;
-  int chunksize;
+  int chunksize,parallel_thresh;
  double rfac0, rmin0, wj1, wj2;
  int rcutfacflag, twojmaxflag;    // flags for required parameters
  int beta_max;                    // length of beta
--- a/src/MOFFF/README
+++ b/src/MOFFF/README
@ -20,7 +20,7 @@ charges (dsf and long-range treatment of charges)
 out-of-plane angle

 See the file doc/drude_tutorial.html for getting started.
-See the doc pages for "pair_style buck6d/coul/gauss", "anlge_style class2",
+See the doc pages for "pair_style buck6d/coul/gauss", "angle_style class2",
 "angle_style cosine/buck6d", and "improper_style inversion/harmonic"
 commands to get started. Also see the above mentioned website and
 literature for further documentation about the force field.
--- a/tools/coding_standard/whitespace.py
+++ b/tools/coding_standard/whitespace.py
@ -34,6 +34,7 @@ exclude:
    - lib/hdnnp
    - lib/kim
    - lib/kokkos
+    - lib/latte
    - lib/machdyn
    - lib/mdi
    - lib/mscg
@ -41,6 +42,7 @@ exclude:
    - lib/plumed
    - lib/quip
    - lib/scafacos
+    - lib/voronoi
    - src/Make.sh
 patterns:
    - "*.c"