Merge branch 'develop' into fix-set-command

2025-06-04 16:22:45 -04:00
parent 0af41a211d 0fb02af020
commit ece4939708
160 changed files with 4080 additions and 887 deletions
--- a/doc/src/Commands_fix.rst
+++ b/doc/src/Commands_fix.rst
@ -29,6 +29,7 @@ OPT.
   * :doc:`ave/grid <fix_ave_grid>`
   * :doc:`ave/histo <fix_ave_histo>`
   * :doc:`ave/histo/weight <fix_ave_histo>`
+   * :doc:`ave/moments <fix_ave_moments>`
   * :doc:`ave/time <fix_ave_time>`
   * :doc:`aveforce <fix_aveforce>`
   * :doc:`balance <fix_balance>`
--- a/doc/src/Developer_updating.rst
+++ b/doc/src/Developer_updating.rst
@ -29,6 +29,7 @@ Available topics in mostly chronological order are:
 - `Rename of fix STORE/PERATOM to fix STORE/ATOM and change of arguments`_
 - `Use Output::get_dump_by_id() instead of Output::find_dump()`_
 - `Refactored grid communication using Grid3d/Grid2d classes instead of GridComm`_
+- `FLERR as first argument to minimum image functions in Domain class`_

 ----

@ -610,3 +611,47 @@ KSpace solvers which use distributed FFT grids:
 - ``src/KSPACE/pppm.cpp``

 This change is **required** or else the code will not compile.
+
+FLERR as first argument to minimum image functions in Domain class
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. versionchanged:: TBD
+
+The ``Domain::minimum_image()`` and ``Domain::minimum_image_big()``
+functions were changed to take the ``FLERR`` macros as first argument.
+This way the error message indicates *where* the function was called
+instead of pointing to the implementation of the function.  Example:
+
+Old:
+
+.. code-block:: c++
+
+   double delx1 = x[i1][0] - x[i2][0];
+   double dely1 = x[i1][1] - x[i2][1];
+   double delz1 = x[i1][2] - x[i2][2];
+   domain->minimum_image(delx1, dely1, delz1);
+   double r1 = sqrt(delx1 * delx1 + dely1 * dely1 + delz1 * delz1);
+
+   double delx2 = x[i3][0] - x[i2][0];
+   double dely2 = x[i3][1] - x[i2][1];
+   double delz2 = x[i3][2] - x[i2][2];
+   domain->minimum_image_big(delx2, dely2, delz2);
+   double r2 = sqrt(delx2 * delx2 + dely2 * dely2 + delz2 * delz2);
+
+New:
+
+.. code-block:: c++
+
+   double delx1 = x[i1][0] - x[i2][0];
+   double dely1 = x[i1][1] - x[i2][1];
+   double delz1 = x[i1][2] - x[i2][2];
+   domain->minimum_image(FLERR, delx1, dely1, delz1);
+   double r1 = sqrt(delx1 * delx1 + dely1 * dely1 + delz1 * delz1);
+
+   double delx2 = x[i3][0] - x[i2][0];
+   double dely2 = x[i3][1] - x[i2][1];
+   double delz2 = x[i3][2] - x[i2][2];
+   domain->minimum_image_big(FLERR, delx2, dely2, delz2);
+   double r2 = sqrt(delx2 * delx2 + dely2 * dely2 + delz2 * delz2);
+
+This change is **required** or else the code will not compile.
--- a/doc/src/Speed_compare.rst
+++ b/doc/src/Speed_compare.rst
@ -75,15 +75,34 @@ section below for examples where this has been done.
 **Differences between the GPU and KOKKOS packages:**

 * The GPU package accelerates only pair force, neighbor list, and (parts
-  of) PPPM calculations. The KOKKOS package attempts to run most of the
+  of) PPPM calculations (and runs the remaining force computations on
+  the CPU concurrently).  The KOKKOS package attempts to run most of the
  calculation on the GPU, but can transparently support non-accelerated
  code (with a performance penalty due to having data transfers between
  host and GPU).
+* The list of which styles are accelerated by the GPU or KOKKOS package
+  differs with some overlap.
 * The GPU package requires neighbor lists to be built on the CPU when using
  hybrid pair styles, exclusion lists, or a triclinic simulation box.
-* The GPU package can be compiled for CUDA, HIP, or OpenCL and thus supports
-  NVIDIA, AMD, and Intel GPUs well. On NVIDIA hardware, using CUDA is
-  typically resulting in equal or better performance over OpenCL.
-* OpenCL in the GPU package does theoretically also support Intel CPUs or
-  Intel Xeon Phi, but the native support for those in KOKKOS (or INTEL)
-  is superior.
+* The GPU package benefits from running multiple MPI processes (2-8) per
+  GPU to parallelize the non-GPU accelerated styles.  The KOKKOS package
+  usually not, especially when all parts of the calculation have KOKKOS
+  support.
+* The GPU package can be compiled for CUDA, HIP, or OpenCL and thus
+  supports NVIDIA, AMD, and Intel GPUs well. On NVIDIA or AMD hardware,
+  using native CUDA or HIP compilation, respectively, with either GPU or
+  KOKKOS results in equal or better performance over OpenCL.
+* OpenCL in the GPU package supports NVIDIA, AMD, and Intel GPUs at the
+  *same time* and with the *same executable*.  KOKKOS currently does not
+  support OpenCL.
+* The GPU package supports single precision floating point, mixed
+  precision floating point, and double precision floating point math on
+  the GPU.  This must be chosen at compile time.  KOKKOS currently only
+  supports double precision floating point math.  Using single or mixed
+  precision (recommended) results in significantly improved performance
+  on consumer GPUs for some loss in accuracy (which is rather small with
+  mixed precision).  Single and mixed precision support for KOKKOS is in
+  development (no ETA yet).
+* Some pair styles (for example :doc:`snap <pair_snap>`, :doc:`mliap
+  <pair_mliap>` or :doc:`reaxff <pair_reaxff>` in the KOKKOS package have
+  seen extensive optimizations and specializations for GPUs and CPUs.
--- a/doc/src/Speed_measure.rst
+++ b/doc/src/Speed_measure.rst
@ -1,16 +1,218 @@
 Measuring performance
 =====================

-Before trying to make your simulation run faster, you should
-understand how it currently performs and where the bottlenecks are.
+Factors that influence performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-The best way to do this is run the your system (actual number of
-atoms) for a modest number of timesteps (say 100 steps) on several
-different processor counts, including a single processor if possible.
-Do this for an equilibrium version of your system, so that the
-100-step timings are representative of a much longer run.  There is
-typically no need to run for 1000s of timesteps to get accurate
-timings; you can simply extrapolate from short runs.
+Before trying to make your simulation run faster, you should understand
+how it currently performs and where the bottlenecks are.  We generally
+distinguish between serial performance (how fast can a single process do
+the calculations?) and parallel efficiency (how much faster does a
+calculation get by using more processes?).  There are many factors
+affecting either and below are some lists discussing some commonly
+known but also some less known factors.
+
+Factors affecting serial performance (in no specific order):
+
+* CPU hardware: clock rate, cache sizes, CPU architecture (instructions
+  per clock, vectorization support, fused multiply-add support and more)
+* RAM speed and number of channels that the CPU can use to access RAM
+* Cooling: CPUs can change the CPU clock based on thermal load, thus the
+  degree of cooling can affect the speed of a CPU.  Sometimes even the
+  temperature of neighboring compute nodes in a cluster can make a
+  difference.
+* Compiler optimization: most of LAMMPS is written to be easy to modify
+  and thus compiler optimization can speed up calculations. However, too
+  aggressive compiler optimization can produce incorrect results or
+  crashes (during compilation or at runtime).
+* Source code improvements: styles in the OPT, OPENMP, and INTEL package
+  can be faster than their base implementation due to improved data
+  access patterns, cache efficiency, or vectorization. Compiler optimization
+  is required to take full advantage of these.
+* Number and kind of fixes, computes, or variables used during a simulation,
+  especially if they result in collective communication operations
+* Pair style cutoffs and system density: calculations get slower the more
+  neighbors are in the neighbor list and thus for which interactions need
+  to be computed.  Force fields with pair styles that compute interactions
+  between triples or quadruples of atoms or that use embedding energies or
+  charge equilibration will need to walk the neighbor lists multiple times.
+* Neighbor list settings: tradeoff between neighbor list skin (larger
+  skin = more neighbors, more distances to compute before applying the
+  cutoff) and frequency of neighbor list builds (larger skin = fewer
+  neighbor list builds).
+* Proximity of per-atom data in physical memory that for atoms that are
+  close in space improves cache efficiency (thus LAMMPS will by default
+  sort atoms in local storage accordingly)
+* Using r-RESPA multi-timestepping or a SHAKE or RATTLE fix to constrain
+  bonds with higher-frequency vibrations may allow a larger (outer) timestep
+  and thus fewer force evaluations (usually the most time consuming step in
+  MD) for the same simulated time (with some tradeoff in accuracy).
+
+Factors affecting parallel efficiency (in no specific order):
+
+* Bandwidth and latency of communication between processes. This can vary a
+  lot between processes on the same CPU or physical node and processes
+  on different physical nodes and there vary between different
+  communication technologies (like Ethernet or InfiniBand or other
+  high-speed interconnects)
+* Frequency and complexity of communication patterns required
+* Number of "work units" (usually correlated with the number of atoms
+  and choice of force field) per MPI-process required for one time step
+  (if this number becomes too small, the cost of communication becomes
+  dominant).
+* Choice of parallelization method (MPI-only, OpenMP-only, MPI+OpenMP,
+  MPI+GPU, MPI+GPU+OpenMP)
+* Algorithmic complexity of the chosen force field (pair-wise vs. many-body
+  potential, Ewald vs. PPPM vs. (compensated or smoothed) cutoff-Coulomb)
+* Communication cutoff: a larger cutoff results in more ghost atoms and
+  thus more data that needs to be communicated
+* Frequency of neighbor list builds: during a neighbor list build the
+  domain decomposition is updated and the list of ghost atoms rebuilt
+  which requires multiple global communication steps
+* FFT-grid settings and number of MPI processes for kspace style PPPM:
+  PPPM uses parallel 3d FFTs which will drop much faster in parallel
+  efficiency with respect to the number of MPI processes than other
+  parts of the force computation.  Thus using MPI+OpenMP parallelization
+  or :doc:`run style verlet/split <run_style>` can improve parallel
+  efficiency by limiting the number of MPI processes used for the FFTs.
+* Load (im-)balance: LAMMPS' domain decomposition assumes that atoms are
+  evenly distributed across the entire simulation box. If there are
+  areas of vacuum, this may lead to different amounts of work for
+  different MPI processes. Using the :doc:`processors command
+  <processors>` to change the spatial decomposition, or MPI+OpenMP
+  parallelization instead of only-MPI to have larger sub-domains, or the
+  (fix) balance command (without or with switching to communication style
+  tiled) to change the sub-domain volumes are all methods that
+  can help to avoid load imbalances.
+
+Examples comparing serial performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Before looking at your own input deck(s), you should get some reference
+data from a known input so that you know what kind of performance you
+should expect from your input.  For the following we therefore use the
+``in.rhodo.scaled`` input file and ``data.rhodo`` data file from the
+``bench`` folder. This is a system of 32000 atoms using the CHARMM force
+field and long-range electrostatics running for 100 MD steps.  The
+performance data is printed at the end of a run and only measures the
+performance during propagation and excludes the setup phase.
+
+Running with a single MPI process on an AMD Ryzen Threadripper PRO
+9985WX CPU (64 cores, 128 threads, base clock: 3.2GHz, max. clock
+5.4GHz, L1/L2/L3 cache 5MB/64MB/256MB, 8 DDR5-6400 memory channels) one
+gets the following performance report:
+
+.. code-block::
+
+   Performance: 1.232 ns/day, 19.476 hours/ns, 7.131 timesteps/s, 228.197 katom-step/s
+   99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
+
+The %CPU value should be at 100% or very close.  Lower values would
+be an indication that there are *other* processes also using the same
+CPU core and thus invalidating the performance data.  The katom-step/s
+value is best suited for comparisons, since it is fairly independent
+from the system size. The `in.rhodo.scaled` input can be easily made
+larger through replication in the three dimensions by settings variables
+"x", "y", "z" to values other than 1 from the command line with the
+"-var" flag. Example:
+
+- 32000 atoms: 228.8 katom-step/s
+- 64000 atoms: 231.6 katom-step/s
+- 128000 atoms: 231.1 katom-step/s
+- 256000 atoms: 226.4 katom-step/s
+- 864000 atoms: 229.6 katom-step/s
+
+Comparing to an AMD Ryzen 7 7840HS CPU (8 cores, 16 threads, base clock
+3.8GHz, max. clock 5.1GHz, L1/L2/L3 cache 512kB/8MB/16MB, 2 DDR5-5600
+memory channels), we get similar single core performance (~220
+katom-step/s vs. ~230 katom-step/s) due to the similar clock and
+architecture:
+
+- 32000 atoms: 219.8 katom-step/s
+- 64000 atoms: 222.5 katom-step/s
+- 128000 atoms: 216.8 katom-step/s
+- 256000 atoms: 221.0 katom-step/s
+- 864000 atoms: 221.1 katom-step/s
+
+Switching to an older Intel Xeon E5-2650 v4 CPU (12 cores, 12 threads,
+base clock 2.2GHz, max. clock 2.9GHz, L1/L2/L3 cache (64kB/256kB/30MB, 4
+DDR4-2400 memory channels) leads to a lower performance of approximately
+109 katom-step/s due to differences in architecture and clock.  In all
+cases, when looking at multiple runs, the katom-step/s property
+fluctuates by approximately 1% around the average.
+
+From here on we are looking at the performance for the 256000 atom system only
+and change several settings incrementally:
+
+#. No compiler optimization GCC (-Og -g): 183.8 katom-step/s
+#. Moderate optimization with debug info GCC (-O2 -g): 231.1 katom-step/s
+#. Full compiler optimization GCC (-DNDEBUG -O3): 236.0 katom-step/s
+#. Aggressive compiler optimization GCC (-O3 -ffast-math -march=native): 239.9 katom-step/s
+#. Source code optimization in OPENMP package (1 thread): 266.7 katom-step/s
+#. Use *fix nvt* instead of *fix npt* (compute virial only every 50 steps): 272.9 katom-step/s
+#. Increase pair style cutoff by 2 :math:`\AA`: 181.2 katom-step/s
+#. Use tight PPPM convergence (1.0e-6 instead of 1.0e-4): 161.9 katom-step/s
+#. Use Ewald summation instead of PPPM (at 1.0e-4 convergence): 19.9 katom-step/s
+
+The numbers show that gains from aggressive compiler optimizations are
+rather small in LAMMPS, the data access optimizations in the OPENMP (and
+OPT) packages are more prominent.  On the other side, using more
+accurate force field settings causes, not unexpectedly, a significant
+slowdown (to about half the speed).  Finally, using regular Ewald
+summation causes a massive slowdown due to the bad algorithmic scaling
+with system size.
+
+Examples comparing parallel performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The parallel performance usually goes on top of the serial performance.
+Using twice as many processors should increase the performance metric
+by up to a factor of two.  With the number of processors *N* and the
+serial performance :math:`p_1` and the performance for *N* processors
+:math:`p_N` we can define a *parallel efficiency* in percent as follows:
+
+.. math::
+
+   P_{eff} = \frac{p_N}{p_1 \cdot N} \cdot 100\%
+
+For the AMD Ryzen Threadripper PRO 9985WX CPU and the serial
+simulation settings of point 6. from above, we get the following
+parallel efficiency data for the 256000 atom system:
+
+- 1 MPI task: 273.6 katom-step/s, :math:`P_{eff} = 100\%`
+- 2 MPI tasks: 530.6 katom-step/s, :math:`P_{eff} = 97\%`
+- 4 MPI tasks: 1.021 Matom-step/s, :math:`P_{eff} = 93\%`
+- 8 MPI tasks: 1.837 Matom-step/s, :math:`P_{eff} = 84\%`
+- 16 MPI tasks: 3.574 Matom-step/s, :math:`P_{eff} = 82\%`
+- 32 MPI tasks: 6.479 Matom-step/s, :math:`P_{eff} = 74\%`
+- 64 MPI tasks: 9.032 Matom-step/s, :math:`P_{eff} = 52\%`
+- 128 MPI tasks: 12.03 Matom-step/s, :math:`P_{eff} = 34\%`
+
+The 128 MPI tasks run uses CPU cores from hyper-threading.
+
+For a small system with only 32000 atoms the parallel efficiency
+drops off earlier when the number of work units is too small relative
+to the communication overhead:
+
+- 1 MPI task:  270.8  katom-step/s, :math:`P_{eff} = 100\%`
+- 2 MPI tasks: 529.3  katom-step/s, :math:`P_{eff} = 98\%`
+- 4 MPI tasks: 989.8  katom-step/s, :math:`P_{eff} = 91\%`
+- 8 MPI tasks: 1.832  Matom-step/s, :math:`P_{eff} = 85\%`
+- 16 MPI tasks: 3.463 Matom-step/s, :math:`P_{eff} = 80\%`
+- 32 MPI tasks: 5.970 Matom-step/s, :math:`P_{eff} = 69\%`
+- 64 MPI tasks: 7.477 Matom-step/s, :math:`P_{eff} = 42\%`
+- 128 MPI tasks: 8.069 Matom-step/s, :math:`P_{eff} = 23\%`
+
+Measuring performance of your input deck
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The best way to do this is run the your system (actual number of atoms)
+for a modest number of timesteps (say 100 steps) on several different
+processor counts, including a single processor if possible.  Do this for
+an equilibrium version of your system, so that the 100-step timings are
+representative of a much longer run.  There is typically no need to run
+for 1000s of timesteps to get accurate timings; you can simply
+extrapolate from short runs.

 For the set of runs, look at the timing data printed to the screen and
 log file at the end of each LAMMPS run.  The
@ -28,12 +230,15 @@ breakdown and relative percentages.  For example, trying different
 options for speeding up the long-range solvers will have little impact
 if they only consume 10% of the run time.  If the pairwise time is
 dominating, you may want to look at GPU or OMP versions of the pair
-style, as discussed below.  Comparing how the percentages change as
-you increase the processor count gives you a sense of how different
-operations within the timestep are scaling.  Note that if you are
-running with a Kspace solver, there is additional output on the
-breakdown of the Kspace time.  For PPPM, this includes the fraction
-spent on FFTs, which can be communication intensive.
+style, as discussed below.  Comparing how the percentages change as you
+increase the processor count gives you a sense of how different
+operations within the timestep are scaling.  If you are using PPPM as
+Kspace solver, you can turn on an additional output with
+:doc:`kspace_modify fftbench yes <kspace_modify>` which measures the
+time spent during PPPM on the 3d FFTs, which can be communication
+intensive for larger processor counts.  This provides an indication
+whether it is worth trying out alternatives to the default FFT settings
+for additional performance.

 Another important detail in the timing info are the histograms of
 atoms counts and neighbor counts.  If these vary widely across
--- a/doc/src/fix.rst
+++ b/doc/src/fix.rst
@ -208,6 +208,7 @@ accelerated styles exist.
 * :doc:`ave/grid <fix_ave_grid>` - compute per-grid time-averaged quantities
 * :doc:`ave/histo <fix_ave_histo>` - compute/output time-averaged histograms
 * :doc:`ave/histo/weight <fix_ave_histo>` - weighted version of fix ave/histo
+* :doc:`ave/moments <fix_ave_moments>` - compute moments of scalar quantities
 * :doc:`ave/time <fix_ave_time>` - compute/output global time-averaged quantities
 * :doc:`aveforce <fix_aveforce>` - add an averaged force to each atom
 * :doc:`balance <fix_balance>` - perform dynamic load-balancing
--- a/doc/src/fix_ave_correlate_long.rst
+++ b/doc/src/fix_ave_correlate_long.rst
@ -82,10 +82,9 @@ specified values may represent calculations performed by computes and
 fixes which store their own "group" definitions.

 Each listed value can be the result of a compute or fix or the
-evaluation of an equal-style or vector-style variable.  For
-vector-style variables, the specified indices can include a wildcard
-character.  See the :doc:`fix ave/correlate <fix_ave_correlate>` page
-for details.
+evaluation of an equal-style or vector-style variable.  The specified
+indices can include a wildcard string.  See the
+:doc:`fix ave/correlate <fix_ave_correlate>` page for details on that.

 The *Nevery* and *Nfreq* arguments specify on what time steps the input
 values will be used to calculate correlation data and the frequency
--- a/doc/src/fix_ave_moments.rst
+++ b/doc/src/fix_ave_moments.rst
@ -0,0 +1,296 @@
+.. index:: fix ave/moments
+
+fix ave/moments command
+=======================
+
+Syntax
+""""""
+
+.. code-block:: LAMMPS
+
+   fix ID group-ID ave/moments Nevery Nrepeat Nfreq value1 value2 ... moment1 moment2 ... keyword args ...
+
+* ID, group-ID are documented in :doc:`fix <fix>` command
+* ave/moments = style name of this fix command
+* Nevery = use input values every this many time steps
+* Nrepeat = # of times to use input values for calculating averages
+* Nfreq = calculate averages every this many time steps
+* one or more input variables can be listed
+* value = v_name
+
+  .. parsed-literal::
+
+       c_ID = global scalar calculated by a compute with ID
+       c_ID[I] = Ith component of global vector calculated by a compute with ID, I can include wildcard (see below)
+       f_ID = global scalar calculated by a fix with ID
+       f_ID[I] = Ith component of global vector calculated by a fix with ID, I can include wildcard (see below)
+       v_name = value calculated by an equal-style variable with name
+       v_name[I] = value calculated by a vector-style variable with name, I can include wildcard (see below)
+
+* one or more moments to compute can be listed
+* moment = *mean* or *stddev* or *variance* or *skew* or *kurtosis*, see exact definitions below.
+* zero or more keyword/arg pairs may be appended
+* keyword = *start* or *history*
+
+  .. parsed-literal::
+
+       *start* args = Nstart
+         Nstart = invoke first after this time step
+       *history* args = Nrecent
+         Nrecent = keep a history of up to Nrecent outputs
+
+Examples
+""""""""
+
+.. code-block:: LAMMPS
+
+   fix 1 all ave/moments 1 1000 100 v_volume mean stddev
+   fix 1 all ave/moments 1 200 1000 v_volume variance kurtosis history 10
+
+Description
+"""""""""""
+
+.. versionadded:: TBD
+
+Using one or more values as input, calculate the moments of the underlying
+(population) distributions based on samples collected every few time
+steps over a time step window. The definitions of the moments calculated
+are given below.
+
+The group specified with this command is ignored.  However, note that
+specified values may represent calculations performed by computes and
+fixes which store their own "group" definitions.
+
+Each listed value can be the result of a :doc:`compute <compute>` or
+:doc:`fix <fix>` or the evaluation of an equal-style or vector-style
+:doc:`variable <variable>`.  In each case, the compute, fix, or variable
+must produce a global quantity, not a per-atom or local quantity.
+If you wish to spatial- or time-average or histogram per-atom
+quantities from a compute, fix, or variable, then see the :doc:`fix
+ave/chunk <fix_ave_chunk>`, :doc:`fix ave/atom <fix_ave_atom>`, or
+:doc:`fix ave/histo <fix_ave_histo>` commands.  If you wish to sum a
+per-atom quantity into a single global quantity, see the :doc:`compute
+reduce <compute_reduce>` command.
+
+Many :doc:`computes <compute>` and :doc:`fixes <fix>` produce global
+quantities.  See their doc pages for details. :doc:`Variables <variable>`
+of style *equal* and *vector* are the only ones that can be used with
+this fix.  Variables of style *atom* cannot be used, since they produce
+per-atom values.
+
+The input values must all be scalars or vectors with a bracketed term
+appended, indicating the :math:`I^\text{th}` value of the vector is
+used.
+
+The result of this fix can be accessed as a vector, containing the
+interleaved moments of each input in order.  If M moments are requested,
+then the moments of input 1 will be the first M values in the vector
+output by this fix. The moments of input 2 will the next M values, etc.
+If there are N values, the vector length will be N*M.
+
+----------
+
+For input values from a compute or fix or variable, the bracketed index
+I can be specified using a wildcard asterisk with the index to
+effectively specify multiple values.  This takes the form "\*" or "\*n"
+or "m\*" or "m\*n".  If :math:`N` is the size of the vector, then an
+asterisk with no numeric values means all indices from 1 to :math:`N`.
+A leading asterisk means all indices from 1 to n (inclusive).  A
+trailing asterisk means all indices from n to :math:`N` (inclusive).  A
+middle asterisk means all indices from m to n (inclusive).
+
+Using a wildcard is the same as if the individual elements of the vector
+or cells of the array had been listed one by one.  For examples, see the
+description of this capability in :doc:`fix ave/time <fix_ave_time>`.
+
+----------
+
+The :math:`N_\text{every}`, :math:`N_\text{repeat}`, and
+:math:`N_\text{freq}` arguments specify on what time steps the input
+values will be used in order to contribute to the average.  The final
+statistics are generated on time steps that are a multiple of
+:math:`N_\text{freq}`\ .  The average is over a window of up to
+:math:`N_\text{repeat}` quantities, computed in the preceding portion of
+the simulation once every :math:`N_\text{every}` time steps.
+
+.. note::
+
+    Contrary to most fix ave/* commands, it is not required that Nevery *
+    Nrepeat <= Nfreq.  This is to allow the user to choose the time
+    window and number of samples contributing to the output at each
+    Nfreq interval.
+
+For example, if :math:`N_\text{freq}=100` and :math:`N_\text{repeat}=5`
+(and :math:`N_\text{every}=1`), then on step 100 values from time steps
+96, 97, 98, 99, and 100 will be used. The fix does not compute its
+inputs on steps that are not required.  If :math:`N_\text{freq}=5`,
+:math:`N_\text{repeat}=8` and :math:`N_\text{every}=1`, then values
+will first be calculated on step 5 from steps 1-5, on step 10 from 3-10,
+on step 15 from 8-15 and so on, forming a rolling average over
+timesteps that span a time window larger than Nfreq.
+
+----------
+
+If a value begins with "c\_", a compute ID must follow which has been
+previously defined in the input script.  If no bracketed term is
+appended, the global scalar calculated by the compute is used.  If a
+bracketed term is appended, the Ith element of the global vector
+calculated by the compute is used.  See the discussion above for how I
+can be specified with a wildcard asterisk to effectively specify
+multiple values.
+
+If a value begins with "f\_", a fix ID must follow which has been
+previously defined in the input script.  If no bracketed term is
+appended, the global scalar calculated by the fix is used.  If a
+bracketed term is appended, the Ith element of the global vector
+calculated by the fix is used.  See the discussion above for how I can
+be specified with a wildcard asterisk to effectively specify multiple
+values.
+
+Note that some fixes only produce their values on certain time steps,
+which must be compatible with *Nevery*, else an error will result.
+Users can also write code for their own fix styles and :doc:`add them to
+LAMMPS <Modify>`.
+
+If a value begins with "v\_", a variable name must follow which has been
+previously defined in the input script. Only equal-style or vector-style
+variables can be used, which both produce global values.  Vector-style
+variables require a bracketed term to specify the Ith element of the
+vector calculated by the variable.
+
+Note that variables of style *equal* and *vector* define a formula which
+can reference individual atom properties or thermodynamic keywords, or
+they can invoke other computes, fixes, or variables when they are
+evaluated, so this is a very general means of specifying quantities to
+time average.
+
+----------
+
+The moments are output in the order requested in the arguments following
+the last input.  Any number and order of moments can be specified,
+although it does not make much sense to specify the same moment multiple
+times.  All moments are computed using a correction of the sample estimators
+used to obtain unbiased cumulants :math:`k_{1..4}` (see :ref:`(Cramer)
+<Cramer1>`). The correction for variance is the standard Bessel
+correction. For other moments, see :ref:`(Joanes)<Joanes1>`.
+
+For *mean*, the arithmetic mean :math:`\bar{x} = \frac{1}{n}
+\sum_{i=1}^{n} x_i` is calculated.
+
+For *variance*, the Bessel-corrected sample variance :math:`var = k_2 =
+\frac{1}{n - 1} \sum_{i=1}^{n} (x_i - \bar{x})^2` is calculated.
+
+For *stddev*, the Bessel-corrected sample standard deviation
+:math:`stddev = \sqrt{k_2}` is calculated.
+
+For *skew*, the adjusted Fisher--Pearson standardized moment :math:`G_1
+= \frac{k_3}{k_2^{3/2}} = \frac{k_3}{stddev^3}` is calculated.
+
+For *kurtosis*, the adjusted Fisher--Pearson standardized moment
+:math:`G_2 = \frac{k_4}{k_2^2}` is calculated.
+
+----------
+
+Fix invocation and output can be modified by optional keywords.
+
+The *start* keyword specifies that the first computation should be no
+earlier than the step number given (but will still occur on a multiple
+of *Nfreq*).  The default is step 0.  Often input values can be 0.0 at
+time 0, so setting *start* to a larger value can avoid including a 0.0
+in a longer series.
+
+The *history* keyword stores the Nrecent most recent outputs on Nfreq
+timesteps, so they can be accessed as global outputs of the fix.  Nrecent
+must be >= 1. The default is 1, meaning only the most recent output is
+accessible. For example, if history 10 is specified and Nfreq = 1000,
+then on timestep 20000, the Nfreq outputs from steps 20000, 19000, ...
+11000 are available for access.  See below for details on how to access
+the history values.
+
+For example, this will store the outputs of the previous 10 Nfreq
+time steps, i.e. a window of 10000 time steps:
+
+.. code-block:: LAMMPS
+
+   fix 1 all ave/moments 1 200 1000 v_volume mean history 10
+
+The previous results can be accessed as values in a global array output
+by this fix. Each column of the array is the vector output of the N-th
+preceding Nfreq timestep.  For example, assuming a single moment is
+calculated, the most recent result corresponding to the third input
+value would be accessed as "f_name[3][1]", "f_name[3][4]" is the 4th
+most recent and so on.  The current vector output is always the first
+column of the array, corresponding to the most recent result.
+
+To illustrate the utility of keeping output history, consider using
+this fix in conjunction with :doc:`fix halt <fix_halt>` to stop a run
+automatically if a quantity is converged to within some desired tolerance:
+
+.. code-block:: LAMMPS
+
+   variable target equal etot
+   fix aveg all ave/moments 1 200 1000 v_target mean stddev history 10
+   variable stopcond equal "abs(f_aveg[1]-f_aveg[1][10])<f_aveg[2]"
+   fix fhalt all halt 1000 v_stopcond == 1
+
+In this example, every 1000 time steps, the average and standard
+deviation of the total energy over the previous 200 time steps are
+calculated.  If the difference between the most recent and 10-th most
+recent average is lower than the most recent standard deviation, the run
+is stopped.
+
+----------
+
+Restart, fix_modify, output, run start/stop, minimize info
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+No information about this fix is written to :doc:`binary restart files
+<restart>`.
+
+This fix produces a global vector and global array which can be accessed
+by various :doc:`output commands <Howto_output>`.  The values can be
+accessed on any time step, but may not be current.
+
+A global vector is produced with the # of elements = number of moments *
+number of inputs.  The moments are output in the order given in the fix
+definition.  An array is produced having # of rows = length of vector
+output (with an ordering which matches the vector) and # of columns =
+value of *history*. There is always at least one column.
+
+Each element of the global vector or array can be either "intensive" or
+"extensive", depending on whether the values contributing to the element
+are "intensive" or "extensive".  If a compute or fix provides the value
+being time averaged, then the compute or fix determines whether the value
+is intensive or extensive; see the page for that compute or fix for
+further info.  Values produced by a variable are treated as intensive.
+
+No parameter of this fix can be used with the *start/stop* keywords of
+the :doc:`run <run>` command.  This fix is not invoked during
+:doc:`energy minimization <minimize>`.
+
+Restrictions
+""""""""""""
+
+This compute is part of the EXTRA-FIX package.  It is only enabled if
+LAMMPS was built with that package.  See the :doc:`Build package
+<Build_package>` page for more info.
+
+Related commands
+""""""""""""""""
+
+:doc:`fix ave/time <fix_ave_time>`,
+
+Default
+"""""""
+
+The option defaults are history = 1, start = 0.
+
+----------
+
+.. _Cramer1:
+
+**(Cramer)** Cramer, Mathematical Methods of Statistics, Princeton University Press (1946).
+
+.. _Joanes1:
+
+**(Joanes)** Joanes, Gill, The Statistician, 47, 183--189 (1998).
--- a/doc/src/pair_hybrid.rst
+++ b/doc/src/pair_hybrid.rst
@ -100,6 +100,56 @@ first is assigned to intra-molecular interactions (i.e. both atoms
 have the same molecule ID), the second to inter-molecular interactions
 (i.e. interacting atoms have different molecule IDs).

+.. admonition:: When **NOT** to use a hybrid pair style
+   :class: warning
+
+   Using pair style *hybrid* can be very tempting to use if you need a
+   **many-body potential** supporting a mix of elements for which you
+   cannot find a potential file that covers *all* of them.  Regardless
+   of how this is set up, there will be *errors*.  The major use case
+   where the error is *small*, is when the many-body sub-styles are used
+   on different objects (for example a slab and a liquid, a metal and a
+   nano-machining work piece).  In that case the *mixed* terms
+   **should** be provided by a pair-wise additive potential (like
+   Lennard-Jones or Morse) to avoid unexpected behavior and reduce
+   errors.  LAMMPS cannot easily check for this condition and thus will
+   accept good and bad choices alike.
+
+   Outside of this, we *strongly* recommend *against* using pair style
+   hybrid with many-body potentials for the following reasons:
+
+   1. When trying to combine EAM or MEAM potentials, there is a *large*
+      error in the embedding term, since it is computed separately for
+      each sub-style only.
+
+   2. When trying to combine many-body potentials like Stillinger-Weber,
+      Tersoff, AIREBO, Vashishta, or similar, you have to understand
+      that the potential of a sub-style cannot be applied in a pair-wise
+      fashion but will need to be applied to multiples of atoms
+      (e.g. a Tersoff potential of elements A and B includes the
+      interactions A-A, B-B, A-B, A-A-A, A-A-B, A-B-B, A-B-A, B-A-A,
+      B-A-B, B-B-A, B-B-B; AIREBO also considers all quadruples of
+      atom elements).
+
+   3. When one of the sub-styles uses charge-equilibration (= QEq; like
+      in ReaxFF or COMB) you have inconsistent QEq behavior because
+      either you try to apply QEq to *all* atoms but then you are
+      missing the QEq parameters for the non-QEq pair style (and it
+      would be inconsistent to apply QEq for pair styles that are not
+      parameterized for QEq) or else you would have either no charges or
+      fixed charges interacting with the QEq which also leads to
+      inconsistent behavior between two sub-styles.  When attempting to
+      use multiple ReaxFF instances to combine different potential
+      files, you might be able to work around the QEq limitations, but
+      point 2. still applies.
+
+   We understand that it is frustrating to not be able to run simulations
+   due to lack of available potential files, but that does not justify
+   combining potentials in a broken way via pair style hybrid.  This is
+   not what the hybrid pair styles are designed for.
+
+----------
+
 Here are two examples of hybrid simulations.  The *hybrid* style could
 be used for a simulation of a metal droplet on a LJ surface.  The metal
 atoms interact with each other via an *eam* potential, the surface atoms
@ -374,12 +424,11 @@ selected sub-style.

 ----------

-.. note::
-
-   Several of the potentials defined via the pair_style command in
-   LAMMPS are really many-body potentials, such as Tersoff, AIREBO, MEAM,
-   ReaxFF, etc.  The way to think about using these potentials in a
-   hybrid setting is as follows.
+Even though the command name "pair_style" would suggest that these are
+pair-wise interactions, several of the potentials defined via the
+pair_style command in LAMMPS are really many-body potentials, such as
+Tersoff, AIREBO, MEAM, ReaxFF, etc.  The way to think about using these
+potentials in a hybrid setting is as follows.

 A subset of atom types is assigned to the many-body potential with a
 single :doc:`pair_coeff <pair_coeff>` command, using "\* \*" to include