diff --git a/doc/src/balance.txt b/doc/src/balance.txt
index f375efe604..abdb1089da 100644
--- a/doc/src/balance.txt
+++ b/doc/src/balance.txt
@@ -10,7 +10,7 @@ balance command :h3
 
 [Syntax:]
 
-balance thresh style args ... keyword value ... :pre
+balance thresh style args ... keyword args ... :pre
 
 thresh = imbalance threshhold that must be exceeded to perform a re-balance :ulb,l
 one style/arg pair can be used (or multiple for {x},{y},{z}) :l
@@ -32,9 +32,23 @@ style = {x} or {y} or {z} or {shift} or {rcb} :l
     Niter = # of times to iterate within each dimension of dimstr sequence
     stopthresh = stop balancing when this imbalance threshhold is reached
   {rcb} args = none :pre
-zero or more keyword/value pairs may be appended :l
-keyword = {out} :l
-  {out} value = filename
+zero or more keyword/arg pairs may be appended :l
+keyword = {weight} or {out} :l
+  {weight} style args = use weighted particle counts for the balancing
+    {style} = {group} or {neigh} or {time} or {var} or {store}
+      {group} args = Ngroup group1 weight1 group2 weight2 ...
+        Ngroup = number of groups with assigned weights
+        group1, group2, ... = group IDs
+        weight1, weight2, ...   = corresponding weight factors
+      {neigh} factor = compute weight based on number of neighbors
+        factor = scaling factor (> 0)
+      {time} factor = compute weight based on time spend computing
+        factor = scaling factor (> 0)
+      {var} name = take weight from atom-style variable
+        name = name of the atom-style variable
+      {store} name = store weight in custom atom property defined by "fix property/atom"_fix_property_atom.html command
+        name = atom property name (without d_ prefix)
+  {out} arg = filename
     filename = write each processor's sub-domain to a file :pre
 :ule
 
@@ -44,28 +58,41 @@ balance 0.9 x uniform y 0.4 0.5 0.6
 balance 1.2 shift xz 5 1.1
 balance 1.0 shift xz 5 1.1
 balance 1.1 rcb
+balance 1.0 shift x 10 1.1 weight group 2 fast 0.5 slow 2.0
+balance 1.0 shift x 10 1.1 weight time 0.8 weight neigh 0.5 weight store balance
 balance 1.0 shift x 20 1.0 out tmp.balance :pre
 
 [Description:]
 
 This command adjusts the size and shape of processor sub-domains
-within the simulation box, to attempt to balance the number of
-particles and thus the computational cost (load) evenly across
-processors.  The load balancing is "static" in the sense that this
-command performs the balancing once, before or between simulations.
-The processor sub-domains will then remain static during the
-subsequent run.  To perform "dynamic" balancing, see the "fix
+within the simulation box, to attempt to balance the number of atoms
+or particles and thus indirectly the computational cost (load) more
+evenly across processors.  The load balancing is "static" in the sense
+that this command performs the balancing once, before or between
+simulations.  The processor sub-domains will then remain static during
+the subsequent run.  To perform "dynamic" balancing, see the "fix
 balance"_fix_balance.html command, which can adjust processor
 sub-domain sizes and shapes on-the-fly during a "run"_run.html.
 
-Load-balancing is typically only useful if the particles in the
-simulation box have a spatially-varying density distribution.  E.g. a
-model of a vapor/liquid interface, or a solid with an irregular-shaped
-geometry containing void regions.  In this case, the LAMMPS default of
+Load-balancing is typically most useful if the particles in the
+simulation box have a spatially-varying density distribution or when
+the computational cost varies signficantly between different
+particles.  E.g. a model of a vapor/liquid interface, or a solid with
+an irregular-shaped geometry containing void regions, or "hybrid pair
+style simulations"_pair_hybrid.html which combine pair styles with
+different computational cost.  In these cases, the LAMMPS default of
 dividing the simulation box volume into a regular-spaced grid of 3d
-bricks, with one equal-volume sub-domain per procesor, may assign very
-different numbers of particles per processor.  This can lead to poor
-performance when the simulation is run in parallel.
+bricks, with one equal-volume sub-domain per procesor, may assign
+numbers of particles per processor in a way that the computational
+effort varies significantly.  This can lead to poor performance when
+the simulation is run in parallel.
+
+The balancing can be performed with or without per-particle weighting.
+With no weighting, the balancing attempts to assign an equal number of
+particles to each processor.  With weighting, the balancing attempts
+to assign an equal aggregate weight to each processor, which typically
+means a different number of particles per processor.  Details on the
+various weighting options are "given below"_#weighted_balance.
 
 Note that the "processors"_processors.html command allows some control
 over how the box volume is split across processors.  Specifically, for
@@ -78,9 +105,9 @@ sub-domains will still have the same shape and same volume.
 The requested load-balancing operation is only performed if the
 current "imbalance factor" in particles owned by each processor
 exceeds the specified {thresh} parameter.  The imbalance factor is
-defined as the maximum number of particles owned by any processor,
-divided by the average number of particles per processor.  Thus an
-imbalance factor of 1.0 is perfect balance.
+defined as the maximum number of particles (or weight) owned by any
+processor, divided by the average number of particles (or weight) per
+processor.  Thus an imbalance factor of 1.0 is perfect balance.
 
 As an example, for 10000 particles running on 10 processors, if the
 most heavily loaded processor has 1200 particles, then the factor is
@@ -108,7 +135,7 @@ defined above.  But depending on the method a perfect balance (1.0)
 may not be achieved.  For example, "grid" methods (defined below) that
 create a logical 3d grid cannot achieve perfect balance for many
 irregular distributions of particles.  Likewise, if a portion of the
-system is a perfect lattice, e.g. the intiial system is generated by
+system is a perfect lattice, e.g. the initial system is generated by
 the "create_atoms"_create_atoms.html command, then "grid" methods may
 be unable to achieve exact balance.  This is because entire lattice
 planes will be owned or not owned by a single processor.
@@ -134,11 +161,11 @@ The {x}, {y}, {z}, and {shift} styles are "grid" methods which produce
 a logical 3d grid of processors.  They operate by changing the cutting
 planes (or lines) between processors in 3d (or 2d), to adjust the
 volume (area in 2d) assigned to each processor, as in the following 2d
-diagram where processor sub-domains are shown and atoms are colored by
-the processor that owns them.  The leftmost diagram is the default
-partitioning of the simulation box across processors (one sub-box for
-each of 16 processors); the middle diagram is after a "grid" method
-has been applied.
+diagram where processor sub-domains are shown and particles are
+colored by the processor that owns them.  The leftmost diagram is the
+default partitioning of the simulation box across processors (one
+sub-box for each of 16 processors); the middle diagram is after a
+"grid" method has been applied.
 
 :image(JPG/balance_uniform_small.jpg,JPG/balance_uniform.jpg),image(JPG/balance_nonuniform_small.jpg,JPG/balance_nonuniform.jpg),image(JPG/balance_rcb_small.jpg,JPG/balance_rcb.jpg)
 :c
@@ -146,8 +173,8 @@ has been applied.
 The {rcb} style is a "tiling" method which does not produce a logical
 3d grid of processors.  Rather it tiles the simulation domain with
 rectangular sub-boxes of varying size and shape in an irregular
-fashion so as to have equal numbers of particles in each sub-box, as
-in the rightmost diagram above.
+fashion so as to have equal numbers of particles (or weight) in each
+sub-box, as in the rightmost diagram above.
 
 The "grid" methods can be used with either of the
 "comm_style"_comm_style.html command options, {brick} or {tiled}.  The
@@ -230,7 +257,7 @@ counts do not match the target value for the plane, the position of
 the cut is adjusted to be halfway between a low and high bound.  The
 low and high bounds are adjusted on each iteration, using new count
 information, so that they become closer together over time.  Thus as
-the recustion progresses, the count of particles on either side of the
+the recursion progresses, the count of particles on either side of the
 plane gets closer to the target value.
 
 Once the rebalancing is complete and final processor sub-domains
@@ -262,21 +289,129 @@ the longest dimension, leaving one new box on either side of the cut.
 All the processors are also partitioned into 2 groups, half assigned
 to the box on the lower side of the cut, and half to the box on the
 upper side.  (If the processor count is odd, one side gets an extra
-processor.)  The cut is positioned so that the number of atoms in the
-lower box is exactly the number that the processors assigned to that
-box should own for load balance to be perfect.  This also makes load
-balance for the upper box perfect.  The positioning is done
-iteratively, by a bisectioning method.  Note that counting atoms on
-either side of the cut requires communication between all processors
-at each iteration.
+processor.)  The cut is positioned so that the number of particles in
+the lower box is exactly the number that the processors assigned to
+that box should own for load balance to be perfect.  This also makes
+load balance for the upper box perfect.  The positioning is done
+iteratively, by a bisectioning method.  Note that counting particles
+on either side of the cut requires communication between all
+processors at each iteration.
 
 That is the procedure for the first cut.  Subsequent cuts are made
 recursively, in exactly the same manner.  The subset of processors
 assigned to each box make a new cut in the longest dimension of that
-box, splitting the box, the subset of processsors, and the atoms in
-the box in two.  The recursion continues until every processor is
-assigned a sub-box of the entire simulation domain, and owns the atoms
-in that sub-box.
+box, splitting the box, the subset of processsors, and the particles
+in the box in two.  The recursion continues until every processor is
+assigned a sub-box of the entire simulation domain, and owns the
+particles in that sub-box.
+
+:line
+
+This sub-section describes how to perform weighted load balancing
+using the {weight} keyword. :link(weighted_balance)
+
+By default, all particles have a weight of 1.0, which means each
+particle is assumed to require the same amount of computation during a
+timestep.  There are, however, scenarios where this is not a good
+assumption.  Measuring the computational cost for each particle
+accurately would be impractical and slow down the computation.
+Instead the {weight} keyword implements several ways to influence the
+per-particle weights empirically by properties readily available or
+using the user's knowledge of the system.  Note that the absolute
+value of the weights are not important; their ratio is what is used to
+assign particles to processors.  A particle with a weight of 2.5 is
+assumed to require 5x more computational than a particle with a weight
+of 0.5.
+
+Below is a list of possible weight options with a short description of
+their usage and some example scenarios where they might be applicable.
+It is possible to apply multiple weight flags and the weightins they
+induce will be combined through multiplication.  Most of the time,
+however, it is sufficient to use just one method.
+
+The {group} weight style assigns weight factors to specified
+"groups"_group.html of particles.  The {group} style keyword is
+followed by the number of groups, then pairs of group IDs and the
+corresponding weight factor.  If a particle belongs to none of the
+specified groups, its weight is not changed.  If it belongs to
+multiple groups, its weight is the product of the weight factors.
+
+This weight style is useful in combination with pair style
+"hybrid"_pair_hybrid.html, e.g. when combining a more costly manybody
+potential with a fast pair-wise potential.  It is also useful when
+using "run_style respa"_run_style.html where some portions of the
+system have many bonded interactions and others none.  It assumes that
+the computational cost for each group remains constant over time.
+This is a purely empirical weighting, so a series test runs to tune
+the assigned weight factors for optimal performance is recommended.
+
+The {neigh} weight style assigns a weight to each particle equal to
+its number of neighbors divided by the avergage number of neighbors
+for all particles.  The {factor} setting is then appied as an overall
+scale factor to all the {neigh} weights which allows tuning of the
+impact of this style.  A {factor} smaller than 1.0 (e.g. 0.8) often
+results in the best performance, since the number of neighbors is
+likely to overestimate the ideal weight.
+
+This weight style is useful for systems where there are different
+cutoffs used for different pairs of interations, or the density
+fluctuates, or a large number of particles are in the vicinity of a
+wall, or a combination of these effects.  If a simulation uses
+multiple neighbor lists, this weight style will use the first suitable
+neighbor list it finds.  It will not request or compute a new list.  A
+warning will be issued if there is no suitable neighbor list available
+or if it is not current, e.g. if the balance command is used before a
+"run"_run.html or "minimize"_minimize.html command is used, in which
+case the neighbor list may not yet have been built.  In this case no
+weights are computed.  Inserting a "run 0 post no"_run.html command
+before issuing the {balance} command, may be a workaround for this
+case, as it will induce the neighbor list to be built.
+
+The {time} weight style uses "timer data"_timer.html to estimate a
+weight for each particle.  It uses the same information as is used for
+the "MPI task timing breakdown"_Section_start.html#start_8, namely,
+the timings for sections {Pair}, {Bond}, {Kspace}, and {Neigh}.  The
+time spent in these sections of the timestep are measured for each MPI
+rank, summed up, then converted into a cost for each MPI rank relative
+to the average cost over all MPI ranks for the same sections.  That
+cost then evenly distributed over all the particles owned by that
+rank.  Finally, the {factor} setting is then appied as an overall
+scale factor to all the {time} weights as a way to fine tune the
+impact of this weight style.  Good {factor} values to use are
+typically between 0.5 and 1.2.
+
+For the {balance} command the timing data is taken from the preceding
+run command, i.e. the timings are for the entire previous run.  For
+the {fix balance} command the timing data is for only the timesteps
+since the last balancing operation was performed.  If timing
+information for the required sections is not available, e.g. at the
+beginning of a run, or when the "timer"_timer.html command is set to
+either {loop} or {off}, a warning is issued.  In this case no weights
+are computed.
+
+This weight style is the most generic one, and should be tried first,
+if neither the {group} or {neigh} styles are easily applicable.
+However, since the computed cost function is averaged over all local
+particles this weight style may not be highly accurate.  This style
+can also be effective as a secondary weight in combination with either
+{group} or {neigh} to offset some of inaccuracies in either of those
+heuristics.
+
+The {var} weight style assigns per-particle weights by evaluating an
+"atom-style variable"_variable.html specified by {name}.  This is
+provided as a more flexible alternative to the {group} weight style,
+allowing definition of a more complex heuristics based on information
+(global and per atom) available inside of LAMMPS.  For example,
+atom-style variables can reference the position of a particle, its
+velocity, the volume of its Voronoi cell, etc.
+
+The {store} weight style does not compute a weight factor.  Instead it
+stores the current accumulated weights in a custom per-atom property
+specified by {name}.  This must be a property defined as {d_name} via
+the "fix property/atom"_fix_property_atom.html command.  Note that
+these custom per-atom properties can be output in a "dump"_dump.html
+file, so this is a way to examine, debug, or visualize the
+per-particle weights computed during the load-balancing operation.
 
 :line
 
@@ -342,6 +477,7 @@ appear in {dimstr} for the {shift} style.
 
 [Related commands:]
 
-"processors"_processors.html, "fix balance"_fix_balance.html
+"group"_group.html, "processors"_processors.html,
+"fix balance"_fix_balance.html
 
 [Default:] none
diff --git a/doc/src/fix_balance.txt b/doc/src/fix_balance.txt
index c997b7c27e..947e420914 100644
--- a/doc/src/fix_balance.txt
+++ b/doc/src/fix_balance.txt
@@ -10,7 +10,7 @@ fix balance command :h3
 
 [Syntax:]
 
-fix ID group-ID balance Nfreq thresh style args keyword value ... :pre
+fix ID group-ID balance Nfreq thresh style args keyword args ... :pre
 
 ID, group-ID are documented in "fix"_fix.html command :ulb,l
 balance = style name of this fix command :l
@@ -21,10 +21,24 @@ style = {shift} or {rcb} :l
     dimstr = sequence of letters containing "x" or "y" or "z", each not more than once
     Niter = # of times to iterate within each dimension of dimstr sequence
     stopthresh = stop balancing when this imbalance threshhold is reached
-  rcb args = none :pre
-zero or more keyword/value pairs may be appended :l
-keyword = {out} :l
-  {out} value = filename
+  {rcb} args = none :pre
+zero or more keyword/arg pairs may be appended :l
+keyword = {weight} or {out} :l
+  {weight} style args = use weighted particle counts for the balancing
+    {style} = {group} or {neigh} or {time} or {var} or {store}
+      {group} args = Ngroup group1 weight1 group2 weight2 ...
+        Ngroup = number of groups with assigned weights
+        group1, group2, ... = group IDs
+        weight1, weight2, ...   = corresponding weight factors
+      {neigh} factor = compute weight based on number of neighbors
+        factor = scaling factor (> 0)
+      {time} factor = compute weight based on time spend computing
+        factor = scaling factor (> 0)
+      {var} name = take weight from atom-style variable
+        name = name of the atom-style variable
+      {store} name = store weight in custom atom property defined by "fix property/atom"_fix_property_atom.html command
+        name = atom property name (without d_ prefix)
+  {out} arg = filename
     filename = write each processor's sub-domain to a file, at each re-balancing :pre
 :ule
 
@@ -32,6 +46,9 @@ keyword = {out} :l
 
 fix 2 all balance 1000 1.05 shift x 10 1.05
 fix 2 all balance 100 0.9 shift xy 20 1.1 out tmp.balance
+fix 2 all balance 100 0.9 shift xy 20 1.1 weight group 3 substrate 3.0 solvent 1.0 solute 0.8 out tmp.balance
+fix 2 all balance 100 1.0 shift x 10 1.1 weight time 0.8
+fix 2 all balance 100 1.0 shift xy 5 1.1 weight var myweight weight neigh 0.6 weight store allweight
 fix 2 all balance 1000 1.1 rcb :pre
 
 [Description:]
@@ -44,14 +61,31 @@ rebalancing is performed periodically during the simulation.  To
 perform "static" balancing, before or between runs, see the
 "balance"_balance.html command.
 
-Load-balancing is typically only useful if the particles in the
-simulation box have a spatially-varying density distribution.  E.g. a
-model of a vapor/liquid interface, or a solid with an irregular-shaped
-geometry containing void regions.  In this case, the LAMMPS default of
-dividing the simulation box volume into a regular-spaced grid of 3d
-bricks, with one equal-volume sub-domain per processor, may assign
-very different numbers of particles per processor.  This can lead to
-poor performance when the simulation is run in parallel.
+Load-balancing is typically most useful if the particles in the
+simulation box have a spatially-varying density distribution or
+where the computational cost varies signficantly between different
+atoms. E.g. a model of a vapor/liquid interface, or a solid with
+an irregular-shaped geometry containing void regions, or
+"hybrid pair style simulations"_pair_hybrid.html which combine
+pair styles with different computational cost.  In these cases, the
+LAMMPS default of dividing the simulation box volume into a
+regular-spaced grid of 3d bricks, with one equal-volume sub-domain
+per procesor, may assign numbers of particles per processor in a
+way that the computational effort varies significantly.  This can
+lead to poor performance when the simulation is run in parallel.
+
+The balancing can be performed with or without per-particle weighting.
+With no weighting, the balancing attempts to assign an equal number of
+particles to each processor.  With weighting, the balancing attempts
+to assign an equal weight to each processor, which typically means a
+different number of atoms per processor.
+
+NOTE: The weighting options listed above are documented with the
+"balance"_balance.html command in "this section of the balance
+command"_balance.html#weighted_balance doc page.  The section
+describes the various weighting options and gives a few examples of
+how they can be used.  The weighting options are the same for both the
+fix balance and "balance"_balance.html commands.
 
 Note that the "processors"_processors.html command allows some control
 over how the box volume is split across processors.  Specifically, for
@@ -64,9 +98,9 @@ sub-domains will still have the same shape and same volume.
 On a particular timestep, a load-balancing operation is only performed
 if the current "imbalance factor" in particles owned by each processor
 exceeds the specified {thresh} parameter.  The imbalance factor is
-defined as the maximum number of particles owned by any processor,
-divided by the average number of particles per processor.  Thus an
-imbalance factor of 1.0 is perfect balance.
+defined as the maximum number of particles (or weight) owned by any
+processor, divided by the average number of particles (or weight) per
+processor.  Thus an imbalance factor of 1.0 is perfect balance.
 
 As an example, for 10000 particles running on 10 processors, if the
 most heavily loaded processor has 1200 particles, then the factor is
@@ -117,8 +151,8 @@ applied.
 The {rcb} style is a "tiling" method which does not produce a logical
 3d grid of processors.  Rather it tiles the simulation domain with
 rectangular sub-boxes of varying size and shape in an irregular
-fashion so as to have equal numbers of particles in each sub-box, as
-in the rightmost diagram above.
+fashion so as to have equal numbers of particles (or weight) in each
+sub-box, as in the rightmost diagram above.
 
 The "grid" methods can be used with either of the
 "comm_style"_comm_style.html command options, {brick} or {tiled}.  The
@@ -139,12 +173,9 @@ from scratch.
 
 :line
 
-The {group-ID} is currently ignored.  In the future it may be used to
-determine what particles are considered for balancing.  Normally it
-would only makes sense to use the {all} group.  But in some cases it
-may be useful to balance on a subset of the particles, e.g. when
-modeling large nanoparticles in a background of small solvent
-particles.
+The {group-ID} is ignored.  However the impact of balancing on
+different groups of atoms can be affected by using the {group} weight
+style as described below.
 
 The {Nfreq} setting determines how often a rebalance is performed.  If
 {Nfreq} > 0, then rebalancing will occur every {Nfreq} steps.  Each
@@ -225,7 +256,7 @@ than {Niter} and exit early.
 
 The {rcb} style invokes a "tiled" method for balancing, as described
 above.  It performs a recursive coordinate bisectioning (RCB) of the
-simulation domain.  The basic idea is as follows.
+simulation domain. The basic idea is as follows.
 
 The simulation domain is cut into 2 boxes by an axis-aligned cut in
 the longest dimension, leaving one new box on either side of the cut.
@@ -250,10 +281,10 @@ in that sub-box.
 
 :line
 
-The {out} keyword writes a text file to the specified {filename} with
-the results of each rebalancing operation.  The file contains the
-bounds of the sub-domain for each processor after the balancing
-operation completes.  The format of the file is compatible with the
+The {out} keyword writes text to the specified {filename} with the
+results of each rebalancing operation.  The file contains the bounds
+of the sub-domain for each processor after the balancing operation
+completes.  The format of the file is compatible with the
 "Pizza.py"_pizza {mdump} tool which has support for manipulating and
 visualizing mesh files.  An example is shown here for a balancing by 4
 processors for a 2d problem:
@@ -321,8 +352,8 @@ values in the vector are as follows:
 3 = imbalance factor right before the last rebalance was performed :ul
 
 As explained above, the imbalance factor is the ratio of the maximum
-number of particles on any processor to the average number of
-particles per processor.
+number of particles (or total weight) on any processor to the average
+number of particles (or total weight) per processor.
 
 These quantities can be accessed by various "output
 commands"_Section_howto.html#howto_15.  The scalar and vector values
@@ -336,11 +367,11 @@ minimization"_minimize.html.
 
 [Restrictions:]
 
-For 2d simulations, a "z" cannot appear in {dimstr} for the {shift}
-style.
+For 2d simulations, the {z} style cannot be used.  Nor can a "z"
+appear in {dimstr} for the {shift} style.
 
 [Related commands:]
 
-"processors"_processors.html, "balance"_balance.html
+"group"_group.html, "processors"_processors.html, "balance"_balance.html
 
 [Default:] none
diff --git a/doc/src/neb.txt b/doc/src/neb.txt
index 0d5838b78a..f7cae7919a 100644
--- a/doc/src/neb.txt
+++ b/doc/src/neb.txt
@@ -48,14 +48,14 @@ follows the discussion in these 3 papers: "(HenkelmanA)"_#HenkelmanA,
 
 Each replica runs on a partition of one or more processors.  Processor
 partitions are defined at run-time using the -partition command-line
-switch; see "Section 2.7"_Section_start.html#start_7 of the
-manual.  Note that if you have MPI installed, you can run a
-multi-replica simulation with more replicas (partitions) than you have
-physical processors, e.g you can run a 10-replica simulation on just
-one or two processors.  You will simply not get the performance
-speed-up you would see with one or more physical processors per
-replica.  See "this section"_Section_howto.html#howto_5 of the manual
-for further discussion.
+switch; see "Section 2.7"_Section_start.html#start_7 of the manual.
+Note that if you have MPI installed, you can run a multi-replica
+simulation with more replicas (partitions) than you have physical
+processors, e.g you can run a 10-replica simulation on just one or two
+processors.  You will simply not get the performance speed-up you
+would see with one or more physical processors per replica.  See
+"Section 6.5"_Section_howto.html#howto_5 of the manual for further
+discussion.
 
 NOTE: The current NEB implementation in LAMMPS only allows there to be
 one processor per replica.
diff --git a/doc/src/prd.txt b/doc/src/prd.txt
index a7c148cd09..d3a3a4562a 100644
--- a/doc/src/prd.txt
+++ b/doc/src/prd.txt
@@ -63,14 +63,14 @@ event to occur.
 
 Each replica runs on a partition of one or more processors.  Processor
 partitions are defined at run-time using the -partition command-line
-switch; see "Section 2.7"_Section_start.html#start_7 of the
-manual.  Note that if you have MPI installed, you can run a
-multi-replica simulation with more replicas (partitions) than you have
-physical processors, e.g you can run a 10-replica simulation on one or
-two processors.  For PRD, this makes little sense, since this offers
-no effective parallel speed-up in searching for infrequent events. See
-"Section 6.5"_Section_howto.html#howto_5 of the manual for further
-discussion.
+switch; see "Section 2.7"_Section_start.html#start_7 of the manual.
+Note that if you have MPI installed, you can run a multi-replica
+simulation with more replicas (partitions) than you have physical
+processors, e.g you can run a 10-replica simulation on one or two
+processors.  However for PRD, this makes little sense, since running a
+replica on virtual instead of physical processors,offers no effective
+parallel speed-up in searching for infrequent events.  See "Section
+6.5"_Section_howto.html#howto_5 of the manual for further discussion.
 
 When a PRD simulation is performed, it is assumed that each replica is
 running the same model, though LAMMPS does not check for this.
@@ -163,7 +163,7 @@ runs for {N} timesteps.  If the {time} value is {clock}, then the
 simulation runs until {N} aggregate timesteps across all replicas have
 elapsed.  This aggregate time is the "clock" time defined below, which
 typically advances nearly M times faster than the timestepping on a
-single replica.
+single replica, where M is the number of replicas.
 
 :line
 
@@ -183,25 +183,26 @@ coincident events, and the replica number of the chosen event.
 
 The timestep is the usual LAMMPS timestep, except that time does not
 advance during dephasing or quenches, but only during dynamics.  Note
-that are two kinds of dynamics in the PRD loop listed above.  The
-first is when all replicas are performing independent dynamics,
-waiting for an event to occur.  The second is when correlated events
-are being searched for and only one replica is running dynamics.
+that are two kinds of dynamics in the PRD loop listed above that
+contribute to this timestepping.  The first is when all replicas are
+performing independent dynamics, waiting for an event to occur.  The
+second is when correlated events are being searched for, but only one
+replica is running dynamics.
 
-The CPU time is the total processor time since the start of the PRD
-run. 
+The CPU time is the total elapsed time on each processor, since the
+start of the PRD run.
 
 The clock is the same as the timestep except that it advances by M
-steps every timestep during the first kind of dynamics when the M
+steps per timestep during the first kind of dynamics when the M
 replicas are running independently.  The clock advances by only 1 step
-per timestep during the second kind of dynamics, since only a single
+per timestep during the second kind of dynamics, when only a single
 replica is checking for a correlated event.  Thus "clock" time
-represents the aggregate time (in steps) that effectively elapses
+represents the aggregate time (in steps) that has effectively elapsed
 during a PRD simulation on M replicas.  If most of the PRD run is
 spent in the second stage of the loop above, searching for infrequent
 events, then the clock will advance nearly M times faster than it
 would if a single replica was running.  Note the clock time between
-events will be drawn from p(t).
+successive events should be drawn from p(t).
 
 The event number is a counter that increments with each event, whether
 it is uncorrelated or correlated.
@@ -212,14 +213,15 @@ replicas are running independently.  The correlation flag will be 1
 when a correlated event occurs during the third stage of the loop
 listed above, i.e. when only one replica is running dynamics.
 
-When more than one replica detects an event at the end of the second
-stage, then one of them is chosen at random. The number of coincident 
-events is the number of replicas that detected an event. Normally, we
-expect this value to be 1. If it is often greater than 1, then either
-the number of replicas is too large, or {t_event} is too large.
+When more than one replica detects an event at the end of the same
+event check (every {t_event} steps) during the the second stage, then
+one of them is chosen at random.  The number of coincident events is
+the number of replicas that detected an event.  Normally, this value
+should be 1.  If it is often greater than 1, then either the number of
+replicas is too large, or {t_event} is too large.
 
-The replica number is the ID of the replica (from 0 to M-1) that
-found the event.
+The replica number is the ID of the replica (from 0 to M-1) in which
+the event occurred.
 
 :line
 
@@ -286,7 +288,7 @@ This command can only be used if LAMMPS was built with the REPLICA
 package.  See the "Making LAMMPS"_Section_start.html#start_3 section
 for more info on packages.
 
-{N} and {t_correlate} settings must be integer multiples of
+The {N} and {t_correlate} settings must be integer multiples of
 {t_event}.
 
 Runs restarted from restart file written during a PRD run will not