diff --git a/doc/balance.rst b/doc/balance.rst
new file mode 100644
index 0000000000..7af8a86e9a
--- /dev/null
+++ b/doc/balance.rst
@@ -0,0 +1,566 @@
+.. index:: balance
+
+balance command
+===============
+
+Syntax
+""""""
+
+.. parsed-literal::
+
+   balance thresh style args ... keyword args ...
+
+* thresh = imbalance threshold that must be exceeded to perform a re-balance
+* one style/arg pair can be used (or multiple for *x*\ ,\ *y*\ ,\ *z*\ )
+* style = *x* or *y* or *z* or *shift* or *rcb*
+
+  .. parsed-literal::
+
+       *x* args = *uniform* or Px-1 numbers between 0 and 1
+         *uniform* = evenly spaced cuts between processors in x dimension
+         numbers = Px-1 ascending values between 0 and 1, Px - # of processors in x dimension
+         *x* can be specified together with *y* or *z*
+       *y* args = *uniform* or Py-1 numbers between 0 and 1
+         *uniform* = evenly spaced cuts between processors in y dimension
+         numbers = Py-1 ascending values between 0 and 1, Py - # of processors in y dimension
+         *y* can be specified together with *x* or *z*
+       *z* args = *uniform* or Pz-1 numbers between 0 and 1
+         *uniform* = evenly spaced cuts between processors in z dimension
+         numbers = Pz-1 ascending values between 0 and 1, Pz - # of processors in z dimension
+         *z* can be specified together with *x* or *y*
+       *shift* args = dimstr Niter stopthresh
+         dimstr = sequence of letters containing "x" or "y" or "z", each not more than once
+         Niter = # of times to iterate within each dimension of dimstr sequence
+         stopthresh = stop balancing when this imbalance threshold is reached
+       *rcb* args = none
+
+* zero or more keyword/arg pairs may be appended
+* keyword = *weight* or *out*
+
+  .. parsed-literal::
+
+       *weight* style args = use weighted particle counts for the balancing
+         *style* = *group* or *neigh* or *time* or *var* or *store*
+           *group* args = Ngroup group1 weight1 group2 weight2 ...
+             Ngroup = number of groups with assigned weights
+             group1, group2, ... = group IDs
+             weight1, weight2, ...   = corresponding weight factors
+           *neigh* factor = compute weight based on number of neighbors
+             factor = scaling factor (> 0)
+           *time* factor = compute weight based on time spend computing
+             factor = scaling factor (> 0)
+           *var* name = take weight from atom-style variable
+             name = name of the atom-style variable
+           *store* name = store weight in custom atom property defined by :doc:`fix property/atom <fix_property_atom>` command
+             name = atom property name (without d\_ prefix)
+       *out* arg = filename
+         filename = write each processor's sub-domain to a file
+
+Examples
+""""""""
+
+.. code-block:: LAMMPS
+
+   balance 0.9 x uniform y 0.4 0.5 0.6
+   balance 1.2 shift xz 5 1.1
+   balance 1.0 shift xz 5 1.1
+   balance 1.1 rcb
+   balance 1.0 shift x 10 1.1 weight group 2 fast 0.5 slow 2.0
+   balance 1.0 shift x 10 1.1 weight time 0.8 weight neigh 0.5 weight store balance
+   balance 1.0 shift x 20 1.0 out tmp.balance
+
+Description
+"""""""""""
+
+This command adjusts the size and shape of processor sub-domains
+within the simulation box, to attempt to balance the number of atoms
+or particles and thus indirectly the computational cost (load) more
+evenly across processors.  The load balancing is "static" in the sense
+that this command performs the balancing once, before or between
+simulations.  The processor sub-domains will then remain static during
+the subsequent run.  To perform "dynamic" balancing, see the :doc:`fix
+balance <fix_balance>` command, which can adjust processor sub-domain
+sizes and shapes on-the-fly during a :doc:`run <run>`.
+
+Load-balancing is typically most useful if the particles in the
+simulation box have a spatially-varying density distribution or when
+the computational cost varies significantly between different
+particles.  E.g. a model of a vapor/liquid interface, or a solid with
+an irregular-shaped geometry containing void regions, or :doc:`hybrid
+pair style simulations <pair_hybrid>` which combine pair styles with
+different computational cost.  In these cases, the LAMMPS default of
+dividing the simulation box volume into a regular-spaced grid of 3d
+bricks, with one equal-volume sub-domain per processor, may assign
+numbers of particles per processor in a way that the computational
+effort varies significantly.  This can lead to poor performance when
+the simulation is run in parallel.
+
+The balancing can be performed with or without per-particle weighting.
+With no weighting, the balancing attempts to assign an equal number of
+particles to each processor.  With weighting, the balancing attempts
+to assign an equal aggregate computational weight to each processor,
+which typically induces a different number of atoms assigned to each
+processor.  Details on the various weighting options and examples for
+how they can be used are :ref:`given below <weighted_balance>`.
+
+Note that the :doc:`processors <processors>` command allows some
+control over how the box volume is split across processors.
+Specifically, for a Px by Py by Pz grid of processors, it allows
+choice of Px, Py, and Pz, subject to the constraint that Px \* Py \*
+Pz = P, the total number of processors.  This is sufficient to achieve
+good load-balance for some problems on some processor counts.
+However, all the processor sub-domains will still have the same shape
+and same volume.
+
+The requested load-balancing operation is only performed if the
+current "imbalance factor" in particles owned by each processor
+exceeds the specified *thresh* parameter.  The imbalance factor is
+defined as the maximum number of particles (or weight) owned by any
+processor, divided by the average number of particles (or weight) per
+processor.  Thus an imbalance factor of 1.0 is perfect balance.
+
+As an example, for 10000 particles running on 10 processors, if the
+most heavily loaded processor has 1200 particles, then the factor is
+1.2, meaning there is a 20% imbalance.  Note that a re-balance can be
+forced even if the current balance is perfect (1.0) be specifying a
+*thresh* < 1.0.
+
+.. note::
+
+   Balancing is performed even if the imbalance factor does not
+   exceed the *thresh* parameter if a "grid" style is specified when the
+   current partitioning is "tiled".  The meaning of "grid" vs "tiled" is
+   explained below.  This is to allow forcing of the partitioning to
+   "grid" so that the :doc:`comm_style brick <comm_style>` command can then
+   be used to replace a current :doc:`comm_style tiled <comm_style>`
+   setting.
+
+When the balance command completes, it prints statistics about the
+result, including the change in the imbalance factor and the change in
+the maximum number of particles on any processor.  For "grid" methods
+(defined below) that create a logical 3d grid of processors, the
+positions of all cutting planes in each of the 3 dimensions (as
+fractions of the box length) are also printed.
+
+.. note::
+
+   This command attempts to minimize the imbalance factor, as
+   defined above.  But depending on the method a perfect balance (1.0)
+   may not be achieved.  For example, "grid" methods (defined below) that
+   create a logical 3d grid cannot achieve perfect balance for many
+   irregular distributions of particles.  Likewise, if a portion of the
+   system is a perfect lattice, e.g. the initial system is generated by
+   the :doc:`create_atoms <create_atoms>` command, then "grid" methods may
+   be unable to achieve exact balance.  This is because entire lattice
+   planes will be owned or not owned by a single processor.
+
+.. note::
+
+   The imbalance factor is also an estimate of the maximum speed-up
+   you can hope to achieve by running a perfectly balanced simulation
+   versus an imbalanced one.  In the example above, the 10000 particle
+   simulation could run up to 20% faster if it were perfectly balanced,
+   versus when imbalanced.  However, computational cost is not strictly
+   proportional to particle count, and changing the relative size and
+   shape of processor sub-domains may lead to additional computational
+   and communication overheads, e.g. in the PPPM solver used via the
+   :doc:`kspace_style <kspace_style>` command.  Thus you should benchmark
+   the run times of a simulation before and after balancing.
+
+----------
+
+The method used to perform a load balance is specified by one of the
+listed styles (or more in the case of *x*\ ,\ *y*\ ,\ *z*\ ), which are
+described in detail below.  There are 2 kinds of styles.
+
+The *x*\ , *y*\ , *z*\ , and *shift* styles are "grid" methods which
+produce a logical 3d grid of processors.  They operate by changing the
+cutting planes (or lines) between processors in 3d (or 2d), to adjust
+the volume (area in 2d) assigned to each processor, as in the
+following 2d diagram where processor sub-domains are shown and
+particles are colored by the processor that owns them.
+
+.. |balance1| image:: img/balance_uniform.jpg
+   :width: 32%
+
+.. |balance2| image:: img/balance_nonuniform.jpg
+   :width: 32%
+
+.. |balance3| image:: img/balance_rcb.jpg
+   :width: 32%
+
+|balance1|  |balance2|  |balance3|
+
+The leftmost diagram is the default partitioning of the simulation box
+across processors (one sub-box for each of 16 processors); the middle
+diagram is after a "grid" method has been applied.  The *rcb* style is
+a "tiling" method which does not produce a logical 3d grid of
+processors.  Rather it tiles the simulation domain with rectangular
+sub-boxes of varying size and shape in an irregular fashion so as to
+have equal numbers of particles (or weight) in each sub-box, as in the
+rightmost diagram above.
+
+The "grid" methods can be used with either of the :doc:`comm_style
+<comm_style>` command options, *brick* or *tiled*\ .  The "tiling"
+methods can only be used with :doc:`comm_style tiled <comm_style>`.
+Note that it can be useful to use a "grid" method with
+:doc:`comm_style tiled <comm_style>` to return the domain partitioning
+to a logical 3d grid of processors so that "comm_style brick" can
+afterwords be specified for subsequent :doc:`run <run>` commands.
+
+When a "grid" method is specified, the current domain partitioning can
+be either a logical 3d grid or a tiled partitioning.  In the former
+case, the current logical 3d grid is used as a starting point and
+changes are made to improve the imbalance factor.  In the latter case,
+the tiled partitioning is discarded and a logical 3d grid is created
+with uniform spacing in all dimensions.  This becomes the starting
+point for the balancing operation.
+
+When a "tiling" method is specified, the current domain partitioning
+("grid" or "tiled") is ignored, and a new partitioning is computed
+from scratch.
+
+----------
+
+The *x*\ , *y*\ , and *z* styles invoke a "grid" method for balancing, as
+described above.  Note that any or all of these 3 styles can be
+specified together, one after the other, but they cannot be used with
+any other style.  This style adjusts the position of cutting planes
+between processor sub-domains in specific dimensions.  Only the
+specified dimensions are altered.
+
+The *uniform* argument spaces the planes evenly, as in the left
+diagrams above.  The *numeric* argument requires listing Ps-1 numbers
+that specify the position of the cutting planes.  This requires
+knowing Ps = Px or Py or Pz = the number of processors assigned by
+LAMMPS to the relevant dimension.  This assignment is made (and the
+Px, Py, Pz values printed out) when the simulation box is created by
+the "create_box" or "read_data" or "read_restart" command and is
+influenced by the settings of the :doc:`processors <processors>`
+command.
+
+Each of the numeric values must be between 0 and 1, and they must be
+listed in ascending order.  They represent the fractional position of
+the cutting place.  The left (or lower) edge of the box is 0.0, and
+the right (or upper) edge is 1.0.  Neither of these values is
+specified.  Only the interior Ps-1 positions are specified.  Thus is
+there are 2 processors in the x dimension, you specify a single value
+such as 0.75, which would make the left processor's sub-domain 3x
+larger than the right processor's sub-domain.
+
+----------
+
+The *shift* style invokes a "grid" method for balancing, as
+described above.  It changes the positions of cutting planes between
+processors in an iterative fashion, seeking to reduce the imbalance
+factor, similar to how the :doc:`fix balance shift <fix_balance>`
+command operates.
+
+The *dimstr* argument is a string of characters, each of which must be
+an "x" or "y" or "z".  Eacn character can appear zero or one time,
+since there is no advantage to balancing on a dimension more than
+once.  You should normally only list dimensions where you expect there
+to be a density variation in the particles.
+
+Balancing proceeds by adjusting the cutting planes in each of the
+dimensions listed in *dimstr*\ , one dimension at a time.  For a single
+dimension, the balancing operation (described below) is iterated on up
+to *Niter* times.  After each dimension finishes, the imbalance factor
+is re-computed, and the balancing operation halts if the *stopthresh*
+criterion is met.
+
+A re-balance operation in a single dimension is performed using a
+recursive multisectioning algorithm, where the position of each
+cutting plane (line in 2d) in the dimension is adjusted independently.
+This is similar to a recursive bisectioning for a single value, except
+that the bounds used for each bisectioning take advantage of
+information from neighboring cuts if possible.  At each iteration, the
+count of particles on either side of each plane is tallied.  If the
+counts do not match the target value for the plane, the position of
+the cut is adjusted to be halfway between a low and high bound.  The
+low and high bounds are adjusted on each iteration, using new count
+information, so that they become closer together over time.  Thus as
+the recursion progresses, the count of particles on either side of the
+plane gets closer to the target value.
+
+After the balanced plane positions are determined, if any pair of
+adjacent planes are closer together than the neighbor skin distance
+(as specified by the :doc`neigh_modify <neigh_modify>` command), then
+the plane positions are shifted to separate them by at least this
+amount.  This is to prevent particles being lost when dynamics are run
+with processor subdomains that are too narrow in one or more
+dimensions.
+
+Once the re-balancing is complete and final processor sub-domains
+assigned, particles are migrated to their new owning processor, and
+the balance procedure ends.
+
+.. note::
+
+   At each re-balance operation, the bisectioning for each cutting
+   plane (line in 2d) typically starts with low and high bounds separated
+   by the extent of a processor's sub-domain in one dimension.  The size
+   of this bracketing region shrinks by 1/2 every iteration.  Thus if
+   *Niter* is specified as 10, the cutting plane will typically be
+   positioned to 1 part in 1000 accuracy (relative to the perfect target
+   position).  For *Niter* = 20, it will be accurate to 1 part in a
+   million.  Thus there is no need to set *Niter* to a large value.
+   LAMMPS will check if the threshold accuracy is reached (in a
+   dimension) is less iterations than *Niter* and exit early.  However,
+   *Niter* should also not be set too small, since it will take roughly
+   the same number of iterations to converge even if the cutting plane is
+   initially close to the target value.
+
+----------
+
+The *rcb* style invokes a "tiled" method for balancing, as described
+above.  It performs a recursive coordinate bisectioning (RCB) of the
+simulation domain. The basic idea is as follows.
+
+The simulation domain is cut into 2 boxes by an axis-aligned cut in
+one of the dimensions, leaving one new sub-box on either side of the
+cut.  Which dimension is chosen for the cut depends on the particle
+(weight) distribution within the parent box.  Normally the longest
+dimension of the box is cut, but if all (or most) of the particles are
+at one end of the box, a cut may be performed in another dimension to
+induce sub-boxes that are more cube-ish (3d) or square-ish (2d) in
+shape.
+
+After the cut is made, all the processors are also partitioned into 2
+groups, half assigned to the box on the lower side of the cut, and
+half to the box on the upper side.  (If the processor count is odd,
+one side gets an extra processor.)  The cut is positioned so that the
+number of (weighted) particles in the lower box is exactly the number
+that the processors assigned to that box should own for load balance
+to be perfect.  This also makes load balance for the upper box
+perfect.  The positioning of the cut is done iteratively, by a
+bisectioning method (median search).  Note that counting particles on
+either side of the cut requires communication between all processors
+at each iteration.
+
+That is the procedure for the first cut.  Subsequent cuts are made
+recursively, in exactly the same manner.  The subset of processors
+assigned to each box make a new cut in one dimension of that box,
+splitting the box, the subset of processors, and the particles in the
+box in two.  The recursion continues until every processor is assigned
+a sub-box of the entire simulation domain, and owns the (weighted)
+particles in that sub-box.
+
+----------
+
+.. _weighted_balance:
+
+This sub-section describes how to perform weighted load balancing
+using the *weight* keyword.
+
+By default, all particles have a weight of 1.0, which means each
+particle is assumed to require the same amount of computation during a
+timestep.  There are, however, scenarios where this is not a good
+assumption.  Measuring the computational cost for each particle
+accurately would be impractical and slow down the computation.
+Instead the *weight* keyword implements several ways to influence the
+per-particle weights empirically by properties readily available or
+using the user's knowledge of the system.  Note that the absolute
+value of the weights are not important; only their relative ratios
+affect which particle is assigned to which processor.  A particle with
+a weight of 2.5 is assumed to require 5x more computational than a
+particle with a weight of 0.5.  For all the options below the weight
+assigned to a particle must be a positive value; an error will be be
+generated if a weight is <= 0.0.
+
+Below is a list of possible weight options with a short description of
+their usage and some example scenarios where they might be applicable.
+It is possible to apply multiple weight flags and the weightings they
+induce will be combined through multiplication.  Most of the time,
+however, it is sufficient to use just one method.
+
+The *group* weight style assigns weight factors to specified
+:doc:`groups <group>` of particles.  The *group* style keyword is
+followed by the number of groups, then pairs of group IDs and the
+corresponding weight factor.  If a particle belongs to none of the
+specified groups, its weight is not changed.  If it belongs to
+multiple groups, its weight is the product of the weight factors.
+
+This weight style is useful in combination with pair style
+:doc:`hybrid <pair_hybrid>`, e.g. when combining a more costly many-body
+potential with a fast pair-wise potential.  It is also useful when
+using :doc:`run_style respa <run_style>` where some portions of the
+system have many bonded interactions and others none.  It assumes that
+the computational cost for each group remains constant over time.
+This is a purely empirical weighting, so a series test runs to tune
+the assigned weight factors for optimal performance is recommended.
+
+The *neigh* weight style assigns the same weight to each particle
+owned by a processor based on the total count of neighbors in the
+neighbor list owned by that processor.  The motivation is that more
+neighbors means a higher computational cost.  The style does not use
+neighbors per atom to assign a unique weight to each atom, because
+that value can vary depending on how the neighbor list is built.
+
+The *factor* setting is applied as an overall scale factor to the
+*neigh* weights which allows adjustment of their impact on the
+balancing operation.  The specified *factor* value must be positive.
+A value > 1.0 will increase the weights so that the ratio of max
+weight to min weight increases by *factor*\ .  A value < 1.0 will
+decrease the weights so that the ratio of max weight to min weight
+decreases by *factor*\ .  In both cases the intermediate weight values
+increase/decrease proportionally as well.  A value = 1.0 has no effect
+on the *neigh* weights.  As a rule of thumb, we have found a *factor*
+of about 0.8 often results in the best performance, since the number
+of neighbors is likely to overestimate the ideal weight.
+
+This weight style is useful for systems where there are different
+cutoffs used for different pairs of interactions, or the density
+fluctuates, or a large number of particles are in the vicinity of a
+wall, or a combination of these effects.  If a simulation uses
+multiple neighbor lists, this weight style will use the first suitable
+neighbor list it finds.  It will not request or compute a new list.  A
+warning will be issued if there is no suitable neighbor list available
+or if it is not current, e.g. if the balance command is used before a
+:doc:`run <run>` or :doc:`minimize <minimize>` command is used, in which
+case the neighbor list may not yet have been built.  In this case no
+weights are computed.  Inserting a :doc:`run 0 post no <run>` command
+before issuing the *balance* command, may be a workaround for this
+case, as it will induce the neighbor list to be built.
+
+The *time* weight style uses :doc:`timer data <timer>` to estimate
+weights.  It assigns the same weight to each particle owned by a
+processor based on the total computational time spent by that
+processor.  See details below on what time window is used.  It uses
+the same timing information as is used for the :doc:`MPI task timing
+breakdown <Run_output>`, namely, for sections *Pair*\ , *Bond*\ ,
+*Kspace*\ , and *Neigh*\ .  The time spent in those portions of the
+timestep are measured for each MPI rank, summed, then divided by the
+number of particles owned by that processor.  I.e. the weight is an
+effective CPU time/particle averaged over the particles on that
+processor.
+
+The *factor* setting is applied as an overall scale factor to the
+*time* weights which allows adjustment of their impact on the
+balancing operation.  The specified *factor* value must be positive.
+A value > 1.0 will increase the weights so that the ratio of max
+weight to min weight increases by *factor*\ .  A value < 1.0 will
+decrease the weights so that the ratio of max weight to min weight
+decreases by *factor*\ .  In both cases the intermediate weight values
+increase/decrease proportionally as well.  A value = 1.0 has no effect
+on the *time* weights.  As a rule of thumb, effective values to use
+are typically between 0.5 and 1.2.  Note that the timer quantities
+mentioned above can be affected by communication which occurs in the
+middle of the operations, e.g. pair styles with intermediate exchange
+of data witin the force computation, and likewise for KSpace solves.
+
+When using the *time* weight style with the *balance* command, the
+timing data is taken from the preceding run command, i.e. the timings
+are for the entire previous run.  For the *fix balance* command the
+timing data is for only the timesteps since the last balancing
+operation was performed.  If timing information for the required
+sections is not available, e.g. at the beginning of a run, or when the
+:doc:`timer <timer>` command is set to either *loop* or *off*\ , a warning
+is issued.  In this case no weights are computed.
+
+.. note::
+
+   The *time* weight style is the most generic option, and should
+   be tried first, unless the *group* style is easily applicable.
+   However, since the computed cost function is averaged over all
+   particles on a processor, the weights may not be highly accurate.
+   This style can also be effective as a secondary weight in combination
+   with either *group* or *neigh* to offset some of inaccuracies in
+   either of those heuristics.
+
+The *var* weight style assigns per-particle weights by evaluating an
+:doc:`atom-style variable <variable>` specified by *name*\ .  This is
+provided as a more flexible alternative to the *group* weight style,
+allowing definition of a more complex heuristics based on information
+(global and per atom) available inside of LAMMPS.  For example,
+atom-style variables can reference the position of a particle, its
+velocity, the volume of its Voronoi cell, etc.
+
+The *store* weight style does not compute a weight factor.  Instead it
+stores the current accumulated weights in a custom per-atom property
+specified by *name*\ .  This must be a property defined as *d_name* via
+the :doc:`fix property/atom <fix_property_atom>` command.  Note that
+these custom per-atom properties can be output in a :doc:`dump <dump>`
+file, so this is a way to examine, debug, or visualize the
+per-particle weights computed during the load-balancing operation.
+
+----------
+
+The *out* keyword writes a text file to the specified *filename* with
+the results of the balancing operation.  The file contains the bounds
+of the sub-domain for each processor after the balancing operation
+completes.  The format of the file is compatible with the
+`Pizza.py <pizza_>`_ *mdump* tool which has support for manipulating and
+visualizing mesh files.  An example is shown here for a balancing by 4
+processors for a 2d problem:
+
+.. parsed-literal::
+
+   ITEM: TIMESTEP
+   0
+   ITEM: NUMBER OF NODES
+   16
+   ITEM: BOX BOUNDS
+   0 10
+   0 10
+   0 10
+   ITEM: NODES
+   1 1 0 0 0
+   2 1 5 0 0
+   3 1 5 5 0
+   4 1 0 5 0
+   5 1 5 0 0
+   6 1 10 0 0
+   7 1 10 5 0
+   8 1 5 5 0
+   9 1 0 5 0
+   10 1 5 5 0
+   11 1 5 10 0
+   12 1 10 5 0
+   13 1 5 5 0
+   14 1 10 5 0
+   15 1 10 10 0
+   16 1 5 10 0
+   ITEM: TIMESTEP
+   0
+   ITEM: NUMBER OF SQUARES
+   4
+   ITEM: SQUARES
+   1 1 1 2 3 4
+   2 1 5 6 7 8
+   3 1 9 10 11 12
+   4 1 13 14 15 16
+
+The coordinates of all the vertices are listed in the NODES section, 5
+per processor.  Note that the 4 sub-domains share vertices, so there
+will be duplicate nodes in the list.
+
+The "SQUARES" section lists the node IDs of the 4 vertices in a
+rectangle for each processor (1 to 4).
+
+For a 3d problem, the syntax is similar with 8 vertices listed for
+each processor, instead of 4, and "SQUARES" replaced by "CUBES".
+
+----------
+
+Restrictions
+""""""""""""
+
+For 2d simulations, the *z* style cannot be used.  Nor can a "z"
+appear in *dimstr* for the *shift* style.
+
+Balancing through recursive bisectioning (\ *rcb* style) requires
+:doc:`comm_style tiled <comm_style>`
+
+Related commands
+""""""""""""""""
+
+:doc:`group <group>`, :doc:`processors <processors>`,
+:doc:`fix balance <fix_balance>`, :doc:`comm_style <comm_style>`
+
+.. _pizza: https://pizza.sandia.gov
+
+Default
+"""""""
+
+none
diff --git a/src/balance.cpp b/src/balance.cpp
index 376f576a1b..3784b9a363 100644
--- a/src/balance.cpp
+++ b/src/balance.cpp
@@ -20,10 +20,11 @@
 
 #include "balance.h"
 
+#include "update.h"
 #include "atom.h"
+#include "neighbor.h"
 #include "comm.h"
 #include "domain.h"
-#include "error.h"
 #include "fix_store.h"
 #include "imbalance.h"
 #include "imbalance_group.h"
@@ -35,16 +36,19 @@
 #include "memory.h"
 #include "modify.h"
 #include "rcb.h"
-#include "update.h"
+#include "error.h"
 
 #include <cmath>
 #include <cstring>
 
 using namespace LAMMPS_NS;
 
+double EPSNEIGH = 1.0e-3;
+
 enum{XYZ,SHIFT,BISECTION};
 enum{NONE,UNIFORM,USER};
 enum{X,Y,Z};
+
 /* ---------------------------------------------------------------------- */
 
 Balance::Balance(LAMMPS *lmp) : Pointers(lmp)
@@ -770,7 +774,7 @@ void Balance::shift_setup(char *str, int nitermax_in, double thresh_in)
 int Balance::shift()
 {
   int i,j,k,m,np;
-  double mycost,totalcost;
+  double mycost,totalcost,boxsize;
   double *split;
 
   // no balancing if no atoms
@@ -790,15 +794,23 @@ int Balance::shift()
 
   // loop over dimensions in balance string
 
+  double *prd = domain->prd;
+  
   int niter = 0;
   for (int idim = 0; idim < ndim; idim++) {
 
     // split = ptr to xyz split in Comm
 
-    if (bdim[idim] == X) split = comm->xsplit;
-    else if (bdim[idim] == Y) split = comm->ysplit;
-    else if (bdim[idim] == Z) split = comm->zsplit;
-    else continue;
+    if (bdim[idim] == X) {
+      split = comm->xsplit;
+      boxsize = prd[0];
+    } else if (bdim[idim] == Y) {
+      split = comm->ysplit;
+      boxsize = prd[1];
+    } else if (bdim[idim] == Z) {
+      split = comm->zsplit;
+      boxsize = prd[2];
+    } else continue;
 
     // initial count and sum
 
@@ -903,6 +915,78 @@ int Balance::shift()
       }
     }
 
+    // adjust adjacent splits that are too close (within neigh skin)
+    // do this with minimal adjustment to splits
+    
+    double close = (1.0+EPSNEIGH) * neighbor->skin / boxsize;
+    double delta,midpt,start,stop,lbound,ubound,spacing;
+
+    i = 0;
+    while (i < np) {
+      if (split[i+1] - split[i] < close) {
+	j = i+1;
+
+	// I,J = set of consecutive splits that are collectively too close
+	// if can expand set and not become too close to splits I-1 or J+1, do it
+	// else add split I-1 or J+1 to set and try again
+	// delta = size of expanded split set that will satisy criterion
+      
+	while (1) {
+	  delta = (j-i) * close;
+	  midpt = 0.5 * (split[i]+split[j]);
+	  start = midpt - 0.5*delta;
+	  stop = midpt + 0.5*delta;
+
+	  if (i > 0) lbound = split[i-1] + close;
+	  else lbound = 0.0;
+	  if (j < np) ubound = split[j+1] - close;
+	  else ubound = 1.0;
+
+	  // start/stop are within bounds, reset the splits
+	  
+	  if (start >= lbound && stop <= ubound) break;
+
+	  // try a shift to either bound, reset the splits if delta fits
+	  // these tests change start/stop
+	  
+	  if (start < lbound) {
+	    start = lbound;
+	    stop = start + delta;
+	    if (stop <= ubound) break;
+	  } else if (stop > ubound) {
+	    stop = ubound;
+	    start = stop - delta;
+	    if (start >= lbound) break;
+	  }
+
+	  // delta does not fit between lbound and ubound
+	  // exit if can't expand set, else expand set
+	  // if can expand in either direction,
+	  //   pick new split closest to current midpt of set
+	  
+	  if (i == 0 && j == np) {
+	    start = 0.0; stop = 1.0;
+	    break;
+	  }
+	  if (i == 0) j++;
+	  else if (j == np) i--;
+	  else if (midpt-lbound < ubound-midpt) i--;
+	  else j++;
+	}
+
+	// reset all splits between I,J inclusive to be equi-spaced
+
+	spacing = (stop-start) / (j-i);
+	for (m = i; m <= j; m++)
+	  split[m] = start + (m-i)*spacing;
+        if (j == np) split[np] = 1.0;
+	
+	// continue testing beyond the J split
+	
+	i = j+1;
+      } else i++;
+    }
+    
     // sanity check on bad duplicate or inverted splits
     // zero or negative width sub-domains will break Comm class
     // should never happen if recursive multisection algorithm is correct
@@ -911,14 +995,6 @@ int Balance::shift()
     for (i = 0; i < np; i++)
       if (split[i] >= split[i+1]) bad = 1;
     if (bad) error->all(FLERR,"Balance produced bad splits");
-    /*
-      if (me == 0) {
-      printf("BAD SPLITS %d %d %d\n",np+1,niter,delta);
-      for (i = 0; i < np+1; i++)
-      printf(" %g",split[i]);
-      printf("\n");
-      }
-    */
 
     // stop at this point in bstr if imbalance factor < threshold
     // this is a true 3d test of particle count per processor