diff --git a/doc/Manual.html b/doc/Manual.html
index e783b2eaa9..39cc7d2f0b 100644
--- a/doc/Manual.html
+++ b/doc/Manual.html
@@ -132,15 +132,21 @@ it gives quick access to documentation for all LAMMPS commands.
 <BR></UL>
 <LI><A HREF = "Section_accelerate.html">Accelerating LAMMPS performance</A> 
 
-<UL>  5.1 <A HREF = "Section_accelerate.html#acc_1">OPT package</A> 
+<UL>  5.1 <A HREF = "Section_accelerate.html#acc_1">Measuring performance</A> 
 <BR>
-  5.2 <A HREF = "Section_accelerate.html#acc_2">USER-OMP package</A> 
+  5.2 <A HREF = "Section_accelerate.html#acc_2">General strategies</A> 
 <BR>
-  5.3 <A HREF = "Section_accelerate.html#acc_3">GPU package</A> 
+  5.3 <A HREF = "Section_accelerate.html#acc_3">Packages with optimized styles</A> 
 <BR>
-  5.4 <A HREF = "Section_accelerate.html#acc_4">USER-CUDA package</A> 
+  5.4 <A HREF = "Section_accelerate.html#acc_4">OPT package</A> 
 <BR>
-  5.5 <A HREF = "Section_accelerate.html#acc_5">Comparison of GPU and USER-CUDA packages</A> 
+  5.5 <A HREF = "Section_accelerate.html#acc_5">USER-OMP package</A> 
+<BR>
+  5.6 <A HREF = "Section_accelerate.html#acc_6">GPU package</A> 
+<BR>
+  5.7 <A HREF = "Section_accelerate.html#acc_7">USER-CUDA package</A> 
+<BR>
+  5.8 <A HREF = "Section_accelerate.html#acc_8">Comparison of GPU and USER-CUDA packages</A> 
 <BR></UL>
 <LI><A HREF = "Section_howto.html">How-to discussions</A> 
 
@@ -393,6 +399,12 @@ it gives quick access to documentation for all LAMMPS commands.
 
 
 
+
+
+
+
+
+
 
 
 
diff --git a/doc/Manual.txt b/doc/Manual.txt
index 6bc66f4d35..25b6746573 100644
--- a/doc/Manual.txt
+++ b/doc/Manual.txt
@@ -102,11 +102,14 @@ it gives quick access to documentation for all LAMMPS commands.
   4.1 "Standard packages"_pkg_1 :ulb,b
   4.2 "User packages"_pkg_2 :ule,b
 "Accelerating LAMMPS performance"_Section_accelerate.html :l
-  5.1 "OPT package"_acc_1 :ulb,b
-  5.2 "USER-OMP package"_acc_2 :b
-  5.3 "GPU package"_acc_3 :b
-  5.4 "USER-CUDA package"_acc_4 :b
-  5.5 "Comparison of GPU and USER-CUDA packages"_acc_5 :ule,b
+  5.1 "Measuring performance"_acc_1 :ulb,b
+  5.2 "General strategies"_acc_2 :b
+  5.3 "Packages with optimized styles"_acc_3 :b
+  5.4 "OPT package"_acc_4 :b
+  5.5 "USER-OMP package"_acc_5 :b
+  5.6 "GPU package"_acc_6 :b
+  5.7 "USER-CUDA package"_acc_7 :b
+  5.8 "Comparison of GPU and USER-CUDA packages"_acc_8 :ule,b
 "How-to discussions"_Section_howto.html :l
   6.1 "Restarting a simulation"_howto_1 :ulb,b
   6.2 "2d simulations"_howto_2 :b
@@ -194,6 +197,9 @@ it gives quick access to documentation for all LAMMPS commands.
 :link(acc_3,Section_accelerate.html#acc_3)
 :link(acc_4,Section_accelerate.html#acc_4)
 :link(acc_5,Section_accelerate.html#acc_5)
+:link(acc_6,Section_accelerate.html#acc_6)
+:link(acc_7,Section_accelerate.html#acc_7)
+:link(acc_8,Section_accelerate.html#acc_8)
 
 :link(howto_1,Section_howto.html#howto_1)
 :link(howto_2,Section_howto.html#howto_2)
diff --git a/doc/Section_accelerate.html b/doc/Section_accelerate.html
index b37c7f452e..182eba8ace 100644
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@@ -14,15 +14,94 @@ Section</A>
 <H3>5. Accelerating LAMMPS performance 
 </H3>
 <P>This section describes various methods for improving LAMMPS
-performance for different classes of problems running
-on different kinds of machines.
+performance for different classes of problems running on different
+kinds of machines.
 </P>
-5.1 <A HREF = "#acc_1">OPT package</A><BR>
-5.2 <A HREF = "#acc_2">USER-OMP package</A><BR>
-5.3 <A HREF = "#acc_3">GPU package</A><BR>
-5.4 <A HREF = "#acc_4">USER-CUDA package</A><BR>
-5.5 <A HREF = "#acc_5">Comparison of GPU and USER-CUDA packages</A> <BR>
+5.1 <A HREF = "#acc_1">Measuring performance</A><BR>
+5.2 <A HREF = "#acc_2">General strategies</A><BR>
+5.3 <A HREF = "#acc_3">Packages with optimized styles</A><BR>
+5.4 <A HREF = "#acc_4">OPT package</A><BR>
+5.5 <A HREF = "#acc_5">USER-OMP package</A><BR>
+5.6 <A HREF = "#acc_6">GPU package</A><BR>
+5.7 <A HREF = "#acc_7">USER-CUDA package</A><BR>
+5.8 <A HREF = "#acc_8">Comparison of GPU and USER-CUDA packages</A> <BR>
 
+<HR>
+
+<HR>
+
+<H4><A NAME = "acc_1"></A>5.1 Measuring performance 
+</H4>
+<P>Before trying to make your simulation run faster, you should
+understand how it currently performs and where the bottlenecks are.
+</P>
+<P>The best way to do this is run the your system (actual number of
+atoms) for a modest number of timesteps (say 100, or a few 100 at
+most) on several different processor counts, including a single
+processor if possible.  Do this for an equilibrium version of your
+system, so that the 100-step timings are representative of a much
+longer run.  There is typically no need to run for 1000s or timesteps
+to get accurate timings; you can simply extrapolate from short runs.
+</P>
+<P>For the set of runs, look at the timing data printed to the screen and
+log file at the end of each LAMMPS run.  <A HREF = "Section_start.html#start_8">This
+section</A> of the manual has an overview.
+</P>
+<P>Running on one (or a few processors) should give a good estimate of
+the serial performance and what portions of the timestep are taking
+the most time.  Running the same problem on a few different processor
+counts should give an estimate of parallel scalability.  I.e. if the
+simulation runs 16x faster on 16 processors, its 100% parallel
+efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
+</P>
+<P>The most important data to look at in the timing info is the timing
+breakdown and relative percentages.  For example, trying different
+options for speeding up the long-range solvers will have little impact
+if they only consume 10% of the run time.  If the pairwise time is
+dominating, you may want to look at GPU or OMP versions of the pair
+style, as discussed below.  Comparing how the percentages change as
+you increase the processor count gives you a sense of how different
+operations within the timestep are scaling.  Note that if you are
+running with a Kspace solver, there is additional output on the
+breakdown of the Kspace time.  For PPPM, this includes the fraction
+spent on FFTs, which can be communication intensive.
+</P>
+<P>Another important detail in the timing info are the histograms of
+atoms counts and neighbor counts.  If these vary widely across
+processors, you have a load-imbalance issue.  This often results in
+inaccurate relative timing data, because processors have to wait when
+communication occurs for other processors to catch up.  Thus the
+reported times for "Communication" or "Other" may be higher than they
+really are, due to load-imbalance.  If this is an issue, you can
+uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
+LAMMPS, to obtain synchronized timings.
+</P>
+<HR>
+
+<H4><A NAME = "acc_2"></A>5.2 General strategies 
+</H4>
+<P>Here is a list of general ideas for improving simulation performance.
+Most of them are only applicable to certain models and certain
+bottlenecks in the current performance, so let the timing data you
+intially generate be your guide.  It is hard, if not impossible, to
+predict how much difference these options will make, since it is a
+function of your problem and your machine.  There is no substitute for
+simply trying them out.
+</P>
+<UL><LI>rRESPA
+<LI>2-FFT PPPM
+<LI>single vs double PPPM
+<LI>partial charge PPPM
+<LI>verlet/split
+<LI>processor mapping via processors numa command
+<LI>load-balancing: balance and fix balance
+<LI>processor command for layout
+<LI>OMP when lots of cores 
+</UL>
+<HR>
+
+<H4><A NAME = "acc_3"></A>5.3 Packages with optimized styles 
+</H4>
 <P>Accelerated versions of various <A HREF = "pair_style.html">pair_style</A>,
 <A HREF = "fix.html">fixes</A>, <A HREF = "compute.html">computes</A>, and other commands have
 been added to LAMMPS, which will typically run faster than the
@@ -86,9 +165,7 @@ packages, since they are both designed to use NVIDIA GPU hardware.
 </P>
 <HR>
 
-<HR>
-
-<H4><A NAME = "acc_1"></A>5.1 OPT package 
+<H4><A NAME = "acc_4"></A>5.4 OPT package 
 </H4>
 <P>The OPT package was developed by James Fischer (High Performance
 Technologies), David Richie, and Vincent Natoli (Stone Ridge
@@ -115,9 +192,7 @@ to 20% savings.
 </P>
 <HR>
 
-<HR>
-
-<H4><A NAME = "acc_2"></A>5.2 USER-OMP package 
+<H4><A NAME = "acc_5"></A>5.5 USER-OMP package 
 </H4>
 <P>The USER-OMP package was developed by Axel Kohlmeyer at Temple University.
 It provides multi-threaded versions of most pair styles, all dihedral
@@ -236,9 +311,7 @@ examples are <A HREF = "http://sites.google.com/site/akohlmey/software/lammps-ic
 </P>
 <HR>
 
-<HR>
-
-<H4><A NAME = "acc_3"></A>5.3 GPU package 
+<H4><A NAME = "acc_6"></A>5.6 GPU package 
 </H4>
 <P>The GPU package was developed by Mike Brown at ORNL.  It provides GPU
 versions of several pair styles and for long-range Coulombics via the
@@ -266,6 +339,13 @@ NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. 
 </UL>
+<P>NOTE:
+  discuss 3 precisions
+    if change, also have to re-link with LAMMPS
+  always use newton off
+  expt with differing numbers of CPUs vs GPU - can't tell what is fastest
+  give command line switches in examples
+</P>
 <P><B>Hardware and software requirements:</B>
 </P>
 <P>To use this package, you currently need to have specific NVIDIA
@@ -378,9 +458,7 @@ requires that your GPU card support double precision.
 </P>
 <HR>
 
-<HR>
-
-<H4><A NAME = "acc_4"></A>5.4 USER-CUDA package 
+<H4><A NAME = "acc_7"></A>5.7 USER-CUDA package 
 </H4>
 <P>The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
@@ -516,7 +594,7 @@ occurs, the faster your simulation will run.
 
 <HR>
 
-<H4><A NAME = "acc_5"></A>5.5 Comparison of GPU and USER-CUDA packages 
+<H4><A NAME = "acc_8"></A>5.8 Comparison of GPU and USER-CUDA packages 
 </H4>
 <P>Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
 using NVIDIA hardware, but they do it in different ways.
@@ -602,66 +680,4 @@ for the GPU and USER-CUDA packages.
 <P>These contain input scripts for identical systems, so they can be used
 to benchmark the performance of both packages on your system.
 </P>
-<HR>
-
-<P><B>Benchmark data:</B>
-</P>
-<P>NOTE: We plan to add some benchmark results and plots here for the
-examples described in the previous section.
-</P>
-<P>Simulations:
-</P>
-<P>1. Lennard Jones
-</P>
-<UL><LI>256,000 atoms
-<LI>2.5 A cutoff
-<LI>0.844 density 
-</UL>
-<P>2. Lennard Jones
-</P>
-<UL><LI>256,000 atoms
-<LI>5.0 A cutoff
-<LI>0.844 density 
-</UL>
-<P>3. Rhodopsin model
-</P>
-<UL><LI>256,000 atoms
-<LI>10A cutoff
-<LI>Coulomb via PPPM 
-</UL>
-<P>4. Lihtium-Phosphate
-</P>
-<UL><LI>295650 atoms
-<LI>15A cutoff
-<LI>Coulomb via PPPM 
-</UL>
-<P>Hardware:
-</P>
-<P>Workstation:
-</P>
-<UL><LI>2x GTX470
-<LI>i7 950@3GHz
-<LI>24Gb DDR3 @ 1066Mhz
-<LI>CentOS 5.5
-<LI>CUDA 3.2
-<LI>Driver 260.19.12 
-</UL>
-<P>eStella:
-</P>
-<UL><LI>6 Nodes
-<LI>2xC2050
-<LI>2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
-<LI>Intel X5650 HexCore @ 2.67GHz
-<LI>SL 5.5
-<LI>CUDA 3.2
-<LI>Driver 260.19.26 
-</UL>
-<P>Keeneland:
-</P>
-<UL><LI>HP SL-390 (Ariston) cluster
-<LI>120 nodes
-<LI>2x Intel Westmere hex-core CPUs
-<LI>3xC2070s
-<LI>QDR InfiniBand interconnect 
-</UL>
 </HTML>
diff --git a/doc/Section_accelerate.txt b/doc/Section_accelerate.txt
index c1721a87f6..388ed55741 100644
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@@ -11,14 +11,92 @@ Section"_Section_howto.html :c
 5. Accelerating LAMMPS performance :h3
 
 This section describes various methods for improving LAMMPS
-performance for different classes of problems running
-on different kinds of machines.
+performance for different classes of problems running on different
+kinds of machines.
 
-5.1 "OPT package"_#acc_1
-5.2 "USER-OMP package"_#acc_2
-5.3 "GPU package"_#acc_3
-5.4 "USER-CUDA package"_#acc_4
-5.5 "Comparison of GPU and USER-CUDA packages"_#acc_5 :all(b)
+5.1 "Measuring performance"_#acc_1
+5.2 "General strategies"_#acc_2
+5.3 "Packages with optimized styles"_#acc_3
+5.4 "OPT package"_#acc_4
+5.5 "USER-OMP package"_#acc_5
+5.6 "GPU package"_#acc_6
+5.7 "USER-CUDA package"_#acc_7
+5.8 "Comparison of GPU and USER-CUDA packages"_#acc_8 :all(b)
+
+:line
+:line
+
+5.1 Measuring performance :h4,link(acc_1)
+
+Before trying to make your simulation run faster, you should
+understand how it currently performs and where the bottlenecks are.
+
+The best way to do this is run the your system (actual number of
+atoms) for a modest number of timesteps (say 100, or a few 100 at
+most) on several different processor counts, including a single
+processor if possible.  Do this for an equilibrium version of your
+system, so that the 100-step timings are representative of a much
+longer run.  There is typically no need to run for 1000s or timesteps
+to get accurate timings; you can simply extrapolate from short runs.
+
+For the set of runs, look at the timing data printed to the screen and
+log file at the end of each LAMMPS run.  "This
+section"_Section_start.html#start_8 of the manual has an overview.
+
+Running on one (or a few processors) should give a good estimate of
+the serial performance and what portions of the timestep are taking
+the most time.  Running the same problem on a few different processor
+counts should give an estimate of parallel scalability.  I.e. if the
+simulation runs 16x faster on 16 processors, its 100% parallel
+efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
+
+The most important data to look at in the timing info is the timing
+breakdown and relative percentages.  For example, trying different
+options for speeding up the long-range solvers will have little impact
+if they only consume 10% of the run time.  If the pairwise time is
+dominating, you may want to look at GPU or OMP versions of the pair
+style, as discussed below.  Comparing how the percentages change as
+you increase the processor count gives you a sense of how different
+operations within the timestep are scaling.  Note that if you are
+running with a Kspace solver, there is additional output on the
+breakdown of the Kspace time.  For PPPM, this includes the fraction
+spent on FFTs, which can be communication intensive.
+
+Another important detail in the timing info are the histograms of
+atoms counts and neighbor counts.  If these vary widely across
+processors, you have a load-imbalance issue.  This often results in
+inaccurate relative timing data, because processors have to wait when
+communication occurs for other processors to catch up.  Thus the
+reported times for "Communication" or "Other" may be higher than they
+really are, due to load-imbalance.  If this is an issue, you can
+uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile
+LAMMPS, to obtain synchronized timings.
+
+:line
+
+5.2 General strategies :h4,link(acc_2)
+
+Here is a list of general ideas for improving simulation performance.
+Most of them are only applicable to certain models and certain
+bottlenecks in the current performance, so let the timing data you
+intially generate be your guide.  It is hard, if not impossible, to
+predict how much difference these options will make, since it is a
+function of your problem and your machine.  There is no substitute for
+simply trying them out.
+
+rRESPA
+2-FFT PPPM
+single vs double PPPM
+partial charge PPPM
+verlet/split
+processor mapping via processors numa command
+load-balancing: balance and fix balance
+processor command for layout
+OMP when lots of cores :ul
+
+:line
+
+5.3 Packages with optimized styles :h4,link(acc_3)
 
 Accelerated versions of various "pair_style"_pair_style.html,
 "fixes"_fix.html, "computes"_compute.html, and other commands have
@@ -81,10 +159,9 @@ speed-ups you can expect :ul
 The final section compares and contrasts the GPU and USER-CUDA
 packages, since they are both designed to use NVIDIA GPU hardware.
 
-:line
 :line
 
-5.1 OPT package :h4,link(acc_1)
+5.4 OPT package :h4,link(acc_4)
 
 The OPT package was developed by James Fischer (High Performance
 Technologies), David Richie, and Vincent Natoli (Stone Ridge
@@ -109,10 +186,9 @@ You should see a reduction in the "Pair time" printed out at the end
 of the run.  On most machines and problems, this will typically be a 5
 to 20% savings.
 
-:line
 :line
 
-5.2 USER-OMP package :h4,link(acc_2)
+5.5 USER-OMP package :h4,link(acc_5)
 
 The USER-OMP package was developed by Axel Kohlmeyer at Temple University.
 It provides multi-threaded versions of most pair styles, all dihedral
@@ -229,10 +305,9 @@ through hyper-threading.
 A description of the multi-threading strategy and some performance
 examples are "presented here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 
-:line
 :line
 
-5.3 GPU package :h4,link(acc_3)
+5.6 GPU package :h4,link(acc_6)
 
 The GPU package was developed by Mike Brown at ORNL.  It provides GPU
 versions of several pair styles and for long-range Coulombics via the
@@ -260,6 +335,19 @@ NVIDIA support as well as more general OpenCL support, so that the
 same functionality can eventually be supported on a variety of GPU
 hardware. :l,ule
 
+
+
+NOTE:
+  discuss 3 precisions
+    if change, also have to re-link with LAMMPS
+  always use newton off
+  expt with differing numbers of CPUs vs GPU - can't tell what is fastest
+  give command line switches in examples
+
+
+
+
+
 [Hardware and software requirements:]
 
 To use this package, you currently need to have specific NVIDIA
@@ -370,10 +458,9 @@ See the lammps/lib/gpu/README file for instructions on how to build
 the GPU library for single, mixed, or double precision.  The latter
 requires that your GPU card support double precision.
 
-:line
 :line
 
-5.4 USER-CUDA package :h4,link(acc_4)
+5.7 USER-CUDA package :h4,link(acc_7)
 
 The USER-CUDA package was developed by Christian Trott at U Technology
 Ilmenau in Germany.  It provides NVIDIA GPU versions of many pair
@@ -508,7 +595,7 @@ occurs, the faster your simulation will run.
 :line
 :line
 
-5.5 Comparison of GPU and USER-CUDA packages :h4,link(acc_5)
+5.8 Comparison of GPU and USER-CUDA packages :h4,link(acc_8)
 
 Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation
 using NVIDIA hardware, but they do it in different ways.
@@ -593,65 +680,3 @@ lammps/examples/USER/cuda = USER-CUDA package files :ul
 
 These contain input scripts for identical systems, so they can be used
 to benchmark the performance of both packages on your system.
-
-:line
-
-[Benchmark data:]
-
-NOTE: We plan to add some benchmark results and plots here for the
-examples described in the previous section.
-
-Simulations:
-
-1. Lennard Jones
-
-256,000 atoms
-2.5 A cutoff
-0.844 density :ul
-
-2. Lennard Jones
-
-256,000 atoms
-5.0 A cutoff
-0.844 density :ul
-
-3. Rhodopsin model
-
-256,000 atoms
-10A cutoff
-Coulomb via PPPM :ul
-
-4. Lihtium-Phosphate
-
-295650 atoms
-15A cutoff
-Coulomb via PPPM :ul
-
-Hardware:
-
-Workstation:
-
-2x GTX470
-i7 950@3GHz
-24Gb DDR3 @ 1066Mhz
-CentOS 5.5
-CUDA 3.2
-Driver 260.19.12 :ul
-
-eStella:
-
-6 Nodes
-2xC2050
-2xQDR Infiniband interconnect(aggregate bandwidth 80GBps)
-Intel X5650 HexCore @ 2.67GHz
-SL 5.5
-CUDA 3.2
-Driver 260.19.26 :ul
-
-Keeneland:
-
-HP SL-390 (Ariston) cluster
-120 nodes
-2x Intel Westmere hex-core CPUs
-3xC2070s
-QDR InfiniBand interconnect :ul
diff --git a/doc/Section_commands.html b/doc/Section_commands.html
index 3abb5ccfb6..aacf28ffe1 100644
--- a/doc/Section_commands.html
+++ b/doc/Section_commands.html
@@ -383,12 +383,12 @@ each style or click on the style itself for a full description:
 <DIV ALIGN=center><TABLE  BORDER=1 >
 <TR ALIGN="center"><TD ><A HREF = "compute_angle_local.html">angle/local</A></TD><TD ><A HREF = "compute_atom_molecule.html">atom/molecule</A></TD><TD ><A HREF = "compute_bond_local.html">bond/local</A></TD><TD ><A HREF = "compute_centro_atom.html">centro/atom</A></TD><TD ><A HREF = "compute_cluster_atom.html">cluster/atom</A></TD><TD ><A HREF = "compute_cna_atom.html">cna/atom</A></TD></TR>
 <TR ALIGN="center"><TD ><A HREF = "compute_com.html">com</A></TD><TD ><A HREF = "compute_com_molecule.html">com/molecule</A></TD><TD ><A HREF = "compute_coord_atom.html">coord/atom</A></TD><TD ><A HREF = "compute_damage_atom.html">damage/atom</A></TD><TD ><A HREF = "compute_dihedral_local.html">dihedral/local</A></TD><TD ><A HREF = "compute_displace_atom.html">displace/atom</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_erotate_asphere.html">erotate/asphere</A></TD><TD ><A HREF = "compute_erotate_sphere.html">erotate/sphere</A></TD><TD ><A HREF = "compute_event_displace.html">event/displace</A></TD><TD ><A HREF = "compute_group_group.html">group/group</A></TD><TD ><A HREF = "compute_gyration.html">gyration</A></TD><TD ><A HREF = "compute_gyration_molecule.html">gyration/molecule</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_heat_flux.html">heat/flux</A></TD><TD ><A HREF = "compute_improper_local.html">improper/local</A></TD><TD ><A HREF = "compute_ke.html">ke</A></TD><TD ><A HREF = "compute_ke_atom.html">ke/atom</A></TD><TD ><A HREF = "compute_msd.html">msd</A></TD><TD ><A HREF = "compute_msd_molecule.html">msd/molecule</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_pair.html">pair</A></TD><TD ><A HREF = "compute_pair_local.html">pair/local</A></TD><TD ><A HREF = "compute_pe.html">pe</A></TD><TD ><A HREF = "compute_pe_atom.html">pe/atom</A></TD><TD ><A HREF = "compute_pressure.html">pressure</A></TD><TD ><A HREF = "compute_property_atom.html">property/atom</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_property_local.html">property/local</A></TD><TD ><A HREF = "compute_property_molecule.html">property/molecule</A></TD><TD ><A HREF = "compute_rdf.html">rdf</A></TD><TD ><A HREF = "compute_reduce.html">reduce</A></TD><TD ><A HREF = "compute_reduce.html">reduce/region</A></TD><TD ><A HREF = "compute_slice.html">slice</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_stress_atom.html">stress/atom</A></TD><TD ><A HREF = "compute_temp.html">temp</A></TD><TD ><A HREF = "compute_temp_asphere.html">temp/asphere</A></TD><TD ><A HREF = "compute_temp_com.html">temp/com</A></TD><TD ><A HREF = "compute_temp_deform.html">temp/deform</A></TD><TD ><A HREF = "compute_temp_partial.html">temp/partial</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "compute_temp_profile.html">temp/profile</A></TD><TD ><A HREF = "compute_temp_ramp.html">temp/ramp</A></TD><TD ><A HREF = "compute_temp_region.html">temp/region</A></TD><TD ><A HREF = "compute_temp_sphere.html">temp/sphere</A></TD><TD ><A HREF = "compute_ti.html">ti</A> 
+<TR ALIGN="center"><TD ><A HREF = "compute_erotate_asphere.html">erotate/asphere</A></TD><TD ><A HREF = "compute_erotate_sphere.html">erotate/sphere</A></TD><TD ><A HREF = "compute_erotate_sphere_atom.html">erotate/sphere/atom</A></TD><TD ><A HREF = "compute_event_displace.html">event/displace</A></TD><TD ><A HREF = "compute_group_group.html">group/group</A></TD><TD ><A HREF = "compute_gyration.html">gyration</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "compute_gyration_molecule.html">gyration/molecule</A></TD><TD ><A HREF = "compute_heat_flux.html">heat/flux</A></TD><TD ><A HREF = "compute_improper_local.html">improper/local</A></TD><TD ><A HREF = "compute_ke.html">ke</A></TD><TD ><A HREF = "compute_ke_atom.html">ke/atom</A></TD><TD ><A HREF = "compute_msd.html">msd</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "compute_msd_molecule.html">msd/molecule</A></TD><TD ><A HREF = "compute_pair.html">pair</A></TD><TD ><A HREF = "compute_pair_local.html">pair/local</A></TD><TD ><A HREF = "compute_pe.html">pe</A></TD><TD ><A HREF = "compute_pe_atom.html">pe/atom</A></TD><TD ><A HREF = "compute_pressure.html">pressure</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "compute_property_atom.html">property/atom</A></TD><TD ><A HREF = "compute_property_local.html">property/local</A></TD><TD ><A HREF = "compute_property_molecule.html">property/molecule</A></TD><TD ><A HREF = "compute_rdf.html">rdf</A></TD><TD ><A HREF = "compute_reduce.html">reduce</A></TD><TD ><A HREF = "compute_reduce.html">reduce/region</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "compute_slice.html">slice</A></TD><TD ><A HREF = "compute_stress_atom.html">stress/atom</A></TD><TD ><A HREF = "compute_temp.html">temp</A></TD><TD ><A HREF = "compute_temp_asphere.html">temp/asphere</A></TD><TD ><A HREF = "compute_temp_com.html">temp/com</A></TD><TD ><A HREF = "compute_temp_deform.html">temp/deform</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "compute_temp_partial.html">temp/partial</A></TD><TD ><A HREF = "compute_temp_profile.html">temp/profile</A></TD><TD ><A HREF = "compute_temp_ramp.html">temp/ramp</A></TD><TD ><A HREF = "compute_temp_region.html">temp/region</A></TD><TD ><A HREF = "compute_temp_sphere.html">temp/sphere</A></TD><TD ><A HREF = "compute_ti.html">ti</A> 
 </TD></TR></TABLE></DIV>
 
 <P>These are compute styles contributed by users, which can be used if
diff --git a/doc/Section_commands.txt b/doc/Section_commands.txt
index fc8e881c27..051c5ff739 100644
--- a/doc/Section_commands.txt
+++ b/doc/Section_commands.txt
@@ -556,6 +556,7 @@ each style or click on the style itself for a full description:
 "displace/atom"_compute_displace_atom.html,
 "erotate/asphere"_compute_erotate_asphere.html,
 "erotate/sphere"_compute_erotate_sphere.html,
+"erotate/sphere/atom"_compute_erotate_sphere_atom.html,
 "event/displace"_compute_event_displace.html,
 "group/group"_compute_group_group.html,
 "gyration"_compute_gyration.html,
diff --git a/doc/Section_packages.html b/doc/Section_packages.html
index ae604bbf98..bbd94a4a6f 100644
--- a/doc/Section_packages.html
+++ b/doc/Section_packages.html
@@ -49,7 +49,7 @@ packages, more details are provided.
 <TR ALIGN="center"><TD >COLLOID</TD><TD > colloidal particles</TD><TD > -</TD><TD > <A HREF = "atom_style.html">atom_style colloid</A></TD><TD > colloid</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >DIPOLE</TD><TD > point dipole particles</TD><TD > -</TD><TD > <A HREF = "pair_dipole.html">pair_style dipole/cut</A></TD><TD > dipole</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >FLD</TD><TD > Fast Lubrication Dynamics</TD><TD > Kumar & Bybee & Higdon (1)</TD><TD > <A HREF = "pair_lubricateU.html">pair_style lubricateU</A></TD><TD > -</TD><TD > -</TD></TR>
-<TR ALIGN="center"><TD >GPU</TD><TD > GPU-enabled potentials</TD><TD > Mike Brown (ORNL)</TD><TD > <A HREF = "Section_accelerate.html#acc_3">Section accelerate</A></TD><TD > gpu</TD><TD > lib/gpu</TD></TR>
+<TR ALIGN="center"><TD >GPU</TD><TD > GPU-enabled potentials</TD><TD > Mike Brown (ORNL)</TD><TD > <A HREF = "Section_accelerate.html#acc_6">Section accelerate</A></TD><TD > gpu</TD><TD > lib/gpu</TD></TR>
 <TR ALIGN="center"><TD >GRANULAR</TD><TD > granular systems</TD><TD > -</TD><TD > <A HREF = "Section_howto.html#howto_6">Section_howto</A></TD><TD > pour</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >KIM</TD><TD > openKIM potentials</TD><TD > Smirichinski & Elliot & Tadmor (3)</TD><TD > <A HREF = "pair_kim.html">pair_style kim</A></TD><TD > kim</TD><TD > lib/kim</TD></TR>
 <TR ALIGN="center"><TD >KSPACE</TD><TD > long-range Coulombic solvers</TD><TD > -</TD><TD > <A HREF = "kspace_style.html">kspace_style</A></TD><TD > peptide</TD><TD > -</TD></TR>
@@ -57,7 +57,7 @@ packages, more details are provided.
 <TR ALIGN="center"><TD >MEAM</TD><TD > modified EAM potential</TD><TD > Greg Wagner (Sandia)</TD><TD > <A HREF = "pair_meam.html">pair_style meam</A></TD><TD > meam</TD><TD > lib/meam</TD></TR>
 <TR ALIGN="center"><TD >MC</TD><TD > Monte Carlo options</TD><TD > -</TD><TD > <A HREF = "fix_gcmc.html">fix gcmc</A></TD><TD > -</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >MOLECULE</TD><TD > molecular system force fields</TD><TD > -</TD><TD > <A HREF = "Section_howto.html#howto_3">Section_howto</A></TD><TD > peptide</TD><TD > -</TD></TR>
-<TR ALIGN="center"><TD >OPT</TD><TD > optimized pair potentials</TD><TD > Fischer & Richie & Natoli (2)</TD><TD > <A HREF = "Section_accelerate.html#acc_1">Section accelerate</A></TD><TD > -</TD><TD > -</TD></TR>
+<TR ALIGN="center"><TD >OPT</TD><TD > optimized pair potentials</TD><TD > Fischer & Richie & Natoli (2)</TD><TD > <A HREF = "Section_accelerate.html#acc_4">Section accelerate</A></TD><TD > -</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >PERI</TD><TD > Peridynamics models</TD><TD > Mike Parks (Sandia)</TD><TD > <A HREF = "pair_peri.html">pair_style peri</A></TD><TD > peri</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >POEMS</TD><TD > coupled rigid body motion</TD><TD > Rudra Mukherjee (JPL)</TD><TD > <A HREF = "fix_poems.html">fix poems</A></TD><TD > rigid</TD><TD > lib/poems</TD></TR>
 <TR ALIGN="center"><TD >REAX</TD><TD > ReaxFF potential</TD><TD > Aidan Thompson (Sandia)</TD><TD > <A HREF = "pair_reax.html">pair_style reax</A></TD><TD > reax</TD><TD >  lib/reax</TD></TR>
@@ -106,11 +106,11 @@ E.g. "peptide" refers to the examples/peptide directory.
 <TR ALIGN="center"><TD >USER-AWPMD</TD><TD > wave-packet MD</TD><TD > Ilya Valuev (JIHT)</TD><TD > <A HREF = "pair_awpmd.html">pair_style awpmd/cut</A></TD><TD > USER/awpmd</TD><TD > -</TD><TD > lib/awpmd</TD></TR>
 <TR ALIGN="center"><TD >USER-CG-CMM</TD><TD > coarse-graining model</TD><TD > Axel Kohlmeyer (Temple U)</TD><TD > <A HREF = "pair_sdk.html">pair_style lj/sdk</A></TD><TD > USER/cg-cmm</TD><TD > <A HREF = "http://lammps.sandia.gov/pictures.html#cg">cg</A></TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >USER-COLVARS</TD><TD > collective variables</TD><TD > Fiorin & Henin & Kohlmeyer (3)</TD><TD > <A HREF = "fix_colvars.html">fix colvars</A></TD><TD > USER/colvars</TD><TD > <A HREF = "colvars">colvars</A></TD><TD > lib/colvars</TD></TR>
-<TR ALIGN="center"><TD >USER-CUDA</TD><TD > NVIDIA GPU styles</TD><TD > Christian Trott (U Tech Ilmenau)</TD><TD > <A HREF = "Section_accelerate.html#acc_4">Section accelerate</A></TD><TD > USER/cuda</TD><TD > -</TD><TD > lib/cuda</TD></TR>
+<TR ALIGN="center"><TD >USER-CUDA</TD><TD > NVIDIA GPU styles</TD><TD > Christian Trott (U Tech Ilmenau)</TD><TD > <A HREF = "Section_accelerate.html#acc_7">Section accelerate</A></TD><TD > USER/cuda</TD><TD > -</TD><TD > lib/cuda</TD></TR>
 <TR ALIGN="center"><TD >USER-EFF</TD><TD > electron force field</TD><TD > Andres Jaramillo-Botero (Caltech)</TD><TD > <A HREF = "pair_eff.html">pair_style eff/cut</A></TD><TD > USER/eff</TD><TD > <A HREF = "http://lammps.sandia.gov/movies.html#eff">eff</A></TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >USER-EWALDN</TD><TD > Ewald for 1/R^n</TD><TD > Pieter in' t Veld (BASF)</TD><TD > <A HREF = "kspace_style.html">kspace_style</A></TD><TD > -</TD><TD > -</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >USER-MOLFILE</TD><TD > <A HREF = "http://www.ks.uiuc.edu/Research/vmd">VMD</A> molfile plug-ins</TD><TD > Axel Kohlmeyer (Temple U)</TD><TD > <A HREF = "dump_molfile.html">dump molfile</A></TD><TD > -</TD><TD > -</TD><TD > lib/molfile</TD></TR>
-<TR ALIGN="center"><TD >USER-OMP</TD><TD > OpenMP threaded styles</TD><TD > Axel Kohlmeyer (Temple U)</TD><TD > <A HREF = "Section_accelerate.html#acc_2">Section accelerate</A></TD><TD > -</TD><TD > -</TD><TD > -</TD></TR>
+<TR ALIGN="center"><TD >USER-OMP</TD><TD > OpenMP threaded styles</TD><TD > Axel Kohlmeyer (Temple U)</TD><TD > <A HREF = "Section_accelerate.html#acc_5">Section accelerate</A></TD><TD > -</TD><TD > -</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >USER-REAXC</TD><TD > C version of ReaxFF</TD><TD > Metin Aktulga (LBNL)</TD><TD > <A HREF = "pair_reax_c.html">pair_style reaxc</A></TD><TD > reax</TD><TD > -</TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >USER-SPH</TD><TD > smoothed particle hydrodynamics</TD><TD > Georg Ganzenmuller (EMI)</TD><TD > <A HREF = "USER/sph/SPH_LAMMPS_userguide.pdf">userguide.pdf</A></TD><TD > USER/sph</TD><TD > <A HREF = "http://lammps.sandia.gov/movies.html#sph">sph</A></TD><TD > -</TD></TR>
 <TR ALIGN="center"><TD >
@@ -304,7 +304,7 @@ GPUs.
 </P>
 <P>See this section of the manual to get started:
 </P>
-<P><A HREF = "Section_accelerate.html#acc_4">Section_accelerate</A>
+<P><A HREF = "Section_accelerate.html#acc_7">Section_accelerate</A>
 </P>
 <P>There are example scripts for using this package in
 examples/USER/cuda.
@@ -403,7 +403,7 @@ styles, and fix styles.
 </P>
 <P>See this section of the manual to get started:
 </P>
-<P><A HREF = "Section_accelerate.html#acc_2">Section_accelerate</A>
+<P><A HREF = "Section_accelerate.html#acc_5">Section_accelerate</A>
 </P>
 <P>The person who created this package is Axel Kohlmeyer at Temple U
 (akohlmey at gmail.com).  Contact him directly if you have questions.
diff --git a/doc/Section_packages.txt b/doc/Section_packages.txt
index e3575fd5f7..51e0a33f90 100644
--- a/doc/Section_packages.txt
+++ b/doc/Section_packages.txt
@@ -44,7 +44,7 @@ CLASS2, class 2 force fields, -, "pair_style lj/class2"_pair_class2.html, -, -
 COLLOID, colloidal particles, -, "atom_style colloid"_atom_style.html, colloid, -
 DIPOLE, point dipole particles, -, "pair_style dipole/cut"_pair_dipole.html, dipole, -
 FLD, Fast Lubrication Dynamics, Kumar & Bybee & Higdon (1), "pair_style lubricateU"_pair_lubricateU.html, -, -
-GPU, GPU-enabled potentials, Mike Brown (ORNL), "Section accelerate"_Section_accelerate.html#acc_3, gpu, lib/gpu
+GPU, GPU-enabled potentials, Mike Brown (ORNL), "Section accelerate"_Section_accelerate.html#acc_6, gpu, lib/gpu
 GRANULAR, granular systems, -, "Section_howto"_Section_howto.html#howto_6, pour, -
 KIM, openKIM potentials, Smirichinski & Elliot & Tadmor (3), "pair_style kim"_pair_kim.html, kim, lib/kim
 KSPACE, long-range Coulombic solvers, -, "kspace_style"_kspace_style.html, peptide, -
@@ -52,7 +52,7 @@ MANYBODY, many-body potentials, -, "pair_style tersoff"_pair_tersoff.html, shear
 MEAM, modified EAM potential, Greg Wagner (Sandia), "pair_style meam"_pair_meam.html, meam, lib/meam
 MC, Monte Carlo options, -, "fix gcmc"_fix_gcmc.html, -, -
 MOLECULE, molecular system force fields, -, "Section_howto"_Section_howto.html#howto_3, peptide, -
-OPT, optimized pair potentials, Fischer & Richie & Natoli (2), "Section accelerate"_Section_accelerate.html#acc_1, -, -
+OPT, optimized pair potentials, Fischer & Richie & Natoli (2), "Section accelerate"_Section_accelerate.html#acc_4, -, -
 PERI, Peridynamics models, Mike Parks (Sandia), "pair_style peri"_pair_peri.html, peri, -
 POEMS, coupled rigid body motion, Rudra Mukherjee (JPL), "fix poems"_fix_poems.html, rigid, lib/poems
 REAX, ReaxFF potential, Aidan Thompson (Sandia), "pair_style reax"_pair_reax.html, reax,  lib/reax
@@ -98,11 +98,11 @@ USER-ATC, atom-to-continuum coupling, Jones & Templeton & Zimmerman (2), "fix at
 USER-AWPMD, wave-packet MD, Ilya Valuev (JIHT), "pair_style awpmd/cut"_pair_awpmd.html, USER/awpmd, -, lib/awpmd
 USER-CG-CMM, coarse-graining model, Axel Kohlmeyer (Temple U), "pair_style lj/sdk"_pair_sdk.html, USER/cg-cmm, "cg"_cg, -
 USER-COLVARS, collective variables, Fiorin & Henin & Kohlmeyer (3), "fix colvars"_fix_colvars.html, USER/colvars, "colvars"_colvars, lib/colvars
-USER-CUDA, NVIDIA GPU styles, Christian Trott (U Tech Ilmenau), "Section accelerate"_Section_accelerate.html#acc_4, USER/cuda, -, lib/cuda
+USER-CUDA, NVIDIA GPU styles, Christian Trott (U Tech Ilmenau), "Section accelerate"_Section_accelerate.html#acc_7, USER/cuda, -, lib/cuda
 USER-EFF, electron force field, Andres Jaramillo-Botero (Caltech), "pair_style eff/cut"_pair_eff.html, USER/eff, "eff"_eff, -
 USER-EWALDN, Ewald for 1/R^n, Pieter in' t Veld (BASF), "kspace_style"_kspace_style.html, -, -, -
 USER-MOLFILE, "VMD"_VMD molfile plug-ins, Axel Kohlmeyer (Temple U), "dump molfile"_dump_molfile.html, -, -, lib/molfile
-USER-OMP, OpenMP threaded styles, Axel Kohlmeyer (Temple U), "Section accelerate"_Section_accelerate.html#acc_2, -, -, -
+USER-OMP, OpenMP threaded styles, Axel Kohlmeyer (Temple U), "Section accelerate"_Section_accelerate.html#acc_5, -, -, -
 USER-REAXC, C version of ReaxFF, Metin Aktulga (LBNL), "pair_style reaxc"_pair_reax_c.html, reax, -, -
 USER-SPH, smoothed particle hydrodynamics, Georg Ganzenmuller (EMI), "userguide.pdf"_USER/sph/SPH_LAMMPS_userguide.pdf, USER/sph, "sph"_sph, -
 :tb(ea=c)
@@ -291,7 +291,7 @@ GPUs.
  
 See this section of the manual to get started:
 
-"Section_accelerate"_Section_accelerate.html#acc_4
+"Section_accelerate"_Section_accelerate.html#acc_7
 
 There are example scripts for using this package in
 examples/USER/cuda.
@@ -390,7 +390,7 @@ styles, and fix styles.
  
 See this section of the manual to get started:
 
-"Section_accelerate"_Section_accelerate.html#acc_2
+"Section_accelerate"_Section_accelerate.html#acc_5
 
 The person who created this package is Axel Kohlmeyer at Temple U
 (akohlmey at gmail.com).  Contact him directly if you have questions.
diff --git a/doc/compute.html b/doc/compute.html
index 4e3d6ffee0..879e6fa8c5 100644
--- a/doc/compute.html
+++ b/doc/compute.html
@@ -181,6 +181,7 @@ available in LAMMPS:
 <LI><A HREF = "compute_displace_atom.html">displace/atom</A> - displacement of each atom
 <LI><A HREF = "compute_erotate_asphere.html">erotate/asphere</A> - rotational energy of aspherical particles
 <LI><A HREF = "compute_erotate_sphere.html">erotate/sphere</A> - rotational energy of spherical particles
+<LI><A HREF = "compute_erotate_sphere.html">erotate/sphere/atom</A> - rotational energy for each spherical particle
 <LI><A HREF = "compute_event_displace.html">event/displace</A> - detect event on atom displacement
 <LI><A HREF = "compute_group_group.html">group/group</A> - energy/force between two groups of atoms
 <LI><A HREF = "compute_gyration.html">gyration</A> - radius of gyration of group of atoms
diff --git a/doc/compute.txt b/doc/compute.txt
index 2c53f5f4e6..02ca64a298 100644
--- a/doc/compute.txt
+++ b/doc/compute.txt
@@ -176,6 +176,7 @@ available in LAMMPS:
 "displace/atom"_compute_displace_atom.html - displacement of each atom
 "erotate/asphere"_compute_erotate_asphere.html - rotational energy of aspherical particles
 "erotate/sphere"_compute_erotate_sphere.html - rotational energy of spherical particles
+"erotate/sphere/atom"_compute_erotate_sphere.html - rotational energy for each spherical particle
 "event/displace"_compute_event_displace.html - detect event on atom displacement
 "group/group"_compute_group_group.html - energy/force between two groups of atoms
 "gyration"_compute_gyration.html - radius of gyration of group of atoms

angle/local	atom/molecule	bond/local	centro/atom	cluster/atom	cna/atom
com	com/molecule	coord/atom	damage/atom	dihedral/local	displace/atom
erotate/asphere	erotate/sphere	event/displace	group/group	gyration	gyration/molecule
heat/flux	improper/local	ke	ke/atom	msd	msd/molecule
pair	pair/local	pe	pe/atom	pressure	property/atom
property/local	property/molecule	rdf	reduce	reduce/region	slice
stress/atom	temp	temp/asphere	temp/com	temp/deform	temp/partial
temp/profile	temp/ramp	temp/region	temp/sphere	ti +
erotate/asphere	erotate/sphere	erotate/sphere/atom	event/displace	group/group	gyration
gyration/molecule	heat/flux	improper/local	ke	ke/atom	msd
msd/molecule	pair	pair/local	pe	pe/atom	pressure
property/atom	property/local	property/molecule	rdf	reduce	reduce/region
slice	stress/atom	temp	temp/asphere	temp/com	temp/deform
temp/partial	temp/profile	temp/ramp	temp/region	temp/sphere	ti