diff --git a/doc/Section_accelerate.html b/doc/Section_accelerate.html
index 4b4838ba70..b64c88ca40 100644
--- a/doc/Section_accelerate.html
+++ b/doc/Section_accelerate.html
@@ -190,55 +190,16 @@ from the GPU package, you can either append "gpu" to the style name
 switch</A>, or use the <A HREF = "suffix.html">suffix</A>
 command.
 </P>
-<P>The <A HREF = "fix_gpu.html">fix gpu</A> command controls the GPU selection and
-initialization steps.
+<P>The <A HREF = "package.html">package gpu</A> command must be used near the beginning
+of your script to control the GPU selection and initialization steps.
+It also enables asynchronous splitting of force computations between
+the CPUs and GPUs.
 </P>
-<P>The format for the fix is:
-</P>
-<PRE>fix fix-ID all gpu <I>mode</I> <I>first</I> <I>last</I> <I>split</I> 
-</PRE>
-<P>where fix-ID is the name for the fix. The gpu fix must be the first
-fix specified for a given run, otherwise LAMMPS will exit with an
-error. The gpu fix does not have any effect on runs that do not use
-GPU acceleration, so there should be no problem specifying the fix
-first in any input script.
-</P>
-<P>The <I>mode</I> setting can be either "force" or "force/neigh". In the
-former, neighbor list calculation is performed on the CPU using the
-standard LAMMPS routines. In the latter, the neighbor list calculation
-is performed on the GPU. The GPU neighbor list can be used for better
-performance, however, it cannot not be used with a triclinic box or
-with <A HREF = "pair_hybrid.html">hybrid</A> pair styles.
-</P>
-<P>There are cases when it may be more efficient to select the CPU for
-neighbor list builds. If a non-GPU enabled style (e.g. a fix or
-compute) requires a neighbor list, it will also be built using CPU
-routines.  Redundant CPU and GPU neighbor list calculations will
-typically be less efficient.
-</P>
-<P>The <I>first</I> setting is the ID (as reported by
-lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on
-each node. The <I>last</I> setting is the ID of the last GPU that will be
-used on each node. If you have only one GPU per node, <I>first</I> and
-<I>last</I> will typically both be 0. Selecting a non-sequential set of GPU
-IDs (e.g. 0,1,3) is not currently supported.
-</P>
-<P>The <I>split</I> setting is the fraction of particles whose forces,
-torques, energies, and/or virials will be calculated on the GPU. This
-can be used to perform CPU and GPU force calculations simultaneously,
-e.g. on a hybrid node with a multicore CPU and a GPU(s).  If <I>split</I>
-is negative, the software will attempt to calculate the optimal
-fraction automatically every 25 timesteps based on CPU and GPU
-timings. Because the GPU speedups are dependent on the number of
-particles, automatic calculation of the split can be less efficient,
-but typically results in loop times within 20% of an optimal fixed
-split.
-</P>
-<P>As an example, if you have two GPUs per node, 8 CPU cores per node,
+<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
 and would like to run on 4 nodes (32 cores) with dynamic balancing of
-force calculation across CPU and GPU cores, the fix might be
+force calculation across CPU and GPU cores, you could specify
 </P>
-<PRE>fix 0 all gpu force/neigh 0 1 -1 
+<PRE>package gpu force/neigh 0 1 -1 
 </PRE>
 <P>In this case, all CPU cores and GPU devices on the nodes would be
 utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
@@ -246,39 +207,14 @@ cores would perform force calculations for some fraction of the
 particles at the same time the GPUs performed force calculation for
 the other particles.
 </P>
-<P><B>Asynchronous pair computation on GPU and CPU</B>
-</P>
-<P>The GPU accelerated pair styles can perform pair style force
-calculation on the GPU at the same time other force calculations
-within LAMMPS are being performed on the CPU.  These include pair,
-bond, angle, etc forces as well as long-range Coulombic forces.  This
-is enabled by the <I>split</I> setting in the gpu fix as described above.
-</P>
-<P>With a <I>split</I> setting less than 1.0, a portion of the pair-wise force
-calculations will also be performed on the CPU.  When the CPU finishes
-its pair style computations (if any), the next LAMMPS force
-computation will begin (bond, angle, etc), possibly before the GPU has
-finished its pair style computations.
-</P>
-<P>This means that if <I>split</I> is set to 1.0, the GPU will begin the
-LAMMPS force computation immediately. This can be used to run a
-<A HREF = "pair_hybrid.html">hybrid</A> GPU pair style at the same time as a hybrid
-CPU pair style. In this case, the GPU pair style should be first in
-the hybrid command in order to perform simultaneous calculations. This
-also allows <A HREF = "bond_style.html">bond</A>, <A HREF = "angle_style.html">angle</A>,
-<A HREF = "dihedral_style.html">dihedral</A>, <A HREF = "improper_style.html">improper</A>, and
-<A HREF = "kspace_style.html">long-range</A> force computations to run
-simultaneously with the GPU pair style.  If all CPU force computations
-complete before the GPU, LAMMPS will block until the GPU has finished
-before continuing the timestep.
-</P>
 <P><B>Timing output:</B>
 </P>
-<P>As noted above, GPU accelerated pair styles can perform computations
-asynchronously with CPU computations. The "Pair" time reported by
-LAMMPS will be the maximum of the time required to complete the CPU
-pair style computations and the time required to complete the GPU pair
-style computations. Any time spent for GPU-enabled pair styles for
+<P>As described by the <A HREF = "package.html">package gpu</A> command, GPU
+accelerated pair styles can perform computations asynchronously with
+CPU computations. The "Pair" time reported by LAMMPS will be the
+maximum of the time required to complete the CPU pair style
+computations and the time required to complete the GPU pair style
+computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with <A HREF = "bond_style.html">bond</A>,
 <A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
 <A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
diff --git a/doc/Section_accelerate.txt b/doc/Section_accelerate.txt
index c67655f13e..b348fe207d 100644
--- a/doc/Section_accelerate.txt
+++ b/doc/Section_accelerate.txt
@@ -185,55 +185,16 @@ from the GPU package, you can either append "gpu" to the style name
 switch"_Section_start.html#2_6, or use the "suffix"_suffix.html
 command.
 
-The "fix gpu"_fix_gpu.html command controls the GPU selection and
-initialization steps.
+The "package gpu"_package.html command must be used near the beginning
+of your script to control the GPU selection and initialization steps.
+It also enables asynchronous splitting of force computations between
+the CPUs and GPUs.
 
-The format for the fix is:
-
-fix fix-ID all gpu {mode} {first} {last} {split} :pre
-
-where fix-ID is the name for the fix. The gpu fix must be the first
-fix specified for a given run, otherwise LAMMPS will exit with an
-error. The gpu fix does not have any effect on runs that do not use
-GPU acceleration, so there should be no problem specifying the fix
-first in any input script.
-
-The {mode} setting can be either "force" or "force/neigh". In the
-former, neighbor list calculation is performed on the CPU using the
-standard LAMMPS routines. In the latter, the neighbor list calculation
-is performed on the GPU. The GPU neighbor list can be used for better
-performance, however, it cannot not be used with a triclinic box or
-with "hybrid"_pair_hybrid.html pair styles.
-
-There are cases when it may be more efficient to select the CPU for
-neighbor list builds. If a non-GPU enabled style (e.g. a fix or
-compute) requires a neighbor list, it will also be built using CPU
-routines.  Redundant CPU and GPU neighbor list calculations will
-typically be less efficient.
-
-The {first} setting is the ID (as reported by
-lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on
-each node. The {last} setting is the ID of the last GPU that will be
-used on each node. If you have only one GPU per node, {first} and
-{last} will typically both be 0. Selecting a non-sequential set of GPU
-IDs (e.g. 0,1,3) is not currently supported.
-
-The {split} setting is the fraction of particles whose forces,
-torques, energies, and/or virials will be calculated on the GPU. This
-can be used to perform CPU and GPU force calculations simultaneously,
-e.g. on a hybrid node with a multicore CPU and a GPU(s).  If {split}
-is negative, the software will attempt to calculate the optimal
-fraction automatically every 25 timesteps based on CPU and GPU
-timings. Because the GPU speedups are dependent on the number of
-particles, automatic calculation of the split can be less efficient,
-but typically results in loop times within 20% of an optimal fixed
-split.
-
-As an example, if you have two GPUs per node, 8 CPU cores per node,
+As an example, if you have two GPUs per node and 8 CPU cores per node,
 and would like to run on 4 nodes (32 cores) with dynamic balancing of
-force calculation across CPU and GPU cores, the fix might be
+force calculation across CPU and GPU cores, you could specify
 
-fix 0 all gpu force/neigh 0 1 -1 :pre
+package gpu force/neigh 0 1 -1 :pre
 
 In this case, all CPU cores and GPU devices on the nodes would be
 utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
@@ -241,39 +202,14 @@ cores would perform force calculations for some fraction of the
 particles at the same time the GPUs performed force calculation for
 the other particles.
 
-[Asynchronous pair computation on GPU and CPU]
-
-The GPU accelerated pair styles can perform pair style force
-calculation on the GPU at the same time other force calculations
-within LAMMPS are being performed on the CPU.  These include pair,
-bond, angle, etc forces as well as long-range Coulombic forces.  This
-is enabled by the {split} setting in the gpu fix as described above.
-
-With a {split} setting less than 1.0, a portion of the pair-wise force
-calculations will also be performed on the CPU.  When the CPU finishes
-its pair style computations (if any), the next LAMMPS force
-computation will begin (bond, angle, etc), possibly before the GPU has
-finished its pair style computations.
-
-This means that if {split} is set to 1.0, the GPU will begin the
-LAMMPS force computation immediately. This can be used to run a
-"hybrid"_pair_hybrid.html GPU pair style at the same time as a hybrid
-CPU pair style. In this case, the GPU pair style should be first in
-the hybrid command in order to perform simultaneous calculations. This
-also allows "bond"_bond_style.html, "angle"_angle_style.html,
-"dihedral"_dihedral_style.html, "improper"_improper_style.html, and
-"long-range"_kspace_style.html force computations to run
-simultaneously with the GPU pair style.  If all CPU force computations
-complete before the GPU, LAMMPS will block until the GPU has finished
-before continuing the timestep.
-
 [Timing output:]
 
-As noted above, GPU accelerated pair styles can perform computations
-asynchronously with CPU computations. The "Pair" time reported by
-LAMMPS will be the maximum of the time required to complete the CPU
-pair style computations and the time required to complete the GPU pair
-style computations. Any time spent for GPU-enabled pair styles for
+As described by the "package gpu"_package.html command, GPU
+accelerated pair styles can perform computations asynchronously with
+CPU computations. The "Pair" time reported by LAMMPS will be the
+maximum of the time required to complete the CPU pair style
+computations and the time required to complete the GPU pair style
+computations. Any time spent for GPU-enabled pair styles for
 computations that run simultaneously with "bond"_bond_style.html,
 "angle"_angle_style.html, "dihedral"_dihedral_style.html,
 "improper"_improper_style.html, and "long-range"_kspace_style.html
diff --git a/doc/Section_commands.html b/doc/Section_commands.html
index 941b2a1de2..0618a5d250 100644
--- a/doc/Section_commands.html
+++ b/doc/Section_commands.html
@@ -338,15 +338,14 @@ of each style or click on the style itself for a full description:
 <DIV ALIGN=center><TABLE  BORDER=1 >
 <TR ALIGN="center"><TD ><A HREF = "fix_adapt.html">adapt</A></TD><TD ><A HREF = "fix_addforce.html">addforce</A></TD><TD ><A HREF = "fix_aveforce.html">aveforce</A></TD><TD ><A HREF = "fix_ave_atom.html">ave/atom</A></TD><TD ><A HREF = "fix_ave_correlate.html">ave/correlate</A></TD><TD ><A HREF = "fix_ave_histo.html">ave/histo</A></TD><TD ><A HREF = "fix_ave_spatial.html">ave/spatial</A></TD><TD ><A HREF = "fix_ave_time.html">ave/time</A></TD></TR>
 <TR ALIGN="center"><TD ><A HREF = "fix_bond_break.html">bond/break</A></TD><TD ><A HREF = "fix_bond_create.html">bond/create</A></TD><TD ><A HREF = "fix_bond_swap.html">bond/swap</A></TD><TD ><A HREF = "fix_box_relax.html">box/relax</A></TD><TD ><A HREF = "fix_deform.html">deform</A></TD><TD ><A HREF = "fix_deposit.html">deposit</A></TD><TD ><A HREF = "fix_drag.html">drag</A></TD><TD ><A HREF = "fix_dt_reset.html">dt/reset</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_efield.html">efield</A></TD><TD ><A HREF = "fix_enforce2d.html">enforce2d</A></TD><TD ><A HREF = "fix_evaporate.html">evaporate</A></TD><TD ><A HREF = "fix_external.html">external</A></TD><TD ><A HREF = "fix_freeze.html">freeze</A></TD><TD ><A HREF = "fix_gpu.html">gpu</A></TD><TD ><A HREF = "fix_gravity.html">gravity</A></TD><TD ><A HREF = "fix_heat.html">heat</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_indent.html">indent</A></TD><TD ><A HREF = "fix_langevin.html">langevin</A></TD><TD ><A HREF = "fix_lineforce.html">lineforce</A></TD><TD ><A HREF = "fix_momentum.html">momentum</A></TD><TD ><A HREF = "fix_move.html">move</A></TD><TD ><A HREF = "fix_msst.html">msst</A></TD><TD ><A HREF = "fix_neb.html">neb</A></TD><TD ><A HREF = "fix_nh.html">nph</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_nph_asphere.html">nph/asphere</A></TD><TD ><A HREF = "fix_nph_sphere.html">nph/sphere</A></TD><TD ><A HREF = "fix_nh.html">npt</A></TD><TD ><A HREF = "fix_npt_asphere.html">npt/asphere</A></TD><TD ><A HREF = "fix_npt_sphere.html">npt/sphere</A></TD><TD ><A HREF = "fix_nve.html">nve</A></TD><TD ><A HREF = "fix_nve_asphere.html">nve/asphere</A></TD><TD ><A HREF = "fix_nve_limit.html">nve/limit</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_nve_noforce.html">nve/noforce</A></TD><TD ><A HREF = "fix_nve_sphere.html">nve/sphere</A></TD><TD ><A HREF = "fix_nh.html">nvt</A></TD><TD ><A HREF = "fix_nvt_asphere.html">nvt/asphere</A></TD><TD ><A HREF = "fix_nvt_sllod.html">nvt/sllod</A></TD><TD ><A HREF = "fix_nvt_sphere.html">nvt/sphere</A></TD><TD ><A HREF = "fix_orient_fcc.html">orient/fcc</A></TD><TD ><A HREF = "fix_planeforce.html">planeforce</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_poems.html">poems</A></TD><TD ><A HREF = "fix_pour.html">pour</A></TD><TD ><A HREF = "fix_press_berendsen.html">press/berendsen</A></TD><TD ><A HREF = "fix_print.html">print</A></TD><TD ><A HREF = "fix_qeq_comb.html">qeq/comb</A></TD><TD ><A HREF = "fix_reax_bonds.html">reax/bonds</A></TD><TD ><A HREF = "fix_recenter.html">recenter</A></TD><TD ><A HREF = "fix_rigid.html">rigid</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_rigid.html">rigid/nve</A></TD><TD ><A HREF = "fix_rigid.html">rigid/nvt</A></TD><TD ><A HREF = "fix_setforce.html">setforce</A></TD><TD ><A HREF = "fix_shake.html">shake</A></TD><TD ><A HREF = "fix_spring.html">spring</A></TD><TD ><A HREF = "fix_spring_rg.html">spring/rg</A></TD><TD ><A HREF = "fix_spring_self.html">spring/self</A></TD><TD ><A HREF = "fix_srd.html">srd</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_store_force.html">store/force</A></TD><TD ><A HREF = "fix_store_state.html">store/state</A></TD><TD ><A HREF = "fix_temp_berendsen.html">temp/berendsen</A></TD><TD ><A HREF = "fix_temp_rescale.html">temp/rescale</A></TD><TD ><A HREF = "fix_thermal_conductivity.html">thermal/conductivity</A></TD><TD ><A HREF = "fix_tmd.html">tmd</A></TD><TD ><A HREF = "fix_ttm.html">ttm</A></TD><TD ><A HREF = "fix_viscosity.html">viscosity</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_viscous.html">viscous</A></TD><TD ><A HREF = "fix_wall.html">wall/colloid</A></TD><TD ><A HREF = "fix_wall_gran.html">wall/gran</A></TD><TD ><A HREF = "fix_wall.html">wall/harmonic</A></TD><TD ><A HREF = "fix_wall.html">wall/lj126</A></TD><TD ><A HREF = "fix_wall.html">wall/lj93</A></TD><TD ><A HREF = "fix_wall_reflect.html">wall/reflect</A></TD><TD ><A HREF = "fix_wall_region.html">wall/region</A></TD></TR>
-<TR ALIGN="center"><TD ><A HREF = "fix_wall_srd.html">wall/srd</A> 
+<TR ALIGN="center"><TD ><A HREF = "fix_efield.html">efield</A></TD><TD ><A HREF = "fix_enforce2d.html">enforce2d</A></TD><TD ><A HREF = "fix_evaporate.html">evaporate</A></TD><TD ><A HREF = "fix_external.html">external</A></TD><TD ><A HREF = "fix_freeze.html">freeze</A></TD><TD ><A HREF = "fix_gravity.html">gravity</A></TD><TD ><A HREF = "fix_heat.html">heat</A></TD><TD ><A HREF = "fix_indent.html">indent</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_langevin.html">langevin</A></TD><TD ><A HREF = "fix_lineforce.html">lineforce</A></TD><TD ><A HREF = "fix_momentum.html">momentum</A></TD><TD ><A HREF = "fix_move.html">move</A></TD><TD ><A HREF = "fix_msst.html">msst</A></TD><TD ><A HREF = "fix_neb.html">neb</A></TD><TD ><A HREF = "fix_nh.html">nph</A></TD><TD ><A HREF = "fix_nph_asphere.html">nph/asphere</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_nph_sphere.html">nph/sphere</A></TD><TD ><A HREF = "fix_nh.html">npt</A></TD><TD ><A HREF = "fix_npt_asphere.html">npt/asphere</A></TD><TD ><A HREF = "fix_npt_sphere.html">npt/sphere</A></TD><TD ><A HREF = "fix_nve.html">nve</A></TD><TD ><A HREF = "fix_nve_asphere.html">nve/asphere</A></TD><TD ><A HREF = "fix_nve_limit.html">nve/limit</A></TD><TD ><A HREF = "fix_nve_noforce.html">nve/noforce</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_nve_sphere.html">nve/sphere</A></TD><TD ><A HREF = "fix_nh.html">nvt</A></TD><TD ><A HREF = "fix_nvt_asphere.html">nvt/asphere</A></TD><TD ><A HREF = "fix_nvt_sllod.html">nvt/sllod</A></TD><TD ><A HREF = "fix_nvt_sphere.html">nvt/sphere</A></TD><TD ><A HREF = "fix_orient_fcc.html">orient/fcc</A></TD><TD ><A HREF = "fix_planeforce.html">planeforce</A></TD><TD ><A HREF = "fix_poems.html">poems</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_pour.html">pour</A></TD><TD ><A HREF = "fix_press_berendsen.html">press/berendsen</A></TD><TD ><A HREF = "fix_print.html">print</A></TD><TD ><A HREF = "fix_qeq_comb.html">qeq/comb</A></TD><TD ><A HREF = "fix_reax_bonds.html">reax/bonds</A></TD><TD ><A HREF = "fix_recenter.html">recenter</A></TD><TD ><A HREF = "fix_rigid.html">rigid</A></TD><TD ><A HREF = "fix_rigid.html">rigid/nve</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_rigid.html">rigid/nvt</A></TD><TD ><A HREF = "fix_setforce.html">setforce</A></TD><TD ><A HREF = "fix_shake.html">shake</A></TD><TD ><A HREF = "fix_spring.html">spring</A></TD><TD ><A HREF = "fix_spring_rg.html">spring/rg</A></TD><TD ><A HREF = "fix_spring_self.html">spring/self</A></TD><TD ><A HREF = "fix_srd.html">srd</A></TD><TD ><A HREF = "fix_store_force.html">store/force</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_store_state.html">store/state</A></TD><TD ><A HREF = "fix_temp_berendsen.html">temp/berendsen</A></TD><TD ><A HREF = "fix_temp_rescale.html">temp/rescale</A></TD><TD ><A HREF = "fix_thermal_conductivity.html">thermal/conductivity</A></TD><TD ><A HREF = "fix_tmd.html">tmd</A></TD><TD ><A HREF = "fix_ttm.html">ttm</A></TD><TD ><A HREF = "fix_viscosity.html">viscosity</A></TD><TD ><A HREF = "fix_viscous.html">viscous</A></TD></TR>
+<TR ALIGN="center"><TD ><A HREF = "fix_wall.html">wall/colloid</A></TD><TD ><A HREF = "fix_wall_gran.html">wall/gran</A></TD><TD ><A HREF = "fix_wall.html">wall/harmonic</A></TD><TD ><A HREF = "fix_wall.html">wall/lj126</A></TD><TD ><A HREF = "fix_wall.html">wall/lj93</A></TD><TD ><A HREF = "fix_wall_reflect.html">wall/reflect</A></TD><TD ><A HREF = "fix_wall_region.html">wall/region</A></TD><TD ><A HREF = "fix_wall_srd.html">wall/srd</A> 
 </TD></TR></TABLE></DIV>
 
 <P>These are fix styles contributed by users, which can be used if
diff --git a/doc/Section_commands.txt b/doc/Section_commands.txt
index 3635a753f5..f9b9b1a189 100644
--- a/doc/Section_commands.txt
+++ b/doc/Section_commands.txt
@@ -418,7 +418,6 @@ of each style or click on the style itself for a full description:
 "evaporate"_fix_evaporate.html,
 "external"_fix_external.html,
 "freeze"_fix_freeze.html,
-"gpu"_fix_gpu.html,
 "gravity"_fix_gravity.html,
 "heat"_fix_heat.html,
 "indent"_fix_indent.html,
diff --git a/doc/fix_gpu.html b/doc/fix_gpu.html
deleted file mode 100644
index d48e510798..0000000000
--- a/doc/fix_gpu.html
+++ /dev/null
@@ -1,112 +0,0 @@
-<HTML>
-<CENTER><A HREF = "http://lammps.sandia.gov">LAMMPS WWW Site</A> - <A HREF = "Manual.html">LAMMPS Documentation</A> - <A HREF = "Section_commands.html#comm">LAMMPS Commands</A> 
-</CENTER>
-
-
-
-
-
-
-<HR>
-
-<H3>fix gpu command 
-</H3>
-<P><B>Syntax:</B>
-</P>
-<PRE>fix ID group-ID gpu mode first last split 
-</PRE>
-<UL><LI>ID, group-ID are documented in <A HREF = "fix.html">fix</A> command 
-
-<LI>gpu = style name of this fix command 
-
-<LI>mode = force or force/neigh 
-
-<LI>first = ID of first GPU to be used on each node 
-
-<LI>last = ID of last GPU to be used on each node 
-
-<LI>split = fraction of particles assigned to the GPU 
-
-
-</UL>
-<P><B>Examples:</B>
-</P>
-<PRE>fix 0 all gpu force 0 0 1.0
-fix 0 all gpu force 0 0 0.75
-fix 0 all gpu force/neigh 0 0 1.0
-fix 0 all gpu force/neigh 0 1 -1.0 
-</PRE>
-<P><B>Description:</B>
-</P>
-<P>Select and initialize GPUs to be used for acceleration and configure
-GPU acceleration in LAMMPS. This fix is required in order to use
-any style with GPU acceleration. The fix must be the first fix
-specified for a run or an error will be generated. The fix will not have an 
-effect on any LAMMPS computations that do not use GPU acceleration, so there 
-should not be any problems with specifying this fix first in input scripts.
-</P>
-<P>The <I>mode</I> setting specifies where neighbor list calculations will be
-performed.  If <I>mode</I> is force, neighbor list calculation is performed
-on the CPU. If <I>mode</I> is force/neigh, neighbor list calculation is
-performed on the GPU. GPU neighbor list calculation currently cannot
-be used with a triclinic box. GPU neighbor list calculation currently
-cannot be used with <A HREF = "pair_hybrid.html">hybrid</A> pair styles.  GPU
-neighbor lists are not compatible with styles that are not
-GPU-enabled.  When a non-GPU enabled style requires a neighbor list,
-it will also be built using CPU routines. In these cases, it will
-typically be more efficient to only use CPU neighbor list builds.
-</P>
-<P>The <I>first</I> and <I>last</I> settings specify the GPUs that will be used for
-simulation.  On each node, the GPU IDs in the inclusive range from
-<I>first</I> to <I>last</I> will be used.
-</P>
-<P>The <I>split</I> setting can be used for load balancing force calculation
-work between CPU and GPU cores in GPU-enabled pair styles. If
-0<<I>split</I><1.0, a fixed fraction of particles is offloaded to the GPU
-while force calculation for the other particles occurs simulataneously
-on the CPU. If <I>split</I><0, the optimal fraction (based on CPU and GPU
-timings) is calculated every 25 timesteps. If <I>split</I>=1.0, all force
-calculations for GPU accelerated pair styles are performed on the
-GPU. In this case, <A HREF = "pair_hybrid.html">hybrid</A>, <A HREF = "bond_style.html">bond</A>,
-<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
-<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
-calculations can be performed on the CPU while the GPU is performing
-force calculations for the GPU-enabled pair style.
-</P>
-<P>In order to use GPU acceleration, a GPU enabled style must be selected
-in the input script in addition to this fix.  Currently, this is
-limited to a few <A HREF = "pair_style.html">pair styles</A> and the PPPM <A HREF = "kspace_style.html">kspace
-style</A>.
-</P>
-<P>See <A HREF = "doc/Section_accerate.html">this section</A> of the manual for more
-details about using the GPU package.
-</P>
-<P><B>Restart, fix_modify, output, run start/stop, minimize info:</B>
-</P>
-<P>This fix is part of the "gpu" package.  It is only enabled if LAMMPS
-was built with that package.  See the <A HREF = "Section_start.html#2_3">Making
-LAMMPS</A> section for more info.
-</P>
-<P>No information about this fix is written to <A HREF = "restart.html">binary restart
-files</A>.  None of the <A HREF = "fix_modify.html">fix_modify</A> options
-are relevant to this fix.
-</P>
-<P>No parameter of this fix can be used with the <I>start/stop</I> keywords of
-the <A HREF = "run.html">run</A> command.
-</P>
-<P><B>Restrictions:</B> 
-</P>
-<P>The fix must be the first fix specified for a given run. The
-force/neigh <I>mode</I> should not be used with a triclinic box or
-<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
-</P>
-<P>The <I>split</I> setting must be positive when using
-<A HREF = "pair_hybrid.html">hybrid</A> pair styles.
-</P>
-<P>Currently, group-ID must be all.
-</P>
-<P><B>Related commands:</B> none
-</P>
-<P><B>Default:</B> none
-</P>
-</HTML>
diff --git a/doc/fix_gpu.txt b/doc/fix_gpu.txt
deleted file mode 100644
index 6abf729e74..0000000000
--- a/doc/fix_gpu.txt
+++ /dev/null
@@ -1,102 +0,0 @@
-"LAMMPS WWW Site"_lws - "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
-
-:link(lws,http://lammps.sandia.gov)
-:link(ld,Manual.html)
-:link(lc,Section_commands.html#comm)
-
-:line
-
-fix gpu command :h3
-
-[Syntax:]
-
-fix ID group-ID gpu mode first last split :pre
-
-ID, group-ID are documented in "fix"_fix.html command :ulb,l
-gpu = style name of this fix command :l
-mode = force or force/neigh :l
-first = ID of first GPU to be used on each node :l
-last = ID of last GPU to be used on each node :l
-split = fraction of particles assigned to the GPU :l
-:ule
-
-[Examples:]
-
-fix 0 all gpu force 0 0 1.0
-fix 0 all gpu force 0 0 0.75
-fix 0 all gpu force/neigh 0 0 1.0
-fix 0 all gpu force/neigh 0 1 -1.0 :pre
-
-[Description:]
-
-Select and initialize GPUs to be used for acceleration and configure
-GPU acceleration in LAMMPS. This fix is required in order to use
-any style with GPU acceleration. The fix must be the first fix
-specified for a run or an error will be generated. The fix will not have an 
-effect on any LAMMPS computations that do not use GPU acceleration, so there 
-should not be any problems with specifying this fix first in input scripts.
-
-The {mode} setting specifies where neighbor list calculations will be
-performed.  If {mode} is force, neighbor list calculation is performed
-on the CPU. If {mode} is force/neigh, neighbor list calculation is
-performed on the GPU. GPU neighbor list calculation currently cannot
-be used with a triclinic box. GPU neighbor list calculation currently
-cannot be used with "hybrid"_pair_hybrid.html pair styles.  GPU
-neighbor lists are not compatible with styles that are not
-GPU-enabled.  When a non-GPU enabled style requires a neighbor list,
-it will also be built using CPU routines. In these cases, it will
-typically be more efficient to only use CPU neighbor list builds.
-
-The {first} and {last} settings specify the GPUs that will be used for
-simulation.  On each node, the GPU IDs in the inclusive range from
-{first} to {last} will be used.
-
-The {split} setting can be used for load balancing force calculation
-work between CPU and GPU cores in GPU-enabled pair styles. If
-0<{split}<1.0, a fixed fraction of particles is offloaded to the GPU
-while force calculation for the other particles occurs simulataneously
-on the CPU. If {split}<0, the optimal fraction (based on CPU and GPU
-timings) is calculated every 25 timesteps. If {split}=1.0, all force
-calculations for GPU accelerated pair styles are performed on the
-GPU. In this case, "hybrid"_pair_hybrid.html, "bond"_bond_style.html,
-"angle"_angle_style.html, "dihedral"_dihedral_style.html,
-"improper"_improper_style.html, and "long-range"_kspace_style.html
-calculations can be performed on the CPU while the GPU is performing
-force calculations for the GPU-enabled pair style.
-
-In order to use GPU acceleration, a GPU enabled style must be selected
-in the input script in addition to this fix.  Currently, this is
-limited to a few "pair styles"_pair_style.html and the PPPM "kspace
-style"_kspace_style.html.
-
-See "this section"_doc/Section_accerate.html of the manual for more
-details about using the GPU package.
-
-[Restart, fix_modify, output, run start/stop, minimize info:]
-
-This fix is part of the "gpu" package.  It is only enabled if LAMMPS
-was built with that package.  See the "Making
-LAMMPS"_Section_start.html#2_3 section for more info.
-
-No information about this fix is written to "binary restart
-files"_restart.html.  None of the "fix_modify"_fix_modify.html options
-are relevant to this fix.
-
-No parameter of this fix can be used with the {start/stop} keywords of
-the "run"_run.html command.
-
-[Restrictions:] 
-
-The fix must be the first fix specified for a given run. The
-force/neigh {mode} should not be used with a triclinic box or
-"hybrid"_pair_hybrid.html pair styles.
-
-The {split} setting must be positive when using
-"hybrid"_pair_hybrid.html pair styles.
-
-Currently, group-ID must be all.
-
-[Related commands:] none
-
-[Default:] none
-
diff --git a/doc/package.html b/doc/package.html
index 814340bc81..c1b5b0bebf 100644
--- a/doc/package.html
+++ b/doc/package.html
@@ -15,39 +15,136 @@
 </P>
 <PRE>package style args 
 </PRE>
-<UL><LI>style = <I>cuda</I> 
+<UL><LI>style = <I>gpu</I> or <I>cuda</I> or <I>omp</I> 
 
-<LI>args = 0 or more args specific to the style 
+<LI>args = arguments specific to the style 
 
-<PRE>  <I>cuda</I> args = to be determined 
+<LI>  <I>gpu</I> args = mode first last split
+    mode = force or force/neigh 
+
+<LI>    first = ID of first GPU to be used on each node 
+
+<LI>    last = ID of last GPU to be used on each node 
+
+<LI>    split = fraction of particles assigned to the GPU 
+
+<PRE>  <I>cuda</I> args = to be determined
+  <I>omp</I> args = Nthreads 
+</PRE>
+<PRE>    Nthreads = # of OpenMP threads to associate with each MPI process 
 </PRE>
 
 </UL>
 <P><B>Examples:</B>
 </P>
-<PRE>package cuda blah 
+<PRE>package gpu force 0 0 1.0
+package gpu force 0 0 0.75
+package gpu force/neigh 0 0 1.0
+package gpu force/neigh 0 1 -1.0
+package cuda blah
+package omp 4 
 </PRE>
 <P><B>Description:</B>
 </P>
-<P>This command invokes package-specific settings.  Currently only the
-USER-CUDA package uses it.
+<P>This command invokes package-specific settings.  Currently the
+following packages use it: GPU, USER-CUDA, and USER-OMP.
 </P>
+<P>See <A HREF = "doc/Section_accerate.html">this section</A> of the manual for more
+details about using these various packages for accelerating
+a LAMMPS calculation.
+</P>
+<HR>
+
+<P>The <I>gpu</I> style invokes options associated with the use of the GPU
+package.  It allows you to select and initialize GPUs to be used for
+acceleration via this package and configure how the GPU acceleration
+is performed.  These settings are required in order to use any style
+with GPU acceleration.
+</P>
+<P>The <I>mode</I> setting specifies where neighbor list calculations will be
+performed.  If <I>mode</I> is force, neighbor list calculation is performed
+on the CPU. If <I>mode</I> is force/neigh, neighbor list calculation is
+performed on the GPU. GPU neighbor list calculation currently cannot
+be used with a triclinic box. GPU neighbor list calculation currently
+cannot be used with <A HREF = "pair_hybrid.html">hybrid</A> pair styles.  GPU
+neighbor lists are not compatible with styles that are not
+GPU-enabled.  When a non-GPU enabled style requires a neighbor list,
+it will also be built using CPU routines. In these cases, it will
+typically be more efficient to only use CPU neighbor list builds.
+</P>
+<P>The <I>first</I> and <I>last</I> settings specify the GPUs that will be used for
+simulation.  On each node, the GPU IDs in the inclusive range from
+<I>first</I> to <I>last</I> will be used.
+</P>
+<P>The <I>split</I> setting can be used for load balancing force calculation
+work between CPU and GPU cores in GPU-enabled pair styles. If 0 <
+<I>split</I> < 1.0, a fixed fraction of particles is offloaded to the GPU
+while force calculation for the other particles occurs simulataneously
+on the CPU. If <I>split</I><0, the optimal fraction (based on CPU and GPU
+timings) is calculated every 25 timesteps. If <I>split</I> = 1.0, all force
+calculations for GPU accelerated pair styles are performed on the
+GPU. In this case, <A HREF = "pair_hybrid.html">hybrid</A>, <A HREF = "bond_style.html">bond</A>,
+<A HREF = "angle_style.html">angle</A>, <A HREF = "dihedral_style.html">dihedral</A>,
+<A HREF = "improper_style.html">improper</A>, and <A HREF = "kspace_style.html">long-range</A>
+calculations can be performed on the CPU while the GPU is performing
+force calculations for the GPU-enabled pair style.  If all CPU force
+computations complete before the GPU, LAMMPS will block until the GPU
+has finished before continuing the timestep.
+</P>
+<P>As an example, if you have two GPUs per node and 8 CPU cores per node,
+and would like to run on 4 nodes (32 cores) with dynamic balancing of
+force calculation across CPU and GPU cores, you could specify
+</P>
+<PRE>package gpu force/neigh 0 1 -1 
+</PRE>
+<P>In this case, all CPU cores and GPU devices on the nodes would be
+utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
+cores would perform force calculations for some fraction of the
+particles at the same time the GPUs performed force calculation for
+the other particles.
+</P>
+<HR>
+
 <P>The <I>cuda</I> style invokes options associated with the use of the
-USER-CUDA package.  These will be described when the USER-CUDA package
-is released with LAMMPS.
+USER-CUDA package.  These need to be documented.
 </P>
+<HR>
+
+<P>The <I>omp</I> style invokes options associated with the use of the
+USER-OMP package.
+</P>
+<P>The only setting to make is the number of OpenMP threads to be
+allocated for each MPI process.  For example, if your system has nodes
+with dual quad-core processors, it has a total of 8 cores per node.
+You could run MPI on 2 cores on each node (e.g. using options for the
+mpirun command), and set the <I>Nthreads</I> setting to 4.  This would
+effectively use all 8 cores on each node.  Since each MPI process
+would spawn 4 threads (one of which runs as part of the MPI process
+itself).
+</P>
+<P>For performance reasons, you should not set <I>Nthreads</I> to more threads
+than there are physical cores, but LAMMPS does not check for this.
+</P>
+<HR>
+
 <P><B>Restrictions:</B>
 </P>
+<P>This command cannot be used after the simulation box is defined by a
+<A HREF = "read_data.html">read_data</A> or <A HREF = "create_box.html">create_box</A> command.
+</P>
 <P>The cuda style of this command can only be invoked if LAMMPS was built
 with the USER-CUDA package.  See the <A HREF = "Section_start.html#2_3">Making
 LAMMPS</A> section for more info.
 </P>
-<P>Obviously, you must have GPU hardware and associated software to build
-and use LAMMPS with either the GPU or USER-CUDA packages.
+<P>The gpu style of this command can only be invoked if LAMMPS was built
+with the GPU package.  See the <A HREF = "Section_start.html#2_3">Making LAMMPS</A>
+section for more info.
 </P>
-<P><B>Related commands:</B>
+<P>The omp style of this command can only be invoked if LAMMPS was built
+with the USER-OMP package.  See the <A HREF = "Section_start.html#2_3">Making
+LAMMPS</A> section for more info.
 </P>
-<P><A HREF = "fix_gpu.html">fix gpu</A>
+<P><B>Related commands:</B> none
 </P>
 <P><B>Default:</B> none
 </P>
diff --git a/doc/package.txt b/doc/package.txt
index 0e53f6d23e..6ac414e843 100644
--- a/doc/package.txt
+++ b/doc/package.txt
@@ -12,35 +12,127 @@ package command :h3
 
 package style args :pre
 
-style = {cuda} :ulb,l
-args = 0 or more args specific to the style :l
-  {cuda} args = to be determined :pre
+style = {gpu} or {cuda} or {omp} :ulb,l
+args = arguments specific to the style :l
+  {gpu} args = mode first last split
+    mode = force or force/neigh :l
+    first = ID of first GPU to be used on each node :l
+    last = ID of last GPU to be used on each node :l
+    split = fraction of particles assigned to the GPU :l
+  {cuda} args = to be determined
+  {omp} args = Nthreads :pre
+    Nthreads = # of OpenMP threads to associate with each MPI process :pre
 :ule
 
 [Examples:]
 
-package cuda blah :pre
+package gpu force 0 0 1.0
+package gpu force 0 0 0.75
+package gpu force/neigh 0 0 1.0
+package gpu force/neigh 0 1 -1.0
+package cuda blah
+package omp 4 :pre
 
 [Description:]
 
-This command invokes package-specific settings.  Currently only the
-USER-CUDA package uses it.
+This command invokes package-specific settings.  Currently the
+following packages use it: GPU, USER-CUDA, and USER-OMP.
+
+See "this section"_doc/Section_accerate.html of the manual for more
+details about using these various packages for accelerating
+a LAMMPS calculation.
+
+:line
+
+The {gpu} style invokes options associated with the use of the GPU
+package.  It allows you to select and initialize GPUs to be used for
+acceleration via this package and configure how the GPU acceleration
+is performed.  These settings are required in order to use any style
+with GPU acceleration.
+
+The {mode} setting specifies where neighbor list calculations will be
+performed.  If {mode} is force, neighbor list calculation is performed
+on the CPU. If {mode} is force/neigh, neighbor list calculation is
+performed on the GPU. GPU neighbor list calculation currently cannot
+be used with a triclinic box. GPU neighbor list calculation currently
+cannot be used with "hybrid"_pair_hybrid.html pair styles.  GPU
+neighbor lists are not compatible with styles that are not
+GPU-enabled.  When a non-GPU enabled style requires a neighbor list,
+it will also be built using CPU routines. In these cases, it will
+typically be more efficient to only use CPU neighbor list builds.
+
+The {first} and {last} settings specify the GPUs that will be used for
+simulation.  On each node, the GPU IDs in the inclusive range from
+{first} to {last} will be used.
+
+The {split} setting can be used for load balancing force calculation
+work between CPU and GPU cores in GPU-enabled pair styles. If 0 <
+{split} < 1.0, a fixed fraction of particles is offloaded to the GPU
+while force calculation for the other particles occurs simulataneously
+on the CPU. If {split}<0, the optimal fraction (based on CPU and GPU
+timings) is calculated every 25 timesteps. If {split} = 1.0, all force
+calculations for GPU accelerated pair styles are performed on the
+GPU. In this case, "hybrid"_pair_hybrid.html, "bond"_bond_style.html,
+"angle"_angle_style.html, "dihedral"_dihedral_style.html,
+"improper"_improper_style.html, and "long-range"_kspace_style.html
+calculations can be performed on the CPU while the GPU is performing
+force calculations for the GPU-enabled pair style.  If all CPU force
+computations complete before the GPU, LAMMPS will block until the GPU
+has finished before continuing the timestep.
+
+As an example, if you have two GPUs per node and 8 CPU cores per node,
+and would like to run on 4 nodes (32 cores) with dynamic balancing of
+force calculation across CPU and GPU cores, you could specify
+
+package gpu force/neigh 0 1 -1 :pre
+
+In this case, all CPU cores and GPU devices on the nodes would be
+utilized.  Each GPU device would be shared by 4 CPU cores. The CPU
+cores would perform force calculations for some fraction of the
+particles at the same time the GPUs performed force calculation for
+the other particles.
+
+:line
 
 The {cuda} style invokes options associated with the use of the
-USER-CUDA package.  These will be described when the USER-CUDA package
-is released with LAMMPS.
+USER-CUDA package.  These need to be documented.
+
+:line
+
+The {omp} style invokes options associated with the use of the
+USER-OMP package.
+
+The only setting to make is the number of OpenMP threads to be
+allocated for each MPI process.  For example, if your system has nodes
+with dual quad-core processors, it has a total of 8 cores per node.
+You could run MPI on 2 cores on each node (e.g. using options for the
+mpirun command), and set the {Nthreads} setting to 4.  This would
+effectively use all 8 cores on each node.  Since each MPI process
+would spawn 4 threads (one of which runs as part of the MPI process
+itself).
+
+For performance reasons, you should not set {Nthreads} to more threads
+than there are physical cores, but LAMMPS does not check for this.
+
+:line
 
 [Restrictions:]
 
+This command cannot be used after the simulation box is defined by a
+"read_data"_read_data.html or "create_box"_create_box.html command.
+
 The cuda style of this command can only be invoked if LAMMPS was built
 with the USER-CUDA package.  See the "Making
 LAMMPS"_Section_start.html#2_3 section for more info.
 
-Obviously, you must have GPU hardware and associated software to build
-and use LAMMPS with either the GPU or USER-CUDA packages.
+The gpu style of this command can only be invoked if LAMMPS was built
+with the GPU package.  See the "Making LAMMPS"_Section_start.html#2_3
+section for more info.
 
-[Related commands:]
+The omp style of this command can only be invoked if LAMMPS was built
+with the USER-OMP package.  See the "Making
+LAMMPS"_Section_start.html#2_3 section for more info.
 
-"fix gpu"_fix_gpu.html
+[Related commands:] none
 
 [Default:] none

adapt	addforce	aveforce	ave/atom	ave/correlate	ave/histo	ave/spatial	ave/time
bond/break	bond/create	bond/swap	box/relax	deform	deposit	drag	dt/reset
efield	enforce2d	evaporate	external	freeze	gpu	gravity	heat
indent	langevin	lineforce	momentum	move	msst	neb	nph
nph/asphere	nph/sphere	npt	npt/asphere	npt/sphere	nve	nve/asphere	nve/limit
nve/noforce	nve/sphere	nvt	nvt/asphere	nvt/sllod	nvt/sphere	orient/fcc	planeforce
poems	pour	press/berendsen	print	qeq/comb	reax/bonds	recenter	rigid
rigid/nve	rigid/nvt	setforce	shake	spring	spring/rg	spring/self	srd
store/force	store/state	temp/berendsen	temp/rescale	thermal/conductivity	tmd	ttm	viscosity
viscous	wall/colloid	wall/gran	wall/harmonic	wall/lj126	wall/lj93	wall/reflect	wall/region
wall/srd +
efield	enforce2d	evaporate	external	freeze	gravity	heat	indent
langevin	lineforce	momentum	move	msst	neb	nph	nph/asphere
nph/sphere	npt	npt/asphere	npt/sphere	nve	nve/asphere	nve/limit	nve/noforce
nve/sphere	nvt	nvt/asphere	nvt/sllod	nvt/sphere	orient/fcc	planeforce	poems
pour	press/berendsen	print	qeq/comb	reax/bonds	recenter	rigid	rigid/nve
rigid/nvt	setforce	shake	spring	spring/rg	spring/self	srd	store/force
store/state	temp/berendsen	temp/rescale	thermal/conductivity	tmd	ttm	viscosity	viscous
wall/colloid	wall/gran	wall/harmonic	wall/lj126	wall/lj93	wall/reflect	wall/region	wall/srd