diff --git a/doc/Manual.html b/doc/Manual.html
index e783b2eaa9..39cc7d2f0b 100644
--- a/doc/Manual.html
+++ b/doc/Manual.html
@@ -132,15 +132,21 @@ it gives quick access to documentation for all LAMMPS commands.
This section describes various methods for improving LAMMPS -performance for different classes of problems running -on different kinds of machines. +performance for different classes of problems running on different +kinds of machines.
-5.1 OPT packageBefore trying to make your simulation run faster, you should +understand how it currently performs and where the bottlenecks are. +
+The best way to do this is run the your system (actual number of +atoms) for a modest number of timesteps (say 100, or a few 100 at +most) on several different processor counts, including a single +processor if possible. Do this for an equilibrium version of your +system, so that the 100-step timings are representative of a much +longer run. There is typically no need to run for 1000s or timesteps +to get accurate timings; you can simply extrapolate from short runs. +
+For the set of runs, look at the timing data printed to the screen and +log file at the end of each LAMMPS run. This +section of the manual has an overview. +
+Running on one (or a few processors) should give a good estimate of +the serial performance and what portions of the timestep are taking +the most time. Running the same problem on a few different processor +counts should give an estimate of parallel scalability. I.e. if the +simulation runs 16x faster on 16 processors, its 100% parallel +efficient; if it runs 8x faster on 16 processors, it's 50% efficient. +
+The most important data to look at in the timing info is the timing +breakdown and relative percentages. For example, trying different +options for speeding up the long-range solvers will have little impact +if they only consume 10% of the run time. If the pairwise time is +dominating, you may want to look at GPU or OMP versions of the pair +style, as discussed below. Comparing how the percentages change as +you increase the processor count gives you a sense of how different +operations within the timestep are scaling. Note that if you are +running with a Kspace solver, there is additional output on the +breakdown of the Kspace time. For PPPM, this includes the fraction +spent on FFTs, which can be communication intensive. +
+Another important detail in the timing info are the histograms of +atoms counts and neighbor counts. If these vary widely across +processors, you have a load-imbalance issue. This often results in +inaccurate relative timing data, because processors have to wait when +communication occurs for other processors to catch up. Thus the +reported times for "Communication" or "Other" may be higher than they +really are, due to load-imbalance. If this is an issue, you can +uncomment the MPI_Barrier() lines in src/timer.cpp, and recompile +LAMMPS, to obtain synchronized timings. +
+Here is a list of general ideas for improving simulation performance. +Most of them are only applicable to certain models and certain +bottlenecks in the current performance, so let the timing data you +intially generate be your guide. It is hard, if not impossible, to +predict how much difference these options will make, since it is a +function of your problem and your machine. There is no substitute for +simply trying them out. +
+Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the @@ -86,9 +165,7 @@ packages, since they are both designed to use NVIDIA GPU hardware.
The OPT package was developed by James Fischer (High Performance Technologies), David Richie, and Vincent Natoli (Stone Ridge @@ -115,9 +192,7 @@ to 20% savings.
The USER-OMP package was developed by Axel Kohlmeyer at Temple University. It provides multi-threaded versions of most pair styles, all dihedral @@ -236,9 +311,7 @@ examples are 5.3 GPU package +
The GPU package was developed by Mike Brown at ORNL. It provides GPU versions of several pair styles and for long-range Coulombics via the @@ -266,6 +339,13 @@ NVIDIA support as well as more general OpenCL support, so that the same functionality can eventually be supported on a variety of GPU hardware.
NOTE: + discuss 3 precisions + if change, also have to re-link with LAMMPS + always use newton off + expt with differing numbers of CPUs vs GPU - can't tell what is fastest + give command line switches in examples +
Hardware and software requirements:
To use this package, you currently need to have specific NVIDIA @@ -378,9 +458,7 @@ requires that your GPU card support double precision.
The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany. It provides NVIDIA GPU versions of many pair @@ -516,7 +594,7 @@ occurs, the faster your simulation will run.
Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways. @@ -602,66 +680,4 @@ for the GPU and USER-CUDA packages.
These contain input scripts for identical systems, so they can be used to benchmark the performance of both packages on your system.
-Benchmark data: -
-NOTE: We plan to add some benchmark results and plots here for the -examples described in the previous section. -
-Simulations: -
-1. Lennard Jones -
-2. Lennard Jones -
-3. Rhodopsin model -
-4. Lihtium-Phosphate -
-Hardware: -
-Workstation: -
-eStella: -
-Keeneland: -
-