demonstrate how serial and parallel performance can change

2025-05-27 01:15:18 -04:00
parent 06b06fd991
commit f5c51af9bc
2 changed files with 133 additions and 7 deletions
--- a/doc/src/Speed_measure.rst
+++ b/doc/src/Speed_measure.rst
@ -1,6 +1,9 @@
 Measuring performance
 =====================

+Factors that influence performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 Before trying to make your simulation run faster, you should understand
 how it currently performs and where the bottlenecks are.  We generally
 distinguish between serial performance (how fast can a single process do
@ -82,13 +85,130 @@ Factors affecting parallel efficiency (in no specific order):
  tiled) to change the sub-domain volumes are all methods that
  can help to avoid load imbalances.

-The best way to do this is run the your system (actual number of
-atoms) for a modest number of timesteps (say 100 steps) on several
-different processor counts, including a single processor if possible.
-Do this for an equilibrium version of your system, so that the
-100-step timings are representative of a much longer run.  There is
-typically no need to run for 1000s of timesteps to get accurate
-timings; you can simply extrapolate from short runs.
+Examples comparing serial performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Before looking at your own input deck(s), you should get some reference
+data from a known input so that you know what kind of performance you
+should expect from your input.  For the following we therefore use the
+``in.rhodo.scaled`` input file and ``data.rhodo`` data file from the
+``bench`` folder. This is a system of 32000 atoms using the CHARMM force
+field and long-range electrostatics running for 100 MD steps.  The
+performance data is printed at the end of a run and only measures the
+performance during propagation and excludes the setup phase.
+
+Running with a single MPI process on an AMD Ryzen Threadripper PRO
+9985WX CPU (64 cores, 128 threads, base clock: 3.2GHz, max. clock
+5.4GHz, L1/L2/L3 cache 5MB/64MB/256MB, 8 DDR5-6400 memory channels) one
+gets the following performance report:
+
+.. code-block::
+
+   Performance: 1.232 ns/day, 19.476 hours/ns, 7.131 timesteps/s, 228.197 katom-step/s
+   99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
+
+The %CPU value should be at 100% or very close.  Lower values would
+be an indication that there are *other* processes also using the same
+CPU core and thus invalidating the performance data.  The katom-step/s
+value is best suited for comparisons, since it is fairly independent
+from the system size. The `in.rhodo.scaled` input can be easily made
+larger through replication in the three dimensions by settings variables
+"x", "y", "z" to values other than 1 from the command line with the
+"-var" flag. Example:
+
+- 32000 atoms: 228.8 katom-step/s
+- 64000 atoms: 231.6 katom-step/s
+- 128000 atoms: 231.1 katom-step/s
+- 256000 atoms: 226.4 katom-step/s
+- 864000 atoms: 229.6 katom-step/s
+
+Comparing to an AMD Ryzen 7 7840HS CPU (8 cores, 16 threads, base clock
+3.8GHz, max. clock 5.1GHz, L1/L2/L3 cache 512kB/8MB/16MB, 2 DDR5-5600
+memory channels), we get similar single core performance (~220
+katom-step/s vs. ~230 katom-step/s) due to the similar clock and
+architecture:
+
+- 32000 atoms: 219.8 katom-step/s
+- 64000 atoms: 222.5 katom-step/s
+- 128000 atoms: 216.8 katom-step/s
+- 256000 atoms: 221.0 katom-step/s
+- 864000 atoms: 221.1 katom-step/s
+
+Switching to an older Intel Xeon E5-2650 v4 CPU (12 cores, 12 threads,
+base clock 2.2GHz, max. clock 2.9GHz, L1/L2/L3 cache (64kB/256kB/30MB, 4
+DDR4-2400 memory channels) leads to a lower performance of approximately
+109 katom-step/s due to differences in architecture and clock.  In all
+cases, when looking at multiple runs, the katom-step/s property
+fluctuates by approximately 1% around the average.
+
+From here on we are looking at the performance for the 256000 atom system only
+and change several settings incrementally:
+
+#. No compiler optimization GCC (-Og -g): 183.8 katom-step/s
+#. Moderate optimization with debug info GCC (-O2 -g): 231.1 katom-step/s
+#. Full compiler optimization GCC (-DNDEBUG -O3): 236.0 katom-step/s
+#. Aggressive compiler optimization GCC (-O3 -ffast-math -march=native): 239.9 katom-step/s
+#. Source code optimization in OPENMP package (1 thread): 266.7 katom-step/s
+#. Use *fix nvt* instead of *fix npt* (compute virial only every 50 steps): 272.9 katom-step/s
+#. Increase pair style cutoff by 2 :math:`\AA`: 181.2 katom-step/s
+#. Use tight PPPM convergence (1.0e-6 instead of 1.0e-4): 161.9 katom-step/s
+#. Use Ewald summation instead of PPPM (at 1.0e-4 convergence): 19.9 katom-step/s
+
+The numbers show that gains from aggressive compiler optimizations are rather
+small in LAMMPS, the data access optimizations in the OPENMP (and OPT) packages
+are more visible. On the other side, using more accurate force field settings
+causes, not unexpectedly, a significant slowdown.  Finally, using regular Ewald
+summation causes a massive slowdown due to the bad algorithmic scaling with
+system size.
+
+Examples comparing parallel performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The parallel performance usually goes on top of the serial performance.
+Using twice as many processors should increase the performance metric
+by up to a factor of two.  With the number of processors *N* and the
+serial performance :math:`p_1` and the performance for *N* processors
+:math:`p_N` we can define a *parallel efficiency* in percent as follows:
+
+.. math::
+
+   P_{eff} = \frac{p_N}{p_1 \cdot N} \cdot 100\%
+
+For the AMD Ryzen Threadripper PRO 9985WX CPU and the serial
+simulation settings of point 6. from above, we get the following
+parallel efficiency data for the 256000 atom system:
+
+- 1 MPI task: 273.6 katom-step/s, :math:`P_{eff} = 100\%`
+- 2 MPI tasks: 530.6 katom-step/s, :math:`P_{eff} = 97\%`
+- 4 MPI tasks: 1.021 Matom-step/s, :math:`P_{eff} = 93\%`
+- 8 MPI tasks: 1.837 Matom-step/s, :math:`P_{eff} = 84\%`
+- 16 MPI tasks: 3.574 Matom-step/s, :math:`P_{eff} = 82\%`
+- 32 MPI tasks: 6.479 Matom-step/s, :math:`P_{eff} = 74\%`
+- 64 MPI tasks: 9.032 Matom-step/s, :math:`P_{eff} = 52\%`
+- 128 MPI tasks: 12.03 Matom-step/s, :math:`P_{eff} = 34\%`
+
+The 128 MPI tasks run uses CPU cores from hyper-threading.
+
+For a small system with only 32000 atoms the parallel efficiency
+drops off earlier when the number of work units is too small relative
+to the communication overhead:
+
+- 1 MPI task:  270.8  katom-step/s, :math:`P_{eff} = 100\%`
+- 2 MPI tasks: 529.3  katom-step/s, :math:`P_{eff} = 98\%`
+- 4 MPI tasks: 989.8  katom-step/s, :math:`P_{eff} = 91\%`
+- 8 MPI tasks: 1.832  Matom-step/s, :math:`P_{eff} = 85\%`
+- 16 MPI tasks: 3.463 Matom-step/s, :math:`P_{eff} = 80\%`
+- 32 MPI tasks: 5.970 Matom-step/s, :math:`P_{eff} = 69\%`
+- 64 MPI tasks: 7.477 Matom-step/s, :math:`P_{eff} = 42\%`
+- 128 MPI tasks: 8.069 Matom-step/s, :math:`P_{eff} = 23\%`
+
+Before looking in The best way to do this is run the your system (actual
+number of atoms) for a modest number of timesteps (say 100 steps) on
+several different processor counts, including a single processor if
+possible.  Do this for an equilibrium version of your system, so that
+the 100-step timings are representative of a much longer run.  There is
+typically no need to run for 1000s of timesteps to get accurate timings;
+you can simply extrapolate from short runs.

 For the set of runs, look at the timing data printed to the screen and
 log file at the end of each LAMMPS run.  The