demonstrate how serial and parallel performance can change

This commit is contained in:
Axel Kohlmeyer
2025-05-27 01:15:18 -04:00
parent 06b06fd991
commit f5c51af9bc
2 changed files with 133 additions and 7 deletions

View File

@ -1,6 +1,9 @@
Measuring performance
=====================
Factors that influence performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before trying to make your simulation run faster, you should understand
how it currently performs and where the bottlenecks are. We generally
distinguish between serial performance (how fast can a single process do
@ -82,13 +85,130 @@ Factors affecting parallel efficiency (in no specific order):
tiled) to change the sub-domain volumes are all methods that
can help to avoid load imbalances.
The best way to do this is run the your system (actual number of
atoms) for a modest number of timesteps (say 100 steps) on several
different processor counts, including a single processor if possible.
Do this for an equilibrium version of your system, so that the
100-step timings are representative of a much longer run. There is
typically no need to run for 1000s of timesteps to get accurate
timings; you can simply extrapolate from short runs.
Examples comparing serial performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before looking at your own input deck(s), you should get some reference
data from a known input so that you know what kind of performance you
should expect from your input. For the following we therefore use the
``in.rhodo.scaled`` input file and ``data.rhodo`` data file from the
``bench`` folder. This is a system of 32000 atoms using the CHARMM force
field and long-range electrostatics running for 100 MD steps. The
performance data is printed at the end of a run and only measures the
performance during propagation and excludes the setup phase.
Running with a single MPI process on an AMD Ryzen Threadripper PRO
9985WX CPU (64 cores, 128 threads, base clock: 3.2GHz, max. clock
5.4GHz, L1/L2/L3 cache 5MB/64MB/256MB, 8 DDR5-6400 memory channels) one
gets the following performance report:
.. code-block::
Performance: 1.232 ns/day, 19.476 hours/ns, 7.131 timesteps/s, 228.197 katom-step/s
99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
The %CPU value should be at 100% or very close. Lower values would
be an indication that there are *other* processes also using the same
CPU core and thus invalidating the performance data. The katom-step/s
value is best suited for comparisons, since it is fairly independent
from the system size. The `in.rhodo.scaled` input can be easily made
larger through replication in the three dimensions by settings variables
"x", "y", "z" to values other than 1 from the command line with the
"-var" flag. Example:
- 32000 atoms: 228.8 katom-step/s
- 64000 atoms: 231.6 katom-step/s
- 128000 atoms: 231.1 katom-step/s
- 256000 atoms: 226.4 katom-step/s
- 864000 atoms: 229.6 katom-step/s
Comparing to an AMD Ryzen 7 7840HS CPU (8 cores, 16 threads, base clock
3.8GHz, max. clock 5.1GHz, L1/L2/L3 cache 512kB/8MB/16MB, 2 DDR5-5600
memory channels), we get similar single core performance (~220
katom-step/s vs. ~230 katom-step/s) due to the similar clock and
architecture:
- 32000 atoms: 219.8 katom-step/s
- 64000 atoms: 222.5 katom-step/s
- 128000 atoms: 216.8 katom-step/s
- 256000 atoms: 221.0 katom-step/s
- 864000 atoms: 221.1 katom-step/s
Switching to an older Intel Xeon E5-2650 v4 CPU (12 cores, 12 threads,
base clock 2.2GHz, max. clock 2.9GHz, L1/L2/L3 cache (64kB/256kB/30MB, 4
DDR4-2400 memory channels) leads to a lower performance of approximately
109 katom-step/s due to differences in architecture and clock. In all
cases, when looking at multiple runs, the katom-step/s property
fluctuates by approximately 1% around the average.
From here on we are looking at the performance for the 256000 atom system only
and change several settings incrementally:
#. No compiler optimization GCC (-Og -g): 183.8 katom-step/s
#. Moderate optimization with debug info GCC (-O2 -g): 231.1 katom-step/s
#. Full compiler optimization GCC (-DNDEBUG -O3): 236.0 katom-step/s
#. Aggressive compiler optimization GCC (-O3 -ffast-math -march=native): 239.9 katom-step/s
#. Source code optimization in OPENMP package (1 thread): 266.7 katom-step/s
#. Use *fix nvt* instead of *fix npt* (compute virial only every 50 steps): 272.9 katom-step/s
#. Increase pair style cutoff by 2 :math:`\AA`: 181.2 katom-step/s
#. Use tight PPPM convergence (1.0e-6 instead of 1.0e-4): 161.9 katom-step/s
#. Use Ewald summation instead of PPPM (at 1.0e-4 convergence): 19.9 katom-step/s
The numbers show that gains from aggressive compiler optimizations are rather
small in LAMMPS, the data access optimizations in the OPENMP (and OPT) packages
are more visible. On the other side, using more accurate force field settings
causes, not unexpectedly, a significant slowdown. Finally, using regular Ewald
summation causes a massive slowdown due to the bad algorithmic scaling with
system size.
Examples comparing parallel performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The parallel performance usually goes on top of the serial performance.
Using twice as many processors should increase the performance metric
by up to a factor of two. With the number of processors *N* and the
serial performance :math:`p_1` and the performance for *N* processors
:math:`p_N` we can define a *parallel efficiency* in percent as follows:
.. math::
P_{eff} = \frac{p_N}{p_1 \cdot N} \cdot 100\%
For the AMD Ryzen Threadripper PRO 9985WX CPU and the serial
simulation settings of point 6. from above, we get the following
parallel efficiency data for the 256000 atom system:
- 1 MPI task: 273.6 katom-step/s, :math:`P_{eff} = 100\%`
- 2 MPI tasks: 530.6 katom-step/s, :math:`P_{eff} = 97\%`
- 4 MPI tasks: 1.021 Matom-step/s, :math:`P_{eff} = 93\%`
- 8 MPI tasks: 1.837 Matom-step/s, :math:`P_{eff} = 84\%`
- 16 MPI tasks: 3.574 Matom-step/s, :math:`P_{eff} = 82\%`
- 32 MPI tasks: 6.479 Matom-step/s, :math:`P_{eff} = 74\%`
- 64 MPI tasks: 9.032 Matom-step/s, :math:`P_{eff} = 52\%`
- 128 MPI tasks: 12.03 Matom-step/s, :math:`P_{eff} = 34\%`
The 128 MPI tasks run uses CPU cores from hyper-threading.
For a small system with only 32000 atoms the parallel efficiency
drops off earlier when the number of work units is too small relative
to the communication overhead:
- 1 MPI task: 270.8 katom-step/s, :math:`P_{eff} = 100\%`
- 2 MPI tasks: 529.3 katom-step/s, :math:`P_{eff} = 98\%`
- 4 MPI tasks: 989.8 katom-step/s, :math:`P_{eff} = 91\%`
- 8 MPI tasks: 1.832 Matom-step/s, :math:`P_{eff} = 85\%`
- 16 MPI tasks: 3.463 Matom-step/s, :math:`P_{eff} = 80\%`
- 32 MPI tasks: 5.970 Matom-step/s, :math:`P_{eff} = 69\%`
- 64 MPI tasks: 7.477 Matom-step/s, :math:`P_{eff} = 42\%`
- 128 MPI tasks: 8.069 Matom-step/s, :math:`P_{eff} = 23\%`
Before looking in The best way to do this is run the your system (actual
number of atoms) for a modest number of timesteps (say 100 steps) on
several different processor counts, including a single processor if
possible. Do this for an equilibrium version of your system, so that
the 100-step timings are representative of a much longer run. There is
typically no need to run for 1000s of timesteps to get accurate timings;
you can simply extrapolate from short runs.
For the set of runs, look at the timing data printed to the screen and
log file at the end of each LAMMPS run. The