demonstrate how serial and parallel performance can change
This commit is contained in:
@ -1,6 +1,9 @@
|
||||
Measuring performance
|
||||
=====================
|
||||
|
||||
Factors that influence performance
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Before trying to make your simulation run faster, you should understand
|
||||
how it currently performs and where the bottlenecks are. We generally
|
||||
distinguish between serial performance (how fast can a single process do
|
||||
@ -82,13 +85,130 @@ Factors affecting parallel efficiency (in no specific order):
|
||||
tiled) to change the sub-domain volumes are all methods that
|
||||
can help to avoid load imbalances.
|
||||
|
||||
The best way to do this is run the your system (actual number of
|
||||
atoms) for a modest number of timesteps (say 100 steps) on several
|
||||
different processor counts, including a single processor if possible.
|
||||
Do this for an equilibrium version of your system, so that the
|
||||
100-step timings are representative of a much longer run. There is
|
||||
typically no need to run for 1000s of timesteps to get accurate
|
||||
timings; you can simply extrapolate from short runs.
|
||||
Examples comparing serial performance
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Before looking at your own input deck(s), you should get some reference
|
||||
data from a known input so that you know what kind of performance you
|
||||
should expect from your input. For the following we therefore use the
|
||||
``in.rhodo.scaled`` input file and ``data.rhodo`` data file from the
|
||||
``bench`` folder. This is a system of 32000 atoms using the CHARMM force
|
||||
field and long-range electrostatics running for 100 MD steps. The
|
||||
performance data is printed at the end of a run and only measures the
|
||||
performance during propagation and excludes the setup phase.
|
||||
|
||||
Running with a single MPI process on an AMD Ryzen Threadripper PRO
|
||||
9985WX CPU (64 cores, 128 threads, base clock: 3.2GHz, max. clock
|
||||
5.4GHz, L1/L2/L3 cache 5MB/64MB/256MB, 8 DDR5-6400 memory channels) one
|
||||
gets the following performance report:
|
||||
|
||||
.. code-block::
|
||||
|
||||
Performance: 1.232 ns/day, 19.476 hours/ns, 7.131 timesteps/s, 228.197 katom-step/s
|
||||
99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
|
||||
|
||||
The %CPU value should be at 100% or very close. Lower values would
|
||||
be an indication that there are *other* processes also using the same
|
||||
CPU core and thus invalidating the performance data. The katom-step/s
|
||||
value is best suited for comparisons, since it is fairly independent
|
||||
from the system size. The `in.rhodo.scaled` input can be easily made
|
||||
larger through replication in the three dimensions by settings variables
|
||||
"x", "y", "z" to values other than 1 from the command line with the
|
||||
"-var" flag. Example:
|
||||
|
||||
- 32000 atoms: 228.8 katom-step/s
|
||||
- 64000 atoms: 231.6 katom-step/s
|
||||
- 128000 atoms: 231.1 katom-step/s
|
||||
- 256000 atoms: 226.4 katom-step/s
|
||||
- 864000 atoms: 229.6 katom-step/s
|
||||
|
||||
Comparing to an AMD Ryzen 7 7840HS CPU (8 cores, 16 threads, base clock
|
||||
3.8GHz, max. clock 5.1GHz, L1/L2/L3 cache 512kB/8MB/16MB, 2 DDR5-5600
|
||||
memory channels), we get similar single core performance (~220
|
||||
katom-step/s vs. ~230 katom-step/s) due to the similar clock and
|
||||
architecture:
|
||||
|
||||
- 32000 atoms: 219.8 katom-step/s
|
||||
- 64000 atoms: 222.5 katom-step/s
|
||||
- 128000 atoms: 216.8 katom-step/s
|
||||
- 256000 atoms: 221.0 katom-step/s
|
||||
- 864000 atoms: 221.1 katom-step/s
|
||||
|
||||
Switching to an older Intel Xeon E5-2650 v4 CPU (12 cores, 12 threads,
|
||||
base clock 2.2GHz, max. clock 2.9GHz, L1/L2/L3 cache (64kB/256kB/30MB, 4
|
||||
DDR4-2400 memory channels) leads to a lower performance of approximately
|
||||
109 katom-step/s due to differences in architecture and clock. In all
|
||||
cases, when looking at multiple runs, the katom-step/s property
|
||||
fluctuates by approximately 1% around the average.
|
||||
|
||||
From here on we are looking at the performance for the 256000 atom system only
|
||||
and change several settings incrementally:
|
||||
|
||||
#. No compiler optimization GCC (-Og -g): 183.8 katom-step/s
|
||||
#. Moderate optimization with debug info GCC (-O2 -g): 231.1 katom-step/s
|
||||
#. Full compiler optimization GCC (-DNDEBUG -O3): 236.0 katom-step/s
|
||||
#. Aggressive compiler optimization GCC (-O3 -ffast-math -march=native): 239.9 katom-step/s
|
||||
#. Source code optimization in OPENMP package (1 thread): 266.7 katom-step/s
|
||||
#. Use *fix nvt* instead of *fix npt* (compute virial only every 50 steps): 272.9 katom-step/s
|
||||
#. Increase pair style cutoff by 2 :math:`\AA`: 181.2 katom-step/s
|
||||
#. Use tight PPPM convergence (1.0e-6 instead of 1.0e-4): 161.9 katom-step/s
|
||||
#. Use Ewald summation instead of PPPM (at 1.0e-4 convergence): 19.9 katom-step/s
|
||||
|
||||
The numbers show that gains from aggressive compiler optimizations are rather
|
||||
small in LAMMPS, the data access optimizations in the OPENMP (and OPT) packages
|
||||
are more visible. On the other side, using more accurate force field settings
|
||||
causes, not unexpectedly, a significant slowdown. Finally, using regular Ewald
|
||||
summation causes a massive slowdown due to the bad algorithmic scaling with
|
||||
system size.
|
||||
|
||||
Examples comparing parallel performance
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The parallel performance usually goes on top of the serial performance.
|
||||
Using twice as many processors should increase the performance metric
|
||||
by up to a factor of two. With the number of processors *N* and the
|
||||
serial performance :math:`p_1` and the performance for *N* processors
|
||||
:math:`p_N` we can define a *parallel efficiency* in percent as follows:
|
||||
|
||||
.. math::
|
||||
|
||||
P_{eff} = \frac{p_N}{p_1 \cdot N} \cdot 100\%
|
||||
|
||||
For the AMD Ryzen Threadripper PRO 9985WX CPU and the serial
|
||||
simulation settings of point 6. from above, we get the following
|
||||
parallel efficiency data for the 256000 atom system:
|
||||
|
||||
- 1 MPI task: 273.6 katom-step/s, :math:`P_{eff} = 100\%`
|
||||
- 2 MPI tasks: 530.6 katom-step/s, :math:`P_{eff} = 97\%`
|
||||
- 4 MPI tasks: 1.021 Matom-step/s, :math:`P_{eff} = 93\%`
|
||||
- 8 MPI tasks: 1.837 Matom-step/s, :math:`P_{eff} = 84\%`
|
||||
- 16 MPI tasks: 3.574 Matom-step/s, :math:`P_{eff} = 82\%`
|
||||
- 32 MPI tasks: 6.479 Matom-step/s, :math:`P_{eff} = 74\%`
|
||||
- 64 MPI tasks: 9.032 Matom-step/s, :math:`P_{eff} = 52\%`
|
||||
- 128 MPI tasks: 12.03 Matom-step/s, :math:`P_{eff} = 34\%`
|
||||
|
||||
The 128 MPI tasks run uses CPU cores from hyper-threading.
|
||||
|
||||
For a small system with only 32000 atoms the parallel efficiency
|
||||
drops off earlier when the number of work units is too small relative
|
||||
to the communication overhead:
|
||||
|
||||
- 1 MPI task: 270.8 katom-step/s, :math:`P_{eff} = 100\%`
|
||||
- 2 MPI tasks: 529.3 katom-step/s, :math:`P_{eff} = 98\%`
|
||||
- 4 MPI tasks: 989.8 katom-step/s, :math:`P_{eff} = 91\%`
|
||||
- 8 MPI tasks: 1.832 Matom-step/s, :math:`P_{eff} = 85\%`
|
||||
- 16 MPI tasks: 3.463 Matom-step/s, :math:`P_{eff} = 80\%`
|
||||
- 32 MPI tasks: 5.970 Matom-step/s, :math:`P_{eff} = 69\%`
|
||||
- 64 MPI tasks: 7.477 Matom-step/s, :math:`P_{eff} = 42\%`
|
||||
- 128 MPI tasks: 8.069 Matom-step/s, :math:`P_{eff} = 23\%`
|
||||
|
||||
Before looking in The best way to do this is run the your system (actual
|
||||
number of atoms) for a modest number of timesteps (say 100 steps) on
|
||||
several different processor counts, including a single processor if
|
||||
possible. Do this for an equilibrium version of your system, so that
|
||||
the 100-step timings are representative of a much longer run. There is
|
||||
typically no need to run for 1000s of timesteps to get accurate timings;
|
||||
you can simply extrapolate from short runs.
|
||||
|
||||
For the set of runs, look at the timing data printed to the screen and
|
||||
log file at the end of each LAMMPS run. The
|
||||
|
||||
Reference in New Issue
Block a user