251 lines
13 KiB
ReStructuredText
251 lines
13 KiB
ReStructuredText
Measuring performance
|
|
=====================
|
|
|
|
Factors that influence performance
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Before trying to make your simulation run faster, you should understand
|
|
how it currently performs and where the bottlenecks are. We generally
|
|
distinguish between serial performance (how fast can a single process do
|
|
the calculations?) and parallel efficiency (how much faster does a
|
|
calculation get by using more processes?). There are many factors
|
|
affecting either and below are some lists discussing some commonly
|
|
known but also some less known factors.
|
|
|
|
Factors affecting serial performance (in no specific order):
|
|
|
|
* CPU hardware: clock rate, cache sizes, CPU architecture (instructions
|
|
per clock, vectorization support, fused multiply-add support and more)
|
|
* RAM speed and number of channels that the CPU can use to access RAM
|
|
* Cooling: CPUs can change the CPU clock based on thermal load, thus the
|
|
degree of cooling can affect the speed of a CPU. Sometimes even the
|
|
temperature of neighboring compute nodes in a cluster can make a
|
|
difference.
|
|
* Compiler optimization: most of LAMMPS is written to be easy to modify
|
|
and thus compiler optimization can speed up calculations. However, too
|
|
aggressive compiler optimization can produce incorrect results or
|
|
crashes (during compilation or at runtime).
|
|
* Source code improvements: styles in the OPT, OPENMP, and INTEL package
|
|
can be faster than their base implementation due to improved data
|
|
access patterns, cache efficiency, or vectorization. Compiler optimization
|
|
is required to take full advantage of these.
|
|
* Number and kind of fixes, computes, or variables used during a simulation,
|
|
especially if they result in collective communication operations
|
|
* Pair style cutoffs and system density: calculations get slower the more
|
|
neighbors are in the neighbor list and thus for which interactions need
|
|
to be computed. Force fields with pair styles that compute interactions
|
|
between triples or quadruples of atoms or that use embedding energies or
|
|
charge equilibration will need to walk the neighbor lists multiple times.
|
|
* Neighbor list settings: tradeoff between neighbor list skin (larger
|
|
skin = more neighbors, more distances to compute before applying the
|
|
cutoff) and frequency of neighbor list builds (larger skin = fewer
|
|
neighbor list builds).
|
|
* Proximity of per-atom data in physical memory that for atoms that are
|
|
close in space improves cache efficiency (thus LAMMPS will by default
|
|
sort atoms in local storage accordingly)
|
|
* Using r-RESPA multi-timestepping or a SHAKE or RATTLE fix to constrain
|
|
bonds with higher-frequency vibrations may allow a larger (outer) timestep
|
|
and thus fewer force evaluations (usually the most time consuming step in
|
|
MD) for the same simulated time (with some tradeoff in accuracy).
|
|
|
|
Factors affecting parallel efficiency (in no specific order):
|
|
|
|
* Bandwidth and latency of communication between processes. This can vary a
|
|
lot between processes on the same CPU or physical node and processes
|
|
on different physical nodes and there vary between different
|
|
communication technologies (like Ethernet or InfiniBand or other
|
|
high-speed interconnects)
|
|
* Frequency and complexity of communication patterns required
|
|
* Number of "work units" (usually correlated with the number of atoms
|
|
and choice of force field) per MPI-process required for one time step
|
|
(if this number becomes too small, the cost of communication becomes
|
|
dominant).
|
|
* Choice of parallelization method (MPI-only, OpenMP-only, MPI+OpenMP,
|
|
MPI+GPU, MPI+GPU+OpenMP)
|
|
* Algorithmic complexity of the chosen force field (pair-wise vs. many-body
|
|
potential, Ewald vs. PPPM vs. (compensated or smoothed) cutoff-Coulomb)
|
|
* Communication cutoff: a larger cutoff results in more ghost atoms and
|
|
thus more data that needs to be communicated
|
|
* Frequency of neighbor list builds: during a neighbor list build the
|
|
domain decomposition is updated and the list of ghost atoms rebuilt
|
|
which requires multiple global communication steps
|
|
* FFT-grid settings and number of MPI processes for kspace style PPPM:
|
|
PPPM uses parallel 3d FFTs which will drop much faster in parallel
|
|
efficiency with respect to the number of MPI processes than other
|
|
parts of the force computation. Thus using MPI+OpenMP parallelization
|
|
or :doc:`run style verlet/split <run_style>` can improve parallel
|
|
efficiency by limiting the number of MPI processes used for the FFTs.
|
|
* Load (im-)balance: LAMMPS' domain decomposition assumes that atoms are
|
|
evenly distributed across the entire simulation box. If there are
|
|
areas of vacuum, this may lead to different amounts of work for
|
|
different MPI processes. Using the :doc:`processors command
|
|
<processors>` to change the spatial decomposition, or MPI+OpenMP
|
|
parallelization instead of only-MPI to have larger sub-domains, or the
|
|
(fix) balance command (without or with switching to communication style
|
|
tiled) to change the sub-domain volumes are all methods that
|
|
can help to avoid load imbalances.
|
|
|
|
Examples comparing serial performance
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Before looking at your own input deck(s), you should get some reference
|
|
data from a known input so that you know what kind of performance you
|
|
should expect from your input. For the following we therefore use the
|
|
``in.rhodo.scaled`` input file and ``data.rhodo`` data file from the
|
|
``bench`` folder. This is a system of 32000 atoms using the CHARMM force
|
|
field and long-range electrostatics running for 100 MD steps. The
|
|
performance data is printed at the end of a run and only measures the
|
|
performance during propagation and excludes the setup phase.
|
|
|
|
Running with a single MPI process on an AMD Ryzen Threadripper PRO
|
|
9985WX CPU (64 cores, 128 threads, base clock: 3.2GHz, max. clock
|
|
5.4GHz, L1/L2/L3 cache 5MB/64MB/256MB, 8 DDR5-6400 memory channels) one
|
|
gets the following performance report:
|
|
|
|
.. code-block::
|
|
|
|
Performance: 1.232 ns/day, 19.476 hours/ns, 7.131 timesteps/s, 228.197 katom-step/s
|
|
99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
|
|
|
|
The %CPU value should be at 100% or very close. Lower values would
|
|
be an indication that there are *other* processes also using the same
|
|
CPU core and thus invalidating the performance data. The katom-step/s
|
|
value is best suited for comparisons, since it is fairly independent
|
|
from the system size. The `in.rhodo.scaled` input can be easily made
|
|
larger through replication in the three dimensions by settings variables
|
|
"x", "y", "z" to values other than 1 from the command line with the
|
|
"-var" flag. Example:
|
|
|
|
- 32000 atoms: 228.8 katom-step/s
|
|
- 64000 atoms: 231.6 katom-step/s
|
|
- 128000 atoms: 231.1 katom-step/s
|
|
- 256000 atoms: 226.4 katom-step/s
|
|
- 864000 atoms: 229.6 katom-step/s
|
|
|
|
Comparing to an AMD Ryzen 7 7840HS CPU (8 cores, 16 threads, base clock
|
|
3.8GHz, max. clock 5.1GHz, L1/L2/L3 cache 512kB/8MB/16MB, 2 DDR5-5600
|
|
memory channels), we get similar single core performance (~220
|
|
katom-step/s vs. ~230 katom-step/s) due to the similar clock and
|
|
architecture:
|
|
|
|
- 32000 atoms: 219.8 katom-step/s
|
|
- 64000 atoms: 222.5 katom-step/s
|
|
- 128000 atoms: 216.8 katom-step/s
|
|
- 256000 atoms: 221.0 katom-step/s
|
|
- 864000 atoms: 221.1 katom-step/s
|
|
|
|
Switching to an older Intel Xeon E5-2650 v4 CPU (12 cores, 12 threads,
|
|
base clock 2.2GHz, max. clock 2.9GHz, L1/L2/L3 cache (64kB/256kB/30MB, 4
|
|
DDR4-2400 memory channels) leads to a lower performance of approximately
|
|
109 katom-step/s due to differences in architecture and clock. In all
|
|
cases, when looking at multiple runs, the katom-step/s property
|
|
fluctuates by approximately 1% around the average.
|
|
|
|
From here on we are looking at the performance for the 256000 atom system only
|
|
and change several settings incrementally:
|
|
|
|
#. No compiler optimization GCC (-Og -g): 183.8 katom-step/s
|
|
#. Moderate optimization with debug info GCC (-O2 -g): 231.1 katom-step/s
|
|
#. Full compiler optimization GCC (-DNDEBUG -O3): 236.0 katom-step/s
|
|
#. Aggressive compiler optimization GCC (-O3 -ffast-math -march=native): 239.9 katom-step/s
|
|
#. Source code optimization in OPENMP package (1 thread): 266.7 katom-step/s
|
|
#. Use *fix nvt* instead of *fix npt* (compute virial only every 50 steps): 272.9 katom-step/s
|
|
#. Increase pair style cutoff by 2 :math:`\AA`: 181.2 katom-step/s
|
|
#. Use tight PPPM convergence (1.0e-6 instead of 1.0e-4): 161.9 katom-step/s
|
|
#. Use Ewald summation instead of PPPM (at 1.0e-4 convergence): 19.9 katom-step/s
|
|
|
|
The numbers show that gains from aggressive compiler optimizations are
|
|
rather small in LAMMPS, the data access optimizations in the OPENMP (and
|
|
OPT) packages are more prominent. On the other side, using more
|
|
accurate force field settings causes, not unexpectedly, a significant
|
|
slowdown (to about half the speed). Finally, using regular Ewald
|
|
summation causes a massive slowdown due to the bad algorithmic scaling
|
|
with system size.
|
|
|
|
Examples comparing parallel performance
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The parallel performance usually goes on top of the serial performance.
|
|
Using twice as many processors should increase the performance metric
|
|
by up to a factor of two. With the number of processors *N* and the
|
|
serial performance :math:`p_1` and the performance for *N* processors
|
|
:math:`p_N` we can define a *parallel efficiency* in percent as follows:
|
|
|
|
.. math::
|
|
|
|
P_{eff} = \frac{p_N}{p_1 \cdot N} \cdot 100\%
|
|
|
|
For the AMD Ryzen Threadripper PRO 9985WX CPU and the serial
|
|
simulation settings of point 6. from above, we get the following
|
|
parallel efficiency data for the 256000 atom system:
|
|
|
|
- 1 MPI task: 273.6 katom-step/s, :math:`P_{eff} = 100\%`
|
|
- 2 MPI tasks: 530.6 katom-step/s, :math:`P_{eff} = 97\%`
|
|
- 4 MPI tasks: 1.021 Matom-step/s, :math:`P_{eff} = 93\%`
|
|
- 8 MPI tasks: 1.837 Matom-step/s, :math:`P_{eff} = 84\%`
|
|
- 16 MPI tasks: 3.574 Matom-step/s, :math:`P_{eff} = 82\%`
|
|
- 32 MPI tasks: 6.479 Matom-step/s, :math:`P_{eff} = 74\%`
|
|
- 64 MPI tasks: 9.032 Matom-step/s, :math:`P_{eff} = 52\%`
|
|
- 128 MPI tasks: 12.03 Matom-step/s, :math:`P_{eff} = 34\%`
|
|
|
|
The 128 MPI tasks run uses CPU cores from hyper-threading.
|
|
|
|
For a small system with only 32000 atoms the parallel efficiency
|
|
drops off earlier when the number of work units is too small relative
|
|
to the communication overhead:
|
|
|
|
- 1 MPI task: 270.8 katom-step/s, :math:`P_{eff} = 100\%`
|
|
- 2 MPI tasks: 529.3 katom-step/s, :math:`P_{eff} = 98\%`
|
|
- 4 MPI tasks: 989.8 katom-step/s, :math:`P_{eff} = 91\%`
|
|
- 8 MPI tasks: 1.832 Matom-step/s, :math:`P_{eff} = 85\%`
|
|
- 16 MPI tasks: 3.463 Matom-step/s, :math:`P_{eff} = 80\%`
|
|
- 32 MPI tasks: 5.970 Matom-step/s, :math:`P_{eff} = 69\%`
|
|
- 64 MPI tasks: 7.477 Matom-step/s, :math:`P_{eff} = 42\%`
|
|
- 128 MPI tasks: 8.069 Matom-step/s, :math:`P_{eff} = 23\%`
|
|
|
|
Measuring performance of your input deck
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The best way to do this is run the your system (actual number of atoms)
|
|
for a modest number of timesteps (say 100 steps) on several different
|
|
processor counts, including a single processor if possible. Do this for
|
|
an equilibrium version of your system, so that the 100-step timings are
|
|
representative of a much longer run. There is typically no need to run
|
|
for 1000s of timesteps to get accurate timings; you can simply
|
|
extrapolate from short runs.
|
|
|
|
For the set of runs, look at the timing data printed to the screen and
|
|
log file at the end of each LAMMPS run. The
|
|
:doc:`screen and logfile output <Run_output>` page gives an overview.
|
|
|
|
Running on one (or a few processors) should give a good estimate of
|
|
the serial performance and what portions of the timestep are taking
|
|
the most time. Running the same problem on a few different processor
|
|
counts should give an estimate of parallel scalability. I.e. if the
|
|
simulation runs 16x faster on 16 processors, its 100% parallel
|
|
efficient; if it runs 8x faster on 16 processors, it's 50% efficient.
|
|
|
|
The most important data to look at in the timing info is the timing
|
|
breakdown and relative percentages. For example, trying different
|
|
options for speeding up the long-range solvers will have little impact
|
|
if they only consume 10% of the run time. If the pairwise time is
|
|
dominating, you may want to look at GPU or OMP versions of the pair
|
|
style, as discussed below. Comparing how the percentages change as you
|
|
increase the processor count gives you a sense of how different
|
|
operations within the timestep are scaling. If you are using PPPM as
|
|
Kspace solver, you can turn on an additional output with
|
|
:doc:`kspace_modify fftbench yes <kspace_modify>` which measures the
|
|
time spent during PPPM on the 3d FFTs, which can be communication
|
|
intensive for larger processor counts. This provides an indication
|
|
whether it is worth trying out alternatives to the default FFT settings
|
|
for additional performance.
|
|
|
|
Another important detail in the timing info are the histograms of
|
|
atoms counts and neighbor counts. If these vary widely across
|
|
processors, you have a load-imbalance issue. This often results in
|
|
inaccurate relative timing data, because processors have to wait when
|
|
communication occurs for other processors to catch up. Thus the
|
|
reported times for "Communication" or "Other" may be higher than they
|
|
really are, due to load-imbalance. If this is an issue, you can
|
|
use the :doc:`timer sync <timer>` command to obtain synchronized timings.
|