115 lines
6.3 KiB
ReStructuredText
115 lines
6.3 KiB
ReStructuredText
OpenMP Parallelism
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
The styles in the INTEL, KOKKOS, and OPENMP packages offer to use OpenMP
|
|
thread parallelism to predominantly distribute loops over local data
|
|
and thus follow an orthogonal parallelization strategy to the
|
|
decomposition into spatial domains used by the :doc:`MPI partitioning
|
|
<Developer_par_part>`. For clarity, this section discusses only the
|
|
implementation in the OPENMP package, as it is the simplest. The INTEL
|
|
and KOKKOS packages offer additional options and are more complex since
|
|
they support more features and different hardware like co-processors
|
|
or GPUs.
|
|
|
|
One of the key decisions when implementing the OPENMP package was to
|
|
keep the changes to the source code small, so that it would be easier to
|
|
maintain the code and keep it in sync with the non-threaded standard
|
|
implementation. This is achieved by a) making the OPENMP version a
|
|
derived class from the regular version (e.g. ``PairLJCutOMP`` from
|
|
``PairLJCut``) and only overriding methods that are multi-threaded or
|
|
need to be modified to support multi-threading (similar to what was done
|
|
in the OPT package), b) keeping the structure in the modified code very
|
|
similar so that side-by-side comparisons are still useful, and c)
|
|
offloading additional functionality and multi-thread support functions
|
|
into three separate classes ``ThrOMP``, ``ThrData``, and ``FixOMP``.
|
|
``ThrOMP`` provides additional, multi-thread aware functionality not
|
|
available in the corresponding base class (e.g. ``Pair`` for
|
|
``PairLJCutOMP``) like multi-thread aware variants of the "tally"
|
|
functions. Those functions are made available through multiple
|
|
inheritance, so those new functions have to have unique names to avoid
|
|
ambiguities; typically ``_thr`` is appended to the name of the function.
|
|
``ThrData`` is a class that manages per-thread data structures. It is
|
|
used instead of extending the corresponding storage to per-thread arrays
|
|
to avoid slowdowns due to "false sharing" when multiple threads update
|
|
adjacent elements in an array and thus force the CPU cache lines to be
|
|
reset and re-fetched. ``FixOMP`` finally manages the "multi-thread
|
|
state" like settings and access to per-thread storage, it is activated
|
|
by the :doc:`package omp <package>` command.
|
|
|
|
Avoiding data races
|
|
"""""""""""""""""""
|
|
|
|
A key problem when implementing thread parallelism in an MD code is
|
|
to avoid data races when updating accumulated properties like forces,
|
|
energies, and stresses. When interactions are computed, they always
|
|
involve multiple atoms and thus there are race conditions when multiple
|
|
threads want to update per-atom data of the same atoms. Five possible
|
|
strategies have been considered to avoid this:
|
|
|
|
1. Restructure the code so that there is no overlapping access possible
|
|
when computing in parallel, e.g. by breaking lists into multiple
|
|
parts and synchronizing threads in between.
|
|
2. Have each thread be "responsible" for a specific group of atoms and
|
|
compute these interactions multiple times, once on each thread that
|
|
is responsible for a given atom, and then have each thread only update
|
|
the properties of this atom.
|
|
3. Use mutexes around functions and regions of code where the data race
|
|
could happen.
|
|
4. Use atomic operations when updating per-atom properties.
|
|
5. Use replicated per-thread data structures to accumulate data without
|
|
conflicts and then use a reduction to combine those results into the
|
|
data structures used by the regular style.
|
|
|
|
Option 5 was chosen for the OPENMP package because it would retain the
|
|
performance for the case of a single thread and the code would be more
|
|
maintainable. Option 1 would require extensive code changes,
|
|
particularly to the neighbor list code; option 2 would have incurred a
|
|
2x or more performance penalty for the serial case; option 3 causes
|
|
significant overhead and would enforce serialization of operations in
|
|
inner loops and thus defeat the purpose of multi-threading; option 4
|
|
slows down the serial case although not quite as bad as option 2. The
|
|
downside of option 5 is that the overhead of the reduction operations
|
|
grows with the number of threads used, so there would be a crossover
|
|
point where options 2 or 4 would result in faster executing. That is
|
|
why option 2 for example is used in the GPU package because a GPU is a
|
|
processor with a massive number of threads. However, since the MPI
|
|
parallelization is generally more effective for typical MD systems, the
|
|
expectation is that thread parallelism is only used for a smaller number
|
|
of threads (2-8). At the time of its implementation, that number was
|
|
equivalent to the number of CPU cores per CPU socket on high-end
|
|
supercomputers.
|
|
|
|
Thus arrays like the force array are dimensioned to the number of atoms
|
|
times the number of threads when enabling OpenMP support, and inside the
|
|
compute functions a pointer to a different chunk is obtained by each thread.
|
|
Similarly, accumulators like potential energy or virial are kept in
|
|
per-thread instances of the ``ThrData`` class and then only reduced and
|
|
stored in their global counterparts at the end of the force computation.
|
|
|
|
|
|
Loop scheduling
|
|
"""""""""""""""
|
|
|
|
Multi-thread parallelization is applied by distributing (outer) loops
|
|
statically across threads. Typically, this would be the loop over local
|
|
atoms *i* when processing *i,j* pairs of atoms from a neighbor list.
|
|
The design of the neighbor list code results in atoms having a similar
|
|
number of neighbors for homogeneous systems and thus load imbalances
|
|
across threads are not common and typically happen for systems where
|
|
also the MPI parallelization would be unbalanced, which would typically
|
|
have a more pronounced impact on the performance. This same loop
|
|
scheduling scheme can also be applied to the reduction operations on
|
|
per-atom data to try and reduce the overhead of the reduction operation.
|
|
|
|
Neighbor list parallelization
|
|
"""""""""""""""""""""""""""""
|
|
|
|
In addition to the parallelization of force computations, also the
|
|
generation of the neighbor lists is parallelized. As explained
|
|
previously, neighbor lists are built by looping over "owned" atoms and
|
|
storing the neighbors in "pages". In the OPENMP variants of the
|
|
neighbor list code, each thread operates on a different chunk of "owned"
|
|
atoms and allocates and fills its own set of pages with neighbor list
|
|
data. This is achieved by each thread keeping its own instance of the
|
|
:cpp:class:`MyPage <LAMMPS_NS::MyPage>` page allocator class.
|