665 lines
50 KiB
HTML
665 lines
50 KiB
HTML
<!DOCTYPE html>
|
|
<html class="writer-html5" lang="en" >
|
|
<head>
|
|
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
|
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
|
<title>7.4.2. INTEL package — LAMMPS documentation</title>
|
|
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
|
|
<link rel="stylesheet" href="_static/css/theme.css" type="text/css" />
|
|
<link rel="stylesheet" href="_static/sphinx-design.min.css" type="text/css" />
|
|
<link rel="stylesheet" href="_static/css/lammps.css" type="text/css" />
|
|
<link rel="shortcut icon" href="_static/lammps.ico"/>
|
|
<link rel="canonical" href="https://docs.lammps.org/Speed_intel.html" />
|
|
<!--[if lt IE 9]>
|
|
<script src="_static/js/html5shiv.min.js"></script>
|
|
<![endif]-->
|
|
|
|
<script src="_static/jquery.js?v=5d32c60e"></script>
|
|
<script src="_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
|
|
<script src="_static/documentation_options.js?v=5929fcd5"></script>
|
|
<script src="_static/doctools.js?v=9bcbadda"></script>
|
|
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
|
|
<script src="_static/design-tabs.js?v=f930bc37"></script>
|
|
<script async="async" src="_static/mathjax/es5/tex-mml-chtml.js?v=cadf963e"></script>
|
|
<script src="_static/js/theme.js"></script>
|
|
<link rel="index" title="Index" href="genindex.html" />
|
|
<link rel="search" title="Search" href="search.html" />
|
|
<link rel="next" title="7.4.3. KOKKOS package" href="Speed_kokkos.html" />
|
|
<link rel="prev" title="7.4.1. GPU package" href="Speed_gpu.html" />
|
|
</head>
|
|
|
|
<body class="wy-body-for-nav">
|
|
<div class="wy-grid-for-nav">
|
|
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
|
<div class="wy-side-scroll">
|
|
<div class="wy-side-nav-search" >
|
|
|
|
|
|
|
|
<a href="Manual.html">
|
|
|
|
<img src="_static/lammps-logo.png" class="logo" alt="Logo"/>
|
|
</a>
|
|
<div class="lammps_version">Version: <b>19 Nov 2024</b></div>
|
|
<div class="lammps_release">git info: </div>
|
|
<div role="search">
|
|
<form id="rtd-search-form" class="wy-form" action="search.html" method="get">
|
|
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
|
<input type="hidden" name="check_keywords" value="yes" />
|
|
<input type="hidden" name="area" value="default" />
|
|
</form>
|
|
</div>
|
|
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
|
<p class="caption" role="heading"><span class="caption-text">User Guide</span></p>
|
|
<ul class="current">
|
|
<li class="toctree-l1"><a class="reference internal" href="Intro.html">1. Introduction</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Install.html">2. Install LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Build.html">3. Build LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Run_head.html">4. Run LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Commands.html">5. Commands</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Packages.html">6. Optional packages</a></li>
|
|
<li class="toctree-l1 current"><a class="reference internal" href="Speed.html">7. Accelerate performance</a><ul class="current">
|
|
<li class="toctree-l2"><a class="reference internal" href="Speed_bench.html">7.1. Benchmarks</a></li>
|
|
<li class="toctree-l2"><a class="reference internal" href="Speed_measure.html">7.2. Measuring performance</a></li>
|
|
<li class="toctree-l2"><a class="reference internal" href="Speed_tips.html">7.3. General tips</a></li>
|
|
<li class="toctree-l2 current"><a class="reference internal" href="Speed_packages.html">7.4. Accelerator packages</a><ul class="current">
|
|
<li class="toctree-l3"><a class="reference internal" href="Speed_gpu.html">7.4.1. GPU package</a></li>
|
|
<li class="toctree-l3 current"><a class="current reference internal" href="#">7.4.2. INTEL package</a></li>
|
|
<li class="toctree-l3"><a class="reference internal" href="Speed_kokkos.html">7.4.3. KOKKOS package</a></li>
|
|
<li class="toctree-l3"><a class="reference internal" href="Speed_omp.html">7.4.4. OPENMP package</a></li>
|
|
<li class="toctree-l3"><a class="reference internal" href="Speed_opt.html">7.4.5. OPT package</a></li>
|
|
</ul>
|
|
</li>
|
|
<li class="toctree-l2"><a class="reference internal" href="Speed_compare.html">7.5. Comparison of various accelerator packages</a></li>
|
|
</ul>
|
|
</li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Howto.html">8. Howto discussions</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Examples.html">9. Example scripts</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Tools.html">10. Auxiliary tools</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Errors.html">11. Errors</a></li>
|
|
</ul>
|
|
<p class="caption" role="heading"><span class="caption-text">Programmer Guide</span></p>
|
|
<ul>
|
|
<li class="toctree-l1"><a class="reference internal" href="Library.html">1. LAMMPS Library Interfaces</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Python_head.html">2. Use Python with LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Modify.html">3. Modifying & extending LAMMPS</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Developer.html">4. Information for Developers</a></li>
|
|
</ul>
|
|
<p class="caption" role="heading"><span class="caption-text">Command Reference</span></p>
|
|
<ul>
|
|
<li class="toctree-l1"><a class="reference internal" href="commands_list.html">Commands</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="fixes.html">Fix Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="computes.html">Compute Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="pairs.html">Pair Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="bonds.html">Bond Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="angles.html">Angle Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="dihedrals.html">Dihedral Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="impropers.html">Improper Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="dumps.html">Dump Styles</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="fix_modify_atc_commands.html">fix_modify AtC commands</a></li>
|
|
<li class="toctree-l1"><a class="reference internal" href="Bibliography.html">Bibliography</a></li>
|
|
</ul>
|
|
|
|
</div>
|
|
</div>
|
|
</nav>
|
|
|
|
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
|
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
|
<a href="Manual.html">LAMMPS</a>
|
|
</nav>
|
|
|
|
<div class="wy-nav-content">
|
|
<div class="rst-content style-external-links">
|
|
<div role="navigation" aria-label="Page navigation">
|
|
<ul class="wy-breadcrumbs">
|
|
<li><a href="Manual.html" class="icon icon-home" aria-label="Home"></a></li>
|
|
<li class="breadcrumb-item"><a href="Speed.html"><span class="section-number">7. </span>Accelerate performance</a></li>
|
|
<li class="breadcrumb-item"><a href="Speed_packages.html"><span class="section-number">7.4. </span>Accelerator packages</a></li>
|
|
<li class="breadcrumb-item active"><span class="section-number">7.4.2. </span>INTEL package</li>
|
|
<li class="wy-breadcrumbs-aside">
|
|
<a href="https://www.lammps.org"><img src="_static/lammps-logo.png" width="64" height="16" alt="LAMMPS Homepage"></a> | <a href="Commands_all.html">Commands</a>
|
|
</li>
|
|
</ul><div class="rst-breadcrumbs-buttons" role="navigation" aria-label="Sequential page navigation">
|
|
<a href="Speed_gpu.html" class="btn btn-neutral float-left" title="7.4.1. GPU package" accesskey="p"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
|
<a href="Speed_kokkos.html" class="btn btn-neutral float-right" title="7.4.3. KOKKOS package" accesskey="n">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
|
</div>
|
|
<hr/>
|
|
</div>
|
|
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
|
<div itemprop="articleBody">
|
|
|
|
<p><span class="math notranslate nohighlight">\(\renewcommand{\AA}{\text{Å}}\)</span></p>
|
|
<section id="intel-package">
|
|
<h1><span class="section-number">7.4.2. </span>INTEL package<a class="headerlink" href="#intel-package" title="Link to this heading"></a></h1>
|
|
<p>The INTEL package is maintained by Mike Brown at Intel
|
|
Corporation. It provides two methods for accelerating simulations,
|
|
depending on the hardware you have. The first is acceleration on
|
|
Intel CPUs by running in single, mixed, or double precision with
|
|
vectorization. The second is acceleration on Intel Xeon Phi
|
|
co-processors via offloading neighbor list and non-bonded force
|
|
calculations to the Phi. The same C++ code is used in both cases.
|
|
When offloading to a co-processor from a CPU, the same routine is run
|
|
twice, once on the CPU and once with an offload flag. This allows
|
|
LAMMPS to run on the CPU cores and co-processor cores simultaneously.</p>
|
|
<section id="currently-available-intel-styles">
|
|
<h2>Currently Available INTEL Styles<a class="headerlink" href="#currently-available-intel-styles" title="Link to this heading"></a></h2>
|
|
<ul class="simple">
|
|
<li><p>Angle Styles: charmm, harmonic</p></li>
|
|
<li><p>Bond Styles: fene, harmonic</p></li>
|
|
<li><p>Dihedral Styles: charmm, fourier, harmonic, opls</p></li>
|
|
<li><p>Fixes: nve, npt, nvt, nvt/sllod, nve/asphere, electrode/conp, electrode/conq, electrode/thermo</p></li>
|
|
<li><p>Improper Styles: cvff, harmonic</p></li>
|
|
<li><p>Pair Styles: airebo, airebo/morse, buck/coul/cut, buck/coul/long,
|
|
buck, dpd, eam, eam/alloy, eam/fs, gayberne, lj/charmm/coul/charmm,
|
|
lj/charmm/coul/long, lj/cut, lj/cut/coul/long, lj/long/coul/long,
|
|
rebo, snap, sw, tersoff</p></li>
|
|
<li><p>K-Space Styles: pppm, pppm/disp, pppm/electrode</p></li>
|
|
</ul>
|
|
<div class="admonition warning">
|
|
<p class="admonition-title">Warning</p>
|
|
<p>None of the styles in the INTEL package currently
|
|
support computing per-atom stress. If any compute or fix in your
|
|
input requires it, LAMMPS will abort with an error message.</p>
|
|
</div>
|
|
</section>
|
|
<section id="speed-up-to-expect">
|
|
<h2>Speed-up to expect<a class="headerlink" href="#speed-up-to-expect" title="Link to this heading"></a></h2>
|
|
<p>The speedup will depend on your simulation, the hardware, which
|
|
styles are used, the number of atoms, and the floating-point
|
|
precision mode. Performance improvements are shown compared to
|
|
LAMMPS <em>without using other acceleration packages</em> as these are
|
|
under active development (and subject to performance changes). The
|
|
measurements were performed using the input files available in
|
|
the <code class="docutils literal notranslate"><span class="pre">src/INTEL/TEST</span></code> directory with the provided run script.
|
|
These are scalable in size; the results given are with 512K
|
|
particles (524K for Liquid Crystal). Most of the simulations are
|
|
standard LAMMPS benchmarks (indicated by the filename extension in
|
|
parenthesis) with modifications to the run length and to add a
|
|
warm-up run (for use with offload benchmarks).</p>
|
|
<img alt="_images/user_intel.png" class="align-center" src="_images/user_intel.png" />
|
|
<p>Results are speedups obtained on Intel Xeon E5-2697v4 processors
|
|
(code-named Broadwell), Intel Xeon Phi 7250 processors (code-named
|
|
Knights Landing), and Intel Xeon Gold 6148 processors (code-named
|
|
Skylake) with “June 2017” LAMMPS built with Intel Parallel Studio
|
|
2017 update 2. Results are with 1 MPI task per physical core. See
|
|
<code class="docutils literal notranslate"><span class="pre">src/INTEL/TEST/README</span></code> for the raw simulation rates and
|
|
instructions to reproduce.</p>
|
|
</section>
|
|
<hr class="docutils" />
|
|
<section id="accuracy-and-order-of-operations">
|
|
<h2>Accuracy and order of operations<a class="headerlink" href="#accuracy-and-order-of-operations" title="Link to this heading"></a></h2>
|
|
<p>In most molecular dynamics software, parallelization parameters
|
|
(# of MPI, OpenMP, and vectorization) can change the results due
|
|
to changing the order of operations with finite-precision
|
|
calculations. The INTEL package is deterministic. This means
|
|
that the results should be reproducible from run to run with the
|
|
<em>same</em> parallel configurations and when using deterministic
|
|
libraries or library settings (MPI, OpenMP, FFT). However, there
|
|
are differences in the INTEL package that can change the
|
|
order of operations compared to LAMMPS without acceleration:</p>
|
|
<ul class="simple">
|
|
<li><p>Neighbor lists can be created in a different order</p></li>
|
|
<li><p>Bins used for sorting atoms can be oriented differently</p></li>
|
|
<li><p>The default stencil order for PPPM is 7. By default, LAMMPS will
|
|
calculate other PPPM parameters to fit the desired accuracy with
|
|
this order</p></li>
|
|
<li><p>The <em>newton</em> setting applies to all atoms, not just atoms shared
|
|
between MPI tasks</p></li>
|
|
<li><p>Vectorization can change the order for adding pairwise forces</p></li>
|
|
<li><p>When using the <code class="docutils literal notranslate"><span class="pre">-DLMP_USE_MKL_RNG</span></code> define (all included intel optimized
|
|
makefiles do) at build time, the random number generator for
|
|
dissipative particle dynamics (<code class="docutils literal notranslate"><span class="pre">pair</span> <span class="pre">style</span> <span class="pre">dpd/intel</span></code>) uses the Mersenne
|
|
Twister generator included in the Intel MKL library (that should be
|
|
more robust than the default Masaglia random number generator)</p></li>
|
|
</ul>
|
|
<p>The precision mode (described below) used with the INTEL
|
|
package can change the <em>accuracy</em> of the calculations. For the
|
|
default <em>mixed</em> precision option, calculations between pairs or
|
|
triplets of atoms are performed in single precision, intended to
|
|
be within the inherent error of MD simulations. All accumulation
|
|
is performed in double precision to prevent the error from growing
|
|
with the number of atoms in the simulation. <em>Single</em> precision
|
|
mode should not be used without appropriate validation.</p>
|
|
</section>
|
|
<hr class="docutils" />
|
|
<section id="quick-start-for-experienced-users">
|
|
<h2>Quick Start for Experienced Users<a class="headerlink" href="#quick-start-for-experienced-users" title="Link to this heading"></a></h2>
|
|
<p>LAMMPS should be built with the INTEL package installed.
|
|
Simulations should be run with 1 MPI task per physical <em>core</em>,
|
|
not <em>hardware thread</em>.</p>
|
|
<ul class="simple">
|
|
<li><p>Edit <code class="docutils literal notranslate"><span class="pre">src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi</span></code> as necessary.</p></li>
|
|
<li><p>Set the environment variable <code class="docutils literal notranslate"><span class="pre">KMP_BLOCKTIME=0</span></code></p></li>
|
|
<li><p><code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span> <span class="pre">0</span> <span class="pre">omp</span> <span class="pre">$t</span> <span class="pre">-sf</span> <span class="pre">intel</span></code> added to LAMMPS command-line</p></li>
|
|
<li><p><code class="docutils literal notranslate"><span class="pre">$t</span></code> should be 2 for Intel Xeon CPUs and 2 or 4 for Intel Xeon Phi</p></li>
|
|
<li><p>For some of the simple 2-body potentials without long-range
|
|
electrostatics, performance and scalability can be better with
|
|
the <code class="docutils literal notranslate"><span class="pre">newton</span> <span class="pre">off</span></code> setting added to the input script</p></li>
|
|
<li><p>For simulations on higher node counts, add <code class="docutils literal notranslate"><span class="pre">processors</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">grid</span>
|
|
<span class="pre">numa</span></code> to the beginning of the input script for better scalability</p></li>
|
|
<li><p>If using <code class="docutils literal notranslate"><span class="pre">kspace_style</span> <span class="pre">pppm</span></code> in the input script, add
|
|
<code class="docutils literal notranslate"><span class="pre">kspace_modify</span> <span class="pre">diff</span> <span class="pre">ad</span></code> for better performance</p></li>
|
|
</ul>
|
|
<p>For Intel Xeon Phi CPUs:</p>
|
|
<ul class="simple">
|
|
<li><p>Runs should be performed using MCDRAM.</p></li>
|
|
</ul>
|
|
<p>For simulations using <code class="docutils literal notranslate"><span class="pre">kspace_style</span> <span class="pre">pppm</span></code> on Intel CPUs supporting
|
|
AVX-512:</p>
|
|
<ul class="simple">
|
|
<li><p>Add <code class="docutils literal notranslate"><span class="pre">kspace_modify</span> <span class="pre">diff</span> <span class="pre">ad</span></code> to the input script</p></li>
|
|
<li><p>The command-line option should be changed to
|
|
<code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span> <span class="pre">0</span> <span class="pre">omp</span> <span class="pre">$r</span> <span class="pre">lrt</span> <span class="pre">yes</span> <span class="pre">-sf</span> <span class="pre">intel</span></code> where <code class="docutils literal notranslate"><span class="pre">$r</span></code> is the number of
|
|
threads minus 1.</p></li>
|
|
<li><p>Do not use thread affinity (set <code class="docutils literal notranslate"><span class="pre">KMP_AFFINITY=none</span></code>)</p></li>
|
|
<li><p>The <code class="docutils literal notranslate"><span class="pre">newton</span> <span class="pre">off</span></code> setting may provide better scalability</p></li>
|
|
</ul>
|
|
<p>For Intel Xeon Phi co-processors (Offload):</p>
|
|
<ul class="simple">
|
|
<li><p>Edit <code class="docutils literal notranslate"><span class="pre">src/MAKE/OPTIONS/Makefile.intel_co-processor</span></code> as necessary</p></li>
|
|
<li><p><code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span> <span class="pre">N</span> <span class="pre">omp</span> <span class="pre">1</span></code> added to command-line where <code class="docutils literal notranslate"><span class="pre">N</span></code> is the number of
|
|
co-processors per node.</p></li>
|
|
</ul>
|
|
</section>
|
|
<hr class="docutils" />
|
|
<section id="required-hardware-software">
|
|
<h2>Required hardware/software<a class="headerlink" href="#required-hardware-software" title="Link to this heading"></a></h2>
|
|
<p>When using Intel compilers version 16.0 or later is required.</p>
|
|
<p>In order to use offload to co-processors, an Intel Xeon Phi
|
|
co-processor and an Intel compiler are required.</p>
|
|
<p>Although any compiler can be used with the INTEL package,
|
|
currently, vectorization directives are disabled by default when
|
|
not using Intel compilers due to lack of standard support and
|
|
observations of decreased performance. The OpenMP standard now
|
|
supports directives for vectorization and we plan to transition the
|
|
code to this standard once it is available in most compilers. We
|
|
expect this to allow improved performance and support with other
|
|
compilers.</p>
|
|
<p>For Intel Xeon Phi x200 series processors (code-named Knights
|
|
Landing), there are multiple configuration options for the hardware.
|
|
For best performance, we recommend that the MCDRAM is configured in
|
|
“Flat” mode and with the cluster mode set to “Quadrant” or “SNC4”.
|
|
“Cache” mode can also be used, although the performance might be
|
|
slightly lower.</p>
|
|
</section>
|
|
<section id="notes-about-simultaneous-multithreading">
|
|
<h2>Notes about Simultaneous Multithreading<a class="headerlink" href="#notes-about-simultaneous-multithreading" title="Link to this heading"></a></h2>
|
|
<p>Modern CPUs often support Simultaneous Multithreading (SMT). On
|
|
Intel processors, this is called Hyper-Threading (HT) technology.
|
|
SMT is hardware support for running multiple threads efficiently on
|
|
a single core. <em>Hardware threads</em> or <em>logical cores</em> are often used
|
|
to refer to the number of threads that are supported in hardware.
|
|
For example, the Intel Xeon E5-2697v4 processor is described
|
|
as having 36 cores and 72 threads. This means that 36 MPI processes
|
|
or OpenMP threads can run simultaneously on separate cores, but that
|
|
up to 72 MPI processes or OpenMP threads can be running on the CPU
|
|
without costly operating system context switches.</p>
|
|
<p>Molecular dynamics simulations will often run faster when making use
|
|
of SMT. If a thread becomes stalled, for example because it is
|
|
waiting on data that has not yet arrived from memory, another thread
|
|
can start running so that the CPU pipeline is still being used
|
|
efficiently. Although benefits can be seen by launching a MPI task
|
|
for every hardware thread, for multinode simulations, we recommend
|
|
that OpenMP threads are used for SMT instead, either with the
|
|
INTEL package, <a class="reference internal" href="Speed_omp.html"><span class="doc">OPENMP package</span></a>, or
|
|
<a class="reference internal" href="Speed_kokkos.html"><span class="doc">KOKKOS package</span></a>. In the example above, up
|
|
to 36X speedups can be observed by using all 36 physical cores with
|
|
LAMMPS. By using all 72 hardware threads, an additional 10-30%
|
|
performance gain can be achieved.</p>
|
|
<p>The BIOS on many platforms allows SMT to be disabled, however, we do
|
|
not recommend this on modern processors as there is little to no
|
|
benefit for any software package in most cases. The operating system
|
|
will report every hardware thread as a separate core allowing one to
|
|
determine the number of hardware threads available. On Linux systems,
|
|
this information can normally be obtained with:</p>
|
|
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>cat<span class="w"> </span>/proc/cpuinfo
|
|
</pre></div>
|
|
</div>
|
|
</section>
|
|
<section id="building-lammps-with-the-intel-package">
|
|
<h2>Building LAMMPS with the INTEL package<a class="headerlink" href="#building-lammps-with-the-intel-package" title="Link to this heading"></a></h2>
|
|
<p>See the <a class="reference internal" href="Build_extras.html#intel"><span class="std std-ref">Build extras</span></a> page for
|
|
instructions. Some additional details are covered here.</p>
|
|
<p>For building with make, several example Makefiles for building with
|
|
the Intel compiler are included with LAMMPS in the <code class="docutils literal notranslate"><span class="pre">src/MAKE/OPTIONS/</span></code>
|
|
directory:</p>
|
|
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>Makefile.intel_cpu_intelmpi<span class="w"> </span><span class="c1"># Intel Compiler, Intel MPI, No Offload</span>
|
|
Makefile.knl<span class="w"> </span><span class="c1"># Intel Compiler, Intel MPI, No Offload</span>
|
|
Makefile.intel_cpu_mpich<span class="w"> </span><span class="c1"># Intel Compiler, MPICH, No Offload</span>
|
|
Makefile.intel_cpu_openpmi<span class="w"> </span><span class="c1"># Intel Compiler, OpenMPI, No Offload</span>
|
|
Makefile.intel_co-processor<span class="w"> </span><span class="c1"># Intel Compiler, Intel MPI, Offload</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
|
|
it explicitly specifies that vectorization should be for Intel Xeon
|
|
Phi x200 processors making it easier to cross-compile. For users with
|
|
recent installations of Intel Parallel Studio, the process can be as
|
|
simple as:</p>
|
|
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>make<span class="w"> </span>yes-intel
|
|
<span class="nb">source</span><span class="w"> </span>/opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
|
|
<span class="c1"># or psxevars.csh for C-shell</span>
|
|
make<span class="w"> </span>intel_cpu_intelmpi
|
|
</pre></div>
|
|
</div>
|
|
<p>Note that if you build with support for a Phi co-processor, the same
|
|
binary can be used on nodes with or without co-processors installed.
|
|
However, if you do not have co-processors on your system, building
|
|
without offload support will produce a smaller binary.</p>
|
|
<p>The general requirements for Makefiles with the INTEL package
|
|
are as follows. When using Intel compilers, <code class="docutils literal notranslate"><span class="pre">-restrict</span></code> is required
|
|
and <code class="docutils literal notranslate"><span class="pre">-qopenmp</span></code> is highly recommended for <code class="docutils literal notranslate"><span class="pre">CCFLAGS</span></code> and <code class="docutils literal notranslate"><span class="pre">LINKFLAGS</span></code>.
|
|
<code class="docutils literal notranslate"><span class="pre">CCFLAGS</span></code> should include <code class="docutils literal notranslate"><span class="pre">-DLMP_INTEL_USELRT</span></code> (unless POSIX Threads
|
|
are not supported in the build environment) and <code class="docutils literal notranslate"><span class="pre">-DLMP_USE_MKL_RNG</span></code>
|
|
(unless Intel Math Kernel Library (MKL) is not available in the build
|
|
environment). For Intel compilers, <code class="docutils literal notranslate"><span class="pre">LIB</span></code> should include <code class="docutils literal notranslate"><span class="pre">-ltbbmalloc</span></code>
|
|
or if the library is not available, <code class="docutils literal notranslate"><span class="pre">-DLMP_INTEL_NO_TBB</span></code> can be added
|
|
to <code class="docutils literal notranslate"><span class="pre">CCFLAGS</span></code>. For builds supporting offload, <code class="docutils literal notranslate"><span class="pre">-DLMP_INTEL_OFFLOAD</span></code> is
|
|
required for <code class="docutils literal notranslate"><span class="pre">CCFLAGS</span></code> and <code class="docutils literal notranslate"><span class="pre">-qoffload</span></code> is required for <code class="docutils literal notranslate"><span class="pre">LINKFLAGS</span></code>. Other
|
|
recommended <code class="docutils literal notranslate"><span class="pre">CCFLAG</span></code> options for best performance are <code class="docutils literal notranslate"><span class="pre">-O2</span> <span class="pre">-fno-alias</span>
|
|
<span class="pre">-ansi-alias</span> <span class="pre">-qoverride-limits</span> <span class="pre">fp-model</span> <span class="pre">fast=2</span> <span class="pre">-no-prec-div</span></code>.</p>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>See the <code class="docutils literal notranslate"><span class="pre">src/INTEL/README</span></code> file for additional flags that
|
|
might be needed for best performance on Intel server processors
|
|
code-named “Skylake”.</p>
|
|
</div>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>The vectorization and math capabilities can differ depending on
|
|
the CPU. For Intel compilers, the <code class="docutils literal notranslate"><span class="pre">-x</span></code> flag specifies the type of
|
|
processor for which to optimize. <code class="docutils literal notranslate"><span class="pre">-xHost</span></code> specifies that the compiler
|
|
should build for the processor used for compiling. For Intel Xeon Phi
|
|
x200 series processors, this option is <code class="docutils literal notranslate"><span class="pre">-xMIC-AVX512</span></code>. For fourth
|
|
generation Intel Xeon (v4/Broadwell) processors, <code class="docutils literal notranslate"><span class="pre">-xCORE-AVX2</span></code> should
|
|
be used. For older Intel Xeon processors, <code class="docutils literal notranslate"><span class="pre">-xAVX</span></code> will perform best
|
|
in general for the different simulations in LAMMPS. The default
|
|
in most of the example Makefiles is to use <code class="docutils literal notranslate"><span class="pre">-xHost</span></code>, however this
|
|
should not be used when cross-compiling.</p>
|
|
</div>
|
|
</section>
|
|
<section id="running-lammps-with-the-intel-package">
|
|
<h2>Running LAMMPS with the INTEL package<a class="headerlink" href="#running-lammps-with-the-intel-package" title="Link to this heading"></a></h2>
|
|
<p>Running LAMMPS with the INTEL package is similar to normal use
|
|
with the exceptions that one should 1) specify that LAMMPS should use
|
|
the INTEL package, 2) specify the number of OpenMP threads, and
|
|
3) optionally specify the specific LAMMPS styles that should use the
|
|
INTEL package. 1) and 2) can be performed from the command-line
|
|
or by editing the input script. 3) requires editing the input script.
|
|
Advanced performance tuning options are also described below to get
|
|
the best performance.</p>
|
|
<p>When running on a single node (including runs using offload to a
|
|
co-processor), best performance is normally obtained by using 1 MPI
|
|
task per physical core and additional OpenMP threads with SMT. For
|
|
Intel Xeon processors, 2 OpenMP threads should be used for SMT.
|
|
For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
|
|
(best choice depends on the simulation). In cases where the user
|
|
specifies that LRT mode is used (described below), 1 or 3 OpenMP
|
|
threads should be used. For multi-node runs, using 1 MPI task per
|
|
physical core will often perform best, however, depending on the
|
|
machine and scale, users might get better performance by decreasing
|
|
the number of MPI tasks and using more OpenMP threads. For
|
|
performance, the product of the number of MPI tasks and OpenMP
|
|
threads should not exceed the number of available hardware threads in
|
|
almost all cases.</p>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>Setting core affinity is often used to pin MPI tasks and OpenMP
|
|
threads to a core or group of cores so that memory access can be
|
|
uniform. Unless disabled at build time, affinity for MPI tasks and
|
|
OpenMP threads on the host (CPU) will be set by default on the host
|
|
<em>when using offload to a co-processor</em>. In this case, it is unnecessary
|
|
to use other methods to control affinity (e.g. <code class="docutils literal notranslate"><span class="pre">taskset</span></code>, <code class="docutils literal notranslate"><span class="pre">numactl</span></code>,
|
|
<code class="docutils literal notranslate"><span class="pre">I_MPI_PIN_DOMAIN</span></code>, etc.). This can be disabled with the <em>no_affinity</em>
|
|
option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command or by disabling the
|
|
option at build time (by adding <code class="docutils literal notranslate"><span class="pre">-DINTEL_OFFLOAD_NOAFFINITY</span></code> to the
|
|
<code class="docutils literal notranslate"><span class="pre">CCFLAGS</span></code> line of your Makefile). Disabling this option is not
|
|
recommended, especially when running on a machine with Intel
|
|
Hyper-Threading technology disabled.</p>
|
|
</div>
|
|
</section>
|
|
<section id="run-with-the-intel-package-from-the-command-line">
|
|
<h2>Run with the INTEL package from the command-line<a class="headerlink" href="#run-with-the-intel-package-from-the-command-line" title="Link to this heading"></a></h2>
|
|
<p>To enable INTEL optimizations for all available styles used in the input
|
|
script, the <code class="docutils literal notranslate"><span class="pre">-sf</span> <span class="pre">intel</span></code> <a class="reference internal" href="Run_options.html"><span class="doc">command-line switch</span></a> can
|
|
be used without any requirement for editing the input script. This
|
|
switch will automatically append “intel” to styles that support it. It
|
|
also invokes a default command: <a class="reference internal" href="package.html"><span class="doc">package intel 1</span></a>. This
|
|
package command is used to set options for the INTEL package. The
|
|
default package command will specify that INTEL calculations are
|
|
performed in mixed precision, that the number of OpenMP threads is
|
|
specified by the OMP_NUM_THREADS environment variable, and that if
|
|
co-processors are present and the binary was built with offload support,
|
|
that 1 co-processor per node will be used with automatic balancing of
|
|
work between the CPU and the co-processor.</p>
|
|
<p>You can specify different options for the INTEL package by using
|
|
the <code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span> <span class="pre">Nphi</span></code> <a class="reference internal" href="Run_options.html"><span class="doc">command-line switch</span></a> with
|
|
keyword/value pairs as specified in the documentation. Here, <code class="docutils literal notranslate"><span class="pre">Nphi</span></code> = #
|
|
of Xeon Phi co-processors/node (ignored without offload
|
|
support). Common options to the INTEL package include <em>omp</em> to
|
|
override any <code class="docutils literal notranslate"><span class="pre">OMP_NUM_THREADS</span></code> setting and specify the number of OpenMP
|
|
threads, <em>mode</em> to set the floating-point precision mode, and <em>lrt</em> to
|
|
enable Long-Range Thread mode as described below. See the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command for details, including the default values
|
|
used for all its options if not specified, and how to set the number
|
|
of OpenMP threads via the <code class="docutils literal notranslate"><span class="pre">OMP_NUM_THREADS</span></code> environment variable if
|
|
desired.</p>
|
|
<p>Examples (see documentation for your MPI/Machine for differences in
|
|
launching MPI applications):</p>
|
|
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads</span>
|
|
mpirun<span class="w"> </span>-np<span class="w"> </span><span class="m">72</span><span class="w"> </span>-ppn<span class="w"> </span><span class="m">36</span><span class="w"> </span>lmp_machine<span class="w"> </span>-sf<span class="w"> </span>intel<span class="w"> </span>-in<span class="w"> </span><span class="k">in</span>.script
|
|
|
|
<span class="c1"># Don't use any co-processors that might be available,</span>
|
|
<span class="c1"># use 2 OpenMP threads for each task, use double precision</span>
|
|
mpirun<span class="w"> </span>-np<span class="w"> </span><span class="m">72</span><span class="w"> </span>-ppn<span class="w"> </span><span class="m">36</span><span class="w"> </span>lmp_machine<span class="w"> </span>-sf<span class="w"> </span>intel<span class="w"> </span>-in<span class="w"> </span><span class="k">in</span>.script<span class="w"> </span><span class="se">\</span>
|
|
<span class="w"> </span>-pk<span class="w"> </span>intel<span class="w"> </span><span class="m">0</span><span class="w"> </span>omp<span class="w"> </span><span class="m">2</span><span class="w"> </span>mode<span class="w"> </span>double
|
|
</pre></div>
|
|
</div>
|
|
</section>
|
|
<section id="or-run-with-the-intel-package-by-editing-an-input-script">
|
|
<h2>Or run with the INTEL package by editing an input script<a class="headerlink" href="#or-run-with-the-intel-package-by-editing-an-input-script" title="Link to this heading"></a></h2>
|
|
<p>As an alternative to adding command-line arguments, the input script
|
|
can be edited to enable the INTEL package. This requires adding
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command to the top of the input
|
|
script. For the second example above, this would be:</p>
|
|
<div class="highlight-LAMMPS notranslate"><div class="highlight"><pre><span></span><span class="k">package</span><span class="w"> </span><span class="n">intel</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="n">omp</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="n">mode</span><span class="w"> </span><span class="n">double</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>To enable the INTEL package only for individual styles, you can
|
|
add an “intel” suffix to the individual style, e.g.:</p>
|
|
<div class="highlight-LAMMPS notranslate"><div class="highlight"><pre><span></span><span class="k">pair_style</span><span class="w"> </span><span class="n">lj</span><span class="o">/</span><span class="n">cut</span><span class="o">/</span><span class="n">intel</span><span class="w"> </span><span class="m">2.5</span>
|
|
</pre></div>
|
|
</div>
|
|
<p>Alternatively, the <a class="reference internal" href="suffix.html"><span class="doc">suffix intel</span></a> command can be added to
|
|
the input script to enable INTEL styles for the commands that
|
|
follow in the input script.</p>
|
|
</section>
|
|
<section id="tuning-for-performance">
|
|
<h2>Tuning for Performance<a class="headerlink" href="#tuning-for-performance" title="Link to this heading"></a></h2>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>The INTEL package will perform better with modifications
|
|
to the input script when <a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> is used:
|
|
<a class="reference internal" href="kspace_modify.html"><span class="doc">kspace_modify diff ad</span></a> should be added to the
|
|
input script.</p>
|
|
</div>
|
|
<p>Long-Range Thread (LRT) mode is an option to the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command that can improve performance when using
|
|
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> for long-range electrostatics on processors
|
|
with SMT. It generates an extra pthread for each MPI task. The thread
|
|
is dedicated to performing some of the PPPM calculations and MPI
|
|
communications. This feature requires setting the pre-processor flag
|
|
<code class="docutils literal notranslate"><span class="pre">-DLMP_INTEL_USELRT</span></code> in the makefile when compiling LAMMPS. It is unset
|
|
in the default makefiles (<code class="docutils literal notranslate"><span class="pre">Makefile.mpi</span></code> and <code class="docutils literal notranslate"><span class="pre">Makefile.serial</span></code>) but
|
|
it is set in all makefiles tuned for the INTEL package. On Intel
|
|
Xeon Phi x200 series CPUs, the LRT feature will likely improve
|
|
performance, even on a single node. On Intel Xeon processors, using
|
|
this mode might result in better performance when using multiple nodes,
|
|
depending on the specific machine configuration. To enable LRT mode,
|
|
specify that the number of OpenMP threads is one less than would
|
|
normally be used for the run and add the <code class="docutils literal notranslate"><span class="pre">lrt</span> <span class="pre">yes</span></code> option to the <code class="docutils literal notranslate"><span class="pre">-pk</span></code>
|
|
command-line suffix or “package intel” command. For example, if a run
|
|
would normally perform best with “-pk intel 0 omp 4”, instead use
|
|
<code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span> <span class="pre">0</span> <span class="pre">omp</span> <span class="pre">3</span> <span class="pre">lrt</span> <span class="pre">yes</span></code>. When using LRT, you should set the
|
|
environment variable <code class="docutils literal notranslate"><span class="pre">KMP_AFFINITY=none</span></code>. LRT mode is not supported
|
|
when using offload.</p>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>Changing the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a> setting to off can improve
|
|
performance and/or scalability for simple 2-body potentials such as
|
|
lj/cut or when using LRT mode on processors supporting AVX-512.</p>
|
|
</div>
|
|
<p>Not all styles are supported in the INTEL package. You can mix
|
|
the INTEL package with styles from the <a class="reference internal" href="Speed_opt.html"><span class="doc">OPT</span></a>
|
|
package or the <a class="reference internal" href="Speed_omp.html"><span class="doc">OPENMP package</span></a>. Of course, this
|
|
requires that these packages were installed at build time. This can
|
|
performed automatically by using <code class="docutils literal notranslate"><span class="pre">-sf</span> <span class="pre">hybrid</span> <span class="pre">intel</span> <span class="pre">opt</span></code> or <code class="docutils literal notranslate"><span class="pre">-sf</span> <span class="pre">hybrid</span>
|
|
<span class="pre">intel</span> <span class="pre">omp</span></code> command-line options. Alternatively, the “opt” and “omp”
|
|
suffixes can be appended manually in the input script. For the latter,
|
|
the <a class="reference internal" href="package.html"><span class="doc">package omp</span></a> command must be in the input script or
|
|
the <code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">omp</span> <span class="pre">Nt</span></code> <a class="reference internal" href="Run_options.html"><span class="doc">command-line switch</span></a> must be used
|
|
where <code class="docutils literal notranslate"><span class="pre">Nt</span></code> is the number of OpenMP threads. The number of OpenMP threads
|
|
should not be set differently for the different packages. Note that
|
|
the <a class="reference internal" href="suffix.html"><span class="doc">suffix hybrid intel omp</span></a> command can also be used
|
|
within the input script to automatically append the “omp” suffix to
|
|
styles when INTEL styles are not available.</p>
|
|
<div class="admonition note">
|
|
<p class="admonition-title">Note</p>
|
|
<p>For simulations on higher node counts, add <a class="reference internal" href="processors.html"><span class="doc">processors * * * grid numa</span></a> to the beginning of the input script for
|
|
better scalability.</p>
|
|
</div>
|
|
<p>When running on many nodes, performance might be better when using
|
|
fewer OpenMP threads and more MPI tasks. This will depend on the
|
|
simulation and the machine. Using the <a class="reference internal" href="run_style.html"><span class="doc">verlet/split</span></a>
|
|
run style might also give better performance for simulations with
|
|
<a class="reference internal" href="kspace_style.html"><span class="doc">PPPM</span></a> electrostatics. Note that this is an
|
|
alternative to LRT mode and the two cannot be used together.</p>
|
|
<p>Currently, when using Intel MPI with Intel Xeon Phi x200 series
|
|
CPUs, better performance might be obtained by setting the
|
|
environment variable <code class="docutils literal notranslate"><span class="pre">I_MPI_SHM_LMT=shm</span></code> for Linux kernels that do
|
|
not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
|
|
series processors will always perform better using MCDRAM. Please
|
|
consult your system documentation for the best approach to specify
|
|
that MPI runs are performed in MCDRAM.</p>
|
|
</section>
|
|
<section id="tuning-for-offload-performance">
|
|
<h2>Tuning for Offload Performance<a class="headerlink" href="#tuning-for-offload-performance" title="Link to this heading"></a></h2>
|
|
<p>The default settings for offload should give good performance.</p>
|
|
<p>When using LAMMPS with offload to Intel co-processors, best performance
|
|
will typically be achieved with concurrent calculations performed on
|
|
both the CPU and the co-processor. This is achieved by offloading only
|
|
a fraction of the neighbor and pair computations to the co-processor or
|
|
using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> pair styles where only one style uses
|
|
the “intel” suffix. For simulations with long-range electrostatics or
|
|
bond, angle, dihedral, improper calculations, computation and data
|
|
transfer to the co-processor will run concurrently with computations
|
|
and MPI communications for these calculations on the host CPU. This
|
|
is illustrated in the figure below for the rhodopsin protein benchmark
|
|
running on E5-2697v2 processors with a Intel Xeon Phi 7120p
|
|
co-processor. In this plot, the vertical access is time and routines
|
|
running at the same time are running concurrently on both the host and
|
|
the co-processor.</p>
|
|
<img alt="_images/offload_knc.png" class="align-center" src="_images/offload_knc.png" />
|
|
<p>The fraction of the offloaded work is controlled by the <em>balance</em>
|
|
keyword in the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. A balance of 0
|
|
runs all calculations on the CPU. A balance of 1 runs all
|
|
supported calculations on the co-processor. A balance of 0.5 runs half
|
|
of the calculations on the co-processor. Setting the balance to -1
|
|
(the default) will enable dynamic load balancing that continuously
|
|
adjusts the fraction of offloaded work throughout the simulation.
|
|
Because data transfer cannot be timed, this option typically produces
|
|
results within 5 to 10 percent of the optimal fixed balance.</p>
|
|
<p>If running short benchmark runs with dynamic load balancing, adding a
|
|
short warm-up run (10-20 steps) will allow the load-balancer to find a
|
|
near-optimal setting that will carry over to additional runs.</p>
|
|
<p>The default for the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command is to have
|
|
all the MPI tasks on a given compute node use a single Xeon Phi
|
|
co-processor. In general, running with a large number of MPI tasks on
|
|
each node will perform best with offload. Each MPI task will
|
|
automatically get affinity to a subset of the hardware threads
|
|
available on the co-processor. For example, if your card has 61 cores,
|
|
with 60 cores available for offload and 4 hardware threads per core
|
|
(240 total threads), running with 24 MPI tasks per node will cause
|
|
each MPI task to use a subset of 10 threads on the co-processor. Fine
|
|
tuning of the number of threads to use per MPI task or the number of
|
|
threads to use per core can be accomplished with keyword settings of
|
|
the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command.</p>
|
|
<p>The INTEL package has two modes for deciding which atoms will be
|
|
handled by the co-processor. This choice is controlled with the <em>ghost</em>
|
|
keyword of the <a class="reference internal" href="package.html"><span class="doc">package intel</span></a> command. When set to 0,
|
|
ghost atoms (atoms at the borders between MPI tasks) are not offloaded
|
|
to the card. This allows for overlap of MPI communication of forces
|
|
with computation on the co-processor when the <a class="reference internal" href="newton.html"><span class="doc">newton</span></a>
|
|
setting is “on”. The default is dependent on the style being used,
|
|
however, better performance may be achieved by setting this option
|
|
explicitly.</p>
|
|
<p>When using offload with CPU Hyper-Threading disabled, it may help
|
|
performance to use fewer MPI tasks and OpenMP threads than available
|
|
cores. This is due to the fact that additional threads are generated
|
|
internally to handle the asynchronous offload tasks.</p>
|
|
<p>If pair computations are being offloaded to an Intel Xeon Phi
|
|
co-processor, a diagnostic line is printed to the screen (not to the
|
|
log file), during the setup phase of a run, indicating that offload
|
|
mode is being used and indicating the number of co-processor threads
|
|
per MPI task. Additionally, an offload timing summary is printed at
|
|
the end of each run. When offloading, the frequency for <a class="reference internal" href="atom_modify.html"><span class="doc">atom sorting</span></a> is changed to 1 so that the per-atom data is
|
|
effectively sorted at every rebuild of the neighbor lists. All the
|
|
available co-processor threads on each Phi will be divided among MPI
|
|
tasks, unless the <code class="docutils literal notranslate"><span class="pre">tptask</span></code> option of the <code class="docutils literal notranslate"><span class="pre">-pk</span> <span class="pre">intel</span></code> <a class="reference internal" href="Run_options.html"><span class="doc">command-line switch</span></a> is used to limit the co-processor threads per
|
|
MPI task.</p>
|
|
</section>
|
|
<section id="restrictions">
|
|
<h2>Restrictions<a class="headerlink" href="#restrictions" title="Link to this heading"></a></h2>
|
|
<p>When offloading to a co-processor, <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid</span></a> styles
|
|
that require skip lists for neighbor builds cannot be offloaded.
|
|
Using <a class="reference internal" href="pair_hybrid.html"><span class="doc">hybrid/overlay</span></a> is allowed. Only one intel
|
|
accelerated style may be used with hybrid styles when offloading.
|
|
<a class="reference internal" href="special_bonds.html"><span class="doc">Special_bonds</span></a> exclusion lists are not currently
|
|
supported with offload, however, the same effect can often be
|
|
accomplished by setting cutoffs for excluded atom types to 0. None of
|
|
the pair styles in the INTEL package currently support the
|
|
“inner”, “middle”, “outer” options for rRESPA integration via the
|
|
<a class="reference internal" href="run_style.html"><span class="doc">run_style respa</span></a> command; only the “pair” option is
|
|
supported.</p>
|
|
</section>
|
|
<section id="references">
|
|
<h2>References<a class="headerlink" href="#references" title="Link to this heading"></a></h2>
|
|
<ul class="simple">
|
|
<li><p>Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakkar, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS”, in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann.</p></li>
|
|
<li><p>Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. <a class="reference external" href="https://dl.acm.org/citation.cfm?id=3014915">Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency.</a> 2016 High Performance Computing, Networking, Storage and Analysis, SC16: International Conference (pp. 82-95).</p></li>
|
|
<li><p>Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101.</p></li>
|
|
</ul>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
</div>
|
|
</div>
|
|
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
|
<a href="Speed_gpu.html" class="btn btn-neutral float-left" title="7.4.1. GPU package" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
|
<a href="Speed_kokkos.html" class="btn btn-neutral float-right" title="7.4.3. KOKKOS package" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
|
</div>
|
|
|
|
<hr/>
|
|
|
|
<div role="contentinfo">
|
|
<p>© Copyright 2003-2025 Sandia Corporation.</p>
|
|
</div>
|
|
|
|
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
|
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
|
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
|
|
|
|
|
</footer>
|
|
</div>
|
|
</div>
|
|
</section>
|
|
</div>
|
|
<script>
|
|
jQuery(function () {
|
|
SphinxRtdTheme.Navigation.enable(false);
|
|
});
|
|
</script>
|
|
|
|
</body>
|
|
</html> |