diff --git a/doc/html/Manual.html b/doc/html/Manual.html index ef091fe1c2..4f54c1450a 100644 --- a/doc/html/Manual.html +++ b/doc/html/Manual.html @@ -134,8 +134,8 @@

LAMMPS-ICMS Documentation

-
-

28 Jun 2016 version

+
+

1 Jul 2016 version

Version info:

diff --git a/doc/html/Section_intro.html b/doc/html/Section_intro.html index 8be864976a..7c6a5c2f0a 100644 --- a/doc/html/Section_intro.html +++ b/doc/html/Section_intro.html @@ -569,7 +569,7 @@ Hierarchical Modeling”.

LAMMPS. If you use LAMMPS results in your published work, please cite this paper and include a pointer to the LAMMPS WWW Site (http://lammps.sandia.gov):

-

S. J. Plimpton, Fast Parallel Algorithms for Short-Range Molecular +

S. Plimpton, Fast Parallel Algorithms for Short-Range Molecular Dynamics, J Comp Phys, 117, 1-19 (1995).

Other papers describing specific algorithms used in LAMMPS are listed under the Citing LAMMPS link of diff --git a/doc/html/_images/offload_knc.png b/doc/html/_images/offload_knc.png new file mode 100644 index 0000000000..0c4028a08d Binary files /dev/null and b/doc/html/_images/offload_knc.png differ diff --git a/doc/html/_images/user_intel.png b/doc/html/_images/user_intel.png new file mode 100644 index 0000000000..0ebb2d1ae0 Binary files /dev/null and b/doc/html/_images/user_intel.png differ diff --git a/doc/html/_sources/Manual.txt b/doc/html/_sources/Manual.txt index 51f1b44072..3310e390d9 100644 --- a/doc/html/_sources/Manual.txt +++ b/doc/html/_sources/Manual.txt @@ -5,8 +5,8 @@ LAMMPS Documentation ==================== -28 Jun 2016 version -------------------- +1 Jul 2016 version +------------------ Version info: ------------- diff --git a/doc/html/_sources/Section_intro.txt b/doc/html/_sources/Section_intro.txt index 7a583e4f3d..2279a01ab8 100644 --- a/doc/html/_sources/Section_intro.txt +++ b/doc/html/_sources/Section_intro.txt @@ -536,7 +536,7 @@ LAMMPS. If you use LAMMPS results in your published work, please cite this paper and include a pointer to the `LAMMPS WWW Site `_ (http://lammps.sandia.gov): -S. J. Plimpton, **Fast Parallel Algorithms for Short-Range Molecular +S. Plimpton, **Fast Parallel Algorithms for Short-Range Molecular Dynamics**\ , J Comp Phys, 117, 1-19 (1995). Other papers describing specific algorithms used in LAMMPS are listed diff --git a/doc/html/_sources/accelerate_intel.txt b/doc/html/_sources/accelerate_intel.txt index 680a1a4e92..c8bee2877e 100644 --- a/doc/html/_sources/accelerate_intel.txt +++ b/doc/html/_sources/accelerate_intel.txt @@ -3,230 +3,252 @@ 5.USER-INTEL package -------------------- -The USER-INTEL package was developed by Mike Brown at Intel +The USER-INTEL package is maintained by Mike Brown at Intel Corporation. It provides two methods for accelerating simulations, depending on the hardware you have. The first is acceleration on -Intel(R) CPUs by running in single, mixed, or double precision with -vectorization. The second is acceleration on Intel(R) Xeon Phi(TM) +Intel CPUs by running in single, mixed, or double precision with +vectorization. The second is acceleration on Intel Xeon Phi coprocessors via offloading neighbor list and non-bonded force calculations to the Phi. The same C++ code is used in both cases. When offloading to a coprocessor from a CPU, the same routine is run -twice, once on the CPU and once with an offload flag. +twice, once on the CPU and once with an offload flag. This allows +LAMMPS to run on the CPU cores and coprocessor cores simulataneously. -Note that the USER-INTEL package supports use of the Phi in "offload" -mode, not "native" mode like the :doc:`KOKKOS package `. +**Currently Available USER-INTEL Styles:** -Also note that the USER-INTEL package can be used in tandem with the -:doc:`USER-OMP package `. This is useful when -offloading pair style computations to the Phi, so that other styles -not supported by the USER-INTEL package, e.g. bond, angle, dihedral, -improper, and long-range electrostatics, can run simultaneously in -threaded mode on the CPU cores. Since less MPI tasks than CPU cores -will typically be invoked when running with coprocessors, this enables -the extra CPU cores to be used for useful computation. +* Angle Styles: charmm, harmonic +* Bond Styles: fene, harmonic +* Dihedral Styles: charmm, harmonic, opls +* Fixes: nve, npt, nvt, nvt/sllod +* Improper Styles: cvff, harmonic +* Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne, + charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff +* K-Space Styles: pppm +**Speed-ups to expect:** -As illustrated below, if LAMMPS is built with both the USER-INTEL and -USER-OMP packages, this dual mode of operation is made easier to use, -via the "-suffix hybrid intel omp" :ref:`command-line switch ` or the :doc:`suffix hybrid intel omp ` command. Both set a second-choice suffix to "omp" so -that styles from the USER-INTEL package will be used if available, -with styles from the USER-OMP package as a second choice. +The speedups will depend on your simulation, the hardware, which +styles are used, the number of atoms, and the floating-point +precision mode. Performance improvements are shown compared to +LAMMPS *without using other acceleration packages* as these are +under active development (and subject to performance changes). The +measurements were performed using the input files available in +the src/USER-INTEL/TEST directory. These are scalable in size; the +results given are with 512K particles (524K for Liquid Crystal). +Most of the simulations are standard LAMMPS benchmarks (indicated +by the filename extension in parenthesis) with modifications to the +run length and to add a warmup run (for use with offload +benchmarks). -Here is a quick overview of how to use the USER-INTEL package for CPU -acceleration, assuming one or more 16-core nodes. More details -follow. +.. image:: JPG/user_intel.png + :align: center -.. parsed-literal:: +Results are speedups obtained on Intel Xeon E5-2697v4 processors +(code-named Broadwell) and Intel Xeon Phi 7250 processors +(code-named Knights Landing) with "18 Jun 2016" LAMMPS built with +Intel Parallel Studio 2016 update 3. Results are with 1 MPI task +per physical core. See *src/USER-INTEL/TEST/README* for the raw +simulation rates and instructions to reproduce. - use an Intel compiler - use these CCFLAGS settings in Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost, -fno-alias, -ansi-alias, -override-limits - use these LINKFLAGS settings in Makefile.machine: -fopenmp, -xHost - make yes-user-intel yes-user-omp # including user-omp is optional - make mpi # build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi - make intel_cpu # or Makefile.intel_cpu already has settings, uses Intel MPI wrapper - Make.py -v -p intel omp -intel cpu -a file mpich_icc # or one-line build via Make.py for MPICH - Make.py -v -p intel omp -intel cpu -a file ompi_icc # or for OpenMPI - Make.py -v -p intel omp -intel cpu -a file intel_cpu # or for Intel MPI wrapper -.. parsed-literal:: +---------- - lmp_machine -sf intel -pk intel 0 omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP - mpirun -np 32 lmp_machine -sf intel -in in.script # 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP - lmp_machine -sf hybrid intel omp -pk intel 0 omp 16 -pk omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP - mpirun -np 32 -ppn 4 lmp_machine -sf hybrid intel omp -pk omp 4 -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP -Here is a quick overview of how to use the USER-INTEL package for the -same CPUs as above (16 cores/node), with an additional Xeon Phi(TM) -coprocessor per node. More details follow. +**Quick Start for Experienced Users:** -.. parsed-literal:: +LAMMPS should be built with the USER-INTEL package installed. +Simulations should be run with 1 MPI task per physical *core*\ , +not *hardware thread*\ . - Same as above for building, with these additions/changes: - add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in Makefile.machine - add the flag -offload to LINKFLAGS in Makefile.machine - for Make.py change "-intel cpu" to "-intel phi", and "file intel_cpu" to "file intel_phi" +For Intel Xeon CPUs: -.. parsed-literal:: +* Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. +* If using *kspace_style pppm* in the input script, add "neigh_modify binsize 3" and "kspace_modify diff ad" to the input script for better + performance. +* "-pk intel 0 omp 2 -sf intel" added to LAMMPS command-line +For Intel Xeon Phi CPUs for simulations without *kspace_style +pppm* in the input script + +* Edit src/MAKE/OPTIONS/Makefile.knl as necessary. +* Runs should be performed using MCDRAM. +* "-pk intel 0 omp 2 -sf intel" *or* "-pk intel 0 omp 4 -sf intel" + should be added to the LAMMPS command-line. Choice for best + performance will depend on the simulation. +For Intel Xeon Phi CPUs for simulations with *kspace_style +pppm* in the input script: + +* Edit src/MAKE/OPTIONS/Makefile.knl as necessary. +* Runs should be performed using MCDRAM. +* Add "neigh_modify binsize 3" to the input script for better + performance. +* Add "kspace_modify diff ad" to the input script for better + performance. +* export KMP_AFFINITY=none +* "-pk intel 0 omp 3 lrt yes -sf intel" or "-pk intel 0 omp 1 lrt yes + -sf intel" added to LAMMPS command-line. Choice for best performance + will depend on the simulation. +For Intel Xeon Phi coprocessors (Offload): + +* Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary +* "-pk intel N omp 1" added to command-line where N is the number of + coprocessors per node. + +---------- - mpirun -np 32 lmp_machine -sf intel -pk intel 1 -in in.script # 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP - mpirun -np 16 -ppn 8 lmp_machine -sf intel -pk intel 1 omp 2 -in in.script # 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP - mpirun -np 32 -ppn 8 lmp_machine -sf hybrid intel omp -pk intel 1 omp 2 -pk omp 2 -in in.script # 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP **Required hardware/software:** -Your compiler must support the OpenMP interface. Use of an Intel(R) -C++ compiler is recommended, but not required. However, g++ will not -recognize some of the settings listed above, so they cannot be used. -Optimizations for vectorization have only been tested with the -Intel(R) compiler. Use of other compilers may not result in -vectorization, or give poor performance. +In order to use offload to coprocessors, an Intel Xeon Phi +coprocessor and an Intel compiler are required. For this, the +recommended version of the Intel compiler is 14.0.1.106 or +versions 15.0.2.044 and higher. -The recommended version of the Intel(R) compiler is 14.0.1.106. -Versions 15.0.1.133 and later are also supported. If using Intel(R) -MPI, versions 15.0.2.044 and later are recommended. +Although any compiler can be used with the USER-INTEL pacakge, +currently, vectorization directives are disabled by default when +not using Intel compilers due to lack of standard support and +observations of decreased performance. The OpenMP standard now +supports directives for vectorization and we plan to transition the +code to this standard once it is available in most compilers. We +expect this to allow improved performance and support with other +compilers. -To use the offload option, you must have one or more Intel(R) Xeon -Phi(TM) coprocessors and use an Intel(R) C++ compiler. +For Intel Xeon Phi x200 series processors (code-named Knights +Landing), there are multiple configuration options for the hardware. +For best performance, we recommend that the MCDRAM is configured in +"Flat" mode and with the cluster mode set to "Quadrant" or "SNC4". +"Cache" mode can also be used, although the performance might be +slightly lower. + +**Notes about Simultaneous Multithreading:** + +Modern CPUs often support Simultaneous Multithreading (SMT). On +Intel processors, this is called Hyper-Threading (HT) technology. +SMT is hardware support for running multiple threads efficiently on +a single core. *Hardware threads* or *logical cores* are often used +to refer to the number of threads that are supported in hardware. +For example, the Intel Xeon E5-2697v4 processor is described +as having 36 cores and 72 threads. This means that 36 MPI processes +or OpenMP threads can run simultaneously on separate cores, but that +up to 72 MPI processes or OpenMP threads can be running on the CPU +without costly operating system context switches. + +Molecular dynamics simulations will often run faster when making use +of SMT. If a thread becomes stalled, for example because it is +waiting on data that has not yet arrived from memory, another thread +can start running so that the CPU pipeline is still being used +efficiently. Although benefits can be seen by launching a MPI task +for every hardware thread, for multinode simulations, we recommend +that OpenMP threads are used for SMT instead, either with the +USER-INTEL package, `USER-OMP package `_, or +:doc:`KOKKOS package `. In the example above, up +to 36X speedups can be observed by using all 36 physical cores with +LAMMPS. By using all 72 hardware threads, an additional 10-30% +performance gain can be achieved. + +The BIOS on many platforms allows SMT to be disabled, however, we do +not recommend this on modern processors as there is little to no +benefit for any software package in most cases. The operating system +will report every hardware thread as a separate core allowing one to +determine the number of hardware threads available. On Linux systems, +this information can normally be obtained with: + +.. parsed-literal:: + + cat /proc/cpuinfo **Building LAMMPS with the USER-INTEL package:** -The lines above illustrate how to include/build with the USER-INTEL -package, for either CPU or Phi support, in two steps, using the "make" -command. Or how to do it with one command via the src/Make.py script, -described in :ref:`Section 2.4 ` of the manual. -Type "Make.py -h" for help. Because the mechanism for specifing what -compiler to use (Intel in this case) is different for different MPI -wrappers, 3 versions of the Make.py command are shown. +The USER-INTEL package must be installed into the source directory: + +.. parsed-literal:: + + make yes-user-intel + +Several example Makefiles for building with the Intel compiler are +included with LAMMPS in the src/MAKE/OPTIONS/ directory: + +.. parsed-literal:: + + Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload + Makefile.knl # Intel Compiler, Intel MPI, No Offload + Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload + Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload + Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload + +Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that +it explicitly specifies that vectorization should be for Intel +Xeon Phi x200 processors making it easier to cross-compile. For +users with recent installations of Intel Parallel Studio, the +process can be as simple as: + +.. parsed-literal:: + + make yes-user-intel + source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh + # or psxevars.csh for C-shell + make intel_cpu_intelmpi + +Alternatively, the build can be accomplished with the src/Make.py +script, described in :ref:`Section 2.4 ` of the +manual. Type "Make.py -h" for help. For an example: + +.. parsed-literal:: + + Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi Note that if you build with support for a Phi coprocessor, the same binary can be used on nodes with or without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary. -If you also build with the USER-OMP package, you can use styles from -both packages, as described below. +The general requirements for Makefiles with the USER-INTEL package +are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When +using Intel compilers, "-restrict" is required and "-qopenmp" is +highly recommended for CCFLAGS and LINKFLAGS. LIB should include +"-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD" +is required for CCFLAGS and "-qoffload" is required for LINKFLAGS. +Other recommended CCFLAG options for best performance are +"-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2 +-no-prec-div". The Make.py command will add all of these +automatically. -Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must -include "-fopenmp". Likewise, if you use an Intel compiler, the -CCFLAGS setting must include "-restrict". For Phi support, the -"-DLMP_INTEL_OFFLOAD" (CCFLAGS) and "-offload" (LINKFLAGS) settings -are required. The other settings listed above are optional, but will -typically improve performance. The Make.py command will add all of -these automatically. +.. note:: -If you are compiling on the same architecture that will be used for -the runs, adding the flag *-xHost* to CCFLAGS enables vectorization -with the Intel(R) compiler. Otherwise, you must provide the correct -compute node architecture to the -x option (e.g. -xAVX). + The vectorization and math capabilities can differ depending on + the CPU. For Intel compilers, the "-x" flag specifies the type of + processor for which to optimize. "-xHost" specifies that the compiler + should build for the processor used for compiling. For Intel Xeon Phi + x200 series processors, this option is "-xMIC-AVX512". For fourth + generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should + be used. For older Intel Xeon processors, "-xAVX" will perform best + in general for the different simulations in LAMMPS. The default + in most of the example Makefiles is to use "-xHost", however this + should not be used when cross-compiling. -Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi -are included in the src/MAKE/OPTIONS directory with settings that -perform well with the Intel(R) compiler. The latter has support for -offload to Phi coprocessors; the former does not. +**Running LAMMPS with the USER-INTEL package:** -**Run with the USER-INTEL package from the command line:** +Running LAMMPS with the USER-INTEL package is similar to normal use +with the exceptions that one should 1) specify that LAMMPS should use +the USER-INTEL package, 2) specify the number of OpenMP threads, and +3) optionally specify the specific LAMMPS styles that should use the +USER-INTEL package. 1) and 2) can be performed from the command-line +or by editing the input script. 3) requires editing the input script. +Advanced performance tuning options are also described below to get +the best performance. -The mpirun or mpiexec command sets the total number of MPI tasks used -by LAMMPS (one or multiple per compute node) and the number of MPI -tasks used per node. E.g. the mpirun command in MPICH does this via -its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. - -If you compute (any portion of) pairwise interactions using USER-INTEL -pair styles on the CPU, or use USER-OMP styles on the CPU, you need to -choose how many OpenMP threads per MPI task to use. If both packages -are used, it must be done for both packages, and the same thread count -value should be used for both. Note that the product of MPI tasks * -threads/task should not exceed the physical number of cores (on a -node), otherwise performance will suffer. - -When using the USER-INTEL package for the Phi, you also need to -specify the number of coprocessor/node and optionally the number of -coprocessor threads per MPI task to use. Note that coprocessor -threads (which run on the coprocessor) are totally independent from -OpenMP threads (which run on the CPU). The default values for the -settings that affect coprocessor threads are typically fine, as -discussed below. - -As in the lines above, use the "-sf intel" or "-sf hybrid intel omp" -:ref:`command-line switch `, which will -automatically append "intel" to styles that support it. In the second -case, "omp" will be appended if an "intel" style does not exist. - -Note that if either switch is used, it also invokes a default command: -:doc:`package intel 1 `. If the "-sf hybrid intel omp" switch -is used, the default USER-OMP command :doc:`package omp 0 ` is -also invoked (if LAMMPS was built with USER-OMP). Both set the number -of OpenMP threads per MPI task via the OMP_NUM_THREADS environment -variable. The first command sets the number of Xeon Phi(TM) -coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only), -and the precision mode to "mixed" (default value). - -You can also use the "-pk intel Nphi" :ref:`command-line switch ` to explicitly set Nphi = # of Xeon -Phi(TM) coprocessors/node, as well as additional options. Nphi should -be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi -= 0 for a CPU-only build. All the available coprocessor threads on -each Phi will be divided among MPI tasks, unless the *tptask* option -of the "-pk intel" :ref:`command-line switch ` is -used to limit the coprocessor threads per MPI task. See the :doc:`package intel ` command for details, including the default values -used for all its options if not specified, and how to set the number -of OpenMP threads via the OMP_NUM_THREADS environment variable if -desired. - -If LAMMPS was built with the USER-OMP package, you can also use the -"-pk omp Nt" :ref:`command-line switch ` to -explicitly set Nt = # of OpenMP threads per MPI task to use, as well -as additional options. Nt should be the same threads per MPI task as -set for the USER-INTEL package, e.g. via the "-pk intel Nphi omp Nt" -command. Again, see the :doc:`package omp ` command for -details, including the default values used for all its options if not -specified, and how to set the number of OpenMP threads via the -OMP_NUM_THREADS environment variable if desired. - -**Or run with the USER-INTEL package by editing an input script:** - -The discussion above for the mpirun/mpiexec command, MPI tasks/node, -OpenMP threads per MPI task, and coprocessor threads per MPI task is -the same. - -Use the :doc:`suffix intel ` or :doc:`suffix hybrid intel omp ` commands, or you can explicitly add an "intel" or -"omp" suffix to individual styles in your input script, e.g. - -.. parsed-literal:: - - pair_style lj/cut/intel 2.5 - -You must also use the :doc:`package intel ` command, unless the -"-sf intel" or "-pk intel" :ref:`command-line switches ` were used. It specifies how many -coprocessors/node to use, as well as other OpenMP threading and -coprocessor options. The :doc:`package ` doc page explains how -to set the number of OpenMP threads via an environment variable if -desired. - -If LAMMPS was also built with the USER-OMP package, you must also use -the :doc:`package omp ` command to enable that package, unless -the "-sf hybrid intel omp" or "-pk omp" :ref:`command-line switches ` were used. It specifies how many -OpenMP threads per MPI task to use (should be same as the setting for -the USER-INTEL package), as well as other options. Its doc page -explains how to set the number of OpenMP threads via an environment -variable if desired. - -**Speed-ups to expect:** - -If LAMMPS was not built with coprocessor support (CPU only) when -including the USER-INTEL package, then acclerated styles will run on -the CPU using vectorization optimizations and the specified precision. -This may give a substantial speed-up for a pair style, particularly if -mixed or single precision is used. - -If LAMMPS was built with coproccesor support, the pair styles will run -on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The -performance of a Xeon Phi versus a multi-core CPU is a function of -your hardware, which pair style is used, the number of -atoms/coprocessor, and the precision used on the coprocessor (double, -single, mixed). - -See the `Benchmark page `_ of the -LAMMPS web site for performance of the USER-INTEL package on different -hardware. +When running on a single node (including runs using offload to a +coprocessor), best performance is normally obtained by using 1 MPI +task per physical core and additional OpenMP threads with SMT. For +Intel Xeon processors, 2 OpenMP threads should be used for SMT. +For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used +(best choice depends on the simulation). In cases where the user +specifies that LRT mode is used (described below), 1 or 3 OpenMP +threads should be used. For multi-node runs, using 1 MPI task per +physical core will often perform best, however, depending on the +machine and scale, users might get better performance by decreasing +the number of MPI tasks and using more OpenMP threads. For +performance, the product of the number of MPI tasks and OpenMP +threads should not exceed the number of available hardware threads in +almost all cases. .. note:: @@ -234,67 +256,202 @@ hardware. threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host (CPU) will be set by default on the host - when using offload to a coprocessor. In this case, it is unnecessary + *when using offload to a coprocessor*\ . In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, - I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with - the *no_affinity* option to the :doc:`package intel ` command - or by disabling the option at build time (by adding - -DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). - Disabling this option is not recommended, especially when running on a - machine with hyperthreading disabled. + I_MPI_PIN_DOMAIN, etc.). This can be disabled with the *no_affinity* + option to the :doc:`package intel ` command or by disabling the + option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the + CCFLAGS line of your Makefile). Disabling this option is not + recommended, especially when running on a machine with Intel + Hyper-Threading technology disabled. -**Guidelines for best performance on an Intel(R) Xeon Phi(TM) -coprocessor:** +**Run with the USER-INTEL package from the command line:** + +To enable USER-INTEL optimizations for all available styles used in +the input script, the "-sf intel" +:ref:`command-line switch ` can be used without +any requirement for editing the input script. This switch will +automatically append "intel" to styles that support it. It also +invokes a default command: :doc:`package intel 1 `. This +package command is used to set options for the USER-INTEL package. +The default package command will specify that USER-INTEL calculations +are performed in mixed precision, that the number of OpenMP threads +is specified by the OMP_NUM_THREADS environment variable, and that +if coprocessors are present and the binary was built with offload +support, that 1 coprocessor per node will be used with automatic +balancing of work between the CPU and the coprocessor. + +You can specify different options for the USER-INTEL package by using +the "-pk intel Nphi" :ref:`command-line switch ` +with keyword/value pairs as specified in the documentation. Here, +Nphi = # of Xeon Phi coprocessors/node (ignored without offload +support). Common options to the USER-INTEL package include *omp* to +override any OMP_NUM_THREADS setting and specify the number of OpenMP +threads, *mode* to set the floating-point precision mode, and +*lrt* to enable Long-Range Thread mode as described below. See the +:doc:`package intel ` command for details, including the +default values used for all its options if not specified, and how to +set the number of OpenMP threads via the OMP_NUM_THREADS environment +variable if desired. + +Examples (see documentation for your MPI/Machine for differences in +launching MPI applications): + +.. parsed-literal:: + + mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads + mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision + +**Or run with the USER-INTEL package by editing an input script:** + +As an alternative to adding command-line arguments, the input script +can be edited to enable the USER-INTEL package. This requires adding +the :doc:`package intel ` command to the top of the input +script. For the second example above, this would be: + +.. parsed-literal:: + + package intel 0 omp 2 mode double + +To enable the USER-INTEL package only for individual styles, you can +add an "intel" suffix to the individual style, e.g.: + +.. parsed-literal:: + + pair_style lj/cut/intel 2.5 + +Alternatively, the :doc:`suffix intel ` command can be added to +the input script to enable USER-INTEL styles for the commands that +follow in the input script. + +**Tuning for Performance:** + +.. note:: + + The USER-INTEL package will perform better with modifications + to the input script when :doc:`PPPM ` is used: + :doc:`kspace_modify diff ad ` and :doc:`neigh_modify binsize 3 ` should be added to the input script. + +Long-Range Thread (LRT) mode is an option to the :doc:`package intel ` command that can improve performance when using +:doc:`PPPM ` for long-range electrostatics on processors +with SMT. It generates an extra pthread for each MPI task. The thread +is dedicated to performing some of the PPPM calculations and MPI +communications. On Intel Xeon Phi x200 series CPUs, this will likely +always improve performance, even on a single node. On Intel Xeon +processors, using this mode might result in better performance when +using multiple nodes, depending on the machine. To use this mode, +specify that the number of OpenMP threads is one less than would +normally be used for the run and add the "lrt yes" option to the "-pk" +command-line suffix or "package intel" command. For example, if a run +would normally perform best with "-pk intel 0 omp 4", instead use +"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the +environment variable "KMP_AFFINITY=none". LRT mode is not supported +when using offload. + +Not all styles are supported in the USER-INTEL package. You can mix +the USER-INTEL package with styles from the :doc:`OPT ` +package or the `USER-OMP package `_. Of course, +this requires that these packages were installed at build time. This +can performed automatically by using "-sf hybrid intel opt" or +"-sf hybrid intel omp" command-line options. Alternatively, the "opt" +and "omp" suffixes can be appended manually in the input script. For +the latter, the :doc:`package omp ` command must be in the +input script or the "-pk omp Nt" :ref:`command-line switch ` must be used where Nt is the +number of OpenMP threads. The number of OpenMP threads should not be +set differently for the different packages. Note that the :doc:`suffix hybrid intel omp ` command can also be used within the +input script to automatically append the "omp" suffix to styles when +USER-INTEL styles are not available. + +When running on many nodes, performance might be better when using +fewer OpenMP threads and more MPI tasks. This will depend on the +simulation and the machine. Using the :doc:`verlet/split ` +run style might also give better performance for simulations with +:doc:`PPPM ` electrostatics. Note that this is an +alternative to LRT mode and the two cannot be used together. + +Currently, when using Intel MPI with Intel Xeon Phi x200 series +CPUs, better performance might be obtained by setting the +environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do +not yet have full support for AVX-512. Runs on Intel Xeon Phi x200 +series processors will always perform better using MCDRAM. Please +consult your system documentation for the best approach to specify +that MPI runs are performed in MCDRAM. + +**Tuning for Offload Performance:** + +The default settings for offload should give good performance. + +When using LAMMPS with offload to Intel coprocessors, best performance +will typically be achieved with concurrent calculations performed on +both the CPU and the coprocessor. This is achieved by offloading only +a fraction of the neighbor and pair computations to the coprocessor or +using :doc:`hybrid ` pair styles where only one style uses +the "intel" suffix. For simulations with long-range electrostatics or +bond, angle, dihedral, improper calculations, computation and data +transfer to the coprocessor will run concurrently with computations +and MPI communications for these calculations on the host CPU. This +is illustrated in the figure below for the rhodopsin protein benchmark +running on E5-2697v2 processors with a Intel Xeon Phi 7120p +coprocessor. In this plot, the vertical access is time and routines +running at the same time are running concurrently on both the host and +the coprocessor. + +.. image:: JPG/offload_knc.png + :align: center + +The fraction of the offloaded work is controlled by the *balance* +keyword in the :doc:`package intel ` command. A balance of 0 +runs all calculations on the CPU. A balance of 1 runs all +supported calculations on the coprocessor. A balance of 0.5 runs half +of the calculations on the coprocessor. Setting the balance to -1 +(the default) will enable dynamic load balancing that continously +adjusts the fraction of offloaded work throughout the simulation. +Because data transfer cannot be timed, this option typically produces +results within 5 to 10 percent of the optimal fixed balance. + +If running short benchmark runs with dynamic load balancing, adding a +short warm-up run (10-20 steps) will allow the load-balancer to find a +near-optimal setting that will carry over to additional runs. + +The default for the :doc:`package intel ` command is to have +all the MPI tasks on a given compute node use a single Xeon Phi +coprocessor. In general, running with a large number of MPI tasks on +each node will perform best with offload. Each MPI task will +automatically get affinity to a subset of the hardware threads +available on the coprocessor. For example, if your card has 61 cores, +with 60 cores available for offload and 4 hardware threads per core +(240 total threads), running with 24 MPI tasks per node will cause +each MPI task to use a subset of 10 threads on the coprocessor. Fine +tuning of the number of threads to use per MPI task or the number of +threads to use per core can be accomplished with keyword settings of +the :doc:`package intel ` command. + +The USER-INTEL package has two modes for deciding which atoms will be +handled by the coprocessor. This choice is controlled with the *ghost* +keyword of the :doc:`package intel ` command. When set to 0, +ghost atoms (atoms at the borders between MPI tasks) are not offloaded +to the card. This allows for overlap of MPI communication of forces +with computation on the coprocessor when the :doc:`newton ` +setting is "on". The default is dependent on the style being used, +however, better performance may be achieved by setting this option +explictly. + +When using offload with CPU Hyper-Threading disabled, it may help +performance to use fewer MPI tasks and OpenMP threads than available +cores. This is due to the fact that additional threads are generated +internally to handle the asynchronous offload tasks. + +If pair computations are being offloaded to an Intel Xeon Phi +coprocessor, a diagnostic line is printed to the screen (not to the +log file), during the setup phase of a run, indicating that offload +mode is being used and indicating the number of coprocessor threads +per MPI task. Additionally, an offload timing summary is printed at +the end of each run. When offloading, the frequency for :doc:`atom sorting ` is changed to 1 so that the per-atom data is +effectively sorted at every rebuild of the neighbor lists. All the +available coprocessor threads on each Phi will be divided among MPI +tasks, unless the *tptask* option of the "-pk intel" :ref:`command-line switch ` is used to limit the coprocessor +threads per MPI task. -* The default for the :doc:`package intel ` command is to have - all the MPI tasks on a given compute node use a single Xeon Phi(TM) - coprocessor. In general, running with a large number of MPI tasks on - each node will perform best with offload. Each MPI task will - automatically get affinity to a subset of the hardware threads - available on the coprocessor. For example, if your card has 61 cores, - with 60 cores available for offload and 4 hardware threads per core - (240 total threads), running with 24 MPI tasks per node will cause - each MPI task to use a subset of 10 threads on the coprocessor. Fine - tuning of the number of threads to use per MPI task or the number of - threads to use per core can be accomplished with keyword settings of - the :doc:`package intel ` command. -* If desired, only a fraction of the pair style computation can be - offloaded to the coprocessors. This is accomplished by using the - *balance* keyword in the :doc:`package intel ` command. A - balance of 0 runs all calculations on the CPU. A balance of 1 runs - all calculations on the coprocessor. A balance of 0.5 runs half of - the calculations on the coprocessor. Setting the balance to -1 (the - default) will enable dynamic load balancing that continously adjusts - the fraction of offloaded work throughout the simulation. This option - typically produces results within 5 to 10 percent of the optimal fixed - balance. -* When using offload with CPU hyperthreading disabled, it may help - performance to use fewer MPI tasks and OpenMP threads than available - cores. This is due to the fact that additional threads are generated - internally to handle the asynchronous offload tasks. -* If running short benchmark runs with dynamic load balancing, adding a - short warm-up run (10-20 steps) will allow the load-balancer to find a - near-optimal setting that will carry over to additional runs. -* If pair computations are being offloaded to an Intel(R) Xeon Phi(TM) - coprocessor, a diagnostic line is printed to the screen (not to the - log file), during the setup phase of a run, indicating that offload - mode is being used and indicating the number of coprocessor threads - per MPI task. Additionally, an offload timing summary is printed at - the end of each run. When offloading, the frequency for :doc:`atom sorting ` is changed to 1 so that the per-atom data is - effectively sorted at every rebuild of the neighbor lists. -* For simulations with long-range electrostatics or bond, angle, - dihedral, improper calculations, computation and data transfer to the - coprocessor will run concurrently with computations and MPI - communications for these calculations on the host CPU. The USER-INTEL - package has two modes for deciding which atoms will be handled by the - coprocessor. This choice is controlled with the *ghost* keyword of - the :doc:`package intel ` command. When set to 0, ghost atoms - (atoms at the borders between MPI tasks) are not offloaded to the - card. This allows for overlap of MPI communication of forces with - computation on the coprocessor when the :doc:`newton ` setting - is "on". The default is dependent on the style being used, however, - better performance may be achieved by setting this option - explictly. Restrictions """""""""""" @@ -311,6 +468,11 @@ the pair styles in the USER-INTEL package currently support the :doc:`run_style respa ` command; only the "pair" option is supported. +**References:** + +* Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. +* Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press. +* Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. .. _lws: http://lammps.sandia.gov .. _ld: Manual.html diff --git a/doc/html/_sources/fix_shardlow.txt b/doc/html/_sources/fix_shardlow.txt index 419dadd4ea..623c7621a2 100644 --- a/doc/html/_sources/fix_shardlow.txt +++ b/doc/html/_sources/fix_shardlow.txt @@ -70,9 +70,6 @@ lengths to be larger than twice the cutoff+skin. Generally, the domain decomposition is dependent on the number of processors requested. -This fix also requires :doc:`atom_style dpd ` to be used -due to shared data structures. - Related commands """""""""""""""" diff --git a/doc/html/_sources/molecule.txt b/doc/html/_sources/molecule.txt index eeccf15d70..dcb8180015 100644 --- a/doc/html/_sources/molecule.txt +++ b/doc/html/_sources/molecule.txt @@ -352,8 +352,11 @@ N1, N2, N3 are the number of 1-2, 1-3, 1-4 neighbors respectively of this atom within the topology of the molecule. See the :doc:`special_bonds ` doc page for more discussion of 1-2, 1-3, 1-4 neighbors. If this section appears, the Special Bonds -section must also appear. If this section is not specied, the -atoms in the molecule will have no special bonds. +section must also appear. + +As explained above, LAMMPS will auto-generate this information if this +section is not specified. If specified, this section will +override what would be auto-generated. ---------- @@ -372,9 +375,11 @@ values should be the 1-2 neighbors, the next N2 should be the 1-3 neighbors, the last N3 should be the 1-4 neighbors. No atom ID should appear more than once. See the :doc:`special_bonds ` doc page for more discussion of 1-2, 1-3, 1-4 neighbors. If this section -appears, the Special Bond Counts section must also appear. If this -section is not specied, the atoms in the molecule will have no special -bonds. +appears, the Special Bond Counts section must also appear. + +As explained above, LAMMPS will auto-generate this information if this +section is not specified. If specified, this section will override +what would be auto-generated. ---------- diff --git a/doc/html/_sources/pair_dpd_fdt.txt b/doc/html/_sources/pair_dpd_fdt.txt index d665d196af..1c35614fa6 100644 --- a/doc/html/_sources/pair_dpd_fdt.txt +++ b/doc/html/_sources/pair_dpd_fdt.txt @@ -135,10 +135,6 @@ Pair style *dpd/fdt/energy* requires :doc:`atom_style dpd ` to be used in order to properly account for the particle internal energies and temperatures. -Pair style *dpd/fdt* currently also requires -:doc:`atom_style dpd ` to be used in conjunction with -:doc:`fix shardlow ` due to shared data structures. - Related commands """""""""""""""" diff --git a/doc/html/_sources/thermo_style.txt b/doc/html/_sources/thermo_style.txt index 89f59fc0a3..f133028110 100644 --- a/doc/html/_sources/thermo_style.txt +++ b/doc/html/_sources/thermo_style.txt @@ -18,7 +18,7 @@ Syntax *multi* args = none *custom* args = list of keywords possible keywords = step, elapsed, elaplong, dt, time, - cpu, tpcpu, spcpu, cpuremain, part, + cpu, tpcpu, spcpu, cpuremain, part, timeremain, atoms, temp, press, pe, ke, etotal, enthalpy, evdwl, ecoul, epair, ebond, eangle, edihed, eimp, emol, elong, etail, @@ -41,6 +41,7 @@ Syntax spcpu = timesteps per CPU second cpuremain = estimated CPU time remaining in run part = which partition (0 to Npartition-1) this is + timeremain = remaining time in seconds on timer timeout. atoms = # of atoms temp = temperature press = pressure @@ -271,6 +272,16 @@ corresponds to, or for use in a :doc:`variable ` to append to a filename for output specific to this partition. See :ref:`Section_start 7 ` of the manual for details on running in multi-partition mode. +The *timeremain* keyword returns the remaining seconds when a +timeout has been configured via the :doc:`timer timeout ` command. +If the timeout timer is inactive, the value of this keyword is 0.0 and +if the timer is expired, it is negative. This allows for example to exit +loops cleanly, if the timeout is expired with: + +.. parsed-literal:: + + if "$(timeremain) < 0.0" then "quit 0" + The *fmax* and *fnorm* keywords are useful for monitoring the progress of an :doc:`energy minimization `. The *fmax* keyword calculates the maximum force in any dimension on any atom in the diff --git a/doc/html/_sources/timer.txt b/doc/html/_sources/timer.txt index ba1dd478c7..9ea8b08542 100644 --- a/doc/html/_sources/timer.txt +++ b/doc/html/_sources/timer.txt @@ -35,6 +35,9 @@ Description """"""""""" Select the level of detail at which LAMMPS performs its CPU timings. +Multiple keywords can be specified with the *timer* command. For +keywords that are mutually exclusive, the last one specified takes +effect. During a simulation run LAMMPS collects information about how much time is spent in different sections of the code and thus can provide @@ -57,17 +60,37 @@ slow down the simulation. Using the *nosync* setting (which is the default) turns off this synchronization. With the *timeout* keyword a walltime limit can be imposed that -affects the :doc:`run ` and :doc:`minimize ` commands. If -the time limit is reached, the run or energy minimization will exit on -the next step or iteration that is a multiple of the *Ncheck* value -specified with the *every* keyword. All subsequent run or minimize -commands in the input script will be skipped until the timeout is -reset or turned off by a new *timer* command. The timeout *elapse* -value can be specified as *off* or *unlimited* to impose no timeout -condition (which is the default). The *elapse* setting can be -specified as a single number for seconds, two numbers separated by a -colon (MM:SS) for minutes and seconds, or as three numbers separated -by colons for hours, minutes, and seconds. +affects the :doc:`run ` and :doc:`minimize ` commands. +This can be convenient when runs have to confirm to time limits, +e.g. when running under a batch system and you want to maximize +the utilization of the batch time slot, especially when the time +per timestep varies and is thus difficult to predict how many +steps a simulation can perform, or for difficult to converge +minimizations. The timeout *elapse* value should be somewhat smaller +than the time requested from the batch system, as there is usually +some overhead to launch jobs, and it may be advisable to write +out a restart after terminating a run due to a timeout. + +The timeout timer starts when the command is issued. When the time +limit is reached, the run or energy minimization will exit on the +next step or iteration that is a multiple of the *Ncheck* value +which can be set with the *every* keyword. Default is checking +every 10 steps. After the timer timeout has expired all subsequent +run or minimize commands in the input script will be skipped. +The remaining time or timer status can be accessed with the +:doc:`thermo ` variable *timeremain*\ , which will be +zero, if the timeout is inactive (default setting), it will be +negative, if the timeout time is expired and positive if there +is time remaining and in this case the value of the variable are +the number of seconds remaining. + +When the *timeout* key word is used a second time, the timer is +restarted with a new time limit. The timeout *elapse* value can +be specified as *off* or *unlimited* to impose a no timeout condition +(which is the default). The *elapse* setting can be specified as +a single number for seconds, two numbers separated by a colon (MM:SS) +for minutes and seconds, or as three numbers separated by colons for +hours, minutes, and seconds (H:MM:SS). The *every* keyword sets how frequently during a run or energy minimization the wall clock will be checked. This check count applies @@ -76,10 +99,6 @@ can slow a calculation down. Checking too infrequently can make the timeout measurement less accurate, with the run being stopped later than desired. -Multiple keywords can be specified with the *timer* command. For -keywords that are mutually exclusive, the last one specified takes -effect. - .. note:: Using the *full* and *sync* options provides the most detailed diff --git a/doc/html/accelerate_intel.html b/doc/html/accelerate_intel.html index 328f558ae3..ad8f8560d4 100644 --- a/doc/html/accelerate_intel.html +++ b/doc/html/accelerate_intel.html @@ -127,212 +127,363 @@

Return to Section accelerate overview

5.USER-INTEL package

-

The USER-INTEL package was developed by Mike Brown at Intel +

The USER-INTEL package is maintained by Mike Brown at Intel Corporation. It provides two methods for accelerating simulations, depending on the hardware you have. The first is acceleration on -Intel(R) CPUs by running in single, mixed, or double precision with -vectorization. The second is acceleration on Intel(R) Xeon Phi(TM) +Intel CPUs by running in single, mixed, or double precision with +vectorization. The second is acceleration on Intel Xeon Phi coprocessors via offloading neighbor list and non-bonded force calculations to the Phi. The same C++ code is used in both cases. When offloading to a coprocessor from a CPU, the same routine is run -twice, once on the CPU and once with an offload flag.

-

Note that the USER-INTEL package supports use of the Phi in “offload” -mode, not “native” mode like the KOKKOS package.

-

Also note that the USER-INTEL package can be used in tandem with the -USER-OMP package. This is useful when -offloading pair style computations to the Phi, so that other styles -not supported by the USER-INTEL package, e.g. bond, angle, dihedral, -improper, and long-range electrostatics, can run simultaneously in -threaded mode on the CPU cores. Since less MPI tasks than CPU cores -will typically be invoked when running with coprocessors, this enables -the extra CPU cores to be used for useful computation.

-

As illustrated below, if LAMMPS is built with both the USER-INTEL and -USER-OMP packages, this dual mode of operation is made easier to use, -via the “-suffix hybrid intel omp” command-line switch or the suffix hybrid intel omp command. Both set a second-choice suffix to “omp” so -that styles from the USER-INTEL package will be used if available, -with styles from the USER-OMP package as a second choice.

-

Here is a quick overview of how to use the USER-INTEL package for CPU -acceleration, assuming one or more 16-core nodes. More details -follow.

-
use an Intel compiler
-use these CCFLAGS settings in Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost, -fno-alias, -ansi-alias, -override-limits
-use these LINKFLAGS settings in Makefile.machine: -fopenmp, -xHost
-make yes-user-intel yes-user-omp     # including user-omp is optional
-make mpi                             # build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi
-make intel_cpu                       # or Makefile.intel_cpu already has settings, uses Intel MPI wrapper
-Make.py -v -p intel omp -intel cpu -a file mpich_icc   # or one-line build via Make.py for MPICH
-Make.py -v -p intel omp -intel cpu -a file ompi_icc    # or for OpenMPI
-Make.py -v -p intel omp -intel cpu -a file intel_cpu   # or for Intel MPI wrapper
-
-
-
lmp_machine -sf intel -pk intel 0 omp 16 -in in.script    # 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP
-mpirun -np 32 lmp_machine -sf intel -in in.script         # 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP
-lmp_machine -sf hybrid intel omp -pk intel 0 omp 16 -pk omp 16 -in in.script         # 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP
-mpirun -np 32 -ppn 4 lmp_machine -sf hybrid intel omp -pk omp 4 -pk omp 4 -in in.script      # 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP
-
-
-

Here is a quick overview of how to use the USER-INTEL package for the -same CPUs as above (16 cores/node), with an additional Xeon Phi(TM) -coprocessor per node. More details follow.

-
Same as above for building, with these additions/changes:
-add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in Makefile.machine
-add the flag -offload to LINKFLAGS in Makefile.machine
-for Make.py change "-intel cpu" to "-intel phi", and "file intel_cpu" to "file intel_phi"
-
-
-
mpirun -np 32 lmp_machine -sf intel -pk intel 1 -in in.script                 # 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP
-mpirun -np 16 -ppn 8 lmp_machine -sf intel -pk intel 1 omp 2 -in in.script            # 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP
-mpirun -np 32 -ppn 8 lmp_machine -sf hybrid intel omp -pk intel 1 omp 2 -pk omp 2 -in in.script # 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP
-
-
+twice, once on the CPU and once with an offload flag. This allows +LAMMPS to run on the CPU cores and coprocessor cores simulataneously.

+

Currently Available USER-INTEL Styles:

+
    +
  • Angle Styles: charmm, harmonic
  • +
  • Bond Styles: fene, harmonic
  • +
  • Dihedral Styles: charmm, harmonic, opls
  • +
  • Fixes: nve, npt, nvt, nvt/sllod
  • +
  • Improper Styles: cvff, harmonic
  • +
  • Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne, +charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff
  • +
  • K-Space Styles: pppm
  • +
+

Speed-ups to expect:

+

The speedups will depend on your simulation, the hardware, which +styles are used, the number of atoms, and the floating-point +precision mode. Performance improvements are shown compared to +LAMMPS without using other acceleration packages as these are +under active development (and subject to performance changes). The +measurements were performed using the input files available in +the src/USER-INTEL/TEST directory. These are scalable in size; the +results given are with 512K particles (524K for Liquid Crystal). +Most of the simulations are standard LAMMPS benchmarks (indicated +by the filename extension in parenthesis) with modifications to the +run length and to add a warmup run (for use with offload +benchmarks).

+_images/user_intel.png +

Results are speedups obtained on Intel Xeon E5-2697v4 processors +(code-named Broadwell) and Intel Xeon Phi 7250 processors +(code-named Knights Landing) with “18 Jun 2016” LAMMPS built with +Intel Parallel Studio 2016 update 3. Results are with 1 MPI task +per physical core. See src/USER-INTEL/TEST/README for the raw +simulation rates and instructions to reproduce.

+
+

Quick Start for Experienced Users:

+

LAMMPS should be built with the USER-INTEL package installed. +Simulations should be run with 1 MPI task per physical core, +not hardware thread.

+

For Intel Xeon CPUs:

+
    +
  • Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary.
  • +
  • If using kspace_style pppm in the input script, add “neigh_modify binsize 3” and “kspace_modify diff ad” to the input script for better +performance.
  • +
  • “-pk intel 0 omp 2 -sf intel” added to LAMMPS command-line
  • +
+

For Intel Xeon Phi CPUs for simulations without kspace_style +pppm in the input script

+
    +
  • Edit src/MAKE/OPTIONS/Makefile.knl as necessary.
  • +
  • Runs should be performed using MCDRAM.
  • +
  • “-pk intel 0 omp 2 -sf intel” or “-pk intel 0 omp 4 -sf intel” +should be added to the LAMMPS command-line. Choice for best +performance will depend on the simulation.
  • +
+

For Intel Xeon Phi CPUs for simulations with kspace_style +pppm in the input script:

+
    +
  • Edit src/MAKE/OPTIONS/Makefile.knl as necessary.
  • +
  • Runs should be performed using MCDRAM.
  • +
  • Add “neigh_modify binsize 3” to the input script for better +performance.
  • +
  • Add “kspace_modify diff ad” to the input script for better +performance.
  • +
  • export KMP_AFFINITY=none
  • +
  • “-pk intel 0 omp 3 lrt yes -sf intel” or “-pk intel 0 omp 1 lrt yes +-sf intel” added to LAMMPS command-line. Choice for best performance +will depend on the simulation.
  • +
+

For Intel Xeon Phi coprocessors (Offload):

+
    +
  • Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary
  • +
  • “-pk intel N omp 1” added to command-line where N is the number of +coprocessors per node.
  • +
+

Required hardware/software:

-

Your compiler must support the OpenMP interface. Use of an Intel(R) -C++ compiler is recommended, but not required. However, g++ will not -recognize some of the settings listed above, so they cannot be used. -Optimizations for vectorization have only been tested with the -Intel(R) compiler. Use of other compilers may not result in -vectorization, or give poor performance.

-

The recommended version of the Intel(R) compiler is 14.0.1.106. -Versions 15.0.1.133 and later are also supported. If using Intel(R) -MPI, versions 15.0.2.044 and later are recommended.

-

To use the offload option, you must have one or more Intel(R) Xeon -Phi(TM) coprocessors and use an Intel(R) C++ compiler.

+

In order to use offload to coprocessors, an Intel Xeon Phi +coprocessor and an Intel compiler are required. For this, the +recommended version of the Intel compiler is 14.0.1.106 or +versions 15.0.2.044 and higher.

+

Although any compiler can be used with the USER-INTEL pacakge, +currently, vectorization directives are disabled by default when +not using Intel compilers due to lack of standard support and +observations of decreased performance. The OpenMP standard now +supports directives for vectorization and we plan to transition the +code to this standard once it is available in most compilers. We +expect this to allow improved performance and support with other +compilers.

+

For Intel Xeon Phi x200 series processors (code-named Knights +Landing), there are multiple configuration options for the hardware. +For best performance, we recommend that the MCDRAM is configured in +“Flat” mode and with the cluster mode set to “Quadrant” or “SNC4”. +“Cache” mode can also be used, although the performance might be +slightly lower.

+

Notes about Simultaneous Multithreading:

+

Modern CPUs often support Simultaneous Multithreading (SMT). On +Intel processors, this is called Hyper-Threading (HT) technology. +SMT is hardware support for running multiple threads efficiently on +a single core. Hardware threads or logical cores are often used +to refer to the number of threads that are supported in hardware. +For example, the Intel Xeon E5-2697v4 processor is described +as having 36 cores and 72 threads. This means that 36 MPI processes +or OpenMP threads can run simultaneously on separate cores, but that +up to 72 MPI processes or OpenMP threads can be running on the CPU +without costly operating system context switches.

+

Molecular dynamics simulations will often run faster when making use +of SMT. If a thread becomes stalled, for example because it is +waiting on data that has not yet arrived from memory, another thread +can start running so that the CPU pipeline is still being used +efficiently. Although benefits can be seen by launching a MPI task +for every hardware thread, for multinode simulations, we recommend +that OpenMP threads are used for SMT instead, either with the +USER-INTEL package, USER-OMP package, or +KOKKOS package. In the example above, up +to 36X speedups can be observed by using all 36 physical cores with +LAMMPS. By using all 72 hardware threads, an additional 10-30% +performance gain can be achieved.

+

The BIOS on many platforms allows SMT to be disabled, however, we do +not recommend this on modern processors as there is little to no +benefit for any software package in most cases. The operating system +will report every hardware thread as a separate core allowing one to +determine the number of hardware threads available. On Linux systems, +this information can normally be obtained with:

+
cat /proc/cpuinfo
+
+

Building LAMMPS with the USER-INTEL package:

-

The lines above illustrate how to include/build with the USER-INTEL -package, for either CPU or Phi support, in two steps, using the “make” -command. Or how to do it with one command via the src/Make.py script, -described in Section 2.4 of the manual. -Type “Make.py -h” for help. Because the mechanism for specifing what -compiler to use (Intel in this case) is different for different MPI -wrappers, 3 versions of the Make.py command are shown.

+

The USER-INTEL package must be installed into the source directory:

+
make yes-user-intel
+
+
+

Several example Makefiles for building with the Intel compiler are +included with LAMMPS in the src/MAKE/OPTIONS/ directory:

+
Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
+Makefile.knl                # Intel Compiler, Intel MPI, No Offload
+Makefile.intel_cpu_mpich    # Intel Compiler, MPICH, No Offload
+Makefile.intel_cpu_openpmi  # Intel Compiler, OpenMPI, No Offload
+Makefile.intel_coprocessor  # Intel Compiler, Intel MPI, Offload
+
+
+

Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that +it explicitly specifies that vectorization should be for Intel +Xeon Phi x200 processors making it easier to cross-compile. For +users with recent installations of Intel Parallel Studio, the +process can be as simple as:

+
make yes-user-intel
+source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
+# or psxevars.csh for C-shell
+make intel_cpu_intelmpi
+
+
+

Alternatively, the build can be accomplished with the src/Make.py +script, described in Section 2.4 of the +manual. Type “Make.py -h” for help. For an example:

+
Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi
+
+

Note that if you build with support for a Phi coprocessor, the same binary can be used on nodes with or without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary.

-

If you also build with the USER-OMP package, you can use styles from -both packages, as described below.

-

Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must -include “-fopenmp”. Likewise, if you use an Intel compiler, the -CCFLAGS setting must include “-restrict”. For Phi support, the -“-DLMP_INTEL_OFFLOAD” (CCFLAGS) and “-offload” (LINKFLAGS) settings -are required. The other settings listed above are optional, but will -typically improve performance. The Make.py command will add all of -these automatically.

-

If you are compiling on the same architecture that will be used for -the runs, adding the flag -xHost to CCFLAGS enables vectorization -with the Intel(R) compiler. Otherwise, you must provide the correct -compute node architecture to the -x option (e.g. -xAVX).

-

Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi -are included in the src/MAKE/OPTIONS directory with settings that -perform well with the Intel(R) compiler. The latter has support for -offload to Phi coprocessors; the former does not.

-

Run with the USER-INTEL package from the command line:

-

The mpirun or mpiexec command sets the total number of MPI tasks used -by LAMMPS (one or multiple per compute node) and the number of MPI -tasks used per node. E.g. the mpirun command in MPICH does this via -its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode.

-

If you compute (any portion of) pairwise interactions using USER-INTEL -pair styles on the CPU, or use USER-OMP styles on the CPU, you need to -choose how many OpenMP threads per MPI task to use. If both packages -are used, it must be done for both packages, and the same thread count -value should be used for both. Note that the product of MPI tasks * -threads/task should not exceed the physical number of cores (on a -node), otherwise performance will suffer.

-

When using the USER-INTEL package for the Phi, you also need to -specify the number of coprocessor/node and optionally the number of -coprocessor threads per MPI task to use. Note that coprocessor -threads (which run on the coprocessor) are totally independent from -OpenMP threads (which run on the CPU). The default values for the -settings that affect coprocessor threads are typically fine, as -discussed below.

-

As in the lines above, use the “-sf intel” or “-sf hybrid intel omp” -command-line switch, which will -automatically append “intel” to styles that support it. In the second -case, “omp” will be appended if an “intel” style does not exist.

-

Note that if either switch is used, it also invokes a default command: -package intel 1. If the “-sf hybrid intel omp” switch -is used, the default USER-OMP command package omp 0 is -also invoked (if LAMMPS was built with USER-OMP). Both set the number -of OpenMP threads per MPI task via the OMP_NUM_THREADS environment -variable. The first command sets the number of Xeon Phi(TM) -coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only), -and the precision mode to “mixed” (default value).

-

You can also use the “-pk intel Nphi” command-line switch to explicitly set Nphi = # of Xeon -Phi(TM) coprocessors/node, as well as additional options. Nphi should -be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi -= 0 for a CPU-only build. All the available coprocessor threads on -each Phi will be divided among MPI tasks, unless the tptask option -of the “-pk intel” command-line switch is -used to limit the coprocessor threads per MPI task. See the package intel command for details, including the default values -used for all its options if not specified, and how to set the number -of OpenMP threads via the OMP_NUM_THREADS environment variable if -desired.

-

If LAMMPS was built with the USER-OMP package, you can also use the -“-pk omp Nt” command-line switch to -explicitly set Nt = # of OpenMP threads per MPI task to use, as well -as additional options. Nt should be the same threads per MPI task as -set for the USER-INTEL package, e.g. via the “-pk intel Nphi omp Nt” -command. Again, see the package omp command for -details, including the default values used for all its options if not -specified, and how to set the number of OpenMP threads via the -OMP_NUM_THREADS environment variable if desired.

-

Or run with the USER-INTEL package by editing an input script:

-

The discussion above for the mpirun/mpiexec command, MPI tasks/node, -OpenMP threads per MPI task, and coprocessor threads per MPI task is -the same.

-

Use the suffix intel or suffix hybrid intel omp commands, or you can explicitly add an “intel” or -“omp” suffix to individual styles in your input script, e.g.

-
pair_style lj/cut/intel 2.5
-
+

The general requirements for Makefiles with the USER-INTEL package +are as follows. “-DLAMMPS_MEMALIGN=64” is required for CCFLAGS. When +using Intel compilers, “-restrict” is required and “-qopenmp” is +highly recommended for CCFLAGS and LINKFLAGS. LIB should include +“-ltbbmalloc”. For builds supporting offload, “-DLMP_INTEL_OFFLOAD” +is required for CCFLAGS and “-qoffload” is required for LINKFLAGS. +Other recommended CCFLAG options for best performance are +“-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2 +-no-prec-div”. The Make.py command will add all of these +automatically.

+
+

Note

+

The vectorization and math capabilities can differ depending on +the CPU. For Intel compilers, the “-x” flag specifies the type of +processor for which to optimize. “-xHost” specifies that the compiler +should build for the processor used for compiling. For Intel Xeon Phi +x200 series processors, this option is “-xMIC-AVX512”. For fourth +generation Intel Xeon (v4/Broadwell) processors, “-xCORE-AVX2” should +be used. For older Intel Xeon processors, “-xAVX” will perform best +in general for the different simulations in LAMMPS. The default +in most of the example Makefiles is to use “-xHost”, however this +should not be used when cross-compiling.

-

You must also use the package intel command, unless the -“-sf intel” or “-pk intel” command-line switches were used. It specifies how many -coprocessors/node to use, as well as other OpenMP threading and -coprocessor options. The package doc page explains how -to set the number of OpenMP threads via an environment variable if -desired.

-

If LAMMPS was also built with the USER-OMP package, you must also use -the package omp command to enable that package, unless -the “-sf hybrid intel omp” or “-pk omp” command-line switches were used. It specifies how many -OpenMP threads per MPI task to use (should be same as the setting for -the USER-INTEL package), as well as other options. Its doc page -explains how to set the number of OpenMP threads via an environment -variable if desired.

-

Speed-ups to expect:

-

If LAMMPS was not built with coprocessor support (CPU only) when -including the USER-INTEL package, then acclerated styles will run on -the CPU using vectorization optimizations and the specified precision. -This may give a substantial speed-up for a pair style, particularly if -mixed or single precision is used.

-

If LAMMPS was built with coproccesor support, the pair styles will run -on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The -performance of a Xeon Phi versus a multi-core CPU is a function of -your hardware, which pair style is used, the number of -atoms/coprocessor, and the precision used on the coprocessor (double, -single, mixed).

-

See the Benchmark page of the -LAMMPS web site for performance of the USER-INTEL package on different -hardware.

+

Running LAMMPS with the USER-INTEL package:

+

Running LAMMPS with the USER-INTEL package is similar to normal use +with the exceptions that one should 1) specify that LAMMPS should use +the USER-INTEL package, 2) specify the number of OpenMP threads, and +3) optionally specify the specific LAMMPS styles that should use the +USER-INTEL package. 1) and 2) can be performed from the command-line +or by editing the input script. 3) requires editing the input script. +Advanced performance tuning options are also described below to get +the best performance.

+

When running on a single node (including runs using offload to a +coprocessor), best performance is normally obtained by using 1 MPI +task per physical core and additional OpenMP threads with SMT. For +Intel Xeon processors, 2 OpenMP threads should be used for SMT. +For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used +(best choice depends on the simulation). In cases where the user +specifies that LRT mode is used (described below), 1 or 3 OpenMP +threads should be used. For multi-node runs, using 1 MPI task per +physical core will often perform best, however, depending on the +machine and scale, users might get better performance by decreasing +the number of MPI tasks and using more OpenMP threads. For +performance, the product of the number of MPI tasks and OpenMP +threads should not exceed the number of available hardware threads in +almost all cases.

Note

Setting core affinity is often used to pin MPI tasks and OpenMP threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host (CPU) will be set by default on the host -when using offload to a coprocessor. In this case, it is unnecessary +when using offload to a coprocessor. In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, -I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with -the no_affinity option to the package intel command -or by disabling the option at build time (by adding --DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). -Disabling this option is not recommended, especially when running on a -machine with hyperthreading disabled.

+I_MPI_PIN_DOMAIN, etc.). This can be disabled with the no_affinity +option to the package intel command or by disabling the +option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the +CCFLAGS line of your Makefile). Disabling this option is not +recommended, especially when running on a machine with Intel +Hyper-Threading technology disabled.

-

Guidelines for best performance on an Intel(R) Xeon Phi(TM) -coprocessor:

-
    -
  • The default for the package intel command is to have -all the MPI tasks on a given compute node use a single Xeon Phi(TM) +

    Run with the USER-INTEL package from the command line:

    +

    To enable USER-INTEL optimizations for all available styles used in +the input script, the “-sf intel” +command-line switch can be used without +any requirement for editing the input script. This switch will +automatically append “intel” to styles that support it. It also +invokes a default command: package intel 1. This +package command is used to set options for the USER-INTEL package. +The default package command will specify that USER-INTEL calculations +are performed in mixed precision, that the number of OpenMP threads +is specified by the OMP_NUM_THREADS environment variable, and that +if coprocessors are present and the binary was built with offload +support, that 1 coprocessor per node will be used with automatic +balancing of work between the CPU and the coprocessor.

    +

    You can specify different options for the USER-INTEL package by using +the “-pk intel Nphi” command-line switch +with keyword/value pairs as specified in the documentation. Here, +Nphi = # of Xeon Phi coprocessors/node (ignored without offload +support). Common options to the USER-INTEL package include omp to +override any OMP_NUM_THREADS setting and specify the number of OpenMP +threads, mode to set the floating-point precision mode, and +lrt to enable Long-Range Thread mode as described below. See the +package intel command for details, including the +default values used for all its options if not specified, and how to +set the number of OpenMP threads via the OMP_NUM_THREADS environment +variable if desired.

    +

    Examples (see documentation for your MPI/Machine for differences in +launching MPI applications):

    +
    mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script                                 # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
    +mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double   # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision
    +
    +
    +

    Or run with the USER-INTEL package by editing an input script:

    +

    As an alternative to adding command-line arguments, the input script +can be edited to enable the USER-INTEL package. This requires adding +the package intel command to the top of the input +script. For the second example above, this would be:

    +
    package intel 0 omp 2 mode double
    +
    +
    +

    To enable the USER-INTEL package only for individual styles, you can +add an “intel” suffix to the individual style, e.g.:

    +
    pair_style lj/cut/intel 2.5
    +
    +
    +

    Alternatively, the suffix intel command can be added to +the input script to enable USER-INTEL styles for the commands that +follow in the input script.

    +

    Tuning for Performance:

    +
    +

    Note

    +

    The USER-INTEL package will perform better with modifications +to the input script when PPPM is used: +kspace_modify diff ad and neigh_modify binsize 3 should be added to the input script.

    +
    +

    Long-Range Thread (LRT) mode is an option to the package intel command that can improve performance when using +PPPM for long-range electrostatics on processors +with SMT. It generates an extra pthread for each MPI task. The thread +is dedicated to performing some of the PPPM calculations and MPI +communications. On Intel Xeon Phi x200 series CPUs, this will likely +always improve performance, even on a single node. On Intel Xeon +processors, using this mode might result in better performance when +using multiple nodes, depending on the machine. To use this mode, +specify that the number of OpenMP threads is one less than would +normally be used for the run and add the “lrt yes” option to the “-pk” +command-line suffix or “package intel” command. For example, if a run +would normally perform best with “-pk intel 0 omp 4”, instead use +“-pk intel 0 omp 3 lrt yes”. When using LRT, you should set the +environment variable “KMP_AFFINITY=none”. LRT mode is not supported +when using offload.

    +

    Not all styles are supported in the USER-INTEL package. You can mix +the USER-INTEL package with styles from the OPT +package or the USER-OMP package. Of course, +this requires that these packages were installed at build time. This +can performed automatically by using “-sf hybrid intel opt” or +“-sf hybrid intel omp” command-line options. Alternatively, the “opt” +and “omp” suffixes can be appended manually in the input script. For +the latter, the package omp command must be in the +input script or the “-pk omp Nt” command-line switch must be used where Nt is the +number of OpenMP threads. The number of OpenMP threads should not be +set differently for the different packages. Note that the suffix hybrid intel omp command can also be used within the +input script to automatically append the “omp” suffix to styles when +USER-INTEL styles are not available.

    +

    When running on many nodes, performance might be better when using +fewer OpenMP threads and more MPI tasks. This will depend on the +simulation and the machine. Using the verlet/split +run style might also give better performance for simulations with +PPPM electrostatics. Note that this is an +alternative to LRT mode and the two cannot be used together.

    +

    Currently, when using Intel MPI with Intel Xeon Phi x200 series +CPUs, better performance might be obtained by setting the +environment variable “I_MPI_SHM_LMT=shm” for Linux kernels that do +not yet have full support for AVX-512. Runs on Intel Xeon Phi x200 +series processors will always perform better using MCDRAM. Please +consult your system documentation for the best approach to specify +that MPI runs are performed in MCDRAM.

    +

    Tuning for Offload Performance:

    +

    The default settings for offload should give good performance.

    +

    When using LAMMPS with offload to Intel coprocessors, best performance +will typically be achieved with concurrent calculations performed on +both the CPU and the coprocessor. This is achieved by offloading only +a fraction of the neighbor and pair computations to the coprocessor or +using hybrid pair styles where only one style uses +the “intel” suffix. For simulations with long-range electrostatics or +bond, angle, dihedral, improper calculations, computation and data +transfer to the coprocessor will run concurrently with computations +and MPI communications for these calculations on the host CPU. This +is illustrated in the figure below for the rhodopsin protein benchmark +running on E5-2697v2 processors with a Intel Xeon Phi 7120p +coprocessor. In this plot, the vertical access is time and routines +running at the same time are running concurrently on both the host and +the coprocessor.

    +_images/offload_knc.png +

    The fraction of the offloaded work is controlled by the balance +keyword in the package intel command. A balance of 0 +runs all calculations on the CPU. A balance of 1 runs all +supported calculations on the coprocessor. A balance of 0.5 runs half +of the calculations on the coprocessor. Setting the balance to -1 +(the default) will enable dynamic load balancing that continously +adjusts the fraction of offloaded work throughout the simulation. +Because data transfer cannot be timed, this option typically produces +results within 5 to 10 percent of the optimal fixed balance.

    +

    If running short benchmark runs with dynamic load balancing, adding a +short warm-up run (10-20 steps) will allow the load-balancer to find a +near-optimal setting that will carry over to additional runs.

    +

    The default for the package intel command is to have +all the MPI tasks on a given compute node use a single Xeon Phi coprocessor. In general, running with a large number of MPI tasks on each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads @@ -342,45 +493,30 @@ with 60 cores available for offload and 4 hardware threads per core each MPI task to use a subset of 10 threads on the coprocessor. Fine tuning of the number of threads to use per MPI task or the number of threads to use per core can be accomplished with keyword settings of -the package intel command.

  • -
  • If desired, only a fraction of the pair style computation can be -offloaded to the coprocessors. This is accomplished by using the -balance keyword in the package intel command. A -balance of 0 runs all calculations on the CPU. A balance of 1 runs -all calculations on the coprocessor. A balance of 0.5 runs half of -the calculations on the coprocessor. Setting the balance to -1 (the -default) will enable dynamic load balancing that continously adjusts -the fraction of offloaded work throughout the simulation. This option -typically produces results within 5 to 10 percent of the optimal fixed -balance.
  • -
  • When using offload with CPU hyperthreading disabled, it may help +the package intel command.

    +

    The USER-INTEL package has two modes for deciding which atoms will be +handled by the coprocessor. This choice is controlled with the ghost +keyword of the package intel command. When set to 0, +ghost atoms (atoms at the borders between MPI tasks) are not offloaded +to the card. This allows for overlap of MPI communication of forces +with computation on the coprocessor when the newton +setting is “on”. The default is dependent on the style being used, +however, better performance may be achieved by setting this option +explictly.

    +

    When using offload with CPU Hyper-Threading disabled, it may help performance to use fewer MPI tasks and OpenMP threads than available cores. This is due to the fact that additional threads are generated -internally to handle the asynchronous offload tasks.

  • -
  • If running short benchmark runs with dynamic load balancing, adding a -short warm-up run (10-20 steps) will allow the load-balancer to find a -near-optimal setting that will carry over to additional runs.
  • -
  • If pair computations are being offloaded to an Intel(R) Xeon Phi(TM) +internally to handle the asynchronous offload tasks.

    +

    If pair computations are being offloaded to an Intel Xeon Phi coprocessor, a diagnostic line is printed to the screen (not to the log file), during the setup phase of a run, indicating that offload mode is being used and indicating the number of coprocessor threads per MPI task. Additionally, an offload timing summary is printed at the end of each run. When offloading, the frequency for atom sorting is changed to 1 so that the per-atom data is -effectively sorted at every rebuild of the neighbor lists.

  • -
  • For simulations with long-range electrostatics or bond, angle, -dihedral, improper calculations, computation and data transfer to the -coprocessor will run concurrently with computations and MPI -communications for these calculations on the host CPU. The USER-INTEL -package has two modes for deciding which atoms will be handled by the -coprocessor. This choice is controlled with the ghost keyword of -the package intel command. When set to 0, ghost atoms -(atoms at the borders between MPI tasks) are not offloaded to the -card. This allows for overlap of MPI communication of forces with -computation on the coprocessor when the newton setting -is “on”. The default is dependent on the style being used, however, -better performance may be achieved by setting this option -explictly.
  • -
+effectively sorted at every rebuild of the neighbor lists. All the +available coprocessor threads on each Phi will be divided among MPI +tasks, unless the tptask option of the “-pk intel” command-line switch is used to limit the coprocessor +threads per MPI task.

Restrictions

When offloading to a coprocessor, hybrid styles @@ -394,6 +530,12 @@ the pair styles in the USER-INTEL package currently support the “inner”, “middle”, “outer” options for rRESPA integration via the run_style respa command; only the “pair” option is supported.

+

References:

+
    +
  • Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann.
  • +
  • Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press.
  • +
  • Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101.
  • +
diff --git a/doc/html/fix_shardlow.html b/doc/html/fix_shardlow.html index 3f9d81d749..e54353f7eb 100644 --- a/doc/html/fix_shardlow.html +++ b/doc/html/fix_shardlow.html @@ -180,8 +180,6 @@ integration, e.g. -

This fix also requires atom_style dpd to be used -due to shared data structures.

Description

-

Select the level of detail at which LAMMPS performs its CPU timings.

+

Select the level of detail at which LAMMPS performs its CPU timings. +Multiple keywords can be specified with the timer command. For +keywords that are mutually exclusive, the last one specified takes +effect.

During a simulation run LAMMPS collects information about how much time is spent in different sections of the code and thus can provide information for determining performance and load imbalance problems. @@ -174,26 +177,41 @@ call which meaures load imbalance more accuractly, though it can also slow down the simulation. Using the nosync setting (which is the default) turns off this synchronization.

With the timeout keyword a walltime limit can be imposed that -affects the run and minimize commands. If -the time limit is reached, the run or energy minimization will exit on -the next step or iteration that is a multiple of the Ncheck value -specified with the every keyword. All subsequent run or minimize -commands in the input script will be skipped until the timeout is -reset or turned off by a new timer command. The timeout elapse -value can be specified as off or unlimited to impose no timeout -condition (which is the default). The elapse setting can be -specified as a single number for seconds, two numbers separated by a -colon (MM:SS) for minutes and seconds, or as three numbers separated -by colons for hours, minutes, and seconds.

+affects the run and minimize commands. +This can be convenient when runs have to confirm to time limits, +e.g. when running under a batch system and you want to maximize +the utilization of the batch time slot, especially when the time +per timestep varies and is thus difficult to predict how many +steps a simulation can perform, or for difficult to converge +minimizations. The timeout elapse value should be somewhat smaller +than the time requested from the batch system, as there is usually +some overhead to launch jobs, and it may be advisable to write +out a restart after terminating a run due to a timeout.

+

The timeout timer starts when the command is issued. When the time +limit is reached, the run or energy minimization will exit on the +next step or iteration that is a multiple of the Ncheck value +which can be set with the every keyword. Default is checking +every 10 steps. After the timer timeout has expired all subsequent +run or minimize commands in the input script will be skipped. +The remaining time or timer status can be accessed with the +thermo variable timeremain, which will be +zero, if the timeout is inactive (default setting), it will be +negative, if the timeout time is expired and positive if there +is time remaining and in this case the value of the variable are +the number of seconds remaining.

+

When the timeout key word is used a second time, the timer is +restarted with a new time limit. The timeout elapse value can +be specified as off or unlimited to impose a no timeout condition +(which is the default). The elapse setting can be specified as +a single number for seconds, two numbers separated by a colon (MM:SS) +for minutes and seconds, or as three numbers separated by colons for +hours, minutes, and seconds (H:MM:SS).

The every keyword sets how frequently during a run or energy minimization the wall clock will be checked. This check count applies to the outer iterations or time steps during minimizations or r-RESPA runs, respectively. Checking for timeout too often, can slow a calculation down. Checking too infrequently can make the timeout measurement less accurate, with the run being stopped later than desired.

-

Multiple keywords can be specified with the timer command. For -keywords that are mutually exclusive, the last one specified takes -effect.

Note

Using the full and sync options provides the most detailed diff --git a/doc/src/JPG/offload_knc.png b/doc/src/JPG/offload_knc.png new file mode 100755 index 0000000000..0c4028a08d Binary files /dev/null and b/doc/src/JPG/offload_knc.png differ diff --git a/doc/src/JPG/user_intel.png b/doc/src/JPG/user_intel.png new file mode 100755 index 0000000000..0ebb2d1ae0 Binary files /dev/null and b/doc/src/JPG/user_intel.png differ diff --git a/doc/src/Manual.txt b/doc/src/Manual.txt index a14acf7a7b..e700ce2c41 100644 --- a/doc/src/Manual.txt +++ b/doc/src/Manual.txt @@ -1,7 +1,7 @@ LAMMPS-ICMS Users Manual - + @@ -21,7 +21,7 @@

LAMMPS-ICMS Documentation :c,h3 -28 Jun 2016 version :c,h4 +1 Jul 2016 version :c,h4 Version info: :h4 diff --git a/doc/src/Section_intro.txt b/doc/src/Section_intro.txt index 81b85f1e87..aa31153386 100644 --- a/doc/src/Section_intro.txt +++ b/doc/src/Section_intro.txt @@ -487,7 +487,7 @@ LAMMPS. If you use LAMMPS results in your published work, please cite this paper and include a pointer to the "LAMMPS WWW Site"_lws (http://lammps.sandia.gov): -S. J. Plimpton, [Fast Parallel Algorithms for Short-Range Molecular +S. Plimpton, [Fast Parallel Algorithms for Short-Range Molecular Dynamics], J Comp Phys, 117, 1-19 (1995). Other papers describing specific algorithms used in LAMMPS are listed diff --git a/doc/src/accelerate_intel.txt b/doc/src/accelerate_intel.txt index 379758f71c..c97b19b67d 100644 --- a/doc/src/accelerate_intel.txt +++ b/doc/src/accelerate_intel.txt @@ -11,247 +11,399 @@ 5.3.2 USER-INTEL package :h4 -The USER-INTEL package was developed by Mike Brown at Intel +The USER-INTEL package is maintained by Mike Brown at Intel Corporation. It provides two methods for accelerating simulations, depending on the hardware you have. The first is acceleration on -Intel(R) CPUs by running in single, mixed, or double precision with -vectorization. The second is acceleration on Intel(R) Xeon Phi(TM) +Intel CPUs by running in single, mixed, or double precision with +vectorization. The second is acceleration on Intel Xeon Phi coprocessors via offloading neighbor list and non-bonded force calculations to the Phi. The same C++ code is used in both cases. When offloading to a coprocessor from a CPU, the same routine is run -twice, once on the CPU and once with an offload flag. +twice, once on the CPU and once with an offload flag. This allows +LAMMPS to run on the CPU cores and coprocessor cores simulataneously. -Note that the USER-INTEL package supports use of the Phi in "offload" -mode, not "native" mode like the "KOKKOS -package"_accelerate_kokkos.html. +[Currently Available USER-INTEL Styles:] -Also note that the USER-INTEL package can be used in tandem with the -"USER-OMP package"_accelerate_omp.html. This is useful when -offloading pair style computations to the Phi, so that other styles -not supported by the USER-INTEL package, e.g. bond, angle, dihedral, -improper, and long-range electrostatics, can run simultaneously in -threaded mode on the CPU cores. Since less MPI tasks than CPU cores -will typically be invoked when running with coprocessors, this enables -the extra CPU cores to be used for useful computation. +Angle Styles: charmm, harmonic :ulb,l +Bond Styles: fene, harmonic :l +Dihedral Styles: charmm, harmonic, opls :l +Fixes: nve, npt, nvt, nvt/sllod :l +Improper Styles: cvff, harmonic :l +Pair Styles: buck/coul/cut, buck/coul/long, buck, gayberne, +charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff :l +K-Space Styles: pppm :l,ule -As illustrated below, if LAMMPS is built with both the USER-INTEL and -USER-OMP packages, this dual mode of operation is made easier to use, -via the "-suffix hybrid intel omp" "command-line -switch"_Section_start.html#start_7 or the "suffix hybrid intel -omp"_suffix.html command. Both set a second-choice suffix to "omp" so -that styles from the USER-INTEL package will be used if available, -with styles from the USER-OMP package as a second choice. +[Speed-ups to expect:] -Here is a quick overview of how to use the USER-INTEL package for CPU -acceleration, assuming one or more 16-core nodes. More details -follow. +The speedups will depend on your simulation, the hardware, which +styles are used, the number of atoms, and the floating-point +precision mode. Performance improvements are shown compared to +LAMMPS {without using other acceleration packages} as these are +under active development (and subject to performance changes). The +measurements were performed using the input files available in +the src/USER-INTEL/TEST directory. These are scalable in size; the +results given are with 512K particles (524K for Liquid Crystal). +Most of the simulations are standard LAMMPS benchmarks (indicated +by the filename extension in parenthesis) with modifications to the +run length and to add a warmup run (for use with offload +benchmarks). -use an Intel compiler -use these CCFLAGS settings in Makefile.machine: -fopenmp, -DLAMMPS_MEMALIGN=64, -restrict, -xHost, -fno-alias, -ansi-alias, -override-limits -use these LINKFLAGS settings in Makefile.machine: -fopenmp, -xHost -make yes-user-intel yes-user-omp # including user-omp is optional -make mpi # build with the USER-INTEL package, if settings (including compiler) added to Makefile.mpi -make intel_cpu # or Makefile.intel_cpu already has settings, uses Intel MPI wrapper -Make.py -v -p intel omp -intel cpu -a file mpich_icc # or one-line build via Make.py for MPICH -Make.py -v -p intel omp -intel cpu -a file ompi_icc # or for OpenMPI -Make.py -v -p intel omp -intel cpu -a file intel_cpu # or for Intel MPI wrapper :pre +:c,image(JPG/user_intel.png) -lmp_machine -sf intel -pk intel 0 omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, no USER-OMP -mpirun -np 32 lmp_machine -sf intel -in in.script # 2 nodess, 16 MPI tasks/node, no threads, no USER-OMP -lmp_machine -sf hybrid intel omp -pk intel 0 omp 16 -pk omp 16 -in in.script # 1 node, 1 MPI task/node, 16 threads/task, with USER-OMP -mpirun -np 32 -ppn 4 lmp_machine -sf hybrid intel omp -pk omp 4 -pk omp 4 -in in.script # 8 nodes, 4 MPI tasks/node, 4 threads/task, with USER-OMP :pre +Results are speedups obtained on Intel Xeon E5-2697v4 processors +(code-named Broadwell) and Intel Xeon Phi 7250 processors +(code-named Knights Landing) with "18 Jun 2016" LAMMPS built with +Intel Parallel Studio 2016 update 3. Results are with 1 MPI task +per physical core. See {src/USER-INTEL/TEST/README} for the raw +simulation rates and instructions to reproduce. -Here is a quick overview of how to use the USER-INTEL package for the -same CPUs as above (16 cores/node), with an additional Xeon Phi(TM) -coprocessor per node. More details follow. +:line -Same as above for building, with these additions/changes: -add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in Makefile.machine -add the flag -offload to LINKFLAGS in Makefile.machine -for Make.py change "-intel cpu" to "-intel phi", and "file intel_cpu" to "file intel_phi" :pre +[Quick Start for Experienced Users:] -mpirun -np 32 lmp_machine -sf intel -pk intel 1 -in in.script # 2 nodes, 16 MPI tasks/node, 240 total threads on coprocessor, no USER-OMP -mpirun -np 16 -ppn 8 lmp_machine -sf intel -pk intel 1 omp 2 -in in.script # 2 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, no USER-OMP -mpirun -np 32 -ppn 8 lmp_machine -sf hybrid intel omp -pk intel 1 omp 2 -pk omp 2 -in in.script # 4 nodes, 8 MPI tasks/node, 2 threads/task, 240 total threads on coprocessor, with USER-OMP :pre +LAMMPS should be built with the USER-INTEL package installed. +Simulations should be run with 1 MPI task per physical {core}, +not {hardware thread}. + +For Intel Xeon CPUs: + +Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l +If using {kspace_style pppm} in the input script, add "neigh_modify binsize 3" and "kspace_modify diff ad" to the input script for better +performance. :l +"-pk intel 0 omp 2 -sf intel" added to LAMMPS command-line :l,ule + +For Intel Xeon Phi CPUs for simulations without {kspace_style +pppm} in the input script : + +Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l +Runs should be performed using MCDRAM. :l +"-pk intel 0 omp 2 -sf intel" {or} "-pk intel 0 omp 4 -sf intel" +should be added to the LAMMPS command-line. Choice for best +performance will depend on the simulation. :l,ule + +For Intel Xeon Phi CPUs for simulations with {kspace_style +pppm} in the input script: + +Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l +Runs should be performed using MCDRAM. :l +Add "neigh_modify binsize 3" to the input script for better +performance. :l +Add "kspace_modify diff ad" to the input script for better +performance. :l +export KMP_AFFINITY=none :l +"-pk intel 0 omp 3 lrt yes -sf intel" or "-pk intel 0 omp 1 lrt yes +-sf intel" added to LAMMPS command-line. Choice for best performance +will depend on the simulation. :l,ule + +For Intel Xeon Phi coprocessors (Offload): + +Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l +"-pk intel N omp 1" added to command-line where N is the number of +coprocessors per node. :l,ule + +:line [Required hardware/software:] -Your compiler must support the OpenMP interface. Use of an Intel(R) -C++ compiler is recommended, but not required. However, g++ will not -recognize some of the settings listed above, so they cannot be used. -Optimizations for vectorization have only been tested with the -Intel(R) compiler. Use of other compilers may not result in -vectorization, or give poor performance. +In order to use offload to coprocessors, an Intel Xeon Phi +coprocessor and an Intel compiler are required. For this, the +recommended version of the Intel compiler is 14.0.1.106 or +versions 15.0.2.044 and higher. -The recommended version of the Intel(R) compiler is 14.0.1.106. -Versions 15.0.1.133 and later are also supported. If using Intel(R) -MPI, versions 15.0.2.044 and later are recommended. +Although any compiler can be used with the USER-INTEL pacakge, +currently, vectorization directives are disabled by default when +not using Intel compilers due to lack of standard support and +observations of decreased performance. The OpenMP standard now +supports directives for vectorization and we plan to transition the +code to this standard once it is available in most compilers. We +expect this to allow improved performance and support with other +compilers. -To use the offload option, you must have one or more Intel(R) Xeon -Phi(TM) coprocessors and use an Intel(R) C++ compiler. +For Intel Xeon Phi x200 series processors (code-named Knights +Landing), there are multiple configuration options for the hardware. +For best performance, we recommend that the MCDRAM is configured in +"Flat" mode and with the cluster mode set to "Quadrant" or "SNC4". +"Cache" mode can also be used, although the performance might be +slightly lower. + +[Notes about Simultaneous Multithreading:] + +Modern CPUs often support Simultaneous Multithreading (SMT). On +Intel processors, this is called Hyper-Threading (HT) technology. +SMT is hardware support for running multiple threads efficiently on +a single core. {Hardware threads} or {logical cores} are often used +to refer to the number of threads that are supported in hardware. +For example, the Intel Xeon E5-2697v4 processor is described +as having 36 cores and 72 threads. This means that 36 MPI processes +or OpenMP threads can run simultaneously on separate cores, but that +up to 72 MPI processes or OpenMP threads can be running on the CPU +without costly operating system context switches. + +Molecular dynamics simulations will often run faster when making use +of SMT. If a thread becomes stalled, for example because it is +waiting on data that has not yet arrived from memory, another thread +can start running so that the CPU pipeline is still being used +efficiently. Although benefits can be seen by launching a MPI task +for every hardware thread, for multinode simulations, we recommend +that OpenMP threads are used for SMT instead, either with the +USER-INTEL package, "USER-OMP package"_accelerate_omp.html", or +"KOKKOS package"_accelerate_kokkos.html. In the example above, up +to 36X speedups can be observed by using all 36 physical cores with +LAMMPS. By using all 72 hardware threads, an additional 10-30% +performance gain can be achieved. + +The BIOS on many platforms allows SMT to be disabled, however, we do +not recommend this on modern processors as there is little to no +benefit for any software package in most cases. The operating system +will report every hardware thread as a separate core allowing one to +determine the number of hardware threads available. On Linux systems, +this information can normally be obtained with: + +cat /proc/cpuinfo :pre [Building LAMMPS with the USER-INTEL package:] -The lines above illustrate how to include/build with the USER-INTEL -package, for either CPU or Phi support, in two steps, using the "make" -command. Or how to do it with one command via the src/Make.py script, -described in "Section 2.4"_Section_start.html#start_4 of the manual. -Type "Make.py -h" for help. Because the mechanism for specifing what -compiler to use (Intel in this case) is different for different MPI -wrappers, 3 versions of the Make.py command are shown. +The USER-INTEL package must be installed into the source directory: + +make yes-user-intel :pre + +Several example Makefiles for building with the Intel compiler are +included with LAMMPS in the src/MAKE/OPTIONS/ directory: + +Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload +Makefile.knl # Intel Compiler, Intel MPI, No Offload +Makefile.intel_cpu_mpich # Intel Compiler, MPICH, No Offload +Makefile.intel_cpu_openpmi # Intel Compiler, OpenMPI, No Offload +Makefile.intel_coprocessor # Intel Compiler, Intel MPI, Offload :pre + +Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that +it explicitly specifies that vectorization should be for Intel +Xeon Phi x200 processors making it easier to cross-compile. For +users with recent installations of Intel Parallel Studio, the +process can be as simple as: + +make yes-user-intel +source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh +# or psxevars.csh for C-shell +make intel_cpu_intelmpi :pre + +Alternatively, the build can be accomplished with the src/Make.py +script, described in "Section 2.4"_Section_start.html#start_4 of the +manual. Type "Make.py -h" for help. For an example: + +Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre Note that if you build with support for a Phi coprocessor, the same binary can be used on nodes with or without coprocessors installed. However, if you do not have coprocessors on your system, building without offload support will produce a smaller binary. -If you also build with the USER-OMP package, you can use styles from -both packages, as described below. +The general requirements for Makefiles with the USER-INTEL package +are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When +using Intel compilers, "-restrict" is required and "-qopenmp" is +highly recommended for CCFLAGS and LINKFLAGS. LIB should include +"-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD" +is required for CCFLAGS and "-qoffload" is required for LINKFLAGS. +Other recommended CCFLAG options for best performance are +"-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2 +-no-prec-div". The Make.py command will add all of these +automatically. -Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must -include "-fopenmp". Likewise, if you use an Intel compiler, the -CCFLAGS setting must include "-restrict". For Phi support, the -"-DLMP_INTEL_OFFLOAD" (CCFLAGS) and "-offload" (LINKFLAGS) settings -are required. The other settings listed above are optional, but will -typically improve performance. The Make.py command will add all of -these automatically. +NOTE: The vectorization and math capabilities can differ depending on +the CPU. For Intel compilers, the "-x" flag specifies the type of +processor for which to optimize. "-xHost" specifies that the compiler +should build for the processor used for compiling. For Intel Xeon Phi +x200 series processors, this option is "-xMIC-AVX512". For fourth +generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should +be used. For older Intel Xeon processors, "-xAVX" will perform best +in general for the different simulations in LAMMPS. The default +in most of the example Makefiles is to use "-xHost", however this +should not be used when cross-compiling. + +[Running LAMMPS with the USER-INTEL package:] -If you are compiling on the same architecture that will be used for -the runs, adding the flag {-xHost} to CCFLAGS enables vectorization -with the Intel(R) compiler. Otherwise, you must provide the correct -compute node architecture to the -x option (e.g. -xAVX). +Running LAMMPS with the USER-INTEL package is similar to normal use +with the exceptions that one should 1) specify that LAMMPS should use +the USER-INTEL package, 2) specify the number of OpenMP threads, and +3) optionally specify the specific LAMMPS styles that should use the +USER-INTEL package. 1) and 2) can be performed from the command-line +or by editing the input script. 3) requires editing the input script. +Advanced performance tuning options are also described below to get +the best performance. -Example machines makefiles Makefile.intel_cpu and Makefile.intel_phi -are included in the src/MAKE/OPTIONS directory with settings that -perform well with the Intel(R) compiler. The latter has support for -offload to Phi coprocessors; the former does not. - -[Run with the USER-INTEL package from the command line:] - -The mpirun or mpiexec command sets the total number of MPI tasks used -by LAMMPS (one or multiple per compute node) and the number of MPI -tasks used per node. E.g. the mpirun command in MPICH does this via -its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. - -If you compute (any portion of) pairwise interactions using USER-INTEL -pair styles on the CPU, or use USER-OMP styles on the CPU, you need to -choose how many OpenMP threads per MPI task to use. If both packages -are used, it must be done for both packages, and the same thread count -value should be used for both. Note that the product of MPI tasks * -threads/task should not exceed the physical number of cores (on a -node), otherwise performance will suffer. - -When using the USER-INTEL package for the Phi, you also need to -specify the number of coprocessor/node and optionally the number of -coprocessor threads per MPI task to use. Note that coprocessor -threads (which run on the coprocessor) are totally independent from -OpenMP threads (which run on the CPU). The default values for the -settings that affect coprocessor threads are typically fine, as -discussed below. - -As in the lines above, use the "-sf intel" or "-sf hybrid intel omp" -"command-line switch"_Section_start.html#start_7, which will -automatically append "intel" to styles that support it. In the second -case, "omp" will be appended if an "intel" style does not exist. - -Note that if either switch is used, it also invokes a default command: -"package intel 1"_package.html. If the "-sf hybrid intel omp" switch -is used, the default USER-OMP command "package omp 0"_package.html is -also invoked (if LAMMPS was built with USER-OMP). Both set the number -of OpenMP threads per MPI task via the OMP_NUM_THREADS environment -variable. The first command sets the number of Xeon Phi(TM) -coprocessors/node to 1 (ignored if USER-INTEL is built for CPU-only), -and the precision mode to "mixed" (default value). - -You can also use the "-pk intel Nphi" "command-line -switch"_Section_start.html#start_7 to explicitly set Nphi = # of Xeon -Phi(TM) coprocessors/node, as well as additional options. Nphi should -be >= 1 if LAMMPS was built with coprocessor support, otherswise Nphi -= 0 for a CPU-only build. All the available coprocessor threads on -each Phi will be divided among MPI tasks, unless the {tptask} option -of the "-pk intel" "command-line switch"_Section_start.html#start_7 is -used to limit the coprocessor threads per MPI task. See the "package -intel"_package.html command for details, including the default values -used for all its options if not specified, and how to set the number -of OpenMP threads via the OMP_NUM_THREADS environment variable if -desired. - -If LAMMPS was built with the USER-OMP package, you can also use the -"-pk omp Nt" "command-line switch"_Section_start.html#start_7 to -explicitly set Nt = # of OpenMP threads per MPI task to use, as well -as additional options. Nt should be the same threads per MPI task as -set for the USER-INTEL package, e.g. via the "-pk intel Nphi omp Nt" -command. Again, see the "package omp"_package.html command for -details, including the default values used for all its options if not -specified, and how to set the number of OpenMP threads via the -OMP_NUM_THREADS environment variable if desired. - -[Or run with the USER-INTEL package by editing an input script:] - -The discussion above for the mpirun/mpiexec command, MPI tasks/node, -OpenMP threads per MPI task, and coprocessor threads per MPI task is -the same. - -Use the "suffix intel"_suffix.html or "suffix hybrid intel -omp"_suffix.html commands, or you can explicitly add an "intel" or -"omp" suffix to individual styles in your input script, e.g. - -pair_style lj/cut/intel 2.5 :pre - -You must also use the "package intel"_package.html command, unless the -"-sf intel" or "-pk intel" "command-line -switches"_Section_start.html#start_7 were used. It specifies how many -coprocessors/node to use, as well as other OpenMP threading and -coprocessor options. The "package"_package.html doc page explains how -to set the number of OpenMP threads via an environment variable if -desired. - -If LAMMPS was also built with the USER-OMP package, you must also use -the "package omp"_package.html command to enable that package, unless -the "-sf hybrid intel omp" or "-pk omp" "command-line -switches"_Section_start.html#start_7 were used. It specifies how many -OpenMP threads per MPI task to use (should be same as the setting for -the USER-INTEL package), as well as other options. Its doc page -explains how to set the number of OpenMP threads via an environment -variable if desired. - -[Speed-ups to expect:] - -If LAMMPS was not built with coprocessor support (CPU only) when -including the USER-INTEL package, then acclerated styles will run on -the CPU using vectorization optimizations and the specified precision. -This may give a substantial speed-up for a pair style, particularly if -mixed or single precision is used. - -If LAMMPS was built with coproccesor support, the pair styles will run -on one or more Intel(R) Xeon Phi(TM) coprocessors (per node). The -performance of a Xeon Phi versus a multi-core CPU is a function of -your hardware, which pair style is used, the number of -atoms/coprocessor, and the precision used on the coprocessor (double, -single, mixed). - -See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the -LAMMPS web site for performance of the USER-INTEL package on different -hardware. +When running on a single node (including runs using offload to a +coprocessor), best performance is normally obtained by using 1 MPI +task per physical core and additional OpenMP threads with SMT. For +Intel Xeon processors, 2 OpenMP threads should be used for SMT. +For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used +(best choice depends on the simulation). In cases where the user +specifies that LRT mode is used (described below), 1 or 3 OpenMP +threads should be used. For multi-node runs, using 1 MPI task per +physical core will often perform best, however, depending on the +machine and scale, users might get better performance by decreasing +the number of MPI tasks and using more OpenMP threads. For +performance, the product of the number of MPI tasks and OpenMP +threads should not exceed the number of available hardware threads in +almost all cases. NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP threads to a core or group of cores so that memory access can be uniform. Unless disabled at build time, affinity for MPI tasks and OpenMP threads on the host (CPU) will be set by default on the host -when using offload to a coprocessor. In this case, it is unnecessary +{when using offload to a coprocessor}. In this case, it is unnecessary to use other methods to control affinity (e.g. taskset, numactl, -I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script with -the {no_affinity} option to the "package intel"_package.html command -or by disabling the option at build time (by adding --DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). -Disabling this option is not recommended, especially when running on a -machine with hyperthreading disabled. +I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity} +option to the "package intel"_package.html command or by disabling the +option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the +CCFLAGS line of your Makefile). Disabling this option is not +recommended, especially when running on a machine with Intel +Hyper-Threading technology disabled. -[Guidelines for best performance on an Intel(R) Xeon Phi(TM) -coprocessor:] +[Run with the USER-INTEL package from the command line:] + +To enable USER-INTEL optimizations for all available styles used in +the input script, the "-sf intel" +"command-line switch"_Section_start.html#start_7 can be used without +any requirement for editing the input script. This switch will +automatically append "intel" to styles that support it. It also +invokes a default command: "package intel 1"_package.html. This +package command is used to set options for the USER-INTEL package. +The default package command will specify that USER-INTEL calculations +are performed in mixed precision, that the number of OpenMP threads +is specified by the OMP_NUM_THREADS environment variable, and that +if coprocessors are present and the binary was built with offload +support, that 1 coprocessor per node will be used with automatic +balancing of work between the CPU and the coprocessor. + +You can specify different options for the USER-INTEL package by using +the "-pk intel Nphi" "command-line switch"_Section_start.html#start_7 +with keyword/value pairs as specified in the documentation. Here, +Nphi = # of Xeon Phi coprocessors/node (ignored without offload +support). Common options to the USER-INTEL package include {omp} to +override any OMP_NUM_THREADS setting and specify the number of OpenMP +threads, {mode} to set the floating-point precision mode, and +{lrt} to enable Long-Range Thread mode as described below. See the +"package intel"_package.html command for details, including the +default values used for all its options if not specified, and how to +set the number of OpenMP threads via the OMP_NUM_THREADS environment +variable if desired. + +Examples (see documentation for your MPI/Machine for differences in +launching MPI applications): + +mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads +mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre + +[Or run with the USER-INTEL package by editing an input script:] + +As an alternative to adding command-line arguments, the input script +can be edited to enable the USER-INTEL package. This requires adding +the "package intel"_package.html command to the top of the input +script. For the second example above, this would be: + +package intel 0 omp 2 mode double :pre + +To enable the USER-INTEL package only for individual styles, you can +add an "intel" suffix to the individual style, e.g.: + +pair_style lj/cut/intel 2.5 :pre + +Alternatively, the "suffix intel"_suffix.html command can be added to +the input script to enable USER-INTEL styles for the commands that +follow in the input script. + +[Tuning for Performance:] + +NOTE: The USER-INTEL package will perform better with modifications +to the input script when "PPPM"_kspace_style.html is used: +"kspace_modify diff ad"_kspace_modify.html and "neigh_modify binsize +3"_neigh_modify.html should be added to the input script. + +Long-Range Thread (LRT) mode is an option to the "package +intel"_package.html command that can improve performance when using +"PPPM"_kspace_style.html for long-range electrostatics on processors +with SMT. It generates an extra pthread for each MPI task. The thread +is dedicated to performing some of the PPPM calculations and MPI +communications. On Intel Xeon Phi x200 series CPUs, this will likely +always improve performance, even on a single node. On Intel Xeon +processors, using this mode might result in better performance when +using multiple nodes, depending on the machine. To use this mode, +specify that the number of OpenMP threads is one less than would +normally be used for the run and add the "lrt yes" option to the "-pk" +command-line suffix or "package intel" command. For example, if a run +would normally perform best with "-pk intel 0 omp 4", instead use +"-pk intel 0 omp 3 lrt yes". When using LRT, you should set the +environment variable "KMP_AFFINITY=none". LRT mode is not supported +when using offload. + +Not all styles are supported in the USER-INTEL package. You can mix +the USER-INTEL package with styles from the "OPT"_accelerate_opt.html +package or the "USER-OMP package"_accelerate_omp.html". Of course, +this requires that these packages were installed at build time. This +can performed automatically by using "-sf hybrid intel opt" or +"-sf hybrid intel omp" command-line options. Alternatively, the "opt" +and "omp" suffixes can be appended manually in the input script. For +the latter, the "package omp"_package.html command must be in the +input script or the "-pk omp Nt" "command-line +switch"_Section_start.html#start_7 must be used where Nt is the +number of OpenMP threads. The number of OpenMP threads should not be +set differently for the different packages. Note that the "suffix +hybrid intel omp"_suffix.html command can also be used within the +input script to automatically append the "omp" suffix to styles when +USER-INTEL styles are not available. + +When running on many nodes, performance might be better when using +fewer OpenMP threads and more MPI tasks. This will depend on the +simulation and the machine. Using the "verlet/split"_run_style.html +run style might also give better performance for simulations with +"PPPM"_kspace_style.html electrostatics. Note that this is an +alternative to LRT mode and the two cannot be used together. + +Currently, when using Intel MPI with Intel Xeon Phi x200 series +CPUs, better performance might be obtained by setting the +environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do +not yet have full support for AVX-512. Runs on Intel Xeon Phi x200 +series processors will always perform better using MCDRAM. Please +consult your system documentation for the best approach to specify +that MPI runs are performed in MCDRAM. + +[Tuning for Offload Performance:] + +The default settings for offload should give good performance. + +When using LAMMPS with offload to Intel coprocessors, best performance +will typically be achieved with concurrent calculations performed on +both the CPU and the coprocessor. This is achieved by offloading only +a fraction of the neighbor and pair computations to the coprocessor or +using "hybrid"_pair_hybrid.html pair styles where only one style uses +the "intel" suffix. For simulations with long-range electrostatics or +bond, angle, dihedral, improper calculations, computation and data +transfer to the coprocessor will run concurrently with computations +and MPI communications for these calculations on the host CPU. This +is illustrated in the figure below for the rhodopsin protein benchmark +running on E5-2697v2 processors with a Intel Xeon Phi 7120p +coprocessor. In this plot, the vertical access is time and routines +running at the same time are running concurrently on both the host and +the coprocessor. + +:c,image(JPG/offload_knc.png) + +The fraction of the offloaded work is controlled by the {balance} +keyword in the "package intel"_package.html command. A balance of 0 +runs all calculations on the CPU. A balance of 1 runs all +supported calculations on the coprocessor. A balance of 0.5 runs half +of the calculations on the coprocessor. Setting the balance to -1 +(the default) will enable dynamic load balancing that continously +adjusts the fraction of offloaded work throughout the simulation. +Because data transfer cannot be timed, this option typically produces +results within 5 to 10 percent of the optimal fixed balance. + +If running short benchmark runs with dynamic load balancing, adding a +short warm-up run (10-20 steps) will allow the load-balancer to find a +near-optimal setting that will carry over to additional runs. The default for the "package intel"_package.html command is to have -all the MPI tasks on a given compute node use a single Xeon Phi(TM) +all the MPI tasks on a given compute node use a single Xeon Phi coprocessor. In general, running with a large number of MPI tasks on each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads @@ -261,50 +413,35 @@ with 60 cores available for offload and 4 hardware threads per core each MPI task to use a subset of 10 threads on the coprocessor. Fine tuning of the number of threads to use per MPI task or the number of threads to use per core can be accomplished with keyword settings of -the "package intel"_package.html command. :ulb,l +the "package intel"_package.html command. -If desired, only a fraction of the pair style computation can be -offloaded to the coprocessors. This is accomplished by using the -{balance} keyword in the "package intel"_package.html command. A -balance of 0 runs all calculations on the CPU. A balance of 1 runs -all calculations on the coprocessor. A balance of 0.5 runs half of -the calculations on the coprocessor. Setting the balance to -1 (the -default) will enable dynamic load balancing that continously adjusts -the fraction of offloaded work throughout the simulation. This option -typically produces results within 5 to 10 percent of the optimal fixed -balance. :l +The USER-INTEL package has two modes for deciding which atoms will be +handled by the coprocessor. This choice is controlled with the {ghost} +keyword of the "package intel"_package.html command. When set to 0, +ghost atoms (atoms at the borders between MPI tasks) are not offloaded +to the card. This allows for overlap of MPI communication of forces +with computation on the coprocessor when the "newton"_newton.html +setting is "on". The default is dependent on the style being used, +however, better performance may be achieved by setting this option +explictly. -When using offload with CPU hyperthreading disabled, it may help +When using offload with CPU Hyper-Threading disabled, it may help performance to use fewer MPI tasks and OpenMP threads than available cores. This is due to the fact that additional threads are generated -internally to handle the asynchronous offload tasks. :l +internally to handle the asynchronous offload tasks. -If running short benchmark runs with dynamic load balancing, adding a -short warm-up run (10-20 steps) will allow the load-balancer to find a -near-optimal setting that will carry over to additional runs. :l - -If pair computations are being offloaded to an Intel(R) Xeon Phi(TM) +If pair computations are being offloaded to an Intel Xeon Phi coprocessor, a diagnostic line is printed to the screen (not to the log file), during the setup phase of a run, indicating that offload mode is being used and indicating the number of coprocessor threads per MPI task. Additionally, an offload timing summary is printed at the end of each run. When offloading, the frequency for "atom sorting"_atom_modify.html is changed to 1 so that the per-atom data is -effectively sorted at every rebuild of the neighbor lists. :l - -For simulations with long-range electrostatics or bond, angle, -dihedral, improper calculations, computation and data transfer to the -coprocessor will run concurrently with computations and MPI -communications for these calculations on the host CPU. The USER-INTEL -package has two modes for deciding which atoms will be handled by the -coprocessor. This choice is controlled with the {ghost} keyword of -the "package intel"_package.html command. When set to 0, ghost atoms -(atoms at the borders between MPI tasks) are not offloaded to the -card. This allows for overlap of MPI communication of forces with -computation on the coprocessor when the "newton"_newton.html setting -is "on". The default is dependent on the style being used, however, -better performance may be achieved by setting this option -explictly. :l,ule +effectively sorted at every rebuild of the neighbor lists. All the +available coprocessor threads on each Phi will be divided among MPI +tasks, unless the {tptask} option of the "-pk intel" "command-line +switch"_Section_start.html#start_7 is used to limit the coprocessor +threads per MPI task. [Restrictions:] @@ -319,3 +456,15 @@ the pair styles in the USER-INTEL package currently support the "inner", "middle", "outer" options for rRESPA integration via the "run_style respa"_run_style.html command; only the "pair" option is supported. + +[References:] + +Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l + +Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press. :l + +Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l,ule + + + + diff --git a/doc/src/molecule.txt b/doc/src/molecule.txt index 3d18d3c688..05971b5ddc 100644 --- a/doc/src/molecule.txt +++ b/doc/src/molecule.txt @@ -324,8 +324,11 @@ N1, N2, N3 are the number of 1-2, 1-3, 1-4 neighbors respectively of this atom within the topology of the molecule. See the "special_bonds"_special_bonds.html doc page for more discussion of 1-2, 1-3, 1-4 neighbors. If this section appears, the Special Bonds -section must also appear. If this section is not specied, the -atoms in the molecule will have no special bonds. +section must also appear. + +As explained above, LAMMPS will auto-generate this information if this +section is not specified. If specified, this section will +override what would be auto-generated. :line @@ -342,9 +345,11 @@ values should be the 1-2 neighbors, the next N2 should be the 1-3 neighbors, the last N3 should be the 1-4 neighbors. No atom ID should appear more than once. See the "special_bonds"_special_bonds.html doc page for more discussion of 1-2, 1-3, 1-4 neighbors. If this section -appears, the Special Bond Counts section must also appear. If this -section is not specied, the atoms in the molecule will have no special -bonds. +appears, the Special Bond Counts section must also appear. + +As explained above, LAMMPS will auto-generate this information if this +section is not specified. If specified, this section will override +what would be auto-generated. :line diff --git a/lib/gpu/Makefile.shannon b/lib/gpu/Makefile.shannon new file mode 100644 index 0000000000..2ddb0d1f04 --- /dev/null +++ b/lib/gpu/Makefile.shannon @@ -0,0 +1,50 @@ +# /* ---------------------------------------------------------------------- +# Generic Linux Makefile for CUDA +# - Change CUDA_ARCH for your GPU +# ------------------------------------------------------------------------- */ + +# which file will be copied to Makefile.lammps + +EXTRAMAKE = Makefile.lammps.standard + +CUDA_HOME = ${CUDA_ROOT} +NVCC = nvcc + +# Kepler CUDA +CUDA_ARCH = -arch=sm_35 +# Tesla CUDA +#CUDA_ARCH = -arch=sm_21 +# newer CUDA +#CUDA_ARCH = -arch=sm_13 +# older CUDA +#CUDA_ARCH = -arch=sm_10 -DCUDA_PRE_THREE + +# this setting should match LAMMPS Makefile +# one of LAMMPS_SMALLBIG (default), LAMMPS_BIGBIG and LAMMPS_SMALLSMALL + +LMP_INC = -DLAMMPS_SMALLBIG + +# precision for GPU calculations +# -D_SINGLE_SINGLE # Single precision for all calculations +# -D_DOUBLE_DOUBLE # Double precision for all calculations +# -D_SINGLE_DOUBLE # Accumulation of forces, etc. in double + +CUDA_PRECISION = -D_DOUBLE_DOUBLE + +CUDA_INCLUDE = -I$(CUDA_HOME)/include +CUDA_LIB = -L$(CUDA_HOME)/lib64 +CUDA_OPTS = -DUNIX -O3 -Xptxas -v --use_fast_math + +CUDR_CPP = mpic++ -DMPI_GERYON -DUCL_NO_EXIT -DMPICH_IGNORE_CXX_SEEK +CUDR_OPTS = -O2 # -xHost -no-prec-div -ansi-alias + +BIN_DIR = ./ +OBJ_DIR = ./ +LIB_DIR = ./ +AR = ar +BSH = /bin/sh + +CUDPP_OPT = -DUSE_CUDPP -Icudpp_mini + +include Nvidia.makefile + diff --git a/src/BODY/body_nparticle.h b/src/BODY/body_nparticle.h index df968aac09..46903f9657 100644 --- a/src/BODY/body_nparticle.h +++ b/src/BODY/body_nparticle.h @@ -45,7 +45,6 @@ class BodyNparticle : public Body { private: int *imflag; double **imdata; - }; } diff --git a/src/CLASS2/dihedral_class2.cpp b/src/CLASS2/dihedral_class2.cpp index 9e907f87b7..d18d75b155 100644 --- a/src/CLASS2/dihedral_class2.cpp +++ b/src/CLASS2/dihedral_class2.cpp @@ -662,7 +662,8 @@ void DihedralClass2::coeff(int narg, char **arg) } } else if (strcmp(arg[1],"ebt") == 0) { - if (narg != 10) error->all(FLERR,"Incorrect args for dihedral coefficients"); + if (narg != 10) + error->all(FLERR,"Incorrect args for dihedral coefficients"); double f1_1_one = force->numeric(FLERR,arg[2]); double f2_1_one = force->numeric(FLERR,arg[3]); @@ -687,7 +688,8 @@ void DihedralClass2::coeff(int narg, char **arg) } } else if (strcmp(arg[1],"at") == 0) { - if (narg != 10) error->all(FLERR,"Incorrect args for dihedral coefficients"); + if (narg != 10) + error->all(FLERR,"Incorrect args for dihedral coefficients"); double f1_1_one = force->numeric(FLERR,arg[2]); double f2_1_one = force->numeric(FLERR,arg[3]); @@ -924,8 +926,10 @@ void DihedralClass2::read_restart(FILE *fp) void DihedralClass2::write_data(FILE *fp) { for (int i = 1; i <= atom->ndihedraltypes; i++) - fprintf(fp,"%d %g %g %g %g %g %g\n", - i,k1[i],phi1[i],k2[i],phi2[i],k3[i],phi3[i]); + fprintf(fp,"%d %g %g %g %g %g %g\n",i, + k1[i],phi1[i]*180.0/MY_PI, + k2[i],phi2[i]*180.0/MY_PI, + k3[i],phi3[i]*180.0/MY_PI); fprintf(fp,"\nAngleAngleTorsion Coeffs\n\n"); for (int i = 1; i <= atom->ndihedraltypes; i++) diff --git a/src/USER-INTEL/TEST/README b/src/USER-INTEL/TEST/README index cfc5df31f9..cf14fb3237 100644 --- a/src/USER-INTEL/TEST/README +++ b/src/USER-INTEL/TEST/README @@ -1,35 +1,36 @@ ############################################################################# # Benchmarks # -# in.intel.lj - Atomic fluid (LJ Benchmark) -# in.intel.rhodo - Protein (Rhodopsin Benchmark) -# in.intel.lc - Liquid Crystal w/ Gay-Berne potential -# in.intel.sw - Silicon benchmark with Stillinger-Weber -# in.intel.tersoff - Silicon benchmark with Tersoff -# in.intel.water - Coarse-grain water benchmark using Stillinger-Weber +# in.intel.lj - Atomic fluid (LJ Benchmark) +# in.intel.rhodo - Protein (Rhodopsin Benchmark) +# in.intel.lc - Liquid Crystal w/ Gay-Berne potential +# in.intel.sw - Silicon benchmark with Stillinger-Weber +# in.intel.tersoff - Silicon benchmark with Tersoff +# in.intel.water - Coarse-grain water benchmark using Stillinger-Weber # ############################################################################# ############################################################################# -# Expected Timesteps/second on E5-2697v3 with turbo on and HT enabled +# Expected Timesteps/second with turbo on and HT enabled, LAMMPS 18-Jun-2016 # -# in.intel.lj - 131.943 -# in.intel.rhodo - 8.661 -# in.intel.lc - 14.015 -# in.intel.sw - 103.53 -# in.intel.tersoff - 55.525 -# in.intel.water - 44.079 +# Xeon E5-2697v4 Xeon Phi 7250 +# +# in.intel.lj - 162.764 179.148 +# in.intel.rhodo - 11.633 13.668 +# in.intel.lc - 19.136 24.863 +# in.intel.sw - 139.048 152.026 +# in.intel.tersoff - 82.663 92.985 +# in.intel.water - 59.838 85.704 # ############################################################################# ############################################################################# -# For Haswell and Broadwell architectures, depending on the compiler version, +# For Haswell (Xeon v3) architectures, depending on the compiler version, # it may give better performance to compile for an AVX target (with -xAVX # compiler option) instead of -xHost or -xCORE-AVX2 for some of the -# workloads due to inefficient code generation for gathers. Aside from -# Tersoff, this will not significantly impact performance because FMA -# sensitive routines will still use AVX2 (MKL and SVML detect the processor -# at runtime) +# workloads. In most cases, FMA sensitive routines will still use AVX2 +# (MKL and SVML detect the processor at runtime). For Broadwell (Xeon v4) +# architectures, -xCORE-AVX2 or -xHost will work best for all. ############################################################################# ############################################################################# @@ -86,3 +87,8 @@ mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 0 -sf intel # To run with USER-INTEL and automatic load balancing to 1 coprocessor ############################################################################# mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 1 -sf intel + +############################################################################# +# If using PPPM (in.intel.rhodo) on Intel Xeon Phi x200 series processors +############################################################################# +mpirun -np $LMP_CORES $LMP_BIN -in $bench -log none -pk intel 0 omp 3 lrt yes -sf intel diff --git a/src/USER-INTEL/TEST/in.intel.rhodo b/src/USER-INTEL/TEST/in.intel.rhodo index 37b0d65d7d..7b3b092607 100644 --- a/src/USER-INTEL/TEST/in.intel.rhodo +++ b/src/USER-INTEL/TEST/in.intel.rhodo @@ -7,7 +7,7 @@ variable n index 0 # Use NUMA Mapping for Multi-Node variable b index 3 # Neighbor binsize variable p index 0 # Use Power Measurement variable c index 0 # 1 to use collectives for PPPM -variable d index 0 # 1 to use 'diff ad' for PPPM +variable d index 1 # 1 to use 'diff ad' for PPPM variable x index 4 variable y index 2 diff --git a/src/USER-INTEL/dihedral_charmm_intel.cpp b/src/USER-INTEL/dihedral_charmm_intel.cpp index 82c5bc77db..7e93e319d9 100644 --- a/src/USER-INTEL/dihedral_charmm_intel.cpp +++ b/src/USER-INTEL/dihedral_charmm_intel.cpp @@ -178,6 +178,11 @@ void DihedralCharmmIntel::eval(const int vflag, } } + #if defined(LMP_SIMD_COMPILER_TEST) + #pragma vector aligned + #pragma simd reduction(+:sedihedral, sevdwl, secoul, sv0, sv1, sv2, \ + sv3, sv4, sv5, spv0, spv1, spv2, spv3, spv4, spv5) + #endif for (int n = nfrom; n < nto; n++) { const int i1 = dihedrallist[n].a; const int i2 = dihedrallist[n].b; @@ -237,6 +242,7 @@ void DihedralCharmmIntel::eval(const int vflag, const flt_t s = rg*rabinv*(ax*vb3x + ay*vb3y + az*vb3z); // error check + #ifndef LMP_SIMD_COMPILER_TEST if (c > PTOLERANCE || c < MTOLERANCE) { int me = comm->me; @@ -258,6 +264,7 @@ void DihedralCharmmIntel::eval(const int vflag, me,x[i4].x,x[i4].y,x[i4].z); } } + #endif if (c > (flt_t)1.0) c = (flt_t)1.0; if (c < (flt_t)-1.0) c = (flt_t)-1.0; @@ -337,6 +344,9 @@ void DihedralCharmmIntel::eval(const int vflag, } + #if defined(LMP_SIMD_COMPILER_TEST) + #pragma simdoff + #endif { if (NEWTON_BOND || i2 < nlocal) { f[i2].x += f2x; @@ -413,6 +423,9 @@ void DihedralCharmmIntel::eval(const int vflag, } // apply force to each of 4 atoms + #if defined(LMP_SIMD_COMPILER_TEST) + #pragma simdoff + #endif { if (NEWTON_BOND || i1 < nlocal) { f[i1].x += f1x; @@ -668,7 +681,7 @@ void DihedralCharmmIntel::eval(const int vflag, const SIMD_flt_t tcos_shift = SIMD_gather(nmask, cos_shift, type); const SIMD_flt_t tsin_shift = SIMD_gather(nmask, sin_shift, type); const SIMD_flt_t tk = SIMD_gather(nmask, k, type); - const SIMD_int m = SIMD_gather(nmask, multiplicity, type); + const SIMD_int m = SIMD_gatherz_offset(nmask, multiplicity, type); SIMD_flt_t p(one); SIMD_flt_t ddf1(szero); diff --git a/src/USER-INTEL/intel_simd.h b/src/USER-INTEL/intel_simd.h index 3bc99c790f..ac13f1edfd 100644 --- a/src/USER-INTEL/intel_simd.h +++ b/src/USER-INTEL/intel_simd.h @@ -194,6 +194,37 @@ namespace ip_simd { _MM_SCALE_8); } + template + inline SIMD_int SIMD_gatherz_offset(const SIMD_mask &m, const int *p, + const SIMD_int &i) { + } + + template <> + inline SIMD_int SIMD_gatherz_offset(const SIMD_mask &m, const int *p, + const SIMD_int &i) { + return _mm512_mask_i32gather_epi32( _mm512_set1_epi32(0), m, i, p, + _MM_SCALE_4); + } + + template <> + inline SIMD_int SIMD_gatherz_offset(const SIMD_mask &m, const int *p, + const SIMD_int &i) { + return _mm512_mask_i32gather_epi32( _mm512_set1_epi32(0), m, i, p, + _MM_SCALE_8); + } + + inline SIMD_float SIMD_gatherz(const SIMD_mask &m, const float *p, + const SIMD_int &i) { + return _mm512_mask_i32gather_ps( _mm512_set1_ps((float)0), m, i, p, + _MM_SCALE_4); + } + + inline SIMD_double SIMD_gatherz(const SIMD_mask &m, const double *p, + const SIMD_int &i) { + return _mm512_mask_i32logather_pd( _mm512_set1_pd(0.0), m, i, p, + _MM_SCALE_8); + } + // ------- Store Operations inline void SIMD_store(int *p, const SIMD_int &one) { diff --git a/src/atom_vec_body.cpp b/src/atom_vec_body.cpp index 27fdaa1a7b..30efb33e7b 100644 --- a/src/atom_vec_body.cpp +++ b/src/atom_vec_body.cpp @@ -26,10 +26,6 @@ #include "memory.h" #include "error.h" -// debug -#include "update.h" - - using namespace LAMMPS_NS; /* ---------------------------------------------------------------------- */ @@ -199,9 +195,10 @@ void AtomVecBody::copy(int i, int j, int delflag) // if deleting atom J via delflag and J has bonus data, then delete it if (delflag && body[j] >= 0) { - icp->put(bonus[body[j]].iindex); - dcp->put(bonus[body[j]].dindex); - copy_bonus(nlocal_bonus-1,body[j]); + int k = body[j]; + icp->put(bonus[k].iindex); + dcp->put(bonus[k].dindex); + copy_bonus(nlocal_bonus-1,k); nlocal_bonus--; } diff --git a/src/atom_vec_line.cpp b/src/atom_vec_line.cpp index 0839530e4b..0e534577f3 100644 --- a/src/atom_vec_line.cpp +++ b/src/atom_vec_line.cpp @@ -178,7 +178,7 @@ void AtomVecLine::copy(int i, int j, int delflag) /* ---------------------------------------------------------------------- copy bonus data from I to J, effectively deleting the J entry - also reset ine that points to I to now point to J + also reset line that points to I to now point to J ------------------------------------------------------------------------- */ void AtomVecLine::copy_bonus(int i, int j) @@ -195,6 +195,10 @@ void AtomVecLine::copy_bonus(int i, int j) void AtomVecLine::clear_bonus() { nghost_bonus = 0; + + if (atom->nextra_grow) + for (int iextra = 0; iextra < atom->nextra_grow; iextra++) + modify->fix[atom->extra_grow[iextra]]->clear_bonus(); } /* ---------------------------------------------------------------------- diff --git a/src/atom_vec_tri.cpp b/src/atom_vec_tri.cpp index 7dc65d5f0f..8ffc39cec3 100644 --- a/src/atom_vec_tri.cpp +++ b/src/atom_vec_tri.cpp @@ -206,6 +206,10 @@ void AtomVecTri::copy_bonus(int i, int j) void AtomVecTri::clear_bonus() { nghost_bonus = 0; + + if (atom->nextra_grow) + for (int iextra = 0; iextra < atom->nextra_grow; iextra++) + modify->fix[atom->extra_grow[iextra]]->clear_bonus(); } /* ---------------------------------------------------------------------- diff --git a/src/fix.h b/src/fix.h index a2aa3782cc..6ebeed26b3 100644 --- a/src/fix.h +++ b/src/fix.h @@ -136,6 +136,7 @@ class Fix : protected Pointers { virtual void set_arrays(int) {} virtual void update_arrays(int, int) {} virtual void set_molecule(int, tagint, int, double *, double *, double *) {} + virtual void clear_bonus() {} virtual int pack_border(int, int *, double *) {return 0;} virtual int unpack_border(int, int, double *) {return 0;} diff --git a/src/my_pool_chunk.h b/src/my_pool_chunk.h index e3a1775c13..61e9e604ca 100644 --- a/src/my_pool_chunk.h +++ b/src/my_pool_chunk.h @@ -30,7 +30,7 @@ inputs: methods: T *get(index) = return ptr/index to unused chunk of size maxchunk T *get(N,index) = return ptr/index to unused chunk of size N - minchunk < N < maxchunk required + minchunk <= N <= maxchunk required put(index) = return indexed chunk to pool (same index returned by get) int size() = return total size of allocated pages in bytes public varaibles: @@ -148,8 +148,10 @@ class MyPoolChunk { } // return indexed chunk to pool via free list + // index = -1 if no allocated chunk void put(int index) { + if (index < 0) return; int ipage = index/chunkperpage; int ibin = whichbin[ipage]; nchunk--; diff --git a/src/neighbor.cpp b/src/neighbor.cpp index 8bd70ee9a8..f82a20acd9 100644 --- a/src/neighbor.cpp +++ b/src/neighbor.cpp @@ -24,8 +24,6 @@ #include "neigh_request.h" #include "atom.h" #include "atom_vec.h" -#include "atom_vec_line.h" -#include "atom_vec_tri.h" #include "comm.h" #include "force.h" #include "pair.h" @@ -37,7 +35,6 @@ #include "update.h" #include "respa.h" #include "output.h" -#include "math_extra.h" #include "citeme.h" #include "memory.h" #include "error.h" @@ -98,8 +95,6 @@ Neighbor::Neighbor(LAMMPS *lmp) : Pointers(lmp) maxhold = 0; xhold = NULL; - line_hold = NULL; - tri_hold = NULL; lastcall = -1; // binning @@ -185,8 +180,6 @@ Neighbor::~Neighbor() delete [] fixchecklist; memory->destroy(xhold); - memory->destroy(line_hold); - memory->destroy(tri_hold); memory->destroy(binhead); memory->destroy(bins); @@ -248,15 +241,6 @@ void Neighbor::init() // ------------------------------------------------------------------ // settings - // linetri_flag = 1/2 if atom style allows for lines/tris - - avec_line = (AtomVecLine *) atom->style_match("line"); - avec_tri = (AtomVecTri *) atom->style_match("tri"); - - linetri_flag = 0; - if (avec_line) linetri_flag = 1; - if (avec_tri) linetri_flag = 2; - // bbox lo/hi = bounding box of entire domain, stored by Domain if (triclinic == 0) { @@ -395,14 +379,6 @@ void Neighbor::init() memory->destroy(xhold); maxhold = 0; xhold = NULL; - - if (linetri_flag == 1) { - memory->destroy(line_hold); - line_hold = NULL; - } else if (linetri_flag == 2) { - memory->destroy(tri_hold); - tri_hold = NULL; - } } if (style == NSQ) { @@ -413,7 +389,6 @@ void Neighbor::init() bins = NULL; // for USER-DPD Shardlow Splitting Algorithm (SSA) - memory->destroy(bins_ssa); memory->destroy(binhead_ssa); memory->destroy(gbinhead_ssa); @@ -424,18 +399,11 @@ void Neighbor::init() } // 1st time allocation of xhold and bins - // also line/tri hold if linetri_flag is set if (dist_check) { if (maxhold == 0) { maxhold = atom->nmax; memory->create(xhold,maxhold,3,"neigh:xhold"); - if (linetri_flag) { - if (linetri_flag == 1) - memory->create(line_hold,maxhold,4,"neigh:line_hold"); - else - memory->create(tri_hold,maxhold,9,"neigh:tri_hold"); - } } } @@ -1537,25 +1505,12 @@ int Neighbor::check_distance() if (includegroup) nlocal = atom->nfirst; int flag = 0; - for (int i = 0; i < nlocal; i++) { delx = x[i][0] - xhold[i][0]; dely = x[i][1] - xhold[i][1]; delz = x[i][2] - xhold[i][2]; rsq = delx*delx + dely*dely + delz*delz; - if (rsq > deltasq) { - flag = 1; - break; - } - } - - // if line or tri particles: - // also check distance moved by corner pts - // since rotation could mean corners move when x coord does not - - if (!flag && linetri_flag) { - if (linetri_flag == 1) flag = check_distance_line(deltasq); - else flag = check_distance_tri(deltasq); + if (rsq > deltasq) flag = 1; } int flagall; @@ -1564,154 +1519,10 @@ int Neighbor::check_distance() return flagall; } -/* ---------------------------------------------------------------------- - if any line end pt moved deltasq, return 1 -------------------------------------------------------------------------- */ - -int Neighbor::check_distance_line(double deltasq) -{ - double length,theta,dx,dy,rsq; - double endpts[4]; - - AtomVecLine::Bonus *bonus = avec_line->bonus; - double **x = atom->x; - int *line = atom->line; - int nlocal = atom->nlocal; - - for (int i = 0; i < nlocal; i++) { - if (line[i] < 0) continue; - length = bonus[line[i]].length; - theta = bonus[line[i]].theta; - dx = 0.5*length*cos(theta); - dy = 0.5*length*sin(theta); - endpts[0] = x[i][0] - dx; - endpts[1] = x[i][1] - dy; - endpts[2] = x[i][0] + dx; - endpts[3] = x[i][1] + dy; - - dx = endpts[0] - line_hold[i][0]; - dy = endpts[1] - line_hold[i][1]; - rsq = dx*dx + dy*dy; - if (rsq > deltasq) return 1; - - dx = endpts[2] - line_hold[i][2]; - dy = endpts[3] - line_hold[i][3]; - rsq = dx*dx + dy*dy; - if (rsq > deltasq) return 1; - } - - return 0; -} - -/* ---------------------------------------------------------------------- - compute and store current line end pts in line_hold -------------------------------------------------------------------------- */ - -void Neighbor::calculate_endpts() -{ - double length,theta,dx,dy; - double *endpt; - - AtomVecLine::Bonus *bonus = avec_line->bonus; - double **x = atom->x; - int *line = atom->line; - int nlocal = atom->nlocal; - - for (int i = 0; i < nlocal; i++) { - if (line[i] < 0) continue; - endpt = line_hold[i]; - length = bonus[line[i]].length; - theta = bonus[line[i]].theta; - dx = 0.5*length*cos(theta); - dy = 0.5*length*sin(theta); - endpt[0] = x[i][0] - dx; - endpt[1] = x[i][1] - dy; - endpt[2] = x[i][0] + dx; - endpt[3] = x[i][1] + dy; - } -} - -/* ---------------------------------------------------------------------- - if any tri corner pt moved deltasq, return 1 -------------------------------------------------------------------------- */ - -int Neighbor::check_distance_tri(double deltasq) -{ - int ibonus; - double dx,dy,dz,rsq; - double p[3][3],corner[9]; - - AtomVecTri::Bonus *bonus = avec_tri->bonus; - double **x = atom->x; - int *tri = atom->tri; - int nlocal = atom->nlocal; - - for (int i = 0; i < nlocal; i++) { - if (tri[i] < 0) continue; - ibonus = tri[i]; - MathExtra::quat_to_mat(bonus[ibonus].quat,p); - MathExtra::matvec(p,bonus[ibonus].c1,&corner[0]); - MathExtra::add3(x[i],&corner[0],&corner[0]); - MathExtra::matvec(p,bonus[ibonus].c2,&corner[3]); - MathExtra::add3(x[i],&corner[3],&corner[3]); - MathExtra::matvec(p,bonus[ibonus].c3,&corner[6]); - MathExtra::add3(x[i],&corner[6],&corner[6]); - - dx = corner[0] - tri_hold[i][0]; - dy = corner[1] - tri_hold[i][1]; - dz = corner[2] - tri_hold[i][2]; - rsq = dx*dx + dy*dy + dz*dz; - if (rsq > deltasq) return 1; - - dx = corner[3] - tri_hold[i][3]; - dy = corner[4] - tri_hold[i][4]; - dz = corner[5] - tri_hold[i][5]; - rsq = dx*dx + dy*dy + dz*dz; - if (rsq > deltasq) return 1; - - dx = corner[6] - tri_hold[i][6]; - dy = corner[7] - tri_hold[i][7]; - dz = corner[8] - tri_hold[i][8]; - rsq = dx*dx + dy*dy + dz*dz; - if (rsq > deltasq) return 1; - } - - return 0; -} - -/* ---------------------------------------------------------------------- - compute and store current tri corner pts in tri_hold -------------------------------------------------------------------------- */ - -void Neighbor::calculate_corners() -{ - int ibonus; - double p[3][3]; - double *corner; - - AtomVecTri::Bonus *bonus = avec_tri->bonus; - double **x = atom->x; - int *tri = atom->tri; - int nlocal = atom->nlocal; - - for (int i = 0; i < nlocal; i++) { - if (tri[i] < 0) continue; - ibonus = tri[i]; - corner = tri_hold[i]; - MathExtra::quat_to_mat(bonus[ibonus].quat,p); - MathExtra::matvec(p,bonus[ibonus].c1,&corner[0]); - MathExtra::add3(x[i],&corner[0],&corner[0]); - MathExtra::matvec(p,bonus[ibonus].c2,&corner[3]); - MathExtra::add3(x[i],&corner[3],&corner[3]); - MathExtra::matvec(p,bonus[ibonus].c3,&corner[6]); - MathExtra::add3(x[i],&corner[6],&corner[6]); - } -} - /* ---------------------------------------------------------------------- build perpetual neighbor lists called at setup and every few timesteps during run or minimization - topology lists also built if topoflag = 1, USER-CUDA called with tflag = 0 + topology lists also built if topoflag = 1, USER-CUDA calls with topoflag = 0 ------------------------------------------------------------------------- */ void Neighbor::build(int topoflag) @@ -1732,25 +1543,12 @@ void Neighbor::build(int topoflag) maxhold = atom->nmax; memory->destroy(xhold); memory->create(xhold,maxhold,3,"neigh:xhold"); - if (linetri_flag) { - if (linetri_flag == 1) { - memory->destroy(line_hold); - memory->create(line_hold,maxhold,4,"neigh:line_hold"); - } else { - memory->destroy(tri_hold); - memory->create(tri_hold,maxhold,4,"neigh:tri_hold"); - } - } } for (i = 0; i < nlocal; i++) { xhold[i][0] = x[i][0]; xhold[i][1] = x[i][1]; xhold[i][2] = x[i][2]; } - if (linetri_flag) { - if (linetri_flag == 1) calculate_endpts(); - else calculate_corners(); - } if (boxcheck) { if (triclinic == 0) { boxlo_hold[0] = bboxlo[0]; @@ -2406,8 +2204,6 @@ bigint Neighbor::memory_usage() { bigint bytes = 0; bytes += memory->usage(xhold,maxhold,3); - if (linetri_flag == 1) bytes += memory->usage(line_hold,maxhold,4); - if (linetri_flag == 2) bytes += memory->usage(tri_hold,maxhold,9); if (style != NSQ) { bytes += memory->usage(bins,maxbin); @@ -2436,3 +2232,4 @@ int Neighbor::exclude_setting() { return exclude; } + diff --git a/src/neighbor.h b/src/neighbor.h index 167af590ea..b44e0fde00 100644 --- a/src/neighbor.h +++ b/src/neighbor.h @@ -116,12 +116,6 @@ class Neighbor : protected Pointers { double boxlo_hold[3],boxhi_hold[3]; // box size at last neighbor build double corners_hold[8][3]; // box corners at last neighbor build - int linetri_flag; // 1 if lines exist, 2 if tris exist - double **line_hold; // line corner pts at last neighbor build - double **tri_hold; // tri corner pts at last neighbor build - class AtomVecLine *avec_line; // used to extract line info - class AtomVecTri *avec_tri; // used to extract tri info - int binatomflag; // bin atoms or not when build neigh list // turned off by build_one() @@ -190,11 +184,6 @@ class Neighbor : protected Pointers { // methods - int check_distance_line(double); // check line move dist since last neigh - int check_distance_tri(double); // check tri move dist since last neigh - void calculate_endpts(); - void calculate_corners(); - void bin_atoms(); // bin all atoms double bin_distance(int, int, int); // distance between binx int coord2bin(double *); // mapping atom coord to a bin diff --git a/src/read_data.cpp b/src/read_data.cpp index 00dd39c02e..9fdc261bd3 100644 --- a/src/read_data.cpp +++ b/src/read_data.cpp @@ -283,6 +283,7 @@ void ReadData::command(int narg, char **arg) } // set up pointer to hold original styles while we replace them with "zero" + Pair *saved_pair = NULL; Bond *saved_bond = NULL; Angle *saved_angle = NULL; diff --git a/src/version.h b/src/version.h index 451b0696ec..7ccaea0ae3 100644 --- a/src/version.h +++ b/src/version.h @@ -1 +1 @@ -#define LAMMPS_VERSION "28 Jun 2016" +#define LAMMPS_VERSION "1 Jul 2016"