diff --git a/doc/Section_accelerate.html b/doc/Section_accelerate.html index 2f55e89e9c..296219f985 100644 --- a/doc/Section_accelerate.html +++ b/doc/Section_accelerate.html @@ -978,45 +978,52 @@ LAMMPS.

The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel -coprocessors. Additionally, it supports running simulations in -single, mixed, or double precision with vectorization, even if a -coprocessor is not present. The same C++ code is used for both cases. -When offloading to a coprocessor, the routine is run twice, once with -an offload flag. +coprocessors (Xeon Phi). Additionally, it supports running +simulations in single, mixed, or double precision with vectorization, +even if a coprocessor is not present, i.e. on an Intel CPU. The same +C++ code is used for both cases. When offloading to a coprocessor, +the routine is run twice, once with an offload flag.

-

The USER-INTEL package will work with the USER-OMP package. Specifying -use of the Intel package implicitly includes the OMP package allowing -it to be used for angle, bond, dihedral, and long-range -electrostatics. Using the suffix intel command will use -styles from the Intel package if available; otherwise it will use -styles from the OMP package if available. +

The USER-INTEL package can be used in tandem with the USER-OMP +package. This is useful when a USER-INTEL pair style is used, so that +other styles not supported by the USER-INTEL package, e.g. for bond, +angle, dihedral, improper, and long-range electrostatics can be run +with the USER-OMP package versions. If you have built LAMMPS with +both the USER-INTEL and USER-OMP packages, then this mode of operation +is made easier, because the "-suffix intel" command-line +switch and the the suffix +intel command will both set a second-choice suffix to +"omp" so that styles from the USER-OMP package will be used if +available.

Building LAMMPS with the USER-INTEL package:

The procedure for building LAMMPS with the USER-INTEL package is simple. You have to edit your machine specific makefile to add the flags to enable OpenMP support (-openmp) to both the CCFLAGS and -LINKFLAGS variables. You also need to add -restrict to CCFLAGS. If -you are compiling on the same architecture that will be used for the -runs, adding the flag -xHost will enable vectorization with the -Intel compiler. In order to build with support for an Intel +LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and +-restrict to CCFLAGS. +

+

If you are compiling on the same architecture that will be used for +the runs, adding the flag -xHost will enable vectorization with the +Intel compiler. In order to build with support for an Intel coprocessor, the flag -offload should be added to the LINKFLAGS line and the flag -DLMP_INTEL_OFFLOAD should be added to the CCFLAGS line.

The files src/MAKE/Makefile.intel and src/MAKE/Makefile.intel_offload -are provided with options that perform well with the Intel -compiler. The latter Makefile has support for offload to coprocessors -and the former does not. +are included in the src/MAKE directory with options that perform well +with the Intel compiler. The latter Makefile has support for offload +to coprocessors and the former does not.

It is recommended that Intel Compiler 2013 SP1 update 1 be used for compiling. Newer versions have some performance issues that are being addressed. If using Intel MPI, version 5 or higher is recommended.

The rest of the compilation is the same as for any other package that -has no additional library dependencies: +has no additional library dependencies, e.g.

-
make yes-user-omp yes-user-intel
+
make yes-user-intel yes-user-omp
 make machine 
 

Running an input script: @@ -1032,94 +1039,97 @@ commands, and is independent of the Intel package.

Input script requirements to run using pair styles with a intel suffix are as follows:

-

To invoke specific styles from the Intel package, either append +

To invoke specific styles from the UESR-INTEL package, either append "intel" to the style name (e.g. pair_style lj/cut/intel), or use the -suffix command-line switch, or use the suffix command in the input script.

Unless the -suffix intel command-line -switch is used, the package +switch is used, a package intel command must be used near the beginning of the -script. The default precision mode for the Intel package is mixed, -meaning that accumulation is performed in double precision and other -calculations are performed in single precision. In order to use all -single or all double precision, the "package intel" line must be used -in the input script with a "single" or "double" keyword specified. +input script. The default precision mode for the USER-INTEL package +is mixed, meaning that accumulation is performed in double precision +and other calculations are performed in single precision. In order to +use all single or all double precision, the package +intel command must be used in the input script with a +"single" or "double" keyword specified.

Running with an Intel coprocessor:

-

The Intel package supports offload of a fraction of the work to Intel -coprocessors. This is accomplished by setting a balance fraction on -the package intel line. A balance of 0 runs all -calculations on the CPU. A balance of 1 runs all calculations on the -coprocessor. A balance of 0.5 runs half of the calculations on the -coprocessor. Setting the balance to -1 will enable dynamic load -balancing that continously adjusts the fraction of offloaded work -throughout the simulation. This option is typically within 5 to 10 -percent of the optimal fixed balance. By default, using the suffix -command or command-line switch will use offload to a coprocessor with -the balance set to -1. If LAMMPS is built without offload support, -this setting is ignored. +

The USER-INTEL package supports offload of a fraction of the work to +Intel coprocessors (Xeon Phi). This is accomplished by setting a +balance fraction on the package intel command. A +balance of 0 runs all calculations on the CPU. A balance of 1 runs +all calculations on the coprocessor. A balance of 0.5 runs half of +the calculations on the coprocessor. Setting the balance to -1 will +enable dynamic load balancing that continously adjusts the fraction of +offloaded work throughout the simulation. This option typically +produces results within 5 to 10 percent of the optimal fixed balance. +By default, using the suffix command or -suffix +command-line switch will use offload to a +coprocessor with the balance set to -1. If LAMMPS is built without +offload support, this setting is ignored.

If one is running short benchmark runs with dynamic load balancing, adding a short warm-up run (10-20 steps) will allow the load-balancer -to find a setting that will be carried over to additional runs. +to find a setting that will carry over to additional runs.

The default for the package intel command is to have -all of the MPI tasks on a given compute node use a single -coprocessor. In general, running with a large number of MPI tasks on -each node will perform best with offload. Each MPI task will +all the MPI tasks on a given compute node use a single coprocessor +(Xeon Phi). In general, running with a large number of MPI tasks on +each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads -available on the coprocessor. For example, if your card has 61 cores, -with 60 cores available for offload and 4 hardware threads per core, -running with 24 MPI tasks per node will cause each MPI task to use a -subset of 10 threads on the coprocessor. Fine tuning of the number of -threads to use per MPI task or the number of threads to use per core -can be accomplished with keywords to the package intel -command. +available on the coprocessor. For example, if your card has 61 cores, +with 60 cores available for offload and 4 hardware threads per core +(240 total threads), running with 24 MPI tasks per node will cause +each MPI task to use a subset of 10 threads on the coprocessor. Fine +tuning of the number of threads to use per MPI task or the number of +threads to use per core can be accomplished with keywords to the +package intel command.

-

If LAMMPS is using offload to a coprocessor, a diagnostic line during -the setup for a run is printed to the screen (not to log files) -indicating that offload is being used and the number of coprocessor -threads per MPI task. Additionally, an offload timing summary is -printed at the end of each run. When using offload, the -sort frequency for atom data is changed to 1 such -that the data is sorted every neighbor build. +

If LAMMPS is using offload to a coprocessor (Xeon Phi), a diagnostic +line during the setup for a run is printed to the screen (not to log +files) indicating that offload is being used and the number of +coprocessor threads per MPI task. Additionally, an offload timing +summary is printed at the end of each run. When using offload, the +sort frequency for atom data is changed to 1 so +that the per-atom data is sorted every neighbor build.

-

In order to use multiple coprocessors on each compute node, the +

To use multiple coprocessors (Xeon Phis) on each compute node, the offload_cards keyword can be specified with the package intel command to specify the number of coprocessors to use.

-

For simulations involving long-range electrostatics or angle, bond, -and dihedral calculations, computation and data transfer to the +

For simulations with long-range electrostatics or bond, angle, +dihedral, improper calculations, computation and data transfer to the coprocessor will run concurrently with computations and MPI -communications for these routines on the host. The Intel package has -two modes for deciding which atoms will be handled by the coprocessor. -The setting is controlled with the "offload_ghost" option. When set to -0, ghost atoms (atoms at the borders between MPI tasks) are not -offloaded to the card. This allows for overlap of MPI communication of -forces with computation on the coprocessor when the -newton setting is "on". The default is dependent on the -style being used, however, better performance might be achieving by +communications for these routines on the host. The USER-INTEL package +has two modes for deciding which atoms will be handled by the +coprocessor. The setting is controlled with the "offload_ghost" +option. When set to 0, ghost atoms (atoms at the borders between MPI +tasks) are not offloaded to the card. This allows for overlap of MPI +communication of forces with computation on the coprocessor when the +newton setting is "on". The default is dependent on the +style being used, however, better performance might be achieved by setting this explictly.

In order to control the number of OpenMP threads used on the host, the OMP_NUM_THREADS environment variable should be set. This variable will not influence the number of threads used on the coprocessor. Only the -"package intel" command can be used to control thread counts on the -coprocessor. +package intel command can be used to control thread +counts on the coprocessor.

Restrictions:

When using offload, hybrid styles that require skip lists for neighbor builds cannot be offloaded to the coprocessor. -Using hybrid/overlay is allowed. Only one intel -accelerated style may be used with hybrid styles. Exclusion lists are +Using hybrid/overlay is allowed. Only one intel +accelerated style may be used with hybrid styles. Exclusion lists are not currently supported with offload, however, the same effect can -often be accomplished by setting cutoffs for excluded atom types to -0. None of the pair styles in the USER-OMP package support the -"inner", "middle", "outer" options for r-RESPA integration. +often be accomplished by setting cutoffs for excluded atom types to 0. +None of the pair styles in the USER-OMP package currently support the +"inner", "middle", "outer" options for rRESPA integration via the +run_style respa command.


diff --git a/doc/Section_accelerate.txt b/doc/Section_accelerate.txt index bfeba1dfd6..160d67c108 100644 --- a/doc/Section_accelerate.txt +++ b/doc/Section_accelerate.txt @@ -974,45 +974,52 @@ LAMMPS. The USER-INTEL package was developed by Mike Brown at Intel Corporation. It provides a capability to accelerate simulations by offloading neighbor list and non-bonded force calculations to Intel -coprocessors. Additionally, it supports running simulations in -single, mixed, or double precision with vectorization, even if a -coprocessor is not present. The same C++ code is used for both cases. -When offloading to a coprocessor, the routine is run twice, once with -an offload flag. +coprocessors (Xeon Phi). Additionally, it supports running +simulations in single, mixed, or double precision with vectorization, +even if a coprocessor is not present, i.e. on an Intel CPU. The same +C++ code is used for both cases. When offloading to a coprocessor, +the routine is run twice, once with an offload flag. -The USER-INTEL package will work with the USER-OMP package. Specifying -use of the Intel package implicitly includes the OMP package allowing -it to be used for angle, bond, dihedral, and long-range -electrostatics. Using the "suffix intel"_suffix.html command will use -styles from the Intel package if available; otherwise it will use -styles from the OMP package if available. +The USER-INTEL package can be used in tandem with the USER-OMP +package. This is useful when a USER-INTEL pair style is used, so that +other styles not supported by the USER-INTEL package, e.g. for bond, +angle, dihedral, improper, and long-range electrostatics can be run +with the USER-OMP package versions. If you have built LAMMPS with +both the USER-INTEL and USER-OMP packages, then this mode of operation +is made easier, because the "-suffix intel" "command-line +switch"_Section_start.html#start_7 and the the "suffix +intel"_suffix.html command will both set a second-choice suffix to +"omp" so that styles from the USER-OMP package will be used if +available. [Building LAMMPS with the USER-INTEL package:] The procedure for building LAMMPS with the USER-INTEL package is simple. You have to edit your machine specific makefile to add the flags to enable OpenMP support ({-openmp}) to both the CCFLAGS and -LINKFLAGS variables. You also need to add -restrict to CCFLAGS. If -you are compiling on the same architecture that will be used for the -runs, adding the flag {-xHost} will enable vectorization with the -Intel compiler. In order to build with support for an Intel +LINKFLAGS variables. You also need to add -DLAMMPS_MEMALIGN=64 and +-restrict to CCFLAGS. + +If you are compiling on the same architecture that will be used for +the runs, adding the flag {-xHost} will enable vectorization with the +Intel compiler. In order to build with support for an Intel coprocessor, the flag {-offload} should be added to the LINKFLAGS line and the flag {-DLMP_INTEL_OFFLOAD} should be added to the CCFLAGS line. The files src/MAKE/Makefile.intel and src/MAKE/Makefile.intel_offload -are provided with options that perform well with the Intel -compiler. The latter Makefile has support for offload to coprocessors -and the former does not. +are included in the src/MAKE directory with options that perform well +with the Intel compiler. The latter Makefile has support for offload +to coprocessors and the former does not. It is recommended that Intel Compiler 2013 SP1 update 1 be used for compiling. Newer versions have some performance issues that are being addressed. If using Intel MPI, version 5 or higher is recommended. The rest of the compilation is the same as for any other package that -has no additional library dependencies: +has no additional library dependencies, e.g. -make yes-user-omp yes-user-intel +make yes-user-intel yes-user-omp make machine :pre [Running an input script:] @@ -1028,94 +1035,97 @@ commands, and is independent of the Intel package. Input script requirements to run using pair styles with a {intel} suffix are as follows: -To invoke specific styles from the Intel package, either append +To invoke specific styles from the UESR-INTEL package, either append "intel" to the style name (e.g. pair_style lj/cut/intel), or use the "-suffix command-line switch"_Section_start.html#start_7, or use the "suffix"_suffix.html command in the input script. Unless the "-suffix intel command-line -switch"_Section_start.html#start_7 is used, the "package +switch"_Section_start.html#start_7 is used, a "package intel"_package.html command must be used near the beginning of the -script. The default precision mode for the Intel package is {mixed}, -meaning that accumulation is performed in double precision and other -calculations are performed in single precision. In order to use all -single or all double precision, the "package intel" line must be used -in the input script with a "single" or "double" keyword specified. +input script. The default precision mode for the USER-INTEL package +is {mixed}, meaning that accumulation is performed in double precision +and other calculations are performed in single precision. In order to +use all single or all double precision, the "package +intel"_package.html command must be used in the input script with a +"single" or "double" keyword specified. [Running with an Intel coprocessor:] -The Intel package supports offload of a fraction of the work to Intel -coprocessors. This is accomplished by setting a balance fraction on -the "package intel"_package.html line. A balance of 0 runs all -calculations on the CPU. A balance of 1 runs all calculations on the -coprocessor. A balance of 0.5 runs half of the calculations on the -coprocessor. Setting the balance to -1 will enable dynamic load -balancing that continously adjusts the fraction of offloaded work -throughout the simulation. This option is typically within 5 to 10 -percent of the optimal fixed balance. By default, using the suffix -command or command-line switch will use offload to a coprocessor with -the balance set to -1. If LAMMPS is built without offload support, -this setting is ignored. +The USER-INTEL package supports offload of a fraction of the work to +Intel coprocessors (Xeon Phi). This is accomplished by setting a +balance fraction on the "package intel"_package.html command. A +balance of 0 runs all calculations on the CPU. A balance of 1 runs +all calculations on the coprocessor. A balance of 0.5 runs half of +the calculations on the coprocessor. Setting the balance to -1 will +enable dynamic load balancing that continously adjusts the fraction of +offloaded work throughout the simulation. This option typically +produces results within 5 to 10 percent of the optimal fixed balance. +By default, using the "suffix"_suffix.html command or "-suffix +command-line switch"_Section_start.html#start_7 will use offload to a +coprocessor with the balance set to -1. If LAMMPS is built without +offload support, this setting is ignored. If one is running short benchmark runs with dynamic load balancing, adding a short warm-up run (10-20 steps) will allow the load-balancer -to find a setting that will be carried over to additional runs. +to find a setting that will carry over to additional runs. The default for the "package intel"_package.html command is to have -all of the MPI tasks on a given compute node use a single -coprocessor. In general, running with a large number of MPI tasks on -each node will perform best with offload. Each MPI task will +all the MPI tasks on a given compute node use a single coprocessor +(Xeon Phi). In general, running with a large number of MPI tasks on +each node will perform best with offload. Each MPI task will automatically get affinity to a subset of the hardware threads -available on the coprocessor. For example, if your card has 61 cores, -with 60 cores available for offload and 4 hardware threads per core, -running with 24 MPI tasks per node will cause each MPI task to use a -subset of 10 threads on the coprocessor. Fine tuning of the number of -threads to use per MPI task or the number of threads to use per core -can be accomplished with keywords to the "package intel"_package.html -command. +available on the coprocessor. For example, if your card has 61 cores, +with 60 cores available for offload and 4 hardware threads per core +(240 total threads), running with 24 MPI tasks per node will cause +each MPI task to use a subset of 10 threads on the coprocessor. Fine +tuning of the number of threads to use per MPI task or the number of +threads to use per core can be accomplished with keywords to the +"package intel"_package.html command. -If LAMMPS is using offload to a coprocessor, a diagnostic line during -the setup for a run is printed to the screen (not to log files) -indicating that offload is being used and the number of coprocessor -threads per MPI task. Additionally, an offload timing summary is -printed at the end of each run. When using offload, the -"sort"_atom_modify.html frequency for atom data is changed to 1 such -that the data is sorted every neighbor build. +If LAMMPS is using offload to a coprocessor (Xeon Phi), a diagnostic +line during the setup for a run is printed to the screen (not to log +files) indicating that offload is being used and the number of +coprocessor threads per MPI task. Additionally, an offload timing +summary is printed at the end of each run. When using offload, the +"sort"_atom_modify.html frequency for atom data is changed to 1 so +that the per-atom data is sorted every neighbor build. -In order to use multiple coprocessors on each compute node, the +To use multiple coprocessors (Xeon Phis) on each compute node, the {offload_cards} keyword can be specified with the "package intel"_package.html command to specify the number of coprocessors to use. -For simulations involving long-range electrostatics or angle, bond, -and dihedral calculations, computation and data transfer to the +For simulations with long-range electrostatics or bond, angle, +dihedral, improper calculations, computation and data transfer to the coprocessor will run concurrently with computations and MPI -communications for these routines on the host. The Intel package has -two modes for deciding which atoms will be handled by the coprocessor. -The setting is controlled with the "offload_ghost" option. When set to -0, ghost atoms (atoms at the borders between MPI tasks) are not -offloaded to the card. This allows for overlap of MPI communication of -forces with computation on the coprocessor when the -"newton"_newton.html setting is "on". The default is dependent on the -style being used, however, better performance might be achieving by +communications for these routines on the host. The USER-INTEL package +has two modes for deciding which atoms will be handled by the +coprocessor. The setting is controlled with the "offload_ghost" +option. When set to 0, ghost atoms (atoms at the borders between MPI +tasks) are not offloaded to the card. This allows for overlap of MPI +communication of forces with computation on the coprocessor when the +"newton"_newton.html setting is "on". The default is dependent on the +style being used, however, better performance might be achieved by setting this explictly. In order to control the number of OpenMP threads used on the host, the OMP_NUM_THREADS environment variable should be set. This variable will not influence the number of threads used on the coprocessor. Only the -"package intel" command can be used to control thread counts on the -coprocessor. +"package intel"_package.html command can be used to control thread +counts on the coprocessor. [Restrictions:] When using offload, "hybrid"_pair_hybrid.html styles that require skip lists for neighbor builds cannot be offloaded to the coprocessor. -Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel -accelerated style may be used with hybrid styles. Exclusion lists are +Using "hybrid/overlay"_pair_hybrid.html is allowed. Only one intel +accelerated style may be used with hybrid styles. Exclusion lists are not currently supported with offload, however, the same effect can -often be accomplished by setting cutoffs for excluded atom types to -0. None of the pair styles in the USER-OMP package support the -"inner", "middle", "outer" options for r-RESPA integration. +often be accomplished by setting cutoffs for excluded atom types to 0. +None of the pair styles in the USER-OMP package currently support the +"inner", "middle", "outer" options for rRESPA integration via the +"run_style respa"_run_style.html command. :line diff --git a/doc/Section_start.html b/doc/Section_start.html index 566f5f6e00..67da141f95 100644 --- a/doc/Section_start.html +++ b/doc/Section_start.html @@ -1497,8 +1497,9 @@ if desired. default Intel settings, as if the command "package intel * mixed balance -1" were used at the top of your input script. These settings can be changed by using the package intel command in -your script if desired. The intel suffix will attempt to use styles -from the OMP package if they are not present in the Intel package. +your script if desired. If the USER-OMP package is installed, the +intel suffix will make the omp suffix a second choice, if a requested +style is not available in the USER-INTEL package.

For the KOKKOS package, using this command-line switch also invokes the default KOKKOS settings, as if the command "package kokkos neigh @@ -1511,9 +1512,9 @@ default OMP settings, as if the command "package omp *" were used at the top of your input script. These settings can be changed by using the package omp command in your script if desired.

-

The suffix command can also be used set a suffix and it -can also turn off or back on any suffix setting made via the command -line. +

The suffix command can also be used to set a suffix and +it can also turn off or back on any suffix setting made via the +command line.

-var name value1 value2 ... 
 
diff --git a/doc/Section_start.txt b/doc/Section_start.txt index 30baf84667..c7714905a4 100644 --- a/doc/Section_start.txt +++ b/doc/Section_start.txt @@ -1491,8 +1491,9 @@ For the Intel package, using this command-line switch also invokes the default Intel settings, as if the command "package intel * mixed balance -1" were used at the top of your input script. These settings can be changed by using the "package intel"_package.html command in -your script if desired. The intel suffix will attempt to use styles -from the OMP package if they are not present in the Intel package. +your script if desired. If the USER-OMP package is installed, the +intel suffix will make the omp suffix a second choice, if a requested +style is not available in the USER-INTEL package. For the KOKKOS package, using this command-line switch also invokes the default KOKKOS settings, as if the command "package kokkos neigh @@ -1505,9 +1506,9 @@ default OMP settings, as if the command "package omp *" were used at the top of your input script. These settings can be changed by using the "package omp"_package.html command in your script if desired. -The "suffix"_suffix.html command can also be used set a suffix and it -can also turn off or back on any suffix setting made via the command -line. +The "suffix"_suffix.html command can also be used to set a suffix and +it can also turn off or back on any suffix setting made via the +command line. -var name value1 value2 ... :pre diff --git a/doc/suffix.html b/doc/suffix.html index 26b1f176cd..479a9bcd29 100644 --- a/doc/suffix.html +++ b/doc/suffix.html @@ -80,9 +80,9 @@ If the variant version does not exist, the standard version is created.

When using the intel suffix, LAMMPS will first attempt to use a style -with the intel suffix. If this does not exist, a style with the omp -suffix is attempted. If this also does not exist, the style without -any suffix is used. +with the intel suffix. If the USER-OMP package is installed, the the +omp suffix will be tried as a second choice, if a requested style is +not available in the USER-INTEL package.

If the specified style is off, then any previously specified suffix is temporarily disabled, whether it was specified by a command-line diff --git a/doc/suffix.txt b/doc/suffix.txt index 01e9aae3a8..6309b5fa16 100644 --- a/doc/suffix.txt +++ b/doc/suffix.txt @@ -77,9 +77,9 @@ If the variant version does not exist, the standard version is created. When using the intel suffix, LAMMPS will first attempt to use a style -with the intel suffix. If this does not exist, a style with the omp -suffix is attempted. If this also does not exist, the style without -any suffix is used. +with the intel suffix. If the USER-OMP package is installed, the the +omp suffix will be tried as a second choice, if a requested style is +not available in the USER-INTEL package. If the specified style is {off}, then any previously specified suffix is temporarily disabled, whether it was specified by a command-line