From 6429e05b7861dd01c05866dfdc670c99189b95dd Mon Sep 17 00:00:00 2001 From: sjplimp Date: Mon, 22 Dec 2014 22:12:21 +0000 Subject: [PATCH] git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12850 f3b2605a-c512-4ea7-a41b-209d697bcdaa --- doc/accelerate_intel.html | 39 +++++++++++++++++++++++++++++---------- doc/accelerate_intel.txt | 39 +++++++++++++++++++++++++++++---------- doc/package.html | 10 +++++++++- doc/package.txt | 10 +++++++++- 4 files changed, 76 insertions(+), 22 deletions(-) diff --git a/doc/accelerate_intel.html b/doc/accelerate_intel.html index 9802a34d64..2972e989c3 100644 --- a/doc/accelerate_intel.html +++ b/doc/accelerate_intel.html @@ -62,8 +62,7 @@ Xeon Phi(TM) coprocessor is the same except for these additional steps:

The latter two steps in the first case and the last step in the coprocessor case can be done using the "-pk intel" and "-sf intel" @@ -75,7 +74,7 @@ commands respectively to your input script.

Required hardware/software:

To use the offload option, you must have one or more Intel(R) Xeon -Phi(TM) coprocessors. +Phi(TM) coprocessors and use an Intel(R) C++ compiler.

Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in @@ -85,10 +84,18 @@ vectorization or give poor performance. g++ will not recognize some of the settings, so they cannot be used). The compiler must support the OpenMP interface.

+

The recommended version of the Intel(R) compiler is 14.0.1.106. +Versions 15.0.1.133 and later are also supported. If using Intel(R) +MPI, versions 15.0.2.044 and later are recommended. +

Building LAMMPS with the USER-INTEL package:

-

You must choose at build time whether to build for CPU acceleration or -to use the Xeon Phi in offload mode. +

You can choose to build with or without support for offload to a +Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a +coprocessor, the same binary can be used on nodes with and without +coprocessors installed. However, if you do not have coprocessors +on your system, building without offload support will produce a +smaller binary.

You can do either in one line, using the src/Make.py script, described in Section 2.4 of the manual. Type @@ -119,7 +126,9 @@ both the CCFLAGS and LINKFLAGS variables. You also need to add

If you are compiling on the same architecture that will be used for the runs, adding the flag -xHost to CCFLAGS will enable -vectorization with the Intel(R) compiler. +vectorization with the Intel(R) compiler. Otherwise, you must +provide the correct compute node architecture to the -x option +(e.g. -xAVX).

In order to build with support for an Intel(R) Xeon Phi(TM) coprocessor, the flag -offload should be added to the LINKFLAGS line @@ -130,10 +139,20 @@ included in the src/MAKE/OPTIONS directory with settings that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not.

-

If using an Intel compiler, it is recommended that Intel(R) Compiler -2013 SP1 update 1 be used. Newer versions have some performance -issues that are being addressed. If using Intel(R) MPI, version 5 or -higher is recommended. +

Notes on CPU and core affinity: +

+

Setting core affinity is often used to pin MPI tasks and OpenMP +threads to a core or group of cores so that memory access can be +uniform. Unless disabled at build time, affinity for MPI tasks and +OpenMP threads on the host will be set by default on the host +when using offload to a coprocessor. In this case, it is unnecessary +to use other methods to control affinity (e.g. taskset, numactl, +I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script +with the no_affinity option to the package intel +command or by disabling the option at build time (by adding +-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). +Disabling this option is not recommended, especially when running +on a machine with hyperthreading disabled.

Running with the USER-INTEL package from the command line:

diff --git a/doc/accelerate_intel.txt b/doc/accelerate_intel.txt index e85899f189..c0cbafa448 100644 --- a/doc/accelerate_intel.txt +++ b/doc/accelerate_intel.txt @@ -59,8 +59,7 @@ Xeon Phi(TM) coprocessor is the same except for these additional steps: add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine -add the flag -offload to LINKFLAGS in your Makefile.machine -specify how many coprocessor threads per MPI task to use :ul +add the flag -offload to LINKFLAGS in your Makefile.machine :ul The latter two steps in the first case and the last step in the coprocessor case can be done using the "-pk intel" and "-sf intel" @@ -72,7 +71,7 @@ commands respectively to your input script. [Required hardware/software:] To use the offload option, you must have one or more Intel(R) Xeon -Phi(TM) coprocessors. +Phi(TM) coprocessors and use an Intel(R) C++ compiler. Optimizations for vectorization have only been tested with the Intel(R) compiler. Use of other compilers may not result in @@ -82,10 +81,18 @@ Use of an Intel C++ compiler is recommended, but not required (though g++ will not recognize some of the settings, so they cannot be used). The compiler must support the OpenMP interface. +The recommended version of the Intel(R) compiler is 14.0.1.106. +Versions 15.0.1.133 and later are also supported. If using Intel(R) +MPI, versions 15.0.2.044 and later are recommended. + [Building LAMMPS with the USER-INTEL package:] -You must choose at build time whether to build for CPU acceleration or -to use the Xeon Phi in offload mode. +You can choose to build with or without support for offload to a +Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a +coprocessor, the same binary can be used on nodes with and without +coprocessors installed. However, if you do not have coprocessors +on your system, building without offload support will produce a +smaller binary. You can do either in one line, using the src/Make.py script, described in "Section 2.4"_Section_start.html#start_4 of the manual. Type @@ -116,7 +123,9 @@ both the CCFLAGS and LINKFLAGS variables. You also need to add If you are compiling on the same architecture that will be used for the runs, adding the flag {-xHost} to CCFLAGS will enable -vectorization with the Intel(R) compiler. +vectorization with the Intel(R) compiler. Otherwise, you must +provide the correct compute node architecture to the -x option +(e.g. -xAVX). In order to build with support for an Intel(R) Xeon Phi(TM) coprocessor, the flag {-offload} should be added to the LINKFLAGS line @@ -127,10 +136,20 @@ included in the src/MAKE/OPTIONS directory with settings that perform well with the Intel(R) compiler. The latter file has support for offload to coprocessors; the former does not. -If using an Intel compiler, it is recommended that Intel(R) Compiler -2013 SP1 update 1 be used. Newer versions have some performance -issues that are being addressed. If using Intel(R) MPI, version 5 or -higher is recommended. +[Notes on CPU and core affinity:] + +Setting core affinity is often used to pin MPI tasks and OpenMP +threads to a core or group of cores so that memory access can be +uniform. Unless disabled at build time, affinity for MPI tasks and +OpenMP threads on the host will be set by default on the host +when using offload to a coprocessor. In this case, it is unnecessary +to use other methods to control affinity (e.g. taskset, numactl, +I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script +with the {no_affinity} option to the "package intel"_package.html +command or by disabling the option at build time (by adding +-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile). +Disabling this option is not recommended, especially when running +on a machine with hyperthreading disabled. [Running with the USER-INTEL package from the command line:] diff --git a/doc/package.html b/doc/package.html index 87e49a5c8f..5c0dd866d8 100644 --- a/doc/package.html +++ b/doc/package.html @@ -59,7 +59,7 @@ intel args = NPhi keyword value ... Nphi = # of coprocessors per node zero or more keyword/value pairs may be appended - keywords = omp or mode or balance or ghost or tpc or tptask + keywords = omp or mode or balance or ghost or tpc or tptask or no_affinity omp value = Nthreads Nthreads = number of OpenMP threads to use on CPU (default = 0) mode value = single or mixed or double @@ -75,6 +75,7 @@ Ntpc = max number of coprocessor threads per coprocessor core (default = 4) tptask value = Ntptask Ntptask = max number of coprocessor threads per MPI task (default = 240) + no_affinity values = none kokkos args = keyword value ... zero or more keyword/value pairs may be appended keywords = neigh or newton or binsize or comm or comm/exchange or comm/forward @@ -427,6 +428,13 @@ with 16 threads, for a total of 128.

Note that the default settings for tpc and tptask are fine for most problems, regardless of how many MPI tasks you assign to a Phi.

+

The no_affinity keyword will turn off automatic setting of core +affinity for MPI tasks and OpenMP threads on the host when using +offload to a coprocessor. Affinity settings are used when possible +to prevent MPI tasks and OpenMP threads from being on separate NUMA +domains and to prevent offload threads from interfering with other +processes/threads used for LAMMPS. +


The kokkos style invokes settings associated with the use of the diff --git a/doc/package.txt b/doc/package.txt index 51f485f411..d3f8a51c17 100644 --- a/doc/package.txt +++ b/doc/package.txt @@ -54,7 +54,7 @@ args = arguments specific to the style :l {intel} args = NPhi keyword value ... Nphi = # of coprocessors per node zero or more keyword/value pairs may be appended - keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask} + keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask} or {no_affinity} {omp} value = Nthreads Nthreads = number of OpenMP threads to use on CPU (default = 0) {mode} value = {single} or {mixed} or {double} @@ -70,6 +70,7 @@ args = arguments specific to the style :l Ntpc = max number of coprocessor threads per coprocessor core (default = 4) {tptask} value = Ntptask Ntptask = max number of coprocessor threads per MPI task (default = 240) + {no_affinity} values = none {kokkos} args = keyword value ... zero or more keyword/value pairs may be appended keywords = {neigh} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward} @@ -421,6 +422,13 @@ with 16 threads, for a total of 128. Note that the default settings for {tpc} and {tptask} are fine for most problems, regardless of how many MPI tasks you assign to a Phi. +The {no_affinity} keyword will turn off automatic setting of core +affinity for MPI tasks and OpenMP threads on the host when using +offload to a coprocessor. Affinity settings are used when possible +to prevent MPI tasks and OpenMP threads from being on separate NUMA +domains and to prevent offload threads from interfering with other +processes/threads used for LAMMPS. + :line The {kokkos} style invokes settings associated with the use of the