From 6429e05b7861dd01c05866dfdc670c99189b95dd Mon Sep 17 00:00:00 2001
From: sjplimp <sjplimp@f3b2605a-c512-4ea7-a41b-209d697bcdaa>
Date: Mon, 22 Dec 2014 22:12:21 +0000
Subject: [PATCH] git-svn-id: svn://svn.icms.temple.edu/lammps-ro/trunk@12850
 f3b2605a-c512-4ea7-a41b-209d697bcdaa

---
 doc/accelerate_intel.html | 39 +++++++++++++++++++++++++++++----------
 doc/accelerate_intel.txt  | 39 +++++++++++++++++++++++++++++----------
 doc/package.html          | 10 +++++++++-
 doc/package.txt           | 10 +++++++++-
 4 files changed, 76 insertions(+), 22 deletions(-)
diff --git a/doc/accelerate_intel.html b/doc/accelerate_intel.html
index 9802a34d64..2972e989c3 100644
--- a/doc/accelerate_intel.html
+++ b/doc/accelerate_intel.html
@@ -62,8 +62,7 @@ Xeon Phi(TM) coprocessor is the same except for these additional
 steps:
 </P>
 <UL><LI>add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
-<LI>add the flag -offload to LINKFLAGS in your Makefile.machine
-<LI>specify how many coprocessor threads per MPI task to use 
+<LI>add the flag -offload to LINKFLAGS in your Makefile.machine 
 </UL>
 <P>The latter two steps in the first case and the last step in the
 coprocessor case can be done using the "-pk intel" and "-sf intel"
@@ -75,7 +74,7 @@ commands respectively to your input script.
 <P><B>Required hardware/software:</B>
 </P>
 <P>To use the offload option, you must have one or more Intel(R) Xeon
-Phi(TM) coprocessors.
+Phi(TM) coprocessors and use an Intel(R) C++ compiler.
 </P>
 <P>Optimizations for vectorization have only been tested with the
 Intel(R) compiler.  Use of other compilers may not result in
@@ -85,10 +84,18 @@ vectorization or give poor performance.
 g++ will not recognize some of the settings, so they cannot be used).
 The compiler must support the OpenMP interface.
 </P>
+<P>The recommended version of the Intel(R) compiler is 14.0.1.106. 
+Versions 15.0.1.133 and later are also supported. If using Intel(R) 
+MPI, versions 15.0.2.044 and later are recommended.
+</P>
 <P><B>Building LAMMPS with the USER-INTEL package:</B>
 </P>
-<P>You must choose at build time whether to build for CPU acceleration or
-to use the Xeon Phi in offload mode.
+<P>You can choose to build with or without support for offload to a
+Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a
+coprocessor, the same binary can be used on nodes with and without
+coprocessors installed. However, if you do not have coprocessors
+on your system, building without offload support will produce a
+smaller binary.
 </P>
 <P>You can do either in one line, using the src/Make.py script, described
 in <A HREF = "Section_start.html#start_4">Section 2.4</A> of the manual.  Type
@@ -119,7 +126,9 @@ both the CCFLAGS and LINKFLAGS variables.  You also need to add
 </P>
 <P>If you are compiling on the same architecture that will be used for
 the runs, adding the flag <I>-xHost</I> to CCFLAGS will enable
-vectorization with the Intel(R) compiler.
+vectorization with the Intel(R) compiler. Otherwise, you must
+provide the correct compute node architecture to the -x option
+(e.g. -xAVX).
 </P>
 <P>In order to build with support for an Intel(R) Xeon Phi(TM)
 coprocessor, the flag <I>-offload</I> should be added to the LINKFLAGS line
@@ -130,10 +139,20 @@ included in the src/MAKE/OPTIONS directory with settings that perform
 well with the Intel(R) compiler. The latter file has support for
 offload to coprocessors; the former does not.
 </P>
-<P>If using an Intel compiler, it is recommended that Intel(R) Compiler
-2013 SP1 update 1 be used.  Newer versions have some performance
-issues that are being addressed. If using Intel(R) MPI, version 5 or
-higher is recommended.
+<P><B>Notes on CPU and core affinity:</B>
+</P>
+<P>Setting core affinity is often used to pin MPI tasks and OpenMP
+threads to a core or group of cores so that memory access can be
+uniform. Unless disabled at build time, affinity for MPI tasks and 
+OpenMP threads on the host will be set by default on the host 
+when using offload to a coprocessor. In this case, it is unnecessary 
+to use other methods to control affinity (e.g. taskset, numactl,
+I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script
+with the <I>no_affinity</I> option to the <A HREF = "package.html">package intel</A> 
+command or by disabling the option at build time (by adding
+-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
+Disabling this option is not recommended, especially when running
+on a machine with hyperthreading disabled.
 </P>
 <P><B>Running with the USER-INTEL package from the command line:</B>
 </P>
diff --git a/doc/accelerate_intel.txt b/doc/accelerate_intel.txt
index e85899f189..c0cbafa448 100644
--- a/doc/accelerate_intel.txt
+++ b/doc/accelerate_intel.txt
@@ -59,8 +59,7 @@ Xeon Phi(TM) coprocessor is the same except for these additional
 steps:
 
 add the flag -DLMP_INTEL_OFFLOAD to CCFLAGS in your Makefile.machine
-add the flag -offload to LINKFLAGS in your Makefile.machine
-specify how many coprocessor threads per MPI task to use :ul
+add the flag -offload to LINKFLAGS in your Makefile.machine :ul
 
 The latter two steps in the first case and the last step in the
 coprocessor case can be done using the "-pk intel" and "-sf intel"
@@ -72,7 +71,7 @@ commands respectively to your input script.
 [Required hardware/software:]
 
 To use the offload option, you must have one or more Intel(R) Xeon
-Phi(TM) coprocessors.
+Phi(TM) coprocessors and use an Intel(R) C++ compiler.
 
 Optimizations for vectorization have only been tested with the
 Intel(R) compiler.  Use of other compilers may not result in
@@ -82,10 +81,18 @@ Use of an Intel C++ compiler is recommended, but not required (though
 g++ will not recognize some of the settings, so they cannot be used).
 The compiler must support the OpenMP interface.
 
+The recommended version of the Intel(R) compiler is 14.0.1.106. 
+Versions 15.0.1.133 and later are also supported. If using Intel(R) 
+MPI, versions 15.0.2.044 and later are recommended.
+
 [Building LAMMPS with the USER-INTEL package:]
 
-You must choose at build time whether to build for CPU acceleration or
-to use the Xeon Phi in offload mode.
+You can choose to build with or without support for offload to a
+Intel(R) Xeon Phi(TM) coprocessor. If you build with support for a
+coprocessor, the same binary can be used on nodes with and without
+coprocessors installed. However, if you do not have coprocessors
+on your system, building without offload support will produce a
+smaller binary.
 
 You can do either in one line, using the src/Make.py script, described
 in "Section 2.4"_Section_start.html#start_4 of the manual.  Type
@@ -116,7 +123,9 @@ both the CCFLAGS and LINKFLAGS variables.  You also need to add
 
 If you are compiling on the same architecture that will be used for
 the runs, adding the flag {-xHost} to CCFLAGS will enable
-vectorization with the Intel(R) compiler.
+vectorization with the Intel(R) compiler. Otherwise, you must
+provide the correct compute node architecture to the -x option
+(e.g. -xAVX).
 
 In order to build with support for an Intel(R) Xeon Phi(TM)
 coprocessor, the flag {-offload} should be added to the LINKFLAGS line
@@ -127,10 +136,20 @@ included in the src/MAKE/OPTIONS directory with settings that perform
 well with the Intel(R) compiler. The latter file has support for
 offload to coprocessors; the former does not.
 
-If using an Intel compiler, it is recommended that Intel(R) Compiler
-2013 SP1 update 1 be used.  Newer versions have some performance
-issues that are being addressed. If using Intel(R) MPI, version 5 or
-higher is recommended.
+[Notes on CPU and core affinity:]
+
+Setting core affinity is often used to pin MPI tasks and OpenMP
+threads to a core or group of cores so that memory access can be
+uniform. Unless disabled at build time, affinity for MPI tasks and 
+OpenMP threads on the host will be set by default on the host 
+when using offload to a coprocessor. In this case, it is unnecessary 
+to use other methods to control affinity (e.g. taskset, numactl,
+I_MPI_PIN_DOMAIN, etc.). This can be disabled in an input script
+with the {no_affinity} option to the "package intel"_package.html 
+command or by disabling the option at build time (by adding
+-DINTEL_OFFLOAD_NOAFFINITY to the CCFLAGS line of your Makefile).
+Disabling this option is not recommended, especially when running
+on a machine with hyperthreading disabled.
 
 [Running with the USER-INTEL package from the command line:]
 
diff --git a/doc/package.html b/doc/package.html
index 87e49a5c8f..5c0dd866d8 100644
--- a/doc/package.html
+++ b/doc/package.html
@@ -59,7 +59,7 @@
   <I>intel</I> args = NPhi keyword value ...
     Nphi = # of coprocessors per node
     zero or more keyword/value pairs may be appended 
-    keywords = <I>omp</I> or <I>mode</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I>
+    keywords = <I>omp</I> or <I>mode</I> or <I>balance</I> or <I>ghost</I> or <I>tpc</I> or <I>tptask</I> or <I>no_affinity</I>
       <I>omp</I> value = Nthreads
         Nthreads = number of OpenMP threads to use on CPU (default = 0)
       <I>mode</I> value = <I>single</I> or <I>mixed</I> or <I>double</I>
@@ -75,6 +75,7 @@
         Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
       <I>tptask</I> value = Ntptask
         Ntptask = max number of coprocessor threads per MPI task (default = 240)
+      <I>no_affinity</I> values = none
   <I>kokkos</I> args = keyword value ...
     zero or more keyword/value pairs may be appended
     keywords = <I>neigh</I> or <I>newton</I> or <I>binsize</I> or <I>comm</I> or <I>comm/exchange</I> or <I>comm/forward</I>
@@ -427,6 +428,13 @@ with 16 threads, for a total of 128.
 <P>Note that the default settings for <I>tpc</I> and <I>tptask</I> are fine for
 most problems, regardless of how many MPI tasks you assign to a Phi.
 </P>
+<P>The <I>no_affinity</I> keyword will turn off automatic setting of core
+affinity for MPI tasks and OpenMP threads on the host when using
+offload to a coprocessor. Affinity settings are used when possible 
+to prevent MPI tasks and OpenMP threads from being on separate NUMA 
+domains and to prevent offload threads from interfering with other 
+processes/threads used for LAMMPS.
+</P>
 <HR>
 
 <P>The <I>kokkos</I> style invokes settings associated with the use of the
diff --git a/doc/package.txt b/doc/package.txt
index 51f485f411..d3f8a51c17 100644
--- a/doc/package.txt
+++ b/doc/package.txt
@@ -54,7 +54,7 @@ args = arguments specific to the style :l
   {intel} args = NPhi keyword value ...
     Nphi = # of coprocessors per node
     zero or more keyword/value pairs may be appended 
-    keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask}
+    keywords = {omp} or {mode} or {balance} or {ghost} or {tpc} or {tptask} or {no_affinity}
       {omp} value = Nthreads
         Nthreads = number of OpenMP threads to use on CPU (default = 0)
       {mode} value = {single} or {mixed} or {double}
@@ -70,6 +70,7 @@ args = arguments specific to the style :l
         Ntpc = max number of coprocessor threads per coprocessor core (default = 4)
       {tptask} value = Ntptask
         Ntptask = max number of coprocessor threads per MPI task (default = 240)
+      {no_affinity} values = none
   {kokkos} args = keyword value ...
     zero or more keyword/value pairs may be appended
     keywords = {neigh} or {newton} or {binsize} or {comm} or {comm/exchange} or {comm/forward}
@@ -421,6 +422,13 @@ with 16 threads, for a total of 128.
 Note that the default settings for {tpc} and {tptask} are fine for
 most problems, regardless of how many MPI tasks you assign to a Phi.
 
+The {no_affinity} keyword will turn off automatic setting of core
+affinity for MPI tasks and OpenMP threads on the host when using
+offload to a coprocessor. Affinity settings are used when possible 
+to prevent MPI tasks and OpenMP threads from being on separate NUMA 
+domains and to prevent offload threads from interfering with other 
+processes/threads used for LAMMPS.
+
 :line
 
 The {kokkos} style invokes settings associated with the use of the