diff --git a/doc/Section_accelerate.html b/doc/Section_accelerate.html index 588df1a190..4de4fa10fe 100644 --- a/doc/Section_accelerate.html +++ b/doc/Section_accelerate.html @@ -683,19 +683,20 @@ occurs, the faster your simulation will run. that use data structures and methods and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.
-Kokkos is a C++ library that provides two key abstractions for an -application like LAMMPS. First, it allows a single implementation of -an application kernel (e.g. a pair style) to run efficiently on -different kinds of hardware (GPU, Intel Phi, many-core chip). +
Kokkos is a C++ library +that provides two key abstractions for an application like LAMMPS. +First, it allows a single implementation of an application kernel +(e.g. a pair style) to run efficiently on different kinds of hardware +(GPU, Intel Phi, many-core chip).
-Second, it adjusts the memory layout of basic data structures like 2d -and 3d arrays specifically for the chosen hardware. These are used in -LAMMPS to store atom coordinates or forces or neighbor lists. The -layout is chosen to optimize performance on different platforms. -Again this operation is hidden from the developer, and does not affect -how the single implementation of the kernel is coded. -
-CT NOTE: Pointer to Kokkos web page??? +
Second, it provides data abstractions to adjust (at compile time) the +memory layout of basic data structures like 2d and 3d arrays and allow +the transparent utilization of special hardware load and store units. +Such data structures are used in LAMMPS to store atom coordinates or +forces or neighbor lists. The layout is chosen to optimize +performance on different platforms. Again this operation is hidden +from the developer, and does not affect how the single implementation +of the kernel is coded.
These abstractions are set at build time, when LAMMPS is compiled with the KOKKOS package installed. This is done by selecting a "host" and @@ -727,9 +728,11 @@ i.e. the host and device are the same.
IMPORTNANT NOTE: Currently, if using GPUs, you should set the number of MPI tasks per compute node to be equal to the number of GPUs per compute node. In the future Kokkos will support assigning one GPU to -multiple MPI tasks or using multiple GPUs per MPI task. -
-CT NOTE: what about AMD GPUs running OpenCL? are they supported? +multiple MPI tasks or using multiple GPUs per MPI task. Currently +Kokkos does not support AMD GPUs due to limits in the available +backend programming models (in particular relative extensive C++ +support is required for the Kernel language). This is expected to +change in the future.
Here are several examples of how to build LAMMPS and run a simulation using the KOKKOS package for typical compute node configurations. @@ -857,8 +860,8 @@ communication can provide a speed-up for specific calculations. tasks/node * number of threads/task should not exceed N, and should typically equal N. Note that the default threads/task is 1, as set by the "t" keyword of the -k command-line -switch. If you do not change this, there -will no additional parallelism (beyond MPI) invoked on the host +switch. If you do not change this, no +additional parallelism (beyond MPI) will be invoked on the host CPU(s).
You can compare the performance running in different modes: @@ -878,9 +881,8 @@ software installation. Insure the -arch setting in src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see this section of the manual for details.
-The -np setting of the mpirun command must set the number of MPI -tasks/node to be equal to the # of physical GPUs on the node. CT -NOTE: does LAMMPS enforce this? +
The -np setting of the mpirun command should set the number of MPI +tasks/node to be equal to the # of physical GPUs on the node.
Use the -kokkos command-line switch to specify the number of GPUs per node, and the number of threads per MPI @@ -936,9 +938,19 @@ will be added later. performance to bind the threads to physical cores, so they do not migrate during a simulation. The same is true for MPI tasks, but the default binding rules implemented for various MPI versions, do not -account for thread binding. Thus you should do the following if using -multiple threads per MPI task. CT NOTE: explain what to do. +account for thread binding.
+Thus if you use more than one thread per MPI task, you should insure +MPI tasks are bound to CPU sockets. Furthermore, use thread affinity +environment variables from the OpenMP runtime when using OpenMP and +compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc +4.7 or later, intel 12 or later) setting the environment variable +OMP_PROC_BIND=true should be sufficient. A typical mpirun command +should set these flags: +
+OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... +Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... +
When using a GPU, you will achieve the best performance if your input script does not use any fix or compute styles which are not yet Kokkos-enabled. This allows data to stay on the GPU for multiple @@ -956,8 +968,6 @@ together to compute pairwise interactions with the KOKKOS package. We hope to support this in the future, similar to the GPU package in LAMMPS.
-CT NOTE: other performance tips?? -
Depending on which flavor of MPI you are running, LAMMPS will look for one of these 3 environment variables
-SLURM_LOCALID (???) CT NOTE: what MPI is this for? +SLURM_LOCALID (various MPI variants compiled with SLURM support) MV2_COMM_WORLD_LOCAL_RANK (Mvapich) OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI)-which are initialized by "mpirun" or "mpiexec". The environment -variable setting for each MPI rank is used to assign a unique GPU ID -to the MPI task. +
which are initialized by the "srun", "mpirun" or "mpiexec" commands. +The environment variable setting for each MPI rank is used to assign a +unique GPU ID to the MPI task.
threads Nt@@ -1274,13 +1274,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode. The default is Nt = 1, which essentially runs in MPI-only mode. If there are Np MPI tasks per physical node, you generally want Np*Nt = the number of physical cores per node, to use your available hardware -optimally. +optimally. This also sets the number of threads used by the host when +LAMMPS is compiled with CUDA=yes.numa Nm-CT NOTE: what does numa set, and why use it? -
-Explain. The default is Nm = 1. +
This option is only relevant when using pthreads with hwloc support. +In this case Nm defines the number of NUMA regions (typicaly sockets) +on a node which will be utilizied by a single MPI rank. By default Nm += 1. If this option is used the total number of worker-threads per +MPI rank is threads*numa. Currently it is always almost better to +assign at least one MPI rank per NUMA region, and leave numa set to +its default value of 1. This is because letting a single process span +multiple NUMA regions induces a significant amount of cross NUMA data +traffic which is slow.
-log filediff --git a/doc/Section_start.txt b/doc/Section_start.txt index 8cdb7e8000..616dfae549 100644 --- a/doc/Section_start.txt +++ b/doc/Section_start.txt @@ -1253,13 +1253,13 @@ Ng = 1 and Ns is not set. Depending on which flavor of MPI you are running, LAMMPS will look for one of these 3 environment variables -SLURM_LOCALID (???) CT NOTE: what MPI is this for? +SLURM_LOCALID (various MPI variants compiled with SLURM support) MV2_COMM_WORLD_LOCAL_RANK (Mvapich) OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre -which are initialized by "mpirun" or "mpiexec". The environment -variable setting for each MPI rank is used to assign a unique GPU ID -to the MPI task. +which are initialized by the "srun", "mpirun" or "mpiexec" commands. +The environment variable setting for each MPI rank is used to assign a +unique GPU ID to the MPI task. threads Nt :pre @@ -1268,13 +1268,20 @@ performing work when Kokkos is executing in OpenMP or pthreads mode. The default is Nt = 1, which essentially runs in MPI-only mode. If there are Np MPI tasks per physical node, you generally want Np*Nt = the number of physical cores per node, to use your available hardware -optimally. +optimally. This also sets the number of threads used by the host when +LAMMPS is compiled with CUDA=yes. numa Nm :pre -CT NOTE: what does numa set, and why use it? - -Explain. The default is Nm = 1. +This option is only relevant when using pthreads with hwloc support. +In this case Nm defines the number of NUMA regions (typicaly sockets) +on a node which will be utilizied by a single MPI rank. By default Nm += 1. If this option is used the total number of worker-threads per +MPI rank is threads*numa. Currently it is always almost better to +assign at least one MPI rank per NUMA region, and leave numa set to +its default value of 1. This is because letting a single process span +multiple NUMA regions induces a significant amount of cross NUMA data +traffic which is slow. -log file :pre diff --git a/doc/package.html b/doc/package.html index 7412dabdf9..939fee6ff2 100644 --- a/doc/package.html +++ b/doc/package.html @@ -216,26 +216,24 @@ device type can be specified when building LAMMPS with the GPU library.
-The kk style invokes options associated with the use of the +
The kokkos style invokes options associated with the use of the KOKKOS package.
The neigh keyword determines what kinds of neighbor lists are built. A value of half uses half-neighbor lists, the same as used by most -pair styles in LAMMPS. This is the default when running without -threads on a CPU. A value of half/thread uses a threadsafe variant -of the half-neighbor list. It should be used instead of half when -running with threads on a CPU. A value of full uses a +pair styles in LAMMPS. A value of half/thread uses a threadsafe +variant of the half-neighbor list. It should be used instead of +half when running with threads on a CPU. A value of full uses a full-neighborlist, i.e. f_ij and f_ji are both calculated. This performs twice as much computation as the half option, however that can be a win because it is threadsafe and doesn't require atomic -operations. This is the default when running in threaded mode or on -GPUs. A value of full/cluster is an experimental neighbor style, -where particles interact with all particles within a small cluster, if -at least one of the clusters particles is within the neighbor cutoff -range. This potentially allows for better vectorization on -architectures such as the Intel Phi. If also reduces the size of the -neighbor list by roughly a factor of the cluster size, thus reducing -the total memory footprint considerably. +operations. A value of full/cluster is an experimental neighbor +style, where particles interact with all particles within a small +cluster, if at least one of the clusters particles is within the +neighbor cutoff range. This potentially allows for better +vectorization on architectures such as the Intel Phi. If also reduces +the size of the neighbor list by roughly a factor of the cluster size, +thus reducing the total memory footprint considerably.
The comm/exchange and comm/forward keywords determine whether the host or device performs the packing and unpacking of data when @@ -254,26 +252,23 @@ packing/unpacking in parallel with threads. A value of device means to use the device, typically a GPU, to perform the packing/unpacking operation.
-CT NOTE: please read this paragraph, to make sure it is correct: -
The optimal choice for these keywords depends on the input script and the hardware used. The no value is useful for verifying that Kokkos code is working correctly. It may also be the fastest choice when using Kokkos styles in MPI-only mode (i.e. with a thread count of 1). -When running on CPUs or Xeon Phi, the host and device values -should work identically. When using GPUs, the device value will -typically be optimal if all of your styles used in your input script -are supported by the KOKKOS package. In this case data can stay on -the GPU for many timesteps without being moved between the host and -GPU, if you use the device value. This requires that your MPI is -able to access GPU memory directly. Currently that is true for -OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI. -If your script uses styles (e.g. fixes) which are not yet supported by -the KOKKOS package, then data has to be move between the host and -device anyway, so it is typically faster to let the host handle -communication, by using the host value. Using host instead of -no will enable use of multiple threads to pack/unpack communicated -data. +When running on CPUs or Xeon Phi, the host and device values work +identically. When using GPUs, the device value will typically be +optimal if all of your styles used in your input script are supported +by the KOKKOS package. In this case data can stay on the GPU for many +timesteps without being moved between the host and GPU, if you use the +device value. This requires that your MPI is able to access GPU +memory directly. Currently that is true for OpenMPI 1.8 (or later +versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses +styles (e.g. fixes) which are not yet supported by the KOKKOS package, +then data has to be move between the host and device anyway, so it is +typically faster to let the host handle communication, by using the +host value. Using host instead of no will enable use of +multiple threads to pack/unpack communicated data.
@@ -354,13 +349,10 @@ invoked, to specify default settings for the GPU package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script. -CT NOTE: is this correct? The above sems to say the -choice of neigh value depends on use of threads or not. -
-The default settings for the KOKKOS package are "package kokkos neigh -full comm/exchange host comm/forward host". This is the case whether -the "-sf kk" command-line switch is used -or not. +
The default settings for the KOKKOS package are "package kk neigh full +comm/exchange host comm/forward host". This is the case whether the +"-sf kk" command-line switch is used or +not.
If the "-sf omp" command-line switch is used then it is as if the command "package omp *" were invoked, to diff --git a/doc/package.txt b/doc/package.txt index 58472b14d1..76ce6ef2d1 100644 --- a/doc/package.txt +++ b/doc/package.txt @@ -210,26 +210,24 @@ device type can be specified when building LAMMPS with the GPU library. :line -The {kk} style invokes options associated with the use of the +The {kokkos} style invokes options associated with the use of the KOKKOS package. The {neigh} keyword determines what kinds of neighbor lists are built. A value of {half} uses half-neighbor lists, the same as used by most -pair styles in LAMMPS. This is the default when running without -threads on a CPU. A value of {half/thread} uses a threadsafe variant -of the half-neighbor list. It should be used instead of {half} when -running with threads on a CPU. A value of {full} uses a +pair styles in LAMMPS. A value of {half/thread} uses a threadsafe +variant of the half-neighbor list. It should be used instead of +{half} when running with threads on a CPU. A value of {full} uses a full-neighborlist, i.e. f_ij and f_ji are both calculated. This performs twice as much computation as the {half} option, however that can be a win because it is threadsafe and doesn't require atomic -operations. This is the default when running in threaded mode or on -GPUs. A value of {full/cluster} is an experimental neighbor style, -where particles interact with all particles within a small cluster, if -at least one of the clusters particles is within the neighbor cutoff -range. This potentially allows for better vectorization on -architectures such as the Intel Phi. If also reduces the size of the -neighbor list by roughly a factor of the cluster size, thus reducing -the total memory footprint considerably. +operations. A value of {full/cluster} is an experimental neighbor +style, where particles interact with all particles within a small +cluster, if at least one of the clusters particles is within the +neighbor cutoff range. This potentially allows for better +vectorization on architectures such as the Intel Phi. If also reduces +the size of the neighbor list by roughly a factor of the cluster size, +thus reducing the total memory footprint considerably. The {comm/exchange} and {comm/forward} keywords determine whether the host or device performs the packing and unpacking of data when @@ -248,26 +246,23 @@ packing/unpacking in parallel with threads. A value of {device} means to use the device, typically a GPU, to perform the packing/unpacking operation. -CT NOTE: please read this paragraph, to make sure it is correct: - The optimal choice for these keywords depends on the input script and the hardware used. The {no} value is useful for verifying that Kokkos code is working correctly. It may also be the fastest choice when using Kokkos styles in MPI-only mode (i.e. with a thread count of 1). -When running on CPUs or Xeon Phi, the {host} and {device} values -should work identically. When using GPUs, the {device} value will -typically be optimal if all of your styles used in your input script -are supported by the KOKKOS package. In this case data can stay on -the GPU for many timesteps without being moved between the host and -GPU, if you use the {device} value. This requires that your MPI is -able to access GPU memory directly. Currently that is true for -OpenMPI 1.8 (or later versions), Mvapich2 1.9 (or later), and CrayMPI. -If your script uses styles (e.g. fixes) which are not yet supported by -the KOKKOS package, then data has to be move between the host and -device anyway, so it is typically faster to let the host handle -communication, by using the {host} value. Using {host} instead of -{no} will enable use of multiple threads to pack/unpack communicated -data. +When running on CPUs or Xeon Phi, the {host} and {device} values work +identically. When using GPUs, the {device} value will typically be +optimal if all of your styles used in your input script are supported +by the KOKKOS package. In this case data can stay on the GPU for many +timesteps without being moved between the host and GPU, if you use the +{device} value. This requires that your MPI is able to access GPU +memory directly. Currently that is true for OpenMPI 1.8 (or later +versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses +styles (e.g. fixes) which are not yet supported by the KOKKOS package, +then data has to be move between the host and device anyway, so it is +typically faster to let the host handle communication, by using the +{host} value. Using {host} instead of {no} will enable use of +multiple threads to pack/unpack communicated data. :line @@ -348,13 +343,10 @@ invoked, to specify default settings for the GPU package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script. -CT NOTE: is this correct? The above sems to say the -choice of neigh value depends on use of threads or not. - -The default settings for the KOKKOS package are "package kokkos neigh -full comm/exchange host comm/forward host". This is the case whether -the "-sf kk" "command-line switch"_Section_start.html#start_7 is used -or not. +The default settings for the KOKKOS package are "package kk neigh full +comm/exchange host comm/forward host". This is the case whether the +"-sf kk" "command-line switch"_Section_start.html#start_7 is used or +not. If the "-sf omp" "command-line switch"_Section_start.html#start_7 is used then it is as if the command "package omp *" were invoked, to