diff --git a/bench/GPU/README b/bench/GPU/README index 91dba179ba..a85d6302ba 100644 --- a/bench/GPU/README +++ b/bench/GPU/README @@ -39,9 +39,10 @@ mpirun -np 8 ../lmp_linux_mixed -sf gpu -c off -v g 2 -v x 32 -v y 32 -v z 64 -v The "xyz" settings determine the problem size. The "t" setting determines the number of timesteps. The "np" setting determines how -many CPUs the problem will be run on, and the "g" settings determines -how many GPUs the problem will run on, i.e. 1 or 2 in this case. You -can use more CPUs than GPUs with the GPU package. +many MPI tasks per compute node the problem will run on, and the "g" +setting determines how many GPUs per compute node the problem will run +on, i.e. 1 or 2 in this case. Note that you can use more MPI tasks +than GPUs (both per compute node) with the GPU package. ------------------------------------------------------------------------ @@ -54,7 +55,7 @@ mpirun -np 2 ../lmp_linux_double -sf cuda -v g 2 -v x 32 -v y 64 -v z 64 -v t 10 The "xyz" settings determine the problem size. The "t" setting determines the number of timesteps. The "np" setting determines how -many CPUs the problem will be run on, and the "g" setting determines -how many GPUs the problem will run on, i.e. 1 or 2 in this case. You -should make the number of CPUs and number of GPUs equal for the -USER-CUDA package. +many MPI tasks per compute node the problem will run on, and the "g" +setting determines how many GPUs per compute node the problem will run +on, i.e. 1 or 2 in this case. For the USER-CUDA package, the number +of MPI tasks and GPUs (both per compute node) must be equal. diff --git a/doc/Manual.html b/doc/Manual.html index 6dc80d4376..576b6b0114 100644 --- a/doc/Manual.html +++ b/doc/Manual.html @@ -1,7 +1,7 @@
For example, all of these variants of the basic Lennard-Jones pair -style exist in LAMMPS: +
For example, all of these styles are variants of the basic +Lennard-Jones pair style pair_style lj/cut:
-Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input @@ -161,11 +162,17 @@ script. Or you can use the -suffix comma switch to invoke the accelerated versions automatically, without changing your input script. The suffix command allows you to set a suffix explicitly and -to turn off/on the comand-line switch setting, both from within your -input script. +to turn off and back on the comand-line switch setting, both from +within your input script.
-Styles with an "opt" suffix are part of the OPT package and typically -speed-up the pairwise calculations of your simulation by 5-25%. +
Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU +packages, and can be run on NVIDIA GPUs associated with your CPUs. +The speed-up due to GPU usage depends on a variety of factors, as +discussed below. +
+Styles with a "kk" suffix are part of the KOKKOS package, and can be +run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends +on a variety of factors, as discussed below.
Styles with an "omp" suffix are part of the USER-OMP package and allow a pair-style to be run in multi-threaded mode using OpenMP. This can @@ -174,26 +181,26 @@ than cores is advantageous, e.g. when running with PPPM so that FFTs are run on fewer MPI processors or when the many MPI tasks would overload the available bandwidth for communication.
-Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA -packages, and can be run on NVIDIA GPUs associated with your CPUs. -The speed-up due to GPU usage depends on a variety of factors, as -discussed below. +
Styles with an "opt" suffix are part of the OPT package and typically +speed-up the pairwise calculations of your simulation by 5-25%.
To see what styles are currently available in each of the accelerated packages, see Section_commands 5 of the manual. A list of accelerated styles is included in the pair, fix, -compute, and kspace sections. +compute, and kspace sections. The doc page for each indvidual style +(e.g. pair lj/cut or fix nve) will also +list any accelerated variants available for that style.
The following sections explain:
The final section compares and contrasts the GPU and USER-CUDA -packages, since they are both designed to use NVIDIA GPU hardware. +packages, since they are both designed to use NVIDIA hardware.
make yes-opt make machine-
If your input script uses one of the OPT pair styles, -you can run it as follows: +
If your input script uses one of the OPT pair styles, you can run it +as follows:
lmp_machine -sf opt < in.script mpirun -np 4 lmp_machine -sf opt < in.script @@ -226,12 +233,13 @@ to 20% savings.5.5 USER-OMP package
-The USER-OMP package was developed by Axel Kohlmeyer at Temple University. -It provides multi-threaded versions of most pair styles, all dihedral -styles and a few fixes in LAMMPS. The package currently uses the OpenMP -interface which requires using a specific compiler flag in the makefile -to enable multiple threads; without this flag the corresponding pair -styles will still be compiled and work, but do not support multi-threading. +
The USER-OMP package was developed by Axel Kohlmeyer at Temple +University. It provides multi-threaded versions of most pair styles, +all dihedral styles, and a few fixes in LAMMPS. The package currently +uses the OpenMP interface which requires using a specific compiler +flag in the makefile to enable multiple threads; without this flag the +corresponding pair styles will still be compiled and work, but do not +support multi-threading.
Building LAMMPS with the USER-OMP package:
@@ -264,18 +272,19 @@ env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script
The value of the environment variable OMP_NUM_THREADS determines how -many threads per MPI task are launched. All three examples above use -a total of 4 CPU cores. For different MPI implementations the method -to pass the OMP_NUM_THREADS environment variable to all processes is -different. Two different variants, one for MPICH and OpenMPI, respectively -are shown above. Please check the documentation of your MPI installation -for additional details. Alternatively, the value provided by OMP_NUM_THREADS -can be overridded with the package omp command. -Depending on which styles are accelerated in your input, you should -see a reduction in the "Pair time" and/or "Bond time" and "Loop time" -printed out at the end of the run. The optimal ratio of MPI to OpenMP -can vary a lot and should always be confirmed through some benchmark -runs for the current system and on the current machine. +many threads per MPI task are launched. All three examples above use a +total of 4 CPU cores. For different MPI implementations the method to +pass the OMP_NUM_THREADS environment variable to all processes is +different. Two different variants, one for MPICH and OpenMPI, +respectively are shown above. Please check the documentation of your +MPI installation for additional details. Alternatively, the value +provided by OMP_NUM_THREADS can be overridded with the package +omp command. Depending on which styles are accelerated +in your input, you should see a reduction in the "Pair time" and/or +"Bond time" and "Loop time" printed out at the end of the run. The +optimal ratio of MPI to OpenMP can vary a lot and should always be +confirmed through some benchmark runs for the current system and on +the current machine.
Restrictions:
@@ -293,53 +302,55 @@ On the other hand, in many cases you still want to use the omp version all contain optimizations similar to those in the OPT package, which can result in serial speedup. -Using multi-threading is most effective under the following circumstances: +
Using multi-threading is most effective under the following +circumstances:
-The best parallel efficiency from omp styles is typically -achieved when there is at least one MPI task per physical -processor, i.e. socket or die. +
The best parallel efficiency from omp styles is typically achieved +when there is at least one MPI task per physical processor, +i.e. socket or die.
Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth -requirements is not offset by the gain in CPU utilization -through hyper-threading. +requirements is not offset by the gain in CPU utilization through +hyper-threading.
A description of the multi-threading strategy and some performance -examples are presented here +examples are presented +here
NOTE: - discuss 3 precisions - if change, also have to re-link with LAMMPS - always use newton off - expt with differing numbers of CPUs vs GPU - can't tell what is fastest - give command line switches in examples -
-I am not very clear to the meaning of "Max Mem / Proc" -in the "GPU Time Info (average)". -Is it the maximal of GPU memory used by one CPU core? -
-It is the maximum memory used at one time on the GPU for data storage by -a single MPI process. - Mike -
Hardware and software requirements:
-To use this package, you currently need to have specific NVIDIA -hardware and install specific NVIDIA CUDA software on your system: +
To use this package, you currently need to have an NVIDIA GPU and +install the NVIDIA Cuda software on your system:
-Before building the library, you can set the precision it will use by +editing the CUDA_PREC setting in the Makefile you are using, as +follows: +
+CUDA_PREC = -D_SINGLE_SINGLE # Single precision for all calculations +CUDA_PREC = -D_DOUBLE_DOUBLE # Double precision for all calculations +CUDA_PREC = -D_SINGLE_DOUBLE # Accumulation of forces, etc, in double ++
The last setting is the mixed mode referred to above. Note that your +GPU must support double precision to use either the 2nd or 3rd of +these settings. +
+To build the library, then type:
cd lammps/lib/gpu make -f Makefile.linux @@ -424,41 +439,60 @@ set appropriately to include the paths and settings for the CUDA system software on your machine. See src/MAKE/Makefile.g++ for an example. -GPU configuration +
Also note that if you change the GPU library precision, you need to +re-build the entire library. You should do a "clean" first, +e.g. "make -f Makefile.linux clean". Then you must also re-build +LAMMPS if the library precision has changed, so that it re-links with +the new library.
-When using GPUs, you are restricted to one physical GPU per LAMMPS -process, which is an MPI process running on a single core or -processor. Multiple MPI processes (CPU cores) can share a single GPU, -and in many cases it will be more efficient to run this way. +
Running an input script:
-Input script requirements: +
The examples/gpu and bench/GPU directories have scripts that can be +run with the GPU package, as well as detailed instructions on how to +run them.
-Additional input script requirements to run pair or PPPM styles with a +
The total number of MPI tasks used by LAMMPS (one or multiple per +compute node) is set in the usual manner via the mpirun or mpiexec +commands, and is independent of the GPU package. +
+When using the GPU package, you cannot assign more than one physical +GPU to an MPI task. However multiple MPI tasks can share the same +GPU, and in many cases it will be more efficient to run this way. +
+Input script requirements to run using pair or PPPM styles with a gpu suffix are as follows:
-
As an example, if you have two GPUs per node and 8 CPU cores per node, -and would like to run on 4 nodes (32 cores) with dynamic balancing of -force calculation across CPU and GPU cores, you could specify +
The default for the package gpu command is to have all +the MPI tasks on the compute node use a single GPU. If you have +multiple GPUs per node, then be sure to create one or more MPI tasks +per GPU, and use the first/last settings in the package +gpu command to include all the GPU IDs on the node. +E.g. first = 0, last = 1, for 2 GPUs. For example, on an 8-core 2-GPU +compute node, if you assign 8 MPI tasks to the node, the following +command in the input script
-package gpu force/neigh 0 1 -1 --
In this case, all CPU cores and GPU devices on the nodes would be -utilized. Each GPU device would be shared by 4 CPU cores. The CPU -cores would perform force calculations for some fraction of the -particles at the same time the GPUs performed force calculation for -the other particles. +
package gpu force/neigh 0 1 -1 +
+would speciy each GPU is shared by 4 MPI tasks. The final -1 will +dynamically balance force calculations across the CPU cores and GPUs. +I.e. each CPU core will perform force calculations for some small +fraction of the particles, at the same time the GPUs perform force +calcaultions for the majority of the particles.
Timing output:
@@ -482,19 +516,30 @@ screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations. +The output section "GPU Time Info (average)" reports "Max Mem / Proc". +This is the maximum memory used at one time on the GPU for data +storage by a single MPI process. +
Performance tips:
-Generally speaking, for best performance, you should use multiple CPUs -per GPU, as provided my most multi-core CPU/GPU configurations. +
You should experiment with how many MPI tasks per GPU to use to see +what gives the best performance for your problem. This is a function +of your problem size and what pair style you are using. Likewise, you +should also experiment with the precision setting for the GPU library +to see if single or mixed precision will give accurate results, since +they will typically be faster.
-Because of the large number of cores within each GPU device, it may be -more efficient to run on fewer processes per GPU when the number of -particles per MPI process is small (100's of particles); this can be -necessary to keep the GPU cores busy. +
Using multiple MPI tasks per GPU will often give the best performance, +as allowed my most multi-core CPU/GPU configurations.
-See the lammps/lib/gpu/README file for instructions on how to build -the GPU library for single, mixed, or double precision. The latter -requires that your GPU card support double precision. +
If the number of particles per MPI task is small (e.g. 100s of +particles), it can be more eefficient to run with fewer MPI tasks per +GPU, even if you do not use all the cores on the compute node. +
+The Benchmark page of the LAMMPS +web site gives GPU performance on a desktop machine and the Titan HPC +platform at ORNL for several of the LAMMPS benchmarks, as a function +of problem size and number of compute nodes.
The KOKKOS package contains versions of pair, fix, and atom styles +that use data structures and methods and macros provided by the Kokkos +library, which is included with LAMMPS in lib/kokkos. +
+Kokkos is a C++ library +that provides two key abstractions for an application like LAMMPS. +First, it allows a single implementation of an application kernel +(e.g. a pair style) to run efficiently on different kinds of hardware +(GPU, Intel Phi, many-core chip). +
+Second, it provides data abstractions to adjust (at compile time) the +memory layout of basic data structures like 2d and 3d arrays and allow +the transparent utilization of special hardware load and store units. +Such data structures are used in LAMMPS to store atom coordinates or +forces or neighbor lists. The layout is chosen to optimize +performance on different platforms. Again this operation is hidden +from the developer, and does not affect how the single implementation +of the kernel is coded. +
+These abstractions are set at build time, when LAMMPS is compiled with +the KOKKOS package installed. This is done by selecting a "host" and +"device" to build for, compatible with the compute nodes in your +machine. Note that if you are running on a desktop machine, you +typically have one compute node. On a cluster or supercomputer there +may be dozens or 1000s of compute nodes. The procedure for building +and running with the Kokkos library is the same, no matter how many +nodes you run on. +
+All Kokkos operations occur within the context of an individual MPI +task running on a single node of the machine. The total number of MPI +tasks used by LAMMPS (one or multiple per compute node) is set in the +usual manner via the mpirun or mpiexec commands, and is independent of +Kokkos. +
+Kokkos provides support for one or two modes of execution per MPI +task. This means that some computational tasks (pairwise +interactions, neighbor list builds, time integration, etc) are +parallelized in one or the other of the two modes. The first mode is +called the "host" and is one or more threads running on one or more +physical CPUs (within the node). Currently, both multi-core CPUs and +an Intel Phi processor (running in native mode) are supported. The +second mode is called the "device" and is an accelerator chip of some +kind. Currently only an NVIDIA GPU is supported. If your compute +node does not have a GPU, then there is only one mode of execution, +i.e. the host and device are the same. +
+IMPORTNANT NOTE: Currently, if using GPUs, you should set the number +of MPI tasks per compute node to be equal to the number of GPUs per +compute node. In the future Kokkos will support assigning one GPU to +multiple MPI tasks or using multiple GPUs per MPI task. Currently +Kokkos does not support AMD GPUs due to limits in the available +backend programming models (in particular relative extensive C++ +support is required for the Kernel language). This is expected to +change in the future. +
+Here are several examples of how to build LAMMPS and run a simulation +using the KOKKOS package for typical compute node configurations. +Note that the -np setting for the mpirun command in these examples are +for a run on a single node. To scale these examples up to run on a +system with N compute nodes, simply multiply the -np setting by N. +
+All the build steps are performed from within the src directory. All +the run steps are performed in the bench directory using the in.lj +input script. It is assumed the LAMMPS executable has been copied to +that directory or whatever directory the runs are being performed in. +Details of the various options are discussed below. +
+Compute node(s) = dual hex-core CPUs and no GPU: +
+make yes-kokkos # install the KOKKOS package +make g++ OMP=yes # build with OpenMP, no CUDA ++
mpirun -np 12 lmp_g++ -k off < in.lj # MPI-only mode with no Kokkos +mpirun -np 12 lmp_g++ -sf kk < in.lj # MPI-only mode with Kokkos +mpirun -np 1 lmp_g++ -k on t 12 -sf kk < in.lj # one MPI task, 12 threads +mpirun -np 2 lmp_g++ -k on t 6 -sf kk < in.lj # two MPI tasks, 6 threads/task ++
Compute node(s) = Intel Phi with 61 cores: +
+make yes-kokkos +make g++ OMP=yes MIC=yes # build with OpenMP for Phi ++
mpirun -np 12 lmp_g++ -k on t 20 -sf kk < in.lj # 12*20 = 240 total cores +mpirun -np 15 lmp_g++ -k on t 16 -sf kk < in.lj +mpirun -np 30 lmp_g++ -k on t 8 -sf kk < in.lj +mpirun -np 1 lmp_g++ -k on t 240 -sf kk < in.lj ++
Compute node(s) = dual hex-core CPUs and a single GPU: +
+make yes-kokkos +make cuda CUDA=yes # build for GPU, use src/MAKE/Makefile.cuda ++
mpirun -np 1 lmp_cuda -k on t 6 -sf kk < in.lj ++
Compute node(s) = dual 8-core CPUs and 2 GPUs: +
+make yes-kokkos +make cuda CUDA=yes ++
mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk < in.lj # use both GPUs, one per MPI task ++
Building LAMMPS with the KOKKOS package: +
+A summary of the build process is given here. More details and all +the available make variable options are given in this +section of the manual. +
+From the src directory, type +
+make yes-kokkos ++
to include the KOKKOS package. Then perform a normal LAMMPS build, +with additional make variable specifications to choose the host and +device you will run the resulting executable on, e.g. +
+make g++ OMP=yes +make cuda CUDA=yes ++
As illustrated above, the most important variables to set are OMP, +CUDA, and MIC. The default settings are OMP=yes, CUDA=no, MIC=no +Setting OMP to yes will use OpenMP for threading on the host, as +well as on the device (if no GPU is present). Setting CUDA to yes +will use one or more GPUs as the device. Setting MIC=yes is necessary +when building for an Intel Phi processor. +
+Note that to use a GPU, you must use a lo-level Makefile, +e.g. src/MAKE/Makefile.cuda as included in the LAMMPS distro, which +uses the NVIDA "nvcc" compiler. You must check that the CCFLAGS -arch +setting is appropriate for your NVIDIA hardware and installed +software. Typical values for -arch are given in this +section of the manual, as well as other +settings that must be included in the lo-level Makefile, if you create +your own. +
+Input scripts and use of command-line switches -kokkos and -suffix: +
+To use any Kokkos-enabled style provided in the KOKKOS package, you +must use a Kokkos-enabled atom style. LAMMPS will give an error if +you do not do this. +
+There are two command-line switches relevant to using Kokkos, -k or +-kokkos, and -sf or -suffix. They are described in detail in this +section of the manual. +
+Here are common options to use: +
+Use of package command options: +
+Using the package kokkos command in an input script +allows choice of options for neighbor lists and communication. See +the package command doc page for details and default +settings. +
+Experimenting with different styles of neighbor lists or inter-node +communication can provide a speed-up for specific calculations. +
+Running on a multi-core CPU: +
+Build with OMP=yes (the default) and CUDA=no (the default). +
+If N is the number of physical cores/node, then the number of MPI +tasks/node * number of threads/task should not exceed N, and should +typically equal N. Note that the default threads/task is 1, as set by +the "t" keyword of the -k command-line +switch. If you do not change this, no +additional parallelism (beyond MPI) will be invoked on the host +CPU(s). +
+You can compare the performance running in different modes: +
+Examples of mpirun commands in these modes, for nodes with dual +hex-core CPUs and no GPU, are shown above. +
+Running on GPUs: +
+Build with CUDA=yes, using src/MAKE/Makefile.cuda. Insure the setting +for CUDA_PATH in lib/kokkos/Makefile.lammps is correct for your Cuda +software installation. Insure the -arch setting in +src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see +this section of the manual for details. +
+The -np setting of the mpirun command should set the number of MPI +tasks/node to be equal to the # of physical GPUs on the node. +
+Use the -kokkos command-line switch to +specify the number of GPUs per node, and the number of threads per MPI +task. As above for multi-core CPUs (and no GPU), if N is the number +of physical cores/node, then the number of MPI tasks/node * number of +threads/task should not exceed N. With one GPU (and one MPI task) it +may be faster to use less than all the available cores, by setting +threads/task to a smaller value. This is because using all the cores +on a dual-socket node will incur extra cost to copy memory from the +2nd socket to the GPU. +
+Examples of mpirun commands that follow these rules, for nodes with +dual hex-core CPUs and one or two GPUs, are shown above. +
+Running on an Intel Phi: +
+Kokkos only uses Intel Phi processors in their "native" mode, i.e. +not hosted by a CPU. +
+Build with OMP=yes (the default) and MIC=yes. The latter +insures code is correctly compiled for the Intel Phi. The +OMP setting means OpenMP will be used for parallelization +on the Phi, which is currently the best option within +Kokkos. In the future, other options may be added. +
+Current-generation Intel Phi chips have either 61 or 57 cores. One +core should be excluded to run the OS, leaving 60 or 56 cores. Each +core is hyperthreaded, so there are effectively N = 240 (4*60) or N = +224 (4*56) cores to run on. +
+The -np setting of the mpirun command sets the number of MPI +tasks/node. The "-k on t Nt" command-line switch sets the number of +threads/task as Nt. The product of these 2 values should be N, i.e. +240 or 224. Also, the number of threads/task should be a multiple of +4 so that logical threads from more than one MPI task do not run on +the same physical core. +
+Examples of mpirun commands that follow these rules, for Intel Phi +nodes with 61 cores, are shown above. +
+Examples and benchmarks: +
+The examples/kokkos and bench/KOKKOS directories have scripts that can +be run with the KOKKOS package, as well as detailed instructions on +how to run them. +
+IMPORTANT NOTE: the bench/KOKKOS directory does not yet exist. It +will be added later. +
+Additional performance issues: +
+When using threads (OpenMP or pthreads), it is important for +performance to bind the threads to physical cores, so they do not +migrate during a simulation. The same is true for MPI tasks, but the +default binding rules implemented for various MPI versions, do not +account for thread binding. +
+Thus if you use more than one thread per MPI task, you should insure +MPI tasks are bound to CPU sockets. Furthermore, use thread affinity +environment variables from the OpenMP runtime when using OpenMP and +compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc +4.7 or later, intel 12 or later) setting the environment variable +OMP_PROC_BIND=true should be sufficient. A typical mpirun command +should set these flags: +
+OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... +Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... ++
When using a GPU, you will achieve the best performance if your input +script does not use any fix or compute styles which are not yet +Kokkos-enabled. This allows data to stay on the GPU for multiple +timesteps, without being copied back to the host CPU. Invoking a +non-Kokkos fix or compute, or performing I/O for +thermo or dump output will cause data +to be copied back to the CPU. +
+You cannot yet assign multiple MPI tasks to the same GPU with the +KOKKOS package. We plan to support this in the future, similar to the +GPU package in LAMMPS. +
+You cannot yet use both the host (multi-threaded) and device (GPU) +together to compute pairwise interactions with the KOKKOS package. We +hope to support this in the future, similar to the GPU package in +LAMMPS. +
Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways. diff --git a/doc/Section_accelerate.txt b/doc/Section_accelerate.txt index 1e995d5c72..a4252cf177 100644 --- a/doc/Section_accelerate.txt +++ b/doc/Section_accelerate.txt @@ -21,7 +21,8 @@ kinds of machines. 5.5 "USER-OMP package"_#acc_5 5.6 "GPU package"_#acc_6 5.7 "USER-CUDA package"_#acc_7 -5.8 "Comparison of GPU and USER-CUDA packages"_#acc_8 :all(b) +5.8 "KOKKOS package"_#acc_8 +5.9 "Comparison of GPU and USER-CUDA packages"_#acc_9 :all(b) :line :line @@ -142,14 +143,14 @@ command is identical, their functionality is the same, and the numerical results it produces should also be identical, except for precision and round-off issues. -For example, all of these variants of the basic Lennard-Jones pair -style exist in LAMMPS: +For example, all of these styles are variants of the basic +Lennard-Jones pair style "pair_style lj/cut"_pair_lj.html: -"pair_style lj/cut"_pair_lj.html -"pair_style lj/cut/opt"_pair_lj.html -"pair_style lj/cut/omp"_pair_lj.html +"pair_style lj/cut/cuda"_pair_lj.html "pair_style lj/cut/gpu"_pair_lj.html -"pair_style lj/cut/cuda"_pair_lj.html :ul +"pair_style lj/cut/kk"_pair_lj.html +"pair_style lj/cut/omp"_pair_lj.html +"pair_style lj/cut/opt"_pair_lj.html :ul Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input @@ -157,11 +158,17 @@ script. Or you can use the "-suffix command-line switch"_Section_start.html#start_7 to invoke the accelerated versions automatically, without changing your input script. The "suffix"_suffix.html command allows you to set a suffix explicitly and -to turn off/on the comand-line switch setting, both from within your -input script. +to turn off and back on the comand-line switch setting, both from +within your input script. -Styles with an "opt" suffix are part of the OPT package and typically -speed-up the pairwise calculations of your simulation by 5-25%. +Styles with a "cuda" or "gpu" suffix are part of the USER-CUDA or GPU +packages, and can be run on NVIDIA GPUs associated with your CPUs. +The speed-up due to GPU usage depends on a variety of factors, as +discussed below. + +Styles with a "kk" suffix are part of the KOKKOS package, and can be +run using OpenMP, pthreads, or on an NVIDIA GPU. The speed-up depends +on a variety of factors, as discussed below. Styles with an "omp" suffix are part of the USER-OMP package and allow a pair-style to be run in multi-threaded mode using OpenMP. This can @@ -170,26 +177,26 @@ than cores is advantageous, e.g. when running with PPPM so that FFTs are run on fewer MPI processors or when the many MPI tasks would overload the available bandwidth for communication. -Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA -packages, and can be run on NVIDIA GPUs associated with your CPUs. -The speed-up due to GPU usage depends on a variety of factors, as -discussed below. +Styles with an "opt" suffix are part of the OPT package and typically +speed-up the pairwise calculations of your simulation by 5-25%. To see what styles are currently available in each of the accelerated packages, see "Section_commands 5"_Section_commands.html#cmd_5 of the manual. A list of accelerated styles is included in the pair, fix, -compute, and kspace sections. +compute, and kspace sections. The doc page for each indvidual style +(e.g. "pair lj/cut"_pair_lj.html or "fix nve"_fix_nve.html) will also +list any accelerated variants available for that style. The following sections explain: what hardware and software the accelerated styles require -how to build LAMMPS with the accelerated packages in place +how to build LAMMPS with the accelerated package in place what changes (if any) are needed in your input scripts guidelines for best performance speed-ups you can expect :ul The final section compares and contrasts the GPU and USER-CUDA -packages, since they are both designed to use NVIDIA GPU hardware. +packages, since they are both designed to use NVIDIA hardware. :line @@ -208,8 +215,8 @@ dependencies: make yes-opt make machine :pre -If your input script uses one of the OPT pair styles, -you can run it as follows: +If your input script uses one of the OPT pair styles, you can run it +as follows: lmp_machine -sf opt -in in.script mpirun -np 4 lmp_machine -sf opt -in in.script :pre @@ -222,12 +229,13 @@ to 20% savings. 5.5 USER-OMP package :h4,link(acc_5) -The USER-OMP package was developed by Axel Kohlmeyer at Temple University. -It provides multi-threaded versions of most pair styles, all dihedral -styles and a few fixes in LAMMPS. The package currently uses the OpenMP -interface which requires using a specific compiler flag in the makefile -to enable multiple threads; without this flag the corresponding pair -styles will still be compiled and work, but do not support multi-threading. +The USER-OMP package was developed by Axel Kohlmeyer at Temple +University. It provides multi-threaded versions of most pair styles, +all dihedral styles, and a few fixes in LAMMPS. The package currently +uses the OpenMP interface which requires using a specific compiler +flag in the makefile to enable multiple threads; without this flag the +corresponding pair styles will still be compiled and work, but do not +support multi-threading. [Building LAMMPS with the USER-OMP package:] @@ -260,18 +268,19 @@ env OMP_NUM_THREADS=2 mpirun -np 2 lmp_machine -sf omp -in in.script mpirun -x OMP_NUM_THREADS=2 -np 2 lmp_machine -sf omp -in in.script :pre The value of the environment variable OMP_NUM_THREADS determines how -many threads per MPI task are launched. All three examples above use -a total of 4 CPU cores. For different MPI implementations the method -to pass the OMP_NUM_THREADS environment variable to all processes is -different. Two different variants, one for MPICH and OpenMPI, respectively -are shown above. Please check the documentation of your MPI installation -for additional details. Alternatively, the value provided by OMP_NUM_THREADS -can be overridded with the "package omp"_package.html command. -Depending on which styles are accelerated in your input, you should -see a reduction in the "Pair time" and/or "Bond time" and "Loop time" -printed out at the end of the run. The optimal ratio of MPI to OpenMP -can vary a lot and should always be confirmed through some benchmark -runs for the current system and on the current machine. +many threads per MPI task are launched. All three examples above use a +total of 4 CPU cores. For different MPI implementations the method to +pass the OMP_NUM_THREADS environment variable to all processes is +different. Two different variants, one for MPICH and OpenMPI, +respectively are shown above. Please check the documentation of your +MPI installation for additional details. Alternatively, the value +provided by OMP_NUM_THREADS can be overridded with the "package +omp"_package.html command. Depending on which styles are accelerated +in your input, you should see a reduction in the "Pair time" and/or +"Bond time" and "Loop time" printed out at the end of the run. The +optimal ratio of MPI to OpenMP can vary a lot and should always be +confirmed through some benchmark runs for the current system and on +the current machine. [Restrictions:] @@ -292,53 +301,55 @@ On the other hand, in many cases you still want to use the {omp} version all contain optimizations similar to those in the OPT package, which can result in serial speedup. -Using multi-threading is most effective under the following circumstances: +Using multi-threading is most effective under the following +circumstances: -Individual compute nodes have a significant number of CPU cores -but the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx -(Clovertown) and 54xx (Harpertown) quad core processors. Running -one MPI task per CPU core will result in significant performance -degradation, so that running with 4 or even only 2 MPI tasks per -nodes is faster. Running in hybrid MPI+OpenMP mode will reduce the -inter-node communication bandwidth contention in the same way, -but offers and additional speedup from utilizing the otherwise -idle CPU cores. :ulb,l +Individual compute nodes have a significant number of CPU cores but +the CPU itself has limited memory bandwidth, e.g. Intel Xeon 53xx +(Clovertown) and 54xx (Harpertown) quad core processors. Running one +MPI task per CPU core will result in significant performance +degradation, so that running with 4 or even only 2 MPI tasks per nodes +is faster. Running in hybrid MPI+OpenMP mode will reduce the +inter-node communication bandwidth contention in the same way, but +offers and additional speedup from utilizing the otherwise idle CPU +cores. :ulb,l The interconnect used for MPI communication is not able to provide -sufficient bandwidth for a large number of MPI tasks per node. -This applies for example to running over gigabit ethernet or -on Cray XT4 or XT5 series supercomputers. Same as in the aforementioned -case this effect worsens with using an increasing number of nodes. :l +sufficient bandwidth for a large number of MPI tasks per node. This +applies for example to running over gigabit ethernet or on Cray XT4 or +XT5 series supercomputers. Same as in the aforementioned case this +effect worsens with using an increasing number of nodes. :l -The input is a system that has an inhomogeneous particle density -which cannot be mapped well to the domain decomposition scheme -that LAMMPS employs. While this can be to some degree alleviated -through using the "processors"_processors.html keyword, multi-threading -provides a parallelism that parallelizes over the number of particles -not their distribution in space. :l +The input is a system that has an inhomogeneous particle density which +cannot be mapped well to the domain decomposition scheme that LAMMPS +employs. While this can be to some degree alleviated through using the +"processors"_processors.html keyword, multi-threading provides a +parallelism that parallelizes over the number of particles not their +distribution in space. :l Finally, multi-threaded styles can improve performance when running LAMMPS in "capability mode", i.e. near the point where the MPI -parallelism scales out. This can happen in particular when using -as kspace style for long-range electrostatics. Here the scaling -of the kspace style is the performance limiting factor and using -multi-threaded styles allows to operate the kspace style at the -limit of scaling and then increase performance parallelizing -the real space calculations with hybrid MPI+OpenMP. Sometimes -additional speedup can be achived by increasing the real-space -coulomb cutoff and thus reducing the work in the kspace part. :l,ule +parallelism scales out. This can happen in particular when using as +kspace style for long-range electrostatics. Here the scaling of the +kspace style is the performance limiting factor and using +multi-threaded styles allows to operate the kspace style at the limit +of scaling and then increase performance parallelizing the real space +calculations with hybrid MPI+OpenMP. Sometimes additional speedup can +be achived by increasing the real-space coulomb cutoff and thus +reducing the work in the kspace part. :l,ule -The best parallel efficiency from {omp} styles is typically -achieved when there is at least one MPI task per physical -processor, i.e. socket or die. +The best parallel efficiency from {omp} styles is typically achieved +when there is at least one MPI task per physical processor, +i.e. socket or die. Using threads on hyper-threading enabled cores is usually counterproductive, as the cost in additional memory bandwidth -requirements is not offset by the gain in CPU utilization -through hyper-threading. +requirements is not offset by the gain in CPU utilization through +hyper-threading. A description of the multi-threading strategy and some performance -examples are "presented here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1 +examples are "presented +here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1 :line @@ -365,36 +376,23 @@ between processors, runs on the CPU. :l Asynchronous force computations can be performed simultaneously on the CPU(s) and GPU. :l +It allows for GPU computations to be performed in single or double +precision, or in mixed-mode precision. where pairwise forces are +cmoputed in single precision, but accumulated into double-precision +force vectors. :l + LAMMPS-specific code is in the GPU package. It makes calls to a generic GPU library in the lib/gpu directory. This library provides NVIDIA support as well as more general OpenCL support, so that the same functionality can eventually be supported on a variety of GPU hardware. :l,ule - - -NOTE: - discuss 3 precisions - if change, also have to re-link with LAMMPS - always use newton off - expt with differing numbers of CPUs vs GPU - can't tell what is fastest - give command line switches in examples - - -I am not very clear to the meaning of "Max Mem / Proc" -in the "GPU Time Info (average)". -Is it the maximal of GPU memory used by one CPU core? - -It is the maximum memory used at one time on the GPU for data storage by -a single MPI process. - Mike - - [Hardware and software requirements:] -To use this package, you currently need to have specific NVIDIA -hardware and install specific NVIDIA CUDA software on your system: +To use this package, you currently need to have an NVIDIA GPU and +install the NVIDIA Cuda software on your system: -Check if you have an NVIDIA card: cat /proc/driver/nvidia/cards/0 +Check if you have an NVIDIA GPU: cat /proc/driver/nvidia/cards/0 Go to http://www.nvidia.com/object/cuda_get.html Install a driver and toolkit appropriate for your system (SDK is not necessary) Follow the instructions in lammps/lib/gpu/README to build the library (see below) @@ -406,8 +404,21 @@ As with other packages that include a separately compiled library, you need to first build the GPU library, before building LAMMPS itself. General instructions for doing this are in "this section"_Section_start.html#start_3 of the manual. For this package, -do the following, using a Makefile in lib/gpu appropriate for your -system: +use a Makefile in lib/gpu appropriate for your system. + +Before building the library, you can set the precision it will use by +editing the CUDA_PREC setting in the Makefile you are using, as +follows: + +CUDA_PREC = -D_SINGLE_SINGLE # Single precision for all calculations +CUDA_PREC = -D_DOUBLE_DOUBLE # Double precision for all calculations +CUDA_PREC = -D_SINGLE_DOUBLE # Accumulation of forces, etc, in double :pre + +The last setting is the mixed mode referred to above. Note that your +GPU must support double precision to use either the 2nd or 3rd of +these settings. + +To build the library, then type: cd lammps/lib/gpu make -f Makefile.linux @@ -427,41 +438,60 @@ set appropriately to include the paths and settings for the CUDA system software on your machine. See src/MAKE/Makefile.g++ for an example. -[GPU configuration] +Also note that if you change the GPU library precision, you need to +re-build the entire library. You should do a "clean" first, +e.g. "make -f Makefile.linux clean". Then you must also re-build +LAMMPS if the library precision has changed, so that it re-links with +the new library. -When using GPUs, you are restricted to one physical GPU per LAMMPS -process, which is an MPI process running on a single core or -processor. Multiple MPI processes (CPU cores) can share a single GPU, -and in many cases it will be more efficient to run this way. +[Running an input script:] -[Input script requirements:] +The examples/gpu and bench/GPU directories have scripts that can be +run with the GPU package, as well as detailed instructions on how to +run them. -Additional input script requirements to run pair or PPPM styles with a +The total number of MPI tasks used by LAMMPS (one or multiple per +compute node) is set in the usual manner via the mpirun or mpiexec +commands, and is independent of the GPU package. + +When using the GPU package, you cannot assign more than one physical +GPU to an MPI task. However multiple MPI tasks can share the same +GPU, and in many cases it will be more efficient to run this way. + +Input script requirements to run using pair or PPPM styles with a {gpu} suffix are as follows: -To invoke specific styles from the GPU package, you can either append -"gpu" to the style name (e.g. pair_style lj/cut/gpu), or use the -"-suffix command-line switch"_Section_start.html#start_7, or use the -"suffix"_suffix.html command. :ulb,l +To invoke specific styles from the GPU package, either append "gpu" to +the style name (e.g. pair_style lj/cut/gpu), or use the "-suffix +command-line switch"_Section_start.html#start_7, or use the +"suffix"_suffix.html command in the input script. :ulb,l -The "newton pair"_newton.html setting must be {off}. :l +The "newton pair"_newton.html setting in the input script must be +{off}. :l -The "package gpu"_package.html command must be used near the beginning -of your script to control the GPU selection and initialization -settings. It also has an option to enable asynchronous splitting of -force computations between the CPUs and GPUs. :l,ule +Unless the "-suffix gpu command-line +switch"_Section_start.html#start_7 is used, the "package +gpu"_package.html command must be used near the beginning of the +script to control the GPU selection and initialization settings. It +also has an option to enable asynchronous splitting of force +computations between the CPUs and GPUs. :l,ule -As an example, if you have two GPUs per node and 8 CPU cores per node, -and would like to run on 4 nodes (32 cores) with dynamic balancing of -force calculation across CPU and GPU cores, you could specify +The default for the "package gpu"_package.html command is to have all +the MPI tasks on the compute node use a single GPU. If you have +multiple GPUs per node, then be sure to create one or more MPI tasks +per GPU, and use the first/last settings in the "package +gpu"_package.html command to include all the GPU IDs on the node. +E.g. first = 0, last = 1, for 2 GPUs. For example, on an 8-core 2-GPU +compute node, if you assign 8 MPI tasks to the node, the following +command in the input script -package gpu force/neigh 0 1 -1 :pre +package gpu force/neigh 0 1 -1 -In this case, all CPU cores and GPU devices on the nodes would be -utilized. Each GPU device would be shared by 4 CPU cores. The CPU -cores would perform force calculations for some fraction of the -particles at the same time the GPUs performed force calculation for -the other particles. +would speciy each GPU is shared by 4 MPI tasks. The final -1 will +dynamically balance force calculations across the CPU cores and GPUs. +I.e. each CPU core will perform force calculations for some small +fraction of the particles, at the same time the GPUs perform force +calcaultions for the majority of the particles. [Timing output:] @@ -485,19 +515,30 @@ screen output (not in the log file) at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations. +The output section "GPU Time Info (average)" reports "Max Mem / Proc". +This is the maximum memory used at one time on the GPU for data +storage by a single MPI process. + [Performance tips:] -Generally speaking, for best performance, you should use multiple CPUs -per GPU, as provided my most multi-core CPU/GPU configurations. +You should experiment with how many MPI tasks per GPU to use to see +what gives the best performance for your problem. This is a function +of your problem size and what pair style you are using. Likewise, you +should also experiment with the precision setting for the GPU library +to see if single or mixed precision will give accurate results, since +they will typically be faster. -Because of the large number of cores within each GPU device, it may be -more efficient to run on fewer processes per GPU when the number of -particles per MPI process is small (100's of particles); this can be -necessary to keep the GPU cores busy. +Using multiple MPI tasks per GPU will often give the best performance, +as allowed my most multi-core CPU/GPU configurations. -See the lammps/lib/gpu/README file for instructions on how to build -the GPU library for single, mixed, or double precision. The latter -requires that your GPU card support double precision. +If the number of particles per MPI task is small (e.g. 100s of +particles), it can be more eefficient to run with fewer MPI tasks per +GPU, even if you do not use all the cores on the compute node. + +The "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS +web site gives GPU performance on a desktop machine and the Titan HPC +platform at ORNL for several of the LAMMPS benchmarks, as a function +of problem size and number of compute nodes. :line @@ -633,10 +674,303 @@ a fix or compute that is non-GPU-ized, or until output is performed (thermo or dump snapshot or restart file). The less often this occurs, the faster your simulation will run. +:line + +5.8 KOKKOS package :h4,link(acc_8) + +The KOKKOS package contains versions of pair, fix, and atom styles +that use data structures and methods and macros provided by the Kokkos +library, which is included with LAMMPS in lib/kokkos. + +"Kokkos"_http://trilinos.sandia.gov/packages/kokkos is a C++ library +that provides two key abstractions for an application like LAMMPS. +First, it allows a single implementation of an application kernel +(e.g. a pair style) to run efficiently on different kinds of hardware +(GPU, Intel Phi, many-core chip). + +Second, it provides data abstractions to adjust (at compile time) the +memory layout of basic data structures like 2d and 3d arrays and allow +the transparent utilization of special hardware load and store units. +Such data structures are used in LAMMPS to store atom coordinates or +forces or neighbor lists. The layout is chosen to optimize +performance on different platforms. Again this operation is hidden +from the developer, and does not affect how the single implementation +of the kernel is coded. + +These abstractions are set at build time, when LAMMPS is compiled with +the KOKKOS package installed. This is done by selecting a "host" and +"device" to build for, compatible with the compute nodes in your +machine. Note that if you are running on a desktop machine, you +typically have one compute node. On a cluster or supercomputer there +may be dozens or 1000s of compute nodes. The procedure for building +and running with the Kokkos library is the same, no matter how many +nodes you run on. + +All Kokkos operations occur within the context of an individual MPI +task running on a single node of the machine. The total number of MPI +tasks used by LAMMPS (one or multiple per compute node) is set in the +usual manner via the mpirun or mpiexec commands, and is independent of +Kokkos. + +Kokkos provides support for one or two modes of execution per MPI +task. This means that some computational tasks (pairwise +interactions, neighbor list builds, time integration, etc) are +parallelized in one or the other of the two modes. The first mode is +called the "host" and is one or more threads running on one or more +physical CPUs (within the node). Currently, both multi-core CPUs and +an Intel Phi processor (running in native mode) are supported. The +second mode is called the "device" and is an accelerator chip of some +kind. Currently only an NVIDIA GPU is supported. If your compute +node does not have a GPU, then there is only one mode of execution, +i.e. the host and device are the same. + +IMPORTNANT NOTE: Currently, if using GPUs, you should set the number +of MPI tasks per compute node to be equal to the number of GPUs per +compute node. In the future Kokkos will support assigning one GPU to +multiple MPI tasks or using multiple GPUs per MPI task. Currently +Kokkos does not support AMD GPUs due to limits in the available +backend programming models (in particular relative extensive C++ +support is required for the Kernel language). This is expected to +change in the future. + +Here are several examples of how to build LAMMPS and run a simulation +using the KOKKOS package for typical compute node configurations. +Note that the -np setting for the mpirun command in these examples are +for a run on a single node. To scale these examples up to run on a +system with N compute nodes, simply multiply the -np setting by N. + +All the build steps are performed from within the src directory. All +the run steps are performed in the bench directory using the in.lj +input script. It is assumed the LAMMPS executable has been copied to +that directory or whatever directory the runs are being performed in. +Details of the various options are discussed below. + +[Compute node(s) = dual hex-core CPUs and no GPU:] + +make yes-kokkos # install the KOKKOS package +make g++ OMP=yes # build with OpenMP, no CUDA :pre + +mpirun -np 12 lmp_g++ -k off < in.lj # MPI-only mode with no Kokkos +mpirun -np 12 lmp_g++ -sf kk < in.lj # MPI-only mode with Kokkos +mpirun -np 1 lmp_g++ -k on t 12 -sf kk < in.lj # one MPI task, 12 threads +mpirun -np 2 lmp_g++ -k on t 6 -sf kk < in.lj # two MPI tasks, 6 threads/task :pre + +[Compute node(s) = Intel Phi with 61 cores:] + +make yes-kokkos +make g++ OMP=yes MIC=yes # build with OpenMP for Phi :pre + +mpirun -np 12 lmp_g++ -k on t 20 -sf kk < in.lj # 12*20 = 240 total cores +mpirun -np 15 lmp_g++ -k on t 16 -sf kk < in.lj +mpirun -np 30 lmp_g++ -k on t 8 -sf kk < in.lj +mpirun -np 1 lmp_g++ -k on t 240 -sf kk < in.lj :pre + +[Compute node(s) = dual hex-core CPUs and a single GPU:] + +make yes-kokkos +make cuda CUDA=yes # build for GPU, use src/MAKE/Makefile.cuda :pre + +mpirun -np 1 lmp_cuda -k on t 6 -sf kk < in.lj :pre + +[Compute node(s) = dual 8-core CPUs and 2 GPUs:] + +make yes-kokkos +make cuda CUDA=yes :pre + +mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk < in.lj # use both GPUs, one per MPI task :pre + +[Building LAMMPS with the KOKKOS package:] + +A summary of the build process is given here. More details and all +the available make variable options are given in "this +section"_Section_start.html#start_3_4 of the manual. + +From the src directory, type + +make yes-kokkos :pre + +to include the KOKKOS package. Then perform a normal LAMMPS build, +with additional make variable specifications to choose the host and +device you will run the resulting executable on, e.g. + +make g++ OMP=yes +make cuda CUDA=yes :pre + +As illustrated above, the most important variables to set are OMP, +CUDA, and MIC. The default settings are OMP=yes, CUDA=no, MIC=no +Setting OMP to {yes} will use OpenMP for threading on the host, as +well as on the device (if no GPU is present). Setting CUDA to {yes} +will use one or more GPUs as the device. Setting MIC=yes is necessary +when building for an Intel Phi processor. + +Note that to use a GPU, you must use a lo-level Makefile, +e.g. src/MAKE/Makefile.cuda as included in the LAMMPS distro, which +uses the NVIDA "nvcc" compiler. You must check that the CCFLAGS -arch +setting is appropriate for your NVIDIA hardware and installed +software. Typical values for -arch are given in "this +section"_Section_start.html#start_3_4 of the manual, as well as other +settings that must be included in the lo-level Makefile, if you create +your own. + +[Input scripts and use of command-line switches -kokkos and -suffix:] + +To use any Kokkos-enabled style provided in the KOKKOS package, you +must use a Kokkos-enabled atom style. LAMMPS will give an error if +you do not do this. + +There are two command-line switches relevant to using Kokkos, -k or +-kokkos, and -sf or -suffix. They are described in detail in "this +section"_Section_start.html#start_7 of the manual. + +Here are common options to use: + +-k off : runs an executable built with the KOKKOS pacakage, as + if Kokkos were not installed. :ulb,l + +-sf kk : enables automatic use of Kokkos versions of atom, pair, +fix, compute styles if they exist. This can also be done with more +precise control by using the "suffix"_suffix.html command or appending +"kk" to styles within the input script, e.g. "pair_style lj/cut/kk". :l + +-k on t Nt : specifies how many threads per MPI task to use within a + compute node. For good performance, the product of MPI tasks * + threads/task should not exceed the number of physical CPU or Intel + Phi cores. :l + +-k on g Ng : specifies how many GPUs per compute node are available. +The default is 1, so this should be specified is you have 2 or more +GPUs per compute node. :ule,l + +[Use of package command options:] + +Using the "package kokkos"_package.html command in an input script +allows choice of options for neighbor lists and communication. See +the "package"_package.html command doc page for details and default +settings. + +Experimenting with different styles of neighbor lists or inter-node +communication can provide a speed-up for specific calculations. + +[Running on a multi-core CPU:] + +Build with OMP=yes (the default) and CUDA=no (the default). + +If N is the number of physical cores/node, then the number of MPI +tasks/node * number of threads/task should not exceed N, and should +typically equal N. Note that the default threads/task is 1, as set by +the "t" keyword of the -k "command-line +switch"_Section_start.html#start_7. If you do not change this, no +additional parallelism (beyond MPI) will be invoked on the host +CPU(s). + +You can compare the performance running in different modes: + +run with 1 MPI task/node and N threads/task +run with N MPI tasks/node and 1 thread/task +run with settings in between these extremes :ul + +Examples of mpirun commands in these modes, for nodes with dual +hex-core CPUs and no GPU, are shown above. + +[Running on GPUs:] + +Build with CUDA=yes, using src/MAKE/Makefile.cuda. Insure the setting +for CUDA_PATH in lib/kokkos/Makefile.lammps is correct for your Cuda +software installation. Insure the -arch setting in +src/MAKE/Makefile.cuda is correct for your GPU hardware/software (see +"this section"_Section_start.html#start_3_4 of the manual for details. + +The -np setting of the mpirun command should set the number of MPI +tasks/node to be equal to the # of physical GPUs on the node. + +Use the "-kokkos command-line switch"_Section_commands.html#start_7 to +specify the number of GPUs per node, and the number of threads per MPI +task. As above for multi-core CPUs (and no GPU), if N is the number +of physical cores/node, then the number of MPI tasks/node * number of +threads/task should not exceed N. With one GPU (and one MPI task) it +may be faster to use less than all the available cores, by setting +threads/task to a smaller value. This is because using all the cores +on a dual-socket node will incur extra cost to copy memory from the +2nd socket to the GPU. + +Examples of mpirun commands that follow these rules, for nodes with +dual hex-core CPUs and one or two GPUs, are shown above. + +[Running on an Intel Phi:] + +Kokkos only uses Intel Phi processors in their "native" mode, i.e. +not hosted by a CPU. + +Build with OMP=yes (the default) and MIC=yes. The latter +insures code is correctly compiled for the Intel Phi. The +OMP setting means OpenMP will be used for parallelization +on the Phi, which is currently the best option within +Kokkos. In the future, other options may be added. + +Current-generation Intel Phi chips have either 61 or 57 cores. One +core should be excluded to run the OS, leaving 60 or 56 cores. Each +core is hyperthreaded, so there are effectively N = 240 (4*60) or N = +224 (4*56) cores to run on. + +The -np setting of the mpirun command sets the number of MPI +tasks/node. The "-k on t Nt" command-line switch sets the number of +threads/task as Nt. The product of these 2 values should be N, i.e. +240 or 224. Also, the number of threads/task should be a multiple of +4 so that logical threads from more than one MPI task do not run on +the same physical core. + +Examples of mpirun commands that follow these rules, for Intel Phi +nodes with 61 cores, are shown above. + +[Examples and benchmarks:] + +The examples/kokkos and bench/KOKKOS directories have scripts that can +be run with the KOKKOS package, as well as detailed instructions on +how to run them. + +IMPORTANT NOTE: the bench/KOKKOS directory does not yet exist. It +will be added later. + +[Additional performance issues:] + +When using threads (OpenMP or pthreads), it is important for +performance to bind the threads to physical cores, so they do not +migrate during a simulation. The same is true for MPI tasks, but the +default binding rules implemented for various MPI versions, do not +account for thread binding. + +Thus if you use more than one thread per MPI task, you should insure +MPI tasks are bound to CPU sockets. Furthermore, use thread affinity +environment variables from the OpenMP runtime when using OpenMP and +compile with hwloc support when using pthreads. With OpenMP 3.1 (gcc +4.7 or later, intel 12 or later) setting the environment variable +OMP_PROC_BIND=true should be sufficient. A typical mpirun command +should set these flags: + +OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... +Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre + +When using a GPU, you will achieve the best performance if your input +script does not use any fix or compute styles which are not yet +Kokkos-enabled. This allows data to stay on the GPU for multiple +timesteps, without being copied back to the host CPU. Invoking a +non-Kokkos fix or compute, or performing I/O for +"thermo"_thermo_style.html or "dump"_dump.html output will cause data +to be copied back to the CPU. + +You cannot yet assign multiple MPI tasks to the same GPU with the +KOKKOS package. We plan to support this in the future, similar to the +GPU package in LAMMPS. + +You cannot yet use both the host (multi-threaded) and device (GPU) +together to compute pairwise interactions with the KOKKOS package. We +hope to support this in the future, similar to the GPU package in +LAMMPS. + :line :line -5.8 Comparison of GPU and USER-CUDA packages :h4,link(acc_8) +5.9 Comparison of GPU and USER-CUDA packages :h4,link(acc_9) Both the GPU and USER-CUDA packages accelerate a LAMMPS calculation using NVIDIA hardware, but they do it in different ways. diff --git a/doc/Section_commands.html b/doc/Section_commands.html index 085743c918..c8f264577b 100644 --- a/doc/Section_commands.html +++ b/doc/Section_commands.html @@ -428,10 +428,10 @@ package.
(3) The KIM package was created by Valeriu Smirichinski, Ryan Elliott, and Ellad Tadmor (U Minn).
+(4) The KOKKOS package was created primarily by Christian Trott +(Sandia). It uses the Kokkos library which was developed by Carter +Edwards, Christian, and collaborators at Sandia. +
The "Doc page" column links to either a portion of the Section_howto of the manual, or an input script command implemented as part of the package. diff --git a/doc/Section_packages.txt b/doc/Section_packages.txt index 700c88b951..7ede067fd8 100644 --- a/doc/Section_packages.txt +++ b/doc/Section_packages.txt @@ -45,15 +45,16 @@ CLASS2, class 2 force fields, -, "pair_style lj/class2"_pair_class2.html, -, - COLLOID, colloidal particles, -, "atom_style colloid"_atom_style.html, colloid, - DIPOLE, point dipole particles, -, "pair_style dipole/cut"_pair_dipole.html, dipole, - FLD, Fast Lubrication Dynamics, Kumar & Bybee & Higdon (1), "pair_style lubricateU"_pair_lubricateU.html, -, - -GPU, GPU-enabled potentials, Mike Brown (ORNL), "Section accelerate"_Section_accelerate.html#acc_6, gpu, lib/gpu +GPU, GPU-enabled styles, Mike Brown (ORNL), "Section accelerate"_Section_accelerate.html#acc_6, gpu, lib/gpu GRANULAR, granular systems, -, "Section_howto"_Section_howto.html#howto_6, pour, - KIM, openKIM potentials, Smirichinski & Elliot & Tadmor (3), "pair_style kim"_pair_kim.html, kim, KIM +KOKKOS, Kokkos-enabled styles, Trott & Edwards (4), "Section_accelerate"_Section_accelerate.html#acc_8, kokkos, lib/kokkos KSPACE, long-range Coulombic solvers, -, "kspace_style"_kspace_style.html, peptide, - MANYBODY, many-body potentials, -, "pair_style tersoff"_pair_tersoff.html, shear, - MEAM, modified EAM potential, Greg Wagner (Sandia), "pair_style meam"_pair_meam.html, meam, lib/meam MC, Monte Carlo options, -, "fix gcmc"_fix_gcmc.html, -, - MOLECULE, molecular system force fields, -, "Section_howto"_Section_howto.html#howto_3, peptide, - -OPT, optimized pair potentials, Fischer & Richie & Natoli (2), "Section accelerate"_Section_accelerate.html#acc_4, -, - +OPT, optimized pair styles, Fischer & Richie & Natoli (2), "Section accelerate"_Section_accelerate.html#acc_4, -, - PERI, Peridynamics models, Mike Parks (Sandia), "pair_style peri"_pair_peri.html, peri, - POEMS, coupled rigid body motion, Rudra Mukherjee (JPL), "fix poems"_fix_poems.html, rigid, lib/poems REAX, ReaxFF potential, Aidan Thompson (Sandia), "pair_style reax"_pair_reax.html, reax, lib/reax @@ -78,6 +79,10 @@ Technolgy). (3) The KIM package was created by Valeriu Smirichinski, Ryan Elliott, and Ellad Tadmor (U Minn). +(4) The KOKKOS package was created primarily by Christian Trott +(Sandia). It uses the Kokkos library which was developed by Carter +Edwards, Christian, and collaborators at Sandia. + The "Doc page" column links to either a portion of the "Section_howto"_Section_howto.html of the manual, or an input script command implemented as part of the package. diff --git a/doc/Section_start.html b/doc/Section_start.html index f8c34710b0..12abe89a3f 100644 --- a/doc/Section_start.html +++ b/doc/Section_start.html @@ -555,7 +555,7 @@ on both a basic build and a customized build with pacakges you select.
Code for some of these auxiliary libraries is included in the LAMMPS distribution under the lib directory. Examples are the USER-ATC and -MEAM packages. Some auxiliary libraries are not included with LAMMPS; +MEAM packages. Some auxiliary libraries are NOT included with LAMMPS; to use the associated package you must download and install the auxiliary library yourself. Examples are the KIM and VORONOI and USER-MOLFILE packages. @@ -699,31 +699,33 @@ that library. Typically this is done by typing something like:
make -f Makefile.g++-
If one of the provided Makefiles is not -appropriate for your system you will need to edit or add one. -Note that all the Makefiles have a setting for EXTRAMAKE at -the top that names a Makefile.lammps.* file. +
If one of the provided Makefiles is not appropriate for your system +you will need to edit or add one. Note that all the Makefiles have a +setting for EXTRAMAKE at the top that specifies a Makefile.lammps.* +file.
-If successful, this will produce 2 files in the lib directory: +
If the library build is successful, it will produce 2 files in the lib +directory:
libpackage.a Makefile.lammps-
The Makefile.lammps file is a copy of the EXTRAMAKE file specified -in the Makefile you used. +
The Makefile.lammps file will be a copy of the EXTRAMAKE file setting +specified in the library Makefile.* you used.
-You MUST insure that the settings in Makefile.lammps are appropriate -for your system. If they are not, the LAMMPS build will fail. +
Note that you must insure that the settings in Makefile.lammps are +appropriate for your system. If they are not, the LAMMPS build will +fail.
-As explained in the lib/package/README files, they are used to specify -additional system libraries and their locations so that LAMMPS can -build with the auxiliary library. For example, if the MEAM or REAX -packages are used, the auxiliary libraries consist of F90 code, build -with a F90 complier. To link that library with LAMMPS (a C++ code) -via whatever C++ compiler LAMMPS is built with, typically requires -additional Fortran-to-C libraries be included in the link. Another -example are the BLAS and LAPACK libraries needed to use the USER-ATC -or USER-AWPMD packages. +
As explained in the lib/package/README files, the settings in +Makefile.lammps are used to specify additional system libraries and +their locations so that LAMMPS can build with the auxiliary library. +For example, if the MEAM or REAX packages are used, the auxiliary +libraries consist of F90 code, built with a Fortran complier. To link +that library with LAMMPS (a C++ code) via whatever C++ compiler LAMMPS +is built with, typically requires additional Fortran-to-C libraries be +included in the link. Another example are the BLAS and LAPACK +libraries needed to use the USER-ATC or USER-AWPMD packages.
For libraries without provided source code, see the src/package/Makefile.lammps file for information on where to find the @@ -731,10 +733,105 @@ library and how to build it. E.g. the file src/KIM/Makefile.lammps or src/VORONOI/Makefile.lammps or src/UESR-MOLFILE/Makefile.lammps. These files serve the same purpose as the lib/package/Makefile.lammps files described above. The files have settings needed when LAMMPS is -built to link with the corresponding auxiliary library. Again, you -MUST insure that the settings in src/package/Makefile.lammps are -appropriate for your system and where you installed the auxiliary -library. If they are not, the LAMMPS build will fail. +built to link with the corresponding auxiliary library. +
+Again, you must insure that the settings in +src/package/Makefile.lammps are appropriate for your system and where +you installed the auxiliary library. If they are not, the LAMMPS +build will fail. +
+One package, the KOKKOS package, allows its build options to be +specified by setting variables via the "make" command, rather than by +first building an auxiliary library and editing a Makefile.lammps +file, as discussed in the previous sub-section for other packages. +This is for convenience since it is common to want to experiment with +different Kokkos library options. Using variables enables a direct +re-build of LAMMPS and its Kokkos dependencies, so that a benchmark +test with different Kokkos options can be quickly performed. +
+The syntax for setting make variables is as follows. You must +use a GNU-compatible make command for this to work. Try "gmake" +if your system's standard make complains. +
+make yes-kokkos +make g++ VAR1=value VAR2=value ... ++
The first line installs the KOKKOS package, which only needs to be +done once. The second line builds LAMMPS with src/MAKE/Makefile.g++ +and optionally sets one or more variables that affect the build. Each +variable is specified in upper-case; its value follows an equal sign +with no spaces. The second line can be repeated with different +variable settings, though a "clean" must be done before the rebuild. +Type "make clean" to see options for this operation. +
+These are the variables that can be specified. Each takes a value of +yes or no. The default value is listed, which is set in the +lib/kokkos/Makefile.lammps file. See this +section for a discussion of what is +meant by "host" and "device" in the Kokkos context. +
+OMP sets the parallelization method used for Kokkos code (within +LAMMPS) that runs on the host. OMP=yes means that OpenMP will be +used. OMP=no means that pthreads will be used. +
+CUDA sets the parallelization method used for Kokkos code (within +LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running +CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting +will be used for the device as well as the host. +
+If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must +use "nvcc" as its compiler, via its CC setting. For best performance +its CCFLAGS setting should use -O3 and have an -arch setting that +matches the compute capability of your NVIDIA hardware and software +installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are +sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell +cards are sm_50. A complete list can be found on +wikipedia. You can +also use the deviceQuery tool that comes with the CUDA samples. Note +the minimal required compute capability is 2.0, but this will give +signicantly reduced performance compared to Kepler generation GPUs +with compute capability 3.x. For the LINK setting, "nvcc" should not +be used; instead use g++ or another compiler suitable for linking C++ +applications. Often you will want to use your MPI compiler wrapper +for this setting (i.e. mpicxx). Finally, the lo-level Makefile must +also have a "Compilation rule" for creating *.o files from *.cu files. +See src/Makefile.cuda for an example of a lo-level Makefile with all +of these settings. +
+HWLOC binds threads to hardware cores, so they do not migrate during a +simulation. HWLOC=yes should always be used if running with OMP=no +for pthreads. It is not necessary for OMP=yes for OpenMP, because +OpenMP provides alternative methods via environment variables for +binding threads to hardware cores. More info on binding threads to +cores is given in this section. +
+AVX enables Intel advanced vector extensions when compiling for an +Intel-compatible chip. AVX=yes should only be set if your host +hardware supports AVX. If it does not support it, this will cause a +run-time crash. +
+MIC enables compiler switches needed when compling for an Intel Phi +processor. +
+LIBRT enables use of a more accurate timer mechanism on most Unix +platforms. This library is not available on all platforms. +
+DEBUG is only useful when developing a Kokkos-enabled style within +LAMMPS. DEBUG=yes enables printing of run-time debugging information +that can be useful. It also enables runtime bounds checking on Kokkos +data structures.
-kokkos on/off keyword/value ... ++
Explicitly enable or disable Kokkos support, as provided by the KOKKOS +package. If LAMMPS is built with this package, as described above in +Section 2.3, then by default LAMMPS will run in Kokkos +mode. If this switch is set to "off", then it will not, even if it +was built with the KOKKOS package, which means you can run standard +LAMMPS styles or use styles enhanced by other acceleration packages, +such as the GPU or USER-CUDA or USER-OMP packages, for testing or +benchmarking purposes. The only reason to set the switch to "on", is +to check if LAMMPS was built with the KOKKOS package, since an error +will be generated if it was not. +
+Additional optional keyword/value pairs can be specified which +determine how Kokkos will use the underlying hardware on your +platform. These settings apply to each MPI task you launch via the +"mpirun" or "mpiexec" command. You may choose to run one or more MPI +tasks per physical node. Note that if you are running on a desktop +machine, you typically have one physical node. On a cluster or +supercomputer there may be dozens or 1000s of physical nodes. +
+Either the full word or an abbreviation can be used for the keywords. +Note that the keywords do not use a leading minus sign. I.e. the +keyword is "t", not "-t". Also note that each of the keywords has a +default setting. More explanation as to when to use these options and +what settings to use on different platforms is given in this +section. +
+device Nd ++
This option is only relevant if you built LAMMPS with CUDA=yes, you +have more than one GPU per node, and if you are running with only one +MPI task per node. The Nd setting is the ID of the GPU on the node to +run on. By default Nd = 0. If you have multiple GPUs per node, they +have consecutive IDs numbered as 0,1,2,etc. This setting allows you +to launch multiple independent jobs on the node, each with a single +MPI task per node, and assign each job to run on a different GPU. +
+gpus Ng Ns ++
This option is only relevant if you built LAMMPS with CUDA=yes, you +have more than one GPU per node, and you are running with multiple MPI +tasks per node (up to one per GPU). The Ng setting is how many GPUs +you will use. The Ns setting is optional. If set, it is the ID of a +GPU to skip when assigning MPI tasks to GPUs. This may be useful if +your desktop system reserves one GPU to drive the screen and the rest +are intended for computational work like running LAMMPS. By default +Ng = 1 and Ns is not set. +
+Depending on which flavor of MPI you are running, LAMMPS will look for +one of these 3 environment variables +
+SLURM_LOCALID (various MPI variants compiled with SLURM support) +MV2_COMM_WORLD_LOCAL_RANK (Mvapich) +OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) ++
which are initialized by the "srun", "mpirun" or "mpiexec" commands. +The environment variable setting for each MPI rank is used to assign a +unique GPU ID to the MPI task. +
+threads Nt ++
This option assigns Nt number of threads to each MPI task for +performing work when Kokkos is executing in OpenMP or pthreads mode. +The default is Nt = 1, which essentially runs in MPI-only mode. If +there are Np MPI tasks per physical node, you generally want Np*Nt = +the number of physical cores per node, to use your available hardware +optimally. This also sets the number of threads used by the host when +LAMMPS is compiled with CUDA=yes. +
+numa Nm ++
This option is only relevant when using pthreads with hwloc support. +In this case Nm defines the number of NUMA regions (typicaly sockets) +on a node which will be utilizied by a single MPI rank. By default Nm += 1. If this option is used the total number of worker-threads per +MPI rank is threads*numa. Currently it is always almost better to +assign at least one MPI rank per NUMA region, and leave numa set to +its default value of 1. This is because letting a single process span +multiple NUMA regions induces a significant amount of cross NUMA data +traffic which is slow. +
-log file
Specify a log file for LAMMPS to write status information to. In @@ -1277,23 +1462,24 @@ multi-partition mode, if the specified file is "none", then no screen output is performed. Option -pscreen will override the name of the partition screen files file.N.
--suffix style ++-suffix style argsUse variants of various styles if they exist. The specified style can -be opt, omp, gpu, or cuda. These refer to optional packages that -LAMMPS can be built with, as described above in Section -2.3. The "opt" style corrsponds to the OPT package, the -"omp" style to the USER-OMP package, the "gpu" style to the GPU -package, and the "cuda" style to the USER-CUDA package. +be cuda, gpu, kk, omp, or opt. These refer to optional +packages that LAMMPS can be built with, as described above in Section +2.3. The "cuda" style corresponds to the USER-CUDA package, +the "gpu" style to the GPU package, the "kk" style to the KOKKOS +pacakge, the "opt" style to the OPT package, and the "omp" style to +the USER-OMP package.
As an example, all of the packages provide a pair_style -lj/cut variant, with style names lj/cut/opt, lj/cut/omp, -lj/cut/gpu, or lj/cut/cuda. A variant styles can be specified -explicitly in your input script, e.g. pair_style lj/cut/gpu. If the --suffix switch is used, you do not need to modify your input script. -The specified suffix (opt,omp,gpu,cuda) is automatically appended -whenever your input script command creates a new -atom, pair, fix, +lj/cut variant, with style names lj/cut/cuda, +lj/cut/gpu, lj/cut/kk, lj/cut/omp, or lj/cut/opt. A variant styles +can be specified explicitly in your input script, e.g. pair_style +lj/cut/gpu. If the -suffix switch is used, you do not need to modify +your input script. The specified suffix (cuda,gpu,kk,omp,opt) is +automatically appended whenever your input script command creates a +new atom, pair, fix, compute, or run style. If the variant version does not exist, the standard version is created.
@@ -1303,13 +1489,20 @@ default GPU settings, as if the command "package gpu force/neigh 0 0 changed by using the package gpu command in your script if desired. +For the KOKKOS package, using this command-line switch also invokes +the default KOKKOS settings, as if the command "package kokkos neigh +full comm/exchange host comm/forward host " were used at the top of +your input script. These settings can be changed by using the +package kokkos command in your script if desired. +
For the OMP package, using this command-line switch also invokes the default OMP settings, as if the command "package omp *" were used at the top of your input script. These settings can be changed by using the package omp command in your script if desired.
-The suffix command can also set a suffix and it can also -turn off/on any suffix setting made via the command line. +
The suffix command can also be used set a suffix and it +can also turn off or back on any suffix setting made via the command +line.
-var name value1 value2 ...diff --git a/doc/Section_start.txt b/doc/Section_start.txt index c126503b1f..2e8d786809 100644 --- a/doc/Section_start.txt +++ b/doc/Section_start.txt @@ -549,7 +549,7 @@ This section has the following sub-sections: "Package basics"_#start_3_1 "Including/excluding packages"_#start_3_2 "Packages that require extra libraries"_#start_3_3 -"Additional Makefile settings for extra libraries"_#start_3_4 :ul +"Packages that use make variable settings"_#start_3_4 :ul :line @@ -682,7 +682,7 @@ for a list of packages that have auxiliary libraries. Code for some of these auxiliary libraries is included in the LAMMPS distribution under the lib directory. Examples are the USER-ATC and -MEAM packages. Some auxiliary libraries are not included with LAMMPS; +MEAM packages. Some auxiliary libraries are NOT included with LAMMPS; to use the associated package you must download and install the auxiliary library yourself. Examples are the KIM and VORONOI and USER-MOLFILE packages. @@ -693,31 +693,33 @@ that library. Typically this is done by typing something like: make -f Makefile.g++ :pre -If one of the provided Makefiles is not -appropriate for your system you will need to edit or add one. -Note that all the Makefiles have a setting for EXTRAMAKE at -the top that names a Makefile.lammps.* file. +If one of the provided Makefiles is not appropriate for your system +you will need to edit or add one. Note that all the Makefiles have a +setting for EXTRAMAKE at the top that specifies a Makefile.lammps.* +file. -If successful, this will produce 2 files in the lib directory: +If the library build is successful, it will produce 2 files in the lib +directory: libpackage.a Makefile.lammps :pre -The Makefile.lammps file is a copy of the EXTRAMAKE file specified -in the Makefile you used. +The Makefile.lammps file will be a copy of the EXTRAMAKE file setting +specified in the library Makefile.* you used. -You MUST insure that the settings in Makefile.lammps are appropriate -for your system. If they are not, the LAMMPS build will fail. +Note that you must insure that the settings in Makefile.lammps are +appropriate for your system. If they are not, the LAMMPS build will +fail. -As explained in the lib/package/README files, they are used to specify -additional system libraries and their locations so that LAMMPS can -build with the auxiliary library. For example, if the MEAM or REAX -packages are used, the auxiliary libraries consist of F90 code, build -with a F90 complier. To link that library with LAMMPS (a C++ code) -via whatever C++ compiler LAMMPS is built with, typically requires -additional Fortran-to-C libraries be included in the link. Another -example are the BLAS and LAPACK libraries needed to use the USER-ATC -or USER-AWPMD packages. +As explained in the lib/package/README files, the settings in +Makefile.lammps are used to specify additional system libraries and +their locations so that LAMMPS can build with the auxiliary library. +For example, if the MEAM or REAX packages are used, the auxiliary +libraries consist of F90 code, built with a Fortran complier. To link +that library with LAMMPS (a C++ code) via whatever C++ compiler LAMMPS +is built with, typically requires additional Fortran-to-C libraries be +included in the link. Another example are the BLAS and LAPACK +libraries needed to use the USER-ATC or USER-AWPMD packages. For libraries without provided source code, see the src/package/Makefile.lammps file for information on where to find the @@ -725,10 +727,105 @@ library and how to build it. E.g. the file src/KIM/Makefile.lammps or src/VORONOI/Makefile.lammps or src/UESR-MOLFILE/Makefile.lammps. These files serve the same purpose as the lib/package/Makefile.lammps files described above. The files have settings needed when LAMMPS is -built to link with the corresponding auxiliary library. Again, you -MUST insure that the settings in src/package/Makefile.lammps are -appropriate for your system and where you installed the auxiliary -library. If they are not, the LAMMPS build will fail. +built to link with the corresponding auxiliary library. + +Again, you must insure that the settings in +src/package/Makefile.lammps are appropriate for your system and where +you installed the auxiliary library. If they are not, the LAMMPS +build will fail. + +:line + +[{Packages that use make variable settings}] :link(start_3_4) + +One package, the KOKKOS package, allows its build options to be +specified by setting variables via the "make" command, rather than by +first building an auxiliary library and editing a Makefile.lammps +file, as discussed in the previous sub-section for other packages. +This is for convenience since it is common to want to experiment with +different Kokkos library options. Using variables enables a direct +re-build of LAMMPS and its Kokkos dependencies, so that a benchmark +test with different Kokkos options can be quickly performed. + +The syntax for setting make variables is as follows. You must +use a GNU-compatible make command for this to work. Try "gmake" +if your system's standard make complains. + +make yes-kokkos +make g++ VAR1=value VAR2=value ... :pre + +The first line installs the KOKKOS package, which only needs to be +done once. The second line builds LAMMPS with src/MAKE/Makefile.g++ +and optionally sets one or more variables that affect the build. Each +variable is specified in upper-case; its value follows an equal sign +with no spaces. The second line can be repeated with different +variable settings, though a "clean" must be done before the rebuild. +Type "make clean" to see options for this operation. + +These are the variables that can be specified. Each takes a value of +{yes} or {no}. The default value is listed, which is set in the +lib/kokkos/Makefile.lammps file. See "this +section"_Section_accelerate.html#acc_8 for a discussion of what is +meant by "host" and "device" in the Kokkos context. + +OMP, default = {yes} +CUDA, default = {no} +HWLOC, default = {no} +AVX, default = {no} +MIC, default = {no} +LIBRT, default = {no} +DEBUG, default = {no} :ul + +OMP sets the parallelization method used for Kokkos code (within +LAMMPS) that runs on the host. OMP=yes means that OpenMP will be +used. OMP=no means that pthreads will be used. + +CUDA sets the parallelization method used for Kokkos code (within +LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running +CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting +will be used for the device as well as the host. + +If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must +use "nvcc" as its compiler, via its CC setting. For best performance +its CCFLAGS setting should use -O3 and have an -arch setting that +matches the compute capability of your NVIDIA hardware and software +installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are +sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell +cards are sm_50. A complete list can be found on +"wikipedia"_http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. You can +also use the deviceQuery tool that comes with the CUDA samples. Note +the minimal required compute capability is 2.0, but this will give +signicantly reduced performance compared to Kepler generation GPUs +with compute capability 3.x. For the LINK setting, "nvcc" should not +be used; instead use g++ or another compiler suitable for linking C++ +applications. Often you will want to use your MPI compiler wrapper +for this setting (i.e. mpicxx). Finally, the lo-level Makefile must +also have a "Compilation rule" for creating *.o files from *.cu files. +See src/Makefile.cuda for an example of a lo-level Makefile with all +of these settings. + +HWLOC binds threads to hardware cores, so they do not migrate during a +simulation. HWLOC=yes should always be used if running with OMP=no +for pthreads. It is not necessary for OMP=yes for OpenMP, because +OpenMP provides alternative methods via environment variables for +binding threads to hardware cores. More info on binding threads to +cores is given in "this section"_Section_accelerate.html#acc_8. + +AVX enables Intel advanced vector extensions when compiling for an +Intel-compatible chip. AVX=yes should only be set if your host +hardware supports AVX. If it does not support it, this will cause a +run-time crash. + +MIC enables compiler switches needed when compling for an Intel Phi +processor. + +LIBRT enables use of a more accurate timer mechanism on most Unix +platforms. This library is not available on all platforms. + +DEBUG is only useful when developing a Kokkos-enabled style within +LAMMPS. DEBUG=yes enables printing of run-time debugging information +that can be useful. It also enables runtime bounds checking on Kokkos +data structures. :line @@ -1038,6 +1135,7 @@ letter abbreviation can be used: -e or -echo -i or -in -h or -help +-k or -kokkos -l or -log -nc or -nocite -p or -partition @@ -1098,6 +1196,93 @@ want to use was included via the appropriate package at compile time. LAMMPS will print the info and immediately exit if this switch is used. +-kokkos on/off keyword/value ... :pre + +Explicitly enable or disable Kokkos support, as provided by the KOKKOS +package. If LAMMPS is built with this package, as described above in +"Section 2.3"_#start_3, then by default LAMMPS will run in Kokkos +mode. If this switch is set to "off", then it will not, even if it +was built with the KOKKOS package, which means you can run standard +LAMMPS styles or use styles enhanced by other acceleration packages, +such as the GPU or USER-CUDA or USER-OMP packages, for testing or +benchmarking purposes. The only reason to set the switch to "on", is +to check if LAMMPS was built with the KOKKOS package, since an error +will be generated if it was not. + +Additional optional keyword/value pairs can be specified which +determine how Kokkos will use the underlying hardware on your +platform. These settings apply to each MPI task you launch via the +"mpirun" or "mpiexec" command. You may choose to run one or more MPI +tasks per physical node. Note that if you are running on a desktop +machine, you typically have one physical node. On a cluster or +supercomputer there may be dozens or 1000s of physical nodes. + +Either the full word or an abbreviation can be used for the keywords. +Note that the keywords do not use a leading minus sign. I.e. the +keyword is "t", not "-t". Also note that each of the keywords has a +default setting. More explanation as to when to use these options and +what settings to use on different platforms is given in "this +section"_Section_accerlerate.html#acc_8. + +d or device +g or gpus +t or threads +n or numa :ul + +device Nd :pre + +This option is only relevant if you built LAMMPS with CUDA=yes, you +have more than one GPU per node, and if you are running with only one +MPI task per node. The Nd setting is the ID of the GPU on the node to +run on. By default Nd = 0. If you have multiple GPUs per node, they +have consecutive IDs numbered as 0,1,2,etc. This setting allows you +to launch multiple independent jobs on the node, each with a single +MPI task per node, and assign each job to run on a different GPU. + +gpus Ng Ns :pre + +This option is only relevant if you built LAMMPS with CUDA=yes, you +have more than one GPU per node, and you are running with multiple MPI +tasks per node (up to one per GPU). The Ng setting is how many GPUs +you will use. The Ns setting is optional. If set, it is the ID of a +GPU to skip when assigning MPI tasks to GPUs. This may be useful if +your desktop system reserves one GPU to drive the screen and the rest +are intended for computational work like running LAMMPS. By default +Ng = 1 and Ns is not set. + +Depending on which flavor of MPI you are running, LAMMPS will look for +one of these 3 environment variables + +SLURM_LOCALID (various MPI variants compiled with SLURM support) +MV2_COMM_WORLD_LOCAL_RANK (Mvapich) +OMPI_COMM_WORLD_LOCAL_RANK (OpenMPI) :pre + +which are initialized by the "srun", "mpirun" or "mpiexec" commands. +The environment variable setting for each MPI rank is used to assign a +unique GPU ID to the MPI task. + +threads Nt :pre + +This option assigns Nt number of threads to each MPI task for +performing work when Kokkos is executing in OpenMP or pthreads mode. +The default is Nt = 1, which essentially runs in MPI-only mode. If +there are Np MPI tasks per physical node, you generally want Np*Nt = +the number of physical cores per node, to use your available hardware +optimally. This also sets the number of threads used by the host when +LAMMPS is compiled with CUDA=yes. + +numa Nm :pre + +This option is only relevant when using pthreads with hwloc support. +In this case Nm defines the number of NUMA regions (typicaly sockets) +on a node which will be utilizied by a single MPI rank. By default Nm += 1. If this option is used the total number of worker-threads per +MPI rank is threads*numa. Currently it is always almost better to +assign at least one MPI rank per NUMA region, and leave numa set to +its default value of 1. This is because letting a single process span +multiple NUMA regions induces a significant amount of cross NUMA data +traffic which is slow. + -log file :pre Specify a log file for LAMMPS to write status information to. In @@ -1271,23 +1456,24 @@ multi-partition mode, if the specified file is "none", then no screen output is performed. Option -pscreen will override the name of the partition screen files file.N. --suffix style :pre +-suffix style args :pre Use variants of various styles if they exist. The specified style can -be {opt}, {omp}, {gpu}, or {cuda}. These refer to optional packages that -LAMMPS can be built with, as described above in "Section -2.3"_#start_3. The "opt" style corrsponds to the OPT package, the -"omp" style to the USER-OMP package, the "gpu" style to the GPU -package, and the "cuda" style to the USER-CUDA package. +be {cuda}, {gpu}, {kk}, {omp}, or {opt}. These refer to optional +packages that LAMMPS can be built with, as described above in "Section +2.3"_#start_3. The "cuda" style corresponds to the USER-CUDA package, +the "gpu" style to the GPU package, the "kk" style to the KOKKOS +pacakge, the "opt" style to the OPT package, and the "omp" style to +the USER-OMP package. As an example, all of the packages provide a "pair_style -lj/cut"_pair_lj.html variant, with style names lj/cut/opt, lj/cut/omp, -lj/cut/gpu, or lj/cut/cuda. A variant styles can be specified -explicitly in your input script, e.g. pair_style lj/cut/gpu. If the --suffix switch is used, you do not need to modify your input script. -The specified suffix (opt,omp,gpu,cuda) is automatically appended -whenever your input script command creates a new -"atom"_atom_style.html, "pair"_pair_style.html, "fix"_fix.html, +lj/cut"_pair_lj.html variant, with style names lj/cut/cuda, +lj/cut/gpu, lj/cut/kk, lj/cut/omp, or lj/cut/opt. A variant styles +can be specified explicitly in your input script, e.g. pair_style +lj/cut/gpu. If the -suffix switch is used, you do not need to modify +your input script. The specified suffix (cuda,gpu,kk,omp,opt) is +automatically appended whenever your input script command creates a +new "atom"_atom_style.html, "pair"_pair_style.html, "fix"_fix.html, "compute"_compute.html, or "run"_run_style.html style. If the variant version does not exist, the standard version is created. @@ -1297,13 +1483,20 @@ default GPU settings, as if the command "package gpu force/neigh 0 0 changed by using the "package gpu"_package.html command in your script if desired. +For the KOKKOS package, using this command-line switch also invokes +the default KOKKOS settings, as if the command "package kokkos neigh +full comm/exchange host comm/forward host " were used at the top of +your input script. These settings can be changed by using the +"package kokkos"_package.html command in your script if desired. + For the OMP package, using this command-line switch also invokes the default OMP settings, as if the command "package omp *" were used at the top of your input script. These settings can be changed by using the "package omp"_package.html command in your script if desired. -The "suffix"_suffix.html command can also set a suffix and it can also -turn off/on any suffix setting made via the command line. +The "suffix"_suffix.html command can also be used set a suffix and it +can also turn off or back on any suffix setting made via the command +line. -var name value1 value2 ... :pre diff --git a/doc/atom_style.html b/doc/atom_style.html index d7af8e203f..0b606843af 100644 --- a/doc/atom_style.html +++ b/doc/atom_style.html @@ -26,11 +26,16 @@ template-ID = ID of molecule template specified in a separate molecule command hybrid args = list of one or more sub-styles, each with their args
accelerated styles (with same args): +
+Examples:
atom_style atomic atom_style bond atom_style full +atom_style full/cuda atom_style body nparticle 2 10 atom_style hybrid charge bond atom_style hybrid charge body nparticle 2 5 @@ -200,6 +205,31 @@ per-atom basis.LAMMPS can be extended with new atom styles as well as new body styles; see this section.
+
+ +Styles with a cuda or kk suffix are functionally the same as the +corresponding style without the suffix. They have been optimized to +run faster, depending on your available hardware, as discussed in +Section_accelerate of the manual. The +accelerated styles take the same arguments and should produce the same +results, except for round-off and precision issues. +
+Note that other acceleration packages in LAMMPS, specifically the GPU, +USER-OMP, and OPT packages do not use of accelerated atom styles. +
+These accelerated styles are part of the USER-CUDA and KOKKOS packages +respectively. They are only enabled if LAMMPS was built with those +packages. See the Making LAMMPS section +for more info. +
+You can specify the accelerated styles explicitly in your input script +by including their suffix, or you can use the -suffix command-line +switch when you invoke LAMMPS, or you can +use the suffix command in your input script. +
+See Section_accelerate of the manual for +more instructions on how to use the accelerated styles effectively. +
Restrictions:
This command cannot be used after the simulation box is defined by a diff --git a/doc/atom_style.txt b/doc/atom_style.txt index 1819324fd3..8690d30537 100644 --- a/doc/atom_style.txt +++ b/doc/atom_style.txt @@ -24,11 +24,16 @@ style = {angle} or {atomic} or {body} or {bond} or {charge} or {dipole} or \ template-ID = ID of molecule template specified in a separate "molecule"_molecule.html command {hybrid} args = list of one or more sub-styles, each with their args :pre +accelerated styles (with same args): + +style = {angle/cuda} or {atomic/cuda} or {atomic/kokkos} or {charge/cuda} or {full/cuda} :ul + [Examples:] atom_style atomic atom_style bond atom_style full +atom_style full/cuda atom_style body nparticle 2 10 atom_style hybrid charge bond atom_style hybrid charge body nparticle 2 5 @@ -196,6 +201,31 @@ per-atom basis. LAMMPS can be extended with new atom styles as well as new body styles; see "this section"_Section_modify.html. +:line + +Styles with a {cuda} or {kk} suffix are functionally the same as the +corresponding style without the suffix. They have been optimized to +run faster, depending on your available hardware, as discussed in +"Section_accelerate"_Section_accelerate.html of the manual. The +accelerated styles take the same arguments and should produce the same +results, except for round-off and precision issues. + +Note that other acceleration packages in LAMMPS, specifically the GPU, +USER-OMP, and OPT packages do not use of accelerated atom styles. + +These accelerated styles are part of the USER-CUDA and KOKKOS packages +respectively. They are only enabled if LAMMPS was built with those +packages. See the "Making LAMMPS"_Section_start.html#start_3 section +for more info. + +You can specify the accelerated styles explicitly in your input script +by including their suffix, or you can use the "-suffix command-line +switch"_Section_start.html#start_7 when you invoke LAMMPS, or you can +use the "suffix"_suffix.html command in your input script. + +See "Section_accelerate"_Section_accelerate.html of the manual for +more instructions on how to use the accelerated styles effectively. + [Restrictions:] This command cannot be used after the simulation box is defined by a diff --git a/doc/fix_nve.html b/doc/fix_nve.html index e70474fe02..e027559e82 100644 --- a/doc/fix_nve.html +++ b/doc/fix_nve.html @@ -13,6 +13,8 @@
fix nve/cuda command
+fix nve/kk command +
fix nve/omp command
Syntax: @@ -35,17 +37,18 @@ ensemble.
-Styles with a cuda, gpu, omp, or opt suffix are functionally -the same as the corresponding style without the suffix. They have -been optimized to run faster, depending on your available hardware, as -discussed in Section_accelerate of the -manual. The accelerated styles take the same arguments and should -produce the same results, except for round-off and precision issues. +
Styles with a cuda, gpu, kk, omp, or opt suffix are +functionally the same as the corresponding style without the suffix. +They have been optimized to run faster, depending on your available +hardware, as discussed in Section_accelerate +of the manual. The accelerated styles take the same arguments and +should produce the same results, except for round-off and precision +issues.
-These accelerated styles are part of the USER-CUDA, GPU, USER-OMP and OPT -packages, respectively. They are only enabled if LAMMPS was built with -those packages. See the Making LAMMPS -section for more info. +
These accelerated styles are part of the USER-CUDA, GPU, KOKKOS, +USER-OMP and OPT packages, respectively. They are only enabled if +LAMMPS was built with those packages. See the Making +LAMMPS section for more info.
You can specify the accelerated styles explicitly in your input script by including their suffix, or you can use the -suffix command-line diff --git a/doc/fix_nve.txt b/doc/fix_nve.txt index b43a78c628..46f842d371 100644 --- a/doc/fix_nve.txt +++ b/doc/fix_nve.txt @@ -8,6 +8,7 @@ fix nve command :h3 fix nve/cuda command :h3 +fix nve/kk command :h3 fix nve/omp command :h3 [Syntax:] @@ -30,17 +31,18 @@ ensemble. :line -Styles with a {cuda}, {gpu}, {omp}, or {opt} suffix are functionally -the same as the corresponding style without the suffix. They have -been optimized to run faster, depending on your available hardware, as -discussed in "Section_accelerate"_Section_accelerate.html of the -manual. The accelerated styles take the same arguments and should -produce the same results, except for round-off and precision issues. +Styles with a {cuda}, {gpu}, {kk}, {omp}, or {opt} suffix are +functionally the same as the corresponding style without the suffix. +They have been optimized to run faster, depending on your available +hardware, as discussed in "Section_accelerate"_Section_accelerate.html +of the manual. The accelerated styles take the same arguments and +should produce the same results, except for round-off and precision +issues. -These accelerated styles are part of the USER-CUDA, GPU, USER-OMP and OPT -packages, respectively. They are only enabled if LAMMPS was built with -those packages. See the "Making LAMMPS"_Section_start.html#start_3 -section for more info. +These accelerated styles are part of the USER-CUDA, GPU, KOKKOS, +USER-OMP and OPT packages, respectively. They are only enabled if +LAMMPS was built with those packages. See the "Making +LAMMPS"_Section_start.html#start_3 section for more info. You can specify the accelerated styles explicitly in your input script by including their suffix, or you can use the "-suffix command-line diff --git a/doc/fix_rigid.html b/doc/fix_rigid.html index 88ff880c02..aecf8bcfe8 100644 --- a/doc/fix_rigid.html +++ b/doc/fix_rigid.html @@ -764,10 +764,27 @@ of the run command. These fixes are not invoked during LAMMPS was built with that package. See the Making LAMMPS section for more info.
+Assigning a temperature via the velocity create +command to a system with rigid bodies may not have +the desired outcome for two reasons. First, the velocity command can +be invoked before the rigid-body fix is invoked or initialized and the +number of adjusted degrees of freedom (DOFs) is known. Thus it is not +possible to compute the target temperature correctly. Second, the +assigned velocities may be partially canceled when constraints are +first enforced, leading to a different temperature than desired. A +workaround for this is to perform a run 0 command, which +insures all DOFs are accounted for properly, and then rescale the +temperature to the desired value before performing a simulation. For +example: +
+velocity all create 300.0 12345 +run 0 # temperature may not be 300K +velocity all scale 300.0 # now it should be +Related commands:
delete_bonds, neigh_modify -exclude +exclude, fix shake
Default:
diff --git a/doc/fix_rigid.txt b/doc/fix_rigid.txt index 66a6c7bd67..3d0fc4afcc 100644 --- a/doc/fix_rigid.txt +++ b/doc/fix_rigid.txt @@ -746,10 +746,27 @@ These fixes are all part of the RIGID package. It is only enabled if LAMMPS was built with that package. See the "Making LAMMPS"_Section_start.html#start_3 section for more info. +Assigning a temperature via the "velocity create"_velocity.html +command to a system with "rigid bodies"_fix_rigid.html may not have +the desired outcome for two reasons. First, the velocity command can +be invoked before the rigid-body fix is invoked or initialized and the +number of adjusted degrees of freedom (DOFs) is known. Thus it is not +possible to compute the target temperature correctly. Second, the +assigned velocities may be partially canceled when constraints are +first enforced, leading to a different temperature than desired. A +workaround for this is to perform a "run 0"_run.html command, which +insures all DOFs are accounted for properly, and then rescale the +temperature to the desired value before performing a simulation. For +example: + +velocity all create 300.0 12345 +run 0 # temperature may not be 300K +velocity all scale 300.0 # now it should be :pre + [Related commands:] "delete_bonds"_delete_bonds.html, "neigh_modify"_neigh_modify.html -exclude +exclude, "fix shake"_fix_shake.html [Default:] diff --git a/doc/package.html b/doc/package.html index d707037dae..939fee6ff2 100644 --- a/doc/package.html +++ b/doc/package.html @@ -15,24 +15,11 @@package style args-
gpu args = mode first last split keyword value ... - mode = force or force/neigh - first = ID of first GPU to be used on each node - last = ID of last GPU to be used on each node - split = fraction of particles assigned to the GPU - zero or more keyword/value pairs may be appended - keywords = threads_per_atom or cellsize or device - threads_per_atom value = Nthreads - Nthreads = # of GPU threads used per atom - cellsize value = dist - dist = length (distance units) in each dimension for neighbor bins - device value = device_type - device_type = kepler or fermi or cypress or generic - cuda args = keyword value ... +cuda args = keyword value ... one or more keyword/value pairs may be appended keywords = gpu/node or gpu/node/special or timing or test or override/bpa gpu/node value = N @@ -45,6 +32,25 @@ id = atom-ID of a test particle override/bpa values = flag flag = 0 for TpA algorithm, 1 for BpA algorithm + gpu args = mode first last split keyword value ... + mode = force or force/neigh + first = ID of first GPU to be used on each node + last = ID of last GPU to be used on each node + split = fraction of particles assigned to the GPU + zero or more keyword/value pairs may be appended + keywords = threads_per_atom or cellsize or device + threads_per_atom value = Nthreads + Nthreads = # of GPU threads used per atom + cellsize value = dist + dist = length (distance units) in each dimension for neighbor bins + device value = device_type + device_type = kepler or fermi or cypress or generic + kokkos args = keyword value ... + one or more keyword/value pairs may be appended + keywords = neigh or comm/exchange or comm/forward + neigh value = full or half/thread or half or n2 or full/cluster + comm/exchange value = no or host or device + comm/forward value = no or host or device omp args = Nthreads mode Nthreads = # of OpenMP threads to associate with each MPI process mode = force or force/neigh (optional) @@ -59,13 +65,14 @@ package gpu force/neigh 0 0 1.0 package gpu force/neigh 0 1 -1.0 package cuda gpu/node/special 2 0 2 package cuda test 3948 +package kokkos neigh half/thread comm/forward device package omp * force/neigh package omp 4 forceDescription:
This command invokes package-specific settings. Currently the -following packages use it: GPU, USER-CUDA, and USER-OMP. +following packages use it: USER-CUDA, GPU, KOKKOS, and USER-OMP.
To use the accelerated GPU and USER-OMP styles, the use of the package command is required. However, as described in the "Defaults" section @@ -74,9 +81,9 @@ options to enable use of these styles, then default package settings are enabled. In that case you only need to use the package command if you want to change the defaults.
-To use the accelerate USER-CUDA styles, the package command is not -required as defaults are assigned internally. You only need to use -the package command if you want to change the defaults. +
To use the accelerated USER-CUDA and KOKKOS styles, the package +command is not required as defaults are assigned internally. You only +need to use the package command if you want to change the defaults.
See Section_accelerate of the manual for more details about using these various packages for accelerating @@ -84,6 +91,58 @@ LAMMPS calculations.
+The cuda style invokes options associated with the use of the +USER-CUDA package. +
+The gpu/node keyword specifies the number N of GPUs to be used on +each node. An MPI process with rank K will use the GPU (K mod N). +This implies that processes should be assigned with successive ranks +on each node, which is the default with most (or even all) MPI +implementations. The default value for N is 2. +
+The gpu/node/special keyword also specifies the number (N) of GPUs +to be used on each node, but allows more control over their +specification. An MPI process with rank K will use the GPU gpuI +with l = (K mod N) + 1. This implies that processes should be assigned +with successive ranks on each node, which is the default with most (or +even all) MPI implementations. For example if you have three GPUs on +a machine, one of which is used for the X-Server (the GPU with the ID +1) while the others (with IDs 0 and 2) are used for computations you +would specify: +
+package cuda gpu/node/special 2 0 2 ++A main purpose of the gpu/node/special optoin is to allow two (or +more) simulations to be run on one workstation. In that case one +would set the first simulation to use GPU 0 and the second to use GPU +1. This is not necessary though, if the GPUs are in what is called +compute exclusive mode. Using that setting, every process will get +its own GPU automatically. This compute exclusive mode can be set +as root using the nvidia-smi tool which is part of the CUDA +installation. +
+Note that if the gpu/node/special keyword is not used, the USER-CUDA +package sorts existing GPUs on each node according to their number of +multiprocessors. This way, compute GPUs will be priorized over +X-Server GPUs. +
+Use of the timing keyword will output detailed timing information +for various subroutines. +
+The test keyword will output info for the the specified atom at +several points during each time step. This is mainly usefull for +debugging purposes. Note that the simulation will be severly slowed +down if this option is used. +
+The override/bpa keyword can be used to specify which mode is used +for pair-force evaluation. TpA = one thread per atom; BpA = one block +per atom. If this keyword is not used, a short test at the begin of +each run will determine which method is more effective (the result of +this test is part of the LAMMPS output). Therefore it is usually not +necessary to use this keyword. +
+
+The gpu style invokes options associated with the use of the GPU package.
@@ -157,55 +216,59 @@ device type can be specified when building LAMMPS with the GPU library.
-The cuda style invokes options associated with the use of the -USER-CUDA package. +
The kokkos style invokes options associated with the use of the +KOKKOS package.
-The gpu/node keyword specifies the number N of GPUs to be used on -each node. An MPI process with rank K will use the GPU (K mod N). -This implies that processes should be assigned with successive ranks -on each node, which is the default with most (or even all) MPI -implementations. The default value for N is 2. +
The neigh keyword determines what kinds of neighbor lists are built. +A value of half uses half-neighbor lists, the same as used by most +pair styles in LAMMPS. A value of half/thread uses a threadsafe +variant of the half-neighbor list. It should be used instead of +half when running with threads on a CPU. A value of full uses a +full-neighborlist, i.e. f_ij and f_ji are both calculated. This +performs twice as much computation as the half option, however that +can be a win because it is threadsafe and doesn't require atomic +operations. A value of full/cluster is an experimental neighbor +style, where particles interact with all particles within a small +cluster, if at least one of the clusters particles is within the +neighbor cutoff range. This potentially allows for better +vectorization on architectures such as the Intel Phi. If also reduces +the size of the neighbor list by roughly a factor of the cluster size, +thus reducing the total memory footprint considerably.
-The gpu/node/special keyword also specifies the number (N) of GPUs -to be used on each node, but allows more control over their -specification. An MPI process with rank K will use the GPU gpuI -with l = (K mod N) + 1. This implies that processes should be assigned -with successive ranks on each node, which is the default with most (or -even all) MPI implementations. For example if you have three GPUs on -a machine, one of which is used for the X-Server (the GPU with the ID -1) while the others (with IDs 0 and 2) are used for computations you -would specify: +
The comm/exchange and comm/forward keywords determine whether the +host or device performs the packing and unpacking of data when +communicating information between processors. "Exchange" +communication happens only on timesteps that neighbor lists are +rebuilt. The data is only for atoms that migrate to new processors. +"Forward" communication happens every timestep. The data is for atom +coordinates and any other atom properties that needs to be updated for +ghost atoms owned by each processor.
-package cuda gpu/node/special 2 0 2 --A main purpose of the gpu/node/special optoin is to allow two (or -more) simulations to be run on one workstation. In that case one -would set the first simulation to use GPU 0 and the second to use GPU -1. This is not necessary though, if the GPUs are in what is called -compute exclusive mode. Using that setting, every process will get -its own GPU automatically. This compute exclusive mode can be set -as root using the nvidia-smi tool which is part of the CUDA -installation. +
The value options for these keywords are no or host or device. +A value of no means to use the standard non-KOKKOS method of +packing/unpacking data for the communication. A value of host means +to use the host, typically a multi-core CPU, and perform the +packing/unpacking in parallel with threads. A value of device means +to use the device, typically a GPU, to perform the packing/unpacking +operation.
-Note that if the gpu/node/special keyword is not used, the USER-CUDA -package sorts existing GPUs on each node according to their number of -multiprocessors. This way, compute GPUs will be priorized over -X-Server GPUs. -
-Use of the timing keyword will output detailed timing information -for various subroutines. -
-The test keyword will output info for the the specified atom at -several points during each time step. This is mainly usefull for -debugging purposes. Note that the simulation will be severly slowed -down if this option is used. -
-The override/bpa keyword can be used to specify which mode is used -for pair-force evaluation. TpA = one thread per atom; BpA = one block -per atom. If this keyword is not used, a short test at the begin of -each run will determine which method is more effective (the result of -this test is part of the LAMMPS output). Therefore it is usually not -necessary to use this keyword. +
The optimal choice for these keywords depends on the input script and +the hardware used. The no value is useful for verifying that Kokkos +code is working correctly. It may also be the fastest choice when +using Kokkos styles in MPI-only mode (i.e. with a thread count of 1). +When running on CPUs or Xeon Phi, the host and device values work +identically. When using GPUs, the device value will typically be +optimal if all of your styles used in your input script are supported +by the KOKKOS package. In this case data can stay on the GPU for many +timesteps without being moved between the host and GPU, if you use the +device value. This requires that your MPI is able to access GPU +memory directly. Currently that is true for OpenMPI 1.8 (or later +versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses +styles (e.g. fixes) which are not yet supported by the KOKKOS package, +then data has to be move between the host and device anyway, so it is +typically faster to let the host handle communication, by using the +host value. Using host instead of no will enable use of +multiple threads to pack/unpack communicated data.
@@ -262,6 +325,10 @@ LAMMPS section for more info. with the GPU package. See the Making LAMMPS section for more info. +The kk style of this command can only be invoked if LAMMPS was built +with the KOKKOS package. See the Making +LAMMPS section for more info. +
The omp style of this command can only be invoked if LAMMPS was built with the USER-OMP package. See the Making LAMMPS section for more info. @@ -272,15 +339,20 @@ LAMMPS section for more info.
Default:
+The default settings for the USER-CUDA package are "package cuda gpu +2". This is the case whether the "-sf cuda" command-line +switch is used or not. +
If the "-sf gpu" command-line switch is used then it is as if the command "package gpu force/neigh 0 0 1" were invoked, to specify default settings for the GPU package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script.
-The default settings for the USER CUDA package are "package cuda gpu -2". This is the case whether the "-sf cuda" command-line -switch is used or not. +
The default settings for the KOKKOS package are "package kk neigh full +comm/exchange host comm/forward host". This is the case whether the +"-sf kk" command-line switch is used or +not.
If the "-sf omp" command-line switch is used then it is as if the command "package omp *" were invoked, to diff --git a/doc/package.txt b/doc/package.txt index 54f5343138..49b383da6f 100644 --- a/doc/package.txt +++ b/doc/package.txt @@ -12,21 +12,8 @@ package command :h3 package style args :pre -style = {gpu} or {cuda} or {omp} :ulb,l +style = {cuda} or {gpu} or {kokkos} or {omp} :ulb,l args = arguments specific to the style :l - {gpu} args = mode first last split keyword value ... - mode = force or force/neigh - first = ID of first GPU to be used on each node - last = ID of last GPU to be used on each node - split = fraction of particles assigned to the GPU - zero or more keyword/value pairs may be appended - keywords = {threads_per_atom} or {cellsize} or {device} - {threads_per_atom} value = Nthreads - Nthreads = # of GPU threads used per atom - {cellsize} value = dist - dist = length (distance units) in each dimension for neighbor bins - {device} value = device_type - device_type = {kepler} or {fermi} or {cypress} or {phi} or {intel} or {generic} {cuda} args = keyword value ... one or more keyword/value pairs may be appended keywords = {gpu/node} or {gpu/node/special} or {timing} or {test} or {override/bpa} @@ -40,6 +27,25 @@ args = arguments specific to the style :l id = atom-ID of a test particle {override/bpa} values = flag flag = 0 for TpA algorithm, 1 for BpA algorithm + {gpu} args = mode first last split keyword value ... + mode = force or force/neigh + first = ID of first GPU to be used on each node + last = ID of last GPU to be used on each node + split = fraction of particles assigned to the GPU + zero or more keyword/value pairs may be appended + keywords = {threads_per_atom} or {cellsize} or {device} + {threads_per_atom} value = Nthreads + Nthreads = # of GPU threads used per atom + {cellsize} value = dist + dist = length (distance units) in each dimension for neighbor bins + {device} value = device_type + device_type = {kepler} or {fermi} or {cypress} or {phi} or {intel} or {generic} + {kokkos} args = keyword value ... + one or more keyword/value pairs may be appended + keywords = {neigh} or {comm/exchange} or {comm/forward} + {neigh} value = {full} or {half/thread} or {half} or {n2} or {full/cluster} + {comm/exchange} value = {no} or {host} or {device} + {comm/forward} value = {no} or {host} or {device} {omp} args = Nthreads mode Nthreads = # of OpenMP threads to associate with each MPI process mode = force or force/neigh (optional) :pre @@ -53,13 +59,14 @@ package gpu force/neigh 0 0 1.0 package gpu force/neigh 0 1 -1.0 package cuda gpu/node/special 2 0 2 package cuda test 3948 +package kokkos neigh half/thread comm/forward device package omp * force/neigh package omp 4 force :pre [Description:] This command invokes package-specific settings. Currently the -following packages use it: GPU, USER-CUDA, and USER-OMP. +following packages use it: USER-CUDA, GPU, KOKKOS, and USER-OMP. To use the accelerated GPU and USER-OMP styles, the use of the package command is required. However, as described in the "Defaults" section @@ -68,9 +75,9 @@ options"_Section_start.html#start_7 to enable use of these styles, then default package settings are enabled. In that case you only need to use the package command if you want to change the defaults. -To use the accelerate USER-CUDA styles, the package command is not -required as defaults are assigned internally. You only need to use -the package command if you want to change the defaults. +To use the accelerated USER-CUDA and KOKKOS styles, the package +command is not required as defaults are assigned internally. You only +need to use the package command if you want to change the defaults. See "Section_accelerate"_Section_accelerate.html of the manual for more details about using these various packages for accelerating @@ -78,6 +85,58 @@ LAMMPS calculations. :line +The {cuda} style invokes options associated with the use of the +USER-CUDA package. + +The {gpu/node} keyword specifies the number {N} of GPUs to be used on +each node. An MPI process with rank {K} will use the GPU (K mod N). +This implies that processes should be assigned with successive ranks +on each node, which is the default with most (or even all) MPI +implementations. The default value for {N} is 2. + +The {gpu/node/special} keyword also specifies the number (N) of GPUs +to be used on each node, but allows more control over their +specification. An MPI process with rank {K} will use the GPU {gpuI} +with l = (K mod N) + 1. This implies that processes should be assigned +with successive ranks on each node, which is the default with most (or +even all) MPI implementations. For example if you have three GPUs on +a machine, one of which is used for the X-Server (the GPU with the ID +1) while the others (with IDs 0 and 2) are used for computations you +would specify: + +package cuda gpu/node/special 2 0 2 :pre + +A main purpose of the {gpu/node/special} optoin is to allow two (or +more) simulations to be run on one workstation. In that case one +would set the first simulation to use GPU 0 and the second to use GPU +1. This is not necessary though, if the GPUs are in what is called +{compute exclusive} mode. Using that setting, every process will get +its own GPU automatically. This {compute exclusive} mode can be set +as root using the {nvidia-smi} tool which is part of the CUDA +installation. + +Note that if the {gpu/node/special} keyword is not used, the USER-CUDA +package sorts existing GPUs on each node according to their number of +multiprocessors. This way, compute GPUs will be priorized over +X-Server GPUs. + +Use of the {timing} keyword will output detailed timing information +for various subroutines. + +The {test} keyword will output info for the the specified atom at +several points during each time step. This is mainly usefull for +debugging purposes. Note that the simulation will be severly slowed +down if this option is used. + +The {override/bpa} keyword can be used to specify which mode is used +for pair-force evaluation. TpA = one thread per atom; BpA = one block +per atom. If this keyword is not used, a short test at the begin of +each run will determine which method is more effective (the result of +this test is part of the LAMMPS output). Therefore it is usually not +necessary to use this keyword. + +:line + The {gpu} style invokes options associated with the use of the GPU package. @@ -152,55 +211,59 @@ the GPU library. :line -The {cuda} style invokes options associated with the use of the -USER-CUDA package. +The {kokkos} style invokes options associated with the use of the +KOKKOS package. -The {gpu/node} keyword specifies the number {N} of GPUs to be used on -each node. An MPI process with rank {K} will use the GPU (K mod N). -This implies that processes should be assigned with successive ranks -on each node, which is the default with most (or even all) MPI -implementations. The default value for {N} is 2. +The {neigh} keyword determines what kinds of neighbor lists are built. +A value of {half} uses half-neighbor lists, the same as used by most +pair styles in LAMMPS. A value of {half/thread} uses a threadsafe +variant of the half-neighbor list. It should be used instead of +{half} when running with threads on a CPU. A value of {full} uses a +full-neighborlist, i.e. f_ij and f_ji are both calculated. This +performs twice as much computation as the {half} option, however that +can be a win because it is threadsafe and doesn't require atomic +operations. A value of {full/cluster} is an experimental neighbor +style, where particles interact with all particles within a small +cluster, if at least one of the clusters particles is within the +neighbor cutoff range. This potentially allows for better +vectorization on architectures such as the Intel Phi. If also reduces +the size of the neighbor list by roughly a factor of the cluster size, +thus reducing the total memory footprint considerably. -The {gpu/node/special} keyword also specifies the number (N) of GPUs -to be used on each node, but allows more control over their -specification. An MPI process with rank {K} will use the GPU {gpuI} -with l = (K mod N) + 1. This implies that processes should be assigned -with successive ranks on each node, which is the default with most (or -even all) MPI implementations. For example if you have three GPUs on -a machine, one of which is used for the X-Server (the GPU with the ID -1) while the others (with IDs 0 and 2) are used for computations you -would specify: +The {comm/exchange} and {comm/forward} keywords determine whether the +host or device performs the packing and unpacking of data when +communicating information between processors. "Exchange" +communication happens only on timesteps that neighbor lists are +rebuilt. The data is only for atoms that migrate to new processors. +"Forward" communication happens every timestep. The data is for atom +coordinates and any other atom properties that needs to be updated for +ghost atoms owned by each processor. -package cuda gpu/node/special 2 0 2 :pre +The value options for these keywords are {no} or {host} or {device}. +A value of {no} means to use the standard non-KOKKOS method of +packing/unpacking data for the communication. A value of {host} means +to use the host, typically a multi-core CPU, and perform the +packing/unpacking in parallel with threads. A value of {device} means +to use the device, typically a GPU, to perform the packing/unpacking +operation. -A main purpose of the {gpu/node/special} optoin is to allow two (or -more) simulations to be run on one workstation. In that case one -would set the first simulation to use GPU 0 and the second to use GPU -1. This is not necessary though, if the GPUs are in what is called -{compute exclusive} mode. Using that setting, every process will get -its own GPU automatically. This {compute exclusive} mode can be set -as root using the {nvidia-smi} tool which is part of the CUDA -installation. - -Note that if the {gpu/node/special} keyword is not used, the USER-CUDA -package sorts existing GPUs on each node according to their number of -multiprocessors. This way, compute GPUs will be priorized over -X-Server GPUs. - -Use of the {timing} keyword will output detailed timing information -for various subroutines. - -The {test} keyword will output info for the the specified atom at -several points during each time step. This is mainly usefull for -debugging purposes. Note that the simulation will be severly slowed -down if this option is used. - -The {override/bpa} keyword can be used to specify which mode is used -for pair-force evaluation. TpA = one thread per atom; BpA = one block -per atom. If this keyword is not used, a short test at the begin of -each run will determine which method is more effective (the result of -this test is part of the LAMMPS output). Therefore it is usually not -necessary to use this keyword. +The optimal choice for these keywords depends on the input script and +the hardware used. The {no} value is useful for verifying that Kokkos +code is working correctly. It may also be the fastest choice when +using Kokkos styles in MPI-only mode (i.e. with a thread count of 1). +When running on CPUs or Xeon Phi, the {host} and {device} values work +identically. When using GPUs, the {device} value will typically be +optimal if all of your styles used in your input script are supported +by the KOKKOS package. In this case data can stay on the GPU for many +timesteps without being moved between the host and GPU, if you use the +{device} value. This requires that your MPI is able to access GPU +memory directly. Currently that is true for OpenMPI 1.8 (or later +versions), Mvapich2 1.9 (or later), and CrayMPI. If your script uses +styles (e.g. fixes) which are not yet supported by the KOKKOS package, +then data has to be move between the host and device anyway, so it is +typically faster to let the host handle communication, by using the +{host} value. Using {host} instead of {no} will enable use of +multiple threads to pack/unpack communicated data. :line @@ -256,8 +319,10 @@ LAMMPS"_Section_start.html#start_3 section for more info. The gpu style of this command can only be invoked if LAMMPS was built with the GPU package. See the "Making LAMMPS"_Section_start.html#start_3 section for more info. -When using the "r-RESPA run style"_run_style.html, GPU accelerated -styles can only be used on the outermost RESPA level. + +The kk style of this command can only be invoked if LAMMPS was built +with the KOKKOS package. See the "Making +LAMMPS"_Section_start.html#start_3 section for more info. The omp style of this command can only be invoked if LAMMPS was built with the USER-OMP package. See the "Making @@ -269,15 +334,20 @@ LAMMPS"_Section_start.html#start_3 section for more info. [Default:] +The default settings for the USER-CUDA package are "package cuda gpu +2". This is the case whether the "-sf cuda" "command-line +switch"_Section_start.html#start_7 is used or not. + If the "-sf gpu" "command-line switch"_Section_start.html#start_7 is used then it is as if the command "package gpu force/neigh 0 0 1" were invoked, to specify default settings for the GPU package. If the command-line switch is not used, then no defaults are set, and you must specify the appropriate package command in your input script. -The default settings for the USER CUDA package are "package cuda gpu -2". This is the case whether the "-sf cuda" "command-line -switch"_Section_start.html#start_7 is used or not. +The default settings for the KOKKOS package are "package kk neigh full +comm/exchange host comm/forward host". This is the case whether the +"-sf kk" "command-line switch"_Section_start.html#start_7 is used or +not. If the "-sf omp" "command-line switch"_Section_start.html#start_7 is used then it is as if the command "package omp *" were invoked, to diff --git a/doc/pair_lj.html b/doc/pair_lj.html index 767022c000..70eb931a13 100644 --- a/doc/pair_lj.html +++ b/doc/pair_lj.html @@ -17,6 +17,8 @@
pair_style lj/cut/gpu command
+pair_style lj/cut/kk command +
pair_style lj/cut/opt command
pair_style lj/cut/omp command @@ -263,17 +265,18 @@ pair_style command.
-Styles with a cuda, gpu, omp, or opt suffix are functionally -the same as the corresponding style without the suffix. They have -been optimized to run faster, depending on your available hardware, as -discussed in Section_accelerate of the -manual. The accelerated styles take the same arguments and should -produce the same results, except for round-off and precision issues. +
Styles with a cuda, gpu, kk, omp, or opt suffix are +functionally the same as the corresponding style without the suffix. +They have been optimized to run faster, depending on your available +hardware, as discussed in Section_accelerate +of the manual. The accelerated styles take the same arguments and +should produce the same results, except for round-off and precision +issues.
-These accelerated styles are part of the USER-CUDA, GPU, USER-OMP and OPT -packages, respectively. They are only enabled if LAMMPS was built with -those packages. See the Making LAMMPS -section for more info. +
These accelerated styles are part of the USER-CUDA, GPU, KOKKOS, +USER-OMP and OPT packages, respectively. They are only enabled if +LAMMPS was built with those packages. See the Making +LAMMPS section for more info.
You can specify the accelerated styles explicitly in your input script by including their suffix, or you can use the -suffix command-line diff --git a/doc/pair_lj.txt b/doc/pair_lj.txt index a5613c121b..fed4af04fc 100644 --- a/doc/pair_lj.txt +++ b/doc/pair_lj.txt @@ -10,6 +10,7 @@ pair_style lj/cut command :h3 pair_style lj/cut/cuda command :h3 pair_style lj/cut/experimental/cuda command :h3 pair_style lj/cut/gpu command :h3 +pair_style lj/cut/kk command :h3 pair_style lj/cut/opt command :h3 pair_style lj/cut/omp command :h3 pair_style lj/cut/coul/cut command :h3 @@ -230,17 +231,18 @@ pair_style command. :line -Styles with a {cuda}, {gpu}, {omp}, or {opt} suffix are functionally -the same as the corresponding style without the suffix. They have -been optimized to run faster, depending on your available hardware, as -discussed in "Section_accelerate"_Section_accelerate.html of the -manual. The accelerated styles take the same arguments and should -produce the same results, except for round-off and precision issues. +Styles with a {cuda}, {gpu}, {kk}, {omp}, or {opt} suffix are +functionally the same as the corresponding style without the suffix. +They have been optimized to run faster, depending on your available +hardware, as discussed in "Section_accelerate"_Section_accelerate.html +of the manual. The accelerated styles take the same arguments and +should produce the same results, except for round-off and precision +issues. -These accelerated styles are part of the USER-CUDA, GPU, USER-OMP and OPT -packages, respectively. They are only enabled if LAMMPS was built with -those packages. See the "Making LAMMPS"_Section_start.html#start_3 -section for more info. +These accelerated styles are part of the USER-CUDA, GPU, KOKKOS, +USER-OMP and OPT packages, respectively. They are only enabled if +LAMMPS was built with those packages. See the "Making +LAMMPS"_Section_start.html#start_3 section for more info. You can specify the accelerated styles explicitly in your input script by including their suffix, or you can use the "-suffix command-line diff --git a/doc/pair_table.html b/doc/pair_table.html index c86d12afce..21d9fc9d1f 100644 --- a/doc/pair_table.html +++ b/doc/pair_table.html @@ -13,6 +13,8 @@
pair_style table/gpu command
+pair_style table/kk command +
pair_style table/omp command
Syntax: @@ -200,17 +202,18 @@ one that matches the specified keyword.
-Styles with a cuda, gpu, omp, or opt suffix are functionally -the same as the corresponding style without the suffix. They have -been optimized to run faster, depending on your available hardware, as -discussed in Section_accelerate of the -manual. The accelerated styles take the same arguments and should -produce the same results, except for round-off and precision issues. +
Styles with a cuda, gpu, kk, omp, or opt suffix are +functionally the same as the corresponding style without the suffix. +They have been optimized to run faster, depending on your available +hardware, as discussed in Section_accelerate +of the manual. The accelerated styles take the same arguments and +should produce the same results, except for round-off and precision +issues.
-These accelerated styles are part of the USER-CUDA, GPU, USER-OMP and OPT -packages, respectively. They are only enabled if LAMMPS was built with -those packages. See the Making LAMMPS -section for more info. +
These accelerated styles are part of the USER-CUDA, GPU, KOKKOS, +USER-OMP and OPT packages, respectively. They are only enabled if +LAMMPS was built with those packages. See the Making +LAMMPS section for more info.
suffix style-
- style = off or on or opt or omp or gpu or cuda +
- style = off or on or cuda or gpu or kk or omp or opt
Examples:
suffix off suffix on -suffix gpu +suffix gpu +suffix kkDescription:
This command allows you to use variants of various styles if they exist. In that respect it operates the same as the -suffix command-line switch. It also has options -to turn off/on any suffix setting made via the command line. +to turn off or back on any suffix setting made via the command line.
-The specified style can be opt, omp, gpu, or cuda. These refer to -optional packages that LAMMPS can be built with, as described in this -section of the manual. The "opt" style -corrsponds to the OPT package, the "omp" style to the USER-OMP package, -the "gpu" style to the GPU package, and the "cuda" style to the -USER-CUDA package. +
The specified style can be cuda, gpu, kk, omp, or opt. +These refer to optional packages that LAMMPS can be built with, as +described in this section of the manual. +The "cuda" style corresponds to the USER-CUDA package, the "gpu" style +to the GPU package, the "kk" style to the KOKKOS package, the "omp" +style to the USER-OMP package, and the "opt" style to the OPT package,
These are the variants these packages provide:
-
- OPT = a handful of pair styles, cache-optimized for faster CPU -performance +
- USER-CUDA = a collection of atom, pair, fix, compute, and intergrate +styles, optimized to run on one or more NVIDIA GPUs + +
- GPU = a handful of pair styles and the PPPM kspace_style, optimized to +run on one or more GPUs or multicore CPU/GPU nodes + +
- KOKKOS = a collection of atom, pair, and fix styles optimized to run +using the Kokkos library on various kinds of hardware, including GPUs +via Cuda and many-core chips via OpenMP or threading.
- USER-OMP = a collection of pair, bond, angle, dihedral, improper, kspace, compute, and fix styles with support for OpenMP multi-threading -
- GPU = a handful of pair styles and the PPPM kspace_style, optimized to -run on one or more GPUs or multicore CPU/GPU nodes - -
- USER-CUDA = a collection of atom, pair, fix, compute, and intergrate -styles, optimized to run on one or more NVIDIA GPUs +
- OPT = a handful of pair styles, cache-optimized for faster CPU +performance
As an example, all of the packages provide a pair_style lj/cut variant, with style names lj/cut/opt, lj/cut/omp, -lj/cut/gpu, or lj/cut/cuda. A variant styles can be specified -explicitly in your input script, e.g. pair_style lj/cut/gpu. If the -suffix command is used with the appropriate style, you do not need to -modify your input script. The specified suffix (opt,omp,gpu,cuda) is -automatically appended whenever your input script command creates a -new atom, pair, -bond, angle, -dihedral, improper, -kspace, fix, compute, or -run style. If the variant version does not exist, -the standard version is created. +lj/cut/gpu, lj/cut/cuda, or lj/cut/kk. A variant styles can be +specified explicitly in your input script, e.g. pair_style lj/cut/gpu. +If the suffix command is used with the appropriate style, you do not +need to modify your input script. The specified suffix +(opt,omp,gpu,cuda,kk) is automatically appended whenever your input +script command creates a new atom, +pair, bond, +angle, dihedral, +improper, kspace, +fix, compute, or run style. +If the variant version does not exist, the standard version is +created.
If the specified style is off, then any previously specified suffix is temporarily disabled, whether it was specified by a command-line diff --git a/doc/suffix.txt b/doc/suffix.txt index be2c1c26f8..42675d252c 100644 --- a/doc/suffix.txt +++ b/doc/suffix.txt @@ -12,56 +12,62 @@ suffix command :h3 suffix style :pre -style = {off} or {on} or {opt} or {omp} or {gpu} or {cuda} :ul +style = {off} or {on} or {cuda} or {gpu} or {kk} or {omp} or {opt} :ul [Examples:] suffix off suffix on -suffix gpu :pre +suffix gpu +suffix kk :pre [Description:] This command allows you to use variants of various styles if they exist. In that respect it operates the same as the "-suffix command-line switch"_Section_start.html#start_7. It also has options -to turn off/on any suffix setting made via the command line. +to turn off or back on any suffix setting made via the command line. -The specified style can be {opt}, {omp}, {gpu}, or {cuda}. These refer to -optional packages that LAMMPS can be built with, as described in "this -section of the manual"_Section_start.html#start_3. The "opt" style -corrsponds to the OPT package, the "omp" style to the USER-OMP package, -the "gpu" style to the GPU package, and the "cuda" style to the -USER-CUDA package. +The specified style can be {cuda}, {gpu}, {kk}, {omp}, or {opt}. +These refer to optional packages that LAMMPS can be built with, as +described in "this section of the manual"_Section_start.html#start_3. +The "cuda" style corresponds to the USER-CUDA package, the "gpu" style +to the GPU package, the "kk" style to the KOKKOS package, the "omp" +style to the USER-OMP package, and the "opt" style to the OPT package, These are the variants these packages provide: -OPT = a handful of pair styles, cache-optimized for faster CPU -performance :ulb,l +USER-CUDA = a collection of atom, pair, fix, compute, and intergrate +styles, optimized to run on one or more NVIDIA GPUs :ulb,l + +GPU = a handful of pair styles and the PPPM kspace_style, optimized to +run on one or more GPUs or multicore CPU/GPU nodes :l + +KOKKOS = a collection of atom, pair, and fix styles optimized to run +using the Kokkos library on various kinds of hardware, including GPUs +via Cuda and many-core chips via OpenMP or threading. :l USER-OMP = a collection of pair, bond, angle, dihedral, improper, kspace, compute, and fix styles with support for OpenMP multi-threading :l -GPU = a handful of pair styles and the PPPM kspace_style, optimized to -run on one or more GPUs or multicore CPU/GPU nodes :l - -USER-CUDA = a collection of atom, pair, fix, compute, and intergrate -styles, optimized to run on one or more NVIDIA GPUs :l,ule +OPT = a handful of pair styles, cache-optimized for faster CPU +performance :ule,l As an example, all of the packages provide a "pair_style lj/cut"_pair_lj.html variant, with style names lj/cut/opt, lj/cut/omp, -lj/cut/gpu, or lj/cut/cuda. A variant styles can be specified -explicitly in your input script, e.g. pair_style lj/cut/gpu. If the -suffix command is used with the appropriate style, you do not need to -modify your input script. The specified suffix (opt,omp,gpu,cuda) is -automatically appended whenever your input script command creates a -new "atom"_atom_style.html, "pair"_pair_style.html, -"bond"_bond_style.html, "angle"_angle_style.html, -"dihedral"_dihedral_style.html, "improper"_improper_style.html, -"kspace"_kspace_style.html, "fix"_fix.html, "compute"_compute.html, or -"run"_run_style.html style. If the variant version does not exist, -the standard version is created. +lj/cut/gpu, lj/cut/cuda, or lj/cut/kk. A variant styles can be +specified explicitly in your input script, e.g. pair_style lj/cut/gpu. +If the suffix command is used with the appropriate style, you do not +need to modify your input script. The specified suffix +(opt,omp,gpu,cuda,kk) is automatically appended whenever your input +script command creates a new "atom"_atom_style.html, +"pair"_pair_style.html, "bond"_bond_style.html, +"angle"_angle_style.html, "dihedral"_dihedral_style.html, +"improper"_improper_style.html, "kspace"_kspace_style.html, +"fix"_fix.html, "compute"_compute.html, or "run"_run_style.html style. +If the variant version does not exist, the standard version is +created. If the specified style is {off}, then any previously specified suffix is temporarily disabled, whether it was specified by a command-line diff --git a/doc/velocity.html b/doc/velocity.html index ac1a172896..53513fbe4e 100644 --- a/doc/velocity.html +++ b/doc/velocity.html @@ -207,11 +207,28 @@ are in units of lattice spacings per time (e.g. spacings/fmsec) and coordinates are in lattice spacings. The lattice command must have been previously used to define the lattice spacing.
-Restrictions: none +
Restrictions:
+Assigning a temperature via the create option to a system with +rigid bodies or SHAKE constraints +may not have the desired outcome for two reasons. First, the velocity +command can be invoked before all of the relevant fixes are created +and initialized and the number of adjusted degrees of freedom (DOFs) +is known. Thus it is not possible to compute the target temperature +correctly. Second, the assigned velocities may be partially canceled +when constraints are first enforced, leading to a different +temperature than desired. A workaround for this is to perform a run +0 command, which insures all DOFs are accounted for +properly, and then rescale the temperature to the desired value before +performing a simulation. For example: +
+velocity all create 300.0 12345 +run 0 # temperature may not be 300K +velocity all scale 300.0 # now it should be +Related commands:
-Default:
diff --git a/doc/velocity.txt b/doc/velocity.txt index 2a606cf552..19bfca633a 100644 --- a/doc/velocity.txt +++ b/doc/velocity.txt @@ -199,11 +199,28 @@ are in units of lattice spacings per time (e.g. spacings/fmsec) and coordinates are in lattice spacings. The "lattice"_lattice.html command must have been previously used to define the lattice spacing. -[Restrictions:] none +[Restrictions:] + +Assigning a temperature via the {create} option to a system with +"rigid bodies"_fix_rigid.html or "SHAKE constraints"_fix_shake.html +may not have the desired outcome for two reasons. First, the velocity +command can be invoked before all of the relevant fixes are created +and initialized and the number of adjusted degrees of freedom (DOFs) +is known. Thus it is not possible to compute the target temperature +correctly. Second, the assigned velocities may be partially canceled +when constraints are first enforced, leading to a different +temperature than desired. A workaround for this is to perform a "run +0"_run.html command, which insures all DOFs are accounted for +properly, and then rescale the temperature to the desired value before +performing a simulation. For example: + +velocity all create 300.0 12345 +run 0 # temperature may not be 300K +velocity all scale 300.0 # now it should be :pre [Related commands:] -"fix shake"_fix_shake.html, "lattice"_lattice.html +"fix rigid"_fix_rigid.html, "fix shake"_fix_shake.html, "lattice"_lattice.html [Default:] diff --git a/examples/README b/examples/README index 33a28a98b4..fb039c5b65 100644 --- a/examples/README +++ b/examples/README @@ -73,6 +73,7 @@ gpu: use of the GPU package for GPU acceleration hugoniostat: Hugoniostat shock dynamics indent: spherical indenter into a 2d solid kim: use of potentials in Knowledge Base for Interatomic Models (KIM) +kokkos: use of the KOKKOS package for multi-threading and GPU acceleration meam: MEAM test for SiC and shear (same as shear examples) melt: rapid melt of 3d LJ system micelle: self-assembly of small lipid-like molecules into 2d bilayers diff --git a/examples/gpu/README b/examples/gpu/README new file mode 100644 index 0000000000..8fb8db00ab --- /dev/null +++ b/examples/gpu/README @@ -0,0 +1,35 @@ +These are input scripts designed for use with the GPU package. + +To run them, you must first build LAMMPS with the GPU package +installed, following the steps explained in Section 2.3 of +doc/Section_start.html and lib/gpu/README. An overview of building +and running LAMMPS with the GPU package is given in Section 5.6 of +doc/Section_accelerate.html. Note that you can choose the precision +at which computations are performed on the GPU in the build process. + +Note that lines such as this in each of the input scripts: + +package gpu force/neigh 0 1 1 + +are set for running on a compute node with 2 GPUs. If you +have a single GPU, you should comment out the line, since +the default is 1 GPU per compute node. + +The scripts can be run in the usual manner: + +lmp_g++ < in.gpu.melt.2.5 +lmp_g++ < in.gpu.melt.5.0 +lmp_g++ < in.gpu.phosphate +lmp_g++ < in.gpu.rhodo + +mpirun -np 4 lmp_g++ < in.gpu.melt.2.5 +mpirun -np 4 lmp_g++ < in.gpu.melt.5.0 +mpirun -np 4 lmp_g++ < in.gpu.phosphate +mpirun -np 4 lmp_g++ < in.gpu.rhodo + +The first set of commmands will run a single MPI task using a single +GPU (even if you have 2 GPUs). + +The second set of commands will run 4 MPI tasks, with 2 MPI tasks per +GPU (if you have 2 GPUs), or 4 MPI tasks per GPU (if you have a single +GPU). diff --git a/examples/kokkos/README b/examples/kokkos/README new file mode 100644 index 0000000000..fe0ea4de70 --- /dev/null +++ b/examples/kokkos/README @@ -0,0 +1,42 @@ +The in.kokkos input script is a copy of the bench/in.lj script, +but can be run with the KOKKOS package, + +To run it, you must first build LAMMPS with the KOKKOS package +installed, following the steps explained in Section 2.3.4 of +doc/Section_start.html. An overview of building and running LAMMPS +with the KOKKOS package, for different compute-node hardware on your +machine, is given in Section 5.8 of doc/Section_accelerate.html. + +The example log files included in this directory are for a desktop box +with dual hex-core CPUs and 2 GPUs. + +Two executables were built in the following manner: + +make yes-kokkos +make g++ OMP=yes -> lmp_cpu +make cuda CUDA=yes -> lmp_cuda + +Then the following runs were made. The "->" means that the run +produced log.lammps which was then copied to the named log file. + +* MPI-only runs + +lmp_cpu -k off < in.kokkos -> log.kokkos.date.mpionly.1 +mpirun -np 4 lmp_cpu -k off < in.kokkos -> log.kokkos.date.mpionly.4 + +* OpenMP threaded runs on CPUs only + +lmp_cpu -k on t 1 -sf kk < in.kokkos.half -> log.kokkos.date.cpu.1 +lmp_cpu -k on t 4 -sf kk < in.kokkos -> log.kokkos.date.cpu.4 + +Note that in.kokkos.half was use for one of the runs, which uses the +package command to force the use of half neighbor lists which are +faster when running on just 1 thread. + +* GPU runs on 1 or 2 GPUs + +lmp_cuda -k on t 6 -sf kk < in.kokkos -> log.kokkos.date.gpu.1 +mpirun -np 2 lmp_cuda -k on t 6 -sf kk < in.kokkos -> log.kokkos.date.gpu.2 + +Note that this is a very small problem (32K atoms) to run +on 1 or 2 GPUs. diff --git a/examples/kokkos/in.kokkos b/examples/kokkos/in.kokkos new file mode 100644 index 0000000000..01e12ef8a9 --- /dev/null +++ b/examples/kokkos/in.kokkos @@ -0,0 +1,30 @@ +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable yy equal 20*$y +variable zz equal 20*$z + +units lj +atom_style atomic + +lattice fcc 0.8442 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +create_box 1 box +create_atoms 1 box +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 diff --git a/examples/kokkos/in.kokkos.half b/examples/kokkos/in.kokkos.half new file mode 100644 index 0000000000..9847d18ef0 --- /dev/null +++ b/examples/kokkos/in.kokkos.half @@ -0,0 +1,32 @@ +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable yy equal 20*$y +variable zz equal 20*$z + +package kokkos neigh half + +units lj +atom_style atomic + +lattice fcc 0.8442 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +create_box 1 box +create_atoms 1 box +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 diff --git a/examples/kokkos/log.kokkos.1Feb14.cpu.1 b/examples/kokkos/log.kokkos.1Feb14.cpu.1 new file mode 100644 index 0000000000..76c5f5747a --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.cpu.1 @@ -0,0 +1,68 @@ +LAMMPS (27 May 2014) +KOKKOS mode is enabled (../lammps.cpp:468) + using 1 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +package kokkos neigh half + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 1 by 1 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 7.79551 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 2.29105 on 1 procs (1 MPI x 1 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 1.82425 (79.6249) +Neigh time (%) = 0.338632 (14.7806) +Comm time (%) = 0.0366232 (1.59853) +Outpt time (%) = 0.000144005 (0.00628553) +Other time (%) = 0.0914049 (3.98965) + +Nlocal: 32000 ave 32000 max 32000 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Nghost: 19657 ave 19657 max 19657 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Neighs: 1.20283e+06 ave 1.20283e+06 max 1.20283e+06 min +Histogram: 1 0 0 0 0 0 0 0 0 0 + +Total # of neighbors = 1202833 +Ave neighs/atom = 37.5885 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/examples/kokkos/log.kokkos.1Feb14.cpu.4 b/examples/kokkos/log.kokkos.1Feb14.cpu.4 new file mode 100644 index 0000000000..2b6001025b --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.cpu.4 @@ -0,0 +1,68 @@ +LAMMPS (27 May 2014) +KOKKOS mode is enabled (../lammps.cpp:468) + using 4 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 1 by 1 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 13.2888 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 0.983697 on 4 procs (1 MPI x 4 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 0.767155 (77.9869) +Neigh time (%) = 0.14734 (14.9782) +Comm time (%) = 0.041466 (4.21532) +Outpt time (%) = 0.000172138 (0.0174991) +Other time (%) = 0.0275636 (2.80204) + +Nlocal: 32000 ave 32000 max 32000 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Nghost: 19657 ave 19657 max 19657 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Neighs: 0 ave 0 max 0 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +FullNghs: 2.40567e+06 ave 2.40567e+06 max 2.40567e+06 min +Histogram: 1 0 0 0 0 0 0 0 0 0 + +Total # of neighbors = 2405666 +Ave neighs/atom = 75.1771 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/examples/kokkos/log.kokkos.1Feb14.gpu.1 b/examples/kokkos/log.kokkos.1Feb14.gpu.1 new file mode 100644 index 0000000000..8dd9caca4c --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.gpu.1 @@ -0,0 +1,68 @@ +LAMMPS (27 May 2014) +KOKKOS mode is enabled (../lammps.cpp:468) + using 6 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 1 by 1 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 16.9509 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 0.57192 on 6 procs (1 MPI x 6 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 0.205416 (35.917) +Neigh time (%) = 0.112468 (19.665) +Comm time (%) = 0.174223 (30.4629) +Outpt time (%) = 0.000159025 (0.0278055) +Other time (%) = 0.0796535 (13.9274) + +Nlocal: 32000 ave 32000 max 32000 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Nghost: 19657 ave 19657 max 19657 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Neighs: 0 ave 0 max 0 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +FullNghs: 2.40567e+06 ave 2.40567e+06 max 2.40567e+06 min +Histogram: 1 0 0 0 0 0 0 0 0 0 + +Total # of neighbors = 2405666 +Ave neighs/atom = 75.1771 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/examples/kokkos/log.kokkos.1Feb14.gpu.2 b/examples/kokkos/log.kokkos.1Feb14.gpu.2 new file mode 100644 index 0000000000..938485a350 --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.gpu.2 @@ -0,0 +1,68 @@ +LAMMPS (27 May 2014) +KOKKOS mode is enabled (../lammps.cpp:468) + using 6 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 1 by 2 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 8.95027 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 0.689608 on 12 procs (2 MPI x 6 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 0.210953 (30.5903) +Neigh time (%) = 0.122991 (17.8349) +Comm time (%) = 0.25264 (36.6353) +Outpt time (%) = 0.000259042 (0.0375636) +Other time (%) = 0.102765 (14.9019) + +Nlocal: 16000 ave 16001 max 15999 min +Histogram: 1 0 0 0 0 0 0 0 0 1 +Nghost: 13632.5 ave 13635 max 13630 min +Histogram: 1 0 0 0 0 0 0 0 0 1 +Neighs: 0 ave 0 max 0 min +Histogram: 2 0 0 0 0 0 0 0 0 0 +FullNghs: 1.20283e+06 ave 1.20347e+06 max 1.2022e+06 min +Histogram: 1 0 0 0 0 0 0 0 0 1 + +Total # of neighbors = 2405666 +Ave neighs/atom = 75.1771 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/examples/kokkos/log.kokkos.1Feb14.mpionly.1 b/examples/kokkos/log.kokkos.1Feb14.mpionly.1 new file mode 100644 index 0000000000..d7763feb76 --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.mpionly.1 @@ -0,0 +1,65 @@ +LAMMPS (27 May 2014) + using 1 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 1 by 1 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 8.21387 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 2.57975 on 1 procs (1 MPI x 1 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 2.20959 (85.6512) +Neigh time (%) = 0.269136 (10.4326) +Comm time (%) = 0.0252256 (0.977833) +Outpt time (%) = 0.000126123 (0.00488898) +Other time (%) = 0.0756752 (2.93343) + +Nlocal: 32000 ave 32000 max 32000 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Nghost: 19657 ave 19657 max 19657 min +Histogram: 1 0 0 0 0 0 0 0 0 0 +Neighs: 1.20283e+06 ave 1.20283e+06 max 1.20283e+06 min +Histogram: 1 0 0 0 0 0 0 0 0 0 + +Total # of neighbors = 1202833 +Ave neighs/atom = 37.5885 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/examples/kokkos/log.kokkos.1Feb14.mpionly.4 b/examples/kokkos/log.kokkos.1Feb14.mpionly.4 new file mode 100644 index 0000000000..1838aafd09 --- /dev/null +++ b/examples/kokkos/log.kokkos.1Feb14.mpionly.4 @@ -0,0 +1,65 @@ +LAMMPS (27 May 2014) + using 1 OpenMP thread(s) per MPI task +# 3d Lennard-Jones melt + +variable x index 1 +variable y index 1 +variable z index 1 + +variable xx equal 20*$x +variable xx equal 20*1 +variable yy equal 20*$y +variable yy equal 20*1 +variable zz equal 20*$z +variable zz equal 20*1 + +units lj +atom_style atomic + +lattice fcc 0.8442 +Lattice spacing in x,y,z = 1.6796 1.6796 1.6796 +region box block 0 ${xx} 0 ${yy} 0 ${zz} +region box block 0 20 0 ${yy} 0 ${zz} +region box block 0 20 0 20 0 ${zz} +region box block 0 20 0 20 0 20 +create_box 1 box +Created orthogonal box = (0 0 0) to (33.5919 33.5919 33.5919) + 1 by 2 by 2 MPI processor grid +create_atoms 1 box +Created 32000 atoms +mass 1 1.0 + +velocity all create 1.44 87287 loop geom + +pair_style lj/cut 2.5 +pair_coeff 1 1 1.0 1.0 2.5 + +neighbor 0.3 bin +neigh_modify delay 0 every 20 check no + +fix 1 all nve + +run 100 +Memory usage per processor = 4.09506 Mbytes +Step Temp E_pair E_mol TotEng Press + 0 1.44 -6.7733681 0 -4.6134356 -5.0197073 + 100 0.7574531 -5.7585055 0 -4.6223613 0.20726105 +Loop time of 0.709072 on 4 procs (4 MPI x 1 OpenMP) for 100 steps with 32000 atoms + +Pair time (%) = 0.574495 (81.0206) +Neigh time (%) = 0.0709588 (10.0073) +Comm time (%) = 0.0474771 (6.69567) +Outpt time (%) = 6.62804e-05 (0.00934748) +Other time (%) = 0.0160753 (2.26708) + +Nlocal: 8000 ave 8037 max 7964 min +Histogram: 2 0 0 0 0 0 0 0 1 1 +Nghost: 9007.5 ave 9050 max 8968 min +Histogram: 1 1 0 0 0 0 0 1 0 1 +Neighs: 300708 ave 305113 max 297203 min +Histogram: 1 0 0 1 1 0 0 0 0 1 + +Total # of neighbors = 1202833 +Ave neighs/atom = 37.5885 +Neighbor list builds = 5 +Dangerous builds = 0 diff --git a/lib/README b/lib/README index 00a69b0b66..d4dac7f26e 100644 --- a/lib/README +++ b/lib/README @@ -19,6 +19,8 @@ cuda NVIDIA GPU routines, USER-CUDA package from Christian Trott (U Tech Ilmenau) gpu general GPU routines, GPU package from Mike Brown (ORNL) +kokkos Kokkos package for GPU and many-core acceleration + from Kokkos development team (Sandia) linalg set of BLAS and LAPACK routines needed by USER-ATC package from Axel Kohlmeyer (Temple U) poems POEMS rigid-body integration package, POEMS package diff --git a/lib/kokkos/Makefile.lammps b/lib/kokkos/Makefile.lammps new file mode 100644 index 0000000000..f9fa37cdfd --- /dev/null +++ b/lib/kokkos/Makefile.lammps @@ -0,0 +1,104 @@ +# Settings that the LAMMPS build will import when this package library is used + +OMP = yes +CUDA = no +HWLOC = no +AVX = no +MIC = no +LIBRT = no +DEBUG = no + +CUDA_PATH = /usr/local/cuda + +KOKKOS_PATH = ../../lib/kokkos +kokkos_SYSINC = -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I../ +SRC_KOKKOS = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.cpp) + +ifeq ($(CUDA), yes) +kokkos_SYSINC += -x cu -DDEVICE=2 -DKOKKOS_HAVE_CUDA +SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cpp) +SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cu) +USRLIB += -L$(CUDA_PATH)/lib64 -lcudart -lcuda +ifeq ($(UVM), yes) +kokkos_SYSINC += -DKOKKOS_USE_UVM +endif +else +kokkos_SYSINC += -DDEVICE=1 +endif + +ifeq ($(CUSPARSE), yes) +kokkos_SYSINC += -DKOKKOS_USE_CUSPARSE +USRLIB += -lcusparse +endif + +ifeq ($(CUBLAS), yes) +kokkos_SYSINC += -DKOKKOS_USE_CUBLAS +USRLIB += -lcublas +endif + +ifeq ($(AVX), yes) +ifeq ($(CUDA), yes) +kokkos_SYSINC += -Xcompiler -mavx +else +kokkos_SYSINC += -mavx +endif +LINKFLAGS += -mavx +endif + +ifeq ($(MIC), yes) +kokkos_SYSINC += -mmic +LINKFLAGS += -mmic +endif + +ifeq ($(OMP),yes) +kokkos_SYSINC += -DKOKKOS_HAVE_OPENMP +SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.cpp) +ifeq ($(CUDA), yes) +kokkos_SYSINC += -Xcompiler -fopenmp +else +kokkos_SYSINC += -fopenmp +endif +LINKFLAGS += -fopenmp +else +kokkos_SYSINC += -DKOKKOS_HAVE_PTHREAD +USRLIB += -lpthread +SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.cpp) +endif + +ifeq ($(HWLOC),yes) +kokkos_SYSINC += -DKOKKOS_HAVE_HWLOC -I$(HWLOCPATH)/include +USRLIB += -L$(HWLOCPATH)/lib -lhwloc +endif + +ifeq ($(RED_PREC), yes) +kokkos_SYSINC += --use_fast_math +endif + +ifeq ($(DEBUG), yes) +kokkos_SYSINC += -g -G -DKOKKOS_EXPRESSION_CHECK -DENABLE_TRACEBACK +LINKFLAGS += -g +endif + +ifeq ($(LIBRT),yes) +kokkos_SYSINC += -DKOKKOS_USE_LIBRT -DPREC_TIMER +USRLIB += -lrt +endif + +ifeq ($(CUDALDG), yes) +kokkos_SYSINC += -DKOKKOS_USE_LDG_INTRINSIC +endif + +OBJ_KOKKOS_TMP = $(SRC_KOKKOS:.cpp=.o) +OBJ_KOKKOS = $(OBJ_KOKKOS_TMP:.cu=.o) +OBJ_KOKKOS_LINK = $(notdir $(OBJ_KOKKOS)) + +override OBJ += kokkos_depend.o + +libkokkoscore.a: $(OBJ_KOKKOS) + ar cr libkokkoscore.a $(OBJ_KOKKOS_LINK) + +kokkos_depend.o: libkokkoscore.a + touch kokkos_depend.cpp + $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c kokkos_depend.cpp + +kokkos_SYSLIB = -L./ $(LINKFLAGS) $(USRLIB) diff --git a/lib/kokkos/README b/lib/kokkos/README new file mode 100644 index 0000000000..59f5685bab --- /dev/null +++ b/lib/kokkos/README @@ -0,0 +1,44 @@ +Kokkos library + +Carter Edwards, Christian Trott, Daniel Sunderland +Sandia National Labs + +29 May 2014 +http://trilinos.sandia.gov/packages/kokkos/ + +------------------------- + +This directory has source files from the Kokkos library that LAMMPS +uses when building with its KOKKOS package. The package contains +versions of pair, fix, and atom styles written with Kokkos data +structures and calls to the Kokkos library that should run efficiently +on various kinds of accelerated nodes, including GPU and many-core +chips. + +Kokkos is a C++ library that provides two key abstractions for an +application like LAMMPS. First, it allows a single implementation of +an application kernel (e.g. a pair style) to run efficiently on +different kinds of hardware (GPU, Intel Phi, many-core chip). + +Second, it provides data abstractions to adjust (at compile time) the +memory layout of basic data structures like 2d and 3d arrays and allow +the transparent utilization of special hardware load and store units. +Such data structures are used in LAMMPS to store atom coordinates or +forces or neighbor lists. The layout is chosen to optimize +performance on different platforms. Again this operation is hidden +from the developer, and does not affect how the single implementation +of the kernel is coded. + +To build LAMMPS with Kokkos, you should not need to make any changes +to files in this directory. You can overrided defaults that are set +in Makefile.lammps when building LAMMPS, by defining variables as part +of the make command. Details of the build process with Kokkos are +explained in Section 2.3 of doc/Section_start.html. and in Section 5.9 +of doc/Section_accelerate.html. + +The one exception is that when using Kokkos with NVIDIA GPUs, the +CUDA_PATH setting in Makefile.lammps needs to point to the +installation of the Cuda software on your machine. The normal default +location is /usr/local/cuda. If this is not correct, you need to edit +Makefile.lammps. + diff --git a/lib/kokkos/TPL/cub/block/block_discontinuity.cuh b/lib/kokkos/TPL/cub/block/block_discontinuity.cuh new file mode 100644 index 0000000000..76af003e58 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_discontinuity.cuh @@ -0,0 +1,587 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockDiscontinuity class provides [collective](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. + */ + +#pragma once + +#include "../util_type.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \brief The BlockDiscontinuity class provides [collective](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * A set of "head flags" (or "tail flags") is often used to indicate corresponding items + * that differ from their predecessors (or successors). For example, head flags are convenient + * for demarcating disjoint data segments as part of a segmented scan or reduction. + * + * \tparam T The data type to be flagged. + * \tparam BLOCK_THREADS The thread block size in threads. + * + * \par A Simple Example + * \blockcollective{BlockDiscontinuity} + * \par + * The code snippet below illustrates the head flagging of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include+ * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockDiscontinuity for 128 threads on type int + * typedef cub::BlockDiscontinuity BlockDiscontinuity; + * + * // Allocate shared memory for BlockDiscontinuity + * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute head flags for discontinuities in the segment + * int head_flags[4]; + * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }. + * The corresponding output \p head_flags in those threads will be + * { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. + * + * \par Performance Considerations + * - Zero bank conflicts for most types. + * + */ +template < + typename T, + int BLOCK_THREADS> +class BlockDiscontinuity +{ +private: + + /****************************************************************************** + * Type definitions + ******************************************************************************/ + + /// Shared memory storage layout type (last element from each thread's input) + typedef T _TempStorage[BLOCK_THREADS]; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /// Specialization for when FlagOp has third index param + template ::HAS_PARAM> + struct ApplyOp + { + // Apply flag operator + static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx) + { + return flag_op(a, b, idx); + } + }; + + /// Specialization for when FlagOp does not have a third index param + template + struct ApplyOp + { + // Apply flag operator + static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx) + { + return flag_op(a, b); + } + }; + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + +public: + + /// \smemstorage{BlockDiscontinuity} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockDiscontinuity() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockDiscontinuity( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockDiscontinuity( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockDiscontinuity( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Head flag operations + *********************************************************************/ + //@{ + + + /** + * \brief Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged. + * + * The flag head_flagsi is set for item + * inputi when + * flag_op(previous-item, inputi) + * returns \p true (where previous-item is either the preceding item + * in the same thread or the last item in the previous thread). + * Furthermore, head_flagsi is always set for + * input>0 in thread0. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates the head-flagging of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockDiscontinuity for 128 threads on type int + * typedef cub::BlockDiscontinuity BlockDiscontinuity; + * + * // Allocate shared memory for BlockDiscontinuity + * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute head flags for discontinuities in the segment + * int head_flags[4]; + * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }. + * The corresponding output \p head_flags in those threads will be + * { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam FlagT [inferred] The flag type (must be an integer type) + * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. + */ + template < + int ITEMS_PER_THREAD, + typename FlagT, + typename FlagOp> + __device__ __forceinline__ void FlagHeads( + FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + FlagOp flag_op) ///< [in] Binary boolean flag predicate + { + // Share last item + temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1]; + + __syncthreads(); + + // Set flag for first item + head_flags[0] = (linear_tid == 0) ? + 1 : // First thread + ApplyOp ::Flag( + flag_op, + temp_storage[linear_tid - 1], + input[0], + linear_tid * ITEMS_PER_THREAD); + + // Set head_flags for remaining items + #pragma unroll + for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++) + { + head_flags[ITEM] = ApplyOp ::Flag( + flag_op, + input[ITEM - 1], + input[ITEM], + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } + } + + + /** + * \brief Sets head flags indicating discontinuities between items partitioned across the thread block. + * + * The flag head_flagsi is set for item + * inputi when + * flag_op(previous-item, inputi) + * returns \p true (where previous-item is either the preceding item + * in the same thread or the last item in the previous thread). + * For thread0, item input0 is compared + * against \p tile_predecessor_item. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates the head-flagging of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockDiscontinuity for 128 threads on type int + * typedef cub::BlockDiscontinuity BlockDiscontinuity; + * + * // Allocate shared memory for BlockDiscontinuity + * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Have thread0 obtain the predecessor item for the entire tile + * int tile_predecessor_item; + * if (threadIdx.x == 0) tile_predecessor_item == ... + * + * // Collectively compute head flags for discontinuities in the segment + * int head_flags[4]; + * BlockDiscontinuity(temp_storage).FlagHeads( + * head_flags, thread_data, cub::Inequality(), tile_predecessor_item); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }, + * and that \p tile_predecessor_item is \p 0. The corresponding output \p head_flags in those threads will be + * { [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam FlagT [inferred] The flag type (must be an integer type) + * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. + */ + template < + int ITEMS_PER_THREAD, + typename FlagT, + typename FlagOp> + __device__ __forceinline__ void FlagHeads( + FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + FlagOp flag_op, ///< [in] Binary boolean flag predicate + T tile_predecessor_item) ///< [in] [thread0 only] Item with which to compare the first tile item (input0 from thread0). + { + // Share last item + temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1]; + + __syncthreads(); + + // Set flag for first item + int predecessor = (linear_tid == 0) ? + tile_predecessor_item : // First thread + temp_storage[linear_tid - 1]; + + head_flags[0] = ApplyOp ::Flag( + flag_op, + predecessor, + input[0], + linear_tid * ITEMS_PER_THREAD); + + // Set flag for remaining items + #pragma unroll + for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++) + { + head_flags[ITEM] = ApplyOp ::Flag( + flag_op, + input[ITEM - 1], + input[ITEM], + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } + } + + + //@} end member group + /******************************************************************//** + * \name Tail flag operations + *********************************************************************/ + //@{ + + + /** + * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged. + * + * The flag tail_flagsi is set for item + * inputi when + * flag_op(inputi, next-item) + * returns \p true (where next-item is either the next item + * in the same thread or the first item in the next thread). + * Furthermore, tail_flagsITEMS_PER_THREAD-1 is always + * set for threadBLOCK_THREADS-1. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates the tail-flagging of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockDiscontinuity for 128 threads on type int + * typedef cub::BlockDiscontinuity BlockDiscontinuity; + * + * // Allocate shared memory for BlockDiscontinuity + * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute tail flags for discontinuities in the segment + * int tail_flags[4]; + * BlockDiscontinuity(temp_storage).FlagTails(tail_flags, thread_data, cub::Inequality()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }. + * The corresponding output \p tail_flags in those threads will be + * { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam FlagT [inferred] The flag type (must be an integer type) + * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. + */ + template < + int ITEMS_PER_THREAD, + typename FlagT, + typename FlagOp> + __device__ __forceinline__ void FlagTails( + FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + FlagOp flag_op) ///< [in] Binary boolean flag predicate + { + // Share first item + temp_storage[linear_tid] = input[0]; + + __syncthreads(); + + // Set flag for last item + tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ? + 1 : // Last thread + ApplyOp ::Flag( + flag_op, + input[ITEMS_PER_THREAD - 1], + temp_storage[linear_tid + 1], + (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1)); + + // Set flags for remaining items + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++) + { + tail_flags[ITEM] = ApplyOp ::Flag( + flag_op, + input[ITEM], + input[ITEM + 1], + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } + } + + + /** + * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block. + * + * The flag tail_flagsi is set for item + * inputi when + * flag_op(inputi, next-item) + * returns \p true (where next-item is either the next item + * in the same thread or the first item in the next thread). + * For threadBLOCK_THREADS-1, item + * inputITEMS_PER_THREAD-1 is compared + * against \p tile_predecessor_item. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates the tail-flagging of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockDiscontinuity for 128 threads on type int + * typedef cub::BlockDiscontinuity BlockDiscontinuity; + * + * // Allocate shared memory for BlockDiscontinuity + * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Have thread127 obtain the successor item for the entire tile + * int tile_successor_item; + * if (threadIdx.x == 127) tile_successor_item == ... + * + * // Collectively compute tail flags for discontinuities in the segment + * int tail_flags[4]; + * BlockDiscontinuity(temp_storage).FlagTails( + * tail_flags, thread_data, cub::Inequality(), tile_successor_item); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] } + * and that \p tile_successor_item is \p 125. The corresponding output \p tail_flags in those threads will be + * { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam FlagT [inferred] The flag type (must be an integer type) + * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. + */ + template < + int ITEMS_PER_THREAD, + typename FlagT, + typename FlagOp> + __device__ __forceinline__ void FlagTails( + FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + FlagOp flag_op, ///< [in] Binary boolean flag predicate + T tile_successor_item) ///< [in] [threadBLOCK_THREADS-1 only] Item with which to compare the last tile item (inputITEMS_PER_THREAD-1 from threadBLOCK_THREADS-1). + { + // Share first item + temp_storage[linear_tid] = input[0]; + + __syncthreads(); + + // Set flag for last item + int successor_item = (linear_tid == BLOCK_THREADS - 1) ? + tile_successor_item : // Last thread + temp_storage[linear_tid + 1]; + + tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp ::Flag( + flag_op, + input[ITEMS_PER_THREAD - 1], + successor_item, + (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1)); + + // Set flags for remaining items + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++) + { + tail_flags[ITEM] = ApplyOp ::Flag( + flag_op, + input[ITEM], + input[ITEM + 1], + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } + } + + //@} end member group + +}; + + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) diff --git a/lib/kokkos/TPL/cub/block/block_exchange.cuh b/lib/kokkos/TPL/cub/block/block_exchange.cuh new file mode 100644 index 0000000000..b7b95343b5 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_exchange.cuh @@ -0,0 +1,918 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockExchange class provides [collective](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block. + */ + +#pragma once + +#include "../util_arch.cuh" +#include "../util_macro.cuh" +#include "../util_type.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \brief The BlockExchange class provides [collective](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * It is commonplace for blocks of threads to rearrange data items between + * threads. For example, the global memory subsystem prefers access patterns + * where data items are "striped" across threads (where consecutive threads access consecutive items), + * yet most block-wide operations prefer a "blocked" partitioning of items across threads + * (where consecutive items belong to a single thread). + * + * \par + * BlockExchange supports the following types of data exchanges: + * - Transposing between [blocked](index.html#sec5sec4) and [striped](index.html#sec5sec4) arrangements + * - Transposing between [blocked](index.html#sec5sec4) and [warp-striped](index.html#sec5sec4) arrangements + * - Scattering ranked items to a [blocked arrangement](index.html#sec5sec4) + * - Scattering ranked items to a [striped arrangement](index.html#sec5sec4) + * + * \tparam T The data type to be exchanged. + * \tparam BLOCK_THREADS The thread block size in threads. + * \tparam ITEMS_PER_THREAD The number of items partitioned onto each thread. + * \tparam WARP_TIME_SLICING [optional] When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds. Yields a smaller memory footprint at the expense of decreased parallelism. (Default: false) + * + * \par A Simple Example + * \blockcollective{BlockExchange} + * \par + * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement + * of 512 integer items partitioned across 128 threads where each thread owns 4 items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockExchange for 128 threads owning 4 integer items each + * typedef cub::BlockExchange BlockExchange; + * + * // Allocate shared memory for BlockExchange + * __shared__ typename BlockExchange::TempStorage temp_storage; + * + * // Load a tile of data striped across threads + * int thread_data[4]; + * cub::LoadStriped (threadIdx.x, d_data, thread_data); + * + * // Collectively exchange data into a blocked arrangement across threads + * BlockExchange(temp_storage).StripedToBlocked(thread_data); + * + * \endcode + * \par + * Suppose the set of striped input \p thread_data across the block of threads is + * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }. + * The corresponding output \p thread_data in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * + * \par Performance Considerations + * - Proper device-specific padding ensures zero bank conflicts for most types. + * + */ +template < + typename T, + int BLOCK_THREADS, + int ITEMS_PER_THREAD, + bool WARP_TIME_SLICING = false> +class BlockExchange +{ +private: + + /****************************************************************************** + * Constants + ******************************************************************************/ + + enum + { + LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS, + WARP_THREADS = 1 << LOG_WARP_THREADS, + WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS, + + LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS, + SMEM_BANKS = 1 << LOG_SMEM_BANKS, + + TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD, + + TIME_SLICES = (WARP_TIME_SLICING) ? WARPS : 1, + + TIME_SLICED_THREADS = (WARP_TIME_SLICING) ? CUB_MIN(BLOCK_THREADS, WARP_THREADS) : BLOCK_THREADS, + TIME_SLICED_ITEMS = TIME_SLICED_THREADS * ITEMS_PER_THREAD, + + WARP_TIME_SLICED_THREADS = CUB_MIN(BLOCK_THREADS, WARP_THREADS), + WARP_TIME_SLICED_ITEMS = WARP_TIME_SLICED_THREADS * ITEMS_PER_THREAD, + + // Insert padding if the number of items per thread is a power of two + INSERT_PADDING = ((ITEMS_PER_THREAD & (ITEMS_PER_THREAD - 1)) == 0), + PADDING_ITEMS = (INSERT_PADDING) ? (TIME_SLICED_ITEMS >> LOG_SMEM_BANKS) : 0, + }; + + /****************************************************************************** + * Type definitions + ******************************************************************************/ + + /// Shared memory storage layout type + typedef T _TempStorage[TIME_SLICED_ITEMS + PADDING_ITEMS]; + +public: + + /// \smemstorage{BlockExchange} + struct TempStorage : Uninitialized<_TempStorage> {}; + +private: + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + int warp_lane; + int warp_id; + int warp_offset; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /** + * Transposes data items from blocked arrangement to striped arrangement. Specialized for no timeslicing. + */ + __device__ __forceinline__ void BlockedToStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and striped arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + + + /** + * Transposes data items from blocked arrangement to striped arrangement. Specialized for warp-timeslicing. + */ + __device__ __forceinline__ void BlockedToStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and striped arrangements. + Int2Type time_slicing) + { + T temp_items[ITEMS_PER_THREAD]; + + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) + { + const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; + const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; + + __syncthreads(); + + if (warp_id == SLICE) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + } + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + // Read a strip of items + const int STRIP_OFFSET = ITEM * BLOCK_THREADS; + const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; + + if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) + { + int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; + if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) + { + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_items[ITEM] = temp_storage[item_offset]; + } + } + } + } + + // Copy + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = temp_items[ITEM]; + } + } + + + /** + * Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for no timeslicing + */ + __device__ __forceinline__ void BlockedToWarpStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD); + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + + /** + * Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for warp-timeslicing + */ + __device__ __forceinline__ void BlockedToWarpStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE) + { + __syncthreads(); + + if (warp_id == SLICE) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD); + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + } + } + + + /** + * Transposes data items from striped arrangement to blocked arrangement. Specialized for no timeslicing. + */ + __device__ __forceinline__ void StripedToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between striped and blocked arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + __syncthreads(); + + // No timeslicing + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + + + /** + * Transposes data items from striped arrangement to blocked arrangement. Specialized for warp-timeslicing. + */ + __device__ __forceinline__ void StripedToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between striped and blocked arrangements. + Int2Type time_slicing) + { + // Warp time-slicing + T temp_items[ITEMS_PER_THREAD]; + + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) + { + const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; + const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + // Write a strip of items + const int STRIP_OFFSET = ITEM * BLOCK_THREADS; + const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; + + if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) + { + int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; + if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) + { + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + } + } + + __syncthreads(); + + if (warp_id == SLICE) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_items[ITEM] = temp_storage[item_offset]; + } + } + } + + // Copy + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = temp_items[ITEM]; + } + } + + + /** + * Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for no timeslicing + */ + __device__ __forceinline__ void WarpStripedToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD); + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + + + /** + * Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for warp-timeslicing + */ + __device__ __forceinline__ void WarpStripedToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. + Int2Type time_slicing) + { + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE) + { + __syncthreads(); + + if (warp_id == SLICE) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_storage[item_offset] = items[ITEM]; + } + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD); + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + items[ITEM] = temp_storage[item_offset]; + } + } + } + } + + + /** + * Exchanges data items annotated by rank into blocked arrangement. Specialized for no timeslicing. + */ + __device__ __forceinline__ void ScatterToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ranks[ITEM]; + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + temp_storage[item_offset] = items[ITEM]; + } + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + items[ITEM] = temp_storage[item_offset]; + } + } + + /** + * Exchanges data items annotated by rank into blocked arrangement. Specialized for warp-timeslicing. + */ + __device__ __forceinline__ void ScatterToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks + Int2Type time_slicing) + { + T temp_items[ITEMS_PER_THREAD]; + + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) + { + __syncthreads(); + + const int SLICE_OFFSET = TIME_SLICED_ITEMS * SLICE; + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ranks[ITEM] - SLICE_OFFSET; + if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS)) + { + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + temp_storage[item_offset] = items[ITEM]; + } + } + + __syncthreads(); + + if (warp_id == SLICE) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + temp_items[ITEM] = temp_storage[item_offset]; + } + } + } + + // Copy + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = temp_items[ITEM]; + } + } + + + /** + * Exchanges data items annotated by rank into striped arrangement. Specialized for no timeslicing. + */ + __device__ __forceinline__ void ScatterToStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks + Int2Type time_slicing) + { + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ranks[ITEM]; + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + temp_storage[item_offset] = items[ITEM]; + } + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + items[ITEM] = temp_storage[item_offset]; + } + } + + + /** + * Exchanges data items annotated by rank into striped arrangement. Specialized for warp-timeslicing. + */ + __device__ __forceinline__ void ScatterToStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks + Int2Type time_slicing) + { + T temp_items[ITEMS_PER_THREAD]; + + #pragma unroll + for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) + { + const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; + const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + int item_offset = ranks[ITEM] - SLICE_OFFSET; + if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS)) + { + if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); + temp_storage[item_offset] = items[ITEM]; + } + } + + __syncthreads(); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + // Read a strip of items + const int STRIP_OFFSET = ITEM * BLOCK_THREADS; + const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; + + if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) + { + int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; + if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) + { + if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; + temp_items[ITEM] = temp_storage[item_offset]; + } + } + } + } + + // Copy + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = temp_items[ITEM]; + } + } + + +public: + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockExchange() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x), + warp_lane(linear_tid & (WARP_THREADS - 1)), + warp_id(linear_tid >> LOG_WARP_THREADS), + warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockExchange( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x), + warp_lane(linear_tid & (WARP_THREADS - 1)), + warp_id(linear_tid >> LOG_WARP_THREADS), + warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockExchange( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid), + warp_lane(linear_tid & (WARP_THREADS - 1)), + warp_id(linear_tid >> LOG_WARP_THREADS), + warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockExchange( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid), + warp_lane(linear_tid & (WARP_THREADS - 1)), + warp_id(linear_tid >> LOG_WARP_THREADS), + warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) + {} + + + //@} end member group + /******************************************************************//** + * \name Structured exchanges + *********************************************************************/ + //@{ + + /** + * \brief Transposes data items from striped arrangement to blocked arrangement. + * + * \smemreuse + * + * The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement + * of 512 integer items partitioned across 128 threads where each thread owns 4 items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockExchange for 128 threads owning 4 integer items each + * typedef cub::BlockExchange BlockExchange; + * + * // Allocate shared memory for BlockExchange + * __shared__ typename BlockExchange::TempStorage temp_storage; + * + * // Load a tile of ordered data into a striped arrangement across block threads + * int thread_data[4]; + * cub::LoadStriped (threadIdx.x, d_data, thread_data); + * + * // Collectively exchange data into a blocked arrangement across threads + * BlockExchange(temp_storage).StripedToBlocked(thread_data); + * + * \endcode + * \par + * Suppose the set of striped input \p thread_data across the block of threads is + * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } after loading from global memory. + * The corresponding output \p thread_data in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * + */ + __device__ __forceinline__ void StripedToBlocked( + T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between striped and blocked arrangements. + { + StripedToBlocked(items, Int2Type ()); + } + + /** + * \brief Transposes data items from blocked arrangement to striped arrangement. + * + * \smemreuse + * + * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement + * of 512 integer items partitioned across 128 threads where each thread owns 4 items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockExchange for 128 threads owning 4 integer items each + * typedef cub::BlockExchange BlockExchange; + * + * // Allocate shared memory for BlockExchange + * __shared__ typename BlockExchange::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively exchange data into a striped arrangement across threads + * BlockExchange(temp_storage).BlockedToStriped(thread_data); + * + * // Store data striped across block threads into an ordered tile + * cub::StoreStriped (threadIdx.x, d_data, thread_data); + * + * \endcode + * \par + * Suppose the set of blocked input \p thread_data across the block of threads is + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * The corresponding output \p thread_data in those threads will be + * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } in + * preparation for storing to global memory. + * + */ + __device__ __forceinline__ void BlockedToStriped( + T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between blocked and striped arrangements. + { + BlockedToStriped(items, Int2Type ()); + } + + + /** + * \brief Transposes data items from warp-striped arrangement to blocked arrangement. + * + * \smemreuse + * + * The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement + * of 512 integer items partitioned across 128 threads where each thread owns 4 items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockExchange for 128 threads owning 4 integer items each + * typedef cub::BlockExchange BlockExchange; + * + * // Allocate shared memory for BlockExchange + * __shared__ typename BlockExchange::TempStorage temp_storage; + * + * // Load a tile of ordered data into a warp-striped arrangement across warp threads + * int thread_data[4]; + * cub::LoadSWarptriped (threadIdx.x, d_data, thread_data); + * + * // Collectively exchange data into a blocked arrangement across threads + * BlockExchange(temp_storage).WarpStripedToBlocked(thread_data); + * + * \endcode + * \par + * Suppose the set of warp-striped input \p thread_data across the block of threads is + * { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } + * after loading from global memory. (The first 128 items are striped across + * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.) + * The corresponding output \p thread_data in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * + */ + __device__ __forceinline__ void WarpStripedToBlocked( + T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. + { + WarpStripedToBlocked(items, Int2Type ()); + } + + /** + * \brief Transposes data items from blocked arrangement to warp-striped arrangement. + * + * \smemreuse + * + * The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement + * of 512 integer items partitioned across 128 threads where each thread owns 4 items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockExchange for 128 threads owning 4 integer items each + * typedef cub::BlockExchange BlockExchange; + * + * // Allocate shared memory for BlockExchange + * __shared__ typename BlockExchange::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively exchange data into a warp-striped arrangement across threads + * BlockExchange(temp_storage).BlockedToWarpStriped(thread_data); + * + * // Store data striped across warp threads into an ordered tile + * cub::StoreStriped (threadIdx.x, d_data, thread_data); + * + * \endcode + * \par + * Suppose the set of blocked input \p thread_data across the block of threads is + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * The corresponding output \p thread_data in those threads will be + * { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } + * in preparation for storing to global memory. (The first 128 items are striped across + * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.) + * + */ + __device__ __forceinline__ void BlockedToWarpStriped( + T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. + { + BlockedToWarpStriped(items, Int2Type ()); + } + + + //@} end member group + /******************************************************************//** + * \name Scatter exchanges + *********************************************************************/ + //@{ + + + /** + * \brief Exchanges data items annotated by rank into blocked arrangement. + * + * \smemreuse + */ + __device__ __forceinline__ void ScatterToBlocked( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks + { + ScatterToBlocked(items, ranks, Int2Type ()); + } + + + /** + * \brief Exchanges data items annotated by rank into striped arrangement. + * + * \smemreuse + */ + __device__ __forceinline__ void ScatterToStriped( + T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange + int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks + { + ScatterToStriped(items, ranks, Int2Type ()); + } + + //@} end member group + + +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_histogram.cuh b/lib/kokkos/TPL/cub/block/block_histogram.cuh new file mode 100644 index 0000000000..dd346e3954 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_histogram.cuh @@ -0,0 +1,414 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockHistogram class provides [collective](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. + */ + +#pragma once + +#include "specializations/block_histogram_sort.cuh" +#include "specializations/block_histogram_atomic.cuh" +#include "../util_arch.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + + +/****************************************************************************** + * Algorithmic variants + ******************************************************************************/ + +/** + * \brief BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. + */ +enum BlockHistogramAlgorithm +{ + + /** + * \par Overview + * Sorting followed by differentiation. Execution is comprised of two phases: + * -# Sort the data using efficient radix sort + * -# Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts. + * + * \par Performance Considerations + * Delivers consistent throughput regardless of sample bin distribution. + */ + BLOCK_HISTO_SORT, + + + /** + * \par Overview + * Use atomic addition to update byte counts directly + * + * \par Performance Considerations + * Performance is strongly tied to the hardware implementation of atomic + * addition, and may be significantly degraded for non uniformly-random + * input distributions where many concurrent updates are likely to be + * made to the same bin counter. + */ + BLOCK_HISTO_ATOMIC, +}; + + + +/****************************************************************************** + * Block histogram + ******************************************************************************/ + + +/** + * \brief The BlockHistogram class provides [collective](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * A histogram + * counts the number of observations that fall into each of the disjoint categories (known as bins). + * + * \par + * Optionally, BlockHistogram can be specialized to use different algorithms: + * -# cub::BLOCK_HISTO_SORT. Sorting followed by differentiation. [More...](\ref cub::BlockHistogramAlgorithm) + * -# cub::BLOCK_HISTO_ATOMIC. Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHistogramAlgorithm) + * + * \tparam T The sample type being histogrammed (must be castable to an integer bin identifier) + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam ITEMS_PER_THREAD The number of items per thread + * \tparam BINS The number bins within the histogram + * \tparam ALGORITHM [optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT) + * + * \par A Simple Example + * \blockcollective{BlockHistogram} + * \par + * The code snippet below illustrates a 256-bin histogram of 512 integer samples that + * are partitioned across 128 threads where each thread owns 4 samples. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each + * typedef cub::BlockHistogram BlockHistogram; + * + * // Allocate shared memory for BlockHistogram + * __shared__ typename BlockHistogram::TempStorage temp_storage; + * + * // Allocate shared memory for block-wide histogram bin counts + * __shared__ unsigned int smem_histogram[256]; + * + * // Obtain input samples per thread + * unsigned char data[4]; + * ... + * + * // Compute the block-wide histogram + * BlockHistogram(temp_storage).Histogram(data, smem_histogram); + * + * \endcode + * + * \par Performance and Usage Considerations + * - The histogram output can be constructed in shared or global memory + * - See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives + * + */ +template < + typename T, + int BLOCK_THREADS, + int ITEMS_PER_THREAD, + int BINS, + BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT> +class BlockHistogram +{ +private: + + /****************************************************************************** + * Constants and type definitions + ******************************************************************************/ + + /** + * Ensure the template parameterization meets the requirements of the + * targeted device architecture. BLOCK_HISTO_ATOMIC can only be used + * on version SM120 or later. Otherwise BLOCK_HISTO_SORT is used + * regardless. + */ + static const BlockHistogramAlgorithm SAFE_ALGORITHM = + ((ALGORITHM == BLOCK_HISTO_ATOMIC) && (CUB_PTX_ARCH < 120)) ? + BLOCK_HISTO_SORT : + ALGORITHM; + + /// Internal specialization. + typedef typename If<(SAFE_ALGORITHM == BLOCK_HISTO_SORT), + BlockHistogramSort , + BlockHistogramAtomic >::Type InternalBlockHistogram; + + /// Shared memory storage layout type for BlockHistogram + typedef typename InternalBlockHistogram::TempStorage _TempStorage; + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + +public: + + /// \smemstorage{BlockHistogram} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockHistogram() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockHistogram( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockHistogram( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockHistogram( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + //@} end member group + /******************************************************************//** + * \name Histogram operations + *********************************************************************/ + //@{ + + + /** + * \brief Initialize the shared histogram counters to zero. + * + * The code snippet below illustrates a the initialization and update of a + * histogram of 512 integer samples that are partitioned across 128 threads + * where each thread owns 4 samples. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each + * typedef cub::BlockHistogram BlockHistogram; + * + * // Allocate shared memory for BlockHistogram + * __shared__ typename BlockHistogram::TempStorage temp_storage; + * + * // Allocate shared memory for block-wide histogram bin counts + * __shared__ unsigned int smem_histogram[256]; + * + * // Obtain input samples per thread + * unsigned char thread_samples[4]; + * ... + * + * // Initialize the block-wide histogram + * BlockHistogram(temp_storage).InitHistogram(smem_histogram); + * + * // Update the block-wide histogram + * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram); + * + * \endcode + * + * \tparam HistoCounter [inferred] Histogram counter type + */ + template + __device__ __forceinline__ void InitHistogram(HistoCounter histogram[BINS]) + { + // Initialize histogram bin counts to zeros + int histo_offset = 0; + + #pragma unroll + for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS) + { + histogram[histo_offset + linear_tid] = 0; + } + // Finish up with guarded initialization if necessary + if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS)) + { + histogram[histo_offset + linear_tid] = 0; + } + } + + + /** + * \brief Constructs a block-wide histogram in shared/global memory. Each thread contributes an array of input elements. + * + * \smemreuse + * + * The code snippet below illustrates a 256-bin histogram of 512 integer samples that + * are partitioned across 128 threads where each thread owns 4 samples. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each + * typedef cub::BlockHistogram BlockHistogram; + * + * // Allocate shared memory for BlockHistogram + * __shared__ typename BlockHistogram::TempStorage temp_storage; + * + * // Allocate shared memory for block-wide histogram bin counts + * __shared__ unsigned int smem_histogram[256]; + * + * // Obtain input samples per thread + * unsigned char thread_samples[4]; + * ... + * + * // Compute the block-wide histogram + * BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram); + * + * \endcode + * + * \tparam HistoCounter [inferred] Histogram counter type + */ + template < + typename HistoCounter> + __device__ __forceinline__ void Histogram( + T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram + HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram + { + // Initialize histogram bin counts to zeros + InitHistogram(histogram); + + // Composite the histogram + InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram); + } + + + + /** + * \brief Updates an existing block-wide histogram in shared/global memory. Each thread composites an array of input elements. + * + * \smemreuse + * + * The code snippet below illustrates a the initialization and update of a + * histogram of 512 integer samples that are partitioned across 128 threads + * where each thread owns 4 samples. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each + * typedef cub::BlockHistogram BlockHistogram; + * + * // Allocate shared memory for BlockHistogram + * __shared__ typename BlockHistogram::TempStorage temp_storage; + * + * // Allocate shared memory for block-wide histogram bin counts + * __shared__ unsigned int smem_histogram[256]; + * + * // Obtain input samples per thread + * unsigned char thread_samples[4]; + * ... + * + * // Initialize the block-wide histogram + * BlockHistogram(temp_storage).InitHistogram(smem_histogram); + * + * // Update the block-wide histogram + * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram); + * + * \endcode + * + * \tparam HistoCounter [inferred] Histogram counter type + */ + template < + typename HistoCounter> + __device__ __forceinline__ void Composite( + T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram + HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram + { + InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram); + } + +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_load.cuh b/lib/kokkos/TPL/cub/block/block_load.cuh new file mode 100644 index 0000000000..e645bcdce9 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_load.cuh @@ -0,0 +1,1122 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * Operations for reading linear tiles of data into the CUDA thread block. + */ + +#pragma once + +#include + +#include "../util_namespace.cuh" +#include "../util_macro.cuh" +#include "../util_type.cuh" +#include "../util_vector.cuh" +#include "../thread/thread_load.cuh" +#include "block_exchange.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \addtogroup IoModule + * @{ + */ + + +/******************************************************************//** + * \name Blocked I/O + *********************************************************************/ +//@{ + + +/** + * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier. + * + * \blocked + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadBlocked( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load +{ + // Load directly in thread-blocked order + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } +} + + +/** + * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range. + * + * \blocked + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadBlocked( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load +{ + int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + if (ITEM < bounds) + { + items[ITEM] = ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM); + } + } +} + + +/** + * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements.. + * + * \blocked + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadBlocked( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items +{ + int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD); + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = (ITEM < bounds) ? + ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM) : + oob_default; + } +} + + + +//@} end member group +/******************************************************************//** + * \name Striped I/O + *********************************************************************/ +//@{ + + +/** + * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier. + * + * \striped + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + int BLOCK_THREADS, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load +{ + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = ThreadLoad (block_itr + (ITEM * BLOCK_THREADS) + linear_tid); + } +} + + +/** + * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range + * + * \striped + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + int BLOCK_THREADS, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load +{ + int bounds = valid_items - linear_tid; + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + if (ITEM * BLOCK_THREADS < bounds) + { + items[ITEM] = ThreadLoad (block_itr + linear_tid + (ITEM * BLOCK_THREADS)); + } + } +} + + +/** + * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements. + * + * \striped + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + int BLOCK_THREADS, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items +{ + int bounds = valid_items - linear_tid; + + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = (ITEM * BLOCK_THREADS < bounds) ? + ThreadLoad (block_itr + linear_tid + (ITEM * BLOCK_THREADS)) : + oob_default; + } +} + + + +//@} end member group +/******************************************************************//** + * \name Warp-striped I/O + *********************************************************************/ +//@{ + + +/** + * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier. + * + * \warpstriped + * + * \par Usage Considerations + * The number of threads in the thread block must be a multiple of the architecture's warp size. + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadWarpStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load +{ + int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); + int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; + int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; + + // Load directly in warp-striped order + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)); + } +} + + +/** + * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range + * + * \warpstriped + * + * \par Usage Considerations + * The number of threads in the thread block must be a multiple of the architecture's warp size. + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadWarpStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load +{ + int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); + int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; + int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; + int bounds = valid_items - warp_offset - tid; + + // Load directly in warp-striped order + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + if ((ITEM * PtxArchProps::WARP_THREADS) < bounds) + { + items[ITEM] = ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)); + } + } +} + + +/** + * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements. + * + * \warpstriped + * + * \par Usage Considerations + * The number of threads in the thread block must be a multiple of the architecture's warp size. + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD, + typename InputIteratorRA> +__device__ __forceinline__ void LoadWarpStriped( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items +{ + int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); + int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; + int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; + int bounds = valid_items - warp_offset - tid; + + // Load directly in warp-striped order + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = ((ITEM * PtxArchProps::WARP_THREADS) < bounds) ? + ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)) : + oob_default; + } +} + + + +//@} end member group +/******************************************************************//** + * \name Blocked, vectorized I/O + *********************************************************************/ +//@{ + +/** + * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier. + * + * \blocked + * + * The input offset (\p block_ptr + \p block_offset) must be quad-item aligned + * + * The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT: + * - \p ITEMS_PER_THREAD is odd + * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.) + * + * \tparam MODIFIER cub::PtxLoadModifier cache modifier. + * \tparam T [inferred] The data type to load. + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + */ +template < + PtxLoadModifier MODIFIER, + typename T, + int ITEMS_PER_THREAD> +__device__ __forceinline__ void LoadBlockedVectorized( + int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + T *block_ptr, ///< [in] Input pointer for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load +{ + enum + { + // Maximum CUDA vector size is 4 elements + MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD), + + // Vector size must be a power of two and an even divisor of the items per thread + VEC_SIZE = ((((MAX_VEC_SIZE - 1) & MAX_VEC_SIZE) == 0) && ((ITEMS_PER_THREAD % MAX_VEC_SIZE) == 0)) ? + MAX_VEC_SIZE : + 1, + + VECTORS_PER_THREAD = ITEMS_PER_THREAD / VEC_SIZE, + }; + + // Vector type + typedef typename VectorHelper ::Type Vector; + + // Alias local data (use raw_items array here which should get optimized away to prevent conservative PTXAS lmem spilling) + T raw_items[ITEMS_PER_THREAD]; + + // Direct-load using vector types + LoadBlocked ( + linear_tid, + reinterpret_cast (block_ptr), + reinterpret_cast (raw_items)); + + // Copy + #pragma unroll + for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) + { + items[ITEM] = raw_items[ITEM]; + } +} + + +//@} end member group + +/** @} */ // end group IoModule + + + +//----------------------------------------------------------------------------- +// Generic BlockLoad abstraction +//----------------------------------------------------------------------------- + +/** + * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. + */ +enum BlockLoadAlgorithm +{ + /** + * \par Overview + * + * A [blocked arrangement](index.html#sec5sec4) of data is read + * directly from memory. The thread block reads items in a parallel "raking" fashion: threadi + * reads the ith segment of consecutive elements. + * + * \par Performance Considerations + * - The utilization of memory transactions (coalescing) decreases as the + * access stride between threads increases (i.e., the number items per thread). + */ + BLOCK_LOAD_DIRECT, + + /** + * \par Overview + * + * A [blocked arrangement](index.html#sec5sec4) of data is read directly + * from memory using CUDA's built-in vectorized loads as a coalescing optimization. + * The thread block reads items in a parallel "raking" fashion: threadi uses vector loads to + * read the ith segment of consecutive elements. + * + * For example, ld.global.v4.s32 instructions will be generated when \p T = \p int and \p ITEMS_PER_THREAD > 4. + * + * \par Performance Considerations + * - The utilization of memory transactions (coalescing) remains high until the the + * access stride between threads (i.e., the number items per thread) exceeds the + * maximum vector load width (typically 4 items or 64B, whichever is lower). + * - The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT: + * - \p ITEMS_PER_THREAD is odd + * - The \p InputIteratorRA is not a simple pointer type + * - The block input offset is not quadword-aligned + * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.) + */ + BLOCK_LOAD_VECTORIZE, + + /** + * \par Overview + * + * A [striped arrangement](index.html#sec5sec4) of data is read + * directly from memory and then is locally transposed into a + * [blocked arrangement](index.html#sec5sec4). The thread block + * reads items in a parallel "strip-mining" fashion: + * threadi reads items having stride \p BLOCK_THREADS + * between them. cub::BlockExchange is then used to locally reorder the items + * into a [blocked arrangement](index.html#sec5sec4). + * + * \par Performance Considerations + * - The utilization of memory transactions (coalescing) remains high regardless + * of items loaded per thread. + * - The local reordering incurs slightly longer latencies and throughput than the + * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives. + */ + BLOCK_LOAD_TRANSPOSE, + + + /** + * \par Overview + * + * A [warp-striped arrangement](index.html#sec5sec4) of data is read + * directly from memory and then is locally transposed into a + * [blocked arrangement](index.html#sec5sec4). Each warp reads its own + * contiguous segment in a parallel "strip-mining" fashion: lanei + * reads items having stride \p WARP_THREADS between them. cub::BlockExchange + * is then used to locally reorder the items into a + * [blocked arrangement](index.html#sec5sec4). + * + * \par Usage Considerations + * - BLOCK_THREADS must be a multiple of WARP_THREADS + * + * \par Performance Considerations + * - The utilization of memory transactions (coalescing) remains high regardless + * of items loaded per thread. + * - The local reordering incurs slightly longer latencies and throughput than the + * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives. + */ + BLOCK_LOAD_WARP_TRANSPOSE, +}; + + +/** + * \brief The BlockLoad class provides [collective](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [blocked arrangement](index.html#sec5sec4) across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * The BlockLoad class provides a single data movement abstraction that can be specialized + * to implement different cub::BlockLoadAlgorithm strategies. This facilitates different + * performance policies for different architectures, data types, granularity sizes, etc. + * + * \par + * Optionally, BlockLoad can be specialized by different data movement strategies: + * -# cub::BLOCK_LOAD_DIRECT. A [blocked arrangement](index.html#sec5sec4) + * of data is read directly from memory. [More...](\ref cub::BlockLoadAlgorithm) + * -# cub::BLOCK_LOAD_VECTORIZE. A [blocked arrangement](index.html#sec5sec4) + * of data is read directly from memory using CUDA's built-in vectorized loads as a + * coalescing optimization. [More...](\ref cub::BlockLoadAlgorithm) + * -# cub::BLOCK_LOAD_TRANSPOSE. A [striped arrangement](index.html#sec5sec4) + * of data is read directly from memory and is then locally transposed into a + * [blocked arrangement](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm) + * -# cub::BLOCK_LOAD_WARP_TRANSPOSE. A [warp-striped arrangement](index.html#sec5sec4) + * of data is read directly from memory and is then locally transposed into a + * [blocked arrangement](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm) + * + * \tparam InputIteratorRA The input iterator type (may be a simple pointer type). + * \tparam BLOCK_THREADS The thread block size in threads. + * \tparam ITEMS_PER_THREAD The number of consecutive items partitioned onto each thread. + * \tparam ALGORITHM [optional] cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT. + * \tparam MODIFIER [optional] cub::PtxLoadModifier cache modifier. default: cub::LOAD_DEFAULT. + * \tparam WARP_TIME_SLICING [optional] For transposition-based cub::BlockLoadAlgorithm parameterizations that utilize shared memory: When \p true, only use enough shared memory for a single warp's worth of data, time-slicing the block-wide exchange over multiple synchronized rounds (default: false) + * + * \par A Simple Example + * \blockcollective{BlockLoad} + * \par + * The code snippet below illustrates the loading of a linear + * segment of 512 integers into a "blocked" arrangement across 128 threads where each + * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, + * meaning memory references are efficiently coalesced using a warp-striped access + * pattern (after which items are locally reordered among threads). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockLoad for 128 threads owning 4 integer items each + * typedef cub::BlockLoad BlockLoad; + * + * // Allocate shared memory for BlockLoad + * __shared__ typename BlockLoad::TempStorage temp_storage; + * + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage).Load(d_data, thread_data); + * + * \endcode + * \par + * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, .... + * The set of \p thread_data across the block of threads in those threads will be + * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. + * + */ +template < + typename InputIteratorRA, + int BLOCK_THREADS, + int ITEMS_PER_THREAD, + BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, + PtxLoadModifier MODIFIER = LOAD_DEFAULT, + bool WARP_TIME_SLICING = false> +class BlockLoad +{ +private: + + /****************************************************************************** + * Constants and typed definitions + ******************************************************************************/ + + // Data type of input iterator + typedef typename std::iterator_traits ::value_type T; + + + /****************************************************************************** + * Algorithmic variants + ******************************************************************************/ + + /// Load helper + template + struct LoadInternal; + + + /** + * BLOCK_LOAD_DIRECT specialization of load helper + */ + template + struct LoadInternal + { + /// Shared memory storage layout type + typedef NullType TempStorage; + + /// Linear thread-id + int linear_tid; + + /// Constructor + __device__ __forceinline__ LoadInternal( + TempStorage &temp_storage, + int linear_tid) + : + linear_tid(linear_tid) + {} + + /// Load a linear segment of items from memory + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load + { + LoadBlocked (linear_tid, block_itr, items); + } + + /// Load a linear segment of items from memory, guarded by range + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load + { + LoadBlocked (linear_tid, block_itr, items, valid_items); + } + + /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items + { + LoadBlocked (linear_tid, block_itr, items, valid_items, oob_default); + } + + }; + + + /** + * BLOCK_LOAD_VECTORIZE specialization of load helper + */ + template + struct LoadInternal + { + /// Shared memory storage layout type + typedef NullType TempStorage; + + /// Linear thread-id + int linear_tid; + + /// Constructor + __device__ __forceinline__ LoadInternal( + TempStorage &temp_storage, + int linear_tid) + : + linear_tid(linear_tid) + {} + + /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization) + __device__ __forceinline__ void Load( + T *block_ptr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load + { + LoadBlockedVectorized (linear_tid, block_ptr, items); + } + + /// Load a linear segment of items from memory, specialized for opaque input iterators (skips vectorization) + template < + typename T, + typename _InputIteratorRA> + __device__ __forceinline__ void Load( + _InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load + { + LoadBlocked (linear_tid, block_itr, items); + } + + /// Load a linear segment of items from memory, guarded by range (skips vectorization) + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load + { + LoadBlocked (linear_tid, block_itr, items, valid_items); + } + + /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements (skips vectorization) + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items + { + LoadBlocked (linear_tid, block_itr, items, valid_items, oob_default); + } + + }; + + + /** + * BLOCK_LOAD_TRANSPOSE specialization of load helper + */ + template + struct LoadInternal + { + // BlockExchange utility type for keys + typedef BlockExchange BlockExchange; + + /// Shared memory storage layout type + typedef typename BlockExchange::TempStorage _TempStorage; + + /// Alias wrapper allowing storage to be unioned + struct TempStorage : Uninitialized<_TempStorage> {}; + + /// Thread reference to shared storage + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + /// Constructor + __device__ __forceinline__ LoadInternal( + TempStorage &temp_storage, + int linear_tid) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + /// Load a linear segment of items from memory + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{ + { + LoadStriped (linear_tid, block_itr, items); + BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); + } + + /// Load a linear segment of items from memory, guarded by range + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load + { + LoadStriped (linear_tid, block_itr, items, valid_items); + BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); + } + + /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items + { + LoadStriped (linear_tid, block_itr, items, valid_items, oob_default); + BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); + } + + }; + + + /** + * BLOCK_LOAD_WARP_TRANSPOSE specialization of load helper + */ + template + struct LoadInternal + { + enum + { + WARP_THREADS = PtxArchProps::WARP_THREADS + }; + + // Assert BLOCK_THREADS must be a multiple of WARP_THREADS + CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS"); + + // BlockExchange utility type for keys + typedef BlockExchange BlockExchange; + + /// Shared memory storage layout type + typedef typename BlockExchange::TempStorage _TempStorage; + + /// Alias wrapper allowing storage to be unioned + struct TempStorage : Uninitialized<_TempStorage> {}; + + /// Thread reference to shared storage + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + /// Constructor + __device__ __forceinline__ LoadInternal( + TempStorage &temp_storage, + int linear_tid) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + /// Load a linear segment of items from memory + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{ + { + LoadWarpStriped (linear_tid, block_itr, items); + BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); + } + + /// Load a linear segment of items from memory, guarded by range + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load + { + LoadWarpStriped (linear_tid, block_itr, items, valid_items); + BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); + } + + + /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items + { + LoadWarpStriped (linear_tid, block_itr, items, valid_items, oob_default); + BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); + } + }; + + + /****************************************************************************** + * Type definitions + ******************************************************************************/ + + /// Internal load implementation to use + typedef LoadInternal InternalLoad; + + + /// Shared memory storage layout type + typedef typename InternalLoad::TempStorage _TempStorage; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Thread reference to shared storage + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + +public: + + /// \smemstorage{BlockLoad} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockLoad() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockLoad( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockLoad( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockLoad( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Data movement + *********************************************************************/ + //@{ + + + /** + * \brief Load a linear segment of items from memory. + * + * \blocked + * + * The code snippet below illustrates the loading of a linear + * segment of 512 integers into a "blocked" arrangement across 128 threads where each + * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, + * meaning memory references are efficiently coalesced using a warp-striped access + * pattern (after which items are locally reordered among threads). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, ...) + * { + * // Specialize BlockLoad for 128 threads owning 4 integer items each + * typedef cub::BlockLoad BlockLoad; + * + * // Allocate shared memory for BlockLoad + * __shared__ typename BlockLoad::TempStorage temp_storage; + * + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage).Load(d_data, thread_data); + * + * \endcode + * \par + * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, .... + * The set of \p thread_data across the block of threads in those threads will be + * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. + * + */ + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load + { + InternalLoad(temp_storage, linear_tid).Load(block_itr, items); + } + + + /** + * \brief Load a linear segment of items from memory, guarded by range. + * + * \blocked + * + * The code snippet below illustrates the guarded loading of a linear + * segment of 512 integers into a "blocked" arrangement across 128 threads where each + * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, + * meaning memory references are efficiently coalesced using a warp-striped access + * pattern (after which items are locally reordered among threads). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, int valid_items, ...) + * { + * // Specialize BlockLoad for 128 threads owning 4 integer items each + * typedef cub::BlockLoad BlockLoad; + * + * // Allocate shared memory for BlockLoad + * __shared__ typename BlockLoad::TempStorage temp_storage; + * + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items); + * + * \endcode + * \par + * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, 6... and \p valid_items is \p 5. + * The set of \p thread_data across the block of threads in those threads will be + * { [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }, with only the first two threads + * being unmasked to load portions of valid data (and other items remaining unassigned). + * + */ + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items) ///< [in] Number of valid items to load + { + InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items); + } + + + /** + * \brief Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements + * + * \blocked + * + * The code snippet below illustrates the guarded loading of a linear + * segment of 512 integers into a "blocked" arrangement across 128 threads where each + * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, + * meaning memory references are efficiently coalesced using a warp-striped access + * pattern (after which items are locally reordered among threads). + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int *d_data, int valid_items, ...) + * { + * // Specialize BlockLoad for 128 threads owning 4 integer items each + * typedef cub::BlockLoad BlockLoad; + * + * // Allocate shared memory for BlockLoad + * __shared__ typename BlockLoad::TempStorage temp_storage; + * + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1); + * + * \endcode + * \par + * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, 6..., + * \p valid_items is \p 5, and the out-of-bounds default is \p -1. + * The set of \p thread_data across the block of threads in those threads will be + * { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }, with only the first two threads + * being unmasked to load portions of valid data (and other items are assigned \p -1) + * + */ + __device__ __forceinline__ void Load( + InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from + T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load + int valid_items, ///< [in] Number of valid items to load + T oob_default) ///< [in] Default value to assign out-of-bound items + { + InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items, oob_default); + } + + + //@} end member group + +}; + + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_radix_rank.cuh b/lib/kokkos/TPL/cub/block/block_radix_rank.cuh new file mode 100644 index 0000000000..149a62c65f --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_radix_rank.cuh @@ -0,0 +1,479 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * cub::BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock + */ + +#pragma once + +#include "../util_arch.cuh" +#include "../util_type.cuh" +#include "../thread/thread_reduce.cuh" +#include "../thread/thread_scan.cuh" +#include "../block/block_scan.cuh" +#include "../util_namespace.cuh" + + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \brief BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock. + * \ingroup BlockModule + * + * \par Overview + * Blah... + * + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam RADIX_BITS [optional] The number of radix bits per digit place (default: 5 bits) + * \tparam MEMOIZE_OUTER_SCAN [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details. + * \tparam INNER_SCAN_ALGORITHM [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS) + * \tparam SMEM_CONFIG [optional] Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte) + * + * \par Usage Considerations + * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits). + * - Assumes a [blocked arrangement](index.html#sec5sec4) of elements across threads + * - \smemreuse{BlockRadixRank::TempStorage} + * + * \par Performance Considerations + * + * \par Algorithm + * These parallel radix ranking variants have O(n) work complexity and are implemented in XXX phases: + * -# blah + * -# blah + * + * \par Examples + * \par + * - Example 1: Simple radix rank of 32-bit integer keys + * \code + * #include + * + * template + * __global__ void ExampleKernel(...) + * { + * + * \endcode + */ +template < + int BLOCK_THREADS, + int RADIX_BITS, + bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, + BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, + cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte> +class BlockRadixRank +{ +private: + + /****************************************************************************** + * Type definitions and constants + ******************************************************************************/ + + // Integer type for digit counters (to be packed into words of type PackedCounters) + typedef unsigned short DigitCounter; + + // Integer type for packing DigitCounters into columns of shared memory banks + typedef typename If<(SMEM_CONFIG == cudaSharedMemBankSizeEightByte), + unsigned long long, + unsigned int>::Type PackedCounter; + + enum + { + RADIX_DIGITS = 1 << RADIX_BITS, + + LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS, + WARP_THREADS = 1 << LOG_WARP_THREADS, + WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS, + + BYTES_PER_COUNTER = sizeof(DigitCounter), + LOG_BYTES_PER_COUNTER = Log2 ::VALUE, + + PACKING_RATIO = sizeof(PackedCounter) / sizeof(DigitCounter), + LOG_PACKING_RATIO = Log2 ::VALUE, + + LOG_COUNTER_LANES = CUB_MAX((RADIX_BITS - LOG_PACKING_RATIO), 0), // Always at least one lane + COUNTER_LANES = 1 << LOG_COUNTER_LANES, + + // The number of packed counters per thread (plus one for padding) + RAKING_SEGMENT = COUNTER_LANES + 1, + + LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS, + SMEM_BANKS = 1 << LOG_SMEM_BANKS, + }; + + + /// BlockScan type + typedef BlockScan BlockScan; + + + /// Shared memory storage layout type for BlockRadixRank + struct _TempStorage + { + // Storage for scanning local ranks + typename BlockScan::TempStorage block_scan; + + union + { + DigitCounter digit_counters[COUNTER_LANES + 1][BLOCK_THREADS][PACKING_RATIO]; + PackedCounter raking_grid[BLOCK_THREADS][RAKING_SEGMENT]; + }; + }; + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + /// Copy of raking segment, promoted to registers + PackedCounter cached_segment[RAKING_SEGMENT]; + + + /****************************************************************************** + * Templated iteration + ******************************************************************************/ + + // General template iteration + template + struct Iterate + { + /** + * Decode keys. Decodes the radix digit from the current digit place + * and increments the thread's corresponding counter in shared + * memory for that digit. + * + * Saves both (1) the prior value of that counter (the key's + * thread-local exclusive prefix sum for that digit), and (2) the shared + * memory offset of the counter (for later use). + */ + template + static __device__ __forceinline__ void DecodeKeys( + BlockRadixRank &cta, // BlockRadixRank instance + UnsignedBits (&keys)[KEYS_PER_THREAD], // Key to decode + DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value (out parameter) + DigitCounter* (&digit_counters)[KEYS_PER_THREAD], // Counter smem offset (out parameter) + int current_bit) // The least-significant bit position of the current digit to extract + { + // Add in sub-counter offset + UnsignedBits sub_counter = BFE(keys[COUNT], current_bit + LOG_COUNTER_LANES, LOG_PACKING_RATIO); + + // Add in row offset + UnsignedBits row_offset = BFE(keys[COUNT], current_bit, LOG_COUNTER_LANES); + + // Pointer to smem digit counter + digit_counters[COUNT] = &cta.temp_storage.digit_counters[row_offset][cta.linear_tid][sub_counter]; + + // Load thread-exclusive prefix + thread_prefixes[COUNT] = *digit_counters[COUNT]; + + // Store inclusive prefix + *digit_counters[COUNT] = thread_prefixes[COUNT] + 1; + + // Iterate next key + Iterate ::DecodeKeys(cta, keys, thread_prefixes, digit_counters, current_bit); + } + + + // Termination + template + static __device__ __forceinline__ void UpdateRanks( + int (&ranks)[KEYS_PER_THREAD], // Local ranks (out parameter) + DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value + DigitCounter* (&digit_counters)[KEYS_PER_THREAD]) // Counter smem offset + { + // Add in threadblock exclusive prefix + ranks[COUNT] = thread_prefixes[COUNT] + *digit_counters[COUNT]; + + // Iterate next key + Iterate ::UpdateRanks(ranks, thread_prefixes, digit_counters); + } + }; + + + // Termination + template + struct Iterate + { + // DecodeKeys + template + static __device__ __forceinline__ void DecodeKeys( + BlockRadixRank &cta, + UnsignedBits (&keys)[KEYS_PER_THREAD], + DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], + DigitCounter* (&digit_counters)[KEYS_PER_THREAD], + int current_bit) {} + + + // UpdateRanks + template + static __device__ __forceinline__ void UpdateRanks( + int (&ranks)[KEYS_PER_THREAD], + DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], + DigitCounter *(&digit_counters)[KEYS_PER_THREAD]) {} + }; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /** + * Internal storage allocator + */ + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /** + * Performs upsweep raking reduction, returning the aggregate + */ + __device__ __forceinline__ PackedCounter Upsweep() + { + PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid]; + PackedCounter *raking_ptr; + + if (MEMOIZE_OUTER_SCAN) + { + // Copy data into registers + #pragma unroll + for (int i = 0; i < RAKING_SEGMENT; i++) + { + cached_segment[i] = smem_raking_ptr[i]; + } + raking_ptr = cached_segment; + } + else + { + raking_ptr = smem_raking_ptr; + } + + return ThreadReduce (raking_ptr, Sum()); + } + + + /// Performs exclusive downsweep raking scan + __device__ __forceinline__ void ExclusiveDownsweep( + PackedCounter raking_partial) + { + PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid]; + + PackedCounter *raking_ptr = (MEMOIZE_OUTER_SCAN) ? + cached_segment : + smem_raking_ptr; + + // Exclusive raking downsweep scan + ThreadScanExclusive (raking_ptr, raking_ptr, Sum(), raking_partial); + + if (MEMOIZE_OUTER_SCAN) + { + // Copy data back to smem + #pragma unroll + for (int i = 0; i < RAKING_SEGMENT; i++) + { + smem_raking_ptr[i] = cached_segment[i]; + } + } + } + + + /** + * Reset shared memory digit counters + */ + __device__ __forceinline__ void ResetCounters() + { + // Reset shared memory digit counters + #pragma unroll + for (int LANE = 0; LANE < COUNTER_LANES + 1; LANE++) + { + *((PackedCounter*) temp_storage.digit_counters[LANE][linear_tid]) = 0; + } + } + + + /** + * Scan shared memory digit counters. + */ + __device__ __forceinline__ void ScanCounters() + { + // Upsweep scan + PackedCounter raking_partial = Upsweep(); + + // Compute inclusive sum + PackedCounter inclusive_partial; + PackedCounter packed_aggregate; + BlockScan(temp_storage.block_scan, linear_tid).InclusiveSum(raking_partial, inclusive_partial, packed_aggregate); + + // Propagate totals in packed fields + #pragma unroll + for (int PACKED = 1; PACKED < PACKING_RATIO; PACKED++) + { + inclusive_partial += packed_aggregate << (sizeof(DigitCounter) * 8 * PACKED); + } + + // Downsweep scan with exclusive partial + PackedCounter exclusive_partial = inclusive_partial - raking_partial; + ExclusiveDownsweep(exclusive_partial); + } + +public: + + /// \smemstorage{BlockScan} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockRadixRank() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockRadixRank( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockRadixRank( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockRadixRank( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Raking + *********************************************************************/ + //@{ + + /** + * \brief Rank keys. + */ + template < + typename UnsignedBits, + int KEYS_PER_THREAD> + __device__ __forceinline__ void RankKeys( + UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile + int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile + int current_bit) ///< [in] The least-significant bit position of the current digit to extract + { + DigitCounter thread_prefixes[KEYS_PER_THREAD]; // For each key, the count of previous keys in this tile having the same digit + DigitCounter* digit_counters[KEYS_PER_THREAD]; // For each key, the byte-offset of its corresponding digit counter in smem + + // Reset shared memory digit counters + ResetCounters(); + + // Decode keys and update digit counters + Iterate<0, KEYS_PER_THREAD>::DecodeKeys(*this, keys, thread_prefixes, digit_counters, current_bit); + + __syncthreads(); + + // Scan shared memory counters + ScanCounters(); + + __syncthreads(); + + // Extract the local ranks of each key + Iterate<0, KEYS_PER_THREAD>::UpdateRanks(ranks, thread_prefixes, digit_counters); + } + + + /** + * \brief Rank keys. For the lower \p RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread. + */ + template < + typename UnsignedBits, + int KEYS_PER_THREAD> + __device__ __forceinline__ void RankKeys( + UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile + int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile (out parameter) + int current_bit, ///< [in] The least-significant bit position of the current digit to extract + int &inclusive_digit_prefix) ///< [out] The incluisve prefix sum for the digit threadIdx.x + { + // Rank keys + RankKeys(keys, ranks, current_bit); + + // Get the inclusive and exclusive digit totals corresponding to the calling thread. + if ((BLOCK_THREADS == RADIX_DIGITS) || (linear_tid < RADIX_DIGITS)) + { + // Obtain ex/inclusive digit counts. (Unfortunately these all reside in the + // first counter column, resulting in unavoidable bank conflicts.) + int counter_lane = (linear_tid & (COUNTER_LANES - 1)); + int sub_counter = linear_tid >> (LOG_COUNTER_LANES); + inclusive_digit_prefix = temp_storage.digit_counters[counter_lane + 1][0][sub_counter]; + } + } +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + + diff --git a/lib/kokkos/TPL/cub/block/block_radix_sort.cuh b/lib/kokkos/TPL/cub/block/block_radix_sort.cuh new file mode 100644 index 0000000000..873d401266 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_radix_sort.cuh @@ -0,0 +1,608 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockRadixSort class provides [collective](index.html#sec0) methods for radix sorting of items partitioned across a CUDA thread block. + */ + + +#pragma once + +#include "../util_namespace.cuh" +#include "../util_arch.cuh" +#include "../util_type.cuh" +#include "block_exchange.cuh" +#include "block_radix_rank.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \brief The cub::BlockRadixSort class provides [collective](index.html#sec0) methods for sorting items partitioned across a CUDA thread block using a radix sorting method.  + * \ingroup BlockModule + * + * \par Overview + * The [radix sorting method](http://en.wikipedia.org/wiki/Radix_sort) arranges + * items into ascending order. It relies upon a positional representation for + * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, + * characters, etc.) specified from least-significant to most-significant. For a + * given input sequence of keys and a set of rules specifying a total ordering + * of the symbolic alphabet, the radix sorting method produces a lexicographic + * ordering of those keys. + * + * \par + * BlockRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: + * unsigned char, \p int, \p double, etc. Within each key, the implementation treats fixed-length + * bit-sequences of \p RADIX_BITS as radix digit places. Although the direct radix sorting + * method can only be applied to unsigned integral types, BlockRadixSort + * is able to sort signed and floating-point types via simple bit-wise transformations + * that ensure lexicographic key ordering. + * + * \tparam Key Key type + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam ITEMS_PER_THREAD The number of items per thread + * \tparam Value [optional] Value type (default: cub::NullType) + * \tparam RADIX_BITS [optional] The number of radix bits per digit place (default: 4 bits) + * \tparam MEMOIZE_OUTER_SCAN [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). + * \tparam INNER_SCAN_ALGORITHM [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS) + * \tparam SMEM_CONFIG [optional] Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte) + * + * \par A Simple Example + * \blockcollective{BlockRadixSort} + * \par + * The code snippet below illustrates a sort of 512 integer keys that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockRadixSort for 128 threads owning 4 integer items each + * typedef cub::BlockRadixSort BlockRadixSort; + * + * // Allocate shared memory for BlockRadixSort + * __shared__ typename BlockRadixSort::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_keys[4]; + * ... + * + * // Collectively sort the keys + * BlockRadixSort(temp_storage).Sort(thread_keys); + * + * ... + * \endcode + * \par + * Suppose the set of input \p thread_keys across the block of threads is + * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The + * corresponding output \p thread_keys in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * + */ +template < + typename Key, + int BLOCK_THREADS, + int ITEMS_PER_THREAD, + typename Value = NullType, + int RADIX_BITS = 4, + bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, + BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, + cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte> +class BlockRadixSort +{ +private: + + /****************************************************************************** + * Constants and type definitions + ******************************************************************************/ + + // Key traits and unsigned bits type + typedef NumericTraits KeyTraits; + typedef typename KeyTraits::UnsignedBits UnsignedBits; + + /// BlockRadixRank utility type + typedef BlockRadixRank BlockRadixRank; + + /// BlockExchange utility type for keys + typedef BlockExchange BlockExchangeKeys; + + /// BlockExchange utility type for values + typedef BlockExchange BlockExchangeValues; + + /// Shared memory storage layout type + struct _TempStorage + { + union + { + typename BlockRadixRank::TempStorage ranking_storage; + typename BlockExchangeKeys::TempStorage exchange_keys; + typename BlockExchangeValues::TempStorage exchange_values; + }; + }; + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + +public: + + /// \smemstorage{BlockScan} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockRadixSort() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockRadixSort( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockRadixSort( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockRadixSort( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Sorting (blocked arrangements) + *********************************************************************/ + //@{ + + /** + * \brief Performs a block-wide radix sort over a [blocked arrangement](index.html#sec5sec4) of keys. + * + * \smemreuse + * + * The code snippet below illustrates a sort of 512 integer keys that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive keys. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each + * typedef cub::BlockRadixSort BlockRadixSort; + * + * // Allocate shared memory for BlockRadixSort + * __shared__ typename BlockRadixSort::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_keys[4]; + * ... + * + * // Collectively sort the keys + * BlockRadixSort(temp_storage).Sort(thread_keys); + * + * \endcode + * \par + * Suppose the set of input \p thread_keys across the block of threads is + * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. + * The corresponding output \p thread_keys in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + */ + __device__ __forceinline__ void Sort( + Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort + int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison + int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison + { + UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = + reinterpret_cast (keys); + + // Twiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); + } + + // Radix sorting passes + while (true) + { + // Rank the blocked keys + int ranks[ITEMS_PER_THREAD]; + BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); + begin_bit += RADIX_BITS; + + __syncthreads(); + + // Exchange keys through shared memory in blocked arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); + + // Quit if done + if (begin_bit >= end_bit) break; + + __syncthreads(); + } + + // Untwiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); + } + } + + + /** + * \brief Performs a block-wide radix sort across a [blocked arrangement](index.html#sec5sec4) of keys and values. + * + * BlockRadixSort can only accommodate one associated tile of values. To "truck along" + * more than one tile of values, simply perform a key-value sort of the keys paired + * with a temporary value array that enumerates the key indices. The reordered indices + * can then be used as a gather-vector for exchanging other associated tile data through + * shared memory. + * + * \smemreuse + * + * The code snippet below illustrates a sort of 512 integer keys and values that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive pairs. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each + * typedef cub::BlockRadixSort BlockRadixSort; + * + * // Allocate shared memory for BlockRadixSort + * __shared__ typename BlockRadixSort::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_keys[4]; + * int thread_values[4]; + * ... + * + * // Collectively sort the keys and values among block threads + * BlockRadixSort(temp_storage).Sort(thread_keys, thread_values); + * + * \endcode + * \par + * Suppose the set of input \p thread_keys across the block of threads is + * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The + * corresponding output \p thread_keys in those threads will be + * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. + * + */ + __device__ __forceinline__ void Sort( + Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort + Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort + int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison + int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison + { + UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = + reinterpret_cast (keys); + + // Twiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); + } + + // Radix sorting passes + while (true) + { + // Rank the blocked keys + int ranks[ITEMS_PER_THREAD]; + BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); + begin_bit += RADIX_BITS; + + __syncthreads(); + + // Exchange keys through shared memory in blocked arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); + + __syncthreads(); + + // Exchange values through shared memory in blocked arrangement + BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks); + + // Quit if done + if (begin_bit >= end_bit) break; + + __syncthreads(); + } + + // Untwiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); + } + } + + + //@} end member group + /******************************************************************//** + * \name Sorting (blocked arrangement -> striped arrangement) + *********************************************************************/ + //@{ + + + /** + * \brief Performs a radix sort across a [blocked arrangement](index.html#sec5sec4) of keys, leaving them in a [striped arrangement](index.html#sec5sec4). + * + * \smemreuse + * + * The code snippet below illustrates a sort of 512 integer keys that + * are initially partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive keys. The final partitioning is striped. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each + * typedef cub::BlockRadixSort BlockRadixSort; + * + * // Allocate shared memory for BlockRadixSort + * __shared__ typename BlockRadixSort::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_keys[4]; + * ... + * + * // Collectively sort the keys + * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys); + * + * \endcode + * \par + * Suppose the set of input \p thread_keys across the block of threads is + * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The + * corresponding output \p thread_keys in those threads will be + * { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }. + * + */ + __device__ __forceinline__ void SortBlockedToStriped( + Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort + int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison + int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison + { + UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = + reinterpret_cast (keys); + + // Twiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); + } + + // Radix sorting passes + while (true) + { + // Rank the blocked keys + int ranks[ITEMS_PER_THREAD]; + BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); + begin_bit += RADIX_BITS; + + __syncthreads(); + + // Check if this is the last pass + if (begin_bit >= end_bit) + { + // Last pass exchanges keys through shared memory in striped arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks); + + // Quit + break; + } + + // Exchange keys through shared memory in blocked arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); + + __syncthreads(); + } + + // Untwiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); + } + } + + + /** + * \brief Performs a radix sort across a [blocked arrangement](index.html#sec5sec4) of keys and values, leaving them in a [striped arrangement](index.html#sec5sec4). + * + * BlockRadixSort can only accommodate one associated tile of values. To "truck along" + * more than one tile of values, simply perform a key-value sort of the keys paired + * with a temporary value array that enumerates the key indices. The reordered indices + * can then be used as a gather-vector for exchanging other associated tile data through + * shared memory. + * + * \smemreuse + * + * The code snippet below illustrates a sort of 512 integer keys and values that + * are initially partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive pairs. The final partitioning is striped. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each + * typedef cub::BlockRadixSort BlockRadixSort; + * + * // Allocate shared memory for BlockRadixSort + * __shared__ typename BlockRadixSort::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_keys[4]; + * int thread_values[4]; + * ... + * + * // Collectively sort the keys and values among block threads + * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values); + * + * \endcode + * \par + * Suppose the set of input \p thread_keys across the block of threads is + * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The + * corresponding output \p thread_keys in those threads will be + * { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }. + * + */ + __device__ __forceinline__ void SortBlockedToStriped( + Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort + Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort + int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison + int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison + { + UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = + reinterpret_cast (keys); + + // Twiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); + } + + // Radix sorting passes + while (true) + { + // Rank the blocked keys + int ranks[ITEMS_PER_THREAD]; + BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); + begin_bit += RADIX_BITS; + + __syncthreads(); + + // Check if this is the last pass + if (begin_bit >= end_bit) + { + // Last pass exchanges keys through shared memory in striped arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks); + + __syncthreads(); + + // Last pass exchanges through shared memory in striped arrangement + BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToStriped(values, ranks); + + // Quit + break; + } + + // Exchange keys through shared memory in blocked arrangement + BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); + + __syncthreads(); + + // Exchange values through shared memory in blocked arrangement + BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks); + + __syncthreads(); + } + + // Untwiddle bits if necessary + #pragma unroll + for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) + { + unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); + } + } + + + //@} end member group + +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_raking_layout.cuh b/lib/kokkos/TPL/cub/block/block_raking_layout.cuh new file mode 100644 index 0000000000..878a786cd9 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_raking_layout.cuh @@ -0,0 +1,145 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * cub::BlockRakingLayout provides a conflict-free shared memory layout abstraction for warp-raking across thread block data. + */ + + +#pragma once + +#include "../util_macro.cuh" +#include "../util_arch.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + +/** + * \brief BlockRakingLayout provides a conflict-free shared memory layout abstraction for raking across thread block data.  + * \ingroup BlockModule + * + * \par Overview + * This type facilitates a shared memory usage pattern where a block of CUDA + * threads places elements into shared memory and then reduces the active + * parallelism to one "raking" warp of threads for serially aggregating consecutive + * sequences of shared items. Padding is inserted to eliminate bank conflicts + * (for most data types). + * + * \tparam T The data type to be exchanged. + * \tparam BLOCK_THREADS The thread block size in threads. + * \tparam BLOCK_STRIPS When strip-mining, the number of threadblock-strips per tile + */ +template < + typename T, + int BLOCK_THREADS, + int BLOCK_STRIPS = 1> +struct BlockRakingLayout +{ + //--------------------------------------------------------------------- + // Constants and typedefs + //--------------------------------------------------------------------- + + enum + { + /// The total number of elements that need to be cooperatively reduced + SHARED_ELEMENTS = + BLOCK_THREADS * BLOCK_STRIPS, + + /// Maximum number of warp-synchronous raking threads + MAX_RAKING_THREADS = + CUB_MIN(BLOCK_THREADS, PtxArchProps::WARP_THREADS), + + /// Number of raking elements per warp-synchronous raking thread (rounded up) + SEGMENT_LENGTH = + (SHARED_ELEMENTS + MAX_RAKING_THREADS - 1) / MAX_RAKING_THREADS, + + /// Never use a raking thread that will have no valid data (e.g., when BLOCK_THREADS is 62 and SEGMENT_LENGTH is 2, we should only use 31 raking threads) + RAKING_THREADS = + (SHARED_ELEMENTS + SEGMENT_LENGTH - 1) / SEGMENT_LENGTH, + + /// Pad each segment length with one element if it evenly divides the number of banks + SEGMENT_PADDING = + (PtxArchProps::SMEM_BANKS % SEGMENT_LENGTH == 0) ? 1 : 0, + + /// Total number of elements in the raking grid + GRID_ELEMENTS = + RAKING_THREADS * (SEGMENT_LENGTH + SEGMENT_PADDING), + + /// Whether or not we need bounds checking during raking (the number of reduction elements is not a multiple of the warp size) + UNGUARDED = + (SHARED_ELEMENTS % RAKING_THREADS == 0), + }; + + + /** + * \brief Shared memory storage type + */ + typedef T TempStorage[BlockRakingLayout::GRID_ELEMENTS]; + + + /** + * \brief Returns the location for the calling thread to place data into the grid + */ + static __device__ __forceinline__ T* PlacementPtr( + TempStorage &temp_storage, + int linear_tid, + int block_strip = 0) + { + // Offset for partial + unsigned int offset = (block_strip * BLOCK_THREADS) + linear_tid; + + // Add in one padding element for every segment + if (SEGMENT_PADDING > 0) + { + offset += offset / SEGMENT_LENGTH; + } + + // Incorporating a block of padding partials every shared memory segment + return temp_storage + offset; + } + + + /** + * \brief Returns the location for the calling thread to begin sequential raking + */ + static __device__ __forceinline__ T* RakingPtr( + TempStorage &temp_storage, + int linear_tid) + { + return temp_storage + (linear_tid * (SEGMENT_LENGTH + SEGMENT_PADDING)); + } +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_reduce.cuh b/lib/kokkos/TPL/cub/block/block_reduce.cuh new file mode 100644 index 0000000000..ffdff73775 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_reduce.cuh @@ -0,0 +1,563 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockReduce class provides [collective](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. + */ + +#pragma once + +#include "specializations/block_reduce_raking.cuh" +#include "specializations/block_reduce_warp_reductions.cuh" +#include "../util_type.cuh" +#include "../thread/thread_operators.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + + + +/****************************************************************************** + * Algorithmic variants + ******************************************************************************/ + +/** + * BlockReduceAlgorithm enumerates alternative algorithms for parallel + * reduction across a CUDA threadblock. + */ +enum BlockReduceAlgorithm +{ + + /** + * \par Overview + * An efficient "raking" reduction algorithm. Execution is comprised of + * three phases: + * -# Upsweep sequential reduction in registers (if threads contribute more + * than one input each). Each thread then places the partial reduction + * of its item(s) into shared memory. + * -# Upsweep sequential reduction in shared memory. Threads within a + * single warp rake across segments of shared partial reductions. + * -# A warp-synchronous Kogge-Stone style reduction within the raking warp. + * + * \par + * \image html block_reduce.png + * \p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.+ * + * \par Performance Considerations + * - Although this variant may suffer longer turnaround latencies when the + * GPU is under-occupied, it can often provide higher overall throughput + * across the GPU when suitably occupied. + */ + BLOCK_REDUCE_RAKING, + + + /** + * \par Overview + * A quick "tiled warp-reductions" reduction algorithm. Execution is + * comprised of four phases: + * -# Upsweep sequential reduction in registers (if threads contribute more + * than one input each). Each thread then places the partial reduction + * of its item(s) into shared memory. + * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style + * reduction within each warp. + * -# A propagation phase where the warp reduction outputs in each warp are + * updated with the aggregate from each preceding warp. + * + * \par + * \image html block_scan_warpscans.png + *\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.+ * + * \par Performance Considerations + * - Although this variant may suffer lower overall throughput across the + * GPU because due to a heavy reliance on inefficient warp-reductions, it + * can often provide lower turnaround latencies when the GPU is + * under-occupied. + */ + BLOCK_REDUCE_WARP_REDUCTIONS, +}; + + +/****************************************************************************** + * Block reduce + ******************************************************************************/ + +/** + * \brief The BlockReduce class provides [collective](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * A reduction (or fold) + * uses a binary combining operator to compute a single aggregate from a list of input elements. + * + * \par + * Optionally, BlockReduce can be specialized by algorithm to accommodate different latency/throughput workload profiles: + * -# cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm) + * -# cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm) + * + * \tparam T Data type being reduced + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam ALGORITHM [optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_RAKING) + * + * \par Performance Considerations + * - Very efficient (only one synchronization barrier). + * - Zero bank conflicts for most types. + * - Computation is slightly more efficient (i.e., having lower instruction overhead) for: + * - Summation (vs. generic reduction) + * - \p BLOCK_THREADS is a multiple of the architecture's warp size + * - Every thread has a valid input (i.e., full vs. partial-tiles) + * - See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives + * + * \par A Simple Example + * \blockcollective{BlockReduce} + * \par + * The code snippet below illustrates a sum reduction of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include+ * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Compute the block-wide sum for thread0 + * int aggregate = BlockReduce(temp_storage).Sum(thread_data); + * + * \endcode + * + */ +template < + typename T, + int BLOCK_THREADS, + BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_RAKING> +class BlockReduce +{ +private: + + /****************************************************************************** + * Constants and typedefs + ******************************************************************************/ + + /// Internal specialization. + typedef typename If<(ALGORITHM == BLOCK_REDUCE_WARP_REDUCTIONS), + BlockReduceWarpReductions , + BlockReduceRaking >::Type InternalBlockReduce; + + /// Shared memory storage layout type for BlockReduce + typedef typename InternalBlockReduce::TempStorage _TempStorage; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + +public: + + /// \smemstorage{BlockReduce} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockReduce() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockReduce( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockReduce( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockReduce( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Generic reductions + *********************************************************************/ + //@{ + + + /** + * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. + * + * The return value is undefined in threads other than thread0. + * + * Supports non-commutative reduction operators. + * + * \smemreuse + * + * The code snippet below illustrates a max reduction of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Each thread obtains an input item + * int thread_data; + * ... + * + * // Compute the block-wide max for thread0 + * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max()); + * + * \endcode + * + * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ T Reduce( + T input, ///< [in] Calling thread's input + ReductionOp reduction_op) ///< [in] Binary reduction operator + { + return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, BLOCK_THREADS, reduction_op); + } + + + /** + * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. + * + * The return value is undefined in threads other than thread0. + * + * Supports non-commutative reduction operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a max reduction of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Compute the block-wide max for thread0 + * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max()); + * + * \endcode + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) + */ + template < + int ITEMS_PER_THREAD, + typename ReductionOp> + __device__ __forceinline__ T Reduce( + T (&inputs)[ITEMS_PER_THREAD], ///< [in] Calling thread's input segment + ReductionOp reduction_op) ///< [in] Binary reduction operator + { + // Reduce partials + T partial = ThreadReduce(inputs, reduction_op); + return Reduce(partial, reduction_op); + } + + + /** + * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first \p num_valid threads each contribute one input element. + * + * The return value is undefined in threads other than thread0. + * + * Supports non-commutative reduction operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a max reduction of a partially-full tile of integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int num_valid, ...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Each thread obtains an input item + * int thread_data; + * if (threadIdx.x < num_valid) thread_data = ... + * + * // Compute the block-wide max for thread0 + * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid); + * + * \endcode + * + * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ T Reduce( + T input, ///< [in] Calling thread's input + ReductionOp reduction_op, ///< [in] Binary reduction operator + int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS) + { + // Determine if we scan skip bounds checking + if (num_valid >= BLOCK_THREADS) + { + return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, num_valid, reduction_op); + } + else + { + return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, num_valid, reduction_op); + } + } + + + //@} end member group + /******************************************************************//** + * \name Summation reductions + *********************************************************************/ + //@{ + + + /** + * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. + * + * The return value is undefined in threads other than thread0. + * + * \smemreuse + * + * The code snippet below illustrates a sum reduction of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Each thread obtains an input item + * int thread_data; + * ... + * + * // Compute the block-wide sum for thread0 + * int aggregate = BlockReduce(temp_storage).Sum(thread_data); + * + * \endcode + * + */ + __device__ __forceinline__ T Sum( + T input) ///< [in] Calling thread's input + { + return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, BLOCK_THREADS); + } + + /** + * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. + * + * The return value is undefined in threads other than thread0. + * + * \smemreuse + * + * The code snippet below illustrates a sum reduction of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Compute the block-wide sum for thread0 + * int aggregate = BlockReduce(temp_storage).Sum(thread_data); + * + * \endcode + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + */ + template + __device__ __forceinline__ T Sum( + T (&inputs)[ITEMS_PER_THREAD]) ///< [in] Calling thread's input segment + { + // Reduce partials + T partial = ThreadReduce(inputs, cub::Sum()); + return Sum(partial); + } + + + /** + * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first \p num_valid threads each contribute one input element. + * + * The return value is undefined in threads other than thread0. + * + * \smemreuse + * + * The code snippet below illustrates a sum reduction of a partially-full tile of integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(int num_valid, ...) + * { + * // Specialize BlockReduce for 128 threads on type int + * typedef cub::BlockReduce BlockReduce; + * + * // Allocate shared memory for BlockReduce + * __shared__ typename BlockReduce::TempStorage temp_storage; + * + * // Each thread obtains an input item (up to num_items) + * int thread_data; + * if (threadIdx.x < num_valid) + * thread_data = ... + * + * // Compute the block-wide sum for thread0 + * int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid); + * + * \endcode + * + */ + __device__ __forceinline__ T Sum( + T input, ///< [in] Calling thread's input + int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS) + { + // Determine if we scan skip bounds checking + if (num_valid >= BLOCK_THREADS) + { + return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, num_valid); + } + else + { + return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, num_valid); + } + } + + + //@} end member group +}; + +} // CUB namespace +CUB_NS_POSTFIX // Optional outer namespace(s) + diff --git a/lib/kokkos/TPL/cub/block/block_scan.cuh b/lib/kokkos/TPL/cub/block/block_scan.cuh new file mode 100644 index 0000000000..1c1a2dac81 --- /dev/null +++ b/lib/kokkos/TPL/cub/block/block_scan.cuh @@ -0,0 +1,2233 @@ +/****************************************************************************** + * Copyright (c) 2011, Duane Merrill. All rights reserved. + * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of the NVIDIA CORPORATION nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + ******************************************************************************/ + +/** + * \file + * The cub::BlockScan class provides [collective](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. + */ + +#pragma once + +#include "specializations/block_scan_raking.cuh" +#include "specializations/block_scan_warp_scans.cuh" +#include "../util_arch.cuh" +#include "../util_type.cuh" +#include "../util_namespace.cuh" + +/// Optional outer namespace(s) +CUB_NS_PREFIX + +/// CUB namespace +namespace cub { + + +/****************************************************************************** + * Algorithmic variants + ******************************************************************************/ + +/** + * \brief BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. + */ +enum BlockScanAlgorithm +{ + + /** + * \par Overview + * An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases: + * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory. + * -# Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions. + * -# A warp-synchronous Kogge-Stone style exclusive scan within the raking warp. + * -# Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output. + * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output. + * + * \par + * \image html block_scan_raking.png + * \p BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.+ * + * \par Performance Considerations + * - Although this variant may suffer longer turnaround latencies when the + * GPU is under-occupied, it can often provide higher overall throughput + * across the GPU when suitably occupied. + */ + BLOCK_SCAN_RAKING, + + + /** + * \par Overview + * Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at + * the expense of higher register pressure. Raking threads preserve their + * "upsweep" segment of values in registers while performing warp-synchronous + * scan, allowing the "downsweep" not to re-read them from shared memory. + */ + BLOCK_SCAN_RAKING_MEMOIZE, + + + /** + * \par Overview + * A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases: + * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory. + * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp. + * -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp. + * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output. + * + * \par + * \image html block_scan_warpscans.png + *\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.+ * + * \par Performance Considerations + * - Although this variant may suffer lower overall throughput across the + * GPU because due to a heavy reliance on inefficient warpscans, it can + * often provide lower turnaround latencies when the GPU is under-occupied. + */ + BLOCK_SCAN_WARP_SCANS, +}; + + +/****************************************************************************** + * Block scan + ******************************************************************************/ + +/** + * \brief The BlockScan class provides [collective](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.  + * \ingroup BlockModule + * + * \par Overview + * Given a list of input elements and a binary reduction operator, a [prefix scan](http://en.wikipedia.org/wiki/Prefix_sum) + * produces an output list where each element is computed to be the reduction + * of the elements occurring earlier in the input list. Prefix sum + * connotes a prefix scan with the addition operator. The term \em inclusive indicates + * that the ith output reduction incorporates the ith input. + * The term \em exclusive indicates the ith input is not incorporated into + * the ith output reduction. + * + * \par + * Optionally, BlockScan can be specialized by algorithm to accommodate different latency/throughput workload profiles: + * -# cub::BLOCK_SCAN_RAKING. An efficient "raking reduce-then-scan" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm) + * -# cub::BLOCK_SCAN_WARP_SCANS. A quick "tiled warpscans" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm) + * + * \tparam T Data type being scanned + * \tparam BLOCK_THREADS The thread block size in threads + * \tparam ALGORITHM [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING) + * + * \par A Simple Example + * \blockcollective{BlockScan} + * \par + * The code snippet below illustrates an exclusive prefix sum of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include+ * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide exclusive prefix sum + * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. + * The corresponding output \p thread_data in those threads will be + * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. + * + * \par Performance Considerations + * - Uses special instructions when applicable (e.g., warp \p SHFL) + * - Uses synchronization-free communication between warp lanes when applicable + * - Uses only one or two block-wide synchronization barriers (depending on + * algorithm selection) + * - Zero bank conflicts for most types + * - Computation is slightly more efficient (i.e., having lower instruction overhead) for: + * - Prefix sum variants (vs. generic scan) + * - Exclusive variants (vs. inclusive) + * - \p BLOCK_THREADS is a multiple of the architecture's warp size + * - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives + * + */ +template < + typename T, + int BLOCK_THREADS, + BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING> +class BlockScan +{ +private: + + /****************************************************************************** + * Constants and typedefs + ******************************************************************************/ + + /** + * Ensure the template parameterization meets the requirements of the + * specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy + * cannot be used with threadblock sizes not a multiple of the + * architectural warp size. + */ + static const BlockScanAlgorithm SAFE_ALGORITHM = + ((ALGORITHM == BLOCK_SCAN_WARP_SCANS) && (BLOCK_THREADS % PtxArchProps::WARP_THREADS != 0)) ? + BLOCK_SCAN_RAKING : + ALGORITHM; + + /// Internal specialization. + typedef typename If<(SAFE_ALGORITHM == BLOCK_SCAN_WARP_SCANS), + BlockScanWarpScans , + BlockScanRaking >::Type InternalBlockScan; + + + /// Shared memory storage layout type for BlockScan + typedef typename InternalBlockScan::TempStorage _TempStorage; + + + /****************************************************************************** + * Thread fields + ******************************************************************************/ + + /// Shared storage reference + _TempStorage &temp_storage; + + /// Linear thread-id + int linear_tid; + + + /****************************************************************************** + * Utility methods + ******************************************************************************/ + + /// Internal storage allocator + __device__ __forceinline__ _TempStorage& PrivateStorage() + { + __shared__ _TempStorage private_storage; + return private_storage; + } + + +public: + + /// \smemstorage{BlockScan} + struct TempStorage : Uninitialized<_TempStorage> {}; + + + /******************************************************************//** + * \name Collective constructors + *********************************************************************/ + //@{ + + /** + * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockScan() + : + temp_storage(PrivateStorage()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. + */ + __device__ __forceinline__ BlockScan( + TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage + : + temp_storage(temp_storage.Alias()), + linear_tid(threadIdx.x) + {} + + + /** + * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier + */ + __device__ __forceinline__ BlockScan( + int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(PrivateStorage()), + linear_tid(linear_tid) + {} + + + /** + * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. + */ + __device__ __forceinline__ BlockScan( + TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage + int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) + : + temp_storage(temp_storage.Alias()), + linear_tid(linear_tid) + {} + + + + //@} end member group + /******************************************************************//** + * \name Exclusive prefix sum operations + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix sum of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide exclusive prefix sum + * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The + * corresponding output \p thread_data in those threads will be 0, 1, ..., 127. + * + */ + __device__ __forceinline__ void ExclusiveSum( + T input, ///< [in] Calling thread's input item + T &output) ///< [out] Calling thread's output item (may be aliased to \p input) + { + T block_aggregate; + InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix sum of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide exclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The + * corresponding output \p thread_data in those threads will be 0, 1, ..., 127. + * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads. + * + */ + __device__ __forceinline__ void ExclusiveSum( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an exclusive prefix sum over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 128 integer items that are partitioned across 128 threads. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total += block_aggregate; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockScan for 128 threads + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(0); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data = d_data[block_offset]; + * + * // Collectively compute the block-wide exclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveSum( + * thread_data, thread_data, block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * d_data[block_offset] = thread_data; + * } + * \endcode + * \par + * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... + * The corresponding output for the first segment will be 0, 1, ..., 127. + * The output for the second segment will be 128, 129, ..., 255. Furthermore, + * the value \p 128 will be stored in \p block_aggregate for all threads after each scan. + * + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template + __device__ __forceinline__ void ExclusiveSum( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate, block_prefix_op); + } + + + //@} end member group + /******************************************************************//** + * \name Exclusive prefix sum operations (multiple data per thread) + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix sum of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide exclusive prefix sum + * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The + * corresponding output \p thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + */ + template + __device__ __forceinline__ void ExclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input) + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix sum of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide exclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The + * corresponding output \p thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. + * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + */ + template + __device__ __forceinline__ void ExclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial, block_aggregate); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an exclusive prefix sum over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 512 integer items that are partitioned in a [blocked arrangement](index.html#sec5sec4) + * across 128 threads where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total += block_aggregate; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread + * typedef cub::BlockLoad BlockLoad; + * typedef cub::BlockStore BlockStore; + * typedef cub::BlockScan BlockScan; + * + * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan + * __shared__ union { + * typename BlockLoad::TempStorage load; + * typename BlockScan::TempStorage scan; + * typename BlockStore::TempStorage store; + * } temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(0); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); + * __syncthreads(); + * + * // Collectively compute the block-wide exclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage.scan).ExclusiveSum( + * thread_data, thread_data, block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); + * __syncthreads(); + * } + * \endcode + * \par + * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... + * The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511. + * The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023. Furthermore, + * the value \p 512 will be stored in \p block_aggregate for all threads after each scan. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + int ITEMS_PER_THREAD, + typename BlockPrefixOp> + __device__ __forceinline__ void ExclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + + //@} end member group // Inclusive prefix sums + /******************************************************************//** + * \name Exclusive prefix scan operations + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide exclusive prefix max scan + * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The + * corresponding output \p thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T identity, ///< [in] Identity value + ScanOp scan_op) ///< [in] Binary scan operator + { + T block_aggregate; + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide exclusive prefix max scan + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The + * corresponding output \p thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. + * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input items + T &output, ///< [out] Calling thread's output items (may be aliased to \p input) + const T &identity, ///< [in] Identity value + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an exclusive prefix max scan over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 128 integer items that are partitioned across 128 threads. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockScan for 128 threads + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(INT_MIN); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data = d_data[block_offset]; + * + * // Collectively compute the block-wide exclusive prefix max scan + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveScan( + * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * d_data[block_offset] = thread_data; + * } + * \endcode + * \par + * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... + * The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126. + * The output for the second segment will be 126, 128, 128, 130, ..., 252, 254. Furthermore, + * \p block_aggregate will be assigned \p 126 in all threads after the first scan, assigned \p 254 after the second + * scan, etc. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + typename ScanOp, + typename BlockPrefixOp> + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T identity, ///< [in] Identity value + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate, block_prefix_op); + } + + + //@} end member group // Inclusive prefix sums + /******************************************************************//** + * \name Exclusive prefix scan operations (multiple data per thread) + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide exclusive prefix max scan + * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. + * The corresponding output \p thread_data in those threads will be + * { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + const T &identity, ///< [in] Identity value + ScanOp scan_op) ///< [in] Binary scan operator + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, identity, scan_op); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide exclusive prefix max scan + * int block_aggregate; + * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The + * corresponding output \p thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. + * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + const T &identity, ///< [in] Identity value + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an exclusive prefix max scan over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 128 integer items that are partitioned across 128 threads. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread + * typedef cub::BlockLoad BlockLoad; + * typedef cub::BlockStore BlockStore; + * typedef cub::BlockScan BlockScan; + * + * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan + * __shared__ union { + * typename BlockLoad::TempStorage load; + * typename BlockScan::TempStorage scan; + * typename BlockStore::TempStorage store; + * } temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(0); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); + * __syncthreads(); + * + * // Collectively compute the block-wide exclusive prefix max scan + * int block_aggregate; + * BlockScan(temp_storage.scan).ExclusiveScan( + * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); + * __syncthreads(); + * } + * \endcode + * \par + * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... + * The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510. + * The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022. Furthermore, + * \p block_aggregate will be assigned \p 510 in all threads after the first scan, assigned \p 1022 after the second + * scan, etc. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp, + typename BlockPrefixOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + T identity, ///< [in] Identity value + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate, block_prefix_op); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + //@} end member group + +#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document + + /******************************************************************//** + * \name Exclusive prefix scan operations (identityless, single datum per thread) + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. With no identity value, the output computed for thread0 is undefined. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + ScanOp scan_op) ///< [in] Binary scan operator + { + T block_aggregate; + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for thread0 is undefined. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + typename ScanOp, + typename BlockPrefixOp> + __device__ __forceinline__ void ExclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate, block_prefix_op); + } + + + //@} end member group + /******************************************************************//** + * \name Exclusive prefix scan operations (identityless, multiple data per thread) + *********************************************************************/ + //@{ + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. With no identity value, the output computed for thread0 is undefined. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + ScanOp scan_op) ///< [in] Binary scan operator + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, scan_op); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for thread0 is undefined. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); + } + + + /** + * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + int ITEMS_PER_THREAD, + typename ScanOp, + typename BlockPrefixOp> + __device__ __forceinline__ void ExclusiveScan( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + // Reduce consecutive thread items in registers + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate, block_prefix_op); + + // Exclusive scan in registers with prefix + ThreadScanExclusive(input, output, scan_op, thread_partial); + } + + + //@} end member group + +#endif // DOXYGEN_SHOULD_SKIP_THIS + + /******************************************************************//** + * \name Inclusive prefix sum operations + *********************************************************************/ + //@{ + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix sum of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide inclusive prefix sum + * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The + * corresponding output \p thread_data in those threads will be 1, 2, ..., 128. + * + */ + __device__ __forceinline__ void InclusiveSum( + T input, ///< [in] Calling thread's input item + T &output) ///< [out] Calling thread's output item (may be aliased to \p input) + { + T block_aggregate; + InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate); + } + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix sum of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide inclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The + * corresponding output \p thread_data in those threads will be 1, 2, ..., 128. + * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads. + * + */ + __device__ __forceinline__ void InclusiveSum( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate); + } + + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an inclusive prefix sum over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 128 integer items that are partitioned across 128 threads. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total += block_aggregate; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockScan for 128 threads + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(0); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data = d_data[block_offset]; + * + * // Collectively compute the block-wide inclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).InclusiveSum( + * thread_data, thread_data, block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * d_data[block_offset] = thread_data; + * } + * \endcode + * \par + * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... + * The corresponding output for the first segment will be 1, 2, ..., 128. + * The output for the second segment will be 129, 130, ..., 256. Furthermore, + * the value \p 128 will be stored in \p block_aggregate for all threads after each scan. + * + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template + __device__ __forceinline__ void InclusiveSum( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate, block_prefix_op); + } + + + //@} end member group + /******************************************************************//** + * \name Inclusive prefix sum operations (multiple data per thread) + *********************************************************************/ + //@{ + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix sum of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide inclusive prefix sum + * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The + * corresponding output \p thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + */ + template + __device__ __forceinline__ void InclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input) + { + if (ITEMS_PER_THREAD == 1) + { + InclusiveSum(input[0], output[0]); + } + else + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial); + + // Inclusive scan in registers with prefix + ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); + } + } + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix sum of 512 integer items that + * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads + * where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * ... + * + * // Collectively compute the block-wide inclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is + * { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The + * corresponding output \p thread_data in those threads will be + * { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. + * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void InclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + if (ITEMS_PER_THREAD == 1) + { + InclusiveSum(input[0], output[0], block_aggregate); + } + else + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial, block_aggregate); + + // Inclusive scan in registers with prefix + ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); + } + } + + + /** + * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an inclusive prefix sum over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 512 integer items that are partitioned in a [blocked arrangement](index.html#sec5sec4) + * across 128 threads where each thread owns 4 consecutive items. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total += block_aggregate; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread + * typedef cub::BlockLoad BlockLoad; + * typedef cub::BlockStore BlockStore; + * typedef cub::BlockScan BlockScan; + * + * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan + * __shared__ union { + * typename BlockLoad::TempStorage load; + * typename BlockScan::TempStorage scan; + * typename BlockStore::TempStorage store; + * } temp_storage; + * + * // Initialize running total + * BlockPrefixOp prefix_op(0); + * + * // Have the block iterate over segments of items + * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) + * { + * // Load a segment of consecutive items that are blocked across threads + * int thread_data[4]; + * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); + * __syncthreads(); + * + * // Collectively compute the block-wide inclusive prefix sum + * int block_aggregate; + * BlockScan(temp_storage.scan).IncluisveSum( + * thread_data, thread_data, block_aggregate, prefix_op); + * __syncthreads(); + * + * // Store scanned items to output segment + * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); + * __syncthreads(); + * } + * \endcode + * \par + * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... + * The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512. + * The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024. Furthermore, + * the value \p 512 will be stored in \p block_aggregate for all threads after each scan. + * + * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. + * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) + */ + template < + int ITEMS_PER_THREAD, + typename BlockPrefixOp> + __device__ __forceinline__ void InclusiveSum( + T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items + T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) + T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) + BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. + { + if (ITEMS_PER_THREAD == 1) + { + InclusiveSum(input[0], output[0], block_aggregate, block_prefix_op); + } + else + { + // Reduce consecutive thread items in registers + Sum scan_op; + T thread_partial = ThreadReduce(input, scan_op); + + // Exclusive threadblock-scan + ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op); + + // Inclusive scan in registers with prefix + ThreadScanInclusive(input, output, scan_op, thread_partial); + } + } + + + //@} end member group + /******************************************************************//** + * \name Inclusive prefix scan operations + *********************************************************************/ + //@{ + + + /** + * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide inclusive prefix max scan + * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max()); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The + * corresponding output \p thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void InclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + ScanOp scan_op) ///< [in] Binary scan operator + { + T block_aggregate; + InclusiveScan(input, output, scan_op, block_aggregate); + } + + + /** + * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that + * are partitioned across 128 threads. + * \par + * \code + * #include + * + * __global__ void ExampleKernel(...) + * { + * // Specialize BlockScan for 128 threads on type int + * typedef cub::BlockScan BlockScan; + * + * // Allocate shared memory for BlockScan + * __shared__ typename BlockScan::TempStorage temp_storage; + * + * // Obtain input item for each thread + * int thread_data; + * ... + * + * // Collectively compute the block-wide inclusive prefix max scan + * int block_aggregate; + * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate); + * + * \endcode + * \par + * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The + * corresponding output \p thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. + * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads. + * + * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) + */ + template + __device__ __forceinline__ void InclusiveScan( + T input, ///< [in] Calling thread's input item + T &output, ///< [out] Calling thread's output item (may be aliased to \p input) + ScanOp scan_op, ///< [in] Binary scan operator + T &block_aggregate) ///< [out] block-wide aggregate reduction of input items + { + InternalBlockScan(temp_storage, linear_tid).InclusiveScan(input, output, scan_op, block_aggregate); + } + + + /** + * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. + * + * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). + * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. + * The functor will be invoked by the first warp of threads in the block, however only the return value from + * lane0 is applied as the block-wide prefix. Can be stateful. + * + * Supports non-commutative scan operators. + * + * \blocked + * + * \smemreuse + * + * The code snippet below illustrates a single thread block that progressively + * computes an inclusive prefix max scan over multiple "tiles" of input using a + * prefix functor to maintain a running total between block-wide scans. Each tile consists + * of 128 integer items that are partitioned across 128 threads. + * \par + * \code + * #include + * + * // A stateful callback functor that maintains a running prefix to be applied + * // during consecutive scan operations. + * struct BlockPrefixOp + * { + * // Running prefix + * int running_total; + * + * // Constructor + * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} + * + * // Callback operator to be entered by the first warp of threads in the block. + * // Thread-0 is responsible for returning a value for seeding the block-wide scan. + * __device__ int operator()(int block_aggregate) + * { + * int old_prefix = running_total; + * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; + * return old_prefix; + * } + * }; + * + * __global__ void ExampleKernel(int *d_data, int num_items, ...) + * { + * // Specialize BlockScan for 128 threads + * typedef cub::BlockScan