diff --git a/doc/Manual.html b/doc/Manual.html index 6a97ddf305..d11b9de225 100644 --- a/doc/Manual.html +++ b/doc/Manual.html @@ -1,7 +1,7 @@
The KOKKOS package was developed primaritly by Christian Trott (Sandia) with contributions of various styles by others, including -Sikandar Mashayak (UIUC). The underlying Kokkos library was written +Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). +The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia).
@@ -25,7 +26,8 @@ that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos.The Kokkos library is part of -Trilinos and is a +Trilinos and can also +be downloaded from Github. Kokkos is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on @@ -71,10 +73,10 @@ mode.
Here is a quick overview of how to use the KOKKOS package:
-The latter two steps can be done using the "-k on", "-pk kokkos" and "-sf kk" command-line switches @@ -87,10 +89,11 @@ kk commands respectively to your input script.
The KOKKOS package can be used to build and run LAMMPS on the following kinds of hardware:
-Note that Intel Xeon Phi coprocessors are supported in "native" mode, not "offload" mode like the USER-INTEL package supports. @@ -130,19 +133,19 @@ Make.py -p kokkos -kokkos phi -o kokkos_phi file mpi
cd lammps/src make yes-kokkos -make g++ OMP=yes +make g++ KOKKOS_DEVICES=OpenMP
Intel Xeon Phi:
cd lammps/src make yes-kokkos -make g++ OMP=yes MIC=yes +make g++ KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=KNC
CPUs and GPUs:
cd lammps/src make yes-kokkos -make cuda CUDA=yes +make cuda KOKKOS_DEVICES=Cuda
These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try @@ -159,7 +162,7 @@ options. makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above, with a line like:
-MIC = yes +KOKKOS_ARCH = KNCNote that if you build LAMMPS multiple times in this manner, using different KOKKOS options (defined in different machine makefiles), you @@ -170,9 +173,9 @@ because the targets will be different. machine makefile, in this case src/MAKE/Makefile.cuda, which is included in the LAMMPS distribution. To build the KOKKOS package for a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must -have a CCFLAGS -arch setting that is appropriate for your NVIDIA -hardware and installed software. Typical values for -arch are given -in Section 2.3.4 of the manual, as well +have a KOKKOS_ARCH setting that is appropriate for your NVIDIA +hardware and installed software. Typical values for KOKKOS_ARCH are given +below, as well as other settings that must be included in the machine makefile, if you create your own.
@@ -183,36 +186,32 @@ double precision.There are other allowed options when building with the KOKKOS package. As above, they can be set either as variables on the make command line or in Makefile.machine. This is the full list of options, including -those discussed above, Each takes a value of yes or no. The +those discussed above, Each takes a value shown below. The default value is listed, which is set in the -lib/kokkos/Makefile.lammps file. +lib/kokkos/Makefile.kokkos file.
-
#Default settings specific options +#Options: force_uvm,use_ldg,rdc +
+OMP sets the parallelization method used for Kokkos code (within -LAMMPS) that runs on the host. OMP=yes means that OpenMP will be -used. OMP=no means that pthreads will be used. +
KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within +LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be +used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. +KOKKOS_DEVICES=Cuda means an NVIDIA GPU running +CUDA will be used.
-CUDA sets the parallelization method used for Kokkos code (within -LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running -CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting -will be used for the device as well as the host. -
-If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must -use "nvcc" as its compiler, via its CC setting. For best performance -its CCFLAGS setting should use -O3 and have an -arch setting that -matches the compute capability of your NVIDIA hardware and software -installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are -sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell -cards are sm_50. A complete list can be found on -wikipedia. You can -also use the deviceQuery tool that comes with the CUDA samples. Note +
If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE +directory must use "nvcc" as its compiler, via its CC setting. For +best performance its CCFLAGS setting should use -O3 and have a +KOKKOS_ARCH setting that matches the compute capability of your NVIDIA +hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note the minimal required compute capability is 2.0, but this will give signicantly reduced performance compared to Kepler generation GPUs with compute capability 3.x. For the LINK setting, "nvcc" should not @@ -223,28 +222,30 @@ also have a "Compilation rule" for creating *.o files from *.cu files. See src/Makefile.cuda for an example of a lo-level Makefile with all of these settings.
-HWLOC binds threads to hardware cores, so they do not migrate during a -simulation. HWLOC=yes should always be used if running with OMP=no -for pthreads. It is not necessary for OMP=yes for OpenMP, because -OpenMP provides alternative methods via environment variables for -binding threads to hardware cores. More info on binding threads to -cores is given in this section. +
KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not +migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be +used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not +necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP +provides alternative methods via environment variables for binding +threads to hardware cores. More info on binding threads to cores is +given in this section.
-AVX enables Intel advanced vector extensions when compiling for an -Intel-compatible chip. AVX=yes should only be set if your host -hardware supports AVX. If it does not support it, this will cause a -run-time crash. +
KOKKOS_ARCH=KNC enables compiler switches needed when compling for an +Intel Phi processor.
-MIC enables compiler switches needed when compling for an Intel Phi -processor. +
KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism +on most Unix platforms. This library is not available on all +platforms.
-LIBRT enables use of a more accurate timer mechanism on most Unix -platforms. This library is not available on all platforms. +
KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style +within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time +debugging information that can be useful. It also enables runtime +bounds checking on Kokkos data structures.
-DEBUG is only useful when developing a Kokkos-enabled style within -LAMMPS. DEBUG=yes enables printing of run-time debugging information -that can be useful. It also enables runtime bounds checking on Kokkos -data structures. +
KOKKOS_CUDA_OPTIONS are additional options for CUDA. +
+For more information on Kokkos see the Kokkos programmers' guide here: +/lib/kokkos/doc/Kokkos_PG.pdf.
Run with the KOKKOS package from the command line:
diff --git a/doc/accelerate_kokkos.txt b/doc/accelerate_kokkos.txt index 5433ff5864..78c2220e86 100644 --- a/doc/accelerate_kokkos.txt +++ b/doc/accelerate_kokkos.txt @@ -13,7 +13,8 @@ The KOKKOS package was developed primaritly by Christian Trott (Sandia) with contributions of various styles by others, including -Sikandar Mashayak (UIUC). The underlying Kokkos library was written +Sikandar Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). +The underlying Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia). @@ -22,7 +23,8 @@ that use data structures and macros provided by the Kokkos library, which is included with LAMMPS in lib/kokkos. The Kokkos library is part of -"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and is a +"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also +be downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a templated C++ library that provides two key abstractions for an application like LAMMPS. First, it allows a single implementation of an application kernel (e.g. a pair style) to run efficiently on @@ -68,10 +70,10 @@ mode. Here is a quick overview of how to use the KOKKOS package: -specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support -include the KOKKOS package and build LAMMPS -enable the KOKKOS package and its hardware options via the "-k on" command-line switch -use KOKKOS styles in your input script :ul +specify variables and settings in your Makefile.machine that enable +OpenMP, GPU, or Phi support include the KOKKOS package and build +LAMMPS enable the KOKKOS package and its hardware options via the "-k +on" command-line switch use KOKKOS styles in your input script :ul The latter two steps can be done using the "-k on", "-pk kokkos" and "-sf kk" "command-line switches"_Section_start.html#start_7 @@ -84,10 +86,11 @@ kk"_suffix.html commands respectively to your input script. The KOKKOS package can be used to build and run LAMMPS on the following kinds of hardware: -CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS styles) -CPU-only: one or a few MPI tasks per node with additional threading via OpenMP -Phi: on one or more Intel Phi coprocessors (per node) -GPU: on the GPUs of a node with additional OpenMP threading on the CPUs :ul +CPU-only: one MPI task per CPU core (MPI-only, but using KOKKOS +styles) CPU-only: one or a few MPI tasks per node with additional +threading via OpenMP Phi: on one or more Intel Phi coprocessors (per +node) GPU: on the GPUs of a node with additional OpenMP threading on +the CPUs :ul Note that Intel Xeon Phi coprocessors are supported in "native" mode, not "offload" mode like the USER-INTEL package supports. @@ -127,19 +130,19 @@ CPU-only (run all-MPI or with OpenMP threading): cd lammps/src make yes-kokkos -make g++ OMP=yes :pre +make g++ KOKKOS_DEVICES=OpenMP :pre Intel Xeon Phi: cd lammps/src make yes-kokkos -make g++ OMP=yes MIC=yes :pre +make g++ KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=KNC :pre CPUs and GPUs: cd lammps/src make yes-kokkos -make cuda CUDA=yes :pre +make cuda KOKKOS_DEVICES=Cuda :pre These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the make command line which requires a GNU-compatible make command. Try @@ -156,7 +159,7 @@ You can also hardwire these make variables in the specified machine makefile, e.g. src/MAKE/Makefile.g++ in the first two examples above, with a line like: -MIC = yes :pre +KOKKOS_ARCH = KNC :pre Note that if you build LAMMPS multiple times in this manner, using different KOKKOS options (defined in different machine makefiles), you @@ -167,9 +170,9 @@ IMPORTANT NOTE: The 3rd example above for a GPU, uses a different machine makefile, in this case src/MAKE/Makefile.cuda, which is included in the LAMMPS distribution. To build the KOKKOS package for a GPU, this makefile must use the NVIDA "nvcc" compiler. And it must -have a CCFLAGS -arch setting that is appropriate for your NVIDIA -hardware and installed software. Typical values for -arch are given -in "Section 2.3.4"_Section_start.html#start_3_4 of the manual, as well +have a KOKKOS_ARCH setting that is appropriate for your NVIDIA +hardware and installed software. Typical values for KOKKOS_ARCH are given +below, as well as other settings that must be included in the machine makefile, if you create your own. @@ -180,36 +183,32 @@ double precision. There are other allowed options when building with the KOKKOS package. As above, they can be set either as variables on the make command line or in Makefile.machine. This is the full list of options, including -those discussed above, Each takes a value of {yes} or {no}. The +those discussed above, Each takes a value shown below. The default value is listed, which is set in the -lib/kokkos/Makefile.lammps file. +lib/kokkos/Makefile.kokkos file. -OMP, default = {yes} -CUDA, default = {no} -HWLOC, default = {no} -AVX, default = {no} -MIC, default = {no} -LIBRT, default = {no} -DEBUG, default = {no} :ul +#Default settings specific options +#Options: force_uvm,use_ldg,rdc -OMP sets the parallelization method used for Kokkos code (within -LAMMPS) that runs on the host. OMP=yes means that OpenMP will be -used. OMP=no means that pthreads will be used. +KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP} +KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, +{Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, +default = {none} +KOKKOS_DEBUG, values = {yes}, {no}, default = {no} +KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none} +KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul -CUDA sets the parallelization method used for Kokkos code (within -LAMMPS) that runs on the device. CUDA=yes means an NVIDIA GPU running -CUDA will be used. CUDA=no means that the OMP=yes or OMP=no setting -will be used for the device as well as the host. +KOKKOS_DEVICE sets the parallelization method used for Kokkos code (within +LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be +used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. +KOKKOS_DEVICES=Cuda means an NVIDIA GPU running +CUDA will be used. -If CUDA=yes, then the lo-level Makefile in the src/MAKE directory must -use "nvcc" as its compiler, via its CC setting. For best performance -its CCFLAGS setting should use -O3 and have an -arch setting that -matches the compute capability of your NVIDIA hardware and software -installation, e.g. -arch=sm_20. Generally Fermi Generation GPUs are -sm_20, while Kepler generation GPUs are sm_30 or sm_35 and Maxwell -cards are sm_50. A complete list can be found on -"wikipedia"_http://en.wikipedia.org/wiki/CUDA#Supported_GPUs. You can -also use the deviceQuery tool that comes with the CUDA samples. Note +If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE +directory must use "nvcc" as its compiler, via its CC setting. For +best performance its CCFLAGS setting should use -O3 and have a +KOKKOS_ARCH setting that matches the compute capability of your NVIDIA +hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note the minimal required compute capability is 2.0, but this will give signicantly reduced performance compared to Kepler generation GPUs with compute capability 3.x. For the LINK setting, "nvcc" should not @@ -220,28 +219,30 @@ also have a "Compilation rule" for creating *.o files from *.cu files. See src/Makefile.cuda for an example of a lo-level Makefile with all of these settings. -HWLOC binds threads to hardware cores, so they do not migrate during a -simulation. HWLOC=yes should always be used if running with OMP=no -for pthreads. It is not necessary for OMP=yes for OpenMP, because -OpenMP provides alternative methods via environment variables for -binding threads to hardware cores. More info on binding threads to -cores is given in "this section"_Section_accelerate.html#acc_8. +KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not +migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be +used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not +necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP +provides alternative methods via environment variables for binding +threads to hardware cores. More info on binding threads to cores is +given in "this section"_Section_accelerate.html#acc_8. -AVX enables Intel advanced vector extensions when compiling for an -Intel-compatible chip. AVX=yes should only be set if your host -hardware supports AVX. If it does not support it, this will cause a -run-time crash. +KOKKOS_ARCH=KNC enables compiler switches needed when compling for an +Intel Phi processor. -MIC enables compiler switches needed when compling for an Intel Phi -processor. +KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism +on most Unix platforms. This library is not available on all +platforms. -LIBRT enables use of a more accurate timer mechanism on most Unix -platforms. This library is not available on all platforms. +KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style +within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time +debugging information that can be useful. It also enables runtime +bounds checking on Kokkos data structures. -DEBUG is only useful when developing a Kokkos-enabled style within -LAMMPS. DEBUG=yes enables printing of run-time debugging information -that can be useful. It also enables runtime bounds checking on Kokkos -data structures. +KOKKOS_CUDA_OPTIONS are additional options for CUDA. + +For more information on Kokkos see the Kokkos programmers' guide here: +/lib/kokkos/doc/Kokkos_PG.pdf. [Run with the KOKKOS package from the command line:] diff --git a/doc/fix_rigid.html b/doc/fix_rigid.html index 5d7086dfed..3cf8e7fae5 100644 --- a/doc/fix_rigid.html +++ b/doc/fix_rigid.html @@ -502,16 +502,17 @@ written out. See the IMPORTANT NOTE in the next section for details.The infile keyword allows a file of rigid body attributes to be read -in from a file, rather then having LAMMPS compute them. There are 3 +in from a file, rather then having LAMMPS compute them. There are 5 such attributes: the total mass of the rigid body, its center-of-mass -position, and its 6 moments of inertia. For rigid bodies consisting -of point particles or non-overlapping finite-size particles, LAMMPS -can compute these values accurately. However, for rigid bodies -consisting of finite-size particles which overlap each other, LAMMPS -will ignore the overlaps when computing these 3 attributes. The -amount of error this induces depends on the amount of overlap. To -avoid this issue, the values can be pre-computed (e.g. using Monte -Carlo integration). +position, its 6 moments of inertia, its center-of-mass velocity, and +the 3 image flags of the center-of-mass position. For rigid bodies +consisting of point particles or non-overlapping finite-size +particles, LAMMPS can compute these values accurately. However, for +rigid bodies consisting of finite-size particles which overlap each +other, LAMMPS will ignore the overlaps when computing these 4 +attributes. The amount of error this induces depends on the amount of +overlap. To avoid this issue, the values can be pre-computed +(e.g. using Monte Carlo integration).
The format of the file is as follows. Note that the file does not have to list attributes for every rigid body integrated by fix rigid. @@ -521,10 +522,10 @@ comment lines starting with "#" which are ignored. The first non-blank, non-comment line should list N = the number of lines to follow. The N successive lines contain the following information:
-ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz -ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz +ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm +ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm ... -IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz +IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcmThe rigid body IDs are all positive integers. For the single bodystyle, only an ID of 1 can be used. For the group bodystyle, @@ -537,15 +538,26 @@ self-explanatory. The center-of-mass should be consistent with what is calculated for the position of the rigid body with all its atoms unwrapped by their respective image flags. If this produces a center-of-mass that is outside the simulation box, LAMMPS wraps it -back into the box. The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) -should be the values consistent with the current orientation of the -rigid body around its center of mass. The values are with respect to -the simulation box XYZ axes, not with respect to the prinicpal axes of -the rigid body itself. LAMMPS performs the latter calculation -internally. The (vxcm,vycm,vzcm) values are the velocity of the -center of mass. The (lx,ly,lz) values are the angular momentum of the -body. These last 6 values can simply be set to 0 if you wish the -body to have no initial motion. +back into the box. +
+The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) should be the +values consistent with the current orientation of the rigid body +around its center of mass. The values are with respect to the +simulation box XYZ axes, not with respect to the prinicpal axes of the +rigid body itself. LAMMPS performs the latter calculation internally. +
+The (vxcm,vycm,vzcm) values are the velocity of the center of mass. +The (lx,ly,lz) values are the angular momentum of the body. The +(vxcm,vycm,vzcm) and (lx,ly,lz) values can simply be set to 0 if you +wish the body to have no initial motion. +
+The (ixcm,iycm,izcm) values are the image flags of the center of mass +of the body. For periodic dimensions, they specify which image of the +simulation box the body is considered to be in. An image of 0 means +it is inside the box as defined. A value of 2 means add 2 box lengths +to get the true value. A value of -1 means subtract 1 box length to +get the true value. LAMMPS updates these flags as the rigid bodies +cross periodic boundaries during the simulation.
IMPORTANT NOTE: If you use the infile or mol keywords and write restart files during a simulation, then each time a restart file is diff --git a/doc/fix_rigid.txt b/doc/fix_rigid.txt index 135ea2653a..225b43966d 100644 --- a/doc/fix_rigid.txt +++ b/doc/fix_rigid.txt @@ -484,16 +484,17 @@ written out. See the IMPORTANT NOTE in the next section for details. :line The {infile} keyword allows a file of rigid body attributes to be read -in from a file, rather then having LAMMPS compute them. There are 3 +in from a file, rather then having LAMMPS compute them. There are 5 such attributes: the total mass of the rigid body, its center-of-mass -position, and its 6 moments of inertia. For rigid bodies consisting -of point particles or non-overlapping finite-size particles, LAMMPS -can compute these values accurately. However, for rigid bodies -consisting of finite-size particles which overlap each other, LAMMPS -will ignore the overlaps when computing these 3 attributes. The -amount of error this induces depends on the amount of overlap. To -avoid this issue, the values can be pre-computed (e.g. using Monte -Carlo integration). +position, its 6 moments of inertia, its center-of-mass velocity, and +the 3 image flags of the center-of-mass position. For rigid bodies +consisting of point particles or non-overlapping finite-size +particles, LAMMPS can compute these values accurately. However, for +rigid bodies consisting of finite-size particles which overlap each +other, LAMMPS will ignore the overlaps when computing these 4 +attributes. The amount of error this induces depends on the amount of +overlap. To avoid this issue, the values can be pre-computed +(e.g. using Monte Carlo integration). The format of the file is as follows. Note that the file does not have to list attributes for every rigid body integrated by fix rigid. @@ -503,10 +504,10 @@ comment lines starting with "#" which are ignored. The first non-blank, non-comment line should list N = the number of lines to follow. The N successive lines contain the following information: -ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz -ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz +ID1 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm +ID2 masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm ... -IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz :pre +IDN masstotal xcm ycm zcm ixx iyy izz ixy ixz iyz vxcm vycm vzcm lx ly lz ixcm iycm izcm :pre The rigid body IDs are all positive integers. For the {single} bodystyle, only an ID of 1 can be used. For the {group} bodystyle, @@ -519,15 +520,26 @@ self-explanatory. The center-of-mass should be consistent with what is calculated for the position of the rigid body with all its atoms unwrapped by their respective image flags. If this produces a center-of-mass that is outside the simulation box, LAMMPS wraps it -back into the box. The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) -should be the values consistent with the current orientation of the -rigid body around its center of mass. The values are with respect to -the simulation box XYZ axes, not with respect to the prinicpal axes of -the rigid body itself. LAMMPS performs the latter calculation -internally. The (vxcm,vycm,vzcm) values are the velocity of the -center of mass. The (lx,ly,lz) values are the angular momentum of the -body. These last 6 values can simply be set to 0 if you wish the -body to have no initial motion. +back into the box. + +The 6 moments of inertia (ixx,iyy,izz,ixy,ixz,iyz) should be the +values consistent with the current orientation of the rigid body +around its center of mass. The values are with respect to the +simulation box XYZ axes, not with respect to the prinicpal axes of the +rigid body itself. LAMMPS performs the latter calculation internally. + +The (vxcm,vycm,vzcm) values are the velocity of the center of mass. +The (lx,ly,lz) values are the angular momentum of the body. The +(vxcm,vycm,vzcm) and (lx,ly,lz) values can simply be set to 0 if you +wish the body to have no initial motion. + +The (ixcm,iycm,izcm) values are the image flags of the center of mass +of the body. For periodic dimensions, they specify which image of the +simulation box the body is considered to be in. An image of 0 means +it is inside the box as defined. A value of 2 means add 2 box lengths +to get the true value. A value of -1 means subtract 1 box length to +get the true value. LAMMPS updates these flags as the rigid bodies +cross periodic boundaries during the simulation. IMPORTANT NOTE: If you use the {infile} or {mol} keywords and write restart files during a simulation, then each time a restart file is diff --git a/lib/kokkos/Copyright.txt b/lib/kokkos/Copyright.txt new file mode 100755 index 0000000000..05980758fa --- /dev/null +++ b/lib/kokkos/Copyright.txt @@ -0,0 +1,40 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 2.0 +// Copyright (2014) Sandia Corporation +// +// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation, +// the U.S. Government retains certain rights in this software. +// +// Redistribution and use in source and binary forms, with or without +// modification, are permitted provided that the following conditions are +// met: +// +// 1. Redistributions of source code must retain the above copyright +// notice, this list of conditions and the following disclaimer. +// +// 2. Redistributions in binary form must reproduce the above copyright +// notice, this list of conditions and the following disclaimer in the +// documentation and/or other materials provided with the distribution. +// +// 3. Neither the name of the Corporation nor the names of the +// contributors may be used to endorse or promote products derived from +// this software without specific prior written permission. +// +// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY +// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE +// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +// +// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov) +// +// ************************************************************************ +//@HEADER diff --git a/lib/kokkos/LICENSE b/lib/kokkos/LICENSE new file mode 100755 index 0000000000..05980758fa --- /dev/null +++ b/lib/kokkos/LICENSE @@ -0,0 +1,40 @@ +//@HEADER +// ************************************************************************ +// +// Kokkos v. 2.0 +// Copyright (2014) Sandia Corporation +// +// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation, +// the U.S. Government retains certain rights in this software. +// +// Redistribution and use in source and binary forms, with or without +// modification, are permitted provided that the following conditions are +// met: +// +// 1. Redistributions of source code must retain the above copyright +// notice, this list of conditions and the following disclaimer. +// +// 2. Redistributions in binary form must reproduce the above copyright +// notice, this list of conditions and the following disclaimer in the +// documentation and/or other materials provided with the distribution. +// +// 3. Neither the name of the Corporation nor the names of the +// contributors may be used to endorse or promote products derived from +// this software without specific prior written permission. +// +// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY +// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE +// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +// +// Questions? Contact H. Carter Edwards (hcedwar@sandia.gov) +// +// ************************************************************************ +//@HEADER diff --git a/lib/kokkos/Makefile.kokkos b/lib/kokkos/Makefile.kokkos new file mode 100755 index 0000000000..473039af52 --- /dev/null +++ b/lib/kokkos/Makefile.kokkos @@ -0,0 +1,318 @@ +# Default settings common options + +KOKKOS_PATH=../../lib/kokkos + +#Options: OpenMP,Serial,Pthreads,Cuda +KOKKOS_DEVICES ?= "OpenMP" +#KOKKOS_DEVICES ?= "Pthreads" +#Options: KNC,SNB,HSW,Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,ARMv8,BGQ,Power7,Power8 +KOKKOS_ARCH ?= "" +#Options: yes,no +KOKKOS_DEBUG ?= "no" +#Options: hwloc,librt +KOKKOS_USE_TPLS ?= "" + +#Default settings specific options +#Options: force_uvm,use_ldg,rdc +KOKKOS_CUDA_OPTIONS ?= "" + +# Check for general settings + +KOKKOS_CXX_STANDARD ?= "c++11" + +KOKKOS_INTERNAL_ENABLE_DEBUG := $(strip $(shell echo $(KOKKOS_DEBUG) | grep "yes" | wc -l)) +KOKKOS_INTERNAL_ENABLE_PROFILING_COLLECT_KERNEL_DATA := $(strip $(shell echo $(KOKKOS_PROFILING) | grep "kernel_times" | wc -l)) +KOKKOS_INTERNAL_ENABLE_PROFILING_AGGREGATE_MPI := $(strip $(shell echo $(KOKKOS_PROFILING) | grep "aggregate_mpi" | wc -l)) +KOKKOS_INTERNAL_ENABLE_CXX11 := $(strip $(shell echo $(KOKKOS_CXX_STANDARD) | grep "c++11" | wc -l)) + +# Check for external libraries +KOKKOS_INTERNAL_USE_HWLOC := $(strip $(shell echo $(KOKKOS_USE_TPLS) | grep "hwloc" | wc -l)) +KOKKOS_INTERNAL_USE_LIBRT := $(strip $(shell echo $(KOKKOS_USE_TPLS) | grep "librt" | wc -l)) + +# Check for advanced settings +KOKKOS_INTERNAL_CUDA_USE_LDG := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "use_ldg" | wc -l)) +KOKKOS_INTERNAL_CUDA_USE_UVM := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "force_uvm" | wc -l)) +KOKKOS_INTERNAL_CUDA_USE_RELOC := $(strip $(shell echo $(KOKKOS_CUDA_OPTIONS) | grep "rdc" | wc -l)) + +# Check for Kokkos Host Execution Spaces one of which must be on + +KOKKOS_INTERNAL_USE_OPENMP := $(strip $(shell echo $(KOKKOS_DEVICES) | grep OpenMP | wc -l)) +KOKKOS_INTERNAL_USE_PTHREADS := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Pthread | wc -l)) +KOKKOS_INTERNAL_USE_SERIAL := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Serial | wc -l)) + +ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 0) +ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 0) + KOKKOS_INTERNAL_USE_SERIAL := 1 +endif +endif + +KOKKOS_INTERNAL_COMPILER_PGI := $(shell $(CXX) --version | grep PGI | wc -l) + +ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1) + KOKKOS_INTERNAL_OPENMP_FLAG := -mp +else + KOKKOS_INTERNAL_OPENMP_FLAG := -fopenmp +endif + +ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1) + KOKKOS_INTERNAL_CXX11_FLAG := --c++11 +else + KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11 +endif +# Check for other Execution Spaces + +KOKKOS_INTERNAL_USE_CUDA := $(strip $(shell echo $(KOKKOS_DEVICES) | grep Cuda | wc -l)) + +# Check for Kokkos Architecture settings + +#Intel based +KOKKOS_INTERNAL_USE_ARCH_KNC := $(strip $(shell echo $(KOKKOS_ARCH) | grep KNC | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_SNB := $(strip $(shell echo $(KOKKOS_ARCH) | grep SNB | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_HSW := $(strip $(shell echo $(KOKKOS_ARCH) | grep HSW | wc -l)) + +#NVIDIA based +KOKKOS_INTERNAL_USE_ARCH_KEPLER30 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler30 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_KEPLER32 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler32 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler35 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_KEPLER37 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler37 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell50 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_MAXWELL52 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell52 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_MAXWELL53 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell53 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) | bc)) + +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_NVIDIA), 0) +KOKKOS_INTERNAL_USE_ARCH_MAXWELL50 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Maxwell | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_KEPLER35 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Kepler | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_NVIDIA := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_KEPLER30) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER32) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER35) \ + + $(KOKKOS_INTERNAL_USE_ARCH_KEPLER37) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52) \ + + $(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53) | bc)) +endif + +#ARM based +KOKKOS_INTERNAL_USE_ARCH_ARMV80 := $(strip $(shell echo $(KOKKOS_ARCH) | grep ARMv8 | wc -l)) + +#IBM based +KOKKOS_INTERNAL_USE_ARCH_BGQ := $(strip $(shell echo $(KOKKOS_ARCH) | grep BGQ | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_POWER7 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Power7 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_POWER8 := $(strip $(shell echo $(KOKKOS_ARCH) | grep Power8 | wc -l)) +KOKKOS_INTERNAL_USE_ARCH_IBM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_BGQ)+$(KOKKOS_INTERNAL_USE_ARCH_POWER7)+$(KOKKOS_INTERNAL_USE_ARCH_POWER8) | bc)) + +#AMD based +KOKKOS_INTERNAL_USE_ARCH_AMDAVX := $(strip $(shell echo $(KOKKOS_ARCH) | grep AMDAVX | wc -l)) + +#Any AVX? +KOKKOS_INTERNAL_USE_ARCH_AVX := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_SNB)+$(KOKKOS_INTERNAL_USE_ARCH_AMDAVX) | bc )) +KOKKOS_INTERNAL_USE_ARCH_AVX2 := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_HSW) | bc )) + +#Incompatible flags? +KOKKOS_INTERNAL_USE_ARCH_MULTIHOST := $(strip $(shell echo "$(KOKKOS_INTERNAL_USE_ARCH_AVX)+$(KOKKOS_INTERNAL_USE_ARCH_AVX2)+$(KOKKOS_INTERNAL_USE_ARCH_KNC)+$(KOKKOS_INTERNAL_USE_ARCH_IBM)+$(KOKKOS_INTERNAL_USE_ARCH_AMDAVX)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV80)>1" | bc )) +KOKKOS_INTERNAL_USE_ARCH_MULTIGPU := $(strip $(shell echo "$(KOKKOS_INTERNAL_USE_ARCH_NVIDIA)>1" | bc)) + +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MULTIHOST), 1) + $(error Defined Multiple Host architectures: KOKKOS_ARCH=$(KOKKOS_ARCH) ) +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MULTIGPU), 1) + $(error Defined Multiple GPU architectures: KOKKOS_ARCH=$(KOKKOS_ARCH) ) +endif + +#Generating the list of Flags + +KOKKOS_CPPFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src +# No warnings: +KOKKOS_CXXFLAGS = +# INTEL and CLANG warnings: +#KOKKOS_CXXFLAGS = -Wall -Wshadow -pedantic -Wsign-compare -Wtype-limits -Wuninitialized +# GCC warnings: +#KOKKOS_CXXFLAGS = -Wall -Wshadow -pedantic -Wsign-compare -Wtype-limits -Wuninitialized -Wignored-qualifiers -Wempty-body -Wclobbered + +KOKKOS_LIBS = -lkokkos +KOKKOS_LDFLAGS = -L$(shell pwd) +KOKKOS_SRC = +KOKKOS_HEADERS = + +#Generating the KokkosCore_config.h file + +tmp := $(shell echo "/* ---------------------------------------------" > KokkosCore_config.tmp) +tmp := $(shell echo "Makefile constructed configuration:" >> KokkosCore_config.tmp) +tmp := $(shell date >> KokkosCore_config.tmp) +tmp := $(shell echo "----------------------------------------------*/" >> KokkosCore_config.tmp) + + +tmp := $(shell echo "/* Execution Spaces */" >> KokkosCore_config.tmp) +ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1) + tmp := $(shell echo '\#define KOKKOS_HAVE_OPENMP 1' >> KokkosCore_config.tmp) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1) + tmp := $(shell echo "\#define KOKKOS_HAVE_PTHREAD 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1) + tmp := $(shell echo "\#define KOKKOS_HAVE_SERIAL 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) + tmp := $(shell echo "\#define KOKKOS_HAVE_CUDA 1" >> KokkosCore_config.tmp ) +endif + +tmp := $(shell echo "/* General Settings */" >> KokkosCore_config.tmp) +ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1) + KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG) + tmp := $(shell echo "\#define KOKKOS_HAVE_CXX11 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1) +ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) + KOKKOS_CXXFLAGS += -G +endif + KOKKOS_CXXFLAGS += -g + KOKKOS_LDFLAGS += -g -ldl + tmp := $(shell echo "\#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK 1" >> KokkosCore_config.tmp ) + tmp := $(shell echo "\#define KOKKOS_HAVE_DEBUG 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1) + KOKKOS_CPPFLAGS += -I$(HWLOC_PATH)/include + KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib + KOKKOS_LIBS += -lhwloc + tmp := $(shell echo "\#define KOKKOS_HAVE_HWLOC 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_LIBRT), 1) + tmp := $(shell echo "\#define KOKKOS_USE_LIBRT 1" >> KokkosCore_config.tmp ) + tmp := $(shell echo "\#define PREC_TIMER 1" >> KokkosCore_config.tmp ) + tmp := $(shell echo "\#define KOKKOSP_ENABLE_RTLIB 1" >> KokkosCore_config.tmp ) + KOKKOS_LIBS += -lrt +endif + +tmp := $(shell echo "/* Cuda Settings */" >> KokkosCore_config.tmp) + +ifeq ($(KOKKOS_INTERNAL_CUDA_USE_LDG), 1) + tmp := $(shell echo "\#define KOKKOS_CUDA_USE_LDG_INTRINSIC 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_CUDA_USE_UVM), 1) + tmp := $(shell echo "\#define KOKKOS_CUDA_USE_UVM 1" >> KokkosCore_config.tmp ) + tmp := $(shell echo "\#define KOKKOS_USE_CUDA_UVM 1" >> KokkosCore_config.tmp ) +endif + +ifeq ($(KOKKOS_INTERNAL_CUDA_USE_RELOC), 1) + tmp := $(shell echo "\#define KOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE 1" >> KokkosCore_config.tmp ) + KOKKOS_CXXFLAGS += --relocatable-device-code=true + KOKKOS_LDFLAGS += --relocatable-device-code=true +endif + +#Add Architecture flags + +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX), 1) + KOKKOS_CXXFLAGS += -mavx + KOKKOS_LDFLAGS += -mavx +endif + +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_AVX2), 1) + KOKKOS_CXXFLAGS += -xcore-avx2 + KOKKOS_LDFLAGS += -xcore-avx2 +endif + +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KNC), 1) + KOKKOS_CXXFLAGS += -mmic + KOKKOS_LDFLAGS += -mmic +endif + +ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER30), 1) + KOKKOS_CXXFLAGS += -arch=sm_30 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER32), 1) + KOKKOS_CXXFLAGS += -arch=sm_32 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER35), 1) + KOKKOS_CXXFLAGS += -arch=sm_35 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_KEPLER37), 1) + KOKKOS_CXXFLAGS += -arch=sm_37 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL50), 1) + KOKKOS_CXXFLAGS += -arch=sm_50 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL52), 1) + KOKKOS_CXXFLAGS += -arch=sm_52 +endif +ifeq ($(KOKKOS_INTERNAL_USE_ARCH_MAXWELL53), 1) + KOKKOS_CXXFLAGS += -arch=sm_53 +endif +endif + +KOKKOS_INTERNAL_LS_CONFIG := $(shell ls KokkosCore_config.h) +ifeq ($(KOKKOS_INTERNAL_LS_CONFIG), KokkosCore_config.h) +KOKKOS_INTERNAL_NEW_CONFIG := $(strip $(shell diff KokkosCore_config.h KokkosCore_config.tmp | grep define | wc -l)) +else +KOKKOS_INTERNAL_NEW_CONFIG := 1 +endif + +ifneq ($(KOKKOS_INTERNAL_NEW_CONFIG), 0) + tmp := $(shell cp KokkosCore_config.tmp KokkosCore_config.h) +endif + +KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/*.hpp) +KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp) +KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp) +KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp) +KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/algorithms/src/*.hpp) + +KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.cpp) +KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.cpp) + +ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) + KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cpp) + KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp) + KOKKOS_LDFLAGS += -L$(CUDA_PATH)/lib64 + KOKKOS_LIBS += -lcudart -lcuda +endif + +ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1) + KOKKOS_LIBS += -lpthread + KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.cpp) + KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp) +endif + +ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1) + KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.cpp) + KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp) + ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) + KOKKOS_CXXFLAGS += -Xcompiler $(KOKKOS_INTERNAL_OPENMP_FLAG) + else + KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_OPENMP_FLAG) + endif + KOKKOS_LDFLAGS += $(KOKKOS_INTERNAL_OPENMP_FLAG) +endif + + +# Setting up dependencies + +KokkosCore_config.h: + +KOKKOS_CPP_DEPENDS := KokkosCore_config.h $(KOKKOS_HEADERS) + +KOKKOS_OBJ = $(KOKKOS_SRC:.cpp=.o) +KOKKOS_OBJ_LINK = $(notdir $(KOKKOS_OBJ)) + +include $(KOKKOS_PATH)/Makefile.targets + +kokkos-clean: + rm -f $(KOKKOS_OBJ_LINK) KokkosCore_config.h KokkosCore_config.tmp libkokkos.a + +libkokkos.a: $(KOKKOS_OBJ_LINK) $(KOKKOS_SRC) $(KOKKOS_HEADERS) + ar cr libkokkos.a $(KOKKOS_OBJ_LINK) + +KOKKOS_LINK_DEPENDS=libkokkos.a diff --git a/lib/kokkos/Makefile.lammps b/lib/kokkos/Makefile.lammps deleted file mode 100755 index 00b55f4f66..0000000000 --- a/lib/kokkos/Makefile.lammps +++ /dev/null @@ -1,171 +0,0 @@ -# This Makefile is intended to be include in an application Makefile. -# It will append the OBJ variable with objects which need to be build for Kokkos. -# It also will produce a KOKKOS_INC and a KOKKOS_LINK variable which must be -# appended to the compile and link flags of the application Makefile. -# Note that you cannot compile and link at the same time! -# If you want to include dependencies (i.e. trigger a rebuild of the application -# object files when Kokkos files change, you can include KOKKOS_HEADERS in your -# dependency list. -# The Makefile uses a number of variables which can be set on the commandline, or -# in the application Makefile prior to including this Makefile. These options set -# certain build options and are explained in the following. - -# Directory path to the Kokkos source directory (this could be the kokkos directory -# in the Trilinos git repository -KOKKOS_PATH ?= ../../lib/kokkos -# Directory paths to libraries potentially used by Kokkos (if the respective options -# are chosen) -CUDA_PATH ?= /usr/local/cuda -HWLOC_PATH ?= /usr/local/hwloc/default - -# Device options: enable Pthreads, OpenMP and/or CUDA device (if none is enabled -# the Serial device will be used) -PTHREADS ?= yes -OMP ?= yes -CUDA ?= no - -# Build for Debug mode: add debug flags and enable boundschecks within Kokkos -DEBUG ?= no - -# Code generation options: use AVX instruction set; build for Xeon Phi (MIC); use -# reduced precision math (sets compiler flags such --fast_math) -AVX ?= no -MIC ?= no -RED_PREC ?=no - -# Optional Libraries: use hwloc for thread affinity; use librt for timers -HWLOC ?= no -LIBRT ?= no - -# CUDA specific options: use UVM (requires CUDA 6+); use LDG loads instead of -# texture fetches; compile for relocatable device code (function pointers) -CUDA_UVM ?= no -CUDA_LDG ?= no -CUDA_RELOC ?= no - -# Settings for replacing generic linear algebra kernels of Kokkos with vendor -# libraries. -CUSPARSE ?= no -CUBLAS ?= no - -#Typically nothing should be changed after this point - -KOKKOS_INC = -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_PATH)/linalg/src -I../ -DKOKKOS_DONT_INCLUDE_CORE_CONFIG_H - -KOKKOS_HEADERS = $(wildcard $(KOKKOS_PATH)/core/src/*.hpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.hpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/linalg/src/*.hpp) - -SRC_KOKKOS = $(wildcard $(KOKKOS_PATH)/core/src/impl/*.cpp) -SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/containers/src/impl/*.cpp) -KOKKOS_LIB = libkokkoscore.a - -ifeq ($(CUDA), yes) -KOKKOS_INC += -x cu -DKOKKOS_HAVE_CUDA -SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cpp) -SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.cu) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp) -KOKKOS_LINK += -L$(CUDA_PATH)/lib64 -lcudart -lcuda -ifeq ($(CUDA_UVM), yes) -KOKKOS_INC += -DKOKKOS_USE_CUDA_UVM -endif -endif - -ifeq ($(CUSPARSE), yes) -KOKKOS_INC += -DKOKKOS_USE_CUSPARSE -KOKKOS_LIB += -lcusparse -endif - -ifeq ($(CUBLAS), yes) -KOKKOS_INC += -DKOKKOS_USE_CUBLAS -KOKKOS_LIB += -lcublas -endif - -ifeq ($(MIC), yes) -KOKKOS_INC += -mmic -KOKKOS_LINK += -mmic -AVX = no -endif - -ifeq ($(AVX), yes) -ifeq ($(CUDA), yes) -KOKKOS_INC += -Xcompiler -mavx -else -KOKKOS_INC += -mavx -endif -KOKKOS_LINK += -mavx -endif - -ifeq ($(PTHREADS),yes) -KOKKOS_INC += -DKOKKOS_HAVE_PTHREAD -KOKKOS_LIB += -lpthread -SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.cpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Threads/*.hpp) -endif - -ifeq ($(OMP),yes) -KOKKOS_INC += -DKOKKOS_HAVE_OPENMP -SRC_KOKKOS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.cpp) -KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/OpenMP/*.hpp) -ifeq ($(CUDA), yes) -KOKKOS_INC += -Xcompiler -fopenmp -KOKKOS_LINK += -Xcompiler -fopenmp -else -KOKKOS_INC += -fopenmp -KOKKOS_LINK += -fopenmp -endif -endif - -ifeq ($(HWLOC),yes) -KOKKOS_INC += -DKOKKOS_HAVE_HWLOC -I$(HWLOC_PATH)/include -KOKKOS_LINK += -L$(HWLOC_PATH)/lib -lhwloc -endif - -ifeq ($(RED_PREC), yes) -KOKKOS_INC += --use_fast_math -endif - -ifeq ($(DEBUG), yes) -ifeq ($(CUDA), yes) -KOKKOS_INC += -G -endif -KOKKOS_INC += -g -DKOKKOS_EXPRESSION_CHECK -DENABLE_TRACEBACK -KOKKOS_LINK += -g -ldl -endif - -ifeq ($(LIBRT),yes) -KOKKOS_INC += -DKOKKOS_USE_LIBRT -DPREC_TIMER -KOKKOS_LIB += -lrt -endif - -ifeq ($(CUDA_LDG), yes) -KOKKOS_INC += -DKOKKOS_USE_LDG_INTRINSIC -endif - -ifeq ($(CUDA), yes) -ifeq ($(CUDA_RELOC), yes) -KOKKOS_INC += -DKOKKOS_CUDA_USE_RELOCATABLE_DEVICE_CODE --relocatable-device-code=true -KOKKOS_LINK += --relocatable-device-code=true -endif -endif - -# Must build with C++11 -KOKKOS_INC += --std=c++11 -DKOKKOS_HAVE_CXX11 - -OBJ_KOKKOS_TMP = $(SRC_KOKKOS:.cpp=.o) -OBJ_KOKKOS = $(OBJ_KOKKOS_TMP:.cu=.o) -OBJ_KOKKOS_LINK = $(notdir $(OBJ_KOKKOS)) - -override OBJ += kokkos_depend.o - -libkokkoscore.a: $(OBJ_KOKKOS) - ar cr libkokkoscore.a $(OBJ_KOKKOS_LINK) - -kokkos_depend.o: libkokkoscore.a - touch kokkos_depend.cpp - $(CC) $(CCFLAGS) $(SHFLAGS) $(EXTRA_INC) -c kokkos_depend.cpp - - -KOKKOS_LINK += -L./ $(KOKKOS_LIB) diff --git a/lib/kokkos/Makefile.targets b/lib/kokkos/Makefile.targets new file mode 100755 index 0000000000..86708ac801 --- /dev/null +++ b/lib/kokkos/Makefile.targets @@ -0,0 +1,50 @@ +Kokkos_UnorderedMap_impl.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/containers/src/impl/Kokkos_UnorderedMap_impl.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/containers/src/impl/Kokkos_UnorderedMap_impl.cpp +Kokkos_AllocationTracker.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_AllocationTracker.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_AllocationTracker.cpp +Kokkos_BasicAllocators.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_BasicAllocators.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_BasicAllocators.cpp +Kokkos_Core.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Core.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Core.cpp +Kokkos_Error.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Error.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Error.cpp +Kokkos_HostSpace.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_HostSpace.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_HostSpace.cpp +Kokkos_hwloc.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_hwloc.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_hwloc.cpp +Kokkos_Serial.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial.cpp +Kokkos_Serial_TaskPolicy.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial_TaskPolicy.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Serial_TaskPolicy.cpp +Kokkos_Shape.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Shape.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Shape.cpp +Kokkos_spinwait.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_spinwait.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_spinwait.cpp +Kokkos_Profiling_Interface.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/Kokkos_Profiling_Interface.cpp +KokkosExp_SharedAlloc.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/impl/KokkosExp_SharedAlloc.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/impl/KokkosExp_SharedAlloc.cpp + +ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1) +Kokkos_Cuda_BasicAllocators.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_BasicAllocators.cpp +Kokkos_Cuda_Impl.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_Impl.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_Cuda_Impl.cpp +Kokkos_CudaSpace.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Cuda/Kokkos_CudaSpace.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Cuda/Kokkos_CudaSpace.cpp +endif + +ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1) +Kokkos_ThreadsExec_base.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp +Kokkos_ThreadsExec.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec.cpp +Kokkos_Threads_TaskPolicy.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_Threads_TaskPolicy.cpp +endif + +ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1) +Kokkos_OpenMPexec.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/OpenMP/Kokkos_OpenMPexec.cpp + $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/OpenMP/Kokkos_OpenMPexec.cpp +endif + diff --git a/lib/kokkos/README b/lib/kokkos/README index 59f5685bab..f979495bfd 100755 --- a/lib/kokkos/README +++ b/lib/kokkos/README @@ -1,44 +1,97 @@ -Kokkos library +Kokkos implements a programming model in C++ for writing performance portable +applications targeting all major HPC platforms. For that purpose it provides +abstractions for both parallel execution of code and data management. +Kokkos is designed to target complex node architectures with N-level memory +hierarchies and multiple types of execution resources. It currently can use +OpenMP, Pthreads and CUDA as backend programming models. -Carter Edwards, Christian Trott, Daniel Sunderland -Sandia National Labs +The core developers of Kokkos are Carter Edwards and Christian Trott +at the Computer Science Research Institute of the Sandia National +Laboratories. -29 May 2014 -http://trilinos.sandia.gov/packages/kokkos/ +The KokkosP interface and associated tools are developed by the Application +Performance Team and Kokkos core developers at Sandia National Laboratories. -------------------------- +To learn more about Kokkos consider watching one of our presentations: +GTC 2015: + http://on-demand.gputechconf.com/gtc/2015/video/S5166.html + http://on-demand.gputechconf.com/gtc/2015/presentation/S5166-H-Carter-Edwards.pdf -This directory has source files from the Kokkos library that LAMMPS -uses when building with its KOKKOS package. The package contains -versions of pair, fix, and atom styles written with Kokkos data -structures and calls to the Kokkos library that should run efficiently -on various kinds of accelerated nodes, including GPU and many-core -chips. +A programming guide can be found under doc/Kokkos_PG.pdf. This is an initial version +and feedback is greatly appreciated. -Kokkos is a C++ library that provides two key abstractions for an -application like LAMMPS. First, it allows a single implementation of -an application kernel (e.g. a pair style) to run efficiently on -different kinds of hardware (GPU, Intel Phi, many-core chip). +For questions please send an email to +kokkos-users@software.sandia.gov -Second, it provides data abstractions to adjust (at compile time) the -memory layout of basic data structures like 2d and 3d arrays and allow -the transparent utilization of special hardware load and store units. -Such data structures are used in LAMMPS to store atom coordinates or -forces or neighbor lists. The layout is chosen to optimize -performance on different platforms. Again this operation is hidden -from the developer, and does not affect how the single implementation -of the kernel is coded. +For non-public questions send an email to +hcedwar(at)sandia.gov and crtrott(at)sandia.gov -To build LAMMPS with Kokkos, you should not need to make any changes -to files in this directory. You can overrided defaults that are set -in Makefile.lammps when building LAMMPS, by defining variables as part -of the make command. Details of the build process with Kokkos are -explained in Section 2.3 of doc/Section_start.html. and in Section 5.9 -of doc/Section_accelerate.html. +============================================================================ +====Requirements============================================================ +============================================================================ + +Primary tested compilers are: + GCC 4.7.2 + GCC 5.1.0 + Intel 14.0.1 + Intel 15.0.1 + Clang 3.7.0 + +Secondary tested compilers are: + CUDA 6.5 + CUDA 7.0 + +Primary tested compiler are passing in release mode +with warnings as errors. We are using the following set +of flags: +GCC: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits + -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized +Intel: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized +Clang: -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized + + +============================================================================ +====Getting started========================================================= +============================================================================ + +In the 'example/tutorial' directory you will find step by step tutorial +examples which explain many of the features of Kokkos. They work with +simple Makefiles. To build with g++ and OpenMP simply type 'make openmp' +in the 'example/tutorial' directory. This will build all examples in the +subfolders. + +============================================================================ +====Running Unit Tests====================================================== +============================================================================ + +To run the unit tests create a build directory and run the following commands + +KOKKOS_PATH/generate_makefile.bash +make build-test +make test + +Run KOKKOS_PATH/generate_makefile.bash --help for more detailed options such as +changing the device type for which to build. + +============================================================================ +====Install the library===================================================== +============================================================================ + +To install Kokkos as a library create a build directory and run the following + +KOKKOS_PATH/generate_makefile.bash --prefix=INSTALL_PATH +make lib +make install + +KOKKOS_PATH/generate_makefile.bash --help for more detailed options such as +changing the device type for which to build. + +============================================================================ +====CMakeFiles============================================================== +============================================================================ + +The CMake files contained in this repository require Tribits and are used +for integration with Trilinos. They do not currently support a standalone +CMake build. -The one exception is that when using Kokkos with NVIDIA GPUs, the -CUDA_PATH setting in Makefile.lammps needs to point to the -installation of the Cuda software on your machine. The normal default -location is /usr/local/cuda. If this is not correct, you need to edit -Makefile.lammps. diff --git a/lib/kokkos/TPL/cmake/Dependencies.cmake b/lib/kokkos/TPL/cmake/Dependencies.cmake deleted file mode 100755 index 7ea652bf32..0000000000 --- a/lib/kokkos/TPL/cmake/Dependencies.cmake +++ /dev/null @@ -1,9 +0,0 @@ -SET(LIB_REQUIRED_DEP_PACKAGES) -SET(LIB_OPTIONAL_DEP_PACKAGES) -SET(TEST_REQUIRED_DEP_PACKAGES) -SET(TEST_OPTIONAL_DEP_PACKAGES) -SET(LIB_REQUIRED_DEP_TPLS) -# Only dependency: -SET(LIB_OPTIONAL_DEP_TPLS CUDA) -SET(TEST_REQUIRED_DEP_TPLS ) -SET(TEST_OPTIONAL_DEP_TPLS ) diff --git a/lib/kokkos/TPL/cub/block/block_discontinuity.cuh b/lib/kokkos/TPL/cub/block/block_discontinuity.cuh deleted file mode 100755 index 76af003e58..0000000000 --- a/lib/kokkos/TPL/cub/block/block_discontinuity.cuh +++ /dev/null @@ -1,587 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockDiscontinuity class provides [collective](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block. - */ - -#pragma once - -#include "../util_type.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \brief The BlockDiscontinuity class provides [collective](index.html#sec0) methods for flagging discontinuities within an ordered set of items partitioned across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * A set of "head flags" (or "tail flags") is often used to indicate corresponding items - * that differ from their predecessors (or successors). For example, head flags are convenient - * for demarcating disjoint data segments as part of a segmented scan or reduction. - * - * \tparam T The data type to be flagged. - * \tparam BLOCK_THREADS The thread block size in threads. - * - * \par A Simple Example - * \blockcollective{BlockDiscontinuity} - * \par - * The code snippet below illustrates the head flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include
- * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockDiscontinuity for 128 threads on type int - * typedef cub::BlockDiscontinuity BlockDiscontinuity; - * - * // Allocate shared memory for BlockDiscontinuity - * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute head flags for discontinuities in the segment - * int head_flags[4]; - * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }. - * The corresponding output \p head_flags in those threads will be - * { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. - * - * \par Performance Considerations - * - Zero bank conflicts for most types. - * - */ -template < - typename T, - int BLOCK_THREADS> -class BlockDiscontinuity -{ -private: - - /****************************************************************************** - * Type definitions - ******************************************************************************/ - - /// Shared memory storage layout type (last element from each thread's input) - typedef T _TempStorage[BLOCK_THREADS]; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /// Specialization for when FlagOp has third index param - template ::HAS_PARAM> - struct ApplyOp - { - // Apply flag operator - static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx) - { - return flag_op(a, b, idx); - } - }; - - /// Specialization for when FlagOp does not have a third index param - template - struct ApplyOp - { - // Apply flag operator - static __device__ __forceinline__ bool Flag(FlagOp flag_op, const T &a, const T &b, int idx) - { - return flag_op(a, b); - } - }; - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - -public: - - /// \smemstorage{BlockDiscontinuity} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockDiscontinuity() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockDiscontinuity( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockDiscontinuity( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockDiscontinuity( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Head flag operations - *********************************************************************/ - //@{ - - - /** - * \brief Sets head flags indicating discontinuities between items partitioned across the thread block, for which the first item has no reference and is always flagged. - * - * The flag head_flagsi is set for item - * inputi when - * flag_op(previous-item, inputi) - * returns \p true (where previous-item is either the preceding item - * in the same thread or the last item in the previous thread). - * Furthermore, head_flagsi is always set for - * input>0 in thread0. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates the head-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockDiscontinuity for 128 threads on type int - * typedef cub::BlockDiscontinuity BlockDiscontinuity; - * - * // Allocate shared memory for BlockDiscontinuity - * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute head flags for discontinuities in the segment - * int head_flags[4]; - * BlockDiscontinuity(temp_storage).FlagHeads(head_flags, thread_data, cub::Inequality()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }. - * The corresponding output \p head_flags in those threads will be - * { [1,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam FlagT [inferred] The flag type (must be an integer type) - * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. - */ - template < - int ITEMS_PER_THREAD, - typename FlagT, - typename FlagOp> - __device__ __forceinline__ void FlagHeads( - FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - FlagOp flag_op) ///< [in] Binary boolean flag predicate - { - // Share last item - temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1]; - - __syncthreads(); - - // Set flag for first item - head_flags[0] = (linear_tid == 0) ? - 1 : // First thread - ApplyOp ::Flag( - flag_op, - temp_storage[linear_tid - 1], - input[0], - linear_tid * ITEMS_PER_THREAD); - - // Set head_flags for remaining items - #pragma unroll - for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++) - { - head_flags[ITEM] = ApplyOp ::Flag( - flag_op, - input[ITEM - 1], - input[ITEM], - (linear_tid * ITEMS_PER_THREAD) + ITEM); - } - } - - - /** - * \brief Sets head flags indicating discontinuities between items partitioned across the thread block. - * - * The flag head_flagsi is set for item - * inputi when - * flag_op(previous-item, inputi) - * returns \p true (where previous-item is either the preceding item - * in the same thread or the last item in the previous thread). - * For thread0, item input0 is compared - * against \p tile_predecessor_item. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates the head-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockDiscontinuity for 128 threads on type int - * typedef cub::BlockDiscontinuity BlockDiscontinuity; - * - * // Allocate shared memory for BlockDiscontinuity - * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Have thread0 obtain the predecessor item for the entire tile - * int tile_predecessor_item; - * if (threadIdx.x == 0) tile_predecessor_item == ... - * - * // Collectively compute head flags for discontinuities in the segment - * int head_flags[4]; - * BlockDiscontinuity(temp_storage).FlagHeads( - * head_flags, thread_data, cub::Inequality(), tile_predecessor_item); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,0,1,1], [1,1,1,1], [2,3,3,3], [3,4,4,4], ... }, - * and that \p tile_predecessor_item is \p 0. The corresponding output \p head_flags in those threads will be - * { [0,0,1,0], [0,0,0,0], [1,1,0,0], [0,1,0,0], ... }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam FlagT [inferred] The flag type (must be an integer type) - * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. - */ - template < - int ITEMS_PER_THREAD, - typename FlagT, - typename FlagOp> - __device__ __forceinline__ void FlagHeads( - FlagT (&head_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity head_flags - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - FlagOp flag_op, ///< [in] Binary boolean flag predicate - T tile_predecessor_item) ///< [in] [thread0 only] Item with which to compare the first tile item (input0 from thread0). - { - // Share last item - temp_storage[linear_tid] = input[ITEMS_PER_THREAD - 1]; - - __syncthreads(); - - // Set flag for first item - int predecessor = (linear_tid == 0) ? - tile_predecessor_item : // First thread - temp_storage[linear_tid - 1]; - - head_flags[0] = ApplyOp ::Flag( - flag_op, - predecessor, - input[0], - linear_tid * ITEMS_PER_THREAD); - - // Set flag for remaining items - #pragma unroll - for (int ITEM = 1; ITEM < ITEMS_PER_THREAD; ITEM++) - { - head_flags[ITEM] = ApplyOp ::Flag( - flag_op, - input[ITEM - 1], - input[ITEM], - (linear_tid * ITEMS_PER_THREAD) + ITEM); - } - } - - - //@} end member group - /******************************************************************//** - * \name Tail flag operations - *********************************************************************/ - //@{ - - - /** - * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block, for which the last item has no reference and is always flagged. - * - * The flag tail_flagsi is set for item - * inputi when - * flag_op(inputi, next-item) - * returns \p true (where next-item is either the next item - * in the same thread or the first item in the next thread). - * Furthermore, tail_flagsITEMS_PER_THREAD-1 is always - * set for threadBLOCK_THREADS-1. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates the tail-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockDiscontinuity for 128 threads on type int - * typedef cub::BlockDiscontinuity BlockDiscontinuity; - * - * // Allocate shared memory for BlockDiscontinuity - * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute tail flags for discontinuities in the segment - * int tail_flags[4]; - * BlockDiscontinuity(temp_storage).FlagTails(tail_flags, thread_data, cub::Inequality()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] }. - * The corresponding output \p tail_flags in those threads will be - * { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,1] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam FlagT [inferred] The flag type (must be an integer type) - * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. - */ - template < - int ITEMS_PER_THREAD, - typename FlagT, - typename FlagOp> - __device__ __forceinline__ void FlagTails( - FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - FlagOp flag_op) ///< [in] Binary boolean flag predicate - { - // Share first item - temp_storage[linear_tid] = input[0]; - - __syncthreads(); - - // Set flag for last item - tail_flags[ITEMS_PER_THREAD - 1] = (linear_tid == BLOCK_THREADS - 1) ? - 1 : // Last thread - ApplyOp ::Flag( - flag_op, - input[ITEMS_PER_THREAD - 1], - temp_storage[linear_tid + 1], - (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1)); - - // Set flags for remaining items - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++) - { - tail_flags[ITEM] = ApplyOp ::Flag( - flag_op, - input[ITEM], - input[ITEM + 1], - (linear_tid * ITEMS_PER_THREAD) + ITEM); - } - } - - - /** - * \brief Sets tail flags indicating discontinuities between items partitioned across the thread block. - * - * The flag tail_flagsi is set for item - * inputi when - * flag_op(inputi, next-item) - * returns \p true (where next-item is either the next item - * in the same thread or the first item in the next thread). - * For threadBLOCK_THREADS-1, item - * inputITEMS_PER_THREAD-1 is compared - * against \p tile_predecessor_item. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates the tail-flagging of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockDiscontinuity for 128 threads on type int - * typedef cub::BlockDiscontinuity BlockDiscontinuity; - * - * // Allocate shared memory for BlockDiscontinuity - * __shared__ typename BlockDiscontinuity::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Have thread127 obtain the successor item for the entire tile - * int tile_successor_item; - * if (threadIdx.x == 127) tile_successor_item == ... - * - * // Collectively compute tail flags for discontinuities in the segment - * int tail_flags[4]; - * BlockDiscontinuity(temp_storage).FlagTails( - * tail_flags, thread_data, cub::Inequality(), tile_successor_item); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,0,1,1], [1,1,1,1], [2,3,3,3], ..., [124,125,125,125] } - * and that \p tile_successor_item is \p 125. The corresponding output \p tail_flags in those threads will be - * { [0,1,0,0], [0,0,0,1], [1,0,0,...], ..., [1,0,0,0] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam FlagT [inferred] The flag type (must be an integer type) - * \tparam FlagOp [inferred] Binary predicate functor type having member T operator()(const T &a, const T &b) or member T operator()(const T &a, const T &b, unsigned int b_index), and returning \p true if a discontinuity exists between \p a and \p b, otherwise \p false. \p b_index is the rank of b in the aggregate tile of data. - */ - template < - int ITEMS_PER_THREAD, - typename FlagT, - typename FlagOp> - __device__ __forceinline__ void FlagTails( - FlagT (&tail_flags)[ITEMS_PER_THREAD], ///< [out] Calling thread's discontinuity tail_flags - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - FlagOp flag_op, ///< [in] Binary boolean flag predicate - T tile_successor_item) ///< [in] [threadBLOCK_THREADS-1 only] Item with which to compare the last tile item (inputITEMS_PER_THREAD-1 from threadBLOCK_THREADS-1). - { - // Share first item - temp_storage[linear_tid] = input[0]; - - __syncthreads(); - - // Set flag for last item - int successor_item = (linear_tid == BLOCK_THREADS - 1) ? - tile_successor_item : // Last thread - temp_storage[linear_tid + 1]; - - tail_flags[ITEMS_PER_THREAD - 1] = ApplyOp ::Flag( - flag_op, - input[ITEMS_PER_THREAD - 1], - successor_item, - (linear_tid * ITEMS_PER_THREAD) + (ITEMS_PER_THREAD - 1)); - - // Set flags for remaining items - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD - 1; ITEM++) - { - tail_flags[ITEM] = ApplyOp ::Flag( - flag_op, - input[ITEM], - input[ITEM + 1], - (linear_tid * ITEMS_PER_THREAD) + ITEM); - } - } - - //@} end member group - -}; - - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) diff --git a/lib/kokkos/TPL/cub/block/block_exchange.cuh b/lib/kokkos/TPL/cub/block/block_exchange.cuh deleted file mode 100755 index b7b95343b5..0000000000 --- a/lib/kokkos/TPL/cub/block/block_exchange.cuh +++ /dev/null @@ -1,918 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockExchange class provides [collective](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block. - */ - -#pragma once - -#include "../util_arch.cuh" -#include "../util_macro.cuh" -#include "../util_type.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \brief The BlockExchange class provides [collective](index.html#sec0) methods for rearranging data partitioned across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * It is commonplace for blocks of threads to rearrange data items between - * threads. For example, the global memory subsystem prefers access patterns - * where data items are "striped" across threads (where consecutive threads access consecutive items), - * yet most block-wide operations prefer a "blocked" partitioning of items across threads - * (where consecutive items belong to a single thread). - * - * \par - * BlockExchange supports the following types of data exchanges: - * - Transposing between [blocked](index.html#sec5sec4) and [striped](index.html#sec5sec4) arrangements - * - Transposing between [blocked](index.html#sec5sec4) and [warp-striped](index.html#sec5sec4) arrangements - * - Scattering ranked items to a [blocked arrangement](index.html#sec5sec4) - * - Scattering ranked items to a [striped arrangement](index.html#sec5sec4) - * - * \tparam T The data type to be exchanged. - * \tparam BLOCK_THREADS The thread block size in threads. - * \tparam ITEMS_PER_THREAD The number of items partitioned onto each thread. - * \tparam WARP_TIME_SLICING [optional] When \p true, only use enough shared memory for a single warp's worth of tile data, time-slicing the block-wide exchange over multiple synchronized rounds. Yields a smaller memory footprint at the expense of decreased parallelism. (Default: false) - * - * \par A Simple Example - * \blockcollective{BlockExchange} - * \par - * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement - * of 512 integer items partitioned across 128 threads where each thread owns 4 items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockExchange for 128 threads owning 4 integer items each - * typedef cub::BlockExchange BlockExchange; - * - * // Allocate shared memory for BlockExchange - * __shared__ typename BlockExchange::TempStorage temp_storage; - * - * // Load a tile of data striped across threads - * int thread_data[4]; - * cub::LoadStriped (threadIdx.x, d_data, thread_data); - * - * // Collectively exchange data into a blocked arrangement across threads - * BlockExchange(temp_storage).StripedToBlocked(thread_data); - * - * \endcode - * \par - * Suppose the set of striped input \p thread_data across the block of threads is - * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] }. - * The corresponding output \p thread_data in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * - * \par Performance Considerations - * - Proper device-specific padding ensures zero bank conflicts for most types. - * - */ -template < - typename T, - int BLOCK_THREADS, - int ITEMS_PER_THREAD, - bool WARP_TIME_SLICING = false> -class BlockExchange -{ -private: - - /****************************************************************************** - * Constants - ******************************************************************************/ - - enum - { - LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS, - WARP_THREADS = 1 << LOG_WARP_THREADS, - WARPS = (BLOCK_THREADS + PtxArchProps::WARP_THREADS - 1) / PtxArchProps::WARP_THREADS, - - LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS, - SMEM_BANKS = 1 << LOG_SMEM_BANKS, - - TILE_ITEMS = BLOCK_THREADS * ITEMS_PER_THREAD, - - TIME_SLICES = (WARP_TIME_SLICING) ? WARPS : 1, - - TIME_SLICED_THREADS = (WARP_TIME_SLICING) ? CUB_MIN(BLOCK_THREADS, WARP_THREADS) : BLOCK_THREADS, - TIME_SLICED_ITEMS = TIME_SLICED_THREADS * ITEMS_PER_THREAD, - - WARP_TIME_SLICED_THREADS = CUB_MIN(BLOCK_THREADS, WARP_THREADS), - WARP_TIME_SLICED_ITEMS = WARP_TIME_SLICED_THREADS * ITEMS_PER_THREAD, - - // Insert padding if the number of items per thread is a power of two - INSERT_PADDING = ((ITEMS_PER_THREAD & (ITEMS_PER_THREAD - 1)) == 0), - PADDING_ITEMS = (INSERT_PADDING) ? (TIME_SLICED_ITEMS >> LOG_SMEM_BANKS) : 0, - }; - - /****************************************************************************** - * Type definitions - ******************************************************************************/ - - /// Shared memory storage layout type - typedef T _TempStorage[TIME_SLICED_ITEMS + PADDING_ITEMS]; - -public: - - /// \smemstorage{BlockExchange} - struct TempStorage : Uninitialized<_TempStorage> {}; - -private: - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - int warp_lane; - int warp_id; - int warp_offset; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /** - * Transposes data items from blocked arrangement to striped arrangement. Specialized for no timeslicing. - */ - __device__ __forceinline__ void BlockedToStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and striped arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - - - /** - * Transposes data items from blocked arrangement to striped arrangement. Specialized for warp-timeslicing. - */ - __device__ __forceinline__ void BlockedToStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and striped arrangements. - Int2Type time_slicing) - { - T temp_items[ITEMS_PER_THREAD]; - - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) - { - const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; - const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; - - __syncthreads(); - - if (warp_id == SLICE) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - } - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - // Read a strip of items - const int STRIP_OFFSET = ITEM * BLOCK_THREADS; - const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; - - if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) - { - int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; - if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) - { - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_items[ITEM] = temp_storage[item_offset]; - } - } - } - } - - // Copy - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = temp_items[ITEM]; - } - } - - - /** - * Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for no timeslicing - */ - __device__ __forceinline__ void BlockedToWarpStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD); - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - - /** - * Transposes data items from blocked arrangement to warp-striped arrangement. Specialized for warp-timeslicing - */ - __device__ __forceinline__ void BlockedToWarpStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE) - { - __syncthreads(); - - if (warp_id == SLICE) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD); - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - } - } - - - /** - * Transposes data items from striped arrangement to blocked arrangement. Specialized for no timeslicing. - */ - __device__ __forceinline__ void StripedToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between striped and blocked arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - __syncthreads(); - - // No timeslicing - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - - - /** - * Transposes data items from striped arrangement to blocked arrangement. Specialized for warp-timeslicing. - */ - __device__ __forceinline__ void StripedToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between striped and blocked arrangements. - Int2Type time_slicing) - { - // Warp time-slicing - T temp_items[ITEMS_PER_THREAD]; - - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) - { - const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; - const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - // Write a strip of items - const int STRIP_OFFSET = ITEM * BLOCK_THREADS; - const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; - - if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) - { - int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; - if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) - { - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - } - } - - __syncthreads(); - - if (warp_id == SLICE) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_items[ITEM] = temp_storage[item_offset]; - } - } - } - - // Copy - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = temp_items[ITEM]; - } - } - - - /** - * Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for no timeslicing - */ - __device__ __forceinline__ void WarpStripedToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = warp_offset + (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = warp_offset + ITEM + (warp_lane * ITEMS_PER_THREAD); - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - - - /** - * Transposes data items from warp-striped arrangement to blocked arrangement. Specialized for warp-timeslicing - */ - __device__ __forceinline__ void WarpStripedToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. - Int2Type time_slicing) - { - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; ++SLICE) - { - __syncthreads(); - - if (warp_id == SLICE) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (ITEM * WARP_TIME_SLICED_THREADS) + warp_lane; - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_storage[item_offset] = items[ITEM]; - } - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ITEM + (warp_lane * ITEMS_PER_THREAD); - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - items[ITEM] = temp_storage[item_offset]; - } - } - } - } - - - /** - * Exchanges data items annotated by rank into blocked arrangement. Specialized for no timeslicing. - */ - __device__ __forceinline__ void ScatterToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ranks[ITEM]; - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - temp_storage[item_offset] = items[ITEM]; - } - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (linear_tid * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - items[ITEM] = temp_storage[item_offset]; - } - } - - /** - * Exchanges data items annotated by rank into blocked arrangement. Specialized for warp-timeslicing. - */ - __device__ __forceinline__ void ScatterToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks - Int2Type time_slicing) - { - T temp_items[ITEMS_PER_THREAD]; - - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) - { - __syncthreads(); - - const int SLICE_OFFSET = TIME_SLICED_ITEMS * SLICE; - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ranks[ITEM] - SLICE_OFFSET; - if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS)) - { - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - temp_storage[item_offset] = items[ITEM]; - } - } - - __syncthreads(); - - if (warp_id == SLICE) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = (warp_lane * ITEMS_PER_THREAD) + ITEM; - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - temp_items[ITEM] = temp_storage[item_offset]; - } - } - } - - // Copy - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = temp_items[ITEM]; - } - } - - - /** - * Exchanges data items annotated by rank into striped arrangement. Specialized for no timeslicing. - */ - __device__ __forceinline__ void ScatterToStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks - Int2Type time_slicing) - { - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ranks[ITEM]; - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - temp_storage[item_offset] = items[ITEM]; - } - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = int(ITEM * BLOCK_THREADS) + linear_tid; - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - items[ITEM] = temp_storage[item_offset]; - } - } - - - /** - * Exchanges data items annotated by rank into striped arrangement. Specialized for warp-timeslicing. - */ - __device__ __forceinline__ void ScatterToStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD], ///< [in] Corresponding scatter ranks - Int2Type time_slicing) - { - T temp_items[ITEMS_PER_THREAD]; - - #pragma unroll - for (int SLICE = 0; SLICE < TIME_SLICES; SLICE++) - { - const int SLICE_OFFSET = SLICE * TIME_SLICED_ITEMS; - const int SLICE_OOB = SLICE_OFFSET + TIME_SLICED_ITEMS; - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - int item_offset = ranks[ITEM] - SLICE_OFFSET; - if ((item_offset >= 0) && (item_offset < WARP_TIME_SLICED_ITEMS)) - { - if (INSERT_PADDING) item_offset = SHR_ADD(item_offset, LOG_SMEM_BANKS, item_offset); - temp_storage[item_offset] = items[ITEM]; - } - } - - __syncthreads(); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - // Read a strip of items - const int STRIP_OFFSET = ITEM * BLOCK_THREADS; - const int STRIP_OOB = STRIP_OFFSET + BLOCK_THREADS; - - if ((SLICE_OFFSET < STRIP_OOB) && (SLICE_OOB > STRIP_OFFSET)) - { - int item_offset = STRIP_OFFSET + linear_tid - SLICE_OFFSET; - if ((item_offset >= 0) && (item_offset < TIME_SLICED_ITEMS)) - { - if (INSERT_PADDING) item_offset += item_offset >> LOG_SMEM_BANKS; - temp_items[ITEM] = temp_storage[item_offset]; - } - } - } - } - - // Copy - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = temp_items[ITEM]; - } - } - - -public: - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockExchange() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x), - warp_lane(linear_tid & (WARP_THREADS - 1)), - warp_id(linear_tid >> LOG_WARP_THREADS), - warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockExchange( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x), - warp_lane(linear_tid & (WARP_THREADS - 1)), - warp_id(linear_tid >> LOG_WARP_THREADS), - warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockExchange( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid), - warp_lane(linear_tid & (WARP_THREADS - 1)), - warp_id(linear_tid >> LOG_WARP_THREADS), - warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockExchange( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid), - warp_lane(linear_tid & (WARP_THREADS - 1)), - warp_id(linear_tid >> LOG_WARP_THREADS), - warp_offset(warp_id * WARP_TIME_SLICED_ITEMS) - {} - - - //@} end member group - /******************************************************************//** - * \name Structured exchanges - *********************************************************************/ - //@{ - - /** - * \brief Transposes data items from striped arrangement to blocked arrangement. - * - * \smemreuse - * - * The code snippet below illustrates the conversion from a "striped" to a "blocked" arrangement - * of 512 integer items partitioned across 128 threads where each thread owns 4 items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockExchange for 128 threads owning 4 integer items each - * typedef cub::BlockExchange BlockExchange; - * - * // Allocate shared memory for BlockExchange - * __shared__ typename BlockExchange::TempStorage temp_storage; - * - * // Load a tile of ordered data into a striped arrangement across block threads - * int thread_data[4]; - * cub::LoadStriped (threadIdx.x, d_data, thread_data); - * - * // Collectively exchange data into a blocked arrangement across threads - * BlockExchange(temp_storage).StripedToBlocked(thread_data); - * - * \endcode - * \par - * Suppose the set of striped input \p thread_data across the block of threads is - * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } after loading from global memory. - * The corresponding output \p thread_data in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * - */ - __device__ __forceinline__ void StripedToBlocked( - T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between striped and blocked arrangements. - { - StripedToBlocked(items, Int2Type ()); - } - - /** - * \brief Transposes data items from blocked arrangement to striped arrangement. - * - * \smemreuse - * - * The code snippet below illustrates the conversion from a "blocked" to a "striped" arrangement - * of 512 integer items partitioned across 128 threads where each thread owns 4 items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockExchange for 128 threads owning 4 integer items each - * typedef cub::BlockExchange BlockExchange; - * - * // Allocate shared memory for BlockExchange - * __shared__ typename BlockExchange::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively exchange data into a striped arrangement across threads - * BlockExchange(temp_storage).BlockedToStriped(thread_data); - * - * // Store data striped across block threads into an ordered tile - * cub::StoreStriped (threadIdx.x, d_data, thread_data); - * - * \endcode - * \par - * Suppose the set of blocked input \p thread_data across the block of threads is - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * The corresponding output \p thread_data in those threads will be - * { [0,128,256,384], [1,129,257,385], ..., [127,255,383,511] } in - * preparation for storing to global memory. - * - */ - __device__ __forceinline__ void BlockedToStriped( - T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between blocked and striped arrangements. - { - BlockedToStriped(items, Int2Type ()); - } - - - /** - * \brief Transposes data items from warp-striped arrangement to blocked arrangement. - * - * \smemreuse - * - * The code snippet below illustrates the conversion from a "warp-striped" to a "blocked" arrangement - * of 512 integer items partitioned across 128 threads where each thread owns 4 items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockExchange for 128 threads owning 4 integer items each - * typedef cub::BlockExchange BlockExchange; - * - * // Allocate shared memory for BlockExchange - * __shared__ typename BlockExchange::TempStorage temp_storage; - * - * // Load a tile of ordered data into a warp-striped arrangement across warp threads - * int thread_data[4]; - * cub::LoadSWarptriped (threadIdx.x, d_data, thread_data); - * - * // Collectively exchange data into a blocked arrangement across threads - * BlockExchange(temp_storage).WarpStripedToBlocked(thread_data); - * - * \endcode - * \par - * Suppose the set of warp-striped input \p thread_data across the block of threads is - * { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } - * after loading from global memory. (The first 128 items are striped across - * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.) - * The corresponding output \p thread_data in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * - */ - __device__ __forceinline__ void WarpStripedToBlocked( - T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between warp-striped and blocked arrangements. - { - WarpStripedToBlocked(items, Int2Type ()); - } - - /** - * \brief Transposes data items from blocked arrangement to warp-striped arrangement. - * - * \smemreuse - * - * The code snippet below illustrates the conversion from a "blocked" to a "warp-striped" arrangement - * of 512 integer items partitioned across 128 threads where each thread owns 4 items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockExchange for 128 threads owning 4 integer items each - * typedef cub::BlockExchange BlockExchange; - * - * // Allocate shared memory for BlockExchange - * __shared__ typename BlockExchange::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively exchange data into a warp-striped arrangement across threads - * BlockExchange(temp_storage).BlockedToWarpStriped(thread_data); - * - * // Store data striped across warp threads into an ordered tile - * cub::StoreStriped (threadIdx.x, d_data, thread_data); - * - * \endcode - * \par - * Suppose the set of blocked input \p thread_data across the block of threads is - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * The corresponding output \p thread_data in those threads will be - * { [0,32,64,96], [1,33,65,97], [2,34,66,98], ..., [415,447,479,511] } - * in preparation for storing to global memory. (The first 128 items are striped across - * the first warp of 32 threads, the second 128 items are striped across the second warp, etc.) - * - */ - __device__ __forceinline__ void BlockedToWarpStriped( - T items[ITEMS_PER_THREAD]) ///< [in-out] Items to exchange, converting between blocked and warp-striped arrangements. - { - BlockedToWarpStriped(items, Int2Type ()); - } - - - //@} end member group - /******************************************************************//** - * \name Scatter exchanges - *********************************************************************/ - //@{ - - - /** - * \brief Exchanges data items annotated by rank into blocked arrangement. - * - * \smemreuse - */ - __device__ __forceinline__ void ScatterToBlocked( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks - { - ScatterToBlocked(items, ranks, Int2Type ()); - } - - - /** - * \brief Exchanges data items annotated by rank into striped arrangement. - * - * \smemreuse - */ - __device__ __forceinline__ void ScatterToStriped( - T items[ITEMS_PER_THREAD], ///< [in-out] Items to exchange - int ranks[ITEMS_PER_THREAD]) ///< [in] Corresponding scatter ranks - { - ScatterToStriped(items, ranks, Int2Type ()); - } - - //@} end member group - - -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_histogram.cuh b/lib/kokkos/TPL/cub/block/block_histogram.cuh deleted file mode 100755 index dd346e3954..0000000000 --- a/lib/kokkos/TPL/cub/block/block_histogram.cuh +++ /dev/null @@ -1,414 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockHistogram class provides [collective](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block. - */ - -#pragma once - -#include "specializations/block_histogram_sort.cuh" -#include "specializations/block_histogram_atomic.cuh" -#include "../util_arch.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - - -/****************************************************************************** - * Algorithmic variants - ******************************************************************************/ - -/** - * \brief BlockHistogramAlgorithm enumerates alternative algorithms for the parallel construction of block-wide histograms. - */ -enum BlockHistogramAlgorithm -{ - - /** - * \par Overview - * Sorting followed by differentiation. Execution is comprised of two phases: - * -# Sort the data using efficient radix sort - * -# Look for "runs" of same-valued keys by detecting discontinuities; the run-lengths are histogram bin counts. - * - * \par Performance Considerations - * Delivers consistent throughput regardless of sample bin distribution. - */ - BLOCK_HISTO_SORT, - - - /** - * \par Overview - * Use atomic addition to update byte counts directly - * - * \par Performance Considerations - * Performance is strongly tied to the hardware implementation of atomic - * addition, and may be significantly degraded for non uniformly-random - * input distributions where many concurrent updates are likely to be - * made to the same bin counter. - */ - BLOCK_HISTO_ATOMIC, -}; - - - -/****************************************************************************** - * Block histogram - ******************************************************************************/ - - -/** - * \brief The BlockHistogram class provides [collective](index.html#sec0) methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * A histogram - * counts the number of observations that fall into each of the disjoint categories (known as bins). - * - * \par - * Optionally, BlockHistogram can be specialized to use different algorithms: - * -# cub::BLOCK_HISTO_SORT. Sorting followed by differentiation. [More...](\ref cub::BlockHistogramAlgorithm) - * -# cub::BLOCK_HISTO_ATOMIC. Use atomic addition to update byte counts directly. [More...](\ref cub::BlockHistogramAlgorithm) - * - * \tparam T The sample type being histogrammed (must be castable to an integer bin identifier) - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam ITEMS_PER_THREAD The number of items per thread - * \tparam BINS The number bins within the histogram - * \tparam ALGORITHM [optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT) - * - * \par A Simple Example - * \blockcollective{BlockHistogram} - * \par - * The code snippet below illustrates a 256-bin histogram of 512 integer samples that - * are partitioned across 128 threads where each thread owns 4 samples. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each - * typedef cub::BlockHistogram BlockHistogram; - * - * // Allocate shared memory for BlockHistogram - * __shared__ typename BlockHistogram::TempStorage temp_storage; - * - * // Allocate shared memory for block-wide histogram bin counts - * __shared__ unsigned int smem_histogram[256]; - * - * // Obtain input samples per thread - * unsigned char data[4]; - * ... - * - * // Compute the block-wide histogram - * BlockHistogram(temp_storage).Histogram(data, smem_histogram); - * - * \endcode - * - * \par Performance and Usage Considerations - * - The histogram output can be constructed in shared or global memory - * - See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives - * - */ -template < - typename T, - int BLOCK_THREADS, - int ITEMS_PER_THREAD, - int BINS, - BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT> -class BlockHistogram -{ -private: - - /****************************************************************************** - * Constants and type definitions - ******************************************************************************/ - - /** - * Ensure the template parameterization meets the requirements of the - * targeted device architecture. BLOCK_HISTO_ATOMIC can only be used - * on version SM120 or later. Otherwise BLOCK_HISTO_SORT is used - * regardless. - */ - static const BlockHistogramAlgorithm SAFE_ALGORITHM = - ((ALGORITHM == BLOCK_HISTO_ATOMIC) && (CUB_PTX_ARCH < 120)) ? - BLOCK_HISTO_SORT : - ALGORITHM; - - /// Internal specialization. - typedef typename If<(SAFE_ALGORITHM == BLOCK_HISTO_SORT), - BlockHistogramSort , - BlockHistogramAtomic >::Type InternalBlockHistogram; - - /// Shared memory storage layout type for BlockHistogram - typedef typename InternalBlockHistogram::TempStorage _TempStorage; - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - -public: - - /// \smemstorage{BlockHistogram} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockHistogram() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockHistogram( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockHistogram( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockHistogram( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - //@} end member group - /******************************************************************//** - * \name Histogram operations - *********************************************************************/ - //@{ - - - /** - * \brief Initialize the shared histogram counters to zero. - * - * The code snippet below illustrates a the initialization and update of a - * histogram of 512 integer samples that are partitioned across 128 threads - * where each thread owns 4 samples. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each - * typedef cub::BlockHistogram BlockHistogram; - * - * // Allocate shared memory for BlockHistogram - * __shared__ typename BlockHistogram::TempStorage temp_storage; - * - * // Allocate shared memory for block-wide histogram bin counts - * __shared__ unsigned int smem_histogram[256]; - * - * // Obtain input samples per thread - * unsigned char thread_samples[4]; - * ... - * - * // Initialize the block-wide histogram - * BlockHistogram(temp_storage).InitHistogram(smem_histogram); - * - * // Update the block-wide histogram - * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram); - * - * \endcode - * - * \tparam HistoCounter [inferred] Histogram counter type - */ - template - __device__ __forceinline__ void InitHistogram(HistoCounter histogram[BINS]) - { - // Initialize histogram bin counts to zeros - int histo_offset = 0; - - #pragma unroll - for(; histo_offset + BLOCK_THREADS <= BINS; histo_offset += BLOCK_THREADS) - { - histogram[histo_offset + linear_tid] = 0; - } - // Finish up with guarded initialization if necessary - if ((BINS % BLOCK_THREADS != 0) && (histo_offset + linear_tid < BINS)) - { - histogram[histo_offset + linear_tid] = 0; - } - } - - - /** - * \brief Constructs a block-wide histogram in shared/global memory. Each thread contributes an array of input elements. - * - * \smemreuse - * - * The code snippet below illustrates a 256-bin histogram of 512 integer samples that - * are partitioned across 128 threads where each thread owns 4 samples. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each - * typedef cub::BlockHistogram BlockHistogram; - * - * // Allocate shared memory for BlockHistogram - * __shared__ typename BlockHistogram::TempStorage temp_storage; - * - * // Allocate shared memory for block-wide histogram bin counts - * __shared__ unsigned int smem_histogram[256]; - * - * // Obtain input samples per thread - * unsigned char thread_samples[4]; - * ... - * - * // Compute the block-wide histogram - * BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram); - * - * \endcode - * - * \tparam HistoCounter [inferred] Histogram counter type - */ - template < - typename HistoCounter> - __device__ __forceinline__ void Histogram( - T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram - HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram - { - // Initialize histogram bin counts to zeros - InitHistogram(histogram); - - // Composite the histogram - InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram); - } - - - - /** - * \brief Updates an existing block-wide histogram in shared/global memory. Each thread composites an array of input elements. - * - * \smemreuse - * - * The code snippet below illustrates a the initialization and update of a - * histogram of 512 integer samples that are partitioned across 128 threads - * where each thread owns 4 samples. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize a 256-bin BlockHistogram type for 128 threads having 4 character samples each - * typedef cub::BlockHistogram BlockHistogram; - * - * // Allocate shared memory for BlockHistogram - * __shared__ typename BlockHistogram::TempStorage temp_storage; - * - * // Allocate shared memory for block-wide histogram bin counts - * __shared__ unsigned int smem_histogram[256]; - * - * // Obtain input samples per thread - * unsigned char thread_samples[4]; - * ... - * - * // Initialize the block-wide histogram - * BlockHistogram(temp_storage).InitHistogram(smem_histogram); - * - * // Update the block-wide histogram - * BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram); - * - * \endcode - * - * \tparam HistoCounter [inferred] Histogram counter type - */ - template < - typename HistoCounter> - __device__ __forceinline__ void Composite( - T (&items)[ITEMS_PER_THREAD], ///< [in] Calling thread's input values to histogram - HistoCounter histogram[BINS]) ///< [out] Reference to shared/global memory histogram - { - InternalBlockHistogram(temp_storage, linear_tid).Composite(items, histogram); - } - -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_load.cuh b/lib/kokkos/TPL/cub/block/block_load.cuh deleted file mode 100755 index e645bcdce9..0000000000 --- a/lib/kokkos/TPL/cub/block/block_load.cuh +++ /dev/null @@ -1,1122 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * Operations for reading linear tiles of data into the CUDA thread block. - */ - -#pragma once - -#include - -#include "../util_namespace.cuh" -#include "../util_macro.cuh" -#include "../util_type.cuh" -#include "../util_vector.cuh" -#include "../thread/thread_load.cuh" -#include "block_exchange.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \addtogroup IoModule - * @{ - */ - - -/******************************************************************//** - * \name Blocked I/O - *********************************************************************/ -//@{ - - -/** - * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier. - * - * \blocked - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadBlocked( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load -{ - // Load directly in thread-blocked order - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM); - } -} - - -/** - * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range. - * - * \blocked - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadBlocked( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load -{ - int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - if (ITEM < bounds) - { - items[ITEM] = ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM); - } - } -} - - -/** - * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements.. - * - * \blocked - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadBlocked( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items -{ - int bounds = valid_items - (linear_tid * ITEMS_PER_THREAD); - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = (ITEM < bounds) ? - ThreadLoad (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM) : - oob_default; - } -} - - - -//@} end member group -/******************************************************************//** - * \name Striped I/O - *********************************************************************/ -//@{ - - -/** - * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier. - * - * \striped - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - int BLOCK_THREADS, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load -{ - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = ThreadLoad (block_itr + (ITEM * BLOCK_THREADS) + linear_tid); - } -} - - -/** - * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range - * - * \striped - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - int BLOCK_THREADS, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load -{ - int bounds = valid_items - linear_tid; - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - if (ITEM * BLOCK_THREADS < bounds) - { - items[ITEM] = ThreadLoad (block_itr + linear_tid + (ITEM * BLOCK_THREADS)); - } - } -} - - -/** - * \brief Load a linear segment of items into a striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements. - * - * \striped - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - int BLOCK_THREADS, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items -{ - int bounds = valid_items - linear_tid; - - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = (ITEM * BLOCK_THREADS < bounds) ? - ThreadLoad (block_itr + linear_tid + (ITEM * BLOCK_THREADS)) : - oob_default; - } -} - - - -//@} end member group -/******************************************************************//** - * \name Warp-striped I/O - *********************************************************************/ -//@{ - - -/** - * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier. - * - * \warpstriped - * - * \par Usage Considerations - * The number of threads in the thread block must be a multiple of the architecture's warp size. - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadWarpStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load -{ - int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); - int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; - int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; - - // Load directly in warp-striped order - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)); - } -} - - -/** - * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range - * - * \warpstriped - * - * \par Usage Considerations - * The number of threads in the thread block must be a multiple of the architecture's warp size. - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadWarpStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load -{ - int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); - int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; - int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; - int bounds = valid_items - warp_offset - tid; - - // Load directly in warp-striped order - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - if ((ITEM * PtxArchProps::WARP_THREADS) < bounds) - { - items[ITEM] = ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)); - } - } -} - - -/** - * \brief Load a linear segment of items into a warp-striped arrangement across the thread block using the specified cache modifier, guarded by range, with a fall-back assignment of out-of-bound elements. - * - * \warpstriped - * - * \par Usage Considerations - * The number of threads in the thread block must be a multiple of the architecture's warp size. - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam InputIteratorRA [inferred] The random-access iterator type for input (may be a simple pointer type). - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename InputIteratorRA> -__device__ __forceinline__ void LoadWarpStriped( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items -{ - int tid = linear_tid & (PtxArchProps::WARP_THREADS - 1); - int wid = linear_tid >> PtxArchProps::LOG_WARP_THREADS; - int warp_offset = wid * PtxArchProps::WARP_THREADS * ITEMS_PER_THREAD; - int bounds = valid_items - warp_offset - tid; - - // Load directly in warp-striped order - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = ((ITEM * PtxArchProps::WARP_THREADS) < bounds) ? - ThreadLoad (block_itr + warp_offset + tid + (ITEM * PtxArchProps::WARP_THREADS)) : - oob_default; - } -} - - - -//@} end member group -/******************************************************************//** - * \name Blocked, vectorized I/O - *********************************************************************/ -//@{ - -/** - * \brief Load a linear segment of items into a blocked arrangement across the thread block using the specified cache modifier. - * - * \blocked - * - * The input offset (\p block_ptr + \p block_offset) must be quad-item aligned - * - * The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT: - * - \p ITEMS_PER_THREAD is odd - * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.) - * - * \tparam MODIFIER cub::PtxLoadModifier cache modifier. - * \tparam T [inferred] The data type to load. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - */ -template < - PtxLoadModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD> -__device__ __forceinline__ void LoadBlockedVectorized( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - T *block_ptr, ///< [in] Input pointer for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load -{ - enum - { - // Maximum CUDA vector size is 4 elements - MAX_VEC_SIZE = CUB_MIN(4, ITEMS_PER_THREAD), - - // Vector size must be a power of two and an even divisor of the items per thread - VEC_SIZE = ((((MAX_VEC_SIZE - 1) & MAX_VEC_SIZE) == 0) && ((ITEMS_PER_THREAD % MAX_VEC_SIZE) == 0)) ? - MAX_VEC_SIZE : - 1, - - VECTORS_PER_THREAD = ITEMS_PER_THREAD / VEC_SIZE, - }; - - // Vector type - typedef typename VectorHelper ::Type Vector; - - // Alias local data (use raw_items array here which should get optimized away to prevent conservative PTXAS lmem spilling) - T raw_items[ITEMS_PER_THREAD]; - - // Direct-load using vector types - LoadBlocked ( - linear_tid, - reinterpret_cast (block_ptr), - reinterpret_cast (raw_items)); - - // Copy - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - items[ITEM] = raw_items[ITEM]; - } -} - - -//@} end member group - -/** @} */ // end group IoModule - - - -//----------------------------------------------------------------------------- -// Generic BlockLoad abstraction -//----------------------------------------------------------------------------- - -/** - * \brief cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block. - */ -enum BlockLoadAlgorithm -{ - /** - * \par Overview - * - * A [blocked arrangement](index.html#sec5sec4) of data is read - * directly from memory. The thread block reads items in a parallel "raking" fashion: threadi - * reads the ith segment of consecutive elements. - * - * \par Performance Considerations - * - The utilization of memory transactions (coalescing) decreases as the - * access stride between threads increases (i.e., the number items per thread). - */ - BLOCK_LOAD_DIRECT, - - /** - * \par Overview - * - * A [blocked arrangement](index.html#sec5sec4) of data is read directly - * from memory using CUDA's built-in vectorized loads as a coalescing optimization. - * The thread block reads items in a parallel "raking" fashion: threadi uses vector loads to - * read the ith segment of consecutive elements. - * - * For example, ld.global.v4.s32 instructions will be generated when \p T = \p int and \p ITEMS_PER_THREAD > 4. - * - * \par Performance Considerations - * - The utilization of memory transactions (coalescing) remains high until the the - * access stride between threads (i.e., the number items per thread) exceeds the - * maximum vector load width (typically 4 items or 64B, whichever is lower). - * - The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT: - * - \p ITEMS_PER_THREAD is odd - * - The \p InputIteratorRA is not a simple pointer type - * - The block input offset is not quadword-aligned - * - The data type \p T is not a built-in primitive or CUDA vector type (e.g., \p short, \p int2, \p double, \p float2, etc.) - */ - BLOCK_LOAD_VECTORIZE, - - /** - * \par Overview - * - * A [striped arrangement](index.html#sec5sec4) of data is read - * directly from memory and then is locally transposed into a - * [blocked arrangement](index.html#sec5sec4). The thread block - * reads items in a parallel "strip-mining" fashion: - * threadi reads items having stride \p BLOCK_THREADS - * between them. cub::BlockExchange is then used to locally reorder the items - * into a [blocked arrangement](index.html#sec5sec4). - * - * \par Performance Considerations - * - The utilization of memory transactions (coalescing) remains high regardless - * of items loaded per thread. - * - The local reordering incurs slightly longer latencies and throughput than the - * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives. - */ - BLOCK_LOAD_TRANSPOSE, - - - /** - * \par Overview - * - * A [warp-striped arrangement](index.html#sec5sec4) of data is read - * directly from memory and then is locally transposed into a - * [blocked arrangement](index.html#sec5sec4). Each warp reads its own - * contiguous segment in a parallel "strip-mining" fashion: lanei - * reads items having stride \p WARP_THREADS between them. cub::BlockExchange - * is then used to locally reorder the items into a - * [blocked arrangement](index.html#sec5sec4). - * - * \par Usage Considerations - * - BLOCK_THREADS must be a multiple of WARP_THREADS - * - * \par Performance Considerations - * - The utilization of memory transactions (coalescing) remains high regardless - * of items loaded per thread. - * - The local reordering incurs slightly longer latencies and throughput than the - * direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives. - */ - BLOCK_LOAD_WARP_TRANSPOSE, -}; - - -/** - * \brief The BlockLoad class provides [collective](index.html#sec0) data movement methods for loading a linear segment of items from memory into a [blocked arrangement](index.html#sec5sec4) across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * The BlockLoad class provides a single data movement abstraction that can be specialized - * to implement different cub::BlockLoadAlgorithm strategies. This facilitates different - * performance policies for different architectures, data types, granularity sizes, etc. - * - * \par - * Optionally, BlockLoad can be specialized by different data movement strategies: - * -# cub::BLOCK_LOAD_DIRECT. A [blocked arrangement](index.html#sec5sec4) - * of data is read directly from memory. [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_VECTORIZE. A [blocked arrangement](index.html#sec5sec4) - * of data is read directly from memory using CUDA's built-in vectorized loads as a - * coalescing optimization. [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_TRANSPOSE. A [striped arrangement](index.html#sec5sec4) - * of data is read directly from memory and is then locally transposed into a - * [blocked arrangement](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm) - * -# cub::BLOCK_LOAD_WARP_TRANSPOSE. A [warp-striped arrangement](index.html#sec5sec4) - * of data is read directly from memory and is then locally transposed into a - * [blocked arrangement](index.html#sec5sec4). [More...](\ref cub::BlockLoadAlgorithm) - * - * \tparam InputIteratorRA The input iterator type (may be a simple pointer type). - * \tparam BLOCK_THREADS The thread block size in threads. - * \tparam ITEMS_PER_THREAD The number of consecutive items partitioned onto each thread. - * \tparam ALGORITHM [optional] cub::BlockLoadAlgorithm tuning policy. default: cub::BLOCK_LOAD_DIRECT. - * \tparam MODIFIER [optional] cub::PtxLoadModifier cache modifier. default: cub::LOAD_DEFAULT. - * \tparam WARP_TIME_SLICING [optional] For transposition-based cub::BlockLoadAlgorithm parameterizations that utilize shared memory: When \p true, only use enough shared memory for a single warp's worth of data, time-slicing the block-wide exchange over multiple synchronized rounds (default: false) - * - * \par A Simple Example - * \blockcollective{BlockLoad} - * \par - * The code snippet below illustrates the loading of a linear - * segment of 512 integers into a "blocked" arrangement across 128 threads where each - * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, - * meaning memory references are efficiently coalesced using a warp-striped access - * pattern (after which items are locally reordered among threads). - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockLoad for 128 threads owning 4 integer items each - * typedef cub::BlockLoad BlockLoad; - * - * // Allocate shared memory for BlockLoad - * __shared__ typename BlockLoad::TempStorage temp_storage; - * - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage).Load(d_data, thread_data); - * - * \endcode - * \par - * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, .... - * The set of \p thread_data across the block of threads in those threads will be - * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. - * - */ -template < - typename InputIteratorRA, - int BLOCK_THREADS, - int ITEMS_PER_THREAD, - BlockLoadAlgorithm ALGORITHM = BLOCK_LOAD_DIRECT, - PtxLoadModifier MODIFIER = LOAD_DEFAULT, - bool WARP_TIME_SLICING = false> -class BlockLoad -{ -private: - - /****************************************************************************** - * Constants and typed definitions - ******************************************************************************/ - - // Data type of input iterator - typedef typename std::iterator_traits ::value_type T; - - - /****************************************************************************** - * Algorithmic variants - ******************************************************************************/ - - /// Load helper - template - struct LoadInternal; - - - /** - * BLOCK_LOAD_DIRECT specialization of load helper - */ - template - struct LoadInternal - { - /// Shared memory storage layout type - typedef NullType TempStorage; - - /// Linear thread-id - int linear_tid; - - /// Constructor - __device__ __forceinline__ LoadInternal( - TempStorage &temp_storage, - int linear_tid) - : - linear_tid(linear_tid) - {} - - /// Load a linear segment of items from memory - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load - { - LoadBlocked (linear_tid, block_itr, items); - } - - /// Load a linear segment of items from memory, guarded by range - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load - { - LoadBlocked (linear_tid, block_itr, items, valid_items); - } - - /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items - { - LoadBlocked (linear_tid, block_itr, items, valid_items, oob_default); - } - - }; - - - /** - * BLOCK_LOAD_VECTORIZE specialization of load helper - */ - template - struct LoadInternal - { - /// Shared memory storage layout type - typedef NullType TempStorage; - - /// Linear thread-id - int linear_tid; - - /// Constructor - __device__ __forceinline__ LoadInternal( - TempStorage &temp_storage, - int linear_tid) - : - linear_tid(linear_tid) - {} - - /// Load a linear segment of items from memory, specialized for native pointer types (attempts vectorization) - __device__ __forceinline__ void Load( - T *block_ptr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load - { - LoadBlockedVectorized (linear_tid, block_ptr, items); - } - - /// Load a linear segment of items from memory, specialized for opaque input iterators (skips vectorization) - template < - typename T, - typename _InputIteratorRA> - __device__ __forceinline__ void Load( - _InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load - { - LoadBlocked (linear_tid, block_itr, items); - } - - /// Load a linear segment of items from memory, guarded by range (skips vectorization) - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load - { - LoadBlocked (linear_tid, block_itr, items, valid_items); - } - - /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements (skips vectorization) - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items - { - LoadBlocked (linear_tid, block_itr, items, valid_items, oob_default); - } - - }; - - - /** - * BLOCK_LOAD_TRANSPOSE specialization of load helper - */ - template - struct LoadInternal - { - // BlockExchange utility type for keys - typedef BlockExchange BlockExchange; - - /// Shared memory storage layout type - typedef typename BlockExchange::TempStorage _TempStorage; - - /// Alias wrapper allowing storage to be unioned - struct TempStorage : Uninitialized<_TempStorage> {}; - - /// Thread reference to shared storage - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - /// Constructor - __device__ __forceinline__ LoadInternal( - TempStorage &temp_storage, - int linear_tid) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - /// Load a linear segment of items from memory - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{ - { - LoadStriped (linear_tid, block_itr, items); - BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); - } - - /// Load a linear segment of items from memory, guarded by range - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load - { - LoadStriped (linear_tid, block_itr, items, valid_items); - BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); - } - - /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items - { - LoadStriped (linear_tid, block_itr, items, valid_items, oob_default); - BlockExchange(temp_storage, linear_tid).StripedToBlocked(items); - } - - }; - - - /** - * BLOCK_LOAD_WARP_TRANSPOSE specialization of load helper - */ - template - struct LoadInternal - { - enum - { - WARP_THREADS = PtxArchProps::WARP_THREADS - }; - - // Assert BLOCK_THREADS must be a multiple of WARP_THREADS - CUB_STATIC_ASSERT((BLOCK_THREADS % WARP_THREADS == 0), "BLOCK_THREADS must be a multiple of WARP_THREADS"); - - // BlockExchange utility type for keys - typedef BlockExchange BlockExchange; - - /// Shared memory storage layout type - typedef typename BlockExchange::TempStorage _TempStorage; - - /// Alias wrapper allowing storage to be unioned - struct TempStorage : Uninitialized<_TempStorage> {}; - - /// Thread reference to shared storage - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - /// Constructor - __device__ __forceinline__ LoadInternal( - TempStorage &temp_storage, - int linear_tid) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - /// Load a linear segment of items from memory - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load{ - { - LoadWarpStriped (linear_tid, block_itr, items); - BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); - } - - /// Load a linear segment of items from memory, guarded by range - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load - { - LoadWarpStriped (linear_tid, block_itr, items, valid_items); - BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); - } - - - /// Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items - { - LoadWarpStriped (linear_tid, block_itr, items, valid_items, oob_default); - BlockExchange(temp_storage, linear_tid).WarpStripedToBlocked(items); - } - }; - - - /****************************************************************************** - * Type definitions - ******************************************************************************/ - - /// Internal load implementation to use - typedef LoadInternal InternalLoad; - - - /// Shared memory storage layout type - typedef typename InternalLoad::TempStorage _TempStorage; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Thread reference to shared storage - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - -public: - - /// \smemstorage{BlockLoad} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockLoad() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockLoad( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockLoad( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockLoad( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Data movement - *********************************************************************/ - //@{ - - - /** - * \brief Load a linear segment of items from memory. - * - * \blocked - * - * The code snippet below illustrates the loading of a linear - * segment of 512 integers into a "blocked" arrangement across 128 threads where each - * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, - * meaning memory references are efficiently coalesced using a warp-striped access - * pattern (after which items are locally reordered among threads). - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, ...) - * { - * // Specialize BlockLoad for 128 threads owning 4 integer items each - * typedef cub::BlockLoad BlockLoad; - * - * // Allocate shared memory for BlockLoad - * __shared__ typename BlockLoad::TempStorage temp_storage; - * - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage).Load(d_data, thread_data); - * - * \endcode - * \par - * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, .... - * The set of \p thread_data across the block of threads in those threads will be - * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. - * - */ - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD]) ///< [out] Data to load - { - InternalLoad(temp_storage, linear_tid).Load(block_itr, items); - } - - - /** - * \brief Load a linear segment of items from memory, guarded by range. - * - * \blocked - * - * The code snippet below illustrates the guarded loading of a linear - * segment of 512 integers into a "blocked" arrangement across 128 threads where each - * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, - * meaning memory references are efficiently coalesced using a warp-striped access - * pattern (after which items are locally reordered among threads). - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, int valid_items, ...) - * { - * // Specialize BlockLoad for 128 threads owning 4 integer items each - * typedef cub::BlockLoad BlockLoad; - * - * // Allocate shared memory for BlockLoad - * __shared__ typename BlockLoad::TempStorage temp_storage; - * - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items); - * - * \endcode - * \par - * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, 6... and \p valid_items is \p 5. - * The set of \p thread_data across the block of threads in those threads will be - * { [0,1,2,3], [4,?,?,?], ..., [?,?,?,?] }, with only the first two threads - * being unmasked to load portions of valid data (and other items remaining unassigned). - * - */ - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items) ///< [in] Number of valid items to load - { - InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items); - } - - - /** - * \brief Load a linear segment of items from memory, guarded by range, with a fall-back assignment of out-of-bound elements - * - * \blocked - * - * The code snippet below illustrates the guarded loading of a linear - * segment of 512 integers into a "blocked" arrangement across 128 threads where each - * thread owns 4 consecutive items. The load is specialized for \p BLOCK_LOAD_WARP_TRANSPOSE, - * meaning memory references are efficiently coalesced using a warp-striped access - * pattern (after which items are locally reordered among threads). - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int *d_data, int valid_items, ...) - * { - * // Specialize BlockLoad for 128 threads owning 4 integer items each - * typedef cub::BlockLoad BlockLoad; - * - * // Allocate shared memory for BlockLoad - * __shared__ typename BlockLoad::TempStorage temp_storage; - * - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage).Load(d_data, thread_data, valid_items, -1); - * - * \endcode - * \par - * Suppose the input \p d_data is 0, 1, 2, 3, 4, 5, 6..., - * \p valid_items is \p 5, and the out-of-bounds default is \p -1. - * The set of \p thread_data across the block of threads in those threads will be - * { [0,1,2,3], [4,-1,-1,-1], ..., [-1,-1,-1,-1] }, with only the first two threads - * being unmasked to load portions of valid data (and other items are assigned \p -1) - * - */ - __device__ __forceinline__ void Load( - InputIteratorRA block_itr, ///< [in] The thread block's base input iterator for loading from - T (&items)[ITEMS_PER_THREAD], ///< [out] Data to load - int valid_items, ///< [in] Number of valid items to load - T oob_default) ///< [in] Default value to assign out-of-bound items - { - InternalLoad(temp_storage, linear_tid).Load(block_itr, items, valid_items, oob_default); - } - - - //@} end member group - -}; - - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_radix_rank.cuh b/lib/kokkos/TPL/cub/block/block_radix_rank.cuh deleted file mode 100755 index 149a62c65f..0000000000 --- a/lib/kokkos/TPL/cub/block/block_radix_rank.cuh +++ /dev/null @@ -1,479 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * cub::BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock - */ - -#pragma once - -#include "../util_arch.cuh" -#include "../util_type.cuh" -#include "../thread/thread_reduce.cuh" -#include "../thread/thread_scan.cuh" -#include "../block/block_scan.cuh" -#include "../util_namespace.cuh" - - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \brief BlockRadixRank provides operations for ranking unsigned integer types within a CUDA threadblock. - * \ingroup BlockModule - * - * \par Overview - * Blah... - * - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam RADIX_BITS [optional] The number of radix bits per digit place (default: 5 bits) - * \tparam MEMOIZE_OUTER_SCAN [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). See BlockScanAlgorithm::BLOCK_SCAN_RAKING_MEMOIZE for more details. - * \tparam INNER_SCAN_ALGORITHM [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS) - * \tparam SMEM_CONFIG [optional] Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte) - * - * \par Usage Considerations - * - Keys must be in a form suitable for radix ranking (i.e., unsigned bits). - * - Assumes a [blocked arrangement](index.html#sec5sec4) of elements across threads - * - \smemreuse{BlockRadixRank::TempStorage} - * - * \par Performance Considerations - * - * \par Algorithm - * These parallel radix ranking variants have O(n) work complexity and are implemented in XXX phases: - * -# blah - * -# blah - * - * \par Examples - * \par - * - Example 1: Simple radix rank of 32-bit integer keys - * \code - * #include - * - * template - * __global__ void ExampleKernel(...) - * { - * - * \endcode - */ -template < - int BLOCK_THREADS, - int RADIX_BITS, - bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, - BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, - cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte> -class BlockRadixRank -{ -private: - - /****************************************************************************** - * Type definitions and constants - ******************************************************************************/ - - // Integer type for digit counters (to be packed into words of type PackedCounters) - typedef unsigned short DigitCounter; - - // Integer type for packing DigitCounters into columns of shared memory banks - typedef typename If<(SMEM_CONFIG == cudaSharedMemBankSizeEightByte), - unsigned long long, - unsigned int>::Type PackedCounter; - - enum - { - RADIX_DIGITS = 1 << RADIX_BITS, - - LOG_WARP_THREADS = PtxArchProps::LOG_WARP_THREADS, - WARP_THREADS = 1 << LOG_WARP_THREADS, - WARPS = (BLOCK_THREADS + WARP_THREADS - 1) / WARP_THREADS, - - BYTES_PER_COUNTER = sizeof(DigitCounter), - LOG_BYTES_PER_COUNTER = Log2 ::VALUE, - - PACKING_RATIO = sizeof(PackedCounter) / sizeof(DigitCounter), - LOG_PACKING_RATIO = Log2 ::VALUE, - - LOG_COUNTER_LANES = CUB_MAX((RADIX_BITS - LOG_PACKING_RATIO), 0), // Always at least one lane - COUNTER_LANES = 1 << LOG_COUNTER_LANES, - - // The number of packed counters per thread (plus one for padding) - RAKING_SEGMENT = COUNTER_LANES + 1, - - LOG_SMEM_BANKS = PtxArchProps::LOG_SMEM_BANKS, - SMEM_BANKS = 1 << LOG_SMEM_BANKS, - }; - - - /// BlockScan type - typedef BlockScan BlockScan; - - - /// Shared memory storage layout type for BlockRadixRank - struct _TempStorage - { - // Storage for scanning local ranks - typename BlockScan::TempStorage block_scan; - - union - { - DigitCounter digit_counters[COUNTER_LANES + 1][BLOCK_THREADS][PACKING_RATIO]; - PackedCounter raking_grid[BLOCK_THREADS][RAKING_SEGMENT]; - }; - }; - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - /// Copy of raking segment, promoted to registers - PackedCounter cached_segment[RAKING_SEGMENT]; - - - /****************************************************************************** - * Templated iteration - ******************************************************************************/ - - // General template iteration - template - struct Iterate - { - /** - * Decode keys. Decodes the radix digit from the current digit place - * and increments the thread's corresponding counter in shared - * memory for that digit. - * - * Saves both (1) the prior value of that counter (the key's - * thread-local exclusive prefix sum for that digit), and (2) the shared - * memory offset of the counter (for later use). - */ - template - static __device__ __forceinline__ void DecodeKeys( - BlockRadixRank &cta, // BlockRadixRank instance - UnsignedBits (&keys)[KEYS_PER_THREAD], // Key to decode - DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value (out parameter) - DigitCounter* (&digit_counters)[KEYS_PER_THREAD], // Counter smem offset (out parameter) - int current_bit) // The least-significant bit position of the current digit to extract - { - // Add in sub-counter offset - UnsignedBits sub_counter = BFE(keys[COUNT], current_bit + LOG_COUNTER_LANES, LOG_PACKING_RATIO); - - // Add in row offset - UnsignedBits row_offset = BFE(keys[COUNT], current_bit, LOG_COUNTER_LANES); - - // Pointer to smem digit counter - digit_counters[COUNT] = &cta.temp_storage.digit_counters[row_offset][cta.linear_tid][sub_counter]; - - // Load thread-exclusive prefix - thread_prefixes[COUNT] = *digit_counters[COUNT]; - - // Store inclusive prefix - *digit_counters[COUNT] = thread_prefixes[COUNT] + 1; - - // Iterate next key - Iterate ::DecodeKeys(cta, keys, thread_prefixes, digit_counters, current_bit); - } - - - // Termination - template - static __device__ __forceinline__ void UpdateRanks( - int (&ranks)[KEYS_PER_THREAD], // Local ranks (out parameter) - DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], // Prefix counter value - DigitCounter* (&digit_counters)[KEYS_PER_THREAD]) // Counter smem offset - { - // Add in threadblock exclusive prefix - ranks[COUNT] = thread_prefixes[COUNT] + *digit_counters[COUNT]; - - // Iterate next key - Iterate ::UpdateRanks(ranks, thread_prefixes, digit_counters); - } - }; - - - // Termination - template - struct Iterate - { - // DecodeKeys - template - static __device__ __forceinline__ void DecodeKeys( - BlockRadixRank &cta, - UnsignedBits (&keys)[KEYS_PER_THREAD], - DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], - DigitCounter* (&digit_counters)[KEYS_PER_THREAD], - int current_bit) {} - - - // UpdateRanks - template - static __device__ __forceinline__ void UpdateRanks( - int (&ranks)[KEYS_PER_THREAD], - DigitCounter (&thread_prefixes)[KEYS_PER_THREAD], - DigitCounter *(&digit_counters)[KEYS_PER_THREAD]) {} - }; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /** - * Internal storage allocator - */ - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /** - * Performs upsweep raking reduction, returning the aggregate - */ - __device__ __forceinline__ PackedCounter Upsweep() - { - PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid]; - PackedCounter *raking_ptr; - - if (MEMOIZE_OUTER_SCAN) - { - // Copy data into registers - #pragma unroll - for (int i = 0; i < RAKING_SEGMENT; i++) - { - cached_segment[i] = smem_raking_ptr[i]; - } - raking_ptr = cached_segment; - } - else - { - raking_ptr = smem_raking_ptr; - } - - return ThreadReduce (raking_ptr, Sum()); - } - - - /// Performs exclusive downsweep raking scan - __device__ __forceinline__ void ExclusiveDownsweep( - PackedCounter raking_partial) - { - PackedCounter *smem_raking_ptr = temp_storage.raking_grid[linear_tid]; - - PackedCounter *raking_ptr = (MEMOIZE_OUTER_SCAN) ? - cached_segment : - smem_raking_ptr; - - // Exclusive raking downsweep scan - ThreadScanExclusive (raking_ptr, raking_ptr, Sum(), raking_partial); - - if (MEMOIZE_OUTER_SCAN) - { - // Copy data back to smem - #pragma unroll - for (int i = 0; i < RAKING_SEGMENT; i++) - { - smem_raking_ptr[i] = cached_segment[i]; - } - } - } - - - /** - * Reset shared memory digit counters - */ - __device__ __forceinline__ void ResetCounters() - { - // Reset shared memory digit counters - #pragma unroll - for (int LANE = 0; LANE < COUNTER_LANES + 1; LANE++) - { - *((PackedCounter*) temp_storage.digit_counters[LANE][linear_tid]) = 0; - } - } - - - /** - * Scan shared memory digit counters. - */ - __device__ __forceinline__ void ScanCounters() - { - // Upsweep scan - PackedCounter raking_partial = Upsweep(); - - // Compute inclusive sum - PackedCounter inclusive_partial; - PackedCounter packed_aggregate; - BlockScan(temp_storage.block_scan, linear_tid).InclusiveSum(raking_partial, inclusive_partial, packed_aggregate); - - // Propagate totals in packed fields - #pragma unroll - for (int PACKED = 1; PACKED < PACKING_RATIO; PACKED++) - { - inclusive_partial += packed_aggregate << (sizeof(DigitCounter) * 8 * PACKED); - } - - // Downsweep scan with exclusive partial - PackedCounter exclusive_partial = inclusive_partial - raking_partial; - ExclusiveDownsweep(exclusive_partial); - } - -public: - - /// \smemstorage{BlockScan} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockRadixRank() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockRadixRank( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockRadixRank( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockRadixRank( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Raking - *********************************************************************/ - //@{ - - /** - * \brief Rank keys. - */ - template < - typename UnsignedBits, - int KEYS_PER_THREAD> - __device__ __forceinline__ void RankKeys( - UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile - int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile - int current_bit) ///< [in] The least-significant bit position of the current digit to extract - { - DigitCounter thread_prefixes[KEYS_PER_THREAD]; // For each key, the count of previous keys in this tile having the same digit - DigitCounter* digit_counters[KEYS_PER_THREAD]; // For each key, the byte-offset of its corresponding digit counter in smem - - // Reset shared memory digit counters - ResetCounters(); - - // Decode keys and update digit counters - Iterate<0, KEYS_PER_THREAD>::DecodeKeys(*this, keys, thread_prefixes, digit_counters, current_bit); - - __syncthreads(); - - // Scan shared memory counters - ScanCounters(); - - __syncthreads(); - - // Extract the local ranks of each key - Iterate<0, KEYS_PER_THREAD>::UpdateRanks(ranks, thread_prefixes, digit_counters); - } - - - /** - * \brief Rank keys. For the lower \p RADIX_DIGITS threads, digit counts for each digit are provided for the corresponding thread. - */ - template < - typename UnsignedBits, - int KEYS_PER_THREAD> - __device__ __forceinline__ void RankKeys( - UnsignedBits (&keys)[KEYS_PER_THREAD], ///< [in] Keys for this tile - int (&ranks)[KEYS_PER_THREAD], ///< [out] For each key, the local rank within the tile (out parameter) - int current_bit, ///< [in] The least-significant bit position of the current digit to extract - int &inclusive_digit_prefix) ///< [out] The incluisve prefix sum for the digit threadIdx.x - { - // Rank keys - RankKeys(keys, ranks, current_bit); - - // Get the inclusive and exclusive digit totals corresponding to the calling thread. - if ((BLOCK_THREADS == RADIX_DIGITS) || (linear_tid < RADIX_DIGITS)) - { - // Obtain ex/inclusive digit counts. (Unfortunately these all reside in the - // first counter column, resulting in unavoidable bank conflicts.) - int counter_lane = (linear_tid & (COUNTER_LANES - 1)); - int sub_counter = linear_tid >> (LOG_COUNTER_LANES); - inclusive_digit_prefix = temp_storage.digit_counters[counter_lane + 1][0][sub_counter]; - } - } -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - - diff --git a/lib/kokkos/TPL/cub/block/block_radix_sort.cuh b/lib/kokkos/TPL/cub/block/block_radix_sort.cuh deleted file mode 100755 index 873d401266..0000000000 --- a/lib/kokkos/TPL/cub/block/block_radix_sort.cuh +++ /dev/null @@ -1,608 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockRadixSort class provides [collective](index.html#sec0) methods for radix sorting of items partitioned across a CUDA thread block. - */ - - -#pragma once - -#include "../util_namespace.cuh" -#include "../util_arch.cuh" -#include "../util_type.cuh" -#include "block_exchange.cuh" -#include "block_radix_rank.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \brief The cub::BlockRadixSort class provides [collective](index.html#sec0) methods for sorting items partitioned across a CUDA thread block using a radix sorting method.  - * \ingroup BlockModule - * - * \par Overview - * The [radix sorting method](http://en.wikipedia.org/wiki/Radix_sort) arranges - * items into ascending order. It relies upon a positional representation for - * keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, - * characters, etc.) specified from least-significant to most-significant. For a - * given input sequence of keys and a set of rules specifying a total ordering - * of the symbolic alphabet, the radix sorting method produces a lexicographic - * ordering of those keys. - * - * \par - * BlockRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: - * unsigned char, \p int, \p double, etc. Within each key, the implementation treats fixed-length - * bit-sequences of \p RADIX_BITS as radix digit places. Although the direct radix sorting - * method can only be applied to unsigned integral types, BlockRadixSort - * is able to sort signed and floating-point types via simple bit-wise transformations - * that ensure lexicographic key ordering. - * - * \tparam Key Key type - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam ITEMS_PER_THREAD The number of items per thread - * \tparam Value [optional] Value type (default: cub::NullType) - * \tparam RADIX_BITS [optional] The number of radix bits per digit place (default: 4 bits) - * \tparam MEMOIZE_OUTER_SCAN [optional] Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure (default: true for architectures SM35 and newer, false otherwise). - * \tparam INNER_SCAN_ALGORITHM [optional] The cub::BlockScanAlgorithm algorithm to use (default: cub::BLOCK_SCAN_WARP_SCANS) - * \tparam SMEM_CONFIG [optional] Shared memory bank mode (default: \p cudaSharedMemBankSizeFourByte) - * - * \par A Simple Example - * \blockcollective{BlockRadixSort} - * \par - * The code snippet below illustrates a sort of 512 integer keys that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockRadixSort for 128 threads owning 4 integer items each - * typedef cub::BlockRadixSort BlockRadixSort; - * - * // Allocate shared memory for BlockRadixSort - * __shared__ typename BlockRadixSort::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_keys[4]; - * ... - * - * // Collectively sort the keys - * BlockRadixSort(temp_storage).Sort(thread_keys); - * - * ... - * \endcode - * \par - * Suppose the set of input \p thread_keys across the block of threads is - * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The - * corresponding output \p thread_keys in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * - */ -template < - typename Key, - int BLOCK_THREADS, - int ITEMS_PER_THREAD, - typename Value = NullType, - int RADIX_BITS = 4, - bool MEMOIZE_OUTER_SCAN = (CUB_PTX_ARCH >= 350) ? true : false, - BlockScanAlgorithm INNER_SCAN_ALGORITHM = BLOCK_SCAN_WARP_SCANS, - cudaSharedMemConfig SMEM_CONFIG = cudaSharedMemBankSizeFourByte> -class BlockRadixSort -{ -private: - - /****************************************************************************** - * Constants and type definitions - ******************************************************************************/ - - // Key traits and unsigned bits type - typedef NumericTraits KeyTraits; - typedef typename KeyTraits::UnsignedBits UnsignedBits; - - /// BlockRadixRank utility type - typedef BlockRadixRank BlockRadixRank; - - /// BlockExchange utility type for keys - typedef BlockExchange BlockExchangeKeys; - - /// BlockExchange utility type for values - typedef BlockExchange BlockExchangeValues; - - /// Shared memory storage layout type - struct _TempStorage - { - union - { - typename BlockRadixRank::TempStorage ranking_storage; - typename BlockExchangeKeys::TempStorage exchange_keys; - typename BlockExchangeValues::TempStorage exchange_values; - }; - }; - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - -public: - - /// \smemstorage{BlockScan} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockRadixSort() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockRadixSort( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockRadixSort( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockRadixSort( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Sorting (blocked arrangements) - *********************************************************************/ - //@{ - - /** - * \brief Performs a block-wide radix sort over a [blocked arrangement](index.html#sec5sec4) of keys. - * - * \smemreuse - * - * The code snippet below illustrates a sort of 512 integer keys that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive keys. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each - * typedef cub::BlockRadixSort BlockRadixSort; - * - * // Allocate shared memory for BlockRadixSort - * __shared__ typename BlockRadixSort::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_keys[4]; - * ... - * - * // Collectively sort the keys - * BlockRadixSort(temp_storage).Sort(thread_keys); - * - * \endcode - * \par - * Suppose the set of input \p thread_keys across the block of threads is - * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. - * The corresponding output \p thread_keys in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - */ - __device__ __forceinline__ void Sort( - Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort - int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison - int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison - { - UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = - reinterpret_cast (keys); - - // Twiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); - } - - // Radix sorting passes - while (true) - { - // Rank the blocked keys - int ranks[ITEMS_PER_THREAD]; - BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); - begin_bit += RADIX_BITS; - - __syncthreads(); - - // Exchange keys through shared memory in blocked arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); - - // Quit if done - if (begin_bit >= end_bit) break; - - __syncthreads(); - } - - // Untwiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); - } - } - - - /** - * \brief Performs a block-wide radix sort across a [blocked arrangement](index.html#sec5sec4) of keys and values. - * - * BlockRadixSort can only accommodate one associated tile of values. To "truck along" - * more than one tile of values, simply perform a key-value sort of the keys paired - * with a temporary value array that enumerates the key indices. The reordered indices - * can then be used as a gather-vector for exchanging other associated tile data through - * shared memory. - * - * \smemreuse - * - * The code snippet below illustrates a sort of 512 integer keys and values that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive pairs. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each - * typedef cub::BlockRadixSort BlockRadixSort; - * - * // Allocate shared memory for BlockRadixSort - * __shared__ typename BlockRadixSort::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_keys[4]; - * int thread_values[4]; - * ... - * - * // Collectively sort the keys and values among block threads - * BlockRadixSort(temp_storage).Sort(thread_keys, thread_values); - * - * \endcode - * \par - * Suppose the set of input \p thread_keys across the block of threads is - * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The - * corresponding output \p thread_keys in those threads will be - * { [0,1,2,3], [4,5,6,7], [8,9,10,11], ..., [508,509,510,511] }. - * - */ - __device__ __forceinline__ void Sort( - Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort - Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort - int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison - int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison - { - UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = - reinterpret_cast (keys); - - // Twiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); - } - - // Radix sorting passes - while (true) - { - // Rank the blocked keys - int ranks[ITEMS_PER_THREAD]; - BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); - begin_bit += RADIX_BITS; - - __syncthreads(); - - // Exchange keys through shared memory in blocked arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); - - __syncthreads(); - - // Exchange values through shared memory in blocked arrangement - BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks); - - // Quit if done - if (begin_bit >= end_bit) break; - - __syncthreads(); - } - - // Untwiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); - } - } - - - //@} end member group - /******************************************************************//** - * \name Sorting (blocked arrangement -> striped arrangement) - *********************************************************************/ - //@{ - - - /** - * \brief Performs a radix sort across a [blocked arrangement](index.html#sec5sec4) of keys, leaving them in a [striped arrangement](index.html#sec5sec4). - * - * \smemreuse - * - * The code snippet below illustrates a sort of 512 integer keys that - * are initially partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive keys. The final partitioning is striped. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockRadixSort for 128 threads owning 4 integer keys each - * typedef cub::BlockRadixSort BlockRadixSort; - * - * // Allocate shared memory for BlockRadixSort - * __shared__ typename BlockRadixSort::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_keys[4]; - * ... - * - * // Collectively sort the keys - * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys); - * - * \endcode - * \par - * Suppose the set of input \p thread_keys across the block of threads is - * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The - * corresponding output \p thread_keys in those threads will be - * { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }. - * - */ - __device__ __forceinline__ void SortBlockedToStriped( - Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort - int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison - int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison - { - UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = - reinterpret_cast (keys); - - // Twiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); - } - - // Radix sorting passes - while (true) - { - // Rank the blocked keys - int ranks[ITEMS_PER_THREAD]; - BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); - begin_bit += RADIX_BITS; - - __syncthreads(); - - // Check if this is the last pass - if (begin_bit >= end_bit) - { - // Last pass exchanges keys through shared memory in striped arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks); - - // Quit - break; - } - - // Exchange keys through shared memory in blocked arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); - - __syncthreads(); - } - - // Untwiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); - } - } - - - /** - * \brief Performs a radix sort across a [blocked arrangement](index.html#sec5sec4) of keys and values, leaving them in a [striped arrangement](index.html#sec5sec4). - * - * BlockRadixSort can only accommodate one associated tile of values. To "truck along" - * more than one tile of values, simply perform a key-value sort of the keys paired - * with a temporary value array that enumerates the key indices. The reordered indices - * can then be used as a gather-vector for exchanging other associated tile data through - * shared memory. - * - * \smemreuse - * - * The code snippet below illustrates a sort of 512 integer keys and values that - * are initially partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive pairs. The final partitioning is striped. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockRadixSort for 128 threads owning 4 integer keys and values each - * typedef cub::BlockRadixSort BlockRadixSort; - * - * // Allocate shared memory for BlockRadixSort - * __shared__ typename BlockRadixSort::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_keys[4]; - * int thread_values[4]; - * ... - * - * // Collectively sort the keys and values among block threads - * BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values); - * - * \endcode - * \par - * Suppose the set of input \p thread_keys across the block of threads is - * { [0,511,1,510], [2,509,3,508], [4,507,5,506], ..., [254,257,255,256] }. The - * corresponding output \p thread_keys in those threads will be - * { [0,128,256,384], [1,129,257,385], [2,130,258,386], ..., [127,255,383,511] }. - * - */ - __device__ __forceinline__ void SortBlockedToStriped( - Key (&keys)[ITEMS_PER_THREAD], ///< [in-out] Keys to sort - Value (&values)[ITEMS_PER_THREAD], ///< [in-out] Values to sort - int begin_bit = 0, ///< [in] [optional] The beginning (least-significant) bit index needed for key comparison - int end_bit = sizeof(Key) * 8) ///< [in] [optional] The past-the-end (most-significant) bit index needed for key comparison - { - UnsignedBits (&unsigned_keys)[ITEMS_PER_THREAD] = - reinterpret_cast (keys); - - // Twiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleIn(unsigned_keys[KEY]); - } - - // Radix sorting passes - while (true) - { - // Rank the blocked keys - int ranks[ITEMS_PER_THREAD]; - BlockRadixRank(temp_storage.ranking_storage, linear_tid).RankKeys(unsigned_keys, ranks, begin_bit); - begin_bit += RADIX_BITS; - - __syncthreads(); - - // Check if this is the last pass - if (begin_bit >= end_bit) - { - // Last pass exchanges keys through shared memory in striped arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToStriped(keys, ranks); - - __syncthreads(); - - // Last pass exchanges through shared memory in striped arrangement - BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToStriped(values, ranks); - - // Quit - break; - } - - // Exchange keys through shared memory in blocked arrangement - BlockExchangeKeys(temp_storage.exchange_keys, linear_tid).ScatterToBlocked(keys, ranks); - - __syncthreads(); - - // Exchange values through shared memory in blocked arrangement - BlockExchangeValues(temp_storage.exchange_values, linear_tid).ScatterToBlocked(values, ranks); - - __syncthreads(); - } - - // Untwiddle bits if necessary - #pragma unroll - for (int KEY = 0; KEY < ITEMS_PER_THREAD; KEY++) - { - unsigned_keys[KEY] = KeyTraits::TwiddleOut(unsigned_keys[KEY]); - } - } - - - //@} end member group - -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_raking_layout.cuh b/lib/kokkos/TPL/cub/block/block_raking_layout.cuh deleted file mode 100755 index 878a786cd9..0000000000 --- a/lib/kokkos/TPL/cub/block/block_raking_layout.cuh +++ /dev/null @@ -1,145 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * cub::BlockRakingLayout provides a conflict-free shared memory layout abstraction for warp-raking across thread block data. - */ - - -#pragma once - -#include "../util_macro.cuh" -#include "../util_arch.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \brief BlockRakingLayout provides a conflict-free shared memory layout abstraction for raking across thread block data.  - * \ingroup BlockModule - * - * \par Overview - * This type facilitates a shared memory usage pattern where a block of CUDA - * threads places elements into shared memory and then reduces the active - * parallelism to one "raking" warp of threads for serially aggregating consecutive - * sequences of shared items. Padding is inserted to eliminate bank conflicts - * (for most data types). - * - * \tparam T The data type to be exchanged. - * \tparam BLOCK_THREADS The thread block size in threads. - * \tparam BLOCK_STRIPS When strip-mining, the number of threadblock-strips per tile - */ -template < - typename T, - int BLOCK_THREADS, - int BLOCK_STRIPS = 1> -struct BlockRakingLayout -{ - //--------------------------------------------------------------------- - // Constants and typedefs - //--------------------------------------------------------------------- - - enum - { - /// The total number of elements that need to be cooperatively reduced - SHARED_ELEMENTS = - BLOCK_THREADS * BLOCK_STRIPS, - - /// Maximum number of warp-synchronous raking threads - MAX_RAKING_THREADS = - CUB_MIN(BLOCK_THREADS, PtxArchProps::WARP_THREADS), - - /// Number of raking elements per warp-synchronous raking thread (rounded up) - SEGMENT_LENGTH = - (SHARED_ELEMENTS + MAX_RAKING_THREADS - 1) / MAX_RAKING_THREADS, - - /// Never use a raking thread that will have no valid data (e.g., when BLOCK_THREADS is 62 and SEGMENT_LENGTH is 2, we should only use 31 raking threads) - RAKING_THREADS = - (SHARED_ELEMENTS + SEGMENT_LENGTH - 1) / SEGMENT_LENGTH, - - /// Pad each segment length with one element if it evenly divides the number of banks - SEGMENT_PADDING = - (PtxArchProps::SMEM_BANKS % SEGMENT_LENGTH == 0) ? 1 : 0, - - /// Total number of elements in the raking grid - GRID_ELEMENTS = - RAKING_THREADS * (SEGMENT_LENGTH + SEGMENT_PADDING), - - /// Whether or not we need bounds checking during raking (the number of reduction elements is not a multiple of the warp size) - UNGUARDED = - (SHARED_ELEMENTS % RAKING_THREADS == 0), - }; - - - /** - * \brief Shared memory storage type - */ - typedef T TempStorage[BlockRakingLayout::GRID_ELEMENTS]; - - - /** - * \brief Returns the location for the calling thread to place data into the grid - */ - static __device__ __forceinline__ T* PlacementPtr( - TempStorage &temp_storage, - int linear_tid, - int block_strip = 0) - { - // Offset for partial - unsigned int offset = (block_strip * BLOCK_THREADS) + linear_tid; - - // Add in one padding element for every segment - if (SEGMENT_PADDING > 0) - { - offset += offset / SEGMENT_LENGTH; - } - - // Incorporating a block of padding partials every shared memory segment - return temp_storage + offset; - } - - - /** - * \brief Returns the location for the calling thread to begin sequential raking - */ - static __device__ __forceinline__ T* RakingPtr( - TempStorage &temp_storage, - int linear_tid) - { - return temp_storage + (linear_tid * (SEGMENT_LENGTH + SEGMENT_PADDING)); - } -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_reduce.cuh b/lib/kokkos/TPL/cub/block/block_reduce.cuh deleted file mode 100755 index ffdff73775..0000000000 --- a/lib/kokkos/TPL/cub/block/block_reduce.cuh +++ /dev/null @@ -1,563 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockReduce class provides [collective](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block. - */ - -#pragma once - -#include "specializations/block_reduce_raking.cuh" -#include "specializations/block_reduce_warp_reductions.cuh" -#include "../util_type.cuh" -#include "../thread/thread_operators.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - - - -/****************************************************************************** - * Algorithmic variants - ******************************************************************************/ - -/** - * BlockReduceAlgorithm enumerates alternative algorithms for parallel - * reduction across a CUDA threadblock. - */ -enum BlockReduceAlgorithm -{ - - /** - * \par Overview - * An efficient "raking" reduction algorithm. Execution is comprised of - * three phases: - * -# Upsweep sequential reduction in registers (if threads contribute more - * than one input each). Each thread then places the partial reduction - * of its item(s) into shared memory. - * -# Upsweep sequential reduction in shared memory. Threads within a - * single warp rake across segments of shared partial reductions. - * -# A warp-synchronous Kogge-Stone style reduction within the raking warp. - * - * \par - * \image html block_reduce.png - * \p BLOCK_REDUCE_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.- * - * \par Performance Considerations - * - Although this variant may suffer longer turnaround latencies when the - * GPU is under-occupied, it can often provide higher overall throughput - * across the GPU when suitably occupied. - */ - BLOCK_REDUCE_RAKING, - - - /** - * \par Overview - * A quick "tiled warp-reductions" reduction algorithm. Execution is - * comprised of four phases: - * -# Upsweep sequential reduction in registers (if threads contribute more - * than one input each). Each thread then places the partial reduction - * of its item(s) into shared memory. - * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style - * reduction within each warp. - * -# A propagation phase where the warp reduction outputs in each warp are - * updated with the aggregate from each preceding warp. - * - * \par - * \image html block_scan_warpscans.png - *\p BLOCK_REDUCE_WARP_REDUCTIONS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.- * - * \par Performance Considerations - * - Although this variant may suffer lower overall throughput across the - * GPU because due to a heavy reliance on inefficient warp-reductions, it - * can often provide lower turnaround latencies when the GPU is - * under-occupied. - */ - BLOCK_REDUCE_WARP_REDUCTIONS, -}; - - -/****************************************************************************** - * Block reduce - ******************************************************************************/ - -/** - * \brief The BlockReduce class provides [collective](index.html#sec0) methods for computing a parallel reduction of items partitioned across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * A reduction (or fold) - * uses a binary combining operator to compute a single aggregate from a list of input elements. - * - * \par - * Optionally, BlockReduce can be specialized by algorithm to accommodate different latency/throughput workload profiles: - * -# cub::BLOCK_REDUCE_RAKING. An efficient "raking" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm) - * -# cub::BLOCK_REDUCE_WARP_REDUCTIONS. A quick "tiled warp-reductions" reduction algorithm. [More...](\ref cub::BlockReduceAlgorithm) - * - * \tparam T Data type being reduced - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam ALGORITHM [optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_RAKING) - * - * \par Performance Considerations - * - Very efficient (only one synchronization barrier). - * - Zero bank conflicts for most types. - * - Computation is slightly more efficient (i.e., having lower instruction overhead) for: - * - Summation (vs. generic reduction) - * - \p BLOCK_THREADS is a multiple of the architecture's warp size - * - Every thread has a valid input (i.e., full vs. partial-tiles) - * - See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives - * - * \par A Simple Example - * \blockcollective{BlockReduce} - * \par - * The code snippet below illustrates a sum reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include- * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Compute the block-wide sum for thread0 - * int aggregate = BlockReduce(temp_storage).Sum(thread_data); - * - * \endcode - * - */ -template < - typename T, - int BLOCK_THREADS, - BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_RAKING> -class BlockReduce -{ -private: - - /****************************************************************************** - * Constants and typedefs - ******************************************************************************/ - - /// Internal specialization. - typedef typename If<(ALGORITHM == BLOCK_REDUCE_WARP_REDUCTIONS), - BlockReduceWarpReductions , - BlockReduceRaking >::Type InternalBlockReduce; - - /// Shared memory storage layout type for BlockReduce - typedef typename InternalBlockReduce::TempStorage _TempStorage; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - -public: - - /// \smemstorage{BlockReduce} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockReduce() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockReduce( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockReduce( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockReduce( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Generic reductions - *********************************************************************/ - //@{ - - - /** - * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element. - * - * The return value is undefined in threads other than thread0. - * - * Supports non-commutative reduction operators. - * - * \smemreuse - * - * The code snippet below illustrates a max reduction of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Each thread obtains an input item - * int thread_data; - * ... - * - * // Compute the block-wide max for thread0 - * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max()); - * - * \endcode - * - * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ T Reduce( - T input, ///< [in] Calling thread's input - ReductionOp reduction_op) ///< [in] Binary reduction operator - { - return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, BLOCK_THREADS, reduction_op); - } - - - /** - * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements. - * - * The return value is undefined in threads other than thread0. - * - * Supports non-commutative reduction operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a max reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Compute the block-wide max for thread0 - * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max()); - * - * \endcode - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ReductionOp> - __device__ __forceinline__ T Reduce( - T (&inputs)[ITEMS_PER_THREAD], ///< [in] Calling thread's input segment - ReductionOp reduction_op) ///< [in] Binary reduction operator - { - // Reduce partials - T partial = ThreadReduce(inputs, reduction_op); - return Reduce(partial, reduction_op); - } - - - /** - * \brief Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first \p num_valid threads each contribute one input element. - * - * The return value is undefined in threads other than thread0. - * - * Supports non-commutative reduction operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a max reduction of a partially-full tile of integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int num_valid, ...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Each thread obtains an input item - * int thread_data; - * if (threadIdx.x < num_valid) thread_data = ... - * - * // Compute the block-wide max for thread0 - * int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cub::Max(), num_valid); - * - * \endcode - * - * \tparam ReductionOp [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ T Reduce( - T input, ///< [in] Calling thread's input - ReductionOp reduction_op, ///< [in] Binary reduction operator - int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS) - { - // Determine if we scan skip bounds checking - if (num_valid >= BLOCK_THREADS) - { - return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, num_valid, reduction_op); - } - else - { - return InternalBlockReduce(temp_storage, linear_tid).template Reduce (input, num_valid, reduction_op); - } - } - - - //@} end member group - /******************************************************************//** - * \name Summation reductions - *********************************************************************/ - //@{ - - - /** - * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element. - * - * The return value is undefined in threads other than thread0. - * - * \smemreuse - * - * The code snippet below illustrates a sum reduction of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Each thread obtains an input item - * int thread_data; - * ... - * - * // Compute the block-wide sum for thread0 - * int aggregate = BlockReduce(temp_storage).Sum(thread_data); - * - * \endcode - * - */ - __device__ __forceinline__ T Sum( - T input) ///< [in] Calling thread's input - { - return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, BLOCK_THREADS); - } - - /** - * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements. - * - * The return value is undefined in threads other than thread0. - * - * \smemreuse - * - * The code snippet below illustrates a sum reduction of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Compute the block-wide sum for thread0 - * int aggregate = BlockReduce(temp_storage).Sum(thread_data); - * - * \endcode - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - */ - template - __device__ __forceinline__ T Sum( - T (&inputs)[ITEMS_PER_THREAD]) ///< [in] Calling thread's input segment - { - // Reduce partials - T partial = ThreadReduce(inputs, cub::Sum()); - return Sum(partial); - } - - - /** - * \brief Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first \p num_valid threads each contribute one input element. - * - * The return value is undefined in threads other than thread0. - * - * \smemreuse - * - * The code snippet below illustrates a sum reduction of a partially-full tile of integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(int num_valid, ...) - * { - * // Specialize BlockReduce for 128 threads on type int - * typedef cub::BlockReduce BlockReduce; - * - * // Allocate shared memory for BlockReduce - * __shared__ typename BlockReduce::TempStorage temp_storage; - * - * // Each thread obtains an input item (up to num_items) - * int thread_data; - * if (threadIdx.x < num_valid) - * thread_data = ... - * - * // Compute the block-wide sum for thread0 - * int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid); - * - * \endcode - * - */ - __device__ __forceinline__ T Sum( - T input, ///< [in] Calling thread's input - int num_valid) ///< [in] Number of threads containing valid elements (may be less than BLOCK_THREADS) - { - // Determine if we scan skip bounds checking - if (num_valid >= BLOCK_THREADS) - { - return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, num_valid); - } - else - { - return InternalBlockReduce(temp_storage, linear_tid).template Sum (input, num_valid); - } - } - - - //@} end member group -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_scan.cuh b/lib/kokkos/TPL/cub/block/block_scan.cuh deleted file mode 100755 index 1c1a2dac81..0000000000 --- a/lib/kokkos/TPL/cub/block/block_scan.cuh +++ /dev/null @@ -1,2233 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * The cub::BlockScan class provides [collective](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block. - */ - -#pragma once - -#include "specializations/block_scan_raking.cuh" -#include "specializations/block_scan_warp_scans.cuh" -#include "../util_arch.cuh" -#include "../util_type.cuh" -#include "../util_namespace.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - - -/****************************************************************************** - * Algorithmic variants - ******************************************************************************/ - -/** - * \brief BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block. - */ -enum BlockScanAlgorithm -{ - - /** - * \par Overview - * An efficient "raking reduce-then-scan" prefix scan algorithm. Execution is comprised of five phases: - * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory. - * -# Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions. - * -# A warp-synchronous Kogge-Stone style exclusive scan within the raking warp. - * -# Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output. - * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output. - * - * \par - * \image html block_scan_raking.png - * \p BLOCK_SCAN_RAKING data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.- * - * \par Performance Considerations - * - Although this variant may suffer longer turnaround latencies when the - * GPU is under-occupied, it can often provide higher overall throughput - * across the GPU when suitably occupied. - */ - BLOCK_SCAN_RAKING, - - - /** - * \par Overview - * Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at - * the expense of higher register pressure. Raking threads preserve their - * "upsweep" segment of values in registers while performing warp-synchronous - * scan, allowing the "downsweep" not to re-read them from shared memory. - */ - BLOCK_SCAN_RAKING_MEMOIZE, - - - /** - * \par Overview - * A quick "tiled warpscans" prefix scan algorithm. Execution is comprised of four phases: - * -# Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory. - * -# Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp. - * -# A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp. - * -# Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output. - * - * \par - * \image html block_scan_warpscans.png - *\p BLOCK_SCAN_WARP_SCANS data flow for a hypothetical 16-thread threadblock and 4-thread raking warp.- * - * \par Performance Considerations - * - Although this variant may suffer lower overall throughput across the - * GPU because due to a heavy reliance on inefficient warpscans, it can - * often provide lower turnaround latencies when the GPU is under-occupied. - */ - BLOCK_SCAN_WARP_SCANS, -}; - - -/****************************************************************************** - * Block scan - ******************************************************************************/ - -/** - * \brief The BlockScan class provides [collective](index.html#sec0) methods for computing a parallel prefix sum/scan of items partitioned across a CUDA thread block.  - * \ingroup BlockModule - * - * \par Overview - * Given a list of input elements and a binary reduction operator, a [prefix scan](http://en.wikipedia.org/wiki/Prefix_sum) - * produces an output list where each element is computed to be the reduction - * of the elements occurring earlier in the input list. Prefix sum - * connotes a prefix scan with the addition operator. The term \em inclusive indicates - * that the ith output reduction incorporates the ith input. - * The term \em exclusive indicates the ith input is not incorporated into - * the ith output reduction. - * - * \par - * Optionally, BlockScan can be specialized by algorithm to accommodate different latency/throughput workload profiles: - * -# cub::BLOCK_SCAN_RAKING. An efficient "raking reduce-then-scan" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm) - * -# cub::BLOCK_SCAN_WARP_SCANS. A quick "tiled warpscans" prefix scan algorithm. [More...](\ref cub::BlockScanAlgorithm) - * - * \tparam T Data type being scanned - * \tparam BLOCK_THREADS The thread block size in threads - * \tparam ALGORITHM [optional] cub::BlockScanAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_SCAN_RAKING) - * - * \par A Simple Example - * \blockcollective{BlockScan} - * \par - * The code snippet below illustrates an exclusive prefix sum of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include- * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide exclusive prefix sum - * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. - * The corresponding output \p thread_data in those threads will be - * { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. - * - * \par Performance Considerations - * - Uses special instructions when applicable (e.g., warp \p SHFL) - * - Uses synchronization-free communication between warp lanes when applicable - * - Uses only one or two block-wide synchronization barriers (depending on - * algorithm selection) - * - Zero bank conflicts for most types - * - Computation is slightly more efficient (i.e., having lower instruction overhead) for: - * - Prefix sum variants (vs. generic scan) - * - Exclusive variants (vs. inclusive) - * - \p BLOCK_THREADS is a multiple of the architecture's warp size - * - See cub::BlockScanAlgorithm for performance details regarding algorithmic alternatives - * - */ -template < - typename T, - int BLOCK_THREADS, - BlockScanAlgorithm ALGORITHM = BLOCK_SCAN_RAKING> -class BlockScan -{ -private: - - /****************************************************************************** - * Constants and typedefs - ******************************************************************************/ - - /** - * Ensure the template parameterization meets the requirements of the - * specified algorithm. Currently, the BLOCK_SCAN_WARP_SCANS policy - * cannot be used with threadblock sizes not a multiple of the - * architectural warp size. - */ - static const BlockScanAlgorithm SAFE_ALGORITHM = - ((ALGORITHM == BLOCK_SCAN_WARP_SCANS) && (BLOCK_THREADS % PtxArchProps::WARP_THREADS != 0)) ? - BLOCK_SCAN_RAKING : - ALGORITHM; - - /// Internal specialization. - typedef typename If<(SAFE_ALGORITHM == BLOCK_SCAN_WARP_SCANS), - BlockScanWarpScans , - BlockScanRaking >::Type InternalBlockScan; - - - /// Shared memory storage layout type for BlockScan - typedef typename InternalBlockScan::TempStorage _TempStorage; - - - /****************************************************************************** - * Thread fields - ******************************************************************************/ - - /// Shared storage reference - _TempStorage &temp_storage; - - /// Linear thread-id - int linear_tid; - - - /****************************************************************************** - * Utility methods - ******************************************************************************/ - - /// Internal storage allocator - __device__ __forceinline__ _TempStorage& PrivateStorage() - { - __shared__ _TempStorage private_storage; - return private_storage; - } - - -public: - - /// \smemstorage{BlockScan} - struct TempStorage : Uninitialized<_TempStorage> {}; - - - /******************************************************************//** - * \name Collective constructors - *********************************************************************/ - //@{ - - /** - * \brief Collective constructor for 1D thread blocks using a private static allocation of shared memory as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockScan() - : - temp_storage(PrivateStorage()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor for 1D thread blocks using the specified memory allocation as temporary storage. Threads are identified using threadIdx.x. - */ - __device__ __forceinline__ BlockScan( - TempStorage &temp_storage) ///< [in] Reference to memory allocation having layout type TempStorage - : - temp_storage(temp_storage.Alias()), - linear_tid(threadIdx.x) - {} - - - /** - * \brief Collective constructor using a private static allocation of shared memory as temporary storage. Each thread is identified using the supplied linear thread identifier - */ - __device__ __forceinline__ BlockScan( - int linear_tid) ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(PrivateStorage()), - linear_tid(linear_tid) - {} - - - /** - * \brief Collective constructor using the specified memory allocation as temporary storage. Each thread is identified using the supplied linear thread identifier. - */ - __device__ __forceinline__ BlockScan( - TempStorage &temp_storage, ///< [in] Reference to memory allocation having layout type TempStorage - int linear_tid) ///< [in] [optional] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - : - temp_storage(temp_storage.Alias()), - linear_tid(linear_tid) - {} - - - - //@} end member group - /******************************************************************//** - * \name Exclusive prefix sum operations - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix sum of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide exclusive prefix sum - * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The - * corresponding output \p thread_data in those threads will be 0, 1, ..., 127. - * - */ - __device__ __forceinline__ void ExclusiveSum( - T input, ///< [in] Calling thread's input item - T &output) ///< [out] Calling thread's output item (may be aliased to \p input) - { - T block_aggregate; - InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix sum of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide exclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The - * corresponding output \p thread_data in those threads will be 0, 1, ..., 127. - * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads. - * - */ - __device__ __forceinline__ void ExclusiveSum( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an exclusive prefix sum over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total += block_aggregate; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockScan for 128 threads - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data = d_data[block_offset]; - * - * // Collectively compute the block-wide exclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveSum( - * thread_data, thread_data, block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * d_data[block_offset] = thread_data; - * } - * \endcode - * \par - * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... - * The corresponding output for the first segment will be 0, 1, ..., 127. - * The output for the second segment will be 128, 129, ..., 255. Furthermore, - * the value \p 128 will be stored in \p block_aggregate for all threads after each scan. - * - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template - __device__ __forceinline__ void ExclusiveSum( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveSum(input, output, block_aggregate, block_prefix_op); - } - - - //@} end member group - /******************************************************************//** - * \name Exclusive prefix sum operations (multiple data per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix sum of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide exclusive prefix sum - * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The - * corresponding output \p thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - */ - template - __device__ __forceinline__ void ExclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input) - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix sum of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide exclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveSum(thread_data, thread_data, block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The - * corresponding output \p thread_data in those threads will be { [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }. - * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - */ - template - __device__ __forceinline__ void ExclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial, block_aggregate); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an exclusive prefix sum over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 512 integer items that are partitioned in a [blocked arrangement](index.html#sec5sec4) - * across 128 threads where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total += block_aggregate; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread - * typedef cub::BlockLoad BlockLoad; - * typedef cub::BlockStore BlockStore; - * typedef cub::BlockScan BlockScan; - * - * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan - * __shared__ union { - * typename BlockLoad::TempStorage load; - * typename BlockScan::TempStorage scan; - * typename BlockStore::TempStorage store; - * } temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); - * __syncthreads(); - * - * // Collectively compute the block-wide exclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage.scan).ExclusiveSum( - * thread_data, thread_data, block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); - * __syncthreads(); - * } - * \endcode - * \par - * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... - * The corresponding output for the first segment will be 0, 1, 2, 3, ..., 510, 511. - * The output for the second segment will be 512, 513, 514, 515, ..., 1022, 1023. Furthermore, - * the value \p 512 will be stored in \p block_aggregate for all threads after each scan. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - int ITEMS_PER_THREAD, - typename BlockPrefixOp> - __device__ __forceinline__ void ExclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - - //@} end member group // Inclusive prefix sums - /******************************************************************//** - * \name Exclusive prefix scan operations - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide exclusive prefix max scan - * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The - * corresponding output \p thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T identity, ///< [in] Identity value - ScanOp scan_op) ///< [in] Binary scan operator - { - T block_aggregate; - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix max scan of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide exclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The - * corresponding output \p thread_data in those threads will be INT_MIN, 0, 0, 2, ..., 124, 126. - * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input items - T &output, ///< [out] Calling thread's output items (may be aliased to \p input) - const T &identity, ///< [in] Identity value - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an exclusive prefix max scan over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockScan for 128 threads - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(INT_MIN); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data = d_data[block_offset]; - * - * // Collectively compute the block-wide exclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveScan( - * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * d_data[block_offset] = thread_data; - * } - * \endcode - * \par - * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... - * The corresponding output for the first segment will be INT_MIN, 0, 0, 2, ..., 124, 126. - * The output for the second segment will be 126, 128, 128, 130, ..., 252, 254. Furthermore, - * \p block_aggregate will be assigned \p 126 in all threads after the first scan, assigned \p 254 after the second - * scan, etc. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T identity, ///< [in] Identity value - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, identity, scan_op, block_aggregate, block_prefix_op); - } - - - //@} end member group // Inclusive prefix sums - /******************************************************************//** - * \name Exclusive prefix scan operations (multiple data per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide exclusive prefix max scan - * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. - * The corresponding output \p thread_data in those threads will be - * { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - const T &identity, ///< [in] Identity value - ScanOp scan_op) ///< [in] Binary scan operator - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, identity, scan_op); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an exclusive prefix max scan of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide exclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).ExclusiveScan(thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The - * corresponding output \p thread_data in those threads will be { [INT_MIN,0,0,2], [2,4,4,6], ..., [506,508,508,510] }. - * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - const T &identity, ///< [in] Identity value - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an exclusive prefix max scan over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread - * typedef cub::BlockLoad BlockLoad; - * typedef cub::BlockStore BlockStore; - * typedef cub::BlockScan BlockScan; - * - * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan - * __shared__ union { - * typename BlockLoad::TempStorage load; - * typename BlockScan::TempStorage scan; - * typename BlockStore::TempStorage store; - * } temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); - * __syncthreads(); - * - * // Collectively compute the block-wide exclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage.scan).ExclusiveScan( - * thread_data, thread_data, INT_MIN, cub::Max(), block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); - * __syncthreads(); - * } - * \endcode - * \par - * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... - * The corresponding output for the first segment will be INT_MIN, 0, 0, 2, 2, 4, ..., 508, 510. - * The output for the second segment will be 510, 512, 512, 514, 514, 516, ..., 1020, 1022. Furthermore, - * \p block_aggregate will be assigned \p 510 in all threads after the first scan, assigned \p 1022 after the second - * scan, etc. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - T identity, ///< [in] Identity value - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, identity, scan_op, block_aggregate, block_prefix_op); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - //@} end member group - -#ifndef DOXYGEN_SHOULD_SKIP_THIS // Do not document - - /******************************************************************//** - * \name Exclusive prefix scan operations (identityless, single datum per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. With no identity value, the output computed for thread0 is undefined. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op) ///< [in] Binary scan operator - { - T block_aggregate; - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for thread0 is undefined. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void ExclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - InternalBlockScan(temp_storage, linear_tid).ExclusiveScan(input, output, scan_op, block_aggregate, block_prefix_op); - } - - - //@} end member group - /******************************************************************//** - * \name Exclusive prefix scan operations (identityless, multiple data per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. With no identity value, the output computed for thread0 is undefined. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op) ///< [in] Binary scan operator - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. With no identity value, the output computed for thread0 is undefined. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - - - /** - * \brief Computes an exclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void ExclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate, block_prefix_op); - - // Exclusive scan in registers with prefix - ThreadScanExclusive(input, output, scan_op, thread_partial); - } - - - //@} end member group - -#endif // DOXYGEN_SHOULD_SKIP_THIS - - /******************************************************************//** - * \name Inclusive prefix sum operations - *********************************************************************/ - //@{ - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix sum of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide inclusive prefix sum - * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The - * corresponding output \p thread_data in those threads will be 1, 2, ..., 128. - * - */ - __device__ __forceinline__ void InclusiveSum( - T input, ///< [in] Calling thread's input item - T &output) ///< [out] Calling thread's output item (may be aliased to \p input) - { - T block_aggregate; - InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate); - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix sum of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide inclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 1, 1, ..., 1. The - * corresponding output \p thread_data in those threads will be 1, 2, ..., 128. - * Furthermore the value \p 128 will be stored in \p block_aggregate for all threads. - * - */ - __device__ __forceinline__ void InclusiveSum( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate); - } - - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes one input element. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an inclusive prefix sum over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total += block_aggregate; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockScan for 128 threads - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data = d_data[block_offset]; - * - * // Collectively compute the block-wide inclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).InclusiveSum( - * thread_data, thread_data, block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * d_data[block_offset] = thread_data; - * } - * \endcode - * \par - * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... - * The corresponding output for the first segment will be 1, 2, ..., 128. - * The output for the second segment will be 129, 130, ..., 256. Furthermore, - * the value \p 128 will be stored in \p block_aggregate for all threads after each scan. - * - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template - __device__ __forceinline__ void InclusiveSum( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - InternalBlockScan(temp_storage, linear_tid).InclusiveSum(input, output, block_aggregate, block_prefix_op); - } - - - //@} end member group - /******************************************************************//** - * \name Inclusive prefix sum operations (multiple data per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix sum of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide inclusive prefix sum - * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The - * corresponding output \p thread_data in those threads will be { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - */ - template - __device__ __forceinline__ void InclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD]) ///< [out] Calling thread's output items (may be aliased to \p input) - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveSum(input[0], output[0]); - } - else - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix sum of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide inclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage).InclusiveSum(thread_data, thread_data, block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [1,1,1,1], [1,1,1,1], ..., [1,1,1,1] }. The - * corresponding output \p thread_data in those threads will be - * { [1,2,3,4], [5,6,7,8], ..., [509,510,511,512] }. - * Furthermore the value \p 512 will be stored in \p block_aggregate for all threads. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void InclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveSum(input[0], output[0], block_aggregate); - } - else - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial, block_aggregate); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using addition (+) as the scan operator. Each thread contributes an array of consecutive input elements. Instead of using 0 as the block-wide prefix, the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an inclusive prefix sum over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 512 integer items that are partitioned in a [blocked arrangement](index.html#sec5sec4) - * across 128 threads where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total += block_aggregate; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread - * typedef cub::BlockLoad BlockLoad; - * typedef cub::BlockStore BlockStore; - * typedef cub::BlockScan BlockScan; - * - * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan - * __shared__ union { - * typename BlockLoad::TempStorage load; - * typename BlockScan::TempStorage scan; - * typename BlockStore::TempStorage store; - * } temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); - * __syncthreads(); - * - * // Collectively compute the block-wide inclusive prefix sum - * int block_aggregate; - * BlockScan(temp_storage.scan).IncluisveSum( - * thread_data, thread_data, block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); - * __syncthreads(); - * } - * \endcode - * \par - * Suppose the input \p d_data is 1, 1, 1, 1, 1, 1, 1, 1, .... - * The corresponding output for the first segment will be 1, 2, 3, 4, ..., 511, 512. - * The output for the second segment will be 513, 514, 515, 516, ..., 1023, 1024. Furthermore, - * the value \p 512 will be stored in \p block_aggregate for all threads after each scan. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - int ITEMS_PER_THREAD, - typename BlockPrefixOp> - __device__ __forceinline__ void InclusiveSum( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveSum(input[0], output[0], block_aggregate, block_prefix_op); - } - else - { - // Reduce consecutive thread items in registers - Sum scan_op; - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveSum(thread_partial, thread_partial, block_aggregate, block_prefix_op); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial); - } - } - - - //@} end member group - /******************************************************************//** - * \name Inclusive prefix scan operations - *********************************************************************/ - //@{ - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide inclusive prefix max scan - * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The - * corresponding output \p thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void InclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op) ///< [in] Binary scan operator - { - T block_aggregate; - InclusiveScan(input, output, scan_op, block_aggregate); - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix max scan of 128 integer items that - * are partitioned across 128 threads. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain input item for each thread - * int thread_data; - * ... - * - * // Collectively compute the block-wide inclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is 0, -1, 2, -3, ..., 126, -127. The - * corresponding output \p thread_data in those threads will be 0, 0, 2, 2, ..., 126, 126. - * Furthermore the value \p 126 will be stored in \p block_aggregate for all threads. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template - __device__ __forceinline__ void InclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - InternalBlockScan(temp_storage, linear_tid).InclusiveScan(input, output, scan_op, block_aggregate); - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an inclusive prefix max scan over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockScan for 128 threads - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(INT_MIN); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data = d_data[block_offset]; - * - * // Collectively compute the block-wide inclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).InclusiveScan( - * thread_data, thread_data, cub::Max(), block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * d_data[block_offset] = thread_data; - * } - * \endcode - * \par - * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... - * The corresponding output for the first segment will be 0, 0, 2, 2, ..., 126, 126. - * The output for the second segment will be 128, 128, 130, 130, ..., 254, 254. Furthermore, - * \p block_aggregate will be assigned \p 126 in all threads after the first scan, assigned \p 254 after the second - * scan, etc. - * - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void InclusiveScan( - T input, ///< [in] Calling thread's input item - T &output, ///< [out] Calling thread's output item (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - InternalBlockScan(temp_storage, linear_tid).InclusiveScan(input, output, scan_op, block_aggregate, block_prefix_op); - } - - - //@} end member group - /******************************************************************//** - * \name Inclusive prefix scan operations (multiple data per thread) - *********************************************************************/ - //@{ - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide inclusive prefix max scan - * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max()); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. The - * corresponding output \p thread_data in those threads will be { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void InclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op) ///< [in] Binary scan operator - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveScan(input[0], output[0], scan_op); - } - else - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates an inclusive prefix max scan of 512 integer items that - * are partitioned in a [blocked arrangement](index.html#sec5sec4) across 128 threads - * where each thread owns 4 consecutive items. - * \par - * \code - * #include - * - * __global__ void ExampleKernel(...) - * { - * // Specialize BlockScan for 128 threads on type int - * typedef cub::BlockScan BlockScan; - * - * // Allocate shared memory for BlockScan - * __shared__ typename BlockScan::TempStorage temp_storage; - * - * // Obtain a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * ... - * - * // Collectively compute the block-wide inclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage).InclusiveScan(thread_data, thread_data, cub::Max(), block_aggregate); - * - * \endcode - * \par - * Suppose the set of input \p thread_data across the block of threads is - * { [0,-1,2,-3], [4,-5,6,-7], ..., [508,-509,510,-511] }. - * The corresponding output \p thread_data in those threads will be - * { [0,0,2,2], [4,4,6,6], ..., [508,508,510,510] }. - * Furthermore the value \p 510 will be stored in \p block_aggregate for all threads. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp> - __device__ __forceinline__ void InclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate) ///< [out] block-wide aggregate reduction of input items - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveScan(input[0], output[0], scan_op, block_aggregate); - } - else - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial, (linear_tid != 0)); - } - } - - - /** - * \brief Computes an inclusive block-wide prefix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_prefix_op is invoked by the first warp in the block, and the value returned by lane0 in that warp is used as the "seed" value that logically prefixes the threadblock's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs. - * - * The \p block_prefix_op functor must implement a member function T operator()(T block_aggregate). - * The functor's input parameter \p block_aggregate is the same value also returned by the scan operation. - * The functor will be invoked by the first warp of threads in the block, however only the return value from - * lane0 is applied as the block-wide prefix. Can be stateful. - * - * Supports non-commutative scan operators. - * - * \blocked - * - * \smemreuse - * - * The code snippet below illustrates a single thread block that progressively - * computes an inclusive prefix max scan over multiple "tiles" of input using a - * prefix functor to maintain a running total between block-wide scans. Each tile consists - * of 128 integer items that are partitioned across 128 threads. - * \par - * \code - * #include - * - * // A stateful callback functor that maintains a running prefix to be applied - * // during consecutive scan operations. - * struct BlockPrefixOp - * { - * // Running prefix - * int running_total; - * - * // Constructor - * __device__ BlockPrefixOp(int running_total) : running_total(running_total) {} - * - * // Callback operator to be entered by the first warp of threads in the block. - * // Thread-0 is responsible for returning a value for seeding the block-wide scan. - * __device__ int operator()(int block_aggregate) - * { - * int old_prefix = running_total; - * running_total = (block_aggregate > old_prefix) ? block_aggregate : old_prefix; - * return old_prefix; - * } - * }; - * - * __global__ void ExampleKernel(int *d_data, int num_items, ...) - * { - * // Specialize BlockLoad, BlockStore, and BlockScan for 128 threads, 4 ints per thread - * typedef cub::BlockLoad BlockLoad; - * typedef cub::BlockStore BlockStore; - * typedef cub::BlockScan BlockScan; - * - * // Allocate aliased shared memory for BlockLoad, BlockStore, and BlockScan - * __shared__ union { - * typename BlockLoad::TempStorage load; - * typename BlockScan::TempStorage scan; - * typename BlockStore::TempStorage store; - * } temp_storage; - * - * // Initialize running total - * BlockPrefixOp prefix_op(0); - * - * // Have the block iterate over segments of items - * for (int block_offset = 0; block_offset < num_items; block_offset += 128 * 4) - * { - * // Load a segment of consecutive items that are blocked across threads - * int thread_data[4]; - * BlockLoad(temp_storage.load).Load(d_data + block_offset, thread_data); - * __syncthreads(); - * - * // Collectively compute the block-wide inclusive prefix max scan - * int block_aggregate; - * BlockScan(temp_storage.scan).InclusiveScan( - * thread_data, thread_data, cub::Max(), block_aggregate, prefix_op); - * __syncthreads(); - * - * // Store scanned items to output segment - * BlockStore(temp_storage.store).Store(d_data + block_offset, thread_data); - * __syncthreads(); - * } - * \endcode - * \par - * Suppose the input \p d_data is 0, -1, 2, -3, 4, -5, .... - * The corresponding output for the first segment will be 0, 0, 2, 2, 4, 4, ..., 510, 510. - * The output for the second segment will be 512, 512, 514, 514, 516, 516, ..., 1022, 1022. Furthermore, - * \p block_aggregate will be assigned \p 510 in all threads after the first scan, assigned \p 1022 after the second - * scan, etc. - * - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam ScanOp [inferred] Binary scan operator type having member T operator()(const T &a, const T &b) - * \tparam BlockPrefixOp [inferred] Call-back functor type having member T operator()(T block_aggregate) - */ - template < - int ITEMS_PER_THREAD, - typename ScanOp, - typename BlockPrefixOp> - __device__ __forceinline__ void InclusiveScan( - T (&input)[ITEMS_PER_THREAD], ///< [in] Calling thread's input items - T (&output)[ITEMS_PER_THREAD], ///< [out] Calling thread's output items (may be aliased to \p input) - ScanOp scan_op, ///< [in] Binary scan operator - T &block_aggregate, ///< [out] block-wide aggregate reduction of input items (exclusive of the \p block_prefix_op value) - BlockPrefixOp &block_prefix_op) ///< [in-out] [warp0 only] Call-back functor for specifying a block-wide prefix to be applied to all inputs. - { - if (ITEMS_PER_THREAD == 1) - { - InclusiveScan(input[0], output[0], scan_op, block_aggregate, block_prefix_op); - } - else - { - // Reduce consecutive thread items in registers - T thread_partial = ThreadReduce(input, scan_op); - - // Exclusive threadblock-scan - ExclusiveScan(thread_partial, thread_partial, scan_op, block_aggregate, block_prefix_op); - - // Inclusive scan in registers with prefix - ThreadScanInclusive(input, output, scan_op, thread_partial); - } - } - - //@} end member group - - -}; - -} // CUB namespace -CUB_NS_POSTFIX // Optional outer namespace(s) - diff --git a/lib/kokkos/TPL/cub/block/block_store.cuh b/lib/kokkos/TPL/cub/block/block_store.cuh deleted file mode 100755 index fb990de1c7..0000000000 --- a/lib/kokkos/TPL/cub/block/block_store.cuh +++ /dev/null @@ -1,926 +0,0 @@ -/****************************************************************************** - * Copyright (c) 2011, Duane Merrill. All rights reserved. - * Copyright (c) 2011-2013, NVIDIA CORPORATION. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions are met: - * * Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * * Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * * Neither the name of the NVIDIA CORPORATION nor the - * names of its contributors may be used to endorse or promote products - * derived from this software without specific prior written permission. - * - * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED - * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE - * DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY - * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES - * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; - * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND - * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT - * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS - * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - * - ******************************************************************************/ - -/** - * \file - * Operations for writing linear segments of data from the CUDA thread block - */ - -#pragma once - -#include - -#include "../util_namespace.cuh" -#include "../util_macro.cuh" -#include "../util_type.cuh" -#include "../util_vector.cuh" -#include "../thread/thread_store.cuh" -#include "block_exchange.cuh" - -/// Optional outer namespace(s) -CUB_NS_PREFIX - -/// CUB namespace -namespace cub { - -/** - * \addtogroup IoModule - * @{ - */ - - -/******************************************************************//** - * \name Blocked I/O - *********************************************************************/ -//@{ - -/** - * \brief Store a blocked arrangement of items across a thread block into a linear segment of items using the specified cache modifier. - * - * \blocked - * - * \tparam MODIFIER cub::PtxStoreModifier cache modifier. - * \tparam T [inferred] The data type to store. - * \tparam ITEMS_PER_THREAD [inferred] The number of consecutive items partitioned onto each thread. - * \tparam OutputIteratorRA [inferred] The random-access iterator type for output (may be a simple pointer type). - */ -template < - PtxStoreModifier MODIFIER, - typename T, - int ITEMS_PER_THREAD, - typename OutputIteratorRA> -__device__ __forceinline__ void StoreBlocked( - int linear_tid, ///< [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks) - OutputIteratorRA block_itr, ///< [in] The thread block's base output iterator for storing to - T (&items)[ITEMS_PER_THREAD]) ///< [in] Data to store -{ - // Store directly in thread-blocked order - #pragma unroll - for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++) - { - ThreadStore (block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM, items[ITEM]); - } -} - - -/** - * \brief Store a blocked arrangement of items across a thread block into a linear segment of items using the specified cache modifier, guarded by range - * - * \blocked - * - * \tparam MODIFIER cub::PtxStoreModifier cache modifier. - * \tparam T