Update Kokkos library in LAMMPS to v2.7.24

This commit is contained in:
Stan Moore
2018-11-12 15:16:26 -07:00
parent 1651a21f92
commit b3f08b38a2
320 changed files with 42934 additions and 1993 deletions

View File

@ -1,5 +1,68 @@
# Change Log
## [2.7.24](https://github.com/kokkos/kokkos/tree/2.7.24) (2018-11-04)
[Full Changelog](https://github.com/kokkos/kokkos/compare/2.7.00...2.7.24)
**Implemented enhancements:**
- DualView: Add non-templated functions for sync, need\_sync, view, modify [\#1858](https://github.com/kokkos/kokkos/issues/1858)
- DualView: Avoid needlessly allocates and initializes modify\_host and modify\_device flag views [\#1831](https://github.com/kokkos/kokkos/issues/1831)
- DualView: Incorrect deduction of "not device type" [\#1659](https://github.com/kokkos/kokkos/issues/1659)
- BuildSystem: Add KOKKOS\_ENABLE\_CXX14 and KOKKOS\_ENABLE\_CXX17 [\#1602](https://github.com/kokkos/kokkos/issues/1602)
- BuildSystem: Installed kokkos\_generated\_settings.cmake contains build directories instead of install directories [\#1838](https://github.com/kokkos/kokkos/issues/1838)
- BuildSystem: KOKKOS\_ARCH: add ticks to printout of improper arch setting [\#1649](https://github.com/kokkos/kokkos/issues/1649)
- BuildSystem: Make core/src/Makefile for Cuda use needed nvcc\_wrapper [\#1296](https://github.com/kokkos/kokkos/issues/1296)
- Build: Support PGI as host compiler for NVCC [\#1828](https://github.com/kokkos/kokkos/issues/1828)
- Build: Many Warnings Fixed e.g.[\#1786](https://github.com/kokkos/kokkos/issues/1786)
- Capability: OffsetView with non-zero begin index [\#567](https://github.com/kokkos/kokkos/issues/567)
- Capability: Reductions into device side view [\#1788](https://github.com/kokkos/kokkos/issues/1788)
- Capability: Add max\_size to Kokkos::Array [\#1760](https://github.com/kokkos/kokkos/issues/1760)
- Capability: View Assignment: LayoutStride -\> LayoutLeft and LayoutStride -\> LayoutRight [\#1594](https://github.com/kokkos/kokkos/issues/1594)
- Capability: Atomic function allow implicit conversion of update argument [\#1571](https://github.com/kokkos/kokkos/issues/1571)
- Capability: Add team\_size\_max with tagged functors [\#663](https://github.com/kokkos/kokkos/issues/663)
- Capability: Fix allignment of views from Kokkos\_ScratchSpace should use different alignment [\#1700](https://github.com/kokkos/kokkos/issues/1700)
- Capabilitiy: create\_mirror\_view\_and\_copy for DynRankView [\#1651](https://github.com/kokkos/kokkos/issues/1651)
- Capability: DeepCopy HBWSpace / HostSpace [\#548](https://github.com/kokkos/kokkos/issues/548)
- ROCm: support team vector scan [\#1645](https://github.com/kokkos/kokkos/issues/1645)
- ROCm: Merge from rocm-hackathon2 [\#1636](https://github.com/kokkos/kokkos/issues/1636)
- ROCm: Add ParallelScanWithTotal [\#1611](https://github.com/kokkos/kokkos/issues/1611)
- ROCm: Implement MDRange in ROCm [\#1314](https://github.com/kokkos/kokkos/issues/1314)
- ROCm: Implement Reducers for Nested Parallelism Levels [\#963](https://github.com/kokkos/kokkos/issues/963)
- ROCm: Add asynchronous deep copy [\#959](https://github.com/kokkos/kokkos/issues/959)
- Tests: Memory pool test seems to allocate 8GB [\#1830](https://github.com/kokkos/kokkos/issues/1830)
- Tests: Add unit\_test for team\_broadcast [\#734](https://github.com/kokkos/kokkos/issues/734)
**Fixed bugs:**
- BuildSystem: Makefile.kokkos gets gcc-toolchain wrong if gcc is cached [\#1841](https://github.com/kokkos/kokkos/issues/1841)
- BuildSystem: kokkos\_generated\_settings.cmake placement is inconsistent [\#1771](https://github.com/kokkos/kokkos/issues/1771)
- BuildSystem: Invalid escape sequence \. in kokkos\_functions.cmake [\#1661](https://github.com/kokkos/kokkos/issues/1661)
- BuildSystem: Problem in Kokkos generated cmake file [\#1770](https://github.com/kokkos/kokkos/issues/1770)
- BuildSystem: invalid file names on windows [\#1671](https://github.com/kokkos/kokkos/issues/1671)
- Tests: reducers min/max\_loc test fails randomly due to multiple min values and thus multiple valid locations [\#1681](https://github.com/kokkos/kokkos/issues/1681)
- Tests: cuda.scatterview unit test causes "Bus error" when force\_uvm and enable\_lambda are enabled [\#1852](https://github.com/kokkos/kokkos/issues/1852)
- Tests: cuda.cxx11 unit test fails when force\_uvm and enable\_lambda are enabled [\#1850](https://github.com/kokkos/kokkos/issues/1850)
- Tests: threads.reduce\_device\_view\_range\_policy failing with Cuda/8.0.44 and RDC [\#1836](https://github.com/kokkos/kokkos/issues/1836)
- Build: compile error when compiling Kokkos with hwloc 2.0.1 \(on OSX 10.12.6, with g++ 7.2.0\) [\#1506](https://github.com/kokkos/kokkos/issues/1506)
- Build: dual\_view.view broken with UVM [\#1834](https://github.com/kokkos/kokkos/issues/1834)
- Build: White cuda/9.2 + gcc/7.2 warnings triggering errors [\#1833](https://github.com/kokkos/kokkos/issues/1833)
- Build: warning: enum constant in boolean context [\#1813](https://github.com/kokkos/kokkos/issues/1813)
- Capability: Fix overly conservative max\_team\_size thingy [\#1808](https://github.com/kokkos/kokkos/issues/1808)
- DynRankView: Ctors taking ViewAllocateWithoutInitializing broken [\#1783](https://github.com/kokkos/kokkos/issues/1783)
- Cuda: Apollo cuda.team\_broadcast test fail with clang-6.0 [\#1762](https://github.com/kokkos/kokkos/issues/1762)
- Cuda: Clang spurious test failure in impl\_view\_accessible [\#1753](https://github.com/kokkos/kokkos/issues/1753)
- Cuda: Kokkos::complex\<double\> atomic deadlocks with Clang 6 Cuda build with -O0 [\#1752](https://github.com/kokkos/kokkos/issues/1752)
- Cuda: LayoutStride Test fails for UVM as default memory space [\#1688](https://github.com/kokkos/kokkos/issues/1688)
- Cuda: Scan wrong values on Volta [\#1676](https://github.com/kokkos/kokkos/issues/1676)
- Cuda: Kokkos::deep\_copy error with CudaUVM and Kokkos::Serial spaces [\#1652](https://github.com/kokkos/kokkos/issues/1652)
- Cuda: cudaErrorInvalidConfiguration with debug build [\#1647](https://github.com/kokkos/kokkos/issues/1647)
- Cuda: parallel\_for with TeamPolicy::team\_size\_recommended with launch bounds not working -- reported by Daniel Holladay [\#1283](https://github.com/kokkos/kokkos/issues/1283)
- Cuda: Using KOKKOS\_CLASS\_LAMBDA in a class with Kokkos::Random\_XorShift64\_Pool member data [\#1696](https://github.com/kokkos/kokkos/issues/1696)
- Long Build Times on Darwin [\#1721](https://github.com/kokkos/kokkos/issues/1721)
- Capability: Typo in Kokkos\_Sort.hpp - BinOp3D - wrong comparison [\#1720](https://github.com/kokkos/kokkos/issues/1720)
- Buffer overflow in SharedAllocationRecord in Kokkos\_HostSpace.cpp [\#1673](https://github.com/kokkos/kokkos/issues/1673)
- Serial unit test failure [\#1632](https://github.com/kokkos/kokkos/issues/1632)
## [2.7.00](https://github.com/kokkos/kokkos/tree/2.7.00) (2018-05-24)
[Full Changelog](https://github.com/kokkos/kokkos/compare/2.6.00...2.7.00)

View File

@ -34,7 +34,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
#------------ GENERATE HEADER AND SOURCE FILES -------------------------------
execute_process(
COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} generate_build_settings
COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} PREFIX=${CMAKE_INSTALL_PREFIX} generate_build_settings
WORKING_DIRECTORY "${Kokkos_BINARY_DIR}"
OUTPUT_FILE ${Kokkos_BINARY_DIR}/core_src_make.out
RESULT_VARIABLE GEN_SETTINGS_RESULT
@ -45,6 +45,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
endif()
include(${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake)
install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION lib/cmake/Kokkos)
install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION ${CMAKE_INSTALL_PREFIX})
string(REPLACE " " ";" KOKKOS_TPL_INCLUDE_DIRS "${KOKKOS_GMAKE_TPL_INCLUDE_DIRS}")
string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_DIRS "${KOKKOS_GMAKE_TPL_LIBRARY_DIRS}")
string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_NAMES "${KOKKOS_GMAKE_TPL_LIBRARY_NAMES}")

View File

@ -1,14 +1,8 @@
# Default settings common options.
#LAMMPS specific settings:
ifndef KOKKOS_PATH
KOKKOS_PATH=../../lib/kokkos
endif
CXXFLAGS=$(CCFLAGS)
# Options: Cuda,ROCm,OpenMP,Pthreads,Qthreads,Serial
KOKKOS_DEVICES ?= "OpenMP"
#KOKKOS_DEVICES ?= "Pthreads"
# Options: Cuda,ROCm,OpenMP,Pthread,Qthreads,Serial
#KOKKOS_DEVICES ?= "OpenMP"
KOKKOS_DEVICES ?= "Pthread"
# Options:
# Intel: KNC,KNL,SNB,HSW,BDW,SKX
# NVIDIA: Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,Pascal60,Pascal61,Volta70,Volta72
@ -21,16 +15,17 @@ KOKKOS_ARCH ?= ""
KOKKOS_DEBUG ?= "no"
# Options: hwloc,librt,experimental_memkind
KOKKOS_USE_TPLS ?= ""
# Options: c++11,c++1z
# Options: c++11,c++14,c++1y,c++17,c++1z,c++2a
KOKKOS_CXX_STANDARD ?= "c++11"
# Options: aggressive_vectorization,disable_profiling,disable_deprecated_code,enable_large_mem_tests
KOKKOS_OPTIONS ?= ""
# Option for setting ETI path
KOKKOS_ETI_PATH ?= ${KOKKOS_PATH}/core/src/eti
KOKKOS_CMAKE ?= "no"
# Default settings specific options.
# Options: force_uvm,use_ldg,rdc,enable_lambda
KOKKOS_CUDA_OPTIONS ?= "enable_lambda"
KOKKOS_CUDA_OPTIONS ?= ""
# Return a 1 if a string contains a substring and 0 if not
# Note the search string should be without '"'
@ -41,7 +36,11 @@ kokkos_has_string=$(if $(findstring $2,$1),1,0)
# Check for general settings.
KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes)
KOKKOS_INTERNAL_ENABLE_CXX11 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++11)
KOKKOS_INTERNAL_ENABLE_CXX14 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++14)
KOKKOS_INTERNAL_ENABLE_CXX1Y := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1y)
KOKKOS_INTERNAL_ENABLE_CXX17 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++17)
KOKKOS_INTERNAL_ENABLE_CXX1Z := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1z)
KOKKOS_INTERNAL_ENABLE_CXX2A := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++2a)
# Check for external libraries.
KOKKOS_INTERNAL_USE_HWLOC := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),hwloc)
@ -110,6 +109,18 @@ KOKKOS_INTERNAL_COMPILER_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VE
KOKKOS_INTERNAL_COMPILER_APPLE_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),apple-darwin)
KOKKOS_INTERNAL_COMPILER_HCC := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),HCC)
# Check Host Compiler if using NVCC through nvcc_wrapper
ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC), 1)
KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER := $(strip $(shell echo $(CXX) | grep nvcc_wrapper | wc -l))
ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER), 1)
KOKKOS_CXX_HOST_VERSION := $(strip $(shell $(CXX) $(CXXFLAGS) --host-version 2>&1))
KOKKOS_INTERNAL_COMPILER_PGI := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),PGI)
KOKKOS_INTERNAL_COMPILER_INTEL := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),Intel Corporation)
KOKKOS_INTERNAL_COMPILER_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),clang)
endif
endif
ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 2)
KOKKOS_INTERNAL_COMPILER_CLANG = 1
endif
@ -202,18 +213,34 @@ endif
# Set C++11 flags.
ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
KOKKOS_INTERNAL_CXX11_FLAG := --c++11
KOKKOS_INTERNAL_CXX14_FLAG := --c++14
#KOKKOS_INTERNAL_CXX17_FLAG := --c++17
else
ifeq ($(KOKKOS_INTERNAL_COMPILER_XL), 1)
KOKKOS_INTERNAL_CXX11_FLAG := -std=c++11
#KOKKOS_INTERNAL_CXX14_FLAG := -std=c++14
KOKKOS_INTERNAL_CXX1Y_FLAG := -std=c++1y
#KOKKOS_INTERNAL_CXX17_FLAG := -std=c++17
#KOKKOS_INTERNAL_CXX1Z_FLAG := -std=c++1Z
#KOKKOS_INTERNAL_CXX2A_FLAG := -std=c++2a
else
ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
KOKKOS_INTERNAL_CXX11_FLAG := -hstd=c++11
KOKKOS_INTERNAL_CXX14_FLAG := -hstd=c++14
#KOKKOS_INTERNAL_CXX1Y_FLAG := -hstd=c++1y
#KOKKOS_INTERNAL_CXX17_FLAG := -hstd=c++17
#KOKKOS_INTERNAL_CXX1Z_FLAG := -hstd=c++1z
#KOKKOS_INTERNAL_CXX2A_FLAG := -hstd=c++2a
else
ifeq ($(KOKKOS_INTERNAL_COMPILER_HCC), 1)
KOKKOS_INTERNAL_CXX11_FLAG :=
else
KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11
KOKKOS_INTERNAL_CXX14_FLAG := --std=c++14
KOKKOS_INTERNAL_CXX1Y_FLAG := --std=c++1y
KOKKOS_INTERNAL_CXX17_FLAG := --std=c++17
KOKKOS_INTERNAL_CXX1Z_FLAG := --std=c++1z
KOKKOS_INTERNAL_CXX2A_FLAG := --std=c++2a
endif
endif
endif
@ -336,7 +363,9 @@ endif
#CPPFLAGS is now unused
KOKKOS_CPPFLAGS =
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_CXXFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_ETI_PATH)
endif
KOKKOS_TPL_INCLUDE_DIRS =
KOKKOS_TPL_LIBRARY_DIRS =
KOKKOS_TPL_LIBRARY_NAMES =
@ -347,9 +376,11 @@ endif
KOKKOS_LIBS = -ldl
KOKKOS_TPL_LIBRARY_NAMES += dl
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_LDFLAGS = -L$(shell pwd)
# CXXLDFLAGS is used together with CXXFLAGS in a combined compile/link command
KOKKOS_CXXLDFLAGS = -L$(shell pwd)
endif
KOKKOS_LINK_FLAGS =
KOKKOS_SRC =
KOKKOS_HEADERS =
@ -377,10 +408,12 @@ tmp := $(call kokkos_append_header,"/* Execution Spaces */")
ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CUDA")
tmp := $(call kokkos_append_header,"\#define KOKKOS_COMPILER_CUDA_VERSION $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION)")
endif
ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
tmp := $(call kokkos_append_header,'\#define KOKKOS_ENABLE_ROCM')
tmp := $(call kokkos_append_header,'\#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1')
endif
ifeq ($(KOKKOS_INTERNAL_USE_OPENMPTARGET), 1)
@ -438,11 +471,25 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX14), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX14_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Y), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Y_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX17), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX17_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Z), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Z_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX1Z")
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX2A), 1)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX2A_FLAG)
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX20")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)
@ -465,7 +512,9 @@ endif
ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1)
ifneq ($(HWLOC_PATH),)
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_CXXFLAGS += -I$(HWLOC_PATH)/include
endif
KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib
KOKKOS_CXXLDFLAGS += -L$(HWLOC_PATH)/lib
KOKKOS_TPL_INCLUDE_DIRS += $(HWLOC_PATH)/include
@ -484,7 +533,9 @@ endif
ifeq ($(KOKKOS_INTERNAL_USE_MEMKIND), 1)
ifneq ($(MEMKIND_PATH),)
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_CXXFLAGS += -I$(MEMKIND_PATH)/include
endif
KOKKOS_LDFLAGS += -L$(MEMKIND_PATH)/lib
KOKKOS_CXXLDFLAGS += -L$(MEMKIND_PATH)/lib
KOKKOS_TPL_INCLUDE_DIRS += $(MEMKIND_PATH)/include
@ -977,7 +1028,9 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_ETI), 1)
endif
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
ifneq ($(CUDA_PATH),)
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_CXXFLAGS += -I$(CUDA_PATH)/include
endif
KOKKOS_LDFLAGS += -L$(CUDA_PATH)/lib64
KOKKOS_CXXLDFLAGS += -L$(CUDA_PATH)/lib64
KOKKOS_TPL_INCLUDE_DIRS += $(CUDA_PATH)/include
@ -1032,7 +1085,9 @@ ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.cpp)
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.hpp)
ifneq ($(QTHREADS_PATH),)
ifneq ($(KOKKOS_CMAKE), yes)
KOKKOS_CXXFLAGS += -I$(QTHREADS_PATH)/include
endif
KOKKOS_LDFLAGS += -L$(QTHREADS_PATH)/lib
KOKKOS_CXXLDFLAGS += -L$(QTHREADS_PATH)/lib
KOKKOS_TPL_INCLUDE_DIRS += $(QTHREADS_PATH)/include

View File

@ -52,44 +52,47 @@ For specifics see the LICENSE file contained in the repository or distribution.
* GCC 4.8.4
* GCC 4.9.3
* GCC 5.1.0
* GCC 5.3.0
* GCC 5.5.0
* GCC 6.1.0
* GCC 7.2.0
* GCC 7.3.0
* GCC 8.1.0
* Intel 15.0.2
* Intel 16.0.1
* Intel 17.1.043
* Intel 17.0.1
* Intel 17.4.196
* Intel 18.0.128
* Intel 18.2.128
* Clang 3.6.1
* Clang 3.7.1
* Clang 3.8.1
* Clang 3.9.0
* Clang 4.0.0
* Clang 4.0.0 for CUDA (CUDA Toolkit 8.0.44)
* Clang 6.0.0 for CUDA (CUDA Toolkit 9.1)
* PGI 17.10
* NVCC 7.0 for CUDA (with gcc 4.8.4)
* Clang 6.0.0 for CUDA (CUDA Toolkit 9.0)
* Clang 7.0.0 for CUDA (CUDA Toolkit 9.1)
* PGI 18.7
* NVCC 7.5 for CUDA (with gcc 4.8.4)
* NVCC 8.0.44 for CUDA (with gcc 5.3.0)
* NVCC 9.1 for CUDA (with gcc 6.1.0)
### Primary tested compilers on Power 8 are:
* GCC 5.4.0 (OpenMP,Serial)
* IBM XL 13.1.6 (OpenMP, Serial)
* NVCC 8.0.44 for CUDA (with gcc 5.4.0)
* NVCC 9.0.103 for CUDA (with gcc 6.3.0 and XL 13.1.6)
* GCC 6.4.0 (OpenMP,Serial)
* GCC 7.2.0 (OpenMP,Serial)
* IBM XL 16.1.0 (OpenMP, Serial)
* NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0)
### Primary tested compilers on Intel KNL are:
* GCC 6.2.0
* Intel 16.4.258 (with gcc 4.7.2)
* Intel 17.2.174 (with gcc 4.9.3)
* Intel 18.0.128 (with gcc 4.9.3)
* Intel 18.2.199 (with gcc 4.9.3)
### Primary tested compilers on ARM
* GCC 6.1.0
### Primary tested compilers on ARM (Cavium ThunderX2)
* GCC 7.2.0
* ARM/Clang 18.4.0
### Other compilers working:
* X86:
- Cygwin 2.1.0 64bit with gcc 4.9.3
- GCC 8.1.0 (not warning free)
### Known non-working combinations:
* Power8:

View File

@ -697,6 +697,7 @@ namespace Kokkos {
typedef Random_XorShift64<DeviceType> generator_type;
typedef DeviceType device_type;
KOKKOS_INLINE_FUNCTION
Random_XorShift64_Pool() {
num_states_ = 0;
}
@ -709,12 +710,14 @@ namespace Kokkos {
#endif
}
KOKKOS_INLINE_FUNCTION
Random_XorShift64_Pool(const Random_XorShift64_Pool& src):
locks_(src.locks_),
state_(src.state_),
num_states_(src.num_states_)
{}
KOKKOS_INLINE_FUNCTION
Random_XorShift64_Pool operator = (const Random_XorShift64_Pool& src) {
locks_ = src.locks_;
state_ = src.state_;
@ -958,6 +961,7 @@ namespace Kokkos {
typedef DeviceType device_type;
KOKKOS_INLINE_FUNCTION
Random_XorShift1024_Pool() {
num_states_ = 0;
}
@ -972,6 +976,7 @@ namespace Kokkos {
#endif
}
KOKKOS_INLINE_FUNCTION
Random_XorShift1024_Pool(const Random_XorShift1024_Pool& src):
locks_(src.locks_),
state_(src.state_),
@ -979,6 +984,7 @@ namespace Kokkos {
num_states_(src.num_states_)
{}
KOKKOS_INLINE_FUNCTION
Random_XorShift1024_Pool operator = (const Random_XorShift1024_Pool& src) {
locks_ = src.locks_;
state_ = src.state_;

View File

@ -246,8 +246,8 @@ public:
{
bin_count_atomic = Kokkos::View<int*, Space >("Kokkos::SortImpl::BinSortFunctor::bin_count",bin_op.max_bins());
bin_count_const = bin_count_atomic;
bin_offsets = offset_type("Kokkos::SortImpl::BinSortFunctor::bin_offsets",bin_op.max_bins());
sort_order = offset_type("PermutationVector",range_end-range_begin);
bin_offsets = offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::bin_offsets"),bin_op.max_bins());
sort_order = offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sort_order"),range_end-range_begin);
}
BinSort( const_key_view_type keys_
@ -290,7 +290,7 @@ public:
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
scratch_view_type
sorted_values("Scratch",
sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
len,
values.extent(1),
values.extent(2),
@ -301,7 +301,7 @@ public:
values.extent(7));
#else
scratch_view_type
sorted_values("Scratch",
sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
values.rank_dynamic > 0 ? len : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
values.rank_dynamic > 1 ? values.extent(1) : KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
values.rank_dynamic > 2 ? values.extent(2) : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
@ -483,7 +483,7 @@ struct BinOp3D {
if (keys(i1,0)>keys(i2,0)) return true;
else if (keys(i1,0)==keys(i2,0)) {
if (keys(i1,1)>keys(i2,1)) return true;
else if (keys(i1,1)==keys(i2,2)) {
else if (keys(i1,1)==keys(i2,1)) {
if (keys(i1,2)>keys(i2,2)) return true;
}
}

View File

@ -0,0 +1,41 @@
#Set your Kokkos path to something appropriate
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
KOKKOS_DEVICES = "Cuda"
KOKKOS_ARCH = "Pascal60"
KOKKOS_CUDA_OPTIONS = enable_lambda
#KOKKOS_DEVICES = "OpenMP"
#KOKKOS_ARCH = "Power8"
SRC = gups-kokkos.cc
default: build
echo "Start Build"
CXXFLAGS = -O3
CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
#CXX = g++
LINK = ${CXX}
LINKFLAGS =
EXE = gups-kokkos
DEPFLAGS = -M
OBJ = $(SRC:.cc=.o)
LIB =
include $(KOKKOS_PATH)/Makefile.kokkos
build: $(EXE)
$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean
rm -f *.o $(EXE)
# Compilation rules
%.o:%.cc $(KOKKOS_CPP_DEPENDS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<

View File

@ -0,0 +1,199 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// ************************************************************************
//@HEADER
*/
#include "Kokkos_Core.hpp"
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <sys/time.h>
#define HLINE "-------------------------------------------------------------\n"
#if defined(KOKKOS_ENABLE_CUDA)
typedef Kokkos::View<int64_t*, Kokkos::CudaSpace>::HostMirror GUPSHostArray;
typedef Kokkos::View<int64_t*, Kokkos::CudaSpace> GUPSDeviceArray;
#else
typedef Kokkos::View<int64_t*, Kokkos::HostSpace>::HostMirror GUPSHostArray;
typedef Kokkos::View<int64_t*, Kokkos::HostSpace> GUPSDeviceArray;
#endif
typedef int GUPSIndex;
double now() {
struct timeval now;
gettimeofday(&now, NULL);
return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
}
void randomize_indices(GUPSHostArray& indices, GUPSDeviceArray& dev_indices, const int64_t dataCount) {
for( GUPSIndex i = 0; i < indices.extent(0); ++i ) {
indices[i] = lrand48() % dataCount;
}
Kokkos::deep_copy(dev_indices, indices);
}
void run_gups(GUPSDeviceArray& indices, GUPSDeviceArray& data, const int64_t datum,
const bool performAtomics) {
if( performAtomics ) {
Kokkos::parallel_for("bench-gups-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
Kokkos::atomic_fetch_xor( &data[indices[i]], datum );
});
} else {
Kokkos::parallel_for("bench-gups-non-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
data[indices[i]] ^= datum;
});
}
Kokkos::fence();
}
int run_benchmark(const GUPSIndex indicesCount, const GUPSIndex dataCount, const int repeats,
const bool useAtomics) {
printf("Reports fastest timing per kernel\n");
printf("Creating Views...\n");
printf("Memory Sizes:\n");
printf("- Elements: %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(dataCount),
1.0e-6 * ((double) dataCount * (double) sizeof(int64_t)));
printf("- Indices: %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(indicesCount),
1.0e-6 * ((double) indicesCount * (double) sizeof(int64_t)));
printf(" - Atomics: %15s\n", (useAtomics ? "Yes" : "No") );
printf("Benchmark kernels will be performed for %d iterations.\n", repeats);
printf(HLINE);
GUPSDeviceArray dev_indices("indices", indicesCount);
GUPSDeviceArray dev_data("data", dataCount);
int64_t datum = -1;
GUPSHostArray indices = Kokkos::create_mirror_view(dev_indices);
GUPSHostArray data = Kokkos::create_mirror_view(dev_data);
double gupsTime = 0.0;
printf("Initializing Views...\n");
#if defined(KOKKOS_HAVE_OPENMP)
Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::OpenMP>(0, dataCount),
#else
Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::Serial>(0, dataCount),
#endif
KOKKOS_LAMBDA(const int i) {
data[i] = 10101010101;
});
#if defined(KOKKOS_HAVE_OPENMP)
Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::OpenMP>(0, indicesCount),
#else
Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::Serial>(0, indicesCount),
#endif
KOKKOS_LAMBDA(const int i) {
indices[i] = 0;
});
Kokkos::deep_copy(dev_data, data);
Kokkos::deep_copy(dev_indices, indices);
double start;
printf("Starting benchmarking...\n");
for( GUPSIndex k = 0; k < repeats; ++k ) {
randomize_indices(indices, dev_indices, data.extent(0));
start = now();
run_gups(dev_indices, dev_data, datum, useAtomics);
gupsTime += now() - start;
}
Kokkos::deep_copy(indices, dev_indices);
Kokkos::deep_copy(data, dev_data);
printf(HLINE);
printf("GUP/s Random: %18.6f\n",
(1.0e-9 * ((double) repeats) * (double) dev_indices.extent(0)) / gupsTime);
printf(HLINE);
return 0;
}
int main(int argc, char* argv[]) {
printf(HLINE);
printf("Kokkos GUPS Benchmark\n");
printf(HLINE);
srand48(1010101);
Kokkos::initialize(argc, argv);
int64_t indices = 8192;
int64_t data = 33554432;
int64_t repeats = 10;
bool useAtomics = false;
for( int i = 1; i < argc; ++i ) {
if( strcmp( argv[i], "--indices" ) == 0 ) {
indices = std::atoll(argv[i+1]);
++i;
} else if( strcmp( argv[i], "--data" ) == 0 ) {
data = std::atoll(argv[i+1]);
++i;
} else if( strcmp( argv[i], "--repeats" ) == 0 ) {
repeats = std::atoll(argv[i+1]);
++i;
} else if( strcmp( argv[i], "--atomics" ) == 0 ) {
useAtomics = true;
}
}
const int rc = run_benchmark(indices, data, repeats, useAtomics);
Kokkos::finalize();
return rc;
}

View File

@ -0,0 +1,41 @@
#Set your Kokkos path to something appropriate
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
#KOKKOS_DEVICES = "Cuda"
#KOKKOS_ARCH = "Pascal60"
#KOKKOS_CUDA_OPTIONS = enable_lambda
KOKKOS_DEVICES = "OpenMP"
KOKKOS_ARCH = "Power8"
SRC = stream-kokkos.cc
default: build
echo "Start Build"
CXXFLAGS = -O3
#CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
CXX = g++
LINK = ${CXX}
LINKFLAGS =
EXE = stream-kokkos
DEPFLAGS = -M
OBJ = $(SRC:.cc=.o)
LIB =
include $(KOKKOS_PATH)/Makefile.kokkos
build: $(EXE)
$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean
rm -f *.o $(EXE)
# Compilation rules
%.o:%.cc $(KOKKOS_CPP_DEPENDS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<

View File

@ -0,0 +1,265 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// ************************************************************************
//@HEADER
*/
#include "Kokkos_Core.hpp"
#include <cstdio>
#include <cstdlib>
#include <cmath>
#include <sys/time.h>
#define STREAM_ARRAY_SIZE 100000000
#define STREAM_NTIMES 20
#define HLINE "-------------------------------------------------------------\n"
#if defined(KOKKOS_ENABLE_CUDA)
typedef Kokkos::View<double*, Kokkos::CudaSpace>::HostMirror StreamHostArray;
typedef Kokkos::View<double*, Kokkos::CudaSpace> StreamDeviceArray;
#else
typedef Kokkos::View<double*, Kokkos::HostSpace>::HostMirror StreamHostArray;
typedef Kokkos::View<double*, Kokkos::HostSpace> StreamDeviceArray;
#endif
typedef int StreamIndex;
double now() {
struct timeval now;
gettimeofday(&now, NULL);
return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
}
void perform_copy(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
c[i] = a[i];
});
Kokkos::fence();
}
void perform_scale(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
const double scalar) {
Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
b[i] = scalar * c[i];
});
Kokkos::fence();
}
void perform_add(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
Kokkos::parallel_for("add", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
c[i] = a[i] + b[i];
});
Kokkos::fence();
}
void perform_triad(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
const double scalar) {
Kokkos::parallel_for("triad", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
a[i] = b[i] + scalar * c[i];
});
Kokkos::fence();
}
int perform_validation(StreamHostArray& a, StreamHostArray& b, StreamHostArray& c,
const StreamIndex arraySize, const double scalar) {
double ai = 1.0;
double bi = 2.0;
double ci = 0.0;
for( StreamIndex i = 0; i < arraySize; ++i ) {
ci = ai;
bi = scalar * ci;
ci = ai + bi;
ai = bi + scalar * ci;
};
double aError = 0.0;
double bError = 0.0;
double cError = 0.0;
for( StreamIndex i = 0; i < arraySize; ++i ) {
aError = std::abs( a[i] - ai );
bError = std::abs( b[i] - bi );
cError = std::abs( c[i] - ci );
}
double aAvgError = aError / (double) arraySize;
double bAvgError = bError / (double) arraySize;
double cAvgError = cError / (double) arraySize;
const double epsilon = 1.0e-13;
int errorCount = 0;
if( std::abs( aAvgError / ai ) > epsilon ) {
fprintf(stderr, "Error: validation check on View a failed.\n");
errorCount++;
}
if( std::abs( bAvgError / bi ) > epsilon ) {
fprintf(stderr, "Error: validation check on View b failed.\n");
errorCount++;
}
if( std::abs( cAvgError / ci ) > epsilon ) {
fprintf(stderr, "Error: validation check on View c failed.\n");
errorCount++;
}
if( errorCount == 0 ) {
printf("All solutions checked and verified.\n");
}
return errorCount;
}
int run_benchmark() {
printf("Reports fastest timing per kernel\n");
printf("Creating Views...\n");
printf("Memory Sizes:\n");
printf("- Array Size: %" PRIu64 "\n", static_cast<uint64_t>(STREAM_ARRAY_SIZE));
printf("- Per Array: %12.2f MB\n", 1.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
printf("- Total: %12.2f MB\n", 3.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
printf("Benchmark kernels will be performed for %d iterations.\n", STREAM_NTIMES);
printf(HLINE);
StreamDeviceArray dev_a("a", STREAM_ARRAY_SIZE);
StreamDeviceArray dev_b("b", STREAM_ARRAY_SIZE);
StreamDeviceArray dev_c("c", STREAM_ARRAY_SIZE);
StreamHostArray a = Kokkos::create_mirror_view(dev_a);
StreamHostArray b = Kokkos::create_mirror_view(dev_b);
StreamHostArray c = Kokkos::create_mirror_view(dev_c);
const double scalar = 3.0;
double copyTime = std::numeric_limits<double>::max();
double scaleTime = std::numeric_limits<double>::max();
double addTime = std::numeric_limits<double>::max();
double triadTime = std::numeric_limits<double>::max();
printf("Initializing Views...\n");
#if defined(KOKKOS_HAVE_OPENMP)
Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::OpenMP>(0, STREAM_ARRAY_SIZE),
#else
Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::Serial>(0, STREAM_ARRAY_SIZE),
#endif
KOKKOS_LAMBDA(const int i) {
a[i] = 1.0;
b[i] = 2.0;
c[i] = 0.0;
});
// Copy contents of a (from the host) to the dev_a (device)
Kokkos::deep_copy(dev_a, a);
Kokkos::deep_copy(dev_b, b);
Kokkos::deep_copy(dev_c, c);
double start;
printf("Starting benchmarking...\n");
for( StreamIndex k = 0; k < STREAM_NTIMES; ++k ) {
start = now();
perform_copy(dev_a, dev_b, dev_c);
copyTime = std::min( copyTime, (now() - start) );
start = now();
perform_scale(dev_a, dev_b, dev_c, scalar);
scaleTime = std::min( scaleTime, (now() - start) );
start = now();
perform_add(dev_a, dev_b, dev_c);
addTime = std::min( addTime, (now() - start) );
start = now();
perform_triad(dev_a, dev_b, dev_c, scalar);
triadTime = std::min( triadTime, (now() - start) );
}
Kokkos::deep_copy(a, dev_a);
Kokkos::deep_copy(b, dev_b);
Kokkos::deep_copy(c, dev_c);
printf("Performing validation...\n");
int rc = perform_validation(a, b, c, STREAM_ARRAY_SIZE, scalar);
printf(HLINE);
printf("Copy %11.2f MB/s\n",
( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / copyTime );
printf("Scale %11.2f MB/s\n",
( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / scaleTime );
printf("Add %11.2f MB/s\n",
( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / addTime );
printf("Triad %11.2f MB/s\n",
( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / triadTime );
printf(HLINE);
return rc;
}
int main(int argc, char* argv[]) {
printf(HLINE);
printf("Kokkos STREAM Benchmark\n");
printf(HLINE);
Kokkos::initialize(argc, argv);
const int rc = run_benchmark();
Kokkos::finalize();
return rc;
}

View File

@ -125,18 +125,20 @@ function show_help {
echo " --openmp-ratio=N/D Ratio of the cpuset to use for OpenMP"
echo " Default: 1"
echo " --openmp-places=<Op> Op=threads|cores|sockets. Default: threads"
echo " --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
echo " --force-openmp-num-threads=N"
echo " --openmp-num-threads=N"
echo " Override logic for selecting OMP_NUM_THREADS"
echo " --force-openmp-proc-bind=<OP>"
echo " --openmp-proc-bind=<OP>"
echo " Override logic for selecting OMP_PROC_BIND"
echo " --no-openmp-nested Set OMP_NESTED to false"
echo " --openmp-nested Set OMP_NESTED to true"
echo " --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
echo " --output-prefix=<P> Save the output to files of the form"
echo " P.hpcbind.N, P.stdout.N and P.stderr.N where P is "
echo " the prefix and N is the rank (no spaces)"
echo " --output-mode=<Op> How console output should be handled."
echo " Options are all, rank0, and none. Default: rank0"
echo " --lstopo Show bindings in lstopo"
echo " --save-topology=<Xml> Save the topology to the given xml file"
echo " --load-topology=<Xml> Load a previously saved topology from an xml file"
echo " -v|--verbose Print bindings and relevant environment variables"
echo " -h|--help Show this message"
echo ""
@ -189,7 +191,7 @@ HPCBIND_OPENMP_PLACES=${OMP_PLACES:-threads}
declare -i HPCBIND_OPENMP_PROC_BIND=1
HPCBIND_OPENMP_FORCE_NUM_THREADS=""
HPCBIND_OPENMP_FORCE_PROC_BIND=""
declare -i HPCBIND_OPENMP_NESTED=1
declare -i HPCBIND_OPENMP_NESTED=0
declare -i HPCBIND_VERBOSE=0
declare -i HPCBIND_LSTOPO=0
@ -197,6 +199,9 @@ declare -i HPCBIND_LSTOPO=0
HPCBIND_OUTPUT_PREFIX=""
HPCBIND_OUTPUT_MODE="rank0"
HPCBIND_OUTPUT_TOPOLOGY=""
HPCBIND_INPUT_TOPOLOGY=""
declare -i HPCBIND_HAS_COMMAND=0
for i in "$@"; do
@ -276,10 +281,22 @@ for i in "$@"; do
HPCBIND_OPENMP_NESTED=0
shift
;;
--openmp-nested)
HPCBIND_OPENMP_NESTED=1
shift
;;
--output-prefix=*)
HPCBIND_OUTPUT_PREFIX="${i#*=}"
shift
;;
--save-topology=*)
HPCBIND_OUTPUT_TOPOLOGY="${i#*=}"
shift
;;
--load-topology=*)
HPCBIND_INPUT_TOPOLOGY="${i#*=}"
shift
;;
--output-mode=*)
HPCBIND_OUTPUT_MODE="${i#*=}"
#convert to lower case
@ -327,24 +344,37 @@ elif [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
HPCBIND_TEE=1
fi
# Save the topology to the given xml file
if [[ "${HPCBIND_OUTPUT_TOPOLOGY}" != "" ]]; then
if [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
lstopo-no-graphics "${HPCBIND_OUTPUT_TOPOLOGY}"
else
lstopo-no-graphics >/dev/null 2>&1
fi
fi
# Load the topology to the given xml file
if [[ "${HPCBIND_INPUT_TOPOLOGY}" != "" ]]; then
if [ -f ${HPCBIND_INPUT_TOPOLOGY} ]; then
export HWLOC_XMLFILE="${HPCBIND_INPUT_TOPOLOGY}"
export HWLOC_THISSYSTEM=1
fi
fi
if [[ "${HPCBIND_OUTPUT_PREFIX}" == "" ]]; then
HPCBIND_LOG=/dev/null
HPCBIND_ERR=/dev/null
HPCBIND_OUT=/dev/null
else
if [[ ${HPCBIND_QUEUE_SIZE} -gt 0 ]]; then
if [[ ${HPCBIND_QUEUE_SIZE} -le 0 ]]; then
HPCBIND_QUEUE_SIZE=1
fi
HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
else
HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_QUEUE_RANK}"
HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_QUEUE_RANK}"
HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_QUEUE_RANK}"
fi
> ${HPCBIND_LOG}
fi
@ -546,6 +576,8 @@ if [[ ${HPCBIND_TEE} -eq 0 || ${HPCBIND_VERBOSE} -eq 0 ]]; then
hostname -s >> ${HPCBIND_LOG}
echo "[HPCBIND]" >> ${HPCBIND_LOG}
echo "${TMP_ENV}" | grep -E "^HPCBIND_" >> ${HPCBIND_LOG}
echo "[HWLOC]" >> ${HPCBIND_LOG}
echo "${TMP_ENV}" | grep -E "^HWLOC_" >> ${HPCBIND_LOG}
echo "[CUDA]" >> ${HPCBIND_LOG}
echo "${TMP_ENV}" | grep -E "^CUDA_" >> ${HPCBIND_LOG}
echo "[OPENMP]" >> ${HPCBIND_LOG}
@ -568,6 +600,8 @@ else
hostname -s > >(tee -a ${HPCBIND_LOG})
echo "[HPCBIND]" > >(tee -a ${HPCBIND_LOG})
echo "${TMP_ENV}" | grep -E "^HPCBIND_" > >(tee -a ${HPCBIND_LOG})
echo "[HWLOC]" > >(tee -a ${HPCBIND_LOG})
echo "${TMP_ENV}" | grep -E "^HWLOC_" > >(tee -a ${HPCBIND_LOG})
echo "[CUDA]" > >(tee -a ${HPCBIND_LOG})
echo "${TMP_ENV}" | grep -E "^CUDA_" > >(tee -a ${HPCBIND_LOG})
echo "[OPENMP]" > >(tee -a ${HPCBIND_LOG})

View File

@ -74,6 +74,9 @@ dry_run=0
host_only=0
host_only_args=""
# Just run version on host compiler
get_host_version=0
# Enable workaround for CUDA 6.5 for pragma ident
replace_pragma_ident=0
@ -93,6 +96,9 @@ depfile_separate=0
depfile_output_arg=""
depfile_target_arg=""
# Option to remove duplicate libraries and object files
remove_duplicate_link_files=0
#echo "Arguments: $# $@"
while [ $# -gt 0 ]
@ -106,10 +112,18 @@ do
--host-only)
host_only=1
;;
#get the host version only
--host-version)
get_host_version=1
;;
#replace '#pragma ident' with '#ident' this is needed to compile OpenMPI due to a configure script bug and a non standardized behaviour of pragma with macros
--replace-pragma-ident)
replace_pragma_ident=1
;;
#remove duplicate link files
--remove-duplicate-link-files)
remove_duplicate_link_files=1
;;
#handle source files to be compiled as cuda files
*.cpp|*.cxx|*.cc|*.C|*.c++|*.cu)
cpp_files="$cpp_files $1"
@ -124,7 +138,12 @@ do
fi
;;
#Handle shared args (valid for both nvcc and the host compiler)
-D*|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
-D*)
unescape_commas=`echo "$1" | sed -e 's/\\\,/,/g'`
arg=`printf "%q" $unescape_commas`
shared_args="$shared_args $arg"
;;
-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared|-w)
shared_args="$shared_args $1"
;;
#Handle compilation argument
@ -152,7 +171,7 @@ do
shift
;;
#Handle known nvcc args
-gencode*|--dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
--dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
cuda_args="$cuda_args $1"
;;
#Handle more known nvcc args
@ -164,8 +183,11 @@ do
cuda_args="$cuda_args $1 $2"
shift
;;
-rdc=*|-maxrregcount*|--maxrregcount*)
cuda_args="$cuda_args $1"
;;
#Handle c++11
--std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1z|-std=c++1z)
--std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1y|-std=c++1y|--std=c++17|-std=c++17|--std=c++1z|-std=c++1z)
if [ $stdcxx_applied -eq 1 ]; then
echo "nvcc_wrapper - *warning* you have set multiple optimization flags (-std=c++1* or --std=c++1*), only the first is used because nvcc can only accept a single std setting"
else
@ -205,6 +227,15 @@ do
fi
shift
;;
#Handle -+ (same as -x c++, specifically used for xl compilers, but mutually exclusive with -x. So replace it with -x c++)
-+)
if [ $first_xcompiler_arg -eq 1 ]; then
xcompiler_args="-x,c++"
first_xcompiler_arg=0
else
xcompiler_args="$xcompiler_args,-x,c++"
fi
;;
#Handle -ccbin (if its not set we can set it to a default value)
-ccbin)
cuda_args="$cuda_args $1 $2"
@ -212,18 +243,39 @@ do
host_compiler=$2
shift
;;
#Handle -arch argument (if its not set use a default
-arch*)
#Handle -arch argument (if its not set use a default) this is the version with = sign
-arch*|-gencode*)
cuda_args="$cuda_args $1"
arch_set=1
;;
#Handle -code argument (if its not set use a default) this is the version with = sign
-code*)
cuda_args="$cuda_args $1"
;;
#Handle -arch argument (if its not set use a default) this is the version without = sign
-arch|-gencode)
cuda_args="$cuda_args $1 $2"
arch_set=1
shift
;;
#Handle -code argument (if its not set use a default) this is the version without = sign
-code)
cuda_args="$cuda_args $1 $2"
shift
;;
#Handle -Xcudafe argument
-Xcudafe)
cuda_args="$cuda_args -Xcudafe $2"
shift
;;
#Handle -Xlinker argument
-Xlinker)
xlinker_args="$xlinker_args -Xlinker $2"
shift
;;
#Handle args that should be sent to the linker
-Wl*)
-Wl,*)
xlinker_args="$xlinker_args -Xlinker ${1:4:${#1}}"
host_linker_args="$host_linker_args ${1:4:${#1}}"
;;
@ -256,6 +308,44 @@ do
shift
done
# Only print host compiler version
if [ $get_host_version -eq 1 ]; then
$host_compiler --version
exit
fi
#Remove duplicate object files
if [ $remove_duplicate_link_files -eq 1 ]; then
for obj in $object_files
do
object_files_reverse="$obj $object_files_reverse"
done
object_files_reverse_clean=""
for obj in $object_files_reverse
do
exists=false
for obj2 in $object_files_reverse_clean
do
if [ "$obj" == "$obj2" ]
then
exists=true
echo "Exists: $obj"
fi
done
if [ "$exists" == "false" ]
then
object_files_reverse_clean="$object_files_reverse_clean $obj"
fi
done
object_files=""
for obj in $object_files_reverse_clean
do
object_files="$obj $object_files"
done
fi
#Add default host compiler if necessary
if [ $ccbin_set -ne 1 ]; then
cuda_args="$cuda_args -ccbin $host_compiler"
@ -328,10 +418,19 @@ fi
#Run compilation command
if [ $host_only -eq 1 ]; then
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
echo "$host_command"
fi
$host_command
elif [ -n "$nvcc_depfile_command" ]; then
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
echo "$nvcc_command && $nvcc_depfile_command"
fi
$nvcc_command && $nvcc_depfile_command
else
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
echo "$nvcc_command"
fi
$nvcc_command
fi
error_code=$?

View File

@ -235,3 +235,7 @@ install(FILES
# Install the export set for use with the install-tree
INSTALL(EXPORT KokkosTargets DESTINATION
"${INSTALL_CMAKE_DIR}")
# build and install pkgconfig file
CONFIGURE_FILE(core/src/kokkos.pc.in kokkos.pc @ONLY)
INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)

View File

@ -47,7 +47,7 @@ function(set_kokkos_cxx_compiler)
OUTPUT_VARIABLE INTERNAL_CXX_COMPILER_VERSION
OUTPUT_STRIP_TRAILING_WHITESPACE)
string(REGEX MATCH "[0-9]+\.[0-9]+\.[0-9]+$"
string(REGEX MATCH "[0-9]+\\.[0-9]+\\.[0-9]+$"
INTERNAL_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION})
endif()

View File

@ -41,7 +41,6 @@ list(APPEND KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST
foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
string(TOUPPER ${opt} OPT )
IF(DEFINED Kokkos_ENABLE_${opt})
MESSAGE("Kokkos_ENABLE_${opt} is defined!")
IF(DEFINED KOKKOS_ENABLE_${OPT})
IF(NOT ("${KOKKOS_ENABLE_${OPT}}" STREQUAL "${Kokkos_ENABLE_${opt}}"))
IF(DEFINED KOKKOS_ENABLE_${OPT}_INTERNAL)
@ -59,7 +58,6 @@ foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
ENDIF()
ELSE()
SET(KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT ${Kokkos_ENABLE_${opt}})
MESSAGE("set KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT!")
ENDIF()
ENDIF()
endforeach()
@ -81,6 +79,7 @@ list(APPEND KOKKOS_ARCH_LIST
ARMv80 # (HOST) ARMv8.0 Compatible CPU
ARMv81 # (HOST) ARMv8.1 Compatible CPU
ARMv8-ThunderX # (HOST) ARMv8 Cavium ThunderX CPU
ARMv8-TX2 # (HOST) ARMv8 Cavium ThunderX2 CPU
WSM # (HOST) Intel Westmere CPU
SNB # (HOST) Intel Sandy/Ivy Bridge CPUs
HSW # (HOST) Intel Haswell CPUs
@ -123,11 +122,18 @@ list(APPEND KOKKOS_DEVICES_LIST
# List of possible TPLs for Kokkos
# From Makefile.kokkos: Options: hwloc,librt,experimental_memkind
set(KOKKOS_USE_TPLS_LIST)
if(APPLE)
list(APPEND KOKKOS_USE_TPLS_LIST
HWLOC # hwloc
MEMKIND # experimental_memkind
)
else()
list(APPEND KOKKOS_USE_TPLS_LIST
HWLOC # hwloc
LIBRT # librt
MEMKIND # experimental_memkind
)
endif()
# Map of cmake variables to Makefile variables
set(KOKKOS_INTERNAL_HWLOC hwloc)
set(KOKKOS_INTERNAL_LIBRT librt)
@ -172,6 +178,7 @@ set(KOKKOS_INTERNAL_LAMBDA enable_lambda)
set(tmpr "\n ")
string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_ARCH_DOCSTR "${KOKKOS_ARCH_LIST}")
set(KOKKOS_INTERNAL_ARCH_DOCSTR "${tmpr}${KOKKOS_INTERNAL_ARCH_DOCSTR}")
# This would be useful, but we use Foo_ENABLE mechanisms
#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_DEVICES_DOCSTR "${KOKKOS_DEVICES_LIST}")
#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_USE_TPLS_DOCSTR "${KOKKOS_USE_TPLS_LIST}")
@ -269,7 +276,7 @@ set(KOKKOS_ENABLE_PROFILING_LOAD_PRINT ${KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_P
set_kokkos_default_default(DEPRECATED_CODE ON)
set(KOKKOS_ENABLE_DEPRECATED_CODE ${KOKKOS_INTERNAL_ENABLE_DEPRECATED_CODE_DEFAULT} CACHE BOOL "Enable deprecated code.")
set_kokkos_default_default(EXPLICIT_INSTANTIATION ON)
set_kokkos_default_default(EXPLICIT_INSTANTIATION OFF)
set(KOKKOS_ENABLE_EXPLICIT_INSTANTIATION ${KOKKOS_INTERNAL_ENABLE_EXPLICIT_INSTANTIATION_DEFAULT} CACHE BOOL "Enable explicit template instantiation.")
#-------------------------------------------------------------------------------

View File

@ -15,16 +15,16 @@
# Ensure that KOKKOS_ARCH is in the ARCH_LIST
if (KOKKOS_ARCH MATCHES ",")
message("-- Detected a comma in: KOKKOS_ARCH=${KOKKOS_ARCH}")
message("-- Detected a comma in: KOKKOS_ARCH=`${KOKKOS_ARCH}`")
message("-- Although we prefer KOKKOS_ARCH to be semicolon-delimited, we do allow")
message("-- comma-delimited values for compatibility with scripts (see github.com/trilinos/Trilinos/issues/2330)")
string(REPLACE "," ";" KOKKOS_ARCH "${KOKKOS_ARCH}")
message("-- Commas were changed to semicolons, now KOKKOS_ARCH=${KOKKOS_ARCH}")
message("-- Commas were changed to semicolons, now KOKKOS_ARCH=`${KOKKOS_ARCH}`")
endif()
foreach(arch ${KOKKOS_ARCH})
list(FIND KOKKOS_ARCH_LIST ${arch} indx)
if (indx EQUAL -1)
message(FATAL_ERROR "${arch} is not an accepted value for KOKKOS_ARCH."
message(FATAL_ERROR "`${arch}` is not an accepted value in KOKKOS_ARCH=`${KOKKOS_ARCH}`."
" Please pick from these choices: ${KOKKOS_INTERNAL_ARCH_DOCSTR}")
endif ()
endforeach()
@ -130,7 +130,8 @@ string(REPLACE ";" ":" KOKKOS_INTERNAL_ADDTOPATH "${addpathl}")
# Set the KOKKOS_SETTINGS String -- this is the primary communication with the
# makefile configuration. See Makefile.kokkos
set(KOKKOS_SETTINGS KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
set(KOKKOS_SETTINGS KOKKOS_CMAKE=yes)
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_PATH=${KOKKOS_PATH})
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_INSTALL_PATH=${CMAKE_INSTALL_PREFIX})

View File

@ -241,17 +241,16 @@ elif [ "$MACHINE" = "white" ]; then
BASE_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>"
IBM_MODULE_LIST="<COMPILER_NAME>/xl/<COMPILER_VERSION>"
CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/5.4.0"
CUDA_MODULE_LIST2="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.3.0,ibm/xl/13.1.6"
CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.4.0,ibm/xl/16.1.0"
# Don't do pthread on white.
GCC_BUILD_LIST="OpenMP,Serial,OpenMP_Serial"
# Format: (compiler module-list build-list exe-name warning-flag)
COMPILERS=("gcc/5.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
"ibm/13.1.6 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
"cuda/8.0.44 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
"cuda/9.0.103 $CUDA_MODULE_LIST2 $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
"gcc/6.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
"ibm/16.1.0 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
"cuda/9.0.103 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
)
if [ -z "$ARCH_FLAG" ]; then
@ -362,7 +361,7 @@ elif [ "$MACHINE" = "apollo" ]; then
"gcc/5.3.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
"intel/16.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
"clang/3.9.0 $BASE_MODULE_LIST "Pthread_Serial" clang++ $CLANG_WARNING_FLAGS"
"clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread" clang++ $CUDA_WARNING_FLAGS"
"clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread,OpenMP" clang++ $CUDA_WARNING_FLAGS"
"cuda/9.1 $CUDA_MODULE_LIST "Cuda_OpenMP" $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
)
else

View File

@ -96,6 +96,7 @@ template< class DataType ,
class Arg3Type = void>
class DualView : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
{
template< class , class , class , class > friend class DualView ;
public:
//! \name Typedefs for device types and various Kokkos::View specializations.
//@{
@ -182,8 +183,20 @@ public:
//! \name Counters to keep track of changes ("modified" flags)
//@{
View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_device;
View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_host;
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
protected:
// modified_flags[0] -> host
// modified_flags[1] -> device
typedef View<unsigned int[2],LayoutLeft,Kokkos::HostSpace> t_modified_flags;
t_modified_flags modified_flags;
public:
#else
typedef View<unsigned int[2],LayoutLeft,typename t_host::execution_space> t_modified_flags;
typedef View<unsigned int,LayoutLeft,typename t_host::execution_space> t_modified_flag;
t_modified_flags modified_flags;
t_modified_flag modified_host,modified_device;
#endif
//@}
//! \name Constructors
@ -194,10 +207,14 @@ public:
/// Both device and host View objects are constructed using their
/// default constructors. The "modified" flags are both initialized
/// to "unmodified."
DualView () :
modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
{}
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
DualView () = default;
#else
DualView ():modified_flags (t_modified_flags("DualView::modified_flags")) {
modified_host = t_modified_flag(modified_flags,0);
modified_device = t_modified_flag(modified_flags,1);
}
#endif
/// \brief Constructor that allocates View objects on both host and device.
///
@ -219,17 +236,24 @@ public:
const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG)
: d_view (label, n0, n1, n2, n3, n4, n5, n6, n7)
, h_view (create_mirror_view (d_view)) // without UVM, host View mirrors
, modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device"))
, modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
{}
, modified_flags (t_modified_flags("DualView::modified_flags"))
{
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
modified_host = t_modified_flag(modified_flags,0);
modified_device = t_modified_flag(modified_flags,1);
#endif
}
//! Copy constructor (shallow copy)
template<class SS, class LS, class DS, class MS>
DualView (const DualView<SS,LS,DS,MS>& src) :
d_view (src.d_view),
h_view (src.h_view),
modified_device (src.modified_device),
modified_host (src.modified_host)
modified_flags (src.modified_flags)
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
, modified_host(src.modified_host)
, modified_device(src.modified_device)
#endif
{}
//! Subview constructor
@ -241,8 +265,11 @@ public:
)
: d_view( Kokkos::subview( src.d_view , arg0 , args ... ) )
, h_view( Kokkos::subview( src.h_view , arg0 , args ... ) )
, modified_device (src.modified_device)
, modified_flags (src.modified_flags)
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
, modified_host(src.modified_host)
, modified_device(src.modified_device)
#endif
{}
/// \brief Create DualView from existing device and host View objects.
@ -258,8 +285,7 @@ public:
DualView (const t_dev& d_view_, const t_host& h_view_) :
d_view (d_view_),
h_view (h_view_),
modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
modified_flags (t_modified_flags("DualView::modified_flags"))
{
if ( int(d_view.rank) != int(h_view.rank) ||
d_view.extent(0) != h_view.extent(0) ||
@ -281,6 +307,10 @@ public:
d_view.span() != h_view.span() ) {
Kokkos::Impl::throw_runtime_exception("DualView constructed with incompatible views");
}
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
modified_host = t_modified_flag(modified_flags,0);
modified_device = t_modified_flag(modified_flags,1);
#endif
}
//@}
@ -316,6 +346,30 @@ public:
t_dev,
t_host>::type& view () const
{
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
constexpr bool device_is_memspace = std::is_same<Device,typename Device::memory_space>::value;
constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
constexpr bool device_exec_is_t_dev_exec = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
constexpr bool device_mem_is_t_dev_mem = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
constexpr bool device_exec_is_t_host_exec = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
constexpr bool device_mem_is_t_host_mem = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
constexpr bool device_is_t_host_device = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
constexpr bool device_is_t_dev_device = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
static_assert(
device_is_t_dev_device || device_is_t_host_device ||
(device_is_memspace && (device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ) ||
(device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
(
(!device_is_execspace && !device_is_memspace) && (
(device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ||
(device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
)
)
,
"Template parameter to .view() must exactly match one of the DualView's device types or one of the execution or memory spaces");
#endif
return Impl::if_c<
std::is_same<
typename t_dev::memory_space,
@ -324,6 +378,72 @@ public:
t_host >::select (d_view , h_view);
}
KOKKOS_INLINE_FUNCTION
t_host view_host() const {
return h_view;
}
KOKKOS_INLINE_FUNCTION
t_dev view_device() const {
return d_view;
}
template<class Device>
static int get_device_side() {
constexpr bool device_is_memspace = std::is_same<Device,typename Device::memory_space>::value;
constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
constexpr bool device_exec_is_t_dev_exec = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
constexpr bool device_mem_is_t_dev_mem = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
constexpr bool device_exec_is_t_host_exec = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
constexpr bool device_mem_is_t_host_mem = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
constexpr bool device_is_t_host_device = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
constexpr bool device_is_t_dev_device = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
static_assert(
device_is_t_dev_device || device_is_t_host_device ||
(device_is_memspace && (device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ) ||
(device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
(
(!device_is_execspace && !device_is_memspace) && (
(device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ||
(device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
)
)
,
"Template parameter to .sync() must exactly match one of the DualView's device types or one of the execution or memory spaces");
#endif
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
int dev = -1;
#else
int dev = 0;
#endif
if(device_is_t_dev_device) dev = 1;
else if(device_is_t_host_device) dev = 0;
else {
if(device_is_memspace) {
if(device_mem_is_t_dev_mem) dev = 1;
if(device_mem_is_t_host_mem) dev = 0;
if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
}
if(device_is_execspace) {
if(device_exec_is_t_dev_exec) dev = 1;
if(device_exec_is_t_host_exec) dev = 0;
if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
}
if(!device_is_execspace && !device_is_memspace) {
if(device_mem_is_t_dev_mem) dev = 1;
if(device_mem_is_t_host_mem) dev = 0;
if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
if(device_exec_is_t_dev_exec) dev = 1;
if(device_exec_is_t_host_exec) dev = 0;
if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
}
}
return dev;
}
/// \brief Update data on device or host only if data in the other
/// space has been marked as modified.
///
@ -347,23 +467,20 @@ public:
( std::is_same< Device , int>::value)
, int >::type& = 0)
{
const unsigned int dev =
Impl::if_c<
std::is_same<
typename t_dev::memory_space,
typename Device::memory_space>::value ,
unsigned int,
unsigned int>::select (1, 0);
if(modified_flags.data()==NULL) return;
if (dev) { // if Device is the same as DualView's device type
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
int dev = get_device_side<Device>();
if (dev == 1) { // if Device is the same as DualView's device type
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
deep_copy (d_view, h_view);
modified_host() = modified_device() = 0;
modified_flags(0) = modified_flags(1) = 0;
}
} else { // hopefully Device is the same as DualView's host type
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
}
if (dev == 0) { // hopefully Device is the same as DualView's host type
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
deep_copy (h_view, d_view);
modified_host() = modified_device() = 0;
modified_flags(0) = modified_flags(1) = 0;
}
}
if(std::is_same<typename t_host::memory_space,typename t_dev::memory_space>::value) {
@ -378,46 +495,71 @@ public:
( std::is_same< Device , int>::value)
, int >::type& = 0 )
{
const unsigned int dev =
Impl::if_c<
std::is_same<
typename t_dev::memory_space,
typename Device::memory_space>::value,
unsigned int,
unsigned int>::select (1, 0);
if (dev) { // if Device is the same as DualView's device type
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
if(modified_flags.data()==NULL) return;
int dev = get_device_side<Device>();
if (dev == 1) { // if Device is the same as DualView's device type
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
}
} else { // hopefully Device is the same as DualView's host type
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
}
if (dev == 0){ // hopefully Device is the same as DualView's host type
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
}
}
}
void sync_host() {
if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
Impl::throw_runtime_exception("Calling sync_host on a DualView with a const datatype.");
if(modified_flags.data()==NULL) return;
if(modified_flags(1) > modified_flags(0)) {
deep_copy (h_view, d_view);
modified_flags(1) = modified_flags(0) = 0;
}
}
void sync_device() {
if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
Impl::throw_runtime_exception("Calling sync_device on a DualView with a const datatype.");
if(modified_flags.data()==NULL) return;
if(modified_flags(0) > modified_flags(1)) {
deep_copy (d_view, h_view);
modified_flags(1) = modified_flags(0) = 0;
}
}
template<class Device>
bool need_sync() const
{
const unsigned int dev =
Impl::if_c<
std::is_same<
typename t_dev::memory_space,
typename Device::memory_space>::value ,
unsigned int,
unsigned int>::select (1, 0);
if(modified_flags.data()==NULL) return false;
int dev = get_device_side<Device>();
if (dev) { // if Device is the same as DualView's device type
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
if (dev == 1) { // if Device is the same as DualView's device type
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
return true;
}
} else { // hopefully Device is the same as DualView's host type
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
}
if (dev == 0){ // hopefully Device is the same as DualView's host type
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
return true;
}
}
return false;
}
inline bool need_sync_host() const {
if(modified_flags.data()==NULL) return false;
return modified_flags(0)<modified_flags(1);
}
inline bool need_sync_device() const {
if(modified_flags.data()==NULL) return false;
return modified_flags(1)<modified_flags(0);
}
/// \brief Mark data as modified on the given device \c Device.
///
/// If \c Device is the same as this DualView's device type, then
@ -425,26 +567,22 @@ public:
/// data as modified.
template<class Device>
void modify () {
const unsigned int dev =
Impl::if_c<
std::is_same<
typename t_dev::memory_space,
typename Device::memory_space>::value,
unsigned int,
unsigned int>::select (1, 0);
if(modified_flags.data()==NULL) return;
int dev = get_device_side<Device>();
if (dev) { // if Device is the same as DualView's device type
if (dev == 1) { // if Device is the same as DualView's device type
// Increment the device's modified count.
modified_device () = (modified_device () > modified_host () ?
modified_device () : modified_host ()) + 1;
} else { // hopefully Device is the same as DualView's host type
modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
modified_flags(1) : modified_flags(0)) + 1;
}
if (dev == 0) { // hopefully Device is the same as DualView's host type
// Increment the host's modified count.
modified_host () = (modified_device () > modified_host () ?
modified_device () : modified_host ()) + 1;
modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
modified_flags(1) : modified_flags(0)) + 1;
}
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
if (modified_host() && modified_device()) {
if (modified_flags(0) && modified_flags(1)) {
std::string msg = "Kokkos::DualView::modify ERROR: ";
msg += "Concurrent modification of host and device views ";
msg += "in DualView \"";
@ -455,6 +593,45 @@ public:
#endif
}
inline void modify_host() {
if(modified_flags.data()!=NULL) {
modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
modified_flags(1) : modified_flags(0)) + 1;
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
if (modified_flags(0) && modified_flags(1)) {
std::string msg = "Kokkos::DualView::modify_host ERROR: ";
msg += "Concurrent modification of host and device views ";
msg += "in DualView \"";
msg += d_view.label();
msg += "\"\n";
Kokkos::abort(msg.c_str());
}
#endif
}
}
inline void modify_device() {
if(modified_flags.data()!=NULL) {
modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
modified_flags(1) : modified_flags(0)) + 1;
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
if (modified_flags(0) && modified_flags(1)) {
std::string msg = "Kokkos::DualView::modify_device ERROR: ";
msg += "Concurrent modification of host and device views ";
msg += "in DualView \"";
msg += d_view.label();
msg += "\"\n";
Kokkos::abort(msg.c_str());
}
#endif
}
}
inline void clear_sync_state() {
if(modified_flags.data()!=NULL)
modified_flags(1) = modified_flags(0) = 0;
}
//@}
//! \name Methods for reallocating or resizing the View objects.
//@{
@ -476,7 +653,10 @@ public:
h_view = create_mirror_view( d_view );
/* Reset dirty flags */
modified_device() = modified_host() = 0;
if(modified_flags.data()==NULL) {
modified_flags = t_modified_flags("DualView::modified_flags");
} else
modified_flags(1) = modified_flags(0) = 0;
}
/// \brief Resize both views, copying old contents into new if necessary.
@ -491,13 +671,16 @@ public:
const size_t n5 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
const size_t n6 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ) {
if(modified_device() >= modified_host()) {
if(modified_flags.data()==NULL) {
modified_flags = t_modified_flags("DualView::modified_flags");
}
if(modified_flags(1) >= modified_flags(0)) {
/* Resize on Device */
::Kokkos::resize(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
h_view = create_mirror_view( d_view );
/* Mark Device copy as modified */
modified_device() = modified_device()+1;
modified_flags(1) = modified_flags(1)+1;
} else {
/* Realloc on Device */
@ -525,7 +708,7 @@ public:
d_view = create_mirror_view( typename t_dev::execution_space(), h_view );
/* Mark Host copy as modified */
modified_host() = modified_host()+1;
modified_flags(0) = modified_flags(0)+1;
}
}
@ -649,7 +832,10 @@ void
deep_copy (DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
const DualView<ST,SL,SD,SM>& src )
{
if (src.modified_device () >= src.modified_host ()) {
if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
return deep_copy(dst.d_view, src.d_view);
}
if (src.modified_flags(1) >= src.modified_flags(0)) {
deep_copy (dst.d_view, src.d_view);
dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
} else {
@ -666,7 +852,10 @@ deep_copy (const ExecutionSpace& exec ,
DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
const DualView<ST,SL,SD,SM>& src )
{
if (src.modified_device () >= src.modified_host ()) {
if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
return deep_copy(exec, dst.d_view, src.d_view);
}
if (src.modified_flags(1) >= src.modified_flags(0)) {
deep_copy (exec, dst.d_view, src.d_view);
dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
} else {

View File

@ -384,8 +384,8 @@ public:
// Removed dimension checks...
typedef typename DstType::offset_type dst_offset_type ;
dst.m_map.m_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
dst.m_map.m_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_handle , src.m_track );
dst.m_map.m_impl_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
dst.m_map.m_impl_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_impl_handle , src.m_track );
dst.m_track.assign( src.m_track , DstTraits::is_managed );
dst.m_rank = src.Rank ;
}
@ -565,10 +565,14 @@ public:
//----------------------------------------
// Allow specializations to query their specialized map
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
KOKKOS_INLINE_FUNCTION
const Kokkos::Impl::ViewMapping< traits , void > &
implementation_map() const { return m_map ; }
#endif
KOKKOS_INLINE_FUNCTION
const Kokkos::Impl::ViewMapping< traits , void > &
impl_map() const { return m_map ; }
//----------------------------------------
@ -624,7 +628,7 @@ public:
reference_type operator()() const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
return implementation_map().reference();
return impl_map().reference();
//return m_map.reference(0,0,0,0,0,0,0);
}
@ -647,7 +651,7 @@ public:
typename std::enable_if< !std::is_same<typename drvtraits::value_type, typename drvtraits::scalar_array_type>::value && std::is_integral<iType>::value, reference_type>::type
operator[](const iType & i0) const
{
// auto map = implementation_map();
// auto map = impl_map();
const size_t dim_scalar = m_map.dimension_scalar();
const size_t bytes = this->span() / dim_scalar;
@ -785,7 +789,7 @@ public:
reference_type access() const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
return implementation_map().reference();
return impl_map().reference();
//return m_map.reference(0,0,0,0,0,0,0);
}
@ -1189,8 +1193,7 @@ public:
, const typename traits::array_layout & arg_layout
)
: DynRankView( Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
, Impl::DynRankDimTraits<typename traits::specialize>::createLayout(arg_layout)
, arg_layout
)
{}
@ -1205,7 +1208,9 @@ public:
, const size_t arg_N6 =KOKKOS_INVALID_INDEX
, const size_t arg_N7 =KOKKOS_INVALID_INDEX
)
: DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing ), arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7 )
: DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
, typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7)
)
{}
//----------------------------------------
@ -1445,30 +1450,30 @@ public:
ret_type dst ;
const SubviewExtents< 7 , rank > extents =
ExtentGenerator< Args ... >::generator( src.m_map.m_offset.m_dim , args... ) ;
ExtentGenerator< Args ... >::generator( src.m_map.m_impl_offset.m_dim , args... ) ;
dst_offset_type tempdst( src.m_map.m_offset , extents ) ;
dst_offset_type tempdst( src.m_map.m_impl_offset , extents ) ;
dst.m_track = src.m_track ;
dst.m_map.m_offset.m_dim.N0 = tempdst.m_dim.N0 ;
dst.m_map.m_offset.m_dim.N1 = tempdst.m_dim.N1 ;
dst.m_map.m_offset.m_dim.N2 = tempdst.m_dim.N2 ;
dst.m_map.m_offset.m_dim.N3 = tempdst.m_dim.N3 ;
dst.m_map.m_offset.m_dim.N4 = tempdst.m_dim.N4 ;
dst.m_map.m_offset.m_dim.N5 = tempdst.m_dim.N5 ;
dst.m_map.m_offset.m_dim.N6 = tempdst.m_dim.N6 ;
dst.m_map.m_impl_offset.m_dim.N0 = tempdst.m_dim.N0 ;
dst.m_map.m_impl_offset.m_dim.N1 = tempdst.m_dim.N1 ;
dst.m_map.m_impl_offset.m_dim.N2 = tempdst.m_dim.N2 ;
dst.m_map.m_impl_offset.m_dim.N3 = tempdst.m_dim.N3 ;
dst.m_map.m_impl_offset.m_dim.N4 = tempdst.m_dim.N4 ;
dst.m_map.m_impl_offset.m_dim.N5 = tempdst.m_dim.N5 ;
dst.m_map.m_impl_offset.m_dim.N6 = tempdst.m_dim.N6 ;
dst.m_map.m_offset.m_stride.S0 = tempdst.m_stride.S0 ;
dst.m_map.m_offset.m_stride.S1 = tempdst.m_stride.S1 ;
dst.m_map.m_offset.m_stride.S2 = tempdst.m_stride.S2 ;
dst.m_map.m_offset.m_stride.S3 = tempdst.m_stride.S3 ;
dst.m_map.m_offset.m_stride.S4 = tempdst.m_stride.S4 ;
dst.m_map.m_offset.m_stride.S5 = tempdst.m_stride.S5 ;
dst.m_map.m_offset.m_stride.S6 = tempdst.m_stride.S6 ;
dst.m_map.m_impl_offset.m_stride.S0 = tempdst.m_stride.S0 ;
dst.m_map.m_impl_offset.m_stride.S1 = tempdst.m_stride.S1 ;
dst.m_map.m_impl_offset.m_stride.S2 = tempdst.m_stride.S2 ;
dst.m_map.m_impl_offset.m_stride.S3 = tempdst.m_stride.S3 ;
dst.m_map.m_impl_offset.m_stride.S4 = tempdst.m_stride.S4 ;
dst.m_map.m_impl_offset.m_stride.S5 = tempdst.m_stride.S5 ;
dst.m_map.m_impl_offset.m_stride.S6 = tempdst.m_stride.S6 ;
dst.m_map.m_handle = dst_handle_type( src.m_map.m_handle +
src.m_map.m_offset( extents.domain_offset(0)
dst.m_map.m_impl_handle = dst_handle_type( src.m_map.m_impl_handle +
src.m_map.m_impl_offset( extents.domain_offset(0)
, extents.domain_offset(1)
, extents.domain_offset(2)
, extents.domain_offset(3)
@ -1896,6 +1901,7 @@ inline
typename DynRankView<T,P...>::HostMirror
create_mirror( const DynRankView<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
@ -1914,6 +1920,7 @@ inline
typename DynRankView<T,P...>::HostMirror
create_mirror( const DynRankView<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
@ -1929,7 +1936,11 @@ create_mirror( const DynRankView<T,P...> & src
// Create a mirror in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorDRVType<Space,T,P ...>::view_type create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src) {
typename Impl::MirrorDRVType<Space,T,P ...>::view_type
create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
>::type * = 0) {
return typename Impl::MirrorDRVType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
}
@ -1985,6 +1996,29 @@ create_mirror_view(const Space& , const Kokkos::DynRankView<T,P...> & src
return typename Impl::MirrorDRViewType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
}
// Create a mirror view and deep_copy in a new space (specialization for same space)
template<class Space, class T, class ... P>
typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
(void)name;
return src;
}
// Create a mirror view and deep_copy in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<!Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
using Mirror = typename Impl::MirrorDRViewType<Space,T,P ...>::view_type;
std::string label = name.empty() ? src.label() : name;
auto mirror = Mirror( Kokkos::ViewAllocateWithoutInitializing(label), Impl::reconstructLayout(src.layout(), src.rank()) );
deep_copy(mirror, src);
return mirror;
}
} //end Kokkos

File diff suppressed because it is too large Load Diff

View File

@ -47,7 +47,9 @@
#include <string>
#include <vector>
#include <Kokkos_Core.hpp>
#include <Kokkos_View.hpp>
#include <Kokkos_Parallel.hpp>
#include <Kokkos_Parallel_Reduce.hpp>
namespace Kokkos {

View File

@ -86,14 +86,13 @@ public:
vector():DV() {
_size = 0;
_extra_storage = 1.1;
DV::modified_host() = 1;
}
vector(int n, Scalar val=Scalar()):DualView<Scalar*,LayoutLeft,Arg1Type>("Vector",size_t(n*(1.1))) {
_size = n;
_extra_storage = 1.1;
DV::modified_host() = 1;
DV::modified_flags(0) = 1;
assign(n,val);
}
@ -119,16 +118,16 @@ public:
/* Assign value either on host or on device */
if( DV::modified_host() >= DV::modified_device() ) {
if( DV::template need_sync<typename DV::t_dev::device_type>() ) {
set_functor_host f(DV::h_view,val);
parallel_for(n,f);
DV::t_host::execution_space::fence();
DV::modified_host()++;
DV::template modify<typename DV::t_host::device_type>();
} else {
set_functor f(DV::d_view,val);
parallel_for(n,f);
DV::t_dev::execution_space::fence();
DV::modified_device()++;
DV::template modify<typename DV::t_dev::device_type>();
}
}
@ -137,7 +136,8 @@ public:
}
void push_back(Scalar val) {
DV::modified_host()++;
DV::template sync<typename DV::t_host::device_type>();
DV::template modify<typename DV::t_host::device_type>();
if(_size == span()) {
size_t new_size = _size*_extra_storage;
if(new_size == _size) new_size++;
@ -247,10 +247,10 @@ public:
}
void on_host() {
DV::modified_host() = DV::modified_device() + 1;
DV::template modify<typename DV::t_host::device_type>();
}
void on_device() {
DV::modified_device() = DV::modified_host() + 1;
DV::template modify<typename DV::t_dev::device_type>();
}
void set_overallocation(float extra) {

View File

@ -23,6 +23,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
threads/TestThreads_DynRankViewAPI_rank12345.cpp
threads/TestThreads_DynRankViewAPI_rank67.cpp
threads/TestThreads_ErrorReporter.cpp
threads/TestThreads_OffsetView.cpp
threads/TestThreads_ScatterView.cpp
threads/TestThreads_StaticCrsGraph.cpp
threads/TestThreads_UnorderedMap.cpp
@ -47,6 +48,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
serial/TestSerial_DynRankViewAPI_rank12345.cpp
serial/TestSerial_DynRankViewAPI_rank67.cpp
serial/TestSerial_ErrorReporter.cpp
serial/TestSerial_OffsetView.cpp
serial/TestSerial_ScatterView.cpp
serial/TestSerial_StaticCrsGraph.cpp
serial/TestSerial_UnorderedMap.cpp
@ -71,6 +73,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
openmp/TestOpenMP_DynRankViewAPI_rank12345.cpp
openmp/TestOpenMP_DynRankViewAPI_rank67.cpp
openmp/TestOpenMP_ErrorReporter.cpp
openmp/TestOpenMP_OffsetView.cpp
openmp/TestOpenMP_ScatterView.cpp
openmp/TestOpenMP_StaticCrsGraph.cpp
openmp/TestOpenMP_UnorderedMap.cpp
@ -95,6 +98,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
cuda/TestCuda_DynRankViewAPI_rank12345.cpp
cuda/TestCuda_DynRankViewAPI_rank67.cpp
cuda/TestCuda_ErrorReporter.cpp
cuda/TestCuda_OffsetView.cpp
cuda/TestCuda_ScatterView.cpp
cuda/TestCuda_StaticCrsGraph.cpp
cuda/TestCuda_UnorderedMap.cpp

View File

@ -39,6 +39,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
OBJ_CUDA += TestCuda_DynRankViewAPI_rank12345.o
OBJ_CUDA += TestCuda_DynRankViewAPI_rank67.o
OBJ_CUDA += TestCuda_ErrorReporter.o
OBJ_CUDA += TestCuda_OffsetView.o
OBJ_CUDA += TestCuda_ScatterView.o
OBJ_CUDA += TestCuda_StaticCrsGraph.o
OBJ_CUDA += TestCuda_UnorderedMap.o
@ -57,6 +58,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
OBJ_ROCM += TestROCm_DynRankViewAPI_rank12345.o
OBJ_ROCM += TestROCm_DynRankViewAPI_rank67.o
OBJ_ROCM += TestROCm_ErrorReporter.o
OBJ_ROCM += TestROCm_OffsetView.o
OBJ_ROCM += TestROCm_ScatterView.o
OBJ_ROCM += TestROCm_StaticCrsGraph.o
OBJ_ROCM += TestROCm_UnorderedMap.o
@ -75,6 +77,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
OBJ_THREADS += TestThreads_DynRankViewAPI_rank12345.o
OBJ_THREADS += TestThreads_DynRankViewAPI_rank67.o
OBJ_THREADS += TestThreads_ErrorReporter.o
OBJ_THREADS += TestThreads_OffsetView.o
OBJ_THREADS += TestThreads_ScatterView.o
OBJ_THREADS += TestThreads_StaticCrsGraph.o
OBJ_THREADS += TestThreads_UnorderedMap.o
@ -93,6 +96,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank12345.o
OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank67.o
OBJ_OPENMP += TestOpenMP_ErrorReporter.o
OBJ_OPENMP += TestOpenMP_OffsetView.o
OBJ_OPENMP += TestOpenMP_ScatterView.o
OBJ_OPENMP += TestOpenMP_StaticCrsGraph.o
OBJ_OPENMP += TestOpenMP_UnorderedMap.o
@ -111,6 +115,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
OBJ_SERIAL += TestSerial_DynRankViewAPI_rank12345.o
OBJ_SERIAL += TestSerial_DynRankViewAPI_rank67.o
OBJ_SERIAL += TestSerial_ErrorReporter.o
OBJ_SERIAL += TestSerial_OffsetView.o
OBJ_SERIAL += TestSerial_ScatterView.o
OBJ_SERIAL += TestSerial_StaticCrsGraph.o
OBJ_SERIAL += TestSerial_UnorderedMap.o

View File

@ -729,6 +729,7 @@ public:
static void run_tests() {
run_test_resize_realloc();
run_test_mirror();
run_test_mirror_and_copy();
run_test_scalar();
run_test();
run_test_const();
@ -885,6 +886,69 @@ public:
}
}
static void run_test_mirror_and_copy()
{
// LayoutLeft
{
Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_org( "A", 10 );
a_org(5) = 42.0;
Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_h = a_org;
auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
int equal_ptr_h_h2 = a_h.data() == a_h2.data() ? 1 : 0;
int equal_ptr_h_d = a_h.data() == a_d.data() ? 1 : 0;
int equal_ptr_h2_d = a_h2.data() == a_d.data() ? 1 : 0;
int equal_ptr_h3_d = a_h3.data() == a_d.data() ? 1 : 0;
int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
ASSERT_EQ( equal_ptr_h_h2, 1 );
ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
ASSERT_EQ( a_h.rank(), a_org.rank() );
ASSERT_EQ( a_h.rank(), a_h2.rank() );
ASSERT_EQ( a_h.rank(), a_h3.rank() );
ASSERT_EQ( a_h.rank(), a_d.rank() );
ASSERT_EQ( a_org(5), a_h3(5) );
}
// LayoutRight
{
Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_org( "A", 10 );
a_org(5) = 42.0;
Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_h = a_org;
auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
int equal_ptr_h_h2 = a_h.data() == a_h2.data() ? 1 : 0;
int equal_ptr_h_d = a_h.data() == a_d.data() ? 1 : 0;
int equal_ptr_h2_d = a_h2.data() == a_d.data() ? 1 : 0;
int equal_ptr_h3_d = a_h3.data() == a_d.data() ? 1 : 0;
int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
ASSERT_EQ( equal_ptr_h_h2, 1 );
ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
ASSERT_EQ( a_h.rank(), a_org.rank() );
ASSERT_EQ( a_h.rank(), a_h2.rank() );
ASSERT_EQ( a_h.rank(), a_h3.rank() );
ASSERT_EQ( a_h.rank(), a_d.rank() );
ASSERT_EQ( a_org(5), a_h3(5) );
}
}
static void run_test_scalar()
{
typedef typename dView0::HostMirror hView0 ; //HostMirror of DynRankView is a DynRankView

View File

@ -0,0 +1,426 @@
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
/*
* FIXME the OffsetView class is really not very well tested.
*/
#ifndef CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
#define CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
#include <gtest/gtest.h>
#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <impl/Kokkos_Timer.hpp>
#include <Kokkos_OffsetView.hpp>
#include <KokkosExp_MDRangePolicy.hpp>
using std::endl;
using std::cout;
namespace Test{
template <typename Scalar, typename Device>
void test_offsetview_construction(unsigned int size)
{
typedef Kokkos::Experimental::OffsetView<Scalar**, Device> offset_view_type;
typedef Kokkos::View<Scalar**, Device> view_type;
Kokkos::Experimental::index_list_type range0 = {-1, 3};
Kokkos::Experimental::index_list_type range1 = {-2, 2};
offset_view_type ov("firstOV", range0, range1);
ASSERT_EQ("firstOV", ov.label());
ASSERT_EQ(2, ov.Rank);
ASSERT_EQ(ov.begin(0), -1);
ASSERT_EQ(ov.end(0), 4);
ASSERT_EQ(ov.begin(1), -2);
ASSERT_EQ(ov.end(1), 3);
ASSERT_EQ(ov.extent(0), 5);
ASSERT_EQ(ov.extent(1), 5);
const int ovmin0 = ov.begin(0);
const int ovend0 = ov.end(0);
const int ovmin1 = ov.begin(1);
const int ovend1 = ov.end(1);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
{
Kokkos::Experimental::OffsetView<Scalar*, Device> offsetV1("OneDOffsetView", range0);
Kokkos::RangePolicy<Device, int> rangePolicy1(offsetV1.begin(0), offsetV1.end(0));
Kokkos::parallel_for(rangePolicy1, KOKKOS_LAMBDA (const int i){
offsetV1(i) = 1;
}
);
Kokkos::fence();
int OVResult = 0;
Kokkos::parallel_reduce(rangePolicy1, KOKKOS_LAMBDA(const int i, int & updateMe){
updateMe += offsetV1(i);
}, OVResult);
Kokkos::fence();
ASSERT_EQ(OVResult, offsetV1.end(0) - offsetV1.begin(0)) << "found wrong number of elements in OffsetView that was summed.";
}
{ //test deep copy of scalar const value into mirro
const int constVal = 6;
typename offset_view_type::HostMirror hostOffsetView =
Kokkos::Experimental::create_mirror_view(ov);
Kokkos::Experimental::deep_copy(hostOffsetView, constVal);
for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
ASSERT_EQ(hostOffsetView(i,j), constVal) << "Bad data found in OffsetView";
}
}
}
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
typedef typename range_type::point_type point_type;
range_type rangePolicy2D(point_type{ {ovmin0, ovmin1 } },
point_type{ { ovend0, ovend1 } });
const int constValue = 9;
Kokkos::parallel_for(rangePolicy2D, KOKKOS_LAMBDA (const int i, const int j) {
ov(i,j) = constValue;
}
);
//test offsetview to offsetviewmirror deep copy
typename offset_view_type::HostMirror hostOffsetView =
Kokkos::Experimental::create_mirror_view(ov);
Kokkos::Experimental::deep_copy(hostOffsetView, ov);
for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
ASSERT_EQ(hostOffsetView(i,j), constValue) << "Bad data found in OffsetView";
}
}
int OVResult = 0;
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
updateMe += ov(i, j);
}, OVResult);
int answer = 0;
for(int i = ov.begin(0); i < ov.end(0); ++i) {
for(int j = ov.begin(1); j < ov.end(1); ++j) {
answer += constValue;
}
}
ASSERT_EQ(OVResult, answer) << "Bad data found in OffsetView";
#endif
{
offset_view_type ovCopy(ov);
ASSERT_EQ(ovCopy==ov, true) <<
"Copy constructor or equivalence operator broken";
}
{
offset_view_type ovAssigned = ov;
ASSERT_EQ(ovAssigned==ov, true) <<
"Assignment operator or equivalence operator broken";
}
{ //construct OffsetView from a View plus begins array
const int extent0 = 100;
const int extent1 = 200;
const int extent2 = 300;
Kokkos::View<Scalar***, Device> view3D("view3D", extent0, extent1, extent2);
Kokkos::deep_copy(view3D, 1);
Kokkos::Array<int64_t,3> begins = {{-10, -20, -30}};
Kokkos::Experimental::OffsetView<Scalar***, Device> offsetView3D(view3D, begins);
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<3>, Kokkos::IndexType<int64_t> > range3_type;
typedef typename range3_type::point_type point3_type;
range3_type rangePolicy3DZero(point3_type{ {0, 0, 0 } },
point3_type{ { extent0, extent1, extent2 } });
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
int view3DSum = 0;
Kokkos::parallel_reduce(rangePolicy3DZero, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
updateMe += view3D(i, j, k);
}, view3DSum);
range3_type rangePolicy3D(point3_type{ {begins[0], begins[1], begins[2] } },
point3_type{ { begins[0] + extent0, begins[1] + extent1, begins[2] + extent2 } });
int offsetView3DSum = 0;
Kokkos::parallel_reduce(rangePolicy3D, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
updateMe += offsetView3D(i, j, k);
}, offsetView3DSum);
ASSERT_EQ(view3DSum, offsetView3DSum) << "construction of OffsetView from View and begins array broken.";
#endif
}
view_type viewFromOV = ov.view();
ASSERT_EQ(viewFromOV == ov, true) <<
"OffsetView::view() or equivalence operator View == OffsetView broken";
{
offset_view_type ovFromV(viewFromOV, {-1, -2});
ASSERT_EQ(ovFromV == viewFromOV , true) <<
"Construction of OffsetView from View or equivalence operator OffsetView == View broken";
}
{
offset_view_type ovFromV = viewFromOV;
ASSERT_EQ(ovFromV == viewFromOV , true) <<
"Construction of OffsetView from View by assignment (implicit conversion) or equivalence operator OffsetView == View broken";
}
{// test offsetview to view deep copy
view_type aView("aView", ov.extent(0), ov.extent(1));
Kokkos::Experimental::deep_copy(aView, ov);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
int sum = 0;
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
}, sum);
ASSERT_EQ(sum, 0) << "deep_copy(view, offsetView) broken.";
#endif
}
{// test view to offsetview deep copy
view_type aView("aView", ov.extent(0), ov.extent(1));
Kokkos::deep_copy(aView, 99);
Kokkos::Experimental::deep_copy(ov, aView);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
int sum = 0;
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
}, sum);
ASSERT_EQ(sum, 0) << "deep_copy(offsetView, view) broken.";
#endif
}
}
template <typename Scalar, typename Device>
void test_offsetview_subview(unsigned int size)
{
{//test subview 1
Kokkos::Experimental::OffsetView<Scalar*, Device> sliceMe("offsetToSlice", {-10, 20});
{
auto offsetSubviewa = Kokkos::Experimental::subview(sliceMe, 0);
ASSERT_EQ(offsetSubviewa.Rank, 0) << "subview of offset is broken.";
}
}
{//test subview 2
Kokkos::Experimental::OffsetView<Scalar**, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30});
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),-2);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
}
{//test subview rank 3
Kokkos::Experimental::OffsetView<Scalar***, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40});
//slice 1
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(),Kokkos::ALL(), 0);
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(), 0,Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(),Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(), Kokkos::make_pair(-30, -21));
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
ASSERT_EQ(offsetSubview.begin(0) , -20);
ASSERT_EQ(offsetSubview.end(0) , 31);
ASSERT_EQ(offsetSubview.begin(1) , 0);
ASSERT_EQ(offsetSubview.end(1) , 9);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
typedef typename range_type::point_type point_type;
const int b0 = offsetSubview.begin(0);
const int b1 = offsetSubview.begin(1);
const int e0 = offsetSubview.end(0);
const int e1 = offsetSubview.end(1);
range_type rangeP2D(point_type{ {b0, b1 } }, point_type{ { e0, e1} });
Kokkos::parallel_for(rangeP2D, KOKKOS_LAMBDA(const int i, const int j) {
offsetSubview(i,j) = 6;
}
);
int sum = 0;
Kokkos::parallel_reduce(rangeP2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
updateMe += offsetSubview(i, j);
}, sum);
ASSERT_EQ(sum, 6*(e0-b0)*(e1-b1));
#endif
}
// slice 2
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
}
{//test subview rank 4
Kokkos::Experimental::OffsetView<Scalar****, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40}, {-40, 50});
//slice 1
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),Kokkos::ALL(), Kokkos::ALL(), 0);
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe ,Kokkos::ALL(), 0, Kokkos::ALL(),Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe , 0, Kokkos::ALL(), Kokkos::ALL(), Kokkos::ALL() );
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
}
// slice 2
auto offsetSubview2a = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, 0);
ASSERT_EQ(offsetSubview2a.Rank, 2) << "subview of offset is broken.";
{
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, Kokkos::ALL(), 0);
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
}
{
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL(), Kokkos::ALL());
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
}
// slice 3
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, 0);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0, 0);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL(), 0);
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
{
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, 0, Kokkos::ALL());
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
}
}
}
TEST_F( TEST_CATEGORY, offsetview_construction) {
test_offsetview_construction<int,TEST_EXECSPACE>(10);
}
TEST_F( TEST_CATEGORY, offsetview_subview) {
test_offsetview_subview<int,TEST_EXECSPACE>(10);
}
} // namespace Test
#endif /* CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_ */

View File

@ -80,7 +80,9 @@ void test_scatter_view_config(int n)
Kokkos::Experimental::contribute(original_view, scatter_view);
}
#if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
Kokkos::fence();
auto host_view = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), original_view);
Kokkos::fence();
for (typename decltype(host_view)::size_type i = 0; i < host_view.extent(0); ++i) {
auto val0 = host_view(i, 0);
auto val1 = host_view(i, 1);
@ -111,9 +113,6 @@ struct TestDuplicatedScatterView {
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
Kokkos::Experimental::ScatterDuplicated,
Kokkos::Experimental::ScatterNonAtomic>(n);
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
Kokkos::Experimental::ScatterDuplicated,
Kokkos::Experimental::ScatterAtomic>(n);
}
};
@ -127,6 +126,16 @@ struct TestDuplicatedScatterView<Kokkos::Cuda> {
};
#endif
#ifdef KOKKOS_ENABLE_ROCM
// disable duplicated instantiation with ROCm until
// UniqueToken can support it
template <>
struct TestDuplicatedScatterView<Kokkos::Experimental::ROCm> {
TestDuplicatedScatterView(int) {
}
};
#endif
template <typename ExecSpace>
void test_scatter_view(int n)
{
@ -142,16 +151,28 @@ void test_scatter_view(int n)
Kokkos::Experimental::ScatterNonDuplicated,
Kokkos::Experimental::ScatterNonAtomic>(n);
}
#ifdef KOKKOS_ENABLE_SERIAL
if (!std::is_same<ExecSpace, Kokkos::Serial>::value) {
#endif
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
Kokkos::Experimental::ScatterNonDuplicated,
Kokkos::Experimental::ScatterAtomic>(n);
#ifdef KOKKOS_ENABLE_SERIAL
}
#endif
TestDuplicatedScatterView<ExecSpace> duptest(n);
}
TEST_F( TEST_CATEGORY, scatterview) {
#ifndef KOKKOS_ENABLE_ROCM
test_scatter_view<TEST_EXECSPACE>(10);
#ifdef KOKKOS_ENABLE_DEBUG
test_scatter_view<TEST_EXECSPACE>(100000);
#else
test_scatter_view<TEST_EXECSPACE>(10000000);
#endif
#endif
}
} // namespace Test

View File

@ -46,6 +46,7 @@
#include <vector>
#include <Kokkos_StaticCrsGraph.hpp>
#include <Kokkos_Core.hpp>
/*--------------------------------------------------------------------------*/
namespace Test {

View File

@ -0,0 +1,47 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#include<cuda/TestCuda_Category.hpp>
#include<TestOffsetView.hpp>

View File

@ -0,0 +1,47 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#include<openmp/TestOpenMP_Category.hpp>
#include<TestOffsetView.hpp>

View File

@ -60,6 +60,6 @@ protected:
} // namespace Test
#define TEST_CATEGORY rocm
#define TEST_EXECSPACE Kokkos::ROCm
#define TEST_EXECSPACE Kokkos::Experimental::ROCm
#endif

View File

@ -0,0 +1,46 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#include<serial/TestSerial_Category.hpp>
#include<TestOffsetView.hpp>

View File

@ -0,0 +1,47 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 2.0
// Copyright (2014) Sandia Corporation
//
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#include<threads/TestThreads_Category.hpp>
#include<TestOffsetView.hpp>

View File

@ -108,3 +108,7 @@ else()
endif()
#-----------------------------------------------------------------------------
# build and install pkgconfig file
CONFIGURE_FILE(kokkos.pc.in kokkos.pc @ONLY)
INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)

View File

@ -208,7 +208,7 @@ struct CudaParallelLaunch< DriverType
, const int shmem
, const cudaStream_t stream = 0 )
{
if ( grid.x && ( block.x * block.y * block.z ) ) {
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
sizeof( DriverType ) ) {
@ -264,7 +264,7 @@ struct CudaParallelLaunch< DriverType
, const int shmem
, const cudaStream_t stream = 0 )
{
if ( grid.x && ( block.x * block.y * block.z ) ) {
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
sizeof( DriverType ) ) {
@ -321,7 +321,7 @@ struct CudaParallelLaunch< DriverType
, const int shmem
, const cudaStream_t stream = 0 )
{
if ( grid.x && ( block.x * block.y * block.z ) ) {
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
sizeof( DriverType ) ) {
@ -370,7 +370,7 @@ struct CudaParallelLaunch< DriverType
, const int shmem
, const cudaStream_t stream = 0 )
{
if ( grid.x && ( block.x * block.y * block.z ) ) {
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
sizeof( DriverType ) ) {

View File

@ -453,6 +453,8 @@ SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
// Copy to device memory
Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
@ -491,6 +493,9 @@ SharedAllocationRecord( const Kokkos::CudaUVMSpace & arg_space
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
}
SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
@ -525,6 +530,8 @@ SharedAllocationRecord( const Kokkos::CudaHostPinnedSpace & arg_space
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
}
//----------------------------------------------------------------------------

View File

@ -689,9 +689,13 @@ Cuda::size_type cuda_internal_multiprocessor_count()
CudaSpace::size_type cuda_internal_maximum_concurrent_block_count()
{
#if defined(KOKKOS_ARCH_KEPLER)
// Compute capability 3.0 through 3.7
enum : int { max_resident_blocks_per_multiprocessor = 16 };
#else
// Compute capability 5.0 through 6.2
enum : int { max_resident_blocks_per_multiprocessor = 32 };
#endif
return CudaInternal::singleton().m_multiProcCount
* max_resident_blocks_per_multiprocessor ;
};

View File

@ -52,22 +52,22 @@
namespace Kokkos { namespace Impl {
template<class DriverType, bool Large>
template<class DriverType, class LaunchBounds, bool Large>
struct CudaGetMaxBlockSize;
template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
template<class DriverType, class LaunchBounds>
int cuda_get_max_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
return CudaGetMaxBlockSize<DriverType,Large>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
return CudaGetMaxBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
}
template<class DriverType>
struct CudaGetMaxBlockSize<DriverType,true> {
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int numBlocks;
int blockSize=32;
int blockSize=1024;
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
@ -76,8 +76,9 @@ struct CudaGetMaxBlockSize<DriverType,true> {
blockSize,
sharedmem);
while (blockSize<1024 && numBlocks>0) {
blockSize*=2;
if(numBlocks>0) return blockSize;
while (blockSize>32 && numBlocks==0) {
blockSize/=2;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
@ -87,19 +88,30 @@ struct CudaGetMaxBlockSize<DriverType,true> {
blockSize,
sharedmem);
}
if(numBlocks>0) return blockSize;
else return blockSize/2;
int blockSizeUpperBound = blockSize*2;
while (blockSize<blockSizeUpperBound && numBlocks>0) {
blockSize+=32;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_constant_memory<DriverType>,
blockSize,
sharedmem);
}
return blockSize - 32;
}
};
template<class DriverType>
struct CudaGetMaxBlockSize<DriverType,false> {
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int numBlocks;
int blockSize=32;
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
unsigned int blockSize=1024;
unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
@ -107,8 +119,9 @@ struct CudaGetMaxBlockSize<DriverType,false> {
blockSize,
sharedmem);
while (blockSize<1024 && numBlocks>0) {
blockSize*=2;
if(numBlocks>0) return blockSize;
while (blockSize>32 && numBlocks==0) {
blockSize/=2;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
@ -118,24 +131,121 @@ struct CudaGetMaxBlockSize<DriverType,false> {
blockSize,
sharedmem);
}
if(numBlocks>0) return blockSize;
else return blockSize/2;
unsigned int blockSizeUpperBound = blockSize*2;
while (blockSize<blockSizeUpperBound && numBlocks>0) {
blockSize+=32;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_local_memory<DriverType>,
blockSize,
sharedmem);
}
return blockSize - 32;
}
};
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,true> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int numBlocks = 0, oldNumBlocks = 0;
unsigned int blockSize=MaxThreadsPerBlock;
unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
blockSize,
sharedmem);
if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
blockSize/=2;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_constant_memory<DriverType>,
blockSize,
sharedmem);
}
unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>MinBlocksPerSM) {
blockSize+=32;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
oldNumBlocks = numBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_constant_memory<DriverType>,
blockSize,
sharedmem);
}
if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
return -1;
}
};
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,false> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int numBlocks = 0, oldNumBlocks = 0;
unsigned int blockSize=MaxThreadsPerBlock;
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
blockSize,
sharedmem);
if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
blockSize/=2;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_local_memory<DriverType>,
blockSize,
sharedmem);
}
unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) {
blockSize+=32;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
oldNumBlocks = numBlocks;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_local_memory<DriverType>,
blockSize,
sharedmem);
}
if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
return -1;
}
};
template<class DriverType, bool Large>
template<class DriverType, class LaunchBounds, bool Large>
struct CudaGetOptBlockSize;
template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
template<class DriverType, class LaunchBounds>
int cuda_get_opt_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
return CudaGetOptBlockSize<DriverType,Large>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
return CudaGetOptBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
}
template<class DriverType>
struct CudaGetOptBlockSize<DriverType,true> {
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int blockSize=16;
@ -165,7 +275,7 @@ struct CudaGetOptBlockSize<DriverType,true> {
};
template<class DriverType>
struct CudaGetOptBlockSize<DriverType,false> {
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int blockSize=16;
@ -194,6 +304,75 @@ struct CudaGetOptBlockSize<DriverType,false> {
}
};
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,true> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int blockSize=16;
int numBlocks;
int sharedmem;
int maxOccupancy=0;
int bestBlockSize=0;
int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
while(blockSize < max_threads_per_block ) {
blockSize*=2;
//calculate the occupancy with that optBlockSize and check whether its larger than the largest one found so far
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
blockSize,
sharedmem);
if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
if(maxOccupancy < numBlocks*blockSize) {
maxOccupancy = numBlocks*blockSize;
bestBlockSize = blockSize;
}
}
}
if(maxOccupancy > 0)
return bestBlockSize;
return -1;
}
};
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,false> {
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
int blockSize=16;
int numBlocks;
int sharedmem;
int maxOccupancy=0;
int bestBlockSize=0;
int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
while(blockSize < max_threads_per_block ) {
blockSize*=2;
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
blockSize,
sharedmem);
if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
if(maxOccupancy < numBlocks*blockSize) {
maxOccupancy = numBlocks*blockSize;
bestBlockSize = blockSize;
}
}
}
if(maxOccupancy > 0)
return bestBlockSize;
return -1;
}
};
}} // namespace Kokkos::Impl
#endif // KOKKOS_ENABLE_CUDA

View File

@ -148,6 +148,9 @@ namespace Kokkos {
namespace Impl {
namespace {
static int lock_array_copied = 0;
inline int eliminate_warning_for_lock_array() {
return lock_array_copied;
}
}
}
}

View File

@ -60,6 +60,7 @@
#include <Cuda/Kokkos_Cuda_Internal.hpp>
#include <Cuda/Kokkos_Cuda_Locks.hpp>
#include <Kokkos_Vectorization.hpp>
#include <Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp>
#if defined(KOKKOS_ENABLE_PROFILING)
#include <impl/Kokkos_Profiling_Interface.hpp>
@ -114,6 +115,7 @@ public:
//----------------------------------------
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class FunctorType >
inline static
int team_size_max( const FunctorType & functor )
@ -131,7 +133,35 @@ public:
return n ;
}
#endif
template<class FunctorType>
int team_size_max( const FunctorType& f, const ParallelForTag& ) const {
typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) );
return block_size/vector_length();
}
template<class FunctorType>
int team_size_max( const FunctorType& f, const ParallelReduceTag& ) const {
typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
// Currently we require Power-of-2 team size for reductions.
int p2 = 1;
while(p2<=block_size) p2*=2;
p2/=2;
return p2/vector_length();
}
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class FunctorType >
static int team_size_recommended( const FunctorType & functor )
{ return team_size_max( functor ); }
@ -143,11 +173,41 @@ public:
if(max<1) max = 1;
return max;
}
#endif
template<class FunctorType>
int team_size_recommended( const FunctorType& f, const ParallelForTag& ) const {
typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double));
return block_size/vector_length();
}
template<class FunctorType>
int team_size_recommended( const FunctorType& f, const ParallelReduceTag& ) const {
typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
return block_size/vector_length();
}
inline static
int vector_length_max()
{ return Impl::CudaTraits::WarpSize; }
inline static
int scratch_size_max(int level)
{ return (level==0?
1024*40: // 48kB is the max for CUDA, but we need some for team_member.reduce etc.
20*1024*1024); // arbitrarily setting this to 20MB, for a Volta V100 that would give us about 3.2GB for 2 teams per SM
}
//----------------------------------------
inline int vector_length() const { return m_vector_length ; }
@ -419,7 +479,7 @@ public:
void execute() const
{
const typename Policy::index_type nwork = m_policy.end() - m_policy.begin();
const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( m_functor , 1, 0 , 0 );
const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds>( m_functor , 1, 0 , 0 );
const dim3 block( 1 , block_size , 1);
const dim3 grid( std::min( typename Policy::index_type(( nwork + block.y - 1 ) / block.y) , typename Policy::index_type(cuda_internal_maximum_grid_count()) ) , 1 , 1);
@ -654,7 +714,7 @@ public:
: m_functor( arg_functor )
, m_league_size( arg_policy.league_size() )
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
, m_vector_size( arg_policy.vector_length() )
, m_shmem_begin( sizeof(double) * ( m_team_size + 2 ) )
, m_shmem_size( arg_policy.scratch_size(0,m_team_size) + FunctorTeamShmemSize< FunctorType >::value( m_functor , m_team_size ) )
@ -670,7 +730,7 @@ public:
}
if ( int(m_team_size) >
int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor >
int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor, LaunchBounds >
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));
}
@ -725,12 +785,13 @@ public:
const Policy m_policy ;
const ReducerType m_reducer ;
const pointer_type m_result_ptr ;
const bool m_result_ptr_device_accessible ;
size_type * m_scratch_space ;
size_type * m_scratch_flags ;
size_type * m_unified_space ;
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit)
enum { UseShflReduction = false };//((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
// Some crutch to do function overloading
private:
typedef double DummyShflReductionType;
@ -752,12 +813,12 @@ public:
__device__ inline
void operator() () const {
run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
/* run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
}
__device__ inline
void run(const DummySHMEMReductionType& ) const
{
{*/
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );
@ -786,7 +847,8 @@ public:
// This is the final block with the final result at the final threads' location
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
( m_unified_space ? m_unified_space : m_scratch_space );
if ( threadIdx.y == 0 ) {
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -798,10 +860,9 @@ public:
}
}
__device__ inline
/* __device__ inline
void run(const DummyShflReductionType&) const
{
value_type value;
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &value);
// Number of blocks is bounded so that the reduction can be limited to two passes.
@ -832,7 +893,7 @@ public:
*result = value;
}
}
}
}*/
// Determine block size constrained by shared memory:
static inline
@ -863,6 +924,7 @@ public:
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
if(!m_result_ptr_device_accessible) {
Cuda::fence();
if ( m_result_ptr ) {
@ -876,6 +938,7 @@ public:
}
}
}
}
else {
if (m_result_ptr) {
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
@ -883,17 +946,18 @@ public:
}
}
template< class HostViewType >
template< class ViewType >
ParallelReduce( const FunctorType & arg_functor
, const Policy & arg_policy
, const HostViewType & arg_result
, const ViewType & arg_result
, typename std::enable_if<
Kokkos::is_view< HostViewType >::value
Kokkos::is_view< ViewType >::value
,void*>::type = NULL)
: m_functor( arg_functor )
, m_policy( arg_policy )
, m_reducer( InvalidType() )
, m_result_ptr( arg_result.data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -906,6 +970,7 @@ public:
, m_policy( arg_policy )
, m_reducer( reducer )
, m_result_ptr( reducer.view().data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -953,6 +1018,7 @@ public:
const Policy m_policy ; // used for workrange and nwork
const ReducerType m_reducer ;
const pointer_type m_result_ptr ;
const bool m_result_ptr_device_accessible ;
size_type * m_scratch_space ;
size_type * m_scratch_flags ;
size_type * m_unified_space ;
@ -960,7 +1026,7 @@ public:
typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && (ValueTraits::StaticValueSize!=0)) };
// Some crutch to do function overloading
private:
typedef double DummyShflReductionType;
@ -978,12 +1044,12 @@ public:
inline
__device__
void operator() (void) const {
run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
/* run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
}
__device__ inline
void run(const DummySHMEMReductionType& ) const
{
{*/
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );
@ -1007,7 +1073,8 @@ public:
// This is the final block with the final result at the final threads' location
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
( m_unified_space ? m_unified_space : m_scratch_space );
if ( threadIdx.y == 0 ) {
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -1019,7 +1086,7 @@ public:
}
}
__device__ inline
/* __device__ inline
void run(const DummyShflReductionType&) const
{
@ -1051,7 +1118,7 @@ public:
}
}
}
*/
// Determine block size constrained by shared memory:
static inline
unsigned local_block_size( const FunctorType & f )
@ -1089,6 +1156,7 @@ public:
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
if(!m_result_ptr_device_accessible) {
Cuda::fence();
if ( m_result_ptr ) {
@ -1102,6 +1170,7 @@ public:
}
}
}
}
else {
if (m_result_ptr) {
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
@ -1109,17 +1178,18 @@ public:
}
}
template< class HostViewType >
template< class ViewType >
ParallelReduce( const FunctorType & arg_functor
, const Policy & arg_policy
, const HostViewType & arg_result
, const ViewType & arg_result
, typename std::enable_if<
Kokkos::is_view< HostViewType >::value
Kokkos::is_view< ViewType >::value
,void*>::type = NULL)
: m_functor( arg_functor )
, m_policy( arg_policy )
, m_reducer( InvalidType() )
, m_result_ptr( arg_result.data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -1132,6 +1202,7 @@ public:
, m_policy( arg_policy )
, m_reducer( reducer )
, m_result_ptr( reducer.view().data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -1174,7 +1245,7 @@ public:
typedef FunctorType functor_type ;
typedef Cuda::size_type size_type ;
enum { UseShflReduction = (true && ValueTraits::StaticValueSize) };
enum { UseShflReduction = (true && (ValueTraits::StaticValueSize!=0)) };
private:
typedef double DummyShflReductionType;
@ -1191,6 +1262,7 @@ private:
const FunctorType m_functor ;
const ReducerType m_reducer ;
const pointer_type m_result_ptr ;
const bool m_result_ptr_device_accessible ;
size_type * m_scratch_space ;
size_type * m_scratch_flags ;
size_type * m_unified_space ;
@ -1279,7 +1351,8 @@ public:
// This is the final block with the final result at the final threads' location
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
( m_unified_space ? m_unified_space : m_scratch_space );
if ( threadIdx.y == 0 ) {
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -1312,12 +1385,18 @@ public:
, value );
}
pointer_type const result = (pointer_type) (m_unified_space ? m_unified_space : m_scratch_space) ;
pointer_type const result = m_result_ptr_device_accessible? m_result_ptr :
(pointer_type) ( m_unified_space ? m_unified_space : m_scratch_space );
value_type init;
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &init);
if(Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
(value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)) {
if(
Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
(value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)
//This breaks a test
// Kokkos::Impl::CudaReductionsFunctor<FunctorType,WorkTag,false,true>::scalar_inter_block_reduction(ReducerConditional::select(m_functor , m_reducer) , blockIdx.x , gridDim.x ,
// kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags)
) {
const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
if(id==0) {
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
@ -1331,7 +1410,7 @@ public:
{
const int nwork = m_league_size * m_team_size ;
if ( nwork ) {
const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024) )
const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024*32) )
:std::min( m_league_size , m_team_size );
m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_count );
@ -1344,6 +1423,7 @@ public:
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem_size_total ); // copy to device and execute
if(!m_result_ptr_device_accessible) {
Cuda::fence();
if ( m_result_ptr ) {
@ -1357,6 +1437,7 @@ public:
}
}
}
}
else {
if (m_result_ptr) {
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
@ -1364,16 +1445,17 @@ public:
}
}
template< class HostViewType >
template< class ViewType >
ParallelReduce( const FunctorType & arg_functor
, const Policy & arg_policy
, const HostViewType & arg_result
, const ViewType & arg_result
, typename std::enable_if<
Kokkos::is_view< HostViewType >::value
Kokkos::is_view< ViewType >::value
,void*>::type = NULL)
: m_functor( arg_functor )
, m_reducer( InvalidType() )
, m_result_ptr( arg_result.data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -1383,17 +1465,17 @@ public:
, m_scratch_ptr{NULL,NULL}
, m_scratch_size{
arg_policy.scratch_size(0,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
arg_policy.vector_length() )
), arg_policy.scratch_size(1,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
arg_policy.vector_length() )
)}
, m_league_size( arg_policy.league_size() )
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
arg_policy.vector_length() )
, m_vector_size( arg_policy.vector_length() )
@ -1430,9 +1512,7 @@ public:
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too much L0 scratch memory"));
}
if ( unsigned(m_team_size) >
unsigned(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
}
@ -1444,6 +1524,7 @@ public:
: m_functor( arg_functor )
, m_reducer( reducer )
, m_result_ptr( reducer.view().data() )
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
, m_unified_space( 0 )
@ -1453,7 +1534,7 @@ public:
, m_scratch_ptr{NULL,NULL}
, m_league_size( arg_policy.league_size() )
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
arg_policy.vector_length() )
, m_vector_size( arg_policy.vector_length() )
@ -1486,10 +1567,7 @@ public:
CudaTraits::SharedMemoryCapacity < shmem_size_total ) {
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > bad team size"));
}
if ( int(m_team_size) >
int(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
}
@ -1753,7 +1831,7 @@ public:
// Occupancy calculator assumes whole block.
m_team_size =
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >
( arg_functor
, arg_policy.vector_length()
, arg_policy.team_scratch_size(0)
@ -1970,7 +2048,9 @@ private:
const WorkRange range( m_policy , blockIdx.x , gridDim.x );
for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
#endif
const typename Policy::member_type iwork = iwork_base + threadIdx.y ;
__syncthreads(); // Don't overwrite previous iteration values until they are used
@ -1981,7 +2061,11 @@ private:
for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) {
shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
}
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.
// Call functor to accumulate inclusive scan value for this work item
@ -2189,6 +2273,9 @@ private:
const WorkRange range( m_policy , blockIdx.x , gridDim.x );
for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
#endif
const typename Policy::member_type iwork = iwork_base + threadIdx.y ;
@ -2201,6 +2288,11 @@ private:
shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
}
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.
// Call functor to accumulate inclusive scan value for this work item

View File

@ -194,8 +194,9 @@ void cuda_shfl_up( T & out , T const & in , int delta ,
*/
template< class ValueType , class JoinOp>
__device__
inline void cuda_intra_warp_reduction( ValueType& result,
__device__ inline
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
cuda_intra_warp_reduction( ValueType& result,
const JoinOp& join,
const uint32_t max_active_thread = blockDim.y) {
@ -214,8 +215,9 @@ inline void cuda_intra_warp_reduction( ValueType& result,
}
template< class ValueType , class JoinOp>
__device__
inline void cuda_inter_warp_reduction( ValueType& value,
__device__ inline
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
cuda_inter_warp_reduction( ValueType& value,
const JoinOp& join,
const int max_active_thread = blockDim.y) {
@ -247,8 +249,9 @@ inline void cuda_inter_warp_reduction( ValueType& value,
}
template< class ValueType , class JoinOp>
__device__
inline void cuda_intra_block_reduction( ValueType& value,
__device__ inline
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
cuda_intra_block_reduction( ValueType& value,
const JoinOp& join,
const int max_active_thread = blockDim.y) {
cuda_intra_warp_reduction(value,join,max_active_thread);
@ -314,31 +317,52 @@ bool cuda_inter_block_reduction( typename FunctorValueTraits< FunctorType , ArgT
if( id + 1 < int(gridDim.x) )
join(value, tmp);
}
int active = KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 2) {
value_type tmp = Kokkos::shfl_down(value, 2,32);
if( id + 2 < int(gridDim.x) )
join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 4) {
value_type tmp = Kokkos::shfl_down(value, 4,32);
if( id + 4 < int(gridDim.x) )
join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 8) {
value_type tmp = Kokkos::shfl_down(value, 8,32);
if( id + 8 < int(gridDim.x) )
join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 16) {
value_type tmp = Kokkos::shfl_down(value, 16,32);
if( id + 16 < int(gridDim.x) )
join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
}
}
//The last block has in its thread=0 the global reduction value through "value"
@ -478,31 +502,52 @@ cuda_inter_block_reduction( const ReducerType& reducer,
if( id + 1 < int(gridDim.x) )
reducer.join(value, tmp);
}
int active = KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 2) {
value_type tmp = Kokkos::shfl_down(value, 2,32);
if( id + 2 < int(gridDim.x) )
reducer.join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 4) {
value_type tmp = Kokkos::shfl_down(value, 4,32);
if( id + 4 < int(gridDim.x) )
reducer.join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 8) {
value_type tmp = Kokkos::shfl_down(value, 8,32);
if( id + 8 < int(gridDim.x) )
reducer.join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
if (int(blockDim.x*blockDim.y) > 16) {
value_type tmp = Kokkos::shfl_down(value, 16,32);
if( id + 16 < int(gridDim.x) )
reducer.join(value, tmp);
}
active += KOKKOS_IMPL_CUDA_BALLOT(1);
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
#else
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
#endif
}
}
@ -513,6 +558,213 @@ cuda_inter_block_reduction( const ReducerType& reducer,
#endif
}
template<class FunctorType, class ArgTag, bool DoScan, bool UseShfl>
struct CudaReductionsFunctor;
template<class FunctorType, class ArgTag>
struct CudaReductionsFunctor<FunctorType, ArgTag, false, true> {
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
typedef typename ValueTraits::pointer_type pointer_type ;
typedef typename ValueTraits::value_type Scalar;
__device__
static inline void scalar_intra_warp_reduction(
const FunctorType& functor,
Scalar value, // Contribution
const bool skip_vector, // Skip threads if Kokkos vector lanes are not part of the reduction
const int width, // How much of the warp participates
Scalar& result)
{
unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
Scalar tmp;
cuda_shfl_down(tmp,value,delta,width,mask);
ValueJoin::join( functor , &value, &tmp);
}
cuda_shfl(result,value,0,width,mask);
}
__device__
static inline void scalar_intra_block_reduction(
const FunctorType& functor,
Scalar value,
const bool skip,
Scalar* my_global_team_buffer_element,
const int shared_elements,
Scalar* shared_team_buffer_element) {
const int warp_id = (threadIdx.y*blockDim.x)/32;
Scalar* const my_shared_team_buffer_element =
shared_team_buffer_element + warp_id%shared_elements;
// Warp Level Reduction, ignoring Kokkos vector entries
scalar_intra_warp_reduction(functor,value,skip,32,value);
if(warp_id<shared_elements) {
*my_shared_team_buffer_element=value;
}
// Wait for every warp to be done before using one warp to do final cross warp reduction
__syncthreads();
const int num_warps = blockDim.x*blockDim.y/32;
for(int w = shared_elements; w<num_warps; w+=shared_elements) {
if(warp_id>=w && warp_id<w+shared_elements) {
if((threadIdx.y*blockDim.x + threadIdx.x)%32==0)
ValueJoin::join( functor , my_shared_team_buffer_element, &value);
}
__syncthreads();
}
if( warp_id == 0) {
ValueInit::init( functor , &value );
for(unsigned int i=threadIdx.y*blockDim.x+threadIdx.x; i<blockDim.y*blockDim.x/32; i+=32)
ValueJoin::join( functor , &value,&shared_team_buffer_element[i]);
scalar_intra_warp_reduction(functor,value,false,32,*my_global_team_buffer_element);
}
}
__device__
static inline bool scalar_inter_block_reduction(
const FunctorType & functor ,
const Cuda::size_type block_id ,
const Cuda::size_type block_count ,
Cuda::size_type * const shared_data ,
Cuda::size_type * const global_data ,
Cuda::size_type * const global_flags ) {
Scalar* const global_team_buffer_element = ((Scalar*) global_data);
Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
Scalar value = shared_team_buffer_elements[threadIdx.y];
int shared_elements=blockDim.x*blockDim.y/32;
int global_elements=block_count;
__syncthreads();
scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
__syncthreads();
unsigned int num_teams_done = 0;
if(threadIdx.x + threadIdx.y == 0) {
__threadfence();
num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
}
bool is_last_block = false;
if(__syncthreads_or(num_teams_done == gridDim.x)) {
is_last_block=true;
*global_flags = 0;
ValueInit::init( functor, &value);
for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
}
scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
}
return is_last_block;
}
};
template<class FunctorType, class ArgTag>
struct CudaReductionsFunctor<FunctorType, ArgTag, false, false> {
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
typedef typename ValueTraits::pointer_type pointer_type ;
typedef typename ValueTraits::value_type Scalar;
__device__
static inline void scalar_intra_warp_reduction(
const FunctorType& functor,
Scalar* value, // Contribution
const bool skip_vector, // Skip threads if Kokkos vector lanes are not part of the reduction
const int width) // How much of the warp participates
{
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
#endif
const int lane_id = (threadIdx.y*blockDim.x+threadIdx.x)%32;
for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
if(lane_id + delta<32) {
ValueJoin::join( functor , value, value+delta);
}
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(mask);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
}
*value=*(value-lane_id);
}
__device__
static inline void scalar_intra_block_reduction(
const FunctorType& functor,
Scalar value,
const bool skip,
Scalar* result,
const int shared_elements,
Scalar* shared_team_buffer_element) {
const int warp_id = (threadIdx.y*blockDim.x)/32;
Scalar* const my_shared_team_buffer_element =
shared_team_buffer_element + threadIdx.y*blockDim.x+threadIdx.x;
*my_shared_team_buffer_element = value;
// Warp Level Reduction, ignoring Kokkos vector entries
scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,skip,32);
// Wait for every warp to be done before using one warp to do final cross warp reduction
__syncthreads();
if( warp_id == 0) {
const unsigned int delta = (threadIdx.y*blockDim.x+threadIdx.x)*32;
if(delta<blockDim.x*blockDim.y)
*my_shared_team_buffer_element = shared_team_buffer_element[delta];
KOKKOS_IMPL_CUDA_SYNCWARP;
scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,false,blockDim.x*blockDim.y/32);
if(threadIdx.x + threadIdx.y == 0) *result = *shared_team_buffer_element;
}
}
__device__
static inline bool scalar_inter_block_reduction(
const FunctorType & functor ,
const Cuda::size_type block_id ,
const Cuda::size_type block_count ,
Cuda::size_type * const shared_data ,
Cuda::size_type * const global_data ,
Cuda::size_type * const global_flags ) {
Scalar* const global_team_buffer_element = ((Scalar*) global_data);
Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
Scalar value = shared_team_buffer_elements[threadIdx.y];
int shared_elements=blockDim.x*blockDim.y/32;
int global_elements=block_count;
__syncthreads();
scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
__syncthreads();
unsigned int num_teams_done = 0;
if(threadIdx.x + threadIdx.y == 0) {
__threadfence();
num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
}
bool is_last_block = false;
if(__syncthreads_or(num_teams_done == gridDim.x)) {
is_last_block=true;
*global_flags = 0;
ValueInit::init( functor, &value);
for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
}
scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
}
return is_last_block;
}
};
//----------------------------------------------------------------------------
// See section B.17 of Cuda C Programming Guide Version 3.2
// for discussion of
@ -639,9 +891,10 @@ void cuda_intra_block_reduce_scan( const FunctorType & functor ,
*
* Global reduce result is in the last threads' 'shared_data' location.
*/
template< bool DoScan , class FunctorType , class ArgTag >
__device__
bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
bool cuda_single_inter_block_reduce_scan2( const FunctorType & functor ,
const Cuda::size_type block_id ,
const Cuda::size_type block_count ,
Cuda::size_type * const shared_data ,
@ -655,7 +908,6 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
typedef typename ValueTraits::pointer_type pointer_type ;
//typedef typename ValueTraits::reference_type reference_type ;
// '__ffs' = position of the least significant bit set to 1.
// 'blockDim.y' is guaranteed to be a power of two so this
@ -678,12 +930,7 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
size_type * const global = global_data + word_count.value * block_id ;
//#if (__CUDA_ARCH__ < 500)
for ( int i = int(threadIdx.y) ; i < int(word_count.value) ; i += int(blockDim.y) ) { global[i] = shared[i] ; }
//#else
// for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
//#endif
}
// Contributing blocks note that their contribution has been completed via an atomic-increment flag
@ -725,6 +972,22 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
return is_last_block ;
}
template< bool DoScan , class FunctorType , class ArgTag >
__device__
bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
const Cuda::size_type block_id ,
const Cuda::size_type block_count ,
Cuda::size_type * const shared_data ,
Cuda::size_type * const global_data ,
Cuda::size_type * const global_flags )
{
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
if(!DoScan && ValueTraits::StaticValueSize)
return Kokkos::Impl::CudaReductionsFunctor<FunctorType,ArgTag,false,(ValueTraits::StaticValueSize>16)>::scalar_inter_block_reduction(functor,block_id,block_count,shared_data,global_data,global_flags);
else
return cuda_single_inter_block_reduce_scan2<DoScan, FunctorType, ArgTag>(functor, block_id, block_count, shared_data, global_data, global_flags);
}
// Size in bytes required for inter block reduce or scan
template< bool DoScan , class FunctorType , class ArgTag >
inline

View File

@ -179,6 +179,29 @@ public:
#endif
}
template<class Closure, class ValueType>
KOKKOS_INLINE_FUNCTION
void team_broadcast( Closure const & f, ValueType & val, const int& thread_id ) const
{
#ifdef __CUDA_ARCH__
f( val );
if ( 1 == blockDim.z ) { // team == block
__syncthreads();
// Wait for shared data write until all threads arrive here
if ( threadIdx.x == 0u && threadIdx.y == (uint32_t)thread_id ) {
*((ValueType*) m_team_reduce) = val ;
}
__syncthreads(); // Wait for shared data read until root thread writes
val = *((ValueType*) m_team_reduce);
}
else { // team <= warp
ValueType tmp( val ); // input might not be a register variable
cuda_shfl( val, tmp, blockDim.x * thread_id, blockDim.x * blockDim.y );
}
#endif
}
//--------------------------------------------------------------------------
/**\brief Reduction across a team
*
@ -200,92 +223,7 @@ public:
team_reduce( ReducerType const & reducer ) const noexcept
{
#ifdef __CUDA_ARCH__
typedef typename ReducerType::value_type value_type ;
value_type tmp( reducer.reference() );
// reduce within the warp using shuffle
const int wx =
( threadIdx.x + blockDim.x * threadIdx.y ) & CudaTraits::WarpIndexMask ;
for ( int i = CudaTraits::WarpSize ; (int)blockDim.x <= ( i >>= 1 ) ; ) {
cuda_shfl_down( reducer.reference() , tmp , i , CudaTraits::WarpSize );
// Root of each vector lane reduces:
if ( 0 == threadIdx.x && wx < i ) {
reducer.join( tmp , reducer.reference() );
}
}
if ( 1 < blockDim.z ) { // team <= warp
// broadcast result from root vector lange of root thread
cuda_shfl( reducer.reference() , tmp
, blockDim.x * threadIdx.y , CudaTraits::WarpSize );
}
else { // team == block
// Reduce across warps using shared memory
// Broadcast result within block
// Number of warps, blockDim.y may not be power of two:
const int nw = ( blockDim.x * blockDim.y + CudaTraits::WarpIndexMask ) >> CudaTraits::WarpIndexShift ;
// Warp index:
const int wy = ( blockDim.x * threadIdx.y ) >> CudaTraits::WarpIndexShift ;
// Number of shared memory entries for the reduction:
int nsh = m_team_reduce_size / sizeof(value_type);
// Using at most one entry per warp:
if ( nw < nsh ) nsh = nw ;
__syncthreads(); // Wait before shared data write
if ( 0 == wx && wy < nsh ) {
((value_type*) m_team_reduce)[wy] = tmp ;
}
// When more warps than shared entries:
for ( int i = nsh ; i < nw ; i += nsh ) {
__syncthreads();
if ( 0 == wx && i <= wy ) {
const int k = wy - i ;
if ( k < nsh ) {
reducer.join( *((value_type*) m_team_reduce + k) , tmp );
}
}
}
__syncthreads();
// One warp performs the inter-warp reduction:
if ( 0 == wy ) {
// Start at power of two covering nsh
for ( int i = 1 << ( 32 - __clz(nsh-1) ) ; ( i >>= 1 ) ; ) {
const int k = wx + i ;
if ( wx < i && k < nsh ) {
reducer.join( ((value_type*)m_team_reduce)[wx]
, ((value_type*)m_team_reduce)[k] );
__threadfence_block();
}
}
}
__syncthreads(); // Wait for reduction
// Broadcast result to all threads
reducer.reference() = *((value_type*)m_team_reduce);
}
cuda_intra_block_reduction(reducer,blockDim.y);
#endif /* #ifdef __CUDA_ARCH__ */
}
@ -801,7 +739,11 @@ void parallel_for
; i += blockDim.x ) {
closure(i);
}
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
#endif
}
@ -970,7 +912,11 @@ KOKKOS_INLINE_FUNCTION
void single(const Impl::VectorSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
#ifdef __CUDA_ARCH__
if(threadIdx.x == 0) lambda();
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
#endif
}
@ -979,7 +925,11 @@ KOKKOS_INLINE_FUNCTION
void single(const Impl::ThreadSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
#ifdef __CUDA_ARCH__
if(threadIdx.x == 0 && threadIdx.y == 0) lambda();
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
#else
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
#endif
#endif
}

View File

@ -2,9 +2,11 @@
#if defined( __CUDA_ARCH__ )
#if ( CUDA_VERSION < 9000 )
#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
#define KOKKOS_IMPL_CUDA_SYNCWARP __threadfence_block()
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(x) __threadfence_block()
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK __threadfence_block()
#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot(x)
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) __ballot(x)
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl(x,y,z)
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl(x,y,z)
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up(x,y,z)
@ -12,9 +14,11 @@
#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) __shfl_down(x,y,z)
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down(x,y,z)
#else
#define KOKKOS_IMPL_CUDA_ACTIVEMASK __activemask()
#define KOKKOS_IMPL_CUDA_SYNCWARP __syncwarp(0xffffffff)
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m)
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m);
#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot_sync(__activemask(),x)
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(m,x) __ballot_sync(m,x)
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl_sync(0xffffffff,x,y,z)
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl_sync(m,x,y,z)
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up_sync(0xffffffff,x,y,z)
@ -23,11 +27,16 @@
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down_sync(m,x,y,z)
#endif
#else
#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
#define KOKKOS_IMPL_CUDA_SYNCWARP
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK
#define KOKKOS_IMPL_CUDA_BALLOT(x) 0
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) 0
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) 0
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) 0
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) 0
#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) 0
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) 0
#endif
#if ( CUDA_VERSION >= 9000 ) && (!defined(KOKKOS_COMPILER_CLANG))

View File

@ -279,6 +279,8 @@ public:
KOKKOS_INLINE_FUNCTION
static handle_type assign( value_type * arg_data_ptr, track_type const & arg_tracker )
{
if(arg_data_ptr == NULL) return handle_type();
#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
// Assignment of texture = non-texture requires creation of a texture object
// which can only occur on the host. In addition, 'get_record' is only valid
@ -292,8 +294,7 @@ public:
#if ! defined( KOKKOS_ENABLE_CUDA_LDG_INTRINSIC )
if ( 0 == r ) {
//Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
return handle_type();
Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
}
#endif

View File

@ -46,6 +46,8 @@
#include <initializer_list>
#include <Kokkos_Layout.hpp>
#include<impl/KokkosExp_Host_IterateTile.hpp>
#include <Kokkos_ExecPolicy.hpp>
#include <Kokkos_Parallel.hpp>
@ -63,13 +65,15 @@
namespace Kokkos {
// ------------------------------------------------------------------ //
// Moved to Kokkos_Layout.hpp for more general accessibility
/*
enum class Iterate
{
Default, // Default for the device
Left, // Left indices stride fastest
Right, // Right indices stride fastest
};
*/
template <typename ExecSpace>
struct default_outer_direction

View File

@ -45,11 +45,13 @@
#define KOKKOS_ARRAY_HPP
#include <Kokkos_Macros.hpp>
#include <impl/Kokkos_Error.hpp>
#include <type_traits>
#include <algorithm>
#include <limits>
#include <cstddef>
#include <string>
namespace Kokkos {
@ -132,6 +134,7 @@ public:
KOKKOS_INLINE_FUNCTION static constexpr size_type size() { return N ; }
KOKKOS_INLINE_FUNCTION static constexpr bool empty(){ return false ; }
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return N ; }
template< typename iType >
KOKKOS_INLINE_FUNCTION
@ -160,7 +163,7 @@ public:
return & m_internal_implementation_private_member_data[0];
}
#ifdef KOKKOS_ROCM_CLANG_WORKAROUND
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
// Do not default unless move and move-assignment are also defined
KOKKOS_INLINE_FUNCTION
~Array() = default ;
@ -197,6 +200,7 @@ public:
KOKKOS_INLINE_FUNCTION static constexpr size_type size() { return 0 ; }
KOKKOS_INLINE_FUNCTION static constexpr bool empty() { return true ; }
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return 0 ; }
template< typename iType >
KOKKOS_INLINE_FUNCTION
@ -261,6 +265,7 @@ public:
KOKKOS_INLINE_FUNCTION constexpr size_type size() const { return m_size ; }
KOKKOS_INLINE_FUNCTION constexpr bool empty() const { return 0 != m_size ; }
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }
template< typename iType >
KOKKOS_INLINE_FUNCTION
@ -336,6 +341,7 @@ public:
KOKKOS_INLINE_FUNCTION constexpr size_type size() const { return m_size ; }
KOKKOS_INLINE_FUNCTION constexpr bool empty() const { return 0 != m_size ; }
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }
template< typename iType >
KOKKOS_INLINE_FUNCTION

View File

@ -105,7 +105,10 @@ namespace Kokkos {
template< typename T > struct is_ ## CONCEPT { \
private: \
template< typename , typename = std::true_type > struct have : std::false_type {}; \
template< typename U > struct have<U,typename std::is_same<U,typename U:: CONCEPT >::type> : std::true_type {}; \
template< typename U > struct have<U,typename std::is_same< \
typename std::remove_cv<U>::type, \
typename std::remove_cv<typename U:: CONCEPT>::type \
>::type> : std::true_type {}; \
public: \
enum { value = is_ ## CONCEPT::template have<T>::value }; \
};

View File

@ -453,8 +453,9 @@ template<class ViewTypeA,class ViewTypeB, class Layout, class ExecSpace,typename
struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,2,iType,KOKKOS_IMPL_COMPILING_LIBRARY> {
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<2,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<2,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -475,7 +476,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,3,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<3,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<3,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -496,7 +499,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,4,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<4,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<4,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -519,7 +524,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,5,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<5,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<5,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -542,7 +549,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,6,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -566,7 +575,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,7,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -590,7 +601,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,8,iType,KOKKOS_IMPL_COMPILI
ViewTypeA a;
ViewTypeB b;
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -642,7 +655,9 @@ void view_copy(const DstType& dst, const SrcType& src) {
int64_t strides[DstType::Rank+1];
dst.stride(strides);
Kokkos::Iterate iterate;
if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
if ( Kokkos::is_layouttiled<typename DstType::array_layout>::value ) {
iterate = Kokkos::layout_iterate_type_selector<typename DstType::array_layout>::outer_iteration_pattern;
} else if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
iterate = Kokkos::Iterate::Right;
} else if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutLeft>::value ) {
iterate = Kokkos::Iterate::Left;
@ -1243,9 +1258,9 @@ void deep_copy
ViewTypeFlat;
ViewTypeFlat dst_flat(dst.data(),dst.size());
if(dst.span() < std::numeric_limits<int>::max())
if(dst.span() < std::numeric_limits<int>::max()) {
Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int >( dst_flat , value );
else
} else
Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int64_t >( dst_flat , value );
Kokkos::fence();
return;
@ -1397,7 +1412,6 @@ void deep_copy
enum { SrcExecCanAccessDst =
Kokkos::Impl::SpaceAccessibility< src_execution_space , dst_memory_space >::accessible };
// Checking for Overlapping Views.
dst_value_type* dst_start = dst.data();
dst_value_type* dst_end = dst.data() + dst.span();
@ -1493,7 +1507,7 @@ void deep_copy
Kokkos::fence();
} else {
Kokkos::fence();
Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),typename src_type::uniform_runtime_const_nomemspace_type(src));
Impl::view_copy(dst, src);
Kokkos::fence();
}
}
@ -1739,8 +1753,7 @@ void deep_copy
exec_space.fence();
} else {
exec_space.fence();
Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),
typename src_type::uniform_runtime_const_nomemspace_type(src));
Impl::view_copy(dst, src);
exec_space.fence();
}
}
@ -1917,4 +1930,213 @@ void realloc( Kokkos::View<T,P...> & v ,
}
} /* namespace Kokkos */
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
namespace Kokkos {
namespace Impl {
// Deduce Mirror Types
template<class Space, class T, class ... P>
struct MirrorViewType {
// The incoming view_type
typedef typename Kokkos::View<T,P...> src_view_type;
// The memory space for the mirror view
typedef typename Space::memory_space memory_space;
// Check whether it is the same memory space
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
// The array_layout
typedef typename src_view_type::array_layout array_layout;
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
typedef typename src_view_type::non_const_data_type data_type;
// The destination view type if it is not the same memory space
typedef Kokkos::View<data_type,array_layout,Space> dest_view_type;
// If it is the same memory_space return the existsing view_type
// This will also keep the unmanaged trait if necessary
typedef typename std::conditional<is_same_memspace,src_view_type,dest_view_type>::type view_type;
};
template<class Space, class T, class ... P>
struct MirrorType {
// The incoming view_type
typedef typename Kokkos::View<T,P...> src_view_type;
// The memory space for the mirror view
typedef typename Space::memory_space memory_space;
// Check whether it is the same memory space
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
// The array_layout
typedef typename src_view_type::array_layout array_layout;
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
typedef typename src_view_type::non_const_data_type data_type;
// The destination view type if it is not the same memory space
typedef Kokkos::View<data_type,array_layout,Space> view_type;
};
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror( const Kokkos::View<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
)
{
typedef View<T,P...> src_type ;
typedef typename src_type::HostMirror dst_type ;
return dst_type( std::string( src.label() ).append("_mirror")
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
, src.extent(0)
, src.extent(1)
, src.extent(2)
, src.extent(3)
, src.extent(4)
, src.extent(5)
, src.extent(6)
, src.extent(7) );
#else
, src.rank_dynamic > 0 ? src.extent(0): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 1 ? src.extent(1): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 2 ? src.extent(2): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 3 ? src.extent(3): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 4 ? src.extent(4): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 5 ? src.extent(5): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 6 ? src.extent(6): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 7 ? src.extent(7): KOKKOS_IMPL_CTOR_DEFAULT_ARG );
#endif
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror( const Kokkos::View<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
)
{
typedef View<T,P...> src_type ;
typedef typename src_type::HostMirror dst_type ;
Kokkos::LayoutStride layout ;
layout.dimension[0] = src.extent(0);
layout.dimension[1] = src.extent(1);
layout.dimension[2] = src.extent(2);
layout.dimension[3] = src.extent(3);
layout.dimension[4] = src.extent(4);
layout.dimension[5] = src.extent(5);
layout.dimension[6] = src.extent(6);
layout.dimension[7] = src.extent(7);
layout.stride[0] = src.stride_0();
layout.stride[1] = src.stride_1();
layout.stride[2] = src.stride_2();
layout.stride[3] = src.stride_3();
layout.stride[4] = src.stride_4();
layout.stride[5] = src.stride_5();
layout.stride[6] = src.stride_6();
layout.stride[7] = src.stride_7();
return dst_type( std::string( src.label() ).append("_mirror") , layout );
}
// Create a mirror in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorType<Space,T,P ...>::view_type
create_mirror(const Space& , const Kokkos::View<T,P...> & src
, typename std::enable_if<
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
>::type * = 0) {
return typename Impl::MirrorType<Space,T,P ...>::view_type(src.label(),src.layout());
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror_view( const Kokkos::View<T,P...> & src
, typename std::enable_if<(
std::is_same< typename Kokkos::View<T,P...>::memory_space
, typename Kokkos::View<T,P...>::HostMirror::memory_space
>::value
&&
std::is_same< typename Kokkos::View<T,P...>::data_type
, typename Kokkos::View<T,P...>::HostMirror::data_type
>::value
)>::type * = 0
)
{
return src ;
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror_view( const Kokkos::View<T,P...> & src
, typename std::enable_if< ! (
std::is_same< typename Kokkos::View<T,P...>::memory_space
, typename Kokkos::View<T,P...>::HostMirror::memory_space
>::value
&&
std::is_same< typename Kokkos::View<T,P...>::data_type
, typename Kokkos::View<T,P...>::HostMirror::data_type
>::value
)>::type * = 0
)
{
return Kokkos::create_mirror( src );
}
// Create a mirror view in a new space (specialization for same space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
return src;
}
// Create a mirror view in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
return typename Impl::MirrorViewType<Space,T,P ...>::view_type(src.label(),src.layout());
}
// Create a mirror view and deep_copy in a new space (specialization for same space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
(void)name;
return src;
}
// Create a mirror view and deep_copy in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
using Mirror = typename Impl::MirrorViewType<Space,T,P ...>::view_type;
std::string label = name.empty() ? src.label() : name;
auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout());
deep_copy(mirror, src);
return mirror;
}
} /* namespace Kokkos */
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
#endif

View File

@ -57,6 +57,10 @@
namespace Kokkos {
struct ParallelForTag {};
struct ParallelScanTag {};
struct ParallelReduceTag {};
struct ChunkSize {
int value;
ChunkSize(int value_):value(value_) {}
@ -320,6 +324,10 @@ public:
template< class FunctorType >
static int team_size_recommended( const FunctorType & , const int&);
template<class FunctorType>
int team_size_recommended( const FunctorType & functor , const int vector_length);
//----------------------------------------
/** \brief Construct policy with the given instance of the execution space */
TeamPolicyInternal( const typename traits::execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1 );

View File

@ -76,6 +76,8 @@ struct LayoutLeft {
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];
enum { is_extent_constructible = true };
LayoutLeft( LayoutLeft const & ) = default ;
LayoutLeft( LayoutLeft && ) = default ;
LayoutLeft & operator = ( LayoutLeft const & ) = default ;
@ -108,6 +110,8 @@ struct LayoutRight {
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];
enum { is_extent_constructible = true };
LayoutRight( LayoutRight const & ) = default ;
LayoutRight( LayoutRight && ) = default ;
LayoutRight & operator = ( LayoutRight const & ) = default ;
@ -132,6 +136,8 @@ struct LayoutStride {
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
size_t stride[ ARRAY_LAYOUT_MAX_RANK ] ;
enum { is_extent_constructible = false };
LayoutStride( LayoutStride const & ) = default ;
LayoutStride( LayoutStride && ) = default ;
LayoutStride & operator = ( LayoutStride const & ) = default ;
@ -222,6 +228,8 @@ struct LayoutTileLeft {
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
enum { is_extent_constructible = true };
LayoutTileLeft( LayoutTileLeft const & ) = default ;
LayoutTileLeft( LayoutTileLeft && ) = default ;
LayoutTileLeft & operator = ( LayoutTileLeft const & ) = default ;
@ -235,6 +243,144 @@ struct LayoutTileLeft {
: dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
};
//////////////////////////////////////////////////////////////////////////////////////
enum class Iterate
{
Default,
Left, // Left indices stride fastest
Right // Right indices stride fastest
};
// To check for LayoutTiled
// This is to hide extra compile-time 'identifier' info within the LayoutTiled class by not relying on template specialization to include the ArgN*'s
template < typename LayoutTiledCheck, class Enable = void >
struct is_layouttiled : std::false_type {};
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
template < typename LayoutTiledCheck >
struct is_layouttiled< LayoutTiledCheck, typename std::enable_if<LayoutTiledCheck::is_array_layout_tiled>::type > : std::true_type {};
namespace Experimental {
/// LayoutTiled
// Must have Rank >= 2
template < Kokkos::Iterate OuterP, Kokkos::Iterate InnerP,
unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 = 0, unsigned ArgN3 = 0, unsigned ArgN4 = 0, unsigned ArgN5 = 0, unsigned ArgN6 = 0, unsigned ArgN7 = 0,
bool IsPowerOfTwo =
( Impl::is_integral_power_of_two(ArgN0) &&
Impl::is_integral_power_of_two(ArgN1) &&
(Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
)
>
struct LayoutTiled {
static_assert( IsPowerOfTwo
, "LayoutTiled must be given power-of-two tile dimensions" );
#if 0
static_assert( (Impl::is_integral_power_of_two(ArgN0) ) &&
(Impl::is_integral_power_of_two(ArgN1) ) &&
(Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
(Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
, "LayoutTiled must be given power-of-two tile dimensions" );
#endif
typedef LayoutTiled<OuterP, InnerP, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, IsPowerOfTwo> array_layout ;
static constexpr Iterate outer_pattern = OuterP;
static constexpr Iterate inner_pattern = InnerP;
enum { N0 = ArgN0 };
enum { N1 = ArgN1 };
enum { N2 = ArgN2 };
enum { N3 = ArgN3 };
enum { N4 = ArgN4 };
enum { N5 = ArgN5 };
enum { N6 = ArgN6 };
enum { N7 = ArgN7 };
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
enum { is_extent_constructible = true };
LayoutTiled( LayoutTiled const & ) = default ;
LayoutTiled( LayoutTiled && ) = default ;
LayoutTiled & operator = ( LayoutTiled const & ) = default ;
LayoutTiled & operator = ( LayoutTiled && ) = default ;
KOKKOS_INLINE_FUNCTION
explicit constexpr
LayoutTiled( size_t argN0 = 0 , size_t argN1 = 0 , size_t argN2 = 0 , size_t argN3 = 0
, size_t argN4 = 0 , size_t argN5 = 0 , size_t argN6 = 0 , size_t argN7 = 0
)
: dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
};
} // namespace Experimental
#endif
// For use with view_copy
template < typename ... Layout >
struct layout_iterate_type_selector {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
};
template <>
struct layout_iterate_type_selector< Kokkos::LayoutRight > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
};
template <>
struct layout_iterate_type_selector< Kokkos::LayoutLeft > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
};
template <>
struct layout_iterate_type_selector< Kokkos::LayoutStride > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
};
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
};
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
};
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
};
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
};
#endif
} // namespace Kokkos
#endif // #ifndef KOKKOS_LAYOUT_HPP

View File

@ -153,7 +153,7 @@
#else
#define KOKKOS_LAMBDA [=]__host__ __device__
#if defined( KOKKOS_ENABLE_CXX1Z )
#if defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20 )
#define KOKKOS_CLASS_LAMBDA [=,*this] __host__ __device__
#endif
#endif
@ -213,7 +213,7 @@
#define KOKKOS_LAMBDA [=]
#endif
#if defined( KOKKOS_ENABLE_CXX1Z ) && !defined( KOKKOS_CLASS_LAMBDA )
#if (defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20) )&& !defined( KOKKOS_CLASS_LAMBDA )
#define KOKKOS_CLASS_LAMBDA [=,*this]
#endif
@ -521,6 +521,9 @@
#if defined ( KOKKOS_ENABLE_CUDA )
#if ( 9000 <= CUDA_VERSION )
#define KOKKOS_IMPL_CUDA_VERSION_9_WORKAROUND
#if ( __CUDA_ARCH__ )
#define KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
#endif
#endif
#endif

View File

@ -793,7 +793,7 @@ struct ParallelReduceReturnValue<typename std::enable_if<
static return_type return_value(ReturnType& return_val,
const FunctorType& functor) {
#ifdef KOKOOS_ENABLE_DEPRECATED_CODE
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return return_type(return_val,functor.value_count);
#else
if ( is_array<ReturnType>::value )
@ -1002,7 +1002,8 @@ void parallel_reduce(const std::string& label,
typename Impl::enable_if<
Kokkos::Impl::is_execution_policy<PolicyType>::value
>::type * = 0) {
Impl::ParallelReduceAdaptor<PolicyType,FunctorType,const ReturnType>::execute(label,policy,functor,return_value);
ReturnType return_value_impl = return_value;
Impl::ParallelReduceAdaptor<PolicyType,FunctorType,ReturnType>::execute(label,policy,functor,return_value_impl);
}
template< class PolicyType, class FunctorType, class ReturnType >
@ -1054,6 +1055,9 @@ void parallel_reduce(const std::string& label,
, typename ValueTraits::pointer_type
>::type value_type ;
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
typedef Kokkos::View< value_type
, Kokkos::HostSpace
, Kokkos::MemoryUnmanaged
@ -1076,6 +1080,9 @@ void parallel_reduce(const PolicyType& policy,
, typename ValueTraits::pointer_type
>::type value_type ;
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
typedef Kokkos::View< value_type
, Kokkos::HostSpace
, Kokkos::MemoryUnmanaged
@ -1096,6 +1103,9 @@ void parallel_reduce(const size_t& policy,
, typename ValueTraits::pointer_type
>::type value_type ;
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
typedef Kokkos::View< value_type
, Kokkos::HostSpace
, Kokkos::MemoryUnmanaged
@ -1117,6 +1127,9 @@ void parallel_reduce(const std::string& label,
, typename ValueTraits::pointer_type
>::type value_type ;
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
typedef Kokkos::View< value_type
, Kokkos::HostSpace
, Kokkos::MemoryUnmanaged

View File

@ -136,6 +136,55 @@ public:
}
}
KOKKOS_INLINE_FUNCTION
void* get_shmem_aligned (const ptrdiff_t size, const ptrdiff_t alignment, int level = -1) const {
if(level == -1)
level = m_default_level;
if(level == 0) {
char* previous = m_iter_L0;
const ptrdiff_t missalign = size_t(m_iter_L0)%alignment;
if(missalign) m_iter_L0 += alignment-missalign;
void* tmp = m_iter_L0 + m_offset * size;
if (m_end_L0 < (m_iter_L0 += size * m_multiplier)) {
m_iter_L0 = previous; // put it back like it was
#ifdef KOKKOS_DEBUG
// mfh 23 Jun 2015: printf call consumes 25 registers
// in a CUDA build, so only print in debug mode. The
// function still returns NULL if not enough memory.
printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
"%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
long(m_end_L0-m_iter_L0));
#endif // KOKKOS_DEBUG
tmp = 0;
}
return tmp;
} else {
char* previous = m_iter_L1;
const ptrdiff_t missalign = size_t(m_iter_L1)%alignment;
if(missalign) m_iter_L1 += alignment-missalign;
void* tmp = m_iter_L1 + m_offset * size;
if (m_end_L1 < (m_iter_L1 += size * m_multiplier)) {
m_iter_L1 = previous; // put it back like it was
#ifdef KOKKOS_DEBUG
// mfh 23 Jun 2015: printf call consumes 25 registers
// in a CUDA build, so only print in debug mode. The
// function still returns NULL if not enough memory.
printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
"%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
long(m_end_L1-m_iter_L1));
#endif // KOKKOS_DEBUG
tmp = 0;
}
return tmp;
}
}
template< typename IntType >
KOKKOS_INLINE_FUNCTION
ScratchMemorySpace( void * ptr_L0 , const IntType & size_L0 , void * ptr_L1 = NULL , const IntType & size_L1 = 0)

View File

@ -262,7 +262,7 @@ public:
}
//----------------------------------------
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class FunctorType >
static
int team_size_max( const FunctorType & ) { return 1 ; }
@ -274,6 +274,16 @@ public:
template< class FunctorType >
static
int team_size_recommended( const FunctorType & , const int& ) { return 1 ; }
#endif
template<class FunctorType>
int team_size_max( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
template<class FunctorType>
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }
//----------------------------------------
@ -281,6 +291,16 @@ public:
inline int league_size() const { return m_league_size ; }
inline size_t scratch_size(const int& level, int = 0) const { return m_team_scratch_size[level] + m_thread_scratch_size[level]; }
inline static
int vector_length_max()
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
inline static
int scratch_size_max(int level)
{ return (level==0?
1024*32:
20*1024*1024);
}
/** \brief Specify league size, request team size */
TeamPolicyInternal( execution_space &
, int league_size_request

View File

@ -624,7 +624,6 @@ public:
when_all( Future< A1 , A2 > const arg[] , int narg )
{
using future_type = Future< execution_space > ;
using task_base = Kokkos::Impl::TaskBase< void , void , void > ;
future_type f ;
@ -692,7 +691,6 @@ public:
{
using input_type = decltype( func(0) );
using future_type = Future< execution_space > ;
using task_base = Kokkos::Impl::TaskBase< void , void , void > ;
static_assert( is_future< input_type >::value
, "Functor must return a Kokkos::Future" );

View File

@ -707,10 +707,17 @@ public:
//----------------------------------------
// Allow specializations to query their specialized map
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
KOKKOS_INLINE_FUNCTION
const Kokkos::Impl::ViewMapping< traits , void > &
implementation_map() const { return m_map ; }
#endif
KOKKOS_INLINE_FUNCTION
const Kokkos::Impl::ViewMapping< traits , void > &
impl_map() const { return m_map ; }
KOKKOS_INLINE_FUNCTION
const Kokkos::Impl::SharedAllocationTracker &
impl_track() const { return m_track ; }
//----------------------------------------
private:
@ -752,7 +759,6 @@ private:
#endif
public:
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class ... Args >
KOKKOS_FORCEINLINE_FUNCTION
@ -793,7 +799,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,args...) )
return m_map.m_handle[ i0 ];
return m_map.m_impl_handle[ i0 ];
}
template< typename I0
@ -809,7 +815,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,args...) )
return m_map.m_handle[ m_map.m_offset.m_stride.S0 * i0 ];
return m_map.m_impl_handle[ m_map.m_impl_offset.m_stride.S0 * i0 ];
}
//------------------------------
@ -839,7 +845,7 @@ public:
operator[]( const I0 & i0 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ i0 ];
return m_map.m_impl_handle[ i0 ];
}
template< typename I0 >
@ -853,10 +859,9 @@ public:
operator[]( const I0 & i0 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ m_map.m_offset.m_stride.S0 * i0 ];
return m_map.m_impl_handle[ m_map.m_impl_offset.m_stride.S0 * i0 ];
}
template< typename I0 , typename I1
, class ... Args >
KOKKOS_FORCEINLINE_FUNCTION
@ -885,7 +890,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 + m_map.m_offset.m_dim.N0 * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_dim.N0 * i1 ];
}
template< typename I0 , typename I1
@ -901,7 +906,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 + m_map.m_offset.m_stride * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_stride * i1 ];
}
template< typename I0 , typename I1
@ -917,7 +922,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i1 + m_map.m_offset.m_dim.N1 * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_dim.N1 * i0 ];
}
template< typename I0 , typename I1
@ -933,7 +938,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i1 + m_map.m_offset.m_stride * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_stride * i0 ];
}
template< typename I0 , typename I1
@ -949,8 +954,8 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 * m_map.m_offset.m_stride.S0 +
i1 * m_map.m_offset.m_stride.S1 ];
return m_map.m_impl_handle[ i0 * m_map.m_impl_offset.m_stride.S0 +
i1 * m_map.m_impl_offset.m_stride.S1 ];
}
//------------------------------
@ -968,7 +973,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2) ];
}
template< typename I0 , typename I1 , typename I2
@ -1001,7 +1006,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1036,7 +1041,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1073,7 +1078,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1110,7 +1115,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1147,7 +1152,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6,i7,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1206,7 +1211,7 @@ public:
operator()( const I0 & i0 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ i0 ];
return m_map.m_impl_handle[ i0 ];
}
template< typename I0 >
@ -1220,7 +1225,7 @@ public:
operator()( const I0 & i0) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ m_map.m_offset.m_stride.S0 * i0 ];
return m_map.m_impl_handle[ m_map.m_impl_offset.m_stride.S0 * i0 ];
}
//------------------------------
// Rank 1 operator[]
@ -1249,7 +1254,7 @@ public:
operator[]( const I0 & i0 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ i0 ];
return m_map.m_impl_handle[ i0 ];
}
template< typename I0 >
@ -1263,7 +1268,7 @@ public:
operator[]( const I0 & i0 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0) )
return m_map.m_handle[ m_map.m_offset.m_stride.S0 * i0 ];
return m_map.m_impl_handle[ m_map.m_impl_offset.m_stride.S0 * i0 ];
}
@ -1294,7 +1299,7 @@ public:
operator()( const I0 & i0 , const I1 & i1) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1) )
return m_map.m_handle[ i0 + m_map.m_offset.m_dim.N0 * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_dim.N0 * i1 ];
}
template< typename I0 , typename I1>
@ -1308,7 +1313,7 @@ public:
operator()( const I0 & i0 , const I1 & i1) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1) )
return m_map.m_handle[ i0 + m_map.m_offset.m_stride * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_stride * i1 ];
}
template< typename I0 , typename I1 >
@ -1322,7 +1327,7 @@ public:
operator()( const I0 & i0 , const I1 & i1 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1) )
return m_map.m_handle[ i1 + m_map.m_offset.m_dim.N1 * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_dim.N1 * i0 ];
}
template< typename I0 , typename I1 >
@ -1336,7 +1341,7 @@ public:
operator()( const I0 & i0 , const I1 & i1 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1) )
return m_map.m_handle[ i1 + m_map.m_offset.m_stride * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_stride * i0 ];
}
template< typename I0 , typename I1>
@ -1350,8 +1355,8 @@ public:
operator()( const I0 & i0 , const I1 & i1 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1) )
return m_map.m_handle[ i0 * m_map.m_offset.m_stride.S0 +
i1 * m_map.m_offset.m_stride.S1 ];
return m_map.m_impl_handle[ i0 * m_map.m_impl_offset.m_stride.S0 +
i1 * m_map.m_impl_offset.m_stride.S1 ];
}
//------------------------------
@ -1367,7 +1372,7 @@ public:
operator()( const I0 & i0 , const I1 & i1 , const I2 & i2) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2) ];
}
template< typename I0 , typename I1 , typename I2>
@ -1396,7 +1401,7 @@ public:
operator()( const I0 & i0 , const I1 & i1 , const I2 & i2 , const I3 & i3) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3 >
@ -1427,7 +1432,7 @@ public:
, const I4 & i4 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1460,7 +1465,7 @@ public:
, const I4 & i4 , const I5 & i5 ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1493,7 +1498,7 @@ public:
, const I4 & i4 , const I5 & i5 , const I6 & i6) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1526,7 +1531,7 @@ public:
, const I4 & i4 , const I5 & i5 , const I6 & i6 , const I7 & i7) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6,i7) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1545,7 +1550,6 @@ public:
}
#endif
template< class ... Args >
KOKKOS_FORCEINLINE_FUNCTION
typename std::enable_if<( Kokkos::Impl::are_integral<Args...>::value
@ -1585,7 +1589,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,args...) )
return m_map.m_handle[ i0 ];
return m_map.m_impl_handle[ i0 ];
}
template< typename I0
@ -1601,7 +1605,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,args...) )
return m_map.m_handle[ m_map.m_offset.m_stride.S0 * i0 ];
return m_map.m_impl_handle[ m_map.m_impl_offset.m_stride.S0 * i0 ];
}
template< typename I0 , typename I1
@ -1632,7 +1636,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 + m_map.m_offset.m_dim.N0 * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_dim.N0 * i1 ];
}
template< typename I0 , typename I1
@ -1648,7 +1652,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 + m_map.m_offset.m_stride * i1 ];
return m_map.m_impl_handle[ i0 + m_map.m_impl_offset.m_stride * i1 ];
}
template< typename I0 , typename I1
@ -1664,7 +1668,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i1 + m_map.m_offset.m_dim.N1 * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_dim.N1 * i0 ];
}
template< typename I0 , typename I1
@ -1680,7 +1684,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i1 + m_map.m_offset.m_stride * i0 ];
return m_map.m_impl_handle[ i1 + m_map.m_impl_offset.m_stride * i0 ];
}
template< typename I0 , typename I1
@ -1696,8 +1700,8 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,args...) )
return m_map.m_handle[ i0 * m_map.m_offset.m_stride.S0 +
i1 * m_map.m_offset.m_stride.S1 ];
return m_map.m_impl_handle[ i0 * m_map.m_impl_offset.m_stride.S0 +
i1 * m_map.m_impl_offset.m_stride.S1 ];
}
//------------------------------
@ -1715,7 +1719,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2) ];
}
template< typename I0 , typename I1 , typename I2
@ -1748,7 +1752,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1783,7 +1787,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1820,7 +1824,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1857,7 +1861,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1894,7 +1898,7 @@ public:
, Args ... args ) const
{
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (m_track,m_map,i0,i1,i2,i3,i4,i5,i6,i7,args...) )
return m_map.m_handle[ m_map.m_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
return m_map.m_impl_handle[ m_map.m_impl_offset(i0,i1,i2,i3,i4,i5,i6,i7) ];
}
template< typename I0 , typename I1 , typename I2 , typename I3
@ -1938,6 +1942,8 @@ public:
KOKKOS_INLINE_FUNCTION
View & operator = ( View && rhs ) { m_track = std::move(rhs.m_track) ; m_map = std::move(rhs.m_map) ; return *this ; }
//----------------------------------------
// Compatible view copy constructor and assignment
// may assign unmanaged from managed.
@ -2206,6 +2212,7 @@ public:
, arg_N4 , arg_N5 , arg_N6 , arg_N7 )
)
{
static_assert ( traits::array_layout::is_extent_constructible , "Layout is not extent constructible. A layout object should be passed too.\n" );
#ifdef KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST
Impl::runtime_check_rank_host(traits::rank_dynamic, std::is_same<typename traits::specialize,void>::value, arg_N0, arg_N1, arg_N2, arg_N3,
@ -2257,6 +2264,15 @@ public:
#endif
}
template <class Traits>
KOKKOS_INLINE_FUNCTION
View( const track_type & track, const Kokkos::Impl::ViewMapping< Traits , void > &map ) :
m_track(track), m_map()
{
typedef Kokkos::Impl::ViewMapping< traits , Traits , void > Mapping ;
static_assert( Mapping::is_assignable , "Incompatible View copy construction" );
Mapping::assign( m_map , map , track );
}
//----------------------------------------
// Memory span required to wrap these dimensions.
@ -2346,7 +2362,7 @@ public:
static inline
size_t shmem_size( typename traits::array_layout const& arg_layout )
{
return map_type::memory_span( arg_layout );
return map_type::memory_span( arg_layout )+sizeof(typename traits::value_type);
}
explicit KOKKOS_INLINE_FUNCTION
@ -2354,7 +2370,7 @@ public:
, const typename traits::array_layout & arg_layout )
: View( Impl::ViewCtorProp<pointer_type>(
reinterpret_cast<pointer_type>(
arg_space.get_shmem( map_type::memory_span( arg_layout ) ) ) )
arg_space.get_shmem_aligned( map_type::memory_span( arg_layout ), sizeof(typename traits::value_type) ) ) )
, arg_layout )
{}
@ -2370,11 +2386,11 @@ public:
, const size_t arg_N7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG )
: View( Impl::ViewCtorProp<pointer_type>(
reinterpret_cast<pointer_type>(
arg_space.get_shmem(
arg_space.get_shmem_aligned(
map_type::memory_span(
typename traits::array_layout
( arg_N0 , arg_N1 , arg_N2 , arg_N3
, arg_N4 , arg_N5 , arg_N6 , arg_N7 ) ) ) ) )
, arg_N4 , arg_N5 , arg_N6 , arg_N7 ) ), sizeof(typename traits::value_type) ) ) )
, typename traits::array_layout
( arg_N0 , arg_N1 , arg_N2 , arg_N3
, arg_N4 , arg_N5 , arg_N6 , arg_N7 )
@ -2515,209 +2531,6 @@ void shared_allocation_tracking_enable()
} /* namespace Impl */
} /* namespace Kokkos */
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
namespace Kokkos {
namespace Impl {
// Deduce Mirror Types
template<class Space, class T, class ... P>
struct MirrorViewType {
// The incoming view_type
typedef typename Kokkos::View<T,P...> src_view_type;
// The memory space for the mirror view
typedef typename Space::memory_space memory_space;
// Check whether it is the same memory space
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
// The array_layout
typedef typename src_view_type::array_layout array_layout;
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
typedef typename src_view_type::non_const_data_type data_type;
// The destination view type if it is not the same memory space
typedef Kokkos::View<data_type,array_layout,Space> dest_view_type;
// If it is the same memory_space return the existsing view_type
// This will also keep the unmanaged trait if necessary
typedef typename std::conditional<is_same_memspace,src_view_type,dest_view_type>::type view_type;
};
template<class Space, class T, class ... P>
struct MirrorType {
// The incoming view_type
typedef typename Kokkos::View<T,P...> src_view_type;
// The memory space for the mirror view
typedef typename Space::memory_space memory_space;
// Check whether it is the same memory space
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
// The array_layout
typedef typename src_view_type::array_layout array_layout;
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
typedef typename src_view_type::non_const_data_type data_type;
// The destination view type if it is not the same memory space
typedef Kokkos::View<data_type,array_layout,Space> view_type;
};
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror( const Kokkos::View<T,P...> & src
, typename std::enable_if<
! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
)
{
typedef View<T,P...> src_type ;
typedef typename src_type::HostMirror dst_type ;
return dst_type( std::string( src.label() ).append("_mirror")
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
, src.extent(0)
, src.extent(1)
, src.extent(2)
, src.extent(3)
, src.extent(4)
, src.extent(5)
, src.extent(6)
, src.extent(7) );
#else
, src.rank_dynamic > 0 ? src.extent(0): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 1 ? src.extent(1): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 2 ? src.extent(2): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 3 ? src.extent(3): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 4 ? src.extent(4): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 5 ? src.extent(5): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 6 ? src.extent(6): KOKKOS_IMPL_CTOR_DEFAULT_ARG
, src.rank_dynamic > 7 ? src.extent(7): KOKKOS_IMPL_CTOR_DEFAULT_ARG );
#endif
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror( const Kokkos::View<T,P...> & src
, typename std::enable_if<
std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
, Kokkos::LayoutStride >::value
>::type * = 0
)
{
typedef View<T,P...> src_type ;
typedef typename src_type::HostMirror dst_type ;
Kokkos::LayoutStride layout ;
layout.dimension[0] = src.extent(0);
layout.dimension[1] = src.extent(1);
layout.dimension[2] = src.extent(2);
layout.dimension[3] = src.extent(3);
layout.dimension[4] = src.extent(4);
layout.dimension[5] = src.extent(5);
layout.dimension[6] = src.extent(6);
layout.dimension[7] = src.extent(7);
layout.stride[0] = src.stride_0();
layout.stride[1] = src.stride_1();
layout.stride[2] = src.stride_2();
layout.stride[3] = src.stride_3();
layout.stride[4] = src.stride_4();
layout.stride[5] = src.stride_5();
layout.stride[6] = src.stride_6();
layout.stride[7] = src.stride_7();
return dst_type( std::string( src.label() ).append("_mirror") , layout );
}
// Create a mirror in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorType<Space,T,P ...>::view_type create_mirror(const Space& , const Kokkos::View<T,P...> & src) {
return typename Impl::MirrorType<Space,T,P ...>::view_type(src.label(),src.layout());
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror_view( const Kokkos::View<T,P...> & src
, typename std::enable_if<(
std::is_same< typename Kokkos::View<T,P...>::memory_space
, typename Kokkos::View<T,P...>::HostMirror::memory_space
>::value
&&
std::is_same< typename Kokkos::View<T,P...>::data_type
, typename Kokkos::View<T,P...>::HostMirror::data_type
>::value
)>::type * = 0
)
{
return src ;
}
template< class T , class ... P >
inline
typename Kokkos::View<T,P...>::HostMirror
create_mirror_view( const Kokkos::View<T,P...> & src
, typename std::enable_if< ! (
std::is_same< typename Kokkos::View<T,P...>::memory_space
, typename Kokkos::View<T,P...>::HostMirror::memory_space
>::value
&&
std::is_same< typename Kokkos::View<T,P...>::data_type
, typename Kokkos::View<T,P...>::HostMirror::data_type
>::value
)>::type * = 0
)
{
return Kokkos::create_mirror( src );
}
// Create a mirror view in a new space (specialization for same space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
return src;
}
// Create a mirror view in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
return typename Impl::MirrorViewType<Space,T,P ...>::view_type(src.label(),src.layout());
}
// Create a mirror view and deep_copy in a new space (specialization for same space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
(void)name;
return src;
}
// Create a mirror view and deep_copy in a new space (specialization for different space)
template<class Space, class T, class ... P>
typename Impl::MirrorViewType<Space,T,P ...>::view_type
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
, std::string const& name = ""
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
using Mirror = typename Impl::MirrorViewType<Space,T,P ...>::view_type;
std::string label = name.empty() ? src.label() : name;
auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout());
deep_copy(mirror, src);
return mirror;
}
} /* namespace Kokkos */
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------

View File

@ -16,6 +16,7 @@ endif
CXXFLAGS ?= -O3
LINK ?= $(CXX)
LDFLAGS ?=
CP = cp
include $(KOKKOS_PATH)/Makefile.kokkos
include $(KOKKOS_PATH)/core/src/Makefile.generate_header_lists
@ -51,6 +52,11 @@ ifeq ($(KOKKOS_OS),Linux)
endif
ifeq ($(KOKKOS_OS),Darwin)
COPY_FLAG =
# If Homebrew coreutils is installed, its cp will have the -u option
ifneq ("$(wildcard /usr/local/opt/coreutils/libexec/gnubin/cp)","")
CP = /usr/local/opt/coreutils/libexec/gnubin/cp
COPY_FLAG = -u
endif
endif
ifeq ($(KOKKOS_DEBUG),"no")
@ -66,36 +72,38 @@ mkdir:
mkdir -p $(PREFIX)/bin
mkdir -p $(PREFIX)/include
mkdir -p $(PREFIX)/lib
mkdir -p $(PREFIX)/lib/pkgconfig
mkdir -p $(PREFIX)/include/impl
copy-cuda: mkdir
mkdir -p $(PREFIX)/include/Cuda
cp $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
copy-threads: mkdir
mkdir -p $(PREFIX)/include/Threads
cp $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
copy-qthreads: mkdir
mkdir -p $(PREFIX)/include/Qthreads
cp $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads
copy-openmp: mkdir
mkdir -p $(PREFIX)/include/OpenMP
cp $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
copy-rocm: mkdir
mkdir -p $(PREFIX)/include/ROCm
cp $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
install: mkdir $(CONDITIONAL_COPIES) build-lib generate_build_settings
cp $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
cp $(COPY_FLAG) $(KOKKOS_MAKEFILE) $(PREFIX)
cp $(COPY_FLAG) $(KOKKOS_CMAKEFILE) $(PREFIX)
cp $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
cp $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include
$(CP) $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
$(CP) $(COPY_FLAG) $(KOKKOS_MAKEFILE) $(PREFIX)
$(CP) $(COPY_FLAG) $(KOKKOS_CMAKEFILE) $(PREFIX)
$(CP) $(COPY_FLAG) $(KOKKOS_PKGCONFIG) $(PREFIX)/lib/pkgconfig
$(CP) $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
$(CP) $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include
clean: kokkos-clean
rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE)
rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE) $(KOKKOS_PKGCONFIG)

View File

@ -5,6 +5,7 @@
# These files are generated by this makefile
KOKKOS_MAKEFILE=Makefile.kokkos
KOKKOS_CMAKEFILE=kokkos_generated_settings.cmake
KOKKOS_PKGCONFIG=kokkos.pc
ifeq ($(KOKKOS_DEBUG),"no")
KOKKOS_DEBUG_CMAKE = OFF
@ -33,11 +34,29 @@ kokkos_append_var = $(call kokkos_appendvar_makefile,$1); $(call kokkos_appendva
kokkos_append_var2 = $(call kokkos_appendvar2_makefile,$1); $(call kokkos_appendvar_cmakefile,$1,$2)
kokkos_append_varval = $(call kokkos_appendval_makefile,$1,$2); $(call kokkos_appendval_cmakefile,$1,$2,$3)
kokkos_fixup_sed_impl = sed \
-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $1 \
> $1.tmp && mv -f $1.tmp $1
$(KOKKOS_PKGCONFIG): $(KOKKOS_PATH)/core/src/$(KOKKOS_PKGCONFIG).in
@sed -e 's|@CMAKE_INSTALL_PREFIX@|$(PREFIX)|g' \
-e 's|@KOKKOS_CXXFLAGS@|$(patsubst -I%,,$(KOKKOS_CXXFLAGS))|g' \
-e 's|@KOKKOS_EXTRA_LIBS_LIST@|$(KOKKOS_EXTRA_LIBS)|g' \
-e 's|@KOKKOS_LINK_FLAGS@|$(KOKKOS_LINK_FLAGS)|g' \
$< > $@
kokkos_fixup_sed = $(call kokkos_fixup_sed_impl,$(KOKKOS_MAKEFILE)); $(call kokkos_fixup_sed_impl,$(KOKKOS_CMAKEFILE))
#This function should be used for variables whose values are different in GNU Make versus CMake,
#especially lists which are delimited by commas in one case and semicolons in another
kokkos_append_gmakevar = $(call kokkos_appendvar_makefile,$1); $(call kokkos_append_gmakevar_cmakefile,$1,$2)
generate_build_settings: $(KOKKOS_CONFIG_HEADER)
generate_build_settings: $(KOKKOS_CONFIG_HEADER) $(KOKKOS_PKGCONFIG)
@rm -f $(KOKKOS_MAKEFILE)
@rm -f $(KOKKOS_CMAKEFILE)
@$(call kokkos_append_string, "#Global Settings used to generate this library")
@ -68,7 +87,6 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
@$(call kokkos_append_var,KOKKOS_HEADERS_ROCM,'STRING "Kokkos headers ROCm list"')
@$(call kokkos_append_var,KOKKOS_HEADERS_THREADS,'STRING "Kokkos headers Threads list"')
@$(call kokkos_append_var,KOKKOS_HEADERS_QTHREADS,'STRING "Kokkos headers QThreads list"')
@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
@$(call kokkos_append_string,"")
@$(call kokkos_append_string,"#Variables used in application Makefiles")
@$(call kokkos_append_var,KOKKOS_OS,'STRING ""') # This was not in original cmake gen
@ -94,19 +112,11 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
@$(call kokkos_append_makefile,"#Fake kokkos-clean target")
@$(call kokkos_append_makefile,"kokkos-clean:")
@$(call kokkos_append_makefile,"")
@sed \
-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $(KOKKOS_MAKEFILE) \
> $(KOKKOS_MAKEFILE).tmp
@mv -f $(KOKKOS_MAKEFILE).tmp $(KOKKOS_MAKEFILE)
@$(call kokkos_fixup_sed)
@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
@$(call kokkos_setvar_cmakefile,KOKKOS_CXX_FLAGS,$(KOKKOS_CXXFLAGS))
@$(call kokkos_setvar_cmakefile,KOKKOS_CPP_FLAGS,$(KOKKOS_CPPFLAGS))
@$(call kokkos_setvar_cmakefile,KOKKOS_LD_FLAGS,$(KOKKOS_LDFLAGS))
@$(call kokkos_setlist_cmakefile,KOKKOS_LIBS_LIST,$(KOKKOS_LIBS))
@$(call kokkos_setlist_cmakefile,KOKKOS_EXTRA_LIBS_LIST,$(KOKKOS_EXTRA_LIBS))
@$(call kokkos_setvar_cmakefile,KOKKOS_LINK_FLAGS,$(KOKKOS_LINK_FLAGS))

View File

@ -103,8 +103,6 @@ public:
void TaskQueueSpecialization< Kokkos::OpenMP >::execute
( TaskQueue< Kokkos::OpenMP > * const queue )
{
using execution_space = Kokkos::OpenMP ;
using queue_type = TaskQueue< execution_space > ;
using task_root_type = TaskBase< void , void , void > ;
using Member = Impl::HostThreadTeamMember< execution_space > ;
@ -213,8 +211,6 @@ void TaskQueueSpecialization< Kokkos::OpenMP >::
iff_single_thread_recursive_execute
( TaskQueue< Kokkos::OpenMP > * const queue )
{
using execution_space = Kokkos::OpenMP ;
using queue_type = TaskQueue< execution_space > ;
using task_root_type = TaskBase< void , void , void > ;
using Member = Impl::HostThreadTeamMember< execution_space > ;

View File

@ -76,14 +76,11 @@ public:
//----------------------------------------
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class FunctorType >
inline static
int team_size_max( const FunctorType & ) {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
int pool_size = traits::execution_space::thread_pool_size(1);
#else
int pool_size = traits::execution_space::impl_thread_pool_size(1);
#endif
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
@ -92,6 +89,47 @@ public:
inline static
int team_size_recommended( const FunctorType & )
{
return traits::execution_space::thread_pool_size(2);
}
template< class FunctorType >
inline static
int team_size_recommended( const FunctorType &, const int& )
{
return traits::execution_space::thread_pool_size(2);
}
#endif
template<class FunctorType>
int team_size_max( const FunctorType&, const ParallelForTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
int pool_size = traits::execution_space::thread_pool_size(1);
#else
int pool_size = traits::execution_space::impl_thread_pool_size(1);
#endif
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
template<class FunctorType>
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
int pool_size = traits::execution_space::thread_pool_size(1);
#else
int pool_size = traits::execution_space::impl_thread_pool_size(1);
#endif
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
return traits::execution_space::impl_thread_pool_size(2);
#endif
}
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
@ -99,15 +137,16 @@ public:
#endif
}
template< class FunctorType >
inline static
int team_size_recommended( const FunctorType &, const int& )
{
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
return traits::execution_space::impl_thread_pool_size(2);
#endif
int vector_length_max()
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
inline static
int scratch_size_max(int level)
{ return (level==0?
1024*32: // Roughly L1 size
20*1024*1024); // Limit to keep compatibility with CUDA
}
//----------------------------------------

View File

@ -160,7 +160,8 @@ SharedAllocationRecord( const Kokkos::Experimental::OpenMPTargetSpace & arg_spac
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
//TODO DeepCopy
// DeepCopy

View File

@ -44,8 +44,8 @@
#ifndef GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP
#define GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP
#ifndef KOKKOS_ROCM_HAS_WORKAROUNDS
#define KOKKOS_ROCM_HAS_WORKAROUNDS 1
#ifndef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1
#endif
#endif

View File

@ -56,13 +56,13 @@ namespace Impl {
struct ROCmTraits {
// TODO: determine if needed
enum { WavefrontSize = 64 /* 64 */ };
enum { WorkgroupSize = 64 /* 64 */ };
enum { WavefrontIndexMask = 0x001f /* Mask for warpindex */ };
enum { WavefrontIndexShift = 5 /* WarpSize == 1 << WarpShift */ };
enum { WorkgroupSize = 256 /* 256 */ };
enum { WavefrontIndexMask = 0x003f /* Mask for wavefrontindex */ };
enum { WavefrontIndexShift = 6 /* WavefrontSize == 1 << WavefrontShift */ };
enum { SharedMemoryBanks = 32 /* Compute device 2.0 */ };
enum { SharedMemoryCapacity = 0x0C000 /* 48k shared / 16k L1 Cache */ };
enum { SharedMemoryUsage = 0x04000 /* 16k shared / 48k L1 Cache */ };
enum { SharedMemoryBanks = 64 /* GCN */ };
enum { SharedMemoryCapacity = 0x10000 /* 64k shared / 16k L1 Cache */ };
enum { SharedMemoryUsage = 0x04000 /* 64k shared / 16k L1 Cache */ };
enum { UpperBoundExtentCount = 4294967295 /* Hard upper bound */ };
#if 0
@ -84,6 +84,16 @@ size_t rocm_internal_maximum_workgroup_count();
size_t * rocm_internal_scratch_flags( const size_t size );
size_t * rocm_internal_scratch_space( const size_t size );
// This pointer is the start of dynamic shared memory (LDS).
// Dynamic is at the end of LDS and it's size must be specified
// in a tile_block specification at kernel launch time.
template< typename T >
KOKKOS_INLINE_FUNCTION
T * kokkos_impl_rocm_shared_memory()
//{ return (T*) hc::get_group_segment_base_pointer() ; }
{ return (T*) hc::get_dynamic_group_segment_base_pointer() ; }
}
} // namespace Kokkos
#define ROCM_SPACE_ATOMIC_MASK 0x1FFFF
@ -249,7 +259,6 @@ struct ROCmParallelLaunch< DriverType
size_t bx = (grid.x > block.x)? block.x : grid.x;
size_t by = (grid.y > block.y)? block.y : grid.y;
size_t bz = (grid.z > block.z)? block.z : grid.z;
hc::parallel_for_each(ext.tile_with_dynamic(bz,by,bx,shmem), [=](const hc::index<3> & idx) [[hc]]

View File

@ -543,20 +543,13 @@ enum { sizeScratchGrain = sizeof(ScratchGrain) };
void rocmMemset( Kokkos::Experimental::ROCm::size_type * ptr , Kokkos::Experimental::ROCm::size_type value , Kokkos::Experimental::ROCm::size_type size)
{
char * mptr = (char * ) ptr;
#if 0
parallel_for_each(hc::extent<1>(size),
/* parallel_for_each(hc::extent<1>(size),
[=, &ptr]
(hc::index<1> idx) __HC__
{
int i = idx[0];
ptr[i] = value;
}).wait();
#else
for (int i= 0; i<size ; i++)
{
mptr[i] = (char) value;
}
#endif
}).wait();*/
}
Kokkos::Experimental::ROCm::size_type *
@ -567,9 +560,9 @@ ROCmInternal::scratch_flags( const Kokkos::Experimental::ROCm::size_type size )
m_scratchFlagsCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;
Record * const r = Record::allocate( Kokkos::HostSpace()
Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
, "InternalScratchFlags"
, ( sizeScratchGrain * m_scratchFlagsCount ) );
@ -590,9 +583,9 @@ ROCmInternal::scratch_space( const Kokkos::Experimental::ROCm::size_type size )
m_scratchSpaceCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;
Record * const r = Record::allocate( Kokkos::HostSpace()
static Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
, "InternalScratchSpace"
, ( sizeScratchGrain * m_scratchSpaceCount ) );
@ -616,7 +609,7 @@ void ROCmInternal::finalize()
// scratch_lock_array_rocm_space_ptr(false);
// threadid_lock_array_rocm_space_ptr(false);
typedef Kokkos::Impl::SharedAllocationRecord< HostSpace > RecordROCm ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace > RecordROCm ;
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmHostPinnedSpace > RecordHost ;
RecordROCm::decrement( RecordROCm::get_record( m_scratchFlags ) );

View File

@ -243,6 +243,15 @@ public:
return(max);
}
template< class FunctorType , class PatternTypeTag>
int team_size_max( const FunctorType& functor, PatternTypeTag) {
return 256/vector_length();
}
template< class FunctorType , class PatternTypeTag>
int team_size_recommended( const FunctorType& functor, PatternTypeTag) {
return 128/vector_length();
}
template<class F>
KOKKOS_INLINE_FUNCTION int team_size(const F& f) const { return (m_team_size > 0) ? m_team_size : team_size_recommended(f); }
KOKKOS_INLINE_FUNCTION int team_size() const { return (m_team_size > 0) ? m_team_size : Impl::get_max_tile_thread(); ; }
@ -261,6 +270,11 @@ public:
return m_thread_scratch_size[level];
}
static int scratch_size_max(int level) {
return level==0 ?
1024*40 : 1024*1204*20;
}
typedef Impl::ROCmTeamMember member_type;
};
@ -487,6 +501,7 @@ public:
#endif
}
m_idx.barrier.wait();
reducer.reference() = buffer[0];
}
/** \brief Intra-team vector reduce
@ -541,19 +556,19 @@ public:
}
template< typename ReducerType >
KOKKOS_INLINE_FUNCTION static
KOKKOS_INLINE_FUNCTION
typename std::enable_if< is_reducer< ReducerType >::value >::type
vector_reduce( ReducerType const & reducer )
vector_reduce( ReducerType const & reducer ) const
{
#ifdef __HCC_ACCELERATOR__
if(blockDim_x == 1) return;
if(m_vector_length == 1) return;
// Intra vector lane shuffle reduction:
typename ReducerType::value_type tmp ( reducer.reference() );
for ( int i = blockDim_x ; ( i >>= 1 ) ; ) {
shfl_down( reducer.reference() , i , blockDim_x );
if ( (int)threadIdx_x < i ) { reducer.join( tmp , reducer.reference() ); }
for ( int i = m_vector_length ; ( i >>= 1 ) ; ) {
reducer.reference() = shfl_down( tmp , i , m_vector_length );
if ( (int)vector_rank() < i ) { reducer.join( tmp , reducer.reference() ); }
}
// Broadcast from root lane to all other lanes.
@ -561,7 +576,7 @@ public:
// because floating point summation is not associative
// and thus different threads could have different results.
shfl( reducer.reference() , 0 , blockDim_x );
reducer.reference() = shfl( tmp , 0 , m_vector_length );
#endif
}
@ -847,7 +862,7 @@ public:
hc::extent< 1 > flat_extent( total_size );
hc::tiled_extent< 1 > team_extent = flat_extent.tile(team_size*vector_length);
hc::tiled_extent< 1 > team_extent = flat_extent.tile(vector_length*team_size);
hc::parallel_for_each( team_extent , [=](hc::tiled_index<1> idx) [[hc]]
{
rocm_invoke<typename Policy::work_tag>(f, typename Policy::member_type(idx, league_size, team_size, shared, shared_size, scratch_size0, scratch, scratch_size1,vector_length));
@ -958,6 +973,176 @@ public:
};
//----------------------------------------------------------------------------
template< class FunctorType , class ReducerType, class... Traits >
class ParallelReduce<
FunctorType , Kokkos::MDRangePolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
{
private:
typedef Kokkos::MDRangePolicy< Traits ... > Policy ;
using RP = Policy;
typedef typename Policy::array_index_type array_index_type;
typedef typename Policy::index_type index_type;
typedef typename Policy::work_tag WorkTag ;
typedef typename Policy::member_type Member ;
typedef typename Policy::launch_bounds LaunchBounds;
typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
typedef typename ReducerConditional::type ReducerTypeFwd;
typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;
typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
typedef Kokkos::Impl::FunctorValueInit< ReducerTypeFwd, WorkTagFwd > ValueInit ;
typedef Kokkos::Impl::FunctorValueJoin< ReducerTypeFwd, WorkTagFwd > ValueJoin ;
public:
typedef typename ValueTraits::pointer_type pointer_type ;
typedef typename ValueTraits::value_type value_type ;
typedef typename ValueTraits::reference_type reference_type ;
typedef FunctorType functor_type ;
typedef Kokkos::Experimental::ROCm::size_type size_type ;
// Algorithmic constraints: blockSize is a power of two AND blockDim.y == blockDim.z == 1
const FunctorType m_functor ;
const Policy m_policy ; // used for workrange and nwork
const ReducerType m_reducer ;
const pointer_type m_result_ptr ;
value_type * m_scratch_space ;
size_type * m_scratch_flags ;
typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
KOKKOS_INLINE_FUNCTION
void exec_range( reference_type update ) const
{
Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag, reference_type>(m_policy, m_functor, update).exec_range();
}
KOKKOS_INLINE_FUNCTION
void operator()(void) const
{
run();
}
KOKKOS_INLINE_FUNCTION
void run( ) const
{
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
word_count( (ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) )) / sizeof(value_type) );
// pointer to shared data accounts for the reserved space at the start
value_type * const shared = kokkos_impl_rocm_shared_memory<value_type>()
+ 2*sizeof(uint64_t);
{
reference_type value =
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , shared + threadIdx_y * word_count.value );
// Number of blocks is bounded so that the reduction can be limited to two passes.
// Each thread block is given an approximately equal amount of work to perform.
// Accumulate the values for this block.
// The accumulation ordering does not match the final pass, but is arithmatically equivalent.
this-> exec_range( value );
}
// Reduce with final value at blockDim.y - 1 location.
// Problem: non power-of-two blockDim
if ( rocm_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTagFwd>(
ReducerConditional::select(m_functor , m_reducer) , blockIdx_x ,
gridDim_x , shared , m_scratch_space , m_scratch_flags ) ) {
// This is the final block with the final result at the final threads' location
value_type * const tshared = shared + ( blockDim_y - 1 ) * word_count.value ;
value_type * const global = m_scratch_space ;
if ( threadIdx_y == 0 ) {
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , tshared );
// for ( unsigned i = 0 ; i < word_count.value ; i+=blockDim_y ) { global[i] = tshared[i]; }
for ( unsigned i = 0 ; i < word_count.value ; i++ ) { global[i] = tshared[i]; }
}
}
}
// Determine block size constrained by shared memory:
static inline
unsigned local_block_size( const FunctorType & f )
{
unsigned n = ROCmTraits::WavefrontSize * 8 ;
while ( n && ROCmTraits::SharedMemoryCapacity < rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( f , n ) ) { n >>= 1 ; }
return n ;
}
inline
void execute()
{
const int nwork = m_policy.m_num_tiles;
if ( nwork ) {
int block_size = m_policy.m_prod_tile_dims;
// CONSTRAINT: Algorithm requires block_size >= product of tile dimensions
// Nearest power of two
int exponent_pow_two = std::ceil( std::log2((float)block_size) );
block_size = 1<<(exponent_pow_two);
m_scratch_space = (value_type*)rocm_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_size*nwork /* block_size == max block_count */ );
m_scratch_flags = rocm_internal_scratch_flags( sizeof(size_type) );
const dim3 block( 1 , block_size , 1 );
// Required grid.x <= block.y
const dim3 grid( nwork, block_size , 1 );
const int shmem = rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( m_functor , block.y );
ROCmParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
ROCM::fence();
if ( m_result_ptr ) {
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
DeepCopy<HostSpace,Kokkos::Experimental::ROCmSpace>( m_result_ptr , m_scratch_space , size );
}
}
else {
if (m_result_ptr) {
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
}
}
}
template< class HostViewType >
ParallelReduce( const FunctorType & arg_functor
, const Policy & arg_policy
, const HostViewType & arg_result
, typename std::enable_if<
Kokkos::is_view< HostViewType >::value
,void*>::type = NULL)
: m_functor( arg_functor )
, m_policy( arg_policy )
, m_reducer( InvalidType() )
, m_result_ptr( arg_result.data() )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
{}
ParallelReduce( const FunctorType & arg_functor
, const Policy & arg_policy
, const ReducerType & reducer)
: m_functor( arg_functor )
, m_policy( arg_policy )
, m_reducer( reducer )
, m_result_ptr( reducer.view().data() )
, m_scratch_space( 0 )
, m_scratch_flags( 0 )
{}
};
//----------------------------------------------------------------------------
template< class FunctorType, class ReducerType, class... Traits >
class ParallelReduce<
FunctorType , Kokkos::TeamPolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
@ -993,7 +1178,13 @@ public:
const int scratch_size1 = policy.scratch_size(1,team_size);
const int total_size = league_size * team_size ;
if(total_size == 0) return;
typedef Kokkos::Impl::FunctorValueInit< FunctorType, typename Policy::work_tag > ValueInit ;
if(total_size==0) {
if (result_view.data()) {
ValueInit::init( f , result_view.data() );
}
return;
}
const int reduce_size = ValueTraits::value_size( f );
const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
@ -1042,7 +1233,16 @@ public:
const int vector_length = policy.vector_length();
const int total_size = league_size * team_size;
if(total_size == 0) return;
typedef Kokkos::Impl::FunctorValueInit< ReducerType, typename Policy::work_tag > ValueInit ;
typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value,
FunctorType, ReducerType> ReducerConditional;
if(total_size==0) {
if (reducer.view().data()) {
ValueInit::init( ReducerConditional::select(f,reducer),
reducer.view().data() );
}
return;
}
const int reduce_size = ValueTraits::value_size( f );
const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
@ -1113,6 +1313,39 @@ public:
//----------------------------------------
};
template< class FunctorType , class ReturnType , class... Traits >
class ParallelScanWithTotal< FunctorType , Kokkos::RangePolicy< Traits... >,
ReturnType, Kokkos::Experimental::ROCm >
{
private:
typedef Kokkos::RangePolicy< Traits... > Policy;
typedef typename Policy::work_tag Tag;
typedef Kokkos::Impl::FunctorValueTraits< FunctorType, Tag> ValueTraits;
public:
//----------------------------------------
inline
ParallelScanWithTotal( const FunctorType & f
, const Policy & policy
, ReturnType & arg_returnvalue)
{
const auto len = policy.end()-policy.begin();
if(len==0) return;
scan_enqueue<Tag,ReturnType>(len, f, arg_returnvalue, [](hc::tiled_index<1> idx, int, int) { return idx.global[0]; });
}
KOKKOS_INLINE_FUNCTION
void execute() const {}
//----------------------------------------
};
template< class FunctorType , class... Traits>
class ParallelScan< FunctorType , Kokkos::TeamPolicy< Traits... >, Kokkos::Experimental::ROCm >
{
@ -1350,22 +1583,17 @@ void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTe
* val is performed and put into result. This functionality requires C++11 support.*/
template< typename iType, class Lambda, typename ValueType >
KOKKOS_INLINE_FUNCTION
void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
typename std::enable_if< ! Kokkos::is_reducer< ValueType >::value >::type
parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
const Lambda & lambda, ValueType& result) {
result = ValueType();
Kokkos::Sum<ValueType> reducer(result);
reducer.init( reducer.reference() );
for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
ValueType tmp = ValueType();
lambda(i,tmp);
result+=tmp;
lambda(i,reducer.reference());
}
result = loop_boundaries.thread.team_reduce(result,
Impl::JoinAdd<ValueType>());
// Impl::rocm_intra_workgroup_reduction( loop_boundaries.thread, result,
// Impl::JoinAdd<ValueType>());
// Impl::rocm_inter_workgroup_reduction( loop_boundaries.thread, result,
// Impl::JoinAdd<ValueType>());
loop_boundaries.thread.team_reduce(reducer);
}
/** \brief Inter-thread thread range parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
@ -1374,7 +1602,8 @@ void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROC
* val is performed and put into result. This functionality requires C++11 support.*/
template< typename iType, class Lambda, typename ReducerType >
KOKKOS_INLINE_FUNCTION
void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
const Lambda & lambda, ReducerType const & reducer) {
reducer.init( reducer.reference() );
@ -1439,7 +1668,8 @@ void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCm
* val is performed and put into result. This functionality requires C++11 support.*/
template< typename iType, class Lambda, typename ValueType >
KOKKOS_INLINE_FUNCTION
void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
typename std::enable_if< !Kokkos::is_reducer< ValueType >::value >::type
parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
loop_boundaries, const Lambda & lambda, ValueType& result) {
result = ValueType();
@ -1477,7 +1707,8 @@ void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::R
* val is performed and put into result. This functionality requires C++11 support.*/
template< typename iType, class Lambda, typename ReducerType >
KOKKOS_INLINE_FUNCTION
void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
loop_boundaries, const Lambda & lambda, ReducerType const & reducer) {
reducer.init( reducer.reference() );
@ -1523,86 +1754,46 @@ void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROC
typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
typedef typename ValueTraits::value_type value_type ;
value_type scan_val = value_type();
#if (__ROCM_ARCH__ >= 800)
// adopt the cuda vector shuffle method
const int VectorLength = loop_boundaries.increment;
int lid = loop_boundaries.thread.lindex();
int vector_rank = lid%VectorLength;
value_type val = value_type();
const int vector_length = loop_boundaries.thread.vector_length();
const int vector_rank = loop_boundaries.thread.vector_rank();
iType loop_bound = ((loop_boundaries.end+VectorLength-1)/VectorLength) * VectorLength;
value_type val ;
for(int _i = vector_rank; _i < loop_bound; _i += VectorLength) {
val = value_type();
if(_i<loop_boundaries.end)
lambda(_i , val , false);
iType end = ((loop_boundaries.end+vector_length-1)/vector_length) * vector_length;
value_type accum = value_type();
value_type tmp = val;
value_type result_i;
for ( int i = vector_rank ; i < end ; i += vector_length ) {
if(vector_rank == 0)
result_i = tmp;
if (VectorLength > 1) {
const value_type tmp2 = shfl_up(tmp, 1,VectorLength);
if(vector_rank > 0)
tmp+=tmp2;
}
if(vector_rank == 1)
result_i = tmp;
if (VectorLength > 3) {
const value_type tmp2 = shfl_up(tmp, 2,VectorLength);
if(vector_rank > 1)
tmp+=tmp2;
}
if ((vector_rank >= 2) &&
(vector_rank < 4))
result_i = tmp;
if (VectorLength > 7) {
const value_type tmp2 = shfl_up(tmp, 4,VectorLength);
if(vector_rank > 3)
tmp+=tmp2;
}
if ((vector_rank >= 4) &&
(vector_rank < 8))
result_i = tmp;
if (VectorLength > 15) {
const value_type tmp2 = shfl_up(tmp, 8,VectorLength);
if(vector_rank > 7)
tmp+=tmp2;
}
if ((vector_rank >= 8) &&
(vector_rank < 16))
result_i = tmp;
if (VectorLength > 31) {
const value_type tmp2 = shfl_up(tmp, 16,VectorLength);
if(vector_rank > 15)
tmp+=tmp2;
}
if ((vector_rank >=16) &&
(vector_rank < 32))
result_i = tmp;
if (VectorLength > 63) {
const value_type tmp2 = shfl_up(tmp, 32,VectorLength);
if(vector_rank > 31)
tmp+=tmp2;
value_type val = 0 ;
// First acquire per-lane contributions:
if ( i < loop_boundaries.end ) lambda( i , val , false );
value_type sval = val ;
// Bottom up inclusive scan in triangular pattern
// where each thread is the root of a reduction tree
// from the zeroth "lane" to itself.
// [t] += [t-1] if t >= 1
// [t] += [t-2] if t >= 2
// [t] += [t-4] if t >= 4
// ...
for ( int j = 1 ; j < vector_length ; j <<= 1 ) {
value_type tmp = 0 ;
tmp = shfl_up(sval , j , vector_length );
if ( j <= vector_rank ) { sval += tmp ; }
}
if (vector_rank >= 32)
result_i = tmp;
// Include accumulation and remove value for exclusive scan:
val = accum + sval - val ;
val = scan_val + result_i - val;
scan_val += shfl(tmp,VectorLength-1,VectorLength);
if(_i<loop_boundaries.end)
lambda(_i , val , true);
}
#else
// for kaveri, call the LDS based thread_scan routine
for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
lambda(i,scan_val,true);
}
scan_val = loop_boundaries.thread.team_scan(scan_val);
// Provide exclusive scan value:
if ( i < loop_boundaries.end ) lambda( i , val , true );
#endif
// Accumulate the last value in the inclusive scan:
sval = shfl( sval , vector_length-1 , vector_length);
accum += sval ;
}
}
} // namespace Kokkos

View File

@ -57,7 +57,6 @@
#include <ROCm/Kokkos_ROCm_Tile.hpp>
#include <ROCm/Kokkos_ROCm_Invoke.hpp>
#include <ROCm/Kokkos_ROCm_Join.hpp>
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
namespace Kokkos {
@ -75,7 +74,7 @@ T& reduce_value(T* x, std::false_type) [[hc]]
return *x;
}
#if KOKKOS_ROCM_HAS_WORKAROUNDS
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
struct always_true
{
template<class... Ts>
@ -149,7 +148,7 @@ void reduce_enqueue(
// Store the tile result in the global memory.
if (local == 0)
{
#if KOKKOS_ROCM_HAS_WORKAROUNDS
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
// Workaround for assigning from LDS memory: std::copy should work
// directly
buffer.action_at(0, [&](T* x)
@ -158,7 +157,7 @@ void reduce_enqueue(
// new ROCM 15 address space changes aren't implemented in std algorithms yet
auto * src = reinterpret_cast<char *>(x);
auto * dest = reinterpret_cast<char *>(result.data()+tile*output_length);
for(int i=0; i<sizeof(T);i++) dest[i] = src[i];
for(int i=0; i<sizeof(T)*output_length;i++) dest[i] = src[i];
#else
// Workaround: copy_if used to avoid memmove
std::copy_if(x, x+output_length, result.data()+tile*output_length, always_true{} );
@ -169,12 +168,10 @@ void reduce_enqueue(
#endif
}
});
if (output_result != nullptr)
ValueInit::init(ReducerConditional::select(f, reducer), output_result);
fut.wait();
copy(result,result_cpu.data());
if (output_result != nullptr) {
for(std::size_t i=0;i<td.num_tiles;i++)

View File

@ -62,6 +62,76 @@
namespace Kokkos {
namespace Impl {
//#if __KALMAR_ACCELERATOR__ == 1
KOKKOS_INLINE_FUNCTION
void __syncthreads() [[hc]]
{
amp_barrier(CLK_LOCAL_MEM_FENCE);
}
#define LT0 ((threadIdx_x+threadIdx_y+threadIdx_z)?0:1)
// returns non-zero if and only if predicate is non-zero for all threads
// note that syncthreads_or uses the first 64 bits of dynamic group memory.
// this reserved memory must be accounted for everwhere
// that get_dynamic_group_segment_base_pointer is called.
KOKKOS_INLINE_FUNCTION
uint64_t __syncthreads_or(uint64_t pred)
{
uint64_t *shared_var = (uint64_t *)hc::get_dynamic_group_segment_base_pointer();
if(LT0) *shared_var = 0;
amp_barrier(CLK_LOCAL_MEM_FENCE);
#if __KALMAR_ACCELERATOR__ == 1
if (pred) hc::atomic_or_uint64(shared_var,1);
#endif
amp_barrier(CLK_LOCAL_MEM_FENCE);
return (*shared_var);
}
KOKKOS_INLINE_FUNCTION
void __threadfence()
{
amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
}
KOKKOS_INLINE_FUNCTION
void __threadfence_block()
{
amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
}
//#endif
struct ROCm_atomic_CAS {
template<class OP>
KOKKOS_INLINE_FUNCTION
unsigned long operator () (volatile unsigned long * dest, OP &&op){
unsigned long read,compare,val;
compare = *dest;
read = compare;
do {
compare = read;
val = op(compare);
#if __KALMAR_ACCELERATOR__ == 1
hc::atomic_compare_exchange((uint64_t *)dest,&read,val);
#endif
} while (read != compare);
return val;
}
};
template<class OP>
KOKKOS_INLINE_FUNCTION
unsigned long atomic_cas_op (volatile unsigned long * dest, OP &&op) {
ROCm_atomic_CAS cas_op;
return cas_op(dest, std::forward<OP>(op));
}
KOKKOS_INLINE_FUNCTION
unsigned long atomicInc (volatile unsigned long * dest, const unsigned long& val) {
return atomic_cas_op(dest, [=](unsigned long old){return ((old>=val)?0:(old+1));});
}
//----------------------------------------------------------------------------
template< typename T >
@ -375,18 +445,7 @@ bool rocm_inter_block_reduction( ROCmTeamMember& team,
#endif
}
#endif
#if 0
//----------------------------------------------------------------------------
// See section B.17 of ROCm C Programming Guide Version 3.2
// for discussion of
// __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
// function qualifier which could be used to improve performance.
//----------------------------------------------------------------------------
// Maximize shared memory and minimize L1 cache:
// rocmFuncSetCacheConfig(MyKernel, rocmFuncCachePreferShared );
// For 2.0 capability: 48 KB shared and 16 KB L1
//----------------------------------------------------------------------------
//----------------------------------------------------------------------------
/*
* Algorithmic constraints:
@ -406,87 +465,105 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
typedef typename ValueTraits::pointer_type pointer_type ;
const unsigned value_count = ValueTraits::value_count( functor );
const unsigned BlockSizeMask = team.team_size() - 1 ;
const unsigned BlockSizeMask = blockDim_y - 1 ;
// Must have power of two thread count
if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }
if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }
#define BLOCK_REDUCE_STEP( R , TD , S ) \
if ( ! ( R & ((1<<(S+1))-1) ) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }
if ( ! (( R & ((1<<(S+1))-1) )|(blockDim_y<(1<<(S+1)))) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }
#define BLOCK_SCAN_STEP( TD , N , S ) \
if ( N == (1<<S) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S))); }
#define KOKKOS_IMPL_ROCM_SYNCWF __threadfence_block()
const unsigned rtid_intra = team.team_rank() ^ BlockSizeMask ;
const pointer_type tdata_intra = base_data + value_count * team.team_rank() ;
const unsigned rtid_intra = threadIdx_y ^ BlockSizeMask ;
const pointer_type tdata_intra = base_data + value_count * threadIdx_y ;
{ // Intra-workgroup reduction:
{ // Intra-workgroup reduction: min blocksize of 64
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,0)
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,1)
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,2)
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,3)
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,4)
KOKKOS_IMPL_ROCM_SYNCWF;
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,5)
KOKKOS_IMPL_ROCM_SYNCWF;
}
team.team_barrier(); // Wait for all workgroups to reduce
__syncthreads(); // Wait for all workgroups to reduce
{ // Inter-workgroup reduce-scan by a single workgroup to avoid extra synchronizations
const unsigned rtid_inter = ( team.team_rank() ^ BlockSizeMask ) << ROCmTraits::WarpIndexShift ;
if(threadIdx_y < value_count) {
for(int i=blockDim_y-65; i>0; i-= 64)
ValueJoin::join( functor , base_data + (blockDim_y-1)*value_count + threadIdx_y , base_data + i*value_count + threadIdx_y );
}
__syncthreads();
#if 0
const unsigned rtid_inter = ( threadIdx_y ^ BlockSizeMask ) << ROCmTraits::WavefrontIndexShift ;
if ( rtid_inter < blockDim_y ) {
if ( rtid_inter < team.team_size() ) {
const pointer_type tdata_inter = base_data + value_count * ( rtid_inter ^ BlockSizeMask );
//
// remove these comments
// for rocm, we start with a block size of 64, so the 5 step is already done.
// The remaining steps are only done if block size is > 64, so we leave them
// in place until we tune blocksize for performance, then remove the ones
// that will never be used.
// if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
// if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
// if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }
// if ( (1<<9) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,9) }
if ( (1<<5) < BlockSizeMask ) { BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,5) }
if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }
if ( DoScan ) {
int n = ( rtid_inter & 32 ) ? 32 : (
( rtid_inter & 64 ) ? 64 : (
int n = ( rtid_inter & 64 ) ? 64 : (
( rtid_inter & 128 ) ? 128 : (
( rtid_inter & 256 ) ? 256 : 0 )));
( rtid_inter & 256 ) ? 256 : 0 ));
if ( ! ( rtid_inter + n < team.team_size() ) ) n = 0 ;
if ( ! ( rtid_inter + n < blockDim_y ) ) n = 0 ;
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,8)
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,7)
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,6)
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
// __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
}
}
#endif
}
team.team_barrier(); // Wait for inter-workgroup reduce-scan to complete
__syncthreads(); // Wait for inter-workgroup reduce-scan to complete
if ( DoScan ) {
int n = ( rtid_intra & 1 ) ? 1 : (
( rtid_intra & 2 ) ? 2 : (
( rtid_intra & 4 ) ? 4 : (
( rtid_intra & 8 ) ? 8 : (
( rtid_intra & 16 ) ? 16 : 0 ))));
( rtid_intra & 16 ) ? 16 : (
( rtid_intra & 32 ) ? 32 : 0 )))));
if ( ! ( rtid_intra + n < team.team_size() ) ) n = 0 ;
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
BLOCK_SCAN_STEP(tdata_intra,n,4) team.team_barrier();//__threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,3) team.team_barrier();//__threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,2) team.team_barrier();//__threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,1) team.team_barrier();//__threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,0) team.team_barrier();
#else
BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
if ( ! ( rtid_intra + n < blockDim_y ) ) n = 0 ;
// BLOCK_SCAN_STEP(tdata_intra,n,5) __threadfence_block();
// BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,3) __threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,2) __threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,1) __threadfence_block();
BLOCK_SCAN_STEP(tdata_intra,n,0) __threadfence_block();
#endif
}
#undef BLOCK_SCAN_STEP
#undef BLOCK_REDUCE_STEP
#undef KOKKOS_IMPL_ROCM_SYNCWF
}
//----------------------------------------------------------------------------
@ -497,16 +574,18 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
*
* Global reduce result is in the last threads' 'shared_data' location.
*/
using ROCM = Kokkos::Experimental::ROCm ;
template< bool DoScan , class FunctorType , class ArgTag >
KOKKOS_INLINE_FUNCTION
bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
const ROCm::size_type block_id ,
const ROCm::size_type block_count ,
ROCm::size_type * const shared_data ,
ROCm::size_type * const global_data ,
ROCm::size_type * const global_flags )
const ROCM::size_type block_id ,
const ROCM::size_type block_count ,
typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const shared_data ,
typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const global_data ,
ROCM::size_type * const global_flags )
{
typedef ROCm::size_type size_type ;
typedef ROCM::size_type size_type ;
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
@ -517,16 +596,17 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
typedef typename ValueTraits::value_type value_type ;
// '__ffs' = position of the least significant bit set to 1.
// 'team.team_size()' is guaranteed to be a power of two so this
// blockDim_y is guaranteed to be a power of two so this
// is the integral shift value that can replace an integral divide.
const unsigned BlockSizeShift = __ffs( team.team_size() ) - 1 ;
const unsigned BlockSizeMask = team.team_size() - 1 ;
// const unsigned long BlockSizeShift = __ffs( blockDim_y ) - 1 ;
const unsigned long BlockSizeShift = __lastbit_u32_u32( blockDim_y ) ;
const unsigned long BlockSizeMask = blockDim_y - 1 ;
// Must have power of two thread count
if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }
if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
word_count( ValueTraits::value_size( functor ) / sizeof(size_type) );
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
word_count( ValueTraits::value_size( functor )/ sizeof(value_type) );
// Reduce the accumulation for the entire block.
rocm_intra_block_reduce_scan<false,FunctorType,ArgTag>( functor , pointer_type(shared_data) );
@ -534,54 +614,47 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
{
// Write accumulation total to global scratch space.
// Accumulation total is the last thread's data.
size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
size_type * const global = global_data + word_count.value * block_id ;
#if (__ROCM_ARCH__ < 500)
for ( size_type i = team.team_rank() ; i < word_count.value ; i += team.team_size() ) { global[i] = shared[i] ; }
#else
for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
#endif
value_type * const shared = shared_data +
word_count.value * BlockSizeMask ;
value_type * const global = global_data + word_count.value * block_id ;
for ( int i = int(threadIdx_y) ; i < word_count.value ; i += blockDim_y ) { global[i] = shared[i] ; }
}
// Contributing blocks note that their contribution has been completed via an atomic-increment flag
// If this block is not the last block to contribute to this group then the block is done.
team.team_barrier();
const bool is_last_block =
! team.team_reduce( team.team_rank() ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) ,Impl::JoinAdd<ValueType>());
const bool is_last_block =
! __syncthreads_or( threadIdx_y ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) );
if ( is_last_block ) {
const size_type b = ( long(block_count) * long(team.team_rank()) ) >> BlockSizeShift ;
const size_type e = ( long(block_count) * long( team.team_rank() + 1 ) ) >> BlockSizeShift ;
const size_type b = ( long(block_count) * long(threadIdx_y )) >> BlockSizeShift ;
const size_type e = ( long(block_count) * long(threadIdx_y + 1 ) ) >> BlockSizeShift ;
{
void * const shared_ptr = shared_data + word_count.value * team.team_rank() ;
reference_type shared_value = ValueInit::init( functor , shared_ptr );
value_type * const shared_ptr = shared_data + word_count.value * threadIdx_y ;
ValueInit::init( functor , shared_ptr );
for ( size_type i = b ; i < e ; ++i ) {
ValueJoin::join( functor , shared_ptr , global_data + word_count.value * i );
}
}
rocm_intra_block_reduce_scan<DoScan,FunctorType,ArgTag>( functor , pointer_type(shared_data) );
if ( DoScan ) {
value_type * const shared_value = shared_data + word_count.value * ( threadIdx_y ? threadIdx_y - 1 : blockDim_y );
size_type * const shared_value = shared_data + word_count.value * ( team.team_rank() ? team.team_rank() - 1 : team.team_size() );
if ( ! team.team_rank() ) { ValueInit::init( functor , shared_value ); }
if ( ! threadIdx_y ) { ValueInit::init( functor , shared_value ); }
// Join previous inclusive scan value to each member
for ( size_type i = b ; i < e ; ++i ) {
size_type * const global_value = global_data + word_count.value * i ;
value_type * const global_value = global_data + word_count.value * i ;
ValueJoin::join( functor , shared_value , global_value );
ValueOps ::copy( functor , global_value , shared_value );
}
}
}
return is_last_block ;
}
@ -592,7 +665,6 @@ unsigned rocm_single_inter_block_reduce_scan_shmem( const FunctorType & functor
{
return ( BlockSize + 2 ) * Impl::FunctorValueTraits< FunctorType , ArgTag >::value_size( functor );
}
#endif
} // namespace Impl
} // namespace Kokkos

View File

@ -98,7 +98,7 @@ void scan_enqueue(
{
auto j = i + d - 1;
auto k = i + d2 - 1;
// join(k, j); // no longer needed with ROCm 1.6
ValueJoin::join(f, &buffer[k], &buffer[j]);
}
}
@ -116,7 +116,7 @@ void scan_enqueue(
auto j = i + d - 1;
auto k = i + d2 - 1;
auto t = buffer[k];
// join(k, j); // no longer needed with ROCm 1.6
ValueJoin::join(f, &buffer[k], &buffer[j]);
buffer[j] = t;
}
@ -127,17 +127,13 @@ void scan_enqueue(
}).wait();
copy(result,result_cpu.data());
// The std::partial_sum was segfaulting, despite that this is cpu code.
// if(td.num_tiles>1)
// std::partial_sum(result_cpu.data(), result_cpu.data()+(td.num_tiles-1)*sizeof(value_type), result_cpu.data(), make_join_operator<ValueJoin>(f));
// use this implementation instead.
for(int i=1; i<td.num_tiles; i++)
ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);
copy(result_cpu.data(),result);
hc::parallel_for_each(hc::extent<1>(len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
{
// const auto local = t_idx.local[0];
const auto global = t_idx.global[0];
const auto tile = t_idx.tile[0];
@ -145,13 +141,115 @@ void scan_enqueue(
{
auto final_state = scratch[global];
// the join is locking up, at least with 1.6
if (tile != 0) final_state += result[tile-1];
// if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
}
}).wait();
}
template< class Tag, class ReturnType, class F, class TransformIndex>
void scan_enqueue(
const int len,
const F & f,
ReturnType & return_val,
TransformIndex transform_index)
{
typedef Kokkos::Impl::FunctorValueTraits< F, Tag> ValueTraits;
typedef Kokkos::Impl::FunctorValueInit< F, Tag> ValueInit;
typedef Kokkos::Impl::FunctorValueJoin< F, Tag> ValueJoin;
typedef Kokkos::Impl::FunctorValueOps< F, Tag> ValueOps;
typedef typename ValueTraits::value_type value_type;
typedef typename ValueTraits::pointer_type pointer_type;
typedef typename ValueTraits::reference_type reference_type;
const auto td = get_tile_desc<value_type>(len);
std::vector<value_type> result_cpu(td.num_tiles);
hc::array<value_type> result(td.num_tiles);
hc::array<value_type> scratch(len);
std::vector<ReturnType> total_cpu(1);
hc::array<ReturnType> total(1);
tile_for<value_type>(td, [&,f,len,td](hc::tiled_index<1> t_idx, tile_buffer<value_type> buffer) [[hc]]
{
const auto local = t_idx.local[0];
const auto global = t_idx.global[0];
const auto tile = t_idx.tile[0];
// Join tile buffer elements
const auto join = [&](std::size_t i, std::size_t j)
{
buffer.action_at(i, j, [&](value_type& x, const value_type& y)
{
ValueJoin::join(f, &x, &y);
});
};
// Copy into tile
buffer.action_at(local, [&](value_type& state)
{
ValueInit::init(f, &state);
if (global < len) rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), state, false);
});
t_idx.barrier.wait();
// Up sweep phase
for(std::size_t d=1;d<buffer.size();d*=2)
{
auto d2 = 2*d;
auto i = local*d2;
if(i<len)
{
auto j = i + d - 1;
auto k = i + d2 - 1;
ValueJoin::join(f, &buffer[k], &buffer[j]);
}
}
t_idx.barrier.wait();
result[tile] = buffer[buffer.size()-1];
buffer[buffer.size()-1] = 0;
// Down sweep phase
for(std::size_t d=buffer.size()/2;d>0;d/=2)
{
auto d2 = 2*d;
auto i = local*d2;
if(i<len)
{
auto j = i + d - 1;
auto k = i + d2 - 1;
auto t = buffer[k];
ValueJoin::join(f, &buffer[k], &buffer[j]);
buffer[j] = t;
}
t_idx.barrier.wait();
}
// Copy tiles into global memory
if (global < len) scratch[global] = buffer[local];
}).wait();
copy(result,result_cpu.data());
for(int i=1; i<td.num_tiles; i++)
ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);
copy(result_cpu.data(),result);
size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
{
const auto global = t_idx.global[0];
const auto tile = t_idx.tile[0];
if (global < len)
{
auto final_state = scratch[global];
if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
if(global==(len-1)) total[0] = final_state;
}
}).wait();
copy(total,total_cpu.data());
return_val = total_cpu[0];
}
} // namespace Impl
} // namespace Kokkos

View File

@ -362,6 +362,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmSpace & arg_space
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
// Copy to device memory
Kokkos::Impl::DeepCopy<Kokkos::Experimental::ROCmSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
@ -399,6 +401,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmHostPinnedSpace & arg_sp
, arg_label.c_str()
, SharedAllocationHeader::maximum_label_length
);
// Set last element zero, in case c_str is too long
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
}
//----------------------------------------------------------------------------

View File

@ -278,7 +278,7 @@ struct single_action
void action_at(std::size_t i, Action a) [[hc]]
{
auto& value = static_cast<Derived&>(*this)[i];
#if KOKKOS_ROCM_HAS_WORKAROUNDS
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
T state = value;
a(state);
value = state;
@ -347,7 +347,7 @@ struct tile_buffer<T[]>
#if defined (ROCM15)
a(value);
#else
#if KOKKOS_ROCM_HAS_WORKAROUNDS
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
if (m > get_max_tile_array_size()) return;
T state[get_max_tile_array_size()];
// std::copy(value, value+m, state);
@ -372,7 +372,6 @@ struct tile_buffer<T[]>
#if defined (ROCM15)
a(value);
#else
//#if KOKKOS_ROCM_HAS_WORKAROUNDS
if (m > get_max_tile_array_size()) return;
T state[get_max_tile_array_size()];
// std::copy(value, value+m, state);

View File

@ -175,6 +175,27 @@ public:
#endif
}
template<class Closure, class ValueType>
KOKKOS_INLINE_FUNCTION
void team_broadcast(Closure const & f, ValueType& value, const int& thread_id) const
{
#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
{ }
#else
// Make sure there is enough scratch space:
typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
, ValueType , void >::type type ;
f( value );
if ( m_team_base ) {
type * const local_value = ((type*) m_team_base[0]->scratch_memory());
if(team_rank() == thread_id) *local_value = value;
memory_fence();
team_barrier();
value = *local_value;
}
#endif
}
template< typename Type >
KOKKOS_INLINE_FUNCTION
typename std::enable_if< !Kokkos::is_reducer< Type >::value , Type>::type
@ -626,9 +647,32 @@ public:
//----------------------------------------
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
template< class FunctorType >
inline static
int team_size_max( const FunctorType & ) {
int pool_size = traits::execution_space::thread_pool_size(1);
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
template< class FunctorType >
inline static
int team_size_recommended( const FunctorType & )
{
return traits::execution_space::thread_pool_size(2);
}
template< class FunctorType >
inline static
int team_size_recommended( const FunctorType &, const int& )
{
return traits::execution_space::thread_pool_size(2);
}
#endif
template<class FunctorType>
int team_size_max( const FunctorType&, const ParallelForTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
int pool_size = traits::execution_space::thread_pool_size(1);
#else
@ -637,11 +681,26 @@ public:
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
template<class FunctorType>
static int team_size_recommended( const FunctorType & )
{
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
int pool_size = traits::execution_space::thread_pool_size(1);
#else
int pool_size = traits::execution_space::impl_thread_pool_size(1);
#endif
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
return pool_size<max_host_team_size?pool_size:max_host_team_size;
}
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
return traits::execution_space::impl_thread_pool_size(2);
#endif
}
template<class FunctorType>
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
@ -650,15 +709,15 @@ public:
}
template< class FunctorType >
inline static
int team_size_recommended( const FunctorType &, const int& )
{
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
return traits::execution_space::thread_pool_size(2);
#else
return traits::execution_space::impl_thread_pool_size(2);
#endif
int vector_length_max()
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
inline static
int scratch_size_max(int level)
{ return (level==0?
1024*32: // Roughly L1 size
20*1024*1024); // Limit to keep compatibility with CUDA
}
//----------------------------------------

Some files were not shown because too many files have changed in this diff Show More