Update Kokkos library in LAMMPS to v2.7.24
This commit is contained in:
@ -1,5 +1,68 @@
|
||||
# Change Log
|
||||
|
||||
## [2.7.24](https://github.com/kokkos/kokkos/tree/2.7.24) (2018-11-04)
|
||||
[Full Changelog](https://github.com/kokkos/kokkos/compare/2.7.00...2.7.24)
|
||||
|
||||
**Implemented enhancements:**
|
||||
|
||||
- DualView: Add non-templated functions for sync, need\_sync, view, modify [\#1858](https://github.com/kokkos/kokkos/issues/1858)
|
||||
- DualView: Avoid needlessly allocates and initializes modify\_host and modify\_device flag views [\#1831](https://github.com/kokkos/kokkos/issues/1831)
|
||||
- DualView: Incorrect deduction of "not device type" [\#1659](https://github.com/kokkos/kokkos/issues/1659)
|
||||
- BuildSystem: Add KOKKOS\_ENABLE\_CXX14 and KOKKOS\_ENABLE\_CXX17 [\#1602](https://github.com/kokkos/kokkos/issues/1602)
|
||||
- BuildSystem: Installed kokkos\_generated\_settings.cmake contains build directories instead of install directories [\#1838](https://github.com/kokkos/kokkos/issues/1838)
|
||||
- BuildSystem: KOKKOS\_ARCH: add ticks to printout of improper arch setting [\#1649](https://github.com/kokkos/kokkos/issues/1649)
|
||||
- BuildSystem: Make core/src/Makefile for Cuda use needed nvcc\_wrapper [\#1296](https://github.com/kokkos/kokkos/issues/1296)
|
||||
- Build: Support PGI as host compiler for NVCC [\#1828](https://github.com/kokkos/kokkos/issues/1828)
|
||||
- Build: Many Warnings Fixed e.g.[\#1786](https://github.com/kokkos/kokkos/issues/1786)
|
||||
- Capability: OffsetView with non-zero begin index [\#567](https://github.com/kokkos/kokkos/issues/567)
|
||||
- Capability: Reductions into device side view [\#1788](https://github.com/kokkos/kokkos/issues/1788)
|
||||
- Capability: Add max\_size to Kokkos::Array [\#1760](https://github.com/kokkos/kokkos/issues/1760)
|
||||
- Capability: View Assignment: LayoutStride -\> LayoutLeft and LayoutStride -\> LayoutRight [\#1594](https://github.com/kokkos/kokkos/issues/1594)
|
||||
- Capability: Atomic function allow implicit conversion of update argument [\#1571](https://github.com/kokkos/kokkos/issues/1571)
|
||||
- Capability: Add team\_size\_max with tagged functors [\#663](https://github.com/kokkos/kokkos/issues/663)
|
||||
- Capability: Fix allignment of views from Kokkos\_ScratchSpace should use different alignment [\#1700](https://github.com/kokkos/kokkos/issues/1700)
|
||||
- Capabilitiy: create\_mirror\_view\_and\_copy for DynRankView [\#1651](https://github.com/kokkos/kokkos/issues/1651)
|
||||
- Capability: DeepCopy HBWSpace / HostSpace [\#548](https://github.com/kokkos/kokkos/issues/548)
|
||||
- ROCm: support team vector scan [\#1645](https://github.com/kokkos/kokkos/issues/1645)
|
||||
- ROCm: Merge from rocm-hackathon2 [\#1636](https://github.com/kokkos/kokkos/issues/1636)
|
||||
- ROCm: Add ParallelScanWithTotal [\#1611](https://github.com/kokkos/kokkos/issues/1611)
|
||||
- ROCm: Implement MDRange in ROCm [\#1314](https://github.com/kokkos/kokkos/issues/1314)
|
||||
- ROCm: Implement Reducers for Nested Parallelism Levels [\#963](https://github.com/kokkos/kokkos/issues/963)
|
||||
- ROCm: Add asynchronous deep copy [\#959](https://github.com/kokkos/kokkos/issues/959)
|
||||
- Tests: Memory pool test seems to allocate 8GB [\#1830](https://github.com/kokkos/kokkos/issues/1830)
|
||||
- Tests: Add unit\_test for team\_broadcast [\#734](https://github.com/kokkos/kokkos/issues/734)
|
||||
|
||||
**Fixed bugs:**
|
||||
|
||||
- BuildSystem: Makefile.kokkos gets gcc-toolchain wrong if gcc is cached [\#1841](https://github.com/kokkos/kokkos/issues/1841)
|
||||
- BuildSystem: kokkos\_generated\_settings.cmake placement is inconsistent [\#1771](https://github.com/kokkos/kokkos/issues/1771)
|
||||
- BuildSystem: Invalid escape sequence \. in kokkos\_functions.cmake [\#1661](https://github.com/kokkos/kokkos/issues/1661)
|
||||
- BuildSystem: Problem in Kokkos generated cmake file [\#1770](https://github.com/kokkos/kokkos/issues/1770)
|
||||
- BuildSystem: invalid file names on windows [\#1671](https://github.com/kokkos/kokkos/issues/1671)
|
||||
- Tests: reducers min/max\_loc test fails randomly due to multiple min values and thus multiple valid locations [\#1681](https://github.com/kokkos/kokkos/issues/1681)
|
||||
- Tests: cuda.scatterview unit test causes "Bus error" when force\_uvm and enable\_lambda are enabled [\#1852](https://github.com/kokkos/kokkos/issues/1852)
|
||||
- Tests: cuda.cxx11 unit test fails when force\_uvm and enable\_lambda are enabled [\#1850](https://github.com/kokkos/kokkos/issues/1850)
|
||||
- Tests: threads.reduce\_device\_view\_range\_policy failing with Cuda/8.0.44 and RDC [\#1836](https://github.com/kokkos/kokkos/issues/1836)
|
||||
- Build: compile error when compiling Kokkos with hwloc 2.0.1 \(on OSX 10.12.6, with g++ 7.2.0\) [\#1506](https://github.com/kokkos/kokkos/issues/1506)
|
||||
- Build: dual\_view.view broken with UVM [\#1834](https://github.com/kokkos/kokkos/issues/1834)
|
||||
- Build: White cuda/9.2 + gcc/7.2 warnings triggering errors [\#1833](https://github.com/kokkos/kokkos/issues/1833)
|
||||
- Build: warning: enum constant in boolean context [\#1813](https://github.com/kokkos/kokkos/issues/1813)
|
||||
- Capability: Fix overly conservative max\_team\_size thingy [\#1808](https://github.com/kokkos/kokkos/issues/1808)
|
||||
- DynRankView: Ctors taking ViewAllocateWithoutInitializing broken [\#1783](https://github.com/kokkos/kokkos/issues/1783)
|
||||
- Cuda: Apollo cuda.team\_broadcast test fail with clang-6.0 [\#1762](https://github.com/kokkos/kokkos/issues/1762)
|
||||
- Cuda: Clang spurious test failure in impl\_view\_accessible [\#1753](https://github.com/kokkos/kokkos/issues/1753)
|
||||
- Cuda: Kokkos::complex\<double\> atomic deadlocks with Clang 6 Cuda build with -O0 [\#1752](https://github.com/kokkos/kokkos/issues/1752)
|
||||
- Cuda: LayoutStride Test fails for UVM as default memory space [\#1688](https://github.com/kokkos/kokkos/issues/1688)
|
||||
- Cuda: Scan wrong values on Volta [\#1676](https://github.com/kokkos/kokkos/issues/1676)
|
||||
- Cuda: Kokkos::deep\_copy error with CudaUVM and Kokkos::Serial spaces [\#1652](https://github.com/kokkos/kokkos/issues/1652)
|
||||
- Cuda: cudaErrorInvalidConfiguration with debug build [\#1647](https://github.com/kokkos/kokkos/issues/1647)
|
||||
- Cuda: parallel\_for with TeamPolicy::team\_size\_recommended with launch bounds not working -- reported by Daniel Holladay [\#1283](https://github.com/kokkos/kokkos/issues/1283)
|
||||
- Cuda: Using KOKKOS\_CLASS\_LAMBDA in a class with Kokkos::Random\_XorShift64\_Pool member data [\#1696](https://github.com/kokkos/kokkos/issues/1696)
|
||||
- Long Build Times on Darwin [\#1721](https://github.com/kokkos/kokkos/issues/1721)
|
||||
- Capability: Typo in Kokkos\_Sort.hpp - BinOp3D - wrong comparison [\#1720](https://github.com/kokkos/kokkos/issues/1720)
|
||||
- Buffer overflow in SharedAllocationRecord in Kokkos\_HostSpace.cpp [\#1673](https://github.com/kokkos/kokkos/issues/1673)
|
||||
- Serial unit test failure [\#1632](https://github.com/kokkos/kokkos/issues/1632)
|
||||
|
||||
## [2.7.00](https://github.com/kokkos/kokkos/tree/2.7.00) (2018-05-24)
|
||||
[Full Changelog](https://github.com/kokkos/kokkos/compare/2.6.00...2.7.00)
|
||||
|
||||
|
||||
@ -11,7 +11,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
|
||||
|
||||
# Define Project Name if this is a standalone build
|
||||
IF(NOT DEFINED ${PROJECT_NAME})
|
||||
project(Kokkos CXX)
|
||||
project(Kokkos CXX)
|
||||
ENDIF()
|
||||
|
||||
# Basic initialization (Used in KOKKOS_SETTINGS)
|
||||
@ -22,7 +22,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
|
||||
include(${KOKKOS_SRC_PATH}/cmake/kokkos_functions.cmake)
|
||||
set_kokkos_cxx_compiler()
|
||||
set_kokkos_cxx_standard()
|
||||
|
||||
|
||||
#------------ GET OPTIONS AND KOKKOS_SETTINGS --------------------------------
|
||||
# Add Kokkos' modules to CMake's module path.
|
||||
set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Kokkos_SOURCE_DIR}/cmake/Modules/")
|
||||
@ -34,7 +34,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
|
||||
|
||||
#------------ GENERATE HEADER AND SOURCE FILES -------------------------------
|
||||
execute_process(
|
||||
COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} generate_build_settings
|
||||
COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} PREFIX=${CMAKE_INSTALL_PREFIX} generate_build_settings
|
||||
WORKING_DIRECTORY "${Kokkos_BINARY_DIR}"
|
||||
OUTPUT_FILE ${Kokkos_BINARY_DIR}/core_src_make.out
|
||||
RESULT_VARIABLE GEN_SETTINGS_RESULT
|
||||
@ -45,6 +45,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
|
||||
endif()
|
||||
include(${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake)
|
||||
install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION lib/cmake/Kokkos)
|
||||
install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION ${CMAKE_INSTALL_PREFIX})
|
||||
string(REPLACE " " ";" KOKKOS_TPL_INCLUDE_DIRS "${KOKKOS_GMAKE_TPL_INCLUDE_DIRS}")
|
||||
string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_DIRS "${KOKKOS_GMAKE_TPL_LIBRARY_DIRS}")
|
||||
string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_NAMES "${KOKKOS_GMAKE_TPL_LIBRARY_NAMES}")
|
||||
|
||||
@ -1,14 +1,8 @@
|
||||
# Default settings common options.
|
||||
|
||||
#LAMMPS specific settings:
|
||||
ifndef KOKKOS_PATH
|
||||
KOKKOS_PATH=../../lib/kokkos
|
||||
endif
|
||||
CXXFLAGS=$(CCFLAGS)
|
||||
|
||||
# Options: Cuda,ROCm,OpenMP,Pthreads,Qthreads,Serial
|
||||
KOKKOS_DEVICES ?= "OpenMP"
|
||||
#KOKKOS_DEVICES ?= "Pthreads"
|
||||
# Options: Cuda,ROCm,OpenMP,Pthread,Qthreads,Serial
|
||||
#KOKKOS_DEVICES ?= "OpenMP"
|
||||
KOKKOS_DEVICES ?= "Pthread"
|
||||
# Options:
|
||||
# Intel: KNC,KNL,SNB,HSW,BDW,SKX
|
||||
# NVIDIA: Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,Pascal60,Pascal61,Volta70,Volta72
|
||||
@ -21,16 +15,17 @@ KOKKOS_ARCH ?= ""
|
||||
KOKKOS_DEBUG ?= "no"
|
||||
# Options: hwloc,librt,experimental_memkind
|
||||
KOKKOS_USE_TPLS ?= ""
|
||||
# Options: c++11,c++1z
|
||||
# Options: c++11,c++14,c++1y,c++17,c++1z,c++2a
|
||||
KOKKOS_CXX_STANDARD ?= "c++11"
|
||||
# Options: aggressive_vectorization,disable_profiling,disable_deprecated_code,enable_large_mem_tests
|
||||
KOKKOS_OPTIONS ?= ""
|
||||
# Option for setting ETI path
|
||||
KOKKOS_ETI_PATH ?= ${KOKKOS_PATH}/core/src/eti
|
||||
KOKKOS_CMAKE ?= "no"
|
||||
|
||||
# Default settings specific options.
|
||||
# Options: force_uvm,use_ldg,rdc,enable_lambda
|
||||
KOKKOS_CUDA_OPTIONS ?= "enable_lambda"
|
||||
KOKKOS_CUDA_OPTIONS ?= ""
|
||||
|
||||
# Return a 1 if a string contains a substring and 0 if not
|
||||
# Note the search string should be without '"'
|
||||
@ -41,7 +36,11 @@ kokkos_has_string=$(if $(findstring $2,$1),1,0)
|
||||
# Check for general settings.
|
||||
KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX11 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++11)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX14 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++14)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX1Y := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1y)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX17 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++17)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX1Z := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1z)
|
||||
KOKKOS_INTERNAL_ENABLE_CXX2A := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++2a)
|
||||
|
||||
# Check for external libraries.
|
||||
KOKKOS_INTERNAL_USE_HWLOC := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),hwloc)
|
||||
@ -110,6 +109,18 @@ KOKKOS_INTERNAL_COMPILER_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VE
|
||||
KOKKOS_INTERNAL_COMPILER_APPLE_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),apple-darwin)
|
||||
KOKKOS_INTERNAL_COMPILER_HCC := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),HCC)
|
||||
|
||||
# Check Host Compiler if using NVCC through nvcc_wrapper
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC), 1)
|
||||
KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER := $(strip $(shell echo $(CXX) | grep nvcc_wrapper | wc -l))
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER), 1)
|
||||
|
||||
KOKKOS_CXX_HOST_VERSION := $(strip $(shell $(CXX) $(CXXFLAGS) --host-version 2>&1))
|
||||
KOKKOS_INTERNAL_COMPILER_PGI := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),PGI)
|
||||
KOKKOS_INTERNAL_COMPILER_INTEL := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),Intel Corporation)
|
||||
KOKKOS_INTERNAL_COMPILER_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),clang)
|
||||
endif
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 2)
|
||||
KOKKOS_INTERNAL_COMPILER_CLANG = 1
|
||||
endif
|
||||
@ -202,18 +213,34 @@ endif
|
||||
# Set C++11 flags.
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
|
||||
KOKKOS_INTERNAL_CXX11_FLAG := --c++11
|
||||
KOKKOS_INTERNAL_CXX14_FLAG := --c++14
|
||||
#KOKKOS_INTERNAL_CXX17_FLAG := --c++17
|
||||
else
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_XL), 1)
|
||||
KOKKOS_INTERNAL_CXX11_FLAG := -std=c++11
|
||||
#KOKKOS_INTERNAL_CXX14_FLAG := -std=c++14
|
||||
KOKKOS_INTERNAL_CXX1Y_FLAG := -std=c++1y
|
||||
#KOKKOS_INTERNAL_CXX17_FLAG := -std=c++17
|
||||
#KOKKOS_INTERNAL_CXX1Z_FLAG := -std=c++1Z
|
||||
#KOKKOS_INTERNAL_CXX2A_FLAG := -std=c++2a
|
||||
else
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
|
||||
KOKKOS_INTERNAL_CXX11_FLAG := -hstd=c++11
|
||||
KOKKOS_INTERNAL_CXX14_FLAG := -hstd=c++14
|
||||
#KOKKOS_INTERNAL_CXX1Y_FLAG := -hstd=c++1y
|
||||
#KOKKOS_INTERNAL_CXX17_FLAG := -hstd=c++17
|
||||
#KOKKOS_INTERNAL_CXX1Z_FLAG := -hstd=c++1z
|
||||
#KOKKOS_INTERNAL_CXX2A_FLAG := -hstd=c++2a
|
||||
else
|
||||
ifeq ($(KOKKOS_INTERNAL_COMPILER_HCC), 1)
|
||||
KOKKOS_INTERNAL_CXX11_FLAG :=
|
||||
else
|
||||
KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11
|
||||
KOKKOS_INTERNAL_CXX14_FLAG := --std=c++14
|
||||
KOKKOS_INTERNAL_CXX1Y_FLAG := --std=c++1y
|
||||
KOKKOS_INTERNAL_CXX17_FLAG := --std=c++17
|
||||
KOKKOS_INTERNAL_CXX1Z_FLAG := --std=c++1z
|
||||
KOKKOS_INTERNAL_CXX2A_FLAG := --std=c++2a
|
||||
endif
|
||||
endif
|
||||
endif
|
||||
@ -336,7 +363,9 @@ endif
|
||||
|
||||
#CPPFLAGS is now unused
|
||||
KOKKOS_CPPFLAGS =
|
||||
KOKKOS_CXXFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_ETI_PATH)
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_CXXFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_ETI_PATH)
|
||||
endif
|
||||
KOKKOS_TPL_INCLUDE_DIRS =
|
||||
KOKKOS_TPL_LIBRARY_DIRS =
|
||||
KOKKOS_TPL_LIBRARY_NAMES =
|
||||
@ -347,9 +376,11 @@ endif
|
||||
|
||||
KOKKOS_LIBS = -ldl
|
||||
KOKKOS_TPL_LIBRARY_NAMES += dl
|
||||
KOKKOS_LDFLAGS = -L$(shell pwd)
|
||||
# CXXLDFLAGS is used together with CXXFLAGS in a combined compile/link command
|
||||
KOKKOS_CXXLDFLAGS = -L$(shell pwd)
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_LDFLAGS = -L$(shell pwd)
|
||||
# CXXLDFLAGS is used together with CXXFLAGS in a combined compile/link command
|
||||
KOKKOS_CXXLDFLAGS = -L$(shell pwd)
|
||||
endif
|
||||
KOKKOS_LINK_FLAGS =
|
||||
KOKKOS_SRC =
|
||||
KOKKOS_HEADERS =
|
||||
@ -377,10 +408,12 @@ tmp := $(call kokkos_append_header,"/* Execution Spaces */")
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CUDA")
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_COMPILER_CUDA_VERSION $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION)")
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
|
||||
tmp := $(call kokkos_append_header,'\#define KOKKOS_ENABLE_ROCM')
|
||||
tmp := $(call kokkos_append_header,'\#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1')
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_USE_OPENMPTARGET), 1)
|
||||
@ -438,11 +471,25 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX14), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX14_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
|
||||
endif
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Y), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Y_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
|
||||
endif
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX17), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX17_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
|
||||
endif
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Z), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Z_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX1Z")
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
|
||||
endif
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX2A), 1)
|
||||
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX2A_FLAG)
|
||||
tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX20")
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)
|
||||
@ -465,7 +512,9 @@ endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1)
|
||||
ifneq ($(HWLOC_PATH),)
|
||||
KOKKOS_CXXFLAGS += -I$(HWLOC_PATH)/include
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_CXXFLAGS += -I$(HWLOC_PATH)/include
|
||||
endif
|
||||
KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib
|
||||
KOKKOS_CXXLDFLAGS += -L$(HWLOC_PATH)/lib
|
||||
KOKKOS_TPL_INCLUDE_DIRS += $(HWLOC_PATH)/include
|
||||
@ -484,7 +533,9 @@ endif
|
||||
|
||||
ifeq ($(KOKKOS_INTERNAL_USE_MEMKIND), 1)
|
||||
ifneq ($(MEMKIND_PATH),)
|
||||
KOKKOS_CXXFLAGS += -I$(MEMKIND_PATH)/include
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_CXXFLAGS += -I$(MEMKIND_PATH)/include
|
||||
endif
|
||||
KOKKOS_LDFLAGS += -L$(MEMKIND_PATH)/lib
|
||||
KOKKOS_CXXLDFLAGS += -L$(MEMKIND_PATH)/lib
|
||||
KOKKOS_TPL_INCLUDE_DIRS += $(MEMKIND_PATH)/include
|
||||
@ -977,7 +1028,9 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_ETI), 1)
|
||||
endif
|
||||
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
|
||||
ifneq ($(CUDA_PATH),)
|
||||
KOKKOS_CXXFLAGS += -I$(CUDA_PATH)/include
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_CXXFLAGS += -I$(CUDA_PATH)/include
|
||||
endif
|
||||
KOKKOS_LDFLAGS += -L$(CUDA_PATH)/lib64
|
||||
KOKKOS_CXXLDFLAGS += -L$(CUDA_PATH)/lib64
|
||||
KOKKOS_TPL_INCLUDE_DIRS += $(CUDA_PATH)/include
|
||||
@ -1032,7 +1085,9 @@ ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
|
||||
KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.cpp)
|
||||
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.hpp)
|
||||
ifneq ($(QTHREADS_PATH),)
|
||||
KOKKOS_CXXFLAGS += -I$(QTHREADS_PATH)/include
|
||||
ifneq ($(KOKKOS_CMAKE), yes)
|
||||
KOKKOS_CXXFLAGS += -I$(QTHREADS_PATH)/include
|
||||
endif
|
||||
KOKKOS_LDFLAGS += -L$(QTHREADS_PATH)/lib
|
||||
KOKKOS_CXXLDFLAGS += -L$(QTHREADS_PATH)/lib
|
||||
KOKKOS_TPL_INCLUDE_DIRS += $(QTHREADS_PATH)/include
|
||||
|
||||
@ -52,44 +52,47 @@ For specifics see the LICENSE file contained in the repository or distribution.
|
||||
* GCC 4.8.4
|
||||
* GCC 4.9.3
|
||||
* GCC 5.1.0
|
||||
* GCC 5.3.0
|
||||
* GCC 5.5.0
|
||||
* GCC 6.1.0
|
||||
* GCC 7.2.0
|
||||
* GCC 7.3.0
|
||||
* GCC 8.1.0
|
||||
* Intel 15.0.2
|
||||
* Intel 16.0.1
|
||||
* Intel 17.1.043
|
||||
* Intel 17.0.1
|
||||
* Intel 17.4.196
|
||||
* Intel 18.0.128
|
||||
* Intel 18.2.128
|
||||
* Clang 3.6.1
|
||||
* Clang 3.7.1
|
||||
* Clang 3.8.1
|
||||
* Clang 3.9.0
|
||||
* Clang 4.0.0
|
||||
* Clang 4.0.0 for CUDA (CUDA Toolkit 8.0.44)
|
||||
* Clang 6.0.0 for CUDA (CUDA Toolkit 9.1)
|
||||
* PGI 17.10
|
||||
* NVCC 7.0 for CUDA (with gcc 4.8.4)
|
||||
* Clang 6.0.0 for CUDA (CUDA Toolkit 9.0)
|
||||
* Clang 7.0.0 for CUDA (CUDA Toolkit 9.1)
|
||||
* PGI 18.7
|
||||
* NVCC 7.5 for CUDA (with gcc 4.8.4)
|
||||
* NVCC 8.0.44 for CUDA (with gcc 5.3.0)
|
||||
* NVCC 9.1 for CUDA (with gcc 6.1.0)
|
||||
|
||||
### Primary tested compilers on Power 8 are:
|
||||
* GCC 5.4.0 (OpenMP,Serial)
|
||||
* IBM XL 13.1.6 (OpenMP, Serial)
|
||||
* NVCC 8.0.44 for CUDA (with gcc 5.4.0)
|
||||
* NVCC 9.0.103 for CUDA (with gcc 6.3.0 and XL 13.1.6)
|
||||
* GCC 6.4.0 (OpenMP,Serial)
|
||||
* GCC 7.2.0 (OpenMP,Serial)
|
||||
* IBM XL 16.1.0 (OpenMP, Serial)
|
||||
* NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0)
|
||||
|
||||
### Primary tested compilers on Intel KNL are:
|
||||
* GCC 6.2.0
|
||||
* Intel 16.4.258 (with gcc 4.7.2)
|
||||
* Intel 17.2.174 (with gcc 4.9.3)
|
||||
* Intel 18.0.128 (with gcc 4.9.3)
|
||||
* Intel 18.2.199 (with gcc 4.9.3)
|
||||
|
||||
### Primary tested compilers on ARM
|
||||
* GCC 6.1.0
|
||||
### Primary tested compilers on ARM (Cavium ThunderX2)
|
||||
* GCC 7.2.0
|
||||
* ARM/Clang 18.4.0
|
||||
|
||||
### Other compilers working:
|
||||
* X86:
|
||||
- Cygwin 2.1.0 64bit with gcc 4.9.3
|
||||
- GCC 8.1.0 (not warning free)
|
||||
|
||||
### Known non-working combinations:
|
||||
* Power8:
|
||||
|
||||
@ -697,6 +697,7 @@ namespace Kokkos {
|
||||
typedef Random_XorShift64<DeviceType> generator_type;
|
||||
typedef DeviceType device_type;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift64_Pool() {
|
||||
num_states_ = 0;
|
||||
}
|
||||
@ -709,12 +710,14 @@ namespace Kokkos {
|
||||
#endif
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift64_Pool(const Random_XorShift64_Pool& src):
|
||||
locks_(src.locks_),
|
||||
state_(src.state_),
|
||||
num_states_(src.num_states_)
|
||||
{}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift64_Pool operator = (const Random_XorShift64_Pool& src) {
|
||||
locks_ = src.locks_;
|
||||
state_ = src.state_;
|
||||
@ -958,6 +961,7 @@ namespace Kokkos {
|
||||
|
||||
typedef DeviceType device_type;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift1024_Pool() {
|
||||
num_states_ = 0;
|
||||
}
|
||||
@ -972,6 +976,7 @@ namespace Kokkos {
|
||||
#endif
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift1024_Pool(const Random_XorShift1024_Pool& src):
|
||||
locks_(src.locks_),
|
||||
state_(src.state_),
|
||||
@ -979,6 +984,7 @@ namespace Kokkos {
|
||||
num_states_(src.num_states_)
|
||||
{}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
Random_XorShift1024_Pool operator = (const Random_XorShift1024_Pool& src) {
|
||||
locks_ = src.locks_;
|
||||
state_ = src.state_;
|
||||
|
||||
@ -246,8 +246,8 @@ public:
|
||||
{
|
||||
bin_count_atomic = Kokkos::View<int*, Space >("Kokkos::SortImpl::BinSortFunctor::bin_count",bin_op.max_bins());
|
||||
bin_count_const = bin_count_atomic;
|
||||
bin_offsets = offset_type("Kokkos::SortImpl::BinSortFunctor::bin_offsets",bin_op.max_bins());
|
||||
sort_order = offset_type("PermutationVector",range_end-range_begin);
|
||||
bin_offsets = offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::bin_offsets"),bin_op.max_bins());
|
||||
sort_order = offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sort_order"),range_end-range_begin);
|
||||
}
|
||||
|
||||
BinSort( const_key_view_type keys_
|
||||
@ -290,7 +290,7 @@ public:
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
scratch_view_type
|
||||
sorted_values("Scratch",
|
||||
sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
|
||||
len,
|
||||
values.extent(1),
|
||||
values.extent(2),
|
||||
@ -301,7 +301,7 @@ public:
|
||||
values.extent(7));
|
||||
#else
|
||||
scratch_view_type
|
||||
sorted_values("Scratch",
|
||||
sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
|
||||
values.rank_dynamic > 0 ? len : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
|
||||
values.rank_dynamic > 1 ? values.extent(1) : KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
|
||||
values.rank_dynamic > 2 ? values.extent(2) : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
|
||||
@ -483,7 +483,7 @@ struct BinOp3D {
|
||||
if (keys(i1,0)>keys(i2,0)) return true;
|
||||
else if (keys(i1,0)==keys(i2,0)) {
|
||||
if (keys(i1,1)>keys(i2,1)) return true;
|
||||
else if (keys(i1,1)==keys(i2,2)) {
|
||||
else if (keys(i1,1)==keys(i2,1)) {
|
||||
if (keys(i1,2)>keys(i2,2)) return true;
|
||||
}
|
||||
}
|
||||
|
||||
41
lib/kokkos/benchmarks/gups/Makefile
Normal file
41
lib/kokkos/benchmarks/gups/Makefile
Normal file
@ -0,0 +1,41 @@
|
||||
#Set your Kokkos path to something appropriate
|
||||
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
|
||||
KOKKOS_DEVICES = "Cuda"
|
||||
KOKKOS_ARCH = "Pascal60"
|
||||
KOKKOS_CUDA_OPTIONS = enable_lambda
|
||||
#KOKKOS_DEVICES = "OpenMP"
|
||||
#KOKKOS_ARCH = "Power8"
|
||||
|
||||
SRC = gups-kokkos.cc
|
||||
|
||||
default: build
|
||||
echo "Start Build"
|
||||
|
||||
CXXFLAGS = -O3
|
||||
CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
|
||||
#CXX = g++
|
||||
|
||||
LINK = ${CXX}
|
||||
|
||||
LINKFLAGS =
|
||||
EXE = gups-kokkos
|
||||
|
||||
DEPFLAGS = -M
|
||||
|
||||
OBJ = $(SRC:.cc=.o)
|
||||
LIB =
|
||||
|
||||
include $(KOKKOS_PATH)/Makefile.kokkos
|
||||
|
||||
build: $(EXE)
|
||||
|
||||
$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
|
||||
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
|
||||
|
||||
clean: kokkos-clean
|
||||
rm -f *.o $(EXE)
|
||||
|
||||
# Compilation rules
|
||||
|
||||
%.o:%.cc $(KOKKOS_CPP_DEPENDS)
|
||||
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
|
||||
199
lib/kokkos/benchmarks/gups/gups-kokkos.cc
Normal file
199
lib/kokkos/benchmarks/gups/gups-kokkos.cc
Normal file
@ -0,0 +1,199 @@
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include "Kokkos_Core.hpp"
|
||||
#include <cstdio>
|
||||
#include <cstdlib>
|
||||
#include <cmath>
|
||||
|
||||
#include <sys/time.h>
|
||||
|
||||
#define HLINE "-------------------------------------------------------------\n"
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA)
|
||||
typedef Kokkos::View<int64_t*, Kokkos::CudaSpace>::HostMirror GUPSHostArray;
|
||||
typedef Kokkos::View<int64_t*, Kokkos::CudaSpace> GUPSDeviceArray;
|
||||
#else
|
||||
typedef Kokkos::View<int64_t*, Kokkos::HostSpace>::HostMirror GUPSHostArray;
|
||||
typedef Kokkos::View<int64_t*, Kokkos::HostSpace> GUPSDeviceArray;
|
||||
#endif
|
||||
|
||||
typedef int GUPSIndex;
|
||||
|
||||
double now() {
|
||||
struct timeval now;
|
||||
gettimeofday(&now, NULL);
|
||||
|
||||
return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
|
||||
}
|
||||
|
||||
void randomize_indices(GUPSHostArray& indices, GUPSDeviceArray& dev_indices, const int64_t dataCount) {
|
||||
for( GUPSIndex i = 0; i < indices.extent(0); ++i ) {
|
||||
indices[i] = lrand48() % dataCount;
|
||||
}
|
||||
|
||||
Kokkos::deep_copy(dev_indices, indices);
|
||||
}
|
||||
|
||||
void run_gups(GUPSDeviceArray& indices, GUPSDeviceArray& data, const int64_t datum,
|
||||
const bool performAtomics) {
|
||||
|
||||
if( performAtomics ) {
|
||||
Kokkos::parallel_for("bench-gups-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
|
||||
Kokkos::atomic_fetch_xor( &data[indices[i]], datum );
|
||||
});
|
||||
} else {
|
||||
Kokkos::parallel_for("bench-gups-non-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
|
||||
data[indices[i]] ^= datum;
|
||||
});
|
||||
}
|
||||
|
||||
Kokkos::fence();
|
||||
}
|
||||
|
||||
int run_benchmark(const GUPSIndex indicesCount, const GUPSIndex dataCount, const int repeats,
|
||||
const bool useAtomics) {
|
||||
|
||||
printf("Reports fastest timing per kernel\n");
|
||||
printf("Creating Views...\n");
|
||||
|
||||
printf("Memory Sizes:\n");
|
||||
printf("- Elements: %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(dataCount),
|
||||
1.0e-6 * ((double) dataCount * (double) sizeof(int64_t)));
|
||||
printf("- Indices: %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(indicesCount),
|
||||
1.0e-6 * ((double) indicesCount * (double) sizeof(int64_t)));
|
||||
printf(" - Atomics: %15s\n", (useAtomics ? "Yes" : "No") );
|
||||
printf("Benchmark kernels will be performed for %d iterations.\n", repeats);
|
||||
|
||||
printf(HLINE);
|
||||
|
||||
GUPSDeviceArray dev_indices("indices", indicesCount);
|
||||
GUPSDeviceArray dev_data("data", dataCount);
|
||||
int64_t datum = -1;
|
||||
|
||||
GUPSHostArray indices = Kokkos::create_mirror_view(dev_indices);
|
||||
GUPSHostArray data = Kokkos::create_mirror_view(dev_data);
|
||||
|
||||
double gupsTime = 0.0;
|
||||
|
||||
printf("Initializing Views...\n");
|
||||
|
||||
#if defined(KOKKOS_HAVE_OPENMP)
|
||||
Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::OpenMP>(0, dataCount),
|
||||
#else
|
||||
Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::Serial>(0, dataCount),
|
||||
#endif
|
||||
KOKKOS_LAMBDA(const int i) {
|
||||
|
||||
data[i] = 10101010101;
|
||||
});
|
||||
|
||||
#if defined(KOKKOS_HAVE_OPENMP)
|
||||
Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::OpenMP>(0, indicesCount),
|
||||
#else
|
||||
Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::Serial>(0, indicesCount),
|
||||
#endif
|
||||
KOKKOS_LAMBDA(const int i) {
|
||||
|
||||
indices[i] = 0;
|
||||
});
|
||||
|
||||
Kokkos::deep_copy(dev_data, data);
|
||||
Kokkos::deep_copy(dev_indices, indices);
|
||||
double start;
|
||||
|
||||
printf("Starting benchmarking...\n");
|
||||
|
||||
for( GUPSIndex k = 0; k < repeats; ++k ) {
|
||||
randomize_indices(indices, dev_indices, data.extent(0));
|
||||
|
||||
start = now();
|
||||
run_gups(dev_indices, dev_data, datum, useAtomics);
|
||||
gupsTime += now() - start;
|
||||
}
|
||||
|
||||
Kokkos::deep_copy(indices, dev_indices);
|
||||
Kokkos::deep_copy(data, dev_data);
|
||||
|
||||
printf(HLINE);
|
||||
printf("GUP/s Random: %18.6f\n",
|
||||
(1.0e-9 * ((double) repeats) * (double) dev_indices.extent(0)) / gupsTime);
|
||||
printf(HLINE);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
int main(int argc, char* argv[]) {
|
||||
|
||||
printf(HLINE);
|
||||
printf("Kokkos GUPS Benchmark\n");
|
||||
printf(HLINE);
|
||||
|
||||
srand48(1010101);
|
||||
|
||||
Kokkos::initialize(argc, argv);
|
||||
|
||||
int64_t indices = 8192;
|
||||
int64_t data = 33554432;
|
||||
int64_t repeats = 10;
|
||||
bool useAtomics = false;
|
||||
|
||||
for( int i = 1; i < argc; ++i ) {
|
||||
if( strcmp( argv[i], "--indices" ) == 0 ) {
|
||||
indices = std::atoll(argv[i+1]);
|
||||
++i;
|
||||
} else if( strcmp( argv[i], "--data" ) == 0 ) {
|
||||
data = std::atoll(argv[i+1]);
|
||||
++i;
|
||||
} else if( strcmp( argv[i], "--repeats" ) == 0 ) {
|
||||
repeats = std::atoll(argv[i+1]);
|
||||
++i;
|
||||
} else if( strcmp( argv[i], "--atomics" ) == 0 ) {
|
||||
useAtomics = true;
|
||||
}
|
||||
}
|
||||
|
||||
const int rc = run_benchmark(indices, data, repeats, useAtomics);
|
||||
|
||||
Kokkos::finalize();
|
||||
|
||||
return rc;
|
||||
}
|
||||
41
lib/kokkos/benchmarks/stream/Makefile
Normal file
41
lib/kokkos/benchmarks/stream/Makefile
Normal file
@ -0,0 +1,41 @@
|
||||
#Set your Kokkos path to something appropriate
|
||||
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
|
||||
#KOKKOS_DEVICES = "Cuda"
|
||||
#KOKKOS_ARCH = "Pascal60"
|
||||
#KOKKOS_CUDA_OPTIONS = enable_lambda
|
||||
KOKKOS_DEVICES = "OpenMP"
|
||||
KOKKOS_ARCH = "Power8"
|
||||
|
||||
SRC = stream-kokkos.cc
|
||||
|
||||
default: build
|
||||
echo "Start Build"
|
||||
|
||||
CXXFLAGS = -O3
|
||||
#CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
|
||||
CXX = g++
|
||||
|
||||
LINK = ${CXX}
|
||||
|
||||
LINKFLAGS =
|
||||
EXE = stream-kokkos
|
||||
|
||||
DEPFLAGS = -M
|
||||
|
||||
OBJ = $(SRC:.cc=.o)
|
||||
LIB =
|
||||
|
||||
include $(KOKKOS_PATH)/Makefile.kokkos
|
||||
|
||||
build: $(EXE)
|
||||
|
||||
$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
|
||||
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
|
||||
|
||||
clean: kokkos-clean
|
||||
rm -f *.o $(EXE)
|
||||
|
||||
# Compilation rules
|
||||
|
||||
%.o:%.cc $(KOKKOS_CPP_DEPENDS)
|
||||
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
|
||||
265
lib/kokkos/benchmarks/stream/stream-kokkos.cc
Normal file
265
lib/kokkos/benchmarks/stream/stream-kokkos.cc
Normal file
@ -0,0 +1,265 @@
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include "Kokkos_Core.hpp"
|
||||
#include <cstdio>
|
||||
#include <cstdlib>
|
||||
#include <cmath>
|
||||
|
||||
#include <sys/time.h>
|
||||
|
||||
#define STREAM_ARRAY_SIZE 100000000
|
||||
#define STREAM_NTIMES 20
|
||||
|
||||
#define HLINE "-------------------------------------------------------------\n"
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA)
|
||||
typedef Kokkos::View<double*, Kokkos::CudaSpace>::HostMirror StreamHostArray;
|
||||
typedef Kokkos::View<double*, Kokkos::CudaSpace> StreamDeviceArray;
|
||||
#else
|
||||
typedef Kokkos::View<double*, Kokkos::HostSpace>::HostMirror StreamHostArray;
|
||||
typedef Kokkos::View<double*, Kokkos::HostSpace> StreamDeviceArray;
|
||||
#endif
|
||||
|
||||
typedef int StreamIndex;
|
||||
|
||||
double now() {
|
||||
struct timeval now;
|
||||
gettimeofday(&now, NULL);
|
||||
|
||||
return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
|
||||
}
|
||||
|
||||
void perform_copy(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
|
||||
|
||||
Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
|
||||
c[i] = a[i];
|
||||
});
|
||||
|
||||
Kokkos::fence();
|
||||
}
|
||||
|
||||
void perform_scale(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
|
||||
const double scalar) {
|
||||
|
||||
Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
|
||||
b[i] = scalar * c[i];
|
||||
});
|
||||
|
||||
Kokkos::fence();
|
||||
}
|
||||
|
||||
void perform_add(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
|
||||
Kokkos::parallel_for("add", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
|
||||
c[i] = a[i] + b[i];
|
||||
});
|
||||
|
||||
Kokkos::fence();
|
||||
}
|
||||
|
||||
void perform_triad(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
|
||||
const double scalar) {
|
||||
|
||||
Kokkos::parallel_for("triad", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
|
||||
a[i] = b[i] + scalar * c[i];
|
||||
});
|
||||
|
||||
Kokkos::fence();
|
||||
}
|
||||
|
||||
int perform_validation(StreamHostArray& a, StreamHostArray& b, StreamHostArray& c,
|
||||
const StreamIndex arraySize, const double scalar) {
|
||||
|
||||
double ai = 1.0;
|
||||
double bi = 2.0;
|
||||
double ci = 0.0;
|
||||
|
||||
for( StreamIndex i = 0; i < arraySize; ++i ) {
|
||||
ci = ai;
|
||||
bi = scalar * ci;
|
||||
ci = ai + bi;
|
||||
ai = bi + scalar * ci;
|
||||
};
|
||||
|
||||
double aError = 0.0;
|
||||
double bError = 0.0;
|
||||
double cError = 0.0;
|
||||
|
||||
for( StreamIndex i = 0; i < arraySize; ++i ) {
|
||||
aError = std::abs( a[i] - ai );
|
||||
bError = std::abs( b[i] - bi );
|
||||
cError = std::abs( c[i] - ci );
|
||||
}
|
||||
|
||||
double aAvgError = aError / (double) arraySize;
|
||||
double bAvgError = bError / (double) arraySize;
|
||||
double cAvgError = cError / (double) arraySize;
|
||||
|
||||
const double epsilon = 1.0e-13;
|
||||
int errorCount = 0;
|
||||
|
||||
if( std::abs( aAvgError / ai ) > epsilon ) {
|
||||
fprintf(stderr, "Error: validation check on View a failed.\n");
|
||||
errorCount++;
|
||||
}
|
||||
|
||||
if( std::abs( bAvgError / bi ) > epsilon ) {
|
||||
fprintf(stderr, "Error: validation check on View b failed.\n");
|
||||
errorCount++;
|
||||
}
|
||||
|
||||
if( std::abs( cAvgError / ci ) > epsilon ) {
|
||||
fprintf(stderr, "Error: validation check on View c failed.\n");
|
||||
errorCount++;
|
||||
}
|
||||
|
||||
if( errorCount == 0 ) {
|
||||
printf("All solutions checked and verified.\n");
|
||||
}
|
||||
|
||||
return errorCount;
|
||||
}
|
||||
|
||||
int run_benchmark() {
|
||||
|
||||
printf("Reports fastest timing per kernel\n");
|
||||
printf("Creating Views...\n");
|
||||
|
||||
printf("Memory Sizes:\n");
|
||||
printf("- Array Size: %" PRIu64 "\n", static_cast<uint64_t>(STREAM_ARRAY_SIZE));
|
||||
printf("- Per Array: %12.2f MB\n", 1.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
|
||||
printf("- Total: %12.2f MB\n", 3.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
|
||||
|
||||
printf("Benchmark kernels will be performed for %d iterations.\n", STREAM_NTIMES);
|
||||
|
||||
printf(HLINE);
|
||||
|
||||
StreamDeviceArray dev_a("a", STREAM_ARRAY_SIZE);
|
||||
StreamDeviceArray dev_b("b", STREAM_ARRAY_SIZE);
|
||||
StreamDeviceArray dev_c("c", STREAM_ARRAY_SIZE);
|
||||
|
||||
StreamHostArray a = Kokkos::create_mirror_view(dev_a);
|
||||
StreamHostArray b = Kokkos::create_mirror_view(dev_b);
|
||||
StreamHostArray c = Kokkos::create_mirror_view(dev_c);
|
||||
|
||||
const double scalar = 3.0;
|
||||
|
||||
double copyTime = std::numeric_limits<double>::max();
|
||||
double scaleTime = std::numeric_limits<double>::max();
|
||||
double addTime = std::numeric_limits<double>::max();
|
||||
double triadTime = std::numeric_limits<double>::max();
|
||||
|
||||
printf("Initializing Views...\n");
|
||||
|
||||
#if defined(KOKKOS_HAVE_OPENMP)
|
||||
Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::OpenMP>(0, STREAM_ARRAY_SIZE),
|
||||
#else
|
||||
Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::Serial>(0, STREAM_ARRAY_SIZE),
|
||||
#endif
|
||||
KOKKOS_LAMBDA(const int i) {
|
||||
|
||||
a[i] = 1.0;
|
||||
b[i] = 2.0;
|
||||
c[i] = 0.0;
|
||||
});
|
||||
|
||||
// Copy contents of a (from the host) to the dev_a (device)
|
||||
Kokkos::deep_copy(dev_a, a);
|
||||
Kokkos::deep_copy(dev_b, b);
|
||||
Kokkos::deep_copy(dev_c, c);
|
||||
|
||||
double start;
|
||||
|
||||
printf("Starting benchmarking...\n");
|
||||
|
||||
for( StreamIndex k = 0; k < STREAM_NTIMES; ++k ) {
|
||||
start = now();
|
||||
perform_copy(dev_a, dev_b, dev_c);
|
||||
copyTime = std::min( copyTime, (now() - start) );
|
||||
|
||||
start = now();
|
||||
perform_scale(dev_a, dev_b, dev_c, scalar);
|
||||
scaleTime = std::min( scaleTime, (now() - start) );
|
||||
|
||||
start = now();
|
||||
perform_add(dev_a, dev_b, dev_c);
|
||||
addTime = std::min( addTime, (now() - start) );
|
||||
|
||||
start = now();
|
||||
perform_triad(dev_a, dev_b, dev_c, scalar);
|
||||
triadTime = std::min( triadTime, (now() - start) );
|
||||
}
|
||||
|
||||
Kokkos::deep_copy(a, dev_a);
|
||||
Kokkos::deep_copy(b, dev_b);
|
||||
Kokkos::deep_copy(c, dev_c);
|
||||
|
||||
printf("Performing validation...\n");
|
||||
int rc = perform_validation(a, b, c, STREAM_ARRAY_SIZE, scalar);
|
||||
|
||||
printf(HLINE);
|
||||
|
||||
printf("Copy %11.2f MB/s\n",
|
||||
( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / copyTime );
|
||||
printf("Scale %11.2f MB/s\n",
|
||||
( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / scaleTime );
|
||||
printf("Add %11.2f MB/s\n",
|
||||
( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / addTime );
|
||||
printf("Triad %11.2f MB/s\n",
|
||||
( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / triadTime );
|
||||
|
||||
printf(HLINE);
|
||||
|
||||
return rc;
|
||||
}
|
||||
|
||||
int main(int argc, char* argv[]) {
|
||||
|
||||
printf(HLINE);
|
||||
printf("Kokkos STREAM Benchmark\n");
|
||||
printf(HLINE);
|
||||
|
||||
Kokkos::initialize(argc, argv);
|
||||
const int rc = run_benchmark();
|
||||
Kokkos::finalize();
|
||||
|
||||
return rc;
|
||||
}
|
||||
@ -125,18 +125,20 @@ function show_help {
|
||||
echo " --openmp-ratio=N/D Ratio of the cpuset to use for OpenMP"
|
||||
echo " Default: 1"
|
||||
echo " --openmp-places=<Op> Op=threads|cores|sockets. Default: threads"
|
||||
echo " --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
|
||||
echo " --force-openmp-num-threads=N"
|
||||
echo " --openmp-num-threads=N"
|
||||
echo " Override logic for selecting OMP_NUM_THREADS"
|
||||
echo " --force-openmp-proc-bind=<OP>"
|
||||
echo " --openmp-proc-bind=<OP>"
|
||||
echo " Override logic for selecting OMP_PROC_BIND"
|
||||
echo " --no-openmp-nested Set OMP_NESTED to false"
|
||||
echo " --openmp-nested Set OMP_NESTED to true"
|
||||
echo " --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
|
||||
echo " --output-prefix=<P> Save the output to files of the form"
|
||||
echo " P.hpcbind.N, P.stdout.N and P.stderr.N where P is "
|
||||
echo " the prefix and N is the rank (no spaces)"
|
||||
echo " --output-mode=<Op> How console output should be handled."
|
||||
echo " Options are all, rank0, and none. Default: rank0"
|
||||
echo " --lstopo Show bindings in lstopo"
|
||||
echo " --save-topology=<Xml> Save the topology to the given xml file"
|
||||
echo " --load-topology=<Xml> Load a previously saved topology from an xml file"
|
||||
echo " -v|--verbose Print bindings and relevant environment variables"
|
||||
echo " -h|--help Show this message"
|
||||
echo ""
|
||||
@ -189,7 +191,7 @@ HPCBIND_OPENMP_PLACES=${OMP_PLACES:-threads}
|
||||
declare -i HPCBIND_OPENMP_PROC_BIND=1
|
||||
HPCBIND_OPENMP_FORCE_NUM_THREADS=""
|
||||
HPCBIND_OPENMP_FORCE_PROC_BIND=""
|
||||
declare -i HPCBIND_OPENMP_NESTED=1
|
||||
declare -i HPCBIND_OPENMP_NESTED=0
|
||||
declare -i HPCBIND_VERBOSE=0
|
||||
|
||||
declare -i HPCBIND_LSTOPO=0
|
||||
@ -197,6 +199,9 @@ declare -i HPCBIND_LSTOPO=0
|
||||
HPCBIND_OUTPUT_PREFIX=""
|
||||
HPCBIND_OUTPUT_MODE="rank0"
|
||||
|
||||
HPCBIND_OUTPUT_TOPOLOGY=""
|
||||
HPCBIND_INPUT_TOPOLOGY=""
|
||||
|
||||
declare -i HPCBIND_HAS_COMMAND=0
|
||||
|
||||
for i in "$@"; do
|
||||
@ -276,10 +281,22 @@ for i in "$@"; do
|
||||
HPCBIND_OPENMP_NESTED=0
|
||||
shift
|
||||
;;
|
||||
--openmp-nested)
|
||||
HPCBIND_OPENMP_NESTED=1
|
||||
shift
|
||||
;;
|
||||
--output-prefix=*)
|
||||
HPCBIND_OUTPUT_PREFIX="${i#*=}"
|
||||
shift
|
||||
;;
|
||||
--save-topology=*)
|
||||
HPCBIND_OUTPUT_TOPOLOGY="${i#*=}"
|
||||
shift
|
||||
;;
|
||||
--load-topology=*)
|
||||
HPCBIND_INPUT_TOPOLOGY="${i#*=}"
|
||||
shift
|
||||
;;
|
||||
--output-mode=*)
|
||||
HPCBIND_OUTPUT_MODE="${i#*=}"
|
||||
#convert to lower case
|
||||
@ -327,24 +344,37 @@ elif [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
|
||||
HPCBIND_TEE=1
|
||||
fi
|
||||
|
||||
# Save the topology to the given xml file
|
||||
if [[ "${HPCBIND_OUTPUT_TOPOLOGY}" != "" ]]; then
|
||||
if [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
|
||||
lstopo-no-graphics "${HPCBIND_OUTPUT_TOPOLOGY}"
|
||||
else
|
||||
lstopo-no-graphics >/dev/null 2>&1
|
||||
fi
|
||||
fi
|
||||
|
||||
# Load the topology to the given xml file
|
||||
if [[ "${HPCBIND_INPUT_TOPOLOGY}" != "" ]]; then
|
||||
if [ -f ${HPCBIND_INPUT_TOPOLOGY} ]; then
|
||||
export HWLOC_XMLFILE="${HPCBIND_INPUT_TOPOLOGY}"
|
||||
export HWLOC_THISSYSTEM=1
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "${HPCBIND_OUTPUT_PREFIX}" == "" ]]; then
|
||||
HPCBIND_LOG=/dev/null
|
||||
HPCBIND_ERR=/dev/null
|
||||
HPCBIND_OUT=/dev/null
|
||||
else
|
||||
if [[ ${HPCBIND_QUEUE_SIZE} -gt 0 ]]; then
|
||||
HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
|
||||
HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
|
||||
|
||||
HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
|
||||
HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
|
||||
HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
|
||||
else
|
||||
HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_QUEUE_RANK}"
|
||||
HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_QUEUE_RANK}"
|
||||
HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_QUEUE_RANK}"
|
||||
if [[ ${HPCBIND_QUEUE_SIZE} -le 0 ]]; then
|
||||
HPCBIND_QUEUE_SIZE=1
|
||||
fi
|
||||
HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
|
||||
HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
|
||||
|
||||
HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
|
||||
HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
|
||||
HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
|
||||
> ${HPCBIND_LOG}
|
||||
fi
|
||||
|
||||
@ -546,6 +576,8 @@ if [[ ${HPCBIND_TEE} -eq 0 || ${HPCBIND_VERBOSE} -eq 0 ]]; then
|
||||
hostname -s >> ${HPCBIND_LOG}
|
||||
echo "[HPCBIND]" >> ${HPCBIND_LOG}
|
||||
echo "${TMP_ENV}" | grep -E "^HPCBIND_" >> ${HPCBIND_LOG}
|
||||
echo "[HWLOC]" >> ${HPCBIND_LOG}
|
||||
echo "${TMP_ENV}" | grep -E "^HWLOC_" >> ${HPCBIND_LOG}
|
||||
echo "[CUDA]" >> ${HPCBIND_LOG}
|
||||
echo "${TMP_ENV}" | grep -E "^CUDA_" >> ${HPCBIND_LOG}
|
||||
echo "[OPENMP]" >> ${HPCBIND_LOG}
|
||||
@ -568,6 +600,8 @@ else
|
||||
hostname -s > >(tee -a ${HPCBIND_LOG})
|
||||
echo "[HPCBIND]" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "${TMP_ENV}" | grep -E "^HPCBIND_" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "[HWLOC]" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "${TMP_ENV}" | grep -E "^HWLOC_" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "[CUDA]" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "${TMP_ENV}" | grep -E "^CUDA_" > >(tee -a ${HPCBIND_LOG})
|
||||
echo "[OPENMP]" > >(tee -a ${HPCBIND_LOG})
|
||||
|
||||
@ -74,6 +74,9 @@ dry_run=0
|
||||
host_only=0
|
||||
host_only_args=""
|
||||
|
||||
# Just run version on host compiler
|
||||
get_host_version=0
|
||||
|
||||
# Enable workaround for CUDA 6.5 for pragma ident
|
||||
replace_pragma_ident=0
|
||||
|
||||
@ -93,6 +96,9 @@ depfile_separate=0
|
||||
depfile_output_arg=""
|
||||
depfile_target_arg=""
|
||||
|
||||
# Option to remove duplicate libraries and object files
|
||||
remove_duplicate_link_files=0
|
||||
|
||||
#echo "Arguments: $# $@"
|
||||
|
||||
while [ $# -gt 0 ]
|
||||
@ -106,10 +112,18 @@ do
|
||||
--host-only)
|
||||
host_only=1
|
||||
;;
|
||||
#get the host version only
|
||||
--host-version)
|
||||
get_host_version=1
|
||||
;;
|
||||
#replace '#pragma ident' with '#ident' this is needed to compile OpenMPI due to a configure script bug and a non standardized behaviour of pragma with macros
|
||||
--replace-pragma-ident)
|
||||
replace_pragma_ident=1
|
||||
;;
|
||||
#remove duplicate link files
|
||||
--remove-duplicate-link-files)
|
||||
remove_duplicate_link_files=1
|
||||
;;
|
||||
#handle source files to be compiled as cuda files
|
||||
*.cpp|*.cxx|*.cc|*.C|*.c++|*.cu)
|
||||
cpp_files="$cpp_files $1"
|
||||
@ -124,7 +138,12 @@ do
|
||||
fi
|
||||
;;
|
||||
#Handle shared args (valid for both nvcc and the host compiler)
|
||||
-D*|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
|
||||
-D*)
|
||||
unescape_commas=`echo "$1" | sed -e 's/\\\,/,/g'`
|
||||
arg=`printf "%q" $unescape_commas`
|
||||
shared_args="$shared_args $arg"
|
||||
;;
|
||||
-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared|-w)
|
||||
shared_args="$shared_args $1"
|
||||
;;
|
||||
#Handle compilation argument
|
||||
@ -152,7 +171,7 @@ do
|
||||
shift
|
||||
;;
|
||||
#Handle known nvcc args
|
||||
-gencode*|--dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
|
||||
--dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
|
||||
cuda_args="$cuda_args $1"
|
||||
;;
|
||||
#Handle more known nvcc args
|
||||
@ -164,8 +183,11 @@ do
|
||||
cuda_args="$cuda_args $1 $2"
|
||||
shift
|
||||
;;
|
||||
-rdc=*|-maxrregcount*|--maxrregcount*)
|
||||
cuda_args="$cuda_args $1"
|
||||
;;
|
||||
#Handle c++11
|
||||
--std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1z|-std=c++1z)
|
||||
--std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1y|-std=c++1y|--std=c++17|-std=c++17|--std=c++1z|-std=c++1z)
|
||||
if [ $stdcxx_applied -eq 1 ]; then
|
||||
echo "nvcc_wrapper - *warning* you have set multiple optimization flags (-std=c++1* or --std=c++1*), only the first is used because nvcc can only accept a single std setting"
|
||||
else
|
||||
@ -205,6 +227,15 @@ do
|
||||
fi
|
||||
shift
|
||||
;;
|
||||
#Handle -+ (same as -x c++, specifically used for xl compilers, but mutually exclusive with -x. So replace it with -x c++)
|
||||
-+)
|
||||
if [ $first_xcompiler_arg -eq 1 ]; then
|
||||
xcompiler_args="-x,c++"
|
||||
first_xcompiler_arg=0
|
||||
else
|
||||
xcompiler_args="$xcompiler_args,-x,c++"
|
||||
fi
|
||||
;;
|
||||
#Handle -ccbin (if its not set we can set it to a default value)
|
||||
-ccbin)
|
||||
cuda_args="$cuda_args $1 $2"
|
||||
@ -212,18 +243,39 @@ do
|
||||
host_compiler=$2
|
||||
shift
|
||||
;;
|
||||
#Handle -arch argument (if its not set use a default
|
||||
-arch*)
|
||||
|
||||
#Handle -arch argument (if its not set use a default) this is the version with = sign
|
||||
-arch*|-gencode*)
|
||||
cuda_args="$cuda_args $1"
|
||||
arch_set=1
|
||||
;;
|
||||
#Handle -code argument (if its not set use a default) this is the version with = sign
|
||||
-code*)
|
||||
cuda_args="$cuda_args $1"
|
||||
;;
|
||||
#Handle -arch argument (if its not set use a default) this is the version without = sign
|
||||
-arch|-gencode)
|
||||
cuda_args="$cuda_args $1 $2"
|
||||
arch_set=1
|
||||
shift
|
||||
;;
|
||||
#Handle -code argument (if its not set use a default) this is the version without = sign
|
||||
-code)
|
||||
cuda_args="$cuda_args $1 $2"
|
||||
shift
|
||||
;;
|
||||
#Handle -Xcudafe argument
|
||||
-Xcudafe)
|
||||
cuda_args="$cuda_args -Xcudafe $2"
|
||||
shift
|
||||
;;
|
||||
#Handle -Xlinker argument
|
||||
-Xlinker)
|
||||
xlinker_args="$xlinker_args -Xlinker $2"
|
||||
shift
|
||||
;;
|
||||
#Handle args that should be sent to the linker
|
||||
-Wl*)
|
||||
-Wl,*)
|
||||
xlinker_args="$xlinker_args -Xlinker ${1:4:${#1}}"
|
||||
host_linker_args="$host_linker_args ${1:4:${#1}}"
|
||||
;;
|
||||
@ -256,6 +308,44 @@ do
|
||||
shift
|
||||
done
|
||||
|
||||
# Only print host compiler version
|
||||
if [ $get_host_version -eq 1 ]; then
|
||||
$host_compiler --version
|
||||
exit
|
||||
fi
|
||||
|
||||
#Remove duplicate object files
|
||||
if [ $remove_duplicate_link_files -eq 1 ]; then
|
||||
for obj in $object_files
|
||||
do
|
||||
object_files_reverse="$obj $object_files_reverse"
|
||||
done
|
||||
|
||||
object_files_reverse_clean=""
|
||||
for obj in $object_files_reverse
|
||||
do
|
||||
exists=false
|
||||
for obj2 in $object_files_reverse_clean
|
||||
do
|
||||
if [ "$obj" == "$obj2" ]
|
||||
then
|
||||
exists=true
|
||||
echo "Exists: $obj"
|
||||
fi
|
||||
done
|
||||
if [ "$exists" == "false" ]
|
||||
then
|
||||
object_files_reverse_clean="$object_files_reverse_clean $obj"
|
||||
fi
|
||||
done
|
||||
|
||||
object_files=""
|
||||
for obj in $object_files_reverse_clean
|
||||
do
|
||||
object_files="$obj $object_files"
|
||||
done
|
||||
fi
|
||||
|
||||
#Add default host compiler if necessary
|
||||
if [ $ccbin_set -ne 1 ]; then
|
||||
cuda_args="$cuda_args -ccbin $host_compiler"
|
||||
@ -328,10 +418,19 @@ fi
|
||||
|
||||
#Run compilation command
|
||||
if [ $host_only -eq 1 ]; then
|
||||
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
|
||||
echo "$host_command"
|
||||
fi
|
||||
$host_command
|
||||
elif [ -n "$nvcc_depfile_command" ]; then
|
||||
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
|
||||
echo "$nvcc_command && $nvcc_depfile_command"
|
||||
fi
|
||||
$nvcc_command && $nvcc_depfile_command
|
||||
else
|
||||
if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
|
||||
echo "$nvcc_command"
|
||||
fi
|
||||
$nvcc_command
|
||||
fi
|
||||
error_code=$?
|
||||
|
||||
@ -235,3 +235,7 @@ install(FILES
|
||||
# Install the export set for use with the install-tree
|
||||
INSTALL(EXPORT KokkosTargets DESTINATION
|
||||
"${INSTALL_CMAKE_DIR}")
|
||||
|
||||
# build and install pkgconfig file
|
||||
CONFIGURE_FILE(core/src/kokkos.pc.in kokkos.pc @ONLY)
|
||||
INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)
|
||||
|
||||
@ -47,7 +47,7 @@ function(set_kokkos_cxx_compiler)
|
||||
OUTPUT_VARIABLE INTERNAL_CXX_COMPILER_VERSION
|
||||
OUTPUT_STRIP_TRAILING_WHITESPACE)
|
||||
|
||||
string(REGEX MATCH "[0-9]+\.[0-9]+\.[0-9]+$"
|
||||
string(REGEX MATCH "[0-9]+\\.[0-9]+\\.[0-9]+$"
|
||||
INTERNAL_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION})
|
||||
endif()
|
||||
|
||||
|
||||
@ -41,7 +41,6 @@ list(APPEND KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST
|
||||
foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
|
||||
string(TOUPPER ${opt} OPT )
|
||||
IF(DEFINED Kokkos_ENABLE_${opt})
|
||||
MESSAGE("Kokkos_ENABLE_${opt} is defined!")
|
||||
IF(DEFINED KOKKOS_ENABLE_${OPT})
|
||||
IF(NOT ("${KOKKOS_ENABLE_${OPT}}" STREQUAL "${Kokkos_ENABLE_${opt}}"))
|
||||
IF(DEFINED KOKKOS_ENABLE_${OPT}_INTERNAL)
|
||||
@ -59,7 +58,6 @@ foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
|
||||
ENDIF()
|
||||
ELSE()
|
||||
SET(KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT ${Kokkos_ENABLE_${opt}})
|
||||
MESSAGE("set KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT!")
|
||||
ENDIF()
|
||||
ENDIF()
|
||||
endforeach()
|
||||
@ -81,6 +79,7 @@ list(APPEND KOKKOS_ARCH_LIST
|
||||
ARMv80 # (HOST) ARMv8.0 Compatible CPU
|
||||
ARMv81 # (HOST) ARMv8.1 Compatible CPU
|
||||
ARMv8-ThunderX # (HOST) ARMv8 Cavium ThunderX CPU
|
||||
ARMv8-TX2 # (HOST) ARMv8 Cavium ThunderX2 CPU
|
||||
WSM # (HOST) Intel Westmere CPU
|
||||
SNB # (HOST) Intel Sandy/Ivy Bridge CPUs
|
||||
HSW # (HOST) Intel Haswell CPUs
|
||||
@ -123,11 +122,18 @@ list(APPEND KOKKOS_DEVICES_LIST
|
||||
# List of possible TPLs for Kokkos
|
||||
# From Makefile.kokkos: Options: hwloc,librt,experimental_memkind
|
||||
set(KOKKOS_USE_TPLS_LIST)
|
||||
if(APPLE)
|
||||
list(APPEND KOKKOS_USE_TPLS_LIST
|
||||
HWLOC # hwloc
|
||||
MEMKIND # experimental_memkind
|
||||
)
|
||||
else()
|
||||
list(APPEND KOKKOS_USE_TPLS_LIST
|
||||
HWLOC # hwloc
|
||||
LIBRT # librt
|
||||
MEMKIND # experimental_memkind
|
||||
)
|
||||
endif()
|
||||
# Map of cmake variables to Makefile variables
|
||||
set(KOKKOS_INTERNAL_HWLOC hwloc)
|
||||
set(KOKKOS_INTERNAL_LIBRT librt)
|
||||
@ -172,6 +178,7 @@ set(KOKKOS_INTERNAL_LAMBDA enable_lambda)
|
||||
|
||||
set(tmpr "\n ")
|
||||
string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_ARCH_DOCSTR "${KOKKOS_ARCH_LIST}")
|
||||
set(KOKKOS_INTERNAL_ARCH_DOCSTR "${tmpr}${KOKKOS_INTERNAL_ARCH_DOCSTR}")
|
||||
# This would be useful, but we use Foo_ENABLE mechanisms
|
||||
#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_DEVICES_DOCSTR "${KOKKOS_DEVICES_LIST}")
|
||||
#string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_USE_TPLS_DOCSTR "${KOKKOS_USE_TPLS_LIST}")
|
||||
@ -269,7 +276,7 @@ set(KOKKOS_ENABLE_PROFILING_LOAD_PRINT ${KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_P
|
||||
set_kokkos_default_default(DEPRECATED_CODE ON)
|
||||
set(KOKKOS_ENABLE_DEPRECATED_CODE ${KOKKOS_INTERNAL_ENABLE_DEPRECATED_CODE_DEFAULT} CACHE BOOL "Enable deprecated code.")
|
||||
|
||||
set_kokkos_default_default(EXPLICIT_INSTANTIATION ON)
|
||||
set_kokkos_default_default(EXPLICIT_INSTANTIATION OFF)
|
||||
set(KOKKOS_ENABLE_EXPLICIT_INSTANTIATION ${KOKKOS_INTERNAL_ENABLE_EXPLICIT_INSTANTIATION_DEFAULT} CACHE BOOL "Enable explicit template instantiation.")
|
||||
|
||||
#-------------------------------------------------------------------------------
|
||||
|
||||
@ -15,16 +15,16 @@
|
||||
|
||||
# Ensure that KOKKOS_ARCH is in the ARCH_LIST
|
||||
if (KOKKOS_ARCH MATCHES ",")
|
||||
message("-- Detected a comma in: KOKKOS_ARCH=${KOKKOS_ARCH}")
|
||||
message("-- Detected a comma in: KOKKOS_ARCH=`${KOKKOS_ARCH}`")
|
||||
message("-- Although we prefer KOKKOS_ARCH to be semicolon-delimited, we do allow")
|
||||
message("-- comma-delimited values for compatibility with scripts (see github.com/trilinos/Trilinos/issues/2330)")
|
||||
string(REPLACE "," ";" KOKKOS_ARCH "${KOKKOS_ARCH}")
|
||||
message("-- Commas were changed to semicolons, now KOKKOS_ARCH=${KOKKOS_ARCH}")
|
||||
message("-- Commas were changed to semicolons, now KOKKOS_ARCH=`${KOKKOS_ARCH}`")
|
||||
endif()
|
||||
foreach(arch ${KOKKOS_ARCH})
|
||||
list(FIND KOKKOS_ARCH_LIST ${arch} indx)
|
||||
if (indx EQUAL -1)
|
||||
message(FATAL_ERROR "${arch} is not an accepted value for KOKKOS_ARCH."
|
||||
message(FATAL_ERROR "`${arch}` is not an accepted value in KOKKOS_ARCH=`${KOKKOS_ARCH}`."
|
||||
" Please pick from these choices: ${KOKKOS_INTERNAL_ARCH_DOCSTR}")
|
||||
endif ()
|
||||
endforeach()
|
||||
@ -130,7 +130,8 @@ string(REPLACE ";" ":" KOKKOS_INTERNAL_ADDTOPATH "${addpathl}")
|
||||
# Set the KOKKOS_SETTINGS String -- this is the primary communication with the
|
||||
# makefile configuration. See Makefile.kokkos
|
||||
|
||||
set(KOKKOS_SETTINGS KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
|
||||
set(KOKKOS_SETTINGS KOKKOS_CMAKE=yes)
|
||||
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
|
||||
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_PATH=${KOKKOS_PATH})
|
||||
set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_INSTALL_PATH=${CMAKE_INSTALL_PREFIX})
|
||||
|
||||
|
||||
@ -241,17 +241,16 @@ elif [ "$MACHINE" = "white" ]; then
|
||||
|
||||
BASE_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>"
|
||||
IBM_MODULE_LIST="<COMPILER_NAME>/xl/<COMPILER_VERSION>"
|
||||
CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/5.4.0"
|
||||
CUDA_MODULE_LIST2="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.3.0,ibm/xl/13.1.6"
|
||||
CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.4.0,ibm/xl/16.1.0"
|
||||
|
||||
# Don't do pthread on white.
|
||||
GCC_BUILD_LIST="OpenMP,Serial,OpenMP_Serial"
|
||||
|
||||
# Format: (compiler module-list build-list exe-name warning-flag)
|
||||
COMPILERS=("gcc/5.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
|
||||
"ibm/13.1.6 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
|
||||
"cuda/8.0.44 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
|
||||
"cuda/9.0.103 $CUDA_MODULE_LIST2 $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
|
||||
"gcc/6.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
|
||||
"ibm/16.1.0 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
|
||||
"cuda/9.0.103 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
|
||||
)
|
||||
|
||||
if [ -z "$ARCH_FLAG" ]; then
|
||||
@ -362,7 +361,7 @@ elif [ "$MACHINE" = "apollo" ]; then
|
||||
"gcc/5.3.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
|
||||
"intel/16.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
|
||||
"clang/3.9.0 $BASE_MODULE_LIST "Pthread_Serial" clang++ $CLANG_WARNING_FLAGS"
|
||||
"clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread" clang++ $CUDA_WARNING_FLAGS"
|
||||
"clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread,OpenMP" clang++ $CUDA_WARNING_FLAGS"
|
||||
"cuda/9.1 $CUDA_MODULE_LIST "Cuda_OpenMP" $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
|
||||
)
|
||||
else
|
||||
|
||||
@ -96,6 +96,7 @@ template< class DataType ,
|
||||
class Arg3Type = void>
|
||||
class DualView : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
|
||||
{
|
||||
template< class , class , class , class > friend class DualView ;
|
||||
public:
|
||||
//! \name Typedefs for device types and various Kokkos::View specializations.
|
||||
//@{
|
||||
@ -182,8 +183,20 @@ public:
|
||||
//! \name Counters to keep track of changes ("modified" flags)
|
||||
//@{
|
||||
|
||||
View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_device;
|
||||
View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_host;
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
protected:
|
||||
// modified_flags[0] -> host
|
||||
// modified_flags[1] -> device
|
||||
typedef View<unsigned int[2],LayoutLeft,Kokkos::HostSpace> t_modified_flags;
|
||||
t_modified_flags modified_flags;
|
||||
|
||||
public:
|
||||
#else
|
||||
typedef View<unsigned int[2],LayoutLeft,typename t_host::execution_space> t_modified_flags;
|
||||
typedef View<unsigned int,LayoutLeft,typename t_host::execution_space> t_modified_flag;
|
||||
t_modified_flags modified_flags;
|
||||
t_modified_flag modified_host,modified_device;
|
||||
#endif
|
||||
|
||||
//@}
|
||||
//! \name Constructors
|
||||
@ -194,10 +207,14 @@ public:
|
||||
/// Both device and host View objects are constructed using their
|
||||
/// default constructors. The "modified" flags are both initialized
|
||||
/// to "unmodified."
|
||||
DualView () :
|
||||
modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
|
||||
modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
|
||||
{}
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
DualView () = default;
|
||||
#else
|
||||
DualView ():modified_flags (t_modified_flags("DualView::modified_flags")) {
|
||||
modified_host = t_modified_flag(modified_flags,0);
|
||||
modified_device = t_modified_flag(modified_flags,1);
|
||||
}
|
||||
#endif
|
||||
|
||||
/// \brief Constructor that allocates View objects on both host and device.
|
||||
///
|
||||
@ -219,17 +236,24 @@ public:
|
||||
const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG)
|
||||
: d_view (label, n0, n1, n2, n3, n4, n5, n6, n7)
|
||||
, h_view (create_mirror_view (d_view)) // without UVM, host View mirrors
|
||||
, modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device"))
|
||||
, modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
|
||||
{}
|
||||
, modified_flags (t_modified_flags("DualView::modified_flags"))
|
||||
{
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
modified_host = t_modified_flag(modified_flags,0);
|
||||
modified_device = t_modified_flag(modified_flags,1);
|
||||
#endif
|
||||
}
|
||||
|
||||
//! Copy constructor (shallow copy)
|
||||
template<class SS, class LS, class DS, class MS>
|
||||
DualView (const DualView<SS,LS,DS,MS>& src) :
|
||||
d_view (src.d_view),
|
||||
h_view (src.h_view),
|
||||
modified_device (src.modified_device),
|
||||
modified_host (src.modified_host)
|
||||
modified_flags (src.modified_flags)
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
, modified_host(src.modified_host)
|
||||
, modified_device(src.modified_device)
|
||||
#endif
|
||||
{}
|
||||
|
||||
//! Subview constructor
|
||||
@ -241,8 +265,11 @@ public:
|
||||
)
|
||||
: d_view( Kokkos::subview( src.d_view , arg0 , args ... ) )
|
||||
, h_view( Kokkos::subview( src.h_view , arg0 , args ... ) )
|
||||
, modified_device (src.modified_device)
|
||||
, modified_host (src.modified_host)
|
||||
, modified_flags (src.modified_flags)
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
, modified_host(src.modified_host)
|
||||
, modified_device(src.modified_device)
|
||||
#endif
|
||||
{}
|
||||
|
||||
/// \brief Create DualView from existing device and host View objects.
|
||||
@ -258,8 +285,7 @@ public:
|
||||
DualView (const t_dev& d_view_, const t_host& h_view_) :
|
||||
d_view (d_view_),
|
||||
h_view (h_view_),
|
||||
modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
|
||||
modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
|
||||
modified_flags (t_modified_flags("DualView::modified_flags"))
|
||||
{
|
||||
if ( int(d_view.rank) != int(h_view.rank) ||
|
||||
d_view.extent(0) != h_view.extent(0) ||
|
||||
@ -281,6 +307,10 @@ public:
|
||||
d_view.span() != h_view.span() ) {
|
||||
Kokkos::Impl::throw_runtime_exception("DualView constructed with incompatible views");
|
||||
}
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
modified_host = t_modified_flag(modified_flags,0);
|
||||
modified_device = t_modified_flag(modified_flags,1);
|
||||
#endif
|
||||
}
|
||||
|
||||
//@}
|
||||
@ -316,6 +346,30 @@ public:
|
||||
t_dev,
|
||||
t_host>::type& view () const
|
||||
{
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
constexpr bool device_is_memspace = std::is_same<Device,typename Device::memory_space>::value;
|
||||
constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
|
||||
constexpr bool device_exec_is_t_dev_exec = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
|
||||
constexpr bool device_mem_is_t_dev_mem = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
|
||||
constexpr bool device_exec_is_t_host_exec = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
|
||||
constexpr bool device_mem_is_t_host_mem = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
|
||||
constexpr bool device_is_t_host_device = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
|
||||
constexpr bool device_is_t_dev_device = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
|
||||
|
||||
static_assert(
|
||||
device_is_t_dev_device || device_is_t_host_device ||
|
||||
(device_is_memspace && (device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ) ||
|
||||
(device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
|
||||
(
|
||||
(!device_is_execspace && !device_is_memspace) && (
|
||||
(device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ||
|
||||
(device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
|
||||
)
|
||||
)
|
||||
,
|
||||
"Template parameter to .view() must exactly match one of the DualView's device types or one of the execution or memory spaces");
|
||||
#endif
|
||||
|
||||
return Impl::if_c<
|
||||
std::is_same<
|
||||
typename t_dev::memory_space,
|
||||
@ -324,6 +378,72 @@ public:
|
||||
t_host >::select (d_view , h_view);
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
t_host view_host() const {
|
||||
return h_view;
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
t_dev view_device() const {
|
||||
return d_view;
|
||||
}
|
||||
|
||||
template<class Device>
|
||||
static int get_device_side() {
|
||||
constexpr bool device_is_memspace = std::is_same<Device,typename Device::memory_space>::value;
|
||||
constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
|
||||
constexpr bool device_exec_is_t_dev_exec = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
|
||||
constexpr bool device_mem_is_t_dev_mem = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
|
||||
constexpr bool device_exec_is_t_host_exec = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
|
||||
constexpr bool device_mem_is_t_host_mem = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
|
||||
constexpr bool device_is_t_host_device = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
|
||||
constexpr bool device_is_t_dev_device = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
|
||||
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
static_assert(
|
||||
device_is_t_dev_device || device_is_t_host_device ||
|
||||
(device_is_memspace && (device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ) ||
|
||||
(device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
|
||||
(
|
||||
(!device_is_execspace && !device_is_memspace) && (
|
||||
(device_mem_is_t_dev_mem || device_mem_is_t_host_mem) ||
|
||||
(device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
|
||||
)
|
||||
)
|
||||
,
|
||||
"Template parameter to .sync() must exactly match one of the DualView's device types or one of the execution or memory spaces");
|
||||
#endif
|
||||
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int dev = -1;
|
||||
#else
|
||||
int dev = 0;
|
||||
#endif
|
||||
if(device_is_t_dev_device) dev = 1;
|
||||
else if(device_is_t_host_device) dev = 0;
|
||||
else {
|
||||
if(device_is_memspace) {
|
||||
if(device_mem_is_t_dev_mem) dev = 1;
|
||||
if(device_mem_is_t_host_mem) dev = 0;
|
||||
if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
|
||||
}
|
||||
if(device_is_execspace) {
|
||||
if(device_exec_is_t_dev_exec) dev = 1;
|
||||
if(device_exec_is_t_host_exec) dev = 0;
|
||||
if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
|
||||
}
|
||||
if(!device_is_execspace && !device_is_memspace) {
|
||||
if(device_mem_is_t_dev_mem) dev = 1;
|
||||
if(device_mem_is_t_host_mem) dev = 0;
|
||||
if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
|
||||
if(device_exec_is_t_dev_exec) dev = 1;
|
||||
if(device_exec_is_t_host_exec) dev = 0;
|
||||
if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
|
||||
}
|
||||
}
|
||||
return dev;
|
||||
}
|
||||
|
||||
/// \brief Update data on device or host only if data in the other
|
||||
/// space has been marked as modified.
|
||||
///
|
||||
@ -347,23 +467,20 @@ public:
|
||||
( std::is_same< Device , int>::value)
|
||||
, int >::type& = 0)
|
||||
{
|
||||
const unsigned int dev =
|
||||
Impl::if_c<
|
||||
std::is_same<
|
||||
typename t_dev::memory_space,
|
||||
typename Device::memory_space>::value ,
|
||||
unsigned int,
|
||||
unsigned int>::select (1, 0);
|
||||
if(modified_flags.data()==NULL) return;
|
||||
|
||||
if (dev) { // if Device is the same as DualView's device type
|
||||
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
|
||||
int dev = get_device_side<Device>();
|
||||
|
||||
if (dev == 1) { // if Device is the same as DualView's device type
|
||||
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
|
||||
deep_copy (d_view, h_view);
|
||||
modified_host() = modified_device() = 0;
|
||||
modified_flags(0) = modified_flags(1) = 0;
|
||||
}
|
||||
} else { // hopefully Device is the same as DualView's host type
|
||||
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
|
||||
}
|
||||
if (dev == 0) { // hopefully Device is the same as DualView's host type
|
||||
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
|
||||
deep_copy (h_view, d_view);
|
||||
modified_host() = modified_device() = 0;
|
||||
modified_flags(0) = modified_flags(1) = 0;
|
||||
}
|
||||
}
|
||||
if(std::is_same<typename t_host::memory_space,typename t_dev::memory_space>::value) {
|
||||
@ -378,46 +495,71 @@ public:
|
||||
( std::is_same< Device , int>::value)
|
||||
, int >::type& = 0 )
|
||||
{
|
||||
const unsigned int dev =
|
||||
Impl::if_c<
|
||||
std::is_same<
|
||||
typename t_dev::memory_space,
|
||||
typename Device::memory_space>::value,
|
||||
unsigned int,
|
||||
unsigned int>::select (1, 0);
|
||||
if (dev) { // if Device is the same as DualView's device type
|
||||
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
|
||||
if(modified_flags.data()==NULL) return;
|
||||
|
||||
int dev = get_device_side<Device>();
|
||||
|
||||
if (dev == 1) { // if Device is the same as DualView's device type
|
||||
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
|
||||
Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
|
||||
}
|
||||
} else { // hopefully Device is the same as DualView's host type
|
||||
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
|
||||
}
|
||||
if (dev == 0){ // hopefully Device is the same as DualView's host type
|
||||
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
|
||||
Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void sync_host() {
|
||||
if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
|
||||
Impl::throw_runtime_exception("Calling sync_host on a DualView with a const datatype.");
|
||||
if(modified_flags.data()==NULL) return;
|
||||
if(modified_flags(1) > modified_flags(0)) {
|
||||
deep_copy (h_view, d_view);
|
||||
modified_flags(1) = modified_flags(0) = 0;
|
||||
}
|
||||
}
|
||||
|
||||
void sync_device() {
|
||||
if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
|
||||
Impl::throw_runtime_exception("Calling sync_device on a DualView with a const datatype.");
|
||||
if(modified_flags.data()==NULL) return;
|
||||
if(modified_flags(0) > modified_flags(1)) {
|
||||
deep_copy (d_view, h_view);
|
||||
modified_flags(1) = modified_flags(0) = 0;
|
||||
}
|
||||
}
|
||||
|
||||
template<class Device>
|
||||
bool need_sync() const
|
||||
{
|
||||
const unsigned int dev =
|
||||
Impl::if_c<
|
||||
std::is_same<
|
||||
typename t_dev::memory_space,
|
||||
typename Device::memory_space>::value ,
|
||||
unsigned int,
|
||||
unsigned int>::select (1, 0);
|
||||
if(modified_flags.data()==NULL) return false;
|
||||
int dev = get_device_side<Device>();
|
||||
|
||||
if (dev) { // if Device is the same as DualView's device type
|
||||
if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
|
||||
if (dev == 1) { // if Device is the same as DualView's device type
|
||||
if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
|
||||
return true;
|
||||
}
|
||||
} else { // hopefully Device is the same as DualView's host type
|
||||
if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
|
||||
}
|
||||
if (dev == 0){ // hopefully Device is the same as DualView's host type
|
||||
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
|
||||
return true;
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
inline bool need_sync_host() const {
|
||||
if(modified_flags.data()==NULL) return false;
|
||||
return modified_flags(0)<modified_flags(1);
|
||||
}
|
||||
|
||||
inline bool need_sync_device() const {
|
||||
if(modified_flags.data()==NULL) return false;
|
||||
return modified_flags(1)<modified_flags(0);
|
||||
}
|
||||
|
||||
/// \brief Mark data as modified on the given device \c Device.
|
||||
///
|
||||
/// If \c Device is the same as this DualView's device type, then
|
||||
@ -425,26 +567,22 @@ public:
|
||||
/// data as modified.
|
||||
template<class Device>
|
||||
void modify () {
|
||||
const unsigned int dev =
|
||||
Impl::if_c<
|
||||
std::is_same<
|
||||
typename t_dev::memory_space,
|
||||
typename Device::memory_space>::value,
|
||||
unsigned int,
|
||||
unsigned int>::select (1, 0);
|
||||
if(modified_flags.data()==NULL) return;
|
||||
int dev = get_device_side<Device>();
|
||||
|
||||
if (dev) { // if Device is the same as DualView's device type
|
||||
if (dev == 1) { // if Device is the same as DualView's device type
|
||||
// Increment the device's modified count.
|
||||
modified_device () = (modified_device () > modified_host () ?
|
||||
modified_device () : modified_host ()) + 1;
|
||||
} else { // hopefully Device is the same as DualView's host type
|
||||
modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
|
||||
modified_flags(1) : modified_flags(0)) + 1;
|
||||
}
|
||||
if (dev == 0) { // hopefully Device is the same as DualView's host type
|
||||
// Increment the host's modified count.
|
||||
modified_host () = (modified_device () > modified_host () ?
|
||||
modified_device () : modified_host ()) + 1;
|
||||
modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
|
||||
modified_flags(1) : modified_flags(0)) + 1;
|
||||
}
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
|
||||
if (modified_host() && modified_device()) {
|
||||
if (modified_flags(0) && modified_flags(1)) {
|
||||
std::string msg = "Kokkos::DualView::modify ERROR: ";
|
||||
msg += "Concurrent modification of host and device views ";
|
||||
msg += "in DualView \"";
|
||||
@ -455,6 +593,45 @@ public:
|
||||
#endif
|
||||
}
|
||||
|
||||
inline void modify_host() {
|
||||
if(modified_flags.data()!=NULL) {
|
||||
modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
|
||||
modified_flags(1) : modified_flags(0)) + 1;
|
||||
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
|
||||
if (modified_flags(0) && modified_flags(1)) {
|
||||
std::string msg = "Kokkos::DualView::modify_host ERROR: ";
|
||||
msg += "Concurrent modification of host and device views ";
|
||||
msg += "in DualView \"";
|
||||
msg += d_view.label();
|
||||
msg += "\"\n";
|
||||
Kokkos::abort(msg.c_str());
|
||||
}
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
inline void modify_device() {
|
||||
if(modified_flags.data()!=NULL) {
|
||||
modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
|
||||
modified_flags(1) : modified_flags(0)) + 1;
|
||||
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
|
||||
if (modified_flags(0) && modified_flags(1)) {
|
||||
std::string msg = "Kokkos::DualView::modify_device ERROR: ";
|
||||
msg += "Concurrent modification of host and device views ";
|
||||
msg += "in DualView \"";
|
||||
msg += d_view.label();
|
||||
msg += "\"\n";
|
||||
Kokkos::abort(msg.c_str());
|
||||
}
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
inline void clear_sync_state() {
|
||||
if(modified_flags.data()!=NULL)
|
||||
modified_flags(1) = modified_flags(0) = 0;
|
||||
}
|
||||
|
||||
//@}
|
||||
//! \name Methods for reallocating or resizing the View objects.
|
||||
//@{
|
||||
@ -476,7 +653,10 @@ public:
|
||||
h_view = create_mirror_view( d_view );
|
||||
|
||||
/* Reset dirty flags */
|
||||
modified_device() = modified_host() = 0;
|
||||
if(modified_flags.data()==NULL) {
|
||||
modified_flags = t_modified_flags("DualView::modified_flags");
|
||||
} else
|
||||
modified_flags(1) = modified_flags(0) = 0;
|
||||
}
|
||||
|
||||
/// \brief Resize both views, copying old contents into new if necessary.
|
||||
@ -491,13 +671,16 @@ public:
|
||||
const size_t n5 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
|
||||
const size_t n6 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
|
||||
const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ) {
|
||||
if(modified_device() >= modified_host()) {
|
||||
if(modified_flags.data()==NULL) {
|
||||
modified_flags = t_modified_flags("DualView::modified_flags");
|
||||
}
|
||||
if(modified_flags(1) >= modified_flags(0)) {
|
||||
/* Resize on Device */
|
||||
::Kokkos::resize(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
|
||||
h_view = create_mirror_view( d_view );
|
||||
|
||||
/* Mark Device copy as modified */
|
||||
modified_device() = modified_device()+1;
|
||||
modified_flags(1) = modified_flags(1)+1;
|
||||
|
||||
} else {
|
||||
/* Realloc on Device */
|
||||
@ -525,7 +708,7 @@ public:
|
||||
d_view = create_mirror_view( typename t_dev::execution_space(), h_view );
|
||||
|
||||
/* Mark Host copy as modified */
|
||||
modified_host() = modified_host()+1;
|
||||
modified_flags(0) = modified_flags(0)+1;
|
||||
}
|
||||
}
|
||||
|
||||
@ -649,7 +832,10 @@ void
|
||||
deep_copy (DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
|
||||
const DualView<ST,SL,SD,SM>& src )
|
||||
{
|
||||
if (src.modified_device () >= src.modified_host ()) {
|
||||
if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
|
||||
return deep_copy(dst.d_view, src.d_view);
|
||||
}
|
||||
if (src.modified_flags(1) >= src.modified_flags(0)) {
|
||||
deep_copy (dst.d_view, src.d_view);
|
||||
dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
|
||||
} else {
|
||||
@ -666,7 +852,10 @@ deep_copy (const ExecutionSpace& exec ,
|
||||
DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
|
||||
const DualView<ST,SL,SD,SM>& src )
|
||||
{
|
||||
if (src.modified_device () >= src.modified_host ()) {
|
||||
if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
|
||||
return deep_copy(exec, dst.d_view, src.d_view);
|
||||
}
|
||||
if (src.modified_flags(1) >= src.modified_flags(0)) {
|
||||
deep_copy (exec, dst.d_view, src.d_view);
|
||||
dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
|
||||
} else {
|
||||
|
||||
@ -64,7 +64,7 @@ namespace Impl {
|
||||
template <typename Specialize>
|
||||
struct DynRankDimTraits {
|
||||
|
||||
enum : size_t{unspecified =KOKKOS_INVALID_INDEX};
|
||||
enum : size_t{unspecified = KOKKOS_INVALID_INDEX};
|
||||
|
||||
// Compute the rank of the view from the nonzero dimension arguments.
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
@ -384,8 +384,8 @@ public:
|
||||
// Removed dimension checks...
|
||||
|
||||
typedef typename DstType::offset_type dst_offset_type ;
|
||||
dst.m_map.m_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
|
||||
dst.m_map.m_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_handle , src.m_track );
|
||||
dst.m_map.m_impl_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
|
||||
dst.m_map.m_impl_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_impl_handle , src.m_track );
|
||||
dst.m_track.assign( src.m_track , DstTraits::is_managed );
|
||||
dst.m_rank = src.Rank ;
|
||||
}
|
||||
@ -565,10 +565,14 @@ public:
|
||||
|
||||
//----------------------------------------
|
||||
// Allow specializations to query their specialized map
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
const Kokkos::Impl::ViewMapping< traits , void > &
|
||||
implementation_map() const { return m_map ; }
|
||||
#endif
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
const Kokkos::Impl::ViewMapping< traits , void > &
|
||||
impl_map() const { return m_map ; }
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
@ -624,7 +628,7 @@ public:
|
||||
reference_type operator()() const
|
||||
{
|
||||
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
|
||||
return implementation_map().reference();
|
||||
return impl_map().reference();
|
||||
//return m_map.reference(0,0,0,0,0,0,0);
|
||||
}
|
||||
|
||||
@ -647,7 +651,7 @@ public:
|
||||
typename std::enable_if< !std::is_same<typename drvtraits::value_type, typename drvtraits::scalar_array_type>::value && std::is_integral<iType>::value, reference_type>::type
|
||||
operator[](const iType & i0) const
|
||||
{
|
||||
// auto map = implementation_map();
|
||||
// auto map = impl_map();
|
||||
const size_t dim_scalar = m_map.dimension_scalar();
|
||||
const size_t bytes = this->span() / dim_scalar;
|
||||
|
||||
@ -785,7 +789,7 @@ public:
|
||||
reference_type access() const
|
||||
{
|
||||
KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
|
||||
return implementation_map().reference();
|
||||
return impl_map().reference();
|
||||
//return m_map.reference(0,0,0,0,0,0,0);
|
||||
}
|
||||
|
||||
@ -1004,7 +1008,7 @@ public:
|
||||
|
||||
//----------------------------------------
|
||||
// Allocation according to allocation properties and array layout
|
||||
// unused arg_layout dimensions must be set toKOKKOS_INVALID_INDEX so that rank deduction can properly take place
|
||||
// unused arg_layout dimensions must be set to KOKKOS_INVALID_INDEX so that rank deduction can properly take place
|
||||
template< class ... P >
|
||||
explicit inline
|
||||
DynRankView( const Kokkos::Impl::ViewCtorProp< P ... > & arg_prop
|
||||
@ -1179,7 +1183,7 @@ public:
|
||||
: DynRankView( Kokkos::Impl::ViewCtorProp< std::string >( arg_label )
|
||||
, typename traits::array_layout
|
||||
( arg_N0 , arg_N1 , arg_N2 , arg_N3 , arg_N4 , arg_N5 , arg_N6 , arg_N7 )
|
||||
)
|
||||
)
|
||||
{}
|
||||
|
||||
// For backward compatibility
|
||||
@ -1189,8 +1193,7 @@ public:
|
||||
, const typename traits::array_layout & arg_layout
|
||||
)
|
||||
: DynRankView( Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
|
||||
|
||||
, Impl::DynRankDimTraits<typename traits::specialize>::createLayout(arg_layout)
|
||||
, arg_layout
|
||||
)
|
||||
{}
|
||||
|
||||
@ -1205,7 +1208,9 @@ public:
|
||||
, const size_t arg_N6 =KOKKOS_INVALID_INDEX
|
||||
, const size_t arg_N7 =KOKKOS_INVALID_INDEX
|
||||
)
|
||||
: DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing ), arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7 )
|
||||
: DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
|
||||
, typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7)
|
||||
)
|
||||
{}
|
||||
|
||||
//----------------------------------------
|
||||
@ -1445,30 +1450,30 @@ public:
|
||||
ret_type dst ;
|
||||
|
||||
const SubviewExtents< 7 , rank > extents =
|
||||
ExtentGenerator< Args ... >::generator( src.m_map.m_offset.m_dim , args... ) ;
|
||||
ExtentGenerator< Args ... >::generator( src.m_map.m_impl_offset.m_dim , args... ) ;
|
||||
|
||||
dst_offset_type tempdst( src.m_map.m_offset , extents ) ;
|
||||
dst_offset_type tempdst( src.m_map.m_impl_offset , extents ) ;
|
||||
|
||||
dst.m_track = src.m_track ;
|
||||
|
||||
dst.m_map.m_offset.m_dim.N0 = tempdst.m_dim.N0 ;
|
||||
dst.m_map.m_offset.m_dim.N1 = tempdst.m_dim.N1 ;
|
||||
dst.m_map.m_offset.m_dim.N2 = tempdst.m_dim.N2 ;
|
||||
dst.m_map.m_offset.m_dim.N3 = tempdst.m_dim.N3 ;
|
||||
dst.m_map.m_offset.m_dim.N4 = tempdst.m_dim.N4 ;
|
||||
dst.m_map.m_offset.m_dim.N5 = tempdst.m_dim.N5 ;
|
||||
dst.m_map.m_offset.m_dim.N6 = tempdst.m_dim.N6 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N0 = tempdst.m_dim.N0 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N1 = tempdst.m_dim.N1 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N2 = tempdst.m_dim.N2 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N3 = tempdst.m_dim.N3 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N4 = tempdst.m_dim.N4 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N5 = tempdst.m_dim.N5 ;
|
||||
dst.m_map.m_impl_offset.m_dim.N6 = tempdst.m_dim.N6 ;
|
||||
|
||||
dst.m_map.m_offset.m_stride.S0 = tempdst.m_stride.S0 ;
|
||||
dst.m_map.m_offset.m_stride.S1 = tempdst.m_stride.S1 ;
|
||||
dst.m_map.m_offset.m_stride.S2 = tempdst.m_stride.S2 ;
|
||||
dst.m_map.m_offset.m_stride.S3 = tempdst.m_stride.S3 ;
|
||||
dst.m_map.m_offset.m_stride.S4 = tempdst.m_stride.S4 ;
|
||||
dst.m_map.m_offset.m_stride.S5 = tempdst.m_stride.S5 ;
|
||||
dst.m_map.m_offset.m_stride.S6 = tempdst.m_stride.S6 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S0 = tempdst.m_stride.S0 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S1 = tempdst.m_stride.S1 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S2 = tempdst.m_stride.S2 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S3 = tempdst.m_stride.S3 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S4 = tempdst.m_stride.S4 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S5 = tempdst.m_stride.S5 ;
|
||||
dst.m_map.m_impl_offset.m_stride.S6 = tempdst.m_stride.S6 ;
|
||||
|
||||
dst.m_map.m_handle = dst_handle_type( src.m_map.m_handle +
|
||||
src.m_map.m_offset( extents.domain_offset(0)
|
||||
dst.m_map.m_impl_handle = dst_handle_type( src.m_map.m_impl_handle +
|
||||
src.m_map.m_impl_offset( extents.domain_offset(0)
|
||||
, extents.domain_offset(1)
|
||||
, extents.domain_offset(2)
|
||||
, extents.domain_offset(3)
|
||||
@ -1896,6 +1901,7 @@ inline
|
||||
typename DynRankView<T,P...>::HostMirror
|
||||
create_mirror( const DynRankView<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
|
||||
! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
|
||||
, Kokkos::LayoutStride >::value
|
||||
>::type * = 0
|
||||
@ -1914,6 +1920,7 @@ inline
|
||||
typename DynRankView<T,P...>::HostMirror
|
||||
create_mirror( const DynRankView<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
|
||||
std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
|
||||
, Kokkos::LayoutStride >::value
|
||||
>::type * = 0
|
||||
@ -1929,7 +1936,11 @@ create_mirror( const DynRankView<T,P...> & src
|
||||
|
||||
// Create a mirror in a new space (specialization for different space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorDRVType<Space,T,P ...>::view_type create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src) {
|
||||
typename Impl::MirrorDRVType<Space,T,P ...>::view_type
|
||||
create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
|
||||
>::type * = 0) {
|
||||
return typename Impl::MirrorDRVType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
|
||||
}
|
||||
|
||||
@ -1985,6 +1996,29 @@ create_mirror_view(const Space& , const Kokkos::DynRankView<T,P...> & src
|
||||
return typename Impl::MirrorDRViewType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
|
||||
}
|
||||
|
||||
// Create a mirror view and deep_copy in a new space (specialization for same space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
|
||||
, std::string const& name = ""
|
||||
, typename std::enable_if<Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
(void)name;
|
||||
return src;
|
||||
}
|
||||
|
||||
// Create a mirror view and deep_copy in a new space (specialization for different space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
|
||||
, std::string const& name = ""
|
||||
, typename std::enable_if<!Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
using Mirror = typename Impl::MirrorDRViewType<Space,T,P ...>::view_type;
|
||||
std::string label = name.empty() ? src.label() : name;
|
||||
auto mirror = Mirror( Kokkos::ViewAllocateWithoutInitializing(label), Impl::reconstructLayout(src.layout(), src.rank()) );
|
||||
deep_copy(mirror, src);
|
||||
return mirror;
|
||||
}
|
||||
|
||||
} //end Kokkos
|
||||
|
||||
|
||||
|
||||
1895
lib/kokkos/containers/src/Kokkos_OffsetView.hpp
Normal file
1895
lib/kokkos/containers/src/Kokkos_OffsetView.hpp
Normal file
File diff suppressed because it is too large
Load Diff
@ -47,7 +47,9 @@
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
#include <Kokkos_Core.hpp>
|
||||
#include <Kokkos_View.hpp>
|
||||
#include <Kokkos_Parallel.hpp>
|
||||
#include <Kokkos_Parallel_Reduce.hpp>
|
||||
|
||||
namespace Kokkos {
|
||||
|
||||
|
||||
@ -86,14 +86,13 @@ public:
|
||||
vector():DV() {
|
||||
_size = 0;
|
||||
_extra_storage = 1.1;
|
||||
DV::modified_host() = 1;
|
||||
}
|
||||
|
||||
|
||||
vector(int n, Scalar val=Scalar()):DualView<Scalar*,LayoutLeft,Arg1Type>("Vector",size_t(n*(1.1))) {
|
||||
_size = n;
|
||||
_extra_storage = 1.1;
|
||||
DV::modified_host() = 1;
|
||||
DV::modified_flags(0) = 1;
|
||||
|
||||
assign(n,val);
|
||||
}
|
||||
@ -119,16 +118,16 @@ public:
|
||||
|
||||
/* Assign value either on host or on device */
|
||||
|
||||
if( DV::modified_host() >= DV::modified_device() ) {
|
||||
if( DV::template need_sync<typename DV::t_dev::device_type>() ) {
|
||||
set_functor_host f(DV::h_view,val);
|
||||
parallel_for(n,f);
|
||||
DV::t_host::execution_space::fence();
|
||||
DV::modified_host()++;
|
||||
DV::template modify<typename DV::t_host::device_type>();
|
||||
} else {
|
||||
set_functor f(DV::d_view,val);
|
||||
parallel_for(n,f);
|
||||
DV::t_dev::execution_space::fence();
|
||||
DV::modified_device()++;
|
||||
DV::template modify<typename DV::t_dev::device_type>();
|
||||
}
|
||||
}
|
||||
|
||||
@ -137,7 +136,8 @@ public:
|
||||
}
|
||||
|
||||
void push_back(Scalar val) {
|
||||
DV::modified_host()++;
|
||||
DV::template sync<typename DV::t_host::device_type>();
|
||||
DV::template modify<typename DV::t_host::device_type>();
|
||||
if(_size == span()) {
|
||||
size_t new_size = _size*_extra_storage;
|
||||
if(new_size == _size) new_size++;
|
||||
@ -247,10 +247,10 @@ public:
|
||||
}
|
||||
|
||||
void on_host() {
|
||||
DV::modified_host() = DV::modified_device() + 1;
|
||||
DV::template modify<typename DV::t_host::device_type>();
|
||||
}
|
||||
void on_device() {
|
||||
DV::modified_device() = DV::modified_host() + 1;
|
||||
DV::template modify<typename DV::t_dev::device_type>();
|
||||
}
|
||||
|
||||
void set_overallocation(float extra) {
|
||||
|
||||
@ -23,6 +23,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
|
||||
threads/TestThreads_DynRankViewAPI_rank12345.cpp
|
||||
threads/TestThreads_DynRankViewAPI_rank67.cpp
|
||||
threads/TestThreads_ErrorReporter.cpp
|
||||
threads/TestThreads_OffsetView.cpp
|
||||
threads/TestThreads_ScatterView.cpp
|
||||
threads/TestThreads_StaticCrsGraph.cpp
|
||||
threads/TestThreads_UnorderedMap.cpp
|
||||
@ -47,6 +48,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
|
||||
serial/TestSerial_DynRankViewAPI_rank12345.cpp
|
||||
serial/TestSerial_DynRankViewAPI_rank67.cpp
|
||||
serial/TestSerial_ErrorReporter.cpp
|
||||
serial/TestSerial_OffsetView.cpp
|
||||
serial/TestSerial_ScatterView.cpp
|
||||
serial/TestSerial_StaticCrsGraph.cpp
|
||||
serial/TestSerial_UnorderedMap.cpp
|
||||
@ -71,6 +73,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
|
||||
openmp/TestOpenMP_DynRankViewAPI_rank12345.cpp
|
||||
openmp/TestOpenMP_DynRankViewAPI_rank67.cpp
|
||||
openmp/TestOpenMP_ErrorReporter.cpp
|
||||
openmp/TestOpenMP_OffsetView.cpp
|
||||
openmp/TestOpenMP_ScatterView.cpp
|
||||
openmp/TestOpenMP_StaticCrsGraph.cpp
|
||||
openmp/TestOpenMP_UnorderedMap.cpp
|
||||
@ -95,6 +98,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
|
||||
cuda/TestCuda_DynRankViewAPI_rank12345.cpp
|
||||
cuda/TestCuda_DynRankViewAPI_rank67.cpp
|
||||
cuda/TestCuda_ErrorReporter.cpp
|
||||
cuda/TestCuda_OffsetView.cpp
|
||||
cuda/TestCuda_ScatterView.cpp
|
||||
cuda/TestCuda_StaticCrsGraph.cpp
|
||||
cuda/TestCuda_UnorderedMap.cpp
|
||||
|
||||
@ -39,6 +39,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
|
||||
OBJ_CUDA += TestCuda_DynRankViewAPI_rank12345.o
|
||||
OBJ_CUDA += TestCuda_DynRankViewAPI_rank67.o
|
||||
OBJ_CUDA += TestCuda_ErrorReporter.o
|
||||
OBJ_CUDA += TestCuda_OffsetView.o
|
||||
OBJ_CUDA += TestCuda_ScatterView.o
|
||||
OBJ_CUDA += TestCuda_StaticCrsGraph.o
|
||||
OBJ_CUDA += TestCuda_UnorderedMap.o
|
||||
@ -57,6 +58,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
|
||||
OBJ_ROCM += TestROCm_DynRankViewAPI_rank12345.o
|
||||
OBJ_ROCM += TestROCm_DynRankViewAPI_rank67.o
|
||||
OBJ_ROCM += TestROCm_ErrorReporter.o
|
||||
OBJ_ROCM += TestROCm_OffsetView.o
|
||||
OBJ_ROCM += TestROCm_ScatterView.o
|
||||
OBJ_ROCM += TestROCm_StaticCrsGraph.o
|
||||
OBJ_ROCM += TestROCm_UnorderedMap.o
|
||||
@ -75,6 +77,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
|
||||
OBJ_THREADS += TestThreads_DynRankViewAPI_rank12345.o
|
||||
OBJ_THREADS += TestThreads_DynRankViewAPI_rank67.o
|
||||
OBJ_THREADS += TestThreads_ErrorReporter.o
|
||||
OBJ_THREADS += TestThreads_OffsetView.o
|
||||
OBJ_THREADS += TestThreads_ScatterView.o
|
||||
OBJ_THREADS += TestThreads_StaticCrsGraph.o
|
||||
OBJ_THREADS += TestThreads_UnorderedMap.o
|
||||
@ -93,6 +96,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
|
||||
OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank12345.o
|
||||
OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank67.o
|
||||
OBJ_OPENMP += TestOpenMP_ErrorReporter.o
|
||||
OBJ_OPENMP += TestOpenMP_OffsetView.o
|
||||
OBJ_OPENMP += TestOpenMP_ScatterView.o
|
||||
OBJ_OPENMP += TestOpenMP_StaticCrsGraph.o
|
||||
OBJ_OPENMP += TestOpenMP_UnorderedMap.o
|
||||
@ -111,6 +115,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
|
||||
OBJ_SERIAL += TestSerial_DynRankViewAPI_rank12345.o
|
||||
OBJ_SERIAL += TestSerial_DynRankViewAPI_rank67.o
|
||||
OBJ_SERIAL += TestSerial_ErrorReporter.o
|
||||
OBJ_SERIAL += TestSerial_OffsetView.o
|
||||
OBJ_SERIAL += TestSerial_ScatterView.o
|
||||
OBJ_SERIAL += TestSerial_StaticCrsGraph.o
|
||||
OBJ_SERIAL += TestSerial_UnorderedMap.o
|
||||
|
||||
@ -729,6 +729,7 @@ public:
|
||||
static void run_tests() {
|
||||
run_test_resize_realloc();
|
||||
run_test_mirror();
|
||||
run_test_mirror_and_copy();
|
||||
run_test_scalar();
|
||||
run_test();
|
||||
run_test_const();
|
||||
@ -885,6 +886,69 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
static void run_test_mirror_and_copy()
|
||||
{
|
||||
// LayoutLeft
|
||||
{
|
||||
Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_org( "A", 10 );
|
||||
a_org(5) = 42.0;
|
||||
Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_h = a_org;
|
||||
auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
|
||||
auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
|
||||
auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
|
||||
|
||||
int equal_ptr_h_h2 = a_h.data() == a_h2.data() ? 1 : 0;
|
||||
int equal_ptr_h_d = a_h.data() == a_d.data() ? 1 : 0;
|
||||
int equal_ptr_h2_d = a_h2.data() == a_d.data() ? 1 : 0;
|
||||
int equal_ptr_h3_d = a_h3.data() == a_d.data() ? 1 : 0;
|
||||
|
||||
int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
|
||||
ASSERT_EQ( equal_ptr_h_h2, 1 );
|
||||
ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
|
||||
ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
|
||||
ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
|
||||
|
||||
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
|
||||
ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
|
||||
ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
|
||||
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
|
||||
ASSERT_EQ( a_h.rank(), a_org.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_h2.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_h3.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_d.rank() );
|
||||
ASSERT_EQ( a_org(5), a_h3(5) );
|
||||
}
|
||||
// LayoutRight
|
||||
{
|
||||
Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_org( "A", 10 );
|
||||
a_org(5) = 42.0;
|
||||
Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_h = a_org;
|
||||
auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
|
||||
auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
|
||||
auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
|
||||
|
||||
int equal_ptr_h_h2 = a_h.data() == a_h2.data() ? 1 : 0;
|
||||
int equal_ptr_h_d = a_h.data() == a_d.data() ? 1 : 0;
|
||||
int equal_ptr_h2_d = a_h2.data() == a_d.data() ? 1 : 0;
|
||||
int equal_ptr_h3_d = a_h3.data() == a_d.data() ? 1 : 0;
|
||||
|
||||
int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
|
||||
ASSERT_EQ( equal_ptr_h_h2, 1 );
|
||||
ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
|
||||
ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
|
||||
ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
|
||||
|
||||
ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
|
||||
ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
|
||||
ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
|
||||
ASSERT_EQ( a_h.rank(), a_org.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_h2.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_h3.rank() );
|
||||
ASSERT_EQ( a_h.rank(), a_d.rank() );
|
||||
ASSERT_EQ( a_org(5), a_h3(5) );
|
||||
}
|
||||
}
|
||||
|
||||
static void run_test_scalar()
|
||||
{
|
||||
typedef typename dView0::HostMirror hView0 ; //HostMirror of DynRankView is a DynRankView
|
||||
|
||||
426
lib/kokkos/containers/unit_tests/TestOffsetView.hpp
Normal file
426
lib/kokkos/containers/unit_tests/TestOffsetView.hpp
Normal file
@ -0,0 +1,426 @@
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
|
||||
/*
|
||||
* FIXME the OffsetView class is really not very well tested.
|
||||
*/
|
||||
#ifndef CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
|
||||
#define CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
|
||||
|
||||
|
||||
|
||||
#include <gtest/gtest.h>
|
||||
#include <iostream>
|
||||
#include <cstdlib>
|
||||
#include <cstdio>
|
||||
#include <impl/Kokkos_Timer.hpp>
|
||||
#include <Kokkos_OffsetView.hpp>
|
||||
#include <KokkosExp_MDRangePolicy.hpp>
|
||||
|
||||
using std::endl;
|
||||
using std::cout;
|
||||
|
||||
namespace Test{
|
||||
|
||||
template <typename Scalar, typename Device>
|
||||
void test_offsetview_construction(unsigned int size)
|
||||
{
|
||||
|
||||
typedef Kokkos::Experimental::OffsetView<Scalar**, Device> offset_view_type;
|
||||
typedef Kokkos::View<Scalar**, Device> view_type;
|
||||
|
||||
Kokkos::Experimental::index_list_type range0 = {-1, 3};
|
||||
Kokkos::Experimental::index_list_type range1 = {-2, 2};
|
||||
|
||||
offset_view_type ov("firstOV", range0, range1);
|
||||
|
||||
ASSERT_EQ("firstOV", ov.label());
|
||||
ASSERT_EQ(2, ov.Rank);
|
||||
|
||||
ASSERT_EQ(ov.begin(0), -1);
|
||||
ASSERT_EQ(ov.end(0), 4);
|
||||
|
||||
ASSERT_EQ(ov.begin(1), -2);
|
||||
ASSERT_EQ(ov.end(1), 3);
|
||||
|
||||
ASSERT_EQ(ov.extent(0), 5);
|
||||
ASSERT_EQ(ov.extent(1), 5);
|
||||
|
||||
const int ovmin0 = ov.begin(0);
|
||||
const int ovend0 = ov.end(0);
|
||||
const int ovmin1 = ov.begin(1);
|
||||
const int ovend1 = ov.end(1);
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
|
||||
{
|
||||
Kokkos::Experimental::OffsetView<Scalar*, Device> offsetV1("OneDOffsetView", range0);
|
||||
|
||||
Kokkos::RangePolicy<Device, int> rangePolicy1(offsetV1.begin(0), offsetV1.end(0));
|
||||
Kokkos::parallel_for(rangePolicy1, KOKKOS_LAMBDA (const int i){
|
||||
offsetV1(i) = 1;
|
||||
}
|
||||
);
|
||||
Kokkos::fence();
|
||||
|
||||
int OVResult = 0;
|
||||
Kokkos::parallel_reduce(rangePolicy1, KOKKOS_LAMBDA(const int i, int & updateMe){
|
||||
updateMe += offsetV1(i);
|
||||
}, OVResult);
|
||||
|
||||
Kokkos::fence();
|
||||
ASSERT_EQ(OVResult, offsetV1.end(0) - offsetV1.begin(0)) << "found wrong number of elements in OffsetView that was summed.";
|
||||
|
||||
}
|
||||
{ //test deep copy of scalar const value into mirro
|
||||
const int constVal = 6;
|
||||
typename offset_view_type::HostMirror hostOffsetView =
|
||||
Kokkos::Experimental::create_mirror_view(ov);
|
||||
|
||||
Kokkos::Experimental::deep_copy(hostOffsetView, constVal);
|
||||
|
||||
for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
|
||||
for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
|
||||
ASSERT_EQ(hostOffsetView(i,j), constVal) << "Bad data found in OffsetView";
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
|
||||
typedef typename range_type::point_type point_type;
|
||||
|
||||
range_type rangePolicy2D(point_type{ {ovmin0, ovmin1 } },
|
||||
point_type{ { ovend0, ovend1 } });
|
||||
|
||||
const int constValue = 9;
|
||||
Kokkos::parallel_for(rangePolicy2D, KOKKOS_LAMBDA (const int i, const int j) {
|
||||
ov(i,j) = constValue;
|
||||
}
|
||||
);
|
||||
|
||||
//test offsetview to offsetviewmirror deep copy
|
||||
typename offset_view_type::HostMirror hostOffsetView =
|
||||
Kokkos::Experimental::create_mirror_view(ov);
|
||||
|
||||
Kokkos::Experimental::deep_copy(hostOffsetView, ov);
|
||||
|
||||
for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
|
||||
for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
|
||||
ASSERT_EQ(hostOffsetView(i,j), constValue) << "Bad data found in OffsetView";
|
||||
}
|
||||
}
|
||||
|
||||
int OVResult = 0;
|
||||
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
|
||||
updateMe += ov(i, j);
|
||||
}, OVResult);
|
||||
|
||||
int answer = 0;
|
||||
for(int i = ov.begin(0); i < ov.end(0); ++i) {
|
||||
for(int j = ov.begin(1); j < ov.end(1); ++j) {
|
||||
answer += constValue;
|
||||
}
|
||||
}
|
||||
|
||||
ASSERT_EQ(OVResult, answer) << "Bad data found in OffsetView";
|
||||
#endif
|
||||
|
||||
{
|
||||
offset_view_type ovCopy(ov);
|
||||
ASSERT_EQ(ovCopy==ov, true) <<
|
||||
"Copy constructor or equivalence operator broken";
|
||||
}
|
||||
|
||||
{
|
||||
offset_view_type ovAssigned = ov;
|
||||
ASSERT_EQ(ovAssigned==ov, true) <<
|
||||
"Assignment operator or equivalence operator broken";
|
||||
}
|
||||
|
||||
{ //construct OffsetView from a View plus begins array
|
||||
const int extent0 = 100;
|
||||
const int extent1 = 200;
|
||||
const int extent2 = 300;
|
||||
Kokkos::View<Scalar***, Device> view3D("view3D", extent0, extent1, extent2);
|
||||
|
||||
Kokkos::deep_copy(view3D, 1);
|
||||
|
||||
Kokkos::Array<int64_t,3> begins = {{-10, -20, -30}};
|
||||
Kokkos::Experimental::OffsetView<Scalar***, Device> offsetView3D(view3D, begins);
|
||||
|
||||
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<3>, Kokkos::IndexType<int64_t> > range3_type;
|
||||
typedef typename range3_type::point_type point3_type;
|
||||
|
||||
range3_type rangePolicy3DZero(point3_type{ {0, 0, 0 } },
|
||||
point3_type{ { extent0, extent1, extent2 } });
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
|
||||
int view3DSum = 0;
|
||||
Kokkos::parallel_reduce(rangePolicy3DZero, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
|
||||
updateMe += view3D(i, j, k);
|
||||
}, view3DSum);
|
||||
|
||||
range3_type rangePolicy3D(point3_type{ {begins[0], begins[1], begins[2] } },
|
||||
point3_type{ { begins[0] + extent0, begins[1] + extent1, begins[2] + extent2 } });
|
||||
int offsetView3DSum = 0;
|
||||
|
||||
Kokkos::parallel_reduce(rangePolicy3D, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
|
||||
updateMe += offsetView3D(i, j, k);
|
||||
}, offsetView3DSum);
|
||||
|
||||
ASSERT_EQ(view3DSum, offsetView3DSum) << "construction of OffsetView from View and begins array broken.";
|
||||
#endif
|
||||
}
|
||||
view_type viewFromOV = ov.view();
|
||||
|
||||
ASSERT_EQ(viewFromOV == ov, true) <<
|
||||
"OffsetView::view() or equivalence operator View == OffsetView broken";
|
||||
|
||||
{
|
||||
offset_view_type ovFromV(viewFromOV, {-1, -2});
|
||||
|
||||
ASSERT_EQ(ovFromV == viewFromOV , true) <<
|
||||
"Construction of OffsetView from View or equivalence operator OffsetView == View broken";
|
||||
}
|
||||
{
|
||||
offset_view_type ovFromV = viewFromOV;
|
||||
ASSERT_EQ(ovFromV == viewFromOV , true) <<
|
||||
"Construction of OffsetView from View by assignment (implicit conversion) or equivalence operator OffsetView == View broken";
|
||||
}
|
||||
|
||||
{// test offsetview to view deep copy
|
||||
view_type aView("aView", ov.extent(0), ov.extent(1));
|
||||
Kokkos::Experimental::deep_copy(aView, ov);
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
|
||||
int sum = 0;
|
||||
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
|
||||
updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
|
||||
}, sum);
|
||||
|
||||
ASSERT_EQ(sum, 0) << "deep_copy(view, offsetView) broken.";
|
||||
#endif
|
||||
}
|
||||
|
||||
{// test view to offsetview deep copy
|
||||
view_type aView("aView", ov.extent(0), ov.extent(1));
|
||||
|
||||
Kokkos::deep_copy(aView, 99);
|
||||
Kokkos::Experimental::deep_copy(ov, aView);
|
||||
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
|
||||
int sum = 0;
|
||||
Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
|
||||
updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
|
||||
}, sum);
|
||||
|
||||
ASSERT_EQ(sum, 0) << "deep_copy(offsetView, view) broken.";
|
||||
#endif
|
||||
}
|
||||
}
|
||||
template <typename Scalar, typename Device>
|
||||
void test_offsetview_subview(unsigned int size)
|
||||
{
|
||||
{//test subview 1
|
||||
Kokkos::Experimental::OffsetView<Scalar*, Device> sliceMe("offsetToSlice", {-10, 20});
|
||||
{
|
||||
auto offsetSubviewa = Kokkos::Experimental::subview(sliceMe, 0);
|
||||
ASSERT_EQ(offsetSubviewa.Rank, 0) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
}
|
||||
{//test subview 2
|
||||
Kokkos::Experimental::OffsetView<Scalar**, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30});
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),-2);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
{//test subview rank 3
|
||||
|
||||
Kokkos::Experimental::OffsetView<Scalar***, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40});
|
||||
|
||||
//slice 1
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(),Kokkos::ALL(), 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(), 0,Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(),Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
|
||||
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(), Kokkos::make_pair(-30, -21));
|
||||
ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
|
||||
|
||||
ASSERT_EQ(offsetSubview.begin(0) , -20);
|
||||
ASSERT_EQ(offsetSubview.end(0) , 31);
|
||||
ASSERT_EQ(offsetSubview.begin(1) , 0);
|
||||
ASSERT_EQ(offsetSubview.end(1) , 9);
|
||||
|
||||
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
|
||||
typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
|
||||
typedef typename range_type::point_type point_type;
|
||||
|
||||
const int b0 = offsetSubview.begin(0);
|
||||
const int b1 = offsetSubview.begin(1);
|
||||
|
||||
const int e0 = offsetSubview.end(0);
|
||||
const int e1 = offsetSubview.end(1);
|
||||
|
||||
range_type rangeP2D(point_type{ {b0, b1 } }, point_type{ { e0, e1} });
|
||||
|
||||
Kokkos::parallel_for(rangeP2D, KOKKOS_LAMBDA(const int i, const int j) {
|
||||
offsetSubview(i,j) = 6;
|
||||
}
|
||||
);
|
||||
|
||||
int sum = 0;
|
||||
Kokkos::parallel_reduce(rangeP2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
|
||||
updateMe += offsetSubview(i, j);
|
||||
}, sum);
|
||||
|
||||
ASSERT_EQ(sum, 6*(e0-b0)*(e1-b1));
|
||||
#endif
|
||||
}
|
||||
|
||||
// slice 2
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
}
|
||||
|
||||
{//test subview rank 4
|
||||
|
||||
Kokkos::Experimental::OffsetView<Scalar****, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40}, {-40, 50});
|
||||
|
||||
//slice 1
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),Kokkos::ALL(), Kokkos::ALL(), 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe ,Kokkos::ALL(), 0, Kokkos::ALL(),Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe , 0, Kokkos::ALL(), Kokkos::ALL(), Kokkos::ALL() );
|
||||
ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
// slice 2
|
||||
auto offsetSubview2a = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, 0);
|
||||
ASSERT_EQ(offsetSubview2a.Rank, 2) << "subview of offset is broken.";
|
||||
{
|
||||
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, Kokkos::ALL(), 0);
|
||||
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL(), Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
|
||||
}
|
||||
// slice 3
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0, 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL(), 0);
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
{
|
||||
auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, 0, Kokkos::ALL());
|
||||
ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
TEST_F( TEST_CATEGORY, offsetview_construction) {
|
||||
test_offsetview_construction<int,TEST_EXECSPACE>(10);
|
||||
}
|
||||
TEST_F( TEST_CATEGORY, offsetview_subview) {
|
||||
test_offsetview_subview<int,TEST_EXECSPACE>(10);
|
||||
}
|
||||
|
||||
} // namespace Test
|
||||
|
||||
#endif /* CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_ */
|
||||
@ -80,7 +80,9 @@ void test_scatter_view_config(int n)
|
||||
Kokkos::Experimental::contribute(original_view, scatter_view);
|
||||
}
|
||||
#if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
|
||||
Kokkos::fence();
|
||||
auto host_view = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), original_view);
|
||||
Kokkos::fence();
|
||||
for (typename decltype(host_view)::size_type i = 0; i < host_view.extent(0); ++i) {
|
||||
auto val0 = host_view(i, 0);
|
||||
auto val1 = host_view(i, 1);
|
||||
@ -111,9 +113,6 @@ struct TestDuplicatedScatterView {
|
||||
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
|
||||
Kokkos::Experimental::ScatterDuplicated,
|
||||
Kokkos::Experimental::ScatterNonAtomic>(n);
|
||||
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
|
||||
Kokkos::Experimental::ScatterDuplicated,
|
||||
Kokkos::Experimental::ScatterAtomic>(n);
|
||||
}
|
||||
};
|
||||
|
||||
@ -127,6 +126,16 @@ struct TestDuplicatedScatterView<Kokkos::Cuda> {
|
||||
};
|
||||
#endif
|
||||
|
||||
#ifdef KOKKOS_ENABLE_ROCM
|
||||
// disable duplicated instantiation with ROCm until
|
||||
// UniqueToken can support it
|
||||
template <>
|
||||
struct TestDuplicatedScatterView<Kokkos::Experimental::ROCm> {
|
||||
TestDuplicatedScatterView(int) {
|
||||
}
|
||||
};
|
||||
#endif
|
||||
|
||||
template <typename ExecSpace>
|
||||
void test_scatter_view(int n)
|
||||
{
|
||||
@ -142,16 +151,28 @@ void test_scatter_view(int n)
|
||||
Kokkos::Experimental::ScatterNonDuplicated,
|
||||
Kokkos::Experimental::ScatterNonAtomic>(n);
|
||||
}
|
||||
#ifdef KOKKOS_ENABLE_SERIAL
|
||||
if (!std::is_same<ExecSpace, Kokkos::Serial>::value) {
|
||||
#endif
|
||||
test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
|
||||
Kokkos::Experimental::ScatterNonDuplicated,
|
||||
Kokkos::Experimental::ScatterAtomic>(n);
|
||||
#ifdef KOKKOS_ENABLE_SERIAL
|
||||
}
|
||||
#endif
|
||||
|
||||
TestDuplicatedScatterView<ExecSpace> duptest(n);
|
||||
}
|
||||
|
||||
TEST_F( TEST_CATEGORY, scatterview) {
|
||||
#ifndef KOKKOS_ENABLE_ROCM
|
||||
test_scatter_view<TEST_EXECSPACE>(10);
|
||||
#ifdef KOKKOS_ENABLE_DEBUG
|
||||
test_scatter_view<TEST_EXECSPACE>(100000);
|
||||
#else
|
||||
test_scatter_view<TEST_EXECSPACE>(10000000);
|
||||
#endif
|
||||
#endif
|
||||
}
|
||||
|
||||
} // namespace Test
|
||||
|
||||
@ -46,6 +46,7 @@
|
||||
#include <vector>
|
||||
|
||||
#include <Kokkos_StaticCrsGraph.hpp>
|
||||
#include <Kokkos_Core.hpp>
|
||||
|
||||
/*--------------------------------------------------------------------------*/
|
||||
namespace Test {
|
||||
|
||||
@ -0,0 +1,47 @@
|
||||
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include<cuda/TestCuda_Category.hpp>
|
||||
#include<TestOffsetView.hpp>
|
||||
|
||||
@ -0,0 +1,47 @@
|
||||
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include<openmp/TestOpenMP_Category.hpp>
|
||||
#include<TestOffsetView.hpp>
|
||||
|
||||
@ -60,6 +60,6 @@ protected:
|
||||
} // namespace Test
|
||||
|
||||
#define TEST_CATEGORY rocm
|
||||
#define TEST_EXECSPACE Kokkos::ROCm
|
||||
#define TEST_EXECSPACE Kokkos::Experimental::ROCm
|
||||
|
||||
#endif
|
||||
|
||||
@ -0,0 +1,46 @@
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include<serial/TestSerial_Category.hpp>
|
||||
#include<TestOffsetView.hpp>
|
||||
|
||||
@ -0,0 +1,47 @@
|
||||
|
||||
/*
|
||||
//@HEADER
|
||||
// ************************************************************************
|
||||
//
|
||||
// Kokkos v. 2.0
|
||||
// Copyright (2014) Sandia Corporation
|
||||
//
|
||||
// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
|
||||
// the U.S. Government retains certain rights in this software.
|
||||
//
|
||||
// Redistribution and use in source and binary forms, with or without
|
||||
// modification, are permitted provided that the following conditions are
|
||||
// met:
|
||||
//
|
||||
// 1. Redistributions of source code must retain the above copyright
|
||||
// notice, this list of conditions and the following disclaimer.
|
||||
//
|
||||
// 2. Redistributions in binary form must reproduce the above copyright
|
||||
// notice, this list of conditions and the following disclaimer in the
|
||||
// documentation and/or other materials provided with the distribution.
|
||||
//
|
||||
// 3. Neither the name of the Corporation nor the names of the
|
||||
// contributors may be used to endorse or promote products derived from
|
||||
// this software without specific prior written permission.
|
||||
//
|
||||
// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
|
||||
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
||||
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
|
||||
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
//
|
||||
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
|
||||
//
|
||||
// ************************************************************************
|
||||
//@HEADER
|
||||
*/
|
||||
|
||||
#include<threads/TestThreads_Category.hpp>
|
||||
#include<TestOffsetView.hpp>
|
||||
|
||||
@ -108,3 +108,7 @@ else()
|
||||
|
||||
endif()
|
||||
#-----------------------------------------------------------------------------
|
||||
|
||||
# build and install pkgconfig file
|
||||
CONFIGURE_FILE(kokkos.pc.in kokkos.pc @ONLY)
|
||||
INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)
|
||||
|
||||
@ -208,7 +208,7 @@ struct CudaParallelLaunch< DriverType
|
||||
, const int shmem
|
||||
, const cudaStream_t stream = 0 )
|
||||
{
|
||||
if ( grid.x && ( block.x * block.y * block.z ) ) {
|
||||
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
|
||||
|
||||
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
|
||||
sizeof( DriverType ) ) {
|
||||
@ -264,7 +264,7 @@ struct CudaParallelLaunch< DriverType
|
||||
, const int shmem
|
||||
, const cudaStream_t stream = 0 )
|
||||
{
|
||||
if ( grid.x && ( block.x * block.y * block.z ) ) {
|
||||
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
|
||||
|
||||
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
|
||||
sizeof( DriverType ) ) {
|
||||
@ -321,7 +321,7 @@ struct CudaParallelLaunch< DriverType
|
||||
, const int shmem
|
||||
, const cudaStream_t stream = 0 )
|
||||
{
|
||||
if ( grid.x && ( block.x * block.y * block.z ) ) {
|
||||
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
|
||||
|
||||
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
|
||||
sizeof( DriverType ) ) {
|
||||
@ -370,7 +370,7 @@ struct CudaParallelLaunch< DriverType
|
||||
, const int shmem
|
||||
, const cudaStream_t stream = 0 )
|
||||
{
|
||||
if ( grid.x && ( block.x * block.y * block.z ) ) {
|
||||
if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {
|
||||
|
||||
if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
|
||||
sizeof( DriverType ) ) {
|
||||
|
||||
@ -453,6 +453,8 @@ SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
// Set last element zero, in case c_str is too long
|
||||
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
|
||||
// Copy to device memory
|
||||
Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
|
||||
@ -491,6 +493,9 @@ SharedAllocationRecord( const Kokkos::CudaUVMSpace & arg_space
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
|
||||
// Set last element zero, in case c_str is too long
|
||||
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
}
|
||||
|
||||
SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
|
||||
@ -525,6 +530,8 @@ SharedAllocationRecord( const Kokkos::CudaHostPinnedSpace & arg_space
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
// Set last element zero, in case c_str is too long
|
||||
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
}
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
@ -689,9 +689,13 @@ Cuda::size_type cuda_internal_multiprocessor_count()
|
||||
|
||||
CudaSpace::size_type cuda_internal_maximum_concurrent_block_count()
|
||||
{
|
||||
#if defined(KOKKOS_ARCH_KEPLER)
|
||||
// Compute capability 3.0 through 3.7
|
||||
enum : int { max_resident_blocks_per_multiprocessor = 16 };
|
||||
#else
|
||||
// Compute capability 5.0 through 6.2
|
||||
enum : int { max_resident_blocks_per_multiprocessor = 32 };
|
||||
|
||||
#endif
|
||||
return CudaInternal::singleton().m_multiProcCount
|
||||
* max_resident_blocks_per_multiprocessor ;
|
||||
};
|
||||
|
||||
@ -52,22 +52,22 @@
|
||||
|
||||
namespace Kokkos { namespace Impl {
|
||||
|
||||
template<class DriverType, bool Large>
|
||||
template<class DriverType, class LaunchBounds, bool Large>
|
||||
struct CudaGetMaxBlockSize;
|
||||
|
||||
template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
|
||||
template<class DriverType, class LaunchBounds>
|
||||
int cuda_get_max_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
return CudaGetMaxBlockSize<DriverType,Large>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
|
||||
return CudaGetMaxBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
|
||||
}
|
||||
|
||||
|
||||
template<class DriverType>
|
||||
struct CudaGetMaxBlockSize<DriverType,true> {
|
||||
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int numBlocks;
|
||||
int blockSize=32;
|
||||
int blockSize=1024;
|
||||
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
@ -76,8 +76,9 @@ struct CudaGetMaxBlockSize<DriverType,true> {
|
||||
blockSize,
|
||||
sharedmem);
|
||||
|
||||
while (blockSize<1024 && numBlocks>0) {
|
||||
blockSize*=2;
|
||||
if(numBlocks>0) return blockSize;
|
||||
while (blockSize>32 && numBlocks==0) {
|
||||
blockSize/=2;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
@ -87,19 +88,30 @@ struct CudaGetMaxBlockSize<DriverType,true> {
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
if(numBlocks>0) return blockSize;
|
||||
else return blockSize/2;
|
||||
int blockSizeUpperBound = blockSize*2;
|
||||
while (blockSize<blockSizeUpperBound && numBlocks>0) {
|
||||
blockSize+=32;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_constant_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
return blockSize - 32;
|
||||
}
|
||||
};
|
||||
|
||||
template<class DriverType>
|
||||
struct CudaGetMaxBlockSize<DriverType,false> {
|
||||
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int numBlocks;
|
||||
|
||||
int blockSize=32;
|
||||
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
unsigned int blockSize=1024;
|
||||
unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
@ -107,8 +119,9 @@ struct CudaGetMaxBlockSize<DriverType,false> {
|
||||
blockSize,
|
||||
sharedmem);
|
||||
|
||||
while (blockSize<1024 && numBlocks>0) {
|
||||
blockSize*=2;
|
||||
if(numBlocks>0) return blockSize;
|
||||
while (blockSize>32 && numBlocks==0) {
|
||||
blockSize/=2;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
@ -118,24 +131,121 @@ struct CudaGetMaxBlockSize<DriverType,false> {
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
if(numBlocks>0) return blockSize;
|
||||
else return blockSize/2;
|
||||
unsigned int blockSizeUpperBound = blockSize*2;
|
||||
while (blockSize<blockSizeUpperBound && numBlocks>0) {
|
||||
blockSize+=32;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_local_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
return blockSize - 32;
|
||||
}
|
||||
};
|
||||
|
||||
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
|
||||
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,true> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int numBlocks = 0, oldNumBlocks = 0;
|
||||
unsigned int blockSize=MaxThreadsPerBlock;
|
||||
unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
|
||||
if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
|
||||
|
||||
while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
|
||||
blockSize/=2;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_constant_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
|
||||
while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>MinBlocksPerSM) {
|
||||
blockSize+=32;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
oldNumBlocks = numBlocks;
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_constant_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
|
||||
return -1;
|
||||
}
|
||||
};
|
||||
|
||||
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
|
||||
struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,false> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int numBlocks = 0, oldNumBlocks = 0;
|
||||
unsigned int blockSize=MaxThreadsPerBlock;
|
||||
int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
|
||||
|
||||
while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
|
||||
blockSize/=2;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_local_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
|
||||
while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) {
|
||||
blockSize+=32;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
oldNumBlocks = numBlocks;
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_local_memory<DriverType>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
}
|
||||
if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
|
||||
return -1;
|
||||
}
|
||||
};
|
||||
|
||||
|
||||
|
||||
template<class DriverType, bool Large>
|
||||
template<class DriverType, class LaunchBounds, bool Large>
|
||||
struct CudaGetOptBlockSize;
|
||||
|
||||
template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
|
||||
template<class DriverType, class LaunchBounds>
|
||||
int cuda_get_opt_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
return CudaGetOptBlockSize<DriverType,Large>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
|
||||
return CudaGetOptBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
|
||||
}
|
||||
|
||||
template<class DriverType>
|
||||
struct CudaGetOptBlockSize<DriverType,true> {
|
||||
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int blockSize=16;
|
||||
@ -165,7 +275,7 @@ struct CudaGetOptBlockSize<DriverType,true> {
|
||||
};
|
||||
|
||||
template<class DriverType>
|
||||
struct CudaGetOptBlockSize<DriverType,false> {
|
||||
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int blockSize=16;
|
||||
@ -194,6 +304,75 @@ struct CudaGetOptBlockSize<DriverType,false> {
|
||||
}
|
||||
};
|
||||
|
||||
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
|
||||
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,true> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int blockSize=16;
|
||||
int numBlocks;
|
||||
int sharedmem;
|
||||
int maxOccupancy=0;
|
||||
int bestBlockSize=0;
|
||||
int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
|
||||
|
||||
while(blockSize < max_threads_per_block ) {
|
||||
blockSize*=2;
|
||||
|
||||
//calculate the occupancy with that optBlockSize and check whether its larger than the largest one found so far
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
|
||||
if(maxOccupancy < numBlocks*blockSize) {
|
||||
maxOccupancy = numBlocks*blockSize;
|
||||
bestBlockSize = blockSize;
|
||||
}
|
||||
}
|
||||
}
|
||||
if(maxOccupancy > 0)
|
||||
return bestBlockSize;
|
||||
return -1;
|
||||
}
|
||||
};
|
||||
|
||||
template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
|
||||
struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,false> {
|
||||
static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
|
||||
const size_t shmem_extra_block, const size_t shmem_extra_thread) {
|
||||
int blockSize=16;
|
||||
int numBlocks;
|
||||
int sharedmem;
|
||||
int maxOccupancy=0;
|
||||
int bestBlockSize=0;
|
||||
int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
|
||||
|
||||
while(blockSize < max_threads_per_block ) {
|
||||
blockSize*=2;
|
||||
sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
|
||||
FunctorTeamShmemSize< typename DriverType::functor_type >::value( f , blockSize/vector_length );
|
||||
|
||||
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
|
||||
&numBlocks,
|
||||
cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
|
||||
blockSize,
|
||||
sharedmem);
|
||||
if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
|
||||
if(maxOccupancy < numBlocks*blockSize) {
|
||||
maxOccupancy = numBlocks*blockSize;
|
||||
bestBlockSize = blockSize;
|
||||
}
|
||||
}
|
||||
}
|
||||
if(maxOccupancy > 0)
|
||||
return bestBlockSize;
|
||||
return -1;
|
||||
}
|
||||
};
|
||||
|
||||
}} // namespace Kokkos::Impl
|
||||
|
||||
#endif // KOKKOS_ENABLE_CUDA
|
||||
|
||||
@ -148,6 +148,9 @@ namespace Kokkos {
|
||||
namespace Impl {
|
||||
namespace {
|
||||
static int lock_array_copied = 0;
|
||||
inline int eliminate_warning_for_lock_array() {
|
||||
return lock_array_copied;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@ -60,6 +60,7 @@
|
||||
#include <Cuda/Kokkos_Cuda_Internal.hpp>
|
||||
#include <Cuda/Kokkos_Cuda_Locks.hpp>
|
||||
#include <Kokkos_Vectorization.hpp>
|
||||
#include <Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp>
|
||||
|
||||
#if defined(KOKKOS_ENABLE_PROFILING)
|
||||
#include <impl/Kokkos_Profiling_Interface.hpp>
|
||||
@ -114,6 +115,7 @@ public:
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template< class FunctorType >
|
||||
inline static
|
||||
int team_size_max( const FunctorType & functor )
|
||||
@ -131,7 +133,35 @@ public:
|
||||
|
||||
return n ;
|
||||
}
|
||||
#endif
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType& f, const ParallelForTag& ) const {
|
||||
typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
|
||||
int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
|
||||
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) );
|
||||
return block_size/vector_length();
|
||||
}
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType& f, const ParallelReduceTag& ) const {
|
||||
typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
|
||||
typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
|
||||
typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
|
||||
typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
|
||||
|
||||
int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
|
||||
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
|
||||
((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
|
||||
|
||||
// Currently we require Power-of-2 team size for reductions.
|
||||
int p2 = 1;
|
||||
while(p2<=block_size) p2*=2;
|
||||
p2/=2;
|
||||
return p2/vector_length();
|
||||
}
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template< class FunctorType >
|
||||
static int team_size_recommended( const FunctorType & functor )
|
||||
{ return team_size_max( functor ); }
|
||||
@ -143,11 +173,41 @@ public:
|
||||
if(max<1) max = 1;
|
||||
return max;
|
||||
}
|
||||
#endif
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType& f, const ParallelForTag& ) const {
|
||||
typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
|
||||
int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
|
||||
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double));
|
||||
return block_size/vector_length();
|
||||
}
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType& f, const ParallelReduceTag& ) const {
|
||||
typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
|
||||
typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
|
||||
typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
|
||||
typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
|
||||
|
||||
int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
|
||||
(size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
|
||||
((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
|
||||
return block_size/vector_length();
|
||||
}
|
||||
|
||||
|
||||
inline static
|
||||
int vector_length_max()
|
||||
{ return Impl::CudaTraits::WarpSize; }
|
||||
|
||||
inline static
|
||||
int scratch_size_max(int level)
|
||||
{ return (level==0?
|
||||
1024*40: // 48kB is the max for CUDA, but we need some for team_member.reduce etc.
|
||||
20*1024*1024); // arbitrarily setting this to 20MB, for a Volta V100 that would give us about 3.2GB for 2 teams per SM
|
||||
}
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
inline int vector_length() const { return m_vector_length ; }
|
||||
@ -419,7 +479,7 @@ public:
|
||||
void execute() const
|
||||
{
|
||||
const typename Policy::index_type nwork = m_policy.end() - m_policy.begin();
|
||||
const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( m_functor , 1, 0 , 0 );
|
||||
const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds>( m_functor , 1, 0 , 0 );
|
||||
const dim3 block( 1 , block_size , 1);
|
||||
const dim3 grid( std::min( typename Policy::index_type(( nwork + block.y - 1 ) / block.y) , typename Policy::index_type(cuda_internal_maximum_grid_count()) ) , 1 , 1);
|
||||
|
||||
@ -654,7 +714,7 @@ public:
|
||||
: m_functor( arg_functor )
|
||||
, m_league_size( arg_policy.league_size() )
|
||||
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
|
||||
, m_vector_size( arg_policy.vector_length() )
|
||||
, m_shmem_begin( sizeof(double) * ( m_team_size + 2 ) )
|
||||
, m_shmem_size( arg_policy.scratch_size(0,m_team_size) + FunctorTeamShmemSize< FunctorType >::value( m_functor , m_team_size ) )
|
||||
@ -670,7 +730,7 @@ public:
|
||||
}
|
||||
|
||||
if ( int(m_team_size) >
|
||||
int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor >
|
||||
int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor, LaunchBounds >
|
||||
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
|
||||
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));
|
||||
}
|
||||
@ -725,12 +785,13 @@ public:
|
||||
const Policy m_policy ;
|
||||
const ReducerType m_reducer ;
|
||||
const pointer_type m_result_ptr ;
|
||||
const bool m_result_ptr_device_accessible ;
|
||||
size_type * m_scratch_space ;
|
||||
size_type * m_scratch_flags ;
|
||||
size_type * m_unified_space ;
|
||||
|
||||
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
|
||||
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
|
||||
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit)
|
||||
enum { UseShflReduction = false };//((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
|
||||
// Some crutch to do function overloading
|
||||
private:
|
||||
typedef double DummyShflReductionType;
|
||||
@ -752,12 +813,12 @@ public:
|
||||
|
||||
__device__ inline
|
||||
void operator() () const {
|
||||
run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
|
||||
/* run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
|
||||
}
|
||||
|
||||
__device__ inline
|
||||
void run(const DummySHMEMReductionType& ) const
|
||||
{
|
||||
{*/
|
||||
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
|
||||
word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );
|
||||
|
||||
@ -786,7 +847,8 @@ public:
|
||||
// This is the final block with the final result at the final threads' location
|
||||
|
||||
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
|
||||
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
|
||||
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
|
||||
( m_unified_space ? m_unified_space : m_scratch_space );
|
||||
|
||||
if ( threadIdx.y == 0 ) {
|
||||
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
|
||||
@ -798,10 +860,9 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
__device__ inline
|
||||
/* __device__ inline
|
||||
void run(const DummyShflReductionType&) const
|
||||
{
|
||||
|
||||
value_type value;
|
||||
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &value);
|
||||
// Number of blocks is bounded so that the reduction can be limited to two passes.
|
||||
@ -832,7 +893,7 @@ public:
|
||||
*result = value;
|
||||
}
|
||||
}
|
||||
}
|
||||
}*/
|
||||
|
||||
// Determine block size constrained by shared memory:
|
||||
static inline
|
||||
@ -863,16 +924,18 @@ public:
|
||||
|
||||
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
|
||||
|
||||
Cuda::fence();
|
||||
if(!m_result_ptr_device_accessible) {
|
||||
Cuda::fence();
|
||||
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -883,17 +946,18 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
template< class HostViewType >
|
||||
template< class ViewType >
|
||||
ParallelReduce( const FunctorType & arg_functor
|
||||
, const Policy & arg_policy
|
||||
, const HostViewType & arg_result
|
||||
, const ViewType & arg_result
|
||||
, typename std::enable_if<
|
||||
Kokkos::is_view< HostViewType >::value
|
||||
Kokkos::is_view< ViewType >::value
|
||||
,void*>::type = NULL)
|
||||
: m_functor( arg_functor )
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( InvalidType() )
|
||||
, m_result_ptr( arg_result.data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -906,6 +970,7 @@ public:
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( reducer )
|
||||
, m_result_ptr( reducer.view().data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -953,6 +1018,7 @@ public:
|
||||
const Policy m_policy ; // used for workrange and nwork
|
||||
const ReducerType m_reducer ;
|
||||
const pointer_type m_result_ptr ;
|
||||
const bool m_result_ptr_device_accessible ;
|
||||
size_type * m_scratch_space ;
|
||||
size_type * m_scratch_flags ;
|
||||
size_type * m_unified_space ;
|
||||
@ -960,7 +1026,7 @@ public:
|
||||
typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
|
||||
|
||||
// Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
|
||||
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
|
||||
enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && (ValueTraits::StaticValueSize!=0)) };
|
||||
// Some crutch to do function overloading
|
||||
private:
|
||||
typedef double DummyShflReductionType;
|
||||
@ -978,12 +1044,12 @@ public:
|
||||
inline
|
||||
__device__
|
||||
void operator() (void) const {
|
||||
run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
|
||||
/* run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
|
||||
}
|
||||
|
||||
__device__ inline
|
||||
void run(const DummySHMEMReductionType& ) const
|
||||
{
|
||||
{*/
|
||||
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
|
||||
word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );
|
||||
|
||||
@ -1007,7 +1073,8 @@ public:
|
||||
|
||||
// This is the final block with the final result at the final threads' location
|
||||
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
|
||||
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
|
||||
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
|
||||
( m_unified_space ? m_unified_space : m_scratch_space );
|
||||
|
||||
if ( threadIdx.y == 0 ) {
|
||||
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
|
||||
@ -1019,7 +1086,7 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
__device__ inline
|
||||
/* __device__ inline
|
||||
void run(const DummyShflReductionType&) const
|
||||
{
|
||||
|
||||
@ -1051,7 +1118,7 @@ public:
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
*/
|
||||
// Determine block size constrained by shared memory:
|
||||
static inline
|
||||
unsigned local_block_size( const FunctorType & f )
|
||||
@ -1089,16 +1156,18 @@ public:
|
||||
|
||||
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
|
||||
|
||||
Cuda::fence();
|
||||
if(!m_result_ptr_device_accessible) {
|
||||
Cuda::fence();
|
||||
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1109,17 +1178,18 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
template< class HostViewType >
|
||||
template< class ViewType >
|
||||
ParallelReduce( const FunctorType & arg_functor
|
||||
, const Policy & arg_policy
|
||||
, const HostViewType & arg_result
|
||||
, const ViewType & arg_result
|
||||
, typename std::enable_if<
|
||||
Kokkos::is_view< HostViewType >::value
|
||||
Kokkos::is_view< ViewType >::value
|
||||
,void*>::type = NULL)
|
||||
: m_functor( arg_functor )
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( InvalidType() )
|
||||
, m_result_ptr( arg_result.data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -1132,6 +1202,7 @@ public:
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( reducer )
|
||||
, m_result_ptr( reducer.view().data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -1174,7 +1245,7 @@ public:
|
||||
typedef FunctorType functor_type ;
|
||||
typedef Cuda::size_type size_type ;
|
||||
|
||||
enum { UseShflReduction = (true && ValueTraits::StaticValueSize) };
|
||||
enum { UseShflReduction = (true && (ValueTraits::StaticValueSize!=0)) };
|
||||
|
||||
private:
|
||||
typedef double DummyShflReductionType;
|
||||
@ -1191,6 +1262,7 @@ private:
|
||||
const FunctorType m_functor ;
|
||||
const ReducerType m_reducer ;
|
||||
const pointer_type m_result_ptr ;
|
||||
const bool m_result_ptr_device_accessible ;
|
||||
size_type * m_scratch_space ;
|
||||
size_type * m_scratch_flags ;
|
||||
size_type * m_unified_space ;
|
||||
@ -1279,7 +1351,8 @@ public:
|
||||
// This is the final block with the final result at the final threads' location
|
||||
|
||||
size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
|
||||
size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
|
||||
size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
|
||||
( m_unified_space ? m_unified_space : m_scratch_space );
|
||||
|
||||
if ( threadIdx.y == 0 ) {
|
||||
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
|
||||
@ -1312,12 +1385,18 @@ public:
|
||||
, value );
|
||||
}
|
||||
|
||||
pointer_type const result = (pointer_type) (m_unified_space ? m_unified_space : m_scratch_space) ;
|
||||
pointer_type const result = m_result_ptr_device_accessible? m_result_ptr :
|
||||
(pointer_type) ( m_unified_space ? m_unified_space : m_scratch_space );
|
||||
|
||||
value_type init;
|
||||
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &init);
|
||||
if(Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
|
||||
(value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)) {
|
||||
if(
|
||||
Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
|
||||
(value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)
|
||||
//This breaks a test
|
||||
// Kokkos::Impl::CudaReductionsFunctor<FunctorType,WorkTag,false,true>::scalar_inter_block_reduction(ReducerConditional::select(m_functor , m_reducer) , blockIdx.x , gridDim.x ,
|
||||
// kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags)
|
||||
) {
|
||||
const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
|
||||
if(id==0) {
|
||||
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
|
||||
@ -1331,7 +1410,7 @@ public:
|
||||
{
|
||||
const int nwork = m_league_size * m_team_size ;
|
||||
if ( nwork ) {
|
||||
const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024) )
|
||||
const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024*32) )
|
||||
:std::min( m_league_size , m_team_size );
|
||||
|
||||
m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_count );
|
||||
@ -1344,16 +1423,18 @@ public:
|
||||
|
||||
CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem_size_total ); // copy to device and execute
|
||||
|
||||
Cuda::fence();
|
||||
if(!m_result_ptr_device_accessible) {
|
||||
Cuda::fence();
|
||||
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr, m_scratch_space, size );
|
||||
if ( m_result_ptr ) {
|
||||
if ( m_unified_space ) {
|
||||
const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
|
||||
for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
|
||||
}
|
||||
else {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,CudaSpace>( m_result_ptr, m_scratch_space, size );
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -1364,16 +1445,17 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
template< class HostViewType >
|
||||
template< class ViewType >
|
||||
ParallelReduce( const FunctorType & arg_functor
|
||||
, const Policy & arg_policy
|
||||
, const HostViewType & arg_result
|
||||
, const ViewType & arg_result
|
||||
, typename std::enable_if<
|
||||
Kokkos::is_view< HostViewType >::value
|
||||
Kokkos::is_view< ViewType >::value
|
||||
,void*>::type = NULL)
|
||||
: m_functor( arg_functor )
|
||||
, m_reducer( InvalidType() )
|
||||
, m_result_ptr( arg_result.data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -1383,17 +1465,17 @@ public:
|
||||
, m_scratch_ptr{NULL,NULL}
|
||||
, m_scratch_size{
|
||||
arg_policy.scratch_size(0,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
|
||||
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
|
||||
arg_policy.vector_length() )
|
||||
), arg_policy.scratch_size(1,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
|
||||
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
|
||||
arg_policy.vector_length() )
|
||||
)}
|
||||
, m_league_size( arg_policy.league_size() )
|
||||
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
|
||||
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
|
||||
arg_policy.vector_length() )
|
||||
, m_vector_size( arg_policy.vector_length() )
|
||||
@ -1430,9 +1512,7 @@ public:
|
||||
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too much L0 scratch memory"));
|
||||
}
|
||||
|
||||
if ( unsigned(m_team_size) >
|
||||
unsigned(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
|
||||
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
|
||||
if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
|
||||
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
|
||||
}
|
||||
|
||||
@ -1444,6 +1524,7 @@ public:
|
||||
: m_functor( arg_functor )
|
||||
, m_reducer( reducer )
|
||||
, m_result_ptr( reducer.view().data() )
|
||||
, m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
, m_unified_space( 0 )
|
||||
@ -1453,7 +1534,7 @@ public:
|
||||
, m_scratch_ptr{NULL,NULL}
|
||||
, m_league_size( arg_policy.league_size() )
|
||||
, m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
|
||||
arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
|
||||
arg_policy.vector_length() )
|
||||
, m_vector_size( arg_policy.vector_length() )
|
||||
@ -1486,10 +1567,7 @@ public:
|
||||
CudaTraits::SharedMemoryCapacity < shmem_size_total ) {
|
||||
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > bad team size"));
|
||||
}
|
||||
|
||||
if ( int(m_team_size) >
|
||||
int(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
|
||||
( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
|
||||
if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
|
||||
Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
|
||||
}
|
||||
|
||||
@ -1753,7 +1831,7 @@ public:
|
||||
// Occupancy calculator assumes whole block.
|
||||
|
||||
m_team_size =
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >
|
||||
Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >
|
||||
( arg_functor
|
||||
, arg_policy.vector_length()
|
||||
, arg_policy.team_scratch_size(0)
|
||||
@ -1970,7 +2048,9 @@ private:
|
||||
const WorkRange range( m_policy , blockIdx.x , gridDim.x );
|
||||
|
||||
for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
|
||||
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
|
||||
#endif
|
||||
const typename Policy::member_type iwork = iwork_base + threadIdx.y ;
|
||||
|
||||
__syncthreads(); // Don't overwrite previous iteration values until they are used
|
||||
@ -1981,7 +2061,11 @@ private:
|
||||
for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) {
|
||||
shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
|
||||
}
|
||||
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.
|
||||
|
||||
// Call functor to accumulate inclusive scan value for this work item
|
||||
@ -2189,6 +2273,9 @@ private:
|
||||
const WorkRange range( m_policy , blockIdx.x , gridDim.x );
|
||||
|
||||
for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
|
||||
#endif
|
||||
|
||||
const typename Policy::member_type iwork = iwork_base + threadIdx.y ;
|
||||
|
||||
@ -2201,6 +2288,11 @@ private:
|
||||
shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
|
||||
}
|
||||
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.
|
||||
|
||||
// Call functor to accumulate inclusive scan value for this work item
|
||||
|
||||
@ -194,8 +194,9 @@ void cuda_shfl_up( T & out , T const & in , int delta ,
|
||||
*/
|
||||
|
||||
template< class ValueType , class JoinOp>
|
||||
__device__
|
||||
inline void cuda_intra_warp_reduction( ValueType& result,
|
||||
__device__ inline
|
||||
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
|
||||
cuda_intra_warp_reduction( ValueType& result,
|
||||
const JoinOp& join,
|
||||
const uint32_t max_active_thread = blockDim.y) {
|
||||
|
||||
@ -214,8 +215,9 @@ inline void cuda_intra_warp_reduction( ValueType& result,
|
||||
}
|
||||
|
||||
template< class ValueType , class JoinOp>
|
||||
__device__
|
||||
inline void cuda_inter_warp_reduction( ValueType& value,
|
||||
__device__ inline
|
||||
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
|
||||
cuda_inter_warp_reduction( ValueType& value,
|
||||
const JoinOp& join,
|
||||
const int max_active_thread = blockDim.y) {
|
||||
|
||||
@ -247,8 +249,9 @@ inline void cuda_inter_warp_reduction( ValueType& value,
|
||||
}
|
||||
|
||||
template< class ValueType , class JoinOp>
|
||||
__device__
|
||||
inline void cuda_intra_block_reduction( ValueType& value,
|
||||
__device__ inline
|
||||
typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
|
||||
cuda_intra_block_reduction( ValueType& value,
|
||||
const JoinOp& join,
|
||||
const int max_active_thread = blockDim.y) {
|
||||
cuda_intra_warp_reduction(value,join,max_active_thread);
|
||||
@ -314,31 +317,52 @@ bool cuda_inter_block_reduction( typename FunctorValueTraits< FunctorType , ArgT
|
||||
if( id + 1 < int(gridDim.x) )
|
||||
join(value, tmp);
|
||||
}
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 2) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 2,32);
|
||||
if( id + 2 < int(gridDim.x) )
|
||||
join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 4) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 4,32);
|
||||
if( id + 4 < int(gridDim.x) )
|
||||
join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 8) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 8,32);
|
||||
if( id + 8 < int(gridDim.x) )
|
||||
join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 16) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 16,32);
|
||||
if( id + 16 < int(gridDim.x) )
|
||||
join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
}
|
||||
}
|
||||
//The last block has in its thread=0 the global reduction value through "value"
|
||||
@ -478,31 +502,52 @@ cuda_inter_block_reduction( const ReducerType& reducer,
|
||||
if( id + 1 < int(gridDim.x) )
|
||||
reducer.join(value, tmp);
|
||||
}
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 2) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 2,32);
|
||||
if( id + 2 < int(gridDim.x) )
|
||||
reducer.join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 4) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 4,32);
|
||||
if( id + 4 < int(gridDim.x) )
|
||||
reducer.join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 8) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 8,32);
|
||||
if( id + 8 < int(gridDim.x) )
|
||||
reducer.join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
if (int(blockDim.x*blockDim.y) > 16) {
|
||||
value_type tmp = Kokkos::shfl_down(value, 16,32);
|
||||
if( id + 16 < int(gridDim.x) )
|
||||
reducer.join(value, tmp);
|
||||
}
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT(1);
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
|
||||
#else
|
||||
active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
|
||||
#endif
|
||||
}
|
||||
}
|
||||
|
||||
@ -513,6 +558,213 @@ cuda_inter_block_reduction( const ReducerType& reducer,
|
||||
#endif
|
||||
}
|
||||
|
||||
template<class FunctorType, class ArgTag, bool DoScan, bool UseShfl>
|
||||
struct CudaReductionsFunctor;
|
||||
|
||||
template<class FunctorType, class ArgTag>
|
||||
struct CudaReductionsFunctor<FunctorType, ArgTag, false, true> {
|
||||
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
|
||||
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
|
||||
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
|
||||
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
|
||||
typedef typename ValueTraits::pointer_type pointer_type ;
|
||||
typedef typename ValueTraits::value_type Scalar;
|
||||
|
||||
__device__
|
||||
static inline void scalar_intra_warp_reduction(
|
||||
const FunctorType& functor,
|
||||
Scalar value, // Contribution
|
||||
const bool skip_vector, // Skip threads if Kokkos vector lanes are not part of the reduction
|
||||
const int width, // How much of the warp participates
|
||||
Scalar& result)
|
||||
{
|
||||
unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
|
||||
for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
|
||||
Scalar tmp;
|
||||
cuda_shfl_down(tmp,value,delta,width,mask);
|
||||
ValueJoin::join( functor , &value, &tmp);
|
||||
}
|
||||
|
||||
cuda_shfl(result,value,0,width,mask);
|
||||
}
|
||||
|
||||
|
||||
__device__
|
||||
static inline void scalar_intra_block_reduction(
|
||||
const FunctorType& functor,
|
||||
Scalar value,
|
||||
const bool skip,
|
||||
Scalar* my_global_team_buffer_element,
|
||||
const int shared_elements,
|
||||
Scalar* shared_team_buffer_element) {
|
||||
|
||||
const int warp_id = (threadIdx.y*blockDim.x)/32;
|
||||
Scalar* const my_shared_team_buffer_element =
|
||||
shared_team_buffer_element + warp_id%shared_elements;
|
||||
|
||||
// Warp Level Reduction, ignoring Kokkos vector entries
|
||||
scalar_intra_warp_reduction(functor,value,skip,32,value);
|
||||
|
||||
if(warp_id<shared_elements) {
|
||||
*my_shared_team_buffer_element=value;
|
||||
}
|
||||
// Wait for every warp to be done before using one warp to do final cross warp reduction
|
||||
__syncthreads();
|
||||
|
||||
const int num_warps = blockDim.x*blockDim.y/32;
|
||||
for(int w = shared_elements; w<num_warps; w+=shared_elements) {
|
||||
if(warp_id>=w && warp_id<w+shared_elements) {
|
||||
if((threadIdx.y*blockDim.x + threadIdx.x)%32==0)
|
||||
ValueJoin::join( functor , my_shared_team_buffer_element, &value);
|
||||
}
|
||||
__syncthreads();
|
||||
}
|
||||
|
||||
|
||||
if( warp_id == 0) {
|
||||
ValueInit::init( functor , &value );
|
||||
for(unsigned int i=threadIdx.y*blockDim.x+threadIdx.x; i<blockDim.y*blockDim.x/32; i+=32)
|
||||
ValueJoin::join( functor , &value,&shared_team_buffer_element[i]);
|
||||
scalar_intra_warp_reduction(functor,value,false,32,*my_global_team_buffer_element);
|
||||
}
|
||||
}
|
||||
|
||||
__device__
|
||||
static inline bool scalar_inter_block_reduction(
|
||||
const FunctorType & functor ,
|
||||
const Cuda::size_type block_id ,
|
||||
const Cuda::size_type block_count ,
|
||||
Cuda::size_type * const shared_data ,
|
||||
Cuda::size_type * const global_data ,
|
||||
Cuda::size_type * const global_flags ) {
|
||||
Scalar* const global_team_buffer_element = ((Scalar*) global_data);
|
||||
Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
|
||||
Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
|
||||
Scalar value = shared_team_buffer_elements[threadIdx.y];
|
||||
int shared_elements=blockDim.x*blockDim.y/32;
|
||||
int global_elements=block_count;
|
||||
__syncthreads();
|
||||
|
||||
scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
|
||||
__syncthreads();
|
||||
unsigned int num_teams_done = 0;
|
||||
if(threadIdx.x + threadIdx.y == 0) {
|
||||
__threadfence();
|
||||
num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
|
||||
}
|
||||
bool is_last_block = false;
|
||||
if(__syncthreads_or(num_teams_done == gridDim.x)) {
|
||||
is_last_block=true;
|
||||
*global_flags = 0;
|
||||
ValueInit::init( functor, &value);
|
||||
for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
|
||||
ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
|
||||
}
|
||||
scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
|
||||
}
|
||||
return is_last_block;
|
||||
}
|
||||
};
|
||||
|
||||
template<class FunctorType, class ArgTag>
|
||||
struct CudaReductionsFunctor<FunctorType, ArgTag, false, false> {
|
||||
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
|
||||
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
|
||||
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
|
||||
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
|
||||
typedef typename ValueTraits::pointer_type pointer_type ;
|
||||
typedef typename ValueTraits::value_type Scalar;
|
||||
|
||||
__device__
|
||||
static inline void scalar_intra_warp_reduction(
|
||||
const FunctorType& functor,
|
||||
Scalar* value, // Contribution
|
||||
const bool skip_vector, // Skip threads if Kokkos vector lanes are not part of the reduction
|
||||
const int width) // How much of the warp participates
|
||||
{
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
|
||||
#endif
|
||||
const int lane_id = (threadIdx.y*blockDim.x+threadIdx.x)%32;
|
||||
for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
|
||||
if(lane_id + delta<32) {
|
||||
ValueJoin::join( functor , value, value+delta);
|
||||
}
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(mask);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
}
|
||||
*value=*(value-lane_id);
|
||||
}
|
||||
|
||||
|
||||
__device__
|
||||
static inline void scalar_intra_block_reduction(
|
||||
const FunctorType& functor,
|
||||
Scalar value,
|
||||
const bool skip,
|
||||
Scalar* result,
|
||||
const int shared_elements,
|
||||
Scalar* shared_team_buffer_element) {
|
||||
|
||||
const int warp_id = (threadIdx.y*blockDim.x)/32;
|
||||
Scalar* const my_shared_team_buffer_element =
|
||||
shared_team_buffer_element + threadIdx.y*blockDim.x+threadIdx.x;
|
||||
*my_shared_team_buffer_element = value;
|
||||
// Warp Level Reduction, ignoring Kokkos vector entries
|
||||
scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,skip,32);
|
||||
// Wait for every warp to be done before using one warp to do final cross warp reduction
|
||||
__syncthreads();
|
||||
|
||||
if( warp_id == 0) {
|
||||
const unsigned int delta = (threadIdx.y*blockDim.x+threadIdx.x)*32;
|
||||
if(delta<blockDim.x*blockDim.y)
|
||||
*my_shared_team_buffer_element = shared_team_buffer_element[delta];
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP;
|
||||
scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,false,blockDim.x*blockDim.y/32);
|
||||
if(threadIdx.x + threadIdx.y == 0) *result = *shared_team_buffer_element;
|
||||
}
|
||||
}
|
||||
|
||||
__device__
|
||||
static inline bool scalar_inter_block_reduction(
|
||||
const FunctorType & functor ,
|
||||
const Cuda::size_type block_id ,
|
||||
const Cuda::size_type block_count ,
|
||||
Cuda::size_type * const shared_data ,
|
||||
Cuda::size_type * const global_data ,
|
||||
Cuda::size_type * const global_flags ) {
|
||||
Scalar* const global_team_buffer_element = ((Scalar*) global_data);
|
||||
Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
|
||||
Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
|
||||
Scalar value = shared_team_buffer_elements[threadIdx.y];
|
||||
int shared_elements=blockDim.x*blockDim.y/32;
|
||||
int global_elements=block_count;
|
||||
__syncthreads();
|
||||
|
||||
scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
|
||||
__syncthreads();
|
||||
|
||||
unsigned int num_teams_done = 0;
|
||||
if(threadIdx.x + threadIdx.y == 0) {
|
||||
__threadfence();
|
||||
num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
|
||||
}
|
||||
bool is_last_block = false;
|
||||
if(__syncthreads_or(num_teams_done == gridDim.x)) {
|
||||
is_last_block=true;
|
||||
*global_flags = 0;
|
||||
ValueInit::init( functor, &value);
|
||||
for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
|
||||
ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
|
||||
}
|
||||
scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
|
||||
}
|
||||
return is_last_block;
|
||||
}
|
||||
};
|
||||
//----------------------------------------------------------------------------
|
||||
// See section B.17 of Cuda C Programming Guide Version 3.2
|
||||
// for discussion of
|
||||
@ -639,14 +891,15 @@ void cuda_intra_block_reduce_scan( const FunctorType & functor ,
|
||||
*
|
||||
* Global reduce result is in the last threads' 'shared_data' location.
|
||||
*/
|
||||
|
||||
template< bool DoScan , class FunctorType , class ArgTag >
|
||||
__device__
|
||||
bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
const Cuda::size_type block_id ,
|
||||
const Cuda::size_type block_count ,
|
||||
Cuda::size_type * const shared_data ,
|
||||
Cuda::size_type * const global_data ,
|
||||
Cuda::size_type * const global_flags )
|
||||
bool cuda_single_inter_block_reduce_scan2( const FunctorType & functor ,
|
||||
const Cuda::size_type block_id ,
|
||||
const Cuda::size_type block_count ,
|
||||
Cuda::size_type * const shared_data ,
|
||||
Cuda::size_type * const global_data ,
|
||||
Cuda::size_type * const global_flags )
|
||||
{
|
||||
typedef Cuda::size_type size_type ;
|
||||
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
|
||||
@ -655,7 +908,6 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
typedef FunctorValueOps< FunctorType , ArgTag > ValueOps ;
|
||||
|
||||
typedef typename ValueTraits::pointer_type pointer_type ;
|
||||
//typedef typename ValueTraits::reference_type reference_type ;
|
||||
|
||||
// '__ffs' = position of the least significant bit set to 1.
|
||||
// 'blockDim.y' is guaranteed to be a power of two so this
|
||||
@ -678,12 +930,7 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
|
||||
size_type * const global = global_data + word_count.value * block_id ;
|
||||
|
||||
//#if (__CUDA_ARCH__ < 500)
|
||||
for ( int i = int(threadIdx.y) ; i < int(word_count.value) ; i += int(blockDim.y) ) { global[i] = shared[i] ; }
|
||||
//#else
|
||||
// for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
|
||||
//#endif
|
||||
|
||||
}
|
||||
|
||||
// Contributing blocks note that their contribution has been completed via an atomic-increment flag
|
||||
@ -725,6 +972,22 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
return is_last_block ;
|
||||
}
|
||||
|
||||
template< bool DoScan , class FunctorType , class ArgTag >
|
||||
__device__
|
||||
bool cuda_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
const Cuda::size_type block_id ,
|
||||
const Cuda::size_type block_count ,
|
||||
Cuda::size_type * const shared_data ,
|
||||
Cuda::size_type * const global_data ,
|
||||
Cuda::size_type * const global_flags )
|
||||
{
|
||||
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
|
||||
if(!DoScan && ValueTraits::StaticValueSize)
|
||||
return Kokkos::Impl::CudaReductionsFunctor<FunctorType,ArgTag,false,(ValueTraits::StaticValueSize>16)>::scalar_inter_block_reduction(functor,block_id,block_count,shared_data,global_data,global_flags);
|
||||
else
|
||||
return cuda_single_inter_block_reduce_scan2<DoScan, FunctorType, ArgTag>(functor, block_id, block_count, shared_data, global_data, global_flags);
|
||||
}
|
||||
|
||||
// Size in bytes required for inter block reduce or scan
|
||||
template< bool DoScan , class FunctorType , class ArgTag >
|
||||
inline
|
||||
|
||||
@ -160,7 +160,7 @@ public:
|
||||
|
||||
template<class ValueType>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void team_broadcast( ValueType & val, const int& thread_id) const
|
||||
void team_broadcast( ValueType & val, const int& thread_id ) const
|
||||
{
|
||||
#ifdef __CUDA_ARCH__
|
||||
if ( 1 == blockDim.z ) { // team == block
|
||||
@ -178,6 +178,29 @@ public:
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
template<class Closure, class ValueType>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void team_broadcast( Closure const & f, ValueType & val, const int& thread_id ) const
|
||||
{
|
||||
#ifdef __CUDA_ARCH__
|
||||
f( val );
|
||||
|
||||
if ( 1 == blockDim.z ) { // team == block
|
||||
__syncthreads();
|
||||
// Wait for shared data write until all threads arrive here
|
||||
if ( threadIdx.x == 0u && threadIdx.y == (uint32_t)thread_id ) {
|
||||
*((ValueType*) m_team_reduce) = val ;
|
||||
}
|
||||
__syncthreads(); // Wait for shared data read until root thread writes
|
||||
val = *((ValueType*) m_team_reduce);
|
||||
}
|
||||
else { // team <= warp
|
||||
ValueType tmp( val ); // input might not be a register variable
|
||||
cuda_shfl( val, tmp, blockDim.x * thread_id, blockDim.x * blockDim.y );
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
//--------------------------------------------------------------------------
|
||||
/**\brief Reduction across a team
|
||||
@ -200,92 +223,7 @@ public:
|
||||
team_reduce( ReducerType const & reducer ) const noexcept
|
||||
{
|
||||
#ifdef __CUDA_ARCH__
|
||||
|
||||
typedef typename ReducerType::value_type value_type ;
|
||||
|
||||
value_type tmp( reducer.reference() );
|
||||
|
||||
// reduce within the warp using shuffle
|
||||
|
||||
const int wx =
|
||||
( threadIdx.x + blockDim.x * threadIdx.y ) & CudaTraits::WarpIndexMask ;
|
||||
|
||||
for ( int i = CudaTraits::WarpSize ; (int)blockDim.x <= ( i >>= 1 ) ; ) {
|
||||
|
||||
cuda_shfl_down( reducer.reference() , tmp , i , CudaTraits::WarpSize );
|
||||
|
||||
// Root of each vector lane reduces:
|
||||
if ( 0 == threadIdx.x && wx < i ) {
|
||||
reducer.join( tmp , reducer.reference() );
|
||||
}
|
||||
}
|
||||
|
||||
if ( 1 < blockDim.z ) { // team <= warp
|
||||
// broadcast result from root vector lange of root thread
|
||||
|
||||
cuda_shfl( reducer.reference() , tmp
|
||||
, blockDim.x * threadIdx.y , CudaTraits::WarpSize );
|
||||
|
||||
}
|
||||
else { // team == block
|
||||
// Reduce across warps using shared memory
|
||||
// Broadcast result within block
|
||||
|
||||
// Number of warps, blockDim.y may not be power of two:
|
||||
const int nw = ( blockDim.x * blockDim.y + CudaTraits::WarpIndexMask ) >> CudaTraits::WarpIndexShift ;
|
||||
|
||||
// Warp index:
|
||||
const int wy = ( blockDim.x * threadIdx.y ) >> CudaTraits::WarpIndexShift ;
|
||||
|
||||
// Number of shared memory entries for the reduction:
|
||||
int nsh = m_team_reduce_size / sizeof(value_type);
|
||||
|
||||
// Using at most one entry per warp:
|
||||
if ( nw < nsh ) nsh = nw ;
|
||||
|
||||
__syncthreads(); // Wait before shared data write
|
||||
|
||||
if ( 0 == wx && wy < nsh ) {
|
||||
((value_type*) m_team_reduce)[wy] = tmp ;
|
||||
}
|
||||
|
||||
// When more warps than shared entries:
|
||||
for ( int i = nsh ; i < nw ; i += nsh ) {
|
||||
|
||||
__syncthreads();
|
||||
|
||||
if ( 0 == wx && i <= wy ) {
|
||||
const int k = wy - i ;
|
||||
if ( k < nsh ) {
|
||||
reducer.join( *((value_type*) m_team_reduce + k) , tmp );
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
__syncthreads();
|
||||
|
||||
// One warp performs the inter-warp reduction:
|
||||
|
||||
if ( 0 == wy ) {
|
||||
|
||||
// Start at power of two covering nsh
|
||||
|
||||
for ( int i = 1 << ( 32 - __clz(nsh-1) ) ; ( i >>= 1 ) ; ) {
|
||||
const int k = wx + i ;
|
||||
if ( wx < i && k < nsh ) {
|
||||
reducer.join( ((value_type*)m_team_reduce)[wx]
|
||||
, ((value_type*)m_team_reduce)[k] );
|
||||
__threadfence_block();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
__syncthreads(); // Wait for reduction
|
||||
|
||||
// Broadcast result to all threads
|
||||
reducer.reference() = *((value_type*)m_team_reduce);
|
||||
}
|
||||
|
||||
cuda_intra_block_reduction(reducer,blockDim.y);
|
||||
#endif /* #ifdef __CUDA_ARCH__ */
|
||||
}
|
||||
|
||||
@ -801,7 +739,11 @@ void parallel_for
|
||||
; i += blockDim.x ) {
|
||||
closure(i);
|
||||
}
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
#endif
|
||||
}
|
||||
|
||||
@ -970,7 +912,11 @@ KOKKOS_INLINE_FUNCTION
|
||||
void single(const Impl::VectorSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
|
||||
#ifdef __CUDA_ARCH__
|
||||
if(threadIdx.x == 0) lambda();
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
#endif
|
||||
}
|
||||
|
||||
@ -979,7 +925,11 @@ KOKKOS_INLINE_FUNCTION
|
||||
void single(const Impl::ThreadSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
|
||||
#ifdef __CUDA_ARCH__
|
||||
if(threadIdx.x == 0 && threadIdx.y == 0) lambda();
|
||||
#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
|
||||
#else
|
||||
KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
|
||||
#endif
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
@ -2,9 +2,11 @@
|
||||
|
||||
#if defined( __CUDA_ARCH__ )
|
||||
#if ( CUDA_VERSION < 9000 )
|
||||
#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP __threadfence_block()
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(x) __threadfence_block()
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK __threadfence_block()
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot(x)
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) __ballot(x)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl(x,y,z)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl(x,y,z)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up(x,y,z)
|
||||
@ -12,9 +14,11 @@
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) __shfl_down(x,y,z)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down(x,y,z)
|
||||
#else
|
||||
#define KOKKOS_IMPL_CUDA_ACTIVEMASK __activemask()
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP __syncwarp(0xffffffff)
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m)
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m);
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot_sync(__activemask(),x)
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(m,x) __ballot_sync(m,x)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl_sync(0xffffffff,x,y,z)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl_sync(m,x,y,z)
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up_sync(0xffffffff,x,y,z)
|
||||
@ -23,11 +27,16 @@
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down_sync(m,x,y,z)
|
||||
#endif
|
||||
#else
|
||||
#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT(x) 0
|
||||
#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) 0
|
||||
#define KOKKOS_IMPL_CUDA_SHFL(x,y,z) 0
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) 0
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) 0
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) 0
|
||||
#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) 0
|
||||
#endif
|
||||
|
||||
#if ( CUDA_VERSION >= 9000 ) && (!defined(KOKKOS_COMPILER_CLANG))
|
||||
|
||||
@ -279,6 +279,8 @@ public:
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
static handle_type assign( value_type * arg_data_ptr, track_type const & arg_tracker )
|
||||
{
|
||||
if(arg_data_ptr == NULL) return handle_type();
|
||||
|
||||
#if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
|
||||
// Assignment of texture = non-texture requires creation of a texture object
|
||||
// which can only occur on the host. In addition, 'get_record' is only valid
|
||||
@ -292,8 +294,7 @@ public:
|
||||
|
||||
#if ! defined( KOKKOS_ENABLE_CUDA_LDG_INTRINSIC )
|
||||
if ( 0 == r ) {
|
||||
//Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
|
||||
return handle_type();
|
||||
Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
|
||||
}
|
||||
#endif
|
||||
|
||||
|
||||
@ -46,6 +46,8 @@
|
||||
|
||||
#include <initializer_list>
|
||||
|
||||
#include <Kokkos_Layout.hpp>
|
||||
|
||||
#include<impl/KokkosExp_Host_IterateTile.hpp>
|
||||
#include <Kokkos_ExecPolicy.hpp>
|
||||
#include <Kokkos_Parallel.hpp>
|
||||
@ -63,13 +65,15 @@
|
||||
namespace Kokkos {
|
||||
|
||||
// ------------------------------------------------------------------ //
|
||||
|
||||
// Moved to Kokkos_Layout.hpp for more general accessibility
|
||||
/*
|
||||
enum class Iterate
|
||||
{
|
||||
Default, // Default for the device
|
||||
Left, // Left indices stride fastest
|
||||
Right, // Right indices stride fastest
|
||||
};
|
||||
*/
|
||||
|
||||
template <typename ExecSpace>
|
||||
struct default_outer_direction
|
||||
|
||||
@ -45,11 +45,13 @@
|
||||
#define KOKKOS_ARRAY_HPP
|
||||
|
||||
#include <Kokkos_Macros.hpp>
|
||||
#include <impl/Kokkos_Error.hpp>
|
||||
|
||||
#include <type_traits>
|
||||
#include <algorithm>
|
||||
#include <limits>
|
||||
#include <cstddef>
|
||||
#include <string>
|
||||
|
||||
namespace Kokkos {
|
||||
|
||||
@ -132,6 +134,7 @@ public:
|
||||
|
||||
KOKKOS_INLINE_FUNCTION static constexpr size_type size() { return N ; }
|
||||
KOKKOS_INLINE_FUNCTION static constexpr bool empty(){ return false ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return N ; }
|
||||
|
||||
template< typename iType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
@ -160,7 +163,7 @@ public:
|
||||
return & m_internal_implementation_private_member_data[0];
|
||||
}
|
||||
|
||||
#ifdef KOKKOS_ROCM_CLANG_WORKAROUND
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
// Do not default unless move and move-assignment are also defined
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
~Array() = default ;
|
||||
@ -197,6 +200,7 @@ public:
|
||||
|
||||
KOKKOS_INLINE_FUNCTION static constexpr size_type size() { return 0 ; }
|
||||
KOKKOS_INLINE_FUNCTION static constexpr bool empty() { return true ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return 0 ; }
|
||||
|
||||
template< typename iType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
@ -261,6 +265,7 @@ public:
|
||||
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type size() const { return m_size ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr bool empty() const { return 0 != m_size ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }
|
||||
|
||||
template< typename iType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
@ -336,6 +341,7 @@ public:
|
||||
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type size() const { return m_size ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr bool empty() const { return 0 != m_size ; }
|
||||
KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }
|
||||
|
||||
template< typename iType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
|
||||
@ -105,7 +105,10 @@ namespace Kokkos {
|
||||
template< typename T > struct is_ ## CONCEPT { \
|
||||
private: \
|
||||
template< typename , typename = std::true_type > struct have : std::false_type {}; \
|
||||
template< typename U > struct have<U,typename std::is_same<U,typename U:: CONCEPT >::type> : std::true_type {}; \
|
||||
template< typename U > struct have<U,typename std::is_same< \
|
||||
typename std::remove_cv<U>::type, \
|
||||
typename std::remove_cv<typename U:: CONCEPT>::type \
|
||||
>::type> : std::true_type {}; \
|
||||
public: \
|
||||
enum { value = is_ ## CONCEPT::template have<T>::value }; \
|
||||
};
|
||||
|
||||
@ -453,8 +453,9 @@ template<class ViewTypeA,class ViewTypeB, class Layout, class ExecSpace,typename
|
||||
struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,2,iType,KOKKOS_IMPL_COMPILING_LIBRARY> {
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<2,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<2,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -475,7 +476,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,3,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<3,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<3,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -496,7 +499,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,4,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<4,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<4,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -519,7 +524,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,5,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<5,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<5,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -542,7 +549,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,6,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -566,7 +575,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,7,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -590,7 +601,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,8,iType,KOKKOS_IMPL_COMPILI
|
||||
ViewTypeA a;
|
||||
ViewTypeB b;
|
||||
|
||||
typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
|
||||
typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
|
||||
typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;
|
||||
|
||||
ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
|
||||
@ -642,7 +655,9 @@ void view_copy(const DstType& dst, const SrcType& src) {
|
||||
int64_t strides[DstType::Rank+1];
|
||||
dst.stride(strides);
|
||||
Kokkos::Iterate iterate;
|
||||
if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
|
||||
if ( Kokkos::is_layouttiled<typename DstType::array_layout>::value ) {
|
||||
iterate = Kokkos::layout_iterate_type_selector<typename DstType::array_layout>::outer_iteration_pattern;
|
||||
} else if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
|
||||
iterate = Kokkos::Iterate::Right;
|
||||
} else if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutLeft>::value ) {
|
||||
iterate = Kokkos::Iterate::Left;
|
||||
@ -1243,9 +1258,9 @@ void deep_copy
|
||||
ViewTypeFlat;
|
||||
|
||||
ViewTypeFlat dst_flat(dst.data(),dst.size());
|
||||
if(dst.span() < std::numeric_limits<int>::max())
|
||||
if(dst.span() < std::numeric_limits<int>::max()) {
|
||||
Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int >( dst_flat , value );
|
||||
else
|
||||
} else
|
||||
Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int64_t >( dst_flat , value );
|
||||
Kokkos::fence();
|
||||
return;
|
||||
@ -1397,7 +1412,6 @@ void deep_copy
|
||||
enum { SrcExecCanAccessDst =
|
||||
Kokkos::Impl::SpaceAccessibility< src_execution_space , dst_memory_space >::accessible };
|
||||
|
||||
|
||||
// Checking for Overlapping Views.
|
||||
dst_value_type* dst_start = dst.data();
|
||||
dst_value_type* dst_end = dst.data() + dst.span();
|
||||
@ -1493,7 +1507,7 @@ void deep_copy
|
||||
Kokkos::fence();
|
||||
} else {
|
||||
Kokkos::fence();
|
||||
Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),typename src_type::uniform_runtime_const_nomemspace_type(src));
|
||||
Impl::view_copy(dst, src);
|
||||
Kokkos::fence();
|
||||
}
|
||||
}
|
||||
@ -1739,8 +1753,7 @@ void deep_copy
|
||||
exec_space.fence();
|
||||
} else {
|
||||
exec_space.fence();
|
||||
Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),
|
||||
typename src_type::uniform_runtime_const_nomemspace_type(src));
|
||||
Impl::view_copy(dst, src);
|
||||
exec_space.fence();
|
||||
}
|
||||
}
|
||||
@ -1917,4 +1930,213 @@ void realloc( Kokkos::View<T,P...> & v ,
|
||||
}
|
||||
} /* namespace Kokkos */
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
namespace Kokkos {
|
||||
namespace Impl {
|
||||
|
||||
// Deduce Mirror Types
|
||||
template<class Space, class T, class ... P>
|
||||
struct MirrorViewType {
|
||||
// The incoming view_type
|
||||
typedef typename Kokkos::View<T,P...> src_view_type;
|
||||
// The memory space for the mirror view
|
||||
typedef typename Space::memory_space memory_space;
|
||||
// Check whether it is the same memory space
|
||||
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
|
||||
// The array_layout
|
||||
typedef typename src_view_type::array_layout array_layout;
|
||||
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
|
||||
typedef typename src_view_type::non_const_data_type data_type;
|
||||
// The destination view type if it is not the same memory space
|
||||
typedef Kokkos::View<data_type,array_layout,Space> dest_view_type;
|
||||
// If it is the same memory_space return the existsing view_type
|
||||
// This will also keep the unmanaged trait if necessary
|
||||
typedef typename std::conditional<is_same_memspace,src_view_type,dest_view_type>::type view_type;
|
||||
};
|
||||
|
||||
template<class Space, class T, class ... P>
|
||||
struct MirrorType {
|
||||
// The incoming view_type
|
||||
typedef typename Kokkos::View<T,P...> src_view_type;
|
||||
// The memory space for the mirror view
|
||||
typedef typename Space::memory_space memory_space;
|
||||
// Check whether it is the same memory space
|
||||
enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
|
||||
// The array_layout
|
||||
typedef typename src_view_type::array_layout array_layout;
|
||||
// The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
|
||||
typedef typename src_view_type::non_const_data_type data_type;
|
||||
// The destination view type if it is not the same memory space
|
||||
typedef Kokkos::View<data_type,array_layout,Space> view_type;
|
||||
};
|
||||
|
||||
}
|
||||
|
||||
template< class T , class ... P >
|
||||
inline
|
||||
typename Kokkos::View<T,P...>::HostMirror
|
||||
create_mirror( const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
|
||||
! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
|
||||
, Kokkos::LayoutStride >::value
|
||||
>::type * = 0
|
||||
)
|
||||
{
|
||||
typedef View<T,P...> src_type ;
|
||||
typedef typename src_type::HostMirror dst_type ;
|
||||
|
||||
return dst_type( std::string( src.label() ).append("_mirror")
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
, src.extent(0)
|
||||
, src.extent(1)
|
||||
, src.extent(2)
|
||||
, src.extent(3)
|
||||
, src.extent(4)
|
||||
, src.extent(5)
|
||||
, src.extent(6)
|
||||
, src.extent(7) );
|
||||
#else
|
||||
, src.rank_dynamic > 0 ? src.extent(0): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 1 ? src.extent(1): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 2 ? src.extent(2): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 3 ? src.extent(3): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 4 ? src.extent(4): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 5 ? src.extent(5): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 6 ? src.extent(6): KOKKOS_IMPL_CTOR_DEFAULT_ARG
|
||||
, src.rank_dynamic > 7 ? src.extent(7): KOKKOS_IMPL_CTOR_DEFAULT_ARG );
|
||||
#endif
|
||||
}
|
||||
|
||||
template< class T , class ... P >
|
||||
inline
|
||||
typename Kokkos::View<T,P...>::HostMirror
|
||||
create_mirror( const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
|
||||
std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
|
||||
, Kokkos::LayoutStride >::value
|
||||
>::type * = 0
|
||||
)
|
||||
{
|
||||
typedef View<T,P...> src_type ;
|
||||
typedef typename src_type::HostMirror dst_type ;
|
||||
|
||||
Kokkos::LayoutStride layout ;
|
||||
|
||||
layout.dimension[0] = src.extent(0);
|
||||
layout.dimension[1] = src.extent(1);
|
||||
layout.dimension[2] = src.extent(2);
|
||||
layout.dimension[3] = src.extent(3);
|
||||
layout.dimension[4] = src.extent(4);
|
||||
layout.dimension[5] = src.extent(5);
|
||||
layout.dimension[6] = src.extent(6);
|
||||
layout.dimension[7] = src.extent(7);
|
||||
|
||||
layout.stride[0] = src.stride_0();
|
||||
layout.stride[1] = src.stride_1();
|
||||
layout.stride[2] = src.stride_2();
|
||||
layout.stride[3] = src.stride_3();
|
||||
layout.stride[4] = src.stride_4();
|
||||
layout.stride[5] = src.stride_5();
|
||||
layout.stride[6] = src.stride_6();
|
||||
layout.stride[7] = src.stride_7();
|
||||
|
||||
return dst_type( std::string( src.label() ).append("_mirror") , layout );
|
||||
}
|
||||
|
||||
|
||||
// Create a mirror in a new space (specialization for different space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorType<Space,T,P ...>::view_type
|
||||
create_mirror(const Space& , const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<
|
||||
std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
|
||||
>::type * = 0) {
|
||||
return typename Impl::MirrorType<Space,T,P ...>::view_type(src.label(),src.layout());
|
||||
}
|
||||
|
||||
template< class T , class ... P >
|
||||
inline
|
||||
typename Kokkos::View<T,P...>::HostMirror
|
||||
create_mirror_view( const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<(
|
||||
std::is_same< typename Kokkos::View<T,P...>::memory_space
|
||||
, typename Kokkos::View<T,P...>::HostMirror::memory_space
|
||||
>::value
|
||||
&&
|
||||
std::is_same< typename Kokkos::View<T,P...>::data_type
|
||||
, typename Kokkos::View<T,P...>::HostMirror::data_type
|
||||
>::value
|
||||
)>::type * = 0
|
||||
)
|
||||
{
|
||||
return src ;
|
||||
}
|
||||
|
||||
template< class T , class ... P >
|
||||
inline
|
||||
typename Kokkos::View<T,P...>::HostMirror
|
||||
create_mirror_view( const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if< ! (
|
||||
std::is_same< typename Kokkos::View<T,P...>::memory_space
|
||||
, typename Kokkos::View<T,P...>::HostMirror::memory_space
|
||||
>::value
|
||||
&&
|
||||
std::is_same< typename Kokkos::View<T,P...>::data_type
|
||||
, typename Kokkos::View<T,P...>::HostMirror::data_type
|
||||
>::value
|
||||
)>::type * = 0
|
||||
)
|
||||
{
|
||||
return Kokkos::create_mirror( src );
|
||||
}
|
||||
|
||||
// Create a mirror view in a new space (specialization for same space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
return src;
|
||||
}
|
||||
|
||||
// Create a mirror view in a new space (specialization for different space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
|
||||
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
return typename Impl::MirrorViewType<Space,T,P ...>::view_type(src.label(),src.layout());
|
||||
}
|
||||
|
||||
// Create a mirror view and deep_copy in a new space (specialization for same space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
|
||||
, std::string const& name = ""
|
||||
, typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
(void)name;
|
||||
return src;
|
||||
}
|
||||
|
||||
// Create a mirror view and deep_copy in a new space (specialization for different space)
|
||||
template<class Space, class T, class ... P>
|
||||
typename Impl::MirrorViewType<Space,T,P ...>::view_type
|
||||
create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
|
||||
, std::string const& name = ""
|
||||
, typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
|
||||
using Mirror = typename Impl::MirrorViewType<Space,T,P ...>::view_type;
|
||||
std::string label = name.empty() ? src.label() : name;
|
||||
auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout());
|
||||
deep_copy(mirror, src);
|
||||
return mirror;
|
||||
}
|
||||
|
||||
} /* namespace Kokkos */
|
||||
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
#endif
|
||||
|
||||
@ -57,6 +57,10 @@
|
||||
|
||||
namespace Kokkos {
|
||||
|
||||
struct ParallelForTag {};
|
||||
struct ParallelScanTag {};
|
||||
struct ParallelReduceTag {};
|
||||
|
||||
struct ChunkSize {
|
||||
int value;
|
||||
ChunkSize(int value_):value(value_) {}
|
||||
@ -320,6 +324,10 @@ public:
|
||||
|
||||
template< class FunctorType >
|
||||
static int team_size_recommended( const FunctorType & , const int&);
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType & functor , const int vector_length);
|
||||
|
||||
//----------------------------------------
|
||||
/** \brief Construct policy with the given instance of the execution space */
|
||||
TeamPolicyInternal( const typename traits::execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1 );
|
||||
|
||||
@ -76,6 +76,8 @@ struct LayoutLeft {
|
||||
|
||||
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];
|
||||
|
||||
enum { is_extent_constructible = true };
|
||||
|
||||
LayoutLeft( LayoutLeft const & ) = default ;
|
||||
LayoutLeft( LayoutLeft && ) = default ;
|
||||
LayoutLeft & operator = ( LayoutLeft const & ) = default ;
|
||||
@ -108,6 +110,8 @@ struct LayoutRight {
|
||||
|
||||
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];
|
||||
|
||||
enum { is_extent_constructible = true };
|
||||
|
||||
LayoutRight( LayoutRight const & ) = default ;
|
||||
LayoutRight( LayoutRight && ) = default ;
|
||||
LayoutRight & operator = ( LayoutRight const & ) = default ;
|
||||
@ -132,6 +136,8 @@ struct LayoutStride {
|
||||
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
|
||||
size_t stride[ ARRAY_LAYOUT_MAX_RANK ] ;
|
||||
|
||||
enum { is_extent_constructible = false };
|
||||
|
||||
LayoutStride( LayoutStride const & ) = default ;
|
||||
LayoutStride( LayoutStride && ) = default ;
|
||||
LayoutStride & operator = ( LayoutStride const & ) = default ;
|
||||
@ -222,6 +228,8 @@ struct LayoutTileLeft {
|
||||
|
||||
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
|
||||
|
||||
enum { is_extent_constructible = true };
|
||||
|
||||
LayoutTileLeft( LayoutTileLeft const & ) = default ;
|
||||
LayoutTileLeft( LayoutTileLeft && ) = default ;
|
||||
LayoutTileLeft & operator = ( LayoutTileLeft const & ) = default ;
|
||||
@ -235,6 +243,144 @@ struct LayoutTileLeft {
|
||||
: dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
|
||||
};
|
||||
|
||||
|
||||
//////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
enum class Iterate
|
||||
{
|
||||
Default,
|
||||
Left, // Left indices stride fastest
|
||||
Right // Right indices stride fastest
|
||||
};
|
||||
|
||||
// To check for LayoutTiled
|
||||
// This is to hide extra compile-time 'identifier' info within the LayoutTiled class by not relying on template specialization to include the ArgN*'s
|
||||
template < typename LayoutTiledCheck, class Enable = void >
|
||||
struct is_layouttiled : std::false_type {};
|
||||
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template < typename LayoutTiledCheck >
|
||||
struct is_layouttiled< LayoutTiledCheck, typename std::enable_if<LayoutTiledCheck::is_array_layout_tiled>::type > : std::true_type {};
|
||||
|
||||
namespace Experimental {
|
||||
|
||||
/// LayoutTiled
|
||||
// Must have Rank >= 2
|
||||
template < Kokkos::Iterate OuterP, Kokkos::Iterate InnerP,
|
||||
unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 = 0, unsigned ArgN3 = 0, unsigned ArgN4 = 0, unsigned ArgN5 = 0, unsigned ArgN6 = 0, unsigned ArgN7 = 0,
|
||||
bool IsPowerOfTwo =
|
||||
( Impl::is_integral_power_of_two(ArgN0) &&
|
||||
Impl::is_integral_power_of_two(ArgN1) &&
|
||||
(Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
|
||||
)
|
||||
>
|
||||
struct LayoutTiled {
|
||||
|
||||
static_assert( IsPowerOfTwo
|
||||
, "LayoutTiled must be given power-of-two tile dimensions" );
|
||||
|
||||
#if 0
|
||||
static_assert( (Impl::is_integral_power_of_two(ArgN0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN1) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
|
||||
(Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
|
||||
, "LayoutTiled must be given power-of-two tile dimensions" );
|
||||
#endif
|
||||
|
||||
typedef LayoutTiled<OuterP, InnerP, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, IsPowerOfTwo> array_layout ;
|
||||
static constexpr Iterate outer_pattern = OuterP;
|
||||
static constexpr Iterate inner_pattern = InnerP;
|
||||
|
||||
enum { N0 = ArgN0 };
|
||||
enum { N1 = ArgN1 };
|
||||
enum { N2 = ArgN2 };
|
||||
enum { N3 = ArgN3 };
|
||||
enum { N4 = ArgN4 };
|
||||
enum { N5 = ArgN5 };
|
||||
enum { N6 = ArgN6 };
|
||||
enum { N7 = ArgN7 };
|
||||
|
||||
size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
|
||||
|
||||
enum { is_extent_constructible = true };
|
||||
|
||||
LayoutTiled( LayoutTiled const & ) = default ;
|
||||
LayoutTiled( LayoutTiled && ) = default ;
|
||||
LayoutTiled & operator = ( LayoutTiled const & ) = default ;
|
||||
LayoutTiled & operator = ( LayoutTiled && ) = default ;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
explicit constexpr
|
||||
LayoutTiled( size_t argN0 = 0 , size_t argN1 = 0 , size_t argN2 = 0 , size_t argN3 = 0
|
||||
, size_t argN4 = 0 , size_t argN5 = 0 , size_t argN6 = 0 , size_t argN7 = 0
|
||||
)
|
||||
: dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
|
||||
};
|
||||
|
||||
} // namespace Experimental
|
||||
#endif
|
||||
|
||||
|
||||
// For use with view_copy
|
||||
template < typename ... Layout >
|
||||
struct layout_iterate_type_selector {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
|
||||
};
|
||||
|
||||
template <>
|
||||
struct layout_iterate_type_selector< Kokkos::LayoutRight > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
};
|
||||
|
||||
template <>
|
||||
struct layout_iterate_type_selector< Kokkos::LayoutLeft > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
};
|
||||
|
||||
template <>
|
||||
struct layout_iterate_type_selector< Kokkos::LayoutStride > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
|
||||
};
|
||||
|
||||
#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
|
||||
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
};
|
||||
|
||||
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
|
||||
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
};
|
||||
|
||||
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
|
||||
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
};
|
||||
|
||||
template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 , unsigned ArgN3 , unsigned ArgN4 , unsigned ArgN5 , unsigned ArgN6 , unsigned ArgN7 >
|
||||
struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
|
||||
static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
|
||||
};
|
||||
#endif
|
||||
|
||||
} // namespace Kokkos
|
||||
|
||||
#endif // #ifndef KOKKOS_LAYOUT_HPP
|
||||
|
||||
@ -153,7 +153,7 @@
|
||||
#else
|
||||
#define KOKKOS_LAMBDA [=]__host__ __device__
|
||||
|
||||
#if defined( KOKKOS_ENABLE_CXX1Z )
|
||||
#if defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20 )
|
||||
#define KOKKOS_CLASS_LAMBDA [=,*this] __host__ __device__
|
||||
#endif
|
||||
#endif
|
||||
@ -213,7 +213,7 @@
|
||||
#define KOKKOS_LAMBDA [=]
|
||||
#endif
|
||||
|
||||
#if defined( KOKKOS_ENABLE_CXX1Z ) && !defined( KOKKOS_CLASS_LAMBDA )
|
||||
#if (defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20) )&& !defined( KOKKOS_CLASS_LAMBDA )
|
||||
#define KOKKOS_CLASS_LAMBDA [=,*this]
|
||||
#endif
|
||||
|
||||
@ -521,6 +521,9 @@
|
||||
#if defined ( KOKKOS_ENABLE_CUDA )
|
||||
#if ( 9000 <= CUDA_VERSION )
|
||||
#define KOKKOS_IMPL_CUDA_VERSION_9_WORKAROUND
|
||||
#if ( __CUDA_ARCH__ )
|
||||
#define KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
|
||||
#endif
|
||||
#endif
|
||||
#endif
|
||||
|
||||
|
||||
@ -793,7 +793,7 @@ struct ParallelReduceReturnValue<typename std::enable_if<
|
||||
|
||||
static return_type return_value(ReturnType& return_val,
|
||||
const FunctorType& functor) {
|
||||
#ifdef KOKOOS_ENABLE_DEPRECATED_CODE
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return return_type(return_val,functor.value_count);
|
||||
#else
|
||||
if ( is_array<ReturnType>::value )
|
||||
@ -1002,7 +1002,8 @@ void parallel_reduce(const std::string& label,
|
||||
typename Impl::enable_if<
|
||||
Kokkos::Impl::is_execution_policy<PolicyType>::value
|
||||
>::type * = 0) {
|
||||
Impl::ParallelReduceAdaptor<PolicyType,FunctorType,const ReturnType>::execute(label,policy,functor,return_value);
|
||||
ReturnType return_value_impl = return_value;
|
||||
Impl::ParallelReduceAdaptor<PolicyType,FunctorType,ReturnType>::execute(label,policy,functor,return_value_impl);
|
||||
}
|
||||
|
||||
template< class PolicyType, class FunctorType, class ReturnType >
|
||||
@ -1054,6 +1055,9 @@ void parallel_reduce(const std::string& label,
|
||||
, typename ValueTraits::pointer_type
|
||||
>::type value_type ;
|
||||
|
||||
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
|
||||
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
|
||||
|
||||
typedef Kokkos::View< value_type
|
||||
, Kokkos::HostSpace
|
||||
, Kokkos::MemoryUnmanaged
|
||||
@ -1076,6 +1080,9 @@ void parallel_reduce(const PolicyType& policy,
|
||||
, typename ValueTraits::pointer_type
|
||||
>::type value_type ;
|
||||
|
||||
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
|
||||
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
|
||||
|
||||
typedef Kokkos::View< value_type
|
||||
, Kokkos::HostSpace
|
||||
, Kokkos::MemoryUnmanaged
|
||||
@ -1096,6 +1103,9 @@ void parallel_reduce(const size_t& policy,
|
||||
, typename ValueTraits::pointer_type
|
||||
>::type value_type ;
|
||||
|
||||
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
|
||||
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
|
||||
|
||||
typedef Kokkos::View< value_type
|
||||
, Kokkos::HostSpace
|
||||
, Kokkos::MemoryUnmanaged
|
||||
@ -1117,6 +1127,9 @@ void parallel_reduce(const std::string& label,
|
||||
, typename ValueTraits::pointer_type
|
||||
>::type value_type ;
|
||||
|
||||
static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
|
||||
has_final_member_function,"Calling parallel_reduce without either return value or final function.");
|
||||
|
||||
typedef Kokkos::View< value_type
|
||||
, Kokkos::HostSpace
|
||||
, Kokkos::MemoryUnmanaged
|
||||
|
||||
@ -136,6 +136,55 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void* get_shmem_aligned (const ptrdiff_t size, const ptrdiff_t alignment, int level = -1) const {
|
||||
if(level == -1)
|
||||
level = m_default_level;
|
||||
if(level == 0) {
|
||||
|
||||
char* previous = m_iter_L0;
|
||||
const ptrdiff_t missalign = size_t(m_iter_L0)%alignment;
|
||||
if(missalign) m_iter_L0 += alignment-missalign;
|
||||
|
||||
void* tmp = m_iter_L0 + m_offset * size;
|
||||
if (m_end_L0 < (m_iter_L0 += size * m_multiplier)) {
|
||||
m_iter_L0 = previous; // put it back like it was
|
||||
#ifdef KOKKOS_DEBUG
|
||||
// mfh 23 Jun 2015: printf call consumes 25 registers
|
||||
// in a CUDA build, so only print in debug mode. The
|
||||
// function still returns NULL if not enough memory.
|
||||
printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
|
||||
"%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
|
||||
long(m_end_L0-m_iter_L0));
|
||||
#endif // KOKKOS_DEBUG
|
||||
tmp = 0;
|
||||
}
|
||||
return tmp;
|
||||
} else {
|
||||
|
||||
char* previous = m_iter_L1;
|
||||
const ptrdiff_t missalign = size_t(m_iter_L1)%alignment;
|
||||
if(missalign) m_iter_L1 += alignment-missalign;
|
||||
|
||||
void* tmp = m_iter_L1 + m_offset * size;
|
||||
if (m_end_L1 < (m_iter_L1 += size * m_multiplier)) {
|
||||
m_iter_L1 = previous; // put it back like it was
|
||||
#ifdef KOKKOS_DEBUG
|
||||
// mfh 23 Jun 2015: printf call consumes 25 registers
|
||||
// in a CUDA build, so only print in debug mode. The
|
||||
// function still returns NULL if not enough memory.
|
||||
printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
|
||||
"%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
|
||||
long(m_end_L1-m_iter_L1));
|
||||
#endif // KOKKOS_DEBUG
|
||||
tmp = 0;
|
||||
}
|
||||
return tmp;
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
template< typename IntType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
ScratchMemorySpace( void * ptr_L0 , const IntType & size_L0 , void * ptr_L1 = NULL , const IntType & size_L1 = 0)
|
||||
|
||||
@ -262,7 +262,7 @@ public:
|
||||
}
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template< class FunctorType >
|
||||
static
|
||||
int team_size_max( const FunctorType & ) { return 1 ; }
|
||||
@ -274,6 +274,16 @@ public:
|
||||
template< class FunctorType >
|
||||
static
|
||||
int team_size_recommended( const FunctorType & , const int& ) { return 1 ; }
|
||||
#endif
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
@ -281,6 +291,16 @@ public:
|
||||
inline int league_size() const { return m_league_size ; }
|
||||
inline size_t scratch_size(const int& level, int = 0) const { return m_team_scratch_size[level] + m_thread_scratch_size[level]; }
|
||||
|
||||
inline static
|
||||
int vector_length_max()
|
||||
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
|
||||
|
||||
inline static
|
||||
int scratch_size_max(int level)
|
||||
{ return (level==0?
|
||||
1024*32:
|
||||
20*1024*1024);
|
||||
}
|
||||
/** \brief Specify league size, request team size */
|
||||
TeamPolicyInternal( execution_space &
|
||||
, int league_size_request
|
||||
|
||||
@ -624,7 +624,6 @@ public:
|
||||
when_all( Future< A1 , A2 > const arg[] , int narg )
|
||||
{
|
||||
using future_type = Future< execution_space > ;
|
||||
using task_base = Kokkos::Impl::TaskBase< void , void , void > ;
|
||||
|
||||
future_type f ;
|
||||
|
||||
@ -692,7 +691,6 @@ public:
|
||||
{
|
||||
using input_type = decltype( func(0) );
|
||||
using future_type = Future< execution_space > ;
|
||||
using task_base = Kokkos::Impl::TaskBase< void , void , void > ;
|
||||
|
||||
static_assert( is_future< input_type >::value
|
||||
, "Functor must return a Kokkos::Future" );
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@ -16,6 +16,7 @@ endif
|
||||
CXXFLAGS ?= -O3
|
||||
LINK ?= $(CXX)
|
||||
LDFLAGS ?=
|
||||
CP = cp
|
||||
|
||||
include $(KOKKOS_PATH)/Makefile.kokkos
|
||||
include $(KOKKOS_PATH)/core/src/Makefile.generate_header_lists
|
||||
@ -50,7 +51,12 @@ ifeq ($(KOKKOS_OS),Linux)
|
||||
COPY_FLAG = -u
|
||||
endif
|
||||
ifeq ($(KOKKOS_OS),Darwin)
|
||||
COPY_FLAG =
|
||||
COPY_FLAG =
|
||||
# If Homebrew coreutils is installed, its cp will have the -u option
|
||||
ifneq ("$(wildcard /usr/local/opt/coreutils/libexec/gnubin/cp)","")
|
||||
CP = /usr/local/opt/coreutils/libexec/gnubin/cp
|
||||
COPY_FLAG = -u
|
||||
endif
|
||||
endif
|
||||
|
||||
ifeq ($(KOKKOS_DEBUG),"no")
|
||||
@ -66,36 +72,38 @@ mkdir:
|
||||
mkdir -p $(PREFIX)/bin
|
||||
mkdir -p $(PREFIX)/include
|
||||
mkdir -p $(PREFIX)/lib
|
||||
mkdir -p $(PREFIX)/lib/pkgconfig
|
||||
mkdir -p $(PREFIX)/include/impl
|
||||
|
||||
copy-cuda: mkdir
|
||||
mkdir -p $(PREFIX)/include/Cuda
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
|
||||
|
||||
copy-threads: mkdir
|
||||
mkdir -p $(PREFIX)/include/Threads
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
|
||||
|
||||
copy-qthreads: mkdir
|
||||
mkdir -p $(PREFIX)/include/Qthreads
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads
|
||||
|
||||
copy-openmp: mkdir
|
||||
mkdir -p $(PREFIX)/include/OpenMP
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
|
||||
|
||||
copy-rocm: mkdir
|
||||
mkdir -p $(PREFIX)/include/ROCm
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
|
||||
|
||||
install: mkdir $(CONDITIONAL_COPIES) build-lib generate_build_settings
|
||||
cp $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
|
||||
cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
|
||||
cp $(COPY_FLAG) $(KOKKOS_MAKEFILE) $(PREFIX)
|
||||
cp $(COPY_FLAG) $(KOKKOS_CMAKEFILE) $(PREFIX)
|
||||
cp $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
|
||||
cp $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include
|
||||
$(CP) $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_MAKEFILE) $(PREFIX)
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_CMAKEFILE) $(PREFIX)
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_PKGCONFIG) $(PREFIX)/lib/pkgconfig
|
||||
$(CP) $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
|
||||
$(CP) $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include
|
||||
|
||||
clean: kokkos-clean
|
||||
rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE)
|
||||
rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE) $(KOKKOS_PKGCONFIG)
|
||||
|
||||
@ -5,6 +5,7 @@
|
||||
# These files are generated by this makefile
|
||||
KOKKOS_MAKEFILE=Makefile.kokkos
|
||||
KOKKOS_CMAKEFILE=kokkos_generated_settings.cmake
|
||||
KOKKOS_PKGCONFIG=kokkos.pc
|
||||
|
||||
ifeq ($(KOKKOS_DEBUG),"no")
|
||||
KOKKOS_DEBUG_CMAKE = OFF
|
||||
@ -33,11 +34,29 @@ kokkos_append_var = $(call kokkos_appendvar_makefile,$1); $(call kokkos_appendva
|
||||
kokkos_append_var2 = $(call kokkos_appendvar2_makefile,$1); $(call kokkos_appendvar_cmakefile,$1,$2)
|
||||
kokkos_append_varval = $(call kokkos_appendval_makefile,$1,$2); $(call kokkos_appendval_cmakefile,$1,$2,$3)
|
||||
|
||||
kokkos_fixup_sed_impl = sed \
|
||||
-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
|
||||
-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
|
||||
-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
|
||||
-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
|
||||
-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
|
||||
-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $1 \
|
||||
> $1.tmp && mv -f $1.tmp $1
|
||||
|
||||
$(KOKKOS_PKGCONFIG): $(KOKKOS_PATH)/core/src/$(KOKKOS_PKGCONFIG).in
|
||||
@sed -e 's|@CMAKE_INSTALL_PREFIX@|$(PREFIX)|g' \
|
||||
-e 's|@KOKKOS_CXXFLAGS@|$(patsubst -I%,,$(KOKKOS_CXXFLAGS))|g' \
|
||||
-e 's|@KOKKOS_EXTRA_LIBS_LIST@|$(KOKKOS_EXTRA_LIBS)|g' \
|
||||
-e 's|@KOKKOS_LINK_FLAGS@|$(KOKKOS_LINK_FLAGS)|g' \
|
||||
$< > $@
|
||||
|
||||
kokkos_fixup_sed = $(call kokkos_fixup_sed_impl,$(KOKKOS_MAKEFILE)); $(call kokkos_fixup_sed_impl,$(KOKKOS_CMAKEFILE))
|
||||
|
||||
#This function should be used for variables whose values are different in GNU Make versus CMake,
|
||||
#especially lists which are delimited by commas in one case and semicolons in another
|
||||
kokkos_append_gmakevar = $(call kokkos_appendvar_makefile,$1); $(call kokkos_append_gmakevar_cmakefile,$1,$2)
|
||||
|
||||
generate_build_settings: $(KOKKOS_CONFIG_HEADER)
|
||||
generate_build_settings: $(KOKKOS_CONFIG_HEADER) $(KOKKOS_PKGCONFIG)
|
||||
@rm -f $(KOKKOS_MAKEFILE)
|
||||
@rm -f $(KOKKOS_CMAKEFILE)
|
||||
@$(call kokkos_append_string, "#Global Settings used to generate this library")
|
||||
@ -68,7 +87,6 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
|
||||
@$(call kokkos_append_var,KOKKOS_HEADERS_ROCM,'STRING "Kokkos headers ROCm list"')
|
||||
@$(call kokkos_append_var,KOKKOS_HEADERS_THREADS,'STRING "Kokkos headers Threads list"')
|
||||
@$(call kokkos_append_var,KOKKOS_HEADERS_QTHREADS,'STRING "Kokkos headers QThreads list"')
|
||||
@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
|
||||
@$(call kokkos_append_string,"")
|
||||
@$(call kokkos_append_string,"#Variables used in application Makefiles")
|
||||
@$(call kokkos_append_var,KOKKOS_OS,'STRING ""') # This was not in original cmake gen
|
||||
@ -94,19 +112,11 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
|
||||
@$(call kokkos_append_makefile,"#Fake kokkos-clean target")
|
||||
@$(call kokkos_append_makefile,"kokkos-clean:")
|
||||
@$(call kokkos_append_makefile,"")
|
||||
@sed \
|
||||
-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
|
||||
-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
|
||||
-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
|
||||
-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
|
||||
-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
|
||||
-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $(KOKKOS_MAKEFILE) \
|
||||
> $(KOKKOS_MAKEFILE).tmp
|
||||
@mv -f $(KOKKOS_MAKEFILE).tmp $(KOKKOS_MAKEFILE)
|
||||
@$(call kokkos_fixup_sed)
|
||||
@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
|
||||
@$(call kokkos_setvar_cmakefile,KOKKOS_CXX_FLAGS,$(KOKKOS_CXXFLAGS))
|
||||
@$(call kokkos_setvar_cmakefile,KOKKOS_CPP_FLAGS,$(KOKKOS_CPPFLAGS))
|
||||
@$(call kokkos_setvar_cmakefile,KOKKOS_LD_FLAGS,$(KOKKOS_LDFLAGS))
|
||||
@$(call kokkos_setlist_cmakefile,KOKKOS_LIBS_LIST,$(KOKKOS_LIBS))
|
||||
@$(call kokkos_setlist_cmakefile,KOKKOS_EXTRA_LIBS_LIST,$(KOKKOS_EXTRA_LIBS))
|
||||
@$(call kokkos_setvar_cmakefile,KOKKOS_LINK_FLAGS,$(KOKKOS_LINK_FLAGS))
|
||||
|
||||
|
||||
@ -103,8 +103,6 @@ public:
|
||||
void TaskQueueSpecialization< Kokkos::OpenMP >::execute
|
||||
( TaskQueue< Kokkos::OpenMP > * const queue )
|
||||
{
|
||||
using execution_space = Kokkos::OpenMP ;
|
||||
using queue_type = TaskQueue< execution_space > ;
|
||||
using task_root_type = TaskBase< void , void , void > ;
|
||||
using Member = Impl::HostThreadTeamMember< execution_space > ;
|
||||
|
||||
@ -213,8 +211,6 @@ void TaskQueueSpecialization< Kokkos::OpenMP >::
|
||||
iff_single_thread_recursive_execute
|
||||
( TaskQueue< Kokkos::OpenMP > * const queue )
|
||||
{
|
||||
using execution_space = Kokkos::OpenMP ;
|
||||
using queue_type = TaskQueue< execution_space > ;
|
||||
using task_root_type = TaskBase< void , void , void > ;
|
||||
using Member = Impl::HostThreadTeamMember< execution_space > ;
|
||||
|
||||
|
||||
@ -76,14 +76,11 @@ public:
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template< class FunctorType >
|
||||
inline static
|
||||
int team_size_max( const FunctorType & ) {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
@ -92,6 +89,47 @@ public:
|
||||
inline static
|
||||
int team_size_recommended( const FunctorType & )
|
||||
{
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
}
|
||||
|
||||
template< class FunctorType >
|
||||
inline static
|
||||
int team_size_recommended( const FunctorType &, const int& )
|
||||
{
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
}
|
||||
#endif
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelForTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
#endif
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
@ -99,16 +137,17 @@ public:
|
||||
#endif
|
||||
}
|
||||
|
||||
template< class FunctorType >
|
||||
|
||||
inline static
|
||||
int team_size_recommended( const FunctorType &, const int& )
|
||||
{
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
#endif
|
||||
}
|
||||
int vector_length_max()
|
||||
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
|
||||
|
||||
inline static
|
||||
int scratch_size_max(int level)
|
||||
{ return (level==0?
|
||||
1024*32: // Roughly L1 size
|
||||
20*1024*1024); // Limit to keep compatibility with CUDA
|
||||
}
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
|
||||
@ -160,7 +160,8 @@ SharedAllocationRecord( const Kokkos::Experimental::OpenMPTargetSpace & arg_spac
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
|
||||
// Set last element zero, in case c_str is too long
|
||||
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
//TODO DeepCopy
|
||||
// DeepCopy
|
||||
|
||||
|
||||
@ -44,8 +44,8 @@
|
||||
#ifndef GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP
|
||||
#define GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP
|
||||
|
||||
#ifndef KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
#define KOKKOS_ROCM_HAS_WORKAROUNDS 1
|
||||
#ifndef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1
|
||||
#endif
|
||||
|
||||
#endif
|
||||
|
||||
@ -55,14 +55,14 @@ namespace Impl {
|
||||
|
||||
struct ROCmTraits {
|
||||
// TODO: determine if needed
|
||||
enum { WavefrontSize = 64 /* 64 */ };
|
||||
enum { WorkgroupSize = 64 /* 64 */ };
|
||||
enum { WavefrontIndexMask = 0x001f /* Mask for warpindex */ };
|
||||
enum { WavefrontIndexShift = 5 /* WarpSize == 1 << WarpShift */ };
|
||||
enum { WavefrontSize = 64 /* 64 */ };
|
||||
enum { WorkgroupSize = 256 /* 256 */ };
|
||||
enum { WavefrontIndexMask = 0x003f /* Mask for wavefrontindex */ };
|
||||
enum { WavefrontIndexShift = 6 /* WavefrontSize == 1 << WavefrontShift */ };
|
||||
|
||||
enum { SharedMemoryBanks = 32 /* Compute device 2.0 */ };
|
||||
enum { SharedMemoryCapacity = 0x0C000 /* 48k shared / 16k L1 Cache */ };
|
||||
enum { SharedMemoryUsage = 0x04000 /* 16k shared / 48k L1 Cache */ };
|
||||
enum { SharedMemoryBanks = 64 /* GCN */ };
|
||||
enum { SharedMemoryCapacity = 0x10000 /* 64k shared / 16k L1 Cache */ };
|
||||
enum { SharedMemoryUsage = 0x04000 /* 64k shared / 16k L1 Cache */ };
|
||||
|
||||
enum { UpperBoundExtentCount = 4294967295 /* Hard upper bound */ };
|
||||
#if 0
|
||||
@ -84,6 +84,16 @@ size_t rocm_internal_maximum_workgroup_count();
|
||||
size_t * rocm_internal_scratch_flags( const size_t size );
|
||||
size_t * rocm_internal_scratch_space( const size_t size );
|
||||
|
||||
// This pointer is the start of dynamic shared memory (LDS).
|
||||
// Dynamic is at the end of LDS and it's size must be specified
|
||||
// in a tile_block specification at kernel launch time.
|
||||
template< typename T >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
T * kokkos_impl_rocm_shared_memory()
|
||||
//{ return (T*) hc::get_group_segment_base_pointer() ; }
|
||||
{ return (T*) hc::get_dynamic_group_segment_base_pointer() ; }
|
||||
|
||||
|
||||
}
|
||||
} // namespace Kokkos
|
||||
#define ROCM_SPACE_ATOMIC_MASK 0x1FFFF
|
||||
@ -249,7 +259,6 @@ struct ROCmParallelLaunch< DriverType
|
||||
size_t bx = (grid.x > block.x)? block.x : grid.x;
|
||||
size_t by = (grid.y > block.y)? block.y : grid.y;
|
||||
size_t bz = (grid.z > block.z)? block.z : grid.z;
|
||||
|
||||
hc::parallel_for_each(ext.tile_with_dynamic(bz,by,bx,shmem), [=](const hc::index<3> & idx) [[hc]]
|
||||
|
||||
|
||||
|
||||
@ -543,20 +543,13 @@ enum { sizeScratchGrain = sizeof(ScratchGrain) };
|
||||
void rocmMemset( Kokkos::Experimental::ROCm::size_type * ptr , Kokkos::Experimental::ROCm::size_type value , Kokkos::Experimental::ROCm::size_type size)
|
||||
{
|
||||
char * mptr = (char * ) ptr;
|
||||
#if 0
|
||||
parallel_for_each(hc::extent<1>(size),
|
||||
/* parallel_for_each(hc::extent<1>(size),
|
||||
[=, &ptr]
|
||||
(hc::index<1> idx) __HC__
|
||||
{
|
||||
int i = idx[0];
|
||||
ptr[i] = value;
|
||||
}).wait();
|
||||
#else
|
||||
for (int i= 0; i<size ; i++)
|
||||
{
|
||||
mptr[i] = (char) value;
|
||||
}
|
||||
#endif
|
||||
}).wait();*/
|
||||
}
|
||||
|
||||
Kokkos::Experimental::ROCm::size_type *
|
||||
@ -567,9 +560,9 @@ ROCmInternal::scratch_flags( const Kokkos::Experimental::ROCm::size_type size )
|
||||
|
||||
m_scratchFlagsCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;
|
||||
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;
|
||||
|
||||
Record * const r = Record::allocate( Kokkos::HostSpace()
|
||||
Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
|
||||
, "InternalScratchFlags"
|
||||
, ( sizeScratchGrain * m_scratchFlagsCount ) );
|
||||
|
||||
@ -590,9 +583,9 @@ ROCmInternal::scratch_space( const Kokkos::Experimental::ROCm::size_type size )
|
||||
|
||||
m_scratchSpaceCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;
|
||||
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;
|
||||
|
||||
Record * const r = Record::allocate( Kokkos::HostSpace()
|
||||
static Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
|
||||
, "InternalScratchSpace"
|
||||
, ( sizeScratchGrain * m_scratchSpaceCount ) );
|
||||
|
||||
@ -616,7 +609,7 @@ void ROCmInternal::finalize()
|
||||
// scratch_lock_array_rocm_space_ptr(false);
|
||||
// threadid_lock_array_rocm_space_ptr(false);
|
||||
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< HostSpace > RecordROCm ;
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace > RecordROCm ;
|
||||
typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmHostPinnedSpace > RecordHost ;
|
||||
|
||||
RecordROCm::decrement( RecordROCm::get_record( m_scratchFlags ) );
|
||||
|
||||
@ -243,6 +243,15 @@ public:
|
||||
return(max);
|
||||
}
|
||||
|
||||
template< class FunctorType , class PatternTypeTag>
|
||||
int team_size_max( const FunctorType& functor, PatternTypeTag) {
|
||||
return 256/vector_length();
|
||||
}
|
||||
template< class FunctorType , class PatternTypeTag>
|
||||
int team_size_recommended( const FunctorType& functor, PatternTypeTag) {
|
||||
return 128/vector_length();
|
||||
}
|
||||
|
||||
template<class F>
|
||||
KOKKOS_INLINE_FUNCTION int team_size(const F& f) const { return (m_team_size > 0) ? m_team_size : team_size_recommended(f); }
|
||||
KOKKOS_INLINE_FUNCTION int team_size() const { return (m_team_size > 0) ? m_team_size : Impl::get_max_tile_thread(); ; }
|
||||
@ -261,6 +270,11 @@ public:
|
||||
return m_thread_scratch_size[level];
|
||||
}
|
||||
|
||||
static int scratch_size_max(int level) {
|
||||
return level==0 ?
|
||||
1024*40 : 1024*1204*20;
|
||||
}
|
||||
|
||||
typedef Impl::ROCmTeamMember member_type;
|
||||
};
|
||||
|
||||
@ -487,6 +501,7 @@ public:
|
||||
#endif
|
||||
}
|
||||
m_idx.barrier.wait();
|
||||
reducer.reference() = buffer[0];
|
||||
}
|
||||
|
||||
/** \brief Intra-team vector reduce
|
||||
@ -541,19 +556,19 @@ public:
|
||||
}
|
||||
|
||||
template< typename ReducerType >
|
||||
KOKKOS_INLINE_FUNCTION static
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
typename std::enable_if< is_reducer< ReducerType >::value >::type
|
||||
vector_reduce( ReducerType const & reducer )
|
||||
vector_reduce( ReducerType const & reducer ) const
|
||||
{
|
||||
#ifdef __HCC_ACCELERATOR__
|
||||
if(blockDim_x == 1) return;
|
||||
if(m_vector_length == 1) return;
|
||||
|
||||
// Intra vector lane shuffle reduction:
|
||||
typename ReducerType::value_type tmp ( reducer.reference() );
|
||||
|
||||
for ( int i = blockDim_x ; ( i >>= 1 ) ; ) {
|
||||
shfl_down( reducer.reference() , i , blockDim_x );
|
||||
if ( (int)threadIdx_x < i ) { reducer.join( tmp , reducer.reference() ); }
|
||||
for ( int i = m_vector_length ; ( i >>= 1 ) ; ) {
|
||||
reducer.reference() = shfl_down( tmp , i , m_vector_length );
|
||||
if ( (int)vector_rank() < i ) { reducer.join( tmp , reducer.reference() ); }
|
||||
}
|
||||
|
||||
// Broadcast from root lane to all other lanes.
|
||||
@ -561,7 +576,7 @@ public:
|
||||
// because floating point summation is not associative
|
||||
// and thus different threads could have different results.
|
||||
|
||||
shfl( reducer.reference() , 0 , blockDim_x );
|
||||
reducer.reference() = shfl( tmp , 0 , m_vector_length );
|
||||
#endif
|
||||
}
|
||||
|
||||
@ -847,7 +862,7 @@ public:
|
||||
|
||||
hc::extent< 1 > flat_extent( total_size );
|
||||
|
||||
hc::tiled_extent< 1 > team_extent = flat_extent.tile(team_size*vector_length);
|
||||
hc::tiled_extent< 1 > team_extent = flat_extent.tile(vector_length*team_size);
|
||||
hc::parallel_for_each( team_extent , [=](hc::tiled_index<1> idx) [[hc]]
|
||||
{
|
||||
rocm_invoke<typename Policy::work_tag>(f, typename Policy::member_type(idx, league_size, team_size, shared, shared_size, scratch_size0, scratch, scratch_size1,vector_length));
|
||||
@ -958,6 +973,176 @@ public:
|
||||
|
||||
};
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
template< class FunctorType , class ReducerType, class... Traits >
|
||||
class ParallelReduce<
|
||||
FunctorType , Kokkos::MDRangePolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
|
||||
{
|
||||
private:
|
||||
typedef Kokkos::MDRangePolicy< Traits ... > Policy ;
|
||||
using RP = Policy;
|
||||
typedef typename Policy::array_index_type array_index_type;
|
||||
typedef typename Policy::index_type index_type;
|
||||
typedef typename Policy::work_tag WorkTag ;
|
||||
typedef typename Policy::member_type Member ;
|
||||
typedef typename Policy::launch_bounds LaunchBounds;
|
||||
|
||||
typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
|
||||
typedef typename ReducerConditional::type ReducerTypeFwd;
|
||||
typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;
|
||||
|
||||
typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
|
||||
typedef Kokkos::Impl::FunctorValueInit< ReducerTypeFwd, WorkTagFwd > ValueInit ;
|
||||
typedef Kokkos::Impl::FunctorValueJoin< ReducerTypeFwd, WorkTagFwd > ValueJoin ;
|
||||
|
||||
|
||||
public:
|
||||
|
||||
typedef typename ValueTraits::pointer_type pointer_type ;
|
||||
typedef typename ValueTraits::value_type value_type ;
|
||||
typedef typename ValueTraits::reference_type reference_type ;
|
||||
typedef FunctorType functor_type ;
|
||||
typedef Kokkos::Experimental::ROCm::size_type size_type ;
|
||||
|
||||
// Algorithmic constraints: blockSize is a power of two AND blockDim.y == blockDim.z == 1
|
||||
|
||||
const FunctorType m_functor ;
|
||||
const Policy m_policy ; // used for workrange and nwork
|
||||
const ReducerType m_reducer ;
|
||||
const pointer_type m_result_ptr ;
|
||||
value_type * m_scratch_space ;
|
||||
size_type * m_scratch_flags ;
|
||||
|
||||
typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void exec_range( reference_type update ) const
|
||||
{
|
||||
Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag, reference_type>(m_policy, m_functor, update).exec_range();
|
||||
}
|
||||
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void operator()(void) const
|
||||
{
|
||||
run();
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void run( ) const
|
||||
{
|
||||
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
|
||||
word_count( (ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) )) / sizeof(value_type) );
|
||||
// pointer to shared data accounts for the reserved space at the start
|
||||
value_type * const shared = kokkos_impl_rocm_shared_memory<value_type>()
|
||||
+ 2*sizeof(uint64_t);
|
||||
|
||||
{
|
||||
reference_type value =
|
||||
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , shared + threadIdx_y * word_count.value );
|
||||
// Number of blocks is bounded so that the reduction can be limited to two passes.
|
||||
// Each thread block is given an approximately equal amount of work to perform.
|
||||
// Accumulate the values for this block.
|
||||
// The accumulation ordering does not match the final pass, but is arithmatically equivalent.
|
||||
|
||||
this-> exec_range( value );
|
||||
}
|
||||
|
||||
// Reduce with final value at blockDim.y - 1 location.
|
||||
// Problem: non power-of-two blockDim
|
||||
|
||||
if ( rocm_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTagFwd>(
|
||||
ReducerConditional::select(m_functor , m_reducer) , blockIdx_x ,
|
||||
gridDim_x , shared , m_scratch_space , m_scratch_flags ) ) {
|
||||
|
||||
// This is the final block with the final result at the final threads' location
|
||||
value_type * const tshared = shared + ( blockDim_y - 1 ) * word_count.value ;
|
||||
value_type * const global = m_scratch_space ;
|
||||
|
||||
if ( threadIdx_y == 0 ) {
|
||||
Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , tshared );
|
||||
// for ( unsigned i = 0 ; i < word_count.value ; i+=blockDim_y ) { global[i] = tshared[i]; }
|
||||
for ( unsigned i = 0 ; i < word_count.value ; i++ ) { global[i] = tshared[i]; }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
// Determine block size constrained by shared memory:
|
||||
static inline
|
||||
unsigned local_block_size( const FunctorType & f )
|
||||
{
|
||||
unsigned n = ROCmTraits::WavefrontSize * 8 ;
|
||||
while ( n && ROCmTraits::SharedMemoryCapacity < rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( f , n ) ) { n >>= 1 ; }
|
||||
return n ;
|
||||
}
|
||||
|
||||
inline
|
||||
void execute()
|
||||
{
|
||||
const int nwork = m_policy.m_num_tiles;
|
||||
if ( nwork ) {
|
||||
int block_size = m_policy.m_prod_tile_dims;
|
||||
// CONSTRAINT: Algorithm requires block_size >= product of tile dimensions
|
||||
// Nearest power of two
|
||||
int exponent_pow_two = std::ceil( std::log2((float)block_size) );
|
||||
block_size = 1<<(exponent_pow_two);
|
||||
|
||||
m_scratch_space = (value_type*)rocm_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_size*nwork /* block_size == max block_count */ );
|
||||
m_scratch_flags = rocm_internal_scratch_flags( sizeof(size_type) );
|
||||
const dim3 block( 1 , block_size , 1 );
|
||||
// Required grid.x <= block.y
|
||||
const dim3 grid( nwork, block_size , 1 );
|
||||
const int shmem = rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( m_functor , block.y );
|
||||
|
||||
ROCmParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
|
||||
|
||||
ROCM::fence();
|
||||
|
||||
if ( m_result_ptr ) {
|
||||
const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
|
||||
DeepCopy<HostSpace,Kokkos::Experimental::ROCmSpace>( m_result_ptr , m_scratch_space , size );
|
||||
}
|
||||
}
|
||||
else {
|
||||
if (m_result_ptr) {
|
||||
ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
template< class HostViewType >
|
||||
ParallelReduce( const FunctorType & arg_functor
|
||||
, const Policy & arg_policy
|
||||
, const HostViewType & arg_result
|
||||
, typename std::enable_if<
|
||||
Kokkos::is_view< HostViewType >::value
|
||||
,void*>::type = NULL)
|
||||
: m_functor( arg_functor )
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( InvalidType() )
|
||||
, m_result_ptr( arg_result.data() )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
{}
|
||||
|
||||
ParallelReduce( const FunctorType & arg_functor
|
||||
, const Policy & arg_policy
|
||||
, const ReducerType & reducer)
|
||||
: m_functor( arg_functor )
|
||||
, m_policy( arg_policy )
|
||||
, m_reducer( reducer )
|
||||
, m_result_ptr( reducer.view().data() )
|
||||
, m_scratch_space( 0 )
|
||||
, m_scratch_flags( 0 )
|
||||
{}
|
||||
|
||||
};
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
template< class FunctorType, class ReducerType, class... Traits >
|
||||
class ParallelReduce<
|
||||
FunctorType , Kokkos::TeamPolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
|
||||
@ -992,8 +1177,14 @@ public:
|
||||
const int scratch_size0 = policy.scratch_size(0,team_size);
|
||||
const int scratch_size1 = policy.scratch_size(1,team_size);
|
||||
const int total_size = league_size * team_size ;
|
||||
|
||||
if(total_size == 0) return;
|
||||
|
||||
typedef Kokkos::Impl::FunctorValueInit< FunctorType, typename Policy::work_tag > ValueInit ;
|
||||
if(total_size==0) {
|
||||
if (result_view.data()) {
|
||||
ValueInit::init( f , result_view.data() );
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
const int reduce_size = ValueTraits::value_size( f );
|
||||
const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
|
||||
@ -1042,7 +1233,16 @@ public:
|
||||
const int vector_length = policy.vector_length();
|
||||
const int total_size = league_size * team_size;
|
||||
|
||||
if(total_size == 0) return;
|
||||
typedef Kokkos::Impl::FunctorValueInit< ReducerType, typename Policy::work_tag > ValueInit ;
|
||||
typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value,
|
||||
FunctorType, ReducerType> ReducerConditional;
|
||||
if(total_size==0) {
|
||||
if (reducer.view().data()) {
|
||||
ValueInit::init( ReducerConditional::select(f,reducer),
|
||||
reducer.view().data() );
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
const int reduce_size = ValueTraits::value_size( f );
|
||||
const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
|
||||
@ -1113,6 +1313,39 @@ public:
|
||||
//----------------------------------------
|
||||
};
|
||||
|
||||
template< class FunctorType , class ReturnType , class... Traits >
|
||||
class ParallelScanWithTotal< FunctorType , Kokkos::RangePolicy< Traits... >,
|
||||
ReturnType, Kokkos::Experimental::ROCm >
|
||||
{
|
||||
private:
|
||||
|
||||
typedef Kokkos::RangePolicy< Traits... > Policy;
|
||||
typedef typename Policy::work_tag Tag;
|
||||
typedef Kokkos::Impl::FunctorValueTraits< FunctorType, Tag> ValueTraits;
|
||||
|
||||
public:
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
inline
|
||||
ParallelScanWithTotal( const FunctorType & f
|
||||
, const Policy & policy
|
||||
, ReturnType & arg_returnvalue)
|
||||
{
|
||||
const auto len = policy.end()-policy.begin();
|
||||
|
||||
|
||||
if(len==0) return;
|
||||
|
||||
scan_enqueue<Tag,ReturnType>(len, f, arg_returnvalue, [](hc::tiled_index<1> idx, int, int) { return idx.global[0]; });
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void execute() const {}
|
||||
|
||||
//----------------------------------------
|
||||
};
|
||||
|
||||
template< class FunctorType , class... Traits>
|
||||
class ParallelScan< FunctorType , Kokkos::TeamPolicy< Traits... >, Kokkos::Experimental::ROCm >
|
||||
{
|
||||
@ -1350,22 +1583,17 @@ void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTe
|
||||
* val is performed and put into result. This functionality requires C++11 support.*/
|
||||
template< typename iType, class Lambda, typename ValueType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
|
||||
typename std::enable_if< ! Kokkos::is_reducer< ValueType >::value >::type
|
||||
parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
|
||||
const Lambda & lambda, ValueType& result) {
|
||||
|
||||
result = ValueType();
|
||||
Kokkos::Sum<ValueType> reducer(result);
|
||||
reducer.init( reducer.reference() );
|
||||
|
||||
for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
|
||||
ValueType tmp = ValueType();
|
||||
lambda(i,tmp);
|
||||
result+=tmp;
|
||||
lambda(i,reducer.reference());
|
||||
}
|
||||
result = loop_boundaries.thread.team_reduce(result,
|
||||
Impl::JoinAdd<ValueType>());
|
||||
// Impl::rocm_intra_workgroup_reduction( loop_boundaries.thread, result,
|
||||
// Impl::JoinAdd<ValueType>());
|
||||
// Impl::rocm_inter_workgroup_reduction( loop_boundaries.thread, result,
|
||||
// Impl::JoinAdd<ValueType>());
|
||||
loop_boundaries.thread.team_reduce(reducer);
|
||||
}
|
||||
|
||||
/** \brief Inter-thread thread range parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
|
||||
@ -1374,7 +1602,8 @@ void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROC
|
||||
* val is performed and put into result. This functionality requires C++11 support.*/
|
||||
template< typename iType, class Lambda, typename ReducerType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
|
||||
typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
|
||||
parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
|
||||
const Lambda & lambda, ReducerType const & reducer) {
|
||||
reducer.init( reducer.reference() );
|
||||
|
||||
@ -1439,7 +1668,8 @@ void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCm
|
||||
* val is performed and put into result. This functionality requires C++11 support.*/
|
||||
template< typename iType, class Lambda, typename ValueType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
|
||||
typename std::enable_if< !Kokkos::is_reducer< ValueType >::value >::type
|
||||
parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
|
||||
loop_boundaries, const Lambda & lambda, ValueType& result) {
|
||||
result = ValueType();
|
||||
|
||||
@ -1477,7 +1707,8 @@ void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::R
|
||||
* val is performed and put into result. This functionality requires C++11 support.*/
|
||||
template< typename iType, class Lambda, typename ReducerType >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
|
||||
typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
|
||||
parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
|
||||
loop_boundaries, const Lambda & lambda, ReducerType const & reducer) {
|
||||
reducer.init( reducer.reference() );
|
||||
|
||||
@ -1523,86 +1754,46 @@ void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROC
|
||||
typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
|
||||
typedef typename ValueTraits::value_type value_type ;
|
||||
|
||||
value_type scan_val = value_type();
|
||||
#if (__ROCM_ARCH__ >= 800)
|
||||
// adopt the cuda vector shuffle method
|
||||
const int VectorLength = loop_boundaries.increment;
|
||||
int lid = loop_boundaries.thread.lindex();
|
||||
int vector_rank = lid%VectorLength;
|
||||
value_type val = value_type();
|
||||
const int vector_length = loop_boundaries.thread.vector_length();
|
||||
const int vector_rank = loop_boundaries.thread.vector_rank();
|
||||
|
||||
iType loop_bound = ((loop_boundaries.end+VectorLength-1)/VectorLength) * VectorLength;
|
||||
value_type val ;
|
||||
for(int _i = vector_rank; _i < loop_bound; _i += VectorLength) {
|
||||
val = value_type();
|
||||
if(_i<loop_boundaries.end)
|
||||
lambda(_i , val , false);
|
||||
iType end = ((loop_boundaries.end+vector_length-1)/vector_length) * vector_length;
|
||||
value_type accum = value_type();
|
||||
|
||||
value_type tmp = val;
|
||||
value_type result_i;
|
||||
for ( int i = vector_rank ; i < end ; i += vector_length ) {
|
||||
|
||||
if(vector_rank == 0)
|
||||
result_i = tmp;
|
||||
if (VectorLength > 1) {
|
||||
const value_type tmp2 = shfl_up(tmp, 1,VectorLength);
|
||||
if(vector_rank > 0)
|
||||
tmp+=tmp2;
|
||||
}
|
||||
if(vector_rank == 1)
|
||||
result_i = tmp;
|
||||
if (VectorLength > 3) {
|
||||
const value_type tmp2 = shfl_up(tmp, 2,VectorLength);
|
||||
if(vector_rank > 1)
|
||||
tmp+=tmp2;
|
||||
}
|
||||
if ((vector_rank >= 2) &&
|
||||
(vector_rank < 4))
|
||||
result_i = tmp;
|
||||
if (VectorLength > 7) {
|
||||
const value_type tmp2 = shfl_up(tmp, 4,VectorLength);
|
||||
if(vector_rank > 3)
|
||||
tmp+=tmp2;
|
||||
}
|
||||
if ((vector_rank >= 4) &&
|
||||
(vector_rank < 8))
|
||||
result_i = tmp;
|
||||
if (VectorLength > 15) {
|
||||
const value_type tmp2 = shfl_up(tmp, 8,VectorLength);
|
||||
if(vector_rank > 7)
|
||||
tmp+=tmp2;
|
||||
}
|
||||
if ((vector_rank >= 8) &&
|
||||
(vector_rank < 16))
|
||||
result_i = tmp;
|
||||
if (VectorLength > 31) {
|
||||
const value_type tmp2 = shfl_up(tmp, 16,VectorLength);
|
||||
if(vector_rank > 15)
|
||||
tmp+=tmp2;
|
||||
}
|
||||
if ((vector_rank >=16) &&
|
||||
(vector_rank < 32))
|
||||
result_i = tmp;
|
||||
if (VectorLength > 63) {
|
||||
const value_type tmp2 = shfl_up(tmp, 32,VectorLength);
|
||||
if(vector_rank > 31)
|
||||
tmp+=tmp2;
|
||||
value_type val = 0 ;
|
||||
|
||||
// First acquire per-lane contributions:
|
||||
if ( i < loop_boundaries.end ) lambda( i , val , false );
|
||||
|
||||
value_type sval = val ;
|
||||
|
||||
// Bottom up inclusive scan in triangular pattern
|
||||
// where each thread is the root of a reduction tree
|
||||
// from the zeroth "lane" to itself.
|
||||
// [t] += [t-1] if t >= 1
|
||||
// [t] += [t-2] if t >= 2
|
||||
// [t] += [t-4] if t >= 4
|
||||
// ...
|
||||
|
||||
for ( int j = 1 ; j < vector_length ; j <<= 1 ) {
|
||||
value_type tmp = 0 ;
|
||||
tmp = shfl_up(sval , j , vector_length );
|
||||
if ( j <= vector_rank ) { sval += tmp ; }
|
||||
}
|
||||
|
||||
if (vector_rank >= 32)
|
||||
result_i = tmp;
|
||||
// Include accumulation and remove value for exclusive scan:
|
||||
val = accum + sval - val ;
|
||||
|
||||
val = scan_val + result_i - val;
|
||||
scan_val += shfl(tmp,VectorLength-1,VectorLength);
|
||||
if(_i<loop_boundaries.end)
|
||||
lambda(_i , val , true);
|
||||
// Provide exclusive scan value:
|
||||
if ( i < loop_boundaries.end ) lambda( i , val , true );
|
||||
|
||||
// Accumulate the last value in the inclusive scan:
|
||||
sval = shfl( sval , vector_length-1 , vector_length);
|
||||
accum += sval ;
|
||||
}
|
||||
#else
|
||||
// for kaveri, call the LDS based thread_scan routine
|
||||
for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
|
||||
lambda(i,scan_val,true);
|
||||
}
|
||||
scan_val = loop_boundaries.thread.team_scan(scan_val);
|
||||
|
||||
#endif
|
||||
}
|
||||
|
||||
} // namespace Kokkos
|
||||
|
||||
@ -57,7 +57,6 @@
|
||||
#include <ROCm/Kokkos_ROCm_Tile.hpp>
|
||||
#include <ROCm/Kokkos_ROCm_Invoke.hpp>
|
||||
#include <ROCm/Kokkos_ROCm_Join.hpp>
|
||||
|
||||
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
|
||||
|
||||
namespace Kokkos {
|
||||
@ -75,7 +74,7 @@ T& reduce_value(T* x, std::false_type) [[hc]]
|
||||
return *x;
|
||||
}
|
||||
|
||||
#if KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
struct always_true
|
||||
{
|
||||
template<class... Ts>
|
||||
@ -149,7 +148,7 @@ void reduce_enqueue(
|
||||
// Store the tile result in the global memory.
|
||||
if (local == 0)
|
||||
{
|
||||
#if KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
// Workaround for assigning from LDS memory: std::copy should work
|
||||
// directly
|
||||
buffer.action_at(0, [&](T* x)
|
||||
@ -158,7 +157,7 @@ void reduce_enqueue(
|
||||
// new ROCM 15 address space changes aren't implemented in std algorithms yet
|
||||
auto * src = reinterpret_cast<char *>(x);
|
||||
auto * dest = reinterpret_cast<char *>(result.data()+tile*output_length);
|
||||
for(int i=0; i<sizeof(T);i++) dest[i] = src[i];
|
||||
for(int i=0; i<sizeof(T)*output_length;i++) dest[i] = src[i];
|
||||
#else
|
||||
// Workaround: copy_if used to avoid memmove
|
||||
std::copy_if(x, x+output_length, result.data()+tile*output_length, always_true{} );
|
||||
@ -169,12 +168,10 @@ void reduce_enqueue(
|
||||
|
||||
#endif
|
||||
}
|
||||
|
||||
});
|
||||
if (output_result != nullptr)
|
||||
ValueInit::init(ReducerConditional::select(f, reducer), output_result);
|
||||
fut.wait();
|
||||
|
||||
copy(result,result_cpu.data());
|
||||
if (output_result != nullptr) {
|
||||
for(std::size_t i=0;i<td.num_tiles;i++)
|
||||
|
||||
@ -62,6 +62,76 @@
|
||||
namespace Kokkos {
|
||||
namespace Impl {
|
||||
|
||||
//#if __KALMAR_ACCELERATOR__ == 1
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void __syncthreads() [[hc]]
|
||||
{
|
||||
amp_barrier(CLK_LOCAL_MEM_FENCE);
|
||||
}
|
||||
|
||||
#define LT0 ((threadIdx_x+threadIdx_y+threadIdx_z)?0:1)
|
||||
|
||||
|
||||
// returns non-zero if and only if predicate is non-zero for all threads
|
||||
// note that syncthreads_or uses the first 64 bits of dynamic group memory.
|
||||
// this reserved memory must be accounted for everwhere
|
||||
// that get_dynamic_group_segment_base_pointer is called.
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
uint64_t __syncthreads_or(uint64_t pred)
|
||||
{
|
||||
uint64_t *shared_var = (uint64_t *)hc::get_dynamic_group_segment_base_pointer();
|
||||
if(LT0) *shared_var = 0;
|
||||
amp_barrier(CLK_LOCAL_MEM_FENCE);
|
||||
#if __KALMAR_ACCELERATOR__ == 1
|
||||
if (pred) hc::atomic_or_uint64(shared_var,1);
|
||||
#endif
|
||||
amp_barrier(CLK_LOCAL_MEM_FENCE);
|
||||
return (*shared_var);
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void __threadfence()
|
||||
{
|
||||
amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void __threadfence_block()
|
||||
{
|
||||
amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
|
||||
}
|
||||
//#endif
|
||||
struct ROCm_atomic_CAS {
|
||||
template<class OP>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
unsigned long operator () (volatile unsigned long * dest, OP &&op){
|
||||
unsigned long read,compare,val;
|
||||
compare = *dest;
|
||||
read = compare;
|
||||
do {
|
||||
compare = read;
|
||||
val = op(compare);
|
||||
#if __KALMAR_ACCELERATOR__ == 1
|
||||
hc::atomic_compare_exchange((uint64_t *)dest,&read,val);
|
||||
#endif
|
||||
} while (read != compare);
|
||||
return val;
|
||||
}
|
||||
};
|
||||
|
||||
template<class OP>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
unsigned long atomic_cas_op (volatile unsigned long * dest, OP &&op) {
|
||||
ROCm_atomic_CAS cas_op;
|
||||
return cas_op(dest, std::forward<OP>(op));
|
||||
}
|
||||
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
unsigned long atomicInc (volatile unsigned long * dest, const unsigned long& val) {
|
||||
return atomic_cas_op(dest, [=](unsigned long old){return ((old>=val)?0:(old+1));});
|
||||
}
|
||||
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
template< typename T >
|
||||
@ -375,18 +445,7 @@ bool rocm_inter_block_reduction( ROCmTeamMember& team,
|
||||
#endif
|
||||
}
|
||||
#endif
|
||||
#if 0
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
// See section B.17 of ROCm C Programming Guide Version 3.2
|
||||
// for discussion of
|
||||
// __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
|
||||
// function qualifier which could be used to improve performance.
|
||||
//----------------------------------------------------------------------------
|
||||
// Maximize shared memory and minimize L1 cache:
|
||||
// rocmFuncSetCacheConfig(MyKernel, rocmFuncCachePreferShared );
|
||||
// For 2.0 capability: 48 KB shared and 16 KB L1
|
||||
//----------------------------------------------------------------------------
|
||||
//----------------------------------------------------------------------------
|
||||
/*
|
||||
* Algorithmic constraints:
|
||||
@ -406,87 +465,105 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
|
||||
typedef typename ValueTraits::pointer_type pointer_type ;
|
||||
|
||||
const unsigned value_count = ValueTraits::value_count( functor );
|
||||
const unsigned BlockSizeMask = team.team_size() - 1 ;
|
||||
const unsigned BlockSizeMask = blockDim_y - 1 ;
|
||||
|
||||
// Must have power of two thread count
|
||||
|
||||
if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }
|
||||
if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }
|
||||
|
||||
#define BLOCK_REDUCE_STEP( R , TD , S ) \
|
||||
if ( ! ( R & ((1<<(S+1))-1) ) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }
|
||||
if ( ! (( R & ((1<<(S+1))-1) )|(blockDim_y<(1<<(S+1)))) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }
|
||||
|
||||
#define BLOCK_SCAN_STEP( TD , N , S ) \
|
||||
if ( N == (1<<S) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S))); }
|
||||
#define KOKKOS_IMPL_ROCM_SYNCWF __threadfence_block()
|
||||
|
||||
const unsigned rtid_intra = team.team_rank() ^ BlockSizeMask ;
|
||||
const pointer_type tdata_intra = base_data + value_count * team.team_rank() ;
|
||||
const unsigned rtid_intra = threadIdx_y ^ BlockSizeMask ;
|
||||
const pointer_type tdata_intra = base_data + value_count * threadIdx_y ;
|
||||
|
||||
{ // Intra-workgroup reduction:
|
||||
{ // Intra-workgroup reduction: min blocksize of 64
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,0)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,1)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,2)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,3)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,4)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,5)
|
||||
KOKKOS_IMPL_ROCM_SYNCWF;
|
||||
}
|
||||
|
||||
team.team_barrier(); // Wait for all workgroups to reduce
|
||||
__syncthreads(); // Wait for all workgroups to reduce
|
||||
|
||||
{ // Inter-workgroup reduce-scan by a single workgroup to avoid extra synchronizations
|
||||
const unsigned rtid_inter = ( team.team_rank() ^ BlockSizeMask ) << ROCmTraits::WarpIndexShift ;
|
||||
if(threadIdx_y < value_count) {
|
||||
for(int i=blockDim_y-65; i>0; i-= 64)
|
||||
ValueJoin::join( functor , base_data + (blockDim_y-1)*value_count + threadIdx_y , base_data + i*value_count + threadIdx_y );
|
||||
}
|
||||
__syncthreads();
|
||||
#if 0
|
||||
const unsigned rtid_inter = ( threadIdx_y ^ BlockSizeMask ) << ROCmTraits::WavefrontIndexShift ;
|
||||
|
||||
if ( rtid_inter < blockDim_y ) {
|
||||
|
||||
if ( rtid_inter < team.team_size() ) {
|
||||
|
||||
const pointer_type tdata_inter = base_data + value_count * ( rtid_inter ^ BlockSizeMask );
|
||||
//
|
||||
// remove these comments
|
||||
// for rocm, we start with a block size of 64, so the 5 step is already done.
|
||||
// The remaining steps are only done if block size is > 64, so we leave them
|
||||
// in place until we tune blocksize for performance, then remove the ones
|
||||
// that will never be used.
|
||||
// if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
|
||||
// if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
|
||||
// if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }
|
||||
// if ( (1<<9) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,9) }
|
||||
|
||||
if ( (1<<5) < BlockSizeMask ) { BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,5) }
|
||||
if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
|
||||
if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
|
||||
if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }
|
||||
|
||||
if ( DoScan ) {
|
||||
|
||||
int n = ( rtid_inter & 32 ) ? 32 : (
|
||||
( rtid_inter & 64 ) ? 64 : (
|
||||
int n = ( rtid_inter & 64 ) ? 64 : (
|
||||
( rtid_inter & 128 ) ? 128 : (
|
||||
( rtid_inter & 256 ) ? 256 : 0 )));
|
||||
( rtid_inter & 256 ) ? 256 : 0 ));
|
||||
|
||||
if ( ! ( rtid_inter + n < team.team_size() ) ) n = 0 ;
|
||||
if ( ! ( rtid_inter + n < blockDim_y ) ) n = 0 ;
|
||||
|
||||
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,8)
|
||||
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,7)
|
||||
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,6)
|
||||
__threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
|
||||
// __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
|
||||
}
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
team.team_barrier(); // Wait for inter-workgroup reduce-scan to complete
|
||||
__syncthreads(); // Wait for inter-workgroup reduce-scan to complete
|
||||
|
||||
if ( DoScan ) {
|
||||
int n = ( rtid_intra & 1 ) ? 1 : (
|
||||
( rtid_intra & 2 ) ? 2 : (
|
||||
( rtid_intra & 4 ) ? 4 : (
|
||||
( rtid_intra & 8 ) ? 8 : (
|
||||
( rtid_intra & 16 ) ? 16 : 0 ))));
|
||||
( rtid_intra & 16 ) ? 16 : (
|
||||
( rtid_intra & 32 ) ? 32 : 0 )))));
|
||||
|
||||
if ( ! ( rtid_intra + n < team.team_size() ) ) n = 0 ;
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,4) team.team_barrier();//__threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,3) team.team_barrier();//__threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,2) team.team_barrier();//__threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,1) team.team_barrier();//__threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,0) team.team_barrier();
|
||||
#else
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
|
||||
if ( ! ( rtid_intra + n < blockDim_y ) ) n = 0 ;
|
||||
|
||||
// BLOCK_SCAN_STEP(tdata_intra,n,5) __threadfence_block();
|
||||
// BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,3) __threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,2) __threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,1) __threadfence_block();
|
||||
BLOCK_SCAN_STEP(tdata_intra,n,0) __threadfence_block();
|
||||
#endif
|
||||
}
|
||||
|
||||
#undef BLOCK_SCAN_STEP
|
||||
#undef BLOCK_REDUCE_STEP
|
||||
#undef KOKKOS_IMPL_ROCM_SYNCWF
|
||||
}
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
@ -497,16 +574,18 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
|
||||
*
|
||||
* Global reduce result is in the last threads' 'shared_data' location.
|
||||
*/
|
||||
using ROCM = Kokkos::Experimental::ROCm ;
|
||||
|
||||
template< bool DoScan , class FunctorType , class ArgTag >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
const ROCm::size_type block_id ,
|
||||
const ROCm::size_type block_count ,
|
||||
ROCm::size_type * const shared_data ,
|
||||
ROCm::size_type * const global_data ,
|
||||
ROCm::size_type * const global_flags )
|
||||
const ROCM::size_type block_id ,
|
||||
const ROCM::size_type block_count ,
|
||||
typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const shared_data ,
|
||||
typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const global_data ,
|
||||
ROCM::size_type * const global_flags )
|
||||
{
|
||||
typedef ROCm::size_type size_type ;
|
||||
typedef ROCM::size_type size_type ;
|
||||
typedef FunctorValueTraits< FunctorType , ArgTag > ValueTraits ;
|
||||
typedef FunctorValueJoin< FunctorType , ArgTag > ValueJoin ;
|
||||
typedef FunctorValueInit< FunctorType , ArgTag > ValueInit ;
|
||||
@ -517,16 +596,17 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
typedef typename ValueTraits::value_type value_type ;
|
||||
|
||||
// '__ffs' = position of the least significant bit set to 1.
|
||||
// 'team.team_size()' is guaranteed to be a power of two so this
|
||||
// blockDim_y is guaranteed to be a power of two so this
|
||||
// is the integral shift value that can replace an integral divide.
|
||||
const unsigned BlockSizeShift = __ffs( team.team_size() ) - 1 ;
|
||||
const unsigned BlockSizeMask = team.team_size() - 1 ;
|
||||
// const unsigned long BlockSizeShift = __ffs( blockDim_y ) - 1 ;
|
||||
const unsigned long BlockSizeShift = __lastbit_u32_u32( blockDim_y ) ;
|
||||
const unsigned long BlockSizeMask = blockDim_y - 1 ;
|
||||
|
||||
// Must have power of two thread count
|
||||
if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }
|
||||
if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }
|
||||
|
||||
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
|
||||
word_count( ValueTraits::value_size( functor ) / sizeof(size_type) );
|
||||
const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
|
||||
word_count( ValueTraits::value_size( functor )/ sizeof(value_type) );
|
||||
|
||||
// Reduce the accumulation for the entire block.
|
||||
rocm_intra_block_reduce_scan<false,FunctorType,ArgTag>( functor , pointer_type(shared_data) );
|
||||
@ -534,54 +614,47 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType & functor ,
|
||||
{
|
||||
// Write accumulation total to global scratch space.
|
||||
// Accumulation total is the last thread's data.
|
||||
size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
|
||||
size_type * const global = global_data + word_count.value * block_id ;
|
||||
|
||||
#if (__ROCM_ARCH__ < 500)
|
||||
for ( size_type i = team.team_rank() ; i < word_count.value ; i += team.team_size() ) { global[i] = shared[i] ; }
|
||||
#else
|
||||
for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
|
||||
#endif
|
||||
value_type * const shared = shared_data +
|
||||
word_count.value * BlockSizeMask ;
|
||||
value_type * const global = global_data + word_count.value * block_id ;
|
||||
|
||||
for ( int i = int(threadIdx_y) ; i < word_count.value ; i += blockDim_y ) { global[i] = shared[i] ; }
|
||||
}
|
||||
|
||||
// Contributing blocks note that their contribution has been completed via an atomic-increment flag
|
||||
// If this block is not the last block to contribute to this group then the block is done.
|
||||
team.team_barrier();
|
||||
|
||||
const bool is_last_block =
|
||||
! team.team_reduce( team.team_rank() ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) ,Impl::JoinAdd<ValueType>());
|
||||
|
||||
! __syncthreads_or( threadIdx_y ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) );
|
||||
if ( is_last_block ) {
|
||||
|
||||
const size_type b = ( long(block_count) * long(team.team_rank()) ) >> BlockSizeShift ;
|
||||
const size_type e = ( long(block_count) * long( team.team_rank() + 1 ) ) >> BlockSizeShift ;
|
||||
const size_type b = ( long(block_count) * long(threadIdx_y )) >> BlockSizeShift ;
|
||||
const size_type e = ( long(block_count) * long(threadIdx_y + 1 ) ) >> BlockSizeShift ;
|
||||
|
||||
{
|
||||
void * const shared_ptr = shared_data + word_count.value * team.team_rank() ;
|
||||
reference_type shared_value = ValueInit::init( functor , shared_ptr );
|
||||
value_type * const shared_ptr = shared_data + word_count.value * threadIdx_y ;
|
||||
ValueInit::init( functor , shared_ptr );
|
||||
|
||||
|
||||
for ( size_type i = b ; i < e ; ++i ) {
|
||||
ValueJoin::join( functor , shared_ptr , global_data + word_count.value * i );
|
||||
}
|
||||
}
|
||||
|
||||
rocm_intra_block_reduce_scan<DoScan,FunctorType,ArgTag>( functor , pointer_type(shared_data) );
|
||||
|
||||
if ( DoScan ) {
|
||||
value_type * const shared_value = shared_data + word_count.value * ( threadIdx_y ? threadIdx_y - 1 : blockDim_y );
|
||||
|
||||
size_type * const shared_value = shared_data + word_count.value * ( team.team_rank() ? team.team_rank() - 1 : team.team_size() );
|
||||
|
||||
if ( ! team.team_rank() ) { ValueInit::init( functor , shared_value ); }
|
||||
if ( ! threadIdx_y ) { ValueInit::init( functor , shared_value ); }
|
||||
|
||||
// Join previous inclusive scan value to each member
|
||||
for ( size_type i = b ; i < e ; ++i ) {
|
||||
size_type * const global_value = global_data + word_count.value * i ;
|
||||
value_type * const global_value = global_data + word_count.value * i ;
|
||||
ValueJoin::join( functor , shared_value , global_value );
|
||||
ValueOps ::copy( functor , global_value , shared_value );
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return is_last_block ;
|
||||
}
|
||||
|
||||
@ -592,7 +665,6 @@ unsigned rocm_single_inter_block_reduce_scan_shmem( const FunctorType & functor
|
||||
{
|
||||
return ( BlockSize + 2 ) * Impl::FunctorValueTraits< FunctorType , ArgTag >::value_size( functor );
|
||||
}
|
||||
#endif
|
||||
|
||||
} // namespace Impl
|
||||
} // namespace Kokkos
|
||||
|
||||
@ -98,7 +98,7 @@ void scan_enqueue(
|
||||
{
|
||||
auto j = i + d - 1;
|
||||
auto k = i + d2 - 1;
|
||||
// join(k, j); // no longer needed with ROCm 1.6
|
||||
|
||||
ValueJoin::join(f, &buffer[k], &buffer[j]);
|
||||
}
|
||||
}
|
||||
@ -116,7 +116,7 @@ void scan_enqueue(
|
||||
auto j = i + d - 1;
|
||||
auto k = i + d2 - 1;
|
||||
auto t = buffer[k];
|
||||
// join(k, j); // no longer needed with ROCm 1.6
|
||||
|
||||
ValueJoin::join(f, &buffer[k], &buffer[j]);
|
||||
buffer[j] = t;
|
||||
}
|
||||
@ -127,17 +127,13 @@ void scan_enqueue(
|
||||
}).wait();
|
||||
copy(result,result_cpu.data());
|
||||
|
||||
// The std::partial_sum was segfaulting, despite that this is cpu code.
|
||||
// if(td.num_tiles>1)
|
||||
// std::partial_sum(result_cpu.data(), result_cpu.data()+(td.num_tiles-1)*sizeof(value_type), result_cpu.data(), make_join_operator<ValueJoin>(f));
|
||||
// use this implementation instead.
|
||||
for(int i=1; i<td.num_tiles; i++)
|
||||
ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);
|
||||
|
||||
copy(result_cpu.data(),result);
|
||||
hc::parallel_for_each(hc::extent<1>(len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
|
||||
size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
|
||||
hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
|
||||
{
|
||||
// const auto local = t_idx.local[0];
|
||||
const auto global = t_idx.global[0];
|
||||
const auto tile = t_idx.tile[0];
|
||||
|
||||
@ -145,13 +141,115 @@ void scan_enqueue(
|
||||
{
|
||||
auto final_state = scratch[global];
|
||||
|
||||
// the join is locking up, at least with 1.6
|
||||
if (tile != 0) final_state += result[tile-1];
|
||||
// if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
|
||||
if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
|
||||
rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
|
||||
}
|
||||
}).wait();
|
||||
}
|
||||
|
||||
template< class Tag, class ReturnType, class F, class TransformIndex>
|
||||
void scan_enqueue(
|
||||
const int len,
|
||||
const F & f,
|
||||
ReturnType & return_val,
|
||||
TransformIndex transform_index)
|
||||
{
|
||||
typedef Kokkos::Impl::FunctorValueTraits< F, Tag> ValueTraits;
|
||||
typedef Kokkos::Impl::FunctorValueInit< F, Tag> ValueInit;
|
||||
typedef Kokkos::Impl::FunctorValueJoin< F, Tag> ValueJoin;
|
||||
typedef Kokkos::Impl::FunctorValueOps< F, Tag> ValueOps;
|
||||
|
||||
typedef typename ValueTraits::value_type value_type;
|
||||
typedef typename ValueTraits::pointer_type pointer_type;
|
||||
typedef typename ValueTraits::reference_type reference_type;
|
||||
|
||||
const auto td = get_tile_desc<value_type>(len);
|
||||
std::vector<value_type> result_cpu(td.num_tiles);
|
||||
hc::array<value_type> result(td.num_tiles);
|
||||
hc::array<value_type> scratch(len);
|
||||
std::vector<ReturnType> total_cpu(1);
|
||||
hc::array<ReturnType> total(1);
|
||||
|
||||
tile_for<value_type>(td, [&,f,len,td](hc::tiled_index<1> t_idx, tile_buffer<value_type> buffer) [[hc]]
|
||||
{
|
||||
const auto local = t_idx.local[0];
|
||||
const auto global = t_idx.global[0];
|
||||
const auto tile = t_idx.tile[0];
|
||||
|
||||
// Join tile buffer elements
|
||||
const auto join = [&](std::size_t i, std::size_t j)
|
||||
{
|
||||
buffer.action_at(i, j, [&](value_type& x, const value_type& y)
|
||||
{
|
||||
ValueJoin::join(f, &x, &y);
|
||||
});
|
||||
};
|
||||
|
||||
// Copy into tile
|
||||
buffer.action_at(local, [&](value_type& state)
|
||||
{
|
||||
ValueInit::init(f, &state);
|
||||
if (global < len) rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), state, false);
|
||||
});
|
||||
t_idx.barrier.wait();
|
||||
// Up sweep phase
|
||||
for(std::size_t d=1;d<buffer.size();d*=2)
|
||||
{
|
||||
auto d2 = 2*d;
|
||||
auto i = local*d2;
|
||||
if(i<len)
|
||||
{
|
||||
auto j = i + d - 1;
|
||||
auto k = i + d2 - 1;
|
||||
ValueJoin::join(f, &buffer[k], &buffer[j]);
|
||||
}
|
||||
}
|
||||
t_idx.barrier.wait();
|
||||
|
||||
result[tile] = buffer[buffer.size()-1];
|
||||
buffer[buffer.size()-1] = 0;
|
||||
// Down sweep phase
|
||||
for(std::size_t d=buffer.size()/2;d>0;d/=2)
|
||||
{
|
||||
auto d2 = 2*d;
|
||||
auto i = local*d2;
|
||||
if(i<len)
|
||||
{
|
||||
auto j = i + d - 1;
|
||||
auto k = i + d2 - 1;
|
||||
auto t = buffer[k];
|
||||
ValueJoin::join(f, &buffer[k], &buffer[j]);
|
||||
buffer[j] = t;
|
||||
}
|
||||
t_idx.barrier.wait();
|
||||
}
|
||||
// Copy tiles into global memory
|
||||
if (global < len) scratch[global] = buffer[local];
|
||||
}).wait();
|
||||
copy(result,result_cpu.data());
|
||||
|
||||
for(int i=1; i<td.num_tiles; i++)
|
||||
ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);
|
||||
|
||||
copy(result_cpu.data(),result);
|
||||
size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
|
||||
hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]]
|
||||
{
|
||||
const auto global = t_idx.global[0];
|
||||
const auto tile = t_idx.tile[0];
|
||||
|
||||
if (global < len)
|
||||
{
|
||||
auto final_state = scratch[global];
|
||||
|
||||
if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
|
||||
rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
|
||||
if(global==(len-1)) total[0] = final_state;
|
||||
}
|
||||
}).wait();
|
||||
copy(total,total_cpu.data());
|
||||
return_val = total_cpu[0];
|
||||
}
|
||||
|
||||
} // namespace Impl
|
||||
} // namespace Kokkos
|
||||
|
||||
@ -362,6 +362,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmSpace & arg_space
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
// Set last element zero, in case c_str is too long
|
||||
header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
|
||||
// Copy to device memory
|
||||
Kokkos::Impl::DeepCopy<Kokkos::Experimental::ROCmSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
|
||||
@ -399,6 +401,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmHostPinnedSpace & arg_sp
|
||||
, arg_label.c_str()
|
||||
, SharedAllocationHeader::maximum_label_length
|
||||
);
|
||||
// Set last element zero, in case c_str is too long
|
||||
RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
|
||||
}
|
||||
|
||||
//----------------------------------------------------------------------------
|
||||
|
||||
@ -278,7 +278,7 @@ struct single_action
|
||||
void action_at(std::size_t i, Action a) [[hc]]
|
||||
{
|
||||
auto& value = static_cast<Derived&>(*this)[i];
|
||||
#if KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
T state = value;
|
||||
a(state);
|
||||
value = state;
|
||||
@ -347,7 +347,7 @@ struct tile_buffer<T[]>
|
||||
#if defined (ROCM15)
|
||||
a(value);
|
||||
#else
|
||||
#if KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
|
||||
if (m > get_max_tile_array_size()) return;
|
||||
T state[get_max_tile_array_size()];
|
||||
// std::copy(value, value+m, state);
|
||||
@ -372,7 +372,6 @@ struct tile_buffer<T[]>
|
||||
#if defined (ROCM15)
|
||||
a(value);
|
||||
#else
|
||||
//#if KOKKOS_ROCM_HAS_WORKAROUNDS
|
||||
if (m > get_max_tile_array_size()) return;
|
||||
T state[get_max_tile_array_size()];
|
||||
// std::copy(value, value+m, state);
|
||||
|
||||
@ -175,6 +175,27 @@ public:
|
||||
#endif
|
||||
}
|
||||
|
||||
template<class Closure, class ValueType>
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
void team_broadcast(Closure const & f, ValueType& value, const int& thread_id) const
|
||||
{
|
||||
#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
|
||||
{ }
|
||||
#else
|
||||
// Make sure there is enough scratch space:
|
||||
typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
|
||||
, ValueType , void >::type type ;
|
||||
f( value );
|
||||
if ( m_team_base ) {
|
||||
type * const local_value = ((type*) m_team_base[0]->scratch_memory());
|
||||
if(team_rank() == thread_id) *local_value = value;
|
||||
memory_fence();
|
||||
team_barrier();
|
||||
value = *local_value;
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
template< typename Type >
|
||||
KOKKOS_INLINE_FUNCTION
|
||||
typename std::enable_if< !Kokkos::is_reducer< Type >::value , Type>::type
|
||||
@ -626,39 +647,77 @@ public:
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
template< class FunctorType >
|
||||
inline static
|
||||
int team_size_max( const FunctorType & ) {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
|
||||
template< class FunctorType >
|
||||
static int team_size_recommended( const FunctorType & )
|
||||
{
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
#endif
|
||||
}
|
||||
|
||||
inline static
|
||||
int team_size_recommended( const FunctorType & )
|
||||
{
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
}
|
||||
|
||||
template< class FunctorType >
|
||||
inline static
|
||||
int team_size_recommended( const FunctorType &, const int& )
|
||||
{
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
{
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
}
|
||||
#endif
|
||||
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelForTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
int pool_size = traits::execution_space::thread_pool_size(1);
|
||||
#else
|
||||
int pool_size = traits::execution_space::impl_thread_pool_size(1);
|
||||
#endif
|
||||
int max_host_team_size = Impl::HostThreadTeamData::max_team_members;
|
||||
return pool_size<max_host_team_size?pool_size:max_host_team_size;
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
#endif
|
||||
}
|
||||
template<class FunctorType>
|
||||
int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
|
||||
#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
|
||||
return traits::execution_space::thread_pool_size(2);
|
||||
#else
|
||||
return traits::execution_space::impl_thread_pool_size(2);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
||||
inline static
|
||||
int vector_length_max()
|
||||
{ return 1024; } // Use arbitrary large number, is meant as a vectorizable length
|
||||
|
||||
inline static
|
||||
int scratch_size_max(int level)
|
||||
{ return (level==0?
|
||||
1024*32: // Roughly L1 size
|
||||
20*1024*1024); // Limit to keep compatibility with CUDA
|
||||
}
|
||||
|
||||
//----------------------------------------
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user