Update Kokkos library in LAMMPS to v3.3.0

This commit is contained in:
Stan Gerald Moore
2020-12-22 08:52:37 -07:00
parent b36363e0fb
commit eea14c55a9
927 changed files with 18603 additions and 46876 deletions

View File

@ -65,10 +65,15 @@ which activates the OpenMP backend. All of the options controlling device backen
## Spack ## Spack
An alternative to manually building with the CMake is to use the Spack package manager. An alternative to manually building with the CMake is to use the Spack package manager.
To do so, download the `kokkos-spack` git repo and add to the package list: Make sure you have downloaded [Spack](https://github.com/spack/spack).
The easiest way to configure the Spack environment is:
````bash ````bash
> spack repo add $path-to-kokkos-spack > source spack/share/spack/setup-env.sh
```` ````
with other scripts available for other shells.
You can display information about how to install packages with:
````bash
> spack info kokkos
A basic installation would be done as: A basic installation would be done as:
````bash ````bash
> spack install kokkos > spack install kokkos
@ -178,8 +183,8 @@ Options can be enabled by specifying `-DKokkos_ENABLE_X`.
## Other Options ## Other Options
* Kokkos_CXX_STANDARD * Kokkos_CXX_STANDARD
* The C++ standard for Kokkos to use: c++11, c++14, c++17, or c++20. This should be given in CMake style as 11, 14, 17, or 20. * The C++ standard for Kokkos to use: c++14, c++17, or c++20. This should be given in CMake style as 14, 17, or 20.
* STRING Default: 11 * STRING Default: 14
## Third-party Libraries (TPLs) ## Third-party Libraries (TPLs)
The following options control enabling TPLs: The following options control enabling TPLs:

View File

@ -1,5 +1,104 @@
# Change Log # Change Log
## [3.3.00](https://github.com/kokkos/kokkos/tree/3.3.00) (2020-12-16)
[Full Changelog](https://github.com/kokkos/kokkos/compare/3.2.01...3.3.00)
**Features:**
- Require C++14 as minimum C++ standard. C++17 and C++20 are supported too.
- HIP backend is nearly feature complete. Kokkos Dynamic Task Graphs are missing.
- Major update for OpenMPTarget: many capabilities now work. For details contact us.
- Added DPC++/SYCL backend: primary capabilites are working.
- Added Kokkos Graph API analogous to CUDA Graphs.
- Added parallel_scan support with TeamThreadRange [\#3536](https://github.com/kokkos/kokkos/pull/#3536)
- Added Logical Memory Spaces [\#3546](https://github.com/kokkos/kokkos/pull/#3546)
- Added initial half precision support [\#3439](https://github.com/kokkos/kokkos/pull/#3439)
- Experimental feature: control cuda occupancy [\#3379](https://github.com/kokkos/kokkos/pull/#3379)
**Implemented enhancements Backends and Archs:**
- Add a64fx and fujitsu Compiler support [\#3614](https://github.com/kokkos/kokkos/pull/#3614)
- Adding support for AMD gfx908 archictecture [\#3375](https://github.com/kokkos/kokkos/pull/#3375)
- SYCL parallel\_for MDRangePolicy [\#3583](https://github.com/kokkos/kokkos/pull/#3583)
- SYCL add parallel\_scan [\#3577](https://github.com/kokkos/kokkos/pull/#3577)
- SYCL custom reductions [\#3544](https://github.com/kokkos/kokkos/pull/#3544)
- SYCL Enable container unit tests [\#3550](https://github.com/kokkos/kokkos/pull/#3550)
- SYCL feature level 5 [\#3480](https://github.com/kokkos/kokkos/pull/#3480)
- SYCL Feature level 4 (parallel\_for) [\#3474](https://github.com/kokkos/kokkos/pull/#3474)
- SYCL feature level 3 [\#3451](https://github.com/kokkos/kokkos/pull/#3451)
- SYCL feature level 2 [\#3447](https://github.com/kokkos/kokkos/pull/#3447)
- OpenMPTarget: Hierarchial reduction for + operator on scalars [\#3504](https://github.com/kokkos/kokkos/pull/#3504)
- OpenMPTarget hierarchical [\#3411](https://github.com/kokkos/kokkos/pull/#3411)
- HIP Add Impl::atomic\_[store,load] [\#3440](https://github.com/kokkos/kokkos/pull/#3440)
- HIP enable global lock arrays [\#3418](https://github.com/kokkos/kokkos/pull/#3418)
- HIP Implement multiple occupancy paths for various HIP kernel launchers [\#3366](https://github.com/kokkos/kokkos/pull/#3366)
**Implemented enhancements Policies:**
- MDRangePolicy: Let it be semiregular [\#3494](https://github.com/kokkos/kokkos/pull/#3494)
- MDRangePolicy: Check narrowing conversion in construction [\#3527](https://github.com/kokkos/kokkos/pull/#3527)
- MDRangePolicy: CombinedReducers support [\#3395](https://github.com/kokkos/kokkos/pull/#3395)
- Kokkos Graph: Interface and Default Implementation [\#3362](https://github.com/kokkos/kokkos/pull/#3362)
- Kokkos Graph: add Cuda Graph implementation [\#3369](https://github.com/kokkos/kokkos/pull/#3369)
- TeamPolicy: implemented autotuning of team sizes and vector lengths [\#3206](https://github.com/kokkos/kokkos/pull/#3206)
- RangePolicy: Initialize all data members in default constructor [\#3509](https://github.com/kokkos/kokkos/pull/#3509)
**Implemented enhancements BuildSystem:**
- Auto-generate core test files for all backends [\#3488](https://github.com/kokkos/kokkos/pull/#3488)
- Avoid rewriting test files when calling cmake [\#3548](https://github.com/kokkos/kokkos/pull/#3548)
- RULE\_LAUNCH\_COMPILE and RULE\_LAUNCH\_LINK system for nvcc\_wrapper [\#3136](https://github.com/kokkos/kokkos/pull/#3136)
- Adding -include as a known argument to nvcc\_wrapper [\#3434](https://github.com/kokkos/kokkos/pull/#3434)
- Install hpcbind script [\#3402](https://github.com/kokkos/kokkos/pull/#3402)
- cmake/kokkos\_tribits.cmake: add parsing for args [\#3457](https://github.com/kokkos/kokkos/pull/#3457)
**Implemented enhancements Tools:**
- Changed namespacing of Kokkos::Tools::Impl::Impl::tune\_policy [\#3455](https://github.com/kokkos/kokkos/pull/#3455)
- Delegate to an impl allocate/deallocate method to allow specifying a SpaceHandle for MemorySpaces [\#3530](https://github.com/kokkos/kokkos/pull/#3530)
- Use the Kokkos Profiling interface rather than the Impl interface [\#3518](https://github.com/kokkos/kokkos/pull/#3518)
- Runtime option for tuning [\#3459](https://github.com/kokkos/kokkos/pull/#3459)
- Dual View Tool Events [\#3326](https://github.com/kokkos/kokkos/pull/#3326)
**Implemented enhancements Other:**
- Abort on errors instead of just printing [\#3528](https://github.com/kokkos/kokkos/pull/#3528)
- Enable C++14 macros unconditionally [\#3449](https://github.com/kokkos/kokkos/pull/#3449)
- Make ViewMapping trivially copyable [\#3436](https://github.com/kokkos/kokkos/pull/#3436)
- Rename struct ViewMapping to class [\#3435](https://github.com/kokkos/kokkos/pull/#3435)
- Replace enums in Kokkos\_ViewMapping.hpp (removes -Wextra) [\#3422](https://github.com/kokkos/kokkos/pull/#3422)
- Use bool for enums representing bools [\#3416](https://github.com/kokkos/kokkos/pull/#3416)
- Fence active instead of default execution space instances [\#3388](https://github.com/kokkos/kokkos/pull/#3388)
- Refactor parallel\_reduce fence usage [\#3359](https://github.com/kokkos/kokkos/pull/#3359)
- Moved Space EBO helpers to Kokkos\_EBO [\#3357](https://github.com/kokkos/kokkos/pull/#3357)
- Add remove\_cvref type trait [\#3340](https://github.com/kokkos/kokkos/pull/#3340)
- Adding identity type traits and update definition of identity\_t alias [\#3339](https://github.com/kokkos/kokkos/pull/#3339)
- Add is\_specialization\_of type trait [\#3338](https://github.com/kokkos/kokkos/pull/#3338)
- Make ScratchMemorySpace semi-regular [\#3309](https://github.com/kokkos/kokkos/pull/#3309)
- Optimize min/max atomics with early exit on no-op case [\#3265](https://github.com/kokkos/kokkos/pull/#3265)
- Refactor Backend Development [\#2941](https://github.com/kokkos/kokkos/pull/#2941)
**Fixed bugs:**
- Fixup MDRangePolicy construction from Kokkos arrays [\#3591](https://github.com/kokkos/kokkos/pull/#3591)
- Add atomic functions for unsigned long long using gcc built-in [\#3588](https://github.com/kokkos/kokkos/pull/#3588)
- Fixup silent pointless comparison with zero in checked\_narrow\_cast (compiler workaround) [\#3566](https://github.com/kokkos/kokkos/pull/#3566)
- Fixes for ROCm 3.9 [\#3565](https://github.com/kokkos/kokkos/pull/#3565)
- Fix windows build issues which crept in for the CUDA build [\#3532](https://github.com/kokkos/kokkos/pull/#3532)
- HIP Fix atomics of large data types and clean up lock arrays [\#3529](https://github.com/kokkos/kokkos/pull/#3529)
- Pthreads fix exception resulting from 0 grain size [\#3510](https://github.com/kokkos/kokkos/pull/#3510)
- Fixup do not require atomic operation to be default constructible [\#3503](https://github.com/kokkos/kokkos/pull/#3503)
- Fix race condition in HIP backend [\#3467](https://github.com/kokkos/kokkos/pull/#3467)
- Replace KOKKOS\_DEBUG with KOKKOS\_ENABLE\_DEBUG [\#3458](https://github.com/kokkos/kokkos/pull/#3458)
- Fix multi-stream team scratch space definition for HIP [\#3398](https://github.com/kokkos/kokkos/pull/#3398)
- HIP fix template deduction [\#3393](https://github.com/kokkos/kokkos/pull/#3393)
- Fix compiling with HIP and C++17 [\#3390](https://github.com/kokkos/kokkos/pull/#3390)
- Fix sigFPE in HIP blocksize deduction [\#3378](https://github.com/kokkos/kokkos/pull/#3378)
- Type alias change: replace CS with CTS to avoid conflicts with NVSHMEM [\#3348](https://github.com/kokkos/kokkos/pull/#3348)
- Clang compilation of CUDA backend on Windows [\#3345](https://github.com/kokkos/kokkos/pull/#3345)
- Fix HBW support [\#3343](https://github.com/kokkos/kokkos/pull/#3343)
- Added missing fences to unique token [\#3260](https://github.com/kokkos/kokkos/pull/#3260)
**Incompatibilities:**
- Remove unused utilities (forward, move, and expand\_variadic) from Kokkos::Impl [\#3535](https://github.com/kokkos/kokkos/pull/#3535)
- Remove unused traits [\#3534](https://github.com/kokkos/kokkos/pull/#3534)
- HIP: Remove old HCC code [\#3301](https://github.com/kokkos/kokkos/pull/#3301)
- Prepare for deprecation of ViewAllocateWithoutInitializing [\#3264](https://github.com/kokkos/kokkos/pull/#3264)
- Remove ROCm backend [\#3148](https://github.com/kokkos/kokkos/pull/#3148)
## [3.2.01](https://github.com/kokkos/kokkos/tree/3.2.01) (2020-11-17) ## [3.2.01](https://github.com/kokkos/kokkos/tree/3.2.01) (2020-11-17)
[Full Changelog](https://github.com/kokkos/kokkos/compare/3.2.00...3.2.01) [Full Changelog](https://github.com/kokkos/kokkos/compare/3.2.00...3.2.01)
@ -36,37 +135,31 @@
- Windows Cuda support [\#3018](https://github.com/kokkos/kokkos/issues/3018) - Windows Cuda support [\#3018](https://github.com/kokkos/kokkos/issues/3018)
- Pass `-Wext-lambda-captures-this` to NVCC when support for `__host__ __device__` lambda is enabled from CUDA 11 [\#3241](https://github.com/kokkos/kokkos/issues/3241) - Pass `-Wext-lambda-captures-this` to NVCC when support for `__host__ __device__` lambda is enabled from CUDA 11 [\#3241](https://github.com/kokkos/kokkos/issues/3241)
- Use explicit staging buffer for constant memory kernel launches and cleanup host/device synchronization [\#3234](https://github.com/kokkos/kokkos/issues/3234) - Use explicit staging buffer for constant memory kernel launches and cleanup host/device synchronization [\#3234](https://github.com/kokkos/kokkos/issues/3234)
- Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable 1: [\#3202](https://github.com/kokkos/kokkos/issues/3202) - Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable: [\#3202](https://github.com/kokkos/kokkos/issues/3202) , [\#3203](https://github.com/kokkos/kokkos/issues/3203) , [\#3196](https://github.com/kokkos/kokkos/issues/3196)
- Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable 2: [\#3203](https://github.com/kokkos/kokkos/issues/3203)
- Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable 3: [\#3196](https://github.com/kokkos/kokkos/issues/3196)
- Annotations for `DefaultExectutionSpace` and `DefaultHostExectutionSpace` to use in static analysis [\#3189](https://github.com/kokkos/kokkos/issues/3189) - Annotations for `DefaultExectutionSpace` and `DefaultHostExectutionSpace` to use in static analysis [\#3189](https://github.com/kokkos/kokkos/issues/3189)
- Add documentation on using Spack to install Kokkos and developing packages that depend on Kokkos [\#3187](https://github.com/kokkos/kokkos/issues/3187) - Add documentation on using Spack to install Kokkos and developing packages that depend on Kokkos [\#3187](https://github.com/kokkos/kokkos/issues/3187)
- Improve support for nvcc\_wrapper with exotic host compiler [\#3186](https://github.com/kokkos/kokkos/issues/3186)
- Add OpenMPTarget backend flags for NVC++ compiler [\#3185](https://github.com/kokkos/kokkos/issues/3185) - Add OpenMPTarget backend flags for NVC++ compiler [\#3185](https://github.com/kokkos/kokkos/issues/3185)
- Move deep\_copy/create\_mirror\_view on Experimental::OffsetView into Kokkos:: namespace [\#3166](https://github.com/kokkos/kokkos/issues/3166) - Move deep\_copy/create\_mirror\_view on Experimental::OffsetView into Kokkos:: namespace [\#3166](https://github.com/kokkos/kokkos/issues/3166)
- Allow for larger block size in HIP [\#3165](https://github.com/kokkos/kokkos/issues/3165) - Allow for larger block size in HIP [\#3165](https://github.com/kokkos/kokkos/issues/3165)
- View: Added names of Views to the different View initialize/free kernels [\#3159](https://github.com/kokkos/kokkos/issues/3159) - View: Added names of Views to the different View initialize/free kernels [\#3159](https://github.com/kokkos/kokkos/issues/3159)
- Cuda: Caching cudaFunctorAttributes and whether L1/Shmem prefer was set [\#3151](https://github.com/kokkos/kokkos/issues/3151) - Cuda: Caching cudaFunctorAttributes and whether L1/Shmem prefer was set [\#3151](https://github.com/kokkos/kokkos/issues/3151)
- BuildSystem: Provide an explicit default CMAKE\_BUILD\_TYPE [\#3131](https://github.com/kokkos/kokkos/issues/3131) - BuildSystem: Improved performance in default configuration by defaulting to Release build [\#3131](https://github.com/kokkos/kokkos/issues/3131)
- Cuda: Update CUDA occupancy calculation [\#3124](https://github.com/kokkos/kokkos/issues/3124) - Cuda: Update CUDA occupancy calculation [\#3124](https://github.com/kokkos/kokkos/issues/3124)
- Vector: Adding data() to Vector [\#3123](https://github.com/kokkos/kokkos/issues/3123) - Vector: Adding data() to Vector [\#3123](https://github.com/kokkos/kokkos/issues/3123)
- BuildSystem: Add CUDA Ampere configuration support [\#3122](https://github.com/kokkos/kokkos/issues/3122) - BuildSystem: Add CUDA Ampere configuration support [\#3122](https://github.com/kokkos/kokkos/issues/3122)
- General: Apply [[noreturn]] to Kokkos::abort when applicable [\#3106](https://github.com/kokkos/kokkos/issues/3106) - General: Apply [[noreturn]] to Kokkos::abort when applicable [\#3106](https://github.com/kokkos/kokkos/issues/3106)
- TeamPolicy: Validate storage level argument passed to TeamPolicy::set\_scratch\_size() [\#3098](https://github.com/kokkos/kokkos/issues/3098) - TeamPolicy: Validate storage level argument passed to TeamPolicy::set\_scratch\_size() [\#3098](https://github.com/kokkos/kokkos/issues/3098)
- nvcc\_wrapper: send --cudart to nvcc instead of host compiler [\#3092](https://github.com/kokkos/kokkos/issues/3092)
- BuildSystem: Make kokkos\_has\_string() function in Makefile.kokkos case insensitive [\#3091](https://github.com/kokkos/kokkos/issues/3091) - BuildSystem: Make kokkos\_has\_string() function in Makefile.kokkos case insensitive [\#3091](https://github.com/kokkos/kokkos/issues/3091)
- Modify KOKKOS\_FUNCTION macro for clang-tidy analysis [\#3087](https://github.com/kokkos/kokkos/issues/3087) - Modify KOKKOS\_FUNCTION macro for clang-tidy analysis [\#3087](https://github.com/kokkos/kokkos/issues/3087)
- Move allocation profiling to allocate/deallocate calls [\#3084](https://github.com/kokkos/kokkos/issues/3084) - Move allocation profiling to allocate/deallocate calls [\#3084](https://github.com/kokkos/kokkos/issues/3084)
- BuildSystem: FATAL\_ERROR when attempting in-source build [\#3082](https://github.com/kokkos/kokkos/issues/3082) - BuildSystem: FATAL\_ERROR when attempting in-source build [\#3082](https://github.com/kokkos/kokkos/issues/3082)
- Change enums in ScatterView to types [\#3076](https://github.com/kokkos/kokkos/issues/3076) - Change enums in ScatterView to types [\#3076](https://github.com/kokkos/kokkos/issues/3076)
- HIP: Changes for new compiler/runtime [\#3067](https://github.com/kokkos/kokkos/issues/3067) - HIP: Changes for new compiler/runtime [\#3067](https://github.com/kokkos/kokkos/issues/3067)
- Extract and use get\_gpu [\#3061](https://github.com/kokkos/kokkos/issues/3061) - Extract and use get\_gpu [\#3061](https://github.com/kokkos/kokkos/issues/3061) , [\#3048](https://github.com/kokkos/kokkos/issues/3048)
- Extract and use get\_gpu [\#3048](https://github.com/kokkos/kokkos/issues/3048)
- Add is\_allocated to View-like containers [\#3059](https://github.com/kokkos/kokkos/issues/3059) - Add is\_allocated to View-like containers [\#3059](https://github.com/kokkos/kokkos/issues/3059)
- Combined reducers for scalar references [\#3052](https://github.com/kokkos/kokkos/issues/3052) - Combined reducers for scalar references [\#3052](https://github.com/kokkos/kokkos/issues/3052)
- Add configurable capacity for UniqueToken [\#3051](https://github.com/kokkos/kokkos/issues/3051) - Add configurable capacity for UniqueToken [\#3051](https://github.com/kokkos/kokkos/issues/3051)
- Add installation testing [\#3034](https://github.com/kokkos/kokkos/issues/3034) - Add installation testing [\#3034](https://github.com/kokkos/kokkos/issues/3034)
- BuildSystem: Add -expt-relaxed-constexpr flag to nvcc\_wrapper [\#3021](https://github.com/kokkos/kokkos/issues/3021)
- HIP: Add UniqueToken [\#3020](https://github.com/kokkos/kokkos/issues/3020) - HIP: Add UniqueToken [\#3020](https://github.com/kokkos/kokkos/issues/3020)
- Autodetect number of devices [\#3013](https://github.com/kokkos/kokkos/issues/3013) - Autodetect number of devices [\#3013](https://github.com/kokkos/kokkos/issues/3013)
@ -82,11 +175,13 @@
- ScatterView: fix for OpenmpTarget remove inheritance from reducers [\#3162](https://github.com/kokkos/kokkos/issues/3162) - ScatterView: fix for OpenmpTarget remove inheritance from reducers [\#3162](https://github.com/kokkos/kokkos/issues/3162)
- BuildSystem: Set OpenMP flags according to host compiler [\#3127](https://github.com/kokkos/kokkos/issues/3127) - BuildSystem: Set OpenMP flags according to host compiler [\#3127](https://github.com/kokkos/kokkos/issues/3127)
- OpenMP: Fix logic for nested omp in partition\_master bug [\#3101](https://github.com/kokkos/kokkos/issues/3101) - OpenMP: Fix logic for nested omp in partition\_master bug [\#3101](https://github.com/kokkos/kokkos/issues/3101)
- nvcc\_wrapper: send --cudart to nvcc instead of host compiler [\#3092](https://github.com/kokkos/kokkos/issues/3092)
- BuildSystem: Fixes for Cuda/11 and c++17 [\#3085](https://github.com/kokkos/kokkos/issues/3085) - BuildSystem: Fixes for Cuda/11 and c++17 [\#3085](https://github.com/kokkos/kokkos/issues/3085)
- HIP: Fix print\_configuration [\#3080](https://github.com/kokkos/kokkos/issues/3080) - HIP: Fix print\_configuration [\#3080](https://github.com/kokkos/kokkos/issues/3080)
- Conditionally define get\_gpu [\#3072](https://github.com/kokkos/kokkos/issues/3072) - Conditionally define get\_gpu [\#3072](https://github.com/kokkos/kokkos/issues/3072)
- Fix bounds for ranges in random number generator [\#3069](https://github.com/kokkos/kokkos/issues/3069) - Fix bounds for ranges in random number generator [\#3069](https://github.com/kokkos/kokkos/issues/3069)
- Fix Cuda minor arch check [\#3035](https://github.com/kokkos/kokkos/issues/3035) - Fix Cuda minor arch check [\#3035](https://github.com/kokkos/kokkos/issues/3035)
- BuildSystem: Add -expt-relaxed-constexpr flag to nvcc\_wrapper [\#3021](https://github.com/kokkos/kokkos/issues/3021)
**Incompatibilities:** **Incompatibilities:**

View File

@ -111,8 +111,8 @@ ENDIF()
set(Kokkos_VERSION_MAJOR 3) set(Kokkos_VERSION_MAJOR 3)
set(Kokkos_VERSION_MINOR 2) set(Kokkos_VERSION_MINOR 3)
set(Kokkos_VERSION_PATCH 1) set(Kokkos_VERSION_PATCH 0)
set(Kokkos_VERSION "${Kokkos_VERSION_MAJOR}.${Kokkos_VERSION_MINOR}.${Kokkos_VERSION_PATCH}") set(Kokkos_VERSION "${Kokkos_VERSION_MAJOR}.${Kokkos_VERSION_MINOR}.${Kokkos_VERSION_PATCH}")
math(EXPR KOKKOS_VERSION "${Kokkos_VERSION_MAJOR} * 10000 + ${Kokkos_VERSION_MINOR} * 100 + ${Kokkos_VERSION_PATCH}") math(EXPR KOKKOS_VERSION "${Kokkos_VERSION_MAJOR} * 10000 + ${Kokkos_VERSION_MINOR} * 100 + ${Kokkos_VERSION_PATCH}")
@ -139,13 +139,15 @@ ENDIF()
# I really wish these were regular variables # I really wish these were regular variables
# but scoping issues can make it difficult # but scoping issues can make it difficult
GLOBAL_SET(KOKKOS_COMPILE_OPTIONS) GLOBAL_SET(KOKKOS_COMPILE_OPTIONS)
GLOBAL_SET(KOKKOS_LINK_OPTIONS) GLOBAL_SET(KOKKOS_LINK_OPTIONS -DKOKKOS_DEPENDENCE)
GLOBAL_SET(KOKKOS_CUDA_OPTIONS) GLOBAL_SET(KOKKOS_CUDA_OPTIONS)
GLOBAL_SET(KOKKOS_CUDAFE_OPTIONS) GLOBAL_SET(KOKKOS_CUDAFE_OPTIONS)
GLOBAL_SET(KOKKOS_XCOMPILER_OPTIONS) GLOBAL_SET(KOKKOS_XCOMPILER_OPTIONS)
# We need to append text here for making sure TPLs # We need to append text here for making sure TPLs
# we import are available for an installed Kokkos # we import are available for an installed Kokkos
GLOBAL_SET(KOKKOS_TPL_EXPORTS) GLOBAL_SET(KOKKOS_TPL_EXPORTS)
# this could probably be scoped to project
GLOBAL_SET(KOKKOS_COMPILE_DEFINITIONS KOKKOS_DEPENDENCE)
# Include a set of Kokkos-specific wrapper functions that # Include a set of Kokkos-specific wrapper functions that
# will either call raw CMake or TriBITS # will either call raw CMake or TriBITS
@ -191,8 +193,6 @@ ELSE()
SET(KOKKOS_IS_SUBDIRECTORY FALSE) SET(KOKKOS_IS_SUBDIRECTORY FALSE)
ENDIF() ENDIF()
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# #
# A) Forward declare the package so that certain options are also defined for # A) Forward declare the package so that certain options are also defined for
@ -253,9 +253,7 @@ KOKKOS_PROCESS_SUBPACKAGES()
KOKKOS_PACKAGE_DEF() KOKKOS_PACKAGE_DEF()
KOKKOS_EXCLUDE_AUTOTOOLS_FILES() KOKKOS_EXCLUDE_AUTOTOOLS_FILES()
KOKKOS_PACKAGE_POSTPROCESS() KOKKOS_PACKAGE_POSTPROCESS()
KOKKOS_CONFIGURE_CORE()
#We are ready to configure the header
CONFIGURE_FILE(cmake/KokkosCore_config.h.in KokkosCore_config.h @ONLY)
IF (NOT KOKKOS_HAS_TRILINOS AND NOT Kokkos_INSTALL_TESTING) IF (NOT KOKKOS_HAS_TRILINOS AND NOT Kokkos_INSTALL_TESTING)
ADD_LIBRARY(kokkos INTERFACE) ADD_LIBRARY(kokkos INTERFACE)
@ -272,7 +270,10 @@ INCLUDE(${KOKKOS_SRC_PATH}/cmake/kokkos_install.cmake)
# executables also need nvcc_wrapper. Thus, we need to install it. # executables also need nvcc_wrapper. Thus, we need to install it.
# If the argument of DESTINATION is a relative path, CMake computes it # If the argument of DESTINATION is a relative path, CMake computes it
# as relative to ${CMAKE_INSTALL_PATH}. # as relative to ${CMAKE_INSTALL_PATH}.
INSTALL(PROGRAMS ${CMAKE_CURRENT_SOURCE_DIR}/bin/nvcc_wrapper DESTINATION ${CMAKE_INSTALL_BINDIR}) # KOKKOS_INSTALL_ADDITIONAL_FILES will install nvcc wrapper and other generated
# files
KOKKOS_INSTALL_ADDITIONAL_FILES()
# Finally - if we are a subproject - make sure the enabled devices are visible # Finally - if we are a subproject - make sure the enabled devices are visible
IF (HAS_PARENT) IF (HAS_PARENT)

View File

@ -11,27 +11,27 @@ CXXFLAGS += $(SHFLAGS)
endif endif
KOKKOS_VERSION_MAJOR = 3 KOKKOS_VERSION_MAJOR = 3
KOKKOS_VERSION_MINOR = 2 KOKKOS_VERSION_MINOR = 3
KOKKOS_VERSION_PATCH = 1 KOKKOS_VERSION_PATCH = 0
KOKKOS_VERSION = $(shell echo $(KOKKOS_VERSION_MAJOR)*10000+$(KOKKOS_VERSION_MINOR)*100+$(KOKKOS_VERSION_PATCH) | bc) KOKKOS_VERSION = $(shell echo $(KOKKOS_VERSION_MAJOR)*10000+$(KOKKOS_VERSION_MINOR)*100+$(KOKKOS_VERSION_PATCH) | bc)
# Options: Cuda,HIP,ROCm,OpenMP,Pthread,Serial # Options: Cuda,HIP,OpenMP,Pthread,Serial
KOKKOS_DEVICES ?= "OpenMP" KOKKOS_DEVICES ?= "OpenMP"
#KOKKOS_DEVICES ?= "Pthread" #KOKKOS_DEVICES ?= "Pthread"
# Options: # Options:
# Intel: KNC,KNL,SNB,HSW,BDW,SKX # Intel: KNC,KNL,SNB,HSW,BDW,SKX
# NVIDIA: Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,Pascal60,Pascal61,Volta70,Volta72,Turing75,Ampere80 # NVIDIA: Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,Pascal60,Pascal61,Volta70,Volta72,Turing75,Ampere80
# ARM: ARMv80,ARMv81,ARMv8-ThunderX,ARMv8-TX2 # ARM: ARMv80,ARMv81,ARMv8-ThunderX,ARMv8-TX2,A64FX
# IBM: BGQ,Power7,Power8,Power9 # IBM: BGQ,Power7,Power8,Power9
# AMD-GPUS: Vega900,Vega906 # AMD-GPUS: Vega900,Vega906,Vega908
# AMD-CPUS: AMDAVX,Zen,Zen2 # AMD-CPUS: AMDAVX,Zen,Zen2
KOKKOS_ARCH ?= "" KOKKOS_ARCH ?= ""
# Options: yes,no # Options: yes,no
KOKKOS_DEBUG ?= "no" KOKKOS_DEBUG ?= "no"
# Options: hwloc,librt,experimental_memkind # Options: hwloc,librt,experimental_memkind
KOKKOS_USE_TPLS ?= "" KOKKOS_USE_TPLS ?= ""
# Options: c++11,c++14,c++1y,c++17,c++1z,c++2a # Options: c++14,c++1y,c++17,c++1z,c++2a
KOKKOS_CXX_STANDARD ?= "c++11" KOKKOS_CXX_STANDARD ?= "c++14"
# Options: aggressive_vectorization,disable_profiling,enable_large_mem_tests,disable_complex_align # Options: aggressive_vectorization,disable_profiling,enable_large_mem_tests,disable_complex_align
KOKKOS_OPTIONS ?= "" KOKKOS_OPTIONS ?= ""
KOKKOS_CMAKE ?= "no" KOKKOS_CMAKE ?= "no"
@ -66,7 +66,6 @@ kokkos_path_exists=$(if $(wildcard $1),1,0)
# Check for general settings # Check for general settings
KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes) KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes)
KOKKOS_INTERNAL_ENABLE_CXX11 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++11)
KOKKOS_INTERNAL_ENABLE_CXX14 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++14) KOKKOS_INTERNAL_ENABLE_CXX14 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++14)
KOKKOS_INTERNAL_ENABLE_CXX1Y := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1y) KOKKOS_INTERNAL_ENABLE_CXX1Y := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1y)
KOKKOS_INTERNAL_ENABLE_CXX17 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++17) KOKKOS_INTERNAL_ENABLE_CXX17 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++17)
@ -279,14 +278,12 @@ else
endif endif
endif endif
# Set C++11 flags. # Set C++ version flags.
ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1) ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
KOKKOS_INTERNAL_CXX11_FLAG := --c++11
KOKKOS_INTERNAL_CXX14_FLAG := --c++14 KOKKOS_INTERNAL_CXX14_FLAG := --c++14
KOKKOS_INTERNAL_CXX17_FLAG := --c++17 KOKKOS_INTERNAL_CXX17_FLAG := --c++17
else else
ifeq ($(KOKKOS_INTERNAL_COMPILER_XL), 1) ifeq ($(KOKKOS_INTERNAL_COMPILER_XL), 1)
KOKKOS_INTERNAL_CXX11_FLAG := -std=c++11
KOKKOS_INTERNAL_CXX14_FLAG := -std=c++14 KOKKOS_INTERNAL_CXX14_FLAG := -std=c++14
KOKKOS_INTERNAL_CXX1Y_FLAG := -std=c++1y KOKKOS_INTERNAL_CXX1Y_FLAG := -std=c++1y
#KOKKOS_INTERNAL_CXX17_FLAG := -std=c++17 #KOKKOS_INTERNAL_CXX17_FLAG := -std=c++17
@ -294,17 +291,12 @@ else
#KOKKOS_INTERNAL_CXX2A_FLAG := -std=c++2a #KOKKOS_INTERNAL_CXX2A_FLAG := -std=c++2a
else else
ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1) ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
KOKKOS_INTERNAL_CXX11_FLAG := -hstd=c++11
KOKKOS_INTERNAL_CXX14_FLAG := -hstd=c++14 KOKKOS_INTERNAL_CXX14_FLAG := -hstd=c++14
#KOKKOS_INTERNAL_CXX1Y_FLAG := -hstd=c++1y #KOKKOS_INTERNAL_CXX1Y_FLAG := -hstd=c++1y
#KOKKOS_INTERNAL_CXX17_FLAG := -hstd=c++17 #KOKKOS_INTERNAL_CXX17_FLAG := -hstd=c++17
#KOKKOS_INTERNAL_CXX1Z_FLAG := -hstd=c++1z #KOKKOS_INTERNAL_CXX1Z_FLAG := -hstd=c++1z
#KOKKOS_INTERNAL_CXX2A_FLAG := -hstd=c++2a #KOKKOS_INTERNAL_CXX2A_FLAG := -hstd=c++2a
else else
ifeq ($(KOKKOS_INTERNAL_COMPILER_HCC), 1)
KOKKOS_INTERNAL_CXX11_FLAG :=
else
KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11
KOKKOS_INTERNAL_CXX14_FLAG := --std=c++14 KOKKOS_INTERNAL_CXX14_FLAG := --std=c++14
KOKKOS_INTERNAL_CXX1Y_FLAG := --std=c++1y KOKKOS_INTERNAL_CXX1Y_FLAG := --std=c++1y
KOKKOS_INTERNAL_CXX17_FLAG := --std=c++17 KOKKOS_INTERNAL_CXX17_FLAG := --std=c++17
@ -313,7 +305,6 @@ else
endif endif
endif endif
endif endif
endif
# Check for Kokkos Architecture settings. # Check for Kokkos Architecture settings.
@ -377,7 +368,8 @@ KOKKOS_INTERNAL_USE_ARCH_ARMV80 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8
KOKKOS_INTERNAL_USE_ARCH_ARMV81 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv81) KOKKOS_INTERNAL_USE_ARCH_ARMV81 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv81)
KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8-ThunderX) KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8-ThunderX)
KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX2 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8-TX2) KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX2 := $(call kokkos_has_string,$(KOKKOS_ARCH),ARMv8-TX2)
KOKKOS_INTERNAL_USE_ARCH_ARM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_ARMV80)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV81)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX2) | bc)) KOKKOS_INTERNAL_USE_ARCH_A64FX := $(call kokkos_has_string,$(KOKKOS_ARCH),A64FX)
KOKKOS_INTERNAL_USE_ARCH_ARM := $(strip $(shell echo $(KOKKOS_INTERNAL_USE_ARCH_ARMV80)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV81)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX)+$(KOKKOS_INTERNAL_USE_ARCH_ARMV8_THUNDERX2)+$(KOKKOS_INTERNAL_USE_ARCH_A64FX) | bc))
# IBM based. # IBM based.
KOKKOS_INTERNAL_USE_ARCH_BGQ := $(call kokkos_has_string,$(KOKKOS_ARCH),BGQ) KOKKOS_INTERNAL_USE_ARCH_BGQ := $(call kokkos_has_string,$(KOKKOS_ARCH),BGQ)
@ -392,6 +384,7 @@ KOKKOS_INTERNAL_USE_ARCH_ZEN2 := $(call kokkos_has_string,$(KOKKOS_ARCH),Zen2)
KOKKOS_INTERNAL_USE_ARCH_ZEN := $(call kokkos_has_string,$(KOKKOS_ARCH),Zen) KOKKOS_INTERNAL_USE_ARCH_ZEN := $(call kokkos_has_string,$(KOKKOS_ARCH),Zen)
KOKKOS_INTERNAL_USE_ARCH_VEGA900 := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega900) KOKKOS_INTERNAL_USE_ARCH_VEGA900 := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega900)
KOKKOS_INTERNAL_USE_ARCH_VEGA906 := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega906) KOKKOS_INTERNAL_USE_ARCH_VEGA906 := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega906)
KOKKOS_INTERNAL_USE_ARCH_VEGA908 := $(call kokkos_has_string,$(KOKKOS_ARCH),Vega908)
# Any AVX? # Any AVX?
KOKKOS_INTERNAL_USE_ARCH_SSE42 := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_WSM)) KOKKOS_INTERNAL_USE_ARCH_SSE42 := $(shell expr $(KOKKOS_INTERNAL_USE_ARCH_WSM))
@ -459,7 +452,6 @@ H := \#
# Do not append first line # Do not append first line
tmp := $(shell echo "/* ---------------------------------------------" > KokkosCore_config.tmp) tmp := $(shell echo "/* ---------------------------------------------" > KokkosCore_config.tmp)
tmp := $(call kokkos_append_header,"Makefile constructed configuration:") tmp := $(call kokkos_append_header,"Makefile constructed configuration:")
tmp := $(call kokkos_append_header,"$(shell date)")
tmp := $(call kokkos_append_header,"----------------------------------------------*/") tmp := $(call kokkos_append_header,"----------------------------------------------*/")
tmp := $(call kokkos_append_header,'$H''if !defined(KOKKOS_MACROS_HPP) || defined(KOKKOS_CORE_CONFIG_H)') tmp := $(call kokkos_append_header,'$H''if !defined(KOKKOS_MACROS_HPP) || defined(KOKKOS_CORE_CONFIG_H)')
@ -479,10 +471,6 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
tmp := $(call kokkos_append_header,"$H""define KOKKOS_COMPILER_CUDA_VERSION $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION)") tmp := $(call kokkos_append_header,"$H""define KOKKOS_COMPILER_CUDA_VERSION $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION)")
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
tmp := $(call kokkos_append_header,'$H''define KOKKOS_ENABLE_ROCM')
tmp := $(call kokkos_append_header,'$H''define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1')
endif
ifeq ($(KOKKOS_INTERNAL_USE_HIP), 1) ifeq ($(KOKKOS_INTERNAL_USE_HIP), 1)
tmp := $(call kokkos_append_header,'$H''define KOKKOS_ENABLE_HIP') tmp := $(call kokkos_append_header,'$H''define KOKKOS_ENABLE_HIP')
endif endif
@ -542,12 +530,6 @@ endif
#only add the c++ standard flags if this is not CMake #only add the c++ standard flags if this is not CMake
tmp := $(call kokkos_append_header,"/* General Settings */") tmp := $(call kokkos_append_header,"/* General Settings */")
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
ifneq ($(KOKKOS_STANDALONE_CMAKE), yes)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
endif
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ENABLE_CXX11")
endif
ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX14), 1) ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX14), 1)
ifneq ($(KOKKOS_STANDALONE_CMAKE), yes) ifneq ($(KOKKOS_STANDALONE_CMAKE), yes)
KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX14_FLAG) KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX14_FLAG)
@ -765,6 +747,13 @@ ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ARMV81), 1)
endif endif
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_ARCH_A64FX), 1)
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_A64FX")
KOKKOS_CXXFLAGS += -march=armv8.2-a+sve
KOKKOS_LDFLAGS += -march=armv8.2-a+sve
endif
ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ZEN), 1) ifeq ($(KOKKOS_INTERNAL_USE_ARCH_ZEN), 1)
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_AMD_ZEN") tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_AMD_ZEN")
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_AMD_AVX2") tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_AMD_AVX2")
@ -1143,6 +1132,12 @@ ifeq ($(KOKKOS_INTERNAL_USE_HIP), 1)
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_VEGA906") tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_VEGA906")
KOKKOS_INTERNAL_HIP_ARCH_FLAG := --amdgpu-target=gfx906 KOKKOS_INTERNAL_HIP_ARCH_FLAG := --amdgpu-target=gfx906
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_ARCH_VEGA908), 1)
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_HIP 908")
tmp := $(call kokkos_append_header,"$H""define KOKKOS_ARCH_VEGA908")
KOKKOS_INTERNAL_HIP_ARCH_FLAG := --amdgpu-target=gfx908
endif
KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/HIP/*.cpp) KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/HIP/*.cpp)
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/HIP/*.hpp) KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/HIP/*.hpp)
@ -1173,6 +1168,55 @@ ifneq ($(KOKKOS_INTERNAL_NEW_CONFIG), 0)
tmp := $(shell cp KokkosCore_config.tmp KokkosCore_config.h) tmp := $(shell cp KokkosCore_config.tmp KokkosCore_config.h)
endif endif
# Functions for generating config header file
kokkos_start_config_header = $(shell sed 's~@INCLUDE_NEXT_FILE@~~g' $(KOKKOS_PATH)/cmake/KokkosCore_Config_HeaderSet.in > $1)
kokkos_update_config_header = $(shell sed 's~@HEADER_GUARD_TAG@~$1~g' $2 > $3)
kokkos_append_config_header = $(shell echo $1 >> $2))
tmp := $(call kokkos_start_config_header, "KokkosCore_Config_FwdBackend.tmp")
tmp := $(call kokkos_start_config_header, "KokkosCore_Config_SetupBackend.tmp")
tmp := $(call kokkos_start_config_header, "KokkosCore_Config_DeclareBackend.tmp")
tmp := $(call kokkos_start_config_header, "KokkosCore_Config_PostInclude.tmp")
tmp := $(call kokkos_update_config_header, KOKKOS_FWD_HPP_, "KokkosCore_Config_FwdBackend.tmp", "KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_update_config_header, KOKKOS_SETUP_HPP_, "KokkosCore_Config_SetupBackend.tmp", "KokkosCore_Config_SetupBackend.hpp")
tmp := $(call kokkos_update_config_header, KOKKOS_DECLARE_HPP_, "KokkosCore_Config_DeclareBackend.tmp", "KokkosCore_Config_DeclareBackend.hpp")
tmp := $(call kokkos_update_config_header, KOKKOS_POST_INCLUDE_HPP_, "KokkosCore_Config_PostInclude.tmp", "KokkosCore_Config_PostInclude.hpp")
ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_CUDA.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_CUDA.hpp>","KokkosCore_Config_DeclareBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <setup/Kokkos_Setup_Cuda.hpp>","KokkosCore_Config_SetupBackend.hpp")
ifeq ($(KOKKOS_INTERNAL_CUDA_USE_UVM), 1)
else
endif
endif
ifeq ($(KOKKOS_INTERNAL_USE_OPENMPTARGET), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_OPENMPTARGET.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_OPENMPTARGET.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_HIP), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_HIP.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_HIP.hpp>","KokkosCore_Config_DeclareBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <setup/Kokkos_Setup_HIP.hpp>","KokkosCore_Config_SetupBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_OPENMP.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_OPENMP.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_THREADS.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_THREADS.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_HPX), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_HPX.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_HPX.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_SERIAL.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_SERIAL.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
ifeq ($(KOKKOS_INTERNAL_USE_MEMKIND), 1)
tmp := $(call kokkos_append_config_header,"\#include <fwd/Kokkos_Fwd_HBWSpace.hpp>","KokkosCore_Config_FwdBackend.hpp")
tmp := $(call kokkos_append_config_header,"\#include <decl/Kokkos_Declare_HBWSpace.hpp>","KokkosCore_Config_DeclareBackend.hpp")
endif
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/*.hpp) KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/*.hpp)
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp) KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/impl/*.hpp)
KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp) KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/containers/src/*.hpp)
@ -1290,7 +1334,7 @@ ifneq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
endif endif
# With Cygwin functions such as fdopen and fileno are not defined # With Cygwin functions such as fdopen and fileno are not defined
# when strict ansi is enabled. strict ansi gets enabled with --std=c++11 # when strict ansi is enabled. strict ansi gets enabled with --std=c++14
# though. So we hard undefine it here. Not sure if that has any bad side effects # though. So we hard undefine it here. Not sure if that has any bad side effects
# This is needed for gtest actually, not for Kokkos itself! # This is needed for gtest actually, not for Kokkos itself!
ifeq ($(KOKKOS_INTERNAL_OS_CYGWIN), 1) ifeq ($(KOKKOS_INTERNAL_OS_CYGWIN), 1)
@ -1313,7 +1357,9 @@ KOKKOS_OBJ_LINK = $(notdir $(KOKKOS_OBJ))
include $(KOKKOS_PATH)/Makefile.targets include $(KOKKOS_PATH)/Makefile.targets
kokkos-clean: kokkos-clean:
rm -f $(KOKKOS_OBJ_LINK) KokkosCore_config.h KokkosCore_config.tmp libkokkos.a rm -f $(KOKKOS_OBJ_LINK) KokkosCore_config.h KokkosCore_config.tmp libkokkos.a KokkosCore_Config_SetupBackend.hpp \
KokkosCore_Config_FwdBackend.hpp KokkosCore_Config_DeclareBackend.hpp KokkosCore_Config_DeclareBackend.tmp \
KokkosCore_Config_FwdBackend.tmp KokkosCore_Config_PostInclude.hpp KokkosCore_Config_PostInclude.tmp KokkosCore_Config_SetupBackend.tmp
libkokkos.a: $(KOKKOS_OBJ_LINK) $(KOKKOS_SRC) $(KOKKOS_HEADERS) libkokkos.a: $(KOKKOS_OBJ_LINK) $(KOKKOS_SRC) $(KOKKOS_HEADERS)
ar cr libkokkos.a $(KOKKOS_OBJ_LINK) ar cr libkokkos.a $(KOKKOS_OBJ_LINK)

View File

@ -53,23 +53,10 @@ Kokkos_HIP_Space.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Space.cpp $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Space.cpp
Kokkos_HIP_Instance.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Instance.cpp Kokkos_HIP_Instance.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Instance.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Instance.cpp $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Instance.cpp
Kokkos_HIP_KernelLaunch.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_KernelLaunch.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_KernelLaunch.cpp
Kokkos_HIP_Locks.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Locks.cpp Kokkos_HIP_Locks.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Locks.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Locks.cpp $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/HIP/Kokkos_HIP_Locks.cpp
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
Kokkos_ROCm_Exec.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Exec.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Exec.cpp
Kokkos_ROCm_Space.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Space.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Space.cpp
Kokkos_ROCm_Task.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Task.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Task.cpp
Kokkos_ROCm_Impl.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Impl.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/ROCm/Kokkos_ROCm_Impl.cpp
endif
ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1) ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
Kokkos_ThreadsExec_base.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp Kokkos_ThreadsExec_base.o: $(KOKKOS_CPP_DEPENDS) $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) -c $(KOKKOS_PATH)/core/src/Threads/Kokkos_ThreadsExec_base.cpp

View File

@ -54,24 +54,16 @@ For specifics see the LICENSE file contained in the repository or distribution.
# Requirements # Requirements
### Primary tested compilers on X86 are: ### Primary tested compilers on X86 are:
* GCC 4.8.4 * GCC 5.3.0
* GCC 4.9.3
* GCC 5.1.0
* GCC 5.4.0 * GCC 5.4.0
* GCC 5.5.0 * GCC 5.5.0
* GCC 6.1.0 * GCC 6.1.0
* GCC 7.2.0 * GCC 7.2.0
* GCC 7.3.0 * GCC 7.3.0
* GCC 8.1.0 * GCC 8.1.0
* Intel 15.0.2
* Intel 16.0.1
* Intel 17.0.1 * Intel 17.0.1
* Intel 17.4.196 * Intel 17.4.196
* Intel 18.2.128 * Intel 18.2.128
* Clang 3.6.1
* Clang 3.7.1
* Clang 3.8.1
* Clang 3.9.0
* Clang 4.0.0 * Clang 4.0.0
* Clang 6.0.0 for CUDA (CUDA Toolkit 9.0) * Clang 6.0.0 for CUDA (CUDA Toolkit 9.0)
* Clang 7.0.0 for CUDA (CUDA Toolkit 9.1) * Clang 7.0.0 for CUDA (CUDA Toolkit 9.1)
@ -81,6 +73,7 @@ For specifics see the LICENSE file contained in the repository or distribution.
* NVCC 9.2 for CUDA (with gcc 7.2.0) * NVCC 9.2 for CUDA (with gcc 7.2.0)
* NVCC 10.0 for CUDA (with gcc 7.4.0) * NVCC 10.0 for CUDA (with gcc 7.4.0)
* NVCC 10.1 for CUDA (with gcc 7.4.0) * NVCC 10.1 for CUDA (with gcc 7.4.0)
* NVCC 11.0 for CUDA (with gcc 8.4.0)
### Primary tested compilers on Power 8 are: ### Primary tested compilers on Power 8 are:
* GCC 6.4.0 (OpenMP,Serial) * GCC 6.4.0 (OpenMP,Serial)
@ -89,9 +82,8 @@ For specifics see the LICENSE file contained in the repository or distribution.
* NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0) * NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0)
### Primary tested compilers on Intel KNL are: ### Primary tested compilers on Intel KNL are:
* Intel 16.4.258 (with gcc 4.7.2) * Intel 17.2.174 (with gcc 6.2.0 and 6.4.0)
* Intel 17.2.174 (with gcc 4.9.3) * Intel 18.2.199 (with gcc 6.2.0 and 6.4.0)
* Intel 18.2.199 (with gcc 4.9.3)
### Primary tested compilers on ARM (Cavium ThunderX2) ### Primary tested compilers on ARM (Cavium ThunderX2)
* GCC 7.2.0 * GCC 7.2.0

View File

@ -806,7 +806,7 @@ class Random_XorShift64 {
const double V = 2.0 * drand() - 1.0; const double V = 2.0 * drand() - 1.0;
S = U * U + V * V; S = U * U + V * V;
} }
return U * std::sqrt(-2.0 * log(S) / S); return U * std::sqrt(-2.0 * std::log(S) / S);
} }
KOKKOS_INLINE_FUNCTION KOKKOS_INLINE_FUNCTION
@ -1042,7 +1042,7 @@ class Random_XorShift1024 {
const double V = 2.0 * drand() - 1.0; const double V = 2.0 * drand() - 1.0;
S = U * U + V * V; S = U * U + V * V;
} }
return U * std::sqrt(-2.0 * log(S) / S); return U * std::sqrt(-2.0 * std::log(S) / S);
} }
KOKKOS_INLINE_FUNCTION KOKKOS_INLINE_FUNCTION

View File

@ -222,11 +222,11 @@ class BinSort {
"Kokkos::SortImpl::BinSortFunctor::bin_count", bin_op.max_bins()); "Kokkos::SortImpl::BinSortFunctor::bin_count", bin_op.max_bins());
bin_count_const = bin_count_atomic; bin_count_const = bin_count_atomic;
bin_offsets = bin_offsets =
offset_type(ViewAllocateWithoutInitializing( offset_type(view_alloc(WithoutInitializing,
"Kokkos::SortImpl::BinSortFunctor::bin_offsets"), "Kokkos::SortImpl::BinSortFunctor::bin_offsets"),
bin_op.max_bins()); bin_op.max_bins());
sort_order = sort_order =
offset_type(ViewAllocateWithoutInitializing( offset_type(view_alloc(WithoutInitializing,
"Kokkos::SortImpl::BinSortFunctor::sort_order"), "Kokkos::SortImpl::BinSortFunctor::sort_order"),
range_end - range_begin); range_end - range_begin);
} }
@ -279,7 +279,7 @@ class BinSort {
} }
scratch_view_type sorted_values( scratch_view_type sorted_values(
ViewAllocateWithoutInitializing( view_alloc(WithoutInitializing,
"Kokkos::SortImpl::BinSortFunctor::sorted_values"), "Kokkos::SortImpl::BinSortFunctor::sorted_values"),
values.rank_dynamic > 0 ? len : KOKKOS_IMPL_CTOR_DEFAULT_ARG, values.rank_dynamic > 0 ? len : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
values.rank_dynamic > 1 ? values.extent(1) values.rank_dynamic > 1 ? values.extent(1)

View File

@ -24,7 +24,7 @@ KOKKOS_ADD_TEST_LIBRARY(
# avoid deprecation warnings from MSVC # avoid deprecation warnings from MSVC
TARGET_COMPILE_DEFINITIONS(kokkosalgorithms_gtest PUBLIC GTEST_HAS_TR1_TUPLE=0 GTEST_HAS_PTHREAD=0) TARGET_COMPILE_DEFINITIONS(kokkosalgorithms_gtest PUBLIC GTEST_HAS_TR1_TUPLE=0 GTEST_HAS_PTHREAD=0)
IF(NOT (Kokkos_ENABLE_CUDA AND WIN32)) IF((NOT (Kokkos_ENABLE_CUDA AND WIN32)) AND (NOT ("${KOKKOS_CXX_COMPILER_ID}" STREQUAL "Fujitsu")))
TARGET_COMPILE_FEATURES(kokkosalgorithms_gtest PUBLIC cxx_std_11) TARGET_COMPILE_FEATURES(kokkosalgorithms_gtest PUBLIC cxx_std_11)
ENDIF() ENDIF()

View File

@ -31,10 +31,10 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
TEST_TARGETS += test-cuda TEST_TARGETS += test-cuda
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1) ifeq ($(KOKKOS_INTERNAL_USE_HIP), 1)
OBJ_ROCM = TestROCm.o UnitTestMain.o gtest-all.o OBJ_HIP = TestHIP.o UnitTestMain.o gtest-all.o
TARGETS += KokkosAlgorithms_UnitTest_ROCm TARGETS += KokkosAlgorithms_UnitTest_HIP
TEST_TARGETS += test-rocm TEST_TARGETS += test-hip
endif endif
ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1) ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
@ -64,8 +64,8 @@ endif
KokkosAlgorithms_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS) KokkosAlgorithms_UnitTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_Cuda $(LINK) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_Cuda
KokkosAlgorithms_UnitTest_ROCm: $(OBJ_ROCM) $(KOKKOS_LINK_DEPENDS) KokkosAlgorithms_UnitTest_HIP: $(OBJ_HIP) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(EXTRA_PATH) $(OBJ_ROCM) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_ROCm $(LINK) $(EXTRA_PATH) $(OBJ_HIP) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_HIP
KokkosAlgorithms_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS) KokkosAlgorithms_UnitTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_Threads $(LINK) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosAlgorithms_UnitTest_Threads
@ -82,8 +82,8 @@ KokkosAlgorithms_UnitTest_Serial: $(OBJ_SERIAL) $(KOKKOS_LINK_DEPENDS)
test-cuda: KokkosAlgorithms_UnitTest_Cuda test-cuda: KokkosAlgorithms_UnitTest_Cuda
./KokkosAlgorithms_UnitTest_Cuda ./KokkosAlgorithms_UnitTest_Cuda
test-rocm: KokkosAlgorithms_UnitTest_ROCm test-hip: KokkosAlgorithms_UnitTest_HIP
./KokkosAlgorithms_UnitTest_ROCm ./KokkosAlgorithms_UnitTest_HIP
test-threads: KokkosAlgorithms_UnitTest_Threads test-threads: KokkosAlgorithms_UnitTest_Threads
./KokkosAlgorithms_UnitTest_Threads ./KokkosAlgorithms_UnitTest_Threads

View File

@ -1,31 +1,38 @@
KOKKOS_PATH = ${HOME}/kokkos KOKKOS_DEVICES=Cuda
KOKKOS_DEVICES = "OpenMP" KOKKOS_CUDA_OPTIONS=enable_lambda
KOKKOS_ARCH = "SNB" KOKKOS_ARCH = "SNB,Volta70"
EXE_NAME = "test"
SRC = $(wildcard *.cpp)
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
ifndef KOKKOS_PATH
KOKKOS_PATH = $(MAKEFILE_PATH)../..
endif
SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
vpath %.cpp $(sort $(dir $(SRC)))
default: build default: build
echo "Start Build" echo "Start Build"
ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES))) ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
EXE = ${EXE_NAME}.cuda EXE = atomic_perf.cuda
KOKKOS_CUDA_OPTIONS = "enable_lambda"
else else
CXX = g++ CXX = g++
EXE = ${EXE_NAME}.host EXE = atomic_perf.exe
endif endif
CXXFLAGS = -O3 CXXFLAGS ?= -O3 -g
override CXXFLAGS += -I$(MAKEFILE_PATH)
LINK = ${CXX}
LINKFLAGS = -O3
DEPFLAGS = -M DEPFLAGS = -M
LINK = ${CXX}
LINKFLAGS =
OBJ = $(SRC:.cpp=.o) OBJ = $(notdir $(SRC:.cpp=.o))
LIB = LIB =
include $(KOKKOS_PATH)/Makefile.kokkos include $(KOKKOS_PATH)/Makefile.kokkos
@ -36,9 +43,9 @@ $(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE) $(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean clean: kokkos-clean
rm -f *.o *.cuda *.host rm -f *.o atomic_perf.cuda atomic_perf.exe
# Compilation rules # Compilation rules
%.o:%.cpp $(KOKKOS_CPP_DEPENDS) %.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)

View File

@ -9,7 +9,7 @@ if [[ ${USE_CUDA} > 0 ]]; then
BAF_EXE=bytes_and_flops.cuda BAF_EXE=bytes_and_flops.cuda
TEAM_SIZE=256 TEAM_SIZE=256
else else
BAF_EXE=bytes_and_flops.host BAF_EXE=bytes_and_flops.exe
TEAM_SIZE=1 TEAM_SIZE=1
fi fi

View File

@ -1,6 +1,6 @@
KOKKOS_DEVICES=Cuda KOKKOS_DEVICES=Cuda
KOKKOS_CUDA_OPTIONS=enable_lambda KOKKOS_CUDA_OPTIONS=enable_lambda
KOKKOS_ARCH = "SNB,Kepler35" KOKKOS_ARCH = "SNB,Volta70"
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST)))) MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
@ -22,7 +22,7 @@ CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
EXE = bytes_and_flops.cuda EXE = bytes_and_flops.cuda
else else
CXX = g++ CXX = g++
EXE = bytes_and_flops.host EXE = bytes_and_flops.exe
endif endif
CXXFLAGS ?= -O3 -g CXXFLAGS ?= -O3 -g

View File

@ -1,7 +1,18 @@
KOKKOS_PATH = ${HOME}/kokkos
SRC = $(wildcard *.cpp)
KOKKOS_DEVICES=Cuda KOKKOS_DEVICES=Cuda
KOKKOS_CUDA_OPTIONS=enable_lambda KOKKOS_CUDA_OPTIONS=enable_lambda
KOKKOS_ARCH = "SNB,Volta70"
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
ifndef KOKKOS_PATH
KOKKOS_PATH = $(MAKEFILE_PATH)../..
endif
SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
vpath %.cpp $(sort $(dir $(SRC)))
default: build default: build
echo "Start Build" echo "Start Build"
@ -9,36 +20,32 @@ default: build
ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES))) ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
EXE = gather.cuda EXE = gather.cuda
KOKKOS_DEVICES = "Cuda,OpenMP"
KOKKOS_ARCH = "SNB,Kepler35"
else else
CXX = g++ CXX = g++
EXE = gather.host EXE = gather.exe
KOKKOS_DEVICES = "OpenMP"
KOKKOS_ARCH = "SNB"
endif endif
CXXFLAGS = -O3 -g CXXFLAGS ?= -O3 -g
override CXXFLAGS += -I$(MAKEFILE_PATH)
DEPFLAGS = -M DEPFLAGS = -M
LINK = ${CXX} LINK = ${CXX}
LINKFLAGS = LINKFLAGS =
OBJ = $(SRC:.cpp=.o) OBJ = $(notdir $(SRC:.cpp=.o))
LIB = LIB =
include $(KOKKOS_PATH)/Makefile.kokkos include $(KOKKOS_PATH)/Makefile.kokkos
$(warning ${KOKKOS_CPPFLAGS})
build: $(EXE) build: $(EXE)
$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS) $(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE) $(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean clean: kokkos-clean
rm -f *.o *.cuda *.host rm -f *.o gather.cuda gather.exe
# Compilation rules # Compilation rules
%.o:%.cpp $(KOKKOS_CPP_DEPENDS) gather_unroll.hpp gather.hpp %.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)

View File

@ -1,28 +1,38 @@
#Set your Kokkos path to something appropriate KOKKOS_DEVICES=Cuda
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
KOKKOS_DEVICES = "Cuda"
KOKKOS_ARCH = "Pascal60"
KOKKOS_CUDA_OPTIONS=enable_lambda KOKKOS_CUDA_OPTIONS=enable_lambda
#KOKKOS_DEVICES = "OpenMP" KOKKOS_ARCH = "SNB,Volta70"
#KOKKOS_ARCH = "Power8"
SRC = gups-kokkos.cc
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
ifndef KOKKOS_PATH
KOKKOS_PATH = $(MAKEFILE_PATH)../..
endif
SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
vpath %.cpp $(sort $(dir $(SRC)))
default: build default: build
echo "Start Build" echo "Start Build"
CXXFLAGS = -O3 ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
#CXX = g++ EXE = gups.cuda
else
CXX = g++
EXE = gups.exe
endif
LINK = ${CXX} CXXFLAGS ?= -O3 -g
override CXXFLAGS += -I$(MAKEFILE_PATH)
LINKFLAGS =
EXE = gups-kokkos
DEPFLAGS = -M DEPFLAGS = -M
LINK = ${CXX}
LINKFLAGS =
OBJ = $(SRC:.cc=.o) OBJ = $(notdir $(SRC:.cpp=.o))
LIB = LIB =
include $(KOKKOS_PATH)/Makefile.kokkos include $(KOKKOS_PATH)/Makefile.kokkos
@ -33,9 +43,9 @@ $(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE) $(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean clean: kokkos-clean
rm -f *.o $(EXE) rm -f *.o gups.cuda gups.exe
# Compilation rules # Compilation rules
%.o:%.cc $(KOKKOS_CPP_DEPENDS) %.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)

View File

@ -1,31 +1,38 @@
KOKKOS_PATH = ../.. KOKKOS_DEVICES=Cuda
SRC = $(wildcard *.cpp) KOKKOS_CUDA_OPTIONS=enable_lambda
KOKKOS_ARCH = "SNB,Volta70"
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
ifndef KOKKOS_PATH
KOKKOS_PATH = $(MAKEFILE_PATH)../..
endif
SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
vpath %.cpp $(sort $(dir $(SRC)))
default: build default: build
echo "Start Build" echo "Start Build"
ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES))) ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
CXXFLAGS = -O3 -g EXE = policy_perf.cuda
LINK = ${CXX}
LINKFLAGS =
EXE = policy_performance.cuda
KOKKOS_DEVICES = "Cuda,OpenMP"
KOKKOS_ARCH = "SNB,Kepler35"
KOKKOS_CUDA_OPTIONS+=enable_lambda
else else
CXX = g++ CXX = g++
CXXFLAGS = -O3 -g -Wall -Werror EXE = policy_perf.exe
LINK = ${CXX}
LINKFLAGS =
EXE = policy_performance.host
KOKKOS_DEVICES = "OpenMP"
KOKKOS_ARCH = "SNB"
endif endif
DEPFLAGS = -M CXXFLAGS ?= -O3 -g
override CXXFLAGS += -I$(MAKEFILE_PATH)
OBJ = $(SRC:.cpp=.o) DEPFLAGS = -M
LINK = ${CXX}
LINKFLAGS =
OBJ = $(notdir $(SRC:.cpp=.o))
LIB = LIB =
include $(KOKKOS_PATH)/Makefile.kokkos include $(KOKKOS_PATH)/Makefile.kokkos
@ -36,9 +43,9 @@ $(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE) $(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean clean: kokkos-clean
rm -f *.o *.cuda *.host rm -f *.o policy_perf.cuda policy_perf.exe
# Compilation rules # Compilation rules
%.o:%.cpp $(KOKKOS_CPP_DEPENDS) main.cpp policy_perf_test.hpp %.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)

View File

@ -146,11 +146,11 @@ int main(int argc, char* argv[]) {
// Call a 'warmup' test with 1 repeat - this will initialize the corresponding // Call a 'warmup' test with 1 repeat - this will initialize the corresponding
// view appropriately for test and should obey first-touch etc Second call to // view appropriately for test and should obey first-touch etc Second call to
// test is the one we actually care about and time // test is the one we actually care about and time
view_type_1d v_1(Kokkos::ViewAllocateWithoutInitializing("v_1"), view_type_1d v_1(Kokkos::view_alloc(Kokkos::WithoutInitializing, "v_1"),
team_range * team_size); team_range * team_size);
view_type_2d v_2(Kokkos::ViewAllocateWithoutInitializing("v_2"), view_type_2d v_2(Kokkos::view_alloc(Kokkos::WithoutInitializing, "v_2"),
team_range * team_size, thread_range); team_range * team_size, thread_range);
view_type_3d v_3(Kokkos::ViewAllocateWithoutInitializing("v_3"), view_type_3d v_3(Kokkos::view_alloc(Kokkos::WithoutInitializing, "v_3"),
team_range * team_size, thread_range, vector_range); team_range * team_size, thread_range, vector_range);
double result_computed = 0.0; double result_computed = 0.0;

View File

@ -1,28 +1,38 @@
#Set your Kokkos path to something appropriate KOKKOS_DEVICES=Cuda
KOKKOS_PATH = ${HOME}/git/kokkos-github-repo KOKKOS_CUDA_OPTIONS=enable_lambda
#KOKKOS_DEVICES = "Cuda" KOKKOS_ARCH = "SNB,Volta70"
#KOKKOS_ARCH = "Pascal60"
#KOKKOS_CUDA_OPTIONS = enable_lambda
KOKKOS_DEVICES = "OpenMP"
KOKKOS_ARCH = "Power8"
SRC = stream-kokkos.cc
MAKEFILE_PATH := $(subst Makefile,,$(abspath $(lastword $(MAKEFILE_LIST))))
ifndef KOKKOS_PATH
KOKKOS_PATH = $(MAKEFILE_PATH)../..
endif
SRC = $(wildcard $(MAKEFILE_PATH)*.cpp)
HEADERS = $(wildcard $(MAKEFILE_PATH)*.hpp)
vpath %.cpp $(sort $(dir $(SRC)))
default: build default: build
echo "Start Build" echo "Start Build"
CXXFLAGS = -O3 ifneq (,$(findstring Cuda,$(KOKKOS_DEVICES)))
#CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper CXX = ${KOKKOS_PATH}/bin/nvcc_wrapper
EXE = stream.cuda
else
CXX = g++ CXX = g++
EXE = stream.exe
endif
LINK = ${CXX} CXXFLAGS ?= -O3 -g
override CXXFLAGS += -I$(MAKEFILE_PATH)
LINKFLAGS =
EXE = stream-kokkos
DEPFLAGS = -M DEPFLAGS = -M
LINK = ${CXX}
LINKFLAGS =
OBJ = $(SRC:.cc=.o) OBJ = $(notdir $(SRC:.cpp=.o))
LIB = LIB =
include $(KOKKOS_PATH)/Makefile.kokkos include $(KOKKOS_PATH)/Makefile.kokkos
@ -33,9 +43,9 @@ $(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE) $(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
clean: kokkos-clean clean: kokkos-clean
rm -f *.o $(EXE) rm -f *.o stream.cuda stream.exe
# Compilation rules # Compilation rules
%.o:%.cc $(KOKKOS_CPP_DEPENDS) %.o:%.cpp $(KOKKOS_CPP_DEPENDS) $(HEADERS)
$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< $(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $< -o $(notdir $@)

View File

@ -0,0 +1,87 @@
#!/bin/bash -e
#
# This script allows CMAKE_CXX_COMPILER to be a standard
# C++ compiler and Kokkos sets RULE_LAUNCH_COMPILE and
# RULE_LAUNCH_LINK in CMake so that all compiler and link
# commands are prefixed with this script followed by the
# C++ compiler. Thus if $1 == $2 then we know the command
# was intended for the C++ compiler and we discard both
# $1 and $2 and redirect the command to NVCC_WRAPPER.
# If $1 != $2 then we know that the command was not intended
# for the C++ compiler and we just discard $1 and launch
# the original command. Examples of when $2 will not equal
# $1 are 'ar', 'cmake', etc. during the linking phase
#
# check the arguments for the KOKKOS_DEPENDENCE compiler definition
KOKKOS_DEPENDENCE=0
for i in ${@}
do
if [ -n "$(echo ${i} | grep 'KOKKOS_DEPENDENCE$')" ]; then
KOKKOS_DEPENDENCE=1
break
fi
done
# if C++ is not passed, someone is probably trying to invoke it directly
if [ -z "${1}" ]; then
echo -e "\n${BASH_SOURCE[0]} was invoked without the C++ compiler as the first argument."
echo "This script is not indended to be directly invoked by any mechanism other"
echo -e "than through a RULE_LAUNCH_COMPILE or RULE_LAUNCH_LINK property set in CMake\n"
exit 1
fi
# if there aren't two args, this isn't necessarily invalid, just a bit strange
if [ -z "${2}" ]; then exit 0; fi
# store the expected C++ compiler
CXX_COMPILER=${1}
# remove the expected C++ compiler from the arguments
shift
# after the above shift, $1 is now the exe for the compile or link command, e.g.
# kokkos_launch_compiler g++ gcc -c file.c -o file.o
# becomes:
# kokkos_launch_compiler gcc -c file.c -o file.o
# Check to see if the executable is the C++ compiler and if it is not, then
# just execute the command.
#
# Summary:
# kokkos_launch_compiler g++ gcc -c file.c -o file.o
# results in this command being executed:
# gcc -c file.c -o file.o
# and
# kokkos_launch_compiler g++ g++ -c file.cpp -o file.o
# results in this command being executed:
# nvcc_wrapper -c file.cpp -o file.o
if [[ "${KOKKOS_DEPENDENCE}" -eq "0" || "${CXX_COMPILER}" != "${1}" ]]; then
# the command does not depend on Kokkos so just execute the command w/o re-directing to nvcc_wrapper
eval $@
else
# the executable is the C++ compiler, so we need to re-direct to nvcc_wrapper
# find the nvcc_wrapper from the same build/install
NVCC_WRAPPER="$(dirname ${BASH_SOURCE[0]})/nvcc_wrapper"
if [ -z "${NVCC_WRAPPER}" ]; then
echo -e "\nError: nvcc_wrapper not found in $(dirname ${BASH_SOURCE[0]}).\n"
exit 1
fi
# set default nvcc wrapper compiler if not specified
: ${NVCC_WRAPPER_DEFAULT_COMPILER:=${CXX_COMPILER}}
export NVCC_WRAPPER_DEFAULT_COMPILER
# calling itself will cause an infinitely long build
if [ "${NVCC_WRAPPER}" = "${NVCC_WRAPPER_DEFAULT_COMPILER}" ]; then
echo -e "\nError: NVCC_WRAPPER == NVCC_WRAPPER_DEFAULT_COMPILER. Terminating to avoid infinite loop!\n"
exit 1
fi
# discard the compiler from the command
shift
# execute nvcc_wrapper
${NVCC_WRAPPER} $@
fi

View File

@ -90,7 +90,12 @@ replace_pragma_ident=0
# Mark first host compiler argument # Mark first host compiler argument
first_xcompiler_arg=1 first_xcompiler_arg=1
# Allow for setting temp dir without setting TMPDIR in parent (see https://docs.olcf.ornl.gov/systems/summit_user_guide.html#setting-tmpdir-causes-jsm-jsrun-errors-job-state-flip-flop)
if [[ ! -z ${NVCC_WRAPPER_TMPDIR+x} ]]; then
temp_dir=${TMPDIR:-/tmp} temp_dir=${TMPDIR:-/tmp}
else
temp_dir=${NVCC_WRAPPER_TMPDIR+x}
fi
# optimization flag added as a command-line argument # optimization flag added as a command-line argument
optimization_flag="" optimization_flag=""
@ -194,7 +199,7 @@ do
cuda_args="$cuda_args $1" cuda_args="$cuda_args $1"
;; ;;
#Handle known nvcc args that have an argument #Handle known nvcc args that have an argument
-rdc|-maxrregcount|--default-stream|-Xnvlink|--fmad|-cudart|--cudart) -rdc|-maxrregcount|--default-stream|-Xnvlink|--fmad|-cudart|--cudart|-include)
cuda_args="$cuda_args $1 $2" cuda_args="$cuda_args $1 $2"
shift shift
;; ;;

View File

@ -1,3 +1,9 @@
# No need for policy push/pop. CMake also manages a new entry for scripts
# loaded by include() and find_package() commands except when invoked with
# the NO_POLICY_SCOPE option
# CMP0057 + NEW -> IN_LIST operator in IF(...)
CMAKE_POLICY(SET CMP0057 NEW)
# Compute paths # Compute paths
@PACKAGE_INIT@ @PACKAGE_INIT@
@ -12,3 +18,18 @@ GET_FILENAME_COMPONENT(Kokkos_CMAKE_DIR "${CMAKE_CURRENT_LIST_FILE}" PATH)
INCLUDE("${Kokkos_CMAKE_DIR}/KokkosTargets.cmake") INCLUDE("${Kokkos_CMAKE_DIR}/KokkosTargets.cmake")
INCLUDE("${Kokkos_CMAKE_DIR}/KokkosConfigCommon.cmake") INCLUDE("${Kokkos_CMAKE_DIR}/KokkosConfigCommon.cmake")
UNSET(Kokkos_CMAKE_DIR) UNSET(Kokkos_CMAKE_DIR)
# if CUDA was enabled and separable compilation was specified, e.g.
# find_package(Kokkos COMPONENTS separable_compilation)
# then we set the RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK
IF(@Kokkos_ENABLE_CUDA@ AND NOT "separable_compilation" IN_LIST Kokkos_FIND_COMPONENTS)
# run test to see if CMAKE_CXX_COMPILER=nvcc_wrapper
kokkos_compiler_is_nvcc(IS_NVCC ${CMAKE_CXX_COMPILER})
# if not nvcc_wrapper, use RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK
IF(NOT IS_NVCC AND NOT CMAKE_CXX_COMPILER_ID STREQUAL Clang AND
(NOT DEFINED Kokkos_LAUNCH_COMPILER OR Kokkos_LAUNCH_COMPILER))
MESSAGE(STATUS "kokkos_launch_compiler is enabled globally. C++ compiler commands with -DKOKKOS_DEPENDENCE will be redirected to nvcc_wrapper")
kokkos_compilation(GLOBAL)
ENDIF()
UNSET(IS_NVCC) # be mindful of the environment, pollution is bad
ENDIF()

View File

@ -89,3 +89,73 @@ function(kokkos_check)
set(${KOKKOS_CHECK_RETURN_VALUE} ${KOKKOS_CHECK_SUCCESS} PARENT_SCOPE) set(${KOKKOS_CHECK_RETURN_VALUE} ${KOKKOS_CHECK_SUCCESS} PARENT_SCOPE)
endif() endif()
endfunction() endfunction()
# this function is provided to easily select which files use nvcc_wrapper:
#
# GLOBAL --> all files
# TARGET --> all files in a target
# SOURCE --> specific source files
# DIRECTORY --> all files in directory
# PROJECT --> all files/targets in a project/subproject
#
FUNCTION(kokkos_compilation)
CMAKE_PARSE_ARGUMENTS(COMP "GLOBAL;PROJECT" "" "DIRECTORY;TARGET;SOURCE" ${ARGN})
# search relative first and then absolute
SET(_HINTS "${CMAKE_CURRENT_LIST_DIR}/../.." "@CMAKE_INSTALL_PREFIX@")
# find kokkos_launch_compiler
FIND_PROGRAM(Kokkos_COMPILE_LAUNCHER
NAMES kokkos_launch_compiler
HINTS ${_HINTS}
PATHS ${_HINTS}
PATH_SUFFIXES bin)
IF(NOT Kokkos_COMPILE_LAUNCHER)
MESSAGE(FATAL_ERROR "Kokkos could not find 'kokkos_launch_compiler'. Please set '-DKokkos_COMPILE_LAUNCHER=/path/to/launcher'")
ENDIF()
IF(COMP_GLOBAL)
# if global, don't bother setting others
SET_PROPERTY(GLOBAL PROPERTY RULE_LAUNCH_COMPILE "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
SET_PROPERTY(GLOBAL PROPERTY RULE_LAUNCH_LINK "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
ELSE()
FOREACH(_TYPE PROJECT DIRECTORY TARGET SOURCE)
# make project/subproject scoping easy, e.g. KokkosCompilation(PROJECT) after project(...)
IF("${_TYPE}" STREQUAL "PROJECT" AND COMP_${_TYPE})
LIST(APPEND COMP_DIRECTORY ${PROJECT_SOURCE_DIR})
UNSET(COMP_${_TYPE})
ENDIF()
# set the properties if defined
IF(COMP_${_TYPE})
# MESSAGE(STATUS "Using nvcc_wrapper :: ${_TYPE} :: ${COMP_${_TYPE}}")
SET_PROPERTY(${_TYPE} ${COMP_${_TYPE}} PROPERTY RULE_LAUNCH_COMPILE "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
SET_PROPERTY(${_TYPE} ${COMP_${_TYPE}} PROPERTY RULE_LAUNCH_LINK "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
ENDIF()
ENDFOREACH()
ENDIF()
ENDFUNCTION()
# A test to check whether a downstream project set the C++ compiler to NVCC or not
# this is called only when Kokkos was installed with Kokkos_ENABLE_CUDA=ON
FUNCTION(kokkos_compiler_is_nvcc VAR COMPILER)
# Check if the compiler is nvcc (which really means nvcc_wrapper).
EXECUTE_PROCESS(COMMAND ${COMPILER} ${ARGN} --version
OUTPUT_VARIABLE INTERNAL_COMPILER_VERSION
OUTPUT_STRIP_TRAILING_WHITESPACE
RESULT_VARIABLE RET)
# something went wrong
IF(RET GREATER 0)
SET(${VAR} false PARENT_SCOPE)
ELSE()
STRING(REPLACE "\n" " - " INTERNAL_COMPILER_VERSION_ONE_LINE ${INTERNAL_COMPILER_VERSION} )
STRING(FIND ${INTERNAL_COMPILER_VERSION_ONE_LINE} "nvcc" INTERNAL_COMPILER_VERSION_CONTAINS_NVCC)
STRING(REGEX REPLACE "^ +" "" INTERNAL_HAVE_COMPILER_NVCC "${INTERNAL_HAVE_COMPILER_NVCC}")
IF(${INTERNAL_COMPILER_VERSION_CONTAINS_NVCC} GREATER -1)
SET(${VAR} true PARENT_SCOPE)
ELSE()
SET(${VAR} false PARENT_SCOPE)
ENDIF()
ENDIF()
ENDFUNCTION()

View File

@ -1,4 +1,3 @@
/* /*
//@HEADER //@HEADER
// ************************************************************************ // ************************************************************************
@ -42,6 +41,9 @@
// ************************************************************************ // ************************************************************************
//@HEADER //@HEADER
*/ */
#ifndef @HEADER_GUARD_TAG@
#define @HEADER_GUARD_TAG@
#include <cuda/TestCuda_Category.hpp> @INCLUDE_NEXT_FILE@
#include <TestAtomicViews.hpp>
#endif

View File

@ -21,6 +21,7 @@
#cmakedefine KOKKOS_ENABLE_HPX #cmakedefine KOKKOS_ENABLE_HPX
#cmakedefine KOKKOS_ENABLE_MEMKIND #cmakedefine KOKKOS_ENABLE_MEMKIND
#cmakedefine KOKKOS_ENABLE_LIBRT #cmakedefine KOKKOS_ENABLE_LIBRT
#cmakedefine KOKKOS_ENABLE_SYCL
#ifndef __CUDA_ARCH__ #ifndef __CUDA_ARCH__
#cmakedefine KOKKOS_ENABLE_TM #cmakedefine KOKKOS_ENABLE_TM
@ -31,7 +32,6 @@
#endif #endif
/* General Settings */ /* General Settings */
#cmakedefine KOKKOS_ENABLE_CXX11
#cmakedefine KOKKOS_ENABLE_CXX14 #cmakedefine KOKKOS_ENABLE_CXX14
#cmakedefine KOKKOS_ENABLE_CXX17 #cmakedefine KOKKOS_ENABLE_CXX17
#cmakedefine KOKKOS_ENABLE_CXX20 #cmakedefine KOKKOS_ENABLE_CXX20
@ -58,7 +58,7 @@
/* TPL Settings */ /* TPL Settings */
#cmakedefine KOKKOS_ENABLE_HWLOC #cmakedefine KOKKOS_ENABLE_HWLOC
#cmakedefine KOKKOS_USE_LIBRT #cmakedefine KOKKOS_USE_LIBRT
#cmakedefine KOKKOS_ENABLE_HWBSPACE #cmakedefine KOKKOS_ENABLE_HBWSPACE
#cmakedefine KOKKOS_ENABLE_LIBDL #cmakedefine KOKKOS_ENABLE_LIBDL
#cmakedefine KOKKOS_IMPL_CUDA_CLANG_WORKAROUND #cmakedefine KOKKOS_IMPL_CUDA_CLANG_WORKAROUND

View File

@ -73,20 +73,20 @@ Compiler features are more fine-grained and require conflicting requests to be r
Suppose I have Suppose I have
```` ````
add_library(A a.cpp) add_library(A a.cpp)
target_compile_features(A PUBLIC cxx_std_11) target_compile_features(A PUBLIC cxx_std_14)
```` ````
then another target then another target
```` ````
add_library(B b.cpp) add_library(B b.cpp)
target_compile_features(B PUBLIC cxx_std_14) target_compile_features(B PUBLIC cxx_std_17)
target_link_libraries(A B) target_link_libraries(A B)
```` ````
I have requested two different features. I have requested two different features.
CMake understands the requests and knows that `cxx_std_11` is a subset of `cxx_std_14`. CMake understands the requests and knows that `cxx_std_14` is a subset of `cxx_std_17`.
CMake then picks C++14 for library `B`. CMake then picks C++17 for library `B`.
CMake would not have been able to do feature resolution if we had directly done: CMake would not have been able to do feature resolution if we had directly done:
```` ````
target_compile_options(A PUBLIC -std=c++11) target_compile_options(A PUBLIC -std=c++14)
```` ````
### Adding Kokkos Options ### Adding Kokkos Options

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -38,12 +38,6 @@ MACRO(GLOBAL_SET VARNAME)
SET(${VARNAME} ${ARGN} CACHE INTERNAL "" FORCE) SET(${VARNAME} ${ARGN} CACHE INTERNAL "" FORCE)
ENDMACRO() ENDMACRO()
FUNCTION(VERIFY_EMPTY CONTEXT)
if(${ARGN})
MESSAGE(FATAL_ERROR "Kokkos does not support all of Tribits. Unhandled arguments in ${CONTEXT}:\n${ARGN}")
endif()
ENDFUNCTION()
MACRO(PREPEND_GLOBAL_SET VARNAME) MACRO(PREPEND_GLOBAL_SET VARNAME)
ASSERT_DEFINED(${VARNAME}) ASSERT_DEFINED(${VARNAME})
GLOBAL_SET(${VARNAME} ${ARGN} ${${VARNAME}}) GLOBAL_SET(${VARNAME} ${ARGN} ${${VARNAME}})
@ -89,7 +83,7 @@ FUNCTION(KOKKOS_ADD_TEST)
CMAKE_PARSE_ARGUMENTS(TEST CMAKE_PARSE_ARGUMENTS(TEST
"" ""
"EXE;NAME;TOOL" "EXE;NAME;TOOL"
"" "ARGS"
${ARGN}) ${ARGN})
IF(TEST_EXE) IF(TEST_EXE)
SET(EXE_ROOT ${TEST_EXE}) SET(EXE_ROOT ${TEST_EXE})
@ -102,6 +96,7 @@ FUNCTION(KOKKOS_ADD_TEST)
NAME ${TEST_NAME} NAME ${TEST_NAME}
COMM serial mpi COMM serial mpi
NUM_MPI_PROCS 1 NUM_MPI_PROCS 1
ARGS ${TEST_ARGS}
${TEST_UNPARSED_ARGUMENTS} ${TEST_UNPARSED_ARGUMENTS}
ADDED_TESTS_NAMES_OUT ALL_TESTS_ADDED ADDED_TESTS_NAMES_OUT ALL_TESTS_ADDED
) )
@ -110,18 +105,25 @@ FUNCTION(KOKKOS_ADD_TEST)
SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME}) SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME})
SET(EXE ${PACKAGE_NAME}_${EXE_ROOT}) SET(EXE ${PACKAGE_NAME}_${EXE_ROOT})
# The function TRIBITS_ADD_TEST() has a CATEGORIES argument that defaults
# to BASIC. If a project elects to only enable tests marked as PERFORMANCE,
# the test won't actually be added and attempting to set a property on it below
# will yield an error.
if(TARGET ${EXE})
if(TEST_TOOL) if(TEST_TOOL)
add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool
foreach(TEST_ADDED ${ALL_TESTS_ADDED}) foreach(TEST_ADDED ${ALL_TESTS_ADDED})
set_property(TEST ${TEST_ADDED} APPEND PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>") set_property(TEST ${TEST_ADDED} APPEND PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
endforeach() endforeach()
endif() endif()
endif()
else() else()
CMAKE_PARSE_ARGUMENTS(TEST CMAKE_PARSE_ARGUMENTS(TEST
"WILL_FAIL" "WILL_FAIL"
"FAIL_REGULAR_EXPRESSION;PASS_REGULAR_EXPRESSION;EXE;NAME;TOOL" "FAIL_REGULAR_EXPRESSION;PASS_REGULAR_EXPRESSION;EXE;NAME;TOOL"
"CATEGORIES;CMD_ARGS" "CATEGORIES;ARGS"
${ARGN}) ${ARGN})
SET(TESTS_ADDED)
# To match Tribits, we should always be receiving # To match Tribits, we should always be receiving
# the root names of exes/libs # the root names of exes/libs
IF(TEST_EXE) IF(TEST_EXE)
@ -133,11 +135,32 @@ FUNCTION(KOKKOS_ADD_TEST)
# These should be the full target name # These should be the full target name
SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME}) SET(TEST_NAME ${PACKAGE_NAME}_${TEST_NAME})
SET(EXE ${PACKAGE_NAME}_${EXE_ROOT}) SET(EXE ${PACKAGE_NAME}_${EXE_ROOT})
IF (TEST_ARGS)
SET(TEST_NUMBER 0)
FOREACH (ARG_STR ${TEST_ARGS})
# This is passed as a single string blob to match TriBITS behavior
# We need this to be turned into a list
STRING(REPLACE " " ";" ARG_STR_LIST ${ARG_STR})
IF(WIN32) IF(WIN32)
ADD_TEST(NAME ${TEST_NAME} WORKING_DIRECTORY ${LIBRARY_OUTPUT_PATH} COMMAND ${EXE}${CMAKE_EXECUTABLE_SUFFIX} ${TEST_CMD_ARGS}) ADD_TEST(NAME ${TEST_NAME}${TEST_NUMBER} WORKING_DIRECTORY ${LIBRARY_OUTPUT_PATH}
COMMAND ${EXE}${CMAKE_EXECUTABLE_SUFFIX} ${ARG_STR_LIST})
ELSE() ELSE()
ADD_TEST(NAME ${TEST_NAME} COMMAND ${EXE} ${TEST_CMD_ARGS}) ADD_TEST(NAME ${TEST_NAME}${TEST_NUMBER} COMMAND ${EXE} ${ARG_STR_LIST})
ENDIF() ENDIF()
LIST(APPEND TESTS_ADDED "${TEST_NAME}${TEST_NUMBER}")
MATH(EXPR TEST_NUMBER "${TEST_NUMBER} + 1")
ENDFOREACH()
ELSE()
IF(WIN32)
ADD_TEST(NAME ${TEST_NAME} WORKING_DIRECTORY ${LIBRARY_OUTPUT_PATH}
COMMAND ${EXE}${CMAKE_EXECUTABLE_SUFFIX})
ELSE()
ADD_TEST(NAME ${TEST_NAME} COMMAND ${EXE})
ENDIF()
LIST(APPEND TESTS_ADDED "${TEST_NAME}")
ENDIF()
FOREACH(TEST_NAME ${TESTS_ADDED})
IF(TEST_WILL_FAIL) IF(TEST_WILL_FAIL)
SET_TESTS_PROPERTIES(${TEST_NAME} PROPERTIES WILL_FAIL ${TEST_WILL_FAIL}) SET_TESTS_PROPERTIES(${TEST_NAME} PROPERTIES WILL_FAIL ${TEST_WILL_FAIL})
ENDIF() ENDIF()
@ -151,6 +174,7 @@ FUNCTION(KOKKOS_ADD_TEST)
add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool add_dependencies(${EXE} ${TEST_TOOL}) #make sure the exe has to build the tool
set_property(TEST ${TEST_NAME} APPEND_STRING PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>") set_property(TEST ${TEST_NAME} APPEND_STRING PROPERTY ENVIRONMENT "KOKKOS_PROFILE_LIBRARY=$<TARGET_FILE:${TEST_TOOL}>")
endif() endif()
ENDFOREACH()
VERIFY_EMPTY(KOKKOS_ADD_TEST ${TEST_UNPARSED_ARGUMENTS}) VERIFY_EMPTY(KOKKOS_ADD_TEST ${TEST_UNPARSED_ARGUMENTS})
endif() endif()
ENDFUNCTION() ENDFUNCTION()

View File

@ -3,7 +3,7 @@ FUNCTION(kokkos_set_intel_flags full_standard int_standard)
STRING(TOLOWER ${full_standard} FULL_LC_STANDARD) STRING(TOLOWER ${full_standard} FULL_LC_STANDARD)
STRING(TOLOWER ${int_standard} INT_LC_STANDARD) STRING(TOLOWER ${int_standard} INT_LC_STANDARD)
# The following three blocks of code were copied from # The following three blocks of code were copied from
# /Modules/Compiler/Intel-CXX.cmake from CMake 3.7.2 and then modified. # /Modules/Compiler/Intel-CXX.cmake from CMake 3.18.1 and then modified.
IF(CMAKE_CXX_SIMULATE_ID STREQUAL MSVC) IF(CMAKE_CXX_SIMULATE_ID STREQUAL MSVC)
SET(_std -Qstd) SET(_std -Qstd)
SET(_ext c++) SET(_ext c++)
@ -11,20 +11,8 @@ FUNCTION(kokkos_set_intel_flags full_standard int_standard)
SET(_std -std) SET(_std -std)
SET(_ext gnu++) SET(_ext gnu++)
ENDIF() ENDIF()
IF(NOT KOKKOS_CXX_STANDARD STREQUAL 11 AND NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 15.0.2)
#There is no gnu++14 value supported; figure out what to do.
SET(KOKKOS_CXX_STANDARD_FLAG "${_std}=c++${FULL_LC_STANDARD}" PARENT_SCOPE) SET(KOKKOS_CXX_STANDARD_FLAG "${_std}=c++${FULL_LC_STANDARD}" PARENT_SCOPE)
SET(KOKKOS_CXX_INTERMEDIATE_STANDARD_FLAG "${_std}=c++${INT_LC_STANDARD}" PARENT_SCOPE) SET(KOKKOS_CXX_INTERMDIATE_STANDARD_FLAG "${_std}=${_ext}${INT_LC_STANDARD}" PARENT_SCOPE)
ELSEIF(KOKKOS_CXX_STANDARD STREQUAL 11 AND NOT CMAKE_CXX_COMPILER_VERSION VERSION_LESS 13.0)
IF (CMAKE_CXX_EXTENSIONS)
SET(KOKKOS_CXX_STANDARD_FLAG "${_std}=${_ext}c++11" PARENT_SCOPE)
ELSE()
SET(KOKKOS_CXX_STANDARD_FLAG "${_std}=c++11" PARENT_SCOPE)
ENDIF()
ELSE()
MESSAGE(FATAL_ERROR "Intel compiler version too low - need 13.0 for C++11 and 15.0 for C++14")
ENDIF()
ENDFUNCTION() ENDFUNCTION()

View File

@ -35,6 +35,7 @@ KOKKOS_ARCH_OPTION(ARMV80 HOST "ARMv8.0 Compatible CPU")
KOKKOS_ARCH_OPTION(ARMV81 HOST "ARMv8.1 Compatible CPU") KOKKOS_ARCH_OPTION(ARMV81 HOST "ARMv8.1 Compatible CPU")
KOKKOS_ARCH_OPTION(ARMV8_THUNDERX HOST "ARMv8 Cavium ThunderX CPU") KOKKOS_ARCH_OPTION(ARMV8_THUNDERX HOST "ARMv8 Cavium ThunderX CPU")
KOKKOS_ARCH_OPTION(ARMV8_THUNDERX2 HOST "ARMv8 Cavium ThunderX2 CPU") KOKKOS_ARCH_OPTION(ARMV8_THUNDERX2 HOST "ARMv8 Cavium ThunderX2 CPU")
KOKKOS_ARCH_OPTION(A64FX HOST "ARMv8.2 with SVE Suport")
KOKKOS_ARCH_OPTION(WSM HOST "Intel Westmere CPU") KOKKOS_ARCH_OPTION(WSM HOST "Intel Westmere CPU")
KOKKOS_ARCH_OPTION(SNB HOST "Intel Sandy/Ivy Bridge CPUs") KOKKOS_ARCH_OPTION(SNB HOST "Intel Sandy/Ivy Bridge CPUs")
KOKKOS_ARCH_OPTION(HSW HOST "Intel Haswell CPUs") KOKKOS_ARCH_OPTION(HSW HOST "Intel Haswell CPUs")
@ -63,6 +64,7 @@ KOKKOS_ARCH_OPTION(ZEN HOST "AMD Zen architecture")
KOKKOS_ARCH_OPTION(ZEN2 HOST "AMD Zen2 architecture") KOKKOS_ARCH_OPTION(ZEN2 HOST "AMD Zen2 architecture")
KOKKOS_ARCH_OPTION(VEGA900 GPU "AMD GPU MI25 GFX900") KOKKOS_ARCH_OPTION(VEGA900 GPU "AMD GPU MI25 GFX900")
KOKKOS_ARCH_OPTION(VEGA906 GPU "AMD GPU MI50/MI60 GFX906") KOKKOS_ARCH_OPTION(VEGA906 GPU "AMD GPU MI50/MI60 GFX906")
KOKKOS_ARCH_OPTION(VEGA908 GPU "AMD GPU")
KOKKOS_ARCH_OPTION(INTEL_GEN GPU "Intel GPUs Gen9+") KOKKOS_ARCH_OPTION(INTEL_GEN GPU "Intel GPUs Gen9+")
@ -72,6 +74,11 @@ IF(KOKKOS_ENABLE_COMPILER_WARNINGS)
"-Wall" "-Wunused-parameter" "-Wshadow" "-pedantic" "-Wall" "-Wunused-parameter" "-Wshadow" "-pedantic"
"-Wsign-compare" "-Wtype-limits" "-Wuninitialized") "-Wsign-compare" "-Wtype-limits" "-Wuninitialized")
# OpenMPTarget compilers give erroneous warnings about sign comparison in loops
IF(KOKKOS_ENABLE_OPENMPTARGET)
LIST(REMOVE_ITEM COMMON_WARNINGS "-Wsign-compare")
ENDIF()
SET(GNU_WARNINGS "-Wempty-body" "-Wclobbered" "-Wignored-qualifiers" SET(GNU_WARNINGS "-Wempty-body" "-Wclobbered" "-Wignored-qualifiers"
${COMMON_WARNINGS}) ${COMMON_WARNINGS})
@ -106,6 +113,12 @@ ENDIF()
IF (KOKKOS_CXX_COMPILER_ID STREQUAL Clang) IF (KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
SET(CUDA_ARCH_FLAG "--cuda-gpu-arch") SET(CUDA_ARCH_FLAG "--cuda-gpu-arch")
GLOBAL_APPEND(KOKKOS_CUDA_OPTIONS -x cuda) GLOBAL_APPEND(KOKKOS_CUDA_OPTIONS -x cuda)
# Kokkos_CUDA_DIR has priority over CUDAToolkit_BIN_DIR
IF (Kokkos_CUDA_DIR)
GLOBAL_APPEND(KOKKOS_CUDA_OPTIONS --cuda-path=${Kokkos_CUDA_DIR})
ELSEIF(CUDAToolkit_BIN_DIR)
GLOBAL_APPEND(KOKKOS_CUDA_OPTIONS --cuda-path=${CUDAToolkit_BIN_DIR}/..)
ENDIF()
IF (KOKKOS_ENABLE_CUDA) IF (KOKKOS_ENABLE_CUDA)
SET(KOKKOS_IMPL_CUDA_CLANG_WORKAROUND ON CACHE BOOL "enable CUDA Clang workarounds" FORCE) SET(KOKKOS_IMPL_CUDA_CLANG_WORKAROUND ON CACHE BOOL "enable CUDA Clang workarounds" FORCE)
ENDIF() ENDIF()
@ -167,6 +180,12 @@ IF (KOKKOS_ARCH_ARMV8_THUNDERX2)
) )
ENDIF() ENDIF()
IF (KOKKOS_ARCH_A64FX)
COMPILER_SPECIFIC_FLAGS(
DEFAULT -march=armv8.2-a+sve
)
ENDIF()
IF (KOKKOS_ARCH_ZEN) IF (KOKKOS_ARCH_ZEN)
COMPILER_SPECIFIC_FLAGS( COMPILER_SPECIFIC_FLAGS(
Intel -mavx2 Intel -mavx2
@ -327,6 +346,16 @@ IF (Kokkos_ENABLE_HIP)
ENDIF() ENDIF()
IF (Kokkos_ENABLE_SYCL)
COMPILER_SPECIFIC_FLAGS(
DEFAULT -fsycl
)
COMPILER_SPECIFIC_OPTIONS(
DEFAULT -fsycl-unnamed-lambda
)
ENDIF()
SET(CUDA_ARCH_ALREADY_SPECIFIED "") SET(CUDA_ARCH_ALREADY_SPECIFIED "")
FUNCTION(CHECK_CUDA_ARCH ARCH FLAG) FUNCTION(CHECK_CUDA_ARCH ARCH FLAG)
IF(KOKKOS_ARCH_${ARCH}) IF(KOKKOS_ARCH_${ARCH})
@ -392,6 +421,7 @@ ENDFUNCTION()
#to the corresponding flag name if ON #to the corresponding flag name if ON
CHECK_AMDGPU_ARCH(VEGA900 gfx900) # Radeon Instinct MI25 CHECK_AMDGPU_ARCH(VEGA900 gfx900) # Radeon Instinct MI25
CHECK_AMDGPU_ARCH(VEGA906 gfx906) # Radeon Instinct MI50 and MI60 CHECK_AMDGPU_ARCH(VEGA906 gfx906) # Radeon Instinct MI50 and MI60
CHECK_AMDGPU_ARCH(VEGA908 gfx908)
IF(KOKKOS_ENABLE_HIP AND NOT AMDGPU_ARCH_ALREADY_SPECIFIED) IF(KOKKOS_ENABLE_HIP AND NOT AMDGPU_ARCH_ALREADY_SPECIFIED)
MESSAGE(SEND_ERROR "HIP enabled but no AMD GPU architecture currently enabled. " MESSAGE(SEND_ERROR "HIP enabled but no AMD GPU architecture currently enabled. "
@ -477,35 +507,53 @@ ENDIF()
#CMake verbose is kind of pointless #CMake verbose is kind of pointless
#Let's just always print things #Let's just always print things
MESSAGE(STATUS "Execution Spaces:") MESSAGE(STATUS "Built-in Execution Spaces:")
FOREACH (_BACKEND CUDA OPENMPTARGET HIP) FOREACH (_BACKEND Cuda OpenMPTarget HIP SYCL)
IF(KOKKOS_ENABLE_${_BACKEND}) STRING(TOUPPER ${_BACKEND} UC_BACKEND)
IF(KOKKOS_ENABLE_${UC_BACKEND})
IF(_DEVICE_PARALLEL) IF(_DEVICE_PARALLEL)
MESSAGE(FATAL_ERROR "Multiple device parallel execution spaces are not allowed! " MESSAGE(FATAL_ERROR "Multiple device parallel execution spaces are not allowed! "
"Trying to enable execution space ${_BACKEND}, " "Trying to enable execution space ${_BACKEND}, "
"but execution space ${_DEVICE_PARALLEL} is already enabled. " "but execution space ${_DEVICE_PARALLEL} is already enabled. "
"Remove the CMakeCache.txt file and re-configure.") "Remove the CMakeCache.txt file and re-configure.")
ENDIF() ENDIF()
SET(_DEVICE_PARALLEL ${_BACKEND}) IF (${_BACKEND} STREQUAL "Cuda")
IF(KOKKOS_ENABLE_CUDA_UVM)
SET(_DEFAULT_DEVICE_MEMSPACE "Kokkos::${_BACKEND}UVMSpace")
ELSE()
SET(_DEFAULT_DEVICE_MEMSPACE "Kokkos::${_BACKEND}Space")
ENDIF()
SET(_DEVICE_PARALLEL "Kokkos::${_BACKEND}")
ELSE()
SET(_DEFAULT_DEVICE_MEMSPACE "Kokkos::Experimental::${_BACKEND}Space")
SET(_DEVICE_PARALLEL "Kokkos::Experimental::${_BACKEND}")
ENDIF()
ENDIF() ENDIF()
ENDFOREACH() ENDFOREACH()
IF(NOT _DEVICE_PARALLEL) IF(NOT _DEVICE_PARALLEL)
SET(_DEVICE_PARALLEL "NONE") SET(_DEVICE_PARALLEL "NoTypeDefined")
SET(_DEFAULT_DEVICE_MEMSPACE "NoTypeDefined")
ENDIF() ENDIF()
MESSAGE(STATUS " Device Parallel: ${_DEVICE_PARALLEL}") MESSAGE(STATUS " Device Parallel: ${_DEVICE_PARALLEL}")
UNSET(_DEVICE_PARALLEL) IF(KOKKOS_ENABLE_PTHREAD)
SET(KOKKOS_ENABLE_THREADS ON)
ENDIF()
FOREACH (_BACKEND OpenMP Threads HPX)
FOREACH (_BACKEND OPENMP PTHREAD HPX) STRING(TOUPPER ${_BACKEND} UC_BACKEND)
IF(KOKKOS_ENABLE_${_BACKEND}) IF(KOKKOS_ENABLE_${UC_BACKEND})
IF(_HOST_PARALLEL) IF(_HOST_PARALLEL)
MESSAGE(FATAL_ERROR "Multiple host parallel execution spaces are not allowed! " MESSAGE(FATAL_ERROR "Multiple host parallel execution spaces are not allowed! "
"Trying to enable execution space ${_BACKEND}, " "Trying to enable execution space ${_BACKEND}, "
"but execution space ${_HOST_PARALLEL} is already enabled. " "but execution space ${_HOST_PARALLEL} is already enabled. "
"Remove the CMakeCache.txt file and re-configure.") "Remove the CMakeCache.txt file and re-configure.")
ENDIF() ENDIF()
SET(_HOST_PARALLEL ${_BACKEND}) IF (${_BACKEND} STREQUAL "HPX")
SET(_HOST_PARALLEL "Kokkos::Experimental::${_BACKEND}")
ELSE()
SET(_HOST_PARALLEL "Kokkos::${_BACKEND}")
ENDIF()
ENDIF() ENDIF()
ENDFOREACH() ENDFOREACH()
@ -515,14 +563,11 @@ IF(NOT _HOST_PARALLEL AND NOT KOKKOS_ENABLE_SERIAL)
"and Kokkos_ENABLE_SERIAL=OFF.") "and Kokkos_ENABLE_SERIAL=OFF.")
ENDIF() ENDIF()
IF(NOT _HOST_PARALLEL) IF(_HOST_PARALLEL)
SET(_HOST_PARALLEL "NONE")
ENDIF()
MESSAGE(STATUS " Host Parallel: ${_HOST_PARALLEL}") MESSAGE(STATUS " Host Parallel: ${_HOST_PARALLEL}")
UNSET(_HOST_PARALLEL) ELSE()
SET(_HOST_PARALLEL "NoTypeDefined")
IF(KOKKOS_ENABLE_PTHREAD) MESSAGE(STATUS " Host Parallel: NoTypeDefined")
SET(KOKKOS_ENABLE_THREADS ON)
ENDIF() ENDIF()
IF(KOKKOS_ENABLE_SERIAL) IF(KOKKOS_ENABLE_SERIAL)

View File

@ -4,24 +4,42 @@ SET(KOKKOS_CXX_COMPILER ${CMAKE_CXX_COMPILER})
SET(KOKKOS_CXX_COMPILER_ID ${CMAKE_CXX_COMPILER_ID}) SET(KOKKOS_CXX_COMPILER_ID ${CMAKE_CXX_COMPILER_ID})
SET(KOKKOS_CXX_COMPILER_VERSION ${CMAKE_CXX_COMPILER_VERSION}) SET(KOKKOS_CXX_COMPILER_VERSION ${CMAKE_CXX_COMPILER_VERSION})
IF(Kokkos_ENABLE_CUDA) MACRO(kokkos_internal_have_compiler_nvcc)
# Check if the compiler is nvcc (which really means nvcc_wrapper). # Check if the compiler is nvcc (which really means nvcc_wrapper).
EXECUTE_PROCESS(COMMAND ${CMAKE_CXX_COMPILER} --version EXECUTE_PROCESS(COMMAND ${ARGN} --version
OUTPUT_VARIABLE INTERNAL_COMPILER_VERSION OUTPUT_VARIABLE INTERNAL_COMPILER_VERSION
OUTPUT_STRIP_TRAILING_WHITESPACE) OUTPUT_STRIP_TRAILING_WHITESPACE)
STRING(REPLACE "\n" " - " INTERNAL_COMPILER_VERSION_ONE_LINE ${INTERNAL_COMPILER_VERSION} ) STRING(REPLACE "\n" " - " INTERNAL_COMPILER_VERSION_ONE_LINE ${INTERNAL_COMPILER_VERSION} )
STRING(FIND ${INTERNAL_COMPILER_VERSION_ONE_LINE} "nvcc" INTERNAL_COMPILER_VERSION_CONTAINS_NVCC) STRING(FIND ${INTERNAL_COMPILER_VERSION_ONE_LINE} "nvcc" INTERNAL_COMPILER_VERSION_CONTAINS_NVCC)
STRING(REGEX REPLACE "^ +" "" INTERNAL_HAVE_COMPILER_NVCC "${INTERNAL_HAVE_COMPILER_NVCC}")
STRING(REGEX REPLACE "^ +" ""
INTERNAL_HAVE_COMPILER_NVCC "${INTERNAL_HAVE_COMPILER_NVCC}")
IF(${INTERNAL_COMPILER_VERSION_CONTAINS_NVCC} GREATER -1) IF(${INTERNAL_COMPILER_VERSION_CONTAINS_NVCC} GREATER -1)
SET(INTERNAL_HAVE_COMPILER_NVCC true) SET(INTERNAL_HAVE_COMPILER_NVCC true)
ELSE() ELSE()
SET(INTERNAL_HAVE_COMPILER_NVCC false) SET(INTERNAL_HAVE_COMPILER_NVCC false)
ENDIF() ENDIF()
ENDMACRO()
IF(Kokkos_ENABLE_CUDA)
# find kokkos_launch_compiler
FIND_PROGRAM(Kokkos_COMPILE_LAUNCHER
NAMES kokkos_launch_compiler
HINTS ${PROJECT_SOURCE_DIR}
PATHS ${PROJECT_SOURCE_DIR}
PATH_SUFFIXES bin)
# check if compiler was set to nvcc_wrapper
kokkos_internal_have_compiler_nvcc(${CMAKE_CXX_COMPILER})
# if launcher was found and nvcc_wrapper was not specified as
# compiler, set to use launcher. Will ensure CMAKE_CXX_COMPILER
# is replaced by nvcc_wrapper
IF(Kokkos_COMPILE_LAUNCHER AND NOT INTERNAL_HAVE_COMPILER_NVCC AND NOT KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
# the first argument to launcher is always the C++ compiler defined by cmake
# if the second argument matches the C++ compiler, it forwards the rest of the
# args to nvcc_wrapper
kokkos_internal_have_compiler_nvcc(
${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_COMPILER} -DKOKKOS_DEPENDENCE)
SET(INTERNAL_USE_COMPILER_LAUNCHER true)
ENDIF()
ENDIF() ENDIF()
IF(INTERNAL_HAVE_COMPILER_NVCC) IF(INTERNAL_HAVE_COMPILER_NVCC)
@ -36,6 +54,35 @@ IF(INTERNAL_HAVE_COMPILER_NVCC)
STRING(SUBSTRING ${TEMP_CXX_COMPILER_VERSION} 1 -1 TEMP_CXX_COMPILER_VERSION) STRING(SUBSTRING ${TEMP_CXX_COMPILER_VERSION} 1 -1 TEMP_CXX_COMPILER_VERSION)
SET(KOKKOS_CXX_COMPILER_VERSION ${TEMP_CXX_COMPILER_VERSION} CACHE STRING INTERNAL FORCE) SET(KOKKOS_CXX_COMPILER_VERSION ${TEMP_CXX_COMPILER_VERSION} CACHE STRING INTERNAL FORCE)
MESSAGE(STATUS "Compiler Version: ${KOKKOS_CXX_COMPILER_VERSION}") MESSAGE(STATUS "Compiler Version: ${KOKKOS_CXX_COMPILER_VERSION}")
IF(INTERNAL_USE_COMPILER_LAUNCHER)
IF(Kokkos_LAUNCH_COMPILER_INFO)
GET_FILENAME_COMPONENT(BASE_COMPILER_NAME ${CMAKE_CXX_COMPILER} NAME)
# does not have STATUS intentionally
MESSAGE("")
MESSAGE("Kokkos_LAUNCH_COMPILER_INFO (${Kokkos_COMPILE_LAUNCHER}):")
MESSAGE(" - Kokkos + CUDA backend requires the C++ files to be compiled as CUDA code.")
MESSAGE(" - kokkos_launch_compiler permits CMAKE_CXX_COMPILER to be set to a traditional C++ compiler when Kokkos_ENABLE_CUDA=ON")
MESSAGE(" by prefixing all the compile and link commands with the path to the script + CMAKE_CXX_COMPILER (${CMAKE_CXX_COMPILER}).")
MESSAGE(" - If any of the compile or link commands have CMAKE_CXX_COMPILER as the first argument, it replaces CMAKE_CXX_COMPILER with nvcc_wrapper.")
MESSAGE(" - If the compile or link command is not CMAKE_CXX_COMPILER, it just executes the command.")
MESSAGE(" - If using ccache, set CMAKE_CXX_COMPILER to nvcc_wrapper explicitly.")
MESSAGE(" - kokkos_compiler_launcher is available to downstream projects as well.")
MESSAGE(" - If CMAKE_CXX_COMPILER=nvcc_wrapper, all legacy behavior will be preserved during 'find_package(Kokkos)'")
MESSAGE(" - If CMAKE_CXX_COMPILER is not nvcc_wrapper, 'find_package(Kokkos)' will apply 'kokkos_compilation(GLOBAL)' unless separable compilation is enabled")
MESSAGE(" - This can be disabled via '-DKokkos_LAUNCH_COMPILER=OFF'")
MESSAGE(" - Use 'find_package(Kokkos COMPONENTS separable_compilation)' to enable separable compilation")
MESSAGE(" - Separable compilation allows you to control the scope of where the compiler transformation behavior (${BASE_COMPILER_NAME} -> nvcc_wrapper) is applied")
MESSAGE(" - The compiler transformation can be applied on a per-project, per-directory, per-target, and/or per-source-file basis")
MESSAGE(" - 'kokkos_compilation(PROJECT)' will apply the compiler transformation to all targets in a project/subproject")
MESSAGE(" - 'kokkos_compilation(TARGET <TARGET> [<TARGETS>...])' will apply the compiler transformation to the specified target(s)")
MESSAGE(" - 'kokkos_compilation(SOURCE <SOURCE> [<SOURCES>...])' will apply the compiler transformation to the specified source file(s)")
MESSAGE(" - 'kokkos_compilation(DIRECTORY <DIR> [<DIRS>...])' will apply the compiler transformation to the specified directories")
MESSAGE("")
ELSE()
MESSAGE(STATUS "kokkos_launch_compiler (${Kokkos_COMPILE_LAUNCHER}) is enabled... Set Kokkos_LAUNCH_COMPILER_INFO=ON for more info.")
ENDIF()
kokkos_compilation(GLOBAL)
ENDIF()
ENDIF() ENDIF()
IF(Kokkos_ENABLE_HIP) IF(Kokkos_ENABLE_HIP)
@ -90,38 +137,49 @@ IF(KOKKOS_CXX_COMPILER_ID STREQUAL Cray OR KOKKOS_CLANG_IS_CRAY)
ENDIF() ENDIF()
ENDIF() ENDIF()
IF(KOKKOS_CXX_COMPILER_ID STREQUAL Fujitsu)
# SET Fujitsus compiler version which is not detected by CMake
EXECUTE_PROCESS(COMMAND ${CMAKE_CXX_COMPILER} --version
OUTPUT_VARIABLE INTERNAL_CXX_COMPILER_VERSION
OUTPUT_STRIP_TRAILING_WHITESPACE)
STRING(REGEX MATCH "[0-9]+\\.[0-9]+\\.[0-9]+"
TEMP_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION})
SET(KOKKOS_CXX_COMPILER_VERSION ${TEMP_CXX_COMPILER_VERSION} CACHE STRING INTERNAL FORCE)
ENDIF()
# Enforce the minimum compilers supported by Kokkos. # Enforce the minimum compilers supported by Kokkos.
SET(KOKKOS_MESSAGE_TEXT "Compiler not supported by Kokkos. Required compiler versions:") SET(KOKKOS_MESSAGE_TEXT "Compiler not supported by Kokkos. Required compiler versions:")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n Clang 3.5.2 or higher") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n Clang 4.0.0 or higher")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n GCC 4.8.4 or higher") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n GCC 5.3.0 or higher")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n Intel 15.0.2 or higher") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n Intel 17.0.0 or higher")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n NVCC 9.0.69 or higher") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n NVCC 9.2.88 or higher")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n HIPCC 3.5.0 or higher") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n HIPCC 3.8.0 or higher")
SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n PGI 17.1 or higher\n") SET(KOKKOS_MESSAGE_TEXT "${KOKKOS_MESSAGE_TEXT}\n PGI 17.4 or higher\n")
IF(KOKKOS_CXX_COMPILER_ID STREQUAL Clang) IF(KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 3.5.2) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 4.0.0)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL GNU) ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL GNU)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 4.8.4) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 5.3.0)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL Intel) ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL Intel)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 15.0.2) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 17.0.0)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA) ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 9.0.69) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 9.2.88)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
SET(CMAKE_CXX_EXTENSIONS OFF CACHE BOOL "Kokkos turns off CXX extensions" FORCE) SET(CMAKE_CXX_EXTENSIONS OFF CACHE BOOL "Kokkos turns off CXX extensions" FORCE)
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL HIP) ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL HIP)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 3.5.0) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 3.8.0)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL PGI) ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL PGI)
IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 17.1) IF(KOKKOS_CXX_COMPILER_VERSION VERSION_LESS 17.4)
MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}") MESSAGE(FATAL_ERROR "${KOKKOS_MESSAGE_TEXT}")
ENDIF() ENDIF()
ENDIF() ENDIF()

View File

@ -1,4 +1,4 @@
IF(KOKKOS_CXX_COMPILER_ID STREQUAL Clang AND KOKKOS_ENABLE_OPENMP AND NOT KOKKOS_CLANG_IS_CRAY AND NOT "x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC") IF(KOKKOS_CXX_COMPILER_ID STREQUAL Clang AND KOKKOS_ENABLE_OPENMP AND NOT KOKKOS_CLANG_IS_CRAY AND NOT KOKKOS_COMPILER_CLANG_MSVC)
# The clang "version" doesn't actually tell you what runtimes and tools # The clang "version" doesn't actually tell you what runtimes and tools
# were built into Clang. We should therefore make sure that libomp # were built into Clang. We should therefore make sure that libomp
# was actually built into Clang. Otherwise the user will get nonsensical # was actually built into Clang. Otherwise the user will get nonsensical

View File

@ -25,6 +25,18 @@ IF (KOKKOS_ENABLE_PTHREAD)
SET(KOKKOS_ENABLE_THREADS ON) SET(KOKKOS_ENABLE_THREADS ON)
ENDIF() ENDIF()
# detect clang++ / cl / clang-cl clashes
IF (CMAKE_CXX_COMPILER_ID STREQUAL Clang AND "x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC")
# this specific test requires CMake >= 3.15
IF ("x${CMAKE_CXX_COMPILER_FRONTEND_VARIANT}" STREQUAL "xGNU")
# use pure clang++ instead of clang-cl
SET(KOKKOS_COMPILER_CLANG_MSVC OFF)
ELSE()
# it defaults to clang-cl
SET(KOKKOS_COMPILER_CLANG_MSVC ON)
ENDIF()
ENDIF()
IF(Trilinos_ENABLE_Kokkos AND Trilinos_ENABLE_OpenMP) IF(Trilinos_ENABLE_Kokkos AND Trilinos_ENABLE_OpenMP)
SET(OMP_DEFAULT ON) SET(OMP_DEFAULT ON)
ELSE() ELSE()
@ -39,13 +51,16 @@ IF(KOKKOS_ENABLE_OPENMP)
IF(KOKKOS_CLANG_IS_INTEL) IF(KOKKOS_CLANG_IS_INTEL)
SET(ClangOpenMPFlag -fiopenmp) SET(ClangOpenMPFlag -fiopenmp)
ENDIF() ENDIF()
IF(KOKKOS_CXX_COMPILER_ID STREQUAL Clang AND "x${CMAKE_CXX_SIMULATE_ID}" STREQUAL "xMSVC") IF(KOKKOS_COMPILER_CLANG_MSVC)
#expression /openmp yields error, so add a specific Clang flag #for clang-cl expression /openmp yields an error, so directly add the specific Clang flag
COMPILER_SPECIFIC_OPTIONS(Clang /clang:-fopenmp) SET(ClangOpenMPFlag /clang:-fopenmp=libomp)
#link omp library from LLVM lib dir ENDIF()
IF(WIN32 AND CMAKE_CXX_COMPILER_ID STREQUAL Clang)
#link omp library from LLVM lib dir, no matter if it is clang-cl or clang++
get_filename_component(LLVM_BIN_DIR ${CMAKE_CXX_COMPILER_AR} DIRECTORY) get_filename_component(LLVM_BIN_DIR ${CMAKE_CXX_COMPILER_AR} DIRECTORY)
COMPILER_SPECIFIC_LIBS(Clang "${LLVM_BIN_DIR}/../lib/libomp.lib") COMPILER_SPECIFIC_LIBS(Clang "${LLVM_BIN_DIR}/../lib/libomp.lib")
ELSEIF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA) ENDIF()
IF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA)
COMPILER_SPECIFIC_FLAGS( COMPILER_SPECIFIC_FLAGS(
COMPILER_ID KOKKOS_CXX_HOST_COMPILER_ID COMPILER_ID KOKKOS_CXX_HOST_COMPILER_ID
Clang -Xcompiler ${ClangOpenMPFlag} Clang -Xcompiler ${ClangOpenMPFlag}
@ -105,9 +120,11 @@ KOKKOS_DEVICE_OPTION(CUDA ${CUDA_DEFAULT} DEVICE "Whether to build CUDA backend"
IF (KOKKOS_ENABLE_CUDA) IF (KOKKOS_ENABLE_CUDA)
GLOBAL_SET(KOKKOS_DONT_ALLOW_EXTENSIONS "CUDA enabled") GLOBAL_SET(KOKKOS_DONT_ALLOW_EXTENSIONS "CUDA enabled")
IF(WIN32) IF(WIN32 AND NOT KOKKOS_CXX_COMPILER_ID STREQUAL Clang)
GLOBAL_APPEND(KOKKOS_COMPILE_OPTIONS -x cu) GLOBAL_APPEND(KOKKOS_COMPILE_OPTIONS -x cu)
ENDIF() ENDIF()
## Cuda has extra setup requirements, turn on Kokkos_Setup_Cuda.hpp in macros
LIST(APPEND DEVICE_SETUP_LIST Cuda)
ENDIF() ENDIF()
# We want this to default to OFF for cache reasons, but if no # We want this to default to OFF for cache reasons, but if no
@ -128,3 +145,10 @@ KOKKOS_DEVICE_OPTION(SERIAL ${SERIAL_DEFAULT} HOST "Whether to build serial back
KOKKOS_DEVICE_OPTION(HPX OFF HOST "Whether to build HPX backend (experimental)") KOKKOS_DEVICE_OPTION(HPX OFF HOST "Whether to build HPX backend (experimental)")
KOKKOS_DEVICE_OPTION(HIP OFF DEVICE "Whether to build HIP backend") KOKKOS_DEVICE_OPTION(HIP OFF DEVICE "Whether to build HIP backend")
## HIP has extra setup requirements, turn on Kokkos_Setup_HIP.hpp in macros
IF (KOKKOS_ENABLE_HIP)
LIST(APPEND DEVICE_SETUP_LIST HIP)
ENDIF()
KOKKOS_DEVICE_OPTION(SYCL OFF DEVICE "Whether to build SYCL backend")

View File

@ -154,13 +154,13 @@ MACRO(kokkos_export_imported_tpl NAME)
KOKKOS_APPEND_CONFIG_LINE("SET_TARGET_PROPERTIES(${NAME} PROPERTIES") KOKKOS_APPEND_CONFIG_LINE("SET_TARGET_PROPERTIES(${NAME} PROPERTIES")
GET_TARGET_PROPERTY(TPL_LIBRARY ${NAME} IMPORTED_LOCATION) GET_TARGET_PROPERTY(TPL_LIBRARY ${NAME} IMPORTED_LOCATION)
IF(TPL_LIBRARY) IF(TPL_LIBRARY)
KOKKOS_APPEND_CONFIG_LINE("IMPORTED_LOCATION ${TPL_LIBRARY}") KOKKOS_APPEND_CONFIG_LINE("IMPORTED_LOCATION \"${TPL_LIBRARY}\"")
ENDIF() ENDIF()
ENDIF() ENDIF()
GET_TARGET_PROPERTY(TPL_INCLUDES ${NAME} INTERFACE_INCLUDE_DIRECTORIES) GET_TARGET_PROPERTY(TPL_INCLUDES ${NAME} INTERFACE_INCLUDE_DIRECTORIES)
IF(TPL_INCLUDES) IF(TPL_INCLUDES)
KOKKOS_APPEND_CONFIG_LINE("INTERFACE_INCLUDE_DIRECTORIES ${TPL_INCLUDES}") KOKKOS_APPEND_CONFIG_LINE("INTERFACE_INCLUDE_DIRECTORIES \"${TPL_INCLUDES}\"")
ENDIF() ENDIF()
GET_TARGET_PROPERTY(TPL_COMPILE_OPTIONS ${NAME} INTERFACE_COMPILE_OPTIONS) GET_TARGET_PROPERTY(TPL_COMPILE_OPTIONS ${NAME} INTERFACE_COMPILE_OPTIONS)
@ -178,7 +178,7 @@ MACRO(kokkos_export_imported_tpl NAME)
GET_TARGET_PROPERTY(TPL_LINK_LIBRARIES ${NAME} INTERFACE_LINK_LIBRARIES) GET_TARGET_PROPERTY(TPL_LINK_LIBRARIES ${NAME} INTERFACE_LINK_LIBRARIES)
IF(TPL_LINK_LIBRARIES) IF(TPL_LINK_LIBRARIES)
KOKKOS_APPEND_CONFIG_LINE("INTERFACE_LINK_LIBRARIES ${TPL_LINK_LIBRARIES}") KOKKOS_APPEND_CONFIG_LINE("INTERFACE_LINK_LIBRARIES \"${TPL_LINK_LIBRARIES}\"")
ENDIF() ENDIF()
KOKKOS_APPEND_CONFIG_LINE(")") KOKKOS_APPEND_CONFIG_LINE(")")
KOKKOS_APPEND_CONFIG_LINE("ENDIF()") KOKKOS_APPEND_CONFIG_LINE("ENDIF()")
@ -770,7 +770,7 @@ FUNCTION(kokkos_link_tpl TARGET)
ENDFUNCTION() ENDFUNCTION()
FUNCTION(COMPILER_SPECIFIC_OPTIONS_HELPER) FUNCTION(COMPILER_SPECIFIC_OPTIONS_HELPER)
SET(COMPILERS NVIDIA PGI XL DEFAULT Cray Intel Clang AppleClang IntelClang GNU HIP) SET(COMPILERS NVIDIA PGI XL DEFAULT Cray Intel Clang AppleClang IntelClang GNU HIP Fujitsu)
CMAKE_PARSE_ARGUMENTS( CMAKE_PARSE_ARGUMENTS(
PARSE PARSE
"LINK_OPTIONS;COMPILE_OPTIONS;COMPILE_DEFINITIONS;LINK_LIBRARIES" "LINK_OPTIONS;COMPILE_OPTIONS;COMPILE_DEFINITIONS;LINK_LIBRARIES"
@ -844,7 +844,6 @@ ENDFUNCTION(COMPILER_SPECIFIC_DEFS)
FUNCTION(COMPILER_SPECIFIC_LIBS) FUNCTION(COMPILER_SPECIFIC_LIBS)
COMPILER_SPECIFIC_OPTIONS_HELPER(${ARGN} LINK_LIBRARIES) COMPILER_SPECIFIC_OPTIONS_HELPER(${ARGN} LINK_LIBRARIES)
ENDFUNCTION(COMPILER_SPECIFIC_LIBS) ENDFUNCTION(COMPILER_SPECIFIC_LIBS)
# Given a list of the form # Given a list of the form
# key1;value1;key2;value2,... # key1;value1;key2;value2,...
# Create a list of all keys in a variable named ${KEY_LIST_NAME} # Create a list of all keys in a variable named ${KEY_LIST_NAME}
@ -877,3 +876,114 @@ FUNCTION(KOKKOS_CHECK_DEPRECATED_OPTIONS)
ENDIF() ENDIF()
ENDFOREACH() ENDFOREACH()
ENDFUNCTION() ENDFUNCTION()
# this function checks whether the current CXX compiler supports building CUDA
FUNCTION(kokkos_cxx_compiler_cuda_test _VAR)
# don't run this test every time
IF(DEFINED ${_VAR})
RETURN()
ENDIF()
FILE(WRITE ${PROJECT_BINARY_DIR}/compile_tests/compiles_cuda.cpp
"
#include <cuda.h>
#include <cstdlib>
__global__
void kernel(int sz, double* data)
{
auto _beg = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = _beg; i < sz; ++i)
data[i] += static_cast<double>(i);
}
int main()
{
double* data = nullptr;
int blocks = 64;
int grids = 64;
auto ret = cudaMalloc(&data, blocks * grids * sizeof(double));
if(ret != cudaSuccess)
return EXIT_FAILURE;
kernel<<<grids, blocks>>>(blocks * grids, data);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
")
TRY_COMPILE(_RET
${PROJECT_BINARY_DIR}/compile_tests
SOURCES ${PROJECT_BINARY_DIR}/compile_tests/compiles_cuda.cpp)
SET(${_VAR} ${_RET} CACHE STRING "CXX compiler supports building CUDA")
ENDFUNCTION()
# this function is provided to easily select which files use nvcc_wrapper:
#
# GLOBAL --> all files
# TARGET --> all files in a target
# SOURCE --> specific source files
# DIRECTORY --> all files in directory
# PROJECT --> all files/targets in a project/subproject
#
FUNCTION(kokkos_compilation)
# check whether the compiler already supports building CUDA
KOKKOS_CXX_COMPILER_CUDA_TEST(Kokkos_CXX_COMPILER_COMPILES_CUDA)
# if CUDA compile test has already been performed, just return
IF(Kokkos_CXX_COMPILER_COMPILES_CUDA)
RETURN()
ENDIF()
CMAKE_PARSE_ARGUMENTS(COMP "GLOBAL;PROJECT" "" "DIRECTORY;TARGET;SOURCE" ${ARGN})
# find kokkos_launch_compiler
FIND_PROGRAM(Kokkos_COMPILE_LAUNCHER
NAMES kokkos_launch_compiler
HINTS ${PROJECT_SOURCE_DIR}
PATHS ${PROJECT_SOURCE_DIR}
PATH_SUFFIXES bin)
IF(NOT Kokkos_COMPILE_LAUNCHER)
MESSAGE(FATAL_ERROR "Kokkos could not find 'kokkos_launch_compiler'. Please set '-DKokkos_COMPILE_LAUNCHER=/path/to/launcher'")
ENDIF()
IF(COMP_GLOBAL)
# if global, don't bother setting others
SET_PROPERTY(GLOBAL PROPERTY RULE_LAUNCH_COMPILE "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
SET_PROPERTY(GLOBAL PROPERTY RULE_LAUNCH_LINK "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
ELSE()
FOREACH(_TYPE PROJECT DIRECTORY TARGET SOURCE)
# make project/subproject scoping easy, e.g. KokkosCompilation(PROJECT) after project(...)
IF("${_TYPE}" STREQUAL "PROJECT" AND COMP_${_TYPE})
LIST(APPEND COMP_DIRECTORY ${PROJECT_SOURCE_DIR})
UNSET(COMP_${_TYPE})
ENDIF()
# set the properties if defined
IF(COMP_${_TYPE})
# MESSAGE(STATUS "Using nvcc_wrapper :: ${_TYPE} :: ${COMP_${_TYPE}}")
SET_PROPERTY(${_TYPE} ${COMP_${_TYPE}} PROPERTY RULE_LAUNCH_COMPILE "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
SET_PROPERTY(${_TYPE} ${COMP_${_TYPE}} PROPERTY RULE_LAUNCH_LINK "${Kokkos_COMPILE_LAUNCHER} ${CMAKE_CXX_COMPILER}")
ENDIF()
ENDFOREACH()
ENDIF()
ENDFUNCTION()
## KOKKOS_CONFIG_HEADER - parse the data list which is a list of backend names
## and create output config header file...used for
## creating dynamic include files based on enabled backends
##
## SRC_FILE is input file
## TARGET_FILE output file
## HEADER_GUARD TEXT used with include header guard
## HEADER_PREFIX prefix used with include (i.e. fwd, decl, setup)
## DATA_LIST list of backends to include in generated file
FUNCTION(KOKKOS_CONFIG_HEADER SRC_FILE TARGET_FILE HEADER_GUARD HEADER_PREFIX DATA_LIST)
SET(HEADER_GUARD_TAG "${HEADER_GUARD}_HPP_")
CONFIGURE_FILE(cmake/${SRC_FILE} ${PROJECT_BINARY_DIR}/temp/${TARGET_FILE}.work COPYONLY)
FOREACH( BACKEND_NAME ${DATA_LIST} )
SET(INCLUDE_NEXT_FILE "#include <${HEADER_PREFIX}_${BACKEND_NAME}.hpp>
\@INCLUDE_NEXT_FILE\@")
CONFIGURE_FILE(${PROJECT_BINARY_DIR}/temp/${TARGET_FILE}.work ${PROJECT_BINARY_DIR}/temp/${TARGET_FILE}.work @ONLY)
ENDFOREACH()
SET(INCLUDE_NEXT_FILE "" )
CONFIGURE_FILE(${PROJECT_BINARY_DIR}/temp/${TARGET_FILE}.work ${TARGET_FILE} @ONLY)
ENDFUNCTION()

View File

@ -1,19 +1,17 @@
# From CMake 3.10 documentation # From CMake 3.10 documentation
#This can run at any time #This can run at any time
KOKKOS_OPTION(CXX_STANDARD "" STRING "The C++ standard for Kokkos to use: 11, 14, 17, or 20. If empty, this will default to CMAKE_CXX_STANDARD. If both CMAKE_CXX_STANDARD and Kokkos_CXX_STANDARD are empty, this will default to 11") KOKKOS_OPTION(CXX_STANDARD "" STRING "The C++ standard for Kokkos to use: 14, 17, or 20. If empty, this will default to CMAKE_CXX_STANDARD. If both CMAKE_CXX_STANDARD and Kokkos_CXX_STANDARD are empty, this will default to 14")
# Set CXX standard flags # Set CXX standard flags
SET(KOKKOS_ENABLE_CXX11 OFF)
SET(KOKKOS_ENABLE_CXX14 OFF) SET(KOKKOS_ENABLE_CXX14 OFF)
SET(KOKKOS_ENABLE_CXX17 OFF) SET(KOKKOS_ENABLE_CXX17 OFF)
SET(KOKKOS_ENABLE_CXX20 OFF) SET(KOKKOS_ENABLE_CXX20 OFF)
IF (KOKKOS_CXX_STANDARD) IF (KOKKOS_CXX_STANDARD)
IF (${KOKKOS_CXX_STANDARD} STREQUAL "c++98") IF (${KOKKOS_CXX_STANDARD} STREQUAL "c++98")
MESSAGE(FATAL_ERROR "Kokkos no longer supports C++98 - minimum C++11") MESSAGE(FATAL_ERROR "Kokkos no longer supports C++98 - minimum C++14")
ELSEIF (${KOKKOS_CXX_STANDARD} STREQUAL "c++11") ELSEIF (${KOKKOS_CXX_STANDARD} STREQUAL "c++11")
MESSAGE(WARNING "Deprecated Kokkos C++ standard set as 'c++11'. Use '11' instead.") MESSAGE(FATAL_ERROR "Kokkos no longer supports C++11 - minimum C++14")
SET(KOKKOS_CXX_STANDARD "11")
ELSEIF(${KOKKOS_CXX_STANDARD} STREQUAL "c++14") ELSEIF(${KOKKOS_CXX_STANDARD} STREQUAL "c++14")
MESSAGE(WARNING "Deprecated Kokkos C++ standard set as 'c++14'. Use '14' instead.") MESSAGE(WARNING "Deprecated Kokkos C++ standard set as 'c++14'. Use '14' instead.")
SET(KOKKOS_CXX_STANDARD "14") SET(KOKKOS_CXX_STANDARD "14")
@ -33,8 +31,8 @@ IF (KOKKOS_CXX_STANDARD)
ENDIF() ENDIF()
IF (NOT KOKKOS_CXX_STANDARD AND NOT CMAKE_CXX_STANDARD) IF (NOT KOKKOS_CXX_STANDARD AND NOT CMAKE_CXX_STANDARD)
MESSAGE(STATUS "Setting default Kokkos CXX standard to 11") MESSAGE(STATUS "Setting default Kokkos CXX standard to 14")
SET(KOKKOS_CXX_STANDARD "11") SET(KOKKOS_CXX_STANDARD "14")
ELSEIF(NOT KOKKOS_CXX_STANDARD) ELSEIF(NOT KOKKOS_CXX_STANDARD)
MESSAGE(STATUS "Setting default Kokkos CXX standard to ${CMAKE_CXX_STANDARD}") MESSAGE(STATUS "Setting default Kokkos CXX standard to ${CMAKE_CXX_STANDARD}")
SET(KOKKOS_CXX_STANDARD ${CMAKE_CXX_STANDARD}) SET(KOKKOS_CXX_STANDARD ${CMAKE_CXX_STANDARD})

View File

@ -29,7 +29,7 @@ FUNCTION(kokkos_set_cxx_standard_feature standard)
ELSEIF(NOT KOKKOS_USE_CXX_EXTENSIONS AND ${STANDARD_NAME}) ELSEIF(NOT KOKKOS_USE_CXX_EXTENSIONS AND ${STANDARD_NAME})
MESSAGE(STATUS "Using ${${STANDARD_NAME}} for C++${standard} standard as feature") MESSAGE(STATUS "Using ${${STANDARD_NAME}} for C++${standard} standard as feature")
IF (KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA AND (KOKKOS_CXX_HOST_COMPILER_ID STREQUAL GNU OR KOKKOS_CXX_HOST_COMPILER_ID STREQUAL Clang)) IF (KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA AND (KOKKOS_CXX_HOST_COMPILER_ID STREQUAL GNU OR KOKKOS_CXX_HOST_COMPILER_ID STREQUAL Clang))
SET(SUPPORTED_NVCC_FLAGS "-std=c++11;-std=c++14;-std=c++17") SET(SUPPORTED_NVCC_FLAGS "-std=c++14;-std=c++17")
IF (NOT ${${STANDARD_NAME}} IN_LIST SUPPORTED_NVCC_FLAGS) IF (NOT ${${STANDARD_NAME}} IN_LIST SUPPORTED_NVCC_FLAGS)
MESSAGE(FATAL_ERROR "CMake wants to use ${${STANDARD_NAME}} which is not supported by NVCC. Using a more recent host compiler or a more recent CMake version might help.") MESSAGE(FATAL_ERROR "CMake wants to use ${${STANDARD_NAME}} which is not supported by NVCC. Using a more recent host compiler or a more recent CMake version might help.")
ENDIF() ENDIF()
@ -42,13 +42,16 @@ FUNCTION(kokkos_set_cxx_standard_feature standard)
ELSEIF((KOKKOS_CXX_COMPILER_ID STREQUAL "NVIDIA") AND WIN32) ELSEIF((KOKKOS_CXX_COMPILER_ID STREQUAL "NVIDIA") AND WIN32)
MESSAGE(STATUS "Using no flag for C++${standard} standard as feature") MESSAGE(STATUS "Using no flag for C++${standard} standard as feature")
GLOBAL_SET(KOKKOS_CXX_STANDARD_FEATURE "") GLOBAL_SET(KOKKOS_CXX_STANDARD_FEATURE "")
ELSEIF((KOKKOS_CXX_COMPILER_ID STREQUAL "Fujitsu"))
MESSAGE(STATUS "Using no flag for C++${standard} standard as feature")
GLOBAL_SET(KOKKOS_CXX_STANDARD_FEATURE "")
ELSE() ELSE()
#nope, we can't do anything here #nope, we can't do anything here
MESSAGE(WARNING "C++${standard} is not supported as a compiler feature. We will choose custom flags for now, but this behavior has been deprecated. Please open an issue at https://github.com/kokkos/kokkos/issues reporting that ${KOKKOS_CXX_COMPILER_ID} ${KOKKOS_CXX_COMPILER_VERSION} failed for ${KOKKOS_CXX_STANDARD}, preferrably including your CMake command.") MESSAGE(WARNING "C++${standard} is not supported as a compiler feature. We will choose custom flags for now, but this behavior has been deprecated. Please open an issue at https://github.com/kokkos/kokkos/issues reporting that ${KOKKOS_CXX_COMPILER_ID} ${KOKKOS_CXX_COMPILER_VERSION} failed for ${KOKKOS_CXX_STANDARD}, preferably including your CMake command.")
GLOBAL_SET(KOKKOS_CXX_STANDARD_FEATURE "") GLOBAL_SET(KOKKOS_CXX_STANDARD_FEATURE "")
ENDIF() ENDIF()
IF(NOT WIN32) IF((NOT WIN32) AND (NOT ("${KOKKOS_CXX_COMPILER_ID}" STREQUAL "Fujitsu")))
IF(NOT ${FEATURE_NAME} IN_LIST CMAKE_CXX_COMPILE_FEATURES) IF(NOT ${FEATURE_NAME} IN_LIST CMAKE_CXX_COMPILE_FEATURES)
MESSAGE(FATAL_ERROR "Compiler ${KOKKOS_CXX_COMPILER_ID} should support ${FEATURE_NAME}, but CMake reports feature not supported") MESSAGE(FATAL_ERROR "Compiler ${KOKKOS_CXX_COMPILER_ID} should support ${FEATURE_NAME}, but CMake reports feature not supported")
ENDIF() ENDIF()
@ -65,11 +68,7 @@ IF (KOKKOS_CXX_STANDARD AND CMAKE_CXX_STANDARD)
ENDIF() ENDIF()
IF (KOKKOS_CXX_STANDARD STREQUAL "11" ) IF(KOKKOS_CXX_STANDARD STREQUAL "14")
kokkos_set_cxx_standard_feature(11)
SET(KOKKOS_ENABLE_CXX11 ON)
SET(KOKKOS_CXX_INTERMEDIATE_STANDARD "11")
ELSEIF(KOKKOS_CXX_STANDARD STREQUAL "14")
kokkos_set_cxx_standard_feature(14) kokkos_set_cxx_standard_feature(14)
SET(KOKKOS_CXX_INTERMEDIATE_STANDARD "1Y") SET(KOKKOS_CXX_INTERMEDIATE_STANDARD "1Y")
SET(KOKKOS_ENABLE_CXX14 ON) SET(KOKKOS_ENABLE_CXX14 ON)
@ -81,21 +80,21 @@ ELSEIF(KOKKOS_CXX_STANDARD STREQUAL "20")
kokkos_set_cxx_standard_feature(20) kokkos_set_cxx_standard_feature(20)
SET(KOKKOS_CXX_INTERMEDIATE_STANDARD "2A") SET(KOKKOS_CXX_INTERMEDIATE_STANDARD "2A")
SET(KOKKOS_ENABLE_CXX20 ON) SET(KOKKOS_ENABLE_CXX20 ON)
ELSEIF(KOKKOS_CXX_STANDARD STREQUAL "98") ELSEIF(KOKKOS_CXX_STANDARD STREQUAL "98" OR KOKKOS_CXX_STANDARD STREQUAL "11")
MESSAGE(FATAL_ERROR "Kokkos requires C++11 or newer!") MESSAGE(FATAL_ERROR "Kokkos requires C++14 or newer!")
ELSE() ELSE()
MESSAGE(FATAL_ERROR "Unknown C++ standard ${KOKKOS_CXX_STANDARD} - must be 11, 14, 17, or 20") MESSAGE(FATAL_ERROR "Unknown C++ standard ${KOKKOS_CXX_STANDARD} - must be 14, 17, or 20")
ENDIF() ENDIF()
# Enforce that extensions are turned off for nvcc_wrapper. # Enforce that extensions are turned off for nvcc_wrapper.
# For compiling CUDA code using nvcc_wrapper, we will use the host compiler's # For compiling CUDA code using nvcc_wrapper, we will use the host compiler's
# flags for turning on C++11. Since for compiler ID and versioning purposes # flags for turning on C++14. Since for compiler ID and versioning purposes
# CMake recognizes the host compiler when calling nvcc_wrapper, this just # CMake recognizes the host compiler when calling nvcc_wrapper, this just
# works. Both NVCC and nvcc_wrapper only recognize '-std=c++11' which means # works. Both NVCC and nvcc_wrapper only recognize '-std=c++14' which means
# that we can only use host compilers for CUDA builds that use those flags. # that we can only use host compilers for CUDA builds that use those flags.
# It also means that extensions (gnu++11) can't be turned on for CUDA builds. # It also means that extensions (gnu++14) can't be turned on for CUDA builds.
IF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA) IF(KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA)
IF(NOT DEFINED CMAKE_CXX_EXTENSIONS) IF(NOT DEFINED CMAKE_CXX_EXTENSIONS)
@ -117,7 +116,7 @@ IF(KOKKOS_ENABLE_CUDA)
MESSAGE(FATAL_ERROR "Compiling CUDA code with clang doesn't support C++ extensions. Set -DCMAKE_CXX_EXTENSIONS=OFF") MESSAGE(FATAL_ERROR "Compiling CUDA code with clang doesn't support C++ extensions. Set -DCMAKE_CXX_EXTENSIONS=OFF")
ENDIF() ENDIF()
ELSEIF(NOT KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA) ELSEIF(NOT KOKKOS_CXX_COMPILER_ID STREQUAL NVIDIA)
MESSAGE(FATAL_ERROR "Invalid compiler for CUDA. The compiler must be nvcc_wrapper or Clang, but compiler ID was ${KOKKOS_CXX_COMPILER_ID}") MESSAGE(FATAL_ERROR "Invalid compiler for CUDA. The compiler must be nvcc_wrapper or Clang or use kokkos_launch_compiler, but compiler ID was ${KOKKOS_CXX_COMPILER_ID}")
ENDIF() ENDIF()
ENDIF() ENDIF()

View File

@ -76,3 +76,7 @@ STRING(REPLACE ";" "\n" KOKKOS_TPL_EXPORT_TEMP "${KOKKOS_TPL_EXPORTS}")
#Convert to a regular variable #Convert to a regular variable
UNSET(KOKKOS_TPL_EXPORTS CACHE) UNSET(KOKKOS_TPL_EXPORTS CACHE)
SET(KOKKOS_TPL_EXPORTS ${KOKKOS_TPL_EXPORT_TEMP}) SET(KOKKOS_TPL_EXPORTS ${KOKKOS_TPL_EXPORT_TEMP})
IF (KOKKOS_ENABLE_MEMKIND)
SET(KOKKOS_ENABLE_HBWSPACE)
LIST(APPEND KOKKOS_MEMSPACE_LIST HBWSpace)
ENDIF()

View File

@ -6,6 +6,12 @@ INCLUDE(GNUInstallDirs)
MESSAGE(STATUS "The project name is: ${PROJECT_NAME}") MESSAGE(STATUS "The project name is: ${PROJECT_NAME}")
FUNCTION(VERIFY_EMPTY CONTEXT)
if(${ARGN})
MESSAGE(FATAL_ERROR "Kokkos does not support all of Tribits. Unhandled arguments in ${CONTEXT}:\n${ARGN}")
endif()
ENDFUNCTION()
#Leave this here for now - but only do for tribits #Leave this here for now - but only do for tribits
#This breaks the standalone CMake #This breaks the standalone CMake
IF (KOKKOS_HAS_TRILINOS) IF (KOKKOS_HAS_TRILINOS)
@ -135,28 +141,37 @@ FUNCTION(KOKKOS_ADD_EXECUTABLE ROOT_NAME)
ENDFUNCTION() ENDFUNCTION()
FUNCTION(KOKKOS_ADD_EXECUTABLE_AND_TEST ROOT_NAME) FUNCTION(KOKKOS_ADD_EXECUTABLE_AND_TEST ROOT_NAME)
IF (KOKKOS_HAS_TRILINOS)
TRIBITS_ADD_EXECUTABLE_AND_TEST(
${ROOT_NAME}
TESTONLYLIBS kokkos_gtest
${ARGN}
NUM_MPI_PROCS 1
COMM serial mpi
FAIL_REGULAR_EXPRESSION " FAILED "
)
ELSE()
CMAKE_PARSE_ARGUMENTS(PARSE CMAKE_PARSE_ARGUMENTS(PARSE
"" ""
"" ""
"SOURCES;CATEGORIES" "SOURCES;CATEGORIES;ARGS"
${ARGN}) ${ARGN})
VERIFY_EMPTY(KOKKOS_ADD_EXECUTABLE_AND_TEST ${PARSE_UNPARSED_ARGUMENTS}) VERIFY_EMPTY(KOKKOS_ADD_EXECUTABLE_AND_TEST ${PARSE_UNPARSED_ARGUMENTS})
IF (KOKKOS_HAS_TRILINOS)
IF(DEFINED PARSE_ARGS)
STRING(REPLACE ";" " " PARSE_ARGS "${PARSE_ARGS}")
ENDIF()
TRIBITS_ADD_EXECUTABLE_AND_TEST(
${ROOT_NAME}
SOURCES ${PARSE_SOURCES}
TESTONLYLIBS kokkos_gtest
NUM_MPI_PROCS 1
COMM serial mpi
ARGS ${PARSE_ARGS}
CATEGORIES ${PARSE_CATEGORIES}
SOURCES ${PARSE_SOURCES}
FAIL_REGULAR_EXPRESSION " FAILED "
ARGS ${PARSE_ARGS}
)
ELSE()
KOKKOS_ADD_TEST_EXECUTABLE(${ROOT_NAME} KOKKOS_ADD_TEST_EXECUTABLE(${ROOT_NAME}
SOURCES ${PARSE_SOURCES} SOURCES ${PARSE_SOURCES}
) )
KOKKOS_ADD_TEST(NAME ${ROOT_NAME} KOKKOS_ADD_TEST(NAME ${ROOT_NAME}
EXE ${ROOT_NAME} EXE ${ROOT_NAME}
FAIL_REGULAR_EXPRESSION " FAILED " FAIL_REGULAR_EXPRESSION " FAILED "
ARGS ${PARSE_ARGS}
) )
ENDIF() ENDIF()
ENDFUNCTION() ENDFUNCTION()
@ -219,6 +234,7 @@ MACRO(KOKKOS_ADD_TEST_EXECUTABLE ROOT_NAME)
${PARSE_UNPARSED_ARGUMENTS} ${PARSE_UNPARSED_ARGUMENTS}
TESTONLYLIBS kokkos_gtest TESTONLYLIBS kokkos_gtest
) )
SET(EXE_NAME ${PACKAGE_NAME}_${ROOT_NAME})
ENDMACRO() ENDMACRO()
MACRO(KOKKOS_PACKAGE_POSTPROCESS) MACRO(KOKKOS_PACKAGE_POSTPROCESS)
@ -227,6 +243,79 @@ MACRO(KOKKOS_PACKAGE_POSTPROCESS)
endif() endif()
ENDMACRO() ENDMACRO()
## KOKKOS_CONFIGURE_CORE Configure/Generate header files for core content based
## on enabled backends.
## KOKKOS_FWD is the forward declare set
## KOKKOS_SETUP is included in Kokkos_Macros.hpp and include prefix includes/defines
## KOKKOS_DECLARE is the declaration set
## KOKKOS_POST_INCLUDE is included at the end of Kokkos_Core.hpp
MACRO(KOKKOS_CONFIGURE_CORE)
SET(FWD_BACKEND_LIST)
FOREACH(MEMSPACE ${KOKKOS_MEMSPACE_LIST})
LIST(APPEND FWD_BACKEND_LIST ${MEMSPACE})
ENDFOREACH()
FOREACH(BACKEND_ ${KOKKOS_ENABLED_DEVICES})
IF( ${BACKEND_} STREQUAL "PTHREAD")
LIST(APPEND FWD_BACKEND_LIST THREADS)
ELSE()
LIST(APPEND FWD_BACKEND_LIST ${BACKEND_})
ENDIF()
ENDFOREACH()
MESSAGE(STATUS "Kokkos Devices: ${KOKKOS_ENABLED_DEVICES}, Kokkos Backends: ${FWD_BACKEND_LIST}")
KOKKOS_CONFIG_HEADER( KokkosCore_Config_HeaderSet.in KokkosCore_Config_FwdBackend.hpp "KOKKOS_FWD" "fwd/Kokkos_Fwd" "${FWD_BACKEND_LIST}")
KOKKOS_CONFIG_HEADER( KokkosCore_Config_HeaderSet.in KokkosCore_Config_SetupBackend.hpp "KOKKOS_SETUP" "setup/Kokkos_Setup" "${DEVICE_SETUP_LIST}")
KOKKOS_CONFIG_HEADER( KokkosCore_Config_HeaderSet.in KokkosCore_Config_DeclareBackend.hpp "KOKKOS_DECLARE" "decl/Kokkos_Declare" "${FWD_BACKEND_LIST}")
KOKKOS_CONFIG_HEADER( KokkosCore_Config_HeaderSet.in KokkosCore_Config_PostInclude.hpp "KOKKOS_POST_INCLUDE" "Kokkos_Post_Include" "${KOKKOS_BACKEND_POST_INCLUDE_LIST}")
SET(_DEFAULT_HOST_MEMSPACE "::Kokkos::HostSpace")
KOKKOS_OPTION(DEFAULT_DEVICE_MEMORY_SPACE "" STRING "Override default device memory space")
KOKKOS_OPTION(DEFAULT_HOST_MEMORY_SPACE "" STRING "Override default host memory space")
KOKKOS_OPTION(DEFAULT_DEVICE_EXECUTION_SPACE "" STRING "Override default device execution space")
KOKKOS_OPTION(DEFAULT_HOST_PARALLEL_EXECUTION_SPACE "" STRING "Override default host parallel execution space")
IF (NOT Kokkos_DEFAULT_DEVICE_EXECUTION_SPACE STREQUAL "")
SET(_DEVICE_PARALLEL ${Kokkos_DEFAULT_DEVICE_EXECUTION_SPACE})
MESSAGE(STATUS "Override default device execution space: ${_DEVICE_PARALLEL}")
SET(KOKKOS_DEVICE_SPACE_ACTIVE ON)
ELSE()
IF (_DEVICE_PARALLEL STREQUAL "NoTypeDefined")
SET(KOKKOS_DEVICE_SPACE_ACTIVE OFF)
ELSE()
SET(KOKKOS_DEVICE_SPACE_ACTIVE ON)
ENDIF()
ENDIF()
IF (NOT Kokkos_DEFAULT_HOST_PARALLEL_EXECUTION_SPACE STREQUAL "")
SET(_HOST_PARALLEL ${Kokkos_DEFAULT_HOST_PARALLEL_EXECUTION_SPACE})
MESSAGE(STATUS "Override default host parallel execution space: ${_HOST_PARALLEL}")
SET(KOKKOS_HOSTPARALLEL_SPACE_ACTIVE ON)
ELSE()
IF (_HOST_PARALLEL STREQUAL "NoTypeDefined")
SET(KOKKOS_HOSTPARALLEL_SPACE_ACTIVE OFF)
ELSE()
SET(KOKKOS_HOSTPARALLEL_SPACE_ACTIVE ON)
ENDIF()
ENDIF()
#We are ready to configure the header
CONFIGURE_FILE(cmake/KokkosCore_config.h.in KokkosCore_config.h @ONLY)
ENDMACRO()
## KOKKOS_INSTALL_ADDITIONAL_FILES - instruct cmake to install files in target destination.
## Includes generated header files, scripts such as nvcc_wrapper and hpcbind,
## as well as other files provided through plugins.
MACRO(KOKKOS_INSTALL_ADDITIONAL_FILES)
# kokkos_launch_compiler is used by Kokkos to prefix compiler commands so that they forward to nvcc_wrapper
INSTALL(PROGRAMS
"${CMAKE_CURRENT_SOURCE_DIR}/bin/nvcc_wrapper"
"${CMAKE_CURRENT_SOURCE_DIR}/bin/hpcbind"
"${CMAKE_CURRENT_SOURCE_DIR}/bin/kokkos_launch_compiler"
DESTINATION ${CMAKE_INSTALL_BINDIR})
INSTALL(FILES
"${CMAKE_CURRENT_BINARY_DIR}/KokkosCore_config.h"
"${CMAKE_CURRENT_BINARY_DIR}/KokkosCore_Config_FwdBackend.hpp"
"${CMAKE_CURRENT_BINARY_DIR}/KokkosCore_Config_SetupBackend.hpp"
"${CMAKE_CURRENT_BINARY_DIR}/KokkosCore_Config_DeclareBackend.hpp"
"${CMAKE_CURRENT_BINARY_DIR}/KokkosCore_Config_PostInclude.hpp"
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
ENDMACRO()
FUNCTION(KOKKOS_SET_LIBRARY_PROPERTIES LIBRARY_NAME) FUNCTION(KOKKOS_SET_LIBRARY_PROPERTIES LIBRARY_NAME)
CMAKE_PARSE_ARGUMENTS(PARSE CMAKE_PARSE_ARGUMENTS(PARSE
"PLAIN_STYLE" "PLAIN_STYLE"

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -1,14 +1,16 @@
# @HEADER # @HEADER
# ************************************************************************ # ************************************************************************
# #
# Trilinos: An Object-Oriented Solver Framework # Kokkos v. 3.0
# Copyright (2001) Sandia Corporation # Copyright (2020) National Technology & Engineering
# Solutions of Sandia, LLC (NTESS).
# #
# Under the terms of Contract DE-NA0003525 with NTESS,
# the U.S. Government retains certain rights in this software.
# #
# Copyright (2001) Sandia Corporation. Under the terms of Contract # Redistribution and use in source and binary forms, with or without
# DE-AC04-94AL85000, there is a non-exclusive license for use of this # modification, are permitted provided that the following conditions are
# work by or on behalf of the U.S. Government. Export of this program # met:
# may require a license from the United States Government.
# #
# 1. Redistributions of source code must retain the above copyright # 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer. # notice, this list of conditions and the following disclaimer.
@ -21,10 +23,10 @@
# contributors may be used to endorse or promote products derived from # contributors may be used to endorse or promote products derived from
# this software without specific prior written permission. # this software without specific prior written permission.
# #
# THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY # THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
@ -33,22 +35,7 @@
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS # NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# #
# NOTICE: The United States Government is granted for itself and others # Questions? Contact Christian R. Trott (crtrott@sandia.gov)
# acting on its behalf a paid-up, nonexclusive, irrevocable worldwide
# license in this data to reproduce, prepare derivative works, and
# perform publicly and display publicly. Beginning five (5) years from
# July 25, 2001, the United States Government is granted for itself and
# others acting on its behalf a paid-up, nonexclusive, irrevocable
# worldwide license in this data to reproduce, prepare derivative works,
# distribute copies to the public, perform publicly and display
# publicly, and to permit others to do so.
#
# NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT
# OF ENERGY, NOR SANDIA CORPORATION, NOR ANY OF THEIR EMPLOYEES, MAKES
# ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR
# RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY
# INFORMATION, APPARATUS, PRODUCT, OR PROCESS DISCLOSED, OR REPRESENTS
# THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS.
# #
# ************************************************************************ # ************************************************************************
# @HEADER # @HEADER

View File

@ -3,44 +3,26 @@ KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR}) KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src ) KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src )
IF(Kokkos_ENABLE_CUDA) foreach(Tag Threads;OpenMP;Cuda;HPX;HIP)
SET(SOURCES # Because there is always an exception to the rule
if(Tag STREQUAL "Threads")
set(DEVICE "PTHREAD")
else()
string(TOUPPER ${Tag} DEVICE)
endif()
string(TOLOWER ${Tag} dir)
if(Kokkos_ENABLE_${DEVICE})
message(STATUS "Sources Test${Tag}.cpp")
set(SOURCES
TestMain.cpp TestMain.cpp
TestCuda.cpp Test${Tag}.cpp
) )
KOKKOS_ADD_EXECUTABLE_AND_TEST( PerformanceTest_Cuda KOKKOS_ADD_EXECUTABLE_AND_TEST(
PerformanceTest_${Tag}
SOURCES ${SOURCES} SOURCES ${SOURCES}
) )
ENDIF() endif()
endforeach()
IF(Kokkos_ENABLE_PTHREAD)
SET(SOURCES
TestMain.cpp
TestThreads.cpp
)
KOKKOS_ADD_EXECUTABLE_AND_TEST( PerformanceTest_Threads
SOURCES ${SOURCES}
)
ENDIF()
IF(Kokkos_ENABLE_OPENMP)
SET(SOURCES
TestMain.cpp
TestOpenMP.cpp
)
KOKKOS_ADD_EXECUTABLE_AND_TEST( PerformanceTest_OpenMP
SOURCES ${SOURCES}
)
ENDIF()
IF(Kokkos_ENABLE_HPX)
SET(SOURCES
TestMain.cpp
TestHPX.cpp
)
KOKKOS_ADD_EXECUTABLE_AND_TEST( PerformanceTest_HPX
SOURCES ${SOURCES}
)
ENDIF()

View File

@ -58,8 +58,8 @@ endif
KokkosContainers_PerformanceTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS) KokkosContainers_PerformanceTest_Cuda: $(OBJ_CUDA) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Cuda $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_CUDA) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Cuda
KokkosContainers_PerformanceTest_ROCm: $(OBJ_ROCM) $(KOKKOS_LINK_DEPENDS) KokkosContainers_PerformanceTest_HIP: $(OBJ_HIP) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_ROCM) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_ROCm $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_HIP) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_HIP
KokkosContainers_PerformanceTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS) KokkosContainers_PerformanceTest_Threads: $(OBJ_THREADS) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Threads $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_THREADS) $(KOKKOS_LIBS) $(LIB) -o KokkosContainers_PerformanceTest_Threads
@ -73,8 +73,8 @@ KokkosContainers_PerformanceTest_HPX: $(OBJ_HPX) $(KOKKOS_LINK_DEPENDS)
test-cuda: KokkosContainers_PerformanceTest_Cuda test-cuda: KokkosContainers_PerformanceTest_Cuda
./KokkosContainers_PerformanceTest_Cuda ./KokkosContainers_PerformanceTest_Cuda
test-rocm: KokkosContainers_PerformanceTest_ROCm test-hip: KokkosContainers_PerformanceTest_HIP
./KokkosContainers_PerformanceTest_ROCm ./KokkosContainers_PerformanceTest_HIP
test-threads: KokkosContainers_PerformanceTest_Threads test-threads: KokkosContainers_PerformanceTest_Threads
./KokkosContainers_PerformanceTest_Threads ./KokkosContainers_PerformanceTest_Threads

View File

@ -43,7 +43,6 @@
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_CUDA)
#include <cstdint> #include <cstdint>
#include <string> #include <string>
@ -66,23 +65,13 @@
namespace Performance { namespace Performance {
class cuda : public ::testing::Test { TEST(TEST_CATEGORY, dynrankview_perf) {
protected:
static void SetUpTestCase() {
std::cout << std::setprecision(5) << std::scientific;
Kokkos::InitArguments args(-1, -1, 0);
Kokkos::initialize(args);
}
static void TearDownTestCase() { Kokkos::finalize(); }
};
TEST_F(cuda, dynrankview_perf) {
std::cout << "Cuda" << std::endl; std::cout << "Cuda" << std::endl;
std::cout << " DynRankView vs View: Initialization Only " << std::endl; std::cout << " DynRankView vs View: Initialization Only " << std::endl;
test_dynrankview_op_perf<Kokkos::Cuda>(40960); test_dynrankview_op_perf<Kokkos::Cuda>(40960);
} }
TEST_F(cuda, global_2_local) { TEST(TEST_CATEGORY, global_2_local) {
std::cout << "Cuda" << std::endl; std::cout << "Cuda" << std::endl;
std::cout << "size, create, generate, fill, find" << std::endl; std::cout << "size, create, generate, fill, find" << std::endl;
for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size; for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size;
@ -90,15 +79,12 @@ TEST_F(cuda, global_2_local) {
test_global_to_local_ids<Kokkos::Cuda>(i); test_global_to_local_ids<Kokkos::Cuda>(i);
} }
TEST_F(cuda, unordered_map_performance_near) { TEST(TEST_CATEGORY, unordered_map_performance_near) {
Perf::run_performance_tests<Kokkos::Cuda, true>("cuda-near"); Perf::run_performance_tests<Kokkos::Cuda, true>("cuda-near");
} }
TEST_F(cuda, unordered_map_performance_far) { TEST(TEST_CATEGORY, unordered_map_performance_far) {
Perf::run_performance_tests<Kokkos::Cuda, false>("cuda-far"); Perf::run_performance_tests<Kokkos::Cuda, false>("cuda-far");
} }
} // namespace Performance } // namespace Performance
#else
void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTCUDA_PREVENT_EMPTY_LINK_ERROR() {}
#endif /* #if defined( KOKKOS_ENABLE_CUDA ) */

View File

@ -43,7 +43,6 @@
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_ROCM)
#include <cstdint> #include <cstdint>
#include <string> #include <string>
@ -66,46 +65,26 @@
namespace Performance { namespace Performance {
class rocm : public ::testing::Test { TEST(TEST_CATEGORY, dynrankview_perf) {
protected: std::cout << "HIP" << std::endl;
static void SetUpTestCase() {
std::cout << std::setprecision(5) << std::scientific;
Kokkos::HostSpace::execution_space::initialize();
Kokkos::Experimental::ROCm::initialize(
Kokkos::Experimental::ROCm::SelectDevice(0));
}
static void TearDownTestCase() {
Kokkos::Experimental::ROCm::finalize();
Kokkos::HostSpace::execution_space::finalize();
}
};
#if 0
// issue 1089
TEST_F( rocm, dynrankview_perf )
{
std::cout << "ROCm" << std::endl;
std::cout << " DynRankView vs View: Initialization Only " << std::endl; std::cout << " DynRankView vs View: Initialization Only " << std::endl;
test_dynrankview_op_perf<Kokkos::Experimental::ROCm>( 40960 ); test_dynrankview_op_perf<Kokkos::Experimental::HIP>(40960);
} }
TEST_F( rocm, global_2_local) TEST(TEST_CATEGORY, global_2_local) {
{ std::cout << "HIP" << std::endl;
std::cout << "ROCm" << std::endl;
std::cout << "size, create, generate, fill, find" << std::endl; std::cout << "size, create, generate, fill, find" << std::endl;
for (unsigned i=Performance::begin_id_size; i<=Performance::end_id_size; i *= Performance::id_step) for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size;
test_global_to_local_ids<Kokkos::Experimental::ROCm>(i); i *= Performance::id_step)
test_global_to_local_ids<Kokkos::Experimental::HIP>(i);
} }
#endif TEST(TEST_CATEGORY, unordered_map_performance_near) {
TEST_F(rocm, unordered_map_performance_near) { Perf::run_performance_tests<Kokkos::Experimental::HIP, true>("hip-near");
Perf::run_performance_tests<Kokkos::Experimental::ROCm, true>("rocm-near");
} }
TEST_F(rocm, unordered_map_performance_far) { TEST(TEST_CATEGORY, unordered_map_performance_far) {
Perf::run_performance_tests<Kokkos::Experimental::ROCm, false>("rocm-far"); Perf::run_performance_tests<Kokkos::Experimental::HIP, false>("hip-far");
} }
} // namespace Performance } // namespace Performance
#else
void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTROCM_PREVENT_EMPTY_LINK_ERROR() {}
#endif /* #if defined( KOKKOS_ENABLE_ROCM ) */

View File

@ -43,7 +43,6 @@
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_HPX)
#include <gtest/gtest.h> #include <gtest/gtest.h>
@ -64,25 +63,13 @@
namespace Performance { namespace Performance {
class hpx : public ::testing::Test { TEST(TEST_CATEGORY, dynrankview_perf) {
protected:
static void SetUpTestCase() {
std::cout << std::setprecision(5) << std::scientific;
Kokkos::initialize();
Kokkos::print_configuration(std::cout);
}
static void TearDownTestCase() { Kokkos::finalize(); }
};
TEST_F(hpx, dynrankview_perf) {
std::cout << "HPX" << std::endl; std::cout << "HPX" << std::endl;
std::cout << " DynRankView vs View: Initialization Only " << std::endl; std::cout << " DynRankView vs View: Initialization Only " << std::endl;
test_dynrankview_op_perf<Kokkos::Experimental::HPX>(8192); test_dynrankview_op_perf<Kokkos::Experimental::HPX>(8192);
} }
TEST_F(hpx, global_2_local) { TEST(TEST_CATEGORY, global_2_local) {
std::cout << "HPX" << std::endl; std::cout << "HPX" << std::endl;
std::cout << "size, create, generate, fill, find" << std::endl; std::cout << "size, create, generate, fill, find" << std::endl;
for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size; for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size;
@ -90,7 +77,7 @@ TEST_F(hpx, global_2_local) {
test_global_to_local_ids<Kokkos::Experimental::HPX>(i); test_global_to_local_ids<Kokkos::Experimental::HPX>(i);
} }
TEST_F(hpx, unordered_map_performance_near) { TEST(TEST_CATEGORY, unordered_map_performance_near) {
unsigned num_hpx = 4; unsigned num_hpx = 4;
std::ostringstream base_file_name; std::ostringstream base_file_name;
base_file_name << "hpx-" << num_hpx << "-near"; base_file_name << "hpx-" << num_hpx << "-near";
@ -98,7 +85,7 @@ TEST_F(hpx, unordered_map_performance_near) {
base_file_name.str()); base_file_name.str());
} }
TEST_F(hpx, unordered_map_performance_far) { TEST(TEST_CATEGORY, unordered_map_performance_far) {
unsigned num_hpx = 4; unsigned num_hpx = 4;
std::ostringstream base_file_name; std::ostringstream base_file_name;
base_file_name << "hpx-" << num_hpx << "-far"; base_file_name << "hpx-" << num_hpx << "-far";
@ -106,7 +93,7 @@ TEST_F(hpx, unordered_map_performance_far) {
base_file_name.str()); base_file_name.str());
} }
TEST_F(hpx, scatter_view) { TEST(TEST_CATEGORY, scatter_view) {
std::cout << "ScatterView data-duplicated test:\n"; std::cout << "ScatterView data-duplicated test:\n";
Perf::test_scatter_view<Kokkos::Experimental::HPX, Kokkos::LayoutRight, Perf::test_scatter_view<Kokkos::Experimental::HPX, Kokkos::LayoutRight,
Kokkos::Experimental::ScatterDuplicated, Kokkos::Experimental::ScatterDuplicated,
@ -119,6 +106,3 @@ TEST_F(hpx, scatter_view) {
} }
} // namespace Performance } // namespace Performance
#else
void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTHPX_PREVENT_EMPTY_LINK_ERROR() {}
#endif

View File

@ -45,9 +45,13 @@
#include <gtest/gtest.h> #include <gtest/gtest.h>
#include <cstdlib> #include <cstdlib>
#include <Kokkos_Macros.hpp> #include <Kokkos_Core.hpp>
int main(int argc, char *argv[]) { int main(int argc, char *argv[]) {
Kokkos::initialize(argc, argv);
::testing::InitGoogleTest(&argc, argv); ::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
int result = RUN_ALL_TESTS();
Kokkos::finalize();
return result;
} }

View File

@ -43,7 +43,6 @@
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_OPENMP)
#include <gtest/gtest.h> #include <gtest/gtest.h>
@ -64,25 +63,13 @@
namespace Performance { namespace Performance {
class openmp : public ::testing::Test { TEST(TEST_CATEGORY, dynrankview_perf) {
protected:
static void SetUpTestCase() {
std::cout << std::setprecision(5) << std::scientific;
Kokkos::initialize();
Kokkos::OpenMP::print_configuration(std::cout);
}
static void TearDownTestCase() { Kokkos::finalize(); }
};
TEST_F(openmp, dynrankview_perf) {
std::cout << "OpenMP" << std::endl; std::cout << "OpenMP" << std::endl;
std::cout << " DynRankView vs View: Initialization Only " << std::endl; std::cout << " DynRankView vs View: Initialization Only " << std::endl;
test_dynrankview_op_perf<Kokkos::OpenMP>(8192); test_dynrankview_op_perf<Kokkos::OpenMP>(8192);
} }
TEST_F(openmp, global_2_local) { TEST(TEST_CATEGORY, global_2_local) {
std::cout << "OpenMP" << std::endl; std::cout << "OpenMP" << std::endl;
std::cout << "size, create, generate, fill, find" << std::endl; std::cout << "size, create, generate, fill, find" << std::endl;
for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size; for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size;
@ -90,7 +77,7 @@ TEST_F(openmp, global_2_local) {
test_global_to_local_ids<Kokkos::OpenMP>(i); test_global_to_local_ids<Kokkos::OpenMP>(i);
} }
TEST_F(openmp, unordered_map_performance_near) { TEST(TEST_CATEGORY, unordered_map_performance_near) {
unsigned num_openmp = 4; unsigned num_openmp = 4;
if (Kokkos::hwloc::available()) { if (Kokkos::hwloc::available()) {
num_openmp = Kokkos::hwloc::get_available_numa_count() * num_openmp = Kokkos::hwloc::get_available_numa_count() *
@ -102,7 +89,7 @@ TEST_F(openmp, unordered_map_performance_near) {
Perf::run_performance_tests<Kokkos::OpenMP, true>(base_file_name.str()); Perf::run_performance_tests<Kokkos::OpenMP, true>(base_file_name.str());
} }
TEST_F(openmp, unordered_map_performance_far) { TEST(TEST_CATEGORY, unordered_map_performance_far) {
unsigned num_openmp = 4; unsigned num_openmp = 4;
if (Kokkos::hwloc::available()) { if (Kokkos::hwloc::available()) {
num_openmp = Kokkos::hwloc::get_available_numa_count() * num_openmp = Kokkos::hwloc::get_available_numa_count() *
@ -114,7 +101,7 @@ TEST_F(openmp, unordered_map_performance_far) {
Perf::run_performance_tests<Kokkos::OpenMP, false>(base_file_name.str()); Perf::run_performance_tests<Kokkos::OpenMP, false>(base_file_name.str());
} }
TEST_F(openmp, scatter_view) { TEST(TEST_CATEGORY, scatter_view) {
std::cout << "ScatterView data-duplicated test:\n"; std::cout << "ScatterView data-duplicated test:\n";
Perf::test_scatter_view<Kokkos::OpenMP, Kokkos::LayoutRight, Perf::test_scatter_view<Kokkos::OpenMP, Kokkos::LayoutRight,
Kokkos::Experimental::ScatterDuplicated, Kokkos::Experimental::ScatterDuplicated,
@ -127,7 +114,3 @@ TEST_F(openmp, scatter_view) {
} }
} // namespace Performance } // namespace Performance
#else
void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTOPENMP_PREVENT_EMPTY_LINK_ERROR() {
}
#endif

View File

@ -43,7 +43,6 @@
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_THREADS)
#include <gtest/gtest.h> #include <gtest/gtest.h>
@ -65,34 +64,13 @@
namespace Performance { namespace Performance {
class threads : public ::testing::Test { TEST(threads, dynrankview_perf) {
protected:
static void SetUpTestCase() {
std::cout << std::setprecision(5) << std::scientific;
unsigned num_threads = 4;
if (Kokkos::hwloc::available()) {
num_threads = Kokkos::hwloc::get_available_numa_count() *
Kokkos::hwloc::get_available_cores_per_numa() *
Kokkos::hwloc::get_available_threads_per_core();
}
std::cout << "Threads: " << num_threads << std::endl;
Kokkos::initialize(Kokkos::InitArguments(num_threads));
}
static void TearDownTestCase() { Kokkos::finalize(); }
};
TEST_F(threads, dynrankview_perf) {
std::cout << "Threads" << std::endl; std::cout << "Threads" << std::endl;
std::cout << " DynRankView vs View: Initialization Only " << std::endl; std::cout << " DynRankView vs View: Initialization Only " << std::endl;
test_dynrankview_op_perf<Kokkos::Threads>(8192); test_dynrankview_op_perf<Kokkos::Threads>(8192);
} }
TEST_F(threads, global_2_local) { TEST(threads, global_2_local) {
std::cout << "Threads" << std::endl; std::cout << "Threads" << std::endl;
std::cout << "size, create, generate, fill, find" << std::endl; std::cout << "size, create, generate, fill, find" << std::endl;
for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size; for (unsigned i = Performance::begin_id_size; i <= Performance::end_id_size;
@ -100,7 +78,7 @@ TEST_F(threads, global_2_local) {
test_global_to_local_ids<Kokkos::Threads>(i); test_global_to_local_ids<Kokkos::Threads>(i);
} }
TEST_F(threads, unordered_map_performance_near) { TEST(threads, unordered_map_performance_near) {
unsigned num_threads = 4; unsigned num_threads = 4;
if (Kokkos::hwloc::available()) { if (Kokkos::hwloc::available()) {
num_threads = Kokkos::hwloc::get_available_numa_count() * num_threads = Kokkos::hwloc::get_available_numa_count() *
@ -112,7 +90,7 @@ TEST_F(threads, unordered_map_performance_near) {
Perf::run_performance_tests<Kokkos::Threads, true>(base_file_name.str()); Perf::run_performance_tests<Kokkos::Threads, true>(base_file_name.str());
} }
TEST_F(threads, unordered_map_performance_far) { TEST(threads, unordered_map_performance_far) {
unsigned num_threads = 4; unsigned num_threads = 4;
if (Kokkos::hwloc::available()) { if (Kokkos::hwloc::available()) {
num_threads = Kokkos::hwloc::get_available_numa_count() * num_threads = Kokkos::hwloc::get_available_numa_count() *
@ -125,8 +103,3 @@ TEST_F(threads, unordered_map_performance_far) {
} }
} // namespace Performance } // namespace Performance
#else
void KOKKOS_CONTAINERS_PERFORMANCE_TESTS_TESTTHREADS_PREVENT_EMPTY_LINK_ERROR() {
}
#endif

View File

@ -74,7 +74,7 @@ template <typename Device>
class Bitset { class Bitset {
public: public:
using execution_space = Device; using execution_space = Device;
using size_type = unsigned; using size_type = unsigned int;
enum { BIT_SCAN_REVERSE = 1u }; enum { BIT_SCAN_REVERSE = 1u };
enum { MOVE_HINT_BACKWARD = 2u }; enum { MOVE_HINT_BACKWARD = 2u };
@ -309,7 +309,7 @@ template <typename Device>
class ConstBitset { class ConstBitset {
public: public:
using execution_space = Device; using execution_space = Device;
using size_type = unsigned; using size_type = unsigned int;
private: private:
enum { block_size = static_cast<unsigned>(sizeof(unsigned) * CHAR_BIT) }; enum { block_size = static_cast<unsigned>(sizeof(unsigned) * CHAR_BIT) };

View File

@ -162,7 +162,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
/// \brief The type of a const, random-access View host mirror of /// \brief The type of a const, random-access View host mirror of
/// \c t_dev_const_randomread. /// \c t_dev_const_randomread.
using t_host_const_randomread_um = using t_host_const_randomread_um =
typename t_dev_const_randomread::HostMirror; typename t_dev_const_randomread_um::HostMirror;
//@} //@}
//! \name Counters to keep track of changes ("modified" flags) //! \name Counters to keep track of changes ("modified" flags)
@ -245,21 +245,6 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
h_view(create_mirror_view(d_view)) // without UVM, host View mirrors h_view(create_mirror_view(d_view)) // without UVM, host View mirrors
{} {}
explicit inline DualView(const ViewAllocateWithoutInitializing& arg_prop,
const size_t arg_N0 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N1 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N2 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N3 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N4 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N5 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N6 = KOKKOS_IMPL_CTOR_DEFAULT_ARG,
const size_t arg_N7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG)
: DualView(Impl::ViewCtorProp<std::string,
Kokkos::Impl::WithoutInitializing_t>(
arg_prop.label, Kokkos::WithoutInitializing),
arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6,
arg_N7) {}
//! Copy constructor (shallow copy) //! Copy constructor (shallow copy)
template <class SS, class LS, class DS, class MS> template <class SS, class LS, class DS, class MS>
DualView(const DualView<SS, LS, DS, MS>& src) DualView(const DualView<SS, LS, DS, MS>& src)
@ -457,7 +442,21 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
} }
return dev; return dev;
} }
static constexpr const int view_header_size = 128;
void impl_report_host_sync() const noexcept {
Kokkos::Tools::syncDualView(
h_view.label(),
reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(h_view.data()) -
view_header_size),
false);
}
void impl_report_device_sync() const noexcept {
Kokkos::Tools::syncDualView(
d_view.label(),
reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(d_view.data()) -
view_header_size),
true);
}
/// \brief Update data on device or host only if data in the other /// \brief Update data on device or host only if data in the other
/// space has been marked as modified. /// space has been marked as modified.
/// ///
@ -499,6 +498,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
deep_copy(d_view, h_view); deep_copy(d_view, h_view);
modified_flags(0) = modified_flags(1) = 0; modified_flags(0) = modified_flags(1) = 0;
impl_report_device_sync();
} }
} }
if (dev == 0) { // hopefully Device is the same as DualView's host type if (dev == 0) { // hopefully Device is the same as DualView's host type
@ -515,6 +515,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
deep_copy(h_view, d_view); deep_copy(h_view, d_view);
modified_flags(0) = modified_flags(1) = 0; modified_flags(0) = modified_flags(1) = 0;
impl_report_host_sync();
} }
} }
if (std::is_same<typename t_host::memory_space, if (std::is_same<typename t_host::memory_space,
@ -539,12 +540,14 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
"Calling sync on a DualView with a const datatype."); "Calling sync on a DualView with a const datatype.");
} }
impl_report_device_sync();
} }
if (dev == 0) { // hopefully Device is the same as DualView's host type if (dev == 0) { // hopefully Device is the same as DualView's host type
if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) { if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
"Calling sync on a DualView with a const datatype."); "Calling sync on a DualView with a const datatype.");
} }
impl_report_host_sync();
} }
} }
@ -567,6 +570,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
deep_copy(h_view, d_view); deep_copy(h_view, d_view);
modified_flags(1) = modified_flags(0) = 0; modified_flags(1) = modified_flags(0) = 0;
impl_report_host_sync();
} }
} }
@ -589,6 +593,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
deep_copy(d_view, h_view); deep_copy(d_view, h_view);
modified_flags(1) = modified_flags(0) = 0; modified_flags(1) = modified_flags(0) = 0;
impl_report_device_sync();
} }
} }
@ -619,7 +624,20 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
if (modified_flags.data() == nullptr) return false; if (modified_flags.data() == nullptr) return false;
return modified_flags(1) < modified_flags(0); return modified_flags(1) < modified_flags(0);
} }
void impl_report_device_modification() {
Kokkos::Tools::modifyDualView(
d_view.label(),
reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(d_view.data()) -
view_header_size),
true);
}
void impl_report_host_modification() {
Kokkos::Tools::modifyDualView(
h_view.label(),
reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(h_view.data()) -
view_header_size),
false);
}
/// \brief Mark data as modified on the given device \c Device. /// \brief Mark data as modified on the given device \c Device.
/// ///
/// If \c Device is the same as this DualView's device type, then /// If \c Device is the same as this DualView's device type, then
@ -636,6 +654,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
(modified_flags(1) > modified_flags(0) ? modified_flags(1) (modified_flags(1) > modified_flags(0) ? modified_flags(1)
: modified_flags(0)) + : modified_flags(0)) +
1; 1;
impl_report_device_modification();
} }
if (dev == 0) { // hopefully Device is the same as DualView's host type if (dev == 0) { // hopefully Device is the same as DualView's host type
// Increment the host's modified count. // Increment the host's modified count.
@ -643,6 +662,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
(modified_flags(1) > modified_flags(0) ? modified_flags(1) (modified_flags(1) > modified_flags(0) ? modified_flags(1)
: modified_flags(0)) + : modified_flags(0)) +
1; 1;
impl_report_host_modification();
} }
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
@ -663,6 +683,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
(modified_flags(1) > modified_flags(0) ? modified_flags(1) (modified_flags(1) > modified_flags(0) ? modified_flags(1)
: modified_flags(0)) + : modified_flags(0)) +
1; 1;
impl_report_host_modification();
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
if (modified_flags(0) && modified_flags(1)) { if (modified_flags(0) && modified_flags(1)) {
std::string msg = "Kokkos::DualView::modify_host ERROR: "; std::string msg = "Kokkos::DualView::modify_host ERROR: ";
@ -682,6 +703,7 @@ class DualView : public ViewTraits<DataType, Arg1Type, Arg2Type, Arg3Type> {
(modified_flags(1) > modified_flags(0) ? modified_flags(1) (modified_flags(1) > modified_flags(0) ? modified_flags(1)
: modified_flags(0)) + : modified_flags(0)) +
1; 1;
impl_report_device_modification();
#ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
if (modified_flags(0) && modified_flags(1)) { if (modified_flags(0) && modified_flags(1)) {
std::string msg = "Kokkos::DualView::modify_device ERROR: "; std::string msg = "Kokkos::DualView::modify_device ERROR: ";

View File

@ -245,10 +245,13 @@ KOKKOS_INLINE_FUNCTION bool dyn_rank_view_verify_operator_bounds(
return (size_t(i) < map.extent(R)) && return (size_t(i) < map.extent(R)) &&
dyn_rank_view_verify_operator_bounds<R + 1>(rank, map, args...); dyn_rank_view_verify_operator_bounds<R + 1>(rank, map, args...);
} else if (i != 0) { } else if (i != 0) {
// FIXME_SYCL SYCL doesn't allow printf in kernels
#ifndef KOKKOS_ENABLE_SYCL
printf( printf(
"DynRankView Debug Bounds Checking Error: at rank %u\n Extra " "DynRankView Debug Bounds Checking Error: at rank %u\n Extra "
"arguments beyond the rank must be zero \n", "arguments beyond the rank must be zero \n",
R); R);
#endif
return (false) && return (false) &&
dyn_rank_view_verify_operator_bounds<R + 1>(rank, map, args...); dyn_rank_view_verify_operator_bounds<R + 1>(rank, map, args...);
} else { } else {
@ -1264,33 +1267,6 @@ class DynRankView : public ViewTraits<DataType, Properties...> {
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3,
arg_N4, arg_N5, arg_N6, arg_N7)) {} arg_N4, arg_N5, arg_N6, arg_N7)) {}
// For backward compatibility
// NDE This ctor does not take ViewCtorProp argument - should not use
// alternative createLayout call
explicit inline DynRankView(const ViewAllocateWithoutInitializing& arg_prop,
const typename traits::array_layout& arg_layout)
: DynRankView(
Kokkos::Impl::ViewCtorProp<std::string,
Kokkos::Impl::WithoutInitializing_t>(
arg_prop.label, Kokkos::WithoutInitializing),
arg_layout) {}
explicit inline DynRankView(const ViewAllocateWithoutInitializing& arg_prop,
const size_t arg_N0 = KOKKOS_INVALID_INDEX,
const size_t arg_N1 = KOKKOS_INVALID_INDEX,
const size_t arg_N2 = KOKKOS_INVALID_INDEX,
const size_t arg_N3 = KOKKOS_INVALID_INDEX,
const size_t arg_N4 = KOKKOS_INVALID_INDEX,
const size_t arg_N5 = KOKKOS_INVALID_INDEX,
const size_t arg_N6 = KOKKOS_INVALID_INDEX,
const size_t arg_N7 = KOKKOS_INVALID_INDEX)
: DynRankView(
Kokkos::Impl::ViewCtorProp<std::string,
Kokkos::Impl::WithoutInitializing_t>(
arg_prop.label, Kokkos::WithoutInitializing),
typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3,
arg_N4, arg_N5, arg_N6, arg_N7)) {}
//---------------------------------------- //----------------------------------------
// Memory span required to wrap these dimensions. // Memory span required to wrap these dimensions.
static constexpr size_t required_allocation_size( static constexpr size_t required_allocation_size(
@ -1401,7 +1377,7 @@ struct DynRankSubviewTag {};
namespace Impl { namespace Impl {
template <class SrcTraits, class... Args> template <class SrcTraits, class... Args>
struct ViewMapping< class ViewMapping<
typename std::enable_if< typename std::enable_if<
(std::is_same<typename SrcTraits::specialize, void>::value && (std::is_same<typename SrcTraits::specialize, void>::value &&
(std::is_same<typename SrcTraits::array_layout, (std::is_same<typename SrcTraits::array_layout,
@ -2052,7 +2028,7 @@ create_mirror_view_and_copy(
nullptr) { nullptr) {
using Mirror = typename Impl::MirrorDRViewType<Space, T, P...>::view_type; using Mirror = typename Impl::MirrorDRViewType<Space, T, P...>::view_type;
std::string label = name.empty() ? src.label() : name; std::string label = name.empty() ? src.label() : name;
auto mirror = Mirror(Kokkos::ViewAllocateWithoutInitializing(label), auto mirror = Mirror(view_alloc(WithoutInitializing, label),
Impl::reconstructLayout(src.layout(), src.rank())); Impl::reconstructLayout(src.layout(), src.rank()));
deep_copy(mirror, src); deep_copy(mirror, src);
return mirror; return mirror;

View File

@ -1940,7 +1940,7 @@ create_mirror(
const Kokkos::Experimental::OffsetView<T, P...>& src, const Kokkos::Experimental::OffsetView<T, P...>& src,
typename std::enable_if< typename std::enable_if<
!std::is_same<typename Kokkos::ViewTraits<T, P...>::array_layout, !std::is_same<typename Kokkos::ViewTraits<T, P...>::array_layout,
Kokkos::LayoutStride>::value>::type* = 0) { Kokkos::LayoutStride>::value>::type* = nullptr) {
using src_type = Experimental::OffsetView<T, P...>; using src_type = Experimental::OffsetView<T, P...>;
using dst_type = typename src_type::HostMirror; using dst_type = typename src_type::HostMirror;
@ -1960,7 +1960,7 @@ create_mirror(
const Kokkos::Experimental::OffsetView<T, P...>& src, const Kokkos::Experimental::OffsetView<T, P...>& src,
typename std::enable_if< typename std::enable_if<
std::is_same<typename Kokkos::ViewTraits<T, P...>::array_layout, std::is_same<typename Kokkos::ViewTraits<T, P...>::array_layout,
Kokkos::LayoutStride>::value>::type* = 0) { Kokkos::LayoutStride>::value>::type* = nullptr) {
using src_type = Experimental::OffsetView<T, P...>; using src_type = Experimental::OffsetView<T, P...>;
using dst_type = typename src_type::HostMirror; using dst_type = typename src_type::HostMirror;
@ -2028,7 +2028,7 @@ create_mirror_view(
std::is_same< std::is_same<
typename Kokkos::Experimental::OffsetView<T, P...>::data_type, typename Kokkos::Experimental::OffsetView<T, P...>::data_type,
typename Kokkos::Experimental::OffsetView< typename Kokkos::Experimental::OffsetView<
T, P...>::HostMirror::data_type>::value)>::type* = 0) { T, P...>::HostMirror::data_type>::value)>::type* = nullptr) {
return Kokkos::create_mirror(src); return Kokkos::create_mirror(src);
} }
@ -2038,7 +2038,7 @@ typename Kokkos::Impl::MirrorOffsetViewType<Space, T, P...>::view_type
create_mirror_view(const Space&, create_mirror_view(const Space&,
const Kokkos::Experimental::OffsetView<T, P...>& src, const Kokkos::Experimental::OffsetView<T, P...>& src,
typename std::enable_if<Impl::MirrorOffsetViewType< typename std::enable_if<Impl::MirrorOffsetViewType<
Space, T, P...>::is_same_memspace>::type* = 0) { Space, T, P...>::is_same_memspace>::type* = nullptr) {
return src; return src;
} }
@ -2048,7 +2048,7 @@ typename Kokkos::Impl::MirrorOffsetViewType<Space, T, P...>::view_type
create_mirror_view(const Space&, create_mirror_view(const Space&,
const Kokkos::Experimental::OffsetView<T, P...>& src, const Kokkos::Experimental::OffsetView<T, P...>& src,
typename std::enable_if<!Impl::MirrorOffsetViewType< typename std::enable_if<!Impl::MirrorOffsetViewType<
Space, T, P...>::is_same_memspace>::type* = 0) { Space, T, P...>::is_same_memspace>::type* = nullptr) {
return typename Kokkos::Impl::MirrorOffsetViewType<Space, T, P...>::view_type( return typename Kokkos::Impl::MirrorOffsetViewType<Space, T, P...>::view_type(
src.label(), src.layout(), src.label(), src.layout(),
{src.begin(0), src.begin(1), src.begin(2), src.begin(3), src.begin(4), {src.begin(0), src.begin(1), src.begin(2), src.begin(3), src.begin(4),
@ -2063,7 +2063,7 @@ create_mirror_view(const Space&,
// , std::string const& name = "" // , std::string const& name = ""
// , typename // , typename
// std::enable_if<Impl::MirrorViewType<Space,T,P // std::enable_if<Impl::MirrorViewType<Space,T,P
// ...>::is_same_memspace>::type* = 0 ) { // ...>::is_same_memspace>::type* = nullptr) {
// (void)name; // (void)name;
// return src; // return src;
// } // }
@ -2076,11 +2076,11 @@ create_mirror_view(const Space&,
// , std::string const& name = "" // , std::string const& name = ""
// , typename // , typename
// std::enable_if<!Impl::MirrorViewType<Space,T,P // std::enable_if<!Impl::MirrorViewType<Space,T,P
// ...>::is_same_memspace>::type* = 0 ) { // ...>::is_same_memspace>::type* = nullptr) {
// using Mirror = typename // using Mirror = typename
// Kokkos::Experimental::Impl::MirrorViewType<Space,T,P ...>::view_type; // Kokkos::Experimental::Impl::MirrorViewType<Space,T,P ...>::view_type;
// std::string label = name.empty() ? src.label() : name; // std::string label = name.empty() ? src.label() : name;
// auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout(), // auto mirror = Mirror(view_alloc(WithoutInitializing, label), src.layout(),
// { src.begin(0), src.begin(1), src.begin(2), // { src.begin(0), src.begin(1), src.begin(2),
// src.begin(3), src.begin(4), // src.begin(3), src.begin(4),
// src.begin(5), src.begin(6), src.begin(7) }); // src.begin(5), src.begin(6), src.begin(7) });

View File

@ -206,6 +206,23 @@ struct DefaultContribution<Kokkos::Experimental::HIP,
}; };
#endif #endif
#ifdef KOKKOS_ENABLE_SYCL
template <>
struct DefaultDuplication<Kokkos::Experimental::SYCL> {
using type = Kokkos::Experimental::ScatterNonDuplicated;
};
template <>
struct DefaultContribution<Kokkos::Experimental::SYCL,
Kokkos::Experimental::ScatterNonDuplicated> {
using type = Kokkos::Experimental::ScatterAtomic;
};
template <>
struct DefaultContribution<Kokkos::Experimental::SYCL,
Kokkos::Experimental::ScatterDuplicated> {
using type = Kokkos::Experimental::ScatterAtomic;
};
#endif
// FIXME All these scatter values need overhaul: // FIXME All these scatter values need overhaul:
// - like should they be copyable at all? // - like should they be copyable at all?
// - what is the internal handle type // - what is the internal handle type
@ -636,19 +653,10 @@ struct ReduceDuplicatesBase {
size_t stride_in, size_t start_in, size_t n_in, size_t stride_in, size_t start_in, size_t n_in,
std::string const& name) std::string const& name)
: src(src_in), dst(dest_in), stride(stride_in), start(start_in), n(n_in) { : src(src_in), dst(dest_in), stride(stride_in), start(start_in), n(n_in) {
uint64_t kpID = 0; parallel_for(
if (Kokkos::Profiling::profileLibraryLoaded()) { std::string("Kokkos::ScatterView::ReduceDuplicates [") + name + "]",
Kokkos::Profiling::beginParallelFor(std::string("reduce_") + name, 0, RangePolicy<ExecSpace, size_t>(0, stride),
&kpID); static_cast<Derived const&>(*this));
}
using policy_type = RangePolicy<ExecSpace, size_t>;
using closure_type = Kokkos::Impl::ParallelFor<Derived, policy_type>;
const closure_type closure(*(static_cast<Derived*>(this)),
policy_type(0, stride));
closure.execute();
if (Kokkos::Profiling::profileLibraryLoaded()) {
Kokkos::Profiling::endParallelFor(kpID);
}
} }
}; };
@ -682,19 +690,10 @@ struct ResetDuplicatesBase {
ResetDuplicatesBase(ValueType* data_in, size_t size_in, ResetDuplicatesBase(ValueType* data_in, size_t size_in,
std::string const& name) std::string const& name)
: data(data_in) { : data(data_in) {
uint64_t kpID = 0; parallel_for(
if (Kokkos::Profiling::profileLibraryLoaded()) { std::string("Kokkos::ScatterView::ResetDuplicates [") + name + "]",
Kokkos::Profiling::beginParallelFor(std::string("reduce_") + name, 0, RangePolicy<ExecSpace, size_t>(0, size_in),
&kpID); static_cast<Derived const&>(*this));
}
using policy_type = RangePolicy<ExecSpace, size_t>;
using closure_type = Kokkos::Impl::ParallelFor<Derived, policy_type>;
const closure_type closure(*(static_cast<Derived*>(this)),
policy_type(0, size_in));
closure.execute();
if (Kokkos::Profiling::profileLibraryLoaded()) {
Kokkos::Profiling::endParallelFor(kpID);
}
} }
}; };
@ -931,8 +930,8 @@ class ScatterView<DataType, Kokkos::LayoutRight, DeviceType, Op,
ScatterView(View<RT, RP...> const& original_view) ScatterView(View<RT, RP...> const& original_view)
: unique_token(), : unique_token(),
internal_view( internal_view(
Kokkos::ViewAllocateWithoutInitializing(std::string("duplicated_") + view_alloc(WithoutInitializing,
original_view.label()), std::string("duplicated_") + original_view.label()),
unique_token.size(), unique_token.size(),
original_view.rank_dynamic > 0 ? original_view.extent(0) original_view.rank_dynamic > 0 ? original_view.extent(0)
: KOKKOS_IMPL_CTOR_DEFAULT_ARG, : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
@ -955,7 +954,7 @@ class ScatterView<DataType, Kokkos::LayoutRight, DeviceType, Op,
template <typename... Dims> template <typename... Dims>
ScatterView(std::string const& name, Dims... dims) ScatterView(std::string const& name, Dims... dims)
: internal_view(Kokkos::ViewAllocateWithoutInitializing(name), : internal_view(view_alloc(WithoutInitializing, name),
unique_token.size(), dims...) { unique_token.size(), dims...) {
reset(); reset();
} }
@ -1094,8 +1093,8 @@ class ScatterView<DataType, Kokkos::LayoutLeft, DeviceType, Op,
KOKKOS_IMPL_CTOR_DEFAULT_ARG}; KOKKOS_IMPL_CTOR_DEFAULT_ARG};
arg_N[internal_view_type::rank - 1] = unique_token.size(); arg_N[internal_view_type::rank - 1] = unique_token.size();
internal_view = internal_view_type( internal_view = internal_view_type(
Kokkos::ViewAllocateWithoutInitializing(std::string("duplicated_") + view_alloc(WithoutInitializing,
original_view.label()), std::string("duplicated_") + original_view.label()),
arg_N[0], arg_N[1], arg_N[2], arg_N[3], arg_N[4], arg_N[5], arg_N[6], arg_N[0], arg_N[1], arg_N[2], arg_N[3], arg_N[4], arg_N[5], arg_N[6],
arg_N[7]); arg_N[7]);
reset(); reset();
@ -1121,9 +1120,9 @@ class ScatterView<DataType, Kokkos::LayoutLeft, DeviceType, Op,
KOKKOS_IMPL_CTOR_DEFAULT_ARG}; KOKKOS_IMPL_CTOR_DEFAULT_ARG};
Kokkos::Impl::Experimental::args_to_array(arg_N, 0, dims...); Kokkos::Impl::Experimental::args_to_array(arg_N, 0, dims...);
arg_N[internal_view_type::rank - 1] = unique_token.size(); arg_N[internal_view_type::rank - 1] = unique_token.size();
internal_view = internal_view_type( internal_view = internal_view_type(view_alloc(WithoutInitializing, name),
Kokkos::ViewAllocateWithoutInitializing(name), arg_N[0], arg_N[1], arg_N[0], arg_N[1], arg_N[2], arg_N[3],
arg_N[2], arg_N[3], arg_N[4], arg_N[5], arg_N[6], arg_N[7]); arg_N[4], arg_N[5], arg_N[6], arg_N[7]);
reset(); reset();
} }

View File

@ -306,9 +306,9 @@ class UnorderedMap {
m_equal_to(equal_to), m_equal_to(equal_to),
m_size(), m_size(),
m_available_indexes(calculate_capacity(capacity_hint)), m_available_indexes(calculate_capacity(capacity_hint)),
m_hash_lists(ViewAllocateWithoutInitializing("UnorderedMap hash list"), m_hash_lists(view_alloc(WithoutInitializing, "UnorderedMap hash list"),
Impl::find_hash_size(capacity())), Impl::find_hash_size(capacity())),
m_next_index(ViewAllocateWithoutInitializing("UnorderedMap next index"), m_next_index(view_alloc(WithoutInitializing, "UnorderedMap next index"),
capacity() + 1) // +1 so that the *_at functions can capacity() + 1) // +1 so that the *_at functions can
// always return a valid reference // always return a valid reference
, ,
@ -540,7 +540,10 @@ class UnorderedMap {
// Previously claimed an unused entry that was not inserted. // Previously claimed an unused entry that was not inserted.
// Release this unused entry immediately. // Release this unused entry immediately.
if (!m_available_indexes.reset(new_index)) { if (!m_available_indexes.reset(new_index)) {
// FIXME_SYCL SYCL doesn't allow printf in kernels
#ifndef KOKKOS_ENABLE_SYCL
printf("Unable to free existing\n"); printf("Unable to free existing\n");
#endif
} }
} }
@ -729,16 +732,16 @@ class UnorderedMap {
tmp.m_size = src.size(); tmp.m_size = src.size();
tmp.m_available_indexes = bitset_type(src.capacity()); tmp.m_available_indexes = bitset_type(src.capacity());
tmp.m_hash_lists = size_type_view( tmp.m_hash_lists = size_type_view(
ViewAllocateWithoutInitializing("UnorderedMap hash list"), view_alloc(WithoutInitializing, "UnorderedMap hash list"),
src.m_hash_lists.extent(0)); src.m_hash_lists.extent(0));
tmp.m_next_index = size_type_view( tmp.m_next_index = size_type_view(
ViewAllocateWithoutInitializing("UnorderedMap next index"), view_alloc(WithoutInitializing, "UnorderedMap next index"),
src.m_next_index.extent(0)); src.m_next_index.extent(0));
tmp.m_keys = tmp.m_keys =
key_type_view(ViewAllocateWithoutInitializing("UnorderedMap keys"), key_type_view(view_alloc(WithoutInitializing, "UnorderedMap keys"),
src.m_keys.extent(0)); src.m_keys.extent(0));
tmp.m_values = value_type_view( tmp.m_values = value_type_view(
ViewAllocateWithoutInitializing("UnorderedMap values"), view_alloc(WithoutInitializing, "UnorderedMap values"),
src.m_values.extent(0)); src.m_values.extent(0));
tmp.m_scalars = scalars_view("UnorderedMap scalars"); tmp.m_scalars = scalars_view("UnorderedMap scalars");

View File

@ -3,7 +3,7 @@ KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR}) KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src ) KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/../src )
foreach(Tag Threads;Serial;OpenMP;HPX;Cuda;HIP) foreach(Tag Threads;Serial;OpenMP;HPX;Cuda;HIP;SYCL)
# Because there is always an exception to the rule # Because there is always an exception to the rule
if(Tag STREQUAL "Threads") if(Tag STREQUAL "Threads")
set(DEVICE "PTHREAD") set(DEVICE "PTHREAD")
@ -31,13 +31,21 @@ foreach(Tag Threads;Serial;OpenMP;HPX;Cuda;HIP)
Vector Vector
ViewCtorPropEmbeddedDim ViewCtorPropEmbeddedDim
) )
# Write to a temporary intermediate file and call configure_file to avoid
# updating timestamps triggering unnecessary rebuilds on subsequent cmake runs.
set(file ${dir}/Test${Tag}_${Name}.cpp) set(file ${dir}/Test${Tag}_${Name}.cpp)
file(WRITE ${file} file(WRITE ${dir}/dummy.cpp
"#include <Test${Tag}_Category.hpp>\n" "#include <Test${Tag}_Category.hpp>\n"
"#include <Test${Name}.hpp>\n" "#include <Test${Name}.hpp>\n"
) )
configure_file(${dir}/dummy.cpp ${file})
list(APPEND UnitTestSources ${file}) list(APPEND UnitTestSources ${file})
endforeach() endforeach()
list(REMOVE_ITEM UnitTestSources
${CMAKE_CURRENT_BINARY_DIR}/sycl/TestSYCL_Bitset.cpp
${CMAKE_CURRENT_BINARY_DIR}/sycl/TestSYCL_ScatterView.cpp
${CMAKE_CURRENT_BINARY_DIR}/sycl/TestSYCL_UnorderedMap.cpp
)
KOKKOS_ADD_EXECUTABLE_AND_TEST(UnitTest_${Tag} SOURCES ${UnitTestSources}) KOKKOS_ADD_EXECUTABLE_AND_TEST(UnitTest_${Tag} SOURCES ${UnitTestSources})
endif() endif()
endforeach() endforeach()

View File

@ -7,7 +7,7 @@ vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/openmp
vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/hpx vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/hpx
vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/serial vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/serial
vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/threads vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/threads
vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/rocm vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/hip
vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/cuda vpath %.cpp ${KOKKOS_PATH}/containers/unit_tests/cuda
vpath %.cpp ${CURDIR} vpath %.cpp ${CURDIR}
default: build_all default: build_all

View File

@ -108,7 +108,7 @@ struct test_dualview_combinations {
if (with_init) { if (with_init) {
a = ViewType("A", n, m); a = ViewType("A", n, m);
} else { } else {
a = ViewType(Kokkos::ViewAllocateWithoutInitializing("A"), n, m); a = ViewType(Kokkos::view_alloc(Kokkos::WithoutInitializing, "A"), n, m);
} }
Kokkos::deep_copy(a.d_view, 1); Kokkos::deep_copy(a.d_view, 1);
@ -404,14 +404,19 @@ void test_dualview_resize() {
Impl::test_dualview_resize<Scalar, Device>(); Impl::test_dualview_resize<Scalar, Device>();
} }
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
TEST(TEST_CATEGORY, dualview_combination) { TEST(TEST_CATEGORY, dualview_combination) {
test_dualview_combinations<int, TEST_EXECSPACE>(10, true); test_dualview_combinations<int, TEST_EXECSPACE>(10, true);
} }
#endif
TEST(TEST_CATEGORY, dualview_alloc) { TEST(TEST_CATEGORY, dualview_alloc) {
test_dualview_alloc<int, TEST_EXECSPACE>(10); test_dualview_alloc<int, TEST_EXECSPACE>(10);
} }
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
TEST(TEST_CATEGORY, dualview_combinations_without_init) { TEST(TEST_CATEGORY, dualview_combinations_without_init) {
test_dualview_combinations<int, TEST_EXECSPACE>(10, false); test_dualview_combinations<int, TEST_EXECSPACE>(10, false);
} }
@ -428,6 +433,7 @@ TEST(TEST_CATEGORY, dualview_realloc) {
TEST(TEST_CATEGORY, dualview_resize) { TEST(TEST_CATEGORY, dualview_resize) {
test_dualview_resize<int, TEST_EXECSPACE>(); test_dualview_resize<int, TEST_EXECSPACE>();
} }
#endif
} // namespace Test } // namespace Test

View File

@ -1063,8 +1063,8 @@ class TestDynViewAPI {
(void)thing; (void)thing;
} }
dView0 d_uninitialized(Kokkos::ViewAllocateWithoutInitializing("uninit"), dView0 d_uninitialized(
10, 20); Kokkos::view_alloc(Kokkos::WithoutInitializing, "uninit"), 10, 20);
ASSERT_TRUE(d_uninitialized.data() != nullptr); ASSERT_TRUE(d_uninitialized.data() != nullptr);
ASSERT_EQ(d_uninitialized.rank(), 2); ASSERT_EQ(d_uninitialized.rank(), 2);
ASSERT_EQ(d_uninitialized.extent(0), 10); ASSERT_EQ(d_uninitialized.extent(0), 10);
@ -1532,7 +1532,7 @@ class TestDynViewAPI {
ASSERT_EQ(ds5.extent(5), ds5plus.extent(5)); ASSERT_EQ(ds5.extent(5), ds5plus.extent(5));
#if (!defined(KOKKOS_ENABLE_CUDA) || defined(KOKKOS_ENABLE_CUDA_UVM)) && \ #if (!defined(KOKKOS_ENABLE_CUDA) || defined(KOKKOS_ENABLE_CUDA_UVM)) && \
!defined(KOKKOS_ENABLE_HIP) !defined(KOKKOS_ENABLE_HIP) && !defined(KOKKOS_ENABLE_SYCL)
ASSERT_EQ(&ds5(1, 1, 1, 1, 0) - &ds5plus(1, 1, 1, 1, 0), 0); ASSERT_EQ(&ds5(1, 1, 1, 1, 0) - &ds5plus(1, 1, 1, 1, 0), 0);
ASSERT_EQ(&ds5(1, 1, 1, 1, 0, 0) - &ds5plus(1, 1, 1, 1, 0, 0), ASSERT_EQ(&ds5(1, 1, 1, 1, 0, 0) - &ds5plus(1, 1, 1, 1, 0, 0),
0); // passing argument to rank beyond the view's rank is allowed 0); // passing argument to rank beyond the view's rank is allowed

View File

@ -243,6 +243,8 @@ struct TestDynamicView {
} }
}; };
// FIXME_SYCL needs resize_serial
#ifndef KOKKOS_ENABLE_SYCL
TEST(TEST_CATEGORY, dynamic_view) { TEST(TEST_CATEGORY, dynamic_view) {
using TestDynView = TestDynamicView<double, TEST_EXECSPACE>; using TestDynView = TestDynamicView<double, TEST_EXECSPACE>;
@ -250,6 +252,7 @@ TEST(TEST_CATEGORY, dynamic_view) {
TestDynView::run(100000 + 100 * i); TestDynView::run(100000 + 100 * i);
} }
} }
#endif
} // namespace Test } // namespace Test

View File

@ -95,10 +95,6 @@ void test_offsetview_construction() {
ASSERT_EQ(ov.extent(1), 5); ASSERT_EQ(ov.extent(1), 5);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
const int ovmin0 = ov.begin(0);
const int ovend0 = ov.end(0);
const int ovmin1 = ov.begin(1);
const int ovend1 = ov.end(1);
{ {
Kokkos::Experimental::OffsetView<Scalar*, Device> offsetV1("OneDOffsetView", Kokkos::Experimental::OffsetView<Scalar*, Device> offsetV1("OneDOffsetView",
range0); range0);
@ -134,6 +130,13 @@ void test_offsetview_construction() {
} }
} }
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
const int ovmin0 = ov.begin(0);
const int ovend0 = ov.end(0);
const int ovmin1 = ov.begin(1);
const int ovend1 = ov.end(1);
using range_type = using range_type =
Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> >; Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> >;
using point_type = typename range_type::point_type; using point_type = typename range_type::point_type;
@ -175,6 +178,7 @@ void test_offsetview_construction() {
} }
ASSERT_EQ(OVResult, answer) << "Bad data found in OffsetView"; ASSERT_EQ(OVResult, answer) << "Bad data found in OffsetView";
#endif
#endif #endif
{ {
@ -211,6 +215,8 @@ void test_offsetview_construction() {
point3_type{{extent0, extent1, extent2}}); point3_type{{extent0, extent1, extent2}});
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
// FIXME_SYCL requires MDRange policy
#ifdef KOKKOS_ENABLE_SYCL
int view3DSum = 0; int view3DSum = 0;
Kokkos::parallel_reduce( Kokkos::parallel_reduce(
rangePolicy3DZero, rangePolicy3DZero,
@ -233,6 +239,7 @@ void test_offsetview_construction() {
ASSERT_EQ(view3DSum, offsetView3DSum) ASSERT_EQ(view3DSum, offsetView3DSum)
<< "construction of OffsetView from View and begins array broken."; << "construction of OffsetView from View and begins array broken.";
#endif
#endif #endif
} }
view_type viewFromOV = ov.view(); view_type viewFromOV = ov.view();
@ -259,6 +266,8 @@ void test_offsetview_construction() {
Kokkos::deep_copy(aView, ov); Kokkos::deep_copy(aView, ov);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
int sum = 0; int sum = 0;
Kokkos::parallel_reduce( Kokkos::parallel_reduce(
rangePolicy2D, rangePolicy2D,
@ -268,6 +277,7 @@ void test_offsetview_construction() {
sum); sum);
ASSERT_EQ(sum, 0) << "deep_copy(view, offsetView) broken."; ASSERT_EQ(sum, 0) << "deep_copy(view, offsetView) broken.";
#endif
#endif #endif
} }
@ -278,6 +288,8 @@ void test_offsetview_construction() {
Kokkos::deep_copy(ov, aView); Kokkos::deep_copy(ov, aView);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
int sum = 0; int sum = 0;
Kokkos::parallel_reduce( Kokkos::parallel_reduce(
rangePolicy2D, rangePolicy2D,
@ -287,6 +299,7 @@ void test_offsetview_construction() {
sum); sum);
ASSERT_EQ(sum, 0) << "deep_copy(offsetView, view) broken."; ASSERT_EQ(sum, 0) << "deep_copy(offsetView, view) broken.";
#endif
#endif #endif
} }
} }
@ -458,6 +471,8 @@ void test_offsetview_subview() {
ASSERT_EQ(offsetSubview.end(1), 9); ASSERT_EQ(offsetSubview.end(1), 9);
#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
// FIXME_SYCL requires MDRange policy
#ifndef KOKKOS_ENABLE_SYCL
using range_type = Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, using range_type = Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>,
Kokkos::IndexType<int> >; Kokkos::IndexType<int> >;
using point_type = typename range_type::point_type; using point_type = typename range_type::point_type;
@ -483,6 +498,7 @@ void test_offsetview_subview() {
sum); sum);
ASSERT_EQ(sum, 6 * (e0 - b0) * (e1 - b1)); ASSERT_EQ(sum, 6 * (e0 - b0) * (e1 - b1));
#endif
#endif #endif
} }
@ -685,9 +701,12 @@ void test_offsetview_offsets_rank3() {
} }
#endif #endif
// FIXME_SYCL needs MDRangePolicy
#ifndef KOKKOS_ENABLE_SYCL
TEST(TEST_CATEGORY, offsetview_construction) { TEST(TEST_CATEGORY, offsetview_construction) {
test_offsetview_construction<int, TEST_EXECSPACE>(); test_offsetview_construction<int, TEST_EXECSPACE>();
} }
#endif
TEST(TEST_CATEGORY, offsetview_unmanaged_construction) { TEST(TEST_CATEGORY, offsetview_unmanaged_construction) {
test_offsetview_unmanaged_construction<int, TEST_EXECSPACE>(); test_offsetview_unmanaged_construction<int, TEST_EXECSPACE>();

View File

@ -0,0 +1,51 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 3.0
// Copyright (2020) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#ifndef KOKKOS_TEST_SYCL_HPP
#define KOKKOS_TEST_SYCL_HPP
#define TEST_CATEGORY sycl
#define TEST_EXECSPACE Kokkos::Experimental::SYCL
#endif

View File

@ -583,18 +583,9 @@ struct TestDuplicatedScatterView<
}; };
#endif #endif
#ifdef KOKKOS_ENABLE_ROCM
// disable duplicated instantiation with ROCm until
// UniqueToken can support it
template <typename ScatterType>
struct TestDuplicatedScatterView<Kokkos::Experimental::ROCm, ScatterType> {
TestDuplicatedScatterView(int) {}
};
#endif
template <typename DeviceType, typename ScatterType, template <typename DeviceType, typename ScatterType,
typename NumberType = double> typename NumberType = double>
void test_scatter_view(int n) { void test_scatter_view(int64_t n) {
using execution_space = typename DeviceType::execution_space; using execution_space = typename DeviceType::execution_space;
// no atomics or duplication is only sensible if the execution space // no atomics or duplication is only sensible if the execution space
@ -630,7 +621,7 @@ void test_scatter_view(int n) {
constexpr std::size_t bytes_per_value = sizeof(NumberType) * 12; constexpr std::size_t bytes_per_value = sizeof(NumberType) * 12;
std::size_t const maximum_allowed_copy_values = std::size_t const maximum_allowed_copy_values =
maximum_allowed_copy_bytes / bytes_per_value; maximum_allowed_copy_bytes / bytes_per_value;
n = std::min(n, int(maximum_allowed_copy_values)); n = std::min(n, int64_t(maximum_allowed_copy_values));
// if the default is duplicated, this needs to follow the limit // if the default is duplicated, this needs to follow the limit
{ {
@ -683,32 +674,40 @@ TEST(TEST_CATEGORY, scatterview_devicetype) {
test_scatter_view<device_type, Kokkos::Experimental::ScatterMin>(10); test_scatter_view<device_type, Kokkos::Experimental::ScatterMin>(10);
test_scatter_view<device_type, Kokkos::Experimental::ScatterMax>(10); test_scatter_view<device_type, Kokkos::Experimental::ScatterMax>(10);
#if defined(KOKKOS_ENABLE_CUDA) || defined(KOKKOS_ENABLE_HIP)
#ifdef KOKKOS_ENABLE_CUDA #ifdef KOKKOS_ENABLE_CUDA
if (std::is_same<TEST_EXECSPACE, Kokkos::Cuda>::value) { using device_execution_space = Kokkos::Cuda;
using cuda_device_type = Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>; using device_memory_space = Kokkos::CudaSpace;
test_scatter_view<cuda_device_type, Kokkos::Experimental::ScatterSum, using host_accessible_space = Kokkos::CudaUVMSpace;
#else
using device_execution_space = Kokkos::Experimental::HIP;
using device_memory_space = Kokkos::Experimental::HIPSpace;
using host_accessible_space = Kokkos::Experimental::HIPHostPinnedSpace;
#endif
if (std::is_same<TEST_EXECSPACE, device_execution_space>::value) {
using device_device_type =
Kokkos::Device<device_execution_space, device_memory_space>;
test_scatter_view<device_device_type, Kokkos::Experimental::ScatterSum,
double>(10); double>(10);
test_scatter_view<cuda_device_type, Kokkos::Experimental::ScatterSum, test_scatter_view<device_device_type, Kokkos::Experimental::ScatterSum,
unsigned int>(10); unsigned int>(10);
test_scatter_view<cuda_device_type, Kokkos::Experimental::ScatterProd>(10); test_scatter_view<device_device_type, Kokkos::Experimental::ScatterProd>(
test_scatter_view<cuda_device_type, Kokkos::Experimental::ScatterMin>(10); 10);
test_scatter_view<cuda_device_type, Kokkos::Experimental::ScatterMax>(10); test_scatter_view<device_device_type, Kokkos::Experimental::ScatterMin>(10);
using cudauvm_device_type = test_scatter_view<device_device_type, Kokkos::Experimental::ScatterMax>(10);
Kokkos::Device<Kokkos::Cuda, Kokkos::CudaUVMSpace>; using host_device_type =
test_scatter_view<cudauvm_device_type, Kokkos::Experimental::ScatterSum, Kokkos::Device<device_execution_space, host_accessible_space>;
test_scatter_view<host_device_type, Kokkos::Experimental::ScatterSum,
double>(10); double>(10);
test_scatter_view<cudauvm_device_type, Kokkos::Experimental::ScatterSum, test_scatter_view<host_device_type, Kokkos::Experimental::ScatterSum,
unsigned int>(10); unsigned int>(10);
test_scatter_view<cudauvm_device_type, Kokkos::Experimental::ScatterProd>( test_scatter_view<host_device_type, Kokkos::Experimental::ScatterProd>(10);
10); test_scatter_view<host_device_type, Kokkos::Experimental::ScatterMin>(10);
test_scatter_view<cudauvm_device_type, Kokkos::Experimental::ScatterMin>( test_scatter_view<host_device_type, Kokkos::Experimental::ScatterMax>(10);
10);
test_scatter_view<cudauvm_device_type, Kokkos::Experimental::ScatterMax>(
10);
} }
#endif #endif
} }
} // namespace Test } // namespace Test
#endif // KOKKOS_TEST_UNORDERED_MAP_HPP #endif // KOKKOS_TEST_SCATTER_VIEW_HPP

View File

@ -200,8 +200,7 @@ void run_test_graph3(size_t B, size_t N) {
for (size_t i = 0; i < B; i++) { for (size_t i = 0; i < B; i++) {
size_t ne = 0; size_t ne = 0;
for (size_t j = hx.row_block_offsets(i); j < hx.row_block_offsets(i + 1); for (auto j = hx.row_block_offsets(i); j < hx.row_block_offsets(i + 1); j++)
j++)
ne += hx.row_map(j + 1) - hx.row_map(j) + C; ne += hx.row_map(j + 1) - hx.row_map(j) + C;
ASSERT_FALSE( ASSERT_FALSE(
@ -212,7 +211,7 @@ void run_test_graph3(size_t B, size_t N) {
template <class Space> template <class Space>
void run_test_graph4() { void run_test_graph4() {
using ordinal_type = unsigned; using ordinal_type = unsigned int;
using layout_type = Kokkos::LayoutRight; using layout_type = Kokkos::LayoutRight;
using space_type = Space; using space_type = Space;
using memory_traits_type = Kokkos::MemoryUnmanaged; using memory_traits_type = Kokkos::MemoryUnmanaged;
@ -286,7 +285,10 @@ void run_test_graph4() {
TEST(TEST_CATEGORY, staticcrsgraph) { TEST(TEST_CATEGORY, staticcrsgraph) {
TestStaticCrsGraph::run_test_graph<TEST_EXECSPACE>(); TestStaticCrsGraph::run_test_graph<TEST_EXECSPACE>();
// FIXME_SYCL requires MDRangePolicy
#ifndef KOKKOS_ENABLE_SYCL
TestStaticCrsGraph::run_test_graph2<TEST_EXECSPACE>(); TestStaticCrsGraph::run_test_graph2<TEST_EXECSPACE>();
#endif
TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 0); TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 0);
TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 1000); TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 1000);
TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 10000); TestStaticCrsGraph::run_test_graph3<TEST_EXECSPACE>(1, 10000);

View File

@ -78,7 +78,7 @@ struct test_vector_insert {
// Looks like some std::vector implementations do not have the restriction // Looks like some std::vector implementations do not have the restriction
// right on the overload taking three iterators, and thus the following call // right on the overload taking three iterators, and thus the following call
// will hit that overload and then fail to compile. // will hit that overload and then fail to compile.
#if defined(KOKKOS_COMPILER_INTEL) && (1700 > KOKKOS_COMPILER_INTEL) #if defined(KOKKOS_COMPILER_INTEL)
// And at least GCC 4.8.4 doesn't implement vector insert correct for C++11 // And at least GCC 4.8.4 doesn't implement vector insert correct for C++11
// Return type is void ... // Return type is void ...
#if (__GNUC__ < 5) #if (__GNUC__ < 5)
@ -104,7 +104,7 @@ struct test_vector_insert {
// Looks like some std::vector implementations do not have the restriction // Looks like some std::vector implementations do not have the restriction
// right on the overload taking three iterators, and thus the following call // right on the overload taking three iterators, and thus the following call
// will hit that overload and then fail to compile. // will hit that overload and then fail to compile.
#if defined(KOKKOS_COMPILER_INTEL) && (1700 > KOKKOS_COMPILER_INTEL) #if defined(KOKKOS_COMPILER_INTEL)
b.insert(b.begin(), typename Vector::size_type(7), 9); b.insert(b.begin(), typename Vector::size_type(7), 9);
#else #else
b.insert(b.begin(), 7, 9); b.insert(b.begin(), 7, 9);
@ -125,7 +125,7 @@ struct test_vector_insert {
// Testing insert at end via all three function interfaces // Testing insert at end via all three function interfaces
a.insert(a.end(), 11); a.insert(a.end(), 11);
#if defined(KOKKOS_COMPILER_INTEL) && (1700 > KOKKOS_COMPILER_INTEL) #if defined(KOKKOS_COMPILER_INTEL)
a.insert(a.end(), typename Vector::size_type(2), 12); a.insert(a.end(), typename Vector::size_type(2), 12);
#else #else
a.insert(a.end(), 2, 12); a.insert(a.end(), 2, 12);

View File

@ -100,6 +100,5 @@
// TODO: No longer options in Kokkos. Need to be removed. // TODO: No longer options in Kokkos. Need to be removed.
#cmakedefine KOKKOS_USING_DEPRECATED_VIEW #cmakedefine KOKKOS_USING_DEPRECATED_VIEW
#cmakedefine KOKKOS_ENABLE_CXX11
#endif // !defined(KOKKOS_FOR_SIERRA) #endif // !defined(KOKKOS_FOR_SIERRA)

View File

@ -48,17 +48,10 @@ SET(SOURCES
PerfTest_ViewResize_8.cpp PerfTest_ViewResize_8.cpp
) )
IF(Kokkos_ENABLE_HIP)
# FIXME HIP requires TeamPolicy
LIST(REMOVE_ITEM SOURCES
PerfTest_CustomReduction.cpp
PerfTest_ExecSpacePartitioning.cpp
)
ENDIF()
IF(Kokkos_ENABLE_OPENMPTARGET) IF(Kokkos_ENABLE_OPENMPTARGET)
# FIXME OPENMPTARGET requires TeamPolicy Reductions and Custom Reduction # FIXME OPENMPTARGET requires TeamPolicy Reductions and Custom Reduction
LIST(REMOVE_ITEM SOURCES LIST(REMOVE_ITEM SOURCES
PerfTestGramSchmidt.cpp
PerfTest_CustomReduction.cpp PerfTest_CustomReduction.cpp
PerfTest_ExecSpacePartitioning.cpp PerfTest_ExecSpacePartitioning.cpp
) )
@ -75,7 +68,8 @@ KOKKOS_INCLUDE_DIRECTORIES(${CMAKE_CURRENT_BINARY_DIR})
KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR}) KOKKOS_INCLUDE_DIRECTORIES(REQUIRED_DURING_INSTALLATION_TESTING ${CMAKE_CURRENT_SOURCE_DIR})
# This test currently times out for MSVC # This test currently times out for MSVC
IF(NOT KOKKOS_CXX_COMPILER_ID STREQUAL "MSVC") # FIXME_SYCL these tests don't compile yet (require parallel_for).
IF(NOT KOKKOS_CXX_COMPILER_ID STREQUAL "MSVC" AND NOT Kokkos_ENABLE_SYCL)
KOKKOS_ADD_EXECUTABLE_AND_TEST( KOKKOS_ADD_EXECUTABLE_AND_TEST(
PerfTestExec PerfTestExec
SOURCES ${SOURCES} SOURCES ${SOURCES}
@ -83,17 +77,28 @@ IF(NOT KOKKOS_CXX_COMPILER_ID STREQUAL "MSVC")
) )
ENDIF() ENDIF()
# FIXME_SYCL
IF(NOT Kokkos_ENABLE_SYCL)
KOKKOS_ADD_EXECUTABLE_AND_TEST( KOKKOS_ADD_EXECUTABLE_AND_TEST(
PerformanceTest_Atomic PerformanceTest_Atomic
SOURCES test_atomic.cpp SOURCES test_atomic.cpp
CATEGORIES PERFORMANCE CATEGORIES PERFORMANCE
) )
IF(NOT KOKKOS_ENABLE_CUDA OR KOKKOS_ENABLE_CUDA_LAMBDA)
KOKKOS_ADD_EXECUTABLE_AND_TEST(
PerformanceTest_Atomic_MinMax
SOURCES test_atomic_minmax_simple.cpp
CATEGORIES PERFORMANCE
)
ENDIF()
KOKKOS_ADD_EXECUTABLE_AND_TEST( KOKKOS_ADD_EXECUTABLE_AND_TEST(
PerformanceTest_Mempool PerformanceTest_Mempool
SOURCES test_mempool.cpp SOURCES test_mempool.cpp
CATEGORIES PERFORMANCE CATEGORIES PERFORMANCE
) )
ENDIF()
IF(NOT Kokkos_ENABLE_OPENMPTARGET) IF(NOT Kokkos_ENABLE_OPENMPTARGET)
# FIXME OPENMPTARGET needs tasking # FIXME OPENMPTARGET needs tasking

View File

@ -65,6 +65,12 @@ TEST_TARGETS += test-taskdag
# #
OBJ_ATOMICS_MINMAX = test_atomic_minmax_simple.o
TARGETS += KokkosCore_PerformanceTest_Atomics_MinMax
TEST_TARGETS += test-atomic-minmax
#
KokkosCore_PerformanceTest: $(OBJ_PERF) $(KOKKOS_LINK_DEPENDS) KokkosCore_PerformanceTest: $(OBJ_PERF) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(EXTRA_PATH) $(OBJ_PERF) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosCore_PerformanceTest $(LINK) $(EXTRA_PATH) $(OBJ_PERF) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosCore_PerformanceTest
@ -77,6 +83,9 @@ KokkosCore_PerformanceTest_Mempool: $(OBJ_MEMPOOL) $(KOKKOS_LINK_DEPENDS)
KokkosCore_PerformanceTest_TaskDAG: $(OBJ_TASKDAG) $(KOKKOS_LINK_DEPENDS) KokkosCore_PerformanceTest_TaskDAG: $(OBJ_TASKDAG) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_TASKDAG) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_PerformanceTest_TaskDAG $(LINK) $(KOKKOS_LDFLAGS) $(LDFLAGS) $(EXTRA_PATH) $(OBJ_TASKDAG) $(KOKKOS_LIBS) $(LIB) -o KokkosCore_PerformanceTest_TaskDAG
KokkosCore_PerformanceTest_Atomics_MinMax: $(OBJ_ATOMICS_MINMAX) $(KOKKOS_LINK_DEPENDS)
$(LINK) $(EXTRA_PATH) $(OBJ_ATOMICS_MINMAX) $(KOKKOS_LIBS) $(LIB) $(KOKKOS_LDFLAGS) $(LDFLAGS) -o KokkosCore_PerformanceTest_Atomics_MinMax
test-performance: KokkosCore_PerformanceTest test-performance: KokkosCore_PerformanceTest
./KokkosCore_PerformanceTest ./KokkosCore_PerformanceTest
@ -89,6 +98,9 @@ test-mempool: KokkosCore_PerformanceTest_Mempool
test-taskdag: KokkosCore_PerformanceTest_TaskDAG test-taskdag: KokkosCore_PerformanceTest_TaskDAG
./KokkosCore_PerformanceTest_TaskDAG ./KokkosCore_PerformanceTest_TaskDAG
test-atomic-minmax: KokkosCore_PerformanceTest_Atomics_MinMax
./KokkosCore_PerformanceTest_Atomics_MinMax
build_all: $(TARGETS) build_all: $(TARGETS)
test: $(TEST_TARGETS) test: $(TEST_TARGETS)

View File

@ -120,7 +120,7 @@ void run_resizeview_tests123(int N, int R) {
Kokkos::Timer timer; Kokkos::Timer timer;
for (int r = 0; r < R; r++) { for (int r = 0; r < R; r++) {
Kokkos::View<double*, Layout> a1( Kokkos::View<double*, Layout> a1(
Kokkos::ViewAllocateWithoutInitializing("A1"), int(N8 * 1.1)); Kokkos::view_alloc(Kokkos::WithoutInitializing, "A1"), int(N8 * 1.1));
double* a1_ptr = a1.data(); double* a1_ptr = a1.data();
Kokkos::parallel_for( Kokkos::parallel_for(
N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; }); N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; });
@ -201,7 +201,7 @@ void run_resizeview_tests45(int N, int R) {
Kokkos::Timer timer; Kokkos::Timer timer;
for (int r = 0; r < R; r++) { for (int r = 0; r < R; r++) {
Kokkos::View<double*, Layout> a1( Kokkos::View<double*, Layout> a1(
Kokkos::ViewAllocateWithoutInitializing("A1"), int(N8 * 1.1)); Kokkos::view_alloc(Kokkos::WithoutInitializing, "A1"), int(N8 * 1.1));
double* a1_ptr = a1.data(); double* a1_ptr = a1.data();
Kokkos::parallel_for( Kokkos::parallel_for(
N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; }); N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; });
@ -258,7 +258,7 @@ void run_resizeview_tests6(int N, int R) {
Kokkos::Timer timer; Kokkos::Timer timer;
for (int r = 0; r < R; r++) { for (int r = 0; r < R; r++) {
Kokkos::View<double*, Layout> a1( Kokkos::View<double*, Layout> a1(
Kokkos::ViewAllocateWithoutInitializing("A1"), int(N8 * 1.1)); Kokkos::view_alloc(Kokkos::WithoutInitializing, "A1"), int(N8 * 1.1));
double* a1_ptr = a1.data(); double* a1_ptr = a1.data();
Kokkos::parallel_for( Kokkos::parallel_for(
N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; }); N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; });
@ -311,7 +311,7 @@ void run_resizeview_tests7(int N, int R) {
Kokkos::Timer timer; Kokkos::Timer timer;
for (int r = 0; r < R; r++) { for (int r = 0; r < R; r++) {
Kokkos::View<double*, Layout> a1( Kokkos::View<double*, Layout> a1(
Kokkos::ViewAllocateWithoutInitializing("A1"), int(N8 * 1.1)); Kokkos::view_alloc(Kokkos::WithoutInitializing, "A1"), int(N8 * 1.1));
double* a1_ptr = a1.data(); double* a1_ptr = a1.data();
Kokkos::parallel_for( Kokkos::parallel_for(
N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; }); N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; });
@ -366,7 +366,7 @@ void run_resizeview_tests8(int N, int R) {
Kokkos::Timer timer; Kokkos::Timer timer;
for (int r = 0; r < R; r++) { for (int r = 0; r < R; r++) {
Kokkos::View<double*, Layout> a1( Kokkos::View<double*, Layout> a1(
Kokkos::ViewAllocateWithoutInitializing("A1"), int(N8 * 1.1)); Kokkos::view_alloc(Kokkos::WithoutInitializing, "A1"), int(N8 * 1.1));
double* a1_ptr = a1.data(); double* a1_ptr = a1.data();
Kokkos::parallel_for( Kokkos::parallel_for(
N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; }); N8, KOKKOS_LAMBDA(const int& i) { a1_ptr[i] = a_ptr[i]; });

View File

@ -0,0 +1,244 @@
// export OMP_PROC_BIND=spread ; export OMP_PLACES=threads
// c++ -O2 -g -DNDEBUG -fopenmp
// ../core/perf_test/test_atomic_minmax_simple.cpp -I../core/src/ -I. -o
// test_atomic_minmax_simple.x containers/src/libkokkoscontainers.a
// core/src/libkokkoscore.a -ldl && OMP_NUM_THREADS=1
// ./test_atomic_minmax_simple.x 10000000
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <typeinfo>
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>
using exec_space = Kokkos::DefaultExecutionSpace;
template <typename T>
void test(const int length) {
Kokkos::Impl::Timer timer;
using vector = Kokkos::View<T*, exec_space>;
vector inp("input", length);
T max = std::numeric_limits<T>::max();
T min = std::numeric_limits<T>::lowest();
// input is max values - all min atomics will replace
{
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) { inp(i) = max; });
Kokkos::fence();
timer.reset();
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) {
(void)Kokkos::atomic_fetch_min(&(inp(i)), (T)i);
});
Kokkos::fence();
double time = timer.seconds();
int errors(0);
Kokkos::parallel_reduce(
length,
KOKKOS_LAMBDA(const int i, int& inner) { inner += (inp(i) != (T)i); },
errors);
Kokkos::fence();
if (errors) {
std::cerr << "Error in 100% min replacements: " << errors << std::endl;
std::cerr << "inp(0)=" << inp(0) << std::endl;
}
std::cout << "Time for 100% min replacements: " << time << std::endl;
}
// input is min values - all max atomics will replace
{
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) { inp(i) = min; });
Kokkos::fence();
timer.reset();
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) {
(void)Kokkos::atomic_max_fetch(&(inp(i)), (T)i);
});
Kokkos::fence();
double time = timer.seconds();
int errors(0);
Kokkos::parallel_reduce(
length,
KOKKOS_LAMBDA(const int i, int& inner) { inner += (inp(i) != (T)i); },
errors);
Kokkos::fence();
if (errors) {
std::cerr << "Error in 100% max replacements: " << errors << std::endl;
std::cerr << "inp(0)=" << inp(0) << std::endl;
}
std::cout << "Time for 100% max replacements: " << time << std::endl;
}
// input is max values - all max atomics will early exit
{
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) { inp(i) = max; });
Kokkos::fence();
timer.reset();
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) {
(void)Kokkos::atomic_max_fetch(&(inp(i)), (T)i);
});
Kokkos::fence();
double time = timer.seconds();
int errors(0);
Kokkos::parallel_reduce(
length,
KOKKOS_LAMBDA(const int i, int& inner) {
T ref = max;
inner += (inp(i) != ref);
},
errors);
Kokkos::fence();
if (errors) {
std::cerr << "Error in 100% max early exits: " << errors << std::endl;
std::cerr << "inp(0)=" << inp(0) << std::endl;
}
std::cout << "Time for 100% max early exits: " << time << std::endl;
}
// input is min values - all min atomics will early exit
{
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) { inp(i) = min; });
Kokkos::fence();
timer.reset();
Kokkos::parallel_for(
length, KOKKOS_LAMBDA(const int i) {
(void)Kokkos::atomic_min_fetch(&(inp(i)), (T)i);
});
Kokkos::fence();
double time = timer.seconds();
int errors(0);
Kokkos::parallel_reduce(
length,
KOKKOS_LAMBDA(const int i, int& inner) {
T ref = min;
inner += (inp(i) != ref);
},
errors);
Kokkos::fence();
if (errors) {
std::cerr << "Error in 100% min early exits: " << errors << std::endl;
std::cerr << "inp(0)=" << inp(0) << std::endl;
if (length > 9) std::cout << "inp(9)=" << inp(9) << std::endl;
}
std::cout << "Time for 100% min early exits: " << time << std::endl;
}
// limit iterations for contentious test, takes ~50x longer for same length
auto con_length = length / 5;
// input is min values - some max atomics will replace
{
Kokkos::parallel_for(
1, KOKKOS_LAMBDA(const int i) { inp(i) = min; });
Kokkos::fence();
T current(0);
timer.reset();
Kokkos::parallel_reduce(
con_length,
KOKKOS_LAMBDA(const int i, T& inner) {
inner = Kokkos::atomic_max_fetch(&(inp(0)), inner + 1);
if (i == con_length - 1) {
Kokkos::atomic_max_fetch(&(inp(0)), max);
inner = max;
}
},
Kokkos::Max<T>(current));
Kokkos::fence();
double time = timer.seconds();
if (current < max) {
std::cerr << "Error in contentious max replacements: " << std::endl;
std::cerr << "final=" << current << " inp(0)=" << inp(0) << " max=" << max
<< std::endl;
}
std::cout << "Time for contentious max " << con_length
<< " replacements: " << time << std::endl;
}
// input is max values - some min atomics will replace
{
Kokkos::parallel_for(
1, KOKKOS_LAMBDA(const int i) { inp(i) = max; });
Kokkos::fence();
timer.reset();
T current(100000000);
Kokkos::parallel_reduce(
con_length,
KOKKOS_LAMBDA(const int i, T& inner) {
inner = Kokkos::atomic_min_fetch(&(inp(0)), inner - 1);
if (i == con_length - 1) {
Kokkos::atomic_min_fetch(&(inp(0)), min);
inner = min;
}
},
Kokkos::Min<T>(current));
Kokkos::fence();
double time = timer.seconds();
if (current > min) {
std::cerr << "Error in contentious min replacements: " << std::endl;
std::cerr << "final=" << current << " inp(0)=" << inp(0) << " min=" << min
<< std::endl;
}
std::cout << "Time for contentious min " << con_length
<< " replacements: " << time << std::endl;
}
}
int main(int argc, char* argv[]) {
Kokkos::initialize(argc, argv);
{
int length = 1000000;
if (argc == 2) {
length = std::stoi(argv[1]);
}
if (length < 1) {
throw std::invalid_argument("");
}
std::cout << "================ int" << std::endl;
test<int>(length);
std::cout << "================ long" << std::endl;
test<long>(length);
std::cout << "================ long long" << std::endl;
test<long long>(length);
std::cout << "================ unsigned int" << std::endl;
test<unsigned int>(length);
std::cout << "================ unsigned long" << std::endl;
test<unsigned long>(length);
std::cout << "================ unsigned long long" << std::endl;
test<unsigned long long>(length);
std::cout << "================ float" << std::endl;
test<float>(length);
std::cout << "================ double" << std::endl;
test<double>(length);
}
Kokkos::finalize();
return 0;
}

View File

@ -19,10 +19,6 @@ SET(KOKKOS_CORE_HEADERS)
APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/*.hpp) APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/*.hpp)
APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/impl/*.hpp) APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/impl/*.hpp)
IF (KOKKOS_ENABLE_ROCM)
APPEND_GLOB(KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/ROCm/*.cpp)
ENDIF()
IF (KOKKOS_ENABLE_CUDA) IF (KOKKOS_ENABLE_CUDA)
APPEND_GLOB(KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/Cuda/*.cpp) APPEND_GLOB(KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/Cuda/*.cpp)
APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/Cuda/*.hpp) APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/Cuda/*.hpp)
@ -64,6 +60,11 @@ ELSE()
LIST(REMOVE_ITEM KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/impl/Kokkos_Serial_task.cpp) LIST(REMOVE_ITEM KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/impl/Kokkos_Serial_task.cpp)
ENDIF() ENDIF()
IF (KOKKOS_ENABLE_SYCL)
APPEND_GLOB(KOKKOS_CORE_SRCS ${CMAKE_CURRENT_SOURCE_DIR}/SYCL/*.cpp)
APPEND_GLOB(KOKKOS_CORE_HEADERS ${CMAKE_CURRENT_SOURCE_DIR}/SYCL/*.hpp)
ENDIF()
KOKKOS_ADD_LIBRARY( KOKKOS_ADD_LIBRARY(
kokkoscore kokkoscore
SOURCES ${KOKKOS_CORE_SRCS} SOURCES ${KOKKOS_CORE_SRCS}

File diff suppressed because it is too large Load Diff

View File

@ -146,9 +146,9 @@ void CudaSpace::access_error(const void *const) {
bool CudaUVMSpace::available() { bool CudaUVMSpace::available() {
#if defined(CUDA_VERSION) && !defined(__APPLE__) #if defined(CUDA_VERSION) && !defined(__APPLE__)
enum { UVM_available = true }; enum : bool { UVM_available = true };
#else #else
enum { UVM_available = false }; enum : bool { UVM_available = false };
#endif #endif
return UVM_available; return UVM_available;
} }
@ -201,8 +201,15 @@ CudaHostPinnedSpace::CudaHostPinnedSpace() {}
void *CudaSpace::allocate(const size_t arg_alloc_size) const { void *CudaSpace::allocate(const size_t arg_alloc_size) const {
return allocate("[unlabeled]", arg_alloc_size); return allocate("[unlabeled]", arg_alloc_size);
} }
void *CudaSpace::allocate(const char *arg_label, const size_t arg_alloc_size, void *CudaSpace::allocate(const char *arg_label, const size_t arg_alloc_size,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
return impl_allocate(arg_label, arg_alloc_size, arg_logical_size);
}
void *CudaSpace::impl_allocate(
const char *arg_label, const size_t arg_alloc_size,
const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
void *ptr = nullptr; void *ptr = nullptr;
auto error_code = cudaMalloc(&ptr, arg_alloc_size); auto error_code = cudaMalloc(&ptr, arg_alloc_size);
@ -219,9 +226,7 @@ void *CudaSpace::allocate(const char *arg_label, const size_t arg_alloc_size,
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::allocateData( Kokkos::Profiling::allocateData(arg_handle, arg_label, ptr, reported_size);
Kokkos::Profiling::make_space_handle(name()), arg_label, ptr,
reported_size);
} }
return ptr; return ptr;
} }
@ -231,6 +236,12 @@ void *CudaUVMSpace::allocate(const size_t arg_alloc_size) const {
} }
void *CudaUVMSpace::allocate(const char *arg_label, const size_t arg_alloc_size, void *CudaUVMSpace::allocate(const char *arg_label, const size_t arg_alloc_size,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
return impl_allocate(arg_label, arg_alloc_size, arg_logical_size);
}
void *CudaUVMSpace::impl_allocate(
const char *arg_label, const size_t arg_alloc_size,
const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
void *ptr = nullptr; void *ptr = nullptr;
Cuda::impl_static_fence(); Cuda::impl_static_fence();
@ -260,19 +271,22 @@ void *CudaUVMSpace::allocate(const char *arg_label, const size_t arg_alloc_size,
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::allocateData( Kokkos::Profiling::allocateData(arg_handle, arg_label, ptr, reported_size);
Kokkos::Profiling::make_space_handle(name()), arg_label, ptr,
reported_size);
} }
return ptr; return ptr;
} }
void *CudaHostPinnedSpace::allocate(const size_t arg_alloc_size) const { void *CudaHostPinnedSpace::allocate(const size_t arg_alloc_size) const {
return allocate("[unlabeled]", arg_alloc_size); return allocate("[unlabeled]", arg_alloc_size);
} }
void *CudaHostPinnedSpace::allocate(const char *arg_label, void *CudaHostPinnedSpace::allocate(const char *arg_label,
const size_t arg_alloc_size, const size_t arg_alloc_size,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
return impl_allocate(arg_label, arg_alloc_size, arg_logical_size);
}
void *CudaHostPinnedSpace::impl_allocate(
const char *arg_label, const size_t arg_alloc_size,
const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
void *ptr = nullptr; void *ptr = nullptr;
auto error_code = cudaHostAlloc(&ptr, arg_alloc_size, cudaHostAllocDefault); auto error_code = cudaHostAlloc(&ptr, arg_alloc_size, cudaHostAllocDefault);
@ -288,9 +302,7 @@ void *CudaHostPinnedSpace::allocate(const char *arg_label,
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::allocateData( Kokkos::Profiling::allocateData(arg_handle, arg_label, ptr, reported_size);
Kokkos::Profiling::make_space_handle(name()), arg_label, ptr,
reported_size);
} }
return ptr; return ptr;
} }
@ -304,11 +316,16 @@ void CudaSpace::deallocate(void *const arg_alloc_ptr,
void CudaSpace::deallocate(const char *arg_label, void *const arg_alloc_ptr, void CudaSpace::deallocate(const char *arg_label, void *const arg_alloc_ptr,
const size_t arg_alloc_size, const size_t arg_alloc_size,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
impl_deallocate(arg_label, arg_alloc_ptr, arg_alloc_size, arg_logical_size);
}
void CudaSpace::impl_deallocate(
const char *arg_label, void *const arg_alloc_ptr,
const size_t arg_alloc_size, const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::deallocateData( Kokkos::Profiling::deallocateData(arg_handle, arg_label, arg_alloc_ptr,
Kokkos::Profiling::make_space_handle(name()), arg_label, arg_alloc_ptr,
reported_size); reported_size);
} }
@ -327,12 +344,20 @@ void CudaUVMSpace::deallocate(const char *arg_label, void *const arg_alloc_ptr,
, ,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
impl_deallocate(arg_label, arg_alloc_ptr, arg_alloc_size, arg_logical_size);
}
void CudaUVMSpace::impl_deallocate(
const char *arg_label, void *const arg_alloc_ptr,
const size_t arg_alloc_size
,
const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
Cuda::impl_static_fence(); Cuda::impl_static_fence();
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::deallocateData( Kokkos::Profiling::deallocateData(arg_handle, arg_label, arg_alloc_ptr,
Kokkos::Profiling::make_space_handle(name()), arg_label, arg_alloc_ptr,
reported_size); reported_size);
} }
try { try {
@ -349,16 +374,21 @@ void CudaHostPinnedSpace::deallocate(void *const arg_alloc_ptr,
const size_t arg_alloc_size) const { const size_t arg_alloc_size) const {
deallocate("[unlabeled]", arg_alloc_ptr, arg_alloc_size); deallocate("[unlabeled]", arg_alloc_ptr, arg_alloc_size);
} }
void CudaHostPinnedSpace::deallocate(const char *arg_label, void CudaHostPinnedSpace::deallocate(const char *arg_label,
void *const arg_alloc_ptr, void *const arg_alloc_ptr,
const size_t arg_alloc_size, const size_t arg_alloc_size,
const size_t arg_logical_size) const { const size_t arg_logical_size) const {
impl_deallocate(arg_label, arg_alloc_ptr, arg_alloc_size, arg_logical_size);
}
void CudaHostPinnedSpace::impl_deallocate(
const char *arg_label, void *const arg_alloc_ptr,
const size_t arg_alloc_size, const size_t arg_logical_size,
const Kokkos::Tools::SpaceHandle arg_handle) const {
if (Kokkos::Profiling::profileLibraryLoaded()) { if (Kokkos::Profiling::profileLibraryLoaded()) {
const size_t reported_size = const size_t reported_size =
(arg_logical_size > 0) ? arg_logical_size : arg_alloc_size; (arg_logical_size > 0) ? arg_logical_size : arg_alloc_size;
Kokkos::Profiling::deallocateData( Kokkos::Profiling::deallocateData(arg_handle, arg_label, arg_alloc_ptr,
Kokkos::Profiling::make_space_handle(name()), arg_label, arg_alloc_ptr,
reported_size); reported_size);
} }
try { try {
@ -375,7 +405,7 @@ void CudaHostPinnedSpace::deallocate(const char *arg_label,
namespace Kokkos { namespace Kokkos {
namespace Impl { namespace Impl {
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
SharedAllocationRecord<void, void> SharedAllocationRecord<void, void>
SharedAllocationRecord<Kokkos::CudaSpace, void>::s_root_record; SharedAllocationRecord<Kokkos::CudaSpace, void>::s_root_record;
@ -551,7 +581,7 @@ SharedAllocationRecord<Kokkos::CudaSpace, void>::SharedAllocationRecord(
// Pass through allocated [ SharedAllocationHeader , user_memory ] // Pass through allocated [ SharedAllocationHeader , user_memory ]
// Pass through deallocation function // Pass through deallocation function
: SharedAllocationRecord<void, void>( : SharedAllocationRecord<void, void>(
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
&SharedAllocationRecord<Kokkos::CudaSpace, void>::s_root_record, &SharedAllocationRecord<Kokkos::CudaSpace, void>::s_root_record,
#endif #endif
Impl::checked_allocation_with_header(arg_space, arg_label, Impl::checked_allocation_with_header(arg_space, arg_label,
@ -582,7 +612,7 @@ SharedAllocationRecord<Kokkos::CudaUVMSpace, void>::SharedAllocationRecord(
// Pass through allocated [ SharedAllocationHeader , user_memory ] // Pass through allocated [ SharedAllocationHeader , user_memory ]
// Pass through deallocation function // Pass through deallocation function
: SharedAllocationRecord<void, void>( : SharedAllocationRecord<void, void>(
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
&SharedAllocationRecord<Kokkos::CudaUVMSpace, void>::s_root_record, &SharedAllocationRecord<Kokkos::CudaUVMSpace, void>::s_root_record,
#endif #endif
Impl::checked_allocation_with_header(arg_space, arg_label, Impl::checked_allocation_with_header(arg_space, arg_label,
@ -610,7 +640,7 @@ SharedAllocationRecord<Kokkos::CudaHostPinnedSpace, void>::
// Pass through allocated [ SharedAllocationHeader , user_memory ] // Pass through allocated [ SharedAllocationHeader , user_memory ]
// Pass through deallocation function // Pass through deallocation function
: SharedAllocationRecord<void, void>( : SharedAllocationRecord<void, void>(
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
&SharedAllocationRecord<Kokkos::CudaHostPinnedSpace, &SharedAllocationRecord<Kokkos::CudaHostPinnedSpace,
void>::s_root_record, void>::s_root_record,
#endif #endif
@ -830,7 +860,7 @@ void SharedAllocationRecord<Kokkos::CudaSpace, void>::print_records(
std::ostream &s, const Kokkos::CudaSpace &, bool detail) { std::ostream &s, const Kokkos::CudaSpace &, bool detail) {
(void)s; (void)s;
(void)detail; (void)detail;
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
SharedAllocationRecord<void, void> *r = &s_root_record; SharedAllocationRecord<void, void> *r = &s_root_record;
char buffer[256]; char buffer[256];
@ -896,7 +926,7 @@ void SharedAllocationRecord<Kokkos::CudaSpace, void>::print_records(
#else #else
Kokkos::Impl::throw_runtime_exception( Kokkos::Impl::throw_runtime_exception(
"SharedAllocationHeader<CudaSpace>::print_records only works with " "SharedAllocationHeader<CudaSpace>::print_records only works with "
"KOKKOS_DEBUG enabled"); "KOKKOS_ENABLE_DEBUG enabled");
#endif #endif
} }
@ -904,13 +934,13 @@ void SharedAllocationRecord<Kokkos::CudaUVMSpace, void>::print_records(
std::ostream &s, const Kokkos::CudaUVMSpace &, bool detail) { std::ostream &s, const Kokkos::CudaUVMSpace &, bool detail) {
(void)s; (void)s;
(void)detail; (void)detail;
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
SharedAllocationRecord<void, void>::print_host_accessible_records( SharedAllocationRecord<void, void>::print_host_accessible_records(
s, "CudaUVM", &s_root_record, detail); s, "CudaUVM", &s_root_record, detail);
#else #else
Kokkos::Impl::throw_runtime_exception( Kokkos::Impl::throw_runtime_exception(
"SharedAllocationHeader<CudaSpace>::print_records only works with " "SharedAllocationHeader<CudaSpace>::print_records only works with "
"KOKKOS_DEBUG enabled"); "KOKKOS_ENABLE_DEBUG enabled");
#endif #endif
} }
@ -918,13 +948,13 @@ void SharedAllocationRecord<Kokkos::CudaHostPinnedSpace, void>::print_records(
std::ostream &s, const Kokkos::CudaHostPinnedSpace &, bool detail) { std::ostream &s, const Kokkos::CudaHostPinnedSpace &, bool detail) {
(void)s; (void)s;
(void)detail; (void)detail;
#ifdef KOKKOS_DEBUG #ifdef KOKKOS_ENABLE_DEBUG
SharedAllocationRecord<void, void>::print_host_accessible_records( SharedAllocationRecord<void, void>::print_host_accessible_records(
s, "CudaHostPinned", &s_root_record, detail); s, "CudaHostPinned", &s_root_record, detail);
#else #else
Kokkos::Impl::throw_runtime_exception( Kokkos::Impl::throw_runtime_exception(
"SharedAllocationHeader<CudaSpace>::print_records only works with " "SharedAllocationHeader<CudaSpace>::print_records only works with "
"KOKKOS_DEBUG enabled"); "KOKKOS_ENABLE_DEBUG enabled");
#endif #endif
} }

View File

@ -198,6 +198,39 @@ int cuda_get_opt_block_size(const CudaInternal* cuda_instance,
LaunchBounds{}); LaunchBounds{});
} }
// Assuming cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1)
// NOTE these number can be obtained several ways:
// * One option is to download the CUDA Occupancy Calculator spreadsheet, select
// "Compute Capability" first and check what is the smallest "Shared Memory
// Size Config" that is available. The "Shared Memory Per Multiprocessor" in
// bytes is then to be found below in the summary.
// * Another option would be to look for the information in the "Tuning
// Guide(s)" of the CUDA Toolkit Documentation for each GPU architecture, in
// the "Shared Memory" section (more tedious)
inline size_t get_shmem_per_sm_prefer_l1(cudaDeviceProp const& properties) {
int const compute_capability = properties.major * 10 + properties.minor;
return [compute_capability]() {
switch (compute_capability) {
case 30:
case 32:
case 35: return 16;
case 37: return 80;
case 50:
case 53:
case 60:
case 62: return 64;
case 52:
case 61: return 96;
case 70:
case 80: return 8;
case 75: return 32;
default:
Kokkos::Impl::throw_runtime_exception(
"Unknown device in cuda block size deduction");
}
return 0;
}() * 1024;
}
} // namespace Impl } // namespace Impl
} // namespace Kokkos } // namespace Kokkos

View File

@ -0,0 +1,210 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 3.0
// Copyright (2020) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#ifndef KOKKOS_KOKKOS_CUDA_GRAPHNODEKERNEL_IMPL_HPP
#define KOKKOS_KOKKOS_CUDA_GRAPHNODEKERNEL_IMPL_HPP
#include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOS_CUDA_ENABLE_GRAPHS)
#include <Kokkos_Graph_fwd.hpp>
#include <impl/Kokkos_GraphImpl.hpp> // GraphAccess needs to be complete
#include <impl/Kokkos_SharedAlloc.hpp> // SharedAllocationRecord
#include <Kokkos_Parallel.hpp>
#include <Kokkos_Parallel_Reduce.hpp>
#include <Kokkos_PointerOwnership.hpp>
#include <Kokkos_Cuda.hpp>
#include <cuda_runtime_api.h>
namespace Kokkos {
namespace Impl {
template <class PolicyType, class Functor, class PatternTag, class... Args>
class GraphNodeKernelImpl<Kokkos::Cuda, PolicyType, Functor, PatternTag,
Args...>
: public PatternImplSpecializationFromTag<PatternTag, Functor, PolicyType,
Args..., Kokkos::Cuda>::type {
private:
using base_t =
typename PatternImplSpecializationFromTag<PatternTag, Functor, PolicyType,
Args..., Kokkos::Cuda>::type;
using size_type = Kokkos::Cuda::size_type;
// These are really functioning as optional references, though I'm not sure
// that the cudaGraph_t one needs to be since it's a pointer under the
// covers and we're not modifying it
Kokkos::ObservingRawPtr<const cudaGraph_t> m_graph_ptr = nullptr;
Kokkos::ObservingRawPtr<cudaGraphNode_t> m_graph_node_ptr = nullptr;
// Note: owned pointer to CudaSpace memory (used for global memory launches),
// which we're responsible for deallocating, but not responsible for calling
// its destructor.
using Record = Kokkos::Impl::SharedAllocationRecord<Kokkos::CudaSpace, void>;
// Basically, we have to make this mutable for the same reasons that the
// global kernel buffers in the Cuda instance are mutable...
mutable Kokkos::OwningRawPtr<base_t> m_driver_storage = nullptr;
public:
using Policy = PolicyType;
using graph_kernel = GraphNodeKernelImpl;
// TODO Ensure the execution space of the graph is the same as the one
// attached to the policy?
// TODO @graph kernel name info propagation
template <class PolicyDeduced, class... ArgsDeduced>
GraphNodeKernelImpl(std::string, Kokkos::Cuda const&, Functor arg_functor,
PolicyDeduced&& arg_policy, ArgsDeduced&&... args)
// This is super ugly, but it works for now and is the most minimal change
// to the codebase for now...
: base_t(std::move(arg_functor), (PolicyDeduced &&) arg_policy,
(ArgsDeduced &&) args...) {}
// FIXME @graph Forward through the instance once that works in the backends
template <class PolicyDeduced>
GraphNodeKernelImpl(Kokkos::Cuda const& ex, Functor arg_functor,
PolicyDeduced&& arg_policy)
: GraphNodeKernelImpl("", ex, std::move(arg_functor),
(PolicyDeduced &&) arg_policy) {}
~GraphNodeKernelImpl() {
if (m_driver_storage) {
// We should be the only owner, but this is still the easiest way to
// allocate and deallocate aligned memory for these sorts of things
Record::decrement(Record::get_record(m_driver_storage));
}
}
void set_cuda_graph_ptr(cudaGraph_t* arg_graph_ptr) {
m_graph_ptr = arg_graph_ptr;
}
void set_cuda_graph_node_ptr(cudaGraphNode_t* arg_node_ptr) {
m_graph_node_ptr = arg_node_ptr;
}
cudaGraphNode_t* get_cuda_graph_node_ptr() const { return m_graph_node_ptr; }
cudaGraph_t const* get_cuda_graph_ptr() const { return m_graph_ptr; }
Kokkos::ObservingRawPtr<base_t> allocate_driver_memory_buffer() const {
KOKKOS_EXPECTS(m_driver_storage == nullptr)
auto* record = Record::allocate(
Kokkos::CudaSpace{}, "GraphNodeKernel global memory functor storage",
sizeof(base_t));
Record::increment(record);
m_driver_storage = reinterpret_cast<base_t*>(record->data());
KOKKOS_ENSURES(m_driver_storage != nullptr)
return m_driver_storage;
}
};
struct CudaGraphNodeAggregateKernel {
using graph_kernel = CudaGraphNodeAggregateKernel;
// Aggregates don't need a policy, but for the purposes of checking the static
// assertions about graph kerenls,
struct Policy {
using is_graph_kernel = std::true_type;
};
};
template <class KernelType,
class Tag =
typename PatternTagFromImplSpecialization<KernelType>::type>
struct get_graph_node_kernel_type
: identity<GraphNodeKernelImpl<Kokkos::Cuda, typename KernelType::Policy,
typename KernelType::functor_type, Tag>> {};
template <class KernelType>
struct get_graph_node_kernel_type<KernelType, Kokkos::ParallelReduceTag>
: identity<GraphNodeKernelImpl<Kokkos::Cuda, typename KernelType::Policy,
typename KernelType::functor_type,
Kokkos::ParallelReduceTag,
typename KernelType::reducer_type>> {};
//==============================================================================
// <editor-fold desc="get_cuda_graph_*() helper functions"> {{{1
template <class KernelType>
auto* allocate_driver_storage_for_kernel(KernelType const& kernel) {
using graph_node_kernel_t =
typename get_graph_node_kernel_type<KernelType>::type;
auto const& kernel_as_graph_kernel =
static_cast<graph_node_kernel_t const&>(kernel);
// TODO @graphs we need to somehow indicate the need for a fence in the
// destructor of the GraphImpl object (so that we don't have to
// just always do it)
return kernel_as_graph_kernel.allocate_driver_memory_buffer();
}
template <class KernelType>
auto const& get_cuda_graph_from_kernel(KernelType const& kernel) {
using graph_node_kernel_t =
typename get_graph_node_kernel_type<KernelType>::type;
auto const& kernel_as_graph_kernel =
static_cast<graph_node_kernel_t const&>(kernel);
cudaGraph_t const* graph_ptr = kernel_as_graph_kernel.get_cuda_graph_ptr();
KOKKOS_EXPECTS(graph_ptr != nullptr);
return *graph_ptr;
}
template <class KernelType>
auto& get_cuda_graph_node_from_kernel(KernelType const& kernel) {
using graph_node_kernel_t =
typename get_graph_node_kernel_type<KernelType>::type;
auto const& kernel_as_graph_kernel =
static_cast<graph_node_kernel_t const&>(kernel);
auto* graph_node_ptr = kernel_as_graph_kernel.get_cuda_graph_node_ptr();
KOKKOS_EXPECTS(graph_node_ptr != nullptr);
return *graph_node_ptr;
}
// </editor-fold> end get_cuda_graph_*() helper functions }}}1
//==============================================================================
} // end namespace Impl
} // end namespace Kokkos
#endif // defined(KOKKOS_ENABLE_CUDA)
#endif // KOKKOS_KOKKOS_CUDA_GRAPHNODEKERNEL_IMPL_HPP

View File

@ -42,85 +42,62 @@
//@HEADER //@HEADER
*/ */
#include <type_traits> #ifndef KOKKOS_KOKKOS_CUDA_GRAPHNODE_IMPL_HPP
#define KOKKOS_KOKKOS_CUDA_GRAPHNODE_IMPL_HPP
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if !defined(KOKKOS_ROCM_INVOKE_H) #if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOS_CUDA_ENABLE_GRAPHS)
#define KOKKOS_ROCM_INVOKE_H
#include <Kokkos_Graph_fwd.hpp>
#include <impl/Kokkos_GraphImpl.hpp> // GraphAccess needs to be complete
#include <Kokkos_Cuda.hpp>
#include <cuda_runtime_api.h>
namespace Kokkos { namespace Kokkos {
namespace Impl { namespace Impl {
template <class Tag, class F, class... Ts, template <>
typename std::enable_if<(!std::is_void<Tag>()), int>::type = 0> struct GraphNodeBackendSpecificDetails<Kokkos::Cuda> {
KOKKOS_INLINE_FUNCTION void rocm_invoke(F&& f, Ts&&... xs) { cudaGraphNode_t node = nullptr;
f(Tag(), static_cast<Ts&&>(xs)...);
}
template <class Tag, class F, class... Ts, //----------------------------------------------------------------------------
typename std::enable_if<(std::is_void<Tag>()), int>::type = 0> // <editor-fold desc="Ctors, destructor, and assignment"> {{{2
KOKKOS_INLINE_FUNCTION void rocm_invoke(F&& f, Ts&&... xs) {
f(static_cast<Ts&&>(xs)...);
}
template <class F, class Tag = void> explicit GraphNodeBackendSpecificDetails() = default;
struct rocm_invoke_fn {
F* f;
rocm_invoke_fn(F& f_) : f(&f_) {}
template <class... Ts> explicit GraphNodeBackendSpecificDetails(
KOKKOS_INLINE_FUNCTION void operator()(Ts&&... xs) const { _graph_node_is_root_ctor_tag) noexcept {}
rocm_invoke<Tag>(*f, static_cast<Ts&&>(xs)...);
} // </editor-fold> end Ctors, destructor, and assignment }}}2
//----------------------------------------------------------------------------
}; };
template <class Tag, class F> template <class Kernel, class PredecessorRef>
KOKKOS_INLINE_FUNCTION rocm_invoke_fn<F, Tag> make_rocm_invoke_fn(F& f) { struct GraphNodeBackendDetailsBeforeTypeErasure<Kokkos::Cuda, Kernel,
return {f}; PredecessorRef> {
} protected:
//----------------------------------------------------------------------------
// <editor-fold desc="ctors, destructor, and assignment"> {{{2
template <class T> GraphNodeBackendDetailsBeforeTypeErasure(
KOKKOS_INLINE_FUNCTION T& rocm_unwrap(T& x) { Kokkos::Cuda const&, Kernel&, PredecessorRef const&,
return x; GraphNodeBackendSpecificDetails<Kokkos::Cuda>&) noexcept {}
}
template <class T> GraphNodeBackendDetailsBeforeTypeErasure(
KOKKOS_INLINE_FUNCTION T& rocm_unwrap(std::reference_wrapper<T> x) { Kokkos::Cuda const&, _graph_node_is_root_ctor_tag,
return x; GraphNodeBackendSpecificDetails<Kokkos::Cuda>&) noexcept {}
}
template <class F, class T> // </editor-fold> end ctors, destructor, and assignment }}}2
struct rocm_capture_fn { //----------------------------------------------------------------------------
F f;
T data;
KOKKOS_INLINE_FUNCTION rocm_capture_fn(F f_, T x) : f(f_), data(x) {}
template <class... Ts>
KOKKOS_INLINE_FUNCTION void operator()(Ts&&... xs) const {
f(rocm_unwrap(data), static_cast<Ts&&>(xs)...);
}
}; };
template <class F, class T> } // end namespace Impl
KOKKOS_INLINE_FUNCTION rocm_capture_fn<F, T> rocm_capture(F f, T x) { } // end namespace Kokkos
return {f, x};
}
template <class F, class T, class U, class... Ts> #include <Cuda/Kokkos_Cuda_GraphNodeKernel.hpp>
KOKKOS_INLINE_FUNCTION auto rocm_capture(F f, T x, U y, Ts... xs)
-> decltype(rocm_capture(rocm_capture(f, x), y, xs...)) {
return rocm_capture(rocm_capture(f, x), y, xs...);
}
struct rocm_apply_op { #endif // defined(KOKKOS_ENABLE_CUDA)
template <class F, class... Ts> #endif // KOKKOS_KOKKOS_CUDA_GRAPHNODE_IMPL_HPP
KOKKOS_INLINE_FUNCTION void operator()(F&& f, Ts&&... xs) const {
f(static_cast<Ts&&>(xs)...);
}
};
} // namespace Impl
} // namespace Kokkos
#endif

View File

@ -0,0 +1,219 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 3.0
// Copyright (2020) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#ifndef KOKKOS_KOKKOS_CUDA_GRAPH_IMPL_HPP
#define KOKKOS_KOKKOS_CUDA_GRAPH_IMPL_HPP
#include <Kokkos_Macros.hpp>
#if defined(KOKKOS_ENABLE_CUDA) && defined(KOKKOS_CUDA_ENABLE_GRAPHS)
#include <Kokkos_Graph_fwd.hpp>
#include <impl/Kokkos_GraphImpl.hpp> // GraphAccess needs to be complete
// GraphNodeImpl needs to be complete because GraphImpl here is a full
// specialization and not just a partial one
#include <impl/Kokkos_GraphNodeImpl.hpp>
#include <Cuda/Kokkos_Cuda_GraphNode_Impl.hpp>
#include <Kokkos_Cuda.hpp>
#include <cuda_runtime_api.h>
namespace Kokkos {
namespace Impl {
template <>
struct GraphImpl<Kokkos::Cuda> {
public:
using execution_space = Kokkos::Cuda;
private:
execution_space m_execution_space;
cudaGraph_t m_graph = nullptr;
cudaGraphExec_t m_graph_exec = nullptr;
using cuda_graph_flags_t = unsigned int;
using node_details_t = GraphNodeBackendSpecificDetails<Kokkos::Cuda>;
void _instantiate_graph() {
constexpr size_t error_log_size = 256;
cudaGraphNode_t error_node = nullptr;
char error_log[error_log_size];
CUDA_SAFE_CALL(cudaGraphInstantiate(&m_graph_exec, m_graph, &error_node,
error_log, error_log_size));
// TODO @graphs print out errors
}
public:
using root_node_impl_t =
GraphNodeImpl<Kokkos::Cuda, Kokkos::Experimental::TypeErasedTag,
Kokkos::Experimental::TypeErasedTag>;
using aggregate_kernel_impl_t = CudaGraphNodeAggregateKernel;
using aggregate_node_impl_t =
GraphNodeImpl<Kokkos::Cuda, aggregate_kernel_impl_t,
Kokkos::Experimental::TypeErasedTag>;
// Not moveable or copyable; it spends its whole life as a shared_ptr in the
// Graph object
GraphImpl() = delete;
GraphImpl(GraphImpl const&) = delete;
GraphImpl(GraphImpl&&) = delete;
GraphImpl& operator=(GraphImpl const&) = delete;
GraphImpl& operator=(GraphImpl&&) = delete;
~GraphImpl() {
// TODO @graphs we need to somehow indicate the need for a fence in the
// destructor of the GraphImpl object (so that we don't have to
// just always do it)
m_execution_space.fence();
KOKKOS_EXPECTS(bool(m_graph))
if (bool(m_graph_exec)) {
CUDA_SAFE_CALL(cudaGraphExecDestroy(m_graph_exec));
}
CUDA_SAFE_CALL(cudaGraphDestroy(m_graph));
};
explicit GraphImpl(Kokkos::Cuda arg_instance)
: m_execution_space(std::move(arg_instance)) {
CUDA_SAFE_CALL(cudaGraphCreate(&m_graph, cuda_graph_flags_t{0}));
}
void add_node(std::shared_ptr<aggregate_node_impl_t> const& arg_node_ptr) {
// All of the predecessors are just added as normal, so all we need to
// do here is add an empty node
CUDA_SAFE_CALL(cudaGraphAddEmptyNode(&(arg_node_ptr->node_details_t::node),
m_graph,
/* dependencies = */ nullptr,
/* numDependencies = */ 0));
}
template <class NodeImpl>
// requires NodeImplPtr is a shared_ptr to specialization of GraphNodeImpl
// Also requires that the kernel has the graph node tag in it's policy
void add_node(std::shared_ptr<NodeImpl> const& arg_node_ptr) {
static_assert(
NodeImpl::kernel_type::Policy::is_graph_kernel::value,
"Something has gone horribly wrong, but it's too complicated to "
"explain here. Buy Daisy a coffee and she'll explain it to you.");
KOKKOS_EXPECTS(bool(arg_node_ptr));
// The Kernel launch from the execute() method has been shimmed to insert
// the node into the graph
auto& kernel = arg_node_ptr->get_kernel();
// note: using arg_node_ptr->node_details_t::node caused an ICE in NVCC 10.1
auto& cuda_node = static_cast<node_details_t*>(arg_node_ptr.get())->node;
KOKKOS_EXPECTS(!bool(cuda_node));
kernel.set_cuda_graph_ptr(&m_graph);
kernel.set_cuda_graph_node_ptr(&cuda_node);
kernel.execute();
KOKKOS_ENSURES(bool(cuda_node));
}
template <class NodeImplPtr, class PredecessorRef>
// requires PredecessorRef is a specialization of GraphNodeRef that has
// already been added to this graph and NodeImpl is a specialization of
// GraphNodeImpl that has already been added to this graph.
void add_predecessor(NodeImplPtr arg_node_ptr, PredecessorRef arg_pred_ref) {
KOKKOS_EXPECTS(bool(arg_node_ptr))
auto pred_ptr = GraphAccess::get_node_ptr(arg_pred_ref);
KOKKOS_EXPECTS(bool(pred_ptr))
// clang-format off
// NOTE const-qualifiers below are commented out because of an API break
// from CUDA 10.0 to CUDA 10.1
// cudaGraphAddDependencies(cudaGraph_t, cudaGraphNode_t*, cudaGraphNode_t*, size_t)
// cudaGraphAddDependencies(cudaGraph_t, const cudaGraphNode_t*, const cudaGraphNode_t*, size_t)
// clang-format on
auto /*const*/& pred_cuda_node = pred_ptr->node_details_t::node;
KOKKOS_EXPECTS(bool(pred_cuda_node))
auto /*const*/& cuda_node = arg_node_ptr->node_details_t::node;
KOKKOS_EXPECTS(bool(cuda_node))
CUDA_SAFE_CALL(
cudaGraphAddDependencies(m_graph, &pred_cuda_node, &cuda_node, 1));
}
void submit() {
if (!bool(m_graph_exec)) {
_instantiate_graph();
}
CUDA_SAFE_CALL(
cudaGraphLaunch(m_graph_exec, m_execution_space.cuda_stream()));
}
execution_space const& get_execution_space() const noexcept {
return m_execution_space;
}
auto create_root_node_ptr() {
KOKKOS_EXPECTS(bool(m_graph))
KOKKOS_EXPECTS(!bool(m_graph_exec))
auto rv = std::make_shared<root_node_impl_t>(
get_execution_space(), _graph_node_is_root_ctor_tag{});
CUDA_SAFE_CALL(cudaGraphAddEmptyNode(&(rv->node_details_t::node), m_graph,
/* dependencies = */ nullptr,
/* numDependencies = */ 0));
KOKKOS_ENSURES(bool(rv->node_details_t::node))
return rv;
}
template <class... PredecessorRefs>
// See requirements/expectations in GraphBuilder
auto create_aggregate_ptr(PredecessorRefs&&...) {
// The attachment to predecessors, which is all we really need, happens
// in the generic layer, which calls through to add_predecessor for
// each predecessor ref, so all we need to do here is create the (trivial)
// aggregate node.
return std::make_shared<aggregate_node_impl_t>(
m_execution_space, _graph_node_kernel_ctor_tag{},
aggregate_kernel_impl_t{});
}
};
} // end namespace Impl
} // end namespace Kokkos
#endif // defined(KOKKOS_ENABLE_CUDA)
#endif // KOKKOS_KOKKOS_CUDA_GRAPH_IMPL_HPP

View File

@ -0,0 +1,710 @@
/*
//@HEADER
// ************************************************************************
//
// Kokkos v. 3.0
// Copyright (2020) National Technology & Engineering
// Solutions of Sandia, LLC (NTESS).
//
// Under the terms of Contract DE-NA0003525 with NTESS,
// the U.S. Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// 1. Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
//
// 2. Redistributions in binary form must reproduce the above copyright
// notice, this list of conditions and the following disclaimer in the
// documentation and/or other materials provided with the distribution.
//
// 3. Neither the name of the Corporation nor the names of the
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY NTESS "AS IS" AND ANY
// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NTESS OR THE
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
//
// ************************************************************************
//@HEADER
*/
#ifndef KOKKOS_CUDA_HALF_HPP_
#define KOKKOS_CUDA_HALF_HPP_
#include <Kokkos_Macros.hpp>
#ifdef KOKKOS_ENABLE_CUDA
#if !(defined(KOKKOS_COMPILER_CLANG) && KOKKOS_COMPILER_CLANG < 900) && \
!(defined(KOKKOS_ARCH_KEPLER) || defined(KOKKOS_ARCH_MAXWELL50) || \
defined(KOKKOS_ARCH_MAXWELL52))
#include <cuda_fp16.h>
#ifndef KOKKOS_IMPL_HALF_TYPE_DEFINED
// Make sure no one else tries to define half_t
#define KOKKOS_IMPL_HALF_TYPE_DEFINED
namespace Kokkos {
namespace Impl {
struct half_impl_t {
using type = __half;
};
} // namespace Impl
namespace Experimental {
// Forward declarations
class half_t;
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(float val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(bool val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(double val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(short val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(int val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long long val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned short val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned int val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long val);
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long long val);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, float>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, bool>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, double>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, short>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, int>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long long>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned short>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, unsigned int>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long>::value, T>
cast_from_half(half_t);
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long long>::value, T>
cast_from_half(half_t);
class half_t {
public:
using impl_type = Kokkos::Impl::half_impl_t::type;
private:
impl_type val;
public:
KOKKOS_FUNCTION
half_t() : val(0.0F) {}
// Don't support implicit conversion back to impl_type.
// impl_type is a storage only type on host.
KOKKOS_FUNCTION
explicit operator impl_type() const { return val; }
KOKKOS_FUNCTION
explicit operator float() const { return cast_from_half<float>(*this); }
KOKKOS_FUNCTION
explicit operator bool() const { return cast_from_half<bool>(*this); }
KOKKOS_FUNCTION
explicit operator double() const { return cast_from_half<double>(*this); }
KOKKOS_FUNCTION
explicit operator short() const { return cast_from_half<short>(*this); }
KOKKOS_FUNCTION
explicit operator int() const { return cast_from_half<int>(*this); }
KOKKOS_FUNCTION
explicit operator long() const { return cast_from_half<long>(*this); }
KOKKOS_FUNCTION
explicit operator long long() const {
return cast_from_half<long long>(*this);
}
KOKKOS_FUNCTION
explicit operator unsigned short() const {
return cast_from_half<unsigned short>(*this);
}
KOKKOS_FUNCTION
explicit operator unsigned int() const {
return cast_from_half<unsigned int>(*this);
}
KOKKOS_FUNCTION
explicit operator unsigned long() const {
return cast_from_half<unsigned long>(*this);
}
KOKKOS_FUNCTION
explicit operator unsigned long long() const {
return cast_from_half<unsigned long long>(*this);
}
KOKKOS_FUNCTION
half_t(impl_type rhs) : val(rhs) {}
KOKKOS_FUNCTION
explicit half_t(float rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(bool rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(double rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(short rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(int rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(long rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(long long rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(unsigned short rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(unsigned int rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(unsigned long rhs) : val(cast_to_half(rhs).val) {}
KOKKOS_FUNCTION
explicit half_t(unsigned long long rhs) : val(cast_to_half(rhs).val) {}
// Unary operators
KOKKOS_FUNCTION
half_t operator+() const {
half_t tmp = *this;
#ifdef __CUDA_ARCH__
tmp.val = +tmp.val;
#else
tmp.val = __float2half(+__half2float(tmp.val));
#endif
return tmp;
}
KOKKOS_FUNCTION
half_t operator-() const {
half_t tmp = *this;
#ifdef __CUDA_ARCH__
tmp.val = -tmp.val;
#else
tmp.val = __float2half(-__half2float(tmp.val));
#endif
return tmp;
}
// Prefix operators
KOKKOS_FUNCTION
half_t& operator++() {
#ifdef __CUDA_ARCH__
++val;
#else
float tmp = __half2float(val);
++tmp;
val = __float2half(tmp);
#endif
return *this;
}
KOKKOS_FUNCTION
half_t& operator--() {
#ifdef __CUDA_ARCH__
--val;
#else
float tmp = __half2float(val);
--tmp;
val = __float2half(tmp);
#endif
return *this;
}
// Postfix operators
KOKKOS_FUNCTION
half_t operator++(int) {
half_t tmp = *this;
operator++();
return tmp;
}
KOKKOS_FUNCTION
half_t operator--(int) {
half_t tmp = *this;
operator--();
return tmp;
}
// Binary operators
KOKKOS_FUNCTION
half_t& operator=(impl_type rhs) {
val = rhs;
return *this;
}
template <class T>
KOKKOS_FUNCTION half_t& operator=(T rhs) {
val = cast_to_half(rhs).val;
return *this;
}
// Compound operators
KOKKOS_FUNCTION
half_t& operator+=(half_t rhs) {
#ifdef __CUDA_ARCH__
val += rhs.val;
#else
val = __float2half(__half2float(val) + __half2float(rhs.val));
#endif
return *this;
}
KOKKOS_FUNCTION
half_t& operator-=(half_t rhs) {
#ifdef __CUDA_ARCH__
val -= rhs.val;
#else
val = __float2half(__half2float(val) - __half2float(rhs.val));
#endif
return *this;
}
KOKKOS_FUNCTION
half_t& operator*=(half_t rhs) {
#ifdef __CUDA_ARCH__
val *= rhs.val;
#else
val = __float2half(__half2float(val) * __half2float(rhs.val));
#endif
return *this;
}
KOKKOS_FUNCTION
half_t& operator/=(half_t rhs) {
#ifdef __CUDA_ARCH__
val /= rhs.val;
#else
val = __float2half(__half2float(val) / __half2float(rhs.val));
#endif
return *this;
}
// Binary Arithmetic
KOKKOS_FUNCTION
half_t friend operator+(half_t lhs, half_t rhs) {
#ifdef __CUDA_ARCH__
lhs.val += rhs.val;
#else
lhs.val = __float2half(__half2float(lhs.val) + __half2float(rhs.val));
#endif
return lhs;
}
KOKKOS_FUNCTION
half_t friend operator-(half_t lhs, half_t rhs) {
#ifdef __CUDA_ARCH__
lhs.val -= rhs.val;
#else
lhs.val = __float2half(__half2float(lhs.val) - __half2float(rhs.val));
#endif
return lhs;
}
KOKKOS_FUNCTION
half_t friend operator*(half_t lhs, half_t rhs) {
#ifdef __CUDA_ARCH__
lhs.val *= rhs.val;
#else
lhs.val = __float2half(__half2float(lhs.val) * __half2float(rhs.val));
#endif
return lhs;
}
KOKKOS_FUNCTION
half_t friend operator/(half_t lhs, half_t rhs) {
#ifdef __CUDA_ARCH__
lhs.val /= rhs.val;
#else
lhs.val = __float2half(__half2float(lhs.val) / __half2float(rhs.val));
#endif
return lhs;
}
// Logical operators
KOKKOS_FUNCTION
bool operator!() const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(!val);
#else
return !__half2float(val);
#endif
}
// NOTE: Loses short-circuit evaluation
KOKKOS_FUNCTION
bool operator&&(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val && rhs.val);
#else
return __half2float(val) && __half2float(rhs.val);
#endif
}
// NOTE: Loses short-circuit evaluation
KOKKOS_FUNCTION
bool operator||(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val || rhs.val);
#else
return __half2float(val) || __half2float(rhs.val);
#endif
}
// Comparison operators
KOKKOS_FUNCTION
bool operator==(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val == rhs.val);
#else
return __half2float(val) == __half2float(rhs.val);
#endif
}
KOKKOS_FUNCTION
bool operator!=(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val != rhs.val);
#else
return __half2float(val) != __half2float(rhs.val);
#endif
}
KOKKOS_FUNCTION
bool operator<(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val < rhs.val);
#else
return __half2float(val) < __half2float(rhs.val);
#endif
}
KOKKOS_FUNCTION
bool operator>(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val > rhs.val);
#else
return __half2float(val) > __half2float(rhs.val);
#endif
}
KOKKOS_FUNCTION
bool operator<=(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val <= rhs.val);
#else
return __half2float(val) <= __half2float(rhs.val);
#endif
}
KOKKOS_FUNCTION
bool operator>=(half_t rhs) const {
#ifdef __CUDA_ARCH__
return static_cast<bool>(val >= rhs.val);
#else
return __half2float(val) >= __half2float(rhs.val);
#endif
}
};
// CUDA before 11.1 only has the half <-> float conversions marked host device
// So we will largely convert to float on the host for conversion
// But still call the correct functions on the device
#if (CUDA_VERSION < 11100)
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(half_t val) { return val; }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(float val) { return half_t(__float2half(val)); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(bool val) { return cast_to_half(static_cast<float>(val)); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(double val) {
// double2half was only introduced in CUDA 11 too
return half_t(__float2half(static_cast<float>(val)));
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(short val) {
#ifdef __CUDA_ARCH__
return half_t(__short2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned short val) {
#ifdef __CUDA_ARCH__
return half_t(__ushort2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(int val) {
#ifdef __CUDA_ARCH__
return half_t(__int2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned int val) {
#ifdef __CUDA_ARCH__
return half_t(__uint2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long long val) {
#ifdef __CUDA_ARCH__
return half_t(__ll2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long long val) {
#ifdef __CUDA_ARCH__
return half_t(__ull2half_rn(val));
#else
return half_t(__float2half(static_cast<float>(val)));
#endif
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long val) {
return cast_to_half(static_cast<long long>(val));
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long val) {
return cast_to_half(static_cast<unsigned long long>(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, float>::value, T>
cast_from_half(half_t val) {
return __half2float(half_t::impl_type(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, bool>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(cast_from_half<float>(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, double>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(__half2float(half_t::impl_type(val)));
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, short>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2short_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned short>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2ushort_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, int>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2int_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, unsigned>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2uint_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long long>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2ll_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long long>::value, T>
cast_from_half(half_t val) {
#ifdef __CUDA_ARCH__
return __half2ull_rz(half_t::impl_type(val));
#else
return static_cast<T>(__half2float(half_t::impl_type(val)));
#endif
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(cast_from_half<long long>(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(cast_from_half<unsigned long long>(val));
}
#else // CUDA 11.1 versions follow
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(float val) { return __float2half(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(double val) { return __double2half(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(short val) { return __short2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned short val) { return __ushort2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(int val) { return __int2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned int val) { return __uint2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long long val) { return __ll2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long long val) { return __ull2half_rn(val); }
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(long val) {
return cast_to_half(static_cast<long long>(val));
}
KOKKOS_INLINE_FUNCTION
half_t cast_to_half(unsigned long val) {
return cast_to_half(static_cast<unsigned long long>(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, float>::value, T>
cast_from_half(half_t val) {
return __half2float(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, double>::value, T>
cast_from_half(half_t val) {
return __half2double(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, short>::value, T>
cast_from_half(half_t val) {
return __half2short_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned short>::value, T>
cast_from_half(half_t val) {
return __half2ushort_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, int>::value, T>
cast_from_half(half_t val) {
return __half2int_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, unsigned int>::value, T>
cast_from_half(half_t val) {
return __half2uint_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long long>::value, T>
cast_from_half(half_t val) {
return __half2ll_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long long>::value, T>
cast_from_half(half_t val) {
return __half2ull_rz(val);
}
template <class T>
KOKKOS_INLINE_FUNCTION std::enable_if_t<std::is_same<T, long>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(cast_from_half<long long>(val));
}
template <class T>
KOKKOS_INLINE_FUNCTION
std::enable_if_t<std::is_same<T, unsigned long>::value, T>
cast_from_half(half_t val) {
return static_cast<T>(cast_from_half<unsigned long long>(val));
}
#endif
} // namespace Experimental
} // namespace Kokkos
#endif // KOKKOS_IMPL_HALF_TYPE_DEFINED
#endif // KOKKOS_ENABLE_CUDA
#endif // Disables for half_t on cuda:
// Clang/8||KEPLER30||KEPLER32||KEPLER37||MAXWELL50||MAXWELL52
#endif

View File

@ -132,7 +132,7 @@ int cuda_kernel_arch() {
bool cuda_launch_blocking() { bool cuda_launch_blocking() {
const char *env = getenv("CUDA_LAUNCH_BLOCKING"); const char *env = getenv("CUDA_LAUNCH_BLOCKING");
if (env == 0) return false; if (env == nullptr) return false;
return std::stoi(env); return std::stoi(env);
} }
@ -509,14 +509,14 @@ void CudaInternal::initialize(int cuda_device_id, cudaStream_t stream) {
const char *env_force_device_alloc = const char *env_force_device_alloc =
getenv("CUDA_MANAGED_FORCE_DEVICE_ALLOC"); getenv("CUDA_MANAGED_FORCE_DEVICE_ALLOC");
bool force_device_alloc; bool force_device_alloc;
if (env_force_device_alloc == 0) if (env_force_device_alloc == nullptr)
force_device_alloc = false; force_device_alloc = false;
else else
force_device_alloc = std::stoi(env_force_device_alloc) != 0; force_device_alloc = std::stoi(env_force_device_alloc) != 0;
const char *env_visible_devices = getenv("CUDA_VISIBLE_DEVICES"); const char *env_visible_devices = getenv("CUDA_VISIBLE_DEVICES");
bool visible_devices_one = true; bool visible_devices_one = true;
if (env_visible_devices == 0) visible_devices_one = false; if (env_visible_devices == nullptr) visible_devices_one = false;
if (Kokkos::show_warnings() && if (Kokkos::show_warnings() &&
(!visible_devices_one && !force_device_alloc)) { (!visible_devices_one && !force_device_alloc)) {
@ -893,6 +893,92 @@ const cudaDeviceProp &Cuda::cuda_device_prop() const {
return m_space_instance->m_deviceProp; return m_space_instance->m_deviceProp;
} }
namespace Impl {
int get_gpu(const InitArguments &args);
int g_cuda_space_factory_initialized =
initialize_space_factory<CudaSpaceInitializer>("150_Cuda");
void CudaSpaceInitializer::initialize(const InitArguments &args) {
int use_gpu = get_gpu(args);
if (std::is_same<Kokkos::Cuda, Kokkos::DefaultExecutionSpace>::value ||
0 < use_gpu) {
if (use_gpu > -1) {
Kokkos::Cuda::impl_initialize(Kokkos::Cuda::SelectDevice(use_gpu));
} else {
Kokkos::Cuda::impl_initialize();
}
}
}
void CudaSpaceInitializer::finalize(bool all_spaces) {
if ((std::is_same<Kokkos::Cuda, Kokkos::DefaultExecutionSpace>::value ||
all_spaces) &&
Kokkos::Cuda::impl_is_initialized()) {
Kokkos::Cuda::impl_finalize();
}
}
void CudaSpaceInitializer::fence() { Kokkos::Cuda::impl_static_fence(); }
void CudaSpaceInitializer::print_configuration(std::ostream &msg,
const bool detail) {
msg << "Device Execution Space:" << std::endl;
msg << " KOKKOS_ENABLE_CUDA: ";
msg << "yes" << std::endl;
msg << "Cuda Atomics:" << std::endl;
msg << " KOKKOS_ENABLE_CUDA_ATOMICS: ";
#ifdef KOKKOS_ENABLE_CUDA_ATOMICS
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << "Cuda Options:" << std::endl;
msg << " KOKKOS_ENABLE_CUDA_LAMBDA: ";
#ifdef KOKKOS_ENABLE_CUDA_LAMBDA
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << " KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: ";
#ifdef KOKKOS_ENABLE_CUDA_LDG_INTRINSIC
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << " KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: ";
#ifdef KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << " KOKKOS_ENABLE_CUDA_UVM: ";
#ifdef KOKKOS_ENABLE_CUDA_UVM
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << " KOKKOS_ENABLE_CUSPARSE: ";
#ifdef KOKKOS_ENABLE_CUSPARSE
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << " KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: ";
#ifdef KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA
msg << "yes" << std::endl;
#else
msg << "no" << std::endl;
#endif
msg << "\nCuda Runtime Configuration:" << std::endl;
Cuda::print_configuration(msg, detail);
}
} // namespace Impl
} // namespace Kokkos } // namespace Kokkos
namespace Kokkos { namespace Kokkos {

View File

@ -34,7 +34,9 @@ struct CudaTraits {
enum : CudaSpace::size_type { enum : CudaSpace::size_type {
KernelArgumentLimit = 0x001000 /* 4k bytes */ KernelArgumentLimit = 0x001000 /* 4k bytes */
}; };
enum : CudaSpace::size_type {
MaxHierarchicalParallelism = 1024 /* team_size * vector_length */
};
using ConstantGlobalBufferType = using ConstantGlobalBufferType =
unsigned long[ConstantMemoryUsage / sizeof(unsigned long)]; unsigned long[ConstantMemoryUsage / sizeof(unsigned long)];

View File

@ -48,20 +48,23 @@
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#ifdef KOKKOS_ENABLE_CUDA #ifdef KOKKOS_ENABLE_CUDA
#include <mutex>
#include <string> #include <string>
#include <cstdint> #include <cstdint>
#include <cmath>
#include <Kokkos_Parallel.hpp> #include <Kokkos_Parallel.hpp>
#include <impl/Kokkos_Error.hpp> #include <impl/Kokkos_Error.hpp>
#include <Cuda/Kokkos_Cuda_abort.hpp> #include <Cuda/Kokkos_Cuda_abort.hpp>
#include <Cuda/Kokkos_Cuda_Error.hpp> #include <Cuda/Kokkos_Cuda_Error.hpp>
#include <Cuda/Kokkos_Cuda_Locks.hpp> #include <Cuda/Kokkos_Cuda_Locks.hpp>
#include <Cuda/Kokkos_Cuda_Instance.hpp> #include <Cuda/Kokkos_Cuda_Instance.hpp>
#include <impl/Kokkos_GraphImpl_fwd.hpp>
#include <Cuda/Kokkos_Cuda_GraphNodeKernel.hpp>
#include <Cuda/Kokkos_Cuda_BlockSize_Deduction.hpp>
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
#if defined(__CUDACC__)
/** \brief Access to constant memory on the device */ /** \brief Access to constant memory on the device */
#ifdef KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE #ifdef KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE
@ -140,28 +143,84 @@ __global__ __launch_bounds__(
driver->operator()(); driver->operator()();
} }
template <class DriverType> //==============================================================================
__global__ static void cuda_parallel_launch_constant_or_global_memory( // <editor-fold desc="Some helper functions for launch code readability"> {{{1
const DriverType* driver_ptr) {
const DriverType& driver =
driver_ptr != nullptr
? *driver_ptr
: *((const DriverType*)kokkos_impl_cuda_constant_memory_buffer);
driver(); inline bool is_empty_launch(dim3 const& grid, dim3 const& block) {
return (grid.x == 0) || ((block.x * block.y * block.z) == 0);
} }
template <class DriverType, unsigned int maxTperB, unsigned int minBperSM> inline void check_shmem_request(CudaInternal const* cuda_instance, int shmem) {
__global__ if (cuda_instance->m_maxShmemPerBlock < shmem) {
__launch_bounds__(maxTperB, minBperSM) static void cuda_parallel_launch_constant_or_global_memory( Kokkos::Impl::throw_runtime_exception(
const DriverType* driver_ptr) { std::string("CudaParallelLaunch (or graph node creation) FAILED: shared"
const DriverType& driver = " memory request is too large"));
driver_ptr != nullptr
? *driver_ptr
: *((const DriverType*)kokkos_impl_cuda_constant_memory_buffer);
driver();
} }
}
template <class KernelFuncPtr>
inline void configure_shmem_preference(KernelFuncPtr const& func,
bool prefer_shmem) {
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
auto set_cache_config = [&] {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
func,
(prefer_shmem ? cudaFuncCachePreferShared : cudaFuncCachePreferL1)));
return prefer_shmem;
};
static bool cache_config_preference_cached = set_cache_config();
if (cache_config_preference_cached != prefer_shmem) {
cache_config_preference_cached = set_cache_config();
}
#else
// Use the parameters so we don't get a warning
(void)func;
(void)prefer_shmem;
#endif
}
template <class Policy>
std::enable_if_t<Policy::experimental_contains_desired_occupancy>
modify_launch_configuration_if_desired_occupancy_is_specified(
Policy const& policy, cudaDeviceProp const& properties,
cudaFuncAttributes const& attributes, dim3 const& block, int& shmem,
bool& prefer_shmem) {
int const block_size = block.x * block.y * block.z;
int const desired_occupancy = policy.impl_get_desired_occupancy().value();
size_t const shmem_per_sm_prefer_l1 = get_shmem_per_sm_prefer_l1(properties);
size_t const static_shmem = attributes.sharedSizeBytes;
// round to nearest integer and avoid division by zero
int active_blocks = std::max(
1, static_cast<int>(std::round(
static_cast<double>(properties.maxThreadsPerMultiProcessor) /
block_size * desired_occupancy / 100)));
int const dynamic_shmem =
shmem_per_sm_prefer_l1 / active_blocks - static_shmem;
if (dynamic_shmem > shmem) {
shmem = dynamic_shmem;
prefer_shmem = false;
}
}
template <class Policy>
std::enable_if_t<!Policy::experimental_contains_desired_occupancy>
modify_launch_configuration_if_desired_occupancy_is_specified(
Policy const&, cudaDeviceProp const&, cudaFuncAttributes const&,
dim3 const& /*block*/, int& /*shmem*/, bool& /*prefer_shmem*/) {}
// </editor-fold> end Some helper functions for launch code readability }}}1
//==============================================================================
//==============================================================================
// <editor-fold desc="DeduceCudaLaunchMechanism"> {{{2
// Use local memory up to ConstantMemoryUseThreshold
// Use global memory above ConstantMemoryUsage
// In between use ConstantMemory
template <class DriverType> template <class DriverType>
struct DeduceCudaLaunchMechanism { struct DeduceCudaLaunchMechanism {
@ -217,428 +276,430 @@ struct DeduceCudaLaunchMechanism {
: Experimental::CudaLaunchMechanism::GlobalMemory) : Experimental::CudaLaunchMechanism::GlobalMemory)
: (default_launch_mechanism)); : (default_launch_mechanism));
}; };
// Use local memory up to ConstantMemoryUseThreshold
// Use global memory above ConstantMemoryUsage // </editor-fold> end DeduceCudaLaunchMechanism }}}2
// In between use ConstantMemory //==============================================================================
//==============================================================================
// <editor-fold desc="CudaParallelLaunchKernelInvoker"> {{{1
// Base classes that summarize the differences between the different launch
// mechanisms
template <class DriverType, class LaunchBounds,
Experimental::CudaLaunchMechanism LaunchMechanism>
struct CudaParallelLaunchKernelFunc;
template <class DriverType, class LaunchBounds,
Experimental::CudaLaunchMechanism LaunchMechanism>
struct CudaParallelLaunchKernelInvoker;
//------------------------------------------------------------------------------
// <editor-fold desc="Local memory"> {{{2
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
Experimental::CudaLaunchMechanism::LocalMemory> {
static std::decay_t<decltype(cuda_parallel_launch_local_memory<
DriverType, MaxThreadsPerBlock, MinBlocksPerSM>)>
get_kernel_func() {
return cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>;
}
};
template <class DriverType>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<0, 0>,
Experimental::CudaLaunchMechanism::LocalMemory> {
static std::decay_t<decltype(cuda_parallel_launch_local_memory<DriverType>)>
get_kernel_func() {
return cuda_parallel_launch_local_memory<DriverType>;
}
};
//------------------------------------------------------------------------------
template <class DriverType, class LaunchBounds>
struct CudaParallelLaunchKernelInvoker<
DriverType, LaunchBounds, Experimental::CudaLaunchMechanism::LocalMemory>
: CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::LocalMemory> {
using base_t = CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds, Experimental::CudaLaunchMechanism::LocalMemory>;
static_assert(sizeof(DriverType) < CudaTraits::KernelArgumentLimit,
"Kokkos Error: Requested CudaLaunchLocalMemory with a Functor "
"larger than 4096 bytes.");
static void invoke_kernel(DriverType const& driver, dim3 const& grid,
dim3 const& block, int shmem,
CudaInternal const* cuda_instance) {
(base_t::
get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>(
driver);
}
#ifdef KOKKOS_CUDA_ENABLE_GRAPHS
inline static void create_parallel_launch_graph_node(
DriverType const& driver, dim3 const& grid, dim3 const& block, int shmem,
CudaInternal const* cuda_instance, bool prefer_shmem) {
//----------------------------------------
auto const& graph = Impl::get_cuda_graph_from_kernel(driver);
KOKKOS_EXPECTS(bool(graph));
auto& graph_node = Impl::get_cuda_graph_node_from_kernel(driver);
// Expect node not yet initialized
KOKKOS_EXPECTS(!bool(graph_node));
if (!Impl::is_empty_launch(grid, block)) {
Impl::check_shmem_request(cuda_instance, shmem);
Impl::configure_shmem_preference(base_t::get_kernel_func(), prefer_shmem);
void const* args[] = {&driver};
cudaKernelNodeParams params = {};
params.blockDim = block;
params.gridDim = grid;
params.sharedMemBytes = shmem;
params.func = (void*)base_t::get_kernel_func();
params.kernelParams = (void**)args;
params.extra = nullptr;
CUDA_SAFE_CALL(cudaGraphAddKernelNode(
&graph_node, graph, /* dependencies = */ nullptr,
/* numDependencies = */ 0, &params));
} else {
// We still need an empty node for the dependency structure
CUDA_SAFE_CALL(cudaGraphAddEmptyNode(&graph_node, graph,
/* dependencies = */ nullptr,
/* numDependencies = */ 0));
}
KOKKOS_ENSURES(bool(graph_node))
}
#endif
};
// </editor-fold> end local memory }}}2
//------------------------------------------------------------------------------
//------------------------------------------------------------------------------
// <editor-fold desc="Global Memory"> {{{2
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
Experimental::CudaLaunchMechanism::GlobalMemory> {
static void* get_kernel_func() {
return cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>;
}
};
template <class DriverType>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<0, 0>,
Experimental::CudaLaunchMechanism::GlobalMemory> {
static std::decay_t<decltype(cuda_parallel_launch_global_memory<DriverType>)>
get_kernel_func() {
return cuda_parallel_launch_global_memory<DriverType>;
}
};
//------------------------------------------------------------------------------
template <class DriverType, class LaunchBounds>
struct CudaParallelLaunchKernelInvoker<
DriverType, LaunchBounds, Experimental::CudaLaunchMechanism::GlobalMemory>
: CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::GlobalMemory> {
using base_t = CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::GlobalMemory>;
static void invoke_kernel(DriverType const& driver, dim3 const& grid,
dim3 const& block, int shmem,
CudaInternal const* cuda_instance) {
DriverType* driver_ptr = reinterpret_cast<DriverType*>(
cuda_instance->scratch_functor(sizeof(DriverType)));
cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType), cudaMemcpyDefault,
cuda_instance->m_stream);
(base_t::
get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>(
driver_ptr);
}
#ifdef KOKKOS_CUDA_ENABLE_GRAPHS
inline static void create_parallel_launch_graph_node(
DriverType const& driver, dim3 const& grid, dim3 const& block, int shmem,
CudaInternal const* cuda_instance, bool prefer_shmem) {
//----------------------------------------
auto const& graph = Impl::get_cuda_graph_from_kernel(driver);
KOKKOS_EXPECTS(bool(graph));
auto& graph_node = Impl::get_cuda_graph_node_from_kernel(driver);
// Expect node not yet initialized
KOKKOS_EXPECTS(!bool(graph_node));
if (!Impl::is_empty_launch(grid, block)) {
Impl::check_shmem_request(cuda_instance, shmem);
Impl::configure_shmem_preference(base_t::get_kernel_func(), prefer_shmem);
auto* driver_ptr = Impl::allocate_driver_storage_for_kernel(driver);
// Unlike in the non-graph case, we can get away with doing an async copy
// here because the `DriverType` instance is held in the GraphNodeImpl
// which is guaranteed to be alive until the graph instance itself is
// destroyed, where there should be a fence ensuring that the allocation
// associated with this kernel on the device side isn't deleted.
cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType),
cudaMemcpyDefault, cuda_instance->m_stream);
void const* args[] = {&driver_ptr};
cudaKernelNodeParams params = {};
params.blockDim = block;
params.gridDim = grid;
params.sharedMemBytes = shmem;
params.func = (void*)base_t::get_kernel_func();
params.kernelParams = (void**)args;
params.extra = nullptr;
CUDA_SAFE_CALL(cudaGraphAddKernelNode(
&graph_node, graph, /* dependencies = */ nullptr,
/* numDependencies = */ 0, &params));
} else {
// We still need an empty node for the dependency structure
CUDA_SAFE_CALL(cudaGraphAddEmptyNode(&graph_node, graph,
/* dependencies = */ nullptr,
/* numDependencies = */ 0));
}
KOKKOS_ENSURES(bool(graph_node))
}
#endif
};
// </editor-fold> end Global Memory }}}2
//------------------------------------------------------------------------------
//------------------------------------------------------------------------------
// <editor-fold desc="Constant Memory"> {{{2
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
Experimental::CudaLaunchMechanism::ConstantMemory> {
static std::decay_t<decltype(cuda_parallel_launch_constant_memory<
DriverType, MaxThreadsPerBlock, MinBlocksPerSM>)>
get_kernel_func() {
return cuda_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>;
}
};
template <class DriverType>
struct CudaParallelLaunchKernelFunc<
DriverType, Kokkos::LaunchBounds<0, 0>,
Experimental::CudaLaunchMechanism::ConstantMemory> {
static std::decay_t<
decltype(cuda_parallel_launch_constant_memory<DriverType>)>
get_kernel_func() {
return cuda_parallel_launch_constant_memory<DriverType>;
}
};
//------------------------------------------------------------------------------
template <class DriverType, class LaunchBounds>
struct CudaParallelLaunchKernelInvoker<
DriverType, LaunchBounds, Experimental::CudaLaunchMechanism::ConstantMemory>
: CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::ConstantMemory> {
using base_t = CudaParallelLaunchKernelFunc<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::ConstantMemory>;
static_assert(sizeof(DriverType) < CudaTraits::ConstantMemoryUsage,
"Kokkos Error: Requested CudaLaunchConstantMemory with a "
"Functor larger than 32kB.");
static void invoke_kernel(DriverType const& driver, dim3 const& grid,
dim3 const& block, int shmem,
CudaInternal const* cuda_instance) {
// Wait until the previous kernel that uses the constant buffer is done
CUDA_SAFE_CALL(cudaEventSynchronize(cuda_instance->constantMemReusable));
// Copy functor (synchronously) to staging buffer in pinned host memory
unsigned long* staging = cuda_instance->constantMemHostStaging;
memcpy(staging, &driver, sizeof(DriverType));
// Copy functor asynchronously from there to constant memory on the device
cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
sizeof(DriverType), 0, cudaMemcpyHostToDevice,
cudaStream_t(cuda_instance->m_stream));
// Invoke the driver function on the device
(base_t::
get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>();
// Record an event that says when the constant buffer can be reused
CUDA_SAFE_CALL(cudaEventRecord(cuda_instance->constantMemReusable,
cudaStream_t(cuda_instance->m_stream)));
}
#ifdef KOKKOS_CUDA_ENABLE_GRAPHS
inline static void create_parallel_launch_graph_node(
DriverType const& driver, dim3 const& grid, dim3 const& block, int shmem,
CudaInternal const* cuda_instance, bool prefer_shmem) {
// Just use global memory; coordinating through events to share constant
// memory with the non-graph interface is not really reasonable since
// events don't work with Graphs directly, and this would anyway require
// a much more complicated structure that finds previous nodes in the
// dependency structure of the graph and creates an implicit dependence
// based on the need for constant memory (which we would then have to
// somehow go and prove was not creating a dependency cycle, and I don't
// even know if there's an efficient way to do that, let alone in the
// structure we currenty have).
using global_launch_impl_t = CudaParallelLaunchKernelInvoker<
DriverType, LaunchBounds,
Experimental::CudaLaunchMechanism::GlobalMemory>;
global_launch_impl_t::create_parallel_launch_graph_node(
driver, grid, block, shmem, cuda_instance, prefer_shmem);
}
#endif
};
// </editor-fold> end Constant Memory }}}2
//------------------------------------------------------------------------------
// </editor-fold> end CudaParallelLaunchKernelInvoker }}}1
//==============================================================================
//==============================================================================
// <editor-fold desc="CudaParallelLaunchImpl"> {{{1
template <class DriverType, class LaunchBounds,
Experimental::CudaLaunchMechanism LaunchMechanism>
struct CudaParallelLaunchImpl;
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM,
Experimental::CudaLaunchMechanism LaunchMechanism>
struct CudaParallelLaunchImpl<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
LaunchMechanism>
: CudaParallelLaunchKernelInvoker<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
LaunchMechanism> {
using base_t = CudaParallelLaunchKernelInvoker<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
LaunchMechanism>;
inline static void launch_kernel(const DriverType& driver, const dim3& grid,
const dim3& block, int shmem,
const CudaInternal* cuda_instance,
bool prefer_shmem) {
if (!Impl::is_empty_launch(grid, block)) {
// Prevent multiple threads to simultaneously set the cache configuration
// preference and launch the same kernel
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
Impl::check_shmem_request(cuda_instance, shmem);
// If a desired occupancy is specified, we compute how much shared memory
// to ask for to achieve that occupancy, assuming that the cache
// configuration is `cudaFuncCachePreferL1`. If the amount of dynamic
// shared memory computed is actually smaller than `shmem` we overwrite
// `shmem` and set `prefer_shmem` to `false`.
modify_launch_configuration_if_desired_occupancy_is_specified(
driver.get_policy(), cuda_instance->m_deviceProp,
get_cuda_func_attributes(), block, shmem, prefer_shmem);
Impl::configure_shmem_preference(base_t::get_kernel_func(), prefer_shmem);
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
// Invoke the driver function on the device
base_t::invoke_kernel(driver, grid, block, shmem, cuda_instance);
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
cuda_instance->fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(
cudaFuncGetAttributes(&attr_tmp, base_t::get_kernel_func()));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
// </editor-fold> end CudaParallelLaunchImpl }}}1
//==============================================================================
//==============================================================================
// <editor-fold desc="CudaParallelLaunch"> {{{1
template <class DriverType, class LaunchBounds = Kokkos::LaunchBounds<>, template <class DriverType, class LaunchBounds = Kokkos::LaunchBounds<>,
Experimental::CudaLaunchMechanism LaunchMechanism = Experimental::CudaLaunchMechanism LaunchMechanism =
DeduceCudaLaunchMechanism<DriverType>::launch_mechanism> DeduceCudaLaunchMechanism<DriverType>::launch_mechanism,
bool DoGraph = DriverType::Policy::is_graph_kernel::value
#ifndef KOKKOS_CUDA_ENABLE_GRAPHS
&& false
#endif
>
struct CudaParallelLaunch; struct CudaParallelLaunch;
template <class DriverType, unsigned int MaxThreadsPerBlock, // General launch mechanism
unsigned int MinBlocksPerSM> template <class DriverType, class LaunchBounds,
struct CudaParallelLaunch< Experimental::CudaLaunchMechanism LaunchMechanism>
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>, struct CudaParallelLaunch<DriverType, LaunchBounds, LaunchMechanism,
Experimental::CudaLaunchMechanism::ConstantMemory> { /* DoGraph = */ false>
static_assert(sizeof(DriverType) < CudaTraits::ConstantMemoryUsage, : CudaParallelLaunchImpl<DriverType, LaunchBounds, LaunchMechanism> {
"Kokkos Error: Requested CudaLaunchConstantMemory with a " using base_t =
"Functor larger than 32kB."); CudaParallelLaunchImpl<DriverType, LaunchBounds, LaunchMechanism>;
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid, template <class... Args>
const dim3& block, const int shmem, CudaParallelLaunch(Args&&... args) {
const CudaInternal* cuda_instance, base_t::launch_kernel((Args &&) args...);
const bool prefer_shmem) {
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) {
if (cuda_instance->m_maxShmemPerBlock < shmem) {
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_constant_memory<
DriverType, MaxThreadsPerBlock, MinBlocksPerSM>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
// Wait until the previous kernel that uses the constant buffer is done
CUDA_SAFE_CALL(cudaEventSynchronize(cuda_instance->constantMemReusable));
// Copy functor (synchronously) to staging buffer in pinned host memory
unsigned long* staging = cuda_instance->constantMemHostStaging;
memcpy(staging, &driver, sizeof(DriverType));
// Copy functor asynchronously from there to constant memory on the device
cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
sizeof(DriverType), 0, cudaMemcpyHostToDevice,
cudaStream_t(cuda_instance->m_stream));
// Invoke the driver function on the device
cuda_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>
<<<grid, block, shmem, cuda_instance->m_stream>>>();
// Record an event that says when the constant buffer can be reused
CUDA_SAFE_CALL(cudaEventRecord(cuda_instance->constantMemReusable,
cudaStream_t(cuda_instance->m_stream)));
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp,
cuda_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
} }
}; };
template <class DriverType> #ifdef KOKKOS_CUDA_ENABLE_GRAPHS
struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>, // Launch mechanism for creating graph nodes
Experimental::CudaLaunchMechanism::ConstantMemory> { template <class DriverType, class LaunchBounds,
static_assert(sizeof(DriverType) < CudaTraits::ConstantMemoryUsage, Experimental::CudaLaunchMechanism LaunchMechanism>
"Kokkos Error: Requested CudaLaunchConstantMemory with a " struct CudaParallelLaunch<DriverType, LaunchBounds, LaunchMechanism,
"Functor larger than 32kB."); /* DoGraph = */ true>
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid, : CudaParallelLaunchImpl<DriverType, LaunchBounds, LaunchMechanism> {
const dim3& block, const int shmem, using base_t =
const CudaInternal* cuda_instance, CudaParallelLaunchImpl<DriverType, LaunchBounds, LaunchMechanism>;
const bool prefer_shmem) { template <class... Args>
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) { CudaParallelLaunch(Args&&... args) {
if (cuda_instance->m_maxShmemPerBlock < shmem) { base_t::create_parallel_launch_graph_node((Args &&) args...);
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_constant_memory<DriverType>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
// Wait until the previous kernel that uses the constant buffer is done
CUDA_SAFE_CALL(cudaEventSynchronize(cuda_instance->constantMemReusable));
// Copy functor (synchronously) to staging buffer in pinned host memory
unsigned long* staging = cuda_instance->constantMemHostStaging;
memcpy(staging, &driver, sizeof(DriverType));
// Copy functor asynchronously from there to constant memory on the device
cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
sizeof(DriverType), 0, cudaMemcpyHostToDevice,
cudaStream_t(cuda_instance->m_stream));
// Invoke the driver function on the device
cuda_parallel_launch_constant_memory<DriverType>
<<<grid, block, shmem, cuda_instance->m_stream>>>();
// Record an event that says when the constant buffer can be reused
CUDA_SAFE_CALL(cudaEventRecord(cuda_instance->constantMemReusable,
cudaStream_t(cuda_instance->m_stream)));
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp, cuda_parallel_launch_constant_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
} }
}; };
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM>
struct CudaParallelLaunch<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
Experimental::CudaLaunchMechanism::LocalMemory> {
static_assert(sizeof(DriverType) < CudaTraits::KernelArgumentLimit,
"Kokkos Error: Requested CudaLaunchLocalMemory with a Functor "
"larger than 4096 bytes.");
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid,
const dim3& block, const int shmem,
const CudaInternal* cuda_instance,
const bool prefer_shmem) {
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) {
if (cuda_instance->m_maxShmemPerBlock < shmem) {
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif #endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE(); // </editor-fold> end CudaParallelLaunch }}}1
//==============================================================================
// Invoke the driver function on the device
cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>
<<<grid, block, shmem, cuda_instance->m_stream>>>(driver);
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp,
cuda_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
template <class DriverType>
struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
Experimental::CudaLaunchMechanism::LocalMemory> {
static_assert(sizeof(DriverType) < CudaTraits::KernelArgumentLimit,
"Kokkos Error: Requested CudaLaunchLocalMemory with a Functor "
"larger than 4096 bytes.");
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid,
const dim3& block, const int shmem,
const CudaInternal* cuda_instance,
const bool prefer_shmem) {
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) {
if (cuda_instance->m_maxShmemPerBlock < shmem) {
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_local_memory<DriverType>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
// Invoke the driver function on the device
cuda_parallel_launch_local_memory<DriverType>
<<<grid, block, shmem, cuda_instance->m_stream>>>(driver);
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp, cuda_parallel_launch_local_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
template <class DriverType, unsigned int MaxThreadsPerBlock,
unsigned int MinBlocksPerSM>
struct CudaParallelLaunch<
DriverType, Kokkos::LaunchBounds<MaxThreadsPerBlock, MinBlocksPerSM>,
Experimental::CudaLaunchMechanism::GlobalMemory> {
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid,
const dim3& block, const int shmem,
CudaInternal* cuda_instance,
const bool prefer_shmem) {
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) {
if (cuda_instance->m_maxShmemPerBlock < shmem) {
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
DriverType* driver_ptr = nullptr;
driver_ptr = reinterpret_cast<DriverType*>(
cuda_instance->scratch_functor(sizeof(DriverType)));
cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType),
cudaMemcpyDefault, cuda_instance->m_stream);
// Invoke the driver function on the device
cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>
<<<grid, block, shmem, cuda_instance->m_stream>>>(driver_ptr);
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp,
cuda_parallel_launch_global_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
template <class DriverType>
struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
Experimental::CudaLaunchMechanism::GlobalMemory> {
inline CudaParallelLaunch(const DriverType& driver, const dim3& grid,
const dim3& block, const int shmem,
CudaInternal* cuda_instance,
const bool prefer_shmem) {
if ((grid.x != 0) && ((block.x * block.y * block.z) != 0)) {
if (cuda_instance->m_maxShmemPerBlock < shmem) {
Kokkos::Impl::throw_runtime_exception(std::string(
"CudaParallelLaunch FAILED: shared memory request is too large"));
}
#ifndef KOKKOS_ARCH_KEPLER
// On Kepler the L1 has no benefit since it doesn't cache reads
else {
static bool cache_config_set = false;
if (!cache_config_set) {
CUDA_SAFE_CALL(cudaFuncSetCacheConfig(
cuda_parallel_launch_global_memory<DriverType>,
(prefer_shmem ? cudaFuncCachePreferShared
: cudaFuncCachePreferL1)));
cache_config_set = true;
}
}
#else
(void)prefer_shmem;
#endif
KOKKOS_ENSURE_CUDA_LOCK_ARRAYS_ON_DEVICE();
DriverType* driver_ptr = nullptr;
driver_ptr = reinterpret_cast<DriverType*>(
cuda_instance->scratch_functor(sizeof(DriverType)));
cudaMemcpyAsync(driver_ptr, &driver, sizeof(DriverType),
cudaMemcpyDefault, cuda_instance->m_stream);
cuda_parallel_launch_global_memory<DriverType>
<<<grid, block, shmem, cuda_instance->m_stream>>>(driver_ptr);
#if defined(KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK)
CUDA_SAFE_CALL(cudaGetLastError());
Kokkos::Cuda().fence();
#endif
}
}
static cudaFuncAttributes get_cuda_func_attributes() {
// Race condition inside of cudaFuncGetAttributes if the same address is
// given requires using a local variable as input instead of a static Rely
// on static variable initialization to make sure only one thread executes
// the code and the result is visible.
auto wrap_get_attributes = []() -> cudaFuncAttributes {
cudaFuncAttributes attr_tmp;
CUDA_SAFE_CALL(cudaFuncGetAttributes(
&attr_tmp, cuda_parallel_launch_global_memory<DriverType>));
return attr_tmp;
};
static cudaFuncAttributes attr = wrap_get_attributes();
return attr;
}
};
//----------------------------------------------------------------------------
} // namespace Impl } // namespace Impl
} // namespace Kokkos } // namespace Kokkos
@ -646,6 +707,5 @@ struct CudaParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
#endif /* defined( __CUDACC__ ) */
#endif /* defined( KOKKOS_ENABLE_CUDA ) */ #endif /* defined( KOKKOS_ENABLE_CUDA ) */
#endif /* #ifndef KOKKOS_CUDAEXEC_HPP */ #endif /* #ifndef KOKKOS_CUDAEXEC_HPP */

View File

@ -42,13 +42,10 @@
//@HEADER //@HEADER
*/ */
#include <Kokkos_Macros.hpp> #include <Kokkos_Core.hpp>
#ifdef KOKKOS_ENABLE_CUDA #ifdef KOKKOS_ENABLE_CUDA
#include <Cuda/Kokkos_Cuda_Locks.hpp> #include <Cuda/Kokkos_Cuda_Locks.hpp>
#include <Cuda/Kokkos_Cuda_Error.hpp> #include <Cuda/Kokkos_Cuda_Error.hpp>
#include <Kokkos_Cuda.hpp>
#ifdef KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE #ifdef KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE
namespace Kokkos { namespace Kokkos {

View File

@ -81,8 +81,6 @@ void finalize_host_cuda_lock_arrays();
} // namespace Impl } // namespace Impl
} // namespace Kokkos } // namespace Kokkos
#if defined(__CUDACC__)
namespace Kokkos { namespace Kokkos {
namespace Impl { namespace Impl {
@ -173,8 +171,6 @@ inline int eliminate_warning_for_lock_array() { return lock_array_copied; }
KOKKOS_COPY_CUDA_LOCK_ARRAYS_TO_DEVICE() KOKKOS_COPY_CUDA_LOCK_ARRAYS_TO_DEVICE()
#endif #endif
#endif /* defined( __CUDACC__ ) */
#endif /* defined( KOKKOS_ENABLE_CUDA ) */ #endif /* defined( KOKKOS_ENABLE_CUDA ) */
#endif /* #ifndef KOKKOS_CUDA_LOCKS_HPP */ #endif /* #ifndef KOKKOS_CUDA_LOCKS_HPP */

View File

@ -46,7 +46,7 @@
#define KOKKOS_CUDA_PARALLEL_HPP #define KOKKOS_CUDA_PARALLEL_HPP
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(__CUDACC__) && defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA)
#include <algorithm> #include <algorithm>
#include <string> #include <string>
@ -99,6 +99,8 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
int m_team_scratch_size[2]; int m_team_scratch_size[2];
int m_thread_scratch_size[2]; int m_thread_scratch_size[2];
int m_chunk_size; int m_chunk_size;
bool m_tune_team;
bool m_tune_vector;
public: public:
//! Execution space of this execution policy //! Execution space of this execution policy
@ -115,6 +117,8 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
m_thread_scratch_size[1] = p.m_thread_scratch_size[1]; m_thread_scratch_size[1] = p.m_thread_scratch_size[1];
m_chunk_size = p.m_chunk_size; m_chunk_size = p.m_chunk_size;
m_space = p.m_space; m_space = p.m_space;
m_tune_team = p.m_tune_team;
m_tune_vector = p.m_tune_vector;
} }
//---------------------------------------- //----------------------------------------
@ -130,10 +134,10 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
Kokkos::Impl::cuda_get_max_block_size<FunctorType, Kokkos::Impl::cuda_get_max_block_size<FunctorType,
typename traits::launch_bounds>( typename traits::launch_bounds>(
space().impl_internal_space_instance(), attr, f, space().impl_internal_space_instance(), attr, f,
(size_t)vector_length(), (size_t)impl_vector_length(),
(size_t)team_scratch_size(0) + 2 * sizeof(double), (size_t)team_scratch_size(0) + 2 * sizeof(double),
(size_t)thread_scratch_size(0) + sizeof(double)); (size_t)thread_scratch_size(0) + sizeof(double));
return block_size / vector_length(); return block_size / impl_vector_length();
} }
template <class FunctorType> template <class FunctorType>
@ -171,10 +175,10 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
Kokkos::Impl::cuda_get_opt_block_size<FunctorType, Kokkos::Impl::cuda_get_opt_block_size<FunctorType,
typename traits::launch_bounds>( typename traits::launch_bounds>(
space().impl_internal_space_instance(), attr, f, space().impl_internal_space_instance(), attr, f,
(size_t)vector_length(), (size_t)impl_vector_length(),
(size_t)team_scratch_size(0) + 2 * sizeof(double), (size_t)team_scratch_size(0) + 2 * sizeof(double),
(size_t)thread_scratch_size(0) + sizeof(double)); (size_t)thread_scratch_size(0) + sizeof(double));
return block_size / vector_length(); return block_size / impl_vector_length();
} }
template <class FunctorType> template <class FunctorType>
@ -234,9 +238,18 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
//---------------------------------------- //----------------------------------------
inline int vector_length() const { return m_vector_length; } KOKKOS_DEPRECATED inline int vector_length() const {
return impl_vector_length();
}
inline int impl_vector_length() const { return m_vector_length; }
inline int team_size() const { return m_team_size; } inline int team_size() const { return m_team_size; }
inline int league_size() const { return m_league_size; } inline int league_size() const { return m_league_size; }
inline bool impl_auto_team_size() const { return m_tune_team; }
inline bool impl_auto_vector_length() const { return m_tune_vector; }
inline void impl_set_team_size(size_t team_size) { m_team_size = team_size; }
inline void impl_set_vector_length(size_t vector_length) {
m_vector_length = vector_length;
}
inline int scratch_size(int level, int team_size_ = -1) const { inline int scratch_size(int level, int team_size_ = -1) const {
if (team_size_ < 0) team_size_ = m_team_size; if (team_size_ < 0) team_size_ = m_team_size;
return m_team_scratch_size[level] + return m_team_scratch_size[level] +
@ -258,18 +271,25 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
m_vector_length(0), m_vector_length(0),
m_team_scratch_size{0, 0}, m_team_scratch_size{0, 0},
m_thread_scratch_size{0, 0}, m_thread_scratch_size{0, 0},
m_chunk_size(32) {} m_chunk_size(Impl::CudaTraits::WarpSize),
m_tune_team(false),
m_tune_vector(false) {}
/** \brief Specify league size, request team size */ /** \brief Specify league size, specify team size, specify vector length */
TeamPolicyInternal(const execution_space space_, int league_size_, TeamPolicyInternal(const execution_space space_, int league_size_,
int team_size_request, int vector_length_request = 1) int team_size_request, int vector_length_request = 1)
: m_space(space_), : m_space(space_),
m_league_size(league_size_), m_league_size(league_size_),
m_team_size(team_size_request), m_team_size(team_size_request),
m_vector_length(verify_requested_vector_length(vector_length_request)), m_vector_length(
(vector_length_request > 0)
? verify_requested_vector_length(vector_length_request)
: verify_requested_vector_length(1)),
m_team_scratch_size{0, 0}, m_team_scratch_size{0, 0},
m_thread_scratch_size{0, 0}, m_thread_scratch_size{0, 0},
m_chunk_size(32) { m_chunk_size(Impl::CudaTraits::WarpSize),
m_tune_team(bool(team_size_request <= 0)),
m_tune_vector(bool(vector_length_request <= 0)) {
// Make sure league size is permissible // Make sure league size is permissible
if (league_size_ >= int(Impl::cuda_internal_maximum_grid_count())) if (league_size_ >= int(Impl::cuda_internal_maximum_grid_count()))
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
@ -277,72 +297,56 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
"space."); "space.");
// Make sure total block size is permissible // Make sure total block size is permissible
if (m_team_size * m_vector_length > 1024) { if (m_team_size * m_vector_length >
int(Impl::CudaTraits::MaxHierarchicalParallelism)) {
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
std::string("Kokkos::TeamPolicy< Cuda > the team size is too large. " std::string("Kokkos::TeamPolicy< Cuda > the team size is too large. "
"Team size x vector length must be smaller than 1024.")); "Team size x vector length must be smaller than 1024."));
} }
} }
/** \brief Specify league size, request team size */ /** \brief Specify league size, request team size, specify vector length */
TeamPolicyInternal(const execution_space space_, int league_size_, TeamPolicyInternal(const execution_space space_, int league_size_,
const Kokkos::AUTO_t& /* team_size_request */ const Kokkos::AUTO_t& /* team_size_request */
, ,
int vector_length_request = 1) int vector_length_request = 1)
: m_space(space_), : TeamPolicyInternal(space_, league_size_, -1, vector_length_request) {}
m_league_size(league_size_),
m_team_size(-1), /** \brief Specify league size, request team size and vector length */
m_vector_length(verify_requested_vector_length(vector_length_request)), TeamPolicyInternal(const execution_space space_, int league_size_,
m_team_scratch_size{0, 0}, const Kokkos::AUTO_t& /* team_size_request */,
m_thread_scratch_size{0, 0}, const Kokkos::AUTO_t& /* vector_length_request */
m_chunk_size(32) { )
// Make sure league size is permissible : TeamPolicyInternal(space_, league_size_, -1, -1) {}
if (league_size_ >= int(Impl::cuda_internal_maximum_grid_count()))
Impl::throw_runtime_exception( /** \brief Specify league size, specify team size, request vector length */
"Requested too large league_size for TeamPolicy on Cuda execution " TeamPolicyInternal(const execution_space space_, int league_size_,
"space."); int team_size_request, const Kokkos::AUTO_t&)
} : TeamPolicyInternal(space_, league_size_, team_size_request, -1) {}
TeamPolicyInternal(int league_size_, int team_size_request, TeamPolicyInternal(int league_size_, int team_size_request,
int vector_length_request = 1) int vector_length_request = 1)
: m_space(typename traits::execution_space()), : TeamPolicyInternal(typename traits::execution_space(), league_size_,
m_league_size(league_size_), team_size_request, vector_length_request) {}
m_team_size(team_size_request),
m_vector_length(verify_requested_vector_length(vector_length_request)),
m_team_scratch_size{0, 0},
m_thread_scratch_size{0, 0},
m_chunk_size(32) {
// Make sure league size is permissible
if (league_size_ >= int(Impl::cuda_internal_maximum_grid_count()))
Impl::throw_runtime_exception(
"Requested too large league_size for TeamPolicy on Cuda execution "
"space.");
// Make sure total block size is permissible TeamPolicyInternal(int league_size_, const Kokkos::AUTO_t& team_size_request,
if (m_team_size * m_vector_length > 1024) {
Impl::throw_runtime_exception(
std::string("Kokkos::TeamPolicy< Cuda > the team size is too large. "
"Team size x vector length must be smaller than 1024."));
}
}
TeamPolicyInternal(int league_size_,
const Kokkos::AUTO_t& /* team_size_request */
,
int vector_length_request = 1) int vector_length_request = 1)
: m_space(typename traits::execution_space()), : TeamPolicyInternal(typename traits::execution_space(), league_size_,
m_league_size(league_size_), team_size_request, vector_length_request)
m_team_size(-1),
m_vector_length(verify_requested_vector_length(vector_length_request)), {}
m_team_scratch_size{0, 0},
m_thread_scratch_size{0, 0}, /** \brief Specify league size, request team size */
m_chunk_size(32) { TeamPolicyInternal(int league_size_, const Kokkos::AUTO_t& team_size_request,
// Make sure league size is permissible const Kokkos::AUTO_t& vector_length_request)
if (league_size_ >= int(Impl::cuda_internal_maximum_grid_count())) : TeamPolicyInternal(typename traits::execution_space(), league_size_,
Impl::throw_runtime_exception( team_size_request, vector_length_request) {}
"Requested too large league_size for TeamPolicy on Cuda execution "
"space."); /** \brief Specify league size, request team size */
} TeamPolicyInternal(int league_size_, int team_size_request,
const Kokkos::AUTO_t& vector_length_request)
: TeamPolicyInternal(typename traits::execution_space(), league_size_,
team_size_request, vector_length_request) {}
inline int chunk_size() const { return m_chunk_size; } inline int chunk_size() const { return m_chunk_size; }
@ -394,7 +398,7 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
get_cuda_func_attributes(); get_cuda_func_attributes();
const int block_size = std::forward<BlockSizeCallable>(block_size_callable)( const int block_size = std::forward<BlockSizeCallable>(block_size_callable)(
space().impl_internal_space_instance(), attr, f, space().impl_internal_space_instance(), attr, f,
(size_t)vector_length(), (size_t)impl_vector_length(),
(size_t)team_scratch_size(0) + 2 * sizeof(double), (size_t)team_scratch_size(0) + 2 * sizeof(double),
(size_t)thread_scratch_size(0) + sizeof(double) + (size_t)thread_scratch_size(0) + sizeof(double) +
((functor_value_traits::StaticValueSize != 0) ((functor_value_traits::StaticValueSize != 0)
@ -406,7 +410,7 @@ class TeamPolicyInternal<Kokkos::Cuda, Properties...>
int p2 = 1; int p2 = 1;
while (p2 <= block_size) p2 *= 2; while (p2 <= block_size) p2 *= 2;
p2 /= 2; p2 /= 2;
return p2 / vector_length(); return p2 / impl_vector_length();
} }
template <class ClosureType, class FunctorType> template <class ClosureType, class FunctorType>
@ -468,6 +472,8 @@ class ParallelFor<FunctorType, Kokkos::RangePolicy<Traits...>, Kokkos::Cuda> {
public: public:
using functor_type = FunctorType; using functor_type = FunctorType;
Policy const& get_policy() const { return m_policy; }
inline __device__ void operator()(void) const { inline __device__ void operator()(void) const {
const Member work_stride = blockDim.y * gridDim.x; const Member work_stride = blockDim.y * gridDim.x;
const Member work_end = m_policy.end(); const Member work_end = m_policy.end();
@ -519,6 +525,7 @@ template <class FunctorType, class... Traits>
class ParallelFor<FunctorType, Kokkos::MDRangePolicy<Traits...>, Kokkos::Cuda> { class ParallelFor<FunctorType, Kokkos::MDRangePolicy<Traits...>, Kokkos::Cuda> {
public: public:
using Policy = Kokkos::MDRangePolicy<Traits...>; using Policy = Kokkos::MDRangePolicy<Traits...>;
using functor_type = FunctorType;
private: private:
using RP = Policy; using RP = Policy;
@ -530,10 +537,11 @@ class ParallelFor<FunctorType, Kokkos::MDRangePolicy<Traits...>, Kokkos::Cuda> {
const Policy m_rp; const Policy m_rp;
public: public:
Policy const& get_policy() const { return m_rp; }
inline __device__ void operator()(void) const { inline __device__ void operator()(void) const {
Kokkos::Impl::Refactor::DeviceIterateTile<Policy::rank, Policy, FunctorType, Kokkos::Impl::DeviceIterateTile<Policy::rank, Policy, FunctorType,
typename Policy::work_tag>( typename Policy::work_tag>(m_rp, m_functor)
m_rp, m_functor)
.exec_range(); .exec_range();
} }
@ -621,8 +629,7 @@ class ParallelFor<FunctorType, Kokkos::MDRangePolicy<Traits...>, Kokkos::Cuda> {
*this, grid, block, 0, m_rp.space().impl_internal_space_instance(), *this, grid, block, 0, m_rp.space().impl_internal_space_instance(),
false); false);
} else { } else {
printf("Kokkos::MDRange Error: Exceeded rank bounds with Cuda\n"); Kokkos::abort("Kokkos::MDRange Error: Exceeded rank bounds with Cuda\n");
Kokkos::abort("Aborting");
} }
} // end execute } // end execute
@ -636,7 +643,7 @@ template <class FunctorType, class... Properties>
class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>, class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
Kokkos::Cuda> { Kokkos::Cuda> {
public: public:
using Policy = TeamPolicyInternal<Kokkos::Cuda, Properties...>; using Policy = TeamPolicy<Properties...>;
private: private:
using Member = typename Policy::member_type; using Member = typename Policy::member_type;
@ -680,6 +687,8 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
} }
public: public:
Policy const& get_policy() const { return m_policy; }
__device__ inline void operator()(void) const { __device__ inline void operator()(void) const {
// Iterate this block through the league // Iterate this block through the league
int64_t threadid = 0; int64_t threadid = 0;
@ -749,7 +758,7 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_policy(arg_policy), m_policy(arg_policy),
m_league_size(arg_policy.league_size()), m_league_size(arg_policy.league_size()),
m_team_size(arg_policy.team_size()), m_team_size(arg_policy.team_size()),
m_vector_size(arg_policy.vector_length()) { m_vector_size(arg_policy.impl_vector_length()) {
cudaFuncAttributes attr = cudaFuncAttributes attr =
CudaParallelLaunch<ParallelFor, CudaParallelLaunch<ParallelFor,
LaunchBounds>::get_cuda_func_attributes(); LaunchBounds>::get_cuda_func_attributes();
@ -796,10 +805,10 @@ class ParallelFor<FunctorType, Kokkos::TeamPolicy<Properties...>,
if (int(m_team_size) > if (int(m_team_size) >
int(Kokkos::Impl::cuda_get_max_block_size<FunctorType, LaunchBounds>( int(Kokkos::Impl::cuda_get_max_block_size<FunctorType, LaunchBounds>(
m_policy.space().impl_internal_space_instance(), attr, m_policy.space().impl_internal_space_instance(), attr,
arg_functor, arg_policy.vector_length(), arg_functor, arg_policy.impl_vector_length(),
arg_policy.team_scratch_size(0), arg_policy.team_scratch_size(0),
arg_policy.thread_scratch_size(0)) / arg_policy.thread_scratch_size(0)) /
arg_policy.vector_length())) { arg_policy.impl_vector_length())) {
Kokkos::Impl::throw_runtime_exception(std::string( Kokkos::Impl::throw_runtime_exception(std::string(
"Kokkos::Impl::ParallelFor< Cuda > requested too large team size.")); "Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));
} }
@ -847,6 +856,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
using functor_type = FunctorType; using functor_type = FunctorType;
using size_type = Kokkos::Cuda::size_type; using size_type = Kokkos::Cuda::size_type;
using index_type = typename Policy::index_type; using index_type = typename Policy::index_type;
using reducer_type = ReducerType;
// Algorithmic constraints: blockSize is a power of two AND blockDim.y == // Algorithmic constraints: blockSize is a power of two AND blockDim.y ==
// blockDim.z == 1 // blockDim.z == 1
@ -873,6 +883,8 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
using DummySHMEMReductionType = int; using DummySHMEMReductionType = int;
public: public:
Policy const& get_policy() const { return m_policy; }
// Make the exec_range calls call to Reduce::DeviceIterateTile // Make the exec_range calls call to Reduce::DeviceIterateTile
template <class TagType> template <class TagType>
__device__ inline __device__ inline
@ -949,7 +961,12 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) { for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
global[i] = shared[i]; global[i] = shared[i];
} }
} else if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd, // return ;
}
if (m_policy.begin() != m_policy.end()) {
{
if (cuda_single_inter_block_reduce_scan<false, ReducerTypeFwd,
WorkTagFwd>( WorkTagFwd>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x, ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(), gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
@ -957,7 +974,8 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
// This is the final block with the final result at the final threads' // This is the final block with the final result at the final threads'
// location // location
size_type* const shared = kokkos_impl_cuda_shared_memory<size_type>() + size_type* const shared =
kokkos_impl_cuda_shared_memory<size_type>() +
(blockDim.y - 1) * word_count.value; (blockDim.y - 1) * word_count.value;
size_type* const global = size_type* const global =
m_result_ptr_device_accessible m_result_ptr_device_accessible
@ -973,12 +991,14 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
__syncthreads(); __syncthreads();
} }
for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) { for (unsigned i = threadIdx.y; i < word_count.value;
i += blockDim.y) {
global[i] = shared[i]; global[i] = shared[i];
} }
} }
} }
}
}
/* __device__ inline /* __device__ inline
void run(const DummyShflReductionType&) const void run(const DummyShflReductionType&) const
{ {
@ -1055,6 +1075,9 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value || const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value || ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible || !m_result_ptr_host_accessible ||
#ifdef KOKKOS_CUDA_ENABLE_GRAPHS
Policy::is_graph_kernel::value ||
#endif
!std::is_same<ReducerType, InvalidType>::value; !std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) { if ((nwork > 0) || need_device_set) {
const int block_size = local_block_size(m_functor); const int block_size = local_block_size(m_functor);
@ -1077,6 +1100,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
dim3 grid(std::min(int(block.y), int((nwork + block.y - 1) / block.y)), 1, dim3 grid(std::min(int(block.y), int((nwork + block.y - 1) / block.y)), 1,
1); 1);
// TODO @graph We need to effectively insert this in to the graph
const int shmem = const int shmem =
UseShflReduction UseShflReduction
? 0 ? 0
@ -1117,6 +1141,7 @@ class ParallelReduce<FunctorType, Kokkos::RangePolicy<Traits...>, ReducerType,
} }
} else { } else {
if (m_result_ptr) { if (m_result_ptr) {
// TODO @graph We need to effectively insert this in to the graph
ValueInit::init(ReducerConditional::select(m_functor, m_reducer), ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr); m_result_ptr);
} }
@ -1195,6 +1220,7 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
using reference_type = typename ValueTraits::reference_type; using reference_type = typename ValueTraits::reference_type;
using functor_type = FunctorType; using functor_type = FunctorType;
using size_type = Cuda::size_type; using size_type = Cuda::size_type;
using reducer_type = ReducerType;
// Algorithmic constraints: blockSize is a power of two AND blockDim.y == // Algorithmic constraints: blockSize is a power of two AND blockDim.y ==
// blockDim.z == 1 // blockDim.z == 1
@ -1214,16 +1240,16 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
// Shall we use the shfl based reduction or not (only use it for static sized // Shall we use the shfl based reduction or not (only use it for static sized
// types of more than 128bit // types of more than 128bit
enum { static constexpr bool UseShflReduction = false;
UseShflReduction = ((sizeof(value_type) > 2 * sizeof(double)) && //((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize)
(ValueTraits::StaticValueSize != 0))
};
// Some crutch to do function overloading // Some crutch to do function overloading
private: private:
using DummyShflReductionType = double; using DummyShflReductionType = double;
using DummySHMEMReductionType = int; using DummySHMEMReductionType = int;
public: public:
Policy const& get_policy() const { return m_policy; }
inline __device__ void exec_range(reference_type update) const { inline __device__ void exec_range(reference_type update) const {
Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType,
typename Policy::work_tag, typename Policy::work_tag,
@ -1390,6 +1416,7 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
// Required grid.x <= block.y // Required grid.x <= block.y
const dim3 grid(std::min(int(block.y), int(nwork)), 1, 1); const dim3 grid(std::min(int(block.y), int(nwork)), 1, 1);
// TODO @graph We need to effectively insert this in to the graph
const int shmem = const int shmem =
UseShflReduction UseShflReduction
? 0 ? 0
@ -1403,7 +1430,7 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
false); // copy to device and execute false); // copy to device and execute
if (!m_result_ptr_device_accessible) { if (!m_result_ptr_device_accessible) {
Cuda().fence(); m_policy.space().fence();
if (m_result_ptr) { if (m_result_ptr) {
if (m_unified_space) { if (m_unified_space) {
@ -1421,6 +1448,7 @@ class ParallelReduce<FunctorType, Kokkos::MDRangePolicy<Traits...>, ReducerType,
} }
} else { } else {
if (m_result_ptr) { if (m_result_ptr) {
// TODO @graph We need to effectively insert this in to the graph
ValueInit::init(ReducerConditional::select(m_functor, m_reducer), ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr); m_result_ptr);
} }
@ -1464,7 +1492,7 @@ template <class FunctorType, class ReducerType, class... Properties>
class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>, class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
ReducerType, Kokkos::Cuda> { ReducerType, Kokkos::Cuda> {
public: public:
using Policy = TeamPolicyInternal<Kokkos::Cuda, Properties...>; using Policy = TeamPolicy<Properties...>;
private: private:
using Member = typename Policy::member_type; using Member = typename Policy::member_type;
@ -1491,8 +1519,11 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
public: public:
using functor_type = FunctorType; using functor_type = FunctorType;
using size_type = Cuda::size_type; using size_type = Cuda::size_type;
using reducer_type = ReducerType;
enum { UseShflReduction = (true && (ValueTraits::StaticValueSize != 0)) }; enum : bool {
UseShflReduction = (true && (ValueTraits::StaticValueSize != 0))
};
private: private:
using DummyShflReductionType = double; using DummyShflReductionType = double;
@ -1539,6 +1570,8 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
} }
public: public:
Policy const& get_policy() const { return m_policy; }
__device__ inline void operator()() const { __device__ inline void operator()() const {
int64_t threadid = 0; int64_t threadid = 0;
if (m_scratch_size[1] > 0) { if (m_scratch_size[1] > 0) {
@ -1631,7 +1664,10 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) { for (unsigned i = threadIdx.y; i < word_count.value; i += blockDim.y) {
global[i] = shared[i]; global[i] = shared[i];
} }
} else if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>( }
if (m_league_size != 0) {
if (cuda_single_inter_block_reduce_scan<false, FunctorType, WorkTag>(
ReducerConditional::select(m_functor, m_reducer), blockIdx.x, ReducerConditional::select(m_functor, m_reducer), blockIdx.x,
gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(), gridDim.x, kokkos_impl_cuda_shared_memory<size_type>(),
m_scratch_space, m_scratch_flags)) { m_scratch_space, m_scratch_flags)) {
@ -1659,6 +1695,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
} }
} }
} }
}
__device__ inline void run(const DummyShflReductionType&, __device__ inline void run(const DummyShflReductionType&,
const int& threadid) const { const int& threadid) const {
@ -1717,6 +1754,9 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value || const bool need_device_set = ReduceFunctorHasInit<FunctorType>::value ||
ReduceFunctorHasFinal<FunctorType>::value || ReduceFunctorHasFinal<FunctorType>::value ||
!m_result_ptr_host_accessible || !m_result_ptr_host_accessible ||
#ifdef KOKKOS_CUDA_ENABLE_GRAPHS
Policy::is_graph_kernel::value ||
#endif
!std::is_same<ReducerType, InvalidType>::value; !std::is_same<ReducerType, InvalidType>::value;
if ((nwork > 0) || need_device_set) { if ((nwork > 0) || need_device_set) {
const int block_count = const int block_count =
@ -1770,6 +1810,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
} }
} else { } else {
if (m_result_ptr) { if (m_result_ptr) {
// TODO @graph We need to effectively insert this in to the graph
ValueInit::init(ReducerConditional::select(m_functor, m_reducer), ValueInit::init(ReducerConditional::select(m_functor, m_reducer),
m_result_ptr); m_result_ptr);
} }
@ -1800,7 +1841,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_scratch_ptr{nullptr, nullptr}, m_scratch_ptr{nullptr, nullptr},
m_league_size(arg_policy.league_size()), m_league_size(arg_policy.league_size()),
m_team_size(arg_policy.team_size()), m_team_size(arg_policy.team_size()),
m_vector_size(arg_policy.vector_length()) { m_vector_size(arg_policy.impl_vector_length()) {
cudaFuncAttributes attr = cudaFuncAttributes attr =
CudaParallelLaunch<ParallelReduce, CudaParallelLaunch<ParallelReduce,
LaunchBounds>::get_cuda_func_attributes(); LaunchBounds>::get_cuda_func_attributes();
@ -1838,7 +1879,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
// The global parallel_reduce does not support vector_length other than 1 at // The global parallel_reduce does not support vector_length other than 1 at
// the moment // the moment
if ((arg_policy.vector_length() > 1) && !UseShflReduction) if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction)
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a vector length of " "Kokkos::parallel_reduce with a TeamPolicy using a vector length of "
"greater than 1 is not currently supported for CUDA for dynamic " "greater than 1 is not currently supported for CUDA for dynamic "
@ -1899,7 +1940,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
m_scratch_ptr{nullptr, nullptr}, m_scratch_ptr{nullptr, nullptr},
m_league_size(arg_policy.league_size()), m_league_size(arg_policy.league_size()),
m_team_size(arg_policy.team_size()), m_team_size(arg_policy.team_size()),
m_vector_size(arg_policy.vector_length()) { m_vector_size(arg_policy.impl_vector_length()) {
cudaFuncAttributes attr = cudaFuncAttributes attr =
CudaParallelLaunch<ParallelReduce, CudaParallelLaunch<ParallelReduce,
LaunchBounds>::get_cuda_func_attributes(); LaunchBounds>::get_cuda_func_attributes();
@ -1936,7 +1977,7 @@ class ParallelReduce<FunctorType, Kokkos::TeamPolicy<Properties...>,
// The global parallel_reduce does not support vector_length other than 1 at // The global parallel_reduce does not support vector_length other than 1 at
// the moment // the moment
if ((arg_policy.vector_length() > 1) && !UseShflReduction) if ((arg_policy.impl_vector_length() > 1) && !UseShflReduction)
Impl::throw_runtime_exception( Impl::throw_runtime_exception(
"Kokkos::parallel_reduce with a TeamPolicy using a vector length of " "Kokkos::parallel_reduce with a TeamPolicy using a vector length of "
"greater than 1 is not currently supported for CUDA for dynamic " "greater than 1 is not currently supported for CUDA for dynamic "
@ -2150,6 +2191,8 @@ class ParallelScan<FunctorType, Kokkos::RangePolicy<Traits...>, Kokkos::Cuda> {
} }
public: public:
Policy const& get_policy() const { return m_policy; }
//---------------------------------------- //----------------------------------------
__device__ inline void operator()(void) const { __device__ inline void operator()(void) const {
@ -2440,6 +2483,8 @@ class ParallelScanWithTotal<FunctorType, Kokkos::RangePolicy<Traits...>,
} }
public: public:
Policy const& get_policy() const { return m_policy; }
//---------------------------------------- //----------------------------------------
__device__ inline void operator()(void) const { __device__ inline void operator()(void) const {
@ -2799,5 +2844,5 @@ struct ParallelReduceFunctorType<FunctorTypeIn, ExecPolicy, ValueType, Cuda> {
} // namespace Kokkos } // namespace Kokkos
#endif /* defined( __CUDACC__ ) */ #endif /* defined(KOKKOS_ENABLE_CUDA) */
#endif /* #ifndef KOKKOS_CUDA_PARALLEL_HPP */ #endif /* #ifndef KOKKOS_CUDA_PARALLEL_HPP */

View File

@ -46,7 +46,7 @@
#define KOKKOS_CUDA_REDUCESCAN_HPP #define KOKKOS_CUDA_REDUCESCAN_HPP
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(__CUDACC__) && defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA)
#include <utility> #include <utility>
@ -983,5 +983,5 @@ inline unsigned cuda_single_inter_block_reduce_scan_shmem(
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
#endif /* #if defined( __CUDACC__ ) */ #endif /* #if defined(KOKKOS_ENABLE_CUDA) */
#endif /* KOKKOS_CUDA_REDUCESCAN_HPP */ #endif /* KOKKOS_CUDA_REDUCESCAN_HPP */

View File

@ -390,7 +390,7 @@ class TaskQueueSpecializationConstrained<
((int*)&task_ptr)[0] = KOKKOS_IMPL_CUDA_SHFL(((int*)&task_ptr)[0], 0, 32); ((int*)&task_ptr)[0] = KOKKOS_IMPL_CUDA_SHFL(((int*)&task_ptr)[0], 0, 32);
((int*)&task_ptr)[1] = KOKKOS_IMPL_CUDA_SHFL(((int*)&task_ptr)[1], 0, 32); ((int*)&task_ptr)[1] = KOKKOS_IMPL_CUDA_SHFL(((int*)&task_ptr)[1], 0, 32);
#if defined(KOKKOS_DEBUG) #if defined(KOKKOS_ENABLE_DEBUG)
KOKKOS_IMPL_CUDA_SYNCWARP_OR_RETURN("TaskQueue CUDA task_ptr"); KOKKOS_IMPL_CUDA_SYNCWARP_OR_RETURN("TaskQueue CUDA task_ptr");
#endif #endif
@ -799,7 +799,6 @@ namespace Kokkos {
* i=0..N-1. * i=0..N-1.
* *
* The range i=0..N-1 is mapped to all threads of the the calling thread team. * The range i=0..N-1 is mapped to all threads of the the calling thread team.
* This functionality requires C++11 support.
*/ */
template <typename iType, class Lambda, class Scheduler> template <typename iType, class Lambda, class Scheduler>
KOKKOS_INLINE_FUNCTION void parallel_for( KOKKOS_INLINE_FUNCTION void parallel_for(

View File

@ -50,7 +50,7 @@
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
/* only compile this file if CUDA is enabled for Kokkos */ /* only compile this file if CUDA is enabled for Kokkos */
#if defined(__CUDACC__) && defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA)
#include <utility> #include <utility>
#include <Kokkos_Parallel.hpp> #include <Kokkos_Parallel.hpp>
@ -290,7 +290,7 @@ class CudaTeamMember {
*/ */
template <typename Type> template <typename Type>
KOKKOS_INLINE_FUNCTION Type team_scan(const Type& value) const { KOKKOS_INLINE_FUNCTION Type team_scan(const Type& value) const {
return this->template team_scan<Type>(value, 0); return this->template team_scan<Type>(value, nullptr);
} }
//---------------------------------------- //----------------------------------------
@ -935,6 +935,54 @@ KOKKOS_INLINE_FUNCTION
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
/** \brief Inter-thread parallel exclusive prefix sum.
*
* Executes closure(iType i, ValueType & val, bool final) for each i=[0..N)
*
* The range [0..N) is mapped to each rank in the team (whose global rank is
* less than N) and a scan operation is performed. The last call to closure has
* final == true.
*/
// This is the same code as in HIP and largely the same as in OpenMPTarget
template <typename iType, typename FunctorType>
KOKKOS_INLINE_FUNCTION void parallel_scan(
const Impl::TeamThreadRangeBoundariesStruct<iType, Impl::CudaTeamMember>&
loop_bounds,
const FunctorType& lambda) {
// Extract value_type from lambda
using value_type = typename Kokkos::Impl::FunctorAnalysis<
Kokkos::Impl::FunctorPatternInterface::SCAN, void,
FunctorType>::value_type;
const auto start = loop_bounds.start;
const auto end = loop_bounds.end;
auto& member = loop_bounds.member;
const auto team_size = member.team_size();
const auto team_rank = member.team_rank();
const auto nchunk = (end - start + team_size - 1) / team_size;
value_type accum = 0;
// each team has to process one or more chunks of the prefix scan
for (iType i = 0; i < nchunk; ++i) {
auto ii = start + i * team_size + team_rank;
// local accumulation for this chunk
value_type local_accum = 0;
// user updates value with prefix value
if (ii < loop_bounds.end) lambda(ii, local_accum, false);
// perform team scan
local_accum = member.team_scan(local_accum);
// add this blocks accum to total accumulation
auto val = accum + local_accum;
// user updates their data with total accumulation
if (ii < loop_bounds.end) lambda(ii, val, true);
// the last value needs to be propogated to next chunk
if (team_rank == team_size - 1) accum = val;
// broadcast last value to rest of the team
member.team_broadcast(accum, team_size - 1);
}
}
//----------------------------------------------------------------------------
/** \brief Intra-thread vector parallel exclusive prefix sum. /** \brief Intra-thread vector parallel exclusive prefix sum.
* *
* Executes closure(iType i, ValueType & val, bool final) for each i=[0..N) * Executes closure(iType i, ValueType & val, bool final) for each i=[0..N)
@ -1089,6 +1137,6 @@ KOKKOS_INLINE_FUNCTION void single(
} // namespace Kokkos } // namespace Kokkos
#endif /* defined( __CUDACC__ ) */ #endif /* defined(KOKKOS_ENABLE_CUDA) */
#endif /* #ifndef KOKKOS_CUDA_TEAM_HPP */ #endif /* #ifndef KOKKOS_CUDA_TEAM_HPP */

View File

@ -77,6 +77,8 @@ class ParallelFor<FunctorType, Kokkos::WorkGraphPolicy<Traits...>,
} }
public: public:
Policy const& get_policy() const { return m_policy; }
__device__ inline void operator()() const noexcept { __device__ inline void operator()() const noexcept {
if (0 == (threadIdx.y % 16)) { if (0 == (threadIdx.y % 16)) {
// Spin until COMPLETED_TOKEN. // Spin until COMPLETED_TOKEN.

View File

@ -48,7 +48,7 @@
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
#include <Kokkos_Macros.hpp> #include <Kokkos_Macros.hpp>
#if defined(__CUDACC__) && defined(KOKKOS_ENABLE_CUDA) #if defined(KOKKOS_ENABLE_CUDA)
#include <cuda.h> #include <cuda.h>
@ -97,5 +97,5 @@ __device__ inline void cuda_abort(const char *const message) {
} // namespace Kokkos } // namespace Kokkos
#else #else
void KOKKOS_CORE_SRC_CUDA_ABORT_PREVENT_LINK_ERROR() {} void KOKKOS_CORE_SRC_CUDA_ABORT_PREVENT_LINK_ERROR() {}
#endif /* #if defined(__CUDACC__) && defined( KOKKOS_ENABLE_CUDA ) */ #endif /* #if defined( KOKKOS_ENABLE_CUDA ) */
#endif /* #ifndef KOKKOS_CUDA_ABORT_HPP */ #endif /* #ifndef KOKKOS_CUDA_ABORT_HPP */

View File

@ -45,6 +45,10 @@
#ifndef KOKKOS_HIP_ATOMIC_HPP #ifndef KOKKOS_HIP_ATOMIC_HPP
#define KOKKOS_HIP_ATOMIC_HPP #define KOKKOS_HIP_ATOMIC_HPP
#include <impl/Kokkos_Atomic_Memory_Order.hpp>
#include <impl/Kokkos_Memory_Fence.hpp>
#include <HIP/Kokkos_HIP_Locks.hpp>
#if defined(KOKKOS_ENABLE_HIP_ATOMICS) #if defined(KOKKOS_ENABLE_HIP_ATOMICS)
namespace Kokkos { namespace Kokkos {
// HIP can do: // HIP can do:
@ -103,19 +107,16 @@ atomic_exchange(volatile T *const dest,
typename std::enable_if<sizeof(T) != sizeof(int) && typename std::enable_if<sizeof(T) != sizeof(int) &&
sizeof(T) != sizeof(long long), sizeof(T) != sizeof(long long),
const T>::type &val) { const T>::type &val) {
// FIXME_HIP
Kokkos::abort("atomic_exchange not implemented for large types.\n");
T return_val; T return_val;
int done = 0; int done = 0;
unsigned int active = __ballot(1); unsigned int active = __ballot(1);
unsigned int done_active = 0; unsigned int done_active = 0;
while (active != done_active) { while (active != done_active) {
if (!done) { if (!done) {
// if (Impl::lock_address_hip_space((void*)dest)) if (Impl::lock_address_hip_space((void *)dest)) {
{
return_val = *dest; return_val = *dest;
*dest = val; *dest = val;
// Impl::unlock_address_hip_space((void*)dest); Impl::unlock_address_hip_space((void *)dest);
done = 1; done = 1;
} }
} }
@ -215,19 +216,16 @@ __inline__ __device__ T atomic_compare_exchange(
typename std::enable_if<sizeof(T) != sizeof(int) && typename std::enable_if<sizeof(T) != sizeof(int) &&
sizeof(T) != sizeof(long long), sizeof(T) != sizeof(long long),
const T>::type &val) { const T>::type &val) {
// FIXME_HIP
Kokkos::abort("atomic_compare_exchange not implemented for large types.\n");
T return_val; T return_val;
int done = 0; int done = 0;
unsigned int active = __ballot(1); unsigned int active = __ballot(1);
unsigned int done_active = 0; unsigned int done_active = 0;
while (active != done_active) { while (active != done_active) {
if (!done) { if (!done) {
// if (Impl::lock_address_hip_space((void*)dest)) if (Impl::lock_address_hip_space((void *)dest)) {
{
return_val = *dest; return_val = *dest;
if (return_val == compare) *dest = val; if (return_val == compare) *dest = val;
// Impl::unlock_address_hip_space((void*)dest); Impl::unlock_address_hip_space((void *)dest);
done = 1; done = 1;
} }
} }
@ -350,19 +348,16 @@ atomic_fetch_add(volatile T *dest,
typename std::enable_if<sizeof(T) != sizeof(int) && typename std::enable_if<sizeof(T) != sizeof(int) &&
sizeof(T) != sizeof(long long), sizeof(T) != sizeof(long long),
const T &>::type val) { const T &>::type val) {
// FIXME_HIP
Kokkos::abort("atomic_fetch_add not implemented for large types.\n");
T return_val; T return_val;
int done = 0; int done = 0;
unsigned int active = __ballot(1); unsigned int active = __ballot(1);
unsigned int done_active = 0; unsigned int done_active = 0;
while (active != done_active) { while (active != done_active) {
if (!done) { if (!done) {
// if(Kokkos::Impl::lock_address_hip_space((void *)dest)) if (Kokkos::Impl::lock_address_hip_space((void *)dest)) {
{
return_val = *dest; return_val = *dest;
*dest = return_val + val; *dest = return_val + val;
// Kokkos::Impl::unlock_address_hip_space((void *)dest); Kokkos::Impl::unlock_address_hip_space((void *)dest);
done = 1; done = 1;
} }
} }
@ -513,19 +508,16 @@ atomic_fetch_sub(volatile T *const dest,
typename std::enable_if<sizeof(T) != sizeof(int) && typename std::enable_if<sizeof(T) != sizeof(int) &&
sizeof(T) != sizeof(long long), sizeof(T) != sizeof(long long),
const T>::type &val) { const T>::type &val) {
// FIXME_HIP
Kokkos::abort("atomic_fetch_sub not implemented for large types.\n");
T return_val; T return_val;
int done = 0; int done = 0;
unsigned int active = __ballot(1); unsigned int active = __ballot(1);
unsigned int done_active = 0; unsigned int done_active = 0;
while (active != done_active) { while (active != done_active) {
if (!done) { if (!done) {
/*if (Impl::lock_address_hip_space((void*)dest)) */ if (Impl::lock_address_hip_space((void *)dest)) {
{
return_val = *dest; return_val = *dest;
*dest = return_val - val; *dest = return_val - val;
// Impl::unlock_address_hip_space((void*)dest); Impl::unlock_address_hip_space((void *)dest);
done = 1; done = 1;
} }
} }
@ -569,6 +561,62 @@ __inline__ __device__ unsigned long long int atomic_fetch_and(
unsigned long long int const val) { unsigned long long int const val) {
return atomicAnd(const_cast<unsigned long long int *>(dest), val); return atomicAnd(const_cast<unsigned long long int *>(dest), val);
} }
namespace Impl {
template <typename T>
__inline__ __device__ void _atomic_store(T *ptr, T val,
memory_order_relaxed_t) {
(void)atomic_exchange(ptr, val);
}
template <typename T>
__inline__ __device__ void _atomic_store(T *ptr, T val,
memory_order_seq_cst_t) {
memory_fence();
atomic_store(ptr, val, memory_order_relaxed);
memory_fence();
}
template <typename T>
__inline__ __device__ void _atomic_store(T *ptr, T val,
memory_order_release_t) {
memory_fence();
atomic_store(ptr, val, memory_order_relaxed);
}
template <typename T>
__inline__ __device__ void _atomic_store(T *ptr, T val) {
atomic_store(ptr, val, memory_order_relaxed);
}
template <typename T>
__inline__ __device__ T _atomic_load(T *ptr, memory_order_relaxed_t) {
T dummy{};
return atomic_compare_exchange(ptr, dummy, dummy);
}
template <typename T>
__inline__ __device__ T _atomic_load(T *ptr, memory_order_seq_cst_t) {
memory_fence();
T rv = atomic_load(ptr, memory_order_relaxed);
memory_fence();
return rv;
}
template <typename T>
__inline__ __device__ T _atomic_load(T *ptr, memory_order_acquire_t) {
T rv = atomic_load(ptr, memory_order_relaxed);
memory_fence();
return rv;
}
template <typename T>
__inline__ __device__ T _atomic_load(T *ptr) {
return atomic_load(ptr, memory_order_relaxed);
}
} // namespace Impl
} // namespace Kokkos } // namespace Kokkos
#endif #endif

View File

@ -55,6 +55,26 @@
namespace Kokkos { namespace Kokkos {
namespace Experimental { namespace Experimental {
namespace Impl { namespace Impl {
template <typename DriverType, bool, int MaxThreadsPerBlock, int MinBlocksPerSM>
void hipOccupancy(int *numBlocks, int blockSize, int sharedmem) {
// FIXME_HIP - currently the "constant" path is unimplemented.
// we should look at whether it's functional, and
// perform some simple scaling studies to see when /
// if the constant launcher outperforms the current
// pass by pointer shared launcher
HIP_SAFE_CALL(hipOccupancyMaxActiveBlocksPerMultiprocessor(
numBlocks,
hip_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>,
blockSize, sharedmem));
}
template <typename DriverType, bool constant>
void hipOccupancy(int *numBlocks, int blockSize, int sharedmem) {
hipOccupancy<DriverType, constant, HIPTraits::MaxThreadsPerBlock, 1>(
numBlocks, blockSize, sharedmem);
}
template <typename DriverType, typename LaunchBounds, bool Large> template <typename DriverType, typename LaunchBounds, bool Large>
struct HIPGetMaxBlockSize; struct HIPGetMaxBlockSize;
@ -78,31 +98,26 @@ int hip_internal_get_block_size(const F &condition_check,
const int min_blocks_per_sm = const int min_blocks_per_sm =
LaunchBounds::minBperSM == 0 ? 1 : LaunchBounds::minBperSM; LaunchBounds::minBperSM == 0 ? 1 : LaunchBounds::minBperSM;
const int max_threads_per_block = LaunchBounds::maxTperB == 0 const int max_threads_per_block = LaunchBounds::maxTperB == 0
? hip_instance->m_maxThreadsPerBlock ? HIPTraits::MaxThreadsPerBlock
: LaunchBounds::maxTperB; : LaunchBounds::maxTperB;
const int regs_per_wavefront = attr.numRegs; const int regs_per_wavefront = std::max(attr.numRegs, 1);
const int regs_per_sm = hip_instance->m_regsPerSM; const int regs_per_sm = hip_instance->m_regsPerSM;
const int shmem_per_sm = hip_instance->m_shmemPerSM; const int shmem_per_sm = hip_instance->m_shmemPerSM;
const int max_shmem_per_block = hip_instance->m_maxShmemPerBlock; const int max_shmem_per_block = hip_instance->m_maxShmemPerBlock;
const int max_blocks_per_sm = hip_instance->m_maxBlocksPerSM; const int max_blocks_per_sm = hip_instance->m_maxBlocksPerSM;
const int max_threads_per_sm = hip_instance->m_maxThreadsPerSM; const int max_threads_per_sm = hip_instance->m_maxThreadsPerSM;
// FIXME_HIP this is broken in 3.5, but should be in 3.6
#if (HIP_VERSION_MAJOR > 3 || HIP_VERSION_MINOR > 5 || \
HIP_VERSION_PATCH >= 20226)
int block_size = std::min(attr.maxThreadsPerBlock, max_threads_per_block);
#else
int block_size = max_threads_per_block; int block_size = max_threads_per_block;
#endif
KOKKOS_ASSERT(block_size > 0); KOKKOS_ASSERT(block_size > 0);
const int blocks_per_warp =
(block_size + HIPTraits::WarpSize - 1) / HIPTraits::WarpSize;
int functor_shmem = ::Kokkos::Impl::FunctorTeamShmemSize<FunctorType>::value( int functor_shmem = ::Kokkos::Impl::FunctorTeamShmemSize<FunctorType>::value(
f, block_size / vector_length); f, block_size / vector_length);
int total_shmem = shmem_block + shmem_thread * (block_size / vector_length) + int total_shmem = shmem_block + shmem_thread * (block_size / vector_length) +
functor_shmem + attr.sharedSizeBytes; functor_shmem + attr.sharedSizeBytes;
int max_blocks_regs = int max_blocks_regs = regs_per_sm / (regs_per_wavefront * blocks_per_warp);
regs_per_sm / (regs_per_wavefront * (block_size / HIPTraits::WarpSize));
int max_blocks_shmem = int max_blocks_shmem =
(total_shmem < max_shmem_per_block) (total_shmem < max_shmem_per_block)
? (total_shmem > 0 ? shmem_per_sm / total_shmem : max_blocks_regs) ? (total_shmem > 0 ? shmem_per_sm / total_shmem : max_blocks_regs)
@ -113,7 +128,8 @@ int hip_internal_get_block_size(const F &condition_check,
blocks_per_sm = max_threads_per_sm / block_size; blocks_per_sm = max_threads_per_sm / block_size;
threads_per_sm = blocks_per_sm * block_size; threads_per_sm = blocks_per_sm * block_size;
} }
int opt_block_size = (blocks_per_sm >= min_blocks_per_sm) ? block_size : 0; int opt_block_size =
(blocks_per_sm >= min_blocks_per_sm) ? block_size : min_blocks_per_sm;
int opt_threads_per_sm = threads_per_sm; int opt_threads_per_sm = threads_per_sm;
// printf("BlockSizeMax: %i Shmem: %i %i %i %i Regs: %i %i Blocks: %i %i // printf("BlockSizeMax: %i Shmem: %i %i %i %i Regs: %i %i Blocks: %i %i
// Achieved: %i %i Opt: %i %i\n",block_size, // Achieved: %i %i Opt: %i %i\n",block_size,
@ -126,8 +142,7 @@ int hip_internal_get_block_size(const F &condition_check,
f, block_size / vector_length); f, block_size / vector_length);
total_shmem = shmem_block + shmem_thread * (block_size / vector_length) + total_shmem = shmem_block + shmem_thread * (block_size / vector_length) +
functor_shmem + attr.sharedSizeBytes; functor_shmem + attr.sharedSizeBytes;
max_blocks_regs = max_blocks_regs = regs_per_sm / (regs_per_wavefront * blocks_per_warp);
regs_per_sm / (regs_per_wavefront * (block_size / HIPTraits::WarpSize));
max_blocks_shmem = max_blocks_shmem =
(total_shmem < max_shmem_per_block) (total_shmem < max_shmem_per_block)
? (total_shmem > 0 ? shmem_per_sm / total_shmem : max_blocks_regs) ? (total_shmem > 0 ? shmem_per_sm / total_shmem : max_blocks_regs)
@ -163,28 +178,21 @@ int hip_get_max_block_size(const HIPInternal *hip_instance,
[](int x) { return x == 0; }, hip_instance, attr, f, vector_length, [](int x) { return x == 0; }, hip_instance, attr, f, vector_length,
shmem_block, shmem_thread); shmem_block, shmem_thread);
} }
template <typename DriverType> template <typename DriverType, class LaunchBounds>
struct HIPGetMaxBlockSize<DriverType, Kokkos::LaunchBounds<>, true> { struct HIPGetMaxBlockSize<DriverType, LaunchBounds, true> {
static int get_block_size(typename DriverType::functor_type const &f, static int get_block_size(typename DriverType::functor_type const &f,
size_t const vector_length, size_t const vector_length,
size_t const shmem_extra_block, size_t const shmem_extra_block,
size_t const shmem_extra_thread) { size_t const shmem_extra_thread) {
// FIXME_HIP -- remove this once the API change becomes mature int numBlocks = 0;
#if !defined(__HIP__) int blockSize = LaunchBounds::maxTperB == 0 ? 1024 : LaunchBounds::maxTperB;
using blocktype = unsigned int;
#else
using blocktype = int;
#endif
blocktype numBlocks = 0;
int blockSize = 1024;
int sharedmem = int sharedmem =
shmem_extra_block + shmem_extra_thread * (blockSize / vector_length) + shmem_extra_block + shmem_extra_thread * (blockSize / vector_length) +
::Kokkos::Impl::FunctorTeamShmemSize< ::Kokkos::Impl::FunctorTeamShmemSize<
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks, hip_parallel_launch_constant_memory<DriverType>, blockSize, hipOccupancy<DriverType, true>(&numBlocks, blockSize, sharedmem);
sharedmem);
if (numBlocks > 0) return blockSize; if (numBlocks > 0) return blockSize;
while (blockSize > HIPTraits::WarpSize && numBlocks == 0) { while (blockSize > HIPTraits::WarpSize && numBlocks == 0) {
@ -195,9 +203,7 @@ struct HIPGetMaxBlockSize<DriverType, Kokkos::LaunchBounds<>, true> {
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, true>(&numBlocks, blockSize, sharedmem);
&numBlocks, hip_parallel_launch_constant_memory<DriverType>,
blockSize, sharedmem);
} }
int blockSizeUpperBound = blockSize * 2; int blockSizeUpperBound = blockSize * 2;
while (blockSize < blockSizeUpperBound && numBlocks > 0) { while (blockSize < blockSizeUpperBound && numBlocks > 0) {
@ -208,9 +214,7 @@ struct HIPGetMaxBlockSize<DriverType, Kokkos::LaunchBounds<>, true> {
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, true>(&numBlocks, blockSize, sharedmem);
&numBlocks, hip_parallel_launch_constant_memory<DriverType>,
blockSize, sharedmem);
} }
return blockSize - HIPTraits::WarpSize; return blockSize - HIPTraits::WarpSize;
} }
@ -255,7 +259,7 @@ struct HIPGetOptBlockSize<DriverType, Kokkos::LaunchBounds<0, 0>, true> {
int maxOccupancy = 0; int maxOccupancy = 0;
int bestBlockSize = 0; int bestBlockSize = 0;
while (blockSize < 1024) { while (blockSize < HIPTraits::MaxThreadsPerBlock) {
blockSize *= 2; blockSize *= 2;
// calculate the occupancy with that optBlockSize and check whether its // calculate the occupancy with that optBlockSize and check whether its
@ -265,9 +269,7 @@ struct HIPGetOptBlockSize<DriverType, Kokkos::LaunchBounds<0, 0>, true> {
::Kokkos::Impl::FunctorTeamShmemSize< ::Kokkos::Impl::FunctorTeamShmemSize<
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, true>(&numBlocks, blockSize, sharedmem);
&numBlocks, hip_parallel_launch_constant_memory<DriverType>,
blockSize, sharedmem);
if (maxOccupancy < numBlocks * blockSize) { if (maxOccupancy < numBlocks * blockSize) {
maxOccupancy = numBlocks * blockSize; maxOccupancy = numBlocks * blockSize;
bestBlockSize = blockSize; bestBlockSize = blockSize;
@ -289,7 +291,7 @@ struct HIPGetOptBlockSize<DriverType, Kokkos::LaunchBounds<0, 0>, false> {
int maxOccupancy = 0; int maxOccupancy = 0;
int bestBlockSize = 0; int bestBlockSize = 0;
while (blockSize < 1024) { while (blockSize < HIPTraits::MaxThreadsPerBlock) {
blockSize *= 2; blockSize *= 2;
sharedmem = sharedmem =
shmem_extra_block + shmem_extra_thread * (blockSize / vector_length) + shmem_extra_block + shmem_extra_thread * (blockSize / vector_length) +
@ -297,9 +299,7 @@ struct HIPGetOptBlockSize<DriverType, Kokkos::LaunchBounds<0, 0>, false> {
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, false>(&numBlocks, blockSize, sharedmem);
&numBlocks, hip_parallel_launch_local_memory<DriverType>, blockSize,
sharedmem);
if (maxOccupancy < numBlocks * blockSize) { if (maxOccupancy < numBlocks * blockSize) {
maxOccupancy = numBlocks * blockSize; maxOccupancy = numBlocks * blockSize;
@ -340,11 +340,8 @@ struct HIPGetOptBlockSize<
::Kokkos::Impl::FunctorTeamShmemSize< ::Kokkos::Impl::FunctorTeamShmemSize<
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, true, MaxThreadsPerBlock, MinBlocksPerSM>(
&numBlocks, &numBlocks, blockSize, sharedmem);
hip_parallel_launch_constant_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>,
blockSize, sharedmem);
if (numBlocks >= static_cast<int>(MinBlocksPerSM) && if (numBlocks >= static_cast<int>(MinBlocksPerSM) &&
blockSize <= static_cast<int>(MaxThreadsPerBlock)) { blockSize <= static_cast<int>(MaxThreadsPerBlock)) {
if (maxOccupancy < numBlocks * blockSize) { if (maxOccupancy < numBlocks * blockSize) {
@ -384,11 +381,8 @@ struct HIPGetOptBlockSize<
typename DriverType::functor_type>::value(f, blockSize / typename DriverType::functor_type>::value(f, blockSize /
vector_length); vector_length);
hipOccupancyMaxActiveBlocksPerMultiprocessor( hipOccupancy<DriverType, false, MaxThreadsPerBlock, MinBlocksPerSM>(
&numBlocks, &numBlocks, blockSize, sharedmem);
hip_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>,
blockSize, sharedmem);
if (numBlocks >= int(MinBlocksPerSM) && if (numBlocks >= int(MinBlocksPerSM) &&
blockSize <= int(MaxThreadsPerBlock)) { blockSize <= int(MaxThreadsPerBlock)) {
if (maxOccupancy < numBlocks * blockSize) { if (maxOccupancy < numBlocks * blockSize) {

View File

@ -56,10 +56,10 @@ namespace Kokkos {
namespace Impl { namespace Impl {
void hip_internal_error_throw(hipError_t e, const char* name, void hip_internal_error_throw(hipError_t e, const char* name,
const char* file = NULL, const int line = 0); const char* file = nullptr, const int line = 0);
inline void hip_internal_safe_call(hipError_t e, const char* name, inline void hip_internal_safe_call(hipError_t e, const char* name,
const char* file = NULL, const char* file = nullptr,
const int line = 0) { const int line = 0) {
if (hipSuccess != e) { if (hipSuccess != e) {
hip_internal_error_throw(e, name, file, line); hip_internal_error_throw(e, name, file, line);

View File

@ -114,7 +114,7 @@ void HIPInternal::print_configuration(std::ostream &s) const {
<< (dev_info.m_hipProp[i].major) << "." << dev_info.m_hipProp[i].minor << (dev_info.m_hipProp[i].major) << "." << dev_info.m_hipProp[i].minor
<< ", Total Global Memory: " << ", Total Global Memory: "
<< ::Kokkos::Impl::human_memory_size(dev_info.m_hipProp[i].totalGlobalMem) << ::Kokkos::Impl::human_memory_size(dev_info.m_hipProp[i].totalGlobalMem)
<< ", Shared Memory per Wavefront: " << ", Shared Memory per Block: "
<< ::Kokkos::Impl::human_memory_size( << ::Kokkos::Impl::human_memory_size(
dev_info.m_hipProp[i].sharedMemPerBlock); dev_info.m_hipProp[i].sharedMemPerBlock);
if (m_hipDev == i) s << " : Selected"; if (m_hipDev == i) s << " : Selected";
@ -140,10 +140,10 @@ HIPInternal::~HIPInternal() {
m_maxShmemPerBlock = 0; m_maxShmemPerBlock = 0;
m_scratchSpaceCount = 0; m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0; m_scratchFlagsCount = 0;
m_scratchSpace = 0; m_scratchSpace = nullptr;
m_scratchFlags = 0; m_scratchFlags = nullptr;
m_scratchConcurrentBitset = nullptr; m_scratchConcurrentBitset = nullptr;
m_stream = 0; m_stream = nullptr;
} }
int HIPInternal::verify_is_initialized(const char *const label) const { int HIPInternal::verify_is_initialized(const char *const label) const {
@ -183,7 +183,7 @@ void HIPInternal::initialize(int hip_device_id, hipStream_t stream) {
const HIPInternalDevices &dev_info = HIPInternalDevices::singleton(); const HIPInternalDevices &dev_info = HIPInternalDevices::singleton();
const bool ok_init = 0 == m_scratchSpace || 0 == m_scratchFlags; const bool ok_init = nullptr == m_scratchSpace || nullptr == m_scratchFlags;
// Need at least a GPU device // Need at least a GPU device
const bool ok_id = const bool ok_id =
@ -195,9 +195,11 @@ void HIPInternal::initialize(int hip_device_id, hipStream_t stream) {
m_hipDev = hip_device_id; m_hipDev = hip_device_id;
m_deviceProp = hipProp; m_deviceProp = hipProp;
hipSetDevice(m_hipDev); HIP_SAFE_CALL(hipSetDevice(m_hipDev));
m_stream = stream; m_stream = stream;
m_team_scratch_current_size = 0;
m_team_scratch_ptr = nullptr;
// number of multiprocessors // number of multiprocessors
m_multiProcCount = hipProp.multiProcessorCount; m_multiProcCount = hipProp.multiProcessorCount;
@ -216,14 +218,19 @@ void HIPInternal::initialize(int hip_device_id, hipStream_t stream) {
m_maxBlock = hipProp.maxGridSize[0]; m_maxBlock = hipProp.maxGridSize[0];
// theoretically, we can get 40 WF's / CU, but only can sustain 32 // theoretically, we can get 40 WF's / CU, but only can sustain 32
// see
// https://github.com/ROCm-Developer-Tools/HIP/blob/a0b5dfd625d99af7e288629747b40dd057183173/vdi/hip_platform.cpp#L742
m_maxBlocksPerSM = 32; m_maxBlocksPerSM = 32;
// FIXME_HIP - Nick to implement this upstream // FIXME_HIP - Nick to implement this upstream
m_regsPerSM = 262144 / 32; // Register count comes from Sec. 2.2. "Data Sharing" of the
// Vega 7nm ISA document (see the diagram)
// https://developer.amd.com/wp-content/resources/Vega_7nm_Shader_ISA.pdf
// VGPRS = 4 (SIMD/CU) * 256 VGPR/SIMD * 64 registers / VGPR =
// 65536 VGPR/CU
m_regsPerSM = 65536;
m_shmemPerSM = hipProp.maxSharedMemoryPerMultiProcessor; m_shmemPerSM = hipProp.maxSharedMemoryPerMultiProcessor;
m_maxShmemPerBlock = hipProp.sharedMemPerBlock; m_maxShmemPerBlock = hipProp.sharedMemPerBlock;
m_maxThreadsPerSM = m_maxBlocksPerSM * HIPTraits::WarpSize; m_maxThreadsPerSM = m_maxBlocksPerSM * HIPTraits::WarpSize;
m_maxThreadsPerBlock = hipProp.maxThreadsPerBlock;
//---------------------------------- //----------------------------------
// Multiblock reduction uses scratch flags for counters // Multiblock reduction uses scratch flags for counters
// and scratch space for partial reduction values. // and scratch space for partial reduction values.
@ -277,8 +284,7 @@ void HIPInternal::initialize(int hip_device_id, hipStream_t stream) {
} }
// Init the array for used for arbitrarily sized atomics // Init the array for used for arbitrarily sized atomics
// FIXME_HIP uncomment this when global variable works if (m_stream == nullptr) ::Kokkos::Impl::initialize_host_hip_lock_arrays();
// if (m_stream == 0) ::Kokkos::Impl::initialize_host_hip_lock_arrays();
} }
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
@ -327,18 +333,35 @@ Kokkos::Experimental::HIP::size_type *HIPInternal::scratch_flags(
m_scratchFlags = reinterpret_cast<size_type *>(r->data()); m_scratchFlags = reinterpret_cast<size_type *>(r->data());
hipMemset(m_scratchFlags, 0, m_scratchFlagsCount * sizeScratchGrain); HIP_SAFE_CALL(
hipMemset(m_scratchFlags, 0, m_scratchFlagsCount * sizeScratchGrain));
} }
return m_scratchFlags; return m_scratchFlags;
} }
void *HIPInternal::resize_team_scratch_space(std::int64_t bytes,
bool force_shrink) {
if (m_team_scratch_current_size == 0) {
m_team_scratch_current_size = bytes;
m_team_scratch_ptr = Kokkos::kokkos_malloc<Kokkos::Experimental::HIPSpace>(
"HIPSpace::ScratchMemory", m_team_scratch_current_size);
}
if ((bytes > m_team_scratch_current_size) ||
((bytes < m_team_scratch_current_size) && (force_shrink))) {
m_team_scratch_current_size = bytes;
m_team_scratch_ptr = Kokkos::kokkos_realloc<Kokkos::Experimental::HIPSpace>(
m_team_scratch_ptr, m_team_scratch_current_size);
}
return m_team_scratch_ptr;
}
//---------------------------------------------------------------------------- //----------------------------------------------------------------------------
void HIPInternal::finalize() { void HIPInternal::finalize() {
HIP().fence(); this->fence();
was_finalized = true; was_finalized = true;
if (0 != m_scratchSpace || 0 != m_scratchFlags) { if (nullptr != m_scratchSpace || nullptr != m_scratchFlags) {
using RecordHIP = using RecordHIP =
Kokkos::Impl::SharedAllocationRecord<Kokkos::Experimental::HIPSpace>; Kokkos::Impl::SharedAllocationRecord<Kokkos::Experimental::HIPSpace>;
@ -346,6 +369,9 @@ void HIPInternal::finalize() {
RecordHIP::decrement(RecordHIP::get_record(m_scratchSpace)); RecordHIP::decrement(RecordHIP::get_record(m_scratchSpace));
RecordHIP::decrement(RecordHIP::get_record(m_scratchConcurrentBitset)); RecordHIP::decrement(RecordHIP::get_record(m_scratchConcurrentBitset));
if (m_team_scratch_current_size > 0)
Kokkos::kokkos_free<Kokkos::Experimental::HIPSpace>(m_team_scratch_ptr);
m_hipDev = -1; m_hipDev = -1;
m_hipArch = -1; m_hipArch = -1;
m_multiProcCount = 0; m_multiProcCount = 0;
@ -355,10 +381,12 @@ void HIPInternal::finalize() {
m_maxShmemPerBlock = 0; m_maxShmemPerBlock = 0;
m_scratchSpaceCount = 0; m_scratchSpaceCount = 0;
m_scratchFlagsCount = 0; m_scratchFlagsCount = 0;
m_scratchSpace = 0; m_scratchSpace = nullptr;
m_scratchFlags = 0; m_scratchFlags = nullptr;
m_scratchConcurrentBitset = nullptr; m_scratchConcurrentBitset = nullptr;
m_stream = 0; m_stream = nullptr;
m_team_scratch_current_size = 0;
m_team_scratch_ptr = nullptr;
} }
} }

View File

@ -57,6 +57,8 @@ struct HIPTraits {
static int constexpr WarpSize = 64; static int constexpr WarpSize = 64;
static int constexpr WarpIndexMask = 0x003f; /* hexadecimal for 63 */ static int constexpr WarpIndexMask = 0x003f; /* hexadecimal for 63 */
static int constexpr WarpIndexShift = 6; /* WarpSize == 1 << WarpShift*/ static int constexpr WarpIndexShift = 6; /* WarpSize == 1 << WarpShift*/
static int constexpr MaxThreadsPerBlock =
1024; // FIXME_HIP -- assumed constant for now
static int constexpr ConstantMemoryUsage = 0x008000; /* 32k bytes */ static int constexpr ConstantMemoryUsage = 0x008000; /* 32k bytes */
static int constexpr ConstantMemoryUseThreshold = 0x000200; /* 512 bytes */ static int constexpr ConstantMemoryUseThreshold = 0x000200; /* 512 bytes */
@ -92,9 +94,11 @@ class HIPInternal {
int m_shmemPerSM; int m_shmemPerSM;
int m_maxShmemPerBlock; int m_maxShmemPerBlock;
int m_maxThreadsPerSM; int m_maxThreadsPerSM;
int m_maxThreadsPerBlock;
// Scratch Spaces for Reductions
size_type m_scratchSpaceCount; size_type m_scratchSpaceCount;
size_type m_scratchFlagsCount; size_type m_scratchFlagsCount;
size_type *m_scratchSpace; size_type *m_scratchSpace;
size_type *m_scratchFlags; size_type *m_scratchFlags;
uint32_t *m_scratchConcurrentBitset = nullptr; uint32_t *m_scratchConcurrentBitset = nullptr;
@ -103,6 +107,10 @@ class HIPInternal {
hipStream_t m_stream; hipStream_t m_stream;
// Team Scratch Level 1 Space
mutable int64_t m_team_scratch_current_size;
mutable void *m_team_scratch_ptr;
bool was_finalized = false; bool was_finalized = false;
static HIPInternal &singleton(); static HIPInternal &singleton();
@ -113,7 +121,7 @@ class HIPInternal {
return m_hipDev >= 0; return m_hipDev >= 0;
} // 0 != m_scratchSpace && 0 != m_scratchFlags ; } } // 0 != m_scratchSpace && 0 != m_scratchFlags ; }
void initialize(int hip_device_id, hipStream_t stream = 0); void initialize(int hip_device_id, hipStream_t stream = nullptr);
void finalize(); void finalize();
void print_configuration(std::ostream &) const; void print_configuration(std::ostream &) const;
@ -132,15 +140,21 @@ class HIPInternal {
m_shmemPerSM(0), m_shmemPerSM(0),
m_maxShmemPerBlock(0), m_maxShmemPerBlock(0),
m_maxThreadsPerSM(0), m_maxThreadsPerSM(0),
m_maxThreadsPerBlock(0),
m_scratchSpaceCount(0), m_scratchSpaceCount(0),
m_scratchFlagsCount(0), m_scratchFlagsCount(0),
m_scratchSpace(0), m_scratchSpace(nullptr),
m_scratchFlags(0), m_scratchFlags(nullptr),
m_stream(0) {} m_stream(nullptr),
m_team_scratch_current_size(0),
m_team_scratch_ptr(nullptr) {}
// Resizing of reduction related scratch spaces
size_type *scratch_space(const size_type size); size_type *scratch_space(const size_type size);
size_type *scratch_flags(const size_type size); size_type *scratch_flags(const size_type size);
// Resizing of team level 1 scratch
void *resize_team_scratch_space(std::int64_t bytes,
bool force_shrink = false);
}; };
} // namespace Impl } // namespace Impl

View File

@ -64,7 +64,7 @@ namespace Kokkos {
namespace Experimental { namespace Experimental {
template <typename T> template <typename T>
inline __device__ T *kokkos_impl_hip_shared_memory() { inline __device__ T *kokkos_impl_hip_shared_memory() {
extern __shared__ HIPSpace::size_type sh[]; HIP_DYNAMIC_SHARED(HIPSpace::size_type, sh);
return (T *)sh; return (T *)sh;
} }
} // namespace Experimental } // namespace Experimental
@ -74,18 +74,17 @@ namespace Kokkos {
namespace Experimental { namespace Experimental {
namespace Impl { namespace Impl {
void *hip_resize_scratch_space(std::int64_t bytes, bool force_shrink = false);
template <typename DriverType> template <typename DriverType>
__global__ static void hip_parallel_launch_constant_memory() { __global__ static void hip_parallel_launch_constant_memory() {
// cannot use global constants in HCC const DriverType &driver = *(reinterpret_cast<const DriverType *>(
#ifdef __HCC__ kokkos_impl_hip_constant_memory_buffer));
__device__ __constant__ unsigned long kokkos_impl_hip_constant_memory_buffer driver();
[Kokkos::Experimental::Impl::HIPTraits::ConstantMemoryUsage / }
sizeof(unsigned long)];
#endif
const DriverType *const driver = (reinterpret_cast<const DriverType *>( template <typename DriverType, unsigned int maxTperB, unsigned int minBperSM>
__global__ __launch_bounds__(
maxTperB, minBperSM) static void hip_parallel_launch_constant_memory() {
const DriverType &driver = *(reinterpret_cast<const DriverType *>(
kokkos_impl_hip_constant_memory_buffer)); kokkos_impl_hip_constant_memory_buffer));
driver->operator()(); driver->operator()();
@ -147,6 +146,8 @@ struct HIPParallelLaunch<
"HIPParallelLaunch FAILED: shared memory request is too large"); "HIPParallelLaunch FAILED: shared memory request is too large");
} }
KOKKOS_ENSURE_HIP_LOCK_ARRAYS_ON_DEVICE();
// FIXME_HIP -- there is currently an error copying (some) structs // FIXME_HIP -- there is currently an error copying (some) structs
// by value to the device in HIP-Clang / VDI // by value to the device in HIP-Clang / VDI
// As a workaround, we can malloc the DriverType and explictly copy over. // As a workaround, we can malloc the DriverType and explictly copy over.
@ -169,12 +170,15 @@ struct HIPParallelLaunch<
} }
static hipFuncAttributes get_hip_func_attributes() { static hipFuncAttributes get_hip_func_attributes() {
static hipFuncAttributes attr = []() {
hipFuncAttributes attr; hipFuncAttributes attr;
hipFuncGetAttributes( HIP_SAFE_CALL(hipFuncGetAttributes(
&attr, &attr,
reinterpret_cast<void const *>( reinterpret_cast<void const *>(
hip_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock, hip_parallel_launch_local_memory<DriverType, MaxThreadsPerBlock,
MinBlocksPerSM>)); MinBlocksPerSM>)));
return attr;
}();
return attr; return attr;
} }
}; };
@ -192,6 +196,8 @@ struct HIPParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
"HIPParallelLaunch FAILED: shared memory request is too large")); "HIPParallelLaunch FAILED: shared memory request is too large"));
} }
KOKKOS_ENSURE_HIP_LOCK_ARRAYS_ON_DEVICE();
// Invoke the driver function on the device // Invoke the driver function on the device
// FIXME_HIP -- see note about struct copy by value above // FIXME_HIP -- see note about struct copy by value above
@ -212,10 +218,13 @@ struct HIPParallelLaunch<DriverType, Kokkos::LaunchBounds<0, 0>,
} }
static hipFuncAttributes get_hip_func_attributes() { static hipFuncAttributes get_hip_func_attributes() {
static hipFuncAttributes attr = []() {
hipFuncAttributes attr; hipFuncAttributes attr;
hipFuncGetAttributes( HIP_SAFE_CALL(hipFuncGetAttributes(
&attr, reinterpret_cast<void *>( &attr, reinterpret_cast<void const *>(
&hip_parallel_launch_local_memory<DriverType, 1024, 1>)); hip_parallel_launch_local_memory<DriverType, 1024, 1>)));
return attr;
}();
return attr; return attr;
} }
}; };

Some files were not shown because too many files have changed in this diff Show More