Update Kokkos library in LAMMPS to v2.7.24

2018-11-12 15:16:26 -07:00
parent 1651a21f92
commit b3f08b38a2
320 changed files with 42934 additions and 1993 deletions
--- a/lib/kokkos/CHANGELOG.md
+++ b/lib/kokkos/CHANGELOG.md
@ -1,5 +1,68 @@
 # Change Log

+## [2.7.24](https://github.com/kokkos/kokkos/tree/2.7.24) (2018-11-04)
+[Full Changelog](https://github.com/kokkos/kokkos/compare/2.7.00...2.7.24)
+
+**Implemented enhancements:**
+
+- DualView: Add non-templated functions for sync, need\_sync, view, modify [\#1858](https://github.com/kokkos/kokkos/issues/1858)
+- DualView: Avoid needlessly allocates and initializes modify\_host and modify\_device flag views [\#1831](https://github.com/kokkos/kokkos/issues/1831)
+- DualView: Incorrect deduction of "not device type" [\#1659](https://github.com/kokkos/kokkos/issues/1659)
+- BuildSystem: Add KOKKOS\_ENABLE\_CXX14 and KOKKOS\_ENABLE\_CXX17 [\#1602](https://github.com/kokkos/kokkos/issues/1602)
+- BuildSystem: Installed kokkos\_generated\_settings.cmake contains build directories instead of install directories [\#1838](https://github.com/kokkos/kokkos/issues/1838)
+- BuildSystem: KOKKOS\_ARCH: add ticks to printout of improper arch setting [\#1649](https://github.com/kokkos/kokkos/issues/1649)
+- BuildSystem: Make core/src/Makefile for Cuda use needed nvcc\_wrapper [\#1296](https://github.com/kokkos/kokkos/issues/1296)
+- Build: Support PGI as host compiler for NVCC [\#1828](https://github.com/kokkos/kokkos/issues/1828)
+- Build: Many Warnings Fixed e.g.[\#1786](https://github.com/kokkos/kokkos/issues/1786)
+- Capability: OffsetView with non-zero begin index [\#567](https://github.com/kokkos/kokkos/issues/567)
+- Capability: Reductions into device side view [\#1788](https://github.com/kokkos/kokkos/issues/1788)
+- Capability: Add max\_size to Kokkos::Array [\#1760](https://github.com/kokkos/kokkos/issues/1760)
+- Capability: View Assignment: LayoutStride -\> LayoutLeft and LayoutStride -\> LayoutRight [\#1594](https://github.com/kokkos/kokkos/issues/1594)
+- Capability: Atomic function allow implicit conversion of update argument [\#1571](https://github.com/kokkos/kokkos/issues/1571)
+- Capability: Add team\_size\_max with tagged functors [\#663](https://github.com/kokkos/kokkos/issues/663)
+- Capability: Fix allignment of views from Kokkos\_ScratchSpace should use different alignment [\#1700](https://github.com/kokkos/kokkos/issues/1700)
+- Capabilitiy: create\_mirror\_view\_and\_copy for DynRankView [\#1651](https://github.com/kokkos/kokkos/issues/1651)
+- Capability: DeepCopy HBWSpace / HostSpace [\#548](https://github.com/kokkos/kokkos/issues/548)
+- ROCm: support team vector scan  [\#1645](https://github.com/kokkos/kokkos/issues/1645)
+- ROCm:  Merge from rocm-hackathon2 [\#1636](https://github.com/kokkos/kokkos/issues/1636)
+- ROCm:  Add ParallelScanWithTotal [\#1611](https://github.com/kokkos/kokkos/issues/1611)
+- ROCm: Implement MDRange in ROCm [\#1314](https://github.com/kokkos/kokkos/issues/1314)
+- ROCm: Implement Reducers for Nested Parallelism Levels [\#963](https://github.com/kokkos/kokkos/issues/963)
+- ROCm: Add asynchronous deep copy [\#959](https://github.com/kokkos/kokkos/issues/959)
+- Tests: Memory pool test seems to allocate 8GB [\#1830](https://github.com/kokkos/kokkos/issues/1830)
+- Tests: Add unit\_test for team\_broadcast [\#734](https://github.com/kokkos/kokkos/issues/734)
+
+**Fixed bugs:**
+
+- BuildSystem: Makefile.kokkos gets gcc-toolchain wrong if gcc is cached [\#1841](https://github.com/kokkos/kokkos/issues/1841)
+- BuildSystem: kokkos\_generated\_settings.cmake placement is inconsistent [\#1771](https://github.com/kokkos/kokkos/issues/1771)
+- BuildSystem: Invalid escape sequence \. in kokkos\_functions.cmake [\#1661](https://github.com/kokkos/kokkos/issues/1661)
+- BuildSystem: Problem in Kokkos generated cmake file [\#1770](https://github.com/kokkos/kokkos/issues/1770)
+- BuildSystem: invalid file names on windows [\#1671](https://github.com/kokkos/kokkos/issues/1671)
+- Tests: reducers min/max\_loc test fails randomly due to multiple min values and thus multiple valid locations [\#1681](https://github.com/kokkos/kokkos/issues/1681)
+- Tests: cuda.scatterview unit test causes "Bus error" when force\_uvm and enable\_lambda are enabled [\#1852](https://github.com/kokkos/kokkos/issues/1852)
+- Tests: cuda.cxx11 unit test fails when force\_uvm and enable\_lambda are enabled [\#1850](https://github.com/kokkos/kokkos/issues/1850)
+- Tests: threads.reduce\_device\_view\_range\_policy failing with Cuda/8.0.44 and RDC [\#1836](https://github.com/kokkos/kokkos/issues/1836)
+- Build: compile error when compiling Kokkos with hwloc 2.0.1 \(on OSX 10.12.6, with g++ 7.2.0\) [\#1506](https://github.com/kokkos/kokkos/issues/1506)
+- Build: dual\_view.view broken with UVM [\#1834](https://github.com/kokkos/kokkos/issues/1834)
+- Build: White cuda/9.2 + gcc/7.2 warnings triggering errors  [\#1833](https://github.com/kokkos/kokkos/issues/1833)
+- Build: warning: enum constant in boolean context [\#1813](https://github.com/kokkos/kokkos/issues/1813)
+- Capability: Fix overly conservative max\_team\_size thingy [\#1808](https://github.com/kokkos/kokkos/issues/1808)
+- DynRankView: Ctors taking ViewAllocateWithoutInitializing broken [\#1783](https://github.com/kokkos/kokkos/issues/1783)
+- Cuda: Apollo cuda.team\_broadcast test fail with clang-6.0 [\#1762](https://github.com/kokkos/kokkos/issues/1762)
+- Cuda: Clang spurious test failure in impl\_view\_accessible [\#1753](https://github.com/kokkos/kokkos/issues/1753)
+- Cuda: Kokkos::complex\<double\> atomic deadlocks with Clang 6 Cuda build with -O0 [\#1752](https://github.com/kokkos/kokkos/issues/1752)
+- Cuda: LayoutStride Test fails for UVM as default memory space [\#1688](https://github.com/kokkos/kokkos/issues/1688)
+- Cuda: Scan wrong values on Volta [\#1676](https://github.com/kokkos/kokkos/issues/1676)
+- Cuda: Kokkos::deep\_copy error with CudaUVM and Kokkos::Serial spaces [\#1652](https://github.com/kokkos/kokkos/issues/1652)
+- Cuda: cudaErrorInvalidConfiguration with debug build [\#1647](https://github.com/kokkos/kokkos/issues/1647)
+- Cuda: parallel\_for with TeamPolicy::team\_size\_recommended with launch bounds not working -- reported by Daniel Holladay [\#1283](https://github.com/kokkos/kokkos/issues/1283)
+- Cuda: Using KOKKOS\_CLASS\_LAMBDA in a class with Kokkos::Random\_XorShift64\_Pool member data [\#1696](https://github.com/kokkos/kokkos/issues/1696)
+- Long Build Times on Darwin [\#1721](https://github.com/kokkos/kokkos/issues/1721)
+- Capability: Typo in Kokkos\_Sort.hpp - BinOp3D - wrong comparison [\#1720](https://github.com/kokkos/kokkos/issues/1720)
+- Buffer overflow in SharedAllocationRecord in Kokkos\_HostSpace.cpp [\#1673](https://github.com/kokkos/kokkos/issues/1673)
+- Serial unit test failure [\#1632](https://github.com/kokkos/kokkos/issues/1632)
+
 ## [2.7.00](https://github.com/kokkos/kokkos/tree/2.7.00) (2018-05-24)
 [Full Changelog](https://github.com/kokkos/kokkos/compare/2.6.00...2.7.00)

--- a/lib/kokkos/CMakeLists.txt
+++ b/lib/kokkos/CMakeLists.txt
@ -11,7 +11,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)

  # Define Project Name if this is a standalone build
  IF(NOT DEFINED ${PROJECT_NAME})
-    project(Kokkos CXX) 
+    project(Kokkos CXX)
  ENDIF()

  # Basic initialization (Used in KOKKOS_SETTINGS)
@ -22,7 +22,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
  include(${KOKKOS_SRC_PATH}/cmake/kokkos_functions.cmake)
  set_kokkos_cxx_compiler()
  set_kokkos_cxx_standard()
-  
+
  #------------ GET OPTIONS AND KOKKOS_SETTINGS --------------------------------
  # Add Kokkos' modules to CMake's module path.
  set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${Kokkos_SOURCE_DIR}/cmake/Modules/")
@ -34,7 +34,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)

  #------------ GENERATE HEADER AND SOURCE FILES -------------------------------
  execute_process(
-    COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} generate_build_settings
+    COMMAND ${KOKKOS_SETTINGS} make -f ${KOKKOS_SRC_PATH}/cmake/Makefile.generate_cmake_settings CXX=${CMAKE_CXX_COMPILER} PREFIX=${CMAKE_INSTALL_PREFIX} generate_build_settings
    WORKING_DIRECTORY "${Kokkos_BINARY_DIR}"
    OUTPUT_FILE ${Kokkos_BINARY_DIR}/core_src_make.out
    RESULT_VARIABLE GEN_SETTINGS_RESULT
@ -45,6 +45,7 @@ IF(NOT KOKKOS_HAS_TRILINOS)
  endif()
  include(${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake)
  install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION lib/cmake/Kokkos)
+  install(FILES ${Kokkos_BINARY_DIR}/kokkos_generated_settings.cmake DESTINATION ${CMAKE_INSTALL_PREFIX})
  string(REPLACE " " ";" KOKKOS_TPL_INCLUDE_DIRS "${KOKKOS_GMAKE_TPL_INCLUDE_DIRS}")
  string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_DIRS "${KOKKOS_GMAKE_TPL_LIBRARY_DIRS}")
  string(REPLACE " " ";" KOKKOS_TPL_LIBRARY_NAMES "${KOKKOS_GMAKE_TPL_LIBRARY_NAMES}")
--- a/lib/kokkos/Makefile.kokkos
+++ b/lib/kokkos/Makefile.kokkos
@ -1,14 +1,8 @@
 # Default settings common options.

-#LAMMPS specific settings:
-ifndef KOKKOS_PATH
-  KOKKOS_PATH=../../lib/kokkos
-endif
-CXXFLAGS=$(CCFLAGS)
-
-# Options: Cuda,ROCm,OpenMP,Pthreads,Qthreads,Serial
-KOKKOS_DEVICES ?= "OpenMP"
-#KOKKOS_DEVICES ?= "Pthreads"
+# Options: Cuda,ROCm,OpenMP,Pthread,Qthreads,Serial
+#KOKKOS_DEVICES ?= "OpenMP"
+KOKKOS_DEVICES ?= "Pthread"
 # Options: 
 # Intel:    KNC,KNL,SNB,HSW,BDW,SKX
 # NVIDIA:   Kepler,Kepler30,Kepler32,Kepler35,Kepler37,Maxwell,Maxwell50,Maxwell52,Maxwell53,Pascal60,Pascal61,Volta70,Volta72
@ -21,16 +15,17 @@ KOKKOS_ARCH ?= ""
 KOKKOS_DEBUG ?= "no"
 # Options: hwloc,librt,experimental_memkind
 KOKKOS_USE_TPLS ?= ""
-# Options: c++11,c++1z
+# Options: c++11,c++14,c++1y,c++17,c++1z,c++2a
 KOKKOS_CXX_STANDARD ?= "c++11"
 # Options: aggressive_vectorization,disable_profiling,disable_deprecated_code,enable_large_mem_tests
 KOKKOS_OPTIONS ?= ""
 # Option for setting ETI path
 KOKKOS_ETI_PATH ?= ${KOKKOS_PATH}/core/src/eti
+KOKKOS_CMAKE ?= "no"

 # Default settings specific options.
 # Options: force_uvm,use_ldg,rdc,enable_lambda
-KOKKOS_CUDA_OPTIONS ?= "enable_lambda"
+KOKKOS_CUDA_OPTIONS ?= ""

 # Return a 1 if a string contains a substring and 0 if not
 # Note the search string should be without '"'
@ -41,7 +36,11 @@ kokkos_has_string=$(if $(findstring $2,$1),1,0)
 # Check for general settings.
 KOKKOS_INTERNAL_ENABLE_DEBUG := $(call kokkos_has_string,$(KOKKOS_DEBUG),yes)
 KOKKOS_INTERNAL_ENABLE_CXX11 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++11)
+KOKKOS_INTERNAL_ENABLE_CXX14 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++14)
+KOKKOS_INTERNAL_ENABLE_CXX1Y := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1y)
+KOKKOS_INTERNAL_ENABLE_CXX17 := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++17)
 KOKKOS_INTERNAL_ENABLE_CXX1Z := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++1z)
+KOKKOS_INTERNAL_ENABLE_CXX2A := $(call kokkos_has_string,$(KOKKOS_CXX_STANDARD),c++2a)

 # Check for external libraries.
 KOKKOS_INTERNAL_USE_HWLOC := $(call kokkos_has_string,$(KOKKOS_USE_TPLS),hwloc)
@ -110,6 +109,18 @@ KOKKOS_INTERNAL_COMPILER_CLANG       := $(call kokkos_has_string,$(KOKKOS_CXX_VE
 KOKKOS_INTERNAL_COMPILER_APPLE_CLANG := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),apple-darwin)
 KOKKOS_INTERNAL_COMPILER_HCC         := $(call kokkos_has_string,$(KOKKOS_CXX_VERSION),HCC)

+# Check Host Compiler if using NVCC through nvcc_wrapper
+ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC), 1)
+  KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER := $(strip $(shell echo $(CXX) | grep nvcc_wrapper | wc -l))
+  ifeq ($(KOKKOS_INTERNAL_COMPILER_NVCC_WRAPPER), 1)
+
+    KOKKOS_CXX_HOST_VERSION             := $(strip $(shell $(CXX) $(CXXFLAGS) --host-version       2>&1))
+    KOKKOS_INTERNAL_COMPILER_PGI    := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),PGI)
+    KOKKOS_INTERNAL_COMPILER_INTEL  := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),Intel Corporation)
+    KOKKOS_INTERNAL_COMPILER_CLANG  := $(call kokkos_has_string,$(KOKKOS_CXX_HOST_VERSION),clang)
+  endif
+endif
+
 ifeq ($(KOKKOS_INTERNAL_COMPILER_CLANG), 2)
  KOKKOS_INTERNAL_COMPILER_CLANG = 1
 endif
@ -202,18 +213,34 @@ endif
 # Set C++11 flags.
 ifeq ($(KOKKOS_INTERNAL_COMPILER_PGI), 1)
  KOKKOS_INTERNAL_CXX11_FLAG := --c++11
+  KOKKOS_INTERNAL_CXX14_FLAG := --c++14
+  #KOKKOS_INTERNAL_CXX17_FLAG := --c++17
 else
  ifeq ($(KOKKOS_INTERNAL_COMPILER_XL), 1)
     KOKKOS_INTERNAL_CXX11_FLAG := -std=c++11
+     #KOKKOS_INTERNAL_CXX14_FLAG := -std=c++14
+     KOKKOS_INTERNAL_CXX1Y_FLAG := -std=c++1y
+     #KOKKOS_INTERNAL_CXX17_FLAG := -std=c++17
+     #KOKKOS_INTERNAL_CXX1Z_FLAG := -std=c++1Z
+     #KOKKOS_INTERNAL_CXX2A_FLAG := -std=c++2a
  else
    ifeq ($(KOKKOS_INTERNAL_COMPILER_CRAY), 1)
      KOKKOS_INTERNAL_CXX11_FLAG := -hstd=c++11
+      KOKKOS_INTERNAL_CXX14_FLAG := -hstd=c++14
+      #KOKKOS_INTERNAL_CXX1Y_FLAG := -hstd=c++1y
+      #KOKKOS_INTERNAL_CXX17_FLAG := -hstd=c++17
+      #KOKKOS_INTERNAL_CXX1Z_FLAG := -hstd=c++1z
+      #KOKKOS_INTERNAL_CXX2A_FLAG := -hstd=c++2a
    else
      ifeq ($(KOKKOS_INTERNAL_COMPILER_HCC), 1)
        KOKKOS_INTERNAL_CXX11_FLAG := 
      else
        KOKKOS_INTERNAL_CXX11_FLAG := --std=c++11
+        KOKKOS_INTERNAL_CXX14_FLAG := --std=c++14
+        KOKKOS_INTERNAL_CXX1Y_FLAG := --std=c++1y
+        KOKKOS_INTERNAL_CXX17_FLAG := --std=c++17
        KOKKOS_INTERNAL_CXX1Z_FLAG := --std=c++1z
+        KOKKOS_INTERNAL_CXX2A_FLAG := --std=c++2a
      endif
    endif
  endif
@ -336,7 +363,9 @@ endif

 #CPPFLAGS is now unused
 KOKKOS_CPPFLAGS =
-KOKKOS_CXXFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_ETI_PATH)
+ifneq ($(KOKKOS_CMAKE), yes)
+  KOKKOS_CXXFLAGS = -I./ -I$(KOKKOS_PATH)/core/src -I$(KOKKOS_PATH)/containers/src -I$(KOKKOS_PATH)/algorithms/src -I$(KOKKOS_ETI_PATH)
+endif
 KOKKOS_TPL_INCLUDE_DIRS =
 KOKKOS_TPL_LIBRARY_DIRS =
 KOKKOS_TPL_LIBRARY_NAMES =
@ -347,9 +376,11 @@ endif

 KOKKOS_LIBS = -ldl
 KOKKOS_TPL_LIBRARY_NAMES += dl
-KOKKOS_LDFLAGS = -L$(shell pwd)
-# CXXLDFLAGS is used together with CXXFLAGS in a combined compile/link command
-KOKKOS_CXXLDFLAGS = -L$(shell pwd)
+ifneq ($(KOKKOS_CMAKE), yes)
+  KOKKOS_LDFLAGS = -L$(shell pwd)
+  # CXXLDFLAGS is used together with CXXFLAGS in a combined compile/link command
+  KOKKOS_CXXLDFLAGS = -L$(shell pwd)
+endif
 KOKKOS_LINK_FLAGS = 
 KOKKOS_SRC =
 KOKKOS_HEADERS =
@ -377,10 +408,12 @@ tmp := $(call kokkos_append_header,"/* Execution Spaces */")

 ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CUDA")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_COMPILER_CUDA_VERSION $(KOKKOS_INTERNAL_COMPILER_NVCC_VERSION)")
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
  tmp := $(call kokkos_append_header,'\#define KOKKOS_ENABLE_ROCM')
+  tmp := $(call kokkos_append_header,'\#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1')
 endif

 ifeq ($(KOKKOS_INTERNAL_USE_OPENMPTARGET), 1)
@ -438,11 +471,25 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX11), 1)
  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX11_FLAG)
  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
 endif
-
+ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX14), 1)
+  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX14_FLAG)
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
+endif
+ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Y), 1)
+  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Y_FLAG)
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX14")
+endif
+ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX17), 1)
+  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX17_FLAG)
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
+endif
 ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX1Z), 1)
  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX1Z_FLAG)
-  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX11")
-  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX1Z")
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX17")
+endif
+ifeq ($(KOKKOS_INTERNAL_ENABLE_CXX2A), 1)
+  KOKKOS_CXXFLAGS += $(KOKKOS_INTERNAL_CXX2A_FLAG)
+  tmp := $(call kokkos_append_header,"\#define KOKKOS_ENABLE_CXX20")
 endif

 ifeq ($(KOKKOS_INTERNAL_ENABLE_DEBUG), 1)
@ -465,7 +512,9 @@ endif

 ifeq ($(KOKKOS_INTERNAL_USE_HWLOC), 1)
  ifneq ($(HWLOC_PATH),)
-    KOKKOS_CXXFLAGS += -I$(HWLOC_PATH)/include
+    ifneq ($(KOKKOS_CMAKE), yes)
+      KOKKOS_CXXFLAGS += -I$(HWLOC_PATH)/include
+    endif
    KOKKOS_LDFLAGS += -L$(HWLOC_PATH)/lib
    KOKKOS_CXXLDFLAGS += -L$(HWLOC_PATH)/lib
    KOKKOS_TPL_INCLUDE_DIRS += $(HWLOC_PATH)/include
@ -484,7 +533,9 @@ endif

 ifeq ($(KOKKOS_INTERNAL_USE_MEMKIND), 1)
  ifneq ($(MEMKIND_PATH),)
-    KOKKOS_CXXFLAGS += -I$(MEMKIND_PATH)/include
+    ifneq ($(KOKKOS_CMAKE), yes)
+      KOKKOS_CXXFLAGS += -I$(MEMKIND_PATH)/include
+    endif
    KOKKOS_LDFLAGS += -L$(MEMKIND_PATH)/lib
    KOKKOS_CXXLDFLAGS += -L$(MEMKIND_PATH)/lib
    KOKKOS_TPL_INCLUDE_DIRS += $(MEMKIND_PATH)/include
@ -977,7 +1028,9 @@ ifeq ($(KOKKOS_INTERNAL_ENABLE_ETI), 1)
 endif
  KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Cuda/*.hpp)
  ifneq ($(CUDA_PATH),)
-    KOKKOS_CXXFLAGS += -I$(CUDA_PATH)/include
+    ifneq ($(KOKKOS_CMAKE), yes)
+      KOKKOS_CXXFLAGS += -I$(CUDA_PATH)/include
+    endif
    KOKKOS_LDFLAGS += -L$(CUDA_PATH)/lib64
    KOKKOS_CXXLDFLAGS += -L$(CUDA_PATH)/lib64
    KOKKOS_TPL_INCLUDE_DIRS += $(CUDA_PATH)/include
@ -1032,7 +1085,9 @@ ifeq ($(KOKKOS_INTERNAL_USE_QTHREADS), 1)
  KOKKOS_SRC += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.cpp)
  KOKKOS_HEADERS += $(wildcard $(KOKKOS_PATH)/core/src/Qthreads/*.hpp)
  ifneq ($(QTHREADS_PATH),)
-    KOKKOS_CXXFLAGS += -I$(QTHREADS_PATH)/include
+    ifneq ($(KOKKOS_CMAKE), yes)
+      KOKKOS_CXXFLAGS += -I$(QTHREADS_PATH)/include
+    endif
    KOKKOS_LDFLAGS += -L$(QTHREADS_PATH)/lib
    KOKKOS_CXXLDFLAGS += -L$(QTHREADS_PATH)/lib
    KOKKOS_TPL_INCLUDE_DIRS += $(QTHREADS_PATH)/include
--- a/lib/kokkos/README
+++ b/lib/kokkos/README
@ -52,44 +52,47 @@ For specifics see the LICENSE file contained in the repository or distribution.
  * GCC 4.8.4
  * GCC 4.9.3
  * GCC 5.1.0
-  * GCC 5.3.0
+  * GCC 5.5.0
  * GCC 6.1.0
+  * GCC 7.2.0
+  * GCC 7.3.0
+  * GCC 8.1.0
  * Intel 15.0.2
  * Intel 16.0.1
-  * Intel 17.1.043
+  * Intel 17.0.1
  * Intel 17.4.196
-  * Intel 18.0.128
+  * Intel 18.2.128
  * Clang 3.6.1
  * Clang 3.7.1
  * Clang 3.8.1
  * Clang 3.9.0
  * Clang 4.0.0
-  * Clang 4.0.0 for CUDA (CUDA Toolkit 8.0.44)
-  * Clang 6.0.0 for CUDA (CUDA Toolkit 9.1)
-  * PGI 17.10
-  * NVCC 7.0 for CUDA (with gcc 4.8.4)
+  * Clang 6.0.0 for CUDA (CUDA Toolkit 9.0)
+  * Clang 7.0.0 for CUDA (CUDA Toolkit 9.1)
+  * PGI 18.7
  * NVCC 7.5 for CUDA (with gcc 4.8.4)
  * NVCC 8.0.44 for CUDA (with gcc 5.3.0)
  * NVCC 9.1 for CUDA (with gcc 6.1.0)

 ### Primary tested compilers on Power 8 are:
-  * GCC 5.4.0 (OpenMP,Serial)
-  * IBM XL 13.1.6 (OpenMP, Serial)
-  * NVCC 8.0.44 for CUDA (with gcc 5.4.0)
-  * NVCC 9.0.103 for CUDA (with gcc 6.3.0 and XL 13.1.6)
+  * GCC 6.4.0 (OpenMP,Serial)
+  * GCC 7.2.0 (OpenMP,Serial)
+  * IBM XL 16.1.0 (OpenMP, Serial)
+  * NVCC 9.2.88 for CUDA (with gcc 7.2.0 and XL 16.1.0)

 ### Primary tested compilers on Intel KNL are:
-  * GCC 6.2.0
  * Intel 16.4.258 (with gcc 4.7.2)
  * Intel 17.2.174 (with gcc 4.9.3)
-  * Intel 18.0.128 (with gcc 4.9.3)
+  * Intel 18.2.199 (with gcc 4.9.3)

-### Primary tested compilers on ARM
-  * GCC 6.1.0 
+### Primary tested compilers on ARM (Cavium ThunderX2)
+  * GCC 7.2.0 
+  * ARM/Clang 18.4.0
  
 ### Other compilers working:
  * X86:
   - Cygwin 2.1.0 64bit with gcc 4.9.3
+   - GCC 8.1.0 (not warning free)

 ### Known non-working combinations:
  * Power8:
--- a/lib/kokkos/algorithms/src/Kokkos_Random.hpp
+++ b/lib/kokkos/algorithms/src/Kokkos_Random.hpp
@ -697,6 +697,7 @@ namespace Kokkos {
    typedef Random_XorShift64<DeviceType> generator_type;
    typedef DeviceType device_type;

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift64_Pool() {
      num_states_ = 0;
    }
@ -709,12 +710,14 @@ namespace Kokkos {
 #endif
    }

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift64_Pool(const Random_XorShift64_Pool& src):
      locks_(src.locks_),
      state_(src.state_),
      num_states_(src.num_states_)
    {}

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift64_Pool operator = (const Random_XorShift64_Pool& src) {
      locks_ = src.locks_;
      state_ = src.state_;
@ -958,6 +961,7 @@ namespace Kokkos {

    typedef DeviceType device_type;

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift1024_Pool() {
      num_states_ = 0;
    }
@ -972,6 +976,7 @@ namespace Kokkos {
 #endif
    }

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift1024_Pool(const Random_XorShift1024_Pool& src):
      locks_(src.locks_),
      state_(src.state_),
@ -979,6 +984,7 @@ namespace Kokkos {
      num_states_(src.num_states_)
    {}

+    KOKKOS_INLINE_FUNCTION
    Random_XorShift1024_Pool operator = (const Random_XorShift1024_Pool& src) {
      locks_ = src.locks_;
      state_ = src.state_;
--- a/lib/kokkos/algorithms/src/Kokkos_Sort.hpp
+++ b/lib/kokkos/algorithms/src/Kokkos_Sort.hpp
@ -246,8 +246,8 @@ public:
  {
    bin_count_atomic = Kokkos::View<int*, Space >("Kokkos::SortImpl::BinSortFunctor::bin_count",bin_op.max_bins());
    bin_count_const =  bin_count_atomic;
-    bin_offsets =      offset_type("Kokkos::SortImpl::BinSortFunctor::bin_offsets",bin_op.max_bins());
-    sort_order =       offset_type("PermutationVector",range_end-range_begin);
+    bin_offsets =      offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::bin_offsets"),bin_op.max_bins());
+    sort_order =       offset_type(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sort_order"),range_end-range_begin);
  }

  BinSort( const_key_view_type  keys_
@ -290,7 +290,7 @@ public:

 #ifdef KOKKOS_ENABLE_DEPRECATED_CODE
    scratch_view_type
-      sorted_values("Scratch",
+      sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
                    len,
                    values.extent(1),
                    values.extent(2),
@ -301,7 +301,7 @@ public:
                    values.extent(7));
 #else
    scratch_view_type
-      sorted_values("Scratch",
+      sorted_values(ViewAllocateWithoutInitializing("Kokkos::SortImpl::BinSortFunctor::sorted_values"),
                  values.rank_dynamic > 0 ? len : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
                  values.rank_dynamic > 1 ? values.extent(1) : KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
                  values.rank_dynamic > 2 ? values.extent(2) : KOKKOS_IMPL_CTOR_DEFAULT_ARG,
@ -483,7 +483,7 @@ struct BinOp3D {
    if (keys(i1,0)>keys(i2,0)) return true;
    else if (keys(i1,0)==keys(i2,0)) {
      if (keys(i1,1)>keys(i2,1)) return true;
-      else if (keys(i1,1)==keys(i2,2)) {
+      else if (keys(i1,1)==keys(i2,1)) {
        if (keys(i1,2)>keys(i2,2)) return true;
      }
    }
--- a/lib/kokkos/benchmarks/gups/Makefile
+++ b/lib/kokkos/benchmarks/gups/Makefile
@ -0,0 +1,41 @@
+#Set your Kokkos path to something appropriate
+KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
+KOKKOS_DEVICES = "Cuda"
+KOKKOS_ARCH = "Pascal60"
+KOKKOS_CUDA_OPTIONS = enable_lambda
+#KOKKOS_DEVICES = "OpenMP"
+#KOKKOS_ARCH = "Power8"
+
+SRC = gups-kokkos.cc
+
+default: build
+	echo "Start Build"
+	
+CXXFLAGS = -O3
+CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
+#CXX = g++
+
+LINK = ${CXX}
+
+LINKFLAGS =  
+EXE = gups-kokkos
+
+DEPFLAGS = -M
+
+OBJ = $(SRC:.cc=.o)
+LIB =
+
+include $(KOKKOS_PATH)/Makefile.kokkos
+
+build: $(EXE)
+
+$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
+	$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
+
+clean: kokkos-clean 
+	rm -f *.o $(EXE)
+
+# Compilation rules
+
+%.o:%.cc $(KOKKOS_CPP_DEPENDS)
+	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
--- a/lib/kokkos/benchmarks/gups/gups-kokkos.cc
+++ b/lib/kokkos/benchmarks/gups/gups-kokkos.cc
@ -0,0 +1,199 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include "Kokkos_Core.hpp"
+#include <cstdio>
+#include <cstdlib>
+#include <cmath>
+
+#include <sys/time.h>
+
+#define HLINE "-------------------------------------------------------------\n"
+
+#if defined(KOKKOS_ENABLE_CUDA)
+typedef Kokkos::View<int64_t*, Kokkos::CudaSpace>::HostMirror GUPSHostArray;
+typedef Kokkos::View<int64_t*, Kokkos::CudaSpace> GUPSDeviceArray;
+#else
+typedef Kokkos::View<int64_t*, Kokkos::HostSpace>::HostMirror GUPSHostArray;
+typedef Kokkos::View<int64_t*, Kokkos::HostSpace> GUPSDeviceArray;
+#endif
+
+typedef int GUPSIndex;
+
+double now() {
+	struct timeval now;
+	gettimeofday(&now, NULL);
+
+	return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
+}
+
+void randomize_indices(GUPSHostArray& indices, GUPSDeviceArray& dev_indices, const int64_t dataCount) {
+	for( GUPSIndex i = 0; i < indices.extent(0); ++i ) {
+		indices[i] = lrand48() % dataCount;
+	}
+
+	Kokkos::deep_copy(dev_indices, indices);
+}
+
+void run_gups(GUPSDeviceArray& indices, GUPSDeviceArray& data, const int64_t datum,
+	const bool performAtomics) {
+
+	if( performAtomics ) {
+		Kokkos::parallel_for("bench-gups-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
+			Kokkos::atomic_fetch_xor( &data[indices[i]], datum );
+		});
+	} else {
+		Kokkos::parallel_for("bench-gups-non-atomic", indices.extent(0), KOKKOS_LAMBDA(const GUPSIndex i) {
+			data[indices[i]] ^= datum;
+		});
+	}
+
+	Kokkos::fence();
+}
+
+int run_benchmark(const GUPSIndex indicesCount, const GUPSIndex dataCount, const int repeats,
+	const bool useAtomics) {
+
+	printf("Reports fastest timing per kernel\n");
+	printf("Creating Views...\n");
+
+	printf("Memory Sizes:\n");
+	printf("- Elements:      %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(dataCount),
+		1.0e-6 * ((double) dataCount * (double) sizeof(int64_t)));
+	printf("- Indices:       %15" PRIu64 " (%12.4f MB)\n", static_cast<uint64_t>(indicesCount),
+		1.0e-6 * ((double) indicesCount * (double) sizeof(int64_t)));
+	printf(" - Atomics:      %15s\n", (useAtomics ? "Yes" : "No") );
+	printf("Benchmark kernels will be performed for %d iterations.\n", repeats);
+
+	printf(HLINE);
+
+	GUPSDeviceArray dev_indices("indices", indicesCount);
+	GUPSDeviceArray dev_data("data", dataCount);
+	int64_t datum = -1;
+
+	GUPSHostArray indices = Kokkos::create_mirror_view(dev_indices);
+	GUPSHostArray data    = Kokkos::create_mirror_view(dev_data);
+
+	double gupsTime  = 0.0;
+
+	printf("Initializing Views...\n");
+
+#if defined(KOKKOS_HAVE_OPENMP)
+	Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::OpenMP>(0, dataCount),
+#else
+	Kokkos::parallel_for("init-data", Kokkos::RangePolicy<Kokkos::Serial>(0, dataCount),
+#endif
+		KOKKOS_LAMBDA(const int i) {
+
+		data[i] = 10101010101;
+	});
+
+#if defined(KOKKOS_HAVE_OPENMP)
+	Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::OpenMP>(0, indicesCount),
+#else
+	Kokkos::parallel_for("init-indices", Kokkos::RangePolicy<Kokkos::Serial>(0, indicesCount),
+#endif
+		KOKKOS_LAMBDA(const int i) {
+
+		indices[i] = 0;
+	});
+
+	Kokkos::deep_copy(dev_data, data);
+	Kokkos::deep_copy(dev_indices, indices);
+	double start;
+
+	printf("Starting benchmarking...\n");
+
+	for( GUPSIndex k = 0; k < repeats; ++k ) {
+		randomize_indices(indices, dev_indices, data.extent(0));
+
+		start = now();
+		run_gups(dev_indices, dev_data, datum, useAtomics);
+		gupsTime += now() - start;
+	}
+
+	Kokkos::deep_copy(indices, dev_indices);
+	Kokkos::deep_copy(data, dev_data);
+
+	printf(HLINE);
+	printf("GUP/s Random:      %18.6f\n",
+		(1.0e-9 * ((double) repeats) * (double) dev_indices.extent(0)) / gupsTime);
+	printf(HLINE);
+
+	return 0;
+}
+
+int main(int argc, char* argv[]) {
+
+	printf(HLINE);
+	printf("Kokkos GUPS Benchmark\n");
+	printf(HLINE);
+
+	srand48(1010101);
+
+	Kokkos::initialize(argc, argv);
+
+	int64_t indices = 8192;
+	int64_t data    = 33554432;
+	int64_t repeats = 10;
+	bool useAtomics = false;
+
+	for( int i = 1; i < argc; ++i ) {
+		if( strcmp( argv[i], "--indices" ) == 0 ) {
+			indices = std::atoll(argv[i+1]);
+			++i;
+		} else if( strcmp( argv[i], "--data" ) == 0 ) {
+			data = std::atoll(argv[i+1]);
+			++i;
+		} else if( strcmp( argv[i], "--repeats" ) == 0 ) {
+			repeats = std::atoll(argv[i+1]);
+			++i;
+		} else if( strcmp( argv[i], "--atomics" ) == 0 ) {
+			useAtomics = true;
+		}
+	}
+
+	const int rc = run_benchmark(indices, data, repeats, useAtomics);
+
+	Kokkos::finalize();
+
+	return rc;
+}
--- a/lib/kokkos/benchmarks/stream/Makefile
+++ b/lib/kokkos/benchmarks/stream/Makefile
@ -0,0 +1,41 @@
+#Set your Kokkos path to something appropriate
+KOKKOS_PATH = ${HOME}/git/kokkos-github-repo
+#KOKKOS_DEVICES = "Cuda"
+#KOKKOS_ARCH = "Pascal60"
+#KOKKOS_CUDA_OPTIONS = enable_lambda
+KOKKOS_DEVICES = "OpenMP"
+KOKKOS_ARCH = "Power8"
+
+SRC = stream-kokkos.cc
+
+default: build
+	echo "Start Build"
+	
+CXXFLAGS = -O3
+#CXX = ${HOME}/git/kokkos-github-repo/bin/nvcc_wrapper
+CXX = g++
+
+LINK = ${CXX}
+
+LINKFLAGS =  
+EXE = stream-kokkos
+
+DEPFLAGS = -M
+
+OBJ = $(SRC:.cc=.o)
+LIB =
+
+include $(KOKKOS_PATH)/Makefile.kokkos
+
+build: $(EXE)
+
+$(EXE): $(OBJ) $(KOKKOS_LINK_DEPENDS)
+	$(LINK) $(KOKKOS_LDFLAGS) $(LINKFLAGS) $(EXTRA_PATH) $(OBJ) $(KOKKOS_LIBS) $(LIB) -o $(EXE)
+
+clean: kokkos-clean 
+	rm -f *.o $(EXE)
+
+# Compilation rules
+
+%.o:%.cc $(KOKKOS_CPP_DEPENDS)
+	$(CXX) $(KOKKOS_CPPFLAGS) $(KOKKOS_CXXFLAGS) $(CXXFLAGS) $(EXTRA_INC) -c $<
--- a/lib/kokkos/benchmarks/stream/stream-kokkos.cc
+++ b/lib/kokkos/benchmarks/stream/stream-kokkos.cc
@ -0,0 +1,265 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include "Kokkos_Core.hpp"
+#include <cstdio>
+#include <cstdlib>
+#include <cmath>
+
+#include <sys/time.h>
+
+#define STREAM_ARRAY_SIZE 100000000
+#define STREAM_NTIMES     20
+
+#define HLINE "-------------------------------------------------------------\n"
+
+#if defined(KOKKOS_ENABLE_CUDA)
+typedef Kokkos::View<double*, Kokkos::CudaSpace>::HostMirror StreamHostArray;
+typedef Kokkos::View<double*, Kokkos::CudaSpace> StreamDeviceArray;
+#else
+typedef Kokkos::View<double*, Kokkos::HostSpace>::HostMirror StreamHostArray;
+typedef Kokkos::View<double*, Kokkos::HostSpace> StreamDeviceArray;
+#endif
+
+typedef int StreamIndex;
+
+double now() {
+	struct timeval now;
+	gettimeofday(&now, NULL);
+
+	return (double) now.tv_sec + ((double) now.tv_usec * 1.0e-6);
+}
+
+void perform_copy(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
+
+	Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
+		c[i] = a[i];
+	});
+
+	Kokkos::fence();
+}
+
+void perform_scale(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
+       	const double scalar) {
+
+	Kokkos::parallel_for("copy", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
+		b[i] = scalar * c[i];
+	});
+
+	Kokkos::fence();
+}
+
+void perform_add(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c) {
+	Kokkos::parallel_for("add", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
+                c[i] = a[i] + b[i];
+        });
+
+	Kokkos::fence();
+}
+
+void perform_triad(StreamDeviceArray& a, StreamDeviceArray& b, StreamDeviceArray& c,
+	const double scalar) {
+
+	Kokkos::parallel_for("triad", a.extent(0), KOKKOS_LAMBDA(const StreamIndex i) {
+		a[i] = b[i] + scalar * c[i];
+	});
+
+	Kokkos::fence();
+}
+
+int perform_validation(StreamHostArray& a, StreamHostArray& b, StreamHostArray& c,
+	const StreamIndex arraySize, const double scalar) {
+
+	double ai = 1.0;
+	double bi = 2.0;
+	double ci = 0.0;
+
+	for( StreamIndex i = 0; i < arraySize; ++i ) {
+		ci = ai;
+		bi = scalar * ci;
+		ci = ai + bi;
+		ai = bi + scalar * ci;
+	};
+
+	double aError = 0.0;
+	double bError = 0.0;
+	double cError = 0.0;
+
+	for( StreamIndex i = 0; i < arraySize; ++i ) {
+		aError = std::abs( a[i] - ai );
+		bError = std::abs( b[i] - bi );
+		cError = std::abs( c[i] - ci );
+	}
+
+	double aAvgError = aError / (double) arraySize;
+	double bAvgError = bError / (double) arraySize;
+	double cAvgError = cError / (double) arraySize;
+
+	const double epsilon = 1.0e-13;
+	int errorCount = 0;
+
+	if( std::abs( aAvgError / ai ) > epsilon ) {
+		fprintf(stderr, "Error: validation check on View a failed.\n");
+		errorCount++;
+	}
+
+	if( std::abs( bAvgError / bi ) > epsilon ) {
+		fprintf(stderr, "Error: validation check on View b failed.\n");
+		errorCount++;
+	}
+
+	if( std::abs( cAvgError / ci ) > epsilon ) {
+		fprintf(stderr, "Error: validation check on View c failed.\n");
+		errorCount++;
+	}
+
+	if( errorCount == 0 ) {
+		printf("All solutions checked and verified.\n");
+	}
+
+	return errorCount;
+}
+
+int run_benchmark() {
+
+	printf("Reports fastest timing per kernel\n");
+	printf("Creating Views...\n");
+
+	printf("Memory Sizes:\n");
+	printf("- Array Size:    %" PRIu64 "\n", static_cast<uint64_t>(STREAM_ARRAY_SIZE));
+	printf("- Per Array:     %12.2f MB\n", 1.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
+	printf("- Total:         %12.2f MB\n", 3.0e-6 * (double) STREAM_ARRAY_SIZE * (double) sizeof(double));
+
+	printf("Benchmark kernels will be performed for %d iterations.\n", STREAM_NTIMES);
+
+	printf(HLINE);
+
+	StreamDeviceArray dev_a("a", STREAM_ARRAY_SIZE);
+	StreamDeviceArray dev_b("b", STREAM_ARRAY_SIZE);
+	StreamDeviceArray dev_c("c", STREAM_ARRAY_SIZE);
+
+	StreamHostArray a = Kokkos::create_mirror_view(dev_a);
+	StreamHostArray b = Kokkos::create_mirror_view(dev_b);
+	StreamHostArray c = Kokkos::create_mirror_view(dev_c);
+
+	const double scalar = 3.0;
+
+	double copyTime  = std::numeric_limits<double>::max();
+	double scaleTime = std::numeric_limits<double>::max();
+	double addTime   = std::numeric_limits<double>::max();
+	double triadTime = std::numeric_limits<double>::max();
+
+	printf("Initializing Views...\n");
+
+#if defined(KOKKOS_HAVE_OPENMP)
+	Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::OpenMP>(0, STREAM_ARRAY_SIZE),
+#else
+	Kokkos::parallel_for("init", Kokkos::RangePolicy<Kokkos::Serial>(0, STREAM_ARRAY_SIZE),
+#endif
+		KOKKOS_LAMBDA(const int i) {
+
+		a[i] = 1.0;
+		b[i] = 2.0;
+		c[i] = 0.0;
+	});
+
+	// Copy contents of a (from the host) to the dev_a (device)
+	Kokkos::deep_copy(dev_a, a);
+	Kokkos::deep_copy(dev_b, b);
+	Kokkos::deep_copy(dev_c, c);
+
+	double start;
+
+	printf("Starting benchmarking...\n");
+
+	for( StreamIndex k = 0; k < STREAM_NTIMES; ++k ) {
+		start = now();
+		perform_copy(dev_a, dev_b, dev_c);
+		copyTime = std::min( copyTime, (now() - start) );
+
+		start = now();
+		perform_scale(dev_a, dev_b, dev_c, scalar);
+		scaleTime = std::min( scaleTime, (now() - start) );
+
+		start = now();
+		perform_add(dev_a, dev_b, dev_c);
+		addTime = std::min( addTime, (now() - start) );
+
+		start = now();
+		perform_triad(dev_a, dev_b, dev_c, scalar);
+		triadTime = std::min( triadTime, (now() - start) );
+	}
+
+	Kokkos::deep_copy(a, dev_a);
+	Kokkos::deep_copy(b, dev_b);
+	Kokkos::deep_copy(c, dev_c);
+
+	printf("Performing validation...\n");
+	int rc = perform_validation(a, b, c, STREAM_ARRAY_SIZE, scalar);
+
+	printf(HLINE);
+
+	printf("Copy            %11.2f MB/s\n",
+		( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / copyTime );
+	printf("Scale           %11.2f MB/s\n",
+		( 1.0e-06 * 2.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / scaleTime );
+	printf("Add             %11.2f MB/s\n",
+		( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / addTime );
+	printf("Triad           %11.2f MB/s\n",
+		( 1.0e-06 * 3.0 * (double) sizeof(double) * (double) STREAM_ARRAY_SIZE) / triadTime );
+
+	printf(HLINE);
+
+	return rc;
+}
+
+int main(int argc, char* argv[]) {
+
+	printf(HLINE);
+	printf("Kokkos STREAM Benchmark\n");
+	printf(HLINE);
+
+	Kokkos::initialize(argc, argv);
+	const int rc = run_benchmark();
+	Kokkos::finalize();
+
+	return rc;
+}
--- a/lib/kokkos/bin/hpcbind
+++ b/lib/kokkos/bin/hpcbind
@ -125,18 +125,20 @@ function show_help {
  echo "  --openmp-ratio=N/D    Ratio of the cpuset to use for OpenMP"
  echo "                        Default: 1"
  echo "  --openmp-places=<Op>  Op=threads|cores|sockets. Default: threads"
-  echo "  --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
-  echo "  --force-openmp-num-threads=N"
+  echo "  --openmp-num-threads=N"
  echo "                        Override logic for selecting OMP_NUM_THREADS"
-  echo "  --force-openmp-proc-bind=<OP>"
+  echo "  --openmp-proc-bind=<OP>"
  echo "                        Override logic for selecting OMP_PROC_BIND"
-  echo "  --no-openmp-nested    Set OMP_NESTED to false"
+  echo "  --openmp-nested       Set OMP_NESTED to true"
+  echo "  --no-openmp-proc-bind Set OMP_PROC_BIND to false and unset OMP_PLACES"
  echo "  --output-prefix=<P>   Save the output to files of the form"
  echo "                        P.hpcbind.N, P.stdout.N and P.stderr.N where P is "
  echo "                        the prefix and N is the rank (no spaces)"
  echo "  --output-mode=<Op>    How console output should be handled."
  echo "                        Options are all, rank0, and none.  Default: rank0" 
  echo "  --lstopo              Show bindings in lstopo"
+  echo "  --save-topology=<Xml>  Save the topology to the given xml file"
+  echo "  --load-topology=<Xml>  Load a previously saved topology from an xml file"
  echo "  -v|--verbose          Print bindings and relevant environment variables"
  echo "  -h|--help             Show this message"
  echo ""
@ -189,7 +191,7 @@ HPCBIND_OPENMP_PLACES=${OMP_PLACES:-threads}
 declare -i HPCBIND_OPENMP_PROC_BIND=1
 HPCBIND_OPENMP_FORCE_NUM_THREADS=""
 HPCBIND_OPENMP_FORCE_PROC_BIND=""
-declare -i HPCBIND_OPENMP_NESTED=1
+declare -i HPCBIND_OPENMP_NESTED=0
 declare -i HPCBIND_VERBOSE=0

 declare -i HPCBIND_LSTOPO=0
@ -197,6 +199,9 @@ declare -i HPCBIND_LSTOPO=0
 HPCBIND_OUTPUT_PREFIX=""
 HPCBIND_OUTPUT_MODE="rank0"

+HPCBIND_OUTPUT_TOPOLOGY=""
+HPCBIND_INPUT_TOPOLOGY=""
+
 declare -i HPCBIND_HAS_COMMAND=0

 for i in "$@"; do
@ -276,10 +281,22 @@ for i in "$@"; do
      HPCBIND_OPENMP_NESTED=0
      shift
      ;;
+    --openmp-nested)
+      HPCBIND_OPENMP_NESTED=1
+      shift
+      ;;
    --output-prefix=*)
      HPCBIND_OUTPUT_PREFIX="${i#*=}"
      shift
      ;;
+    --save-topology=*)
+      HPCBIND_OUTPUT_TOPOLOGY="${i#*=}"
+      shift
+      ;;
+    --load-topology=*)
+      HPCBIND_INPUT_TOPOLOGY="${i#*=}"
+      shift
+      ;;
    --output-mode=*)
      HPCBIND_OUTPUT_MODE="${i#*=}"
      #convert to lower case
@ -327,24 +344,37 @@ elif [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
  HPCBIND_TEE=1
 fi

+# Save the topology to the given xml file
+if [[ "${HPCBIND_OUTPUT_TOPOLOGY}" != "" ]]; then
+  if [[ ${HPCBIND_QUEUE_RANK} -eq 0 ]]; then
+    lstopo-no-graphics "${HPCBIND_OUTPUT_TOPOLOGY}"
+  else
+    lstopo-no-graphics >/dev/null 2>&1
+  fi
+fi
+
+# Load the topology to the given xml file
+if [[ "${HPCBIND_INPUT_TOPOLOGY}" != "" ]]; then
+  if [ -f ${HPCBIND_INPUT_TOPOLOGY} ]; then
+    export HWLOC_XMLFILE="${HPCBIND_INPUT_TOPOLOGY}"
+    export HWLOC_THISSYSTEM=1
+  fi
+fi

 if [[ "${HPCBIND_OUTPUT_PREFIX}" == "" ]]; then
  HPCBIND_LOG=/dev/null
  HPCBIND_ERR=/dev/null
  HPCBIND_OUT=/dev/null
 else
-  if [[ ${HPCBIND_QUEUE_SIZE} -gt 0 ]]; then
-    HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
-    HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
-
-    HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
-    HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
-    HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
-  else
-    HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_QUEUE_RANK}"
-    HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_QUEUE_RANK}"
-    HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_QUEUE_RANK}"
+  if [[ ${HPCBIND_QUEUE_SIZE} -le 0 ]]; then
+    HPCBIND_QUEUE_SIZE=1
  fi
+  HPCBIND_STR_QUEUE_SIZE="${HPCBIND_QUEUE_SIZE}"
+  HPCBIND_STR_QUEUE_RANK=$(printf %0*d ${#HPCBIND_STR_QUEUE_SIZE} ${HPCBIND_QUEUE_RANK})
+
+  HPCBIND_LOG="${HPCBIND_OUTPUT_PREFIX}.hpcbind.${HPCBIND_STR_QUEUE_RANK}"
+  HPCBIND_ERR="${HPCBIND_OUTPUT_PREFIX}.stderr.${HPCBIND_STR_QUEUE_RANK}"
+  HPCBIND_OUT="${HPCBIND_OUTPUT_PREFIX}.stdout.${HPCBIND_STR_QUEUE_RANK}"
  > ${HPCBIND_LOG}
 fi

@ -546,6 +576,8 @@ if [[ ${HPCBIND_TEE} -eq 0 || ${HPCBIND_VERBOSE} -eq 0 ]]; then
  hostname -s >> ${HPCBIND_LOG}
  echo "[HPCBIND]" >> ${HPCBIND_LOG}
  echo "${TMP_ENV}" | grep -E "^HPCBIND_" >> ${HPCBIND_LOG}
+  echo "[HWLOC]" >> ${HPCBIND_LOG}
+  echo "${TMP_ENV}" | grep -E "^HWLOC_" >> ${HPCBIND_LOG}
  echo "[CUDA]" >> ${HPCBIND_LOG}
  echo "${TMP_ENV}" | grep -E "^CUDA_" >> ${HPCBIND_LOG}
  echo "[OPENMP]" >> ${HPCBIND_LOG}
@ -568,6 +600,8 @@ else
  hostname -s > >(tee -a ${HPCBIND_LOG})
  echo "[HPCBIND]" > >(tee -a ${HPCBIND_LOG})
  echo "${TMP_ENV}" | grep -E "^HPCBIND_" > >(tee -a ${HPCBIND_LOG})
+  echo "[HWLOC]" > >(tee -a ${HPCBIND_LOG})
+  echo "${TMP_ENV}" | grep -E "^HWLOC_" > >(tee -a ${HPCBIND_LOG})
  echo "[CUDA]" > >(tee -a ${HPCBIND_LOG})
  echo "${TMP_ENV}" | grep -E "^CUDA_" > >(tee -a ${HPCBIND_LOG})
  echo "[OPENMP]" > >(tee -a ${HPCBIND_LOG})
--- a/lib/kokkos/bin/nvcc_wrapper
+++ b/lib/kokkos/bin/nvcc_wrapper
@ -74,6 +74,9 @@ dry_run=0
 host_only=0
 host_only_args=""

+# Just run version on host compiler
+get_host_version=0
+
 # Enable workaround for CUDA 6.5 for pragma ident 
 replace_pragma_ident=0

@ -93,6 +96,9 @@ depfile_separate=0
 depfile_output_arg=""
 depfile_target_arg=""

+# Option to remove duplicate libraries and object files
+remove_duplicate_link_files=0
+
 #echo "Arguments: $# $@"

 while [ $# -gt 0 ]
@ -106,10 +112,18 @@ do
  --host-only)
    host_only=1
    ;;
+  #get the host version only
+  --host-version)
+    get_host_version=1
+    ;;
  #replace '#pragma ident' with '#ident' this is needed to compile OpenMPI due to a configure script bug and a non standardized behaviour of pragma with macros
  --replace-pragma-ident)
    replace_pragma_ident=1
    ;;
+  #remove duplicate link files
+  --remove-duplicate-link-files)
+    remove_duplicate_link_files=1
+    ;;
  #handle source files to be compiled as cuda files
  *.cpp|*.cxx|*.cc|*.C|*.c++|*.cu)
    cpp_files="$cpp_files $1"
@ -124,7 +138,12 @@ do
    fi
    ;;
  #Handle shared args (valid for both nvcc and the host compiler)
-  -D*|-I*|-L*|-l*|-g|--help|--version|-E|-M|-shared)
+  -D*)
+    unescape_commas=`echo "$1" | sed -e 's/\\\,/,/g'`
+    arg=`printf "%q" $unescape_commas`
+    shared_args="$shared_args $arg"
+    ;;
+  -I*|-L*|-l*|-g|--help|--version|-E|-M|-shared|-w)
    shared_args="$shared_args $1"
    ;;
  #Handle compilation argument
@ -152,7 +171,7 @@ do
    shift
    ;;
  #Handle known nvcc args
-  -gencode*|--dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
+  --dryrun|--verbose|--keep|--keep-dir*|-G|--relocatable-device-code*|-lineinfo|-expt-extended-lambda|--resource-usage|-Xptxas*)
    cuda_args="$cuda_args $1"
    ;;
  #Handle more known nvcc args
@ -164,8 +183,11 @@ do
    cuda_args="$cuda_args $1 $2"
    shift
    ;;
+  -rdc=*|-maxrregcount*|--maxrregcount*)
+    cuda_args="$cuda_args $1"
+    ;;
  #Handle c++11
-  --std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1z|-std=c++1z)
+  --std=c++11|-std=c++11|--std=c++14|-std=c++14|--std=c++1y|-std=c++1y|--std=c++17|-std=c++17|--std=c++1z|-std=c++1z)
    if [ $stdcxx_applied -eq 1 ]; then
       echo "nvcc_wrapper - *warning* you have set multiple optimization flags (-std=c++1* or --std=c++1*), only the first is used because nvcc can only accept a single std setting"
    else
@ -205,6 +227,15 @@ do
    fi
    shift
    ;;
+  #Handle -+ (same as -x c++, specifically used for xl compilers, but mutually exclusive with -x. So replace it with -x c++)
+  -+)
+    if [ $first_xcompiler_arg -eq 1 ]; then
+      xcompiler_args="-x,c++"
+      first_xcompiler_arg=0
+    else
+      xcompiler_args="$xcompiler_args,-x,c++"
+    fi
+    ;;
  #Handle -ccbin (if its not set we can set it to a default value)
  -ccbin)
    cuda_args="$cuda_args $1 $2"
@ -212,18 +243,39 @@ do
    host_compiler=$2
    shift
    ;;
-  #Handle -arch argument (if its not set use a default
-  -arch*)
+
+  #Handle -arch argument (if its not set use a default) this is the version with = sign
+  -arch*|-gencode*)
    cuda_args="$cuda_args $1"
    arch_set=1
    ;;
+  #Handle -code argument (if its not set use a default) this is the version with = sign
+  -code*)
+    cuda_args="$cuda_args $1"
+    ;;
+  #Handle -arch argument (if its not set use a default) this is the version without = sign
+  -arch|-gencode)
+    cuda_args="$cuda_args $1 $2"
+    arch_set=1
+    shift
+    ;;
+  #Handle -code argument (if its not set use a default) this is the version without = sign
+  -code)
+    cuda_args="$cuda_args $1 $2"
+    shift
+    ;;
  #Handle -Xcudafe argument
  -Xcudafe)
    cuda_args="$cuda_args -Xcudafe $2"
    shift
    ;;
+  #Handle -Xlinker argument
+  -Xlinker)
+    xlinker_args="$xlinker_args -Xlinker $2"
+    shift
+    ;;
  #Handle args that should be sent to the linker
-  -Wl*)
+  -Wl,*)
    xlinker_args="$xlinker_args -Xlinker ${1:4:${#1}}"
    host_linker_args="$host_linker_args ${1:4:${#1}}"
    ;;
@ -256,6 +308,44 @@ do
  shift
 done

+# Only print host compiler version
+if [ $get_host_version -eq 1 ]; then
+  $host_compiler --version
+  exit
+fi
+
+#Remove duplicate object files
+if [ $remove_duplicate_link_files -eq 1 ]; then
+for obj in $object_files
+do
+  object_files_reverse="$obj $object_files_reverse"
+done
+
+object_files_reverse_clean=""
+for obj in $object_files_reverse
+do
+  exists=false
+  for obj2 in $object_files_reverse_clean
+  do
+    if [ "$obj" == "$obj2" ]
+    then
+      exists=true
+      echo "Exists: $obj"
+    fi
+  done
+  if [ "$exists" == "false" ]
+  then
+    object_files_reverse_clean="$object_files_reverse_clean $obj"
+  fi
+done
+
+object_files=""
+for obj in $object_files_reverse_clean
+do
+  object_files="$obj $object_files"
+done
+fi
+
 #Add default host compiler if necessary
 if [ $ccbin_set -ne 1 ]; then
  cuda_args="$cuda_args -ccbin $host_compiler"
@ -328,10 +418,19 @@ fi

 #Run compilation command
 if [ $host_only -eq 1 ]; then
+  if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
+    echo "$host_command"
+  fi
  $host_command
 elif [ -n "$nvcc_depfile_command" ]; then
+  if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
+    echo "$nvcc_command && $nvcc_depfile_command"
+  fi
  $nvcc_command && $nvcc_depfile_command
 else
+  if [ "$NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN" == "1" ] ; then
+    echo "$nvcc_command"
+  fi
  $nvcc_command
 fi
 error_code=$?
--- a/lib/kokkos/cmake/kokkos_build.cmake
+++ b/lib/kokkos/cmake/kokkos_build.cmake
@ -235,3 +235,7 @@ install(FILES
 # Install the export set for use with the install-tree
 INSTALL(EXPORT KokkosTargets DESTINATION
       "${INSTALL_CMAKE_DIR}")
+
+# build and install pkgconfig file
+CONFIGURE_FILE(core/src/kokkos.pc.in kokkos.pc @ONLY)
+INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)
--- a/lib/kokkos/cmake/kokkos_functions.cmake
+++ b/lib/kokkos/cmake/kokkos_functions.cmake
@ -47,7 +47,7 @@ function(set_kokkos_cxx_compiler)
                    OUTPUT_VARIABLE INTERNAL_CXX_COMPILER_VERSION
                    OUTPUT_STRIP_TRAILING_WHITESPACE)

-    string(REGEX MATCH "[0-9]+\.[0-9]+\.[0-9]+$"
+    string(REGEX MATCH "[0-9]+\\.[0-9]+\\.[0-9]+$"
           INTERNAL_CXX_COMPILER_VERSION ${INTERNAL_CXX_COMPILER_VERSION})
  endif()

--- a/lib/kokkos/cmake/kokkos_options.cmake
+++ b/lib/kokkos/cmake/kokkos_options.cmake
@ -41,7 +41,6 @@ list(APPEND KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST
 foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
  string(TOUPPER ${opt} OPT )
  IF(DEFINED Kokkos_ENABLE_${opt})
-    MESSAGE("Kokkos_ENABLE_${opt} is defined!")
    IF(DEFINED KOKKOS_ENABLE_${OPT})
      IF(NOT ("${KOKKOS_ENABLE_${OPT}}" STREQUAL "${Kokkos_ENABLE_${opt}}"))
        IF(DEFINED KOKKOS_ENABLE_${OPT}_INTERNAL)
@ -59,7 +58,6 @@ foreach(opt ${KOKKOS_INTERNAL_ENABLE_OPTIONS_LIST})
      ENDIF()
    ELSE()
      SET(KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT ${Kokkos_ENABLE_${opt}})
-      MESSAGE("set KOKKOS_INTERNAL_ENABLE_${OPT}_DEFAULT!")
    ENDIF()
  ENDIF()
 endforeach()
@ -81,6 +79,7 @@ list(APPEND KOKKOS_ARCH_LIST
     ARMv80          # (HOST) ARMv8.0 Compatible CPU
     ARMv81          # (HOST) ARMv8.1 Compatible CPU
     ARMv8-ThunderX  # (HOST) ARMv8 Cavium ThunderX CPU
+     ARMv8-TX2       # (HOST) ARMv8 Cavium ThunderX2 CPU
     WSM             # (HOST) Intel Westmere CPU
     SNB             # (HOST) Intel Sandy/Ivy Bridge CPUs
     HSW             # (HOST) Intel Haswell CPUs
@ -123,11 +122,18 @@ list(APPEND KOKKOS_DEVICES_LIST
 # List of possible TPLs for Kokkos
 # From Makefile.kokkos: Options: hwloc,librt,experimental_memkind
 set(KOKKOS_USE_TPLS_LIST)
+if(APPLE)
+list(APPEND KOKKOS_USE_TPLS_LIST
+    HWLOC          # hwloc
+    MEMKIND        # experimental_memkind
+    )
+else()
 list(APPEND KOKKOS_USE_TPLS_LIST
    HWLOC          # hwloc
    LIBRT          # librt
    MEMKIND        # experimental_memkind
    )
+endif()
 # Map of cmake variables to Makefile variables
 set(KOKKOS_INTERNAL_HWLOC hwloc)
 set(KOKKOS_INTERNAL_LIBRT librt)
@ -172,6 +178,7 @@ set(KOKKOS_INTERNAL_LAMBDA enable_lambda)

 set(tmpr "\n       ")
 string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_ARCH_DOCSTR "${KOKKOS_ARCH_LIST}")
+set(KOKKOS_INTERNAL_ARCH_DOCSTR "${tmpr}${KOKKOS_INTERNAL_ARCH_DOCSTR}")
 # This would be useful, but we use Foo_ENABLE mechanisms
 #string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_DEVICES_DOCSTR "${KOKKOS_DEVICES_LIST}")
 #string(REPLACE ";" ${tmpr} KOKKOS_INTERNAL_USE_TPLS_DOCSTR "${KOKKOS_USE_TPLS_LIST}")
@ -269,7 +276,7 @@ set(KOKKOS_ENABLE_PROFILING_LOAD_PRINT ${KOKKOS_INTERNAL_ENABLE_PROFILING_LOAD_P
 set_kokkos_default_default(DEPRECATED_CODE ON)
 set(KOKKOS_ENABLE_DEPRECATED_CODE ${KOKKOS_INTERNAL_ENABLE_DEPRECATED_CODE_DEFAULT} CACHE BOOL "Enable deprecated code.")

-set_kokkos_default_default(EXPLICIT_INSTANTIATION ON)
+set_kokkos_default_default(EXPLICIT_INSTANTIATION OFF)
 set(KOKKOS_ENABLE_EXPLICIT_INSTANTIATION ${KOKKOS_INTERNAL_ENABLE_EXPLICIT_INSTANTIATION_DEFAULT} CACHE BOOL "Enable explicit template instantiation.")

 #-------------------------------------------------------------------------------
--- a/lib/kokkos/cmake/kokkos_settings.cmake
+++ b/lib/kokkos/cmake/kokkos_settings.cmake
@ -15,16 +15,16 @@

 # Ensure that KOKKOS_ARCH is in the ARCH_LIST
 if (KOKKOS_ARCH MATCHES ",")
-  message("-- Detected a comma in: KOKKOS_ARCH=${KOKKOS_ARCH}")
+  message("-- Detected a comma in: KOKKOS_ARCH=`${KOKKOS_ARCH}`")
  message("-- Although we prefer KOKKOS_ARCH to be semicolon-delimited, we do allow")
  message("-- comma-delimited values for compatibility with scripts (see github.com/trilinos/Trilinos/issues/2330)")
  string(REPLACE "," ";" KOKKOS_ARCH "${KOKKOS_ARCH}")
-  message("-- Commas were changed to semicolons, now KOKKOS_ARCH=${KOKKOS_ARCH}")
+  message("-- Commas were changed to semicolons, now KOKKOS_ARCH=`${KOKKOS_ARCH}`")
 endif()
 foreach(arch ${KOKKOS_ARCH})
  list(FIND KOKKOS_ARCH_LIST ${arch} indx)
  if (indx EQUAL -1)
-    message(FATAL_ERROR "${arch} is not an accepted value for KOKKOS_ARCH."
+    message(FATAL_ERROR "`${arch}` is not an accepted value in KOKKOS_ARCH=`${KOKKOS_ARCH}`."
      "  Please pick from these choices: ${KOKKOS_INTERNAL_ARCH_DOCSTR}")
  endif ()
 endforeach()
@ -130,7 +130,8 @@ string(REPLACE ";" ":" KOKKOS_INTERNAL_ADDTOPATH "${addpathl}")
 # Set the KOKKOS_SETTINGS String -- this is the primary communication with the
 # makefile configuration.  See Makefile.kokkos

-set(KOKKOS_SETTINGS KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
+set(KOKKOS_SETTINGS KOKKOS_CMAKE=yes)
+set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_SRC_PATH=${KOKKOS_SRC_PATH})
 set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_PATH=${KOKKOS_PATH})
 set(KOKKOS_SETTINGS ${KOKKOS_SETTINGS} KOKKOS_INSTALL_PATH=${CMAKE_INSTALL_PREFIX})

--- a/lib/kokkos/config/test_all_sandia
+++ b/lib/kokkos/config/test_all_sandia
@ -241,17 +241,16 @@ elif [ "$MACHINE" = "white" ]; then

  BASE_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>"
  IBM_MODULE_LIST="<COMPILER_NAME>/xl/<COMPILER_VERSION>"
-  CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/5.4.0"
-  CUDA_MODULE_LIST2="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.3.0,ibm/xl/13.1.6"
+  CUDA_MODULE_LIST="<COMPILER_NAME>/<COMPILER_VERSION>,gcc/6.4.0,ibm/xl/16.1.0"

  # Don't do pthread on white.
  GCC_BUILD_LIST="OpenMP,Serial,OpenMP_Serial"

  # Format: (compiler module-list build-list exe-name warning-flag)
  COMPILERS=("gcc/5.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
-             "ibm/13.1.6 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
-             "cuda/8.0.44 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
-             "cuda/9.0.103 $CUDA_MODULE_LIST2 $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
+             "gcc/6.4.0 $BASE_MODULE_LIST $IBM_BUILD_LIST g++ $GCC_WARNING_FLAGS"
+             "ibm/16.1.0 $IBM_MODULE_LIST $IBM_BUILD_LIST xlC $IBM_WARNING_FLAGS"
+             "cuda/9.0.103 $CUDA_MODULE_LIST $CUDA_IBM_BUILD_LIST ${KOKKOS_PATH}/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
  )

  if [ -z "$ARCH_FLAG" ]; then
@ -362,7 +361,7 @@ elif [ "$MACHINE" = "apollo" ]; then
               "gcc/5.3.0 $BASE_MODULE_LIST "Serial" g++ $GCC_WARNING_FLAGS"
               "intel/16.0.1 $BASE_MODULE_LIST "OpenMP" icpc $INTEL_WARNING_FLAGS"
               "clang/3.9.0 $BASE_MODULE_LIST "Pthread_Serial" clang++ $CLANG_WARNING_FLAGS"
-               "clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread" clang++ $CUDA_WARNING_FLAGS"
+               "clang/6.0 $CLANG_MODULE_LIST "Cuda_Pthread,OpenMP" clang++ $CUDA_WARNING_FLAGS"
               "cuda/9.1 $CUDA_MODULE_LIST "Cuda_OpenMP" $KOKKOS_PATH/bin/nvcc_wrapper $CUDA_WARNING_FLAGS"
    )
  else
--- a/lib/kokkos/containers/src/Kokkos_DualView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_DualView.hpp
@ -96,6 +96,7 @@ template< class DataType ,
          class Arg3Type = void>
 class DualView : public ViewTraits< DataType , Arg1Type , Arg2Type, Arg3Type >
 {
+template< class , class , class , class > friend class DualView ;
 public:
  //! \name Typedefs for device types and various Kokkos::View specializations.
  //@{
@ -182,8 +183,20 @@ public:
  //! \name Counters to keep track of changes ("modified" flags)
  //@{

-  View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_device;
-  View<unsigned int,LayoutLeft,typename t_host::execution_space> modified_host;
+#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+protected:
+  // modified_flags[0] -> host
+  // modified_flags[1] -> device
+  typedef View<unsigned int[2],LayoutLeft,Kokkos::HostSpace> t_modified_flags;
+  t_modified_flags modified_flags;
+
+public:
+#else
+  typedef View<unsigned int[2],LayoutLeft,typename t_host::execution_space> t_modified_flags;
+  typedef View<unsigned int,LayoutLeft,typename t_host::execution_space> t_modified_flag;
+  t_modified_flags modified_flags;
+  t_modified_flag modified_host,modified_device;
+#endif

  //@}
  //! \name Constructors
@ -194,10 +207,14 @@ public:
  /// Both device and host View objects are constructed using their
  /// default constructors.  The "modified" flags are both initialized
  /// to "unmodified."
-  DualView () :
-    modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
-    modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
-  {}
+#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+  DualView () = default;
+#else
+  DualView ():modified_flags (t_modified_flags("DualView::modified_flags")) {
+    modified_host = t_modified_flag(modified_flags,0);
+    modified_device = t_modified_flag(modified_flags,1);
+  }
+#endif

  /// \brief Constructor that allocates View objects on both host and device.
  ///
@ -219,17 +236,24 @@ public:
            const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG)
    : d_view (label, n0, n1, n2, n3, n4, n5, n6, n7)
    , h_view (create_mirror_view (d_view)) // without UVM, host View mirrors
-    , modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device"))
-    , modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
-  {}
+    , modified_flags (t_modified_flags("DualView::modified_flags"))
+  {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    modified_host = t_modified_flag(modified_flags,0);
+    modified_device = t_modified_flag(modified_flags,1);
+#endif
+  }

  //! Copy constructor (shallow copy)
  template<class SS, class LS, class DS, class MS>
  DualView (const DualView<SS,LS,DS,MS>& src) :
    d_view (src.d_view),
    h_view (src.h_view),
-    modified_device (src.modified_device),
-    modified_host (src.modified_host)
+    modified_flags (src.modified_flags)
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    , modified_host(src.modified_host)
+    , modified_device(src.modified_device)
+#endif
  {}

  //! Subview constructor
@ -241,8 +265,11 @@ public:
          )
    : d_view( Kokkos::subview( src.d_view , arg0 , args ... ) )
    , h_view( Kokkos::subview( src.h_view , arg0 , args ... ) )
-    , modified_device (src.modified_device)
-    , modified_host (src.modified_host)
+    , modified_flags (src.modified_flags)
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    , modified_host(src.modified_host)
+    , modified_device(src.modified_device)
+#endif
    {}

  /// \brief Create DualView from existing device and host View objects.
@ -258,8 +285,7 @@ public:
  DualView (const t_dev& d_view_, const t_host& h_view_) :
    d_view (d_view_),
    h_view (h_view_),
-    modified_device (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_device")),
-    modified_host (View<unsigned int,LayoutLeft,typename t_host::execution_space> ("DualView::modified_host"))
+    modified_flags (t_modified_flags("DualView::modified_flags"))
  {
    if ( int(d_view.rank)     != int(h_view.rank) ||
         d_view.extent(0) != h_view.extent(0) ||
@ -281,6 +307,10 @@ public:
         d_view.span()        != h_view.span() ) {
      Kokkos::Impl::throw_runtime_exception("DualView constructed with incompatible views");
    }
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    modified_host = t_modified_flag(modified_flags,0);
+    modified_device = t_modified_flag(modified_flags,1);
+#endif
  }

  //@}
@ -316,6 +346,30 @@ public:
    t_dev,
    t_host>::type& view () const
  {
+    #ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+    constexpr bool device_is_memspace  = std::is_same<Device,typename Device::memory_space>::value;
+    constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
+    constexpr bool device_exec_is_t_dev_exec  = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
+    constexpr bool device_mem_is_t_dev_mem    = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
+    constexpr bool device_exec_is_t_host_exec  = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
+    constexpr bool device_mem_is_t_host_mem    = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
+    constexpr bool device_is_t_host_device  = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
+    constexpr bool device_is_t_dev_device    = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
+
+    static_assert(
+        device_is_t_dev_device || device_is_t_host_device ||
+        (device_is_memspace  && (device_mem_is_t_dev_mem   || device_mem_is_t_host_mem) ) ||
+        (device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
+        (
+          (!device_is_execspace && !device_is_memspace) && (
+            (device_mem_is_t_dev_mem   || device_mem_is_t_host_mem)  ||
+            (device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
+          )
+        )
+        ,
+        "Template parameter to .view() must exactly match one of the DualView's device types or one of the execution or memory spaces");
+    #endif
+
    return Impl::if_c<
      std::is_same<
        typename t_dev::memory_space,
@ -324,6 +378,72 @@ public:
      t_host >::select (d_view , h_view);
  }

+  KOKKOS_INLINE_FUNCTION
+  t_host view_host() const {
+    return h_view;
+  }
+
+  KOKKOS_INLINE_FUNCTION
+  t_dev view_device() const {
+    return d_view;
+  }
+
+  template<class Device>
+  static int get_device_side() {
+    constexpr bool device_is_memspace  = std::is_same<Device,typename Device::memory_space>::value;
+    constexpr bool device_is_execspace = std::is_same<Device,typename Device::execution_space>::value;
+    constexpr bool device_exec_is_t_dev_exec  = std::is_same<typename Device::execution_space,typename t_dev::execution_space>::value;
+    constexpr bool device_mem_is_t_dev_mem    = std::is_same<typename Device::memory_space,typename t_dev::memory_space>::value;
+    constexpr bool device_exec_is_t_host_exec  = std::is_same<typename Device::execution_space,typename t_host::execution_space>::value;
+    constexpr bool device_mem_is_t_host_mem    = std::is_same<typename Device::memory_space,typename t_host::memory_space>::value;
+    constexpr bool device_is_t_host_device  = std::is_same<typename Device::execution_space,typename t_host::device_type>::value;
+    constexpr bool device_is_t_dev_device    = std::is_same<typename Device::memory_space,typename t_host::device_type>::value;
+
+    #ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+    static_assert(
+        device_is_t_dev_device || device_is_t_host_device ||
+        (device_is_memspace  && (device_mem_is_t_dev_mem   || device_mem_is_t_host_mem) ) ||
+        (device_is_execspace && (device_exec_is_t_dev_exec || device_exec_is_t_host_exec) ) ||
+        (
+          (!device_is_execspace && !device_is_memspace) && (
+            (device_mem_is_t_dev_mem   || device_mem_is_t_host_mem)  ||
+            (device_exec_is_t_dev_exec || device_exec_is_t_host_exec)
+          )
+        )
+        ,
+        "Template parameter to .sync() must exactly match one of the DualView's device types or one of the execution or memory spaces");
+    #endif
+
+    #ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+    int dev = -1;
+    #else
+    int dev = 0;
+    #endif
+    if(device_is_t_dev_device) dev = 1;
+    else if(device_is_t_host_device) dev = 0;
+    else {
+      if(device_is_memspace) {
+        if(device_mem_is_t_dev_mem) dev = 1;
+        if(device_mem_is_t_host_mem) dev = 0;
+        if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
+      }
+      if(device_is_execspace) {
+        if(device_exec_is_t_dev_exec) dev = 1;
+        if(device_exec_is_t_host_exec) dev = 0;
+        if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
+      }
+      if(!device_is_execspace && !device_is_memspace) {
+        if(device_mem_is_t_dev_mem) dev = 1;
+        if(device_mem_is_t_host_mem) dev = 0;
+        if(device_mem_is_t_host_mem && device_mem_is_t_dev_mem) dev = -1;
+        if(device_exec_is_t_dev_exec) dev = 1;
+        if(device_exec_is_t_host_exec) dev = 0;
+        if(device_exec_is_t_host_exec && device_exec_is_t_dev_exec) dev = -1;
+      }
+    }
+    return dev;
+  }
+
  /// \brief Update data on device or host only if data in the other
  ///   space has been marked as modified.
  ///
@ -347,23 +467,20 @@ public:
        ( std::is_same< Device , int>::value)
        , int >::type& = 0)
  {
-    const unsigned int dev =
-      Impl::if_c<
-        std::is_same<
-          typename t_dev::memory_space,
-          typename Device::memory_space>::value ,
-        unsigned int,
-        unsigned int>::select (1, 0);
+    if(modified_flags.data()==NULL) return;

-    if (dev) { // if Device is the same as DualView's device type
-      if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
+    int dev = get_device_side<Device>();
+
+    if (dev == 1) { // if Device is the same as DualView's device type
+      if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
        deep_copy (d_view, h_view);
-        modified_host() = modified_device() = 0;
+        modified_flags(0) = modified_flags(1) = 0;
      }
-    } else { // hopefully Device is the same as DualView's host type
-      if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
+    }
+    if (dev == 0) { // hopefully Device is the same as DualView's host type
+      if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
        deep_copy (h_view, d_view);
-        modified_host() = modified_device() = 0;
+        modified_flags(0) = modified_flags(1) = 0;
      }
    }
    if(std::is_same<typename t_host::memory_space,typename t_dev::memory_space>::value) {
@ -378,46 +495,71 @@ public:
      ( std::is_same< Device , int>::value)
      , int >::type& = 0 )
  {
-    const unsigned int dev =
-      Impl::if_c<
-        std::is_same<
-          typename t_dev::memory_space,
-          typename Device::memory_space>::value,
-        unsigned int,
-        unsigned int>::select (1, 0);
-    if (dev) { // if Device is the same as DualView's device type
-      if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
+    if(modified_flags.data()==NULL) return;
+
+    int dev = get_device_side<Device>();
+
+    if (dev == 1) { // if Device is the same as DualView's device type
+      if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
        Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
      }
-    } else { // hopefully Device is the same as DualView's host type
-      if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
+    }
+    if (dev == 0){ // hopefully Device is the same as DualView's host type
+      if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
        Impl::throw_runtime_exception("Calling sync on a DualView with a const datatype.");
      }
    }
  }

+  void sync_host() {
+    if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
+      Impl::throw_runtime_exception("Calling sync_host on a DualView with a const datatype.");
+    if(modified_flags.data()==NULL) return;
+    if(modified_flags(1) > modified_flags(0)) {
+      deep_copy (h_view, d_view);
+      modified_flags(1) = modified_flags(0) = 0;
+    }
+  }
+
+  void sync_device() {
+    if( ! std::is_same< typename traits::data_type , typename traits::non_const_data_type>::value )
+      Impl::throw_runtime_exception("Calling sync_device on a DualView with a const datatype.");
+    if(modified_flags.data()==NULL) return;
+    if(modified_flags(0) > modified_flags(1)) {
+      deep_copy (d_view, h_view);
+      modified_flags(1) = modified_flags(0) = 0;
+    }
+  }
+
  template<class Device>
  bool need_sync() const
  {
-    const unsigned int dev =
-      Impl::if_c<
-        std::is_same<
-          typename t_dev::memory_space,
-          typename Device::memory_space>::value ,
-        unsigned int,
-        unsigned int>::select (1, 0);
+    if(modified_flags.data()==NULL) return false;
+    int dev = get_device_side<Device>();

-    if (dev) { // if Device is the same as DualView's device type
-      if ((modified_host () > 0) && (modified_host () >= modified_device ())) {
+    if (dev == 1) { // if Device is the same as DualView's device type
+      if ((modified_flags(0) > 0) && (modified_flags(0) >= modified_flags(1))) {
        return true;
      }
-    } else { // hopefully Device is the same as DualView's host type
-      if ((modified_device () > 0) && (modified_device () >= modified_host ())) {
+    }
+    if (dev == 0){ // hopefully Device is the same as DualView's host type
+      if ((modified_flags(1) > 0) && (modified_flags(1) >= modified_flags(0))) {
        return true;
      }
    }
    return false;
  }
+
+  inline bool need_sync_host() const {
+    if(modified_flags.data()==NULL) return false;
+    return modified_flags(0)<modified_flags(1);
+  }
+
+  inline bool need_sync_device() const {
+    if(modified_flags.data()==NULL) return false;
+    return modified_flags(1)<modified_flags(0);
+  }
+
  /// \brief Mark data as modified on the given device \c Device.
  ///
  /// If \c Device is the same as this DualView's device type, then
@ -425,26 +567,22 @@ public:
  /// data as modified.
  template<class Device>
  void modify () {
-    const unsigned int dev =
-      Impl::if_c<
-        std::is_same<
-          typename t_dev::memory_space,
-          typename Device::memory_space>::value,
-        unsigned int,
-        unsigned int>::select (1, 0);
+    if(modified_flags.data()==NULL) return;
+    int dev = get_device_side<Device>();

-    if (dev) { // if Device is the same as DualView's device type
+    if (dev == 1) { // if Device is the same as DualView's device type
      // Increment the device's modified count.
-      modified_device () = (modified_device () > modified_host () ?
-                            modified_device () : modified_host ()) + 1;
-    } else { // hopefully Device is the same as DualView's host type
+      modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
+                            modified_flags(1) : modified_flags(0)) + 1;
+    }
+    if (dev == 0) { // hopefully Device is the same as DualView's host type
      // Increment the host's modified count.
-      modified_host () = (modified_device () > modified_host () ?
-                          modified_device () : modified_host ())  + 1;
+      modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
+                          modified_flags(1) : modified_flags(0))  + 1;
    }

 #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
-    if (modified_host() && modified_device()) {
+    if (modified_flags(0) && modified_flags(1)) {
      std::string msg = "Kokkos::DualView::modify ERROR: ";
      msg += "Concurrent modification of host and device views ";
      msg += "in DualView \"";
@ -455,6 +593,45 @@ public:
 #endif
  }

+  inline void modify_host() {
+    if(modified_flags.data()!=NULL) {
+      modified_flags(0) = (modified_flags(1) > modified_flags(0) ?
+          modified_flags(1) : modified_flags(0))  + 1;
+      #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
+      if (modified_flags(0) && modified_flags(1)) {
+        std::string msg = "Kokkos::DualView::modify_host ERROR: ";
+        msg += "Concurrent modification of host and device views ";
+        msg += "in DualView \"";
+        msg += d_view.label();
+        msg += "\"\n";
+        Kokkos::abort(msg.c_str());
+      }
+    #endif
+    }
+  }
+
+  inline void modify_device() {
+    if(modified_flags.data()!=NULL) {
+      modified_flags(1) = (modified_flags(1) > modified_flags(0) ?
+          modified_flags(1) : modified_flags(0))  + 1;
+      #ifdef KOKKOS_ENABLE_DEBUG_DUALVIEW_MODIFY_CHECK
+      if (modified_flags(0) && modified_flags(1)) {
+        std::string msg = "Kokkos::DualView::modify_device ERROR: ";
+        msg += "Concurrent modification of host and device views ";
+        msg += "in DualView \"";
+        msg += d_view.label();
+        msg += "\"\n";
+        Kokkos::abort(msg.c_str());
+      }
+      #endif
+    }
+  }
+
+  inline void clear_sync_state() {
+    if(modified_flags.data()!=NULL) 
+      modified_flags(1) = modified_flags(0) = 0;
+  }
+
  //@}
  //! \name Methods for reallocating or resizing the View objects.
  //@{
@ -476,7 +653,10 @@ public:
     h_view = create_mirror_view( d_view );

     /* Reset dirty flags */
-     modified_device() = modified_host() = 0;
+     if(modified_flags.data()==NULL) {
+       modified_flags = t_modified_flags("DualView::modified_flags");
+     } else
+       modified_flags(1) = modified_flags(0) = 0;
  }

  /// \brief Resize both views, copying old contents into new if necessary.
@ -491,13 +671,16 @@ public:
           const size_t n5 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
           const size_t n6 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ,
           const size_t n7 = KOKKOS_IMPL_CTOR_DEFAULT_ARG ) {
-   if(modified_device() >= modified_host()) {
+   if(modified_flags.data()==NULL) {
+     modified_flags = t_modified_flags("DualView::modified_flags");
+   }
+   if(modified_flags(1) >= modified_flags(0)) {
     /* Resize on Device */
     ::Kokkos::resize(d_view,n0,n1,n2,n3,n4,n5,n6,n7);
     h_view = create_mirror_view( d_view );

     /* Mark Device copy as modified */
-     modified_device() = modified_device()+1;
+     modified_flags(1) = modified_flags(1)+1;

   } else {
     /* Realloc on Device */
@ -525,7 +708,7 @@ public:
     d_view = create_mirror_view( typename t_dev::execution_space(), h_view );

     /* Mark Host copy as modified */
-     modified_host() = modified_host()+1;
+     modified_flags(0) = modified_flags(0)+1;
   }
  }

@ -649,7 +832,10 @@ void
 deep_copy (DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
           const DualView<ST,SL,SD,SM>& src )
 {
-  if (src.modified_device () >= src.modified_host ()) {
+  if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
+    return deep_copy(dst.d_view, src.d_view);
+  }
+  if (src.modified_flags(1) >= src.modified_flags(0)) {
    deep_copy (dst.d_view, src.d_view);
    dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
  } else {
@ -666,7 +852,10 @@ deep_copy (const ExecutionSpace& exec ,
           DualView<DT,DL,DD,DM> dst, // trust me, this must not be a reference
           const DualView<ST,SL,SD,SM>& src )
 {
-  if (src.modified_device () >= src.modified_host ()) {
+  if(src.modified_flags.data()==NULL || dst.modified_flags.data()==NULL) {
+    return deep_copy(exec, dst.d_view, src.d_view);
+  }
+  if (src.modified_flags(1) >= src.modified_flags(0)) {
    deep_copy (exec, dst.d_view, src.d_view);
    dst.template modify<typename DualView<DT,DL,DD,DM>::device_type> ();
  } else {
--- a/lib/kokkos/containers/src/Kokkos_DynRankView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_DynRankView.hpp
@ -64,7 +64,7 @@ namespace Impl {
 template <typename Specialize>
 struct DynRankDimTraits {

-  enum : size_t{unspecified =KOKKOS_INVALID_INDEX};
+  enum : size_t{unspecified = KOKKOS_INVALID_INDEX};

  // Compute the rank of the view from the nonzero dimension arguments.
  KOKKOS_INLINE_FUNCTION
@ -384,8 +384,8 @@ public:
    // Removed dimension checks...

      typedef typename DstType::offset_type  dst_offset_type ;
-      dst.m_map.m_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
-      dst.m_map.m_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_handle , src.m_track );
+      dst.m_map.m_impl_offset = dst_offset_type(std::integral_constant<unsigned,0>() , src.layout() ); //Check this for integer input1 for padding, etc
+      dst.m_map.m_impl_handle = Kokkos::Impl::ViewDataHandle< DstTraits >::assign( src.m_map.m_impl_handle , src.m_track );
      dst.m_track.assign( src.m_track , DstTraits::is_managed );
      dst.m_rank = src.Rank ;
    }
@ -565,10 +565,14 @@ public:

  //----------------------------------------
  // Allow specializations to query their specialized map
-
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  KOKKOS_INLINE_FUNCTION
  const Kokkos::Impl::ViewMapping< traits , void > &
  implementation_map() const { return m_map ; }
+#endif
+  KOKKOS_INLINE_FUNCTION
+  const Kokkos::Impl::ViewMapping< traits , void > &
+  impl_map() const { return m_map ; }

  //----------------------------------------

@ -624,7 +628,7 @@ public:
  reference_type operator()() const
    {
      KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
-      return implementation_map().reference();
+      return impl_map().reference();
      //return m_map.reference(0,0,0,0,0,0,0);
    }

@ -647,7 +651,7 @@ public:
  typename std::enable_if< !std::is_same<typename drvtraits::value_type, typename drvtraits::scalar_array_type>::value && std::is_integral<iType>::value, reference_type>::type
  operator[](const iType & i0) const
    {
-//      auto map = implementation_map();
+//      auto map = impl_map();
      const size_t dim_scalar = m_map.dimension_scalar();
      const size_t bytes = this->span() / dim_scalar;

@ -785,7 +789,7 @@ public:
  reference_type access() const
    {
      KOKKOS_IMPL_VIEW_OPERATOR_VERIFY( (0 , this->rank(), m_track, m_map) )
-      return implementation_map().reference();
+      return impl_map().reference();
      //return m_map.reference(0,0,0,0,0,0,0);
    }

@ -1004,7 +1008,7 @@ public:

  //----------------------------------------
  // Allocation according to allocation properties and array layout
-  // unused arg_layout dimensions must be set toKOKKOS_INVALID_INDEX so that rank deduction can properly take place
+  // unused arg_layout dimensions must be set to KOKKOS_INVALID_INDEX so that rank deduction can properly take place
  template< class ... P >
  explicit inline
  DynRankView( const Kokkos::Impl::ViewCtorProp< P ... > & arg_prop
@ -1179,7 +1183,7 @@ public:
    : DynRankView( Kokkos::Impl::ViewCtorProp< std::string >( arg_label )
    , typename traits::array_layout
          ( arg_N0 , arg_N1 , arg_N2 , arg_N3 , arg_N4 , arg_N5 , arg_N6 , arg_N7 )
-          )
+      )
    {}

  // For backward compatibility
@ -1189,8 +1193,7 @@ public:
      , const typename traits::array_layout & arg_layout
      )
    : DynRankView( Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
-
-          , Impl::DynRankDimTraits<typename traits::specialize>::createLayout(arg_layout)
+                 , arg_layout
      )
    {}

@ -1205,7 +1208,9 @@ public:
      , const size_t arg_N6 =KOKKOS_INVALID_INDEX
      , const size_t arg_N7 =KOKKOS_INVALID_INDEX
      )
-    : DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing ), arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7 )
+    : DynRankView(Kokkos::Impl::ViewCtorProp< std::string , Kokkos::Impl::WithoutInitializing_t >( arg_prop.label , Kokkos::WithoutInitializing )
+      , typename traits::array_layout(arg_N0, arg_N1, arg_N2, arg_N3, arg_N4, arg_N5, arg_N6, arg_N7)
+      )
    {}

  //----------------------------------------
@ -1445,30 +1450,30 @@ public:
      ret_type dst ;

      const SubviewExtents< 7 , rank > extents =
-        ExtentGenerator< Args ... >::generator( src.m_map.m_offset.m_dim , args... ) ;
+        ExtentGenerator< Args ... >::generator( src.m_map.m_impl_offset.m_dim , args... ) ;

-      dst_offset_type tempdst( src.m_map.m_offset , extents ) ;
+      dst_offset_type tempdst( src.m_map.m_impl_offset , extents ) ;

      dst.m_track = src.m_track ;

-      dst.m_map.m_offset.m_dim.N0 = tempdst.m_dim.N0 ;
-      dst.m_map.m_offset.m_dim.N1 = tempdst.m_dim.N1 ;
-      dst.m_map.m_offset.m_dim.N2 = tempdst.m_dim.N2 ;
-      dst.m_map.m_offset.m_dim.N3 = tempdst.m_dim.N3 ;
-      dst.m_map.m_offset.m_dim.N4 = tempdst.m_dim.N4 ;
-      dst.m_map.m_offset.m_dim.N5 = tempdst.m_dim.N5 ;
-      dst.m_map.m_offset.m_dim.N6 = tempdst.m_dim.N6 ;
+      dst.m_map.m_impl_offset.m_dim.N0 = tempdst.m_dim.N0 ;
+      dst.m_map.m_impl_offset.m_dim.N1 = tempdst.m_dim.N1 ;
+      dst.m_map.m_impl_offset.m_dim.N2 = tempdst.m_dim.N2 ;
+      dst.m_map.m_impl_offset.m_dim.N3 = tempdst.m_dim.N3 ;
+      dst.m_map.m_impl_offset.m_dim.N4 = tempdst.m_dim.N4 ;
+      dst.m_map.m_impl_offset.m_dim.N5 = tempdst.m_dim.N5 ;
+      dst.m_map.m_impl_offset.m_dim.N6 = tempdst.m_dim.N6 ;

-      dst.m_map.m_offset.m_stride.S0 = tempdst.m_stride.S0 ;
-      dst.m_map.m_offset.m_stride.S1 = tempdst.m_stride.S1 ;
-      dst.m_map.m_offset.m_stride.S2 = tempdst.m_stride.S2 ;
-      dst.m_map.m_offset.m_stride.S3 = tempdst.m_stride.S3 ;
-      dst.m_map.m_offset.m_stride.S4 = tempdst.m_stride.S4 ;
-      dst.m_map.m_offset.m_stride.S5 = tempdst.m_stride.S5 ;
-      dst.m_map.m_offset.m_stride.S6 = tempdst.m_stride.S6 ;
+      dst.m_map.m_impl_offset.m_stride.S0 = tempdst.m_stride.S0 ;
+      dst.m_map.m_impl_offset.m_stride.S1 = tempdst.m_stride.S1 ;
+      dst.m_map.m_impl_offset.m_stride.S2 = tempdst.m_stride.S2 ;
+      dst.m_map.m_impl_offset.m_stride.S3 = tempdst.m_stride.S3 ;
+      dst.m_map.m_impl_offset.m_stride.S4 = tempdst.m_stride.S4 ;
+      dst.m_map.m_impl_offset.m_stride.S5 = tempdst.m_stride.S5 ;
+      dst.m_map.m_impl_offset.m_stride.S6 = tempdst.m_stride.S6 ;

-      dst.m_map.m_handle = dst_handle_type( src.m_map.m_handle +
-                                      src.m_map.m_offset( extents.domain_offset(0)
+      dst.m_map.m_impl_handle = dst_handle_type( src.m_map.m_impl_handle +
+                                      src.m_map.m_impl_offset( extents.domain_offset(0)
                                                  , extents.domain_offset(1)
                                                  , extents.domain_offset(2)
                                                  , extents.domain_offset(3)
@ -1896,6 +1901,7 @@ inline
 typename DynRankView<T,P...>::HostMirror
 create_mirror( const DynRankView<T,P...> & src
             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
                 ! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
                               , Kokkos::LayoutStride >::value
               >::type * = 0
@ -1914,6 +1920,7 @@ inline
 typename DynRankView<T,P...>::HostMirror
 create_mirror( const DynRankView<T,P...> & src
             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
                 std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
                             , Kokkos::LayoutStride >::value
               >::type * = 0
@ -1929,7 +1936,11 @@ create_mirror( const DynRankView<T,P...> & src

 // Create a mirror in a new space (specialization for different space)
 template<class Space, class T, class ... P>
-typename Impl::MirrorDRVType<Space,T,P ...>::view_type create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src) {
+typename Impl::MirrorDRVType<Space,T,P ...>::view_type
+create_mirror(const Space& , const Kokkos::DynRankView<T,P...> & src
+             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
+               >::type * = 0) {
  return typename Impl::MirrorDRVType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
 }

@ -1985,6 +1996,29 @@ create_mirror_view(const Space& , const Kokkos::DynRankView<T,P...> & src
  return typename Impl::MirrorDRViewType<Space,T,P ...>::view_type(src.label(), Impl::reconstructLayout(src.layout(), src.rank()) );
 }

+// Create a mirror view and deep_copy in a new space (specialization for same space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  (void)name;
+  return src;
+}
+
+// Create a mirror view and deep_copy in a new space (specialization for different space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorDRViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::DynRankView<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<!Impl::MirrorDRViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  using Mirror = typename Impl::MirrorDRViewType<Space,T,P ...>::view_type;
+  std::string label = name.empty() ? src.label() : name;
+  auto mirror = Mirror( Kokkos::ViewAllocateWithoutInitializing(label), Impl::reconstructLayout(src.layout(), src.rank()) );
+  deep_copy(mirror, src);
+  return mirror;
+}
+
 } //end Kokkos


--- a/lib/kokkos/containers/src/Kokkos_OffsetView.hpp
+++ b/lib/kokkos/containers/src/Kokkos_OffsetView.hpp
--- a/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp
+++ b/lib/kokkos/containers/src/Kokkos_StaticCrsGraph.hpp
@ -47,7 +47,9 @@
 #include <string>
 #include <vector>

-#include <Kokkos_Core.hpp>
+#include <Kokkos_View.hpp>
+#include <Kokkos_Parallel.hpp>
+#include <Kokkos_Parallel_Reduce.hpp>

 namespace Kokkos {

--- a/lib/kokkos/containers/src/Kokkos_Vector.hpp
+++ b/lib/kokkos/containers/src/Kokkos_Vector.hpp
@ -86,14 +86,13 @@ public:
  vector():DV() {
    _size = 0;
    _extra_storage = 1.1;
-    DV::modified_host() = 1;
  }


  vector(int n, Scalar val=Scalar()):DualView<Scalar*,LayoutLeft,Arg1Type>("Vector",size_t(n*(1.1))) {
    _size = n;
    _extra_storage = 1.1;
-    DV::modified_host() = 1;
+    DV::modified_flags(0) = 1;

    assign(n,val);
  }
@ -119,16 +118,16 @@ public:

          /* Assign value either on host or on device */

-    if( DV::modified_host() >= DV::modified_device() ) {
+    if( DV::template need_sync<typename DV::t_dev::device_type>() ) {
      set_functor_host f(DV::h_view,val);
      parallel_for(n,f);
      DV::t_host::execution_space::fence();
-      DV::modified_host()++;
+      DV::template modify<typename DV::t_host::device_type>();
    } else {
      set_functor f(DV::d_view,val);
      parallel_for(n,f);
      DV::t_dev::execution_space::fence();
-      DV::modified_device()++;
+      DV::template modify<typename DV::t_dev::device_type>();
    }
  }

@ -137,7 +136,8 @@ public:
  }

  void push_back(Scalar val) {
-    DV::modified_host()++;
+    DV::template sync<typename DV::t_host::device_type>();
+    DV::template modify<typename DV::t_host::device_type>();
    if(_size == span()) {
      size_t new_size = _size*_extra_storage;
      if(new_size == _size) new_size++;
@ -247,10 +247,10 @@ public:
  }

  void on_host() {
-    DV::modified_host() = DV::modified_device() + 1;
+    DV::template modify<typename DV::t_host::device_type>();
  }
  void on_device() {
-    DV::modified_device() = DV::modified_host() + 1;
+    DV::template modify<typename DV::t_dev::device_type>();
  }

  void set_overallocation(float extra) {
--- a/lib/kokkos/containers/unit_tests/CMakeLists.txt
+++ b/lib/kokkos/containers/unit_tests/CMakeLists.txt
@ -23,6 +23,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
    threads/TestThreads_DynRankViewAPI_rank12345.cpp
    threads/TestThreads_DynRankViewAPI_rank67.cpp
    threads/TestThreads_ErrorReporter.cpp
+    threads/TestThreads_OffsetView.cpp
    threads/TestThreads_ScatterView.cpp
    threads/TestThreads_StaticCrsGraph.cpp
    threads/TestThreads_UnorderedMap.cpp
@ -47,6 +48,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
    serial/TestSerial_DynRankViewAPI_rank12345.cpp
    serial/TestSerial_DynRankViewAPI_rank67.cpp
    serial/TestSerial_ErrorReporter.cpp
+    serial/TestSerial_OffsetView.cpp
    serial/TestSerial_ScatterView.cpp
    serial/TestSerial_StaticCrsGraph.cpp
    serial/TestSerial_UnorderedMap.cpp
@ -71,6 +73,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
    openmp/TestOpenMP_DynRankViewAPI_rank12345.cpp
    openmp/TestOpenMP_DynRankViewAPI_rank67.cpp
    openmp/TestOpenMP_ErrorReporter.cpp
+    openmp/TestOpenMP_OffsetView.cpp
    openmp/TestOpenMP_ScatterView.cpp
    openmp/TestOpenMP_StaticCrsGraph.cpp
    openmp/TestOpenMP_UnorderedMap.cpp
@ -95,6 +98,7 @@ TRIBITS_ADD_EXECUTABLE_AND_TEST(
    cuda/TestCuda_DynRankViewAPI_rank12345.cpp
    cuda/TestCuda_DynRankViewAPI_rank67.cpp
    cuda/TestCuda_ErrorReporter.cpp
+    cuda/TestCuda_OffsetView.cpp
    cuda/TestCuda_ScatterView.cpp
    cuda/TestCuda_StaticCrsGraph.cpp
    cuda/TestCuda_UnorderedMap.cpp
--- a/lib/kokkos/containers/unit_tests/Makefile
+++ b/lib/kokkos/containers/unit_tests/Makefile
@ -39,6 +39,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_CUDA), 1)
 	OBJ_CUDA += TestCuda_DynRankViewAPI_rank12345.o
 	OBJ_CUDA += TestCuda_DynRankViewAPI_rank67.o
 	OBJ_CUDA += TestCuda_ErrorReporter.o
+	OBJ_CUDA += TestCuda_OffsetView.o
 	OBJ_CUDA += TestCuda_ScatterView.o
 	OBJ_CUDA += TestCuda_StaticCrsGraph.o
 	OBJ_CUDA += TestCuda_UnorderedMap.o
@ -57,6 +58,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_ROCM), 1)
 	OBJ_ROCM += TestROCm_DynRankViewAPI_rank12345.o
 	OBJ_ROCM += TestROCm_DynRankViewAPI_rank67.o
 	OBJ_ROCM += TestROCm_ErrorReporter.o
+	OBJ_ROCM += TestROCm_OffsetView.o
 	OBJ_ROCM += TestROCm_ScatterView.o
 	OBJ_ROCM += TestROCm_StaticCrsGraph.o
 	OBJ_ROCM += TestROCm_UnorderedMap.o
@ -75,6 +77,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_PTHREADS), 1)
 	OBJ_THREADS += TestThreads_DynRankViewAPI_rank12345.o
 	OBJ_THREADS += TestThreads_DynRankViewAPI_rank67.o
 	OBJ_THREADS += TestThreads_ErrorReporter.o
+	OBJ_THREADS += TestThreads_OffsetView.o
 	OBJ_THREADS += TestThreads_ScatterView.o
 	OBJ_THREADS += TestThreads_StaticCrsGraph.o
 	OBJ_THREADS += TestThreads_UnorderedMap.o
@ -93,6 +96,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_OPENMP), 1)
 	OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank12345.o
 	OBJ_OPENMP += TestOpenMP_DynRankViewAPI_rank67.o
 	OBJ_OPENMP += TestOpenMP_ErrorReporter.o
+	OBJ_OPENMP += TestOpenMP_OffsetView.o
 	OBJ_OPENMP += TestOpenMP_ScatterView.o
 	OBJ_OPENMP += TestOpenMP_StaticCrsGraph.o
 	OBJ_OPENMP += TestOpenMP_UnorderedMap.o
@ -111,6 +115,7 @@ ifeq ($(KOKKOS_INTERNAL_USE_SERIAL), 1)
 	OBJ_SERIAL += TestSerial_DynRankViewAPI_rank12345.o
 	OBJ_SERIAL += TestSerial_DynRankViewAPI_rank67.o
 	OBJ_SERIAL += TestSerial_ErrorReporter.o
+	OBJ_SERIAL += TestSerial_OffsetView.o
 	OBJ_SERIAL += TestSerial_ScatterView.o
 	OBJ_SERIAL += TestSerial_StaticCrsGraph.o
 	OBJ_SERIAL += TestSerial_UnorderedMap.o
--- a/lib/kokkos/containers/unit_tests/TestDynViewAPI.hpp
+++ b/lib/kokkos/containers/unit_tests/TestDynViewAPI.hpp
@ -729,6 +729,7 @@ public:
  static void run_tests() {
    run_test_resize_realloc();
    run_test_mirror();
+    run_test_mirror_and_copy();
    run_test_scalar();
    run_test();
    run_test_const();
@ -885,6 +886,69 @@ public:
    }
  }

+  static void run_test_mirror_and_copy()
+  {
+    // LayoutLeft
+    {
+      Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_org( "A", 10 );
+      a_org(5) = 42.0;
+      Kokkos::DynRankView< double, Kokkos::LayoutLeft, Kokkos::HostSpace > a_h = a_org;
+      auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
+      auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
+      auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
+
+      int equal_ptr_h_h2 = a_h.data()  == a_h2.data() ? 1 : 0;
+      int equal_ptr_h_d  = a_h.data()  ==  a_d.data() ? 1 : 0;
+      int equal_ptr_h2_d = a_h2.data() ==  a_d.data() ? 1 : 0;
+      int equal_ptr_h3_d = a_h3.data() ==  a_d.data() ? 1 : 0;
+
+      int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
+      ASSERT_EQ( equal_ptr_h_h2, 1 );
+      ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
+      ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
+      ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
+
+      ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
+      ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
+      ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
+      ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
+      ASSERT_EQ( a_h.rank(), a_org.rank() );
+      ASSERT_EQ( a_h.rank(), a_h2.rank() );
+      ASSERT_EQ( a_h.rank(), a_h3.rank() );
+      ASSERT_EQ( a_h.rank(), a_d.rank() );
+      ASSERT_EQ( a_org(5), a_h3(5) );
+    }
+    // LayoutRight
+    {
+      Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_org( "A", 10 );
+      a_org(5) = 42.0;
+      Kokkos::DynRankView< double, Kokkos::LayoutRight, Kokkos::HostSpace > a_h = a_org;
+      auto a_h2 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_h );
+      auto a_d = Kokkos::create_mirror_view_and_copy( DeviceType(), a_h );
+      auto a_h3 = Kokkos::create_mirror_view_and_copy( Kokkos::HostSpace(), a_d );
+
+      int equal_ptr_h_h2 = a_h.data()  == a_h2.data() ? 1 : 0;
+      int equal_ptr_h_d  = a_h.data()  ==  a_d.data() ? 1 : 0;
+      int equal_ptr_h2_d = a_h2.data() ==  a_d.data() ? 1 : 0;
+      int equal_ptr_h3_d = a_h3.data() ==  a_d.data() ? 1 : 0;
+
+      int is_same_memspace = std::is_same< Kokkos::HostSpace, typename DeviceType::memory_space >::value ? 1 : 0;
+      ASSERT_EQ( equal_ptr_h_h2, 1 );
+      ASSERT_EQ( equal_ptr_h_d, is_same_memspace );
+      ASSERT_EQ( equal_ptr_h2_d, is_same_memspace );
+      ASSERT_EQ( equal_ptr_h3_d, is_same_memspace );
+
+      ASSERT_EQ( a_h.extent(0), a_h3.extent(0) );
+      ASSERT_EQ( a_h.extent(0), a_h2.extent(0) );
+      ASSERT_EQ( a_h.extent(0), a_d .extent(0) );
+      ASSERT_EQ( a_h.rank(), a_org.rank() );
+      ASSERT_EQ( a_h.rank(), a_h2.rank() );
+      ASSERT_EQ( a_h.rank(), a_h3.rank() );
+      ASSERT_EQ( a_h.rank(), a_d.rank() );
+      ASSERT_EQ( a_org(5), a_h3(5) );
+    }
+  }
+
  static void run_test_scalar()
  {
    typedef typename dView0::HostMirror  hView0 ; //HostMirror of DynRankView is a DynRankView
--- a/lib/kokkos/containers/unit_tests/TestOffsetView.hpp
+++ b/lib/kokkos/containers/unit_tests/TestOffsetView.hpp
@ -0,0 +1,426 @@
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+
+/*
+ * FIXME the OffsetView class is really not very well tested.
+ */
+#ifndef CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
+#define CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_
+
+
+
+#include <gtest/gtest.h>
+#include <iostream>
+#include <cstdlib>
+#include <cstdio>
+#include <impl/Kokkos_Timer.hpp>
+#include <Kokkos_OffsetView.hpp>
+#include <KokkosExp_MDRangePolicy.hpp>
+
+using std::endl;
+using std::cout;
+
+namespace Test{
+
+   template <typename Scalar, typename Device>
+   void test_offsetview_construction(unsigned int size)
+   {
+
+      typedef Kokkos::Experimental::OffsetView<Scalar**, Device> offset_view_type;
+      typedef Kokkos::View<Scalar**, Device> view_type;
+
+      Kokkos::Experimental::index_list_type range0 = {-1, 3};
+      Kokkos::Experimental::index_list_type range1 = {-2, 2};
+
+      offset_view_type ov("firstOV", range0, range1);
+
+      ASSERT_EQ("firstOV", ov.label());
+      ASSERT_EQ(2, ov.Rank);
+
+      ASSERT_EQ(ov.begin(0), -1);
+      ASSERT_EQ(ov.end(0), 4);
+
+      ASSERT_EQ(ov.begin(1), -2);
+      ASSERT_EQ(ov.end(1), 3);
+
+      ASSERT_EQ(ov.extent(0), 5);
+      ASSERT_EQ(ov.extent(1), 5);
+
+      const int ovmin0 = ov.begin(0);
+      const int ovend0 = ov.end(0);
+      const int ovmin1 = ov.begin(1);
+      const int ovend1 = ov.end(1);
+
+#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
+      {
+         Kokkos::Experimental::OffsetView<Scalar*, Device> offsetV1("OneDOffsetView", range0);
+
+         Kokkos::RangePolicy<Device, int> rangePolicy1(offsetV1.begin(0), offsetV1.end(0));
+         Kokkos::parallel_for(rangePolicy1, KOKKOS_LAMBDA (const int i){
+            offsetV1(i) = 1;
+         }
+         );
+	 Kokkos::fence();
+
+         int OVResult = 0;
+         Kokkos::parallel_reduce(rangePolicy1, KOKKOS_LAMBDA(const int i, int & updateMe){
+            updateMe += offsetV1(i);
+         }, OVResult);
+	 
+	 Kokkos::fence();
+         ASSERT_EQ(OVResult, offsetV1.end(0) - offsetV1.begin(0)) << "found wrong number of elements in OffsetView that was summed.";
+
+      }
+      {  //test deep copy of scalar const value into mirro
+         const int constVal = 6;
+         typename offset_view_type::HostMirror hostOffsetView =
+               Kokkos::Experimental::create_mirror_view(ov);
+
+         Kokkos::Experimental::deep_copy(hostOffsetView, constVal);
+
+         for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
+            for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
+               ASSERT_EQ(hostOffsetView(i,j),  constVal) << "Bad data found in OffsetView";
+            }
+         }
+      }
+
+      typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
+      typedef typename range_type::point_type point_type;
+
+      range_type rangePolicy2D(point_type{ {ovmin0, ovmin1 } },
+            point_type{ { ovend0, ovend1 } });
+
+      const int constValue = 9;
+      Kokkos::parallel_for(rangePolicy2D, KOKKOS_LAMBDA (const int i, const int j) {
+         ov(i,j) =  constValue;
+      }
+      );
+      
+      //test offsetview to offsetviewmirror deep copy
+      typename offset_view_type::HostMirror hostOffsetView =
+            Kokkos::Experimental::create_mirror_view(ov);
+
+      Kokkos::Experimental::deep_copy(hostOffsetView, ov);
+
+      for(int i = hostOffsetView.begin(0); i < hostOffsetView.end(0); ++i) {
+         for(int j = hostOffsetView.begin(1); j < hostOffsetView.end(1); ++j) {
+            ASSERT_EQ(hostOffsetView(i,j),  constValue) << "Bad data found in OffsetView";
+         }
+      }
+      
+     int OVResult = 0;
+      Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
+         updateMe += ov(i, j);
+      }, OVResult);
+
+      int answer = 0;
+      for(int i = ov.begin(0); i < ov.end(0); ++i) {
+         for(int j = ov.begin(1); j < ov.end(1); ++j) {
+            answer += constValue;
+         }
+      }
+      
+      ASSERT_EQ(OVResult, answer) << "Bad data found in OffsetView";
+#endif
+
+      {
+         offset_view_type ovCopy(ov);
+         ASSERT_EQ(ovCopy==ov, true) <<
+               "Copy constructor or equivalence operator broken";
+      }
+      
+      {
+         offset_view_type ovAssigned = ov;
+         ASSERT_EQ(ovAssigned==ov, true) <<
+               "Assignment operator or equivalence operator broken";
+      }
+      
+      {  //construct OffsetView from a View plus begins array
+         const int extent0 = 100;
+         const int extent1 = 200;
+         const int extent2 = 300;
+         Kokkos::View<Scalar***, Device> view3D("view3D", extent0, extent1, extent2);
+
+         Kokkos::deep_copy(view3D, 1);
+
+         Kokkos::Array<int64_t,3> begins = {{-10, -20, -30}};
+         Kokkos::Experimental::OffsetView<Scalar***, Device> offsetView3D(view3D, begins);
+
+         typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<3>, Kokkos::IndexType<int64_t> > range3_type;
+         typedef typename range3_type::point_type point3_type;
+
+         range3_type rangePolicy3DZero(point3_type{ {0, 0, 0 } },
+               point3_type{ { extent0, extent1, extent2 } });
+
+#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
+        int view3DSum = 0;
+         Kokkos::parallel_reduce(rangePolicy3DZero, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
+            updateMe += view3D(i, j, k);
+         }, view3DSum);
+
+         range3_type rangePolicy3D(point3_type{ {begins[0], begins[1], begins[2] } },
+               point3_type{ { begins[0] + extent0, begins[1] + extent1, begins[2] + extent2 } });
+         int offsetView3DSum = 0;
+
+         Kokkos::parallel_reduce(rangePolicy3D, KOKKOS_LAMBDA(const int i, const int j, int k, int & updateMe){
+            updateMe += offsetView3D(i, j, k);
+         }, offsetView3DSum);
+
+         ASSERT_EQ(view3DSum, offsetView3DSum) << "construction of OffsetView from View and begins array broken.";
+#endif
+      }
+      view_type viewFromOV = ov.view();
+
+      ASSERT_EQ(viewFromOV == ov, true) <<
+            "OffsetView::view() or equivalence operator View == OffsetView broken";
+
+      {
+         offset_view_type ovFromV(viewFromOV, {-1, -2});
+
+         ASSERT_EQ(ovFromV == viewFromOV , true) <<
+               "Construction of OffsetView from View or equivalence operator OffsetView == View broken";
+      }
+      {
+         offset_view_type ovFromV = viewFromOV;
+         ASSERT_EQ(ovFromV == viewFromOV , true) <<
+               "Construction of OffsetView from View by assignment (implicit conversion) or equivalence operator OffsetView == View broken";
+      }
+
+      {// test offsetview to view deep copy
+         view_type aView("aView", ov.extent(0), ov.extent(1));
+         Kokkos::Experimental::deep_copy(aView, ov);
+
+#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
+         int sum = 0;
+         Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
+            updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
+         }, sum);
+
+         ASSERT_EQ(sum, 0) << "deep_copy(view, offsetView) broken.";
+#endif
+      }
+
+      {// test view to  offsetview deep copy
+         view_type aView("aView", ov.extent(0), ov.extent(1));
+
+         Kokkos::deep_copy(aView, 99);
+         Kokkos::Experimental::deep_copy(ov, aView);
+	 
+
+#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
+         int sum = 0;
+         Kokkos::parallel_reduce(rangePolicy2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
+            updateMe += ov(i, j) - aView(i- ov.begin(0), j-ov.begin(1));
+         }, sum);
+
+         ASSERT_EQ(sum, 0) << "deep_copy(offsetView, view) broken.";
+#endif
+      }
+   }
+   template <typename Scalar, typename Device>
+   void test_offsetview_subview(unsigned int size)
+   {
+      {//test subview 1
+          Kokkos::Experimental::OffsetView<Scalar*, Device> sliceMe("offsetToSlice", {-10, 20});
+          {
+             auto offsetSubviewa = Kokkos::Experimental::subview(sliceMe, 0);
+             ASSERT_EQ(offsetSubviewa.Rank, 0) << "subview of offset is broken.";
+          }
+
+       }
+      {//test subview 2
+         Kokkos::Experimental::OffsetView<Scalar**, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30});
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),-2);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+      }
+
+
+      {//test subview rank 3
+
+         Kokkos::Experimental::OffsetView<Scalar***, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40});
+
+         //slice 1
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(),Kokkos::ALL(), 0);
+            ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe,Kokkos::ALL(), 0,Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
+         }
+
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(),Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
+
+         }
+         {
+	   auto offsetSubview = Kokkos::Experimental::subview(sliceMe,0, Kokkos::ALL(), Kokkos::make_pair(-30, -21));
+            ASSERT_EQ(offsetSubview.Rank, 2) << "subview of offset is broken.";
+
+            ASSERT_EQ(offsetSubview.begin(0) , -20);
+            ASSERT_EQ(offsetSubview.end(0) , 31);
+            ASSERT_EQ(offsetSubview.begin(1) , 0);
+            ASSERT_EQ(offsetSubview.end(1) , 9);
+
+#if defined(KOKKOS_ENABLE_CUDA_LAMBDA) || !defined(KOKKOS_ENABLE_CUDA)
+            typedef Kokkos::MDRangePolicy<Device, Kokkos::Rank<2>, Kokkos::IndexType<int> > range_type;
+            typedef typename range_type::point_type point_type;
+
+            const int b0 = offsetSubview.begin(0);
+            const int b1 = offsetSubview.begin(1);
+
+            const int e0 = offsetSubview.end(0);
+            const int e1 = offsetSubview.end(1);
+
+            range_type rangeP2D(point_type{ {b0, b1 } }, point_type{ { e0, e1} });
+
+            Kokkos::parallel_for(rangeP2D, KOKKOS_LAMBDA(const int i, const int j) {
+               offsetSubview(i,j) =  6;
+            }
+            );
+
+            int sum = 0;
+             Kokkos::parallel_reduce(rangeP2D, KOKKOS_LAMBDA(const int i, const int j, int & updateMe){
+                updateMe += offsetSubview(i, j);
+             }, sum);
+
+            ASSERT_EQ(sum, 6*(e0-b0)*(e1-b1));
+#endif
+         }
+
+         // slice 2
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+      }
+
+      {//test subview rank 4
+
+         Kokkos::Experimental::OffsetView<Scalar****, Device> sliceMe("offsetToSlice", {-10,20}, {-20,30}, {-30,40}, {-40, 50});
+
+         //slice 1
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(),Kokkos::ALL(), Kokkos::ALL(), 0);
+            ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe ,Kokkos::ALL(), 0, Kokkos::ALL(),Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe , 0, Kokkos::ALL(), Kokkos::ALL(),  Kokkos::ALL() );
+            ASSERT_EQ(offsetSubview.Rank, 3) << "subview of offset is broken.";
+         }
+
+         // slice 2
+         auto offsetSubview2a = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), Kokkos::ALL(), 0, 0);
+         ASSERT_EQ(offsetSubview2a.Rank, 2) << "subview of offset is broken.";
+         {
+            auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, Kokkos::ALL(), 0);
+            ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe,  0, Kokkos::ALL(), 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview2b = Kokkos::Experimental::subview(sliceMe,  0, 0, Kokkos::ALL(), Kokkos::ALL());
+            ASSERT_EQ(offsetSubview2b.Rank, 2) << "subview of offset is broken.";
+         }
+         // slice 3
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, Kokkos::ALL(), 0, 0, 0);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe, 0, Kokkos::ALL(), 0, 0);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe,  0, 0, Kokkos::ALL(), 0);
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+         {
+            auto offsetSubview = Kokkos::Experimental::subview(sliceMe,  0, 0, 0, Kokkos::ALL());
+            ASSERT_EQ(offsetSubview.Rank, 1) << "subview of offset is broken.";
+         }
+
+      }
+
+   }
+
+   TEST_F( TEST_CATEGORY, offsetview_construction) {
+      test_offsetview_construction<int,TEST_EXECSPACE>(10);
+   }
+   TEST_F( TEST_CATEGORY, offsetview_subview) {
+      test_offsetview_subview<int,TEST_EXECSPACE>(10);
+   }
+
+} // namespace Test
+
+#endif /* CONTAINERS_UNIT_TESTS_TESTOFFSETVIEW_HPP_ */
--- a/lib/kokkos/containers/unit_tests/TestScatterView.hpp
+++ b/lib/kokkos/containers/unit_tests/TestScatterView.hpp
@ -80,7 +80,9 @@ void test_scatter_view_config(int n)
    Kokkos::Experimental::contribute(original_view, scatter_view);
  }
 #if defined( KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA )
+  Kokkos::fence();
  auto host_view = Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), original_view);
+  Kokkos::fence();
  for (typename decltype(host_view)::size_type i = 0; i < host_view.extent(0); ++i) {
    auto val0 = host_view(i, 0);
    auto val1 = host_view(i, 1);
@ -111,9 +113,6 @@ struct TestDuplicatedScatterView {
    test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
      Kokkos::Experimental::ScatterDuplicated,
      Kokkos::Experimental::ScatterNonAtomic>(n);
-    test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
-      Kokkos::Experimental::ScatterDuplicated,
-      Kokkos::Experimental::ScatterAtomic>(n);
  }
 };

@ -127,6 +126,16 @@ struct TestDuplicatedScatterView<Kokkos::Cuda> {
 };
 #endif

+#ifdef KOKKOS_ENABLE_ROCM
+// disable duplicated instantiation with ROCm until
+// UniqueToken can support it
+template <>
+struct TestDuplicatedScatterView<Kokkos::Experimental::ROCm> {
+  TestDuplicatedScatterView(int) {
+  }
+};
+#endif
+
 template <typename ExecSpace>
 void test_scatter_view(int n)
 {
@ -142,16 +151,28 @@ void test_scatter_view(int n)
      Kokkos::Experimental::ScatterNonDuplicated,
      Kokkos::Experimental::ScatterNonAtomic>(n);
  }
+#ifdef KOKKOS_ENABLE_SERIAL
+  if (!std::is_same<ExecSpace, Kokkos::Serial>::value) {
+#endif
  test_scatter_view_config<ExecSpace, Kokkos::LayoutRight,
    Kokkos::Experimental::ScatterNonDuplicated,
    Kokkos::Experimental::ScatterAtomic>(n);
+#ifdef KOKKOS_ENABLE_SERIAL
+  }
+#endif

  TestDuplicatedScatterView<ExecSpace> duptest(n);
 }

 TEST_F( TEST_CATEGORY, scatterview) {
+#ifndef KOKKOS_ENABLE_ROCM
  test_scatter_view<TEST_EXECSPACE>(10);
+#ifdef KOKKOS_ENABLE_DEBUG
+  test_scatter_view<TEST_EXECSPACE>(100000);
+#else
  test_scatter_view<TEST_EXECSPACE>(10000000);
+#endif
+#endif
 }

 } // namespace Test
--- a/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
+++ b/lib/kokkos/containers/unit_tests/TestStaticCrsGraph.hpp
@ -46,6 +46,7 @@
 #include <vector>

 #include <Kokkos_StaticCrsGraph.hpp>
+#include <Kokkos_Core.hpp>

 /*--------------------------------------------------------------------------*/
 namespace Test {
--- a/lib/kokkos/containers/unit_tests/cuda/TestCuda_OffsetView.cpp
+++ b/lib/kokkos/containers/unit_tests/cuda/TestCuda_OffsetView.cpp
@ -0,0 +1,47 @@
+
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include<cuda/TestCuda_Category.hpp>
+#include<TestOffsetView.hpp>
+
--- a/lib/kokkos/containers/unit_tests/openmp/TestOpenMP_OffsetView.cpp
+++ b/lib/kokkos/containers/unit_tests/openmp/TestOpenMP_OffsetView.cpp
@ -0,0 +1,47 @@
+
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include<openmp/TestOpenMP_Category.hpp>
+#include<TestOffsetView.hpp>
+
--- a/lib/kokkos/containers/unit_tests/rocm/TestROCm_Category.hpp
+++ b/lib/kokkos/containers/unit_tests/rocm/TestROCm_Category.hpp
@ -60,6 +60,6 @@ protected:
 } // namespace Test

 #define TEST_CATEGORY rocm
-#define TEST_EXECSPACE Kokkos::ROCm
+#define TEST_EXECSPACE Kokkos::Experimental::ROCm

 #endif
--- a/lib/kokkos/containers/unit_tests/serial/TestSerial_OffsetView.cpp
+++ b/lib/kokkos/containers/unit_tests/serial/TestSerial_OffsetView.cpp
@ -0,0 +1,46 @@
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include<serial/TestSerial_Category.hpp>
+#include<TestOffsetView.hpp>
+
--- a/lib/kokkos/containers/unit_tests/threads/TestThreads_OffsetView.cpp
+++ b/lib/kokkos/containers/unit_tests/threads/TestThreads_OffsetView.cpp
@ -0,0 +1,47 @@
+
+/*
+//@HEADER
+// ************************************************************************
+//
+//                        Kokkos v. 2.0
+//              Copyright (2014) Sandia Corporation
+//
+// Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation,
+// the U.S. Government retains certain rights in this software.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//
+// 1. Redistributions of source code must retain the above copyright
+// notice, this list of conditions and the following disclaimer.
+//
+// 2. Redistributions in binary form must reproduce the above copyright
+// notice, this list of conditions and the following disclaimer in the
+// documentation and/or other materials provided with the distribution.
+//
+// 3. Neither the name of the Corporation nor the names of the
+// contributors may be used to endorse or promote products derived from
+// this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY SANDIA CORPORATION "AS IS" AND ANY
+// EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+// PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION OR THE
+// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+// Questions? Contact Christian R. Trott (crtrott@sandia.gov)
+//
+// ************************************************************************
+//@HEADER
+*/
+
+#include<threads/TestThreads_Category.hpp>
+#include<TestOffsetView.hpp>
+
--- a/lib/kokkos/core/src/CMakeLists.txt
+++ b/lib/kokkos/core/src/CMakeLists.txt
@ -108,3 +108,7 @@ else()

 endif()
 #-----------------------------------------------------------------------------
+
+# build and install pkgconfig file
+CONFIGURE_FILE(kokkos.pc.in kokkos.pc @ONLY)
+INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/kokkos.pc DESTINATION lib/pkgconfig)
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaExec.hpp
@ -208,7 +208,7 @@ struct CudaParallelLaunch< DriverType
                    , const int          shmem
                    , const cudaStream_t stream = 0 )
  {
-    if ( grid.x && ( block.x * block.y * block.z ) ) {
+    if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {

      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
           sizeof( DriverType ) ) {
@ -264,7 +264,7 @@ struct CudaParallelLaunch< DriverType
                    , const int          shmem
                    , const cudaStream_t stream = 0 )
  {
-    if ( grid.x && ( block.x * block.y * block.z ) ) {
+    if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {

      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
           sizeof( DriverType ) ) {
@ -321,7 +321,7 @@ struct CudaParallelLaunch< DriverType
                    , const int          shmem
                    , const cudaStream_t stream = 0 )
  {
-    if ( grid.x && ( block.x * block.y * block.z ) ) {
+    if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {

      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
           sizeof( DriverType ) ) {
@ -370,7 +370,7 @@ struct CudaParallelLaunch< DriverType
                    , const int          shmem
                    , const cudaStream_t stream = 0 )
  {
-    if ( grid.x && ( block.x * block.y * block.z ) ) {
+    if ( (grid.x != 0) && ( ( block.x * block.y * block.z ) != 0 ) ) {

      if ( sizeof( Kokkos::Impl::CudaTraits::ConstantGlobalBufferType ) <
           sizeof( DriverType ) ) {
--- a/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp
@ -453,6 +453,8 @@ SharedAllocationRecord( const Kokkos::CudaSpace & arg_space
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
+  // Set last element zero, in case c_str is too long
+  header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;

  // Copy to device memory
  Kokkos::Impl::DeepCopy<CudaSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
@ -491,6 +493,9 @@ SharedAllocationRecord( const Kokkos::CudaUVMSpace & arg_space
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
+
+  // Set last element zero, in case c_str is too long
+  RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
 }

 SharedAllocationRecord< Kokkos::CudaHostPinnedSpace , void >::
@ -525,6 +530,8 @@ SharedAllocationRecord( const Kokkos::CudaHostPinnedSpace & arg_space
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
+  // Set last element zero, in case c_str is too long
+  RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
 }

 //----------------------------------------------------------------------------
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp
@ -689,9 +689,13 @@ Cuda::size_type cuda_internal_multiprocessor_count()

 CudaSpace::size_type cuda_internal_maximum_concurrent_block_count()
 {
+  #if defined(KOKKOS_ARCH_KEPLER)
+  // Compute capability 3.0 through 3.7
+  enum : int { max_resident_blocks_per_multiprocessor = 16 };
+  #else
  // Compute capability 5.0 through 6.2
  enum : int { max_resident_blocks_per_multiprocessor = 32 };
-
+  #endif
   return CudaInternal::singleton().m_multiProcCount
          * max_resident_blocks_per_multiprocessor ;
 };
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Internal.hpp
@ -52,22 +52,22 @@

 namespace Kokkos { namespace Impl {

-template<class DriverType, bool Large>
+template<class DriverType, class LaunchBounds, bool Large>
 struct CudaGetMaxBlockSize;

-template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
+template<class DriverType, class LaunchBounds>
 int cuda_get_max_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
-  return CudaGetMaxBlockSize<DriverType,Large>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
+  return CudaGetMaxBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length, shmem_extra_block,shmem_extra_thread);
 }


 template<class DriverType>
-struct CudaGetMaxBlockSize<DriverType,true> {
+struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
    int numBlocks;
-    int blockSize=32;
+    int blockSize=1024;
    int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
                    FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
@ -76,8 +76,9 @@ struct CudaGetMaxBlockSize<DriverType,true> {
        blockSize,
        sharedmem);

-    while (blockSize<1024 && numBlocks>0) {
-      blockSize*=2;
+    if(numBlocks>0) return blockSize;
+    while (blockSize>32 && numBlocks==0) {
+      blockSize/=2;
      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );

@ -87,19 +88,30 @@ struct CudaGetMaxBlockSize<DriverType,true> {
          blockSize,
          sharedmem);
    }
-    if(numBlocks>0) return blockSize;
-    else return blockSize/2;
+    int blockSizeUpperBound = blockSize*2;
+    while (blockSize<blockSizeUpperBound && numBlocks>0) {
+      blockSize+=32;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+            &numBlocks,
+            cuda_parallel_launch_constant_memory<DriverType>,
+            blockSize,
+            sharedmem);
+    }
+    return blockSize - 32;
  }
 };

 template<class DriverType>
-struct CudaGetMaxBlockSize<DriverType,false> {
+struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
    int numBlocks;

-    int blockSize=32;
-    int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+    unsigned int blockSize=1024;
+    unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
                    FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks,
@ -107,8 +119,9 @@ struct CudaGetMaxBlockSize<DriverType,false> {
        blockSize,
        sharedmem);

-    while (blockSize<1024 && numBlocks>0) {
-      blockSize*=2;
+    if(numBlocks>0) return blockSize;
+    while (blockSize>32 && numBlocks==0) {
+      blockSize/=2;
      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );

@ -118,24 +131,121 @@ struct CudaGetMaxBlockSize<DriverType,false> {
          blockSize,
          sharedmem);
    }
-    if(numBlocks>0) return blockSize;
-    else return blockSize/2;
+    unsigned int blockSizeUpperBound = blockSize*2;
+    while (blockSize<blockSizeUpperBound && numBlocks>0) {
+      blockSize+=32;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+            &numBlocks,
+            cuda_parallel_launch_local_memory<DriverType>,
+            blockSize,
+            sharedmem);
+    }
+    return blockSize - 32;
+  }
+};
+
+template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
+struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,true> {
+  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
+                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
+    int numBlocks = 0, oldNumBlocks = 0;
+    unsigned int blockSize=MaxThreadsPerBlock;
+    unsigned int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                    FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &numBlocks,
+        cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
+        blockSize,
+        sharedmem);
+
+    if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
+
+    while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
+      blockSize/=2;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+          &numBlocks,
+          cuda_parallel_launch_constant_memory<DriverType>,
+          blockSize,
+          sharedmem);
+    }
+    unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
+    while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>MinBlocksPerSM) {
+      blockSize+=32;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+      oldNumBlocks = numBlocks;
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+            &numBlocks,
+            cuda_parallel_launch_constant_memory<DriverType>,
+            blockSize,
+            sharedmem);
+    }
+    if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
+    return -1;
+  }
+};
+
+template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
+struct CudaGetMaxBlockSize<DriverType,Kokkos::LaunchBounds<MaxThreadsPerBlock,MinBlocksPerSM>,false> {
+  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
+                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
+    int numBlocks = 0, oldNumBlocks = 0;
+    unsigned int blockSize=MaxThreadsPerBlock;
+    int sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                    FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+        &numBlocks,
+        cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
+        blockSize,
+        sharedmem);
+    if(static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) return blockSize;
+
+    while (blockSize>32 && static_cast<unsigned int>(numBlocks)<MinBlocksPerSM) {
+      blockSize/=2;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+          &numBlocks,
+          cuda_parallel_launch_local_memory<DriverType>,
+          blockSize,
+          sharedmem);
+    }
+    unsigned int blockSizeUpperBound = (blockSize*2<MaxThreadsPerBlock?blockSize*2:MaxThreadsPerBlock);
+    while (blockSize<blockSizeUpperBound && static_cast<unsigned int>(numBlocks)>=MinBlocksPerSM) {
+      blockSize+=32;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+      oldNumBlocks = numBlocks;
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+            &numBlocks,
+            cuda_parallel_launch_local_memory<DriverType>,
+            blockSize,
+            sharedmem);
+    }
+    if(static_cast<unsigned int>(oldNumBlocks)>=MinBlocksPerSM) return blockSize - 32;
+    return -1;
  }
 };


-
-template<class DriverType, bool Large>
+template<class DriverType, class LaunchBounds, bool Large>
 struct CudaGetOptBlockSize;

-template<class DriverType, bool Large = (CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>
+template<class DriverType, class LaunchBounds>
 int cuda_get_opt_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
-  return CudaGetOptBlockSize<DriverType,Large>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
+  return CudaGetOptBlockSize<DriverType,LaunchBounds,(CudaTraits::ConstantMemoryUseThreshold < sizeof(DriverType))>::get_block_size(f,vector_length,shmem_extra_block,shmem_extra_thread);
 }

 template<class DriverType>
-struct CudaGetOptBlockSize<DriverType,true> {
+struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,true> {
  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
    int blockSize=16;
@ -165,7 +275,7 @@ struct CudaGetOptBlockSize<DriverType,true> {
 };

 template<class DriverType>
-struct CudaGetOptBlockSize<DriverType,false> {
+struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds<>,false> {
  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
    int blockSize=16;
@ -194,6 +304,75 @@ struct CudaGetOptBlockSize<DriverType,false> {
  }
 };

+template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
+struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,true> {
+  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
+                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
+    int blockSize=16;
+    int numBlocks;
+    int sharedmem;
+    int maxOccupancy=0;
+    int bestBlockSize=0;
+    int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
+
+    while(blockSize < max_threads_per_block ) {
+      blockSize*=2;
+
+      //calculate the occupancy with that optBlockSize and check whether its larger than the largest one found so far
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+              &numBlocks,
+              cuda_parallel_launch_constant_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
+              blockSize,
+              sharedmem);
+      if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
+        if(maxOccupancy < numBlocks*blockSize) {
+           maxOccupancy = numBlocks*blockSize;
+           bestBlockSize = blockSize;
+        }
+      }
+    }
+    if(maxOccupancy > 0)
+      return bestBlockSize;
+    return -1;
+  }
+};
+
+template<class DriverType, unsigned int MaxThreadsPerBlock, unsigned int MinBlocksPerSM>
+struct CudaGetOptBlockSize<DriverType,Kokkos::LaunchBounds< MaxThreadsPerBlock, MinBlocksPerSM >,false> {
+  static int get_block_size(const typename DriverType::functor_type & f, const size_t vector_length,
+                            const size_t shmem_extra_block, const size_t shmem_extra_thread) {
+    int blockSize=16;
+    int numBlocks;
+    int sharedmem;
+    int maxOccupancy=0;
+    int bestBlockSize=0;
+    int max_threads_per_block = std::min(MaxThreadsPerBlock,cuda_internal_maximum_warp_count()*CudaTraits::WarpSize);
+
+    while(blockSize < max_threads_per_block ) {
+      blockSize*=2;
+      sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
+                  FunctorTeamShmemSize< typename DriverType::functor_type  >::value( f , blockSize/vector_length );
+
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+              &numBlocks,
+              cuda_parallel_launch_local_memory<DriverType,MaxThreadsPerBlock,MinBlocksPerSM>,
+              blockSize,
+              sharedmem);
+      if(numBlocks >= int(MinBlocksPerSM) && blockSize<=int(MaxThreadsPerBlock)) {
+        if(maxOccupancy < numBlocks*blockSize) {
+          maxOccupancy = numBlocks*blockSize;
+          bestBlockSize = blockSize;
+        }
+      }
+    }
+    if(maxOccupancy > 0)
+      return bestBlockSize;
+    return -1;
+  }
+};
+
 }} // namespace Kokkos::Impl

 #endif // KOKKOS_ENABLE_CUDA
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Locks.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Locks.hpp
@ -148,6 +148,9 @@ namespace Kokkos {
 namespace Impl {
 namespace {
  static int lock_array_copied = 0;
+  inline int eliminate_warning_for_lock_array() {
+    return lock_array_copied;
+  }
 }
 }
 }
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp
@ -60,6 +60,7 @@
 #include <Cuda/Kokkos_Cuda_Internal.hpp>
 #include <Cuda/Kokkos_Cuda_Locks.hpp>
 #include <Kokkos_Vectorization.hpp>
+#include <Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp>

 #if defined(KOKKOS_ENABLE_PROFILING)
 #include <impl/Kokkos_Profiling_Interface.hpp>
@ -114,6 +115,7 @@ public:

  //----------------------------------------

+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  template< class FunctorType >
  inline static
  int team_size_max( const FunctorType & functor )
@ -131,7 +133,35 @@ public:

      return n ;
    }
+#endif

+  template<class FunctorType>
+  int team_size_max( const FunctorType& f, const ParallelForTag& ) const {
+    typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
+    int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
+        (size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) );
+    return block_size/vector_length();
+  }
+
+  template<class FunctorType>
+  int team_size_max( const FunctorType& f, const ParallelReduceTag& ) const {
+    typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
+    typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
+    typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
+    typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
+
+    int block_size = Kokkos::Impl::cuda_get_max_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
+        (size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
+                                                          ((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
+
+    // Currently we require Power-of-2 team size for reductions.
+    int p2 = 1;
+    while(p2<=block_size) p2*=2;
+    p2/=2;
+    return p2/vector_length();
+  }
+
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  template< class FunctorType >
  static int team_size_recommended( const FunctorType & functor )
    { return team_size_max( functor ); }
@ -143,11 +173,41 @@ public:
      if(max<1) max = 1;
      return max;
    }
+#endif
+
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType& f, const ParallelForTag& ) const {
+    typedef Impl::ParallelFor< FunctorType , TeamPolicy<Properties...> > closure_type;
+    int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
+        (size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double));
+    return block_size/vector_length();
+  }
+
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType& f, const ParallelReduceTag& ) const {
+    typedef Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,TeamPolicyInternal,FunctorType> functor_analysis_type;
+    typedef typename Impl::ParallelReduceReturnValue<void,typename functor_analysis_type::value_type,FunctorType>::reducer_type reducer_type;
+    typedef Impl::ParallelReduce< FunctorType , TeamPolicy<Properties...>, reducer_type > closure_type;
+    typedef Impl::FunctorValueTraits< FunctorType , typename traits::work_tag > functor_value_traits;
+
+    int block_size = Kokkos::Impl::cuda_get_opt_block_size< closure_type, typename traits::launch_bounds >( f ,(size_t) vector_length(),
+        (size_t) team_scratch_size(0) + 2*sizeof(double), (size_t) thread_scratch_size(0) + sizeof(double) +
+                                                          ((functor_value_traits::StaticValueSize!=0)?0:functor_value_traits::value_size( f )));
+    return block_size/vector_length();
+  }
+

  inline static
  int vector_length_max()
    { return Impl::CudaTraits::WarpSize; }

+  inline static
+  int scratch_size_max(int level)
+    { return (level==0?
+        1024*40:             // 48kB is the max for CUDA, but we need some for team_member.reduce etc.
+        20*1024*1024);   // arbitrarily setting this to 20MB, for a Volta V100 that would give us about 3.2GB for 2 teams per SM
+    }
+
  //----------------------------------------

  inline int vector_length()   const { return m_vector_length ; }
@ -419,7 +479,7 @@ public:
  void execute() const
    {
      const typename Policy::index_type nwork = m_policy.end() - m_policy.begin();
-      const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( m_functor , 1, 0 , 0 );
+      const int block_size = Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds>( m_functor , 1, 0 , 0 );
      const dim3 block(  1 , block_size , 1);
      const dim3 grid( std::min( typename Policy::index_type(( nwork + block.y - 1 ) / block.y) , typename Policy::index_type(cuda_internal_maximum_grid_count()) ) , 1 , 1);

@ -654,7 +714,7 @@ public:
    : m_functor( arg_functor )
    , m_league_size( arg_policy.league_size() )
    , m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
-        Kokkos::Impl::cuda_get_opt_block_size< ParallelFor >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
+        Kokkos::Impl::cuda_get_opt_block_size< ParallelFor, LaunchBounds >( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length() )
    , m_vector_size( arg_policy.vector_length() )
    , m_shmem_begin( sizeof(double) * ( m_team_size + 2 ) )
    , m_shmem_size( arg_policy.scratch_size(0,m_team_size) + FunctorTeamShmemSize< FunctorType >::value( m_functor , m_team_size ) )
@ -670,7 +730,7 @@ public:
      }

      if ( int(m_team_size) >
-           int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor >
+           int(Kokkos::Impl::cuda_get_max_block_size< ParallelFor, LaunchBounds >
                 ( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
        Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));
      }
@ -725,12 +785,13 @@ public:
  const Policy        m_policy ;
  const ReducerType   m_reducer ;
  const pointer_type  m_result_ptr ;
+  const bool          m_result_ptr_device_accessible ;
  size_type *         m_scratch_space ;
  size_type *         m_scratch_flags ;
  size_type *         m_unified_space ;

-  // Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
-  enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
+  // Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit)
+  enum { UseShflReduction = false };//((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
  // Some crutch to do function overloading
 private:
  typedef double DummyShflReductionType;
@ -752,12 +813,12 @@ public:

  __device__ inline
  void operator() () const {
-    run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
+/*    run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
  }

  __device__ inline
  void run(const DummySHMEMReductionType& ) const
-  {
+  {*/
    const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
      word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );

@ -786,7 +847,8 @@ public:
      // This is the final block with the final result at the final threads' location

      size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
-      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
+      size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) : 
+                                 ( m_unified_space ? m_unified_space : m_scratch_space );

      if ( threadIdx.y == 0 ) {
        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -798,10 +860,9 @@ public:
    }
  }

-  __device__ inline
+/*  __device__ inline
   void run(const DummyShflReductionType&) const
   {
-
     value_type value;
     ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &value);
     // Number of blocks is bounded so that the reduction can be limited to two passes.
@ -832,7 +893,7 @@ public:
         *result = value;
       }
     }
-   }
+   }*/

  // Determine block size constrained by shared memory:
  static inline
@ -863,16 +924,18 @@ public:

      CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute

-      Cuda::fence();
+      if(!m_result_ptr_device_accessible) {
+        Cuda::fence();

-      if ( m_result_ptr ) {
-        if ( m_unified_space ) {
-          const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer)  );
-          for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
-        }
-        else {
-          const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer)  );
-          DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
+        if ( m_result_ptr ) {
+          if ( m_unified_space ) {
+            const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer)  );
+            for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
+          }
+          else {
+            const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer)  );
+            DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
+          }
        }
      }
    }
@ -883,17 +946,18 @@ public:
    }
  }

-  template< class HostViewType >
+  template< class ViewType >
  ParallelReduce( const FunctorType  & arg_functor
                , const Policy       & arg_policy
-                , const HostViewType & arg_result
+                , const ViewType & arg_result
                , typename std::enable_if<
-                   Kokkos::is_view< HostViewType >::value
+                   Kokkos::is_view< ViewType >::value
                ,void*>::type = NULL)
  : m_functor( arg_functor )
  , m_policy(  arg_policy )
  , m_reducer( InvalidType() )
  , m_result_ptr( arg_result.data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -906,6 +970,7 @@ public:
  , m_policy(  arg_policy )
  , m_reducer( reducer )
  , m_result_ptr( reducer.view().data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -953,6 +1018,7 @@ public:
  const Policy        m_policy ; // used for workrange and nwork
  const ReducerType   m_reducer ;
  const pointer_type  m_result_ptr ;
+  const bool          m_result_ptr_device_accessible ;
  size_type *         m_scratch_space ;
  size_type *         m_scratch_flags ;
  size_type *         m_unified_space ;
@ -960,7 +1026,7 @@ public:
  typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;

  // Shall we use the shfl based reduction or not (only use it for static sized types of more than 128bit
-  enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && ValueTraits::StaticValueSize) };
+  enum { UseShflReduction = ((sizeof(value_type)>2*sizeof(double)) && (ValueTraits::StaticValueSize!=0)) };
  // Some crutch to do function overloading
 private:
  typedef double DummyShflReductionType;
@ -978,12 +1044,12 @@ public:
  inline
  __device__
  void operator() (void) const {
-    run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
+/*    run(Kokkos::Impl::if_c<UseShflReduction, DummyShflReductionType, DummySHMEMReductionType>::select(1,1.0) );
  }

  __device__ inline
  void run(const DummySHMEMReductionType& ) const
-  {
+  {*/
    const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
      word_count( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) / sizeof(size_type) );

@ -1007,7 +1073,8 @@ public:

      // This is the final block with the final result at the final threads' location
      size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
-      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
+      size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
+                                 ( m_unified_space ? m_unified_space : m_scratch_space );

      if ( threadIdx.y == 0 ) {
        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -1019,7 +1086,7 @@ public:
    }
  }

-  __device__ inline
+/*  __device__ inline
   void run(const DummyShflReductionType&) const
   {

@ -1051,7 +1118,7 @@ public:
       }
     }
   }
-
+*/
  // Determine block size constrained by shared memory:
  static inline
  unsigned local_block_size( const FunctorType & f )
@ -1089,16 +1156,18 @@ public:

      CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute

-      Cuda::fence();
+      if(!m_result_ptr_device_accessible) {
+        Cuda::fence();

-      if ( m_result_ptr ) {
-        if ( m_unified_space ) {
-          const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer)  );
-          for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
-        }
-        else {
-          const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer)  );
-          DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
+        if ( m_result_ptr ) {
+          if ( m_unified_space ) {
+            const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer)  );
+            for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
+          }
+          else {
+            const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer)  );
+            DeepCopy<HostSpace,CudaSpace>( m_result_ptr , m_scratch_space , size );
+          }
        }
      }
    }
@ -1109,17 +1178,18 @@ public:
    }
  }

-  template< class HostViewType >
+  template< class ViewType >
  ParallelReduce( const FunctorType  & arg_functor
                , const Policy       & arg_policy
-                , const HostViewType & arg_result
+                , const ViewType & arg_result
                , typename std::enable_if<
-                   Kokkos::is_view< HostViewType >::value
+                   Kokkos::is_view< ViewType >::value
                ,void*>::type = NULL)
  : m_functor( arg_functor )
  , m_policy(  arg_policy )
  , m_reducer( InvalidType() )
  , m_result_ptr( arg_result.data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -1132,6 +1202,7 @@ public:
  , m_policy(  arg_policy )
  , m_reducer( reducer )
  , m_result_ptr( reducer.view().data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -1174,7 +1245,7 @@ public:
  typedef FunctorType      functor_type ;
  typedef Cuda::size_type  size_type ;

-  enum { UseShflReduction = (true && ValueTraits::StaticValueSize) };
+  enum { UseShflReduction = (true && (ValueTraits::StaticValueSize!=0)) };

 private:
  typedef double DummyShflReductionType;
@ -1191,6 +1262,7 @@ private:
  const FunctorType   m_functor ;
  const ReducerType   m_reducer ;
  const pointer_type  m_result_ptr ;
+  const bool          m_result_ptr_device_accessible ;
  size_type *         m_scratch_space ;
  size_type *         m_scratch_flags ;
  size_type *         m_unified_space ;
@ -1279,7 +1351,8 @@ public:
      // This is the final block with the final result at the final threads' location

      size_type * const shared = kokkos_impl_cuda_shared_memory<size_type>() + ( blockDim.y - 1 ) * word_count.value ;
-      size_type * const global = m_unified_space ? m_unified_space : m_scratch_space ;
+      size_type * const global = m_result_ptr_device_accessible? reinterpret_cast<size_type*>(m_result_ptr) :
+                                 ( m_unified_space ? m_unified_space : m_scratch_space );

      if ( threadIdx.y == 0 ) {
        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , shared );
@ -1312,12 +1385,18 @@ public:
        , value );
    }

-    pointer_type const result = (pointer_type) (m_unified_space ? m_unified_space : m_scratch_space) ;
+    pointer_type const result = m_result_ptr_device_accessible? m_result_ptr :
+                                (pointer_type) ( m_unified_space ? m_unified_space : m_scratch_space );

    value_type init;
    ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , &init);
-    if(Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
-           (value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)) {
+    if(
+        Impl::cuda_inter_block_reduction<FunctorType,ValueJoin,WorkTag>
+           (value,init,ValueJoin(ReducerConditional::select(m_functor , m_reducer)),m_scratch_space,result,m_scratch_flags,blockDim.y)
+        //This breaks a test
+        //   Kokkos::Impl::CudaReductionsFunctor<FunctorType,WorkTag,false,true>::scalar_inter_block_reduction(ReducerConditional::select(m_functor , m_reducer) , blockIdx.x , gridDim.x ,
+        //              kokkos_impl_cuda_shared_memory<size_type>() , m_scratch_space , m_scratch_flags)
+    ) {
      const unsigned id = threadIdx.y*blockDim.x + threadIdx.x;
      if(id==0) {
        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , (void*) &value );
@ -1331,7 +1410,7 @@ public:
    {
      const int nwork = m_league_size * m_team_size ;
      if ( nwork ) {
-        const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024) )
+        const int block_count = UseShflReduction? std::min( m_league_size , size_type(1024*32) )
          :std::min( m_league_size , m_team_size );

        m_scratch_space = cuda_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_count );
@ -1344,16 +1423,18 @@ public:

        CudaParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem_size_total ); // copy to device and execute

-        Cuda::fence();
+        if(!m_result_ptr_device_accessible) {
+          Cuda::fence();

-        if ( m_result_ptr ) {
-          if ( m_unified_space ) {
-            const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
-            for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
-          }
-          else {
-            const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
-            DeepCopy<HostSpace,CudaSpace>( m_result_ptr, m_scratch_space, size );
+          if ( m_result_ptr ) {
+            if ( m_unified_space ) {
+              const int count = ValueTraits::value_count( ReducerConditional::select(m_functor , m_reducer) );
+              for ( int i = 0 ; i < count ; ++i ) { m_result_ptr[i] = pointer_type(m_unified_space)[i] ; }
+            }
+            else {
+              const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) );
+              DeepCopy<HostSpace,CudaSpace>( m_result_ptr, m_scratch_space, size );
+            }
          }
        }
      }
@ -1364,16 +1445,17 @@ public:
      }
    }

-  template< class HostViewType >
+  template< class ViewType >
  ParallelReduce( const FunctorType  & arg_functor
                , const Policy       & arg_policy
-                , const HostViewType & arg_result
+                , const ViewType & arg_result
                , typename std::enable_if<
-                                   Kokkos::is_view< HostViewType >::value
+                                   Kokkos::is_view< ViewType >::value
                                ,void*>::type = NULL)
  : m_functor( arg_functor )
  , m_reducer( InvalidType() )
  , m_result_ptr( arg_result.data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ViewType::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -1383,17 +1465,17 @@ public:
  , m_scratch_ptr{NULL,NULL}
  , m_scratch_size{
    arg_policy.scratch_size(0,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
-        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
+        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
                                                                 arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
                                                                 arg_policy.vector_length() )
    ), arg_policy.scratch_size(1,( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
-        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
+        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
                                                                 arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
                                                                 arg_policy.vector_length() )
        )}
  , m_league_size( arg_policy.league_size() )
  , m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
-      Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
+      Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
                                                               arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
                                                               arg_policy.vector_length() )
  , m_vector_size( arg_policy.vector_length() )
@ -1430,9 +1512,7 @@ public:
      Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too much L0 scratch memory"));
    }

-    if ( unsigned(m_team_size) >
-         unsigned(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
-               ( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
+    if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
      Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
    }

@ -1444,6 +1524,7 @@ public:
  : m_functor( arg_functor )
  , m_reducer( reducer )
  , m_result_ptr( reducer.view().data() )
+  , m_result_ptr_device_accessible(MemorySpaceAccess< Kokkos::CudaSpace , typename ReducerType::result_view_type::memory_space>::accessible )
  , m_scratch_space( 0 )
  , m_scratch_flags( 0 )
  , m_unified_space( 0 )
@ -1453,7 +1534,7 @@ public:
  , m_scratch_ptr{NULL,NULL}
  , m_league_size( arg_policy.league_size() )
  , m_team_size( 0 <= arg_policy.team_size() ? arg_policy.team_size() :
-      Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >( arg_functor , arg_policy.vector_length(),
+      Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >( arg_functor , arg_policy.vector_length(),
                                                               arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) /
      arg_policy.vector_length() )
  , m_vector_size( arg_policy.vector_length() )
@ -1486,10 +1567,7 @@ public:
         CudaTraits::SharedMemoryCapacity < shmem_size_total ) {
      Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > bad team size"));
    }
-
-    if ( int(m_team_size) >
-         int(Kokkos::Impl::cuda_get_max_block_size< ParallelReduce >
-               ( arg_functor , arg_policy.vector_length(), arg_policy.team_scratch_size(0),arg_policy.thread_scratch_size(0) ) / arg_policy.vector_length())) {
+    if ( int(m_team_size) > arg_policy.team_size_max(m_functor,ParallelReduceTag()) ) {
      Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelReduce< Cuda > requested too large team size."));
    }

@ -1753,7 +1831,7 @@ public:
      // Occupancy calculator assumes whole block.

      m_team_size =
-        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce >
+        Kokkos::Impl::cuda_get_opt_block_size< ParallelReduce, LaunchBounds >
          ( arg_functor
          , arg_policy.vector_length()
          , arg_policy.team_scratch_size(0)
@ -1970,7 +2048,9 @@ private:
    const WorkRange range( m_policy , blockIdx.x , gridDim.x );

    for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
-
+      #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
+      #endif
      const typename Policy::member_type iwork = iwork_base + threadIdx.y ;

      __syncthreads(); // Don't overwrite previous iteration values until they are used
@ -1981,7 +2061,11 @@ private:
      for ( unsigned i = threadIdx.y ; i < word_count.value ; ++i ) {
        shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
      }
-
+      #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
+      #else
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+      #endif
      if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.

      // Call functor to accumulate inclusive scan value for this work item
@ -2189,6 +2273,9 @@ private:
    const WorkRange range( m_policy , blockIdx.x , gridDim.x );

    for ( typename Policy::member_type iwork_base = range.begin(); iwork_base < range.end() ; iwork_base += blockDim.y ) {
+      #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      unsigned MASK=KOKKOS_IMPL_CUDA_ACTIVEMASK;
+      #endif

      const typename Policy::member_type iwork = iwork_base + threadIdx.y ;

@ -2201,6 +2288,11 @@ private:
        shared_data[i + word_count.value] = shared_data[i] = shared_accum[i] ;
      }

+      #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK(MASK);
+      #else
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+      #endif
      if ( CudaTraits::WarpSize < word_count.value ) { __syncthreads(); } // Protect against large scan values.

      // Call functor to accumulate inclusive scan value for this work item
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_ReduceScan.hpp
@ -194,8 +194,9 @@ void cuda_shfl_up( T & out , T const & in , int delta ,
 */

 template< class ValueType , class JoinOp>
-__device__
-inline void cuda_intra_warp_reduction( ValueType& result,
+__device__ inline
+typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
+cuda_intra_warp_reduction( ValueType& result,
                                       const JoinOp& join,
                                       const uint32_t max_active_thread = blockDim.y) {

@ -214,8 +215,9 @@ inline void cuda_intra_warp_reduction( ValueType& result,
 }

 template< class ValueType , class JoinOp>
-__device__
-inline void cuda_inter_warp_reduction( ValueType& value,
+__device__ inline
+typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
+cuda_inter_warp_reduction( ValueType& value,
                                       const JoinOp& join,
                                       const int max_active_thread = blockDim.y) {

@ -247,8 +249,9 @@ inline void cuda_inter_warp_reduction( ValueType& value,
 }

 template< class ValueType , class JoinOp>
-__device__
-inline void cuda_intra_block_reduction( ValueType& value,
+__device__ inline
+typename std::enable_if< !Kokkos::is_reducer<ValueType>::value >::type
+cuda_intra_block_reduction( ValueType& value,
                                        const JoinOp& join,
                                        const int max_active_thread = blockDim.y) {
  cuda_intra_warp_reduction(value,join,max_active_thread);
@ -314,31 +317,52 @@ bool cuda_inter_block_reduction( typename FunctorValueTraits< FunctorType , ArgT
        if( id + 1 < int(gridDim.x) )
          join(value, tmp);
      }
-      int active = KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
+      int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 2) {
        value_type tmp = Kokkos::shfl_down(value, 2,32);
        if( id + 2 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 4) {
        value_type tmp = Kokkos::shfl_down(value, 4,32);
        if( id + 4 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 8) {
        value_type tmp = Kokkos::shfl_down(value, 8,32);
        if( id + 8 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 16) {
        value_type tmp = Kokkos::shfl_down(value, 16,32);
        if( id + 16 < int(gridDim.x) )
          join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
    }
  }
  //The last block has in its thread=0 the global reduction value through "value"
@ -478,31 +502,52 @@ cuda_inter_block_reduction( const ReducerType& reducer,
        if( id + 1 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      int active = KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      unsigned int mask = KOKKOS_IMPL_CUDA_ACTIVEMASK;
+      int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      int active = KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 2) {
        value_type tmp = Kokkos::shfl_down(value, 2,32);
        if( id + 2 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 4) {
        value_type tmp = Kokkos::shfl_down(value, 4,32);
        if( id + 4 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 8) {
        value_type tmp = Kokkos::shfl_down(value, 8,32);
        if( id + 8 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
      if (int(blockDim.x*blockDim.y) > 16) {
        value_type tmp = Kokkos::shfl_down(value, 16,32);
        if( id + 16 < int(gridDim.x) )
          reducer.join(value, tmp);
      }
-      active += KOKKOS_IMPL_CUDA_BALLOT(1);
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(mask,1);
+#else
+      active += KOKKOS_IMPL_CUDA_BALLOT_MASK(1);
+#endif
    }
  }

@ -513,6 +558,213 @@ cuda_inter_block_reduction( const ReducerType& reducer,
 #endif
 }

+template<class FunctorType, class ArgTag, bool DoScan, bool UseShfl>
+struct CudaReductionsFunctor;
+
+template<class FunctorType, class ArgTag>
+struct CudaReductionsFunctor<FunctorType, ArgTag, false, true> {
+  typedef FunctorValueTraits< FunctorType , ArgTag >  ValueTraits ;
+  typedef FunctorValueJoin<   FunctorType , ArgTag >  ValueJoin ;
+  typedef FunctorValueInit<   FunctorType , ArgTag >  ValueInit ;
+  typedef FunctorValueOps<    FunctorType , ArgTag >  ValueOps ;
+  typedef typename ValueTraits::pointer_type  pointer_type ;
+  typedef typename ValueTraits::value_type Scalar;
+
+  __device__
+  static inline void scalar_intra_warp_reduction(
+      const FunctorType& functor,
+      Scalar value,                            // Contribution
+      const bool skip_vector,                  // Skip threads if Kokkos vector lanes are not part of the reduction
+      const int width,                         // How much of the warp participates
+      Scalar& result)
+  {
+    unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
+    for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
+      Scalar tmp;
+      cuda_shfl_down(tmp,value,delta,width,mask);
+      ValueJoin::join( functor , &value, &tmp);
+    }
+
+    cuda_shfl(result,value,0,width,mask);
+  }
+
+
+  __device__
+  static inline void scalar_intra_block_reduction(
+      const FunctorType& functor,
+      Scalar value,
+      const bool skip,
+      Scalar* my_global_team_buffer_element,
+      const int shared_elements,
+      Scalar* shared_team_buffer_element) {
+
+    const int warp_id = (threadIdx.y*blockDim.x)/32;
+    Scalar* const my_shared_team_buffer_element =
+        shared_team_buffer_element + warp_id%shared_elements;
+
+    // Warp Level Reduction, ignoring Kokkos vector entries
+    scalar_intra_warp_reduction(functor,value,skip,32,value);
+
+    if(warp_id<shared_elements) {
+        *my_shared_team_buffer_element=value;
+    }
+    // Wait for every warp to be done before using one warp to do final cross warp reduction
+    __syncthreads();
+
+    const int num_warps = blockDim.x*blockDim.y/32;
+    for(int w = shared_elements; w<num_warps; w+=shared_elements) {
+      if(warp_id>=w && warp_id<w+shared_elements) {
+        if((threadIdx.y*blockDim.x + threadIdx.x)%32==0)
+          ValueJoin::join( functor , my_shared_team_buffer_element, &value);
+      }
+      __syncthreads();
+    }
+
+
+    if( warp_id == 0) {
+      ValueInit::init( functor , &value );
+      for(unsigned int i=threadIdx.y*blockDim.x+threadIdx.x; i<blockDim.y*blockDim.x/32; i+=32)
+        ValueJoin::join( functor , &value,&shared_team_buffer_element[i]);
+      scalar_intra_warp_reduction(functor,value,false,32,*my_global_team_buffer_element);
+    }
+  }
+
+  __device__
+  static inline bool scalar_inter_block_reduction(
+      const FunctorType     & functor ,
+      const Cuda::size_type   block_id ,
+      const Cuda::size_type   block_count ,
+      Cuda::size_type * const shared_data ,
+      Cuda::size_type * const global_data ,
+      Cuda::size_type * const global_flags )  {
+    Scalar* const global_team_buffer_element = ((Scalar*) global_data);
+    Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
+    Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
+    Scalar value = shared_team_buffer_elements[threadIdx.y];
+    int shared_elements=blockDim.x*blockDim.y/32;
+    int global_elements=block_count;
+    __syncthreads();
+
+    scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
+    __syncthreads();
+    unsigned int num_teams_done = 0;
+    if(threadIdx.x + threadIdx.y == 0) {
+      __threadfence();
+      num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
+    }
+    bool is_last_block = false;
+    if(__syncthreads_or(num_teams_done == gridDim.x)) {
+      is_last_block=true;
+      *global_flags = 0;
+      ValueInit::init( functor, &value);
+      for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
+        ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
+      }
+      scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
+    }
+    return is_last_block;
+  }
+};
+
+template<class FunctorType, class ArgTag>
+struct CudaReductionsFunctor<FunctorType, ArgTag, false, false> {
+  typedef FunctorValueTraits< FunctorType , ArgTag >  ValueTraits ;
+  typedef FunctorValueJoin<   FunctorType , ArgTag >  ValueJoin ;
+  typedef FunctorValueInit<   FunctorType , ArgTag >  ValueInit ;
+  typedef FunctorValueOps<    FunctorType , ArgTag >  ValueOps ;
+  typedef typename ValueTraits::pointer_type  pointer_type ;
+  typedef typename ValueTraits::value_type Scalar;
+
+  __device__
+  static inline void scalar_intra_warp_reduction(
+      const FunctorType& functor,
+      Scalar* value,                           // Contribution
+      const bool skip_vector,                  // Skip threads if Kokkos vector lanes are not part of the reduction
+      const int width)                         // How much of the warp participates
+  {
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+    unsigned mask = width==32?0xffffffff:((1<<width)-1)<<((threadIdx.y*blockDim.x+threadIdx.x)%(32/width))*width;
+#endif
+    const int lane_id = (threadIdx.y*blockDim.x+threadIdx.x)%32;
+    for(int delta=skip_vector?blockDim.x:1; delta<width; delta*=2) {
+      if(lane_id + delta<32) {
+        ValueJoin::join( functor , value, value+delta);
+      }
+#ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK(mask);
+#else
+      KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+#endif
+    }
+    *value=*(value-lane_id);
+  }
+
+
+  __device__
+  static inline void scalar_intra_block_reduction(
+      const FunctorType& functor,
+      Scalar value,
+      const bool skip,
+      Scalar* result,
+      const int shared_elements,
+      Scalar* shared_team_buffer_element) {
+
+    const int warp_id = (threadIdx.y*blockDim.x)/32;
+    Scalar* const my_shared_team_buffer_element =
+        shared_team_buffer_element + threadIdx.y*blockDim.x+threadIdx.x;
+    *my_shared_team_buffer_element = value;
+    // Warp Level Reduction, ignoring Kokkos vector entries
+    scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,skip,32);
+    // Wait for every warp to be done before using one warp to do final cross warp reduction
+    __syncthreads();
+
+    if( warp_id == 0) {
+      const unsigned int delta = (threadIdx.y*blockDim.x+threadIdx.x)*32;
+      if(delta<blockDim.x*blockDim.y)
+        *my_shared_team_buffer_element = shared_team_buffer_element[delta];
+      KOKKOS_IMPL_CUDA_SYNCWARP;   
+      scalar_intra_warp_reduction(functor,my_shared_team_buffer_element,false,blockDim.x*blockDim.y/32);
+      if(threadIdx.x + threadIdx.y == 0) *result = *shared_team_buffer_element;
+    }
+  }
+
+  __device__
+  static inline bool scalar_inter_block_reduction(
+      const FunctorType     & functor ,
+      const Cuda::size_type   block_id ,
+      const Cuda::size_type   block_count ,
+      Cuda::size_type * const shared_data ,
+      Cuda::size_type * const global_data ,
+      Cuda::size_type * const global_flags )  {
+    Scalar* const global_team_buffer_element = ((Scalar*) global_data);
+    Scalar* const my_global_team_buffer_element = global_team_buffer_element + blockIdx.x;
+    Scalar* shared_team_buffer_elements = ((Scalar*) shared_data);
+    Scalar value = shared_team_buffer_elements[threadIdx.y];
+    int shared_elements=blockDim.x*blockDim.y/32;
+    int global_elements=block_count;
+    __syncthreads();
+
+    scalar_intra_block_reduction(functor,value,true,my_global_team_buffer_element,shared_elements,shared_team_buffer_elements);
+    __syncthreads();
+
+    unsigned int num_teams_done = 0;
+    if(threadIdx.x + threadIdx.y == 0) {
+      __threadfence();
+      num_teams_done = Kokkos::atomic_fetch_add(global_flags,1)+1;
+    }
+    bool is_last_block = false;
+    if(__syncthreads_or(num_teams_done == gridDim.x)) {
+      is_last_block=true;
+      *global_flags = 0;
+      ValueInit::init( functor, &value);
+      for(int i=threadIdx.y*blockDim.x+threadIdx.x; i<global_elements; i+=blockDim.x*blockDim.y) {
+        ValueJoin::join( functor , &value,&global_team_buffer_element[i]);
+      }
+      scalar_intra_block_reduction(functor,value,false,shared_team_buffer_elements+(blockDim.y-1),shared_elements,shared_team_buffer_elements);
+    }
+    return is_last_block;
+  }
+};
 //----------------------------------------------------------------------------
 // See section B.17 of Cuda C Programming Guide Version 3.2
 // for discussion of
@ -639,14 +891,15 @@ void cuda_intra_block_reduce_scan( const FunctorType & functor ,
 *
 *  Global reduce result is in the last threads' 'shared_data' location.
 */
+
 template< bool DoScan , class FunctorType , class ArgTag >
 __device__
-bool cuda_single_inter_block_reduce_scan( const FunctorType     & functor ,
-                                          const Cuda::size_type   block_id ,
-                                          const Cuda::size_type   block_count ,
-                                          Cuda::size_type * const shared_data ,
-                                          Cuda::size_type * const global_data ,
-                                          Cuda::size_type * const global_flags )
+bool cuda_single_inter_block_reduce_scan2( const FunctorType     & functor ,
+                                    const Cuda::size_type   block_id ,
+                                    const Cuda::size_type   block_count ,
+                                    Cuda::size_type * const shared_data ,
+                                    Cuda::size_type * const global_data ,
+                                    Cuda::size_type * const global_flags )
 {
  typedef Cuda::size_type                  size_type ;
  typedef FunctorValueTraits< FunctorType , ArgTag >  ValueTraits ;
@ -655,7 +908,6 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType     & functor ,
  typedef FunctorValueOps<    FunctorType , ArgTag >  ValueOps ;

  typedef typename ValueTraits::pointer_type    pointer_type ;
-  //typedef typename ValueTraits::reference_type  reference_type ;

  // '__ffs' = position of the least significant bit set to 1.
  // 'blockDim.y' is guaranteed to be a power of two so this
@ -678,12 +930,7 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType     & functor ,
    size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
    size_type * const global = global_data + word_count.value * block_id ;

-//#if (__CUDA_ARCH__ < 500)
    for ( int i = int(threadIdx.y) ; i < int(word_count.value) ; i += int(blockDim.y) ) { global[i] = shared[i] ; }
-//#else
-//    for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
-//#endif
-
  }

  // Contributing blocks note that their contribution has been completed via an atomic-increment flag
@ -725,6 +972,22 @@ bool cuda_single_inter_block_reduce_scan( const FunctorType     & functor ,
  return is_last_block ;
 }

+template< bool DoScan , class FunctorType , class ArgTag >
+__device__
+bool cuda_single_inter_block_reduce_scan( const FunctorType     & functor ,
+                                          const Cuda::size_type   block_id ,
+                                          const Cuda::size_type   block_count ,
+                                          Cuda::size_type * const shared_data ,
+                                          Cuda::size_type * const global_data ,
+                                          Cuda::size_type * const global_flags )
+{
+  typedef FunctorValueTraits< FunctorType , ArgTag >  ValueTraits ;
+  if(!DoScan && ValueTraits::StaticValueSize)
+    return Kokkos::Impl::CudaReductionsFunctor<FunctorType,ArgTag,false,(ValueTraits::StaticValueSize>16)>::scalar_inter_block_reduction(functor,block_id,block_count,shared_data,global_data,global_flags);
+  else
+    return cuda_single_inter_block_reduce_scan2<DoScan, FunctorType, ArgTag>(functor, block_id, block_count, shared_data, global_data, global_flags);
+}
+
 // Size in bytes required for inter block reduce or scan
 template< bool DoScan , class FunctorType , class ArgTag >
 inline
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Team.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Team.hpp
@ -160,7 +160,7 @@ public:

  template<class ValueType>
  KOKKOS_INLINE_FUNCTION
-  void team_broadcast( ValueType & val, const int& thread_id) const
+  void team_broadcast( ValueType & val, const int& thread_id ) const
    {
      #ifdef __CUDA_ARCH__
      if ( 1 == blockDim.z ) { // team == block
@ -178,6 +178,29 @@ public:
      }
      #endif
    }
+	
+  template<class Closure, class ValueType>
+  KOKKOS_INLINE_FUNCTION
+  void team_broadcast( Closure const & f, ValueType & val, const int& thread_id ) const
+    {
+      #ifdef __CUDA_ARCH__
+      f( val );
+
+      if ( 1 == blockDim.z ) { // team == block
+        __syncthreads();
+        // Wait for shared data write until all threads arrive here
+        if ( threadIdx.x == 0u && threadIdx.y == (uint32_t)thread_id ) {
+          *((ValueType*) m_team_reduce) = val ;
+        }
+        __syncthreads(); // Wait for shared data read until root thread writes
+        val = *((ValueType*) m_team_reduce);
+      }
+      else { // team <= warp
+        ValueType tmp( val ); // input might not be a register variable
+        cuda_shfl( val, tmp, blockDim.x * thread_id, blockDim.x * blockDim.y );
+      }
+      #endif
+    }

  //--------------------------------------------------------------------------
  /**\brief  Reduction across a team
@ -200,92 +223,7 @@ public:
  team_reduce( ReducerType const & reducer ) const noexcept
    {
      #ifdef __CUDA_ARCH__
-
-      typedef typename ReducerType::value_type value_type ;
-
-      value_type tmp( reducer.reference() );
-
-      // reduce within the warp using shuffle
-
-      const int wx =
-        ( threadIdx.x + blockDim.x * threadIdx.y ) & CudaTraits::WarpIndexMask ;
-
-      for ( int i = CudaTraits::WarpSize ; (int)blockDim.x <= ( i >>= 1 ) ; ) {
-
-        cuda_shfl_down( reducer.reference() , tmp , i , CudaTraits::WarpSize );
-
-        // Root of each vector lane reduces:
-        if ( 0 == threadIdx.x && wx < i ) {
-          reducer.join( tmp , reducer.reference() );
-        }
-      }
-
-      if ( 1 < blockDim.z ) { // team <= warp
-        // broadcast result from root vector lange of root thread
-
-        cuda_shfl( reducer.reference() , tmp
-                 , blockDim.x * threadIdx.y , CudaTraits::WarpSize );
-
-      }
-      else { // team == block
-        // Reduce across warps using shared memory
-        // Broadcast result within block
-
-        // Number of warps, blockDim.y may not be power of two:
-        const int nw  = ( blockDim.x * blockDim.y + CudaTraits::WarpIndexMask ) >> CudaTraits::WarpIndexShift ;
-
-        // Warp index:
-        const int wy = ( blockDim.x * threadIdx.y ) >> CudaTraits::WarpIndexShift ;
-
-        // Number of shared memory entries for the reduction:
-        int nsh = m_team_reduce_size / sizeof(value_type);
-
-        // Using at most one entry per warp:
-        if ( nw < nsh ) nsh = nw ;
-
-        __syncthreads(); // Wait before shared data write
-
-        if ( 0 == wx && wy < nsh ) {
-          ((value_type*) m_team_reduce)[wy] = tmp ;
-        }
-
-        // When more warps than shared entries:
-        for ( int i = nsh ; i < nw ; i += nsh ) {
-
-          __syncthreads();
-
-          if ( 0 == wx && i <= wy ) {
-            const int k = wy - i ;
-            if ( k < nsh ) {
-              reducer.join( *((value_type*) m_team_reduce + k) , tmp );
-            }
-          }
-        }
-
-        __syncthreads();
-
-        // One warp performs the inter-warp reduction:
-
-        if ( 0 == wy ) {
-
-          // Start at power of two covering nsh
-
-          for ( int i = 1 << ( 32 - __clz(nsh-1) ) ; ( i >>= 1 ) ; ) {
-            const int k = wx + i ;
-            if ( wx < i && k < nsh ) {
-              reducer.join( ((value_type*)m_team_reduce)[wx]
-                          , ((value_type*)m_team_reduce)[k] );
-              __threadfence_block();
-            }
-          }
-        }
-
-        __syncthreads(); // Wait for reduction
-
-        // Broadcast result to all threads
-        reducer.reference() = *((value_type*)m_team_reduce);
-      }
-
+      cuda_intra_block_reduction(reducer,blockDim.y);
      #endif /* #ifdef __CUDA_ARCH__ */
    }

@ -801,7 +739,11 @@ void parallel_for
      ; i += blockDim.x ) {
    closure(i);
  }
+  #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
  KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
+  #else
+  KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+  #endif
 #endif
 }

@ -970,7 +912,11 @@ KOKKOS_INLINE_FUNCTION
 void single(const Impl::VectorSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
 #ifdef __CUDA_ARCH__
  if(threadIdx.x == 0) lambda();
+  #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
  KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
+  #else
+  KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+  #endif
 #endif
 }

@ -979,7 +925,11 @@ KOKKOS_INLINE_FUNCTION
 void single(const Impl::ThreadSingleStruct<Impl::CudaTeamMember>& , const FunctorType& lambda) {
 #ifdef __CUDA_ARCH__
  if(threadIdx.x == 0 && threadIdx.y == 0) lambda();
+  #ifdef KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
  KOKKOS_IMPL_CUDA_SYNCWARP_MASK(blockDim.x==32?0xffffffff:((1<<blockDim.x)-1)<<(threadIdx.y%(32/blockDim.x))*blockDim.x);
+  #else
+  KOKKOS_IMPL_CUDA_SYNCWARP_MASK;
+  #endif
 #endif
 }

--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Version_9_8_Compatibility.hpp
@ -2,9 +2,11 @@

 #if defined( __CUDA_ARCH__ )
 #if ( CUDA_VERSION < 9000 )
+#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
 #define KOKKOS_IMPL_CUDA_SYNCWARP __threadfence_block()
-#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(x) __threadfence_block()
+#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK __threadfence_block()
 #define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot(x)
+#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) __ballot(x)
 #define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl(x,y,z)
 #define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl(x,y,z)
 #define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up(x,y,z)
@ -12,9 +14,11 @@
 #define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) __shfl_down(x,y,z)
 #define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down(x,y,z)
 #else
+#define KOKKOS_IMPL_CUDA_ACTIVEMASK __activemask()
 #define KOKKOS_IMPL_CUDA_SYNCWARP __syncwarp(0xffffffff)
-#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m)
+#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK(m) __syncwarp(m);
 #define KOKKOS_IMPL_CUDA_BALLOT(x) __ballot_sync(__activemask(),x)
+#define KOKKOS_IMPL_CUDA_BALLOT_MASK(m,x) __ballot_sync(m,x)
 #define KOKKOS_IMPL_CUDA_SHFL(x,y,z) __shfl_sync(0xffffffff,x,y,z)
 #define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) __shfl_sync(m,x,y,z)
 #define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) __shfl_up_sync(0xffffffff,x,y,z)
@ -23,11 +27,16 @@
 #define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) __shfl_down_sync(m,x,y,z)
 #endif 
 #else
+#define KOKKOS_IMPL_CUDA_ACTIVEMASK 0
 #define KOKKOS_IMPL_CUDA_SYNCWARP 
+#define KOKKOS_IMPL_CUDA_SYNCWARP_MASK
 #define KOKKOS_IMPL_CUDA_BALLOT(x) 0
+#define KOKKOS_IMPL_CUDA_BALLOT_MASK(x) 0
 #define KOKKOS_IMPL_CUDA_SHFL(x,y,z) 0
+#define KOKKOS_IMPL_CUDA_SHFL_MASK(m,x,y,z) 0
 #define KOKKOS_IMPL_CUDA_SHFL_UP(x,y,z) 0
 #define KOKKOS_IMPL_CUDA_SHFL_DOWN(x,y,z) 0
+#define KOKKOS_IMPL_CUDA_SHFL_DOWN_MASK(m,x,y,z) 0
 #endif 

 #if ( CUDA_VERSION >= 9000 ) && (!defined(KOKKOS_COMPILER_CLANG))
--- a/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
+++ b/lib/kokkos/core/src/Cuda/Kokkos_Cuda_View.hpp
@ -279,6 +279,8 @@ public:
  KOKKOS_INLINE_FUNCTION
  static handle_type assign( value_type * arg_data_ptr, track_type const & arg_tracker )
    {
+      if(arg_data_ptr == NULL) return handle_type();
+
 #if defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
      // Assignment of texture = non-texture requires creation of a texture object
      // which can only occur on the host.  In addition, 'get_record' is only valid
@ -292,8 +294,7 @@ public:

 #if ! defined( KOKKOS_ENABLE_CUDA_LDG_INTRINSIC )
      if ( 0 == r ) {
-        //Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
-        return handle_type();
+        Kokkos::abort("Cuda const random access View using Cuda texture memory requires Kokkos to allocate the View's memory");
      }
 #endif

--- a/lib/kokkos/core/src/KokkosExp_MDRangePolicy.hpp
+++ b/lib/kokkos/core/src/KokkosExp_MDRangePolicy.hpp
@ -46,6 +46,8 @@

 #include <initializer_list>

+#include <Kokkos_Layout.hpp>
+
 #include<impl/KokkosExp_Host_IterateTile.hpp>
 #include <Kokkos_ExecPolicy.hpp>
 #include <Kokkos_Parallel.hpp>
@ -63,13 +65,15 @@
 namespace Kokkos {

 // ------------------------------------------------------------------ //
-
+// Moved to Kokkos_Layout.hpp for more general accessibility
+/*
 enum class Iterate
 {
  Default, // Default for the device
  Left,    // Left indices stride fastest
  Right,   // Right indices stride fastest
 };
+*/

 template <typename ExecSpace>
 struct default_outer_direction
--- a/lib/kokkos/core/src/Kokkos_Array.hpp
+++ b/lib/kokkos/core/src/Kokkos_Array.hpp
@ -45,11 +45,13 @@
 #define KOKKOS_ARRAY_HPP

 #include <Kokkos_Macros.hpp>
+#include <impl/Kokkos_Error.hpp>

 #include <type_traits>
 #include <algorithm>
 #include <limits>
 #include <cstddef>
+#include <string>

 namespace Kokkos {

@ -132,6 +134,7 @@ public:

  KOKKOS_INLINE_FUNCTION static constexpr size_type size() { return N ; }
  KOKKOS_INLINE_FUNCTION static constexpr bool      empty(){ return false ; }
+  KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return N ; }

  template< typename iType >
  KOKKOS_INLINE_FUNCTION
@ -160,7 +163,7 @@ public:
      return & m_internal_implementation_private_member_data[0];
    }

-  #ifdef KOKKOS_ROCM_CLANG_WORKAROUND
+  #ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
  // Do not default unless move and move-assignment are also defined
  KOKKOS_INLINE_FUNCTION
  ~Array() = default ;
@ -197,6 +200,7 @@ public:

  KOKKOS_INLINE_FUNCTION static constexpr size_type size()  { return 0 ; }
  KOKKOS_INLINE_FUNCTION static constexpr bool      empty() { return true ; }
+  KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return 0 ; }

  template< typename iType >
  KOKKOS_INLINE_FUNCTION
@ -261,6 +265,7 @@ public:

  KOKKOS_INLINE_FUNCTION constexpr size_type size()  const { return m_size ; }
  KOKKOS_INLINE_FUNCTION constexpr bool      empty() const { return 0 != m_size ; }
+  KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }

  template< typename iType >
  KOKKOS_INLINE_FUNCTION
@ -336,6 +341,7 @@ public:

  KOKKOS_INLINE_FUNCTION constexpr size_type size()  const { return m_size ; }
  KOKKOS_INLINE_FUNCTION constexpr bool      empty() const { return 0 != m_size ; }
+  KOKKOS_INLINE_FUNCTION constexpr size_type max_size() const { return m_size ; }

  template< typename iType >
  KOKKOS_INLINE_FUNCTION
--- a/lib/kokkos/core/src/Kokkos_Concepts.hpp
+++ b/lib/kokkos/core/src/Kokkos_Concepts.hpp
@ -105,7 +105,10 @@ namespace Kokkos {
  template< typename T > struct is_ ## CONCEPT { \
  private: \
    template< typename , typename = std::true_type > struct have : std::false_type {}; \
-    template< typename U > struct have<U,typename std::is_same<U,typename U:: CONCEPT >::type> : std::true_type {}; \
+    template< typename U > struct have<U,typename std::is_same< \
+     typename std::remove_cv<U>::type, \
+     typename std::remove_cv<typename U:: CONCEPT>::type \
+   >::type> : std::true_type {}; \
  public: \
    enum { value = is_ ## CONCEPT::template have<T>::value }; \
  };
--- a/lib/kokkos/core/src/Kokkos_CopyViews.hpp
+++ b/lib/kokkos/core/src/Kokkos_CopyViews.hpp
@ -453,8 +453,9 @@ template<class ViewTypeA,class ViewTypeB, class Layout, class ExecSpace,typename
 struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,2,iType,KOKKOS_IMPL_COMPILING_LIBRARY> {
  ViewTypeA a;
  ViewTypeB b;
-
-  typedef Kokkos::Rank<2,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<2,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -475,7 +476,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,3,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<3,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<3,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -496,7 +499,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,4,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<4,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<4,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -519,7 +524,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,5,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<5,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<5,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -542,7 +549,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,6,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -566,7 +575,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,7,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -590,7 +601,9 @@ struct ViewCopy<ViewTypeA,ViewTypeB,Layout,ExecSpace,8,iType,KOKKOS_IMPL_COMPILI
  ViewTypeA a;
  ViewTypeB b;

-  typedef Kokkos::Rank<6,ViewFillLayoutSelector<Layout>::iterate,ViewFillLayoutSelector<Layout>::iterate> iterate_type;
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::outer_iteration_pattern;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::layout_iterate_type_selector<Layout>::inner_iteration_pattern;
+  typedef Kokkos::Rank<6,outer_iteration_pattern,inner_iteration_pattern> iterate_type;
  typedef Kokkos::MDRangePolicy<ExecSpace,iterate_type,Kokkos::IndexType<iType>> policy_type;

  ViewCopy(const ViewTypeA& a_, const ViewTypeB& b_):a(a_),b(b_) {
@ -642,7 +655,9 @@ void view_copy(const DstType& dst, const SrcType& src) {
  int64_t strides[DstType::Rank+1];
  dst.stride(strides);
  Kokkos::Iterate iterate;
-  if        ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
+  if        ( Kokkos::is_layouttiled<typename DstType::array_layout>::value ) {
+    iterate = Kokkos::layout_iterate_type_selector<typename DstType::array_layout>::outer_iteration_pattern;
+  } else if        ( std::is_same<typename DstType::array_layout,Kokkos::LayoutRight>::value ) {
    iterate = Kokkos::Iterate::Right;
  } else if ( std::is_same<typename DstType::array_layout,Kokkos::LayoutLeft>::value ) {
    iterate = Kokkos::Iterate::Left;
@ -1243,9 +1258,9 @@ void deep_copy
     ViewTypeFlat;

    ViewTypeFlat dst_flat(dst.data(),dst.size());
-    if(dst.span() < std::numeric_limits<int>::max())
+    if(dst.span() < std::numeric_limits<int>::max()) {
      Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int >( dst_flat , value );
-    else
+    } else
      Kokkos::Impl::ViewFill< ViewTypeFlat , Kokkos::LayoutRight, typename ViewType::execution_space, ViewTypeFlat::Rank, int64_t >( dst_flat , value );
    Kokkos::fence();
    return;
@ -1397,7 +1412,6 @@ void deep_copy
  enum { SrcExecCanAccessDst =
   Kokkos::Impl::SpaceAccessibility< src_execution_space , dst_memory_space >::accessible };

-
  // Checking for Overlapping Views.
  dst_value_type* dst_start = dst.data();
  dst_value_type* dst_end   = dst.data() + dst.span();
@ -1493,7 +1507,7 @@ void deep_copy
    Kokkos::fence();
  } else {
    Kokkos::fence();
-    Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),typename src_type::uniform_runtime_const_nomemspace_type(src));
+    Impl::view_copy(dst, src);
    Kokkos::fence();
  }
 }
@ -1739,8 +1753,7 @@ void deep_copy
    exec_space.fence();
  } else {
    exec_space.fence();
-    Impl::view_copy(typename dst_type::uniform_runtime_nomemspace_type(dst),
-                    typename src_type::uniform_runtime_const_nomemspace_type(src));
+    Impl::view_copy(dst, src);
    exec_space.fence();
  }
 }
@ -1917,4 +1930,213 @@ void realloc(      Kokkos::View<T,P...> & v ,
 }
 } /* namespace Kokkos */

+//----------------------------------------------------------------------------
+//----------------------------------------------------------------------------
+
+namespace Kokkos {
+namespace Impl {
+
+// Deduce Mirror Types
+template<class Space, class T, class ... P>
+struct MirrorViewType {
+  // The incoming view_type
+  typedef typename Kokkos::View<T,P...> src_view_type;
+  // The memory space for the mirror view
+  typedef typename Space::memory_space memory_space;
+  // Check whether it is the same memory space
+  enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
+  // The array_layout
+  typedef typename src_view_type::array_layout array_layout;
+  // The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
+  typedef typename src_view_type::non_const_data_type data_type;
+  // The destination view type if it is not the same memory space
+  typedef Kokkos::View<data_type,array_layout,Space> dest_view_type;
+  // If it is the same memory_space return the existsing view_type
+  // This will also keep the unmanaged trait if necessary
+  typedef typename std::conditional<is_same_memspace,src_view_type,dest_view_type>::type view_type;
+};
+
+template<class Space, class T, class ... P>
+struct MirrorType {
+  // The incoming view_type
+  typedef typename Kokkos::View<T,P...> src_view_type;
+  // The memory space for the mirror view
+  typedef typename Space::memory_space memory_space;
+  // Check whether it is the same memory space
+  enum { is_same_memspace = std::is_same<memory_space,typename src_view_type::memory_space>::value };
+  // The array_layout
+  typedef typename src_view_type::array_layout array_layout;
+  // The data type (we probably want it non-const since otherwise we can't even deep_copy to it.
+  typedef typename src_view_type::non_const_data_type data_type;
+  // The destination view type if it is not the same memory space
+  typedef Kokkos::View<data_type,array_layout,Space> view_type;
+};
+
+}
+
+template< class T , class ... P >
+inline
+typename Kokkos::View<T,P...>::HostMirror
+create_mirror( const Kokkos::View<T,P...> & src
+             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
+                 ! std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
+                               , Kokkos::LayoutStride >::value
+               >::type * = 0
+             )
+{
+  typedef View<T,P...>                   src_type ;
+  typedef typename src_type::HostMirror  dst_type ;
+
+  return dst_type( std::string( src.label() ).append("_mirror")
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+                   , src.extent(0)
+                   , src.extent(1)
+                   , src.extent(2)
+                   , src.extent(3)
+                   , src.extent(4)
+                   , src.extent(5)
+                   , src.extent(6)
+                   , src.extent(7) );
+#else
+                 , src.rank_dynamic > 0 ? src.extent(0): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 1 ? src.extent(1): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 2 ? src.extent(2): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 3 ? src.extent(3): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 4 ? src.extent(4): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 5 ? src.extent(5): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 6 ? src.extent(6): KOKKOS_IMPL_CTOR_DEFAULT_ARG
+                 , src.rank_dynamic > 7 ? src.extent(7): KOKKOS_IMPL_CTOR_DEFAULT_ARG );
+#endif
+}
+
+template< class T , class ... P >
+inline
+typename Kokkos::View<T,P...>::HostMirror
+create_mirror( const Kokkos::View<T,P...> & src
+             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value &&
+                 std::is_same< typename Kokkos::ViewTraits<T,P...>::array_layout
+                             , Kokkos::LayoutStride >::value
+               >::type * = 0
+             )
+{
+  typedef View<T,P...>                   src_type ;
+  typedef typename src_type::HostMirror  dst_type ;
+
+  Kokkos::LayoutStride layout ;
+
+  layout.dimension[0] = src.extent(0);
+  layout.dimension[1] = src.extent(1);
+  layout.dimension[2] = src.extent(2);
+  layout.dimension[3] = src.extent(3);
+  layout.dimension[4] = src.extent(4);
+  layout.dimension[5] = src.extent(5);
+  layout.dimension[6] = src.extent(6);
+  layout.dimension[7] = src.extent(7);
+
+  layout.stride[0] = src.stride_0();
+  layout.stride[1] = src.stride_1();
+  layout.stride[2] = src.stride_2();
+  layout.stride[3] = src.stride_3();
+  layout.stride[4] = src.stride_4();
+  layout.stride[5] = src.stride_5();
+  layout.stride[6] = src.stride_6();
+  layout.stride[7] = src.stride_7();
+
+  return dst_type( std::string( src.label() ).append("_mirror") , layout );
+}
+
+
+// Create a mirror in a new space (specialization for different space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorType<Space,T,P ...>::view_type
+create_mirror(const Space& , const Kokkos::View<T,P...> & src
+             , typename std::enable_if<
+                 std::is_same< typename ViewTraits<T,P...>::specialize , void >::value
+               >::type * = 0) {
+  return typename Impl::MirrorType<Space,T,P ...>::view_type(src.label(),src.layout());
+}
+
+template< class T , class ... P >
+inline
+typename Kokkos::View<T,P...>::HostMirror
+create_mirror_view( const Kokkos::View<T,P...> & src
+                  , typename std::enable_if<(
+                      std::is_same< typename Kokkos::View<T,P...>::memory_space
+                                  , typename Kokkos::View<T,P...>::HostMirror::memory_space
+                                  >::value
+                      &&
+                      std::is_same< typename Kokkos::View<T,P...>::data_type
+                                  , typename Kokkos::View<T,P...>::HostMirror::data_type
+                                  >::value
+                    )>::type * = 0
+                  )
+{
+  return src ;
+}
+
+template< class T , class ... P >
+inline
+typename Kokkos::View<T,P...>::HostMirror
+create_mirror_view( const Kokkos::View<T,P...> & src
+                  , typename std::enable_if< ! (
+                      std::is_same< typename Kokkos::View<T,P...>::memory_space
+                                  , typename Kokkos::View<T,P...>::HostMirror::memory_space
+                                  >::value
+                      &&
+                      std::is_same< typename Kokkos::View<T,P...>::data_type
+                                  , typename Kokkos::View<T,P...>::HostMirror::data_type
+                                  >::value
+                    )>::type * = 0
+                  )
+{
+  return Kokkos::create_mirror( src );
+}
+
+// Create a mirror view in a new space (specialization for same space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
+  , typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  return src;
+}
+
+// Create a mirror view in a new space (specialization for different space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view(const Space& , const Kokkos::View<T,P...> & src
+  , typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  return typename Impl::MirrorViewType<Space,T,P ...>::view_type(src.label(),src.layout());
+}
+
+// Create a mirror view and deep_copy in a new space (specialization for same space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  (void)name;
+  return src;
+}
+
+// Create a mirror view and deep_copy in a new space (specialization for different space)
+template<class Space, class T, class ... P>
+typename Impl::MirrorViewType<Space,T,P ...>::view_type
+create_mirror_view_and_copy(const Space& , const Kokkos::View<T,P...> & src
+  , std::string const& name = ""
+  , typename std::enable_if<!Impl::MirrorViewType<Space,T,P ...>::is_same_memspace>::type* = 0 ) {
+  using Mirror = typename Impl::MirrorViewType<Space,T,P ...>::view_type;
+  std::string label = name.empty() ? src.label() : name;
+  auto mirror = Mirror(ViewAllocateWithoutInitializing(label), src.layout());
+  deep_copy(mirror, src);
+  return mirror;
+}
+
+} /* namespace Kokkos */
+
+
+//----------------------------------------------------------------------------
+//----------------------------------------------------------------------------
+
 #endif
--- a/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
+++ b/lib/kokkos/core/src/Kokkos_ExecPolicy.hpp
@ -57,6 +57,10 @@

 namespace Kokkos {

+struct ParallelForTag {};
+struct ParallelScanTag {};
+struct ParallelReduceTag {};
+
 struct ChunkSize {
  int value;
  ChunkSize(int value_):value(value_) {}
@ -320,6 +324,10 @@ public:

  template< class FunctorType >
  static int team_size_recommended( const FunctorType & , const int&);
+
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType & functor , const int vector_length);
+
  //----------------------------------------
  /** \brief  Construct policy with the given instance of the execution space */
  TeamPolicyInternal( const typename traits::execution_space & , int league_size_request , int team_size_request , int vector_length_request = 1 );
--- a/lib/kokkos/core/src/Kokkos_Layout.hpp
+++ b/lib/kokkos/core/src/Kokkos_Layout.hpp
@ -76,6 +76,8 @@ struct LayoutLeft {

  size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];

+  enum { is_extent_constructible = true };
+
  LayoutLeft( LayoutLeft const & ) = default ;
  LayoutLeft( LayoutLeft && ) = default ;
  LayoutLeft & operator = ( LayoutLeft const & ) = default ;
@ -108,6 +110,8 @@ struct LayoutRight {

  size_t dimension[ ARRAY_LAYOUT_MAX_RANK ];

+  enum { is_extent_constructible = true };
+
  LayoutRight( LayoutRight const & ) = default ;
  LayoutRight( LayoutRight && ) = default ;
  LayoutRight & operator = ( LayoutRight const & ) = default ;
@ -132,6 +136,8 @@ struct LayoutStride {
  size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
  size_t stride[ ARRAY_LAYOUT_MAX_RANK ] ;

+  enum { is_extent_constructible = false };
+
  LayoutStride( LayoutStride const & ) = default ;
  LayoutStride( LayoutStride && ) = default ;
  LayoutStride & operator = ( LayoutStride const & ) = default ;
@ -222,6 +228,8 @@ struct LayoutTileLeft {

  size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;

+  enum { is_extent_constructible = true };
+
  LayoutTileLeft( LayoutTileLeft const & ) = default ;
  LayoutTileLeft( LayoutTileLeft && ) = default ;
  LayoutTileLeft & operator = ( LayoutTileLeft const & ) = default ;
@ -235,6 +243,144 @@ struct LayoutTileLeft {
    : dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
 };

+
+//////////////////////////////////////////////////////////////////////////////////////
+
+enum class Iterate
+{
+  Default,
+  Left,    // Left indices stride fastest
+  Right   // Right indices stride fastest
+};
+
+// To check for LayoutTiled
+// This is to hide extra compile-time 'identifier' info within the LayoutTiled class by not relying on template specialization to include the ArgN*'s
+template < typename LayoutTiledCheck, class Enable = void >
+struct is_layouttiled : std::false_type {};
+
+#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+template < typename LayoutTiledCheck >
+struct is_layouttiled< LayoutTiledCheck, typename std::enable_if<LayoutTiledCheck::is_array_layout_tiled>::type > : std::true_type {};
+
+namespace Experimental {
+
+/// LayoutTiled
+// Must have Rank >= 2
+template < Kokkos::Iterate OuterP, Kokkos::Iterate InnerP,
+           unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 = 0,  unsigned ArgN3 = 0,  unsigned ArgN4 = 0,  unsigned ArgN5 = 0,  unsigned ArgN6 = 0,  unsigned ArgN7 = 0, 
+           bool IsPowerOfTwo = 
+           ( Impl::is_integral_power_of_two(ArgN0) &&
+             Impl::is_integral_power_of_two(ArgN1) &&
+             (Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
+             (Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
+             (Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
+             (Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
+             (Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
+             (Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
+           )
+         >
+struct LayoutTiled {
+
+  static_assert( IsPowerOfTwo
+               , "LayoutTiled must be given power-of-two tile dimensions" );
+
+#if 0
+  static_assert( (Impl::is_integral_power_of_two(ArgN0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN1) ) &&
+                 (Impl::is_integral_power_of_two(ArgN2) || (ArgN2 == 0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN3) || (ArgN3 == 0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN4) || (ArgN4 == 0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN5) || (ArgN5 == 0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN6) || (ArgN6 == 0) ) &&
+                 (Impl::is_integral_power_of_two(ArgN7) || (ArgN7 == 0) )
+               , "LayoutTiled must be given power-of-two tile dimensions" );
+#endif
+
+  typedef LayoutTiled<OuterP, InnerP, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, IsPowerOfTwo> array_layout ;
+  static constexpr Iterate outer_pattern = OuterP;
+  static constexpr Iterate inner_pattern = InnerP;
+
+  enum { N0 = ArgN0 };
+  enum { N1 = ArgN1 };
+  enum { N2 = ArgN2 };
+  enum { N3 = ArgN3 };
+  enum { N4 = ArgN4 };
+  enum { N5 = ArgN5 };
+  enum { N6 = ArgN6 };
+  enum { N7 = ArgN7 };
+
+  size_t dimension[ ARRAY_LAYOUT_MAX_RANK ] ;
+
+  enum { is_extent_constructible = true };
+
+  LayoutTiled( LayoutTiled const & ) = default ;
+  LayoutTiled( LayoutTiled && ) = default ;
+  LayoutTiled & operator = ( LayoutTiled const & ) = default ;
+  LayoutTiled & operator = ( LayoutTiled && ) = default ;
+
+  KOKKOS_INLINE_FUNCTION
+  explicit constexpr
+  LayoutTiled( size_t argN0 = 0 , size_t argN1 = 0 , size_t argN2 = 0 , size_t argN3 = 0
+                , size_t argN4 = 0 , size_t argN5 = 0 , size_t argN6 = 0 , size_t argN7 = 0
+                )
+    : dimension { argN0 , argN1 , argN2 , argN3 , argN4 , argN5 , argN6 , argN7 } {}
+};
+
+} // namespace Experimental
+#endif
+
+
+// For use with view_copy
+template < typename ... Layout >
+struct layout_iterate_type_selector {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
+};
+
+template <>
+struct layout_iterate_type_selector< Kokkos::LayoutRight > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
+};
+
+template <>
+struct layout_iterate_type_selector< Kokkos::LayoutLeft > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
+};
+
+template <>
+struct layout_iterate_type_selector< Kokkos::LayoutStride > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Default ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Default ;
+};
+
+#ifndef KOKKOS_ENABLE_DEPRECATED_CODE
+template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 ,  unsigned ArgN3 ,  unsigned ArgN4 ,  unsigned ArgN5 ,  unsigned ArgN6 ,  unsigned ArgN7 >
+struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
+};
+
+template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 ,  unsigned ArgN3 ,  unsigned ArgN4 ,  unsigned ArgN5 ,  unsigned ArgN6 ,  unsigned ArgN7 >
+struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Left, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Left ;
+};
+
+template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 ,  unsigned ArgN3 ,  unsigned ArgN4 ,  unsigned ArgN5 ,  unsigned ArgN6 ,  unsigned ArgN7 >
+struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Left, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Left ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
+};
+
+template < unsigned ArgN0 , unsigned ArgN1 , unsigned ArgN2 ,  unsigned ArgN3 ,  unsigned ArgN4 ,  unsigned ArgN5 ,  unsigned ArgN6 ,  unsigned ArgN7 >
+struct layout_iterate_type_selector< Kokkos::Experimental::LayoutTiled<Kokkos::Iterate::Right, Kokkos::Iterate::Right, ArgN0, ArgN1, ArgN2, ArgN3, ArgN4, ArgN5, ArgN6, ArgN7, true> > {
+  static const Kokkos::Iterate outer_iteration_pattern = Kokkos::Iterate::Right ;
+  static const Kokkos::Iterate inner_iteration_pattern = Kokkos::Iterate::Right ;
+};
+#endif
+
 } // namespace Kokkos

 #endif // #ifndef KOKKOS_LAYOUT_HPP
--- a/lib/kokkos/core/src/Kokkos_Macros.hpp
+++ b/lib/kokkos/core/src/Kokkos_Macros.hpp
@ -153,7 +153,7 @@
    #else
      #define KOKKOS_LAMBDA [=]__host__ __device__

-      #if defined( KOKKOS_ENABLE_CXX1Z )
+      #if defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20 )
        #define KOKKOS_CLASS_LAMBDA        [=,*this] __host__ __device__
      #endif
    #endif
@ -213,7 +213,7 @@
  #define KOKKOS_LAMBDA [=]
 #endif

-#if defined( KOKKOS_ENABLE_CXX1Z ) && !defined( KOKKOS_CLASS_LAMBDA )
+#if (defined( KOKKOS_ENABLE_CXX17 ) || defined( KOKKOS_ENABLE_CXX20) )&& !defined( KOKKOS_CLASS_LAMBDA )
  #define KOKKOS_CLASS_LAMBDA [=,*this]
 #endif

@ -521,6 +521,9 @@
 #if defined ( KOKKOS_ENABLE_CUDA )
  #if ( 9000 <= CUDA_VERSION )
  #define KOKKOS_IMPL_CUDA_VERSION_9_WORKAROUND
+  #if ( __CUDA_ARCH__ )
+    #define KOKKOS_IMPL_CUDA_SYNCWARP_NEEDS_MASK
+  #endif
  #endif
 #endif

--- a/lib/kokkos/core/src/Kokkos_Parallel_Reduce.hpp
+++ b/lib/kokkos/core/src/Kokkos_Parallel_Reduce.hpp
@ -793,7 +793,7 @@ struct ParallelReduceReturnValue<typename std::enable_if<

  static return_type return_value(ReturnType& return_val,
                                  const FunctorType& functor) {
-#ifdef KOKOOS_ENABLE_DEPRECATED_CODE
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
    return return_type(return_val,functor.value_count);
 #else
    if ( is_array<ReturnType>::value )
@ -1002,7 +1002,8 @@ void parallel_reduce(const std::string& label,
                     typename Impl::enable_if<
                       Kokkos::Impl::is_execution_policy<PolicyType>::value
                     >::type * = 0) {
-  Impl::ParallelReduceAdaptor<PolicyType,FunctorType,const ReturnType>::execute(label,policy,functor,return_value);
+  ReturnType return_value_impl = return_value;
+  Impl::ParallelReduceAdaptor<PolicyType,FunctorType,ReturnType>::execute(label,policy,functor,return_value_impl);
 }

 template< class PolicyType, class FunctorType, class ReturnType >
@ -1054,6 +1055,9 @@ void parallel_reduce(const std::string& label,
                                     , typename ValueTraits::pointer_type
                                     >::type value_type ;

+  static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
+                 has_final_member_function,"Calling parallel_reduce without either return value or final function.");
+
  typedef Kokkos::View< value_type
              , Kokkos::HostSpace
              , Kokkos::MemoryUnmanaged
@ -1076,6 +1080,9 @@ void parallel_reduce(const PolicyType& policy,
                                     , typename ValueTraits::pointer_type
                                     >::type value_type ;

+  static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,PolicyType,FunctorType>::
+                 has_final_member_function,"Calling parallel_reduce without either return value or final function.");
+
  typedef Kokkos::View< value_type
              , Kokkos::HostSpace
              , Kokkos::MemoryUnmanaged
@ -1096,6 +1103,9 @@ void parallel_reduce(const size_t& policy,
                                     , typename ValueTraits::pointer_type
                                     >::type value_type ;

+  static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
+                 has_final_member_function,"Calling parallel_reduce without either return value or final function.");
+
  typedef Kokkos::View< value_type
              , Kokkos::HostSpace
              , Kokkos::MemoryUnmanaged
@ -1117,6 +1127,9 @@ void parallel_reduce(const std::string& label,
                                     , typename ValueTraits::pointer_type
                                     >::type value_type ;

+  static_assert(Impl::FunctorAnalysis<Impl::FunctorPatternInterface::REDUCE,RangePolicy<>,FunctorType>::
+                 has_final_member_function,"Calling parallel_reduce without either return value or final function.");
+
  typedef Kokkos::View< value_type
              , Kokkos::HostSpace
              , Kokkos::MemoryUnmanaged
--- a/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp
+++ b/lib/kokkos/core/src/Kokkos_ScratchSpace.hpp
@ -136,6 +136,55 @@ public:
    }
  }

+
+  KOKKOS_INLINE_FUNCTION
+  void* get_shmem_aligned (const ptrdiff_t size, const ptrdiff_t alignment, int level = -1) const {
+    if(level == -1)
+      level = m_default_level;
+    if(level == 0) {
+
+      char* previous = m_iter_L0;
+      const ptrdiff_t missalign = size_t(m_iter_L0)%alignment;
+      if(missalign) m_iter_L0 += alignment-missalign;
+
+      void* tmp = m_iter_L0 + m_offset * size;
+      if (m_end_L0 < (m_iter_L0 += size * m_multiplier)) {
+        m_iter_L0 = previous; // put it back like it was
+        #ifdef KOKKOS_DEBUG
+        // mfh 23 Jun 2015: printf call consumes 25 registers
+        // in a CUDA build, so only print in debug mode.  The
+        // function still returns NULL if not enough memory.
+        printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
+                "%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
+                long(m_end_L0-m_iter_L0));
+        #endif // KOKKOS_DEBUG
+        tmp = 0;
+      }
+      return tmp;
+    } else {
+
+      char* previous = m_iter_L1;
+      const ptrdiff_t missalign =  size_t(m_iter_L1)%alignment;
+      if(missalign) m_iter_L1 += alignment-missalign;
+
+      void* tmp = m_iter_L1 + m_offset * size;
+      if (m_end_L1 < (m_iter_L1 += size * m_multiplier)) {
+        m_iter_L1 = previous; // put it back like it was
+        #ifdef KOKKOS_DEBUG
+        // mfh 23 Jun 2015: printf call consumes 25 registers
+        // in a CUDA build, so only print in debug mode.  The
+        // function still returns NULL if not enough memory.
+        printf ("ScratchMemorySpace<...>::get_shmem: Failed to allocate "
+                "%ld byte(s); remaining capacity is %ld byte(s)\n", long(size),
+                long(m_end_L1-m_iter_L1));
+        #endif // KOKKOS_DEBUG
+        tmp = 0;
+      }
+      return tmp;
+
+    }
+  }
+
  template< typename IntType >
  KOKKOS_INLINE_FUNCTION
  ScratchMemorySpace( void * ptr_L0 , const IntType & size_L0 , void * ptr_L1 = NULL , const IntType & size_L1 = 0)
--- a/lib/kokkos/core/src/Kokkos_Serial.hpp
+++ b/lib/kokkos/core/src/Kokkos_Serial.hpp
@ -262,7 +262,7 @@ public:
  }

  //----------------------------------------
-
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  template< class FunctorType >
  static
  int team_size_max( const FunctorType & ) { return 1 ; }
@ -274,6 +274,16 @@ public:
  template< class FunctorType >
  static
  int team_size_recommended( const FunctorType & , const int& ) { return 1 ; }
+#endif
+
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelForTag& ) const { return 1 ; }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const { return 1 ; }

  //----------------------------------------

@ -281,6 +291,16 @@ public:
  inline int league_size() const { return m_league_size ; }
  inline size_t scratch_size(const int& level, int = 0) const { return m_team_scratch_size[level] + m_thread_scratch_size[level]; }

+  inline static
+  int vector_length_max()
+    { return 1024; } // Use arbitrary large number, is meant as a vectorizable length
+
+  inline static
+  int scratch_size_max(int level)
+  { return (level==0?
+        1024*32:
+        20*1024*1024);
+  }
  /** \brief  Specify league size, request team size */
  TeamPolicyInternal( execution_space &
            , int league_size_request
--- a/lib/kokkos/core/src/Kokkos_TaskScheduler.hpp
+++ b/lib/kokkos/core/src/Kokkos_TaskScheduler.hpp
@ -624,7 +624,6 @@ public:
  when_all( Future< A1 , A2 > const arg[] , int narg )
    {
      using future_type = Future< execution_space > ;
-      using task_base   = Kokkos::Impl::TaskBase< void , void , void > ;

      future_type f ;

@ -692,7 +691,6 @@ public:
    {
      using input_type  = decltype( func(0) );
      using future_type = Future< execution_space > ;
-      using task_base   = Kokkos::Impl::TaskBase< void , void , void > ;

      static_assert( is_future< input_type >::value
                   , "Functor must return a Kokkos::Future" );
--- a/lib/kokkos/core/src/Kokkos_View.hpp
+++ b/lib/kokkos/core/src/Kokkos_View.hpp
--- a/lib/kokkos/core/src/Makefile
+++ b/lib/kokkos/core/src/Makefile
@ -16,6 +16,7 @@ endif
 CXXFLAGS ?= -O3
 LINK ?= $(CXX)
 LDFLAGS ?=
+CP = cp

 include $(KOKKOS_PATH)/Makefile.kokkos
 include $(KOKKOS_PATH)/core/src/Makefile.generate_header_lists
@ -50,7 +51,12 @@ ifeq ($(KOKKOS_OS),Linux)
  COPY_FLAG = -u
 endif
 ifeq ($(KOKKOS_OS),Darwin)
-  COPY_FLAG =
+  COPY_FLAG = 
+  # If Homebrew coreutils is installed, its cp will have the -u option
+  ifneq ("$(wildcard /usr/local/opt/coreutils/libexec/gnubin/cp)","")
+    CP = /usr/local/opt/coreutils/libexec/gnubin/cp
+    COPY_FLAG = -u
+  endif
 endif

 ifeq ($(KOKKOS_DEBUG),"no")
@ -66,36 +72,38 @@ mkdir:
 	mkdir -p $(PREFIX)/bin
 	mkdir -p $(PREFIX)/include
 	mkdir -p $(PREFIX)/lib
+	mkdir -p $(PREFIX)/lib/pkgconfig
 	mkdir -p $(PREFIX)/include/impl

 copy-cuda: mkdir
 	mkdir -p $(PREFIX)/include/Cuda
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_CUDA) $(PREFIX)/include/Cuda

 copy-threads: mkdir
 	mkdir -p $(PREFIX)/include/Threads
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_THREADS) $(PREFIX)/include/Threads

 copy-qthreads: mkdir
 	mkdir -p $(PREFIX)/include/Qthreads
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_QTHREADS) $(PREFIX)/include/Qthreads

 copy-openmp: mkdir
 	mkdir -p $(PREFIX)/include/OpenMP
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_OPENMP) $(PREFIX)/include/OpenMP

 copy-rocm: mkdir
 	mkdir -p $(PREFIX)/include/ROCm
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_ROCM) $(PREFIX)/include/ROCm

 install: mkdir $(CONDITIONAL_COPIES) build-lib generate_build_settings
-	cp $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
-	cp $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
-	cp $(COPY_FLAG) $(KOKKOS_MAKEFILE)  $(PREFIX)
-	cp $(COPY_FLAG) $(KOKKOS_CMAKEFILE)  $(PREFIX)
-	cp $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
-	cp $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include
+	$(CP) $(COPY_FLAG) $(NVCC_WRAPPER) $(PREFIX)/bin
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE) $(PREFIX)/include
+	$(CP) $(COPY_FLAG) $(KOKKOS_HEADERS_INCLUDE_IMPL) $(PREFIX)/include/impl
+	$(CP) $(COPY_FLAG) $(KOKKOS_MAKEFILE)  $(PREFIX)
+	$(CP) $(COPY_FLAG) $(KOKKOS_CMAKEFILE)  $(PREFIX)
+	$(CP) $(COPY_FLAG) $(KOKKOS_PKGCONFIG)  $(PREFIX)/lib/pkgconfig
+	$(CP) $(COPY_FLAG) libkokkos.a $(PREFIX)/lib
+	$(CP) $(COPY_FLAG) $(KOKKOS_CONFIG_HEADER) $(PREFIX)/include

 clean: kokkos-clean
-	rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE) 
+	rm -f $(KOKKOS_MAKEFILE) $(KOKKOS_CMAKEFILE) $(KOKKOS_PKGCONFIG) 
--- a/lib/kokkos/core/src/Makefile.generate_build_files
+++ b/lib/kokkos/core/src/Makefile.generate_build_files
@ -5,6 +5,7 @@
 # These files are generated by this makefile
 KOKKOS_MAKEFILE=Makefile.kokkos
 KOKKOS_CMAKEFILE=kokkos_generated_settings.cmake
+KOKKOS_PKGCONFIG=kokkos.pc

 ifeq ($(KOKKOS_DEBUG),"no")
  KOKKOS_DEBUG_CMAKE = OFF
@ -33,11 +34,29 @@ kokkos_append_var = $(call kokkos_appendvar_makefile,$1); $(call kokkos_appendva
 kokkos_append_var2 = $(call kokkos_appendvar2_makefile,$1); $(call kokkos_appendvar_cmakefile,$1,$2)
 kokkos_append_varval = $(call kokkos_appendval_makefile,$1,$2); $(call kokkos_appendval_cmakefile,$1,$2,$3)

+kokkos_fixup_sed_impl = sed \
+		-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
+		-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
+		-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
+		-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
+		-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
+		-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $1 \
+		> $1.tmp && mv -f $1.tmp $1
+
+$(KOKKOS_PKGCONFIG): $(KOKKOS_PATH)/core/src/$(KOKKOS_PKGCONFIG).in
+	@sed -e 's|@CMAKE_INSTALL_PREFIX@|$(PREFIX)|g' \
+	    -e 's|@KOKKOS_CXXFLAGS@|$(patsubst -I%,,$(KOKKOS_CXXFLAGS))|g' \
+	    -e 's|@KOKKOS_EXTRA_LIBS_LIST@|$(KOKKOS_EXTRA_LIBS)|g' \
+	    -e 's|@KOKKOS_LINK_FLAGS@|$(KOKKOS_LINK_FLAGS)|g' \
+	     $< > $@
+
+kokkos_fixup_sed = $(call kokkos_fixup_sed_impl,$(KOKKOS_MAKEFILE)); $(call kokkos_fixup_sed_impl,$(KOKKOS_CMAKEFILE))
+
 #This function should be used for variables whose values are different in GNU Make versus CMake,
 #especially lists which are delimited by commas in one case and semicolons in another
 kokkos_append_gmakevar = $(call kokkos_appendvar_makefile,$1); $(call kokkos_append_gmakevar_cmakefile,$1,$2)

-generate_build_settings: $(KOKKOS_CONFIG_HEADER)
+generate_build_settings: $(KOKKOS_CONFIG_HEADER) $(KOKKOS_PKGCONFIG)
 	@rm -f $(KOKKOS_MAKEFILE)
 	@rm -f $(KOKKOS_CMAKEFILE)
 	@$(call kokkos_append_string, "#Global Settings used to generate this library")
@ -68,7 +87,6 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
 	@$(call kokkos_append_var,KOKKOS_HEADERS_ROCM,'STRING "Kokkos headers ROCm list"')
 	@$(call kokkos_append_var,KOKKOS_HEADERS_THREADS,'STRING "Kokkos headers Threads list"')
 	@$(call kokkos_append_var,KOKKOS_HEADERS_QTHREADS,'STRING "Kokkos headers QThreads list"')
-	@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
 	@$(call kokkos_append_string,"")
 	@$(call kokkos_append_string,"#Variables used in application Makefiles")
 	@$(call kokkos_append_var,KOKKOS_OS,'STRING ""')  # This was not in original cmake gen
@ -94,19 +112,11 @@ generate_build_settings: $(KOKKOS_CONFIG_HEADER)
 	@$(call kokkos_append_makefile,"#Fake kokkos-clean target")
 	@$(call kokkos_append_makefile,"kokkos-clean:")
 	@$(call kokkos_append_makefile,"")
-	@sed \
-		-e 's|$(KOKKOS_PATH)/core/src|$(PREFIX)/include|g' \
-		-e 's|$(KOKKOS_PATH)/containers/src|$(PREFIX)/include|g' \
-		-e 's|$(KOKKOS_PATH)/algorithms/src|$(PREFIX)/include|g' \
-		-e 's|-L$(PWD)|-L$(PREFIX)/lib|g' \
-		-e 's|= libkokkos.a|= $(PREFIX)/lib/libkokkos.a|g' \
-		-e 's|= $(KOKKOS_CONFIG_HEADER)|= $(PREFIX)/include/$(KOKKOS_CONFIG_HEADER)|g' $(KOKKOS_MAKEFILE) \
-		> $(KOKKOS_MAKEFILE).tmp
-	@mv -f $(KOKKOS_MAKEFILE).tmp $(KOKKOS_MAKEFILE)
+	@$(call kokkos_fixup_sed)
+	@$(call kokkos_append_var,KOKKOS_SRC,'STRING "Kokkos source list"')
 	@$(call kokkos_setvar_cmakefile,KOKKOS_CXX_FLAGS,$(KOKKOS_CXXFLAGS))
 	@$(call kokkos_setvar_cmakefile,KOKKOS_CPP_FLAGS,$(KOKKOS_CPPFLAGS))
 	@$(call kokkos_setvar_cmakefile,KOKKOS_LD_FLAGS,$(KOKKOS_LDFLAGS))
 	@$(call kokkos_setlist_cmakefile,KOKKOS_LIBS_LIST,$(KOKKOS_LIBS))
 	@$(call kokkos_setlist_cmakefile,KOKKOS_EXTRA_LIBS_LIST,$(KOKKOS_EXTRA_LIBS))
 	@$(call kokkos_setvar_cmakefile,KOKKOS_LINK_FLAGS,$(KOKKOS_LINK_FLAGS))
-
--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Task.cpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Task.cpp
@ -103,8 +103,6 @@ public:
 void TaskQueueSpecialization< Kokkos::OpenMP >::execute
  ( TaskQueue< Kokkos::OpenMP > * const queue )
 {
-  using execution_space = Kokkos::OpenMP ;
-  using queue_type      = TaskQueue< execution_space > ;
  using task_root_type  = TaskBase< void , void , void > ;
  using Member          = Impl::HostThreadTeamMember< execution_space > ;

@ -213,8 +211,6 @@ void TaskQueueSpecialization< Kokkos::OpenMP >::
  iff_single_thread_recursive_execute
    ( TaskQueue< Kokkos::OpenMP > * const queue )
 {
-  using execution_space = Kokkos::OpenMP ;
-  using queue_type      = TaskQueue< execution_space > ;
  using task_root_type  = TaskBase< void , void , void > ;
  using Member          = Impl::HostThreadTeamMember< execution_space > ;

--- a/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Team.hpp
+++ b/lib/kokkos/core/src/OpenMP/Kokkos_OpenMP_Team.hpp
@ -76,14 +76,11 @@ public:

  //----------------------------------------

+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  template< class FunctorType >
  inline static
  int team_size_max( const FunctorType & ) {
-#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
    int pool_size = traits::execution_space::thread_pool_size(1);
-#else
-    int pool_size = traits::execution_space::impl_thread_pool_size(1);
-#endif
    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
    return pool_size<max_host_team_size?pool_size:max_host_team_size;
  }
@ -92,6 +89,47 @@ public:
  inline static
  int team_size_recommended( const FunctorType & )
  {
+    return traits::execution_space::thread_pool_size(2);
+  }
+
+  template< class FunctorType >
+  inline static
+  int team_size_recommended( const FunctorType &, const int& )
+  {
+    return traits::execution_space::thread_pool_size(2);
+  }
+#endif
+
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelForTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    int pool_size = traits::execution_space::thread_pool_size(1);
+#else
+    int pool_size = traits::execution_space::impl_thread_pool_size(1);
+#endif
+    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
+    return pool_size<max_host_team_size?pool_size:max_host_team_size;
+  }
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    int pool_size = traits::execution_space::thread_pool_size(1);
+#else
+    int pool_size = traits::execution_space::impl_thread_pool_size(1);
+#endif
+    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
+    return pool_size<max_host_team_size?pool_size:max_host_team_size;
+  }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    return traits::execution_space::thread_pool_size(2);
+#else
+    return traits::execution_space::impl_thread_pool_size(2);
+#endif
+  }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
 #ifdef KOKKOS_ENABLE_DEPRECATED_CODE
    return traits::execution_space::thread_pool_size(2);
 #else
@ -99,16 +137,17 @@ public:
 #endif
  }

-  template< class FunctorType >
+
  inline static
-  int team_size_recommended( const FunctorType &, const int& )
-  {
-#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
-    return traits::execution_space::thread_pool_size(2);
-#else
-    return traits::execution_space::impl_thread_pool_size(2);
-#endif
-  }
+  int vector_length_max()
+    { return 1024; } // Use arbitrary large number, is meant as a vectorizable length
+
+  inline static
+  int scratch_size_max(int level)
+    { return (level==0?
+        1024*32: // Roughly L1 size
+        20*1024*1024); // Limit to keep compatibility with CUDA
+    }

  //----------------------------------------

--- a/lib/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTargetSpace.cpp
+++ b/lib/kokkos/core/src/OpenMPTarget/Kokkos_OpenMPTargetSpace.cpp
@ -160,7 +160,8 @@ SharedAllocationRecord( const Kokkos::Experimental::OpenMPTargetSpace & arg_spac
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
-  
+  // Set last element zero, in case c_str is too long
+  header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0; 
  //TODO DeepCopy
  // DeepCopy

--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Config.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Config.hpp
@ -44,8 +44,8 @@
 #ifndef GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP
 #define GUARD_CORE_KOKKOS_ROCM_CONFIG_HPP

-#ifndef KOKKOS_ROCM_HAS_WORKAROUNDS
-#define KOKKOS_ROCM_HAS_WORKAROUNDS 1
+#ifndef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
+#define KOKKOS_IMPL_ROCM_CLANG_WORKAROUND 1
 #endif

 #endif
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Exec.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Exec.hpp
@ -55,14 +55,14 @@ namespace Impl {

 struct ROCmTraits {
 // TODO: determine if needed
-  enum { WavefrontSize       = 64 /* 64  */ };
-  enum { WorkgroupSize       = 64 /* 64  */ };
-  enum { WavefrontIndexMask  = 0x001f  /* Mask for warpindex */ };
-  enum { WavefrontIndexShift = 5       /* WarpSize == 1 << WarpShift */ };
+  enum { WavefrontSize       = 64  /* 64  */ };
+  enum { WorkgroupSize       = 256 /* 256  */ };
+  enum { WavefrontIndexMask  = 0x003f  /* Mask for wavefrontindex */ };
+  enum { WavefrontIndexShift = 6   /* WavefrontSize == 1 << WavefrontShift */ };

-  enum { SharedMemoryBanks    = 32      /* Compute device 2.0 */ };
-  enum { SharedMemoryCapacity = 0x0C000 /* 48k shared / 16k L1 Cache */ };
-  enum { SharedMemoryUsage    = 0x04000 /* 16k shared / 48k L1 Cache */ };
+  enum { SharedMemoryBanks    = 64      /* GCN */ };
+  enum { SharedMemoryCapacity = 0x10000 /* 64k shared / 16k L1 Cache */ };
+  enum { SharedMemoryUsage    = 0x04000 /* 64k shared / 16k L1 Cache */ };

  enum { UpperBoundExtentCount    = 4294967295 /* Hard upper bound */ };
 #if 0
@ -84,6 +84,16 @@ size_t rocm_internal_maximum_workgroup_count();
 size_t * rocm_internal_scratch_flags( const size_t size );
 size_t * rocm_internal_scratch_space( const size_t size );

+// This pointer is the start of dynamic shared memory (LDS).
+// Dynamic is at the end of LDS and it's size must be specified
+// in a tile_block specification at kernel launch time.
+template< typename T >
+KOKKOS_INLINE_FUNCTION
+T * kokkos_impl_rocm_shared_memory()
+//{ return (T*) hc::get_group_segment_base_pointer() ; }
+{ return (T*) hc::get_dynamic_group_segment_base_pointer() ; }
+
+
 }
 } // namespace Kokkos
 #define ROCM_SPACE_ATOMIC_MASK      0x1FFFF
@ -249,7 +259,6 @@ struct ROCmParallelLaunch< DriverType
      size_t bx = (grid.x > block.x)? block.x : grid.x;
      size_t by = (grid.y > block.y)? block.y : grid.y;
      size_t bz = (grid.z > block.z)? block.z : grid.z;
-
      hc::parallel_for_each(ext.tile_with_dynamic(bz,by,bx,shmem), [=](const hc::index<3> & idx) [[hc]]
 
 
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Impl.cpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Impl.cpp
@ -543,20 +543,13 @@ enum { sizeScratchGrain = sizeof(ScratchGrain) };
 void rocmMemset(  Kokkos::Experimental::ROCm::size_type * ptr ,  Kokkos::Experimental::ROCm::size_type value , Kokkos::Experimental::ROCm::size_type size)
 {
 char * mptr = (char * ) ptr;
-#if 0
-   parallel_for_each(hc::extent<1>(size),
+/*   parallel_for_each(hc::extent<1>(size),
                    [=, &ptr]
                    (hc::index<1> idx) __HC__
   {
      int i = idx[0];
      ptr[i] = value;
-   }).wait();
-#else
-   for (int i= 0; i<size ; i++)
-   {
-     mptr[i] = (char) value;
-   }
-#endif
+   }).wait();*/
 }

 Kokkos::Experimental::ROCm::size_type *
@ -567,9 +560,9 @@ ROCmInternal::scratch_flags( const Kokkos::Experimental::ROCm::size_type size )

    m_scratchFlagsCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;

-    typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
+    typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;

-    Record * const r = Record::allocate( Kokkos::HostSpace()
+    Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
                                       , "InternalScratchFlags"
                                       , ( sizeScratchGrain  * m_scratchFlagsCount ) );

@ -590,9 +583,9 @@ ROCmInternal::scratch_space( const Kokkos::Experimental::ROCm::size_type size )

    m_scratchSpaceCount = ( size + sizeScratchGrain - 1 ) / sizeScratchGrain ;

-     typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::HostSpace , void > Record ;
+     typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace , void > Record ;

-     Record * const r = Record::allocate( Kokkos::HostSpace()
+     static Record * const r = Record::allocate( Kokkos::Experimental::ROCmSpace()
                                        , "InternalScratchSpace"
                                        , ( sizeScratchGrain  * m_scratchSpaceCount ) );

@ -616,7 +609,7 @@ void ROCmInternal::finalize()
 //    scratch_lock_array_rocm_space_ptr(false);
 //    threadid_lock_array_rocm_space_ptr(false);

-    typedef Kokkos::Impl::SharedAllocationRecord< HostSpace > RecordROCm ;
+    typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmSpace > RecordROCm ;
    typedef Kokkos::Impl::SharedAllocationRecord< Kokkos::Experimental::ROCmHostPinnedSpace > RecordHost ;

    RecordROCm::decrement( RecordROCm::get_record( m_scratchFlags ) );
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Parallel.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Parallel.hpp
@ -243,6 +243,15 @@ public:
   return(max);
 }

+  template< class FunctorType , class PatternTypeTag>
+  int team_size_max( const FunctorType& functor, PatternTypeTag) {
+    return 256/vector_length();
+  }
+  template< class FunctorType , class PatternTypeTag>
+  int team_size_recommended( const FunctorType& functor, PatternTypeTag) {
+    return 128/vector_length();
+  }
+
  template<class F>
  KOKKOS_INLINE_FUNCTION int team_size(const F& f) const { return (m_team_size > 0) ? m_team_size : team_size_recommended(f); }
  KOKKOS_INLINE_FUNCTION int team_size() const { return (m_team_size > 0) ? m_team_size : Impl::get_max_tile_thread(); ; }
@ -261,6 +270,11 @@ public:
    return m_thread_scratch_size[level];
  }

+  static int scratch_size_max(int level) {
+    return level==0 ? 
+      1024*40 : 1024*1204*20;
+  }
+
  typedef Impl::ROCmTeamMember member_type;
 };

@ -487,6 +501,7 @@ public:
 #endif
      }
      m_idx.barrier.wait();
+      reducer.reference() = buffer[0];
    }

    /** \brief  Intra-team vector reduce 
@ -541,19 +556,19 @@ public:
    }

  template< typename ReducerType >
-  KOKKOS_INLINE_FUNCTION static
+  KOKKOS_INLINE_FUNCTION
  typename std::enable_if< is_reducer< ReducerType >::value >::type
-  vector_reduce( ReducerType const & reducer )
+  vector_reduce( ReducerType const & reducer ) const
    {
      #ifdef __HCC_ACCELERATOR__
-      if(blockDim_x == 1) return;
+      if(m_vector_length == 1) return;

      // Intra vector lane shuffle reduction:
      typename ReducerType::value_type tmp ( reducer.reference() );

-      for ( int i = blockDim_x ; ( i >>= 1 ) ; ) {
-        shfl_down( reducer.reference() , i , blockDim_x );
-        if ( (int)threadIdx_x < i ) { reducer.join( tmp , reducer.reference() ); }
+      for ( int i = m_vector_length ; ( i >>= 1 ) ; ) {
+        reducer.reference() = shfl_down( tmp , i , m_vector_length );
+        if ( (int)vector_rank() < i ) { reducer.join( tmp , reducer.reference() ); }
      }

      // Broadcast from root lane to all other lanes.
@ -561,7 +576,7 @@ public:
      // because floating point summation is not associative
      // and thus different threads could have different results.

-      shfl( reducer.reference() , 0 , blockDim_x );
+      reducer.reference() = shfl( tmp , 0 , m_vector_length );
      #endif
    }

@ -847,7 +862,7 @@ public:

      hc::extent< 1 > flat_extent( total_size );

-      hc::tiled_extent< 1 > team_extent = flat_extent.tile(team_size*vector_length);
+      hc::tiled_extent< 1 > team_extent = flat_extent.tile(vector_length*team_size);
      hc::parallel_for_each( team_extent , [=](hc::tiled_index<1> idx) [[hc]]
      {
        rocm_invoke<typename Policy::work_tag>(f, typename Policy::member_type(idx, league_size, team_size, shared, shared_size, scratch_size0, scratch, scratch_size1,vector_length));
@ -958,6 +973,176 @@ public:

 };

+//----------------------------------------------------------------------------
+
+template< class FunctorType , class ReducerType, class... Traits >
+class ParallelReduce<
+  FunctorType , Kokkos::MDRangePolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
+{
+private:
+  typedef Kokkos::MDRangePolicy< Traits ...  > Policy ;
+  using RP = Policy;
+  typedef typename Policy::array_index_type array_index_type;
+  typedef typename Policy::index_type index_type;
+  typedef typename Policy::work_tag     WorkTag ;
+  typedef typename Policy::member_type  Member ;
+  typedef typename Policy::launch_bounds LaunchBounds;
+
+  typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, FunctorType, ReducerType> ReducerConditional;
+  typedef typename ReducerConditional::type ReducerTypeFwd;
+  typedef typename Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value, WorkTag, void>::type WorkTagFwd;
+
+  typedef Kokkos::Impl::FunctorValueTraits< ReducerTypeFwd, WorkTagFwd > ValueTraits ;
+  typedef Kokkos::Impl::FunctorValueInit<   ReducerTypeFwd, WorkTagFwd > ValueInit ;
+  typedef Kokkos::Impl::FunctorValueJoin<   ReducerTypeFwd, WorkTagFwd > ValueJoin ;
+
+
+public:
+
+  typedef typename ValueTraits::pointer_type    pointer_type ;
+  typedef typename ValueTraits::value_type      value_type ;
+  typedef typename ValueTraits::reference_type  reference_type ;
+  typedef FunctorType                           functor_type ;
+  typedef Kokkos::Experimental::ROCm::size_type size_type ;
+
+  // Algorithmic constraints: blockSize is a power of two AND blockDim.y == blockDim.z == 1
+
+  const FunctorType   m_functor ;
+  const Policy        m_policy ; // used for workrange and nwork
+  const ReducerType   m_reducer ;
+  const pointer_type  m_result_ptr ;
+  value_type *         m_scratch_space ;
+  size_type *         m_scratch_flags ;
+
+  typedef typename Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank, Policy, FunctorType, typename Policy::work_tag, reference_type> DeviceIteratePattern;
+
+  KOKKOS_INLINE_FUNCTION
+  void exec_range( reference_type update ) const
+  {
+    Kokkos::Impl::Reduce::DeviceIterateTile<Policy::rank,Policy,FunctorType,typename Policy::work_tag, reference_type>(m_policy, m_functor, update).exec_range();
+  }
+
+
+  KOKKOS_INLINE_FUNCTION
+  void operator()(void) const
+    {
+       run();
+    }
+
+  KOKKOS_INLINE_FUNCTION
+  void run( ) const
+  {
+    const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
+      word_count( (ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) )) / sizeof(value_type) );
+      // pointer to shared data accounts for the reserved space at the start
+      value_type * const shared = kokkos_impl_rocm_shared_memory<value_type>()
+                                 + 2*sizeof(uint64_t); 
+
+    {
+      reference_type value =
+        ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , shared + threadIdx_y * word_count.value );
+      // Number of blocks is bounded so that the reduction can be limited to two passes.
+      // Each thread block is given an approximately equal amount of work to perform.
+      // Accumulate the values for this block.
+      // The accumulation ordering does not match the final pass, but is arithmatically equivalent.
+
+      this-> exec_range( value );
+    }
+
+    // Reduce with final value at blockDim.y - 1 location.
+    // Problem: non power-of-two blockDim
+
+    if ( rocm_single_inter_block_reduce_scan<false,ReducerTypeFwd,WorkTagFwd>(
+           ReducerConditional::select(m_functor , m_reducer) , blockIdx_x ,
+           gridDim_x , shared , m_scratch_space , m_scratch_flags ) ) {
+
+      // This is the final block with the final result at the final threads' location
+      value_type * const tshared = shared + ( blockDim_y - 1 ) * word_count.value ;
+      value_type * const global =  m_scratch_space ;
+
+      if ( threadIdx_y == 0 ) {
+        Kokkos::Impl::FunctorFinal< ReducerTypeFwd , WorkTagFwd >::final( ReducerConditional::select(m_functor , m_reducer) , tshared );
+//        for ( unsigned i = 0 ; i < word_count.value ; i+=blockDim_y ) { global[i] = tshared[i]; }
+        for ( unsigned i = 0 ; i < word_count.value ; i++ ) { global[i] = tshared[i]; }
+      }
+    }
+  }
+
+
+
+  // Determine block size constrained by shared memory:
+  static inline
+  unsigned local_block_size( const FunctorType & f )
+    {
+      unsigned n = ROCmTraits::WavefrontSize * 8 ;
+      while ( n && ROCmTraits::SharedMemoryCapacity < rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( f , n ) ) { n >>= 1 ; }
+      return n ;
+    }
+
+  inline
+  void execute()
+    {
+      const int nwork = m_policy.m_num_tiles;
+      if ( nwork ) {
+        int block_size = m_policy.m_prod_tile_dims;
+        // CONSTRAINT: Algorithm requires block_size >= product of tile dimensions
+        // Nearest power of two
+        int exponent_pow_two = std::ceil( std::log2((float)block_size) );
+        block_size = 1<<(exponent_pow_two);
+
+        m_scratch_space = (value_type*)rocm_internal_scratch_space( ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer) ) * block_size*nwork /* block_size == max block_count */ );
+        m_scratch_flags = rocm_internal_scratch_flags( sizeof(size_type) );
+        const dim3 block( 1 , block_size , 1 );
+        // Required grid.x <= block.y
+        const dim3 grid( nwork, block_size ,  1 );
+      const int shmem = rocm_single_inter_block_reduce_scan_shmem<false,FunctorType,WorkTag>( m_functor , block.y );
+
+      ROCmParallelLaunch< ParallelReduce, LaunchBounds >( *this, grid, block, shmem ); // copy to device and execute
+
+      ROCM::fence();
+
+      if ( m_result_ptr ) {
+          const int size = ValueTraits::value_size( ReducerConditional::select(m_functor , m_reducer)  );
+          DeepCopy<HostSpace,Kokkos::Experimental::ROCmSpace>( m_result_ptr , m_scratch_space , size );
+      }
+    }
+    else {
+      if (m_result_ptr) {
+        ValueInit::init( ReducerConditional::select(m_functor , m_reducer) , m_result_ptr );
+      }
+    }
+  }
+
+
+  template< class HostViewType >
+  ParallelReduce( const FunctorType  & arg_functor
+                , const Policy       & arg_policy
+                , const HostViewType & arg_result
+                , typename std::enable_if<
+                   Kokkos::is_view< HostViewType >::value
+                ,void*>::type = NULL)
+  : m_functor( arg_functor )
+  , m_policy(  arg_policy )
+  , m_reducer( InvalidType() )
+  , m_result_ptr( arg_result.data() )
+  , m_scratch_space( 0 )
+  , m_scratch_flags( 0 )
+  {}
+
+  ParallelReduce( const FunctorType  & arg_functor
+                , const Policy       & arg_policy
+                , const ReducerType & reducer)
+  : m_functor( arg_functor )
+  , m_policy(  arg_policy )
+  , m_reducer( reducer )
+  , m_result_ptr( reducer.view().data() )
+  , m_scratch_space( 0 )
+  , m_scratch_flags( 0 )
+  {}
+
+};
+//----------------------------------------------------------------------------
+
 template< class FunctorType, class ReducerType, class... Traits >
 class ParallelReduce<
   FunctorType , Kokkos::TeamPolicy< Traits... >, ReducerType, Kokkos::Experimental::ROCm >
@ -992,8 +1177,14 @@ public:
      const int scratch_size0 = policy.scratch_size(0,team_size);
      const int scratch_size1 = policy.scratch_size(1,team_size);
      const int total_size = league_size * team_size ;
-
-      if(total_size == 0) return;
+      
+      typedef Kokkos::Impl::FunctorValueInit< FunctorType, typename Policy::work_tag > ValueInit ;
+      if(total_size==0) {
+        if (result_view.data()) {
+           ValueInit::init( f , result_view.data() );
+        }
+        return;
+      }

      const int reduce_size = ValueTraits::value_size( f );
      const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
@ -1042,7 +1233,16 @@ public:
      const int vector_length = policy.vector_length();
      const int total_size = league_size * team_size;

-      if(total_size == 0) return;
+      typedef Kokkos::Impl::FunctorValueInit< ReducerType, typename Policy::work_tag > ValueInit ;
+      typedef Kokkos::Impl::if_c< std::is_same<InvalidType,ReducerType>::value,
+                                   FunctorType, ReducerType> ReducerConditional;
+      if(total_size==0) {
+        if (reducer.view().data()) {
+           ValueInit::init( ReducerConditional::select(f,reducer), 
+                            reducer.view().data() );
+        }
+        return;
+      }

      const int reduce_size = ValueTraits::value_size( f );
      const int shared_size = FunctorTeamShmemSize< FunctorType >::value( f , team_size );
@ -1113,6 +1313,39 @@ public:
  //----------------------------------------
 };

+template< class FunctorType , class ReturnType , class... Traits >
+class ParallelScanWithTotal< FunctorType , Kokkos::RangePolicy< Traits... >,
+                             ReturnType, Kokkos::Experimental::ROCm >
+{
+private:
+
+  typedef Kokkos::RangePolicy< Traits... > Policy;
+  typedef typename Policy::work_tag Tag;
+  typedef Kokkos::Impl::FunctorValueTraits< FunctorType, Tag>  ValueTraits;
+
+public:
+
+  //----------------------------------------
+
+  inline
+  ParallelScanWithTotal( const FunctorType & f
+              , const Policy      & policy 
+              , ReturnType        & arg_returnvalue)
+  {
+    const auto len = policy.end()-policy.begin();
+
+
+    if(len==0) return;
+
+    scan_enqueue<Tag,ReturnType>(len, f, arg_returnvalue, [](hc::tiled_index<1> idx, int, int) { return idx.global[0]; });
+  }
+
+  KOKKOS_INLINE_FUNCTION
+  void execute() const {}
+
+  //----------------------------------------
+};
+
 template< class FunctorType , class... Traits>
 class ParallelScan< FunctorType , Kokkos::TeamPolicy< Traits... >, Kokkos::Experimental::ROCm >
 {
@ -1350,22 +1583,17 @@ void parallel_for(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTe
 * val is performed and put into result. This functionality requires C++11 support.*/
 template< typename iType, class Lambda, typename ValueType >
 KOKKOS_INLINE_FUNCTION
-void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
+typename std::enable_if< ! Kokkos::is_reducer< ValueType >::value >::type
+parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
                     const Lambda & lambda, ValueType& result) {

-  result = ValueType();
+  Kokkos::Sum<ValueType> reducer(result);
+  reducer.init( reducer.reference() );

  for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
-    ValueType tmp = ValueType();
-    lambda(i,tmp);
-    result+=tmp;
+    lambda(i,reducer.reference());
  }
-  result = loop_boundaries.thread.team_reduce(result,
-                                              Impl::JoinAdd<ValueType>());
-//  Impl::rocm_intra_workgroup_reduction( loop_boundaries.thread, result,
-//               Impl::JoinAdd<ValueType>());
-//  Impl::rocm_inter_workgroup_reduction( loop_boundaries.thread, result,
-//               Impl::JoinAdd<ValueType>());
+  loop_boundaries.thread.team_reduce(reducer);
 }

 /** \brief  Inter-thread thread range parallel_reduce. Executes lambda(iType i, ValueType & val) for each i=0..N-1.
@ -1374,7 +1602,8 @@ void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROC
 * val is performed and put into result. This functionality requires C++11 support.*/
 template< typename iType, class Lambda, typename ReducerType >
 KOKKOS_INLINE_FUNCTION
-void parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
+typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
+parallel_reduce(const Impl::TeamThreadRangeBoundariesStruct<iType,Impl::ROCmTeamMember>& loop_boundaries,
                     const Lambda & lambda, ReducerType const & reducer) {
  reducer.init( reducer.reference() );

@ -1439,7 +1668,8 @@ void parallel_for(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCm
 * val is performed and put into result. This functionality requires C++11 support.*/
 template< typename iType, class Lambda, typename ValueType >
 KOKKOS_INLINE_FUNCTION
-void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
+typename std::enable_if< !Kokkos::is_reducer< ValueType >::value >::type 
+parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
      loop_boundaries, const Lambda & lambda, ValueType& result) {
  result = ValueType();

@ -1477,7 +1707,8 @@ void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::R
 * val is performed and put into result. This functionality requires C++11 support.*/
 template< typename iType, class Lambda, typename ReducerType >
 KOKKOS_INLINE_FUNCTION
-void parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
+typename std::enable_if< Kokkos::is_reducer< ReducerType >::value >::type
+parallel_reduce(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROCmTeamMember >&
      loop_boundaries, const Lambda & lambda, ReducerType const & reducer) {
  reducer.init( reducer.reference() );

@ -1523,86 +1754,46 @@ void parallel_scan(const Impl::ThreadVectorRangeBoundariesStruct<iType,Impl::ROC
  typedef Kokkos::Impl::FunctorValueTraits< FunctorType , void > ValueTraits ;
  typedef typename ValueTraits::value_type value_type ;

-  value_type scan_val = value_type();
-#if (__ROCM_ARCH__ >= 800)
-// adopt the cuda vector shuffle method
-  const int VectorLength = loop_boundaries.increment;
-  int lid = loop_boundaries.thread.lindex();
-  int vector_rank = lid%VectorLength;
+  value_type val = value_type();
+  const int vector_length = loop_boundaries.thread.vector_length();
+  const int vector_rank = loop_boundaries.thread.vector_rank();

-  iType loop_bound = ((loop_boundaries.end+VectorLength-1)/VectorLength) * VectorLength;
-  value_type val ;
-  for(int _i = vector_rank; _i < loop_bound; _i += VectorLength) {
-    val = value_type();
-    if(_i<loop_boundaries.end)
-      lambda(_i , val , false);
+  iType end = ((loop_boundaries.end+vector_length-1)/vector_length) * vector_length;
+  value_type accum = value_type();

-    value_type tmp = val;
-    value_type result_i;
+  for ( int i = vector_rank ; i < end ; i += vector_length ) {

-    if(vector_rank == 0)
-      result_i = tmp;
-    if (VectorLength > 1) {
-      const value_type tmp2 = shfl_up(tmp, 1,VectorLength);
-      if(vector_rank > 0)
-        tmp+=tmp2;
-    }
-    if(vector_rank == 1)
-      result_i = tmp;
-    if (VectorLength > 3) {
-      const value_type tmp2 = shfl_up(tmp, 2,VectorLength);
-      if(vector_rank > 1)
-        tmp+=tmp2;
-    }
-    if ((vector_rank >= 2) &&
-        (vector_rank < 4))
-      result_i = tmp;
-    if (VectorLength > 7) {
-      const value_type tmp2 = shfl_up(tmp, 4,VectorLength);
-      if(vector_rank > 3)
-        tmp+=tmp2;
-    }
-    if ((vector_rank >= 4) &&
-        (vector_rank < 8))
-      result_i = tmp;
-    if (VectorLength > 15) {
-      const value_type tmp2 = shfl_up(tmp, 8,VectorLength);
-      if(vector_rank > 7)
-        tmp+=tmp2;
-    }
-    if ((vector_rank >= 8) &&
-        (vector_rank < 16))
-      result_i = tmp;
-    if (VectorLength > 31) {
-      const value_type tmp2 = shfl_up(tmp, 16,VectorLength);
-      if(vector_rank > 15)
-        tmp+=tmp2;
-    }
-    if ((vector_rank >=16) &&
-        (vector_rank < 32))
-      result_i = tmp;
-    if (VectorLength > 63) {
-      const value_type tmp2 = shfl_up(tmp, 32,VectorLength);
-      if(vector_rank > 31)
-        tmp+=tmp2;
+    value_type val = 0 ;
+
+    // First acquire per-lane contributions:
+    if ( i < loop_boundaries.end ) lambda( i , val , false );
+
+    value_type sval = val ;
+
+    // Bottom up inclusive scan in triangular pattern
+    // where each thread is the root of a reduction tree
+    // from the zeroth "lane" to itself.
+    //  [t] += [t-1] if t >= 1
+    //  [t] += [t-2] if t >= 2
+    //  [t] += [t-4] if t >= 4
+    //  ...
+
+    for ( int j = 1 ; j < vector_length ; j <<= 1 ) {
+      value_type tmp = 0 ;
+      tmp = shfl_up(sval , j , vector_length );
+      if ( j <= vector_rank ) { sval += tmp ; }
    }

-    if (vector_rank >= 32)
-      result_i = tmp;
+    // Include accumulation and remove value for exclusive scan:
+    val = accum + sval - val ;

-    val = scan_val + result_i - val;
-    scan_val += shfl(tmp,VectorLength-1,VectorLength);
-    if(_i<loop_boundaries.end)
-      lambda(_i , val , true);
+    // Provide exclusive scan value:
+    if ( i < loop_boundaries.end ) lambda( i , val , true );
+
+    // Accumulate the last value in the inclusive scan:
+    sval = shfl( sval , vector_length-1 , vector_length);
+    accum += sval ;
  }
-#else
-// for kaveri, call the LDS based thread_scan routine
-  for( iType i = loop_boundaries.start; i < loop_boundaries.end; i+=loop_boundaries.increment) {
-    lambda(i,scan_val,true);
-  }
-  scan_val = loop_boundaries.thread.team_scan(scan_val);
-
-#endif
 }

 } // namespace Kokkos
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Reduce.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Reduce.hpp
@ -57,7 +57,6 @@
 #include <ROCm/Kokkos_ROCm_Tile.hpp>
 #include <ROCm/Kokkos_ROCm_Invoke.hpp>
 #include <ROCm/Kokkos_ROCm_Join.hpp>
-
 //////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

 namespace Kokkos {
@ -75,7 +74,7 @@ T& reduce_value(T* x, std::false_type) [[hc]]
  return *x;
 }

-#if KOKKOS_ROCM_HAS_WORKAROUNDS
+#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
 struct always_true
 {
    template<class... Ts>
@ -149,7 +148,7 @@ void reduce_enqueue(
      // Store the tile result in the global memory.
      if (local == 0)
      {
-#if KOKKOS_ROCM_HAS_WORKAROUNDS
+#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
          // Workaround for assigning from LDS memory: std::copy should work
          // directly
          buffer.action_at(0, [&](T* x)
@ -158,7 +157,7 @@ void reduce_enqueue(
 // new ROCM 15 address space changes aren't implemented in std algorithms yet
              auto * src = reinterpret_cast<char *>(x);
              auto * dest = reinterpret_cast<char *>(result.data()+tile*output_length);
-              for(int i=0; i<sizeof(T);i++) dest[i] = src[i];
+              for(int i=0; i<sizeof(T)*output_length;i++) dest[i] = src[i];
 #else
              // Workaround: copy_if used to avoid memmove
              std::copy_if(x, x+output_length, result.data()+tile*output_length, always_true{} );
@ -169,12 +168,10 @@ void reduce_enqueue(

 #endif
      }
-      
  });
  if (output_result != nullptr)
     ValueInit::init(ReducerConditional::select(f, reducer), output_result);
  fut.wait();
-
  copy(result,result_cpu.data());
  if (output_result != nullptr) {
    for(std::size_t i=0;i<td.num_tiles;i++)
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_ReduceScan.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_ReduceScan.hpp
@ -62,6 +62,76 @@
 namespace Kokkos {
 namespace Impl {

+//#if __KALMAR_ACCELERATOR__ == 1
+KOKKOS_INLINE_FUNCTION
+void __syncthreads() [[hc]]
+{
+   amp_barrier(CLK_LOCAL_MEM_FENCE);
+}
+
+#define LT0 ((threadIdx_x+threadIdx_y+threadIdx_z)?0:1)
+
+
+// returns non-zero if and only if predicate is non-zero for all threads
+// note that syncthreads_or uses the first 64 bits of dynamic group memory.
+// this reserved memory must be accounted for everwhere 
+// that get_dynamic_group_segment_base_pointer is called.
+KOKKOS_INLINE_FUNCTION
+uint64_t __syncthreads_or(uint64_t  pred) 
+{
+  uint64_t *shared_var = (uint64_t *)hc::get_dynamic_group_segment_base_pointer();
+  if(LT0) *shared_var = 0;
+  amp_barrier(CLK_LOCAL_MEM_FENCE);
+#if __KALMAR_ACCELERATOR__ == 1
+  if (pred) hc::atomic_or_uint64(shared_var,1);
+#endif
+  amp_barrier(CLK_LOCAL_MEM_FENCE);
+  return (*shared_var);
+}
+
+KOKKOS_INLINE_FUNCTION
+void __threadfence() 
+{
+   amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
+}
+
+KOKKOS_INLINE_FUNCTION
+void __threadfence_block() 
+{
+   amp_barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
+}
+//#endif
+struct ROCm_atomic_CAS {
+    template<class OP>
+    KOKKOS_INLINE_FUNCTION
+    unsigned long operator () (volatile unsigned long * dest, OP &&op){
+       unsigned long read,compare,val;
+       compare = *dest;
+       read = compare;
+       do {
+         compare = read;
+         val = op(compare);
+#if __KALMAR_ACCELERATOR__ == 1
+         hc::atomic_compare_exchange((uint64_t *)dest,&read,val);
+#endif
+       } while (read != compare);
+       return val;
+    }
+};
+
+  template<class OP>
+  KOKKOS_INLINE_FUNCTION
+  unsigned long atomic_cas_op (volatile unsigned long * dest, OP &&op) {
+    ROCm_atomic_CAS cas_op;
+    return cas_op(dest, std::forward<OP>(op));
+  }
+
+  KOKKOS_INLINE_FUNCTION
+  unsigned long atomicInc (volatile unsigned long * dest, const unsigned long& val) {
+    return atomic_cas_op(dest, [=](unsigned long old){return ((old>=val)?0:(old+1));});
+  }
+
+
 //----------------------------------------------------------------------------

 template< typename T >
@ -375,18 +445,7 @@ bool rocm_inter_block_reduction( ROCmTeamMember& team,
 #endif
 }
 #endif
-#if 0

-//----------------------------------------------------------------------------
-// See section B.17 of ROCm C Programming Guide Version 3.2
-// for discussion of
-//   __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)
-// function qualifier which could be used to improve performance.
-//----------------------------------------------------------------------------
-// Maximize shared memory and minimize L1 cache:
-//   rocmFuncSetCacheConfig(MyKernel, rocmFuncCachePreferShared );
-// For 2.0 capability: 48 KB shared and 16 KB L1
-//----------------------------------------------------------------------------
 //----------------------------------------------------------------------------
 /*
 *  Algorithmic constraints:
@ -406,87 +465,105 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
  typedef typename ValueTraits::pointer_type  pointer_type ;

  const unsigned value_count   = ValueTraits::value_count( functor );
-  const unsigned BlockSizeMask = team.team_size() - 1 ;
+  const unsigned BlockSizeMask = blockDim_y  - 1 ;

  // Must have power of two thread count

-  if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }
+  if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_intra_block_scan requires power-of-two blockDim"); }

 #define BLOCK_REDUCE_STEP( R , TD , S )  \
-  if ( ! ( R & ((1<<(S+1))-1) ) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }
+  if ( ! (( R & ((1<<(S+1))-1) )|(blockDim_y<(1<<(S+1)))) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S)) ); }

 #define BLOCK_SCAN_STEP( TD , N , S )  \
  if ( N == (1<<S) ) { ValueJoin::join( functor , TD , (TD - (value_count<<S))); }
+#define KOKKOS_IMPL_ROCM_SYNCWF __threadfence_block()

-  const unsigned     rtid_intra = team.team_rank() ^ BlockSizeMask ;
-  const pointer_type tdata_intra = base_data + value_count * team.team_rank() ;
+  const unsigned     rtid_intra = threadIdx_y ^ BlockSizeMask ;
+  const pointer_type tdata_intra = base_data + value_count * threadIdx_y ;

-  { // Intra-workgroup reduction:
+  { // Intra-workgroup reduction: min blocksize of 64
+    KOKKOS_IMPL_ROCM_SYNCWF;
    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,0)
+    KOKKOS_IMPL_ROCM_SYNCWF;
    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,1)
+    KOKKOS_IMPL_ROCM_SYNCWF;
    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,2)
+    KOKKOS_IMPL_ROCM_SYNCWF;
    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,3)
+    KOKKOS_IMPL_ROCM_SYNCWF;
    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,4)
+    KOKKOS_IMPL_ROCM_SYNCWF;
+    BLOCK_REDUCE_STEP(rtid_intra,tdata_intra,5)
+    KOKKOS_IMPL_ROCM_SYNCWF;
  }

-  team.team_barrier(); // Wait for all workgroups to reduce
+  __syncthreads(); // Wait for all workgroups to reduce

  { // Inter-workgroup reduce-scan by a single workgroup to avoid extra synchronizations
-    const unsigned rtid_inter = ( team.team_rank() ^ BlockSizeMask ) << ROCmTraits::WarpIndexShift ;
+    if(threadIdx_y < value_count) {
+      for(int i=blockDim_y-65; i>0; i-= 64)
+        ValueJoin::join( functor , base_data + (blockDim_y-1)*value_count + threadIdx_y ,  base_data + i*value_count + threadIdx_y );
+    }
+    __syncthreads();
+#if 0
+    const unsigned rtid_inter = ( threadIdx_y ^ BlockSizeMask ) << ROCmTraits::WavefrontIndexShift ;
+
+    if ( rtid_inter < blockDim_y ) {

-    if ( rtid_inter < team.team_size() ) {

      const pointer_type tdata_inter = base_data + value_count * ( rtid_inter ^ BlockSizeMask );
+//
+// remove these comments
+// for rocm, we start with a block size of 64, so the 5 step is already done.
+// The remaining steps are only done if block size is > 64, so we leave them
+// in place until we tune blocksize for performance, then remove the ones 
+// that will never be used.
+//      if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
+//      if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
+//      if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }
+//      if ( (1<<9) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,9) }

-      if ( (1<<5) < BlockSizeMask ) {                        BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,5) }
-      if ( (1<<6) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,6) }
-      if ( (1<<7) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,7) }
-      if ( (1<<8) < BlockSizeMask ) { __threadfence_block(); BLOCK_REDUCE_STEP(rtid_inter,tdata_inter,8) }

      if ( DoScan ) {

-        int n = ( rtid_inter &  32 ) ?  32 : (
-                ( rtid_inter &  64 ) ?  64 : (
+        int n = ( rtid_inter &  64 ) ?  64 : (
                ( rtid_inter & 128 ) ? 128 : (
-                ( rtid_inter & 256 ) ? 256 : 0 )));
+                ( rtid_inter & 256 ) ? 256 : 0 ));

-        if ( ! ( rtid_inter + n < team.team_size() ) ) n = 0 ;
+        if ( ! ( rtid_inter + n < blockDim_y ) ) n = 0 ;

        __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,8)
        __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,7)
        __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,6)
-        __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
+//        __threadfence_block(); BLOCK_SCAN_STEP(tdata_inter,n,5)
      }
    }
+#endif
  }

-  team.team_barrier(); // Wait for inter-workgroup reduce-scan to complete
+  __syncthreads(); // Wait for inter-workgroup reduce-scan to complete

  if ( DoScan ) {
    int n = ( rtid_intra &  1 ) ?  1 : (
            ( rtid_intra &  2 ) ?  2 : (
            ( rtid_intra &  4 ) ?  4 : (
            ( rtid_intra &  8 ) ?  8 : (
-            ( rtid_intra & 16 ) ? 16 : 0 ))));
+            ( rtid_intra & 16 ) ? 16 : (
+            ( rtid_intra & 32 ) ? 32 : 0 )))));

-    if ( ! ( rtid_intra + n < team.team_size() ) ) n = 0 ;
-    #ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
-    BLOCK_SCAN_STEP(tdata_intra,n,4) team.team_barrier();//__threadfence_block();
-    BLOCK_SCAN_STEP(tdata_intra,n,3) team.team_barrier();//__threadfence_block();
-    BLOCK_SCAN_STEP(tdata_intra,n,2) team.team_barrier();//__threadfence_block();
-    BLOCK_SCAN_STEP(tdata_intra,n,1) team.team_barrier();//__threadfence_block();
-    BLOCK_SCAN_STEP(tdata_intra,n,0) team.team_barrier();
-    #else
-    BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
+    if ( ! ( rtid_intra + n < blockDim_y ) ) n = 0 ;
+
+//    BLOCK_SCAN_STEP(tdata_intra,n,5) __threadfence_block();
+//    BLOCK_SCAN_STEP(tdata_intra,n,4) __threadfence_block();
    BLOCK_SCAN_STEP(tdata_intra,n,3) __threadfence_block();
    BLOCK_SCAN_STEP(tdata_intra,n,2) __threadfence_block();
    BLOCK_SCAN_STEP(tdata_intra,n,1) __threadfence_block();
    BLOCK_SCAN_STEP(tdata_intra,n,0) __threadfence_block();
-    #endif
  }

 #undef BLOCK_SCAN_STEP
 #undef BLOCK_REDUCE_STEP
+#undef KOKKOS_IMPL_ROCM_SYNCWF
 }

 //----------------------------------------------------------------------------
@ -497,16 +574,18 @@ void rocm_intra_block_reduce_scan( const FunctorType & functor ,
 *
 *  Global reduce result is in the last threads' 'shared_data' location.
 */
+using ROCM  = Kokkos::Experimental::ROCm ;
+
 template< bool DoScan , class FunctorType , class ArgTag >
 KOKKOS_INLINE_FUNCTION
 bool rocm_single_inter_block_reduce_scan( const FunctorType     & functor ,
-                                          const ROCm::size_type   block_id ,
-                                          const ROCm::size_type   block_count ,
-                                          ROCm::size_type * const shared_data ,
-                                          ROCm::size_type * const global_data ,
-                                          ROCm::size_type * const global_flags )
+                                          const ROCM::size_type   block_id ,
+                                          const ROCM::size_type   block_count ,
+                                          typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const shared_data ,
+                                          typename FunctorValueTraits<FunctorType, ArgTag>::value_type * const global_data ,
+                                          ROCM::size_type * const global_flags )
 {
-  typedef ROCm::size_type                  size_type ;
+  typedef ROCM::size_type                  size_type ;
  typedef FunctorValueTraits< FunctorType , ArgTag >  ValueTraits ;
  typedef FunctorValueJoin<   FunctorType , ArgTag >  ValueJoin ;
  typedef FunctorValueInit<   FunctorType , ArgTag >  ValueInit ;
@ -517,16 +596,17 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType     & functor ,
  typedef typename ValueTraits::value_type      value_type ;

  // '__ffs' = position of the least significant bit set to 1.
-  // 'team.team_size()' is guaranteed to be a power of two so this
+  // blockDim_y is guaranteed to be a power of two so this
  // is the integral shift value that can replace an integral divide.
-  const unsigned BlockSizeShift = __ffs( team.team_size() ) - 1 ;
-  const unsigned BlockSizeMask  = team.team_size() - 1 ;
+  //  const unsigned long BlockSizeShift = __ffs( blockDim_y ) - 1 ;
+  const unsigned long BlockSizeShift = __lastbit_u32_u32( blockDim_y )  ;
+  const unsigned long BlockSizeMask  = blockDim_y - 1 ;

  // Must have power of two thread count
-  if ( BlockSizeMask & team.team_size() ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }
+  if ( BlockSizeMask & blockDim_y ) { Kokkos::abort("ROCm::rocm_single_inter_block_reduce_scan requires power-of-two blockDim"); }

-  const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(size_type) >
-    word_count( ValueTraits::value_size( functor ) / sizeof(size_type) );
+  const integral_nonzero_constant< size_type , ValueTraits::StaticValueSize / sizeof(value_type) >
+    word_count( ValueTraits::value_size( functor )/ sizeof(value_type) );

  // Reduce the accumulation for the entire block.
  rocm_intra_block_reduce_scan<false,FunctorType,ArgTag>( functor , pointer_type(shared_data) );
@ -534,54 +614,47 @@ bool rocm_single_inter_block_reduce_scan( const FunctorType     & functor ,
  {
    // Write accumulation total to global scratch space.
    // Accumulation total is the last thread's data.
-    size_type * const shared = shared_data + word_count.value * BlockSizeMask ;
-    size_type * const global = global_data + word_count.value * block_id ;
-
-#if (__ROCM_ARCH__ < 500)
-    for ( size_type i = team.team_rank() ; i < word_count.value ; i += team.team_size() ) { global[i] = shared[i] ; }
-#else
-    for ( size_type i = 0 ; i < word_count.value ; i += 1 ) { global[i] = shared[i] ; }
-#endif
+    value_type * const shared = shared_data +  
+                                   word_count.value * BlockSizeMask ;
+    value_type * const global = global_data + word_count.value * block_id ;

+    for ( int i = int(threadIdx_y) ; i < word_count.value ; i += blockDim_y ) { global[i] = shared[i] ; }
  }

  // Contributing blocks note that their contribution has been completed via an atomic-increment flag
  // If this block is not the last block to contribute to this group then the block is done.
-    team.team_barrier();
+    
  const bool is_last_block =
-    ! team.team_reduce( team.team_rank() ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) ,Impl::JoinAdd<ValueType>());
-
+    !  __syncthreads_or( threadIdx_y ? 0 : ( 1 + atomicInc( global_flags , block_count - 1 ) < block_count ) );
  if ( is_last_block ) {

-    const size_type b = ( long(block_count) * long(team.team_rank()) ) >> BlockSizeShift ;
-    const size_type e = ( long(block_count) * long( team.team_rank() + 1 ) ) >> BlockSizeShift ;
+    const size_type b = ( long(block_count) * long(threadIdx_y )) >> BlockSizeShift ;
+    const size_type e = ( long(block_count) * long(threadIdx_y + 1 ) ) >> BlockSizeShift ;

    {
-      void * const shared_ptr = shared_data + word_count.value * team.team_rank() ;
-      reference_type shared_value = ValueInit::init( functor , shared_ptr );
+      value_type * const shared_ptr = shared_data + word_count.value * threadIdx_y ;
+      ValueInit::init( functor , shared_ptr );
+

      for ( size_type i = b ; i < e ; ++i ) {
        ValueJoin::join( functor , shared_ptr , global_data + word_count.value * i );
      }
    }
-
    rocm_intra_block_reduce_scan<DoScan,FunctorType,ArgTag>( functor , pointer_type(shared_data) );

    if ( DoScan ) {
+      value_type * const shared_value = shared_data + word_count.value * ( threadIdx_y ? threadIdx_y - 1 : blockDim_y );

-      size_type * const shared_value = shared_data + word_count.value * ( team.team_rank() ? team.team_rank() - 1 : team.team_size() );
-
-      if ( ! team.team_rank() ) { ValueInit::init( functor , shared_value ); }
+      if ( ! threadIdx_y ) { ValueInit::init( functor , shared_value ); }

      // Join previous inclusive scan value to each member
      for ( size_type i = b ; i < e ; ++i ) {
-        size_type * const global_value = global_data + word_count.value * i ;
+        value_type * const global_value = global_data + word_count.value * i ;
        ValueJoin::join( functor , shared_value , global_value );
        ValueOps ::copy( functor , global_value , shared_value );
      }
    }
  }
-
  return is_last_block ;
 }

@ -592,7 +665,6 @@ unsigned rocm_single_inter_block_reduce_scan_shmem( const FunctorType & functor
 {
  return ( BlockSize + 2 ) * Impl::FunctorValueTraits< FunctorType , ArgTag >::value_size( functor );
 }
-#endif 

 } // namespace Impl
 } // namespace Kokkos
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Scan.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Scan.hpp
@ -98,7 +98,7 @@ void scan_enqueue(
            {
               auto j = i + d - 1;
               auto k = i + d2 - 1;
-//               join(k, j);  // no longer needed with ROCm 1.6
+
               ValueJoin::join(f, &buffer[k], &buffer[j]);
            }
        }
@ -116,7 +116,7 @@ void scan_enqueue(
               auto j = i + d - 1;
               auto k = i + d2 - 1;
               auto t = buffer[k];
-//               join(k, j);  // no longer needed with ROCm 1.6
+
               ValueJoin::join(f, &buffer[k], &buffer[j]);
               buffer[j] = t;
            }
@ -127,17 +127,13 @@ void scan_enqueue(
    }).wait();
    copy(result,result_cpu.data());

-//  The std::partial_sum was segfaulting, despite that this is cpu code.
-//   if(td.num_tiles>1)
-//      std::partial_sum(result_cpu.data(), result_cpu.data()+(td.num_tiles-1)*sizeof(value_type), result_cpu.data(), make_join_operator<ValueJoin>(f));
-// use this implementation instead.
   for(int i=1; i<td.num_tiles; i++)
      ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);

    copy(result_cpu.data(),result);
-    hc::parallel_for_each(hc::extent<1>(len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]] 
+    size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
+    hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]] 
    {
-//        const auto local = t_idx.local[0];
        const auto global = t_idx.global[0];
        const auto tile = t_idx.tile[0];

@ -145,13 +141,115 @@ void scan_enqueue(
        {
            auto final_state = scratch[global];

-// the join is locking up, at least with 1.6
-            if (tile != 0) final_state += result[tile-1];
-//            if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
+            if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
            rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
        }
    }).wait();
 }

+template< class Tag, class ReturnType, class F, class TransformIndex>
+void scan_enqueue(
+  const int len,
+  const F & f,
+  ReturnType & return_val,
+  TransformIndex transform_index)
+{
+    typedef Kokkos::Impl::FunctorValueTraits< F, Tag>  ValueTraits;
+    typedef Kokkos::Impl::FunctorValueInit<   F, Tag>  ValueInit;
+    typedef Kokkos::Impl::FunctorValueJoin<   F, Tag>  ValueJoin;
+    typedef Kokkos::Impl::FunctorValueOps<    F, Tag>  ValueOps;
+
+    typedef typename ValueTraits::value_type    value_type;
+    typedef typename ValueTraits::pointer_type    pointer_type;
+    typedef typename ValueTraits::reference_type  reference_type;
+
+    const auto td = get_tile_desc<value_type>(len);
+    std::vector<value_type> result_cpu(td.num_tiles);
+    hc::array<value_type> result(td.num_tiles);
+    hc::array<value_type> scratch(len);
+    std::vector<ReturnType> total_cpu(1);
+    hc::array<ReturnType> total(1);
+
+    tile_for<value_type>(td, [&,f,len,td](hc::tiled_index<1> t_idx, tile_buffer<value_type> buffer) [[hc]] 
+    {
+        const auto local = t_idx.local[0];
+        const auto global = t_idx.global[0];
+        const auto tile = t_idx.tile[0];
+
+        // Join tile buffer elements
+        const auto join = [&](std::size_t i, std::size_t j)
+        {
+            buffer.action_at(i, j, [&](value_type& x, const value_type& y)
+            {
+                ValueJoin::join(f, &x, &y);
+            });
+        };
+
+        // Copy into tile
+        buffer.action_at(local, [&](value_type& state)
+        {
+            ValueInit::init(f, &state);
+            if (global < len) rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), state, false);
+        });
+        t_idx.barrier.wait();
+        // Up sweep phase
+        for(std::size_t d=1;d<buffer.size();d*=2)
+        {
+            auto d2 = 2*d;
+            auto i = local*d2;
+            if(i<len)
+            {
+               auto j = i + d - 1;
+               auto k = i + d2 - 1;
+               ValueJoin::join(f, &buffer[k], &buffer[j]);
+            }
+        }
+        t_idx.barrier.wait();
+
+        result[tile] = buffer[buffer.size()-1];
+        buffer[buffer.size()-1] = 0;
+        // Down sweep phase
+        for(std::size_t d=buffer.size()/2;d>0;d/=2)
+        {
+            auto d2 = 2*d;
+            auto i = local*d2;
+            if(i<len)
+            {
+               auto j = i + d - 1;
+               auto k = i + d2 - 1;
+               auto t = buffer[k];
+               ValueJoin::join(f, &buffer[k], &buffer[j]);
+               buffer[j] = t;
+            }
+            t_idx.barrier.wait();
+        }
+        // Copy tiles into global memory
+        if (global < len) scratch[global] = buffer[local];
+    }).wait();
+    copy(result,result_cpu.data());
+
+   for(int i=1; i<td.num_tiles; i++)
+      ValueJoin::join(f, &result_cpu[i], &result_cpu[i-1]);
+
+    copy(result_cpu.data(),result);
+    size_t launch_len = (((len - 1) / td.tile_size) + 1) * td.tile_size;
+    hc::parallel_for_each(hc::extent<1>(launch_len).tile(td.tile_size), [&,f,len,td](hc::tiled_index<1> t_idx) [[hc]] 
+    {
+        const auto global = t_idx.global[0];
+        const auto tile = t_idx.tile[0];
+
+        if (global < len) 
+        {
+            auto final_state = scratch[global];
+
+            if (tile != 0) ValueJoin::join(f, &final_state, &result[tile-1]);
+            rocm_invoke<Tag>(f, transform_index(t_idx, td.tile_size, td.num_tiles), final_state, true);
+            if(global==(len-1))  total[0] = final_state;
+        }
+    }).wait();
+    copy(total,total_cpu.data());
+    return_val = total_cpu[0];
+}
+
 } // namespace Impl
 } // namespace Kokkos
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Space.cpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Space.cpp
@ -362,6 +362,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmSpace & arg_space
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
+  // Set last element zero, in case c_str is too long
+  header.m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;

  // Copy to device memory
  Kokkos::Impl::DeepCopy<Kokkos::Experimental::ROCmSpace,HostSpace>( RecordBase::m_alloc_ptr , & header , sizeof(SharedAllocationHeader) );
@ -399,6 +401,8 @@ SharedAllocationRecord( const Kokkos::Experimental::ROCmHostPinnedSpace & arg_sp
          , arg_label.c_str()
          , SharedAllocationHeader::maximum_label_length
          );
+  // Set last element zero, in case c_str is too long
+  RecordBase::m_alloc_ptr->m_label[SharedAllocationHeader::maximum_label_length - 1] = (char) 0;
 }

 //----------------------------------------------------------------------------
--- a/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Tile.hpp
+++ b/lib/kokkos/core/src/ROCm/Kokkos_ROCm_Tile.hpp
@ -278,7 +278,7 @@ struct single_action
    void action_at(std::size_t i, Action a) [[hc]]
    {
        auto& value = static_cast<Derived&>(*this)[i];
-#if KOKKOS_ROCM_HAS_WORKAROUNDS
+#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
        T state = value;
        a(state);
        value = state;
@ -347,7 +347,7 @@ struct tile_buffer<T[]>
 #if defined (ROCM15)
        a(value);
 #else
-#if KOKKOS_ROCM_HAS_WORKAROUNDS
+#ifdef KOKKOS_IMPL_ROCM_CLANG_WORKAROUND
        if (m > get_max_tile_array_size()) return;
        T state[get_max_tile_array_size()];
        // std::copy(value, value+m, state);
@ -372,7 +372,6 @@ struct tile_buffer<T[]>
 #if defined (ROCM15)
        a(value);
 #else
-//#if KOKKOS_ROCM_HAS_WORKAROUNDS
        if (m > get_max_tile_array_size()) return;
        T state[get_max_tile_array_size()];
        // std::copy(value, value+m, state);
--- a/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
+++ b/lib/kokkos/core/src/Threads/Kokkos_ThreadsTeam.hpp
@ -175,6 +175,27 @@ public:
 #endif
  }

+  template<class Closure, class ValueType>
+  KOKKOS_INLINE_FUNCTION
+  void team_broadcast(Closure const & f, ValueType& value, const int& thread_id) const
+  {
+#if ! defined( KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_HOST )
+    { }
+#else
+    // Make sure there is enough scratch space:
+    typedef typename if_c< sizeof(ValueType) < TEAM_REDUCE_SIZE
+                         , ValueType , void >::type type ;
+    f( value );
+    if ( m_team_base ) {
+      type * const local_value = ((type*) m_team_base[0]->scratch_memory());
+      if(team_rank() == thread_id) *local_value = value;
+      memory_fence();
+      team_barrier();
+      value = *local_value;
+    }
+#endif
+  }
+  
  template< typename Type >
  KOKKOS_INLINE_FUNCTION
  typename std::enable_if< !Kokkos::is_reducer< Type >::value , Type>::type
@ -626,39 +647,77 @@ public:

  //----------------------------------------

+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
  template< class FunctorType >
  inline static
  int team_size_max( const FunctorType & ) {
-#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
-      int pool_size = traits::execution_space::thread_pool_size(1);
-#else
-      int pool_size = traits::execution_space::impl_thread_pool_size(1);
-#endif
-      int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
-      return pool_size<max_host_team_size?pool_size:max_host_team_size;
-    }
-
+    int pool_size = traits::execution_space::thread_pool_size(1);
+    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
+    return pool_size<max_host_team_size?pool_size:max_host_team_size;
+  }

  template< class FunctorType >
-  static int team_size_recommended( const FunctorType & )
-    {
-#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
-      return traits::execution_space::thread_pool_size(2);
-#else
-      return traits::execution_space::impl_thread_pool_size(2);
-#endif
-    }
-
+  inline static
+  int team_size_recommended( const FunctorType & )
+  {
+    return traits::execution_space::thread_pool_size(2);
+  }

  template< class FunctorType >
  inline static
  int team_size_recommended( const FunctorType &, const int& )
-    {
-#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
-      return traits::execution_space::thread_pool_size(2);
-#else
-      return traits::execution_space::impl_thread_pool_size(2);
+  {
+    return traits::execution_space::thread_pool_size(2);
+  }
 #endif
+
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelForTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    int pool_size = traits::execution_space::thread_pool_size(1);
+#else
+    int pool_size = traits::execution_space::impl_thread_pool_size(1);
+#endif
+    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
+    return pool_size<max_host_team_size?pool_size:max_host_team_size;
+  }
+  template<class FunctorType>
+  int team_size_max( const FunctorType&, const ParallelReduceTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    int pool_size = traits::execution_space::thread_pool_size(1);
+#else
+    int pool_size = traits::execution_space::impl_thread_pool_size(1);
+#endif
+    int max_host_team_size =  Impl::HostThreadTeamData::max_team_members;
+    return pool_size<max_host_team_size?pool_size:max_host_team_size;
+  }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelForTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    return traits::execution_space::thread_pool_size(2);
+#else
+    return traits::execution_space::impl_thread_pool_size(2);
+#endif
+  }
+  template<class FunctorType>
+  int team_size_recommended( const FunctorType&, const ParallelReduceTag& ) const {
+#ifdef KOKKOS_ENABLE_DEPRECATED_CODE
+    return traits::execution_space::thread_pool_size(2);
+#else
+    return traits::execution_space::impl_thread_pool_size(2);
+#endif
+  }
+
+
+  inline static
+  int vector_length_max()
+    { return 1024; } // Use arbitrary large number, is meant as a vectorizable length
+
+  inline static
+  int scratch_size_max(int level)
+    { return (level==0?
+        1024*32: // Roughly L1 size
+        20*1024*1024); // Limit to keep compatibility with CUDA
    }

  //----------------------------------------
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank1.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank1.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank2.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank2.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank3.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank3.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank4.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank4.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank5.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank5.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank8.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutLeft_Rank8.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank1.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank1.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank2.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank2.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank3.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank3.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank4.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank4.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank5.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank5.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank8.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutRight_Rank8.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank1.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank1.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank2.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank2.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank3.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank3.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank4.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank4.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank5.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank5.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank8.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_double_LayoutStride_Rank8.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank1.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank1.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank2.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank2.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank3.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank3.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank4.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank4.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank5.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank5.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank8.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutLeft_Rank8.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank1.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank1.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank2.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank2.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank3.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank3.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank4.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank4.cpp
--- a/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank5.cpp
+++ b/lib/kokkos/core/src/eti/ROCm/Kokkos_Experimental::ROCm_ViewCopyETIInst_int64_t_float_LayoutRight_Rank5.cpp
--- a/Show More
+++ b/Show More