0% found this document useful (0 votes)

20 views322 pages

KokkosTutorial ORNL20

Tutorial for kokkos framework

Uploaded by

tojepo9101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views322 pages

KokkosTutorial ORNL20

Tutorial for kokkos framework

Uploaded by

tojepo9101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 322

Kokkos Tutorial

Jeff Miles 1 , Christian Trott 1

1
Sandia National Laboratories

Online April 21-24, 2020

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and
Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
SAND2019-1055814
Prerequisites for Tutorial Exercises

Knowledge of C++: class constructors, member variables,

member functions, member operators, template arguments

Using your own ${HOME}

I Git
I GCC 4.8.4 (or newer) OR Intel 15 (or newer) OR Clang 3.5.2 (or newer)
I CUDA nvcc 9.0 (or newer) AND NVIDIA compute capability 3.0 (or newer)
I git clone https://github.com/kokkos/kokkos
into ${HOME}/Kokkos/kokkos
I git clone https://github.com/kokkos/kokkos-tutorials
into ${HOME}/Kokkos/kokkos-tutorials
Slides are in
${HOME}/Kokkos/kokkos-tutorials/Intro-Full/Slides
Exercises are in
${HOME}/Kokkos/kokkos-tutorials/Intro-Full/Exercises
Exercises’ makefiles look for ${HOME}/Kokkos/kokkos

Online April 21-24, 2020 2/192

Prerequisites for Tutorial Exercises

Online Resources:
I https://github.com/kokkos: Primary Kokkos GitHub
Organization
I https:
//github.com/kokkos/kokkos-tutorials/blob/master/
Intro-Full/Slides/KokkosTutorial_ORNL20.pdf: These
slides.
I https://github.com/kokkos/kokkos/wiki: Wiki
including API reference
I https:
//github.com/kokkos/kokkos-tutorials/issues/28:
Instructions to get cloud instance with GPU
I https://kokkosteam.slack.com: Slack channel for Kokkos

Online April 21-24, 2020 3/192

Tutorial Objectives

Kokkos’ basic capabilities:

I Simple 1D data parallel computational patterns
I Deciding where code is run and where data is placed
I Managing data access patterns for performance portability
Kokkos’ advanced capabilities:
I Thread safety, thread scalability, and atomic operations
I Hierarchical patterns for maximizing parallelism
Kokkos’ advanced capabilities not covered today:
I Multidimensional data parallelism
I Dynamic directed acyclic graph of tasks pattern
I Numerous plugin points for extensibility

Online April 21-24, 2020 4/192

Tutorial Takeaways

I Kokkos enables Single Source Performance Portable

Codes
I Simple things stay simple - it is not much more complicated
than OpenMP
I Advanced performance optimizing capabilities easier to
use with Kokkos than e.g. CUDA
I Kokkos provides data abstractions critical for performance
portability not available in OpenMP or OpenACC
Controlling data access patterns is key for obtaining
performance

Online April 21-24, 2020 5/192

Operating assumptions (0)

Assume you are here because:

I Want to use all HPC node architectures; including GPUs
I Are familiar with C++
I Want GPU programming to be easier
I Would like portability, as long as it doesn’t hurt performance
Helpful for understanding nuances:
I Are familiar with data parallelism
I Are familiar with OpenMP
I Are familiar with GPU architecture and CUDA

Online April 21-24, 2020 6/192

Operating assumptions (1)

Target machine:

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Online April 21-24, 2020 7/192

Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and

may not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)

Online April 21-24, 2020 8/192

Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and

may not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns across
architectures,
I can leverage architecture-specific features where possible.

Online April 21-24, 2020 8/192

Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and

Kokkos: performance portability across manycore architectures.

Online April 21-24, 2020 8/192

Concepts for threaded data
parallelism
Learning objectives:
I Terminology of pattern, policy, and body.
I The data layout problem.

Online April 21-24, 2020 9/192

Concepts: Patterns, Policies, and Bodies

for ( element = 0; element < numElements ; ++ element ) {

total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

Online April 21-24, 2020 10/192

Concepts: Patterns, Policies, and Bodies

Pattern Policy
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
Body

total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);

}
elementValues [ element ] = total ;
}

Terminology:
I Pattern: structure of the computations
for, reduction, scan, task-graph, ...
I Execution Policy: how computations are executed
static scheduling, dynamic scheduling, thread teams, ...
I Computational Body: code which performs each unit of
work; e.g., the loop body
⇒ The pattern and policy drive the computational body.

Online April 21-24, 2020 10/192

Threading “Parallel for”

What if we want to thread the loop?

for ( element = 0; element < numElements ; ++ element ) {

total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

Online April 21-24, 2020 11/192

Threading “Parallel for”

What if we want to thread the loop?

# pragma omp parallel for
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

(Change the execution policy from “serial” to “parallel.”)

Online April 21-24, 2020 11/192

Threading “Parallel for”

What if we want to thread the loop?

(Change the execution policy from “serial” to “parallel.”)

OpenMP is simple for parallelizing loops on multi-core CPUs,

but what if we then want to do this on other architectures?
Intel PHI and NVIDIA GPU and AMD GPU and ...

Online April 21-24, 2020 11/192

“Parallel for” on a GPU via pragmas
Option 1: OpenMP 4.5
# pragma omp target data map (...)
# pragma omp teams num_teams (...) num_threads (...) private (...)
# pragma omp distribute
for ( element = 0; element < numElements ; ++ element ) {
total = 0
# pragma omp parallel for
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}

Online April 21-24, 2020 12/192

Option 2: OpenACC
# pragma acc parallel copy (...) num_gangs (...) vector_length (...)
# pragma acc loop gang vector
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}
Online April 21-24, 2020 12/192
Portable, but not performance portable

A standard thread parallel programming model

may give you portable parallel execution
if it is supported on the target architecture.

But what about performance?

Online April 21-24, 2020 13/192

Portable, but not performance portable

A standard thread parallel programming model

may give you portable parallel execution
if it is supported on the target architecture.

But what about performance?

Performance depends upon the computation’s

memory access pattern.

Online April 21-24, 2020 13/192

Online April 21-24, 2020 14/192

Problem: memory access pattern
# pragma something , opencl , etc .
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
for ( i = 0; i < vectorSize ; ++ i ) {
total +=
left [ element * numQPs * vectorSize +
qp * vectorSize + i ] *
right [ element * numQPs * vectorSize +
qp * vectorSize + i ];
}
}
elementValues [ element ] = total ;
}
Memory access pattern problem: CPU data layout reduces GPU
performance by more than 10X.

Online April 21-24, 2020 14/192

How does Kokkos address performance portability?

Kokkos is a productive, portable, performant, shared-memory

programming model.
I is a C++ library, not a new language or language extension.
I supports clear, concise, thread-scalable parallel patterns.
I lets you write algorithms once and run on many architectures
e.g. multi-core CPU, GPUs, Xeon Phi, ...
I minimizes the amount of architecture-specific
implementation details users must know.
I solves the data layout problem by using multi-dimensional
arrays with architecture-dependent layouts

Online April 21-24, 2020 15/192

Data parallel patterns
Learning objectives:
I How computational bodies are passed to the Kokkos runtime.
I How work is mapped to cores.
I The difference between parallel for and
parallel reduce.
I Start parallelizing a simple example.

Online April 21-24, 2020 16/192

Using Kokkos for data parallel patterns (0)

Data parallel patterns and work

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores

Online April 21-24, 2020 17/192

Using Kokkos for data parallel patterns (0)

Data parallel patterns and work

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores

I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.

Online April 21-24, 2020 17/192

Using Kokkos for data parallel patterns (0)

Data parallel patterns and work

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores

I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.

Important concept: Work mapping

You give an iteration range and computational body (kernel)
to Kokkos, Kokkos maps iteration indices to cores and then
runs the computational body on those cores.

Online April 21-24, 2020 17/192

Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?

Online April 21-24, 2020 18/192

Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?

As functors or function objects, a common pattern in C++.

Online April 21-24, 2020 18/192

Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?

As functors or function objects, a common pattern in C++.

Quick review, a functor is a function with data. Example:

struct P a ra ll el F un ct or {
...
void operator ()( a work assignment ) const {
/* ... computational body ... */
...
};

Online April 21-24, 2020 18/192

Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?

Online April 21-24, 2020 19/192

Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?

A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

Online April 21-24, 2020 19/192

Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?

A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

and work items are assigned to functors one-by-one:

struct Functor {
void operator ()( const int64_t index ) const {...}
}

Online April 21-24, 2020 19/192

Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?

A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

and work items are assigned to functors one-by-one:

struct Functor {
void operator ()( const int64_t index ) const {...}
}

Warning: concurrency and order

Concurrency and ordering of parallel iterations is not guaranteed
by the Kokkos runtime.

Online April 21-24, 2020 19/192

Using Kokkos for data parallel patterns (4)

How is data passed to computational bodies?

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force ( ... data ... );
}

struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}

Online April 21-24, 2020 20/192

Using Kokkos for data parallel patterns (4)

How is data passed to computational bodies?

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force ( ... data ... );
}

struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}

How does the body access the data?

Important concept
A parallel functor body must have access to all the data it needs
through the functor’s data members.

Online April 21-24, 2020 20/192

Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;
A t o m Fo r c e F u n c to r ( /* args */ ) {...}
void operator ()( const int64_t atomIndex ) const {
_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
};

Online April 21-24, 2020 21/192

Q/ How would we reproduce serial execution with this functor?

Serial

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){

atomForces [ atomIndex ] = c alculate Force ( data );
}

Online April 21-24, 2020 21/192

Q/ How would we reproduce serial execution with this functor?

Serial

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){

atomForces [ atomIndex ] = c alculate Force ( data );
}
Functor

At om Fo rc e F u n ct o r functor ( atomForces , data );

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){
functor ( atomIndex );
}

Online April 21-24, 2020 21/192

Using Kokkos for data parallel patterns (6)

The complete picture (using functors):

1. Defining the functor (operator+data):
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;

A t o m Fo r c e F u n c to r (ForceType atomForces , AtomDataType data ) :

_atomForces ( atomForces ) , _atomData ( data ) {}

void operator ()( const int64_t atomIndex ) const {

_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
}

2. Executing in parallel with Kokkos pattern:

A t o m F o rc e F u n c t o r functor ( atomForces , data );
Kokkos :: parallel_for ( numberOfAtoms , functor );

Online April 21-24, 2020 22/192

Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise

atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

Online April 21-24, 2020 23/192

Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise

atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

A lambda is not magic, it is the compiler auto-generating a

functor for you.

Online April 21-24, 2020 23/192

Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise

atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

A lambda is not magic, it is the compiler auto-generating a

functor for you.

Warning: Lambda capture and C++ containers

For portability to GPU a lambda must capture by value [=].
Don’t capture containers (e.g., std::vector) by value because it will
copy the container’s entire contents.

Online April 21-24, 2020 23/192

parallel for examples

How does this compare to OpenMP?

Serial

for ( int64_t i = 0; i < N ; ++ i ) {

/* loop body */
}
Kokkos OpenMP

# pragma omp parallel for

for ( int64_t i = 0; i < N ; ++ i ) {
/* loop body */
}

parallel_for ( N , [=] ( const int64_t i ) {

/* loop body */
});

Important concept
Simple Kokkos usage is no more conceptually difficult than
OpenMP, the annotations just go in different places.
Online April 21-24, 2020 24/192
Scalar integration (0)

Riemann-sum-style numerical integration:

Z upper
y= function(x) dx
lower
Wikipedia

Online April 21-24, 2020 25/192

Scalar integration (0)

Riemann-sum-style numerical integration:

Z upper
y= function(x) dx
lower
Wikipedia

double totalIntegral = 0;
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );
const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;

Online April 21-24, 2020 25/192

Scalar integration (0)

Riemann-sum-style numerical integration:

Z upper
y= function(x) dx
lower
Wikipedia

How do we parallelize it? Correctly?

Online April 21-24, 2020 25/192

Scalar integration (0)

Riemann-sum-style numerical integration:

Z upper
y= function(x) dx
lower
Wikipedia

Pattern?
double totalIntegral = 0; Policy?
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
Body?

lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );

const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;

How do we parallelize it? Correctly?

Online April 21-24, 2020 25/192

Scalar integration (1)

An (incorrect) attempt:
double totalIntegral = 0;
Kokkos :: parallel_for ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
totalIntegral += function ( x );} ,
);
totalIntegral *= dx ;

First problem: compiler error; cannot increment totalIntegral

(lambdas capture by value and are treated as const!)

Online April 21-24, 2020 26/192

Scalar integration (2)
An (incorrect) solution to the (incorrect) attempt:
double totalIntegral = 0;
double * t o t a l I n t e g r a l P o i n t e r = & totalIntegral ;
Kokkos :: parallel_for ( numberOfIntervals ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
* t o t a l I n t e g r a l P o i n t e r += function ( x );} ,
);
totalIntegral *= dx ;

Online April 21-24, 2020 27/192

Second problem: race condition

step thread 0 thread 1
0 load
1 increment load
2 write increment
3 write
Online April 21-24, 2020 27/192
Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of

reduction

Online April 21-24, 2020 28/192

Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of

reduction

Important concept: Reduction

Reductions combine the results contributed by parallel work.

Online April 21-24, 2020 28/192

Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of

reduction

Important concept: Reduction

Reductions combine the results contributed by parallel work.

How would we do this with OpenMP?

double f i n a l R e d u c e d V a l u e = 0;
# pragma omp parallel for reduction ( +: f i n a l R e d u c e d V a l u e )
for ( int64_t i = 0; i < N ; ++ i ) {
f i n a l R e d u c e d V a l u e += ...
}

Online April 21-24, 2020 28/192

Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of

reduction

Important concept: Reduction

Reductions combine the results contributed by parallel work.

How would we do this with OpenMP?

How will we do this with Kokkos?

double f i n a l R e d u c e d V a l u e = 0;
p aral lel _r ed uc e ( N , functor , f i n a l R e d u c e d V a l u e );

Online April 21-24, 2020 28/192

Scalar integration (4)

Example: Scalar integration

OpenMP

double totalIntegral = 0;
# pragma omp parallel for reduction ( +: totalIntegral )
for ( int64_t i = 0; i < nu m be r Of I nt e rv a ls ; ++ i ) {
totalIntegral += function (...);
}

double totalIntegral = 0;
Kokkos

p ar al le l _r ed u ce ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t i , double & valueToUpdate ) {
valueToUpdate += function (...);
},
totalIntegral );

I The operator takes two arguments: a work index and a value

to update.
I The second argument is a thread-private value that is
managed by Kokkos; it is not the final reduced value.
Online April 21-24, 2020 29/192
Scalar integration (5)

Warning: Parallelism is NOT free

Dispatching (launching) parallel work has non-negligible cost.

Online April 21-24, 2020 30/192

Scalar integration (5)

Warning: Parallelism is NOT free

Online April 21-24, 2020 30/192

Scalar integration (5)

Warning: Parallelism is NOT free

Dispatching (launching) parallel work has non-negligible cost.
β∗N
Simplistic data-parallel performance model: Time = α + P
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency

α∗P
Speedup = P ÷ 1 + β∗N
I Should have α ∗ P β ∗ N
I All runtimes strive to minimize launch overhead α
I Find more parallelism to increase N
I Merge (fuse) parallel operations to increase β
Online April 21-24, 2020 30/192
Scalar integration (6)

α∗P
Results: illustrates simple speedup model = P ÷ 1 + β∗N

Kokkos speedup over serial: Scalar Integration

10000

Note: log scale

Kokkos Cuda Pascal60
Kokkos OpenMP HSW
Kokkos OpenMP KNL
Native OpenMP KNL
1000 Unity
speedup over serial [-]

100

0.1

0.01
100 1000 10000 100000 1x106 1x107 1x108
number of intervals [-]

Online April 21-24, 2020 31/192

Naming your kernels
Always name your kernels!
Giving unique names to each kernel is immensely helpful for
debugging and profiling. You will regret it if you don’t!

I Non-nested parallel patterns can take an optional string

argument.
I The label doesn’t need to be unique, but it is helpful.
I Anything convertible to ”const std::string”
I Used by profiling and debugging tools (see Profiling Tutorial)
Example:
double totalIntegral = 0;
p aral lel _r ed uc e ( " Reduction " , n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t i , double & valueToUpdate ) {
valueToUpdate += function (...);
},
totalIntegral );

Online April 21-24, 2020 32/192

Recurring Exercise: Inner Product

Exercise: Inner product < y , A ∗ x >

Details:
I y is Nx1, A is NxM, x is Mx1
I We’ll use this exercise throughout the tutorial

Online April 21-24, 2020 33/192

Exercise #1: include, initialize, finalize Kokkos

The first step in using Kokkos is to include, initialize, and finalize:

# include < Kokkos_Core . hpp >
int main ( int argc , char ** argv ) {
/* ... do any necessary setup ( e . g . , initialize MPI ) ... */
Kokkos :: initialize ( argc , argv );
{
/* ... do computations ... */
}
Kokkos :: finalize ();
return 0;
}

(Optional) Command-line arguments:

total number of threads
--kokkos-threads=INT
(or threads within NUMA region)
--kokkos-numa=INT number of NUMA regions
--kokkos-device=INT device (GPU) ID to use

Online April 21-24, 2020 34/192

Exercise #1: Inner Product, Flat Parallelism on the CPU

Exercise: Inner product < y , A ∗ x >

Details:
I Location: Intro-Full/Exercises/01/Begin/
I Look for comments labeled with “EXERCISE”
I Need to include, initialize, and finalize Kokkos library
I Parallelize loops with parallel for or parallel reduce
I Use lambdas instead of functors for computational bodies.
I For now, this will only use the CPU.
Online April 21-24, 2020 35/192
Exercise #1: logistics
Compiling for CPU
# gcc using OpenMP ( default ) and Serial back - ends ,
# ( optional ) change non - default arch with KOKKOS_ARCH
make -j KOKKOS_D EVICES = OpenMP , Serial KOKKOS_ARCH =...

Running on CPU with OpenMP back-end

# Set OpenMP affinity
export O M P_ NU M_ T HR EA DS =8
export OMP_PROC_BIND = spread OMP_PLACES = threads
# Print example command line options :
./01 _Exercise . host -h
# Run with defaults on CPU
./01 _Exercise . host
# Run larger problem
./01 _Exercise . host -S 26

Things to try:
I Vary problem size with cline arg -S s
I Vary number of rows with cline arg -N n
I Num rows = 2n , num cols = 2m , total size = 2s == 2n+m
Online April 21-24, 2020 36/192
Exercise #1 results
<y,Ax> Exercise 01, Fixed Size
350
HSW
KNL
KNL (HBM)
300

250
Bandwidth (GB/s)

200

150

100

0
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Number of Rows (N)
Online April 21-24, 2020 37/192
Basic capabilities we haven’t covered

I Customizing parallel reduce data type and reduction

operator
e.g., minimum, maximum, ...
I parallel scan pattern for exclusive and inclusive prefix sum
I Using tag dispatch interface to allow non-trivial functors to
have multiple “operator()” functions.
very useful in large, complex applications

Online April 21-24, 2020 38/192

Section Summary

I Simple usage is similar to OpenMP, advanced features are

also straightforward
I Three common data-parallel patterns are parallel for,
parallel reduce, and parallel scan.
I A parallel computation is characterized by its pattern, policy,
and body.
I User provides computational bodies as functors or lambdas
which handle a single work item.

Online April 21-24, 2020 39/192

Views
Learning objectives:
I Motivation behind the View abstraction.
I Key View concepts and template parameters.
I The View life cycle.

Online April 21-24, 2020 40/192

View motivation

Example: running daxpy on the GPU:

Lambda

double * x = new double [ N ]; // also y

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Online April 21-24, 2020 41/192

View motivation

Example: running daxpy on the GPU:

Lambda

double * x = new double [ N ]; // also y

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Problem: x and y reside in CPU memory.

Online April 21-24, 2020 41/192

View motivation

Example: running daxpy on the GPU:

Lambda

double * x = new double [ N ]; // also y

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Problem: x and y reside in CPU memory.

Solution: We need a way of storing data (multidimensional arrays)
which can be communicated to an accelerator (GPU).
⇒ Views
Online April 21-24, 2020 41/192
Views (0)

View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).

High-level example of Views for daxpy using lambda:

View < double * , ... > x (...) , y (...);
... populate x , y ...

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {

// Views x and y are captured by value ( copy )
y ( i ) = a * x ( i ) + y ( i );
});

Online April 21-24, 2020 42/192

Views (0)

View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).

High-level example of Views for daxpy using lambda:

View < double * , ... > x (...) , y (...);
... populate x , y ...

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {

// Views x and y are captured by value ( copy )
y ( i ) = a * x ( i ) + y ( i );
});

Important point
Views are like pointers, so copy them in your functors.

Online April 21-24, 2020 42/192

Online April 21-24, 2020 43/192

Views (1)
View overview:
I Multi-dimensional array of 0 or more dimensions
scalar (0), vector (1), matrix (2), etc.
I Number of dimensions (rank) is fixed at compile-time.
I Arrays are rectangular, not ragged.
I Sizes of dimensions set at compile-time or runtime.
e.g., 2x20, 50x50, etc.
I Access elements via ”(...)” operator.
Example:
View < double *** > data ( " label " , N0 , N1 , N2 ); //3 run, 0 compile
View < double **[ N2 ] > data ( " label " , N0 , N1 ); //2 run, 1 compile
View < double *[ N1 ][ N2 ] > data ( " label " , N0 ); //1 run, 2 compile
View < double [ N0 ][ N1 ][ N2 ] > data ( " label " ); //0 run, 3 compile
// Access
data (i ,j , k ) = 5.3;
Note: runtime-sized dimensions must come first.
Online April 21-24, 2020 43/192
Views (2)

View life cycle:

Online April 21-24, 2020 44/192

Views (2)

View life cycle:

I Allocations only happen when explicitly specified.
i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).
so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr
Example:
View < double *[5] > a ( " a " , N0 ) , b ( " b " , N0 );
a = b;
View < double ** > c ( b );
a (0 ,2) = 1;
b (0 ,2) = 2; What gets printed?
c (0 ,2) = 3;
print a (0 ,2)

Online April 21-24, 2020 44/192

Views (2)

View life cycle:

Online April 21-24, 2020 44/192

Views (3)

View Properties:
I Accessing a View’s sizes is done via its extent(dim) function.
Static extents can additionally be accessed via
static extent(dim).
I You can retrieve a raw pointer via its data() function.
I The label can be accessed via label().
Example:
View < double *[5] > a ( " A " , N0 );
assert ( a . extent (0)== N0 );
assert ( a . extent (1)== N0 );
static_assert ( a . static_extent (1)==5);
assert ( a . data ()!= nullptr );
assert ( std :: string ( " A " . compare ( a . label ())==0);

Online April 21-24, 2020 45/192

Exercise #2: Inner Product, Flat Parallelism on the CPU, with Views

I Location: Intro-Full/Exercises/02/Begin/
I Assignment: Change data storage from arrays to Views.
I Compile and run on CPU, and then on GPU with UVM

make -j K OKKOS_D EVICES = OpenMP # CPU - only using OpenMP

make -j K OKKOS_D EVICES = Cuda # GPU - note UVM in Makefile
# Run exercise
./02 _Exercise . host -S 26
./02 _Exercise . cuda -S 26
# Note the warnings , set appropriate environment variables

I Vary problem size: -S #

I Vary number of rows: -N #
I Vary repeats: -nrepeat #
I Compare performance of CPU vs GPU

Online April 21-24, 2020 46/192

Advanced features we haven’t covered

I Memory space in which view’s data resides; covered next.

I deep copy view’s data; covered later.
Note: Kokkos never hides a deep copy of data.
I Layout of multidimensional array; covered later.
I Memory traits; covered later.
I Subview: Generating a view that is a “slice” of other
multidimensional array view; covered later.

Online April 21-24, 2020 47/192

Execution and Memory spaces

Execution and Memory Spaces

Learning objectives:
I Heterogeneous nodes and the space abstractions.
I How to control where parallel bodies are run, execution
space.
I How to control where view data resides, memory space.
I How to avoid illegal memory accesses and manage data
movement.
I The need for Kokkos::initialize and finalize.
I Where to use Kokkos annotation macros for portability.

Online April 21-24, 2020 48/192

Execution spaces (1)
Execution Space
a homogeneous set of cores and an execution mechanism
(i.e., “place to run code”)

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Execution spaces: Serial, Threads, OpenMP, Cuda, HIP, ...

Online April 21-24, 2020 49/192
Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);

r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {

const double y = ...;
// do something interesting
}
);

Online April 21-24, 2020 50/192

Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);

r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {

const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?

⇒ Always in the host process

Online April 21-24, 2020 50/192

Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);

r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {

const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?

⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?
⇒ The default execution space

Online April 21-24, 2020 50/192

Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);

r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {

const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?

⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?
⇒ The default execution space
I How do I control where the Parallel body is executed?
Changing the default execution space (at compilation),
or specifying an execution space in the policy.

Online April 21-24, 2020 50/192

Execution spaces (3)
Changing the parallel execution space:
parallel_for ( " Label " ,
Custom

RangePolicy < E x e c u t i o n S p a c e >(0 , n u m b e r O f I n t e r v a l s ) ,

[=] ( const int64_t i ) {
/* ... body ... */
});

parallel_for ( " Label " ,

Default

n u m b e r O f I n t e r v a l s , // == RangePolicy < >(0 , n u m b e r O f I n t e r v a l s )

[=] ( const int64_t i ) {
/* ... body ... */
});

Online April 21-24, 2020 51/192

Execution spaces (3)
Changing the parallel execution space:
parallel_for ( " Label " ,
Custom

RangePolicy < E x e c u t i o n S p a c e >(0 , n u m b e r O f I n t e r v a l s ) ,

[=] ( const int64_t i ) {
/* ... body ... */
});

parallel_for ( " Label " ,

Default

n u m b e r O f I n t e r v a l s , // == RangePolicy < >(0 , n u m b e r O f I n t e r v a l s )

[=] ( const int64_t i ) {
/* ... body ... */
});

Requirements for enabling execution spaces:

I Kokkos must be compiled with the execution spaces enabled.
I Execution spaces must be initialized (and finalized).
I Functions must be marked with a macro for non-CPU spaces.
I Lambdas must be marked with a macro for non-CPU spaces.
Online April 21-24, 2020 51/192
Execution spaces (5)

Kokkos function and lambda portability annotation macros:

Function annotation with KOKKOS INLINE FUNCTION macro
struct ParallelFunctor {
KOKKOS INLINE FUNCTION
double helperFunction ( const i n t 6 4 t s ) const { . . .}
KOKKOS INLINE FUNCTION
void operator ()( const int64 t index ) const {
helperFunction ( index ) ;
}
}
// Where k o k k o s d e f i n e s :
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e device host /∗ # i f CPU+Cuda ∗/

Online April 21-24, 2020 52/192

Execution spaces (5)

Kokkos function and lambda portability annotation macros:

Lambda annotation with KOKKOS LAMBDA macro (requires CUDA 8.0)

Kokkos : : p a r a l l e l f o r ( ” L a b e l ” , n u m b e r O f I t e r a t i o n s ,
KOKKOS LAMBDA ( c o n s t i n t 6 4 t i n d e x ) { . . . } ) ;

// Where Kokkos d e f i n e s :
#d e f i n e KOKKOS LAMBDA [ = ] /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS LAMBDA [ = ] device /∗ # i f CPU+Cuda ∗/

Online April 21-24, 2020 52/192

Memory Space Motivation

Memory space motivating example: summing an array

View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Online April 21-24, 2020 53/192

Memory Space Motivation

Memory space motivating example: summing an array

View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

Question: Where is the data stored? GPU memory? CPU

memory? Both?

Online April 21-24, 2020 53/192

Memory Space Motivation

Memory space motivating example: summing an array

View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

Question: Where is the data stored? GPU memory? CPU

memory? Both?

Online April 21-24, 2020 53/192

Memory Space Motivation

Memory space motivating example: summing an array

View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

Question: Where is the data stored? GPU memory? CPU

memory? Both?

⇒ Memory Spaces
Online April 21-24, 2020 53/192
Memory spaces (0)

Memory space:
explicitly-manageable memory resource
(i.e., “place to put data”)

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Online April 21-24, 2020 54/192

Memory spaces (1)

Important concept: Memory spaces

Every view stores its data in a memory space set at compile time.

Online April 21-24, 2020 55/192

Memory spaces (1)

Important concept: Memory spaces

Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);

Online April 21-24, 2020 55/192

Memory spaces (1)

Important concept: Memory spaces

Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);

I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more

Online April 21-24, 2020 55/192

Memory spaces (1)

Important concept: Memory spaces

Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);

I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which is
used if Space provided is actually an execution space

Online April 21-24, 2020 55/192

Memory spaces (1)

Important concept: Memory spaces

Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);

I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which is
used if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the default
memory space of the default execution space.

Online April 21-24, 2020 55/192

Memory spaces (2)
Example: HostSpace
View < double ** , HostSpace > hostView (... constructor arguments ...);

Online April 21-24, 2020 56/192

Memory spaces (2)
Example: HostSpace
View < double ** , HostSpace > hostView (... constructor arguments ...);

Example: CudaSpace
View < double ** , CudaSpace > view (... constructor arguments ...);

Online April 21-24, 2020 56/192

Execution and Memory spaces (0)

Anatomy of a kernel launch:

1. User declares views, allocating. # define KL KOKKOS_LAMBDA

2. User instantiates a functor with View < int * , Cuda > dev (...);
parallel_for ( " Label " ,N ,
views. KL ( int i ) {
3. User launches dev ( i ) = ...;
});
parallel something:
I Functor is copied to the device.
I Kernel is run.
I Copy of functor on the device is
released.

Note: no deep copies of array data are performed;

views are like pointers.

Online April 21-24, 2020 57/192

Execution and Memory spaces (1)

Example: one view

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
});

Online April 21-24, 2020 58/192

Execution and Memory spaces (2)

Example: two views

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});

Online April 21-24, 2020 59/192

Execution and Memory spaces (2)

Example: two views

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});

Online April 21-24, 2020 59/192

Execution and Memory spaces (3)

Example (redux): summing an array with the GPU

(failed) Attempt 1: View lives in CudaSpace
View < double * , CudaSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

Online April 21-24, 2020 60/192

Execution and Memory spaces (3)

Example (redux): summing an array with the GPU

(failed) Attempt 1: View lives in CudaSpace
View < double * , CudaSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ... fault
}

Online April 21-24, 2020 60/192

Execution and Memory spaces (4)

Example (redux): summing an array with the GPU

(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

Online April 21-24, 2020 61/192

Execution and Memory spaces (4)

Example (redux): summing an array with the GPU

(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index ); illegal access
},
sum );

Online April 21-24, 2020 61/192

Execution and Memory spaces (4)

Example (redux): summing an array with the GPU

(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

Online April 21-24, 2020 61/192

Execution and Memory spaces (5)

CudaUVMSpace

# define KL KOKKOS_LAMBDA
View < double * ,
CudaUVMSpace> array ;
array = ... from file ...
double sum = 0;
p ar allel _r ed u ce ( " Label " , N ,
KL ( int i ,
double & d ) {
d += array ( i );
},
sum );

Cuda runtime automatically handles data movement,

at a performance hit.

Online April 21-24, 2020 62/192

Views, Spaces, and Mirrors

Important concept: Mirrors

Mirrors are views of equivalent arrays residing in possibly different
memory spaces.

Online April 21-24, 2020 63/192

Views, Spaces, and Mirrors

Important concept: Mirrors

Mirrors are views of equivalent arrays residing in possibly different
memory spaces.

Mirroring schematic
typedef Kokkos :: View < double ** , Space > ViewType ;
ViewType view (...);
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

Online April 21-24, 2020 63/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

Online April 21-24, 2020 64/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the

host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

Online April 21-24, 2020 64/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the

host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).

Online April 21-24, 2020 64/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the

host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).

4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

Online April 21-24, 2020 64/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the

host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).

4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

5. Launch a kernel processing the view’s array.

Kokkos :: parallel_for ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA (...) { use and change view });

Online April 21-24, 2020 64/192

Mirroring pattern

1. Create a view’s array in some memory space.

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the

host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).

4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

5. Launch a kernel processing the view’s array.

Kokkos :: parallel_for ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA (...) { use and change view });

6. If needed, deep copy the view’s updated array back to the

hostView’s array to write file, etc.
Kokkos :: d e e p c o p y ( hostView , view );
Online April 21-24, 2020 64/192
Mirrors of Views in HostSpace

What if the View is in HostSpace too? Does it make a copy?

typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view ( " test " , 10);
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

I create mirror view allocates data only if the host process

cannot access view’s data, otherwise hostView references the
same data.
I create mirror always allocates data.
I Reminder: Kokkos never performs a hidden deep copy.

Online April 21-24, 2020 65/192

Exercise #3: Flat Parallelism on the GPU, Views and Host Mirrors
Details:
I Location: Intro-Full/Exercises/03/Begin/
I Add HostMirror Views and deep copy
I Make sure you use the correct view in initialization and Kernel

# Compile for CPU

make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./03 _Exercise . cuda -S 26

Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 66/192

View and Spaces Section Summary

I Data is stored in Views that are “pointers” to

multi-dimensional arrays residing in memory spaces.
I Views abstract away platform-dependent allocation,
(automatic) deallocation, and access.
I Heterogeneous nodes have one or more memory spaces.
I Mirroring is used for performant access to views in host and
device memory.
I Heterogeneous nodes have one or more execution spaces.
I You control where parallel code is run by a template
parameter on the execution policy, or by compile-time
selection of the default execution space.

Online April 21-24, 2020 67/192

Managing memory access patterns
for performance portability
Learning objectives:
I How the View’s Layout parameter controls data layout.
I How memory access patterns result from Kokkos mapping
parallel work indices and layout of multidimensional array data
I Why memory access patterns and layouts have such a
performance impact (caching and coalescing).
I See a concrete example of the performance of various memory
configurations.

Online April 21-24, 2020 68/192

Example: inner product (0)
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) ,
KOKKOS_LAMBDA ( const size_t row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( size_t entry = 0; entry < M ; ++ entry ) {
thisRowsSum += A ( row , entry ) * x ( entry );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Online April 21-24, 2020 69/192

Driving question: How should A be laid out in memory?

Online April 21-24, 2020 69/192
Example: inner product (1)

Layout is the mapping of multi-index to memory:

LayoutLeft
in 2D, “column-major”

LayoutRight
in 2D, “row-major”

Online April 21-24, 2020 70/192

Layout

Important concept: Layout

Every View has a multidimensional array Layout set at
compile-time.

View < double *** , L a y o u t , Space > name (...);

Online April 21-24, 2020 71/192

Layout

Important concept: Layout

Every View has a multidimensional array Layout set at
compile-time.

View < double *** , L a y o u t , Space > name (...);

I Most-common layouts are LayoutLeft and LayoutRight.

LayoutLeft: left-most index is stride 1.
LayoutRight: right-most index is stride 1.
I If no layout specified, default for that memory space is used.
LayoutLeft for CudaSpace, LayoutRight for HostSpace.
I Layouts are extensible: ≈ 50 lines
I Advanced layouts: LayoutStride, LayoutTiled, ...

Online April 21-24, 2020 71/192

Exercise #4: Inner Product, Flat Parallelism

Details:
I Location: Intro-Full/Exercises/04/Begin/
I Replace ‘‘N’’ in parallel dispatch with RangePolicy<ExecSpace>
I Add MemSpace to all Views and Layout to A
I Experiment with the combinations of ExecSpace, Layout to view
performance
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
I Compare using UVM vs not using UVM on GPUs
I Check what happens if MemSpace and ExecSpace do not match.

Online April 21-24, 2020 72/192

Exercise #4: Inner Product, Flat Parallelism
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

300 Why?
200

100

0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 73/192

Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

Online April 21-24, 2020 74/192

Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

I CPU threads are independent.
i.e., threads may execute at any rate.

Online April 21-24, 2020 74/192

Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

Online April 21-24, 2020 74/192

Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

I CPU threads are independent.
i.e., threads may execute at any rate.
I GPU threads benefit (NVIDIA Volta) or must synchronize
(AMD) in groups.
i.e., threads in groups can/must execute instructions
together.
In particular, all threads in a group (warp or wavefront) must
finished their loads before any thread can move on.

Online April 21-24, 2020 74/192

Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

CPUs: few (independent) cores with separate caches:

Online April 21-24, 2020 75/192

Caching and coalescing (1)

CPUs: few (independent) cores with separate caches:

GPUs: many (synchronized) cores with a shared cache:

Online April 21-24, 2020 75/192

Caching and coalescing (2)

Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.

Caching: if thread t’s current access is at position i,

thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,
thread t+1’s current access should be at position i+1.

Online April 21-24, 2020 76/192

Caching and coalescing (2)

Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.

Caching: if thread t’s current access is at position i,

thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,
thread t+1’s current access should be at position i+1.
Warning
Uncoalesced access on GPUs and non-cached loads on CPUs
greatly reduces performance (can be ¿10X)

Online April 21-24, 2020 76/192

Mapping indices to cores (0)

Consider the array summation example:

View < double * , Space > data ( " data " , size );
... populate data ...

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: is this cached (for OpenMP) and coalesced (for Cuda)?

Online April 21-24, 2020 77/192

Mapping indices to cores (0)

Consider the array summation example:

View < double * , Space > data ( " data " , size );
... populate data ...

Question: is this cached (for OpenMP) and coalesced (for Cuda)?

Given P threads, which indices do we want thread 0 to handle?
Contiguous: Strided:
0, 1, 2, ..., N/P 0, N/P, 2*N/P, ...

Online April 21-24, 2020 77/192

Mapping indices to cores (0)

Consider the array summation example:

View < double * , Space > data ( " data " , size );
... populate data ...

Question: is this cached (for OpenMP) and coalesced (for Cuda)?

Given P threads, which indices do we want thread 0 to handle?
Contiguous: Strided:
0, 1, 2, ..., N/P 0, N/P, 2*N/P, ...
CPU GPU
Why?
Online April 21-24, 2020 77/192
Mapping indices to cores (1)

Iterating for the execution space:

operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

As users we don’t control how indices are mapped to threads, so

how do we achieve good memory access?

Online April 21-24, 2020 78/192

Mapping indices to cores (1)

Iterating for the execution space:

operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

As users we don’t control how indices are mapped to threads, so

how do we achieve good memory access?

Important point
Kokkos maps indices to cores in contiguous chunks on CPU
execution spaces, and strided for Cuda.

Online April 21-24, 2020 78/192

Mapping indices to cores (2)

Rule of Thumb
Kokkos index mapping and default layouts provide efficient access
if iteration indices correspond to the first index of array.

Example:
View < double *** , ... > view (...);
...
Kokkos :: parallel_for ( " Label " , ... ,
KOKKOS_LAMBDA ( const size_t workIndex ) {
...
view (... , ... , workIndex ) = ... ;
view (... , workIndex , ... ) = ... ;
view ( workIndex , ... , ... ) = ... ;
});
...

Online April 21-24, 2020 79/192

Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Online April 21-24, 2020 80/192

Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Analysis: row-major (LayoutRight)

Online April 21-24, 2020 80/192

Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Analysis: row-major (LayoutRight)

I HostSpace: cached (good)

I CudaSpace: uncoalesced (bad)
Online April 21-24, 2020 80/192
Example: inner product (3)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.

Analysis: column-major (LayoutLeft)

Online April 21-24, 2020 81/192

Example: inner product (3)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.

Analysis: column-major (LayoutLeft)

I HostSpace: uncached (bad)

I CudaSpace: coalesced (good)
Online April 21-24, 2020 81/192
Example: inner product (4)

Analysis: Kokkos architecture-dependent

View < double ** , E x e c u t i o n S p a c e > A (N , M );
parallel_for ( RangePolicy < E x e c u t i o n S p a c e >(0 , N ) ,
... thisRowsSum += A (j , i ) * x ( i );

(a) OpenMP (b) Cuda

I HostSpace: cached (good)
I CudaSpace: coalesced (good)

Online April 21-24, 2020 82/192

Example: inner product (5)
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

cached
300

200 uncoalesced

100
cached
uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 83/192

Memory Access Pattern Summary

I Every View has a Layout set at compile-time through a

template parameter.
I LayoutRight and LayoutLeft are most common.
I Views in HostSpace default to LayoutRight and Views in
CudaSpace default to LayoutLeft.
I Layouts are extensible and flexible.
I For performance, memory access patterns must result in
caching on a CPU and coalescing on a GPU.
I Kokkos maps parallel work indices and multidimensional array
layout for performance portable memory access patterns.
I There is nothing in OpenMP, OpenACC, or OpenCL to manage
layouts.
⇒ You’ll need multiple versions of code or pay the
performance penalty.
Online April 21-24, 2020 84/192
DualView

DualView
Learning objectives:
I Motivation and Value Added.
I Usage.
I Exercises.

Online April 21-24, 2020 85/192

DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

Online April 21-24, 2020 86/192

DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

I DualView simplifies the task of managing data movement

between memory spaces, e.g., host and device.

Online April 21-24, 2020 86/192

DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

I DualView simplifies the task of managing data movement

between memory spaces, e.g., host and device.

I When converting a typical app to use Kokkos, there is usually

no holistic view of such data transfers.

Online April 21-24, 2020 86/192

• do I really need to do a deep co
between memory spaces, eg, host and • where is the most recent data?
device. • is my data on the host or device
DualView(1)
• did someone modify the code u
the data is now stale, but wasn

deep_copy
View ? MirrorView

device host

Without DualView, could use MirrorViews, but

I deep copies are expensive, use sparingly
I do I need a deep copy here?
I where is the most recent data?
I is data on the host or device stale?
I was code modified upstream? is data here now stale, but not
in previous version?
Online April 21-24, 2020 87/192
• Bundles two views, e.g. a host View and a device View
• DualView::modify<MemorySpace>() marks the data as modified on the given MemorySpace
• DualView::sync<MemorySpace>() deep copies the data to the given MemorySpace only if the
two memory spaces are not in sync. DualView: Usage
• DualView relies on calls to modify() to determine if data actually needs to be copied during a
call to sync().
sync() does nothing
• DualView if there istwo
bundles a singleviews,
memory space,
a Hostso theyView
are efficient
andto auseDevice
all the time View

View View
DualView

device host

There is no automatic tracking of data freshness:

I you must tell Kokkos when data has been modified on a
memory space.
I If you mark data as modified when you modify it, then Kokkos
will know if it needs to move data
Online April 21-24, 2020 88/192
DualView: Usage(1)

DualView bundles two views, a Host View and a Device

View
I Data members for the two views
DualView :: t_host h_view
DualView :: t_dev d_view

I Retrieve data members

t_host view_host ();
t_dev view_device ();

I Mark data as modified

void modify_host ();
void modify_device ();

Online April 21-24, 2020 89/192

DualView: Usage(2)

DualView bundles two views, a Host View and a Device

View
I Sync data in a direction if not in sync
void sync_host ();
void sync_device ();

I Check sync status

void need_syn c_host ();
void n e e d _ s y n c _ d e v i c e ();

Online April 21-24, 2020 90/192

DualView: Usage in generic context
DualView has templated functions for generic use in
templated code
I Retrieve data members
template < class Space >
auto view ();

I Mark data as modified

template < class Space >
void modify ();

I Sync data in a direction if not in sync

template < class Space >
void sync ();

I Check sync status

template < class Space >
void need_sync ();

Online April 21-24, 2020 91/192

Exercise - DualView

Details:
I Location: Intro-Full/Exercises/dualview/Begin/
I Modify or create a new compute enthalpy function in
dual view exercise.cpp to:
I 1. Take (dual)views as arguments
I 2. Call modify() and/or sync() when appropriate for the dual
views
I 3. Runs the kernel on host or device execution spaces

# Compile for CPU

make -j KOKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j KOKKOS_D EVICES = Cuda
# Run on GPU
./ dualview . cuda -S 26

Online April 21-24, 2020 92/192

MDRangePolicy

Tightly Nested Loops with

MDRangePolicy
Learning objectives:
I Demonstrate usage of the MDRangePolicy with tightly nested
loops.
I Syntax - Required and optional settings
I Code demo and example

Online April 21-24, 2020 93/192

MDRangePolicy (0)
Motivating example: Consider the nested for loops:
for ( int i = 0; i < Ni ; ++ i )
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );

Based on Kokkos lessons thus far, you might parallelize this as

Kokkos :: parallel_for ( Ni ,
KOKKOS_LAMBDA ( const i ) {
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );
}
);

I This only parallelizes along one dimension, leaving potential

parallelism unexploited.
I What if Ni is too small to amortize the cost of constructing a
parallel region, but Ni*Nj*Nk makes it worthwhile?
Online April 21-24, 2020 94/192
MDRangePolicy (1)

Solution: Use an MDRangePolicy

for ( int i = 0; i < Ni ; ++ i )
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );

Instead, use the MDRangePolicy with the parallel for

Kokkos :: parallel_for ( Kokkos :: MDRangePolicy < Kokkos :: Rank <3 > >
({0 ,0 ,0} , { Ni , Nj , Nk }) ,
KOKKOS_LAMBDA ( int i , int j , int k ) {
some_init_fcn (i , j , k );
}
);

Online April 21-24, 2020 95/192

MDRangePolicy API(0)

Required Template Parameters to MDRangePolicy

Kokkos : : Rank< N, I t e r a t e O u t e r , I t e r a t e I n n e r >
I N: (Required) the rank of the index space (limited from 2 to 6)
I IterateOuter (Optional) iteration pattern between tiles
I Options: Iterate::Left, Iterate::Right, Iterate::Default
I IterateInner (Optional) iteration pattern within tiles
I Options: Iterate::Left, Iterate::Right, Iterate::Default

Online April 21-24, 2020 96/192

MDRangePolicy API(1)

Optional Template Parameters

ExecutionSpace
I Options: Serial, OpenMP, Threads, Cuda

Schedule < Options >

I Options: Static, Dynamic

IndexType < Options >

I Options: int, long, etc

WorkTag
I Options: SomeClass

MDRangePolicy< Rank <2,OP, IP >, OpenMP , S c h e d u l e <S t a t i c >,

IndexType <i n t > > m d r p o l i c y ;

Online April 21-24, 2020 97/192

MDRangePolicy API(2)
Policy Arguments
BeginList
I Initializer List or Kokkos::Array (Required): rank arguments for
starts of index space
I Example Rank 2: {b0,b1}

EndList
I Initializer List or Kokkos::Array (Required): rank arguments for
ends of index space
I Example Rank 2: {e0,e1}

TileDimList
I Initializer List or Kokkos::Array (Optional): rank arguments for
dimension of tiles
I Example Rank 2: {t0,t1}

m d r p o l i c y ( {b0 , b1 } , { e0 , e1 } , { t0 , t 1 } ) ;
Online April 21-24, 2020 98/192
Exercise - mdrange: Initialize multi-dim views with MDRangePolicy

Details:
I Location: Intro-Full/Exercises/mdrange/Begin/
I This begins with the Solution of 02
I Initialize the device Views x and y directly on the device using a
parallel for and RangePolicy
I Initialize the device View matrix A directly on the device using a
parallel for and MDRangePolicy

# Compile for CPU

make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ m d r an g e _ e x e r c i s e . cuda -S 26

Online April 21-24, 2020 99/192

Exercise - mdrange: Initialize multi-dim views with MDRangePolicy

Things to try:
I Name the kernels - pass a string as the first argument of the parallel
pattern
I Try changing the iteration patterns for the tiles in the
MDRangePolicy, notice differences in performance

Online April 21-24, 2020 100/192

Subviews

Subviews: Taking ’slices’ of

Views
Learning objectives:
I Introduce Kokkos::subview - basic capabilities and syntax
I Suggested usage and practices

Online April 21-24, 2020 101/192

Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View

Online April 21-24, 2020 102/192

Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries

Online April 21-24, 2020 102/192

Subviews (0)

Online April 21-24, 2020 102/192

Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
I Can be constructed on host or within a kernel (no allocation
of memory occurs)

Online April 21-24, 2020 102/192

Subviews (0)

Online April 21-24, 2020 102/192

Subviews (1)

Introductory Usage Demo:

Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Online April 21-24, 2020 103/192

Subviews (1)

Introductory Usage Demo:

Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Say we want a 2-dimensional slice at an index i0 in the first

dimension - that is, in Matlab/Fortran/Python notation:
slicei0 = v ( i0 , : , :);

Online April 21-24, 2020 103/192

Subviews (1)

Introductory Usage Demo:

Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Say we want a 2-dimensional slice at an index i0 in the first

dimension - that is, in Matlab/Fortran/Python notation:
slicei0 = v ( i0 , : , :);

This can be accomplished in Kokkos using a subview as follows:

auto slicei0 =
Kokkos :: subview (v , i0 , Kokkos :: ALL , Kokkos :: ALL );

auto slicei0 =
Kokkos :: subview (v , i0 , std :: make_pair (0 , v . extent (1)) ,
std :: make_pair (0 , v . extent (2)));
// extent ( N ) returns the size of dimension N of the View

Online April 21-24, 2020 103/192

Subviews (2)

Syntax:
Kokkos :: subview ( Kokkos :: View <... > view ,
arg0 ,
...)
I view: First argument to the subview is the view of which a slice will
be taken
I argN: Slice info for rank N - provide same number of arguments as
rank
I Options for argN:
I index: integral type single value
I partial-range: std::pair or Kokkos::pair of integral types to
provide sub-range of a rank’s range [0,N)
I full-range: use Kokkos::ALL rather than providing the full
range as a pair

Online April 21-24, 2020 104/192

Subviews (3)

Suggested usage:
I Use ’auto’ to determine the return type of a subview
I A subview can help with encapsulation - e.g. can pass into
functions expecting a lower-dimensional View
I Use Kokkos::pair for partial ranges if subview created within a
kernel
I Avoid usage if very few data accesses will be made to the
subview
I Construction of subview costs 20-40 operations

Online April 21-24, 2020 105/192

Exercise - Subviews: Basic usage

Details:
I Location: Intro-Full/Exercises/subview/Begin/
I This begins with the Solution of 04
I In the parallel reduce kernel, create a subview for row j of view A
I Use this subview when computing A(j,:)*x(:) rather than the matrix
A
# Compile for CPU
make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ s u b vi e w _ e x e r c i s e . cuda -S 26

Online April 21-24, 2020 106/192

Thread safety and
atomic operations
Learning objectives:
I Understand that coordination techniques for low-count CPU
threading are not scalable.
I Understand how atomics can parallelize the scatter-add
pattern.
I Gain performance intuition for atomics on the CPU and
GPU, for different data types and contention rates.

Online April 21-24, 2020 107/192

Examples: Histogram

Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram

Problem: Multiple threads may try to write to the same location.

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram

Problem: Multiple threads may try to write to the same location.

Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:
not thread-scalable
I Atomics

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Atomics

Atomics: the portable and thread-scalable solution

parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const int bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
Kokkos :: atomic_add (& _histogram ( bucketIndex ) , 1);
});

Online April 21-24, 2020 109/192

Atomics

Atomics: the portable and thread-scalable solution

I Atomics are the only scalable solution to thread safety.

Online April 21-24, 2020 109/192

Atomics

Atomics: the portable and thread-scalable solution

I Atomics are the only scalable solution to thread safety.

I Locks are not portable.

Online April 21-24, 2020 109/192

Atomics

Atomics: the portable and thread-scalable solution

I Atomics are the only scalable solution to thread safety.

I Locks are not portable.
I Data replication is not thread scalable.

Online April 21-24, 2020 109/192

Performance of atomics (0)

How expensive are atomics?

Thought experiment: scalar integration

operator ()( const unsigned int intervalIndex ,
double & valueToUpdate ) const {
double contribution = function (...);
valueToUpdate += contribution ;
}

Online April 21-24, 2020 110/192

Performance of atomics (0)

How expensive are atomics?

Thought experiment: scalar integration

operator ()( const unsigned int intervalIndex ,
double & valueToUpdate ) const {
double contribution = function (...);
valueToUpdate += contribution ;
}

Idea: what if we instead do this with parallel for and atomics?

operator ()( const unsigned int intervalIndex ) const {
const double contribution = function (...);
Kokkos : : a t o m i c a d d (& globalSum , contribution );
}

How much of a performance penalty is incurred?

Online April 21-24, 2020 110/192

Performance of atomics (1)

Two costs: (independent) work and coordination.

p ar al le l _r ed u ce ( numberOfIntervals ,
KOKKOS_LAMBDA ( const unsigned int intervalIndex ,
double & valueToUpdate ) {
valueToUpdate += function (...);
} , totalIntegral );

Online April 21-24, 2020 111/192

Performance of atomics (1)

Two costs: (independent) work and coordination.

p ar al le l _r ed u ce ( numberOfIntervals ,
KOKKOS_LAMBDA ( const unsigned int intervalIndex ,
double & valueToUpdate ) {
valueToUpdate += function (...);
} , totalIntegral );

Experimental setup
operator ()( const unsigned int index ) const {
Kokkos :: atomic_add (& globalSums [ index % atomicStride ] , 1);
}

I This is the most extreme case: all coordination and no work.

I Contention is captured by the atomicStride.
atomicStride → 1 ⇒ Scalar integration (bad)
atomicStride → large ⇒ Independent (good)

Online April 21-24, 2020 111/192

Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 0 pows

log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
phi float
phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 0 pows

Low(?) penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (3)
Atomics performance: 1 million adds, some work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 2 pows

No penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 113/192
Performance of atomics (4)
Atomics performance: 1 million adds, lots of work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 5 pows

No penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

Atomics on arbitrary types:

I Atomic operations work if the corresponding operator exists,
i.e., atomic add works on any data type with “+”.
I Atomic exchange works on any data type.
// Assign * dest to val , return former value of * dest
template < typename T >
T a to mic _e xc h an ge ( T * dest , T val );
// If * dest == comp then assign * dest to val
// Return true if succeeds .
template < typename T >
bool a t o m i c _ c o m p a r e _ e x c h a n g e _ s t r o n g ( T * dest , T comp , T val );

Online April 21-24, 2020 115/192

Memory traits

Slight detour: View memory traits:

I Beyond a Layout and Space, Views can have memory traits.
I Memory traits either provide convenience or allow for certain
hardware-specific optimizations to be performed.
Example: If all accesses to a View will be atomic, use the Atomic
memory trait:
View < double ** , Layout , Space ,
MemoryTraits<Atomic> > forces (...);

Online April 21-24, 2020 116/192

Memory traits

Slight detour: View memory traits:

Many memory traits exist or are experimental, including Read,

Write, ReadWrite, ReadOnce (non-temporal), Contiguous, and
RandomAccess.

Online April 21-24, 2020 116/192

RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, random
access, originally designed for textures.

Online April 21-24, 2020 117/192

RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, random
access, originally designed for textures.
How to access texture memory via CUDA:

Online April 21-24, 2020 117/192

How to access texture memory via Kokkos:

View < c o n s t double *** , Layout , Space ,
MemoryTraits<RandomAccess> > name (...);
Online April 21-24, 2020 117/192
Scatter Contribute (1)

Histogram generation is an example of the Scatter Contribute

pattern.
I Like a reduction but with many results.
I Number of results scales with number of inputs.
I Each results gets contributions from a small number of
inputs/iterations.
I Uses an inputs-to-results map not inverse.
Examples:
I Particles contributing to neighbors forces.
I Cells contributing forces to nodes.
I Computing histograms.
I Computing a density grid from point source contributions.

Online April 21-24, 2020 118/192

Scatter Contribute (2)
There are two useful algorithms:.
I Atomics: thread-scalable but depends on atomic
performance.
I Data Replication: every thread owns a copy of the output,
not thread-scalable but good for low (< 16) threads count
architectures.

Online April 21-24, 2020 119/192

Important Capability: ScatterView

ScatterView can transparently switch between Atomic and Data
Replication based scatter algorithms.

Online April 21-24, 2020 119/192

Important Capability: ScatterView

ScatterView can transparently switch between Atomic and Data
Replication based scatter algorithms.

I Abstracts over scatter contribute algorithms.

I Compile time choice with backend-specific defaults.
I Only limited number of operations are supported.
I Part of Kokkos Containers.
Online April 21-24, 2020 119/192
Scatter Contribute (3)
Example:
// Begin with a normal View
Kokkos :: View < double * > results ( " results " ,N );
// Create a scatter view wrapping the original view
Kokkos :: Experimental :: ScatterView < double * > scatter ( results );
// Reset contributions if necessary
scatter . reset ();
// Start parallel operation
Kokkos :: parallel_for ( " ScatterAlg " , M ,
KOKKOS_LAMBDA ( int i ) {
// Get the accessor - e . g . the thread specific copy
// or an atomic view of the data .
auto access = scatter . access ();

for ( int j =0; j < num_neighs ( i ); j ++) {

// Get the destination index
int neigh = neighbors (i , j );
// Add the contribution
access ( neigh ) += contribution (i , neigh );
}
});
// Combine the results - no op if ScatterView was using atomics in
Kokkos :: Experimental :: contribute ( results , scatter );
Online April 21-24, 2020 120/192
Exercise ScatterView

I Location: Intro-Full/Exercises/scatter view/Begin/

I Assignment: Convert scatter view loop to use ScatterView.
I Compile and run on both CPU and GPU

make -j K OKKOS_D EVICES = OpenMP # CPU - only using OpenMP

make -j K OKKOS_D EVICES = Cuda # GPU - note UVM in Makefile
# Run exercise
./ scatterview . host
./ scatterview . cuda
# Note the warnings , set appropriate environment variables

I Compare performance on CPU of the three variants

I Compare performance on GPU of the two variants
I Vary problem size: first and second optional argument

Online April 21-24, 2020 121/192

Section Summary

I Atomics are the only thread-scalable solution to thread safety.

I Locks or data replication are not portable or scalable
I Atomic performance depends on ratio of independent work
and atomic operations.
I With more work, there is a lower performance penalty, because
of increased opportunity to interleave work and atomic.
I The Atomic memory trait can be used to make all accesses
to a view atomic.
I The cost of atomics can be negligible:
I CPU ideal: contiguous access, integer types
I GPU ideal: scattered access, 32-bit types
I Many programs with the scatter-add pattern can be
thread-scalably parallelized using atomics without much
modification.

Online April 21-24, 2020 122/192

Hierarchical parallelism
Finding and exploiting more parallelism in your computations.

Learning objectives:
I Similarities and differences between outer and inner levels of
parallelism
I Thread teams (league of teams of threads)
I Performance improvement with well-coordinated teams

Online April 21-24, 2020 123/192

Example: inner product (0)

(Flat parallel) Kernel:

Kokkos :: p ar a ll el _r e du ce ( " yAx " ,N ,
KOKKOS_LAMBDA ( const int row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( int col = 0; col < M ; ++ col ) {
thisRowsSum += A ( row , col ) * x ( col );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Online April 21-24, 2020 124/192

Example: inner product (0)

(Flat parallel) Kernel:

Problem: What if we don’t have

enough rows to saturate the GPU?

Online April 21-24, 2020 124/192

Example: inner product (0)

(Flat parallel) Kernel:

Problem: What if we don’t have

enough rows to saturate the GPU?
Solutions?

Online April 21-24, 2020 124/192

Example: inner product (0)

(Flat parallel) Kernel:

Problem: What if we don’t have

enough rows to saturate the GPU?
Solutions?
I Atomics
I Thread teams

Online April 21-24, 2020 124/192

Example: inner product (1)

Atomics kernel:
Kokkos :: parallel_for ( " yAx " , N ,
KOKKOS_LAMBDA ( const size_t index ) {
const int row = extractRow ( index );
const int col = extractCol ( index );
atomic_add (& result , A ( row , col ) * x ( col ));
});

Online April 21-24, 2020 125/192

Example: inner product (1)

Problem: Poor performance

Online April 21-24, 2020 125/192

Example: inner product (2)

Doing each individual row with atomics is like doing scalar

integration with atomics.

Instead, you could envision doing a large number of

parallel reduce kernels.
for each row
Functor functor ( row , ...);
p ar alle l _r ed uc e (M , functor );
}

Online April 21-24, 2020 126/192

Example: inner product (2)

Doing each individual row with atomics is like doing scalar

integration with atomics.

Instead, you could envision doing a large number of

parallel reduce kernels.
for each row
Functor functor ( row , ...);
p ar alle l _r ed uc e (M , functor );
}

This is an example of hierarchical work.

Important concept: Hierarchical parallelism
Algorithms that exhibit hierarchical structure can exploit
hierarchical parallelism with thread teams.

Online April 21-24, 2020 126/192

Example: inner product (3)

Important concept: Thread team

A collection of threads which are guaranteed to be executing
concurrently and can synchronize.

Online April 21-24, 2020 127/192

Example: inner product (3)

Important concept: Thread team

A collection of threads which are guaranteed to be executing
concurrently and can synchronize.
High-level strategy:
1. Do one parallel launch of N teams of M threads.
2. Each thread performs one entry in the row.
3. The threads within teams perform a reduction.
4. The thread teams perform a reduction.

Online April 21-24, 2020 127/192

Example: inner product (4)

The final hierarchical parallel kernel:

p ar al le l _r ed u ce ( " yAx " ,
team_policy (N , Kokkos :: AUTO ) ,

KOKKOS_LAMBDA ( const member_type & teamMember , double & update )

int row = teamMember . league_rank ();

double thisRowsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , M ) ,
[=] ( int col , double & innerUpdate ) {
innerUpdate += A ( row , col ) * x ( col );
} , thisRowsSum );

if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
} , result );

Online April 21-24, 2020 128/192

TeamPolicy (0)

Important point
Using teams is changing the execution policy.

“Flat parallelism” uses RangePolicy:

We specify a total amount of work.
// total work = N
parallel_for ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) , functor );

Online April 21-24, 2020 129/192

TeamPolicy (0)

Important point
Using teams is changing the execution policy.

“Flat parallelism” uses RangePolicy:

We specify a total amount of work.
// total work = N
parallel_for ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) , functor );

“Hierarchical parallelism” uses TeamPolicy:

We specify a team size and a number of teams.
// total work = numberOfTeams * teamSize
parallel_for ( " Label " ,
TeamPolicy < ExecutionSpace >( numberOfTeams , teamSize ) , functor );

Online April 21-24, 2020 129/192

TeamPolicy (1)

Important point
When using teams, functor operators receive a team member.

typedef typename TeamPolicy < ExecSpace >:: member_type member_type ;

void operator ()( const member_type & teamMember ) {

// Which team am I on?
const unsigned int leagueRank = teamMember . league_rank ();
// Which thread am I on this team?
const unsigned int teamRank = teamMember . team_rank ();
}

Online April 21-24, 2020 130/192

TeamPolicy (1)

Important point
When using teams, functor operators receive a team member.

typedef typename TeamPolicy < ExecSpace >:: member_type member_type ;

void operator ()( const member_type & teamMember ) {

// Which team am I on?
const unsigned int leagueRank = teamMember . league_rank ();
// Which thread am I on this team?
const unsigned int teamRank = teamMember . team_rank ();
}

Warning
There may be more (or fewer) team members than pieces of your
algorithm’s work per team

Online April 21-24, 2020 130/192

TeamThreadRange (0)

First attempt at exercise:

operator () ( member_type & teamMember ) {
const size_t row = teamMember . league_rank ();
const size_t col = teamMember . team_rank ();
atomic_add (& result ,y ( row ) * A ( row , col ) * x ( entry ));
}

Online April 21-24, 2020 131/192

TeamThreadRange (0)

First attempt at exercise:

I When team size 6= number of columns, how are units of work

mapped to team’s member threads? Is the mapping
architecture-dependent?
I atomic add performs badly under high contention, how can
team’s member threads performantly cooperate for a nested
reduction?
Online April 21-24, 2020 131/192
TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...

operator () ( member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
‘‘ do a reduction ’ ’( ‘‘ over M columns ’ ’ ,
[=] ( const int col ) {
thisRowsSum += A ( row , col ) * x ( col );
});
if ( teamMember . team_rank () == 0) {
update += ( row ) * thisRowsSum ;
}
}

Online April 21-24, 2020 132/192

TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...

If this were a parallel execution,

we’d use Kokkos::parallel reduce.

Online April 21-24, 2020 132/192

TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...

If this were a parallel execution,

we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.

Online April 21-24, 2020 132/192

TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...

If this were a parallel execution,

we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
Online April 21-24, 2020 132/192
TeamThreadRange (2)
TeamThreadRange:
operator () ( const member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
p ar al l el _r ed u ce ( T ea mT hr e ad Ra n ge ( teamMember , M ) ,
[=] ( const int col , double & t h i s R o w s P a r t i a l S u m ) {
t h i s R o w s P a r t i a l S u m += A ( row , col ) * x ( col );
} , thisRowsSum );
if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
}

Online April 21-24, 2020 133/192

TeamThreadRange (2)
TeamThreadRange:
operator () ( const member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
p ar al l el _r ed u ce ( T ea mT hr e ad Ra n ge ( teamMember , M ) ,
[=] ( const int col , double & t h i s R o w s P a r t i a l S u m ) {
t h i s R o w s P a r t i a l S u m += A ( row , col ) * x ( col );
} , thisRowsSum );
if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
}

I The mapping of work indices to threads is

architecture-dependent.
I The amount of work given to the TeamThreadRange need
not be a multiple of the team size.
I Intrateam reduction handled by Kokkos.
Online April 21-24, 2020 133/192
Nested parallelism
Anatomy of nested parallelism:
para llel_out er ( " Label " ,
TeamPolicy < ExecutionSpace >( numberOfTeams , teamSize ) ,
KOKKOS_LAMBDA ( const member_type & teamMember [ , . . . ] ) {
/* beginning of outer body */
para llel_inn er (
T ea mT hr e ad Ra ng e ( teamMember , t h i s T e a m s R a n g e S i z e ) ,
[=] ( const unsigned int i n d e xW i t h i n B a t c h [ , . . . ] ) {
/* inner body */
} [ , . . . ] );
/* end of outer body */
} [ , . . . ] );

I parallel outer and parallel inner may be any

combination of for, reduce, or scan.
I The inner lambda may capture by reference, but
capture-by-value is recommended.
I The policy of the inner lambda is always a TeamThreadRange.
I TeamThreadRange cannot be nested.
Online April 21-24, 2020 134/192
What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

Online April 21-24, 2020 135/192

What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256

Online April 21-24, 2020 135/192

What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchy
a well-coordinated team avoids cache-thrashing
Online April 21-24, 2020 135/192
Exercise #5: Inner Product, Hierarchical Parallelism
Details:
I Location: Intro-Full/Exercises/05/
I Replace RangePolicy<Space> with TeamPolicy<Space>
I Use AUTO for team size
I Make the inner loop a parallel reduce with TeamThreadRange
policy
I Experiment with the combinations of Layout, Space, N to view
performance
I Hint: what should the layout of A be?
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 4 for very non-square matrices
I Compare behavior of CPU vs GPU
Online April 21-24, 2020 136/192
Reminder, Exercise #4 with Flat Parallelism
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

cached
300

200 uncoalesced
cached
100

uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 137/192

Exercise #5: Inner Product, Hierarchical Parallelism
<y|Ax> Exercise 05 (Layout/Teams) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right
coalesced

400
Bandwidth (GB/s)

cached
300

200

cached
100

0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 138/192

Three-level parallelism (0)

Exposing Vector Level Parallelism

I Optional third level in the hierarchy: ThreadVectorRange
I Can be used for parallel for, parallel reduce, or
parallel scan.
I Maps to vectorizable loop on CPUs or (sub-)warp level
parallelism on GPUs.
I Enabled with a runtime vector length argument to
TeamPolicy
I There is no explicit access to a vector lane ID.
I Depending on the backend the full global parallel region has
active vector lanes.
I TeamVectorRange uses both thread and vector parallelism.

Online April 21-24, 2020 139/192

Three-level parallelism (1)
Anatomy of nested parallelism:
para llel_out er ( " Label " ,
TeamPolicy < >( numberOfTeams , teamSize , vectorLength ) ,
KOKKOS_LAMBDA ( const member_type & teamMember [ , . . . ] ) {
/* beginning of outer body */
p ar al le l _m id dl e (
T ea mT hr e ad Ra ng e ( teamMember , t h i s T e a m s R a n g e S i z e ) ,
[=] ( const int i n d e x W i t h i n B a t c h [ , . . . ] ) {
/* begin middle body */
para llel_inn er (
T h r e a d V e c t o r R a n g e ( teamMember , t h i s V e c t o r R a n g e S i z e ) ,
[=] ( const int i n d e x V e c t o r R a n g e [ , . . . ] ) {
/* inner body */
}[ , ....);
/∗ end m i d d l e body ∗/
} [ , ...] ) ;
parallel middle (
TeamVectorRange ( teamMember , s o m e S i z e ) ,
[=] ( const i n t indexTeamVector [ , . . . ] ) {
/∗ n e s t e d body ∗/
}[ , . . . ] ) ;
/∗ end o f o u t e r body ∗/
} [ , ...] ) ;
Online April 21-24, 2020 140/192
Sum sanity checks (0)

Question: What will the value of totalSum be?

int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , RangePolicy < >(0 , n um b er Of Th r ea ds ) ,
KOKKOS_LAMBDA ( size_t & index , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

Online April 21-24, 2020 141/192

Sum sanity checks (0)

Question: What will the value of totalSum be?

totalSum = numberOfThreads * 10

Online April 21-24, 2020 141/192

Sum sanity checks (1)

Question: What will the value of totalSum be?

int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

Online April 21-24, 2020 142/192

Sum sanity checks (1)

Question: What will the value of totalSum be?

totalSum = numberOfTeams * team size * 10

Online April 21-24, 2020 142/192

Sum sanity checks (2)

Question: What will the value of totalSum be?

int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisTeamsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , team_size ) ,
[=] ( const int index , int & t h i s T e a m s P a r t i a l S u m ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
t h i s T e a m s P a r t i a l S u m += thisTh readsSum ;
} , thisTeamsSum );
partialSum += thisTeamsSum ;
} , totalSum );

Online April 21-24, 2020 143/192

Sum sanity checks (2)

Question: What will the value of totalSum be?

totalSum = numberOfTeams * team size * team size * 10

Online April 21-24, 2020 143/192

Restricting Execution: single pattern

The single pattern can be used to restrict execution

I Like parallel patterns it takes a policy, a lambda, and
optionally a broadcast argument.
I Two policies: PerTeam and PerThread.
I Equivalent to OpenMP single directive with nowait
// Restrict to once per thread
single ( PerThread ( teamMember ) , [&] () {
// code
});

// Restrict to once per team with broadcast

int b r oa d c a s t e d V a l u e = 0;
single ( PerTeam ( teamMember ) , [&] ( int & b r o a d c a s t e d V a l u e _ l o c a l ) {
b r o a d c a s t e d V a l u e _ l o c a l = special value assigned by one ;
} , b r o ad c a s te d V a l u e );
// Now everyone has the special value

Online April 21-24, 2020 144/192

Exercise #6: Three-Level Parallelism

The previous example was extended with an outer loop over

“Elements” to expose a third natural layer of parallelism.

Details:
I Location: Intro-Full/Exercises/06/
I Use the single policy instead of checking team rank
I Parallelize all three loop levels.
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 5 for very non-square matrices
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 145/192

Exercise #6: Three-Level Parallelism

<y|Ax> Exercise 06 (Three Level Parallelism) Fixed Size

KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

450
2L HSW Begin
3L HSW
400 2L KNL Begin
3L KNL
2L Pascal60 Begin
350 3L Pascal60

300
Bandwidth (GB/s)

250

200

150

100

0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 146/192
Section Summary

I Hierarchical work can be parallelized via hierarchical

parallelism.
I Hierarchical parallelism is leveraged using thread teams
launched with a TeamPolicy.
I Team “worksets” are processed by a team in nested
parallel for (or reduce or scan) calls with a
TeamThreadRange and ThreadVectorRange policy.
I Execution can be restricted to a subset of the team with the
single pattern using either a PerTeam or PerThread policy.
I Teams can be used to reduce contention for global resources
even in “flat” algorithms.

Online April 21-24, 2020 147/192

Scratch memory
Learning objectives:
I Understand concept of team and thread private scratch
pads
I Understand how scratch memory can reduce global memory
accesses
I Recognize when to use scratch memory
I Understand how to use scratch memory and when barriers
are necessary

Online April 21-24, 2020 148/192

Types of Scratch Space Uses
Two Levels of Scratch Space
I Level 0 is limited in size but fast.
I Level 1 allows larger allocations but is equivalent to High
Bandwidth Memory in latency and bandwidth.
Team or Thread private memory
I Typically used for per work-item temporary storage.
I Advantage over pre-allocated memory is aggregate size scales
with number of threads, not number of work-items.
Manually Managed Cache
I Explicitly cache frequently used data.
I Exposes hardware specific on-core scratch space (e.g. NVIDIA
GPU Shared Memory).

Online April 21-24, 2020 149/192

Example: contractDataFieldScalar (1)

One slice of contractDataFieldScalar:

for ( qp = 0; qp < numberOfQPs ; ++ qp ) {

total = 0;
for ( i = 0; i < vectorSize ; ++ i ) {
total += A ( qp , i ) * B ( i );
}
result ( qp ) = total ;
}

Online April 21-24, 2020 150/192

Example: contractDataFieldScalar (2)
contractDataFieldScalar:

for ( element = 0; element < n u m b e r O f E l e m e n t s ; ++ element ) {

for ( qp = 0; qp < numberOfQPs ; ++ qp ) {
total = 0;
for ( i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
}
}

Online April 21-24, 2020 151/192

Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements

Online April 21-24, 2020 152/192

Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs

Online April 21-24, 2020 152/192

Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
I Each thread handles an i.
Threads: numElements * numQPs * vectorSize
Requires a parallel reduce.

Online April 21-24, 2020 152/192

Example: contractDataFieldScalar (3)

Online April 21-24, 2020 152/192

Example: contractDataFieldScalar (4)

Flat kernel: Each thread handles a quadrature point

operator ()( int index ) {
int element = e x t r a c t E l e m e n t F r o m I n d e x ( index );
int qp = e x t r a c t Q P F r o m I n d e x ( index );
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
}

Online April 21-24, 2020 153/192

Example: contractDataFieldScalar (6)

Teams kernel: Each team handles an element

operator ()( member_type teamMember ) {
int element = teamMember . league_rank ();
parallel_for (
T eamT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 154/192

Example: contractDataFieldScalar (6)

Teams kernel: Each team handles an element

Online April 21-24, 2020 155/192

Scratch memory (1)

Scratch memory (scratch pad) as manual cache:

I Accessing data in (level 0) scratch memory is (usually) much
faster than global memory.
I GPUs have separate, dedicated, small, low-latency scratch
memories (NOT subject to coalescing requirements).
I CPUs don’t have special hardware, but programming with
scratch memory results in cache-aware memory access
patterns.
I Roughly, it’s like a user-managed L1 cache.

Online April 21-24, 2020 156/192

Scratch memory (1)

Scratch memory (scratch pad) as manual cache:

Important concept
When members of a team read the same data multiple times, it’s
better to load the data into scratch memory and read from there.

Online April 21-24, 2020 156/192

Scratch memory (2)

Scratch memory for temporary per work-item storage:

I Scenario: Algorithm requires temporary workspace of size W.
I Without scratch memory: pre-allocate space for N
work-items of size N x W.
I With scratch memory: Kokkos pre-allocates space for each
Team or Thread of size T x W.
I PerThread and PerTeam scratch can be used concurrently.
I Level 0 and Level 1 scratch memory can be used concurrently.

Online April 21-24, 2020 157/192

Scratch memory (2)

Scratch memory for temporary per work-item storage:

Important concept
If an algorithm requires temporary workspace for each work-item,
then use Kokkos’ scratch memory.

Online April 21-24, 2020 157/192

Scratch memory (3)
To use scratch memory, you need to:
1. Tell Kokkos how much scratch memory you’ll need.
2. Make scratch memory views inside your kernels.

Online April 21-24, 2020 158/192

Scratch memory (3)
To use scratch memory, you need to:
1. Tell Kokkos how much scratch memory you’ll need.
2. Make scratch memory views inside your kernels.
TeamPolicy < ExecutionSpace > policy ( numberOfTeams , teamSize );

// Define a scratch memory view type

typedef View < double * , Execu tionSpa ce :: s c r a t c h _ m e m o r y _ s p a c e
, M em or yU n ma na ge d > ScratchP adView ;
// Compute how much scratch memory ( in bytes ) is needed
size_t bytes = Sc ratchPad View :: shmem_size ( vectorSize );

// Tell the policy how much scratch memory is needed

int level = 0;
parallel_for ( policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes )) ,
KOKKOS_LAMBDA ( const member_type & teamMember ) {

// Create a view from the pre - existing scratch memory

Scra tchPadVi ew scratch ( teamMember . team_scratch ( level ) ,
vectorSize );
});
Online April 21-24, 2020 158/192
Example: contractDataFieldScalar (7)

Kernel outline for teams with scratch memory:

operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch ( teamMember . team_scratch (0) ,
vectorSize );

// TODO : load slice of B into scratch

parallel_for (
T ea mT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 159/192

Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all?
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

Online April 21-24, 2020 160/192

Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all? Serial
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

I Each thread loads one entry?

scratch ( team_rank ) = B ( element , team_rank );

Online April 21-24, 2020 160/192

I Each thread loads one entry? teamSize 6= vectorSize

scratch ( team_rank ) = B ( element , team_rank );
I TeamThreadRange or ThreadVectorRange
parallel_for (
T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,
[=] ( int i ) {
scratch ( i ) = B ( element , i );
});

Online April 21-24, 2020 160/192

I Each thread loads one entry? teamSize 6= vectorSize

Online April 21-24, 2020 160/192

Example: contractDataFieldScalar (9)
(incomplete) Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,

[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
// TODO : fix a problem at this location

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,

[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 161/192

Example: contractDataFieldScalar (9)
(incomplete) Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,

[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
// TODO : fix a problem at this location

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,

[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}
Problem: threads may start to use scratch before all threads are
done loading.
Online April 21-24, 2020 161/192
Example: contractDataFieldScalar (10)
Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,

[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
teamMember . t e a m b a r r i e r ( ) ;

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,

[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 162/192

Exercise #7: Scratch Memory

Use Scratch Memory to explicitly cache the x-vector for each

element.

Details:
I Location: Intro-Full/Exercises/07/
I Create a scratch view
I Fill the scratch view in parallel using a TeamThreadRange or
ThreadVectorRange
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 6
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 163/192

Exercise #7: Scratch Memory

Exercise 07 (Scratch Memory) Fixed Size

KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
06 HSW
07 HSW
06 KNL
07 KNL
500 06 Pascal60
07 Pascal60

400
Bandwidth (GB/s)

300

200

100

0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 164/192
Scratch Memory: API Details

Allocating scratch in different levels:

int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Online April 21-24, 2020 165/192

Scratch Memory: API Details

Allocating scratch in different levels:

int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Using PerThread, PerTeam or both:

policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerThread ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes1 ) ,
PerThread ( bytes2 ));

Online April 21-24, 2020 165/192

Scratch Memory: API Details

Allocating scratch in different levels:

int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Using PerThread, PerTeam or both:

Using both levels of scratch:

policy . s e t _ s c r a t c h _ s i z e (0 , PerTeam ( bytes0 ))
. s e t _ s c r a t c h _ s i z e (1 , PerThread ( bytes1 ));

Note: set scratch size() returns a new policy instance, it

doesn’t modify the existing one.

Online April 21-24, 2020 165/192

Section Summary

I Scratch Memory can be use with the TeamPolicy to

provide thread or team private memory.
I Usecase: per work-item temporary storage or manual caching.
I Scratch memory exposes on-chip user managed caches (e.g.
on NVIDIA GPUs)
I The size must be determined before launching a kernel.
I Two levels are available: large/slow and small/fast.

Online April 21-24, 2020 166/192

Task parallelism
Fine-grained dependent execution.

Learning objectives:
I Basic interface for fine-grained tasking in Kokkos
I How to express dynamic dependency structures in Kokkos
tasking
I When to use Kokkos tasking

Online April 21-24, 2020 167/192

Task Parallelism Looks Like Data Parallelism

Recall that data parallel code is composed of a pattern, a policy,

and a functor
Kokkos :: parallel_for (
Kokkos :: RangePolicy < >( exec_space , 0 , N ) ,
SomeFunctor ()
);

Task parallel code similarly has a pattern, a policy, and a functor

Kokkos :: task_spawn (
Kokkos :: TaskSingle ( scheduler , TaskPriority :: High ) ,
SomeFunctor ()
);

Online April 21-24, 2020 168/192

What does a task functor look like?

struct MyTask {
using value_type = double ;
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result );
};

I Tell Kokkos what the value type of your task’s output is.
I Take a team member argument, analogous to the team
member passed in by Kokkos::TeamPolicy in hierarchical
parallelism
I The output is expressed by assigning to a parameter, similar
to with Kokkos::parallel reduce

Online April 21-24, 2020 169/192

What policies does Kokkos tasking provide?

I Kokkos::TaskSingle()
I Run the task with a single worker thread
I Kokkos::TaskTeam()
I Run the task with all of the threads in a team
I Think of it like being inside of a parallel for with a
TeamPolicy
I Both policies take a scheduler, an optional predecessor, and an
optional priority (more on schedulers and predecessors later)

Online April 21-24, 2020 170/192

What patterns does Kokkos tasking provide?

I Kokkos::task spawn()
I Kokkos::host spawn() (same thing, but from host code)
I Kokkos::respawn()
I Argument order is backwards; policy comes second!
I First argument is ‘this‘ always (not ‘*this‘)
I task spawn() and host spawn() return a Kokkos::Future
representing the completion of the task (see next slide), which
can be used as a predecessor to another operation.

Online April 21-24, 2020 171/192

How do futures and dependencies work?
struct MyTask {
using value_type = double ;
Kokkos :: Future < double , Kokkos :: DefaultExecutionSpace > dep ;
int depth ;
K O K K O S _ I N L I N E _ F U N C T I O N MyTask ( int d ) : depth ( d ) { }
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result ) {
if ( depth == 1) result = 3.14;
else if ( dep . is_null ()) {
dep =
Kokkos :: task_spawn (
Kokkos :: TaskSingle ( member . scheduler ()) ,
MyTask ( depth -1)
);
Kokkos :: respawn ( this , dep );
}
else {
result = depth * dep . get ();
}
}
};

Online April 21-24, 2020 172/192

The Scheduler Abstraction

template < class Scheduler >

struct MyTask {
using value_type = double ;
Kokkos :: BasicFuture < double , Scheduler > dep ;
int depth ;
K O K K O S _ I N L I N E _ F U N C T I O N MyTask ( int d ) : depth ( d ) { }
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result );
};

Available Schedulers:
I TaskScheduler<ExecSpace>
I TaskSchedulerMultiple<ExecSpace>
I ChaseLevTaskScheduler<ExecSpace>

Online April 21-24, 2020 173/192

Spawning from the host

using ex ec u ti on _s p ac e = Kokkos :: D e f a u l t E x e c u t i o n S p a c e ;
using sch eduler_ type = Kokkos :: TaskScheduler < execution_space >;
using memory_space = sche duler_ty pe :: memory_space ;
using m e m o r y _ p o o l _ t y p e = sched uler_typ e :: memory_pool ;
size_t m e m o r y _ po o l _ s i z e = 1 << 22;

auto scheduler =
sche duler_ty pe ( m e m o r y _ p o o l _ t y p e ( m e m o r y _p o o l _ s i z e ));

Kokkos :: BasicFuture < double , scheduler_type > result =

Kokkos :: host_spawn (
Kokkos :: TaskSingle ( scheduler ) ,
MyTask < scheduler_type >(10)
);
Kokkos :: wait ( scheduler );
printf ( " Result is % f " , result . get ());

Online April 21-24, 2020 174/192

Things to Keep in Mind

I Tasks always run to completion

I There is no way to wait or block inside of a task
I future.get() does not block!
I Tasks that do not respawn themselves are complete
I The value in the result parameter is made available through
future.get() to any dependent tasks.
I The second argument to respawn can only be either a
predecessor (future) or a scheduler, not a proper execution
policy
I We are fixing this to provide a more consistent overload in the
next release.
I Tasks can only have one predecessor (at a time)
I Use scheduler.when all() to aggregate predecessors (see
next slide)

Online April 21-24, 2020 175/192

Aggregate Predecessors

using void_future =
Kokkos :: BasicFuture < void , scheduler_type >;
auto f1 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , X {});
auto f2 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , Y {});
void_future f_array [] = { f1 , f2 };
void_future f_12 = scheduler . when_all ( f_array , 2);
auto f3 =
Kokkos :: task_spawn (
Kokkos :: TaskSingle ( scheduler , f_12 ) , FuncXY {}
);

I To create an aggregate Future, use scheduler.when all()

I scheduler.when all() always returns a void future.
I (Also, any future is implicitly convertible to a void future of
the same Scheduler type)

Online April 21-24, 2020 176/192

Exercise #8: Fibonacci

Formula Serial algorithm

int fib ( int n ) {
FN = FN−1 + FN−2 if ( n < 2) return n ;
else {
F0 = 0 return fib ( n -1) + fib ( n -2);
F1 = 1 }
}

Details:
I Location: Intro-Full/Exercises/08
I Implement the FibonacciTask task functor recursively
I Spawn the root task from the host and wait for the scheduler
to make it ready
Hints:
I Do the FN−1 and FN−2 subproblems in separate tasks
I Use a scheduler.when all() to wait on the subproblems
Online April 21-24, 2020 177/192
SIMD
Portable vector intrinsic types.

Learning objectives:
I How to use SIMD types to improve vectorization.
I SIMD Types as an alternative to ThreadVector loops.
I SIMD Types to achieve outer loop vectorization.

Online April 21-24, 2020 178/192

Vectorization In Kokkos

So far there were two options for achieving vectorization:

I Hope For The Best: Kokkos semantics make loops
inherently vectorizable, sometimes the compiler figures it even
out.
I Hierarchical Parallelism: TeamVectorRange and
ThreadVectorRange help the compiler with hints such as
#pragma ivdep or #pragma omp simd.

These strategies do run into limits though:

I Compilers often do not vectorize loops on their own.
I An optimal vectorization strategy would require outer-loop
vectorization.
I Vectorization with TeamVectorRange sometimes requires
artifically introducing an additional loop level.

Online April 21-24, 2020 179/192

Outer-Loop Vectorization
A simple scenario where for outer-loop vectorization:
for ( int i =0; i < N ; i ++) {
// expect K to be small odd 1 ,3 ,5 ,7 for physics reasons
for ( int k =0; k < K ; k ++) b ( i ) += a (i , k );
}
Vectorization the K-Loop is not profitable:
I It is a short reduction.
I Remainders will eat up much time.

Using ThreadVectorRange is cumbersome and requires split of

N-Loop:
parallel_for ( " VectorLoop " , TeamPolicy < >(0 , N /V , V ) ,
KOKKOS_LAMBDA ( const team_t & team ) {
int i = team . league_rank () * V ;
for ( int k =0; k < K ; k ++)
parallel_for ( T h r e a d V e c t o r R a n g e ( team , V ) , [&]( int ii ) {
b ( i + ii ) += a ( i + ii , k );
});
});
Online April 21-24, 2020 180/192
SIMD Types

To help with this situation and (in particular in the past) fix the
lack of auto-vectorizing compilers SIMD-Types have been
invented. They:
I Are short vectors of scalars.
I Have operators such as += so one can use them like scalars.
I Are compile time sized.
I Usually map directly to hardware vector instructions.

Important concept: SIMD Type

A SIMD variable is a short vector which acts like a scalar.
Using such a simd type one can simply achieve outer-loop
vectorization by using arrays of simd and dividing the loop range
by its size.

Online April 21-24, 2020 181/192

C++23? SIMD

The ISO C++ standard has a Technical Specification for simd (in
parallelism v2):
template < class T , class Abi >
class simd {
public :
using value_type = T ;
using reference = /* impl defined */ ;
using abi_type = Abi ;
static constexpr size_t size ();
void copy_from ( T const * , aligned_tag );
void copy_to ( T * , aligned_tag ) const ;
T & operator [] ( size_t );
// Element wise operators
};

// Element Wise non - member operators

Online April 21-24, 2020 182/192

C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for

different, hardware specific, implementations.
The most important in the proposal are:
I scalar: single element type.
I fixed size< N >: stores N elements.
I max fixed size< T >: stores maximum number of elements
for T.
I native: best fit for hardware.

Online April 21-24, 2020 183/192

C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for

But std::experimental::simd is not in the standard yet, and

doesn’t support GPUs ...

Online April 21-24, 2020 183/192

C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for

But std::experimental::simd is not in the standard yet, and

doesn’t support GPUs ...
It also has other problems making it insufficient for our codes ...

Online April 21-24, 2020 183/192

Kokkos SIMD
Just at Sandia we had at least 5 different SIMD types in use.
A unification effort was started with the goal of:
I Match the proposed std::simd API as far as possible.
I Support GPUs.
I Can be used stand-alone or in conjunction with Kokkos.
I Replaces all current implementations at Sandia for SIMD.

We now have an implementation developed by Dan Ibanez, which

is close to meeting all of those criteria:
I For now available at
https://github.com/kokkos/simd-math.
I Considered Experimental, but supports X86, ARM, Power,
NVIDIA GPUs.
I Will be integrated into Kokkos in the next two months.
Online April 21-24, 2020 184/192
Exercise #9: Simple SIMD usage.
Details:
I Location: Intro-Full/Exercises/09/Begin/
I Include the simd.hpp header.
I Change the data type of the views to use
simd::simd<double,simd::simd abi:native>.
I Create an unmanaged View<double*> of results using the
data() function for the final reduction.

# Compile for CPU

make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ simd . cuda
Things to try:
I Vary problem size (-N ...; -K ...)
I Compare behavior of scalar vs vectorized on CPU and GPU
Online April 21-24, 2020 185/192
The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.

Online April 21-24, 2020 186/192

The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.

Why wouldn’t we use a fixed size instead?
I Using a fixed size ABI will create a scalar of size N in each
CUDA thread!
I Loading a fixed size variable from memory would result in
uncoalesced access.
I If you have correct layouts you get outer-loop vectorization
implicitly on GPUs.
But what if you really want to use warp-level parallelziation for
SIMD types?

Online April 21-24, 2020 186/192

The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.

Online April 21-24, 2020 186/192

cuda warp ABI
Important concept: simd::storage type
Every simd<T,ABI> has an associated storage type typedef.

To help with the GPU issue we split types between storage types
used for Views, and temporary variables.
I Most simd::simd types will just have the same storage type.
I simd<T,cuda warp<N>> will use warp level parallelism.
I simd<T,cuda warp<N>>::storage type is different though!.
I Used in conjunction with TeamPolicy.
using simd_t = simd :: simd <T , simd :: simd_abi :: cuda_warp <V > >;
using sim d_stora ge_t = simd_t :: storage_type ;
View < s imd_stor age_t ** > data ( " D " ,N , M ); // will hold N * M * V Ts
parallel_for ( " Loop " , TeamPolicy < >(N ,M , V ) ,
KOKKOS_LAMBDA ( const team_t & team ) {
int i = team . league_rank ();
parallel_for ( T ea m Th re a dR an ge ( team , M ) , [&]( int j ) {
data (i , j ) = 2.0* simd_t ( data (i , j ));
});April
Online });
21-24, 2020 187/192
Exercise #10: SIMD storage usage.
Details:
I Location: Intro-Full/Exercises/10/Begin/
I Include the simd.hpp header.
I Change the data type of the views to use
simd::simd<double,simd::simd abi:cuda warp<
32 >>::storage type.
I Create an unmanaged View<double*> of results using the
data() function for the final reduction.
I Use inside of the lambda the
simd::simd<double,simd::simd abi:cuda warp< 32 >> as
scalar type.

# Compile for GPU

make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ simd . cuda

Online April 21-24, 2020 188/192

Advanced SIMD Capabilities

Kokkos SIMD supports math operations:

I Common stuff like abs,sqrt,exp, ...

It also supports masking:

using simd_t = simd < double , simd_abi :: native >;
using simd_mask_t = simd_t :: mask_type ;

simd_t threshold (100.0) , a ( a ( i ));

simd_mask_t is_smaller = threshold < a ;
simd_t only_smaller = choose ( is_smaller ,a , threshold );

Online April 21-24, 2020 189/192

SIMD Summary

I SIMD types help vectorize code.

I In particular for outer-loop vectorization.
I There are storage and temporary types.
I Masking is supported too.
I Currently considered experimental at
https://github.com/Kokkos/simd-math: please try it out
and provide feedback.
I Will move into Kokkos proper likely in the next release.

Online April 21-24, 2020 190/192

Conclusion

Kokkos advanced capabilities NOT covered today

I Directed acyclic graph (DAG) of tasks pattern
I Dynamic graph of heterogeneous tasks (maximum flexibility)
I Static graph of homogeneous task (low overhead)
I Portable, thread scalable memory pool
I Plugging in customized multidimensional array data layout
e.g., arbitrarily strided, hierarchical tiling

Online April 21-24, 2020 191/192

Conclusion: Takeaways

I For portability: OpenMP, OpenACC, ... or Kokkos.

I Only Kokkos obtains performant memory access patterns via
architecture-aware arrays and work mapping.
i.e., not just portable, performance portable.
I With Kokkos, simple things stay simple (parallel-for, etc.).
i.e., it’s no more difficult than OpenMP.
I Advanced performance-optimizing patterns are simpler
with Kokkos than with native versions.
i.e., you’re not missing out on advanced features.
I full day tutorial only

Online April 21-24, 2020 192/192

A New Optimized Version of Merge Sort
No ratings yet
A New Optimized Version of Merge Sort
5 pages
21-Day Plan To Prepare For Data Structures and Algorithms (DSA)
No ratings yet
21-Day Plan To Prepare For Data Structures and Algorithms (DSA)
12 pages
What Is Story of Panday Pira by Jose Ma Hernandez
22% (9)
What Is Story of Panday Pira by Jose Ma Hernandez
2 pages
C Language (1) - 11235
No ratings yet
C Language (1) - 11235
79 pages
621 C++ Mcqs CA Test 1
No ratings yet
621 C++ Mcqs CA Test 1
23 pages
Multiplication of Signed Numbers: CSE 430 - Assignment 3
No ratings yet
Multiplication of Signed Numbers: CSE 430 - Assignment 3
15 pages
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
No ratings yet
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
53 pages
Java Programming (Bcse 2333) Lab File: Submitted by
No ratings yet
Java Programming (Bcse 2333) Lab File: Submitted by
48 pages
Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
Write Go Like A Senior Engineer. What I Wish I Knew When I Started - by Jacob Bennett - Level Up Coding
No ratings yet
Write Go Like A Senior Engineer. What I Wish I Knew When I Started - by Jacob Bennett - Level Up Coding
14 pages
Learn Java - Arrays and ArrayLists Cheatsheet - Codecademy
No ratings yet
Learn Java - Arrays and ArrayLists Cheatsheet - Codecademy
3 pages
Module 5
No ratings yet
Module 5
25 pages
Google Java Style Guide
No ratings yet
Google Java Style Guide
25 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
Apmp Study Guide Quick Quiz Answers
No ratings yet
Apmp Study Guide Quick Quiz Answers
10 pages
M.Tech Embedded System Technologies Part Time Curriculum & Syllabus Semester I
No ratings yet
M.Tech Embedded System Technologies Part Time Curriculum & Syllabus Semester I
38 pages
SEBI Grade A Phase 1 Paper 2 IT 2022 Previous Year Paper PDF
No ratings yet
SEBI Grade A Phase 1 Paper 2 IT 2022 Previous Year Paper PDF
18 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
Lassonde School of Engineering Dept. of Electrical Engineering and Computer Science EECS 2031 Software Tools Fall 2018
No ratings yet
Lassonde School of Engineering Dept. of Electrical Engineering and Computer Science EECS 2031 Software Tools Fall 2018
4 pages
Advanced OpenACC Course Lecture2 Multi GPU 20160602
No ratings yet
Advanced OpenACC Course Lecture2 Multi GPU 20160602
91 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Notes-Procedures Ver1
No ratings yet
Notes-Procedures Ver1
9 pages
KokkosTutorial 01 Introduction
No ratings yet
KokkosTutorial 01 Introduction
110 pages
Lesson 1
No ratings yet
Lesson 1
11 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Chapter 02 - Finite Automata
No ratings yet
Chapter 02 - Finite Automata
48 pages
T2 Sequence and Selection
No ratings yet
T2 Sequence and Selection
15 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Lec 01
No ratings yet
Lec 01
2 pages
Question Solution
No ratings yet
Question Solution
35 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
5 Sheet
No ratings yet
5 Sheet
4 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Attention Paper Summary
No ratings yet
Attention Paper Summary
3 pages
ProgrammingModelExamples ECMWF
No ratings yet
ProgrammingModelExamples ECMWF
7 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
PPS With Diagram
No ratings yet
PPS With Diagram
17 pages
4 Performance.4x
No ratings yet
4 Performance.4x
14 pages
Openmp
No ratings yet
Openmp
61 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Shared Memory and Accelerators
No ratings yet
Shared Memory and Accelerators
88 pages
HPC Architecture and ECO System PDF
No ratings yet
HPC Architecture and ECO System PDF
3 pages
Lab1 PAR
No ratings yet
Lab1 PAR
40 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Introduction To Openmp: Openmp in Small Bites: Overview
No ratings yet
Introduction To Openmp: Openmp in Small Bites: Overview
123 pages
TM04 Producing Basic Server-Side Scrip For Dynamic Web Page
No ratings yet
TM04 Producing Basic Server-Side Scrip For Dynamic Web Page
94 pages
ISE-20% Unit Test I-15% Unit Test II-15% ESE-50% (Minimum Passing Marks: 40%)
No ratings yet
ISE-20% Unit Test I-15% Unit Test II-15% ESE-50% (Minimum Passing Marks: 40%)
2 pages
MCQs - Visual-Programming
No ratings yet
MCQs - Visual-Programming
23 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
Java
No ratings yet
Java
9 pages
GATE Questions On Sequential Circuits MARATHON
No ratings yet
GATE Questions On Sequential Circuits MARATHON
44 pages
Alan Turing The Father of Computer Science
No ratings yet
Alan Turing The Father of Computer Science
7 pages
Introduction To Programming and Algorithms Dit 0202
No ratings yet
Introduction To Programming and Algorithms Dit 0202
3 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Structures Functions
No ratings yet
Structures Functions
13 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Loops
No ratings yet
Loops
26 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Lecture 9-OpenMP Coclusion
No ratings yet
Lecture 9-OpenMP Coclusion
39 pages
Gpu Test Answer Bank
No ratings yet
Gpu Test Answer Bank
22 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
HPC Summary
No ratings yet
HPC Summary
17 pages
hpcxx2024 d3
No ratings yet
hpcxx2024 d3
53 pages
PDC Lecture 02
No ratings yet
PDC Lecture 02
35 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet