0% found this document useful (0 votes)
20 views322 pages

KokkosTutorial ORNL20

Tutorial for kokkos framework

Uploaded by

tojepo9101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views322 pages

KokkosTutorial ORNL20

Tutorial for kokkos framework

Uploaded by

tojepo9101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 322

Kokkos Tutorial

Jeff Miles 1 , Christian Trott 1

1
Sandia National Laboratories

Online April 21-24, 2020

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and
Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
SAND2019-1055814
Prerequisites for Tutorial Exercises

Knowledge of C++: class constructors, member variables,


member functions, member operators, template arguments

Using your own ${HOME}


I Git
I GCC 4.8.4 (or newer) OR Intel 15 (or newer) OR Clang 3.5.2 (or newer)
I CUDA nvcc 9.0 (or newer) AND NVIDIA compute capability 3.0 (or newer)
I git clone https://github.com/kokkos/kokkos
into ${HOME}/Kokkos/kokkos
I git clone https://github.com/kokkos/kokkos-tutorials
into ${HOME}/Kokkos/kokkos-tutorials
Slides are in
${HOME}/Kokkos/kokkos-tutorials/Intro-Full/Slides
Exercises are in
${HOME}/Kokkos/kokkos-tutorials/Intro-Full/Exercises
Exercises’ makefiles look for ${HOME}/Kokkos/kokkos

Online April 21-24, 2020 2/192


Prerequisites for Tutorial Exercises

Online Resources:
I https://github.com/kokkos: Primary Kokkos GitHub
Organization
I https:
//github.com/kokkos/kokkos-tutorials/blob/master/
Intro-Full/Slides/KokkosTutorial_ORNL20.pdf: These
slides.
I https://github.com/kokkos/kokkos/wiki: Wiki
including API reference
I https:
//github.com/kokkos/kokkos-tutorials/issues/28:
Instructions to get cloud instance with GPU
I https://kokkosteam.slack.com: Slack channel for Kokkos

Online April 21-24, 2020 3/192


Tutorial Objectives

Kokkos’ basic capabilities:


I Simple 1D data parallel computational patterns
I Deciding where code is run and where data is placed
I Managing data access patterns for performance portability
Kokkos’ advanced capabilities:
I Thread safety, thread scalability, and atomic operations
I Hierarchical patterns for maximizing parallelism
Kokkos’ advanced capabilities not covered today:
I Multidimensional data parallelism
I Dynamic directed acyclic graph of tasks pattern
I Numerous plugin points for extensibility

Online April 21-24, 2020 4/192


Tutorial Takeaways

I Kokkos enables Single Source Performance Portable


Codes
I Simple things stay simple - it is not much more complicated
than OpenMP
I Advanced performance optimizing capabilities easier to
use with Kokkos than e.g. CUDA
I Kokkos provides data abstractions critical for performance
portability not available in OpenMP or OpenACC
Controlling data access patterns is key for obtaining
performance

Online April 21-24, 2020 5/192


Operating assumptions (0)

Assume you are here because:


I Want to use all HPC node architectures; including GPUs
I Are familiar with C++
I Want GPU programming to be easier
I Would like portability, as long as it doesn’t hurt performance
Helpful for understanding nuances:
I Are familiar with data parallelism
I Are familiar with OpenMP
I Are familiar with GPU architecture and CUDA

Online April 21-24, 2020 6/192


Operating assumptions (1)

Target machine:

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Online April 21-24, 2020 7/192


Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and


may not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)

Online April 21-24, 2020 8/192


Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and


may not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns across
architectures,
I can leverage architecture-specific features where possible.

Online April 21-24, 2020 8/192


Important Point: Performance Portability

Important Point
There’s a difference between portability and
performance portability.

Example: implementations may target particular architectures and


may not be thread scalable.
(e.g., locks on CPU won’t scale to 100,000 threads on GPU)
Goal: write one implementation which:
I compiles and runs on multiple architectures,
I obtains performant memory access patterns across
architectures,
I can leverage architecture-specific features where possible.

Kokkos: performance portability across manycore architectures.

Online April 21-24, 2020 8/192


Concepts for threaded data
parallelism
Learning objectives:
I Terminology of pattern, policy, and body.
I The data layout problem.

Online April 21-24, 2020 9/192


Concepts: Patterns, Policies, and Bodies

for ( element = 0; element < numElements ; ++ element ) {


total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

Online April 21-24, 2020 10/192


Concepts: Patterns, Policies, and Bodies

Pattern Policy
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
Body

total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);


}
elementValues [ element ] = total ;
}

Terminology:
I Pattern: structure of the computations
for, reduction, scan, task-graph, ...
I Execution Policy: how computations are executed
static scheduling, dynamic scheduling, thread teams, ...
I Computational Body: code which performs each unit of
work; e.g., the loop body
⇒ The pattern and policy drive the computational body.

Online April 21-24, 2020 10/192


Threading “Parallel for”

What if we want to thread the loop?

for ( element = 0; element < numElements ; ++ element ) {


total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

Online April 21-24, 2020 11/192


Threading “Parallel for”

What if we want to thread the loop?


# pragma omp parallel for
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

(Change the execution policy from “serial” to “parallel.”)

Online April 21-24, 2020 11/192


Threading “Parallel for”

What if we want to thread the loop?


# pragma omp parallel for
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
}
elementValues [ element ] = total ;
}

(Change the execution policy from “serial” to “parallel.”)

OpenMP is simple for parallelizing loops on multi-core CPUs,


but what if we then want to do this on other architectures?
Intel PHI and NVIDIA GPU and AMD GPU and ...

Online April 21-24, 2020 11/192


“Parallel for” on a GPU via pragmas
Option 1: OpenMP 4.5
# pragma omp target data map (...)
# pragma omp teams num_teams (...) num_threads (...) private (...)
# pragma omp distribute
for ( element = 0; element < numElements ; ++ element ) {
total = 0
# pragma omp parallel for
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}

Online April 21-24, 2020 12/192


“Parallel for” on a GPU via pragmas
Option 1: OpenMP 4.5
# pragma omp target data map (...)
# pragma omp teams num_teams (...) num_threads (...) private (...)
# pragma omp distribute
for ( element = 0; element < numElements ; ++ element ) {
total = 0
# pragma omp parallel for
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}

Option 2: OpenACC
# pragma acc parallel copy (...) num_gangs (...) vector_length (...)
# pragma acc loop gang vector
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}
Online April 21-24, 2020 12/192
Portable, but not performance portable

A standard thread parallel programming model


may give you portable parallel execution
if it is supported on the target architecture.

But what about performance?

Online April 21-24, 2020 13/192


Portable, but not performance portable

A standard thread parallel programming model


may give you portable parallel execution
if it is supported on the target architecture.

But what about performance?

Performance depends upon the computation’s


memory access pattern.

Online April 21-24, 2020 13/192


Problem: memory access pattern
# pragma something , opencl , etc .
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
for ( i = 0; i < vectorSize ; ++ i ) {
total +=
left [ element * numQPs * vectorSize +
qp * vectorSize + i ] *
right [ element * numQPs * vectorSize +
qp * vectorSize + i ];
}
}
elementValues [ element ] = total ;
}

Online April 21-24, 2020 14/192


Problem: memory access pattern
# pragma something , opencl , etc .
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
for ( i = 0; i < vectorSize ; ++ i ) {
total +=
left [ element * numQPs * vectorSize +
qp * vectorSize + i ] *
right [ element * numQPs * vectorSize +
qp * vectorSize + i ];
}
}
elementValues [ element ] = total ;
}
Memory access pattern problem: CPU data layout reduces GPU
performance by more than 10X.

Online April 21-24, 2020 14/192


Problem: memory access pattern
# pragma something , opencl , etc .
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
for ( i = 0; i < vectorSize ; ++ i ) {
total +=
left [ element * numQPs * vectorSize +
qp * vectorSize + i ] *
right [ element * numQPs * vectorSize +
qp * vectorSize + i ];
}
}
elementValues [ element ] = total ;
}
Memory access pattern problem: CPU data layout reduces GPU
performance by more than 10X.
Important Point
For performance the memory access pattern
must depend on the architecture.
Online April 21-24, 2020 14/192
Kokkos overview

How does Kokkos address performance portability?

Kokkos is a productive, portable, performant, shared-memory


programming model.
I is a C++ library, not a new language or language extension.
I supports clear, concise, thread-scalable parallel patterns.
I lets you write algorithms once and run on many architectures
e.g. multi-core CPU, GPUs, Xeon Phi, ...
I minimizes the amount of architecture-specific
implementation details users must know.
I solves the data layout problem by using multi-dimensional
arrays with architecture-dependent layouts

Online April 21-24, 2020 15/192


Data parallel patterns
Learning objectives:
I How computational bodies are passed to the Kokkos runtime.
I How work is mapped to cores.
I The difference between parallel for and
parallel reduce.
I Start parallelizing a simple example.

Online April 21-24, 2020 16/192


Using Kokkos for data parallel patterns (0)

Data parallel patterns and work


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores

Online April 21-24, 2020 17/192


Using Kokkos for data parallel patterns (0)

Data parallel patterns and work


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores


I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.

Online April 21-24, 2020 17/192


Using Kokkos for data parallel patterns (0)

Data parallel patterns and work


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force (... data ...);
}

Kokkos maps work to cores


I each iteration of a computational body is a unit of work.
I an iteration index identifies a particular unit of work.
I an iteration range identifies a total amount of work.

Important concept: Work mapping


You give an iteration range and computational body (kernel)
to Kokkos, Kokkos maps iteration indices to cores and then
runs the computational body on those cores.

Online April 21-24, 2020 17/192


Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?

Online April 21-24, 2020 18/192


Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?


As functors or function objects, a common pattern in C++.

Online April 21-24, 2020 18/192


Using Kokkos for data parallel patterns (2)

How are computational bodies given to Kokkos?


As functors or function objects, a common pattern in C++.

Quick review, a functor is a function with data. Example:


struct P a ra ll el F un ct or {
...
void operator ()( a work assignment ) const {
/* ... computational body ... */
...
};

Online April 21-24, 2020 18/192


Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?

Online April 21-24, 2020 19/192


Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?


A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

Online April 21-24, 2020 19/192


Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?


A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

and work items are assigned to functors one-by-one:


struct Functor {
void operator ()( const int64_t index ) const {...}
}

Online April 21-24, 2020 19/192


Using Kokkos for data parallel patterns (3)

How is work assigned to functor operators?


A total amount of work items is given to a Kokkos pattern,
P aral lel Fu nc t or functor ;
Kokkos :: parallel_for ( n u m b e r O f I t e r a t i o n s , functor );

and work items are assigned to functors one-by-one:


struct Functor {
void operator ()( const int64_t index ) const {...}
}

Warning: concurrency and order


Concurrency and ordering of parallel iterations is not guaranteed
by the Kokkos runtime.

Online April 21-24, 2020 19/192


Using Kokkos for data parallel patterns (4)

How is data passed to computational bodies?


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force ( ... data ... );
}

struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}

Online April 21-24, 2020 20/192


Using Kokkos for data parallel patterns (4)

How is data passed to computational bodies?


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ) {
atomForces [ atomIndex ] = c alculate Force ( ... data ... );
}

struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}

How does the body access the data?


Important concept
A parallel functor body must have access to all the data it needs
through the functor’s data members.

Online April 21-24, 2020 20/192


Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;
A t o m Fo r c e F u n c to r ( /* args */ ) {...}
void operator ()( const int64_t atomIndex ) const {
_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
};

Online April 21-24, 2020 21/192


Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;
A t o m Fo r c e F u n c to r ( /* args */ ) {...}
void operator ()( const int64_t atomIndex ) const {
_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
};

Q/ How would we reproduce serial execution with this functor?


Serial

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){


atomForces [ atomIndex ] = c alculate Force ( data );
}

Online April 21-24, 2020 21/192


Using Kokkos for data parallel patterns (5)
Putting it all together: the complete functor:
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;
A t o m Fo r c e F u n c to r ( /* args */ ) {...}
void operator ()( const int64_t atomIndex ) const {
_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
};

Q/ How would we reproduce serial execution with this functor?


Serial

for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){


atomForces [ atomIndex ] = c alculate Force ( data );
}
Functor

At om Fo rc e F u n ct o r functor ( atomForces , data );


for ( atomIndex = 0; atomIndex < numberOfAtoms ; ++ atomIndex ){
functor ( atomIndex );
}

Online April 21-24, 2020 21/192


Using Kokkos for data parallel patterns (6)

The complete picture (using functors):


1. Defining the functor (operator+data):
struct A t o m F o r ce F u n c t o r {
ForceType _atomForces ;
AtomDataType _atomData ;

A t o m Fo r c e F u n c to r (ForceType atomForces , AtomDataType data ) :


_atomForces ( atomForces ) , _atomData ( data ) {}

void operator ()( const int64_t atomIndex ) const {


_atomForces [ atomIndex ] = calcul ateForce ( _atomData );
}
}

2. Executing in parallel with Kokkos pattern:


A t o m F o rc e F u n c t o r functor ( atomForces , data );
Kokkos :: parallel_for ( numberOfAtoms , functor );

Online April 21-24, 2020 22/192


Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise


atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

Online April 21-24, 2020 23/192


Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise


atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

A lambda is not magic, it is the compiler auto-generating a


functor for you.

Online April 21-24, 2020 23/192


Using Kokkos for data parallel patterns (7)

Functors are tedious ⇒ C++11 Lambdas are concise


atomForces already exists
data already exists
Kokkos :: parallel_for ( numberOfAtoms ,
[=] ( const int64_t atomIndex ) {
atomForces [ atomIndex ] = ca lculateF orce ( data );
}
);

A lambda is not magic, it is the compiler auto-generating a


functor for you.

Warning: Lambda capture and C++ containers


For portability to GPU a lambda must capture by value [=].
Don’t capture containers (e.g., std::vector) by value because it will
copy the container’s entire contents.

Online April 21-24, 2020 23/192


parallel for examples

How does this compare to OpenMP?


Serial

for ( int64_t i = 0; i < N ; ++ i ) {


/* loop body */
}
Kokkos OpenMP

# pragma omp parallel for


for ( int64_t i = 0; i < N ; ++ i ) {
/* loop body */
}

parallel_for ( N , [=] ( const int64_t i ) {


/* loop body */
});

Important concept
Simple Kokkos usage is no more conceptually difficult than
OpenMP, the annotations just go in different places.
Online April 21-24, 2020 24/192
Scalar integration (0)

Riemann-sum-style numerical integration:


Z upper
y= function(x) dx
lower
Wikipedia

Online April 21-24, 2020 25/192


Scalar integration (0)

Riemann-sum-style numerical integration:


Z upper
y= function(x) dx
lower
Wikipedia

double totalIntegral = 0;
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );
const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;

Online April 21-24, 2020 25/192


Scalar integration (0)

Riemann-sum-style numerical integration:


Z upper
y= function(x) dx
lower
Wikipedia

double totalIntegral = 0;
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );
const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;

How do we parallelize it? Correctly?

Online April 21-24, 2020 25/192


Scalar integration (0)

Riemann-sum-style numerical integration:


Z upper
y= function(x) dx
lower
Wikipedia

Pattern?
double totalIntegral = 0; Policy?
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
Body?

lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );


const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;

How do we parallelize it? Correctly?

Online April 21-24, 2020 25/192


Scalar integration (1)

An (incorrect) attempt:
double totalIntegral = 0;
Kokkos :: parallel_for ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
totalIntegral += function ( x );} ,
);
totalIntegral *= dx ;

First problem: compiler error; cannot increment totalIntegral


(lambdas capture by value and are treated as const!)

Online April 21-24, 2020 26/192


Scalar integration (2)
An (incorrect) solution to the (incorrect) attempt:
double totalIntegral = 0;
double * t o t a l I n t e g r a l P o i n t e r = & totalIntegral ;
Kokkos :: parallel_for ( numberOfIntervals ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
* t o t a l I n t e g r a l P o i n t e r += function ( x );} ,
);
totalIntegral *= dx ;

Online April 21-24, 2020 27/192


Scalar integration (2)
An (incorrect) solution to the (incorrect) attempt:
double totalIntegral = 0;
double * t o t a l I n t e g r a l P o i n t e r = & totalIntegral ;
Kokkos :: parallel_for ( numberOfIntervals ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
* t o t a l I n t e g r a l P o i n t e r += function ( x );} ,
);
totalIntegral *= dx ;

Second problem: race condition


step thread 0 thread 1
0 load
1 increment load
2 write increment
3 write
Online April 21-24, 2020 27/192
Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of


reduction

Online April 21-24, 2020 28/192


Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of


reduction

Important concept: Reduction


Reductions combine the results contributed by parallel work.

Online April 21-24, 2020 28/192


Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of


reduction

Important concept: Reduction


Reductions combine the results contributed by parallel work.

How would we do this with OpenMP?


double f i n a l R e d u c e d V a l u e = 0;
# pragma omp parallel for reduction ( +: f i n a l R e d u c e d V a l u e )
for ( int64_t i = 0; i < N ; ++ i ) {
f i n a l R e d u c e d V a l u e += ...
}

Online April 21-24, 2020 28/192


Scalar integration (3)

Root problem: we’re using the wrong pattern, for instead of


reduction

Important concept: Reduction


Reductions combine the results contributed by parallel work.

How would we do this with OpenMP?


double f i n a l R e d u c e d V a l u e = 0;
# pragma omp parallel for reduction ( +: f i n a l R e d u c e d V a l u e )
for ( int64_t i = 0; i < N ; ++ i ) {
f i n a l R e d u c e d V a l u e += ...
}

How will we do this with Kokkos?


double f i n a l R e d u c e d V a l u e = 0;
p aral lel _r ed uc e ( N , functor , f i n a l R e d u c e d V a l u e );

Online April 21-24, 2020 28/192


Scalar integration (4)

Example: Scalar integration


OpenMP

double totalIntegral = 0;
# pragma omp parallel for reduction ( +: totalIntegral )
for ( int64_t i = 0; i < nu m be r Of I nt e rv a ls ; ++ i ) {
totalIntegral += function (...);
}

double totalIntegral = 0;
Kokkos

p ar al le l _r ed u ce ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t i , double & valueToUpdate ) {
valueToUpdate += function (...);
},
totalIntegral );

I The operator takes two arguments: a work index and a value


to update.
I The second argument is a thread-private value that is
managed by Kokkos; it is not the final reduced value.
Online April 21-24, 2020 29/192
Scalar integration (5)

Warning: Parallelism is NOT free


Dispatching (launching) parallel work has non-negligible cost.

Online April 21-24, 2020 30/192


Scalar integration (5)

Warning: Parallelism is NOT free


Dispatching (launching) parallel work has non-negligible cost.
β∗N
Simplistic data-parallel performance model: Time = α + P
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency

Online April 21-24, 2020 30/192


Scalar integration (5)

Warning: Parallelism is NOT free


Dispatching (launching) parallel work has non-negligible cost.
β∗N
Simplistic data-parallel performance model: Time = α + P
I α = dispatch overhead
I β = time for a unit of work
I N = number of units of work
I P = available concurrency
 
α∗P
Speedup = P ÷ 1 + β∗N
I Should have α ∗ P  β ∗ N
I All runtimes strive to minimize launch overhead α
I Find more parallelism to increase N
I Merge (fuse) parallel operations to increase β
Online April 21-24, 2020 30/192
Scalar integration (6)
 
α∗P
Results: illustrates simple speedup model = P ÷ 1 + β∗N

Kokkos speedup over serial: Scalar Integration


10000

Note: log scale


Kokkos Cuda Pascal60
Kokkos OpenMP HSW
Kokkos OpenMP KNL
Native OpenMP KNL
1000 Unity
speedup over serial [-]

100

10

0.1

0.01
100 1000 10000 100000 1x106 1x107 1x108
number of intervals [-]

Online April 21-24, 2020 31/192


Naming your kernels
Always name your kernels!
Giving unique names to each kernel is immensely helpful for
debugging and profiling. You will regret it if you don’t!

I Non-nested parallel patterns can take an optional string


argument.
I The label doesn’t need to be unique, but it is helpful.
I Anything convertible to ”const std::string”
I Used by profiling and debugging tools (see Profiling Tutorial)
Example:
double totalIntegral = 0;
p aral lel _r ed uc e ( " Reduction " , n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t i , double & valueToUpdate ) {
valueToUpdate += function (...);
},
totalIntegral );

Online April 21-24, 2020 32/192


Recurring Exercise: Inner Product

Exercise: Inner product < y , A ∗ x >

Details:
I y is Nx1, A is NxM, x is Mx1
I We’ll use this exercise throughout the tutorial

Online April 21-24, 2020 33/192


Exercise #1: include, initialize, finalize Kokkos

The first step in using Kokkos is to include, initialize, and finalize:


# include < Kokkos_Core . hpp >
int main ( int argc , char ** argv ) {
/* ... do any necessary setup ( e . g . , initialize MPI ) ... */
Kokkos :: initialize ( argc , argv );
{
/* ... do computations ... */
}
Kokkos :: finalize ();
return 0;
}

(Optional) Command-line arguments:


total number of threads
--kokkos-threads=INT
(or threads within NUMA region)
--kokkos-numa=INT number of NUMA regions
--kokkos-device=INT device (GPU) ID to use

Online April 21-24, 2020 34/192


Exercise #1: Inner Product, Flat Parallelism on the CPU

Exercise: Inner product < y , A ∗ x >

Details:
I Location: Intro-Full/Exercises/01/Begin/
I Look for comments labeled with “EXERCISE”
I Need to include, initialize, and finalize Kokkos library
I Parallelize loops with parallel for or parallel reduce
I Use lambdas instead of functors for computational bodies.
I For now, this will only use the CPU.
Online April 21-24, 2020 35/192
Exercise #1: logistics
Compiling for CPU
# gcc using OpenMP ( default ) and Serial back - ends ,
# ( optional ) change non - default arch with KOKKOS_ARCH
make -j KOKKOS_D EVICES = OpenMP , Serial KOKKOS_ARCH =...

Running on CPU with OpenMP back-end


# Set OpenMP affinity
export O M P_ NU M_ T HR EA DS =8
export OMP_PROC_BIND = spread OMP_PLACES = threads
# Print example command line options :
./01 _Exercise . host -h
# Run with defaults on CPU
./01 _Exercise . host
# Run larger problem
./01 _Exercise . host -S 26

Things to try:
I Vary problem size with cline arg -S s
I Vary number of rows with cline arg -N n
I Num rows = 2n , num cols = 2m , total size = 2s == 2n+m
Online April 21-24, 2020 36/192
Exercise #1 results
<y,Ax> Exercise 01, Fixed Size
350
HSW
KNL
KNL (HBM)
300

250
Bandwidth (GB/s)

200

150

100

50

0
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Number of Rows (N)
Online April 21-24, 2020 37/192
Basic capabilities we haven’t covered

I Customizing parallel reduce data type and reduction


operator
e.g., minimum, maximum, ...
I parallel scan pattern for exclusive and inclusive prefix sum
I Using tag dispatch interface to allow non-trivial functors to
have multiple “operator()” functions.
very useful in large, complex applications

Online April 21-24, 2020 38/192


Section Summary

I Simple usage is similar to OpenMP, advanced features are


also straightforward
I Three common data-parallel patterns are parallel for,
parallel reduce, and parallel scan.
I A parallel computation is characterized by its pattern, policy,
and body.
I User provides computational bodies as functors or lambdas
which handle a single work item.

Online April 21-24, 2020 39/192


Views
Learning objectives:
I Motivation behind the View abstraction.
I Key View concepts and template parameters.
I The View life cycle.

Online April 21-24, 2020 40/192


View motivation

Example: running daxpy on the GPU:


Lambda

double * x = new double [ N ]; // also y


parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Online April 21-24, 2020 41/192


View motivation

Example: running daxpy on the GPU:


Lambda

double * x = new double [ N ]; // also y


parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Problem: x and y reside in CPU memory.

Online April 21-24, 2020 41/192


View motivation

Example: running daxpy on the GPU:


Lambda

double * x = new double [ N ]; // also y


parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {
y [ i ] = a * x [ i ] + y [ i ];
});

struct Functor {
Functor

double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};

Problem: x and y reside in CPU memory.


Solution: We need a way of storing data (multidimensional arrays)
which can be communicated to an accelerator (GPU).
⇒ Views
Online April 21-24, 2020 41/192
Views (0)

View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).

High-level example of Views for daxpy using lambda:


View < double * , ... > x (...) , y (...);
... populate x , y ...

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {


// Views x and y are captured by value ( copy )
y ( i ) = a * x ( i ) + y ( i );
});

Online April 21-24, 2020 42/192


Views (0)

View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).

High-level example of Views for daxpy using lambda:


View < double * , ... > x (...) , y (...);
... populate x , y ...

parallel_for ( " DAXPY " ,N , [=] ( const int64_t i ) {


// Views x and y are captured by value ( copy )
y ( i ) = a * x ( i ) + y ( i );
});

Important point
Views are like pointers, so copy them in your functors.

Online April 21-24, 2020 42/192


Views (1)
View overview:
I Multi-dimensional array of 0 or more dimensions
scalar (0), vector (1), matrix (2), etc.
I Number of dimensions (rank) is fixed at compile-time.
I Arrays are rectangular, not ragged.
I Sizes of dimensions set at compile-time or runtime.
e.g., 2x20, 50x50, etc.
I Access elements via ”(...)” operator.

Online April 21-24, 2020 43/192


Views (1)
View overview:
I Multi-dimensional array of 0 or more dimensions
scalar (0), vector (1), matrix (2), etc.
I Number of dimensions (rank) is fixed at compile-time.
I Arrays are rectangular, not ragged.
I Sizes of dimensions set at compile-time or runtime.
e.g., 2x20, 50x50, etc.
I Access elements via ”(...)” operator.
Example:
View < double *** > data ( " label " , N0 , N1 , N2 ); //3 run, 0 compile
View < double **[ N2 ] > data ( " label " , N0 , N1 ); //2 run, 1 compile
View < double *[ N1 ][ N2 ] > data ( " label " , N0 ); //1 run, 2 compile
View < double [ N0 ][ N1 ][ N2 ] > data ( " label " ); //0 run, 3 compile
// Access
data (i ,j , k ) = 5.3;
Note: runtime-sized dimensions must come first.
Online April 21-24, 2020 43/192
Views (2)

View life cycle:


I Allocations only happen when explicitly specified.
i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).
so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr

Online April 21-24, 2020 44/192


Views (2)

View life cycle:


I Allocations only happen when explicitly specified.
i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).
so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr
Example:
View < double *[5] > a ( " a " , N0 ) , b ( " b " , N0 );
a = b;
View < double ** > c ( b );
a (0 ,2) = 1;
b (0 ,2) = 2; What gets printed?
c (0 ,2) = 3;
print a (0 ,2)

Online April 21-24, 2020 44/192


Views (2)

View life cycle:


I Allocations only happen when explicitly specified.
i.e., there are no hidden allocations.
I Copy construction and assignment are shallow (like pointers).
so, you pass Views by value, not by reference
I Reference counting is used for automatic deallocation.
I They behave like shared ptr
Example:
View < double *[5] > a ( " a " , N0 ) , b ( " b " , N0 );
a = b;
View < double ** > c ( b );
a (0 ,2) = 1;
b (0 ,2) = 2; What gets printed?
c (0 ,2) = 3;
print a (0 ,2) 3.0

Online April 21-24, 2020 44/192


Views (3)

View Properties:
I Accessing a View’s sizes is done via its extent(dim) function.
Static extents can additionally be accessed via
static extent(dim).
I You can retrieve a raw pointer via its data() function.
I The label can be accessed via label().
Example:
View < double *[5] > a ( " A " , N0 );
assert ( a . extent (0)== N0 );
assert ( a . extent (1)== N0 );
static_assert ( a . static_extent (1)==5);
assert ( a . data ()!= nullptr );
assert ( std :: string ( " A " . compare ( a . label ())==0);

Online April 21-24, 2020 45/192


Exercise #2: Inner Product, Flat Parallelism on the CPU, with Views

I Location: Intro-Full/Exercises/02/Begin/
I Assignment: Change data storage from arrays to Views.
I Compile and run on CPU, and then on GPU with UVM

make -j K OKKOS_D EVICES = OpenMP # CPU - only using OpenMP


make -j K OKKOS_D EVICES = Cuda # GPU - note UVM in Makefile
# Run exercise
./02 _Exercise . host -S 26
./02 _Exercise . cuda -S 26
# Note the warnings , set appropriate environment variables

I Vary problem size: -S #


I Vary number of rows: -N #
I Vary repeats: -nrepeat #
I Compare performance of CPU vs GPU

Online April 21-24, 2020 46/192


Advanced features we haven’t covered

I Memory space in which view’s data resides; covered next.


I deep copy view’s data; covered later.
Note: Kokkos never hides a deep copy of data.
I Layout of multidimensional array; covered later.
I Memory traits; covered later.
I Subview: Generating a view that is a “slice” of other
multidimensional array view; covered later.

Online April 21-24, 2020 47/192


Execution and Memory spaces

Execution and Memory Spaces


Learning objectives:
I Heterogeneous nodes and the space abstractions.
I How to control where parallel bodies are run, execution
space.
I How to control where view data resides, memory space.
I How to avoid illegal memory accesses and manage data
movement.
I The need for Kokkos::initialize and finalize.
I Where to use Kokkos annotation macros for portability.

Online April 21-24, 2020 48/192


Execution spaces (1)
Execution Space
a homogeneous set of cores and an execution mechanism
(i.e., “place to run code”)

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Execution spaces: Serial, Threads, OpenMP, Cuda, HIP, ...


Online April 21-24, 2020 49/192
Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);


r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {


const double y = ...;
// do something interesting
}
);

Online April 21-24, 2020 50/192


Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);


r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {


const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?


⇒ Always in the host process

Online April 21-24, 2020 50/192


Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);


r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {


const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?


⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?
⇒ The default execution space

Online April 21-24, 2020 50/192


Execution spaces (2)

MPI_Reduce (...);
Host

FILE * file = fopen (...);


r u n A N o r m a l F u n c t i o n (... data ...);
Kokkos :: parallel_for ( " MyKernel " , numberOfSomethings ,
Parallel

[=] ( const int64_t s omethin gIndex ) {


const double y = ...;
// do something interesting
}
);

I Where will Host code be run? CPU? GPU?


⇒ Always in the host process
I Where will Parallel code be run? CPU? GPU?
⇒ The default execution space
I How do I control where the Parallel body is executed?
Changing the default execution space (at compilation),
or specifying an execution space in the policy.

Online April 21-24, 2020 50/192


Execution spaces (3)
Changing the parallel execution space:
parallel_for ( " Label " ,
Custom

RangePolicy < E x e c u t i o n S p a c e >(0 , n u m b e r O f I n t e r v a l s ) ,


[=] ( const int64_t i ) {
/* ... body ... */
});

parallel_for ( " Label " ,


Default

n u m b e r O f I n t e r v a l s , // == RangePolicy < >(0 , n u m b e r O f I n t e r v a l s )


[=] ( const int64_t i ) {
/* ... body ... */
});

Online April 21-24, 2020 51/192


Execution spaces (3)
Changing the parallel execution space:
parallel_for ( " Label " ,
Custom

RangePolicy < E x e c u t i o n S p a c e >(0 , n u m b e r O f I n t e r v a l s ) ,


[=] ( const int64_t i ) {
/* ... body ... */
});

parallel_for ( " Label " ,


Default

n u m b e r O f I n t e r v a l s , // == RangePolicy < >(0 , n u m b e r O f I n t e r v a l s )


[=] ( const int64_t i ) {
/* ... body ... */
});

Requirements for enabling execution spaces:


I Kokkos must be compiled with the execution spaces enabled.
I Execution spaces must be initialized (and finalized).
I Functions must be marked with a macro for non-CPU spaces.
I Lambdas must be marked with a macro for non-CPU spaces.
Online April 21-24, 2020 51/192
Execution spaces (5)

Kokkos function and lambda portability annotation macros:


Function annotation with KOKKOS INLINE FUNCTION macro
struct ParallelFunctor {
KOKKOS INLINE FUNCTION
double helperFunction ( const i n t 6 4 t s ) const { . . .}
KOKKOS INLINE FUNCTION
void operator ()( const int64 t index ) const {
helperFunction ( index ) ;
}
}
// Where k o k k o s d e f i n e s :
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e device host /∗ # i f CPU+Cuda ∗/

Online April 21-24, 2020 52/192


Execution spaces (5)

Kokkos function and lambda portability annotation macros:


Function annotation with KOKKOS INLINE FUNCTION macro
struct ParallelFunctor {
KOKKOS INLINE FUNCTION
double helperFunction ( const i n t 6 4 t s ) const { . . .}
KOKKOS INLINE FUNCTION
void operator ()( const int64 t index ) const {
helperFunction ( index ) ;
}
}
// Where k o k k o s d e f i n e s :
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS INLINE FUNCTION i n l i n e device host /∗ # i f CPU+Cuda ∗/

Lambda annotation with KOKKOS LAMBDA macro (requires CUDA 8.0)


Kokkos : : p a r a l l e l f o r ( ” L a b e l ” , n u m b e r O f I t e r a t i o n s ,
KOKKOS LAMBDA ( c o n s t i n t 6 4 t i n d e x ) { . . . } ) ;

// Where Kokkos d e f i n e s :
#d e f i n e KOKKOS LAMBDA [ = ] /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS LAMBDA [ = ] device /∗ # i f CPU+Cuda ∗/

Online April 21-24, 2020 52/192


Memory Space Motivation

Memory space motivating example: summing an array


View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Online April 21-24, 2020 53/192


Memory Space Motivation

Memory space motivating example: summing an array


View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: Where is the data stored? GPU memory? CPU


memory? Both?

Online April 21-24, 2020 53/192


Memory Space Motivation

Memory space motivating example: summing an array


View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: Where is the data stored? GPU memory? CPU


memory? Both?

Online April 21-24, 2020 53/192


Memory Space Motivation

Memory space motivating example: summing an array


View < double * > data ( " data " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
data ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: Where is the data stored? GPU memory? CPU


memory? Both?

⇒ Memory Spaces
Online April 21-24, 2020 53/192
Memory spaces (0)

Memory space:
explicitly-manageable memory resource
(i.e., “place to put data”)

On-Package
Memory Core Core
...
NUMA Domain
External Interconnect

External Network

DRAM
Network-on-Chip
Interface

NUMA Domain
NVRAM
Core ... Core

Accelerator

On-Package
Acc. Memory

Node

Online April 21-24, 2020 54/192


Memory spaces (1)

Important concept: Memory spaces


Every view stores its data in a memory space set at compile time.

Online April 21-24, 2020 55/192


Memory spaces (1)

Important concept: Memory spaces


Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);

Online April 21-24, 2020 55/192


Memory spaces (1)

Important concept: Memory spaces


Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);


I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more

Online April 21-24, 2020 55/192


Memory spaces (1)

Important concept: Memory spaces


Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);


I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which is
used if Space provided is actually an execution space

Online April 21-24, 2020 55/192


Memory spaces (1)

Important concept: Memory spaces


Every view stores its data in a memory space set at compile time.

I View<double***,Memory Space> data(...);


I Available memory spaces:
HostSpace, CudaSpace, CudaUVMSpace, ... more
I Each execution space has a default memory space, which is
used if Space provided is actually an execution space
I If no Space is provided, the view’s data resides in the default
memory space of the default execution space.

Online April 21-24, 2020 55/192


Memory spaces (2)
Example: HostSpace
View < double ** , HostSpace > hostView (... constructor arguments ...);

Online April 21-24, 2020 56/192


Memory spaces (2)
Example: HostSpace
View < double ** , HostSpace > hostView (... constructor arguments ...);

Example: CudaSpace
View < double ** , CudaSpace > view (... constructor arguments ...);

Online April 21-24, 2020 56/192


Execution and Memory spaces (0)

Anatomy of a kernel launch:

1. User declares views, allocating. # define KL KOKKOS_LAMBDA


2. User instantiates a functor with View < int * , Cuda > dev (...);
parallel_for ( " Label " ,N ,
views. KL ( int i ) {
3. User launches dev ( i ) = ...;
});
parallel something:
I Functor is copied to the device.
I Kernel is run.
I Copy of functor on the device is
released.

Note: no deep copies of array data are performed;


views are like pointers.

Online April 21-24, 2020 57/192


Execution and Memory spaces (1)

Example: one view

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
});

Online April 21-24, 2020 58/192


Execution and Memory spaces (2)

Example: two views

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});

Online April 21-24, 2020 59/192


Execution and Memory spaces (2)

Example: two views

# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});

Online April 21-24, 2020 59/192


Execution and Memory spaces (3)

Example (redux): summing an array with the GPU


(failed) Attempt 1: View lives in CudaSpace
View < double * , CudaSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );

Online April 21-24, 2020 60/192


Execution and Memory spaces (3)

Example (redux): summing an array with the GPU


(failed) Attempt 1: View lives in CudaSpace
View < double * , CudaSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ... fault
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );

Online April 21-24, 2020 60/192


Execution and Memory spaces (4)

Example (redux): summing an array with the GPU


(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );

Online April 21-24, 2020 61/192


Execution and Memory spaces (4)

Example (redux): summing an array with the GPU


(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index ); illegal access
},
sum );

Online April 21-24, 2020 61/192


Execution and Memory spaces (4)

Example (redux): summing an array with the GPU


(failed) Attempt 2: View lives in HostSpace
View < double * , HostSpace > array ( " array " , size );
for ( int64_t i = 0; i < size ; ++ i ) {
array ( i ) = ... read from file ...
}

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index ); illegal access
},
sum );
I CudaUVMSpace
What’s the solution?
I CudaHostPinnedSpace (skipping)
I Mirroring

Online April 21-24, 2020 61/192


Execution and Memory spaces (5)

CudaUVMSpace

# define KL KOKKOS_LAMBDA
View < double * ,
CudaUVMSpace> array ;
array = ... from file ...
double sum = 0;
p ar allel _r ed u ce ( " Label " , N ,
KL ( int i ,
double & d ) {
d += array ( i );
},
sum );

Cuda runtime automatically handles data movement,


at a performance hit.

Online April 21-24, 2020 62/192


Views, Spaces, and Mirrors

Important concept: Mirrors


Mirrors are views of equivalent arrays residing in possibly different
memory spaces.

Online April 21-24, 2020 63/192


Views, Spaces, and Mirrors

Important concept: Mirrors


Mirrors are views of equivalent arrays residing in possibly different
memory spaces.

Mirroring schematic
typedef Kokkos :: View < double ** , Space > ViewType ;
ViewType view (...);
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

Online April 21-24, 2020 63/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

Online April 21-24, 2020 64/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the


host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

Online April 21-24, 2020 64/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the


host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).

Online April 21-24, 2020 64/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the


host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).


4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

Online April 21-24, 2020 64/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the


host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).


4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

5. Launch a kernel processing the view’s array.


Kokkos :: parallel_for ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA (...) { use and change view });

Online April 21-24, 2020 64/192


Mirroring pattern

1. Create a view’s array in some memory space.


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view (...);

2. Create hostView, a mirror of the view’s array residing in the


host memory space.
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

3. Populate hostView on the host (from file, etc.).


4. Deep copy hostView’s array to view’s array.
Kokkos :: d e e p c o p y ( view , hostView );

5. Launch a kernel processing the view’s array.


Kokkos :: parallel_for ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA (...) { use and change view });

6. If needed, deep copy the view’s updated array back to the


hostView’s array to write file, etc.
Kokkos :: d e e p c o p y ( hostView , view );
Online April 21-24, 2020 64/192
Mirrors of Views in HostSpace

What if the View is in HostSpace too? Does it make a copy?


typedef Kokkos :: View < double * , Space > ViewType ;
ViewType view ( " test " , 10);
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );

I create mirror view allocates data only if the host process


cannot access view’s data, otherwise hostView references the
same data.
I create mirror always allocates data.
I Reminder: Kokkos never performs a hidden deep copy.

Online April 21-24, 2020 65/192


Exercise #3: Flat Parallelism on the GPU, Views and Host Mirrors
Details:
I Location: Intro-Full/Exercises/03/Begin/
I Add HostMirror Views and deep copy
I Make sure you use the correct view in initialization and Kernel

# Compile for CPU


make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./03 _Exercise . cuda -S 26

Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 66/192


View and Spaces Section Summary

I Data is stored in Views that are “pointers” to


multi-dimensional arrays residing in memory spaces.
I Views abstract away platform-dependent allocation,
(automatic) deallocation, and access.
I Heterogeneous nodes have one or more memory spaces.
I Mirroring is used for performant access to views in host and
device memory.
I Heterogeneous nodes have one or more execution spaces.
I You control where parallel code is run by a template
parameter on the execution policy, or by compile-time
selection of the default execution space.

Online April 21-24, 2020 67/192


Managing memory access patterns
for performance portability
Learning objectives:
I How the View’s Layout parameter controls data layout.
I How memory access patterns result from Kokkos mapping
parallel work indices and layout of multidimensional array data
I Why memory access patterns and layouts have such a
performance impact (caching and coalescing).
I See a concrete example of the performance of various memory
configurations.

Online April 21-24, 2020 68/192


Example: inner product (0)
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) ,
KOKKOS_LAMBDA ( const size_t row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( size_t entry = 0; entry < M ; ++ entry ) {
thisRowsSum += A ( row , entry ) * x ( entry );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Online April 21-24, 2020 69/192


Example: inner product (0)
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) ,
KOKKOS_LAMBDA ( const size_t row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( size_t entry = 0; entry < M ; ++ entry ) {
thisRowsSum += A ( row , entry ) * x ( entry );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Driving question: How should A be laid out in memory?


Online April 21-24, 2020 69/192
Example: inner product (1)

Layout is the mapping of multi-index to memory:

LayoutLeft
in 2D, “column-major”

LayoutRight
in 2D, “row-major”

Online April 21-24, 2020 70/192


Layout

Important concept: Layout


Every View has a multidimensional array Layout set at
compile-time.

View < double *** , L a y o u t , Space > name (...);

Online April 21-24, 2020 71/192


Layout

Important concept: Layout


Every View has a multidimensional array Layout set at
compile-time.

View < double *** , L a y o u t , Space > name (...);

I Most-common layouts are LayoutLeft and LayoutRight.


LayoutLeft: left-most index is stride 1.
LayoutRight: right-most index is stride 1.
I If no layout specified, default for that memory space is used.
LayoutLeft for CudaSpace, LayoutRight for HostSpace.
I Layouts are extensible: ≈ 50 lines
I Advanced layouts: LayoutStride, LayoutTiled, ...

Online April 21-24, 2020 71/192


Exercise #4: Inner Product, Flat Parallelism

Details:
I Location: Intro-Full/Exercises/04/Begin/
I Replace ‘‘N’’ in parallel dispatch with RangePolicy<ExecSpace>
I Add MemSpace to all Views and Layout to A
I Experiment with the combinations of ExecSpace, Layout to view
performance
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
I Compare using UVM vs not using UVM on GPUs
I Check what happens if MemSpace and ExecSpace do not match.

Online April 21-24, 2020 72/192


Exercise #4: Inner Product, Flat Parallelism
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

300 Why?
200

100

0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 73/192


Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?

Online April 21-24, 2020 74/192


Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?


I CPU threads are independent.
i.e., threads may execute at any rate.

Online April 21-24, 2020 74/192


Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?


I CPU threads are independent.
i.e., threads may execute at any rate.
I GPU threads benefit (NVIDIA Volta) or must synchronize
(AMD) in groups.
i.e., threads in groups can/must execute instructions
together.

Online April 21-24, 2020 74/192


Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?


I CPU threads are independent.
i.e., threads may execute at any rate.
I GPU threads benefit (NVIDIA Volta) or must synchronize
(AMD) in groups.
i.e., threads in groups can/must execute instructions
together.
In particular, all threads in a group (warp or wavefront) must
finished their loads before any thread can move on.

Online April 21-24, 2020 74/192


Caching and coalescing (0)
Thread independence:
operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

Question: once a thread reads d, does it need to wait?


I CPU threads are independent.
i.e., threads may execute at any rate.
I GPU threads benefit (NVIDIA Volta) or must synchronize
(AMD) in groups.
i.e., threads in groups can/must execute instructions
together.
In particular, all threads in a group (warp or wavefront) must
finished their loads before any thread can move on.
So, how many cache lines must be fetched before threads can
move
Online April on?
21-24, 2020 74/192
Caching and coalescing (1)

CPUs: few (independent) cores with separate caches:

Online April 21-24, 2020 75/192


Caching and coalescing (1)

CPUs: few (independent) cores with separate caches:

GPUs: many (synchronized) cores with a shared cache:

Online April 21-24, 2020 75/192


Caching and coalescing (2)

Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.

Caching: if thread t’s current access is at position i,


thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,
thread t+1’s current access should be at position i+1.

Online April 21-24, 2020 76/192


Caching and coalescing (2)

Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.

Caching: if thread t’s current access is at position i,


thread t’s next access should be at position i+1.
Coalescing: if thread t’s current access is at position i,
thread t+1’s current access should be at position i+1.
Warning
Uncoalesced access on GPUs and non-cached loads on CPUs
greatly reduces performance (can be ¿10X)

Online April 21-24, 2020 76/192


Mapping indices to cores (0)

Consider the array summation example:


View < double * , Space > data ( " data " , size );
... populate data ...

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: is this cached (for OpenMP) and coalesced (for Cuda)?

Online April 21-24, 2020 77/192


Mapping indices to cores (0)

Consider the array summation example:


View < double * , Space > data ( " data " , size );
... populate data ...

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: is this cached (for OpenMP) and coalesced (for Cuda)?


Given P threads, which indices do we want thread 0 to handle?
Contiguous: Strided:
0, 1, 2, ..., N/P 0, N/P, 2*N/P, ...

Online April 21-24, 2020 77/192


Mapping indices to cores (0)

Consider the array summation example:


View < double * , Space > data ( " data " , size );
... populate data ...

double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );

Question: is this cached (for OpenMP) and coalesced (for Cuda)?


Given P threads, which indices do we want thread 0 to handle?
Contiguous: Strided:
0, 1, 2, ..., N/P 0, N/P, 2*N/P, ...
CPU GPU
Why?
Online April 21-24, 2020 77/192
Mapping indices to cores (1)

Iterating for the execution space:


operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

As users we don’t control how indices are mapped to threads, so


how do we achieve good memory access?

Online April 21-24, 2020 78/192


Mapping indices to cores (1)

Iterating for the execution space:


operator ()( const size_t index , double & valueToUpdate ) {
const double d = _data ( index );
valueToUpdate += d ;
}

As users we don’t control how indices are mapped to threads, so


how do we achieve good memory access?

Important point
Kokkos maps indices to cores in contiguous chunks on CPU
execution spaces, and strided for Cuda.

Online April 21-24, 2020 78/192


Mapping indices to cores (2)

Rule of Thumb
Kokkos index mapping and default layouts provide efficient access
if iteration indices correspond to the first index of array.

Example:
View < double *** , ... > view (...);
...
Kokkos :: parallel_for ( " Label " , ... ,
KOKKOS_LAMBDA ( const size_t workIndex ) {
...
view (... , ... , workIndex ) = ... ;
view (... , workIndex , ... ) = ... ;
view ( workIndex , ... , ... ) = ... ;
});
...

Online April 21-24, 2020 79/192


Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Online April 21-24, 2020 80/192


Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Analysis: row-major (LayoutRight)

Online April 21-24, 2020 80/192


Example: inner product (2)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.

Analysis: row-major (LayoutRight)

I HostSpace: cached (good)


I CudaSpace: uncoalesced (bad)
Online April 21-24, 2020 80/192
Example: inner product (3)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.

Analysis: column-major (LayoutLeft)

Online April 21-24, 2020 81/192


Example: inner product (3)

Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.

Analysis: column-major (LayoutLeft)

I HostSpace: uncached (bad)


I CudaSpace: coalesced (good)
Online April 21-24, 2020 81/192
Example: inner product (4)

Analysis: Kokkos architecture-dependent


View < double ** , E x e c u t i o n S p a c e > A (N , M );
parallel_for ( RangePolicy < E x e c u t i o n S p a c e >(0 , N ) ,
... thisRowsSum += A (j , i ) * x ( i );

(a) OpenMP (b) Cuda


I HostSpace: cached (good)
I CudaSpace: coalesced (good)

Online April 21-24, 2020 82/192


Example: inner product (5)
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

cached
300

200 uncoalesced

100
cached
uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 83/192


Memory Access Pattern Summary

I Every View has a Layout set at compile-time through a


template parameter.
I LayoutRight and LayoutLeft are most common.
I Views in HostSpace default to LayoutRight and Views in
CudaSpace default to LayoutLeft.
I Layouts are extensible and flexible.
I For performance, memory access patterns must result in
caching on a CPU and coalescing on a GPU.
I Kokkos maps parallel work indices and multidimensional array
layout for performance portable memory access patterns.
I There is nothing in OpenMP, OpenACC, or OpenCL to manage
layouts.
⇒ You’ll need multiple versions of code or pay the
performance penalty.
Online April 21-24, 2020 84/192
DualView

DualView
Learning objectives:
I Motivation and Value Added.
I Usage.
I Exercises.

Online April 21-24, 2020 85/192


DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

Online April 21-24, 2020 86/192


DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

I DualView simplifies the task of managing data movement


between memory spaces, e.g., host and device.

Online April 21-24, 2020 86/192


DualView(0)

Motivation and Value-added

I DualView was designed to help transition codes to Kokkos.

I DualView simplifies the task of managing data movement


between memory spaces, e.g., host and device.

I When converting a typical app to use Kokkos, there is usually


no holistic view of such data transfers.

Online April 21-24, 2020 86/192


• do I really need to do a deep co
between memory spaces, eg, host and • where is the most recent data?
device. • is my data on the host or device
DualView(1)
• did someone modify the code u
the data is now stale, but wasn

deep_copy
View ? MirrorView

device host

Without DualView, could use MirrorViews, but


I deep copies are expensive, use sparingly
I do I need a deep copy here?
I where is the most recent data?
I is data on the host or device stale?
I was code modified upstream? is data here now stale, but not
in previous version?
Online April 21-24, 2020 87/192
• Bundles two views, e.g. a host View and a device View
• DualView::modify<MemorySpace>() marks the data as modified on the given MemorySpace
• DualView::sync<MemorySpace>() deep copies the data to the given MemorySpace only if the
two memory spaces are not in sync. DualView: Usage
• DualView relies on calls to modify() to determine if data actually needs to be copied during a
call to sync().
sync() does nothing
• DualView if there istwo
bundles a singleviews,
memory space,
a Hostso theyView
are efficient
andto auseDevice
all the time View

View View
DualView

device host

There is no automatic tracking of data freshness:


I you must tell Kokkos when data has been modified on a
memory space.
I If you mark data as modified when you modify it, then Kokkos
will know if it needs to move data
Online April 21-24, 2020 88/192
DualView: Usage(1)

DualView bundles two views, a Host View and a Device


View
I Data members for the two views
DualView :: t_host h_view
DualView :: t_dev d_view

I Retrieve data members


t_host view_host ();
t_dev view_device ();

I Mark data as modified


void modify_host ();
void modify_device ();

Online April 21-24, 2020 89/192


DualView: Usage(2)

DualView bundles two views, a Host View and a Device


View
I Sync data in a direction if not in sync
void sync_host ();
void sync_device ();

I Check sync status


void need_syn c_host ();
void n e e d _ s y n c _ d e v i c e ();

Online April 21-24, 2020 90/192


DualView: Usage in generic context
DualView has templated functions for generic use in
templated code
I Retrieve data members
template < class Space >
auto view ();

I Mark data as modified


template < class Space >
void modify ();

I Sync data in a direction if not in sync


template < class Space >
void sync ();

I Check sync status


template < class Space >
void need_sync ();

Online April 21-24, 2020 91/192


Exercise - DualView

Details:
I Location: Intro-Full/Exercises/dualview/Begin/
I Modify or create a new compute enthalpy function in
dual view exercise.cpp to:
I 1. Take (dual)views as arguments
I 2. Call modify() and/or sync() when appropriate for the dual
views
I 3. Runs the kernel on host or device execution spaces

# Compile for CPU


make -j KOKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j KOKKOS_D EVICES = Cuda
# Run on GPU
./ dualview . cuda -S 26

Online April 21-24, 2020 92/192


MDRangePolicy

Tightly Nested Loops with


MDRangePolicy
Learning objectives:
I Demonstrate usage of the MDRangePolicy with tightly nested
loops.
I Syntax - Required and optional settings
I Code demo and example

Online April 21-24, 2020 93/192


MDRangePolicy (0)
Motivating example: Consider the nested for loops:
for ( int i = 0; i < Ni ; ++ i )
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );

Based on Kokkos lessons thus far, you might parallelize this as


Kokkos :: parallel_for ( Ni ,
KOKKOS_LAMBDA ( const i ) {
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );
}
);

I This only parallelizes along one dimension, leaving potential


parallelism unexploited.
I What if Ni is too small to amortize the cost of constructing a
parallel region, but Ni*Nj*Nk makes it worthwhile?
Online April 21-24, 2020 94/192
MDRangePolicy (1)

Solution: Use an MDRangePolicy


for ( int i = 0; i < Ni ; ++ i )
for ( int j = 0; j < Nj ; ++ j )
for ( int k = 0; k < Nk ; ++ k )
some_init_fcn (i , j , k );

Instead, use the MDRangePolicy with the parallel for


Kokkos :: parallel_for ( Kokkos :: MDRangePolicy < Kokkos :: Rank <3 > >
({0 ,0 ,0} , { Ni , Nj , Nk }) ,
KOKKOS_LAMBDA ( int i , int j , int k ) {
some_init_fcn (i , j , k );
}
);

Online April 21-24, 2020 95/192


MDRangePolicy API(0)

Required Template Parameters to MDRangePolicy


Kokkos : : Rank< N, I t e r a t e O u t e r , I t e r a t e I n n e r >
I N: (Required) the rank of the index space (limited from 2 to 6)
I IterateOuter (Optional) iteration pattern between tiles
I Options: Iterate::Left, Iterate::Right, Iterate::Default
I IterateInner (Optional) iteration pattern within tiles
I Options: Iterate::Left, Iterate::Right, Iterate::Default

Online April 21-24, 2020 96/192


MDRangePolicy API(1)

Optional Template Parameters


ExecutionSpace
I Options: Serial, OpenMP, Threads, Cuda

Schedule < Options >


I Options: Static, Dynamic

IndexType < Options >


I Options: int, long, etc

WorkTag
I Options: SomeClass

MDRangePolicy< Rank <2,OP, IP >, OpenMP , S c h e d u l e <S t a t i c >,


IndexType <i n t > > m d r p o l i c y ;

Online April 21-24, 2020 97/192


MDRangePolicy API(2)
Policy Arguments
BeginList
I Initializer List or Kokkos::Array (Required): rank arguments for
starts of index space
I Example Rank 2: {b0,b1}

EndList
I Initializer List or Kokkos::Array (Required): rank arguments for
ends of index space
I Example Rank 2: {e0,e1}

TileDimList
I Initializer List or Kokkos::Array (Optional): rank arguments for
dimension of tiles
I Example Rank 2: {t0,t1}

m d r p o l i c y ( {b0 , b1 } , { e0 , e1 } , { t0 , t 1 } ) ;
Online April 21-24, 2020 98/192
Exercise - mdrange: Initialize multi-dim views with MDRangePolicy

Details:
I Location: Intro-Full/Exercises/mdrange/Begin/
I This begins with the Solution of 02
I Initialize the device Views x and y directly on the device using a
parallel for and RangePolicy
I Initialize the device View matrix A directly on the device using a
parallel for and MDRangePolicy

# Compile for CPU


make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ m d r an g e _ e x e r c i s e . cuda -S 26

Online April 21-24, 2020 99/192


Exercise - mdrange: Initialize multi-dim views with MDRangePolicy

Things to try:
I Name the kernels - pass a string as the first argument of the parallel
pattern
I Try changing the iteration patterns for the tiles in the
MDRangePolicy, notice differences in performance

Online April 21-24, 2020 100/192


Subviews

Subviews: Taking ’slices’ of


Views
Learning objectives:
I Introduce Kokkos::subview - basic capabilities and syntax
I Suggested usage and practices

Online April 21-24, 2020 101/192


Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View

Online April 21-24, 2020 102/192


Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries

Online April 21-24, 2020 102/192


Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying

Online April 21-24, 2020 102/192


Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
I Can be constructed on host or within a kernel (no allocation
of memory occurs)

Online April 21-24, 2020 102/192


Subviews (0)

Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
I Can be constructed on host or within a kernel (no allocation
of memory occurs)
I Similar capability as provided by Matlab, Fortran, Python, etc.
using ’colon’ notation

Online April 21-24, 2020 102/192


Subviews (1)

Introductory Usage Demo:


Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Online April 21-24, 2020 103/192


Subviews (1)

Introductory Usage Demo:


Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Say we want a 2-dimensional slice at an index i0 in the first


dimension - that is, in Matlab/Fortran/Python notation:
slicei0 = v ( i0 , : , :);

Online April 21-24, 2020 103/192


Subviews (1)

Introductory Usage Demo:


Begin with a View:
Kokkos :: View < double *** > v ( " v " , N0 , N1 , N2 );

Say we want a 2-dimensional slice at an index i0 in the first


dimension - that is, in Matlab/Fortran/Python notation:
slicei0 = v ( i0 , : , :);

This can be accomplished in Kokkos using a subview as follows:


auto slicei0 =
Kokkos :: subview (v , i0 , Kokkos :: ALL , Kokkos :: ALL );

auto slicei0 =
Kokkos :: subview (v , i0 , std :: make_pair (0 , v . extent (1)) ,
std :: make_pair (0 , v . extent (2)));
// extent ( N ) returns the size of dimension N of the View

Online April 21-24, 2020 103/192


Subviews (2)

Syntax:
Kokkos :: subview ( Kokkos :: View <... > view ,
arg0 ,
...)
I view: First argument to the subview is the view of which a slice will
be taken
I argN: Slice info for rank N - provide same number of arguments as
rank
I Options for argN:
I index: integral type single value
I partial-range: std::pair or Kokkos::pair of integral types to
provide sub-range of a rank’s range [0,N)
I full-range: use Kokkos::ALL rather than providing the full
range as a pair

Online April 21-24, 2020 104/192


Subviews (3)

Suggested usage:
I Use ’auto’ to determine the return type of a subview
I A subview can help with encapsulation - e.g. can pass into
functions expecting a lower-dimensional View
I Use Kokkos::pair for partial ranges if subview created within a
kernel
I Avoid usage if very few data accesses will be made to the
subview
I Construction of subview costs 20-40 operations

Online April 21-24, 2020 105/192


Exercise - Subviews: Basic usage

Details:
I Location: Intro-Full/Exercises/subview/Begin/
I This begins with the Solution of 04
I In the parallel reduce kernel, create a subview for row j of view A
I Use this subview when computing A(j,:)*x(:) rather than the matrix
A
# Compile for CPU
make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ s u b vi e w _ e x e r c i s e . cuda -S 26

Online April 21-24, 2020 106/192


Thread safety and
atomic operations
Learning objectives:
I Understand that coordination techniques for low-count CPU
threading are not scalable.
I Understand how atomics can parallelize the scatter-add
pattern.
I Gain performance intuition for atomics on the CPU and
GPU, for different data types and contention rates.

Online April 21-24, 2020 107/192


Examples: Histogram

Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram

Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});

Problem: Multiple threads may try to write to the same location.

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram

Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});

Problem: Multiple threads may try to write to the same location.

Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:
not thread-scalable
I Atomics

http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Atomics

Atomics: the portable and thread-scalable solution


parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const int bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
Kokkos :: atomic_add (& _histogram ( bucketIndex ) , 1);
});

Online April 21-24, 2020 109/192


Atomics

Atomics: the portable and thread-scalable solution


parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const int bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
Kokkos :: atomic_add (& _histogram ( bucketIndex ) , 1);
});

I Atomics are the only scalable solution to thread safety.

Online April 21-24, 2020 109/192


Atomics

Atomics: the portable and thread-scalable solution


parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const int bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
Kokkos :: atomic_add (& _histogram ( bucketIndex ) , 1);
});

I Atomics are the only scalable solution to thread safety.


I Locks are not portable.

Online April 21-24, 2020 109/192


Atomics

Atomics: the portable and thread-scalable solution


parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const int bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
Kokkos :: atomic_add (& _histogram ( bucketIndex ) , 1);
});

I Atomics are the only scalable solution to thread safety.


I Locks are not portable.
I Data replication is not thread scalable.

Online April 21-24, 2020 109/192


Performance of atomics (0)

How expensive are atomics?

Thought experiment: scalar integration


operator ()( const unsigned int intervalIndex ,
double & valueToUpdate ) const {
double contribution = function (...);
valueToUpdate += contribution ;
}

Online April 21-24, 2020 110/192


Performance of atomics (0)

How expensive are atomics?

Thought experiment: scalar integration


operator ()( const unsigned int intervalIndex ,
double & valueToUpdate ) const {
double contribution = function (...);
valueToUpdate += contribution ;
}

Idea: what if we instead do this with parallel for and atomics?


operator ()( const unsigned int intervalIndex ) const {
const double contribution = function (...);
Kokkos : : a t o m i c a d d (& globalSum , contribution );
}

How much of a performance penalty is incurred?

Online April 21-24, 2020 110/192


Performance of atomics (1)

Two costs: (independent) work and coordination.


p ar al le l _r ed u ce ( numberOfIntervals ,
KOKKOS_LAMBDA ( const unsigned int intervalIndex ,
double & valueToUpdate ) {
valueToUpdate += function (...);
} , totalIntegral );

Online April 21-24, 2020 111/192


Performance of atomics (1)

Two costs: (independent) work and coordination.


p ar al le l _r ed u ce ( numberOfIntervals ,
KOKKOS_LAMBDA ( const unsigned int intervalIndex ,
double & valueToUpdate ) {
valueToUpdate += function (...);
} , totalIntegral );

Experimental setup
operator ()( const unsigned int index ) const {
Kokkos :: atomic_add (& globalSums [ index % atomicStride ] , 1);
}

I This is the most extreme case: all coordination and no work.


I Contention is captured by the atomicStride.
atomicStride → 1 ⇒ Scalar integration (bad)
atomicStride → large ⇒ Independent (good)

Online April 21-24, 2020 111/192


Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 0 pows


log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
phi float
phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 0 pows


Low(?) penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (3)
Atomics performance: 1 million adds, some work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 2 pows


No penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 113/192
Performance of atomics (4)
Atomics performance: 1 million adds, lots of work per kernel

Slowdown from atomics: Summary for 1 million adds, mod, 5 pows


No penalty for low contention
log10(speedup over independent) [-]

0
cuda double
1 cuda uint64_t
cuda float
Note: log scale

cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 114/192
Advanced features

Atomics on arbitrary types:


I Atomic operations work if the corresponding operator exists,
i.e., atomic add works on any data type with “+”.
I Atomic exchange works on any data type.
// Assign * dest to val , return former value of * dest
template < typename T >
T a to mic _e xc h an ge ( T * dest , T val );
// If * dest == comp then assign * dest to val
// Return true if succeeds .
template < typename T >
bool a t o m i c _ c o m p a r e _ e x c h a n g e _ s t r o n g ( T * dest , T comp , T val );

Online April 21-24, 2020 115/192


Memory traits

Slight detour: View memory traits:


I Beyond a Layout and Space, Views can have memory traits.
I Memory traits either provide convenience or allow for certain
hardware-specific optimizations to be performed.
Example: If all accesses to a View will be atomic, use the Atomic
memory trait:
View < double ** , Layout , Space ,
MemoryTraits<Atomic> > forces (...);

Online April 21-24, 2020 116/192


Memory traits

Slight detour: View memory traits:


I Beyond a Layout and Space, Views can have memory traits.
I Memory traits either provide convenience or allow for certain
hardware-specific optimizations to be performed.
Example: If all accesses to a View will be atomic, use the Atomic
memory trait:
View < double ** , Layout , Space ,
MemoryTraits<Atomic> > forces (...);

Many memory traits exist or are experimental, including Read,


Write, ReadWrite, ReadOnce (non-temporal), Contiguous, and
RandomAccess.

Online April 21-24, 2020 116/192


RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, random
access, originally designed for textures.

Online April 21-24, 2020 117/192


RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, random
access, originally designed for textures.
How to access texture memory via CUDA:

Online April 21-24, 2020 117/192


RandomAccess memory trait
Example: RandomAccess memory trait:
On GPUs, there is a special pathway for fast read-only, random
access, originally designed for textures.
How to access texture memory via CUDA:

How to access texture memory via Kokkos:


View < c o n s t double *** , Layout , Space ,
MemoryTraits<RandomAccess> > name (...);
Online April 21-24, 2020 117/192
Scatter Contribute (1)

Histogram generation is an example of the Scatter Contribute


pattern.
I Like a reduction but with many results.
I Number of results scales with number of inputs.
I Each results gets contributions from a small number of
inputs/iterations.
I Uses an inputs-to-results map not inverse.
Examples:
I Particles contributing to neighbors forces.
I Cells contributing forces to nodes.
I Computing histograms.
I Computing a density grid from point source contributions.

Online April 21-24, 2020 118/192


Scatter Contribute (2)
There are two useful algorithms:.
I Atomics: thread-scalable but depends on atomic
performance.
I Data Replication: every thread owns a copy of the output,
not thread-scalable but good for low (< 16) threads count
architectures.

Online April 21-24, 2020 119/192


Scatter Contribute (2)
There are two useful algorithms:.
I Atomics: thread-scalable but depends on atomic
performance.
I Data Replication: every thread owns a copy of the output,
not thread-scalable but good for low (< 16) threads count
architectures.

Important Capability: ScatterView


ScatterView can transparently switch between Atomic and Data
Replication based scatter algorithms.

Online April 21-24, 2020 119/192


Scatter Contribute (2)
There are two useful algorithms:.
I Atomics: thread-scalable but depends on atomic
performance.
I Data Replication: every thread owns a copy of the output,
not thread-scalable but good for low (< 16) threads count
architectures.

Important Capability: ScatterView


ScatterView can transparently switch between Atomic and Data
Replication based scatter algorithms.

I Abstracts over scatter contribute algorithms.


I Compile time choice with backend-specific defaults.
I Only limited number of operations are supported.
I Part of Kokkos Containers.
Online April 21-24, 2020 119/192
Scatter Contribute (3)
Example:
// Begin with a normal View
Kokkos :: View < double * > results ( " results " ,N );
// Create a scatter view wrapping the original view
Kokkos :: Experimental :: ScatterView < double * > scatter ( results );
// Reset contributions if necessary
scatter . reset ();
// Start parallel operation
Kokkos :: parallel_for ( " ScatterAlg " , M ,
KOKKOS_LAMBDA ( int i ) {
// Get the accessor - e . g . the thread specific copy
// or an atomic view of the data .
auto access = scatter . access ();

for ( int j =0; j < num_neighs ( i ); j ++) {


// Get the destination index
int neigh = neighbors (i , j );
// Add the contribution
access ( neigh ) += contribution (i , neigh );
}
});
// Combine the results - no op if ScatterView was using atomics in
Kokkos :: Experimental :: contribute ( results , scatter );
Online April 21-24, 2020 120/192
Exercise ScatterView

I Location: Intro-Full/Exercises/scatter view/Begin/


I Assignment: Convert scatter view loop to use ScatterView.
I Compile and run on both CPU and GPU

make -j K OKKOS_D EVICES = OpenMP # CPU - only using OpenMP


make -j K OKKOS_D EVICES = Cuda # GPU - note UVM in Makefile
# Run exercise
./ scatterview . host
./ scatterview . cuda
# Note the warnings , set appropriate environment variables

I Compare performance on CPU of the three variants


I Compare performance on GPU of the two variants
I Vary problem size: first and second optional argument

Online April 21-24, 2020 121/192


Section Summary

I Atomics are the only thread-scalable solution to thread safety.


I Locks or data replication are not portable or scalable
I Atomic performance depends on ratio of independent work
and atomic operations.
I With more work, there is a lower performance penalty, because
of increased opportunity to interleave work and atomic.
I The Atomic memory trait can be used to make all accesses
to a view atomic.
I The cost of atomics can be negligible:
I CPU ideal: contiguous access, integer types
I GPU ideal: scattered access, 32-bit types
I Many programs with the scatter-add pattern can be
thread-scalably parallelized using atomics without much
modification.

Online April 21-24, 2020 122/192


Hierarchical parallelism
Finding and exploiting more parallelism in your computations.

Learning objectives:
I Similarities and differences between outer and inner levels of
parallelism
I Thread teams (league of teams of threads)
I Performance improvement with well-coordinated teams

Online April 21-24, 2020 123/192


Example: inner product (0)

(Flat parallel) Kernel:


Kokkos :: p ar a ll el _r e du ce ( " yAx " ,N ,
KOKKOS_LAMBDA ( const int row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( int col = 0; col < M ; ++ col ) {
thisRowsSum += A ( row , col ) * x ( col );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Online April 21-24, 2020 124/192


Example: inner product (0)

(Flat parallel) Kernel:


Kokkos :: p ar a ll el _r e du ce ( " yAx " ,N ,
KOKKOS_LAMBDA ( const int row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( int col = 0; col < M ; ++ col ) {
thisRowsSum += A ( row , col ) * x ( col );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Problem: What if we don’t have


enough rows to saturate the GPU?

Online April 21-24, 2020 124/192


Example: inner product (0)

(Flat parallel) Kernel:


Kokkos :: p ar a ll el _r e du ce ( " yAx " ,N ,
KOKKOS_LAMBDA ( const int row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( int col = 0; col < M ; ++ col ) {
thisRowsSum += A ( row , col ) * x ( col );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Problem: What if we don’t have


enough rows to saturate the GPU?
Solutions?

Online April 21-24, 2020 124/192


Example: inner product (0)

(Flat parallel) Kernel:


Kokkos :: p ar a ll el _r e du ce ( " yAx " ,N ,
KOKKOS_LAMBDA ( const int row , double & valueToUpdate ) {
double thisRowsSum = 0;
for ( int col = 0; col < M ; ++ col ) {
thisRowsSum += A ( row , col ) * x ( col );
}
valueToUpdate += y ( row ) * thisRowsSum ;
} , result );

Problem: What if we don’t have


enough rows to saturate the GPU?
Solutions?
I Atomics
I Thread teams

Online April 21-24, 2020 124/192


Example: inner product (1)

Atomics kernel:
Kokkos :: parallel_for ( " yAx " , N ,
KOKKOS_LAMBDA ( const size_t index ) {
const int row = extractRow ( index );
const int col = extractCol ( index );
atomic_add (& result , A ( row , col ) * x ( col ));
});

Online April 21-24, 2020 125/192


Example: inner product (1)

Atomics kernel:
Kokkos :: parallel_for ( " yAx " , N ,
KOKKOS_LAMBDA ( const size_t index ) {
const int row = extractRow ( index );
const int col = extractCol ( index );
atomic_add (& result , A ( row , col ) * x ( col ));
});

Problem: Poor performance

Online April 21-24, 2020 125/192


Example: inner product (2)

Doing each individual row with atomics is like doing scalar


integration with atomics.

Instead, you could envision doing a large number of


parallel reduce kernels.
for each row
Functor functor ( row , ...);
p ar alle l _r ed uc e (M , functor );
}

Online April 21-24, 2020 126/192


Example: inner product (2)

Doing each individual row with atomics is like doing scalar


integration with atomics.

Instead, you could envision doing a large number of


parallel reduce kernels.
for each row
Functor functor ( row , ...);
p ar alle l _r ed uc e (M , functor );
}

This is an example of hierarchical work.


Important concept: Hierarchical parallelism
Algorithms that exhibit hierarchical structure can exploit
hierarchical parallelism with thread teams.

Online April 21-24, 2020 126/192


Example: inner product (3)

Important concept: Thread team


A collection of threads which are guaranteed to be executing
concurrently and can synchronize.

Online April 21-24, 2020 127/192


Example: inner product (3)

Important concept: Thread team


A collection of threads which are guaranteed to be executing
concurrently and can synchronize.
High-level strategy:
1. Do one parallel launch of N teams of M threads.
2. Each thread performs one entry in the row.
3. The threads within teams perform a reduction.
4. The thread teams perform a reduction.

Online April 21-24, 2020 127/192


Example: inner product (4)

The final hierarchical parallel kernel:


p ar al le l _r ed u ce ( " yAx " ,
team_policy (N , Kokkos :: AUTO ) ,

KOKKOS_LAMBDA ( const member_type & teamMember , double & update )


int row = teamMember . league_rank ();

double thisRowsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , M ) ,
[=] ( int col , double & innerUpdate ) {
innerUpdate += A ( row , col ) * x ( col );
} , thisRowsSum );

if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
} , result );

Online April 21-24, 2020 128/192


TeamPolicy (0)

Important point
Using teams is changing the execution policy.

“Flat parallelism” uses RangePolicy:


We specify a total amount of work.
// total work = N
parallel_for ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) , functor );

Online April 21-24, 2020 129/192


TeamPolicy (0)

Important point
Using teams is changing the execution policy.

“Flat parallelism” uses RangePolicy:


We specify a total amount of work.
// total work = N
parallel_for ( " Label " ,
RangePolicy < ExecutionSpace >(0 , N ) , functor );

“Hierarchical parallelism” uses TeamPolicy:


We specify a team size and a number of teams.
// total work = numberOfTeams * teamSize
parallel_for ( " Label " ,
TeamPolicy < ExecutionSpace >( numberOfTeams , teamSize ) , functor );

Online April 21-24, 2020 129/192


TeamPolicy (1)

Important point
When using teams, functor operators receive a team member.

typedef typename TeamPolicy < ExecSpace >:: member_type member_type ;

void operator ()( const member_type & teamMember ) {


// Which team am I on?
const unsigned int leagueRank = teamMember . league_rank ();
// Which thread am I on this team?
const unsigned int teamRank = teamMember . team_rank ();
}

Online April 21-24, 2020 130/192


TeamPolicy (1)

Important point
When using teams, functor operators receive a team member.

typedef typename TeamPolicy < ExecSpace >:: member_type member_type ;

void operator ()( const member_type & teamMember ) {


// Which team am I on?
const unsigned int leagueRank = teamMember . league_rank ();
// Which thread am I on this team?
const unsigned int teamRank = teamMember . team_rank ();
}

Warning
There may be more (or fewer) team members than pieces of your
algorithm’s work per team

Online April 21-24, 2020 130/192


TeamThreadRange (0)

First attempt at exercise:


operator () ( member_type & teamMember ) {
const size_t row = teamMember . league_rank ();
const size_t col = teamMember . team_rank ();
atomic_add (& result ,y ( row ) * A ( row , col ) * x ( entry ));
}

Online April 21-24, 2020 131/192


TeamThreadRange (0)

First attempt at exercise:


operator () ( member_type & teamMember ) {
const size_t row = teamMember . league_rank ();
const size_t col = teamMember . team_rank ();
atomic_add (& result ,y ( row ) * A ( row , col ) * x ( entry ));
}

I When team size 6= number of columns, how are units of work


mapped to team’s member threads? Is the mapping
architecture-dependent?
I atomic add performs badly under high contention, how can
team’s member threads performantly cooperate for a nested
reduction?
Online April 21-24, 2020 131/192
TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...


operator () ( member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
‘‘ do a reduction ’ ’( ‘‘ over M columns ’ ’ ,
[=] ( const int col ) {
thisRowsSum += A ( row , col ) * x ( col );
});
if ( teamMember . team_rank () == 0) {
update += ( row ) * thisRowsSum ;
}
}

Online April 21-24, 2020 132/192


TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...


operator () ( member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
‘‘ do a reduction ’ ’( ‘‘ over M columns ’ ’ ,
[=] ( const int col ) {
thisRowsSum += A ( row , col ) * x ( col );
});
if ( teamMember . team_rank () == 0) {
update += ( row ) * thisRowsSum ;
}
}

If this were a parallel execution,


we’d use Kokkos::parallel reduce.

Online April 21-24, 2020 132/192


TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...


operator () ( member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
‘‘ do a reduction ’ ’( ‘‘ over M columns ’ ’ ,
[=] ( const int col ) {
thisRowsSum += A ( row , col ) * x ( col );
});
if ( teamMember . team_rank () == 0) {
update += ( row ) * thisRowsSum ;
}
}

If this were a parallel execution,


we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.

Online April 21-24, 2020 132/192


TeamThreadRange (1)

We shouldn’t be hard-coding the work mapping...


operator () ( member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
‘‘ do a reduction ’ ’( ‘‘ over M columns ’ ’ ,
[=] ( const int col ) {
thisRowsSum += A ( row , col ) * x ( col );
});
if ( teamMember . team_rank () == 0) {
update += ( row ) * thisRowsSum ;
}
}

If this were a parallel execution,


we’d use Kokkos::parallel reduce.
Key idea: this is a parallel execution.
⇒ Nested parallel patterns
Online April 21-24, 2020 132/192
TeamThreadRange (2)
TeamThreadRange:
operator () ( const member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
p ar al l el _r ed u ce ( T ea mT hr e ad Ra n ge ( teamMember , M ) ,
[=] ( const int col , double & t h i s R o w s P a r t i a l S u m ) {
t h i s R o w s P a r t i a l S u m += A ( row , col ) * x ( col );
} , thisRowsSum );
if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
}

Online April 21-24, 2020 133/192


TeamThreadRange (2)
TeamThreadRange:
operator () ( const member_type & teamMember , double & update ) {
const int row = teamMember . league_rank ();
double thisRowsSum ;
p ar al l el _r ed u ce ( T ea mT hr e ad Ra n ge ( teamMember , M ) ,
[=] ( const int col , double & t h i s R o w s P a r t i a l S u m ) {
t h i s R o w s P a r t i a l S u m += A ( row , col ) * x ( col );
} , thisRowsSum );
if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
}

I The mapping of work indices to threads is


architecture-dependent.
I The amount of work given to the TeamThreadRange need
not be a multiple of the team size.
I Intrateam reduction handled by Kokkos.
Online April 21-24, 2020 133/192
Nested parallelism
Anatomy of nested parallelism:
para llel_out er ( " Label " ,
TeamPolicy < ExecutionSpace >( numberOfTeams , teamSize ) ,
KOKKOS_LAMBDA ( const member_type & teamMember [ , . . . ] ) {
/* beginning of outer body */
para llel_inn er (
T ea mT hr e ad Ra ng e ( teamMember , t h i s T e a m s R a n g e S i z e ) ,
[=] ( const unsigned int i n d e xW i t h i n B a t c h [ , . . . ] ) {
/* inner body */
} [ , . . . ] );
/* end of outer body */
} [ , . . . ] );

I parallel outer and parallel inner may be any


combination of for, reduce, or scan.
I The inner lambda may capture by reference, but
capture-by-value is recommended.
I The policy of the inner lambda is always a TeamThreadRange.
I TeamThreadRange cannot be nested.
Online April 21-24, 2020 134/192
What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

Online April 21-24, 2020 135/192


What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256

Online April 21-24, 2020 135/192


What should the team size be?
In practice, you can let Kokkos decide:
p a r a l l el _ s o m e t h i n g (
TeamPolicy < ExecutionSpace >( numberOfTeams , Kokkos :: AUTO ) ,
/* functor */ );

GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchy
a well-coordinated team avoids cache-thrashing
Online April 21-24, 2020 135/192
Exercise #5: Inner Product, Hierarchical Parallelism
Details:
I Location: Intro-Full/Exercises/05/
I Replace RangePolicy<Space> with TeamPolicy<Space>
I Use AUTO for team size
I Make the inner loop a parallel reduce with TeamThreadRange
policy
I Experiment with the combinations of Layout, Space, N to view
performance
I Hint: what should the layout of A be?
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 4 for very non-square matrices
I Compare behavior of CPU vs GPU
Online April 21-24, 2020 136/192
Reminder, Exercise #4 with Flat Parallelism
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right

400
Bandwidth (GB/s)

cached
300

200 uncoalesced
cached
100

uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 137/192


Exercise #5: Inner Product, Hierarchical Parallelism
<y|Ax> Exercise 05 (Layout/Teams) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right
coalesced

400
Bandwidth (GB/s)

cached
300

200

cached
100

0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9

Number of Rows (N)

Online April 21-24, 2020 138/192


Three-level parallelism (0)

Exposing Vector Level Parallelism


I Optional third level in the hierarchy: ThreadVectorRange
I Can be used for parallel for, parallel reduce, or
parallel scan.
I Maps to vectorizable loop on CPUs or (sub-)warp level
parallelism on GPUs.
I Enabled with a runtime vector length argument to
TeamPolicy
I There is no explicit access to a vector lane ID.
I Depending on the backend the full global parallel region has
active vector lanes.
I TeamVectorRange uses both thread and vector parallelism.

Online April 21-24, 2020 139/192


Three-level parallelism (1)
Anatomy of nested parallelism:
para llel_out er ( " Label " ,
TeamPolicy < >( numberOfTeams , teamSize , vectorLength ) ,
KOKKOS_LAMBDA ( const member_type & teamMember [ , . . . ] ) {
/* beginning of outer body */
p ar al le l _m id dl e (
T ea mT hr e ad Ra ng e ( teamMember , t h i s T e a m s R a n g e S i z e ) ,
[=] ( const int i n d e x W i t h i n B a t c h [ , . . . ] ) {
/* begin middle body */
para llel_inn er (
T h r e a d V e c t o r R a n g e ( teamMember , t h i s V e c t o r R a n g e S i z e ) ,
[=] ( const int i n d e x V e c t o r R a n g e [ , . . . ] ) {
/* inner body */
}[ , ....);
/∗ end m i d d l e body ∗/
} [ , ...] ) ;
parallel middle (
TeamVectorRange ( teamMember , s o m e S i z e ) ,
[=] ( const i n t indexTeamVector [ , . . . ] ) {
/∗ n e s t e d body ∗/
}[ , . . . ] ) ;
/∗ end o f o u t e r body ∗/
} [ , ...] ) ;
Online April 21-24, 2020 140/192
Sum sanity checks (0)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , RangePolicy < >(0 , n um b er Of Th r ea ds ) ,
KOKKOS_LAMBDA ( size_t & index , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

Online April 21-24, 2020 141/192


Sum sanity checks (0)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , RangePolicy < >(0 , n um b er Of Th r ea ds ) ,
KOKKOS_LAMBDA ( size_t & index , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

totalSum = numberOfThreads * 10

Online April 21-24, 2020 141/192


Sum sanity checks (1)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

Online April 21-24, 2020 142/192


Sum sanity checks (1)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
partialSum += thi sThreads Sum ;
} , totalSum );

totalSum = numberOfTeams * team size * 10

Online April 21-24, 2020 142/192


Sum sanity checks (2)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisTeamsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , team_size ) ,
[=] ( const int index , int & t h i s T e a m s P a r t i a l S u m ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
t h i s T e a m s P a r t i a l S u m += thisTh readsSum ;
} , thisTeamsSum );
partialSum += thisTeamsSum ;
} , totalSum );

Online April 21-24, 2020 143/192


Sum sanity checks (2)

Question: What will the value of totalSum be?


int totalSum = 0;
p ar al le l _r ed u ce ( " Sum " , TeamPolicy < >( numberOfTeams , team_size ) ,
KOKKOS_LAMBDA ( member_type & teamMember , int & partialSum ) {
int thisTeamsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , team_size ) ,
[=] ( const int index , int & t h i s T e a m s P a r t i a l S u m ) {
int thisThr eadsSum = 0;
for ( int i = 0; i < 10; ++ i ) {
++ thisTh readsSu m ;
}
t h i s T e a m s P a r t i a l S u m += thisTh readsSum ;
} , thisTeamsSum );
partialSum += thisTeamsSum ;
} , totalSum );

totalSum = numberOfTeams * team size * team size * 10

Online April 21-24, 2020 143/192


Restricting Execution: single pattern

The single pattern can be used to restrict execution


I Like parallel patterns it takes a policy, a lambda, and
optionally a broadcast argument.
I Two policies: PerTeam and PerThread.
I Equivalent to OpenMP single directive with nowait
// Restrict to once per thread
single ( PerThread ( teamMember ) , [&] () {
// code
});

// Restrict to once per team with broadcast


int b r oa d c a s t e d V a l u e = 0;
single ( PerTeam ( teamMember ) , [&] ( int & b r o a d c a s t e d V a l u e _ l o c a l ) {
b r o a d c a s t e d V a l u e _ l o c a l = special value assigned by one ;
} , b r o ad c a s te d V a l u e );
// Now everyone has the special value

Online April 21-24, 2020 144/192


Exercise #6: Three-Level Parallelism

The previous example was extended with an outer loop over


“Elements” to expose a third natural layer of parallelism.

Details:
I Location: Intro-Full/Exercises/06/
I Use the single policy instead of checking team rank
I Parallelize all three loop levels.
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 5 for very non-square matrices
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 145/192


Exercise #6: Three-Level Parallelism

<y|Ax> Exercise 06 (Three Level Parallelism) Fixed Size


KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

450
2L HSW Begin
3L HSW
400 2L KNL Begin
3L KNL
2L Pascal60 Begin
350 3L Pascal60

300
Bandwidth (GB/s)

250

200

150

100

50

0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 146/192
Section Summary

I Hierarchical work can be parallelized via hierarchical


parallelism.
I Hierarchical parallelism is leveraged using thread teams
launched with a TeamPolicy.
I Team “worksets” are processed by a team in nested
parallel for (or reduce or scan) calls with a
TeamThreadRange and ThreadVectorRange policy.
I Execution can be restricted to a subset of the team with the
single pattern using either a PerTeam or PerThread policy.
I Teams can be used to reduce contention for global resources
even in “flat” algorithms.

Online April 21-24, 2020 147/192


Scratch memory
Learning objectives:
I Understand concept of team and thread private scratch
pads
I Understand how scratch memory can reduce global memory
accesses
I Recognize when to use scratch memory
I Understand how to use scratch memory and when barriers
are necessary

Online April 21-24, 2020 148/192


Types of Scratch Space Uses
Two Levels of Scratch Space
I Level 0 is limited in size but fast.
I Level 1 allows larger allocations but is equivalent to High
Bandwidth Memory in latency and bandwidth.
Team or Thread private memory
I Typically used for per work-item temporary storage.
I Advantage over pre-allocated memory is aggregate size scales
with number of threads, not number of work-items.
Manually Managed Cache
I Explicitly cache frequently used data.
I Exposes hardware specific on-core scratch space (e.g. NVIDIA
GPU Shared Memory).

Online April 21-24, 2020 149/192


Types of Scratch Space Uses
Two Levels of Scratch Space
I Level 0 is limited in size but fast.
I Level 1 allows larger allocations but is equivalent to High
Bandwidth Memory in latency and bandwidth.
Team or Thread private memory
I Typically used for per work-item temporary storage.
I Advantage over pre-allocated memory is aggregate size scales
with number of threads, not number of work-items.
Manually Managed Cache
I Explicitly cache frequently used data.
I Exposes hardware specific on-core scratch space (e.g. NVIDIA
GPU Shared Memory).
Now: Discuss Manually Managed Cache Usecase.

Online April 21-24, 2020 149/192


Example: contractDataFieldScalar (1)

One slice of contractDataFieldScalar:

for ( qp = 0; qp < numberOfQPs ; ++ qp ) {


total = 0;
for ( i = 0; i < vectorSize ; ++ i ) {
total += A ( qp , i ) * B ( i );
}
result ( qp ) = total ;
}

Online April 21-24, 2020 150/192


Example: contractDataFieldScalar (2)
contractDataFieldScalar:

for ( element = 0; element < n u m b e r O f E l e m e n t s ; ++ element ) {


for ( qp = 0; qp < numberOfQPs ; ++ qp ) {
total = 0;
for ( i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
}
}

Online April 21-24, 2020 151/192


Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements

Online April 21-24, 2020 152/192


Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs

Online April 21-24, 2020 152/192


Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
I Each thread handles an i.
Threads: numElements * numQPs * vectorSize
Requires a parallel reduce.

Online April 21-24, 2020 152/192


Example: contractDataFieldScalar (3)

Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
I Each thread handles an i.
Threads: numElements * numQPs * vectorSize
Requires a parallel reduce.

Online April 21-24, 2020 152/192


Example: contractDataFieldScalar (4)

Flat kernel: Each thread handles a quadrature point


operator ()( int index ) {
int element = e x t r a c t E l e m e n t F r o m I n d e x ( index );
int qp = e x t r a c t Q P F r o m I n d e x ( index );
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
}

Online April 21-24, 2020 153/192


Example: contractDataFieldScalar (6)

Teams kernel: Each team handles an element


operator ()( member_type teamMember ) {
int element = teamMember . league_rank ();
parallel_for (
T eamT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 154/192


Example: contractDataFieldScalar (6)

Teams kernel: Each team handles an element


operator ()( member_type teamMember ) {
int element = teamMember . league_rank ();
parallel_for (
T eamT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * B ( element , i );
}
result ( element , qp ) = total ;
});
} No real advantage (yet)
Online April 21-24, 2020 154/192
Scratch memory (0)
Each team has access to a “scratch pad”.

Online April 21-24, 2020 155/192


Scratch memory (1)

Scratch memory (scratch pad) as manual cache:


I Accessing data in (level 0) scratch memory is (usually) much
faster than global memory.
I GPUs have separate, dedicated, small, low-latency scratch
memories (NOT subject to coalescing requirements).
I CPUs don’t have special hardware, but programming with
scratch memory results in cache-aware memory access
patterns.
I Roughly, it’s like a user-managed L1 cache.

Online April 21-24, 2020 156/192


Scratch memory (1)

Scratch memory (scratch pad) as manual cache:


I Accessing data in (level 0) scratch memory is (usually) much
faster than global memory.
I GPUs have separate, dedicated, small, low-latency scratch
memories (NOT subject to coalescing requirements).
I CPUs don’t have special hardware, but programming with
scratch memory results in cache-aware memory access
patterns.
I Roughly, it’s like a user-managed L1 cache.

Important concept
When members of a team read the same data multiple times, it’s
better to load the data into scratch memory and read from there.

Online April 21-24, 2020 156/192


Scratch memory (2)

Scratch memory for temporary per work-item storage:


I Scenario: Algorithm requires temporary workspace of size W.
I Without scratch memory: pre-allocate space for N
work-items of size N x W.
I With scratch memory: Kokkos pre-allocates space for each
Team or Thread of size T x W.
I PerThread and PerTeam scratch can be used concurrently.
I Level 0 and Level 1 scratch memory can be used concurrently.

Online April 21-24, 2020 157/192


Scratch memory (2)

Scratch memory for temporary per work-item storage:


I Scenario: Algorithm requires temporary workspace of size W.
I Without scratch memory: pre-allocate space for N
work-items of size N x W.
I With scratch memory: Kokkos pre-allocates space for each
Team or Thread of size T x W.
I PerThread and PerTeam scratch can be used concurrently.
I Level 0 and Level 1 scratch memory can be used concurrently.

Important concept
If an algorithm requires temporary workspace for each work-item,
then use Kokkos’ scratch memory.

Online April 21-24, 2020 157/192


Scratch memory (3)
To use scratch memory, you need to:
1. Tell Kokkos how much scratch memory you’ll need.
2. Make scratch memory views inside your kernels.

Online April 21-24, 2020 158/192


Scratch memory (3)
To use scratch memory, you need to:
1. Tell Kokkos how much scratch memory you’ll need.
2. Make scratch memory views inside your kernels.
TeamPolicy < ExecutionSpace > policy ( numberOfTeams , teamSize );

// Define a scratch memory view type


typedef View < double * , Execu tionSpa ce :: s c r a t c h _ m e m o r y _ s p a c e
, M em or yU n ma na ge d > ScratchP adView ;
// Compute how much scratch memory ( in bytes ) is needed
size_t bytes = Sc ratchPad View :: shmem_size ( vectorSize );

// Tell the policy how much scratch memory is needed


int level = 0;
parallel_for ( policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes )) ,
KOKKOS_LAMBDA ( const member_type & teamMember ) {

// Create a view from the pre - existing scratch memory


Scra tchPadVi ew scratch ( teamMember . team_scratch ( level ) ,
vectorSize );
});
Online April 21-24, 2020 158/192
Example: contractDataFieldScalar (7)

Kernel outline for teams with scratch memory:


operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch ( teamMember . team_scratch (0) ,
vectorSize );

// TODO : load slice of B into scratch

parallel_for (
T ea mT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 159/192


Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all?
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

Online April 21-24, 2020 160/192


Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all? Serial
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

I Each thread loads one entry?


scratch ( team_rank ) = B ( element , team_rank );

Online April 21-24, 2020 160/192


Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all? Serial
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

I Each thread loads one entry? teamSize 6= vectorSize


scratch ( team_rank ) = B ( element , team_rank );
I TeamThreadRange or ThreadVectorRange
parallel_for (
T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,
[=] ( int i ) {
scratch ( i ) = B ( element , i );
});

Online April 21-24, 2020 160/192


Example: contractDataFieldScalar (8)
How to populate the scratch memory?
I One thread loads it all? Serial
if ( teamMember . team_rank () == 0) {
for ( int i = 0; i < vectorSize ; ++ i ) {
scratch ( i ) = B ( element , i );
}
}

I Each thread loads one entry? teamSize 6= vectorSize


scratch ( team_rank ) = B ( element , team_rank );
I TeamThreadRange or ThreadVectorRange
parallel_for (
T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,
[=] ( int i ) {
scratch ( i ) = B ( element , i );
});

Online April 21-24, 2020 160/192


Example: contractDataFieldScalar (9)
(incomplete) Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,


[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
// TODO : fix a problem at this location

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,


[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 161/192


Example: contractDataFieldScalar (9)
(incomplete) Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,


[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
// TODO : fix a problem at this location

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,


[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}
Problem: threads may start to use scratch before all threads are
done loading.
Online April 21-24, 2020 161/192
Example: contractDataFieldScalar (10)
Kernel for teams with scratch memory:
operator ()( member_type teamMember ) {
Scra tchPadVi ew scratch (...);

parallel_for ( T h r e a d V e c t o r R a n g e ( teamMember , vectorSize ) ,


[=] ( int i ) {
scratch ( i ) = B ( element , i );
});
teamMember . t e a m b a r r i e r ( ) ;

parallel_for ( T ea m Th re a dR an ge ( teamMember , numberOfQPs ) ,


[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}

Online April 21-24, 2020 162/192


Exercise #7: Scratch Memory

Use Scratch Memory to explicitly cache the x-vector for each


element.

Details:
I Location: Intro-Full/Exercises/07/
I Create a scratch view
I Fill the scratch view in parallel using a TeamThreadRange or
ThreadVectorRange
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 6
I Compare behavior of CPU vs GPU

Online April 21-24, 2020 163/192


Exercise #7: Scratch Memory

Exercise 07 (Scratch Memory) Fixed Size


KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU

600
06 HSW
07 HSW
06 KNL
07 KNL
500 06 Pascal60
07 Pascal60

400
Bandwidth (GB/s)

300

200

100

0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 164/192
Scratch Memory: API Details

Allocating scratch in different levels:


int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Online April 21-24, 2020 165/192


Scratch Memory: API Details

Allocating scratch in different levels:


int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Using PerThread, PerTeam or both:


policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerThread ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes1 ) ,
PerThread ( bytes2 ));

Online April 21-24, 2020 165/192


Scratch Memory: API Details

Allocating scratch in different levels:


int level = 1; // valid values 0 ,1
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));

Using PerThread, PerTeam or both:


policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerThread ( bytes ));
policy . s e t _ s c r a t c h _ s i z e ( level , PerTeam ( bytes1 ) ,
PerThread ( bytes2 ));

Using both levels of scratch:


policy . s e t _ s c r a t c h _ s i z e (0 , PerTeam ( bytes0 ))
. s e t _ s c r a t c h _ s i z e (1 , PerThread ( bytes1 ));

Note: set scratch size() returns a new policy instance, it


doesn’t modify the existing one.

Online April 21-24, 2020 165/192


Section Summary

I Scratch Memory can be use with the TeamPolicy to


provide thread or team private memory.
I Usecase: per work-item temporary storage or manual caching.
I Scratch memory exposes on-chip user managed caches (e.g.
on NVIDIA GPUs)
I The size must be determined before launching a kernel.
I Two levels are available: large/slow and small/fast.

Online April 21-24, 2020 166/192


Task parallelism
Fine-grained dependent execution.

Learning objectives:
I Basic interface for fine-grained tasking in Kokkos
I How to express dynamic dependency structures in Kokkos
tasking
I When to use Kokkos tasking

Online April 21-24, 2020 167/192


Task Parallelism Looks Like Data Parallelism

Recall that data parallel code is composed of a pattern, a policy,


and a functor
Kokkos :: parallel_for (
Kokkos :: RangePolicy < >( exec_space , 0 , N ) ,
SomeFunctor ()
);

Task parallel code similarly has a pattern, a policy, and a functor


Kokkos :: task_spawn (
Kokkos :: TaskSingle ( scheduler , TaskPriority :: High ) ,
SomeFunctor ()
);

Online April 21-24, 2020 168/192


What does a task functor look like?

struct MyTask {
using value_type = double ;
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result );
};

I Tell Kokkos what the value type of your task’s output is.
I Take a team member argument, analogous to the team
member passed in by Kokkos::TeamPolicy in hierarchical
parallelism
I The output is expressed by assigning to a parameter, similar
to with Kokkos::parallel reduce

Online April 21-24, 2020 169/192


What policies does Kokkos tasking provide?

I Kokkos::TaskSingle()
I Run the task with a single worker thread
I Kokkos::TaskTeam()
I Run the task with all of the threads in a team
I Think of it like being inside of a parallel for with a
TeamPolicy
I Both policies take a scheduler, an optional predecessor, and an
optional priority (more on schedulers and predecessors later)

Online April 21-24, 2020 170/192


What patterns does Kokkos tasking provide?

I Kokkos::task spawn()
I Kokkos::host spawn() (same thing, but from host code)
I Kokkos::respawn()
I Argument order is backwards; policy comes second!
I First argument is ‘this‘ always (not ‘*this‘)
I task spawn() and host spawn() return a Kokkos::Future
representing the completion of the task (see next slide), which
can be used as a predecessor to another operation.

Online April 21-24, 2020 171/192


How do futures and dependencies work?
struct MyTask {
using value_type = double ;
Kokkos :: Future < double , Kokkos :: DefaultExecutionSpace > dep ;
int depth ;
K O K K O S _ I N L I N E _ F U N C T I O N MyTask ( int d ) : depth ( d ) { }
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result ) {
if ( depth == 1) result = 3.14;
else if ( dep . is_null ()) {
dep =
Kokkos :: task_spawn (
Kokkos :: TaskSingle ( member . scheduler ()) ,
MyTask ( depth -1)
);
Kokkos :: respawn ( this , dep );
}
else {
result = depth * dep . get ();
}
}
};

Online April 21-24, 2020 172/192


The Scheduler Abstraction

template < class Scheduler >


struct MyTask {
using value_type = double ;
Kokkos :: BasicFuture < double , Scheduler > dep ;
int depth ;
K O K K O S _ I N L I N E _ F U N C T I O N MyTask ( int d ) : depth ( d ) { }
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result );
};

Available Schedulers:
I TaskScheduler<ExecSpace>
I TaskSchedulerMultiple<ExecSpace>
I ChaseLevTaskScheduler<ExecSpace>

Online April 21-24, 2020 173/192


Spawning from the host

using ex ec u ti on _s p ac e = Kokkos :: D e f a u l t E x e c u t i o n S p a c e ;
using sch eduler_ type = Kokkos :: TaskScheduler < execution_space >;
using memory_space = sche duler_ty pe :: memory_space ;
using m e m o r y _ p o o l _ t y p e = sched uler_typ e :: memory_pool ;
size_t m e m o r y _ po o l _ s i z e = 1 << 22;

auto scheduler =
sche duler_ty pe ( m e m o r y _ p o o l _ t y p e ( m e m o r y _p o o l _ s i z e ));

Kokkos :: BasicFuture < double , scheduler_type > result =


Kokkos :: host_spawn (
Kokkos :: TaskSingle ( scheduler ) ,
MyTask < scheduler_type >(10)
);
Kokkos :: wait ( scheduler );
printf ( " Result is % f " , result . get ());

Online April 21-24, 2020 174/192


Things to Keep in Mind

I Tasks always run to completion


I There is no way to wait or block inside of a task
I future.get() does not block!
I Tasks that do not respawn themselves are complete
I The value in the result parameter is made available through
future.get() to any dependent tasks.
I The second argument to respawn can only be either a
predecessor (future) or a scheduler, not a proper execution
policy
I We are fixing this to provide a more consistent overload in the
next release.
I Tasks can only have one predecessor (at a time)
I Use scheduler.when all() to aggregate predecessors (see
next slide)

Online April 21-24, 2020 175/192


Aggregate Predecessors

using void_future =
Kokkos :: BasicFuture < void , scheduler_type >;
auto f1 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , X {});
auto f2 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , Y {});
void_future f_array [] = { f1 , f2 };
void_future f_12 = scheduler . when_all ( f_array , 2);
auto f3 =
Kokkos :: task_spawn (
Kokkos :: TaskSingle ( scheduler , f_12 ) , FuncXY {}
);

I To create an aggregate Future, use scheduler.when all()


I scheduler.when all() always returns a void future.
I (Also, any future is implicitly convertible to a void future of
the same Scheduler type)

Online April 21-24, 2020 176/192


Exercise #8: Fibonacci

Formula Serial algorithm


int fib ( int n ) {
FN = FN−1 + FN−2 if ( n < 2) return n ;
else {
F0 = 0 return fib ( n -1) + fib ( n -2);
F1 = 1 }
}

Details:
I Location: Intro-Full/Exercises/08
I Implement the FibonacciTask task functor recursively
I Spawn the root task from the host and wait for the scheduler
to make it ready
Hints:
I Do the FN−1 and FN−2 subproblems in separate tasks
I Use a scheduler.when all() to wait on the subproblems
Online April 21-24, 2020 177/192
SIMD
Portable vector intrinsic types.

Learning objectives:
I How to use SIMD types to improve vectorization.
I SIMD Types as an alternative to ThreadVector loops.
I SIMD Types to achieve outer loop vectorization.

Online April 21-24, 2020 178/192


Vectorization In Kokkos

So far there were two options for achieving vectorization:


I Hope For The Best: Kokkos semantics make loops
inherently vectorizable, sometimes the compiler figures it even
out.
I Hierarchical Parallelism: TeamVectorRange and
ThreadVectorRange help the compiler with hints such as
#pragma ivdep or #pragma omp simd.

These strategies do run into limits though:


I Compilers often do not vectorize loops on their own.
I An optimal vectorization strategy would require outer-loop
vectorization.
I Vectorization with TeamVectorRange sometimes requires
artifically introducing an additional loop level.

Online April 21-24, 2020 179/192


Outer-Loop Vectorization
A simple scenario where for outer-loop vectorization:
for ( int i =0; i < N ; i ++) {
// expect K to be small odd 1 ,3 ,5 ,7 for physics reasons
for ( int k =0; k < K ; k ++) b ( i ) += a (i , k );
}
Vectorization the K-Loop is not profitable:
I It is a short reduction.
I Remainders will eat up much time.

Using ThreadVectorRange is cumbersome and requires split of


N-Loop:
parallel_for ( " VectorLoop " , TeamPolicy < >(0 , N /V , V ) ,
KOKKOS_LAMBDA ( const team_t & team ) {
int i = team . league_rank () * V ;
for ( int k =0; k < K ; k ++)
parallel_for ( T h r e a d V e c t o r R a n g e ( team , V ) , [&]( int ii ) {
b ( i + ii ) += a ( i + ii , k );
});
});
Online April 21-24, 2020 180/192
SIMD Types

To help with this situation and (in particular in the past) fix the
lack of auto-vectorizing compilers SIMD-Types have been
invented. They:
I Are short vectors of scalars.
I Have operators such as += so one can use them like scalars.
I Are compile time sized.
I Usually map directly to hardware vector instructions.

Important concept: SIMD Type


A SIMD variable is a short vector which acts like a scalar.
Using such a simd type one can simply achieve outer-loop
vectorization by using arrays of simd and dividing the loop range
by its size.

Online April 21-24, 2020 181/192


C++23? SIMD

The ISO C++ standard has a Technical Specification for simd (in
parallelism v2):
template < class T , class Abi >
class simd {
public :
using value_type = T ;
using reference = /* impl defined */ ;
using abi_type = Abi ;
static constexpr size_t size ();
void copy_from ( T const * , aligned_tag );
void copy_to ( T * , aligned_tag ) const ;
T & operator [] ( size_t );
// Element wise operators
};

// Element Wise non - member operators

Online April 21-24, 2020 182/192


C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for


different, hardware specific, implementations.
The most important in the proposal are:
I scalar: single element type.
I fixed size< N >: stores N elements.
I max fixed size< T >: stores maximum number of elements
for T.
I native: best fit for hardware.

Online April 21-24, 2020 183/192


C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for


different, hardware specific, implementations.
The most important in the proposal are:
I scalar: single element type.
I fixed size< N >: stores N elements.
I max fixed size< T >: stores maximum number of elements
for T.
I native: best fit for hardware.

But std::experimental::simd is not in the standard yet, and


doesn’t support GPUs ...

Online April 21-24, 2020 183/192


C++23? SIMD ABI

One interesting innovation here is the Abi parameter allowing for


different, hardware specific, implementations.
The most important in the proposal are:
I scalar: single element type.
I fixed size< N >: stores N elements.
I max fixed size< T >: stores maximum number of elements
for T.
I native: best fit for hardware.

But std::experimental::simd is not in the standard yet, and


doesn’t support GPUs ...
It also has other problems making it insufficient for our codes ...

Online April 21-24, 2020 183/192


Kokkos SIMD
Just at Sandia we had at least 5 different SIMD types in use.
A unification effort was started with the goal of:
I Match the proposed std::simd API as far as possible.
I Support GPUs.
I Can be used stand-alone or in conjunction with Kokkos.
I Replaces all current implementations at Sandia for SIMD.

We now have an implementation developed by Dan Ibanez, which


is close to meeting all of those criteria:
I For now available at
https://github.com/kokkos/simd-math.
I Considered Experimental, but supports X86, ARM, Power,
NVIDIA GPUs.
I Will be integrated into Kokkos in the next two months.
Online April 21-24, 2020 184/192
Exercise #9: Simple SIMD usage.
Details:
I Location: Intro-Full/Exercises/09/Begin/
I Include the simd.hpp header.
I Change the data type of the views to use
simd::simd<double,simd::simd abi:native>.
I Create an unmanaged View<double*> of results using the
data() function for the final reduction.

# Compile for CPU


make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ simd . cuda
Things to try:
I Vary problem size (-N ...; -K ...)
I Compare behavior of scalar vs vectorized on CPU and GPU
Online April 21-24, 2020 185/192
The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.


Why wouldn’t we use a fixed size instead?
I Using a fixed size ABI will create a scalar of size N in each
CUDA thread!
I Loading a fixed size variable from memory would result in
uncoalesced access.
I If you have correct layouts you get outer-loop vectorization
implicitly on GPUs.

Online April 21-24, 2020 186/192


The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.


Why wouldn’t we use a fixed size instead?
I Using a fixed size ABI will create a scalar of size N in each
CUDA thread!
I Loading a fixed size variable from memory would result in
uncoalesced access.
I If you have correct layouts you get outer-loop vectorization
implicitly on GPUs.
But what if you really want to use warp-level parallelziation for
SIMD types?

Online April 21-24, 2020 186/192


The GPU SIMD Problem

The above exercise used a scalar simd type on the GPU.


Why wouldn’t we use a fixed size instead?
I Using a fixed size ABI will create a scalar of size N in each
CUDA thread!
I Loading a fixed size variable from memory would result in
uncoalesced access.
I If you have correct layouts you get outer-loop vectorization
implicitly on GPUs.
But what if you really want to use warp-level parallelziation for
SIMD types?
We need two SIMD types: a storage type and a temporary type!

Online April 21-24, 2020 186/192


cuda warp ABI
Important concept: simd::storage type
Every simd<T,ABI> has an associated storage type typedef.

To help with the GPU issue we split types between storage types
used for Views, and temporary variables.
I Most simd::simd types will just have the same storage type.
I simd<T,cuda warp<N>> will use warp level parallelism.
I simd<T,cuda warp<N>>::storage type is different though!.
I Used in conjunction with TeamPolicy.
using simd_t = simd :: simd <T , simd :: simd_abi :: cuda_warp <V > >;
using sim d_stora ge_t = simd_t :: storage_type ;
View < s imd_stor age_t ** > data ( " D " ,N , M ); // will hold N * M * V Ts
parallel_for ( " Loop " , TeamPolicy < >(N ,M , V ) ,
KOKKOS_LAMBDA ( const team_t & team ) {
int i = team . league_rank ();
parallel_for ( T ea m Th re a dR an ge ( team , M ) , [&]( int j ) {
data (i , j ) = 2.0* simd_t ( data (i , j ));
});April
Online });
21-24, 2020 187/192
Exercise #10: SIMD storage usage.
Details:
I Location: Intro-Full/Exercises/10/Begin/
I Include the simd.hpp header.
I Change the data type of the views to use
simd::simd<double,simd::simd abi:cuda warp<
32 >>::storage type.
I Create an unmanaged View<double*> of results using the
data() function for the final reduction.
I Use inside of the lambda the
simd::simd<double,simd::simd abi:cuda warp< 32 >> as
scalar type.

# Compile for GPU


make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ simd . cuda

Online April 21-24, 2020 188/192


Advanced SIMD Capabilities

Kokkos SIMD supports math operations:


I Common stuff like abs,sqrt,exp, ...

It also supports masking:


using simd_t = simd < double , simd_abi :: native >;
using simd_mask_t = simd_t :: mask_type ;

simd_t threshold (100.0) , a ( a ( i ));


simd_mask_t is_smaller = threshold < a ;
simd_t only_smaller = choose ( is_smaller ,a , threshold );

Online April 21-24, 2020 189/192


SIMD Summary

I SIMD types help vectorize code.


I In particular for outer-loop vectorization.
I There are storage and temporary types.
I Masking is supported too.
I Currently considered experimental at
https://github.com/Kokkos/simd-math: please try it out
and provide feedback.
I Will move into Kokkos proper likely in the next release.

Online April 21-24, 2020 190/192


Conclusion

Kokkos advanced capabilities NOT covered today


I Directed acyclic graph (DAG) of tasks pattern
I Dynamic graph of heterogeneous tasks (maximum flexibility)
I Static graph of homogeneous task (low overhead)
I Portable, thread scalable memory pool
I Plugging in customized multidimensional array data layout
e.g., arbitrarily strided, hierarchical tiling

Online April 21-24, 2020 191/192


Conclusion: Takeaways

I For portability: OpenMP, OpenACC, ... or Kokkos.


I Only Kokkos obtains performant memory access patterns via
architecture-aware arrays and work mapping.
i.e., not just portable, performance portable.
I With Kokkos, simple things stay simple (parallel-for, etc.).
i.e., it’s no more difficult than OpenMP.
I Advanced performance-optimizing patterns are simpler
with Kokkos than with native versions.
i.e., you’re not missing out on advanced features.
I full day tutorial only

Online April 21-24, 2020 192/192

You might also like