KokkosTutorial ORNL20
KokkosTutorial ORNL20
1
Sandia National Laboratories
Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and
Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S.
Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.
SAND2019-1055814
Prerequisites for Tutorial Exercises
Online Resources:
I https://github.com/kokkos: Primary Kokkos GitHub
Organization
I https:
//github.com/kokkos/kokkos-tutorials/blob/master/
Intro-Full/Slides/KokkosTutorial_ORNL20.pdf: These
slides.
I https://github.com/kokkos/kokkos/wiki: Wiki
including API reference
I https:
//github.com/kokkos/kokkos-tutorials/issues/28:
Instructions to get cloud instance with GPU
I https://kokkosteam.slack.com: Slack channel for Kokkos
Target machine:
On-Package
Memory Core Core
...
NUMA Domain
External Interconnect
External Network
DRAM
Network-on-Chip
Interface
NUMA Domain
NVRAM
Core ... Core
Accelerator
On-Package
Acc. Memory
Node
Important Point
There’s a difference between portability and
performance portability.
Important Point
There’s a difference between portability and
performance portability.
Important Point
There’s a difference between portability and
performance portability.
Pattern Policy
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp ) {
Body
Terminology:
I Pattern: structure of the computations
for, reduction, scan, task-graph, ...
I Execution Policy: how computations are executed
static scheduling, dynamic scheduling, thread teams, ...
I Computational Body: code which performs each unit of
work; e.g., the loop body
⇒ The pattern and policy drive the computational body.
Option 2: OpenACC
# pragma acc parallel copy (...) num_gangs (...) vector_length (...)
# pragma acc loop gang vector
for ( element = 0; element < numElements ; ++ element ) {
total = 0;
for ( qp = 0; qp < numQPs ; ++ qp )
total += dot ( left [ element ][ qp ] , right [ element ][ qp ]);
elementValues [ element ] = total ;
}
Online April 21-24, 2020 12/192
Portable, but not performance portable
struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}
struct A t o m F o r ce F u n c t o r {
...
void operator ()( const int64_t atomIndex ) const {
atomForces [ atomIndex ] = ca lculateF orce ( ... data ... );
}
...
}
Important concept
Simple Kokkos usage is no more conceptually difficult than
OpenMP, the annotations just go in different places.
Online April 21-24, 2020 24/192
Scalar integration (0)
double totalIntegral = 0;
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );
const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;
double totalIntegral = 0;
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
lower + ( i / n u m b e r O f I n t e r v a l s ) * (upper - lower );
const double t h i s I n t e r v a l s C o n t r i b u t i o n = function( x );
totalIntegral += t h i s I n t e r v a l s C o n t r i b u t i o n ;
}
totalIntegral *= dx ;
Pattern?
double totalIntegral = 0; Policy?
for ( int64_t i = 0; i < n u m b e r O f I n t e r v a l s ; ++ i ) {
const double x =
Body?
An (incorrect) attempt:
double totalIntegral = 0;
Kokkos :: parallel_for ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t index ) {
const double x =
lower + ( index / n u m b e r O f I n t e r v a l s ) * ( upper - lower );
totalIntegral += function ( x );} ,
);
totalIntegral *= dx ;
double totalIntegral = 0;
# pragma omp parallel for reduction ( +: totalIntegral )
for ( int64_t i = 0; i < nu m be r Of I nt e rv a ls ; ++ i ) {
totalIntegral += function (...);
}
double totalIntegral = 0;
Kokkos
p ar al le l _r ed u ce ( n u m b e r O f I n t e r v a l s ,
[=] ( const int64_t i , double & valueToUpdate ) {
valueToUpdate += function (...);
},
totalIntegral );
100
10
0.1
0.01
100 1000 10000 100000 1x106 1x107 1x108
number of intervals [-]
Details:
I y is Nx1, A is NxM, x is Mx1
I We’ll use this exercise throughout the tutorial
Details:
I Location: Intro-Full/Exercises/01/Begin/
I Look for comments labeled with “EXERCISE”
I Need to include, initialize, and finalize Kokkos library
I Parallelize loops with parallel for or parallel reduce
I Use lambdas instead of functors for computational bodies.
I For now, this will only use the CPU.
Online April 21-24, 2020 35/192
Exercise #1: logistics
Compiling for CPU
# gcc using OpenMP ( default ) and Serial back - ends ,
# ( optional ) change non - default arch with KOKKOS_ARCH
make -j KOKKOS_D EVICES = OpenMP , Serial KOKKOS_ARCH =...
Things to try:
I Vary problem size with cline arg -S s
I Vary number of rows with cline arg -N n
I Num rows = 2n , num cols = 2m , total size = 2s == 2n+m
Online April 21-24, 2020 36/192
Exercise #1 results
<y,Ax> Exercise 01, Fixed Size
350
HSW
KNL
KNL (HBM)
300
250
Bandwidth (GB/s)
200
150
100
50
0
1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109
Number of Rows (N)
Online April 21-24, 2020 37/192
Basic capabilities we haven’t covered
struct Functor {
Functor
double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};
struct Functor {
Functor
double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};
struct Functor {
Functor
double * _x , * _y , a ;
void operator ()( const int64_t i ) {
_y [ i ] = _a * _x [ i ] + _y [ i ];
}
};
View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).
View abstraction
I A lightweight C++ class with a pointer to array data and a
little meta-data,
I that is templated on the data type (and other things).
Important point
Views are like pointers, so copy them in your functors.
View Properties:
I Accessing a View’s sizes is done via its extent(dim) function.
Static extents can additionally be accessed via
static extent(dim).
I You can retrieve a raw pointer via its data() function.
I The label can be accessed via label().
Example:
View < double *[5] > a ( " A " , N0 );
assert ( a . extent (0)== N0 );
assert ( a . extent (1)== N0 );
static_assert ( a . static_extent (1)==5);
assert ( a . data ()!= nullptr );
assert ( std :: string ( " A " . compare ( a . label ())==0);
I Location: Intro-Full/Exercises/02/Begin/
I Assignment: Change data storage from arrays to Views.
I Compile and run on CPU, and then on GPU with UVM
On-Package
Memory Core Core
...
NUMA Domain
External Interconnect
External Network
DRAM
Network-on-Chip
Interface
NUMA Domain
NVRAM
Core ... Core
Accelerator
On-Package
Acc. Memory
Node
MPI_Reduce (...);
Host
MPI_Reduce (...);
Host
MPI_Reduce (...);
Host
MPI_Reduce (...);
Host
// Where Kokkos d e f i n e s :
#d e f i n e KOKKOS LAMBDA [ = ] /∗ # i f CPU−o n l y ∗/
#d e f i n e KOKKOS LAMBDA [ = ] device /∗ # i f CPU+Cuda ∗/
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < SomeExampleExecutionSpace >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
⇒ Memory Spaces
Online April 21-24, 2020 53/192
Memory spaces (0)
Memory space:
explicitly-manageable memory resource
(i.e., “place to put data”)
On-Package
Memory Core Core
...
NUMA Domain
External Interconnect
External Network
DRAM
Network-on-Chip
Interface
NUMA Domain
NVRAM
Core ... Core
Accelerator
On-Package
Acc. Memory
Node
Example: CudaSpace
View < double ** , CudaSpace > view (... constructor arguments ...);
# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
});
# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});
# define KL KOKKOS_LAMBDA
View < int * , Cuda > dev ;
View < int * , Host > host ;
parallel_for ( " Label " ,N ,
KL ( int i ) {
dev ( i ) = ...;
host ( i ) = ...;
});
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index ); illegal access
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Cuda >(0 , size ) ,
KOKKOS_LAMBDA ( const int64_t index , double & valueToUpdate ) {
valueToUpdate += array ( index ); illegal access
},
sum );
I CudaUVMSpace
What’s the solution?
I CudaHostPinnedSpace (skipping)
I Mirroring
CudaUVMSpace
# define KL KOKKOS_LAMBDA
View < double * ,
CudaUVMSpace> array ;
array = ... from file ...
double sum = 0;
p ar allel _r ed u ce ( " Label " , N ,
KL ( int i ,
double & d ) {
d += array ( i );
},
sum );
Mirroring schematic
typedef Kokkos :: View < double ** , Space > ViewType ;
ViewType view (...);
ViewType :: H o s t M i r r o r hostView =
Kokkos : : c r e a t e m i r r o r v i e w ( view );
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
LayoutLeft
in 2D, “column-major”
LayoutRight
in 2D, “row-major”
Details:
I Location: Intro-Full/Exercises/04/Begin/
I Replace ‘‘N’’ in parallel dispatch with RangePolicy<ExecSpace>
I Add MemSpace to all Views and Layout to A
I Experiment with the combinations of ExecSpace, Layout to view
performance
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Change number of repeats (-nrepeat ...)
I Compare behavior of CPU vs GPU
I Compare using UVM vs not using UVM on GPUs
I Check what happens if MemSpace and ExecSpace do not match.
600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right
400
Bandwidth (GB/s)
300 Why?
200
100
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9
Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.
Important point
For performance, accesses to views in HostSpace must be cached,
while access to views in CudaSpace must be coalesced.
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
double sum = 0;
Kokkos :: p ar a ll el _r e du ce ( " Label " ,
RangePolicy < Space >(0 , size ) ,
KOKKOS_LAMBDA ( const size_t index , double & valueToUpdate ) {
valueToUpdate += data ( index );
},
sum );
Important point
Kokkos maps indices to cores in contiguous chunks on CPU
execution spaces, and strided for Cuda.
Rule of Thumb
Kokkos index mapping and default layouts provide efficient access
if iteration indices correspond to the first index of array.
Example:
View < double *** , ... > view (...);
...
Kokkos :: parallel_for ( " Label " , ... ,
KOKKOS_LAMBDA ( const size_t workIndex ) {
...
view (... , ... , workIndex ) = ... ;
view (... , workIndex , ... ) = ... ;
view ( workIndex , ... , ... ) = ... ;
});
...
Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.
Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.
Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout appropriately for
the architecture.
Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.
Important point
Performant memory access is achieved by Kokkos mapping parallel
work indices and multidimensional array layout optimally for the
architecture.
600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right
400
Bandwidth (GB/s)
cached
300
200 uncoalesced
100
cached
uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9
DualView
Learning objectives:
I Motivation and Value Added.
I Usage.
I Exercises.
deep_copy
View ? MirrorView
device host
View View
DualView
device host
Details:
I Location: Intro-Full/Exercises/dualview/Begin/
I Modify or create a new compute enthalpy function in
dual view exercise.cpp to:
I 1. Take (dual)views as arguments
I 2. Call modify() and/or sync() when appropriate for the dual
views
I 3. Runs the kernel on host or device execution spaces
WorkTag
I Options: SomeClass
EndList
I Initializer List or Kokkos::Array (Required): rank arguments for
ends of index space
I Example Rank 2: {e0,e1}
TileDimList
I Initializer List or Kokkos::Array (Optional): rank arguments for
dimension of tiles
I Example Rank 2: {t0,t1}
m d r p o l i c y ( {b0 , b1 } , { e0 , e1 } , { t0 , t 1 } ) ;
Online April 21-24, 2020 98/192
Exercise - mdrange: Initialize multi-dim views with MDRangePolicy
Details:
I Location: Intro-Full/Exercises/mdrange/Begin/
I This begins with the Solution of 02
I Initialize the device Views x and y directly on the device using a
parallel for and RangePolicy
I Initialize the device View matrix A directly on the device using a
parallel for and MDRangePolicy
Things to try:
I Name the kernels - pass a string as the first argument of the parallel
pattern
I Try changing the iteration patterns for the tiles in the
MDRangePolicy, notice differences in performance
Subview description:
I A subview is a ’slice’ of a View that behaves as a View
Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
I Can be constructed on host or within a kernel (no allocation
of memory occurs)
Subview description:
I A subview is a ’slice’ of a View that behaves as a View
I Same syntax as a View - access data using (multi-)index entries
I The ’slice’ and original View point to the same data - no extra
memory allocation or copying
I Can be constructed on host or within a kernel (no allocation
of memory occurs)
I Similar capability as provided by Matlab, Fortran, Python, etc.
using ’colon’ notation
auto slicei0 =
Kokkos :: subview (v , i0 , std :: make_pair (0 , v . extent (1)) ,
std :: make_pair (0 , v . extent (2)));
// extent ( N ) returns the size of dimension N of the View
Syntax:
Kokkos :: subview ( Kokkos :: View <... > view ,
arg0 ,
...)
I view: First argument to the subview is the view of which a slice will
be taken
I argN: Slice info for rank N - provide same number of arguments as
rank
I Options for argN:
I index: integral type single value
I partial-range: std::pair or Kokkos::pair of integral types to
provide sub-range of a rank’s range [0,N)
I full-range: use Kokkos::ALL rather than providing the full
range as a pair
Suggested usage:
I Use ’auto’ to determine the return type of a subview
I A subview can help with encapsulation - e.g. can pass into
functions expecting a lower-dimensional View
I Use Kokkos::pair for partial ranges if subview created within a
kernel
I Avoid usage if very few data accesses will be made to the
subview
I Construction of subview costs 20-40 operations
Details:
I Location: Intro-Full/Exercises/subview/Begin/
I This begins with the Solution of 04
I In the parallel reduce kernel, create a subview for row j of view A
I Use this subview when computing A(j,:)*x(:) rather than the matrix
A
# Compile for CPU
make -j K OKKOS_D EVICES = OpenMP
# Compile for GPU ( we do not need UVM anymore )
make -j K OKKOS_D EVICES = Cuda
# Run on GPU
./ s u b vi e w _ e x e r c i s e . cuda -S 26
Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});
http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram
Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});
http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Examples: Histogram
Histogram kernel:
parallel_for (N , KOKKOS_LAMBDA ( const size_t index ) {
const Something value = ...;
const size_t bucketIndex = c o m p u t e B u c k e t I n d e x ( value );
++ _histogram ( bucketIndex );
});
Solution strategies:
I Locks: not feasible on GPU
I Thread-private copies:
not thread-scalable
I Atomics
http://www.farmaceuticas.com.br/tag/graficos/
Online April 21-24, 2020 108/192
Atomics
Experimental setup
operator ()( const unsigned int index ) const {
Kokkos :: atomic_add (& globalSums [ index % atomicStride ] , 1);
}
0
cuda double
1 cuda uint64_t
cuda float
Note: log scale
cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
phi float
phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (2)
Atomics performance: 1 million adds, no work per kernel
0
cuda double
1 cuda uint64_t
cuda float
Note: log scale
cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 112/192
Performance of atomics (3)
Atomics performance: 1 million adds, some work per kernel
0
cuda double
1 cuda uint64_t
cuda float
Note: log scale
cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 113/192
Performance of atomics (4)
Atomics performance: 1 million adds, lots of work per kernel
0
cuda double
1 cuda uint64_t
cuda float
Note: log scale
cuda unsigned
2 omp double
omp size_t
omp float
3 omp unsigned
phi double
4
phi size_t
High penalty for phi float
high contention phi unsigned
5
0 1 2 3 4 5 6
log10(contention) [-]
Online April 21-24, 2020 114/192
Advanced features
Learning objectives:
I Similarities and differences between outer and inner levels of
parallelism
I Thread teams (league of teams of threads)
I Performance improvement with well-coordinated teams
Atomics kernel:
Kokkos :: parallel_for ( " yAx " , N ,
KOKKOS_LAMBDA ( const size_t index ) {
const int row = extractRow ( index );
const int col = extractCol ( index );
atomic_add (& result , A ( row , col ) * x ( col ));
});
Atomics kernel:
Kokkos :: parallel_for ( " yAx " , N ,
KOKKOS_LAMBDA ( const size_t index ) {
const int row = extractRow ( index );
const int col = extractCol ( index );
atomic_add (& result , A ( row , col ) * x ( col ));
});
double thisRowsSum = 0;
p aral le l _r ed uc e ( T e am Th re a dR an ge ( teamMember , M ) ,
[=] ( int col , double & innerUpdate ) {
innerUpdate += A ( row , col ) * x ( col );
} , thisRowsSum );
if ( teamMember . team_rank () == 0) {
update += y ( row ) * thisRowsSum ;
}
} , result );
Important point
Using teams is changing the execution policy.
Important point
Using teams is changing the execution policy.
Important point
When using teams, functor operators receive a team member.
Important point
When using teams, functor operators receive a team member.
Warning
There may be more (or fewer) team members than pieces of your
algorithm’s work per team
GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256
GPUs
I Special hardware available for coordination within a team.
I Within a team 32 (NVIDIA) or 64 (AMD) threads execute
“lock step.”
I Maximum team size: 1024; Recommended team size:
128/256
Intel Xeon Phi:
I Recommended team size: # hyperthreads per core
I Hyperthreads share entire cache hierarchy
a well-coordinated team avoids cache-thrashing
Online April 21-24, 2020 135/192
Exercise #5: Inner Product, Hierarchical Parallelism
Details:
I Location: Intro-Full/Exercises/05/
I Replace RangePolicy<Space> with TeamPolicy<Space>
I Use AUTO for team size
I Make the inner loop a parallel reduce with TeamThreadRange
policy
I Experiment with the combinations of Layout, Space, N to view
performance
I Hint: what should the layout of A be?
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 4 for very non-square matrices
I Compare behavior of CPU vs GPU
Online April 21-24, 2020 136/192
Reminder, Exercise #4 with Flat Parallelism
<y|Ax> Exercise 04 (Layout) Fixed Size
KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU
600
HSW Left
HSW Right
KNL Left
KNL Right coalesced
500 Pascal60 Left
Pascal60 Right
400
Bandwidth (GB/s)
cached
300
200 uncoalesced
cached
100
uncached
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9
600
HSW Left
HSW Right
KNL Left
KNL Right
500 Pascal60 Left
Pascal60 Right
coalesced
400
Bandwidth (GB/s)
cached
300
200
cached
100
0
1 10 100 1000 10000 100000 1x106 1x10
7 8
1x10 1x10
9
totalSum = numberOfThreads * 10
Details:
I Location: Intro-Full/Exercises/06/
I Use the single policy instead of checking team rank
I Parallelize all three loop levels.
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 5 for very non-square matrices
I Compare behavior of CPU vs GPU
450
2L HSW Begin
3L HSW
400 2L KNL Begin
3L KNL
2L Pascal60 Begin
350 3L Pascal60
300
Bandwidth (GB/s)
250
200
150
100
50
0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 146/192
Section Summary
Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
I Each thread handles an i.
Threads: numElements * numQPs * vectorSize
Requires a parallel reduce.
Parallelization approaches:
I Each thread handles an element.
Threads: numberOfElements
I Each thread handles a qp.
Threads: numberOfElements * numberOfQPs
I Each thread handles an i.
Threads: numElements * numQPs * vectorSize
Requires a parallel reduce.
Important concept
When members of a team read the same data multiple times, it’s
better to load the data into scratch memory and read from there.
Important concept
If an algorithm requires temporary workspace for each work-item,
then use Kokkos’ scratch memory.
parallel_for (
T ea mT hr e ad Ra ng e ( teamMember , numberOfQPs ) ,
[=] ( int qp ) {
double total = 0;
for ( int i = 0; i < vectorSize ; ++ i ) {
total += A ( element , qp , i ) * scratch ( i );
}
result ( element , qp ) = total ;
});
}
Details:
I Location: Intro-Full/Exercises/07/
I Create a scratch view
I Fill the scratch view in parallel using a TeamThreadRange or
ThreadVectorRange
Things to try:
I Vary problem size and number of rows (-S ...; -N ...)
I Compare behavior with Exercise 6
I Compare behavior of CPU vs GPU
600
06 HSW
07 HSW
06 KNL
07 KNL
500 06 Pascal60
07 Pascal60
400
Bandwidth (GB/s)
300
200
100
0
6
1 10 100 1000 10000 100000 1x10
Number of Rows (N)
Online April 21-24, 2020 164/192
Scratch Memory: API Details
Learning objectives:
I Basic interface for fine-grained tasking in Kokkos
I How to express dynamic dependency structures in Kokkos
tasking
I When to use Kokkos tasking
struct MyTask {
using value_type = double ;
template < class TeamMember >
KOKKOS_INLINE_FUNCTION
void operator ()( TeamMember & member , double & result );
};
I Tell Kokkos what the value type of your task’s output is.
I Take a team member argument, analogous to the team
member passed in by Kokkos::TeamPolicy in hierarchical
parallelism
I The output is expressed by assigning to a parameter, similar
to with Kokkos::parallel reduce
I Kokkos::TaskSingle()
I Run the task with a single worker thread
I Kokkos::TaskTeam()
I Run the task with all of the threads in a team
I Think of it like being inside of a parallel for with a
TeamPolicy
I Both policies take a scheduler, an optional predecessor, and an
optional priority (more on schedulers and predecessors later)
I Kokkos::task spawn()
I Kokkos::host spawn() (same thing, but from host code)
I Kokkos::respawn()
I Argument order is backwards; policy comes second!
I First argument is ‘this‘ always (not ‘*this‘)
I task spawn() and host spawn() return a Kokkos::Future
representing the completion of the task (see next slide), which
can be used as a predecessor to another operation.
Available Schedulers:
I TaskScheduler<ExecSpace>
I TaskSchedulerMultiple<ExecSpace>
I ChaseLevTaskScheduler<ExecSpace>
using ex ec u ti on _s p ac e = Kokkos :: D e f a u l t E x e c u t i o n S p a c e ;
using sch eduler_ type = Kokkos :: TaskScheduler < execution_space >;
using memory_space = sche duler_ty pe :: memory_space ;
using m e m o r y _ p o o l _ t y p e = sched uler_typ e :: memory_pool ;
size_t m e m o r y _ po o l _ s i z e = 1 << 22;
auto scheduler =
sche duler_ty pe ( m e m o r y _ p o o l _ t y p e ( m e m o r y _p o o l _ s i z e ));
using void_future =
Kokkos :: BasicFuture < void , scheduler_type >;
auto f1 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , X {});
auto f2 =
Kokkos :: task_spawn ( Kokkos :: TaskSingle ( scheduler ) , Y {});
void_future f_array [] = { f1 , f2 };
void_future f_12 = scheduler . when_all ( f_array , 2);
auto f3 =
Kokkos :: task_spawn (
Kokkos :: TaskSingle ( scheduler , f_12 ) , FuncXY {}
);
Details:
I Location: Intro-Full/Exercises/08
I Implement the FibonacciTask task functor recursively
I Spawn the root task from the host and wait for the scheduler
to make it ready
Hints:
I Do the FN−1 and FN−2 subproblems in separate tasks
I Use a scheduler.when all() to wait on the subproblems
Online April 21-24, 2020 177/192
SIMD
Portable vector intrinsic types.
Learning objectives:
I How to use SIMD types to improve vectorization.
I SIMD Types as an alternative to ThreadVector loops.
I SIMD Types to achieve outer loop vectorization.
To help with this situation and (in particular in the past) fix the
lack of auto-vectorizing compilers SIMD-Types have been
invented. They:
I Are short vectors of scalars.
I Have operators such as += so one can use them like scalars.
I Are compile time sized.
I Usually map directly to hardware vector instructions.
The ISO C++ standard has a Technical Specification for simd (in
parallelism v2):
template < class T , class Abi >
class simd {
public :
using value_type = T ;
using reference = /* impl defined */ ;
using abi_type = Abi ;
static constexpr size_t size ();
void copy_from ( T const * , aligned_tag );
void copy_to ( T * , aligned_tag ) const ;
T & operator [] ( size_t );
// Element wise operators
};
To help with the GPU issue we split types between storage types
used for Views, and temporary variables.
I Most simd::simd types will just have the same storage type.
I simd<T,cuda warp<N>> will use warp level parallelism.
I simd<T,cuda warp<N>>::storage type is different though!.
I Used in conjunction with TeamPolicy.
using simd_t = simd :: simd <T , simd :: simd_abi :: cuda_warp <V > >;
using sim d_stora ge_t = simd_t :: storage_type ;
View < s imd_stor age_t ** > data ( " D " ,N , M ); // will hold N * M * V Ts
parallel_for ( " Loop " , TeamPolicy < >(N ,M , V ) ,
KOKKOS_LAMBDA ( const team_t & team ) {
int i = team . league_rank ();
parallel_for ( T ea m Th re a dR an ge ( team , M ) , [&]( int j ) {
data (i , j ) = 2.0* simd_t ( data (i , j ));
});April
Online });
21-24, 2020 187/192
Exercise #10: SIMD storage usage.
Details:
I Location: Intro-Full/Exercises/10/Begin/
I Include the simd.hpp header.
I Change the data type of the views to use
simd::simd<double,simd::simd abi:cuda warp<
32 >>::storage type.
I Create an unmanaged View<double*> of results using the
data() function for the final reduction.
I Use inside of the lambda the
simd::simd<double,simd::simd abi:cuda warp< 32 >> as
scalar type.