Threads and Memory7
Threads and Memory7
Frauke Sprengel
Outline
1 Thread Management
3 Memory Hierarchy
4 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 2
Outline
1 Thread Management
Thread Coordinates
SPMD Architecture
Synchronization of Threads
3 Memory Hierarchy
4 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 3
Thread Coordinates
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 4
Logical Coordinates
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 5
Logical Coordinates (cont.)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 6
Physical Coordinates
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
Physical Coordinates
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
SPMD Architecture
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 8
SPMD Architecture
The following tables list the line numbers that are processed at a given time
by concurrent threads for different input values.
Due to its conditional section the function is not suitable for processing by a
SIMD architecture.
MIMD (multicore) SPMD (GPU)
arr [] -12 0 24 36 arr [] -12 0 24 36
Thread T0 T1 T2 T3 Thread T0 T1 T2 T3
Line no. 2 2 2 2 Line no. 2 2 2 2
3 3 3 3 3 3 3 3
5 5 4 4 4 4
6 7 7 7 5 5
7 6
7 7 7 7
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 9
SPMD Architecture
It can be seen that in the SPMD architecture only one instruction is executed
at a given time.
Threads which do not include this instruction are suspended.
This means that the SPMD architecture offers the flexibility of diverging
threads at the cost of reduced performance in conditional sections.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 10
Synchronization of Threads
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 11
Synchronization of Threads
Avoid the use of __syncthreads() inside divergent code
1 u n s i g n e d i n t imax = blockDim . x ∗ ( ( n e l e m e n t s
2 + blockDim . x − 1 ) / blockDim . x ) ;
3
4 f o r ( i n t i = t h r e a d i d x . x ; i < imax ; i += blockDim . x ) {
5
6 i f ( i < nelements ) {
7 ...
8 }
9
10 __syncthreads ( ) ;
11
12 i f ( i < nelements ) {
13 ...
14 }
15 }
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 12
Outline
1 Thread Management
3 Memory Hierarchy
4 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 13
Error Handling via Return Codes
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 14
Error Handling in Kernel Launches
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 15
Debugging Using Console Output
For devices from compute capability 2.0 (Fermi architecture), CUDA provides
a printf () function to be used in device code to print information to the
standard output stream.
__global__
v o i d myKernel ( ) {
p r i n t f ( "This is thread %d in block %d.\n" ,
blockIdx . x , threadIdx . x );
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 16
Debugging Using cuda-gdb
cuda-gdb is an extension to the Gnu debugger gdb under Linux, which can be
used for devices from compute capability 1.1. It allows breakpoints in CUDA
kernels, access to GPU memory, and information on device usage by threads.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 17
Debugging Using cuda-gdb
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 18
Commands (Selection)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 19
Commands (Selection)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 20
Outline
1 Thread Management
3 Memory Hierarchy
CUDA Memory Types
Matrix Multiplication Example
4 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 21
CUDA Memory Hierarchy
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 22
CUDA Memory Types
Device memory Global device memory (i. e., the GPU’s graphics memory),
accessible by all threads of a kernel and by host code (via
cudaMemcpy())
Constant cache Cache memory of each multiprocessor used for constant
values, accessible (read-only) by all threads within a block
Texture cache Cache memory of each multiprocessor used for textures,
accessible (read-only) by all threads within a block
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 24
CUDA Memory Types
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 25
Matrix Multiplication Example
C = A · B, A, B, C ∈ Rn×n
n
X
ci,j = aik bkj , i, j ∈ {1, . . . , n}.
k=1
It is obvious that the problem is of O(n3 ), which results in three nested loops
in the straightforward CPU implementation.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 26
CPU Implementation of Matrix Multiplication
v o i d multMatricesCPU ( f l o a t ∗ matrixC ,
const f l o a t ∗ matrixA , const f l o a t ∗ matrixB ,
int size ) {
f o r ( i n t i = 0 ; i < s i z e ; i ++) {
f o r ( i n t j = 0 ; j < s i z e ; j ++) {
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += m a t r i x A [ i ∗ s i z e + k ] ∗
matrixB [ k ∗ s i z e + j ] ;
}
matrixC [ i ∗ s i z e + j ] = multVal ;
}
}
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 27
GPU Implementation of Matrix Multiplication
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 28
GPU Implementation of Matrix Multiplication Using Device
Memory
In the first approach, all matrices are stored in global device memory. Using
different blocks and 1D arrays requires index computations with block and
row size offsets (or strides).
The following code examples closely follow Kirk & Hwu 2010 (Kirk and Hwu
2010) and the CUDA C Programming Guide (NVIDIA 2015a).
Note that the variable names for indices in the code do not match the
notation in the figures.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 29
GPU Implementation Using Device Memory
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 30
Kernel Using Global Device Memory
__global__
void multMatricesGPUKernel ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t r o w I d x = b l o c k I d x . y ∗ blockDim . x + t h r e a d I d x . y ;
i n t c o l I d x = b l o c k I d x . x ∗ blockDim . y + t h r e a d I d x . x ;
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += s r c A [ r o w I d x ∗ s i z e + k ] ∗
srcB [ k ∗ s i z e + c o l I d x ] ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 31
Kernel Launch (Using Global Device Memory)
const i n t BLOCK_SIZE = 1 6 ;
int s i z e = 256; // m a t r i x s i z e
// d e f i n e m a t r i c e s , c u d a M a l l o c ( ) , cudaMemcpy ( ) . . .
i n t g r i d S i z e = s i z e / BLOCK_SIZE ;
dim3 d i m G r i d ( g r i d S i z e , g r i d S i z e ) ;
dim3 d i m B l o c k ( BLOCK_SIZE , BLOCK_SIZE ) ;
m u l t M a t r i c e s G P U K e r n e l <<<dimGrid , dimBlock >>>(
deviceMemPtrDest , deviceMemPtrSrcA , deviceMemPtrSrcB ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 32
Performance (Using Global Device Memory)
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 33
GPU Implementation Using Shared Memory
The second (better) approach is to use shared memory to store the matrices
to be multiplied.
Shared memory is smaller than device memory and accessible only by the
threads within a single block. Thus only the required sub-matrices (or matrix
blocks) of the given block size are copied to shared memory. The blocks
within a given row of matrix A and the appropriate column of matrix B are
multiplied subsequently.
Variables in shared memory are declared by means of the specifier
__shared__. In order to ensure that the copying is finished when starting the
multiplication (and vice versa), the threads within a block have to be
synchronized via __syncthreads(). Multiplying sub-matrices requires further
index computations within a block.
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 34
GPU Implementation Using Shared Memory
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 35
Kernel Using Shared Memory
__global__
void multMatricesGPUKernelShared ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t g r i d S i z e = gridDim . x ;
i n t b l o c k S i z e = blockDim . x ;
// g l o b a l i n d i c e s
int rowIdx = blockIdx . y ∗ b l o c k S i z e + threadIdx . y ;
int colIdx = blockIdx . x ∗ blockSize + threadIdx . x ;
// l o c a l i n d i c e s p e r b l o c k
int rowIdxBlock = threadIdx . y ;
int colIdxBlock = threadIdx . x ;
// a l l o c a t e s h a r e d memory f o r m a t r i x b l o c k s
__shared__ f l o a t s r c A S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;
__shared__ f l o a t s r c B S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 36
Kernel Using Shared Memory (contd.)
f l o a t multVal = 0. f ;
// l o o p o v e r m a t r i x b l o c k s
f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
// copy m a t r i x b l o c k s c o n c u r r e n t l y t o s h a r e d memory
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcA [ rowIdx ∗ s i z e + l ∗ b l o c k S i z e
+ colIdxBlock ] ;
srcBShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcB [ ( l ∗ b l o c k S i z e + rowIdxBlock ) ∗ s i z e
+ colIdx ];
__syncthreads ( ) ;
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 37
Kernel Using Shared Memory (contd.)
// m u l t i p l y b l o c k s i n s h a r e d memory
f o r ( i n t k = 0 ; k < b l o c k S i z e ; k++) {
m u l t V a l +=
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + k ] ∗
srcBShared [ k ∗ b l oc k Si z e + colIdxBlock ] ;
}
__syncthreads ( ) ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 38
Performance
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 39
Outline
1 Thread Management
3 Memory Hierarchy
4 Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 40
Guidelines
Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 41