0% found this document useful (0 votes)
19 views42 pages

Threads and Memory7

This document discusses GPU computing threads and memory. It covers thread management including logical and physical thread coordinates, the SPMD architecture, and synchronization of threads. It also discusses error handling and debugging techniques such as using return codes, console output, and the cuda-gdb debugger. Finally, it mentions the memory hierarchy and guidelines.

Uploaded by

sue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views42 pages

Threads and Memory7

This document discusses GPU computing threads and memory. It covers thread management including logical and physical thread coordinates, the SPMD architecture, and synchronization of threads. It also discusses error handling and debugging techniques such as using return codes, console output, and the cuda-gdb debugger. Finally, it mentions the memory hierarchy and guidelines.

Uploaded by

sue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Visual Computing – GPU Computing

Threads and Memory

Frauke Sprengel
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 2
Outline
1 Thread Management
Thread Coordinates
SPMD Architecture
Synchronization of Threads

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 3
Thread Coordinates

A given thread can be specified by either logical or physical coordinates.


The counting of all coordinate values start at zero. Thread coordinates can
be inquired by the cuda-gdb debugger (see below) and are used to switch the
focus to a specified thread.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 4
Logical Coordinates

kernel (specified by number) CUDA function, can be launched


concurrently on multiple devices (i. e., multiple GPUs) via
different grids
grid (specified by number) 1D, 2D or 3D layout of blocks connected
to a single device
block (specified by 2 numbers) 1D, 2D, or 3D layout of threads
connected to a single streaming multiprocessor
thread (specified by 3 numbers) thread running on a streaming
processor

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 5
Logical Coordinates (cont.)

A maximum of 1024 (from compute capability 1.2) threads per block is


allowed.
The definite value for a special GPU can be inquired at runtime via
cudaGetDeviceProperties() (cf. CUDA reference and deviceQuery.cpp from the
NVIDIA GPU Computing SDK).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 6
Physical Coordinates

device (specified by number) device (i. e., GPU)


sm (specified by number) streaming multiprocessor
warp (specified by number) group of lanes on a single multiprocessor
lane (specified by number) thread running within a warp on a
streaming processor

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
Physical Coordinates

device (specified by number) device (i. e., GPU)


sm (specified by number) streaming multiprocessor
warp (specified by number) group of lanes on a single multiprocessor
lane (specified by number) thread running within a warp on a
streaming processor

A warp consists of 32 lanes (i. e., threads).


The definite value can be inquired at runtime via cudaGetDeviceProperties()
(cf. CUDA reference).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
SPMD Architecture

As mentioned before, the CUDA SPMD architecture (single program multiple


data) allows conditional code sections (like if clauses) within kernels.
1 i n t s i g n ( const i n t a r r [ ] , i n t i ) {
2 int r e s u l t = 0;
3 i f ( a r r [ i ] > 0)
4 r e s u l t = 1;
5 e l s e i f ( a r r [ i ] < 0)
6 r e s u l t = −1;
7 return r e s u l t ;
8 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 8
SPMD Architecture
The following tables list the line numbers that are processed at a given time
by concurrent threads for different input values.
Due to its conditional section the function is not suitable for processing by a
SIMD architecture.
MIMD (multicore) SPMD (GPU)
arr [] -12 0 24 36 arr [] -12 0 24 36
Thread T0 T1 T2 T3 Thread T0 T1 T2 T3
Line no. 2 2 2 2 Line no. 2 2 2 2
3 3 3 3 3 3 3 3
5 5 4 4 4 4
6 7 7 7 5 5
7 6
7 7 7 7

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 9
SPMD Architecture

It can be seen that in the SPMD architecture only one instruction is executed
at a given time.
Threads which do not include this instruction are suspended.
This means that the SPMD architecture offers the flexibility of diverging
threads at the cost of reduced performance in conditional sections.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 10
Synchronization of Threads

cudaDeviceSynchronize() (from Cuda 4.0) or cudaThreadSynchronize()


(deprecated, to use until 3.2) (host code):
Wait until all device tasks (like kernel executions) have been finished (in host
code).
__syncthreads() (device code):

Synchronize threads within one block by means of a barrier, i. e., suspending


the execution of a thread until all threads within the same block have reached
the given point.
Note that it is not possible to synchronize threads in different blocks.
There is also a variant of synchronization on based on events.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 11
Synchronization of Threads
Avoid the use of __syncthreads() inside divergent code
1 u n s i g n e d i n t imax = blockDim . x ∗ ( ( n e l e m e n t s
2 + blockDim . x − 1 ) / blockDim . x ) ;
3
4 f o r ( i n t i = t h r e a d i d x . x ; i < imax ; i += blockDim . x ) {
5
6 i f ( i < nelements ) {
7 ...
8 }
9
10 __syncthreads ( ) ;
11
12 i f ( i < nelements ) {
13 ...
14 }
15 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 12
Outline
1 Thread Management

2 Error Handling and Debugging


Return Codes
Debugging Using Console Output
Debugging Using cuda-gdb

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 13
Error Handling via Return Codes

Error check via return code of CUDA library functions:


cudaError_t e r r = cudaMalloc ( . . . ) ;
i f ( e r r != c u d a S u c c e s s ) {
c e r r << "CUDA error : "
<< c u d a G e t E r r o r S t r i n g ( e r r ) << e n d l ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 14
Error Handling in Kernel Launches

CUDA kernels (with __global__) must have return type void.


Error check via cudaGetLastError():
cudaGetLastError ( ) ;
myKernel<<<dimGrid , dimBlock > > >(...);
cudaError_t e r r = cudaGetLastError ( ) ;
i f ( e r r != c u d a S u c c e s s ) {
c e r r << "CUDA kernel error : "
<< c u d a G e t E r r o r S t r i n g ( e r r ) << e n d l ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 15
Debugging Using Console Output

For devices from compute capability 2.0 (Fermi architecture), CUDA provides
a printf () function to be used in device code to print information to the
standard output stream.
__global__
v o i d myKernel ( ) {
p r i n t f ( "This is thread %d in block %d.\n" ,
blockIdx . x , threadIdx . x );
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 16
Debugging Using cuda-gdb

cuda-gdb is an extension to the Gnu debugger gdb under Linux, which can be
used for devices from compute capability 1.1. It allows breakpoints in CUDA
kernels, access to GPU memory, and information on device usage by threads.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 17
Debugging Using cuda-gdb

• Compilation of CUDA code: nvcc -g -G


• cuda-gdb can be used with DDD (graphical front-end)
• Restriction: X11 cannot be run on the GPU that cuda-gdb is working on.
Possible solutions:
• Work on dual-GPU system for debugging.
• Use cuda-gdb in console mode.
• Alternative: debugging and profiling (time, memory) in NSight (from
compute capability 3.5 and CUDA 5.5, same problem with X11 (on
Linux) or Aqua (Mac))

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 18
Commands (Selection)

info cuda system information on device type, compute capability, number of


streaming multiprocessors, warps per multiprocessor, lanes per
warp, etc.
info cuda device/sm/warp/lane/kernels information on current state of
device/sm/. . . which is currently in focus
cuda kernel grid block thread information on/switch focus to thread by
logical coordinates
cuda device sm warp lane information on/switch to thread by physical
coordinates

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 19
Commands (Selection)

break set breakpoint specified by function name or line number


run run program
cont continue program up to next breakpoint
step step one command, entering functions (except for device
functions as they are inlined)
next go to next command, not entering functions;
step and next advance all threads within the warp in focus (use
cont and breakpoints to advance all threads of a kernel)
print print current values of variables (device, shared, local, or built-in
variables like threadIdx )

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 20
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy
CUDA Memory Types
Matrix Multiplication Example

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 21
CUDA Memory Hierarchy

An appropriate usage of the different types of CUDA memory is crucial for


the efficient execution of kernels.
We will demonstrate this by means of a matrix multiplication example.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 22
CUDA Memory Types

Figure 3.1: CUDA Programming Guide Version 2.3 (2009)


Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 23
CUDA Memory Types

Device memory Global device memory (i. e., the GPU’s graphics memory),
accessible by all threads of a kernel and by host code (via
cudaMemcpy())
Constant cache Cache memory of each multiprocessor used for constant
values, accessible (read-only) by all threads within a block
Texture cache Cache memory of each multiprocessor used for textures,
accessible (read-only) by all threads within a block

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 24
CUDA Memory Types

Shared memory Shared memory of each multiprocessor, accessible by all


threads within one block
Registers Registers of each streaming processor for automatic (local)
variables within a kernel function, accessible by a single thread
Local memory Global device memory used for automatic (local) array
variables within a kernel function, accessible by a single thread
→ slow, should be avoided

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 25
Matrix Multiplication Example

As a (standard) example, we implement a concurrent matrix-matrix


multiplication in CUDA. To keep things easy (and concentrate on the CUDA
aspects of the problem), we restrict ourselves to square matrices, i. e.,

C = A · B, A, B, C ∈ Rn×n
n
X
ci,j = aik bkj , i, j ∈ {1, . . . , n}.
k=1

It is obvious that the problem is of O(n3 ), which results in three nested loops
in the straightforward CPU implementation.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 26
CPU Implementation of Matrix Multiplication
v o i d multMatricesCPU ( f l o a t ∗ matrixC ,
const f l o a t ∗ matrixA , const f l o a t ∗ matrixB ,
int size ) {
f o r ( i n t i = 0 ; i < s i z e ; i ++) {
f o r ( i n t j = 0 ; j < s i z e ; j ++) {
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += m a t r i x A [ i ∗ s i z e + k ] ∗
matrixB [ k ∗ s i z e + j ] ;
}
matrixC [ i ∗ s i z e + j ] = multVal ;
}
}
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 27
GPU Implementation of Matrix Multiplication

For a GPU implementation of matrix multiplication, the computation of the


cij can be performed concurrently. The two outer loops of the CPU
implementation are replaced by distributing the problem to different threads
within a 2D block (with x representing the column and y the row,
respectively).
Doing so, one soon reaches the maximum number of 1024 threads per block,
which only allows matrices of size n = 32, respectively.
Therefore the problem has to be divided into smaller sub-problems, in our
case to sub-matrices or matrix blocks which can be distributed to different
blocks within a 2D grid.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 28
GPU Implementation of Matrix Multiplication Using Device
Memory

In the first approach, all matrices are stored in global device memory. Using
different blocks and 1D arrays requires index computations with block and
row size offsets (or strides).
The following code examples closely follow Kirk & Hwu 2010 (Kirk and Hwu
2010) and the CUDA C Programming Guide (NVIDIA 2015a).
Note that the variable names for indices in the code do not match the
notation in the figures.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 29
GPU Implementation Using Device Memory

Figure 3.2: CUDA Programming Guide (NVIDIA 2015a)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 30
Kernel Using Global Device Memory

__global__
void multMatricesGPUKernel ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t r o w I d x = b l o c k I d x . y ∗ blockDim . x + t h r e a d I d x . y ;
i n t c o l I d x = b l o c k I d x . x ∗ blockDim . y + t h r e a d I d x . x ;
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += s r c A [ r o w I d x ∗ s i z e + k ] ∗
srcB [ k ∗ s i z e + c o l I d x ] ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 31
Kernel Launch (Using Global Device Memory)

const i n t BLOCK_SIZE = 1 6 ;
int s i z e = 256; // m a t r i x s i z e
// d e f i n e m a t r i c e s , c u d a M a l l o c ( ) , cudaMemcpy ( ) . . .
i n t g r i d S i z e = s i z e / BLOCK_SIZE ;
dim3 d i m G r i d ( g r i d S i z e , g r i d S i z e ) ;
dim3 d i m B l o c k ( BLOCK_SIZE , BLOCK_SIZE ) ;
m u l t M a t r i c e s G P U K e r n e l <<<dimGrid , dimBlock >>>(
deviceMemPtrDest , deviceMemPtrSrcA , deviceMemPtrSrcB ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 32
Performance (Using Global Device Memory)

Matrix size CPU [ms] GPU [ms]


64 2 15
128 15 22
256 115 68
512 928 387

• Release mode (with optimization) has been used for compilation.


• The runtime is averaged over 10 function calls.
• The memory transfer time between CPU and GPU is included.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 33
GPU Implementation Using Shared Memory
The second (better) approach is to use shared memory to store the matrices
to be multiplied.
Shared memory is smaller than device memory and accessible only by the
threads within a single block. Thus only the required sub-matrices (or matrix
blocks) of the given block size are copied to shared memory. The blocks
within a given row of matrix A and the appropriate column of matrix B are
multiplied subsequently.
Variables in shared memory are declared by means of the specifier
__shared__. In order to ensure that the copying is finished when starting the
multiplication (and vice versa), the threads within a block have to be
synchronized via __syncthreads(). Multiplying sub-matrices requires further
index computations within a block.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 34
GPU Implementation Using Shared Memory

Figure 3.3: CUDA Programming Guide (NVIDIA 2015a)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 35
Kernel Using Shared Memory
__global__
void multMatricesGPUKernelShared ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t g r i d S i z e = gridDim . x ;
i n t b l o c k S i z e = blockDim . x ;
// g l o b a l i n d i c e s
int rowIdx = blockIdx . y ∗ b l o c k S i z e + threadIdx . y ;
int colIdx = blockIdx . x ∗ blockSize + threadIdx . x ;
// l o c a l i n d i c e s p e r b l o c k
int rowIdxBlock = threadIdx . y ;
int colIdxBlock = threadIdx . x ;
// a l l o c a t e s h a r e d memory f o r m a t r i x b l o c k s
__shared__ f l o a t s r c A S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;
__shared__ f l o a t s r c B S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 36
Kernel Using Shared Memory (contd.)

f l o a t multVal = 0. f ;
// l o o p o v e r m a t r i x b l o c k s
f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
// copy m a t r i x b l o c k s c o n c u r r e n t l y t o s h a r e d memory
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcA [ rowIdx ∗ s i z e + l ∗ b l o c k S i z e
+ colIdxBlock ] ;
srcBShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcB [ ( l ∗ b l o c k S i z e + rowIdxBlock ) ∗ s i z e
+ colIdx ];
__syncthreads ( ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 37
Kernel Using Shared Memory (contd.)

// m u l t i p l y b l o c k s i n s h a r e d memory
f o r ( i n t k = 0 ; k < b l o c k S i z e ; k++) {
m u l t V a l +=
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + k ] ∗
srcBShared [ k ∗ b l oc k Si z e + colIdxBlock ] ;
}
__syncthreads ( ) ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 38
Performance

Matrix size CPU [ms] GPU [ms] GPU [ms]


device mem. shared mem.
64 2 15 13
128 15 22 15
256 115 68 17
512 928 387 36

• Release mode (with optimization) has been used for compilation.


• The runtime is averaged over 10 function calls.
• The memory transfer time between CPU and GPU is included.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 39
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 40
Guidelines

• Using CUDA is most effective for large problems, e. g., multiplication of


matrices larger than 128 × 128.
• The problem has to be divided into sub-problems in order to utilize all
multiprocessors, e. g., matrix blocks of size 16 × 16.
• Use shared memory for variables with frequent access by different
threads within one block.
• Use automatic (local) variables for frequent access by a single thread.
• Avoid automatic (local) array variables.
• Device memory allocation and de-allocation via cudaMalloc() and
cudaFree() are expensive operations, so device memory should be reused.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 41

You might also like