0% found this document useful (0 votes)

19 views42 pages

Threads and Memory7

This document discusses GPU computing threads and memory. It covers thread management including logical and physical thread coordinates, the SPMD architecture, and synchronization of threads. It also discusses error handling and debugging techniques such as using return codes, console output, and the cuda-gdb debugger. Finally, it mentions the memory hierarchy and guidelines.

Uploaded by

sue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views42 pages

Threads and Memory7

Uploaded by

sue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Visual Computing – GPU Computing

Threads and Memory

Frauke Sprengel
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 2
Outline
1 Thread Management
Thread Coordinates
SPMD Architecture
Synchronization of Threads

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 3
Thread Coordinates

A given thread can be specified by either logical or physical coordinates.

The counting of all coordinate values start at zero. Thread coordinates can
be inquired by the cuda-gdb debugger (see below) and are used to switch the
focus to a specified thread.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 4
Logical Coordinates

kernel (specified by number) CUDA function, can be launched

concurrently on multiple devices (i. e., multiple GPUs) via
different grids
grid (specified by number) 1D, 2D or 3D layout of blocks connected
to a single device
block (specified by 2 numbers) 1D, 2D, or 3D layout of threads
connected to a single streaming multiprocessor
thread (specified by 3 numbers) thread running on a streaming
processor

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 5
Logical Coordinates (cont.)

A maximum of 1024 (from compute capability 1.2) threads per block is

allowed.
The definite value for a special GPU can be inquired at runtime via
cudaGetDeviceProperties() (cf. CUDA reference and deviceQuery.cpp from the
NVIDIA GPU Computing SDK).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 6
Physical Coordinates

device (specified by number) device (i. e., GPU)

sm (specified by number) streaming multiprocessor
warp (specified by number) group of lanes on a single multiprocessor
lane (specified by number) thread running within a warp on a
streaming processor

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
Physical Coordinates

device (specified by number) device (i. e., GPU)

sm (specified by number) streaming multiprocessor
warp (specified by number) group of lanes on a single multiprocessor
lane (specified by number) thread running within a warp on a
streaming processor

A warp consists of 32 lanes (i. e., threads).

The definite value can be inquired at runtime via cudaGetDeviceProperties()
(cf. CUDA reference).

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 7
SPMD Architecture

As mentioned before, the CUDA SPMD architecture (single program multiple

data) allows conditional code sections (like if clauses) within kernels.
1 i n t s i g n ( const i n t a r r [ ] , i n t i ) {
2 int r e s u l t = 0;
3 i f ( a r r [ i ] > 0)
4 r e s u l t = 1;
5 e l s e i f ( a r r [ i ] < 0)
6 r e s u l t = −1;
7 return r e s u l t ;
8 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 8
SPMD Architecture
The following tables list the line numbers that are processed at a given time
by concurrent threads for different input values.
Due to its conditional section the function is not suitable for processing by a
SIMD architecture.
MIMD (multicore) SPMD (GPU)
arr [] -12 0 24 36 arr [] -12 0 24 36
Thread T0 T1 T2 T3 Thread T0 T1 T2 T3
Line no. 2 2 2 2 Line no. 2 2 2 2
3 3 3 3 3 3 3 3
5 5 4 4 4 4
6 7 7 7 5 5
7 6
7 7 7 7

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 9
SPMD Architecture

It can be seen that in the SPMD architecture only one instruction is executed
at a given time.
Threads which do not include this instruction are suspended.
This means that the SPMD architecture offers the flexibility of diverging
threads at the cost of reduced performance in conditional sections.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 10
Synchronization of Threads

cudaDeviceSynchronize() (from Cuda 4.0) or cudaThreadSynchronize()

(deprecated, to use until 3.2) (host code):
Wait until all device tasks (like kernel executions) have been finished (in host
code).
__syncthreads() (device code):

Synchronize threads within one block by means of a barrier, i. e., suspending

the execution of a thread until all threads within the same block have reached
the given point.
Note that it is not possible to synchronize threads in different blocks.
There is also a variant of synchronization on based on events.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 11
Synchronization of Threads
Avoid the use of __syncthreads() inside divergent code
1 u n s i g n e d i n t imax = blockDim . x ∗ ( ( n e l e m e n t s
2 + blockDim . x − 1 ) / blockDim . x ) ;
3
4 f o r ( i n t i = t h r e a d i d x . x ; i < imax ; i += blockDim . x ) {
5
6 i f ( i < nelements ) {
7 ...
8 }
9
10 __syncthreads ( ) ;
11
12 i f ( i < nelements ) {
13 ...
14 }
15 }

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 12
Outline
1 Thread Management

2 Error Handling and Debugging

Return Codes
Debugging Using Console Output
Debugging Using cuda-gdb

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 13
Error Handling via Return Codes

Error check via return code of CUDA library functions:

cudaError_t e r r = cudaMalloc ( . . . ) ;
i f ( e r r != c u d a S u c c e s s ) {
c e r r << "CUDA error : "
<< c u d a G e t E r r o r S t r i n g ( e r r ) << e n d l ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 14
Error Handling in Kernel Launches

CUDA kernels (with global) must have return type void.

Error check via cudaGetLastError():
cudaGetLastError ( ) ;
myKernel<<<dimGrid , dimBlock > > >(...);
cudaError_t e r r = cudaGetLastError ( ) ;
i f ( e r r != c u d a S u c c e s s ) {
c e r r << "CUDA kernel error : "
<< c u d a G e t E r r o r S t r i n g ( e r r ) << e n d l ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 15
Debugging Using Console Output

For devices from compute capability 2.0 (Fermi architecture), CUDA provides
a printf () function to be used in device code to print information to the
standard output stream.
__global__
v o i d myKernel ( ) {
p r i n t f ( "This is thread %d in block %d.\n" ,
blockIdx . x , threadIdx . x );
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 16
Debugging Using cuda-gdb

cuda-gdb is an extension to the Gnu debugger gdb under Linux, which can be
used for devices from compute capability 1.1. It allows breakpoints in CUDA
kernels, access to GPU memory, and information on device usage by threads.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 17
Debugging Using cuda-gdb

• Compilation of CUDA code: nvcc -g -G

• cuda-gdb can be used with DDD (graphical front-end)
• Restriction: X11 cannot be run on the GPU that cuda-gdb is working on.
Possible solutions:
• Work on dual-GPU system for debugging.
• Use cuda-gdb in console mode.
• Alternative: debugging and profiling (time, memory) in NSight (from
compute capability 3.5 and CUDA 5.5, same problem with X11 (on
Linux) or Aqua (Mac))

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 18
Commands (Selection)

info cuda system information on device type, compute capability, number of

streaming multiprocessors, warps per multiprocessor, lanes per
warp, etc.
info cuda device/sm/warp/lane/kernels information on current state of
device/sm/. . . which is currently in focus
cuda kernel grid block thread information on/switch focus to thread by
logical coordinates
cuda device sm warp lane information on/switch to thread by physical
coordinates

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 19
Commands (Selection)

break set breakpoint specified by function name or line number

run run program
cont continue program up to next breakpoint
step step one command, entering functions (except for device
functions as they are inlined)
next go to next command, not entering functions;
step and next advance all threads within the warp in focus (use
cont and breakpoints to advance all threads of a kernel)
print print current values of variables (device, shared, local, or built-in
variables like threadIdx )

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 20
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy
CUDA Memory Types
Matrix Multiplication Example

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 21
CUDA Memory Hierarchy

An appropriate usage of the different types of CUDA memory is crucial for

the efficient execution of kernels.
We will demonstrate this by means of a matrix multiplication example.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 22
CUDA Memory Types

Figure 3.1: CUDA Programming Guide Version 2.3 (2009)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 23
CUDA Memory Types

Device memory Global device memory (i. e., the GPU’s graphics memory),
accessible by all threads of a kernel and by host code (via
cudaMemcpy())
Constant cache Cache memory of each multiprocessor used for constant
values, accessible (read-only) by all threads within a block
Texture cache Cache memory of each multiprocessor used for textures,
accessible (read-only) by all threads within a block

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 24
CUDA Memory Types

Shared memory Shared memory of each multiprocessor, accessible by all

threads within one block
Registers Registers of each streaming processor for automatic (local)
variables within a kernel function, accessible by a single thread
Local memory Global device memory used for automatic (local) array
variables within a kernel function, accessible by a single thread
→ slow, should be avoided

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 25
Matrix Multiplication Example

As a (standard) example, we implement a concurrent matrix-matrix

multiplication in CUDA. To keep things easy (and concentrate on the CUDA
aspects of the problem), we restrict ourselves to square matrices, i. e.,

C = A · B, A, B, C ∈ Rn×n
n
X
ci,j = aik bkj , i, j ∈ {1, . . . , n}.
k=1

It is obvious that the problem is of O(n3 ), which results in three nested loops
in the straightforward CPU implementation.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 26
CPU Implementation of Matrix Multiplication
v o i d multMatricesCPU ( f l o a t ∗ matrixC ,
const f l o a t ∗ matrixA , const f l o a t ∗ matrixB ,
int size ) {
f o r ( i n t i = 0 ; i < s i z e ; i ++) {
f o r ( i n t j = 0 ; j < s i z e ; j ++) {
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += m a t r i x A [ i ∗ s i z e + k ] ∗
matrixB [ k ∗ s i z e + j ] ;
}
matrixC [ i ∗ s i z e + j ] = multVal ;
}
}
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 27
GPU Implementation of Matrix Multiplication

For a GPU implementation of matrix multiplication, the computation of the

cij can be performed concurrently. The two outer loops of the CPU
implementation are replaced by distributing the problem to different threads
within a 2D block (with x representing the column and y the row,
respectively).
Doing so, one soon reaches the maximum number of 1024 threads per block,
which only allows matrices of size n = 32, respectively.
Therefore the problem has to be divided into smaller sub-problems, in our
case to sub-matrices or matrix blocks which can be distributed to different
blocks within a 2D grid.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 28
GPU Implementation of Matrix Multiplication Using Device
Memory

In the first approach, all matrices are stored in global device memory. Using
different blocks and 1D arrays requires index computations with block and
row size offsets (or strides).
The following code examples closely follow Kirk & Hwu 2010 (Kirk and Hwu
2010) and the CUDA C Programming Guide (NVIDIA 2015a).
Note that the variable names for indices in the code do not match the
notation in the figures.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 29
GPU Implementation Using Device Memory

Figure 3.2: CUDA Programming Guide (NVIDIA 2015a)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 30
Kernel Using Global Device Memory

__global__
void multMatricesGPUKernel ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t r o w I d x = b l o c k I d x . y ∗ blockDim . x + t h r e a d I d x . y ;
i n t c o l I d x = b l o c k I d x . x ∗ blockDim . y + t h r e a d I d x . x ;
f l o a t multVal = 0. f ;
f o r ( i n t k = 0 ; k < s i z e ; k++) {
m u l t V a l += s r c A [ r o w I d x ∗ s i z e + k ] ∗
srcB [ k ∗ s i z e + c o l I d x ] ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 31
Kernel Launch (Using Global Device Memory)

const i n t BLOCK_SIZE = 1 6 ;
int s i z e = 256; // m a t r i x s i z e
// d e f i n e m a t r i c e s , c u d a M a l l o c ( ) , cudaMemcpy ( ) . . .
i n t g r i d S i z e = s i z e / BLOCK_SIZE ;
dim3 d i m G r i d ( g r i d S i z e , g r i d S i z e ) ;
dim3 d i m B l o c k ( BLOCK_SIZE , BLOCK_SIZE ) ;
m u l t M a t r i c e s G P U K e r n e l <<<dimGrid , dimBlock >>>(
deviceMemPtrDest , deviceMemPtrSrcA , deviceMemPtrSrcB ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 32
Performance (Using Global Device Memory)

Matrix size CPU [ms] GPU [ms]

64 2 15
128 15 22
256 115 68
512 928 387

• Release mode (with optimization) has been used for compilation.

• The runtime is averaged over 10 function calls.
• The memory transfer time between CPU and GPU is included.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 33
GPU Implementation Using Shared Memory
The second (better) approach is to use shared memory to store the matrices
to be multiplied.
Shared memory is smaller than device memory and accessible only by the
threads within a single block. Thus only the required sub-matrices (or matrix
blocks) of the given block size are copied to shared memory. The blocks
within a given row of matrix A and the appropriate column of matrix B are
multiplied subsequently.
Variables in shared memory are declared by means of the specifier
__shared__. In order to ensure that the copying is finished when starting the
multiplication (and vice versa), the threads within a block have to be
synchronized via __syncthreads(). Multiplying sub-matrices requires further
index computations within a block.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 34
GPU Implementation Using Shared Memory

Figure 3.3: CUDA Programming Guide (NVIDIA 2015a)

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 35
Kernel Using Shared Memory
__global__
void multMatricesGPUKernelShared ( f l o a t ∗ dest ,
const f l o a t ∗ srcA , const f l o a t ∗ s r c B ) {
i n t s i z e = g r i d D i m . x ∗ blockDim . x ;
i n t g r i d S i z e = gridDim . x ;
i n t b l o c k S i z e = blockDim . x ;
// g l o b a l i n d i c e s
int rowIdx = blockIdx . y ∗ b l o c k S i z e + threadIdx . y ;
int colIdx = blockIdx . x ∗ blockSize + threadIdx . x ;
// l o c a l i n d i c e s p e r b l o c k
int rowIdxBlock = threadIdx . y ;
int colIdxBlock = threadIdx . x ;
// a l l o c a t e s h a r e d memory f o r m a t r i x b l o c k s
__shared__ f l o a t s r c A S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;
__shared__ f l o a t s r c B S h a r e d [ b l o c k S i z e ∗ b l o c k S i z e ] ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 36
Kernel Using Shared Memory (contd.)

f l o a t multVal = 0. f ;
// l o o p o v e r m a t r i x b l o c k s
f o r ( i n t l = 0 ; l < g r i d S i z e ; l ++) {
// copy m a t r i x b l o c k s c o n c u r r e n t l y t o s h a r e d memory
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcA [ rowIdx ∗ s i z e + l ∗ b l o c k S i z e
+ colIdxBlock ] ;
srcBShared [ rowIdxBlock ∗ b lo c k Si z e + colIdxBlock ] =
srcB [ ( l ∗ b l o c k S i z e + rowIdxBlock ) ∗ s i z e
+ colIdx ];
__syncthreads ( ) ;

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 37
Kernel Using Shared Memory (contd.)

// m u l t i p l y b l o c k s i n s h a r e d memory
f o r ( i n t k = 0 ; k < b l o c k S i z e ; k++) {
m u l t V a l +=
srcAShared [ rowIdxBlock ∗ b lo c k Si z e + k ] ∗
srcBShared [ k ∗ b l oc k Si z e + colIdxBlock ] ;
}
__syncthreads ( ) ;
}
dest [ rowIdx ∗ s i z e + c o l I d x ] = multVal ;
}

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 38
Performance

Matrix size CPU [ms] GPU [ms] GPU [ms]

device mem. shared mem.
64 2 15 13
128 15 22 15
256 115 68 17
512 928 387 36

• Release mode (with optimization) has been used for compilation.

• The runtime is averaged over 10 function calls.
• The memory transfer time between CPU and GPU is included.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 39
Outline
1 Thread Management

2 Error Handling and Debugging

3 Memory Hierarchy

4 Guidelines

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 40
Guidelines

• Using CUDA is most effective for large problems, e. g., multiplication of

matrices larger than 128 × 128.
• The problem has to be divided into sub-problems in order to utilize all
multiprocessors, e. g., matrix blocks of size 16 × 16.
• Use shared memory for variables with frequent access by different
threads within one block.
• Use automatic (local) variables for frequent access by a single thread.
• Avoid automatic (local) array variables.
• Device memory allocation and de-allocation via cudaMalloc() and
cudaFree() are expensive operations, so device memory should be reused.

Hochschule Hannover, Fak. IV, F. Sprengel, Visual Computing – GPU Computing, Threads and Memory, 41

Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
PLAGUE OF WAR - ATHENS, SPARTA Roberts
100% (1)
PLAGUE OF WAR - ATHENS, SPARTA Roberts
404 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Immigration Clearance
60% (5)
Immigration Clearance
92 pages
Types of Constraints in DBMS
No ratings yet
Types of Constraints in DBMS
15 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
1 Describe The Distribution of Active Volcanoes, Earthquake Epicenters and Major Mountain Belts
100% (1)
1 Describe The Distribution of Active Volcanoes, Earthquake Epicenters and Major Mountain Belts
3 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
A House On Fire Book
No ratings yet
A House On Fire Book
77 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Area Under The Curve PDF
No ratings yet
Area Under The Curve PDF
51 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Evolve 2B
No ratings yet
Evolve 2B
9 pages
7th English
No ratings yet
7th English
162 pages
Aditya Hridaya Stotra Meaning and Benefits PDF
0% (1)
Aditya Hridaya Stotra Meaning and Benefits PDF
16 pages
Synthesis Essay On Eminent Domain
100% (3)
Synthesis Essay On Eminent Domain
4 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
English9 Pre Assessment Test
75% (4)
English9 Pre Assessment Test
3 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Cuda C
No ratings yet
Cuda C
70 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Threads
No ratings yet
Threads
54 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Achievers 6 Semester 1 Test
100% (4)
Achievers 6 Semester 1 Test
7 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Xamarin For Mobile Development Concepts
No ratings yet
Xamarin For Mobile Development Concepts
36 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
24 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
CUDA
No ratings yet
CUDA
18 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA
No ratings yet
CUDA
33 pages
Context For Senior High School. Quezon City, QC: C&E Publishing, Inc
No ratings yet
Context For Senior High School. Quezon City, QC: C&E Publishing, Inc
3 pages
Disability in Fairy Tales 2
No ratings yet
Disability in Fairy Tales 2
19 pages
De Vera Matlab
No ratings yet
De Vera Matlab
25 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Iao PDF
100% (1)
Iao PDF
3 pages
Indirect Questions
No ratings yet
Indirect Questions
8 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
The Vulgar Poetism in Driss Benali Malhûn Texts - Reading of "Lalla Ghita" and "Al-Lutayyah"
No ratings yet
The Vulgar Poetism in Driss Benali Malhûn Texts - Reading of "Lalla Ghita" and "Al-Lutayyah"
11 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Gr9 ENG (FAL) June 2018 Question Paper
No ratings yet
Gr9 ENG (FAL) June 2018 Question Paper
9 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Voice Based Email System For Visually Impaired
No ratings yet
Voice Based Email System For Visually Impaired
8 pages
L7 BBFA4034 ATP 2021 (Students)
No ratings yet
L7 BBFA4034 ATP 2021 (Students)
13 pages
Advanced DB Chapter-3
No ratings yet
Advanced DB Chapter-3
54 pages
Defect Bug Life Cycle in Software Testing
No ratings yet
Defect Bug Life Cycle in Software Testing
7 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
No ratings yet
Ayachi BPSC Tre 4 0 English (TGT, Class-9th & 10th) Complete Foundation With Final Selection Batch 2024 - Online Live Classes by Adda 247
2 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
Sunil Kumar: (Java Developer)
No ratings yet
Sunil Kumar: (Java Developer)
5 pages
Lab#7: Complex Part Using G03 Ijk Input: Description
No ratings yet
Lab#7: Complex Part Using G03 Ijk Input: Description
3 pages
Daily Journal Writing in The Classroom
No ratings yet
Daily Journal Writing in The Classroom
8 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Resumen Literatura
No ratings yet
Resumen Literatura
5 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Terraform Functions PDF
No ratings yet
Terraform Functions PDF
2 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet

Threads and Memory7

Uploaded by

Threads and Memory7

Uploaded by

Visual Computing – GPU Computing

Threads and Memory

2 Error Handling and Debugging

2 Error Handling and Debugging

A given thread can be specified by either logical or physical coordinates.

kernel (specified by number) CUDA function, can be launched

A maximum of 1024 (from compute capability 1.2) threads per block is

device (specified by number) device (i. e., GPU)

device (specified by number) device (i. e., GPU)

A warp consists of 32 lanes (i. e., threads).

As mentioned before, the CUDA SPMD architecture (single program multiple

cudaDeviceSynchronize() (from Cuda 4.0) or cudaThreadSynchronize()

Synchronize threads within one block by means of a barrier, i. e., suspending

2 Error Handling and Debugging

Error check via return code of CUDA library functions:

CUDA kernels (with __global__) must have return type void.

• Compilation of CUDA code: nvcc -g -G

info cuda system information on device type, compute capability, number of

break set breakpoint specified by function name or line number

2 Error Handling and Debugging

An appropriate usage of the different types of CUDA memory is crucial for

Figure 3.1: CUDA Programming Guide Version 2.3 (2009)

Shared memory Shared memory of each multiprocessor, accessible by all

As a (standard) example, we implement a concurrent matrix-matrix

For a GPU implementation of matrix multiplication, the computation of the

Figure 3.2: CUDA Programming Guide (NVIDIA 2015a)

Matrix size CPU [ms] GPU [ms]

• Release mode (with optimization) has been used for compilation.

Figure 3.3: CUDA Programming Guide (NVIDIA 2015a)

Matrix size CPU [ms] GPU [ms] GPU [ms]

• Release mode (with optimization) has been used for compilation.

2 Error Handling and Debugging

• Using CUDA is most effective for large problems, e. g., multiplication of

You might also like

CUDA kernels (with global) must have return type void.