0% found this document useful (0 votes)

14 views65 pages

Lecture 12 GPU Programming

The document outlines an assignment for GPU programming, focusing on concurrency and requiring a compatible GPU. It provides resources for learning CUDA, including a recommended textbook and details on the CUDA programming model, memory organization, and a simple example of matrix multiplication. The assignment is due on June 7 and does not include final exams.

Uploaded by

ukalappa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views65 pages

Lecture 12 GPU Programming

Uploaded by

ukalappa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

GPU Programming 1

GPU PROGRAMMING
GPU Programming 2

Assignment 4
• Consists of two programming assignments
• Concurrency
• GPU programming
• Requires a computer with a CUDA/OpenCL/DirectCompute compatible
GPU
• Due Jun 07
• We have no final exams
GPU Programming 3

GPU Resources
• Download CUDA toolkit from the web

• Very good text book:

• Programming Massively Parallel Processors
• Wen-mei Hwu and David Kirk
• Available at
• http://courses.engr.illinois.edu/ece498/al/Syllabus.html
GPU Programming 4

Acknowledgments
• Slides and material from
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)
Why GPU Programming
• More processing power + higher memory bandwidth

• GPU in every PC and workstation – massive volume

and potential impact
GPU Programming 5
Current CPU

4 Cores
CPU 0 CPU 1
4 float wide SIMD
3GHz
48-96GFlops
CPU 2 CPU 3
2x HyperThreaded
64kB $L1/core
20GB/s to Memory
$200
L2 Cache
200W
Current GPU

SIMD SIMD SIMD SIMD SIMD

32 Cores
SIMD SIMD SIMD SIMD SIMD
32 Float wide
SIMD SIMD SIMD SIMD SIMD
1GHz
SIMD SIMD SIMD SIMD SIMD
1TeraFlop
SIMD SIMD SIMD SIMD SIMD
32x “HyperThreaded”
SIMD SIMD SIMD SIMD SIMD
64kB $L1/Core
SIMD SIMD SIMD SIMD SIMD
150GB/s to Mem
SIMD SIMD SIMD SIMD SIMD
$200,
L2 Cache
200W
GPU Programming

Bandwidth and Capacity

CPU GPU
50GFlops 1TFlop
1GB/s

10GB/s 100GB/s

GPU RAM
CPU RAM 1 GB
4-6 GB
All values are approximate

8
GPU Programming 9

CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
• User kicks off batches of threads on the GPU
• GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack
• Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
GPU Programming 10

Languages with Similar Capabilities

• CUDA
• OpenCL
• DirectCompute

• You are free to use any of the above for assignment 4

• I will focus on CUDA for the rest of the lecture
• Same abstractions present in all three with different (and
confusing) names
GPU Programming 11

CUDA Programming Model:

• The GPU = compute device that:
• Is a coprocessor to the CPU or host
• Has its own DRAM (device memory)
• Runs many threads in parallel
• GPU program = kernel
• Differences between GPU and CPU threads
• GPU threads are extremely lightweight
• Very little creation overhead
• GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
GPU Programming 12

A CUDA Program
1. Host performs some CPU computation
2. Host copies input data into the device
3. Host instructs the device to execute a kernel
4. Device executes the kernel produces results
5. Host copies the results
6. Goto step 1
GPU Programming 13

CUDA Kernel is a SPMD program

• SPMD = Single Program Multiple Data

• All threads run the same code

• Each thread uses its id to Kernel:
…
• Operate on different memory i = input[tid];
addresses o = f(i);
• Make control decisions output[tid] = o;
…
GPU Programming 14

CUDA Kernel is a SPMD program

• SPMD = Single Program Multiple Data

• All threads run the same code

• Each thread uses its id to Kernel:
…
• Operate on different memory i = input[tid];
addresses if(i%2 == 0)
• Make control decisions o = f(i);
else
• Difference with SIMD
o = g(i);
• Threads can execute different output[tid] = o;
control flow …
• At a performance cost
GPU Programming 15

Threads Organization
• Kernel threads Host Device

Grid 1
= Grid of Thread Blocks Kernel Block Block

(1D or 2D) 1 (0, 0) (1, 0)

Block Block

• Thread Block
(0, 1) (1, 1)

= Array of Threads Kernel

Grid 2

(1D or 2D or 3D)
2

• Simplifies memory addressing

for multidimensional data
GPU Programming 16

Threads Organization
• Kernel threads Host Device

Grid 1
= Grid of Thread Blocks Kernel Block Block

(1D or 2D) 1 (0, 0) (1, 0)

Block Block

• Thread Block
(0, 1) (1, 1)

= Array of Threads Kernel

Grid 2

(1D or 2D or 3D)
2

Thread Thread
• Simplifies memory addressing (0,0) (1,0)

for multidimensional data Thread

(0,1)
Thread
(1,1)
GPU Programming 17

Threads within a Block

• Execute in lock step CUDA Thread Block

• Can share memory

Thread Id #:
• Can synchronize with each other 0123… m

Thread program

Courtesy: John Nickolls, NVIDIA

CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

• global defines a kernel function

• Must return void
• __device__ and __host__ can be used
together

GPU Programming 18
GPU Programming 19

CUDA Function Declarations (cont.)

• device functions cannot have their

address taken
• For functions executed on the device:
• No recursion
• No static variable declarations inside the function
• No variable number of arguments
GPU Programming 20

Putting it all together

__global__ void KernelFunc(…)
dim3 DimGrid(100, 50);
dim3 DimBlock(4, 8, 8);

KernelFunc<<< DimGrid, DimBlock >>>(...);

GPU Programming 21

CUDA Memory Model

• Registers Grid
• Read/write per thread
Block (0, 0) Block (1, 0)
• Local memory
Shared Memory Shared Memory
• Read/write per thread
Registers Registers Registers Registers
• Shared memory
• Read/write per block Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Global memory
Host Global Memory
• Read/write per grid
Constant Memory
• Constant memory
• Read only, per grid Texture Memory

• Texture memory
• Read only, per grid
GPU Programming 22

Memory Access Efficiency

• Registers
Grid
• Fast
Block (0, 0) Block (1, 0)
• Local memory
• Not cached -> Slow Shared Memory Shared Memory
• Registers spill into local memory
Registers Registers Registers Registers

• Shared memory
• On chip -> Fast Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Global memory
Host Global Memory
• Not cached -> Slow
Constant Memory
• Constant memory
• Cached – Fast if good reuse Texture Memory

• Texture memory
• Cached – Fast if good reuse
GPU Programming 23

CUDA Variable Type Qualifiers

Variable declaration Memory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

• device is optional when used with

__local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in a

Variable Type Restrictions

• Pointers can only point to memory allocated or
declared in global memory:
• Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
• Obtained as the address of a global variable:
float* ptr = &GlobalVar;
GPU Programming 25

Simple Example: Matrix Multiplication

GPU Programming 26

Matrix Multiplication
• P = M * N of size WIDTH x WIDTH N
• Simple strategy

WIDTH
• One thread calculates one element of P
• M and N are loaded WIDTH times from
global memory

M P

WIDTH
WIDTH WIDTH
GPU Programming 27

GPU Matrix Multiplication: Host

float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
GPU Programming 28

GPU Matrix Multiplication: Host

float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);
GPU Programming 29

GPU Matrix Multiplication: Host

float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

GPU Programming 30

GPU Matrix Multiplication: Host

float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

cudaFree(Md); cudaFree(Nd); cudaFree(Pd);
GPU Programming 31

GPU Matrix Multiplication: Host

• How many threads do we need? N

WIDTH
M P

WIDTH
WIDTH WIDTH
GPU Programming 32

GPU Matrix Multiplication: Host

dim3 dimGrid(1,1); N
dim3 dimBlock(width, width);

WIDTH
MatrixMul<<<dimGrid, dimBlock>>>
(Md, Nd, Pd, width);

M P

WIDTH
WIDTH WIDTH
GPU Programming 33

GPU Matrix Multiplication: Kernel

__global__ void MatrixMul( Nd
float* Md, float* Nd,
float* Pd, int width)

WIDTH
{
Pd[ty*width + tx] = …
}

Md Pd
short forms:
ty
tx = threadIdx.x;

WIDTH
ty = threadIdx.y;
tx
WIDTH WIDTH
GPU Programming 34

GPU Matrix Multiplication: Kernel

__global__ void MatrixMul(…){ Nd
for(k=0; k<width; k++){
r = Md[ty*width+k] +

WIDTH
Nd[k*width+tx];
Pd[ty*width + tx] = r;
}}

Md Pd
ty

WIDTH
tx
WIDTH WIDTH
Only One Thread Block Used
Grid 1 Nd
• One Block of threads compute Block 1
2
matrix Pd
4
• Each thread computes one
element of Pd Thread
(2, 2)
2
• Each thread 6
• Loads a row of matrix Md
• Loads a column of matrix Nd
• Perform one multiply and
addition for each pair of Md and
Nd elements
• Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not very
high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block
Md Pd
GPU Programming 35
GPU Programming 36

How about performance on G80?

• All threads access global memory for
Grid
their input matrix elements
• Compute: 346.5 GFLOPS Block (0, 0) Block (1, 0)
• Memory bandwidth: 86.4 GBps
Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory
GPU Programming 37

How about performance on G80?

• All threads access global memory for their
input matrix elements Grid

• Two memory accesses (8 bytes) per

Block (0, 0) Block (1, 0)
floating point multiply-add
• 4B/s of memory bandwidth/FLOPS
Shared Memory Shared Memory
• 4*346.5 = 1386 GB/s required to
achieve peak FLOP rating Registers Registers Registers Registers

• 86.4 GB/s limits the code at 21.6

GFLOPS
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory Host Global Memory

accesses to get closer to the peak 346.5
GFLOPS Constant Memory
G80 Example: Executing Thread Blocks
SM 0 SM 1
t0 t1 t2 … tm t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
• Up to 8 blocks to each SM as resource
Shared Shared allows
Memory Memory
• SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3 blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
• SM maintains thread/block id #s
• SM manages/schedules thread execution

GPU Programming 38
GPU Programming 39

G80 Example: Thread Scheduling

• Each Block is executed as …
Block 1 Warps …Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM

• If 3 blocks are assigned to an Streaming Multiprocessor

Instruction L1
SM and each block has 256
threads, how many Warps are Instruction Fetch/Dispatch

there in an SM? Shared Memory

SP SP

SP SP
SFU SFU
SP SP

SP SP
GPU Programming 40

G80 Example: Thread Scheduling

• Each Block is executed as …
Block 1 Warps …Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM

• If 3 blocks are assigned to an Streaming Multiprocessor

Instruction L1
SM and each block has 256
threads, how many Warps are Instruction Fetch/Dispatch

there in an SM? Shared Memory

– Each Block is divided into SP SP

256/32 = 8 Warps SP SP
– There are 8 * 3 = 24 Warps SFU SFU
SP SP

SP SP
SM Warp Scheduling
• SM hardware implements zero-
SM multithreaded overhead Warp scheduling
Warp scheduler • Warps whose next instruction has
time its operands ready for
consumption are eligible for
warp 8 instruction 11
execution
warp 1 instruction 42
• Eligible Warps are selected for
execution on a prioritized
warp 3 instruction 95 scheduling policy
.. • All threads in a Warp execute the
.
same instruction when selected
warp 8 instruction 12

warp 3 instruction 96

GPU Programming 41
GPU Programming 42

G80 Block Granularity Considerations

• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?
• Each SM can take max 8 blocks and max 768 threads
GPU Programming 43

G80 Block Granularity Considerations

• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?

• For 8X8, we have 64 threads per Block. Since each SM can take up to 768
threads, there are 12 Blocks. However, each SM can only take up to 8
Blocks, only 512 threads will go into each SM!

• For 16X16, we have 256 threads per Block. Since each SM can take up to
768 threads, it can take up to 3 Blocks and achieve full capacity unless
other resource considerations overrule.

• For 32X32, we have 1024 threads per Block. Not even one can fit into an
SM!
GPU Programming 44

A Common Programming Strategy

• Global memory resides in device memory (DRAM) - much
slower access than shared memory
• So, a profitable way of performing computation on the
device is to tile data to take advantage of fast shared
memory:
• Partition data into subsets that fit into shared memory
• Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each
thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
GPU Programming 45

A Common Programming Strategy (Cont.)

• Constant memory also resides in device memory
(DRAM) - much slower access than shared memory
• But… cached!
• Highly efficient access for read-only data
• Carefully divide data according to access patterns
• R/Only  constant memory (very fast if in cache)
• R/W shared within Block  shared memory (very fast)
• R/W within each thread  registers (very fast)
• R/W inputs/results  global memory (very slow)
GPU Programming 46

Idea: Use Shared Memory to reuse global memory data

• Each input element is read N

by Width threads.
• Load each element into

WIDTH
Shared Memory and have
several threads use the
local version to reduce the
memory bandwidth M P

• Tiled algorithms ty

WIDTH
tx
WIDTH WIDTH
GPU Programming 47
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

kernel into phases so that the data Nd

TILE_WIDTH TILE_WIDTH
accesses in each phase is focused

WIDTH
on one subset (tile) of Md and Nd

Md Pd

TILE_WIDTHE
0 Pdsub
1

WIDTH
2
by 1
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
GPU Programming 48

A Small Example: 2X2 Tiling of P

Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0 Md1,0 Md2,0 Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1 Md1,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

Every Md and Nd Element is used exactly twice in
generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

GPU Programming 49
Every Md and Nd Element is used exactly twice in
generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

GPU Programming 50
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3

GPU Programming
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3

GPU Programming
GPU Programming 53

Tiled Kernel
__global__
void Tiled(float* Md, float* Nd, float* Pd, int Width)
{
__shared __float Mds[TILE_WIDTH][TILE_WIDTH];
__shared __float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element

to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;
// compute Pvalue
Pd[Row*Width + Col] = Pvalue;
}
GPU Programming 54

Tiled Kernel: Computing Pvalue

//…
float Pvalue = 0;
// Loop over the Md and Nd tiles required
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles Mds[ty]
[tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k)

Pvalue += Mds[ty][k] * Nds[k][tx];
__syncthreads();
}
Pd[Row*Width + Col] = Pvalue;
//…
GPU Programming 55

CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
GPU Programming 56

First-order Size Considerations in G80

• Each thread block should have many threads
• TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks

• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)

• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
• Memory bandwidth no longer a limiting factor
GPU Programming 57
bx
Tiled Multiply 0 1 2

tx
• Each block computes one square 0 1 2 TILE_WIDTH-1

sub-matrix Pdsub of size TILE_WIDTH Nd

TILE_WIDTH
m
• Each thread computes one element

WIDTH
bx k

TILE_WIDTH
of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0 Pdsub
1

WIDTH
by 1
ty
2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
GPU Programming 58

G80 Shared Memory and Threading

• Each SM in G80 has 16KB shared memory
• SM size is implementation dependent!
• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
• The shared memory can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
• The threading model limits the number of thread blocks to 3 so shared memory is not the
limiting factor here
• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage
per thread block, allowing only up to two thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor
of 16
• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
GPU Programming 59

Parallel Memory Architecture

• In a parallel machine, many threads access memory
• Therefore, memory is divided into banks
• Essential to achieve high bandwidth
Bank 0
Bank 1
• Each bank can service one address per cycle Bank 2
Bank 3
• A memory can service as many simultaneous Bank 4
accesses as it has banks Bank 5
Bank 6
Bank 7
• Multiple simultaneous accesses to a bank
result in a bank conflict
Bank 15
• Conflicting accesses are serialized
GPU Programming 60

Bank Addressing Examples

• No Bank Conflicts • No Bank Conflicts
• Linear addressing • Random 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15

GPU Programming 61

Bank Addressing Examples

• 2-way Bank Conflicts • 8-way Bank Conflicts
• Linear addressing • Linear addressing
stride == 2 stride == 8
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
GPU Programming 62

How addresses map to banks on G80

• Each bank has a bandwidth of 32 bits per clock cycle
• Successive 32-bit words are assigned to successive
banks
• G80 has 16 banks
• So bank = address % 16
• Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp
GPU Programming 63

Shared memory bank conflicts

• Shared memory is as fast as registers if there are no bank
conflicts

• The fast case:

• If all threads of a half-warp access different banks, there is no bank
conflict
• If all threads of a half-warp access the identical address, there is no
bank conflict (broadcast)
• The slow case:
• Bank Conflict: multiple threads in the same half-warp access the same
bank
• Must serialize the accesses
• Cost = max # of simultaneous accesses to a single bank
GPU Programming 64

Linear Addressing Thread 0

s=1
Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7
float foo =
shared[baseIndex + s * Thread 15 Bank 15

threadIdx.x];
s=3
This is only bank-conflict-free if s
Thread 0 Bank 0
• Thread 1 Bank 1
Thread 2 Bank 2

shares no common factors with the Thread 3

Thread 4
Bank 3
Bank 4

number of banks Thread 5

Thread 6
Bank 5
Bank 6

16 on G80, so s must be odd

Thread 7 Bank 7
•

Thread 15 Bank 15
GPU Programming 65

Control Flow Instructions

• Main performance concern with branching is divergence
• Threads within a single warp take different paths
• Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed one at a
time until there is no more.
• A common case: avoid divergence when branch condition is a
function of thread ID
• Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Branch granularity < warp size; threads 0, 1 and 2 follow different path
than the rest of the threads in the first warp
• Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Lecture 2
No ratings yet
Lecture 2
77 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA
No ratings yet
CUDA
33 pages
Lec 1
No ratings yet
Lec 1
27 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Threads
No ratings yet
Threads
54 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lec 6
No ratings yet
Lec 6
16 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDA
No ratings yet
CUDA
18 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Course 7
No ratings yet
Course 7
21 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Govind 6
No ratings yet
Govind 6
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
From Everand
Mega Drive Architecture: Architecture of Consoles: A Practical Analysis, #3
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Open SeeSsp
No ratings yet
Open SeeSsp
13 pages
Introduction To OLGA
100% (2)
Introduction To OLGA
19 pages
Multicore Architecture and Programming
No ratings yet
Multicore Architecture and Programming
20 pages
OS Outline
No ratings yet
OS Outline
4 pages
Multicore Programming: K. Nagalakshmi, ASP/IT Department of Information Technology E.G.S. Pillay Engineering Technology
No ratings yet
Multicore Programming: K. Nagalakshmi, ASP/IT Department of Information Technology E.G.S. Pillay Engineering Technology
19 pages
Contatc Definition in LS-Dyna
100% (1)
Contatc Definition in LS-Dyna
34 pages
Deepseek v2 Tech Report
No ratings yet
Deepseek v2 Tech Report
50 pages
UCS645 ProjectReport MergeSort
No ratings yet
UCS645 ProjectReport MergeSort
22 pages
Management of Information Technology: Business Dimensions
No ratings yet
Management of Information Technology: Business Dimensions
14 pages
Paro: A Design Tool For Synthesis of Hardware Accelerators For Socs
No ratings yet
Paro: A Design Tool For Synthesis of Hardware Accelerators For Socs
1 page
Seminar Topics
No ratings yet
Seminar Topics
21 pages
722 9 5 2011 Review
No ratings yet
722 9 5 2011 Review
101 pages
Computer Architecture AllClasses-Outline-100-198
No ratings yet
Computer Architecture AllClasses-Outline-100-198
99 pages
Introduction To Distributed Operating Systems
No ratings yet
Introduction To Distributed Operating Systems
41 pages
HPC Note
No ratings yet
HPC Note
39 pages
Operating System Tutorialspoint
No ratings yet
Operating System Tutorialspoint
64 pages
UNIT II - Multi Core Architecture
No ratings yet
UNIT II - Multi Core Architecture
102 pages
Matthew Williams CISB305 Fall 2022 Assignment 3
No ratings yet
Matthew Williams CISB305 Fall 2022 Assignment 3
2 pages
End Sem Lab Exam Q Paper
No ratings yet
End Sem Lab Exam Q Paper
3 pages
Retrieve
No ratings yet
Retrieve
40 pages
Parallel Programming
No ratings yet
Parallel Programming
27 pages
Iiest PG Syllabus It
No ratings yet
Iiest PG Syllabus It
36 pages
History of Java
No ratings yet
History of Java
20 pages
CFX - Solver: What You Will Learn
No ratings yet
CFX - Solver: What You Will Learn
17 pages
Oneapi Optimization Guide Gpu 2024.1 771772 817276
No ratings yet
Oneapi Optimization Guide Gpu 2024.1 771772 817276
450 pages
Microprocessor Computer ArchitectureCACS155
No ratings yet
Microprocessor Computer ArchitectureCACS155
7 pages
5 Amdahl
No ratings yet
5 Amdahl
3 pages
Unit 1 Ccws QB
No ratings yet
Unit 1 Ccws QB
34 pages
15CS72 - Advanced Computer Architectures, June, July 2019
No ratings yet
15CS72 - Advanced Computer Architectures, June, July 2019
2 pages
AR Grid Computing
No ratings yet
AR Grid Computing
9 pages

Lecture 12 GPU Programming

Uploaded by

Lecture 12 GPU Programming

Uploaded by

GPU Programming 1

• Very good text book:

• GPU in every PC and workstation – massive volume

SIMD SIMD SIMD SIMD SIMD

Bandwidth and Capacity

Languages with Similar Capabilities

• You are free to use any of the above for assignment 4

CUDA Programming Model:

CUDA Kernel is a SPMD program

• All threads run the same code

CUDA Kernel is a SPMD program

• All threads run the same code

(1D or 2D) 1 (0, 0) (1, 0)

= Array of Threads Kernel

• Simplifies memory addressing

(1D or 2D) 1 (0, 0) (1, 0)

= Array of Threads Kernel

for multidimensional data Thread

Threads within a Block

• Can share memory

Courtesy: John Nickolls, NVIDIA

• __global__ defines a kernel function

CUDA Function Declarations (cont.)

• __device__ functions cannot have their

Putting it all together

KernelFunc<<< DimGrid, DimBlock >>>(...);

CUDA Memory Model

Memory Access Efficiency

CUDA Variable Type Qualifiers

• __device__ is optional when used with

• Automatic variables without any qualifier reside in a

Variable Type Restrictions

Simple Example: Matrix Multiplication

GPU Matrix Multiplication: Host

GPU Matrix Multiplication: Host

GPU Matrix Multiplication: Host

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

GPU Matrix Multiplication: Host

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

GPU Matrix Multiplication: Host

GPU Matrix Multiplication: Host

GPU Matrix Multiplication: Kernel

GPU Matrix Multiplication: Kernel

How about performance on G80?

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

How about performance on G80?

• Two memory accesses (8 bytes) per

• 86.4 GB/s limits the code at 21.6

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory Host Global Memory

G80 Example: Thread Scheduling

• If 3 blocks are assigned to an Streaming Multiprocessor

there in an SM? Shared Memory

G80 Example: Thread Scheduling

• If 3 blocks are assigned to an Streaming Multiprocessor

there in an SM? Shared Memory

– Each Block is divided into SP SP

G80 Block Granularity Considerations

G80 Block Granularity Considerations

A Common Programming Strategy

A Common Programming Strategy (Cont.)

Idea: Use Shared Memory to reuse global memory data

kernel into phases so that the data Nd

A Small Example: 2X2 Tiling of P

Md0,0 Md1,0 Md2,0 Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1 Md1,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element

• global defines a kernel function

• device functions cannot have their

• device is optional when used with