0% found this document useful (0 votes)
4 views

Lecture-12-GPU-Programming

The document outlines an assignment for GPU programming, focusing on concurrency and requiring a compatible GPU. It provides resources for learning CUDA, including a recommended textbook and details on the CUDA programming model, memory organization, and a simple example of matrix multiplication. The assignment is due on June 7 and does not include final exams.

Uploaded by

ukalappa
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture-12-GPU-Programming

The document outlines an assignment for GPU programming, focusing on concurrency and requiring a compatible GPU. It provides resources for learning CUDA, including a recommended textbook and details on the CUDA programming model, memory organization, and a simple example of matrix multiplication. The assignment is due on June 7 and does not include final exams.

Uploaded by

ukalappa
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

GPU Programming 1

GPU PROGRAMMING
GPU Programming 2

Assignment 4
• Consists of two programming assignments
• Concurrency
• GPU programming
• Requires a computer with a CUDA/OpenCL/DirectCompute compatible
GPU
• Due Jun 07
• We have no final exams
GPU Programming 3

GPU Resources
• Download CUDA toolkit from the web

• Very good text book:


• Programming Massively Parallel Processors
• Wen-mei Hwu and David Kirk
• Available at
• http://courses.engr.illinois.edu/ece498/al/Syllabus.html
GPU Programming 4

Acknowledgments
• Slides and material from
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)
Why GPU Programming
• More processing power + higher memory bandwidth

• GPU in every PC and workstation – massive volume


and potential impact
GPU Programming 5
Current CPU

4 Cores
CPU 0 CPU 1
4 float wide SIMD
3GHz
48-96GFlops
CPU 2 CPU 3
2x HyperThreaded
64kB $L1/core
20GB/s to Memory
$200
L2 Cache
200W
Current GPU

SIMD SIMD SIMD SIMD SIMD


32 Cores
SIMD SIMD SIMD SIMD SIMD
32 Float wide
SIMD SIMD SIMD SIMD SIMD
1GHz
SIMD SIMD SIMD SIMD SIMD
1TeraFlop
SIMD SIMD SIMD SIMD SIMD
32x “HyperThreaded”
SIMD SIMD SIMD SIMD SIMD
64kB $L1/Core
SIMD SIMD SIMD SIMD SIMD
150GB/s to Mem
SIMD SIMD SIMD SIMD SIMD
$200,
L2 Cache
200W
GPU Programming

Bandwidth and Capacity

CPU GPU
50GFlops 1TFlop
1GB/s

10GB/s 100GB/s

GPU RAM
CPU RAM 1 GB
4-6 GB
All values are approximate

8
GPU Programming 9

CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
• User kicks off batches of threads on the GPU
• GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack
• Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
GPU Programming 10

Languages with Similar Capabilities


• CUDA
• OpenCL
• DirectCompute

• You are free to use any of the above for assignment 4


• I will focus on CUDA for the rest of the lecture
• Same abstractions present in all three with different (and
confusing) names
GPU Programming 11

CUDA Programming Model:


• The GPU = compute device that:
• Is a coprocessor to the CPU or host
• Has its own DRAM (device memory)
• Runs many threads in parallel
• GPU program = kernel
• Differences between GPU and CPU threads
• GPU threads are extremely lightweight
• Very little creation overhead
• GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
GPU Programming 12

A CUDA Program
1. Host performs some CPU computation
2. Host copies input data into the device
3. Host instructs the device to execute a kernel
4. Device executes the kernel produces results
5. Host copies the results
6. Goto step 1
GPU Programming 13

CUDA Kernel is a SPMD program


• SPMD = Single Program Multiple Data

• All threads run the same code


• Each thread uses its id to Kernel:

• Operate on different memory i = input[tid];
addresses o = f(i);
• Make control decisions output[tid] = o;

GPU Programming 14

CUDA Kernel is a SPMD program


• SPMD = Single Program Multiple Data

• All threads run the same code


• Each thread uses its id to Kernel:

• Operate on different memory i = input[tid];
addresses if(i%2 == 0)
• Make control decisions o = f(i);
else
• Difference with SIMD
o = g(i);
• Threads can execute different output[tid] = o;
control flow …
• At a performance cost
GPU Programming 15

Threads Organization
• Kernel threads Host Device

Grid 1
= Grid of Thread Blocks Kernel Block Block

(1D or 2D) 1 (0, 0) (1, 0)

Block Block

• Thread Block
(0, 1) (1, 1)

= Array of Threads Kernel


Grid 2

(1D or 2D or 3D)
2

• Simplifies memory addressing


for multidimensional data
GPU Programming 16

Threads Organization
• Kernel threads Host Device

Grid 1
= Grid of Thread Blocks Kernel Block Block

(1D or 2D) 1 (0, 0) (1, 0)

Block Block

• Thread Block
(0, 1) (1, 1)

= Array of Threads Kernel


Grid 2

(1D or 2D or 3D)
2

Thread Thread
• Simplifies memory addressing (0,0) (1,0)

for multidimensional data Thread


(0,1)
Thread
(1,1)
GPU Programming 17

Threads within a Block


• Execute in lock step CUDA Thread Block

• Can share memory


Thread Id #:
• Can synchronize with each other 0123… m

Thread program

Courtesy: John Nickolls, NVIDIA


CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

• __global__ defines a kernel function


• Must return void
• __device__ and __host__ can be used
together

GPU Programming 18
GPU Programming 19

CUDA Function Declarations (cont.)

• __device__ functions cannot have their


address taken
• For functions executed on the device:
• No recursion
• No static variable declarations inside the function
• No variable number of arguments
GPU Programming 20

Putting it all together


__global__ void KernelFunc(…)
dim3 DimGrid(100, 50);
dim3 DimBlock(4, 8, 8);

KernelFunc<<< DimGrid, DimBlock >>>(...);


GPU Programming 21

CUDA Memory Model


• Registers Grid
• Read/write per thread
Block (0, 0) Block (1, 0)
• Local memory
Shared Memory Shared Memory
• Read/write per thread
Registers Registers Registers Registers
• Shared memory
• Read/write per block Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Global memory
Host Global Memory
• Read/write per grid
Constant Memory
• Constant memory
• Read only, per grid Texture Memory

• Texture memory
• Read only, per grid
GPU Programming 22

Memory Access Efficiency


• Registers
Grid
• Fast
Block (0, 0) Block (1, 0)
• Local memory
• Not cached -> Slow Shared Memory Shared Memory
• Registers spill into local memory
Registers Registers Registers Registers

• Shared memory
• On chip -> Fast Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Global memory
Host Global Memory
• Not cached -> Slow
Constant Memory
• Constant memory
• Cached – Fast if good reuse Texture Memory

• Texture memory
• Cached – Fast if good reuse
GPU Programming 23

CUDA Variable Type Qualifiers


Variable declaration Memory Scope Lifetime
__device__ __local__ int LocalVar; local thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

• __device__ is optional when used with


__local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in a


register
• Except arrays that reside in local memory
GPU Programming 24

Variable Type Restrictions


• Pointers can only point to memory allocated or
declared in global memory:
• Allocated in the host and passed to the kernel:
__global__ void KernelFunc(float* ptr)
• Obtained as the address of a global variable:
float* ptr = &GlobalVar;
GPU Programming 25

Simple Example: Matrix Multiplication


GPU Programming 26

Matrix Multiplication
• P = M * N of size WIDTH x WIDTH N
• Simple strategy

WIDTH
• One thread calculates one element of P
• M and N are loaded WIDTH times from
global memory

M P

WIDTH
WIDTH WIDTH
GPU Programming 27

GPU Matrix Multiplication: Host


float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
GPU Programming 28

GPU Matrix Multiplication: Host


float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);
GPU Programming 29

GPU Matrix Multiplication: Host


float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);


GPU Programming 30

GPU Matrix Multiplication: Host


float *M, *N, *P; int width;
int size = width * width * sizeof(float);

cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);


cudaFree(Md); cudaFree(Nd); cudaFree(Pd);
GPU Programming 31

GPU Matrix Multiplication: Host


• How many threads do we need? N

WIDTH
M P

WIDTH
WIDTH WIDTH
GPU Programming 32

GPU Matrix Multiplication: Host


dim3 dimGrid(1,1); N
dim3 dimBlock(width, width);

WIDTH
MatrixMul<<<dimGrid, dimBlock>>>
(Md, Nd, Pd, width);

M P

WIDTH
WIDTH WIDTH
GPU Programming 33

GPU Matrix Multiplication: Kernel


__global__ void MatrixMul( Nd
float* Md, float* Nd,
float* Pd, int width)

WIDTH
{
Pd[ty*width + tx] = …
}

Md Pd
short forms:
ty
tx = threadIdx.x;

WIDTH
ty = threadIdx.y;
tx
WIDTH WIDTH
GPU Programming 34

GPU Matrix Multiplication: Kernel


__global__ void MatrixMul(…){ Nd
for(k=0; k<width; k++){
r = Md[ty*width+k] +

WIDTH
Nd[k*width+tx];
Pd[ty*width + tx] = r;
}}

Md Pd
ty

WIDTH
tx
WIDTH WIDTH
Only One Thread Block Used
Grid 1 Nd
• One Block of threads compute Block 1
2
matrix Pd
4
• Each thread computes one
element of Pd Thread
(2, 2)
2
• Each thread 6
• Loads a row of matrix Md
• Loads a column of matrix Nd
• Perform one multiply and
addition for each pair of Md and
Nd elements
• Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not very
high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block
Md Pd
GPU Programming 35
GPU Programming 36

How about performance on G80?


• All threads access global memory for
Grid
their input matrix elements
• Compute: 346.5 GFLOPS Block (0, 0) Block (1, 0)
• Memory bandwidth: 86.4 GBps
Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory
GPU Programming 37

How about performance on G80?


• All threads access global memory for their
input matrix elements Grid

• Two memory accesses (8 bytes) per


Block (0, 0) Block (1, 0)
floating point multiply-add
• 4B/s of memory bandwidth/FLOPS
Shared Memory Shared Memory
• 4*346.5 = 1386 GB/s required to
achieve peak FLOP rating Registers Registers Registers Registers

• 86.4 GB/s limits the code at 21.6


GFLOPS
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory Host Global Memory


accesses to get closer to the peak 346.5
GFLOPS Constant Memory
G80 Example: Executing Thread Blocks
SM 0 SM 1
t0 t1 t2 … tm t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
• Up to 8 blocks to each SM as resource
Shared Shared allows
Memory Memory
• SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3 blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
• SM maintains thread/block id #s
• SM manages/schedules thread execution

GPU Programming 38
GPU Programming 39

G80 Example: Thread Scheduling


• Each Block is executed as …
Block 1 Warps …Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM

• If 3 blocks are assigned to an Streaming Multiprocessor


Instruction L1
SM and each block has 256
threads, how many Warps are Instruction Fetch/Dispatch

there in an SM? Shared Memory

SP SP

SP SP
SFU SFU
SP SP

SP SP
GPU Programming 40

G80 Example: Thread Scheduling


• Each Block is executed as …
Block 1 Warps …Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM

• If 3 blocks are assigned to an Streaming Multiprocessor


Instruction L1
SM and each block has 256
threads, how many Warps are Instruction Fetch/Dispatch

there in an SM? Shared Memory

– Each Block is divided into SP SP

256/32 = 8 Warps SP SP
– There are 8 * 3 = 24 Warps SFU SFU
SP SP

SP SP
SM Warp Scheduling
• SM hardware implements zero-
SM multithreaded overhead Warp scheduling
Warp scheduler • Warps whose next instruction has
time its operands ready for
consumption are eligible for
warp 8 instruction 11
execution
warp 1 instruction 42
• Eligible Warps are selected for
execution on a prioritized
warp 3 instruction 95 scheduling policy
.. • All threads in a Warp execute the
.
same instruction when selected
warp 8 instruction 12

warp 3 instruction 96

GPU Programming 41
GPU Programming 42

G80 Block Granularity Considerations


• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?
• Each SM can take max 8 blocks and max 768 threads
GPU Programming 43

G80 Block Granularity Considerations


• For Matrix Multiplication using multiple blocks, should I use
8X8, 16X16 or 32X32 blocks?

• For 8X8, we have 64 threads per Block. Since each SM can take up to 768
threads, there are 12 Blocks. However, each SM can only take up to 8
Blocks, only 512 threads will go into each SM!

• For 16X16, we have 256 threads per Block. Since each SM can take up to
768 threads, it can take up to 3 Blocks and achieve full capacity unless
other resource considerations overrule.

• For 32X32, we have 1024 threads per Block. Not even one can fit into an
SM!
GPU Programming 44

A Common Programming Strategy


• Global memory resides in device memory (DRAM) - much
slower access than shared memory
• So, a profitable way of performing computation on the
device is to tile data to take advantage of fast shared
memory:
• Partition data into subsets that fit into shared memory
• Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each
thread can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
GPU Programming 45

A Common Programming Strategy (Cont.)


• Constant memory also resides in device memory
(DRAM) - much slower access than shared memory
• But… cached!
• Highly efficient access for read-only data
• Carefully divide data according to access patterns
• R/Only  constant memory (very fast if in cache)
• R/W shared within Block  shared memory (very fast)
• R/W within each thread  registers (very fast)
• R/W inputs/results  global memory (very slow)
GPU Programming 46

Idea: Use Shared Memory to reuse global memory data


• Each input element is read N

by Width threads.
• Load each element into

WIDTH
Shared Memory and have
several threads use the
local version to reduce the
memory bandwidth M P

• Tiled algorithms ty

WIDTH
tx
WIDTH WIDTH
GPU Programming 47
bx
Tiled Multiply 0 1 2

tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1

kernel into phases so that the data Nd

TILE_WIDTH TILE_WIDTH
accesses in each phase is focused

WIDTH
on one subset (tile) of Md and Nd

Md Pd

TILE_WIDTHE
0 Pdsub
1

WIDTH
2
by 1
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
GPU Programming 48

A Small Example: 2X2 Tiling of P


Nd0,0 Nd1,0

Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0 Md1,0 Md2,0 Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1 Md1,1 Md2,1 Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3


Every Md and Nd Element is used exactly twice in
generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1


order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

GPU Programming 49
Every Md and Nd Element is used exactly twice in
generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

Access M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1


order
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

GPU Programming 50
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3

GPU Programming
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3

GPU Programming
GPU Programming 53

Tiled Kernel
__global__
void Tiled(float* Md, float* Nd, float* Pd, int Width)
{
__shared __float Mds[TILE_WIDTH][TILE_WIDTH];
__shared __float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;


int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element


to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;
// compute Pvalue
Pd[Row*Width + Col] = Pvalue;
}
GPU Programming 54

Tiled Kernel: Computing Pvalue


//…
float Pvalue = 0;
// Loop over the Md and Nd tiles required
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Collaborative loading of Md and Nd tiles Mds[ty]
[tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k)


Pvalue += Mds[ty][k] * Nds[k][tx];
__syncthreads();
}
Pd[Row*Width + Col] = Pvalue;
//…
GPU Programming 55

CUDA Code – Kernel Execution Configuration

// Setup the execution configuration


dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
GPU Programming 56

First-order Size Considerations in G80


• Each thread block should have many threads
• TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks


• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)

• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
• Memory bandwidth no longer a limiting factor
GPU Programming 57
bx
Tiled Multiply 0 1 2

tx
• Each block computes one square 0 1 2 TILE_WIDTH-1

sub-matrix Pdsub of size TILE_WIDTH Nd

TILE_WIDTH
m
• Each thread computes one element

WIDTH
bx k

TILE_WIDTH
of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0 Pdsub
1

WIDTH
by 1
ty
2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
GPU Programming 58

G80 Shared Memory and Threading


• Each SM in G80 has 16KB shared memory
• SM size is implementation dependent!
• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
• The shared memory can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
• The threading model limits the number of thread blocks to 3 so shared memory is not the
limiting factor here
• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage
per thread block, allowing only up to two thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor
of 16
• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
GPU Programming 59

Parallel Memory Architecture


• In a parallel machine, many threads access memory
• Therefore, memory is divided into banks
• Essential to achieve high bandwidth
Bank 0
Bank 1
• Each bank can service one address per cycle Bank 2
Bank 3
• A memory can service as many simultaneous Bank 4
accesses as it has banks Bank 5
Bank 6
Bank 7
• Multiple simultaneous accesses to a bank
result in a bank conflict
Bank 15
• Conflicting accesses are serialized
GPU Programming 60

Bank Addressing Examples


• No Bank Conflicts • No Bank Conflicts
• Linear addressing • Random 1:1 Permutation
stride == 1
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15


GPU Programming 61

Bank Addressing Examples


• 2-way Bank Conflicts • 8-way Bank Conflicts
• Linear addressing • Linear addressing
stride == 2 stride == 8
x8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
GPU Programming 62

How addresses map to banks on G80


• Each bank has a bandwidth of 32 bits per clock cycle
• Successive 32-bit words are assigned to successive
banks
• G80 has 16 banks
• So bank = address % 16
• Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp
GPU Programming 63

Shared memory bank conflicts


• Shared memory is as fast as registers if there are no bank
conflicts

• The fast case:


• If all threads of a half-warp access different banks, there is no bank
conflict
• If all threads of a half-warp access the identical address, there is no
bank conflict (broadcast)
• The slow case:
• Bank Conflict: multiple threads in the same half-warp access the same
bank
• Must serialize the accesses
• Cost = max # of simultaneous accesses to a single bank
GPU Programming 64

Linear Addressing Thread 0


s=1
Bank 0
• Given: Thread 1
Thread 2
Bank 1
Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
__shared__ float shared[256]; Thread 6 Bank 6
Thread 7 Bank 7
float foo =
shared[baseIndex + s * Thread 15 Bank 15

threadIdx.x];
s=3
This is only bank-conflict-free if s
Thread 0 Bank 0
• Thread 1 Bank 1
Thread 2 Bank 2

shares no common factors with the Thread 3


Thread 4
Bank 3
Bank 4

number of banks Thread 5


Thread 6
Bank 5
Bank 6

16 on G80, so s must be odd


Thread 7 Bank 7

Thread 15 Bank 15
GPU Programming 65

Control Flow Instructions


• Main performance concern with branching is divergence
• Threads within a single warp take different paths
• Different execution paths are serialized in G80
• The control paths taken by the threads in a warp are traversed one at a
time until there is no more.
• A common case: avoid divergence when branch condition is a
function of thread ID
• Example with divergence:
• If (threadIdx.x > 2) { }
• This creates two different control paths for threads in a block
• Branch granularity < warp size; threads 0, 1 and 2 follow different path
than the rest of the threads in the first warp
• Example without divergence:
• If (threadIdx.x / WARP_SIZE > 2) { }
• Also creates two different control paths for threads in a block
• Branch granularity is a whole multiple of warp size; all threads in any
given warp follow the same path

You might also like