Lecture-12-GPU-Programming
Lecture-12-GPU-Programming
GPU PROGRAMMING
GPU Programming 2
Assignment 4
• Consists of two programming assignments
• Concurrency
• GPU programming
• Requires a computer with a CUDA/OpenCL/DirectCompute compatible
GPU
• Due Jun 07
• We have no final exams
GPU Programming 3
GPU Resources
• Download CUDA toolkit from the web
Acknowledgments
• Slides and material from
• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)
Why GPU Programming
• More processing power + higher memory bandwidth
4 Cores
CPU 0 CPU 1
4 float wide SIMD
3GHz
48-96GFlops
CPU 2 CPU 3
2x HyperThreaded
64kB $L1/core
20GB/s to Memory
$200
L2 Cache
200W
Current GPU
CPU GPU
50GFlops 1TFlop
1GB/s
10GB/s 100GB/s
GPU RAM
CPU RAM 1 GB
4-6 GB
All values are approximate
8
GPU Programming 9
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
• User kicks off batches of threads on the GPU
• GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack
• Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
GPU Programming 10
A CUDA Program
1. Host performs some CPU computation
2. Host copies input data into the device
3. Host instructs the device to execute a kernel
4. Device executes the kernel produces results
5. Host copies the results
6. Goto step 1
GPU Programming 13
Threads Organization
• Kernel threads Host Device
Grid 1
= Grid of Thread Blocks Kernel Block Block
Block Block
• Thread Block
(0, 1) (1, 1)
(1D or 2D or 3D)
2
Threads Organization
• Kernel threads Host Device
Grid 1
= Grid of Thread Blocks Kernel Block Block
Block Block
• Thread Block
(0, 1) (1, 1)
(1D or 2D or 3D)
2
Thread Thread
• Simplifies memory addressing (0,0) (1,0)
Thread program
GPU Programming 18
GPU Programming 19
• Global memory
Host Global Memory
• Read/write per grid
Constant Memory
• Constant memory
• Read only, per grid Texture Memory
• Texture memory
• Read only, per grid
GPU Programming 22
• Shared memory
• On chip -> Fast Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Global memory
Host Global Memory
• Not cached -> Slow
Constant Memory
• Constant memory
• Cached – Fast if good reuse Texture Memory
• Texture memory
• Cached – Fast if good reuse
GPU Programming 23
Matrix Multiplication
• P = M * N of size WIDTH x WIDTH N
• Simple strategy
WIDTH
• One thread calculates one element of P
• M and N are loaded WIDTH times from
global memory
M P
WIDTH
WIDTH WIDTH
GPU Programming 27
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
GPU Programming 28
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
GPU Programming 29
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
// call kernel
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
cudaMalloc(&Pd, size);
// call kernel
WIDTH
M P
WIDTH
WIDTH WIDTH
GPU Programming 32
WIDTH
MatrixMul<<<dimGrid, dimBlock>>>
(Md, Nd, Pd, width);
M P
WIDTH
WIDTH WIDTH
GPU Programming 33
WIDTH
{
Pd[ty*width + tx] = …
}
Md Pd
short forms:
ty
tx = threadIdx.x;
WIDTH
ty = threadIdx.y;
tx
WIDTH WIDTH
GPU Programming 34
WIDTH
Nd[k*width+tx];
Pd[ty*width + tx] = r;
}}
Md Pd
ty
WIDTH
tx
WIDTH WIDTH
Only One Thread Block Used
Grid 1 Nd
• One Block of threads compute Block 1
2
matrix Pd
4
• Each thread computes one
element of Pd Thread
(2, 2)
2
• Each thread 6
• Loads a row of matrix Md
• Loads a column of matrix Nd
• Perform one multiply and
addition for each pair of Md and
Nd elements
• Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not very
high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block
Md Pd
GPU Programming 35
GPU Programming 36
Constant Memory
GPU Programming 37
MT IU MT IU
Blocks
SP SP
Blocks
• Threads are assigned to Streaming
Multiprocessors in block granularity
• Up to 8 blocks to each SM as resource
Shared Shared allows
Memory Memory
• SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3 blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
• SM maintains thread/block id #s
• SM manages/schedules thread execution
GPU Programming 38
GPU Programming 39
32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM
SP SP
SP SP
SFU SFU
SP SP
SP SP
GPU Programming 40
32-thread Warps t0 t1 t2 … t0 t1 t2 … t0 t1 t2 …
– Warps are scheduling units … t31 … t31 … t31
in SM
256/32 = 8 Warps SP SP
– There are 8 * 3 = 24 Warps SFU SFU
SP SP
SP SP
SM Warp Scheduling
• SM hardware implements zero-
SM multithreaded overhead Warp scheduling
Warp scheduler • Warps whose next instruction has
time its operands ready for
consumption are eligible for
warp 8 instruction 11
execution
warp 1 instruction 42
• Eligible Warps are selected for
execution on a prioritized
warp 3 instruction 95 scheduling policy
.. • All threads in a Warp execute the
.
same instruction when selected
warp 8 instruction 12
warp 3 instruction 96
GPU Programming 41
GPU Programming 42
• For 8X8, we have 64 threads per Block. Since each SM can take up to 768
threads, there are 12 Blocks. However, each SM can only take up to 8
Blocks, only 512 threads will go into each SM!
• For 16X16, we have 256 threads per Block. Since each SM can take up to
768 threads, it can take up to 3 Blocks and achieve full capacity unless
other resource considerations overrule.
• For 32X32, we have 1024 threads per Block. Not even one can fit into an
SM!
GPU Programming 44
by Width threads.
• Load each element into
WIDTH
Shared Memory and have
several threads use the
local version to reduce the
memory bandwidth M P
• Tiled algorithms ty
WIDTH
tx
WIDTH WIDTH
GPU Programming 47
bx
Tiled Multiply 0 1 2
tx
• Break up the execution of the 0 1 2 TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
accesses in each phase is focused
WIDTH
on one subset (tile) of Md and Nd
Md Pd
TILE_WIDTHE
0 Pdsub
1
WIDTH
2
by 1
ty
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
GPU Programming 48
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
GPU Programming 49
Every Md and Nd Element is used exactly twice in
generating a 2X2 tile of P
P0,0 P1,0 P0,1 P1,1
thread0,0 thread1,0 thread0,1 thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
GPU Programming 50
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3
GPU Programming
Breaking Md and Nd into Tiles
• Break up the inner product
loop of each thread into Nd0,0 Nd1,0
phases Nd0,1 Nd1,1
• At the beginning of each Nd0,2 Nd1,2
phase, load the Md and Nd
elements that everyone Nd0,3 Nd1,3
needs during the phase into
shared memory Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0
• Everyone access the Md and
Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1
Nd elements from the
shared memory during the Pd0,2 Pd1,2 Pd2,2 Pd3,2
phase Pd0,3 Pd1,3 Pd2,3 Pd3,3
GPU Programming
GPU Programming 53
Tiled Kernel
__global__
void Tiled(float* Md, float* Nd, float* Pd, int Width)
{
__shared __float Mds[TILE_WIDTH][TILE_WIDTH];
__shared __float Nds[TILE_WIDTH][TILE_WIDTH];
float Pvalue = 0;
// compute Pvalue
Pd[Row*Width + Col] = Pvalue;
}
GPU Programming 54
• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
• Memory bandwidth no longer a limiting factor
GPU Programming 57
bx
Tiled Multiply 0 1 2
tx
• Each block computes one square 0 1 2 TILE_WIDTH-1
TILE_WIDTH
m
• Each thread computes one element
WIDTH
bx k
TILE_WIDTH
of Pdsub
Md Pd
by
0
m
TILE_WIDTHE
0 Pdsub
1
WIDTH
by 1
ty
2
k
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
GPU Programming 58
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor
of 16
• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
GPU Programming 59
threadIdx.x];
s=3
This is only bank-conflict-free if s
Thread 0 Bank 0
• Thread 1 Bank 1
Thread 2 Bank 2
Thread 15 Bank 15
GPU Programming 65