Threads
Threads
Threads
Klaus Mueller
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …
processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)
– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)
Courtesy: NDVIA
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 5
Figure 3.2. An Example of CUDA Thread Org
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W
Grid
Data between host and
device Block (0, 0) Block (1, 0)
cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 8
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
• cudaMemcpy()
– memory data transfer Grid
• Pointer to source
Registers Registers Registers Registers
• Number of bytes copied
• Type of transfer Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
– Host to Host
– Host to Device Host Global
Memory
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 9
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants
• Without tiling:
WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P
WIDTH
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 16
ECE498AL, University of Illinois, Urbana-Cham paign
Memory Layout of a Matrix in C
M 0,0 M 1,0 M 2,0 M 3,0
M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3
WIDTH
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P
sum += a * b; i
}
WIDTH
P[i * Width + j] = sum;
}
} k
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 18
ECE498AL, University of Illinois, Urbana-Cham paign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md , N d , Pd ;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
WIDTH
Pvalue += Melement * Nelement;
} tx
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd
ty ty
WIDTH
k tx
WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md
You still need to pu t a loop by
arou nd the kernel call for TILE_WIDTH
cases w here ty
WIDTH
WIDTH / TILE_WIDTH is
greater than m ax grid size bx tx
(64K)!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 25
ECE498AL, University of Illinois, Urbana-Cham paign
bx
0 1 2
Matrix Multiplication Using
tx
Multiple Blocks 0 1 2 TILE_WIDTH-1
Nd
• Break-up Pd into tiles
• Each block calculates one
WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH
2 WIDTH WIDTH
Block(0,0) Block(1,0)
Block(0,1) Block(1,1)
N d 0,1N d 1,1
N d 0,2N d 1,2
N d 0,3N d 1,3
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 29
ECE498AL, University of Illinois, Urbana-Cham paign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program
Block 2 Block 3
Block 4 Block 5
Each block can execu te in any ord er relative
Block 6 Block 7 to other blocks.
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 31
ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm
MT IU MT IU
Blocks
SP SP
– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1
256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!
This will lead to under-utilization (bad for latency hiding).
– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.
– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 35
ECE498AL, University of Illinois, Urbana-Cham paign
Number of Threads
• All threads on an SM must run in lock-step
– if one thread is delayed because of memory load then
all threads in a warp must wait
– a new warp is scheduled
– so it is good to have more threads per block
– however, there is a limit on the number of threads per
SM
– 768, 1024, 1536, 2048 depending on compute
capability
– this is a function of the maximum number of warps
(24, 32, 48, 64) of 32 threads each
Number of Blocks
• Workload of threads is not always uniform
– all threads must complete before a new block can be
scheduled
– if the slow thread is part of a large block the idle time
is high
– so it is better to have smaller blocks
– however, there is a limit on the number of blocks per
SM (8 or less, depending on compute capability)
GPU Utilization
• Goal is to allocate as many threads per SM as
maximum limit
• Here take into account:
– max number of blocks
– modulus of warps
• So to achieve 100 % utilization depends on
compute capability and threads/block
GPU Utilization
ShaneCook“CUDAProgram m ing”
Practical Example
• Histogram computation
ShaneCook“CUDAProgram m ing”
GPU Algorithm 1
• Not overly fast
• Why?
– each thread only fetches 1 byte
– half warp fetches 16 bytes
– maximal supported size is 128 bytes
– hence memory bandwidth is heavily underused
GPU Algorithm 2
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit
ShaneCook“CUDAProgram m ing”
GPU Algorithm 3
ShaneCook“CUDAProgram m ing”
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?
ShaneCook“CUDAProgram m ing”
GPU Algorithm 3
ShaneCook“CUDAProgram m ing”
GPU Algorithm 3
ShaneCook“CUDAProgram m ing”
GPU Algorithm 3
not m uch grow th in band w id th after N =32 d ue to other factors im ped ing grow th
(atom ics ad d s in this case)
ShaneCook“CUDAProgram m ing”