GPU Computing 2
GPU Computing 2
PROGRAMMING
LECTURE 02 - CUDA PROGRAMMING
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
With material from D. Kirk, W. Hwu (“Programming Massively Parallel Processors”)
DIE SHOTS - CPU OR GPU?
PCIe QPI links
Memory
controller
4x Setup
pipeline
Command
Setup pipeline processor Setup pipeline
Caching agents, router, Mrs
3
CUDA & GPU - OVERVIEW 96 GFLOPS (DP)
CPU SOCKET
NVIDIA CUDA
CPU
Compute kernel as C program 60GB/S
CORES
Explicit data- and thread-level parallelism
system request
Computing, not graphics processing queue
Host communication NORTH HOST
Memory hierarchy BRIDGE memory MEMORY
interface
Host memory 16GB/S
system interface
GPU (device) memory
GPU on-chip memory (later)
IO
BRIDGE 288GB/S
More HW details exposed 16GB/S peripheral
Use of pointers interface
Load/store architecture GPU GPU
Barrier synchronization of thread blocks 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
4
G80 ARCHITECTURE FOR GRAPHICS PROCESSING
Host
SP SP SP SP
Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
5
G80 ARCHITECTURE FOR GENERAL-PURPOSE
PROCESSING
Host
SM = Streaming Multiprocessor
Input Assembler SM = Shared Memory
Thread Execution Manager
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
SM SM SM SM SM SM SM SM
Global Memory
6
CUDA PROGRAMMING MODEL
PROGRAMMING MODEL
CUDA program consists of CPU & GPU part
CPU
CPU part: part of the program with no or little parallelism
}
GPU part: high parallel part, SPMD-style
Kernel
GPU
Concurrent execution …
Non-blocking thread execution
Explicit synchronization
CPU
C Extension with three main abstractions
1.Hierarchy of threads
GPU
2.Shared memory …
3.Barrier synchronization
Exploiting parallelism
CPU
Inner loops
Fine-grain data-level parallelism (DLP)
Thread-level parallelism (TLP) } Threads
Kernels
8
KERNEL LAUNCH
Kernels: N-fold execution by N __global__ void matAdd ( float A[N][N],
float B[N][N],
threads float C[N][N] )
{
__global__ int i = threadIdx.x;
int j = threadIdx.y;
Execution: C[i][j] = A[i][j] + B[i][j];
}
kernel <<< numBlocks,
threadsPerBlock >>> int main()
{
(args) // Kernel invocation
dim3 dimBlock ( N, N );
Unique ID matAdd <<< 1, dimBlock >>> ( A, B, C );
}
threadIdx.{x,y,z}
Control flow for SPMD programs
Memory access orchestration
9
KERNEL LAUNCH
Each thread block has up to 3 dimensions __global__ void matAdd ( float A[N][N],
float B[N][N],
“Block” in the following float C[N][N])
{
Number of blocks is limited int i = threadIdx.x;
512 x 512 x 64 -> 1024 x 1024 x 64 int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
GPU dep. }
10
KERNEL LAUNCH
__global__ void matAdd ( float A[N][N], Super-fine grained: one thread
float B[N][N],
float C[N][N] )
computes one element
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if ( i < N && j < N )
C[i][j] = A[i][j] + B[i][j]; Operator “/” rounds down, so add
}
block size to round up!
int main() E.g. N=50
{ grid size = (50+16-1)/16=4.0625 => 4
// Kernel invocation
dim3 dimBlock ( 16, 16 );
dim3 dimGrid ( ( N + dimBlock.x – 1 ) / dimBlock.x,
( N + dimBlock.y – 1 ) / dimBlock.y );
matAdd <<< dimGrid, dimBlock >>> ( A, B, C );
}
GPU
…
grid size block size
11
THREAD HIERARCHY
Thread hierarchy
Grid of thread blocks
Blocks of equal size
Given problem size N, how to choose the
parameter threads per block respectively blocks
per grid?
Recommendations wrt block count
>2x number of SMs
Optimal: 100 – 1000 (max. 64k-1)
Recommendations wrt threads/block
Required concurrency for latency toleration vs
resources per thread
12
THREAD COMMUNICATION
Host Device
Barrier synchronization
Grid 2
interact
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Exception: global memory
Very weak coherence & consistency Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
guarantees
Thread Thread Thread Thread
13
MEMORY HIERARCHY – GLOBAL MEMORY
Global memory
Grid
Communication between host and device
Block Block
Accessible from all threads (R/W)
Shared memory Shared memory
High latency
Lifetime exceeds thread lifetime Registers Registers Registers Registers
Sensitive to fine-grained accesses
Allocation Thread Thread Thread Thread
(0,0) (0,1) (0,0) (0,1)
cudaMalloc (&dmem, size);
Deallocation
Host Global memory
cudaFree (dmem);
Data transfer (blocking)
cudaMemcpy (*dst, *src, size, transfer_type);
cudaMemcpyAsync ( … )
14
MEMORY HIERARCHY – GLOBAL MEMORY
void *dmem = cudaMalloc ( N*sizeof ( float ) ); // Allocate GPU memory
Annotate
variable void *hmem = malloc ( N*sizeof ( float ) ); // Allocate CPU memory
scope!
// Transfer data from host to device
cudaMemcpy ( dmem, hmem, N*sizeof ( float ), cudaMemcpyHostToDevice );
// Do calculations
kernel1 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
Only ...
references kernel2 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
to device
// Transfer data from device to host
memory cudaMemcpy ( hmem, dmem, N*sizeof ( float ), cudaMemcpyDeviceToHost );
15
MEMORY HIERARCHY – SHARED MEMORY
On-chip memory Grid
Block Block
Lifetime: thread lifetime
Shared memory Shared memory
Access costs in the best case
equal register access Registers Registers Registers Registers
global memory
__device__ float var; device/host program
(device memory)
constant memory
__constant__ float var; device/host program
(device memory)
__shared__ float var; shared memory threads thread block
texture memory
texture <float> ref; device/host program
(device memory)
Vector types
char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1,
short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3,
uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
float1, float2, float3, float4, double2
Derived from basic types (int, float, …)
Dimension type: dim3
Based on uint3
Unspecified components are initialized with 1
18
FUNCTION DECLARATION
Executed on Callable from
20
BRIEF PROPERTY SURVEY (DEVICEQUERY)
Shared
Total Total Regis- Max Max. Max.
CC Multi- memory Threads Clock Concurrent
global constant ters Warp dimen- dimen- memory
Model Revi- proces Cores per per rate copy and
memory memory per size sion of sion of pitch
sion sors block block [GHz] execution
[bytes] [bytes] block a block a grid [bytes]
[bytes]
65535
x
GeForce 1k x 1k Y
2.0 1.5G 15 480 64k 48k 32k 32 1k 65535 2G 1.4
GTX 480 x 64 1
x
65535
2G x
Tesla 1k x 1k 65535 Y
3.5 5G 13 2496 64k 48k 64k 32 1k 2G 0.7
K20c x 64 x 2
65535
2G x
RTX 1k x 1k 65535 Y
7.5 11G 68 4352 64k 48k 64k 32 1k 2G 1.54
2080Ti x 64 x 3
65535 21
CUDA EXAMPLE: SAXPY
SAXPY EXAMPLE
23
SAXPY EXAMPLE
// kernel function (CPU)
void saxpy_serial(int n, float alpha, float *x, float *y)
{
int i;
for (i=0; i<n; i++) { CPU version
y[i] = alpha*x[i] + y[i];
}
}
24
Huge advantage
for GPU when no
data movements
INITIAL PERFORMANCE Pageable memory
vs. pinned memory
64k
16k
16k
Tesla K10, PCIe 2.0 x16, Intel Xeon E5 (single-threaded) 25
PINNED MEMORY
Replace malloc with cudaMallocHost
Significant reduction of data movement costs
Pinned memory is a scarce resource!
float *h_x;
float *h_y;
float *d_x;
float *d_y;
if (USE_PINNED_MEMORY) {
cudaMallocHost ( (void**) &h_x, N*sizeof(float) );
cudaMallocHost ( (void**) &h_y, N*sizeof(float) );
} else {
h_x = (float*) malloc ( N*sizeof(float) );
h_y = (float*) malloc ( N*sizeof(float) );
}
cudaMalloc ( (void**)&d_x, N*sizeof(float) );
cudaMalloc ( (void**)&d_y, N*sizeof(float) );
26
COMMON ERRORS
CUDA Error: the launch timed out and was terminated
-> Stop X11
CUDA Error: unspecified launch failure
-> Typically a segmentation fault
CUDA Error: invalid configuration argument
-> Too many threads per block, too many resources per thread (shared memory,
register count)
Compile problem:
mmult.cu(171): error: identifier "__eh_curr_region" is
undefined
-> Non-static shared memory, use static allocation of shared memory
27
SUMMARY 96 GFLOPS (DP)
CPU SOCKET
Introduction to CUDA
CPU
Pretty unusual concept compared to CPU CORES 60GB/S
programming
system request
Once understood: pretty easy programming model queue