GPU Computing 3
GPU Computing 3
PROGRAMMING
LECTURE 03 - BASIC ARCHITECTURE
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
GK110 - ARCHITECTURE
Up to 15 SMX, 6 MCs,
L2 cache, PCIe 3.0,
CC 3.5
2
GK110 - ARCHITECTURE
192 SP units
64 DP units
32 load/store units
32 special function units
4 warp schedulers
Optimized for performance/watt
-> reduced clock frequency
Remember Pollack
3
BULK-SYNCHRONOUS PARALLEL
REMINDER: BULK-SYNCHRONOUS PARALLEL
In 1990, Valiant already described GPU computing
pretty well
Superstep
Compute, communicate, synchronize
Leslie G. Valiant, A bridging model for
Parallel slackness: # of virtual processors v, physical parallel computa on, Communica ons of
the ACM, Volume 33 Issue 8, Aug. 1990
processors p
v = 1: not viable
v = p: unpromising wrt optimality
v >> p: leverage slack to schedule and pipeline computation
and communication efficiently
Extremely scalable, bad for unstructured parallelism
5
ti
ti
REMINDER: VECTOR ISAS
Compact: single instruction defines N operations
Amortizes the cost of instruction fetch/decode/issue
Also reduces the frequency of branches
4x SIMD example
Parallel: N operations are (data) parallel
Instruction stream
No dependencies
No need for complex hardware to detect parallelism 1 1
PU
(similar to VLIW)
Data pool
1 PU
Can execute in parallel assuming N parallel data paths
1 PU
Expressive: memory operations describe patterns 1 PU
Continuous or regular memory access pattern
Can prefetch or accelerate using wide/multi-banked memory
Can amortize high latency for 1st element over large sequential pattern
6
OUR VIEW OF A GPU
Software view: a programmable many-core scalar architecture
Huge amount of scalar threads to exploit parallel slackness, operates in lock-step
SIMT: single instruction, multiple threads
Memory Compute
GPU collaborative computing
…
One thread per output element
PCAM: A == form thread blocks, ignore M
Schedulers exploit parallel slack GDDR
…
GPU collaborative memory access
MC
GDDR
One thread per data element
GDDR
MCs highly optimized to exploit concurrency
-> coalescing issues
-> If you do something on a GPU, do it collaboratively with all threads
8
(GLOBAL) MEMORY SUBSYSTEM
GK110 – MEMORY HIERARCHY
Registers at thread level Thread
Registers
64k/thread block
Registers/thread depends on run-time
configuration
Max. 255 registers/thread Thread Shared Memory L1 Cache Read-only data
Shared memory / L1$ at block level Block 16-48kB 16-48kB Cache 48kB
Variable sizes
L1$ can serve for register spilling L2 Cache
L1$ not coherent, write-invalidate 1.5MB
Multiple
Compiler controlled RO L1$ Kernels
L2$ / GDDR at device level GDDR (off-chip)
6GB
GDDR: ~400-600 cycles access latency
L2$ as victim cache for all upper units,
write-back
Host memory (off-device)
Purpose: reducing contention multiple TBs
10
LOCAL MEMORY
Local memory: part of global memory, but Thread
Registers
thread-local 64k/thread block
Register spilling: when SM runs out of
resources
Limited register count per thread Thread Shared Memory L1 Cache Read-only data
Block 16-48kB 16-48kB Cache 48kB
Limited total number of registers
LM is used if the source code exceeds these
limits
L2 Cache
Local because each thread has its private 1.5MB
area
Multiple
Differences from global memory Kernels
Stores are cached in L1$ GDDR (off-chip)
Addressing is resolved by compiler 6GB
11
HOST MEMORY 96 GFLOPS (DP)
CPU SOCKET
Pinned/unpinned host memory
CPU
Unpinned host memory: possibility of CORES 60GB/S
demand paging -> staging buffers system request
queue
Pinned host memory: autonomous
device access possible NORTH HOST
BRIDGE memory MEMORY
cudaMemcpy 16GB/S interface
system interface
GPU DMA engine(s)
IO
Zero copy (CC >= 2.0) BRIDGE 288GB/S
GPU threads can operate on pinned 16GB/S peripheral
host memory interface
13
HOST MEMORY & STREAMS
Stream: sequence of operations
performed in-order
cudaMemcpy
Kernel launch
Default stream: id=0
Overlap computation with data
movement
Latency hiding
Only applicable for divisible work
Most suited for compute-bound workloads
See also zero-copy for initial data movements
14
GLOBAL MEMORY - COALESCING
High bandwidth, high latency
Coalesced access
Combine fine-grain accesses by multiple
threads into single GDDR operations (such
requests have a certain granularity)
Coalesced thread access should match a
multiple of L1/L2 cache line sizes
For Kepler cache line sizes: L1: 128B, L2: 32B
Misaligned accesses
One warp is scheduled, but accesses
misaligned addresses
GPUs use caches for access coalescing
15
GLOBAL MEMORY – ACCESS PENALTIES
Offset: constant shift
of access pattern
data[addr+offset]
Penalty: fetch 5 CLs
instead of 4
4/5 of max. bandwidth
18
CUDA THREAD SCHEDULING
22
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
4 warp contexts, max. 1 being executed
simultaneously
32 SIMD ALUs
Explicit 32x SIMD instructions 0 1 2
30 31
32 ALUs execute a single SIMD instruction
Register file (RF) is shared among contexts RF
One register entry (vector) has 32 words (each 1
0
5
32bit)
RF: 16 entries -> Max. of 4 registers/warp
Thread Warp Contexts
Simplifying assumptions
T0 T1 T2 T3
Each memory access blocks execution for 50
cycles
A memory access occurs every 20 cycles
23
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
Each memory access blocks execution for 50 cycles
(texture memory)
A memory access occurs every 20 cycles 32 SIMD ALUs
0 1 2
-> 4 thread warps required for full utilization
30 31
-> Per thread warp 32 entities = 128 entities
0 T0 T1 T2 T3 RF
1
waiting 0
20 5
40
Thread Warp Contexts
exec
60 T0 T1 T2 T3
stall
80
24
THREAD SCHEDULING (KEPLER)
Fetch one instruction per cycle Warp Scheduler
(from I$)
Determine dependencies (operands) Instruction Dispatch Unit Instruction Dispatch Unit
time
prioritized round-robin … …
Priority: warp age
Warp 4 Instruction 14 Warp 4 Instruction 15
Scheduler broadcasts the
Warp 3 Instruction 42 Warp 3 Instruction 42
instruction to all 32 threads in a
warp
25
THREAD SCHEDULING - SCOREBOARD
Scoreboard: hardware table that tracks
Instructions (fetched, issued, executed)
Resources/Functional units (occupation)
Dependencies (operands)
Outputs (modified registers)
Tracks all operands of all instructions in the instruction buffer
Any thread can proceed until scoreboard prevents issue
OOO execution among warps
Unfeasible without warp abstraction (32x less issue slots required)
Scoreboard: old concept from 1960s wikipedia.org
26
THREAD SCHEDULING – BRANCH DIVERGENCE
Scheduler broadcasts the instruction to all 32 threads in a warp
Dedicated control paths
Branch divergence problem
-> Write-masks
__global__ kernel1 (…) __global__ kernel2 (…)
{ {
id = threadIdx.x; id = threadIdx.x;
if ( id % 32 == 0 ) if ( id < 32 )
out = complex_function_call(); out = complex_function_call();
else else
out = 0; out = 0;
} }
27
SUMMARY
SUMMARY
GPUs have manually-controlled, Global memory subsystem
rather flat memory hierarchies Fully featured memory subsystem,
CPUs = deep memory hierarchy including virtual addresses, MMU
and TLB
Caches in GPUs not used to reduce
latency, but to reduce memory Performance issues
contention and to coalesce Latency hiding: insufficient number
accesses of threads
Parallel slackness as in BSP Too many threads: register spilling
Latency hiding & scalability Coalescing issues (global memory):
Instruction stream == thread stride and offset
warp, != single thread (as for CPUs) Branch divergence
29
BONUS: ADVANCED MEMORY ANALYSIS
POINTER CHASING: MEMORY/CACHE ANALYSIS
128MB@8MB stride,
no overhead: 16-
entry, fully-
@32B, L1 & L2 associative TLB
saturation: CL size
= 32B
768kB@32kB, L2 cache
latency reverts: 24-
L1 cache latency, way set-associative
5kB size (latency
increase for 5.5kB)
Ambiguous: or 6
replicated 4-way L2s
GeForce 8800 GTX @ 1350MHz
Source: Vasily Volkov, James W. Demmel: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs, LAPACK Working Note 202 32