0% found this document useful (0 votes)
15 views

GPU Computing 3

This document discusses GPU architecture and programming. It describes the GK110 GPU, which contains up to 15 streaming multiprocessors with 192 single-precision or 64 double-precision processing units each. It views the GPU as a programmable many-core scalar architecture that exploits parallel slackness, as well as a programmable multi-core vector architecture that hides its vector units. It emphasizes the importance of collaborative computing and memory access across threads to achieve high performance on the GPU.

Uploaded by

QuantumChromist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

GPU Computing 3

This document discusses GPU architecture and programming. It describes the GK110 GPU, which contains up to 15 streaming multiprocessors with 192 single-precision or 64 double-precision processing units each. It views the GPU as a programmable many-core scalar architecture that exploits parallel slackness, as well as a programmable multi-core vector architecture that hides its vector units. It emphasizes the importance of collaborative computing and memory access across threads to achieve high performance on the GPU.

Uploaded by

QuantumChromist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

GPU COMPUTING - ARCHITECTURE +

PROGRAMMING
LECTURE 03 - BASIC ARCHITECTURE
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
GK110 - ARCHITECTURE
Up to 15 SMX, 6 MCs,
L2 cache, PCIe 3.0,
CC 3.5

2
GK110 - ARCHITECTURE

192 SP units
64 DP units
32 load/store units
32 special function units
4 warp schedulers
Optimized for performance/watt
-> reduced clock frequency
Remember Pollack

3
BULK-SYNCHRONOUS PARALLEL
REMINDER: BULK-SYNCHRONOUS PARALLEL
In 1990, Valiant already described GPU computing
pretty well
Superstep
Compute, communicate, synchronize
Leslie G. Valiant, A bridging model for

Parallel slackness: # of virtual processors v, physical parallel computa on, Communica ons of
the ACM, Volume 33 Issue 8, Aug. 1990
processors p
v = 1: not viable
v = p: unpromising wrt optimality
v >> p: leverage slack to schedule and pipeline computation
and communication efficiently
Extremely scalable, bad for unstructured parallelism
5
ti
ti
REMINDER: VECTOR ISAS
Compact: single instruction defines N operations
Amortizes the cost of instruction fetch/decode/issue
Also reduces the frequency of branches
4x SIMD example
Parallel: N operations are (data) parallel
Instruction stream
No dependencies
No need for complex hardware to detect parallelism 1 1
PU
(similar to VLIW)

Data pool
1 PU
Can execute in parallel assuming N parallel data paths
1 PU
Expressive: memory operations describe patterns 1 PU
Continuous or regular memory access pattern
Can prefetch or accelerate using wide/multi-banked memory
Can amortize high latency for 1st element over large sequential pattern
6
OUR VIEW OF A GPU
Software view: a programmable many-core scalar architecture
Huge amount of scalar threads to exploit parallel slackness, operates in lock-step
SIMT: single instruction, multiple threads

IT’S A (ALMOST) PERFECT INCARNATION OF THE BSP MODEL

Hardware view: a programmable multi-core vector architecture


Illusion of scalar threads: hardware packs them into compound units
SIMD: single instruction, multiple data

IT’S A VECTOR ARCHITECTURE THAT HIDES ITS VECTOR UNITS


7
THE BEAUTY OF SIMPLICITY
GPU Computing & CUDA
Thread-collective computation and memory accesses Output data set
SIMT – Single Instruction, Multiple Threads

Memory Compute
GPU collaborative computing

One thread per output element
PCAM: A == form thread blocks, ignore M
Schedulers exploit parallel slack GDDR

GPU collaborative memory access

MC
GDDR
One thread per data element
GDDR
MCs highly optimized to exploit concurrency
-> coalescing issues
-> If you do something on a GPU, do it collaboratively with all threads
8
(GLOBAL) MEMORY SUBSYSTEM
GK110 – MEMORY HIERARCHY
Registers at thread level Thread
Registers
64k/thread block
Registers/thread depends on run-time
configuration
Max. 255 registers/thread Thread Shared Memory L1 Cache Read-only data
Shared memory / L1$ at block level Block 16-48kB 16-48kB Cache 48kB

Variable sizes
L1$ can serve for register spilling L2 Cache
L1$ not coherent, write-invalidate 1.5MB
Multiple
Compiler controlled RO L1$ Kernels
L2$ / GDDR at device level GDDR (off-chip)
6GB
GDDR: ~400-600 cycles access latency
L2$ as victim cache for all upper units,
write-back
Host memory (off-device)
Purpose: reducing contention multiple TBs

10
LOCAL MEMORY
Local memory: part of global memory, but Thread
Registers
thread-local 64k/thread block
Register spilling: when SM runs out of
resources
Limited register count per thread Thread Shared Memory L1 Cache Read-only data
Block 16-48kB 16-48kB Cache 48kB
Limited total number of registers
LM is used if the source code exceeds these
limits
L2 Cache
Local because each thread has its private 1.5MB
area
Multiple
Differences from global memory Kernels
Stores are cached in L1$ GDDR (off-chip)
Addressing is resolved by compiler 6GB

Store always happens before load


Per thread: move data from GM to LM (stores) Host memory (off-device)
Subsequent load accesses multiple TBs

11
HOST MEMORY 96 GFLOPS (DP)
CPU SOCKET
Pinned/unpinned host memory
CPU
Unpinned host memory: possibility of CORES 60GB/S
demand paging -> staging buffers system request
queue
Pinned host memory: autonomous
device access possible NORTH HOST
BRIDGE memory MEMORY
cudaMemcpy 16GB/S interface
system interface
GPU DMA engine(s)
IO
Zero copy (CC >= 2.0) BRIDGE 288GB/S
GPU threads can operate on pinned 16GB/S peripheral
host memory interface

For initial shared memory fills, GPU GPU


etc. 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
12
HOST MEMORY & CUDAMEMCPY

13
HOST MEMORY & STREAMS
Stream: sequence of operations
performed in-order
cudaMemcpy
Kernel launch
Default stream: id=0
Overlap computation with data
movement
Latency hiding
Only applicable for divisible work
Most suited for compute-bound workloads
See also zero-copy for initial data movements

14
GLOBAL MEMORY - COALESCING
High bandwidth, high latency
Coalesced access
Combine fine-grain accesses by multiple
threads into single GDDR operations (such
requests have a certain granularity)
Coalesced thread access should match a
multiple of L1/L2 cache line sizes
For Kepler cache line sizes: L1: 128B, L2: 32B
Misaligned accesses
One warp is scheduled, but accesses
misaligned addresses
GPUs use caches for access coalescing
15
GLOBAL MEMORY – ACCESS PENALTIES
Offset: constant shift
of access pattern
data[addr+offset]
Penalty: fetch 5 CLs
instead of 4
4/5 of max. bandwidth

8 elements offset, 4B per element NVIDIA, CUDA C Best Practices Guide 16


GLOBAL MEMORY – ACCESS PENALTIES
Stride: access only
every nth address
data[addr*stride]
Stride of 2
50% load/store
efficiency
Worsens with larger
strides

NVIDIA, CUDA C Best Practices Guide 17


GLOBAL MEMORY – ACCESS PENALTIES
Main problem: thread scheduling does not result in coalesced
accesses
Solution: manually control data movement in memory hierarchy
Caches = transparent, implicit hierarchy
Scratchpad (shared memory) = opaque, explicit hierarchy
Collaborative loads from global memory to shared memory
Common case: one thread is not moving the data it requires (at least not
immediately)
One of the GPU’s main advantages is memory bandwidth: coalescing of upmost
importance!

18
CUDA THREAD SCHEDULING

Foundation of latency tolerance


LATENCY TOLERANCE TECHNIQUES
Block Data
Property Relaxed Consistency Models Prefetching Multi-Threading
Transfer

Types of Write (blocking read processors) Write


Read and write (dynamically Write Write
latency Read
scheduled processors) Read Read
tolerated Synchronization
Identifying and
Software Labeling synchronization Explicit extra
Predictability orchestrating
requirements operations concurrency
block transfers
Extra Not in processor,
hardware Little Little Substantial but in memory
support system
Supported in
commercial Yes Yes Yes (Yes)
systems?
David E. Culler, Jaswinder Pal Singh, Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, 20
Morgan Kaufmann,1998
THREAD SCHEDULING
Up to 1k threads per block Scheduler
One block executes on one SM 1.Select one thread block to execute,
Kepler: one SM = 192 SP + 64 DP units allocate resources (registers, etc) as
required
Each thread block is divided in 2.Select one out of the 32 warps of
warps of 32 threads this block for instruction fetch and
Implementation decision, not CUDA execution
3.Repeat until all resources are
Warps are the units for the
utilized
scheduler
4.Upon warp stalling, select another
Example warp for IF and EX
4 blocks being executed on one SM, 5.Deallocate resources after all warps
each block 1k threads have finished (non-preemptive)
How many warps?
21
THREAD SCHEDULING
Fine-grained multi-threading (FGMT)
Switch context (i.e., warp) every cycle
A warp that has the operands ready for its next instruction is ready for execution
All threads in a warp execute the same instruction
Goal of FGMT: latency hiding
Global memory access latency: ~400-600 cycles
Sufficient number of warps can keep all functional units busy
Warp count for maximum utilization depends on computational intensity
TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3


W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

22
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
4 warp contexts, max. 1 being executed
simultaneously
32 SIMD ALUs
Explicit 32x SIMD instructions 0 1 2
30 31
32 ALUs execute a single SIMD instruction
Register file (RF) is shared among contexts RF
One register entry (vector) has 32 words (each 1
0
5
32bit)
RF: 16 entries -> Max. of 4 registers/warp
Thread Warp Contexts
Simplifying assumptions
T0 T1 T2 T3
Each memory access blocks execution for 50
cycles
A memory access occurs every 20 cycles
23
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
Each memory access blocks execution for 50 cycles
(texture memory)
A memory access occurs every 20 cycles 32 SIMD ALUs
0 1 2
-> 4 thread warps required for full utilization
30 31
-> Per thread warp 32 entities = 128 entities
0 T0 T1 T2 T3 RF
1
waiting 0
20 5

40
Thread Warp Contexts
exec

60 T0 T1 T2 T3
stall

80

24
THREAD SCHEDULING (KEPLER)
Fetch one instruction per cycle Warp Scheduler
(from I$)
Determine dependencies (operands) Instruction Dispatch Unit Instruction Dispatch Unit

Scoreboard checks if dependencies


Warp 4 Instruction 12 Warp 4 Instruction 13
are resolved
Prevent data hazards Warp 12 Instruction 94 Warp 12 Instruction 95

Issue: select one warp based on Warp 7 Instruction 0 Warp 7 Instruction 1

time
prioritized round-robin … …
Priority: warp age
Warp 4 Instruction 14 Warp 4 Instruction 15
Scheduler broadcasts the
Warp 3 Instruction 42 Warp 3 Instruction 42
instruction to all 32 threads in a
warp
25
THREAD SCHEDULING - SCOREBOARD
Scoreboard: hardware table that tracks
Instructions (fetched, issued, executed)
Resources/Functional units (occupation)
Dependencies (operands)
Outputs (modified registers)
Tracks all operands of all instructions in the instruction buffer
Any thread can proceed until scoreboard prevents issue
OOO execution among warps
Unfeasible without warp abstraction (32x less issue slots required)
Scoreboard: old concept from 1960s wikipedia.org

Separate computation and memory resources


CDC6600 (https://en.wikipedia.org/wiki/CDC_6600)
Enabler of OOO execution for CPUs

26
THREAD SCHEDULING – BRANCH DIVERGENCE
Scheduler broadcasts the instruction to all 32 threads in a warp
Dedicated control paths
Branch divergence problem
-> Write-masks
__global__ kernel1 (…) __global__ kernel2 (…)
{ {
id = threadIdx.x; id = threadIdx.x;

if ( id % 32 == 0 ) if ( id < 32 )
out = complex_function_call(); out = complex_function_call();
else else
out = 0; out = 0;
} }

27
SUMMARY
SUMMARY
GPUs have manually-controlled, Global memory subsystem
rather flat memory hierarchies Fully featured memory subsystem,
CPUs = deep memory hierarchy including virtual addresses, MMU
and TLB
Caches in GPUs not used to reduce
latency, but to reduce memory Performance issues
contention and to coalesce Latency hiding: insufficient number
accesses of threads
Parallel slackness as in BSP Too many threads: register spilling
Latency hiding & scalability Coalescing issues (global memory):
Instruction stream == thread stride and offset
warp, != single thread (as for CPUs) Branch divergence

29
BONUS: ADVANCED MEMORY ANALYSIS
POINTER CHASING: MEMORY/CACHE ANALYSIS

GeForce 8800 GTX @ 1350MHz


Source: Vasily Volkov, James W. Demmel: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs, LAPACK Working Note 202 31
POINTER CHASING: MEMORY/CACHE ANALYSIS @512kB: saturation
of TLB misses:
page size=512kB
20kB@1kB, L1 cache 128MB latency
latency reverts: 20- increase: TLB
way set-associative presence

128MB@8MB stride,
no overhead: 16-
entry, fully-
@32B, L1 & L2 associative TLB
saturation: CL size
= 32B
768kB@32kB, L2 cache
latency reverts: 24-
L1 cache latency, way set-associative
5kB size (latency
increase for 5.5kB)
Ambiguous: or 6
replicated 4-way L2s
GeForce 8800 GTX @ 1350MHz
Source: Vasily Volkov, James W. Demmel: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs, LAPACK Working Note 202 32

You might also like