Hardware
Hardware
IO IO IO IO IO IO
NB NB
• AMD HyperTransport™
Technology bus replaces the
Front-side Bus architecture
• HyperTransport ™
similarities to PCIe:
– Packet based, switching
network
– Dedicated links for both
directions
• Shown in 4 socket
configuraton, 8 GB/sec per
link
• Northbridge/HyperTransport
™ is on die
• Glueless logic
– to DDR, DDR2 memory
– PCI-X/PCIe bridges (usually
implemented in Southbridge)
A Typical Motherboad (mATX form factor)
CUDA Refresher
• Grids of Blocks
• Blocks of Threads
• Conventional Processors
– Latency optimized
– ILP
– Caches 99% hit rate
• GPU
– Caches 90% or less. Not a good option
– Throughput optimized
– ILP + TLP
GT200 Architecture Overview
atomic
Terminology
• SPA
– Streaming Processor Array
• TPC
– Texture Processor Cluster
• 3 SM + TEX
• SM
– Streaming Multiprocessor (8 SP)
– Multi-threaded processor core
– Fundamental processing unit for CUDA thread block
• SP
– Streaming Processor
– Scalar ALU for a single CUDA thread
Thread Processing Cluster
SM
TEX SM
SM
Stream Multiprocessor Overview
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
– 2 Super Function Units (SFU) Streaming Multiprocessor
– Instruction L1 Data L1
1 Double-FP Unit (DPU)
• Multi-threaded instruction Instruction Fetch/Dispatch
• 16 KB shared memory
• DRAM texture and memory
access
Thread Life
• Grid is launched on the SPA
Host Device
Blocks
Cooperative Thread Array
• Break Blocks into warps
• Allocate Resources
– Registers, Shared Mem, Barriers
• Then allocate for execution
Stream Multiprocessors Execute Blocks
• Threads are assigned to SMs in
Block granularity
– Up to 8 Blocks to each SM as resource t0 t1 t2 … tm SM 0
allows MT IU
– SM in G200 can take up to 1K threads SP
• Could be 256 (threads/block) * 4 blocks
• Or 128 (threads/block) * 8 blocks, etc. Blocks
• Threads run concurrently
– SM assigns/maintains thread id #s
Shared
– SM manages/schedules thread Memory
execution
TF
Texture L1
L2
Memory
Thread Scheduling and Execution
• Each Thread Blocks is divided
in 32-thread Warps …
Block 1 Warps
…
Block 2 Warps
t0 t1 t2 … t31 t0 t1 t2 … t31
– This is an implementation … …
decision, not part of the CUDA
programming model
Streaming Multiprocessor
• Warp: primitive scheduling unit Instruction L1 Data L1
Instruction Fetch/Dispatch
– same instruction SP SP
SP SP
DPU
Warp Scheduling
• SM hardware implements zero-
overhead Warp scheduling
– Warps whose next instruction has its
SM multithreaded operands ready for consumption are
Warp scheduler
eligible for execution
time – Eligible Warps are selected for
warp 8 instruction 11 execution on a prioritized scheduling
policy
warp 1 instruction 42 – All threads in a Warp execute the
same instruction when selected
warp 3 instruction 95
..
. • 4 clock cycles needed to dispatch
warp 8 instruction 12 the same instruction for all threads
warp 3 instruction 96 in a Warp in G200
How many warps are there?
• If 3 blocks are assigned to an SM and each Block has
256 threads, how many Warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
– At any point in time, only one of the 24 Warps will be
selected for instruction fetch and execution.
Warp Scheduling: Hiding Thread stalls
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
– For 32X32, we have 1024 threads per Block. Not even one
can fit into an SM.
Stream Multiprocessor Detail
64 entry
Scalar Units
• 32 bit ALU and Multiply-Add
• IEEE Single-Precision Floating-Point
• Integer
• Latency is 4 cycles
• FP: NaN, Denormals become signed 0.
• Round to nearest even
Special Function Units
• Transcendental function evaluation and per-
pixel attribute interpolation
• Function evaluator:
– rcp, rsqrt, log2, exp2, sin, cos approximations
– Uses quadratic interpolation based on
Enhanced Minimax Approximation
– 1 scalar result per cycle
• Latency is 16 cycles
• Some are synthesized: 32 cycles or so
Memory System Goals
• High-Bandwidth
• As much parallelism as possible
• wide. 512 pins in G200 / Many DRAM chips
• fast signalling. max data rate per pin.
• maximize utilization
– Multiple bins of memory requests
– Coalesce requests to get as wide as possible
– Goal to use every cycle to transfer from/to memory
• Compression: lossless and lossy
• Caches where it makes sense. Small
DRAM considerations
• multiple banks per chip
– 4-8 typical
• 2^N rows.
– 16K typical
• 2^M cols
– 8K typical
• Timing contraints
– 10~ cycles for row
– 4 cycles within row
• DDR
– 1Ghz --> 2Gbit/pin
– 32-bit --> 8 bytes clock
• GPU to memory: many traffic generators
– no correlation if greedy scheduling
– separate heaps / coalesce accesses
• Longer latency
Parallelism in the Memory System
Thread
• Local Memory: per-thread
– Private per thread
Local Memory – Auto variables, register spill
• Shared Memory: per-Block
– Shared by threads of the same block
Block
– Inter-thread communication
Shared • Global Memory: per-application
Memory – Shared by all threads
– Inter-Grid communication
Grid 0
...
Global Sequential
Grid 1 Memory Grids
in Time
...
SM Memory Architecture
• Threads in a Block share data & results
– In Memory and Shared Memory
– Synchronize at barrier instruction
– Provides 4 operands/clock
R C$ Shared
cached on chip
R C$ Shared
– L1 per SM F L1 Mem
block R C$ Shared
Bank 15
Bank Addressing Examples
threadIdx.x];
Thread 15 Bank 15
Thread 15 Bank 15
Data types and bank conflicts
• This has no conflicts if type of shared is 32-bits:
Thread 15 Bank 15
– 2-way bank conflicts:
__shared__ short shared[];
foo = shared[baseIndex + threadIdx.x]; Thread 0 Bank 0
Thread 1 Bank 1
Thread 2 Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
Thread 6 Bank 6
Thread 7 Bank 7
Thread 15 Bank 15
Structs and Bank Conflicts
• Struct assignments compile into as many memory accesses
as there are struct members:
struct vector { float x, y, z; };
struct myType {
float f;
int c;
Thread 0 Bank 0
}; Thread 1 Bank 1
__shared__ struct vector vectors[64]; Thread 2 Bank 2
Thread 3 Bank 3
__shared__ struct myType myTypes[64]; Thread 4 Bank 4
Thread 5 Bank 5
• This has no bank conflicts for vector; struct size is 3 words Thread 6 Bank 6
Thread 7 Bank 7
– 3 accesses per thread, contiguous banks (no common factor
with 16)
Thread 15 Bank 15
struct vector v = vectors[baseIndex +
threadIdx.x];
Thread 0 Bank 0
int tid = threadIdx.x;
Thread 1 Bank 1
shared[2*tid] = global[2*tid]; Thread 2 Bank 2
Thread 4 Bank 4
Bank 5
Thread 10
– Not in shared memory usage where Thread 11 Bank 15
there is no cache line effects but
banking effects
A Better Array Access Pattern
• Each thread loads one element
in every consecutive group of
blockDim elements.
Thread 0 Bank 0
Thread 3 Bank 3
global[tid + blockDim.x];
Thread 4 Bank 4
Thread 5 Bank 5
Thread 6 Bank 6
Thread 7 Bank 7
Thread 15 Bank 15
Vector Reduction with Bank Conflicts
Array elements
0 1 2 3 4 5 6 7 8 9 10 11
3
No Bank Conflicts
0 1 2 3 … 13 14 15 16 17 18 19
3
Common Bank Conflict Patterns (2D)
• Operating on 2D array of floats in
shared memory
– Bank Indices without Padding
e.g., image processing
0 1 2 3 4 5 6 7 15
• Example: 16x16 block 0 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
– Each thread processes a row 0 1 2 3 4 5 6 7 15
– So threads in a block access the elements 0 1 2 3 4 5 6 7 15
in each column simultaneously (example: 0 1 2 3 4 5 6 7 15
row 1 in purple) 0 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
– 16-way bank conflicts: rows all start at bank
0
0 1 2 3 4 5 6 7 15
15 0 1 2 3 4 5 6 14 15
• Use LD to hide
Load/Store LD latency
(Memory (non-dependent LD ops
read/write)
only)
Clustering/Batching
– Use same thread to help hide own latency
• Instead of:
– LD 0 (long latency)
– Dependent MATH 0
– LD 1 (long latency)
– Dependent MATH 1
• Do:
– LD 0 (long latency)
– LD 1 (long latency - hidden)
– MATH 0
– MATH 1
• Compiler handles this!
– But, you must have enough non-dependent LDs and Math