Gpu Architecture
Gpu Architecture
P J Narayanan
Centre for Visual Information Technology
IIIT, Hyderabad
IIIT Hyderabad
PPoPP Tutorial on
GPU Architecture, Programming and Performance Models
GPU: Evolution
• Graphics : a few hundred triangles/vertices map to a few
hundred thousand pixels
• Process pixels in parallel. Do the same thing on a large
number of different items.
• Data parallel model: parallelism provided by the data
– Thousands to millions of data elements
– Same program/instruction on all of them
good performance
• Graphics rendering, image/signal processing, DRAM
matrix manipulation, FFT, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 4
What do GPUs do?
Vertex
• GPU implements the graphics
pipeline consisting of: Vertex
– Vertex transformations
Processing
• Compute camera coords, lighting
– Geometry processing Geometry
• Primitive-wide properties Processing
– Rasterizing polygons to pixels
• Find pixels falling on each Rasterization
polygon
– Processing the pixels
• Texture lookup, shading, Z-values
Pixel
Processing
– Writing to the framebuffer
• Colour, Z-value
Framebuffer
• Computationally intensive
Image
IIIT Hyderabad
SM Instruction Fetch/Dispatch
Shared Memory
TEX
SP SP
SM SP SP
SFU SFU
IIIT Hyderabad
SP SP
SP SP
Streaming Multi-Processor
• Streaming Multiprocessor
• 8 Streaming Processors (SP)
– 2 Super Function Units (SFU)
• Multi-threaded instruction dispatch Streaming Multiprocessor
Instruction L1 Data L1
• 1 to 512 threads active
Instruction Fetch/Dispatch
• Shared instruction fetch per 32 threads
Shared Memory
• Cover latency of texture/memory loads
SP SP
• 30+ GFLOPS SP SP
SFU SFU
• 16K registers SP SP
• 16 KB shared memory
IIIT Hyderabad
• Simplifies memory
addressing when Block (1, 1)
Courtesy: NDVIA
Threads, Warps, Blocks
• 32 threads in a Warp or a scheduling group
– Only <32 when there are fewer than 32 total threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a
single SM
• G80 has 16 SMs, G280 has 30 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory)
allow, more than 1 Block can occupy each SM
IIIT Hyderabad
Memory Spaces
Grid
Constant
Texture
Memory
Memory Access Times
• Register – dedicated HW - single cycle
• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
(400-500 cycles)
• Constant Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Texture Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Instruction Memory (invisible) – DRAM,
IIIT Hyderabad
cached
Thread Scheduling/Execution
Each Thread Blocks consists of 32-thread warps currently
sequentially.
Processors, Memory
• Nvidia 280GTX: 240 Streaming Processors, grouped into 30
Streaming Multiprocessors
– One instruction sequencer per SM
– 16KB of on-chip shared memory per SM
– 16K 32-bit registers per SM
– Single clock access of registers, shared memory
• 1 GB of common, off-chip global memory
– 130 GB/s of theoretical peak memory bandwith
– High memory access latency: 300-500 cycles
– 128 byte, 64 byte, or 32 byte memory transactions
• 10 special texture access units to the same global memory.
30 SMs grouped into 10 Texture processor clusters
• 1.3 GHz clock, 933 GFLOPs peak
• Integer and single-precision float operations in one clock cycle.
IIIT Hyderabad
the location
January 10, 2010 GPU Tutorial at PPoPP 2010 29
Thinking Data-Parallel
• Launch N data locations, each of which gets a kernel of code
• Data follows a domain of computation.
• Each invocation of the kernel is aware of its location loc within the
domain
– Can access different data elements using the loc
– May perform different computations also
• Variations of SIMD processing
– Abstain from a compute step: if ( f(loc) ) then … else …
• Divergence can result in serialization
– Autonomous addressing for gather: a := b[ f(loc) ]
– Autonomous addressing for scatter: a[ g(loc) ] := b
• GPGPU model supports gather but not scatter
– Operation autonomy: Beyond SIMD.
• GPU hardware uses it for graphics, but not exposed to users
IIIT Hyderabad
multiple times
distribution, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 33
Data-Parallel Primitives
• Deep knowledge of
architecture needed to get 1 3 2 0 6 2 5 2 4
high performance
– Use primitives to build other
algorithms Add Reduce
– Efficient implementations on 25
the architecture by experts
• reduce, scan, segmented Scan or prefix sum
scan: Aggregate or
0 1 4 6 6 12 14 19 21
progressive results from
distributed data
– Ordering distributed info Segmented Scan
• split, sort: 1 0 0 1 0 0 0 1 0
IIIT Hyderabad
Split
Regular Domain
own location
January 10, 2010 GPU Tutorial at PPoPP 2010 36
Graph Algorithms
• Not the prototypical data-
parallel application; an
irregular application. Adjacency
Matrix
• Source of data-parallelism:
Data structure (adjacency
matrix or adjacency list)
• A 2D-domain of V2
elements or a 1D-domain of
E elements Vertices
• A thread processes each
edge in parallel. Combine
the results
IIIT Hyderabad
Adjacency List
v
Take the first entry for each u.
• And more …
January 10, 2010 GPU Tutorial at PPoPP 2010 41
Thank you!