TDCI Arch
TDCI Arch
Agenda
Evolution of GPUs
Computing Revolution
Stream Processing
Architecture details of modern GPUs
Evolution of GPUs
Evolution of GPUs
(1995-1999)
1995 NV1
1997 Riva 128 (NV3), DX3
1998 Riva TNT (NV4), DX5
32 bit color, 24 bit Z, 8 bit stencil
Dual texture, bilinear filtering
2 pixels per clock (ppc)
Faster TNT
128b memory interface
32 MB memory
The chip that would not die
Virtua Fighter
(SEGA Corporation)
NV1
50K triangles/sec
1M pixel ops/sec
1M transistors
16-bit color
Nearest filtering
1995
Evolution of GPUs
(Fixed Function)
2x Anisotropic filtering
Trilinear filtering
DXT texture compression
4 ppc
Term GPU introduced
Deus Ex
(Eidos/Ion Storm)
NV10
15M triangles/sec
480M pixel ops/sec
23M transistors
32-bit color
Trilinear filtering
1999
RGB Function
Input
Mappings
RGB
Scale/Bias
Next Combiners
RGB Registers
Alpha
Scale/Bias
Next Combiners
Alpha Registers
A
A op1 B
B
RGB
Portion
C op2 D
C
D
Input Alpha, Blue
Registers
Input
Mappings
AB op3 CD
Alpha
Function
A
AB
Alpha
Portion
B
CD
C
AB op4 CD
Evolution of GPUs
(Shader Model 1.0)
GeForce 3 (NV20)
NV2A Xbox GPU
DirectX 8.0
Vertex and Pixel Shaders
3D Textures
Hardware Shadow Maps
8x Anisotropic filtering
Multisample AA (MSAA)
4 ppc
Ragnarok Online
(Atari/Gravity)
NV20
100M triangles/sec
1G pixel ops/sec
57M transistors
Vertex/Pixel shaders
MSAA
2001
Evolution of GPUs
(Shader Model 2.0)
GeForce FX Series (NV3x)
DirectX 9.0
Floating Point and Long
Vertex and Pixel Shaders
Shader Model 2.0
256 vertex ops
32 tex + 64 arith pixel ops
Shading Languages
Dawn Demo
(NVIDIA)
NV30
200M triangles/sec
2G pixel ops/sec
125M transistors
Shader Model 2.0a
2003
Evolution of GPUs
(Shader Model 3.0)
NV40
600M triangles/sec
12.8G pixel ops/sec
220M transistors
Shader Model 3.0
Rotated Grid MSAA
16x Aniso, SLI
2004
Evolution of GPUs
(Shader Model 4.0)
GeForce 8 Series (G8x)
DirectX 10.0
Crysis
(EA/Crytek)
G80
Unified Shader Cores w/
Stream Processors
681M transistors
Shader Model 4.0
8x MSAA, CSAA
2006
As Of Today
Hellgate: London 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc.
Pixel
ROP
Memory
Shaders in Direct3D
DirectX 9:
Vertex Shader, Pixel Shader
DirectX 10:
Vertex Shader, Geometry Shader, Pixel Shader
DirectX 11:
Vertex Shader, Hull Shader, Domain Shader, Geometry Shader, Pixel
Shader, Compute Shader
Observation: All of these shaders require the same basic
functionality: Texturing (or Data Loads) and Math Ops.
Unified Pipeline
Physics
Vertex
Geometry
(new in DX10)
Future
Texture +
Floating
Point
Processor
Pixel
Compute
ROP
Memory
(CUDA, DX11
Compute, OpenCL)
Why Unify?
Vertex Shader
Pixel Shader
Idle hardware
Vertex Shader
Idle hardware
Unbalanced
and inefficient
utilization in nonunified architecture
Heavy Geometry
Workload Perf = 4
Pixel Shader
Heavy Pixel
Workload Perf = 8
Why Unify?
Unified Shader
Vertex Workload
Pixel
Optimal utilization
In unified architecture
Heavy Geometry
Workload Perf = 11
Unified Shader
Pixel Workload
Vertex
Heavy Pixel
Workload Perf = 11
4
MAD r2.xyzw, r0.xyzw, r1.xyzw 100% utilization
3
DP3 r2.w, r0.xyz, r1.xyz 75%
2
MUL r2.xy, r0.xy, r1.xy 50%
1
ADD r2.w, r0.x, r1.x 25%
3
DP3 r2.w, r0.xyz, r1.xyz
Cannot co-issue
1
ADD r2.w, r0.w, r2.w
Vector/VLIW architecture More compiler work required
G8x, GT200: scalar always 100% efficient, simple to compile
Up to 2x effective throughput advantage relative to vector
8800GTX
Conclusion
Build a unified architecture with scalar cores where all shader
operations are done on the same processors
Stream Processing
Latency (1)
GPUs are designed for tasks that can tolerate latency
Example: Graphics in a game (simplified scenario):
CPU
Generate
Frame 0
Generate
Frame 1
Generate
Frame 2
GPU
Idle
Render
Frame 0
Render
Frame 1
Latency (2)
CPUs are designed to minimize latency
Example: Mouse or keyboard input
ALU
ALU
ALU
Control
Cache
DRAM
DRAM
CPU
GPU
Stream Processing
What we just described:
Given a (typically large) set of data (stream)
Run the same series of operations (kernel or shader) on all of the data (SIMD)
To Summarize
GPUs use stream processing to achieve high throughput
Additionally:
Threads managed by hardware
You are not required to write code for each
thread and manage them yourself
Easier to increase parallelism by adding more processors
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
L2
FB
System Memory
GPU
Host Interface
Viewport / Clip /
Setup / Raster /
ZCull
Input Assemble
Vertex Work
Distribution
Geometry Work
Distribution
Pixel Work
Distribution
Compute Work
Distribution
Interconnection Network
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
ROP
L2
DRAM
I-Cache
MT Issue
C-Cache
TPC
SP SP
SP SP
SP SP
SP SP
SFU SFU
DP
Shared
Memory
I-Cache
MT Issue
C-Cache
TPC
SP SP
SP SP
SP SP
SP SP
SFU SFU
DP
Shared
Memory
PS Branching Efficiency
48 pixel coherence
16
14
12
10
8
6
4
2
0%
20%
40%
60%
80%
100%
120%
Conclusion:
G80 and GT200 Streaming Processor Architecture
Execute in blocks can maximally exploits data parallelism
Minimize incoherent memory access
Adding more ALU yields better performance
CPU
Vertex
Assembly
Vertex
Vertex
Vertex
Vertex
Shader
Shader
Shader
Shader
Vertex
Vertex
Vertex
Geometry
Shader
Shader
Shader
Shader
Rasterizer
Texture
Framebuffer
Vertex
Vertex
Vertex
Pixel
Shader
Shader
Shader
Shader
Blending
IA
FB
H
VS
GEOM
GS
SO
SETUP
SHADER
TEXTURE
FB
H
I
Unified Shading Units
RASTERIZER
D, J
ROP
PS
Texturing
G80
Every 2 SMs connect to a TMU (Texture Mapping
Unit)
8 TMUs in total
SP
SP
GT200
Every 3 SMs connect to a TMU
10 TMUs in total
TF
L1
L2
TA
TA
TA
TA
TF
TF
TF
TF
TF
TF
TF
TF
Math
Math
Tex
Math
Tex
Math
Math
Tex
Math
Tex
A
Tex
A
Tex
Fragment Data
MSAA
CSAA
AA with HDR
Color & Z Compression (2x)
Z Comp
C Comp
Z ROP
C ROP
Frame Buffer
Partition
Z-Cull
Early-Z
Pixel
Killed
Shader
Texture
Z-Check
Memory
Pixel
Killed
3DMark05 Futuremark
Anti-Aliasing Fundamentals
Pixels in an image need to represent finite areas from the scene
description, not just infinitely small sample points
Ideal anti-aliasing would be a smooth convolution over the
pixel area
Too expensive for general scenes
Anti-Aliasing
Anti-aliasing properly accounts for the contribution of all
the primitives that intersect a pixel
Supersampling
Store a color and depth value for each
sub-sample at every pixel. Calculate unique color and depth
values at every sub-sample for every pixel of every drawn
triangle
Strengths
Robust image quality by brute force
Weaknesses
Inefficient: Expensive pixel shaders and texture fetches are executed for
every sub-sample; wasteful because color results within a pixel are
nearly identical
Weaknesses
Memory footprint N times larger than 1x
Expensive to extend to 8x quality and beyond
4x
8x
8xQ
16x
16xQ
Texture/Shader
Samples
Stored Color/Z
Samples
Coverage Samples
16
16
Half Life 2
NVIDIA 4x
NVIDIA 16xQ
FEAR - Monolith
FEAR 4x
FEAR - Monolith
FEAR 16x
FEAR - Monolith
Questions?