GeForce 7800 GTX
Board Details
SLI Connector Single slot cooling
sVideo
TV Out
DVI x 2
256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32
GeForce 7800 GTX
GPU Details
302 million transistors
430 MHz core clock
256-bit memory interface
Notable Functionality
• Non-power-of-two textures with mipmaps
• Floating-point (fp16) blending and filtering
• sRGB color space texture filtering and
frame buffer blending
• Vertex textures
• 16x anisotropic texture filtering
• Dynamic vertex and fragment branching
• Double-rate depth/stencil-only rendering
• Early depth/stencil culling
• Transparency antialiasing
GeForce 7800 GTX
Parallelism
8 Vertex Engines
Z-Cull Triangle Setup/Raster
Shader Instruction Dispatch 24 Fragment Shaders
Fragment Crossbar 16 Raster Operation Pipelines
Memory Memory Memory Memory
Partition Partition Partition Partition
GeForce Graphics Pipeline
Separate dedicated units
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Vertex Engine
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Setup
Prepare triangle for
rasterization
215M triangles/sec setup
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Raster
Compute coverage
Points, lines, and triangles
Rotated grid multisampling
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Z Cull
Discard fragments early based on Z
Up to 64 pixels/clock
Multisampled: 256 samples/clock
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Fragment Shader
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Texture
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
GeForce Graphics Pipeline
Texture
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only
Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer
Z Cull Texture
Single GeForce 7800
Vertex Unit
Primitive Assembly + Vertex Processing Engine
Attribute Processing • MIMD Architecture
• Dual Issue
• Low-penalty branching
• Shader Model 3.0
• 32 vector registers
Vertex FP32 FP32
• 512 static instructions per
Texture Scalar Vector
Fetch Unit Unit
program
• Indexed input and output
registers
Texture Branch
Vertex Texture Fetch
Cache Unit
• Non-stalling
• Up to 4 texture units
Viewport Processing • Unlimited fetches
• Mipmapping, no filtering
To Setup
Vertex Texturing Example
Vertex
Program
Flat tessellated mesh Displaced mesh
Height field
texture
Vertex Textures for Dynamic
Displacement Mapping
Without Vertex Textures With Vertex Textures
Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games.
All rights reserved. © 2004 Ubi Soft Entertainment.
Vertex Textures to Drive
Particle Systems
◼ Render-to-texture
Simulation runs
in floating-point
frame buffer, also
usable as texture
◼ Vertex textures
Determines particle
location with
vertex texture
fetch
Single GeForce 7800
Fragment Shader Pipeline
Texture Input Fragment Texture Processor
Data Data
16 texture units
1 texture fetch at full speed
Bilinear or tri-linear filtering
FP32 16x anisotropic filtering
Texture
Shader
Processor Floating-point (fp16) texture filtering
Unit 1
Shader Unit 1
FP32 4 MULs + RCP
Texture Dual Issue
Shader
Cache
Unit 2 Texture address calculation
Fast fp16 normalize
Branch Free: negate, abs, condition codes
Processor
Shader Unit 2
Output 4 MADs or DP4
Fixed-function
Shaded Dual Issue
Fog Unit
Fragments
Free: negate, abs, condition codes
Operations Per Fragment
Shader Pass
Shader 4 Components 1 Texture /
Unit 1 1 Op / component
or fragment at full
4 ops / fragment
per pass speed per pass
Texture
Shader 4 Components
1 Op / component
Unit 2 4 Ops / fragment
per pass
8 Operations / fragment per pass
Fragment Shader
Component Co-issue
◼ Use 4 components various ways
RGBA all together
RGB and A
RG and GB
Shader
◼ Both shader units Unit 1 R G B A
◼ Two operations Operation 1 Operation 2
per shader unit
Shader
Unit 2 R G B A
Operation 3 Operation 4
Single GeForce 7800
Raster Operations Pipeline
Input
Shaded Pixel Crossbar
Fragment Interconnect Functionality
Data
• OpenEXR
Multisample Antialiasing floating-point
blending
• sRGB
Depth Color blending
Compression Compression • 4x rotated grid
multisampling
Depth Color • Lossless color
Raster Raster and depth
Operations Operations compression
• Multiple
render targets
Memory Frame Buffer Partition
GeForce 7800
Transparency Antialiasing
Conventional 4x antialiasing Transparency antialiasing
with alpha tested context with alpha tested context
Scalable Link Interface (SLI)
◼ Gang two GeForce 6600, 6800, or 7800
graphics boards together
Can almost double your performance
SLI
Connector
Two 6800 Ultras
pictured
SLI Rendering Modes
◼ Split Frame Rendering (SFR)
One GPU renders top of screen; other renders the bottom
Scales fragment processing but not vertex processing
◼ Alternate Frame Rendering (AFR)
Scales both vertex and fragment processing
Adds frame latency
Rendering must be free of CPU synchronization
◼ SLI Antialiasing: SLI8x and SLI16x
Better antialiasing quality rather than performance
Each card renders with slightly different sub-pixel offset
Current High-end “Fermi” GPU
◼ Current high-end graphics card
◼ 512 graphics “cores”
◼ 1.5Gb memory
◼ System power: 600W
◼ OpenGL 4.2 / DirectX 11
functionality
High-level “Fermi” Architecture
◼ GF100
◼ Four Graphics
Processor
Clusters (GPCs)
Each is self-
contained
graphics
pipeline
Smaller chips
have fewer
GPcs
◼ Shared L2 cache
◼ 6 Memory
Controllers
1.5 Gb
Inside Each
Graphics Processing Cluster
◼ Raster engine
◼ Four SMs
Streaming
Multiprocessor
◼ Texture fetch
resources
◼ Tessellation and
vertex
processing
resources
Polymorph
Engine
Streaming
Multiprocessor (SM)
◼ Multi-processor
execution unit
32 scalar processor
cores
Warp is a unit of
thread execution of up
to 32 threads
◼ Two workloads
Graphics
◼ Vertex shader
◼ Tessellation
◼ Geometry shader
◼ Fragment shader
Compute
OpenGL Pipeline Programmable
Domains run on Unified Hardware
◼ Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
Plus tessellation + compute (not shown below)
,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations
Can be Vertex Primitive Fragment
unified Program Program Program
hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access
Memory Interface
Dual Warp Scheduling
32 threads launch!
Shader or CUDA Core,
Same Unit but Two Personalities
◼ Execution unit
Scalar floating-point
Scalar integer
Levels of Caching in Fermi GPU
◼ 12 KB L1 Texture cache
Per texture unit
◼ SM 64 K cache
Split into dedicated 16K or 48K
Load/Store cache
Shared memory 48K or 16K
◼ L2 unifies texture cache, raster
operation cache, and internal
buffering in prior generation
768 K
Read / write
Fully coherent
Cache Use Strategies
in Fermi GPU
◼ Pipeline stages can communicate efficiently through
GPU’s L1 and L2 caches
Buffering between stages stays all on chip
Only vertex, texel, and pixel read/writes need to go to DRAM
Vertex and Tessellation
Processing Tasks
◼ Fixed-function graphics engines
Pull attributes and assemble vertex
Manage tessellation control and domain shader evaluation
Viewport transform
Attribute setup of plane equations for rasterization
Stream out vertices into buffers
Rasterization Tasks
◼ Turns primitives into fragments
Computes edge equations
Two-stage rasterization
◼ Coarse raster finds tiles the primitive could be in
◼ Fine raster evaluates sample positions within tiles
Zcull efficiently eliminates occluded fragments
Base
Input Mesh Input Mesh
From Metro 2033, © THQ and 4A Games
Apply Phong Tessellation
From Metro 2033, © THQ and 4A Games
Add
Apply Displacement
Displacement Mapping Mapping
From Metro 2033, © THQ and 4A Games
GPUs as Compute Nodes
◼ Architecture of GPU has evolved into a high-
performance, high-bandwidth compute node
Small form factor
Compute
Integrated CPU-GPU OEM CPU Server + Workstations
Servers & Blades Compute 1U 2 to 4 Tesla
GPUs
Compute Programming Model
◼ Cooperative Thread Array (CTA)
Single Program, Multiple Data
Organized in 1D, 2D, or 3D
◼ Programming APIs
CUDA, OpenCL, DirectCompute
◼ APIs + language = parallel processing system
OpenGL or Direct3D through shaders
◼ Cg, HLSL, GLSL