How Gpus Work

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

HOW THINGS WORK

Direct3D) to provide each triangle to

How GPUs the graphics pipeline one vertex at a


time; the GPU assembles vertices into
triangles as needed.

Work Model transformations


A GPU can specify each logical
object in a scene in its own locally
David Luebke, NVIDIA Research defined coordinate system, which is
Greg Humphreys, University of Virginia convenient for objects that are natu-
rally defined hierarchically. This con-
venience comes at a price: before
rendering, the GPU must first trans-
form all objects into a common coor-
GPUs have moved away from dinate system. To ensure that triangles
the traditional fixed-function aren’t warped or twisted into curved
3D graphics pipeline toward shapes, this transformation is limited
to simple affine operations such as
a flexible general-purpose rotations, translations, scalings, and
computational engine. the like.
As the “Homogeneous Coordinates”
sidebar explains, by representing each
vertex in homogeneous coordinates,
the graphics system can perform the
entire hierarchy of transformations

I
n the early 1990s, ubiquitous THE GRAPHICS PIPELINE simultaneously with a single matrix-
interactive 3D graphics was still The task of any 3D graphics system vector multiply. The need for efficient
the stuff of science fiction. By the is to synthesize an image from a hardware to perform floating-point
end of the decade, nearly every description of a scene—60 times per vector arithmetic for millions of ver-
new computer contained a graph- second for real-time graphics such as tices each second has helped drive the
ics processing unit (GPU) dedicated to videogames. This scene contains the GPU parallel-computing revolution.
providing a high-performance, visu- geometric primitives to be viewed as The output of this stage of the
ally rich, interactive 3D experience. well as descriptions of the lights illu- pipeline is a stream of triangles, all
This dramatic shift was the in- minating the scene, the way that each expressed in a common 3D coordinate
evitable consequence of consumer object reflects light, and the viewer’s system in which the viewer is located
demand for videogames, advances in position and orientation. at the origin, and the direction of view
manufacturing technology, and the GPU designers traditionally have is aligned with the z-axis.
exploitation of the inherent paral- expressed this image-synthesis process
lelism in the feed-forward graphics as a hardware pipeline of specialized Lighting
pipeline. Today, the raw computa- stages. Here, we provide a high-level Once each triangle is in a global
tional power of a GPU dwarfs that of overview of the classic graphics coordinate system, the GPU can com-
the most powerful CPU, and the gap is pipeline; our goal is to highlight those pute its color based on the lights in the
steadily widening. aspects of the real-time rendering cal- scene. As an example, we describe the
Furthermore, GPUs have moved culation that allow graphics applica- calculations for a single-point light
away from the traditional fixed-func- tion developers to exploit modern source (imagine a very small lightbulb).
tion 3D graphics pipeline toward GPUs as general-purpose parallel The GPU handles multiple lights by
a flexible general-purpose compu- computation engines. summing the contributions of each
tational engine. Today, GPUs can individual light. The traditional graph-
implement many parallel algorithms Pipeline input ics pipeline supports the Phong light-
directly using graphics hardware. Most real-time graphics systems ing equation (B-T. Phong, “Illumina-
Well-suited algorithms that leverage assume that everything is made of tri- tion for Computer-Generated Images,”
all the underlying computational angles, and they first carve up any more Comm. ACM, June 1975, pp. 311-
horsepower often achieve tremendous complex shapes, such as quadrilaterals 317), a phenomenological appearance
speedups. Truly, the GPU is the first or curved surface patches, into trian- model that approximates the look of
widely deployed commodity desktop gles. The developer uses a computer plastic. These materials combine a dull
parallel computer. graphics library (such as OpenGL or diffuse base with a shiny specular high-

96 Computer
light. The Phong lighting equation
gives the output color C = KdLi(N · L) Homogeneous Coordinates
+ KsLi(R · V)s.
Table 1 defines each term in the Points in three dimensions are typically represented as a triple (x,y,z). In
equation. The mathematics here isn’t computer graphics, however, it’s frequently useful to add a fourth coordinate,
as important as the computation’s w, to the point representation. To convert a point to this new representation,
structure; to evaluate this equation we set w = 1. To recover the original point, we apply the transformation
efficiently, GPUs must again operate (x,y,z,w) —> (x/w, y/w, z/w).
directly on vectors. In this case, we Although at first glance this might seem like needless complexity, it has sev-
repeatedly evaluate the dot product of eral significant advantages. As a simple example, we can use the otherwise
two vectors, performing a four-com- undefined point (x,y,z,0) to represent the direction vector (x,y,z). With this uni-
ponent multiply-and-add operation. fied representation for points and vectors in place, we can also perform several
useful transformations such as simple matrix-vector multiplies that would oth-
Camera simulation erwise be impossible. For example, the multiplication
The graphics pipeline next projects
each colored 3D triangle onto the vir- ⎡1 0 0 Δx ⎤ ⎡ x ⎤
⎢ ⎥⎢ ⎥
tual camera’s film plane. Like the ⎢0 1 0 Δy ⎥ ⎢ y ⎥
model transformations, the GPU does ⎢0 0 1 Δz ⎥ ⎢ z ⎥
this using matrix-vector multiplication, ⎢ ⎥⎢ ⎥
⎣0 0 0 1 ⎦ ⎣w ⎦
again leveraging efficient vector opera-
tions in hardware. This stage’s output can accomplish translation by an amount Dx, Dy, Dz.
is a stream of triangles in screen coor- Furthermore, these matrices can encode useful nonlinear transformations
dinates, ready to be turned into pixels. such as perspective foreshortening.

Rasterization
Each visible screen-space triangle resolution. Because the access pattern to a programmable computational sub-
overlaps some pixels on the display; to texture memory is typically very strate that can support it. Fixed-func-
determining these pixels is called ras- regular (nearby pixels tend to access tion units for transforming vertices and
terization. GPU designers have incor- nearby texture image locations), spe- texturing pixels have been subsumed by
porated many rasterizatiom algo- cialized cache designs help hide the a unified grid of processors, or shaders,
rithms over the years, which all ex- latency of memory accesses. that can perform these tasks and much
ploit one crucial observation: Each more. This evolution has taken place
pixel can be treated independently Hidden surfaces over several generations by gradually
from all other pixels. Therefore, the In most scenes, some objects replacing individual pipeline stages
machine can handle all pixels in par- obscure other objects. If each pixel with increasingly programmable units.
allel—indeed, some exotic machines were simply written to display mem- For example, the NVIDIA GeForce 3,
have had a processor for each pixel. ory, the most recently submitted tri- launched in February 2001, introduced
This inherent independence has led angle would appear to be in front. programmable vertex shaders. These
GPU designers to build increasingly Thus, correct hidden surface removal shaders provide units that the pro-
parallel sets of pipelines. would require sorting all triangles grammer can use for performing
from back to front for each view, an matrix-vector multiplication, exponen-
Texturing expensive operation that isn’t even tiation, and square root calculations, as
The actual color of each pixel can always possible for all scenes.
be taken directly from the lighting cal- All modern GPUs provide a depth
Table 1. Phong lighting equation terms.
culations, but for added realism, buffer, a region of memory that stores
images called textures are often the distance from each pixel to the Term Meaning
draped over the geometry to give the viewer. Before writing to the display,
illusion of detail. GPUs store these tex- the GPU compares a pixel’s distance to Kd Diffuse color
tures in high-speed memory, which the distance of the pixel that’s already Li Light color
each pixel calculation must access to present, and it updates the display N Surface normal
determine or modify that pixel’s color. memory only if the new pixel is closer. L Vector to light
In practice, the GPU might require Ks Specular color
multiple texture accesses per pixel to THE GRAPHICS PIPELINE, R Reflected light vector
mitigate visual artifacts that can result EVOLVED V Vector to camera
when textures appear either smaller GPUs have evolved from a hardwired s “Shininess”
or larger on screen than their native implementation of the graphics pipeline

February 2007 97
HOW THINGS WORK

GPUs introduced increased flexibility,


adding support for longer programs,
more registers, and control-flow prim-
itives such as branches, loops, and
subroutines.
The ATI Radeon 9700 (July 2002)
and NVIDIA GeForce FX (January
2003) replaced the often awkward reg-
ister combiners with fully program-
mable pixel shaders. NVIDIA’s latest
chip, the GeForce 8800 (November
2006), adds programmability to the
primitive assembly stage, allowing
developers to control how they con-
struct triangles from transformed ver-
tices. As Figure 2 shows, modern
GPUs achieve stunning visual realism.
Increases in precision have accom-
panied increases in programmability.
The traditional graphics pipeline pro-
vided only 8-bit integers per color
Figure 1. Programmable shading.The introduction of programmable shading in 2001 led channel, allowing values ranging from
to several visual effects not previously possible, such as this simulation of refractive 0 to 255. The ATI Radeon 9700
chromatic dispersion for a “soap bubble” effect. increased the representable range of
color to 24-bit floating point, and
NVIDIA’s GeForce FX followed with
both 16-bit and 32-bit floating point.
Both vendors have announced plans
to support 64-bit double-precision
floating point in upcoming chips.
To keep up with the relentless
demand for graphics performance,
GPUs have aggressively embraced
parallel design. GPUs have long used
four-wide vector registers much like
Intel’s Streaming SIMD Extensions
(SSE) instruction sets now provide on
Intel CPUs. The number of such four-
wide processors executing in parallel
has increased as well, from only four
on GeForce FX to 16 on GeForce
6800 (April 2004) to 24 on GeForce
7800 (May 2005). The GeForce 8800
actually includes 128 scalar shader
processors that also run on a special
shader clock at 2.5 times the clock
rate (relative to pixel output) of for-
Figure 2. Unprecedented visual realism. Modern GPUs can use programmable shading to mer chips, so the computational per-
achieve near-cinematic realism, as this interactive demonstration shows, featuring formance might be considered equiv-
actress Adrianne Curry on an NVIDIA GeForce 8800 GTX. alent to 128 ¥ 2.5/4 = 80 four-wide
pixel shaders.
well as a short default program that exposing the texturing hardware’s
uses these units to perform vertex trans- functionality as a set of register com- UNIFIED SHADERS
formation and lighting. biners that could achieve novel visual The latest step in the evolution from
GeForce 3 also introduced limited effects such as the “soap-bubble” look hardwired pipeline to flexible compu-
reconfigurability into pixel processing, demonstrated in Figure 1. Subsequent tational fabric is the introduction of

98 Computer
unified shaders. Unified shaders were 3D geometric
first realized in the ATI Xenos chip for primitives
the Xbox 360 game console, and
GPU
NVIDIA introduced them to PCs with
the GeForce 8800 chip.
Instead of separate custom proces- Programmable unified processors
sors for vertex shaders, geometry Vertex Geometry Pixel Compute
shaders, and pixel shaders, a unified programs programs programs programs
shader architecture provides one large
grid of data-parallel floating-point
processors general enough to run all Rasterization Hidden surface
removal
these shader workloads. As Figure 3
shows, vertices, triangles, and pixels
recirculate through the grid rather
GPU memory (DRAM)
than flowing through a pipeline with Final image
stages of fixed width.
This configuration leads to better
overall utilization because demand for
the various shaders varies greatly Figure 3. Graphics pipeline evolution.The NVIDIA GeForce 8800 GPU replaces the
between applications, and indeed even traditional graphics pipeline with a unified shader architecture in which vertices,
within a single frame of one applica- triangles, and pixels recirculate through a set of programmable processors.The flexibility
tion. For example, a videogame might and computational power of these processors invites their use for general-purpose com-
begin an image by using large trian- puting tasks.
gles to draw the sky and distant ter-
rain. This quickly saturates the pixel extremely high arithmetic throughput resources, mapping well to the GPU’s
shaders in a traditional pipeline, while and streaming memory bandwidth many-core arithmetic intensity, or
leaving the vertex shaders mostly idle. but tolerates considerable latency in they require streaming through large
One millisecond later, the game might an individual computation since final quantities of data, mapping well to the
use highly detailed geometry to draw images are only displayed every 16 GPU’s streaming memory subsystem.
intricate characters and objects. This milliseconds. These workload charac- Porting a judiciously chosen algo-
behavior will swamp the vertex shaders teristics have shaped the underlying rithm to the GPU often produces
and leave the pixel shaders mostly idle. GPU architecture: Whereas CPUs are speedups of five to 20 times over
These dramatic oscillations in optimized for low latency, GPUs are mature, optimized CPU codes running
resource demands in a single image optimized for high throughput. on state-of-the-art CPUs, and speed-
present a load-balancing nightmare The raw computational horsepower ups of more than 100 times have been
for the game designer and can also of GPUs is staggering: A single GeForce reported for some algorithms that
vary unpredictably as the players’ 8800 chip achieves a sustained 330 bil- map especially well.
viewpoint and actions change. A uni- lion floating-point operations per sec- Notable GPGPU success stories
fied shader architecture, on the other ond (Gflops) on simple benchmarks include Stanford University’s Folding@
hand, can allocate a varying percent- (http://graphics.stanford.edu/projects/ home project, which uses spare cycles
age of its pool of processors to each gpubench). The ever-increasing power, that users around the world donate to
shader type. programmability, and precision of study protein folding (http://folding.
For this example, a GeForce 8800 GPUs have motivated a great deal of stanford.edu). A new GPU-accelerated
might use 90 percent of its 128 proces- research on general-purpose compu- Folding@home client contributed
sors as pixel shaders and 10 percent tation on graphics hardware—GPGPU 28,000 Gflops in the month after its
as vertex shaders while drawing the for short. GPGPU researchers and October 2006 release—more than 18
sky, then reverse that ratio when developers use the GPU as a compu- percent of the total Gflops that CPU
drawing a distant character’s geome- tational coprocessor rather than as an clients contributed running on Micro-
try. The net result is a flexible parallel image-synthesis device. soft Windows since October 2000.
architecture that improves GPU uti- The GPU’s specialized architecture In another GPGPU success story,
lization and provides much greater isn’t well suited to every algorithm. researchers at the University of North
flexibility for game designers. Many applications are inherently ser- Carolina and Microsoft used GPU-
ial and are characterized by incoher- based code to win the 2006 Indy
GPGPU ent and unpredictable memory access. PennySort category of the TeraSort
The highly parallel workload of Nonetheless, many important prob- competition, a sorting benchmark
real-time computer graphics demands lems require significant computational testing price/performance for database

February 2007 99
HOW THINGS WORK

operations (http://gamma.cs.unc.edu/ GPU architectures, but not without David Luebke is a research scientist
GPUTERASORT). Closer to home for limit; neither vendors nor users want at NVIDIA Research. Contact him at
the GPU business, the HavokFX prod- to sacrifice the specialized architec- [email protected].
uct uses GPGPU techniques to accel- ture that made GPUs successful in the
erate tenfold the physics calculations first place. Today, GPU developers Greg Humphreys is a faculty member in
used to add realistic behavior to need new high-level programming the Computer Science Department at the
objects in computer games (www. models for massively multithreaded University of Virginia. Contact him at
havok.com). parallel computation, a problem soon [email protected].
to impact multicore CPU vendors as

M
odern GPUs could be seen as well.
the first generation of com- Can GPU vendors, graphics devel-
modity data-parallel proces- opers, and the GPGPU research com- Computer welcomes your submis-
sors. Their tremendous computational munity build on their success with sions to this bimonthly column. For
capacity and rapid growth curve, far commodity parallel computing to additional information, or to
outstripping traditional CPUs, high- transcend their computer graphics suggest topics that you would like
light the advantages of domain-spe- roots and develop the computational to see explained, contact column
cialized data-parallel computing. idioms, techniques, and frameworks editor Alf Weaver at weaver@cs.
We can expect increased program- for the desktop parallel computing virginia.edu.
mability and generality from future environment of the future? ■

Welcomes Your Contribution


• Computer, the flagship publication of the IEEE Computer
Computer Society, publishes peer-reviewed technical content that
covers all aspects of computer science, computer
magazine engineering, technology, and applications.

• Articles selected for publication in Computer are edited


looks ahead to enhance readability for the nearly 100,000 computing
professionals who receive this monthly magazine.
to future
• Readers depend on Computer to provide current,
unbiased, thoroughly researched information on the
technologies newest directions in computing technology.

To submit a manuscript for peer review,


see Computer’s author guidelines:

www.computer.org/computer/author.htm

100 Computer

You might also like