Developing Efficient Graphics Software
Developing Efficient Graphics Software
Course Organizer
Keith Cok
SGI
Course Speakers
Keith Cok
Roger Corron
Bob Kuehne
Thomas True
SGI
Developing Efficient Graphics Software
ii
Developing Efficient Graphics Software
Abstract
A common misconception in modern computing is that to make slow software work more quickly, you
need a bigger and faster computer. This approach is expensive and often unworkable. A more feasible
and cost-effective approach to improving software performance is to measure the current software perfor-
mance, and then optimize the software to meet the anticipated graphics and system performance. This
course discusses the techniques and principles involved in this approach to application development and
optimization with particular emphasis on practical software development.
The course begins with a general discussion of the interaction between CPUs, bus, memory, and graph-
ics subsystem, which provides the background necessary to understand the techniques of software op-
timization. After this discussion of architecture fundamentals, we present the methods used to detect
performance bottlenecks and measure graphics and system performance. Next, we discuss some general
optimization techniques for the C and C++ languages. Finally, we give an overview of current application-
level architectures and algorithms for reducing graphics and general system overhead.
iii
Developing Efficient Graphics Software
iv
Developing Efficient Graphics Software
Preface
Course Schedule
1:30 PM Introduction
2:55 PM Break
v
. ABOUT THE SPEAKERS Developing Efficient Graphics Software
Roger Corron
MTS SGI
1 Cabot Road, Hudson, MA 01749
[email protected]
Roger Corron is a Member of Technical Staff in SGI Custom Engineering, where he develops custom
system solutions. Previously, Roger worked in SGI Applications Engineering assisting software develop-
ers to optimize their application graphics performance. Previous to SGI, he worked at Matra Datavision
optimizing and porting solid modelling software to several different graphical APIs. His interests include
low level graphics APIs and extracting maximum performance from computer systems. Roger received
his BS in Electrical and Computer Engineering from Clarkson University.
Bob Kuehne
MTS SGI
39001 West Twelve Mile Road, Farmington Hills, MI 48331
[email protected]
Bob Kuehne is a Member of Technical Staff at SGI. He currently assists software developers and vendors
in the CAD/CAE industries in developing products to most effectively utilize the underlying hardware. His
interests include object-oriented graphics toolkits, software design methodologies, creative use of graph-
ics hardware, and human/computer interaction techniques. Prior to joining SGI, he worked for Deneb
Robotics developing software for virtual reality applications. Bob received his BS and MS in Mechan-
ical Engineering from Iowa State University and performed research on assembly techniques in virtual
environments.
Thomas True
MTS SGI
1600 Amphitheatre Pkwy., Mountain View, CA 94043
[email protected]
Thomas True is a Member of Technical Staff at SGI where he currently works assisting software de-
velopers in tuning their graphics applications. His primary areas of interest include low-level graphics
system software, graphics APIs, user interaction, digital media, rendering, and animation. He received
a BS in Computer Science from the Rochester Institute of Technology and an MS in Computer Science
from Brown University where he completed his thesis on volume warping under the direction of Dr. John
Hughes and Dr. Andries van Dam. He presented this research at IEEE Visualization ’92. Prior to joining
SGI, Thomas developed graphics system software at Digital Equipment Corporation.
vi
Developing Efficient Graphics Software
Acknowledgments
This course is based on our experience with real applications outside of SGI or in conjunction with part-
nerships through SGI Applications Consulting. We thank all of the graphics software developers and
researchers who are pushing the envelope in graphics technology; without them there would be no content
for this course.
We also thank our management, David Campbell, Brian Thatch, Janet Matsuda, and Keith Seto, for
giving us the opportunity to develop this course and the course reviewers who gave us much needed
feedback.
We gratefully acknowledge Alan Commike for his work on an ealier version of these course notes
presented at SIGGRAPH99. We also gratefully acknowledge both Alan and Pam Thuman-Commike for
their work with the proofreading, figures, and LATEX formatting.
vii
. COURSE RESOURCES ON THE WEB Developing Efficient Graphics Software
viii
Developing Efficient Graphics Software
Contents
Abstract iii
Preface v
Course Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
About the Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Course Resources On the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1 Course Introduction 1
ix
CONTENTS Developing Efficient Graphics Software
3.2.6 Rasterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.7 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.8 Fog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.9 Antialiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.10 Per-Fragment Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Image Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Draw Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.2 Texture Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Read Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4 Copy Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.5 Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Hardware Fast Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
x
Developing Efficient Graphics Software
Conclusion 105
Glossary 107
Bibliography 113
xi
CONTENTS Developing Efficient Graphics Software
xii
Developing Efficient Graphics Software
List of Figures
xiii
LIST OF FIGURES Developing Efficient Graphics Software
xiv
Developing Efficient Graphics Software
List of Tables
xv
LIST OF TABLES Developing Efficient Graphics Software
Developing Efficient Graphics Software
Section 1
Course Introduction
This course was developed for software graphics developers interested in developing interactive graphics
applications that perform well. The course is not targeted at a specific class of graphics applications, such
as visual simulation or CAD, but instead focuses on the general elements required for highly interactive
2D and 3D applications. In this course, you will learn how to:
The course begins by discussing hardware systems, including CPU, bus, and memory. The course then
covers graphics devices, theoretical and realized throughput, graphics hardware categorization, hardware
bottlenecks, graphics performance characterization, and techniques to improve performance. Next, the
course discusses application profile analysis, and compiler and language performance issues (for C and
C++). The course then progresses into a discussion of application graphics rendering strategies, frame-
works, and concepts for high-performance interactive applications.
This course is founded on the premise that creating high-performance graphics applications is a difficult
problem that can be addressed through careful thought given to hardware and software systems interaction.
The course presents a variety of techniques and methodologies for developing, analyzing, and optimizing
graphics applications performance.
1
Developing Efficient Graphics Software
2
Developing Efficient Graphics Software
Section 2
2.1.1 Overview
To understand why a graphics application is slow, you must first determine if the graphics are actually
slow, or if the bottleneck lies elsewhere in the system. In either case, it’s important to understand both
the code and the target system on which the code is running, how the two interact, and the strengths and
weaknesses of the system.
In this section, hardware, software, and their interaction are discussed with a specific emphasis on
graphics applications and graphics hardware. Also discussed is the process an application goes through
to get data to the graphics hardware. Additionally, concepts for maximizing application performance are
discussed throughout this section.
3
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
and subsystems upon which it runs. A useful metaphor for this balance (and diversion from the topic
of computer hardware) is the Chinese concept of yin and yang. Quoting from the Skeptics Dictionary
(http://skepdic.com/yinyang.html):
According to traditional Chinese philosophy, yin and yang are the two primal cosmic princi-
ples of the universe. Yin (Mandarin for moon) is the passive, female principle. Yang (Man-
darin for sun) is the active, masculine principle. According to legend, the Chinese emperor Fu
Hsi claimed that the best state for everything in the universe is a state of harmony represented
by a balance of yin and yang.
Although the ideas behind yin and yang do not map exactly to the main goal of application tuning, the
basic concept of balance is key. If the re-purposing of this ancient Chinese philosophy can be forgiven, the
goal in tuning an application is to obtain harmony, a state of blissful balancing of application load across
the hardware provided in a computer. Throughout the remainder of this course, the yin/yang symbol
appears in the margin to denote a section of interest that discusses harmonious application balance. A
consequence of trying to obtain balanced hardware usage is the need to understand how that hardware
operates so that an application can best take advantage of it.
Another icon used throughout the course is the winged foot of Mercury. This icon indicates an explicit
performance hint or suggestion. The goal of this course is not, however, to give explicit hints, but to
encourage overall understanding of an application and its interaction with the computer on which it runs.
Therefore, scanning the course for these icons and following the hint without understanding the surround-
ing concepts and content will not be of much value. Furthermore, much larger performance increases can
be obtained by implementing the concept, as opposed to implementing a specific suggestion. Take care to
understand why a particular suggestion is given, where it will work, and most importantly, the context of
the section surrounding the suggestion.
4
Developing Efficient Graphics Software
The architecture of a specific computer system is important to consider when designing software for that
system. Specifically, it’s important to consider which subsystems an application interacts with, and how
that interaction occurs. There are several distinct systems on a computer, each using some interconnect
fabric or “glue,” (shown as a single block in Figure ??) to communicate with each other. Understanding
this fabric and where devices live on this fabric is extremely important in determining where application
bottlenecks occur, and avoiding bottlenecks when designing new software systems.
CPU memory
glue (graphics)
ASIC (disk)
(misc i/o)
Interconnect fabrics vary dramatically from system to system. On low-end systems, the fabric is often
a bus on which all devices share access through some hardware arbitration mechanism. The fabric can be
a point-to-point peering connection, which allows individual devices to communicate with preallocated
guaranteed-bandwidth. In still other fabrics, some systems might live on a bus, while others in that system
live on a peered interface. The differences in application performance among these types of systems can
be dramatic depending on how an application uses various components.
Because the focus of this course is on writing graphics applications, understanding the specifics of how
graphics hardware interfaces with CPU, memory, and disk is of special importance. A diverse mix of
computer systems exists on which an application might be run. This diversity ranges from systems with
a shared-bus (PCI) with local texture and framebuffer, to systems with a dedicated bus to the graphics
(AGP) with some local texture cache, some main memory texture cache, and local framebuffer, to systems
on a dedicated bus with all texture and framebuffer allocated from main memory (SGI O2). Each of these
architectures has certain advantages and disadvantages, but an application cannot be expected to fully
realize the performance of these platforms without consideration of the differences among them.
As a concrete example of these differences, let’s examine shared-bus systems. Graphics systems using a
shared-bus architecture share bandwidth with other devices on that bus. This sharing impacts applications
attempting to transfer large amounts of data to or from the graphics pipe while other devices are using
the bus. Large texture downloads, framebuffer readbacks, or other high-bandwidth uses of the graphics
hardware are likely to encounter bottlenecks as other parts of the system utilize the bus. (A complete
description of the different types of graphics accelerator hardware strategies appears in later sections of
the course notes.) Regardless of the type of system being used, the key to high-performance applications
is to fully utilize the entire system, balancing the workload among all the components needed so that more
application work can be performed more quickly.
2.1.4 CPU
Figure 2.2 depicts a simplistic CPU to illustrate the lengthy path that application data must travel before it
is useful. In this figure, main memory lives on the far side of all the caches, and data must be successively
cached down to the registers before it can be operated on by the CPU. This means that keeping often-used
data localized in memory is a very good idea, as it can improve cache efficiencies dramatically. In fact,
5
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
the premise behind caches is that data near the current data being operated upon is much more likely to be
needed next. This design criterion means that data locality affects performance, because access to cache
memory is significantly faster than to main memory.
Application data transfers to the graphics hardware that avoid pushing data through the CPU can signif-
icantly improve performance. Graphics structures such as OpenGL display lists (other graphics APIs have
other nomenclature for this concept) can often be pre-compiled into a state such that a glCallList() simply
transfers the display list directly from main memory to the graphics hardware, using a technique such as
direct memory access. This technique allows large amounts of graphics data to be rendered without any
complex calculations occurring on that data at run time.
fast slow
small mammoth
Figure 2.3: Approximate data latencies and capacities of typical system components.
6
Developing Efficient Graphics Software
Now that a few typical latencies and bandwidths have been discussed, how do the two interact? When
transferring data from one piece of hardware to another, both measures are important. Latency is most
often a factor when many operations are being performed, each with a latency that is large relative to
length of the overall operation. Latency is critical when accessing memory, for example, as the access
times for portions of main memory are approximately an order of magnitude slower than those of cache
memories.
A hypothetical graphics device is used to illustrate the effects that latencies can have on a running
program. Assume that this system consists of a data source (memory) and a data sink (graphics) where
the bandwidth between source and sink is 1 MB/s and the latency is 100 ms. The hypothetical application
programming interface (API) in this example is a call that blocks (is synchronous) while downloading a
texture. The transfer time for a 100-MB download of a texture (assuming no other delays in retrieving the
data) will then take 100s. As the latency involved in transferring this texture is 100 ms or 0.1 second, then
the overall time to transfer this texture is 100.1 seconds. However, if 100 1-MB textures are downloaded,
the transfer time per texture is 1s, for a total of 100 seconds. Adding in the latency of 0.1 second per
texture, we add a cumulative additional 10 seconds, thus bringing the total transfer time up 10% to 110
seconds total. A developer aware of this issue could design methodologies such as creating a large texture
with many small sub-textures within it to avoid many small data transfers that could negatively impact
performance. Though contrived, this example illustrates that latency can be an issue affecting application
performance, and that developers must be aware of hardware latencies so their effects (the latencies, not
the developers) can be minimized.
2.1.6 Memory
Previous sections have described the effects of latencies and bandwidths on hypothetical activities. This
section of the course discusses memory hierarchies and how applications interact with data within memory.
This section describes how memory hierarchies work in general, but many details are beyond the scope of
this course, such as instruction vs. data caches, details behind cache mappings (direct, n-way associative,
etc.), translation look-aside buffers, and many others.
Virtual Memory
Most current operating systems work under a memory scheme known as virtual memory. Virtual memory
is a method of managing memory that allows applications access to data storage space sized significantly
larger than the amount of physical RAM in a system. Addressing schemes vary, but 32-bit applications
can typically address >1 GB of memory when only a small fraction of that is physically available. Vir-
tual memory systems perform this task through managing a list of active memory segments known as
pages. For details behind this operation, and that of many computer systems, see Principles of Computer
Architecture [41] or a good introductory computer architecture book for elaboration.
Pages of memory are blocks of address space of a fixed size. Memory address space is simply a
hardware mapping of all available memory locations to a numbering scheme. A simplistic mapping for a
16-byte memory system might have valid memory addresses of 0x00 to 0x10. The size of pages of memory
varies from system to system but is typically constant on a specific running system. However, many
hardware systems allow the page size to be changed, some operating systems allow this to be changed
dynamically as a tunable parameter. Knowing the page size and page boundary for the specific system on
which an application is running can be very useful, as will be explained more fully in a moment. Specific
page sizes and functions to retrieve page size and page boundary vary by operating system. Pages are
7
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
important structures to understand because they are used as the coarsest level of data caching that occurs
in virtual memory systems.
As applications use memory and address space for code and data storage, more and more pages of that
address space are allocated and used. Eventually, more pages are in use than are available in physical
system RAM. At that point, the virtual memory manager decides to move some infrequently used pages
from that application from main memory to disk. This process is known as paging. Each time a page
of memory is requested, the memory manager checks if it already exists in main memory. If it does,
no action is required; if it does not, the memory manager checks if there is space available in RAM for
that page. If space is available in RAM for the needed page, no action is required, but if not, a page
of resident data must be put to disk prior to writing the desired page from disk. Next, in all cases, the
desired page is copied from disk, to the available page location in RAM. When an application pages, disk
I/O occurs, thus impacting both the application and overall system performance. Because maintaining the
integrity of a running application is essential, the paging process operates in a fairly resource-intensive
fashion to ensure that data is properly preserved. Because of these constraints, keeping data in as few
pages as possible is important to ensure high-performance applications. Applications that typically use
very large datasets, which cause the system to page, may benefit from implementing its own data paging
strategy. Application-specific paging can be written to be much more efficient than the general OS paging
mechanism.
Figure 2.4 shows a hypothetical application with an address space ranging from page 0 to page n, and
a system with many physical pages of RAM. In this example application, pages 0 - 9 are active, or have
data of some sort stored in them by the application, and pages 0 and 1 are physically resident in RAM. For
this example, the memory manager has decreed that only two pages can be used by the application, so any
application data that resides on pages other than the two in RAM are paged to and from disk.
Address
Space
Physical
page 0 RAM
page 1
page 1
page 0
. . .
...
page n
If the application in this example needs to retrieve vertex data from each of the 10 pages in use by
the application, then each page must be cached into RAM. This process will likely require eight paging
8
Developing Efficient Graphics Software
operations, which can be quite expensive, given that disk access is several orders of magnitude slower
than RAM access. However, if the application could rearrange data such that it all resided on one page, no
paging by the virtual memory manager would be required, and access times for this data would improve
dramatically. This is known as data locality, or the property of one piece of data residing “close” to
another piece of data in memory. If data locality can be improved, by storing freqently-used data in adjcent
memory, performance may improve as well. Understanding data access patterns is key to understanding
and improving data locality.
When data is resident on pages in main memory, it must then be transferred to the CPU (see Figure 2.2)
in order for operations to be performed on it. The process by which data is copied from main memory into
cache memory is similar to the process by which data is paged into main memory. As memory locations
are required by the operating program, they must be copied ultimately into the registers. Figure 2.5 shows
the data arrangement of cache lines in pages and both caches.
To get data to the registers, active data must first be cached into L2 and then L1 caches. Data is
transferred from pages in main memory to L2 cache in chunks known as cache lines. A cache line is a
linear block of address space of a system-dependent size. Level-2 (L2) caches are typically sized between
32-128 bytes in length. As data is required by the CPU, data from L2 cache must be copied into a faster
level-1 (L1) cache, again of a system-dependent size, typically of around 32 bytes. Finally, the actual data
required from within the L1 cache is copied into the registers, and operated upon by the CPU. This is the
physical path through which data must flow to be able to be operated on by the CPU.
{ { L1 cache
line
...
page
line
... {
...
.
.
... .
.
.
.
.
.
. ...
Figure 2.5: Cache line structure. Shown are pages of memory composed of multiple L2 cache lines; L2
cache composed of multiple L1 cache lines; and L1 cache composed of individual bytes.
The process by which requested data is copied into the registers is important because the consequences
of its action are one of the primary factors limiting application performance. As data is needed by the CPU,
controlling circuitry checks to see if that data is in the registers. If the data is not immediately available,
9
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
the controller checks the L1 cache for the data. If again not available, the controller checks the L2 cache.
Finally, if the data is still not available, a cache line containing the required data is copied from a page in
main memory (assuming that the page is already resident in RAM, and not paged to disk) and propagated
through L2 and L1 cache, ultimately depositing the requested data in a register. This process is depicted
in Figure 2.6, which shows the data request procedure as a flow chart.
Is word True
in Get word Compute
register?
Is word True
in L1 Get word
cache?
Is word True
in L2 Get L1
cache? cache line
Is word in True
a page of Get L2
memory? cache line
False
Get page
from disk
Though this discussion of memory and how it works is straightforward, the relevance to application
performance may not be immediately clear. Data locality, or packing frequently used data near other
frequently used data in memory, is the ultimate point of any discussion of how memory works. Keeping
data closer together keeps data in faster and faster memories in the memory hierarchy. Conversely, data
that is widely dispersed in memory is accessed through slower layers in the memory hierarchy. The effects
of data locality are best demonstrated through two examples.
In these examples, an operation is being performed in the CPU that requires 2 bytes of data, each in
a register. The computer on which this operation is running has the following access times: L1 cache,
1 ns; L2 cache, 10 ns; main memory, 100 ns. These access times are the largest contributors to overall
data access time. In the first example, the 2 bytes of data are resident on two different pages of memory,
so for each data to be accessed, a cache line must be copied from main memory into the cache. Thus, to
10
Developing Efficient Graphics Software
access main memory, it takes 100 ns + the L2 cache access time (10 ns) + the L1 cache time (1 ns), or 111
ns for each data byte to be copied from main memory to a register. Therefore, for the first example, the
total time to prepare memory for the operation to occur is 222 ns. Note that in this example, the two bytes
are the data of interest, but complete L2 cache lines containing the bytes of interest are copied from main
memory, and L1 cache lines containing the bytes are copied from L2 cache to L1 cache. Finally a word
containing each byte of interest is copied to each register location.
In the second example, both data bytes live on the same page in memory and on the same L2 cache line
(though far enough apart that they don’t fit on the same L1 cache line). A much smaller time to set up this
operation is needed than in the first example. Again, it takes 100 ns to access the main memory page to
copy to L2 cache, 10 ns to access the L2 cache twice to copy each byte to a different L1 cache line, two
1-ns accesses to the L1 cache to load the registers. In this example, the total time to prepare the operation
is 122 ns, which is nearly half the previous example’s overall time. As these examples show, keeping
data localized can clearly benefit application performance. Keep in mind this cache effect when designing
graphics data structures to hold objects to be rendered, state information, visibility lists, etc. Some simple
changes in data structure organization can possibly gain a few frames per second in the application frame
rate.
Another example of how data locality can be of great advantage to a graphics application is through a
graphics construct known as a vertex array. Vertex arrays allow graphics data to be efficiently utilized by
the CPU for transformation, lighting, etc. This efficiency is primarily due to the fact that vertex arrays are
arranged contiguously in memory, and therefore subsequent accesses to vertex data are likely to be found
in a cache. For example, if a hypothetical L2 cache uses 128-byte lines, then four 32-bit floats can live on a
single cache line, allowing fast access to each of them. However, because most applications do more than
render flat-shaded triangles, these vertices will need normals too. If a large contiguous array is allocated
in memory for the vertices, another for the normals, another for the color, and so on, it’s possible that, due
to the implementations of the L2 caches these arrays may overlap in cache, and still incur trips to main
memory for access. A solution to this problem is the concept of interleaved vertex arrays. In this case,
vertex, normal, and color data are arrayed one after another in memory; therefore, in a 128-byte cache
line implementation, all three are quite likely to live in non-overlapping L2 cache at once, thus improving
performance.
A number of techniques exist for mitigating the effects of cache on data access performance; however,
these techniques are more adequately addressed in later sections of this course, which discuss language
and code optimizations.
Understanding the path through which data must flow to get to the CPU is key because of the latencies
involved in accessing data from various memory caches. Keeping data packed close together in memory
ensures the likelihood of subsequent data accesses occurring from memory already resident in cache, and,
therefore, the algorithms operating on that data will be much faster.
11
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
Graphics Pipeline
The process of rendering interactive graphics can best be described as a series of distinct operations per-
formed on a set of input data. This data, often referred to as a primitive, typically takes the form of
triangles, triangle strips, image data, points, and lines. Each primitive enters the process as a set of vertex
data in a world coordinate system, and leaves as a set of pixels in the framebuffer. This set of stages, which
performs this transformation, is known collectively as the graphics pipeline (Figure 2.7).
When implemented using special-purpose dedicated hardware, the graphics pipeline can conceptually
be reduced to five basic stages[8].
Generation The process of creating the actual graphics data to be rendered, and organizing it into a
graphics data structure. Generation includes all the work done by an application on the CPU prior
to the point at which it’s ready to render.
Traversal The process of walking through the internal graphics data structures and passing the appropri-
ate data to the graphics API.
Typically, this stage of the rendering process is not implemented in dedicated hardware. Immediate
mode graphics requires flexible traversal algorithms that are much easier to perform in software
on the host CPU. Retained mode graphics, such as OpenGL display-lists, can be implemented in
hardware and then are part of the traversal phase.
Transformation The process of mapping graphics primitives from world-space coordinates into eye-
space, performing lighting and shading calculations, mapping the eye-space coordinates to clip-
space, clipping these coordinates, and projecting the final result into screen-space.
Graphics subsystems with hardware support for this stage do not always accelerate all geometric
operations. Often, there is a limited number of paths that are fully implemented in hardware. For
example, some machines may only accelerate geometric operations involving one infinite light,
others may not accelerate lights at all. Some hardware accelerators may have dedicated ASICs that
transform geometric data faster for triangle strips of even lengths rather than odd (due to parallelism
in the geometry engines). Understanding which operations in this portion of the graphics pipeline
are performed in hardware, and to what degree, is critical for building fast graphics applications.
These operations are known as fast paths. Determination of hardware fast paths is discussed in
Sections 2.2.3 and 3.4.
12
Developing Efficient Graphics Software
Graphics
CPU Board Display
Rasterization The process of drawing the screen-space primitives into the framebuffer, performing screen-
space shading, and per-pixel operations. Per-pixel operations performed in this phase include texture
lookups and depth, alpha, and stencil tests. Following this stage in the pipeline, there remain only
fragments, or pixels with a variety of associated data such as depth, color, alpha, and texture.
Any (or all) rasterization operations can be incorporated into hardware, but very frequently, only a
limited subset actually are. Reasons for this limitation are many, including cost, complexity, chip
(die) space, target market applicability, and CPU speed. Some hardware may accelerate textures
only of certain formats (ABGR and not RGBA), whereas others may not accelerate texture at all,
targeting instead markets such as CAD where texture is (as of yet, relatively) unimportant. It is im-
portant to know what is and is not implemented in hardware to construct a well-performing graphics
application.
Display The scanning process that transfers pixel data in the framebuffer to the display monitor.
This stage is always implemented in dedicated hardware, which provides a constant refresh rate (for
example, 60 Hz, 75 Hz).
Figure 2.8 shows how these five stages overlay onto the original graphics pipeline. These five stages
can be used to build a useful taxonomy that classifies graphics subsystems according to hardware imple-
mentation. This taxonomy, its mapping onto hardware, and consequent performance implications are the
subject of the next section.
13
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
GTXR-D The sole function of a GTXR-D type graphics subsystem is to regularly update the screen at
the set refresh frequency by scanning the pixel values from video memory to a display monitor. All
other rendering stages are performed on the host CPU.
GTX-RD GTX-RD type graphics subsystems have a rendering engine that implements the scan conver-
sion of screen-space objects (points, lines, and polygons) into video memory and performs screen-
space shading and other pixel operations (depth testing, stencil testing, etc.). Transformation and
lighting are still performed on the host CPU.
GT-XRD GT-XRD type graphics subsystems go one step beyond GTX-RD with the addition of one or
more transform engines that implement in hardware the transformation from object-space to eye-
space, eye-space lighting and shading, and the subsequent transformation to screen-space. In this
case, the CPU is left to simply generate and traverse the graphics data structures sending the object-
space data to the graphics subsystem.
G-TXRD Graphics subsystems of type G-TXRD are rare because of the overwhelming demand for im-
mediate mode graphics. Moving the traversal stage from the host CPU into dedicated hardware
imposes strict rules on user interaction, which is unacceptable in most environments. Because there
are very few such systems, we will not discuss them further here.
Maximizing application performance on a particular type of graphics subsystem requires first an under-
standing of which portions of the graphics pipeline are used by an application, and second, which portions
of the pipeline are implemented in dedicated graphics hardware. Keep both points in mind when authoring
an application.
14
Developing Efficient Graphics Software
load on the CPU. This reduces the risk that an application will become CPU-bound. Currently, AGP offers
an exclusive 512-MB/s or 1024-MB/s transfer path between system memory and graphics.
Another approach is the Unified Memory Architecture (UMA). In UMA machines, there is a dedicated
bus that handles the flow of data between the CPU and graphics.
A comparison of the various architectures can be seen in Figure 2.9. An analysis of how different
graphics hardware implementations affect overall application performance can be found in Section 4.3.4.
15
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
Graphics
Disk
CPU
...
Memory
Network
B
AGP
Graphics
CPU
Disk
PCI
Memory
...
Network
CPU
Disk
UMA
PCI
controller
...
Memory
Network
Graphics
Figure 2.9: Schematic of system interconnection architectures: (A) PCI, (B) AGP, and (C) UMA.
16
Developing Efficient Graphics Software
which render, transfer internally across the system’s bus or hub, and then are recombined on a fifth pipe.
Processes are threaded (or even forked) across multiple CPUs, and functions executed directly through
standard programming language bindings. In an SSI system data is shared either implicitly, as occurs with
threaded programs, or explicitly, as occurs with forked programs using shared-memory arenas. In both
cases, explicit or implicit memory sharing, the data resides within a single logical memory subsystem,
allowing easy and direct access to data across multiple process and threads.
The key difference between systems of this type and clusters is the bandwidth and latency of interfaces
among graphics pipes. In particular, in systems of this sort, sharing data from main memory to/from
individual graphics pipelines is both high bandwidth and low latency.
The defining feature of COW systems, typically, is cost. COWs are often systems with much less integrated
hardware, and more off-the-shelf components. Though systems of this sort can potentially have high-
performance graphics, often they have lower cost, lower quality graphics. In COW systems, applications
must be explicitly aware of the differences among individual systems, or nodes, in the cluster, as well as
the individual system capabilities, and the link performance and topology of the system connections. An
example of one configuration is found in Figure 2.11. Programming interfaces are explicitly parallel, or
happen through an abstraction layer such as a message-passing interface. Examples of these interfaces
include OpenMP [6] and MPI [4], although many others exist. One example of other techniques include
distributing objects using an object layer such as CORBA [1]. Link connection and topology are key
factors in constructing and using a cluster in both determining the amount of data that can be distributed to
all nodes (within the application per-frame time constraints) and determining the latency involved (through
number-of-hops in the topology) in transferring that data.
17
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
First, it’s necessary to understand and choose an appropriate problem decomposition. Among the
choices for problem decomposition are image-space, time-based, geometry-based, and depth-based. Each
of these techniques involve understanding the desired output display configuration and configuring the set
of image pipelines to produce those images.
In image-space decomposition, the configuration may consist of either a single large image subdivided
into a set of smaller sub-images, or a set of abutting images, perhaps not even in the same plane, such as
in a CAVE. A second decomposition is geometry-based. In this configuration, each pipe views the entire
view-volume, and each pipe receives some portion of the geometry to render. One approach might simply
be to send each pipe 1/number-of-pipes objects; however this approach suffers from being a bit too
simple. For example, if the pipes have different capabilities (geometric or fill rates), then the goal should
be to balance their workloads so that no pipe is too busy, and that no pipe is idle either. For this reason,
a simple 1/number-of-pipes partitioning will likely not scale well, and the actual data needs to be
examined to divide the workload equitably among pipes in a system. The images from each pipe are then
recombined along with their depth values on the final pipeline.
Another decomposition, often used in simulators, is time-based. In this tiling, each subsequent frame is
rendered on an additional pipe, then the results are displayed sequentially on the output device (or pipe). In
time-based decompositions, pipes are arranged in a ring-buffer; each pipe, once finished, begins working
on the next available frame. For non-interactive graphics applications, this technique is often used to
render frames of animations, etc.
Yet another decomposition is depth-space tiling. In this decomposition, each pipe will handle the same
screen-space area, but each will render a specific depth-section of the database (which itself is sorted
by depth.) For example, on a three-pipe system, each pipe would create a view frustum of one-third of
18
Developing Efficient Graphics Software
the total depth. This requires that the screen-space depth data of each piece of geometry be computed.
This differs from the image-space decomposition described previously in which the database is divided
into eye-space sections. Each pipe renders its own piece of the entire geometry, and the rendered sets of
geometry are combined into the final image.
If enough resources are available, combining these techniques can yield very interesting and scalable
results. For example, an application might use a time-based decomposition, and then for each frame
within that time-buffer, you can subdivide those frames spatially. Decomposition combinations such as
these are extraordinarily powerful but require a significant investment in software architecture to utilize
multiple-pipe systems effectively.
Image-space Decomposition
/* each pipe gets a section of the image-space view-volume. sort
* so only that section of data goes to each pipe. */
sort_geometry_by_pipe();
render_individual_pipe_data( pipe_num );
/* OpenGL: glBegin/End */
set_graphics_context_to_window_on_pipe( output_pipe );
/* OpenGL/glX: glXMakeCurrent( output_pipe ); */
19
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
Depth-based Decomposition
sort_geometry_by_pipe();
render_individual_pipe_data( pipe_num );
/* OpenGL: glBegin/End */
set_graphics_context_to_window_on_pipe( output_pipe );
/* OpenGL/glX: glXMakeCurrent( output_pipe ); */
20
Developing Efficient Graphics Software
Geometry-based Decomposition
divide_geometry_among_pipes();
render_individual_pipe_data( pipe_num );
/* OpenGL: glBegin/End */
barrier_wait_for_all_pipes_to_finish();
set_graphics_context_to_window_on_pipe( output_pipe );
/* OpenGL/glX: glXMakeCurrent( output_pipe ); */
21
2.1. COMPUTER SYSTEM HARDWARE Developing Efficient Graphics Software
Time-based Decomposition
/* each pipe gets the ’next’ frame. the next frame is computed by
* either continuously sampling input devices, or by extrapolating
* along some smoothed previous ’n’ input steps. in either case,
* each subsequent new view is rendered on another pipe, then
* back to the main pipe.
sort_geometry_by_pipe();
set_view( new_view_matrix );
render_all_data();
set_graphics_context_to_window_on_pipe( output_pipe );
/* OpenGL/glX: glXMakeCurrent( output_pipe ); */
Whatever the technique chosen, the resultant images are rendered each using a different graphics pipe,
and then recomposited together. Techniques for recompositing include, in the case of wall or CAVE
configurations, allowing the images to simply be projected on surfaces that physically abut each other,
thereby creating the illusion of a seamless image. The second, more challenging technique, involves
rendering images on a number of pipes, then capturing those pixels (and potentially depth information),
and sending them back across to the final graphics pipe where they are recomposited together, then sent to
the display device. Examples of all these techniques are shown in Figure 2.12.
The above description brings us to the second and third key points to effectively utilizing a multiple-pipe
system. The second key factor is to understand the system architecture. In some architectures, bandwidth
may not be available to pass back image sections to a final pipe for recompositing. Or, potentially, if
bandwidth is available, the latencies involved may be too long for a copy to occur per-frame. For example,
in a COW, latencies may be on the order of several milliseconds, but in an SSI system, latencies may be
on the order of several microseconds. In clusters, at this point in time, a good strategy is to avoid these
latencies whenever possible by transmitting synchronous data across the network fabric as infrequently
as possible. This also implies that, for COWs, depth-space compositing can be difficult, again due to the
latencies, especially in interactive applications. Specifically, a good technique in COWs is to project the
resultant displays to recomposite the image. Similarly, in an SSI system, where latencies are lower, it’s
much more feasible to transmit portions of the resultant image to a single pipe for recompositing.
22
Developing Efficient Graphics Software
The last key factor in using a multiple-pipe architecture is to understand the differences among individ-
ual systems, or nodes, within a system. When doing depth-space compositing, or large image reconstruc-
tion, doing this final step on a system with additional resources makes obvious sense, as the pixel demands
on this system will be larger. In a more general sense, balancing the load among systems in either a COW
or an SSI system is essential to maximizing the performance of the overall system. Both geometric and
fill requirements should be balanced for each individual node and pipe within a system so that each pipe
is kept busy, but only for as long as the time constraints on interactivity allow.
23
2.2. GRAPHICS HARDWARE SPECIFICATIONS Developing Efficient Graphics Software
example, if a polygon at some far distance in the framebuffer is first drawn and then another is drawn in
front of it, the second polygon will be drawn completely, overwriting some of the farther back polygons.
Pixels in which the two polygons intersect will have been written to, or filled, twice. The phenomenon of
writing the same framebuffer fragments multiple times yields a measurement know as the depth complexity
of a scene. Depth complexity is an average measurement of how many times a single image-pixel has been
filled prior to display. Applications with high depth-complexity are often fill-limited. Fill rate is often a
bottleneck for application domains such as flight simulation.
Polygon rate is a measure of the speed at which polygons can be processed by the graphics pipeline.
Polygon rates are reported as the number of triangles able to be drawn per second. Polygon rates, like
fill rates, are almost meaningless without additional supporting information. Check hardware information
carefully for specifics about whether or not these triangles were lit or unlit, textured or untextured, the
pixel size of the triangles, etc. Polygon rates are often bottlenecks in application domains such as CAD
and manufacturing simulation.
Fill rates directly correspond to the rasterization phase of the pipeline referred to in Figure 2.8, whereas
polygon rates directly correspond to the transformation phase of the pipeline. Because these are the two
phases of the pipeline most commonly implemented in hardware, achieving a good balance between
these two is essential to good application performance. For example, if a graphics vendor claims a high
polygon rate, but low fill rate, the card will be of little use in the flight simulation space because of the
type of graphics data typically used in flight simulations. Flight simulations typically draw relatively few
polygons, but most are large, and are textured and fogged, and often overlap (think of trees in front of
buildings in front of layers of ground terrain), thus having high depth-complexity. In another example, if
graphics hardware claims a high fill rate, but low polygon rate, it will likely be a poor CAD performer.
CAD applications typically draw many small polygons without using much fill capacity (no texture, no
fog, no per-pixel effects). In either application scenario, CAD or flight-simulation, some performance of
the graphics hardware is often underutilized, and if more fully utilized, more complex or more detailed
scenes could be rendered. Balance is key.
Examining the details behind the reported fill rate and polygon rate numbers can yield information
about whether or not an application will be able to perform up to these published standards. However,
even armed with all this information, there are still many variables on which hardware vendors do not
provide data, which affect an application’s performance. Ultimately, to measure the real performance of a
system’s graphics, you must test the hardware.
24
Developing Efficient Graphics Software
– Understand its capabilities for data I/O, cache and CPU architecture, and primarily the data
paths to and through the graphics hardware.
25
2.3. HARDWARE CONCLUSION Developing Efficient Graphics Software
• Ensure maximum performance of a computing system by using provided hardware features and
extensions. Graphics system performance can be most dramatically affected through use of vendor-
provided extensions.
– Use run-time queries to determine which extensions are available and then use them fully.
– Use both the fill and transformation portions of the graphics hardware to maximize use of the
available resources.
– Balance the workload among all system components.
Though hardware systems are continually improving in performance, features, and overall capability,
applications and users of those applications are similarly increasing their demands and workloads. It is
essential for application writers to understand the capabilities of their target hardware platforms in order to
meet these demands. Understanding these capabilities and writing software with hardware in mind affords
the best possible performance across a wide variety of computing architectures.
26
Developing Efficient Graphics Software
Section 3
3.1 Introduction
To achieve the highest performance from a computer graphics application, you must maintain a delicate
balance between the requirements of the software application and the capabilties of the system hardware.
An out-of-balance system does not perform optimally.
To achieve this balance, an understanding of the underlying system hardware is required, specifically
the hardware that implements the graphics rendering pipeline. This has become even more important in
recent years as graphics architectures continue to evolve at a rapid pace with different parts of the graphics
rendering pipeline implemented in dedicated hardware within the graphics subsystem.
Equally important is an understanding of the software side of the equation, not only the application
software and the requirements that it places on the computer system hardware, but also any software that
complements the graphics hardware to implement the complete graphics rendering pipeline.
This section builds on Section 2.1.7, which introduced the five basic stages of the graphics rendering
pipeline. Here, we will examine in detail Transformation and Rasterization. These stages form the core of
the pipeline and are typically implemented using special-purpose hardware within the graphics subsystem.
We’ve chosen an OpenGLT M implementation of the graphics rendering pipeline as a basis for this
discussion as it is available on many computer platforms and as a result easily understood by a general
audience. Figure 3.1 illustrates how the Transformation and Rasterization stages are broken into functional
blocks. It is at this level that we will examine the functionality of the graphics rendering pipeline and how
it is implemented as a combination of hardware and software.
The implementation of the graphics rendering pipeline presented here is for instructional purposes only
and should not serve as a complete and definitive guide. The intent here is to provide a collection of
useful information to be considered by the application developer when evaluating the performance of a
graphics application. For more specific information on the OpenGLT M graphics rendering pipeline, see
more detailed reference documentation [42]. To see how this pipeline is actually implemented in software,
check the OpenGLT M Sample Implementation [5] or MesaT M [3].
Except in the case of extremely low-end graphics adapters, which implement most of the graphics ren-
dering pipeline in software, the graphics rendering pipeline is typically implemented in hardware by one
or more special-purpose ASICs. One ASIC or group of ASICs is commonly referred to as the rendering
or rasterization engine. This engine excels at 2D screen coordinate integer-based computations and as
such implements the functionality of the Rasterization stage. Mid- to high-level graphics accelerator cards
feature an additional ASIC or set of ASICS for geometry processing, commonly referred to as a transform
engine. The transform engine is like a special-purpose CPU that efficiently performs the floating-point
27
3.2. GEOMETRY PATH Developing Efficient Graphics Software
calculations required in matrix multiplication and the evaluation of plane equations, the Transformation
functionality within the graphics rendering pipeline.
The rendering process follows a set of paths through the rendering pipeline. Understanding how these
paths are implemented and function within the graphics pipeline provides insight for application design
and performance tuning.
28
Developing Efficient Graphics Software
29
3.2. GEOMETRY PATH Developing Efficient Graphics Software
3.2.4 Lighting
The lighting state computes RGBA color values for each vertex in a scene. As a result, lighting is the
most computationally complex stage of the 3D rendering pipeline. Before exploring some specific details
which should be considered in the context of an application, it’s beneficial to review the basic lighting
calculations.
The lighting stage computes RGBA color values by applying a light model equation to a set of input
parameters. These input parameters include the following:
• Vertex coordinates
• Vertex normal
• Material characteristics
The color for each vertex is then calculated as the sum of the vertex material emission, the global
ambient light scaled by the vertex ambient property, and the attenuated ambient, diffuse and specular
contributions from all light sources. Examination of each term in the lighting equation demonstrates how
the values of particular parameters can influence work performed by the graphics hardware.
Material Emission
The material emission is simply the RGB value, which indicates the color of an object that is not attributed
to any external light source.
coloremissive (3.1)
30
Developing Efficient Graphics Software
The attenuation factor (att) is calculated in relation to the distance (d) between the light’s position and
the vertex as
1
att = (3.4)
attconstant + attlinear d + attquadratic d2
except in the case of a directional light where the attenuation factor is simply 1.0. In this case, the light
rays entering the scene are parallel, because a directional light is considered to be infinitely far away from
the scene.
The spotlight effect (spot) for a light source evaluates to one of three possible values specified below.
In the case that the light is a spotlight source and the vertex falls within the illumination cone, the spot
effect is the dot product of the unit vector (vx , vy , vz ) that points from the spotlight to the vertex and the
vector (dx , dy , dz ) that describes the spotlight’s direction. This dot product varies with the cosine of the
angle between the two vectors. As a result, vertices directly in line get maximum illumination, whereas
vertices off the axis receive less illumination as the cosine of the angle gets smaller.
1,
light not spotlight
spot = 0, light is spotlight but vertex outside illumination cone
V d, otherwise
The ambient term (colorambient ) is the ambient color of the light source scaled by the ambient material
property:
The diffuse term (colordif f use ) is calculated by scaling the diffuse color of the light source by the diffuse
material property and then scaling the result by the dot product of the unit vector (Lx .Ly , Lz ) that points
from the vertex to the light and the normal vector (nx , ny , nz ). If the dot product is less than or equal to 0,
then there is no diffuse component contribution at that vertex. If the result is less than 0, then the light is
on the wrong side of the surface.
The final term is the specular term (colorspecular ). Before calculating the specular term, it must be
determined that there is a specular component contribution at the vertex. Once again, like in the diffuse
case, the dot product between L and n is used to account for the position of the vertex in relation to the
direction of the light source. If the result of the dot product is less than or equal to 0, the light is on the
wrong side of the surface and there is no specular component contribution at that vertex.
If the dot product is positive, then there is a specular component contribution. This contribution is then
calculated using the normalized sum (sx , sy , sz ) of the two unit vectors that point between the vertex and
31
3.2. GEOMETRY PATH Developing Efficient Graphics Software
the light position and the vertex and the viewpoint. In the case that the viewpoint is not considered local
to the scene, the second vector is then simply (0, 0, 1). The complete equation for calculating the specular
component contribution is
With all the various terms calculated, the complete equation for calculating the vertex color given n
light sources can be expressed as
color = coloremissive
+ colorglobal
n−1
X
+ (atti )(spoti )[colorambient + colordif f use + colorspecular ]i (3.8)
i=0
One final wrinkle to consider is that some graphics systems simply sum the ambient, diffuse, specular
and emissive contributions as done in Equation 3.8. In the case of texture mapping, however, this can
produce muted or undersireable specular hightlights as textures are applied after lighting. To eliminate
this unwanted effect, the specular component can be computed separately, in effect, causing two color
values to be calculated per vertex as in the following equation.
colorprimary = coloremissive
+ colorglobal
n−1
X
+ (atti )(spoti )[colorambient + colordif f use ]i (3.9)
i=0
n−1
X
colorsecondary = (atti )(spoti )[colorspecular ]i (3.10)
i=0
In this case, during texture mapping, colorprimary is combined with the texture color. The specular color
(colorsecondary ) value is then added to the result to produce more visible and realistic specular hightlights.
Hardware support for lighting varies widely amoung graphics vendors. A graphics subsystem with a
geometry engine will generally implement lighting calculations in hardware, but the extent to which such
lighting calculations are performed is highly dependent on the sophistication of the geometry processor
and the types of lighting in the scene. Some vendors even go as far as implementing a lighting engine
separate from the transform engine.
Some graphics subsystems will only implement infinite (also known as directional) lights in hardware,
while others will also implement local lights. In the case of infinite or directional light sources, the light
rays are considered parallel and as a result, the attenuation factor in Equation 3.4 is simply 1.0. This sim-
plifies the overall lighting model equation by removing the per-vertex calculation of the attenuation factor.
In OpenGLT M , infinite or directional lights are specified with the fourth coordinate of GL POSITION set
to 0.0.
The specification of a local viewer also adds to the complexity of the lighting model equation. When
a local view is specified, the calculation of the specular term in Equation 3.7 requires that the angle
32
Developing Efficient Graphics Software
between the viewpoint and each object must be calculated. With an infinite viewer, this angle is not used,
thus producing slightly less realistic results but at a reduced computational cost.
The number of lights actually implemented in hardware also varies widely between graphics subsys-
tems. Lights in excess of the number implemented in hardware will be implemented in software on the
host CPU. If vendor-supplied documentation does not specify hardware implementation, do some testing
as outlined in Section 2.2.3 to determine the hardware support for lighting. If the performance decreases
linearly as additional lights are added, then lighting is implemented in software. If performance only de-
creases slightly up to a point where it falls of dramatically, then lights are implemented in hardware with
the cutoff point for the number of lights supported being the point at which the performance drops off
significantly.
A graphics card without geometry acceleration will not implement lighting. This forces all lighting
calculations to be performed in software on the host CPU. Independent of hardware or software imple-
mentation, typical lighting models are approximations that do not take into consideration effects such as
shadows or objects that reflect or radiate light. More sophisticated lighting models must be implemented
in software within the application or by the creative use of texture mapping.
Because lighting is the most compute-intensive part of the rendering pipeline, lighting code within the
graphics subsystems is typically tuned by the vendor to provide the best performance for the most common
cases. Lighting code is often the first code tuned, because tuning the lighting code offers the most return on
investment or bang for the buck. One aspect of this tuning is that values that do not change from vertex to
vertex are typically cached to reduce the per-vertex computational overhead. As such, changing properties
unnecessarily between vertices can force unnecessary calculations and impact performance. One example
in OpenGLT M is the use of glColorMaterial() which specifies that material parameters track the
vertex color. When glColorMaterial() is used, calculations that utilize material parameters must be
recalculated between vertices.
33
3.2. GEOMETRY PATH Developing Efficient Graphics Software
xe
0 0 0 0
ye
ABCD ≥0
ze
we
lie in the half-space defined by the plane equation. Vertices that do not statisfy this condition do not lie
in the half-space and are clipped away.
Note that when objects are clipped, while some vertices are clipped away, new vertices are added where
an object intersects with the clip volume. As a results, there can be a net increase, a net decrease, or no
change in the number of vertices that will be processed in subsequent stages of the pipeline. When new
vertices are added, the color, texture and depth values at those vertices are calculated by interpolating
between the values at the original vertices.
An optimization performed in some implementations is to implement the clipping against the application-
defined clip planes before the lighting stage. In cases where a significant amount of geometry may be
clipped away, this significantly reduces the amount of work required during the lighting stage at the ex-
pense of having to track clipped vertices to access per-vertex data required by the lighting model.
Perspective Transform
The perspective transform maps graphics primitives from eye space into clip space by multiplying the
xe , ye , ze , and we coordinates for each vertex in eye space by the 4 x 4 projection matrix P . Again, the
perspective transform, like the model-view transform, is a floating-point operation handled efficiently by
the geometry engine within the graphics subsystem hardware.
xc xe
yc ye
=P
zc ze
wc we
Clipping
Once in clip space, objects are clipped against the view volume specified by
−wc ≤ xc ≤ wc
−wc ≤ yc ≤ wc
−wc ≤ zc ≤ wc
As in the case for application-defined clip planes in Section 3.2.5, clipping is performed at this point
using the same method used previously. In this case, however, the plane equations are the six plane
equations that define the view volume.
Clipping can also be implemented by combining clipping against the application-defined clip planes and
view-volume clipping into a single operation. In this case, the clip volume and resulting plane equations
become the intersection of the half-spaces defined by the application-defined clip planes with the view
volume.
34
Developing Efficient Graphics Software
Perspective Division
Perspective division divides the xc , yc , and zc coordinates by wc to map each of the clip space coordinates
into normalized device coordinates xd , yd , zd .
xd xc /wc
yd = yc /wc
zd zc /wc
Viewport Transform
Finally, the viewport transform uses the viewport parameters of the application to map each of the nor-
malized device coordinates into window coordinates. Given a viewport’s width (w), height (h), and center
(ox , oy ), the window coordinates (xw , yw , zw ) for a vertex are calculated
xw (w/2)xd + ox
yw = (h/2)yd + oy
zw ((f − n)/2)zd + (n + f )/2
where n and f represent the near and far clipping planes specified by the application.
3.2.6 Rasterization
Rasterization is the process of converting a geometric primitive into an image. Rasterization determines
which pixels are occupied by a primitive and then calculates and assigns the appropriate color, depth, and
stencil values for each pixel. Graphics subsystems with rasterization, also known as rendering engines,
perform this operation in dedicated hardware. However, there are typically special cases where a software
implementation might need to be used when a specific operation or rendering mode is not implemented in
hardware. As described in Section 3.4, this is known as falling off the fast path.
The process of rasterization can be broken down into two steps. The first step determines which pixels
in a 2D integer grid in window coordinates are occupied by a primitive. The second step assigns a color
and depth value to each pixel. Each pixel and its associated data is called a fragment.
Nearly all graphics subsystems today implement rasterization and at least some of the subsequent per-
fragment operations in dedicated hardware. It’s at this stage that primitives are matched to those supported
natively by the rendering engine and sequences of microcode-like rendering instructions are executed by
the rendering engine to perform rasterization and all subsequent per-fragment operations before a prim-
itive is displayed in the framebuffer. Rendering engines perform rasterization by executing a series of
rendering instructions. These instructions are commonly implemented in such a way that all subsequent
per-fragment operations required to render a primitive in the framebuffer are chained together and per-
formed in sequence with rasterization. However, rasterization is presented here with subsequent stages
and per-fragment operations discussed later.
Rasterization engines may not support all primitives natively. For example, triangle strips may be
actually rasterized by duplicating vertices and rasterizing individual triangles. Quads may be rendered as
two triangles, etc. Non-native primitives are supported transparently to applications excepting the speed
at which these primitives are rasterized.
Points are typically rasterized by truncating their xw and yw window coordinates to integers. The result-
ing (x, y) then specify an address on the integer grid. Hardware that does not support points with a width
greater than 1.0 grid units may convert such points into quads.
35
3.2. GEOMETRY PATH Developing Efficient Graphics Software
Lines are rasterized by rendering instructions that typically implement Bresenham’s classic algorithm [18].
Support for lines with a width greater than 1.0 is implementation- dependent with some hardware render-
ing them as quads, whereas others might fall back to a software rendering path.
Polygon rasterization includes the rasterization of polygons, independent triangles, triangle strips, tri-
angle fans, and quadralaterals. The first step in polygon rasterization is back face culling. If it is enabled,
only polygons that are determined to be front facing are rasterized. Because rasterization engines typically
operate on spans, higher order polygons are typically decomposed into triangles for easy rasterization us-
ing a scanline algorithm that linearly interpolates triangles along edges and linearly interpolates data across
horizontal spans.
Other rasterization problems that may cause an application to force software rasterization include these:
• Stippling
• Antialiasing (See Section 3.2.9)
• Additional polygon modes (edge flags, wireframe, point)
3.2.7 Texture
When texturing is enabled, a post-rasterization texture application stage is added to the basic 3D geometry
path. This new stage is combined with a 2D path, which acts on the image data to place it in texture
memory prior to application. The 2D texture path is described in Section 3.3.
When texture mapping is enabled the rasterizer will, when executing the scanline algorithm to rasterize
the triangle, interpolate between texture coordinates to compute an offset into texture memory. This texture
data is then used to color the fragment. This coloring occurs in different ways depending on the texturing
mode. For example, if a ”replace” mode is selected, then the texture color value is used as the color for
a particular fragment. In another example, if ”blend” mode is selected, then the texture color value is
blended with the triangle color value to determine the final fragment color.
3.2.8 Fog
Fog is also often commonly referred to as depth cueing. When a fog operation is enabled, objects appear
to fade into the distance. Fog color is computed by blending an incoming fragment’s post-texturing color
with a fog blending factor to calculate the final fogged color:
f = e−(density∗z) (3.11)
2
f = e−(density∗z) (3.12)
end − z
f= (3.13)
end − start
and z is the eye-coordinate distance between the viewpoint and the fragment center. Depending on the
graphics pipeline implementation, f may be computed at each fragment, or computed only at vertices and
interpolated. The fog operation is typically not implemented as a test and, therefore, fragments that are
completely fogged are not discarded.
36
Developing Efficient Graphics Software
3.2.9 Antialiasing
Antialiasing, the processing of removing the “jaggies” from an object, if enabled, is one stage in the
rendering pipeline where your mileage will vary depending on the graphics pipeline implementation. An-
tialiasing at this stage is performed on a per-primitive basis.
In RGBA mode, antialiasing is implemented by multiplying the alpha values for a fragment by the
percentage of coverage of that pixel by a primitive. This alpha value is then used during the blending per-
fragment operation stage of the pipeline, to blend the fragment’s color with the color of the corresponding
pixel already in the framebuffer. In CI mode, the least significant bits of the color index for a pixel are set
according to the percentage of coverage. All ones represents complete coverage, and all zeros represents
no coverage.
Antialiasing of primitives, especially those other than lines of width 1.0 (wide lines, triangles and,
quads), are often implemented in software even on relatively high-end graphics systems with dedicated
rendering engines. This is due to the fact that the procedure for computing the coverage values for vari-
ous fragment types is difficult to implement in dedicated hardware. Therefore, enabling antialiasing for
primitives other than lines will have a significant impact on performance. However, the same effect could
be achieved by using multisampling or another full-scene antialiasing technique such as the accumulation
buffer [25] or oversampling as described in Section 7.2.3.
Scissor Test
The scissor test determines if the fragment at location (xw , yw ) lies within the rectangle defined by
lef t, width, bottom and top. If lef t <= xw < lef t+width and bottom <= yw < bottom+height, then
37
3.3. IMAGE PATHS Developing Efficient Graphics Software
the scissor test passes. This test is a simple inside/outside test commonly implemented by rasterization
hardware.
Alpha Test
The alpha test compares the alpha value of the current fragment against a constant value. Fragments which
do not pass the test are discarded. Again, this test is commonly implemented by rasterization hardware.
Stencil Test
The stencil test compares the value in the stencil buffer at location (xw , yw ) with a constant and discards
fragments that do not pass the test. The performance of this test is slow on systems that do not have a
hardware stencil buffer.
Depth Test
The depth test compares the depth value for an incoming fragment (xw , yw ) with the value currently in the
depth value at that location. If the test fails, the incoming fragment is discarded. This test is slow when a
hardware z-buffer is not available.
Blending
The blending operation combines the R, G, B, and A values for an incoming fragment with the R, G, B,
and A values stored at the incoming fragments (xw , yw ) location within the framebuffer. The blending
equations are implemented by rasterization hardware.
Dithering
Dithering is not typically implemented by rasterization hardware and should be avoided for optimum
performance.
Logic Op
The logic op test performs a logical operation between the color of the incoming fragment and the color
in the framebufffer at the incoming fragments location. The result replaces the values in the framebuffer
at the fragment’s (x, y) coordinates. Logic operations are implemented by the rasterization engine.
38
Developing Efficient Graphics Software
these paths, the stages through which data passes along each path, and how these paths connect to the
same rasterization backend used by the 3D geometry path.
Unlike the 3D path, in which data moves in a single direction from memory to the framebuffer, data in
the image paths can move in the reverse direction as well. This “readback” from the framebuffer back to
graphics or system memory is a primary raison d’être for the imaging side of the graphics pipeline.
Unpack Pixels
This stage serves to convert the data from the format in which it is stored in memory, into an internal stor-
age format according to parameters supplied by the application. This stage also performs byte-swapping
and data-alignment operations.
For best application performance, pixel data should be stored within the application in a format that is
native to the graphics hardware. This mitigates the unpacking that is required and, as a result, improves
performance. Image data that is in a native format to the graphics hardware is typically transferred from
system memory to graphics using a DMA operation. Non-native data requires conversion by the host
CPU.
39
3.3. IMAGE PATHS Developing Efficient Graphics Software
Imaging Subset
A new feature extension in OpenGLT M 1.2 is the Imaging Subset, a collection of additional pipeline stages
that provides pixel processing capabilities. Figure 3.5 illustrates how these additional stages fit within the
legacy 2D pipeline.
Each stage of the Imaging Subset operates on pixel data that results from the unpack stage in the
pipeline. Varying degrees of each function are typically implemented in hardware within the graphics
subsystem.
Color Table Lookup Color table lookup is used to replace a pixel’s color value with a color value previ-
ously defined in a lookup table (LUT). Typically, color LUTs can be used in three places within the
imaging pipeline:
Convolution The convolution stage implements pixel filter kernels that replace each pixel in an image by
some weighted average of it and its neighboring pixels. This functionality is useful for sharpening
or blurring, and other image filtering operations. As the number of pixels used in the average in-
creases, thus increasing the size of the convolution filter kernel, the number of required calculations
increases. This additional complexity makes hardware implementation of convolution extremely dif-
ficult, except for small filter kernels. Hardware convolution implementations typically only support
kernels of 3x3 or 5x5 although some higher end graphics hardware can support hardware convolu-
tion with kernels of up to size 7x7. Vendor documentation should indicate the maximum filter kernel
size supported by a graphics subsystem. If it does not, write a simple test program that performs
40
Developing Efficient Graphics Software
convolution and compare the results for different kernel sizes. A sudden increase in the time re-
quired to perform the convolution is a good indication that the maximum allowable hardware kernel
size has been exceeded.
Color Matrix This stage performs color matrix operations and linear transformation on pixel values. The
functionality is used to perform color space conversions within the graphics subsystems without
using the CPU. However, because operations are performed with fewer bits of precision in the
graphics subsystem than if they were performed on the host CPU, the precision may not be of
acceptible quality for certain applications.
Histogram The histogram function calculates the histogram for a specified pixel rectangle.
Minmax Given a pixel rectangle, the minmax function computes the minimum and maximum pixel com-
ponent values for a particular pixel rectangle. Rendering engines can be programmed to efficiently
implement this functionality.
Be aware that hardware which was not designed for OpenGLT M 1.2 may implement fewer if any of
these operations in dedicated hardware.
41
3.3. IMAGE PATHS Developing Efficient Graphics Software
is to simply draw an object to the framebuffer, but it is quite common, in fact, to read rendered images
from the framebuffer back into system memory for storage or additional graphics processing.
In the Read Pixels case, pixel data that has been previously rasterized to the framebuffer is read and
passed through the same Pixel Transfer Operations and Imaging Subset stages described above. The pixel
data is then packed as specified by the application and placed into system or graphics memory.
Historically, the focus of graphics hardware architecture has been on high-performance geometry pro-
cessing and pixel fill-rates at the expense of other operations, specifically the ability to ready pixel data
back from the framebuffer. As a consequence, the efficient implementation of the read pixel operation has
been neglected and, as a result, performs poorly on many graphics subsystems. A common problem on
AGP-based graphics subsystems is that data will transfer to the graphics at AGP speeds, with data being
transferred on both the leading and trailing edges of the clock; however, data is only read back from the
graphics subsystem at half that data rate because the interface on the graphics card can only send data on
the leading edges.
Another common problem is that pixel read operations break the inherent pipelining in a graphics sub-
system. A problem commonly encountered is that the read occurs before the graphics have finished ren-
dering. This is solved by ensuring that all drawing is complete before trying to readback. Another solution
is to time the readback to take place when the application is performing other CPU operations.
42
Developing Efficient Graphics Software
43
3.4. HARDWARE FAST PATHS Developing Efficient Graphics Software
3.3.5 Bitmap
Figure 3.9 illustrates one final 2D rendering path, the path for bitmap operations. This path is similiar to
that for the Draw Pixels case, but in this case bitmap data is obtained from system memory without Pixel
Transfer or Imaging Operations being performed. The bitmap data simply passes directly to rasterization.
This path is typically implemented in a graphics subsystem as a special case of the Draw Pixels path
where the data processed is simply a single bit per pixel. Low performance bitmap rendering is typically
due to the fact that the hardware was designed to render pixels as a combination of color components, but
when each color component is packed differently, as in a bitmap, performance suffers.
44
Developing Efficient Graphics Software
when using a test program not to introduce other overhead that may invalidate the results. When targeting
more than one platform, use a least-common denominator approach to stay on the intersection between
the different hardware fast paths, if possible. If graphics state and modes are forcing an application off
a fast path, change the code within the parameters of the application to more fully exercise the rendering
features of the graphics hardware.
3.5 Conclusion
The graphics rendering pipeline consists of geometry and imaging paths implemented as a combination
of hardware and software. Graphics operations not implemented in graphics hardware are performed
in software executing on the host CPU. This section described the operations of the 2D and 3D paths
through the graphics pipeline and the stages within these paths. Effective performance analysis and tuning
of a graphics application can be better achieved with a complete understanding of these paths and their
potential performance implications.
45
3.5. CONCLUSION Developing Efficient Graphics Software
46
Developing Efficient Graphics Software
Section 4
The application tuning process can best be described as consisting of four non-exclusive stages as shown
in Figure 4.1. The first phase quantifies performance to compare how an application performs against the
ideal system performance. The second phase examines how system configuration impacts performance.
The third phase performs an analysis of the graphics subsystem implementation and usage to determine
when an application is CPU or graphics-bound. The fourth and final stage focuses on bottleneck elimina-
tion.
Before digging in and examining each stage of the process in detail, it should be stressed that the process
described here is iterative and is never really complete. When a bottleneck or application performance
problem has been identified and addressed, the tuning process should start anew in search of the next
performance bottleneck. Code changes as well as hardware changes can cause performance bottlenecks
to shift among the different stages of the rendering process and also shift between the CPU and graphics
subsystem. As a result, performance tuning is an ongoing process.
Application type plays a large role in determining the graphics demands on a system. Is the application
a 3D modeling application using a large amount of graphics primitives with complex lighting and texture
mapping, an imaging application performing mostly 2D pixel-based operations, or a scientific visualiza-
tion application that might render large amounts of geometry and texture? A good place to start is to know
the application space.
47
4.1. QUANTIFY: CHARACTERIZE AND COMPARE Developing Efficient Graphics Software
Primitive Types
Determine the primitive types (triangles, quads, pixel rectangles, etc.) being used by the application and
if there is a predominant primitive type. Identify if the primitives are generally 2D or 3D and if they are
rendered individually or as strips. Primitives passed to the graphics hardware as strips use inherently less
bandwidth, which is important during the analysis process. The easiest way to determine this information
is to examine the source code and the graphics API calls.
Primitive Counts
Determine the average number of primitives rendered per frame by instrumenting the code to count the
number of primitives between buffer swaps or screen updates. For primitives sent in lists, report the
number of lists and the number of primitives per list. Add instrumentation such that it can be enabled
and disabled easily with an environment variable or compiler flag. Consider enabling and using run-
time instrumentation to load-balance as application hardware utilization changes. Instrumentation also
provides a chance to examine the graphics code to determine how the primitives are being packaged and
sent to the graphics hardware. Later in this section, you will learn about tools to trace per-frame primitive
information.
When gathering primitive counts and other data, it is important to use the application and exercise code
paths as a real user would. The work process that a bona fide user encounters day in and day out is the
most useful to consider. It is also important to exercise multiple code paths when gathering data about
performance.
After determining the number of primitives, calculate the amount of per-primitive data that must be
transferred to the rendering pipeline. This exercise can be a revelation, inspiring thought about bandwidth
saving alternatives. For example, consider the worst case as illustrated in Figure 4.2. To render a triangle
with per-vertex color, normal and texture data requires 56 bytes of data per vertex, 168 bytes per triangle.
Rendering the three triangles individually requires 504 bytes of data (Figure 4.2A); rendering the triangles
as a strip only requires 280 bytes of data (Figure 4.2A), which saves 224 bytes. In a real application, this
48
Developing Efficient Graphics Software
A v4 v5 B v0 v3
v7
v1
v3 v6 v1 v2
v4
v8
v0 v2
Figure 4.2: Worst case per-vertex data for triangles. (A) Shown are three triangles, each vertex containing
position (XYZW), color (RGBA), normal (XYZ), and texture (STR). Rendering a single triangle requires
56 bytes of data per vertex, resulting in a total of 168 bytes of data. The set of triangles therefore requires
504 bytes of data. (B) The same triangles from A are now combined into a triangle strip. Each vertex
still requires 56 bytes of data, but because only 5 vertices are used, the total amount of data is 280 bytes,
saving 224 bytes.
savings increases dramatically. For example, rendering 5000 independent triangles would require 820 KB
of data. However, combining the triangles into a single strip would require only 273 KB of data, roughly
300% less data.
Lighting Requirements
Lighting requirements are a critical consideration in order to fully quantify the graphics requirements of
an application:
• Number of light sources
• Local or infinite light sources
• Lighting model
• Local or non-local viewpoint
• If both sides of polygons are lit
Lighting information is easily discovered by looking at the graphics API calls in the application source
code.
All the listed lighting variables affect the number and complexity of calculations that must be performed
in the lighting equations. For example, using local lights requires the calculation of an attenuation factor,
which makes them more expensive than using infinite light sources. Furthermore, the use of a local
viewpoint is more costly, because the direction between the viewpoint and each vertex must be calculated.
With an infinite viewer, the direction between each vertex and the viewpoint remains constant. When
two-sided lighting is used, lighting is done twice, once for the front face of a polygon and a second time
for the back face. Review section 3.2.4 for more information about how different lighting parameters can
change the computation complexity and performance of the lighting model equation.
49
4.2. EXAMINE THE SYSTEM CONFIGURATION Developing Efficient Graphics Software
Frame Rate
Measure the frame rate to determine the currently attainable number of frames rendered per second. The
best way to determine frames per second is to add instrumentation code to the application that counts
the number of buffer swaps or screen updates per second. Be aware that swaps may occur at screen-
refresh boundaries and single-buffer mode can eliminate some potential measurement artifacts here, as
described in more detail in Section 2.2.3. Some systems provide hooks (and tools, such as osview) into
the hardware, which can measure framebuffer swaps for any application.
4.2.1 Resources
Memory
Insufficient memory in a system can cause excessive paging. Understand the memory requirements of
your application and measure them against the available memory in the system. Large amounts of disk
activity while an application is running indicates memory page swapping, a symptom of having insufficient
physical memory, or inefficient application memory usage. Swapping memory pages to disk negatively
50
Developing Efficient Graphics Software
impacts performance. Try to keep data small and in-cache as much as possible by creating and using
small, tightly packed, and efficient data structures. Large models and databases add to the overall memory
footprint of an application.
Consider how system memory is used to store graphics data. Some systems implement a UMA where
the framebuffer resides in system memory, and other systems might use AGP where some textures and
most graphics data are stored in system memory before a high-speed transfer to the framebuffer. These
two approaches to graphics hardware can affect performance in different ways.
In a UMA system, a set amount of system memory is reserved for the framebuffer at boot time. This
memory is not available to application programs and is never released. The performance advantage of
this approach is that graphics data can be rendered directly into the framebuffer, which removes the cost
of the additional copy from system memory to dedicated video memory, as found in more traditional
hardware. One caveat of this approach is that this memory is never available to an application. As a result,
if this effective loss of system memory is not taken into account by boosting the physical system memory
accordingly when configuring the system, an application that fits nicely on a traditional system may swap
on a UMA system.
On a system built around AGP, system memory is used to hold graphics data, but this memory is not
reserved for the framebuffer and can be allocated and freed as necessary so that it may be used by the
application. The use of system memory provides an application with space for textures and other graphics
data that otherwise would not fit in dedicated graphics memory. Copying data from system memory to
video memory is implemented as a DMA over AGP. One disadvantage of AGP texturing is that memory
access to non-resident textures requires a full fetch from main memory, with all the attendant performance
implications of main-memory access.
Know the memory access times and bus speeds of the system. Examine these in respect to the amount
of data that the application is moving around when rendering. Consider if an application’s optimal data
transfer per unit time will exceed that which can be provided by the memory and bus. No matter how
fast the CPUs in a system, the overall performance in some applications domains will be limited by the
bus speeds on which the CPUs sit. For example, in current Intel memory controller-based workstations,
overall performance is governed by the front-side bus between the CPU and main memory.
Disk
Consider how the disk subsystem might affect the graphics performance of an application. In addition to
the type of disk (IDE, SCSI, fibre channel, etc.), consider the actual location of the disks and the application
requirements. Streaming video to the screen from a slow disk will always be physically impossible,
regardless of the speed of the graphics hardware. Store data and textures on local disks, as fetching data
across a network can be a significant bottleneck. Choose disks with the lowest latencies and seek times.
Once again, the disk requirements vary greatly by application, so use appropriate disk resources for the
specific application.
4.2.2 Configuration
Display
Ensure that the latest driver from the graphics hardware manufacturer is installed on the system before
examining the display configuration. Manufacturers are constantly fixing bugs and tuning their drivers and
the latest driver will typically perform best. If it is unclear if a new driver will offer the best performance,
51
4.2. EXAMINE THE SYSTEM CONFIGURATION Developing Efficient Graphics Software
compare the performance between the new and old drivers. Use the driver that offers the best performance
for the application.
Almost all combinations of operating systems and window systems provide methods for setting the
configuration of the graphics display. This functionality dictates how the windowing system uses the
graphics hardware and consequently how an application uses the graphics hardware. Consider how the
current active display configuration relates to the actual hardware in the graphics subsystem as described in
Section 2.1.7. The display configuration should be set to take full advantage of the features implemented
in hardware and necessary for the application.
When display information is queried by an application, the windowing system passes the display ca-
pability information back to the application. An improperly configured display impacts performance by
forcing operations to be performed in software on the host CPU — operations that could have been per-
formed by the graphics hardware — which effectively forces an application off the fast path. Therefore it is
important to confirm that display properties are set properly within the window system before considering
the display properties available to an application. More often than not, poor performance or some aspect
of it can be attributed to a poorly configured display that does not take full advantage of hardware features.
There are often a number of visuals that match an application’s needs, and it is important to understand
the performance of the selected visual as it may not be the best performing or most feature rich.
Once the display is configured properly it is the responsibility of the application to ensure that it is using
an appropriate configuration for the underlying graphics hardware. One way to do this might be to have
an application do some simple benchmarks tests at startup which exercise frequently used functionality.
Use the results of these tests to decide on an optimum display configuration. The following are important
display parameters to consider.
Pixel Formats / Visuals The pixel formats/visuals available dictate the color depth and the availability of
auxiliary buffers such as depth and stencil. Determine how the available pixel formats or visuals
compare with those required by an application. Have a fall-back strategy if the application can’t
get the desired pixel format. For example, if the display is configured such that there are no pixel
formats or visuals available with destination alpha, an application that draws alpha-blended shapes
forces the graphics driver to perform alpha-blending in software. A fallback for this scenario might
be to use stippled-alpha rather than blended-alpha.
Color Buffer Choose a visual that matches your application’s needs for color precision. For example,
a system may support visuals with 12 bits of precision available per color-component, but may
have no alpha planes available in this configuration. A second consideration is to choose visuals
that match, and only just match, the requirements for the application. Visuals with more precision
per-pixel induce extra fill work, and can potentially be a bottleneck.
Screen Resolution The screen resolution determines the number of pixels that must be filled for a given
frame. Determine the optimal screen resolution for an application. An application may run faster
at 1024×768 than at 1280×1024 because there are fewer pixels to fill. However, using a lower
resolution sacrifices visual quality, which may not be an acceptable trade-off.
Depth Buffer The depth buffer configuration indicates the resolution of the Z-buffer. Determine how the
resolution of the configured Z-buffer compares to the requirements of the application. Using a visual
or pixel format that does not support a hardware Z-buffer forces depth testing to be performed in
software. The actual resolution of the Z-buffer is important as well. Too many bits of precision
52
Developing Efficient Graphics Software
increases the fill overhead per-pixel, whereas too few bits of precision creates visual artifacts known
as Z-fighting or flimmering.
Auxiliary Buffers Several auxiliary buffers typically exist for a particular visual, and selection of a visual
with appropriate auxiliary buffers is essential. Typical additional buffers available include stencil,
accumulation, and stereo. Certain combinations of auxiliary buffers may force the rendering driver
off the fast path.
Buffer Swap Characteristics Determine if buffer swaps are tied to the vertical retrace of the graphics
display. If so, an application that can render a frame faster than the screen refresh rate (normally 60
Hz or 75 Hz) stalls to wait for a vertical retrace and buffer swap to complete. This anomaly is called
frame rate quantization and is described in more detail in Section 7.3.2. Many hardware graphics
drivers now let users disconnect buffer swaps from the vertical retrace to improve performance by
allowing an application to render to the back buffer as quickly as possible. Be aware that enabling
this disconnect may introduce unacceptable tearing in the display.
Network
The network can also play a role in the performance of an interactive graphics application. Use caution
when loading data and textures from a remote file system during rendering as network traffic and latencies
will affect performance. Also, consider what else might be happening on the network to cause a system
“hiccup” that would impact performance. For example, something as simple as receiving an e-mail, doing
a DNS lookup, or redrawing a simple animated gif on a web page causes CPU usage that would have
been otherwise devoted to the application. Another issue to consider is remote rendering. Is all data and
rendering being performed locally, or are remote machines being used to augment the CPU processing
requirements? If so, understand the capabilities of all systems in a remote-rendering scenario and the
available bandwidth between them.
53
4.3. GRAPHICS ANALYSIS Developing Efficient Graphics Software
4.3.2 CPU-Bound
When the graphics subsystem processes data in the FIFO faster than the CPU can place new data into
the FIFO, the FIFO empties, which causes the graphics hardware to stall waiting for data to render. In
this case, an application is CPU-bound because the overall performance of the application is governed by
how fast the CPU can process data to be rendered. Here, the balance between the stages of the rendering
pipeline done in hardware and in software is such that all available CPU cycles are consumed preparing
data to be rendered while additional unused bandwidth may be available in the graphics subsystem. An
application in this state can also be described as being host-limited. In this scenario, the CPU is running
at 100% utilization, while the graphics subsystem is running at less than 100% utilization and may even
be idle.
4.3.3 Graphics-Bound
If the graphics subsystem is processing data slower than the FIFO is being filled, the FIFO will issue an
interrupt causing the CPU to stall until sufficient space is available in the FIFO so that it can continue
sending data. This condition is known as a pipeline stall. The implications of stalling the pipeline are that
the application processing stops as well, awaiting a time when the hardware can again begin processing
data. An application in this state is graphics-bound such that the overall performance is governed by how
fast the graphics hardware can render the data that the CPU is sending it. A graphics application that is
not CPU-bound is graphics-bound. A graphics-bound application can be either fill-limited or geometry-
limited. In this situation, the graphics subsystem is running at 100% utilization, while the CPU is running
at less than 100% utilization.
Fill-Limited
A fill-limited application is limited by the speed at which pixels can be updated in the framebuffer, which
is common in applications that draw large polygons. In the context of the graphics pipeline as described
in Section 3, an application that is fill limited is limited by the speed at which rasterization and subsequent
pipeline stages can be executed. The fill limit, specified in megapixels/s, is determined by the rasterization
capabilities of the graphics accelerator card. As the fill limit is reached, consider increasing application
geometry load to improve the balance between fill and geometry operations.
Geometry-Limited
An application that is geometry-limited is limited by the speed at which vertices can be lit, transformed,
and clipped. Programs containing large amounts of geometry, or geometric primitives that are highly
tessellated can easily become geometry-limited or transform-limited. An application that is fill-limited
54
Developing Efficient Graphics Software
is limited by the speed at which the per-vertex and primitive assembly operations can be performed.
The geometry limit is determined by both the CPU and the graphics hardware. The limit depends on
the hardware capabilities of the graphics subsystem and where the geometric calculations are performed.
In this case, consider reducing geometry calculations to improve the balance between fill and geometry
operations.
Rasterize
Rasterize
Xform
Display
Traverse
Generate
CPU Display
GTXR-D As shown in Figure 4.3 all rendering stages are performed on the host CPU. If the CPU renders
pixel values into the framebuffer faster than the screen is refreshed, and swap buffer calls are tied
to the vertical retrace of the screen, the CPU stalls before rendering the next frame. In this case, an
application is graphics-bound. However, since the CPU is calculating all other rendering stages, on
this type of hardware, an application is much more likely to be CPU-bound.
Ultimately, CPU speed, scene complexity, and monitor refresh rate dictate the balancing point. If an
application proves to be graphics-bound, increase the monitor refresh rate or disconnect buffer swaps
from the vertical retrace. For a CPU-bound application, reduce the scene complexity by eliminating
geometry. Another option is the use of more efficient graphics algorithms for rasterization, depth,
stencil testing, etc. Also, consider potential code optimizations described in Sections 5 and 6.
As more and more parts of the graphics rendering pipeline as described in Section 3 have been
implemented in special-purpose hardware, this type of graphics system is becoming less prevalent
in 3D graphics workstations and 3D game systems. For this reason, it is not discussed in more detail.
GTX-RD As shown in Figure 4.4, screen-space operations are performed in hardware, whereas geometric
operations are performed on the host CPU. If the CPU can generate and send the screen-space prim-
itives and the associated rendering commands faster than the rendering subsystem can process them,
the graphics FIFO fills causing the CPU to stall. When this happens, the application is graphics-
bound. However, with this type of graphics subsystem, it’s much more likely that an application will
be CPU-bound as the CPU is required to perform the approximately 100 single-precision, floating-
point per-vertex operations required to transform, light, clip test, project, and map vertices from
object-space to screen-space [9].
55
4.3. GRAPHICS ANALYSIS Developing Efficient Graphics Software
Xform
Rasterize
Traverse
Generate
Display
Graphics
CPU Board
Display
Traverse Rasterize
Generate Xform
Display
Graphics
CPU Board
Display
Whether graphics-bound or CPU-bound there is a balance between the CPU speed, the scene com-
plexity, and the graphics hardware. For an application that is CPU-limited, reduce the number of
calculations required by reducing the scene complexity. If an application is graphics-bound, in addi-
tion to the options given above for the GTXR-D adapter, one way to improve performance would be
to use a smaller graphics window, reducing the scan conversion area. Another improvement would
be to reduce depth complexity and the number of times a pixel is drawn. This is accomplished by
not rendering objects that are occluded in the final image (see Section 7.2).
GT-XRD As shown in Figure 4.5, even though much of the rendering burden has been removed from the
CPU, an application can still be CPU-bound if the CPU is totally consumed packaging and sending
down data to render. A more common scenario is the graphics FIFO fills up causing the CPU to stall
while the graphics subsystem performs all the transformation, lighting, and rasterization.
An application that is CPU-bound might be performing some calculations that could be performed
more efficiently on the specifically designed graphics hardware. If so, offload these tasks to the
graphics subsystem. For example, ensure that the application is using a lighting model implemented
in the graphics hardware. Another example is to use the graphics hardware to perform matrix oper-
ations as GT-XRD hardware is designed to efficiently perform matrix multiplication operations. Do
this by specifying these operations with the graphics API, and let the dedicated graphics hardware
do the required calculations, thereby freeing the CPU to perform other tasks.
For graphics-bound applications, consider moving some of the eye-space lighting or shading calcu-
lations back to the host CPU, or packaging the data into formats that are more easily processed by
the graphics subsystem. Try using display lists or compiled vertex arrays to limit setup time required
by the graphics hardware. Also, try reducing lighting requirements may reduce the computational
complexity of the lighting calculations.
56
Developing Efficient Graphics Software
• Shrink the graphics window. If the framerate improves, then the application is fill-limited as the
overall performance is limited by the time required to update the graphics window. Shrinking the
graphics window effectively shrinks the viewport. This shinks the size of primitives and reduces the
fill requirements. Before doing this test, ensure that the behavior of the application does not change.
Some applications will change their behavior and send down less polygons when the graphics win-
dow is made smaller. This behavior invalidates the test as not only are the fill requirements reduced,
but the geometry requirements are reduced as well.
• Reduce geometry processing requirements. Render using fewer/no lights, materials properties, pixel
transfers, clip planes, etc., to reduce the geometry processing demands on the system. If the frame
rate improves and the graphics subsystem is responsible for geometry processing, then an application
is graphics-bound. But, if lighting and geometric processing is performed on the host, then an
increase in frame rate in this case is typical of an application which is CPU-limited.
• Remove all graphics API calls. This establishes a theoretical upper limit on the performance of an
application. The quickest way to do this is to build and link with a stub library. If after removing all
the graphics calls the performance of the application does not improve, the bottleneck is clearly not
the graphics system. The bottleneck is the application code and in either the generation or traversal
phases. Keep this stub library in your bag of tricks for further use.
• Use a system monitoring tool to trace unexpected and excessive amounts of CPU activity. This is a
sure sign that an application has fallen off the fast path and has become CPU-bound doing software
rendering. Often, a simple state change can cause this. This is actually a common problem on
GTX-RD and GT-XRD subsystems where not all rendering modes are implemented in hardware.
Figure 4.6 shows how to combine these techniques into a comprehensive graphics performance analysis
procedure. Follow this procedure as a first step in the analysis of the graphics subsystem performance.
57
4.4. BOTTLENECK ELIMINATION Developing Efficient Graphics Software
58
Developing Efficient Graphics Software
Bottlenecks are not limited to the graphics subsystem and can occur in all parts of the system and arise
from a number of causes. Listed below are a few common causes of bottlenecks within an application
categorized by the subsystem in which they occur. All the bottlenecks listed here affect graphics rendering
performance, regardless of the subsystem within which they occur.
4.4.1 Graphics
Bottlenecks are most common in the graphics subsystem. Where, when, and how severe a bottleneck is
depends largely on the combination of hardware and software that implements the graphics pipeline and
how the application utilizes the graphics pipeline. Since the graphics pipeline implementation is a fixed
and unchangeable entity, bottleneck elimination in this case should focus on changing the application to
better utilize the graphics subsystem.
Pixel and texture data that is not in a format native to the graphics hardware must be reprocessed by the
graphics driver to repackage it into a native format before it can be rendered. An example of this might be
the conversion of RGB data to RGBA. This increased rendering overhead could create a bottleneck within
the system. A list of native data formats can be found in the graphics hardware documentation.
State Changes
The graphics subsystem is a state machine that is set up for rendering a particular primitive according
to the settings of that machine. Changing state adds rendering overhead as the rendering pipeline must
be revalidated after each state change before rendering can occur. Excessive state changing can cause a
bottleneck in the graphics subsystem when more time is actually spent validating the state than rendering.
To avoid unnecessary state changes, organize data so that primitives with similar if not identical char-
acteristics are rendered sequentially (without differing data in-between). Avoid redundant state calls and
cache important state information within the application. For example, one way to sort data may be by
• Transform
• Lighting model (one vs. two side, local vs. infinite, etc.)
• Texture
• Material
• Color
However, exercise caution, and don’t blindly pick a sorting methodology. Measure the relative expense
of each state change and hierarchically order the sort accordingly. Also, refer to vendor documentation for
hints as to the relative costs of various state changes.
59
4.4. BOTTLENECK ELIMINATION Developing Efficient Graphics Software
Pipeline Queries
The graphics subsystem is optimized to receive graphics data and attribute information from an application
and render the resulting primitives according to the current state settings. Avoid querying the pipeline for
state information as this will break the inherent pipelining and cause the graphics subsystem to stall.
Cache important state information within the application.
Texture Paging
Textures that do not fit in texture cache on a graphics subsystem must be transferred to the graphics
subsystem prior to rendering. Traditional PCI bus-based graphics subsystems have limited local graphics
memory in which to hold texture data. Such systems, therefore, are required to cache textures from system
memory over the 132 MB/s shared PCI bus. In this scenario, loading and using textures that do not fit
in the local texture cache can be a bottleneck. The AGP architecture solves this problem by providing
a high-speed dedicated bus for the transfer of texture data from system memory to graphics memory.
UMA systems also provide support for large textures by implementing the framebuffer directly in system
memory. In the case of UMA, no copy of the texture data is required for rendering.
One solution to reduce texture paging in OpenGLT M is to use the texture LOD extension to reduce
the resolution and subsequently the texture size until a texture fits into texture memory. When texture
paging is unavoidable, amortize the cost of loading textures by calling glAreTexturesResident()
to determine which textures are resident in texture memory and then rendering all objects which require
the resident textures.
Use texture objects to optimize rendering and texture management. Texture objects are persistent. As
such, they can be stored in on-board texture memory which may prevent a texture from being downloaded
to the rasterizer every frame. If texture objects are not an option, consider encapsulating texture commands
into a display list. Avoid unnecessary switching between textures by rendering primitives that use the same
texture together.
60
Developing Efficient Graphics Software
GL FALSE in glLightModel. To use one-sided lighting effectively, all normals must be consistent
with respect to the geometry. In other words, all normals must point either “out” or “in.”
Ensure that all lights, and the state characteristics of those lights are required and add to the overall
visual quality of the scene. For example, using directional or infinite light sources rather than local lights
removes the per-vertex calculation of the attenuation factor in Equation 3.4. In OpenGLT M , infinite or
directional lights are specified with the fourth coordinate of GL POSITION set to 0.0. Remove lights that
don’t add to the visual clarity of the scene. Each additional light requires evaluation of the lighting model
equation at each primitive for flat shading and each vertex for Gouraud shading.
Normalization
Normalization within the graphics rendering pipeline can create a bottleneck on the CPU or in the graphics
hardware depending upon where such calculations are performed for a given implementation. Avoid nor-
mal recalculation by the graphics rendering pipeline by ensuring that all normals are normalized within the
application prior to specification to the graphics subsystem. When all normals are guaranteed to be normal-
ized by the application, disable automatic renormalization in OpenGLT M by disabling GL NORMALIZE.
Using rasterization operations that your application does not require increases rendering overhead and
creates a rasterization bottleneck. Rasterization operations such as texture, fog, antialiasing, and other
per-fragment operations (blending, depth, stencil, scissor, logic operations, and dithering) as discussed in
Section 3.2.10 could be unnecessary for an application and should be disabled when appropriate. Bot-
tlenecks of this type can occur in either the graphics subsystem or on the host CPU, depending on the
rendering pipeline split between hardware and software.
Examine application code to ensure that all rendering states enabled are required to achieve the resulting
images. Turn off unused features and attributes when they have no visible effect. For example, depth
testing can be turned off when it is not required. In a visual simulation application, draw background
objects such as the sky and ground with the depth buffer disabled then enable the depth test for foreground
objects such as mountains, trees, buildings, etc. In another example, if low-quality texturing is acceptible,
use only bilinear filtering instead of trilinear.
Geometry
Processing large amounts of geometry (such as lighting and transformations) can cause a bottleneck even
with hardware-accelerated graphics subsystems. In every system, there is a point where the system be-
comes geometry-bound, where the system cannot transform and light the amount of geometry specified
at satisfactory frame-rates. Lessen geometric requirements — if you can’t see it, don’t draw it. See
Section 7.2 to learn about various culling techniques.
Consider using billboards to replace complex geometry as described in Section 7.2.3. Textures can
also be used to implement approximate per-pixel lighting models for hardware that does not support
per-pixel lighting. More generally, think of textures as simply one-, two- or three-dimensional lookup
tables. Texture coordinates can be used to extract any specific data point within texture space and apply
that point’s properties to a vertex. This broadens the usefulness of textures but requires some thought to
see immediately how to apply it within an application. See [35] for further description and ideas.
61
4.4. BOTTLENECK ELIMINATION Developing Efficient Graphics Software
Depth Complexity
Consider how many times the same pixels are filled. Avoid drawing small or occluded polygons by culling
unseen or insignificant geometry as described in Section 7.2.
Function Overhead
A common cause of bottlenecks is function call overhead on the transfer rate between the host and graph-
ics. While some systems may have a host interface that uses look-up tables for graphics API subroutines
and DMA to transfer data between the CPU and graphics, most systems do not and require a function
call for each graphics API call. Function call overhead is not negligible, because the system must save
the current state, push the arguments on the stack, jump to the new program location, return and restore
the original state. Using many small calls, such as glVertex, instead of batching calls with aggregate
functions, such as glVertexArray, can cause the CPU to do excessive work and create a bottleneck on the
host, leaving the graphics subsystem underutilized.
Avoiding these types of bottlenecks is quite simple. Use primitive strips to reduce the raw amount of
data sent to the pipe. Use aggregate calls such as vertex arrays and display lists to reduce function-call
overhead. Use vector arguments instead of individual vector elements in function calls to reduce the data
copies on the stack. Another way to reduce call overhead is to eliminate function calls which set state
to the same value as is already current. Don’t send state information that has not changed to the graphics
subsystem.
Vertex Formats
When using vertex arrays, consider using interleaved and precompiled vertex arrays. Interleaved arrays
allow for multiple arrays to be specified with a single function call. Using interleaved arrays also specifies
that the data is tightly packed and can be accessed in one piece. When the data is tightly packed, the
graphics subsystem can make assumptions about the layout thereby reducing required pointer calculations
during traversal. When precompiled arrays are used, data can be transferred from host memory to graphics
using DMA.
Consider using a display list for static geometry that is drawn many times. However, within a display
list, for optimal performance, don’t replace a single instantiation of an object with multiple copies. Also,
be careful not to make display lists excessively small. In this case, the overhead to traverse the display
list may outweigh the time savings over immediate mode rendering. One final caveat with display lists is
to understand how nested display lists may create memory fragmentation and cacheing problems that will
impact performance.
62
Developing Efficient Graphics Software
63
4.4. BOTTLENECK ELIMINATION Developing Efficient Graphics Software
be reduced to 1.
4.4.3 Memory
Inefficient storage of graphics data within memory and inefficient memory management in general can
cause a bottleneck in the memory system.
Memory Allocation
Memory allocation requires a system call and an expensive kernel context switch from user mode to system
mode. As a result, the allocation of memory within the rendering loop causing rendering to stall until the
system call returns and user mode state is restored. Allocate all memory for graphics primitives before
beginning the rendering loop to prevent stalls of this type.
Data Copies
Making local copies of data consumes CPU cycles that could otherwise be used for graphics or other
data processing within the application. Avoid making local copies of per-vertex data for API calls. For
example, don’t copy individual X, Y, and Z coordinates into a vector to make a graphics API call when
the coordinates can be sent down individually.
Memory Bandwidth
Each transfer of data from memory to graphics requires overhead and system bus traffic. Amortize this
overhead and maximize data bandwidth by organizing per-vertex data to allow use of vertex arrays. Vertex
array code is optimized to efficiently step through memory to obtain the per-vertex data and transfer
it efficiently to the graphics hardware. Data in precompiled vertex arrays can be transferred from host
memory to graphics using DMA. Display lists may also be a solution to reduce bus traffic on platforms
where display list data is cached in local graphics memory.
In the case of textures, combine multiple small textures into one large texture as a mosaic changing the
texture coordinates accordingly to map into the larger super- texture. This maximizes the amount of texture
data downloaded for the fixed overhead cost of the operation. Also when using textures, consider using
glTexSubImage* to redefine a subregion of an existing texture. This optimizes the use of available
memory bandwidth by downloading only the portion of a texture that has changed and not the whole larger
texture.
Memory Fragmentation
Sparsely packed data causes memory fragmentation and as a result, poor cache behaviour. Avoid memory
fragmentation by allocating memory for per-vertex data from a preallocated pool. This reduces expensive
memory paging operations when traversing graphics data.
Figure 4.8 demonstrates how the use of triangle strips and vertex arrays can reduce memory fragmen-
tation and maximize the use of available memory bandwidth. In this example, rendering 3 triangles as
independent triangles requires 9 vertices and 504 bytes of memory while using triangle strips, or a vertex
array to render the same 3 triangles requires only 280 bytes of data. This reduces the required memory
bandwidth by 45%. In the vertex array case, vertex data is contiquous in memory thereby reducing page
faults and subsequent memory paging as the data is traversed.
64
Developing Efficient Graphics Software
4.4.4 CPU
Another common place to uncover system bottlenecks is the host CPU. This is especially true on systems
that implement a large part of the graphics rendering pipeline in software. In this case, the most common
bottleneck will be geometry processing as the CPU performs all transform and lighting calculations. To
remedy this situation, follow the suggestions under Geometry in Section 4.4.1.
4.4.5 Disk
The inefficient storage and loading of data from disk into memory at run time can cause the file system to
become a bottleneck. Ensure that texture and program data are stored locally and that the disk can handle
the transfer requirements (for example, video streaming requires a disk system that can transfer the data
fast enough to maintain the frame rate).
65
4.5. USE SYSTEM TOOLS TO LOOK DEEPER Developing Efficient Graphics Software
Typically, the rendering loop in an application is executed per-frame, so analysis of a single frame can be
applied to all frames. Do this by examining all the API calls between buffer swaps or screen updates. Be
on the lookout for repeated calls to set graphics state and rendering modes between primitives. Tools such
as OpenGL debug (see Figure 4.9), APIMON (see Figure 4.10), and ZapDB provide these capabilities.
• Interrupt Time
A large percentage of time spent servicing hardware interrupts can indicate that a system is graphics-
bound as the graphics hardware interrupts the CPU to prevent graphics FIFO overflow.
• Page Faults
A large number of page faults, indicating that a process is referring to a large number of virtual
memory pages that are not currently in its working set in main memory, could signal a memory
locality problem.
• Disk Activity
A large amount of disk activity indicates that an application is exceeding the physical memory of a
machine and is paging.
• Network Activity
A large amount of network activity indicates that a system is being bombarded with network packets.
Servicing such activity steals CPU cycles from application performance.
Because tools differ by platform, it is impossible to adequately describe them here. More detail is pre-
sented in the next section but, in general, a developer should know the tools available on their development
platform.
66
Developing Efficient Graphics Software
Figure 4.9: Sample output from ogldebug, an OpenGL tracing tool. (A) Call count output from a running
OpenGL application. (B) A call history trace from the same OpenGL application. (C) Primitive count
output from the same OpenGL application.
67
4.5. USE SYSTEM TOOLS TO LOOK DEEPER Developing Efficient Graphics Software
68
Developing Efficient Graphics Software
4.6 Conclusion
Tuning a graphics application to take advantage of the underlying hardware is an iterative process. First,
basic understanding of the graphics hardware is necessary, followed by analysis of its capabilities, pro-
filing of the application, and subsequent code changes to achieve better performance. One key concept
in graphics tuning is to try to attain a balance among the various components involved in the rendering
cycle. Balancing workload among CPU, transformation hardware, and rasterization hardware is essential
to maximize performance of a graphics application. Applying the tuning procedures and tips described in
this section to a graphics application will yield a more complete understanding of the graphics pipeline,
application usage of that pipeline, and, after tuning, better utilization of that pipeline for faster application
performance.
69
4.6. CONCLUSION Developing Efficient Graphics Software
70
Developing Efficient Graphics Software
Section 5
By this point in the course, overall system graphics performance has been characterized, tuned, and can
perform at an acceptable level. The next step is to profile the code, which simply means using system
tools to identify the slow parts of application software. These tools insert extra counters in the executed
assembly code that track sections of software as they are executed. Analyzing the data output from these
counters reveals where the code bottlenecks lie. The data generated from these counters can measure many
aspects of application performance such as the number of times a line of code was executed or the number
of CPU cycles taken to execute sections of code. You can use the results of this analysis to rewrite, or
tune, slow sections of software, once the slow parts are identified.
71
5.3. SOFTWARE PROFILING Developing Efficient Graphics Software
your system calls. In a well-balanced application, the system-time is a fraction of the user-time.1 Not all
system function calls are expensive, of course, but understanding the effects of a system call is important
before using it. Similarly, libraries or utilities, which in turn execute system calls, for example, memory
allocation functions, need to be handled carefully. Don’t allocate memory in a time-critical graphics code.
Although this is elementary, it might not be obvious that other utilities (for example, some GUI functions)
may themselves allocate memory; understand the work being done by the libraries in an application, and
use caution if these calls are in a tight loop.
Some computer systems have a FIFO queue in between the host system and the graphics system to
smooth transfer of data (see Section 4.3 for more details.) The queue can force a CPU stall if it becomes too
full, thus stalling program execution. The state of the queue (stalled, full, or empty) during intense graphic
activity can tell you if the host is flooding the graphics pipeline. Use tools described in Section 4.1.1 to
gather data about the FIFO usage.
Additionally, some systems and CPUs indicate the amount of swapping or cache misses that are oc-
curring while the program is executing. If swap activity is high, then more physical memory or better
utilization of the existing memory is needed. If cache misses are high, then better packing of data in
memory is needed. An analysis and rewrite of the code is necessary to determine where to do this tuning.
Although newer chips tend to have larger cache sizes, larger caches will only temporarily mitigate the
effect of poor cache usage — it is far too easy to write code that will thrash even the largest cache. Use
profiling to identify the offending code and rewrite it.
72
Developing Efficient Graphics Software
set and usage scenario. Choose the data set that best represents typical customer data. Run the software
when profiling in a manner similar to a customer’s scenario. Poor choice of data and usage when profiling
leads to code optimizations that are not particularly relevant. Another consideration when profiling is that
the execution of instrumented code can take significantly longer to complete. Running the instrumented
executable produces a data file with timing results that can then be interpreted as shown in the example
below.
Step 1: Instrument the executable.
% instrument foo.exe
Step 2: Run the instrumented executable on carefully choosen data.
% instrumented.foo.exe −args
Step 3: Analyze the results using a profiling tool such as the Unix "prof" tool.
% prof foo.exe.datafile
Consider the example foo.exe shown in Figure 5.2. This example has two functions of interest,
old loop and new loop, which add and print the sum of all the values in array x. A third function,
setup data, is only used to set up the data, which we will ignore for now. The function old loop (Fig-
ure 5.2A) is the original function prior to profiling, and the second function (Figure 5.2B), new loop, is
the improved function resulting from application tuning.
A // Code the old way B // Code the new way
#define NUM 1024 27: void new_loop() {
19: void old_loop() { 28: sum = 0;
20: sum = 0; 29: ii = NUM%4;
21: for (i = 0;i < NUM; i++) 30: for (i = 0; i < ii; i++)
22: sum += x[i]; 31: sum += x[i];
23: printf("sum = %f\n",sum); 32: for (i = ii; i < NUM; i += 4){
24: } 33: sum += x[i];
34: sum += x[i+1];
35: sum += x[i+2];
36: sum += x[i+3];
37: }
38: printf("sum = %f\n",sum);
39: }
Figure 5.2: Code of foo.exe for profiling example. (A) Original function old loop. (B) Improved
function new loop with the loop unrolled.
What does the analysis tell us about this code segment? Figure 5.3 provides the output for the test
run. The function old loop took 6168 cycles to complete. Now the fun begins — analysis of why
the code is “slow” and how to make it better. How could this be rewritten to run faster? Notice that
old loop (Figure 5.2A) is basically one large loop and nothing else. If you unroll the loop, and call the
function new loop, it now looks like Figure 5.2B. (More about loop unrolling in Section 6.4.4). After
re-profiling the new executable, the analysis (Figure 5.3B) shows that new loop only takes 4625 cycles,
a savings of 25%.
73
5.3. SOFTWARE PROFILING Developing Efficient Graphics Software
Figure 5.3: Results of profiling. (A) The basic profiling block of the original code. Shown is the function
list in descending order by ideal time. (B) Profiling block of the modified code. Shown is the function list
in descending order by ideal time. (C) Line analysis for both original and modified code. Shown is the
line list in descending order by time.
In addition to the amount of time each function takes, the analysis can tell you the lines of code that are
repeated most often. The second part of the report (Figure 5.3C) provides that data. (For simplicity, in
this example, old loop and new loop are both included in the same file and both called once.) Note
that lines 21 and 22 of old loop were invoked 1024 times each. (This makes sense because the code
was written that way.) Approximately 2 cycles per loop invocation were used by the loop overhead, and 4
cycles per loop invocation for the loop body. In the new loop function, the loop body took 4615 cycles
(978 + 3 ∗ 968) to execute — a little more than with old loop (4096). However, the loop overhead
dropped from 2061 cycles (old loop) to 733 (new loop) because it is executed fewer times. This is
the primary source of savings from the loop-unroll optimization.
How does this savings compare on other systems? Old loop and new loop were combined into one
file, compiled under Visual C++, and run on an Intel CPU. The results (Figure 5.4) show that new loop
beats old loop by about 40 percent.
Figure 5.4: Profile comparison of new loop and old loop using Visual C++ on an Intel CPU.
74
Developing Efficient Graphics Software
5.4 Conclusion
Code profiling is critical for optimal application speed. Code profiling tools make it relatively simple to
gain a basic understanding of how well different parts of the application software execute. Profiling also
gives a glimpse into the effects of instruction and data caching. This is done by comparing the basic block
results against profiling data from a statistical sampling profile.
While profiling the application is easy, it can be difficult to find a code change that yields better per-
formance. It is somewhat of an art. Initially, it may take several iterations for software changes to realize
performance gains. Tenacity is key here. The next section discusses some common C and C++ perfor-
mance techniques that will help.
75
5.4. CONCLUSION Developing Efficient Graphics Software
Figure 5.5: Example PC sampling profile showing memory latency. (A) Code for three functions that
traverse a array. Each function traverses the indices in a different order. (B) Report showing basic block
analysis. (C) Report showing PC sampling analysis.
76
Developing Efficient Graphics Software
Section 6
A good developer writes good code based on abilities honed throughout the educational process and work
experience. However, a good software developer should also make effective use of the available tools
such as compilers, linkers, and debuggers. This section shows how effective use of a compiler can greatly
increase the overall performance of graphics software. This section concludes by addressing a variety of
language considerations when writing software in either C or C++.
77
6.2. 32-BIT AND 64-BIT CODE Developing Efficient Graphics Software
forms come with different levels of quality and kinds of optimization. Study the compiler documentation
carefully for insight into how certain optimizations perform and change the way code is generated. Often
code compiled with debug information exhibits different bugs than code compiled with optimizations.
Discovering and working with optimizations can be well worth the effort. Consider the commonly
known Dhrystones benchmark as an example. The benchmark measures how many iterations (or loops)
of a specific code fragment can be executed in a given time. More loops executed means that the code
performs faster. In Figure 6.1, the benchmark achieves 239,700 loops using code that is not optimized. If
the first level of optimization is used, 496,353 loops are achieved in the allotted time. Better yet, if the
highest level of optimization is used, then tuned for a specific computer, 1,023,234 loops are achieved.
This is nearly four times faster than the original benchmark.
One common complaint about compiler optimizations is that they break the application code. Generally,
this is most often due to a problem in the code, and not in the compiler. Perhaps an inherently incorrect
statement was used or one that doesn’t adhere properly to a C or C++ standard. Or maybe the source
code implicitly depends on some dubious practice. It is true, however, that the optimizations may lead
to different mathematical results because of a change in arithmetic roundoff as a result of rearranged
lines of code. The author has to make the final decision about each optimization, carefully weighing the
advantages and disadvantages of each.
A final word on debugging code: never ship a final product with debugging enabled — it has happened!
Debug code is much slower than optimized code and can be used to reverse-engineer software. This
may launch a premature entry into the OpenSource arena. Always check that executables and libraries are
stripped before shipping.
78
Developing Efficient Graphics Software
• void initializeList(); allocates a number of list elements and prepares them for use by
the application.
• void destroyListElement( list * ); returns the specified element to the pool of ele-
ments, and marks that new element as available for redistribution by the pool.
Similar things can be done in C++ with class-constructors and overloading of new to provide the same
behavior in a much more seamless fashion. A procedure like the one described above is much better
than a simple malloc-based approach, because it greatly increases the likelihood that list elements will
reside next to others in cache. It does not ensure that elements will exist in cache but rather increases the
probability that they will.
One key trade-off when doing memory management of this sort is the amount of both work and space
allocated to doing the list management. One issue to consider is how much pre-allocation to do of list ele-
ments. If too many are allocated, overall memory requirements for the application may be increased, yet
performance improved. If too few are allocated, then as the store of pre-allocated elements is exhausted,
another segment will have to be allocated, taking time when the application expected a simple create
was being issued and returning to the memory fragmentation case that the pool was trying to solve in
the first place. Again, it’s important to consider the balance of work in an application. Improving cache
behavior definitely improves application performance, if data access is an important and time-consuming
task. However, it’s important to pursue changes that will most affect the application being tuned, so if
79
6.4. C LANGUAGE CONSIDERATIONS Developing Efficient Graphics Software
the application does not use linked lists, time invested in improving cache behavior of lists will not be
particularly useful. Memory management techniques such as pooling are typically of most interest for
data types which are used in large number and frequently allocated and deallocated. Consider memory
allocation issues and usage scenarios for those data structures most commonly used by an application and
spend effort tuning those.
A struct { B struct {
str *next; str *next;
str *prev; str *prev;
large_type foo; // lots of user int key; // likely to be
// data structures // cached in already
int key; // not cached until large_type foo; // lots of user
// explicitly referenced // data structures
} str; } str;
str *ptr;
Figure 6.1: Example of how data structure choice affects performance. (A) Typical linked list data struc-
ture with the reference locator, key, not cached with the next or previous pointers. (B) Modified
version of linked list in A with key relocated to be cached with the next and previous pointers.
80
Developing Efficient Graphics Software
The data structure could easily be rearranged, as shown in Figure 6.1B, so that when next or previ-
ous is referenced, key is likely to be cached in as well. Because next, previous, and key probably
are only several bytes each, they should all fit in most cache lines. Thus, the reference to key will avoid
a cache miss. This optimization improves cache effects only for a single record at a time, because there
is still the large foo data structure in between each of the next pointers. When traversing the list, it is
likely that only a single next pointer will be brought into cache for each lookup. Allocating the foo data
structure outside of the list and using a pointer to foo inside the list enables much more cache friendly
searching and access to the data only a pointer reference away. Naturally, the size of cache lines changes
the effectiveness of this optimization.
81
6.4. C LANGUAGE CONSIDERATIONS Developing Efficient Graphics Software
Figure 6.2: Example of how data structure packing affects memory size. (A) A non-packed data structure
foo. (B) Memory space used by the foo data structure illustrating the wasted memory. Because the data
structure is not packed, the 8-bit characters result in a waste of 16 bits, each resulting in a total space of 5
words. (C) Packed version of the data structure shown in A. (D) Memory space used by the foo packed
data structure. The packing allows all three characters to be placed in the same word resulting in only 8
bits of wasted memory and a total space of 3 words.
82
Developing Efficient Graphics Software
relevance of these tools. Much as when profiling applications, choosing representative data is the most
important factor to consider. If sample data is chosen poorly, the rearrangement of procedures in the
executable might be slower for a more common usage scenario. Contact specific hardware vendors for
more information about their tools.
Figure 6.3: Example of loop unrolling. (A) Original function old loop. (B) Improved function
new loop with the loop unrolled.
In the example shown in Figure 5.3, the amount of work completed by the code segment took 6168
cycles. By reducing the loop overhead relative to the amount of work accomplished, the improved code
took 4891 cycles, resulting in a savings of approximately 25%. Of course, the size of NUM and a choice of
value other than 4 affects the total savings achieved.
Which loops are good candidates for loop unrolling? “Fat” loops, those that complete a lot of work
relative to the overhead, are poor candidates. If the loop iteration is small, the amount of savings is likely
to be negligible. Loops containing function calls should also be ignored as they are likely to be expensive.
Note, however, that there are drawbacks to loop unrolling. First, it adds visual clutter and complexity
to the code because the loop operations are duplicated. Second, as code is duplicated, loop unrolling can
83
6.4. C LANGUAGE CONSIDERATIONS Developing Efficient Graphics Software
increase the code size. Last, the compiler may already have loop unrolling enabled as an optimization,
and the compiler’s work may obviate the effects of a manual unrolling.
6.4.5 Arrays
Large data arrays may cause poor cache behavior when a loop strides through the data. For example, in
image processing where array sizes are often large, it is frequently more efficient to break up the array into
smaller sub-arrays. The size of these sub-arrays can be designed to reside within either L1 or L2 cache.
This technique is often called cache blocking. Another example is to consider a loop that walks down
columns in an array. If each row is aligned such that elements along the row-axis are cached-in with each
access, then walking through each column of data involves caching a new row of data with each loop
iteration. However, if the array is accessed across rows, instead of down columns, the data is in-cache and
accessed much more quickly. Try swapping row and column access for large array manipulations to see if
performance improves. Different compilers place arrays differently in memory, so verify how the specific
compiler being used allocates array memory. As Section 2.1.5 points out, data access to array elements in
cache is far faster than those from main memory.
Figure 6.4: Example of optimization using temporary variables. (A) Original code. (B) Optimized version.
Figure 6.5 demonstrates how, within a function, a temporary variable, tmp, can replace several refer-
ences to a global pointer, new pt. References to global variables may induce caching penalties. The
substitution demonstrated results in better caching behavior and increased performance, which can result
in up to a 50% faster loop with some compilers.
84
Developing Efficient Graphics Software
float *c1, *c2, *c3, *c4, float *c1, *c2, *c3, *c4,
*op, *np; *op, *np, tmp;
c1 = m; c2 = m + 4; c1 = m; c2 = m + 4;
c3 = m + 8; c4 = m + 12; c3 = m + 8; c4 = m + 12;
for (j=0, np=newPnt; j<4; ++j) for (j=0, np=newPnt; j<4; ++j)
{ {
op = oldPnt; op = oldPnt;
*np = *op++ * *c1++; tmp = *op++ * *c1++;
*np += *op++ * *c2++; tmp += *op++ * *c2++;
*np += *op++ * *c3++; tmp += *op++ * *c3++;
*np++ = *op++ * *c4++; *np++ = tmp + (*op * *c4++);
} }
Figure 6.5: Example of optimization using temporary variables with a function. (A) Original code. (B)
Optimized version.
In C and C++, pointers are used to reference and perform various data operations on sections of memory. If
two pointers point to potentially overlapping regions of memory, those pointers are said to be aliases [14].
To be safe, the compiler must assume that two pointers with the potential to overlap may be aliased, and
this may severely restrict its ability to optimize use of those pointers by reordering or parallelizing the
code. However, if it is known that the two pointers never overlap (be aliased), significant optimization can
be accomplished.
Consider the code example from Cook [14] (Figure 6.6A). This code is excerpted from an audio appli-
cation, but the problems of aliasing are common to graphics applications as well. In this example, p1 may
point to memory that overlaps memory referenced by p2. Therefore, any store through p1 can potentially
affect memory pointed to by p2. This prevents the compiler from taking advantage of instruction pipelin-
ing or parallelism inherent in the CPU. Loop unrolling may help here, but there is an even simpler solution
in this case.
Optimally, the compiler would generally recognize aliasing and optimize accordingly. This is unrealistic
in any large software project. Furthermore, there is no way to indicate which pointers are aliased and which
are not. However, the Numerical Extensions Group/X3J11.1 proposed a new keyword, restrict, for
the C language to solve this problem [43]. The restrict keyword is used to indicate which pointers
are aliased and which are not. The const keyword in C++ provides a similar capability by telling the
compiler that certain variables will not be modified. Using restrict, the code in Figure 6.6A would
be rewritten as shown in Figure 6.6B. Cook [14] states a 300% performance improvement using this
technique in his example over the original code. In addition, adding this to the code and recompiling is a
much simpler and faster change than unrolling the loop.
85
6.5. C++ LANGUAGE CONSIDERATIONS Developing Efficient Graphics Software
Figure 6.6: An example of pointer aliasing. (A) Function with pointer aliasing. (B) Revised function using
the restrict keyword to optimize pointer aliasing.
• Use the const keyword wherever possible to ensure that writes to read-only objects are detected
at compile time. Some compilers can also perform some optimizations on const objects to avoid
aliasing.
• Understand how temporary classes are created. As objects are transformed from one type to another
(through type conversion and coercion), temporary copies of these classes can be created, invoking
some constructor code and causing allocation of extra memory. Compilers sometimes warn of this
issue.
• Understand what overloaded operators exist for objects in an application. Overloaded operators
offer another path into user-written code that can be of arbitrary complexity. Despite the visual
86
Developing Efficient Graphics Software
readability of overloading an operator to perform vector addition, for example, problems can occur
when types differ and the compiler attempts to reconcile this through type conversion and coercion,
incurring problems associated with temporary classes.
• Inline functions as a compiler hint wherever possible. Inlining can replace small functions with
in-place code, speeding execution.
• Understand how a compiler behaves when using C++ keywords such as inline, mutable, and
volatile. Use of these keywords can affect how data is accessed and how compiler optimization
is performed.
• Profile how run-time type identification (RTTI) performs on the systems on which an application
will run. In some cases, adopting an application-specific type methodology may be more efficient,
even though RTTI is part of the ANSI standard.
87
6.5. C++ LANGUAGE CONSIDERATIONS Developing Efficient Graphics Software
6.5.4 Templates
Templates are another language feature of C++ allowing high levels of code re-use. Templates preserve
type-safety while allowing the same code to operate on multiple data types. The efficiency of re-using
the same code for performing a certain operation for all data types stems from only having to implement
efficient code once. Templates can be difficult to debug, but are easily implemented originally as a concrete
class, then templatized after they have been debugged. Another solution to efficient template usage is
to use commercial libraries or the Standard Template Library (STL1 ), now part of the ANSI language
specification. Extensive use of templates may cause code expansion due to techniques used by compilers
to instantiate template code. Read compiler documentation to learn how templates are instantiated on a
particular system.
1
The Standard Template Library — http://www.cs.rpi.edu/∼musser/stl.html
88
Developing Efficient Graphics Software
Section 7
7.1 Introduction
In this course, you have learned some tools and techniques to determine how well an application is run-
ning and how to improve performance. Although tuning the individual parts of an application increases
performance, tuning can only go so far. The metaphor for this section is as follows: “The most highly
tuned bubble sort in the world is still a bubble sort and will be left in the dust by any decent quicksort
implementation.” The goal of this section is to describe some additional techniques for improving appli-
cation performance and show how these techniques can be combined with knowledge of the application
domain and system architecture to produce high-performance applications.
Each application is written to solve a specific domain problem, and each problem domain comes with a
set of requirements to which the application must adhere. These requirements differ sometimes drastically
among domains. For example, a visual simulation application might be required to run at a 30-Hz or even
60-Hz constant frame rate; the frame rate in a scientific visualization application might be measured not in
frames per second, but seconds per frame; and an interactive modeling application might require a delicate
balance between interactive user response and image quality. There are many more domains, each with
its own set of requirements. An application writer needs to look at these requirements to determine how
the application as a whole fits together to solve the user’s problem. Furthermore, these requirements are
usually not mutually exclusive. An application typically does not need to achieve a high constant frame
rate and a high-fidelity scene, but a balance of both.
This section covers idioms that are used to increase perceived graphics performance, and application-
level architectures that use these idioms to achieve the best possible application performance. This section
primarily emphasizes interactive applications. Therefore, many of the techniques described do not fit
well into an application where the end result is only a generated image, but rather are appropriate for
applications where the goal is user-interactivity in generating images.
7.2 Idioms
idiom: The syntactical, grammatical, or structural form peculiar to a language [56].
The language of the computer is very specific — one misplaced symbol, and the computer no longer
does what is expected. When that language is used for a graphics application, similar, although not as
catastrophic, results can follow. For example, an application might not meet the needs of the users if it is
not architected properly. There are many idioms that help in architecting a graphics application, and these
89
7.2. IDIOMS Developing Efficient Graphics Software
generally take the form of reducing the information that needs to be rendered. The basic premise of these
idioms is that an application only needs to render what the user sees, and that rendering needs to be only
as detailed as the user can perceive. This may seem obvious, but there exist precious few applications that
are effective at applying all the techniques described.
The following sections outline some useful idioms for reducing the information that needs to be rendered
(culling), reducing the complexity of the information that gets rendered (level of detail), and reducing the
amount of data that has to be transferred at a given stage of the pipeline (caching).
Effective use of these idioms reduces both the geometry load and the pixel fill load of an application,
which enables applications to render scenes that are much more complex in a shorter amount of time.
Unfortunately, this effective speedup can introduce a feedback loop that can cause swings in frame rate
and a reduction in the amount of time that can be spent calculating versus drawing. This feedback loop
begins by reducing the graphics load, thereby increasing the effective frame rate. The increase in frame
rate causes the amount of time available for non-rendering tasks to be reduced, which adds more geometry
load to the graphics system due to less time to cull and calculate proper level of details, and so on, creating
the feedback loop. Therefore, when using culling and multiple levels of detail, it is also necessary to have
a frame-rate control mechanism that can balance the graphics and CPU load.
7.2.1 Caching
Caching is the well-known technique of locally storing data that is expensive to recompute or fetch from
remote storage. Caching reduces data transfer by storing graphics information in one part of the graph-
ics pipeline so that it does not have to be retransmitted. Applied to graphics applications, caching can
minimize data generation, accelerate traversal, and possibly avoid rendering altogether.
A display list is a data structure that stores graphics commands in a format optimized for fast traver-
sal and transfer to the graphics system. Display lists may be provided by the graphics vendor or may
be implemented within an application. Vendor-supplied display lists optimize traversal by precompiling
graphics API calls into graphics commands and data structures in a format native to the graphics system.
This format is aligned for rapid transfer to the graphics hardware by the CPU and may, depending on the
system, be transferred by DMA. If your graphics vendor does not provide native display lists, it is often
advantageous to implement a display list within your application. For example, if your application edits
and displays NURBS or other parametric surfaces, a display list can store the surface tesselations as trian-
gle strips, removing the need to retesselate. Both types of display list can contain meta-information such
as bounding boxes to enable other optimizations. Because display list generation and editing take time,
display lists are best for caching static geometry that will be displayed more than changed. Display lists
are stored in system memory, and their memory requirements need to be balanced against the performance
acceleration they supply. In most cases, display lists provide a useful performance boost at a reasonable
cost.
Many applications roam through a large database stored on disk. The database may contain geometry, or
in other cases, image or volume data formatted as texture. It can be worthwhile to organize the database
90
Developing Efficient Graphics Software
spatially, and create a cache for data most likely to displayed next. If multithreading is used, the cache can
be loaded by prefetching, instead of on demand.
For imaging and volume visualization applications, the data is easy to organize as tiles or bricks, with
the nearest spatial neighbor implicit in the data definition. These applications are particularly amenable to
data caching and prefetching. Applications that display 3D geometry are harder to organize in this way,
because the natural hierarchy created by the user is not always as spatially coherent as in the imaging or
volume cases.
7.2.2 Culling
One of the most effective ways of improving graphics rendering performance of a scene is to not render
all the objects in that scene. Culling is the process of determining which objects in a scene need to be
drawn and which objects can safely be elided. In other words, the objects of the scene that can safely be
elided are those that are not visible in the final rendered scene. This concept has fostered years of research
work [20, 58, 57, 12, 23] and many useful techniques.
The premise behind culling is to determine if a geometric object needs to be drawn before actually
drawing it. Therefore, the first step is to define the objects to be tested. In most cases, it is not compu-
tationally feasible to test the actual, perhaps very complex, geometric object, so a simpler representation
of the object is used, the bounding volume. This representation can take the form of a bounding sphere, a
bounding box, or even a more complex bounding convex hull.
A bounding sphere is a point and a radius, defined to completely encompass the extents of the geometry
that it represents. A bounding sphere is very fast and efficient to test against, but not very accurate in
determining the extents of the object. Bounding sphere extents are fairly accurate when the dimensions of
an object are similar. For example, box-shaped objects such as buildings, cars, and engines are usually well
represented by bounding spheres. However, bounding spheres are a poor representation in many cases,
particularly when a single dimension is much larger than another. For example, the bounding sphere of an
elongated object in a scene is much larger than the true extents of the object. Such objects include pens,
trees, missiles, and railroad cars that are not particularly well represented by bounding spheres.
91
7.2. IDIOMS Developing Efficient Graphics Software
Significant efficiency is gained by grouping objects spatially and testing the bounding sphere of the
larger group instead of testing each individual object in that group. For this to be effective, the geometry
for the scene needs to be grouped hierarchically with bounding sphere information determined at the
lowest levels and propagated up the tree. A bounding sphere test of a large group of geometry can quickly
determine that none of its contained geometry needs to be tested, thus avoiding the test of each geometric
object.
The process of recursively testing a bounding sphere and, if needed, the child geometry contained in the
bounding sphere can continue all the way down to individual geometric objects. You can use bounding
boxes of the actual geometry when you need a more accurate test of the geometric extents. The level at
which the bounding sphere test stops and the point at which bounding box tests are started can be based
on the amount of time allotted to culling the scene or set to a fixed threshold. The cull time needs to be
balanced with the draw time. A very accurate cull that takes more time than the allotted frame time is not
very useful. On the other hand, an early termination of the cull that causes excess geometry to be drawn
slows down the overall frame rate.
Bounding boxes also suffer from some of the same problems as bounding spheres. In particular, a poorly
oriented bounding box has the same problems as a bounding sphere representing an elongated object —
poor representation of an object leading to inaccurate culling. Iones, et al., have recently published a paper
on the determination of the optimal bounding box orientation [33].
Backface Culling
Manifold surfaces always have some polygons that are facing the viewer and others that are facing away
from the viewer. Polygons that are facing away from the viewer are not visible and do not need to be
rendered. The process of determining which polygons are frontfacing (visible), which are backfacing (not
visible), and eliding those that are backfacing is called backface culling [58]. Backface culling is done on
a per-object, and sometimes per-primitive, basis.
92
Developing Efficient Graphics Software
OpenGL performs backface culling as the first step of rasterization, after clipping, transformation, and
lighting (see section 3.2.6). This only eliminates rasterization, which is not helpful for applications bound
by transformation and lighting. In GTX-RD graphics systems, it can therefore be worthwhile to perform
backface culling explicitly, before transformation takes place.
A simple approach to calculating the face of a polygon is to take the dot product of the polygon normal
and a ray from the camera (or eye-point). If the dot product is negative, the polygon is facing toward the
user and needs to be drawn. If the dot product is positive, the polygon is facing away from the user and can
safely be elided. One point regarding dot products that needs attention is the meaning of the dot product
sign. When the user is inside the object, the meaning of the positive and negative dot product is reversed.
The possibility of the eye point entering an object needs to be handled in all cases where the direction
of the normal is important, such as lighting. Backface culling adds an additional case to the handling of
flipped normals.
Before implementing your own backface culling, test your application performance and check your
vendor’s documentation. OpenGL backface culling may be adequate, and if not, your vendor may provide
an extension to perform camera-space backface culling.
Occlusion Culling
A more complex form of culling, occlusion culling, is the process of determining which objects within
the view frustum are visible. Only objects not behind other objects or seen through those objects from
the current viewpoint are visible in the final rendered scene. The objects that are visible are known as
occluders, and those that are blocked are known as occludees. The determination of the optimal set of
occluders is the goal of an occlusion culling algorithm. The objects in this optimal occluder set are the
only objects that need to be drawn, and all other objects can safely be elided.
The key to an effective occlusion culling algorithm is to determine which objects in a scene are oc-
cluders. In many cases, you can use the information available in the application domain as a means to
help determine the occluders. In domains such as architectural walkthroughs or certain classes of games,
the world is naturally made up of cells and portals between the cells. In this case, you can use a cell &
portal [54] culling algorithm to make a map of the visibility between cells. Only cells visible from the
current cell need to be rendered.
When knowledge about the underlying spatial organization does not lead to the use of a specialized
algorithm to determine occluders, you can use a general occlusion algorithm [59, 23]. One method of
occlusion culling is to use the hierarchical bounding-box or bounding-sphere information in conjunction
with a typical hardware depth buffer. The scene is sorted in a rough front to back ordering, and all
geometry in the scene is marked as a possible occluder, meaning that all geometry needs to be drawn. The
depth sort is necessary to take advantage of the natural visibility effects where a closer object generally
obstructs the view of a further object. The bounds of each object are rendered in turn, and the depth
buffer is compared to the previous depth buffer. If the depth buffer changes between drawing one object
and the next, the object is visible and is not occluded. If the depth buffer did not change, the object is
not visible and can safely be elided. It is possible that the hardware can efficiently feedback the depth
buffer hit information outside of reading the full depth buffer. Check with your hardware vendor when
implementing an occlusion culling algorithm to see if there are extensions that allow efficient occlusion
culling algorithms.
More detail on occlusion culling can be found in Zhang [57], which covers occlusion culling back-
ground material and an extensive algorithm for choosing the optimal occlusion set.
93
7.2. IDIOMS Developing Efficient Graphics Software
Contribution Culling
Another area where you can use culling is to elide objects that are small enough not to be noticed if they
are missing from the scene. This form of culling, called contribution culling [57], makes a binary decision
to draw or not draw an object depending on its pixel coverage in screen space. An object that only occupies
a few pixels in screen space can be safely elided with very little impact in the overall scene. Examples
where contribution culling can be applied include objects that are a large distance from the eye, such as
trees when flying at altitude in a flight-simulator, or objects that are very small in comparison to the entire
scene, such as bolts on an engine when designing a truck. Contribution culling can also assist occluder
selection for occlusion culling, since objects with low pixel coverage are not good occluders.
The screen space size of an object can either be determined computationally or in a preliminary render-
ing pass. In either method, the bounding representation is used instead of the actual geometry associated
with the object. Check with your hardware vendor when implementing a contribution culling algorithm.
It is possible that the hardware can efficiently feed back the pixel coverage information much easier and
faster than a computational approach or straightforward graphics language implementation.
94
Developing Efficient Graphics Software
the smaller visible window with an appropriate (and fast) kernel. When compared to blending, over-
sampling trades off framebuffer use against the CPU and main memory requirements of a depth sort.
When compared to accumulation, oversampling has a fraction of the traversal and transformation
requirements, comparable rasterization requirements, and a higher framebuffer requirement.
Accelerated panning and accelerated dynamics utilize frame coherence. Each frame of the output image
contains much of the data from the previous frame, so much so that it is worthwhile to cache it and rerender
only a small part of the image. Oversampled antialiasing substitutes a fast 2D operation requiring a large
amount of framebuffer for 3D operations that conserve framebuffer but use large amounts of time. The
substitution of texture for geometry also substitutes fast 2D operations for slow 3D operations. There
may be other ways to accelerate your application. Is your geometry laid out in such a way that some
objects will always make better occluders than other objects? (For example, sheet metal vs. rivets.) Do
you typically draw large arrays of coplanar or parallel surfaces that can be backface removed together at
a small computational cost? Step back and cast a critical eye at your application, the problems it solves,
and its usage.
95
7.2. IDIOMS Developing Efficient Graphics Software
the old is considered a hard change in the scene. In other words, as the user transitions from one LOD to
another, the transition is noticed by the user as a “popping” effect. You can minimize this effect by using
softer methods of LOD transitions such as geometry morphing (geomorph) or blending. A good LOD
implementation should present few visual artifacts to the user.
Creating the LOD objects is only part of the full LOD idiom. To effectively use multiple LOD objects
in a scene, you must determine the correct LOD for each object. Properly determining the correct LOD
can greatly increase frame rate [20, 45, 49]. The LOD can be based not only on the distance from the eye
but also on the cost of rendering the object and the perceived importance within the scene [20]. In many
cases, the geometry can be totally replaced by a textured image [49], thereby reducing the geometry load
down to a single polygon.
Height Fields
Heckbert and Garland further subdivided height fields into six subclasses: regular grid methods [36, 32],
hierarchical subdivision methods [53, 46, 15], feature methods [51], refinement methods [19, 28, 44, 21],
decimation methods [37, 47], and optimal methods [7]. Many of these algorithms are very computa-
tional and, therefore, can only be used to preprocess the LODs that are used during rendering. These
preprocessed LODs are generally not sufficient for an interactive application where the user controls the
eye-point and viewing parameters. This is especially true for surfaces that are very large, as in terrain
models, where a single LOD is not sufficient over the whole surface. In fully interactive applications, the
LOD across the height field needs to be what Hoppe refers to as “view-dependent” [30]. That is, the LOD
across the height field varies as the eye-point and view frustum changes. This entails a real-time algorithm
with a continuously variable LOD allowing more detail close to the eye-point and less further away.
The number of algorithms that allow view-dependent, real-time height field LOD calculations is small.
For the algorithm to be effective, Lindstrom et al.[38] defines five properties that are important for a height
field LOD algorithm:
• At any instant, the mesh geometry and the components that describe it should be directly and ef-
ficiently queryable, allowing for surface following and fast spatial indexing of both polygons and
vertices.
• Dynamic changes to the geometry of the mesh, leading to recomputation of surface parameters or
geometry, should not significantly impact the performance of the system.
96
Developing Efficient Graphics Software
• High frequency data such as localized convexities and concavities, and local changes to the geome-
try, should not have a widespread global effect on the complexity of the model.
• Small changes to the view parameters (for example, viewpoint, view direction, field of view) should
lead only to small changes in complexity in order to minimize uncertainties in prediction and allow
maintenance of (near) constant frame rates.
• The algorithm should provide a means of bounding the loss in image quality incurred by the ap-
proximated geometry of the mesh. That is, there should exist a consistent and direct relationship
between the input parameters to the LOD algorithm and the resulting image quality.
A single algorithm that fulfills all of these properties, runs in real time, and handles very large surfaces
is difficult to achieve. The IRIS Performer [45] library’s Active Surface Definition (ASD), Linstrom’s [38]
algorithm, and Hoppe’s view-dependent progressive mesh [31] are some examples of algorithms that fulfill
all properties. These algorithms depend on a hierarchical surface definition but take different approaches to
achieve a similar result. Lindstrom and Hoppe work with the original height field breaking the surface into
LOD blocks. They simplify each block with a continuous LOD function based on eye position, height, and
an error tolerance. The ASD algorithm starts with a triangulated irregular network (TIN) and precomputes
the LOD blocks. Lindstrom works with the entire surface but limits the maximum size that can be rendered
to what can fit in memory. In addition, even though the LOD is continuous, Lindstrom does not geomorph
the surface when changing from one level to another, which can cause a noticeable popping effect. In
contrast, ASD and Hoppe store the hierarchical LOD blocks on disk and load the appropriate block as
needed, dependent on the viewer velocity and direction. This allows an infinite surface to be convincingly
rendered. Furthermore, both ASD and Hoppe geomorph the vertices as the LOD level changes. This
allows a very smooth-looking surface representation even when the error tolerance becomes high.
97
7.2. IDIOMS Developing Efficient Graphics Software
which LODs need to be used, and without it the graphics pipeline may be under-utilized or overloaded.
A target frame rate sets a bound on the minimum frame rate without which the frame rate is unbounded
allowing an application to become arbitrarily slow.
Often, application developers who have not incorporated frame rate control into their applications ratio-
nalize the decision by saying they always want the fastest frame rate, hence they do not need set a target
frame rate. This viewpoint is always countered by the fact that a frame-rate control mechanism, combined
with LODs, allows the fastest frame rate to be increased by using less complex LODs. For example, if
an application is running slower than the target frame rate, it can decrease the LOD complexity, thereby
reducing the geometry load on the system and increasing the overall frame rate. Without a target frame
rate and associated frame-control mechanism, increasing the frame rate cannot happen reliably. Adjusting
the geometry load based on the difference between current and target frame rate is known as stress man-
agement. Stress is a multiplier, calculated on this difference, incorporated into the LOD selection function.
One method of determining which LODs to render is to determine the cost in frame time it takes to render
each object and the benefit of having that object at a certain LOD level.
Funkhouser et al. [20] defines cost and benefit functions for each object in a scene. The cost of rendering
an object O at level of detail L with rendering method R is defined as Cost(O, L, R), and the benefit of
having object O in the scene is defined as Benef it(O, L, R). Therefore, to determine the LOD levels for
all objects in a scene, S, maximize
X
Benef it(O, L, R)
S
subject to
X
Cost(O, L, R) ≤ T argetF rameRate.
S
Generating the cost functions can be done experimentally as the application starts by running a small
benchmark to determine the rendering cost. This benchmark can render some of the basic graphics prim-
itives in different sizes using multiple graphics states to determine the characteristics of the underlying
system. The cost of rendering certain primitives is useful not only for LOD control, but also for the gen-
eral case of determining some of the fast paths on given hardware. Of course, though the benchmark is not
a substitute for detailed system analysis, you can use it to fine-tune for a particular platform. It is up to the
application writer to determine which modes and rendering types are fastest separately and in combination
for a particular platform and to code those into the benchmark.
The Benef it function is a heuristic based on rasterized object size, accuracy of the LOD model com-
pared to the original, importance in the scene, position or focus in the scene, perceived motion of the object
in the scene, and hysteresis through frame-to-frame coherence. Unfortunately, optimizing the above for
all objects in the scene is NP-complete and therefore too computationally expensive to attempt for any
real data set size. Funkhouser et al. uses a greedy approximation algorithm to select the objects with the
highest Benef it/Cost ratio. They take advantage of frame-to-frame coherency to incrementally update
the LOD for each object starting with the LOD from the previous frame. The Benef it and Cost functions
can be simplified to reduce the computational complexity of calculating the LODs. This computational
complexity can become the overriding frame time factor for complex scenes, because the LOD calcula-
tions increase with the number of objects in the scene. Similar to using LODs to reduce geometry load,
it is necessary to measure the computational load and reduce computation when the calculations begin to
take more time than the rendering.
98
Developing Efficient Graphics Software
Using a predictive model such as described above, you can control the frame rate with higher accuracy
than with purely static or feedback methods. The accuracy of the predictions are highly dependent on
the Cost function accuracy. To minimize the divergence of actual frame rate to calculated cost, you can
introduce a stress factor to artificially increase the LOD levels as the graphics load increases. This is a
feedback loop dependent on the true frame rate.
Using Billboards
Another approach to controlling the level of geometric detail in a scene is to substitute an impostor or a
billboard for the real geometry [45, 40, 49, 50]. In this idiom, the geometry is pre-rendered into a texture
and texture mapped onto a single polygon or simple polygon mesh during rendering. This is an advanced
form of the texture for geometry substitution described in section 7.2.3. IRIS Performer [45] has a built-
in billboard (sometimes known as sprite) data type that can be explicitly used. The billboard follows the
eye-point with two or three degrees of freedom, which appear to the user as if the original geometry is
being rendered. Billboards are used extensively for trees, buildings, and other static scene objects.
Shade et al. [49] creates a BSP tree of the scene and renders using a two-pass algorithm. The first pass
caches images of the nodes and uses a cost function and error metric to determine the projected lifespan
of the image and the cost to simply render the geometry. The projected lifespan of the image alleviates the
problem of the algorithm trying to cache only the top-level node. A second pass renders the BSP nodes
back to front using either geometry or the cached images. This algorithm works well for sparsely occluded
scenes. In dense scenes, the parallax due to the perspective projection shortens the lifetime of the image
cache, making the technique less effective.
Sillion et al. [50] have a similar approach, but instead of only using textures mapped to simple polygons,
they create a simplified 3D mesh to go along with the texture image. The 3D mesh is created through
feature extraction on the image followed by a re-projection into 3D space with the use of the depth buffer.
This 3D mesh has a much longer lifetime than 2D texture techniques, but at the expense of much higher
computational complexity in the creation of the image cache.
7.3.1 Multithreading
Multithreading is used here as the general ability to have more than one thread of control sharing a work
load for a single application. These threads run concurrently on multiprocessor machines or are scheduled
in some manner on single-processor machines. Threads also may all reside within the same address space
or may be split across separate exclusive address spaces. In a cluster of workstations, threads will execute
on separate machines and communicate by a message passing interface. The mechanism of thread control
is not as important as the need to use multiple threads within an application.
Even when using only a single processor, multithreading can still improve application performance.
Additional threads can accomplish work while the main thread is waiting for something to happen, which
99
7.3. APPLICATION ARCHITECTURES Developing Efficient Graphics Software
is quite often. Examples include the main thread waiting for a graphics operation to complete before
issuing another command; waiting for an I/O operation to complete and block in the I/O call; or waiting
for memory to be copied from main memory into the caches. In addition, when multiple processors are
available, the threads can run free on those processors and not have to wait for the main thread to stall or
to context swap in order to get work done.
Multiple threads can be used for many of the computational tasks in deciding what to draw, such as LOD
control, culling, and intersection testing. Threads can be used to page data to and from disk or to pipeline
the rendering across multiple frames. Again, an added benefit comes when running the application on
multiprocessing machines. In this case, the rendering thread can spend 100% of its time rendering while
the other threads are dedicated to their tasks 100% of the time.
There are a few issues associated with using multiple threads. The primary concern becomes data
exclusion and data synchronization. When multiple threads are acting on the same data, only one thread
can be changing the data at a time. That change then needs to be propagated to all other threads so they
see the same consistent view of the data. It is possible to use standard thread locking mechanisms such as
semaphores and mutexes to minimize these multiprocessing data management issues. This approach is not
optimal, because as the number of objects in the scene increases, the corresponding locking overhead also
increases. A more elaborate approach based on multiple memory buffers is described in [45]. Another
issue is the time consumed by thread creation. It may be worthwhile to cache and reuse threads instead of
creating and destroying them freely. As in all other aspects of graphics, performance measurement is an
essential part of threaded architecture design.
Threads can be used in a pipelined fashion or in a parallel fashion for rendering. In many cases, it is
useful to combine the two techniques for the greatest performance benefit. In a pipelined renderer, each
stage of the pipeline works on an independent frame with its own view of the data. Here the latency is
increased by the number of stages in the pipeline, but the throughput is also increased. Parallel concur-
rent processes all work on the same frame at the same time, perhaps using multiple hardware graphics
pipelines. (see 2.1.10)The synchronization overhead is higher, but latency is reduced. A combination
of the two approaches can have a pipelined renderer with asynchronous concurrent threads handling non-
frame-critical aspects of the application such as I/O. The target system architecture determines what is
possible, whereas the application requirements determine what is useful. The following are some areas
where a separate thread can work either as a stage in a pipeline or as a parallel concurrent thread.
Culling
The process of culling decides which geometric objects need to be drawn and which geometric objects
can be safely elided from the scene (see section 7.2). Culling is traditionally done early in the rendering
process to reduce the amount of data that later stages need to process. As one of the first stages in a mul-
tithreaded application the culler thread can traverse the scene doing view frustum, backface, contribution,
and occlusion culling. Each of these culling algorithms can be done in a pipelined fashion spread over
multiple threads. The resulting output of the culling threads can be incorporated into a new second-stage
scene structure, which is passed to the remaining parts of the application.
100
Developing Efficient Graphics Software
be pipelined with early results from the culling stage to prevent calculation of LOD values for objects that
are not rendered.
Intersection
Most applications do more than just render and allow the user to interact with the scene. This interaction
entails calculating intersections either on an object-to-object basis or as a ray cast from a viewing position
to an object. An intersection thread can be run concurrently with LOD calculations to generate a hit list
that is passed to the application before rendering.
I/O
In applications where all data is generally not all visible simultaneously, it is beneficial to only load the
portion of the data that is currently being used. Complex visual simulations or architectural walkthroughs
are two of the many types of applications that have large databases where the data is paged off the disk as
the user moves through the world. As the user approaches an area where the data has not yet been loaded,
the required data is read off the disk or a network interface to be ready to use when the user arrives at
the new area. One or more asynchronous threads are generally allocated to I/O operations such as paging
database data from external storage or tracking information from input devices. These threads can be
asynchronous because they do not need to complete in order to generate data for the next frame of the
rendering process. An additional benefit of an asynchronous I/O thread is that an application is not tied
to the variable read rates inherent in disk, network, or other external interfaces. The maximum frame rate
of an application is gated by the I/O device when I/O is done in-line as part of the rendering loop. This is
especially apparent with input devices that have a very high data latency that put a bounds on the frame
rate.
Because I/O threads are asynchronous and may not have completed their operation before the data they
are responsible for is needed, the application needs to have a fallback to replace the missing data. Database
paging operations can first bring in small, low-resolution data that is quick to read to ensure there is some
data ready to be rendered if needed. Similarly, missing tracking information can simply reuse previous data
or interpolate where the new position may be based on the previous heading, velocity, and acceleration.
101
7.3. APPLICATION ARCHITECTURES Developing Efficient Graphics Software
blocking can potentially be used to either spend more time rendering using more complex geometry and
more resource-hungry rendering modes, or to use the extra time to run application-specific calculations.
Another use for this time in a single threaded application is to synchronize the threads and update shared
data. In general, the time between a buffer swap and the next iteration of the render loop should be used
by an application to do additional work in anticipation of the subsequent frame.
Level of Detail
Changing between appropriate LODs for a given object should be almost invisible to a user. When LOD
levels are artificially changed due to the need to increase frame rate, users begin to notice changes in the
scene. Here, frame rate and image quality need to be balanced. Similarly, if a proper blend or morph
between two LOD levels is not done, the switch between the two LODs is apparent and distracting. In
either case, the use of LODs is important for an application. Memory considerations for generating LODs
should be a concern only for very memory-conscious applications. If memory becomes a concern, consider
paging the LOD levels from disk when needed.
Mipmapping
Textures can be pre-filtered into multiple power of two levels forming a pyramid of texture levels. During
texture interpolation, the two best mipmap levels are chosen, and texel values are interpolated between
those levels. This process reduces texturing complexity when the ratio of screen space to texture dimension
gets very small. Interpolation between smaller levels produces a better image at the cost of memory to
store the texture levels and a possible performance hit on some graphics systems that do not have hardware
support for mipmapping. The memory bloat associated with mipmapping is minimal, in fact, adding only
one-third the original image size. This memory bloat is usually outweighed by the increase in image
quality and performance for hardware that accelerates mipmapping.
Paging
For very large databases or other types of applications working with large data sets, all of the data does
not have to be loaded up-front. An application should be able to roam through an infinitely large scene if
supplied with an infinitely large disk array.
102
Developing Efficient Graphics Software
Bounding Information
One of the easiest to use and most beneficial pieces of information to store in the scene graph is bounding
information for objects in a scene. Both bounding spheres and bounding boxes may be stored, each used
where appropriate.
Pre-Calculations
Many times objects in a scene have static transformations associated with them, for example, wheels
of a car are always positioned relative to the center of the car, offset by some transform. These extra
transformations can quickly add up with complex scenes. A pass through the scene graph can be done
before rendering begins to collapse static transformations by recalculating the vertices of the objects,
physically moving the vertices to their transformed locations. You can do similar concatenations for other
states in the scene, namely rendering modes, colors, and even pre-calculating lighting in some situations.
State Changes
State changes are generally an expensive operation for most graphics systems. It is best to try to render all
items with the same state in order to minimize the number of times state needs to be changed in a scene.
Rendering a geometric checkerboard goes much faster by rendering all black squares first, followed by all
white squares, instead of rendering alternate black and white squares. If each object is able to keep track of
the state settings it is using, then sorting the scene by state becomes possible and rendering more efficient.
This sorting creates lists of renderable items that have multiple levels of sorting, from most expensive to
least expensive.
103
7.3. APPLICATION ARCHITECTURES Developing Efficient Graphics Software
For debugging purposes, it is useful to know what is actually being drawn, especially when trying to fix
a fill-limited or geometry-limited application to see how the state changes affect what is actually rendered.
Besides timing information, the depth complexity of a scene should be viewable as an image of the depth
buffer to see how many times each pixel is filled. This is a measure of how well the culling process is
performing. It is also useful to be able to turn off certain modes to see their effect. For example, turning
off texturing or drawing the scene in wire frame can be useful for debugging.
104
Developing Efficient Graphics Software
Conclusion
Graphics software that truly runs efficiently on a computer system is built on three foundations. These
foundations, of course, rely on knowledge of how software, graphics function calls, and the computer
system interact with each other.
Because many graphics applications spend much of their time processing information not directly re-
lated to calling any graphics API, the first foundation is based on well-written application software. This
software is distinct from that used to call any graphics API, but is instead used to process data, take user
input, or store data, etc. Delays in the execution of this part of the code decrease overall performance.
Fortunately, a host of tools are available that can clearly define any existing inefficiency in the application
software.
The second foundation rests on an efficient graphics structure and how that structure interplays with
the system hardware. Graphics API calls can be implemented poorly, and no amount of code analysis or
restructuring will change that fact. Fortunately, most graphics hardware suppliers provide key pointers
that demonstrate how to improve graphics API and hardware interaction.
Unfortunately, well-written code and graphics function calls does not make up for a poor choice of
graphics algorithms. Efficient algorithms, then, are the third foundation on which graphics performance
rests. As is pointed out in the course, a poor algorithm can effectively kill any performance gained by
clever coding or graphics hack. Fortunately, SIGGRAPH conferences are replete with examples of such
algorithms, and some of them are captured here.
Creating high-performance graphics software can be difficult. The purchase of a bigger-faster-cheaper
computer may be a solution, but this is a temporary solution that doesn’t fit many situations. It’s far
easier – and cheaper in the long run – to examine how the software and system interact, and modify the
application software accordingly. This effort can be one of the most challenging and satisfying aspects of
developing efficient graphics software.
105
Developing Efficient Graphics Software
106
Developing Efficient Graphics Software
Glossary
Application Programming Interface: A collection of functions and data that together define an interface
to a programming library.
ASIC: Application Specific Integrated Circuit. Examples of ASICs include chips that perform texture-
mapping, lighting calculations, or geometric transformations.
Asynchronous: An event or operation that is not synchronized. Asynchronous function calls are those
that can occur at any time and do not wait for other input to complete before returning.
Bandwidth: A measure of the amount of data per time unit that can be transmitted to a device.
Basic Block: A section of code that has one entry and one exit.
Basic Block Counting: Indicates how many times a section of code has been executed (the hot spot),
regardless of how long an instruction might have taken.
Billboard: A texture, or multiple textures, that represent complex geometry. The texture is mapped to a
single polygon that follows the eye-point.
Binary Space Partitioning: Usually referred to as a BSP tree. This is a data structure that represents a
recursive, hierarchical subdivision of space. The tree can be traversed to quickly find the locations
of items in a scene.
Block: The process of not allowing the controlling program to proceed any further in its current thread of
execution until the device being communicated with is finished with its operation.
Bounding Box: The extents of an object defined by the smallest box that fits around the object. A bound-
ing box can be axis-aligned or oriented in some way to better fit the object extents.
Bounding Sphere: The extents of an object defined by the smallest sphere that fits around the object.
Bounding Volume: The extents of an object or group of objects. This can be defined using a bounding
box, bounding sphere, or other method.
107
Developing Efficient Graphics Software
Cache Blocking: Memory reference optimization that reorders the memory accesses in a loop nest so
that data are worked on in small neighborhoods that fit in cache. Also known as tiling.
Contribution Culling: A binary decision to draw or not draw depending on the pixel coverage in screen
space.
Culling: The process of determining which objects in a scene need to be drawn and which objects can
safely be elided.
Data Locality: The property of data to reside ’near’ other data in memory. One way to achieve data
locality is to use a vertex array which store vertices linearly in memory - subsequently accessed
vertices will be adjacent to just-used vertices, and likely have better cache behavior.
Database: The application one buys from Oracle or Sybase. Also, the store of data that can be rendered.
Usually used in the visual simulation domains.
Depth Complexity: The measure of how many times a single pixel on the screen is filled. Depth com-
plexity can be reduced by using Culling.
Direct Memory Access: A way for a piece of hardware in a system to bypass the CPU and read directly
read from the memory. This is generally faster the PIO, but there is a constant setup time that makes
DMA useful only for large data transfers.
FIFO Buffer: A mechanism designed to mitigate the effects of the differing rates of graphics data gener-
ation and graphics data processing.
Fill Rate: A measure of the speed at which pixels can be drawn into the frame buffer. Fill rates are
reported as a number of pixels able to be drawn per second.
Full-in: A geometric object that lies fully inside the view frustum.
Full-out: A geometric object that lies fully outside the view frustum.
Fragment: A fragment is an OpenGL rasterized piece of geometry or image data that contains coordinate,
color, and depth information.
108
Developing Efficient Graphics Software
Frustum Culling: Removing all geometry that lies outside of the frustum.
Generation: All of the work done by an application prior to the point at which it’s nearly ready to render.
Graphics Pipeline: The stages through which a primitive is operated upon to transform it into an image.
Height Field: A mapping of a data value to a height relative to the image plane. One common mapping
is to take a grid of elevation data (terrain) and map it to a trianglated surface.
Hysteresis: Minimizing the effect of a changing scene to keep a constant frame rate.
Inlining: The technique of replacing the call to a function with an in-place copy of the functions contents.
Interprocedural Analysis: The process of rearranging code within one function based on knowledge of
another function’s code and structure.
Latency: A measure of the amount of time it takes to fully transfer a single unit of data to a device.
Level of Detail: Alternate representations of geometric objects where successive levels have less geomet-
ric complexity.
Microcode: Instructions which implement the instruction set of a processing unit. Typically composed
of bit fields which control specific low-level processor operations. Several microcode instructions
or microinstructions are required to decode and implement higher-level operations.
Native data formats: Data formatted in the same fashion as used internally by a graphics subsystem.
Pixels, vertices, normals, and a number of other basic data types have preferred, or native, data
formats. Example: AGBR may be native but RGBA may not.
Occlusion Culling: Determination of the visible objects from the current viewpoint.
Paging: Copying data to and from one device to another. Usually disk to memory.
109
Developing Efficient Graphics Software
Pixel: A picture element. All the bits at location (x, y) in all the bitplanes of the framebuffer that form the
single pixel (x, y). In OpenGL window coordinates, a pixel corresponds to a 1.0 x 1.0 screen area.
Polygon Rate: A measure of the speed polygons can be processed by the graphics pipeline. Polygon rates
are reported as the number of triangles able to be drawn per second.
Primitive: Basic graphic input data such as triangles, triangle strips, pixmaps, points, and lines.
Program Counter Profiling: Uses statistical callstack or program counter (PC) sampling to determine
how many cycles or CPU time is spent in a line of code.
Programmed I/O: Transferring data from one device in a system to another by having the CPU read
from the first and write to the second. See DMA for another approach.
Scene Graph: The data structure that holds the items that will be rendered.
Single-System-Image: A collection of graphics pipes within a system used to produce a single computa-
tional or graphical result via traditional programming models.
Span: Segment of a scanline inside a polygon upon which a scanline algorithm operates to rasterize a
primitive.
Stall: A condition where further progress cannot be made due to the unavailability of a required resource.
Static Scene: A scene that needs to be of high quality but not interactive.
Stress Factor: A computed value for a scene such that the further behind the scene gets from its target
frame rate the higher the stress factor becomes.
Synchronous: The opposite of asynchronous. Synchronous function calls are those that do not return un-
til they have finished performing whatever action is requested of them. For example, a synchronous
texture download function waits until the texture has been completely downloaded before return-
ing, while an asynchronous download function simply queues the texture for download and returns
immediately.
Tearing: The effect that happens when a rendering is not synchronized to the monitor refresh rate in
single buffered mode. Parts of more than one frame can be visible at once giving a “tearing” look to
a moving scene.
Transformation: Usually used as the process of multiplying a vertex by a matrix thereby changing the
location of the vertex in space.
110
Developing Efficient Graphics Software
Traversal: The portion of an application that walks through internal data structures to extract data and
call specific graphics API calls (in OpenGL things such as glBegin(), glVertex3f(), and glEnable(
foo )).
Virtual Memory: Addressing memory space that is larger than the physical memory on a system.
Word: The “natural” data size of a specific computer. 64-bit computers operate on 64-bit words, 32-bit
computers operate on 32-bit words.
111
Developing Efficient Graphics Software
112
Developing Efficient Graphics Software
Bibliography
[7] Pankaj K. Agarwal and Subhash Suri. Surface approximation and geometric partitions. In Proc. 5th
ACM-SIAM Sympos. Discrete Algorithms, pages 24–33, 1994. (Also available as Duke U. CS tech
report, ftp://ftp.cs.duke.edu/dist/techreport/1994/1994-21.ps.Z).
[8] Kurt Akeley. The Silicon Graphics 4d/240gtx superworkstation. IEEE Computer Graphics & Appli-
cations, 9(4):71–83, 1989.
[9] Kurt Akeley and Thomas Jermoluk. High-performance polygon rendering. In SIGGRAPH 88 Con-
ference Proceedings, Annual Conference Series, pages 239–246. ACM SIGGRAPH, 1988.
[10] Et al. Carolina Cruz-Neira. The cave: audio visual experience automatic virtual environment. Com-
munications of the ACM, 35(6):64–72, 1992.
[11] Andrew Certain, Jovan Popović, Tony DeRose, Tom Duchamp, David Salesin, and Werner Stuetzle.
Interactive multiresolution surface viewing. In SIGGRAPH 96 Conference Proceedings, pages 91–
98. ACM SIGGRAPH, 1996.
[12] James H. Clark. Hierarchical geometric models for visible surface algorithms. CACM, 19(10):547–
554, Oct. 1976.
[13] Jonathan Cohen, Amitabh Varshney, Dinesh Manocha, Greg Turk, Hans Weber, Pankaj Agarwal,
Frederick Brooks, and William Wright. Simplification envelopes. In SIGGRAPH ’96 Proc., pages
119–128, Aug. 1996. http://www.cs.unc.edu/∼geom/envelope.html.
[14] Doug Cook. Performance implications of pointer aliasing. SGI Tech Focus FAQ,
http://www.sgi.com/tech/faq/audio/aliasing.html, 1997.
113
BIBLIOGRAPHY Developing Efficient Graphics Software
[15] Leila De Floriani and Enrico Puppo. A hierarchical triangle-based model for terrain description.
In A. U. Frank et al., editors, Theories and Methods of Spatio-Temporal Reasoning in Geographic
Space, pages 236–251, Berlin, 1992. Springer-Verlag.
[16] Kevin Dowd. High Performance Computing. O’Reilly & Associates, Inc., first edition, 1993.
[17] Matthias Eck, Tony DeRose, Tom Duchamp, Hugues Hoppe, Michael Lounsbery, and Werner Stuet-
zle. Multiresolution analysis of arbitrary meshes. In SIGGRAPH ’95 Proc., pages 173–182. ACM,
Aug. 1995. http://www.cs.washington.edu/homes/derose/grail/treasure bags.html.
[18] James D. Foley, Andries van Dam, Seven K. Feiner, and John F. Hughes. Computer Graphics:
Principles and Practice. Addison-Wesley, second edition, 1990.
[19] Robert J. Fowler and James J. Little. Automatic extraction of irregular network digital terrain models.
Computer Graphics (SIGGRAPH ’79 Proc.), 13(2):199–207, Aug. 1979.
[20] Thomas A. Funkhouser and Carlo H. Séquin. Adaptive display algorithm for interactive frame rates
during visualization of complex virtual environments. Computer Graphics (SIGGRAPH ’93 Proc.),
1993.
[21] Michael Garland and Paul S. Heckbert. Fast polygonal approximation of terrains and height
fields. Technical report, CS Dept., Carnegie Mellon U., Sept. 1995. CMU-CS-95-181,
http://www.cs.cmu.edu/∼garland/scape.
[22] Anatole Gordon, Keith Cok, Paul Ho, John Rosasco, John Spitzer, Peter Shafton, Paula Womack, and
Ian Williams. Optimizing OpenGL coding and performance. Silicon Graphics Computer Systems
Developer News, pages 2–8, 1997.
[23] Ned Greene. Hierarchical polygon tiling with coverage masks. In SIGGRAPH 96 Conference Pro-
ceedings, Annual Conference Series, pages 65–74. ACM SIGGRAPH, 1996.
[24] Stefan Gumhold and Wolfgang Straßer. Real time compression of triangle mesh connectivity. In
SIGGRAPH 98 Conference Proceedings, pages 133–140. ACM SIGGRAPH, 1998.
[25] Paul Haeberli and Kurt Akeley. The accumulation buffer: Hardware support for high-quality render-
ing. In SIGGRAPH 90 Conference Proceedings, Annual Conference Series, pages 309–318. ACM
SIGGRAPH, 1990.
[26] Paul S. Heckbert and Michael Garland. Multiresolution modeling for fast rendering. In Proc.
Graphics Interface ’94, pages 43–50, Banff, Canada, May 1994. Canadian Inf. Proc. Soc.
http://www.cs.cmu.edu/∼ph.
[27] Paul S. Heckbert and Michael Garland. Survey of polygonal surface simplification algorithms. Tech-
nical report, CS Dept., Carnegie Mellon U., to appear. http://www.cs.cmu.edu/∼ph.
[28] Martin Heller. Triangulation algorithms for adaptive terrain modeling. In Proc. 4th Intl. Symp. on
Spatial Data Handling, volume 1, pages 163–174, Zürich, 1990.
[29] Hugues Hoppe. Progressive meshes. In SIGGRAPH ’96 Proc., pages 99–108, Aug. 1996.
http://research.microsoft.com/∼hoppe.
114
Developing Efficient Graphics Software
[31] Hugues Hoppe. Smooth view-dependent level-of-detail control and its applicatio to terrain rendering.
In IEEE Visualization ’98, pages 35–42, 1998. Available at http://research.microsoft.com/∼hoppe.
[32] Peter Hughes. Building a terrain renderer. Computers in Physics, pages 434–437, July/August 1991.
[33] Andrey Iones, Sergei Zhukov, and Anton Krupkin. On optimality of obbs for visibility tests for
frustum culling, ray shooting and collision detection. In Computer Graphics International 1998.
IEEE, 1998.
[34] Leif Kobbelt, Swen Campagna, Jens Vorsatz, and Hans-Peter Seidel. Interactive multi-resolution
modeling of arbitrary meshes. In SIGGRAPH 98 Conference Proceedings, pages 105–113. ACM
SIGGRAPH, 1998.
[35] Bob Kuehne. Displaying surface data with 1-d textures. Silicon Graphics Computer Systems Devel-
oper News, March/April 1997.
[36] Mark P. Kumler. An intensive comparison of triangulated irregular networks (TINs) and digital
elevation models (DEMs). Cartographica, 31(2), Summer 1994. Monograph 45.
[37] Jay Lee. A drop heuristic conversion method for extracting irregular network for digital elevation
models. In GIS/LIS ’89 Proc., volume 1, pages 30–39. American Congress on Surveying and Map-
ping, Nov. 1989.
[38] Peter Lindstrom, Devid Koller, William Ribarsky, Larry F. Hodges, Nick Faust, and Gregory A.
Turner. Real-time, continuous level of detail rendering of height fields. In SIGGRAPH 96 Conference
Proceedings, Annual Conference Series, pages 109–118. ACM SIGGRAPH, 1996.
[39] David Luebke and Carl Erikson. View-dependent simplification of arbitrary polygonal environments.
In SIGGRAPH 97 Conference Proceedings, Annual Conference Series. ACM SIGGRAPH, 1997.
[40] Paulo W. C. Maciel and Peter Shirley. Visual navigation of large environments using textured clus-
ters. In 1995 Symposium on Interactive 3D Graphics, pages 95–102, 1995.
[41] Miles J. Murdocca and Vincent P. Heuring. Principles Of Computer Architecture. Addison-Wesley,
1998.
[42] Jackie Neider, Tom Davis, and Mason Woo. OpenGL Programming Guide. Addison-Wesley, third
edition, 1999.
[43] Restricted pointers in C. Numerical C Extensions Group / X3J11.1, Aliasing Subcommittee, 1993.
[44] Shmuel Rippa. Adaptive approximation by piecewise linear polynomials on triangulations of subsets
of scattered data. SIAM J. Sci. Stat. Comput., 13(5):1123–1141, Sept. 1992.
[45] John Rohlf and James Helman. IRIS performer: A high performance multiprocessing toolkit for
real-time 3d graphics. In SIGGRAPH 94 Conference Proceedings, Annual Conference Series, pages
381–394. ACM SIGGRAPH, 1994.
115
BIBLIOGRAPHY Developing Efficient Graphics Software
[46] Lori Scarlatos and Theo Pavlidis. Hierarchical triangulation using cartographic coherence. CVGIP:
Graphical Models and Image Processing, 54(2):147–161, March 1992.
[47] Lori L. Scarlatos and Theo Pavlidis. Optimizing triangulations by curvature equalization. In Proc.
Visualization ’92, pages 333–339. IEEE Comput. Soc. Press, 1992.
[48] William J. Schroeder, Jonathan A. Zarge, and William E. Lorensen. Decimation of triangle meshes.
Computer Graphics (SIGGRAPH ’92 Proc.), 26(2):65–70, July 1992.
[49] Jonathan Shade, Dani Lischinski, David H. Salesin, Tony DeRose, and John Snyder. Hierarchical
image caching for accelerated walkthroughs of complex environments. In SIGGRAPH 96 Conference
Proceedings, Annual Conference Series, pages 75–82. ACM SIGGRAPH, 1996.
[50] François Sillion, George Drettakis, and Benoit Bodelet. Efficient impostor manipulation for real-time
visualization of urban scenery. In EUROGRAPHICS ’97, volume 16, 1997.
[51] David A. Southard. Piecewise planar surface models from sampled data. In N. M. Patrikalakis, editor,
Scientific Visualization of Physical Phenomena, pages 667–680, Tokyo, 1991. Springer-Verlag.
[52] Gabriel Taubin, André Guéziec, William Horn, and Francis Lazarus. Progressive forest split com-
pression. In SIGGRAPH 98 Conference Proceedings. ACM SIGGRAPH, 1998.
[53] David C. Taylor and William A. Barrett. An algorithm for continuous resolution polygonalizations
of a discrete surface. In Proc. Graphics Interface ’94, pages 33–42, Banff, Canada, May 1994.
Canadian Inf. Proc. Soc.
[54] Seth Teller and Pat Hanrahan. Global visibility algorithms for illumination computations. In SIG-
GRAPH 93 Conference Proceedings, Annual Conference Series, pages 239–246. ACM SIGGRAPH,
1993.
[55] Greg Turk. Re-tiling polygonal surfaces. Computer Graphics (SIGGRAPH ’92 Proc.), 26(2):55–64,
July 1992.
[56] Merriam Webster. The Merriam Webster Dictionary. Merriam Webster Mass Market, 1994.
[57] Hansong Zhang. Effective Occlusion Culling for the Interactive Display of Arbitrary Mod-
els. PhD thesis, The University of North Carolina at Chapel Hill, 1998. Also available at
http://www.cs.unc.edu/∼zhangh/research.html.
[58] Hansong Zhang and Kenneth E. Hoff. Fast backface culling using normal mask. In Proceedings of
the 1997 Symposium on Interactive 3D Graphics, 1997.
[59] Hansong Zhang, Dinesh Manocha, Tom Hudson, and Kenneth E. Hoff. Visibility culling using
hierarchical occlusion map. In SIGGRAPH 96 Conference Proceedings, 1997. Also available at
http://www.cs.unc.edu/∼zhangh/research.html.
116