0% found this document useful (0 votes)
127 views9 pages

Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh

Three key points: 1. 3D computer graphics rendering involves transforming 3D models into 2D images through a graphics pipeline consisting of geometric transformation and rasterization stages. 2. Geometric transformation maps 3D triangles to 2D coordinates through a series of transformations like modeling, viewing, projection, and clipping. This requires many floating point operations. 3. Rasterization converts transformed triangles into pixel values for display, using mostly integer operations like additions and comparisons. Parallel processing can accelerate rendering complex 3D graphics models.

Uploaded by

pravin_bhavale
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views9 pages

Three-Dimensional Computer Graphics Architecture: Tulika Mitra and Tzi-Cker Chiueh

Three key points: 1. 3D computer graphics rendering involves transforming 3D models into 2D images through a graphics pipeline consisting of geometric transformation and rasterization stages. 2. Geometric transformation maps 3D triangles to 2D coordinates through a series of transformations like modeling, viewing, projection, and clipping. This requires many floating point operations. 3. Rasterization converts transformed triangles into pixel values for display, using mostly integer operations like additions and comparisons. Parallel processing can accelerate rendering complex 3D graphics models.

Uploaded by

pravin_bhavale
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SPECIAL SECTION: COMPUTATIONAL SCIENCE

SURVEYS

Three-dimensional computer graphics


architecture
Tulika Mitra* and Tzi-cker Chiueh
Computer Science Department, State University of New York at Stony Brook, Stony Brook, NY 11794-4400, USA

tions that attempt to address these problems.


Three-dimensional (3D) computer graphics hardware has Finally, we present a taxonomy of parallel rendering algo-
emerged as an integral part of mainstream desktop PC
rithms, which uses parallel processing hardware to render
systems. The aim of this paper is to describe the 3D graph-
extremely complicated 3D graphics models.
ics architecture at a level accessible to the general compu-
tational science community. We start with the generic 3D
graphics rendering algorithm, the computational require- 2. 3D graphics pipeline
ments of each of its steps, and the basic architectural fea-
tures of 3D graphics processors. Then we survey the Polygon-based 3D graphic rendering is the process of con-
architectural features that have been implemented in or verting the geometric description of a 3D model (or a virtual
proposed for state-of-the-art graphics processors at the world) to a photo-realistic two-dimensional image (a 2D ar-
processor and system levels to enable faster and higher- ray of picture elements or pixels) that can be displayed on a
quality 3D graphics rendering. Finally, we describe a tax- computer monitor. Each pixel represents a colour value con-
onomy of parallel 3D rendering algorithms that accelerate sisting of red, green, and blue (RGB) components. The se-
the performance of 3D graphics using parallel processing. quence of steps involved in this conversion forms the 3D
graphics pipeline, each stage of which can be implemented
either in hardware or software.
1. Introduction The input to the 3D graphics pipeline is a virtual world
UNTIL recently, real-time three-dimensional (3D) computer created by application programmers. This world/scene con-
graphics was available only in very high-end machines from sists of a mathematical representation of a set of
Silicon Graphics Inc. In the last few years however, the PC objects, their positions relative to each other, an optional
industry has seen an unprecedented growth of cost- set of light sources, together with a viewpoint that provides
effective 3D graphics accelerators. Because a significant a camera angle into the virtual world. Objects or primitives
amount of industrial research effort has been invested in are typically represented by a set of triangles for ease of
powerful 3D graphics cards, it is predicted that the perform- implementation. The description of the 3D model is passed
ance of these accelerators will surpass the performance of to the 3D graphics engine through a standard Application
SGI machines by the year 2001 (ref. 1). 3D graphics applica- Programmer Interface (API) such as OpenGL2 or Direct 3D3.
tions place a stringent demand on the processing power The 3D graphics pipeline itself consists of two distinct
and on the data transfer bandwidth of the memory subsys- stages: geometric transformation and rasterization. The
tem and interconnecting buses. The growing importance of geometric transformation stage maps triangles from a 3D
3D graphics applications has motivated CPU vendors to coordinate system (object space) to a 2D coordinate system
add new instructions to the existing instruction set architec- (image space) by performing a series of transformations.
ture, and to develop higher-bandwidth memory and system The computation in this stage is mostly floating-point in-
buses. In fact, the data-intensive nature of 3D applications tensive, involving linear algebraic operations such as matrix
has been one of the primary motivations multiplication and dot products. The rasterization stage
behind the introduction of advanced Dynamic Random converts transformed triangles into pixel values to be
Access Memory (DRAM) architectures for host memory, shown on the computer screen. This stage involves mostly
and the local memory on graphics cards. integer arithmetic, such as simple additions and compari-
In this article, we start with the basic steps required to sons. An excellent reference to the 3D graphics pipeline can
render a polygon-based 3D graphics model and their asso- be found in Foley et al.4.
ciated and bandwidth requirements. Then we examine the
major design issues in generating photo-realistic images on 2.1 Geometric transformation
desktop machines in real time, and the architectural innova-
At the input of the geometric transformation stage, each
triangle consists of three vertex coordinates, vertex
normals and other attributes such as colour. For ease of
*For correspondence. (e-mail: [email protected]) manipulation, vertices are represented in homogeneous
838 CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000
SPECIAL SECTION: COMPUTATIONAL SCIENCE

coordinates, which are quadruples of the form {x, y, z, w}, 2.1.3 Projection transformation: This transformation
where in most cases w is 1. (The tuple {x/w, y/w, z/w} is the projects objects onto the screen. There are two types of
Cartesian coordinate of the homogeneous point.) The geo- projections: (1) orthographic projection, which keeps the
metric transformation stage applies a sequence of ope- original size of 3D objects and hence is useful for architec-
rations on the vertices of the triangle. Figure 1 shows the tural and computer-aided design; (2) perspective projection,
geometric transformation part of a typical 3D graphics pipe- which produces more realistic images by making distant
line which consists of the following stages: objects appear smaller. Each of these transformations again
involves a 4 × 4 matrix multiplication. However, as most en-
tries in these matrices are zero, a careful implementation
2.1.1 Model and viewing transformation: Modelling
requires only 6 multiplications and 3
transformation positions primitives with respect to each
additions.
other, and the viewing transformation orients the resulting
set of primitives to the user viewpoint. These two transfor-
2.1.4 Clipping: The application programmer defines a 3D
mations can be combined into a single multiplication of the
viewing frustum such that only the primitives within the
homogeneous vertex coordinate by a 4 × 4 matrix, which is
frustum are projected onto the screen. This step
implemented as 16 floating point multiplications and 12
removes the objects that are outside the viewable area. The
floating point additions. Lighting calculation, in addition,
algorithm requires one floating point comparison per view-
requires the transformation of the vertex nor-
boundary plane, and thus 6 comparisons per vertex. If a
mal by a 3 × 3 inverse transformation matrix, which costs 9
triangle is partially clipped, then the algorithm should cal-
floating point multiplications and 6 floating point
culate the position of the new vertices at the intersection of
additions.
the triangle edge and the view-boundary plane. The number
of such operations performed depends on the actual num-
2.1.2 Lighting: This stage evaluates the colour of the
ber of triangles that cross the view-boundary planes, which
vertices given the direction of light, the vertex position, and
varies from one viewpoint to another. Hence, we will not
the surface-normal vector and material characteristics of an
take this cost into account for our computation requirement
object’s surface. We will consider here only the most
calculation.
popular shading model, called Gouraud shading, which
interpolates the colour of the three vertices across the sur-
2.1.5 Perspective division: If perspective transformation
face. Evaluating the colour of a vertex requires a variable
is applied on a homogeneous vertex, then the w value no
amount of computation depending on the number of light
longer remains equal to 1. This stage divides x, y, z by w to
sources and the material properties. We assume the sim-
convert the vertex to Cartesian coordinates.
plest case of a single light at infinite distance, and the mate-
rial with only ambient and diffuse coefficients. This lighting
2.1.6 Viewport mapping: This step performs the final
model calculates the following equation for each R, G, B
scaling and translation to map the vertices from the pro-
component:
jected coordinate system to the actual viewport on the
computer screen. Each vertex component is scaled by an
Cdiffuse × Clight × (N ⋅ L) + Areflection × Alight, independent scale factor and offset by an independent off-
set, i.e. 3 floating point multiplications and 3 floating point
where Clight and Cdiffuse are the light source intensity and dif- additions.
fuse reflection coefficient; Alight and Areflection are the ambient The total computation requirement to perform geometry
light intensity and ambient light coefficient; (N ⋅ L) is the transformation per vertex is then 46 multiplications, 29 addi-
dot product of surface-normal vector and the direction of tions, 3 divisions, and 6 comparisons. Modern processors
light vector. (N ⋅ L) is calculated only once. However, the can execute floating point addition, subtraction, compari-
rest of the equation should be calculated independently for son, and multiplication operations quite fast using pipelined
R, G and B components for each vertex. This requires a total execution units. Floating point division operation however,
of (3 + 3 × 3 = 12) multiplications and (2 + 3 × 1 = 5) addi- is not usually pipelined, and can take as high as 50 floating
tions per vertex. point addition operations’ worth of time. The total floating
point operation requirement for a single vertex transforma-
tion is then around 130. Today

Figure 1.
Geometry transformation stage of a 3D graphics pipe-
line.
CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000 839
SPECIAL SECTION: COMPUTATIONAL SCIENCE

even a modest scene requires around 1 million vertex trans- gle proceeds down from a starting point, and moves out-
formations per second to achieve a rate of 30 frames per ward from the centre line6. The centre line shifts to the left
second. This would translate to 130 MFlops (million floating or right, until it steps outside of the triangle at any point of
point operations) per second. Today’s PCs have sufficient time (Figure 4 a). To achieve parallelism, the triangle may be
floating point computation power and therefore typically traversed one pixelstamp at a time, rather than pixel by
perform the geometric transformation stage in the main CPU. pixel6. A pixelstamp is an array of pixels of dimension X × Y.
Evaluation of edge functions for all the pixels within a pixel-
stamp could start in parallel, and only qualified pixels are
2.2 Rasterization sent to the
pixel processing stage. Triangle traversal visits all pixel-
The rasterization stage comprises two steps. The scan con-
stamps that are completely or partially inside the triangle
version step decomposes a triangle into a set of pixels, and
(Figure 4 b).
calculates the attributes of each pixel, such as colour, depth,
The rasterization stage also includes texture mapping,
alpha, and texture coordinates. The pixel processing step
which is a crucial and widely used technique that wraps a
performs texture mapping, depth test and alpha blending for
2D texture image on the surface of a 3D object to emulate
individual pixels. Figure 2 shows the rasterization stage of
the visual effects of complex 3D geometric details,
the graphics pipeline.
such as wooden surface, tiled wall, etc. Each vertex of a
There are two distinct mechanisms that are quite popular
texture-mapped triangle comes with a texture coordinate that
for the scan conversion step: linear interpolation
defines the part of the texture map to be applied
algorithm and linear edge function algorithm. In linear in-
(refer to Figure 5). These texture coordinates are inter-
terpolation-based algorithms4, the triangle set-up step first
polated across the triangle surface via scan conversion. The
computes the slopes, with respect to the X-axis, for all the
most popular texture mapping implementation is based on
attributes along each edge of the triangle. Next, the edge
mip-mapping7 (Figure 6), which pre-calculates multiple re-
processing step iterates along the edges and computes the
duced-resolution versions of a texture image. Each resolu-
two end points of a horizontal pixel segment, called a span.
tion level corresponds to a particular depth. Coarser (finer)
Finally, the span processing step iterates along each span
resolution levels are used for farther (closer) objects. For a
and computes the attributes for each pixel on the span
3D object at a given depth, the mip-mapping algorithm
through linear interpolation (Figure 3).
chooses a pair of adjacent resolution levels of the texture
In linear edge function-based algorithms5, each edge of
image, and performs weighted filtering of 8 texels (texture
the triangle is defined by a linear edge function. The triangle
pixel) from these two resolution levels. This tri-linear filter-
is scan converted by evaluating, at each pixel’s centre, the
ing eliminates visual discontinuities when different mip-map
function for all edges, and processing only those pixels that
levels are applied on the same object.
are inside all the edges. The attributes are also computed
Before a pixel is written to the frame buffer, the rendering
from the linear functions. Typically, the traversal of a trian-
engine needs to check whether that triangle is actually visi-
ble at that pixel, i.e. no other triangle overlaps that pixel
making it invisible. This is known as hidden surface removal
for opaque objects. The number of overlapping triangles for
a pixel is called the depth complexity of the pixel. The major-
ity of graphics accelerators achieve hidden surface removal
using a depth/Z buffer, which is an array with the same di-
mension as the frame buffer. After a triangle is scan-
Figure 2. Rasterization stage of a 3D graphics pipeline. converted into a set of

Figure 3. Scan conversion of a triangle using linear interpolation algorithm.

840 CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000


SPECIAL SECTION: COMPUTATIONAL SCIENCE

a b

Figure 4. Scan conversion of a triangle using linear edge function.

Figure 5. Texture mapping of a triangle. X, Y represent the coor-


dinates of the triangle in image space. U, V represent the coordinates
of the triangle in texture space and are known as the texture coordi-
nates.
Figure 6. Mip-map and tri-linear filtering. Each level in the mip-
map represents a reduced resolution texture image from the previous
pixels, each pixel goes through the depth test. This test mip-map. Tri-linear filtering performs weighted filtering of 8 texels,
compares the depth value of the current pixel against the 4 from the lower resolution level and 4 from the higher resolution
depth value of the pixel at the corresponding X–Y coordi- level.
nate of the frame buffer. If the new value is smaller, the cur-
rent pixel is closer to the viewpoint than the old pixel and assuming an aggressive texture cache
therefore the depth and colour value of the current pixel that stores the recently accessed texels, around 2.5 texel
replace the old values. Otherwise, the new pixel values are access per pixel are required, which translates to
discarded. For transparent objects, the colour of old and 1280 × 1024 × 3 × 30 × 2.5 × 2 = 590 MB/sec of texture mem-
new pixels is composited according to their transparency, or ory bandwidth (assuming each texel is 16-bit).
alpha value. This composition is known as alpha blending Figures 7 and 8 show the triangle and pixel processing
and requires another buffer for storing the alpha value requirement per frame for Viewperf benchmarks8, and Figure
called the alpha buffer. 9 shows the texture bandwidth requirement for some sample
The rasterization stage is quite compute and memory in- applications. A more detailed compute
tensive. Let us consider a frame buffer with resolution and bandwidth requirement of the rendering stages for dif-
1280 × 1024 and average depth complexity for a scene ferent real-world applications can be found in refs
of about 3. Assuming 32-bit pixel and 30 frames/sec, 9–11.
the frame buffer read bandwidth requirement will be Theoretically, the entire 3D graphics pipeline can be im-
1280 × 1024 × 3 × 30 × 4 = 472 MB/sec. Similarly the ren- plemented in software. The geometry transformation stage
dering engine would require Z-buffer read bandwidth of is extremely floating point intensive, which was
472 MB/sec assuming 32-bit Z-buffer. If a pixel passes the beyond the capability of general purpose processors even a
depth test, both the colour and depth information have to few years ago. Today however, with processors having
be written back. Assuming 50% of the pixels pass depth peak performance of around 400 MFlops/sec, the host CPU
test, the bandwidth requirements are 236 MB/sec for frame is capable of handling the load. The pixel-related rasteriza-
buffer write and 225 MB/sec for Z-buffer write. Finally, each tion operations, on the other hand, require tremendous
pixel requires 8 texels to perform tri-linear filtering. Even memory bandwidth to process around 100 million pixels/sec.

CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000 841


SPECIAL SECTION: COMPUTATIONAL SCIENCE

It is imperative that a separate hardware accelerator be dedi- well as the texture images are transferred over a high-speed
cated to rasterization. Hence, two distinct classes of graph- system bus such as PCI (Peripheral Components Interface)
ics architectures have been implemented: (1) combined to the rasterization hardware accelerator. A major design
geometry processor and rasterizer, the prime examples being issue for rasterization-only graphics
RealityEngine12 and InfiniteReality13 from SGI; (2) host CPU- accelerators is how to use the system bus bandwidth
based geometry processing and dedicated hardware accel- efficiently.
erator for rasterization.
Almost all of today’s low-end 3D graphics accelerators be-
3. Architectural innovations
long to the second class. In this case, the transformed ge-
ometry (vertex position, colour, and texture coordinates), as
To scale up the performance of the generic 3D graphics
architecture described in the previous section, the following
architectural issues need to be resolved:
• Although in theory state-of-the-art processors seem to
have sufficient raw floating-point computation power to
support geometric transformation at interactive frame
rates, in practice the CPUs are lagging behind the rasteri-
zation performance of the 3D graphics cards. Therefore
higher floating-point performance is essential to achieve
faster frame rates with better rendering quality.
• The data transfer bandwidth between the CPU, which
performs geometric transformation, and the 3D graphics
card, which performs rasterization, plays a crucial role in
the extent to which the entire 3D graphics pipeline can be
sped up. The heavy use of texture map in modern 3D ap-
plications further exacerbates the bandwidth problem.
• The memory access performance in the scan conversion
process has a dominating impact on the overall rasteriza-
tion performance. Improving the rasterization algorithm’s
data access locality is pivotal to the graphics card’s per-
formance.
Figure 7. Total number of triangles processed by the rasterization
engine at different frames or viewangles for various 3D applications. The following subsections describe architectural tech-
niques that have been proposed and implemented to
address these issues.

Figure 8. Total number of pixels processed by the rasterization Figure 9. Total texture memory bandwidth in MBytes for differ-
engine at different frames or viewangles for various 3D applications. ent frames.

842 CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000


SPECIAL SECTION: COMPUTATIONAL SCIENCE

3.1 Streaming SIMD extensions to instruction set called Accelerated Graphics Port (AGP)17. AGP connects the
graphics accelerator exclusively to the main memory sub-
Many current microprocessors have added Single Instruc- system (refer to Figure 11). AGP has four main advantages
tion Multiple Data (SIMD) type instructions to accelerate over PCI:
integer processing for media applications such as audio, 1. Reduction of load on PCI: The primary advantage of
video and image processing. This includes Intel’s Multime- AGP is that it eliminates the graphics-related bandwidth
dia Extensions (MMX), HP’s Multimedia Architectural Ex- requirement from the PCI bus by transferring data
tensions (MAX-2), Sun Microsystem’s Visual Instruction from the main memory to the graphics card over a dedicated
Set (VIS), etc. However, the geometry processing stage of bus.
the 3D graphics pipeline is based on floating point data 2. Higher peak bandwidth: AGP 2X (32 bit data path at
types. To exploit the parallelism in the geometry processing 66 MHz) transfers data on both edges of the clock, thereby
stage, Intel, AMD and others have achieving a peak bandwidth of 528 MB/sec. AGP 4X has a
recently added floating point SIMD instructions14,15 to the bandwidth of 1GB/sec.
instruction set. The main idea behind these extensions is 3. Higher sustainable bandwidth: AGP supports pipelining
that the geometry processing requires 32-bit floating point of requests, i.e. overlapping of access time of
data types, whereas the floating point paths (registers and request n with the issue of requests n + 1, n + 2 and so on.
ALUs) are 64-bit in width in most modern processors. Be- It also does sidebanding which provides extra address lines
cause vertex processing is inherently parallelizable, SIMD to issue new requests while the main data/address lines are
instructions allow two vertex-processing operations to be transferring the data corresponding to previous requests.
performed simultaneously using a single floating-point in- These two features makes it more likely for AGP to achieve
struction, with each vertex using half of the 64-bit data path. a sustained bandwidth that is much closer to its peak
Yang, Sano and Lebeck16 showed that SIMD instructions bandwidth.
can improve the geometry transformation performance by 20 4. Direct memory execute: The amount of local memory pre-
to 30%. sent in the graphics accelerator is limited. However, to ob-
tain more realistic images, applications use more and more
high resolution textures, all of which cannot fit into the local
3.2 Accelerated graphics port memory. Hence, the graphics driver needs to perform texture
memory management that keeps track of the textures pres-
Figure 10 shows a high-level view of the components of a ent in the local memory and downloads the required textures
PC desktop system. It consists of the processor, main mem- before they are used. This can introduce significant latency
ory, the north bridge, PCI-based devices and various inter- as the rendering engine waits for the complete mip-map of
connecting buses. The north bridge has the memory the texture image to be downloaded over the PCI/AGP bus.
controller and provides connections among different sys-
tem components. The main processor fetches the 3D model
from main memory, performs geometry transformation, and
writes it back to the main memory. The graphics accelerator
sitting on the PCI bus uses DMA (Direct Memory Access)
to retrieve that data from the main memory and then per-
forms rasterization. One major bottleneck of PC-based sys-
tems is the transfer bandwidth over the PCI bus, which
connects the system memory to the local memory of the
graphics accelerator1. The CPU needs to transfer geometry
data, graphics commands as well as texture data to the
graphics accelerator. Typically, the geometry information
associated with a vertex is about 32 bytes1, including the
vertex coordinates, colour, and texture coordinates, i.e.
32 MB/sec for 1 million vertices. This information crosses
the processor bus two time (once for reading and once for
writing in the geometry transformation stage), the PCI bus
once (transferring data to the graphics card), and the mem-
ory bus three times (in all the above cases). In addition, a
large amount of texture data need to be transferred over the
PCI bus as well. The peak PCI bandwidth of 32-bit, 33-MHz
PCI bus is 132 MB/sec, which is still not quite sufficient. To
solve this problem, Intel introduced a new bus specification, Figure 10. High-level view of the components of a PCI-based gra-
phics subsystem.

CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000 843


SPECIAL SECTION: COMPUTATIONAL SCIENCE

AGP provided a new feature called direct memory execute • Since only one tile worth of rasterization buffer is
(DIME) that allows required as opposed to a full-screen buffer, it is possible
the graphics accelerator to directly address main system to use more bits per buffer entry to support more ad-
memory over the AGP bus. A translation table in the AGP vanced rendering techniques such as oversampling or
controller, similar to the virtual to physical address transla- anti-aliasing, which rasterizes each pixel at a higher
tion table in the CPU, allows non-consecutive memory resolution and then down-samples the result to the
pages to appear as a single contiguous address space to required resolution.
the accelerator. This way the graphics accelerator can cache • Titled architecture matches very well with the emerging
the heavily used textures in the local memory, and access embedded DRAM process that can provide small on-
the comparatively little used ones directly from the system chip memory and high memory access bandwidth.
memory.
The main disadvantages of this architecture are
1. It requires an additional pipeline stage to sort tri-
3.3 Bucket rendering angles into buckets, thus increasing the total rendering la-
tency.
Traditional rendering requires random access to the entire
2. Redundant work is performed because large primitives
frame buffer, and it is not very cost-effective to provide a
may overlap with multiple tiles.
large high-bandwidth frame buffer. An interesting architec-
tural idea that addresses this problem is bucket
rendering. Bucket rendering is a technique where the 3.4 Composited image layers: Talisman
screen-space is partitioned into tiles (also called chunks),
and all the primitives of the scene are sorted into buckets, Microsoft introduced Talisman architecture in 1996, that
where each bucket contains the primitives that intersect comprised several independent ideas. However, the key
with the corresponding tile. This architecture renders the distinguishing feature of Talisman is composited image
scene one tile/bucket at a time, thereby reusing the Z-buffer, layer rendering20 that exploits the frame-to-frame coherence
alpha-buffer as well as other necessary buffers for storing for the first time. In traditional architecture, all the primitives
the results of intermediate rendering. At the end, all the tiles are rendered in each frame even though there is a great deal
are collected together to form the final image. Bucket ren- of coherence between consecutive frames. Instead, Talis-
dering has been implemented in Pixel-Planes 5 (ref. 18), Pix- man renders each primitive on a separate image surface. All
elFlow19, Talisman20 and finally commercially available the image surfaces are then composited together to form the
PowerVR from NEC/VideoLogic. The main advantages of final image. In the next frame, the image for a primitive is
bucket rendering are the following: transformed in the screen-space, given the transformation
matrices in the object-space. If the error introduced by im-
age-space transformation is below a threshold, the trans-
formed image can be used as the final result of rendering.
This architecture relies on the fact that image-space trans-
formation is much less expensive compared to object-space
transformation, and image layer composition can be per-
formed more efficiently. The main disadvantage of this ar-
chitecture is the complexity and gate count, and the
incompatibility problem with traditional APIs like OpenGL.
As a result, no commercial attempt has been made so far to
implement Talisman
architecture.

4. Parallel architecture

The 3D graphics pipeline is computation intensive, but is


quite amenable to parallel implementation both in the
object space as well as in the image space. Exploiting the
graphics pipeline’s parallelism can significantly reduce the
total polygon rendering time. A considerable amount of
research effort has been invested so far to design and im-
plement various efficient parallel polygon rendering en-
Figure 11. High-level view of the components of an AGP-based gines. In this section, we briefly describe different classes
graphics subsystem.

844 CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000


SPECIAL SECTION: COMPUTATIONAL SCIENCE

of parallelization techniques. Because the fundamental issue advantages of the sort-middle architecture are the load im-
in 3D rendering is sorting the geometric primitives with re- balance in the rasterization stage and the communication
spect to a given viewpoint, the parallelization strategies for cost due to redistribution of primitives after transfor-
polygon rendering can be classified as sort-first, sort- mation. Crockett and Orloff 22 proposed a static scan-line-
middle and sort-last depending on where the sorting opera- based scheme for image space partitioning. Whitman23 sug-
tion is performed19, which are gested adaptive load-balancing schemes for sort-middle
illustrated in Figure 12. architecture, while Ellsworth24 took advantage
In the sort-first strategy21, the image space is partitioned of frame-to-frame coherence to achieve better load-
into regions and each processor is responsible for all the balancing.
rendering calculations (both geometry and rasterization) in The sort-last strategy partitions the 3D input model in the
the region to which it is assigned. The screen space beginning of the rendering pipeline without taking into ac-
bounding box of each 3D primitive is calculated by per- count the viewpoint or object coordinates, performs geo-
forming just enough transformations. Every 3D primitive is metric transformation and rasterization on each partition
then distributed to the processors that are responsible for independently to produce a partial image, and finally com-
the image regions with which the bounding box overlaps. posites the partial images according to the depth value of
One primitive can be sent to multiple processors. From this each image pixel. Because of its simplicity, the sort-last ap-
point on, the set of 3D primitives in each processor goes proach has been implemented in several systems, including
through geometric transformation and rasterization com- PixelFlow19 from University of North Carolina, which uses a
pletely independent of primitives in other processors. Fi- high-speed combining network to composite sub-images.
nally, the image regions from the processors are simply The performance of the sort-last strategy depends critically
combined together to form the final rendered image. The on the composition stage. Various methods have been pro-
sort-first architecture has received the least attention so far posed so far to perform the composition. The simplest
because of the load imbalance problem in transformation method is to send the sub-images to a single compositing
and rasterization stage. However, as Mueller21 pointed out, processor19. Other schemes proposed are binary tree com-
the sort-first architecture can easily exploit the frame-to- position25, binary-swap composition26,27 and parallel pipeline
frame coherence and he has proposed a new adaptive algo- composition28.
rithm to achieve better load balancing. Mitra and Chiueh29 showed that all previously proposed
In the sort-middle strategy22, the image space is again sub-image compositing methods can be unified in a single
partitioned and each processor is responsible for one framework.
image region. 3D primitives are first transformed and then In general, in sort-last, a processor sends all the pixels of
distributed to different processors based on the trans- the relevant image space to another processor. This is
formed X and Y coordinates of the primitives. Again a primi- known as sort-last-full technique30. Cox and Hanrahan31
tive is sent to multiple processors if it crosses the image pointed out that it is sufficient to send only the ‘active’
region boundaries. After distribution, each processor per- pixels of the image space which is termed as sort-last-
forms rasterization on the transformed primitives independ- sparse. The trade-off between the two methods is the com-
ent of one another to produce a sub-image for the munication overhead versus extra processing required to
associated image region. The sub-images are then combined encode the ‘active’ pixels.
to form the final projection image. Sort-middle seems to be Until recently, all the parallel rendering engines were im-
the most natural architecture and has plemented either as dedicated ASIC, such as RealityEngine
been implemented both in hardware and software. Both and InfiniteReality, or were implemented on
InfiniteReality13 and RealityEngine Graphics12 have been massively parallel message passing or distributed shared
implemented using sort-middle strategy. The main dis- memory machines such as Intel Paragon. Currently how-

Figure 12. Sort-first, sort-middle, and sort-last parallel rendering architectures. The main dif-
ference between the architectures is where the distribution/sorting of primitives take place. G
represents the geometric transformation engine and R represents the rasterization engine.

CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000 845


SPECIAL SECTION: COMPUTATIONAL SCIENCE

ever, advances in the processor and graphics accelerator 8. OpenGL Performance Characterization Group,
technology, as well as the emergence of gigabit local net- http://www.spec.org/ gpc/opc.static/opcview.htm.
9. Dunwoody, J. C. and Linton, M. A., in Proc. of the ACM Sym-
work technology, such as Myrinet, have made it possible to posium on Interactive 3D Graphics, 1990, pp. 155–163.
implement high performance 3D graphics engines on a clus- 10. Chiueh, T. and Lin, W., in Proc. of the 12th ACM SIGGRAPH/
ter of workstations each of which is equipped with a low- Eurographics Workshop on Graphics Hardware, 1997, pp. 17–
cost 3D graphics card32,33. The basic parallelization strate- 24.
11. Mitra, T. and Chiueh, T., in Proc. of the 32nd Annual
gies will remain the same for these architectures. However, ACM/IEEE International Symposium on Microarchitecture
the loosely coupled network topology may (MICRO), 1999, pp. 62–71.
require different kinds of load balancing and composition 12. Akeley, K., in Proc. of the 20th ACM Annual Conference on
algorithms. Computer Graphics (SIGGRAPH), 1993, pp. 109–116.
13. Montrym, J. S., Baum, D. R., Dignam, D. L. and Migdal, C. J., in
Proc. of the 24th Annual ACM Conference on Computer
5. Conclusion Graphics (SIGGRAPH), 1997, pp. 293–302.
14. Intel Corporation, http://developer.intel.com/design/PentiumIII/
manuals/, 1999.
A unique characteristic of 3D graphic applications is that 15. Advanced Micro Devices, Inc.,
there is no end to the addition of new features to the stan- http://www.amd.com/products/cpg/ 3dnow/inside.html.
dard graphics pipeline. Unlike microprocessors, 3D graphics 16. Yang, C., Sano, B. and Lebeck, A. R., in Proc. of the 31st An-
nual ACM/IEEE International Symposium on Microarchitecture,
requires both advances in performance, i.e. more triangles
1998, pp. 14–24.
and more pixels per second as well as new and improved 17. Intel Corporation, http://www.intel.com/technology/agp/agp-
techniques that deliver more realistic image and cinematic _index.htm, 1998.
effects. Engineering and scientific 3D applications such as 18. Fuchs, H., et al. in Proc. of the 16th Annual ACM Conference
Computer Aided Design (CAD) and Computational Fluid on Computer Graphics (SIGGRAPH), 1989, pp. 79–88.
19. Molnar, S., Eyles, J. and Poulton, J., in Proc. of the 19th Annual
Dynamics (CFD) applications as well as entertainment ap- ACM Conference on Computer Graphics (SIGGRAPH), 1992,
plications such as computer games and animated movies, all pp. 231–240.
require higher-quality rendered 20. Torborg, J. and Kajiya, J. T., in Proc. of the 23rd Annual ACM
images at a faster rate, thus placing an increasing demand Conference on Computer Graphics (SIGGRAPH), 1996, pp.
353–363.
on the triangle and pixel rate. Therefore, we expect that 3D 21. Mueller, C., in Proc of the ACM Symposium on Interactive 3D
graphics architecture will remain a challenging field in the Graphics, 1995, pp. 75–84.
foreseeable future and thus has abundant room for further 22. Crockett, T. W. and Orloff, T., IEEE Parallel Distributed Tech:
algorithmic and architectural innovation. Sys. Appl., 1994, 2, 17–28.
23. Whitman, S., IEEE Comput. Graphics Appl., 1994, 14, 41–48.
24. Ellsworth, D., IEEE Comput. Graphics Appl., 1994, 14, 33–40.
25. Shaw, C., Green, M. and Schaeffer, J., Advance in Computer
Graphics Hardware, III, 1991.
1. Kirk, D., in Proc. of 13th ACM SIGGRAPH/Eurographics 26. Ma, K., Painter, J. S., Hansen, C. D. and Krogh, M. F., IEEE
Workshop on Graphics Hardware, http://www.merl.com/hwws98/ Comput. Graphics Appl., 1994, 14, 59–68.
presentations/ kirk/index.html, Keynote address, 1998, pp. 11– 27. Karia, R. J., in Proc. of IEEE Scalable High Performance Com-
13. puting Conference, 1994, pp. 252–258.
2. Neider, J., Davis, T. and Woo, M., Open GL Programming 28. Lee, T., Raghavendra, C. S. and Nicholas, J. B., IEEE Trans. Vis.
Guide, Addison–Wesley, 1993. Comput. Graphics, 1996, 2, 202–217.
3. Microsoft Corporation, 29. Mitra, T. and Chiueh, T., in Proc. of the 6th IEEE International
http://www.microsoft.com/directx/overview/ d3d/default.asp, Conference on Parallel and Distributed System, 1998.
1996. 30. Molnar, S., Cox, M., Ellsworth, D. and Fuchs, H., IEEE Comput.
4. Foley, J. D., vanDam, A., Feiner, S. K., Hughes, J. F. and Phil- Graphics Appl., 1994, 14, 23–32.
lips, R. L., Computer Graphics: Principles and Practice, 31. Cox, M. and Hanrahan, P., IEEE Parallel Distributed Technol-
Addison-Wesley 1990, 2nd edn. ogy: Syst. Appl., 1994, 2.
5. Fuchs, H., et al., in Proc. of the 12th Annual ACM Conference 32. Samanta, R. and others, in Proc. of the 14th ACM SIGGRAPH/
on Computer Graphics (SIGGRAPH), 1985. Eurographics Workshop on Graphics Hardware, 1999, pp. 107–
6. Pineda, J., in Proc. of the 15th Annual ACM Conference on 116.
Computer Graphics (SIGGRAPH), 1988, pp. 17–20. 33. Experimental Computer Systems Lab, Department of Computer
Science, SUNY at Stony Brook, http://www.ecsl.cs.sunysb.edu/
sunder.html.
7. Williams, L., in Proc. of the 10th Annual ACM Conference on
Computer Graphics (SIGGRAPH), 1983.

846 CURRENT SCIENCE, VOL. 78, NO. 7, 10 APRIL 2000

You might also like