Direct3D.shaderX - Vertex.and - Pixel.shader - Tips.and - Tricks Wolfgang.F.engel Wordware - Pub 2002
Direct3D.shaderX - Vertex.and - Pixel.shader - Tips.and - Tricks Wolfgang.F.engel Wordware - Pub 2002
DirectX
1
3
0
1 j k m ,
ExtractNormalizedEigenvectors(R)
}
DetermineCompressionMatrix()
{
CalculateRotationMatrix();
For Every Vertex
v' = R
1
* v
LowerRange = Minimum (LowerRange, v')
UpperRange = Maximum (UpperRange, v')
O = TranslationMatrix( LowerRange )
S = ScaleMatrix( UpperRange LowerRange )
C = R
1
* O
1
* S
1
}
CompressionTransform( originalVal )
{
v = originalVal
v
"
= C*v
return FloatToInteger(v
"
)
}
Decompression
We transform the incoming values by the decompression matrix. This has the effect of trans-
forming the compressed vector from compression space back into normal space.
Constants
{
D = C
1
}
Decompress( compressedVal )
{
return (IntegerToFloat( compressedVal ) * D)
}
Practical Example
This compression method is ideal for position data, except for huge objects. Using 16-bit data
types gives 65,535 values across the longest distance of the object; in most cases, it will pro-
vide no visual loss in quality from the original float version. We have to compensate for the
data type range by changing the scale and translation matrix (same as scaled offset) and then
178 Part 2: Vertex Shader Tricks
Team LRN
transpose the matrix (as usual for vertex shader matrices). This gives us a 25% savings in ver-
tex size for four cycles of vertex code.
Compression transform shader example:
; v1 = position in the range 32767.0 to 32768.0 (integer only)
; c0 c3 = Transpose of the decompression matrix
m4x4 r0, v1, c0 ; decompress input vertex
Using the same data stream from the earlier example, we can now see what the vertex shader
might look like.
Compressed vertex example:
D3DVSD_STREAM(0),
D3DVSD_REG( 0, D3DVSDT_SHORT4), // position
D3DVSD_REG( 1, D3DVSDT_UBYTE4), // normal
D3DVSD_REG( 2, D3DVSDT_SHORT2), // texture coordinate
D3DVSD_END()
Compressed vertex shader example:
; v1.xyz = position in the range 32767.0 to 32768.0 (integer only)
; v2.xyz = normal in the range 0.0 to 255.0 (integer only)
; v3.xy = uv in the range 32767.0 to 32768.0 (integer only)
; c0 c3 = Transpose of the world view projection matrix
; c4 c7 = Transpose of the decompression matrix
; c8 = <1.0/255.0, 1.0/255.0, 1.0/255.0, 1.0/255.0>
; c9 = <u scale / 65535.0, v scale / 65535.0, u offset + 0.5, v offset + 0.5>
m4x4 r0, v1, c4 ; decompress compress position
mul r1, v2, c8 ; multiply compressed normal by 1/255
mad r2.xy, v3.xy, c9.xy, c9.zw ; multiply uv by scale and add offset
; now the normal vertex shader code, this example just transforms
; the position and copies the normal and uv coordinate to the rasterizer
m4x4 oPos, r0, c0 ; transform position into HCLIP space
mov oT0, r2 ; copy uv into texture coordinate set 0
mov oT1, r1 ; copy normal into texture coordinate set 1
For an extra six cycles of vertex shader code, we have reduced vertex data size by 50%.
Optimizations
We can usually eliminate most of the decompression instructions, reducing the shader execu-
tion time to roughly the same as the uncompressed version.
The first optimization is noting that anywhere we are already doing a 4x4 matrix trans-
form, we can get any of the previous compressions for free. By incorporating the
decompression step into the transform matrix, the decompression will occur when the matrix
vector multiply is performed (this works because all the compressions involve linear trans-
forms). This is particularly important for position; we always do a 4x4 matrix transform (to
take us from local space into HCLIP space) so we can decompress any of the above for free!
Another optimization is the usual vertex shader rule of never moving a temporary register
into an output register. Wherever you are, simply output directly into the output register.
Anywhere you use quantized D3DVSDT_UBYTE4, you might as well use D3D-
VSDT_D3DCOLOR instead, as it automatically does the divide by 255.0 and swizzling is
free in the vertex shader. This also applies the other way; if you are scaling by a
Vertex Decompression in a Shader 179
Team LRN
D3DVSDT_D3DCOLOR by 255.0 (and your device supports both), use
D3DVSDT_UBYTE4.
Practical Example
By applying these three rules to the same data as before, we achieve 50% savings of memory
and bandwidth for zero cycles.
Optimized compressed vertex example:
D3DVSD_STREAM(0),
D3DVSD_REG( D3DVSDE_POSITION, D3DVSDT_SHORT4),
D3DVSD_REG( D3DVSDE_NORMAL, D3DVSDT_D3DCOLOR),
D3DVSD_REG( D3DVSDE_TEXCOORD0, D3DVSDT_SHORT2),
D3DVSD_END()
Optimized compressed vertex shader example:
; v1.xyz = position in the range 32767.0 to 32768.0 (integer only)
; v2.xyz = normal in the range 0.0 to 255.0 (integer only) (swizzled)
; v3.xy = uv in the range 32767.0 to 32768.0 (integer only)
; c0 c3 = Transpose of the decompression world view projection matrix
; c4 = <u scale / 65535.0, v scale / 65535.0, u offset + 0.5, v offset + 0.5>
m4x4 oPos, v1, c0 ; decompress and transform position
mad oT0.xy, v3, c4.xy, c4.zw ; multiply uv by scale and add offset
mov oT1.xyz, v2.zyx ; swizzle and output normal
Advanced Compression
There are several methods that extend and improve on the basic techniques described in the
previous section for special situations.
Multiple Compression Transforms
The main reason that you may not be able to use a compression transform for position is if the
object is huge but still has fine details (these are classic objects like large buildings, huge space
ships, or terrain). This technique trades vertex constant space for an increase in precision
across the object. It doesnt mix with palette skinning very well, but for static data it can
increase precision significantly.
Essentially, it breaks the object up into separate compressed areas and each vertex picks
which areas it belongs to. The main issue is making sure the compressor doesnt cause gaps in
the object.
We load the address register with the matrix to use, which we store by using a spare com-
ponent (position w is usually spare with compressed data).
180 Part 2: Vertex Shader Tricks
Note: This is the same technique used to speed up rendering of lots of small
objects (things like cubes): Every vertex is treated as if it were skinned with a sin-
gle index and a weight of 1.0. This allows a single render call to have multiple
local matrices. We just use it to select compression matrices rather than local
space matrices.
Team LRN
Compression
The object is broken into chunks beforehand and the standard compressor is applied to each
chunk, storing the chunk number in the vertex.
The exact method of breaking the object into chunks will be very data-dependent. A sim-
ple method might be to subdivide the object into equally spaced chunks; another might be to
choose an area of high detail.
MultipleCompressionTransforms
{
BreakObjectIntoChunks()
For Each Chunk
DetermineCompressionMatrix()
compressionMatrixArray[chunkNum] = C
For Each Vertex v in Chunk
{
v" = C * v
Store FloatToInteger(v") && chunkNum
}
}
Decompression
The only change from the standard compression transform decompression is to use the stored
chunkNum to index into the compression matrix first.
Constants
{
D[MAX_CHUNKS] = C
1
[MAX_CHUNKS]
}
Decompress( compressedVal, chunkNum )
{
return (IntegerToFloat( compressedVal ) * D[chunkNum] )
}
Practical Example
This compression method is good for position data, even for large objects, but isnt usable for
skinned or multi-matrix objects. Each matrix takes four constants, so for equally spaced com-
pression matrices, it quickly fills up constant space. The only extra cost is a single instruction
to fill the index register.
Multiple compression transform shader example:
; v1.xyz = position in the range 32767.0 to 32768.0 (integer only)
; v1.w = (compression matrix index * 4) in the range 0 to MAX_CHUNKS*4
; c0 = < 1.f, ?, ?, ?>
; array of decompression matrices from c1 up to MAX_CHUNKS*4
; i.e., c0 c3 = Transpose of the 1st decompression matrix
; c4 c7 = Transpose of the 2nd decompression matrix
mov a0.x, v1.w ; choose which decompression matrix to use
mov r0.xyz, v1.xyz ; for replacing w
Vertex Decompression in a Shader 181
Team LRN
mov r0.w, c0.x ; set w =1
m4x4 r0, r0 c[a0.x + 0] ; decompress input vertex using selected matrix
Sliding Compression Transform
Rather than pack more data into the already used vector components, this time we put the extra
dummy padding to good use. By encoding displacement values in the padding, we displace
the entire compression matrix, allowing each vertex to slide the compression space to a better
place for it.
The number and size of the spare padding ultimately decides how much extra precision is
needed. Vertex shaders are not very good at extracting multiple numbers from a single pad-
ding, so it is best not to, but we can split values in the compressor and reconstruct the original
value in the vertex shader.
Vertices store offset from local grid origin and store the grid number in the padding. This
example has six times more precision along both world axes than the usual compression trans-
form. The grid is aligned with the compression space, thus preserving maximum precision.
Another way to look at it is that you break the range into discrete chunks and they are treated
separately until decompression.
Compression
The first calculation is to decide how much extra precision you can get out of the padding.
There is a different execution cost depending on how or what you have to extract, and this is
likely to be the most important factor. The ideal situation is three padding values with enough
range. The worst is to extract three values from a single padding. A typical case is one short
pad and one byte pad (from a position and a normal); for this case, we can rearrange the data
so we have three 8-bit values for extra precision.
In all cases, there is an extra precision that has to be factored into the compressor. This can
be done by dividing the range that produces values greater than 1.f and then taking the integer
components as the displacement scalars and the remainder as the compression values. The
other method is to treat the slide as adding directly to the precision.
182 Part 2: Vertex Shader Tricks
Figure 3: Sliding compression
space example
Team LRN
Sliding Compression Transform
{
CalculateRotationMatrix()
For Each Vertex v
{
v' = R
1
* v
LowerRange = Minimum (LowerRange, v')
UpperRange = Maximum (UpperRange, v')
O = TranslationMatrix( LowerRange )
S = ScaleMatrix( UpperRange LowerRange )
}
S = S / size of slide data-type
C = R
1
* O
1
* S
1
For Each Vertex v
{
v" = C * v
c = frac(v")
g = floor(v")
store FloatToInteger( c )
store PackPadding ( g )
}
}
The packing into a spare short and a byte is done by changing the four shorts into two lots of 4
bytes. The space used is the same, but we can access it differently.
Original stream definition:
D3DVSD_REG(0, D3DVSDT_SHORT4),
D3DVSD_REG(1, D3DVSDT_UBYTE4),
New stream definition:
D3DVSD_REG( 0, D3DVSDT_UBYTE4),
D3DVSD_REG( 1, D3DVSDT_UBYTE4),
D3DVSD_REG( 2, D3DVSDT_UBYTE4),
We place the lower 8 bits of each position component in register 0 and the upper 8 bits in regis-
ter 1. This leaves the w component of each register for our slide value. One thing to watch for
is the change from signed 16-bit values to unsigned 16-bit values caused by the split; this
means the decompression matrix may have to change a little.
Decompression
To decompress, we recombine the position, get the slide values, and scale this by the grid size,
which is then added to the compressed value.
Constants
{
D = C
1
GridSize = size of data-type
S = <255.0, 255.0, 255.0, 0.0>
}
Vertex Decompression in a Shader 183
Team LRN
Decompress( lowerVal, upperVal, slideVal )
{
compressedVal = (IntegerToFloat(upperVal) * S) + IntegerToFloat( lowerVal )
return ( compressedVal + (slideVal * GridSize)) * D)
}
; v1.xyz = lower half of position in the range 0 to 255.0 (integer only)
; v1.w = x slide
; v2.xyz = upper half of position in the range 0 to 255.0 (integer only)
; v2.w = y slide
; v3.xyz = normal in the range 0 to 255.0 (integer only)
; v3.w = z slide
; c1 = <255.0, 255.0, 255, 0.0>
; c2 = <65535.0, 1.0, 0.0, 0.0>
mov r1.xyz, v1.xyz ; due to 1 vertex register per instruction
mad r0.xyz, v2.xyz, c1.xyz, r1.xyz ; r0.xyz = position (upper * 255.0) + lower
mad r0.x, v1.w, c2.x, r0.x ; (x slide * 65535.0) + position x
mad r0.y, v2.w, c2.x, r0.y ; (y slide * 65535.0) + position y
mad r0.z, v3.w, c2.x, r0.z ; (z slide * 65535.0) + position z
mov r0.w, c2.y ; set w = 1
; now decompress using a compression matrix
A cost of six cycles has improved our precision 255 times in all axes by reusing the padding
bytes. This effectively gives us 24 bits across the entire object, matching the precision (but not
the range) of the original floating-point version.
Displacement Maps and Vector Fields
Displacements maps (and to a lesser degree their superset vector fields) are generally consid-
ered to be a good thing. Its likely that the future hardware and APIs will support this primitive
directly, but by using vertex shaders, we can do a lot today.
First some background. What are displacement maps and vector fields? A displacement
map is a texture with each texel representing distance along the surface normal. If that sounds a
lot like a height field, its because displacement maps are the superset of height fields (height
fields always displace along the world up direction). Vector fields store a vector at each texel,
with each component representing the distance along a basis vector.
Basic Displacement Maps
184 Part 2: Vertex Shader Tricks
Figure 4: Displacement map example
Team LRN
We have a supporting plane which is constant for the entire displacement map and the dis-
placement along the plane normal at every vertex. For true hardware support, that would be all
that is needed, but we need more information.
To do displacement maps in a vertex shader requires an index list, whereas true displace-
ment map hardware could deduce the index list from the rectilinear shape. Since the index data
is constant for all displacement maps (only one index list per displacement map size is
needed), the overhead is very small.
We also have to store the displacement along the u and v direction, as we cant carry infor-
mation from one vertex to another. This is constant across all displacement maps of the same
size, and can be passed in through a different stream. This makes the vector fields and dis-
placement maps the same for a vertex shader program; the only difference is whether the u and
v displacement is constant and stored in a separate stream (displacement maps) or arbitrarily
changes per vertex (vector fields).
We cant automatically compute the normal and tangent (we would need per-primitive
information), so we have to send them along with a displacement value (if needed; world space
per-pixel lighting removes the need).
The easiest way (conceptually) is to store in vertex constants the three basis vectors repre-
senting the vector field. One basis vector is along the u edge, another the v edge, and the last is
the normal. We also need to store maximum distance along each vector and a translation vector
to the <0.0, 0.0, 0.0> corner of the displacement map.
For each vertex, we take the displacement along each vector and multiply it by the appropriate
scale and basis vector. We then add them to the translation vector to get our displaced point.
Basis displacement map/vector field shader example:
; v1.x = displacement w basis vector (normal) (dW)
; v2.xy = displacements along u and v basis vectors (dU and dV)
; c0.xyz = U basis vector (normalized) (U)
; c1.xyz = V basis vector (normalized) (V)
; c2.xyz = W basis vector (normalized) (W)
; c3.xyz = translation to <0.0, 0.0, 0.0> corner of map (T)
; c4.xyz = U V W scale
; c5 c8 = world view projection matrix
; multiple normalized vectors
mul r0.xyz, c0.xyz, c4.x ; scale U vector
mul r1.xyz, c1.xyz, c4.y ; scale V vector
mul r2.xyz, c2.xyz, c4.z ; scale W vector
; generate displaced point
mul r0.xyz, v2.x, r0.xyz ; r0 = dU * U
Vertex Decompression in a Shader 185
Figure 5: Displacement vector diagram
Team LRN
mad r0.xyz, v2.y, r1.xyz, r0.xyz ; r0 = dV * V + dU * U
mad r0.xyz, v1.x, r2.xyz, r0.xyz ; r0 = dW * W + dV * V + dU * U
add r0.xyz, r0.xyz, c3.xyz ; r0 = T + dW * W + dV * V + dU * U
; standard vector shader code
m4x4 oPos, r0, c5 ; transform position into HCLIP space
Hopefully, you can see this is just an alternative formulation of the compression matrix method
(a rotate, scale, and translation transform). Rather than optimize it in the same way
(pre-multiply the transform matrix by the compression transform), we follow it through in the
other way, moving the displacement vector into HCLIP space.
Entering Hyperspace
Moving the displacement vector into HCLIP space involves understanding four-dimensional
space. By using this 4D hyperspace, we can calculate a displacement vector that works directly
in HCLIP space.
It sounds more difficult than it really is. HCLIP space is the space before perspective divi-
sion. The non-linear transform of perspective is stored in the fourth dimension, and after the
vertex shader is finished (and clipping has finished), the usual three dimensions will be divided
by w.
The most important rule is that everything works as it does in 3D, just with an extra com-
ponent. We treat it as we would any other coordinate space; it just has an extra axis.
All we have to do is transform (before entry into the vertex shader) the basis and transla-
tion vectors into HCLIP space and we use them as before (only we now use four components),
and the displaced point is generated in HCLIP so we dont need to transform them again. We
also pre-multiply the basis scales before transforming the basis vector into HCLIP space.
HCLIP displacement map/vector field shader example:
; v1.x = displacement w basis vector (normal) (dW)
; v2.xy = displacements along u and v basis vectors (dU and dV)
; c0.xyzw = Scaled U basis vector in HCLIP space (U)
; c1.xyzw = Scaled V basis vector in HCLIP space (V)
; c2.xyzw = Scaled W basis vector in HCLIP space (W)
; c3.xyzw = translation to <0.0, 0.0, 0.0> corner of map in HCLIP space (T)
; generate displaced point in HCLIP space
mul r0.xyzw, v2.x, c0.xyzw ; r0 = dU * U
mad r0.xyzw, v2.y, c1.xyzw, r0.xyzw ; r0 = dV * V + dU * U
mad r0.xyzw, v1.x, c2.xyzw, r0.xyzw ; r0 = dW * W + dV * V + dU * U
add oPos, r0.xyzw, c3.xyzw ; r0 = T + dW * W + dV * V + dU * U
This is a matrix transform. The only difference from the m4x4 (4 dp4) or the mul/mad methods
is that the m4x4 is using the transpose of a matrix, whereas mul/mad uses the matrix directly.
Clearly there is no instruction saving (if we could have two constants per instruction we could
get down to three instructions) using vertex shaders for displacement mapping over compres-
sion transform, but we can reuse our index buffers and our u v displacements and make savings
if we have lots of displacement maps.
This allows us to get down to only 4 bytes (either D3DVSDT_UBYTE4 or D3DVSDT_
SHORT2) for every displacement position past the first map. For vector fields, the only sav-
ings we make over compression transform is reusing the index buffers.
186 Part 2: Vertex Shader Tricks
Team LRN
Conclusion
Ive demonstrated several methods for reducing the size of vertex data. The best method obvi-
ously depends on your particular situation, but you can expect a 50% reduction in size quite
easily. Thats good in many ways, as models will load faster (especially over slow paths, like
modems), memory bandwidth is conserved, and memory footprint is kept low.
Acknowledgments
n
S. Gottschalk, M.C. Lin, D. Manocha OBBTree: A Hierarchical Structure for Rapid
Interference Detection, http://citeseer.nj.nec.com/gottschalk96obbtree.html.
n
C. Gotsman Compression of 3D Mesh Geometry, http://www.cs.technion.ac.il/
~gotsman/primus/Documents/primus-lecture-slides.pdf.
n
Tom Forsyth WGDC and Meltdown 2001 Displaced Subdivision
n
Oscar Cooper General sanity checking
Vertex Decompression in a Shader 187
Team LRN
Shadow Volume
Extrusion Using a Vertex
Shader
Chris Brennan
Introduction
The shadow volume technique of rendering real-time shadows involves drawing geometry that
represents the volume of space that bounds what is in the shadow cast by an object. To calcu-
late if a pixel being drawn is in or out of a shadow, shoot a ray from the eye through the
shadow volume toward the point on the rendered object and count the number of times the
shadow volume entered and exited. If the ray has entered more times than exited, the pixel
being drawn is in shadow. The stencil buffer can be used to emulate this by rendering the back
sides of the shadow volume triangles while incrementing the stencil buffer, followed by the
front sides of the triangles, which decrement it. If the final result adds up to where it started,
then you have entered and exited the shadow an equal number of times and are therefore out-
side the shadow; otherwise, you are inside the shadow. The next step is rendering a light pass
that is masked out by a stencil test.
There are several other very different algorithms to doing shadows [Haines01]. Stencil
shadow volumes have their benefits and drawbacks compared to other shadowing algorithms
like depth buffers. The most important trade-off is that while shadow volumes have infinite
precision and no artifacts, they have a hard edge and an uncertain render complexity, depend-
ing on object shape complexity and the viewer and light positions. Previous major drawbacks
to shadow volumes were the CPU power required to compute the shadow geometry and the
requirement that character animation must be done on the CPU so that a proper shadow geom-
etry could be generated, but a clever vertex shader combined with some preprocessing removes
the need for all CPU computations and therefore allows the GPU to do all the character anima-
tion. A brief comparison of CPU and GPU complexity and their drawbacks can be found in
[Dietrich].
Another historical complexity of shadow volumes that has been solved is what to do if the
viewer is inside the shadow. The problem arises that since the viewer starts in shadow, the
stencil count begins off by one. Many solutions have been proposed [Haines02], and many are
very computationally intensive, but a simple solution exists. Instead of incrementing and
188
Team LRN
decrementing the stencil buffer with the visible portion of the shadow volume, modify the sten-
cil buffer when the volume is hidden by another surface by setting the depth test to fail. This
sounds counterintuitive, but what it does is exactly the same thing, except it counts how many
times the ray from the eye to the pixel exits and enters the shadow volume after the visible
point of the pixel. It still tests to see if the pixel is inside or outside of the shadow volume, but
it eliminates the issues with testing to see if the viewer starts in shadow. It does, however,
emphasize the need to make sure that all shadow volumes are complete and closed as opposed
to previous algorithms, which did not require geometry to cap the front or back of the volume.
Creating Shadow Volumes
The time-consuming part of the algorithm is the detection of all the silhouette edges. These are
normally found by taking a dot product with the light vector and each of the edges two neigh-
boring face normals. If one dot product is positive (toward the light) and one is negative (away
from the light), then it is a silhouette edge. For each silhouette edge, create planes extending
from the edge away from the light creating the minimum geometry needed for the shadow vol-
ume. Unfortunately, not only is it expensive to iterate across all of the edges, but it is also
expensive to upload the new geometry to the video card every frame.
However, hardware vertex shaders can be used to do this work on the GPU. The general
idea is to create geometry that can be modified by a vertex shader to properly create the
shadow volume with any light position so that the geometry can reside on the GPU. At initial-
ization or preprocess time, for each edge of the original object geometry, add a quad that has
two sides consisting of copies of the original edge and two opposite sides of zero length. The
pseudocode for this is as follows:
For each face
Calculate face normal
Create 3 new vertices for this face and face normal
Insert the face into the draw list
For each edge of face
If (edge has been seen before)
Insert degenerate quad into draw list
Remove edge from checklist
Else
Insert edge into a checklist
If (any edges are left in checklist)
flag an error because the geometry is not a closed volume.
Figure 1 shows the geometry with the quads inserted and spread apart slightly so that they can
be seen. These quads will make up the extruded edges of the shadow volume when required.
The original geometry is still present and is used to cap the extruded edges on the front and
back to complete the volume. After the quad insertion, each vertex neighbors only one of the
original faces and should include its faces normal. When rendering the volume, each vertexs
face normal is dotted with the light vector. If the result is negative, the face is facing away
from the light and should therefore be pushed out to the outer extent of the light along the light
vector. Otherwise, it stays exactly where the original geometry lies.
Shadow Volume Extrusion Using a Vertex Shader 189
Team LRN
After this pass is completed, the light pass is rendered using a stencil test to knock out the pix-
els that are in shadow.
The shadowed room application with source code can be found on the companion CD and at
http://www.ati.com/na/pages/resource_centre/dev_rel/sdk/RadeonSDK/Html/Samples/Direct-
3D/RadeonShadowShader.html.
190 Part 2: Vertex Shader Tricks
Figure 1: Illustration of invisible quads inserted into
the original geometry to create a new static geome-
try to be used with a shadow volume vertex shader
Figure 2: Shadow volumes after being extruded away
from the light
Figure 3: The final pass completes the
effect.
Team LRN
Effect File Code
matrix mWVP;
matrix mWV;
matrix mP;
matrix mWVt;
vector cShd;
vector pVL;
vertexshader vShd =
decl
{
stream 0;
float v0[3]; // Position
float v1[3]; // FaceNormal
}
asm
{
; Constants:
; 16..19 - Composite World*View*Proj Matrix
; 20..23 - Composite World*View Matrix
; 24..27 - Projection Matrix
; 28..31 - Inv Trans World*View
; 90 - {light range, debug visualization amount, z near, z far}
; 91 - View Space Light Position
vs.1.0
def c0, 0,0,0,1
; View Space
m4x4 r0, v0, c20 ; World*View Transform of point P (pP)
m3x3 r3, v1, c28 ; World*View Transform of normal (vN)
sub r1, r0, c91 ; Ray from light to the point (vLP)
dp3 r11.x, r1, r1 ; length^2
rsq r11.y, r11.x ; 1/length
mul r1, r1, r11.y ; normalized
rcp r11.y, r11.y ; length
sub r11.z, c90.x, r11.y ; light.Range - len(vLP)
max r11.z, r11.z, c0.x ; extrusion length = clamp0(light.Range -
; len(vLP))
dp3 r10.z, r3, r1 ; vLP dot vN
slt r10.x, r10.z, c0.x ; if (vLP.vN < 0) (is pointing away from light)
mad r2, r1, r11.z, r0 ; extrude along vLP
; Projected Space
m4x4 r3, r2, c24 ; Projected extruded position
m4x4 r0, v0, c16 ; World*View*Proj Transform of original position
; Chose final result
sub r10.y, c0.w, r10.x ; !(vLP.vN >= 0)
mul r1, r3, r10.y
mad oPos, r0, r10.x, r1
};
Shadow Volume Extrusion Using a Vertex Shader 191
Team LRN
technique ShadowVolumes
{
pass P0
{
vertexshader = <vShd>;
VertexShaderConstant[16] = <mWVP>;
VertexShaderConstant[20] = <mWV>;
VertexShaderConstant[24] = <mP>;
VertexShaderConstant[28] = <mWVt>;
VertexShaderConstant[90] = <cShd>;
VertexShaderConstant[91] = <pVL>;
ColorWriteEnable = 0;
ZFunc = Less;
ZWriteEnable = False;
StencilEnable = True;
StencilFunc = Always;
StencilMask = 0xffffffff;
StencilWriteMask = 0xffffffff;
CullMode = CCW;
StencilZFail = IncrSat;
}
pass P1
{
CullMode = CW;
StencilZFail = DecrSat;
}
}
Using Shadow Volumes with Character
Animation
The technique as described is for statically shaped objects and does not include characters that
are skinned or tweened. The biggest advantage of doing shadow volume extrusion in a vertex
shader is that the volumes can exist as static vertex buffers on the GPU and the updated geom-
etry does not have to be uploaded every frame. Therefore, this technique needs to be extended
to work on animated characters as well (see David Gosselins Character Animation with
Direct3D Vertex Shaders article in this book). Otherwise, if the animation of the shadow vol-
ume were to be done on the CPU, it would be possible to do the optimized shadow volume
creation at the same time.
The most straightforward approach is to copy the vertex animation data to the shadow vol-
ume geometry and skin and/or tween the face normal as you would the vertex normal.
Unfortunately, the face normals need to be very accurate and consistent across all vertices of a
face; otherwise, objects will be split and extruded in inappropriate places, resulting in incorrect
shadow volumes.
192 Part 2: Vertex Shader Tricks
Team LRN
These artifacts are the result of the extrusion happening across a face of the original geometry
as opposed to the inserted quads along the face edges. This is caused by the fact that each ver-
tex has different weights for skinning, which yield a different face normal for each of the three
vertices of a face. When that face becomes close to being a silhouette edge, it may have one or
two vertices of the face moved away from the light, while the other one stays behind.
The ideal solution is to animate the positions and regenerate the face normals. However,
generating face normals requires vertex neighbor information that is not normally available in
a vertex shader. One possible solution is to make each vertex contain position information of
its two neighbors, animate the vertex position as well as its two neighbors, and recalculate the
face normal with a cross product. The regular vertex position data would need to be stored and
animated three times. This can sometimes be very expensive, depending on the animation
scheme and the size of the models.
An inexpensive way to fix the variation in face normals across a face is to calculate skin-
ning weights per face in addition to per vertex and use the face weights for the face normal.
This can be done by averaging all of the vertex weights or by extracting them directly from the
original art. Using the same weight for each vertex of a face guarantees that the shadow vol-
ume can only be extruded along the edge quads.
By using face weights for the face normals, the previously seen artifacts are not visible.
This technique can be seen in the ATI Island demos on the companion CD, or at
http://www.ati.com/na/pages/resource_ centre/dev_rel/demos.html.
Shadow Volume Extrusion Using a Vertex Shader 193
Figure 4: Artifacts caused by skinned
face normals. Notice the notch in the
shadow of the shoulder.
Figure 5: Animated shadow volumes
with face normals skinned by face
weights
Team LRN
References
[Dietrich] Sim Dietrich, Practical Priority Buffer Shadows, Game Programming Gems II,
Mark DeLoura, Ed. (Charles River Media, 2001), p. 482.
[Haines01] Eric Haines and Tomas Mller, Real-Time Shadows, GDC 2001 Proceedings,
http://www.gdconf.com/archives/proceedings/2001/haines.pdf.
[Haines02] Eric Haines and Tomas Akenine-Mller, Real-Time Rendering, 2nd edition (A.K.
Peters Ltd., 2002).
194 Part 2: Vertex Shader Tricks
Team LRN
Character Animation
with Direct3D Vertex
Shaders
David Gosselin
Introduction
With the introduction of vertex shaders to modern graphics hardware, we are able to move a
large portion of the character animation processing from the CPU to the graphics hardware.
Two common character animation techniques are tweening (also known as morphing) and skin-
ning (also called skeletal animation). This article describes how both of these techniques and
some related variations can be performed using vertex shaders. We will also cover how to com-
bine animation with the per-pixel lighting techniques described in the articles by Wolfgang
Engel in Part 1. We will conclude with a discussion of some geometry decompression tech-
niques, which allow an application to minimize memory usage and memory bandwidth.
Tweening
Tweening is the technique of linearly interpolating between two (or more) key frames of ani-
mation. This technique is used in many popular games like Quake. The artwork consists of a
character or object in a few poses spaced relatively close together in time. For example, the
figure below shows a few frames of an animation stored as key frames.
195
Figure 1: An object animated with tweening
Team LRN
In order to perform this type of animation, the objects position and normal data are stored for
each frame of the animation. At run time, two frames are selected that represent the current
state of the animation. The vertex buffers containing these two frames are loaded as separate
streams of data. Typically, setting up the vertex streams looks something like:
d3dDevice->SetStreamSource (0, frame0VertexBuffer, frame0Stride);
d3dDevice->SetStreamSource (1, frame1VertexBuffer, frame1Stride);
Additionally, a tween factor is computed at run time. The tween factor represents the animation
time between the two frames, with the value 0.0 corresponding to the first frame and the value
1.0 corresponding to the second frame. This value is used to perform the linear interpolation of
the position data. One way to compute this value is shown below:
float t = time;
if (t < 0)
{
t = (float)fmod ((float)(t), TotalAnimationTime);
t = TotalAnimationTime t;
}
float tween = (float)fmod ((float64)t, TotalAnimationTime);
tween /= TotalAnimationTime;
tween *= (NumberOfAnimationFrames 1);
which = (int32)floor ((float64)tween);
tween = which;
The tween value and any lighting constants you may be using are loaded into the constant
store. The vertex shader code then only needs to implement the following equation in order to
perform the interpolation between the two frames:
A*(1tween) + B*tween
Generally, the vertex shader should also multiply by the concatenated world, view, and projec-
tion matrix. The following vertex shader shows an implementation of tweening:
; position frame 0 in v0
; position frame 1 in v14
; tween in c0.x
; World/View/Projection matrix in c12-c15
; Figure out 1-tween constant
; use the 1.0 from positions w
sub r0, v0.wwww, c0
; Compute the tweened position
mul r1, v0, r0.xxxx
mad r1, v14, c0.xxxx, r1
; Multiply by the view/projection and pass it along
m4x4 oPos, r1, c12
To save some vertex shader instructions, the value of (1 tween factor) can be computed by
the application. Both of these values can be loaded into the constant store, in this case as part
of the same constant vector. The z and w components of this constant register are also good
places to stick handy constants like 1.0, 0.0, 2.0, 0.5, etc., if you happen to need them. The
resulting vertex shader code looks like the following:
; position frame 0 in v0
; normal frame 0 in v3
196 Part 2: Vertex Shader Tricks
Team LRN
; position frame 1 in v14
; normal frame 1 in v15
; tween in c0.x
; 1 tween in c0.y
; View/Projection matrix in c12-c15
; Compute the tweened position
mul r1, v0, c0.yyyy
mad r1, v14, c0.xxxx, r1
; Multiply by the view/projection and pass it along
m4x4 oPos, r1, c12
In this section, we have explored an efficient tweening algorithm that can be easily computed
within a vertex shader. It has the advantages of being very quick to compute and working well
with higher order surface tessellation schemes such as N-Patches. Unfortunately, tweening has
the downside of requiring quite a bit of data to be stored in memory for each animation.
Another popular technique, which addresses the memory usage issue, is skinning. Skinning
requires more vertex shader instructions but less data per frame of animation. We will explore
skinning in more detail in the following section.
Skinning
Skinning is the process of blending the contributions
from several matrices in order to find the final vertex
position. Typically, the character modeling tool
describes the matrices using a series of bones. Fig-
ure 3 shows what this looks like in 3D Studio Max.
The diamond shapes in Figure 2 represent the
various control matrices. The inner triangles represent
the bones. In this tool, the artist can move the
joints, and the models position is updated using
inverse kinematics. During export of this data from
Max, the triangles are grouped according to which
matrices affect their position. Further preprocessing
generates the vertex buffers used by the Direct3D
API. The vertices created from the mesh have bone
weights, which represent the amount of influence
from the corresponding matrices (bones). These
weights are used to blend the contributions of the cor-
responding matrices according to the following
equation:
Final Position = SUM (matrix[n]*position*weight[n])
Before the advent of vertex shading hardware, these computations took place on the CPU (typ-
ically referred to as software skinning). This approach has the benefit of being able to
compute accurate normals after skinning a model, but typically it is fairly costly in terms of the
amount of computation required. Some earlier graphics hardware had fixed-function hardware
dedicated to performing these computations. The user specified a Flexible Vertex Format
Character Animation with Direct3D Vertex Shaders 197
Figure 2: Skeleton of a skinned charac-
ter with overlayed polygon mesh
Team LRN
(FVF), which contained the skinning weights and loaded two to four numbered world matrices
to be blended. With vertex shaders, programmers have the ability to tailor the type of skinning
to their own applications. In this section, we will explore two different kinds of skinning: a
four-matrix skinned approach, which mirrors the functionality available from the fixed func-
tion pipeline, and a paletted approach, which allows for greater batching of drawing calls.
The following vertex shader shows how to perform four-matrix skinning. One common
technique to reduce the amount of per-vertex data is to only store three vertex weights and
compute a fourth one in the vertex shader by assuming that the weights sum to 1.0. At run
time, the matrices for a particular group are loaded into the constant store, and the appropriate
vertex buffers are loaded. The vertex shader code looks like the following:
; position in v0
; matrix weights in v1
; c0.z = 1.0
; View/Projection matrix in c12-c15
; World 0 matrix in c16-c19
; World 1 matrix in c20-c23
; World 2 matrix in c24-c27
; World 3 matrix in c28-31
; Multiply input position by matrix 0
m4x4 r0, v0, c16
mul r1, r0, v1.xxxx
; Multiply input position by matrix 1 and sum
m4x4 r0, v0, c20
mad r1, r0, v1.yyyy, r1
; Multiply input position by matrix 2 and sum
m4x4 r0, v0, c24
mad r1, r0, v1.zzzz, r1
; Multiply input position by matrix 3
m4x4 r0, v0, c28
; Compute fourth weight
dp3 r10, v1, c0.zzzz
sub r11, c0.zzzz, r10
; sum
mad r1, r0, r11.wwww, r1
; Multiply by the view/projection matrix
m4x4 oPos, r1, c20
One variation on this technique is to store a palette of matrices in the constant store. During
preprocessing, four indices are stored per-vertex. These indices determine which matrices from
the palette are used in the blending process. Using this technique, a much larger set of triangles
can be processed without changing constant store state (typically an expensive operation). Note
that it is still worthwhile to sort the triangles by similar bone matrices to take advantage of
pipelined hardware. The vertex shader code takes advantage of the indexing register in order to
reach into the constant store to retrieve the correct matrices. The following vertex shader
shows one way to implement this technique.
198 Part 2: Vertex Shader Tricks
Team LRN
; position in v0
; matrix weights in v1
; matrix indices in v2
; c0.z = 1.0
; c10 = (16.0, 4.0, 1.0, 0.0)
; View/Projection matrix in c12-c15
; World 0 matrix in c16-c19
; World 1 matrix in c20-c23
; . . . Other world matrices follow
; figure out the last weight
dp3 r5, v1, c0.zzzz
sub r5.w, c0.zzzz, r5
; First world matrix constant = index*4 + start index
mad r7, v2.xyzw, c10.y, c10.x
; Skin by Matrix 0
mov a0.x, r7.x
m4x4 r0, v0, c[a0.x]
mul r1, r0, v1.xxxx
; Skin by Matrix 1 and sum
mov a0.x, r7.y
m4x4 r0, v0, c[a0.x]
mad r1, r0, v1.yyyy, r1
; Skin by Matrix 2 and sum
mov a0.x, r7.z
m4x4 r0, v0, c[a0.x]
mad r1, r0, v1.zzzz, r1
; Skin by Matrix 3 and sum
mov a0.x, r7.w
m4x4 r0, v0, c[a0.x]
mad r1, r0, r5.wwww, r1
; Multiply by the view/projection and pass it along
m4x4 oPos, r1, c12
One drawback to using the paletted skinning approach is that it doesnt work well with higher
order surfaces. The issue stems from the fact that vertex shaders are computed post-tessellation
and any vertex data other than positions and normals are linearly interpolated. Interpolating the
index values for blend matrices is nonsensical.
Skinning and Tweening Together
It is also possible to blend skinning and tweening. There are cases where it makes sense to
model a portion of a character using tweening, but you still want some basic movement con-
trolled by bones. For example, you might want the face of a character to be tweened in order to
capture some subtle movements that would require a large number of bones and would be dif-
ficult to manage within an art tool, but you want to use bones for the movement of the head
and body. The following figures show this kind of example, where the mouth and claws of the
character are tweened and skinned.
The following figure shows a few frames of animation using skinning alone.
Character Animation with Direct3D Vertex Shaders 199
Team LRN
The next figure shows just the tweened portion of the animation. Note the complex animation
of the mouth and claws.
The final figure shows the skinning and tweening animations being performed simultaneously
on the character.
When performing this type of animation, the tween portion is computed first and then the
resultant position is skinned. The following vertex shader shows one implementation of this
technique.
; position frame 0 in v0
; position frame 1 in v14
; matrix weights in v1
; matrix indices in v2
200 Part 2: Vertex Shader Tricks
Figure 3
Figure 4
Figure 5
Team LRN
; c0.z = 1.0
; tween in c9.x
; c10 = (16.0, 4.0, 1.0, 0.0)
; View/Projection matrix in c12-c15
; World 0 matrix in c16-c19
; World 1 matrix in c20-c23
; . . . Other world matrices follow
; Figure out tween constant
sub r9, v0.wwww, c9
mov r9.y, c9.x
; Compute the tweened position
mul r10, v0, r9.xxxx
mad r2, v14, r9.yyyy, r10
; figure out the last weight
mov r5, v1
dp3 r3, v1, c0.zzzz
sub r5.w, c0.zzzz, r3
; First world matrix constant = index*4 + start index
mad r7, v2.xyzw, c10.y, c10.x
; Skin by Matrix 0
mov a0.x, r7.x
m4x4 r0, r2, c[a0.x]
mul r1, r0, v1.xxxx
; Skin by Matrix 1 and sum
mov a0.x, r7.y
m4x4 r0, r2, c[a0.x]
mad r1, r0, v1.yyyy, r1
; Skin by Matrix 2 and sum
mov a0.x, r7.z
m4x4 r0, r2, c[a0.x]
mad r1, r0, v1.zzzz, r1
; Skin by Matrix 3 and sum
mov a0.x, r7.w
m4x4 r0, r2, c[a0.x]
mad r1, r0, r5.wwww, r1
; Multiply by the projection and pass it along
m4x4 oPos, r1, c12
In this section, we explored one way to combine two different types of character animation. It
is obviously just one way to mix the two types of animation. One benefit of vertex shaders is
the ability to customize the vertex processing to get different effects that fit the needs of your
particular application. Hopefully, this will provide you with some ideas for how to customize
animation to suit your own needs.
So far, we have only discussed animation of vertex positions and normals. This is suffi-
cient for vertex lighting of animated characters. Modern graphics chips provide very powerful
pixel shading capabilities that allow lighting to be performed per-pixel. To do this, however,
some care must be taken in the vertex shader to properly set up the per-pixel lighting computa-
tions. This will be discussed in the next section.
Character Animation with Direct3D Vertex Shaders 201
Team LRN
Animating Tangent Space for Per-Pixel Lighting
In order to reasonably light a character animated by these techniques, it is important to animate
the normal as well as the position. In the case of tweening, this means storing a normal along
with the position and interpolating the normal in the same way as the position. Getting the
right normal isnt quite as simple in the case of skinning. Since skinning can potentially
move each vertex in very different ways, the only way to get completely correct normals is to
recompute the normal at each vertex once the mesh is reskinned. Typically, this tends to be a
very expensive operation and would require a lot of additional information stored per-vertex.
To top it off, there really isnt a way to index other vertices from within a vertex shader. The
typical compromise, which gives good enough results, is to skin the normal using just the
rotation part of the bone matrices.
Additionally, if you are doing per-pixel bump mapping or other effects requiring tex-
ture/tangent space discussed in the articles by Engel in Part 1, you also need to skin the basis
vectors. The following shader shows the paletted matrix skinned version that also skins the
tangent/texture space basis vectors.
; position in v0
; matrix weights in v1
; matrix indices in v2
; normal in v3
; tangent in v9
; binormal in v10
; c0.z = 1.0
; c10 = (16.0, 4.0, 1.0, 0.0)
; View/Projection matrix in c12-c15
; World 0 matrix in c16-c19
; World 1 matrix in c20-c23
; . . . Other world matrices follow
; figure out the last weight
dp3 r3, v1, c0.zzzz
sub r5.w, c0.zzzz, r3
; First world matrix constant = index*4 + start index
mad r7, v2.xyzw, c10.y, c10.x
; Skin by Matrix 0
mov a0.x, r7.x
m4x4 r0, v0, c[a0.x] ; Position
mul r1, r0, v1.xxxx
m3x3 r0, v9, c[a0.x] ; Tangent
mul r2, r0, v1.xxxx
m3x3 r0, v10, c[a0.x] ; Bi-Normal
mul r3, r0, v1.xxxx
m3x3 r0, v3, c[a0.x] ; Normal
mul r4, r0, v1.xxxx
; Skin by Matrix 1 and sum
mov a0.x, r7.y
m4x4 r0, v0, c[a0.x] ; Position
mad r1, r0, v1.yyyy, r1
m3x3 r0, v9, c[a0.x] ; Tangent
mad r2, r0, v1.yyyy, r2
202 Part 2: Vertex Shader Tricks
Team LRN
m3x3 r0, v10, c[a0.x] ; Bi-Normal
mad r3, r0, v1.yyyy, r3
m3x3 r0, v3, c[a0.x] ; Normal
mad r4, r0, v1.yyyy, r4
; Skin by Matrix 2 and sum
mov a0.x, r7.z
m4x4 r0, v0, c[a0.x] ; Position
mad r1, r0, v1.zzzz, r1
m3x3 r0, v9, c[a0.x] ; Tangent
mad r2, r0, v1.zzzz, r2
m3x3 r0, v10, c[a0.x] ; Bi-Normal
mad r3, r0, v1.zzzz, r3
m3x3 r0, v3, c[a0.x] ; Normal
mad r4, r0, v1.zzzz, r4
; Skin by Matrix 3 and sum
mov a0.x, r7.w
m4x4 r0, v0, c[a0.x] ; Position
mad r1, r0, r5.wwww, r1
m3x3 r0, v9, c[a0.x] ; Tangent
mad r2, r0, r5.wwww, r2
m3x3 r0, v10, c[a0.x] ; Bi-Normal
mad r3, r0, r5.wwww, r3
m3x3 r0, v3, c[a0.x] ; Normal
mad r4, r0, r5.wwww, r4
; Multiply by the projection and pass it along
m4x4 oPos, r1, c12
; >>>> At this point:
; >>>>> r1 contains the skinned vertex position
; >>>>> r2 contains the tangent (v9)
; >>>>> r3 contains the binormal (v10)
; >>>>> r4 contains the normal (v3)
Now, you might ask, what can we do with this basis vector? Well, one common usage is to per-
form per-pixel Dot3 bump mapping. In order to compute this, two textures are needed. The
first texture is the base map, which contains the color on the surface. The second texture is the
bump map that contains the normal at each point in the original texture encoded into the red,
green, and blue channels. Within the vertex shader, the vector from the light to the vertex is
computed, normalized, and converted into tangent/texture space. This vector is then interpo-
lated per-pixel before being sent to the pixel shader.
Per-Pixel Lighting
Additionally, within the vertex shader, you can compute a light falloff. This involves sending
data about the light falloff values in the constant store. The three values sent to the vertex
shader are the distance from the light when the falloff starts, the distance from the light where
the falloff ends, and the difference between those two values. The following shader code builds
upon the previous shader and performs the calculations for three bumped diffuse lights. If
fewer lights are desired, the same shader can be used by setting the color of one or more of the
lights to black (0,0,0,0).
Character Animation with Direct3D Vertex Shaders 203
Team LRN
; c0.z = 1.0
; c1 has the position of light 1
; c2 has the color of light 1
; c3.x has the start of the falloff for light 1
; c3.y has the end of the falloff for light 1
; c3.z has the difference between the end of the falloff and the
; start of the falloff
; c4 has the position of light 2
; c5 has the color of light 2
; c6.x has the start of the falloff for light 2
; c6.y has the end of the falloff for light 2
; c6.z has the difference between the end of the falloff and the
; start of the falloff
; c7 has the position of light 2
; c8 has the color of light 2
; c9.x has the start of the falloff for light 2
; c9.y has the end of the falloff for light 2
; c9.z has the difference between the end of the falloff and the
; start of the falloff
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Compute vector for light 1
sub r0, c1, r1 ; Ray from light to point
dp3 r5.x, r0, r0 ; length^2
rsq r5.y, r5.x ; 1/length
mul r0, r0, r5.y ; normalized
rcp r5.z, r5.y ; length
m3x3 r8, r0, r2 ; Convert to tangent space
dp3 r9.x, r8, r8 ; length^2
rsq r9.y, r9.x ; 1/length
mul oT2.xyz, r8, r9.y ; normalized
sub r7.x, c3.y, r5.z ; fallEnd - length
mul r7.y, c3.z, r7.x ; (fallEnd - length)/
; (fallEnd-fallStart)
min r7.w, r7.y, c0.z ; clamp
mul oD0.xyz, r7.w, c2 ; falloff * light color
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Compute vector for light 2
sub r0, c4, r1 ; Ray from light to the point
dp3 r5.x, r0, r0 ; length^2
rsq r5.y, r5.x ; 1/length
mul r0, r0, r5.y ; normalized
rcp r5.z, r5.y ; length
m3x3 r8, r0, r2 ; Convert to tangent space
dp3 r9.x, r8, r8 ; length^2
rsq r9.y, r9.x ; 1/length
mul oT3.xyz, r8, r9.y ; normalized
sub r7.x, c6.y, r5.z ; fallEnd - length
mul r7.y, c6.z, r7.x ; (fallEnd - length)/
; (fallEnd - fallStart)
min r7.w, r7.y, c0.z ; clamp
mul oD1.xyz, r7.w, c5 ; falloff * light color
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Compute vector for light 3
sub r0, c7, r1 ; Ray from light to the point
204 Part 2: Vertex Shader Tricks
Team LRN
dp3 r5.x, r0, r0 ; length^2
rsq r5.y, r5.x ; 1/length
mul r0, r0, r5.y ; normalized
rcp r5.z, r5.y ; length
m3x3 r8, r0, r2 ; Convert to tangent space
dp3 r9.x, r8, r8 ; length^2
rsq r9.y, r9.x ; 1/length
mul oT4.xyz, r8, r9.y ; normalized
sub r7.x, c9.y, r5.z ; fallEnd - length
mul r7.y, c9.z, r7.x ; (fallEnd - length)/
; (fallEnd- fallStart)
min r7.w, r7.y, c0.z ; clamp
mul oT5.xyz, r7.w, c8 ; falloff * light color
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Pass along the texture coordinates.
mov oT0.xy, v7.xy
mov oT1.xy, v8.xy
In the pixel shader, we compute the dot products and sum up the contributions from all three
lights. In addition, the light colors passed into the above shaders are pre-divided by two to
allow a bigger range of light colors and allow the lights to over-brighten the base textures.
Within the pixel shader, this range is re-expanded by using the _x2 modifier on the colors.
ps.1.4
; c2 contains the ambient lighting color
texld r0, t0 ; DOT3 Bump Map Texture
texld r1, t1 ; Base Texture
texcrd r2.rgb, t2 ; L1 light 1 vector
texcrd r3.rgb, t3 ; L2 light 2 vector
texcrd r4.rgb, t4 ; L3 light 3 vector
; v0 C1 color of light 1 (from above)
; v1 C1 color of light 2 (from above)
texcrd r5.rgb, t5 ; C3 color of light 3
dp3_sat r2, r0_bx2, r2_bx2 ; N.L1
mul r2, r2, v0_x2 ; (N.L1)*C1
dp3_sat r3, r0_bx2, r3_bx2 ; N.L2
mad r2, r3, v1_x2, r2 ; ((N.L1)*C1) + ((N.L2)*C2)
dp3_sat r4, r0_bx2, r4_bx2 ; N.L2
mad r2.rgb, r4, r5_x2, r2 ; ((N.L1)*C1) + ((N.L2)*C2) +
; ((N.L3)*C3)
add r2.rgb, r2, c2 ; add ambient
mul r0.rgb, r2, r1 ; ((N.L1)*C1) + ((N.L2)*C2) +
; ((N.L3)*C3)*base
The following figures show various pieces of this shader in action. The first three characters in
Figure 6 show the contributions from each of the individual lights. The fourth character shows
the summation of all the lighting contributions, and the fifth shows the base texture for the
character.
Character Animation with Direct3D Vertex Shaders 205
Team LRN
The next figure shows a few frames of animation with the full shader in action.
Compression
One downside of doing tweened animation is the large amount of storage space required to
store each individual frame of animation data. Thus, it is desirable to reduce the amount of
memory consumed in storing the character data as well as the amount of data that needs to be
sent to the video card via the buses in the system. One compression technique is done by com-
puting the bounding box of the character and specifying the position data as an integer offset
from one of the corners of the box. The following example packs the position data into 16-bit
shorts for x, y, and z, and pads with one short for w. The preprocessing code to do the packing
looks something like this:
206 Part 2: Vertex Shader Tricks
Figure 6
Figure 7
Team LRN
float half = (maxX minX)/2.0f;
vertex.x = (short)(((x minX half)/half)*32767.5f)
half = (maxY minY)/2.0f;
vertex.y = (short)(((y minY half)/half)*32767.5f)
half = (maxZ minZ)/2.0f;
vertex.z = (short)(((z minZ half)/half)*32767.5f)
The vertex data can then be re-expanded within the vertex shader at a relatively low cost. The
following vertex shader fragment shows how this is accomplished:
; position in v0
; c0.x = ((maxX minX)/2)/32767.5
; c0.y = ((maxY minY)/2)/32767.5
; c0.z = ((maxZ minZ)/2)/32767.5
; c1.x = minX + (maxX minX)/2
; c1.y = minY + (maxY minY)/2
; c1.z = minZ + (maxZ minZ)/2
; scale and bias position
mul r2, v0, c0
add r2, r2, c1
A similar technique can be applied to the weights for the skinning matrices. Again, here is a
small bit of preprocessing code:
vertex.w1 = (unsigned char)((float)(w1)/255.0f);
vertex.w2 = (unsigned char)((float)(w2)/255.0f);
vertex.w3 = (unsigned char)((float)(w3)/255.0f);
A few vertex shader instructions can expand this data back into floating-point data to use for
computing the skinned position.
; v1 contains the weights
; c0.x = 0.003921569 = 1/255
; c0.z = 1.0
; unpack the weights
mul r9, v1, c0.x
dp3 r9.w, r9, c0.zzzz
sub r9.w, c0.zzzz, r9.w
Normal data can also be compressed by sacrificing some quality by quantizing the normals
into bytes or shorts and doing a similar unpacking process within the vertex shader. The fol-
lowing code shows how the data is packed within a preprocessing tool:
vertex.nx = (unsigned char)(nx*127.5 + 127.5);
vertex.ny = (unsigned char)(ny*127.5 + 127.5);
vertex.nz = (unsigned char)(nz*127.5 + 127.5);
A small bit of vertex shader code can be used to decompress the normals:
; v3 contains the normal
; c0.x = 0.007843137 = 1/127.5
; r2.w = 1.0
mad r3, v3, c0.x, r2.w ; scale and bias normal to 1 to 1 range
In this section, we showed a few ways to compress various vertex components. These tech-
niques can be used to significantly reduce the amount of data required per character. They
should, however, be used with some caution since they are lossy methods of compression and
will not work with all data sets. For more on vertex compression, see the Vertex Decompres-
sion Using Vertex Shaders article in this book.
Character Animation with Direct3D Vertex Shaders 207
Team LRN
Summary
This chapter has discussed a few ways to perform character animation within a vertex shader.
This should be considered a starting point for your exploration of character animation through
vertex shaders. Vertex shaders give you the power to customize the graphics pipeline for your
application, allowing you to free up the CPU to make a more engaging game.
208 Part 2: Vertex Shader Tricks
Team LRN
Lighting a Single-Surface
Object
Greg James
In this article, well explore how a programmable vertex shader can be used to light objects
that expose both sides of their triangles to the viewer. Such objects can be problematic with
fixed-function hardware lighting, but a few vertex shader instructions allow us to light such
objects correctly from all points of view without having to modify vertex data, duplicate trian-
gles, or make complex decisions based on the scene. We will also extend the basic technique to
approximate the effect of light transmission through thin scattering objects.
Three-dimensional objects are often represented by single-surface geometry in 3D applica-
tions. Leaves, foliage, blades of grass, particle system objects, hair, and clothing are typically
modeled as sheets of triangles where both sides of the sheet are potentially visible, and as such,
these objects are rendered with back-face culling disabled. This approach is used widely in 3D
games but presents a problem for hardware-accelerated lighting calculations. The problem
stems from the restriction that each vertex is specified with only a single surface normal vector.
This normal is correct when viewing the front faces of triangles but is not correct when the
back faces are in view. When back faces are visible, this single normal vector points opposite
the direction it should, so a lighting calculation based on this normal will give an incorrect
result. Fortunately, we can use five instructions in an OpenGL or DirectX 8 vertex shader pro-
gram to calculate the correct normal and pass it on to the lighting calculation so that a
single-surface object is lit correctly from all orientations. This same correction will also light a
twisted object correctly regardless of how the surfaces front and back faces twist in and out of
view. This approach of detecting the orientation of the normal and correcting it can also be
used to enhance the lighting of thin objects by accounting for transmitted light. A realistic ren-
dering of light shining through leaves or stained glass can be achieved without costly logic
based on the positions of the objects in the scene.
The problem of lighting a single-surface object depends on the viewers orientation rela-
tive to each face. Since surface normals are defined at each vertex, the surface facing is really a
vertex facing defined by each vertexs normal vector. The four possible orientations of the ver-
tex normal, viewer location, and light location are shown in Figure 1. In the first two cases, the
front faces are visible and no correction of the normal is required. In the last two cases, the
back faces are visible and the normal must be reversed to properly represent the face in view.
Note that the position of the light is not relevant to the problem of selecting the proper normal.
Only the eye position relative to the vertex normal matters. Correct illumination will result if
the vertex normal is always chosen to face toward the viewer. The operation of guaranteeing
that the normal always faces the viewer is easily accomplished in a vertex shader program.
209
Team LRN
Calculations for correcting the normal can be done in object space or world space. Since the
vertex input position is in object space, and in a vertex shader this is typically transformed
directly to homogeneous clip space (i.e., the world space position is never calculated), it is
more efficient to perform the calculations in object space. If the calculations were done in
world space, this would require extra vertex shader instructions to transform the vertex posi-
tion and vertex normal to world space. On the other hand, working in object space requires us
to transform the light position and viewer position from world space to object space. Since the
light and viewer position in object space is constant for all vertices of the object being ren-
dered, they should be calculated on the CPU and supplied as vertex shader constants. It would
be wasteful to compute them in the vertex shader itself. To compute them, we will have to
compute the inverse object-to-world transformation matrix, and this is simple if the world
matrix consists only of rotation and translation without scaling [Foley94]. In that case, the
inverse of the 4x4 transform is composed of the transpose of the 3x3 rotation portion along
with the opposite amount of translation.
210 Part 2: Vertex Shader Tricks
Figure 1: The four
possible orienta-
tions of viewer,
vertex normal, and
light. The light
position is irrele-
vant to selecting
the proper vertex
normal for
lighting.
Figure 2: Illustration of vari-
ables calculated in Listing 1
Team LRN
Vertex Shader Code
Figure 2 illustrates the values that well calculate, and Listing 1 shows the vertex shader
instructions. The first step of the correction is to compute a vector from the vertex position to
the viewer position. The viewer position in object space is computed on the CPU and supplied
as a vertex shader constant: c[CV_EYE_POS_OSPACE]. The vector from the vertex to the eye
is a simple subtraction. Next, we test whether the vertex normal is facing toward or away from
the eye. The three component dot-product between the normal and the vector to the eye will
give a value that is negative if the vertex normal points away from the eye or positive if the
vertex normal points toward the eye. The result of the dot-product is written to the ALIGNED
variable, which is positive if no correction is needed or negative if we have to flip the normal.
At this point, we must do one of two things based on the sign of the ALIGNED variable.
The C equivalent of what we want to do is:
if ( ALIGNED < 0 )
normal = V_NORMAL;
else
normal = V_NORMAL;
There are no if or else branch instructions in the vertex shader instruction set, so how can
we perform this decision based on the sign of ALIGNED? The answer is to do one calculation
for both cases which will result in a value as though a branch had been used. We use both the
input normal (V_NORMAL) and the reversed normal (V_NORMAL) in an expression in
which either value can be masked out. Each value is multiplied by a mask and summed. If the
value of mask is 1 or 0, then it can be used in the following equation to select either
V_NORMAL or V_NORMAL.
normal = mask * (V_NORMAL) + ( 1 mask ) * V_NORMAL
In a vertex shader, we could compute mask and 1mask, and this would be expressed as:
sub INV_MASK, c[ONE], MASK
mul COR_NORMAL, MASK, V_NORMAL
mad COR_NORMAL, INV_MASK, V_NORMAL, COR_NORMAL
For our case, the negative normal is computed from the positive normal as in the following
equation, so we can skip the calculation of 1mask and save one instruction in the vertex
shader program.
V_NORMAL = V_NORMAL 2 * V_NORMAL
The vertex shader instructions are then:
mad COR_NORMAL, V_NORMAL, MASK, V_NORMAL
mad COR_NORMAL, V_NORMAL, MASK, COR_NORMAL
The mask value is computed from ALIGNED using the sge (Set if Greater or Equal) instruc-
tion to compare ALIGNED to zero.
sge MASK, ALIGNED, c[ZERO]
MASK will be 0 if ALIGNED is less than c[ZERO], or it will be 1 if ALIGNED is greater or
equal to c[ZERO].
Lighting a Single-Surface Object 211
Team LRN
The five instructions in the middle of Listing 1 (sub, dp3, sge, mad, mad) will always pro-
duce a vertex normal that faces toward the viewer. Lighting based on this normal will be
correct for all light positions.
Performing the correction of the normal in a vertex shader lets us fill a scene with many
single-surface objects with only a small cost for the added calculations. We can avoid more
costly operations such as updating vertex data with the CPU or rendering in multiple passes to
achieve the correct lighting. The single-surface objects may also be animated in the vertex
shader and remain correctly lit. By performing all of the transform and lighting calculations in
a hardware vertex shader, we can populate scenes with a large number of these objects.
Listing 1: Vertex shader for correcting the input vertex normal based on viewer position
// Include definitions of constant indices
#include "TwoSided.h"
// Define names for vertex input data
#define V_POSITION v0
#define V_NORMAL v1
#define V_DIFFUSE v2
#define V_TEXTURE v3
// Define names for temporaries
#define VEC_VERT_TO_EYE r0
#define VEC_VERT_TO_LIGHT r1
#define ALIGNED r2
#define TEMP r6
#define COR_NORMAL r7
// Vertex shader version 1.1
vs.1.1
// Transform position to clip space and output it
dp4 oPos.x, V_POSITION, c[CV_WORLDVIEWPROJ_0]
dp4 oPos.y, V_POSITION, c[CV_WORLDVIEWPROJ_1]
dp4 oPos.z, V_POSITION, c[CV_WORLDVIEWPROJ_2]
dp4 oPos.w, V_POSITION, c[CV_WORLDVIEWPROJ_3]
// Use eye position relative to the vertex to
// determine the correct orientation of the
// vertex normal. Flip the normal if it is
// pointed away from the camera.
// Eye & light positions were transformed to
// object space on the CPU and are supplied
// as constants.
// Make vector from vertex to the eye
sub VEC_VERT_TO_EYE, c[CV_EYE_POS_OSPACE], V_POSITION
// Dot product with the normal to see if they point
// in the same direction or opposite
dp3 ALIGNED, V_NORMAL, VEC_VERT_TO_EYE
// If aligned is positive, no correction is needed
// If aligned is negative, we need to flip the normal
// Do this with SGE to create a mask for a virtual
// branch calculation
// ALIGNED.x = ALIGNED.x >= 0 ? 1 : 0;
212 Part 2: Vertex Shader Tricks
Team LRN
sge ALIGNED.x, ALIGNED, c[CV_ZERO]
mad COR_NORMAL, V_NORMAL, ALIGNED.x, V_NORMAL
// now COR_NORMAL = 0 or V_NORMAL
mad COR_NORMAL, V_NORMAL, ALIGNED.x, COR_NORMAL
// COR_NORMAL = V_NORMAL or V_NORMAL
// Point lighting
// Vector from vertex to the light
add VEC_VERT_TO_LIGHT, c[CV_LIGHT_POS_OSPACE], V_POSITION
// Normalize it using 3 instructions
dp3 TEMP.w, VEC_VERT_TO_LIGHT, VEC_VERT_TO_LIGHT
rsq TEMP.w, TEMP.w
mul VEC_VERT_TO_LIGHT, VEC_VERT_TO_LIGHT, TEMP.w
// dp3 for lighting. Point light is not attenuated
dp3 r4, VEC_VERT_TO_LIGHT, COR_NORMAL
// Use LIT to clamp to zero if r4 < 0
// r5 will have diffuse light value in y component
lit r5, r4
// Light color to output diffuse color
// c[CV_LIGHT_CONST].x is ambient
add oD0, r5.y, c[CV_LIGHT_CONST].x
// Another lit for the light value from opposite normal
lit r5, r4
// Square the value for more attenuation. This
// represents transmitted light, which could fall
// off more sharply with angle of incidence
mul r5.y, r5.y, r5.y
// Attenuate it further and add transmitted ambient factor
mad oD1, r5.y, c[CV_LIGHT_CONST].z, c[CV_LIGHT_CONST].y
// Output texture coordinates
mov oT0, V_TEXTURE
mov oT1, V_TEXTURE
Enhanced Lighting for Thin Objects
We can take this approach a step further by enhancing the lighting model for these thin objects.
Ordinarily, we would most likely compute some function of the dot product of the light vector
and normal vector (NL) at each vertex and be done with it. This would account for light scat-
tering off the surface but not for light shining through the surface. After all, these are thin
objects, and thin objects tend to transmit light. Dramatic lighting effects are possible if we
account for this transmission.
In this case, the transmission is not that of a clear object allowing light to pass directly
through, but it is instead a diffuse scattering of transmitted light. Things behind the object are
not visible, but bright light shining through the object is dispersed and scattered by the mate-
rial. This reveals the inner details or thickness of the object, which have a different appearance
Lighting a Single-Surface Object 213
Team LRN
than the surface illuminated by reflected light. Leaves, stained glass, and lamp shades show a
good contrast between the transmitted and reflected light. Hold a leaf to the ground and it
appears dark and waxy. Hold it up to the sun, and the veins and cells are revealed against a
glowing green background. Stained glass viewed from the outside of a building is entirely
dark, but viewing it from the inside with sunlight streaming through reveals brilliant colors.
We can account for this transmitted light by supplying an additional texture for the object.
This additional texture is simply the colors we would see with bright light shining through
the object. It is a texture representing light transmission. The standard lighting equation is
applied to the ordinary diffuse reflective texture, and we add a new lighting calculation that
will affect the diffuse transmission texture. The new lighting calculation is based on the normal
pointing away from the viewer and should contribute no light when the light is positioned in
front of the vertex in view. We use the vertex shader lit instruction to achieve this clamping to
zero, just as we would for the front-facing lighting. When the front-facing reflective lighting
clamps to zero, the transmissive lighting begins to show through.
The result of scattered reflected light can be placed in one vertex color, and the result of
scattered transmitted light can be placed in the second specular color or any output which will
be interpolated in rasterization. A pixel shader then fetches from each texture, applies the light
contribution for transmission to the transmission texture and for reflection to the ordinary dif-
fuse texture, and adds the results together. As each contribution is clamped to zero based on
whether or not the light is in front or behind, the addition of each term results in correct
lighting.
In this way, dramatic lighting effects are possible without requiring CPU processing or
complex logic based on the placement of objects and lights. As the sun streams through leaves
overhead, they appear different than leaves off in the distance, or as the viewer and lights move
about a scene, fabrics and hair will change appropriately when lit from behind. Other effects
like x-ray illumination or one-way mirrors could also be devised based on the vertex shaders
determination of the correct normal.
About the Demo
Included on the companion CD is the NV Effects Browser demo Two-Sided Lighting, which
demonstrates this technique. To run, simply start the NVEffectsBrowser.exe and click the
effect under the Lighting (per-vertex) section. Keyboard controls can be displayed by hitting
the H or F1 keys. There are controls for twisting, scaling, and moving the object. Complete
DirectX 8 source code is included.
References
[Foley94] James D. Foley, Andries van Dam, et. al., Introduction to Computer Graphics
(Addison-Wesley Co., 1994), p. 182.
214 Part 2: Vertex Shader Tricks
Team LRN
Optimizing Software
Vertex Shaders
Kim Pallister
The introduction of the vertex shader technology in the DirectX APIs marked a significant
milestone in the real-time graphics and gaming industries. It was significant not so much
because it exposed any new capability, but rather because it put innovation back in the hands of
the developer.
By making the pipeline programmable, developers were no longer limited to what the
hardware and API had been explicitly designed to do. Conversely, hardware developers, and
those designing the API, did not need to be clairvoyant in anticipating what game developers
would want the pipeline to do over the coming years.
Thats the good side of things. However, like most things, theres a cloud attached to this
silver lining.
As is often the case with PC game development, developers are challenged to take advan-
tage of new technologies while still supporting older systems. If programmable vertex shaders
were only supported on the highest end of the PC performance spectrum, developers would
have their work cut out for them, having to support multiple code paths for systems without
support.
Thankfully, this isnt the case. Microsoft, with help from CPU vendors like Intel, has seen
to it that software vertex shaders offer impressive performance and can be used in games today.
This is accomplished through a vertex shader compiler within the DirectX run-time that
compiles the vertex shader code into CPU-optimal instruction streams (SSE and SSE2 instruc-
tions in the case of Pentium III and Pentium 4 processors).
Like with any compiler, properly designing and structuring the source code going into the
compiler can improve the code that it generates. In this article, well look at some guidelines
for writing vertex shaders to improve the performance of those shaders when running in
software.
If you are simply interested in knowing a set of canonical rules to apply to your shaders,
you can skip to the optimization guidelines in Table 1. However, if you are interested in some
background about how the software emulation is done, well cover that beforehand. First well
look at some details about the Pentium III and Pentium 4 processor architectures and then look
at how the DirectX run-time compiler works. This is followed by a list of optimization guide-
lines and an explanation of why they help generate optimal code. Finally, well look at a
sample vertex shader and how we can optimize it.
215
Team LRN
Introduction to Pentium 4 Processor Architecture
The Pentium II and Pentium III processors were both based on the same core architecture. This
microarchitecture introduced speculative execution and out-of-order operation to Intel Archi-
tecture. The Pentium III processor made improvements over the Pentium II processor and
added the Streaming SIMD Extensions (SSE).
The Pentium 4 processor introduced a new microarchitecture, the NetBurst
microarchitecture. Like its predecessors, the Pentium 4 processor operates out of order and
does speculative execution.
However, it has a much deeper
pipeline and boasts a number of
other enhancements. For a more
detailed explanation, see Figure
1.
Instructions are specula-
tively fetched by the Branch
Predictor, the part of the diagram
on the far left comprised of the
Branch Target Buffer, or BTB,
and the Instruction Translation
Lookaside Buffer, or I-TLB. The
logic in these units serves to
make predictions about what
instructions will be needed next,
based on a branch history and some prediction rules.
From there, the instructions are fed into the decoder. The Instruction Decoder receives the
IA-32 opcode bytes delivered to it and breaks them into one or more simple micro-ops (Ops).
These Ops are the operations that that core of the Pentium 4 processor executes.
The next stage is the Execution Trace Cache, an innovative microarchitecture redesign
introduced as part of the NetBurst microarchitecture. The Trace Cache caches (i.e., remembers)
the decoded Ops from the instruction decoder. The Trace Cache is the Pentium 4 processors
primary or L1 Instruction cache. It is capable of delivering up to three instructions per clock
down the pipeline to the execution units. It has its own Branch Predictor, labeled here as BTB,
that steers the Trace Cache to its next instruction locations. The micro-code ROM in the block
below contains micro-code sequences for more complex operations like fault handling and
string moves. The addition of the Trace Cache is significant because it means that often, previ-
ously executed instruction streams (e.g., a loop after its first iteration) will not need to go
through the decoder stage, avoiding a possible stall.
The renaming stage maps logical IA-32 registers (see Figure 2) to the Pentium 4 proces-
sors deep physical register file. This abstraction allows for a larger number of physical
registers in the processor without changing the front end of the processor. The allocator stage
assigns all the necessary hardware buffers in the machine for Op execution, at which point
the Ops go into some queues to await scheduling.
The scheduling stage that follows determines when an instruction is ready to execute. An
instruction is ready to execute when both input operands (most instructions have two input
sources) are ready, registers and an execution unit of the required type are free, etc. The
216 Part 2: Vertex Shader Tricks
Figure 1: Pentium 4 processor architectural diagram
Team LRN
scheduler is kind of like a restaurant host, only seating a party when all the party members
have arrived and a table is free, thus ensuring maximum throughput of customers through the
restaurant.
Once an instruction is ready to execute, execution takes place in the integer or float-
ing-point units (depending on the type of instruction, of course). Having these many execution
units operating at once is what makes the processor so highly parallelized and able to execute a
number of integer and floating-point operations per clock cycle.
The Address Generation Units (labeled AGU) are used for doing loads and stores of data
to the L1 data cache (the last block in Figure 1). Since programs have many loads and stores,
having both a load port and a store port keeps this from being a bottleneck. The Pentium 4 pro-
cessor also has an on-chip L2 cache that stores both code and data, along with a fast system
bus to main memory.
In terms of maximizing performance of software running on this architecture, there are
several goals to keep in mind, such as minimizing mis-predicted branches, minimizing stalls,
maximizing cache usage, etc. Well talk about how the vertex shader compiler aims to achieve
some of these goals later on, but in order to understand that discussion, we must first look at
some architectural features aimed at enhancing multimedia performance.
Introduction to the Streaming SIMD Extensions
As mentioned earlier, the Pentium III processor introduced the Streaming SIMD Extensions
(better known as SSE), aimed at offering improved floating-point performance for multimedia
applications. The Pentium 4 processor furthered the enhancements of SSE with the introduc-
tion of the SSE2 instructions. Since the vertex shader compiler uses these, a basic
understanding of how these instructions operate will allow for a better understanding of how
the vertex shader compiler works.
SIMD instructions (Single Instruction, Multiple Data) allow for an operation to be issued
once and simultaneously performed on several pieces of data. This is extremely useful in mul-
timedia applications using 3D graphics, video or audio, where an operation may be performed
thousands of times on data (such as vertices, pixels, audio samples, etc.). Most of the SSE and
SSE2 instructions utilize the eight 128-bit wide XMM registers, pictured on the far right of
Figure 2.
These registers can be used in a number of ways. Depending on the instruction used, the data is
read as four 32-bit single-precision floating-point numbers, two 64-bit double-precision float-
ing-point numbers, eight 16-bit integers, or 16 single bytes. The following assembly example
shows two of these usages:
Optimizing Software Vertex Shaders 217
Figure 2: Pentium 4 processor
register sets
Team LRN
MULPS XMM1, XMM2 // Multiply 4 single precision
// floating-point numbers in
// XMM1 by those in XMM2, place
// results in XMM1.
MULPD XMM1, XMM2 // Multiply 2 double-precision
// floating-point numbers in
// XMM1 by those in XMM2, place
// results in XMM1
In order to understand how the vertex shader compiler works, a full understanding of the SSE
and SSE2 instruction set isnt necessary. However, those interested in learning how to use SSE
instructions in their own code may refer to Intels Developer Services web site
(www.intel.com/IDS).
While the full SSE/SSE2 instruction set isnt covered here, theres one more subject that
will give some insight into how the compiler functions: data arrangement.
Optimal Data Arrangement for SSE
Instruction Usage
Often, execution can be hampered by memory bandwidth. This sometimes occurs when data is
laid out in a format that is intuitive to the author of the code but can result in fragmented
access to memory, unnecessary data being loaded into the cache, etc. This is best illustrated
with an example.
Consider an array of vertices, each of which has of the following structure:
#define numverts 100
struct {
float x, y, z;
float nx, ny, nz ;
float u,v;
} myvertex;
myvertex mVerts[numverts];
In this array-of-structures (AOS) layout, the array of these vertices would sit in memory as
shown in Figure 3. The problem arises that as an operation is performed on all of the vertices,
say doing transformations, only some of the elements are used in that operation, yet an entire
cache line is loaded (anywhere from 32 to 128 bytes, depending on which processor is being
used). This means that some elements of the vertex structure are being loaded into the cache
and then not being used at all.
218 Part 2: Vertex Shader Tricks
Figure 3: Ineffective cache
line load with AOS data
layout
Team LRN
A better approach is to use a structure-of-arrays (SOA) layout, as illustrated below:
#define numverts 100
struct {
float x[numverts]; //x,x,x
float y[numverts]; //y,y,y
float z[numverts];
float nx[numverts];
float ny[numverts];
float nz[numverts];
float u[numverts];
float v[numverts];
} myvertexarray;
myvertexarray mVerts;
While this type of an approach to structuring data may not be as intuitive, it has a couple of
benefits. First, cache line loads consist of data that will be used for the given operation, with no
unnecessary data being loaded into the cache. Secondly, applying SSE to the code becomes
trivial.
Lets look at an example. Consider the following loop to do a simple animation to a vertex:
//assume variables of type float exist, called fTime and fVelocityX/Y/Z
for (n=0;n<numverts;n++)
{
mVerts.x[n] += fTime * fVelocityX;
mVerts.y[n] += fTime * fVelocityY;
mVerts.z[n] += fTime * fVelocityZ;
}
Modifying this to use SSE becomes a matter of iterating through the loop differently. In the
example below, the same loop has been modified slightly. Where we had our variables defined
as floats before, weve now used a C++ class f32Vec4, which is just a container for four floats
with overloaded operators that use the SSE instructions:
//assume variables of type f32Vec4 exist, called fTime and fVelocityX/Y/Z
for (n=0; n<numverts; n += 4)
{
(m128*)(mVerts.x[n]) += fTime * fVelocityX;
(m128*)(mVerts.y[n]) += fTime * fVelocityY;
(m128*)(mVerts.z[n]) += fTime * fVelocityZ;
}
The online training resources available at http://developer.intel.com/software/products/col-
lege/ia32/sse2/ contain more detail about how to use these classes and instructions. Weve just
used them here to illustrate the structure-of-arrays data layout.
Optimizing Software Vertex Shaders 219
Team LRN
How the Vertex Shader Compiler Works
The DirectX 8 API describes a dedicated virtual machine known as the vertex virtual machine,
or VVM, which does processing to a stream of incoming vertex data. This could be any kind of
processing that can be performed autonomously on one vertex (see the articles by Wolfgang
Engel in Part 1). The vertex shader mechanism was designed to run on dedicated hardware, but
as weve been discussing, a software implementation exists. Actually, several software imple-
mentations exist and are invoked based on what processor is detected at run time.
When an Intel Pentium III or Pentium 4 processor is detected and software vertex process-
ing is selected by the application, the vertex shader compiler is used. The compiler takes the
vertex shader program and declaration at shader creation time (done with the IDirect3D-
Device8::CreateVertexShader method) and compiles them to a sequence of Intel Architecture
instructions. The compiler attempts to generate the most optimal sequence of IA instructions to
emulate the functionality required. It does so by making use of the SSE and SSE2 instructions,
minimizing memory traffic, rescheduling the generated assembly instructions, etc.
After this sequence of instructions is generated, it exists until destroyed (done with the
IDirect3DDevice8::DeleteVertexShader method). Whenever any of the IDirect3DDevice8
::Draw methods are used after calling IDirect3DDevice8::SetVertexShader, execution will
use the IA-32 code sequence generated by the compiler.
When following this path, a number of things happen. First, incoming vertex data is
arranged into a structure-of-arrays format for more efficient processing. (As expected, there is
a cost to doing this arrangement.) Vertices are then processed four at a time, taking advantage
of the SIMD processing.
The way that the four-wide VVM registers are mapped to the SSE registers is by turning
them on their side, so to speak. Each VVM register is mapped to four SSE registers. This map-
ping forms a 16-float block that is essentially a four-vertex deep VVM register. The illustration
in Figure 4 shows what this looks like. The vertex shader then only needs to be executed one
quarter as many times, with each iteration simultaneously processing four vertices.
Vertex data is arranged back into the array-of-structures format on the way out to be sent on to
the rendering engine. This also incurs a performance penalty.
220 Part 2: Vertex Shader Tricks
Figure 4: VVM to SSE register
mapping
Team LRN
Performance Expectations
As mentioned above, arranging incoming data from array-of-structures to structure-of-arrays,
and then back again upon exit, incurs some performance penalty. However, this is countered by
the fact that the shader need only be executed one quarter as many times. Additionally, proces-
sor clock speeds tend to be much higher than most graphics hardware.
In the end, the performance benefits depend on the complexity of the shader (as a more
complex shader will amortize the cost of the data arrangement) and on how well the shader is
written. The shader can be written in a way that lets the compiler best do its job of generating
the assembly instructions.
Optimization Guidelines
The optimization guidelines listed below can help improve performance of vertex shaders run-
ning in software. What follows are some details on each one of the guidelines and why it can
affect performance.
Guidelines for optimizing software vertex shaders
Rule
1. Write only the results youll need.
2. Use macros whenever possible.
3. Squeeze dependency chains.
4. Write final arithmetic instructions directly to output registers.
5. Reuse the same temporary register if possible.
6. Dont implicitly saturate color and fog values.
7. When using exp and log functions, use the form with the lowest acceptable accuracy.
8. Avoid using the address register when possible.
9. If the address register must be used, try to order vertices in order of the address field.
10. Profile, profile, profile
Write Only the Results Youll Need
In some cases, only a component of a register is needed. For example, the post-transform
Z-value of a vertex might be used to calculate an alpha or fog value. As we discussed earlier,
each of the components of the VVM registers maps to an individual SSE register. Any instruc-
tion performed on the VVM register will be performed to each of the four SSE registers;
therefore, any unnecessary components of the VVM register that we can avoid using will save
at least a few instructions on one of the SSE registers.
Optimizing Software Vertex Shaders 221
Team LRN
Lets look at an example:
Vertex Shader Code SSE Instruction Cost (for four vertices)
1 mul r0, c8, r0 Four SSE multiplies, 4 to 8 moves
2 mul r0.z, c8.z, r0.z One SSE multiply, 1 to 2 moves
Use Macros Whenever Possible
Macros are traditionally thought of as a convenience more than an optimization technique.
Indeed, in C code, macros are just a shorthand representation of a code sequence. In this case,
however, they give the compiler some extraneous information about what a particular sequence
of instructions is doing. In the m4x4 matrix multiplication macro, for example, the compiler is
aware that the multiplies being executed will be followed by an add (since the matrix multiply
is four dot product operations). Since the compiler is aware of this, it can retain results in regis-
ters, rather than moving them out and back in again.
This is illustrated in the following code sequence:
Before After
dp4 r0.x, v0, c2
dp4 r0.y, v0, c3
dp4 r0.z, v0, c4
dp4 r0.w, v0, c5
add r1, c6,v0
dp3 r2, r1, r1
rsq r2, r2
mov oT0, v2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oPos,r0
m4x4 r0, v0, c2
add r1, c6,v0
dp3 r2, r1, r1
rsq r2, r2
mov oT0, v2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oPos,r0
Squeeze Dependency Chains
Since four SSE registers are used to emulate a VVM register, and there are a total of eight SSE
registers, data often gets moved in and out of registers to make room for the next instructions
operands.
In cases where an operations result is going to be used in another operation, performing
that operation soon afterward can often let the compiler use the registers containing the inter-
mediate result without having to move that intermediate result to memory and back.
This is illustrated in the following example. The 4x4 matrix multiply result is stored in r0,
and then r0 is not used again until the end of the code sequence, where it is moved into oPos.
Therefore, bumping the mov instruction up in the code sequence is possible and can let the
compiler save some instructions in the SSE code sequence.
Before After
m4x4 r0, v0, c2
add r1, c6,v0
dp3 r2, r1, r1
m4x4 r0, v0, c2
mov oPos,r0
add r1, c6,v0
222 Part 2: Vertex Shader Tricks
Team LRN
Before After
rsq r2, r2
mov oT0, v2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oPos,r0
dp3 r2, r1, r1
rsq r2, r2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oT0, v2
Write Final Arithmetic Instructions Directly to Output Registers
A keen eye may have looked at the previous code sequence and wondered why the result of the
matrix multiplication wasnt written directly to the oPos register. Indeed, this is another oppor-
tunity to squeeze a few cycles out of the code sequence, as shown below.
Before After
m4x4 r0, v0, c2
mov oPos,r0
add r1, c6,v0
dp3 r2, r1, r1
rsq r2, r2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oT0, v2
m4x4 oPos, v0, c2
add r1, c6,v0
dp3 r2, r1, r1
rsq r2, r2
mul r1,r1,r2
dp3 r3, v1, r1
add r3, r3, c7
mov oD0,r3
mov oT0, v2
Reuse the Same Temporary Register If Possible
When a temp register is used, there is some work the compiler does in terms of allocating SSE
registers to map to the temp register and the like. Reusing the same temp registers can some-
times save the compiler some work in having to redo this.
Dont Implicitly Saturate Color and Fog Values
This is a simple rule to understand. The compiler saturates color and fog values. Any work
done to saturate them by hand is wasted cycles.
Use the Lowest Acceptable Accuracy with exp and log Functions
The expp and logp instructions offer a variable level of precision. The modifier applied to the
instruction determines the level of precision. As one would imagine, the lower the precision,
the lower the amount of work required by the code sequence the compiler generates.
Avoid Using the Address Register When Possible
The address register allows for the addressing of different constant registers based on the value
in the address register. This is a problem for the compiler generating the SIMD code, which is
Optimizing Software Vertex Shaders 223
Team LRN
trying to process four vertices at a time. Each of the four vertices being processed may end up
referencing a different constant register, in which case the compiler will have to revert to a sca-
lar operation, processing each vertex one at a time.
Try to Order Vertices
Of course, the address register is there for a reason. Some vertex shaders may require its use. If
thats the case, the next best thing is to order vertices in the order of the address field. Ordering
by address field can result in runs of vertices with the same address value. The compiler will
then use SIMD processing on blocks of four vertices with a shared address value.
Profile, Profile, Profile
Less of an optimization guideline than a bit of advice, it should go without saying that
optimizations need to be tested. The guidelines provided are a good start, but theres no substi-
tute for good testing! Test your optimization changes before and after and under a variety of
conditions and input data.
Testing the vertex shader performance can be a bit tricky. Some graphics vendors offer
drivers that output performance statistics, but some do not. The DirectX 8 software vertex
pipeline doesnt output performance statistics. The best place to start is to isolate the shader in
a test harness so that performance differences will be amplified, as the shader will comprise a
larger portion of what gets rendered in a frame.
A Detailed Example
The following vertex shader is
the one used in the sample
code provided on the compan-
ion CD with this book. Its
simply a combination of a cou-
ple of well-known shaders, the
matrix palette skinning shader
and the reflection/refraction
shader often used for soap
bubble type effects. Com-
bining these gives us a glass
character type effects as
shown in Figure 5. Addi-
tionally, a sine wave is
propagated along the charac-
ters skin to provide a rippling
effect.
224 Part 2: Vertex Shader Tricks
Figure 5: Skinned mesh soap bubble shader example. Vertex
shader is performing skinning, sinusoidal peturbation, reflection,
and refraction calculations for texture coordinates.
Team LRN
Heres the shader before and after optimization:
Before After
vs.1.1
; Section 1
; Matrix Palette Skinning
mov a0.x, v3.x
;first bone and normal
dp4 r0.x, v0, c[a0.x + 50]
dp4 r0.y, v0, c[a0.x + 51]
dp4 r0.z, v0, c[a0.x + 52]
dp4 r0.w, v0, c[a0.x + 53]
dp3 r1.x, v4, c[a0.x + 50]
dp3 r1.y, v4, c[a0.x + 51]
dp3 r1.z, v4, c[a0.x + 52]
mov r1.w, c4.y
;second bone and normal
mov a0.x, v3.y
dp4 r2.x, v0, c[a0.x + 50]
dp4 r2.y, v0, c[a0.x + 51]
dp4 r2.z, v0, c[a0.x + 52]
dp4 r2.w, v0, c[a0.x + 53]
dp3 r3.x, v4, c[a0.x + 50]
dp3 r3.y, v4, c[a0.x + 51]
dp3 r3.z, v4, c[a0.x + 52]
mov r3.w, c4.y
;blend between r0, r2
mul r0, r0, v1.x
mad r2, r2, v2.x, r0
mov r2.w, c4.w
;blend between r1, r3
mul r1, r1, v1.x
mad r3, r3, v2.x, r1
; r2 contains final pos
; r3 contains final normal
; Sine-wave calculation
; Transform vert to
; clip-space
dp4 r0.x, r2, c0
dp4 r0.y, r2, c1
dp4 r0.z, r2, c2
dp4 r0.w, r2, c3
; theta from distance & time
mov r1, r0 ; xyz
dp3 r1.x, r1, r1 ; d2
vs.1.1
; Section 1
; Matrix Palette Skinning
mov a0.x, v3.x
;first bone and normal
m4x3 r0.xyz, v0, c[a0.x + 50]
m3x3 r1.xyz, v4, c[a0.x + 50]
;second bone
mov a0.x, v3.y
m4x3 r2.xyz, v0, c[a0.x + 50]
m3x3 r3.xyz, v4, c[a0.x + 50]
;blend between bones r0, r2
mul r0.xyz, r0.xyz, v1.x
mad r2, r2.xyz, v2.x, r0.xyz
mov r2.w, c4.w
;blend between r1, r3
mul r1.xyz, r1.xyz, v1.x mad
r3, r3.xyz, v2.x, r1.xyz
mov r3.w, c4.y
; r2 contains final pos
; r3 contains final normal
; Sine-wave calculation
; Transform vert to
; clip-space
m4x4 r0, r2, c0
; theta from distance & time
mov r1.xyz, r0.xyz ; xyz
dp3 r1.x, r1, r1 ; d2
Optimizing Software Vertex Shaders 225
Team LRN
Before After
rsq r1.x, r1.x
rcp r1.x, r1.x ; d
mul r1, r1, c4.x ; time
; Clamp theta to pi..pi
add r1.x, r1.x, c7.x
mul r1.x, r1.x, c7.y
frc r1.xy, r1.x
mul r1.x, r1.x, c7.z
add r1.x, r1.x, c7.x
; Compute 1
st
4 series values
mul r4.x, r1.x, r1.x ; d^2
mul r1.y, r1.x, r4.x ; d^3
mul r1.z, r4.x, r1.y ; d^5
mul r1.w, r4.x, r1.z ; d^7
mul r1, r1, c10 ; sin
dp4 r1.x, r1, c4.w
; Move vertex sin(x)
mad r0, r1.x, c7.w, r0
mov oPos, r0
; Reflection/Refraction calc
; Trans. vert and normal
; to world-space
dp4 r0.x, r2, c12
dp4 r0.y, r2, c13
dp4 r0.z, r2, c14
dp4 r0.w, r2, c15
dp3 r1.x, r3, c12
dp3 r1.y, r3, c13
dp3 r1.z, r3, c14
; re-normalize normal
dp3 r1.w, r1, r1
rsq r1.w, r1.w
mul r1, r1, r1.w
; Get eye to vertex vector
sub r4, r0, c5
dp3 r3.w, r4, r4
rsq r3.w, r3.w
mul r3, r4, r3.w
; Calculate E 2*(E dot N)*N
dp3 r4.x, r3, r1
add r4.x, r4.x, r4.x
mad oT0, r1, r4.x, r3
; Get refraction normal
mul r1, r1, c6
rsq r1.x, r1.x
rcp r1.x, r1.x ; d
mul r1.xyz, r1, c4.x ; time
; Clamp theta to pi..pi
add r1.x, r1.x, c7.x
mul r1.x, r1.x, c7.y
frc r1.xy, r1.x
mad r1.x, r1.x, c7.z, c7.x
; Compute 1
st
4 series values
mul r4.x, r1.x, r1.x ; d^2
mul r1.y, r1.x, r4.x ; d^3
mul r1.z, r4.x, r1.y ; d^5
mul r1.w, r4.x, r1.z ; d^7
dp4 r1.x, r1, c10 ; sin
; Move vertex sin(x)
mad oPos, r1.x, c7.w, r0
; Reflection/Refraction calc
; Trans. vert and normal
; to world-space
m4x4 r0, r2, c12
m3x3 r1, r3, c12
; re-normalize normal
dp3 r1.w, r1, r1
rsq r1.w, r1.w
mul r1, r1, r1.w
; Get eye to vertex vector
sub r4, r0, c5
dp3 r3.w, r4, r4
rsq r3.w, r3.w
mul r3, r4, r3.w
; Calculate E 2*(E dot N)*N
dp3 r4.x, r3, r1
add r4.x, r4.x, r4.x
mad oT0.xyz, r1.xyz, r4.x, r3.xyz
; Get refraction normal
mul r1.xyz, r1.xyz, c6.xyz
226 Part 2: Vertex Shader Tricks
Team LRN
Before After
; Calculate E 2*(E dot N)*N
dp3 r4.x, r3, r1
add r4.x, r4.x, r4.x
mad oT1, r1, r4.x, r3
; Calculate E 2*(E dot N)*N
dp3 r4.x, r3, r1
add r4.x, r4.x, r4.x
mad oT1.xyz, r1, r4.x, r3
A number of the optimization guidelines are used here, giving an improvement of about 15
percent. The vertex shader before optimization takes approximately 250 cycles per vertex.
After optimization, this is reduced to approximately 220 cycles per vertex. The effect on the
overall frame rate of the application will be less because there will usually be many more
things going on, and the vertex shader comprises only a portion of the total execution time.
Hopefully, the details provided here have given some insight into how the software vertex
shader compiler works and how to optimize code for it. The high performance available
through the software implementation, especially when properly optimized, means that develop-
ers can take advantage of vertex shaders today, even if their games have to run on computers
without vertex shader hardware.
Acknowledgments
Id like to recognize a couple of individuals who contributed to this paper. First off, a large
thanks is due Ronen Zohar at Intel. Ronen worked closely with Microsoft on optimizing the
DirectX run time and provided the optimization guidelines discussed in this article. Thanks as
well to William Damon at Intel, who worked on the sample application referenced in the
examples.
Optimizing Software Vertex Shaders 227
Team LRN
Compendium of Vertex
Shader Tricks
Scott Le Grand
Introduction
There are a surprising number of procedural effects one can express with vertex shaders. As
hardware speed increases, moving such effects from the CPU to the graphics hardware will
free up more CPU time and memory bandwidth for other computations. In addition, a vertex
shader-based approach to procedural rendering allows the use of static vertex buffers within
procedural effects. This allows one to offload both the computation and the data for a proce-
dural effect entirely into the graphics hardware.
However, the conversion of a procedural effect from a CPU-bound calculation to a vertex
shader requires some thought: There is no flow control within a vertex shader, and data cannot
be shared among the vertices. Fortunately, one can generate the equivalent of decision-based
bit masks and use those to emulate flow control. This article will illustrate some of the tricks
one can employ to move procedural effects into 3D hardware and provide a simple example of
doing so.
Periodic Time
Vertex animation requires updating the positions of an effects vertices over the time period of
the effect. In a vertex shader, one wants to use static vertex buffers in order to minimize AGP
bandwidth. So the first trick needed here is to calculate a periodic time value to drive vertex
animation from within the vertex shader itself. The use of this periodic time allows the vertices
to follow a calculated trajectory over the lifetime of an effect and then disappear at the end of
one cycle of the period. If the effect is cyclic, the periodic nature of the calculated time value
will allow each vertex to recycle itself and reappear as a seemingly new component of the
effect during the next cycle of the time period. Mathematically, for this task we need an
approximation to a fractional remainder function to calculate a fractional period. While the
DirectX 8 specification supplies just such a function in the frc macro, there is a simpler
approach. The y component of the destination register in an expp instruction returns the frac-
tional part of the w component from the input register.
To derive a periodic time value for a vertex, first supply each vertex with a phase . Next,
store the current absolute time, t, in a constant register and 1/period, , in another constant
228
Team LRN
register. The fractional period, f, of any vertex can be calculated by passing *(t ) into an
expp instruction as follows:
mov r1, c0.xy ; t is in c0.x, is in c0.y
sub r1.w, c0.x, v0.x ; is in v0.x
mul r1.w, r1.w, c0.y
expp r1.y, r1.w ; f is now in r1.y
Vertex phases can all be identical if you wish the vertices of the effect to move in tandem from
an origin (such as a shockwave from an explosion), random if you wish the effect to always
appear active (such as a particle fountain), or in groups of randomly generated but identical
phases to generate clumps of vertices that are always at the same timepoint in their periodic
trajectories (such as the rings emerging along the path of a railgun shot in Quake).
One-Shot Effect
A variant on the above involves skipping the expp instruction altogether, setting the phases of
all the vertices to the same time as the creation of the effect, and only rendering the vertices up
to the expiration of the time period. This is useful for a one-shot effect such as an explosion.
Random Numbers
While one can supply pregenerated random numbers as a vertex attribute, one will need to gen-
erate a new random number for every cycle of a periodic effects period, or else the effect
could look repetitive without an enormous number of vertices. Fortunately, one can use a sim-
ple equation (a linear congruence) and the aforementioned expp instruction to generate a
pseudorandom number from a supplied seed (such as the phase used above if space is tight and
its random itself) and the integer part of the periodic time from above. To calculate a random
number, store T, the integer part of the periodic time in a constant register, then pass r=T into
an expp instruction:
mov r1.w, c0.z ; T is in c0.z
mul r1.w, r1.w, v0.x ;
expp r2.y, r1.w ; fractional time is in r2
sub r1.w, r1.w, r2.y ; a unit pseudo-random number is in r1.w
Such a random number will remain constant throughout a single period of an effect. This is
useful for initializing a vertexs state. Since no calculated data is retained between invocations
of a vertex shader, this pseudorandom value will be calculated upon every transformation of
each vertex.
The above code fragment also demonstrates how to manipulate the vertex shader input and
output masks. These masks are the register component (x, y, z, and w) sequences following the
period in each input and output register in the above code fragment. While the expp instruction
normally calculates four output values, we can indicate our interest in just the w component
output by leaving out x, y, and z from the output register. The omission of unnecessary compo-
nents of output registers can improve vertex shader performance, especially on machines that
emulate vertex shaders in software. For input registers only, their four components can be
Compendium of Vertex Shader Tricks 229
Team LRN
collectively negated, and each component can be swizzled: replicated or permuted in any
desired combination.
Flow Control
While there are no flow control instructions in the DirectX 8 vertex shader language, there are
two instructions, sgt and sle, which will perform a greater than, equal, or a less than compari-
son between a pair of source registers and then set all of a destination registers components
either to 1.0, if the comparison is true, or 0.0, if it is false. One can use a series of such tests to
conditionally add in multiple components of an effects trajectory.
Cross Products
When either constant or vertex attribute space is tight, one can use two instructions to calculate
the cross product of two vectors in order to generate a vector orthogonal to an existing pair of
normalized vectors. This is also handy for generating the third row of an orthonormal transfor-
mation matrix given only its first two rows. The cross product of two three-dimensional
vectors v
1
and v
2
is defined as the determinant:
i x x
j y y
k z z
x y y x i z x x z j x y y x k
1 2
1 2
1 2
1 2 1 2 1 2 1 2 1 2 1 2
- + - + -
This sequence of two multiplies and one subtraction per vector component can be represented
by:
mul r1.xzx, r2.yxy, r3.xyz
mad r1.yxy, r2,xzx, r3.xyz
where v
1
is stored in r1 and v
2
is stored in r2. In this case, we have used the vertex mask swiz-
zler on registers r1 and r2, along with negation of register r1 to reproduce the input terms of
equation 1.
Examples
The first example for this article on the companion CD shows how to use periodic time to sim-
ulate and render a particle system entirely within a vertex shader. Each particle requires only
four floating-point values. The first three of these values represent a three-dimensional heading
vector, which the particle would follow in the absence of gravity. The fourth floating-point
value is s the phase, which is used to ensure the particle system effect is always in
mid-stream, no matter when it is initiated. This shader only requires five constant registers,
four of which hold the total transformation matrix. The fourth constant register contains, in
order of component, the current time t, normalized by dividing it by T
p
, the desired lifetime of
a particle, followed by the launch speed of all particles, after which comes the total gravita-
tional acceleration experienced by a particle during its lifetime, gT
p
2
, where g is the
230 Part 2: Vertex Shader Tricks
Team LRN
gravitational acceleration, and finally, a constant, 1.0, which is used to convert the
three-dimensional particle position into a four-dimensional homogeneous coordinate.
The second example makes more extensive use of random numbers to create a series of
recurring procedural explosions similar to firework bursts. As in the first example, each parti-
cle has a phase value , which is used here to stagger the times of detonation of each
individual explosion. However, there are two important differences here from the previous
effect. First, each is shared by a group of particles so that the group animates synchronously.
Second, there are three additional shared phase values for each of these particle groups, one for
each spatial dimension. These three phases are used to pseudorandomize the position of each
explosion. They each vary in range from 0.5 to 1.0. With each recurring explosion, the position
of a particle is shifted in each dimension by the respective phase within a periodic box. The
periodicity of the box is enforced, not surprisingly, by the use of the expp instruction.
Summary
Vertex shaders can perform much of the bookkeeping for special effects that is currently han-
dled by the CPU. Converting a CPU-bound effect into a vertex shader effect requires thinking
of the effect as either periodic or turning the effect on and off externally. The advantage of han-
dling the effect inside a vertex shader is that the graphics hardware can render it entirely from
static, pregenerated vertex buffers. The disadvantage is that if geometry bandwidth is tight,
converting an effect to a vertex shader may slow things down. As with most processes, the
only way to find out is to try each way.
Compendium of Vertex Shader Tricks 231
Team LRN
Perlin Noise and
Returning Results from
Shader Programs
Steven Riddle and Oliver C. Zecha
Limitations of Shaders
The hardware-accelerated programmable shader is the most important advance in computer
graphics today. Although early pixel shaders leave a few things to be desired, vertex shaders
are proving to be a robust and versatile solution to many graphics problems. Together, pixel
and vertex shaders allow a degree of realism on screen unprecedented in real-time animation.
From particle systems to geometric deformations, their scope of application seems endless.
The main reason for this is their vector processing capabilities, which allow complex lighting
models and other visually stunning effects.
Currently, if you do not have a graphics card supporting vertex shaders, there are reason-
able implementations that use the CPU to execute the vertex shader program. (The same is true
for pixel shaders, but because the CPU must do all the texturing, it is painfully slow and ren-
ders the graphics processor useless.) However, at the rate with which graphics technology
advances these days, it is difficult to believe that the CPU will be able to keep up with the GPU
in the immediate future. This is due to the nature of the execution units.
A shader program does not allow branching or looping, and consequently, the GPU does
not need to implement the hardware to make these possible. This fact alone allows the GPU to
run much faster. Besides this, a GPU does not allow access to the data of other vertices, so they
may be processed in parallel. Although CPU manufacturers have developed competing tech-
nologies to operate on multiple data with a single instruction, the GPU still has the advantage
when working with larger sets of data. This becomes especially apparent when one considers
all the other work the CPU must do, such as AI, collision detection, or whatever else your
application requires.
Very soon, the number of vertices in average scenes will be so astronomically huge that
performing the math to transform and light them on the CPU will be futile. Couple this with
the memory bandwidth requirements for large meshes, and it becomes clear that the CPUs
cycles are better spent on tasks other than vertex processing.
This leaves us with the vector math capabilities of the GPU. It can compute more vector
operations faster than any PC hardware before it, and the numbers just keep getting bigger and
232
Team LRN
better. Already, the number of vector operations feasible on a GPU exceeds that of a CPU, and
there are no signs of going back. In fact, the trend seems to be that GPUs are doubling in
power every six months, a full three times faster than Moores Law predicts. Therefore, it
seems reasonable that as much graphic computation as possible should be performed using the
GPU. Not only this, but the GPU should be used for as much vector processing as possible.
What this really means, in terms of code, is that as much vector and graphic computation
as achievable should be done using a programmable shader. Here we encounter the first major
limitation of shaders, namely that we may use, at most, 128 instructions in the vertex pipeline
and eight instructions plus four addressing instructions in the pixel pipeline. (Version 1.4 pixel
shaders have increased to six addressing instructions but kept eight regular instructions in each
phase; however, they allow two phases of 14 (8+6) instructions each.) Well, you may
think, even if we are allowed only 140 instructions, at least the instructions themselves are
relatively complicated mathematical operations, and using them we can do all sorts of neat
things. Nevertheless, as graphics programmers, we are the type of people who like to push the
limits, and so we ask, What if we need more? After all, 640K turned out to not be enough for
everybody!
We have always been fascinated by computer graphics and physical simulation, so you can
imagine how excited we were when given the opportunity to harness so much power for
on-screen imaging and vector processing! One of the first applications we saw for vertex
shaders was procedural deformation on a plane. We immediately noticed huge implications for
landscapes and other generated meshes, but something was missing: There was no way of
knowing the positions of the deformed vertices. As far as we were concerned, the objects had
not changed. This presents a whole host of problems. Even doing collision detection with
objects that have been transformed requires knowledge of the transformed vertices. In fact, it
seems that there are many situations where getting results from the shader calculations is use-
ful. Shortly after this, we discovered to our horror that the shader program outputs are write
only. This marks the second major limitation of shaders: the difficulty in returning values.
The first identified limitation, the number of instructions, has been dealt with to some
extent in several documents. The general idea, with which you are certainly familiar by now, is
to utilize the costless modifiers. These consist of swizzle, mask, and negation for the vertex
shader, a whole host of instruction modifiers, and instruction pairing for the pixel shader. This
will allow you to group instructions for speed and possibly allow you to reduce the number of
instructions used. Nevertheless, the gains you can make using these methods are minimal, in
terms of decreasing the number of instructions used. The ideal solution would be to run two
vertex shader programs sequentially, the second processing the output of the first. Unfortu-
nately, this idea was not incorporated into the shader specification, which means that the best
we can do is reduce the problem to getting output from the shader program. Interestingly
enough, this coincides perfectly with the second limitation outlined earlier.
Finally, we have only one solution to discover, and fortunately it exists. Although we can-
not get results directly from the shader program, they do give a sort of result. Generally, shader
programs produce rendered output in memory, usually in video memory for display on screen.
However, it is possible to render to texture. This is the technique used to return values from the
shader program. The trick is to arrange everything so that the texture to which we render is
used as an array of results for our computations. This may seem a bit strange at first, but the
implications are vast. The ability to get values back from the shader pipeline lets us implement
Perlin Noise and Returning Results from Shader Programs 233
Team LRN
a variety of computation-intensive algorithms as fast shader programs which can run on the
GPU.
In order to accomplish this, we need to add an extra piece of information to each vertex.
This extra information represents the texture coordinate where the result will be stored. In
order to justify moving this extra data across the AGP bus, we must be certain that the compu-
tations performed on each vertex are complex enough that doing them on the GPU is
worthwhile. In this article, we use the vertex shader to generate three-dimensional Perlin noise.
This is very intensive, since for each Perlin value, we need to do three cubic evaluations, eight
dot products, and seven linear interpolations, plus a number of moduli, multiplications, and
additions.
Perlin Noise and Fractional Brownian Motion
In general, computer graphics tend to have hard edges and appear too perfect to look real. To
correct for this, noise is often added to give the impression of some blemishes and randomness.
This noise may be added to lines to make them sketchy, textures to give them variation, char-
acter animation to make it more natural, or geometry to add some imperfections. In fact, there
are countless applications of noise in computer imaging. Nevertheless, most noise is too ran-
dom to look natural, unless you need a texture for TV static. For this reason, it is common to
perform some smoothing on the noise, such as a Gaussian blur. There is a problem with this
technique, however, when we consider real-time graphics. A Gaussian blur is quite slow. For
this reason, Perlin noise was developed to approximate this smoothed noise.
Perlin noise, named for its creator Ken Perlin, is a technique for generating smooth
pseudorandom noise. Smooth in this case means that the noise is defined at all real points in
the domain. Pseudorandom means that it appears random, but is in fact a function. The most
important feature of Perlin noise is that it is repeatable. This ensures that for any given point,
the noise will be the same each time it is calculated. A Perlin noise function takes a value in
n
and returns a value in . In simpler terms, this means that it takes an n-dimensional point with
real coordinates as input, and the result is a real number.
In order to produce smooth pseudorandom
noise, we need a random number generator and
a function for interpolation. The random num-
ber generator is used as a noise function, and
the interpolation function is used to produce a
continuous function from the noise.
Figure 1 shows an example of a
one-dimensional random function. Given an
integer, it generates a pseudorandom real num-
ber between zero and one. Each time it is used
to compute a value for a given point, it gives us
the same pseudorandom number, so it is repeat-
able. This is our noise function.
234 Part 2: Vertex Shader Tricks
Figure 1: Random function
Team LRN
In Figure 2, you can see the results of cubic
interpolation on our one-dimensional noise func-
tion. For the moment, dont worry about how
this is done; Ill explain that later. This function
is repeatable like the noise. However, unlike the
noise function, this function is continuous, and
consequently, it can be evaluated at any real
point in the x domain. We call this smooth func-
tion our noise wave.
It is sometimes helpful to think of the noise
wave as a simple wave function, such as sine or
cosine. You can see this analogy illustrated in
Figure 3. The amplitude of a simple wave is the
distance from valley to peak. Similarly, the
amplitude (A) of the noise wave is the maximum
possible distance from valley to peak (this is the
range of the noise function). The wavelength (l)
is taken as the x distance between consecutive
points. Since we know the wavelength (l), we
can calculate the frequency (f) by the following:
f
1
where N H o 0 if N H 0
Pixel shader:
ps.1.4
texld r0.rgb, t0.xyz //get normal from a texture lookup
texld r1.rgb, t1.xyz //get light vector in same way
texld r2.rgb, t2.xyz //get view vector in same way
texcrd r3.rgb, t3.xyz //pass in spec power in t3.x
//compute H as V+L/2
add_d2 r1.rgb, r1_bx2, r2_bx2
430 Part 4: Using 3D Textures with Shaders
Team LRN
//compute N.H and store in r0.r
dot3 r0.r, r0_bx2, r1
//compute H.H and store in r0.g
dot3 r0.g, r1, r1
//copy in k
mov r0.b, r3.r
phase
texld r3, r0.rgb //fetch the 3D texture to compute specular
//move it to the output
mov_sat r0, r3
Noise and Procedural Texturing
Noise is a particular type of function texture that is extremely useful in pixel shading. The
noise that is most interesting to us here is not white noise of random numbers, but Perlin noise
that has smoothly varying characteristics [Perlin85]. Noise of this sort has been used for years
in production quality 3D graphics for texture synthesis. It allows the creation of non-repeating
fractal patterns like those found in wood, marble, or granite. This is especially interesting in
that an effectively infinite texture can be created from finite texture memory. Other uses of 2D
noise textures are discussed in Texture Perturbation Effects.
The most classic example for using a 3D noise function is to create turbulence. Turbulence
is an accumulation of multiple levels (octaves) of noise for the purpose of representing a turbu-
lent flow. A great example of this is the appearance of marble. The veins flow through the
material in a turbulent manner. The example we will use is an adaptation for real time of the
Blue Marble shader in [Upstill89]. In this example, the shader uses six textures to create the
effect. The textures are five occurrences of a noise map and a one-dimensional color table.
(Note that this shader will work even better with more noise textures six is just the limit for
DirectX 8.1s 1.4 pixel shaders.) The noise textures are scaled and summed to create the turbu-
lence, and the value of the turbulence is then used to index a 1D texture via a dependent read to
map it to a color. The texture coordinates used to index into the noise maps are all procedurally
generated in the same manner, and they are just different scales of each other.
Pixel shader code:
ps.1.4
//fetch noise maps
texld r0, t0.xyz
texld r1, t1.xyz
texld r2, t2.xyz
texld r3, t3.xyz
texld r4, t4.xyz
//accumulate multiple octaves of noise
// this sets r0 equal to:
// 1/2r0 + 1/4r1 + 1/8r2 + 1/16r3 + 1/32r4
add_d2 r3, r3_x2, r4
add_d2 r3, r2_x2, r3
add_d2 r3, r1_x2, r3
3D Textures and Pixel Shaders 431
Team LRN
add_d4 r0, r0_x2, r3
phase
//fetch 1-D texture to remap
texld r5, r0.xyz
mov r0, r5
Realistically, the interactive 3D graphics field is just now on the edge of getting procedural
noise textures to work. Noise textures tend to require several octaves of noise and dependent
fetches just to implement something as simple as marble. This ignores the application of light-
ing and other effects one would also need to compute in the shader. As a result, it would be
impractical to ship a game that relied upon noise textures of this sort today, but it still certainly
applies as a technique to add that little bit of extra varying detail to help eliminate the repeti-
tive look in many of todays interactive 3D applications. Finally, even if this technique is not
immediately applicable, graphics hardware is evolving at such a pace that we expect this tech-
nique will be very common in a few years.
Attenuation and Distance Measurement
Another procedural trick that 3D textures are useful for is attenuation. The attenuation can cor-
respond to falloff proportional to the distance from a light source, or simply some distance
measurement trick intended to create a cool effect. Typically, a distance measurement like this
is implemented with a combination of pixel and vertex shaders, where the true distance is com-
puted per-vertex and interpolated or a vector representing the distance is interpolated with its
length being computed per-pixel. The deficiency in these methods is that they incorrectly lin-
early interpolate non-linear quantities or that they potentially force some of the computation to
occur in lower precision, leading to poor quality. Additionally, these algorithmic solutions are
only applicable to calculations based on simple primitives such as points, lines, and planes.
By using a 3D texture, the programmer is effectively embedding the object in a distance or
attenuation field. Each element in the field can have arbitrary complexity in its computation
since it is generated off-line. This allows the field to represent a distance from an arbitrary
shape. On the downside, using a 3D texture can be reasonably expensive from a memory
432 Part 4: Using 3D Textures with Shaders
Figure 3: Marble shader using eight
octaves of noise instead of five
Team LRN
standpoint, and it relies on a piecewise linear approximation to the function from the filtering.
The toughest portion of using a 3D texture to perform this sort of operation is defining the
space used for the texture and the transform necessary to get the object into the space. Below is
a general outline of the algorithm:
1. Define a texture space containing the area of interest.
2. For each texel in the space, compute the attenuation.
3. Apply the 3D attenuation texture to the object with appropriate texture coordinate
generation.
The first step involves analyzing the data and understanding its symmetry and extents. If the
data is symmetric along one of its axes, then one of the mirrored texturing modes, such as
D3DTADDRESS_MIRRORONCE, can be used to double the resolution (per axis of symme-
try) for a given amount of data. Additionally, an understanding of the extents of the data is
necessary to optimally utilize the texture memory. If a function falls to a steady state, such as
zero along an axis, then the texture address mode can be set to clamp to represent that portion
of the data. Once the user has fit the data into a space that is efficient at representing the func-
tion, the function must be calculated for each texel. Below are a pair of pseudocode examples
showing how to fill an attenuation field for a point. The first is nave and computes all the
attenuation values in the shape of a sphere; the second takes advantage of the symmetry of the
sphere and only computes one octant. This can be done since the values all mirror against any
plane through the center of the volume.
Complete sphere:
#define VOLUME_SIZE 32
unsigned char volume[VOLUME_SIZE][VOLUME_SIZE][VOLUME_SIZE];
float x, y, z;
float dist;
//walk the volume filling in the attenuation function,
// where the center is considered (0,0,0) and the
// edges of the volume are one unit away in parametric
// space (volume ranges from 1 to 1 in all dimensions
x = -1.0f + 1.0f/32.0f; //sample at texel center
for (int ii=0; ii<VOLUME_SIZE; ii++, x += 1.0f/16.0f)
{
y = -1.0f + 1.0f/32.0f;
for (int jj=0; jj<VOLUME_SIZE; jj++, y += 1.0f/16.0f)
{
z = -1.0f + 1.0f/32.0f;
for (int kk=0; kk<VOLUME_SIZE; kk++, y += 1.0f/16.0f)
{
//compute distance squared
dist = x*x + y*y + z*z;
//compute the falloff and put it into the volume
if (dist > 1.0f)
{
//outside the cutoff
volume[ii][jj][kk] = 0;
}
else
{
3D Textures and Pixel Shaders 433
Team LRN
//inside the cutoff
volume[ii][jj][kk] = (unsigned char)
( 255.0f * (1.0f dist));
}
}
}
}
Sphere octant, used with the D3DTADDRESS_MIRRORONCE texture address mode:
#define VOLUME_SIZE 16
unsigned char volume[VOLUME_SIZE][VOLUME_SIZE][VOLUME_SIZE];
float x, y, z;
float dist;
//walk the volume filling in the attenuation function,
// where the center of the volume is really (0,0,0) in
// texture space.
x = 1.0f/32.0f;
for (int ii=0; ii<VOLUME_SIZE; ii++, x += 1.0f/16.0f)
{
y = 1.0f/32.0f;
for (int jj=0; jj<VOLUME_SIZE; jj++, y += 1.0f/16.0f)
{
z = 1.0f/32.0f;
for (int kk=0; kk<VOLUME_SIZE; kk++, y += 1.0f/16.0f)
{
//compute distance squared
dist = x*x + y*y + z*z;
//compute the falloff and put it into the volume
if (dist > 1.0f)
{
//outside the cutoff
volume[ii][jj][kk] = 0;
}
else
{
//inside the cutoff
volume[ii][jj][kk] = (unsigned char)
( 255.0f * (1.0f dist));
}
}
}
}
With the volume, the next step is to generate the coordinates in the space of the texture. This
can be done with either a vertex shader or the texture transforms available in the fixed-function
vertex processing. The basic algorithm involves two transformations. The first translates the
coordinate space to be centered at the volume, rotates it to be properly aligned, then scales it so
that the area effected by the volume falls into a cube ranging from 1 to 1. The second trans-
formation is dependent on the symmetry being exploited. In the case of the sphere octant, no
additional transformation is needed. With the full sphere, the transformation needs to map the
entire 1 to 1 range to the 0 to 1 range of the texture. This is done by scaling the coordinates by
one half and translating them by one half. All these transformations can be concatenated into a
single matrix for efficiency. The only operation left to perform is the application of the dis-
tance or attenuation to the calculations being performed in the pixel shader.
434 Part 4: Using 3D Textures with Shaders
Team LRN
Representation of Amorphous Volume
One of the original uses for 3D textures was in the visualization of scientific and medical data
[Drebin88]. Direct volume visualization of 3D datasets using 2D textures has been in use for
some time, but it requires three sets of textures, one for each of the axes of the data, and is
unable to address filtering between slices without the use of multitexture. With 3D textures, a
single texture can directly represent the dataset.
Although this article does not focus on scientific visualization, the implementation of this
technique is useful in creating and debugging other effects using volume textures. Using this
technique borrowed from medical volume visualization is invaluable in debugging the code
that a developer might use to generate a 3D distance attenuation texture, for example. It allows
the volume to be inspected for correctness, and the importance of this should not be underesti-
mated, as presently an art tool for creating and manipulating volumes does not exist.
Additionally, this topic is of great interest since a large amount of research is presently ongoing
in the area of using pixel shaders to improve volume rendering. For more information on volu-
metric rendering, see Truly Volumetric Effects.
Application
These general algorithms and techniques for 3D textures are great, but the really interesting
part is the concrete applications. The following sections will discuss a few applications of the
techniques described previously in this article. They range from potentially interesting spot
effects to general improvements on local illumination.
Volumetric Fog and Other Atmospherics
One of the first problems that jumps into many programmers minds as a candidate for volume
texture is the rendering of more complex atmospherics. So far, no one has been able to get
good variable density fog and smoke wafting through a scene, but many have thought of vol-
ume textures as a potential solution. The reality is that many of the rendering techniques that
can produce the images at a good quality level burn enormous amounts of fill-rate. Imagine
having to render an extra 50 to 100 passes over the majority of a 1024 x 768 screen to produce
a nice fog effect. This is not going to happen quite yet. Naturally, hardware is accelerating its
capabilities toward this sort of goal, and with the addition of a few recently discovered
improvements, this sort of effect might be acceptable.
The basic algorithm driving the rendering of the atmospherics is based on the volume visu-
alization shaders discussed earlier. The primary change is an optimization technique presented
at the SIGGRAPH/Eurographics Workshop on Graphics Hardware in 2001 by Klaus Engel
(http://wwwvis.informatik.uni-stuttgart.de/~engel/). The idea is that the accumulation of slices
of the texture map is attempting to perform an integration operation. To speed the process up,
the volume can be rendered in slabs instead. By having each slice sample the volume twice (at
the front and rear of the slab), the algorithm approximates an integral as if the density varied
linearly between those two values. This results in the slice taking the place of several slices
with the same quality. This technique is better described in the following article, Truly Volu-
metric Effects.
3D Textures and Pixel Shaders 435
Team LRN
The first step in the algorithm is to draw the scene as normal, filling in the color values
that the fog will attenuate and the depth values that will clip the slices appropriately. The atmo-
spheric is next added to the scene by rendering screen-aligned slices with the shader described
in the following article. When rendering something like fog that covers most of the view, this
means generating full-screen quads from the present viewpoint. The quads must be rendered
back to front to allow proper compositing using the alpha blender. The downside to the algo-
rithm is in the coarseness of the planes intersecting with the scene. This will lead to the last
section of fog before a surface to be dropped out. In general, this algorithm is best for a fog
volume with relatively little detail, but that tends to be true of most fog volumes.
Light Falloff
A more attainable application of 3D textures and shaders on todays hardware is volumetric
light attenuation. In its simplest form, this becomes the standard light-map algorithm that
everyone knows. In more complex forms, it allows the implementation of per-pixel lighting
derived from something as arbitrary as a fluorescent tube or a light saber.
In simple cases, the idea is to simply use the 3D texture as a ball of light that is modulated
with the base environment to provide a simple lighting effect. Uses for this include the classic
rocket flying down the dark hallway to kill some ugly monster shader. Ideally, that rocket
should emit light from the flame. The light should roughly deteriorate as a sphere centered
around the flame. This obviously oversimplifies the real lighting one might like to do, but it is
likely to be much better than the projected flashes everyone is used to currently. Implementing
this sort of lighting falloff is extremely simple; one only needs to create a coordinate frame
centered around the source of the illumination and map texture coordinates from it. In practice,
this means generating texture coordinates from world or eye-space locations and back, trans-
forming them into this light-centered volume. The texture is then applied via modulation with
the base texture. The wrap modes for the light map texture should be set as described in the
earlier section on the general concept of attenuation.
Extending the simple glow technique to more advanced local lighting models brings in
more complex shader operations. A simple lighting case that might spring to mind is to use this
to implement a point light. This is not really a good use of this technique, as a point light needs
to interpolate its light vector anyway to perform any of the standard lighting models. Once the
light vector is available, the shader is only a dot product and a dependent texture fetch away
from computing an arbitrary falloff function from that point. A more interesting example is to
use a more complex light shape such as a fluorescent tube. A fluorescent tube is a great exam-
ple. The light vector cannot be interpolated between vertices, so a different method needs to be
found. Luckily, a 3D texture provides a great way to encode light vector and falloff data. The
light vector can be encoded in the RGB channels, and the attenuation can simply fit into the
alpha channel. Once the correct texel is fetched, the pixel shader must rotate the light vector
from texture space to tangent space by performing three dot products. Finally, lighting can pro-
ceed as usual.
This algorithm uses the same logic for texture coordinate generation as the attenuation
technique described earlier. The only real difference with respect to the application of the vol-
ume texture is that the light vector mentioned above is encoded in the texture in addition to the
attenuation factor. The application of the light is computed in the following shader using the
light vector encoded in the volume:
436 Part 4: Using 3D Textures with Shaders
Team LRN
ps.1.4
texld r0.rgba, t0.xyz //sample the volume map
texld r1.rgb, t1.xyz //sample the normal map
texcrd r2.rgb, t2.xyz //pass in row 0 of a rotation
texcrd r3.rgb, t3.xyz //pass in row 1 of a rotation
texcrd r4.rgb, t4.xyz //pass in row 2 of a rotation
//rotate Light vector into tangent space
dot3 r5.r, r0_bx2, r2
dot3 r5.g, r0_bx2, r3
dot3 r5.b, r0_bx2, r4
//compute N.L and store in r0
dot3_sat r0.rgb, r5, r1_bx2
//apply the attenuation
mul r0.rgb, r0, r0.a
The only tricky portion of this shader is the rotation into tangent space. The rotation matrix is
stored in texture coordinate sets two through four. This matrix must rotate from texture space
to surface space. This matrix is generated by concatenating the matrix composed of the tan-
gent, binormal, and normal vectors (the one typically used in per-pixel lighting) with a matrix
to rotate the texture-space light vector into object space. This matrix can easily be derived from
the matrix used to transform object coordinates into texture space.
References
[Drebin88] Robert A. Drebin, Loren Carpenter, and Pat Hanrahan, Volume Rendering,
SIGGRAPH Proceedings, Vol. 22, No. 4, August 1988, pp. 65-74.
[Perlin85] Ken Perlin, An Image Synthesizer, SIGGRAPH Proceedings, Vol. 19, No. 3, July
1985, pp. 287-296.
[Upstill89] Steve Upstill, The RenderMan Companion (Addison Wesley, 1989).
3D Textures and Pixel Shaders 437
Team LRN
Truly Volumetric Effects
Martin Kraus
This article presents several examples of ps.1.4 programming for volumetric effects, ranging
from quite basic techniques to recent research results; this includes volume visualization with
transfer functions, animation of volume graphics by blending between different volume data
sets, animation of transfer functions, and a recently developed rendering technique for
real-time interactive, high-quality volume graphics, which is called pre-integrated volume
rendering.
Volumetric effects have been part of computer graphics since transparency became popu-
lar, more than 20 years now. They include the rendering of common effects like explosions,
fire, smoke, fog, clouds, dust, colored liquids, and all other kinds of semi-transparent materials.
However, there are important differences. For example, noninteractive rendering for movie
productions usually tries to achieve the highest possible quality by simulating the physics of
volumetric effects, at least approximately. In contrast to this, graphics in computer games do
not have the privilege of noninteractivity. Therefore, they usually have to fake volumetric
effects, in particular by rendering polygons with precalculated, semi-transparent textures.
Nowadays, this necessity of faking volumetric effects becomes questionable, as modern
graphics boards offer the required resources for real volume graphics. In fact, programmable
pixel shading offers even more than is necessary, allowing us to increase the quality of interac-
tive volume graphics and, maybe more importantly, to animate volumetric effects in ways that
have not been possible in real time before.
The Role of Volume Visualization
One of the most valuable resources for programmers of interactive, truly volumetric effects is
the literature on volume visualization. Since volume visualization is a branch of scientific data
visualization, faking volumetric effects is not an option in this field; however, the demands for
real-time interactivity and high visual quality come close to the demands in the entertainment
industry. Moreover, in recent years, many applications of volume rendering techniques have
emerged that are no longer within the strict limits of scientific visualization. All these technol-
ogies are now often subsumed under the more general term volume graphics, which is a
branch of computer graphics.
Apart from this growth of research topics, there is also a steady transition going on with
respect to the employed hardware. In fact, PCs with OpenGL-capable graphics boards have
been a common benchmark environment for algorithms in volume visualization for several
438
Team LRN
years. Consequently, researchers have also started to make use of specific hardware features of
new PC graphics boards. This transition from scientific volume visualization on high-end
graphics workstations to volume graphics on off-the-shelf PCs in combination with the similar-
ity of challenges is why recent research in volume visualization is so much more relevant for
graphics and game programming than it was a few years ago. Therefore, it is worthwhile to
take a closer look at the basic problem of volume visualization and the way researchers have
solved it with the help of programmable pixel shading.
Basic Volume Graphics
Volume visualization started with the task of generating pictures of three-dimensional scalar
fields (i.e., volumes in which a real number is assigned to each point). This problem arose in
medical imaging and many kinds of computer simulations. While one common way to view
and represent this kind of data was a stack of two-dimensional grayscale images, the best
suited representation on a modern graphics board is a three-dimensional texture with a single
component. Note that this volume texture does not hold any colors or opacities, but only a sin-
gle number per texel (or voxel in the terms of volume graphics). Although it might be
convenient to store this number in the place of the red component of a texel, this in no way
implies that all voxels have to be reddish.
If the volume texture does not specify colors, where do the colors come from? The answer
of volume visualization is the concept of a transfer function. This function maps the numbers
stored in the volume texture to colors and opacities (i.e., to red, green, blue, and alpha compo-
nents). Thus, a transfer function works very much like a one-dimensional dependent texture,
and this is in fact the way transfer functions are implemented with programmable pixel
shading.
Once we can color a single point of a volume texture, we can also render an arbitrary slice
through it by assigning appropriate texture coordinates to the corners of the slicing polygon, as
shown in Figure 1 for a volume texture, which resulted from a volume scan of a teddy bear. If
we are able to render one slice, we can easily render the whole volume by rendering a stack of
Truly Volumetric Effects 439
Note: Of course, there are many different algorithms for volume rendering
(e.g., ray-casting, shear-warp, splatting, cell projection, axis-aligned slicing with
two-dimensional textures, or viewplane-aligned slicing with three-dimensional
textures). While this article only covers the latter, many of the presented ideas
work exactly the same or in a similar way for other rendering algorithms.
Note: If a specific graphics hardware device does not support dependent tex-
tures or similar techniques, then the transfer function has to be applied to the
volume data in a preprocessing step and the resulting colors and opacities are
stored in an RGBA volume texture. (The technical term for this method is
preclassification.) Unfortunately, this alternative has many drawbacks. In partic-
ular, it needs four times more texture memory and does not usually permit
modifications of the transfer function in real timea feature that is crucial for
both volume visualization and the programming of many volumetric effects.
Team LRN
many slices parallel to the viewplane and compositing the slices in the frame buffer with the
blending equation:
destColor = (1 - srcAlpha) * destColor + srcAlpha * srcColor
Figure 2 depicts a coarse rendering with too few slices, while an appropriate number of
slices was chosen for Figure 3. In order to demonstrate the effect of transfer functions, Figure 4
depicts the same object with identical viewing parameters but a different transfer function.
440 Part 4: Using 3D Textures with Shaders
Figure 1: A textured slice through a vol-
ume texture
Figure 2: Several (but too few)
composited slices
Note: In many cases, srcColor is pre-multiplied with srcAlpha, thus the
blending equation simplifies to:
destColor = (1 - srcAlpha) * destColor + srcColor
Figure 3: A volume rendering of a teddy
bear
Figure 4: Same as Figure 3 with a more
colorful transfer function
Team LRN
Here is a simple pixel shader 1.4 program that samples a volume texture, which is stored in
texture stage 0, and applies a transfer function, which is specified in texture stage 1.
ps.1.4
texld r0, t0.xyz // lookup in volume texture
phase
texld r1, r0 // lookup for transfer function
mov r0, r1 // move result to output register
Rendering a stack of viewplane-parallel slices that are textured this way is straightforward. In
fact, it is much more challenging to come up with interesting volume data sets and appropriate
transfer functions for particular volumetric effects.
Animation of Volume Graphics
While animations of faked volumetric effects often require too many resources, texture mem-
ory in particular, or do not offer a satisfying visual quality, there are several ways to animate
true volume graphics without additional costs.
Rotating Volume Graphics
Perhaps the most important advantage of truly volumetric graphics is the ability to rotate them
easily. In order to do this, we may, however, not rotate the geometric positions of the slices but
rather the texture coordinates. This may seem awkward at first glance, but it becomes clear
when you remember that the slices are parallel to the viewplane all the time. (See the source
code for the simple volume renderer on the books companion CD for more details.)
Animated Transfer Functions
In some sense, transfer functions are just the three-dimensional analog of color palettes or
indexed colors for images. Many applications of animated color palettes exist in two-dimen-
sional computer graphics; similarly, the animation of transfer functions has a special
importance for volume graphics and volumetric effects in particular.
While color palettes are often animated by modifying single entries or setting a new pal-
ette, there is a different, very efficient and general way to animate transfer functions that are
implemented by dependent textures. The idea is to store a set of one-dimensional textures as
the rows of a two-dimensional texture. When accessing this two-dimensional dependent tex-
ture, we just need to set a second texture coordinate (with the same value for all fragments),
which selects a specific row of the texture and corresponds to a time parameter. We can easily
implement this idea by extending the previous code example. The volume texture is again set
Truly Volumetric Effects 441
Tip: A common technique to reduce the screen area of the slicing polygons
is to clip them in the space of texture coordinates, such that all fragments have
texture coordinates within the limits of the volume texture and no time is wasted
with the rasterization of superfluous fragments. This technique is demonstrated
in Figures 1 to 4.
Team LRN
in texture stage 0, texture stage 2 specifies the texture containing the transfer functions, and the
green component of the constant vector c0 represents a time parameter.
ps.1.4
texld r0, t0.xyz // lookup in volume texture
mov r0.g, c1.r // copy time parameter
phase
texld r2, r0 // lookup for transfer functions
mov r0, r2 // move result to output register
Blending of Volume Graphics
While blending between images is definitely not a spectacular effect, blending between volume
data can lead to rather surprising and dramatic results. This is partially due to the effect of
transfer functions and can be further enhanced by blending between more than two volume
data sets, especially volume data sets with a complicated structure (e.g. noise or turbulence).
An example is given in Figures 5 to 8. Figure 5 depicts a volume rendering of the radial
distance field of a point, while Figure 6 shows a volume data set consisting of noise. Blending
between these two volumes with different pairs of weights results in Figures 7 and 8.
442 Part 4: Using 3D Textures with Shaders
Figure 5: Volume rendering of a sphere
Tip: One additional advantage of this implementation is the linear interpolation
in the dependent texture lookup (i.e., the possibility to choose the time parameter
continuously). This results in a smooth blending between successive transfer func-
tions, which permits smoother animations.
Tip: As indicated in Figures 7 and 8, we cannot only blend smoothly from one
volume texture to another, but we can also distort, warp, or twist volumes by
blending with an appropriate second volume texture.
Team LRN
The following implementation stores four independent volume data sets in the red, green, blue,
and alpha components of an RGBA volume texture. After the texture lookup, a dot product
with a constant vector of weights is performed in order to blend between the four data sets.
Finally, we apply the transfer function and move the result to the output register. The volume
data is again associated with texture stage 0, time-dependent weights are stored in c0, and tex-
ture stage 1 specifies the transfer function.
ps.1.4
texld r0, t0.xyz // lookup in RGBA volume texture
dp4 r0, r0, c0 // blend components with c0
phase
texld r1, r0 // lookup for transfer function
mov r0, r1 // move result to output register
Truly Volumetric Effects 443
Figure 6: Volume rendering of some noise Figure 7: A blended combination of the
volumes of Figure 5 and 6
Figure 8: Same as Figure 7 with
an increased weight of Figure 6
Team LRN
High-Quality but Fast Volume Rendering
Presumably, the most important drawback of volume rendering is the required rasterization
performance necessary to achieve a satisfying visual quality by rendering a sufficient number
of slices. How many slices are necessary? A first guess is to render enough slices to match the
resolution of the volume texture. This is, in fact, very acceptable for small volume textures.
Unfortunately, the application of transfer functions usually adds further fine structures to the
original volume texture, which leads to visible artifacts if the number of slices is not increased
appropriately. Thus, our dilemma is that we do not want to abandon transfer functions, but we
also do not want to render additional slices.
Fortunately, the literature on volume graphics suggests a solution to this problem, called
pre-integrated volume rendering. In the context of textured slices, this technique actually
requires programmable pixel shading. Its features include full support for transfer functions
and a much higher visual quality than the simple composition of slices, in particular for trans-
fer functions with jumps and sharp peaks. For a detailed description of this algorithm, see
[EKE2001].
Pre-integrated volume rendering reduces the number of slices by rendering slabs instead of
slices. A slab is simply the space between two successive slices. For a single fragment, the
problem reduces to the computation of the color contributed by a ray segment between the two
slices; see Figure 9. If a linear interpolation of the volume data along this ray segment is
assumed, the color will only depend on three numbers: the value on the front slice (at point F
in Figure 9), the value on the back slice (at point B), and the distance between the two slices. If
the slice distance is also assumed to be constant, the color of the ray segment will only depend
on the two values of the volume data, and a two-dimensional lookup table is already enough to
hold this dependency.
In the context of pixel shading, this table is implemented as a two-dimensional dependent
texture, which is generated in a preprocessing step by computing the ray integrals for all possi-
ble pairs of values at the start and end points F and B. (The precomputation of this integral is
the reason for the name pre-integrated volume rendering.)
A corresponding ps. 1.4 program has to fetch the value of a volume texture on the front
slice and the value on the back slice (i.e., on the next slice in the stack of slices). Then these
two values are used as texture coordinates in a dependent texture fetch, which corresponds to
the lookup of the precalculated colors of ray segments.
In the following code example, texture stage 0 specifies the volume texture on the front
slice, while texture stage 3 specifies the volume texture on the back slice (i.e., both refer to the
same volume texture data but use different texture coordinates). The lookup table for the
precomputed ray integrals is specified in texture stage 4.
444 Part 4: Using 3D Textures with Shaders
Tip: Depending on the particular data sets, the vector of weights does not nec-
essarily have to be normalized. In fact, a sum of weights unequal to 1 offers
another degree of freedom for volumetric effects. Another way of enhancing this
example is the animation of transfer functions, as discussed before.
Team LRN
ps.1.4
texld r0, t0.xyz // lookup for front slice
texld r3, t3.xyz // lookup for back slice
mov r0.g, r3.r // merge texture coordinates
phase
texld r4, r0 // lookup for transfer function
mov r0, r4 // move result to output register
Note that an additional mov command is necessary in the first phase in order to merge the two
slice values in register r0, which holds the texture coordinates of the dependent texture lookup
in the second phase.
The following code shows one possibility for the computation of a lookup texture for
pre-integrated volume rendering of slabs of a specified thickness. The RGBA transfer function
with premultiplied colors is stored in an array of floats called tfunc[TABLE_SIZE][4]. Note
that the color and alpha components in this transfer function specify color and opacity densities
(i.e., color and opacity per unit length). The resulting RGBA lookup table is returned in
lutable[TABLE_SIZE][TABLE_SIZE][4].
void createLookupTable(float thickness,
float tfunc[TABLE_SIZE][4],
float lutable[TABLE_SIZE][TABLE_SIZE][4])
{
for (int x = 0; x < TABLE_SIZE; x++)
{
for (int y = 0; y < TABLE_SIZE; y++)
{
int n = 10 + 2 * abs(x-y);
double step = thickness / n;
double dr = 0.0, dg = 0.0, db = 0.0,
dtau = 0.0;
for (int i = 0; i < n; i++)
{
double w = x + (y-x)*(double)i/n;
if ((int)(w + 1) >= TABLE_SIZE)
w = (double)(TABLE_SIZE - 1) - 0.5 / n;
double tau = step *
(tfunc[(int)w][3]*(w-floor(w))+
tfunc[(int)(w+1)][3]*(1.0-w+floor(w)));
double r = exp(-dtau)*step*(
tfunc[(int)w][0]*(s-floor(w))+
Truly Volumetric Effects 445
Figure 9: A slab of the volume between two slices
Team LRN
tfunc[(int)(w+1)][0]*(1.0-w+floor(w)));
double g = exp(-dtau)*step*(
tfunc[(int)w][1]*(w-floor(w))+
tfunc[(int)(w+1)][1]*(1.0-w+floor(w)));
double b = exp(-dtau)*step*(
tfunc[(int)w][2]*(w-floor(w))+
tfunc[(int)(w+1)][2]*(1.0-w+floor(w)));
dr += r;
dg += g;
db += b;
dtau += tau;
}
}
lutable[x][y][0] =
(dr > 1.0 ? 1.0f : (float)dr);
lutable[x][y][1] =
(dg > 1.0 ? 1.0f : (float)dg);
lutable[x][y][2] =
(db > 1.0 ? 1.0f : (float)db);
lutable[x][y][3] =
(1.- exp(-dtau));
}
}
Where to Go from Here
Programmable pixel shading offers a whole world of new possibilities for volumetric effects,
and only a small fraction of them have been actually implemented and described. This article
covers an even smaller fraction of these ideas, focusing on the pixel shader programming.
Thus, there are many paths to take from here, for example:
n
Play with the code! The books companion CD contains a simple volume renderer, which
includes all the code snippets presented and discussed in this chapter. It also displays and
lets you modify the pixel shader programs.
n
Design your own volumetric effects: Try to play with different volume textures and trans-
fer functions.
n
Look for other applications of volume graphics. The research literature about volume visu-
alization and volume graphics is full of interesting applications (e.g., shape modeling,
terrain rendering, simulation of natural phenomena, etc.).
446 Part 4: Using 3D Textures with Shaders
Tip: There are many ways to accelerate the computation of the pre-integrated
lookup table; some of them are discussed in [EKE2001]. However, for static trans-
fer functions, the best solution is probably to save this table to a file and load it
from disk when it is needed.
Team LRN
Acknowledgments
The author would like to thank Klaus Engel, who generated Figures 1 to 8 and programmed
earlier versions of several code samples.
References
[EKE2001] Klaus Engel, Martin Kraus, and Thomas Ertl, High-Quality Pre-Integrated
Volume Rendering Using Hardware-Accelerated Pixel Shading, SIGGRAPH
Proceedings/Eurographics Workshop on Graphics Hardware, August 2001, pp. 9-16
(http://wwwvis.informatik.uni-stuttgart.de/~engel/pre-integrated/).
Truly Volumetric Effects 447
Team LRN
448
First Thoughts on Designing a Shader-Driven
Game Engine
by Steffen Bendel
When you design a game engine, you have to think about the usage of
effects before writing even one line of code. Otherwise, it is possible
that your initial engine design might be trashed by the addition of a
new effect. A single effect is easy to realize, but an engine is more than
the sum of its components. The topics in this article focus on issues
that need to be taken into consideration when designing an engine that
uses shaders.
Team LRN
449
Visualization with the Krass Game Engine
by Ingo Frick
The objective of this article is to present selected visualization tech-
niques that have been applied in a successful game engine. The
structure of these techniques is demonstrated and motivated by their
developmental history, the further requirements in the foreseeable
future, and the main design goal of the entire engine.
Designing a Vertex Shader-Driven 3D Engine
for the Quake III Format
by Bart Sekura
This article explains how vertex programs can be used to implement real
game engine algorithms, in addition to great-looking effects and demos.
Rather than showing a particular trick, my goal is to provide the reader
with the display of applicability which, for the current vertex shader
capable hardware, provides easier coding paths and visible performance
benefits compared to CPU-based implementations of the same effects.
Team LRN
First Thoughts on
Designing a Shader-
Driven Game Engine
Steffen Bendel
When you design a game engine, you have to think about the usage of effects before writing
even one line of code. Otherwise, it is possible that your initial engine design might be trashed
by the addition of a new effect. A single effect is easy to realize, but an engine is more than the
sum of its components. An engine includes a concept and a lot of compromises. It must share
the resources in a meaningful way. Some of the effects described in this book are too expensive
to incorporate into a real-time game engine. In case of an already existing game engine, they
might be too difficult because the overall engine design is not prepared for that. The perfect
engine does not exist, but there is a good engine for a special purpose and for a defined hard-
ware basis. Try to focus on the things that are most important for you and optimize your design
for that.
I would like to present a few thoughts on specific topics that have to be considered while
designing a engine that uses shaders.
Bump Mapping
Bump mapping is futile without real dynamic lighting. If there is no moving light source, you
see no effect and the effort is wasted. So bump mapping is not an isolated effect. It needs a
complete engine infrastructure, not only for light. Objects using bump mapping require addi-
tional tangent space vectors, which must be supported by the rendering pipeline. The enormous
effort is the reason why it is so seldom used.
Real-time Lighting
Real-time lighting is one of the most important things in engine design today. It is not a
post-rendering effect that changes the brightness of the picture; it is a calculation which con-
nects material properties and light intensity. If you decide to use real-time lighting, you have to
implement it into every shader you use. Whereas per-vertex lighting is an easy way to
450
Team LRN
implement lighting, it is very inaccurate. It works only with specifically optimized meshes
because you get artifacts with big triangles. You cant use an LOD system with per-vertex
lights, as lower LOD levels will produce artifacts because of the bigger triangles.
Precalculation is only possible for static lights. Dynamic lightmaps can be calculated only
in real time if there is a simple projection algorithm for the texture coordinates. This is espe-
cially suitable for flat landscapes. One of the big challenges is to give the user a consistent
impression of the lighting in the engine. Using different types of lighting for objects and land-
scape doesnt seem to work. Try to avoid an artifical look here.When using dynamic lights in
the engine, sometimes a very simple lighting model is used. The resulting picture of such a
simple lighting model will be too clear and too easy to analyze for the human brain, which is
confusing. To make it more dusty, a preprocessed part of light might be a good idea. You might
use radiosity, which could be too slow, or precalculated lightmaps to make it more dusty.
Use Detail Textures
In the real world, there are numerous details all the way down to the microscopic level. In
computer graphics, the user should not realize an end to details. Geometry is limited by hard-
ware polygon throughput, but for textures, multilayer detail textures are a good solution.
Approximately one texel of the highest resolution texture should be mapped onto one screen
pixel.
Use Anisotropic Filtering
Blurring is no longer state of the art. If hardware supports it, anisotropic texture filtering can
increase quality in some cases. Without either, the picture is smooth or flickers. The difference
between anisotropic and trilinear filtering is amazing.
Split Up Rendering into Independent Passes
The more complex the rendering model gets, the more important a clear structure of the engine
will get. Two elements determine the color of the object: material and light. The product of
both is the visible color. In the simplest version, this is premodulated in a single texture:
Color r Material r Light r
i
i
i j
j
r r r
10 . ,
,
Fog r r
Fog r r
Camera
Camer
intensity
intensity
r r
r r
a
Fogcolorr Worldposition
r
If you use a different algorithm to render material and light, then there are many combinations.
To reduce the count of vertex and pixel shaders, it makes sense to split the rendering into
independent passes. The speed increase by using two textures instead of one is noticeable.
More than two textures per pass amount is not so noticeable and is not compatible with older
hardware. Thus, it is not necessary if the feature does not request it. A separation of the mate-
rial, light, and fog part allows the combination of every material shader with any light shader.
First Thoughts on Designing a Shader-Driven Game Engine 451
Team LRN
A compact shader is, of course, the better solution if there is only one material map and one or
no lightmap, but these are a limited number of cases.
Two solutions are possible. First, the material color is accumulated in the frame buffer and
the result is blended with the light calculated by the following pass, or the light is rendered and
modulated by material. The best solution depends on your specification. The component that is
rendered second must fit into one pass.
Use _x2
Most rendering values are only in a range between zero and one. This means the light factor is
limited and the product of a color multiply is darker than the original values. To get a brighter
picture, the result of light and material blending should be scaled by a factor of 2. This is pro-
cessed by using the pixel shader _x2 modifier in the single pass solution. If the blending is
done in the frame buffer, SourceBlendingFactor is set to DestColor and DestBlendingFactor is
set to SourceColor to realize the doubled modulation.
452 Part 5: Engine Design with Shaders
Team LRN
Visualization with the
Krass Game Engine
Ingo Frick
Introduction
The objective of this article is to present selected visualization techniques that have been
applied in a successful game engine. The structure of these techniques is demonstrated and
motivated by their developmental history, the further requirements in the foreseeable future,
and the main design goal of the entire engine.
Apart from the theoretical discussion, the practical suitability will also be demonstrated, as
all techniques were used without modification in AquaNox, the recently published game
from Massive Development GmbH in Germany.
We will present techniques which offer a good balance between the current state of tech-
nology, practical usability, and the application in a real game engine. All the techniques are
part of the krass engine, a European game engine that powers some high-end 3D games. It is
important to see the historical development of this engine, as it didnt take place in an ivory
tower with the only focus on technical details. Instead, it originated in the context of gaming
and was first used about six years ago in a game called Archimedean Dynasty. Because of its
origins, its software patterns are mainly designed to solve the problems that arise during the
creation of computer games.
General Structure of the Krass Engine
To explain the location of the rendering systems within the main structure, we will begin with a
short description of the engines architecture. The entire system offers an extensive functional-
ity covering the areas of visualization, networking, simulation, collision, mathematics, physics,
and more. The top structure consists of three separated layers. The topmost layer, the so-called
application layer, holds the game-specific code. It is intended to exchange this layer while
moving to another project. The second layer, the component layer, aggregates functionality,
which had been identified as reusable units and can be shared between different games; for
instance, here you find systems like particle systems, collision systems, animation systems, and
453
Team LRN
more. The third and lowest layer is the system layer. It offers basic system functionality like
I/O, timing, rendering, etc.
Developmental History of the Rendering
Component
The rendering component is implemented in the system and component layers. The basic
functionalities, like textures, vertex buffers, frame buffers, and depth buffers, are located in the
system layer. More specific effects, like render object with fogmap, static- and dynamic
lightmap, are located in the component layer. The best way to understand the structural archi-
tecture is to follow its historical development.
Regarding the development of visualization techniques over the past five years, the devel-
opment of specialized hardware was the biggest change. This specialization led to higher
graphics quality and an unloading of the main CPU. This shift made a lot of reorganization
necessary. Some of them negatively affected the software design and the downstream produc-
tion processes. In earlier (software-driven) rendering systems, it was usually easier to design
the algorithms and data structures regarding flexibility and controllability. Accessing data
structures like triangles, vertices, and materials using the consistent and powerful program-
ming language of the main CPU (C++, Assembler) often led to data structures that were more
object oriented than process oriented. Beyond this, the overall production efficiency was very
high. This meant that the evolution of specialized hardware had to pass through many steps to
regain these initial advantages.
Considered a simple model, the generation of a picture with the rendering approach
requires at least the following data structures:
n
Vertex
n
Triangle (constructed from vertices)
n
Triangles pixel (fragment, generated from color(s), texture(s), etc.)
n
Frames pixel (composed of triangle pixel and frame pixel)
The enumeration above is not complete, but it should be sufficient for the following thoughts.
In the following table, you will find a brief overview of the chronological shift of these data
structures from software to specialized hardware. The data structures that came along with
each shift are labeled with a D and the respective processes are labeled with AP. SW-
means software and HW- means hardware.
Table 1
Period Vertex Triangle Trianglepixel Framepixel
0
Software-
Rendering
D: SW-Vertex
(arbitrary layout)
P: SW-Processing
(programmable)
D: SW-Triangle
(arbitrary layout)
P: SW-Processing
(programmable)
D: SW-Texture
(arbitrary layout)
P: SW-Processing
(programmable)
D: HW-FrameBuffer
(selection)
P: SW-BlendingMode
(programmable)
454 Part 5: Engine Design with Shaders
Team LRN
Period Vertex Triangle Trianglepixel Framepixel
1
Glide
D: SW-Vertex
(predefined)
P: SW-Processing
(programmable)
unchanged D: HW-Texture
(selection)
P: HW-RenderState
(selection)
D: HW-FrameBuffer
(selection)
P: HW-BlendingMode
(selection)
2
Hardware
(T/T&L)
D: HW-VertexBuffer
(selection / (FVF))
P: HW-Processing
(selection (T/T&L))
D: HW-IndexBuffer
(selection)
P: HW-Processing
(selection)
unchanged unchanged
3
Vertex
Shader
D: HW-VertexStream
(selection)
P: HW-VertexShader
(programmable)
unchanged unchanged unchanged
4
Pixel
Shader
unchanged unchanged D: HW-Texture+Data
(selection )
P: HW-PixelShader
(programmable)
unchanged
5
Pixel
Shader++
unchanged unchanged D: HW-Texture+Data
(arbitrary layout)
P: HW-PixelShader
(programmable)
D: HW-FrameBuffer
(arbitrary layout)
P: HW-PixelShader
(programmable)
Comparing period 0 (pure software rendering) with period 5 (maximum hardware support)
shows a nearly equivalent flexibility for the individual organization and interpretation of the
data structures and processes. This closes the evolution circle mentioned above.
Previous Drawbacks of Hardware Development
The main objective of visualization in computer games is the creation of individual effects.
They are the unique selling points over the competitors and show a definite influence on the
success of a product. Restrictions caused by the current hardware technology normally result in
a drastic reduction of the possible amount of individual effects. This consequence could clearly
be noticed right after the introduction of graphics hardware with T&L capabilities. Though the
game developer could configure the T&L pipeline for the standard cases, every aberrant fea-
ture caused serious problems. The Character Skinning problem, for instance, is well suited to
demonstrate this effect. From todays point of view, the approach to solve this problem using
the standard path (e.g., transformation matrix palettes) seems curious. The resulting lack of
individuality caused all 3D games of this period to look nearly the same.
Current Drawbacks
With the return of flexibility (e.g., vertex and pixel shaders), this problem can be avoided. The
variety of visual effects is again determined by the imagination of the developer instead of the
hardware capabilities. Instead, now you have to face another problem, which unfortunately
gets overlooked many times. The programming of individual vertex and pixel shader effects
only leads to a local flexibility (assembly-syntax, unrestricted data layout). In a larger context,
Visualization with the Krass Game Engine 455
Team LRN
you catch a global inflexibility, as we went back to the development of effects. Effects very
often tend to interact with each other, which makes it difficult to recombine simple effects to
more complex ones. Take a simple vertex shader implementing some effects like skinning and
mapping texture coordinates and try to combine it with a second vertex shader doing the illu-
mination. This usually very common and basic combination becomes difficult.
Consequently, you have to implement the skinning shader for each number of possible
light sources. This example seems to be more of a technical shortcoming of the current hard-
ware technology, but it demonstrates the general problem.
Ordering Effects in the Krass Engine
Modeling the rendering process through an accumulation of a sufficient amount of effects
seems to be a reasonable method. It disengages the rendering techniques from using
parameterized standard methods and enters the terrain of individually designed rendering
effects. These thoughts also stood at the beginning of the current krass revision. The next ques-
tion is how to define and associate the effects to the overall structure.
Considering commercial and traditional rendering systems for the non-real-time range of
use (3DSMax, Maya), you will usually not find effects in terms of our definition. In fact, these
applications use standard rendering methods on a per material basis and combine these with a
multipass approach to emulate quasi-effects. Apart from that, some effect-oriented concepts
have been realized, such as the shading concept of the RenderMan systems. Here as well, the
visualization is done by the individual programming of a pixel effect (material) rather than
by the combination of generic texture parameter sets.
In our case (real time, hardware accelerated), it makes no sense to take something like a
material as the only criterion of organization. In most of the cases, we dont have a material in
the common sense (e.g., terrain rendering) or the effect is determined more by a geometric
characteristic (e.g., plankton, skinning, morphing).
In our latest game, AquaNox, it turned out to be very useful to order the effects directly by
the different game entities. Each effect is created by some subeffects combined in a very spe-
cial way to ensure flexibility and recombination. The effects themselves are labeled as render
pipeline (RP) because they can be interpreted as a specialized encapsulation of the global ren-
der pipeline for a single effect.
The following table contains a short description of some render pipelines together with a
qualitative description of the effect and the related main elements:
Table 2
Render Pipeline Elements of Effect
Building
RP for all fixed objects like buildings
n
Caustic texture (animated)
n
Light texture (prelighted)
n
Dynamic lighting
n
Detail texture
n
Diffuse texture
n
Fog texture
456 Part 5: Engine Design with Shaders
Team LRN
Render Pipeline Elements of Effect
Terrain
RP for generating the terrain visualization from
given level-of-detail (LOD) geometry
n
Caustic texture (animated)
n
Light texture (prelighted)
n
Dynamic lighting
n
Detail texture
n
One to six material textures
n
One material selection texture
n
Fog texture
Plankton
RP for translation, rotation, wrapping, and visu-
alization of individual particles
n
Diffuse texture
n
Fog texture
n
Translation with vertex shader
n
Rotation with vertex shader
n
Wrapping in view space with vertex shader
At that point, it is appropriate to return to the above-mentioned problem of recombining effects
with one another (light source difficulty). To solve this problem in an elegant way, we only
have to unify the advantages of both approaches (recombination through multipass rendering
and individual effect programming for each pass). Reconsidering the AquaNox render pipelines
reveals an interesting strategy. Instead of combining the individual render passes in one
inflexible vertex and/or pixel shader, it makes sense to assign the individual passes to larger
function blocks.
We identified the following blocks:
n
Illumination (lightmap, caustic map, vertex-/pixel light, bump map)
n
Material (diffuse map, detail map)
n
Global effects (fog map, environment map)
That settles the matter for the larger group of effects. Creating the three function blocks illu-
mination, material, and global effects seems to be a sensitive concept, which has been dem-
onstrated in the krass engine and AquaNox for the first time. (In the following, this approach
will be abbreviated as IMG = Illumination, Material, Global Effects). This reorganization gives
you the opportunity to implement each individual pass with the highest flexibility and also to
recombine these blocks with each other. In this context, it makes sense to reintroduce the
multipass concept. So far, this concept has mainly been used as an emergency solution in the
case of an insufficient number of texture stages.
If you assume that the frame buffer stores arbitrary values instead of traditional pixels, you
can interpret the IMG technique as calling subroutines for specialized tasks (illumination,
material generation, global effects). If it was possible to call (non-trivial) subroutines within
the vertex and pixel shaders, the multipass IMG system could easily be collapsed to a single
pass concept.
Visualization with the Krass Game Engine 457
Team LRN
Application of the IMG Concept for Terrain
Rendering
The proposed function block organization particularly makes sense, as the first block (illumi-
nation) can be constructed from independent individual passes. This block finally results in the
summation of all lights. As shadow is the absence of light, even shadow passes fall within this
block. In the AquaNox terrain system, the summation of all lights originates in the following
sources:
1. Illumination Stage (I)
n
Light texture (prelighted)
n
Caustic texture (animated)
n
Dynamic lights (rendertarget)
The next stage (material) is composed of a predefined number of materials. Each material is
assembled from a material texture, a fine detail texture, and a coarser color texture.
2. Material Stage (M)
n
Material texture (sand)
n
Color texture (color)
n
Detail texture (sand crystals)
The last stage (global) handles the fog which is realized by using a procedurally generated and
dynamically mapped fog texture.
3. Global Stage (G)
n
Fog texture (procedural, volumetric fog)
Figure 1 shows an overview of the entire process. The three frames show the clearly separated
IMG stages. The interconnections between the different stages are indicated by the arrows. The
combining operations are displayed as well.
We can now go into detail regarding the global stage. AquaNox here uses a volumetric fog
system which allows us to define multiple planes of constant height with real fog density.
Interestingly enough, this feature is not only a visual effect but also offers a gameplay-relevant
element (e.g., hiding). We do not generate the fog itself with the graphics hardware in the stan-
dard manner (vertex and pixel fog), but instead we use the vertex shader to map a real-time
generated texture onto the geometry (in view space). This method doesnt show the annoying
artifacts that normally accompany the use of standard fog in combination with transparent tex-
tures. It also avoids the appearance of the popping effects that always happen when
vertex-related calculations are carried out on the basis of a real-time LOD geometry system.
The 3D texture is generated in a preprocessing step. The generator solves the equation of light
transport for a predefined parameter range through a numerical integration. The integrated val-
ues for fog density (a) and fog color (r,g,b) are stored in the 3D texture array. The texture
mapping itself corresponds more to a parameterized solution lookup of the equation of light
transport.
458 Part 5: Engine Design with Shaders
Team LRN
Visualization with the Krass Game Engine 459
Figure 2
Figure 1
Team LRN
Unfortunately, the major part of the customers presently installed hardwarebase does not sup-
port hardware-accelerated volume textures. To meet this challenge, we had to think of a small
work-around: We put the texture data set into the CPU main memory and extract one fog tex-
ture slice for each rendered frame. This slice is uploaded to the hardware and used as a
standard 2D texture. The restrictive condition for this method is to find one fixed texture coor-
dinate for the entire frame which could be achieved by a well-defined texture parameterization.
In the model we use here, the fog parameters are determined by the following values:
n
Height of the observer (origin of camera in world space)
n
Height difference between vertex and observer
n
Distance between camera and vertex (linear norm is sufficient)
Together with the information about the maximum distance (far plane) and the maximum
height and depth of the covered game world, it becomes possible to map the three values to the
3D texture coordinates (u,v,w). Because the height of the observer remains constant for the
entire frame, this parameter is used as the fixed coordinate (w). The texture slice (u,v) is
extracted from the 3D data array exactly at this coordinate (linear interpolation). Figure 2 dem-
onstrates this process:
Particle Rendering to Exemplify a Specialized
Effect Shader
Beside the shaders operating according to the introduced stage concept, we have developed
other shaders for AquaNox, which show the character of an effect shader. As an example for
these shaders, we will now take a look at the so-called particle shader: Its render pipeline
allows you to translate, rotate, wrap, and display a dense field of individual polygons in the
near range of the observer. At first glance, this sounds similar to the hardware-accelerated
point-primitive. The point-primitive had been invented to render huge amounts of quads with
an identical texture map. Up to now, they still show some serious shortcomings which chal-
lenge their use; for example, they are restricted in their maximum size, and they cannot be
rotated. These problems have been solved with the particle shader. We also additionally satis-
fied the following requirements:
n
Minimum vertex and pixel demand for individual particle
n
Translation of individual particle with graphics hardware
n
Rotation of individual particle with graphics hardware
n
3D position wrapping with graphics hardware (classic star scroller)
In order to minimize the area (pixel demand) and the number of vertices (vertex demand) of an
individual particle, we build up each particle from a single triangle. This avoids a second trian-
gle setup for the graphics hardware, and the amount of rasterized pixels is reduced (depends on
particle shape on texture). The vertex shader also has to process only three vertices per particle
instead of four. This is important because the vertex shader turned out to be the bottleneck
(complex vertex program).
The individual translation and rotation could be achieved by a sophisticated vertex stream
data layout. In contrast to the standard data, each vertex holds not only the position of the
460 Part 5: Engine Design with Shaders
Team LRN
triangles center, but also its relative position
with regard to this center. The individual
rotation is controlled by a rotational velocity
of this relative position around the center.
The wrapping functionality has been realized
with the help of the expp vertex shader
instruction. For this purpose, we first carried
out the translation in a unit particle cube (0 to
1) and later eliminated the integer part of the
position with the instruction mentioned
above. In the sequencing step, the unit cube
particle position is transformed into world
space coordinate system and the rotated rela-
tive vertex position is added. The correct
location (origin, dimension) of the unit cube
in the world space (required to create the
cubeworld transformation) is illustrated in
Figure 3.
The center of the cube (C) is determined
by adding the camera position to the viewing
direction vector which is normalized to a half
cube edge length in world space (D). The
entire vertex shader process is demonstrated in Figure 4.
Operation Space
Translation (P
ABS
+ V
ABS
* t) Unit Space
Wrapping (fmodf(P,1.0f)) Unit Space
Transform (UWV) View Space
Attenuation (~D) View Space
Rotate (P
ABS
+ Va * t * P
REL
) View Space
Transform (VC) Clip Space
Figure 4
Summary
In this article, we presented selected visualization techniques that we have actually used in a
shipped game engine. The structure of these techniques is demonstrated by their developmental
history. We showed that the techniques passed the real-world game test successfully with
AquaNox. The mixture of procedural and classic texture generation (fog), as well as the proce-
dural geometry handling (particle), proved to be adequate procedures. Introducing the IMG
concept offers an interesting alternative to reduce the huge amount of shader permutations
(>100) that game developers need for their games.
Visualization with the Krass Game Engine 461
Figure 3
Team LRN
Concluding Remarks and Outlook
The techniques presented here only demonstrate the current state of development. Regarding
the foreseeable developments in hardware evolution, you can deduce tendencies which will
become important in the near future:
n
The pixel shader will advance to be the most important instrument for rendering effects.
n
The complexity of vertex and pixel shader language will increase and new instructions will
be added (jumps, subroutine calls).
n
The size of programs, as well as the related data, will increase.
n
The hardware internal data formats will develop to a complete 32-bit floating-point
representation.
n
The classic pixel will only be used for input/output purposes.
Eventually, if you compare the graphics hardware processors (GPU) with the normal main pro-
cessors (CPU), the differences will appear rather small. Just like the main processors, the
graphics processors will need continually increasing assembly programs. This development
will finally force people to introduce a higher level programming language. Perhaps one of the
already existing higher level languages can be used, which might also be the one you used to
write the rest of the application.
So were back to square one. The only difference lies within ourselves: We have gained a
much more comprehensive rendering knowledge, and we can look forward to the next hard-
ware evolution in the realm of the remaining global rendering techniques, such as raytracing or
radiosity.
462 Part 5: Engine Design with Shaders
Team LRN
Designing a Vertex
Shader-Driven 3D Engine
for the Quake III Format
Bart Sekura
Some time ago, I developed a Quake III Arena level viewer for the purpose of trying out vari-
ous algorithms and techniques using an industry standard engine format and an abundance of
tools and shading techniques. When DirectX 8.0 came out, I immediately decided to port my
viewer to this API and in particular, take advantage of new programmable pipelines available
for developers. The result was a Quake III Arena level viewer, which runs various effects,
described in a shader script as a vertex program and thus on vertex shader capable cards in
hardware. In this article, I will present the vertex program translation process, which I think is
also very useful outside of the Quake level viewer context, as it introduces some nice tricks
and effects that can be implemented in vertex programs. The reader can find the Quake III
Arena viewer and its full source code on the companion CD. Please consult the viewer docu-
mentation included on the CD for instructions on how to run and compile it.
I have chosen this particular platform to demonstrate how vertex programs can be used to
implement real game engine algorithms, in addition to great looking effects and demos. Rather
than showing this or that particular trick, my goal is to provide the reader with the display of
applicability which, for the current vertex shader capable hardware, provides easier coding
path and visible performance benefits compared to CPU-based implementation of the same
effects.
Quake III Arena Shaders
Quake III Arena allows level designers to create shader scripts for a specific brush/surface dur-
ing mapping process. This shader is then used by the renderer to draw the faces associated with
it. The shader in Quake III Arena is a text script that describes how the surface should be ren-
dered and is preprocessed by the game engine upon startup. In general, the shader scripts can
be complex and can contain information not relevant for the rendering process (such as hints or
commands for the level editor). A description of the full capabilities of the shader system is
beyond the scope of this article. Useful documentation is provided by the game developer, id
software, and is included on the companion CD with this book.
463
Team LRN
Basically, at the beginning of the shader script, general parameters are provided, such as
culling or geometry deformation functions. Consecutive layers are then specified, where each
layer is usually a texture, along with its blending function and various parameters, such as:
n
Texture coordinates function (e.g., deformation function or a specific type of coordinate
generation, such as environment mapping)
n
Diffuse color and alpha source vertex, periodic function, fixed
n
Depth and alpha test parameters
We will focus on some of the most common shader effects that can be easily translated to ver-
tex programs. First, the description of the shader setup in the viewer is presented to give the
reader an idea of how things are organized. Next, we will follow the translation process and
look at some specific shader effects as they are translated into vertex programs. Finally, we
will put it all together and see the whole rendering process.
To minimize confusion, I will refer to Quake III Arena shader scripts as Quake III
shaders and DirectX 8.1 vertex shaders as simply vertex shaders.
Vertex Program Setup in the Viewer
Quake III Arena stores the game data in a package file, which is basically a zipped archive
containing all art content as well as Quake III shaders. The collection of shader scripts resides
in the scripts subdirectory inside the package and each of the text files usually contains many
Quake III shaders.
When the viewer initializes, it scans the package file for any Quake III shader scripts and
parses them into the internal format, represented as C++ classes, which contain shader parame-
ters and a collection of passes, each with specific material properties. Below is a C++
representation of Quake III shaders (reformatted for clarity):
struct shader_t
{
// shader pass definition
struct pass_t
{
pass_t() : flags(0), map_ref(0), vsh_ref(0), tcgen(0) {}
int flags; // misc flags
std::string map; // texture name applied for this pass
int map_ref; // internal renderer reference to texture
int vsh_ref; // internal vertex shader reference
int blend_src; // blend operation (source)
int blend_dst; // blend operation (destination)
int alpha_func; // alpha test function
int alpha_ref; // alpha test reference value
int depth_func; // depth test function
int tcgen; // texture generation type
// texture coordinate modification functions
std::vector<tcmod_t> tcmods;
// concatenated matrix of texture coordination modifications
matrix_t tcmod_mat;
464 Part 5: Engine Design with Shaders
Team LRN
// color (RGB) generation function
struct rgbgen_t {
rgbgen_t() : type(0), color(vec4_t(1,1,1,1)) {}
int type;
wave_func_t func;
vec4_t color;
}
rgbgen;
// animated textures collection
struct anim_t {
// helper to determine current texture from the collection
// based on current time and FPS settings
int frame(double time) const {
int i = (int)(time*fps)%maps.size();
return map_refs[i];
}
float fps; // animation speed
std::vector<std::string> maps;
// textures collection
std::vector<int> map_refs;
// renderer refs
}
anim;
};
// the shader itself
public:
shader_t() : flags(0), type(0), stride(0) {}
std::string name; // name of the shader
std::vector<pass_t> passes; // collection of passes
int stride; // stride in vertex stream
int flags;
int type;
int vertex_base;
int vertex_count;
// vertex deformation function
struct deform_t {
float div;
wave_func_t func;
}
deform;
// parsing function translates text script representation
// into this class internal representation suitable for
// rendering pipeline
static void parse(const byte* data,
int size,
std::vector<shader_t>& shaders);
};
This C++ object is created per the Quake III shader and used during the rendering phase. It
makes it easy to traverse the shader passes and apply various parameters that directly influence
the rendering process. One additional thing that is performed during the setup phase is setting
up all the textures used by Quake III shaders (loading them up and uploading to the card) and
Designing a Vertex Shader-Driven 3D Engine for the Quake III Format 465
Team LRN
translating of various passes into corresponding vertex shaders. I will describe the latter pro-
cess in greater detail now.
Many of the surfaces inside the game level reference Quake III shaders that dont exist in
the form of the script. This is just a handy shortcut for default Quake III shaders that use only a
base texture and lightmap with no additional rendering passes or special parameters. Whenever
the Quake III shader that is referenced from within the level is not found among the parsed
scripts, the viewer will create a default Quake III shader with two passes: default texture,
looked up by the shaders name, and a lightmap.
To facilitate the setup code as well as provide easier and more optimized code paths for
rendering, I divided the shaders into the following categories:
Table 1: Shader categories
Shader Category/Type Description
SHADER_SIMPLE This is the default Quake III shader that uses base texture, lightmap
texture, and constant diffuse. There is no vertex or texture coordinate
deformation, and diffuse color is not coming from vertex stream data.
This is by far the majority of shaders, also known as default shaders,
automatically created when there is no explicit script code. Primarily, it
is used for basic world geometry.
SHADER_DIFFUSE This category is very similar to the above shader type, the difference
being that diffuse color is coming from the vertex stream. It also uses
base texture and lightmap. It is used primarily for mesh models embed-
ded in world geometry.
SHADER_CUSTOM All non-standard shaders fall into this category, which represents arbi-
trarily complex shaders with one or many passes, various vertex and
texture coordinate deformation functions, etc.
The category that the Quake III shader belongs to is stored in a type member variable of
shader_t class. The default vertex shaders, SHADER_SIMPLE and SHADER_DIFFUSE, are
set up in world_t::upload_default_vertex_shaders() method. These never change and are
uploaded when the viewer starts. In addition, I have added one more default vertex shader,
which is not a result of Quake III shader script translation process but is set up to facilitate all
screen space rendering operations (heads-up display, 2D window interface, etc.) and is repre-
sented as SHADER_FONT type. This shader uses one texture and expects the coordinates in
normalized device range (1.0 to +1.0). The primary use of this shader can be found in font_t
class.
The Quake III shaders that fall into the SHADER_SIMPLE, SHADER_DIFFUSE, and
SHADER_FONT categories are hard-coded, since they never change. All SHADER_CUS-
TOM shaders are dynamically constructed depending on the number of passes, their
parameters, etc. This process is constructed from consecutive steps that result in final vertex
shader source code, which is then compiled by DirectX runtime and uploaded to the card. The
basic steps are as follows:
466 Part 5: Engine Design with Shaders
Team LRN
Table 2: Constructing custom shaders
Step Description
Vertex coordinates These are either set up as coming unchanged from the vertex stream or as
a result of a deformation function. In the second case, the vertex deforma-
tion function, sinus, is evaluated using a Taylor series.
Vertex diffuse color The vertex color can be constant, come from the vertex stream, or be evalu-
ated as a result of a periodic function. In any case, the actual evaluation
happens in CPU, and the vertex program code merely loads the appropri-
ate value from constant memory or vertex stream data.
Texture coordinates Texture coordinates either come unchanged from the vertex stream or are
computed using a simplified environment mapping function. The texture
unit used depends on whether were dealing with base texture or lightmap.
Texture coordinates
modification
Texture coordinates computed in the previous step can be optionally
changed, depending on shader pass parameters. If that is the case, they
are multiplied by a concatenated texture coordinate modification matrix in
the vertex shader. The matrix is prepared by CPU.
As you can see, various Quake III shader script effects are translated into the corresponding
vertex shader source. As it turns out, the final vertex shader code is very often identical for
many different Quake III shader scripts, and it would be a waste of resources to upload identi-
cal vertex shaders many times. I have employed a simple string comparison mechanism to
prevent this. Whenever a vertex shader is generated from the Quake III shader definition, after
it is compiled by the DirectX run-time, its source code is put on a map to allow for easy lookup
later on. The source code (a string) becomes a key for the collection and the value is the ID of
the already compiled vertex shader, as it was returned by the DirectX run-time after successful
compilation. Then, whenever the vertex shader source is generated and it is time to compile it,
first the map is consulted to see whether there was previously an identical vertex shader
already compiled. If there were, the same ID is used; otherwise, it is compiled and put on the
source code map.
Vertex Shader Effects
We will now look in more detail at some of the effects that can be created.
Deformable Geometry
One of the simple things that you can do in Quake III shader is deformable geometry.
Basically, you can specify a periodic function such as sin() and provide the parameters, and the
geometry is then deformed during the rendering process and parameterized by time. This is, for
example, how all those waving banners are done in Quake levels. It would be great to do that
in a vertex shader, especially since it involves recomputing the vertex position, something
wed rather avoid. So we need to evaluate the sin function from within vertex shader.
There is no direct way to do this; that is, there is no sin instruction similar to the C library
function that is available from within vertex shader. Therefore, we have to simulate one. This
becomes possible thanks to Taylor series, which allows us to generate approximate sinus val-
ues using available vertex shader instructions.
Designing a Vertex Shader-Driven 3D Engine for the Quake III Format 467
Team LRN
The theory behind Taylor series is beyond the scope of this article. What is interesting to
us is how to generate approximate sinus values using Taylor series. Here are the approximation
formulas:
sin(x) = x (x^3)/3! + (x^5)/5! + (x^7)/7! + ...
sin(x) ~= x*(1-(x^2)*(1/3!-(x^2)(1/5!-(x^2)/7!)))
To compute sinus from within vertex shader using Tayor series, we need to set up some magic
numbers in vertex shader constant memory. These numbers are:
1.0 1/3! 1/5! 1/7!
Additionally, since Taylor series will only work accurately with the input range of {PI, PI},
we need to prepare our input variable accordingly. This is done in the following way:
// r0.x x
// r1.y - temp
expp r1.y, r0.x
// r1.y=x-floor(x)
mad r0.x, r1.y, c[12].y, -c[12].x
// r0.x=x*2PI-PI
As you see, we also need to add 2*PI and PI to vertex shader constant memory. During transla-
tion of shader scripts into vertex shader, we generate vertex shader code with the following:
if(shader.flags&SHF_DEFORM)
{
// vertex deformation code
vsh << "dp3 r1.x, v0, c15.w\n"
<< "add r0.x, r1.x, c14.y\n" // r0.x=off+phase
<< "add r0.x, r0.x, c15.x\n" // r0.x=off+phase+time
<< "mul r0.x, r0.x, c14.w\n" // r0.x=(off+phase+time)*freq
<< "expp r1.y, r0.x\n" // r1.y=x-floor(x)
//r0.x=x*TWO_PI-PI
<< "mad r0.x, r1.y, c12.y, -c12.x\n"
// fast sin
// r0.x=sin(r0.x)
<< "dst r2.xy, r0.x, r0.x\n"
<< "mul r2.z, r2.y, r2.y\n"
<< "mul r2.w, r2.y, r2.z\n"
<< "mul r0, r2, r0.x\n"
<< "dp4 r0.x, r0, c13\n"
// r1=y*amp+base
<< "mad r1.xyz, r0.xxx, c14.zzz, c14.xxx\n"
<< "mov r0,v0\n"
<< "mad r0.xyz, v1.xyz, r1.xyz, r0.xyz\n"
<< "mov r0.w, c10.x\n" // w=1.0f
// transform to clip space
<< "dp4 oPos.x,r0,c0\n"
<< "dp4 oPos.y,r0,c1\n"
<< "dp4 oPos.z,r0,c2\n"
<< "dp4 oPos.w,r0,c3\n";
}
This vertex shader section takes vertex position in world space and deforms in time using the
sinus function. All parameters (particularly current time) are passed in using vertex shader con-
stant memory. The shader computes a new vertex world position and then transforms it to clip
space.
468 Part 5: Engine Design with Shaders
Team LRN
Texture Coordinate Generation
One particular effect that is often used is environment mapping. The exact way this is done in
the Quake III Arena engine is, of course, not public knowledge, so the effect we will show here
is an approximation. It has to be noted that even the Q3A engine does not implement real envi-
ronment mapping (e.g., using cubemaps), but rather a crude approximation that uses skillfully
prepared texture and generats texture coordinates based on the viewing angle. This simple
effect is pretty nice and especially useful for implementing all kinds of chrome effects.
Our approximation works like this: Based on the viewing position, referred to as eye,
and the vertex position in world space, we compute the reflection vector; based on the result,
we generate texture coordinates dynamically. This gives the effect of shiny or chrome-like sur-
faces, which change along with the position of the player and the viewing angle.
The C++ code to compute the texture coordinates for our fake environment mapping is
something like this:
vec3_t dir = cam->eye-vec3_t(vertex_pos);
dir.normalize();
st[0]=dir[0]+normal[0];
st[1]=dir[1]+normal[1];
We basically take the direction vector from the vertex to the eye (camera position) in world
space, normalize it, and generate the texture coordinates by adding the normal of the surface to
the direction vector.
In the vertex shader setup phase, we process the custom shader definition and attempt to
generate the corresponding fragment of vertex shader code. In case of the environment map-
ping texture coordinate generation, we construct the code in the following way:
if((pass.flags & SPF_TCGEN) && pass.tcgen == SP_TCGEN_ENVIRONMENT)
{
vsh << "add r0, r0, -c11\n" // r0=vertex-eye
<< "dp3 r0.w,r0,r0\n" // normalize(r0)
<< "rsq r0.w,r0.w\n"
<< "mul r0,r0,r0.w\n" // r0=normalize(vertex-eye)
<< "add r1.x,r0.x,v1.x\n"
<< "add r1.y,r0.y,v1.y\n"
<< "mov r1.zw, c10.zw\n";
}
Note that this can be easily extended to support full environment mapping using cubemaps on
hardware that supports it. We would need to compute the reflection vector from within the ver-
tex shader and use it to generate the texture coordinates for the cubemap.
Texture Matrix Manipulation
The texture coordinates can be deformed to produce nice-looking effects. With the Quake
shader script, one can specify multiple texture coordinate modification steps, each applied on
top of another. These modification functions are usually time-based periodic functions with
various parameters controlling amplitude, frequency, etc. In our viewer, during the rendering
phase, we evaluate all these functions based on the current time value, producing a single
matrix per step. So if, for example, we had two modification steps specified in a shader script,
one that does the scaling and another that rotates the texture, we would end up with two matri-
ces for each step: a scaling matrix and a rotation matrix. These matrices are then concatenated
Designing a Vertex Shader-Driven 3D Engine for the Quake III Format 469
Team LRN
to produce one texture coordinates deformation matrix, which is then loaded to the vertex
shader. The actual modification of texture coordinates takes place inside the vertex program,
when we simply multiply original texture coordinates by the tcmod matrix, thus producing the
final effect.
Lets look at this in more detail. The following function is called once per frame per shader
pass, which involves texture coordinate modification:
static void
eval_tcmod(shader_t::pass_t& pass, const float time)
{
matrix_t t;
t.translate(+0.5f,+0.5f,0);
for(int i = pass.tcmods.size()-1; i >= 0; i)
{
pass.tcmods[i].eval(time);
t*=pass.tcmods[i].mat;
}
t.translate(-0.5f,-0.5f,0);
matrix_t::transpose(t,pass.tcmod_mat);
}
This function moves through all the texture modification steps for a particular shader pass and
produces a single concatenated matrix. The matrix is transposed in order to be used from
within the vertex shader. Later in the rendering process, when it comes to the actual drawing of
batched geometry data that uses this particular shader, the following piece of code updates the
constant memory of the vertex shader with the final texture modification matrix produced
above, prior to the actual DrawIndexedPrimitive() call:
const int pass_count=s.passes.size();
for(int i=0; i<pass_count; ++i)
{
const shader_t::pass_t& pass = s.passes[i];
[ omitted for clarity ]
// set vertex shader for this pass
d3dev->SetVertexShader(pass.vsh_ref);
// set tcmod matrix if needed
if(pass.flags&SPF_TCMOD)
{
d3dev->SetVertexShaderConstant(4,pass.tcmod_mat,4);
}
[ omitted for clarity ]
}
To put it all together, lets step back to the setup code, which creates a vertex shader for each
shader script. During the translation process, whenever we encounter the pass that involves
texture coordinate modification, we add the following piece of vertex shader code:
if(pass.flags&SPF_TCMOD)
{
vsh << "dp4 " << unit << ".x,r1,c4\n"
<< "dp4 " << unit << ".y,r1,c5\n";
}
470 Part 5: Engine Design with Shaders
Team LRN
This generates the code to transform the texture coordinates of the vertex by our matrix (actu-
ally, the first two vectors from it), which had been set up at constant memory locations 4
through 8. So the final vertex shader code looks like this, assuming were dealing with the first
texture unit:
dp4 oT0.x, r1, c4
dp4 oT0.y, r1, c5
This results in transforming the texture coordinates by our concatenated texture modification
matrix. This is similar to setting the texture matrix when using a fixed programmable pipeline
and letting DirectX 8.0 handle the transformation.
Rendering Process
We will now describe the rendering pipeline of our level viewer, which is greatly simplified
thanks to vertex shaders. Normally, all kinds of effects require the CPU to touch each vertex
and modify the stream accordingly, which makes the code path rather complicated. Since we
are using vertex shaders to do that on GPU instead of CPU, our lower level pipeline boils
down to preparing the vertex stream, managing the rotating dynamic index buffer, and setting
up vertex constant memory.
The rendering process is divided roughly into higher and lower level pipeline code. Table
3 shows a summary of which tasks and functions belong to higher and lower level rendering
code:
Table 3: Rendering code levels
Level Main Functions
Higher level rendering pipeline code
Implementation in: world.cpp
n
Traverses BSP tree
n
Culls BSP nodes and leaves to view frustrum
n
Culls faces to view frustrum
n
Culls faces according to PVS tables
n
Sorts faces by shader and lightmap
n
Arranges faces by solid and translucent
n
Walks sorted, batched shader faces list and pre-
pares vertex constant memory data
Lower level rendering pipeline code
DirectX 8.0 specific driver
Implementation in: d3dev.cpp
n
Device initialization/shutdown
n
Texture upload management
n
Static/dynamic vertex and index buffer
management
n
Batch rendering and device state setup according
to shader pass parameters
The higher level rendering pipeline code is traversing the BSP tree and performing some gross
culling using the view frustrum and bounding boxes for the BSP nodes. At a finer level, each
face is tested using PVS data when it is considered a candidate for drawing. The result of the
higher level rendering pipeline code is a sorted list of faces, batched by shader and lightmap,
and the shader hash map is sorted by shader solidity and translucency. This is performed in
world_t::process_faces() function. The higher level rendering begins in world_t::render_faces()
Designing a Vertex Shader-Driven 3D Engine for the Quake III Format 471
Team LRN
function, where the sorted shader batches are visited and vertex constant memory data is pre-
pared, if needed. This includes texture coordinate modification matrix, RGB generation
functions, etc. When the current batch is ready to be rendered, the control is passed to the
lower level rendering code.
The lower level code is a DirectX 8.0 driver that does the actual rendering of the triangles.
Its main rendering loop consists of managing vertex and index buffers and making low-level
calls for drawing primitives. A single dynamic index buffer is used to prepare batched up verti-
ces for drawing, according to the list passed in from the higher level rendering code. For
hardware vertex processing, all level geometry is uploaded into one static vertex buffer. This
gives significant performance benefits, since the data is effectively cached on GPU and does
not need to travel across the bus. For software vertex processing, this is not possible due to
performance reasons, so I have provided a fallback mechanism, where a fixed-size dynamic
vertex buffer is managed and vertex stream data is uploaded dynamically prior to rendering
calls. The reason behind this stems from the fact that when calling the DrawIndexedPrimitive()
API of DirectX, the range of vertices that needs to be processed is interpreted differently
depending on whether were using hardware or software vertex processing. In hardware mode,
vertices are transformed and lit according to the indices passed to the call, so the range of verti-
ces specified does not influence the process. We can therefore use a single static vertex buffer
that is spatially addressed by indices from our dynamic index buffer. In software mode, how-
ever, the optimized processor-specific pipeline transforms exactly the range of vertices that
were specified during the call. Thus, it is significantly more beneficial for software vertex pro-
cessing pipeline to batch vertices for drawing using a single dynamic vertex buffer so that the
range of vertices is continuous and always corresponds exactly to the number of vertices actu-
ally being drawn.
Assuming hardware vertex processing and a single static vertex buffer with level geometry
uploaded, the actual rendering loop is extremely simple. All the low-level driver does is man-
age the index buffer by filling it up with the current batch triangle indices, set up device state
according to shader pass parameters (culling, blending, z-buffering writes), set up vertex pro-
gram constant memory, initialize texture units with textures, set up current vertex shader, and
make actual calls to DrawIndexedPrimitive. All this is performed in the d3dev_t::ren-
der_faces() function.
Summary
It should be noted that the vertex shaders that result from the translation process do not provide
the full implementation of the Quake III shader scripts in a way that no CPU work is required
for their implementation during rendering process. Because of the great amount of combina-
tions of vertex shaders that are possible as a result of a single Quake III shader script, as well
as size limitations imposed on the current vertex shader implementation, the ideal situation of
having vertex shaders that implement the Quake III shader script in its entirety (sometimes
involving evaluation of a couple of sin functions) is extremely hard to achieve, if not impossi-
ble. For performance reasons, the intention was to eliminate the need for CPU to touch every
vertex in order to perform given shader script functions, such as texture coordinate modifica-
tion or vertex deformation. This was required in the traditional model, before vertex shaders
were available. The fixed rendering pipeline of DirectX does not allow developers to
472 Part 5: Engine Design with Shaders
Team LRN
customize vertex processing according to their needs. Thanks to programmable pipeline and
vertex shaders resulting from translation process, geometry data once uploaded to the card
never needs to be touched by the CPU again, as the GPU makes all the changes to it. The
geometry can therefore be treated as static, even though we are doing changes to individual
components of the vertex stream. Using vertex shaders implemented in hardware, the stream
data does not need to travel across the bus. The balance was achieved in such a way that the
CPU performs more complicated tasks, such as evaluation time-based periodic functions, and
simply updates the constant memory of the vertex shader in each frame, greatly reducing the
complexity of the vertex shader.
Designing a Vertex Shader-Driven 3D Engine for the Quake III Format 473
Team LRN
Glossary
anisotropic Exhibiting different properties when measured from different directions.
Directional dependence. For instance, anisotropic lighting takes into account direction
with respect to the surface of incoming and/or outgoing light when illuminating a surface.
anisotropic filtering With 2D textures, anisotropic filtering is thought of as a filter along a
line across the texture with a width of two and length up to the maximum degree of
anisotropy. In three dimensions, this increases to a slab with width two extending in the
other two dimensions up to the maximum degree of anisotropy. 3D anisotropy is not
supported by current accelerators.
anisotropic lighting See anisotropic.
basis vectors A (complete) set of line or row vectors for a matrix; these might be the three
vectors that represent a rotation matrix.
bilerping/bilinear filtering An image interpolation technique that is based on the averaging
of the four nearest pixels in a digital image. It first computes a texel address, which is
usually not an integer address, and then finds the texel whose integer address is closest to
the computed address. After that, it computes a weighted average of the texels that are
immediately above, below, to the left of, and to the right of the nearest sample point.
binormal To realize per-pixel lighting, a texture space coordinate system is established at
each vertex of a mesh. With the help of this texture space coordinate system, a light vector
or any other vector can be transformed into texture space. The axes of the texture space
coordinate system are called tangent, binormal, and normal.
BRDF Bidirectional Reflectance Distribution Functions. The BRDF is a generalization of all
shading models, such as Lambert, Blinn, or Phong. A shading model is an analytic
approximation of a surfaces reflective properties: It describes how incoming light is
reflected by a surface. The simplest shading model, Lambert shading, says that a surface
reflects incoming light equally in all directions. Phong shading (see Phong shading), on the
other hand, has an additional additive specular term, which permits a surface to reflect
different amounts of light in different directions. Thus, each of these models is a formula
which computes, for a given incoming and a given outgoing light direction, the portion of
incoming light that gets reflected in the outgoing direction. Such functions are called
BRDFs.
bump map Bump mapping adds surface detail (bumpiness) to objects in 3D scenes without
adding more geometry than already exists. It does so by varying the lighting of each pixel
according to values in a bump map texture. As each pixel of a triangle is rendered, a
lookup is done into a texture depicting the surface relief (aka the bump map). The values
in the bump map are then used to perturb the normals for that pixel. The new
bumped-and-wiggled normal is then used during the subsequent color calculation, and the
end result is a flat surface that looks like it has bumps or depressions in it. Usually a
normal map (see normal map) contains the already perturbed normals. In this case, an
additional bump map is not necessary.
474 Glossary
Team LRN
canonical Very common. With respect to faces, it means a face with very common features.
cartoon shading The use of any class of rendering approaches that are designed to mimic
hand-drawn cartoon artwork. Examples include pen-and-inking, pencil sketch, pastel, etc.
convex hull The smallest convex region that can enclose a defined group of points.
cube map Cube mapping has becoming a common environment mapping technique for
reflective objects. Typically, a reflection vector is calculated either per-vertex or per-pixel
and is then used to index into a cube map that contains a picture of the world surrounding
the object being drawn. Cube maps are made up of six square textures of the same size,
representing a cube centered at the origin. Each cube face represents a set of directions
along each major axis (+X, X, +Y, Y, +Z, Z). Think of a unit cube centered about the
origin. Each texel on the cube represents what can be seen from the origin in that
direction.
The cube map is accessed via vectors expressed as 3D texture coordinates (S, T, R or
U, V, W). The greatest magnitude component, S, T, or R, is used to select the cube face.
The other two components are used to select a texel from that face.
Cube mapping is also used for vector normalization with the light vector placed in the
texture coordinates.
dependent read A read from a texture map using a texture coordinate that was calculated
earlier in the pixel shader, often by looking up the coordinate in another texture. This
concept is fundamental to many of the advanced effects that use textures as transfer
functions or complex functions like power functions.
diffuse lighting Diffuse lighting simulates the emission of an object by a particular light
source. Therefore, you are able to see that the light falls onto the surface of an object from
a particular direction by using the diffuse lighting model.
It is based on the assumption that light is reflected equally well in all directions, so the
appearance of the reflection does not depend on the position of the observer. The intensity
of the light reflected in any direction depends only on how much light falls onto the
surface.
If the surface of the object is facing the light source, which means it is perpendicular to
the direction of the light, the density of the incident light is the highest. If the surface is
facing the light source under some angle smaller than 90 degrees, the density is
proportionally smaller.
The diffuse reflection model is based on a law of physics called Lamberts Law, which
states that for ideally diffuse (totally matte) surfaces, the reflected light is determined by
the cosine between the surface normal N and the light vector L.
directional light A directional light is a light source in an infinite distance. This simulates
the long distance the light beams have to travel from the sun. In game programming, these
light beams are treated as being parallel.
displacement map A texture that stores a single value at every texel, representing the
distance to move the surface point along the surface normal.
edge detection A type of filter that has a strong response at edges in the input image and a
weak response in areas of the image with little change. This kind of filter is often used as
the first step in machine vision tasks which must segment an image to identify its contents.
Glossary 475
Team LRN
EMBM Environment Mapped Bump Mapping. Introduced with DirectX6, this is a specific
kind of dependent lookup (see dependent read) allowing values of texels in one texture to
offset the fetching of texels in another texture, often an environment map. There is
spherical, dual paraboloid, and cubic environment mapping.
Fresnel term Common real time environment mapping reflection techniques reflect the
same amount of light on matter depending how the surface is oriented relative to the
viewer, much like a mirror or chrome. Many real materials, such as water, skin, plastic,
and ceramics, do not behave in this way. Due to their unique physical properties, these
materials appear to be much more reflective edge-on than they are when viewed head-on.
The change in reflectance as a function of the viewing angle is called the Fresnel effect.
Gram-Schmidt orthonormalization Given a set of vectors making up a vector space,
Gram-Schmidt orthonormalization creates a new set of vectors which span the same vector
space, but are unit length and orthogonal to each other.
half vector The half-way vector H is useful to calculate specular reflections (see specular
lighting). It was invented by James F. Blinn to prevent the expensive calculation of a
reflection vector (mirror of light incidence around the surface normal). The H vector is
defined as halfway between L and V. H is therefore:
H = (L + V) / 2
When H coincides with N, the direction of the reflection coincides with the viewing
direction V and a specular highlight is observed. This specular reflection model is called
the Blinn-Phong model.
isotropic Exhibiting the same properties when measured from different directions.
Directional independence.
Lambert model See diffuse lighting.
light reflection vector A vector, usually called R, that describes the direction of the
reflection of light.
light vector A vector that describes the light direction.
LUT Look-up table. A precomputed array of values used to replace a function evaluation
often for the sake of performance. Many pixel shader techniques use 1D and 2D textures
as lookup tables to calculate functions that are prohibitive or even impossible to calculate
given the limited size and scope of the pixel shader instruction set and size.
mip mapping Mip maps consist of a series of textures, each containing a progressively
lower resolution of an image that represents the texture. Each level in the mip map
sequence has a height and a width that is half of the height and width of the previous level.
The levels could be either square or rectangular. Using mip maps ensures that textures
retain their realism and quality as you move closer or further away.
normal map A normal map stores vectors that represents the direction in which the normal
vector (see normal vector) points. A normal map is typically constructed by extracting
normal vectors from a height map whose contents represent the height of a flat surface at
each pixel.
normal vector Normals are vectors that define the direction that a face is pointing or the
visible side of a face.
476 Glossary
Team LRN
orthonormal Two vectors are considered to be orthonormal if both vectors are unit length
and have a dot product of 0 between them.
per-pixel lighting For example, diffuse (see diffuse lighting) or specular (see specular
lighting) reflection models calculated on a per-pixel level are called per-pixel lighting.
Per-pixel lighting gets the per-pixel data usually from a normal map, which stores vectors
representing the direction in which the normal vector (see normal vector) points. The
additional needed light (see light vector) and/or half angle vector(s) (see half vector) are
usually transformed via a texture space coordinates system, that is, build per-vertex into
texture space so that the normal vector provided by the normal map and these vectors are
in the same space.
per-vertex lighting Per-vertex lighting means that the actual lighting calculations are done
for each vertex of a triangle, as opposed to each pixel that gets rendered (see per-pixel
lighting). In some cases, per-vertex lighting produces noticeable artifacts. Think of a large
triangle with a light source close to the surface. As long as the light is close to one of the
vertices of the triangle, you can see the lighting effect on the triangle. When the light
moves toward the center of the triangle, then the triangle gradually loses the lighting
effect. In the worst case, the light is directly in the middle of the triangle and there is
nearly no light visible on the triangle, instead of a triangle with a bright spot in the middle.
That means that a smooth-looking object needs a lot of vertices or in other words, a
high level of tessellation; otherwise, the coarseness of the underlying geometry is visible.
Phong shading Phong shading is done by interpolating the vertex normals across the surface
of a polygon or triangle and illuminating the pixel at each point, usually using the phong
lighting model. At each pixel, you need to re-normalize the normal vector and also
calculate the reflection vector.
point light A point light source has color and position within a scene but no single direction.
All light rays originate from one point and illuminate equally in all directions. The
intensity of the rays will remain constant regardless of their distance from the point source,
unless a falloff value is explicitly stated. A point light is useful to simulate light bulb.
post-classification (in volume graphics) The mapping of volume data to RGBA tuples after
interpolating, filtering, and/or sampling of the volume data.
power of 2 texture A texture that has dimensions that fall into the set 1, 2, 4, 8, 16, 32, 64,
128, 256, 512, 1024, 2048, 4096. Some hardware have these limitation, or it may be more
efficient for hardware to deal with these dimensions.
pre-classification (in volume graphics) The mapping of volume data to RGBA tuples
before interpolating, filtering, and/or sampling of the volume data.
pre-integrated classification (in volume graphics) The mapping of (small) line segments
of volume data to RGBA tuples with the help of a pre-computed look-up table.
quadrilinear filtering Adds a linear filter along the fourth dimension of the volumetric mip
maps consisting of 3D textures, sampling from 16 different texels.
reflection The return of light from a surface.
reflection vector The mirror of a vector around a surface normal. In lighting equations, often
substituted by the half-vector (see half-vector).
Glossary 477
Team LRN
refraction Refraction is a phenomenon that simulates the bending of light rays through
semi-transparent objects. There are several properties defined with refraction, Snells law,
and the critical angle. Snells law states that for a light ray going from a less dense medium
to a higher dense medium, the light ray will bend in one direction, and going from a higher
density medium to a lower density medium, it will bend in the other direction. These
properties are known as the refraction index, and it is a ratio of the speed of light through
one medium divided by the speed of light through the other medium.
render target The buffer in memory to which a graphics processor writes pixels. A render
target may be a displayable surface in memory, like the front or back buffer, or the render
target may be a texture.
renderable texture The render target (see render target) is a texture.
scientific data visualization The branch of computer graphics that is concerned with the
(interactive) generation of comprehensible, informative, accurate, and reliable images
from scientific or technical data.
shadow volume The volume of space that forms a shadow for a given object and light.
Objects inside this volume of space are in shadow. Rapidly determining whether
something is in shadow can be done by using the stencil buffer and a vertex shader.
skinning A technique for blending several bone matrices at a vertex level to produce a
smooth animation.
slab (in volume graphics) The space between two consecutive slices through volume data.
The interior of the slab is considered to be filled with the volume data.
slice (in volume graphics) A flat subset of volume data, often implemented as a polygon
that is textured with a three-dimensional texture.
specular lighting Compared to the diffuse reflection (see diffuse lighting) model, the
appearance of the reflection depends on the specular reflection model on the position of
the viewer. When the direction of the viewing coincides, or nearly coincides, with the
direction of specular reflection, a bright highlight is observed. This simulates the reflection
of a light source by a smooth, shiny, and polished surface. To describe reflection from
shiny surfaces, an approximation is commonly used, which is called the Phong
illumination model (not to be confused with Phong shading), named after its creator Phong
Bui Tong. According to this model, a specular highlight is seen when the viewer is close to
the direction of reflection. The intensity of light falls off sharply when the viewer moves
away from the direction of the specular reflection.
spotlight A light source in which all light rays illuminate in the shape of a cone. The falloff
(the attenuation in the intensity of a light source as the distance increases), spread (the
parameter that controls the width of the cone of light produced by the spotlight), and
dropoff (the parameter that controls the way the light intensity fades based on its distance
from the center of the light cone) of a spotlight are adjustable.
SSE Streaming SIMD Extensions. A set of registers and instructions in Pentium III and
Pentium 4 processors for processing multiple floating-point numbers per instruction.
tangent space Tangent space is a 3D coordinate system defined for every point on the
surface of an object. It can be thought of as a transform from object space into texture
space. The x and y axes are tangent to the surface at the point and are equal to the u and v
478 Glossary
Team LRN
directions of the texture coordinates, respectively. The z-axis is equal to the surface
normal. Tangent space is used for a variety of per pixel shaders due to the fact that most
bump maps and other types of maps are defined in tangent space. Tangent space is
sometimes called the Frenet Frame or a Surface Normal Coordinate Frame.
tap Each sample that contributes to a filter result.
Taylor series A Taylor series allows the generation of approximate sinus and cosinus values
without being computationally as expensive as the sinus and cosinus functions.
texels Abbreviation for texture elements, usually meaning 1 RGB or ARGB component of
the texture.
texture perturbation With the ability to perform per-pixel math both before and after
texture fetches, a variety of new texture-based effects are possible. Per-pixel texture
perturbation effects, effects that use the results of a texture fetch to change the texture
coordinates used to fetch another texel (see dependent read).
texture space See tangent space.
topology The arrangement in which the nodes of a 3D object are connected to each other.
transfer function A function (often one-dimensional) which is used to enhance or segment
an input image. Transfer functions are often used in scientific visualization to enhance the
understanding of complex datasets. A transfer function might map a scalar, such as heat,
pressure, or density, to a color to aid in understanding the phenomenon being drawn. In
games, transfer functions can be used on images to stylize them or to simulate effects like
night-vision goggles or heat-sensitive displays. In volume graphics, this is the mapping of
a number to an RGBA tuple; it is used in volume visualization and volume graphics to
assign colors to numeric data.
trilinear filtering In case of 2D textures, this means that multiple linear filtered mip maps
are being blended together. In case of a 3D texture, trilinear filtering performs a linear
filtering along all three axes of the texture, sampling colors from eight different texels. The
filtering method that is called trilinear filtering in conjunction with 2D textures is called
quadrilinear filtering for 3D textures.
trinary operator Operation that can use three components and do two operations
simultaneously.
tweening A technique for interpolating between two or more key frames to produce a
smooth animation.
unit vector A unit vector is a vector with the length 1. To calculate a unit vector, divide the
vector by its magnitude or length. The magnitude of vectors is calculated by using the
Pythagorean theorem:
x
2
+ y
2
+ z
2
= m
2
The length of the vector is retrieved by:
||A|| = sqrt(x
2
+ y
2
+ z
2
)
The magnitude of a vector has a special symbol in mathematics. It is a capital letter
designated with two vertical bars: ||A||. So dividing the vector by its magnitude is:
UnitVector = Vector / sqrt(x
2
+ y
2
+ z
2
)
vector field map A texture that stores a vector at every texel to move the surface point.
Glossary 479
Team LRN
volume data A data set that is indexed by three coordinates, often implemented as
three-dimensional textures (volume textures).
volume graphics The branch of computer graphics that is concerned with the rendering of
volumetric (as opposed to polygonal) primitives.
volume visualization The branch of scientific data visualization that is concerned with the
visualization of volume data.
480 Glossary
Team LRN
About the Authors
Philippe Beaudoin ([email protected])
Philippe recently completed a masters thesis in computer graphics at the University of
Montreal where he developed algorithms for the spreading and rendering of 3D fire
effects. As part of his job as a hardware architect at Matrox Graphics, he studied and
developed various vertex and pixel shader technologies. He is currently employed at Dig-
ital Fiction where he develops games for various next-generation consoles.
Steffen Bendel ([email protected])
Steffen started his career as a game programmer in 1996 at a company called BlueByte.
After that, he studied physics with a specialty in quantum optics at the university in
Rostock. While studying, he worked as a freelancer for Massive Development where he
wrote some parts of the 3D engine of the underwater game AquaNox.
Chris Brennan ([email protected])
Chris graduated with a BS degree in Computer Science and another BS degree in Electri-
cal Engineering from Worcester Polytechnic Institute in 97 and joined Digital
Equipment Corps Workstation Graphics group doing hardware design and verification.
When Digital died, Chris joined ATI as a 3D ASIC designer for the RADEON line of
graphics chips and then moved over to the 3D Application Research Group where he
tries to get those chips to do things that were not originally thought possible.
Dean Calver ([email protected])
Games are fun; Dean figured that out at age 2 and has spent the following years working
out how to make better games. For the last five years, people have even paid him to do it.
Having no real preference for console or PC has meant a mixed career flipping between
them for every project. Professionally, he started on a war game, did three years of a rac-
ing game, followed by an X-COM style game and arcade classic updates. Currently hes
doing a 3D graphic adventure. He still studies various subjects including optics, mathe-
matics, and other geeky things for fun. This preoccupation with learning means that he
has been taking exams every year for over half his life. At least hell be ready to write the
first game for a quantum computer.
Drew Card ([email protected])
Drew is currently a software engineer in the 3D Application Research Group at ATI
Research where he is focusing on the application of shader-based rendering techniques.
He has worked on SDK applications as well as helped out with the demo engine. Drew is
a graduate of the University of South Carolina.
Wolfgang F. Engel ([email protected])
Wolfgang is the author of Beginning Direct3D Game Programming and a co-author of
OS/2 in Team, for which he contributed the introductory chapters on OpenGL and DIVE.
Wolfgang has written several articles in German journals on game programming and
About the Authors 481
Team LRN
many online tutorials that have been published on www.gamedev.net and his own web
site, www.direct3d.net. During his career in the game industry, he built two game devel-
opment units from scratch with four and five people that published, for example, six
online games for the biggest European TV show Wetten das..? As a member of the
board or as a CEO in different companies, he has been responsible for several game
projects.
Ingo Frick ([email protected])
Ingo is co-founder and technical director of Massive Development GmbH. He has played
a leading role in the development of the Krass engine, Archimedean Dynasty
(Schleichfahrt), and AquaNox. In the mid-80s he developed several games (C64, Amiga,
PC) that were mainly distributed by smaller publishers and magazines. His first success-
ful commercial product was the conversion of The Settlers from the Commodore Amiga.
Ingo has a PhD in the area of numerical simulation of the motion of granular media.
David Gosselin ([email protected])
Dave is currently a software engineer in the 3D Application Research Group at ATI
Research. He is involved in various demo and SDK work, focusing mainly on character
animation. Previously, he worked at several companies, including Oracle, Spacetec IMC,
Xyplex, and MIT Lincoln Laboratory on varied projects from low-level networking and
web technologies to image processing and 3D input devices.
Juan Guardado ([email protected])
Juan currently works at Matrox as a graphics architect. He started as the first developer
support engineer in the days before DirectX. Later, his focus shifted to better understand-
ing the requirements of next-generation games and APIs, using this analysis to direct
research projects and argue with his boss.
Evan Hart ([email protected])
Evan is presently a software engineer with ATIs Application Research Group. He works
with new hardware features on all levels from API specification to implementation in
applications. He has been working in real-time 3D for the past four years. He is a gradu-
ate of Ohio State University.
Matthias Hopf
Matthias graduated with a degree in computer science from the FAU Erlangen in Ger-
many. Right now, he is a PhD student in the Visualization and Interactive Systems Group
at the University of Stuttgart. Despite all his research on adaptive and hierarchical algo-
rithms as well as on hardware-based filters, he is still interested in highly tuned low-level
software and hardware. He is mainly known for still being the maintainer of the Aminet
archive mirror at the FAU Erlangen, collecting gigabytes of Amiga software.
Kenneth L. Hurley ([email protected])
Kenneth started his career in the games industry in 1985 with a company called
Dynamix. He has also worked for Activision, Electronic Arts, and Intel and now works in
developer relations at NVIDIA Corp. His current job includes research and development
and instructing developers how to use new technology from NVIDIA. His credits in the
game industry include Sword of Kadash (Atari ST), Rampage (PC, Amiga, Apple II),
482 About the Authors
Team LRN
Copy II ST, Chuck Yeagers Air Combat Simulator (PC), The Immortal (PC), and Wing
Commander III (Playstation). While at NVIDIA, he has contributed the following pack-
ages/demos: NVASM (GeForce3 vertex/pixel shader assembler), NVTune (NVIDIAs
performance analysis tool set), DX7 Refract demo, Minnaert Lighting demo, Particle
Physics demo, and the Brushed Metal effect.
John Isidoro ([email protected])
John is a member of the 3D Application Research Group at ATI Technologies and a grad-
uate student at Boston University. His research interests are in the areas of real-time
graphics, image-based rendering, and machine vision.
Greg James ([email protected])
Greg is a software engineer with NVIDIAs technical developer relations group where he
develops tools and demos for real-time 3D graphics. Prior to this, he worked for a small
game company and as a research assistant in a high-energy physics laboratory. He is very
glad to have avoided graduate school and even happier to be working in computer graph-
ics, which he picked up as a hobby after his father brought home a strange beige Amiga
1000.
Martin Kraus ([email protected])
Since graduating in physics, Martin is a PhD student in the Visualization and Interactive
Systems Group at the University of Stuttgart in Germany. In recent years he has pub-
lished several papers on volume visualization, but he is still best known for his Java
applet LiveGraphics3D. Martin started programming on a C64 in his early teens and
quickly became addicted to computer graphics. Major goals in his life include a long
vacation after receiving his PhD and achieving a basic understanding of quantum
mechanics.
Scott Le Grand ([email protected])
Scott is a senior engineer on the Direct3D driver team at NVIDIA. His previous commer-
cial projects include BattleSphere for the Atari Jaguar and Genesis for the Atari ST. Scott
has been writing video games since 1971 when he played a Star Trek game on a main-
frame and was instantly hooked. In a former life, he picked up a BS in biology from
Siena College and a PhD in biochemistry from Pennsylvania State University. Scotts
current interests are his wife, Stephanie, and developing techniques to render planets in
real time.
Jason L. Mitchell ([email protected])
Jason is the team lead of the 3D Application Research Group at ATI Research, makers of
the RADEON family of graphics processors. Working on the Microsoft campus in
Redmond, Jason has worked with Microsoft to define new Direct3D features, such as the
1.4 pixel shader model in DirectX 8.1. Prior to working at ATI, Jason did work in human
eye tracking for human interface applications at the University of Cincinnati where he
received his masters degree in Electrical Engineering. He received a BS in Computer
Engineering from Case Western Reserve University. In addition to several articles for this
book, Jason has written for the Game Programming Gems books, Game Developer Mag-
azine, Gamasutra.com, and academic publications on graphics and image processing.
About the Authors 483
Team LRN
dm Moravnszky ([email protected] or http://n.ethz.ch/student/adammo/)
Adam is a computer science student at the Swiss Federal Institute of Technology. He is
looking forward to receiving his masters degree in April 2002, after finishing his thesis
on real-time 3D graphics.
Christopher Oat ([email protected])
Christopher is a software engineer in the 3D Application Research Group at ATI
Research where he explores novel rendering techniques for real-time 3D graphics appli-
cations. His current focus is on pixel and vertex shader development in the realm of PC
gaming.
Kim Pallister ([email protected])
Kim is a technical marketing manager and processor evangelist with Intels Software and
Solutions Group. He is currently focused on real-time 3D graphics technologies and
game development.
Steven Riddle ([email protected])
Steven is an independent contractor developing various web and 3D applications. Since
he started programming on the C64, his main interest has been graphics. It has taken him
from writing the first Sega Genesis emulator to writing his own rendering engine in C
and Assembly. Currently, he is focusing on using vertex and pixel shaders to create a vir-
tual universe.
Guennadi Riguer ([email protected])
Guennadi is a software developer at ATI Technologies, where he is helping game engine
developers to adopt new graphics technologies. Guennadi holds a degree in Computer
Science from York University and has previously studied at Belorussian State University
of Computing and Electronics. He began programming in the mid-80s and worked on a
wide variety of software development projects prior to joining ATI.
John Schwab ([email protected], www.shaderstudio.com)
John has been programming for the better part of his life. He is the creator of Shader Stu-
dio, but in the past ten years, he has been developing games professionally and has
worked for Psygnosis and Microsoft. His specialty is computer graphics, and he is
always trying to do what others dont think is possible. Currently, he is software engineer
at Electronic Arts. In his spare time, he builds robots and designs electronic devices.
Bart Sekura ([email protected])
Bart is a software developer with over seven years of professional experience. He loves
computer games and enjoys writing 3D graphics-related code. He spends most of his
time tinkering with DirectX, locking and unlocking vertex buffers, and transposing matri-
ces. Bart is currently a senior developer for People Can Fly working on Painkiller, a
next-generation technology action shooter.
Alex Vlachos ([email protected] or http://alex.vlachos.com)
Alex is currently part of the 3D Application Research Group at ATI Research, where he
has worked since 1998 focusing on 3D engine development. Alex is one of the lead
developers for ATIs graphics demos and screen savers, and he continues to write 3D
484 About the Authors
Team LRN
engines that showcase next-generation hardware features. In addition, he has developed
N-Patches (a curved surface representation which is part of Microsofts DirectX 8). Prior
to working at ATI, he worked at Spacetec IMC as a software engineer for the SpaceOrb
360, a six degrees-of-freedom game controller. He has also been published in Game Pro-
gramming Gems 1, 2, and 3 and ACMs I3DG. Alex is a graduate of Boston University.
Daniel Weiskopf ([email protected])
Daniel is a researcher at the Visualization and Interactive Systems Group at the Univer-
sity of Stuttgart. His scientific interests range from figuring out how the latest graphics
boards can be used to speed up scientific visualization to rather crazy things like visualiz-
ing general relativistic faster-than-light travel in a warp spaceship. Daniel received a PhD
in Theoretical Astrophysics from the University of Tuebingen.
Oliver C. Zecha ([email protected])
Oliver is an independent consultant with five years experience in the real-time 3D graph-
ics field. His primary focus is dynamic physical simulation, for which he has received
several awards and accolades. Recently, he migrated from OpenGL to Direct3D in order
to utilize programmable hardware shaders. At the time of publication, his research
involves the design of new algorithms that utilize consumer grade graphics hardware in
creative and unconventional ways, as well as implementing them for the X-Box console.
About the Authors 485
Team LRN
Team LRN
Index
3D model, converting photographs into, 297
3D noise, 431
3D Studio Max, 16, 296
3D texture lookup table shader, 430-431
3D textures, 428
filtering, 429
storing, 429-430
using, 430-437
A
add instruction, 23, 101-102
address register (vertex shader), using, 29
alpha blending, 78-79
alpha test, 78, 331
amorphous volume, representing with 3D textures, 435
anisotropic
filtering, 429, 451
lighting model, 376
strand illumination shaders, 379, 380, 381-382
arithmetic address instructions (pixel shader), 128
arithmetic instruction set, ps.1.1-ps.1.4, 101-108
array-of-structures vs. structure-of-arrays, 218-219
ATI ShadeLab, 80-81
atmospherics, creating with 3D textures, 435-436
attenuation, 66-67, 432-434 see also light falloff
B
backface culling, 73-74
backward mapping, 418-419
bem instruction, 102-103
Bzier patch, 65
Bzier patch class, using in vertex shader example, 59-66
bias, 114
Bidirectional Reflectance Distribution Functions, see BRDF
black and white transfer function, 260
Blinn half-vector, using with Phong shading, 400
Blinn-Phong reflection, 63-64
BRDF, 406-407
shaders, 408-409
bubble shaders, 370-375
bubbles, rendering, 369-370
bump mapping, 410-411, 450
shaders, 410-411
bump maps, 130
extracting with Paint Shop Pro, 299
separating with Paint Shop Pro, 299-300
C
cartoon shading, 322-323
shaders, 323-324
character animation, using shadow volumes with, 192-193
clip planes, 74
clipping, 74-75
cloud effects, 337-341
shaders, 338, 339, 341
clouds, rendering, 337-341
cmp instruction, 103-104
cnd instruction, 105
color map shader, ps.1.1-1.3, 128
color registers (pixel shader), 83
common files framework, 40-41
complex instructions (vertex shader), 27
compression
of tweened animation, 206-207
shader, 206-207
compression transform, 176-179
ConfirmDevice() function, 42
constant instruction (pixel shader), 128
constant registers (pixel shader), 82
constant registers (vertex shader), 17
setting, 22, 43, 54-55
using, 29
CreateFile() function, 52
CreateFileMapping() function, 52
CreatePixelShader() function, 120, 135
CreatePSFromCompiledFile, 134-135
CreateVertexShader() function, 34, 46
CreateVSFromCompiledFile() function, 51
cross products, 230
crystal shaders, 364-368
crystal texture, rendering, 363-364
cube mapping, 287
cube maps, 287
diffuse, 287
generating, 313
cubic environment mapping, 290
cubic interpolation, 240
custom shaders, constructing, 466-467
D
D3DXAssembleShader() function, 45-46
using for vertex shader example, 38-49
data arrangement, 218-219
deformable geometry, 467-468
DeletePixelShader() function, 36, 121
Index 487
Team LRN
DeleteVertexShader() function, 47
depth precision, 331
depth test, 78
destination register modifiers, 117-118
detail textures, 451
development system recommendations, 5
diffuse and specular point light shader, 68-69
diffuse and specular reflection shader, 61-62
diffuse bump mapping shader, 117-118
diffuse cube maps, 287
generating dynamically, 288-289
shader, 287-288
using, 287-288
diffuse lighting, 298
shader, 55
diffuse reflection, 57-59
shader, 144
diffusion cube map tool, 14-15
dilation, 265-266
shader, 266-267
Direct3D pipeline, 5-6, 73
directional light, 57
displacement maps, 184-186
dithering, 79
DLL Detective, 15
dp3 instruction, 23, 106
dp4 instruction, 23, 106-107
dst instruction, 23
E
ease interpolation, 241
edge detection, 262
filters, 262-265
edge determination shaders, 329-330
edges, compositing, 330
engine design considerations, 450-452
environment mapping, 290-292
artifacts, 290
cubic, 290
for eyes, 315-316
shaders, 293-294, 315-316
with Quake III shader, 469
environmental lighting, 312
shader, 314
erosion, 267
shader, 267-268
expp instruction, 24, 27
eye mapping shaders, 315-316
eye space depth shader, 328-329
F
faces, rendering, 296
facial animation, 317
filter kernels, 261
texture coordinates, 261-262
filtering, 429
fire effect, 342-344
shader, 343-344
fixed-function pipeline, 6
mapping, 20
flow control, 230
fog, 77-78
font smoothing, 270-272
shaders, 271-272
forward mapping, 418
fractional Brownian motion, 236-239
three-dimensional, 238
FrameMove() function, 54-55, 60
frc instruction, 27
frequency, using to compare noise waves, 235
Fresnel effects, 281-282
Fresnel shaders, 283-285
Fresnel term, 281
frustum clipping, 74-75
full head mapping, 309-312
G
geometry emulation, 273-274
shaders, 274-276
gmax, 16
Gooch lighting, 326-327
shaders, 327
Gram-Schmidt orthonormalization, 380-381
grass,
animating, 334
lighting, 334-335
shaders, 335-336
guard band clipping, 74
H
half angles, 303
hard shading, 322
hardware vertex shader, 17
hatching, 324
shaders, 325-326
HCLIP space, 186
heat signature
shader, 260
transfer function, 261
hyperspace, 186
I
illumination,
per-pixel strand-based, 378-382
strand-based, 376-377
image rendering, 328-329
imposters, 273
488 Index
Team LRN
independent rendering passes, 451-452
index buffer, using, 49
inflow, 421-423
input registers (vertex shader), 17
using, 28
instruction modifiers, 116-118
instruction pairing, 119-120
instruction set, complex (vertex shader), 27
instruction set (vertex shader), 23-26
instructions, modifying, 110-111
invert, 114
K
krass engine, 453
development history, 454-455
ordering effects in, 456-457
particle rendering in, 460-461
rendering terrain in, 458-460
structure, 453-454
L
lake water,
rendering, 357
shaders, 358-362
Lambert shading, 406
Lamberts Law, 57-58
light factor scaling, 452
light falloff, see also attenuation
calculating, 203-206
creating with 3D textures, 436
shader, 204-205, 437
light, directional, 57
lighting,
calculating, 57
enhancing, 213-214
improving, 277-278
real-time, 450-451
shaders, 279-280
lighting elements, separating, 298
linear interpolation, 240-241
lit instruction, 24
log instruction, 27
logp instruction, 25
lookup table, using, 140-143, 145-146
lrp instruction, 107
luminance shader, 260
M
m3x2 instruction, 27
m3x3 instruction, 27
m3x4 instruction, 27
m4x3 instruction, 27
m4x4 instruction, 27
mad instruction, 25, 107-108
mapping head, 309-312
MapViewOfFile() function, 53
masking, 32, 117-118
mathematical morphology, 265
max instruction, 25
MFC pixel shader, 80
Microsoft pixel shader assembler, 79-80
Microsoft vertex shader assembler, 11
min instruction, 25
morphing, see tweening
morphology, using to thicken outlines, 332
mov instruction, 25, 108
mul instruction, 25, 108
multiple compression transforms, 180-182
multisampling, 76
N
noise, 234, 431
function, 234-235
one-dimensional, 234-236
Perlin, see Perlin noise
shader, 431-432
two-dimensional, 236-237
using to generate textures, 238-239
noise waves, 235
using frequency to compare, 235
non-integer power function, 383-384
non-photorealistic rendering, 319
nop instruction, 25, 108
normal, transforming, 56
normal map, 130, 301-302
creating for face, 300-301
Normal Map Generator, 13
N-Patches, 6
NVIDIA Effects Browser, 8
NVIDIA NVASM, 10
using for vertex shader example, 49-53
NVIDIA Photoshop plug-ins, 13-14
NVIDIA Shader Debugger, 9
NVLink, 12
O
ocean water,
rendering, 347
rendering using Fresnel term, 350-352
shaders, 348-356
offset textures, 419
one-dimensional noise function, 234-236
one-dimensional Perlin noise, 242
one-time effect, 229
on-the-fly compiled shaders, 120
optimization example (vertex shader), 224-227
optimizations
for creating photorealistic faces, 302
Index 489
Team LRN
vertex compression, 179-180
vertex shader, 221-224
outflow, 423
outline shader, 321
outlines,
rendering, 319-322
thickening, 332
output registers (pixel shader), 81, 82
output registers (vertex shader), 17
using, 30-31
P
Paint Shop Pro, 296
using to extract maps, 299
using to separate maps, 299-300
particle
motion, 416-420
systems, 414
particle flow
rendering, 414
shader, 423
particles, 416
rendering with krass engine, 460-461
Pentium 4 processor architecture, 216-217
periodic effects, randomizing, 229-230
Perlin noise, 234, 431 see also noise
calculating scalars for, 244-245
generating, 241-244
one-dimensional, 242
three-dimensional, 243-244
two-dimensional, 242-243
per-pixel Fresnel term, 281
per-pixel lighting, 130-134, 203-206
per-pixel strand-based illumination, 378-382
persistence, 236
Phong illumination, 62-63
Phong shading, 399-402, 406
shader, 402
using with Blinn half-vector, 400
photorealistic faces, 296
shaders, 303-309
Photoshop Compression plug-in, 13-14
Photoshop, using to extract lighting characteristics, 298
pipelines, 5-6
disadvantages of, 7
Pixar Animation Studios, 4
pixel shader, 72
architecture, 81-84
assembling, 120, 134
compiling, 120
creating, 120-121, 134-135
creating with Shader Studio, 159-160
defining constants for, 86, 127-128, 137
deleting, 121, 136
effects, 73
freeing resources of, 121, 136
programming, 84-121
range, 83-84
registers, 82-84
setting, 121, 136
setting texture flags for, 126-127, 140
setting texture of, 127, 140-141
standards, 77
tools, 79
pixel shader arithmetic instruction set, ps.1.1-ps.1.4, 101-108
pixel shader examples
using lookup table, 140-143
with diffuse reflection and specular reflection, 144-147
with directional light and diffuse reflection, 125-137
with higher precision specular reflection, 140-143
with specular reflection, 137-140
pixel shader instructions, 87-88, 137-139
ps.1.1-1.3, 128
ps.1.4, 128-130
pixel shader support, checking for, 84-85, 126
pixel shader texture addressing instruction set,
ps.1.1-ps.1.3, 89-97
ps.1.4, 98-100
pixel shader texture stages, setting, 85-86
pixel shader texture, setting, 86
pixel shaders
3D texture lookup table, 430-431
anisotropic strand illumination, 379, 380, 381-382
approximating power function, 392, 394, 396
blending textures, 257
bubble effect, 372-375
bump mapping, 411
cartoon shading, 323-324
cloud effects, 338, 339, 341
computing power function, 384-385, 386
converting to black and white, 260
creating photorealistic faces, 307-309
crystal effect, 365-368
determining edges, 329-330
diffuse bump mapping, 117-118
diffuse cube maps, 287-288
diffuse reflection, 144
dilation, 266-267
displaying color map, 128
environment mapping, 293-294
environmental lighting, 314
erosion, 267-268
eye environment mapping, 316
fire effects, 343-344
font smoothing, 271-272
Fresnel effects, 284-285
geometry emulation, 274
Gooch lighting, 327
490 Index
Team LRN
grass, 336
hatching, 326
heat signature effect, 260
lake water, 360-362
light falloff, 437
light rendering, 409
lighting, 279
noise, 431-432
ocean water, 350-352, 352-356
particle flow, 423
Phong shading, 402
plasma effects, 344, 345-346
Roberts cross gradient filter, 263
sepia effect, 260
smooth conditional function, 397-398
Sobel filter, 264-265
specular map, 142
specular power, 145, 146
specular reflection, 139
transfer function, 441, 442
transformation to texture space, 133
volume bounded lighting, 399
volume rendering, 443, 445
pixel shaders,
advantages of, 72
limitations of, 232-234
plasma glass
effect, 344-346
shaders, 344, 345-346
point light source, 66
point lights, attenuation for, 66-67
power function,
approximating in pixel shader, 391-396
computing, 384-387
mathematical background, 387-391
non-integer, 383-384
using, 396-402
precompiled shaders, 120
pre-integrated volume rendering, 438, 444-446
procedural textures, 415
programmable pipeline, 4
ps.1.4, advantages of, 77
Q
quad example shader, 43-45
quadrilinear filtering, 429
Quake III Arena shaders, see Quake III shaders
Quake III shaders, 463-464
categories of, 466
vertex shader effects, 467-471
quantization, 174-175
R
random numbers, generating, 229-230
rcp instruction, 25
read port limit, 109-110
real-time lighting, 450-451
reconstruction, 407-408
reflection,
Blinn-Phong, 63-64
diffuse, 57-59
specular, 62-65
reflection map, generating, 358
refraction map, generating, 358
render pipeline, 456-457
render target, 79, 258
rendering passes, independent, 451-452
rendering process with Quake III shader, 471-472
RenderMan, 4
repetition, avoiding, 295
Roberts cross gradient filter, 262-263
shader, 263
rsq instruction, 26
S
scale, 114
scaled offset, 175-176
scrolling, disadvantage of, 295
sepia effect shader, 260
sepia tone transfer function, 260
SetPixelShader() function, 121, 136
SetPixelShaderConstant() function, 86, 127, 137
SetTexture() function, 86, 127, 140-141
SetVertexShader() function, 34-35, 47
SetVertexShaderConstant() function, 22, 43, 61
sge instruction, 26
shader assembly program, 246-251
Shader City, 10
Shader Studio, 11, 149
browser, 165
coordinate system, 150
creating pixel shader with, 159-160
creating vertex shader with, 152-159
drawbacks of, 151
features, 150-151
files, 161
light source, 152
menu layout, 160
object control, 152
transformations, 165-166
Shader Studio dialogs
Configure, 168
Constant Properties, 163-164
Materials, 164
Settings, 167-168
Shaders, 161-163
Index 491
Team LRN
Statistics, 167
Textures, 165
shaders
in effect files, 120
limitations of, 232-234
shadow volume extrusion, 192-193
shader, 191-192
shadow volumes, 188-189
creating, 189-190
using with character animation, 192-193
shadows, outlining, 331-332
signed scaling, 114
silhouette edges, 319
detecting, 189
single-surface geometry, 209
single-surface object, lighting, 209-210
sinusoidal perturbation, 348-350
skeletal animation, see skinning
skinning, 197-199
blending with tweening, 199-201
shader, 198-199
sliding compression transform, 182-184
slt instruction, 26
smooth conditional function, 396-399
shader, 397-398
Sobel filter, 263-265
shader, 264-265
source register modifiers,
ps.1.1-1.3, 113, 115
ps.1.4, 115
specular map,
extracting with Paint Shop Pro, 299
separating with Paint Shop Pro, 299-300
shader, 142
specular power shader, 145, 146
specular reflection, 62-65
shaders, 138-139
spot light, 66
SSE, 217-218
data arrangement, 218-219
SSE registers, mapping VVM registers to, 220
stencil test, 78
strand-based illumination, 376-377
per-pixel, 378-382
Streaming SIMD Extensions, see SSE
structure-of-arrays vs. array-of-structures, 218-219
sub instruction, 108
swizzling, 31-32, 111-113
T
T&L pipeline, 6
disadvantages of, 7
tangent space, animating for per-pixel lighting, 202-203
temporary registers (pixel shader), 81, 82
temporary registers (vertex shader), 17
using, 30
terrain rendering with krass engine, 458-460
tex instruction, 89
texbem instruction, 89-90
texbeml instruction, 91
texcoord instruction, 91
texcrd instruction, 98
texdepth instruction, 99
texdp3 instruction, 91
texdp3tex instruction, 91-92
texkill instruction, 92, 99
texld instruction, 99-100
texm3x2depth instruction, 92
texm3x2pad instruction, 92-93
texm3x2tex instruction, 93
texm3x3 instruction, 93-94
texm3x3pad instruction, 94
texm3x3spec instruction, 94-95
texm3x3tex instruction, 95-96
texm3x3vspec instruction, 96
texreg2ar instruction, 97
texreg2gb instruction, 97
texreg2rgb instruction, 97
texture
advection, 418
coordinates, 81
matrix manipulation with Quake III shader, 469-471
operations, 81
perturbation effects, 337-346
scrolling, 295
shaders, 257
stages, 81
texture address instructions (pixel shader), 128
texture addressing,
ps.1.1-ps.1.3, 88-89, 97-98
ps.1.4, 98, 100-101
texture addressing instruction set,
ps.1.1-ps.1.3, 89-97
ps.1.4, 98-100
texture registers, 81
pixel shader, 82-83
texture space,
moving light into, 131-134
transforming L into, 132-134
shaders, 132-133
texture space coordinate system, establishing, 131-132
textures,
blending, 256-257
generating using noise, 238-239
three-dimensional Perlin noise, 243-244
time value, calculating, 228-229
tonal art maps, 324
transfer functions, 259, 441-442
492 Index
Team LRN
implementing, 439-441
shader, 441, 442
types of, 260-261
Transform & Lighting pipeline, see T&L pipeline
transformation matrix, using, 56
trilinear filtering, 429
turbulence, 431
tweening, 195-197
blending with skinning, 199-201
shader, 196-197
two-dimensional noise, 236-237
two-dimensional Perlin noise, 242-243
U
unit vector, calculating, 56
user clip planes, 74
UV flipping, 295
V
vector fields, 184-186
version instruction (pixel shader), 128
vertex buffer, using, 48
vertex compression, 172-173
examples, 174-179, 180-186
optimizations, 179-180
techniques, 172-173
vertex lighting, 277
vertex normal, correcting, 211-213
vertex shader,
architecture, 16-18
assembler, 10-11
benchmarks, 252
compiling, 22, 33-34, 45-46
creating, 34, 46, 51-53
creating with Shader Studio, 152-159
data types, 173
declaring, 19-21, 42-43, 54, 59-60
deleting, 35
effects, 7
effects with Quake III shaders, 467-471
freeing resources of, 47
hardware, 17
non-shader specific code, 47-49, 65
optimization example, 224-227
optimizations, 221-224
process of creating, 18, 42
programming, 18-35
setting, 34-35, 47
tools, 8-16
writing, 22, 32-33
vertex shader constant registers, setting, 22, 43, 54-55
vertex shader constants, setting, 60-61, 67
vertex shader examples
using Bzier patch class, 59-66
using combined diffuse and specular reflection, 59-6
using D3DXAssembleShader() function, 38-49
using NVASM, 49-53
with per-vertex diffuse reflection, 53-59
with point light source, 66-70
vertex shader instruction set, 23-26
complex, 27
vertex shader registers, 17-18
using, 28-31
vertex shader resources, freeing, 35
vertex shader support, checking for, 19, 42
vertex shaders
blending textures, 257
blending tweening and skinning, 200-201
bubble effect, 370-372
bump mapping, 410
cartoon shading, 323
compression, 206-207
compression transform, 179
correcting vertex normal, 212-213
creating photorealistic faces, 303-307
crystal effect, 364-365
diffuse and specular point light, 68-69
diffuse and specular reflection, 61-62
diffuse lighting, 55
displacement, 185-186
environment mapping, 293
eye environment mapping, 315-316
eye space depth, 328-329
font smoothing, 271
Fresnel effects, 283-284
geometry emulation, 274-276
Gooch lighting, 327
grass, 335-336
hatching, 325-326
HCLIP displacement map, 186
HCLIP vector field, 186
lake water, 358-360
light falloff, 204-205
light rendering, 408-409
lighting, 279-280
multiple compression transforms, 181-182
ocean water, 348-350, 352-356
optimized compression transform, 180
outlines, 321
quad example, 43-45
quantization, 175
scaled offset, 176
shadow volume extrusion, 191-192
skinning, 198-199
skinning tangent/texture space vectors, 202-203
specular reflection, 138
transformation to texture space, 132
tweening, 196-197
Index 493
Team LRN
vector field, 185-186
world space normals, 328-329
vertex shaders, 6-7
advantages of, 7
limitations of, 232-234
vertex stream declaration example, 174
vertex virtual machine, see VVM
vertices, processing, 6
visualization in computer games, 455
drawbacks of, 455-456
volume bound for pixel shader effects, 398-399
volume graphics
animating, 441-444
blending, 442-444
rotating, 441
volume rendering, 444-446
volume visualization, 438-439
volumetric effects, 438
volumetric fog, creating with 3D textures, 435-436
volumetric textures, see 3D textures
voxel, 439
VVM, 220
VVM registers, mapping to SSE registers, 220
W
water, see ocean water, lake water
world space normals shader, 328-329
494 Index
Team LRN
C
o
l
o
r
P
l
a
t
e
1
Team LRN
Color Plate 2
Team LRN
C
o
l
o
r
P
l
a
t
e
3
Team LRN
C
o
l
o
r
P
l
a
t
e
4
Team LRN
Color Plate 5
Color Plate 6
Color Plate 7
Team LRN
Color Plate 9
Color Plate 8
Team LRN
Color Plate 10
Color Plate 11
Team LRN
C
o
l
o
r
P
l
a
t
e
1
2
Team LRN
About the CD
The companion CD contains vertex and pixel shader code along with a variety of programs
demonstrating the concepts presented throughout the book. When you unzip the CD files,
they will be organized into directories named for the articles.
See the individual readme files for more information.
Additional resources are available at www.shaderx.com
WARNING: By opening the CD package, you accept the terms and conditions
of the CD/Source Code Usage License Agreement.
Additionally, opening the CD package makes this book non-returnable.
Team LRN
CD/Source Code Usage License Agreement
Please read the following CD/Source Code usage license agreement before opening the CD and using the contents
therein:
1. By opening the accompanying software package, you are indicating that you have read and agree to be bound
by all terms and conditions of this CD/Source Code usage license agreement.
2. The compilation of code and utilities contained on the CD and in the book are copyrighted and protected by
both U.S. copyright law and international copyright treaties, and is owned by Wordware Publishing, Inc. Indi-
vidual source code, example programs, help files, freeware, shareware, utilities, and evaluation packages,
including their copyrights, are owned by the respective authors.
3. No part of the enclosed CD or this book, including all source code, help files, shareware, freeware, utilities,
example programs, or evaluation programs, may be made available on a public forum (such as a World Wide
Web page, FTP site, bulletin board, or Internet news group) without the express written permission of
Wordware Publishing, Inc. or the author of the respective source code, help files, shareware, freeware, utili-
ties, example programs, or evaluation programs.
4. You may not decompile, reverse engineer, disassemble, create a derivative work, or otherwise use the
enclosed programs, help files, freeware, shareware, utilities, or evaluation programs except as stated in this
agreement.
5. The software, contained on the CD and/or as source code in this book, is sold without warranty of any kind.
Wordware Publishing, Inc. and the authors specifically disclaim all other warranties, express or implied,
including but not limited to implied warranties of merchantability and fitness for a particular purpose with
respect to defects in the disk, the program, source code, sample files, help files, freeware, shareware, utilities,
and evaluation programs contained therein, and/or the techniques described in the book and implemented in
the example programs. In no event shall Wordware Publishing, Inc., its dealers, its distributors, or the authors
be liable or held responsible for any loss of profit or any other alleged or actual private or commercial dam-
age, including but not limited to special, incidental, consequential, or other damages.
6. One (1) copy of the CD or any source code therein may be created for backup purposes. The CD and all
accompanying source code, sample files, help files, freeware, shareware, utilities, and evaluation programs
may be copied to your hard drive. With the exception of freeware and shareware programs, at no time can any
part of the contents of this CD reside on more than one computer at one time. The contents of the CD can be
copied to another computer, as long as the contents of the CD contained on the original computer are deleted.
7. You may not include any part of the CD contents, including all source code, example programs, shareware,
freeware, help files, utilities, or evaluation programs in any compilation of source code, utilities, help files,
example programs, freeware, shareware, or evaluation programs on any media, including but not limited to
CD, disk, or Internet distribution, without the express written permission of Wordware Publishing, Inc. or the
owner of the individual source code, utilities, help files, example programs, freeware, shareware, or evalua-
tion programs.
8. You may use the source code, techniques, and example programs in your own commercial or private applica-
tions unless otherwise noted by additional usage agreements as found on the CD.
Warning: By opening the CD package, you accept the terms and conditions of
the CD/Source Code Usage License Agreement.
Additionally, opening the CD package makes this book non-returnable.
Team LRN