GPU Zen 2
Advanced Rendering Techniques
Preface xi
I Rendering 1
Patrick Cozzi, editor
vi Contents
5 Skinned Decals 79
Hawar Doghramachi
5.1 Introduction.................................................................................................79
5.2 Overview .....................................................................................................79
5.3 Implementation ...........................................................................................80
Contents vii
II Environmental Effects 89
Wolfgang Engel, editor
This book—like its long line of predecessors—is created with the intend to helping
readers to better achieve their goals.
For generations, books were used to preserve valuable information. They are an
important source of knowledge in our modern world. With the rise of social media,
information is obscured and transformed into whatever the agenda of the poster is. It
became acceptable to bother other people with information that is sometimes tasteless,
mindless and/or nonsensical in all areas of life, including graphics programming. Po-
litical parties and companies drive large scale misinformation activities (sometimes
called marketing or information warfare) with noise levels that are hard to bear.
This book is meant to provide an oasis of peace and intellectual reflection. All of
us who worked on it, tried to make sure this collection of articles is practically useful,
stimulating for your mind and a joy to read.
The awesome screenshot on the cover is provided by Jeroen Roding with permis-
sion by Guerilla Games. Thank you!
I would like to thank Eric Lengyel for editing the articles and creating the beautiful
page layout. I would also like to thank Anton Kaplanyan, Mauricio Vives, Patrick Cozzi
and Wessam Bahnassi for being the section editors.
I also want to thank at this point everyone for supporting this book series and its
predecessors since 2001. These books started friendships, careers, companies and
much more over the years. They certainly changed my life in awesome ways!
—Wolfgang Engel
Real-time rendering is an exciting field in part because how rapidly it evolves and ad-
vances but also because of the graphics community’s eagerness and willingness to
share their new ideas, opening the door for others to learn and share in the fun!
In this section we introduce five new rendering techniques that will be relevant to
game developers, hobbyist and anyone else interested in the world of graphics.
The article “Adaptive GPU Tessellation with Compute Shaders” by Jad Khoury,
Jonathan Dupuy, and Christophe Riccio suggests to make rasterization more efficient
for moderately distant polygons by procedurally refining coarse meshes as they get
closer to the camera with the help of compute shaders. They achieve this by manipu-
lating an implicit (triangle-based) subdivision scheme for each polygon of the scene in
a dedicated compute shader that reads from and writes to a compact, double-buffered
The article “Applying Vectorized Visibility on All frequency Direct Illumination”
by Ho Chun Leung, Tze Yui Ho, Zhenni Wang, Chi Sing Leung, Eric Wing Ming Wong
describes a new PRT approach with visibility functions represented in vector graphics
form. This results in a different set of strengths and weaknesses compared to other PRT
approaches. This new approach can preserve the fidelity of high frequency shadows
and accurately account for a huge number of light sources, even with coarsely tessel-
lated 3D models. It can also handle the specular component from mirror to blurry
The article “Non-periodic Tiling of Noise-based Procedural Textures” by Ale-
ksandr Kirillov shows a method to combine noise-based procedural texture synthesis
with a non-periodic tiling algorithm. It describes modifications to several popular
procedural noise functions that directly produce texture maps containing the smallest
possible complete Wang tile set. It can be used as a preprocessing step or during appli-
cation runtime.
The article “Rendering Surgery Simulation with Vulkan” by Nicholas Milef, Di
Qi, and Suvranu De shows a rendering system design around surgery simulation in-
cluding how higher-level design decisions propagate to lower-level usage of Vulkan.
The last article in the section “Skinned Decals” by Hawar Doghramachi describes
a way on how to dynamically apply decals to a character for example to show the impact
position of a projectile. This technique is overcoming the drawback of deferred decals
in that scenario, where in case the target area is influenced by several bones, the decals
are “swimming” on top of the target mesh.
—Patrick Cozzi
1.1 Introduction
GPU rasterizers are most efficient when primitives project into more than a few pixels.
Below this limit, the Z-buffer starts aliasing, and shading rate decreases dramatically
[Riccio 2012]; this makes the rendering of geometrically-complex scenes challenging,
as any moderately distant polygon will project to subpixel size. In order to minimize
such subpixel projections, a simple solution consists in procedurally refining coarse
meshes as they get closer to the camera. In this chapter, we are interested in deriving
such a procedural refinement technique for arbitrary polygon meshes.
Traditionally, mesh refinement has been computed on the CPU via recursive algo-
rithms such as quadtrees [Duchaineau et al. 1997, Strugar 2009] or subdivision sur-
faces [Stam 1998, Cashman 2012]. Unfortunately, CPU-based refinement is now
fundamentally bottlenecked by the massive CPU-GPU streaming of geometric data it
requires for high resolution rendering. In order to avoid these data transfers, extensive
work has been dedicated to implement and/or emulate these recursive algorithms di-
rectly on the GPU by leveraging tessellation shaders (see, e.g., [Niessner et al. 2012,
Cashman 2012, Mistal 2013]). While tessellation shaders provide a flexible, hardware-
accelerated mechanism for mesh refinement, they remain limited in two respects. First,
they only allow up to log 2 64 6 levels of subdivision. Second, their performance
drops along with subdivision depth [AMD 2013].
In the following sections, we introduce a GPU-based refinement scheme that is
free from the limitations incurred by tessellation shaders. Specifically, our scheme al-
lows arbitrary subdivision levels at constant memory costs. We achieve this by manip-
ulating an implicit (triangle-based) subdivision scheme for each polygon of the scene
in a dedicated compute shader that reads from and writes to a compact, double-buffered
4 1. Adaptive GPU Tessellation with Compute Shaders
array. First, we show how we manage our implicit subdivision scheme in Section 1.2.
Then, we provide implementation details for rendering programs we wrote that lever-
age our subdivision scheme in Section 1.3.
Figure 1.1. The (a) subdivision rule we apply on a triangle (b) uniformly and (c) adaptively.
The subdivision levels for the red, blue, and green nodes are respectively 2, 3, and 4.
1.2 Implicit Triangle Subdivision 5
M 0100 M 0 M 1 M 0 M 0 (1.3)
In our implementation, we store each key produced by our subdivision rule as a 32-bit
unsigned integer. Below is the bit representation of a 32-bit word, encoding the key
0100. Bits irrelevant to the code are denoted by the ‘_’ character.
____ ____ ____ ____ ____ ____ ___1 0100
Note that we always prepend the key’s binary sequence with a binary value of 1 so we
can track the subdivision level associated to the key easily. Listing 1.2 provides the
GLSL code we use to extract the transformation matrix associated to an arbitrary key.
Since we use 32-bit integers, we can store up to a 32 1 31 levels of subdivision,
which includes the root node. Naturally, more levels require longer words. Because
longer integers are currently unavailable on many GPUs, we emulate them using integer
vectors, where each component represents a 32-bit wide portion of the entire key. For
more details, see our implementation, where we provide a 63-level subdivision algo-
rithm using the GLSL uvec2 datatype.
6 1. Adaptive GPU Tessellation with Compute Shaders
return xf;
parent: ____ ____ ____ ____ ____ ____ ____ 1010
key: ____ ____ ____ ____ ____ ____ ___1 0100
child1: ____ ____ ____ ____ ____ ____ __10 1000
child2: ____ ____ ____ ____ ____ ____ __10 1001
Note that compared to the key representation, the other keys are either 1-bit expansions
or contractions. The GLSL code to compute these representations is shown in List-
ing 1.3; it simply consists of bit shifts and logical operations, and is thus very cheap.
Listing 1.4 provides the pseudocode we typically use for updating the subdivision
buffer in a GLSL compute shader. In practice, if a key needs to be split, it emits two
new words, and the original key is deleted. Conversely, when two sibling keys must
merge, they are replaced by their parent's key. In order to avoid generating two copies
of the same key in memory, we only emit the key once from the 0-child, identified
1.2 Implicit Triangle Subdivision 7
using the test provided in Listing 1.5. We also provide some unit tests we perform on
the keys to avoid producing invalid keys in Listing 1.6. For the keys that do not require
any modification, they are simply re-emitted, unchanged.
else if (/* keep ? */ keyLod == targetLod)
8 1. Adaptive GPU Tessellation with Compute Shaders
else /* merge ? */
if (/* is root ? */ isRootKey(key))
else if (/* is zero child ? */ isChildZeroKey(key))
Listing 1.5. Determining if the key represents the 0-child of its parent.
It should be clear that our approach maps very well to the GPU. This allows us to
compute adaptive subdivisions such as the one shown in Figure 1.1(c). Note that an
iteration only permits a single refinement or coarsening operation per key. Thus when
more are needed, multiple buffer iterations should be performed. In our rendering im-
plementations, we perform a single buffer iteration at the beginning of each frame.
// barycentric interpolation
vec3 berp(in vec3 v[3], in vec2 u)
return v[0] + u.x * (v[1] - v[0]) + u.y * (v[2] - v[0]);
Listing 1.7. Compute the vertices v_out of the subtriangle associated to a subdivision key
generated from a triangle defined by vertices v_in.
1.3.1 Overview
In this section, we describe a tessellation technique for polygonal geometry that lever-
ages our implicit subdivision scheme. Our technique computes an adaptive subdivision
for each polygon in the scene, so as to control their extent in screen-space and hence
minimize subpixel projections; we describe how we compute such subdivisions using
a distance-based LOD criterion in Section 1.3.2. Since adaptive subdivisions usually
lead to T-junction polygons, we also discuss how we avoid them entirely; we discuss
the issue of T-junctions in Section 1.3.3.
In practice, our technique requires three GPU kernels with OpenGL 4.5; Fig-
ure 1.2 diagrams the OpenGL pipeline of our implementation. The first kernel (Lod-
Kernel) updates the subdivision buffer in a compute shader using the algorithms
described in the previous section. In addition, we perform view-frustum culling for
each key and write the visible ones to a buffer (CulledSubdBuffer) using an atomic
counter. Next, we launch a second compute kernel (IndirectBatcherKernel) that pre-
pares an indirect compute dispatch call for the next subdivision update (i.e., the next
invocation of LodKernel), as well as an indirect draw call for the third and final kernel.
The final kernel (RenderKernel) executes the indirect drawing commands to render the
10 1. Adaptive GPU Tessellation with Compute Shaders
Figure 1.2. OpenGL pipeline of our compute-based tessellation shader. The green, red, and
gray boxes respectively denote GPU memory buffers, GPU code execution, and CPU code
float distanceToLod(float z)
float tmp = s(z) * targetPixelSize / screenResolution;
return -log2(clamp(tmp, 0.0, 1.0));
Here, the parameter z denotes the distance from the camera to the subtriangle associ-
ated to the key being processed. Listing 1.8 provides the GLSL pseudocode we execute
in LodKernel.
1.3 Adaptive Subdivision on the GPU 11
void main()
// get threadID (each key is associated to a thread)
int threadID = gl_GlobalInvocationID.x;
// write to u_SubdBufferOut
updateSubdBuffer(key, targetLod);
tessellated). Therefore, we chose to rely on such a system as it avoids the need for a
sophisticated T-junction removal system; Listing 1.9 shows the code we use in the ver-
tex shader of our RenderKernel.
void main() {
// get coarse triangle associated to the key
uint primID = i_PerInstanceKey.y;
vec3 v_in[3] = vec3[3](
u_VertexBuffer[u_IndexBuffer[primID * 3 ]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 1]],
u_VertexBuffer[u_IndexBuffer[primID * 3 + 2]],
1.3.4 Results
To demonstrate the effectiveness of our method, we wrote a renderer for displacement-
mapped terrains, and another one for meshes; our source code is available on github at, and a terrain rendering result is
shown in Figure 1.3. In Table 1.1, we give the CPU and GPU timings of a zoom-
in/zoom-out sequence in the terrain at 1080p. The camera's orientation was fixed, look-
ing downwards, so that the terrain would occupy the whole framebuffer, thus maintain-
ing constant rasterization activity. We configured the renderer to target an average
triangle edge length of 10 pixels; Figure 1.3 shows the wireframe of such a target. The
testing platform is an Intel i7-8700k CPU, running at 3.70 GHz, and an Nvidia GTX
1080 GPU with 8 GiB of memory. Note that the CPU activity only consists of OpenGL
uniform variables and driver management. On current implementations, such tasks run
asynchronously to the GPU.
1.4 Discussion 13
Figure 1.3. Crack-free, multiresolution terrain rendered entirely on the GPU using compute-
based subdivision and displacement mapping. The alternating colors show the different subdivi-
sion levels.
1.4 Discussion
We introduced a novel compute-based subdivision algorithm that runs entirely on the
GPU thanks to an implicit representation. In future work, we would like to explore the
14 1. Adaptive GPU Tessellation with Compute Shaders
feasibility of this representation for more complex subdivision schemes such as Cat-
mull-Clark. In the meantime, we provide next a few additional considerations that we
think can be relevant in the context of our work.
How much memory should be allocated for the buffers containing the subdivision
keys? This depends on the target polygon density in screen space. The buffers should
be able to store at least 3 max_level 1 nodes, and do not need to exceed a capacity
of 4 max_level nodes. The lower bound corresponds to a perfectly restricted subdivision,
where each neighboring triangle differ by one level of subdivision at most. The higher
bound gives the number of cells at the finest level in case of uniform subdivision.
How about combining this technique with tessellation shaders to overcome the
subdivision limits of the hardware? We have actually implemented such an ap-
proach. Our open-source implementation is available on github at
jdupuy/opengl-framework (see the demo-isubd-terrain demo). With both approached
at hand, we leave it up to the developer to decide which approach is best given his
software and hardware constraints.
There are two ways to control polygon density. Either use the implicit subdivision,
or refine the instanced triangle grid. Which approach is best? This will naturally
depend on the platform. Our code provides tools to modify the tessellation of the in-
stanced triangle grid, so that its impact can be thoroughly measured; Figure 1.4 plots
the performance evolution that we measured on our platform.
Can our implicit subdivision scheme smooth input meshes? Our implicit subdivi-
sion scheme offers the same functionality as tessellation shaders. Therefore, any
smoothing technique that runs with tessellation shaders run with our subdivision
shaders. For instance, the mesh renderer we provide implements PN-triangles [Vlachos
et al. 2001] and Phong Tessellation [Boubekeur and Alexa 2008] to smooth the surface
of the coarse meshes we refine; Figure 1.5 shows our mesh renderer applying either
bilinear interpolation or Phong Tessellation to a coarse triangle mesh.
1.5 Acknowledgments
This chapter is the result of Jad Khoury’s master thesis, which was supervised by Jonathan
Dupuy. All authors conducted this work at Unity Technologies.
Bibliography 15
Figure 1.4. Performance evolution with respect to the level of subdivision of the instanced
triangle grid on an NVidia GTX 1080.
Figure 1.5. Our subdivision technique applied on (a) a triangle mesh using (b) bilinear inter-
polation and (c) Phong tessellation [Boubekeur and Alexa 2008].
AMD 2013. GCN Performance Tweets. List of all GCN performance tweets that were released
during the first few months of 2013. URL:
BOUBEKEUR, T., AND ALEXA, M. 2008. Phong Tessellation. ACM Transactions on Graphics
(Proc. SIGGRAPH Asia 2008) 27:5.
CASHMAN, T. 2012. Beyond Catmull Clark? A Survey of Advances in Subdivision Surface
Methods. Comput. Graph. Forum 31:1, 42–61. URL:
16 1. Adaptive GPU Tessellation with Compute Shaders
STEIN, M. 1997. ROAMing terrain: real-time optimally adapting meshes. In Proceedings of the
8th Conference on Visualization ‘97, pp. 81–88. IEEE Computer Society Press.
MISTAL, B. 2013. Gpu terrain subdivision and tessellation. In GPU Pro 4, 3–20.
NIESSNER, M., LOOP, C., MEYER, M., AND DEROSE, T. 2012. Feature-adaptive GPU Rendering
of Catmull-Clark Subdivision Surfaces. ACM Trans. Graph. 31:1, 6:1–6:11.
RICCIO, C. 2012. Southern Islands in deep dive. SIGGRAPH Tech Talk. URL:
Stam, J. 1998. Exact Evaluation of Catmull-Clark Subdivision Surfaces at Arbitrary Parameter
Values. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive
Techniques, SIGGRAPH ‘98, pp. 395–404. URL:
STRUGAR, F. 2009. Continuous distance-dependent level of detail for rendering heightmaps.
Journal of graphics, GPU, and game tools 14:4, 57–74.
VLACHOS, A., PETERS, J., BOYD, C., AND MITCHELL, J. 2001. Curved PN Triangles. In Proceed-
ings of the 2001 Symposium on Interactive 3D Graphics, I3D ‘01, pp. 159–166. URL:
2.1 Introduction
The Precomputed Radiance Transfer (PRT) [Sloan et al. 2002] is a general framework
for illuminating surfaces using some precomputed transfer functions. It plays an im-
portant role in the more elaborated rendering applications (e.g., [Elcott 2016]) nowa-
days. The preparation for PRT usually involves a precomputation step, which is a
computationally expensive step for evaluating the light bounce and transfer in a scene;
the output is a set of transfer functions, and these transfer functions are usually pre-
pared in a per vertex manner. Such transfer functions can be prepared by using compu-
tational simulations (e.g., ray tracing) or measuring from the real world with some
specifically designed equipment [Matusik et al. 2003]. Either simulated or measured,
the fundamental objective of PRT is still the same, i.e., providing instant access to the
transfer functions for the rendering applications.
The formulation of PRT is so general that it can practically capture all possible
visual effects due to illuminations, including shadows, interreflections, lighting func-
tions, etc. While PRT can render many visual features, the real challenges are to make
it also efficient; and there are two of them. First, the PRT approach needs an efficient
means to evaluate the radiance equation (i.e., a numerical integral) with potentially
several hundred thousand samples. The evaluation has to be efficient because we might
need to do the evaluation for a few million times per frame (e.g., an evaluation for each
onscreen pixel; screen resolution 1280 962). Second, they need an algorithm to com-
press the transfer functions to a manageable data size. When local effects (e.g., view
dependent lighting, shadows and interreflections) are included, the data size of the
18 2. Applying Vectorized Visibility on All Frequency Direct Illumination
precomputed transfer functions can be in gigabytes order. Even if the memory con-
sumption and the bandwidth of the distribution medium are not a concern, it will not
be worth the effort to spend all that memory for illumination.
Thankfully, the human visual system is very forgiving to illumination being a bit
different to reality. Even if we simplify the lighting evaluation to an extreme measure,
the rendered results can still be visually pleasing. For instance, if we restrict the view
positions to a single point at indefinite far and ignore the local effects, PRT will reduce
to MatCap [Brauer 2010, Moreno 2018]; if we ignore just the local effects, PRT will
reduce to the environment mapping with mipmapping for the glossy reflections [Scheu-
ermann and Isidoro 2006]; if we restrict the view positions to a single point and fix the
3D model postures (i.e., essentially a static image), PRT will reduce to Image Based
Lighting [Russell 2015]. Other than the simplified approaches, there exist many PRT
approaches [Ng et al. 2003, Ng et al. 2004, Sloan et al. 2003, Tsai and Shih 2006, Kautz
et al. 2002, Liu et al. 2004, Ben-Artzi et al. 2006, Lam et al. 2010, Wang et al. 2009,
Wang et al. 2013], which attack the problem head-on by modeling the transfer func-
tions as a whole instead of dropping the effects whenever they are inconvenient.
This article presents the implementation and the rendering algorithms for the vec-
torized visibility [Ho et al. 2018], i.e., a variant of the PRT approaches. It differentiates
itself from the other PRT approaches by having its visibility functions represented in
vector graphics forms. This fundamental difference is so profound that the correspond-
ing rendering algorithm evolves through a very different path as compared to other PRT
approaches; this results in a different set of strengths and weaknesses.
Our PRT approach can preserve the fidelity of high frequency shadows and accu-
rately account for huge number of light sources even with coarsely tessellated 3D mod-
els; and it can also handle the specular component from mirror to blurry reflection1.
Both per-vertex and per-fragment direct illumination are supported. It can also make
use of dynamic tessellation to provide a better scalability, which is faster than the per-
fragment with better quality than the per-vertex. The specular component in the real
world does have shadows2; and our algorithm can capture the visual impression of the
specular component shadows3.
I ω 0 , n ,
f r ω 0 , n , ω ω dω, (2.1)
The video demonstration of our BRDF editing feature
The video of the specular component shadows in the real world
The video of the specular component shadows from our demo program
2.2 The Precomputed Radiance Transfer 19
where ω 0 is the viewing direction, ω is the incident direction, n is the surface normal,
f r is the Bidirectional Reflectance Distribution Function (BRDF) (or, the transfer func-
tion), is the environment map (or the illuminations), and Ω is the spherical domain.
(Note that, we will refer to f r as the transfer function from this point onward.) Concep-
tually, it is just describing that we can have the appearance of a surface point by adding
up the influence of all involving light sources. In simple terms, it is just a complicated
way to describe the multiple light source application, and the integration sign simply
means having as many light sources as possible.
An obvious intention of writing the radiance equation in its integral form is to
apply frequency analysis. That is transforming the transfer function f r and the illumi-
nations to their frequency representations, and performing the integration in the fre-
quency domain. In the frequency domain, the convolution of f r and becomes a dot
product of their frequency coefficient vectors C f r and C , i.e.,
I ω 0 , n , C f r C . (2.2)
Figure 2.1 shows an intuitive illustrative figure to demonstrate the radiance equation
evaluation using frequency representations. Using the lower frequency components
only, for instances 16 coefficients per representation, a numerical integral potentially
requiring hundred thousand samples can then be approximated by a simple dot product
of two 16-component vectors; and it makes the radiance evaluation efficient enough
for real time applications as presented by Sloan et al. [2002].
Sloan et al. [2002] use spherical harmonics as his basis for the frequency repre-
sentation, which is, in vague terms, the Fourier series for the spherical domain. Like
Fourier series, spherical harmonics is a global basis, which is inefficient for handling
high frequency signals. As a result, Sloan et al. [2002] can only render blurry effects,
e.g., blurry shadows and the diffuse lighting function.
Ng et al. [2003] extends Sloan’s work by replacing the global basis with a local
basis4 (i.e., Haar wavelets in the cubemap space). To make things more focused, the
effects involved are simplified from global illumination (i.e., view independent light-
ing, visibility, and interreflections) to direct illumination (i.e., view independent light-
ing and visibility). In this case, the radiance equation becomes
I ω 0 , n ,
S ω f ω 0 , n , ω ω ω n dω , (2.3)
A basis is a prototype of functions. A global basis means each of its functions is non-zero in
general throughout the domain. On the contrary, a local basis (also referred as compactly sup-
ported basis) means each of its functions is non-zero only within the neighborhood of a particular
20 2. Applying Vectorized Visibility on All Frequency Direct Illumination
Figure 2.1. An intuitive illustrative figure to demonstrate the radiance equation evaluation using
frequency representations. The efficiency of this approach come from aggressively dropping less
important components until the number of components is manageable.
advantage of the local basis (i.e., more efficient in handling high frequency signals),
Ng et al. [2003] attains the all-frequency shadow rendering quality.
I ω 0 , n ,
S ω f ω 0 , n , ω ω dω . (2.4)
2.3 Rewriting the Radiance Equation 21
I ω 0 , n ,
f ω 0 , n , ω ω dω. (2.5)
Green’s theorem, one of the major theorems in multivariable calculus, gives a relation-
ship between the line integral of a two-dimensional vector field over a closed path (i.e.,
a contour integral) and the double integral over the region it encloses. Coincidentally,
the visibility function S happens to be some regions enclosed by the contour of its
visible regions. This unusual coincidence allows us to apply Green’s theorem to Equa-
tion (2.5).
Applying Green’s theorem, we can rewrite the radiance equation from a double
integral to a contour integral, i.e.,
I ω 0 , n ,
SAT f ω , n, ω ω p dp ,
0 (2.6)
where A is the contour of the visibility function S; SAT g denotes the pre-integrated
spherical function of a given spherical function g, i.e.,
pθ pφ
SAT g p g φ, θ sin θ dφ dθ , (2.7)
0 0
Figure 2.2. An intuitive illustrative figure to demonstrate the radiance equation evaluation using
the vectorized visibility representation. Using the visibility function as the domain and applying
Green’s theorem, we can convert the radiance equation to a contour integral. The vectorized
visibility happens to be the ideal candidate for the domain of the contour integral.
22 2. Applying Vectorized Visibility on All Frequency Direct Illumination
As a contour integral, Equation (2.6) only needs to process a one dimensional data,
i.e., the contour of the visibility function. Having the contour represented in vector
graphics forms, we decouple the visibility function from the illumination resolution.
Hence, even with 400k light sources, Equation (2.6) can still provide an accurate inte-
gral without dropping any information at the same computational cost. This makes
Equation (2.6) both efficient and accurate for the radiance equation evaluation.
The relative pros and cons between doing it from the frequency analysis perspec-
tive vs. Equation (2.6) might be not immediate obvious. Metaphorically, their relative
pros and cons are analogous to that of Finite Element Method (FEM) vs. Finite Bound-
ary Method (FBM) respectively. The downside of FBM is that it takes more efforts to
formulate the simulation (in a lot of cases, such formulations may not even be feasible
at all), whilst FEM works in general. However, if FBM is applicable, it can provide
several orders higher accuracy at a tiny fraction of computational cost. Two applica-
tions of Equation (2.6), i.e., diffuse lighting and specular lighting, will be discussed in
Section 2.5.
(a) the sample origin (b) the vectorized visibility (c) the synthesis locations
Figure 2.3. More about the nature of the vectorized visibility. (a) The vectorized visibility
sampled at the yellow dot plotted on the 3D model. (b) The vectorized visibility in Figure (a).
(c) Indicating the locations for the visibility contour synthesis in Figure (d). (d) The visibility
contours synthesized at the locations identified in Figure (c) using the vectorized visibility shown
in Figure (b). A single vectorized visibility representation can already synthesize a family of
visibility contours.
Figure 2.4. The data structure for the vectorized visibility. (a) The visibility cubemap. (b) The
sequences of position vectors obtained from the boundaries in Figure (a). (c) Multiple sequences
can be connected to form a single sequence. (d) The position vector sequences of all vertices
would be stored in a 3D texture.
S ω max n ω, 0 ω dω . (2.8)
24 2. Applying Vectorized Visibility on All Frequency Direct Illumination
Id n
SAT ω ω p dp .
We use the synthesized visibility contour u i , which is the unit vector sequence pre-
sented in Section 2.4, as the representation of A. Each consecutive pairs of unit vectors
defines a line segment, and the contour integral is evaluated by summing up the partial
results of all line segments. Then, the radiance becomes
where vsat is the generalized SAT lookup operator for line segments to be presented
in Section 2.6.
For the specular component, we use the Phong lighting function, and its radiance
equation is given by
S ω Phong r , ω, s ω dω, (2.11)
I s I Cap
Is Is Is , (2.12)
Is I
where Is Ω Phong r , ω, s ω dω is the radiance value without visibility, and
I Cap and I
Cap are the corresponding radiance values of I s and I s by substituting an alge-
braically friendly function for the Phong lighting function. As shown in Figure 2.5,
Equation (2.12) expresses I s as a fraction of the radiance value without visibility Is .
The intention is to isolate the influence of visibility from the lighting function as a
ratio. Such that we can evaluate the ratio with an algebraically friendly function, while
preserving the feature of the lighting function with Is . Our algebraically friendly func-
tion is
max r ω cos ξ , 0
Cap r , ω, ξ , (2.13)
1 cos ξ
where ξ is the radius of a circular window W r , ξ centered at r. Better than that, both
the ratio I Cap I
Cap and I s are computationally friendly. I s can be obtained easily from
the cubemap mipmap of . However, calculating the ratio is a bit tricky. Expanding
the ratio, we have
2.5 Lighting Evaluation 25
Figure 2.5. Visualizing each of the components separately. The specular component is esti-
mated by the product of a ratio estimator and the filtered without visibility.
I Cap
Cap r , ω, ξ ω dω
max r ω cos ξ , 0 ω dω
. (2.14)
Cap r , ω, ξ ω dω
max r ω cos ξ , 0 ω dω
Examining Equation (2.14) carefully, we see that r cannot be factorized out of the in-
tegral due to the presence of the max function. The max function can be considered as
a form of visibility function, which is a circular window in this case. By using this fact,
the max function place can be taken by changing the domain to the intersection of the
original domain and the circular window; then, we have
I Cap
A W r , ξ
r ω cos ξ ω dω
. (2.15)
Ω W r ,ξ
r ω cos ξ ω dω
Without the explicit presence of the max function, r can be factorized, and we have
I Cap
A W r , ξ
ω ω dω cos ξ
A W r , ξ
ω dω
. (2.16)
ICap r
W r ,ξ
ω ω dω cos ξ
W r ,ξ
ω dω
The integrals in Equation (2.16) can all be evaluated efficiently using a similar formu-
lation to Equation (2.10) given the contour of A W r , ξ . The contour of A W r , ξ
can be obtained by clamping all the unit vectors in A to the circular window; each
clamped unit vector is given by
a i r b r b a i rt rt
r cos ξ sin ξ , if a i r cos ξ ;
vclamp a i (2.17)
2 2
a r a r
i b i t
ai, otherwise,
26 2. Applying Vectorized Visibility on All Frequency Direct Illumination
Figure 2.6. Intersecting a visibility contour with a circular window. (a) shows a visibility con-
tour (black and red) and a circular window (blue). (b) shows the visibility contour of their inter-
section. As indicated by the red arrows, the clamped unit vectors will form either the missing
paths of intersected regions or some zero-area round trips.
where r b , r , r t are orthonormal vectors. The clamped unit vectors will form either the
missing paths of intersected regions or some zero-area round trips as shown in Fig-
ure 2.6(b).
The GLSL code for the evaluation process of the diffuse and specular components
is provided below.
void vsatlookup(
sampler2D ysat, // the pre-integrated environment map
sampler3D vcptmap, // the vectorized visibility 3D texture
vec2 ci, // the texture coordinates for vmap
corresponding to a given vertex
vec3 cv, // the vertex position in global coordinate
mat3 mtx, // local to global coordinate system
rotation matrix
out vec4 ID, // the diffuse component radiance value
float phongbound, // the glossiness
vec3 ax, // these three vectors form a coordinates
vec3 ay, // system with ay pointing the reflected
vec3 az, // view direction
out vec4 II) // the specular component radiance ratio
int DD = textureSize(vcptmap, 0).z; // the number of position
// vector in the visibility
float dd = 1.0 / float(DD);
vec3 td = vec3(ci, 0.5 * dd);
// synthesize visibility
ll = normalize((mtx * texture(vcptmap, td).xyz) - cv);
d0 = Float32Angle(ll); // unit vector to spherical coords
// synthesize visibility
ll = normalize((mtx * texture(vcptmap, td).xyz) - cv);
d1 = Float32Angle(ll); // unit vector to spherical coords
d0 = d1;
a0 = a1;
Figure 2.7. (a) Non-axis aligned line segment. (b) The bell shape function. (c) The rendering
results comparison between the ordinary SAT vs. the generalized SAT. The rendered image using
the generalized SAT lookup eliminates the jaggy shadow boundaries that caused by the well-
known limitation of the ordinary SAT.
2.6 Shader Implementation for the Generalized SAT Lookup 29
sech 2 α x x c dx
1 1
β 0
tanh α x c , (2.19)
2 2
sech α x x c dx
float dl = 0.5;
if (abs(A.x) > 0.0001 && abs(A.y) > 0.0001)
vec2 B = Float32Angle(normalize(( -
* sign(E.w - F.w)));
B.x = (B.x + 0.5 / DSIDE) / 1.5;
B.x += step(B.x, min(cc.x, cc.z)) / 1.5;
B -=;
dl += tanh(5.87 * clamp((B.y / A.y - B.x / A.x), -1, 1))
* 0.5;
To demonstrate that the generalized SAT lookup can appropriately account for the
integral of non-axis aligned line segments, we compare it to the horizontal line SAT
lookup. If the SAT lookup can only provide the integrals of axis aligned line segments,
the rendered image will have some jaggy shadow boundaries as shown in Figure 2.7(c).
As shown in the figure, the rendered image using the generalized SAT lookup does not
have the jaggy shadow boundaries.
30 2. Applying Vectorized Visibility on All Frequency Direct Illumination
Figure 2.8. The rendering results for the teapot model: (a) ground truth, (b) per- vertex, (c) per-
fragment, and (d) dynamic tessellation. Figure (a) is the reference image generated using
6 256 256 number of directional lighting sources, where the visibility is handled by using
shadow volume algorithm. Figures (b)–(d) are the rendering results of our algorithm using dif-
ferent rendering modes. The per-vertex result was intended to be the ground truth; however, it
failed because of its severe aliasing.
second with a GTX 660 display card. Given the image quality requirement and the
rendering capabilities of our approach (see Section 2.1), this rendering speed is practi-
cally impossible for most PRT approaches to achieve. However, the uncompromising
image quality is simultaneously the major drawback of our algorithm, as the image
quality is not always the primary concern. In a lot of cases, users might just prefer
sacrificing some image quality over a real time performance, which could be addressed
gracefully by the antialiasing feature in most PRT approaches, while neither the per-
vertex nor the per-fragment of our algorithm offer such a flexible tradeoff. To cover this
loose end, we improvise a little bit using the flexible rendering setup of our algorithm.
As mentioned in the introduction, we can use dynamic tessellation to provide a
flexible tradeoff between speeds versus quality, which is faster than the per-fragment
with better quality than the per-vertex and has a better scalability. If we examine the
per-fragment rendered result in Figure 2.8(c), we can see that a good portion of regions
are almost identical to the per-vertex result in Figure 2.8(b). This is equivalently saying
that a lot of pixels are using some extremely expensive computation just to render some
blurry signals. In other words, the utilization is very low.
To improve the utilization, an intuitive idea would be adjusting the sampling rate
of individual triangles on demand according to the signal frequency. In terms of dy-
namic tessellation, it will decide the tessellation level per triangle. For the frequency
estimation, we estimate the frequency per vertex with
where is a blurry version of . Then, for a triangle, we have three frequency values
freq 0 , freq1 , freq 2 corresponding to its vertices. The frequency value per triangle
freq Δ is taken to be the maximum of the three frequency values. To convert freq Δ to
the tessellation level per triangle Lv Δ, we use the following equation
where tessscale is an arbitrary constant for the tradeoff. Adjusting tessscale to a higher
value will result in a finer grained tessellation. The parameter tessscale is coupled with
the frame rates and adjusted automatically using a PID controller. In this case, suppose
the computational power could be quantified, then each unit of computation power
would bear more value in terms of rendering quality.
We allocate the GPU memory for the vertices, the inner tessellated vertices of
edges, and the inner tessellated vertices of triangles as three textures, namely vmap,
emap, and fmap (see Figure 2.9), where the size of emap and fmap are chosen to be
large enough for the maximum tessellation level. They are managed this way for two
practical reasons. First, any common edge of two neighboring triangles must have a
single tessellation level in order to render a smooth image across neighboring triangles
with different tessellation levels. Second, as our radiance evaluation is computationally
expensive, we want to ensure that the vertices and the common edges are evaluated
only once.
2.7 Dynamic Tessellation 33
Figure 2.9. An illustrative figure to demonstrate the flow of our dynamic tessellation. Three
textures, namely vmap, emap, and fmap store the radiance values of the vertices, the inner tes-
sellated vertices of edges, and the inner tessellated vertices of triangles. Managing it this way
allows us to render the image seamlessly across the triangles with different tessellation levels.
Each row in emap and fmap contains the radiance of the tessellated vertices of a
triangle. Given a tessellation level Lv , we can calculate the number of tessellated ver-
tices of the inner edges and the inner triangles in close form, i.e.,
2 Lv 1 2 Lv 2
N face Lv . (2.23)
Each edge and each face are associated with their own tessellation level. Given an edge
with tessellation level Lv edge, we will only evaluate the radiance for N edge Lv edge num-
ber of pixels in the emap. Similarly, given a triangle with tessellation level Lv face ,
N face Lv face number of pixels will be evaluated in the fmap. The following geometric
shaders are the shaders for updating pixels of the inner edges and the inner triangles
depending on their tessellation levels.
During the render-time, we prepare the per-vertex frequency map and then the per-
face tessellation level map. Given the tessellation levels, we evaluate the radiance val-
ues for vmap, emap, and fmap. Then, we render vmap, emap, and fmap to a screen
buffer through straightforward rasterization (notice that, special attention has to be paid
to render the common edges). The screen buffer will then contain the radiance values
for the diffuse component and specular component individually. Finally, the post-pro-
cessing effects, e.g., texture mapping, material color, and bump mapping, etc., will be
applied. The most beautiful part of this process is that it is all completed within the
2.8 Results
Now, we examine the performance aspect of our rendering algorithm. We begin with
the rendering information and the model information. The rendering was carried out
with a GTX 660 display card. The screen resolution was 1280 962. The illuminations
consisted of 400k light sources (i.e., St. Peter HDR cubemap; resolution 6 256 256).
As mentioned in Section 2.1, our algorithm does not require 3D models to be
densely tessellated. As shown in Table 2.1, the girl model and the teapot model only
have 6655 and 2038 vertices, which are coarsely tessellated, and they have 10333 and
2.8 Results 35
3706 faces. Their precomputed data sizes are 11.5 MB and 3.14 MB, and the allocated
memory sizes to buffer the required tessellated data are 10.9 MB and 4.1 MB. Given
the GPU graphics memory capacity nowadays (usually in GBs), both the precomputed
data and the buffer for the tessellated data can reside at the GPU graphics memory
comfortably. The preparation times for their vectorized visibility are 3.27 minutes and
0.93 minutes, where the preparation time is measured from loading the model file to
having all the required data saved to the hard disk.
The dynamic tessellation here serves the role to distribute the computational re-
sources adaptively to the region in need. tessscale controls the overall spending of the
computational resources. Figure 2.10 shows the distribution of samples with the aware-
ness to the content frequency given more and more computational resources. As shown
in Figure 2.10, the density of samples starts increasing from some selective regions
first, then the less selective regions.
In particular, just by evaluating a faction of sample points to that of the per-frag-
ment rendering (see Figure 2.10(d)), the dynamic tessellation (see Figure 2.11(d)) at-
tains almost identical result to the per-fragment rendering (see Figure 2.11(c)), or even
the brute force rendering (see Figure 2.11(a)).
The PID controller allows the users to specify their expected frame rate. The PID
will adjust tessscale and try to fulfill the user expectation. The realized frame rate is
limited to somewhere in between the per-vertex and the per-fragment frame rates. In
particular, for the girl model, the expected frame rate we specified is 20 fps, the per-
fragment frame rate is 2 fps, and the per-vertex frame rate is 83 fps (see Table 2.1).
The dynamic tessellation improves the rendering speed to almost ten times the per-
fragment frame rate, while achieving almost identical rendering results.
In addition, our algorithm can also capture the visual impression of the specular
component shadows as shown in Figure 2.8 and Figure 2.11. However, the visual im-
pression of the specular component shadows is not very appreciable in static images.
Please visit the video to visualize the behavior of the specular component shadows in
Figure 2.10. The dynamic tessellation with different tessscale for the girl model. Increasing
the tessscale will increase the density of samples for the more demanding regions.
The video of the specular component shadows from our demo program
36 2. Applying Vectorized Visibility on All Frequency Direct Illumination
Figure 2.11. The rendering results for the girl model: (a) ground truth; (b) per-fragment; (c)
per-vertex; (d) dynamic tessellation tessscale 10.1. Just by evaluating a faction of sample points
(see Figure 2.10(d)) w.r.t. that of the per-fragment rendering, the dynamic tessellation attains
almost identical result to the per-fragment.
Our algorithm also supports the BRDF editing feature [Ben-Artzi et al. 2006],
which allows us to adjust the specular component glossiness continuously. The video
shows the specular component glossiness transition from mirror to very rough surface6.
The intention of our specular component approximation, Equation (2.12), is to
isolate the influence of visibility from the lighting function as a ratio. For the lighting
function, filtering the illuminations without visibility can be handled with relatively
high accuracy even with the primitive form of cubemap mipmap. For the ratio estimator,
because of its geometrical formulation, the shadows will topologically make sense on
its own. Therefore, although the rendered results do not agree to the ground truth
The video demonstration of our BRDF editing feature
2.9 Conclusion 37
precisely, there will be no identifiable visual clues within a rendered image, e.g., light
bleeding, to tell whether the ground truth images or our rendered results are closer to
our world (see Figure 2.8 and Figure 2.11). However, the approximation damage is in-
deed bigger than that would have been suggested by the static image comparisons. The
video7 shows the visual lost due to our approximation, and we can see that our specular
shadows are less responsive to the illuminations w.r.t. the ground truth.
2.9 Conclusion
In this article, we presented the implementation and the rendering algorithm relying on
vectorized visibility. By exploiting its vector graphics properties and the GPU parallel
architecture, our rendering algorithms supports a number of functionalities which ap-
pear to be impractical to be supported simultaneously, e.g., the per-fragment direct il-
lumination with all frequency shadow quality, using coarsely tessellated 3D models,
the BRDF editing, etc.
By integrating the dynamic tessellation feature, the scalability of our algorithm is
improved drastically, which is faster than the per-fragment with better quality than the
per-vertex. Having the frame rates coupled with the tessellation level, we practically
The video to demonstrate the visual lost due to our specular approximation
38 2. Applying Vectorized Visibility on All Frequency Direct Illumination
make the frame rate an intrinsic property among the devices with different computa-
tional capability.
While interreflection is not yet supported by the presented algorithm, it is feasible
to extend our algorithm to support it by approximating the transfer functions in the
higher dimension space, which is one of our ongoing research projects.
2.10 Acknowledgments
The work is supported by a research grant (CityU 11259516) from the Hong Kong Special Ad-
ministrative Region.
BEN-ARTZI, A., OVERBECK, R., AND RAMAMOORTHI, R. 2006. Real-time BRDF editing in com-
plex lighting. In ACM Trans. Graph., 25:3, pp. 945–954.
BRAUER, D. 2010. MatCap. URL:
CROW, F. 1984. Summed-area tables for texture mapping. In ACM SIGGRAPH computer
graphics, 18:3.
ELCOTT, S., ET AL. 2016. Rendering techniques of final fantasy XV. ACM SIGGRAPH 2016
HO, T., XIAO, Y., FENG, R., LEUNG, C., AND WONG, T. 2015. All-Frequency Direct Illumination
with Vectorized Visibility. In IEEE Trans. Vis. Comput. Graph., 21:8, pp. 945–958.
KAUTZ, J., SLOAN, P., AND SNYDER, J. 2002. Fast, arbitrary BRDF shading for low-frequency
lighting using spherical harmonics. In Proc. 13th Eurograph. Workshop Rendering, pp. 291–
LAM, P., HO, T., LEUNG, C., AND WONG, T. 2010. All-frequency lighting with multiscale spher-
ical radial basis functions. In IEEE Trans. Vis. Comput. Graph., 16, pp. 43–56.
LIU, X., SLOAN, P., SHUM, H., AND SNYDER, J. 2004. All-frequency precomputed radiance trans-
fer for glossy objects. In Proc. Eurograph. Symp. Rendering, pp. 337–344.
MATUSIK, W., PFISTER, H., BRAND, M., AND MCMILLAN, L. 2003. A data-driven reflectance
model. In ACM SIGGRAPH 2003 Papers (SIGGRAPH ‘03). ACM, pp. 759–769. DOI:
MORENO, J. 2018. MatCap Shaders.
NG, R., RAMAMOORTHI, R., AND HANRAHAN, P. 2003. All-frequency shadows using non-linear
wavelet lighting approximation. In ACM Transactions on Graphics, 22:3, pp. 376–381.
NG, R., RAMAMOORTHI, R., AND HANRAHAN, P. 2004. Triple product wavelet integrals for all-
frequency relighting. In ACM Transactions on Graphics, 23:3.
RUSSELL, J. 2015. HDR Image-Based Lighting on the Web. In WebGL Insights. CRC Press,
pp. 253–260.
Bibliography 39
SCHEUERMANN, T. AND ISIDORO, J. 2006. Cubemap filtering with cubemapgen. Game Devel-
opers Conference 2006.
SLOAN, P., KAUTZ, J., AND SNYDER, J. 2002. Precomputed radiance transfer for real-time ren-
dering in dynamic, low-frequency lighting environments. ACM Trans. Graph. 21, pp. 527–
SLOAN, P., HALL, J., HART, J., AND SNYDER, J. 2003. Clustered principal components for pre-
computed radiance transfer. ACM Trans. Graph., 22, pp. 382–391.
TSAI, Y. AND SHIH, Z. 2006. All-frequency precomputed radiance transfer using spherical radial
basis functions and clustered tensor approximation. In ACM Transactions on Graphics, 25:3.
WANG, J., REN, P., GONG, M., SNYDER, J., AND GUO, B. 2009. All-frequency rendering of dy-
namic, spatially-varying reflectance. In ACM Trans. Graph., 28, pp. 1–10.
WANG, R., PAN, M., CHEN, W., REN, Z., ZHOU, K., HUA, W., AND BAO, H. 2013. Analytic double
product integrals for all-frequency relighting. In IEEE Trans. Vis. Comput. Graph., 19:7, pp.
3.1 Introduction
Procedural noise functions have been one of the key tools for adding visual fidelity in
computer graphics for decades. They serve as a foundation for landscape geometry
synthesis, creation of textures containing surface properties such as color and normals,
simulation of atmospheric effects and many other tasks.
With the ever-growing game environment scale and the amount of detail expected
from modern games, developers are more and more frequently faced with content pro-
duction challenges. Reduction of the time required for authoring and iterating on con-
tent is always among the hottest topics in the industry.
Procedural content creation is an increasingly popular solution to this problem.
Procedural methods allow developers to automate and simplify tasks, from object
placement to texture creation. Many large studios as well as independent developers
that cannot afford to produce assets manually already employ them. It is highly likely
that these methods are going to become an industry standard, and an integral part of
all modern content pipelines.
Most of the time games cannot afford to evaluate procedural noise functions at
runtime and instead store the precomputed results in textures. The majority of them
only use basic tiling options provided by the hardware. Noise functions that are de-
signed with efficient evaluation on the GPU in mind are periodic with a relatively small
period. Both cases lead to either repetition or loss of detail. A common approach to
overcome this is to increase the texture resolution, to add decal textures or employ
multitexturing. This results in an increase in memory consumption and memory band-
width requirements.
In this chapter, we present a method to combine noise-based procedural texture
synthesis with a nonperiodic tiling algorithm. We describe modifications to several
popular procedural noise functions that directly produce texture maps containing the
42 3. Nonperiodic Tiling of Noise-based Procedural Textures
smallest possible complete Wang tile set. Our approach can be used as a preprocessing
step or during application runtime.
Additionally, we present several improvements of the algorithm [Wei 2004] that
implements Wang tiling on the GPU. We show how our modifications enable nonperi-
odic tiling for a large range of noise-based procedurally generated textures (see an ex-
ample in Figure 3.1). Finally, we analyze the effect these modifications have on the
performance, and discuss their limitations.
(a) (b)
Figure 3.2. A schematic representation of Wang tiling. (a) A complete Wang tile set with two
edge colors. (b) A valid Wang tiling of a surface using the tile set from (a).
Cohen [2003] observed that Wang tiles do not take into account tile corners, which
may result in a discontinuity in the resulting image. Corner tiles [Lagae and Dutré
2006] address this problem by restricting the diagonal tile neighbors in addition to the
horizontal and vertical ones.
We observe, however, an insufficiency in modern Wang tile set synthesis methods.
Many procedural noise functions are limited to producing only periodic images. Noise
functions that allow the construction of nonperiodic noise at runtime are usually com-
putationally expensive, which makes them unusable by many real-time applications.
We lift these limitations by combining Wang tiles and procedural noise functions.
In the following section we present modifications to several noise synthesis algo-
rithms (see Figure 3.3) which directly output a minimal complete set of Wang tiles (16
square tiles with 2 edge colors). The tiles are arranged in a single texture map as pro-
posed by Wei [2004]. Our methods guarantee seamless tiling, can be extended to sup-
port more edge colors or higher dimensions, and can be applied to the noise synthesis
either during precomputation or at runtime. Full implementations of the original and
the modified noise functions are available in the accompanying source code.
Many noise functions are derived from Perlin noise. Improved Perlin noise [Perlin
2002] reduced visual artifacts in the derivatives and improved overall noise appearance.
Modified noise [Olano 2005] replaced the permutation table with a hash function
3.3 Nonperiodic Tiling of Procedural Noise Functions 45
We use a similar approach to construct a Wang tile set. We subdivide all the points
of the lattice into two non-intersecting groups: boundary points, which form the
boundaries of the Wang tiles within the texture, and inner points. We do a further sub-
division of the boundary points: corner points; C; vertical border points (one subgroup
per Wang tile edge color); V0 and V1; and horizontal border points; H 0 and H 1 (see
Figure 3.4(a)). The modified function needs to guarantee that the values at the tile bor-
ders that have the same edge color are the same, or, in other words, that the lattice
points that are at the same position local to the tile and are in the same group share a
gradient vector.
Listing 3.2. A coordinate mapping function used with Perlin noise to produce Wang tiles.
The easiest way to achieve this is to modify the lattice point coordinates. We first
check which group the lattice point belongs to, and then map the points that fall into
the same group to the same region of the lattice (see Listing 3.2).
For simplicity, we chose the following mapping: all corner points end up at 0, 0 ;
horizontal borders either at x t , 0 or at x t K, 0 , depending on the edge color; ver-
tical borders at 0, y t or 0, y t K , where K equals kPointsPerTile, and x t , y t
is the coordinate of the point local to the tile (see Figure 3.4(b)). As long as the regions
that these groups are mapped to do not overlap, this choice does not noticeably affect
the resulting noise.
The last step is to call the TransformCoord function to modify the lattice point
3.3 Nonperiodic Tiling of Procedural Noise Functions 47
The resulting noise is shown in Figure 3.4(c). The TransformCoord function can
be used together with many other lattice-based procedural noise functions. We suc-
cessfully applied it to modified noise and Perlin noise with the xor hashing function
introduced by Kensler et al. [2008].
In order to avoid the corner problem mentioned in Section 3.2, our modifications
are defined in such a way that all the tile corners are the same, and thus are guaranteed
to not produce any discontinuities.
points that belong to the borders of the second color are in another group, C 1:
C 0 C V 0 H 0 , C 1 V1 H 1 . Points from the group C 0 are mapped to the coordi-
nate local to the tile, x t , y t . Points from the other group are mapped to positions with
a fixed offset from the coordinate local to the tile, x t K , y t K , where K is equal
to kPointsPerTile. The resulting noise is shown in Figure 3.5(c).
N x, y
γ u, v k x u, y v du dv. (3.1)
γ x, y a δ x x , y y .
k k k (3.2)
The parameters K and a control the magnitude and the radius of the Gaussian. F0 and
ω 0 control the frequency and the orientation of the cosine wave.
Gabor noise is constructed as a sparse convolution noise that uses the Gabor kernel
as its kernel. Because the power spectrum of a sparse convolution noise is the power
spectrum of the kernel, scaled by a constant as shown by Lewis [1989], the parameters
of the Gabor kernel provide direct control over the power spectrum of the resulting
In order to increase the computational efficiency, Gabor noise is evaluated on a
grid. The properties of the Gabor kernels are generated per cell on the fly using a
PRNG. The size of a grid cell equals the radius of the Gabor kernels. This allows us to
limit the evaluation of the noise function to the cell containing the point being evalu-
ated, and its immediate neighbors.
Lagae et al. [2009] provide ways to produce both periodic and nonperiodic noise.
In order to obtain noise with a period of N, the grid cells are enumerated in row-major
order with cell coordinates taken with modulo N. The seed for the PRNG is then cal-
culated as a sum of the cell index and a global offset parameter. Nonperiodic noise uses
Morton order for cell enumeration.
50 3. Nonperiodic Tiling of Noise-based Procedural Textures
(b) (c)
Figure 3.6. Noise function Wang tile texture maps. (a) Anisotropic Gabor noise. (b) Worley
noise. (c) Worley noise from (b) with separated tiles. Note that the corresponding edges of the
tiles marked with the same color have matching borders.
tangent plane for the final noise evaluation, which renders our methods not applicable
to surface noise.
When the noise is mapped to surfaces, special care must be taken to avoid aliasing
on distant objects and on surfaces seen at an angle. While the precomputed noise can
benefit from the existing texture filtering methods, noise functions that are evaluated
at application runtime usually rely on the properties of the noise to reduce aliasing.
Anisotropic filtering is a commonly used technique that increases the image qual-
ity on surfaces viewed at an oblique angle. It takes several samples in an elliptic area,
producing a more accurate approximation of the pixel projection to the screen. When
the filter area crosses the tile borders, it can produce incorrect values. All the discussed
noises except the ones that use the TransformCoord function (Listing 3.2) exhibit
such artifacts only when the anisotropic filter is wider than the width of the tile border.
Perlin noise and other noise functions that require just one row of lattice points for the
tile border always have this error present. The error is nearly invisible due to interpo-
lation between the lattice points, so some games can completely neglect this. Others
may choose to increase the width of the border region to remove it completely.
52 3. Nonperiodic Tiling of Noise-based Procedural Textures
Using mipmaps is one of the most common ways of filtering textures on distant
surfaces. When dealing with textures containing tiles, one would need to take into ac-
count the way the hardware performs the sampling while downscaling the images in
order to minimize the discontinuities at the tile boundaries between two adjacent levels
of detail.
Our experiments show that anisotropic filtering hides the discontinuities intro-
duced by general-purpose downscaling algorithms used to construct the mipmaps, even
in high-contrast images. In our opinion, using a combination of anisotropic filtering
and mipmaps produces the best visual results (see Figure 3.7).
The proposed methods do not modify the underlying noise function. It follows that
filtering the procedural noise functions that use them can be done with the help of the
techniques that are applicable to the original noise functions. Read Lagae et al. [2010]
for details on these techniques.
piecewise function TileIndex 1D is applied separately to the vertical and horizontal edges
of the tile, and gives the vertical and horizontal position of the tile respectively:
0, e1 e 2 0;
e12 2e 2 1, e1 e 2 0;
TileIndex 1D e1 , e 2 2e1 e 22 , e 2 e1 0; (3.4)
e 1 2 2, e1 e 2 0;
e1 1 1,
e1 e 2 0.
Most applications, however, limit the usage of Wang tiles to just two edge colors,
thus reducing the texture size. We propose a different packing function that can be used
specifically for two edge colors, ensures the same tile layout and is more efficient to
compute than TileIndex 1D e1 , e 2 :
TileIndex 1D e1 , e 2 2e1 e1 e 2
2 e1 e 2 e 1
2 max e1 , e 2 e1 e 2. (3.5)
We provide three equivalent forms of the same function to account for possible perfor-
mance differences in the target hardware.
hash function is the implementation of an MD5 hash1 for GPUs by Tzeng and Wei
Figure 3.8 shows the tile patterns produced by these tiling methods and the sto-
chastic tiling algorithm proposed by Cohen et al. [2003].
3.6 Results
Perlin [1985] showed that his noise function can be used as a building block for many
realistic-looking textures. We would like to show that similar results can be achieved
with the noise functions modified as proposed in this chapter (see Figure 3.9).
Additionally, we would like to make an observation that once a function F is ap-
plied to a Wang tile texture, the result remains a Wang tile texture with the same layout
if F depends on the pixel color, or if it depends on the pixel position and forms a Wang
tile texture with the same layout. This class of functions includes, for example, linear
combinations of Wang tile textures and periodic functions, with the period being a
multiple of the Wang tile size. We show some textures generated with our application
in Figure 3.9. We provide a Unity project with our example implementations of the
shaders producing these textures in the accompanying code samples.
3.7 Performance
We conducted a series of CPU and GPU tests to analyze the impact of the proposed
modifications on the performance. All tests were performed on a computer with an
Intel Core i7-6820HK CPU (2.7 GHz) and a GeForce GTX 980M (8GB) GPU. The
screen resolution was set to 1920 1080 . The GPU driver settings were left in the de-
fault state, allowing the application to control most parameters. VSync was turned off.
The test code was written in C++ using Visual Studio 2015. The compiler and
linker settings were left in the default state for a console application in release config-
There are alternatives to MD5 which provide a better balance between hash complexity and
tile pattern repetition.
3.7 Performance 55
Figure 3.9. Some interesting Wang tile texture maps produced using the presented techniques
(left) and the corresponding tilings (right). Top to bottom: marble, detailed fabric, brick wall,
cobblestones, stained glass.
uration. The code was written without usage of SIMD CPU extensions or multithread-
ing. The application used OpenGL 4.5 as a graphics API.
We measured the average and the median execution time for all the variants of the
noise functions being tested. The time values provided are in milliseconds. The last
two columns of the following tables present the percentage difference in sampling per-
formance between the original noise functions and the noise functions using the pro-
posed modifications.
We chose to use an anisotropic version of Gabor noise for the performance com-
parison. All the parameters of the Gabor kernel were fixed. We measured the impact
of the proposed modifications on the Gabor noise by introducing a value to cap the
number of kernels per cell (we set the noise parameters to have 32 kernels as the mean
value) and comparing the performance with the cap set to 0, 8, 16, 24, 32 and 40
For the CPU performance test we provided two algorithms that compute the values
of the Perlin noise and of the better gradient noise. The default implementation samples
the function as usual, both the non-tileable and the modified versions. An optimized
implementation first precomputes the lattice gradients into a large lookup table, and
then uses them to efficiently calculate the values of the final function.
The default implementations of both lattice-based noise functions that produce
Wang tiles are several times slower than the original functions. Still, the amount of
time it takes to sample the functions once per pixel in a 2048 2048 texture is relatively
small, indicating that they are very cheap to evaluate. The optimized implementations
require additional memory, but make the performance of both the original and the
modified function equal. The optimized Perlin noise implementation is also faster than
the default one by more than 20% (see Table 3.1).
The evaluation of the anisotropic Gabor noise with zero Gabor kernels per cell
closely estimates the performance cost of using the modified noise function. It is
around 40% higher than that of the original function. The evaluation of additional Ga-
bor kernels quickly reduces the weight of the boundary cell remapping to values close
to zero in the overall function sampling duration.
the whole screen. The texture being sampled contained a full mip chain. We tested two
texture resolutions, 1024 1024 and 2048 2048. We rendered either a single mesh
covering the whole screen or 16 such meshes located at the same depth in order to
simulate overdraw.
Sampling a Wang texture once per pixel carries very little overhead. In this case
the performance drops insignificantly, by less than 2%. Sampling a texture 16 times
per pixel makes the difference more noticeable, up to about 20% (see Table 3.2).
Our tests showed that enabling trilinear and anisotropic filtering did not affect the
Overall, the results of the performance comparison tests indicate that the modifi-
cations presented in this chapter can be adopted by many applications without having
a major effect on the frame rate. Both precomputation and direct runtime evaluation of
the modified noise functions are only slightly slower than the original. Sampling a pre-
computed Wang tile texture is roughly equivalent in performance to evaluating Perlin
noise with simple tiling in the shader, and is at least several times faster than evaluating
noises that offer higher quality.
3.8 Limitations
As mentioned in Section 3.4, Wang tile textures require special downscaling algo-
rithms in order to minimize errors on tile borders between adjacent mip levels. The
same applies to block texture compression methods. Anisotropic filtering, however,
visually hides the discontinuities arising from texture compression as well.
Using a Wang tile texture within a texture atlas presents challenges very similar to
those of using texture atlases with wrapping modes other than clamping. A detailed
discussion is available in [Nvidia 2004]. If the Wang tile textures are of the same size,
texture arrays can be utilized on the hardware that supports them.
Sampling a Wang tile texture requires using a version of the texture sampling func-
tion with explicitly provided gradients. Older or low-power hardware used in some
mobile devices may lack support for gradient computation. Additionally, some old
GPUs in mobile devices undergo a performance penalty when the texture coordinates
used to sample a texture are modified in the fragment shader. This can be mitigated in
some cases by adjusting the mesh UV coordinates in a way that would enforce all tile
corners to correspond to mesh vertices, and by moving the texture coordinate calcula-
tion to the vertex shader.
The presented algorithms do not solve certain inherent problems of Wang tiles.
For example, when the resulting tiles have large-scale distinct features, the tiling pat-
tern becomes quite obvious.
Hash-based evaluation of tile edge colors requires a complete set of Wang tiles to
operate. The number of tiles in the set grows very quickly when adding an edge color
or increasing the number of dimensions. A full 2D set contains N 4 tiles, where N is
the number of colors. This significantly increases the memory requirements for the
precomputed textures. A tile index texture map can be employed to reduce the memory
occupied by the tile set by including only those tiles that are actually used. This, how-
ever, introduces additional complexity to the tile packing step, and adds a texture read
to the shader.
3.9 Conclusion
We presented a set of modifications to several popular procedural noise functions that
directly produce texture maps containing the smallest complete set of Wang tiles. The
3.10 Future Work 59
Figure 3.10. A comparison between simple tiling of a 1024 1024 texture (top) and Wang tiling
of a 512 512 Wang tile texture map (bottom). The latter is nonperiodic and requires 4 times
less memory.
proposed modifications can be used both at application runtime and during the prepro-
cessing steps and can be generalized to higher dimensions and Wang tile sets with more
edge colors.
The modified noise functions retain most of the key characteristics of the original
functions [Kirillov 2018]. We discussed the effect of using the proposed modifications
on the noise function filtering and the mapping of noise functions to surfaces. Addi-
tionally, we presented several improvements of the tiling algorithm on the GPU for
Wang tiles.
Our modifications enable nonperiodic tiling for a large range of noise-based pro-
cedurally generated textures and effects. The presented techniques can be used to pro-
duce large, non-repetitive and detailed terrain geometry, atmospheric effects and
realistic-looking, natural and artificial textures. The option to combine the Wang tile
texture maps while maintaining tile layout further increases the diversity of the possi-
ble results.
The performance tests indicate that the proposed techniques can be adopted even
by high-performance interactive real-time applications like games. The presented
methods offer a potential to decrease the memory consumption of the precomputed
noise-based textures, and enable precomputation for large noise-based textures that
would not fit into the memory budget of an application otherwise, without sacrificing
the final image variance (see Figure 3.10). The Wang tile texture map sampling is also
straightforward to implement. This guarantees smooth integration into both existing
and new applications.
BLUM, L., BLUM, M., AND SHUB, M. 1986. A Simple Unpredictable Pseudo-Random Number
Generator. In SIAM Journal on Computing, 15:2, pp. 364–383. URL:
COHEN, M., SHADE, J., HILLER, S., AND DEUSSEN, O. 2003. Wang Tiles for Image and Texture
Generation. 22, pp. 287–294. URL:
EFROS, A. AND FREEMAN, W. 2001. Image Quilting for Texture Synthesis and Transfer. In Pro-
ceedings of ACM SIGGRAPH ‘01, pp. 341–346. URL:
JEANDEL, E. AND RAO, M. 2015. An aperiodic set of 11 Wang tiles. URL:
KENSLER, A., KNOLL, A., AND SHIRLEY, P. 2008. Better Gradient Noise. SCI Institute Technical
Report No. UUSCI-2008-001. URL:
KIRILLOV, A. 2018. Non-periodic Tiling of Procedural Noise Functions. Proc. In ACM Comput.
Graph. Interact. Tech. 1, 2, Article 32. URL:
LAGAE, A. AND DUTRÉ, P. 2006. An Alternative for Wang Tiles: Colored Edges Versus Colored
Corners. In ACM Trans. Graph. 25:4, pp. 1442–1459. URL:
LAGAE, A., LEFEBVRE, S., DRETTAKIS, G., AND DUTRÉ, P. 2009. Procedural Noise Using Sparse
Gabor Convolution. In Proceedings of ACM SIGGRAPH ‘09, pp. 54:1–54:10. URL:
K, AND ZWICKER, M. 2010. State of the Art in Procedural Noise Functions. In EG 2010 – State
of the Art Reports. Eurographics Association.
LEWIS, J. 1989. Algorithms for Solid Noise Synthesis. In Proceedings of ACM SIGGRAPH
‘89, pp. 263–270. URL:
Neyret, F. and Cani, M. 1999. Pattern-based Texturing Revisited. In Proceedings of ACM SIG-
GRAPH ‘99, pp. 235–242. URL:
NVIDIA. 2004. Improve Batching Using Texture Atlases. NVSDK 7.0 Whitepaper. URL:
OLANO, M. 2005. Modified Noise for Evaluation on Graphics Hardware. In Proceedings of the
ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ‘05, pp.
105–110. URL:
PERLIN, K. 1985. An Image Synthesizer. In Proceedings of ACM SIGGRAPH ‘85, pp. 287–
296. URL:
PERLIN, K. 2002. Improving Noise. In Proceedings of ACM SIGGRAPH ‘02, pp. 681–682.
Bibliography 61
STAM, J. 1997. Aperiodic texture mapping. Tech. rep., R046. European Research Consortium
for Informatics and Mathematics (ERCIM). URL:
TZENG, S. AND WEI, L. 2008. Parallel White Noise Generation on a GPU via Cryptographic
Hash. In Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, I3D
‘08, pp. 79–87. URL:
WEI, L. AND LEVOY, M. 2000. Fast Texture Synthesis Using Tree-structured Vector Quantiza-
tion. In Proceedings of ACM SIGGRAPH ‘00, pp. 479–488. URL:
WEI, L. 2004. Tile-based Texture Mapping on Graphics Hardware. In Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, HWWS ‘04, pp. 55–63.
WORLEY, S. 1996. A Cellular Texture Basis Function. In Proceedings of ACM SIGGRAPH
‘96, pp. 291–294. URL:
Rendering Surgery
Simulation with Vulkan
Nicholas Milef, Di Qi, and Suvranu De
4.1 Introduction
While surgical simulation requires much of the same rendering functionality as games,
critical differences necessitate simulation-specific optimizations and engine design de-
cisions that aren’t commonly needed or provided in rendering engines for games. Given
our unique use cases, we take advantage of the explicitness of the Vulkan API (as com-
pared to OpenGL) to develop a rendering engine for surgical simulation. In this article,
we explain how we tailored our rendering engine design around surgery simulation
including how higher level design decisions propagate to lower-level usage of Vulkan.
To achieve this goal, our rendering architecture is designed to be flexible, maintainable
and efficient. In surgical use cases, soft tissues are modeled by deformable meshes
which are specially handled by our efficient memory system. We show how perfor-
mance scales with our memory system. Later, we present a case study using our ren-
derer in a virtual cricothyroidotomy (CCT) 3D simulator.
4.2 Overview
Virtual surgery simulators present some unique computational and development chal-
lenges that are less common in other applications such as games. Rendering is partic-
ularly important because the appearance of the simulator must be convincingly realistic
to properly train surgeons for real-life surgery scenarios.
General-purpose game engines often have limited soft-body physics and haptics
support. Platforms such as the Software Framework for Multimodal Interactive Simu-
lation (SoFMIS) [Halic et al. 2011], the Simulation Open Framework Architecture
(SOFA) [SOFA 2018], OpenSurgSim [OpenSurgSim 2017], and the Interactive Medi-
cal Simulation Toolkit (iMSTK) [iMSTK 2018] seek to fill this gap. Our rendering
engine in particular is part of the larger framework of iMSTK. In addition, newer APIs
64 4. Rendering Surgery Simulation with Vulkan
such as Vulkan [The Khronos Group 2018] provide more capabilities to make better
use of computing resources and allow for more predictable performance compared to
older graphics APIs such as OpenGL.
Shadow Mapping
In some surgery simulation scenarios, many shadow-casting lights are necessary for
helping users to judge depth perception of their instruments in the virtual environment.
The first pass of each frame includes rendering shadows for each directional light into
shadow maps, which we place into a texture array. Rendering shadow maps into texture
arrays is a common approach [Pettineo 2015] as it allows for the binding of many shad-
ows maps during a single draw, which is necessary for forward rendering architectures
that evaluate multiple lights in one draw call. We save on material permutations by
66 4. Rendering Surgery Simulation with Vulkan
reusing the same shadow pipelines for each shadow pass (since the render passes are
compatible). We pass the index of the shadow current shadow map to the shadow
shader using a push constant that accesses an array of light inverse matrices.
Buffer Type R G B A
Diffuse Accumulation 16-bit per channel color
Specular Accumulation 16-bit per channel color
Normal/SSS x y z SSS strength
Depth 32-bit depth
Table 4.1. G-buffer layout.
Deferred Decals
Deferred decals are a low-cost yet flexible method for adding details to underlying
geometry. The decal pass differs from the opaque rendering in that it doesn’t write to
the normal or subsurface scattering (SSS) buffer but rather reads from it. One of the
benefits of using the underlying SSS buffer is that it allows the decal to blend in with
4.3 Render Pass Architecture 67
Figure 4.2. (left) A mesh without a decal, (right) a mesh with a decal affected by subsurface
the surface underneath during the SSS render pass, so any marks on the skin for in-
stance will look more naturally integrated. Since the decals are deferred, they need a
normal, and we chose to use the underlying normal and the underlying SSS constant
(Figure 4.2). Many use cases of the decals such as bleeding or marking only make small
changes to the surface normal, so the underlying surface normal can provide an ac-
ceptable approximation.
Post Processing
Surgery simulations include organic geometry such as skin and organs, which requires
subsurface scattering (SSS) to display accurately (Figure 4.3). We chose to implement
a screen-space SSS as opposed to a texture-space implementation for several reasons.
First, screen-space avoids overdraw which becomes a problem when inside the body.
Although a depth prepass mitigates this, there’s still the need to perform extra texture
lookups as compared to screen space methods. Secondly, the diffusion profile is rela-
tively similar since most of the materials have blood and/or fat under the surface. Third,
it samples across different draw calls. This becomes particularly important when a
mesh is split into smaller meshes in order to avoid unnecessary physics computations.
For example, if a section of an organ undergoes operation, then it must be deformable,
but the surrounding organ can be rigid. With screen-space SSS, the SSS can sample
from across both meshes, creating a seamless rendering.
After the lighting pass, separable screen-space SSS [Jimenez et al. 2015] is applied
to only the diffuse buffer. We keep a pool of 3 HDR buffers (one for the specular render
target, one for the diffuse, and one free) in order to ping-pong the diffuse buffer during
the two passes. After the SSS, the specular buffer and diffuse buffer are composited
into the free buffer. These buffers are reused for later passes.
Bloom is then calculated in two passes at quarter resolution and then composited
with the previous result. A filmic tone mapping pass [Hable 2010] follows the bloom
68 4. Rendering Surgery Simulation with Vulkan
pass to map down to a 32-bit sRGB buffer. In the early stages of implementing SSS,
we found that using a non-filmic tone mapper such as Reinhard [Reinhard et al. 2002]
hide the effect of the SSS by desaturating the effect of the diffusion profile.
Shaders are compiled as a build step for compilation of the C++ code, so only
SPIR-V binary files are read into the engine at runtime. This creates some problems
with texture resource management, however, since each material can provide a different
number texture resources. We solve this problem by creating small placeholder tex-
tures, but the lookup of these textures are restricted by specialization constants. Spe-
cialization constants allow the driver’s shader compiler to optimize away shader code
that gets set during pipeline creation. Other expensive operations, such as PBR lighting
code, can be optimized away for situations that don’t require it. One drawback of using
specialization is that the pipeline objects can become incompatible for similar materi-
als. In this case, they cannot be shared across draw calls.
Our renderer includes a pool of command buffers for geometry passes, and each frame,
a command buffer is recycled and rewritten to. When we finish writing to a command
buffer, we submit it to the driver, and start recording another command buffer. The
command buffer will run asynchronously to our render loop. We needed to avoid read-
write hazards but didn’t want to stall the render loop, so we implemented multi-buff-
ering for vertex and index buffers (Figure 4.4). We buffer the data the same amount as
the number of back buffers presented in the swap chain; if the application renders with
triple-buffering, then the mesh data also uses triple-buffering. This makes tracking the
region to update simple as the remainder of the frame number can just be passed to the
update functions.
Our multi-buffering implementation is similar to unsynchronized multi-buffering
in OpenGL with persistent data mapping [Hrabcak and Masserann 2012], but we have
more control over the memory management and synchronization.
One problem with this approach is that the tangents can diverge from the normals
since tangents depend on the UV coordinates which are likely unique to each vertex,
whereas normals can be shared across vertices. This produced shading artifacts that
were highlighted by our BRDFs, but a simple fix was to orthogonalize the tangent-
bitangent-normal basis in the vertex shader through the Gram-Schmidt process.
The initial allocation size we chose was 16 MiB, and this allocation is used until
it runs out of space and then a new allocation is made. For images and buffers larger
than the allocation size, the allocation is expanded to account for these, so very large
resources can potentially have their own allocation.
Mesh Data
Mesh data is handled differently from the other resource types. Deformable meshes
reside in host visible memory, whereas other meshes (e.g., rigid objects) are in device
local memory through staging. Other game engines such as Source 2 have followed a
similar approach for static/dynamic resources [McDonald 2016].
Deformable meshes also take up more space than rigid meshes need as they are
multi-buffered. When the mesh is initially allocated, this extra buffering space is also
allocated in the same location. For host visible memory, this works out well since the
transfer to the GPU is implicitly done, avoided overhead in calling transfer functions.
In some operations, such as mesh cutting operations, additional triangles or verti-
ces can be added or removed to the mesh. Adding new vertices or triangles would re-
quire more buffer space. Because operations such as cutting are a high-frequency op-
eration, we needed a way to expand geometry all of the time without allocating new
memory and deallocating old memory, which could be costly and cause memory frag-
mentation. To solve this problem, we allow users to specify a load factor that sets a
maximum size of the geometry relative to the original mesh size. We allocate addi-
tional space within each buffer subsection for each frame (Figure 4.6).
4.6 Performance and results 73
Palpating the neck region to identify the locations of the thyroid and cricoid
cartilages which are the landmarks anatomies in this procedure.
Making an incision along the midline of the neck through the skin and the fat
tissue to uncover the cricothyroid membrane underneath.
Inserting an endotracheal tube inside the trachea through the new incision.
There are two main problems with the CCT simulation from a rendering perspective:
efficiently updating the large geometric models representing the fat and membrane tis-
sues and rendering the surface of each cut.
When performing the cut, we needed a visual cue (e.g., bleeding) for the progress
of the cut. We opted to use a pool of blood decals to display the cutting path while it
was being performed by the user. The decals automatically recycle after the pool hits
a certain maximum number so they can be reused for multiple cuts.
4.10 Acknowledgments
Research reported in this publication was supported by the National Institute of Biomedical
Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under Award
Number 2R01EB005807, 5R01EB010037, 1R01EB009362, 1R01EB014305; National Heart,
Lung, and Blood Institute (NHLBI) of NIH under Award Number 5R01HL119248; National
Cancer Institute (NCI) of NIH under Award Number 1R01CA197491 and NIH under Award
Number R44OD018334.
BAVOIL, L., SAINZ, M., AND DIMITROV, R. 2008. Image-space horizon-based ambient occlusion.
In ACM SIGGRAPH 2008 talks, p. 22.
78 4. Rendering Surgery Simulation with Vulkan
Skinned Decals
Hawar Doghramachi
5.1 Introduction
Decals that are dynamically added to a scene are a great way to increase interactivity
by allowing the player to change the game environment. An efficient way to do this for
static environments is the use of deferred decals [Krassnigg 2010], [Persson 2011].
However, in general deferred decals fail to deliver convincing results on top of skinned
meshes. A typical use-case is shooting at a character and creating wounds at the impact
position of the projectiles. We will present a technique for adding dynamically decals
on top of arbitrary, skinned meshes that is compatible to common rendering architec-
tures and easy to integrate into existing rendering engines.
5.2 Overview
Dynamically added deferred decals work pretty well as long as the target area is rigid,
i.e., only influenced by one bone. In this case it is possible to intersect a ray that repre-
sents the projection direction of the decal with the skinned mesh and record the triangle
and barycentric coordinates of the intersection. With this information, each frame the
position and normal of the intersection can be updated and a deferred decal applied
accordingly. However, this method fails to deliver convincing results when the target
area is influenced by several bones. Typical artifacts in such cases are decals that are
“swimming” on top of the target mesh, not stretching according to the underlying sur-
face, producing distracting overlaps and projecting on top of areas that were initially
not in the influence area of the decals.
A method that doesn't suffer from the aforementioned problems was mentioned
by [Bronx 2011]. The idea is to render the skinned mesh that receives a decal once
when a new decal is created. Instead of rendering the mesh itself into a render target,
it uses the mesh texture coordinates as vertex shader output position. In a pixel shader
the decal texture coordinates are calculated, similarly to deferred decals, and the cor-
responding texels are fetched from the decal texture and output into the mesh texture.
80 5. Skinned Decals
This implies that each mesh instance has its own texture that serves as render target
resource. Finally when the skinned mesh is rendered in the shading pass, the updated
mesh texture is fetched at the mesh texture coordinates and the resulting decals are
automatically applied. Unfortunately there are several drawbacks which make this tech-
nique impractical for game production:
Usually modern games use several textures for shading purposes. In general you
can expect to have a diffuse, normal, specular and roughness map and in case of
parallax occlusion mapping and displacement mapping additionally a height
map. In this way several texture maps would need to be duplicated per mesh
instance which can have a large memory impact. Since compressing textures at
runtime is most likely too expensive, textures would need to be stored uncom-
pressed, in this way increasing the memory requirements and the memory band-
width when reading from these textures.
Once a decal is added, it can't be selectively removed from the mesh textures.
The proposed technique is based on the same base idea as the aforementioned
method, but overcomes several of its drawbacks. The idea is to output decal texture
coordinates instead of decal texture values when a new decal is created and the corre-
sponding skinned mesh is rendered. In this way neither does it matter how many texture
maps are utilized (diffuse, normal, specular, etc.) nor do we have to bother with gener-
ating mipmaps for all of these texture maps. Each mesh instance requires only one
additional texture, called hereafter decal lookup map, which holds the decal texture
coordinates, a fading value and an index for each decal added on top of the skinned
mesh. This decal lookup map is used during shading to fetch and apply the decal tex-
tures. In this way decals correctly stretch according to the underlying surface, don't
overlap under motion and don't project on top of mesh areas that were initially not in
the influence area of the decals.
5.3 Implementation
The presented system can be divided into three steps (Figure 5.1) that will be described
in more detail in the following subsections. All explanations are assuming: column-
major matrix layout, right-handed coordinate system, NDC depth-range from 0.0 to 1.0
and left top corner as texture / screen space origin. Furthermore the explanations
assume the use of DirectX 12, but the system could be also implemented with e.g.,
OpenGL or Vulkan.
5.3 Implementation 81
Figure 5.1. Overview of the three steps involved in the proposed technique. While the third
step is executed each frame, the first two steps are only excuted when a new decal is added. The
skinned model used in this image was authored by KatsBits [2014].
are combined with a texture matrix, that maps texture coordinates from 1,1 range
into 0,1 range, to the final decal matrix. Listing 5.1 shows pseudo code how to calcu-
late the decal matrix.
Matrix4x4 projMatrix(2/decalWith, 0, 0, 0,
0, 2/decalHeight, 0, 0,
0, 0, -1/decalDepth, 0,
0, 0, 0, 1);
Matrix4x4 textureMatrix(0.5, 0, 0, 0,
0, -0.5, 0, 0,
0, 0, 0.5, 0,
0.5, 0.5, 0.5, 1);
output.normal = input.normal;
return output;
target texel already contains information for another valid decal, the newly added decal
texel will be discarded. To check that the target texel contains already information for
a valid decal and not for a decal that is considered to cause an overlap with existing
decals, a small GPU buffer is used, called hereafter decal validity buffer. Each entry
of this buffer corresponds to one decal on the target mesh and is initialized to 0, i.e.,
all decals are valid at the beginning. In case an overlapping decal texel is detected, the
corresponding entry in this buffer is set to 1, so that on one hand this decal won't pre-
vent more recent decals to be added into the decal lookup map and on other hand such
decals won't be rendered in the shading pass. Since we store the index of each decal at
8 bits into a decal lookup map texel, this buffer needs only 256 entries. The decal index
is combined with the decal texture coordinates and a fading value and stored as
DXGI_FORMAT_R8G8B8A8_UNORM. If typed UAV reads of this texture format are not
supported, we can pack all values into a bitmask, using a 32 bit integer and perform
manual bilinear filtering later on. The corresponding pixel shader is shown in List-
ing 5.3. Since for the proposed system the texture coordinates of the target mesh have
to be unique, i.e., each triangle of the mesh has to map to a different texture area, we
never run into race conditions where different pixel shader threads write into the same
texel location.
84 5. Skinned Decals
// only add decal when all fragments of processed quad have valid
// decal info to ensure correct UV derivatives
if ((valid == 1.0f) && (ddx(valid) == 0.0f) && (ddy(valid) == 0.0f))
// fetch decal diffuse texture
uint diffuseIndex = constBuffer.decals[decalIndex].diffuseIndex;
float4 decalDiffuse =
SampleGrad(trilinearSampler, decalTCFade.xy, derivX, derivY);
To be able to select the correct mipmap from the decal textures, derivatives of the
decal texture coordinates are calculated. This has to be done outside of the branch
where valid decals are applied, since derivatives require valid values across the entire
2 2 processed pixel quad. A decal texel is only considered as valid and added on top
of the target mesh, when all decal indices in a 2 2 pixel area of the decal lookup map
are valid and equal. To ensure that decal texture coordinate derivatives for trilinear
filtering are valid as well, we need to check the validity of the entire 2 2 processed
pixel quad. For this the derivatives of the validity value are used.
The decal index is used to index into a common GPU buffer and retrieve decal
specific information. In the case of DirectX 12 this information can contain indices
into a common DirectX 12 shader-resource descriptor table to select the corresponding
decal textures. For platforms, where textures can't be dynamically indexed, either a
texture array or a texture atlas can be used instead.
Since at the border of decals we can't perform bilinear texture filtering, artifacts in
form of jaggies will be visible. Such artifacts can be avoided by adding a few texels
wide border to the alpha mask texture of a decal that contains an alpha value of zero.
Furthermore not all areas on a skinned mesh are a good fit for a decal, especially areas
that contain a UV seam or are extremely deformed under skinning, as for example the
area around a character elbow. Decals on top of such mesh areas can be easily faded
out by an alpha mask texture that is stored alongside the diffuse, normal, etc. mesh
textures. Mesh areas that don’t have a unique texture coordinate mapping have to be
masked out as well.
Supports both, forward and deferred rendering systems.
Supports static, dynamic, skinned and transparent meshes.
Additional geometry has to be only rendered once when adding a new decal.
Each mesh instance requires only one additional decal lookup map texture, re-
gardless of how many textures the mesh uses.
Textures don’t need to be decompressed.
Texture mipmaps don't need to be generated at runtime.
Artifacts from wrong mipmap selection at geometry edges, as often visible with
deferred decals, are avoided.
Supports arbitrary blending of all decal attributes with underlying mesh.
Decals can be also used to perform displacement mapping or to cut out holes.
Decals can be selectively removed.
Decal texture coordinates can be animated.
5.5 Results 87
One decal lookup map and decal validity buffer required per mesh instance.
Overlapping decals not supported. However, automatically ensuring that decals
don’t overlap can be also considered as desirable to avoid decals stacking up on
each other and causing a performance impact.
Texture coordinates of the target mesh need to be unique, i.e., each triangle of
the mesh has to map to a different texture area.
5.5 Results
In Figure 5.2, a decal was dynamically added on top of an animated, skinned mesh and
rendered once as a deferred decal and once with the proposed technique.
Figure 5.2. The images show a decal dynamically added on top of an animated, skinned model,
authored by KatsBits [2014]. The images on the left side show the animation frame where the
decal was added, the images on the right side a pose, couple of animation frames later. The
images on the top were rendered with the proposed technique, the images on the bottom with
the deferred decal technique.
88 5. Skinned Decals
One can observe that while the deferred decal “swims” on the top of the underlying
surface, not stretching accordingly, the skinned decal stretches and follows the
underlying surface correctly. Thus, in contrast to the skinned decal, the deferred decal
breaks the illusion of dynamically changing the target surface. It is also visible, that
the skinned decal has the same rendering quality as the deferred decal.
5.6 Conclusion
We presented a system to efficiently add high quality, dynamic decals on top of skinned
meshes that correctly follow the underlying mesh surface. It supports common render-
ing systems and can be easily integrated into existing rendering engines.
BRONX. 2011. Deferred decals. Blog post, URL:
KRASSNIGG, J. 2010. A Deferred Decal Rendering Technique. In Game Engine Gems 1, Jones
and Bartlett, pp. 271–280.
PERSSON, E. 2011. Volume Decals. In GPU Pro 2, A K Peters, pp. 115–120.
One of the coolest features of games are nowadays effects that are typically observed
in the environment. In general, they stand out as a visually impressive simulations of
real-world occurrences.
This edition of GPU Zen has two articles in the Special Effects section: the first
article “Real-Time Fluid Simulation in Shadow of the Tomb Raider” by Peter Sikachev,
Martin Palko and Alexandre Chekroun describes how the latest Tomb Raider install-
ment simulates and renders real-time fluids. The contributions of this article to the field
of real-time fluid simulation are:
The second article “Real-time Snow Deformation in Horizon Zero Dawn: The Frozen
Wilds” by Kevin Örtegren in this section is the technique shown on the cover of this
book. This technique models’ interactions of dynamic characters and objects with the
environment while being persistent and scalable. The development requirements have
It had to work in a massive open world and on top of certain static objects, like
rocks and roof tops.
It had to run very fast on the GPU, since the GPU frame was already laid out
and optimized for the base game.
No major asset re-factor could be done to avoid adding to the expansion down-
load size and because we could not spare artists to go through all the content
manually and fine tune for a new system.
—Wolfgang Engel
1.1 Introduction
Fluid simulation has been thoroughly investigated topic by academics and the VFX
industry. However in practice, fluid simulation has been implemented in very few major
AAA game titles.
In this chapter, we present our real-time fluid simulation implementation in the
Shadow of the Tomb Raider game. We simulate a dynamic interaction between charac-
ters and fluid substance: oil spots on water, floating algae, etc.
In particular, our contributions are:
Fluid simulation solution for vast areas using scrollable textures (Section 1.3.3).
92 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
1.3 Simulation
In this section we describe how fluid simulation works in the Shadow of the Tomb
Raider engine. First, we give a brief primer on the fluid simulation theory in Sec-
tion 1.3.1. After that, we show the simulation data flow on a block scheme and discuss
each step in Section 1.3.2. We discuss particular features details in the subsequent
Please refer to Harris [2006] or Bridson and Müller-Fischer [2007] for more detailed
description of these stages. In the rest of the article we will focus on the specifics of
our implementation.
where characterPos is the character world xy (if z is pointing up) world position,
fluidSimGridResolution is the resolution of the simulation grid, simulation-
AreaSize is the size of the simulated area, and the oldPos is the snapped position
calculated on the previous frame. The Listing 1.1 shows the grid shifting code executed
on the GPU. When the character moves, we are shifting the grid accordingly, so that
the character is always in the center of it. As velocity and density are the only persistent
94 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
textures, only they have to be shifted. The grid is snapped to the world in order to avoid
jittering when the character moves. Figure 1.4 shows the simulation window within a
bigger static mask map.
Figure 1.1. Fluid simulation data flow. Blue figures show algorithm steps. Green rectangles
show input texture data for each of the algorithm steps and orange ones show the output data.
1.3 Simulation 95
Figure 1.2. Oil simulation. On the top, the auxiliary render targets are shown, cropped and
remapped for better readability. Maps, from left to right: velocity, obstacle, obstacle velocity,
divergence, pressure, and pressure gradient.
Figure 1.3. Static map (white) and simulated area (red) debug view.
96 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
Listing 1.1. Grid scrolling shader. g_vPosition stands for characters position relative to the
object with fluid simulation, g_v(Inverse)Proxy(Half)Size stands for the size of the object
in world units, and g_texInflow is a texture with static density map.
Figure 1.4. Static density map (white) and simulated density area (red).
Over time, the density, disturbed by fluid dynamics, should fade back into the
static texture. We call this process of fading back density to the default value density
Listing 1.2 shows how the inflow is added into the density map. The following
constants are being introduced in it:
g_FadeoutStart is the point (in 0,1 space on the simulation grid) at which
fadeout starts and g_FadeoutLength 1 1 g_FadeoutStart .
coords = GetScrolledCoords(coords);
g_uavOutput[dtID.xy] = result;
E.g., a value on the simulation area boundary is always equal to the static map
value. If fadeout starts from the very center of the simulation area (i.e.,
gFadeoutStart 0), then any value halfway (e.g., at 0, 0.25 or 0.75, 0 of the sim-
ulation grid) to the edge should be at least within 0.5 of the static map value. If it is
not, it is ‘shifted’ towards it. That allows avoiding discontinuities between simulated
and non-simulated areas with any pattern of character locomotion.
1.3.6 Advection
Advection is the process of transfer of fluid properties by a velocity field of a fluid
(including the velocity itself). Listing 1.3 shows the advection shader. Besides of ad-
vection itself, the advection shader performs two other operations.
First, it adds velocity inflow from the static map. Second, it adds viscosity by
simply dampening the velocity map according to the viscosity values.
Different viscosities are used for the two fluids, and the density map from the
previous frame is used to blend between them. Despite being not physically correct,
this approach works very well in practice.
g_uavOutput[dtID.xy] =
#if EffectType == fxVelocity
lerp(g_ViscosityWater, g_ViscosityOil, saturate(g_texDensity.
SampleLevel(BilinearSampler, coords, 0.0f).x)) *
g_texInput.SampleLevel(BilinearSampler, pos, 0.0f);
Listing 1.3. Advection shader. g_FadeoutPower is a fadeout exponent for the velocity inflow
map, g_DeltaTime is the elapsed time from the previous frame, and g_ViscosityOil and
g_ViscosityWater are oil and water viscosity coefficients, respectively.
1.3 Simulation 101
First, we use local data storage (LDS) memory to prefetch from the VRAM. Since
most (except for the boundary ones) fetches are shared between multiple threads, we
effectively limit the bandwidth per thread.
Second, we exploit Gather instructions for optimizing fetches. We found out that
this is highly beneficial even when not using LDS, and using two Gather instructions
to fetch just four texels. In out case, when LDS is used, we can group together fetches
corresponding to different threads and thus minimize the waste. Besides, we keep all
transient textures in ESRAM to improve bandwidth.
pressureLDS[grtID.x*2][grtID.y*2] = topLeftPressure;
pressureLDS[grtID.x*2][grtID.y*2+1] = bottomLeftPressure;
pressureLDS[grtID.x*2+1][grtID.y*2] = topRightPressure;
pressureLDS[grtID.x*2+1][grtID.y*2+1] = bottomRightPressure;
/* Do the same for obstacle */
102 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
Listing 1.4. Poisson pressure shader. Some parts were edited in order to fit on the page.
Figure 1.5. Algae simulation. In the inset, the static density inflow map is shown (white) with
the simulated density area (red).
Figure 1.6. Algae fluid simulation data flow (density advection only).
104 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
Particle Density Accumulation. Listing 1.5 shows the shader that advects particles
and accumulates particle density. We allocate a one-channel 32-bit unsigned integer
accumulation texture. It has a resolution of original simulation grid resolution size
Effectively, we quantize density in ALGAE_DENSITY_QUANTIZATION quants. We
add a 0.5 offset in order to avoid energy loss during conversions. This quantization is
needed, because in Shader Model 5.0, atomic operations (such as InterlockedAdd)
work only with integer values.
Particle Density Resolution. Listing 1.6 shows the shader resolving particles back to
density. The denominator has ALGAE_DENSITY_QUANTIZATION to the power of two
because (before advection) each simulation grid cell contains ALGAE_DENSITY_QUAN-
d max x , y (1.1)
After that, a fluid simulation drawable is created on the main thread and appended
to the drawables list. This drawable encapsulates all fluid simulation parameters (e.g.,
106 1. Real-Time Fluid Simulation in Shadow of the Tomb Raider
Figure 1.7. Fluid component architecture. Blue boxes stand for classes, green boxes stand for
threads. Rhombus-ending lines stand for aggregation, and arrowhead-ending lines stand for func-
tion calls. The main thread creates a fluid manager, components, and a drawable, while the ren-
dering thread calls the Draw() function of the fluid drawable.
inflow texture maps, viscosity, dissipation, fadeout factors and such) needed to perform
a fluid simulation step.
Finally, when the render thread flushes the drawables list, the Draw() function of
the fluid simulation drawable is called. This is the place where the simulation happens:
compute shaders for the respective algorithm stages are dispatched.
vector containing density value, velocity field, and pressure packed into its components
and PressureGradient—a precomputed gradient of the scalar pressure value.
Listing 1.7 shows implementation of the fluid simulation shader node. g_Play-
erPos and g_InvSimulationSize are global constants set by the fluid simulation
drawable. We use a simple binary fadeout, as there is already a fadeout in simulation,
but a more complex fadeout function could be used. FluidSimulationTexture and
PressureGradientTexture are global textures where fluid simulation outputs its
results, and they are persistent between frames.
Color = FluidSimulationTexture.SampleLevel(
SamplerGenericAnisoBorder, float3(texCoord, 0), 0);
Color = lerp(float4(StaticMask.r, 0, 0, 0), Color, fadeout);
PressureGradient = PressureGradientTexture.SampleLevel(
SamplerGenericAnisoBorder, float3(texCoord, 0), 0);
Listing 1.7. Fluid mask shader node. g_InvSimulationSize is the inverse of the size of the
simulation area (in world units). g_PlayerPos is the player’s character position, multiplied by
the size of the simulation area and divided by the grid resolution.
float t = Thickness;
float a = abs(dot(CameraVector, Normal));
Figure 1.10) that represents the micro structure of algae found in swamps. Listing 1.9
shows more details about our shader implementation.
1.5 Optimization
We have already shown how to optimize one of the bottlenecks: Poisson pressure equa-
tion solving. However, there are few more issues that need to be addressed.
which capsules intersect with which tiles and store this information in a Structured-
Buffer. We use StructuredBuffer over a ConstantBuffer because of the storage
limitations. Finally, instead of querying all the capsules in each grid cell, we only query
the capsules from the respective tile. In practice, this was good enough to handle a
realistic number of NPCs around the main character.
it very early in frame, and the results are expected relatively late in the frame (as water
objects are usually rendered after all opaque objects have been rendered and lit).
Therefore, we dispatch fluid simulation on the low-priority compute pipe early in
the frame. In order to ensure simulation is done by the time it is needed to be used, we
insert a fence at the end of the simulation. Also, a wait on this fence is inserted before
the pass where the water is rendered.
We would like to thank several colleagues who made this feature possible. Maximilien Faubert
came up with the original idea to utilize fluid simulation for fluid on water and did the initial
prototype. Vincent Duboisdendien and Jonathan Bard did countless code reviews and helpful
comments on how to improve and accelerate the method. Finally, we would like to thank all the
Shadow of the Tomb Raider team, Eidos Montréal and Crystal Dynamics studio for providing us
with an opportunity to work together on this game and to make this publication happen.
BRIDSON, R. AND MÜLLER-FISCHER, M. 2007. Fluid Simulation. In ACM SIGGRAPH 2007
Courses. URL:
SIGGRAPH 2006 Courses. URL:
CRANE, K., LLAMAS, I., AND TARIQ, S. 2008. Real-Time Simulation and Rendering of 3D Flu-
ids. In GPU Gems 3, pp. 633-675. URL:
GRINSPUN, E. 2018. Animation and CGI Motion. edX. URL:
Bibliography 111
HARRIS, M. 2004. Fast Fluid Dynamics Simulation on the GPU. In GPU Gems, pp. 637–665.
VLIETINCK, J. 2009. Fluid simulation (DX11/DirectCompute). URL:
WROŃSKI, B. 2014. Volumetric fog: Unified, compute shader based solution to atmospheric
scattering. In ACM SIGGRAPH, Advances in the Real-Time Rendering in 3D Graphics and
ZHU, Y. AND BRIDSON, R. 2005. Animating Sand as a Fluid. In ACM SIGGRAPH Papers, pp.
2.1 Introduction
Having dynamic characters and objects interact with the environment makes the scene
more immersive and alive. Typically games will have some form of foliage interaction
with the character, where leaves and grass bend out of the way when the character
moves through. Another example is ground projected decals when walking on snow.
Both of these commonly used techniques lack persistence and scalability; the foliage
will not be permanently deformed or crushed and the footsteps in snow usually have
an upper limit to the number of decals active at the same time.
Horizon Zero Dawn: The Frozen Wilds is the expansion to Horizon Zero Dawn1
and it takes place in a new snowy mountain region. Snow covers most of the landscape
and we thus needed believable snow which solved some of the shortcomings of existing
techniques and worked under the constraints and requirements for this particular pro-
ject. Figure 2.1 gives an overview of the results we achieved.
It had to work in a massive open world and on top of certain static objects, like
rocks and roof tops.
Horizon Zero Dawn™ © 2017–2018 Sony Interactive Entertainment Europe. Developed by
Guerrilla. Horizon Zero Dawn is a trademark of Sony Interactive Entertainment Europe. All
rights reserved.
114 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds
(a) (b)
Figure 2.1. Showing the final applied result versus what the deformation is in the system.
(a) Example of snow trails caused by the deformation system. (b) Debug overlay mesh showing
the actual system deformation, colored by its normal map.
It had to run very fast on the GPU, since the GPU frame was already laid out
and optimized for the base game.
No major asset refactor could be done to avoid adding to the expansion down-
load size and because we could not spare artists to go through all the content
manually and fine tune for a new system.
2.3 Implementation
This section will go through the different steps included in the snow deformation algo-
rithm. A overview block diagram depicting the algorithm is shown in Figure 2.2.
2.3 Implementation 115
2. Deform & temporal filter. Take the rendered height as input and determine for
each pixel if deformation has to be applied while simultaneously performing
temporal filtering on the persistent deformation data. The output of this shader
is both an updated version of the persistent height data, written to one of the
ping-pong buffers to be read back next frame, as well as the packed calculated
normal and height data to the result buffer. (Compute shader)
After finishing the compute shader work, the deformation data may be read from
the result buffer by any shader. The needed buffers for the algorithm are (all are
1024 1024 in our case):
1. 2× persistent UNORM8 buffers containing relative deformation height, used as
ping-pong buffers across GPU frames for temporal filtering.
116 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds
Figure 2.3. Side view visualization of the deformation height above the different height maps.
This also illustrates the use of an orthographic frustum to capture dynamic object depths locally
around the player, where the deformation occurs. (Illustration not to scale.)
2. 1×16-bit depth buffer containing linear depth for the orthographic camera, only
used within the same GPU frame (can thus be aliased with other render targets).
3. 1× Result buffer, UNORM10.10.10.2 which stores the packed height and nor-
mal for other shaders to read.
The relative deformation height stored in the persistent buffers and in the result buffer
represents a height relative to the baked terrain/objects/water height. See Figure 2.3 for
a visualization of this.
of this pass is a simple 16-bit depth buffer containing linear depth and it serves as the
input to the next pass.
To ensure that small objects have enough depth samples rendered, we render the
depth using nonuniform xy axes in NDC. The nonuniformity is achieved by the equa-
tion P P xyndc P xyndc , where P xyndc is the original post-projected xy coordinate in
1,1, and x is the power of the distribution, limited between 0.8 (extreme distortion)
and 0.0 (linear). While the depth is linear, the normalized device coordinate (NDC) xy
axes are not. This acts as a dynamic level-of-detail, where objects close to the u and v
center axes in texture space get more depth samples. To read back the depth texture,
the same function must be applied to the sampling coordinate, see Listing 2.1 for a
helper function in HLSL.
Shown in Figure 2.5 is the comparison between uniform distribution and the dis-
tribution we used in The Frozen Wilds. Notice how in Figure 2.5(b) our main character
Aloy (in the center) is using many more samples in the depth texture than in Fig-
ure 2.5(a). Aloy’s feet would normally only be a few samples in the depth render, which
would not be enough for detailed deformation.
Listing 2.1. Shader function to convert from uniform normalized UV coordinates to non-uni-
form normalized UV coordinates.
118 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds
plane value (1.0), which means no object was written to that location. If the read depth
value is above the baked terrain/object/water height and below the current snow height,
apply it as the new snow height. See Figure 2.3 for a visualization of this in action. The
normal is calculated using finite difference between samples in the 33 filter. The new
height value and normal are packed and written to the UNORM10.10.10.2 output
buffer. We pack the world space normal using Lambert Azimuthal Equal-Area Projec-
tion, since we know that the Z component will always be positive (pointing up in world
space). The height stored represents the relative snow deformation, 0 meaning un-
touched snow and 1 meaning fully deformed down to the height data below it (terrain,
rocks, water etc.). The new result is also written to the next persistent buffer for input
to the next frame, but before that is done we subtract the snow refill which is based on
the current precipitation rate of the game.
Figure 2.7. Illustration of the deformation system during a move. The current texture reads
back previous frame data using an offset, depending on how much the system moved. Textures
in this example are 2 2 texels.
120 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds
Figure 2.8. Example of snow lumps and piles, which respond the deformation, placed by our
procedural placement system.
and normal in any artist created vertex or pixel shader. This led to more than just the
terrain using the deformation. Things like snow lumps, like in Figure 2.8 and other
snowy assets sampled the deformation and responded to dynamic object interaction.
This is achieved by having the actual deformation height of the system differ from the
rendered snow height. In our case, the snow height of the deformation system is at 1
meter, but the visual snow layer is only about 30 cm deep. This allows shaders to know
about deformation which occurs above the snow height and applying that as defor-
mation to snow lumps sticking out of the snow layer.
One interesting thing which spawned from this system was interactive thin
ice/snow slush on the surface of lakes, as shown in Figure 2.9. Using the shader node
to sample the deformation data, our artists could make this happen in the water surface
shader. Since objects below the terrain/object/water height do not contribute to the de-
formation, it was even possible to stealth swim below the surface without disturbing it.
2.4 Results
With a 6464 m region around the player, we achieve the desired quality using
1024 1024 buffers, which adds up to 8 MB of VRAM. This gives us a persistent de-
formation buffer with 6.25 cm resolution, which roughly matches our inner most ter-
rain LOD mesh resolution. An example of the result buffer can be seen in Figure 2.10.
2.4 Results 121
Figure 2.9. Aloy swimming and interacting with the layer of snow slush on the surface of the
cold water.
Figure 2.10. An example of the result buffer after a battle with a few Watcher robots.
122 2. Real-time Snow Deformation in Horizon Zero Dawn: The Frozen Wilds
The GPU cost of this technique is broken down into three parts: Clear depth, ren-
der dynamic objects and deformation compute shader. The cost of rendering the dy-
namic objects depends on the number and complexity, but in general it takes 1 µs per
object. The timings for the other two passes can be found in Table 2.1. The minimal
cost to using the result of the deformation shader is one 32-bit texture sample and the
decoding of the height and normal.
interact with each other. Snow from an upper layer (e.g., on a roof top) could fall down
and get added to a snow deformation layer below.
Even though the size of the deformation area gave a plausible result, it would be
interesting to explore a larger scale, more persistent solution using a combination of
“pre-deformed” data, either coming from streamed in data or from our procedural
placement system, and streaming out real-time deformation results to secondary stor-
age to later stream it back in.
One of the more interesting aspects going forwards is repurposing this technique
to allow for interaction with other environmental assets, like vegetation, sand, mud,
water. Rendering depth from above could allow a dynamic precipitation occlusion sys-
tem to be spawned from this.
BARRÉ-BRISEBOIS, C. 2014. Deformable Snow Rendering in Batman: Arkham Origins. In
Game Developers Conference 2014. URL:
MICHELS, A. AND SIKACHEV, P. 2016. Deferred Snow Deformation in Rise of the Tomb Raider.
In GPU Pro 7, pp. 3–16. CRC Press.
ST-AMOUR, J. 2013. Rendering Assassin’s Creed III. In Game Developers Conference 2013.
Shadows are the dark companions of lights, and although both can exist in their own,
they shouldn’t exist without each other in games. Achieving good visual results in ren-
dering shadows is considered one of the particularly difficult tasks of graphics pro-
The first article “Soft Shadow Approximation for Dappled Light Sources” by
Mariano Merchante proposes to mimic an effect that is called dabble lights. This type
of effect occurs when for example light shines through leaves that are very close to-
gether, while a small patch of light can travel through them, projecting the sun shape
into the shadow receiver.
The second article is “Parallax-Corrected Cached Shadow Maps” by Pavlo
Turchyn is the successor to another great shadow map article in GPU Pro 2 by the same
author. This article describes a parallax correction algorithm for rendering sweeping
shadows from a dynamic light source using a static shadow map. The resulting imple-
mentation uses Cascaded Shadow Maps up to a distance of 30 meters from the camera
and after that Adaptive Shadow Maps covering the next 500 meters range and updated
every 2500 frames. The use of the parallax correction algorithm enables a fairly seam-
less transition between dynamic shadows rendered with these two methods.
—Mauricio Vives
1.1 Introduction
Common shadow rendering techniques rely on solving the visibility problem through
buffers that contain information related to the distance to the light source. These buffers
are commonly referred as shadow maps [Williams 1978], and although the results can
be filtered with a wide variety of algorithms to get smooth penumbra effects, filtering
is expensive and does not consider either the size or shape of the light source. Well
known examples of filtering approaches include percentage-closer filtering [Reeves et
al. 1987], percentage-closer soft shadows [Fernando 2005], variance shadow maps
[Donnelly and Lauritzen 2006] and moment shadow maps [Peters and Klein 2015]. A
more recent approach uses raytracing and denoising to estimate the penumbra gener-
ated by a polygonal area light [Heitz et al. 2018], but requires a complex rendering
pipeline setup that may not be available to most real-time engines.
Given the complexity of analytically approximating the penumbra generated by
area lights, these real-time techniques usually ignore the pinhole effect that certain
high frequency objects can generate when lit by such lights. This article proposes an
approximation of this effect, which can be seen working in Figure 1.1. In photography,
this phenomenon is called dappled light, and is very characteristic of tree shadows:
when leaves are very close together, a small patch of light can travel through it while
essentially projecting the sun shape into the shadow receiver. Moreover, it is particu-
larly evident in the case of crescent shadows like shown in Figure 1.2 while a solar
eclipse is occurring. Minnaert [1937] offers a comprehensive introduction of the sub-
ject matter.
This phenomenon does not only happen on perfectly infinitesimal holes in depth-
space, as it is an artifact of complex visibility functions. However, the effect can be
approximated by just identifying these points and using subtractive masking on top of
any other conventional technique.. This article will describe a simple and practical
128 1. Soft Shadow Approximation for Dappled Light Sources
Figure 1.1. A sample scene using the proposed technique, inspired by Tufte’s [1997] sculptures.
Figure 1.2. Examples of dappled light in nature. Left: dappled light on a road. Right: Crescent
shadows during a solar eclipse. (Source: Wikipedia)
algorithm that uses an arbitrary shape texture to represent the projected light shape.
Moreover, the majority of the examples presented here will be related to tree shadows,
given that it is one of the most prominent cases of this phenomenon while also being
very common in games and real-time engines. Other situations where this effect occurs
are metal gratis, woven patterns and certain furniture.
A percentage of the points in the inner set, A, have greater depth than the mean
of set B plus its standard deviation multiplied by some constant.
A smoother version of this idea can be implemented with a sample weight defined
by distance; this helps by giving more importance to the center samples, thus reducing
the total number of detected pinholes.
130 1. Soft Shadow Approximation for Dappled Light Sources
Listing 1.1. A naive implementation that selects pinholes when neighbor pixels are very similar
and the center pixel is an outlier with a certain threshold.
Figure 1.3. Left: The core concept of this technique is finding pinholes a) and b) and their
estimated distances to the light, and then projecting textures based on the distance to the receiv-
ing surface. Right top: The original shadow map. Right bottom: The identified pinholes that will
project the light shape, shown here as crosshairs.
1.2 Detecting Pinholes 131
Figure 1.4. Left: The neighborhood segmentation based on radius. Right: An example of a
possible set of depth samples at a cross section of this neighborhood. Ideally, we would desire
the cavity to be close in radius to our expected radius r.
The outer neighborhood B has to have low variance so that we can then approxi-
mate the estimated pinhole depth by using the average depth of the neighborhood.
Cases where there’s too much noise on the outer neighborhood must be ignored, as
there’s no clear way to approximate the combination of light spills happening.
on a compute buffer. This approach is based on Yang et al. [2010], which is generally
used for order-independent transparency. It also exploits the fact that, in general, pin-
holes are sparsely distributed, so these lists can be reasonably small for later inspection.
We can build a grid that subdivides uniformly the shadow map containing these
lists of samples, as shown in Listing 1.2. However, the actual points are distributed
throughout a global buffer in a very noncoherent way. To improve coherency, an inter-
mediate compute shader flattens these lists into flat arrays on a separate compute buffer,
which the receiving shadow shader uses. Additional data can be stored, such as the
“intensity” of this pinhole, which we’ll discuss in Section 1.4.1. Finally, the actual
count of samples are stored in a separate buffer so that iterating through these is easier.
Figure 1.5 shows how the buffers are organized.
void ComputePinholes(uint2 id : SV_DispatchThreadID)
float2 uv = ...
bool foundPinhole = FindPinhole(uv, id, ...);
if (foundPinhole)
uint newIndex = g_PinholeLinkBuffer.IncrementCounter();
if (newIndex >= TotalMaxPinholeCount)
// Atomic swapping
g_OffsetBuffer.InterlockedExchange(offset * 4,
newIndex, prevIndex);
Listing 1.2. An example implementation of the linked list generation. This approach is very
similar to Yang et al. [2010].
1.3 Shadow Rendering 133
Figure 1.5. Left: The shadow map is subdivided into a global buffer with linked lists and a
separate buffer that indicates each cell’s first link. Right: The required buffers. From top to bot-
tom: 1) The global linked list data, which uses an atomic counter along with a 2) buffer that
maintains the list starting element. 3) The coherent uniform grid buffer; each cell has a maximum
amount of pinholes, but 4) for reducing the amount of iterations, a size buffer is used.
S p d r d p S L (1.1)
d r is the distance from the receiving point to the light, and d p is the estimated average
distance of the pinhole’s neighborhood to the light. d p is used to provide a good esti-
mate of the pinhole’s abstract position.
Using a single shape texture for all pinholes may be limiting, because it implies
that the source light is infinitely far away and thus the projected shape is always similar,
disregarding size. Because of this, it is also possible to extend this method to use dif-
ferent shaped textures depending on the direction towards the light, in a similar way to
view-dependent impostors [Brucks 2018]. This can be useful if the light has an uncon-
ventional shape (e.g., a star polyhedron) and is very close to the occluders.
134 1. Soft Shadow Approximation for Dappled Light Sources
Finally, if the light shape is simple enough, it is possible to define it with a function
of the pinhole’s local UV space. For example, a solar eclipse can be described with
two overlapping circles, and the subtracting inner circle’s position can be driven by a
function of time. This may reduce texture lookups drastically and improve performance
if the shape function is computationally cheap. The local pinhole UV can also be ro-
tated or distorted easily through a matrix multiplication or any kind of domain warping
for specific scenarios, e.g., for shadows seen through animated water.
Figure 1.6. Pinholes can leak through occluders. To prevent this, we must still store the actual
depth (not it’s estimated neighborhood average one) at the pinhole’s position in the depth map,
so we can use that to prevent leaking. Left: No occlusion. Right: occlusion considered.
1.4 Temporal Filtering 135
return accum;
Listing 1.3. The pinhole rendering code. This method is called over every uniform grid cell
inside a maximum predefined radius.
1.4.2 Merging
In addition to temporal accumulation, it is also desirable to merge any pair of points
that fall within a defined radius. When merging both pinholes, we also combine the
position of both points using linear interpolation, letting the developer control the re-
sult by choosing a bias between new or old points. Merging can be executed while
flattening the linked list generated from the pinhole detector, but it requires N 2
iterations over the pinhole buffer. Also, this technique forces the user to do more
bookkeeping, as multiple buffers with different structures need to be kept and traversed.
This filtering technique can cause pinholes to slide and swim through the shadow
map, which can visually break the phenomenon. Thus, it is important to define reason-
able parameters that limit the amount of movement a pinhole can have, and to carefully
merge with neighbor cells. If moving pinholes (Figure 1.7) are not properly merged,
there will be clear artifacts at the edges of cells where pinholes will stop moving and
just decay until discarded.
Figure 1.7. A pinhole has moved from one cell to another in this frame, and should be merged.
If the merging does not occur, the previous pinhole will decay at the edge of the cell, generating
an unwanted artifact.
1.5 Results 137
1.4.4 No Filtering
If the scene or the shadow casters responsible for pinhole generation are near static,
then this technique is very effective. It can be used for baked lighting, where temporal
filtering is unnecessary but the shadow phenomenon is desired on the static receivers.
Additionally, if the user desires to have the projected (but static) pinholes interact
with dynamic objects, the pinhole buffer can be calculated and stored offline. This
removes the per-frame cost of pinhole detection, but requires a bit more infrastructure
in the engine. The runtime cost of evaluating the projected textures is still necessary,
1.5 Results
Runtime performance is highly dependent on the scene and amount of pinholes per
frame. Shadow map resolution and the number of shadow-receiving pixels being eval-
uated on screen also contribute to performance. The maximum possible pinhole count
and the uniform grid subdivision count are the two biggest driving performance factors,
as it usually happens when iterating over uniform grids, as Table 1.1 shows. It is also
fill rate bound, considering that the pinhole search is done in the fragment shader of
each receiver.
Because the radius of the projected shape light is proportional to the distance of
the pinhole to the receiver, the number of neighbor cells iterated increase when objects
are far away, negatively impacting performance, as can be seen in Table 1.2. Having
either a maximum search size or transitioning to a default soft shadow model might be
enough to hide this limitation. At very big pinhole sizes the approximation breaks down
visually, so it is desired to prevent this from happening anyway.
138 1. Soft Shadow Approximation for Dappled Light Sources
Sampling light shape textures is also another bottleneck, as it must occur for each
pinhole in a specific neighborhood inside the uniform grid. Considering this limitation,
it is better to design shapes that can be procedurally defined, reducing memory
lookups. Finally, downsampling the shadow map also helps substantially, as both the
lookups necessary for pinhole detection and the amount of pinholes found decrease.
This also benefits pinhole detection by offloading the filtering to the rasterizer.
In practice, we found that the simple 9-tap pinhole detector works best, coupled
with the simpler temporal filtering approaches. In a way, if the pinholes are moving too
much the result becomes stochastic and approximate, but at least it is not jarring to the
eye. Using the more robust pinhole detection with a radius has multiple problems. First,
the complexity of estimating the mean and variance of the segmented kernel makes it
very expensive. However, our implementation used a very basic local mean and vari-
ance estimator, and by using shared memory, most of the impact from texture lookups
can probably be mitigated. Additionally, selecting a predefined radius is not good for
arbitrary scenes with animations, and even if we sampled at different radii and averaged
the results, this would require even more samples, making it impractical.
Finally, a comparison with PCSS can prove useful. A naive implementation of
PCSS can generate similar results, but requires many samples to reduce noise and cap-
ture pinholes, as it has no particular search method for them. It also cannot approximate
arbitrary light shapes, although sampling a shape texture may be an interesting ap-
proach for future work. Figure 1.8 shows the resulting shadows. Overall, our method
can achieve similar quality of high-frequency shadows as PCSS with fewer filtering
1.6 Conclusion and Future Work 139
Figure 1.8. Left: our pinhole approximation with a 64x64 uniform buffer and 16 pinholes per
cell. Right: naive PCSS with 32 blocker search samples and 256 PCF samples. Note that PCSS
has better shadow edges, but cannot approximate the same optical properties (in the case of non-
circular shapes), as well as requiring many more samples for its computation.
Figure 1.9. Left: Common shadow mapping with simple filtering (Unity). Right: Using a cir-
cular shape to estimate the light shape.
140 1. Soft Shadow Approximation for Dappled Light Sources
Figure 1.10. Shadows during a simulated eclipse, where the area light is occluded procedurally.
Figure 1.11. Non-photorealistic shape textures, which can be used for artistic control of a
scene, or simulating uncommon light shapes (for example, an LED array light will generate
shadows similar to those from the light shape in the second image).
acceleration structure for the pinhole search, such as a quadtree. Clever use of down-
sampling techniques can also reduce the lookup complexity in most cases. To improve
the pinhole detection, one could use a moment map for approximating the variance. To
improve the temporal filtering, one could implement a more robust clustering approach
that doesn’t consider just close pairs of samples. Lastly, we could combine our tech-
nique with PCSS, either by extending the blocker search with information from the
pinhole buffer, or trying to apply the same shape texture concept based on the blocker
search region. Finally, partial occlusion of the projected pinhole shapes by close oc-
cluders can also aid in simulating this effect.
Cached Shadow Maps
Pavlo Turchyn
2.1 Introduction
Rendering shadows over large viewing distances often requires processing a large num-
ber of shadow-casting objects. Many game engines, which are using shadow maps for
long-range shadow rendering, opt for some caching schemes that allow distributing the
costs of shadow map rendering over several frames, thus exploiting frame-to-frame
coherency and rendering only a subset of shadow casters per frame, e.g., Schulz and
Mader [2014] and Acton [2012]. Some game engines cache occlusion data derived
from shadow maps rather than keeping plain shadow maps, e.g., Valient [2012] and
Gollent [2014].
Figure 2.1. Shadows from a moving directional light rendered using two shadow map cascades.
The first (near) cascade is updated every frame, and the second (far) cascade is cached and in-
validated infrequently. The left image shows a mismatch between the cascades since the cached
cascade is rendered with a light direction that was captured many frames ago. Parallax correction
fixes this divergence as shown on the right image.
144 2. Parallax-Corrected Cached Shadow Maps
However, caching is problematic when the shadow casting light is dynamic, e.g.,
in a game with dynamic day-night cycle where the sun or moon is constantly moving
across the sky, thus making cached data inconsistent with the current light state. One
has to either invalidate the cache often to keep the divergence small, which makes
caching a less efficient optimization, or treat cached shadows as a very rough approxi-
mation of actual shadows because the error is too apparent when viewed up close.
In this paper, we describe a parallax correction algorithm for rendering sweeping
shadows from a dynamic light source using a static shadow map. The use of parallax
correction in Far Cry 5 enabled a fairly seamless transition between dynamic shadows
rendered with two different techniques: near shadows with cascaded shadow maps
(CSM) updated every frame, and far shadows with adaptive shadow maps (ASM) cov-
ering 500 meters range and updated every 2500 frames. As a result, we are using rela-
tively expensive CSM for rendering shadows within only 30 meters from the player’s
camera, which is quite a short range for an open world game.
Pocc P0 d 0 L 0. (2.1)
Suppose the light is moving, and its new direction is L1. Let’s project Pocc along the new
direction L1 to get a point P1:
P1 Pocc d 1 L1 . (2.2)
2.2 Parallax Correction Algorithm 145
(a) (b)
Figure 2.2. Parallax correction algorithm. (a) The shadow map is rendered for a light direction
L 0. The point Pocc is starting to occlude P1 rather than P0 when the light changes its direction from
L 0 to L1. Our idea is to sample shadows at P0 using the shadow map computed for light direction
L 0, and take the resulting shadow factor for shadow intensity at P1. (b) We walk along the direc-
tion D starting from P1 in small increments, sampling the shadow map at each point S i and ac-
cumulating occluder distance values. We stop after certain number of iterations. We compute the
average value of accumulated occluder distances, which gives us an approximation of P0 via
Equation (2.4).
The practical meaning of Equation (2.2) is that we can compute shadows at P1, i.e.,
P1 will be in shadow for any d 1 0 . So far we were following this reprojection route:
take a point P0 , reconstruct its occluder from a shadow map, and then use the repro-
jected occluder when shading the scene. However, we are really interested in doing
these steps in the reverse order. For any given light direction L1 and a point P1, we want
to find the corresponding P0 , so that we can sample shadows at P0 using the shadow
map computed for light direction L 0 and then use the resulting shadow factor for
shadow intensity at P1.
Our parallax correction algorithm can be briefly described as this: we want to get
to P0 from P1. For this, we need two things: a good guess of the direction from P1 to P0
and a good guess of the length of the path from P1 to P0 . Let’s elaborate how to obtain
these values. Substituting Equation (2.1) into Equation (2.2), we get
P0 P1 d 1 L1 d 0 L 0. (2.3)
We attempt to solve Equation (2.3) by assuming d 1 kd 0, where k is a constant. We
will discuss how we choose the value k later in this section. Here we only note that this
assumption enforces a certain relation between P1, Pocc , and P0 . This gets us
P0 P1 d 0 kL1 L 0 . (2.4)
146 2. Parallax-Corrected Cached Shadow Maps
Thus, if we want to compute shadows for light direction L1 at an arbitrary point P1, we
can use Equation (2.4) to find corresponding point P0 provided that we can compute
the distance to occluder d 0 and choose a reasonable value of k.
Occluder search. As follows from Equation (2.4), P0 is located somewhere on the ray
starting from P1 in the direction
D kL1 L 0. (2.5)
We search for an occluder by marching along this ray with a certain number of itera-
tions, computing occluder depth at each step, and then taking the average for d 0 . This
process is illustrated in Figure 2.2(b) and Figure 2.3(a).
The search distance d s is a scene-dependent parameter that accounts for maximum
displacement of the shadows due to the parallax we are expecting. That is, we need a
larger search distance for a scene with long shadows and tall shadow casting structures.
Conversely, the search distance may be shorter for a scene with an overhead light and
small shadow casters. An increase in the difference between L 0 and L1 also increases
shadow parallax and thus the search distance.
Choosing the parameter k. The point P0 is located somewhere on the ray originating
at P1, with ray direction D being controlled with the parameter k. Figure 2.3(b) illus-
trates that changing the value of k would result in different D, with values k 1 corre-
sponding to the point P0 being closer to the occluder than P1. Ideally, having P0 on the
(a) (b)
Figure 2.3. Occluder search. (a) Occluder search with 5 iterations and k 1.5. We are sampling
shadow map depths at each step to compute the distance from a point on the ray to its occluder,
if there’s any. (b) Effect of parameter k on search vector D. The value k 1.5 would give incor-
rect results because some points on the search ray are located inside shadow casting geometry.
2.2 Parallax Correction Algorithm 147
surface of an object containing P1 would give the most accurate shadows, but this is
hardly possible in practice. In our example, P0 will be either above the surface if we
choose k 2 or k 3, or even below the surface if we choose k 1.5.
The best value for k depends on the scene. Consider a difficult situation when we
want to compute parallax correction at a point located on a concave surface. It’s quite
probable that the occluder search may be testing points under the surfaces of nearby
objects as illustrated in Figure 2.3(a), thus interpreting surrounding geometry as an
occluder. Choosing a larger value for k can help prevent these errors in concavities, as
shown in Figure 2.3(b). However, a larger values of k can cause the method to miss
smaller occluders, thus producing inaccurate results. Due to this tradeoff, one needs to
pick the value of k that produces the best results for a given scene.
We use the following empirical formula, where k depends on the magnitude of the
difference between light directions
k 1 L1 L 0 . (2.6)
Finally, we can gather all the bits we have described so far into the sample code given
in Listing 2.1.
// Occluder search
float3 S = P1;
float3 dS = mul(float4(D, 1), searchParams);
float sum = 0, cnt = 0;
for (int i = 0; i < OCCLUDER_SEARCH_ITS; ++i)
float3 Sp = mul(float4(S, 1), worldSpaceToShadowMap);
float occDepth = SampleShadowMapDepthTexture(Sp.xy);
if (occDepth < Sp.z)
sum += Sp.z - occDepth;
cnt += 1;
S += dS;
Listing 2.1. Example implementation of the parallax correction for a simple orthogonal shadow
Algorithm limitations. While the algorithm works well as long as the parameters d s
and k are chosen appropriately, a high variation in depth of overlapping shadow casters
results in incorrect parallax correction. It is caused by the occluder search hitting the
furthest occluder and computing parallax correction using a biased occluder distance,
which may distort shadows from closer shadow casters overlapping with shadows from
2.3 Applications of Parallax Correction 149
Figure 2.4. Defects occurring with high depth variation of overlapping shadow casters. From
left to right: a shadow from the cone is overlapping with a shadow from the cylinder, which is
much taller and further away from the camera; PCSS estimator gives incorrect penumbra size
resulting in a large penumbra near the cone base; our parallax correction algorithm also produces
incorrect results for the same reason (overestimation of the distance to occluder) resulting in
shadows distortion; parallax-corrected shadows penumbra is incorrect too.
more distant ones, as shown in Figure 2.4. This is similar to the occluder fusion prob-
lem existing in some soft shadows algorithms, such as percentage-closer soft shadows
(PCSS) [Fernando 2005].
Cached cascaded shadow maps. CSM caching implies that not all cascades are
updated within one frame. One way to do that is skipping cascade updates for a certain
small number of frames, either replacing cascade updates with another workload of
similar complexity or updating distant cascades in a round-robin manner. The system
keeps using matrices and shadow map textures cached from previous frames to apply
shadows to the current scene. We suppose that applying parallax correction in this sce-
nario doesn’t offer a lot of improvements since shadow maps are meant to be updated
quite frequently (every other frame or so), thus small discrete changes in light direction
aren’t too noticeable.
150 2. Parallax-Corrected Cached Shadow Maps
The work by Schulz and Mader [2014] employs a different approach to caching
with a single shadow map containing only static objects replacing the last two cascades.
Shadows cover 1.4 km range, so the full update of this shadow map takes 10–15 ms
distributed over many frames. In this scenario, parallax correction can reduce the vis-
ual discontinuity between the cascades that is caused by the lengthy shadow map up-
date process, similar to what is demonstrated in Figure 2.1. One should just update the
static shadow map periodically rather than only in certain preset points in game levels.
Acton [2012] utilizes a toroidal update scheme, also found in other algorithms
such as clipmaps [Asirvatham and Hoppe 2005], to minimize the number of shadow
casting objects rendered into cascades. Their main observation is that moving the
player’s camera only changes the cascade’s frustum translation, but not its size or ori-
entation, as long as the shadow casting light is static. Typically there’s only a small
difference between the current frustum and the frustum from the previous frame. Thus,
contents of the shadow map would nearly be the same save for few small regions. One
can perform a toroidal update reusing a large portion of the previously rendered shadow
map, and only rendering objects falling into the parts near the shadow map border that
were invisible previously. Parallax correction improves consistency between the cas-
cades updated every frame and the cascades updated via toroidal update, thus allowing
cached data to be reused for a longer period before the cached cascades need to be
rebuilt with a new light direction.
Adaptive shadow maps. Shadow map caching is an essential part of the adaptive
shadow maps algorithm, e.g., Turchyn [2011]. It’s possible to discretize light direction
movements and build a separate hierarchy of tiles for each quantized light direction.
Aside from possibly noticeable steps in the light directions, this also implies that we
have to maintain two hierarchies whenever we want to update the shadow maps. One
of the hierarchies is used for shading the current frame, and the other one is in the
process of construction. Having two hierarchies at the same time means we need to
double the size of the tile cache, and we also pay extra costs to render the tiles. Parallax
correction addresses these issues.
We can start using the tiles rendered with the new light direction as they become
available, rather than waiting until the full tile hierarchy is ready. This way shadows
are sampled from a mix of old and new tiles with the parallax correction ensuring
shadow consistency as shown in Figure 2.5. Therefore, we can start discarding old tiles
as the new tiles become available, thus reducing tile cache memory requirements and
improving cache utilization.
2.4 Results
A major challenge in the development of Far Cry 5 was the addition of new rendering
techniques, such as screen-space reflections, that were not present in the engine previ-
ously. The existing subsystems had to become faster to accommodate for the new tech,
thus the shadow rendering budget was reduced from 6 ms to 4.5 ms. The Far Cry series
2.4 Results 151
(a) (b)
Figure 2.5. Parallax correction not only enables smooth sweeping shadows with adaptive
shadow maps algorithm, but also makes it possible to start evicting old tiles from the cache
before the update is fully finished without having discontinuities visible in the left image.
(a) ASM shadows from a mix of new and old tiles. (b) The same set of tiles with parallax cor-
rection applied.
has a long history of using cached shadow maps [Valient 2012]. However, cached shad-
ows were always treated as a low-quality solution for the objects further away from the
player’s camera, hence cascaded shadow maps used to cover quite a large viewing dis-
tance. Adding parallax correction and improving cached shadow map filtering quality
allowed having cached shadows closer to the camera, thus reducing the range covered
by CSM from 80 to 30 meters. Table 2.1 shows examples of the resulting performance
We are using adaptive shadow maps for shadows covering the range from 30
to 500 meters from the camera. A typical cost of rendering a single ASM tile is around
0.5 ms on the GPU and up to 1.5 ms on the CPU on PS4. An ASM light direction
update is triggered every 1–2 minutes of normal gameplay time, so we’re avoiding the
update costs in the vast majority of frames. Our implementation of the parallax correc-
tion is relatively lightweight, with the typical GPU cost being 40–70 µs on PS4. We are
performing the occluder search with 7 steps over a low-res downsampled depth map
generated using a min-depth kernel over the shadow map (see depth extent map in
Turchyn [2011]), which improves search accuracy while keeping the number of itera-
tions low.
The importance of consistency between cached and dynamic shadows is clearer in
motion than on static screenshots such as Figure 2.6. Our very infrequently updated
cached shadows without parallax correction often resulted into lighting being com-
pletely different when transitioning between dynamic and cached shadow maps. In mo-
tion this change between the two types of shadows was perceived as a cross-fade
between unrelated images rather than a change in a shadow’s details. See this book’s
online sample code for a demonstration of parallax correction in motion.
Figure 2.6. Parallax correction improves the transition between cascaded shadow maps at the
foreground and adaptive shadow maps at the background in Far Cry 5. (a) A mismatch between
long-range and dynamic shadows due to a slow update of cached long-range shadows. (b) Paral-
lax correction fixes the mismatch so that the long-range shadows are perceived as a level of
details of dynamic shadows rather than something unrelated.
3D Engine Design
Welcome to the 3D Engine Design section of GPU Zen’s second volume. The five
chapters presented here are a reflection of the latest trends in modern 3D engine design,
as shown through advances in realism, material synthesis, ray-tracing, as well as tar-
geting the latest graphics API standards.
The section starts with Sergey Makeev’s chapter “Real-Time Layered Materials
Compositing Using Spatial Clustering Encoding”. Sergey presents an algorithm that
mimics “Allegorithmic Substance” texture pipeline as close as possible but in real-
time. It uses a layered materials method which allows us to create composite materials
using a large number of layers. This technique was successfully applied in the render-
ing of armored vehicles in the published action multiplayer tank game “Armored War-
Next, Thomas Deliot and Eric Heitz’s chapter “Procedural Stochastic Textures by
Tiling and Blending” describes a production-ready algorithm that synthesizes infi-
nitely-tiling stochastic textures from small input texture examples. The technique runs
in a fragment shader and requires no more than 4 texture fetches and a few computa-
The third chapter in this section is “A Ray Casting Technique for Baked Texture
Generation” by Alain Galvan and Jeff Russell. This chapter shows how to bake high-
polygon geometry to textures meant to be used by low-polygon geometry using GPU
ray-casting. Computation times are reduced drastically compared to classical CPU-
based baking tools. The chapter shows example shaders to bake various types of tex-
tures, as well as highlighting a number of potential pitfalls inherent in the process.
In the fourth chapter “Writing an efficient Vulkan renderer”, Arseny Kapoulkine
explores key topics for implementing Vulkan in modern 3D engines. The topics include
memory allocation, descriptor set management, command buffer recording, pipeline
barriers, and render passes. The chapter also discusses ways to optimize CPU and GPU
performance of production desktop/mobile Vulkan renderers today as well as look at
what a future looking Vulkan renderer could do differently.
The fifth chapter “glTF - Runtime 3D Asset Delivery” by Marco Hutter explains
Khronos Group’s glTF – a transmission and delivery format for 3D assets. The chapter
starts with the goals and features that are achieved with glTF and their technical im-
plementation. Then the role of glTF in the 3D content creation workflow is laid out,
showing the tools and libraries that are available to support each step of the content
156 IV 3D Engine Design
creation process, and how glTF may open up new application areas that rely on the
efficient transfer and rendering of high-quality 3D content.
I hope you enjoy learning from this section’s authors’ experiences, and do not
hesitate to share with us your latest findings and experiences around 3D Engine Design!
—Wessam Bahnassi
1.1 Introduction
Most of the modern rendering engines take advantage of using a library of simple and
well-known materials and a layered material representation to author detailed and high-
quality in-game materials. Popular tools used in texturing pipeline nowadays (e.g., Al-
legorithmic Substance Painter and Quixel NDO Painter) are also based on the concept
of layered materials [Neubelt and Pettineo 2013, Deguy et al. 2016, Karis 2013].
In this chapter, we present an algorithm that uses a layered materials method which
allows us to create composite materials using a large number of layers in real-time. Our
algorithm is designed to mimic Allegorithmic Substance texture pipeline as close as
possible but in real-time. The proposed technique based on the blending of multiple
well-known materials where a shared materials library defines the surface properties
for each material used in compositing.
Using our method, each mesh can have one unique UV set and several unique
texture blend masks where each blend mask defines the per-pixel blending weights for
the material from the library. Each material from the library can use a detail textures
technique to improve surface details resolution. Using the materials with detail textures
for the composition has the advantage of breaking the texture resolution barrier and
allows us to produce a final composition at a very high resolution. Having high-resolu-
tion in-game materials is especially crucial in the 4K era.
Our method supports the replacement of a library of materials and transparency
modifications for the texture blend masks at runtime. Material replacement at runtime
leads to a different visual appearance of the resulting composited material which is
especially important for games supporting User Generated Content or in-game
158 1. Real-Time Layered Materials Compositing Using Spatial Clustering Encoding
Customization. The presented technique is used for rendering armored vehicles in Ar-
mored Warfare, an action multiplayer tank game published by
Figure 1.1. An example of Dynamic Material Layering. This example mesh uses three texture
blend masks to define the blending transparency and three library materials to define the surface
1.3 Introduced Terms 159
Material Template. One single well-known material such as gold, steel, wood,
etc. Material Template can use a tiled detail texture to give the illusion of greater
detail for a material. Material Templates are used as basic blocks to create com-
plex multi-layered materials.
Color ID. A color-coded texture which defines areas of UVs that belong to dif-
ferent opaque materials. The opaque material does not have a blend mask asso-
ciated with it, and it is always used as a bottom layer in our composition. A
color-coded representation where each unique color represents a single material
is used to simplify the content pipeline and reduce the number of required tex-
tures. Each color-coded texture can represent several opaque materials as shown
in Figure 1.2.
Layered Material. This is a material definition which is used to build the final
composite material. Each Layered Material has a single Color ID associated
with it and an ordered set of Material Masks which define the composition order
and the blending weights of the materials. Layered Material also has a set of
Material Templates associated with it to define the visual appearance of each
material used in a composition.
Figure 1.2. Several opaque material masks combined into a single Color ID map and applied
to the mesh.
shared texture channels. At runtime we use this clustered representation and the set of
the material templates to make the final material composition as shown in Figure 1.3.
Since we are storing the different blend masks in the same texture channels, it is
critical to take into account texture filtering boundaries between different clusters. Tex-
ture filtering of different blend masks lead to errors during the composition stage due
to the leaking of texture blend masks from one material into another. While building
the material clusters, we consider which neighboring pixels are involved in the texture
filtering and this information is used while creating the clusters.
To create the initial partitioning into the clusters, we perform a connectivity anal-
ysis for the set of Material Masks. Connectivity analysis classifies all the texels which
are used for texture filtering as connected. If the texels are classified as connected, they
will belong to the same material cluster. When we perform a connectivity analysis, we
should also take mipmap texture filtering into account. At the same time, we should
limit the number of supported mipmap levels otherwise at the very last mipmap level
all the texels will be classified as connected. For our implementation, we decided to
support only the first four mipmap levels. Smaller mipmap levels are not handled by
our implementation and discarded. Supporting only the first four mipmap levels is
enough to preserve a good quality of the texture filtering and keep the number of the
resulting clusters small. An incomplete mipmaps chain might lead to aliasing, but its
level is acceptable [Mittring 2008]. In practice, the resulting aliasing can be barely
visible and effectively removed by most of the modern anti-aliasing algorithms.
Each resulting cluster should not contain more than a limited number of materials
where the number of materials depends on how many per pixel material layers we need
to support. In practice, the number of materials used in the cluster is usually equal to
Figure 1.3. An example of the use of the presented technique. Several texture blend masks
encoded as a single RGB weights texture and a single cluster indirection texture. Encoded mate-
rial blend masks and material templates from the library are composited to get the final image.
four or five since we store the cluster blend weights in the BC1 or BC3 texture format.
A practically unlimited total number of materials and five per-pixel materials is enough
to represent even a very complex layered material.
Constructing the clusters with a limited number of materials is not always possible
since we can find more connected materials than the maximum allowed number of
materials per cluster. As a result, we can find a cluster which is used more than a max-
imum allowed number of materials. We can split such clusters into several smaller ones
that meet our initial requirements. This will lead to a texture filtering error for the texels
shared between the clusters edges. Filtering errors will occur due to erroneous texture
filtering between blend masks from the different clusters in which different materials
are encoded (see Figure 1.4). We propose a solution that minimizes leaking of the tex-
ture blend masks while splitting such clusters. For more details see Section 1.5.7.
Figure 1.4. The texture unit uses blend masks from different Clusters during texture filtering.
The result of such filtering is a leaking of the material boundaries which leads to visual artifacts
in the composition stage.
Cluster Weights. The weight texture defines the blending weight for each ma-
terial in a set of materials which a specified by the cluster ID. We support up to
five different material masks per pixel where the weights are stored using the
BC3 texture format. Cluster Weights are stored using a texture that has the same
resolution, as the input Material Masks, despite the set of used materials rarely
vary, neighbor texels of a blend mask can differ significantly. Cluster Weights
can be correctly filtered inside the same material cluster. Texture data is stored
with mipmaps and fetched using a TRILINEAR or ANISOTROPIC texture filter-
ing mode.
At the composition stage, we obtain the cluster ID for each fragment which defines
a set of used materials and the blending weights. Then using the Cluster Properties, we
obtain the surface properties for each material which are used in the current cluster.
Afterwards, we use the blend weights and the surface properties for the final composi-
tion of the surface properties for a given fragment. See Figure 1.5 for more details.
Then we repeat this operation for all the blend masks used for blending:
This approach depends on the order of operations and instead of using alpha-blending
we can rewrite the blending equation in weighted form:
C final w 0C 0 w 1C 1 w 2C 2 w nC n ,
w k 1 α n 1 α n 1 1 α k 1 α k .
Since the resulting weights are normalized, we know that the sum of all weights are
always equal to one. We can use this property to reconstruct one of the weights inside
a composting pixel-shader instead of storing this weight in the texture channel:
w n 1 w 0 w 1 w 2 w n 1.
each material layer using a MAX filter to the resolution corresponding the latest sup-
ported mipmap level. If the resulting texel was marked as unused on the previous step,
this texel is located outside of the valid UV mapping and will not be used in the com-
Algorithm 1.1. Converting the texture blend mask to a normalized weighted form.
Figure 1.6. Splitting bitmap data to the graph vertices. Red, Green and Blue circles represent
areas covered by the different texture blend masks.
Figure 1.7. Resulting undirected graph. Arrows represent edges which indicate the texture fil-
tering relationships between vertices.
1.5 Algorithm Implementation 167
Next, our algorithm chooses the best solution from all solutions that have been
tried. Thus our algorithm attempts to find the optimal subset A which has a minimum
sum of edge weights to cut. See Algorithm 1.2 for implementation details. To demon-
strate one iteration step of the algorithm, see Figure 1.9.
Figure 1.9. One step of the graph partitioning algorithm. The vertex V 5 moved from subset B
into subset A.
Figure 1.10. Final data generated by our implementation. Cluster weight texture (left) and In-
direction texture (right). Indirection texture is colorized and upscaled by 8 times for demonstra-
tion purposes. Image courtesy of Mail.Ru Group.
struct SurfaceParameters
float3 albedo;
struct ClusterParameters
SurfaceParameters layer0;
SurfaceParameters layer1;
SurfaceParameters layer2;
SurfaceParameters layer3;
// Weights texture
Texture2D cWeights;
// Indirection texture
Texture2D cIndirection;
// Material parameters (stored per cluster)
StructuredBuffer<ClusterParameters> clusterParameters;
// Fetch weights = cWeights.Sample(samplerTrilinear, uv).rgb;
// Reconstruct weight
weights.w = 1.0 - weights.x - weights.y - weights.z;
// Fetch index
uint clusterIndex = cIndirection.Sample(samplerPoint, uv).r;
Figure 1.11. Rendering using detail maps (left) and without detail maps (right). Image courtesy
of Mail.Ru Group.
1.6 Results
We used the approach described in this article for rendering the armored vehicles in
the Armored Warfare game. In our game, users can customize coloring and materials
used for rendering the armored vehicles. The presented technique allows minimizing
the number of textures stored on disk while supporting high-quality textures and al-
lowing the customization of the visual appearance. You can see some results of using
our technique in Figure 1.12. We compared the number of instructions resulting for our
technique and number of instructions resulting for Unreal Engine 4 material layering
technique, see Table 1.1. Table 1.2 shows build times and the number of resulting ma-
terials clusters made for the armored vehicle.
Figure 1.12. An example of composited material and some material templates used in compo-
sition (Top) and the same composited material with different material templates applied (Bot-
tom). Image courtesy of Mail.Ru Group.
Also, the proposed method supports real-time material recomposition using the pro-
posed data representation. For the most effective use of our method, it is necessary to
take into account the texture space connectivity of the different texture blend masks at
the earliest stages of the art-pipeline. At the same time, the proposed method is suitable
for any existing art assets without additional preparation with some tolerable texture
filtering errors. We are continue to develop and refine of the proposed technique. Here
are some areas for further development:
Reducing the texture filtering errors when dividing clusters. Since the texture
blend masks are order independent, we can swap the texture channels inside the
cluster. Using the least squares minimization technique along the “seams”
boundary as proposed by Iwanicki [2013], we can reduce the texture filtering
error almost to zero.
Using the vertex color as a blend weight modifier for local dynamic material
recomposition (dynamic dirt, scratches, etc.).
First, I would like to thank Vladimir Egorov, my friend and colleague, for his suggestions and
early feedback on this article. Peter Sikachev, Vadim Slyusarev, Bonifacio Costiniano and Alex-
andre Chekroun for their feedback on this article. In addition, I would like to thank all Allods
Team members as well.
BARRÉ-BRISEBOIS, C. AND HILL, S. 2012. Blending in Detail. URL:
BLOOM, C. 2000. Terrain Texture Compositing by Blending in the Frame-Buffer. URL:
DEGUY, S., OLGUIN, R., AND SMITH, B. 2016. Texturing Uncharted 4: a matter of Substance.
Game Developers Conference 2016.
HAMILTON, A. AND BROWN, K. 2016. Photogrammetry and Star Wars Battlefront. Game De-
velopers Conference 2016.
HARDY, A. AND MCROBERTS, D. 2006. Blend maps: enhanced terrain texturing. In SAICSIT
INSIDE UNREAL. 2013. A Look at Unreal Engine 4 Layered Materials. URL:
IWANICKI, M. 2013. Lighting technology of The Last of Us. SIGGRAPH ’13.
KARIS, B. 2013. Real Shading in Unreal Engine 4 : Physically Based Shading in Theory and
Practice. SIGGRAPH ’13.
KERNIGHAN, B. AND LIN, S. 1970. An efficient heuristic procedure for partitioning graphs. In
The Bell System Technical Journal, 49, pp. 291–307.
MITTRING, M. 2008. Advanced Virtual Texture Topics. SIGGRAPH ’08.
NEUBELT, D. AND PETTINEO, M. 2013. Crafting a Next-Gen Material Pipeline for The Order:
1886. SIGGRAPH ’13.
NOGUER, J. 2016. The Next Frontier of Texturing Workflows. URL:
Procedural Stochastic
Textures by Tiling
and Blending
Thomas Deliot and Eric Heitz
2.1 Introduction
Heitz and Neyret [2018] recently introduced a new by-example procedural texturing
method for stochastic textures, typically natural textures such as moss, granite, sand,
bark, etc. Their algorithm takes as input a small texture example and synthesizes an
infinite output with the same appearance, as in Figure 2.1. The algorithm is a simple
Figure 2.1. Procedural stochastic textures by tiling and blending. Our algorithm runs in a frag-
ment shader that requires no more than 4 texture fetches and a few computations. It can be effi-
ciently integrated into a rendering engine.
2.2.1 Tiling
In this section, we provide the implementation of the functions required for the tiling
part of the algorithm in Listing 2.1.
Triangle grid. We use the equilateral-triangle lattice introduced in Simplex Noise [Per-
lin 2001]. Listing 2.2 provides the function that, for a given point in uv space, computes
// Precompute UV derivatives
vec2 duvdx = dFdx(uv);
vec2 duvdy = dFdy(uv);
// Fetch input
vec3 I1 = textureGrad(input, uv1, duvdx, duvdy).rgb;
vec3 I2 = textureGrad(input, uv2, duvdx, duvdy).rgb;
vec3 I3 = textureGrad(input, uv3, duvdx, duvdy).rgb;
// Linear blending
vec3 color = w1 * I1 + w2 * I2 + w3 * I3;
return color;
the vertices of its containing triangle and its barycentric coordinates w 1 , w 2 , w 3 inside
this triangle. With this partitioning of the uv space, each vertex is associated with a
hexagonal tile chosen randomly in the input image such that each point is covered by
exactly 3 tiles and each tile is weighted by a function falling to 0 at the borders and
such that the sum of the weights equals 1 everywhere (w 1 w 2 w 3 1). Note that
the constant 2 3 controls the size of the input with respect to the size of the tiles. With
this value, the height of a hexagonal tile is half the size of the input texture, which
works well in general. This parameter can be adjusted depending on the input. Using
larger tiles (decreasing the constant) captures more large-scale features but is more
prone to visible repetitions. Using smaller tiles (increasing the constant) increases the
variety of the tiles but misses large-scale features.
Listing 2.2. Computing the local triangle vertices and barycentric coordinates.
Hash function. We use the hash function given in Listing 2.3 to associate a random
offset with each vertex of the triangle grid and use it to fetch the example texture. The
choice of the hash function does not really matter as long as it provides enough ran-
domness and does not introduce visible correlations between neighboring tiles.
vec2 hash(vec2 p)
return fract(sin((p) * mat2(127.1, 311.7, 269.5, 183.3))
* 43758.5453);
Fetching the example texture. We fetch the input texture with mipmapping and an-
isotropic filtering like a conventional texture. Note that the hardware uses screen-space
derivatives to compute the mipmap level and parameterize its anisotropic filter. Typi-
cally, these derivatives are computed with the finite differences between neighboring
pixels of the uv positions passed as argument to the texture function. In our case,
these screen-space derivatives are broken by the random offsets if neighboring pixels
are not in the same triangle. To avoid this problem, in Listing 2.1 we compute the
uv derivatives before adding the random offsets and we pass them explicitly to the
texture2DGrad function.
2.2.2 Blending
In this section, we address the blending part of the algorithm in Listing 2.1.
The problem of linear blending. Listing 2.1 implements a classic linear blending
I w 1 I 1 w 2 I 2 w 3 I 3. (2.1)
182 2. Procedural Stochastic Textures by Tiling and Blending
Unfortunately, it does not yield satisfying results, as shown in Figure 2.3(b). The result
has heterogeneous contrast and exhibits a grid-revealing pattern. Heitz and Neyret ex-
plain that the problem of linear blending is that is does not preserve the statistical prop-
erties of the input, i.e., its histogram. The problem is thus to find a blending operator
that preserves the histogram.
Variance-preserving blending. Heitz and Neyret notice that in the special case where
the input has a Gaussian histogram, variance-preserving blending preserves the Gauss-
ian histogram. The expression of this operator is
w 1G1 w 2 G 2 w 3G 3 G
G G , (2.2)
w 12 w 22 w 32
Fragment shader. We update the blending step of Listing 2.1 with the instructions
provided in Listing 2.5. Instead of sampling the original input, we sample the Gaussian
input stored in texture Tinput and we use the variance-preserving blending operator
of Equation (2.2). Finally, we apply the inverse histogram transformation by fetching
the precomputed look-up table stored in texture invT.
// Variance-preserving blending
vec3 G = w1 * G1 + w2 * G2 + w3 * G3;
G = G - vec3(0.5);
G = G * inversesqrt(w1 * w1 + w2 * w2 + w3 * w3);
G = G + vec3(0.5);
// Fetch LUT
vec3 color;
color.r = texture(invT, vec2(G.r, 0)).r;
color.g = texture(invT, vec2(G.g, 0)).g;
color.b = texture(invT, vec2(G.b, 0)).b;
A quantile value U CDF x is the integral of the distribution below x. For instance,
if U 0.30 it means that 30% of the integral is below x and 70% is above.
Figure 2.5. Histogram transformation of the input. We sort the pixel values I of the input and
we map them to sorted values G from the target Gaussian distribution.
Implementation. In Listing 2.8, we start by sorting the values of the input image. For
this purpose, we use a structure PixelSortStruct that stores the coordinates and the
value of a pixel. Then, we go through the sorted list of pixel values and for the i-th
i 1 2
element we compute its quantile value U N . It means that U% of the list is before
this element and 1 U % is after. We replace the pixel value by the same quantile in
the Gaussian distribution using the inverse CDF of Equation (2.5): G CDF 1 U .
sort(sortedInputValues.begin(), sortedInputValues.end());
Algorithm. The algorithm consists in mapping sorted values, as in the previous section
(Figure 2.5). However, the computation of the values is different. Since we use a Gauss-
ian distribution that can be well represented in the interval 0,1, we are going to pa-
rameterize the look-up table on this interval and we associate quantiles of the Gaussian
distribution in 0,1 to quantiles of the pixel values.
Implementation. In Listing 2.9, we start by sorting the values of the input image. Note
that an optimized implementation could reuse the sorting step of Listing 2.8. Then, we
go through the texels of the look-up table that parameterizes the interval 0,1 such that
i 1 2
the i-th over N texel is associated with the position x N . We compute the Gaussian
quantile value at this position using Equation (2.4): U CDF x , and we pick up the
same quantile in the sorted pixel values, i.e., we fetch the U . M -th element in the
sorted list if it has M entries. This is the value that we store in the look-up table.
sort(sortedInputValues.begin(), sortedInputValues.end());
Listing 2.9. Precomputing the inverse histogram transformation T 1 and storing it in a look-up
2.3.4 Discussion
With the fragment shader of Section 2.2 and the precomputations of Section 2.3 we
already have a standalone implementation. However, this implementation has several
shortcomings: color problems might appear with some inputs. They are due to compu-
ting separate per-channel histogram transformations and the incompatibility of mip-
mapping and using a look-up table. Sections 2.4 and 2.5 are dedicated to overcome
these shortcomings.
inter-channel correlations and transforming the channels separately does not preserve
these correlations. For instance, the result of Figure 2.6(b) has the same 1D histogram
as the input for each channel. However, since the inter-channels correlations are not
preserved, the 3D shape of this histogram is not preserved and wrong colors appear in
the result. We obtained the result of Figure 2.6(c) by using a color space in which the
channels are not correlated such that processing them separately is less prone to this
P O v 1 V1 v 2 V 2 v 3 V3 with v 1 , v 2 , v 3 0,1 ,
where the bounding box is defined by its corner O and its orthogonal axes V1, V 2 , and
190 2. Procedural Stochastic Textures by Tiling and Blending
Fragment shader. Before returning, the fragment shader transforms the result back
to the original color space with Equation (2.6). This is done in the function Return-
ToOriginalColorSpace provided in Listing 2.10.
Listing 2.10. Return to the original color space in the fragment shader.
appearance to a regular tiling of the input reveals an issue of color deviation, as shown
in Figure 2.8. In this section, we show how to solve this problem by prefiltering the
look-up table.
the color of the input texture at a position uv and P the domain covered by the pixel
footprint. Figure 2.9 illustrates that the filtered color is the integral of the texture over
the pixel footprint:
texture uv duv . (2.7)
Texture mipmapping (with anisotropic filtering for more accuracy) provides a fast way
to evaluate this integral.
As shown in Figure 2.10, this integral can be computed by sampling the values of the
texture over the footprint P, passing the values through the look-up table, and averaging
the results. Unfortunately, this process is too costly and we are thus willing to use mip-
mapping, as for a conventional texture.
Figure 2.10. Filtering the procedural texture. The correct filtering averages the values after the
application of the look-up table. Filtering the texture before and applying the look-up table after
does not produce the same result.
Filtering the procedural texture (wrong). A simple approach consists in using a mip-
mapped version of the input texture, fetching a single sample from it as for a conven-
tional texture, and then passing it through the look-up table, as shown in Figure 2.10.
However, this computes
which is not the right result because the integral and the look-up table do not commute:
This inequality explains the color difference between Figure 2.8(a) and (b).
filtered procedural texture LUT t H P t dt , (2.12)
where H P gives the weight of each entry of the look-up table. This weight depends on
the distribution of texture values t inside the pixel footprint P. The more a value t of
the texture is represented, the more the entry LUT t contributes to the weighted av-
erage. Hence, H P is the histogram of the values of the texture inside P. This equiva-
lence is shown in Figure 2.11.
Figure 2.11. Alternative filtering formulation with a look-up table. Filtering the texture with
the look-up table is equivalent to convolving the look-up table by the histogram of the texture
values inside the pixel footprint.
the unfiltered look-up table and whose height is the number of levels of detail of the
input texture. For each level of detail L we compute the average variance in all the
subwindows of width 2 L. At the first level of detail the variance is 0 and at the highest
level of detail the variance is the variance of the full Gaussian texture, which is 1 6 2
as explained in Section 2.3.1. For each level of detail, we filter the look-up table by a
Gaussian filter of the associated variance.
Fragment shader. We update the fragment shader in Listing 2.11 where we use the
function textureQueryLod to obtain the level of detail of the input texture and we
remap it to a value in 0,1 to obtain a y coordinate to fetch the look-up table.
Listing 2.11. Fetching the prefiltered look-up table in the fragment shader.
2.7 Results
Performance and storage. In Table 2.1, we compare the performance and storage of
our method compared to a classic texture repeat, as in Figures 2.13 and 2.14. On aver-
age, it is 4–5 times costlier, which makes sense since we fetch the input 3 times, use
one additional look-up table fetch and use a few additional operations. The repeated
tiling only requires the storage of the input texture while our method requires the stor-
age of the Gaussian input Tinput and the look-up table invT. Since the Gaussian input
has the same size as the input, the memory overhead of our method is only the storage
of the look-up table, which is small in comparison.
2.8 Conclusion
We have presented an implementation of our procedural texturing algorithm that works
well for breaking the repetition of tiled textures. This algorithm is meant to be used
with stochastic textures (moss, granite, sand, etc.) and cannot be used with repetitive
or strongly correlated patterns. It has little memory overhead, works well with the com-
pressed DXT texture format, and is about four times the cost of a classic texture re-
peated tiling. Finally, it is straightforward to adapt it to other inputs than RGB color
data such as the normal map in Figure 2.14.
This chapter is the result of Thomas Deliot’s master thesis, which was supervised by Eric Heitz.
Both authors conducted this work at Unity Technologies.
Figure 2.16. Failure case of our method. Our method does not produce plausible results if the
input presents a strong pattern-like organization.
HEEGER, D. AND BERGEN, J. 1995. Pyramid-based Texture Analysis/Synthesis. In Proceedings
of ACM SIGGRAPH ’95, pp. 229–238.
HEITZ, E. AND NEYRET, F. 2018. High-Performance By-Example Noise using a Histogram-Pre-
serving Blending Operator. In Proceedings of the ACM on Computer Graphics and Interactive
Techniques, 1:2, pp. 25.
HEITZ, E., NOWROUZEZAHRAI, D., POULIN, P., AND NEYRET, F. 2013. Filtering Color Mapped
Textures and Surfaces. 2013. In Proceedings of the Symposium on Interactive 3D Graphics
and Games 2013, pp. 129–136.
PERLIN, K. 2001. Noise hardware. Real-time shading languages, SIGGRAPH 2001 Course.
Figure 3.1. An example scene from Marmoset Toolbag 3 showing a model baked using our
As baking is generally an offline process, early tools relied on CPU processing for
greater flexibility. Computation times were on the order of minutes and hours. More
recent implementations have used GPU processing to great effect. Increased parallel-
ism as well as improved support for general computing, in particular ray tracing, have
made GPU baking an appealing choice. Modern systems can bake results in seconds
or even in real-time. This chapter outlines a technique to bake geometry on the GPU
with user input, as well as a number of potential pitfalls inherent in the process.
The technology demo that this chapter follows is based on the baker used in Mar-
moset Toolbag. This example can be downloaded at:
Figure 3.2. A model representation of baking, where rays are cast from the working model to
the reference model.
Figure 3.3. A model of baking which includes a Cage projected from the working model.
Another issue with baking projection relates to ray directions. In general, sample
rays should point “inward” in the direction of the reference model. However, the spe-
cific direction chosen obviously has a significant effect on the sample’s ultimate result.
A simple approach to determine sample directions is to make sample rays run parallel
to the working model’s polygon normals. This works somewhat well, but has the major
drawback of creating discontinuities along polygon edges, as Figure 3.4 shows.
Figure 3.4. A comparison of baking projection techniques. The left shows ray directions based
off the average normal of that surface, and the right shows rays based off the working model’s
face normal. Note the discontinuities on the edge of different faces.
Figure 3.5. A visualization of a cage determined by a positional offset map and skew map. On
the left shows the offset cage colored with its skew map, on the right shows the direction of rays
projected from that cage with the mesh rotated 180 degrees for easier visualization. Red indicates
that rays should point the direction of the working model’s normal and green indicates rays
should point in the direction of the calculated smooth normal.
process of baking models remains more art than science, but by exposing the technical
aspects of baking in an intuitive way artists are able to work more effectively.
In summary, the following data are needed to bake a mesh with our approach:
Output Render Target. Your output render target, which can vary in format and
Working Model. The low polygon geometry used to determine smooth normals
and used as input for any final transformations to output fragments.
Reference Model. The high polygon geometry used when searching for ray col-
Skew Map. A map that allows users to interpolate between either raycasting in
the direction of the computed smooth normal, or the direction of the working
model’s normal.
Offset Map. A map defines the offset between the cage and the working model.
Offset Bounds. A minimum and maximum offset value to offset the cage from
the working model.
3.1.1 Pitfalls
One pitfall we countered however was in the computing of smooth normals. We noticed
that for convex geometry such as holes or corners, the smooth direction would point
away from the reference model. To fix this for a given vertex, the starting ray’s direction
should be the average normal if the dot product of it and that vertex normal is greater
than 0, otherwise it should be reflected by that vertex’s normal.
In addition, when baking geometry, it is sometimes necessary for the user to split
apart a bake to different sections to avoid rays being cast from the inside of the refer-
ence model. This splitting of working models and reference models can speed up the
time it takes to bake, but now different sections can be completely isolated from all
reference geometry. One possible way to mitigate this issue is to expose a user option
to bake using all reference geometry. This can be useful when working maps that re-
quire scene information such as ambient occlusion.
Finally, since every user has unique use cases for their baked textures, it’s best to
be aware of the tangent space the working model will be used in, as well as the hand-
edness of the target application to avoid issues such as baked normals facing the wrong
3.1.2 Implementation
Before we begin discussing implementation details it’s important to introduce base
functions that we’ll be depending on. We’ll be using the function findTraceDir-
206 3. A Ray Casting Technique for Baked Texture Generation
ection to determine the direction our ray will be cast from, findTraceOrigin to
determine the origin of our rays, and finally traceRay to perform the ray tracing op-
eration. For the sake of brevity, we’ll omit traceRay from any source listings in this
chapter, but a working implementation can be found in our included example.
Normal Map. Normals are available directly from the reference model’s vertex data.
One would simply need to compute the fragment as the current reference vertex’s nor-
mal value, and interpolate between vertices through barycentric coordinates. Tangent
based normals require that one transform the reference model normal with the working
model’s normal orientation:
Figure 3.6. A tangent space normal map texture applied to a model rendered with 16×
TriangleHit hit;
bool didhit = traceRay(tracePos, traceDir, hit);
if (didHit)
n = hit.coords.x * uNormals[hit.vertices.x]
+ hit.coords.y * uNormals[hit.vertices.y]
+ hit.coords.z * uNormals[hit.vertices.z];
n = normalize(n);
outObjectNormals.rgb = n;
outObjectNormals.a = 1.0;
outTangentNormals.rgb = vec3(dot(n, inTangents), dot(n, inBitangents),
dot(n, inNormals));
outTangentNormals.a = 1.0;
Listing 3.2. An example implementation of baking the reference model’s vertex normal.
208 3. A Ray Casting Technique for Baked Texture Generation
Height Map. Height maps are used as inputs in tessellation to determine areas that
require more subdivisions, and to offset those areas by the input texture.
Figure 3.7. Height map computed from cage offset distance to reference model.
TriangleHit hit;
bool didhit = traceRay(tracePos, traceDir, hit);
Listing 3.3. Height map computed from cage offset distance to reference model.
Ambient Occlusion Map. Ambient occlusion describes average amount of light that
would be expected to miss a region from an omnidirectional light source. This value
can be determined through monte-carlo stochastic sampling of rays cast hemispheri-
cally from the first initial raycast.
3.1 Baking in Practice 209
Figure 3.8. An example of a baked ambient occlusion map set to 4096 rays.
#define SAMPLES 16
outAO = vec4(1.0,1.0,1.0,0.0);
vec3 traceDir = findTraceDirection(inPosition, normalize(inBakeDir),
inTextureCoords, dirMask);
vec3 tracePos = findTraceOrigin(inPosition, traceDir,
inTextureCoords, tOffsetMap, uOffsetRange);
TriangleHit hit;
if (traceRay(tracePos, traceDir, hit))
vec3 pos = tracePos + traceDir
* (hit.distance - uHemisphereOffset);
vec3 basisY = normalize(hit.coords.x * uNormals[hit.vertices.x]
+ hit.coords.y * uNormals[hit.vertices.y]
+ hit.coords.z * uNormals[hit.vertices.z]);
vec3 basisX = normalize(cross(basisY, fTangent));
vec3 basisZ = cross(basisX, basisY);
float ao = 0.0;
float hits = 0.0;
TriangleHit hit2;
for (int i = 0; i < SAMPLES; ++i)
// Random Direction in hemisphere of first hit
vec3 d = normalize(rand3(fTexCoord + uRandSeed + float(i)));
// Give rays that point away from the top of the hemisphere
// more weight when averaging the final ambient occlusion.
float omega = d.y;
Material Atlas. Scene descriptions introduce the concept of a Mesh being composed
of several Primitives, each coupled with a Material. Different parts of a mesh can cor-
respond with different materials, which can lead to dense geometry with made up of
many materials. This is great for authoring reference modes, however when designing
assets to be used in real time rendering, the need for a simpler working model that
encodes all these materials as textures to be used in a single material with a Physically
Based Rendering (PBR) workflow arises.
One solution to this problem is to process each material the working model is
composed of, masking out the geometry of what’s not being baked with an alpha of
Figure 3.9. An example of an albedo map that uses the metalness workflow, baked from the
reference model’s Physically Based (PBR) materials using our technique. On the right is the
mesh textured with the generated albedo map rendered with a PBR metalness workflow.
preferredTrisPerLeaf := 12
maxTrisPerLeaf := 64
preferredNodeDepth := 23
maxNodeDepth := preferredNodeDepth + 10
buildNode(depth, triCount):
if (depth < preferredNodeDepth and
triCount > preferredTrisPerLeaf) or
(depth < maxNodeDepth and
triCount > maxTrisPerLeaf)
Split triangles into two new nodes
for each new node
Attach triangles to leaf node
Listing 3.5. Pseudocode for k-d tree construction. Triangle and depth limits have been chosen
empirically with a brute-force performance search.
Writing an Efficient
Vulkan Renderer
Arseny Kapoulkine
Vulkan is a new explicit cross-platform graphics API. It introduces many new concepts
that may be unfamiliar to even seasoned graphics programmers. The key goal of Vul-
kan is performance—however, attaining good performance requires in-depth
knowledge about these concepts and how to apply them efficiently, as well as how par-
ticular driver implementations implement these. This article will explore topics such
as memory allocation, descriptor set management, command buffer recording, pipeline
barriers, render passes and discuss ways to optimize CPU and GPU performance of
production desktop/mobile Vulkan renderers today as well as look at what a future
looking Vulkan renderer could do differently.
Modern renderers are becoming increasingly complex and must support many dif-
ferent graphics APIs with varying levels of hardware abstraction and disjoint sets of
concepts. This sometimes makes it challenging to support all platforms at the same
level of efficiency. Fortunately, for most tasks Vulkan provides multiple options that
can be as simple as reimplementing concepts from other APIs with higher efficiency
due to targeting the code specifically towards the renderer needs, and as hard as rede-
signing large systems to make them optimal for Vulkan. We will try to cover both ex-
tremes when applicable—ultimately, this is a tradeoff between maximum efficiency on
Vulkan-capable systems and implementation and maintenance costs that every engine
needs to carefully pick. Additionally, efficiency is often application-dependent—the
guidance in this article is generic and ultimately best performance is achieved by pro-
filing the target application on a target platform and making an informed implementa-
tion decision based on the results.
This article assumes that the reader is familiar with the basics of Vulkan API, and
would like to understand them better and/or learn how to use the API efficiently.
HOST_VISIBLE_BIT—on AMD hardware, this memory type refers to up to 256
MB of video memory that the CPU can write to directly, and is perfect for allo-
cating reasonable amounts of data that is written by CPU every frame, such as
uniform buffers or dynamic vertex/index buffers
We only cover memory allocation types that are writable from host and readable or writable
from GPU; for CPU readback of data that has been written by GPU, memory with VK_MEMORY_
PROPERTY_HOST_CACHED_BIT flag is more appropriate.
HOST_COHERENT_BIT2—this is referring to CPU memory that is directly visible
from GPU; reads from this memory go over PCI-express bus. In absence of the
previous memory type, this generally speaking should be the choice for uniform
buffers or dynamic vertex/index buffers, and also should be used to store staging
buffers that are used to populate static resources allocated with VK_MEMORY_
LAZILY_ALLOCATED_BIT—this is referring to GPU memory that might never
need to be allocated for render targets on tiled architectures. It is recommended
to use lazily allocated memory to save physical memory for large render targets
that are never stored to, such as MSAA images or depth images.
Note that VK_MEMORY_PROPERTY_HOST_COHERENT_BIT generally implies that the memory
will be write-combined; on some devices it’s possible to allocate non-coherent memory and
flush it manually with vkFlushMappedMemoryRanges.
Always over-align image resources (as they typically have larger alignment to
begin with) by bufferImageGranularity, essentially using a maximum of
required alignment and bufferImageGranularity for address and size align-
Track resource type for each allocation, and have the allocator add the requisite
padding only if the previous or following resource is of a different type. This
requires a somewhat more complex allocation algorithm.
On many GPUs the required alignment for image resources is substantially bigger
than it is for buffers which makes the last option attractive—in addition to reducing
waste due to lack of extra padding between buffers and images, it reduces internal frag-
mentation due to image alignment when an image follows a buffer resource. VMA
provides implementations for option 2 (by default) and option 3 (see VMA_POOL_
dedicated allocation. That way the driver can allocate the resources in faster memory
under special circumstances.
To that end, Vulkan provides an extension (core in 1.1) to perform dedicated allo-
cations—when allocating memory, you can specify that you are allocating this memory
for this individual resource instead of as an opaque blob. To know if this is worthwhile,
you can query the extended memory requires via vkGetImageMemoryRequire-
ments2KHR or vkGetBufferMemoryRequirements2KHR; the resulting struct,
VkMemoryDedicatedRequirementsKHR, will contain requiresDedicatedAllo-
cation (which might be set if the allocated resource needs to be shared with other
processes) and prefersDedicatedAllocation flags.
In general, applications may see performance improvements from dedicated allo-
cations on large render targets that require a lot of read/write bandwidth depending on
the hardware and drivers.
Do this before CPU needs to write data to the allocation, and unmap once the
write is complete.
Do this right after the host-visible memory is allocated, and never unmap
The second option is otherwise known as persistent mapping and is generally a better
tradeoff—it minimizes the time it takes to obtain a writeable pointer (vkMapMemory is
not particularly cheap on some drivers), removes the need to handle the case where
multiple resources from the same memory object need to be written to simultaneously
(calling vkMapMemory on an allocation that’s already been mapped and not unmapped
is not valid) and simplifies the code in general.
The only downside is that this technique makes the 256 MB chunk of VRAM that
is host visible and device local on AMD GPU that was described in “Memory heap
selection” less useful—on systems with Windows 7 and AMD GPU, using persistent
mapping on this memory may force WDDM to migrate the allocations to system
memory. If this combination is a critical performance target for your users, then map-
ping and unmapping memory when needed might be more appropriate.
that’s in use by the GPU, and to provide the descriptor layout that has an optimal bal-
ance between CPU-side update cost and GPU-side access cost. In addition, since dif-
ferent rendering APIs use different models for resource binding and none of them
match Vulkan model exactly, using the API in an efficient and cross-platform way be-
comes a challenge. We will outline several possible approaches to working with Vulkan
descriptor sets that strike different points on the scale of usability and performance.
via vkResetDescriptorPool and returned to free lists. While it’s possible to free
individual descriptors from a pool via VK_DESCRIPTOR_POOL_CREATE_FREE_
DESCRIPTOR_SET_BIT, this complicates the memory management on the driver side
and is not recommended.
When a descriptor set pool is created, application specifies the maximum number
of descriptor sets allocated from it, as well as the maximum number of descriptors of
each type that can be allocated from it. In Vulkan 1.1, the application doesn’t have to
handle accounting for these limits—it can just call vkAllocateDescriptorSets
and handle the error from that call by switching to a new descriptor set pool. Unfortu-
nately, in Vulkan 1.0 without any extensions, it’s an error to call vkAllocate-
DescriptorSets if the pool does not have available space, so application must track
the number of sets and descriptors of each type to know beforehand when to switch to
a different pool.
Different pipeline objects may use different numbers of descriptors, which raises
the question of pool configuration. A straightforward approach is to create all pools
with the same configuration that uses the worst-case number of descriptors for each
type—for example, if each set can use at most 16 texture and 8 buffer descriptors, one
can allocate all pools with maxSets = 1024, and pool sizes 16 × 1024 for texture de-
scriptors and 8 × 1024 for buffer descriptors. This approach can work but in practice it
can result in very significant memory waste for shaders with different descriptor
count—you can’t allocate more than 1024 descriptor sets out of a pool with the afore-
mentioned configuration, so if most of your pipeline objects use 4 textures, you’ll be
wasting 75% of texture descriptor memory.
Two alternatives that provide a better balance with respect to memory use are:
Measure an average number of descriptors used in a shader pipeline per type for
a characteristic scene and allocate pool sizes accordingly. For example, if in a
given scene we need 3000 descriptor sets, 13400 texture descriptors, and 1700
buffer descriptors, then the average number of descriptors per set is 4.47 textures
(rounded up to 5) and 0.57 buffers (rounded up to 1), so a reasonable configu-
ration of a pool is maxSets = 1024, 5 × 1024 texture descriptors, 1024 buffer
descriptors. When a pool is out of descriptors of a given type, we allocate a new
one—so this scheme is guaranteed to work and should be reasonably efficient
on average.
Group shader pipeline objects into size classes, approximating common patterns
of descriptor use, and pick descriptor set pools using the appropriate size class.
This is an extension of the scheme described above to more than one size class.
For example, it’s typical to have large numbers of shadow/depth prepass draw
calls, and large numbers of regular draw calls in a scene—but these two groups
have different numbers of required descriptors, with shadow draw calls typically
requiring 0 to 1 textures per set and 0 to 1 buffers when dynamic buffer offsets
are used. To optimize memory use, it’s more appropriate to allocate descriptor
set pools separately for shadow/depth and other draw calls. Similarly to general-
222 4. Writing an Efficient Vulkan Renderer
purpose allocators that can have size classes that are optimal for a given appli-
cation, this can still be managed in a lower-level descriptor set management layer
as long as it’s configured with application specific descriptor set usages before-
To make this model work, application needs to decide how many resource
namespaces are there and how they map to Vulkan set/slot indices. For example, in
Metal each stage (VS, FS, CS) has three resource namespaces—textures, buffers, sam-
plers—with no differentiation between, e.g., uniform buffers and storage buffers. In
Direct3D 11 the namespaces are more complicated since read-only structured buffers
belong to the same namespace as textures, but textures and buffers used with unordered
access reside in a separate one.
Vulkan specification only guarantees a minimum of 4 descriptor sets accessible to
the entire pipeline (across all stages); because of this, the most convenient mapping
option is to have resource bindings match across all stages—for example, a texture slot
3 would contain the same texture resource no matter what stage it’s accessed from—
and use different descriptor sets for different types, e.g., set 0 for buffers, set 1 for
textures, set 2 for samplers. Alternatively, an application can use one descriptor set per
stage3 and perform static index remapping (e.g., slots 0-16 would be used for textures,
slots 17–24 for uniform buffers, etc.)—this, however, can use much more descriptor
set memory and isn’t recommended. Finally, one could implement optimally compact
dynamic slot remapping for each shader stage (e.g., if a vertex shader uses texture slots
0, 4, 5, then they map to Vulkan descriptor indices 0, 1, 2 in set 0, and at runtime
application extracts the relevant texture information using this remapping table.
In all these cases, the implementation of setting a texture to a given slot wouldn’t
generally run any Vulkan commands and would just update shadow state; just before
the draw call or dispatch you’d need to allocate a descriptor set from the appropriate
pool, update it with new descriptors, and bind all descriptor sets using vkCmdBind-
DescriptorSets. Note that if a descriptor set has 5 resources, and only one of them
changed since the last draw call, you still need to allocate a new descriptor set with 5
resources and update all of them.
To reach good performance with this approach, you need to follow several guide-
Don’t allocate or update descriptor sets if nothing in the set changed. In the
model with slots that are shared between different stages, this can mean that if
no textures are set between two draw calls, you don’t need to allocate/update
the descriptor set with texture descriptors.
Note that with the 4 descriptors per pipeline, this approach can’t handle full pipeline setup for
VS, GS, FS, TCS and TES—which is only a problem if you use tessellation on drivers that only
expose 4 descriptor sets.
224 4. Writing an Efficient Vulkan Renderer
In general, the approach outlined above can be very efficient in terms of perfor-
mance—it’s not as efficient as approaches with more static descriptor sets that are de-
scribed below, but it can still run circles around older APIs if implemented carefully.
On some drivers, unfortunately the allocate and update path is not very optimal—on
some mobile hardware, it may make sense to cache descriptor sets based on the de-
scriptors they contain if they can be reused later in the frame.
Set = 0 descriptor set containing uniform buffer with global, per-frame or per-
view data, as well as globally available textures such as shadow map texture
Set = 1 descriptor set containing uniform buffer and texture descriptors for per-
material data, such as albedo map, Fresnel coefficients, etc.
4.2 Descriptor Sets 225
Set = 2 descriptor set containing dynamic uniform buffer with per-draw data,
such as world transform array
For set = 0, the expectation is that it only changes a handful of times per frame;
it’s sufficient to use a dynamic allocation scheme similar to the previous section.
For set = 1, the expectation is that for most objects, the material data persists be-
tween frames, and as such could be allocated and updated only when the gameplay
code changes material data.
For set = 2, the data would be completely dynamic; due to the use of a dynamic
uniform buffer, we’d rarely need to allocate and update this descriptor set—assuming
dynamic constants are uploaded to a series of large per-frame buffers, for most draws
we’d need to update the buffer with the constant data, and call vkCmdBind-
DescriptorSets with new offsets.
Note that due to compatibility rules between pipeline objects, in most cases it’s
enough to bind sets 1 and 2 whenever a material changes, and only set 2 when material
is the same as that for the previous draw call. This results in just one call to vkCmd-
BindDescriptorSets per draw call.
For a complex renderer, different shaders might need to use different layouts—for
example, not all shaders need to agree on the same layout for material data. In rare
cases it might also make sense to use more than 3 sets depending on the frame struc-
ture. Additionally, given the flexibility of Vulkan it’s not strictly required to use the
same resource binding system for all draw calls in the scene. For example, post-pro-
cessing draw call chains tend to be highly dynamic, with texture/constant data changing
completely between individual draw calls. Some renderers initially implement the dy-
namic slot-based binding model from the previous section and proceed to additionally
implement the frequency-based sets for world rendering to minimize the performance
penalty for set management, while still keeping the simplicity of slot-based model for
more dynamic parts of the rendering pipeline.
The scheme described above assumes that in most cases, per-draw data is larger
than the size that can be efficiently set via push constants. Push constants can be set
without updating or rebinding descriptor sets; with a guaranteed limit of 128 bytes per
draw call, it’s tempting to use them for per-draw data such as a 4x3 transform matrix
for an object. However, on some architectures the actual number of constants available
to push quickly depends on the descriptor setup the shaders use, and is closer to 12
bytes or so. Exceeding this limit can force the driver to spill the push constants into
driver-managed ring buffer, which can end up being more expensive than moving this
data to a dynamic uniform buffer on the application side. While limited use of push
constants may still be a good idea for some designs, it’s more appropriate to use them
in a fully bindless scheme described in the next section.
226 4. Writing an Efficient Vulkan Renderer
struct MaterialData
vec4 albedoTint;
float tilingX;
float tilingY;
float reflectance;
float unused0; // pad to vec4
uint albedoTexture;
uint normalTexture;
uint roughnessTexture;
uint unused1; // pad to vec4
Similarly, all per-draw constants for all objects in the scene can reside in another
large storage buffer; for simplicity, we’ll assume that all per-draw constants have iden-
tical structure. To support skinned objects in a scheme like this, we’ll extract transform
data into a separate, third storage buffer:
struct TransformData
vec4 transform[3];
Something that we’ve ignored so far is the vertex data specification. While Vulkan
provides a first-class way to specify vertex data by calling vkCmdBindVertex-
Buffers, having to bind vertex buffers per-draw would not work for a fully bindless
design. Additionally, some hardware doesn’t support vertex buffers as a first-class en-
tity, and the driver has to emulate vertex buffer binding, which causes some CPU-side
slowdowns when using vkCmdBindVertexBuffers. In a fully bindless design, we
need to assume that all vertex buffers are suballocated in one large buffer and either
use per-draw vertex offsets (firstVertex argument to vkCmdDrawIndexed) to have
hardware fetch data from it, or pass an offset in this buffer to the shader with each draw
call and fetch data from the buffer in the shader. Both approaches can work well, and
might be more or less efficient depending on the GPU; here we will assume that the
vertex shader will perform manual vertex fetching.
Thus, for each draw call we need to specify three integers to the shader:
Material index; used to look up material data from material storage buffer. The
textures can then be accessed using the indices from the material data and the
descriptor array.
Transform data index; used to look up transform data from transform storage
228 4. Writing an Efficient Vulkan Renderer
Vertex data offset; used to look up vertex attributes from vertex storage buffer.
We can specify these indices and additional data, if necessary, via draw data:
struct DrawData
uint materialIndex;
uint transformOffset;
uint vertexOffset;
uint unused0; // vec4 padding
The shader will need to access storage buffers containing MaterialData, Trans-
formData, DrawData as well as a storage buffer containing vertex data. These can be
bound the shader via the global descriptor set; the only remaining piece of information
is the draw data index, that can be passed via a push constant.
With this scheme, we’d need to update the storage buffers used by materials and
draw calls each frame and bind them once using our global descriptor set; additionally,
we need to bind index data—assuming that, like vertex data, index data is allocated in
one large index buffer, we only need to bind it once using vkCmdBindIndexBuffer.
With the global setup complete, for each draw call we need to call vkCmdBind-
Pipeline if the shader changes, followed by vkCmdPushConstants to specify an
index into the draw data buffer4, followed by vkCmdDrawIndexed.
In a GPU-centric design, we can use vkCmdDrawIndirect or vkCmdDraw-
IndirectCountKHR (provided by KHR_draw_indirect_count extension) and fetch
per-draw constants using gl_DrawIDARB (provided by KHR_shader_draw_parame-
ters extension) as an index instead of push constants. The only caveat is that for GPU-
based submission, we’d need to bucket draw calls based on pipeline object on CPU
since there’s no support for switching pipeline objects otherwise.
With this, vertex shader code to transform the vertex could look like this:
DrawData dd = drawData[gl_DrawIDARB];
TransformData td = transformData[dd.transformOffset];
vec4 positionLocal = vec4(positionData[gl_VertexIndex
+ dd.vertexOffset], 1.0);
vec3 positionWorld = mat4x3(td.transform[0], td.transform[1],
td.transform[2]) * positionLocal;
Depending on the GPU architecture it might also be beneficial to pass some of the indices, like
material index or vertex data offset, via push constants to reduce the number of memory indi-
rections in vertex/fragment shaders.
4.3 Command Buffer Recording and Submission 229
Fragment shader code to sample material textures could look like this:
DrawData dd = drawData[drawId];
MaterialData md = materialData[dd.materialIndex];
vec4 albedo = texture(sampler2D(materialTextures[md.albedoTexture],
albedoSampler), uv * vec2(md.tilingX, md.tilingY));
This scheme minimizes the CPU-side overhead. Of course, fundamentally it’s a bal-
ance between multiple factors:
While the scheme can be extended to multiple formats of material, draw and
vertex data, it gets harder to manage.
Using storage buffers exclusively instead of uniform buffers can increase GPU
time on some architectures.
Fetching texture descriptors from an array indexed by material data indexed by
material index can add an extra indirection on GPU compared to some alterna-
tive designs.
On some hardware, various descriptor set limits may make this technique im-
practical to implement; to be able to index an arbitrary texture dynamically from
the shader, maxPerStageDescriptorSampledImages should be large
enough to accomodate all material textures—while many desktop drivers ex-
pose a large limit here, the specification only guarantees a limit of 16, so bind-
less remains out of reach on some hardware that otherwise supports Vulkan.
As the renderers get more and more complex, bindless designs will become more
involved and eventually allow moving even larger parts of rendering pipeline to GPU;
due to hardware constraints this design is not practical on every single Vulkan-compat-
ible device, but it’s definitely worth considering when designing new rendering paths
for future hardware.
Command buffers allocated from one pool may not be recorded concurrently by
multiple threads
Command buffers and pools can not be freed or reset while GPU is still execut-
ing the associated commands
Regrettably, Vulkan doesn’t provide a way for the driver to implement thread-safe command
buffer recording so that one command pool can be reused between threads; in the scheme de-
scribed, cross-thread synchronization is only required for switching pages which is relatively rare
and can be lock-free for the most part.
Because of these, a typical threading setup requires a set of command buffer pools.
The set has to contain F * T pools, where F is the frame queue length—F is usually 2
(one frame is recorded by the CPU while another frame is being executed by the GPU)
or 3; T is the number of threads that can concurrently record commands, which can be
as high as the core count on the system. When recording commands from a thread, the
thread needs to allocate a command buffer using the pool associated with the current
frame & thread and record commands into it. Assuming that command buffers aren’t
recorded across a frame boundary, and that at a frame boundary the frame queue length
is enforced by waiting for the last frame in the queue to finish executing, we can then
free all command buffers allocated for that frame and reset all associated command
Additionally, instead of freeing command buffers, it’s possible to reuse them after
calling vkResetCommandPool—which would mean that command buffers don’t have
to be allocated again. While in theory allocating command buffers could be cheap,
some driver implementations have a measurable overhead associated with command
buffer allocation. This also makes sure that the driver doesn’t ever need to return com-
mand memory to the system which can make submitting commands into these buffers
Note that depending on the frame structure, the setup above may result in unbal-
anced memory consumption across threads; for example, shadow draw calls typically
require less setup and less command memory. When combined with effectively random
workload distribution across threads that many job schedulers produce, this can result
in all command pools getting sized for the worst-case consumption. If an application
is memory constrained and this becomes a problem, it’s possible to limit the parallel-
ism for each individual pass and select the command buffer/pool based on the recorded
pass to limit the waste.
This requires introducing the concept of size classes to the command buffer man-
ager. With a command pool per thread and a manual reuse of allocated command buff-
ers as suggested above, it’s possible to keep a free list per size class, with size classes
defined based on the number of draw calls (e.g., “<100”, “100–400”, etc.) and/or the
complexity of individual draw calls (depth-only, gbuffer). Picking the buffer based on
the expected usage leads to a more stable memory consumption. Additionally, for
passes that are too small it is worthwhile to reduce the parallelism when recording
these—for example, if a pass has <100 draw calls, instead of splitting it into 4 recording
jobs on a 4-core system, it can be more efficient to record it in one job since that can
reduce the overhead of command memory management and command buffer submis-
during command processing. Additionally, each submission has some overhead both
on the CPU side and on the GPU side. In general a Vulkan application should target
<10 submits per frame (with each submit accounting for 0.5 ms or more of GPU work-
load), and <100 command buffers per frame (with each command buffer accounting
for 0.1 ms or more of GPU workload). This might require adjusting the concurrency
limits for command recording for individual passes, e.g., if a shadow pass for a specific
light has <100 draw calls, it might be necessary to limit the concurrency on the record-
ing for this pass to just one thread; additionally, for even shorter passes combining them
with neighboring passes into one command buffer becomes beneficial. Finally, the
fewer submissions a frame has the better—this needs to be balanced with submitting
enough GPU work earlier in the frame to increase CPU and GPU parallelism though,
for example it might make sense to submit all command buffers for shadow rendering
before recording commands for other parts of the frame.
Crucially, the number of submissions refers to the total number of VkSubmitInfo
structured submitted in all vkQueueSubmit calls in a frame, not to the number of
vkQueueSubmit calls per se. For example, when submitting 10 command buffers, it’s
much more efficient to use one VkSubmitInfo that submits 10 command buffers com-
pared to 10 VkSubmitInfo structures with one command buffer per each, even if in
both cases only one vkQueueSubmit call is performed. Essentially, VkSubmitInfo
is a unit of synchronization/scheduling on GPU since it has its own set of fences/
Record primary command buffers that render chunks of draw calls into the same
framebuffer, using vkCmdBeginRenderPass and vkCmdEndRenderPass; ex-
ecute the resulting command buffers using vkQueueSubmit (batching submits
for efficiency)
Record secondary command buffers that render chunks of draw calls, passing
the render pass to vkBeginCommandBuffer along with VK_COMMAND_
primary command buffer, followed by vkCmdExecuteCommands to execute all
recorded secondary command buffers
While on immediate mode GPUs the first approach can be viable, and it can be a bit
easier to manage with respect to synchronization points on the CPU, it’s vital to use
the second approach on GPUs that use tiled rendering instead. Using the first approach
4.4 Pipeline Barriers 233
on tilers would require that the contents of the tiles is flushed to memory and loaded
back from memory between each command buffer, which is catastrophic for perfor-
by tools like Radeon Graphics Profiler, missing barriers are generally not detected by
validation tools.
Because of this, it’s vital to understand the behavior or barriers, the consequences
of overspecifying them as well as how to work with them.
1. Stalling execution of a specific stage until another stage is drained of all current
work. For example, if a render pass renders data to a texture, and a subsequent
render pass uses a vertex shader to read from this shader, GPU must wait for all
pending fragment shader and ROP work to complete before launching shader
threads for the vertex work in a subsequent pass. Most barrier operations will
lead to execution stalling for some stages.6
3. Converting the format the resource is stored in, most commonly to decompress
the resource storage. For example, MSAA textures on some architectures are
stored in a compressed form where each pixel has a sample mask indicating
how many unique colors this pixel contains, and a separate storage for sample
It’s crucial to note that a commonly held belief that individual draw calls execute in isolation
without overlap with other work is wrong—GPUs commonly run subsequent draw calls in par-
allel across render state, shader and even render target switches.
4.4 Pipeline Barriers 235
data. Transfer stage or shader stage might be unable to read directly from a
compressed texture, so a barrier that transitions from VK_IMAGE_LAYOUT_
press the texture, writing all samples for all pixels to memory. Most barrier
operations won’t need to do this, but the ones that do can be incredibly
With this in mind, let’s try to understand the guidance for using barriers.
On tiled mode renderers, for some designs the expectation is that all vertex work
from the subsequent pass executes to completion before fragment work can
start; waiting for fragment work to end for any vertex work to begin thus com-
pletely eliminates the parallelism between vertex and fragment stages and is one
of the largest potential performance problems that a naively ported Vulkan title
can encounter.
Note that even if the barriers are specified correctly—in this case, assuming the
texture is read from the fragment stage, dstStageMask should be VK_PIPELINE_
236 4. Writing an Efficient Vulkan Renderer
valid and optimal barriers for various types of synchronization as part of Vulkan-Docs
repository on GitHub:
These can serve to improve the understanding of general barrier behavior, and can also
be used directly in a shipping application.
Additionally, for cases not covered by these examples and, in general, to simplify
the specification code and make it more correct, it is possible to switch to a simpler
model where, instead of fully specifying access masks, stages and image layouts, the
only concept that needs to be known about a resource is the resource state that encap-
sulates the stages that can use the resource and the usage mode for most common types
of access. Then all transitions involve transitioning a resource from state A from state
B, which is much easier to understand. To that end, Tobias Hector, a member of
Khronos Group and a co-author of the Vulkan specification, wrote an open-source li-
brary, simple_vulkan_synchronization, that translates resource state (otherwise
known as access type in the library) transitions into Vulkan barrier specification. The
library is small and simple and provides support for split barriers as well as full pipeline
Due to the aforementioned problems, many modern renderers are starting to ex-
periment with render graphs as a way to declaratively specify all dependencies between
frame resources. Based on the resulting DAG structure, it’s possible to establish correct
barriers, including barriers required for synchronizing work across multiple queues,
and allocate transient resources with minimal use of physical memory.
A full description of a render graph system is out of scope of this article, but in-
terested readers are encouraged to refer to the following talks and articles:
Advanced Graphics Tech: Moving to DirectX 12: Lessons Learned, Tiago Ro-
drigues, GDC 2017.
Different engines pick different parameters of the solution, for example Frostbite ren-
der graph is specified by the application using the final execution order (which the
author of this article finds more predictable and preferable), whereas two other presen-
tations linearize the graph based on certain heuristics to try to find a more optimal
execution order. Regardless, the important part is that dependencies between passes
must be declared ahead of time for the entire frame to make sure that barriers can be
emitted appropriately. Importantly, the frame graph systems work well for transient
resources that are limited in number and represent the bulk of required barriers; while
it’s possible to specify barriers required for resource uploads and similar streaming
work as part of the same system, this can make the graphs too complex and the pro-
cessing time too large, so these are generally best handled outside of a frame graph
In the latter case, the driver will perform the necessary work to resolve MSAA contents
as part of work done when subpass/renderpass ends.
The second approach can be significantly more efficient. On tiled architectures,
using the first approach requires storing the entire MSAA texture to main memory,
240 4. Writing an Efficient Vulkan Renderer
followed by reading it from memory and resolving to the destination; the second ap-
proach can perform in-tile resolve in the most efficient manner. On immediate mode
architectures, some implementation may not support reading compressed MSAA tex-
tures using the transfer stage—the API requires a transition into VK_IMAGE_LAYOUT_
TRANSFER_SRC_OPTIMAL layout before calling vkCmdResolveImage, which may
lead to decompression of the MSAA texture, wasting bandwidth and performance.
With pResolveAttachments, the driver can perform the resolve operation at maxi-
mum performance regardless of the architecture.
In some cases, fixed function MSAA resolve is insufficient. In this case, it’s nec-
essary to transition the texture to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
and do the resolve in a separate render pass. On tiled architectures, this has the same
efficiency issues as vkCmdResolveImage fixed-function method; on immediate mode
architectures the efficiency depends on GPU and driver. One possible alternative is to
use an extra subpass that reads the MSAA texture via an input attachment.
For this to work, the first subpass that renders to MSAA texture has to specify the
MSAA texture via pColorAttachments, with VK_ATTACHMENT_STORE_OP_DONT_
CARE as the store op. The second subpass that performs the resolve needs to specify
MSAA texture via pInputAttachments and the resolve target via pColor-
Attachments; the subpass then needs to render a full-screen quad or triangle with a
shader that uses subpassInputMS resource to read MSAA data. Additionally, the ap-
plication needs to specify a dependency between two subpasses that indicates the
stage/access masks, similarly to pipeline barriers, and dependency flags VK_
DEPENDENCY_BY_REGION_BIT. With this, the driver should have enough information
to arrange the execution such that on tiled GPUs, the MSAA contents never leaves the
tile memory and instead is resolved in-tile, with the resolve result being written to main
memory7. Note that whether this happens depends on the driver and is unlikely to result
in significant savings on immediate mode GPUs.
Of course, it’s not guaranteed that the driver will perform this optimization—it depends on the
hardware architecture and driver implementation.
While an individual state object is supposed to model parts of GPU state and
could be directly transferred to commands that setup GPU state, on some GPUs
the configuration of the GPU state required data from multiple different state
blocks. Because of this, drivers typically must keep a shadow copy of all state
and convert the state to the actual GPU commands at the time of Draw/Draw-
With the rasterization pipeline getting more complex and gaining more pro-
grammable stages, some GPUs didn’t map them directly to hardware stages,
which means that the shader microcode can depend on whether other shader
stages are active and, in some cases, on the specific microcode for other stages;
this meant that the driver might have to compile new shader microcode from
state that can only be discovered at the time of Draw/DrawIndexed.
Similarly, on some GPUs, fixed functional units from the API description were
implemented as part of one of the shader stages—changing the vertex input for-
mat, blending setup, or render target format could affect the shader microcode.
Since the state is only known at the time of Draw/DrawIndexed, this, again, is
where the final microcode had to be compiled.
While the first problem is more benign, the second and third problem can lead to sig-
nificant stalls during rendering as, due to the complexity of modern shaders and shader
compilation pipelines, shader compilation can take tens to hundreds of milliseconds
depending on hardware. To solve this, Vulkan and other new APIs introduce the con-
cept of pipeline object—it encapsulates most GPU state, including vertex input format,
render target format, state for all stages and shader modules for all stages. The expec-
tation is that on every supported GPU, this state is sufficient to build final shader mi-
crocode and GPU commands required to set the state up, so the driver never has to
compile microcode at draw time and can optimize pipeline object setup to the extent
This model, however, presents challenges when implementing renderers on top of
Vulkan. There are multiple ways to solve this problem, with different tradeoffs with
respect to complexity, efficiency, and renderer design.
A minor concern is that the state that needs to be hashed together is potentially
large; doing this for every draw call can be time consuming when the cache already
contains all relevant objects. This can be mitigated by grouping state into objects and
hashing pointers to these objects, and in general simplifying the state specification from
the high-level API point of view.
A major concern, however, is that for any pipeline state object that must be created,
the driver might need to compile multiple shaders to the final GPU microcode. This
process is time consuming; additionally, it can not be optimally threaded with a just-
in-time compilation model—if the application only uses one thread for command sub-
mission, this thread would typically also compile pipeline state objects; even with mul-
tiple threads, often multiple threads would request the same pipeline object, serializing
compilation, or one thread would need several new pipeline objects, which increases
the overall latency of submission since other threads would finish first and have no
work to do.
For multithreaded submission, accessing the cache can result in contention be-
tween cores even when the cache is full. Fortunately, this can be solved by a two-level
cache scheme as follows:
The cache would have two parts, the immutable part that never changes during the
frame, and the mutable part. To perform a pipeline cache lookup, we first check if the
immutable cache has the object—this is done without any synchronization. In the event
of the cache miss, we lock a critical section and check if the mutable cache has the
object; if it doesn’t, we unlock the critical section, create the pipeline object, and then
lock it again and insert the object into the cache, potentially displacing another object
(additional or synchronization might be required if, when two threads request the same
object, only one compilation request is issued to the driver). At the end of the frame,
all objects from the mutable cache are added to the immutable cache and the mutable
cache is cleared, so that on the next frame access to these objects can be free-threaded.
Vulkan provides a pipeline cache object, VkPipelineCache, that can store driver-
specific bits of state and shader microcode to improve compilation time for pipeline
objects. For example, if an application creates two pipeline objects with identical setup
except for culling mode, the shader microcode would typically be the same. To make
sure the driver only compiles the object once, the application should pass the same
instance of VkPipelineCache to vkCreateGraphicsPipelines in both calls, in
which case the first call would compile the shader microcode and the second call would
be able to reuse it. If these calls happen concurrently in different threads the driver
might still compile the shaders twice since the data would only be added to the cache
when one of the calls finishes.
It’s vital to use the same VkPipelineCache object when creating all pipeline
objects and serialize it to disk between runs using vkGetPipelineCacheData and
pInitialData member of VkPipelineCacheCreateInfo. This makes sure that the
compiled objects are reused between runs and minimizes the frame spikes during sub-
sequent application runs.
Unfortunately, during the first play through the shader compilation spikes will still
occur since the pipeline cache will not contain all used combinations. Additionally,
even when the pipeline cache contains the necessary microcode, vkCreate-
GraphicsPipelines isn’t free and as such compilation of new pipeline objects can
still increase the frame time variance. To solve that, it’s possible to pre-warm the in-
memory cache (and/or vkPipelineCache) during load time.
One possible solution here is that at the end of the gameplay session, the renderer
could save the in-memory pipeline cache data—which shaders were used with which
state8—to a database. Then, during QA playthroughs, this database could be populated
with data from multiple playthroughs at different graphics settings etc.—effectively
gathering the set of states that are likely to be used during the actual gameplay.
This database can then be shipped with the game; at game startup, the in-memory
cache could be prepopulated with all states created using the data from that database
(or, depending on the amount of pipeline states, this pre-warming phase could be lim-
ited to just the states for the current graphics settings). This should happen on multiple
threads to reduce the load time impact; the first run would still have a longer load time
(which can be further reduced with features like Steam pre-caching), but frame spikes
due to just-in-time pipeline object creation can be mostly avoided.
If a particular set of state combinations wasn’t discovered during QA play-
throughs, the system can still function correctly—at the expense of some amount of
stuttering. The resulting scheme is more or less universal and practical—but requires a
potentially large effort to play through enough levels with enough different graphics
settings to capture most realistic workloads, making it somewhat hard to manage.
This can use an application-specific format, or a library like Fossilize (
technique gbuffer
vertex_shader gbuffer_vs
fragment_shader gbuffer_fs
#ifdef DECAL
depth_state less_equal false
blend_state src_alpha one_minus_src_alpha
depth_state less_equal true
blend_state disabled
render_target 0 rgba16f
render_target 1 rgba8_unorm
render_target 2 rgba8_unorm
vertex_layout gbuffer_vertex_struct
Assuming all draw calls, including ones used for post-effects etc., use the effect
system to specify render state, and assuming the set of effects and techniques is static,
it’s trivial to precreate all pipeline objects—each technique needs just one—at load
time using multiple threads, and at runtime use very efficient code with no need for in-
memory caches or possibility of frame spikes.
In practice, implementing this system in a modern renderer is an exercise in com-
plexity management. It’s common to use complex shader or state permutations—for
example, for two-sided rendering you typically need to change culling state and perhaps
change the shaders to implement two-sided lighting. For skinned rendering, you need
to change vertex format and add some code to the vertex shader to transform the attrib-
utes using skinned matrices. On some graphics settings, you might decide that the ren-
der target format needs to be floating-point R10G11B10 instead of RGBA16F, to
conserve bandwidth. All these combinations multiply and require you to be able to
represent them concisely and efficiently when specifying technique data (for example,
by allowing #ifdef sections inside technique declarations as shown above), and—
importantly—being aware of the steadily growing amount of combinations and refac-
toring/simplifying them as appropriate. Some effects are rare enough that they could
be rendered in a separate pass without increasing the number of permutations. Some
computations are simple enough that always running them in all shaders can be a better
tradeoff than increasing the number of permutations. And some rendering techniques
offer better decoupling and separation of concerns, which can also reduce the number
of permutations.
harder but doesn’t make it different—many renderers have to solve the problem of a
large number of shader permutations anyway, and once you incorporate all render state
into shader/technique specification and focus on reducing the number of technique per-
mutations, the same complexity management solutions apply equally to both problems.
The benefit of implementing a system like this is perfect knowledge of all required
combinations (as opposed to having to rely on fragile permutation discovery systems),
great performance with minimal frame-to-frame variance including the first load, and
a forcing function to keep the complexity of rendering code at bay.
Vulkan API shifts a large amount of responsibility from driver developers onto appli-
cation developers. Navigating the landscape of various rendering features becomes
more challenging when many implementation options are available; it’s challenging
enough to write a correct Vulkan renderer, but performance and memory consumption
is paramount. This article tried to discuss various important considerations when deal-
ing with specific problems in Vulkan, present multiple implementation approaches that
provide different tradeoffs between complexity, ease of use and performance, and span
the range between porting existing renderers to redesigning renderers around Vulkan.
Ultimately, it’s hard to give a general advice that works across all vendors and is
applicable to all renderers. For this reason, it’s vital to profile the resulting code on the
target platform/vendor—for Vulkan, it’s important to monitor the performance across
all vendors that the game is planning to ship on as the choices the application makes
are even more important, and in some cases a specific feature, like fixed-function vertex
buffer bindings, is the fast path on one vendor but a slow path on another.
Beyond using validation layers to ensure code correctness and vendor-specific pro-
filing tools, such as AMD Radeon Graphics Profiler or NVidia Nsight Graphics, many
open-source libraries that can help optimize your renderer for Vulkan are available:
VulkanMemoryAllocator. Provides convenient and performant memory alloca-
tors for Vulkan as well as other memory-related algorithms such as defragmen-
volk. Provides an easy way to use driver-provided
Vulkan entry points from the driver directly which can reduce function call over-
synchronization. Provides a way to specify Vulkan barriers using a simplified
access type model, which helps balance correctness and performance.
Fossilize. Provides serialization sup-
port for various Vulkan objects, most notably for pipeline state creation info
which can be used to implement pre-warming for a pipeline cache.
perfdoc. Provides layers similar to
validation layers, that analyze the stream of rendering command and identify
potential performance problems on ARM GPUs.
Finally, some vendors develop open-source Vulkan drivers for Linux; studying their
sources can help gain more insight into performance of certain Vulkan constructs:
AMD. Contains xgl which has the Vul-
kan driver source, and PAL which is a library used by xgl; many Vulkan func-
tion calls end up going through both xgl and PAL.
AMD. Contains
community-developed open-source radv driver.
Intel. Contains
Anvil driver.
glTF—Runtime 3D
Asset Delivery
Marco Hutter
The widespread availability of client devices and web-browsers that are capable of
high-quality 3D rendering offers new application areas that involve 3D assets. These
applications range from online product presentation and configuration websites, virtual
museums, geo-information systems to general virtual environments or 3D video games
on mobile devices. Despite the variety of these applications, they pose many similar
requirements for the 3D assets. In order to meet these requirements, the Khronos Group
has designed glTF -- a transmission and delivery format for 3D assets. Its development
is based on the collaborative efforts of 3D content creators and providers, application-
and engine developers and end users. Since the initial release of glTF in 2015, software
companies from different application domains have adopted the format, and it is be-
coming the standard for the delivery of 3D assets for efficient real-time rendering.
This chapter explains the goals and features that are achieved with glTF, and their
technical implementation. The role of glTF in the 3D content creation workflow is laid
out, showing the tools and libraries that are available to support each step of the content
creation process, and how glTF may open up new application areas that rely on the
efficient transfer and rendering of high-quality 3D content.
Figure 5.1. The basic structure of a glTF asset. The JSON file contains information about the
asset structure. The geometry data and textures are stored in binary files. Optionally, all files may
be combined into a single glTF binary (GLB) file.
Textures are stored directly as JPG or PNG images. These formats allow for an
efficient network transfer, and can easily and efficiently be decompressed using stand-
ard image processing libraries.
In the default glTF format, the JSON structure, binary data and images are stored
as individual files. In order to simplify the delivery of assets to the end users, there is
the option to combine all data into a single, binary file. Such a file is called a glTF
binary (GLB) file.
The goal of consistency of the visual representation of the rendered objects is
achieved by making Physically Based Rendering (PBR) part of the standard. The phys-
ically based rendering is based on a model that describes the surface structure of ob-
jects, like reflectivity and color, in a way that precisely defines the final appearance of
the rendered objects.
There are two common models for PBR: The metallic-roughness model and the
specular-glossiness model. Content creators, researchers and rendering engine devel-
opers analyzed the advantages of both models in view of the goals of glTF. Both mod-
els have similar capabilities regarding the range of materials that they can represent.
But the metallic-roughness model has a lower memory consumption, is easier to im-
plement in the rendering client, and is supported by default in most modern rendering
engines. Therefore, the metallic-roughness model is the model that is used in the core
glTF specification. The specular-glossiness model is supported in glTF via an exten-
sion. For both models, the glTF specification contains a definition of the possible ma-
terial properties, and how to interpret these properties for the light and rendering
glTF establishes a set of conventions for the assets: The assets are given in a right-
handed coordinate system, the positive y axis is pointing upwards, and the front of a
glTF asset points along the positive z axis. The basic unit for all coordinates and dis-
tances are meters. Rotations are defined with quaternions, encoding the rotation axis
and angle unambiguously. These design decisions aim at minimizing the fragmenta-
tion, and ensure that assets that can be displayed by all glTF viewers without the need
for additional conversions.
translation, rotation, and scaling, or as a transformation matrix. Thus, each node de-
fines a local transform, and the concatenation of the transformations of nodes along
one path of the hierarchy defines a global transform. The transformation of the nodes
allows for a free arrangement of the scene elements in the virtual environment.
This arrangement of the scene elements is achieved by attaching the objects to
nodes. In the core glTF specification, the objects that may be attached to nodes are
meshes, skins and cameras. Meshes represent the geometric objects that appear in the
scene, and skins are attached to nodes that define the skeleton of a virtual character.
Both concepts are explained in more detail in the following sections. Orthographic and
perspective cameras can be attached to the nodes in order to define the view configu-
ration for the scene.
Additional types of elements can be attached to nodes by defining an extension.
For example, it is possible to define an extension for point or spot lights that have a
certain position and direction. As a part of the extension, these objects may be attached
to nodes, in order to define their spatial arrangement in the scene.
5.3.2 Meshes
The rendered objects that appear in the scene consist of meshes. Different kinds of
meshes are supported by glTF: The primitive type of the meshes can be points, lines
or triangles, and the meshes can either be indexed or non-indexed. Additionally, each
mesh defines a set of vertex attributes.
The core specification of glTF supports a set of vertex attributes that already cover
most application cases: The positions and normals of the vertices are encoded as 3D
floating point vectors. Tangents that may be required for lighting computations based
on normal maps are given as 4D floating point vectors.
Texture coordinates can be given as 2D vectors. Additionally, vertex colors can be
defined. The colors can be 3D or 4D vectors, consisting of the red, green, blue (and
optional alpha) components of the color. The texture coordinates and vertex colors can
either be floating point values, or 8- or 16-bit integer values that are normalized by the
renderer to the range 0,1.
Further vertex attributes that are supported by the standard are the joint- and
weight information of the vertices that are required for vertex skinning, as explained in
the section about Morphing and Skinning.
Client implementations must at least support the vertex attributes that are defined
in the core specification, which are positions, normals, tangents, vertex colors, two sets
of texture coordinates and joint and weight information. But the specification of glTF
also allows for the definition of additional vertex attributes. This may be additional
texture coordinate sets or vertex colors, but also custom attributes with application-
specific semantics. For example, an application may choose to assign physical proper-
ties like a temperature or pressure to each vertex. A custom renderer implementation
may then use these vertex attributes to offer a color-cored visualization of the physical
property, extending the range of possible application areas of glTF to scientific visual-
izations or engineering applications.
The actual skinning computation then usually takes place in a vertex shader. The
shader receives the vectors of joint indices and weights as 4D attribute variables, as
well as the joint matrices that describe the current deformations caused by each joint.
The joint matrices for the respective indices are selected and the weights are used to
compute a linear combination of these joint matrices, to obtain the skinning matrix.
The skinning matrix eventually transforms each vertex based on the current pose of the
5.3.4 Animations
Animations may be one part of the description of a scene in a glTF asset. So in addition
to the overall structure and contents of a scene, the asset may also contain information
about the behavior of the objects in the scene over time.
These animations may affect the transformation properties of nodes, or the weights
of morph targets. This is achieved by storing key frames that associate time stamps
with a set of values for certain properties. At runtime, the property values that corre-
spond to the current time stamp are read, and passed to the target of the animation—
for example, to the translation property of a node. The key frames may be iterated
through stepwise, interpolated linearly or using cubic splines, which allows for a
smooth, realistic movement of the objects that are attached to the animated nodes.
There may be multiple animations combined in a single glTF asset. The general
structure of a single animation is that it consists of channels and samplers. The sam-
plers define the interpolation behavior as well as the key frame data, which consists of
the time stamps and the values of the animated properties. The channel establishes the
connection between a sampler and the animated target property. The time stamps for a
sampler are given in seconds of animation time. The client application can either ad-
vance the animation time in real time, or offer user controls, for example, for playing
back animations at different speeds.
5.3.5 Materials
The core of the material definition consists of the properties that are used in the me-
tallic-roughness model of Physically Based Rendering (PBR): The base color, the
metallness and the roughness. The base color defines the main color of the object sur-
face, and is sometimes also referred to as albedo. The metalness determines the reflec-
tivity characteristics. The roughness of the object surface affects how matt or glossy
the object appears. Each of these properties may be set uniformly for the whole mate-
rial, or defined using a texture.
The materials in glTF may have a set of additional properties: A normal texture
allows adding the appearance of fine geometric details to the object surface, without
the need for a high resolution mesh. An occlusion texture can be defined, to emulate
self-shadowing effects of the object in concave areas. An emissive color or emissive
texture may be added to the material, which allows the object surface to appear
5.3 Feature Summary 255
Figure 5.2. Illustration of material properties supported in glTF: The base color and metalness
and roughness textures are part of the core material definition for physically based rendering.
The emissive, occlusion and normal maps allow for additional details and thus an increased
realism of the rendered result.
5.3.6 Textures
Details about the textures that are the basis for the definition of a materials are encoded
in the JSON part of a glTF asset. A texture definition consists of an image and a sam-
pler. The image data can be in PNG or JPG format, and can either be stored as an
external file that is referred to using a URI, or as part of the binary data of an asset.
The sampler further defines wrapping- and filtering options for the texture. Wrapping
modes allow textures to be clamped or repeated at the border. Filtering modes define
the magnification- and minification filters, as well as mipmapping behavior of the sam-
pled texture.
Different options exist for structuring the binary data: It is possible to combine multiple
binary data sets into a single binary file, or split them into multiple binary files. For
example, there may be one binary file that stores the vertex positions and triangle in-
dices of one mesh, and multiple other binary files storing the data for different anima-
The binary files are referred to by the JSON part of a glTF asset, using URIs. At
runtime, the contents of such a binary data file is loaded into a buffer. The JSON part
of the glTF asset contains further information about the structure of this binary data:
So-called accessors provide information about how the buffer data has to be inter-
preted. For example, the binary data of a glTF asset may consist of vertex positions
and triangle indices for a mesh. In this case, two accessors will exist: The first one
defines the range of the buffer that contains the positions as 3D floating point vectors.
The second accessor defines that a range of the buffer contains the indices, for example,
as 32-bit scalar integer values.
In general, an accessor defines the range of a buffer that contains the relevant data,
as well as the number of elements in this range. The elements can be defined to be
scalar values, 2D, 3D, or 4D vectors or matrices. The accessor also defines the com-
ponent type of the elements, which may be single-precision floating point values or 8-,
16-, or 32-bit integer values. An optional stride for the data elements even allows for
an interleaved storage of multiple vertex attributes.
Although a large variety of different data layouts can be represented with the ac-
cessor concept of glTF, most of them are tailored for the efficient rendering. This
means that in most cases, the parts of the buffers that correspond to one accessor, and
may, for example, represent a single vertex attribute of a mesh, can directly be uploaded
to the GPU using standard graphics API calls, without further preprocessing.
5.4 Ecosystem
The core of the glTF ecosystem is maintained by the Khronos Group, and the main
entry point and an overview of the available resources is the Khronos glTF landing
page at The most important resources are summarized
in the following sections.
Tutorials for developers who want to implement loaders or viewers for glTF are avail-
able at These tutorials describe the
technical part of the specification and the underlying concepts. They are associated
with a set of sample models that show and explain each feature individually.
Figure 5.3. A schematic content creation pipeline. The role of glTF is that of a format that is
used for the final delivery to the client application that will import and render the 3D asset.
Preprocessing and optimization steps are supported via the glTF toolchain.
5.5.1 Creation
One of the main sources of 3D content are authoring tools like Blender, 3D Studio
Max, Maya, Cinema 4D or specialized applications, for example, for CAD or character
modeling. These tools allow artists, designers and engineers to define the geometry of
objects, their appearance in terms of the surface structure and material properties, their
behavior over time in form of articulated animation, vertex skinning or morphing, and
how several objects are arranged in a virtual environment.
An increasing number of 3D authoring tools offers the option to directly export
assets as glTF. This is usually accomplished by exporter plugins for the authoring tools.
For the case that no direct export from the authoring tool is possible, there are convert-
ers from different input file formats to glTF. Most importantly, there are converters that
can convert the standard authoring exchange formats COLLADA and FBX into glTF.
Another important source of 3D content are digitization processes. The digitiza-
tion of products or cultural heritage artifacts is usually accomplished with 3D scanners.
The results are often stored as plain geometry formats like OBJ, and there is a variety
of converters for these file formats into glTF.
For the case that existing assets should be converted to glTF, possibly in a batch
process, these converters can often be run as standalone, command-line applications.
Many of the converters are also available as online services that allow converting the
assets simply via drag-and-drop.
The result of an export or conversion will usually be an asset in glTF format where
the internal structures still resemble the data structure of the creation tool or the source
file. Further preprocessing may then be applied to this asset, in order to optimize it for
the delivery to the client.
5.5.2 Optimization
In order to prepare a complex 3D asset for the delivery to the client, it may have to be
preprocessed or optimized. In some cases, the focus may be on optimization or
simplification of the geometry. This may include the use of extensions to integrate spe-
cial compression methods for the geometry data, or just simplifications of the scene
and node structure. In other cases, there may be the need to combine different geome-
tries or materials in one asset. For the final delivery to the client, it may be desirable to
convert the asset into a single file that contains all the elements of the asset, including
geometry data and textures. The main goal here is to have the complete asset, self-
contained, so that it may be downloaded without issuing requests to external resources.
The official tool for most of these optimization and conversion tasks is the glTF
pipeline tool from Analytical Graphics Inc., which is available at
AnalyticalGraphicsInc/gltf-pipeline. It supports the optimization of the mesh structure
using a special mesh compression extension, as well as the conversion between default
glTF assets and binary glTF binary files.
5.5.3 Validation
An important criterion for the broad acceptance of a glTF is the robustness and relia-
bility of the tools that are used to process the 3D assets. In order to make sure that
exporters and converters are generating files that are syntactically and structurally
valid, the files may be validated using the Khronos glTF validator that is available at
The validator can be used as an online drag-and-drop tool, as a standalone appli-
cation, or as a library. The online tool allows to quickly and easily validate a given
glTF asset, without the need of installing any software. The standalone validator ap-
plication may be used to perform individual- or batch validation of generated assets,
for example, before they are uploaded to a content distribution network to be delivered
to the clients. As a library, developers may integrate the validator into their own glTF
tools and libraries. Input files can then be validated before passing them to the down-
stream processing. The validation allows the developer to reduce the internal error han-
dling to a minimum, because he can rely on all the assertions that are made by the
specification for valid glTF assets.
In all cases, the validator performs a validation of the JSON part of the asset, and
of the binary data, to make sure that the files containing the binary data match the
intended usage in terms of size and data layout. The validation of the JSON part is
based on a JSON schema that is part of the specification, and defines the JSON struc-
ture as well as ranges of possible values for the JSON properties. Additional constraints
that are established by the specification are checked and violations of these constraints
are reported.
The validator generates a report of the validation process, in JSON format. This
report lists warnings and errors that have been detected, and contains detailed infor-
mation about the source of the problem, to help the developer to locate an correct the
issue. Additionally, it can automatically be processed. This makes it possible to use the
validator as part of automated unit-tests for exporters.
A tool for the validation and inspection of individual glTF assets is available in
form of a plugin for the Microsoft Visual Studio Code editor (
com/). The plugin is available at
and uses the official Khronos glTF validator for validating glTF assets. Additionally, it
allows developers to preview the asset, inspect the textures and binary data, offers auto-
completion for manual editing of the JSON part of an asset, and can convert between
glTF and binary glTF assets.
5.5.4 Rendering
One of the main goals of glTF is to support a large variety of rendering clients. There-
fore, many options for importing and rendering glTF assets exist. The tables at https:// list various loaders, engines and viewers that support glTF.
Pure loader libraries represent a glTF asset as an in-memory data structure, and
allow further manipulation of the asset structure, to translate the asset into the internal
data structures of a renderer, or write a manipulated asset to a file. The libraries cover
all major programming languages, including JavaScript, C/C++, Java, Objective-C, Go,
Rust, Haxe, Ada, Swift and TypeScript.
An increasing number of rendering engines has built-in support for glTF. When
loading the asset, they build the internal rendering data structures directly. Rendering
engines that support glTF are available for different programming languages, and based
on different graphics APIs, including WebGL, OpenGL, Vulkan, Metal and DirectX.
Standalone viewers offer the option to quickly inspect a given glTF asset. These
viewers are often intended for end-users, but some of them also offer functions for
developers to analyze details of the model—for example, to visualize the node structure
or animation timeline of a glTF asset.
Figure 5.4. Screenshots from the web-based open-source PBR reference implementation.
5.6 Extensions
Extensions are an important concept for the evolution of the glTF standard, in order to
adapt to future requirements. The core of the specification allows for extensions to be
defined in the JSON part of an asset and to refer to custom binary data like vertex
attributes. Vendors can implement own extensions in order to support custom features
that add new properties to the JSON objects. The extensions can then be proposed to
the Khronos Group. The group members can discuss the new features and the exten-
sion specification, and eventually promote the extension to become an official Khronos
extension. Depending on the demand and adaption of the extension, it may eventually
be integrated into the core specification as part of a new major release.
The extensions that are used in a certain glTF asset can be queried by the viewer
or loader. Extensions can be optional, in which case the viewer may omit the features
that are offered by the extension or -- depending on the functionality of the extension—
provide a fallback behavior. Other extensions can be declared to be required for
properly rendering the assets, and viewers that do not support such a required extension
can report this to the user. In general, viewers are encouraged to support the official
Khronos extensions, to foster the adoption of the extensions for the future version of
the standard.
Several extensions have already been proposed and become official Khronos ex-
tensions, and their goals and functionality is summarized in the following sections.
The connection between the geometry and texture data of the asset and the shaders
is established by defining a rendering technique. Such a technique associates object
properties like vertex positions or texture coordinates with the attributes of the shaders.
A material in this extension is therefore an instance of a technique, with a certain map-
ping for these properties, or with specific values for the uniform variables of the
5.8 Conclusion
A versatile and reliable transmission format is the key for opening up new application
areas for 3D content. High-quality rendering is now possible on many client devices—
including web browsers—and the increasing demand for 3D content on the web is one
of the main factors for the momentum that glTF has gained since its initial release. The
design choices and features have proven to meet the requirements of the content crea-
tors and the users. The ecosystem and tools have that have been developed around glTF
support all users in making 3D content available in a large variety of applications.
Being a open standard, glTF is maintained and developed by a large, active community,
in order to keep up with the latest technical developments and to adapt to future
Ray Tracing
Welcome to the Real-time Ray Tracing section of GPU Zen’s second volume. With the
recent revolution in real-time ray tracing, this section presents two novel methods that
utilize ray tracing hardware to achieve new visual effects and improve rendered image
The section starts with an article from Holger Gruen on efficient rendering of sin-
gle-bounce caustics, such as water reflections and refractions. The article presents an
elegant way to track the light energy compression induced by refractive geometry and
importance sample rays in bright and visible parts of caustics followed by denoising.
The method has a practically good performance for real-time applications and de-
livers high rendering quality of caustics.
The second article from Rahul Sathe et al. is on achieving high quality antialiasing
by utilizing the new hardware features, such as ray tracing and conservative rasteriza-
tion. Conservative rasterization is used to detect partially covered pixels, and ray trac-
ing is then used to evaluate the subsamples of these pixels. While providing high
quality antialiased image even with thin geometry, the method can used for hard and
detailed geometry.
All articles come with demos that demonstrate the methods in real-time.
This is just the beginning of a long path to make the new real-time ray tracing
practical. I hope the content of this section will inspire you to apply ray tracing more
widely in your games and do not hesitate to share your methods with us!
—Anton Kaplanyan
Real-Time Ray-Traced
One-Bounce Caustics
Holger Gruen
1.1 Introduction
This chapter investigates how to make use of the DirectX 12 real-time ray tracing API
DXR to simplify and innovate on current methods for rendering real-time caustics. We
assume that the necessary DXR bounding volume acceleration structures have already
been implemented in a game engine to enable other ray tracing based algorithms.
The use of DXR has several benefits over existing solutions with regards to finding
the specular intersections of refracted and reflected light rays and a dynamic 3D scene.
This is depicted in Figure 1.1(a)—the yellow light rays from the sun interact with the
refracting surface and ‘turn’ into the refracted red light rays that shed light on the brown
receiving geometry. Figure 1.1(b) shows the reflected light rays that shed light on the
purple receiving geometry.
Many games avoid finding intersections with the full game scene, but only com-
pute intersections with a ground plane or the inside of a 3D box. Consequently, the
resulting caustics cannot have shadows that are the result of many incoherent re-
fracted/reflected light rays. Typically, shadows are then generated from a shadow map,
which ‘only’ produces shadows from a single point of view, e.g., from the point of view
of the light.
For a scene with a refractive water interface (not displayed), Figure 1.2 shows a
shadow produced from a shadow map. As a result, the silhouette of the shadow of the
lid of the box is straight. Figure 1.3 shows shadows that are the result of incoherent
refracted light rays, and thus the shadow of the lid of the box is bent and deforms in a
physically correct way as the surfaces moves.
268 1. Real-Time Ray-Traced One-Bounce Caustics
a. March the primary depth buffer or the shadow map depth buffer in the
pixel shader to find intersections. The problem with this approach is that
refracted/reflected light rays may be occluded in both the primary view and
the view from light as shown in Figure 1.4. It is possible to render and
march a set of images utilizing:
i. Multiple depth layers of the primary depth buffer and the shadow map
270 1. Real-Time Ray-Traced One-Bounce Caustics
Figure 1.4. Issues with shadow map based visibility for caustics.
ii. Multiple viewpoints of the primary depth buffer and the shadow map.
Note that the runtime cost, the memory consumption and the implementa-
tion complexity of these methods can be significantly over what the DXR
based approach described below achieves.
2. Voxelize the receiving geometry and march the voxelized 3D grid. This can
yield good results, depending on the resolution of the grid. Voxelization is not
a cheap operation and is the rasterization side equivalent of keeping a bounding
volume hierarchy up-to-date. Raymarching the 3D grid usually has also a high
or unacceptable performance cost. Overall, the implementation complexity of
voxelization methods is above the one of the DXR based approach described
1. Compute a refracted/reflected mesh. For each vertex of the mesh that repre-
sents the geometry of the refractive/reflective interface, a refracted/reflected ray
R is constructed. This ray starts at the interface’s vertex and points along the
refracted/reflected direction of incident light. The resulting refracted/reflected
mesh has the same number of vertices and the same triangle count as the
1.3 Algorithm Overview 271
The same equation is used to compute and store the compression ratio for re-
flected triangles.
Figure 1.6.
reflected triangle. This means, that more samples get generated for triangles
that are close to the viewer. The idea to use the compression ratio from Equation
1 is not new and has been described in the past (see, e.g., Golias and Jensen
a. For each random world-space position that can be seen from the light
source, a ray (photon) is formed starting at this position and along the in-
terpolated refracted/reflected direction. One easy solution to find out if a
sample is shadowed (can’t be seen by the light source) is to use a shadow
map. It is of course also possible to shoot a ray towards the light source.
b. Compute the intersection of each such ray with the receiving scene
Figure 1.7 depicts the generation of samples and their corresponding refracted
screen-space positions. It is possible to skip the generation of samples for in-
terface triangles that are guaranteed not to generate any on-screen intersections.
See below in ‘Implementation Details’ for a description.
4. Denoise the buffer that results from Step 3c to arrive at a denoised and
smooth caustics buffer.
The following section describes implementation details of the demo that accom-
panies this chapter.
For Step 1 we call DispatchRays() with each thread tracing exactly one re-
fracted/reflected ray into the scene. The resulting refracted/reflected mesh is
written to a buffer that is read by later steps and uses the same index buffer as
the original interface mesh.
274 1. Real-Time Ray-Traced One-Bounce Caustics
As depicted in Listing 1.1, a simple seed based random number generator is used
to generate (see
in-d3d11/) a set of ‘random’ barycentric positions inside the current interface’s trian-
gle. The current implementation uses the triangle ID/index as the RandSeed. This
choice means that the generated noise is stable across a triangle. Should the number of
samples for a triangle change between frames then this does only result in removing or
adding samples but never in a change of sample positions for a given triangle.
As described above, the number of samples generated is also based on the screen-
space size of the triangle of the refracted/reflected mesh. As this refracted/reflected
triangle may be off-screen even if some of the generated rays produce on-screen caus-
tics, the maximum screen-space size of the refracted triangle and three additional tri-
angles is used.
These three triangles are formed at the intersection points of the three refracted/re-
flected rays. This is depicted in Figure 1.8, the blue triangle is at the interface, and the
three yellow rays show scene intersections at different distances.
In order to speed up Step 4, the buffer accumulating brightness is implemented as
a half-res buffer. This speeds up denoising significantly. Denoising is done through a
set of iterated cross-bilateral blurring steps that account for differences in view-space
depth, normal and view-space positions.
1.5 Results
Figure 1.9 shows the result of applying the raw caustics buffer without denoising. Fig-
ure 1.10 shows a similar frame with denoising enabled. Finally, Figure 1.11 shows the
algorithm running with slightly different lighting conditions.
The following performance numbers have been measured on a NVIDIA RTX
2080Ti board running at a resolution of 1920 1080 utilizing the official version of
DXR that is part of the DirectX API:
Adding reflected caustics add approximately another 0.8 ms to the run-time of the tech-
nique. Please note that the denoise step does not get more expensive unless a separate
denoise step is required for the reflected caustics. These numbers show that physically
inspired ray-traced caustics are fast enough to be brought into scenes of real-time
games on DXR capable hardware.
Multi-bounce caustics. The algorithm described above stops when a refracted ray
hits the scene. With DXR it is possible to generate one or more secondary rays in the
ray generation shader. In this case, for each secondary ray that generates an intersection
with the scene, it is necessary to accumulate brightness at the projected screen space
position of the intersection.
Volumetric caustics. Most algorithms for volumetric caustics (see, e.g., Hu et al.
[2010] and Liktor and Dachsbacher [2011]) make use of the regular structure of a
Caustics Map. It would be interesting to investigate how volumetric caustics can be
implemented based on the random ray samples that the current implementation uses.
2.1 Introduction
Anti-aliasing is a category of techniques used to remove image artifacts that result from
inadequate sampling rates. MSAA [Akeley 1993] is a popular anti-aliasing technique
that samples visibility at a rate different from the typical shading rate of once per pixel
per primitive. Although effective in geometric anti-aliasing, MSAA incurs higher stor-
age costs due to storing depth and color samples at the sampling rate. Additionally, it
can suffer from higher bandwidth usage in the case where color compression fails to
compress the color data well. For these reasons, MSAA produces high image quality
at a relatively high cost.
Ideally, we would like the image quality of MSAA without paying the high asso-
ciated cost. When a primitive covers a pixel entirely, it is not necessary to do further
visibility calculations. When a pixel is partially covered by primitive(s), we need to
determine how much of the pixel is covered by each intersecting primitive to calculate
correct visibility. Taking advantage of this knowledge, we present an approach that
identifies “complex” or “interesting” pixels that require computing visibility more ac-
curately than a single raster sample. We then discuss two methods to compute visibility
for the identified pixels with improved accuracy.
2.2 Overview
In this section, we describe the algorithm at a high level. The flow chart of the pixel
shader used in the algorithm is shown in Figure 2.1. The goal of the algorithm is to
identify partially covered pixels and resolve the visibility of those pixels using ray trac-
ing. We use conservative rasterization to identify such pixels. To eliminate the fully
covered pixels that are also rasterized by conservative rasterization, we use a system
variable available in Tier 3 conservative rasterization hardware called SV_Inner-
Coverage. Using this variable, we mark the fully covered pixels as “less interesting”
and mark the partially covered pixels as “interesting” in a render target. However, this
can generate a lot of false positives even if interesting pixels are completely behind
fully covered pixels. To minimize this, we output farthest depth for the fully covered
pixels. To ensure that no interesting pixels are missed, we output the closest depth
within the pixel for the partially covered pixels, even if it generates some false positives.
This is described in detail in the next section.
Is FullyCovered
Figure 2.1. Flowchart of the pixel shader used to identify “interesting” pixels that are ray traced
for improved visibility resolution.
2.3.1 Algorithm
We propose using ray tracing to determine the visibility within the pixels where raster-
ization techniques fall short. With the Tier 3 conservative rasterizer at our disposal, we
propose the following algorithm to identify the pixels that require ray tracing to resolve
the visibility further. We set the pipeline with the standard Input Assembler (IA) and
the Vertex Shader (VS), followed by a Pixel Shader (PS). PS calculates the depth equa-
tion using the derivatives (finite differences) of the position with respect to x and y. It
then uses this equation to calculate the depth conservatively (minimum or maximum)
over the entire pixel. It calculates the maximum and minimum depth over the primitive
in the PS using the new method GetAttributeAtVertex to access the per vertex
attributes. We set the rasterizer to rasterize in a conservative manner and we also as-
sume the Tier 3 level support available. We set the PS with the following inputs and
PS Inputs.
1. The screen space position (with the semantic SV_Position).
2. Inner coverage with the semantic SV_InnerCoverage.
PS Outputs.
1. Depth value with the semantic SV_Depth. This forces the depth-stencil test to
happen late (i.e., after the PS).
2. A uint value that eventually get merged into a render target.
Pixel Shader. The goal of this shader is to:
1. Identify the pixels that are partially covered and could be potentially visible.
2. Separate them from the fully covered frontmost (visible) pixels.
We would like to do this without resolving the exact visibility within the partially
covered fragments. We use SV_Coverage, to separate partially covered pixels from
fully covered. We mark the interesting pixels that are partially covered by outputting
the value of 0x1 into the render targets. We output the value 0x0 for the fully covered
282 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing
pixels. Figure 2.2 shows the pixels that get identified as fully covered marked in yellow
and the ones that are partially covered and hence potentially “interesting” are marked
in purple.
To identify the frontmost layer of the pixels, both partially and fully covered, we
assign the depth values inside the pixel shader conservatively. We output the farthest z
for the occluding fragments (fully covered pixels) and nearest z for the potentially vis-
ible fragments (partially covered pixels) as output depth (with the semantic
SV_Depth). By doing so, we guarantee that no potentially visible (and partially cov-
ered) fragment gets missed. As a result of these conservative depths, sometimes we
allow some partially covered pixels to pass the depth test even though they are occluded
in reality. This is shown in Figure 2.3(b).
zmax zmax
zmin zmin
(a) (b)
Figure 2.3. As shown in the figure, we output closest z (z min ) of the partially covered purple
fragment and farthest z (z max ) of the fully covered yellow fragment. (a) shows the case of an
“interesting” pixel; and (b) shows the pixel identified as “interesting” incorrectly (false positive).
2.3 Pixel Classification using Conservative Rasterization 283
buffer, which is unable to reliably capture all subpixel geometry. Using our conserva-
tive rasterization-based approach, we are able to guarantee that subpixel projected ge-
ometry will be accounted for during the pixel identification phase.
After all pixels are classified, we create a set of rays for each interesting pixel iden-
tified by the conservative raster approach by computing an origin and direction. Since
these are all viewing (or “primary”) rays, we assume a pinhole camera model and all
rays share a common origin (the pinhole). The ray direction is selected based on vary-
ing sample locations corresponding to an offset from the pixel center. These rays are
then tested against the geometry in the scene by invoking the TraceRay() HLSL func-
tion of the DXR API. We perform shading at these sample locations in DXR Hit
Shaders and store the reconstructed result in the framebuffer.
While our current sample distribution follows the MSAA pattern, we could easily
use more sophisticated patterns like 8 rooks. Alternatively, movie production renderers
primarily use a more sophisticated approach called correlated multi-jittered sampling
[Kensler 2013]. This approach avoids the structured artifacts common to quasi Monte
Carlo sequences while achieving similar quality without needing precomputation or
storage. Enterprising developers might experiment with a variety of patterns to deter-
mine which works best for their application.
Figure 2.4.
To render with just one geometry pass it is necessary to compute ray intersections
for each pixel that is temporarily classified as partially covered even if it is classified as
fully covered later on. In a nutshell the algorithm works like this:
1. In order to support an arbitrary N fully customizable subsamples per pixel cre-
ate a buffer B that is big enough to hold all per-pixel/per-sample data (see Fig-
ure 2.5) as well as a 32-bit depth value.
Figure 2.5.
2. Clear the 32-bit depth part of all subsamples in B to the maximum or the min-
imum depth value. Clear to maximum depth if a LESS depth comparison mode
is used. Clear to minimum depth if a GREATER depth comparison mode is used.
3. Set up the same rendering pipeline as in Listing 2.1.
4. Render the main geometry pass.
a. In the pixel shader do the following:
i. Compute attributes at the pixel center by interpolating the three vertex
attributes and output all interpolated per pixel attributes to the bound
render targets.
ii. If SV_InnerCoverage is 1, a simple pixel.
1. Output zMax to SV_Depth.
2. Output 0x0 to a render target for classifying this pixel as simple.
iii. Else (if SV_InnerCoverage is 0), a complex pixel.
1. Output zMin to SV_Depth.
2. Output 0x1 to a render target for classifying this pixel as complex.
3. For s 0 to N 1 (iterate over all subsamples)
a. Compute intersection of the ray from the eye to the position of
the current subsample with the plane of the current triangle—see
Figure 2.6.
2.4 Improved Coverage and Shading Computation 287
have finally output 0x1 in Step 4. E.g., in a deferred rendering setup, these sub-
samples need to be lit individually before averaging their results.
For forward rendering pipeline or other rendering algorithms that only need to
store 32 bits per pixel, a thread-safe way to update subsample data is to use 64-bit wide
interlocked operations:
1. Construct a 64-bit word W like this: (32 bits of intersection depth d: 32 bits of
1. If supported, use a per pixel mutex to lock the data at position P in B and update
all subsamples. Then unlock the mutex again.
2. If Raster Order Views are supported, then render to ROVs. In this case all up-
dates to B are automatically thread safe.
Figure 2.7. A tower scene with thin subpixel features. Top to bottom: (a) no anti-aliasing; (b)
ray tracing executes on pixels where luminance differs by more than a threshold; and (c) ray
tracing executes on pixels identified by (b) and our conservative rasterization approach.
two vertex IDs match (but in opposite order) we detect that as an internal edge and do
not mark such pixels for ray tracing. However, note that, depending on the valence of
the vertex and the submission order of the triangles around that vertex, the pixel con-
taining a vertex may still be marked for ray tracing. With this ROV based approach, we
see rasterization passes take significantly longer, e.g., 11.7 ms and 8.1 ms for the tower
scene and the moon scape scene respectively, due to serialization caused by ROVs. For
290 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing
Figure 2.8. A moonscape scene with almost no thin subpixel features. Top to bottom: (a) no
anti-aliasing; (b) ray tracing executes on pixels where luminance differs from their neighbors by
more than a threshold; and (c) ray tracing executes on pixels identified by (b) and our conserva-
tive rasterization approach.
the tower scene, the ray tracing costs dropped only slightly to 4.6 ms whereas in the
case of moonscape scene the ray tracing costs dropped to 5.0 ms because fewer pixels
were marked for ray tracing. In both the cases the savings in the ray tracing costs was
more than offset by the additional costs required to remove those pixels. We believe
this is an artifact of extremely simplified shading we used. Table 2.1 summarizes the
times spend in passes in different variations of the algorithms.
2.6 Future work 291
Note that in our measurements we had to use the geometry shader to get the vertex
IDs because the shader compiler dxc that Microsoft distributes with the SDK
10.0.17763.0 failed to compile the shaders with GetAttributeAtVertex().
Alternative solution to this internal edge issue is to use the geometry shader (or
fast geometry shader on Nvidia hardware) along with the adjacency topology and iden-
tify the silhouette edges. The Pixel shader can use this information to mark partially
covered pixels only along the silhouette triangles, but this technique requires the use
of geometry shader.
We provide a demo running on Nvidia GPUs for the proposed technique.
292 2. Adaptive Anti-Aliasing using Conservative Rasterization and GPU Ray Tracing
