0% found this document useful (0 votes)
12 views52 pages

Volume Tiled Forward Shading

Volume Tiled Forward Shading enhances Tiled and Clustered Forward Shading techniques to support real-time rendering of up to 4 million light sources at 30 FPS. The method involves a series of steps including depth pre-pass, marking active tiles, and assigning lights to tiles, ultimately improving performance when many lights are active. The document discusses various rendering techniques, GPU architecture considerations, and presents experimental results demonstrating the effectiveness of the proposed method.

Uploaded by

Diego
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views52 pages

Volume Tiled Forward Shading

Volume Tiled Forward Shading enhances Tiled and Clustered Forward Shading techniques to support real-time rendering of up to 4 million light sources at 30 FPS. The method involves a series of steps including depth pre-pass, marking active tiles, and assigning lights to tiles, ultimately improving performance when many lights are active. The document discusses various rendering techniques, GPU architecture considerations, and presents experimental results demonstrating the effectiveness of the proposed method.

Uploaded by

Diego
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Volume Tiled

Forward Shading
JEREMIAH VAN OOSTEN – 3910539 - INFOMGMT
Abstract

 Volume Tiled Forward Shading extends upon Tiled and Clustered


Forward Shading (Ola Olsson et. al.)
 Goal is to increase the number of lights in the scene
 Can achieve 4 million light source in real-time (30 FPS)
Background

 Forward Rendering
 Deferred Shading
 Tiled Forward Shading
 Clustered Forward Shading
Forward Rendering

 Geometry is pushed “forward”


through the rendering pipeline.
 Vertex transformations,
texturing & lighting performed
in a single pass.
 runtime where:
 is the number of geometric
objects to render
 is the number of fragments
that are shaded
 is the number of lights in the
scene
Deferred Shading

 Builds geometry buffers (G-buffers) to store attributes


 Depth/Stencil
 Light Accumulation
 Normals
 Specular
 Lambert Diffuse
 Lighting is computed in the second pass
 Lights rendered as shapes
 Point lights as spheres
 Spot lights as cones
 Directional lights as full-screen quads
Tiled Forward Shading

 Tiled Forward Shading


 Splits the screen into uniform screen-space tiles
 Each tile forms a frustum (in view space)
 Lights are assigned to tiles by performing frustum culling
 When shading, only lights inside the tile’s frustum are considered
Clustered Forward Shading

 Clusters samples based on position and normal


 Cluster keys are written to 2D image buffer
 Cluster keys are sorted and compacted to find unique keys
 Lights are assigned to unique clusters
 Only lights inside cluster’s AABB need to be considered during shading
GPU Architecture

 Thread Dispatch
 Coalesced Access to Global Memory
 Avoid Bank Conflicts to Shared Memory
Thread Dispatch

 Work is executed in a grid


 Thread groups consist of threads
 High-speed memory is shared within a thread
group
 Synchronization only possible between threads
in a thread group
 Synchronization amongst thread groups only
possible by a separate dispatch
Coalesced Access to Global
Memory
 Fetches to global memory is slow
 Coalesced access reduces the number of fetches required
 Global memory is accessed via memory segments
 The size of a segment is dependent on the size of a word
accessed by each thread in a warp
 32 B for 1 B words
 64 B for 2 B words
 128 B for 4, 8, and 16 B words
Avoid Bank Conflicts to Shared
Memory
 Shared memory is stored in 32 banks of 32-bit words
 If each thread in a warp accesses a different memory bank then
no conflicts occur and all reads / writes happen simultaneously
Parallel Primitives

 Reduction
 Scan
Parallel Reduction

 A Reduction applies a binary associative operator () over a set of


values, reducing to a single value.

 Using Interleaved log-step reduction avoids shared memory bank


conflicts when the addresses are accessed in an interleaved
pattern
Parallel Scan

 Takes a binary operator () and identity () and an


ordered set of elements and returns the ordered set
 is for addition and for multiplication
 For example, if is addition, then the scan applied to the
array:

Would produce
Sorting

 Radix Sort
 Merge Sort
Radix Sort

 Radix sort considers a single bit from the sort keys


 All keys with a in the current bit are placed before keys with a
 Process is repeated for all bits of the sort keys
 Final result is a sorted list
Merge Sort

 Merges two sorted lists (A, B) to produce one


large sorted list (C)
 The line that traces the grid created by placing A
and B on adjacent axis is called the Merge Path
(red line)
 A diagonal line that intersects the merge path
shows the split for each thread group (green
line)
 To find the point in A and B to perform the
merge, a binary search is performed on the list
 A serial merge is performed between merge
path partitions for each thread
Morton Codes

 Minimum Bounding Volume


 Compute Morton codes
Minimum Bounding Volume

 Compute the AABB over the objects (lights) in the scene


 Required to normalize the position of the objects in the range
 Uses parallel reduction
Compute Morton Codes

 The normalized coordinates are scaled by where is the number


of bits used to represent each component
 The bits are interleaved to produce the Morton code
 Results in a Z-order curve of the points in spacs
Bounding Volume Hierarchy (BVH)

 BVH Basics
 BVH Construction
 BVH Traversal
BVH Basics

 A tree-like data structure


 The leaf-nodes of the tree represent the smallest primitive
(triangles, or geometric objects
 Upper nodes are constructed by building an Axis-Aligned
Bounding Box (AABB) over the child nodes
 The number of child nodes used to construct the parent node is
called the degree of the BVH
BVH Construction

 The leaf nodes are constructed by taking 32 lights from the


sorted list
 The AABB for the first child nodes are constructed by performing
a reduction on the leaf nodes
 The upper levels are constructed by performing a reduction on
the child nodes
 Process is repeated until only the root node remains
BVH Traversal

 The term cell refers to the area that is being checked for overlap
 Uses a stack to push the index of the child node of the BVH if the
AABB of the node overlaps with the AABB of the cell
 32-threads in a warp each perform the AABB intersection test
during traversal
 If it is a leaf node, the AABB of the lights is checked against the
AABB of the cell
Volume Tiled Forward Shading

 Initialize
 Determine Grid Size
 Compute AABBs for Volume Tiles
 Update
 Depth pre-pass
 Mark tiles
 Find unique tiles
 Assign lights to tiles
 Shade samples
Determine Grid Size

 Volume tiles are defined in view space


 Only need to be recomputed if the screen resolution changes or
field-of-view changes
 For a given tile size and screen dimensions , the number of
subdivisions is

 And the number of subdivisions in the depth is


Compute AABBs

 The AABB for a volume tile is the minimum bounding


volume that fully encloses the frustum created by
the tile
Depth Pre-pass

 Record all of the opaque scene objects into the depth buffer
 Required to ensure only visible samples are drawn in the next
pass…
Mark Active Tiles

 For each visible sample (pixel shader invocation), mark the


volume tile corresponding to the sample
 This results in a sparse list of flags corresponding to “active” tiles
 A dense list of tile IDs is generated in the next pass
Build Tile List

 Compress the list of active tiles


 Produces a dense list of tile IDs
Assign Lights to Tiles

 A thread group is executed per active volume tile


 Uses Indirect Dispatch to make sure only enough thread groups
are executed (without needing to stall the GPU)
 An AABB-AABB test is performed against the AABB of the volume
tile for each light in the scene (brute-force)
 If the BVH of the lights is available, use that to reduce the
number of tests that need to be performed
 Results in a volume tile grid and a global
light index list
Shade Samples

 Same as Forward rendering but only the lights intersecting with


the current volume tile are considered during shading
Experiment Setup

 DirectX 12 Graphics API


 Targeted for Windows 10
 NVidia GeForce GTX Titan X was used for all experiments (complements of NVidia)
 Scenes used
 Sponza atrium (Crytek, 2010)
 San Miguel hacienda (McGuire, 2011)
 Tested Algorithms
 Forward Rendering
 Tiled Forward Shading
 Volume Tiled Forward Shading
 Volume Tiled Forward Shading with BVH
 Captured using GPU timestamp queries
 All times reported in milliseconds (ms)
Results

 Forward Rendering (FR)


 Tiled Forward Shading (TFS)
 Volume Tiled Forward Shading (VTFS)
 Volume Tiled Forward Shading with BVH (VTFSBVH)
 Comparison
Forward Rendering (Sponza)
Forward Rendering (San Miguel)
Tiled Forward Shading (Sponza)
VTFS (Sponza)
VTFS (San Miguel)
VTFSBVH (Sponza)
VTFSBVH (San Migule)
Techniques Combined (Sponza)
Techniques Combined (San
Miguel)
Known Issues

 Reducing Draw Calls


 Self-Similar Volume Tiles
Reducing Draw Calls

 Volume Tiled Forward Shading requires several render passes


 3 x opaque objects (Depth pre-pass, mark active tiles, shading)
 2 x transparent objects (mark active tiles, shading)
 API overhead can be mitigated using indirect draw
 Vertex feedback buffers can be used to avoid expensive
animation and tessellation stages
Self-Similar Volume Tiles

 Volume tiles close to the camera are relatively small


 Volume tiles further away become larger
 This is done to make tiles as
cubic as possible but results
in larger volume tiles covering
many lights
 May improve culling if the
min/max bounds of visible
samples are used to reduce
the size of the volume tile
Improved Sorting

 Sorting is the bottleneck of the technique


 Difficult to solve the sorting problem efficiently
 May try to experiment with different sorting techniques on the
GPU
 Maybe Merge sort alone will work better than Radix sort (if done
properly)
Conclusion

 Volume Tiled Forward Shading performs better than Tiled Forward


Shading when more than 16,384 lights are active in the scene
 Can handle 4 M active lights (with a constant light distribution of
 May be improved by improving sorting
 Shading may be improved by limiting the volume tile AABB by
the range of samples contained in the volume tile
Questions?
References

Akeley, K., Akin, A., Ashbaugh, B., Beretta, B., Carmack, J., & Craighead, M. et al. (2007). ARB_vertex_program. Opengl.org. Retrieved 23 September 2016, from
https://www.opengl.org/registry/specs/ARB/vertex_program.txt
AMD Graphics Cores Next (GCN) Architecture. (2012) (1st ed.). Retrieved from https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
Andersson, J. (2009). Parallel Graphics in Frostbite – Current & Future. Presentation, Siggraph.
Balestra, C., & Engstad, P. (2008). The technology of uncharted: Drake’s fortune. Presentation, Game Developer Conference.
Beretta, B., Brown, P., Craighead, M., Everitt, C., Hart, E., & Leech, J. et al. (2013). ARB_fragment_program. OpenGL.org. Retrieved 23 September 2016, from
https://www.opengl.org/registry/specs/ARB/fragment_program.txt
Blelloch, G. (1989). Scans as primitive parallel operations. IEEE Transactions On Computers, 38(11), 1526-1538. http://dx.doi.org/10.1109/12.42122
Catmull, E. (1974). A Subdivision Algorithm for Computer Display of Curved Surfaces (Ph.D). University of Utah.
Clark, J. (1976). Hierarchical geometric models for visible surface algorithms. Communications Of The ACM, 19(10), 547-554. http://dx.doi.org/10.1145/360349.360354
CUDA C Best Practices Guide. (2016) (1st ed.). Retrieved from http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
Deering, M., Winner, S., Schediwy, B., Duffy, C., & Hunt, N. (1988). The triangle processor and normal vector shader. ACM SIGGRAPH Computer Graphics, 22(4), 21-30.
http://dx.doi.org/10.1145/378456.378468
Dickau, R. (2008). Lebesgue 3D curve, iteration 2. Retrieved from https://commons.wikimedia.org/wiki/File:Lebesgue-3d-step2.png
Downloads. (2017). Crytek.com. Retrieved 4 January 2017, from http://www.crytek.com/cryengine/cryengine3/downloads
Ericson, C. (2005). Real-time collision detection. Amsterdam: Elsevier.
Foley, J., van Dam, A., Feiner, S., & Hughes, J. (1996). Computer Graphics: Principles and Practice (2nd ed.). Boston: Addison-Wesley.
Geldreich, R., & Pritchard, M. (2004). GDC Vault - Deferred Shading on DX9 Class Hardware and the Xbox. Gdcvault.com. Retrieved 27 September 2016, from
http://www.gdcvault.com/play/1015172/Deferred-Shading-on-DX9-Class
Green, O., McColl, R., & Bader, D. (2012). GPU merge path. Proceedings Of The 26Th ACM International Conference On Supercomputing - ICS '12.
http://dx.doi.org/10.1145/2304576.2304621
Harada, T. (2012). A 2.5D culling for Forward+. SIGGRAPH Asia 2012 Technical Briefs On - SA '12. http://dx.doi.org/10.1145/2407746.2407764
Harada, T., McKee, J., & Yang, J. (2012). Forward+: Bringing Deferred Lighting to the Next Level.
Hargreaves, S., & Harris, M. (2004). Deferred Shading. Presentation.
References

Harris, M., Sengupta, S., & Owens, J. (2008). Parallel Prefix Sum (Scan) with CUDA. In H. Nguyen, GPU Gems 3 (1st ed., pp. 871-873). Addison-Wesley.
Hillis, W., & Steele, G. (1986). Data parallel algorithms. Communications Of The ACM, 29(12), 1170-1183. http://dx.doi.org/10.1145/7902.7903
Howes, L. (2012). Making GPGPU Easier - Software and Hardware Improvements in GPU Computing. Presentation, University of Texas, Austin, Texas.
Karras, T. (2012). Thinking Parallel, Part II: Tree Traversal on the GPU. Parallel Forall. Retrieved 5 January 2017, from https://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-
tree-traversal-gpu/
Lottes, T. (2009). FXAA. Santa Clara, California, USA: NVIDIA Corporation. Retrieved from http://developer.download.nvidia.com/assets/gamedev/files/sdk/11/FXAA_WhitePaper.pdf
McGuire, M. (2011). Meshes. Graphics.cs.williams.edu. Retrieved 2 June 2017, from http://graphics.cs.williams.edu/data/meshes.xml
McKee, J. (2012). Technology Behind AMD's "Leo Demo". Presentation, San Francisco, California.
Mittring, M. (2009). A bit more deferred - CryEngine 3. Presentation, Raleigh, North Carolina.
Morton, G. (1966). A computer oriented geodetic data base and a new technique in file sequencing (1st ed.). Ottawa: International Business Machines Co.
NVIDIA GeForce GTX 1080 Whitepaper. (2016) (1st ed.). Retrieved from http://international.download.nvidia.com/geforce-com/international/pdfs/
GeForce_GTX_1080_Whitepaper_FINAL.pdf
Olsson, O. (2015). Introduction to Real-Time Shading with Many Lights. Presentation.
Olsson, O., & Assarsson, U. (2011). Tiled Shading. Journal Of Graphics, GPU, And Game Tools, 15(4), 235-251. http://dx.doi.org/10.1080/2151237x.2011.621761
Olsson, O., Billeter, M., & Assarsson, U. (2012). Clustered Deferred and Forward Shading. In Eurographics/ ACM SIGGRAPH Symposium on High Performance Graphics. Eurographics:
The Eurographics Association. Retrieved from http://dx.doi.org/10.2312/EGGH/HPG12/087-096
Programming Guide :: CUDA Toolkit Documentation. (2016). Docs.nvidia.com. Retrieved 13 January 2017, from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Rasterization Rules (Windows). (2017). Msdn.microsoft.com. Retrieved 10 July 2017, from
https://msdn.microsoft.com/en-us/library/windows/desktop/cc627092(v=vs.85).aspx#Multisample
Saito, T., & Takahashi, T. (1990). Comprehensible rendering of 3-D shapes. ACM SIGGRAPH Computer Graphics, 24(4), 197-206. http://dx.doi.org/10.1145/97880.97901
SAT (Separating Axis Theorem) – dyn4j. (2017). Dyn4j.org. Retrieved 10 July 2017, from http://www.dyn4j.org/2010/01/sat/
Satish, N., Harris, M., & Garland, M. (2009). Designing efficient sorting algorithms for manycore GPUs. 2009 IEEE International Symposium On Parallel & Distributed Processing.
http://dx.doi.org/10.1109/ipdps.2009.5161005
References

Segal, M., & Akeley, K. (1994). The OpenGL Graphics System: A Specification (1st ed.). Silicon Graphics, Inc. Retrieved from
https://www.opengl.org/registry/doc/glspec10.pdf
Segal, M., & Akeley, K. (2004). The OpenGL Graphics System: A Specification (2nd ed.). Silicon Graphics Inc. Retrieved from
https://www.opengl.org/registry/doc/glspec20.20041022.pdf
Shishkovtsov, O. (2006). Deferred Shading in S.T.A.L.K.E.R. In M. Pharr & R. Fernando, GPU Gems 2: Programming Techniques For
High-Performance Graphics And General-Purpose Computation (3rd ed.). Pearson Addison Wesley Prof. Retrieved from
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter09.html
Singer, G. (2013). The History of the Modern Graphics Processor. TechSpot. Retrieved 2 September 2016, from
http://www.techspot.com/article/650-history-of-the-gpu
van der Leeuw, M. (2007). Deferred Rendering in Killzone 2. Presentation, Palo Alto, California.
van Oosten, J. (2011). Optimizing CUDA Applications - 3D Game Engine Programming. 3D Game Engine Programming. Retrieved
6 January 2017, from http://www.3dgep.com/optimizing-cuda-applications/
van Oosten, J. (2014). Introduction to DirectX 11. 3D Game Engine Programming. Retrieved 21 September 2016, from
http://www.3dgep.com/introduction-to-directx-11
van Oosten, J. (2015). Forward vs Deferred vs Forward+ Rendering with DirectX 11. 3D Game Engine Programming. Retrieved 29
September 2016, from http://www.3dgep.com/forward-plus
Wilt, N. (2013). The CUDA Handbook: A Comprehensive Guide to GPU Programming (1st ed., pp. 365-383). Addison-Wesley.
Young, E. (2010). DirectCompute Optimizations and Best Practices. Presentation, San Jose, California.
Zhang, H., Manocha, D., Hudson, T., & Hoff, K. (1997). Visibility culling using hierarchical occlusion maps. Proceedings Of The
24Th Annual Conference On Computer Graphics And Interactive Techniques - SIGGRAPH '97.
http://dx.doi.org/10.1145/258734.258781

You might also like