Direct3D 11 Computer Shader More Generality For Advanced Techniques
Direct3D 11 Computer Shader More Generality For Advanced Techniques
Overview
GPGPU vs. Data Parallel Computing Introducing the Compute Shader Advantages Target Applications Key Features Examples
Image reduction, histogram, convolution
API Support
Apps can scale to massive parallelism without tricky code changes General recognition that this model is applicable beyond just rendering
although that is our primary target
Scattered Writes
Can read/write arbitrary data structures Enables new classes of algorithms Integrated with Direct3D resources
7
Target Applications
Image/post-processing:
Image reduction, histogram, convolution, FFT
Effect physics
Particles, smoke, water, cloth, etc.
Render scene Write out scene image Use Compute for image post-processing Output final image
Data Structure
Rasterizer
Pixel Shader Output Merger Final Image
Scene Image
Compute Shader
Analogous to graphics DrawPrimitive() calls Enables algorithms to execute the optimal number of threads
Not how many vertices are read, or pixels written
Sub Blocking
Not all threads in the call can/should share registers with each other Sharing threads are broken down into subsets (groups) of threads Thread indices are made available in shader
sv_ThreadID sv_ThreadGroupID sv_ThreadIDinGroup
14
Atomic Intrinsics
Enable parallel operations on individual 32-bit memory locations without requiring full synchronization
Either video memory or shared registers
Atomic Intrinsics
Enables basic operations:
InterlockedAdd( rVar, val ); InterlockedMin( rVar, val ); InterlockedMax( rVar, val ); InterlockedOr( rVar, val ); InterlockedXOr( rVar, val ); InterlockedCompareWrite( rVar, val ); InterlockedCompareExchange( rVar, val );
DXGI resources
Enables out-of-bounds memory checking
Returns 0 on reads Writes are No-Ops
Unordered I/O
For fastest performance when ordering of records need not be preserved Both reads and writes:
UnorderedLoad( ResourceVar, val); UnorderedStore( ResourceVar, val);
Dont Forget
Texture sampling still works:
Object.Load( Loc, Offset, Samples ); Object.Gather( Sampler, Loc ); Object.Sample( Sampler, Loc ); Object.SampleLevel( Sampler, Loc, LoD );
Examples
Image Reduction Image Histogram FFT
Image Post-Processing
Significant fraction of frame time
1020% for most games 5070% for deferred shading-based engines
Image Reduction
Find the average intensity of an Image
E.g. for HDR exposure adjustment Optimizes scene for viewing on SDR monitor
Algorithm breakdown:
Input: 1 million pixels Compute: 1 MAD per pixel read Output: 1 value
Million-to-1 reduction
GPU
Output
Input
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536;
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); uint value = fLuminance*65536; uint idx = (sv_ThreadID.x + sv_ThreadID.y + sv_ThreadID.z) & 32; Total[idx] += value; Count[idx] += 1;
Reduction Performance
Pyramid approaches work today
Some choice in reduction level per pass Tradeoff is contention for destination
Histogram Generation
Similar to reduction problem
Reduce to 64256 destinations at data dependent (unpredictable) addresses
float3 vPixel = load( sampler, sv_ThreadID ); float fLuminance = dot( vPixel, LUM_VECTOR ); int iBin = fLuminance*255.0f; // compute bin to increment int iHist = sv_ThreadIDInGroup & 16; // use thread index Histograms[iHist][iBin] += 1; // update bin
Histogram Performance
Recent work shows similar performance to reductions:
Direct3D takes ~2.4 ms per megapixel
On DirectX10 hardware
8x theoretically possible
if purely read limited
Image Convolution
Fundamental operation for blurs:
HDR flares, depth-of-field, soft shadows, streaks
Convolution Performance
Massively variable depending on method Direct3D does 5x5 kernel in 0.65ms/Mpix
Separable kernel
Scan (Prefix-sum)
Each number in data sequence is sum of all previous numbers
Used to compute writes in irregular arrays Foundation of Summed Area Tables
Scan (Prefix-sum)
We are looking at providing this in a library routine Along with FFT, etc.
Enables box filter with performance independent of kernel size O(k) Fast generation of
Shadow blur with distance Depth-of-field Area light integrals, etc.
Direct3D FFT
Ping-pong between 2 R32G32F surfaces
R is Real, G is Complex
Do LogN passes along rows then columns Pixel shader only Does not use blenders or iterators
Uses vPos.xy as array indices [i][j]
FFT Before
FFT After
After
FFT Performance
Complex 1024x1024 2D FFT:
Software 42ms Direct3D9 15ms Prototype DX11 6ms Latest chips 3ms 6 GFlops 17 GFlops 3x 42 GFlops 6x 100 GFlops
Shared register space and random access writes enable ~2x speedups
Order-Independent Translucency
Eliminates draw-order issues, and shimmer in moving scenes Correct AA even of transparent objects
Any object is transparent if antialiased e.g. alpha tested leaves in forests
Brings visual quality to movie levels without requiring 256-sample MSAA Something to keep an eye on for OIT
A-Buffer Rendering
Currently prototyping using refrast
DirectX reference rasterizer running on CPU Measuring memory access patterns/locality Evaluating feasibility of hardware Not really feasible with current Direct3D
Additional Algorithms
New rendering methods
Ray-tracing, collision detection, etc. Rendering elements at different resolutions
Non-rendering algorithms
IK, physics, AI, simulation, fluid simulation, radiosity
Summary
Compute Shader is coming in Direct3D 11
GPU performance levels for more applications
Questions?
www.xnagamefest.com