0% found this document useful (0 votes)
162 views12 pages

On FidelityFX

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Understanding AMD’s FidelityFX

Overview
From: https://gpuopen.com/fidelityfx-superresolution/

FidelityFX Super Resolution is a spatial upscaler: it works by taking the current anti-aliased
frame and upscaling it to display resolution without relying on other data such as frame history
or motion vectors.
At the heart of FSR is a cutting-edge algorithm that detects and recreates high-resolution edges
from the source image. Those high-resolution edges are a critical element required for turning
the current frame into a “super-resolution” image.
FSR provides consistent upscaling quality regardless of whether the frame is in motion, which
can provide quality advantages compared to other types of upscalers.
FSR is composed of two main passes:
● An upscaling pass called EASU (Edge-Adaptive Spatial Upsampling) also performs edge
reconstruction. In this pass, the input frame is analyzed and the main part of the
algorithm detects gradient reversals – essentially looking at how neighboring gradients
differ – from a set of input pixels. The intensity of the gradient reversals defines the
weights to apply to the reconstructed pixels at display resolution.
● A sharpening pass called RCAS (Robust Contrast-Adaptive Sharpening) extracts pixel
detail in the upscaled image.
FSR also comes with helper functions for color space conversions, dithering, and tone mapping
to assist with integrating it into common rendering pipelines used with today’s games.

FidelityFX Super Resolution looks for gradient reversals in the source image to reconstruct
high-definition edges at upscaled resolution.

Notes on the code:

● All code is at https://github.com/GPUOpen-Effects/FidelityFX-FSR


● But all interesting stuff is in the files ffx-fsr/ffx_a.h, where all the definitions are, and the
file ffx-fsr/ffx_fsr1.h, where all the real code is.

EASU Explanation
In the following we are going to explain all we know/understand/gather from this technique.

EASU preprocessing
● Image should be well antialiased (TAA, MSAA, etc.)
● Image should be in perceptual space
○ This means we should do a conversion. The idea conversion is described
elsewhere, for instance at these blogs (in particular, the second one):
https://lettier.github.io/3d-game-shaders-for-beginners/gamma-correction.html
http://chilliant.blogspot.com/2012/08/srgb-approximations-for-hlsl.html
However, at the Unity presentation at siggraph, they say they used sqrt to go
from RGB to sRGB, and squared to go the other way round…. Not as accurate,
but probably faster...
● Input image must be normalized to [0,1]
○ Negative input results in RCAS to output NaN!
● Image should be generated using negative MIP bias to increase texture detail
● Image should be noise-free
○ Add that AFTER rescaling with FidellityFX

EASU Algorithm
● Uses a fixed 12-tap kernel window. EASU requires an analysis on those 12 taps before it
can figure out the filter kernel, selecting the nearest 12-taps in a circular pattern. The
reason why 12 taps was chosen, instead of 16, is because with 12 taps you only need
36 registers for the 32-bit version.

❖ 12 Taps = Good Upper Limit


❖ Single-pass algorithm (radial/elliptical filtering)
❖ 12 taps * 3 channels = 36 VGPRs (FP32)
❖ To avoid reading the 12 taps twice, it has to keep all of them in registers during
the full algorithm. And therefore, if you wanted to do anything higher, you’d run
out of temporary registers for logic. And the goal being that we want to around or
under 64 registers, as that is a good upper limit on AMD’s hardware to be able to
hide latency.
❖ 64 VGPRs (good upper limit) - 36 = 28 VGPRs for logic
❖ The algorithm needs all 12 taps for analysis then filtering

● Does analysis on each ‘+’ pattern that surrounds the inner 2x2 quad in luma (r+2g+b).

So if we look at the 12-tap kernel, there is 4 taps in the center, and for each one of
those, it needs to compute the analysis for direction and length. And to do the analysis it
is working in luma, and by luma I mean an approximation, red plus two green plus blue.
So is it not a complicated luma, it is more of a “get all the channels included so we don’t
miss anything” approximation.

❖ Example of “Pass Merging”


❖ Analysis could be done as a separate pass
❖ But that would require extra round trip through memory. Using even more data
❖ Instead ALU logic gets duplicated 4 times / output pixel

● Analysis is bilinearly interpolated and used to shape final filter kernel

The analysis done on the 2x2 quad, and this is effectively a form of pass merging. As
the analysis could have been done in a separate pass, but then one would require two
round trips through memory, and therefore we don’t want to do that, instead we
duplicate a little amount of work in the shader, thus we don’t have to go through
memory many times. Once the analysis is finished we are going to bilinearly interpolate
the analysis at the position we actually want to filter at. And that is going to be used to
shape the final filter kernel.

This part is setup at the code at the function (line 156):

A_STATIC void FsrEasuCon(


outAU4 con0,
outAU4 con1,
outAU4 con2,
outAU4 con3,
// This the rendered image resolution being upscaled
AF1 inputViewportInPixelsX,
AF1 inputViewportInPixelsY,
// This is the resolution of the resource containing the input image (useful for dynamic
resolution)
AF1 inputSizeInPixelsX,
AF1 inputSizeInPixelsY,
// This is the display resolution which the input image gets upscaled to
AF1 outputSizeInPixelsX,
AF1 outputSizeInPixelsY){

}

However, there it is only the setup, the function where this is really applied, i.e., where the
textures are sampled, is at (line 239):

void FsrEasuTapF(
inout AF3 aC, // Accumulated color, with negative lobe.
inout AF1 aW, // Accumulated weight.
AF2 off, // Pixel offset from resolve position to tap.
AF2 dir, // Gradient direction.
AF2 len, // Length.
AF1 lob, // Negative lobe strength.
AF1 clp, // Clipping point.
AF3 c){ // Tap color.

}
EASU Sampling

● The 12 tap pattern is done via 4 positions


● Setup so {X,Y} and {Z,W} pairs have necessary data
● Otherwise would have to shuffle data around

EASU Analysis

Edge direction is estimated from central difference: For the analysis, once the taps are in the
edge direction is estimated using a central difference. The central difference does miss
single-pixel features, however as we will see later, as feature-length becomes very small, the
filter kernel becomes symmetric and non-directional so we don’t care about directionality for thin
features.

A diagonal diff would have been more expensive and have 0.5 texel offset: Therefore a diagonal
diff is not used, and also a diagonal diff would have been more expensive and we would have
had to deal with a half-texel offset which would have made the logic a little more complicated.
Ok to miss single-pixel features (feature-length forces asymmetric non-directional filters in those
cases anyway). So once the edge direction is finished, we look at feature-length, and by
feature-length we are estimating that by looking at the 3 texels in the horizontal and 3 texels in
the vertical.

Feature-length is estimated by scoring the amount of gradient reversal


And looking at what happens with the luma gradient: If the luma gradient has a reversal, for
instance starting at black going to white, and returning to black, that would be a “Full Reversal”
which is a significant probability of being a thin feature. Whereas if we look at something that
has no reversal, say going from black to white to white, that is probably a large feature which we
can have a larger filter kernel on.

EASU and Color Spaces


Most gaming AA ends up with perceptually even gradients on edges

Thus directional analysis works better in perceptual space for games


● A directional analysis is based on horizontal and vertical gradients
● Perceptual as in sRGB (piecewise curve), gamma 2.0, gamma 2.2, etc

This way the computation is not that expensive, because if we were to input in linear and
convert to perceptual, we would have to do that 12 times for the 12 taps.
So it is much better and in fact required for good performance to factor any linear to perceptual
translation into the prior pass, prior to EASU.
Since linear to perceptual transforms are expensive and using 12 taps
It is better (required for good perf) to factor that out to the pass prior to EASU

Highly recommended to run EASU in a perceptual space


● It will work in linear too, just doesn’t look as good on some content

The one compromise of course is that if we are running on perceptual, we are running all the
filtering in perceptual, but as it turns out it is typically acceptable in this case.
EASU Kernel Shaping
Analysis after interpolation produces {direction, length}
● The ‘direction’ used to rotate the filter kernel
● The ‘length’ drives post-rotation kernel scaling, and kernel window adjustment

X scales from {1.0 to sqrt(2.0)} on {axis-aligned to diagonal} direction


● Diagonals get larger kernels as they can sharpen more without banding
Y scales from {1.0 to 2.0} on {small to larger feature length}
● Small axis-aligned features end up with a small symmetric kernel to avoid artifacts
● Longer features get a larger kernel to better restore the edge

So once we have all the analysis finished,


we have a {direction and length} for all the 2x2 quad,
we are going to use the interpolated direction to rotate the filter kernel,
the length to scale the post-rotation kernel scaling on the X and Y axis,
and we are also going to use the length to adjust the kernel window (which I will show on
another slide).
So in the X-axis, we are going to go from no scaling to sqrt(2) based on whether we are
axis-aligned or we are running on the diagonal.
So when we are axis-aligned we don’t do any scaling on the X-axis,
but when we are on a diagonal we are scaling by sqrt(2) because we can allow a larger kernel
there without seeing any banding.
The banding would have been created by the negative lobe.
The Y-axis has no scaling to double size.
We use the no-scaling for the small features, and that way we end up with a small symmetric
kernel which does not sample outside the feature itself.
And as the feature gets larger we are using a longer kernel so we can better restore the edge.
EASU Kernel

Uses a polynomial approximation to lanczos. This code is at the function

// Filtering for a given tap for the scalar.


void FsrEasuTapF(
inout AF3 aC, // Accumulated color, with negative lobe.
inout AF1 aW, // Accumulated weight.
AF2 off, // Pixel offset from resolve position to tap.
AF2 dir, // Gradient direction.
AF2 len, // Length.
AF1 lob, // Negative lobe strength.
AF1 clp, // Clipping point.
AF3 c){

}

● Lanczos is expensive using a {sin(),rcp(),sqrt()} and those are transcendentals, which


run at quarter rate depending on your hardware, and therefore they are best to be
avoided if possible.

The EASU kernel itself started as a polynomial approximation to lanczos(2).


So instead this is broken down into a base and a window, similar to the way lanczos is a sync
function that is windowed by another sync function, and also because we want the window to be
adaptable to the length. When the window is small, we have a kernel which goes from +/-
sqrt(2). That window has been shortened which truncates the negative lobe. We don’t get as
much sharpening. We don’t have the ringing and other problems that we would potentially have.
The wide kernel goes from +/- 2, and that kernel has a very strong negative lobe which helps
restore the edge.

Instead base*window via


This implements an approximation of lancos2 without sin() or rcp(), or sqrt() to get x:

(25/16 * (2/5 * x^2 - 1)^2 - (25/16 - 1)) * (1/4 * x^2 - 1)^2


|_____________________________| |____________|
base window

(w = 1/4). Note, the general form of the 'base' is,

(a*(b*x^2-1)^2-(a-1)),

Where 'a=1/(2*b-b^2)' and 'b' moves around the negative lobe.

Where window term ‘w’ varies from 1/4 for {+/- 2} kernel, and 1/2 for {+/- sqrt(2)}

EASU Deringing
The local 2x2 texel quad {min,max} used to clamp the EASU output

Removes all ringing

Also removes some artifacts of the 12-tap limited window


● Or alternatively some artifacts of the kernel adaption
We move on to the deringing step where we take the local 2x2 texel quad,
the min and max of RGB, and we use that to clamp the EASU output.
This removes all the ringing.
This also removes artifacts of the limited 12-tap window.
Therefore when scaling is larger, and you might see the clipping of the window,
It is best to run the reringing step to try to minimize that.

The first part is implemented in two functions, FsrEasuSetF and FsrEasuF. First, FsrEasuSetF
accumulates direction and length:

void FsrEasuSetF(
inout AF2 dir,
inout AF1 len,
AF2 pp,
AP1 biS,AP1 biT,AP1 biU,AP1 biV,
AF1 lA,AF1 lB,AF1 lC,AF1 lD,AF1 lE){

}

Direction is the '+' diff:

a
bcd
e

Then, the algorithm takes magnitude from abs average of both sides of 'c'. Length converts
gradient reversal to 0, smoothly to non-reversal at 1, shaped, then adding horz and vert terms.
This is done as:

AF1 dc=lD-lC;
AF1 cb=lC-lB;
AF1 lenX=max(abs(dc),abs(cb));
lenX=APrxLoRcpF1(lenX);
AF1 dirX=lD-lB;
dir.x+=dirX*w;
lenX=ASatF1(abs(dirX)*lenX);
lenX*=lenX;
len+=lenX*w;

Then the code repeats for the y axis, accumulating in the variable len. Here (from ffx_a.h):

AF1 APrxLoRcpF1(AF1 a){return AF1_AU1(AU1_(0x7ef07ebb)-AU1_AF1(a));}


A_STATIC AF1 ASatF1(AF1 a){return AMinF1(1.0f,AMaxF1(0.0f,a));

Here AMinF1 and AMaxF1 are simple minimum- and maximum-computing functions:

A_STATIC AF1 AMinF1(AF1 a,AF1 b){return a<b?a:b;}


A_STATIC AF1 AMaxF1(AF1 a,AF1 b){return a>b?a:b;}

And AF1_AU1 is a conversion from int to float, and AU1_ is a simple float cast.

Finally, the function FsrEasuF is where the heavy duty is performed:

void FsrEasuF(
out AF3 pix,
AU2 ip, // Integer pixel position in output.
AU4 con0, // Constants generated by FsrEasuCon().
AU4 con1,
AU4 con2,
AU4 con3){

}

This code first gets the position of sample “f” from ip, the position of the pixel to calculate. Then
it applies the 12-tap kernel.
bc
efgh
ijkl
no

Remember that the gather 4 has the following ordering:


ab
rg

For packed FP16, need either {rg} or {ab} so using the following setup for gather in all versions.
Also, take into account that a b are unused (z).
// r g
// a b a b
// r g r g
// a b
// r g <- unused (z)
// Allowing dead-code removal to remove the 'z's.

Then it applies the simplest multi-channel approximate luma possible (luma times 2, in 2
FMA/MAD):

AF4 bczzL=bczzB*AF4_(0.5)+(bczzR*AF4_(0.5)+bczzG);
AF4 ijfeL=ijfeB*AF4_(0.5)+(ijfeR*AF4_(0.5)+ijfeG);
AF4 klhgL=klhgB*AF4_(0.5)+(klhgR*AF4_(0.5)+klhgG);
AF4 zzonL=zzonB*AF4_(0.5)+(zzonR*AF4_(0.5)+zzonG);

Next, it accumulates the gradients at variable len for bilinear interpolation:

AF2 dir=AF2_(0.0);
AF1 len=AF1_(0.0);
FsrEasuSetF(dir,len,pp,true, false,false,false,bL,eL,fL,gL,jL);
FsrEasuSetF(dir,len,pp,false,true ,false,false,cL,fL,gL,hL,kL);
FsrEasuSetF(dir,len,pp,false,false,true ,false,fL,iL,jL,kL,nL);
FsrEasuSetF(dir,len,pp,false,false,false,true ,gL,jL,kL,lL,oL);

You might also like