Chadha Deep Perceptual Preprocessing For Video Coding CVPR 2021 Paper
Chadha Deep Perceptual Preprocessing For Video Coding CVPR 2021 Paper
Figure 1: Left: Frame segments of encoder vs. DPP+encoder at the same bitrate. Right: Bjontegaard delta-rate vs. runtime
(in multiples of x264 slow preset runtime on CPU) for codec only (red) and DPP+codec (green). More negative BD-rates cor-
respond to higher average bitrate savings for the same visual quality. x264: AVC/H.264, aomenc: AV1, vvenc: VVC/H.266).
Abstract 1. Introduction
Streaming high-resolution video comes with an in-
We introduce the concept of rate-aware deep perceptual evitable trade-off between available bandwidth and visual
preprocessing (DPP) for video encoding. DPP makes a sin- quality. In recent years, many video compression stan-
gle pass over each input frame in order to enhance its visual dards have been developed, such as Advanced Video Cod-
quality when the video is to be compressed with any codec ing (AVC) and AOMedia Video 1 (AV1), which offer a
at any bitrate. The resulting bitstreams can be decoded number of advanced coding and prediction tools for effi-
and displayed at the client side without any post-processing cient video encoding and transmission. While these codecs
component. DPP comprises a convolutional neural network are widely deployed in industry, with AVC still accounting
that is trained via a composite set of loss functions that in- for the largest share of video streaming volume worldwide,
corporates: (i) a perceptual loss based on a trained no- the encoding tools are handcrafted and not entirely data de-
reference image quality assessment model, (ii) a reference- pendent. This has led to an increased interest in learned
based fidelity loss expressing L1 and structural similar- video compression methods [24, 13, 11, 35], which claim
ity aspects, (iii) a motion-based rate loss via block-based to offer better encoding efficiency by training deep neural
transform, quantization and entropy estimates that converts networks to improve the rate-distortion performance. How-
the essential components of standard hybrid video encoder ever, these methods come with their own pitfalls; typically
designs into a trainable framework. Extensive testing using they require bespoke encoder, bitstream and decoder com-
multiple quality metrics and AVC, AV1 and VVC encoders ponents for end-to-end optimization. The decoder is typi-
shows that DPP+encoder reduces, on average, the bitrate cally computationally heavy and not viable for deployment
of the corresponding encoder by 11%. This marks the first on CPU-based commodity devices such as mobile phones.
time a server-side neural processing component achieves Additionally, work in learned video compression [24, 45]
such savings over the state-of-the-art in video coding. tends to be benchmarked against codecs with limited tools
enabled: ‘very fast’ preset, low-latency mode and GOP
14852
sizes of 10 frames. It is unclear if learned video compres- That is, a neural network-based encoder learns to transform
sion methods outperform standards under their more ad- an image or video x into a latent vector y. The latent vec-
vanced (and most widely used) encoding settings. tor is quantized, yielding a discrete valued representation ŷ,
In this work we aim to bridge the gap between the data upon which rate is minimized via differential entropy com-
adaptivity and scalability of learned compression methods putation:
and the performance and off-the-shelf decoding support of- R = −Eŷ log2 p(ŷ) (1)
fered by standard codec implementations. To this end,
our proposed deep perceptual preprocessor (DPP) simply Given that quantization and prior density p(ŷ) estima-
prepends any standard video codec at inference, without re- tion for entropy computation are non-differentiable opera-
quiring any bespoke encoder of decoder component. The tions [2, 39], these are instead represented with continu-
key aspect of our proposal is that it offers rate-aware per- ous approximations. The reconstructed image or video x̂
ceptual improvement by encapsulating both perceptual and can thus be generated from ŷ with a neural-network based
fidelity losses, as well as a motion-based rate loss that en- decoder. The error between the reconstructed input x̂ and
capsulates the effect of motion-compensated prediction and original input x can be minimized via a distortion measure
entropy coding. In addition, our trained DPP models re- ∆, such as mean squared error (MSE) or mean absolute er-
quire a single pass over the input and all encodings with ror (MAE):
different standards-based encoders at various bitrates and D = Ex,x̂ ∆(x, x̂) (2)
resolutions can be subsequently applied to the DPP output. The encoder and decoder thus constitute an (variational) au-
Experiments versus state-of-the-art AVC, AV1 and Versa- toencoder framework [2, 3, 11, 16], and this framework is
tile Video Coding (VVC) [14] encoders show that DPP al- trained end-to-end to jointly optimize rate and distortion
lows for 11% average reduction in bitrate without requiring with loss L = D + λR, where λ is the Lagrange mul-
changes in encoding, streaming, or video decoding. tiplier that controls the rate-distortion tradeoff [30]. In the
We summarize our contributions as follows: case where the prior density model is fully factorized, statis-
tical dependencies between elements of ẑ can be modelled
1. We propose a deep perceptual preprocessor (DPP) that with a (scale) hyperprior [3, 11]; however, any additional
preprocesses the input content prior to passing it to any encoding bits must be transmitted as side information.
standard video codec, such as AVC, AV1 or VVC. Contrary to recent methods in learned compression, stan-
2. We train the DPP in an end-to-end manner by virtual- dard image or video codecs typically adopt orthogonal lin-
izing key components of a standard codec with differ- ear transforms to the frequency domain, where the data
entiable approximations. We balance between percep- is decorrelated and easier to compress. While the trans-
tion and distortion by using an array of no-reference form coefficients are not necessarily data adaptive and can
and reference based loss functions. exhibit strong joint dependencies [36, 37], the parameters
are exposed and can be finely tuned. While learned video
3. We test our models under the most stringent testing compression has shown some promise for high-bitrate low-
conditions: multi-resolution, multi-QP, convex-hull delay video compression [24, 2, 3, 13, 35], standard codecs
optimization per clip and high-performance AVC, AV1 like AVC and HEVC surpass all current methods in learned
and VVC presets used extensively by Netflix, Face- video compression in terms of standard metrics like SSIM
book, Intel in several benchmark papers [19, 21, 20]. and VMAF when the former are used with all their ad-
vanced prediction and entropy coding tools enabled [12].
Visual comparisons of encoder versus DPP+encoder out- In addition, more advanced encoder designs of the AO-
puts are shown in Fig. 1 (left), illustrating the visual quality Media AV1 [15] and MPEG/ITU-T Versatile Video Cod-
improvement that can be achieved at the same bitrate. Fig. 1 ing (VVC) standards [40] now include neural components
(right) illustrates how DPP is able to offer consistent bitrate for optimized encoding tool selection [15]. Such standards
savings across three video coding standards of increasing allow for decoders on CPU-based commodity devices like
sophistication and complexity, while its runtime overhead tablets and mobile phones and there is no need for bespoke
diminishes in comparison to the encoding runtime. encoder or decoder components that require joint optimiza-
tion, as in recent proposals [10, 1].
2. Related Work
2.2. Metrics
2.1. Compression
Performance of compression methods is typically eval-
Recent work in learned image [2, 3, 34, 39] or video uated by plotting rate-distortion curves. Rate is measured
[24, 13, 11, 35] compression tends to replace the entirety of in bits per pixel (bpp) or bits-per-second (bps) for video.
a standard transform coding pipeline with neural networks. In recent work, distortion is typically evaluated in terms of
14853
PSNR or SSIM. However, while these metrics are viable rate value (CRF). On the other hand, in order to main-
options for measuring reconstruction error from the source, tain a single-pass preprocessing and to avoid training a pre-
they do not capture perceptual quality of the content. Per- processing model for every single codec and configuration,
ceptual quality is instead captured by the divergence be- the preprocessing must have a marginalized response over
tween the distribution of reconstructed images p(x̂) and the codec parameter space. To this end, we propose to
original images p(x). Blau et al. [8] mathematically proved model or ‘virtualize’ the basic building blocks of a standard
the existence of a perception-distortion bound where dis- video coding pipeline, such that we can approximate the
tortion must be traded off with perceptual quality or vice rate-distortion behavior over standard video codecs. The
versa. This work was extended further to incorporate rate, core codec components we model are inter/intra prediction,
where it was derived that, in order to improve perceptual adaptive macroblock selection, spatial frequency transform
quality, either rate or distortion must be increased [9]. In- and quantization. This virtual codec is appended to our
deed, for constant rate, distortion must be increased to in- preprocessing neural network and the resulting DPP frame-
crease perceptual quality and this tradeoff is strengthened work is trained end-to-end with our proposed loss formu-
at low rates. Furthermore, perfect perceptual quality cannot lations. In this way, we perform perceptual and codec-
be achieved by only optimizing a distortion measure. How- oriented rate-distortion optimization over the preprocessing
ever, the tradeoff between perception and distortion for con- network parameters. Notably, in order to aid with marginal-
stant rate can be weakened for perceptually-oriented distor- ization, we also expose parameters such as QP, which can
tion measures that capture more semantic similarities. be adjusted during training. During inference/deployment,
Given the above, we consider other metrics for eval- the virtual encoder is removed and replaced with a standard
uating our method, beyond SSIM and PSNR. Notably, codec, such as an MPEG or AOMedia encoder.
VMAF is a perceptually-oriented full-reference (FR) dis- The training and deployment frameworks are illustrated
tortion metric, which has been developed and is com- in Figure 2a. Each color outlines a different component
mercially adopted by Netflix, Facebook, Intel, AOMedia in the training framework. For a given video sequence
standardization, and several others for codec evaluation V = {x1 . . . xt , xt+1 . . . xN } with N frames, the green
[19, 21, 20, 32, 29, 15] and A/B experimentation [20]. blocks represent the preprocessing network that maps input
VMAF has two primary components: visual information video frame xt at time t to preprocessed frame pt . The or-
fidelity (VIF) and detail loss metric (DLM), and their re- ange blocks represent the components for inter (motion esti-
spective scores are fused into a single prediction with sup- mation + compensation) and intra prediction, which output
port vector regression (SVR). Multiple independent studies a predicted frame p̃t and residual frame rt by performing
have shown that VMAF is significantly more correlated to block matching between the current and reference frames.
human opinion scores than SSIM and PSNR [31, 4, 47]. Importantly, in this paper we focus on an open loop codec
Recently, a more compression-oriented variant of implementation for inter prediction and exclude the red ar-
VMAF, VMAF NEG [20], has been proposed by Netflix row in the figure. The grey blocks represent the spatial
for isolating compression artifacts from general perceptual transform and quantization components for encoding and
quality enhancement (e.g., contrast enhancement). Essen- compressing the residual. The residual frame is transformed
tially, VMAF NEG is derived by clipping the local gain to the frequency domain output yt and quantized to ŷt , with
terms in VIF and DLM to 1.0, thus penalizing linear en- the quantization level controlled by the quantization param-
hancement operations. In this paper, we present results in eter (QP). We model the rate of ŷt with an entropy model,
terms of VMAF, VMAF NEG and SSIM, to demonstrate as represented with the yellow block, as this is what a stan-
how our method traverses the perception-distortion space. dard encoder would losslessly compact into the compressed
bitstream. The blue blocks represent YUV to RGB conver-
3. Deep Perceptual Preprocessor sion and the perceptual model that we use collectively to
quantify perceptual quality, based on mean opinion scores
3.1. Overview of Proposed Method (MOS). These components will allow us to train the prepro-
cessing network to enhance the perceptual quality of recon-
In this section, we describe our deep perceptual prepro-
structed frame p̂t .
cessing (DPP) framework for video preprocessing. Essen-
tially, the objective of our preprocessing framework is to 3.2. Learnable Preprocessing
provide a perceptually optimized and rate-controlled repre-
sentation of the decoded input frame via a learnable pre- The input video frames are first processed individually
processing approach. On one hand, the preprocessing must by a preprocessing block, represented in green in Figures
have some level of encoder-awareness such that it can adapt 2a and 2b. The preprocessing block F (x; Θ) comprises
to visual distortions induced by changing standard codec a pixel-to-pixel mapping F , with associated parameters Θ.
settings such as quantization parameter (QP) and constant For efficient deployment, preprocessing only processes the
14854
(b) Open loop schematic for DPP
Figure 2: (a): Deep perceptual preprocessor framework for training perceptually-enhanced & rate-controlled representation
of input frames via a learnable preprocessing. Dashed arrows represent optional components. (b): Schematic showing the
perceptual preprocessor training framework in open loop configuration with loss functions.
luminance (Y) channel only, since it contains all of the local search space centered on (n1 , n2 ) and of size M × M
frame’s structural information and is the main contributor to is extracted from the reference frame. A similarity criterion
perceptual sharpness and bitrate, which constitute our main is used to find the best matching block of size K × K to
objectives for optimization. Specifically, for input frame the current frame block within the local search space. For
x ∈ RH×W scaled to range [0, 1] and modelled represen- inter prediction, the local search space is extracted from the
tation p̂, the intention is to optimize parameters Θ, in or- previous frame, pt−1 . The similarity criterion ǫ can thus be
der to achieve a balance on p̂ between the perceptual en- expressed at (n1 , n2 ) as:
hancement, rate control and fidelity to x. The mapping F
is implemented as a convolutional neural network (CNN)
with single-frame latency (assuming the supporting hard- X
ǫ(m1 , m2 ) , d(pt (n1 + k1 , n2 + k2 ),
ware can carry out the CNN inference fast enough). In order
(k1 ,k2 )
to reduce the network complexity while allowing for larger
receptive field sizes and maintaining translational equivari- pt−1 (n1 + k1 + m1 , n2 + k2 + m2 ))
ance, we utilize dilated convolutions [46] with varying dila- (3)
tion rates per layer. The neural network weights constitute where the coordinates (k1 , k2 ) ∈ [(0, K − 1), (0, K −
the parameters Θ that we intend to optimize for perceptual 1)] shift the pixel location within a K × K block and
quality, rate and distortion in our training framework. (m1 , m2 ) ∈ [(− M M M M
2 , − 2 ), ( 2 , 2 )] represent the block
displacement within the local search space of the reference
3.3. Inter and Intra Prediction frame. d represents the similarity measure, which in this
paper is set to mean absolute error (MAE), given its bet-
The preprocessing network maps current video frame
ter handling of outliers than mean squared error (MSE).
xt at time step to t to pt . The next step is to generate
Importantly, the operation in (3) can be easily vectorized,
the residual frame rt via intra or inter prediction. A stan-
which enables efficient end-to-end training on GPUs (at the
dard video codec such as H.264/AVC adaptively divides the
cost of higher memory allocation). Then, for the given
frame into variable-sized macroblock partitions and sub-
current frame block, the optimal block displacement m =
partitions, typically varying from 16×16 to 4×4. Let us first
(m∗1 , m∗2 )T in the reference frame is given as:
assume a fixed block size. Under this assumption, the pre-
processed frame pt is first divided into a set of blocks of the
fixed size K × K. For a block in the current frame centered
(m∗1 , m∗2 ) = arg min(ǫ(m1 , m2 )) (4)
on the pixel location (n1 , n2 ) ∈ [(0, 0), (H − 1, W − 1)], a (m1 ,m2 )
14855
The displacement or motion vector m∗ = (m∗1 , m∗2 )T is p̃t + r̂t .
encoded for each block in the current frame. However, the
arg min in (4) has zero gradients almost everywhere with 3.5. Entropy Model
respect to the input and therefore is not differentiable. This Given that we aim to optimize our preprocessing on rate,
poses a problem if we wish to optimize the DPP with end- we must minimize the number of bits required to encode
to-end backpropagation from the reconstructed frame p̂t the DCT transformed and quantized frame ŷt . This can be
back to the input frame xt . In order to resolve this, we first estimated by computing the entropy as in (1). However, as
express (4) in terms of a one-hot matrix, which we denote discussed, the prior density p(ŷt ) must be estimated with
as 1arg min(m) (ǫ) , where the matrix is 1 at index (m∗1 , m∗2 ) a continuously differentiable approximation, such that we
and 0 for all other (m1 , m2 ). We approximate the argmin can compute the number of bits to encode the DCT sub-
operation by using a straight-through estimator [5]. Our bands in a differentiable manner. To this end, we can model
approach is analogous to gumbel-softmax [17] except that p(ŷt ) as a factorized prior. The disadvantage of assuming a
we are not sampling over a discrete distribution but deter- factorized prior on ŷ is that it does not account for the strong
ministically extracting the optimal block based on ǫ. The non-linear dependencies between subband coefficients [36,
predicted frameP p̃inter
t is then configured as: p̃inter
t (n1 + 25, 41]. Rather than extending the factorized prior with a
k1 , n2 + k2 ) = (m1 ,m2 ) 1(m∗ ) (m1 , m2 ).pt−1 (n1 + k1 + hyperprior [3], which would require additional training and
m1 , n2 +k2 +m2 ) and the residual frame rt is simply equal deviate further from standard codec operation, we propose a
to the difference between the predicted frame and current simple spatial divisive normalization which has been shown
frame: rt = pt − p̃intert . to decorrelate DCT domain coefficients per sub-band [26].
For intra prediction, we follow a similar approach for We denote the divisively normalized coefficients as zn,s,t ,
generating p̃intra
t , except the reference frame from which where index n runs per subband over all spatial coordinates.
we extract the local search space is from the current frame In this way, we can assume a factorized prior p(z) on z
pt itself (but masking the block being queried and only instead of ŷ:
searching in the causal neighborhood around the queried
block). In this way, we are able to emulate all translational Y
p(zt ; Φ) = p(zn,s,t ; Φ(s) ) (5)
intra prediction modes.
n,s
14856
and single fully connected layer with 5 neurons. A soft- α and β are hyperparameters which control the weighting
max function maps the output to a distribution over human on structural versus luminance preservation.
ratings, or ACR distribution, ranging from from poor (1) Rate Loss, LR : The virtual codec rate loss LRs per DCT
to excellent (5). To give the output layer access to multi- sub-band s is defined on the divisively normalized trans-
scale and multi-semantic representations of the input, we form coefficients zt :
also global average pool intermediate layer activations and
concatenate the pooled activations over layers. The model is X
thus trained to minimize the total variation distance between LRs (zt ; Θ, Φ) = −Ezt (log2 (p(zn,s,t ; Φ(s) )) (7)
predicted and reference human rating distributions. We note n
that given that our perceptual model is trained on human-
where n runs over all spatial coordinates of each sub-band.
rated RGB images, it is necessary in our perceptual prepro-
The final rate Ploss is simply the summation over all sub-
cessing framework to first convert the luminance frame p̂t S
bands: LR = s=1 LRs , where S = 16 for a 4 × 4 DCT.
to RGB frame p̂RGB
t . We perform a transform from YUV
The rate loss represents an approximation (upper bound) to
to RGB space by first concatenating p̂t with the lossless U
the actual rate required to encode the preprocessed frames.
and V components of the RGB input, xRGB t .
Perceptual Loss, LP : We quantify perceptual quality
3.7. Loss Functions with our perceptual model P , which is pre-trained and
frozen during the DPP training. Essentially, we aim to
Our overall objective is to train our preprocessing
maximize the mean opinion scores (MOS) of our decoded
F (xt ; Θ) to perform perceptually-oriented rate-distortion
RGB frame representations p̂RGBt , independent of the refer-
optimization on the decoded frame representations p̂t rel-
ence frame xRGB
t , but derived on the natural scene statistics
ative to the input video frames xt . Assuming the domain
(NSS) learned from training the perceptual model on a cor-
shift between our virtual codec and standard video codec
pus of natural images. To this end, we minimize:
is marginal, this should equate to optimizing the rate and
distortion of the decoded frames during deployment with 5
X
a standard video codec. To this end, we train the CNN LP (p̂t ; Θ) = −Epˆt i(P (p̂RGB
t )i ) (8)
of the preprocessor end-to-end with the building blocks of i=1
our DPP framework and a perceptual loss (LP ), rate loss
where the inner summation represents the predicted MOS
(LR ) and fidelity loss (LF ) (as illustrated in Figure 2b).
score, as the mean over the predicted ACR distributions.
The overall loss function for training the preprocessing can
thus be written as a weighted summation: L(xt , p̂t ; Θ) =
4. Experimental Results
γLP + λLR + LF , where γ and λ are the perceptual and
rate coefficients respectively. It is worth noting that con- 4.1. Implementation Details
trary to neural encoders, where changing λ maps to a new
rate-distortion point, λ in this case shifts the entire rate- The perceptual model P is first trained on Koniq-10k no-
distortion curve mapped over multiple QPs/CRFs - this be- reference IQA dataset [23] using stochastic gradient descent
havior is illustrated in the ablation study on λ in the sup- with momentum set to 0.9 and an initial learning rate of
plementary. Given that we marginalize over QP, λ gives the 1 × 10−3 . The perceptual model is then frozen and the deep
flexibility to explore the entire rate-distortion space. preprocessing framework is trained on Vimeo-90k dataset
Fidelity Loss, LF : In order to ensure a likeness be- [44] in an end-to-end manner, under the open loop configu-
tween the input luminance frame xt and the perceptually ration illustrated in Figure 2b and loss function as defined in
enhanced and rate constrained decoded frame representa- Section 3.7. We train a deep convolutional neural network
tion p̂t , we train the preprocessing with a combination of with curriculum learning [6, 43] and using multi-scale crop
fidelity losses. As discussed by Zhao et al. [48], the L1 dis- sizes. The curriculum is generated via a scoring and pacing
tance is good for preserving luminance, whereas multiscale function, which map the content type and difficulty. Each
structural similarity (MS-SSIM) [42] is better at preserving convolutional layer is followed by a parametric ReLu acti-
contrast in high frequency regions. Our fidelity loss can vation function. During training we alternate between our
thus be written as the summation: inter and intra prediction blocks; we follow a standard en-
coding pipeline and default to inter prediction only, switch-
ing to intra prediction for 1 mini-batch every 100 training it-
LF (xt , p̂t ; Θ) = Ext ,p̂t αLL1 (xt , p̂t ; Θ)
erations (i.e. in correspondence to 1 I-frame every 100 P or
+ β(1 − LMS−SSIM (xt , p̂t ; Θ)] B frames). The local search space size M is fixed at 24. The
(6) network is trained with Adam optimizer and learning rate is
where LL1 (xt , p̂t ) = |xt − p̂t | and LMS−SSIM represents decayed when metrics saturate on the validation dataset. Fi-
the MS-SSIM function (as defined by Wang et al. [42]), and nally, we follow Zhao et. al [48] and fix hyperparameters
14857
trates per resolution2 . Our vvenc recipe used the slow preset
and multiple QPs per resolution. All encodings were pro-
duced using GOP size of 150 frames (128 for VVC) and
for multiple resolutions, ranging from the 1080p original
resolution all the way to 144p by using FFmpeg Lanczos
downscaling. All lower resolutions are upscaled with FFm-
peg bicubic to 1080p prior to quality measurements [22].
All Bjontegaard delta-rates (BD-rates) [7] are produced by
(a) MS-SSIM (b) PSNR first finding the subset of monotonically-increasing bitrate-
quality points that are in the convex hull of the quality-
Figure 3: Proposed DPP+H264 and DPP+H265 versus bitrate curve, and then using the Netflix libvmaf reposi-
DVC [24] on the first 100 frames of HEVC Class B se- tory [22] to measure SSIM, VMAF NEG, VMAF and BD-
quences. Points are plotted up to 0.12 bits per pixel (bpp). rates. The convex hull is computed over all resolutions,
CRFs/bitrates and multiple rate coefficients λ, such that, per
metric, we obtain a single RD-curve for both the codec and
α and β to 0.2 and 0.8 respectively. For the core hyperpa- our proposed DPP+codec. Full details of this convex hull
rameters that control the rate-perception-distortion tradeoff, optimization, along with the utilized encoding recipes can
λ and γ, we fix γ to 0.01 and vary λ ∈ [0.001, 0.01]. We be found in the supplementary.
present an ablation of these parameters in the supplemen-
tary material.
4.3. Comparison Against Neural Encoders
At deployment, we only retain the part of the preprocess- Before moving to our main results, we present a short
ing that comprises the learned pixel-to-pixel mapping; the comparison against neural encoders, selecting the recently-
virtual codec is replaced with a standard video codec, with proposed DVC framework [24] as a representative candi-
the decoded frame perceptually enhanced and at the same date of the state-of-the-art. Such neural encoders have
or lower bitrate than achievable without any preprocess- been shown to outperform AVC and HEVC when the lat-
ing. Importantly, we achieved real-time performance for ter are using: no B slices, ‘veryfast’ preset, low-latency
full-HD video (1080p@50fps) on a single NVIDIA Tesla mode (which disables most advanced temporal prediction
T4 GPU by porting our trained models to OpenCV CUDA tools), and very small GOP sizes of 10 or 12 frames. How-
primitives and fp16 arithmetic. For CPU execution, by port- ever, they are not able to approach the performance of these
ing our models to OpenVINO and quantizing them to int8, hybrid encoders, or indeed that of our framework under
we achieved real time for 1080p@60fps on 12 cores of an the state-of-the-art experimental setup of Section 4.2. This
Intel Cascade Lake CPU with no detriment in visual quality. is evident in the example results of Fig. 3, where DVC
is very substantially outperformed in terms of bitrate vs.
PSNR and MS-SSIM (the metrics used in their work) by
4.2. Experimental Setup for BD-Rate Results
both DPP+AVC/H.264 and DPP+HEVC/H.265 under our
We present a detailed evaluation of different models us- encoding recipe.
ing standard 1080p XIPH and CDVL sequences1 . Our an-
4.4. BD-Rate Results with H.264/AVC and AV1
chor encoders comprise AVC/H.264, AV1 and VVC, utiliz-
ing the libx264, aomenc and vvenc open implementations of The results of Fig. 4 and Table 1 and Table 2 show
these standards. We deliberately focus on a very-highly op- that the average rate saving over VMAF, VMAF NEG and
timized encoding setup that is known to outperform all neu- SSIM for both H.264 and AV1 standards is just above
ral or run-of-the-mill proprietary video encoders by a large 11%. As expected, our gains are higher on metrics that
margin [18, 12, 40, 10]. Our aim is to examine if DPP can are increasingly perception-oriented rather than distortion-
push the envelope of what is achievable today under some oriented: on VMAF, our framework offers 18% to 25% sav-
of the most-advanced encoding conditions used in practice. ing; on VMAF NEG, they are between 7% to 11% and on
Our x264/AVC encoding recipe is: veryslow preset, tune SSIM they are 1% to 3%. This makes the average BD-rate
SSIM and multiple CRF values per resolution. Our aomenc of all three metrics a reliable estimate of the bitrate sav-
AV1 recipe is: two-pass encoding, CPU=5, ‘tune SSIM’ or ing that can be offered in practice, since this average is
‘tune VMAF’ preprocessing options, and multiple target bi- influenced by performance in both distortion (SSIM) and
perception-oriented dimensions (VMAF and VMAF NEG).
1 XIPH source material: https://media.xiph.org/video/derf/ and CDVL 2 We note that preprocessing techniques such as ‘tune VMAF’ and ‘tune
material: https://www.cdvl.org/. See supplementary results for more de- SSIM‘ operate in-loop, i.e., within a specific encoder. As such, our method
tails on exact sequences used. can offer gains on top of them.
14858
Figure 4: Rate distortion curves for 16 XIPH sequences (top row) and 24 CDVL sequences (bottom row) on VMAF,
VMAF NEG and SSIM respectively. Curves are plotted for the codec and for our proposed DPP+codec. The corresponding
BD rates for our method are reported in Tables 1 and 2, respectively, for each dataset.
Table 1: BD rates on 16 XIPH sequences for DPP+ Table 2: BD rates on 24 CDVL sequences for DPP+
H264, DPP+AV1 (with perceptual settings tune ssim and H264, DPP+AV1 (with perceptual settings tune ssim and
tune vmaf) and DPP+VVC. More negative=more saving. tune vmaf) and DPP+VVC. More negative=more saving.
4.5. BD-Rate Results with VVC preprocessing for rate, distortion and perceptual quality in
an end-to-end differentiable manner. At inference, only the
We report BD-rate savings for VVC in Table 1 and Table preprocessor is deployed to carry out a single pass through
2. The average saving over all three metrics is 8.7%. The each frame prior to any standard encoder. Our frame-
fact that our framework offers consistent savings over vvenc work delivers consistent gains for three quality metrics with
further illustrates the validity of DPP across encoders, en- different perception-distortion characteristics and for three
coding recipes, and convex-hull rate-distortion optimized very different encoders used at their performance limits. It
encoding [18], which is summarized in Fig. 1 (right). is also easily deployable as it attains real time performance
on commodity hardware without requiring any changes in
5. Conclusion encoding, streaming or video decoding at the client side.
14859
References [15] Adrian Grange, Andrey Norkin, Cheng Chen, Ching-Han
Chiang, Debargha Mukherjee, Hui Su, James Bankoski,
[1] Mariana Afonso, Fan Zhang, and David R Bull. Video Jean-Marc Valin, Jingning Han, Luc Trudeau, et al. An
compression based on spatio-temporal resolution adaptation. overview of core coding tools in the av1 video codec. 2018.
IEEE Transactions on Circuits and Systems for Video Tech- [16] Amirhossein Habibian, Ties van Rozendaal, Jakub M Tom-
nology, 29(1):275–280, 2018. czak, and Taco S Cohen. Video compression with rate-
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. distortion autoencoders. In Proceedings of the IEEE Inter-
End-to-end optimized image compression. arXiv preprint national Conference on Computer Vision, pages 7033–7042,
arXiv:1611.01704, 2016. 2019.
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin [17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical
Hwang, and Nick Johnston. Variational image compression reparameterization with gumbel-softmax. arXiv preprint
with a scale hyperprior. arXiv preprint arXiv:1802.01436, arXiv:1611.01144, 2016.
2018. [18] Ioannis Katsavounidis and Liwei Guo. Video codec compar-
[4] Nabajeet Barman, Steven Schmidt, Saman Zadtootaghaj, ison using the dynamic optimizer framework. In Applica-
Maria G Martini, and Sebastian Möller. An evaluation of tions of Digital Image Processing XLI, volume 10752, page
video quality assessment metrics for passive gaming video 107520Q. International Society for Optics and Photonics,
streaming. In Proceedings of the 23rd packet video work- 2018.
shop, pages 7–12, 2018. [19] Faouzi Kossentini, Hassen Guermazi, Nader Mahdi, Chekib
[5] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Nouira, Amir Naghdinezhad, Hassene Tmar, Omar Khlif,
Estimating or propagating gradients through stochastic Phoenix Worth, and Foued Ben Amara. The svt-av1 en-
neurons for conditional computation. arXiv preprint coder: overview, features and speed-quality tradeoffs. In Ap-
arXiv:1308.3432, 2013. plications of Digital Image Processing XLIII, volume 11510,
[6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Ja- page 1151021. International Society for Optics and Photon-
son Weston. Curriculum learning. In Proceedings of the 26th ics, 2020.
annual international conference on machine learning, pages [20] Zhi Li. Video @Scale 2020: VMAF. At Scale Confer-
41–48, 2009. ence, 2020. https://atscaleconference.com/
[7] Gisle Bjontegaard. Improvements of the BD-PSNR model. videos/video-scale-2020-vmaf/.
In ITU-T SG16/Q6, 35th VCEG Meeting, Berlin, Germany, [21] Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy,
July, 2008, 2008. and Megha Manohara. Toward a practical perceptual video
quality metric. The Netflix Tech Blog, 6, 2016.
[8] Yochai Blau and Tomer Michaeli. The perception-distortion
tradeoff. In Proceedings of the IEEE Conference on Com- [22] Zhi Li, Christos Bampis, Julie Novak, Anne Aaron, Kyle
puter Vision and Pattern Recognition, pages 6228–6237, Swanson, Anush Moorthy, and J Cock. Vmaf: The journey
2018. continues. Netflix Technology Blog, 2018.
[23] Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Koniq-10k:
[9] Yochai Blau and Tomer Michaeli. Rethinking lossy compres-
Towards an ecologically valid and large-scale iqa database.
sion: The rate-distortion-perception tradeoff. arXiv preprint
arXiv preprint arXiv:1803.08489, 2018.
arXiv:1901.07821, 2019.
[24] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chun-
[10] Eirina Bourtsoulatze, Aaron Chadha, Ilya Fadeev, Vasileios lei Cai, and Zhiyong Gao. Dvc: An end-to-end deep video
Giotsas, and Yiannis Andreopoulos. Deep video precoding. compression framework. In Proceedings of the IEEE Con-
IEEE Transactions on Circuits and Systems for Video Tech- ference on Computer Vision and Pattern Recognition, pages
nology, 2019. 11006–11015, 2019.
[11] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Vari- [25] Siwei Lyu and Eero P Simoncelli. Nonlinear image repre-
able rate deep image compression with a conditional autoen- sentation using divisive normalization. In 2008 IEEE Con-
coder. In Proceedings of the IEEE International Conference ference on Computer Vision and Pattern Recognition, pages
on Computer Vision, pages 3146–3154, 2019. 1–8. IEEE, 2008.
[12] Jan De Cock, Aditya Mavlankar, Anush Moorthy, and Anne [26] Jesús Malo, Irene Epifanio, Rafael Navarro, and Eero P Si-
Aaron. A large-scale video codec comparison of x264, moncelli. Nonlinear image representation for efficient per-
x265 and libvpx for practical vod applications. In Appli- ceptual coding. IEEE Transactions on Image Processing,
cations of Digital Image Processing XXXIX, volume 9971, 15(1):68–80, 2005.
page 997116. International Society for Optics and Photon- [27] Henrique S Malvar, Antti Hallapuro, Marta Karczewicz, and
ics, 2016. Louis Kerofsky. Low-complexity transform and quantization
[13] Abdelaziz Djelouah, Joaquim Campos, Simone Schaub- in h. 264/avc. IEEE Transactions on circuits and systems for
Meyer, and Christopher Schroers. Neural inter-frame com- video technology, 13(7):598–603, 2003.
pression for video coding. In Proceedings of the IEEE Inter- [28] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand.
national Conference on Computer Vision, pages 6421–6429, Context-based adaptive binary arithmetic coding in the h.
2019. 264/avc video compression standard. IEEE Transactions on
[14] HHI Fraunhofer. Vvenc software repository, 2020. https: circuits and systems for video technology, 13(7):620–636,
//github.com/fraunhoferhhi/vvenc. 2003.
14860
[29] Marta Orduna, César Dı́az, Lara Muñoz, Pablo Pérez, Igna- oriented flow. International Journal of Computer Vision,
cio Benito, and Narciso Garcı́a. Video multimethod assess- 127(8):1106–1125, 2019.
ment fusion (vmaf) on 360vr contents. IEEE Transactions [45] Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Timo-
on Consumer Electronics, 66(1):22–31, 2019. fte. Learning for video compression with hierarchical quality
[30] Antonio Ortega and Kannan Ramchandran. Rate-distortion and recurrent enhancement. In Proceedings of the IEEE/CVF
methods for image and video compression. IEEE signal pro- Conference on Computer Vision and Pattern Recognition,
cessing magazine, 15(6):23–50, 1998. pages 6628–6637, 2020.
[31] Reza Rassool. Vmaf reproducibility: Validating a perceptual [46] Fisher Yu and Vladlen Koltun. Multi-scale context
practical video quality metric. In 2017 IEEE international aggregation by dilated convolutions. arXiv preprint
symposium on broadband multimedia systems and broad- arXiv:1511.07122, 2015.
casting (BMSB), pages 1–2. IEEE, 2017. [47] Fan Zhang, Angeliki V Katsenou, Mariana Afonso, Goce
[32] Shankar L Regunathan, Haixiong Wang, Yun Zhang, Dimitrov, and David R Bull. Comparing vvc, hevc and av1
Yu Ryan Liu, David Wolstencroft, Srinath Reddy, Cosmin using objective and subjective assessments. arXiv preprint
Stejerean, Sonal Gandhi, Minchuan Chen, Pankaj Sethi, arXiv:2003.10282, 2020.
et al. Efficient measurement of quality at scale in facebook [48] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss
video ecosystem. In Applications of Digital Image Process- functions for image restoration with neural networks. IEEE
ing XLIII, volume 11510, page 115100J. International Soci- Transactions on computational imaging, 3(1):47–57, 2016.
ety for Optics and Photonics, 2020.
[33] Iain E Richardson. The H. 264 advanced video compression
standard. John Wiley & Sons, 2011.
[34] Oren Rippel and Lubomir Bourdev. Real-time adaptive im-
age compression. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 2922–
2930. JMLR. org, 2017.
[35] Oren Rippel, Sanjay Nair, Carissa Lew, Steve Branson,
Alexander G Anderson, and Lubomir Bourdev. Learned
video compression. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 3454–3463,
2019.
[36] Daniel L Ruderman. The statistics of natural images. Net-
work: computation in neural systems, 5(4):517–548, 1994.
[37] Eero P Simoncelli and Bruno A Olshausen. Natural image
statistics and neural representation. Annual review of neuro-
science, 24(1):1193–1216, 2001.
[38] Hossein Talebi and Peyman Milanfar. Nima: Neural im-
age assessment. IEEE Transactions on Image Processing,
27(8):3998–4011, 2018.
[39] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc
Huszár. Lossy image compression with compressive autoen-
coders. arXiv preprint arXiv:1703.00395, 2017.
[40] Pankaj Topiwala, Madhu Krishnan, and Wei Dai. Perfor-
mance comparison of vvc, av1, and hevc on 8-bit and 10-bit
content. In Applications of Digital Image Processing XLI,
volume 10752, page 107520V. International Society for Op-
tics and Photonics, 2018.
[41] Martin J Wainwright and Eero Simoncelli. Scale mixtures of
gaussians and the statistics of natural images. Advances in
neural information processing systems, 12:855–861, 1999.
[42] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multi-
scale structural similarity for image quality assessment. In
The Thrity-Seventh Asilomar Conference on Signals, Sys-
tems & Computers, 2003, volume 2, pages 1398–1402. Ieee,
2003.
[43] Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur. When do
curricula work? arXiv preprint arXiv:2012.03107, 2020.
[44] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and
William T Freeman. Video enhancement with task-
14861