Efficient Video Encoder Autotuning Via Offline Bayesian Optimization and Supervised Learning Paper
Efficient Video Encoder Autotuning Via Offline Bayesian Optimization and Supervised Learning Paper
Abstract—Modern video encoders are complex software con- quality (according to a specific quality metric). The huge num-
taining dozens of parameters, which allows them to be configured ber of available parameters and the exponential combination
to different scenarios, requirements, or specific titles or scenes. of them, however, makes such an approach impractical. An
Besides the number of parameters, the inter-dependency between
them adds to the complexity of finding a per-title optimized improved method is using optimization methods (e.g., genetic
combination of encoding parameters. Even though good practices algorithms [4] or Bayesian optimization [5]) to guide the
in the industry have emerged, with the definition of presets per search for the best encoding parameters per title. Sharma et
content type (e.g., film vs. cartoon), such practices are suboptimal al. [6] is an example of such a work that uses genetics algo-
for specific titles or scenes. Indeed, finding the best encoding rithms to find the best encoding parameter for H.265/HEVC
parameters for a piece of content is currently a mix of best
practices and trial-and-error artwork. We propose an efficient (High-Efficiency Video Coding) [7]. However, even though
video encoder autotuner based on offline Bayesian optimization such approaches can provide an approximation of the optimum
and supervised machine learning. Our proposal uses Bayesian encoding parameter values per content/scene, they still require
optimization to search for a per-title best encoding parameter hundreds of encodings for each title during inference.
set offline to generate a dataset. Then, we use the generated Brute-force-based approaches have also been proposed for
dataset to train machine learning models that can map features
extracted from the content to the best encoding parameters. bitrate-ladder construction for HTTP-based dynamic adaptive
Our experiments show that our generated dataset can find a streaming (HAS) [8]–[10]. More recently, data-driven methods
combination of parameters that improves up to approximately have been explored for such a problem [11]–[14]. In HAS,
−14.49% BD-Rate (0.77 BD-PSNR) and −11.59% BD-Rate (2.12 the video content is split into short segments. Each segment
BD-VMAF) when optimizing for PSNR and VMAF, respectively. is then encoded at different resolutions and quality levels,
In comparison, our prediction models can recover ∼80% of such
performance while requiring only one fast encoding (compared which constitutes a bitrate-ladder. During streaming, based
to hundreds of encodes of a search optimization). on network conditions, display resolution, etc., the client can
Index Terms—Video encoder, Encoding parameters, Bayesian dynamically decide which representation to download for each
Optimization, Deep Learning segment. One key issue of bitrate ladder optimization is to
predict for each target bitrate which resolution provides the
I. I NTRODUCTION best quality. Such a problem can be seen as a specific case
Ideo is one of the most important media for communica- of video encoder parameter autotuning, in which resolution is
V tion and entertainment in today’s digital world, dominat-
ing global internet traffic. Video compression standards [1]–[3]
the only parameter being selected.
In this paper, we propose a data-driven method that enables
provide the key technologies that support the successful de- a significantly faster decision process than previous video
ployment of digital video. Modern video encoders have many encoder autotuning methods. Our proposal is based on i) gen-
parameters that can be tuned to specific scenarios (e.g., on- erating an offline dataset through optimization methods (e.g.,
demand vs live video) or in a per-content/scene manner (e.g., Bayesian optimization or genetics algorithms) and then ii) us-
cartoon vs live action content). Examples of encoding param- ing such data to learn a model that predicts the best encoding
eters include the number of reference frames, rate-distortion parameter. The main goal is for the model to learn the best
optimization mode, adaptive quantization mode and strength, encoding parameters found on such a dataset and, by imitating
number of B-frames, motion estimation range, deblocking such behavior, extrapolate to predict the best encoding param-
filter strength, etc. However, finding the best encoding param- eters on new samples. Our proposal is also non-invasive since
eters for a specific content/scene is non-trivial. it does not need any change on the encoder. Compared to
A naive approach is to brute-force all the combinations previous approaches, our proposed method allows for a better
of values for different parameters, encode the content with exploration of the search space (due to the per-title dynamic
such combinations, and then choose the one with the best exploration of the space of parameters) and faster inference
time (due to the learned prediction models). A. Dataset generation
We use H.264/AVC (Advanced Video Coding) [1], [15] as For each video in the source dataset, we perform a Bayesian
our target codec and evaluate our method on the PSNR (Peak- optimization-based approach to guide the search for the “best”
Signal-to-Noise Ratio) and VMAF (Video Multimethod As- encoding parameters for that video. Also, for each video,
sessment Fusion) [16] quality metrics. However, our overall we extract features that can characterize its content. These
proposal is encoder-agnostic, and it can be easily applied to features can later be used to predict the ground-truth encoding
different encoders, encoder parameter sets, and target quality parameters found by the optimization search.
metrics. Experimental results show that our dataset generation
1) Bayesian Optimization: Bayesian optimization is an
method supports (without any changes on the encoder itself)
approach to optimizing objective functions that take a long
an improvement of -14.49% BD-Rate (0.77 BD-PSNR) and
time to evaluate. It uses the accumulated knowledge in the
-11.59% BD-Rate (2.12 BD-VMAF) when optimizing for
known area of the search space to guide sampling in the
PSNR and VMAF, respectively. Our prediction models can re-
remaining area in an iterative process. For that, it builds a
cover ∼80% of that performance with just one faster encoding
surrogate for the objective and quantifies the uncertainty in that
process (compared to hundreds of encoding of optimization-
surrogate using a Gaussian process regression, and then uses
based approaches).
an acquisition function to decide where to sample. Bayesian
II. P ROBLEM FORMULATION optimization has been used in many optimization tasks when
the function to be optimized needs to be treated as a black box,
We consider the encoder as a function E that takes the e.g., as hyperparameter search for deep learning [19] [20] or
frames of a video V = {F1 , F2 , ..., Fn }, the specific encoding in parameter tuning of compilers [21].
parameters p = {p1 , p2 , ..., pp }, and the target bitrate b as Bayesian optimization is a good fit for our problem because
input. The outputs of the encoder are a set of encoded frames it does not assume first or second-order derivatives, and
V ′ = {F1′ , F2′ , ...Fn′ }, i.e., thus can work with the encoder as a black-box function.
However, our overall proposal is not dependent on Bayesian
V ′ = E(V, p, b). (1)
optimization and could perfectly work with other optimization
Given the set of encoding parameters, the encoder tries its search algorithms, e.g., genetics algorithms. One drawback
best to encode V into V ′ while keeping the final bitrate as of genetics algorithms, however, is that they require many
close as possible to b. The final achieved quality and bitrate more encode runs to converge when compared to Bayesian
depend on the encoder heuristics themselves and how the user optimization.
controls such heuristics based on p. For our implementation of Bayesian optimization, we as-
Given an objective quality metric M(V, V ′ ), finding the best sume that there is a default preset pdef that we want to improve
set of encoding parameters can then be defined as, upon and only consider samples that have a better quality than
pdef . For each target bitrate b, each video sample goes through
max M(V, V ′ ), (2) the above Bayesian optimization approach, being encoded a
p1 ,p2 ,...∈P1 ,P2 ,...
maximum of N times. Algorithm 1 details such a process.
where V ′ is given by Eq. (1). Common examples of M
are PSNR, SSIM [17], and VMAF [16]. The goal of the Algorithm 1 Pseudo-code for the proposed Bayesian
above problem is to find the best parameters set p ∈ P that optimization-based dataset generation.
maximizes output quality produced by the encoder. 1: for all V in the source content dataset D do
As aforementioned, a straightforward solution for the prob- 2: Place a Gaussian process prior on f
lem above is using optimization algorithms, e.g., genetic 3: n←0
algorithms [6], simulated annealing [18], and Bayesian op- 4: pbest (V) ← pdef
timization [5]. Such approaches require hundreds of function 5: Observe f at pdef point
evaluations to converge to the maximum solution. However, 6: while n ≤ N do
since the evaluation of the encoder function (i.e., encoding 7: Update the posterior probability distribution on f
the content and compute the objective quality) is an expensive using the gathered data so far
process, running such an optimization search approach per 8: Let xn be a maximizer of the acquisition function
title/scene during inference time is prohibitive. over x, where the acquisition function is computed
using the current posterior distribution.
III. P ROPOSED M ETHOD 9: Observe yn = f (xn )
We first use a Bayesian optimization-based approach to gen- 10: n←n+1
erate an offline dataset (Subsection III-A). The dataset is then 11: end while
used to train machine learning models (Subsection III-B) in a 12: Save the point evaluated with the maximum f (x) as
supervised way to approximate the best encoding parameters best encoding parameter for V, pbest (V)
solution found in the offline dataset. Fig. 1 overviews the 13: end for
proposed method.
Fig. 1. Overview of the proposed approach.
2) Feature extraction: Aiming at predicting the best encod- c) First pass features from x264: 3 Together with the
ing parameters, we extract features from the video samples to above features, we also run a fast encode of x264 which
characterize them and be used as input by machine learning allows us to extract the following features4 : Q: Average of
algorithms. Specifically, we extract the following features: macroblocks QPs before adaptive quantization; AQ: average
a) Spatial Information (SI) and Temporal Informa- of macroblocks QPs after adaptive quantization decided by
tion (TI): 1 SI is computed as the Root Mean Square (RMS) the rate control; MV: bits used by the motion vectors; Tex:
difference between the Sobel maps of each of the frames [22], Number of bits used by the texture component; and Misc: bits
s spend in other signalization, e.g., slice header and skip flags.
1 X SI, TI, Energy-based Video Complexity, and 1st-pass fea-
SI(v, u) = |sij |2 , (3)
w × h i,j tures are computed per frame, and then statistics on those
features (mean, standard deviation, minimum, and maximum)
where w and h are the width and height of the u and v frames are computed for each video sample.
and
s = S(v) − S(u), (4) B. Prediction Models
q Given a dataset that maps features to the best-found encod-
S(z) = (G1 ∗ z)2 + (GT1 ∗ z)2 , (5) ing parameters, machine learning methods can then be trained
in a supervised way to predict such values. The goal is to
where ∗ denotes the 2-dimensional convolution operation, and learn a model that can learn Mθ (V ) ≈ pbest (V) for any V,
G1 is the vertical Sobel filter. TI is based on the motion where θ are the model’s parameters. Two main approaches are
between adjacent frames, Mt (i, j), defined as the difference possible: classification or regression. From a small number
of the pixel luminance at the same location, at time t, i.e., of combinations, it is straightforward to train a classification
model. However, this limits the approach to a pre-defined
Mt (i, j) = Ft (i, j) − Ft−1 (i, j), (6) number of presets. Since our found best encoding parameters
are not pre-defined, i.e., the values are dynamically chosen
where Ft (i, j) is the pixel at the (i, j) of the t-th frame. TI is
based on the optimization search approach, we opt for using
computed as the maximum over time of the standard deviation
a regression approach. In our experiments, we focused mainly
over space of Mn (i, j) over all i and j.
on XGBoost and Multi-Layer Perceptron (MLP) models, but
b) Energy-based Video Complexity Features: we com- our general proposal is not restricted to them. We also exper-
pute the per-frame average spatial energy (E) and average imented with SVM (Support Vector Machines) and Random
temporal energy (h), following the definition of [23]2 .
3 https://www.videolan.org/developers/x264.html
1 https://github.com/Telecommunication-Telemedia-Assessment/SITI 4 A more detailed description of those features can be found at the x264
2 https://github.com/cd-athena/VCA documentation.
Forest models, but omitted these results here since XGBoost of our offline datasets, following Algorithm 1. For illustrative
and MLP consistently performed better in our experiments. purposes, Fig. 2 shows examples of performing our Bayesian
optimization approach for sample titles in the dataset, compar-
IV. E XPERIMENTS ing the performance of the default preset to the best encoding
A. Datasets generation and analysis parameter found during the optimization.
We experiment with the freely available dataset Inter4k5 , Table II and Fig 3(a) show the statistics of the Inter4K-
which is composed of one thousand 4k videos of 5 seconds HD/PSNR, in which, we can find a parameter set that provides
duration each. We downsampled all the sample videos to up to +0.91 PSNR in the low bitrate regime compared to the
1920x1080 resolution and used that as our source dataset default preset. Table III and Figs 3(b) show the statistics of
for the following experiments. This source dataset is named the Inter4K-HD/VMAF, in which, we find a parameter set
Inter4K-HD in the rest of this document. We generated three supporting up to +4.70 VMAF scores on average in the lower
versions of this initial dataset, which we named Inter4K- bitrate regime when compared to the default preset. Finally,
HD/PSNR, Inter4K-HD/VMAF, and Inter4K-HD/VMAF- Table IV and Fig 3(c) show the statistics of the Inter4K-
MultiRes. Inter4K-HD/PSNR and Inter4K-HD/VMAF use, HD/VMAF-MultiRes dataset.
respectively, PSNR and VMAF as the target metric for
the Bayesian optimization discussed in Section III, whereas TABLE II
I NTER 4 K -HD/PSNR DATASET STATISTICS . VALUES ARE REPORTED IN
Inter4K/VMAF-MultiRes is similar to Inter4K/VMAF but THE FORMAT: “AVG . PSNR ( STANDARD DEVIATION )”.
also allows to configure the resolution of the output video as
Inter4K-HD/PSNR
an additional encoding parameter. Bitrate Default Best Avg. ∆PSNR
For all the three dataset variants above, we focus on
1Mbps 33.56 (5.76) 34.47 (5.88) +0.91 (2.56)
H.264/AVC as our target codec, using the very-slow preset 2Mbps 37.05 (5.82) 37.81 (5.91) +0.76 (0.48)
from x264 as our default preset pdef . Table I details the 3Mbps 39.02 (5.81) 39.70 (5.89) +0.68 (0.44)
range of encoding parameters and the default values used in 4Mbps 40.07 (5.25) 40.70 (5.36) +0.64 (0.44)
5Mbps 41.45 (5.75) 42.06 (5.86) +0.61 (0.45)
the Bayesian optimization. The “resolution” parameter is only
used for Inter4k-HD/VMAF-MultiRes dataset.
(b)
(c)
Fig. 3. Inter4K-HD/H.264 dataset distribution when optimizing for (a) PSNR, (b) VMAF, and (c) VMAF/MultiRes.
seen during training. All the features extracted from the source
content and the encoding parameters are min/max normalized.
For the XGBoost model, we use the default python xgboost
library6 with max_depth = 0, while the MLP is composed of
5 layers, each with 512 neurons. For the MLP training, we use
an adaptive learning rating starting at 1 × 10−3 being divided
by 5 every time that 2 consecutive epochs fail to decrease the
loss function, until the tolerance of 1 × 10−7 .
Table V shows the BD-Rate and BD-PSNR/VMAF for
evaluating the different trained models (XGBoost and MLP)
Fig. 4. Histogram of best chosen resolution for Inter4K-HD/VMAF-MultiRes.
on the three dataset variants we generated. It also reports
the BD-Rate and BD-PSNR/VMAF computed on the whole
dataset and such metrics computed only on the test set. It is
Finally, Fig. 5 depicts the average rate-distortion curve of expected that the upper bound of the prediction model reported
the default preset and the optimal per-content found for each metrics are the ones of “Best (test set)”, while the “Best (full
of the three dataset variants, while the first row of Table V dataset)” is kept just for reference. From the results, it is clear
shows the improvement in terms of BD-Rate and BD-Metric. that the prediction models are able to recover most of the
performance of the search optimization, while requiring just
B. Prediction model results one fast encoding and feature extraction step.
We independently trained prediction models on the different V. C ONCLUSION
dataset variants: Inter4K-HD/PSNR, Inter4K-HD/VMAF, We introduce a video encoder autotuning framework that
and Inter4K-HD/VMAF-MultiRes. The data was split in takes advantage of Bayesian optimization to search the space
80% for training and 20% for validation and the same split was of encoder parameters and build an offline dataset which is
used for all the dataset variants on the results presented below. then used to train supervised machine learning methods. Our
When splitting the dataset into train/test, we make sure that method supports an automated and efficient search of encoding
for a given content, all the target bitrate data will appear only parameters while offering better performance than previous
either in the training set or in the test set. Thus, we guarantee
that our tests are only performed in video content that was not 6 https://github.com/dmlc/xgboost
(a) (b) (c)
Fig. 5. Inter4K-HD rate-distortion curves when optimizing for PSNR (a), VMAF (b), and VMAF/MultiRes (c).
TABLE V
BD-R ATE /M ETRIC COMPARING THE BEST IN THE DATASET AND PREDICTED PARAMETERS . D EFAULT PRESET ( X 264, VERY SLOW ) IS USED AS ANCHOR
TO COMPUTE BD-R ATE /M ETRIC .
fixed parameter search methods. Moreover, after training, our [8] A. Aaron, Z. Li et al., “Per-title encode optimization,”
method provides an efficient solution, which only requires one Tech. Rep., 2015. [Online]. Available: https://netflixtechblog.com/
per-title-encode-optimization-7e99442b62a2
fast encoding of the content plus a pass of feature extraction. [9] K. S. Durbha, H. Tmar et al., “Bitrate ladder construction using visual
Specifically, we demonstrate that using x264, we are able to information fidelity,” arXiv preprint arXiv:2312.07780, 2023.
find a parameter set that achieves up to 14.49% and 11.59% [10] A. Telili, W. Hamidouche et al., “Bitrate ladder prediction methods for
adaptive video streaming: A review and benchmark,” arXiv preprint
BD-Rate reduction compared to very-slow encoding parameter arXiv:2310.15163, 2023.
preset when optimizing for PSNR and VMAF, respectively, [11] A. V. Katsenou, J. Sole, and D. R. Bull, “Efficient bitrate ladder
and can recover ∼ 80% of such performance with a much construction for content-optimized adaptive video streaming,” IEEE
Open Journal of Signal Processing, vol. 2, pp. 496–511, 2021.
more efficient prediction model. Our proposed framework also [12] F. Nasiri, W. Hamidouche et al., “Multi-preset video encoder bitrate
opens up new avenues for future work. Although we focus only ladder prediction,” ser. ViSNext ’22. New York, NY, USA: ACM,
on simple hand-designed features and more traditional ma- 2022, p. 8–13.
[13] A. Telili, W. Hamidouche et al., “Efficient per-shot transformer-based
chine learning algorithms in our experiments, more advanced bitrate ladder prediction for adaptive video streaming,” in 2023 IEEE
features (e.g., deep learning-based ones) and models (e.g., ICIP. IEEE, 2023, pp. 1835–1839.
Transformers) can be easily integrated into our framework. [14] J. Yang, M. Guo et al., “Optimal transcoding resolution predic-
tion for efficient per-title bitrate ladder estimation,” arXiv preprint
The experimentation of our method with other codecs (e.g., arXiv:2401.04405, 2024.
HEVC and AV1) is another interesting future work. [15] “Recommendation itu-t h.264. advanced video coding for generic au-
diovisual services,” 2021.
R EFERENCES [16] “VMAF: Video multimethod assessment fusion.” [Online]. Available:
https://github.com/Netflix/vmaf
[1] T. Wiegand, G. J. Sullivan et al., “Overview of the H. 264/AVC video [17] Z. Wang, A. C. Bovik et al., “Image quality assessment: from error vis-
coding standard,” IEEE Transactions on circuits and systems for video ibility to structural similarity,” IEEE transactions on image processing,
technology, vol. 13, no. 7, pp. 560–576, 2003. vol. 13, no. 4, pp. 600–612, 2004.
[2] B. Bross, Y.-K. Wang et al., “Overview of the Versatile Video Coding [18] D. Bertsimas and J. Tsitsiklis, “Simulated annealing,” Statistical science,
(VVC) standard and its applications,” IEEE Transactions on Circuits and vol. 8, no. 1, pp. 10–15, 1993.
Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021. [19] J. Wu, X.-Y. Chen et al., “Hyperparameter optimization for machine
[3] J. Han, B. Li et al., “A technical overview of AV1,” Proceedings of the learning models based on bayesian optimization,” Journal of Electronic
IEEE, vol. 109, no. 9, pp. 1435–1462, 2021. Science and Technology, vol. 17, no. 1, pp. 26–40, 2019.
[4] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Ma- [20] A. H. Victoria and G. Maragatham, “Automatic tuning of hyperparame-
chine Learning, 1st ed. USA: Addison-Wesley Longman Publishing ters using bayesian optimization,” Evolving Systems, vol. 12, no. 1, pp.
Co., Inc., 1989. 217–223, 2021.
[5] P. I. Frazier, “A tutorial on bayesian optimization,” arXiv preprint [21] A. H. Ashouri, G. Mariani et al., “Cobayn: Compiler autotuning frame-
arXiv:1807.02811, 2018. work using bayesian networks,” ACM Transactions on Architecture and
[6] R. R. Sharma and K. V. Arya, “Parameter optimization for HEVC/H.265 Code Optimization (TACO), vol. 13, no. 2, pp. 1–25, 2016.
encoder using multi-objective optimization technique,” in 2016 11th [22] “Recommendation p.910 : Subjective video quality assessment methods
International Conference on Industrial and Information Systems (ICIIS). for multimedia applications,” 2023.
Roorkee, India: IEEE, Dec. 2016, pp. 592–597. [23] V. V. Menon, C. Feldmann et al., “Vca: video complexity analyzer,” in
[7] G. J. Sullivan, J.-R. Ohm et al., “Overview of the High Efficiency Video Proceedings of the 13th ACM Multimedia Systems Conference (MMSys
Coding (HEVC) standard,” IEEE Transactions on circuits and systems ’22). New York, NY, USA: ACM, 2022, p. 259–264.
for video technology, vol. 22, no. 12, pp. 1649–1668, 2012.