A Survey On Video Coding Optimizations Using Machine Learning
A Survey On Video Coding Optimizations Using Machine Learning
A Survey On Video Coding Optimizations Using Machine Learning
ISSN No:-2456-2165
Abstract:- The most common type of data used globally is Digital video has many advantages over traditional
presently video data. The volume of video data has been analog video, which has led to its replacement. Text and audio
rising explosively around the globe as a result of the quick data are compatible with digital videos. To properly store and
development of video applications and the rising demand transmit visual information, an effective video coding system
for higher-quality video services, giving the biggest is required. Digital videos consume large amounts of data, and
challenge to multimedia processing, transmission, and if they are not compressed properly, it would be highly
storage. Video coding by compression has become difficult to store and transmit video data. Although today, data
somewhat saturated while the compression ratio has storage capacity, network bandwidth, and computer power
grown in the last three decades. Deep Learning have increased tremendously, demands for better-quality
algorithms offer new possibilities for improving video video have never stopped.
coding technologies since they can make data-driven
predictions and learn from vast amounts of unstructured YouTube is a video-sharing website, that enables to
data. We explore machine learning-based video encoding watch online videos. YouTube received 112.9B visits in the
optimization in this research, which lays a solid month of October 2023, with an average session duration of
groundwork for further advancements in video coding. 35:09 which increased the traffic by 19.00% within two
The video service's designer must choose a suitable video months. [1]
coding scheme to satisfy criteria like efficiency,
complexity, rate distortion, flexibility, etc. This article One of the key technologies in video applications is
also presents challenges associated with machine learning video coding, which makes it possible to compress and
video coding optimization. The survey is mainly organize video data more efficiently for computing,
presented from two key aspects, first is low complexity transmission, and storage. In order to improve video
optimization with the help of advanced learning tools, compression efficiency, machine learning is the most
such as feed-forward CNN, deep RL, and deep NN, and advanced research topic to be explored. The study of machine
second is learning-based visual quality assessment (VQA). learning allows for the analysis of data to find hidden patterns
and drive decisions. Owing to its exceptional ability to learn
Keywords:- Video Coding, Deep Learning, Machine from data, a number of recent studies have significantly
Learning, High-Efficiency Video Coding Standard (HEVC), enhanced video coding results by including machine learning
Versatile Video Coding (VVC), Visual Quality Assessment. algorithms in the process.
(VQA).
The idea of perceptual redundancy is explained by the
I. INTRODUCTION fact that not all video distortions are perceptually visible by
the Human Visual System (HVS), which is ultimately
Numerous video applications, including TV responsible for perceiving most videos. The eyes and the
broadcasting, movies, video-on-demand, video conferences, brain are the two functioning components of the HVS.
mobile video, video surveillance, remote control, robotics, Numerous visual characteristics and redundancies have been
3D videos, and free viewpoint TV, have emerged with the identified and inspired by HVS research that is based on
development of multimedia computing, communication, and physiological (eye) and psychological (brain) studies [2]. The
display technologies. Numerous aspects of daily life, notion of Just Noticeable Difference (JND) arises when
including industry, communication, national security, the multiple pixel values in an image exhibit a very fine-scale
military, education, medicine, and entertainment, have made variation. In most cases, the distortion is unnoticeable. The
extensive use of these video applications. The majority of eyes are responsible for these physiological perceptual
data transmission over the internet today is video data, and its redundancies. Additionally, the perceptual sensitivity differs
volume is increasing dramatically yearly. YouTube is depending on the video's subject matter, the viewer's
extensively used to share information through video. consciousness, and their region of interest, or ROI, which
corresponds to how the brain processes psychology. The goal
of video coding is to preserve visual quality while utilizing
signal and perceptual redundancy as much as possible.
Yu-Huan et al. proposed a multi-phase nearest mean As brute-force searches for rate-distortion optimization,
classification based on RD cost clustering for fast mode the quad-tree partition of the Coding Unit (CU) is responsible
decisions. It is an unsupervised clustering machine learning for complexity in encoding. Mai Xu et al. collected large-
model. This method achieved a 68% reduction of time at the scale database which includes CU partition data for intra and
expense of a slight increase in bit rate. [11] inter-modes of HEVC. This is used for deep learning CU
partition. They used a hierarchical CU partition map (HCPM)
Jui-Chiu Chiang et al proposed a fast stereo video to depict the CU partition of a whole coding tree unit. Next,
encoding algorithm. This algorithm is based on hierarchical they suggested, using an early terminated hierarchical CNN
two-stage neural classification, including fast prediction (ETH-CNN) to develop prediction skills for the HCPM.
source determination and fast block partition selection. [12] Consequently, by using ETH-CNN to determine the CU
partition instead of a brute-force search, the encoding
Paula Carrillo et al. suggested a machine learning based difficulty of intra-mode HEVC can be significantly
approach. In this technique, they have proposed three-level decreased. Third, to discover the CU partition's temporal
topology for inter-mode decision. The first level improves correlation, an ETH-LSTM is suggested. A combination of
speed by SKIP early decision. In the second level, there is a ETH-LSTM and the ETH-CNN is used to predict the CU
direct division between inter 8×8 and sub modes against inter partition, which reduces HEVC complexity in inter-mode.
16×16 and sub modes. If other leaf is selected, third Experimental results showed that, their method outperformed
classification between inter 16×16 sub modes and intra 4×4 is other state-of-the-art approaches in terms of complexity [17].
evaluated. [13]
In order to reduce complexity in H.264 to HEVC,
H.265/HEVC has a large number of decision modes. Jingyao Xu et al. suggested deep learning based approach to
That makes the task more challenging. H.265/HEVC has replace brute-force searching for rate-distortion optimization.
more complex decision computation as compared to They built large-scale transcoding database. After that,
H.264/AVC. H.265/HEVC includes recursive quad-tree CU determined correlation between HEVC CTU partition and
mode decision, multi-class CU and TU mode decision. In the H.264 features. These relation helps to find out temporal and
following sections, machine learning based HEVC INTRA spatial-temporal similarities of the CTU partition. Next, they
and INTER coding optimization are discussed. In recent proposed hierarchical long short-term memory (H-LSTM)
years Deep Nural Network (NN) has been widely used in architecture network. This deep learning-based architecture
visual signal processing. Researchers are putting their efforts predict the CTU partition of HEVC. The performance of (H-
into exploring end-to-end deep learning-based decision LSTM) is compared with other methods. [18]
schemes.
Discussion on learning based low complexity coding
Zhenyu Liu et al. used Convolution Neural Network optimization:
(CNN) to analyse the texture of images and reduces the
number of CU mode. Whatever CU modes are available are Low complexity optimization becomes more important
undergone through an exhaustive Rate-Distortion- when the coding complexity grows exponentially. In the
Optimization process. In this encoding, CNN determines the meantime, the complexity of each mode decision problem in
texture of CU and then identifies the optimal CU/PU VVC increases. In order to solve complicated decision
configuration. They have incorporated quantization problems, advanced learning tools, such as feed-forward
parameters in CNN architecture. This method could save CNN, deep RL, and deep NN, are better options.
63% intra-coding time with the cost of a 2.66 % BDBR
increase [14]. IV. LEARNING-BASED VISUAL QUALITY
ASSESSMENT (VQA)
Thorsten Laude et al. developed deep learning intra
prediction mode decision process for H.265/HEVC. Input Minimizing distortion (D) or increasing quality (Q) is
values of block samples to be coded are fed through a deep the aim of video coding. The quality Q is determined by
convolution neural network. Without RD optimization of all PSNR, based on the pixel-by-pixel difference between the
feasible modes, the choice of intra-prediction mode is original and reconstructed pictures, and the distortion D is
expressed as a classification problem. [15] determined by MSE. But, there is no guarantee that PSNR
and MSE reflect the real perceived quality of HVS. There are
Tianyi Li et al. proposed a complexity reduction a number of Visual Quality Assessment (VQA) metrics that
approach for INTRA mode. This model learns Deep have been developed, such as SSIM, FSIM, Multi-Scale
Convolution Neural Network to predict CTU partition instead SSIM etc. Creating a useful visual quality metric that is in
of RDO. They established a large-scale database with line with human perception is difficult. Through the
diversiform patterns of CTU partition. Then, they created a extraction of visual elements from data and the development