0% found this document useful (0 votes)
6 views13 pages

A Distortion-Aware Multi-Task Learning Framework For Fractional Interpolation in Video Coding

Uploaded by

Francio Modesto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

A Distortion-Aware Multi-Task Learning Framework For Fractional Interpolation in Video Coding

Uploaded by

Francio Modesto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

2824 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO.

7, JULY 2021

A Distortion-Aware Multi-Task Learning


Framework for Fractional Interpolation
in Video Coding
Liangwei Yu, Liquan Shen , Member, IEEE, Hao Yang, Member, IEEE,
Xuhao Jiang, and Bo Yan , Senior Member, IEEE

Abstract— Motion-compensated prediction adopts fractional- I. I NTRODUCTION


pixel interpolation to obtain the best motion vector. Tradi-
tional fixed interpolation filters cannot handle various content
and structures well, and existing convolutional neural network
based methods cannot fully exploit the distortion characteristics
S TANDARDIZED hybrid video coding systems [1], [2]
including Advanced Video Coding (AVC) and High
Efficiency Video Coding (HEVC) adopt block-based
for fractional interpolation. Therefore, this paper proposes a motion-compensated prediction (MCP) [3], [4] to reduce the
distortion-aware multi-task learning framework (DA-MLF) to temporal redundancy in video signals. To perform block-based
perform fractional interpolation. First, a multi-task training MCP, a picture to be coded is divided into multiple blocks
framework is proposed to provide the distortion characteris-
tics as complementary information for improving the perfor- first. For each block, the encoder searches the reference
mance of subsequent interpolation. Then, a uniform interpolation frames to find the best matching block. The relative position
sub-network is proposed to accomplish fractional interpolation, between the current coding block and its matching block is
which utilizes the feature fusion module to fuse abundant local represented by a motion vector (MV). During inter coding,
features, and the distortion awareness module to capture the only the residual signal between the matching block and
multi-scale information of compression artifacts. Furthermore,
DA-MLF is integrated into High Efficiency Video Coding (HEVC) the coding block and the MV information are encoded and
test model, and multiple experiments are performed to evaluate delivered to the decoder to save the bitrate.
the effectiveness of our method. On HEVC testing sequences, The distance between coding blocks is integral, and the
DA-MLF achieves 5.0%, 4.0% and 1.7% BD-rate reduction on motions of objects in practice are not limited to integer
average compared to the HEVC baseline, under low-delay P, pixels. Thus, fractional interpolation is introduced into recent
low-delay B and random-access configurations, respectively. The
experimental results validate that our framework not only
video coding standards to estimate MV with fractional-pixel
achieves the best interpolation performance but also has the precision to further reduce prediction errors.
lowest computational complexity compared with state-of-the-art Over the past decades, different interpolation methods have
methods. been proposed. Linear interpolation and bicubic interpolation
Index Terms— Fractional interpolation, convolutional neural are widely used in image processing, which simply weight the
network (CNN), video coding, multi-task learning. neighbor pixels to generate fractional pixels. To generate more
accurate fractional-pel samples, an average filter [5] and a
discrete cosine transform based interpolation filter (DCTIF) [6]
Manuscript received September 1, 2019; revised April 12, 2020 and
July 5, 2020; accepted September 17, 2020. Date of publication October 2, are adopted in H.264 and HEVC. However, fixed interpolation
2020; date of current version July 2, 2021. This work was supported in part by filters cannot handle various contents well. Therefore, several
the National Natural Science Foundation of China under Grant 61671282 and adaptive interpolation filters [7]–[9] are proposed to generate
Grant 61931022, in part by the Open Fund of Key Laboratory of Advanced
Display and System Applications of Ministry of Education (Shanghai Univer- fractional pixels according to the statistical characteristics of
sity), in part by the Shanghai Science and Technology Innovation Plan under different content, which need to additionally transmit the filter
Grant 18010500200, and in part by the Shanghai Shuguang Program under coefficients to the decoder-side.
Grant 17SG37. This article was recommended by Associate Editor J. Xu.
(Corresponding author: Liquan Shen.) The great success of deep learning in computer vision tasks
Liangwei Yu, Hao Yang, and Xuhao Jiang are with the School of Commu- enlightens novel convolutional neural network (CNN) based
nication and Information Engineering, Shanghai Institute for Advanced Com- methods in multimedia compression [10] such as complex-
munication and Data Science, Shanghai University, Shanghai 200072, China
(e-mail: [email protected]; [email protected]; jxuhao@shu. ity reduction [11], compression artifacts reduction [12], [13],
edu.cn). transcoding [14] and intra prediction [15]. Inspired by these,
Liquan Shen is with the Key Laboratory of Advanced Display and several CNN-based methods [16]–[19] have also been pro-
System Application, Shanghai University, Shanghai 200072, China (e-mail:
[email protected]). posed to generate more accurate fractional pixels. Gener-
Bo Yan is with the School of Computer Science, Fudan University, Shanghai ally, existing methods can be classified into two categories,
200433, China (e-mail: [email protected]). i.e., interpolation-based generation and prediction-based gen-
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. eration. Interpolation-based methods [16]–[18] focus on
Digital Object Identifier 10.1109/TCSVT.2020.3028330 learning an end-to-end mapping between the reference frames
1051-8215 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2825

strategy is further proposed to simulate the compression


distortions during recurrent interpolation and alleviate the
over-smoothing effect. The experimental results demonstrate
the superiority of our framework.
The main contributions of this paper are as follows:
1) A multi-task learning framework is proposed to exploit
the distortion characteristic to improve the performance
of fractional interpolation in video coding.
2) A uniform interpolation sub-network is proposed
to accomplish fractional interpolation. The interpola-
tion sub-network adopts the feature fusion and the
distortion-aware modules to fuse the features from mul-
tiple layers and capture the multi-scale information of
distortion, which achieves a promising result.
Fig. 1. (a) The raw frame. (b) The compressed frame with compression 3) A three-step training strategy is proposed to reduce
artifacts, which is also the reference frame for factional interpolation. (c) The over-smoothing during interpolation. The three-step
interpolated frame generated by a network trained without the auxiliary task of training strategy helps to introduce the distortion of
quality enhancement. (d) The interpolated frame generated by our DA-MLF,
which is trained with the auxiliary task of quality enhancement. recurrent enhancement into training samples, and guide
the interpolation sub-network to learn to eliminate the
and the interpolated fractional pixels. Since the fractional pix- over-smoothing effect.
els are interpolated from the reference frames, the prediction The remainder of this paper is organized as follows.
residual cannot be eliminated. In contrast, prediction-based Section II provides a brief review of related works.
methods [19] predict the current to-be-coded picture based on In Section III, we present the details of DA-MLF.
both the reference frames and motion vector. Different from In Section IV, the training configurations of our framework
the interpolation-based methods, the prediction-based methods are discussed. Then, testing details and experimental results
can further eliminate the temporal redundancy to reduce the are given in Section V to demonstrate the advancement of our
prediction residual. framework. Section VI concludes this paper.
These existing methods usually adopt conventional cascaded II. R ELATED W ORK
networks, which have difficulty in extracting and fusing the
output of each convolutional layer for further training [20]. A. Traditional Fractional Interpolation Methods
To avoid over-smoothing, some of these methods adopt During the development of video coding standards, different
CTU-level determination to choose the optimal interpola- interpolation filters have been proposed. H.264/AVC adopts
tion approach, which increases the computation complexity. the fixed 6-tap FIR filter and average filter [5] to interpolate
In addition, these existing methods only focus on training half-pixels and quarter-pixels separately. HEVC adopts a dis-
a single interpolation network and cannot fully exploit the crete cosine transform based interpolation filter (DCTIF) [6]
correlation between quality enhancement and fractional inter- to interpolate more accurate fractional pixels, the coefficients
polation task, which limits their performance of interpolation. of which are designed using a Fourier decomposition of the
As shown in Fig. 1, the compressed reference frame (Fig. 1(b)) discrete cosine transform. These fixed interpolation filters fail
suffers from severe compression distortions when compared to handle the various structures and contents well. Thus,
to the raw frame (Fig. 1(a)). Although the network utilized to several advanced filters have been proposed to perform frac-
perform the interpolation in Fig. 1(c) is trained on compressed tional interpolation with higher performance. With additional
frames, the distortions are still propagated to fractional pixels analysis, these adaptive interpolation filters [7] achieve opti-
through interpolation. Therefore, a more effective method is mal interpolation performance to improve coding efficiency.
needed to further remove the distortions during interpolation. Although achieving promising performance, adaptive interpo-
In this paper, we jointly train the reference frame enhancement lation filters lead to a significant increase in computational
and the fractional interpolation tasks to capture the correlation expense. Therefore, Wittmann et al. [9] proposed a separable
between these two tasks to further improve the interpolation filter, which is computationally less expensive in performing
performance. The interpolation sample of our method is shown adaptive interpolation. For HD video coding, Dong et al. [8]
in Fig. 1(d). As can be seen, our method can restore frames proposed a parametric interpolation filter, which adopts a
with higher quality and sharper edges. function determined by five parameters to represent interpo-
In this paper, we propose a distortion-aware multi-task lation filters. However, most of these approaches rely on the
learning framework (DA-MLF), where the features of com- statistical properties of videos, which cannot provide enough
pression distortions are imposed as supplementary information capability when handling various contents. Thus, the perfor-
and jointly captured by our auxiliary task for subsequent mance of these methods can be further improved.
interpolation. In addition, an advanced uniform interpolation
sub-network is proposed with our designed feature fusion and B. CNN-Based Interpolation Filters
distortion awareness modules. Since CNN-based interpolation Inspired by the success of CNN networks in computer vision
filters may suffer from over-smoothing, a three-step training tasks, multiple CNN-based interpolation filters have been

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2826 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

Fig. 2. The architecture of DA-MLF.

proposed to improve the coding efficiency of HEVC. Reusing cannot fuse multiple local features and capture the multi-scale
the architecture of Super Resolution CNN (SRCNN) [21], similarity of the distortions. Thus, there is still room for further
Yan et al. [16] proposed a CNN-based Interpolation Fil- performance improvement.
ter (CNNIF) to replace the half-pixels generated by HEVC,
which achieved a promising result compared with the HEVC III. T HE P ROPOSED DA-MLF
baseline. However, multiple CNNIF networks must be trained As shown in Fig. 2, DA-MLF contains four components:
to obtain half-pel pixels of different positions separately. feature fusion, distortion awareness, multiple reconstruction
To reduce the training burden, Zhang et al. [17] proposed an and residual short-cut. The input frame is first fed into the fea-
enhancement network to improve the quality of interpolated ture fusion module, which is formed by dense structures [25].
images, which are generated by DCTIF. The network is based During the feature fusion procedure, pixel-wise information
on the Very Deep Super Resolution network (VDSR) [22], and distortion characteristics are extracted and fused into
which is only designed to generate half-pel pixels. By sharing multi-level features. Previous works [16]–[18] directly feed
feature extraction layers, Liu et al. [18] proposed a Grouped these features into the reconstruction layer, and thus the quality
Variation CNN (GVCNN) to further reduce the computational of the interpolated frames is not guaranteed. In our method,
complexity. Inspired by the invertibility of fractional-pixel a multi-task learning framework is proposed to effectively
interpolation, Yan et al. [23], [24] proposed CNN-based end- exploit the distortion characteristics for the subsequent recon-
to-end schemes for fractional interpolation in video coding. struction. These valuable features flow into multiple recon-
In contrast to previous works, Yan et al. [19] proposed a struction modules to reconstruct different outputs for different
Fractional-pixel Reference generation CNN (FRCNN) to pre- tasks. The quality enhancement structure is designed to remove
dict fractional pixels instead of interpolating them. In addition, the compression distortions and reconstruct an enhanced
an RD-cost selection algorithm was introduced into FRCNN image. With the pre-trained feature fusion and the distortion
to further improve the performance, which is time consuming. awareness modules, the fractional interpolation sub-network
Although these existing CNN-based interpolation filters can focus on learning the interpolation function. Additionally,
achieve remarkable performance, there are two issues that the residual short-cut module is adopted to provide global
need to be addressed. First, to avoid distortion propaga- contextual information. Similar to the distortion awareness
tion, the characteristics of compression distortions should be module, the variable-filter-size technique [12] is adopted in
jointly considered while performing fractional interpolation. the residual short-cut to capture the distortion characteristics.
Although these existing CNN-based interpolation filters are The structure details are given in Table I.
trained on the compressed images, the distortions cannot be
comprehensively removed, as shown in Fig. 1(c). Therefore,
an additional training strategy should be proposed to exploit A. Network Structure
the distortion characteristics to enhance the quality of the 1) Feature Fusion Module: To achieve a promising interpo-
interpolated images. Second, previous works mainly adopt lation performance, a fundamental module should be proposed
conventional cascaded structures to form their networks, which to provide abundant local features. Previous methods mainly

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2827

TABLE I
PARAMETER D ETAILS OF DA-MLF

adopt cascaded structures or nearly cascaded structures to


extract the features. The structure of a cascaded network with
d layers can be formulated as

Fout = h d (Fd−1 ) = h d (h d−1 (Fd−2 ))


= h d (h d−1 (h d−2 (· · ·h 0 (Fin ) · ··))), (1)

where Fout represents the output feature maps, Fd represents


the d t h extracted feature maps, h d represents the composite
function of convolution and activation, and Fin represents the
input feature maps. As can be seen, the connections between
the extracted feature maps are mainly cascaded, which cannot
effectively exploit the complex partition structures of modern
video codecs [12].
Inspired by the residual dense network [20], residual dense Fig. 3. The curve of interpolation performance and the number of convolu-
blocks (RDB) are adopted in our feature fusion module to tional layer.
extract the hierarchical features, exploiting the multi-scale sim-
ilarity of compression artifacts. The residual dense structure performance. In our DA-MLF, the distortion characteristics
is represented as follows, are effectively captured by the multi-task learning strategy
and the distortion awareness module, which leads to a higher
F1 = h 1 (Fin )
interpolation performance.
F2 = h 2 (concat(Fin , F1 )) 2) Distortion Awareness Module: Due to the compression
··· ··· procedure, the reference frames to be interpolated may suffer
Fd = h d (concat(Fin , F1 , F2 , · · ·, Fd−1 )) from severe distortion. Thus, the distortion awareness module
Fout = Fin + Fd , (2) is needed to better capture the distortion information for
interpolating images. In contrast to compression artifacts of
where concat (·) means the concatenation operation. In the JPEG, the distortions of HEVC have multi-scale similari-
RDB, convolutional layers obtain additional inputs from all ties [26], which are mainly caused by the variable-size CU
preceding layers and pass on their own feature maps to mechanism. Because the traditional convolutional layer is not
all subsequent layers, which is more suitable to capture the efficient in capturing this characteristic, the variable-filter-size
multi-scale characteristics in video coding. technique [12] is adopted in the distortion awareness module
Besides, an additional experiment is performed to find the to capture the multi-scale similarity of distortions. Specifically,
best depth in the RDB. As can be seen in Fig. 3, when the variable-size filter utilized in DA-MLF is a combination of
the number of convolutional layers achieves seven, the inter- 3 × 3 filters and 1 × 1 filters. During the training of the quality
polation performance has limited increase. Therefore, seven enhancement task, the distortion awareness module learns to
convolutional layers are adopted in the feature fusion module capture the distortion characteristics. Accordingly, in the train-
for a reasonable tradeoff between performance and efficiency. ing of the fractional interpolation task, the distortion awareness
It is worth noting that the feature fusion module only module mainly provides additional distortion information for
maps the pixel-wise information into high-dimensional fea- enhancing the subsequent uniform interpolation.
tures and cannot effectively remove the distortions. If these 3) Residual Short-Cut Module: Residual learning is pro-
high-dimensional features are directly fed into the reconstruc- posed in [27], which has become a common strategy for
tion layer, the interpolated image will contain many unnec- accelerating the training procedure and improving the learning
essary compression artifacts, which limits the interpolation performance. However, conventional residual learning, which

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2828 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

directly adds the input to the output, is not suitable in our TABLE II
case. The reason is twofold. First, the input and the output do T HE P ERFORMANCE OF THE R EDESIGNED VDSR
C OMPARED TO THE HEVC BASELINE
not have the same size in our task. Therefore, they cannot be
directly added together. Second, the input to be interpolated
is highly distorted with compression artifacts. The enhanced
reconstruction will be polluted if the noisy input is directly
added to the output. Therefore, a residual short-cut module
with the deconvolution layer and the variable-filter-size tech-
nique is proposed in DA-MLF. A deconvolution layer is first
adopted to resize the input according to the output. Because
the resized input still contains many compression artifacts,
the variable-size filters are further integrated to remove the
distortions. The followed 3 × 3 filter with 1 channel recon-
structs a relatively clean image to provide a global context. where x is the referenced neighbor pixels, y is the interpolated
4) Multiple Reconstruction Module: In deep learning, pixel, and X is the input image to be interpolated. w is a
focusing on a single task, some valuable information that fixed matrix, the weights of which usually depend on the
might improve the performance is easily ignored. Specifically, interpolation methods.
this information can be obtained from training related tasks. Accordingly, the formulation of the deconvolution layer can
By sharing representations between related tasks, the designed be presented as:
network can generalize better on the original task. This
approach is called multi-task learning [28], [29]. Previous y = δ(w ⊗ x + b), (4)
fractional interpolation methods feed the compressed images
and their corresponding enlarged pairs into the networks and where x and y are the input and output of the deconvolution
attempt to optimize the networks to learn the fractional inter- layer, w is the trainable kernels of this layer, b is the bias, and
polation. However, without any additional training strategy, δ is the activation function. It is worth noting that the decon-
the distortion characteristics contained in the input images are volution layer is degraded to a traditional interpolation method
difficult to exploit to improve the interpolation quality. when the weights and biases are fixed. Generally, the decon-
To this end, DA-MLF utilizes the multiple reconstruction volution layer can be regarded as a uniform interpolation.
module to perform multi-task learning, which fully exploits An evaluation experiment is further conducted to demon-
the distortion characteristics to achieve a higher interpolation strate the interpolation ability of the deconvolution layer.
performance for video coding. A sub-network is first proposed VDSR [22], which is a classic super-resolution network, maps
to learn our auxiliary task, quality enhancement. Accordingly, the pre-enlarged low-resolution image into the high-resolution
with the pre-learned information, another sub-network of image. We add a deconvolution layer at the end of VDSR
uniform interpolation is designed to focus on learning the so that it can directly interpolate images without any pre-
interpolation mapping. enlargement. To evaluate the performance of this uniform
a) Quality enhancement: A quality enhancement interpolation, we retrain the redesigned VDSR and integrate
sub-network is first proposed to learn our auxiliary task and the network into the HEVC Test Model as a CNN-based
provide the distortion characteristics for subsequent fractional half-pel interpolation filter. The experimental configuration
interpolation. As suggested in [30], deeper networks can is introduced in Section V, and the experimental results are
enlarge the receptive field and improve learning ability. shown in Table II. It can be observed that the re-designed
Therefore, in our quality enhancement sub-network, three VDSR with the uniform interpolation structure performs
convolutional layers are stacked together to further improve interpolation well and achieves an improvement in coding
the performance. efficiency. However, lacking a further training strategy to
b) Fractional interpolation: Several studies [16]–[18] exploit the valuable compression information, the interpolation
have proposed different CNN-based fractional interpolation performance of this re-designed VDSR is limited in our
filters. These methods mainly generate fractional pixels task. Therefore, instead of reusing the structure of VDSR,
according to the fractional positions and stitch them to form a multi-task learning framework is proposed in our method
the interpolated images. In our fractional interpolation sub- to merge the distortion information into fractional inter-
network, the uniform interpolation structure is proposed to polation. Compared with previous CNN-based interpolation
directly generate the interpolated images. methods [16]–[18], DA-MLF adopts the uniform interpolation
Specifically, a deconvolution layer is adopted in our interpo- structure to interpolate the referenced images directly rather
lation sub-network to perform the uniform interpolation, which than stitching fractional pixels of different positions together.
directly enlarges the outputs without any post-processing. The As noted in [19], the integer pixels should be unchanged
formulation of traditional interpolation methods (e.g., linear during interpolation. Therefore, the generated integer pixels of
and bicubic) can be presented as: the uniform interpolation are not utilized to replace the original
 integer pixels, and other generated pixels are adopted as the
y= w × x, (3) fractional pixels. More experimental results of our DA-MLF
x∈X are given in Section V.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2829

Fig. 4. The data generation method of DA-MLF.

IV. T RAINING S TRATEGY parameter q is set as 22, 27, 32 and 37, which is the same
A. Training Sample Generation configuration as that in the following experiments.
Because fractional interpolation for video coding is sensitive
to distortions, it is important to generate samples with realistic
distortions for training. As mentioned before, our framework B. Three-Step Training
adopts a uniform interpolation method, which is different from To alleviate the over-smoothing effect during interpolation,
previous methods. Therefore, a new sample generation method a three-step training strategy is proposed to collaborate with
should be proposed for DA-MLF. Our generation model can our fused training set to fully capture the features of recurrent
be presented as: enhancement. Ii denotes the input image to be interpolated,
I s,q = ((L ⊗ kσ ) ↓s ) H E V Cq , (5) Ie denotes the enhanced image and I f denotes the interpo-
lated image. Correspondingly, Oe presents the label of the
where I s,q is the input image with realistic distortions and L quality enhancement task and O f presents the label of the
denotes the label image. ⊗ denotes the convolution operator. fractional interpolation task. The weights of DA-MLF are also
kσ stands for the Gaussian blur kernel with the standard represented in three parts. Wc represents for the weights of
deviation σ . ↓s denotes the down-scale operator with a scale the common layers between the quality enhancement task
factor s. (L ⊗ kσ ) ↓s simulates the reverse procedure of and the fractional interpolation task, which are marked as
interpolation, and Gaussian blurring assures that no aliasing the feature extraction module and the distortion awareness
signal is additionally introduced into down-scaling. (·) H E V Cq module in Fig. 2. We and W f denote the weights of the quality
denotes the HEVC compression operator with quantization enhancement sub-network and the fractional interpolation sub-
parameter q. Inter coding is adopted to simulate the distortions network, respectively. DA-MLF takes Ii as the input image,
caused by HEVC compression. The flowchart of our proposed restores a quality-enhanced image, Ie and produces a fractional
generation method is shown in Fig. 4. The input images are interpolated image, I f .
first blurred and downsampled to generate the correspond- First, the image dataset is compressed with AI configuration.
ing integer images. Then, integer images are compressed by Wc and We are jointly trained for the quality enhancement task
HEVC inter coding to simulate the actual distortions. with these compressed images. During the first step, the quality
In the following, we explain the parameter settings for these enhancement sub-network learns to remove the compression
operations: artifacts under intra coding. However, since the distortion fea-
1) Blur Kernel: Gaussian blur kernel is a classic low-pass tures between intra coding and inter coding are quite different.
filter that is utilized in our data generation method to avoid If this pre-trained quality enhancement sub-network is directly
aliasing signals. Specifically, kσ is set as 3 × 3 with a standard used to guide the subsequent interpolation, over-smoothing
deviation of 0.5. effect may occur.
2) Downsampler: Directly downsampling, which chooses In our second step, the collected video sequences are
the top-left pixels as the downsampled pixels, is adopted in compressed with LDP, LDB and RA configurations. Besides,
our method to keep the integer pixels unchanged. The scale during inter coding, previous pre-trained quality enhancement
factor s is set as 0.5 and 0.25 to generate half-pel pixels and sub-network is utilized as a CNN-based loop-filter to further
quarter-pel pixels. simulate the distortions during recurrent interpolation. Then,
3) HEVC Compression: To better simulate the actual fixing We , Wc and W f are jointly trained for the fractional
distortions during fractional interpolation, the quantization interpolation task with these video samples. Note that these

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2830 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

two steps update parameters following the regulations below: TABLE III
D ATASET C ONFIGURATION OF DA-MLF
(Wc∗ , We∗ ) = min D{h e (Ii |Wc , We ), Oe }, (6)
Wc ,We
(Wc∗ , W ∗f ) = min D{h f (Ii |Wc , W f ), O f }, (7)
Wc ,W f

where h e (·) and h f (·) represent the network function of the


quality enhancement task and the interpolation task, respec-
tively. D{·} measures the difference between the generated and
the original batches.
Finally, the three parts of DA-MLF are jointly finetuned
for global optimization with the fused compressed images
and video sequences. Specifically, the quality enhancement
sub-network and the fractional interpolation sub-network enhancement propagation, if networks are simply trained on
jointly update parameters according to the following objective these image datasets, the interpolated frames may suffer from
function: over-smoothing. Second, frames extracted from a few video
sequences have strong similarity and lack sufficient content
(Wc∗ , We∗ , W ∗f ) = min {D{h e (Ii |Wc , We ), Oe }
Wc ,Wc ,W f to train a stable network. A similar phenomenon has been
+ D{h f (Ii |Wc , W f ), O f }}. (8) observed in [19], in which the performance of FRCNN varies
when training on different video sequences. In our dataset,
Generally, the Euclidean distance is adopted to mea- frames extracted from several video sequences and multiple
sure D{h e (Ii |Wc , We ), Oe } and D{h f (Ii |Wc , W f ), O f } in this still images are fused together to provide various details
paper, i.e., with enough features of recurrent enhancement. Specifically,
379 images from DVI2K (DIVerse 2K resolution images) [31]
1  i
N
D{h e (Ii |Wc , We ), Oe } = {Oe − h e (Iii |Wc , We )}2 , are randomly extracted as our base training set and another
N 50 images as the base test set. In addition, six high-resolution
i=1
(9) video sequences are chosen as an additional training set to

N improve the generalization capability of our network. Corre-
1
D{h f (Ii |Wc , W f ), O f } = {O if − h f (Iii |Wc , W f )}2 , spondingly, two video sequences are chosen as an additional
N test set. For each sequence, only 30 frames are extracted at an
i=1
(10) interval of every 10 frames to avoid too many similar contents.
The summary of our fused training set and test set is listed
where N is the batch size. During training, the L2 loss is
in Table III.
adopted to train our network. As can be seen in the Fig. 5, even
without CU-level interpolation determination approach, our
DA-MLF can avoid severe over-smoothing effect and achieve B. Implementation Details
a superior performance on different QP sets and different 1) CNN Training: Caffe [32] is adopted as our deep learn-
sequences. ing framework because it has better compatibility with the
C++ code. Note that the compression distortions have a
V. E XPERIMENT C ONFIGURATION AND R ESULTS great influence on our fractional interpolation task. Different
To evaluate the performance of DA-MLF, multiple exper- models are trained separately for each QP (including QPs 22,
iments are conducted. The training set and implementation 27, 32 and 37) and each encoding configuration (including
details are first introduced, followed by the experimental low-delay P, low-delay B and random-access). As mentioned
results. Additional ablation studies are further analyzed to above, half-pel pixels and quarter-pel pixels adopt different
evaluate the effect of different modules. generation methods. Thus, two different models, DA-MLF-H
and DA-MLF-Q, are trained for a certain QP, and there are
eight models for a total of four QPs.
A. Training Set The inputs are cropped into 40 × 40 patches with a stride
Previous methods [16]–[18] choose a single existing image of 20. The corresponding labels are cropped into 80 × 80 or
dataset or video dataset as their training set. In this paper, 160 × 160 according to the interpolation scale. Zero padding
a fused dataset, which includes both video samples and image is adopted to keep the image size of the output the same
samples, is collected to train our DA-MLF. The reason is as the label. Because enough contents are included in our
twofold. First, since intra coding and inter coding adopt dataset, no augmentation technique is adopted during our
different coding modes, the compression artifacts in the com- training. In total, there are approximately 194,730 samples for
pressed images cannot fully simulate the features of inter training a single network. Adam [33] optimizer with standard
coding. Particularly, in our task, the CNN-based interpolation back-propagation is adopted to train our proposed networks
filters are supposed to further remove the distortions. Since with a fixed learning rate of 0.0001, which has the best balance
compressed images with intra coding fail to simulate the between fast convergence and stable performance. The batch

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2831

Fig. 5. Four DA-MLF R-D curves of the sequences BasketballDrill, BasketballPass, BQTerrace and KristenAndSara compared to HEVC baseline under
LDP configuration.

size is set as 64. The training procedure finishes in 20 epochs B, C, D, and E. Three configurations, including LDP, LDB
with no loss reduction observed. DA-MLF is trained and tested and RA, are adopted to evaluate our DA-MLF. DA-MLF can
on NVIDIA GeForce GTX 1080. achieve, on average, 5.0%, 4.0% and 1.7% BD-rate reduction
2) HEVC Coding Configuration: DA-MLF is embedded under LDP, LDB and RA configurations, respectively. The
into the HEVC reference software HM (version 16.12) using highest performance can be observed in the sequence BDTer-
the official C++ API provided by Caffe. It is worth noting race, which achieves a 8.2% BD-rate reduction under LDP
that DA-MLF is utilized both in the motion compensation and configuration. The reason for the high performance is that the
motion estimation. In contrast to previous methods, the DCTIF distortion characteristics guide our DA-MLF to better denoise
in HM is directly replaced with our proposed DA-MLF. HEVC the large backgrounds in BQTerrace. Similar results can be
common test sequences recommended by JCT-VC [34] are observed in the video frames of class E, which contain large
used to evaluate the performance of DA-MLF. All video backgrounds. It can be observed that DA-MLF achieves better
frames are evaluated in the tables with BD-rate index. performance on these video sequences.
Three encoding configurations, low-delay P (LDP), low-delay For further analysis, four RD curves are given in Fig. 5,
B (LDB) and random-access (RA), are tested using corre- covering four videos of different resolutions and differ-
sponding networks under QPs 22, 27, 32 and 37. BD-rate [35] ent classes. As can be observed, the benefits brought by
is adopted to calculate the coding efficiency improvement DA-MLF are not the same under different QPs. Under low
between DA-MLF and HEVC baseline. QP (e.g., QP 22), the benefits of DA-MLF mainly come from
the bitrate decrease. Because details are preserved well under
low QP and the reconstructed frames do not have obvious
C. Experimental Results compression artifacts, the fractional interpolation sub-network
1) Overall Performance: The performances of our proposed can generate high-quality fractional pixels and further reduce
DA-MLF compared to HEVC baseline are shown in Table IV. the inter prediction residual to save bitrate. Under high QP
Five classes of video sequences are tested, including class A, (e.g., QP 37), the benefits of DA-MLF mainly come from the

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

TABLE IV
T HE P ERFORMANCE OF O UR P ROPOSED DA-MLF C OMPARED TO HEVC BASELINE

TABLE V
T HE BD-Rate C OMPARISON OF D IFFERENT M ETHODS
U NDER D IFFERENT C ONFIGURATIONS

Fig. 6. Comparison between GVCNN and DA-MLF trained on same dataset.

of generating different fractional pixels to save parameter


numbers, and the results are promising. In contrast to previous
works, FRCNN predicts rather than generates fractional pixels
to further reduce bitrate, especially under low QP. FRCNN
also adopts an RD-cost selection algorithm to improve coding
performance. However, these methods lack a further strategy
PSNR increase. Under high QP, the post-processing methods to exploit the distortion characteristics of compression, which
in HEVC fail to comprehensively remove compression arti- fail to have a larger performance gain. Compared with these
facts. Thus, DA-MLF with the distortion awareness module methods, our DA-MLF utilizes multi-task learning to merge
can further denoise these compression artifacts to obtain a the distortion information into fractional interpolation and
higher PSNR gain. achieves the best BD-rate reduction performance.
2) Comparison Results: To obtain an intuitive evaluation Besides, since the training sample generation method of
of the performance of our method, the BD-rate reduction GVCNN is similar with ours, we retrain GVCNN to have a
results of DA-MLF are further compared with several existing further comparison. LDP is chosen as the coding configuration.
CNN-based fractional interpolation methods, including VDIF, As can be seen in the Fig. 6, both trained on our fused
GVCNN and FRCNN. Four methods are all tested under the dataset, our DA-MLF still outperforms GVCNN on all the
HEVC common test conditions. Comparison results for the video classes.
Y component are given in Table V. As VDIF only replaces 3) Implementation Time: Because CNN-based methods are
the half-pel pixels, it achieves the lowest BD-rate reduction. more efficient in GPU mode, the training and implementa-
GVCNN shares the feature extraction layers in the procedure tion of DA-MLF all run in GPU mode. To have a clearer

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2833

TABLE VI TABLE VII


T HE R ESULTS OF AVERAGE C ODING /D ECODING T IME T HE P ERFORMANCE OF D IFFERENT D OWNSAMPLING M ETHODS
R ATIOS B ETWEEN D IFFERENT M ETHODS

presentation of the efficiency of DA-MLF, the encoding


complexity ratio of DA-MLF is calculated and compared with
other CNN-based interpolation methods. Because multiple
network structures are tested in FRCNN, only the structure
achieving the maximum efficiency is listed. The complexity
ratio is calculated by σ = /T , where  is the encoding
time of HM 16.12 integrating each method, and T is the
original encoding time. To obtain a comprehensive evaluation,
the implementation time of DA-MLF in CPU mode is also
tested. The encoding complexity ratio of DA-MLF is tested
on an Intel i7-6700k CPU and GeForce GTX 1080 GPU.
Because the decoding time of GVCNN and FRCNN heavily
depends on the selection ratio, the decoding complexity ratios
are not compared in this experiment. The results are evaluated
under the LDP configuration and are given in Table VI.
As can be seen, although DA-MLF has a more complex
structure than GVCNN and FRCNN, it achieves the lowest
Fig. 7. The PSNR curves of fractional interpolation generated by models
computation complexity. The large gain of implementation trained with and without multi-task framework.
efficiency compared to previous methods is mainly due to
two reasons. First, DA-MLF adopts uniform interpolation to
generate fractional pixels at the same time, while FRCNN has their effectiveness in suppressing aliasing signals and
to generate pixels in different positions with separate networks. generating promising training pairs.
Therefore, considerable time is wasted in the forward propa- 2) Deviation of 0.5 is chosen as our kernel deviation.
gation of FRCNN. Second, to improve performance, GVCNN A small blurring kernel preserves most low-frequency
and FRCNN adopt an RD-cost optimization to select the best components at the cost of larger aliasing noise. In con-
interpolation method, which increases the coding complexity trast, a large blurring kernel can reduce enough aliasing
ratios. signals, while low-frequency components are also lost.
4) Verification of the Sample Generation Approach: To Therefore, an intermediate blur kernel with a deviation
further evaluate the effect of our blurring and down-scaling of 0.5 achieves the best performance, which is chosen
approach in sample generation, a comparison experiment is as the Gaussian kernel in our method.
conducted. Table VII shows the different coding efficiency 5) Verification of Multi-Task Learning: To improve the frac-
improvements between different blurring&downsampling tional interpolation performance, an auxiliary task of quality
methods. Directly downsampling means choosing top-left pix- enhancement is introduced into DA-MLF. This auxiliary task
els as the downsampled images without blurring; Gaussian learns the mapping function between reconstructed frames
blur means performing Gaussian blur with different standard and ground truths and provides the distortion characteristics
deviation σ before directly downsampling; bicubic means for the subsequent fractional interpolation. To evaluate the
choosing bicubic interpolation as the downsampling method, effectiveness of our multi-task learning framework, a single
which is widely used in different applications including the network is additionally trained without the auxiliary task to
Scalable High Efficiency Video Coding. From Table VII, capture the distortion characteristics. The structure of the
several generation parameters can be determined: experimental network remains the same as the uniform inter-
1) Performing Gaussian blur before downsampling is polation sub-network in DA-MLF. The PSNR-epoch curves of
chosen as our downsampling method. Specifically, the proposed networks with and without multi-task learning
directly downsampling and bicubic downsampling can- are shown in Fig. 7. Note that QP is set to 32 and 22. As can
not provide noticeable improvement because these be observed, the auxiliary task of quality enhancement not
two methods cannot reduce aliasing signals. Addi- only accelerates the training procedure but also contributes
tional noises are introduced into the generated training to a higher interpolation quality. Because the compression
pairs, which decrease the coding performance. Com- distortions are larger under higher QP, the gain of exploit-
paratively, Gaussian blur&downsampling methods can ing distortion characteristics is more obvious under QP 32.
obtain more obvious BD-rate reduction. This indicates The networks with and without multi-task learning strategy

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2834 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

TABLE VIII TABLE X


T HE P ERFORMANCE C OMPARISON OF A S INGLE I NTERPOLATION C OMPARISON B ETWEEN DA-MLF W ITH F EATURE F USION M ODULE AND
N ETWORK AND O UR DA-MLF DA-MLF-S N ETWORKS W ITH S TACKED C ONVOLUTIONAL L AYERS

TABLE IX
TABLE XI
T HE BD-R ATE P ERFORMANCE OF DA-MLF W ITH AND
W ITHOUT T HREE -S TEP T RAINING S TRATEGY C OMPARISON B ETWEEN DA-MLF W ITH AND W ITHOUT
D ISTORTION AWARENESS M ODULE

are further integrated into the HEVC Test Model, and the
experimental results are given in Table VIII. LDP is chosen as
the coding configuration. Tested sequences include class C, D additional networks are trained for comparison. Specifically,
and E. Without additional specification, the subsequent verifi- the feature fusion modules of these additional networks are
cation studies follow this experiment configuration. As seen, replaced by stacked convolutional layers, and the number of
our DA-MLF has a higher BD-rate reduction compared with these convolutional layers increases from 7 to 11. Accord-
the single interpolation network, which demonstrates the effect ingly, these additional networks are named as DA-MLF-S7,
of the multi-task learning strategy. DA-MLF-S9 and DA-MLF-S11. The training set is the same
6) Verification of the Three-Step Training Strategy: Since as that displayed in Section IV, and LDP is chosen as the
CNN-based interpolation filters suffer from over-smoothing coding configuration. By comparing DA-MLF, which adopts
effect, previous methods utilize CTU-level determination to our feature fusion module with these DA-MLF-S networks on
choose the optimal interpolation approach, which increase the test sequences, we obtain the results shown in Table X.
computational complexity. Instead of CTU-level determina- These results indicate that, even though the interpolation
tion, a three-step training strategy is utilized in this paper performance of the DA-MLF-S networks continues to increase
to reduce the complexity and ease the transmission burden. as the network depth grows, it still cannot outperform the
To demonstrate the effectiveness of our three-step training DA-MLF with our proposed feature fusion module. This
strategy, a comparison experiment is conducted. The average explicitly verifies the superior design of our feature fusion
BD-rate reduction of our DA-MLF with and without three-step module.
training strategy is shown in Table IX. 2) Distortion Awareness Module: Different from SR
The experimental result shows that the three-step training problems, fractional interpolation for HEVC suffers severe
strategy has a great impact on the interpolation performance. distortions in terms of compression. Therefore, it is neces-
It can be noticed that, although trained on a larger dataset, sary to integrate the distortion awareness module into our
DA-MLF without three-step training strategy cannot achieve DA-MLF to better exploit the distortion characteristics for
obvious improvement. When adopting the three-step training interpolation. As mentioned in [12], the variable-filter-size
strategy, the BD-rate reduction of DA-MLF can be further technique can effectively capture the distortion features of
improved by 3.5%, which shows the effectiveness of three-step HEVC compression. Therefore, the lightweight version of the
training strategy. variable-filter-size technique is integrated into our distortion
awareness module. The experiment is conducted by comparing
the interpolation performance between networks with and
D. Ablation Study without our distortion awareness module. In addition to giving
Ablation analyses of our feature fusion module and distor- the PSNR performance, we also employ the PSNR-B [36],
tion awareness module are further conducted to better evaluate which modifies PSNR by including a blocking effect factor,
the effectiveness of the different structures we adopted. to evaluate the performance of the distortion awareness mod-
1) Feature Fusion Module: To achieve a promising inter- ule. The experimental results are shown in Table XI.
polation performance, a fundamental module should be As illustrated in Table XI, the PSNR and PSNR-B gains
designed to effectively extract features from the pixel domain. indicate the benefit of our distortion awareness module with
In DA-MLF, the feature fusion module, which is formed the variable-filter-size technique. Note that the gains on
by dense structures, is proposed to fuse the output of each PSNR-B are much larger than those on PSNR. This indi-
convolution layer to enhance the feature extraction ability. cates that DA-MLF with our distortion awareness module
Dense structures are proposed in [25], which connect all layers can further utilize the distortion features to produce images
directly. In the training of our DA-MLF, features captured with fewer blocking artifacts. The large gain comes from
in the previous layers can be effectively merged into the two parts. First, the auxiliary task with the distortion aware-
subsequent structures. To demonstrate the superior design ness module can provide more effective distortion informa-
of our feature fusion module with dense structures, several tion of compression for later training. Second, the fractional

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
YU et al.: DISTORTION-AWARE MULTI-TASK LEARNING FRAMEWORK FOR FRACTIONAL INTERPOLATION IN VIDEO CODING 2835

interpolation network can have a more generalized represen- [16] N. Yan, D. Liu, H. Li, and F. Wu, “A convolutional neural network
tation ability with the integrated distortion awareness module. approach for half-pel interpolation in video coding,” in Proc. IEEE Int.
Symp. Circuits Syst. (ISCAS), May 2017, pp. 1–4.
[17] H. Zhang, L. Song, Z. Luo, and X. Yang, “Learning a convolutional
VI. C ONCLUSION neural network for fractional interpolation in HEVC inter coding,” in
Proc. IEEE Vis. Commun. Image Process. (VCIP), Dec. 2017, pp. 1–4.
In this paper, we propose a novel distortion-aware multi-task [18] J. Liu, S. Xia, W. Yang, M. Li, and D. Liu, “One-for-all: Grouped
learning framework to perform fractional interpolation for variation network-based fractional interpolation in video coding,” IEEE
Trans. Image Process., vol. 28, no. 5, pp. 2140–2151, May 2019.
inter coding. With the multi-task learning strategy and the [19] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional neural
training set fusion strategy, DA-MLF can generate interpolated network-based fractional-pixel motion compensation,” IEEE Trans. Cir-
images with higher quality. Moreover, a uniform fractional cuits Syst. Video Technol., vol. 29, no. 3, pp. 840–853, Mar. 2019, doi:
10.1109/TCSVT.2018.2816932.
interpolation network with the feature fusion module and the [20] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense
distortion awareness module is utilized to directly generate network for image super-resolution,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., Jun. 2018, pp. 2472–2481.
fractional-pel pixels. Our extensive experiments indicate that [21] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
DA-MLF outperforms the state-of-the-art fractional interpola- deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
tion approaches and achieves the lowest computational cost. vol. 38, no. 2, pp. 295–307, Feb. 2016.
[22] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution
We will conduct more study utilizing the temporal information using very deep convolutional networks,” in Proc. IEEE Conf. Comput.
to further improve the interpolation performance. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1646–1654.
[23] N. Yan, D. Liu, H. Li, T. Xu, F. Wu, and B. Li, “Convolutional
neural network-based invertible half-pixel interpolation filter for video
ACKNOWLEDGMENT coding,” in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Oct. 2018,
pp. 201–205.
The author would like to thank Professor D. Liu and [24] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Invertibility-driven
Professor M. Xu for their valuable advices and discussion. interpolation filter for video coding,” IEEE Trans. Image Process.,
vol. 28, no. 10, pp. 4912–4925, Oct. 2019.
[25] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
R EFERENCES connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269.
[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview [26] L. Yu, L. Shen, H. Yang, L. Wang, and P. An, “Quality enhancement
of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. network via multi-reconstruction recursive residual learning for video
Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. coding,” IEEE Signal Process. Lett., vol. 26, no. 4, pp. 557–561,
[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the Apr. 2019.
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012. image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[3] R. Fan, Y. Zhang, and B. Li, “Motion classification-based fast motion (CVPR), Jun. 2016, pp. 770–778.
estimation for high-efficiency video coding,” IEEE Trans. Multimedia, [28] X. Yin and X. Liu, “Multi-task convolutional neural network for pose-
vol. 19, no. 5, pp. 893–907, May 2017. invariant face recognition,” IEEE Trans. Image Process., vol. 27, no. 2,
[4] F. Luo, S. Wang, S. Wang, X. Zhang, S. Ma, and W. Gao, “GPU-based pp. 964–975, Feb. 2018.
hierarchical motion estimation for high efficiency video coding,” IEEE [29] K. Wang, S. Zhai, H. Cheng, X. Liang, and L. Lin, “Human pose esti-
Trans. Multimedia, vol. 21, no. 4, pp. 851–862, Apr. 2019. mation from depth images via inference embedded multi-task learning,”
[5] Y. Vatis and J. Ostermann, “Adaptive interpolation filter for H.264/AVC,” in Proc. ACM Multimedia Conf. (MM), 2016, pp. 1227–1236.
IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 179–192, [30] S. I. Cho and S.-J. Kang, “Gradient prior-aided CNN denoiser with
Feb. 2009. separable convolution-based optimization of feature dimension,” IEEE
[6] K. Ugur et al., “Motion compensated prediction and interpolation filter Trans. Multimedia, vol. 21, no. 2, pp. 484–493, Feb. 2019.
design in H.265/HEVC,” IEEE J. Sel. Topics Signal Process., vol. 7, [31] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image
no. 6, pp. 946–956, Dec. 2013. super-resolution: Dataset and study,” in Proc. IEEE Conf. Comput. Vis.
[7] T. Wedi, “Adaptive interpolation filters and high-resolution displace- Pattern Recognit. Workshops (CVPRW), Jul. 2017, pp. 126–135.
ments for video coding,” IEEE Trans. Circuits Syst. Video Technol., [32] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
vol. 16, no. 4, pp. 484–491, Apr. 2006. embedding,” in Proc. ACM Int. Conf. Multimedia (ACMMM), 2014,
[8] J. Dong and K. N. Ngan, “Parametric interpolation filter for HD video pp. 675–678.
coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 12, [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
pp. 1892–1897, Dec. 2010. in Proc. Int. Conf. Learn. Represent. (ICLR), 2014, pp. 1–15.
[9] S. Wittmann and T. Wedi, “Separable adaptive interpolation filter for [34] F. Bossen, Common HM Test Conditions and Software Reference Con-
video coding,” in Proc. 15th IEEE Int. Conf. Image Process., Oct. 2008, figurations, document JCTVC-L1100, Apr. 2013.
pp. 2500–2503. [35] G. Bjontegaard, Calculation of Average PSNR Differences Between RD-
[10] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wang, “Image Curves, document VCEG-M33, Apr. 2001.
and video compression with neural networks: A review,” IEEE Trans. [36] C. Yim and A. C. Bovik, “Quality assessment of deblocked images,”
Circuits Syst. Video Technol., vol. 30, no. 6, pp. 1683–1698, Jun. 2020, IEEE Trans. Image Process., vol. 20, no. 1, pp. 88–98, Jan. 2011.
doi: 10.1109/TCSVT.2019.2910119.
[11] M. Xu, T. Li, Z. Wang, X. Deng, R. Yang, and Z. Guan, “Reducing
complexity of HEVC: A deep learning approach,” IEEE Trans. Image
Process., vol. 27, no. 10, pp. 5044–5059, Oct. 2018.
[12] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for
post-processing in HEVC intra coding,” in Proc. Int. Conf. MultiMedia Liangwei Yu received the B.S. and M.S. degrees
Modeling (MMM), Jan. 2017, pp. 1–12. in communication and information systems from
[13] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement Shanghai University, Shanghai, China, in 2017 and
for compressed video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 2020, respectively. He has been working on video
Recognit. (CVPR), Jun. 2018, pp. 6664–6673. compression and CNN-based video enhancement.
[14] J. Xu, M. Xu, Y. Wei, Z. Wang, and Z. Guan, “Fast H.264 to HEVC His research interests include image/video coding
transcoding: A deep learning method,” IEEE Trans. Multimedia, vol. 21, and machine learning.
no. 7, pp. 1633–1645, Jul. 2019, doi: 10.1109/TMM.2018.2885921.
[15] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network-
based intra prediction for image coding,” IEEE Trans. Image Process.,
vol. 27, no. 7, pp. 3236–3247, Jul. 2018.

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.
2836 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 31, NO. 7, JULY 2021

Liquan Shen (Member, IEEE) received the B.S. Xuhao Jiang received the B.S. and M.S. degrees
degree in automation control from Henan Polytech- in communication and information systems from
nic University, Henan, China, in 2001, and the M.E. Shanghai University, Shanghai, China, in 2017 and
and Ph.D. degrees in communication and informa- 2020, respectively. He is currently pursuing the
tion systems from Shanghai University, Shanghai, Ph.D. degree with the School of Computer Science,
China, in 2005 and 2008, respectively. Since 2008, Fudan University, Shanghai. He has been working
he has been with the faculty of the School of Com- on image quality assessment and image enhance-
munication and Information Engineering, Shanghai ment. His research interests include the image/video
University, where he is currently a Professor. He is processing and machine learning
with the Department of Electrical and Computer
Engineering, University of Florida at Gainesville as
a Visiting Professor from November 2013 to November 2014. His major
research interests include high efficiency video coding (HEVC), perceptual
coding, video codec optimization, 3DTV, and video quality assessment.
He has authored or coauthored more than 100 refereed technical papers in Bo Yan (Senior Member, IEEE) received the B.E.
international journals and conferences in the field of video coding and image and M.E. degrees in communication engineering
processing. He holds ten patents in the areas of image/video coding and from Xi’an Jiaotong University (XJTU) in 1998 and
communications. 2001, respectively, and the Ph.D. degree in computer
science and engineering from the Chinese University
of Hong Kong (CUHK) in 2004.
Hao Yang (Member, IEEE) received the B.S. From 2004 to 2006, he worked in the National
and Ph.D. degrees in communication engineering Institute of Standards and Technology US (NIST)
from Shanghai University, China, in 2015 and as a Postdoctoral Guest Researcher. He is currently
2020, respectively. His research interests include a Professor with the School of Computer Science,
learning-based image/video processing, rate control, Fudan University, Shanghai, China. His research
and video codec optimization. interests include video processing, computer vision, and multimedia commu-
nications. He has served as the Associate Editor of the IEEE T RANSACTIONS
ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY (TCSVT), and the
Guest Editor of Special Issue on “Content-aware Visual Systems: Analysis,
Streaming and Retargeting” of the IEEE J OURNAL ON E MERGING AND
S ELECTED TOPICS IN C IRCUITS AND S YSTEMS (JETCAS).

Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE PELOTAS. Downloaded on November 17,2024 at 02:49:28 UTC from IEEE Xplore. Restrictions apply.

You might also like