0% found this document useful (0 votes)
20 views13 pages

Neural Network-Based Enhancement To Inter Prediction For Video Coding

Uploaded by

shadoxxshado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views13 pages

Neural Network-Based Enhancement To Inter Prediction For Video Coding

Uploaded by

shadoxxshado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

826 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO.

2, FEBRUARY 2022

Neural Network-Based Enhancement to Inter


Prediction for Video Coding
Yang Wang , Xiaopeng Fan , Senior Member, IEEE, Ruiqin Xiong , Senior Member, IEEE,
Debin Zhao , Member, IEEE, and Wen Gao, Fellow, IEEE

Abstract— Inter prediction is a crucial part of hybrid video be exploited, resulting in a higher compression ratio. The
coding frameworks, utilized to exploit the temporal redundancy more accurate the predicted block, the smaller the residue,
in video sequences and improve the coding performance. During and the higher the coding performance. Inter prediction has
inter prediction, a predicted block is typically derived from
reference pictures using motion estimation and motion compen- been widely used in modern video coding standards, e.g.
sation. To improve the coding performance of inter prediction, H.264/AVC [1] and high efficiency video coding (HEVC) [2].
a neural network based enhancement to inter prediction (NNIP) In inter prediction, the predicted block is typically generated
is proposed in this paper. NNIP is composed of three networks, by copying or interpolating a block from the reference pictures.
namely residue estimation network, combination network, and Even though the best matching block is obtained, there is
deep refinement network. Specifically, first, a residue estimation
network is designed to estimate the residue between current block still some residue after prediction that needs to be further
and its predicted block using their available spatial neighbors. encoded. This residue comes from the variations between
Second, the feature maps of the estimated residue and the current block and its predicted block, such as illumination
predicted block are extracted and concatenated in a combination variation, zooming, deformation, and blurring.
network. Finally, the concatenated feature maps are fed into a To address these variations and improve the coding per-
deep refinement network to generate a refined residue, which is
added back to the predicted block to derive a more accurate formance of inter prediction, numerous algorithms have been
predicted block. NNIP is integrated in HEVC to evaluate its proposed in recent years. Local illumination compensation [3],
efficiency. The experimental results demonstrate that NNIP can [4] is proposed to address the local illumination variation in
achieve 4.6%, 3.0%, and 2.7% BD-rate reduction on average video sequence. Overlapped block motion compensation [5] is
under LDP, LDB, and RA configurations compared to HEVC. proposed to reduce the block artifact caused by motion com-
Index Terms— Video coding, inter prediction, NNIP, HEVC, pensation. An adaptive progressive motion vector resolution
deep learning. selection scheme is proposed in [6] to reduce the overhead
of signaling motion vectors. Bi-directional optical flow [7] is
I. I NTRODUCTION used to refine the bi-prediction using the optical flow equation.
Affine motion compensated prediction [8]–[10] is proposed to
I NTER prediction aims at reducing the temporal redundancy
in video sequences. During inter prediction, motion esti-
mation is used to obtain a motion vector by searching the
address the non-translational motion. All these methods are
manually designed to address the variations between current
best matching block with current block in reference pictures, block and the predicted block.
and the motion vector is used in motion compensation to Recently, deep learning has achieved impressive success
derive the predicted block. With motion estimation and motion on video coding [11], [12]. Typically, deep learning based
compensation, temporal redundancy in video sequences can methods are integrated into the hybrid video coding
framework to improve the coding performance of each
Manuscript received July 29, 2020; revised December 7, 2020; accepted particular module, such as intra prediction [13]–[18], inter
February 18, 2021. Date of publication March 2, 2021; date of current version
February 4, 2022. This work was supported by the National Science Founda- prediction [19]–[33], in-loop filter [34]–[39], post-processing
tion of China under Grant 61972115, Grant 61872116, and Grant 61631017. [40]–[46], up-sampling coded down-sampled blocks [47],
This article was recommended by Associate Editor M. Cagnazzo. [48], entropy coding [49], [50], and rate control [51], [52].
(Corresponding author: Xiaopeng Fan.)
Yang Wang, Xiaopeng Fan, and Debin Zhao are with the Depart- Specifically, for inter prediction, deep learning can be used
ment of Computer Science and Technology, Harbin Institute of Technol- for bi-prediction [21]–[24], fractional interpolation [25]–[28],
ogy, Harbin 150001, China, and also with the Peng Cheng Laboratory, inter prediction enhancement [19], [20] and reference frame
Shenzhen 518055, China (e-mail: [email protected]; [email protected];
[email protected]). generation [29]–[33]. More deep learning based methods on
Ruiqin Xiong is with the Department of Electronics Engineering and video coding are reviewed in [11], [12].
Computer Science, Peking University, Beijing 100871, China (e-mail: In this paper, a neural network based enhancement to inter
[email protected]).
Wen Gao is with the Department of Electronics Engineering and Computer prediction for video coding is proposed. NNIP is composed
Science, Peking University, Beijing 100871, China, and also with the Peng of three networks. Residue estimation network is proposed to
Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). estimate the residue between current block and its predicted
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2021.3063165. block. Combination network extracts and concatenates the
Digital Object Identifier 10.1109/TCSVT.2021.3063165 feature maps of the estimated residue and the predicted block.
1051-8215 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 827

Deep refinement network is used to derive a refined residue, reference block to generate the fractional pixels. Furthermore,
then the refined residue is added back to the predicted block Zhang et al. [27] proposed to use the corresponding residual
to get a more accurate predicted block. NNIP is integrated component and collocated high quality component as the
in HEVC to evaluate its performance. The experimental compression priors in CNN to boost the performance of
results demonstrate that NNIP achieves 4.6%, 3.0%, and 2.7% fractional interpolation. Pham and Zhou [28] presented two
BD-rate reduction on average under LDP, LDB, and RA test CNN models of fractional interpolation for luma and chroma
conditions compared to HEVC. components. A reference frame is first interpolated using
The main contributions of this work are summarized as Discrete Cosine Transform interpolation filters, then fed into
follows: CNN to avoid motion shift problem.
1) A neural network based enhancement to inter prediction Apart from these methods, deep learning is also used
for video coding is proposed, which consists of three networks, to improve inter prediction by refining the predicted block
namely residue estimation network, combination network, and derived from motion compensation. Huo et al. [19] proposed
deep refinement network. a CNN based method for motion compensation refinement,
2) A residue estimation network is designed to estimate the which takes the predicted block and the spatial neighboring
residue between current block and its predicted block using pixels of current block as the input. Wang et al. [20] proposed
their available spatial neighbors. a neural network based enhancement to inter prediction for
3) A combination network is presented to extract the feature HEVC to enhance motion compensation, in which a fully con-
maps of the estimated residue and the predicted block and nected network is used to estimate the residue between current
concatenate these feature maps together. Therefore, the texture block and its predicted block from their spatial neighbors.
information in the predicted block could be fully utilized to Then the sum of the residue and the predicted block is fed
guide the residue refining. into a convolutional neural network to obtain a more accurate
4) A deep refinement network is proposed to take the con- predicted block.
catenated feature maps as input and derive a refined residue, Besides, deep learning is also used for reference frame gen-
which is added back to the predicted block to get a more eration to improve inter prediction. Zhao et al. [29], [30] pro-
accurate predicted block. posed to generate a high quality virtual reference frame with
The rest of this paper is organized as follows. The related the deep learning based frame rate up-conversion algorithm
works are reviewed in Section II. The proposed neural network from two reconstructed bi-prediction frames. A CTU level
is detailed in Section III. In Section IV, how to integrate NNIP coding mode is proposed to select from either the existing ref-
in HEVC is introduced. The experimental results and analyses erence frames or the virtual reference frame. Choi et al. [31]
are given in Section V. Section VI concludes the paper. proposed a frame prediction method using CNN for both
uni-prediction and bi-prediction. Two frames from the decoded
picture buffer are fed into CNN to generate a synthesized
II. R ELATED W ORK
deep frame, which is directly used as a predictor for current
For inter prediction, deep learning has been used for frame. Xia et al. [32] proposed a multiscale adaptive separable
bi-prediction [21]–[24], fractional interpolation [25]–[28], convolutional neural network to generate pixel-wise closer
inter prediction enhancement [19], [20] and reference frame reference frames. Reconstruction losses are enforced on each
generation [29]–[33]. scale to make the network infer the main structure at small
For bi-prediction, Zhao et al. [21], [22] proposed a scales. Huo et al. [33] proposed to align the reference frames
bi-directional motion compensation algorithm using CNN. for further extrapolation by a trained deep network, and this
Two reference blocks generated by bi-prediction are fed into alignment can reduce the diversity of the network input. All
the network to get the final prediction. This network consists of these methods aim to derive an additional reference frame
six convolutional layers and can adaptively learn the weights of using the neural network to provide a higher quality reference
the two reference blocks, rather than the average weights used for inter prediction.
in HEVC. To further improve the performance of bi-prediction, Among the above-mentioned methods, [21]–[24] focus on
Mao et al. [23], [24] proposed to use the spatial neighboring improving bi-prediction, and [25]–[28] focus on fractional
pixels of current block as the additional information, which is interpolation, and [29]–[33] focus on generating reference
fed into the CNN model together with the two reference blocks frames. Different from these methods, our work aims to
to get the final prediction. Compared to [22], the additional enhance inter prediction after motion compensation and it
spatial neighboring pixels of current block are also utilized in can be used together with these methods. Compared to our
[24], which can provide more contextual information for the preliminary work [20], several improvements have been made
network. in this paper. First, a combination network is proposed to
For fractional interpolation, Yan et al. [25] proposed a concatenate the feature maps of the estimated residue and
CNN based method for fractional-pixel motion compensation. the predicted block, rather than directly adding the residue to
In their method, the fractional-pixel motion compensation is the predicted block in [20]. Therefore, the texture information
formulated as an inter-picture regression problem and CNN in the predicted block could be fully utilized to guide the
is utilized to solve the regression problem. Zhang et al. [26] residue refining in the deep refinement network. Second,
proposed to regard fractional interpolation as an image a deep refinement network is redesigned to efficiently refine
generation task and use real pixels at integer positions in the the estimated residue. Third, the proposed neural network is

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
828 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

Fig. 1. Framework of neural network based enhancement to inter prediction.

integrated in HEVC for all prediction unit (PU) partitionings,


rather than only 2N × 2N PU partitioning in [20]. All these
improvements bring about 2% BD rate reduction compared to
the preliminary work under LDP configuration.

III. N EURAL N ETWORK BASED E NHANCEMENT


TO I NTER P REDICTION

In this section, the framework of neural network based


enhancement to inter prediction (NNIP) for video coding
is first introduced. Then three networks of NNIP: residue
Fig. 2. Inputs of NNIP.
estimation network, combination network, and deep refinement
network, are detailed respectively. Finally, the training data
of current block and its predicted block, denoted by L C and
generation and the training strategy are described.
L P . In this paper, residue estimation network is implemented
by a fully connected network. As shown in Fig. 3, the inputs
A. Framework of NNIP of the residue estimation network are L C and L P . L C and L P
Fig. 1 depicts the overall framework of NNIP. The inputs are composed of the left, left-above, and above neighboring
of NNIP are the spatial neighboring L-shapes of current block pixels of current block and its predicted block, respectively.
and the predicted block, and the predicted block, represented The output of the residue estimation network is the estimated
by L C , L P , and P as shown in Fig. 2, respectively. The residue. Residue estimation network is composed of four fully
predicted block is generated using motion compensation in connected layers. Each fully connected layer is followed by a
traditional inter prediction. The output of NNIP is a refined non-linear activation layer except the last layer.
residue. NNIP consists of three networks, namely residue For a block of size N × N and the L-shape of width size
estimation network, combination network, and deep refinement M, the first layer has a dimension of K = 4M N + 2M 2 , L C
network. As denoted in Fig. 1, residue estimation network and L P are reshaped into a K -dimension vector for the input.
is used to estimate the residue between current block and The second and third layers have a dimension of 2K . The last
its predicted block. Combination network is used to extract layer has N 2 -dimension. The output is reshaped to a N × N
the feature maps of the estimated residue and the predicted block.
block and concatenate these feature maps together (Y ). Deep Denote residue estimation network by R. The input and the
refinement network is used to further refine the residue. output of R are represented by X = {L C , L P } and R(X). R
Finally, the refined residue is added back to the predicted block can be described as follows:
to get a more accurate block (P  ).
(R) (R)
R1 (X) = f (W1 ·X + B1 )
B. Residue Estimation Network Ri (X) = f (Wi(R) ·Ri−1 (X) + Bi(R) ), 1 < i < 4
(R) (R)
Residue estimation network aims to capture the variations R(X) = W4 ·R3 (X) + B4 (1)
between current block and its predicted block. Since current
(R) (R)
block has not been reconstructed, the variations between where Wi and Bi are the weights and biases parameters
current block and its predicted block could not be directly of layer i . f () is a non-linear mapping function, e.g. the
derived. Thus, residue estimation network is proposed to parameter rectified linear unit (PReLU) [53]. · denotes inner
estimate the residue using the spatial neighboring L-shapes product.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 829

Fig. 3. Residue estimation network.

followed by a non-linear activation layer (i.e. PReLU) except


the output layer.
For the first layer of convolutional block, the kernel sizes
are set to 1 × 1 and 3 × 3 and the number of the feature maps
is set to 32. The purpose of using different kernel sizes is
to effectively aggregate contextual feature information in both
short distance and long distance. For other layers, the kernel
size and the number of the feature maps are all set to 3 × 3
and 64 (except for that the number of the feature maps is set
to 1 in the output layer).
Denote the deep refinement network by D. The input and
Fig. 4. Combination network. output of D are represented by Y and D(Y ). D can be
described as follows:
C. Combination Network D1 (Y ) = f (W1(D) ∗ Y + B1(D))
(D) (D)
In our preliminary work [20], the estimated residue is Di (Y ) = f (Wi ∗ Di−1 (Y ) + Bi ), 1 < i < d
directly added to the predicted block, and then fed into a
D(Y ) = Wd(D) ∗ Dd−1 (Y ) + Bd(D) (2)
convolutional neural network to get a more accurate predicted
block. This summation may limit the learning capability of (D) (D)
where Wi and Bi are the weights and biases parameters
the neural network. In this paper, a combination network is of layer i . f () is the non-linear mapping function: PReLU. ∗
designed to first extract feature maps of the estimated residue denotes convolution operation. d is the network depth of D
and the predicted block, then concatenate these feature maps (d is set to 8 in this paper). Note that D2 and D5 that are
together. Thus, the texture information in the predicted block the first convolutional layers of the two convolutional blocks
could be fully utilized to guide the residue refining in the deep in Fig. 5 have two sub convolutional layers with different
refinement network. kernel sizes concatenated together. It can be represented by
Fig. 4 depicts the structure of the combination network. D2 = concat (D21×1, D23×3 ) and D5 = concat (D51×1, D53×3 ).
The inputs are the estimated residue and the predicted block. The output of the deep refinement network is a refined
The output is the concatenated feature maps. Combination residue block, which needs to be added back to the predicted
network is implemented by the convolutional neural network, block to get a more accurate predicted block as follows:
composed of a convolutional layer and a concatenate layer.
The convolutional layer is followed by a non-linear activation P  = D(Y ) + P (3)
layer (i.e. PReLU), and its number of feature maps and filter
size are set to 64 and 3 × 3. E. Training Data Generation
Denote (x i , yi ) as a training sample, where i is the index of
D. Deep Refinement Network the training data, and x i represents the inputs of NNIP consist-
Fig. 5 depicts the structure of the deep refinement network. ing of the two L-shapes L C i , L P i and the predicted block Pi .
The input is the concatenated feature maps derived from yi represents the ground truth, which is the original signal of
the combination network. The output is a refined residue current block. x i are extracted from the compressed bitstreams
block. The deep refinement network is implemented by and yi is extracted from the original video sequences.
the convolutional neural network, composed of an input We use 10 4K video sequences to generate the training
convolutional layer, two convolutional blocks, and an output data. These 10 4K video sequences are derived from the
convolutional layer. Each convolutional block is composed dataset of SJTU [54]. All these 10 4K video sequences are
of three convolutional layers. Each convolutional layer is YUV 4:2:0 color sampled. The resolution is 3840 × 2160

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
830 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

Fig. 5. Deep refinement network.

and the frame rate is 30fps. Abundant textures exist in these optimization Adam [58]. A batch-mode learning method is
video sequences and are suitable for training the network. For adopted with a batch size of 64. The momentum of Adam
better adapting to video contents with different resolutions, optimization is set to 0.9 and the momentum2 is set to 0.99.
the 10 4K video sequences are downsampled to five HEVC The base learning rate is set to decay exponentially from 0.1 to
resolutions (2560 × 1600, 1920 × 1080, 832 × 480, 416 × 240, 0.0001, changing every 40 epochs. Thus, the training takes
and 1280 × 720). Thus there are 60 video sequences with 160 epochs in total. The model for QP = 37 is trained using
different resolutions in total to generate the training data. the base learning rate 0.1. The models for other QPs (22, 27,
All these video sequences are encoded using HM 16.9 with and 32) are fine-tuned from the model of QP = 37, while
low delay P (LDP) configuration [55]. Quantization parame- the base learning rate is 0.001. In addition, we train different
ters (QPs) used to encode these sequences are 22, 27, 32, models for different sizes of CU (8 × 8, 16 × 16, 32 × 32, and
and 37. The first 100 frames of each sequence are encoded 64×64). Therefore, in total, there are 16 models in NNIP. The
and the inputs of network are obtained from the compressed required memory to store the network parameters is 291.2M
bitstreams. In addition, not all blocks encoded by inter mode in total. The memories for 8 ×8, 16 ×16, 32 ×32, and 64 ×64
are used for training, only the blocks with relatively complex blocks are 0.3M, 2.6M, 11.2M, and 58.7M, respectively.
texture (e.g. the standard deviation ≥ 2) are used. This is due Take blocks with size of 16 × 16 as an example, the change
to the fact that smooth blocks can be easily processed well of test loss with number of iterations increasing for QP = 37
by traditional inter prediction in HEVC. Blocks with size of is depicted in Fig. 6 (a). The test loss for iteration = 0 is
8 × 8, 16 × 16, 32 × 32, and 64 × 64 for different CUs are excluded due to its much higher magnitude. The test loss
used. There are 200,000 training patches approximately for decreases smoothly before 15,000 iterations and converges to
each block size and each QP. a relatively small value after changing the base learning rate
several times. As shown Fig. 6 (b), networks for QP = 22, 27,
F. Training Strategy 32 are fine-tuned based on the trained model with QP = 37
Learning the mapping F of NNIP needs to estimate and the test loss can converge to a small value quickly,
the weighted parameters of the residue estimation network, especially the test loss for iteration = 0.
the combination network, and the deep refinement network. IV. I NTEGRATION IN HEVC
Specifically, given a collection of n training samples: (x i , yi ),
To evaluate the efficiency of NNIP, we integrate it into
the mean squared error (MSE) is used to minimize the loss
HEVC. In HEVC main profile, the CU size varies from 8 × 8
function as in video coding the sum of square error (SSE) is
to 64 × 64. When NNIP is integrated into HEVC, all the
used in the rate distortion optimization. The loss function is
trained models for 8 × 8, 16 × 16, 32 × 32, and 64 × 64
formulated as follows:
block sizes are used for different CU sizes. Fig. 7 depicts the
1
n
diagram of the encoder and the decoder when NNIP integrated
L() = ||(F(x i |) + Pi ) − yi ||2 (4)
n in HEVC.
i=1
Several aspects should be considered when integrating
where  = {W (R) , B (R) , W (C) , B (C), W (D) , B (D)}, NNIP into HEVC. First, NNIP is used in CUs with all PU
{W (R) , B (R)}, {W (C) , B (C)}, and {W (D), B (D)} are the partitionings (i.e. 2N × 2N, 2N × N, N × 2N, N × N,
parameters of the residue estimation network, the combination 2N × nU , 2N × n D, n L × 2N, n R × 2N). Second, L C
network, and the deep refinement network respectively. n is is left and above neighboring reconstructed pixels of current
the total number of the training samples. CU and P is derived using traditional motion estimation and
NNIP is trained using Caffe [56] on a NVIDIA GeForce motion compensation in HEVC. For a CU with only one PU,
GTX 1080 GPU. All weights of convolutional filters are L P is derived using the motion vector of the PU as same
initialized as [57], all bias are initialized with 0. The loss as traditional motion compensation. For a CU with multiple
function is minimized by using the first-order gradient based PUs, L P is derived using the motion vector of the first PU.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 831

Fig. 6. Test loss of 16 × 16 CUs with QP = 22, 27, 32, 37.

Fig. 7. The diagram of NNIP integrated in HEVC.

The reason why we use the first PU is that L P is composed compared to the state-of-the-art methods are provided. Finally,
of left and above neighboring pixels and it is nearest to the the coding complexity is discussed.
first PU. Third, NNIP is not only integrated in normal inter A. Experimental Settings
mode, but also integrated in skip/merge mode. In skip/merge
mode, NNIP is used for all merge candidates. NNIP is integrated in HM 16.9 to evaluate its perfor-
Moreover, when integrated into HEVC, NNIP is determined mance under three configurations: low delay P (LDP), low
by rate distortion optimization scheme against the traditional delay B (LDB), and random access (RA) suggested in [55].
inter prediction, in which a CU level flag is set to indicate Total 18 natural sequences with 8 bit depth are tested in
whether NNIP is used. In this paper, only the luma component our experiments, including class A (2560 × 1600), class B
is processed by NNIP. (1920 × 1080), class C (832 × 480), class D (416 × 240), class
E (1280×720). 4 sequences in class F are also used to evaluate
the performance of NNIP on screen content videos. All frames
V. E XPERIMENTAL R ESULTS in these sequences are used in the experiments except for the
In this section, the extensive experiments are conducted to evaluation of the network structure in Subsection V-C, where
evaluate the performance of NNIP. First, the experimental set- the first 64 frames of each sequence are tested. Note there is no
tings are introduced. Then the experimental results compared overlap between the training video sequences and the testing
to HM 16.9 are given, followed by the ablation experiment video sequences. There are about 200,000 blocks used to train
on network structure. After that, the experimental results network models for each block size and each QP. QP used in

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

TABLE I
C ODING P ERFORMANCE OF NNIP C OMPARED TO HM 16.9

TABLE II coding gain for all sequences. NNIP can get more coding gain
C ODING P ERFORMANCE OF NNIP FOR S CREEN C ONTENT for video sequences with higher resolution. This is due to
C OMPARED TO HM 16.9
that lots of large CUs exist in video sequences with higher
resolution, which can benefit from the improved accuracy of
NNIP for large blocks. In addition, the coding performance
under LDB and RA is worse than LDP. This is due to that
bi-prediction is adopted in RA and LDB and can achieve
more accurate prediction, resulting in better coding perfor-
mance under RA and LDB configurations inherently. NNIP
is only applied to luma component, minor changes of coding
performance are observed for chroma components.
Apart from natural sequences, the coding performance of
our experiments varies among 22, 27, 32, and 37. Intel i7- NNIP is also evaluated on screen content sequences. As shown
6700 3.4GHz Quad-core processors with 64GB memory and in Table II, compared to HM 16.9, the BD-rates of NNIP
Microsoft Windows Server 2012 R2 operating system are used. for luma component are −1.5%, −0.7%, and −0.5% on
Both HM 16.9 and the proposed algorithm are compiled with average under LDP, LDB, and RA configurations, respectively.
Microsoft Visual Studio 2013. When NNIP is integrated in It is observed that the coding performance on screen content
HM 16.9, the feed-forward operation of network is processed sequences is worse than natural sequences. Screen content
with GPU version Caffe [56]. sequences have quite different characteristics from natural
sequences and are not used to train the network. If NNIP is
B. Comparison With HM 16.9 deployed to the application scenarios of screen content, it is
Bjφntegaard Delta rate (BD-rate) [59] using piecewise cubic possible and straightforward to retrain the network using data
interpolation typically used in video coding is calculated from screen content sequences.
to evaluate the coding performance of NNIP. The negative
number indicates bitrate saving and the positive number C. Evaluation of Network Structure
indicates bitrate increasing for the same quality. The coding In our preliminary work [20], a similar neural network struc-
performance of NNIP for each sequence is tabulated in Table I. ture is proposed to improve inter prediction. The following
Compared to HM 16.9, the BD-rates of NNIP for {Y, U, V} three main differences exist between [20] and the proposed
components are {−4.6%, −0.6%, −0.4%}, {−3.0%, 0.4%, method in this paper. First, the network in [20] is only applied
0.3%}, and {−2.7%, −0.9%, −0.9%} on average under LDP, to CUs of 2N × 2N PU partitioning, while the proposed
LDB, and RA configurations, respectively. NNIP can achieve method is applied to CUs of all PU partitionings. Second,

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 833

TABLE III TABLE IV


C ODING P ERFORMANCE OF D IFFERENT N ETWORK S TRUCTURES C ODING P ERFORMANCE OF NNIP U NDER VARIOUS QP S
C OMPARED TO HM 16.9 U NDER LDP C ONFIGURATION

TABLE V
in [20], the add operation is used to combine the estimated T HE U SAGE R ATES OF NNIP
residue and the predicted block, while a combination network
is designed to combine the estimated residue and the predicted
block. With the combination network, the texture information
in the predicted block could be fully utilized to guide the
residue refining. Third, the variable-filter-size convolutional
neural network (VRCNN) [41] is directly used as the deep
refinement network in [20], while a more efficient deep refine-
ment network is designed in this paper. To further evaluate the
network structure of NNIP, the network in preliminary work
[20] and its two variations (Variation 1 = [20] applied to CUs
with all PU partitionings, Variation 2 = Variation 1 + the
combination network) are tested. For fair comparison, [20]
and these two variations are trained using the same dataset as
NNIP in this paper. In this subsection, the first 64 frames of
each video sequence are tested under LDP configuration as
did in [20].
Table III tabulates coding performance of these four meth-
ods. First, the Variation 1 achieves 0.3% on average coding
gain better than [20], which demonstrates the efficiency of net-
work applied to different PU partitionings. Second, the Vari-
ation 2 achieves 0.9% coding gain on average better than the
Variation 1 with 14% encoding time and 38% decoding time
increase, which demonstrates the efficiency of the combination
network. Third, NNIP achieves 1.1% coding gain on average
better than the Variation 2, which demonstrates the efficiency normal CTC QPs. Table IV tabulates BD-rates of luma compo-
of the deep refinement network. Therefore, these extensive nent for these tests. As shown in Table IV, the average coding
results demonstrate the efficiency of the proposed network gains are 4.4%, 4.4%, and 4.3% for three different QP settings.
structure in this paper. These comparisons demonstrate the robustness of NNIP for
In previous experiments, normal QPs are set equal to {22, different bitrates and QPs.
27, 32, 37} recommended by HEVC CTC. To verify the Furthermore, a CU level flag is used to indicate whether
generalization ability of NNIP under different QPs, the coding NNIP is used. The proportion of the flags in the bitstream
performance under both small QPs (20, 25, 30, 35) and large is approximately 2.3% on average. The coding performance
QPs (24, 29, 34, 39) is tested. The same trained models from without signaling the flags is -7.5% BD-rate saving on average.
normal QPs are used for the test, in which the trained model Table V shows usage ratios of NNIP for each video
with the closest QP is applied to a particular testing QP, shown sequence under LDP configuration. For each frame of a video
as follows: sequence, usage ratio is defined as follows:

⎪ M22 , Q P < 25 3

⎪ n i × Ni2
⎨ M , 25 <= Q P < 30 η = i=0 × 100% (6)
M=
27
(5) W×H

⎪ M32 , 30 <= Q P < 35

⎩ where η denotes the usage ratio. W and H denote the width
M37 , Q P > 35
and height of a frame. n i denotes the number of CUs coded
where M denotes the used model for a particular Q P. by NNIP with size of Ni × Ni , which is 64 × 64, 32 × 32,
M22 , M27 , M32 , and, M37 denote the trained models using 16 × 16, and 8 × 8 for depth i = 0, 1, 2, 3.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
834 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

Fig. 8. The distribution of CUs coded by HEVC and NNIP. The orange CUs are coded by traditional inter/intra prediction in HEVC and the light blue CUs
are coded by NNIP.

The usage ratio in Table V is the average value for 4 QPs In [19], the predicted block and the spatial neighboring pixels
and all encoded 64 frames including I frame. As shown of current block are fed into the network, while the predicted
in Table V, the average usage ratio is about 25% on average. block and the spatial neighboring pixels of both current block
The usage ratio varies slightly in different sequences. For and the predicted block are fed into NNIP. As more contextual
instance, the usage ratio in Kimono can reach up to 71.5%. information is utilized by NNIP, it achieves better coding
However, it is only about 3.0% in BQSquare. The coding performance.
performance is relatively better for video sequences with [21]–[24] are designed for bi-prediction, while [25]–[28]
higher usage ratio. are used for fractional interpolation. As shown in Table VI,
Fig. 8 also shows the visual distribution of CUs coded by [21]–[23], and [24] achieve 3.1%, 3.0%, 3.5%, and 5.1%
NNIP. The orange CUs are coded by traditional inter/intra BD-rate reduction on average under RA configuration. [25]–
prediction and the light blue CUs are coded by NNIP. It is [27], and [28] achieve 4.3%, 1.0%, 5.3%, and 3.7% BD-rate
observed that most regions with complex texture are likely to reduction on average under LDP configuration. NNIP can
be coded by NNIP. Furthermore, rate distortion curves (RD- achieve 4.6% and 2.7% BD-rate reduction under LDP and
curves) of several typical video sequences are shown in Fig. 9. RA configurations. Actually, NNIP can be applied on top of
It is obvious that NNIP outperforms HM 16.9. NNIP can bi-prediction methods (e.g., [21]–[24]) and fractional interpo-
improve the coding performance of HEVC significantly for lation methods (e.g., [25]–[28]) to achieve better coding per-
both low bitrate and high bitrate scenarios. formance. For example, the predicted block is first generated
using one of [21]–[28], then fed into NNIP to further improve
D. Comparison With the State-of-the-Art Methods inter prediction.
In order to further evaluate the coding performance of NNIP,
E. Computational Complexity
NNIP is compared with the latest deep learning based inter
prediction methods [19], [21]–[28]. [19] is the most related Table VII tabulates the encoding and decoding complexities
work to NNIP, using network to improve inter prediction after of NNIP. The computational complexity is evaluated by time
motion compensation. As shown in Table VI, [19] can achieve increasing, which is defined as follows:
2.7% BD-rate reduction on average under LDP configuration, T p − To
while NNIP can achieve 4.6% BD-rate reduction on average. T = × 100% (7)
To

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 835

Fig. 9. Rate-distortion (R-D) curves of several typical video sequences under LDP configuration.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
836 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

TABLE VI
E XPERIMENTAL R ESULTS C OMPARED W ITH S TATE - OF - THE -A RT M ETHODS

TABLE VII
T HE C OMPUTATIONAL C OMPLEXITY OF NNIP

where T denotes the encoding (decoding) time increasing. R EFERENCES


To and T p denote the encoding (decoding) time of HM
[1] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
16.9 and NNIP. of the H. 264/AVC video coding standard,” IEEE Trans. Circuits Syst.
As shown in Table VII, the encoding time of NNIP increases Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.
approximately 2444%, 1727%, and 1599% on average for [2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the
LDP, LDB, and RA configurations respectively. The decoding high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
time increases approximately 4707%, 3227%, and 1794% on [3] H. Liu, Y. Chen, J. Chen, L. Zhang, and M. Karczewicz, Local
average for LDP, LDB, and RA configurations respectively. Illumination Compensation, document VCEG-AZ06, Jun. 2015.
This high complexity mainly comes from the feed-forward [4] N. Zhang, Y. Lu, X. Fan, R. Xiong, D. Zhao, and W. Gao, “Enhanced
inter prediction with localized weighted prediction in HEVC,” in Proc.
operation of the neural network and multiple times of rate Vis. Commun. Image Process. (VCIP), Dec. 2015, pp. 1–4.
distortion optimization. With the development of dedicated [5] C. Auyeung, J. J. Kosmach, M. T. Orchard, and T. Kalafatis,
chip for deep learning, the complexity will be reduced in the “Overlapped block motion compensation,” Proc. SPIE, vol. 1818,
pp. 561–572, Nov. 1992.
future. [6] Z. Wang, S. Wang, J. Zhang, and S. Ma, “Adaptive progressive motion
vector resolution selection based on rate–distortion optimization,” IEEE
VI. C ONCLUSION Trans. Image Process., vol. 26, no. 1, pp. 400–413, Jan. 2017.
[7] E. Alshina, A. Alshin, K. Choi, and M. Park, Performance of JEM 1
In this paper, a neural network based enhancement to inter Tools Analysis, document JVET-B0022, Feb. 2016.
prediction is proposed for video coding. It is composed of [8] N. Zhang, X. Fan, D. Zhao, and W. Gao, “Merge mode for deformable
block motion information derivation,” IEEE Trans. Circuits Syst. Video
three networks, namely residue estimation network, combi- Technol., vol. 27, no. 11, pp. 2437–2449, Nov. 2017.
nation network, and deep refinement network. These three [9] L. Li et al., “An efficient four-parameter affine motion model for video
networks are jointly used to generate a more accurate predicted coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 8,
pp. 1934–1948, Aug. 2018.
block from the neighboring pixels of current block and the
[10] K. Zhang, Y.-W. Chen, L. Zhang, W.-J. Chien, and M. Karczewicz,
predicted block. The experimental results demonstrate that the “An improved framework of affine motion compensation in video
proposed method can achieve 4.6%, 3.0%, and 2.7% BD-rate coding,” IEEE Trans. Image Process., vol. 28, no. 3, pp. 1456–1469,
reduction on average in LDP, LDB, and RA test conditions Mar. 2019.
[11] D. Liu, Z. Chen, S. Liu, and F. Wu, “Deep learning-based technology
compared to HEVC. In the future, several aspects need to in responses to the joint call for proposals on video compression with
be further studied on NNIP, including reducing complex- capability beyond HEVC,” IEEE Trans. Circuits Syst. Video Technol.,
ity, reducing memory of the models using QP-independent vol. 30, no. 5, pp. 1267–1280, May 2020.
[12] D. Liu, Y. Li, J. Lin, H. Li, and F. Wu, “Deep learning-based video
method, and improving coding performance for screen coding: A review and a case study,” ACM Comput. Surv., vol. 53, no. 1,
content video. pp. 1–35, May 2020.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 837

[13] W. Cui et al., “Convolutional neural networks based intra prediction for [37] Y. Wang, Z. Chen, Y. Li, L. Zhao, S. Liu, and X. Li, AHG9:
HEVC,” in Proc. Data Compress. Conf. (DCC), Apr. 2017, p. 436. Dense Residual Convolutional Neural Network based In-Loop Filter,
[14] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network- document JVET-L0242, ISO/IEC JTC 1/SC 29/WG 11, Joint Video
based intra prediction for image coding,” IEEE Trans. Image Process., Exploration Team, Macao, China, Oct. 2018.
vol. 27, no. 7, pp. 3236–3247, Jul. 2018. [38] K. Kawamura, Y. Kidani, and S. Naito, AHG9: Convolution Neural
[15] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neural Network Filter, document JVET-L0383, ISO/IEC JTC 1/SC 29/WG
network for intra prediction,” IEEE Trans. Multimedia, vol. 21, no. 12, 11 Joint Video Exploration Team (JVET) 12th Meeting, Macao,
pp. 3024–3037, Dec. 2019. Oct. 2018.
[16] T. Dumas, A. Roumy, and C. Guillemot, “Context-adaptive neural [39] S. Liu et al., JVET AHG Report: Neural Networks in Video Cod-
network-based prediction for image compression,” IEEE Trans. Image ing (AHG9), document JVET-M0009, ISO/IEC JTC 1/SC 29/WG
Process., vol. 29, pp. 679–693, 2020. 11 Joint Video Exploration Team (JVET) 13th Meeting, Marrakech,
[17] J. Pfaff et al., “Neural network based intra prediction for video coding,” Jan. 2019.
Proc. SPIE, vol. 10752, Sep. 2018, Art. no. 1075213. [40] C. Li, L. Song, R. Xie, and W. Zhang, “CNN based post-processing
[18] Y. Wang, X. Fan, S. Liu, D. Zhao, and W. Gao, “Multi-scale convolu- to improve HEVC,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
tional neural network-based intra prediction for video coding,” IEEE Sep. 2017, pp. 4577–4580.
Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1803–1815, [41] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for
Jul. 2020. post-processing in HEVC intra coding,” in Proc. Int. Conf. Multimedia
[19] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-based Model. Cham, Switzerland: Springer, 2017, pp. 28–39.
motion compensation refinement for video coding,” in Proc. IEEE Int. [42] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method
Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4. of improving coding efficiency from the decoder-end for HEVC,” in
[20] Y. Wang, X. Fan, C. Jia, D. Zhao, and W. Gao, “Neural network based Proc. Data Compress. Conf. (DCC), Apr. 2017, pp. 410–419.
inter prediction for HEVC,” in Proc. IEEE Int. Conf. Multimedia Expo [43] R. Yang, M. Xu, and Z. Wang, “Decoder-side HEVC quality enhance-
(ICME), Jul. 2018, pp. 1–6. ment with scalable convolutional neural network,” in Proc. IEEE Int.
[21] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “CNN-based Conf. Multimedia Expo (ICME), Jul. 2017, pp. 817–822.
bi-directional motion compensation for high efficiency video coding,” in [44] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4. HEVC compressed videos,” IEEE Trans. Circuits Syst. Video Technol.,
[22] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “Enhanced vol. 29, no. 7, pp. 2039–2054, Jul. 2019.
bi-prediction with convolutional neural network for high-efficiency video [45] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “Enhanc-
coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 11, ing HEVC compressed videos with a partition-masked convolutional
pp. 3291–3301, Nov. 2019. neural network,” in Proc. 25th IEEE Int. Conf. Image Process. (ICIP),
[23] J. Mao, H. Yu, X. Gao, and L. Yu, “CNN-based bi-prediction utilizing Oct. 2018, pp. 216–220.
spatial information for video coding,” in Proc. IEEE Int. Symp. Circuits [46] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement
Syst. (ISCAS), May 2019, pp. 1–5. for compressed video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 6664–6673.
[24] J. Mao and L. Yu, “Convolutional neural network based bi-prediction
[47] Y. Li et al., “Convolutional neural network-based block up-sampling for
utilizing spatial and temporal information in video coding,” IEEE Trans.
intra frame coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 28,
Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1856–1870, Jul. 2020.
no. 9, pp. 2316–2330, Sep. 2018.
[25] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional neural
[48] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional
network-based fractional-pixel motion compensation,” IEEE Trans. Cir-
neural network-based block up-sampling for HEVC,” IEEE Trans.
cuits Syst. Video Technol., vol. 29, no. 3, pp. 840–853, Mar. 2019.
Circuits Syst. Video Technol., vol. 29, no. 12, pp. 3701–3715,
[26] H. Zhang, L. Li, L. Song, X. Yang, and Z. Li, “Advanced CNN based Dec. 2019.
motion compensation fractional interpolation,” in Proc. IEEE Int. Conf. [49] S. Puri, S. Lasserre, and P. L. Callet, “CNN-based transform index
Image Process. (ICIP), Sep. 2019, pp. 709–713. prediction in multiple transforms framework to assist entropy coding,”
[27] H. Zhang, L. Song, L. Li, Z. Li, and X. Yang, “Compression pri- in Proc. 25th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2017,
ors assisted convolutional neural network for fractional interpolation,” pp. 798–802.
IEEE Trans. Circuits Syst. Video Technol., early access, Jul. 22, 2020, [50] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmetic
doi: 10.1109/TCSVT.2020.3011197. coding of intra prediction modes in HEVC,” in Proc. IEEE Vis. Commun.
[28] C. D.-K. Pham and J. Zhou, “Deep learning-based Luma and Image Process. (VCIP), Dec. 2017, pp. 1–4.
chroma fractional interpolation in video coding,” IEEE Access, vol. 7, [51] L. Zhu, G. Wang, G. Teng, Z. Yang, and L. Zhang, “A deep learning
pp. 112535–112543, 2019. based perceptual bit allocation scheme on conversational videos for
[29] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced HEVC λ-domain rate control,” in Proc. Int. Forum Digit. TV Wireless
CTU-level inter prediction with deep frame rate up-conversion for high Multimedia Commun. Singapore: Springer, 2017, pp. 515–524.
efficiency video coding,” in Proc. 25th IEEE Int. Conf. Image Process. [52] B. Xu, X. Pan, Y. Zhou, Y. Li, D. Yang, and Z. Chen, “CNN-based
(ICIP), Oct. 2018, pp. 206–210. rate-distortion modeling for H.265/HEVC,” in Proc. IEEE Vis. Commun.
[30] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, Image Process. (VCIP), Dec. 2017, pp. 1–4.
“Enhanced motion-compensated video coding with deep virtual refer- [53] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
ence frame generation,” IEEE Trans. Image Process., vol. 28, no. 10, tifiers: Surpassing human-level performance on ImageNet classifica-
pp. 4832–4844, Oct. 2019. tion,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
[31] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,” pp. 1026–1034.
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1843–1855, [54] L. Song, X. Tang, W. Zhang, X. Yang, and P. Xia, “The SJTU 4K video
Jul. 2020. sequence dataset,” in Proc. 5th Int. Workshop Qual. Multimedia Exper.
[32] S. Xia, W. Yang, Y. Hu, and J. Liu, “Deep inter prediction via pixel-wise (QoMEX), Jul. 2013, pp. 34–35.
motion oriented reference generation,” in Proc. IEEE Int. Conf. Image [55] F. Bossen, Common Test Conditions and Software Reference Configura-
Process. (ICIP), Sep. 2019, pp. 1710–1774. tions, document JCTVC-L1100, Jan. 2013.
[33] S. Huo, D. Liu, B. Li, S. Ma, F. Wu, and W. Gao, “Deep network- [56] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
based frame extrapolation with reference frame alignment,” IEEE embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, Nov. 2014,
Trans. Circuits Syst. Video Technol., early access, May 18, 2020, pp. 675–678.
doi: 10.1109/TCSVT.2020.2995243. [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[34] W.-S. Park and M. Kim, “CNN-based in-loop filtering for coding image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
efficiency improvement,” in Proc. IEEE 12th Image, Video, Multidimen- (CVPR), Jun. 2016, pp. 770–778.
sional Signal Process. Workshop (IVMSP), Jul. 2016, pp. 1–5. [58] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[35] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), Y. Bengio and
highway convolutional neural networks for in-loop filtering in HEVC,” Y. LeCun, Eds., San Diego, CA, USA, May 2015, pp. 1–15, [Online].
IEEE Trans. Image Process., vol. 27, no. 8, pp. 3827–3841, Aug. 2018. Available: http://arxiv.org/abs/1412.6980
[36] C. Jia et al., “Content-aware convolutional neural network for in-loop [59] G. Bjφntegaard, Improvements BD-PSNR Model, document VCEG-
filtering in high efficiency video coding,” IEEE Trans. Image Process., AI11, ITU-T, Video Coding Experts Group, Heinrich-Hertz-Institute,
vol. 28, no. 7, pp. 3343–3356, Jul. 2019. Berlin, Germany, Jul. 2008.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
838 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022

Yang Wang received the B.S., M.S., and Ph.D. Debin Zhao (Member, IEEE) received the B.S.,
degrees in computer science from the Harbin Insti- M.S., and Ph.D. degrees in computer science from
tute of Technology (HIT), Harbin, China, in 2012, the Harbin Institute of Technology, China, in 1985,
2014, and 2019, respectively. 1988, and 1998, respectively.
From 2013 to 2014, he was with the School He is currently a Professor with the Department of
of Electronics Engineering and Computer Science, Computer Science, Harbin Institute of Technology.
Peking University, Beijing, as a Research Assistant. He has published over 200 technical articles in refer-
From 2014 to 2016, he was with the Media Com- eed journals and conference proceedings in the areas
puting Group, Microsoft Research Asia, Beijing, of image and video coding, video processing, video
as an Intern. From 2018 to 2019, he was with the streaming and transmission, and computer vision.
Peng Cheng Laboratory, Shenzhen, as an Intern. His
current research interests are in image processing, video coding, and deep
learning.

Xiaopeng Fan (Senior Member, IEEE) received the


B.S. and M.S. degrees from the Harbin Institute
of Technology (HIT), Harbin, China, in 2001 and
2003, respectively, and the Ph.D. degree from the
Hong Kong University of Science and Technology,
Hong Kong, in 2009.
In 2009, he joined HIT, where he is currently a
Professor. From 2003 to 2005, he was with Intel
Corporation (China) as a Software Engineer. From
2011 to 2012, he was with Microsoft Research Asia
as a Visiting Researcher. From 2015 to 2016, he was
with HKUST as a Research Assistant Professor. He has authored one book
and more than 100 articles in refereed journals and conference proceedings.
His current research interests include video coding and transmission, image
processing, and computer vision. He has served as the Program Chair of
PCM2017, the Chair of the IEEE SGC2015, and the Co-Chair of MCSN2015.
He was an Associate Editor of the IEEE 1857 standard from 2012. He has
been awarded Outstanding Contributions to the Development of the IEEE
Standard 1857 by the IEEE in 2013.
Wen Gao (Fellow, IEEE) received the Ph.D. degree
in electronics engineering from the University of
Tokyo, Tokyo, Japan, in 1991.
He is currently a Professor of Computer Science
Ruiqin Xiong (Senior Member, IEEE) received the with Peking University, Beijing, China. Before
B.S. degree in computer science from the University joining Peking University, he was a Professor of
of Science and Technology of China in 2001, and the Computer Science with the Harbin Institute of
Ph.D. degree in computer science from the Institute Technology, Harbin, China, from 1991 to 1995,
of Computing Technology, Chinese Academy of and a Professor with the Institute of Computing
Sciences, in 2007. Technology, Chinese Academy of Sciences, Beijing,
He was a Research Intern with the Microsoft China. He has authored five books and more than
Research Asia from 2002 to 2007 and a Senior 600 technical articles in refereed journals and conference proceedings in
Research Associate with the University of New image processing, video coding and communication, pattern recognition,
South Wales, Australia, from 2007 to 2009. multimedia information retrieval, multimodal interface, and bioinformatics.
He joined the School of Electronic Engineering and Dr. Gao is a member of the Chinese Academy of Engineering and a fellow
Computer Science, Institute of Digital Media, Peking University, in 2010, of the ACM. He has chaired a number of prestigious international conferences
where he is currently a Professor. He has published more than 110 techni- on multimedia and video signal processing, such as the IEEE ICME 2007,
cal articles in referred international journals and conferences. His research the ACM Multimedia 2009, the IEEE ISCAS 2013, and also served on the
interests include statistical image modeling, deep learning, image and video advisory and technical committees of numerous professional organizations.
processing, and compression and communications. He was a recipient of the He has served on the Editorial Board for several journals, such as the IEEE
Best Student Paper Award from the SPIE Conference on Visual Commu- T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
nications and Image Processing in 2005 and the Best Paper Award from the IEEE T RANSACTIONS ON M ULTIMEDIA, the IEEE T RANSACTIONS
the IEEE Visual Communications and Image Processing in 2011. He was ON AUTONOMOUS M ENTAL D EVELOPMENT, the EURASIP J OURNAL ON
also a co-recipient of the Best Student Paper Award from the IEEE Visual I MAGE AND V IDEO P ROCESSING, and the Journal of Visual Communication
Communications and Image Processing in 2017. and Image Representation.

Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.

You might also like