Neural Network-Based Enhancement To Inter Prediction For Video Coding
Neural Network-Based Enhancement To Inter Prediction For Video Coding
2, FEBRUARY 2022
Abstract— Inter prediction is a crucial part of hybrid video be exploited, resulting in a higher compression ratio. The
coding frameworks, utilized to exploit the temporal redundancy more accurate the predicted block, the smaller the residue,
in video sequences and improve the coding performance. During and the higher the coding performance. Inter prediction has
inter prediction, a predicted block is typically derived from
reference pictures using motion estimation and motion compen- been widely used in modern video coding standards, e.g.
sation. To improve the coding performance of inter prediction, H.264/AVC [1] and high efficiency video coding (HEVC) [2].
a neural network based enhancement to inter prediction (NNIP) In inter prediction, the predicted block is typically generated
is proposed in this paper. NNIP is composed of three networks, by copying or interpolating a block from the reference pictures.
namely residue estimation network, combination network, and Even though the best matching block is obtained, there is
deep refinement network. Specifically, first, a residue estimation
network is designed to estimate the residue between current block still some residue after prediction that needs to be further
and its predicted block using their available spatial neighbors. encoded. This residue comes from the variations between
Second, the feature maps of the estimated residue and the current block and its predicted block, such as illumination
predicted block are extracted and concatenated in a combination variation, zooming, deformation, and blurring.
network. Finally, the concatenated feature maps are fed into a To address these variations and improve the coding per-
deep refinement network to generate a refined residue, which is
added back to the predicted block to derive a more accurate formance of inter prediction, numerous algorithms have been
predicted block. NNIP is integrated in HEVC to evaluate its proposed in recent years. Local illumination compensation [3],
efficiency. The experimental results demonstrate that NNIP can [4] is proposed to address the local illumination variation in
achieve 4.6%, 3.0%, and 2.7% BD-rate reduction on average video sequence. Overlapped block motion compensation [5] is
under LDP, LDB, and RA configurations compared to HEVC. proposed to reduce the block artifact caused by motion com-
Index Terms— Video coding, inter prediction, NNIP, HEVC, pensation. An adaptive progressive motion vector resolution
deep learning. selection scheme is proposed in [6] to reduce the overhead
of signaling motion vectors. Bi-directional optical flow [7] is
I. I NTRODUCTION used to refine the bi-prediction using the optical flow equation.
Affine motion compensated prediction [8]–[10] is proposed to
I NTER prediction aims at reducing the temporal redundancy
in video sequences. During inter prediction, motion esti-
mation is used to obtain a motion vector by searching the
address the non-translational motion. All these methods are
manually designed to address the variations between current
best matching block with current block in reference pictures, block and the predicted block.
and the motion vector is used in motion compensation to Recently, deep learning has achieved impressive success
derive the predicted block. With motion estimation and motion on video coding [11], [12]. Typically, deep learning based
compensation, temporal redundancy in video sequences can methods are integrated into the hybrid video coding
framework to improve the coding performance of each
Manuscript received July 29, 2020; revised December 7, 2020; accepted particular module, such as intra prediction [13]–[18], inter
February 18, 2021. Date of publication March 2, 2021; date of current version
February 4, 2022. This work was supported by the National Science Founda- prediction [19]–[33], in-loop filter [34]–[39], post-processing
tion of China under Grant 61972115, Grant 61872116, and Grant 61631017. [40]–[46], up-sampling coded down-sampled blocks [47],
This article was recommended by Associate Editor M. Cagnazzo. [48], entropy coding [49], [50], and rate control [51], [52].
(Corresponding author: Xiaopeng Fan.)
Yang Wang, Xiaopeng Fan, and Debin Zhao are with the Depart- Specifically, for inter prediction, deep learning can be used
ment of Computer Science and Technology, Harbin Institute of Technol- for bi-prediction [21]–[24], fractional interpolation [25]–[28],
ogy, Harbin 150001, China, and also with the Peng Cheng Laboratory, inter prediction enhancement [19], [20] and reference frame
Shenzhen 518055, China (e-mail: [email protected]; [email protected];
[email protected]). generation [29]–[33]. More deep learning based methods on
Ruiqin Xiong is with the Department of Electronics Engineering and video coding are reviewed in [11], [12].
Computer Science, Peking University, Beijing 100871, China (e-mail: In this paper, a neural network based enhancement to inter
[email protected]).
Wen Gao is with the Department of Electronics Engineering and Computer prediction for video coding is proposed. NNIP is composed
Science, Peking University, Beijing 100871, China, and also with the Peng of three networks. Residue estimation network is proposed to
Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). estimate the residue between current block and its predicted
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2021.3063165. block. Combination network extracts and concatenates the
Digital Object Identifier 10.1109/TCSVT.2021.3063165 feature maps of the estimated residue and the predicted block.
1051-8215 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 827
Deep refinement network is used to derive a refined residue, reference block to generate the fractional pixels. Furthermore,
then the refined residue is added back to the predicted block Zhang et al. [27] proposed to use the corresponding residual
to get a more accurate predicted block. NNIP is integrated component and collocated high quality component as the
in HEVC to evaluate its performance. The experimental compression priors in CNN to boost the performance of
results demonstrate that NNIP achieves 4.6%, 3.0%, and 2.7% fractional interpolation. Pham and Zhou [28] presented two
BD-rate reduction on average under LDP, LDB, and RA test CNN models of fractional interpolation for luma and chroma
conditions compared to HEVC. components. A reference frame is first interpolated using
The main contributions of this work are summarized as Discrete Cosine Transform interpolation filters, then fed into
follows: CNN to avoid motion shift problem.
1) A neural network based enhancement to inter prediction Apart from these methods, deep learning is also used
for video coding is proposed, which consists of three networks, to improve inter prediction by refining the predicted block
namely residue estimation network, combination network, and derived from motion compensation. Huo et al. [19] proposed
deep refinement network. a CNN based method for motion compensation refinement,
2) A residue estimation network is designed to estimate the which takes the predicted block and the spatial neighboring
residue between current block and its predicted block using pixels of current block as the input. Wang et al. [20] proposed
their available spatial neighbors. a neural network based enhancement to inter prediction for
3) A combination network is presented to extract the feature HEVC to enhance motion compensation, in which a fully con-
maps of the estimated residue and the predicted block and nected network is used to estimate the residue between current
concatenate these feature maps together. Therefore, the texture block and its predicted block from their spatial neighbors.
information in the predicted block could be fully utilized to Then the sum of the residue and the predicted block is fed
guide the residue refining. into a convolutional neural network to obtain a more accurate
4) A deep refinement network is proposed to take the con- predicted block.
catenated feature maps as input and derive a refined residue, Besides, deep learning is also used for reference frame gen-
which is added back to the predicted block to get a more eration to improve inter prediction. Zhao et al. [29], [30] pro-
accurate predicted block. posed to generate a high quality virtual reference frame with
The rest of this paper is organized as follows. The related the deep learning based frame rate up-conversion algorithm
works are reviewed in Section II. The proposed neural network from two reconstructed bi-prediction frames. A CTU level
is detailed in Section III. In Section IV, how to integrate NNIP coding mode is proposed to select from either the existing ref-
in HEVC is introduced. The experimental results and analyses erence frames or the virtual reference frame. Choi et al. [31]
are given in Section V. Section VI concludes the paper. proposed a frame prediction method using CNN for both
uni-prediction and bi-prediction. Two frames from the decoded
picture buffer are fed into CNN to generate a synthesized
II. R ELATED W ORK
deep frame, which is directly used as a predictor for current
For inter prediction, deep learning has been used for frame. Xia et al. [32] proposed a multiscale adaptive separable
bi-prediction [21]–[24], fractional interpolation [25]–[28], convolutional neural network to generate pixel-wise closer
inter prediction enhancement [19], [20] and reference frame reference frames. Reconstruction losses are enforced on each
generation [29]–[33]. scale to make the network infer the main structure at small
For bi-prediction, Zhao et al. [21], [22] proposed a scales. Huo et al. [33] proposed to align the reference frames
bi-directional motion compensation algorithm using CNN. for further extrapolation by a trained deep network, and this
Two reference blocks generated by bi-prediction are fed into alignment can reduce the diversity of the network input. All
the network to get the final prediction. This network consists of these methods aim to derive an additional reference frame
six convolutional layers and can adaptively learn the weights of using the neural network to provide a higher quality reference
the two reference blocks, rather than the average weights used for inter prediction.
in HEVC. To further improve the performance of bi-prediction, Among the above-mentioned methods, [21]–[24] focus on
Mao et al. [23], [24] proposed to use the spatial neighboring improving bi-prediction, and [25]–[28] focus on fractional
pixels of current block as the additional information, which is interpolation, and [29]–[33] focus on generating reference
fed into the CNN model together with the two reference blocks frames. Different from these methods, our work aims to
to get the final prediction. Compared to [22], the additional enhance inter prediction after motion compensation and it
spatial neighboring pixels of current block are also utilized in can be used together with these methods. Compared to our
[24], which can provide more contextual information for the preliminary work [20], several improvements have been made
network. in this paper. First, a combination network is proposed to
For fractional interpolation, Yan et al. [25] proposed a concatenate the feature maps of the estimated residue and
CNN based method for fractional-pixel motion compensation. the predicted block, rather than directly adding the residue to
In their method, the fractional-pixel motion compensation is the predicted block in [20]. Therefore, the texture information
formulated as an inter-picture regression problem and CNN in the predicted block could be fully utilized to guide the
is utilized to solve the regression problem. Zhang et al. [26] residue refining in the deep refinement network. Second,
proposed to regard fractional interpolation as an image a deep refinement network is redesigned to efficiently refine
generation task and use real pixels at integer positions in the the estimated residue. Third, the proposed neural network is
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
828 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 829
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
830 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
and the frame rate is 30fps. Abundant textures exist in these optimization Adam [58]. A batch-mode learning method is
video sequences and are suitable for training the network. For adopted with a batch size of 64. The momentum of Adam
better adapting to video contents with different resolutions, optimization is set to 0.9 and the momentum2 is set to 0.99.
the 10 4K video sequences are downsampled to five HEVC The base learning rate is set to decay exponentially from 0.1 to
resolutions (2560 × 1600, 1920 × 1080, 832 × 480, 416 × 240, 0.0001, changing every 40 epochs. Thus, the training takes
and 1280 × 720). Thus there are 60 video sequences with 160 epochs in total. The model for QP = 37 is trained using
different resolutions in total to generate the training data. the base learning rate 0.1. The models for other QPs (22, 27,
All these video sequences are encoded using HM 16.9 with and 32) are fine-tuned from the model of QP = 37, while
low delay P (LDP) configuration [55]. Quantization parame- the base learning rate is 0.001. In addition, we train different
ters (QPs) used to encode these sequences are 22, 27, 32, models for different sizes of CU (8 × 8, 16 × 16, 32 × 32, and
and 37. The first 100 frames of each sequence are encoded 64×64). Therefore, in total, there are 16 models in NNIP. The
and the inputs of network are obtained from the compressed required memory to store the network parameters is 291.2M
bitstreams. In addition, not all blocks encoded by inter mode in total. The memories for 8 ×8, 16 ×16, 32 ×32, and 64 ×64
are used for training, only the blocks with relatively complex blocks are 0.3M, 2.6M, 11.2M, and 58.7M, respectively.
texture (e.g. the standard deviation ≥ 2) are used. This is due Take blocks with size of 16 × 16 as an example, the change
to the fact that smooth blocks can be easily processed well of test loss with number of iterations increasing for QP = 37
by traditional inter prediction in HEVC. Blocks with size of is depicted in Fig. 6 (a). The test loss for iteration = 0 is
8 × 8, 16 × 16, 32 × 32, and 64 × 64 for different CUs are excluded due to its much higher magnitude. The test loss
used. There are 200,000 training patches approximately for decreases smoothly before 15,000 iterations and converges to
each block size and each QP. a relatively small value after changing the base learning rate
several times. As shown Fig. 6 (b), networks for QP = 22, 27,
F. Training Strategy 32 are fine-tuned based on the trained model with QP = 37
Learning the mapping F of NNIP needs to estimate and the test loss can converge to a small value quickly,
the weighted parameters of the residue estimation network, especially the test loss for iteration = 0.
the combination network, and the deep refinement network. IV. I NTEGRATION IN HEVC
Specifically, given a collection of n training samples: (x i , yi ),
To evaluate the efficiency of NNIP, we integrate it into
the mean squared error (MSE) is used to minimize the loss
HEVC. In HEVC main profile, the CU size varies from 8 × 8
function as in video coding the sum of square error (SSE) is
to 64 × 64. When NNIP is integrated into HEVC, all the
used in the rate distortion optimization. The loss function is
trained models for 8 × 8, 16 × 16, 32 × 32, and 64 × 64
formulated as follows:
block sizes are used for different CU sizes. Fig. 7 depicts the
1
n
diagram of the encoder and the decoder when NNIP integrated
L() = ||(F(x i |) + Pi ) − yi ||2 (4)
n in HEVC.
i=1
Several aspects should be considered when integrating
where = {W (R) , B (R) , W (C) , B (C), W (D) , B (D)}, NNIP into HEVC. First, NNIP is used in CUs with all PU
{W (R) , B (R)}, {W (C) , B (C)}, and {W (D), B (D)} are the partitionings (i.e. 2N × 2N, 2N × N, N × 2N, N × N,
parameters of the residue estimation network, the combination 2N × nU , 2N × n D, n L × 2N, n R × 2N). Second, L C
network, and the deep refinement network respectively. n is is left and above neighboring reconstructed pixels of current
the total number of the training samples. CU and P is derived using traditional motion estimation and
NNIP is trained using Caffe [56] on a NVIDIA GeForce motion compensation in HEVC. For a CU with only one PU,
GTX 1080 GPU. All weights of convolutional filters are L P is derived using the motion vector of the PU as same
initialized as [57], all bias are initialized with 0. The loss as traditional motion compensation. For a CU with multiple
function is minimized by using the first-order gradient based PUs, L P is derived using the motion vector of the first PU.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 831
The reason why we use the first PU is that L P is composed compared to the state-of-the-art methods are provided. Finally,
of left and above neighboring pixels and it is nearest to the the coding complexity is discussed.
first PU. Third, NNIP is not only integrated in normal inter A. Experimental Settings
mode, but also integrated in skip/merge mode. In skip/merge
mode, NNIP is used for all merge candidates. NNIP is integrated in HM 16.9 to evaluate its perfor-
Moreover, when integrated into HEVC, NNIP is determined mance under three configurations: low delay P (LDP), low
by rate distortion optimization scheme against the traditional delay B (LDB), and random access (RA) suggested in [55].
inter prediction, in which a CU level flag is set to indicate Total 18 natural sequences with 8 bit depth are tested in
whether NNIP is used. In this paper, only the luma component our experiments, including class A (2560 × 1600), class B
is processed by NNIP. (1920 × 1080), class C (832 × 480), class D (416 × 240), class
E (1280×720). 4 sequences in class F are also used to evaluate
the performance of NNIP on screen content videos. All frames
V. E XPERIMENTAL R ESULTS in these sequences are used in the experiments except for the
In this section, the extensive experiments are conducted to evaluation of the network structure in Subsection V-C, where
evaluate the performance of NNIP. First, the experimental set- the first 64 frames of each sequence are tested. Note there is no
tings are introduced. Then the experimental results compared overlap between the training video sequences and the testing
to HM 16.9 are given, followed by the ablation experiment video sequences. There are about 200,000 blocks used to train
on network structure. After that, the experimental results network models for each block size and each QP. QP used in
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
TABLE I
C ODING P ERFORMANCE OF NNIP C OMPARED TO HM 16.9
TABLE II coding gain for all sequences. NNIP can get more coding gain
C ODING P ERFORMANCE OF NNIP FOR S CREEN C ONTENT for video sequences with higher resolution. This is due to
C OMPARED TO HM 16.9
that lots of large CUs exist in video sequences with higher
resolution, which can benefit from the improved accuracy of
NNIP for large blocks. In addition, the coding performance
under LDB and RA is worse than LDP. This is due to that
bi-prediction is adopted in RA and LDB and can achieve
more accurate prediction, resulting in better coding perfor-
mance under RA and LDB configurations inherently. NNIP
is only applied to luma component, minor changes of coding
performance are observed for chroma components.
Apart from natural sequences, the coding performance of
our experiments varies among 22, 27, 32, and 37. Intel i7- NNIP is also evaluated on screen content sequences. As shown
6700 3.4GHz Quad-core processors with 64GB memory and in Table II, compared to HM 16.9, the BD-rates of NNIP
Microsoft Windows Server 2012 R2 operating system are used. for luma component are −1.5%, −0.7%, and −0.5% on
Both HM 16.9 and the proposed algorithm are compiled with average under LDP, LDB, and RA configurations, respectively.
Microsoft Visual Studio 2013. When NNIP is integrated in It is observed that the coding performance on screen content
HM 16.9, the feed-forward operation of network is processed sequences is worse than natural sequences. Screen content
with GPU version Caffe [56]. sequences have quite different characteristics from natural
sequences and are not used to train the network. If NNIP is
B. Comparison With HM 16.9 deployed to the application scenarios of screen content, it is
Bjφntegaard Delta rate (BD-rate) [59] using piecewise cubic possible and straightforward to retrain the network using data
interpolation typically used in video coding is calculated from screen content sequences.
to evaluate the coding performance of NNIP. The negative
number indicates bitrate saving and the positive number C. Evaluation of Network Structure
indicates bitrate increasing for the same quality. The coding In our preliminary work [20], a similar neural network struc-
performance of NNIP for each sequence is tabulated in Table I. ture is proposed to improve inter prediction. The following
Compared to HM 16.9, the BD-rates of NNIP for {Y, U, V} three main differences exist between [20] and the proposed
components are {−4.6%, −0.6%, −0.4%}, {−3.0%, 0.4%, method in this paper. First, the network in [20] is only applied
0.3%}, and {−2.7%, −0.9%, −0.9%} on average under LDP, to CUs of 2N × 2N PU partitioning, while the proposed
LDB, and RA configurations, respectively. NNIP can achieve method is applied to CUs of all PU partitionings. Second,
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 833
TABLE V
in [20], the add operation is used to combine the estimated T HE U SAGE R ATES OF NNIP
residue and the predicted block, while a combination network
is designed to combine the estimated residue and the predicted
block. With the combination network, the texture information
in the predicted block could be fully utilized to guide the
residue refining. Third, the variable-filter-size convolutional
neural network (VRCNN) [41] is directly used as the deep
refinement network in [20], while a more efficient deep refine-
ment network is designed in this paper. To further evaluate the
network structure of NNIP, the network in preliminary work
[20] and its two variations (Variation 1 = [20] applied to CUs
with all PU partitionings, Variation 2 = Variation 1 + the
combination network) are tested. For fair comparison, [20]
and these two variations are trained using the same dataset as
NNIP in this paper. In this subsection, the first 64 frames of
each video sequence are tested under LDP configuration as
did in [20].
Table III tabulates coding performance of these four meth-
ods. First, the Variation 1 achieves 0.3% on average coding
gain better than [20], which demonstrates the efficiency of net-
work applied to different PU partitionings. Second, the Vari-
ation 2 achieves 0.9% coding gain on average better than the
Variation 1 with 14% encoding time and 38% decoding time
increase, which demonstrates the efficiency of the combination
network. Third, NNIP achieves 1.1% coding gain on average
better than the Variation 2, which demonstrates the efficiency normal CTC QPs. Table IV tabulates BD-rates of luma compo-
of the deep refinement network. Therefore, these extensive nent for these tests. As shown in Table IV, the average coding
results demonstrate the efficiency of the proposed network gains are 4.4%, 4.4%, and 4.3% for three different QP settings.
structure in this paper. These comparisons demonstrate the robustness of NNIP for
In previous experiments, normal QPs are set equal to {22, different bitrates and QPs.
27, 32, 37} recommended by HEVC CTC. To verify the Furthermore, a CU level flag is used to indicate whether
generalization ability of NNIP under different QPs, the coding NNIP is used. The proportion of the flags in the bitstream
performance under both small QPs (20, 25, 30, 35) and large is approximately 2.3% on average. The coding performance
QPs (24, 29, 34, 39) is tested. The same trained models from without signaling the flags is -7.5% BD-rate saving on average.
normal QPs are used for the test, in which the trained model Table V shows usage ratios of NNIP for each video
with the closest QP is applied to a particular testing QP, shown sequence under LDP configuration. For each frame of a video
as follows: sequence, usage ratio is defined as follows:
⎧
⎪ M22 , Q P < 25 3
⎪
⎪ n i × Ni2
⎨ M , 25 <= Q P < 30 η = i=0 × 100% (6)
M=
27
(5) W×H
⎪
⎪ M32 , 30 <= Q P < 35
⎪
⎩ where η denotes the usage ratio. W and H denote the width
M37 , Q P > 35
and height of a frame. n i denotes the number of CUs coded
where M denotes the used model for a particular Q P. by NNIP with size of Ni × Ni , which is 64 × 64, 32 × 32,
M22 , M27 , M32 , and, M37 denote the trained models using 16 × 16, and 8 × 8 for depth i = 0, 1, 2, 3.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
834 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
Fig. 8. The distribution of CUs coded by HEVC and NNIP. The orange CUs are coded by traditional inter/intra prediction in HEVC and the light blue CUs
are coded by NNIP.
The usage ratio in Table V is the average value for 4 QPs In [19], the predicted block and the spatial neighboring pixels
and all encoded 64 frames including I frame. As shown of current block are fed into the network, while the predicted
in Table V, the average usage ratio is about 25% on average. block and the spatial neighboring pixels of both current block
The usage ratio varies slightly in different sequences. For and the predicted block are fed into NNIP. As more contextual
instance, the usage ratio in Kimono can reach up to 71.5%. information is utilized by NNIP, it achieves better coding
However, it is only about 3.0% in BQSquare. The coding performance.
performance is relatively better for video sequences with [21]–[24] are designed for bi-prediction, while [25]–[28]
higher usage ratio. are used for fractional interpolation. As shown in Table VI,
Fig. 8 also shows the visual distribution of CUs coded by [21]–[23], and [24] achieve 3.1%, 3.0%, 3.5%, and 5.1%
NNIP. The orange CUs are coded by traditional inter/intra BD-rate reduction on average under RA configuration. [25]–
prediction and the light blue CUs are coded by NNIP. It is [27], and [28] achieve 4.3%, 1.0%, 5.3%, and 3.7% BD-rate
observed that most regions with complex texture are likely to reduction on average under LDP configuration. NNIP can
be coded by NNIP. Furthermore, rate distortion curves (RD- achieve 4.6% and 2.7% BD-rate reduction under LDP and
curves) of several typical video sequences are shown in Fig. 9. RA configurations. Actually, NNIP can be applied on top of
It is obvious that NNIP outperforms HM 16.9. NNIP can bi-prediction methods (e.g., [21]–[24]) and fractional interpo-
improve the coding performance of HEVC significantly for lation methods (e.g., [25]–[28]) to achieve better coding per-
both low bitrate and high bitrate scenarios. formance. For example, the predicted block is first generated
using one of [21]–[28], then fed into NNIP to further improve
D. Comparison With the State-of-the-Art Methods inter prediction.
In order to further evaluate the coding performance of NNIP,
E. Computational Complexity
NNIP is compared with the latest deep learning based inter
prediction methods [19], [21]–[28]. [19] is the most related Table VII tabulates the encoding and decoding complexities
work to NNIP, using network to improve inter prediction after of NNIP. The computational complexity is evaluated by time
motion compensation. As shown in Table VI, [19] can achieve increasing, which is defined as follows:
2.7% BD-rate reduction on average under LDP configuration, T p − To
while NNIP can achieve 4.6% BD-rate reduction on average. T = × 100% (7)
To
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 835
Fig. 9. Rate-distortion (R-D) curves of several typical video sequences under LDP configuration.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
836 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
TABLE VI
E XPERIMENTAL R ESULTS C OMPARED W ITH S TATE - OF - THE -A RT M ETHODS
TABLE VII
T HE C OMPUTATIONAL C OMPLEXITY OF NNIP
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: NEURAL NETWORK-BASED ENHANCEMENT TO INTER PREDICTION FOR VIDEO CODING 837
[13] W. Cui et al., “Convolutional neural networks based intra prediction for [37] Y. Wang, Z. Chen, Y. Li, L. Zhao, S. Liu, and X. Li, AHG9:
HEVC,” in Proc. Data Compress. Conf. (DCC), Apr. 2017, p. 436. Dense Residual Convolutional Neural Network based In-Loop Filter,
[14] J. Li, B. Li, J. Xu, R. Xiong, and W. Gao, “Fully connected network- document JVET-L0242, ISO/IEC JTC 1/SC 29/WG 11, Joint Video
based intra prediction for image coding,” IEEE Trans. Image Process., Exploration Team, Macao, China, Oct. 2018.
vol. 27, no. 7, pp. 3236–3247, Jul. 2018. [38] K. Kawamura, Y. Kidani, and S. Naito, AHG9: Convolution Neural
[15] Y. Hu, W. Yang, M. Li, and J. Liu, “Progressive spatial recurrent neural Network Filter, document JVET-L0383, ISO/IEC JTC 1/SC 29/WG
network for intra prediction,” IEEE Trans. Multimedia, vol. 21, no. 12, 11 Joint Video Exploration Team (JVET) 12th Meeting, Macao,
pp. 3024–3037, Dec. 2019. Oct. 2018.
[16] T. Dumas, A. Roumy, and C. Guillemot, “Context-adaptive neural [39] S. Liu et al., JVET AHG Report: Neural Networks in Video Cod-
network-based prediction for image compression,” IEEE Trans. Image ing (AHG9), document JVET-M0009, ISO/IEC JTC 1/SC 29/WG
Process., vol. 29, pp. 679–693, 2020. 11 Joint Video Exploration Team (JVET) 13th Meeting, Marrakech,
[17] J. Pfaff et al., “Neural network based intra prediction for video coding,” Jan. 2019.
Proc. SPIE, vol. 10752, Sep. 2018, Art. no. 1075213. [40] C. Li, L. Song, R. Xie, and W. Zhang, “CNN based post-processing
[18] Y. Wang, X. Fan, S. Liu, D. Zhao, and W. Gao, “Multi-scale convolu- to improve HEVC,” in Proc. IEEE Int. Conf. Image Process. (ICIP),
tional neural network-based intra prediction for video coding,” IEEE Sep. 2017, pp. 4577–4580.
Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1803–1815, [41] Y. Dai, D. Liu, and F. Wu, “A convolutional neural network approach for
Jul. 2020. post-processing in HEVC intra coding,” in Proc. Int. Conf. Multimedia
[19] S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional neural network-based Model. Cham, Switzerland: Springer, 2017, pp. 28–39.
motion compensation refinement for video coding,” in Proc. IEEE Int. [42] T. Wang, M. Chen, and H. Chao, “A novel deep learning-based method
Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4. of improving coding efficiency from the decoder-end for HEVC,” in
[20] Y. Wang, X. Fan, C. Jia, D. Zhao, and W. Gao, “Neural network based Proc. Data Compress. Conf. (DCC), Apr. 2017, pp. 410–419.
inter prediction for HEVC,” in Proc. IEEE Int. Conf. Multimedia Expo [43] R. Yang, M. Xu, and Z. Wang, “Decoder-side HEVC quality enhance-
(ICME), Jul. 2018, pp. 1–6. ment with scalable convolutional neural network,” in Proc. IEEE Int.
[21] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “CNN-based Conf. Multimedia Expo (ICME), Jul. 2017, pp. 817–822.
bi-directional motion compensation for high efficiency video coding,” in [44] R. Yang, M. Xu, T. Liu, Z. Wang, and Z. Guan, “Enhancing quality for
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2018, pp. 1–4. HEVC compressed videos,” IEEE Trans. Circuits Syst. Video Technol.,
[22] Z. Zhao, S. Wang, S. Wang, X. Zhang, S. Ma, and J. Yang, “Enhanced vol. 29, no. 7, pp. 2039–2054, Jul. 2019.
bi-prediction with convolutional neural network for high-efficiency video [45] X. He, Q. Hu, X. Zhang, C. Zhang, W. Lin, and X. Han, “Enhanc-
coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 11, ing HEVC compressed videos with a partition-masked convolutional
pp. 3291–3301, Nov. 2019. neural network,” in Proc. 25th IEEE Int. Conf. Image Process. (ICIP),
[23] J. Mao, H. Yu, X. Gao, and L. Yu, “CNN-based bi-prediction utilizing Oct. 2018, pp. 216–220.
spatial information for video coding,” in Proc. IEEE Int. Symp. Circuits [46] R. Yang, M. Xu, Z. Wang, and T. Li, “Multi-frame quality enhancement
Syst. (ISCAS), May 2019, pp. 1–5. for compressed video,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 6664–6673.
[24] J. Mao and L. Yu, “Convolutional neural network based bi-prediction
[47] Y. Li et al., “Convolutional neural network-based block up-sampling for
utilizing spatial and temporal information in video coding,” IEEE Trans.
intra frame coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 28,
Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1856–1870, Jul. 2020.
no. 9, pp. 2316–2330, Sep. 2018.
[25] N. Yan, D. Liu, H. Li, B. Li, L. Li, and F. Wu, “Convolutional neural
[48] J. Lin, D. Liu, H. Yang, H. Li, and F. Wu, “Convolutional
network-based fractional-pixel motion compensation,” IEEE Trans. Cir-
neural network-based block up-sampling for HEVC,” IEEE Trans.
cuits Syst. Video Technol., vol. 29, no. 3, pp. 840–853, Mar. 2019.
Circuits Syst. Video Technol., vol. 29, no. 12, pp. 3701–3715,
[26] H. Zhang, L. Li, L. Song, X. Yang, and Z. Li, “Advanced CNN based Dec. 2019.
motion compensation fractional interpolation,” in Proc. IEEE Int. Conf. [49] S. Puri, S. Lasserre, and P. L. Callet, “CNN-based transform index
Image Process. (ICIP), Sep. 2019, pp. 709–713. prediction in multiple transforms framework to assist entropy coding,”
[27] H. Zhang, L. Song, L. Li, Z. Li, and X. Yang, “Compression pri- in Proc. 25th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2017,
ors assisted convolutional neural network for fractional interpolation,” pp. 798–802.
IEEE Trans. Circuits Syst. Video Technol., early access, Jul. 22, 2020, [50] R. Song, D. Liu, H. Li, and F. Wu, “Neural network-based arithmetic
doi: 10.1109/TCSVT.2020.3011197. coding of intra prediction modes in HEVC,” in Proc. IEEE Vis. Commun.
[28] C. D.-K. Pham and J. Zhou, “Deep learning-based Luma and Image Process. (VCIP), Dec. 2017, pp. 1–4.
chroma fractional interpolation in video coding,” IEEE Access, vol. 7, [51] L. Zhu, G. Wang, G. Teng, Z. Yang, and L. Zhang, “A deep learning
pp. 112535–112543, 2019. based perceptual bit allocation scheme on conversational videos for
[29] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Enhanced HEVC λ-domain rate control,” in Proc. Int. Forum Digit. TV Wireless
CTU-level inter prediction with deep frame rate up-conversion for high Multimedia Commun. Singapore: Springer, 2017, pp. 515–524.
efficiency video coding,” in Proc. 25th IEEE Int. Conf. Image Process. [52] B. Xu, X. Pan, Y. Zhou, Y. Li, D. Yang, and Z. Chen, “CNN-based
(ICIP), Oct. 2018, pp. 206–210. rate-distortion modeling for H.265/HEVC,” in Proc. IEEE Vis. Commun.
[30] L. Zhao, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, Image Process. (VCIP), Dec. 2017, pp. 1–4.
“Enhanced motion-compensated video coding with deep virtual refer- [53] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rec-
ence frame generation,” IEEE Trans. Image Process., vol. 28, no. 10, tifiers: Surpassing human-level performance on ImageNet classifica-
pp. 4832–4844, Oct. 2019. tion,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
[31] H. Choi and I. V. Bajić, “Deep frame prediction for video coding,” pp. 1026–1034.
IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 7, pp. 1843–1855, [54] L. Song, X. Tang, W. Zhang, X. Yang, and P. Xia, “The SJTU 4K video
Jul. 2020. sequence dataset,” in Proc. 5th Int. Workshop Qual. Multimedia Exper.
[32] S. Xia, W. Yang, Y. Hu, and J. Liu, “Deep inter prediction via pixel-wise (QoMEX), Jul. 2013, pp. 34–35.
motion oriented reference generation,” in Proc. IEEE Int. Conf. Image [55] F. Bossen, Common Test Conditions and Software Reference Configura-
Process. (ICIP), Sep. 2019, pp. 1710–1774. tions, document JCTVC-L1100, Jan. 2013.
[33] S. Huo, D. Liu, B. Li, S. Ma, F. Wu, and W. Gao, “Deep network- [56] Y. Jia et al., “Caffe: Convolutional architecture for fast feature
based frame extrapolation with reference frame alignment,” IEEE embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, Nov. 2014,
Trans. Circuits Syst. Video Technol., early access, May 18, 2020, pp. 675–678.
doi: 10.1109/TCSVT.2020.2995243. [57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[34] W.-S. Park and M. Kim, “CNN-based in-loop filtering for coding image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
efficiency improvement,” in Proc. IEEE 12th Image, Video, Multidimen- (CVPR), Jun. 2016, pp. 770–778.
sional Signal Process. Workshop (IVMSP), Jul. 2016, pp. 1–5. [58] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[35] Y. Zhang, T. Shen, X. Ji, Y. Zhang, R. Xiong, and Q. Dai, “Residual in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), Y. Bengio and
highway convolutional neural networks for in-loop filtering in HEVC,” Y. LeCun, Eds., San Diego, CA, USA, May 2015, pp. 1–15, [Online].
IEEE Trans. Image Process., vol. 27, no. 8, pp. 3827–3841, Aug. 2018. Available: http://arxiv.org/abs/1412.6980
[36] C. Jia et al., “Content-aware convolutional neural network for in-loop [59] G. Bjφntegaard, Improvements BD-PSNR Model, document VCEG-
filtering in high efficiency video coding,” IEEE Trans. Image Process., AI11, ITU-T, Video Coding Experts Group, Heinrich-Hertz-Institute,
vol. 28, no. 7, pp. 3343–3356, Jul. 2019. Berlin, Germany, Jul. 2008.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.
838 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 32, NO. 2, FEBRUARY 2022
Yang Wang received the B.S., M.S., and Ph.D. Debin Zhao (Member, IEEE) received the B.S.,
degrees in computer science from the Harbin Insti- M.S., and Ph.D. degrees in computer science from
tute of Technology (HIT), Harbin, China, in 2012, the Harbin Institute of Technology, China, in 1985,
2014, and 2019, respectively. 1988, and 1998, respectively.
From 2013 to 2014, he was with the School He is currently a Professor with the Department of
of Electronics Engineering and Computer Science, Computer Science, Harbin Institute of Technology.
Peking University, Beijing, as a Research Assistant. He has published over 200 technical articles in refer-
From 2014 to 2016, he was with the Media Com- eed journals and conference proceedings in the areas
puting Group, Microsoft Research Asia, Beijing, of image and video coding, video processing, video
as an Intern. From 2018 to 2019, he was with the streaming and transmission, and computer vision.
Peng Cheng Laboratory, Shenzhen, as an Intern. His
current research interests are in image processing, video coding, and deep
learning.
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on June 12,2024 at 14:28:18 UTC from IEEE Xplore. Restrictions apply.