0% found this document useful (0 votes)
9 views4 pages

Paper 5

Uploaded by

vmsprrce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Paper 5

Uploaded by

vmsprrce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

OpenDVC:

An Open Source Implementation of the DVC Video Compression Method

Ren Yang Luc Van Gool Radu Timofte


Computer Vision Laboratory KU Leuven, Belgium Computer Vision Laboratory
ETH Zürich, Switzerland ETH Zürich, Switzerland ETH Zürich, Switzerland
[email protected] [email protected] [email protected]
arXiv:2006.15862v2 [eess.IV] 3 Aug 2020

1. Introduction
We introduce an open source Tensorflow [1] implemen-
tation of the Deep Video Compression (DVC) [9] method
in this technical report. DVC [9] is the first end-to-end opti-
mized learned video compression method, achieving bet-
ter MS-SSIM performance than the Low-Delay P (LDP)
very fast setting of x265 and comparable PSNR perfor-
mance with x265 (LDP very fast). At the time of writ-
ing this report, several learned video compression meth-
ods [5, 6, 8, 15, 16] are superior to DVC [9], but currently
none of them provides open source codes. We hope that
our OpenDVC codes are able to provide a useful model
for further development, and facilitate future researches
on learned video compression. Different from the origi-
Figure 1. The high-level framework of DVC [9].
nal DVC, which is only optimized for PSNR, we release
not only the PSNR-optimized re-implementation, denoted
by OpenDVC (PSNR), but also the MS-SSIM-optimized pyramid architecture benefits DVC to handle large motions.
model OpenDVC (MS-SSIM). Our OpenDVC (MS-SSIM) In OpenDVC, the motion estimation network is imple-
model provides a more convincing baseline for MS-SSIM mented by Tensorflow in the file motion.py, based on a
optimized methods, which can only compare with the PyTorch implementation [10] of the pyramid network [11].
PSNR optimized DVC [9] in the past. The OpenDVC We follow the settings described in [11] to use a 5-level
source codes and pre-trained models are publicly released at pyramid network. Each level has five convolutional layers
https://github.com/RenYang-home/OpenDVC. with the kernal size of 7 × 7, and with the filter numbers of
32, 64, 32, 16 and 2, respectively. As Figure 1 shows, the es-
2. Implementation timated motion vt is output from the pyramid network, de-
In this section, we describe the implementation of our noted as flow tensor in OpenDVC test video.py.
OpenDVC, which follows the framework of DVC [9] shown Motion compression. We follow [9] to use the auto-
in Figure 1. The high-level architecture of DVC is moti- encoder of [2] to compress the the estimated motion. The
vated by the handcrafted video coding standards [13, 12], encoder part consists of four convolutional layers with ×2
i.e., adopting motion compensation to reduce the temporal down-sampling, and the first three layers use the activation
redundancy and using two compression networks to com- function of GDN [2]. In the decoder part, there are four
press the motion and residual information, respectively. In corresponding convolutional layers with ×2 up-sampling,
the following, we introduce the OpenDVC implementation and the first three layers use the activation function of the
of each module presented in Figure 1. inverse GDN [2]. In motion compression, we set the filter
Motion estimation. DVC utilizes the pyramid net- size as 3 × 3 and the filter number as 128 for all layers ex-
work [11] to estimate the motion between the current frame cept the last layer in decoder, which has the filter number
and the previous compressed frame, shown as the “Optical of 2 to reconstruct the 2-channel motion vector. The en-
Flow Net” module in Figure 1. The large receptive field of coder and decoder for motion compression are implemented

1
#2 #2 "2 "2

Figure 2. The architecture of the motion compensation network [9, 15].

as the functions MV analysis and MV synthesis in 3. Training


CNN img.py. Different from DVC [9] which employs the
hyperpripr entropy model [3], our OpenDVC uses the fac- In this technical report, we use the same notations as
torized entropy model [2]1 . As such, OpenDVC has lower DVC [9], shown in Figure 1. The definition of the notations
requirements on the input resolution, i.e., DVC requires the and their corresponding variable names in our OpenDVC
input height and width to be the multiples of 32, while our codes OpenDVC test video.py are listed in Table 1.
OpenDVC only needs height and width to be the multiples The OpenDVC network is trained on the Vimeo-90k [14]
of 16. More importantly, replacing the hyperprior model dataset in a progressive manner. At the beginning, the mo-
with the factorized model does not lead to obvious drop of tion estimation network is first trained with the loss function
performance (please refer to Section 4). of
LME = D(xt , W (x̂t−1 , vt )), (1)
Motion compensation. As described in DVC [9], the
reference frame is first warped by the compressed motion where W is the backward warping operation. After the
flow hat, and the motion compensation network takes as convergence of the motion estimation network, we fur-
inputs the reference frame Y0 com, the warped2 reference ther include the motion compression network into training
frame Y1 warp and the compressed motion flow hat to with the loss including the distortion of the reference frame
generate the motion compensated frame Y1 MC. The mo- warped by the compressed motion and the bit-rate for com-
tion compensation network in our OpenDVC follows the pressing m̂t , i.e.,
architecture shown in the Appendix3 of [9]. The detailed
LM = λ · D(xt , W (x̂t−1 , v̂t )) + R(m̂t ), (2)
network is show in Figure 2, in which all layers have the
filter size of 3 × 3. The filter number of each layer is set
to 64, except the last layer whose filter number is 3. ↑ 2 Table 1. Corresponding notations and variable names.
and ↓ 2 indicate up-L and down-sampling with the stride of Definition Notation Variable name
2, respectively, and denotes the element-wise addition.
Reference frame x̂t−1 Y0 com
Residual compression. After motion compensation, the Current frame xt Y1 raw
residual can be obtained as the difference between the com-
Estimated motion vt flow tensor
pensated reference frame and the current raw frame. In our
OpenDVC, we compress residual with the same method as Latent representation
mt flow latent
the motion compression. The only difference is that we use of motion
the filters with the size of 5 × 5 in the auto-encoder for Quantized latent
residual compression, instead of 3 × 3 in motion compres- m̂t flow latent hat
representation of motion
sion. The reason is that residual contains more informa- Compressed motion v̂t flow hat
tion and consumes more bit-rate than motion [9], and larger
filter size improves the representation ability of the auto- Motion compensated
xt Y1 MC
encoder. Finally, the reconstructed compressed frame can reference frame
be obtained by adding the residual to the compensated ref- Residual rt Res
erence frame. Latent representation
yt res latent
of residual
Quantized latent
ŷt res latent hat
1 https://github.com/tensorflow/compression/ representation of residual
releases/tag/v1.0
2 In OpenDVC, we use the backward warping, which is implemented as Compressed residual r̂t Res hat
tf.contrib.image.dense image warp in Tensorflow 1.12. Compressed frame x̂t Y1 com
3 https://arxiv.org/abs/1812.00101.

2
Figure 3. The performance of DVC [9], OpenDVC and our latest RLVC approach [16].

in which λ balances the penalties of rate and distortion, and models with λ = 8, 16, 32 and 64 use [7] with the quality
R stands for the bit-rate estimated by the entropy model [2]. levels of 2, 3, 5 and 7, respectively.
Then, the motion compensation network is trained by

LMC = λ · D(xt , xt ) + R(m̂t ). (3)


4. Performance
The rate-distortion performance of OpenDVC is demon-
When LMC is converged, the whole network is jointly
strated in Figure 3, in comparison with the results reported
trained in an end-to-end manner, using the loss of
in DVC [9]. It can be seen that the OpenDVC (PSNR)
L = λ · D(xt , x̂t ) + R(m̂t ) + R(ŷt ). (4) model achieves comparable performance with DVC in
terms of PSNR and MS-SSIM, and OpenDVC (MS-SSIM)
The learning rate is initially set as 10−4 for all loss functions obviously outperforms DVC in terms of MS-SSIM. Note
(1), (2), (3) and (4). When training the whole network by that, Figure 3 directly uses the results of DVC, x265 (very
the final loss of (4), the learning rate decreases by the factor fast) and x264 (very fast) reported in [9].
of 10 after convergence until 10−6 .
In OpenDVC, we first follow DVC [9] to train the PSNR- 5. Our latest works
optimized model with the distortion D as the Mean Square
Error (MSE) and λ = 256, 512, 1024 and 2048. Then, the In 2020, we proposed a Hierarchical Learned Video
MS-SSIM models are fine-tuned only using the final loss Compression (HLVC) approach [15] with hierarchical qual-
function (4) with D = 1 − MS-SSIM. The MS-SSIM mod- ity and recurrent enhancement layer. Our HLVC approach
els with λ = 8, 16, 32 and 64 are fine-tuned from the pre- is published in CVPR 2020. The paper can be down-
trained PSNR models with λ = 256, 512, 1024 and 2048, loaded at https://arxiv.org/abs/2003.01966,
respectively. Note that, we use BPG [4] to compress the and the project page is at https://github.com/
I-frames for the PSNR models in OpenDVC, and use the RenYang-home/HLVC.
learned image compression method [7] to compress the I- Later, we proposed a Recurrent Learned Video Compres-
frames for the MS-SSIM models. Specifically, the BPG sion (RLVC) approach [16] with recurrent auto-encoder and
with QP = 37, 32, 27 and 22 is used for PSNR models with recurrent probability model. The paper is publicly available
λ = 256, 512, 1024 and 2048, respectively. The MS-SSIM at https://arxiv.org/abs/2006.13560. The re-

3
sults of our latest RLVC [16] approach are also illustrated [13] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and
in Figure 3, which clearly outperforms the performance of Ajay Luthra. Overview of the H.264/AVC video coding stan-
DVC and also advances the state-of-the-art of learned video dard. IEEE Transactions on circuits and systems for video
compression approaches (refer to our paper). technology, 13(7):560–576, 2003.
[14] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and
William T Freeman. Video enhancement with task-
References oriented flow. International Journal of Computer Vision,
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, 127(8):1106–1125, 2019.
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- [15] Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Timo-
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: fte. Learning for video compression with hierarchical qual-
A system for large-scale machine learning. In 12th USENIX ity and recurrent enhancement. In Proceedings of the IEEE
symposium on operating systems design and implementation Conference on Computer Vision and Pattern Recognition
(OSDI 16), pages 265–283, 2016. (CVPR), 2020.
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End- [16] Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Tim-
to-end optimized image compression. In Proceedings of ofte. Learning for video compression with recurrent auto-
the International Conference on Learning Representations encoder and recurrent probability model. arXiv preprint
(ICLR), 2017. arXiv:2006.13560, 2020.
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
Hwang, and Nick Johnston. Variational image compression
with a scale hyperprior. In Proceedings of the International
Conference on Learning Representations (ICLR), 2018.
[4] Fabrice Bellard. BPG image format. https://bellard.
org/bpg/.
[5] Abdelaziz Djelouah, Joaquim Campos, Simone Schaub-
Meyer, and Christopher Schroers. Neural inter-frame com-
pression for video coding. In Proceedings of the IEEE In-
ternational Conference on Computer Vision (ICCV), pages
6421–6429, 2019.
[6] Amirhossein Habibian, Ties van Rozendaal, Jakub M Tom-
czak, and Taco S Cohen. Video compression with rate-
distortion autoencoders. In Proceedings of the IEEE Inter-
national Conference of Computer Vision (ICCV), 2019.
[7] Jooyoung Lee, Seunghyun Cho, and Seung-Kwon Beack.
Context-adaptive entropy model for end-to-end optimized
image compression. In Proceedings of the International
Conference on Learning Representations (ICLR), 2019.
[8] Haojie Liu, Lichao Huang, Ming Lu, Tong Chen, and Zhan
Ma. Learned video compression via joint spatial-temporal
correlation exploration. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, 2020.
[9] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chun-
lei Cai, and Zhiyong Gao. DVC: An end-to-end deep video
compression framework. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 11006–11015, 2019.
[10] Simon Niklaus. A reimplementation of SPyNet us-
ing PyTorch. https://github.com/sniklaus/
pytorch-spynet, 2018.
[11] Anurag Ranjan and Michael J Black. Optical flow estima-
tion using a spatial pyramid network. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 4161–4170, 2017.
[12] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and
Thomas Wiegand. Overview of the high efficiency video
coding (HEVC) standard. IEEE Transactions on circuits and
systems for video technology, 22(12):1649–1668, 2012.

You might also like