Paper 5
Paper 5
1. Introduction
We introduce an open source Tensorflow [1] implemen-
tation of the Deep Video Compression (DVC) [9] method
in this technical report. DVC [9] is the first end-to-end opti-
mized learned video compression method, achieving bet-
ter MS-SSIM performance than the Low-Delay P (LDP)
very fast setting of x265 and comparable PSNR perfor-
mance with x265 (LDP very fast). At the time of writ-
ing this report, several learned video compression meth-
ods [5, 6, 8, 15, 16] are superior to DVC [9], but currently
none of them provides open source codes. We hope that
our OpenDVC codes are able to provide a useful model
for further development, and facilitate future researches
on learned video compression. Different from the origi-
Figure 1. The high-level framework of DVC [9].
nal DVC, which is only optimized for PSNR, we release
not only the PSNR-optimized re-implementation, denoted
by OpenDVC (PSNR), but also the MS-SSIM-optimized pyramid architecture benefits DVC to handle large motions.
model OpenDVC (MS-SSIM). Our OpenDVC (MS-SSIM) In OpenDVC, the motion estimation network is imple-
model provides a more convincing baseline for MS-SSIM mented by Tensorflow in the file motion.py, based on a
optimized methods, which can only compare with the PyTorch implementation [10] of the pyramid network [11].
PSNR optimized DVC [9] in the past. The OpenDVC We follow the settings described in [11] to use a 5-level
source codes and pre-trained models are publicly released at pyramid network. Each level has five convolutional layers
https://github.com/RenYang-home/OpenDVC. with the kernal size of 7 × 7, and with the filter numbers of
32, 64, 32, 16 and 2, respectively. As Figure 1 shows, the es-
2. Implementation timated motion vt is output from the pyramid network, de-
In this section, we describe the implementation of our noted as flow tensor in OpenDVC test video.py.
OpenDVC, which follows the framework of DVC [9] shown Motion compression. We follow [9] to use the auto-
in Figure 1. The high-level architecture of DVC is moti- encoder of [2] to compress the the estimated motion. The
vated by the handcrafted video coding standards [13, 12], encoder part consists of four convolutional layers with ×2
i.e., adopting motion compensation to reduce the temporal down-sampling, and the first three layers use the activation
redundancy and using two compression networks to com- function of GDN [2]. In the decoder part, there are four
press the motion and residual information, respectively. In corresponding convolutional layers with ×2 up-sampling,
the following, we introduce the OpenDVC implementation and the first three layers use the activation function of the
of each module presented in Figure 1. inverse GDN [2]. In motion compression, we set the filter
Motion estimation. DVC utilizes the pyramid net- size as 3 × 3 and the filter number as 128 for all layers ex-
work [11] to estimate the motion between the current frame cept the last layer in decoder, which has the filter number
and the previous compressed frame, shown as the “Optical of 2 to reconstruct the 2-channel motion vector. The en-
Flow Net” module in Figure 1. The large receptive field of coder and decoder for motion compression are implemented
1
#2 #2 "2 "2
2
Figure 3. The performance of DVC [9], OpenDVC and our latest RLVC approach [16].
in which λ balances the penalties of rate and distortion, and models with λ = 8, 16, 32 and 64 use [7] with the quality
R stands for the bit-rate estimated by the entropy model [2]. levels of 2, 3, 5 and 7, respectively.
Then, the motion compensation network is trained by
3
sults of our latest RLVC [16] approach are also illustrated [13] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and
in Figure 3, which clearly outperforms the performance of Ajay Luthra. Overview of the H.264/AVC video coding stan-
DVC and also advances the state-of-the-art of learned video dard. IEEE Transactions on circuits and systems for video
compression approaches (refer to our paper). technology, 13(7):560–576, 2003.
[14] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and
William T Freeman. Video enhancement with task-
References oriented flow. International Journal of Computer Vision,
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, 127(8):1106–1125, 2019.
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- [15] Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Timo-
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: fte. Learning for video compression with hierarchical qual-
A system for large-scale machine learning. In 12th USENIX ity and recurrent enhancement. In Proceedings of the IEEE
symposium on operating systems design and implementation Conference on Computer Vision and Pattern Recognition
(OSDI 16), pages 265–283, 2016. (CVPR), 2020.
[2] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End- [16] Ren Yang, Fabian Mentzer, Luc Van Gool, and Radu Tim-
to-end optimized image compression. In Proceedings of ofte. Learning for video compression with recurrent auto-
the International Conference on Learning Representations encoder and recurrent probability model. arXiv preprint
(ICLR), 2017. arXiv:2006.13560, 2020.
[3] Johannes Ballé, David Minnen, Saurabh Singh, Sung Jin
Hwang, and Nick Johnston. Variational image compression
with a scale hyperprior. In Proceedings of the International
Conference on Learning Representations (ICLR), 2018.
[4] Fabrice Bellard. BPG image format. https://bellard.
org/bpg/.
[5] Abdelaziz Djelouah, Joaquim Campos, Simone Schaub-
Meyer, and Christopher Schroers. Neural inter-frame com-
pression for video coding. In Proceedings of the IEEE In-
ternational Conference on Computer Vision (ICCV), pages
6421–6429, 2019.
[6] Amirhossein Habibian, Ties van Rozendaal, Jakub M Tom-
czak, and Taco S Cohen. Video compression with rate-
distortion autoencoders. In Proceedings of the IEEE Inter-
national Conference of Computer Vision (ICCV), 2019.
[7] Jooyoung Lee, Seunghyun Cho, and Seung-Kwon Beack.
Context-adaptive entropy model for end-to-end optimized
image compression. In Proceedings of the International
Conference on Learning Representations (ICLR), 2019.
[8] Haojie Liu, Lichao Huang, Ming Lu, Tong Chen, and Zhan
Ma. Learned video compression via joint spatial-temporal
correlation exploration. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, 2020.
[9] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chun-
lei Cai, and Zhiyong Gao. DVC: An end-to-end deep video
compression framework. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
pages 11006–11015, 2019.
[10] Simon Niklaus. A reimplementation of SPyNet us-
ing PyTorch. https://github.com/sniklaus/
pytorch-spynet, 2018.
[11] Anurag Ranjan and Michael J Black. Optical flow estima-
tion using a spatial pyramid network. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 4161–4170, 2017.
[12] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and
Thomas Wiegand. Overview of the high efficiency video
coding (HEVC) standard. IEEE Transactions on circuits and
systems for video technology, 22(12):1649–1668, 2012.