Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
Abstract—Image compression is one of the most fundamental techniques and commonly used applications in the image and video
processing field. Earlier methods built a well-designed pipeline, and efforts were made to improve all modules of the pipeline by
handcrafted tuning. Later, tremendous contributions were made, especially when data-driven methods revitalized the domain with their
excellent modeling capacities and flexibility in incorporating newly designed modules and constraints. Despite great progress, a
arXiv:2002.03711v4 [eess.IV] 26 Mar 2021
systematic benchmark and comprehensive analysis of end-to-end learned image compression methods are lacking. In this paper, we
first conduct a comprehensive literature survey of learned image compression methods. The literature is organized based on several
aspects to jointly optimize the rate-distortion performance with a neural network, i.e., network architecture, entropy model and rate
control. We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and
provide insights into their historical development routes. With this survey, the main challenges of image compression methods are
revealed, along with opportunities to address the related issues with recent advanced learning methods. This analysis provides an
opportunity to take a further step towards higher-efficiency image compression. By introducing a coarse-to-fine hyperprior model for
entropy estimation and signal reconstruction, we achieve improved rate-distortion performance, especially on high-resolution images.
Extensive benchmark experiments demonstrate the superiority of our model in rate-distortion performance and time complexity on
multi-core CPUs and GPUs. Our project website is available at https://huzi96.github.io/compression-bench.html.
1 I NTRODUCTION
With the rapid development of deep learning, there have for the existing approaches, we further explore the
been many works exploring the potential of artificial neural potential of end-to-end learned image compression
networks to form an end-to-end optimized image com- and propose a coarse-to-fine hyperprior modeling
pression framework. The development of these learning- framework for lossy image compression. The pro-
based methods has significant differences from traditional posed method is shown to outperform existing meth-
methods. For traditional methods, improved performance ods in terms of coding performance, while keeping
mainly comes from designing more complex tools for each the time complexity low on parallel computing hard-
component in the coding loop. Deeper analysis can be con- wares.
ducted on the input image, and more adaptive operations • We conduct a thorough benchmark analysis to com-
can be applied, resulting in more compact codes. However, pare the performance of existing end-to-end com-
in some cases, although the performance of the single mod- pression methods, the proposed method, and tradi-
ule is improved, the final performance of the codec, i.e., tional codecs. The comparison is conducted fairly
the superimposed performance of different modules, might from different perspectives, i.e. the rate-distortion
not increase much, making further improvement difficult. performance on different ranges of bit-rate or reso-
For end-to-end learned methods, as the whole framework lution and the complexity of the implementation.
can be jointly optimized, performance improvement in a
module naturally leads to a boost to the final objective. Note that, this paper is the extension of our earlier publi-
Furthermore, joint optimization causes all modules to work cation [12]. We summarize the changes here. First, this paper
more adaptively with each other. additionally focuses on the thorough survey and benchmark
In the design of an end-to-end learned image com- of end-to-end learned image compression methods. In ad-
pression method, two aspects are considered. First, if the dition to [12], we summarize the contributions of existing
latent representation coefficients after the transform net- works on end-to-end learned image compression in Sec. 3,
work are less correlated, more bit-rate can be saved in the and present a more detailed comparative analysis of the
entropy coding. Second, if the probability distribution of merits towards high-efficiency end-to-end learned image
the coefficients can be accurately estimated by an entropy compression in Sec. 4 and Sec. 5. Second, we conduct a
model, the bit-stream can be utilized more efficiently and benchmark evaluation of existing methods in Sec. 7.2, where
the bit-rate to encode the latent representations can be we present the comparative experimental results on two
better controlled, thus, a better trade-off between the bit- additional datasets, in both PSNR and MS-SSIM. Third, we
rate and the distortion can be achieved. The pioneering raise the novel problem of cross-metric performance with
work of Toderici et al. [10] presents an end-to-end learned respect to image compression methods in Sec. 7.4, where we
image compression that reconstructs the image by applying present the empirical analysis on the phenomenon of cross-
a recurrent neural network (RNN). Meanwhile, generalized metric bias and we briefly discuss future research directions
divisive normalization (GDN) [11] was proposed by Ballé to address the related issues.
et al. to model image content with a density model, which The rest of the paper is organized as follows. In Sec. 2,
shows an impressive capacity for image compression. Since we first formulate the image compression problem, espe-
that time, there have been numerous end-to-end learned cially focusing on end-to-end learned schemes. After that,
image compression methods inspired by these frameworks. in Sec. 3, we briefly summarize the main contributions of
Although tremendous progress has been made in end- existing research. Then, in Sec. 4, we categorize existing
to-end learned image compression, there is a lack of a sys- learned image compression methods according to their
tematic survey and benchmark to summarize and compare backbone models. After that, special attention is paid to
different methods thoroughly. To this end, in this work, the rate control technique in Sec. 5, which is the very spe-
we conduct a comprehensive survey of recent progress in cialized component in image compression compared with
learning-based image compression as well as a thorough other deep-learning processing or understanding methods.
benchmarking analysis on different methods of learning- Inspired by our survey and analysis, we introduce our new
based image compression. The contributions and novel- proposed method in Sec. 6. Later, in Sec. 7, we introduce
ties of existing works are summarized and highlighted, the benchmarking protocols and make benchmarking com-
and future directions are illustrated. With the summarized parisons of existing methods. Finally, in Sec. 8, we draw
guidance from the survey and benchmark, we propose a conclusions and discuss potential future research directions.
novel end-to-end learned image compression framework
that offers state-of-the-art performance.
The contributions of this paper are as follows: 2 P ROBLEM F ORMULATION
• We comprehensively summarize the existing end-to- Natural image signals include many spatial redundancies
end learned image compression methods. The con- and have the potential to be compressed without much
tributions and novelties of these methods are dis- degradation in perceptual quality. Considering practical
cussed and highlighted. The technical improvements constraints on bandwidth and storage, lossy image com-
of these methods are commented on based on their pression is widely adopted to minimize the bit-rate of
categorizations, and we demonstrate a clear picture representing a given image to tolerate a certain level of
of the design methodologies and shows interesting distortion. The compression framework usually consists of
future research directions. an encoder-decoder pair. Given an input image x with its
• Inspired by the insights and challenges summarized distribution px , the encoder with an encoding transform E
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
and a quantization function Q, a discrete code y is generated modules are trainable, and it is possible to optimize
as follows: all parameters and components jointly. However, it
y = Q(E(x; θE )), (1) is nontrivial to acquire good performance in end-to-
end learning compression because of the difficulties
where θE denotes the encoder parameters to be tuned dur- in optimization.
ing the learning procedure. To obtain the pixel representa- • Full-Resolution Processing. Convolutional neural
tion of the image, the corresponding decoder D reconstructs networks support the full-resolution processing of
the image x̂ from the code y as follows: images, while hybrid frameworks usually process
x̂ = D(y; θD ) = D(Q(E(x; θE )); θD ), (2) partitioned blocks. Full processing can bring more
benefits to entropy modeling with more context and
where θD denotes the parameters in D. avoid the blocking effect caused by partitioning. Full-
Two kinds of metrics, i.e. distortion D and bit-rate R, resolution processing also comes with an increase in
give rise to rate-distortion optimization R + λD, the core complexity. Because the perceptive field of a con-
problem of lossy image compression. The distortion term D volutional kernel is limited, the network needs to
measures how different the reconstructed image is from the be deepened to perceive more large regions and
original image, and it is usually measured via fidelity-driven improve modeling capacity.
metrics or perceptual metrics as follows: • Rate Control. With joint optimization, the whole
model can directly target the rate-distortion con-
D = Ex∼px [d(x, x̂)], (3)
straint, while in hybrid schemes, the additional rate-
where d denotes the distortion function. The rate term R control component is employed and may not pro-
corresponds to the number of bits to encode y, which is duce an optimal approximation. However, for a large
bounded according to the entropy constraints. However, the portion of learning-based methods, multiple models
actual probability distribution of the latent code y, denoted need to be trained for different rate-distortion trade-
as py , is unknown, making accurate entropy calculation offs. The other single-model variable-bit-rate archi-
intractable. Thus, we usually utilize an entropy model qy tectures are usually much more time-consuming.
to serve as the estimation of py for entropy coding. Hence, Therefore, practical applications of these methods are
the rate term can be formulated as the cross entropy of py sometimes limited.
and qy as follows:
R = H(py , qy ) = Ey∼py [− log qy (y)], (4)
3 OVERVIEW OF P ROGRESS IN R ECENT Y EARS
where py stands for the real probability distribution and qy
Since the pioneering work of Toderici et al. [10] in 2015
refers to the distribution estimated by the entropy model.
exploited recurrent neural networks for learned image com-
The overall compression model can be viewed as an op-
pression, much progress has been made. Benefiting from
timization of the weighted sum of R and D. Formally,
the strong modeling capacity of deep networks, the per-
the problem can be solved by minimizing the following
formance of learned image compression has exceeded that
optimization with a trade-off coefficient λ as follows:
of JPEG to BPG (HEVC Intra), and the performance gap
θ̂E , θ̂D , θ̂p = arg min R + λD, (5) is widening further. The milestones of learned image com-
θE ,θD ,θp pression are summarized in Table 1. Early works aim to
where θp denotes the parameter for the entropy model. The search for possible architectures to apply transform coding
with neural networks and propose end-to-end trainable so-
optimal parameters θ̂E , θ̂D , θ̂p cause the model to achieve
lutions. Ballé et al. [11], [16], [20] proposes a learning-based
an overall good rate-distortion performance on the image
framework with GDN nonlinearity embedded analysis and
x that follows x ∼ px . Different λ values indicate different
synthesis transforms for learned image compression, while
rate-distortion trade-offs, depending on the requirements of
Toderici et al. utilize recurrent models for variable-rate
different applications.
learned compression [10], [17].
Though the idea of rate-distortion optimization is also
To make the network end-to-end trainable, the quan-
applied to traditional compression schemes, learning-based
tization component, which is not differentiable based on
methods finally make the joint optimization of all the com-
the definition, should be designed carefully and approxi-
ponents feasible. The opportunities and challenges are listed
mated by a differentiable process. Some works replace the
below:
true quantization with additive uniform noise [20], [24]
• Global Optimization. The major difference between while others use direct rounding in forwarding and back-
learned image compression and the traditional hy- propagate the gradient of y = x. In addition, Agustsson et
brid codec lies in their optimizations. Instead of al. [18] proposes replacing direct scalar quantization with
hand-craft tuning, learned image compression mod- soft-to-hard vector quantization to make the quantization
els can be automatically tuned to any differentiable smoother. Dumas et al. [39] designs a model that addition-
metric, e.g. SSIM [13], MS-SSIM [14] and perceptual ally learns the quantization parameters. As it is non-trivial
difference [15], which is calculated by neural net- to train a variational autoencoder (VAE) [40] based model
works. In addition, while the traditional hybrid cod- that incorporates quantization, advanced optimization tech-
ing framework is usually improved at the scale of in- niques for image compression are still being extensively
dividual components, in learning-based methods, all studied recently [41].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
When the compression network is trainable, the next 4 BACKBONES FOR I MAGE C OMPRESSION
issue is to efficiently reduce spatial redundancy in the image
A typical neural network backbone for image compression
signal, where the transform is usually a critical part. Some
is built upon the VAE architecture. The architecture encodes
take the form of a convolutional neural network (CNN),
images into vectors in a latent space, forming a compact
e.g. GDN [11], [16], [20] or residual block with enhanced
representation. With dimensionality reduction and entropy
nonlinearity [19]. Some advanced convolutional architec-
constraints, the redundancy in the image is squeezed out
tures like attention module [36], non-local networks [38],
by the compressive transform. There have been a variety
and invertible structures [37] have also been employed to
of architectures for the backbone of the framework, which
improve the modeling capacity of the transforms. Others
can be coarsely divided into two categories, namely, one-
resort to a recurrent neural network (RNN) to infer latent
time feed-forward frameworks and multistage recurrent
representations progressively, which forms a scalable cod-
frameworks. Each component in a one-time feed-forward
ing framework [10]. In each iteration, the network largely
framework conducts the feed-forward operation only once
squeezes out the unnecessary bits in the latent representa-
in the encoding and decoding procedure. Usually, multiple
tions. Therefore, the final representations are compact.
models need to be trained to cover different ranges of bit-
rates, as the encoder and decoder networks determine the
After the transform, the compact latent representations
rate-distortion trade-off. In contrast, in multistage recurrent
are further compressed via entropy coding, where fre-
frameworks, an encoding component of the network iter-
quently occurring patterns are represented with few bits
atively conducts compression on the original and residual
and rarely occurring patterns with many bits. Earlier works
signals, and the number of iterations controls the rate-
incorporate elementwise independent entropy models to
distortion trade-off. Each iteration encodes a portion of the
estimate the probability distribution of the latent representa-
residual signal with a certain amount of bits. Such a model
tions [19], [20] and independently encode each element with
can conduct variable-bit-rate compression on its own. In
an arithmetic coder. With these initial trials, later advanced
the following, we introduce both types of architectures and
methods explicitly estimate entropy with hyperpriors [24],
conduct a comparison analysis on them.
[26], predictive models [25], [32], [34], [36], [37], [38] or other
learned parametric models [17], [27], [28].
4.1 One-Time Feed-Forward Frameworks
In addition to the abovementioned methods that target
signal fidelity with learned transform coding frameworks, One-time feed-forward frameworks have been most widely
there are emerging works targeting novel application con- adopted for end-to-end learned image compression. Basic
ditions, notably compression for machine vision [42] or variations of the architectures in the literature are illustrated
human perception at low bit-rates. According to research in Fig. 1.
on the human visual system, human eyes are less sensi- The first end-to-end learned image compression with a
tive to pixelwise distortion in areas with complex texture. one-time feed-forward structure was proposed by Ballé et
Therefore, generative models such as conditional generative al. [16], where the analysis and synthesis transforms for
adversarial networks (GAN) can be employed to synthesize encoding and decoding are made up of a single-layer GDN
such areas, where low-bit-rate representations can serve as and inverse GDN (iGDN). This structure is then improved
the guidance. This can be utilized to design high-efficiency to support full-resolution processing, with stride convolu-
image codecs. Rippel et al. [22] first proposed utilizing tion and the corresponding transposed convolution [20]. In
the adversarial loss function in an end-to-end framework later works, the hyperprior network [24] is introduced to
to improve visual quality. In later literature, Agustsson extract the side information from the latent representation
et al. [43], Tschannen et al. [30] and Santurkar et al. [44] produced by the analysis transform, and the side informa-
improve the capacity of adversarial learning by introducing tion can improve the entropy estimation of the latent code.
advanced generative networks to provide superior recon- In addition to the frameworks equipped with GDN,
struction quality with extremely low bit-rates. Mentzer et another kind of feed-forward network utilizing residual
al. [45] demonstrated that with a hyperprior based compres- blocks is proposed by Theis et al. [19] and Mentzer et al. [27].
sion model and a generative convolutional decoder with These networks stack multiple residual blocks in both the
ChannelNorm, it is possible to achieve similar visual quality encoder and decoder, greatly expanding the depth of the
on high-resolution natural images with only half the bit- model. With deeper networks, the encoder and decoder can
rates. embed more complex prior images, and they have more
flexibility in modeling nonlinear transforms. In addition,
In summary, the tremendous progress in learned image some works adopt a multiscale structure [22], [31], which
compression unveils the power of machine learning tech- also extends the capacity of the network.
niques. Nevertheless, there are still a large number of prob- It is reported that a more complex design of an ar-
lems to investigate, which requires a systematic benchmark chitecture with GDN may bring further improvements in
to illustrate critical areas where end-to-end learned frame- compression performance [46], [47], but not as significant
works for image compression can be further improved. In as that of other contributions, such as a hyperprior. Unlike
the following, we first analyze the important components other computer vision tasks, e.g. image recognition, where
(i.e., the backbone architecture and entropy model) in detail a deeper network can usually bring extra gain in perfor-
and then conduct the benchmark analysis on the methods mance [48], [49], it does not result in significant improve-
according to various aspects. ments in performance to extend the architecture complexity
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
Transform
Transform
Synthesis
Analysis
(a) GDN Transform [16]
Deconv
Deconv
Deconv
iGDN
iGDN
iGDN
Conv
Conv
Conv
GDN
GDN
GDN
(a) Vanilla
AE AE
μ
σ
AD AD (b) Stateful
P0 I
RS′0 I
(c) Hyperprior Model [24]1 , [25] P1 I
RS′1 RS1
P2 I P2 I
Resblock
RS′2 RS′2
(c) Incremental (d) Skip-Connection (e) Stateful Propagation
(d) Residual Auto-Encoder [19] Fig. 2: Illustration of the backbones of the multistage recur-
rent framework and its variations. The main feature of these
designs is that the residue for one stage is taken as the input
×3 ×3 at the next stage. (a) and (b) show the vanilla structure and
its improved stateful form [10]. (c)-(e) show different cross-
stage connections [21].
×3 ×3
a state during sequential processing that propagates the tectures [25], [28]. However, variable-bit-rate compression
features of the images to the following steps to facilitate is commonly required by applications, which becomes the
the modeling of the multilevel residue. Fig. 2 shows the major barrier for end-to-end learned image compression
unrolled stateful structure. In each stage, the modules in the methods to be adopted by existing systems. More efforts are
pipeline take the currently processed residue and the state needed to investigate an efficient way to achieve variable-
from the previous stage as the input. The states are updated rate compression for learning-based approaches.
and propagated for processing in the next step.
There have been studies on the aggregation of the output
of each stage. Baig et. al. [21] present and analyze differ- 5 E NTROPY M ODELS
ent kinds of aggregation schemes. The basic Incremental Entropy coding is an important component in an image
scheme adds the output of all stages together to form the compression framework. According to information the-
final decoded images. The loss function of the Incremental ory [51], the bit-rate needed to encode the signal is bounded
scheme usually includes a term to encourage the output by the information entropy, which corresponds to the proba-
of each stage to approximate the residue of the previous bility distribution of the symbols in representing the signal.
stage. A different way to combine all the stages is to treat Thus, the entropy coding component is embedded in the
the multistage structure as a residual network to form the end-to-end learned image compression framework to esti-
Skip-Connection scheme. There is only one term in the loss mate the probability distribution of the latent representa-
function for such a scheme to require that the sum of tions and apply constraints on the entropy to reduce the
all the stages reconstructs the original image. Unlike the bit-rate.
Incremental structure, there is no explicit arrangement of There is a large amount of research on entropy models
the residue in the Skip-Connection structure. The outputs of for learned image compression. A summary of solutions to
all stages contribute to the final reconstruction, each as a the problem of entropy modeling is presented in Table 2,
supplement of the reconstruction quality with respect to the and we illustrate the typical structure of different variations
others. In addition to these two kinds of schemes, Baig et. in Fig. 3.
al. reported that with the Stateful-Propagation structure and Ideal entropy coding requires precise estimation of the
the corresponding residual-to-image prediction, where each joint distribution of the elements in the latent representa-
step produces a prediction of the original image rather tions, for each instance of the image. In earlier works, those
than the residual signal, the network achieves the best elements are assumed to be independently distributed [19],
performance. In such a stateful propagation scheme, it is [20] to simplify the design. However, even with optimized
important to propagate the states of the layers to the next transforms, it is still difficult to eliminate the spatial redun-
step to construct a refined decoding image. dancy in the latent maps of the images. Thus, a variety of
entropy models are proposed to further reduce the redun-
4.3 Comparative Analysis dancy in the latent code. These methods include statistical
Each of the two categories of backbone architectures has analysis over a given dataset [18], [19], [20], [26], [35],
its own properties and corresponding pros and cons. The contextual prediction or analysis [17], [25], [28], [29], [31],
differences are mainly due to the choice between the one- [33], [34], [38], [52], and utilizing a learned hyperprior [24],
time structure and the progressive structure. Here are some [25], [34] for entropy modeling. The entropy model provides
main differences. the estimation of the likelihood for all the elements, and the
expectation of the log-likelihoods is the bound of the bit-
• Recurrent models can naturally handle variable-rate rates in encoding these elements. With the entropy model,
compression, while for the feed-forward network, in most of the works, arithmetic coding [5] is utilized to
multiple instances of networks need to be trained to practically losslessly encode the symbols of the latent repre-
support a variable range of bit-rates. sentations.
• Feed-forward networks are comparatively shallower, It is worth noting that in traditional hybrid frameworks,
and the path of back-propagation is much shorter. improvements of the entropy model only affect entropy
Training such a network can be easier. In contrast, coding performance. For the learned method, as all the
training the recurrent models requires the back- components are jointly optimized, a better designed entropy
propagation through time (BPTT) technique, which model not only produces a more precise estimate of the
is more complicated. entropy but also changes the patterns produced by the anal-
• Weights are shared across different stages in the ysis transform. As a consequence, the design of the entropy
recurrent model; thus, the total number of param- model should also take the structure of other components in
eters for a practical image codec may require less the pipeline into consideration.
storage for the parameters compared with one-time In summary, existing methods aim to provide a flexible
feed-forward models. However, residual signals and transform and an accurate entropy model, all of which are
image signals are different in nature, making the neural network-based and end-to-end trainable. In addition
training of a recurrent model more challenging. to the main goal of rate-distortion performance, several
• It usually takes more time for recurrent models to issues need to be addressed in the exploration. The model
encode and decode an image because the network is should be adaptive to different ranges of resolutions, bit-
executed multiple times. rates, and distortions. Currently, when high-resolution cap-
Despite the pros and cons, existing works report higher turing and displaying devices emerge, high-efficiency com-
rate-distortion performance in one-time feed-forward archi- pression of high-resolution images is a constantly growing
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
Solutions Description
The proposed network directly produces binary codes, which are transmitted as the
bit-stream without entropy modeling [10]. Optional external entropy codecs, such as
Direct adaptive arithmetic coding [22], can be applied to the bit-stream to improve coding
efficiency.
In addition to the binary code, the network also constructs a mask from the feature to
Binary Masked indicate the length of the binary code [23], [29], [52]. The mask is usually transmitted
together with the bit-stream. With rate-control, the overall performance can be further
improved compared to the direct scheme.
The probability distribution of all the symbols to be encoded is estimated by the
Binary network with previously coded symbols [17] and spatially adjacent symbols [28]. The
Context-Model context model can more accurately estimate the probability so that the entropy coding
can be conducted with more efficiency.
The probability distribution is estimated by a histogram [18]. A variation of this
Histogram
scheme is to use a Laplace-smoothed histogram for better generalization [19].
The probability density function is approximated by a parametric piecewise lin-
Piecewise
Linear ear function during training [20]. Context Adaptive Binary Arithmetic Coding
(CABAC) [6] is used to practically compress the latent codes.
Statistical A function p(xi ) = f (xi , θ) with trainable parameters θ is modeled to estimate the
Parametric
Factorized probability of a symbol xi . These parameters reflect the distribution of latent code
through the training set and can be generalized for all images [35], [53].
Networks based on VAE assume that the latent code follows an elementwise Gaussian
Gaussian distribution. The loss function includes a term of cross-entropy between the actual
(Mixture) distribution and the estimated Gaussian distribution to control the bit-rate [24], [25],
[34]. Gaussian Mixture distribution is shown to better estimate the likelihoods [36].
Multistage recurrent models [17] employ PixelRNN [54], while one-time feed-forward
PixelRNN models [27], [31] utilize PixelCNN [55] for spatial context conditioned probability
PixelCNN
modeling.
Context-Model
Masked 2D [25], [34] or 3D [38] convolutions can be seen as a simplified version of
Masked
Convolution PixelCNN for conditional probability modeling. It estimates likelihoods of a to-be-
encoded element based on decoded elements.
The latent code produced by a given encoder is analyzed offline in tiles by learning a
Offline
dictionary, and the indices are transmitted with lossless compression [26].
The hyperprior, transmitted in the bit-stream, encodes the parameters of a Gaussian
Side-Information Guided
Hyperprior entropy model [24] to estimate the likelihoods of the elements to be encoded. It greatly
improves the accuracy of the entropy model and it can be combined with the context
model for enhanced modeling.
need. On the other hand, with the rapid development of of discrete symbols X = {X1 , X2 , ..., Xn }. In addition,
large-scale parallel computing devices, e.g. GPU, models a parametric entropy model QX (X; θ) w.r.t. the random
should also be designed to take advantage of parallel com- vector X is built to provide the estimation of the likelihoods.
puting devices for higher efficiency. According to the above The aim of entropy coding is now to jointly optimize the
analysis, the one-time feed-forward frameworks with con- parameters in the networks to 1) accurately model the
volutional neural network-powered hyperprior structures distribution PX (X) of the random vector X with QX (X; θ)
have more potential to be scalable to a high-resolution and 2) minimize the overall rate-distortion function with
and to support large-scale parallelism. With this idea in the estimated entropy. State-of-the-art methods combine
mind, we adopt a one-time feed-forward framework and context models and hyperpriors. In such approaches, it is
achieve one step towards obtaining superior performance first assumed that the joint probability distribution of X
with a newly proposed coarse-to-fine hyperprior compres- can be factorized to the product of sequential conditional
sion model. probabilities as follows:
Y
QX|Y (X|Y) = Qi (Xi |Xi−1 , Xi−2 , ..., Xi−m , Y) , (6)
6 P ROPOSED C OARSE - TO -F INE M ODEL i
6.1 Coarse-to-Fine Hyperprior Modeling where Y denotes the hyperprior, which is generated from
As analyzed, we follow the basic framework of a one- X and encoded to the bit-stream. When we need to decode
time feed-forward framework, which consists of an analysis X, Y has already been decoded. These kinds of models
transform Ga and a synthesis transform Gs . Ga transforms need to address two issues. First, the dimensionality and the
the image to latent representations, and Gs reconstructs the corresponding bit-rate of Y should be kept low; otherwise,
image from those representations. To perform entropy cod- Y itself may contain too much redundancy and is not effi-
ing, the latent representations are first quantized to a vector ciently compressed. In such a circumstance, the hyperprior
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
0101110001 With Eq. (7) and Eq. (8), the probability distribution of Y
(a) 1000010000 …
1110101000 and X can now be modeled in a conditional way, while
existing works [54], [56] show that neural networks are ca-
… pable of modeling conditional probability distributions. The
(b)
hyperrepresentation Y is also designed to embed the main
0101110001
1000010000 … information of the images to be compressed. Therefore, the
1110101000 joint distribution can also be approximately factorized as
Channel
0101110001
follows:
1000010000
Y
1110101000
…
Q(X|Y) = Q(X1 , · · ·, Xn |Y) ≈ QXi |Y (Xi |Y),
(c) AE AD
i
Y (10)
P( xi xi -1 , ) Q(Y|Z) = Q(Y1 , · · ·, Yn |Z) ≈ QYi |Z (Yi |Z),
i
P( xi )
where all elements in the previous layer can be utilized
as the conditions to estimate the distribution of the latent
(d) representation at the upper layer. Although no contextual
AE
conditioning is conducted here, contextual conditioning can
feature map 0100010111
be implicitly modeled in the information flow from X to Y
and then used to predict X from Y . Unlike existing block-
conditioning context models, in the proposed framework,
the estimation of the probability for each element utilizes
(e) μ σ information from a larger area due to the coarse-to-fine
AE AD structure. This helps to explore long-term correlations in im-
ages and improves the compression performance, especially
for high-resolution images.
Fig. 3: Illustration of entropy modeling methods. (a)-(c) Bi-
nary methods and variations with the masking and context
model, including (a) direct modeling [10], (b) masked mod- 6.2 Network Architecture
eling [23], [29], [52], and (c) the binary context model [17]. The overall structure of the end-to-end learned coarse-to-
(d) Spatial context model for latent code maps [25], [27], [34]. fine framework is shown in Fig. 4 jointly with the encoder
(e) Hyperprior entropy model [24]. and decoder. The analysis transform network encodes the
input image as the latent representation X, which is then
quantized with a rounding operation. It aims to squeeze
may not provide enough information to accurately model out pixelwise redundancy as much as possible. We exploit
the conditional probability, especially for higher ranges of GDN as the activation in the analysis transform and inverse
bit-rates and large resolutions. Second, although contextual GDN in the synthesis transform. We conduct coarse-to-fine
conditioning can help with accuracy, it is performed in a modeling with multilayer hyper analysis and a symmetric
sequential way and is hard to accelerate with large-scale hyper synthesis transform. According to Eq. (7) and Eq. (8),
parallel computing devices. Thus, the framework is less to estimate the distribution of X, a probability estimation
scalable for input images of different sizes. network is employed to process Y and predict the likeli-
To address the issues of the sequential context models, in hood PXi (Xi = xi ) with the estimated QXi (Xi = xi ) for
the proposed method, we adopt a multilayer conditioning each element Xi in X. As stated in [24], the conditional
framework, which improves scalability for images of differ- distribution of each element in X can be assumed to be
ent sizes. The formulation is modified as follows: Gaussian, and the probability estimation network predicts
QX (X) = QX,Y (X, Y) = QY (Y) QX|Y (X|Y) . (7) the mean and scale of the Gaussian distribution. As the
latent code has been rounded to be discrete, the likelihood
The first equality in Eq. (7) holds for Y because the hy- of the latent code can be calculated as follows:
perprior is generated from X in a deterministic manner.
QXi |Y (Xi = xi |Y) =
When X becomes complex and is controlled by expanding ! !
the dimension, Y may need to embed more information xi + 12 − µxi xi − 21 − µxi (11)
φ −φ ,
to support accurate conditional modeling. Therefore, an σxi σxi
additional layer of the hyperprior is introduced as follows:
where φ denotes the cumulative distribution function of a
QY (Y) = QY,Z (Y, Z) = QZ (Z) QY|Z (Y|Z) , (8) standard normal distribution, while the mean µxi and scale
which in fact forms a coarse-to-fine hyperprior model. The
σxi are predicted from Y. The same process is conducted
w.r.t. Y and Z to estimate the probability distribution of Y .
dimension of Z is reduced, and the redundancy is squeezed
As illustrated in Eq. (9), the probability distribution of Z can
out by the hypertransforms. Thus, the joint distribution
be approximately factorized. Thus, we employ a zero-mean
PZ (Z) of the latent representation Z = {Z1 , Z2 , · · ·, Zn }
Gaussian model. The likelihood of each element in Z can be
at the innermost layer can be approximately factorized as
calculated as follows:
follows: ! !
Y zi + 12 zi − 12
QZ (Z) = QZ (Z1 , Z2 , · · ·, Zn ) ≈ QZi (Zi ). (9) QZi (Zi = zi ) = φ −φ . (12)
i
σzi σzi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
Input Image
Encoder Network
Transform
Analysis
(μ=0, σ)
Decoder Network
(μ,σ) (μ,σ)
AE AE AE
Arithmetic
μ μ Encoder / Decoder
Binary Binary Binary
σ σ (μ=0, σ)
Decoded Image Latent
Representation
AD AD AD
(μ,σ) (μ,σ)
Transform
Synthesis
μ Probability Estimation
σ Network
Information
Aggregation Network
Fig. 4: Overall architecture of the multilayer image compression framework. The probability distribution of the innermost
layer of the hyperprior is approximated with a zero-mean Gaussian distribution, where the scale values σ are channelwise
independent and spatially shared.
Note that σzi is a trainable parameter in the network. All TABLE 3: Structure of the signal-preserving hypertransform.
elements in a channel in the latent representation share the (a) Hyper analysis transform.
same σ while each channel has an independent one.
According to information theory, the minimum bit-rate Name Operation Output Shape Activation
required to encode X (or Y and Z) with the estimated Input / (b, h, w, c) /
distribution equals the cross entropy of the real distribution #1 E Conv. (3 × 3) (b, h, w, 2c) Linear
PX|Y (X|Y) and the estimated distribution QX|Y (X|Y) ∼ Down Space-to-Depth (b, h/2, w/2, 8c) /
#2 E Conv. (1 × 1) (b, h/2, w/2, 4c) ReLU
N (µx , σx ), which is denoted as follows: #3 E Conv. (1 × 1) (b, h/2, w/2, 4c) ReLU
#4 E Conv. (1 × 1) (b, h/2, w/2, c) Linear
R = H(Q) + DKL (P ||Q) = EX|Y [− log(Q)] . (13)
(b) Hyper synthesis transform.
We minimize the rate-distortion function LRD = R + λD
with the network. To accelerate the convergence dur- Name Operation Output Shape Activation
ing the training of the multilayer network, an additional
Input / (b, h/2, w/2, c) /
information-fidelity loss is introduced. This loss term en- #1 D Deconv. (1 × 1) (b, h/2, w/2, 4c) Linear
courages the hyperrepresentation Y to maintain the critical #2 D Deconv. (1 × 1) (b, h/2, w/2, 4c) ReLU
information in X during training and is formulated as #3 D Deconv. (1 × 1) (b, h/2, w/2, 4c) ReLU
Up Depth-to-Space (b, h, w, c) /
follows: #4 D Deconv. (3 × 3) (b, h, w, c) Linear
min Lif = ||F(Y; θ) − X||2 . (14)
Y,θ
Deconv #1
Deconv #2
Deconv #3
After the split, one half of the tensor is used as mean in the
L1 Repr.
Residue Block
Concatenate
Gaussian distribution. We calculate the absolute value of the
Conv #1
Conv #2
Conv #3
+ other half as scale.
Main Repr.
main transforms and train the fine-grained hyper trans- 7.2.1 Evaluation Protocol
forms progressively. Each group of hyper transforms (i.e. Three datasets, i.e., Kodak, Tecnick and CLIC 2019 valida-
the fine-grained groups and the coarse-grained groups of tion set, are used in the evaluation corresponding to three
hyper transforms) is trained for 20,000 iterations. Next, we different levels of resolution and different content. For the
train the Information-Aggregation Reconstruction subnet- evaluated learning-based methods, we average the metrics
work for another 20,000 iterations. Finally, we end-to-end of the bit-rate (bpp) and the distortion (PSNR and MS-
tune the whole network for 400,000 iterations to complete SSIM) across the dataset for different models, which are
the training. usually trained with different trade-off coefficients λ. We
compare the learning-based methods with JPEG, BPG, and
7 E VALUATION VVC. For these hybrid codecs, the metrics are averaged at
different quality factors (QFs) or quantization parameters
7.1 Datasets (QPs). To illustrate the comparison, we show the results
End-to-end learned image compression is a self-supervised for rate-distortion curves in Fig. 6 2 . We also calculate the
problem where distortion metrics measure the difference BD-rate [61] with respect to the bit-rate and PSNR over the
between the original image and the reconstructed image and three datasets. As not all methods cover the whole range of
the bit-rate corresponds to the entropy of the latent code. bit-rate and PSNR, we separate different bit-rate ranges for
Thus, no extra labeling labor is needed, and many existing evaluation and comparison, marked as low, median, high,
large-scale image sets, e.g. ImageNet [58] and DIV2K [57], and full. Bit-rate ranges are different among datasets due to
can be used to train networks for image compression. To variations of content, but full ranges are selected to cover
reduce possible compression artifacts in the images, the the variation of image quality from poor to transparent, as
lossy-compressed images are usually downsampled before shown in Table 6. The BD-Rate results are shown in Table 7.
they are used for network training. We analyze the results and summarize the important prop-
Commonly used testing image sets include Kodak [59] erties in the following.
and Tecnick [60], which contain high-quality natural images
that have not been lossy-compressed. The Kodak dataset TABLE 6: Specifications of BD-Rate range on different test-
consists of 24 images with resolution 512 × 768, with a ing datasets. Full ranges are illustrated, and low, median,
wide variety of content and textures that are sensitive to high ranges are chosen respectively within the full ranges.
artifacts. Thus, it has been widely used to evaluate image
compression methods. For the Tecnick dataset, the SAM- Dataset Bit-Rate Range PSNR Range
PLING testset is used for evaluation in some works. In Kodak 0.25 bpp - 1.40 bpp 26 dB - 40 dB
contrast to Kodak, this dataset contains images with higher Tecnick 0.12 bpp - 0.70 bpp 26 dB - 43 dB
CLIC 0.20 bpp - 1.05 bpp 28 dB - 40 dB
resolution (1200 × 1200), which can serve as a supplemental
benchmark for image compression methods that can have
different performance on images with different resolutions.
In addition, in recent years, the CVPR Workshop and 7.2.2 Entropy Model
Challenge on Learned Image Compression (CLIC), with the The design of the entropy model is the main driving force
goal of encouraging research in learning-based image com- of improvements in rate-distortion performance. The design
pression, has attracted much attention in the community. of entropy models in end-to-end learned image compres-
A testing dataset consisting of images captured by both sion has developed through the period from contextual
mobile phones and professional cameras is provided and binary entropy models [17] to hyperprior models and spa-
updated year by year. The images have higher resolutions, tial / cross-channel entropy estimation [24], [25]. Specif-
on average 1913 × 1361 for mobile photos and 1803 × 1175 ically, as shown in Fig. 6, a leap in gain occurred with
for professional photos. Evaluation results on this dataset the emergence of hyperpriors, which have been adopted
indicate compression performance on images with relatively by many other frameworks. Despite great success, mod-
high resolutions. eling contextual probability is still a challenging topic in
image modeling due to variation in resolution. As shown
in Table 7, the context model-based method [34] may have
7.2 Rate-Distortion Performance
unstable gain over BPG at different levels of resolution,
Although the overall history of the development of end- while the proposed methods achieve consistent superiority
to-end learned image-compression methods is not as long over the anchor.
as that of hybrid coding standards, there have been a
significant number of works on this topic, and tremendous 7.2.3 Depth of the Network
progress has been made. However, few studies have thor- The depth of the network is a comparatively less important
oughly evaluated rate-distortion performance on various factor in performance, while in other computer vision tasks,
images and compared baselines (i.e., anchors). It is never- networks with a deeper architecture usually perform better
theless valuable to compare performance on technical merits than those with fewer layers. Some works [46], [62] also
to investigate which direction truly affects performance. In confirm this observation. Instead of building complicated
the following, we summarize the performance of selected network architectures, work may focus more on the specific
works. The contributions in these works include different
methods for entropy modeling, novel architecture design 2. The results of NeurIPS18 correspond to the publicly released code,
and normalization. which does not include auto-regressive context model.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
42 27.5
40 25.0
38 22.5
CVPR18-Condition CVPR18-Condition
36 VTM8-444 VTM8-444
MS-SSIM (dB)
PCS18-ReLU 20.0 PCS18-ReLU
PSNR (dB)
34 PCS18-GDN PCS18-GDN
ICLR18-Factorized 17.5 ICLR18-Factorized
32 ICLR18-HyperPrior ICLR18-HyperPrior
NeurIPS18 15.0 NeurIPS18
30 ICLR19 ICLR19
CVPR17-RNN CVPR17-RNN
28 BPG-444 12.5 BPG-444
JPEG JPEG
26 Ours 10.0 Ours
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Bit-rate (bpp) Bit-rate (bpp)
MS-SSIM (dB)
NeurIPS18 20 NeurIPS18
PSNR (dB)
ICLR19 ICLR19
35.0 CVPR17-RNN CVPR17-RNN
BPG-444 BPG-444
32.5 JPEG 15 JPEG
Ours Ours
30.0
27.5 10
25.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
Bit-rate (bpp) Bit-rate (bpp)
40.0 25.0
22.5
37.5 CVPR18-Condition CVPR18-Condition
VTM8-444 20.0 VTM8-444
MS-SSIM (dB)
PCS18-ReLU PCS18-ReLU
PSNR (dB)
35.0
PCS18-GDN 17.5 PCS18-GDN
ICLR18-Factorized ICLR18-Factorized
32.5 ICLR18-HyperPrior ICLR18-HyperPrior
15.0
NeurIPS18 NeurIPS18
30.0 ICLR19 12.5 ICLR19
CVPR17-RNN CVPR17-RNN
27.5 BPG-444 10.0 BPG-444
JPEG JPEG
25.0 Ours 7.5 Ours
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Bit-rate (bpp) Bit-rate (bpp)
Fig. 6: Rate-Distortion Curves. The methods include PCS18-ReLU, PCS18-GDN [53], ICLR18-Factorized, ICLR18-
HyperPrior [24], NeurIPS18 [25], ICLR19 [34], CVPR17-RNN [17], CVPR18-Condition [27], BPG-4:4:4 [9], VTM8-4:4:4 [8]
and JPEG [1]. We conduct the evaluation on the three datasets Kodak, Tecnick and CLIC 2019 (validation set). PSNR and
MS-SSIM are used as the distortion metrics. We convert the MS-SSIM values to decibels (−10 log10 (1 − d), where d refers
to the MS-SSIM value) for a clear illustration, following [24].
design of the networks to better model the image prior. images to improve overall performance. Most state-of-the-
However, it has been reported that the network should art solutions adopt normalization and its inverse in the main
consist of a sufficient quantity of parameters, and the width encoding and decoding transform. However, it still remains
of the network should be large enough for effective model- as a topic in future research to reduce spatial redundancy
ing of images, especially for higher ranges of bit-rates and more efficiently with normalization.
higher quality [34].
7.2.5 Summary
7.2.4 Normalization We evaluate the rate-distortion performance of different
It is reported in [24] that batch normalization [63], com- methods developed in recent years. As we can see from
monly used to improve the performance of neural networks, the results, great progress has been made to improve the
does not bring significant improvement. However, Ballè et rate-distortion performance, where the decorrelation nor-
al. proposed generalized divisive normalization [11], [53], malization and the hyperprior model bring significant im-
which is proven to be able to decorrelate the elements in provement. Nevertheless, we also see large variations in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
TABLE 7: Evaluation of the BD-Rate on different methods TABLE 8: BD-Rate evaluation for the coarse-to-fine hyper-
(optimized by PSNR) on different image sets. We set BPG- prior model at different resolutions, with the single-layer
4:4:4 as the anchor. The negative values reflect the average hyperprior as the anchor.
bit-rate saving compared to the anchor at the same level of
distortion. Best performances are marked in bold, while the Resolution BD-Rate
second best ones are underlined. 4K -4.65%
1080p -2.65%
Bit-Rate Range 540p -1.97%
Method
Low Median High Full
Kodak
34.5
CVPR18-Condition 76.18% N/A N/A N/A
VTM8-4:4:4 -20.85% -18.26% -15.86% -18.91% 34.0
PSNR (dB)
PCS18-GDN 42.09% 40.64% 41.62% 41.68%
ICLR18-Factorized 35.46% 32.75% 28.59% 32.96% 33.5
ICLR18-HyperPrior 6.11% 3.06% 0.32% 4.07% Hyper IAR (Fine+Coarse)
NeurIPS18 -4.63% -4.62% -5.03% -4.68% 33.0 Mean IAR (Fine+Coarse)
ICLR19 -6.27% -4.75% -4.18% -5.40% Mean IAR (Fine)
CVPR17-RNN 189.38% 174.02% 193.28% 176.16% 32.5 Mean SYN (Fine) [41]
No Aggregation
JPEG N/A 113.28% 104.99% N/A
Ours -10.65% -8.74% -8.31% -9.42% 0.40 0.45 0.50 0.55 0.60 0.65
Bit-rate (bpp)
Tecnick
CVPR18-Condition N/A 123.13% N/A N/A Fig. 7: Rate-Distortion curves by different aggregation meth-
VTM8-4:4:4 -31.63% -30.14% 27.56% -30.25%
PCS18-GDN 49.28% 43.57% 32.34% 43.00% ods. The methods vary in aggregated feature forms (Hyper
ICLR18-Factorized 41.29% 37.64% 25.58% 36.70% and Mean), feature granularity (Fine and Fine+Coarse), and
ICLR18-HyperPrior 5.64% -0.23% -7.82% 0.98% fusion stage (SYN and IAR).
NeurIPS18 -12.76% -14.84% -18.16% -14.54%
ICLR19 -9.85% -10.92% -11.78% -11.65%
CVPR17-RNN N/A 210.12% 224.64% N/A
JPEG 222.30% 193.80% 187.90% 198.24% original single layer model. We also show that the coarse-
Ours -14.82% -14.56% -16.48% -14.15% to-fine models achieves more significant improvements in
CLIC3 BD-Rate reduction on high-resolution images. It especially
benefits emerging high-resolution applications.
CVPR18-Condition 88.73% N/A N/A N/A
VTM8-4:4:4 -23.43% -20.31% -17.52% -21.22%
PCS18-GDN 53.34% 53.53% 54.26% 53.54% 7.3.2 Information-Aggregation Reconstruction
ICLR18-Factorized 49.37% 49.45% 50.85% 49.64% The Information Aggregation Reconstruction (IAR) sub-
ICLR18-HyperPrior 12.00% 8.97% 9.99% 10.63%
NeurIPS18 -3.53% -1.40% 4.45% -1.88% network is designed to improve reconstruction quality. It
ICLR19 -7.52% -4.33% -2.17% -4.61% aggregates image representations at different granularities
JPEG 124.30% 115.73% N/A N/A to fully utilize transmitted information for reconstruction.
Ours -14.49% -12.21% -11.64% -12.86%
To analyze the effect of the IAR component, we conduct
ablation studies considering the forms and granularities of
the aggregated features. The results are shown in Fig 7
performance on different testing datasets. Compared with and Table 9. There are two types of feature forms, i.e.
existing works, the proposed method achieves a more con- Hyper information retrieved right after the hyper synthesis
sistent gain on different content and resolutions. transforms, and Mean of Gaussian distributions generated
by the prediction subnetwork [47]. These feature maps can
7.3 Studies on the Proposed Method be aggregated at different resolutions, i.e. at the small-
resolution stage before the synthesis transform (SYN), or
7.3.1 Coarse-to-Fine Modeling
at the full-resolution stage within the IAR subnetwork. With
We propose the coarse-to-fine hyperprior model to reduce the proposed coarse-to-fine hyperprior model, the hyperpri-
the bit-rate. We conduct ablation studies to evaluate the ors can be aggregated at different granularities, i.e. Fine and
coarse-to-fine design. In this experiment, we benchmark Coarse. We empirically analyze the effect of combining these
on a subset from the LIU4K dataset [64], to evaluate the factors in the ablation study. As shown in Fig. 7, the fusion of
performance at different resolutions but having the same multi-resolution representation shows significant benefits,
content. Images in LIU4K dataset are of 4K resolutions. We and it is beneficial to aggregate information at both coarse
down-sample the images to 1080p (1920 × 1080) and 540p and fine granularities. Utilizing hyperprior representation
(960 × 540) resolutions to build three subsets of different tends to show better performance than concatenating Mean
resolutions. We calculate BD-Rate on the R-D curves, with information. Besides, an aggregation at the stage of higher
the single-layer hyperprior model as the anchor. The BD- resolution leads to improved performance.
Rates results are shown in Table 8. As shown, the coarse-to-
fine model achieve R-D performance improvements over the 7.3.3 Visual Quality Analysis
We conduct visual analysis on the reconstructed images
3. The results of CVPR17-RNN on CLIC 2019 validation dataset is not
included, as the available code does not support the resolutions in this in Fig. 8, where we compare our method with the repre-
dataset. sentative learning-based method (ICLR18-Hyperprior) [24]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
TABLE 9: BD-Rate corresponding with R-D curves in Fig. 7. TABLE 10: Encoding and decoding time (seconds) for vari-
We use setting “No Aggregation” as the anchor. ous methods.
Fig. 8: Visualization of the reconstructed images by the proposed method, ICLR18-Hyperprior, and BPG.
3.5 CVPR18-Condition-MSE
CVPR18-Condition-SSIM
VTM8-444
3.0 PCS18-ReLU
PCS18-GDN
2.5 ICLR18-Factorized-MSE
VGG L4 Distance
ICLR18-Factorized-SSIM
2.0 ICLR18-HyperPrior-MSE
ICLR18-HyperPrior-SSIM
NeurIPS18-MSE
1.5 NeurIPS18-SSIM
ICLR19-SSIM
1.0 ICLR19-MSE
CVPR17-RNN
0.5 BPG-444
JPEG
Ours-SSIM
10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 Ours-MSE
MS-SSIM (dB) PSNR (dB)
Fig. 9: Evaluation of perceptual distance [15] (a lower value corresponds to better quality) with respect to PSNR and
MS-SSIM for different methods. The methods correspond to those in Fig. 6.
scheme and therefore runs faster than those methods in devices. Existing works have shown the potentials of end-
the experiments. VTM is designed with no thread-level to-end learned to achieve higher performance and better
parallelism, thus not accelerated by multi-core devices. efficiency in the near future.
No. 2018AAA0102702, the Fundamental Research Funds for [24] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston,
the Central Universities, and the National Natural Science “Variational image compression with a scale hyperprior,” in Proc.
Int. Conf. Learn. Representations, 2018.
Foundation of China under Contract No. 61772043 and [25] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and
No. 62022038. hierarchical priors for learned image compression,” in Proc. Adv.
Neural Inform. Process. Syst., 2018.
[26] D. Minnen, G. Toderici, S. Singh, S. J. Hwang, and M. Covell,
R EFERENCES “Image-dependent local entropy models for learned image com-
pression,” in Proc. Int. Conf. Image Process., 2018.
[1] M. W. Marcellin, M. J. Gormish, A. Bilgin, and M. P. Boliek, “An [27] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and
overview of JPEG-2000,” in Proc. Data Compression Conf., 2000. L. Van Gool, “Conditional probability models for deep image
[2] M. Rabbani and R. Joshi, “An overview of the JPEG 2000 still im- compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
age compression standard,” Signal processing: Image communication, 2018.
vol. 17, no. 1, pp. 3–48, 2002. [28] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
[3] A. Gersho and R. M. Gray, Vector quantization and signal compres- S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image
sion. Springer Science & Business Media, 2012, vol. 159. compression with priming and spatially adaptive bit rates for
[4] M. W. Marcellin and T. R. Fischer, “Trellis coded quantization of recurrent networks,” in Proc. IEEE Conf. Comput. Vis. Pattern
memoryless and Gauss-Markov sources,” IEEE Trans. Commun., Recognit., 2018.
vol. 38, no. 1, pp. 82–93, 1990. [29] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convo-
[5] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for lutional networks for content-weighted image compression,” in
data compression,” Communications of the ACM, vol. 30, no. 6, pp. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
520–540, 1987. [30] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generative
[6] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive models for distribution-preserving lossy compression,” in Proc.
binary arithmetic coding in the H. 264/AVC video compression Adv. Neural Inform. Process. Syst., 2018.
standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, [31] K. M. Nakanishi, S.-i. Maeda, T. Miyato, and D. Okanohara, “Neu-
pp. 620–636, 2003. ral multi-scale image compression,” in Proc. Asia. Conf. Comput.
[7] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview Vis., 2018.
of the high efficiency video coding (HEVC) standard,” IEEE Trans. [32] J. Klopp, Y.-C. F. Wang, S.-Y. Chien, and L.-G. Chen, “Learning a
Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012. code-space predictor by exploiting intra-image-dependencies.” in
[8] J. Chen, Y. Ye, and S. H. Kim, “Versatile video coding (draft 8),” Proc. Brit. Mach. Vis. Conf., 2018.
JVET-Q2002-v3, 2020. [33] J. Cai and L. Zhang, “Deep image compression with iterative non-
[9] F. Bellard., “BPG image format (http://bellard.org/bpg/). ac- uniform quantization,” in Proc. Int. Conf. Image Process., 2018.
cessed: 2017-01-30.” [34] J. Lee, S. Cho, and S.-K. Beack, “Context adaptive entropy model
[10] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, for end-to-end optimized image compression,” in Proc. Int. Conf.
S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image Learn. Representations, 2019.
compression with recurrent neural networks,” in Proc. Int. Conf. [35] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and
Learn. Representations, 2016. video compression through spatial-temporal energy compaction,”
[11] J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019.
images using a generalized normalization transformation,” Proc. [36] ——, “Learned image compression with discretized gaussian
Int. Conf. Learn. Representations, 2016. mixture likelihoods and attention modules,” in Proc. IEEE Conf.
[12] Y. Hu, W. Yang, and J. Liu, “Coarse-to-fine hyper-prior modeling Comput. Vis. Pattern Recognit., 2020.
for learned image compression,” in Proc. AAAI Conf. Artif. Intell., [37] H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized
2020. versatile image compression with wavelet-like transform,” IEEE
[13] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Trans. Pattern Anal. Mach. Intell., 2020.
quality assessment: from error visibility to structural similarity,” [38] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural
IEEE Trans. on Image Process., vol. 13, no. 4, pp. 600–612, 2004. image compression via non-local attention optimization and im-
[14] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural proved context modeling,” IEEE Trans. on Image Process., 2021.
similarity for image quality assessment,” in Conference Record of the [39] T. Dumas, A. Roumy, and C. Guillemot, “Autoencoder based im-
Asilomar Conference on Signals, Systems & Computers, 2003. age compression: can the learning be quantization independent?”
[15] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018.
style transfer and super-resolution,” in Proc. Eur. Conf. Comput. [40] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
Vis., 2016. arXiv preprint arXiv:1312.6114, 2013.
[16] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimization [41] Y. Yang, R. Bamler, and S. Mandt, “Improving inference for neural
of nonlinear transform codes for perceptual quality,” in Proc. image compression,” in Proc. Adv. Neural Inform. Process. Syst.,
Picture Coding Symp., 2016. 2020.
[17] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, [42] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
J. Shor, and M. Covell, “Full resolution image compression with for machines: A paradigm of collaborative compression and intel-
recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern ligent analytics,” IEEE Trans. Image Process., vol. 29, pp. 8680–8695,
Recognit., 2017. 2020.
[18] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, [43] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and
L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for L. Van Gool, “Generative adversarial networks for extreme
end-to-end learning compressible representations,” in Proc. Adv. learned image compression,” in Proc. Int. Conf. Comput. Vis., 2018.
Neural Inform. Process. Syst., 2017. [44] S. Santurkar, D. Budden, and N. Shavit, “Generative compres-
[19] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image sion,” in Proc. Picture Coding Symp., 2018.
compression with compressive autoencoders,” Proc. Int. Conf. [45] F. Mentzer, G. Toderici, M. Tschannen, and E. Agustsson, “High-
Learn. Representations, 2017. fidelity generative image compression,” in Proc. Adv. Neural In-
[20] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized form. Process. Syst., 2020.
image compression,” Proc. Int. Conf. Learn. Representations, 2017. [46] J. Lee, S. Cho, S.-Y. Jeong, H. Kwon, H. Ko, H. Y. Kim, and
[21] M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaint for J. S. Choi, “Extended end-to-end optimized image compression
image compression,” in Proc. Adv. Neural Inform. Process. Syst., method based on a context-adaptive entropy model,” in Proc. IEEE
2017. Conf. Comput. Vis. Pattern Recognit. Workshops, 2019.
[22] O. Rippel and L. Bourdev, “Real-time adaptive image compres- [47] J. Zhou, S. Wen, A. Nakagawa, K. Kazui, and Z. Tan, “Multi-scale
sion,” in Proc. Int. Conf. Mach. Learn, 2017. and context-adaptive entropy model for image compression,” in
[23] D. Minnen, G. Toderici, M. Covell, T. Chinen, N. Johnston, J. Shor, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019.
S. J. Hwang, D. Vincent, and S. Singh, “Spatially adaptive image [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
compression using a tiled deep network,” in Proc. Int. Conf. Image for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Process., 2017. Recognit., 2016.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18
[49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Wenhan Yang (M’18) received the B.S degree
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with and Ph.D. degree (Hons.) in computer science
convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., from Peking University, Beijing, China, in 2012
2015. and 2018. He is currently a postdoctoral re-
[50] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” search fellow with the Department of Computer
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. Science, City University of Hong Kong. Dr. His
[51] T. M. Cover and J. A. Thomas, Elements of Information Theory. John current research interests include image/video
Wiley & Sons, 2012. processing/restoration, bad weather restoration,
[52] M. Covell, N. Johnston, D. Minnen, S. J. Hwang, J. Shor, S. Singh, human-machine collaborative coding. He has
D. Vincent, and G. Toderici, “Target-quality image compression authored over 100 technical articles in refereed
with recurrent, convolutional neural networks,” arXiv preprint journals and proceedings, and holds 9 granted
arXiv:1705.06687, 2017. patents. He received the IEEE ICME-2020 Best Paper Award, the IFTC
[53] J. Ballé, “Efficient nonlinear transforms for lossy image compres- 2017 Best Paper Award, and the IEEE CVPR-2018 UG2 Challenge
sion,” in Proc. Picture Coding Symp., 2018. First Runner-up Award. He was the Candidate of CSIG Best Doctoral
[54] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, Dissertation Award in 2019. He served as the Area Chair of IEEE
A. Graves et al., “Conditional image generation with pixelcnn ICME-2021, and the Organizer of IEEE CVPR-2019/2020/2021 UG2+
decoders,” in Proc. Adv. Neural Inform. Process. Syst., 2016. Challenge and Workshop.
[55] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
[56] M. Mirza and S. Osindero, “Conditional generative adversarial
nets,” arXiv preprint arXiv:1411.1784, 2014.
[57] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single
image super-resolution: Dataset and study,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. Workshops, 2017.
[58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet Zhan Ma (SM’19) received the B.S. and M.S.
large scale visual recognition challenge,” Int. J. Comput. Vis., vol. from Huazhong University of Science and Tech-
115, no. 3, pp. 211–252, 2015. nology (HUST), Wuhan, China, in 2004 and
[59] E. Kodak, “Kodak lossless true color image suite (photocd 2006 respectively, and the Ph.D. degree from
pcd0992). [online]. available: http://r0k.us/graphics/kodak/.” the New York University, New York, in 2011. He
[60] N. Asuni and A. Giachetti, “TESTIMAGES: a large-scale archive is now on the faculty of Electronic Science and
for testing visual devices and basic image processing algorithms.” Engineering School, Nanjing University, Jiangsu,
in Proc. Eur. Italian Chapter Conf., 2014. 210093, China. From 2011 to 2014, he has been
[61] G. Bjontegarrd, “Calculation of average PSNR differences between with Samsung Research America, Dallas TX,
RD-curves,” VCEG-M33, 2001. and Futurewei Technologies, Inc., Santa Clara,
[62] L. Zhou, Z. Sun, X. Wu, and J. Wu, “End-to-end optimized im- CA, respectively. His current research focuses
age compression with attention mechanism,” in Proc. IEEE Conf. on the next-generation video coding, energy-efficient communication,
Comput. Vis. Pattern Recognit. Workshops, 2019. gigapixel streaming and deep learning. He is a co-recipient of 2018
[63] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep ACM SIGCOMM Student Research Competition Finalist, 2018 PCM
network training by reducing internal covariate shift,” in Proc. Int. Best Paper Finalist, and 2019 IEEE Broadcast Technology Society Best
Conf. Mach. Learn, 2015. Paper Award.
[64] J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y. Dai, “A comprehen-
sive benchmark for single image compression artifact reduction,”
IEEE Trans. on Image Process., vol. 29, pp. 7845–7860, 2020.
[65] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
“The unreasonable effectiveness of deep features as a perceptual
metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
[66] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn. Jiaying Liu (M’10-SM’17) is currently an Asso-
Representations, 2015. ciate Professor, Peking University Boya Young
[67] D. Liu, H. Zhang, and Z. Xiong, “On the classification-distortion- Fellow with the Wangxuan Institute of Computer
perception tradeoff,” in Proc. Adv. Neural Inform. Process. Syst., Technology, Peking University. She received the
2019. Ph.D. degree (Hons.) in computer science from
[68] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Peking University, Beijing China, 2010. She has
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018. authored over 100 technical articles in refereed
journals and proceedings, and holds 50 granted
patents. Her current research interests include
multimedia signal processing, compression, and
computer vision.
Yueyu Hu (STM’18-GSM’19) received the B.S. Dr. Liu is a Senior Member of IEEE, CSIG and CCF. She was a
degree in computer science from Peking Univer- Visiting Scholar with the University of Southern California, Los Angeles,
sity, Beijing, China, in 2018, where he is currently from 2007 to 2008. She was a Visiting Researcher with the Microsoft
working toward the master’s degree with Wangx- Research Asia in 2015 supported by the Star Track Young Faculties
uan Institute of Computer Technology, Peking Award. She has served as a member of Multimedia Systems & Applica-
University. His current research interests include tions Technical Committee (MSA TC), and Visual Signal Processing and
video and image compression and analytics with Communications Technical Committee (VSPC TC) in IEEE Circuits and
machine learning. Systems Society. She received the IEEE ICME-2020 Best Paper Award
and IEEE MMSP-2015 Top10% Paper Award. She has also served as
the Associate Editor of IEEE Trans. on Image Processing, IEEE Trans.
on Circuit System for Video Technology and Elsevier JVCI, the Techni-
cal Program Chair of IEEE ICME-2021/ACM ICMR-2021, the Publicity
Chair of IEEE ICME-2020/ICIP-2019, and the Area Chair of CVPR-
2021/ECCV-2020/ICCV-2019. She was the APSIPA Distinguished Lec-
turer (2016-2017).