0% found this document useful (0 votes)

10 views18 pages

Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu

This paper presents a comprehensive survey and benchmark analysis of end-to-end learned lossy image compression methods, highlighting their historical development and technical improvements. It introduces a novel coarse-to-fine hyperprior model that enhances rate-distortion performance, particularly for high-resolution images, and demonstrates its superiority through extensive experiments. The study aims to provide insights into existing challenges and future directions in the field of image compression using deep learning techniques.

Uploaded by

rf9878

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views18 pages

Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu

Uploaded by

rf9878

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Learning End-to-End Lossy Image

Compression: A Benchmark
Yueyu Hu, Graduate Student Member, IEEE, Wenhan Yang, Member, IEEE,
Zhan Ma, Senior Member, IEEE, and Jiaying Liu, Senior Member, IEEE

Abstract—Image compression is one of the most fundamental techniques and commonly used applications in the image and video
processing field. Earlier methods built a well-designed pipeline, and efforts were made to improve all modules of the pipeline by
handcrafted tuning. Later, tremendous contributions were made, especially when data-driven methods revitalized the domain with their
excellent modeling capacities and flexibility in incorporating newly designed modules and constraints. Despite great progress, a
arXiv:2002.03711v4 [eess.IV] 26 Mar 2021

systematic benchmark and comprehensive analysis of end-to-end learned image compression methods are lacking. In this paper, we
first conduct a comprehensive literature survey of learned image compression methods. The literature is organized based on several
aspects to jointly optimize the rate-distortion performance with a neural network, i.e., network architecture, entropy model and rate
control. We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and
provide insights into their historical development routes. With this survey, the main challenges of image compression methods are
revealed, along with opportunities to address the related issues with recent advanced learning methods. This analysis provides an
opportunity to take a further step towards higher-efficiency image compression. By introducing a coarse-to-fine hyperprior model for
entropy estimation and signal reconstruction, we achieve improved rate-distortion performance, especially on high-resolution images.
Extensive benchmark experiments demonstrate the superiority of our model in rate-distortion performance and time complexity on
multi-core CPUs and GPUs. Our project website is available at https://huzi96.github.io/compression-bench.html.

Index Terms—Machine Learning, Image Compression, Neural Networks, Transform Coding

1 I NTRODUCTION

I M age compression is a fundamental technique in the sig-

nal processing and computer vision fields. The constantly
developing image and video compression methods facilitate
image compression transforms the image signal into com-
pact and decorrelated coefficients. Discrete Cosine Trans-
form (DCT) is applied to the 8 × 8 partitioned images
the continual innovation of new applications, e.g. high- in the JPEG [1] image coding standard. Discrete Wavelet
resolution video streaming and augmented reality. The goal Transform (DWT) in JPEG 2000 [2] further improves coding
of image compression, especially lossy image compression, performance by introducing a multiresolution image repre-
is to preserve the critical visual information of the image sentation to decorrelate images across scales. Then, quanti-
signal while reducing the bit-rate used to encode the image zation discards the least significant information by truncat-
for efficient transmission and storage. For different applica- ing less informative dimensions in the coefficient vectors.
tion scenarios, trade-offs are made to balance the quality of Methods are introduced to improve quantization perfor-
the compressed image and the bit-rate of the code. mance, including vector quantization [3] and trellis-coded
In recent decades, a variety of codecs have been de- quantization [4]. After that, the decorrelated coefficients are
veloped to optimize the reconstruction quality with bit- compressed with entropy coding. Huffman coding is first
rate constraints. In the design of existing image compres- employed in JPEG images. Then, improved entropy coding
sion frameworks, there are two basic principles. First, the methods such as arithmetic coding [5] and context-adaptive
image signal should be decorrelated, which is beneficial binary arithmetic coding [6] are utilized in image and video
in improving the efficiency of entropy coding. Second, for codecs [7]. In addition to these basic components, modern
lossy compression, the neglected information should have video codecs, e.g. HEVC and VVC [8], employ intra predic-
the least influence on the reconstruction quality, i.e., only tion and an in-loop filter for intra-frame coding. These two
the least important information for visual experience is components are also applied to BPG [9], an image codec, to
discarded in the coding process. further reduce spatial redundancy and improve the quality
The traditional transform image compression pipeline of the reconstruction frames, especially interblock redun-
consists of several basic modules, i.e. transform, quantiza- dancy. However, the widely used traditional hybrid image
tion and entropy coding. A well-designed transform for codecs have limitations. First, these methods are all based
on partitioned blocks of images, which introduce blocking
effects. Second, each module of the codec has a complex
• Yueyu Hu, Wenhan Yang, and Jiaying Liu are with Wangxuan Institute of dependency on others. Thus, it is difficult to jointly optimize
Computer Technology, Peking University, Beijing, 100080, China, E-mail:
{huyy, yangwenhan, liujiaying}@pku.edu.cn. the whole codec. Third, as the model cannot be optimized
• Zhan Ma is with Electronic Science and Engineering School, Nanjing as a whole, the partial improvement of one module may not
University, Jiangsu, 210093, China, E-mail: [email protected]. bring a gain in overall performance, making it difficult to
(Corresponding author: Jiaying Liu) further improve the sophisticated framework.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

With the rapid development of deep learning, there have for the existing approaches, we further explore the
been many works exploring the potential of artificial neural potential of end-to-end learned image compression
networks to form an end-to-end optimized image com- and propose a coarse-to-fine hyperprior modeling
pression framework. The development of these learning- framework for lossy image compression. The pro-
based methods has significant differences from traditional posed method is shown to outperform existing meth-
methods. For traditional methods, improved performance ods in terms of coding performance, while keeping
mainly comes from designing more complex tools for each the time complexity low on parallel computing hard-
component in the coding loop. Deeper analysis can be con- wares.
ducted on the input image, and more adaptive operations • We conduct a thorough benchmark analysis to com-
can be applied, resulting in more compact codes. However, pare the performance of existing end-to-end com-
in some cases, although the performance of the single mod- pression methods, the proposed method, and tradi-
ule is improved, the final performance of the codec, i.e., tional codecs. The comparison is conducted fairly
the superimposed performance of different modules, might from different perspectives, i.e. the rate-distortion
not increase much, making further improvement difficult. performance on different ranges of bit-rate or reso-
For end-to-end learned methods, as the whole framework lution and the complexity of the implementation.
can be jointly optimized, performance improvement in a
module naturally leads to a boost to the final objective. Note that, this paper is the extension of our earlier publi-
Furthermore, joint optimization causes all modules to work cation [12]. We summarize the changes here. First, this paper
more adaptively with each other. additionally focuses on the thorough survey and benchmark
In the design of an end-to-end learned image com- of end-to-end learned image compression methods. In ad-
pression method, two aspects are considered. First, if the dition to [12], we summarize the contributions of existing
latent representation coefficients after the transform networks on end-to-end learned image compression in Sec. 3,
work are less correlated, more bit-rate can be saved in the and present a more detailed comparative analysis of the
entropy coding. Second, if the probability distribution of merits towards high-efficiency end-to-end learned image
the coefficients can be accurately estimated by an entropy compression in Sec. 4 and Sec. 5. Second, we conduct a
model, the bit-stream can be utilized more efficiently and benchmark evaluation of existing methods in Sec. 7.2, where
the bit-rate to encode the latent representations can be we present the comparative experimental results on two
better controlled, thus, a better trade-off between the bit- additional datasets, in both PSNR and MS-SSIM. Third, we
rate and the distortion can be achieved. The pioneering raise the novel problem of cross-metric performance with
work of Toderici et al. [10] presents an end-to-end learned respect to image compression methods in Sec. 7.4, where we
image compression that reconstructs the image by applying present the empirical analysis on the phenomenon of cross-
a recurrent neural network (RNN). Meanwhile, generalized metric bias and we briefly discuss future research directions
divisive normalization (GDN) [11] was proposed by Ballé to address the related issues.
et al. to model image content with a density model, which The rest of the paper is organized as follows. In Sec. 2,
shows an impressive capacity for image compression. Since we first formulate the image compression problem, espe-
that time, there have been numerous end-to-end learned cially focusing on end-to-end learned schemes. After that,
image compression methods inspired by these frameworks. in Sec. 3, we briefly summarize the main contributions of
Although tremendous progress has been made in end- existing research. Then, in Sec. 4, we categorize existing
to-end learned image compression, there is a lack of a sys- learned image compression methods according to their
tematic survey and benchmark to summarize and compare backbone models. After that, special attention is paid to
different methods thoroughly. To this end, in this work, the rate control technique in Sec. 5, which is the very spe-
we conduct a comprehensive survey of recent progress in cialized component in image compression compared with
learning-based image compression as well as a thorough other deep-learning processing or understanding methods.
benchmarking analysis on different methods of learning- Inspired by our survey and analysis, we introduce our new
based image compression. The contributions and novel- proposed method in Sec. 6. Later, in Sec. 7, we introduce
ties of existing works are summarized and highlighted, the benchmarking protocols and make benchmarking com-
and future directions are illustrated. With the summarized parisons of existing methods. Finally, in Sec. 8, we draw
guidance from the survey and benchmark, we propose a conclusions and discuss potential future research directions.
novel end-to-end learned image compression framework
that offers state-of-the-art performance.
The contributions of this paper are as follows: 2 P ROBLEM F ORMULATION
• We comprehensively summarize the existing end-to- Natural image signals include many spatial redundancies
end learned image compression methods. The con- and have the potential to be compressed without much
tributions and novelties of these methods are dis- degradation in perceptual quality. Considering practical
cussed and highlighted. The technical improvements constraints on bandwidth and storage, lossy image com-
of these methods are commented on based on their pression is widely adopted to minimize the bit-rate of
categorizations, and we demonstrate a clear picture representing a given image to tolerate a certain level of
of the design methodologies and shows interesting distortion. The compression framework usually consists of
future research directions. an encoder-decoder pair. Given an input image x with its
• Inspired by the insights and challenges summarized distribution px , the encoder with an encoding transform E
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

and a quantization function Q, a discrete code y is generated modules are trainable, and it is possible to optimize
as follows: all parameters and components jointly. However, it
y = Q(E(x; θE )), (1) is nontrivial to acquire good performance in end-to-
end learning compression because of the difficulties
where θE denotes the encoder parameters to be tuned dur- in optimization.
ing the learning procedure. To obtain the pixel representa- • Full-Resolution Processing. Convolutional neural
tion of the image, the corresponding decoder D reconstructs networks support the full-resolution processing of
the image x̂ from the code y as follows: images, while hybrid frameworks usually process
x̂ = D(y; θD ) = D(Q(E(x; θE )); θD ), (2) partitioned blocks. Full processing can bring more
benefits to entropy modeling with more context and
where θD denotes the parameters in D. avoid the blocking effect caused by partitioning. Full-
Two kinds of metrics, i.e. distortion D and bit-rate R, resolution processing also comes with an increase in
give rise to rate-distortion optimization R + λD, the core complexity. Because the perceptive field of a con-
problem of lossy image compression. The distortion term D volutional kernel is limited, the network needs to
measures how different the reconstructed image is from the be deepened to perceive more large regions and
original image, and it is usually measured via fidelity-driven improve modeling capacity.
metrics or perceptual metrics as follows: • Rate Control. With joint optimization, the whole
model can directly target the rate-distortion con-
D = Ex∼px [d(x, x̂)], (3)
straint, while in hybrid schemes, the additional rate-
where d denotes the distortion function. The rate term R control component is employed and may not pro-
corresponds to the number of bits to encode y, which is duce an optimal approximation. However, for a large
bounded according to the entropy constraints. However, the portion of learning-based methods, multiple models
actual probability distribution of the latent code y, denoted need to be trained for different rate-distortion trade-
as py , is unknown, making accurate entropy calculation offs. The other single-model variable-bit-rate archi-
intractable. Thus, we usually utilize an entropy model qy tectures are usually much more time-consuming.
to serve as the estimation of py for entropy coding. Hence, Therefore, practical applications of these methods are
the rate term can be formulated as the cross entropy of py sometimes limited.
and qy as follows:
R = H(py , qy ) = Ey∼py [− log qy (y)], (4)
3 OVERVIEW OF P ROGRESS IN R ECENT Y EARS
where py stands for the real probability distribution and qy
Since the pioneering work of Toderici et al. [10] in 2015
refers to the distribution estimated by the entropy model.
exploited recurrent neural networks for learned image com-
The overall compression model can be viewed as an op-
pression, much progress has been made. Benefiting from
timization of the weighted sum of R and D. Formally,
the strong modeling capacity of deep networks, the per-
the problem can be solved by minimizing the following
formance of learned image compression has exceeded that
optimization with a trade-off coefficient λ as follows:
of JPEG to BPG (HEVC Intra), and the performance gap
θ̂E , θ̂D , θ̂p = arg min R + λD, (5) is widening further. The milestones of learned image com-
θE ,θD ,θp pression are summarized in Table 1. Early works aim to
where θp denotes the parameter for the entropy model. The search for possible architectures to apply transform coding
with neural networks and propose end-to-end trainable so-
optimal parameters θ̂E , θ̂D , θ̂p cause the model to achieve
lutions. Ballé et al. [11], [16], [20] proposes a learning-based
an overall good rate-distortion performance on the image
framework with GDN nonlinearity embedded analysis and
x that follows x ∼ px . Different λ values indicate different
synthesis transforms for learned image compression, while
rate-distortion trade-offs, depending on the requirements of
Toderici et al. utilize recurrent models for variable-rate
different applications.
learned compression [10], [17].
Though the idea of rate-distortion optimization is also
To make the network end-to-end trainable, the quan-
applied to traditional compression schemes, learning-based
tization component, which is not differentiable based on
methods finally make the joint optimization of all the com-
the definition, should be designed carefully and approxi-
ponents feasible. The opportunities and challenges are listed
mated by a differentiable process. Some works replace the
below:
true quantization with additive uniform noise [20], [24]
• Global Optimization. The major difference between while others use direct rounding in forwarding and back-
learned image compression and the traditional hy- propagate the gradient of y = x. In addition, Agustsson et
brid codec lies in their optimizations. Instead of al. [18] proposes replacing direct scalar quantization with
hand-craft tuning, learned image compression mod- soft-to-hard vector quantization to make the quantization
els can be automatically tuned to any differentiable smoother. Dumas et al. [39] designs a model that addition-
metric, e.g. SSIM [13], MS-SSIM [14] and perceptual ally learns the quantization parameters. As it is non-trivial
difference [15], which is calculated by neural net- to train a variational autoencoder (VAE) [40] based model
works. In addition, while the traditional hybrid cod- that incorporates quantization, advanced optimization tech-
ing framework is usually improved at the scale of in- niques for image compression are still being extensively
dividual components, in learning-based methods, all studied recently [41].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

TABLE 1: Summary of important contributions of image compression in recent years.

Method Name Paper Title Published In Highlight

Variable-Rate Variable Rate Image Compression with The first work to utilize a convolutional LSTM network for
ICLR 16
RNN [10] Recurrent Neural Networks variable-bit-rate end-to-end learned image compression.
GDN End-to-End Optimization of Nonlinear Introduces GDN, a trainable decorrelation nonlinear normal-
PCS 16
Transform [16] Transform Codes for Perceptual Quality ization that shows a great capability for image compression.
Full-Resolution Full Resolution Image Compression The first practical recurrent model for variable-bit-rate full-
CVPR 17
RNN [17] with Recurrent Neural Networks resolution image compression.
Soft-to-Hard Vector Quantization for Introduces vector quantization for learned compression and
Soft-to-Hard
End-to-End Learning Compressible NeurIPS 17 proposes using soft-to-hard annealing techniques to improve
Quantization [18]
Representations the performance of networks with quantization.
Residual network is first employed for CNN-based image
Compressive Lossy Image Compression with
ICLR 17 compression models. A Laplace-smoothed histogram is used
Autoencoder [19] Compressive Autoencoder
as the entropy model.
GDN End-to-End Optimized Image Introduces the multilayer nonpartitioning end-to-end archi-
ICLR 17
Network [20] Compression tecture with GDN for image compression.
Inpainting Learning to Inpaint for Image Utilizes image inpainting techniques in a recurrent frame-
NeurIPS 17
Based [21] Compression work to improve compression performance.
Real-Time Real-Time Adaptive Image The first method to adopt a multiscale framework with
ICML 17
Adversarial [22] Compression adversarial loss for learned real-time image compression.
Tiled Spatially Adaptive Image Compression Introduces explicit intraprediction with a tiled structure in
ICIP 17
Network [23] Using a Tiled Deep Network the network.
Variational Image Compression with a The first work to propose a hyperprior for image compres-
Hyperprior [24] ICLR 18
Scale Hyperprior sion, which greatly advances the compression performance.
Context Joint Autoregressive and Hierarchical Proposes combining the spatial context-model and a hyper-
NeurIPS 18
Model [25] Priors for Learned Image Compression prior for conditional entropy estimation.
Local Entropy Image-Dependent Local Entropy Aims to better encode latent representations with an offline
ICIP 18
Model [26] Models for Learned Image Compression dictionary.
3D-CNN
Conditional Probability Models for 3D-CNN is used for learning a conditional probability model
Entropy CVPR 18
Deep Image Compression for a multiresidual-block-based network.
Model [27]
Improved Lossy Image Compression The recurrent compression model is improved with a pro-
Priming
with Priming and Spatially Adaptive Bit CVPR 18 posed priming technique and spatial contextual entropy
RNN [28]
Rates for Recurrent Networks model.
Content- Learning Convolutional Networks for Proposes using a learned importance map to guide the allo-
CVPR 18
Weighted [29] Content-Weighted Image Compression cation of bits for latent code.
Deep Generative Models for
Generative GAN is first used for extremely low bit-rate image compres-
Distribution-Preserving Lossy NeurIPS 18
Model [30] sion.
Compression
Multiscale Proposes a multiscale model and corresponding contextual
Neural Multi-Scale Image Compression ACCV 18
CNN [31] entropy estimation to improve compression efficiency.
Intraprediction Learning a Code-Space Predictor by Explicitly designs code-space intraprediction to reduce cod-
BMVC 18
in Codes [32] Exploiting Intra-Image-Dependencies ing redundancy.
Nonuniform Deep Image Compression with Iterative Proposes nonuniform quantization to reduce quantization
ICIP 18
Quantization [33] Non-Uniform Quantization error in the network.
Context Adaptive Entropy Model for
Context Introduces a different approach to combine a hyperprior and
End-To-End Optimized Image ICLR 19
Model [34] the context model for image compression.
Compression
Learning Image and Video Compression
Energy Introduces a subband coding energy compaction technique
through Spatial-Temporal Energy CVPR 19
Compaction [35] for CNN-based image compression.
Compaction
Learned Image Compression with Utilizes Gaussian Mixture Model to estimate likelihoods of
GMM &
Discretized Gaussian Mixture CVPR 20 symbols more accurately. Attention modules are included for
Attention [36]
Likelihoods and Attention Modules improved transform capability.
End-to-End Optimized Versatile Image Adopts lifting to build the wavelet-like transforms with neu-
iWave++ [37] Compression with Wavelet-Like TPAMI 20 ral networks. It simultaneously supports lossy and lossless
Transform image compression.
Neural Image Compression via
Non-Local & Utilizes non-local network and 3D context model to achieve
Non-Local Attention Optimization and TIP 21
3D-Context [38] improved rate-distortion performance.
Improved Context Modeling
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

When the compression network is trainable, the next 4 BACKBONES FOR I MAGE C OMPRESSION
issue is to efficiently reduce spatial redundancy in the image
A typical neural network backbone for image compression
signal, where the transform is usually a critical part. Some
is built upon the VAE architecture. The architecture encodes
take the form of a convolutional neural network (CNN),
images into vectors in a latent space, forming a compact
e.g. GDN [11], [16], [20] or residual block with enhanced
representation. With dimensionality reduction and entropy
nonlinearity [19]. Some advanced convolutional architec-
constraints, the redundancy in the image is squeezed out
tures like attention module [36], non-local networks [38],
by the compressive transform. There have been a variety
and invertible structures [37] have also been employed to
of architectures for the backbone of the framework, which
improve the modeling capacity of the transforms. Others
can be coarsely divided into two categories, namely, one-
resort to a recurrent neural network (RNN) to infer latent
time feed-forward frameworks and multistage recurrent
representations progressively, which forms a scalable cod-
frameworks. Each component in a one-time feed-forward
ing framework [10]. In each iteration, the network largely
framework conducts the feed-forward operation only once
squeezes out the unnecessary bits in the latent representa-
in the encoding and decoding procedure. Usually, multiple
tions. Therefore, the final representations are compact.
models need to be trained to cover different ranges of bit-
rates, as the encoder and decoder networks determine the
After the transform, the compact latent representations
rate-distortion trade-off. In contrast, in multistage recurrent
are further compressed via entropy coding, where fre-
frameworks, an encoding component of the network iter-
quently occurring patterns are represented with few bits
atively conducts compression on the original and residual
and rarely occurring patterns with many bits. Earlier works
signals, and the number of iterations controls the rate-
incorporate elementwise independent entropy models to
distortion trade-off. Each iteration encodes a portion of the
estimate the probability distribution of the latent representa-
residual signal with a certain amount of bits. Such a model
tions [19], [20] and independently encode each element with
can conduct variable-bit-rate compression on its own. In
an arithmetic coder. With these initial trials, later advanced
the following, we introduce both types of architectures and
methods explicitly estimate entropy with hyperpriors [24],
conduct a comparison analysis on them.
[26], predictive models [25], [32], [34], [36], [37], [38] or other
learned parametric models [17], [27], [28].
4.1 One-Time Feed-Forward Frameworks
In addition to the abovementioned methods that target
signal fidelity with learned transform coding frameworks, One-time feed-forward frameworks have been most widely
there are emerging works targeting novel application con- adopted for end-to-end learned image compression. Basic
ditions, notably compression for machine vision [42] or variations of the architectures in the literature are illustrated
human perception at low bit-rates. According to research in Fig. 1.
on the human visual system, human eyes are less sensi- The first end-to-end learned image compression with a
tive to pixelwise distortion in areas with complex texture. one-time feed-forward structure was proposed by Ballé et
Therefore, generative models such as conditional generative al. [16], where the analysis and synthesis transforms for
adversarial networks (GAN) can be employed to synthesize encoding and decoding are made up of a single-layer GDN
such areas, where low-bit-rate representations can serve as and inverse GDN (iGDN). This structure is then improved
the guidance. This can be utilized to design high-efficiency to support full-resolution processing, with stride convolu-
image codecs. Rippel et al. [22] first proposed utilizing tion and the corresponding transposed convolution [20]. In
the adversarial loss function in an end-to-end framework later works, the hyperprior network [24] is introduced to
to improve visual quality. In later literature, Agustsson extract the side information from the latent representation
et al. [43], Tschannen et al. [30] and Santurkar et al. [44] produced by the analysis transform, and the side informa-
improve the capacity of adversarial learning by introducing tion can improve the entropy estimation of the latent code.
advanced generative networks to provide superior recon- In addition to the frameworks equipped with GDN,
struction quality with extremely low bit-rates. Mentzer et another kind of feed-forward network utilizing residual
al. [45] demonstrated that with a hyperprior based compres- blocks is proposed by Theis et al. [19] and Mentzer et al. [27].
sion model and a generative convolutional decoder with These networks stack multiple residual blocks in both the
ChannelNorm, it is possible to achieve similar visual quality encoder and decoder, greatly expanding the depth of the
on high-resolution natural images with only half the bit- model. With deeper networks, the encoder and decoder can
rates. embed more complex prior images, and they have more
flexibility in modeling nonlinear transforms. In addition,
In summary, the tremendous progress in learned image some works adopt a multiscale structure [22], [31], which
compression unveils the power of machine learning tech- also extends the capacity of the network.
niques. Nevertheless, there are still a large number of prob- It is reported that a more complex design of an ar-
lems to investigate, which requires a systematic benchmark chitecture with GDN may bring further improvements in
to illustrate critical areas where end-to-end learned frame- compression performance [46], [47], but not as significant
works for image compression can be further improved. In as that of other contributions, such as a hyperprior. Unlike
the following, we first analyze the important components other computer vision tasks, e.g. image recognition, where
(i.e., the backbone architecture and entropy model) in detail a deeper network can usually bring extra gain in perfor-
and then conduct the benchmark analysis on the methods mance [48], [49], it does not result in significant improve-
according to various aspects. ments in performance to extend the architecture complexity
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

Transform

Transform
Synthesis
Analysis
(a) GDN Transform [16]

Deconv

Deconv
iGDN

iGDN

iGDN
Conv

Conv

Conv
GDN

GDN

GDN
(a) Vanilla

(b) Multilayer GDN Network [20]

AE AE
μ
σ
AD AD (b) Stateful
P0 I

RS′0 I
(c) Hyperprior Model [24]1 , [25] P1 I

RS′1 RS1
P2 I P2 I
Resblock
RS′2 RS′2
(c) Incremental (d) Skip-Connection (e) Stateful Propagation

(d) Residual Auto-Encoder [19] Fig. 2: Illustration of the backbones of the multistage recur-
rent framework and its variations. The main feature of these
designs is that the residue for one stage is taken as the input
×3 ×3 at the next stage. (a) and (b) show the vanilla structure and
its improved stateful form [10]. (c)-(e) show different cross-
stage connections [21].
×3 ×3

(e) Deep Residual Auto-Encoder [27]

significantly improved with increased complexity.

4.2 Multistage Recurrent Frameworks

The basic architecture and the variations of multistage re-
current frameworks for image compression are illustrated
in Fig. 2.
(f) Multiscale Model [31] The vanilla multistage framework, as an illustration of
the concept, progressively encodes the residue to compress
the image. For an example of the simplified case, in the first
Bit-Stream Convolutional Encoder Decoder stage, there is no reconstructed signal, so the residue is the
Layer Network Network
original image itself. After the encoding and reconstruction,
the residual image with respect to the reconstructed and
original image is pushed into the network to conduct the
Fig. 1: Illustration of typical architectures for feed-forward second-stage compression. As at each stage the compression
frameworks. The networks are divided into three categories: loses some of the information, the output of the second stage
GDN-based networks (a)-(c), residual block-based networks is the degraded signal of the true residue. The framework
(d)-(e), and multiscale networks (f). compresses the residue and the residue of the residue pro-
gressively to achieve better quality. To finally reconstruct the
original image, bits of all the stages are needed to decode
for learned image compression. Although deeper architec- the multistage residue maps, which are added together to
tures can provide more fidelity to model the prior of the form the decoded image. This kind of reconstruction process
images, they are harder to train than shallower networks, corresponds to the Incremental structure in Fig. 2. The vanilla
especially with a hard bottleneck in the pipeline. However, multistage framework adopts a stateless structure, where
with sufficient capacity, due to the characteristics of this the analysis of different stages of the residue is conducted
problem, an end-to-end optimization process may easily independently. It is difficult for the network to simulta-
fall into local minima, and therefore, performance is not neously compress the image and the residue of all steps.
Therefore, in the first practical multistage structure [10],
1. In [24] the probability is models with zero-mean Gaussian and the a stateful framework utilizing long short-term memory
prediction network only generates σ . In this case, we have µ ≡ 0. (LSTM) architectures [50] is introduced. LSTM maintains
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

a state during sequential processing that propagates the tectures [25], [28]. However, variable-bit-rate compression
features of the images to the following steps to facilitate is commonly required by applications, which becomes the
the modeling of the multilevel residue. Fig. 2 shows the major barrier for end-to-end learned image compression
unrolled stateful structure. In each stage, the modules in the methods to be adopted by existing systems. More efforts are
pipeline take the currently processed residue and the state needed to investigate an efficient way to achieve variable-
from the previous stage as the input. The states are updated rate compression for learning-based approaches.
and propagated for processing in the next step.
There have been studies on the aggregation of the output
of each stage. Baig et. al. [21] present and analyze differ- 5 E NTROPY M ODELS
ent kinds of aggregation schemes. The basic Incremental Entropy coding is an important component in an image
scheme adds the output of all stages together to form the compression framework. According to information the-
final decoded images. The loss function of the Incremental ory [51], the bit-rate needed to encode the signal is bounded
scheme usually includes a term to encourage the output by the information entropy, which corresponds to the proba-
of each stage to approximate the residue of the previous bility distribution of the symbols in representing the signal.
stage. A different way to combine all the stages is to treat Thus, the entropy coding component is embedded in the
the multistage structure as a residual network to form the end-to-end learned image compression framework to esti-
Skip-Connection scheme. There is only one term in the loss mate the probability distribution of the latent representa-
function for such a scheme to require that the sum of tions and apply constraints on the entropy to reduce the
all the stages reconstructs the original image. Unlike the bit-rate.
Incremental structure, there is no explicit arrangement of There is a large amount of research on entropy models
the residue in the Skip-Connection structure. The outputs of for learned image compression. A summary of solutions to
all stages contribute to the final reconstruction, each as a the problem of entropy modeling is presented in Table 2,
supplement of the reconstruction quality with respect to the and we illustrate the typical structure of different variations
others. In addition to these two kinds of schemes, Baig et. in Fig. 3.
al. reported that with the Stateful-Propagation structure and Ideal entropy coding requires precise estimation of the
the corresponding residual-to-image prediction, where each joint distribution of the elements in the latent representa-
step produces a prediction of the original image rather tions, for each instance of the image. In earlier works, those
than the residual signal, the network achieves the best elements are assumed to be independently distributed [19],
performance. In such a stateful propagation scheme, it is [20] to simplify the design. However, even with optimized
important to propagate the states of the layers to the next transforms, it is still difficult to eliminate the spatial redun-
step to construct a refined decoding image. dancy in the latent maps of the images. Thus, a variety of
entropy models are proposed to further reduce the redun-
4.3 Comparative Analysis dancy in the latent code. These methods include statistical
Each of the two categories of backbone architectures has analysis over a given dataset [18], [19], [20], [26], [35],
its own properties and corresponding pros and cons. The contextual prediction or analysis [17], [25], [28], [29], [31],
differences are mainly due to the choice between the one- [33], [34], [38], [52], and utilizing a learned hyperprior [24],
time structure and the progressive structure. Here are some [25], [34] for entropy modeling. The entropy model provides
main differences. the estimation of the likelihood for all the elements, and the
expectation of the log-likelihoods is the bound of the bit-
• Recurrent models can naturally handle variable-rate rates in encoding these elements. With the entropy model,
compression, while for the feed-forward network, in most of the works, arithmetic coding [5] is utilized to
multiple instances of networks need to be trained to practically losslessly encode the symbols of the latent repre-
support a variable range of bit-rates. sentations.
• Feed-forward networks are comparatively shallower, It is worth noting that in traditional hybrid frameworks,
and the path of back-propagation is much shorter. improvements of the entropy model only affect entropy
Training such a network can be easier. In contrast, coding performance. For the learned method, as all the
training the recurrent models requires the back- components are jointly optimized, a better designed entropy
propagation through time (BPTT) technique, which model not only produces a more precise estimate of the
is more complicated. entropy but also changes the patterns produced by the anal-
• Weights are shared across different stages in the ysis transform. As a consequence, the design of the entropy
recurrent model; thus, the total number of param- model should also take the structure of other components in
eters for a practical image codec may require less the pipeline into consideration.
storage for the parameters compared with one-time In summary, existing methods aim to provide a flexible
feed-forward models. However, residual signals and transform and an accurate entropy model, all of which are
image signals are different in nature, making the neural network-based and end-to-end trainable. In addition
training of a recurrent model more challenging. to the main goal of rate-distortion performance, several
• It usually takes more time for recurrent models to issues need to be addressed in the exploration. The model
encode and decode an image because the network is should be adaptive to different ranges of resolutions, bit-
executed multiple times. rates, and distortions. Currently, when high-resolution cap-
Despite the pros and cons, existing works report higher turing and displaying devices emerge, high-efficiency com-
rate-distortion performance in one-time feed-forward archi- pression of high-resolution images is a constantly growing
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

TABLE 2: Description of different entropy models utilized in learned image compression.

Solutions Description
The proposed network directly produces binary codes, which are transmitted as the
bit-stream without entropy modeling [10]. Optional external entropy codecs, such as
Direct adaptive arithmetic coding [22], can be applied to the bit-stream to improve coding
efficiency.
In addition to the binary code, the network also constructs a mask from the feature to
Binary Masked indicate the length of the binary code [23], [29], [52]. The mask is usually transmitted
together with the bit-stream. With rate-control, the overall performance can be further
improved compared to the direct scheme.
The probability distribution of all the symbols to be encoded is estimated by the
Binary network with previously coded symbols [17] and spatially adjacent symbols [28]. The
Context-Model context model can more accurately estimate the probability so that the entropy coding
can be conducted with more efficiency.
The probability distribution is estimated by a histogram [18]. A variation of this
Histogram
scheme is to use a Laplace-smoothed histogram for better generalization [19].
The probability density function is approximated by a parametric piecewise lin-
Piecewise
Linear ear function during training [20]. Context Adaptive Binary Arithmetic Coding
(CABAC) [6] is used to practically compress the latent codes.
Statistical A function p(xi ) = f (xi , θ) with trainable parameters θ is modeled to estimate the
Parametric
Factorized probability of a symbol xi . These parameters reflect the distribution of latent code
through the training set and can be generalized for all images [35], [53].
Networks based on VAE assume that the latent code follows an elementwise Gaussian
Gaussian distribution. The loss function includes a term of cross-entropy between the actual
(Mixture) distribution and the estimated Gaussian distribution to control the bit-rate [24], [25],
[34]. Gaussian Mixture distribution is shown to better estimate the likelihoods [36].
Multistage recurrent models [17] employ PixelRNN [54], while one-time feed-forward
PixelRNN models [27], [31] utilize PixelCNN [55] for spatial context conditioned probability
PixelCNN
modeling.
Context-Model
Masked 2D [25], [34] or 3D [38] convolutions can be seen as a simplified version of
Masked
Convolution PixelCNN for conditional probability modeling. It estimates likelihoods of a to-be-
encoded element based on decoded elements.
The latent code produced by a given encoder is analyzed offline in tiles by learning a
Offline
dictionary, and the indices are transmitted with lossless compression [26].
The hyperprior, transmitted in the bit-stream, encodes the parameters of a Gaussian
Side-Information Guided
Hyperprior entropy model [24] to estimate the likelihoods of the elements to be encoded. It greatly
improves the accuracy of the entropy model and it can be combined with the context
model for enhanced modeling.

need. On the other hand, with the rapid development of of discrete symbols X = {X1 , X2 , ..., Xn }. In addition,
large-scale parallel computing devices, e.g. GPU, models a parametric entropy model QX (X; θ) w.r.t. the random
should also be designed to take advantage of parallel com- vector X is built to provide the estimation of the likelihoods.
puting devices for higher efficiency. According to the above The aim of entropy coding is now to jointly optimize the
analysis, the one-time feed-forward frameworks with con- parameters in the networks to 1) accurately model the
volutional neural network-powered hyperprior structures distribution PX (X) of the random vector X with QX (X; θ)
have more potential to be scalable to a high-resolution and 2) minimize the overall rate-distortion function with
and to support large-scale parallelism. With this idea in the estimated entropy. State-of-the-art methods combine
mind, we adopt a one-time feed-forward framework and context models and hyperpriors. In such approaches, it is
achieve one step towards obtaining superior performance first assumed that the joint probability distribution of X
with a newly proposed coarse-to-fine hyperprior compres- can be factorized to the product of sequential conditional
sion model. probabilities as follows:
Y
QX|Y (X|Y) = Qi (Xi |Xi−1 , Xi−2 , ..., Xi−m , Y) , (6)
6 P ROPOSED C OARSE - TO -F INE M ODEL i
6.1 Coarse-to-Fine Hyperprior Modeling where Y denotes the hyperprior, which is generated from
As analyzed, we follow the basic framework of a one- X and encoded to the bit-stream. When we need to decode
time feed-forward framework, which consists of an analysis X, Y has already been decoded. These kinds of models
transform Ga and a synthesis transform Gs . Ga transforms need to address two issues. First, the dimensionality and the
the image to latent representations, and Gs reconstructs the corresponding bit-rate of Y should be kept low; otherwise,
image from those representations. To perform entropy cod- Y itself may contain too much redundancy and is not effi-
ing, the latent representations are first quantized to a vector ciently compressed. In such a circumstance, the hyperprior
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

0101110001 With Eq. (7) and Eq. (8), the probability distribution of Y
(a) 1000010000 …
1110101000 and X can now be modeled in a conditional way, while
existing works [54], [56] show that neural networks are ca-
… pable of modeling conditional probability distributions. The
(b)
hyperrepresentation Y is also designed to embed the main
0101110001
1000010000 … information of the images to be compressed. Therefore, the
1110101000 joint distribution can also be approximately factorized as

P( xi )
where all elements in the previous layer can be utilized
as the conditions to estimate the distribution of the latent
(d) representation at the upper layer. Although no contextual
AE
conditioning is conducted here, contextual conditioning can
feature map 0100010111
be implicitly modeled in the information flow from X to Y
and then used to predict X from Y . Unlike existing block-
conditioning context models, in the proposed framework,
the estimation of the probability for each element utilizes
(e) μ σ information from a larger area due to the coarse-to-fine
AE AD structure. This helps to explore long-term correlations in im-
ages and improves the compression performance, especially
for high-resolution images.
Fig. 3: Illustration of entropy modeling methods. (a)-(c) Bi-
nary methods and variations with the masking and context
model, including (a) direct modeling [10], (b) masked mod- 6.2 Network Architecture
eling [23], [29], [52], and (c) the binary context model [17]. The overall structure of the end-to-end learned coarse-to-
(d) Spatial context model for latent code maps [25], [27], [34]. fine framework is shown in Fig. 4 jointly with the encoder
(e) Hyperprior entropy model [24]. and decoder. The analysis transform network encodes the
input image as the latent representation X, which is then
quantized with a rounding operation. It aims to squeeze
may not provide enough information to accurately model out pixelwise redundancy as much as possible. We exploit
the conditional probability, especially for higher ranges of GDN as the activation in the analysis transform and inverse
bit-rates and large resolutions. Second, although contextual GDN in the synthesis transform. We conduct coarse-to-fine
conditioning can help with accuracy, it is performed in a modeling with multilayer hyper analysis and a symmetric
sequential way and is hard to accelerate with large-scale hyper synthesis transform. According to Eq. (7) and Eq. (8),
parallel computing devices. Thus, the framework is less to estimate the distribution of X, a probability estimation
scalable for input images of different sizes. network is employed to process Y and predict the likeli-
To address the issues of the sequential context models, in hood PXi (Xi = xi ) with the estimated QXi (Xi = xi ) for
the proposed method, we adopt a multilayer conditioning each element Xi in X. As stated in [24], the conditional
framework, which improves scalability for images of differ- distribution of each element in X can be assumed to be
ent sizes. The formulation is modified as follows: Gaussian, and the probability estimation network predicts
QX (X) = QX,Y (X, Y) = QY (Y) QX|Y (X|Y) . (7) the mean and scale of the Gaussian distribution. As the
latent code has been rounded to be discrete, the likelihood
The first equality in Eq. (7) holds for Y because the hy- of the latent code can be calculated as follows:
perprior is generated from X in a deterministic manner.
QXi |Y (Xi = xi |Y) =
When X becomes complex and is controlled by expanding ! !
the dimension, Y may need to embed more information xi + 12 − µxi xi − 21 − µxi (11)
φ −φ ,
to support accurate conditional modeling. Therefore, an σxi σxi
additional layer of the hyperprior is introduced as follows:
where φ denotes the cumulative distribution function of a
QY (Y) = QY,Z (Y, Z) = QZ (Z) QY|Z (Y|Z) , (8) standard normal distribution, while the mean µxi and scale
which in fact forms a coarse-to-fine hyperprior model. The
σxi are predicted from Y. The same process is conducted
w.r.t. Y and Z to estimate the probability distribution of Y .
dimension of Z is reduced, and the redundancy is squeezed
As illustrated in Eq. (9), the probability distribution of Z can
out by the hypertransforms. Thus, the joint distribution
be approximately factorized. Thus, we employ a zero-mean
PZ (Z) of the latent representation Z = {Z1 , Z2 , · · ·, Zn }
Gaussian model. The likelihood of each element in Z can be
at the innermost layer can be approximately factorized as
calculated as follows:
follows: ! !
Y zi + 12 zi − 12
QZ (Z) = QZ (Z1 , Z2 , · · ·, Zn ) ≈ QZi (Zi ). (9) QZi (Zi = zi ) = φ −φ . (12)
i
σzi σzi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Input Image

Encoder Network

Transform
Analysis
(μ=0, σ)
Decoder Network
(μ,σ) (μ,σ)
AE AE AE
Arithmetic
μ μ Encoder / Decoder
Binary Binary Binary
σ σ (μ=0, σ)
Decoded Image Latent
Representation
AD AD AD
(μ,σ) (μ,σ)

Transform
Synthesis
μ Probability Estimation
σ Network

Information
Aggregation Network

Fig. 4: Overall architecture of the multilayer image compression framework. The probability distribution of the innermost
layer of the hyperprior is approximated with a zero-mean Gaussian distribution, where the scale values σ are channelwise
independent and spatially shared.

Note that σzi is a trainable parameter in the network. All TABLE 3: Structure of the signal-preserving hypertransform.
elements in a channel in the latent representation share the (a) Hyper analysis transform.
same σ while each channel has an independent one.
According to information theory, the minimum bit-rate Name Operation Output Shape Activation
required to encode X (or Y and Z) with the estimated Input / (b, h, w, c) /
distribution equals the cross entropy of the real distribution #1 E Conv. (3 × 3) (b, h, w, 2c) Linear
PX|Y (X|Y) and the estimated distribution QX|Y (X|Y) ∼ Down Space-to-Depth (b, h/2, w/2, 8c) /
#2 E Conv. (1 × 1) (b, h/2, w/2, 4c) ReLU
N (µx , σx ), which is denoted as follows: #3 E Conv. (1 × 1) (b, h/2, w/2, 4c) ReLU
#4 E Conv. (1 × 1) (b, h/2, w/2, c) Linear
R = H(Q) + DKL (P ||Q) = EX|Y [− log(Q)] . (13)
(b) Hyper synthesis transform.
We minimize the rate-distortion function LRD = R + λD
with the network. To accelerate the convergence dur- Name Operation Output Shape Activation
ing the training of the multilayer network, an additional
Input / (b, h/2, w/2, c) /
information-fidelity loss is introduced. This loss term en- #1 D Deconv. (1 × 1) (b, h/2, w/2, 4c) Linear
courages the hyperrepresentation Y to maintain the critical #2 D Deconv. (1 × 1) (b, h/2, w/2, 4c) ReLU
information in X during training and is formulated as #3 D Deconv. (1 × 1) (b, h/2, w/2, 4c) ReLU
Up Depth-to-Space (b, h, w, c) /
follows: #4 D Deconv. (3 × 3) (b, h, w, c) Linear
min Lif = ||F(Y; θ) − X||2 . (14)
Y,θ

In practice, the function F with trainable parameter θ is

some of the filter neurons that produce negative values and
one convolutional layer with no nonlinear activation. The
make the response sparser. Because the dimension of these
information-fidelity loss takes the form of the least-square
convolution layers needs to be limited to ensure the gradual
error to make the prediction of µ and σ more accurate.
factorization of the latent representation, the original hyper-
transform loses much information during processing.
6.3 Signal-Preserving Hyper Transform In summary, the issues of the original analysis transform
To conduct coarse-to-fine modeling of images, especially in the proposed architecture fall into two categories: 1)
for high-fidelity modeling in high-resolution or high-quality Original analysis transforms fix the number of channels
circumstances, it is important to preserve the informa- and downsample the feature maps, which reduces the di-
tion while performing hyper analysis and synthesis trans- mension of the latent maps. 2) Combining large convolu-
forms in the succeeding hyperlayers. Therefore, the signal- tion kernels with ReLUs at the beginning of the analysis
preserving hypertransform is proposed to build a frame- transform or the end of the synthesis transform will lose
work with multiple layers. We observe that elements in some information that has not been transformed, limiting
the latent representations produced by the main analysis the capacity.
transform are much less correlated compared with pixels in The signal-preserving hypertransform is designed to fa-
natural images, as the spatial redundancy has been largely cilitate the multilayer structure by preserving information
reduced by the previous analysis transforms. Therefore, for coarse-to-fine analysis. The structure of the analysis
local correlations in the feature maps are weak, while con- and synthesis transform network is illustrated in Table 3.
volutions with large kernels rely on such local correlations Instead of using large kernels in the filters, we employ a
for effective modeling. In addition, the previous transform relatively small filter in the first layer with no nonlinear
network consists of stride convolutions with ReLU acti- activation, and we conduct 1 × 1 convolutions in the re-
vation. Stride convolutions downsample the feature maps, maining layers. The first layer in the network expands the
while activation functions such as ReLU intuitively disable dimension of the original representations. Combined with
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

TABLE 4: Structure of the probability estimation network.

L2 Repr.
K and S are short for kernel size and stride, respectively.

Deconv #1
Deconv #2
Deconv #3
After the split, one half of the tensor is used as mean in the
L1 Repr.

Residue Block
Concatenate
Gaussian distribution. We calculate the absolute value of the

Conv #1

Conv #2

Conv #3
+ other half as scale.
Main Repr.

Layer In Shape Out Shape K S Activation

Unfold c, h, w c, 5, 5, h, w / / None
Space-to-Depth + Element-Wise Add Transpose c, 5, 5, h, w h, w, c, 5, 5 / / None
Reshape h, w, c, 5, 5 h · w, c, 5, 5 / / None
Conv h · w, c, 5, 5 h · w, c, 5, 5 3 1 Leaky ReLU
Fig. 5: Information aggregation subnetwork for the recon- Conv h · w, c, 5, 5 h · w, c, 3, 3 3 2 Leaky ReLU
struction of the decoded image. The main latent representa- Conv h · w, c, 3, 3 h · w, c, 3, 3 3 1 Leaky ReLU
tion (Main Repr.) and the two layers of hyperrepresentations Reshape h · w, c, 3, 3 h · w, c · 9 / / None
(L1 Repr. and L2 Repr.) are aggregated for the reconstruc- Dense h · w, c · 9 h · w, c · 2 / / None
Reshape h · w, c · 2 h, w, c · 2 / / None
tion. Split h, w, c · 2 h, w, c, 2 / / None

TABLE 5: Structure of the Information-Aggregation Recon-

succeeding nonlinear layers, the expansion of dimension struction network, corresponding to Fig. 5. R #1 refers to the
preserves the information of the original representations first layer in the residual block. C-In and C-Out refer to the
while supporting nonlinear modeling. We exploit a space-to- numbers of input and output channels, respectively. K and
depth operation to reshape the tensor of the representations, S are short for kernel size and stride, respectively.
making spatially adjacent elements scatter in one location
but in different channels. In this way, the succeeding 1 × 1 #Layer Layer Type C-In C-Out K S Activation
convolutions are able to conduct nonlinear transforms to #1 Deconv. 256 192 5 2 None
reduce spatial redundancy. At the final layer of the network, #2 Deconv. 192 192 5 2 Leaky ReLU
we conduct a dimensionality reduction on the tensor to #3 Deconv. 192 192 5 2 Leaky ReLU
#4 Deconv. 384 64 5 2 None
make the representation compact. We symmetrically design #5 Conv. 64 3 3 1 Leaky ReLU
the hyper synthesis transform to produce Y in Eq. (7) as the #6 Conv. 3 3 1 1 None
conditional prior for the outer layer, which is taken as the R #1 Conv. 384 192 3 1 Leaky ReLU
R #2 Conv. 192 192 3 1 Leaky ReLU
side information for reconstruction.
R #3 Conv. 192 384 3 1 Leaky ReLU

6.4 Information Aggregation for Reconstruction 6.5 Implementation Details

We follow the same network structure and hyperparameters
In the decoding process, the synthesis transform maps latent as [24] in the analysis and synthesis transforms, while the
representations back to pixels. To best reconstruct the image, last layer of the original synthesis transform is removed.
the decoder needs to fully utilize the provided information For the probability estimation network and the information-
in the bit stream. Practical image and video compression aggregation reconstruction network, we summarize the key
usually exploit side information to improve quality. With properties in Table 4 and Table 5, respectively. We train
this idea in mind, we take hyperlatent representations as the network with the rate-distortion tradeoff, specified in
side information and aggregate information from all layers Eq. (5). We train multiple models with different λ to encode
of the hyperlatent representations to reconstruct the de- images to different bit-rates. Models optimized with MSE
coded image in the proposed framework. The architecture loss function are trained with λ ∈ {1.2 × 10−3 , 1.5 ×
of the information aggregation decoding network is shown 10−3 , 2.5×10−3 , 8×10−3 , 1.5×10−2 , 2.0×10−2 , 3.0×10−2 },
in Fig. 5. Both the main latent representation and the higher and those optimized with MS-SSIM loss function are trained
order representations of smaller scales are upsampled by with λ ∈ {10, 25, 45, 70, 100, 200, 300, 360}. The typical
the decoding network to half the size of the output image. width (number of channels) of the transforms is set to
A fusion is conducted with a concatenation of the two c = 128, corresponding to Table 3, and c = 192 for the
representations. The fused representation is then processed main analysis and synthesis transforms. To provide enough
by a residue block and then upsampled to the scale of the degrees of freedom for high bit-rates compression, we dou-
output image. ble the width of the network when λ ≥ 8 × 10−3 for MSE or
By fusing the main representation and the hyperrep- λ ≥ 70 for MS-SSIM.
resentations, information of different scales contributes to The network is trained on DIV2K [57] dataset. The
the reconstruction of the decoded image, where the higher- dataset contains 800 lossless images with 2K resolution
order representations provide global information and the on average. We down-sample the original images to half
others preserve details in the image. The fusion process of their resolutions as an augmentation. In each training
is conducted at smaller spatial resolutions to avoid high iteration, we randomly sample 256 × 256 patches from
computational complexity. After the fusion of features, we images. We adopt a multi-step training strategy. We first
employ a single residue block with peripheral convolution pre-train the main analysis and synthesis transforms for
layers to map the feature maps back to pixels. 200,000 iterations. After that, we fix the parameters of the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

main transforms and train the fine-grained hyper trans- 7.2.1 Evaluation Protocol
forms progressively. Each group of hyper transforms (i.e. Three datasets, i.e., Kodak, Tecnick and CLIC 2019 valida-
the fine-grained groups and the coarse-grained groups of tion set, are used in the evaluation corresponding to three
hyper transforms) is trained for 20,000 iterations. Next, we different levels of resolution and different content. For the
train the Information-Aggregation Reconstruction subnet- evaluated learning-based methods, we average the metrics
work for another 20,000 iterations. Finally, we end-to-end of the bit-rate (bpp) and the distortion (PSNR and MS-
tune the whole network for 400,000 iterations to complete SSIM) across the dataset for different models, which are
the training. usually trained with different trade-off coefficients λ. We
compare the learning-based methods with JPEG, BPG, and
7 E VALUATION VVC. For these hybrid codecs, the metrics are averaged at
different quality factors (QFs) or quantization parameters
7.1 Datasets (QPs). To illustrate the comparison, we show the results
End-to-end learned image compression is a self-supervised for rate-distortion curves in Fig. 6 2 . We also calculate the
problem where distortion metrics measure the difference BD-rate [61] with respect to the bit-rate and PSNR over the
between the original image and the reconstructed image and three datasets. As not all methods cover the whole range of
the bit-rate corresponds to the entropy of the latent code. bit-rate and PSNR, we separate different bit-rate ranges for
Thus, no extra labeling labor is needed, and many existing evaluation and comparison, marked as low, median, high,
large-scale image sets, e.g. ImageNet [58] and DIV2K [57], and full. Bit-rate ranges are different among datasets due to
can be used to train networks for image compression. To variations of content, but full ranges are selected to cover
reduce possible compression artifacts in the images, the the variation of image quality from poor to transparent, as
lossy-compressed images are usually downsampled before shown in Table 6. The BD-Rate results are shown in Table 7.
they are used for network training. We analyze the results and summarize the important prop-
Commonly used testing image sets include Kodak [59] erties in the following.
and Tecnick [60], which contain high-quality natural images
that have not been lossy-compressed. The Kodak dataset TABLE 6: Specifications of BD-Rate range on different test-
consists of 24 images with resolution 512 × 768, with a ing datasets. Full ranges are illustrated, and low, median,
wide variety of content and textures that are sensitive to high ranges are chosen respectively within the full ranges.
artifacts. Thus, it has been widely used to evaluate image
compression methods. For the Tecnick dataset, the SAM- Dataset Bit-Rate Range PSNR Range
PLING testset is used for evaluation in some works. In Kodak 0.25 bpp - 1.40 bpp 26 dB - 40 dB
contrast to Kodak, this dataset contains images with higher Tecnick 0.12 bpp - 0.70 bpp 26 dB - 43 dB
CLIC 0.20 bpp - 1.05 bpp 28 dB - 40 dB
resolution (1200 × 1200), which can serve as a supplemental
benchmark for image compression methods that can have
different performance on images with different resolutions.
In addition, in recent years, the CVPR Workshop and 7.2.2 Entropy Model
Challenge on Learned Image Compression (CLIC), with the The design of the entropy model is the main driving force
goal of encouraging research in learning-based image com- of improvements in rate-distortion performance. The design
pression, has attracted much attention in the community. of entropy models in end-to-end learned image compres-
A testing dataset consisting of images captured by both sion has developed through the period from contextual
mobile phones and professional cameras is provided and binary entropy models [17] to hyperprior models and spa-
updated year by year. The images have higher resolutions, tial / cross-channel entropy estimation [24], [25]. Specif-
on average 1913 × 1361 for mobile photos and 1803 × 1175 ically, as shown in Fig. 6, a leap in gain occurred with
for professional photos. Evaluation results on this dataset the emergence of hyperpriors, which have been adopted
indicate compression performance on images with relatively by many other frameworks. Despite great success, mod-
high resolutions. eling contextual probability is still a challenging topic in
image modeling due to variation in resolution. As shown
in Table 7, the context model-based method [34] may have
7.2 Rate-Distortion Performance
unstable gain over BPG at different levels of resolution,
Although the overall history of the development of end- while the proposed methods achieve consistent superiority
to-end learned image-compression methods is not as long over the anchor.
as that of hybrid coding standards, there have been a
significant number of works on this topic, and tremendous 7.2.3 Depth of the Network
progress has been made. However, few studies have thor- The depth of the network is a comparatively less important
oughly evaluated rate-distortion performance on various factor in performance, while in other computer vision tasks,
images and compared baselines (i.e., anchors). It is never- networks with a deeper architecture usually perform better
theless valuable to compare performance on technical merits than those with fewer layers. Some works [46], [62] also
to investigate which direction truly affects performance. In confirm this observation. Instead of building complicated
the following, we summarize the performance of selected network architectures, work may focus more on the specific
works. The contributions in these works include different
methods for entropy modeling, novel architecture design 2. The results of NeurIPS18 correspond to the publicly released code,
and normalization. which does not include auto-regressive context model.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13

42 27.5
40 25.0
38 22.5
CVPR18-Condition CVPR18-Condition
36 VTM8-444 VTM8-444

MS-SSIM (dB)
PCS18-ReLU 20.0 PCS18-ReLU
PSNR (dB)

34 PCS18-GDN PCS18-GDN
ICLR18-Factorized 17.5 ICLR18-Factorized
32 ICLR18-HyperPrior ICLR18-HyperPrior
NeurIPS18 15.0 NeurIPS18
30 ICLR19 ICLR19
CVPR17-RNN CVPR17-RNN
28 BPG-444 12.5 BPG-444
JPEG JPEG
26 Ours 10.0 Ours
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Bit-rate (bpp) Bit-rate (bpp)

(a) Kodak, PSNR (b) Kodak, MS-SSIM

45.0 CVPR18-Condition CVPR18-Condition
VTM8-444 VTM8-444
42.5 PCS18-ReLU PCS18-ReLU
PCS18-GDN 25 PCS18-GDN
40.0 ICLR18-Factorized ICLR18-Factorized
ICLR18-HyperPrior ICLR18-HyperPrior
37.5

MS-SSIM (dB)
NeurIPS18 20 NeurIPS18
PSNR (dB)

ICLR19 ICLR19
35.0 CVPR17-RNN CVPR17-RNN
BPG-444 BPG-444
32.5 JPEG 15 JPEG
Ours Ours
30.0
27.5 10
25.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
Bit-rate (bpp) Bit-rate (bpp)

(c) Tecnick, PSNR (d) Tecnick, MS-SSIM

42.5 27.5

40.0 25.0
22.5
37.5 CVPR18-Condition CVPR18-Condition
VTM8-444 20.0 VTM8-444
MS-SSIM (dB)

PCS18-ReLU PCS18-ReLU
PSNR (dB)

35.0
PCS18-GDN 17.5 PCS18-GDN
ICLR18-Factorized ICLR18-Factorized
32.5 ICLR18-HyperPrior ICLR18-HyperPrior
15.0
NeurIPS18 NeurIPS18
30.0 ICLR19 12.5 ICLR19
CVPR17-RNN CVPR17-RNN
27.5 BPG-444 10.0 BPG-444
JPEG JPEG
25.0 Ours 7.5 Ours
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Bit-rate (bpp) Bit-rate (bpp)

(e) CLIC, PSNR (f) CLIC, MS-SSIM

Fig. 6: Rate-Distortion Curves. The methods include PCS18-ReLU, PCS18-GDN [53], ICLR18-Factorized, ICLR18-
HyperPrior [24], NeurIPS18 [25], ICLR19 [34], CVPR17-RNN [17], CVPR18-Condition [27], BPG-4:4:4 [9], VTM8-4:4:4 [8]
and JPEG [1]. We conduct the evaluation on the three datasets Kodak, Tecnick and CLIC 2019 (validation set). PSNR and
MS-SSIM are used as the distortion metrics. We convert the MS-SSIM values to decibels (−10 log10 (1 − d), where d refers
to the MS-SSIM value) for a clear illustration, following [24].

design of the networks to better model the image prior. images to improve overall performance. Most state-of-the-
However, it has been reported that the network should art solutions adopt normalization and its inverse in the main
consist of a sufficient quantity of parameters, and the width encoding and decoding transform. However, it still remains
of the network should be large enough for effective model- as a topic in future research to reduce spatial redundancy
ing of images, especially for higher ranges of bit-rates and more efficiently with normalization.
higher quality [34].
7.2.5 Summary
7.2.4 Normalization We evaluate the rate-distortion performance of different
It is reported in [24] that batch normalization [63], com- methods developed in recent years. As we can see from
monly used to improve the performance of neural networks, the results, great progress has been made to improve the
does not bring significant improvement. However, Ballè et rate-distortion performance, where the decorrelation nor-
al. proposed generalized divisive normalization [11], [53], malization and the hyperprior model bring significant im-
which is proven to be able to decorrelate the elements in provement. Nevertheless, we also see large variations in
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

TABLE 7: Evaluation of the BD-Rate on different methods TABLE 8: BD-Rate evaluation for the coarse-to-fine hyper-
(optimized by PSNR) on different image sets. We set BPG- prior model at different resolutions, with the single-layer
4:4:4 as the anchor. The negative values reflect the average hyperprior as the anchor.
bit-rate saving compared to the anchor at the same level of
distortion. Best performances are marked in bold, while the Resolution BD-Rate
second best ones are underlined. 4K -4.65%
1080p -2.65%
Bit-Rate Range 540p -1.97%
Method
Low Median High Full
Kodak
34.5
CVPR18-Condition 76.18% N/A N/A N/A
VTM8-4:4:4 -20.85% -18.26% -15.86% -18.91% 34.0

PSNR (dB)
PCS18-GDN 42.09% 40.64% 41.62% 41.68%
ICLR18-Factorized 35.46% 32.75% 28.59% 32.96% 33.5
ICLR18-HyperPrior 6.11% 3.06% 0.32% 4.07% Hyper IAR (Fine+Coarse)
NeurIPS18 -4.63% -4.62% -5.03% -4.68% 33.0 Mean IAR (Fine+Coarse)
ICLR19 -6.27% -4.75% -4.18% -5.40% Mean IAR (Fine)
CVPR17-RNN 189.38% 174.02% 193.28% 176.16% 32.5 Mean SYN (Fine) [41]
No Aggregation
JPEG N/A 113.28% 104.99% N/A
Ours -10.65% -8.74% -8.31% -9.42% 0.40 0.45 0.50 0.55 0.60 0.65
Bit-rate (bpp)
Tecnick
CVPR18-Condition N/A 123.13% N/A N/A Fig. 7: Rate-Distortion curves by different aggregation meth-
VTM8-4:4:4 -31.63% -30.14% 27.56% -30.25%
PCS18-GDN 49.28% 43.57% 32.34% 43.00% ods. The methods vary in aggregated feature forms (Hyper
ICLR18-Factorized 41.29% 37.64% 25.58% 36.70% and Mean), feature granularity (Fine and Fine+Coarse), and
ICLR18-HyperPrior 5.64% -0.23% -7.82% 0.98% fusion stage (SYN and IAR).
NeurIPS18 -12.76% -14.84% -18.16% -14.54%
ICLR19 -9.85% -10.92% -11.78% -11.65%
CVPR17-RNN N/A 210.12% 224.64% N/A
JPEG 222.30% 193.80% 187.90% 198.24% original single layer model. We also show that the coarse-
Ours -14.82% -14.56% -16.48% -14.15% to-fine models achieves more significant improvements in
CLIC3 BD-Rate reduction on high-resolution images. It especially
benefits emerging high-resolution applications.
CVPR18-Condition 88.73% N/A N/A N/A
VTM8-4:4:4 -23.43% -20.31% -17.52% -21.22%
PCS18-GDN 53.34% 53.53% 54.26% 53.54% 7.3.2 Information-Aggregation Reconstruction
ICLR18-Factorized 49.37% 49.45% 50.85% 49.64% The Information Aggregation Reconstruction (IAR) sub-
ICLR18-HyperPrior 12.00% 8.97% 9.99% 10.63%
NeurIPS18 -3.53% -1.40% 4.45% -1.88% network is designed to improve reconstruction quality. It
ICLR19 -7.52% -4.33% -2.17% -4.61% aggregates image representations at different granularities
JPEG 124.30% 115.73% N/A N/A to fully utilize transmitted information for reconstruction.
Ours -14.49% -12.21% -11.64% -12.86%
To analyze the effect of the IAR component, we conduct
ablation studies considering the forms and granularities of
the aggregated features. The results are shown in Fig 7
performance on different testing datasets. Compared with and Table 9. There are two types of feature forms, i.e.
existing works, the proposed method achieves a more con- Hyper information retrieved right after the hyper synthesis
sistent gain on different content and resolutions. transforms, and Mean of Gaussian distributions generated
by the prediction subnetwork [47]. These feature maps can
7.3 Studies on the Proposed Method be aggregated at different resolutions, i.e. at the small-
resolution stage before the synthesis transform (SYN), or
7.3.1 Coarse-to-Fine Modeling
at the full-resolution stage within the IAR subnetwork. With
We propose the coarse-to-fine hyperprior model to reduce the proposed coarse-to-fine hyperprior model, the hyperpri-
the bit-rate. We conduct ablation studies to evaluate the ors can be aggregated at different granularities, i.e. Fine and
coarse-to-fine design. In this experiment, we benchmark Coarse. We empirically analyze the effect of combining these
on a subset from the LIU4K dataset [64], to evaluate the factors in the ablation study. As shown in Fig. 7, the fusion of
performance at different resolutions but having the same multi-resolution representation shows significant benefits,
content. Images in LIU4K dataset are of 4K resolutions. We and it is beneficial to aggregate information at both coarse
down-sample the images to 1080p (1920 × 1080) and 540p and fine granularities. Utilizing hyperprior representation
(960 × 540) resolutions to build three subsets of different tends to show better performance than concatenating Mean
resolutions. We calculate BD-Rate on the R-D curves, with information. Besides, an aggregation at the stage of higher
the single-layer hyperprior model as the anchor. The BD- resolution leads to improved performance.
Rates results are shown in Table 8. As shown, the coarse-to-
fine model achieve R-D performance improvements over the 7.3.3 Visual Quality Analysis
We conduct visual analysis on the reconstructed images
3. The results of CVPR17-RNN on CLIC 2019 validation dataset is not
included, as the available code does not support the resolutions in this in Fig. 8, where we compare our method with the repre-
dataset. sentative learning-based method (ICLR18-Hyperprior) [24]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

TABLE 9: BD-Rate corresponding with R-D curves in Fig. 7. TABLE 10: Encoding and decoding time (seconds) for vari-
We use setting “No Aggregation” as the anchor. ous methods.

Settings BD-Rate Task Device Ours Context [34] VTM-8

Hyper IAR (Fine+Coarse) -8.38% 1 Core 24.4 50.1 467.5
Mean IAR (Fine+Coarse) -7.34% 2 Cores 13.6 49.4 467.54
Encode
Mean IAR (Fine) -6.35% 4 Cores 9.2 48.4 467.54
Mean SYN (Fine) [47] -4.17% 4 Cores + GPU 4.9 N/A N/A
1 Core 69.8 186.5 0.3
2 Cores 41.4 181.2 0.34
Decode
and hybrid coding method (BPG-4:4:4) [9]. As shown, the 4 Cores 28.8 177.7 0.34
4 Cores + GPU 7.1 N/A N/A
proposed method reconstructs images with fewer artifacts
at lower bit-rates. Specifically, the hybrid coding method
relies on image partitioning. It inevitably causes blocking
artifacts. Besides, the directional intra prediction scheme in shows a higher perceptual distortion at the same
the hybrid codec does not handle multi-directional edges level of the metric.
well, as shown in Fig. 8. More visual results are provided in • For models tuned with the MS-SSIM loss func-
the supplementary material. tion, those with higher performance in MS-SSIM-
bpp evaluation tend to result in larger perceptual
distortion at a certain level of MS-SSIM. Although
7.4 Cross-Metric Evaluation the same phenomenon is observed in the Perceptual-
End-to-end learned image compression models can usually PSNR curve, it is not as significant.
be trained towards many objectives as long as they are dif- To summarize, we observe in the experimental results
ferentiable. Recent works usually evaluate two versions of that there exists a gap of different metrics, especially for
the proposed method by training the model with both MSE models with better performance on one metric. Although
(for PSNR) and MS-SSIM, as MS-SSIM better models visual end-to-end learning-based methods can be trained towards
quality for humans. Models trained on one objective may different objectives, they tend to be over-optimized on that
not perform well on the other metrics. Specifically, models specific objective only. This phenomenon is also related to
trained with MS-SSIM as the loss function usually show recent work on the investigation into the trade-off between
lower PSNR values for a given range of bit-rates. Different perception and distortion [67], [68]. In circumstances where
models, which should achieve different performances, show we reserve more bit-rate for an image encoded with a better
similar levels of PSNR if they are trained using MS-SSIM. In codec, as metrics such as PSNR and MS-SSIM show high
contrast, different PSNR-optimized models do show differ- enough values at that bit-rate, we may not be provided with
ent performance in MS-SSIM evaluation. For this class of the expected visual quality. A better assessment technique is
models, those with higher results in PSNR usually perform needed, especially for the development of high-performance
better in MS-SSIM. image compression methods. Furthermore, in real-world
Because end-to-end learned models can be tuned with applications, the images are mostly consumed by human
both PSNR and MS-SSIM, we are able to investigate the users, while there is a trend of developing image processing
relationship between different metrics and objectives. We systems for machine vision tasks. To jointly optimize an
employ the perceptual metric, which is widely used in image compression framework for both human perception
image enhancement and synthesis [15], [65], as the metric and machine intelligence remains to be explored in future
for cross evaluation. Following the settings of perceptual research.
loss in [15], the L2 distance of the output feature maps
corresponding to four layers in the VGG-16 [66] with re-
spect to the original image and the reconstructed image 7.5 Discussion
are evaluated. Zhang et al. [65] show that the distance of 7.5.1 Efficiency on Parallel Devices
the feature maps of such layers reflects the distortion with We also benchmark the encoding and decoding time for
respect to human perception. Thus, we employ the metric the proposed method, the context-model-based method [34]
as a supporting evaluation of the reconstruction quality for and VTM-8 as a hybrid codec. We test the encoder and
image compression methods. decoder on a machine with Intel Core i7-7700K CPU and
To show the comparison, we plot the Perceptual-PSNR an NVIDIA RTX 2060 Super GPU. When doing encoding
and Perceptual-MS-SSIM curves in Fig. 9. Note that the and decoding for an image from CLIC 19 validation dataset,
distances with respect to the four layers are averaged for the we restrict the resources the program could utilize to test
illustration. Here are our observations of the experimental its time consumption on different parallel scales. The results
results. are presented in Table 10. As shown, while the proposed
• For a given level of PSNR, models trained on MS- method achieves competitive rate-distortion performance
SSIM show significantly less perceptual distortion, against context-model based methods, as it does not rely on
while for a given level of MS-SSIM, those trained on the context model, it is not limited to using a serial decoding
PSNR have less perceptual distortion. When a model
4. We set QP=25 in this test. VTM-8 does not support multi-thread
is optimized for a metric, compared with others that execution, so its time consumption should remain the same with
are not optimized for that metric, the optimized one different numbers of available cores.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

0.372 bpp 0.380 bpp 0.417 bpp GT

(a) Ours (b) ICLR18-Hyperprior (c) BPG-4:4:4 (d) Ground Truth

Fig. 8: Visualization of the reconstructed images by the proposed method, ICLR18-Hyperprior, and BPG.

3.5 CVPR18-Condition-MSE
CVPR18-Condition-SSIM
VTM8-444
3.0 PCS18-ReLU
PCS18-GDN
2.5 ICLR18-Factorized-MSE
VGG L4 Distance

ICLR18-Factorized-SSIM
2.0 ICLR18-HyperPrior-MSE
ICLR18-HyperPrior-SSIM
NeurIPS18-MSE
1.5 NeurIPS18-SSIM
ICLR19-SSIM
1.0 ICLR19-MSE
CVPR17-RNN
0.5 BPG-444
JPEG
Ours-SSIM
10.0 12.5 15.0 17.5 20.0 22.5 25.0 27.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 Ours-MSE
MS-SSIM (dB) PSNR (dB)

Fig. 9: Evaluation of perceptual distance [15] (a lower value corresponds to better quality) with respect to PSNR and
MS-SSIM for different methods. The methods correspond to those in Fig. 6.

scheme and therefore runs faster than those methods in devices. Existing works have shown the potentials of end-
the experiments. VTM is designed with no thread-level to-end learned to achieve higher performance and better
parallelism, thus not accelerated by multi-core devices. efficiency in the near future.

7.5.2 Comparison with VVC 8 C ONCLUSION

We compare the learning-based image compression meth- In this paper, we conduct a systematic benchmark on exist-
ods with VVC [8], the advanced hybrid transform coding ing methods for learned image compression. We first sum-
scheme. Based on the experimental results, we summarize marize the contributions of existing works, with novelties
different characteristics of VVC and the advanced learning- highlighted, and we also analyze and discuss insights and
based methods. challenges in this problem. With inspiration from the tech-
VVC. VVC has improved design in picture partition- nical merits, we propose a coarse-to-fine hyperprior frame-
ing, intra prediction, transform, and quantization etc. over work for image compression, trying to address the issues
HEVC / BPG, and it achieves better rate-distortion perfor- of existing methods in multiresolution context modeling.
mance than benchmarked end-to-end learned image com- We conduct a thorough evaluation of existing methods and
pression methods. Besides, as the rate-distortion optimiza- the proposed method, which illustrates the great progress
tion (RDO) is done only at the encoder side, VVC decoder made in the research, as well as the driving force for such
has significantly lower complexity. Thus, it better meets the advancements. The results also demonstrate the superiority
need in most real-world applications. of the proposed method in handling images with various
Learning-Based. Advanced learned image compression content and resolutions. Further cross-metric evaluation in-
methods adopt neural networks to learn the image encoder dicates the future research direction of jointly optimizing an
and decoder automatically. Thanks to the rapid evolution image compression method for both machine intelligence
of machine learning techniques, tremendous improvements systems and human perception.
in rate-distortion performance have been witnessed in the
past five years. These methods tend to be more flexible than
hand-crafted hybrid coding methods, as they can be end-to- ACKNOWLEDGMENTS
end optimized to avoid conflicts between components. They This work is supported by the National Key Re-
can also be potentially accelerated by parallel computing search and Development Program of China under Grant
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 17

No. 2018AAA0102702, the Fundamental Research Funds for [24] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston,
the Central Universities, and the National Natural Science “Variational image compression with a scale hyperprior,” in Proc.
Int. Conf. Learn. Representations, 2018.
Foundation of China under Contract No. 61772043 and [25] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and
No. 62022038. hierarchical priors for learned image compression,” in Proc. Adv.
Neural Inform. Process. Syst., 2018.
[26] D. Minnen, G. Toderici, S. Singh, S. J. Hwang, and M. Covell,
R EFERENCES “Image-dependent local entropy models for learned image com-
pression,” in Proc. Int. Conf. Image Process., 2018.
[1] M. W. Marcellin, M. J. Gormish, A. Bilgin, and M. P. Boliek, “An [27] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and
overview of JPEG-2000,” in Proc. Data Compression Conf., 2000. L. Van Gool, “Conditional probability models for deep image
[2] M. Rabbani and R. Joshi, “An overview of the JPEG 2000 still im- compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
age compression standard,” Signal processing: Image communication, 2018.
vol. 17, no. 1, pp. 3–48, 2002. [28] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
[3] A. Gersho and R. M. Gray, Vector quantization and signal compres- S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image
sion. Springer Science & Business Media, 2012, vol. 159. compression with priming and spatially adaptive bit rates for
[4] M. W. Marcellin and T. R. Fischer, “Trellis coded quantization of recurrent networks,” in Proc. IEEE Conf. Comput. Vis. Pattern
memoryless and Gauss-Markov sources,” IEEE Trans. Commun., Recognit., 2018.
vol. 38, no. 1, pp. 82–93, 1990. [29] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convo-
[5] I. H. Witten, R. M. Neal, and J. G. Cleary, “Arithmetic coding for lutional networks for content-weighted image compression,” in
data compression,” Communications of the ACM, vol. 30, no. 6, pp. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
520–540, 1987. [30] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generative
[6] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive models for distribution-preserving lossy compression,” in Proc.
binary arithmetic coding in the H. 264/AVC video compression Adv. Neural Inform. Process. Syst., 2018.
standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, [31] K. M. Nakanishi, S.-i. Maeda, T. Miyato, and D. Okanohara, “Neu-
pp. 620–636, 2003. ral multi-scale image compression,” in Proc. Asia. Conf. Comput.
[7] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview Vis., 2018.
of the high efficiency video coding (HEVC) standard,” IEEE Trans. [32] J. Klopp, Y.-C. F. Wang, S.-Y. Chien, and L.-G. Chen, “Learning a
Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012. code-space predictor by exploiting intra-image-dependencies.” in
[8] J. Chen, Y. Ye, and S. H. Kim, “Versatile video coding (draft 8),” Proc. Brit. Mach. Vis. Conf., 2018.
JVET-Q2002-v3, 2020. [33] J. Cai and L. Zhang, “Deep image compression with iterative non-
[9] F. Bellard., “BPG image format (http://bellard.org/bpg/). ac- uniform quantization,” in Proc. Int. Conf. Image Process., 2018.
cessed: 2017-01-30.” [34] J. Lee, S. Cho, and S.-K. Beack, “Context adaptive entropy model
[10] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, for end-to-end optimized image compression,” in Proc. Int. Conf.
S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image Learn. Representations, 2019.
compression with recurrent neural networks,” in Proc. Int. Conf. [35] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learning image and
Learn. Representations, 2016. video compression through spatial-temporal energy compaction,”
[11] J. Ballé, V. Laparra, and E. P. Simoncelli, “Density modeling of in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019.
images using a generalized normalization transformation,” Proc. [36] ——, “Learned image compression with discretized gaussian
Int. Conf. Learn. Representations, 2016. mixture likelihoods and attention modules,” in Proc. IEEE Conf.
[12] Y. Hu, W. Yang, and J. Liu, “Coarse-to-fine hyper-prior modeling Comput. Vis. Pattern Recognit., 2020.
for learned image compression,” in Proc. AAAI Conf. Artif. Intell., [37] H. Ma, D. Liu, N. Yan, H. Li, and F. Wu, “End-to-end optimized
2020. versatile image compression with wavelet-like transform,” IEEE
[13] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image Trans. Pattern Anal. Mach. Intell., 2020.
quality assessment: from error visibility to structural similarity,” [38] T. Chen, H. Liu, Z. Ma, Q. Shen, X. Cao, and Y. Wang, “Neural
IEEE Trans. on Image Process., vol. 13, no. 4, pp. 600–612, 2004. image compression via non-local attention optimization and im-
[14] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural proved context modeling,” IEEE Trans. on Image Process., 2021.
similarity for image quality assessment,” in Conference Record of the [39] T. Dumas, A. Roumy, and C. Guillemot, “Autoencoder based im-
Asilomar Conference on Signals, Systems & Computers, 2003. age compression: can the learning be quantization independent?”
[15] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018.
style transfer and super-resolution,” in Proc. Eur. Conf. Comput. [40] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
Vis., 2016. arXiv preprint arXiv:1312.6114, 2013.
[16] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimization [41] Y. Yang, R. Bamler, and S. Mandt, “Improving inference for neural
of nonlinear transform codes for perceptual quality,” in Proc. image compression,” in Proc. Adv. Neural Inform. Process. Syst.,
Picture Coding Symp., 2016. 2020.
[17] G. Toderici, D. Vincent, N. Johnston, S. Jin Hwang, D. Minnen, [42] L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding
J. Shor, and M. Covell, “Full resolution image compression with for machines: A paradigm of collaborative compression and intel-
recurrent neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern ligent analytics,” IEEE Trans. Image Process., vol. 29, pp. 8680–8695,
Recognit., 2017. 2020.
[18] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, [43] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and
L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for L. Van Gool, “Generative adversarial networks for extreme
end-to-end learning compressible representations,” in Proc. Adv. learned image compression,” in Proc. Int. Conf. Comput. Vis., 2018.
Neural Inform. Process. Syst., 2017. [44] S. Santurkar, D. Budden, and N. Shavit, “Generative compres-
[19] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image sion,” in Proc. Picture Coding Symp., 2018.
compression with compressive autoencoders,” Proc. Int. Conf. [45] F. Mentzer, G. Toderici, M. Tschannen, and E. Agustsson, “High-
Learn. Representations, 2017. fidelity generative image compression,” in Proc. Adv. Neural In-
[20] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized form. Process. Syst., 2020.
image compression,” Proc. Int. Conf. Learn. Representations, 2017. [46] J. Lee, S. Cho, S.-Y. Jeong, H. Kwon, H. Ko, H. Y. Kim, and
[21] M. H. Baig, V. Koltun, and L. Torresani, “Learning to inpaint for J. S. Choi, “Extended end-to-end optimized image compression
image compression,” in Proc. Adv. Neural Inform. Process. Syst., method based on a context-adaptive entropy model,” in Proc. IEEE
2017. Conf. Comput. Vis. Pattern Recognit. Workshops, 2019.
[22] O. Rippel and L. Bourdev, “Real-time adaptive image compres- [47] J. Zhou, S. Wen, A. Nakagawa, K. Kazui, and Z. Tan, “Multi-scale
sion,” in Proc. Int. Conf. Mach. Learn, 2017. and context-adaptive entropy model for image compression,” in
[23] D. Minnen, G. Toderici, M. Covell, T. Chinen, N. Johnston, J. Shor, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019.
S. J. Hwang, D. Vincent, and S. Singh, “Spatially adaptive image [48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
compression using a tiled deep network,” in Proc. Int. Conf. Image for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Process., 2017. Recognit., 2016.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

[49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Wenhan Yang (M’18) received the B.S degree
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with and Ph.D. degree (Hons.) in computer science
convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., from Peking University, Beijing, China, in 2012
2015. and 2018. He is currently a postdoctoral re-
[50] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” search fellow with the Department of Computer
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. Science, City University of Hong Kong. Dr. His
[51] T. M. Cover and J. A. Thomas, Elements of Information Theory. John current research interests include image/video
Wiley & Sons, 2012. processing/restoration, bad weather restoration,
[52] M. Covell, N. Johnston, D. Minnen, S. J. Hwang, J. Shor, S. Singh, human-machine collaborative coding. He has
D. Vincent, and G. Toderici, “Target-quality image compression authored over 100 technical articles in refereed
with recurrent, convolutional neural networks,” arXiv preprint journals and proceedings, and holds 9 granted
arXiv:1705.06687, 2017. patents. He received the IEEE ICME-2020 Best Paper Award, the IFTC
[53] J. Ballé, “Efficient nonlinear transforms for lossy image compres- 2017 Best Paper Award, and the IEEE CVPR-2018 UG2 Challenge
sion,” in Proc. Picture Coding Symp., 2018. First Runner-up Award. He was the Candidate of CSIG Best Doctoral
[54] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, Dissertation Award in 2019. He served as the Area Chair of IEEE
A. Graves et al., “Conditional image generation with pixelcnn ICME-2021, and the Organizer of IEEE CVPR-2019/2020/2021 UG2+
decoders,” in Proc. Adv. Neural Inform. Process. Syst., 2016. Challenge and Workshop.
[55] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel
recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
[56] M. Mirza and S. Osindero, “Conditional generative adversarial
nets,” arXiv preprint arXiv:1411.1784, 2014.
[57] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single
image super-resolution: Dataset and study,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. Workshops, 2017.
[58] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet Zhan Ma (SM’19) received the B.S. and M.S.
large scale visual recognition challenge,” Int. J. Comput. Vis., vol. from Huazhong University of Science and Tech-
115, no. 3, pp. 211–252, 2015. nology (HUST), Wuhan, China, in 2004 and
[59] E. Kodak, “Kodak lossless true color image suite (photocd 2006 respectively, and the Ph.D. degree from
pcd0992). [online]. available: http://r0k.us/graphics/kodak/.” the New York University, New York, in 2011. He
[60] N. Asuni and A. Giachetti, “TESTIMAGES: a large-scale archive is now on the faculty of Electronic Science and
for testing visual devices and basic image processing algorithms.” Engineering School, Nanjing University, Jiangsu,
in Proc. Eur. Italian Chapter Conf., 2014. 210093, China. From 2011 to 2014, he has been
[61] G. Bjontegarrd, “Calculation of average PSNR differences between with Samsung Research America, Dallas TX,
RD-curves,” VCEG-M33, 2001. and Futurewei Technologies, Inc., Santa Clara,
[62] L. Zhou, Z. Sun, X. Wu, and J. Wu, “End-to-end optimized im- CA, respectively. His current research focuses
age compression with attention mechanism,” in Proc. IEEE Conf. on the next-generation video coding, energy-efficient communication,
Comput. Vis. Pattern Recognit. Workshops, 2019. gigapixel streaming and deep learning. He is a co-recipient of 2018
[63] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep ACM SIGCOMM Student Research Competition Finalist, 2018 PCM
network training by reducing internal covariate shift,” in Proc. Int. Best Paper Finalist, and 2019 IEEE Broadcast Technology Society Best
Conf. Mach. Learn, 2015. Paper Award.
[64] J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y. Dai, “A comprehen-
sive benchmark for single image compression artifact reduction,”
IEEE Trans. on Image Process., vol. 29, pp. 7845–7860, 2020.
[65] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
“The unreasonable effectiveness of deep features as a perceptual
metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
[66] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn. Jiaying Liu (M’10-SM’17) is currently an Asso-
Representations, 2015. ciate Professor, Peking University Boya Young
[67] D. Liu, H. Zhang, and Z. Xiong, “On the classification-distortion- Fellow with the Wangxuan Institute of Computer
perception tradeoff,” in Proc. Adv. Neural Inform. Process. Syst., Technology, Peking University. She received the
2019. Ph.D. degree (Hons.) in computer science from
[68] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Peking University, Beijing China, 2010. She has
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018. authored over 100 technical articles in refereed
journals and proceedings, and holds 50 granted
patents. Her current research interests include
multimedia signal processing, compression, and
computer vision.
Yueyu Hu (STM’18-GSM’19) received the B.S. Dr. Liu is a Senior Member of IEEE, CSIG and CCF. She was a
degree in computer science from Peking Univer- Visiting Scholar with the University of Southern California, Los Angeles,
sity, Beijing, China, in 2018, where he is currently from 2007 to 2008. She was a Visiting Researcher with the Microsoft
working toward the master’s degree with Wangx- Research Asia in 2015 supported by the Star Track Young Faculties
uan Institute of Computer Technology, Peking Award. She has served as a member of Multimedia Systems & Applica-
University. His current research interests include tions Technical Committee (MSA TC), and Visual Signal Processing and
video and image compression and analytics with Communications Technical Committee (VSPC TC) in IEEE Circuits and
machine learning. Systems Society. She received the IEEE ICME-2020 Best Paper Award
and IEEE MMSP-2015 Top10% Paper Award. She has also served as
the Associate Editor of IEEE Trans. on Image Processing, IEEE Trans.
on Circuit System for Video Technology and Elsevier JVCI, the Techni-
cal Program Chair of IEEE ICME-2021/ACM ICMR-2021, the Publicity
Chair of IEEE ICME-2020/ICIP-2019, and the Area Chair of CVPR-
2021/ECCV-2020/ICCV-2019. She was the APSIPA Distinguished Lec-
turer (2016-2017).

Muller Study 2005
No ratings yet
Muller Study 2005
122 pages
A Seminar On: Wavelet Video Processing
No ratings yet
A Seminar On: Wavelet Video Processing
18 pages
Image Database
No ratings yet
Image Database
16 pages
Preprints202403 1272 v1
No ratings yet
Preprints202403 1272 v1
37 pages
Lecture 12
No ratings yet
Lecture 12
24 pages
An Improved Image Compression Algorithm Using 2D D
No ratings yet
An Improved Image Compression Algorithm Using 2D D
24 pages
A Universal Optimization Framework For Learning-Based Image Codec
No ratings yet
A Universal Optimization Framework For Learning-Based Image Codec
19 pages
Deep Lossy Plus Residual Coding For Lossless and Near-Lossless Image Compression
No ratings yet
Deep Lossy Plus Residual Coding For Lossless and Near-Lossless Image Compression
18 pages
Lossy Image Compression With Foundation Diffusion Models Paper
No ratings yet
Lossy Image Compression With Foundation Diffusion Models Paper
17 pages
Lecture10 Image Compression
No ratings yet
Lecture10 Image Compression
103 pages
IKIN - Diffusion Based Compression v1.8
No ratings yet
IKIN - Diffusion Based Compression v1.8
14 pages
Learning-Driven Lossy Image Compression A Comprehensive Survey
No ratings yet
Learning-Driven Lossy Image Compression A Comprehensive Survey
14 pages
Image and Video Compression
No ratings yet
Image and Video Compression
18 pages
Enhanced Standard Compatible Image Compression
No ratings yet
Enhanced Standard Compatible Image Compression
15 pages
Semantically-Guided Image Compression For Enhanced Perceptual Quality at Extremely Low Bitrates
No ratings yet
Semantically-Guided Image Compression For Enhanced Perceptual Quality at Extremely Low Bitrates
16 pages
On Combining Denoising With Learning-Based Image Decoding
No ratings yet
On Combining Denoising With Learning-Based Image Decoding
14 pages
FLLIC: Functionally Lossless Image Compression: Xi Zhang and Xiaolin Wu
No ratings yet
FLLIC: Functionally Lossless Image Compression: Xi Zhang and Xiaolin Wu
10 pages
Paper 1
No ratings yet
Paper 1
14 pages
Dip-Unit 5
No ratings yet
Dip-Unit 5
37 pages
Dip-Unit 5
No ratings yet
Dip-Unit 5
37 pages
Analysis Based Coding of Image Transform and Subband Coefficients
No ratings yet
Analysis Based Coding of Image Transform and Subband Coefficients
11 pages
Deep-Learning-Based Lossless Image Coding
No ratings yet
Deep-Learning-Based Lossless Image Coding
14 pages
Jpeg Image Compression Thesis
100% (2)
Jpeg Image Compression Thesis
6 pages
Kumar 2016
No ratings yet
Kumar 2016
5 pages
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
No ratings yet
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
10 pages
Learned Lossless Image Compression With Combined Channel-Conditioning Models and Autoregressive Modules
No ratings yet
Learned Lossless Image Compression With Combined Channel-Conditioning Models and Autoregressive Modules
8 pages
Analysis of Different Image Compression Techniques
No ratings yet
Analysis of Different Image Compression Techniques
4 pages
Thesis On Image Compression Using Wavelet Transform
100% (2)
Thesis On Image Compression Using Wavelet Transform
5 pages
Eai 13-7-2018 163503
No ratings yet
Eai 13-7-2018 163503
13 pages
Comprehensive Complexity Assessment of Emerging Learned Image Compression On Cpu and Gpu
No ratings yet
Comprehensive Complexity Assessment of Emerging Learned Image Compression On Cpu and Gpu
5 pages
Thesis On Jpeg Image Compression
100% (3)
Thesis On Jpeg Image Compression
7 pages
wg1n90021-REQ-JPEG AI Use Cases and Requirements
No ratings yet
wg1n90021-REQ-JPEG AI Use Cases and Requirements
7 pages
A Segmented Wavelet Inspired Neural Network Approach To Compress Images
No ratings yet
A Segmented Wavelet Inspired Neural Network Approach To Compress Images
11 pages
Paper 1
No ratings yet
Paper 1
16 pages
Applied Sciences: An End-to-End Deep Learning Image Compression Framework Based On Semantic Analysis
No ratings yet
Applied Sciences: An End-to-End Deep Learning Image Compression Framework Based On Semantic Analysis
13 pages
SMT Project 3-1
No ratings yet
SMT Project 3-1
17 pages
v1 Covered
No ratings yet
v1 Covered
15 pages
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Approach With Db2 and de Noising Filters
No ratings yet
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Approach With Db2 and de Noising Filters
8 pages
3.multimedia Compression Algorithms
No ratings yet
3.multimedia Compression Algorithms
23 pages
Deep-Learning Based Lossless Image Coding
No ratings yet
Deep-Learning Based Lossless Image Coding
14 pages
6IJCSEITRJUN20196
No ratings yet
6IJCSEITRJUN20196
10 pages
Sop For Image Processing
No ratings yet
Sop For Image Processing
5 pages
Wavelets and Neural Networks Based Hybrid Image Compression Scheme
No ratings yet
Wavelets and Neural Networks Based Hybrid Image Compression Scheme
6 pages
Digital Image Compression
No ratings yet
Digital Image Compression
19 pages
Improvement of Classical Wavelet Network Over ANN in Image Compression
No ratings yet
Improvement of Classical Wavelet Network Over ANN in Image Compression
6 pages
Information Theory and Coding: Submitted by
No ratings yet
Information Theory and Coding: Submitted by
12 pages
Excel Dynamic Arrays: Course Notes
No ratings yet
Excel Dynamic Arrays: Course Notes
34 pages
2 Optimum Full
No ratings yet
2 Optimum Full
6 pages
Medical Images Compression and Decompression Using Neural Networks
No ratings yet
Medical Images Compression and Decompression Using Neural Networks
5 pages
Lossless Image Compression Algorithm For Transmitting Over Low Bandwidth Line
No ratings yet
Lossless Image Compression Algorithm For Transmitting Over Low Bandwidth Line
6 pages
Investigations On Reduction of Compression Artifacts in Digital Images Using Deep Learning
No ratings yet
Investigations On Reduction of Compression Artifacts in Digital Images Using Deep Learning
4 pages
DLL 2022-2023
No ratings yet
DLL 2022-2023
25 pages
Image Compression Using Run Length Encoding (RLE)
No ratings yet
Image Compression Using Run Length Encoding (RLE)
6 pages
Evaluation of Wavelet Filters For Image Compression: G. Sadashivappa, and K. V. S. Anandababu
No ratings yet
Evaluation of Wavelet Filters For Image Compression: G. Sadashivappa, and K. V. S. Anandababu
7 pages
Image Compression Approaches: A Comprehensive Study: Akul S, Kavitha S N
No ratings yet
Image Compression Approaches: A Comprehensive Study: Akul S, Kavitha S N
3 pages
Performance Analysis of Image Compression Using Discrete Wavelet Transform
No ratings yet
Performance Analysis of Image Compression Using Discrete Wavelet Transform
6 pages
Image Compression: by Artificial Neural Networks
No ratings yet
Image Compression: by Artificial Neural Networks
14 pages
5130 - 04 5G Basic Service Capabilities and Applications
100% (1)
5130 - 04 5G Basic Service Capabilities and Applications
92 pages
Image Processing and Compression Techniques: Digitization Includes Sampling of Image and Quantization of Sampled Values
No ratings yet
Image Processing and Compression Techniques: Digitization Includes Sampling of Image and Quantization of Sampled Values
14 pages
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
No ratings yet
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
7 pages
Rotem Mesika: System Security Engineering 372.2.5204
No ratings yet
Rotem Mesika: System Security Engineering 372.2.5204
21 pages
Grade 8 Computer Question Bank
No ratings yet
Grade 8 Computer Question Bank
3 pages
Operation Manual MIPLUS REV - 00 en
No ratings yet
Operation Manual MIPLUS REV - 00 en
86 pages
Math1059 - Calculus
100% (1)
Math1059 - Calculus
98 pages
Chapter 4 Enumeration
No ratings yet
Chapter 4 Enumeration
26 pages
An Algorithmic Approach For Efficient Image Compression Using Neuro-Wavelet Model and Fuzzy Vector Quantization Technique
No ratings yet
An Algorithmic Approach For Efficient Image Compression Using Neuro-Wavelet Model and Fuzzy Vector Quantization Technique
9 pages
ADE7763-ADI Energy Metering
No ratings yet
ADE7763-ADI Energy Metering
56 pages
Unit-2 Data Literacy
No ratings yet
Unit-2 Data Literacy
6 pages
Jetnet 6528Gf: Industrial 28G Full Gigabit Managed Ethernet Switch
No ratings yet
Jetnet 6528Gf: Industrial 28G Full Gigabit Managed Ethernet Switch
5 pages
Yan 2021 Fine Grained Motion Estimation For
No ratings yet
Yan 2021 Fine Grained Motion Estimation For
11 pages
231123 智能无线通信技术研究概况PPT 演说
No ratings yet
231123 智能无线通信技术研究概况PPT 演说
28 pages
How To Ace Jumbling Questions For RRB NTPC: Facebooktwitterwhatsappemailtelegram Google Bookmarksshare
No ratings yet
How To Ace Jumbling Questions For RRB NTPC: Facebooktwitterwhatsappemailtelegram Google Bookmarksshare
4 pages
Introduction Deck Nike Final
No ratings yet
Introduction Deck Nike Final
17 pages
The Most Effective Digital Marketing Strategies
No ratings yet
The Most Effective Digital Marketing Strategies
5 pages
Dhrystone - Wikipedia
No ratings yet
Dhrystone - Wikipedia
15 pages
Module 11 Fetch Decode Execute Cycle V3
No ratings yet
Module 11 Fetch Decode Execute Cycle V3
14 pages
References From The Reading
No ratings yet
References From The Reading
16 pages
? Difference Between On Page and Off Page Seo
No ratings yet
? Difference Between On Page and Off Page Seo
3 pages
Factory Patterns: Factory Method and Abstract Factory
No ratings yet
Factory Patterns: Factory Method and Abstract Factory
25 pages
Create A Larger Than 4GB Casper Partition: Search
No ratings yet
Create A Larger Than 4GB Casper Partition: Search
6 pages
Runtime Program Structure
No ratings yet
Runtime Program Structure
6 pages
HEC-RAS 507 Unsteady
No ratings yet
HEC-RAS 507 Unsteady
9 pages
Blockchain Based Framework For Software Development Using DevOps
No ratings yet
Blockchain Based Framework For Software Development Using DevOps
6 pages
Terms of Service
No ratings yet
Terms of Service
3 pages
Ashrith Resume-1
No ratings yet
Ashrith Resume-1
2 pages
Schedule Management Plan Example W22
No ratings yet
Schedule Management Plan Example W22
2 pages
(4 Usd) (76561199183231530)
No ratings yet
(4 Usd) (76561199183231530)
1 page
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
From Everand
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
Fouad Sabry
No ratings yet

Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu

Uploaded by

Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu

Uploaded by

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Learning End-to-End Lossy Image

Index Terms—Machine Learning, Image Compression, Neural Networks, Transform Coding

I M age compression is a fundamental technique in the sig-

TABLE 1: Summary of important contributions of image compression in recent years.

Method Name Paper Title Published In Highlight

(b) Multilayer GDN Network [20]

(e) Deep Residual Auto-Encoder [27]

4.2 Multistage Recurrent Frameworks

TABLE 2: Description of different entropy models utilized in learned image compression.

In practice, the function F with trainable parameter θ is

TABLE 4: Structure of the probability estimation network.

Layer In Shape Out Shape K S Activation

TABLE 5: Structure of the Information-Aggregation Recon-

6.4 Information Aggregation for Reconstruction 6.5 Implementation Details

(a) Kodak, PSNR (b) Kodak, MS-SSIM

(c) Tecnick, PSNR (d) Tecnick, MS-SSIM

(e) CLIC, PSNR (f) CLIC, MS-SSIM

Settings BD-Rate Task Device Ours Context [34] VTM-8

0.372 bpp 0.380 bpp 0.417 bpp GT

7.5.2 Comparison with VVC 8 C ONCLUSION

You might also like