Semantics-To-Signal Scalable Image Compression With Learned Revertible Representations
Semantics-To-Signal Scalable Image Compression With Learned Revertible Representations
https://doi.org/10.1007/s11263-021-01491-7
Received: 20 December 2020 / Accepted: 9 June 2021 / Published online: 22 June 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021
Abstract
Image/video compression and communication need to serve both human vision and machine vision. To address this need,
we propose a scalable image compression solution. We assume that machine vision needs less information that is related to
semantics, whereas human vision needs more information that is to reconstruct signal. We then propose semantics-to-signal
scalable compression, where partial bitstream is decodeable for machine vision and the entire bitstream is decodeable for
human vision. Our method is inspired by the scalable image coding standard, JPEG2000, and similarly adopts subband-wise
representations. We first design a trainable and revertible transform based on the lifting structure, which converts an image
into a pyramid of multiple subbands; the transform is trained to make the partial representations useful for multiple machine
vision tasks. We then design an end-to-end optimized encoding/decoding network for compressing the multiple subbands, to
jointly optimize compression ratio, semantic analysis accuracy, and signal reconstruction quality. We experiment with two
datasets: CUB200-2011 and FGVC-Aircraft, taking coarse-to-fine image classification tasks as an example. Experimental
results demonstrate that our proposed method achieves semantics-to-signal scalable compression, and outperforms JPEG2000
in compression efficiency. The proposed method sheds light on a generic approach for image/video coding for human and
machines.
Keywords Deep learning · Image compression · Lifting structure · Machine vision · Scalable coding
123
2606 International Journal of Computer Vision (2021) 129:2605–2621
…
information. Accordingly, the base layer can be the feature
bitstream, which contains less information of the image about
Low-level pixel High-level feature
its semantics. The enhancement layer may contain richer representation representation
semantics-related information, and/or the information that
is used to reconstruct the signal. In addition, there is redun- Fig. 1 Conceptual illustration of the proposed semantics-to-signal scal-
ability with learned revertible representations. Image signal, i.e. pixels,
dancy between the image and its features, so the enhancement is transformed into a set of features. The transform is revertible in the
layer may refer to the base layer to improve the compression sense that the image can be perfectly reconstructed using all the fea-
efficiency. tures. Using partial features we can perform semantic analysis tasks;
Some studies (Wang et al. 2019; Hu et al. 2020) pro- using more features, we have more semantic information. In this figure,
we show a coarse-to-fine image classification task: with more features
pose scalable image compression methods based on feature we perform finer-grained classification, from “Wren”, “House wren”,
extraction and image generation to fulfill the need of human– to “Northern house wren”
machine collaborative judgment. Feature extraction is a
process of information contraction (Simonyan and Zisserman
2014; Zhao et al. 2019; Latif et al. 2019), during which a large
tion. With increasing features, semantics-related infor-
amount of task-independent information related to the origi-
mation contained in the features is gradually enriched.
nal image is lost, resulting in the lack of clear correspondence
– Second, we design a layered compression network to
between high-level features and low-level pixels, namely
compresses the multiple features into a scalable bit-
semantic gap (Kwaśnicka and Jain 2018). The semantic gap
stream. We use the end-to-end optimization strategy to
makes it difficult to predict an image based on its sparse and
achieve the joint optimization of semantic accuracy, sig-
compact features, and further affects the flexibility of increas-
nal fidelity, and compression ratio.
ing semantics-related information through partial decoding
– Third, we use coarse-to-fine image classification and
of the scalable bitstream. In addition, the extracted features
image reconstruction as a motivating example to conduct
may not be compact enough (Zhao et al. 2020; Ruder 2017;
experiments, and verify the effectiveness of the proposed
Baxter 1997), which incurs unnecessary bitstream cost.
semantics-to-signal scalable image compression.
In this paper, we target at the image compression for
human–machine collaborative judgment from the perspec-
tive of jointly considering feature extraction, feature com-
pression, and image compression. Firstly, to bridge the The remainder of this paper is organized as follows.
semantic gap between image and features, inspired by the Related work is presented in Sect. 2. We elaborate on the hier-
scalable image coding standard, JPEG2000 (Christopoulos archical representation of the proposed semantics-to-signal
et al. 2000), we design a learned revertible transform, namely scalable bitstream in Sect. 3. In Sect. 4, we introduce the
a hierarchical signal representation method, as illustrated proposed feature representation network. Section 5 demon-
in Fig. 1. The transform is trainable so that semantic tasks strates the proposed layered compression network for the
can be well performed by intercepting partial features. The learned representations. In Sect. 6, we show the experimen-
information contained in the partial features is a subset of tal results about semantic analysis accuracy and compression
image information, which enhances the interpretability of efficiency. Finally, Sect. 7 concludes the paper.
the scalable structure. Secondly, we design an end-to-end
encoding/decoding network to achieve layered compression
of the features. Our specific contributions are summarized as
follows. 2 Related Work
123
International Journal of Computer Vision (2021) 129:2605–2621 2607
2.1 Image and Feature Compression icantly. Our encoding/decoding network is also based on
CNN and adopts the hyper-prior model as well as the non-
Studies in Johnston et al. (2018), Dejean-Servières et al. local attention module. The existing end-to-end compression
(2017), Dodge and Karam (2016) have shown that compres- methods, similar to non-deep methods, focus on improving
sion artifacts have an impact on machine vision tasks. Taking the objective or subjective quality of reconstructed images,
classification as an example, compression has little impact on but ignore the fidelity of semantic information. Torfason et
the classification accuracy at high bit rates, but at low bit rates, al. propose to use the same bitstream for machine vision
compression will seriously lower the classification accuracy. and image reconstruction (Torfason et al. 2018). Intuitively,
To make images serve both human vision and machine machine vision tasks (e.g. classification, detection, recogni-
vision, signal distortion and semantic analysis accuracy need tion) usually require less information quantity than image
to be considered in the compression. Ma et al. conduct a reconstruction does. Using the scheme of Torfason et al.
systematic review, analyze the joint compression of image (2018), the number of bits suitable for machine vision may
and features (Ma et al. 2018), and explain the advantages of be too few to reconstruct visually pleasing image, and the
the joint image-feature compression for reconstruction qual- number of bits suitable for image reconstruction may be too
ity and analysis accuracy. Wang et al. propose a scalable redundant for machine vision. Different from their work, our
image coding framework for face recognition task (Wang designed bitstream is scalable; in our bitstream, the bits used
et al. 2019), where base layer and enhancement layer are for machine vision occupy only a small fraction (e.g. less
used to represent face features and signal residuals, respec- than 10%, see Sect. 6.3.1) of the entire bitstream.
tively. The framework uses traditional coding technologies,
such as quantization and entropy coding, on the face features, 2.2 Image Representation via Transforms
and uses a network-based compression scheme for the signal
residuals. Its compression performance surpasses JPEG and Signal representation is an important approach for image
JPEG2000 while maintaining semantic analysis accuracy. analysis and processing. In image compression, discrete
Hu et al. (2020) extend the strategy to face keypoint detec- cosine transform (DCT) (Akansu and Liu 1991) and dis-
tion task, and further unify the entire compression scheme crete wavelet transform (DWT) (Mallat 1989) are the most
with a deep neural network. Yan et al. propose an image commonly used signal representation methods. DCT linearly
compression method based on semantic scalability in Yan maps the signal from the spatial domain into the frequency
et al. (2020). The multi-layer features of the deep network domain while keeping the resolution unchanged. DWT uses
are compressed into a scalable bitstream, which can serve orthogonal basis functions to decompose the original signal
multi-grained classification tasks, verifying the advantage of into multi-resolution coefficients with a pyramid structure.
joint compression for significant bits saving. Besides, some Further, nonlinear wavelets (Goutsias and Heijmans 2000;
researchers (Zhang et al. 2016; Duan et al. 2020; Xia et al. Heijmans and Goutsias 2000) are proposed based on the lift-
2020) extend the concept of scalability to videos and verify ing structure (Sweldens 1998), which brings the advantages
the effectiveness of joint video-feature compression. of perfect reconstruction and non-redundant representation.
In the recent years, end-to-end optimized image com- However, DCT and DWT directly decompose low-frequency
pression based on deep neural networks has demonstrated and high-frequency components in the signal from the per-
more flexible and efficient image compression capabilities. spective of energy distribution (Akansu et al. 2001). These
Toderici et al. propose the first end-to-end image coding relatively low-level coefficients are difficult to directly serve
method based on recurrent neural network (RNN) (Toderici image understanding.
et al. 2015), and achieve scalable coding by iteratively How to obtain high-level semantics-oriented features from
invoking the RNN-based encoder to compress the image or raw image pixels remains an open problem. Feature repre-
residual. Ballé et al. propose the first end-to-end image cod- sentation based on deep learning is the current main research
ing method based on convolutional neural network (CNN) trend and has achieved outstanding performance in various
(Ballé et al. 2016). After that, hyper-prior model (Ballé et al. machine vision tasks. Several popular neural networks such
2018) and autoregressive model (Minnen et al. 2018; Lee as VGGNet (Simonyan and Zisserman 2014), ResNet (He
et al. 2018) are introduced for more efficient entropy coding. et al. 2016), DenseNet (Huang et al. 2017) are believed to
By using the non-local attention module (Li et al. 2020; Zhou have outstanding feature extraction capabilities. Based on
et al. 2019; Chen et al. 2019), the compression efficiency is the concept of information bottleneck (Tishby et al. 2000),
further improved. Nowadays, CNN-based end-to-end cod- Tishby et al. interpret the learning process of the deep neural
ing methods outperform the state-of-the-art non-deep image network (DNN) as constantly forgetting information about
coding scheme, Better Portable Graphics (BPG),1 signif- the input while obtaining an efficient expression of the label
(Tishby and Zaslavsky 2015; Shwartz-Ziv and Tishby 2017).
1 https://bellard.org/bpg/. The process of forgetting is irrevertible, which can easily lead
123
2608 International Journal of Computer Vision (2021) 129:2605–2621
to a semantic gap (Kwaśnicka and Jain 2018), i.e. the lack Hierarchical representation Scalable bitstream
123
International Journal of Computer Vision (2021) 129:2605–2621 2609
consider whether the feature is useful to express semantic Then, we use F̃km to predict F̃kd , and use the prediction resid-
information, but do not consider whether the feature is able ual Fkd to update Fkm :
to reconstruct image; the existing neural image compression ⎧
methods like Ballé et al. (2018), Minnen et al. (2018), Chen ⎪
⎨ Fkd = F̃kd − Predict F̃km
et al. (2019) only consider whether the coded representation (8)
(feature) is useful to reconstruct image, but do not consider ⎪
⎩ Fkm = F̃km + Update Fkd
whether the feature carries important semantic information.
Different from them, we consider signal reconstruction and Equation (8) is revertible. Its inverse process is:
semantic analysis simultaneously to obtain compact features.
⎧
⎪
⎨ F̃km = Fkm − Update Fkd
(9)
⎪
⎩ F̃kd = Fkd + Predict F̃km
4 Proposed Learned Revertible
Representation In other words, Fk−1 m can be perfectly reconstructed when
m d
Fk and Fk are known.
In this section, we propose a Lifting-based Feature Represen-
The revertibility of RFRU implies that LFRNet is revert-
tation Network (LFRNet) to convert an image into multiple
ible, provided that the parameters used for the forward-
features. The conversion is revertible. Using LFRNet, we
transform are also used for the inverse-transform. Here, we
achieve semantics-to-signal scalability through the hierar-
do not consider the information loss due to numeric compu-
chical representation of the image information.
tations. When all the features are known, the input image I
←−
can be reconstructed by the inverse operation F :
4.1 Lifting-Based Feature Representation Network −
←
I = F FKm , FKd , FKd −1 , . . . , F2d , F1d |Θ . (10)
4.1.1 Overview
4.1.2 Structure of RFRU
The proposed LFRNet is a fully convolutional network based
on the lifting structure (Sweldens 1998). The network struc- Note that RFRU follows the general lifting structure, which
ture is shown in Fig. 3a. Specifically, input to the network was proposed initially for efficient implementation of the
is the image I , or rigorously speaking, the pixels. LFRNet wavelet transform (Sweldens 1998). While the lifting struc-
converts I into feature representations with a hierarchical ture remains the same, our RFRU is different from the
structure. The conversion can be formulated as traditional wavelets, because in RFRU the Predict and Update
operations are implemented by trained networks.
−→ Specifically, Predict and Update in (8) use the same
FKm , FKd , FKd −1 , . . . , F2d , F1d = F (I |Θ) (6) network structure, but have different parameters. Besides,
different RFRUs in LFRNet do not share parameters. Each
Predict/Update network has three parts: redundant repre-
−
→ sentation, feature extraction, and feature shrinkage. The
where F stands for the forward-transform of LFRNet, Θ is
the set of trainable parameters of LFRNet, and K is the order redundant representation is using a convolutional layer with
of transform. K = 5 in the experiments by default. The kernel size equal to 3 × 3 to expand in the channel dimension
features, {FKm , FKd , FKd −1 , . . . , F2d , F1d }, are derived from to 8 times. The feature extraction is using N repetitive units,
multiple Revertible Feature Representation Units (RFRUs). each of which has a convolution, a batch normalization, and
RFRU is the basic unit of LFRNet as shown in Fig. 3b. It a Rectified Linear Unit (ReLU). N is the order of nonlinear-
is designed based on the lifting structure and CNN. RFRU ity, and is set to {2, 2, 3, 3, 3} for the K = 5 RFRUs. The
includes three basic operations: Split, Predict, and Update. feature shrinkage is using a convolutional layer with kernel
The input of RFRU kF (k ∈ [1, K ]), the kth forward-transform size equal to 3 × 3 to shrink in the channel dimension back
unit of LFRNet, is Fk−1 m . The Split operation decomposes to that of the input.
Fk−1 into the main branch subband F̃km and the dual branch
m
4.1.3 Structure of Hierarchical Representations
subband F̃kd . In particular, the splitting process is revertible:
LFRNet achieves the mapping from an image I to a set
m of features {FKm , FKd , FKd −1 , . . . , F2d , F1d } through the cas-
F̃km , F̃kd = Split Fk−1 (7) cade of RFRUs. Next, these features are bound with machine
123
2610 International Journal of Computer Vision (2021) 129:2605–2621
…
Split Predict/Update N
2c
…
RGB c
…
h 2c
w
Inverse-transform
Fig. 3 The proposed lifting-based feature representation network (LFRNet) and revertible feature representation unit (RFRU)
vision tasks. Usually machine vision tasks need a compact set number of network parameters and computational cost.
of representations for the sake of computational simplicity. Fourth, the inverse-transform directly uses the parameters
Thus, we may further decompose the features into subsets of the forward-transform, avoiding additional modeling and
and choose for example a subset of features for a task. For training.
example, we may decompose at the subband level, or may
decompose at the channel level.
In LFRNet, subband decomposition is implanted into the 4.2 Task-Oriented Optimization
cascade of RFRUs. Specifically, while one RFRU outputs
two subbands: main branch and dual branch, the following The proposed LFRNet can be optimized for specific machine
RFRU deals with the main branch only. Thus, the deeper vision tasks. Note that LFRNet produces a series of features,
RFRUs process the less data, with the hope of extracting which can be input to different networks to fulfill various
compact features. It is also worth noting that the degree of machine vision tasks. LFRNet and the task-specific networks
nonlinearity also increases with deeper RFRUs. may be jointly optimized to ensure the usability of the fea-
In addition, we may also select partial channels from a sub- tures. Here, we first give a general formulation, and then
band as a subset. Because the proposed RFRU reduces spatial present the specific formulation for the scenario of coarse-
resolution but increases number of channels, the main-branch to-fine image classification.
and dual-branch features have more and more channels. Pos- In general, we may have T tasks, for each task Taskt
sibly, for a machine vision task, we may need only a subset (t = 1, . . . , T ), we assign a subset of features Ft ⊆
of a subband. Channel decomposition is a convenient way to {FKm , FKd , FKd −1 , . . . , F2d , F1d } to perform the task. Let Θ be
achieve this. the set of trainable parameters of LFRNet, and Θt be the
In this paper we consider image classification task, which set of trainable parameters of the task-specific network, the
benefits from relatively deeper CNN that has more nonlinear- optimization problem can be defined as
ity. Thus, we use the last main-branch feature FKm to perform
classification. For coarse-grained classification, we perform
channel decomposition on FKm to obtain a subset. This is T
illustrated in Fig. 3a, where we decompose FKm into F1s and min λt Lt (Θ, Θt ) (11)
Θ,Θ1 ,...,ΘT
F2s . Here we use a notation slightly different from that in t=1
123
International Journal of Computer Vision (2021) 129:2605–2621 2611
Fig. 4 The proposed layered compression network (LCNet). RAFC l {F1s , F2s , . . . , FLs } are the features to be compressed. { Fˆ1s , Fˆ2s , . . . , FˆLs }
stands for the resolution adaptive feature compression (RAFC) unit are the reconstructed features after compression. {B1 , B2 , . . . , B L }
for layer l. The superscripts {e} and {d} stand for encoding and form a scalable bitstream
decoding, respectively. “PUnit” is the inter-layer prediction unit.
123
2612 International Journal of Computer Vision (2021) 129:2605–2621
123
International Journal of Computer Vision (2021) 129:2605–2621 2613
looplh to 2. In addition, the width/height and number of chan- categories is large). Note that our scheme is to provide a
nels of X l and Z l are set to scalable bitstream, which can be partially decoded to obtain
features that then serve the classification. For the classifi-
log2 (S)−1
− f loor cation tasks, one may imagine a compression scheme that
Sm = S · 2 2
performs classification at the encoder side and transmits only
C · S2 the classification results to the decoder side. This scheme is
Cm =
4 · (Sm )2 feasible, but may have severe limitations: the transmitted data
(14)
Sm may be useless if the decoder side wants to perform another
Sh =
4 classification task with a different category set; the encoder
Cm side may not have sufficient computational resource to per-
Ch =
1.5 form the classification. In view of these limitations, we do
not experimentally compare with the imaginary scheme of
Sm and Cm are for X l , and Sh and C h are for Z l , respectively. “transmitting classification results.”
Therefore, the size of X l is 1/4 of the size of the input feature, There are several datasets that support our designed study.
and the size of Z l is 1/24 of the size of X l . For example, CUB200-2011 (Wah et al. 2011) is a bird image
dataset consisting of 11,788 images, where 5994 images are
5.3.2 Inter-Feature Prediction Unit for training and 5794 are for test. All the images are divided
into 200 categories. According to ornithology systematics,
The inter-feature prediction has two modes depending on the 200 categories can be merged into 122 coarse categories
how the features are decomposed. F1s and F2s are different or 37 coarser categories. For CUB200-2011, we use coarse-
channels of the same subband. The following features belong grained to refer to 37-category classification, fine-grained to
to different subbands. As shown in Fig. 5, PUnit first distin- refer to 200-category classification, and intermediate-grained
guishes the two cases. If it is channel decomposition, the to refer to 122-category classification. For another exam-
feature to be predicted and the feature used to predict have ple, FGVC-Aircraft (Maji et al. 2013) is an aircraft image
the same spatial resolution, so a fully convolutional network dataset consisting of 10,000 images, where 6667 images
is directly used. If it is subband decomposition, the feature are for training and 3333 are for test. The images can be
used to predict shall be processed by an RFRU so that the divided by manufacturer, family, variant, into 30, 70, 100
spatial resolution is aligned, and then passes a fully convo- categories, respectively. For FGVC-Aircraft, we use coarse-
lutional network. grained, intermediate-grained, and fine-grained to refer to
Each PUnit has a lightweight six-layer CNN. Its first con- the 30-category, 70-category, and 100-category classifica-
volutional layer uses kernel size 5 × 5 and 128 channels tion, respectively. All these category labels are available in
for redundant representation. The following four layers are the dataset, so the classification accuracy can be directly
organized into two ResBlocks. Each ResBlock has two con- calculated. Note that our proposed LFRNet and LCNet are
volutional layers with kernel size equal to 3 × 3 and a ReLU optimized only for coarse-grained and fine-grained clas-
between the two, and a residual connection. The last convo- sification, but we will test their generalization ability for
lutional layer uses kernel size 1 × 1 and 1 channel to output. intermediate-grained classification.
Since the content of the above two datasets is relatively
5.3.3 Post-Processing Unit homogeneous, we use the ILSVRC dataset (Russakovsky
et al. 2015) (often known as ImageNet) for pre-training of
A post-processing unit is added to filter the entire recon- LFRNet. ILSVRC contains 1000 categories, and each cate-
structed image, so as to reduce compression artifacts and gory has about 1000 images.
enhance signal reconstruction quality. In particular, the post- All the images in the used datasets have the resolution of
processing unit uses the same network structure as the PUnit. 256 × 256, or have been resized to this resolution, to ensure
a fair comparison with other methods. Fig. 6 presents some
images of the used datasets.
6 Experiments We evaluate the proposed method and compare with the
others in both semantic analysis accuracy and compression
6.1 Tasks and Datasets efficiency. Top-1 accuracy and top-5 accuracy are used to
evaluate the classification results. Compression efficiency is
In this paper, we consider coarse-to-fine image classification evaluated from three indicators: compression ratio (or bitrate
as the targeted machine vision tasks, where one image can be in bit-per-pixel, bpp), peak signal-to-noise ratio (PSNR), and
classified into a coarse category (the number of optional cat- multi-scale structural similarity (MS-SSIM).
egories is small), or a fine category (the number of optional
123
2614 International Journal of Computer Vision (2021) 129:2605–2621
Fig. 6 Exemplar images of the used datasets. ILSVRC (often known as ImageNet) is used for network pre-training. Both CUB200-2011 and
FGVC-Aircraft are used for coarse-to-fine image classification
Epoch number 128 128 Four compression models are trained by setting different val-
Batch size 256 24 ues for λ. These models are referring to the same LFRNet
Initial lr 1e−2 1e−5 model. Table 2 shows the average bit rate of compressed
Step lr 0.1×/32 epochs 0.75×/32 epochs images with different models. On CUB200-2011, the max-
Weight decay 0.0005 0.0005 imal average compression ratio is 471 (0.051 bpp) and the
Momentum 0.9 0.9 minimal one is 34 (0.711 bpp). On FGVC-Aircraft, the max-
“lr” is short for learning rate imal one is 889 (0.027 bpp) and the minimal one is 51 (0.474
bpp).
123
Table 2 Classification accuracy results
Dataset Task Bitrate Top-1 Accuracy Top-5 Accuracy
JPEG 2000 Ours (Image) BPG NLAIC Ours (Feature) JPEG 2000 Ours (Image) BPG NLAIC Ours (Feature)
CUB200-2011 Coarse-grained Orig. 88.9 88.9 88.9 88.9 89.3 98.7 98.7 98.7 98.7 98.4
0.711 83.3 84.7 85.0 87.0 87.9 97.1 97.4 97.8 98.2 98.1
0.343 74.2 77.0 76.0 81.1 87.9 93.8 94.7 95.1 94.5 98.2
0.153 57.8 58.5 62.5 63.1 87.4 85.8 85.3 88.9 88.2 98.1
0.051 26.6 30.7 40.6 – 87.6 62.5 59.6 73.0 – 98.0
Intermediate-grained Orig. 81.9 81.9 81.9 81.9 80.7 96.4 96.4 96.4 96.4 96.0
0.711 73.0 76.2 76.5 79.2 77.2 92.3 93.9 94.1 95.1 94.9
0.343 61.3 67.2 65.4 71.0 76.8 85.6 89.2 88.9 92.1 94.5
0.153 40.9 46.7 48.7 50.2 76.4 69.2 74.4 78.2 77.5 94.7
0.051 12.2 20.8 23.9 – 76.8 33.5 43.4 50.9 – 94.5
Fine-grained Orig. 75.8 75.8 75.8 75.8 75.5 94.0 94.0 94.0 94.0 93.3
International Journal of Computer Vision (2021) 129:2605–2621
0.711 68.6 69.9 70.2 72.4 72.6 90.4 91.7 91.4 92.9 92.3
0.343 57.0 61.5 60.2 65.5 72.4 83.4 87.1 86.4 89.3 92.3
0.153 36.1 41.8 46.3 44.6 72.2 66.5 70.4 75.7 73.5 92.3
0.051 9.9 16.0 22.5 – 72.1 29.6 36.4 48.3 – 92.4
FGVC-Aircraft Coarse-grained Orig. 93.3 93.3 93.3 93.3 92.6 98.5 98.5 98.5 98.5 98.6
0.474 82.3 83.4 85.5 86.4 90.9 95.7 96.1 96.5 96.7 98.1
0.206 58.9 70.2 75.0 79.6 90.4 85.2 88.8 92.4 94.0 98.1
0.093 22.4 41.0 49.4 57.2 90.9 49.4 68.0 77.4 79.9 98.0
0.027 6.7 10.2 16.1 – 89.6 21.2 28.3 38.7 – 97.5
Intermediate-grained Orig. 89.7 89.7 89.7 89.7 88.4 96.8 96.8 96.8 96.8 96.3
0.474 75.6 79.6 80.1 81.3 85.2 92.6 94.6 94.6 95.0 95.1
0.206 50.2 69.7 70.8 75.9 85.1 78.3 89.1 91.4 93.2 95.0
0.093 18.5 46.2 49.7 57.4 84.3 43.4 73.8 77.9 81.3 94.6
0.027 5.3 12.0 14.7 – 81.3 18.0 32.9 36.9 – 94.2
Fine-grained Orig. 89.6 89.6 89.6 89.6 89.0 97.0 97.0 97.0 97.0 97.0
0.474 76.7 80.5 80.7 82.0 87.1 92.7 94.1 94.5 94.9 96.3
0.206 55.0 71.2 74.1 78.7 86.6 80.5 90.3 92.4 93.7 95.9
0.093 20.4 47.7 56.0 61.5 86.4 43.8 73.0 80.8 83.5 96.0
0.027 4.3 11.7 18.3 – 85.1 14.8 34.4 41.0 – 95.5
“Ours (Feature)” and “Ours (Image)” respectively indicate decoded feature-based and reconstructed image-based classification results. “Orig.” indicates using original uncompressed images, whose
results are underlined to distinguish
Bold indicates the best accuracy for each bitrate
123
2615
2616 International Journal of Computer Vision (2021) 129:2605–2621
90% 35
82% 31
Top-1 Accuracy
PSNR (dB)
74% 26
66% 22
58% 17
Coarse-grained classification
Fig. 7 Quantitative evaluation of the semantics-to-signal scalability accuracy is not displayed in the high bitrate range because it becomes
(λ = 0.0067). The horizontal axis represents bitrate (bit-per-pixel, bpp), stable after the indicated (by the green arrow) point. The reconstruction
and the left and right vertical axes represent top-1 classification accu- quality is not displayed in the (extremely) low bitrate range because it
racy and reconstruction quality in PSNR, respectively. The classification is too low to be useful (Color figure online)
still be able to identify the shape of the object. With more bits usually drops significantly. However, our results using fea-
decoded, colors, edges, and textures become more and more tures are quite stable across different bit rates. For example
clear. When all the bits are decoded, reconstructed images on CUB200-2011, for coarse-grained classification, when
appear similar to the original images. bitrate is around 0.051 bpp, the top-1 accuracy of JPEG2000
reconstructed images is 26.6%, but our result using features
is 87.6%. Note that the classifier for features is a three-layer
fully-connected network and it is not stronger than VGG16.
6.3.2 Compression Performance Therefore, the results show that compressing and transmit-
ting features for machine vision tasks is a good choice at low
We choose three image compression methods for compari- bit rates.
son. The first is JPEG2000, which is the widely used standard Second, we compare the results of different methods on
for scalable image compression. The second is BPG, which PSNR, MS-SSIM, and bitrate. Figure 9 shows the rate-
represents state-of-the-art of non-learned compression. We PSNR/MS-SSIM curves of our method and JPEG2000,
use the default configuration of BPG, that is to compress with BPG, and NLAIC. Our proposed method significantly sur-
YUV420 format. The third is NLAIC, a CNN-based end-to- passes JPEG2000 in both PSNR and MS-SSIM. In addition,
end learned image compression method proposed in Chen our method achieves comparable MS-SSIM than BPG and
et al. (2019). We choose NLAIC because our compression NLAIC at high bit rates, but does not catch up with BPG and
network also borrows some ideas from it. For JPEG2000 and NLAIC in PSNR. We believe the result is mainly due to the
BPG, we have adjusted the quantization parameter to achieve less compression efficiency of our entropy coding method.
similar compression ratios as our method. For NLAIC, we JPEG2000 compresses all the subbands simultaneously, and
use pre-trained models, so the bit rates are not well aligned it has dedicated, highly efficient coding tools like zero-tree
to those of the other methods. (Taubman 2000). BPG has an advanced context-adaptive
First, we show the classification results. Here for CUB200- binary arithmetic coding (CABAC) engine for entropy cod-
2011 and FGVC-Aircraft, we respectively train a VGG16 ing (Marpe et al. 2003). NLAIC compresses all the features
model with uncompressed images, and use the same model and optimizes the uniform entropy coder in an end-to-end
on reconstructed images with different methods (JPEG2000, fashion. In our scheme, the features are compressed one
ours, BPG, NLAIC) and different bit rates. These results by one using prediction from each to the next. The corre-
are summarized in Table 2. In addition, we also report lation among non-adjacent features is not fully exploited. As
the classification results of our method not using recon- a result, our method performs less well especially in PSNR at
structed images but using features that are decoded from high bit rates where entropy coding has big impact. Nonethe-
partial bitstream. It can be observed that with the decrease less, our method performs well in MS-SSIM at high bit rates;
of bit rate, the classification accuracy of reconstructed images
123
International Journal of Computer Vision (2021) 129:2605–2621 2617
Fig. 8 Reconstructed images of our method. Numbers shown below each image indicate bitrate and PSNR. The top two rows correspond to
λ = 0.04 and the bottom two rows correspond to λ = 0.2, respectively
0.98
JPEG2000 0.98
26.4 0.85 27.0 0.79 0.97
Proposed 0.97 0.96
23.2 BPG 0.96 23.0 0.72 0.95
0.80
0.95 0.94
NLAIC 0.15 0.27 0.39 0.51
20.0 0.75 0.3 0.46 0.62 0.78 19.0 0.65
0 0.17 0.34 0.51 0.68 0.85 0 0.17 0.34 0.51 0.68 0.85 0 0.12 0.24 0.36 0.48 0.6 0 0.12 0.24 0.36 0.48 0.6
Bpp Bpp Bpp Bpp
this is probably due to the effectiveness of multi-scale decom- model (Chen et al. 2019). In the future, we may simplify the
position of our method. context model to accelerate the decoding process. It is also
noticeable that, excluding the decoding, the analysis modules
at the decoder side are computationally efficient, reducing
6.3.3 Complexity Analysis
more than 90% computational time than the image-based
classification.
We report the average running time of each module in Table
3. The time was measured on a GTX1080Ti GPU. Note that
our scheme enables partial decoding, so we report the time 6.4 Performance of LFRNet
needed for coarse-grained classification, fine-grained clas-
sification, and image reconstruction, respectively. It can be LFRNet converts an image into hierarchical feature repre-
observed that the slowest module in our scheme is the decod- sentations in a revertible manner. Partial features may serve
ing module, which is attributed to the autoregressive context as a compact representation of the information needed for
123
2618 International Journal of Computer Vision (2021) 129:2605–2621
123
International Journal of Computer Vision (2021) 129:2605–2621 2619
Feature Top-1 Acc. Top-5 Acc. Feature Top-1 Acc. Top-5 Acc.
90%
Table 6 Classification accuracy results of uncompressed features for
the intermediate-grained classification task
85%
Dataset Method Top-1 Acc. Top-5 Acc.
Top-1 Accuracy
80%
CUB200-2011 VGG16 81.9 96.4
75%
i-RevNet 84.2(↑2.3) 96.4(↑0.0)
70% Ours 80.7(↓1.2) 96.0(↓0.4)
FGVC-Aircraft VGG16 89.7 96.8
65%
i-RevNet 89.2(↓0.5) 97.1(↑0.3)
60% Ours 88.4(↓1.3) 96.3(↓0.5)
0 16 32 48 64 80 96
Number of channels of Note that in our method, the features (72 × 8 × 8) are a subset of
the features (96 × 8 × 8) prepared for the fine-grained classification.
100% Nonetheless, the intermediate-grained classification was not considered
during the training
98%
Top-5 Accuracy
96%
classification task. By grid search, we found k = 72 is appro-
94% priate, as shown in Fig. 10.
92%
For comparison, we also fine-tune VGG16 and i-RevNet
Coarse-grained classification
for the intermediate-grained classification. The results are
90%
Intermediate-grained classification given in Table 6. It is observed that LFRNet still achieves
Fine-grained classification very competitive results, especially taken into account that
88%
0 16 32 48 64 80 96 VGG16 and i-RevNet are specifically trained for the task.
Number of channels of These results demonstrate that LFRNet has extracted com-
pact features that work well for trained tasks and meanwhile
Fig. 10 Relation between classification accuracy and number of chan- are generalizable to similar tasks.
nels of FKm . In each curve, there is a solid marker showing that the top-1
accuracy becomes stable after that point (also indicated by a green
arrow) (Color figure online)
7 Conclusion
123
2620 International Journal of Computer Vision (2021) 129:2605–2621
to-fine image classification and image reconstruction as the He, C., Shi, Z., Qu, T., Wang, D., & Liao, M. (2019). Lifting scheme-
targets for image compression. Our experimental results have based deep neural network for remote sensing scene classification.
Remote Sensing, 11(22), 2648.
verified the effectiveness of the proposed method, which out- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for
performs JPEG2000 significantly. image recognition. In CVPR (pp. 770–778).
In the future, our work can be extended in several direc- Heijmans, H. J., & Goutsias, J. (2000). Nonlinear multiresolution signal
tions. First, we may further investigate revertible networks decomposition schemes (II) Morphological wavelets. IEEE Trans-
actions on Image Processing, 9(11), 1897–1913.
to enhance the feature learning capabilities. Second, we may Hu, Y., Yang, S., Yang, W., Duan, L. Y., & Liu, J. (2020). Towards
design advanced methods for more efficient compression of coding for human and machine vision: A scalable image coding
the features. Third, we may consider video coding in a similar approach. In ICME (pp. 1–6). IEEE.
way but need to address motion carefully. Huang, G., Liu, Z., Van Der Maaten L., & Weinberger, K. Q. (2017).
Densely connected convolutional networks. In CVPR (pp. 4700–
4708).
Acknowledgements This work was supported by the National Key
Jacobsen, J. H., Smeulders, A., & Oyallon, E. (2018). i-Revnet:
Research and Development Program of China under Grant 2018YFA07
Deep invertible networks. Technical Report. arXiv preprint
01603, by the Natural Science Foundation of China under Grant
arXiv:1802.07088
61772483, and by the Fundamental Research Funds for the Central Uni-
Johnston, P., Elyan, E., & Jayne, C. (2018). Spatial effects of video
versities under Contract WK3490000005. We acknowledge the support
compression on classification in convolutional neural networks. In
of the GPU cluster built by MCC Lab of the School of Information
IJCNN (pp. 1–8).
Science and Technology of USTC.
Kwaśnicka, H., & Jain, L. C. (2018). Bridging the semantic gap in
image and video analysis. Berlin: Springer.
Latif, A., Rasheed, A., Sajid, U., Ahmed, J., Ali, N., Ratyal, N. I., et al.
(2019). Content-based image retrieval and feature extraction: A
comprehensive review. Mathematical Problems in Engineering,
References 2019, 1–21.
Lee, J., Cho, S., & Beack, S. K. (2018). Context-adaptive entropy model
Akansu, A. N., Haddad, P. A., Haddad, R. A., & Haddad, P. R. (2001). for end-to-end optimized image compression. Technical Report.
Multiresolution signal decomposition: Transforms, subbands, and arXiv preprint arXiv:1809.10452
wavelets. New York: Academic Press. Li, M., Zhang, K., Zuo, W., Timofte, R., & Zhang, D. (2020). Learning
Akansu, A. N., & Liu, Y. (1991). On-signal decomposition techniques. context-based non-local entropy modeling for image compression.
Optical Engineering, 30(7), 912–921. Technical Report. arXiv preprint arXiv:2005.04661
Ballé, J., Laparra, V., & Simoncelli, E. P. (2016). End-to-end opti- Lo, S. C., Li, H., & Freedman, M. T. (2003). Optimization of wavelet
mized image compression. Technical Report. arXiv preprint decomposition for image compression and feature preservation.
arXiv:1611.01704 IEEE Transactions on Medical Imaging, 22(9), 1141–1151.
Ballé, J., Minnen, D., Singh, S., Hwang, S. J., & Johnston, N. (2018). Ma, H., Liu, D., Xiong, R., & Wu, F. (2019). iWave: CNN-based
Variational image compression with a scale hyperprior. Technical wavelet-like transform for image compression. IEEE Transactions
Report. arXiv preprint arXiv:1802.01436 on Multimedia, 22, 1667–1679.
Baxter, J. (1997). A bayesian/information theoretic model of learning to Ma, H., Liu, D., Yan, N., Li, H., & Wu, F. (2020). End-to-end opti-
learn via multiple task sampling. Machine Learning, 28(1), 7–39. mized versatile image compression with wavelet-like transform.
Chen, T., Liu, H., Ma, Z., Shen, Q., Cao, X., & Wang, Y. (2019). IEEE Transactions on Pattern Analysis and Machine Intelligence,.
Neural image compression via non-local attention optimization https://doi.org/10.1109/TPAMI.2020.3026003.
and improved context modeling. Technical Report. arXiv preprint Ma, S., Zhang, X., Wang, S., Zhang, X., Jia, C., & Wang, S. (2018). Joint
arXiv:1910.06244 feature and texture coding: Toward smart video representation via
Christopoulos, C., Skodras, A., & Ebrahimi, T. (2000). The JPEG2000 front-end intelligence. IEEE Transactions on Circuits and Systems
still image coding system: An overview. IEEE Transactions on for Video Technology, 29(10), 3095–3105.
Consumer Electronics, 46(4), 1103–1127. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013).
Dejean-Servières, M., Desnos, K., Abdelouahab, K., Hamidouche, W., Fine-grained visual classification of aircraft. Technical Report.
Morin, L., & Pelcat, M. (2017). Study of the impact of standard arXiv preprint arXiv:1306.5151.
image compression techniques on performance of image classi- Mallat, S. G. (1989). A theory for multiresolution signal decomposi-
fication with a convolutional neural network. Technical Report. tion: The wavelet representation. IEEE Transactions on Pattern
hal-01725126. https://hal.archives-ouvertes.fr/hal-01725126 Analysis and Machine Intelligence, 11(7), 674–693.
Dodge, S., & Karam, L. (2016). Understanding how image quality Marpe, D., Schwarz, H., & Wiegand, T. (2003). Context-based adap-
affects deep neural networks. In International conference on qual- tive binary arithmetic coding in the h.264/AVC video compression
ity of multimedia experience (pp. 1–6). IEEE. standard. IEEE Transactions on Circuits and Systems for Video
Duan, L., Liu, J., Yang, W., Huang, T., & Gao, W. (2020). Video cod- Technology, 13(7), 620–636.
ing for machines: A paradigm of collaborative compression and Minnen, D., Ballé, J., & Toderici, G. D. (2018). Joint autoregressive and
intelligent analytics. IEEE Transactions on Image Processing, 29, hierarchical priors for learned image compression. In Advances in
8680–8695. neural information processing systems (pp. 10771–10780).
Gomez, A. N., Ren, M., Urtasun, R., & Grosse, R. B. (2017). The Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin,
reversible residual network: Backpropagation without storing acti- Z., Desmaison, A., Antiga, L., & Lerer A. (2017). Automatic dif-
vations. In: Advances in neural information processing systems ferentiation in PyTorch. Technical report. OpenReview.net, https://
(pp. 2214–2224). openreview.net/forum?id=BJJsrmfCZ.
Goutsias, J., & Heijmans, H. J. (2000). Nonlinear multiresolution signal Poyser, M., Atapour-Abarghouei, A., & Breckon, T. P. (2020). On the
decomposition schemes (I) Morphological pyramids. IEEE Trans- impact of lossy image and video compression on the performance
actions on Image Processing, 9(11), 1862–1876.
123
International Journal of Computer Vision (2021) 129:2605–2621 2621
of deep convolutional neural network architectures. Technical Wang, S., Wang, S., Zhang, X., Wang, S., Ma, S., & Gao, W. (2019).
Report. arXiv preprint arXiv:2007.14314. Scalable facial image compression with deep feature reconstruc-
Rodriguez, M. X. B., Gruson, A., Polania, L., Fujieda, S., Prieto, F., tion. In ICIP (pp. 2691–2695). IEEE.
Takayama, K., & Hachisuka, T. (2020). Deep adaptive wavelet Xia, S., Liang, K., Yang, W., Duan, L. Y., & Liu, J. (2020). An emerg-
network. In IEEE Winter conference on applications of computer ing coding paradigm VCM: A scalable coding approach beyond
vision (pp. 3111–3119). feature and signal. In ICME (pp. 1–6). IEEE
Ruder, S. (2017). An overview of multi-task learning in deep neural Yan, N., Liu, D., Li, H., & Wu, F. (2020). Semantically scalable image
networks. Technical Report. arXiv preprint arXiv:1706.05098 coding with compression of feature maps. In ICIP, IEEE (pp.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., 3114–3118).
et al. (2015). Imagenet large scale visual recognition challenge. Zhang, X., Ma, S., Wang, S., Zhang, X., Sun, H., & Gao, W. (2016). A
International Journal of Computer Vision, 115(3), 211–252. joint compression scheme of video feature descriptors and visual
Shwartz-Ziv, R., & Tishby, N. (2017). Opening the black box of deep content. IEEE Transactions on Image Processing, 26(2), 633–647.
neural networks via information. Technical Report. arXiv preprint Zhao, J., Peng, Y., & He, X. (2020). Attribute hierarchy based multi-task
arXiv:1703.00810 learning for fine-grained image classification. Neurocomputing,
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional net- 395, 150–159.
works for large-scale image recognition. Technical Report. arXiv Zhao, Z. Q., Zheng, P., & Xu St, Wu X. (2019). Object detection with
preprint arXiv:1409.1556. deep learning: A review. IEEE Transactions on Neural Networks
Sweldens, W. (1998). The lifting scheme: A construction of second gen- and Learning Systems, 30(11), 3212–3232.
eration wavelets. SIAM Journal on Mathematical Analysis, 29(2), Zhou, L., Sun, Z., Wu, X., & Wu, J. (2019). End-to-end optimized image
511–546. compression with attention mechanism. In CVPR workshops (pp.
Taubman, D. (2000). High performance scalable image compression 1–4).
with EBCOT. IEEE Transactions on Image Processing, 9(7),
1158–1170.
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The infor- Publisher’s Note Springer Nature remains neutral with regard to juris-
mation bottleneck method. Technical Report. arXiv preprint dictional claims in published maps and institutional affiliations.
arXiv:physics/0004057.
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information
bottleneck principle. In IEEE information theory workshop (pp.
1–5).
Toderici, G., O’Malley, S. M., Hwang, S. J., Vincent, D., Minnen,
D., Baluja, S., Covell, M., & Sukthankar, R. (2015). Variable
rate image compression with recurrent neural networks. Technical
Report. arXiv preprint arXiv:1511.06085
Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R.,
& Van Gool, L. (2018). Towards image understanding from deep
compression without decoding. Technical Report. arXiv preprint
arXiv:1803.06131
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011).
The Caltech-UCSD Birds-200-2011 Dataset. Technical Report.
CNS-TR-2011-001, California Institute of Technology.
123