0% found this document useful (0 votes)
12 views

Semantics-To-Signal Scalable Image Compression With Learned Revertible Representations

Uploaded by

changtongxin111
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Semantics-To-Signal Scalable Image Compression With Learned Revertible Representations

Uploaded by

changtongxin111
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

International Journal of Computer Vision (2021) 129:2605–2621

https://doi.org/10.1007/s11263-021-01491-7

Semantics-to-Signal Scalable Image Compression with Learned


Revertible Representations
Kang Liu1 · Dong Liu1 · Li Li1 · Ning Yan1 · Houqiang Li1

Received: 20 December 2020 / Accepted: 9 June 2021 / Published online: 22 June 2021
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021

Abstract
Image/video compression and communication need to serve both human vision and machine vision. To address this need,
we propose a scalable image compression solution. We assume that machine vision needs less information that is related to
semantics, whereas human vision needs more information that is to reconstruct signal. We then propose semantics-to-signal
scalable compression, where partial bitstream is decodeable for machine vision and the entire bitstream is decodeable for
human vision. Our method is inspired by the scalable image coding standard, JPEG2000, and similarly adopts subband-wise
representations. We first design a trainable and revertible transform based on the lifting structure, which converts an image
into a pyramid of multiple subbands; the transform is trained to make the partial representations useful for multiple machine
vision tasks. We then design an end-to-end optimized encoding/decoding network for compressing the multiple subbands, to
jointly optimize compression ratio, semantic analysis accuracy, and signal reconstruction quality. We experiment with two
datasets: CUB200-2011 and FGVC-Aircraft, taking coarse-to-fine image classification tasks as an example. Experimental
results demonstrate that our proposed method achieves semantics-to-signal scalable compression, and outperforms JPEG2000
in compression efficiency. The proposed method sheds light on a generic approach for image/video coding for human and
machines.

Keywords Deep learning · Image compression · Lifting structure · Machine vision · Scalable coding

1 Introduction rely on limited manpower to understand massive images.


Instead, the development of machine vision algorithms, espe-
Image/video contains rich semantic information, being one of cially with the help of deep learning technologies, has greatly
the main information sources for human. With the explosive improved the efficiency of machine understanding of images.
growth of image/video data, it has become impossible to fully Nonetheless, machine vision could not completely replace
human observation, so human–machine collaborative judg-
ment will last for a while. Note that human can directly view
Communicated by Dong Xu. images, but machine vision generally needs to convert pix-
els to compact semantic representations, namely features.
B Dong Liu
In particular, various machine vision tasks generally require
[email protected]
different features.
B Houqiang Li
In order to reduce the cost of storage and transmission,
[email protected]
images in typical scenarios are compressed with informa-
Kang Liu
[email protected] tion loss. Traditional image compression algorithms do not
consider feature fidelity, so the compression artifacts under
Li Li
[email protected] low bit rates seriously affect the semantic analysis accuracy
(Poyser et al. 2020; Dejean-Servières et al. 2017; Dodge and
Ning Yan
[email protected] Karam 2016). A feasible solution is to obtain and compress
compact feature representations extracted from the origi-
1 CAS Key Laboratory of Technology in Geo-Spatial nal image. Note that it is difficult to reconstruct the image
Information Processing and Application System, University only based on the features. For human–machine collabora-
of Science and Technology of China, Hefei 230027, China

123
2606 International Journal of Computer Vision (2021) 129:2605–2621

tive judgment, it appears necessary to compress and transmit


both the features and the image itself simultaneously.
Scalable image compression is an promising approach Human vision Machine vision
for jointly compressing the image and features. Scalable
Wren
compression means that the compressed bitstream can be
Learned Revertible House
partially decoded to obtain meaningful output. Note that the Representation wren
Northern
features are compact representations of the image and the house wren
information contained in the features is a subset of the image


information. Accordingly, the base layer can be the feature
bitstream, which contains less information of the image about
Low-level pixel High-level feature
its semantics. The enhancement layer may contain richer representation representation
semantics-related information, and/or the information that
is used to reconstruct the signal. In addition, there is redun- Fig. 1 Conceptual illustration of the proposed semantics-to-signal scal-
ability with learned revertible representations. Image signal, i.e. pixels,
dancy between the image and its features, so the enhancement is transformed into a set of features. The transform is revertible in the
layer may refer to the base layer to improve the compression sense that the image can be perfectly reconstructed using all the fea-
efficiency. tures. Using partial features we can perform semantic analysis tasks;
Some studies (Wang et al. 2019; Hu et al. 2020) pro- using more features, we have more semantic information. In this figure,
we show a coarse-to-fine image classification task: with more features
pose scalable image compression methods based on feature we perform finer-grained classification, from “Wren”, “House wren”,
extraction and image generation to fulfill the need of human– to “Northern house wren”
machine collaborative judgment. Feature extraction is a
process of information contraction (Simonyan and Zisserman
2014; Zhao et al. 2019; Latif et al. 2019), during which a large
tion. With increasing features, semantics-related infor-
amount of task-independent information related to the origi-
mation contained in the features is gradually enriched.
nal image is lost, resulting in the lack of clear correspondence
– Second, we design a layered compression network to
between high-level features and low-level pixels, namely
compresses the multiple features into a scalable bit-
semantic gap (Kwaśnicka and Jain 2018). The semantic gap
stream. We use the end-to-end optimization strategy to
makes it difficult to predict an image based on its sparse and
achieve the joint optimization of semantic accuracy, sig-
compact features, and further affects the flexibility of increas-
nal fidelity, and compression ratio.
ing semantics-related information through partial decoding
– Third, we use coarse-to-fine image classification and
of the scalable bitstream. In addition, the extracted features
image reconstruction as a motivating example to conduct
may not be compact enough (Zhao et al. 2020; Ruder 2017;
experiments, and verify the effectiveness of the proposed
Baxter 1997), which incurs unnecessary bitstream cost.
semantics-to-signal scalable image compression.
In this paper, we target at the image compression for
human–machine collaborative judgment from the perspec-
tive of jointly considering feature extraction, feature com-
pression, and image compression. Firstly, to bridge the The remainder of this paper is organized as follows.
semantic gap between image and features, inspired by the Related work is presented in Sect. 2. We elaborate on the hier-
scalable image coding standard, JPEG2000 (Christopoulos archical representation of the proposed semantics-to-signal
et al. 2000), we design a learned revertible transform, namely scalable bitstream in Sect. 3. In Sect. 4, we introduce the
a hierarchical signal representation method, as illustrated proposed feature representation network. Section 5 demon-
in Fig. 1. The transform is trainable so that semantic tasks strates the proposed layered compression network for the
can be well performed by intercepting partial features. The learned representations. In Sect. 6, we show the experimen-
information contained in the partial features is a subset of tal results about semantic analysis accuracy and compression
image information, which enhances the interpretability of efficiency. Finally, Sect. 7 concludes the paper.
the scalable structure. Secondly, we design an end-to-end
encoding/decoding network to achieve layered compression
of the features. Our specific contributions are summarized as
follows. 2 Related Work

Image and features need to be compressed simultaneously for


– First, we propose a task-driven learned revertible trans- human–machine collaborative judgment. In this section, we
form that converts an image into compact features and introduce in detail the related work from two perspectives:
achieves hierarchical representations of image informa- compression and image representation.

123
International Journal of Computer Vision (2021) 129:2605–2621 2607

2.1 Image and Feature Compression icantly. Our encoding/decoding network is also based on
CNN and adopts the hyper-prior model as well as the non-
Studies in Johnston et al. (2018), Dejean-Servières et al. local attention module. The existing end-to-end compression
(2017), Dodge and Karam (2016) have shown that compres- methods, similar to non-deep methods, focus on improving
sion artifacts have an impact on machine vision tasks. Taking the objective or subjective quality of reconstructed images,
classification as an example, compression has little impact on but ignore the fidelity of semantic information. Torfason et
the classification accuracy at high bit rates, but at low bit rates, al. propose to use the same bitstream for machine vision
compression will seriously lower the classification accuracy. and image reconstruction (Torfason et al. 2018). Intuitively,
To make images serve both human vision and machine machine vision tasks (e.g. classification, detection, recogni-
vision, signal distortion and semantic analysis accuracy need tion) usually require less information quantity than image
to be considered in the compression. Ma et al. conduct a reconstruction does. Using the scheme of Torfason et al.
systematic review, analyze the joint compression of image (2018), the number of bits suitable for machine vision may
and features (Ma et al. 2018), and explain the advantages of be too few to reconstruct visually pleasing image, and the
the joint image-feature compression for reconstruction qual- number of bits suitable for image reconstruction may be too
ity and analysis accuracy. Wang et al. propose a scalable redundant for machine vision. Different from their work, our
image coding framework for face recognition task (Wang designed bitstream is scalable; in our bitstream, the bits used
et al. 2019), where base layer and enhancement layer are for machine vision occupy only a small fraction (e.g. less
used to represent face features and signal residuals, respec- than 10%, see Sect. 6.3.1) of the entire bitstream.
tively. The framework uses traditional coding technologies,
such as quantization and entropy coding, on the face features, 2.2 Image Representation via Transforms
and uses a network-based compression scheme for the signal
residuals. Its compression performance surpasses JPEG and Signal representation is an important approach for image
JPEG2000 while maintaining semantic analysis accuracy. analysis and processing. In image compression, discrete
Hu et al. (2020) extend the strategy to face keypoint detec- cosine transform (DCT) (Akansu and Liu 1991) and dis-
tion task, and further unify the entire compression scheme crete wavelet transform (DWT) (Mallat 1989) are the most
with a deep neural network. Yan et al. propose an image commonly used signal representation methods. DCT linearly
compression method based on semantic scalability in Yan maps the signal from the spatial domain into the frequency
et al. (2020). The multi-layer features of the deep network domain while keeping the resolution unchanged. DWT uses
are compressed into a scalable bitstream, which can serve orthogonal basis functions to decompose the original signal
multi-grained classification tasks, verifying the advantage of into multi-resolution coefficients with a pyramid structure.
joint compression for significant bits saving. Besides, some Further, nonlinear wavelets (Goutsias and Heijmans 2000;
researchers (Zhang et al. 2016; Duan et al. 2020; Xia et al. Heijmans and Goutsias 2000) are proposed based on the lift-
2020) extend the concept of scalability to videos and verify ing structure (Sweldens 1998), which brings the advantages
the effectiveness of joint video-feature compression. of perfect reconstruction and non-redundant representation.
In the recent years, end-to-end optimized image com- However, DCT and DWT directly decompose low-frequency
pression based on deep neural networks has demonstrated and high-frequency components in the signal from the per-
more flexible and efficient image compression capabilities. spective of energy distribution (Akansu et al. 2001). These
Toderici et al. propose the first end-to-end image coding relatively low-level coefficients are difficult to directly serve
method based on recurrent neural network (RNN) (Toderici image understanding.
et al. 2015), and achieve scalable coding by iteratively How to obtain high-level semantics-oriented features from
invoking the RNN-based encoder to compress the image or raw image pixels remains an open problem. Feature repre-
residual. Ballé et al. propose the first end-to-end image cod- sentation based on deep learning is the current main research
ing method based on convolutional neural network (CNN) trend and has achieved outstanding performance in various
(Ballé et al. 2016). After that, hyper-prior model (Ballé et al. machine vision tasks. Several popular neural networks such
2018) and autoregressive model (Minnen et al. 2018; Lee as VGGNet (Simonyan and Zisserman 2014), ResNet (He
et al. 2018) are introduced for more efficient entropy coding. et al. 2016), DenseNet (Huang et al. 2017) are believed to
By using the non-local attention module (Li et al. 2020; Zhou have outstanding feature extraction capabilities. Based on
et al. 2019; Chen et al. 2019), the compression efficiency is the concept of information bottleneck (Tishby et al. 2000),
further improved. Nowadays, CNN-based end-to-end cod- Tishby et al. interpret the learning process of the deep neural
ing methods outperform the state-of-the-art non-deep image network (DNN) as constantly forgetting information about
coding scheme, Better Portable Graphics (BPG),1 signif- the input while obtaining an efficient expression of the label
(Tishby and Zaslavsky 2015; Shwartz-Ziv and Tishby 2017).
1 https://bellard.org/bpg/. The process of forgetting is irrevertible, which can easily lead

123
2608 International Journal of Computer Vision (2021) 129:2605–2621

to a semantic gap (Kwaśnicka and Jain 2018), i.e. the lack Hierarchical representation Scalable bitstream

of explicit connection between features and image. Seman-


tic gap brings more difficulties to the joint compression of Coarse-grained
classification features
image and features.
Convolutions and nonlinear activations in DNN are the Fine-grained
classification features Image
main reasons for irreversibility. It is a possible direction to
combine DNN with traditional transforms. In Lo et al. (2003), Fig. 2 Theoretical interpretation of the semantics-to-signal scalability.
Ma et al. (2019, 2020), linear and nonlinear neural networks Left: the entropy of image may be represented in a hierarchy. Right: the
hierarchical representations can be naturally compressed into a scalable
are introduced into the lifting structure, where the studies bitstream
are focused on signal fidelity instead of preserving semantic
information. He et al. (2019) replace partial modules in the
ResNet with the lifting structure, which surpasses ResNet in information. We now use symbols F1 , F2 , and F to denote
remote sensing classification task, and show the effectiveness features for coarse-grained classification, features for fine-
of the lifting structure for feature representation. Another grained classification, and image, respectively, as shown in
wavelet-based network is designed in Rodriguez et al. (2020), Fig. 2. In the language of information theory, we know that
where wavelet coefficients obtained by transform are directly
used for classification, but the network does not have the H (F1 |F2 ) = 0 (1)
revertible characteristic. i-RevNet is a completely revertible H (F2 |F) = 0 (2)
nonlinear convolutional neural network proposed in Jacob-
sen et al. (2018) on the basis of Gomez et al. (2017). In then we have
i-RevNet, the input information is always fully retained, and
the features based on multi-level mapping are used for object I (F2 ; F) = H (F2 ) − H (F2 |F) = H (F2 ) (3)
classification and signal reconstruction. However, i-RevNet
I (F1 ; F2 ) = H (F1 ) − H (F1 |F2 ) = H (F1 ) (4)
has the drawback that the feature extraction process and the
classifier always use entire image information, which ignores
and thus
the compactness of features.
H (F) = H (F|F2 ) + I (F2 ; F)
= H (F|F2 ) + H (F2 )
3 Hierarchical Representation for (5)
Semantics-to-Signal Scalability = H (F|F2 ) + H (F2 |F1 ) + I (F1 ; F2 )
= H (F|F2 ) + H (F2 |F1 ) + H (F1 )
For human–machine collaborative judgment, an image needs
to be provided to both human and machines. We now intro- which is depicted in Fig. 2. In other words, the image infor-
duce an application scenario that motivates the following mation can be decomposed into a series of entropies or
discussions and experiments. In the considered scenario, conditional entropies of the features or the image. The hier-
there are multiple semantic analysis tasks, for example archical structure of entropy naturally leads to a scalable
coarse-grained classification and fine-grained classification. bitstream, where the base layer is corresponding to H (F1 ),
Here, coarse-grained classification has less optional classes, and the enhancement layers correspond to H (F2 |F1 ) and
and fine-grained classification has more and finer classes. For H (F|F2 ), respectively. It also implies that the enhancement
example, we may want to classify each image as “dog” or layers may be compressed by using prediction from the pre-
“cat,” and if there is a cat, we ask which species the cat vious layer(s).
belongs to. The dog-cat problem and the dog/cat species The hierarchical structure of entropy inspires us that fea-
problem are coarse-grained and fine-grained, respectively. In tures may also have a hierarchical characteristic. Motivated
addition, image shall be reconstructed for human viewing. by JPEG2000, we design a revertible feature representa-
It is intuitive that machine vision needs less, semantics- tion method, which distributes the image information into
related information, whereas human vision needs more, the feature space without any information loss and achieves
signal-reconstructive information. For classification tasks the compact representations for machine vision tasks by
with different granularities, the information required for constraining the amount of information contained in the fea-
coarse-grained classification is intuitively a subset of the tures. By gradually adding features, the semantic information
information required for fine-grained classification. Mean- can be continuously augmented, then the hierarchical rep-
while, features required by machine vision tasks are generally resentations have the characteristic of semantics-to-signal
compact representations of an image, that is to say, the infor- scalability. Note that the existing feature extraction methods
mation required by machine vision tasks is a subset of image like Simonyan and Zisserman (2014), He et al. (2016) only

123
International Journal of Computer Vision (2021) 129:2605–2621 2609

consider whether the feature is useful to express semantic Then, we use F̃km to predict F̃kd , and use the prediction resid-
information, but do not consider whether the feature is able ual Fkd to update Fkm :
to reconstruct image; the existing neural image compression ⎧  
methods like Ballé et al. (2018), Minnen et al. (2018), Chen ⎪
⎨ Fkd = F̃kd − Predict F̃km
et al. (2019) only consider whether the coded representation   (8)
(feature) is useful to reconstruct image, but do not consider ⎪
⎩ Fkm = F̃km + Update Fkd
whether the feature carries important semantic information.
Different from them, we consider signal reconstruction and Equation (8) is revertible. Its inverse process is:
semantic analysis simultaneously to obtain compact features.
⎧  

⎨ F̃km = Fkm − Update Fkd
  (9)

⎩ F̃kd = Fkd + Predict F̃km
4 Proposed Learned Revertible
Representation In other words, Fk−1 m can be perfectly reconstructed when
m d
Fk and Fk are known.
In this section, we propose a Lifting-based Feature Represen-
The revertibility of RFRU implies that LFRNet is revert-
tation Network (LFRNet) to convert an image into multiple
ible, provided that the parameters used for the forward-
features. The conversion is revertible. Using LFRNet, we
transform are also used for the inverse-transform. Here, we
achieve semantics-to-signal scalability through the hierar-
do not consider the information loss due to numeric compu-
chical representation of the image information.
tations. When all the features are known, the input image I
←−
can be reconstructed by the inverse operation F :
4.1 Lifting-Based Feature Representation Network −
← 
I = F FKm , FKd , FKd −1 , . . . , F2d , F1d |Θ . (10)
4.1.1 Overview
4.1.2 Structure of RFRU
The proposed LFRNet is a fully convolutional network based
on the lifting structure (Sweldens 1998). The network struc- Note that RFRU follows the general lifting structure, which
ture is shown in Fig. 3a. Specifically, input to the network was proposed initially for efficient implementation of the
is the image I , or rigorously speaking, the pixels. LFRNet wavelet transform (Sweldens 1998). While the lifting struc-
converts I into feature representations with a hierarchical ture remains the same, our RFRU is different from the
structure. The conversion can be formulated as traditional wavelets, because in RFRU the Predict and Update
operations are implemented by trained networks.
  −→ Specifically, Predict and Update in (8) use the same
FKm , FKd , FKd −1 , . . . , F2d , F1d = F (I |Θ) (6) network structure, but have different parameters. Besides,
different RFRUs in LFRNet do not share parameters. Each
Predict/Update network has three parts: redundant repre-

→ sentation, feature extraction, and feature shrinkage. The
where F stands for the forward-transform of LFRNet, Θ is
the set of trainable parameters of LFRNet, and K is the order redundant representation is using a convolutional layer with
of transform. K = 5 in the experiments by default. The kernel size equal to 3 × 3 to expand in the channel dimension
features, {FKm , FKd , FKd −1 , . . . , F2d , F1d }, are derived from to 8 times. The feature extraction is using N repetitive units,
multiple Revertible Feature Representation Units (RFRUs). each of which has a convolution, a batch normalization, and
RFRU is the basic unit of LFRNet as shown in Fig. 3b. It a Rectified Linear Unit (ReLU). N is the order of nonlinear-
is designed based on the lifting structure and CNN. RFRU ity, and is set to {2, 2, 3, 3, 3} for the K = 5 RFRUs. The
includes three basic operations: Split, Predict, and Update. feature shrinkage is using a convolutional layer with kernel
The input of RFRU kF (k ∈ [1, K ]), the kth forward-transform size equal to 3 × 3 to shrink in the channel dimension back
unit of LFRNet, is Fk−1 m . The Split operation decomposes to that of the input.
Fk−1 into the main branch subband F̃km and the dual branch
m
4.1.3 Structure of Hierarchical Representations
subband F̃kd . In particular, the splitting process is revertible:
LFRNet achieves the mapping from an image I to a set
   m  of features {FKm , FKd , FKd −1 , . . . , F2d , F1d } through the cas-
F̃km , F̃kd = Split Fk−1 (7) cade of RFRUs. Next, these features are bound with machine

123
2610 International Journal of Computer Vision (2021) 129:2605–2621

High-level feature representation


Forward-transform
Low-level pixel representation Split Predict Update


Split Predict/Update N
2c


RGB c

h 2c
w
Inverse-transform

(a) LFRNet (b) RFRU

Fig. 3 The proposed lifting-based feature representation network (LFRNet) and revertible feature representation unit (RFRU)

vision tasks. Usually machine vision tasks need a compact set number of network parameters and computational cost.
of representations for the sake of computational simplicity. Fourth, the inverse-transform directly uses the parameters
Thus, we may further decompose the features into subsets of the forward-transform, avoiding additional modeling and
and choose for example a subset of features for a task. For training.
example, we may decompose at the subband level, or may
decompose at the channel level.
In LFRNet, subband decomposition is implanted into the 4.2 Task-Oriented Optimization
cascade of RFRUs. Specifically, while one RFRU outputs
two subbands: main branch and dual branch, the following The proposed LFRNet can be optimized for specific machine
RFRU deals with the main branch only. Thus, the deeper vision tasks. Note that LFRNet produces a series of features,
RFRUs process the less data, with the hope of extracting which can be input to different networks to fulfill various
compact features. It is also worth noting that the degree of machine vision tasks. LFRNet and the task-specific networks
nonlinearity also increases with deeper RFRUs. may be jointly optimized to ensure the usability of the fea-
In addition, we may also select partial channels from a sub- tures. Here, we first give a general formulation, and then
band as a subset. Because the proposed RFRU reduces spatial present the specific formulation for the scenario of coarse-
resolution but increases number of channels, the main-branch to-fine image classification.
and dual-branch features have more and more channels. Pos- In general, we may have T tasks, for each task Taskt
sibly, for a machine vision task, we may need only a subset (t = 1, . . . , T ), we assign a subset of features Ft ⊆
of a subband. Channel decomposition is a convenient way to {FKm , FKd , FKd −1 , . . . , F2d , F1d } to perform the task. Let Θ be
achieve this. the set of trainable parameters of LFRNet, and Θt be the
In this paper we consider image classification task, which set of trainable parameters of the task-specific network, the
benefits from relatively deeper CNN that has more nonlinear- optimization problem can be defined as
ity. Thus, we use the last main-branch feature FKm to perform
classification. For coarse-grained classification, we perform
channel decomposition on FKm to obtain a subset. This is T

illustrated in Fig. 3a, where we decompose FKm into F1s and min λt Lt (Θ, Θt ) (11)
Θ,Θ1 ,...,ΘT
F2s . Here we use a notation slightly different from that in t=1

Sect. 3: F1s is equivalent to F1 in Sect. 3, and refers to the


features for coarse-grained classification. {F1s , F2s } is equiv-
where λt is the weight of the tth task and Lt measures the
alent to F2 in Sect. 3, and refer to the features for fine-grained
task-specific loss.
classification.
Specifically, for the considered coarse-to-fine image clas-
We would like to remark that the proposed LFRNet has the
sification, there are two tasks: coarse-grained classification
following advantages. First, the image information is redis-
and fine-grained classification. There are respectively two
tributed in the feature space without any information loss.
task-specific networks dedicated to classification. For exam-
Second, by using a part/all of the features, compact/complete
ple, we use three fully-connected layers to build a classifica-
representations are obtained, respectively. Third, compared
tion network. The two classification networks are denoted by
with i-RevNet that always performs feature extraction upon
Nc and N f , and their parameters are Θc and Θ f , respectively.
entire image information (Jacobsen et al. 2018), the main
In addition, we have mentioned that we use F1s for coarse-
branches of LFRNet discard information gradually, which
grained classification and we use {F1s , F2s } for fine-grained
makes the feature extraction more interpretable and reduces
classification. Therefore, the specific optimization problem

123
International Journal of Computer Vision (2021) 129:2605–2621 2611

Fig. 4 The proposed layered compression network (LCNet). RAFC l {F1s , F2s , . . . , FLs } are the features to be compressed. { Fˆ1s , Fˆ2s , . . . , FˆLs }
stands for the resolution adaptive feature compression (RAFC) unit are the reconstructed features after compression. {B1 , B2 , . . . , B L }
for layer l. The superscripts {e} and {d} stand for encoding and form a scalable bitstream
decoding, respectively. “PUnit” is the inter-layer prediction unit.

becomes: parameters of LFRNet had been trained and keep unchanged.


So the optimization problem can be defined as
   
min λc H Nc F1s |Θc , L c
Θ,Θc ,Θ f min Distortion + λ × Rate
     Ω
       
+λ f H N f F1s , F2s |Θ f , L f (12)
= λc H Nc F̂1s |Ω , L c + λ f H N f F̂1s |Ω, F̂2s |Ω , L f
←−   L
+ λ D L F F̂1s , . . . , F̂Ls |Ω , I + λ Rl (Ω)
l=1
where H (·, ·) is the cross-entropy loss function, L c and (13)
L f are the ground-truth labels for coarse-grained and fine-
grained classification, respectively. λc = λ f = 1 in the
experiments because we assign equal importance to the two where λ is the Lagrangian multiplier for rate-distortion trade-
tasks. off, F̂ls is the lossily compressed and reconstructed version
of Fls and is dependent on the parameters Ω, L(·, ·) measures
the signal distortion and λ D is its weight, Rl is the rate of Fls
and is dependent on the parameters Ω. Note that Θ, Θc , Θ f
5 Proposed Layered Compression Network are omitted in the optimization problem because they all keep
unchanged.
In this section, we propose a layered compression network to The joint optimization of bitrate, signal distortion, and
compress the multi-layer features into a scalable bitstream. semantic analysis accuracy has twofold benefits. First, it
The compression network is optimized end-to-end to achieve effectively retains the semantic information required by
the joint optimization of compression ratio, signal distortion, machine vision tasks, thereby ensuring the accuracy of
and semantic analysis accuracy. semantics. Second, it can tradeoff between bitrate and sig-
nal distortion by more or less compressing features that are
irrelevant to machine vision tasks.
5.1 Problem Formulation
5.2 Layered Compression Network
Based on the hierarchical representations generated by LFR-
Net, we propose a Layered Compression Network (LCNet) to Under the guidance of the optimization problem defined in
compress all the features. To use a unified notation, hereafter (13), we construct the layered compression network as shown
we use Fls (l = 1, . . . , L) to replace the previous symbols in Fig. 4. The entire LCNet consists of three parts: Encoder,
{FKm , FKd , FKd −1 , . . . , F2d , F1d }, as shown in Fig. 3. Note that Decoder, and Post-Processing Unit. During the end-to-end
we have split FKm into F1s and F2s , so L = K + 2 = 7 optimization, LFRNet is involved but the parameters of LFR-
in the experiments. When optimizing LCNet, we assume the Net are fixed.

123
2612 International Journal of Computer Vision (2021) 129:2605–2621

The Encoder in LCNet has multiple resolution-adaptive


Input Features Reconstruction input
feature compression (RAFC) units as well as multiple inter-
Conv 5x5 Conv 5x5 NLN
feature prediction units (PUnits). Each RAFC (encoding) unit ResBlock
ResBlock x2 Conv5x5 s2
deals with one feature Fls , either compressing the feature Conv5x5 s2 ResBlock x2 ResBlock
ResBlock

(when l = 1) or compressing the residual of the feature (when sNLAM sNLAM


Conv1x1
ResBlock

l ≥ 2) into a part of bitstream Bl . Each PUnit predicts the Sigmoid output

feature Fls (when l ≥ 2) from the previously reconstructed


s ; the prediction is denoted by F p and
features F̂1s , . . . , F̂l−1
CM
l MaskConv
p
the corresponding residual is denoted by Flr = Fls − Fl . Conv3x3 Conv3x3 Mode
Channel

The compressed and reconstructed residual is denoted by ResBlock x2 Conv5x5 s2 Sub-


p Conv5x5
F̂lr . Then, the reconstructed feature is F̂ls = Fl + F̂lr . PUnit Conv5x5 s2 ResBlock x2
band
ResBlock
effectively reduces the inter-feature redundancy. sNLAM sNLAM ResBlock
The Decoder in LCNet also has multiple RAFC units and Conv1x1

multiple PUnits. Each RAFC (decoding) unit decodes the


partial bitstream Bl , reconstructs the feature (when l = 1)
Fig. 5 Left: the proposed resolution adaptive feature compression
or the residual (when l ≥ 2). The residual is added with the (RAFC) unit. “Conv5x5 s2” indicates a convolutional layer using a
prediction to obtain the reconstructed feature when l ≥ 2. kernel of size 5 × 5 and a stride of 2. “looplm ” and “looplh ” respec-
Note that the Decoder can decode partial bitstream, which is tively represent the number of stacking the specified module. Right: the
a nature of scalable coding. For example, if one wants to per- proposed prediction unit (PUnit). Each ResBlock contains two convo-
lutional layers and a rectified linear unit (ReLU) inside the two, as well
form coarse-grained classification, it is sufficient to decode as a residual connection, i.e. f (x) = Conv2 (ReLU(Conv1 (x))) + x
B1 into F̂1s , and then use the task-specific network. The entire
bitstream is decoded only when we want to reconstruct the
image.
Since the features are lossily compressed, the compres- {Dm , Dh , AD, C M}. In the encoder, we first use the main
sion artifacts may deteriorate the quality of the reconstructed transformation Em to obtain representations X l , then we use
image, especially at low bit rates. Thus, we use a post- the secondary transformation Eh to obtain representations
processing unit to repair the reconstructed image for human Z l . Z l passes quantization (Q) and arithmetic-encoder (AE)
vision. to become a bitstream Blh . Blh is decoded by arithmetic-
decoder (AD) to obtain Ẑ l , which is then sent to the secondary
inverse-transformation Dh . The output of Dh , together with
5.3 Basic Modules the neighboring reconstructed features that pass a masked
convolution (MaskConv), is used by the context modeling
5.3.1 Resolution-Adaptive Feature Compression Unit (CM), which provides probability models for the AE to
encode the quantized version of X l into another bitstream
Our resolution-adaptive feature compression (RAFC) unit Blm . So far, we obtain the bitstream Bl composed by Blm and
is a simplified and adapted version of the network of Blh . Note that the rate of Bl , in other words, Rl , can be esti-
Non-Local Attention optimization and Improved Context mated according to these probability models as calculated
modeling-based image compression (NLAIC) (Chen et al. in Chen et al. (2019). Finally, we can decode Blm to get X̂ l ,
2019). NLAIC is based on Ballé’s pioneering work about and use the main inverse-transformation Dm to obtain the
hyper-prior model (Ballé et al. 2018). Compared to Ballé’s reconstruction.
work, NLAIC introduces the non-local attention optimiza- Compared to Chen et al. (2019), our RAFC uses a sim-
tion and the improved context model: the attention mech- plified non-local attention module (sNLAM). Specifically,
anism together with the nonlocal operations are used to we remove one residual-connection block (ResBlock) and
process multi-layer features adaptively; both hyper-prior reduce the number of channels from 192 to 128. This is
model and neighboring reconstructed features are used to observed efficient for computation and do not incur much
improve the efficiency of context modeling. Compared to compression performance loss.
NLAIC, our RAFC has simplified the network structure and In addition, all the features use RAFC units but the fea-
adapted some hyper-parameters for different features. Fig- tures have different resolutions. Accordingly, the RAFC unit
ure 5 shows the core modules of RAFC. for each feature slightly differs from one another, notably
The compression unit for the lth feature is denoted by in the down-sampling scales of Em and Eh . The scales are
{e} {d}
RAFC l . RAFC l and RAFC l represent its encoding part controlled by looplm and looplh . Denote the resolution of Fls
{e}
and decoding part, respectively. RAFC l is a combination by (C, S, S) , where C is number of channels and S = 2 j
{d} log (S)−1
of {Em , Eh , Dh , Q, AE, C M}. RAFC l is a combination of is width/height. We set looplm to f loor ( 2 2 ), and set

123
International Journal of Computer Vision (2021) 129:2605–2621 2613

looplh to 2. In addition, the width/height and number of chan- categories is large). Note that our scheme is to provide a
nels of X l and Z l are set to scalable bitstream, which can be partially decoded to obtain
  features that then serve the classification. For the classifi-
log2 (S)−1
− f loor cation tasks, one may imagine a compression scheme that
Sm = S · 2 2
performs classification at the encoder side and transmits only
C · S2 the classification results to the decoder side. This scheme is
Cm =
4 · (Sm )2 feasible, but may have severe limitations: the transmitted data
(14)
Sm may be useless if the decoder side wants to perform another
Sh =
4 classification task with a different category set; the encoder
Cm side may not have sufficient computational resource to per-
Ch =
1.5 form the classification. In view of these limitations, we do
not experimentally compare with the imaginary scheme of
Sm and Cm are for X l , and Sh and C h are for Z l , respectively. “transmitting classification results.”
Therefore, the size of X l is 1/4 of the size of the input feature, There are several datasets that support our designed study.
and the size of Z l is 1/24 of the size of X l . For example, CUB200-2011 (Wah et al. 2011) is a bird image
dataset consisting of 11,788 images, where 5994 images are
5.3.2 Inter-Feature Prediction Unit for training and 5794 are for test. All the images are divided
into 200 categories. According to ornithology systematics,
The inter-feature prediction has two modes depending on the 200 categories can be merged into 122 coarse categories
how the features are decomposed. F1s and F2s are different or 37 coarser categories. For CUB200-2011, we use coarse-
channels of the same subband. The following features belong grained to refer to 37-category classification, fine-grained to
to different subbands. As shown in Fig. 5, PUnit first distin- refer to 200-category classification, and intermediate-grained
guishes the two cases. If it is channel decomposition, the to refer to 122-category classification. For another exam-
feature to be predicted and the feature used to predict have ple, FGVC-Aircraft (Maji et al. 2013) is an aircraft image
the same spatial resolution, so a fully convolutional network dataset consisting of 10,000 images, where 6667 images
is directly used. If it is subband decomposition, the feature are for training and 3333 are for test. The images can be
used to predict shall be processed by an RFRU so that the divided by manufacturer, family, variant, into 30, 70, 100
spatial resolution is aligned, and then passes a fully convo- categories, respectively. For FGVC-Aircraft, we use coarse-
lutional network. grained, intermediate-grained, and fine-grained to refer to
Each PUnit has a lightweight six-layer CNN. Its first con- the 30-category, 70-category, and 100-category classifica-
volutional layer uses kernel size 5 × 5 and 128 channels tion, respectively. All these category labels are available in
for redundant representation. The following four layers are the dataset, so the classification accuracy can be directly
organized into two ResBlocks. Each ResBlock has two con- calculated. Note that our proposed LFRNet and LCNet are
volutional layers with kernel size equal to 3 × 3 and a ReLU optimized only for coarse-grained and fine-grained clas-
between the two, and a residual connection. The last convo- sification, but we will test their generalization ability for
lutional layer uses kernel size 1 × 1 and 1 channel to output. intermediate-grained classification.
Since the content of the above two datasets is relatively
5.3.3 Post-Processing Unit homogeneous, we use the ILSVRC dataset (Russakovsky
et al. 2015) (often known as ImageNet) for pre-training of
A post-processing unit is added to filter the entire recon- LFRNet. ILSVRC contains 1000 categories, and each cate-
structed image, so as to reduce compression artifacts and gory has about 1000 images.
enhance signal reconstruction quality. In particular, the post- All the images in the used datasets have the resolution of
processing unit uses the same network structure as the PUnit. 256 × 256, or have been resized to this resolution, to ensure
a fair comparison with other methods. Fig. 6 presents some
images of the used datasets.
6 Experiments We evaluate the proposed method and compare with the
others in both semantic analysis accuracy and compression
6.1 Tasks and Datasets efficiency. Top-1 accuracy and top-5 accuracy are used to
evaluate the classification results. Compression efficiency is
In this paper, we consider coarse-to-fine image classification evaluated from three indicators: compression ratio (or bitrate
as the targeted machine vision tasks, where one image can be in bit-per-pixel, bpp), peak signal-to-noise ratio (PSNR), and
classified into a coarse category (the number of optional cat- multi-scale structural similarity (MS-SSIM).
egories is small), or a fine category (the number of optional

123
2614 International Journal of Computer Vision (2021) 129:2605–2621

(a) ILSVRC (b) CUB200-2011 (c) FGVC-Aircraft

Fig. 6 Exemplar images of the used datasets. ILSVRC (often known as ImageNet) is used for network pre-training. Both CUB200-2011 and
FGVC-Aircraft are used for coarse-to-fine image classification

Table 1 Training hyper-parameters 6.3 Performance of Semantics-to-Signal Scalable


Network LFRNet LCNet Compression

Epoch number 128 128 Four compression models are trained by setting different val-
Batch size 256 24 ues for λ. These models are referring to the same LFRNet
Initial lr 1e−2 1e−5 model. Table 2 shows the average bit rate of compressed
Step lr 0.1×/32 epochs 0.75×/32 epochs images with different models. On CUB200-2011, the max-
Weight decay 0.0005 0.0005 imal average compression ratio is 471 (0.051 bpp) and the
Momentum 0.9 0.9 minimal one is 34 (0.711 bpp). On FGVC-Aircraft, the max-
“lr” is short for learning rate imal one is 889 (0.027 bpp) and the minimal one is 51 (0.474
bpp).

6.3.1 Semantics-to-Signal Scalability

First, we examine the scalable coding functionality of the


proposed method. Figure 7 shows the partial decoding results
6.2 Experimental Settings on CUB200-2011 using the model with λ = 0.0067. It also
displays some reconstructed images of partial decoding, tak-
Our implementation is based on PyTorch (Paszke et al. ing one picture in the CUB200-2011 test set as an example.
2017). All the training and test are conducted on a cluster of Clearly, with more bits decoded, the classification accuracy
GTX1080Ti graphics processing units (GPUs). Four GPUs and PSNR both increase, and the visual quality of the recon-
are used for fast training. structed images becomes better. Also, after a certain rate
In the training stage, we use the stochastic gradient descent (e.g. 0.011 bpp for coarse-grained), the classification accu-
algorithm for gradient back-propagation and parameter opti- racy becomes stable. This rate is called “critical rate” for
mization. The weights in (13) are set to λc = λ f = 1, the machine vision task. Obviously, the critical rate for fine-
λ D = 0.5, and λ ∈ {0.0067, 0.04, 0.2, 2}, respectively. Four grained classification (∼ 0.02 bpp) is higher than that for
values are used for λ to achieve different bit rates. Com- coarse-grained, but it is still far less than the rate needed for
pared with several neural image compression networks such image reconstruction. As shown in Fig. 7, to reconstruct visu-
as Ballé et al. (2018), Chen et al. (2019) that use patches ally pleasing image, the bitrate shall be higher than 0.2 bpp.
for training, our networks are trained with complete images In this example, the bits required for the classification tasks
due to the considered machine vision tasks–image classifica- occupy less than 10% of the entire bitstream that provides
tion. Note that our LFRNet and LCNet are separately trained: visually acceptable image reconstruction.
we train LFRNet first, then fix the parameters of LFRNet In Fig. 8, we randomly select two images from the
and train LCNet. Our LFRNet is pre-trained on ILSVRC CUB200-2011 test set and the FGVC-Aircraft test set,
and then further trained on either CUB200-2011 or FGVC- respectively, and display some reconstructed images of par-
Aircraft. LCNet is not pre-trained. Table 1 summarizes the tial decoding. When the decoded bit rate is very low, the
hyper-parameters for network training. reconstructed images are hard to recognize, but human may

123
Table 2 Classification accuracy results
Dataset Task Bitrate Top-1 Accuracy Top-5 Accuracy
JPEG 2000 Ours (Image) BPG NLAIC Ours (Feature) JPEG 2000 Ours (Image) BPG NLAIC Ours (Feature)

CUB200-2011 Coarse-grained Orig. 88.9 88.9 88.9 88.9 89.3 98.7 98.7 98.7 98.7 98.4
0.711 83.3 84.7 85.0 87.0 87.9 97.1 97.4 97.8 98.2 98.1
0.343 74.2 77.0 76.0 81.1 87.9 93.8 94.7 95.1 94.5 98.2
0.153 57.8 58.5 62.5 63.1 87.4 85.8 85.3 88.9 88.2 98.1
0.051 26.6 30.7 40.6 – 87.6 62.5 59.6 73.0 – 98.0
Intermediate-grained Orig. 81.9 81.9 81.9 81.9 80.7 96.4 96.4 96.4 96.4 96.0
0.711 73.0 76.2 76.5 79.2 77.2 92.3 93.9 94.1 95.1 94.9
0.343 61.3 67.2 65.4 71.0 76.8 85.6 89.2 88.9 92.1 94.5
0.153 40.9 46.7 48.7 50.2 76.4 69.2 74.4 78.2 77.5 94.7
0.051 12.2 20.8 23.9 – 76.8 33.5 43.4 50.9 – 94.5
Fine-grained Orig. 75.8 75.8 75.8 75.8 75.5 94.0 94.0 94.0 94.0 93.3
International Journal of Computer Vision (2021) 129:2605–2621

0.711 68.6 69.9 70.2 72.4 72.6 90.4 91.7 91.4 92.9 92.3
0.343 57.0 61.5 60.2 65.5 72.4 83.4 87.1 86.4 89.3 92.3
0.153 36.1 41.8 46.3 44.6 72.2 66.5 70.4 75.7 73.5 92.3
0.051 9.9 16.0 22.5 – 72.1 29.6 36.4 48.3 – 92.4
FGVC-Aircraft Coarse-grained Orig. 93.3 93.3 93.3 93.3 92.6 98.5 98.5 98.5 98.5 98.6
0.474 82.3 83.4 85.5 86.4 90.9 95.7 96.1 96.5 96.7 98.1
0.206 58.9 70.2 75.0 79.6 90.4 85.2 88.8 92.4 94.0 98.1
0.093 22.4 41.0 49.4 57.2 90.9 49.4 68.0 77.4 79.9 98.0
0.027 6.7 10.2 16.1 – 89.6 21.2 28.3 38.7 – 97.5
Intermediate-grained Orig. 89.7 89.7 89.7 89.7 88.4 96.8 96.8 96.8 96.8 96.3
0.474 75.6 79.6 80.1 81.3 85.2 92.6 94.6 94.6 95.0 95.1
0.206 50.2 69.7 70.8 75.9 85.1 78.3 89.1 91.4 93.2 95.0
0.093 18.5 46.2 49.7 57.4 84.3 43.4 73.8 77.9 81.3 94.6
0.027 5.3 12.0 14.7 – 81.3 18.0 32.9 36.9 – 94.2
Fine-grained Orig. 89.6 89.6 89.6 89.6 89.0 97.0 97.0 97.0 97.0 97.0
0.474 76.7 80.5 80.7 82.0 87.1 92.7 94.1 94.5 94.9 96.3
0.206 55.0 71.2 74.1 78.7 86.6 80.5 90.3 92.4 93.7 95.9
0.093 20.4 47.7 56.0 61.5 86.4 43.8 73.0 80.8 83.5 96.0
0.027 4.3 11.7 18.3 – 85.1 14.8 34.4 41.0 – 95.5
“Ours (Feature)” and “Ours (Image)” respectively indicate decoded feature-based and reconstructed image-based classification results. “Orig.” indicates using original uncompressed images, whose
results are underlined to distinguish
Bold indicates the best accuracy for each bitrate

123
2615
2616 International Journal of Computer Vision (2021) 129:2605–2621

Coarse- to-fine classification Progressive reconstruction of signals

90% 35

82% 31
Top-1 Accuracy

PSNR (dB)
74% 26

66% 22

58% 17
Coarse-grained classification

50% Intermediate-grained classification 13


Fine-grained classification
42% … 8
0.000 0.005 0.010 0.015 0.020 0.050 0.150 0.250 0.350 0.450 0.550 0.650
Bits / Per Pixel

Fig. 7 Quantitative evaluation of the semantics-to-signal scalability accuracy is not displayed in the high bitrate range because it becomes
(λ = 0.0067). The horizontal axis represents bitrate (bit-per-pixel, bpp), stable after the indicated (by the green arrow) point. The reconstruction
and the left and right vertical axes represent top-1 classification accu- quality is not displayed in the (extremely) low bitrate range because it
racy and reconstruction quality in PSNR, respectively. The classification is too low to be useful (Color figure online)

still be able to identify the shape of the object. With more bits usually drops significantly. However, our results using fea-
decoded, colors, edges, and textures become more and more tures are quite stable across different bit rates. For example
clear. When all the bits are decoded, reconstructed images on CUB200-2011, for coarse-grained classification, when
appear similar to the original images. bitrate is around 0.051 bpp, the top-1 accuracy of JPEG2000
reconstructed images is 26.6%, but our result using features
is 87.6%. Note that the classifier for features is a three-layer
fully-connected network and it is not stronger than VGG16.
6.3.2 Compression Performance Therefore, the results show that compressing and transmit-
ting features for machine vision tasks is a good choice at low
We choose three image compression methods for compari- bit rates.
son. The first is JPEG2000, which is the widely used standard Second, we compare the results of different methods on
for scalable image compression. The second is BPG, which PSNR, MS-SSIM, and bitrate. Figure 9 shows the rate-
represents state-of-the-art of non-learned compression. We PSNR/MS-SSIM curves of our method and JPEG2000,
use the default configuration of BPG, that is to compress with BPG, and NLAIC. Our proposed method significantly sur-
YUV420 format. The third is NLAIC, a CNN-based end-to- passes JPEG2000 in both PSNR and MS-SSIM. In addition,
end learned image compression method proposed in Chen our method achieves comparable MS-SSIM than BPG and
et al. (2019). We choose NLAIC because our compression NLAIC at high bit rates, but does not catch up with BPG and
network also borrows some ideas from it. For JPEG2000 and NLAIC in PSNR. We believe the result is mainly due to the
BPG, we have adjusted the quantization parameter to achieve less compression efficiency of our entropy coding method.
similar compression ratios as our method. For NLAIC, we JPEG2000 compresses all the subbands simultaneously, and
use pre-trained models, so the bit rates are not well aligned it has dedicated, highly efficient coding tools like zero-tree
to those of the other methods. (Taubman 2000). BPG has an advanced context-adaptive
First, we show the classification results. Here for CUB200- binary arithmetic coding (CABAC) engine for entropy cod-
2011 and FGVC-Aircraft, we respectively train a VGG16 ing (Marpe et al. 2003). NLAIC compresses all the features
model with uncompressed images, and use the same model and optimizes the uniform entropy coder in an end-to-end
on reconstructed images with different methods (JPEG2000, fashion. In our scheme, the features are compressed one
ours, BPG, NLAIC) and different bit rates. These results by one using prediction from each to the next. The corre-
are summarized in Table 2. In addition, we also report lation among non-adjacent features is not fully exploited. As
the classification results of our method not using recon- a result, our method performs less well especially in PSNR at
structed images but using features that are decoded from high bit rates where entropy coding has big impact. Nonethe-
partial bitstream. It can be observed that with the decrease less, our method performs well in MS-SSIM at high bit rates;
of bit rate, the classification accuracy of reconstructed images

123
International Journal of Computer Vision (2021) 129:2605–2621 2617

0.0392/16.07 0.0785/17.54 0.1569/21.81 0.2071/24.04 0.2354/25.08 0.3138/28.91 Original

0.0398/13.72 0.0795/13.97 0.1591/19.46 0.2100/21.72 0.2386/23.54 0.3182/28.61 Original

0.0024/12.49 0.0049/13.99 0.0098/14.10 0.0129/18.90 0.0146/20.00 0.0195/23.41 Original

0.0025/12.33 0.0050/13.15 0.0101/13.41 0.0133/19.24 0.0151/19.60 0.0201/22.72 Original

Fig. 8 Reconstructed images of our method. Numbers shown below each image indicate bitrate and PSNR. The top two rows correspond to
λ = 0.04 and the bottom two rows correspond to λ = 0.2, respectively

36.0 1.00 39.0 1.00

32.8 0.95 35.0 0.93


1.00 1.00
MS-SSIM
MS-SSIM

29.6 0.90 31.0 0.86 0.99


0.99
PSNR
PSNR

0.98
JPEG2000 0.98
26.4 0.85 27.0 0.79 0.97
Proposed 0.97 0.96
23.2 BPG 0.96 23.0 0.72 0.95
0.80
0.95 0.94
NLAIC 0.15 0.27 0.39 0.51
20.0 0.75 0.3 0.46 0.62 0.78 19.0 0.65
0 0.17 0.34 0.51 0.68 0.85 0 0.17 0.34 0.51 0.68 0.85 0 0.12 0.24 0.36 0.48 0.6 0 0.12 0.24 0.36 0.48 0.6
Bpp Bpp Bpp Bpp

(a) (b) (c) (d)


Fig. 9 Rate-distortion curves of the proposed method compared to JPEG2000, BPG, and NLAIC (Chen et al. 2019). a, b Correspond to the
CUB200-2011 dataset. c, d Correspond to the FGVC-Aircraft dataset

this is probably due to the effectiveness of multi-scale decom- model (Chen et al. 2019). In the future, we may simplify the
position of our method. context model to accelerate the decoding process. It is also
noticeable that, excluding the decoding, the analysis modules
at the decoder side are computationally efficient, reducing
6.3.3 Complexity Analysis
more than 90% computational time than the image-based
classification.
We report the average running time of each module in Table
3. The time was measured on a GTX1080Ti GPU. Note that
our scheme enables partial decoding, so we report the time 6.4 Performance of LFRNet
needed for coarse-grained classification, fine-grained clas-
sification, and image reconstruction, respectively. It can be LFRNet converts an image into hierarchical feature repre-
observed that the slowest module in our scheme is the decod- sentations in a revertible manner. Partial features may serve
ing module, which is attributed to the autoregressive context as a compact representation of the information needed for

123
2618 International Journal of Computer Vision (2021) 129:2605–2621

Table 3 Average running time


Module Running time (s)
of each module for one image
Encoder
LFRNet 0.0099
Encoding of F1s 0.0264
Encoding of F2s 0.0241
Encoding of the other features 0.1931
Total time 0.2535
Decoder
Decoding of F̂1s t1 = 0.8244
Decoding of F̂2s t2 = 2.3098
Decoding of the other features t3 = 47.3683
Analysis of F̂1s t4 = 0.0002
Analysis of { F̂1s , F̂2s } t5 = 0.0003
Inverse LFRNet t6 = 0.0099
Total time for coarse-grained t1 + t4 = 0.8246
Total time for fine-grained t1 + t2 + t5 = 3.1345
Total time for image reconstruction t1 + t2 + t3 + t6 = 50.5124
Image-based classification
Coarse-grained 0.0036
Fine-grained 0.0043

Table 4 Classification accuracy on ILSVRC dataset 6.4.2 Performance in Coarse-to-Fine Classification


Network Top-1 Acc. Top-5 Acc. # Parameters
Based on the pre-trained LFRNet model and a given
VGG16 73.36 91.51 1.38e8 dataset/task (e.g. CUB200-2011, coarse-grained classifica-
i-RevNet 74.02 (↑0.66) 91.59 (↑0.08) 1.25e8 (↓ 9%) tion), we want to identify how many features are sufficient
Ours 72.84 (↓0.52) 90.32 (↓1.19) 0.54e8 (↓61%) for the task. Here we perform a grid search using k channels
of FKm , where k ∈ {8, 16, 24, . . . , 96}. For each k we fine-
tune LFRNet with the given dataset/task, and draw the curves
of top-1 and top-5 accuracy with respect to k in Fig. 10. It
a machine vision task. The image can be perfectly recon- can be observed that the accuracy becomes stable when k is
structed based on all features. In this section, we ask: how larger than a “critical number.” According to these results,
many features are sufficient for a given task? We provide we use 24 channels for coarse-grained classification and 96
empirical results to address this question. In addition, we channels for fine-grained classification, respectively. In other
compare with other networks, notably VGG16 (Simonyan words, F1s has 24 channels, and F2s has 72 channels.
and Zisserman 2014) and i-RevNet (Jacobsen et al. 2018). Based on the above settings, we now fine-tune LFRNet for
two classification tasks simultaneously to optimize (12). For
6.4.1 Performance of Pre-trained Models comparison, we also fine-tune VGG16 and i-RevNet, but for
coarse-grained and fine-grained classification individually.
We have pre-trained LFRNet on the ILSVRC training set. For Table 5 presents the classification results of different fine-
fair comparison, we similarly pre-train VGG16 and i-RevNet tuned models on the corresponding test set. It is observed that
on the same training set. Then we test the three models on the our method achieves comparable classification accuracy, but
ILSVRC validation set. The tested classification accuracy, as our method uses much less features, again demonstrating the
well as number of parameters, is shown in Table 4. The three advantage of compact representation.
models achieve comparable top-1 and top-5 accuracy, but our
LFRNet has greatly reduced the number of parameters. Note
that our LFRNet gradually discards information in the net- 6.4.3 Performance in Intermediate-Grained Classification
work, which is quite different from VGG16 and i-RevNet. It
also confirms that the information required for classification We go one step further to ask: may LFRNet perform
may be compactly represented by a small set of features. well on non-trained machine vision tasks? To answer this

123
International Journal of Computer Vision (2021) 129:2605–2621 2619

Table 5 Classification accuracy results of uncompressed features


Dataset Method Coarse-grained classification Fine-grained classification

Feature Top-1 Acc. Top-5 Acc. Feature Top-1 Acc. Top-5 Acc.

CUB200-2011 VGG16 (512, 8, 8) 88.9 98.7 (512, 8, 8) 75.8 94.0


i-RevNet (1024, 8, 8) 89.7(↑0.8) 98.5(↓0.2) (1024, 8, 8) 76.7(↑0.9) 93.3(↓0.7)
Ours (24, 8, 8) 89.3(↑0.4) 98.4(↓0.3) (96, 8, 8) 75.5(↓0.3) 93.3(↓0.7)
FGVC-Aircraft VGG16 (512, 8, 8) 93.3 98.5 (512, 8, 8) 89.6 97.0
i-RevNet (1024, 8, 8) 92.1(↓1.2) 98.5(↑0.0) (1024, 8, 8) 90.0(↑0.4) 97.2(↑0.2)
Ours (24, 8, 8) 92.6(↓0.7) 98.6(↑0.1) (96, 8, 8) 89.0(↓0.6) 97.0(↑0.0)
The dimensions of used features are shown in numbers like (512,8,8), which is short for 512 × 8 × 8, i.e. 512 feature maps with spatial resolution
8×8

90%
Table 6 Classification accuracy results of uncompressed features for
the intermediate-grained classification task
85%
Dataset Method Top-1 Acc. Top-5 Acc.
Top-1 Accuracy

80%
CUB200-2011 VGG16 81.9 96.4
75%
i-RevNet 84.2(↑2.3) 96.4(↑0.0)
70% Ours 80.7(↓1.2) 96.0(↓0.4)
FGVC-Aircraft VGG16 89.7 96.8
65%
i-RevNet 89.2(↓0.5) 97.1(↑0.3)
60% Ours 88.4(↓1.3) 96.3(↓0.5)
0 16 32 48 64 80 96

Number of channels of Note that in our method, the features (72 × 8 × 8) are a subset of
the features (96 × 8 × 8) prepared for the fine-grained classification.
100% Nonetheless, the intermediate-grained classification was not considered
during the training
98%
Top-5 Accuracy

96%
classification task. By grid search, we found k = 72 is appro-
94% priate, as shown in Fig. 10.
92%
For comparison, we also fine-tune VGG16 and i-RevNet
Coarse-grained classification
for the intermediate-grained classification. The results are
90%
Intermediate-grained classification given in Table 6. It is observed that LFRNet still achieves
Fine-grained classification very competitive results, especially taken into account that
88%
0 16 32 48 64 80 96 VGG16 and i-RevNet are specifically trained for the task.
Number of channels of These results demonstrate that LFRNet has extracted com-
pact features that work well for trained tasks and meanwhile
Fig. 10 Relation between classification accuracy and number of chan- are generalizable to similar tasks.
nels of FKm . In each curve, there is a solid marker showing that the top-1
accuracy becomes stable after that point (also indicated by a green
arrow) (Color figure online)

7 Conclusion

In this paper, we have presented a semantics-to-signal scal-


question, we introduce a third classification task, namely able image compression framework with learned revertible
the intermediate-grained as mentioned before. Note that representations. We have proposed LFRNet to learn effec-
intermediate-grained classification is not considered during tive and efficient features oriented to machine vision tasks.
the LFRNet training. We have also proposed LCNet to compress the features
Again we need to identify how many features are suffi- into a scalable bitstream, so as to achieve a joint opti-
cient for this task. We now fix all the parameters of LFRNet, mization of compression ratio, signal reconstruction quality,
set different values of k, and train the classifier (also a three- and semantic analysis accuracy. As a concrete example of
layer fully-connected network) for the intermediate-grained human–machine collaborative judgment, we study coarse-

123
2620 International Journal of Computer Vision (2021) 129:2605–2621

to-fine image classification and image reconstruction as the He, C., Shi, Z., Qu, T., Wang, D., & Liao, M. (2019). Lifting scheme-
targets for image compression. Our experimental results have based deep neural network for remote sensing scene classification.
Remote Sensing, 11(22), 2648.
verified the effectiveness of the proposed method, which out- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for
performs JPEG2000 significantly. image recognition. In CVPR (pp. 770–778).
In the future, our work can be extended in several direc- Heijmans, H. J., & Goutsias, J. (2000). Nonlinear multiresolution signal
tions. First, we may further investigate revertible networks decomposition schemes (II) Morphological wavelets. IEEE Trans-
actions on Image Processing, 9(11), 1897–1913.
to enhance the feature learning capabilities. Second, we may Hu, Y., Yang, S., Yang, W., Duan, L. Y., & Liu, J. (2020). Towards
design advanced methods for more efficient compression of coding for human and machine vision: A scalable image coding
the features. Third, we may consider video coding in a similar approach. In ICME (pp. 1–6). IEEE.
way but need to address motion carefully. Huang, G., Liu, Z., Van Der Maaten L., & Weinberger, K. Q. (2017).
Densely connected convolutional networks. In CVPR (pp. 4700–
4708).
Acknowledgements This work was supported by the National Key
Jacobsen, J. H., Smeulders, A., & Oyallon, E. (2018). i-Revnet:
Research and Development Program of China under Grant 2018YFA07
Deep invertible networks. Technical Report. arXiv preprint
01603, by the Natural Science Foundation of China under Grant
arXiv:1802.07088
61772483, and by the Fundamental Research Funds for the Central Uni-
Johnston, P., Elyan, E., & Jayne, C. (2018). Spatial effects of video
versities under Contract WK3490000005. We acknowledge the support
compression on classification in convolutional neural networks. In
of the GPU cluster built by MCC Lab of the School of Information
IJCNN (pp. 1–8).
Science and Technology of USTC.
Kwaśnicka, H., & Jain, L. C. (2018). Bridging the semantic gap in
image and video analysis. Berlin: Springer.
Latif, A., Rasheed, A., Sajid, U., Ahmed, J., Ali, N., Ratyal, N. I., et al.
(2019). Content-based image retrieval and feature extraction: A
comprehensive review. Mathematical Problems in Engineering,
References 2019, 1–21.
Lee, J., Cho, S., & Beack, S. K. (2018). Context-adaptive entropy model
Akansu, A. N., Haddad, P. A., Haddad, R. A., & Haddad, P. R. (2001). for end-to-end optimized image compression. Technical Report.
Multiresolution signal decomposition: Transforms, subbands, and arXiv preprint arXiv:1809.10452
wavelets. New York: Academic Press. Li, M., Zhang, K., Zuo, W., Timofte, R., & Zhang, D. (2020). Learning
Akansu, A. N., & Liu, Y. (1991). On-signal decomposition techniques. context-based non-local entropy modeling for image compression.
Optical Engineering, 30(7), 912–921. Technical Report. arXiv preprint arXiv:2005.04661
Ballé, J., Laparra, V., & Simoncelli, E. P. (2016). End-to-end opti- Lo, S. C., Li, H., & Freedman, M. T. (2003). Optimization of wavelet
mized image compression. Technical Report. arXiv preprint decomposition for image compression and feature preservation.
arXiv:1611.01704 IEEE Transactions on Medical Imaging, 22(9), 1141–1151.
Ballé, J., Minnen, D., Singh, S., Hwang, S. J., & Johnston, N. (2018). Ma, H., Liu, D., Xiong, R., & Wu, F. (2019). iWave: CNN-based
Variational image compression with a scale hyperprior. Technical wavelet-like transform for image compression. IEEE Transactions
Report. arXiv preprint arXiv:1802.01436 on Multimedia, 22, 1667–1679.
Baxter, J. (1997). A bayesian/information theoretic model of learning to Ma, H., Liu, D., Yan, N., Li, H., & Wu, F. (2020). End-to-end opti-
learn via multiple task sampling. Machine Learning, 28(1), 7–39. mized versatile image compression with wavelet-like transform.
Chen, T., Liu, H., Ma, Z., Shen, Q., Cao, X., & Wang, Y. (2019). IEEE Transactions on Pattern Analysis and Machine Intelligence,.
Neural image compression via non-local attention optimization https://doi.org/10.1109/TPAMI.2020.3026003.
and improved context modeling. Technical Report. arXiv preprint Ma, S., Zhang, X., Wang, S., Zhang, X., Jia, C., & Wang, S. (2018). Joint
arXiv:1910.06244 feature and texture coding: Toward smart video representation via
Christopoulos, C., Skodras, A., & Ebrahimi, T. (2000). The JPEG2000 front-end intelligence. IEEE Transactions on Circuits and Systems
still image coding system: An overview. IEEE Transactions on for Video Technology, 29(10), 3095–3105.
Consumer Electronics, 46(4), 1103–1127. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013).
Dejean-Servières, M., Desnos, K., Abdelouahab, K., Hamidouche, W., Fine-grained visual classification of aircraft. Technical Report.
Morin, L., & Pelcat, M. (2017). Study of the impact of standard arXiv preprint arXiv:1306.5151.
image compression techniques on performance of image classi- Mallat, S. G. (1989). A theory for multiresolution signal decomposi-
fication with a convolutional neural network. Technical Report. tion: The wavelet representation. IEEE Transactions on Pattern
hal-01725126. https://hal.archives-ouvertes.fr/hal-01725126 Analysis and Machine Intelligence, 11(7), 674–693.
Dodge, S., & Karam, L. (2016). Understanding how image quality Marpe, D., Schwarz, H., & Wiegand, T. (2003). Context-based adap-
affects deep neural networks. In International conference on qual- tive binary arithmetic coding in the h.264/AVC video compression
ity of multimedia experience (pp. 1–6). IEEE. standard. IEEE Transactions on Circuits and Systems for Video
Duan, L., Liu, J., Yang, W., Huang, T., & Gao, W. (2020). Video cod- Technology, 13(7), 620–636.
ing for machines: A paradigm of collaborative compression and Minnen, D., Ballé, J., & Toderici, G. D. (2018). Joint autoregressive and
intelligent analytics. IEEE Transactions on Image Processing, 29, hierarchical priors for learned image compression. In Advances in
8680–8695. neural information processing systems (pp. 10771–10780).
Gomez, A. N., Ren, M., Urtasun, R., & Grosse, R. B. (2017). The Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin,
reversible residual network: Backpropagation without storing acti- Z., Desmaison, A., Antiga, L., & Lerer A. (2017). Automatic dif-
vations. In: Advances in neural information processing systems ferentiation in PyTorch. Technical report. OpenReview.net, https://
(pp. 2214–2224). openreview.net/forum?id=BJJsrmfCZ.
Goutsias, J., & Heijmans, H. J. (2000). Nonlinear multiresolution signal Poyser, M., Atapour-Abarghouei, A., & Breckon, T. P. (2020). On the
decomposition schemes (I) Morphological pyramids. IEEE Trans- impact of lossy image and video compression on the performance
actions on Image Processing, 9(11), 1862–1876.

123
International Journal of Computer Vision (2021) 129:2605–2621 2621

of deep convolutional neural network architectures. Technical Wang, S., Wang, S., Zhang, X., Wang, S., Ma, S., & Gao, W. (2019).
Report. arXiv preprint arXiv:2007.14314. Scalable facial image compression with deep feature reconstruc-
Rodriguez, M. X. B., Gruson, A., Polania, L., Fujieda, S., Prieto, F., tion. In ICIP (pp. 2691–2695). IEEE.
Takayama, K., & Hachisuka, T. (2020). Deep adaptive wavelet Xia, S., Liang, K., Yang, W., Duan, L. Y., & Liu, J. (2020). An emerg-
network. In IEEE Winter conference on applications of computer ing coding paradigm VCM: A scalable coding approach beyond
vision (pp. 3111–3119). feature and signal. In ICME (pp. 1–6). IEEE
Ruder, S. (2017). An overview of multi-task learning in deep neural Yan, N., Liu, D., Li, H., & Wu, F. (2020). Semantically scalable image
networks. Technical Report. arXiv preprint arXiv:1706.05098 coding with compression of feature maps. In ICIP, IEEE (pp.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., 3114–3118).
et al. (2015). Imagenet large scale visual recognition challenge. Zhang, X., Ma, S., Wang, S., Zhang, X., Sun, H., & Gao, W. (2016). A
International Journal of Computer Vision, 115(3), 211–252. joint compression scheme of video feature descriptors and visual
Shwartz-Ziv, R., & Tishby, N. (2017). Opening the black box of deep content. IEEE Transactions on Image Processing, 26(2), 633–647.
neural networks via information. Technical Report. arXiv preprint Zhao, J., Peng, Y., & He, X. (2020). Attribute hierarchy based multi-task
arXiv:1703.00810 learning for fine-grained image classification. Neurocomputing,
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional net- 395, 150–159.
works for large-scale image recognition. Technical Report. arXiv Zhao, Z. Q., Zheng, P., & Xu St, Wu X. (2019). Object detection with
preprint arXiv:1409.1556. deep learning: A review. IEEE Transactions on Neural Networks
Sweldens, W. (1998). The lifting scheme: A construction of second gen- and Learning Systems, 30(11), 3212–3232.
eration wavelets. SIAM Journal on Mathematical Analysis, 29(2), Zhou, L., Sun, Z., Wu, X., & Wu, J. (2019). End-to-end optimized image
511–546. compression with attention mechanism. In CVPR workshops (pp.
Taubman, D. (2000). High performance scalable image compression 1–4).
with EBCOT. IEEE Transactions on Image Processing, 9(7),
1158–1170.
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The infor- Publisher’s Note Springer Nature remains neutral with regard to juris-
mation bottleneck method. Technical Report. arXiv preprint dictional claims in published maps and institutional affiliations.
arXiv:physics/0004057.
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information
bottleneck principle. In IEEE information theory workshop (pp.
1–5).
Toderici, G., O’Malley, S. M., Hwang, S. J., Vincent, D., Minnen,
D., Baluja, S., Covell, M., & Sukthankar, R. (2015). Variable
rate image compression with recurrent neural networks. Technical
Report. arXiv preprint arXiv:1511.06085
Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R.,
& Van Gool, L. (2018). Towards image understanding from deep
compression without decoding. Technical Report. arXiv preprint
arXiv:1803.06131
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011).
The Caltech-UCSD Birds-200-2011 Dataset. Technical Report.
CNS-TR-2011-001, California Institute of Technology.

123

You might also like