0% found this document useful (0 votes)
20 views12 pages

Variable-Rate Deep Image Compression With Vision Transformers

This document presents a novel variable-rate deep image compression framework that utilizes vision transformers to improve compression performance. The proposed method divides images into overlapping patches, applies transformer blocks for encoding and decoding, and incorporates a context model to enhance coding efficiency. Experimental results demonstrate significant improvements in PSNR and MS-SSIM compared to existing models, showcasing the effectiveness of the approach in achieving high-quality image compression at various bit rates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views12 pages

Variable-Rate Deep Image Compression With Vision Transformers

This document presents a novel variable-rate deep image compression framework that utilizes vision transformers to improve compression performance. The proposed method divides images into overlapping patches, applies transformer blocks for encoding and decoding, and incorporates a context model to enhance coding efficiency. Experimental results demonstrate significant improvements in PSNR and MS-SSIM compared to existing models, showcasing the effectiveness of the approach in achieving high-quality image compression at various bit rates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received April 21, 2022, accepted May 2, 2022, date of publication May 9, 2022, date of current version May

13, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3173256

Variable-Rate Deep Image Compression


With Vision Transformers
BINGLIN LI 1 , JIE LIANG 1 , (Senior Member, IEEE),
AND JINGNING HAN 2 , (Senior Member, IEEE)
1 School of Engineering Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
2 WebM Codec Team, Google LLC, Mountain View, CA 94043, USA
Corresponding author: Jie Liang ([email protected])
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN-2020-04525,
and in part by the Google Chrome University Research Program.

ABSTRACT Recently, vision transformers have been applied in many computer vision problems due to its
long-range learning ability. However, it has not been throughly explored in image compression. We propose
a patch-based learned image compression network by incorporating vision transformers. The input image
is divided into patches before feeding to the encoder and the patches are reconstructed from the decoder to
form a complete image. Different kinds of transformer blocks (TransBlocks) are applied to meet the various
requirements in the subnetworks. We also propose a transformer-based context model (TransContext) to
facilitate the coding based on previously decoded symbols. Since the computational complexity of the atten-
tion mechanism in transformers is a quadratic function of the sequence length, we partition the feature tensor
into different segments and conduct the transformer in each segment to save computational cost. To alleviate
the compression artifacts, we use overlapping patches and apply an existing deblocking network to further
remove the artifacts. At last, the residual coding scheme is adopted to get the compression performance for
variable bit rates. We show that our patch-based learned image compression with transformers obtain 0.75dB
improvement in PSNR at 0.15bpp than the prior variable-rate compression work on the Kodak dataset. When
using the residual coding strategy, our framework keeps good performance in PSNR and is comparable to
BPG420. For MS-SSIM, we get higher results than BPG444 across a range of bit rates (0.021 at 0.21bpp)
and other variable-rate learned image compression models at low bit rates.

INDEX TERMS Learned image compression, transformer, variable-rate.

I. INTRODUCTION network, and the residual between the input and the base
Recently, there has been a line of researches [1]–[8] layer reconstruction is coded by a traditional method to cover
on deep image compression. The autoencoder approaches more bit rates. Motivated by [11], in this paper, we propose
[6]–[8] with the joint autoregressive and hierarchical hyper- a more effective learned image framework by incorporating
prior models have been the mainstream practice for transformers in the base layer and apply the residual coding
learning-based image compression. Although the above to achieve compression across a range of bit rates with one
methods show promising compression performance com- single model. In [11], only eight feature maps are used for the
pared with conventional image codecs, there are two main
compact representation which limits the learning capability
drawbacks in real applications.
of the base layer. In our framework, a hyperprior network [12]
Firstly, a separate model needs to be trained for each bit rate
is adopted to estimate the distribution parameters of the quan-
which increases the coding complexity. To this end, variable
tized feature representation so that the channel dimension
bitrate image compression models [9]–[11] are developed to
of the representation can be set larger in the base layer.
cover various bit rates with one training model. In particular,
Experimental results show 0.75dB improvement in PSNR at
in [11], a layered coding scheme is developed, where the
base layer feature map is obtained by a deep learning (DL) 0.15bpp than [11]. When using the residual coding strategy,
our framework keeps good performance in PSNR and is
The associate editor coordinating the review of this manuscript and comparable to BPG420, whereas the performance in [11] is
approving it for publication was Yizhang Jiang . lower than BPG420.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 50323
B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

Secondly, although the masked convolutional context The rest of the paper is organized as follows. In Section II,
model in [6]–[8] enables to achieve better compres- we discuss some related work on learned image compression
sion performance compared with the scale-only hyperprior and transformers in vision applications. Then we introduce
model [12], it brings in extra computational overhead because our framework and explain the building blocks. Experimental
of the sequential decoding process. Besides, the context infor- results on the Kodak dataset and discussions are presented in
mation is constrained into 5 × 5 windows. In this paper, Section IV. Section V concludes the paper.
we leverage transformers to capture the long-range depen-
dency and use the masked multi-head attention module in
II. RELATED WORKS
transformers in training to guarantee the causal relationship,
A. LEARNED IMAGE COMPRESSION
which is called TransContext model. Equipped with local
Many learned image compression models are proposed with
transformers, we divide the latent representation into seg-
the prevalence of DL techniques applied in various research
ments. In each segment we conduct the transformer. Each
fields. Some researches study learning-based image com-
segment can thus be processed parallelly to reduce the total
pression in specific scenarios. In [17], a discrete wavelet
decoding time.
transform based DL model is proposed for internet of under-
In our framework, only local transformer blocks (Trans-
water things. [18] presents a compression model using the
Blocks) are adopted since the computational complexity of
convolutional neural network for remote sensing images.
attention mechanism in transformers is a quadratic function
In this paper, we focus on deep image compression for nat-
to the sequence length. Similar to [13], we divide the input
ural RGB images. In [12], a hyperprior network is proposed
image into image patches. By this way the spatial size of
to learn the scale parameters of the Gaussian scale mixture
the input can be reduced and we can feed the patch features
model for the entropy model. The hyper latent is transmitted
to transformer blocks in the encoder. At the output of the
as side information to help decode the main latent. However,
decoder, an image patch is reconstructed based on the infor-
the estimation is not image-dependent and spatially adaptive
mation from the local patch. All the patches are merged into a
after trained. In [6]–[8], the main latent representation is mod-
complete image. This could produce the blocking artifacts at
eled by Gaussian distribution with parameters learned from
the patch boundaries [14]. To alleviate these artifacts, we use
the context and prior information. The context model allows
image patches with a small portion of overlaps as the input
to combine the information from the neighboring decoded
to the network and average the values of positions where two
symbols and thus giving a more accurate prediction. It has
outputs are predicted. We find this strategy can improve the
been the classic learned image compression method due to its
compression performance to some extent. To further remove
superior PSNR and MS-SSIM performance compared with
the artifacts, a post-processing network [15] is applied.
previous works.
When the network is designed with fully convolution
In [8], non-local blocks are embedded in the encoder and
layers, the size of input image can be arbitrary. Previous
decoder networks to learn the long-range dependency. How-
works use the entire image as input to the compression net-
ever, the self-attention mechanism results in non-negligible
work. Patch-based learned image compression has not been
computational cost which imposes restrictions to the com-
explored. Although patch-based image compression methods
pression framework design. Based on this joint architecture,
may result in blocking artifacts as in JPEG [14], it has its
recently a new framework that combines the octave convo-
advantages. In [16], the inpainting techniques are embed-
lutions is applied in [19] and achieves higher results than
ded into the patch-based image compression framework to
VVC (4:2:0) and other DL-based image compression models.
improve the compression performance. In our work, image
Later works extend the single Gaussian probability model
patches are used as input to cope with the demanding compu-
in [6] to Gaussian mixture model (GMM) [20], [21] and show
tation resource for a high-resolution input to transformers.
better compression efficiency. In [21], a joint optimization
Our contributions include: 1) We build an effective
of the image compression and quality enhancement model
patch-based learned image compression network with vision
is applied. The loss at the output of compression network
transformers in the base layer based on [11] for the
acts as an intermediate supervision and the output of the
variable-rate deep image compression. To alleviate the com-
quality enhancement model is the final reconstruction. The
pression artifacts resulted from patch reconstructions, we par-
post-processing technique is commonly employed in previ-
tition the image patches with overlaps and utilize an existing
ous codec JPEG [22] to remove the compression artifacts.
deblocking network to further remove the blocking artifacts.
2) Different kinds of transformer blocks are applied to meet
the various requirements in the subnetworks. 3) We propose 1) VARIABLE-RATE LEARNED IMAGE COMPRESSION
a transformer-based context model to facilitate the Gaussian In [23], [24], variable-rate image compression models are
parameter predictions based on the previously decoded proposed based on convolutional and deconvolutional LSTM
symbols. It is performed on segments of the quantized recurrent networks. In [25], four code layers including a base
latent representation and thus can reduce the total decod- layer and three enhancement layers are adopted to construct
ing time compared with the masked convolution context the scalable image compression framework. A decorrelation
model in [6]. unit is utilized to obliterate redundancy between the base

50324 VOLUME 10, 2022


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 1. Our proposed deep image compression model in the base layer. The input image x is divided into patches with a size of
3 × n × n. Patches are flattened to vectors to formalize a 3D tensor as input to the main encoder. The main decoder outputs a 3D
tensor and the vector at each spatial position is reshaped to a 3 × n × n patch. The patches are merged to the reconstructed image.

layer and the current enhancement layer. During inference, problems [20], [30]–[32]. Transformers [33] with multi-head
the output of each layer corresponds to the reconstruction at attentions have become predominant DL models for natu-
certain bitrate. These methods rely on the layered architec- ral language processing (NLP). Due to the ability to learn
tures to adjust for the variable bitrate and are not flexible to long-range interactions on sequential data, recently trans-
obtain a specific rate target. In [9], a multi-scale decompo- formers are migrated to many computer vision tasks such
sition transform is learned and a rate allocation algorithm is as image classification [13], object detection [34], seg-
used to determine the optimal scale of each image block based mentation [35] as well as the low-level computer vision
on content complexity given a target rate. In [10], the authors task [36]. However, purely using transformers instead of
apply bit-plane decomposition before the transform and intro- convolution layers requires to pre-train on a very large-scale
duce a bidirectional network to disentangle the information of dataset and it consumes a vast of time to train [13] to get
different bit-planes. However, the performance of [9] and [10] comparable or even better performance than convolutional
still has a large gap from the state-of-the-art. In [26], a set networks. Other works integrate convolution layers and trans-
of scaling factors is embedded on the quantized feature map formers to improve results based on similar computation
from a high bit-rate pre-trained model to fine-tune for the low complexity [34], [35], [37].
bit-rate one while keeping main parameters fixed. For low In [38], transformers are applied on the convolutional
bit rates far from the bit rate of the pre-trained model, the feature maps and followed by convolutional decoder to syn-
performance is not satisfactory. In [27] a conditional autoen- thesize high-resolution scene images. It leverages the autore-
coder is proposed with a coarse rate control by the Lagrange gressive structure of transformers to predict current index
multiplier and a fine-tuning parameter by the quantization based on previous indices. This property also suits for the
bin size. The fine-tuning process is conducted on intervals context model in entropy coding module in image compres-
between individually trained models. Therefore, to get the sion. In [38], the standard transformer layers are applied.
compression results for a wide range of bitrates, it is still In our work, different transformer modules are developed.
required to train discrete multiple models. In addition, we apply transformers in local windows and
In [11], [28], [29], a hybrid architecture that combines a symbols in each window can be decoded parallelly, which
learning based model and conventional codec is proposed. compensates the expensive time cost from the context model
The BPG-based residual coding is applied as the enhance- in [6]–[8].
ment layer to obtain compression results for the subsequent
bit rates. However, only eight feature maps are used for
III. OUR APPROACH
the compact representation in [11] which limits the learning
We propose an effective learned image framework by incor-
capability of the base layer. Based on this, we build a more
porating transformers in the base layer and apply the residual
effective model for the base layer.
coding [11] to achieve compression across a range of bit
rates. No previous work on vision transformers is proposed
B. TRANSFORMERS IN VISIONS for variable-rate image compression models. Our patch-based
Attention mechanisms are widely applied in DL mod- framework along with the post-processing step performs bet-
els for speech processing and computer vision ter in the base layer than other baselines with residual coding

VOLUME 10, 2022 50325


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 2. The encoding and decoding process of the overall framework.


‘‘Base layer’’ is the proposed deep image compression model in Fig. 1.
‘‘· · · ’’ represents the decoder part in the base layer.

scheme [11], [28]. The encoding and decoding process of the


overall framework is given in Fig. 2. Next we will elaborate on
the autoencoder image compression model in the base layer,
deblocking network and residual coding in the enhancement FIGURE 3. The original transformer block in [33].
layer separately.

A. AUTOENCODER NETWORK concatenated with the output from the hyper-decoder E 00 to


The architecture of our proposed deep image compression predict the parameters µ and σ 2 for ŷ. We use the arithmetic
model in the base layer is given in Fig. 1. During training, encoding AE and arithmetic decoding AD to encode and
the input image is randomly cropped with the resolution of decode the latent ŷ and hyper-latent ẑ with predicted Gaussian
256×256 . Given a 2D image x ∈ RH ×W , the sequence length distribution.
is H × W , where H and W are the height and width of the The loss function of the compression model is:
image. As the computational complexity of transformers is a Lcomp = R + λD
quadratic function of the sequence length, it is infeasible to = (Ex∼px [−log2 pŷ|ẑ (ŷ|ẑ)]
apply transformers on the entire image directly. We partition
the input image x into patches with a size of n×n. Each patch + Ex∼px [−log2 pẑ (ẑ)]) + λD (1)
can be flattened to a vector with the length of 3 n2 . We have where the first two items are the bitrate loss for the latent ŷ and
H W
n × n patches. Then each vector is projected to d dimension hyper-latent ẑ, and the last item D is the distortion function
where d is the channel size through the autoencoder network. between the original image x and reconstructed image x̃. λ is
At this point, we obtain a tensor with X ∈ Rh×w×d , where the tradeoff between the distortion and bitrate. The distor-
h = Hn and w = Wn . We reshape the tensor as X ∈ Rhw×d as tion D can be the mean square error (MSE) loss optimized
input of the main encoder network. At the output of the main for peak signal-to-noise ratio (PSNR) or multi-scale struc-
decoder, the vector at each spatial position is first mapped to tural similarity index measure (MS-SSIM) loss optimized for
3 n2 dimension from d and then reshaped back to a 3 × n × n MS-SSIM [41].
patch. All the patches are merged to a complete image. The MSE loss is given below.
The main encoder and decoder consist of Generalized Divi- 1 XX
sive Normalization (GDN) [39] layers, residual blocks (Res- DMSE = kx − x̃k2 (2)
N
Blocks) [40] and transformer blocks (TransBlocks). GDN
layers are suited for Gaussianizing data from natural images. where N is the number of elements. The PSNR is calculated
255
ResBlocks are added to extract local information to compen- by 20log10 √ . The final compression loss optimized
DMSE
sate the transformer blocks that focus more on long-range with MSE in our experiment is L = R+0.003×2552 ×DMSE
dependency. We will introduce the TransBlocks in detail in for the base layer.
Sec. III-A1 below. The PSNR metric is commonly used as the quality assess-
In Fig.1, we denote the output of the main encoder as y and ment for image reconstruction. However, it does not aim for
it is followed by a quantizer Q to obtain the quantized latent ŷ. perceived quality. MS-SSIM is a complementary metric to
Note that ŷ has the same spatial size as X with h × w since evaluate the structural similarity between two images [41].
no downsampling is needed for the main encoder network. The MSS-SSIM is calculated as
Similar to [12], the latent ŷ is modeled with the Gaussian M
distribution and a hyperprior network is applied to predict
MS-SSIM(x, x̃) = blM (x, x̃)cαM bcj (x, x̃)cβj bsj (x, x̃)cγj
Y
the Gaussian parameters µ and σ 2 . The hyper-encoder and
j=1
decoder contain three types of TransBlocks. The output of
(3)
the hyper-encoder is denoted as z and ẑ after quantization.
The context model is called TransContext which will be M is the number of scales. lM (x, x̃), cj (x, x̃) and sj (x, x̃)
detailed in Sec. III-A2. The output of context model E 0 is then are luminance, contrast and structure comparison measures

50326 VOLUME 10, 2022


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 5. Masked attention module in TransContext Model.


FIGURE 4. DownTransBlock and UpTransBlock illustration.

respectively [41]. αM , βj and γj are the relative importance and UpTransBlock as depicted in Fig. 4 to meet the various
of the terms. The final compression loss optimized with requirements in the architecture.
MS-SSIM in our experiment is L = R + 8 × (1 − DMS-SSIM ) The regular TransBlock has the input and output with the
for the base layer. same spatial size similar to [13]. The DownTransBlock is
modified to get the output size reduced by a factor of 2 as
1) TransBlocks shown at first row in Fig. 4. The input tensor is divided into
We explore to use transformer blocks to extract the 4×4 blocks. Then we flatten each block to a vector and use a
long-range information in the learned image compression convolution layer with 1×1 kernel size to reduce the channel
network. The original transformer block in [33] is given in size to the same as the input tensor. Then the output tensor
Fig. 3. One transformer block contains a multi-head attention is followed with the regular TransBlock. The UpTransBlock
network and a point-wise feed-forward network. is the inverse operation of DownTransBlock as shown at the
We denote the number of heads as m. The input tensor X second row in Fig. 4. The DownTransBlock is applied in the
is divided into m heads with di = md dimension for each head hyper-encoder network to obtain the compressed hyper-latent
(i = 1, 2, · · · m). For a tensor Xi ∈ Rhw×di , the multi-head ẑ and the UpTransBlock is used to transform back in order to
attention process can be represented by a set of equations predict the Gaussian parameters for ŷ.
below. hw is the sequence length and di is the vector dimen- All the TransBlocks are conducted in local windows. Based
sion in ith head. When X is reshaped to a sequence with length on the spatial size of the tensors, we use 8 × 8 window
of hw, the position information is lost. A positional encoding size for the main encoder and decoder network. Each Trans-
module is added to provide spatial information at the input. Block contains N = 4 layers of MHSA and FFN. For the
hyper-encoder and decoder network, 4 × 4 window size is
Qi = Xi WQTi , Ki = Xi WKTi , Vi = Xi WVTi applied. Each TransBlock contains N = 2 layers of MHSA
Qi K T and FFN.
Zi = Softmax( √ i )Vi
dk
2) TransContext MODEL
O(Q, K , V ) = Concat(Z1 , Z2 , · · · , Zm )WOT (4)
In [6], the context model is a simple masked convolution
where WQi ,√WKi , WVi and WO are weights for the linear layer with 5 × 5 kernels. A symbol is decoded based on
layers and dk is a scaling factor. In the second equation, previous decoded symbols above and to the left of the current
Softmax is the softmax operation to get the attention scores. symbol in the window. However, the context information
In the third equation, the weighted vectors from each head are is constrained to local windows. We propose to apply a
concatenated as the final output. The attention here is referred transformer-based context model which is called TransCon-
as multi-head self-attention (MHSA) mechanism as the three text to allow more context to be used for prediction.
items Q, K and V are obtained from the same input X . During training, we use masked multi-head attention mod-
The output of MHSA is then fed into the feed-forward ules [33] in the TransContext model to allow the network
network (FFN) as to back-propagate for gradient calculation. Fig. 5 gives an
illustration of the masked attention module for a tensor with
f (O(Q, K , V )) = ReLU (O(Q, K , V )W1T +b1 )W2T +b2 (5)
the input size of 2 × 2 × d. The mask shown in the figure has
where W1 , W2 are the weights and b1 , b2 are the bias of linear 0s at and below the diagonal direction. The values above the
layers. RELU is a ReLU activation layer. diagonal direction are set to negative infinite.
Differing from [33] for machine translation tasks, the posi- Given an input from the quantized feature representation ŷ,
tional encoding module is based on 2D fixed sine function we first flatten the tensor and pad it with a vector with all 0s
for images. The periodic property of the sine function allows at the beginning. For the current symbol (upper right value of
to extend for longer sequence length. In addition, in order the input), the output of the corresponding position (second
to get a compact feature representation for an image and vector) only depends on the first vector and padded 0s. In the
reconstruct it after the decoder, the transformer blocks need softmax operation, the product of q and k is added with the
to be scalable for spatial size which is not required in [33] mask so that the values corresponding to 0s in the mask will
for language modeling. We propose the DownTransBlock not change and the values above the diagonal direction will

VOLUME 10, 2022 50327


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

be negative infinite. Note that the softmax of negative infinite


is 0. By this way, each symbol only uses the information
of previous decoded symbols during test. The output is then
sent to the FFN. The last linear layer outputs a tensor with
channel size of 2d and it is then combined with output from
the hyper-decoder network to predict for the µ and σ 2 of
Gaussian distribution.
FIGURE 6. Examples of blocking artifacts for patch-based reconstruction:
In implementation, given a feature representation ŷ ∈ (a) non-overlap, (b) overlap, (c) overlap+deblock. (please zoom in).
Rh×w×d , we divide ŷ by 2 in each spatial direction to obtain
segments with h2 × w2 × d size. In each segment, we apply
the TransContext model for inference in parallel. The C. RESIDUAL ENCODING FOR VARIABLE RATE
TransContext model also contains N = 4 layers of MHSA Current learned image compression networks achieve the
and FFN. state-of-the-art compression performance but they need to
train a separate model for each bit rate. In variable-rate image
compression, a single model is trained to get results for a
B. DEBLOCKING NETWORK range of bitrate. The fine-tuning trick may reduce the total
As in Sec. III-A, given an image x, we use image patches training time but can only be applied by a trained model from
as the input in order to leverage the transformer blocks in a high bit rate to a close low bit rate. For low bit rates far from
the autoencoder network. During reconstruction, each vector the bit rate of the pre-trained model, the performance drops
is reshaped to form an image patch. The patches from all dramatically [26].
positions are merged to a complete image x̃. Experiments Similar to [11], we use the BPG444 codec 1 to encode and
show that the restored image contains some blocking artifacts decode the residual between the reconstructed image x̃d from
at the patch borders. This is because each vector only uses the deblocking network in Sec. III-B and the original image
the local information in the last linear layer and the edge x as an enhancement layer as shown in Fig. 2. The bit rate of
values for the adjacent patches cannot keep consistency in BPG codec is controlled by a quality parameter q. The total bit
the prediction. We show two examples in Fig. 6. In (a), the rate for our framework is the addition of the bitrate R from the
reconstruction is based on non-overlapped image patches, base layer in Eq. 1 and the bitrate Rbpg from this enhancement
whereas in (b) the image patches are overlapped by two layer controlled by q.
pixels and the overlapping areas are averaged by the neigh-
boring patches. We find that (b) actually has less artifacts IV. EXPERIMENTS
than (a). A. DATASET AND TRAINING DETAILS
Although it can reduce some artifacts by using the method 1) DATASET
in Fig. 6 (b), it is insufficient for image compression where Since for the learned image compression model, the input
blocking noise can result in PNSR or MS-SSIM degradation. and the ground truth image are the same, no extra labels
Motivated by [42] that a network is developed to post-process are needed for the training. In fact, prior work conduct
the compression artifacts in JPEG [14] for a better compres- experiments on different training dataset. We use a subset
sion performance, we apply the model in [15] to enhance the of 40k images from the COCO-2014 set [43] as the training
image reconstruction quality. The decompressed image x̃ is set and compare the results on the popular Kodak PhotoCD
fed into the deblocking network as shown in Fig. 2 to obtain dataset 2 and Berkeley Segmentation Dataset (BSD) 100 test
the deblocking image x̃d . Different from previous work [42] dataset [44].
and [31] where only the MSE loss is used during training in
accordance with the JPEG optimization metric, we train the 2) TRAINING SETTING
deblocking network with MSE or MS-SSIM loss between the
We randomly crop each image by 256 × 256 during training.
deblocking image x̃d and the original image x depending on
The learning rate is set to 0.00003 for the image compression
the optimization method of the image compression network.
network. We find that a higher learning rate makes it hard for
An example result after applying the deblocking network is
the training to converge. The training lasts 300 epochs and we
shown in Fig. 6 (c).
reduce the learning rate by 0.1 after 180 epochs. We set the
The deblocking process does not increase the bitrate,
batch size as 20. The learning rate for the deblocking network
as when we complete the training process, the reconstructed
is set to 0.0001. The training lasts 80 epochs and we reduce
image from the image compression network can be improved
the learning rate by 0.5 after 40 and 60 epochs. The batch
by using one feed-forward step from the deblocking network.
size is set to 8. We experiment on the Pytorch framework [45]
It can also be trained jointly with the image compression
and use one TITAN X GPU for the training with the Adam
network end-to-end. However, it will increase the model
optimizer.
complexity which makes it hard to train the model on one
GPU card. Therefore, in our experiment, we train the two 1 http://bellard.org/bpg
networks separately. 2 http://r0k.us/graphics/kodak/

50328 VOLUME 10, 2022


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 7. (a) PSNR/bpp and (b) MS-SSIM/bpp on the Kodak dataset. FIGURE 8. (a) PSNR/bpp and (b) MS-SSIM/bpp on the BSD100 dataset.

B. EXPERIMENTAL RESULTS outperforms Cai2018 at low bit rates. Our approach does
1) RESULTS ON KODAK DATASET not show advantages for MS-SSIM at high bit rates as the
Fig. 7 shows the comparison of our results with conven- residual coding with the classic codec BPG is not optimized
tional codecs (JPEG [14], JPEG2000 [22], BPG420 and for MS-SSIM. However, the residual coding strategy can
BPG444) and learned variable-rate image compression mod- provide an effective as well as simple way for variable-rate
els Cai2018 [9], Zhang2019 [10], Akbari2020 [11] and image compression.
Fu2021 [28] on the Kodak dataset in terms of PSNR and
MS-SSIM for per bit per pixel (bpp). Our approach can 2) RESULTS ON BSD100 DATASET
achieve comparable PSNR with BPG420. The first point in The methods Cai2018, Zhang2019 and Akbari2020 only
the R-D curve actually reflects the influence of the proposed show results on Kodak dataset. We also compare our results
compression model in the base layer in Fig. 1. Our result in with JPEG, JPEG2000, BPG420 and BPG444 and the
the base layer achieves 0.75dB higher than Akbari2020 and learned variable-rate image compression model Fu2021 on
1.7dB higher than Fu2021 at 0.15bpp in which BPG444 is the BSD100 dataset as given in Fig. 8. The overall trend
also applied for the residual coding. is consistent to that on Kodak dataset and it shows that the
For MS-SSIM in (b), we have better performance at the trained models can generalize well.
base layer (first point at 0.21bpp) than Akbari2020 and
Fu2021. Compared with BPG444, we get 0.021 higher at 3) ABLATION TEST: NON OVERLAP vs. OVERLAP
0.21 bpp. However, as the bitrate increases, our MS-SSIM We experiment on two different partition schemes which
result saturates to that from BPG444 which is similar to we call non-overlap and overlap on Kodak dataset as
Fu2021. The MS-SSIM of our method shows better perfor- shown in Fig. 9. For non-overlap, we set the patch size
mance than the traditional codecs and Zhang2019. It also with 16 × 16 . For overlap, the patch size is 18 with

VOLUME 10, 2022 50329


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

combined with the ResBlocks, our model extracts the local


and global information to optimize the compression loss.
The TransContext model allows to predict the Gaussian
parameters from previously decoded symbols which con-
tributes to a more accurate probability estimation for the
arithmetic coding. Tab. 1 shows that both the Trans-
Blocks and TransContext can help improve the compression
performance.

5) ABLATION TEST: DEBLOCKING NETWORK


The deblocking network is applied after we get the recon-
structed results from the mean decoder. We get about 0.2dB
improvement in PSNR and 0.001 in MS-SSIM for overlap
after the deblocking network. The PSNR curve is displayed
FIGURE 9. Ablation study of our framework. with the red curve in Fig. 9.

6) ABLATION TEST: FEATURE SIZE IN TRANSFORMERS


TABLE 1. Ablation test for TransBlocks and TransContext modules
optimized with MSE loss at 0.15bpp. In the above experiment, we set the channel dimension
d = 512. This can better maintain the information from a
patch. We experiment on a smaller channel dimension with
d = 256. It shows that d = 512 can achieve higher PSNR
and MS-SSIM at less bit rate as given in Tab. 2. Therefore,
we use 512 as the channel dimension for other experiments.
TABLE 2. Results on different channel size for d in transformers.
The PSNR with d = 512 after residual coding (green curve)
is steadily better than that with d = 256 (black curve) at
various bit rates as shown in Fig. 9.

C. TIME COMPLEXITY
We discuss the running time of our framework for inference
on a E5-2620 v4 CPU (2.10GHz) with 128GB RAM. The
a stride of 16. The overlapping areas are two pixels in most time consuming part of the model is the context model
each direction. With the same λ in Eq. 1, the calculated which needs to be decoded sequentially from previously
bit rate for non-overlap is less than overlap at the base decoded symbols. For main encoder and decoder networks,
layer. After using the BPG residual coding, overlap (green it takes about 0.05 s and 0.04 s for one forward step. For the
curve) show generally better PSNR performance than non- hyper-encoder and hyper-decoder networks, the running time
overlap (blue curve). For non-overlap, only the local infor- is 0.01 s and 0.01 s. The hyper-latent ẑ can be encoded and
mation is used to construct each patch in the last linear decoded in parallel. The decoding time for one position is
layer and the edge values for the neighboring patches around 0.04s. In [6] the latent ŷ can only be decoded one by
could have a large variance. For overlap, the inconsis- one and it takes h × w times of forward steps of the entropy
tency is averaged to reduce the blocking artifacts as shown model. In our framework, the transformer is applied on the
in Fig. 6 (b). local windows with size 2h × w2 and the forward steps is
reduced by 41 when decoding parallelly. The running time for
4) ABLATION TEST: TransBlocks AND TransContext one window is around 177s. Note that the arithmetic coding in
To prove the effectiveness of the TransBlocks and TransCon- our experiment is not optimized .3 Since different platforms
text, we experiment to delete them respectively and keep the may affect the time elapse, we also test the original context
remaining parts of the model same. Tab. 1 shows the results model with a masked convolution layer (5×5 kernels) on this
with MSE optimization at 0.15bpp without the deblock- device. The running time is about 240s, which takes longer
ing post-processing. The first row shows the result without than our scheme.
the TransContext model. The second row gives the result
without the TransBlocks in the main encoder and decoder. D. EXAMPLES
We show the result for the model with both modules in the In Fig. 10 and Fig. 11, we show reconstructed examples from
third row. different methods. In Fig. 10, the results in (e) and (f) show
Compared with convolutional layers where the respective more clear lines on the sail nevertheless blurry human faces.
field size is constrained by the kernel size, TransBlocks can
extract long-range dependency from the feature tensor. When 3 https://github.com/nayuki/Reference-arithmetic-coding

50330 VOLUME 10, 2022


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 10. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

VOLUME 10, 2022 50331


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 11. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

50332 VOLUME 10, 2022


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

The results in (b) and (c) contain more detailed features on [6] D. Minnen, J. Ballé, and G. D. Toderici, ‘‘Joint autoregressive and hier-
human faces. This is because the results in (e) and (f) are archical priors for learned image compression,’’ in Proc. Adv. Neural Inf.
Process. Syst., 2018, pp. 10771–10780.
obtained from the model in the base layer trained with [7] J. Lee, S. Cho, and S.-K. Beack, ‘‘Context-adaptive entropy model for
MS-SSIM loss. As the MS-SSIM loss focuses more on over- end-to-end optimized image compression,’’ in Proc. 7th Int. Conf. Learn.
all structures, the MS-SSIM in (e) and (f) are higher than Represent., May 2019, pp. 1–20.
[8] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, ‘‘Non-local
BPG444, whereas the PSNR in (e) and (f) are relatively attention optimized deep image compression,’’ 2019, arXiv:1904.09757.
low. [9] C. Cai, L. Chen, X. Zhang, and Z. Gao, ‘‘Efficient variable rate image com-
At higher bit rate in Fig. 11, the visual difference is not pression with multi-scale decomposition network,’’ IEEE Trans. Circuits
that significant. The MS-SSIM in (f) is slightly better than Syst. Video Technol., vol. 29, no. 12, pp. 3687–3700, Dec. 2018.
[10] Z. Zhang, Z. Chen, J. Lin, and W. Li, ‘‘Learned scalable image compression
BPG444 in (d) with less 0.04bpp. Note that in (e) and (f), with bidirectional context disentanglement network,’’ in Proc. IEEE Int.
due to the residual coding scheme based on BPG which is Conf. Multimedia Expo. (ICME), Jul. 2019, pp. 1438–1443.
optimized with MSE, the results in (e) and (f) also have high [11] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned variable-rate image
compression with residual divisive normalization,’’ in Proc. IEEE Int.
PSNR values. The wall texture on the left in (a) obtained with Conf. Multimedia Expo. (ICME), Jul. 2020, pp. 1–6.
BPG420 method is not well restored. The corner between [12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, ‘‘Variational
the roof and wall contour on the left in (d) is blurry. In both image compression with a scale hyperprior,’’ 2018, arXiv:1802.01436.
figures, our results are improved when adding the deblocking [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
modules compared with that from the model optimized with J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16x16 words: Trans-
the corresponding loss. formers for image recognition at scale,’’ 2020, arXiv:2010.11929.
[14] G. K. Wallace, ‘‘The JPEG still picture compression standard,’’ Commun.
ACM, vol. 34, no. 4, pp. 30–44, Apr. 1991.
V. CONCLUSION [15] N. Ahn, B. Kang, and K.-A. Sohn, ‘‘Fast, accurate, and lightweight super-
We propose to incorporate vision transformers into a resolution with cascading residual network,’’ in Proc. Eur. Conf. Comput.
variable-rate learned image compression framework. Differ- Vis. (ECCV), Sep. 2018, pp. 252–268.
[16] D. Liu, X. Sun, and F. Wu, ‘‘Inpainting with image patches for compres-
ent transformer blocks are applied to meet the various require- sion,’’ J. Vis. Commun. Image Represent., vol. 23, pp. 100–113, Jan. 2012.
ments in the subnetworks. Compared with other variable-rate [17] N. Krishnaraj, M. Elhoseny, M. Thenmozhi, M. M. Selim, and K. Shankar,
learned image compression networks, our framework can ‘‘Deep learning model for real-time image compression in Internet of
get higher PSNR across a range of bit rates and MS-SSIM Underwater Things (IoUT),’’ J. Real-Time Image Process., vol. 17,
pp. 2097–2111, May 2019.
performance at low bit rates. Ablation experiment shows the [18] B. Sujitha, V. S. Parvathy, E. L. Lydia, P. Rani, Z. Polkowski, and
effectiveness of the proposed TransBlocks and TransContext K. Shankar, ‘‘Optimal deep learning based image compression technique
model. We also experiment on two different image patch for data transmission on industrial Internet of Things applications,’’ Trans.
Emerg. Telecommun. Technol., vol. 32, Apr. 2020, Art. no. e3976.
strategies and show that the overlap partition achieves bet- [19] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Generalized octave
ter compression performance than the non-overlap partition. convolutions for learned multi-frequency image compression,’’ 2020,
At last we discuss the time complexity of our model and it arXiv:2002.10032.
can reduce the inference time for the autoregressive context [20] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Learned image compression
with discretized Gaussian mixture likelihoods and attention modules,’’ in
model. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
When applying vision transformers, the sequence length pp. 7939–7948.
which is the number of image patches in our framework [21] J. Lee, S. Cho, and M. Kim, ‘‘An end-to-end joint learning scheme of image
compression and quality enhancement with improved entropy minimiza-
associates with the computation cost. More layers of the tion,’’ 2019, arXiv:1912.12817.
transformer block can be added and explored if the sequence [22] Information Technology JPEG 2000 Image Coding System: Core Coding
length can be further reduced. In the future work, we may System, International Organization for Standardization, Geneva, Switzer-
land, Dec. 2000.
mask out some of the patches and apply image inpainting
[23] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
techniques to fill the masked patches at the decoder. S. Baluja, M. Covell, and R. Sukthankar, ‘‘Variable rate image compression
with recurrent neural networks,’’ 2015, arXiv:1511.06085.
[24] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
REFERENCES S. J. Hwang, J. Shor, and G. Toderici, ‘‘Improved lossy image compression
[1] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘End-to-end optimized image with priming and spatially adaptive bit rates for recurrent networks,’’
compression,’’ 2016, arXiv:1611.01704. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, pp. 4385–4393.
L. Benini, and L. V. Gool, ‘‘Soft-to-hard vector quantization for end-to- [25] Z. Guo, Z. Zhang, and Z. Chen, ‘‘Deep scalable image compression via
end learning compressible representations,’’ in Proc. Adv. Neural Inf. hierarchical feature decorrelation,’’ in Proc. Picture Coding Symp. (PCS),
Process. Syst., 2017, pp. 1141–1151. Nov. 2019, pp. 1–5.
[3] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, ‘‘Learning convolutional [26] T. Chen and Z. Ma, ‘‘Variable bitrate image compression with quality
networks for content-weighted image compression,’’ in Proc. IEEE/CVF scaling factors,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3214–3223. (ICASSP), May 2020, pp. 2163–2167.
[4] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, [27] Y. Choi, M. El-Khamy, and J. Lee, ‘‘Variable rate deep image compression
‘‘Conditional probability models for deep image compression,’’ in with a conditional autoencoder,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Vis. (ICCV), Oct. 2019, pp. 3146–3154.
pp. 4394–4402. [28] H. Fu, F. Liang, B. Lei, Q. Zhang, J. Liang, C. Tu, and G. Zhang,
[5] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Energy compaction-based ‘‘An extended context-based entropy hybrid modeling for image com-
image compression using convolutional AutoEncoder,’’ IEEE Trans. Mul- pression,’’ Signal Process., Image Commun., vol. 95, Jul. 2021,
timedia, vol. 22, no. 4, pp. 860–873, Apr. 2020. Art. no. 116244.

VOLUME 10, 2022 50333


B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

[29] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned multi-resolution variable- BINGLIN LI received the B.Eng. degree in com-
rate image compression with octave-based residual blocks,’’ IEEE Trans. munication engineering from Wuhan University,
Multimedia, vol. 23, pp. 3013–3021, 2021. Wuhan, China, in 2014, and the M.Sc. degree
[30] A. Tursunov, Mustaqeem, J. Y. Choeh, and S. Kwon, ‘‘Age and gender in engineering from the University of Manitoba,
recognition using a convolutional neural network with a specially designed Winnipeg, Canada, in 2016. She is currently pursu-
multi-attention module through speech spectrograms,’’ Sensors, vol. 21, ing the Ph.D. degree with the Multimedia Labora-
no. 17, p. 5892, Sep. 2021. tory, School of Engineering Science, Simon Fraser
[31] B. Li, J. Liang, and Y. Wang, ‘‘Compression artifact removal with stacked
University, Burnaby, Canada. Since 2018, she has
multi-context channel-wise attention network,’’ in Proc. IEEE Int. Conf.
been a Research Assistant with the Multimedia
Image Process. (ICIP), Sep. 2019, pp. 3601–3605.
[32] K. Muhammad, A. Ullah, A. S. Imran, M. Sajjad, M. S. Kiran, G. Sannino, Laboratory, School of Engineering Science, Simon
and V. H. C. de Albuquerque, ‘‘Human action recognition using attention Fraser University. Her research interests include deep-learning-based image
based LSTM network with dilated CNN features,’’ Future Gener. Comput. compression and computer vision.
Syst., vol. 125, pp. 820–830, Dec. 2021.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2017, pp. 5998–6008. JIE LIANG (Senior Member, IEEE) received the
[34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and B.E. and M.E. degrees from Xi’an Jiaotong Uni-
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc. versity, China, in 1992 and 1995, respectively,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2020, pp. 213–229. the M.E. degree from the National University of
[35] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, ‘‘MaX-DeepLab: Singapore, in 1998, and the Ph.D. degree from
End-to-end panoptic segmentation with mask transformers,’’ in Proc.
Johns Hopkins University, USA, in 2003.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
From 2003 to 2004, he worked with the
pp. 5463–5474.
[36] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, Microsoft Digital Media Division, Video Codec
C. Xu, and W. Gao, ‘‘Pre-trained image processing transformer,’’ in Proc. Group. Since May 2004, he has been with the
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, School of Engineering Science, Simon Fraser Uni-
pp. 12299–12310. versity, Canada, where he is currently a Professor. His research interests
[37] H. Yan, Z. Li, W. Li, C. Wang, M. Wu, and C. Zhang, ‘‘ConTNet: include image and video processing, computer vision, and deep learning.
Why not use convolution and transformer at the same time?’’ 2021, Prof. Liang received the 2014 IEEE TCSVT Best Associate Editor Award,
arXiv:2104.13497. 2014 SFU Dean of the Graduate Studies Award for Excellence in Leader-
[38] P. Esser, R. Rombach, and B. Ommer, ‘‘Taming transformers for high- ship, and 2015 Canada NSERC Discovery Accelerator Supplements (DAS)
resolution image synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- Award. He served as an Associate Editor for several journals, including
tern Recognit. (CVPR), Jun. 2021, pp. 12873–12883. IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON CIRCUITS
[39] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘Density modeling of AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), and IEEE SIGNAL PROCESSING
images using a generalized normalization transformation,’’ 2015, LETTERS. He has also served on three IEEE Technical Committees.
arXiv:1511.06281.
[40] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.
[41] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘‘Multiscale structural simi- JINGNING HAN (Senior Member, IEEE)
larity for image quality assessment,’’ in Proc. 37th Asilomar Conf. Signals, received the B.S. degree in electrical engineering
Syst. Comput., vol. 2, Jul. 2003, pp. 1398–1402. from Tsinghua University, Beijing, China, in 2007,
[42] C. Dong, Y. Deng, C. C. Loy, and X. Tang, ‘‘Compression artifacts reduc- and the M.S. and Ph.D. degrees in electrical
tion by a deep convolutional network,’’ in Proc. IEEE Int. Conf. Comput. and computer engineering from the University of
Vis. (ICCV), Dec. 2015, pp. 576–584. California at Santa Barbara, Santa Barbara, CA,
[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, USA, in 2008 and 2012, respectively.
and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ in Proc. He joined the WebM Codec Team, Google,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2014, pp. 740–755. Mountain View, CA, USA, in 2012, where he is
[44] D. Martin, C. Fowlkes, D. Tal, and J. Malik, ‘‘A database of human the Main Architect of the VP9 and AV1 codecs,
segmented natural images and its application to evaluating segmentation and leads the Software Video Codec Team. He has published more than
algorithms and measuring ecological statistics,’’ in Proc. 8th IEEE Int.
60 research articles. He holds more than 50 U.S. patents in the field of video
Conf. Comput. Vis. (ICCV), vol. 2, Jun. 2001, pp. 416–423.
coding. His research interests include video coding and computer science
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, and A. Desmaison, architecture.
‘‘Pytorch: An imperative style, high-performance deep learning library,’’ Dr. Han received the Dissertation Fellowship from the Department of Elec-
in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, trical and Engineering, University of California at Santa Barbara, in 2012.
A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, Eds. Red Hook, He was a recipient of the Best Student Paper Award at the IEEE International
NY, USA: Curran Associates, 2019, pp. 8024–8035. [Online]. Available: Conference on Multimedia and Expo, in 2012. He also received the IEEE
http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high- Signal Processing Society Best Young Author Paper Award, in 2015.
pe%rformance-deep-learning-library.pdf

50334 VOLUME 10, 2022

You might also like