Variable-Rate Deep Image Compression With Vision Transformers
Variable-Rate Deep Image Compression With Vision Transformers
13, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3173256
ABSTRACT Recently, vision transformers have been applied in many computer vision problems due to its
long-range learning ability. However, it has not been throughly explored in image compression. We propose
a patch-based learned image compression network by incorporating vision transformers. The input image
is divided into patches before feeding to the encoder and the patches are reconstructed from the decoder to
form a complete image. Different kinds of transformer blocks (TransBlocks) are applied to meet the various
requirements in the subnetworks. We also propose a transformer-based context model (TransContext) to
facilitate the coding based on previously decoded symbols. Since the computational complexity of the atten-
tion mechanism in transformers is a quadratic function of the sequence length, we partition the feature tensor
into different segments and conduct the transformer in each segment to save computational cost. To alleviate
the compression artifacts, we use overlapping patches and apply an existing deblocking network to further
remove the artifacts. At last, the residual coding scheme is adopted to get the compression performance for
variable bit rates. We show that our patch-based learned image compression with transformers obtain 0.75dB
improvement in PSNR at 0.15bpp than the prior variable-rate compression work on the Kodak dataset. When
using the residual coding strategy, our framework keeps good performance in PSNR and is comparable to
BPG420. For MS-SSIM, we get higher results than BPG444 across a range of bit rates (0.021 at 0.21bpp)
and other variable-rate learned image compression models at low bit rates.
I. INTRODUCTION network, and the residual between the input and the base
Recently, there has been a line of researches [1]–[8] layer reconstruction is coded by a traditional method to cover
on deep image compression. The autoencoder approaches more bit rates. Motivated by [11], in this paper, we propose
[6]–[8] with the joint autoregressive and hierarchical hyper- a more effective learned image framework by incorporating
prior models have been the mainstream practice for transformers in the base layer and apply the residual coding
learning-based image compression. Although the above to achieve compression across a range of bit rates with one
methods show promising compression performance com- single model. In [11], only eight feature maps are used for the
pared with conventional image codecs, there are two main
compact representation which limits the learning capability
drawbacks in real applications.
of the base layer. In our framework, a hyperprior network [12]
Firstly, a separate model needs to be trained for each bit rate
is adopted to estimate the distribution parameters of the quan-
which increases the coding complexity. To this end, variable
tized feature representation so that the channel dimension
bitrate image compression models [9]–[11] are developed to
of the representation can be set larger in the base layer.
cover various bit rates with one training model. In particular,
Experimental results show 0.75dB improvement in PSNR at
in [11], a layered coding scheme is developed, where the
base layer feature map is obtained by a deep learning (DL) 0.15bpp than [11]. When using the residual coding strategy,
our framework keeps good performance in PSNR and is
The associate editor coordinating the review of this manuscript and comparable to BPG420, whereas the performance in [11] is
approving it for publication was Yizhang Jiang . lower than BPG420.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 50323
B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers
Secondly, although the masked convolutional context The rest of the paper is organized as follows. In Section II,
model in [6]–[8] enables to achieve better compres- we discuss some related work on learned image compression
sion performance compared with the scale-only hyperprior and transformers in vision applications. Then we introduce
model [12], it brings in extra computational overhead because our framework and explain the building blocks. Experimental
of the sequential decoding process. Besides, the context infor- results on the Kodak dataset and discussions are presented in
mation is constrained into 5 × 5 windows. In this paper, Section IV. Section V concludes the paper.
we leverage transformers to capture the long-range depen-
dency and use the masked multi-head attention module in
II. RELATED WORKS
transformers in training to guarantee the causal relationship,
A. LEARNED IMAGE COMPRESSION
which is called TransContext model. Equipped with local
Many learned image compression models are proposed with
transformers, we divide the latent representation into seg-
the prevalence of DL techniques applied in various research
ments. In each segment we conduct the transformer. Each
fields. Some researches study learning-based image com-
segment can thus be processed parallelly to reduce the total
pression in specific scenarios. In [17], a discrete wavelet
decoding time.
transform based DL model is proposed for internet of under-
In our framework, only local transformer blocks (Trans-
water things. [18] presents a compression model using the
Blocks) are adopted since the computational complexity of
convolutional neural network for remote sensing images.
attention mechanism in transformers is a quadratic function
In this paper, we focus on deep image compression for nat-
to the sequence length. Similar to [13], we divide the input
ural RGB images. In [12], a hyperprior network is proposed
image into image patches. By this way the spatial size of
to learn the scale parameters of the Gaussian scale mixture
the input can be reduced and we can feed the patch features
model for the entropy model. The hyper latent is transmitted
to transformer blocks in the encoder. At the output of the
as side information to help decode the main latent. However,
decoder, an image patch is reconstructed based on the infor-
the estimation is not image-dependent and spatially adaptive
mation from the local patch. All the patches are merged into a
after trained. In [6]–[8], the main latent representation is mod-
complete image. This could produce the blocking artifacts at
eled by Gaussian distribution with parameters learned from
the patch boundaries [14]. To alleviate these artifacts, we use
the context and prior information. The context model allows
image patches with a small portion of overlaps as the input
to combine the information from the neighboring decoded
to the network and average the values of positions where two
symbols and thus giving a more accurate prediction. It has
outputs are predicted. We find this strategy can improve the
been the classic learned image compression method due to its
compression performance to some extent. To further remove
superior PSNR and MS-SSIM performance compared with
the artifacts, a post-processing network [15] is applied.
previous works.
When the network is designed with fully convolution
In [8], non-local blocks are embedded in the encoder and
layers, the size of input image can be arbitrary. Previous
decoder networks to learn the long-range dependency. How-
works use the entire image as input to the compression net-
ever, the self-attention mechanism results in non-negligible
work. Patch-based learned image compression has not been
computational cost which imposes restrictions to the com-
explored. Although patch-based image compression methods
pression framework design. Based on this joint architecture,
may result in blocking artifacts as in JPEG [14], it has its
recently a new framework that combines the octave convo-
advantages. In [16], the inpainting techniques are embed-
lutions is applied in [19] and achieves higher results than
ded into the patch-based image compression framework to
VVC (4:2:0) and other DL-based image compression models.
improve the compression performance. In our work, image
Later works extend the single Gaussian probability model
patches are used as input to cope with the demanding compu-
in [6] to Gaussian mixture model (GMM) [20], [21] and show
tation resource for a high-resolution input to transformers.
better compression efficiency. In [21], a joint optimization
Our contributions include: 1) We build an effective
of the image compression and quality enhancement model
patch-based learned image compression network with vision
is applied. The loss at the output of compression network
transformers in the base layer based on [11] for the
acts as an intermediate supervision and the output of the
variable-rate deep image compression. To alleviate the com-
quality enhancement model is the final reconstruction. The
pression artifacts resulted from patch reconstructions, we par-
post-processing technique is commonly employed in previ-
tition the image patches with overlaps and utilize an existing
ous codec JPEG [22] to remove the compression artifacts.
deblocking network to further remove the blocking artifacts.
2) Different kinds of transformer blocks are applied to meet
the various requirements in the subnetworks. 3) We propose 1) VARIABLE-RATE LEARNED IMAGE COMPRESSION
a transformer-based context model to facilitate the Gaussian In [23], [24], variable-rate image compression models are
parameter predictions based on the previously decoded proposed based on convolutional and deconvolutional LSTM
symbols. It is performed on segments of the quantized recurrent networks. In [25], four code layers including a base
latent representation and thus can reduce the total decod- layer and three enhancement layers are adopted to construct
ing time compared with the masked convolution context the scalable image compression framework. A decorrelation
model in [6]. unit is utilized to obliterate redundancy between the base
FIGURE 1. Our proposed deep image compression model in the base layer. The input image x is divided into patches with a size of
3 × n × n. Patches are flattened to vectors to formalize a 3D tensor as input to the main encoder. The main decoder outputs a 3D
tensor and the vector at each spatial position is reshaped to a 3 × n × n patch. The patches are merged to the reconstructed image.
layer and the current enhancement layer. During inference, problems [20], [30]–[32]. Transformers [33] with multi-head
the output of each layer corresponds to the reconstruction at attentions have become predominant DL models for natu-
certain bitrate. These methods rely on the layered architec- ral language processing (NLP). Due to the ability to learn
tures to adjust for the variable bitrate and are not flexible to long-range interactions on sequential data, recently trans-
obtain a specific rate target. In [9], a multi-scale decompo- formers are migrated to many computer vision tasks such
sition transform is learned and a rate allocation algorithm is as image classification [13], object detection [34], seg-
used to determine the optimal scale of each image block based mentation [35] as well as the low-level computer vision
on content complexity given a target rate. In [10], the authors task [36]. However, purely using transformers instead of
apply bit-plane decomposition before the transform and intro- convolution layers requires to pre-train on a very large-scale
duce a bidirectional network to disentangle the information of dataset and it consumes a vast of time to train [13] to get
different bit-planes. However, the performance of [9] and [10] comparable or even better performance than convolutional
still has a large gap from the state-of-the-art. In [26], a set networks. Other works integrate convolution layers and trans-
of scaling factors is embedded on the quantized feature map formers to improve results based on similar computation
from a high bit-rate pre-trained model to fine-tune for the low complexity [34], [35], [37].
bit-rate one while keeping main parameters fixed. For low In [38], transformers are applied on the convolutional
bit rates far from the bit rate of the pre-trained model, the feature maps and followed by convolutional decoder to syn-
performance is not satisfactory. In [27] a conditional autoen- thesize high-resolution scene images. It leverages the autore-
coder is proposed with a coarse rate control by the Lagrange gressive structure of transformers to predict current index
multiplier and a fine-tuning parameter by the quantization based on previous indices. This property also suits for the
bin size. The fine-tuning process is conducted on intervals context model in entropy coding module in image compres-
between individually trained models. Therefore, to get the sion. In [38], the standard transformer layers are applied.
compression results for a wide range of bitrates, it is still In our work, different transformer modules are developed.
required to train discrete multiple models. In addition, we apply transformers in local windows and
In [11], [28], [29], a hybrid architecture that combines a symbols in each window can be decoded parallelly, which
learning based model and conventional codec is proposed. compensates the expensive time cost from the context model
The BPG-based residual coding is applied as the enhance- in [6]–[8].
ment layer to obtain compression results for the subsequent
bit rates. However, only eight feature maps are used for
III. OUR APPROACH
the compact representation in [11] which limits the learning
We propose an effective learned image framework by incor-
capability of the base layer. Based on this, we build a more
porating transformers in the base layer and apply the residual
effective model for the base layer.
coding [11] to achieve compression across a range of bit
rates. No previous work on vision transformers is proposed
B. TRANSFORMERS IN VISIONS for variable-rate image compression models. Our patch-based
Attention mechanisms are widely applied in DL mod- framework along with the post-processing step performs bet-
els for speech processing and computer vision ter in the base layer than other baselines with residual coding
respectively [41]. αM , βj and γj are the relative importance and UpTransBlock as depicted in Fig. 4 to meet the various
of the terms. The final compression loss optimized with requirements in the architecture.
MS-SSIM in our experiment is L = R + 8 × (1 − DMS-SSIM ) The regular TransBlock has the input and output with the
for the base layer. same spatial size similar to [13]. The DownTransBlock is
modified to get the output size reduced by a factor of 2 as
1) TransBlocks shown at first row in Fig. 4. The input tensor is divided into
We explore to use transformer blocks to extract the 4×4 blocks. Then we flatten each block to a vector and use a
long-range information in the learned image compression convolution layer with 1×1 kernel size to reduce the channel
network. The original transformer block in [33] is given in size to the same as the input tensor. Then the output tensor
Fig. 3. One transformer block contains a multi-head attention is followed with the regular TransBlock. The UpTransBlock
network and a point-wise feed-forward network. is the inverse operation of DownTransBlock as shown at the
We denote the number of heads as m. The input tensor X second row in Fig. 4. The DownTransBlock is applied in the
is divided into m heads with di = md dimension for each head hyper-encoder network to obtain the compressed hyper-latent
(i = 1, 2, · · · m). For a tensor Xi ∈ Rhw×di , the multi-head ẑ and the UpTransBlock is used to transform back in order to
attention process can be represented by a set of equations predict the Gaussian parameters for ŷ.
below. hw is the sequence length and di is the vector dimen- All the TransBlocks are conducted in local windows. Based
sion in ith head. When X is reshaped to a sequence with length on the spatial size of the tensors, we use 8 × 8 window
of hw, the position information is lost. A positional encoding size for the main encoder and decoder network. Each Trans-
module is added to provide spatial information at the input. Block contains N = 4 layers of MHSA and FFN. For the
hyper-encoder and decoder network, 4 × 4 window size is
Qi = Xi WQTi , Ki = Xi WKTi , Vi = Xi WVTi applied. Each TransBlock contains N = 2 layers of MHSA
Qi K T and FFN.
Zi = Softmax( √ i )Vi
dk
2) TransContext MODEL
O(Q, K , V ) = Concat(Z1 , Z2 , · · · , Zm )WOT (4)
In [6], the context model is a simple masked convolution
where WQi ,√WKi , WVi and WO are weights for the linear layer with 5 × 5 kernels. A symbol is decoded based on
layers and dk is a scaling factor. In the second equation, previous decoded symbols above and to the left of the current
Softmax is the softmax operation to get the attention scores. symbol in the window. However, the context information
In the third equation, the weighted vectors from each head are is constrained to local windows. We propose to apply a
concatenated as the final output. The attention here is referred transformer-based context model which is called TransCon-
as multi-head self-attention (MHSA) mechanism as the three text to allow more context to be used for prediction.
items Q, K and V are obtained from the same input X . During training, we use masked multi-head attention mod-
The output of MHSA is then fed into the feed-forward ules [33] in the TransContext model to allow the network
network (FFN) as to back-propagate for gradient calculation. Fig. 5 gives an
illustration of the masked attention module for a tensor with
f (O(Q, K , V )) = ReLU (O(Q, K , V )W1T +b1 )W2T +b2 (5)
the input size of 2 × 2 × d. The mask shown in the figure has
where W1 , W2 are the weights and b1 , b2 are the bias of linear 0s at and below the diagonal direction. The values above the
layers. RELU is a ReLU activation layer. diagonal direction are set to negative infinite.
Differing from [33] for machine translation tasks, the posi- Given an input from the quantized feature representation ŷ,
tional encoding module is based on 2D fixed sine function we first flatten the tensor and pad it with a vector with all 0s
for images. The periodic property of the sine function allows at the beginning. For the current symbol (upper right value of
to extend for longer sequence length. In addition, in order the input), the output of the corresponding position (second
to get a compact feature representation for an image and vector) only depends on the first vector and padded 0s. In the
reconstruct it after the decoder, the transformer blocks need softmax operation, the product of q and k is added with the
to be scalable for spatial size which is not required in [33] mask so that the values corresponding to 0s in the mask will
for language modeling. We propose the DownTransBlock not change and the values above the diagonal direction will
FIGURE 7. (a) PSNR/bpp and (b) MS-SSIM/bpp on the Kodak dataset. FIGURE 8. (a) PSNR/bpp and (b) MS-SSIM/bpp on the BSD100 dataset.
B. EXPERIMENTAL RESULTS outperforms Cai2018 at low bit rates. Our approach does
1) RESULTS ON KODAK DATASET not show advantages for MS-SSIM at high bit rates as the
Fig. 7 shows the comparison of our results with conven- residual coding with the classic codec BPG is not optimized
tional codecs (JPEG [14], JPEG2000 [22], BPG420 and for MS-SSIM. However, the residual coding strategy can
BPG444) and learned variable-rate image compression mod- provide an effective as well as simple way for variable-rate
els Cai2018 [9], Zhang2019 [10], Akbari2020 [11] and image compression.
Fu2021 [28] on the Kodak dataset in terms of PSNR and
MS-SSIM for per bit per pixel (bpp). Our approach can 2) RESULTS ON BSD100 DATASET
achieve comparable PSNR with BPG420. The first point in The methods Cai2018, Zhang2019 and Akbari2020 only
the R-D curve actually reflects the influence of the proposed show results on Kodak dataset. We also compare our results
compression model in the base layer in Fig. 1. Our result in with JPEG, JPEG2000, BPG420 and BPG444 and the
the base layer achieves 0.75dB higher than Akbari2020 and learned variable-rate image compression model Fu2021 on
1.7dB higher than Fu2021 at 0.15bpp in which BPG444 is the BSD100 dataset as given in Fig. 8. The overall trend
also applied for the residual coding. is consistent to that on Kodak dataset and it shows that the
For MS-SSIM in (b), we have better performance at the trained models can generalize well.
base layer (first point at 0.21bpp) than Akbari2020 and
Fu2021. Compared with BPG444, we get 0.021 higher at 3) ABLATION TEST: NON OVERLAP vs. OVERLAP
0.21 bpp. However, as the bitrate increases, our MS-SSIM We experiment on two different partition schemes which
result saturates to that from BPG444 which is similar to we call non-overlap and overlap on Kodak dataset as
Fu2021. The MS-SSIM of our method shows better perfor- shown in Fig. 9. For non-overlap, we set the patch size
mance than the traditional codecs and Zhang2019. It also with 16 × 16 . For overlap, the patch size is 18 with
C. TIME COMPLEXITY
We discuss the running time of our framework for inference
on a E5-2620 v4 CPU (2.10GHz) with 128GB RAM. The
a stride of 16. The overlapping areas are two pixels in most time consuming part of the model is the context model
each direction. With the same λ in Eq. 1, the calculated which needs to be decoded sequentially from previously
bit rate for non-overlap is less than overlap at the base decoded symbols. For main encoder and decoder networks,
layer. After using the BPG residual coding, overlap (green it takes about 0.05 s and 0.04 s for one forward step. For the
curve) show generally better PSNR performance than non- hyper-encoder and hyper-decoder networks, the running time
overlap (blue curve). For non-overlap, only the local infor- is 0.01 s and 0.01 s. The hyper-latent ẑ can be encoded and
mation is used to construct each patch in the last linear decoded in parallel. The decoding time for one position is
layer and the edge values for the neighboring patches around 0.04s. In [6] the latent ŷ can only be decoded one by
could have a large variance. For overlap, the inconsis- one and it takes h × w times of forward steps of the entropy
tency is averaged to reduce the blocking artifacts as shown model. In our framework, the transformer is applied on the
in Fig. 6 (b). local windows with size 2h × w2 and the forward steps is
reduced by 41 when decoding parallelly. The running time for
4) ABLATION TEST: TransBlocks AND TransContext one window is around 177s. Note that the arithmetic coding in
To prove the effectiveness of the TransBlocks and TransCon- our experiment is not optimized .3 Since different platforms
text, we experiment to delete them respectively and keep the may affect the time elapse, we also test the original context
remaining parts of the model same. Tab. 1 shows the results model with a masked convolution layer (5×5 kernels) on this
with MSE optimization at 0.15bpp without the deblock- device. The running time is about 240s, which takes longer
ing post-processing. The first row shows the result without than our scheme.
the TransContext model. The second row gives the result
without the TransBlocks in the main encoder and decoder. D. EXAMPLES
We show the result for the model with both modules in the In Fig. 10 and Fig. 11, we show reconstructed examples from
third row. different methods. In Fig. 10, the results in (e) and (f) show
Compared with convolutional layers where the respective more clear lines on the sail nevertheless blurry human faces.
field size is constrained by the kernel size, TransBlocks can
extract long-range dependency from the feature tensor. When 3 https://github.com/nayuki/Reference-arithmetic-coding
The results in (b) and (c) contain more detailed features on [6] D. Minnen, J. Ballé, and G. D. Toderici, ‘‘Joint autoregressive and hier-
human faces. This is because the results in (e) and (f) are archical priors for learned image compression,’’ in Proc. Adv. Neural Inf.
Process. Syst., 2018, pp. 10771–10780.
obtained from the model in the base layer trained with [7] J. Lee, S. Cho, and S.-K. Beack, ‘‘Context-adaptive entropy model for
MS-SSIM loss. As the MS-SSIM loss focuses more on over- end-to-end optimized image compression,’’ in Proc. 7th Int. Conf. Learn.
all structures, the MS-SSIM in (e) and (f) are higher than Represent., May 2019, pp. 1–20.
[8] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, ‘‘Non-local
BPG444, whereas the PSNR in (e) and (f) are relatively attention optimized deep image compression,’’ 2019, arXiv:1904.09757.
low. [9] C. Cai, L. Chen, X. Zhang, and Z. Gao, ‘‘Efficient variable rate image com-
At higher bit rate in Fig. 11, the visual difference is not pression with multi-scale decomposition network,’’ IEEE Trans. Circuits
that significant. The MS-SSIM in (f) is slightly better than Syst. Video Technol., vol. 29, no. 12, pp. 3687–3700, Dec. 2018.
[10] Z. Zhang, Z. Chen, J. Lin, and W. Li, ‘‘Learned scalable image compression
BPG444 in (d) with less 0.04bpp. Note that in (e) and (f), with bidirectional context disentanglement network,’’ in Proc. IEEE Int.
due to the residual coding scheme based on BPG which is Conf. Multimedia Expo. (ICME), Jul. 2019, pp. 1438–1443.
optimized with MSE, the results in (e) and (f) also have high [11] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned variable-rate image
compression with residual divisive normalization,’’ in Proc. IEEE Int.
PSNR values. The wall texture on the left in (a) obtained with Conf. Multimedia Expo. (ICME), Jul. 2020, pp. 1–6.
BPG420 method is not well restored. The corner between [12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, ‘‘Variational
the roof and wall contour on the left in (d) is blurry. In both image compression with a scale hyperprior,’’ 2018, arXiv:1802.01436.
figures, our results are improved when adding the deblocking [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
modules compared with that from the model optimized with J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16x16 words: Trans-
the corresponding loss. formers for image recognition at scale,’’ 2020, arXiv:2010.11929.
[14] G. K. Wallace, ‘‘The JPEG still picture compression standard,’’ Commun.
ACM, vol. 34, no. 4, pp. 30–44, Apr. 1991.
V. CONCLUSION [15] N. Ahn, B. Kang, and K.-A. Sohn, ‘‘Fast, accurate, and lightweight super-
We propose to incorporate vision transformers into a resolution with cascading residual network,’’ in Proc. Eur. Conf. Comput.
variable-rate learned image compression framework. Differ- Vis. (ECCV), Sep. 2018, pp. 252–268.
[16] D. Liu, X. Sun, and F. Wu, ‘‘Inpainting with image patches for compres-
ent transformer blocks are applied to meet the various require- sion,’’ J. Vis. Commun. Image Represent., vol. 23, pp. 100–113, Jan. 2012.
ments in the subnetworks. Compared with other variable-rate [17] N. Krishnaraj, M. Elhoseny, M. Thenmozhi, M. M. Selim, and K. Shankar,
learned image compression networks, our framework can ‘‘Deep learning model for real-time image compression in Internet of
get higher PSNR across a range of bit rates and MS-SSIM Underwater Things (IoUT),’’ J. Real-Time Image Process., vol. 17,
pp. 2097–2111, May 2019.
performance at low bit rates. Ablation experiment shows the [18] B. Sujitha, V. S. Parvathy, E. L. Lydia, P. Rani, Z. Polkowski, and
effectiveness of the proposed TransBlocks and TransContext K. Shankar, ‘‘Optimal deep learning based image compression technique
model. We also experiment on two different image patch for data transmission on industrial Internet of Things applications,’’ Trans.
Emerg. Telecommun. Technol., vol. 32, Apr. 2020, Art. no. e3976.
strategies and show that the overlap partition achieves bet- [19] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Generalized octave
ter compression performance than the non-overlap partition. convolutions for learned multi-frequency image compression,’’ 2020,
At last we discuss the time complexity of our model and it arXiv:2002.10032.
can reduce the inference time for the autoregressive context [20] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Learned image compression
with discretized Gaussian mixture likelihoods and attention modules,’’ in
model. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
When applying vision transformers, the sequence length pp. 7939–7948.
which is the number of image patches in our framework [21] J. Lee, S. Cho, and M. Kim, ‘‘An end-to-end joint learning scheme of image
compression and quality enhancement with improved entropy minimiza-
associates with the computation cost. More layers of the tion,’’ 2019, arXiv:1912.12817.
transformer block can be added and explored if the sequence [22] Information Technology JPEG 2000 Image Coding System: Core Coding
length can be further reduced. In the future work, we may System, International Organization for Standardization, Geneva, Switzer-
land, Dec. 2000.
mask out some of the patches and apply image inpainting
[23] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
techniques to fill the masked patches at the decoder. S. Baluja, M. Covell, and R. Sukthankar, ‘‘Variable rate image compression
with recurrent neural networks,’’ 2015, arXiv:1511.06085.
[24] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
REFERENCES S. J. Hwang, J. Shor, and G. Toderici, ‘‘Improved lossy image compression
[1] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘End-to-end optimized image with priming and spatially adaptive bit rates for recurrent networks,’’
compression,’’ 2016, arXiv:1611.01704. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, pp. 4385–4393.
L. Benini, and L. V. Gool, ‘‘Soft-to-hard vector quantization for end-to- [25] Z. Guo, Z. Zhang, and Z. Chen, ‘‘Deep scalable image compression via
end learning compressible representations,’’ in Proc. Adv. Neural Inf. hierarchical feature decorrelation,’’ in Proc. Picture Coding Symp. (PCS),
Process. Syst., 2017, pp. 1141–1151. Nov. 2019, pp. 1–5.
[3] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, ‘‘Learning convolutional [26] T. Chen and Z. Ma, ‘‘Variable bitrate image compression with quality
networks for content-weighted image compression,’’ in Proc. IEEE/CVF scaling factors,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3214–3223. (ICASSP), May 2020, pp. 2163–2167.
[4] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, [27] Y. Choi, M. El-Khamy, and J. Lee, ‘‘Variable rate deep image compression
‘‘Conditional probability models for deep image compression,’’ in with a conditional autoencoder,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Vis. (ICCV), Oct. 2019, pp. 3146–3154.
pp. 4394–4402. [28] H. Fu, F. Liang, B. Lei, Q. Zhang, J. Liang, C. Tu, and G. Zhang,
[5] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Energy compaction-based ‘‘An extended context-based entropy hybrid modeling for image com-
image compression using convolutional AutoEncoder,’’ IEEE Trans. Mul- pression,’’ Signal Process., Image Commun., vol. 95, Jul. 2021,
timedia, vol. 22, no. 4, pp. 860–873, Apr. 2020. Art. no. 116244.
[29] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned multi-resolution variable- BINGLIN LI received the B.Eng. degree in com-
rate image compression with octave-based residual blocks,’’ IEEE Trans. munication engineering from Wuhan University,
Multimedia, vol. 23, pp. 3013–3021, 2021. Wuhan, China, in 2014, and the M.Sc. degree
[30] A. Tursunov, Mustaqeem, J. Y. Choeh, and S. Kwon, ‘‘Age and gender in engineering from the University of Manitoba,
recognition using a convolutional neural network with a specially designed Winnipeg, Canada, in 2016. She is currently pursu-
multi-attention module through speech spectrograms,’’ Sensors, vol. 21, ing the Ph.D. degree with the Multimedia Labora-
no. 17, p. 5892, Sep. 2021. tory, School of Engineering Science, Simon Fraser
[31] B. Li, J. Liang, and Y. Wang, ‘‘Compression artifact removal with stacked
University, Burnaby, Canada. Since 2018, she has
multi-context channel-wise attention network,’’ in Proc. IEEE Int. Conf.
been a Research Assistant with the Multimedia
Image Process. (ICIP), Sep. 2019, pp. 3601–3605.
[32] K. Muhammad, A. Ullah, A. S. Imran, M. Sajjad, M. S. Kiran, G. Sannino, Laboratory, School of Engineering Science, Simon
and V. H. C. de Albuquerque, ‘‘Human action recognition using attention Fraser University. Her research interests include deep-learning-based image
based LSTM network with dilated CNN features,’’ Future Gener. Comput. compression and computer vision.
Syst., vol. 125, pp. 820–830, Dec. 2021.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2017, pp. 5998–6008. JIE LIANG (Senior Member, IEEE) received the
[34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and B.E. and M.E. degrees from Xi’an Jiaotong Uni-
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc. versity, China, in 1992 and 1995, respectively,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2020, pp. 213–229. the M.E. degree from the National University of
[35] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, ‘‘MaX-DeepLab: Singapore, in 1998, and the Ph.D. degree from
End-to-end panoptic segmentation with mask transformers,’’ in Proc.
Johns Hopkins University, USA, in 2003.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
From 2003 to 2004, he worked with the
pp. 5463–5474.
[36] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, Microsoft Digital Media Division, Video Codec
C. Xu, and W. Gao, ‘‘Pre-trained image processing transformer,’’ in Proc. Group. Since May 2004, he has been with the
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, School of Engineering Science, Simon Fraser Uni-
pp. 12299–12310. versity, Canada, where he is currently a Professor. His research interests
[37] H. Yan, Z. Li, W. Li, C. Wang, M. Wu, and C. Zhang, ‘‘ConTNet: include image and video processing, computer vision, and deep learning.
Why not use convolution and transformer at the same time?’’ 2021, Prof. Liang received the 2014 IEEE TCSVT Best Associate Editor Award,
arXiv:2104.13497. 2014 SFU Dean of the Graduate Studies Award for Excellence in Leader-
[38] P. Esser, R. Rombach, and B. Ommer, ‘‘Taming transformers for high- ship, and 2015 Canada NSERC Discovery Accelerator Supplements (DAS)
resolution image synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- Award. He served as an Associate Editor for several journals, including
tern Recognit. (CVPR), Jun. 2021, pp. 12873–12883. IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON CIRCUITS
[39] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘Density modeling of AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), and IEEE SIGNAL PROCESSING
images using a generalized normalization transformation,’’ 2015, LETTERS. He has also served on three IEEE Technical Committees.
arXiv:1511.06281.
[40] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.
[41] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘‘Multiscale structural simi- JINGNING HAN (Senior Member, IEEE)
larity for image quality assessment,’’ in Proc. 37th Asilomar Conf. Signals, received the B.S. degree in electrical engineering
Syst. Comput., vol. 2, Jul. 2003, pp. 1398–1402. from Tsinghua University, Beijing, China, in 2007,
[42] C. Dong, Y. Deng, C. C. Loy, and X. Tang, ‘‘Compression artifacts reduc- and the M.S. and Ph.D. degrees in electrical
tion by a deep convolutional network,’’ in Proc. IEEE Int. Conf. Comput. and computer engineering from the University of
Vis. (ICCV), Dec. 2015, pp. 576–584. California at Santa Barbara, Santa Barbara, CA,
[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, USA, in 2008 and 2012, respectively.
and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ in Proc. He joined the WebM Codec Team, Google,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2014, pp. 740–755. Mountain View, CA, USA, in 2012, where he is
[44] D. Martin, C. Fowlkes, D. Tal, and J. Malik, ‘‘A database of human the Main Architect of the VP9 and AV1 codecs,
segmented natural images and its application to evaluating segmentation and leads the Software Video Codec Team. He has published more than
algorithms and measuring ecological statistics,’’ in Proc. 8th IEEE Int.
60 research articles. He holds more than 50 U.S. patents in the field of video
Conf. Comput. Vis. (ICCV), vol. 2, Jun. 2001, pp. 416–423.
coding. His research interests include video coding and computer science
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, and A. Desmaison, architecture.
‘‘Pytorch: An imperative style, high-performance deep learning library,’’ Dr. Han received the Dissertation Fellowship from the Department of Elec-
in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, trical and Engineering, University of California at Santa Barbara, in 2012.
A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, Eds. Red Hook, He was a recipient of the Best Student Paper Award at the IEEE International
NY, USA: Curran Associates, 2019, pp. 8024–8035. [Online]. Available: Conference on Multimedia and Expo, in 2012. He also received the IEEE
http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high- Signal Processing Society Best Young Author Paper Award, in 2015.
pe%rformance-deep-learning-library.pdf