0% found this document useful (0 votes)

22 views12 pages

Variable-Rate Deep Image Compression With Vision Transformers

This document presents a novel variable-rate deep image compression framework that utilizes vision transformers to improve compression performance. The proposed method divides images into overlapping patches, applies transformer blocks for encoding and decoding, and incorporates a context model to enhance coding efficiency. Experimental results demonstrate significant improvements in PSNR and MS-SSIM compared to existing models, showcasing the effectiveness of the approach in achieving high-quality image compression at various bit rates.

Uploaded by

JorgeCeferinoValdez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

Variable-Rate Deep Image Compression With Vision Transformers

Uploaded by

JorgeCeferinoValdez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Received April 21, 2022, accepted May 2, 2022, date of publication May 9, 2022, date of current version May

13, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3173256

Variable-Rate Deep Image Compression

With Vision Transformers
BINGLIN LI 1 , JIE LIANG 1 , (Senior Member, IEEE),
AND JINGNING HAN 2 , (Senior Member, IEEE)
1 School of Engineering Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
2 WebM Codec Team, Google LLC, Mountain View, CA 94043, USA
Corresponding author: Jie Liang ([email protected])
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under Grant RGPIN-2020-04525,
and in part by the Google Chrome University Research Program.

ABSTRACT Recently, vision transformers have been applied in many computer vision problems due to its
long-range learning ability. However, it has not been throughly explored in image compression. We propose
a patch-based learned image compression network by incorporating vision transformers. The input image
is divided into patches before feeding to the encoder and the patches are reconstructed from the decoder to
form a complete image. Different kinds of transformer blocks (TransBlocks) are applied to meet the various
requirements in the subnetworks. We also propose a transformer-based context model (TransContext) to
facilitate the coding based on previously decoded symbols. Since the computational complexity of the atten-
tion mechanism in transformers is a quadratic function of the sequence length, we partition the feature tensor
into different segments and conduct the transformer in each segment to save computational cost. To alleviate
the compression artifacts, we use overlapping patches and apply an existing deblocking network to further
remove the artifacts. At last, the residual coding scheme is adopted to get the compression performance for
variable bit rates. We show that our patch-based learned image compression with transformers obtain 0.75dB
improvement in PSNR at 0.15bpp than the prior variable-rate compression work on the Kodak dataset. When
using the residual coding strategy, our framework keeps good performance in PSNR and is comparable to
BPG420. For MS-SSIM, we get higher results than BPG444 across a range of bit rates (0.021 at 0.21bpp)
and other variable-rate learned image compression models at low bit rates.

INDEX TERMS Learned image compression, transformer, variable-rate.

I. INTRODUCTION network, and the residual between the input and the base
Recently, there has been a line of researches [1]–[8] layer reconstruction is coded by a traditional method to cover
on deep image compression. The autoencoder approaches more bit rates. Motivated by [11], in this paper, we propose
[6]–[8] with the joint autoregressive and hierarchical hyper- a more effective learned image framework by incorporating
prior models have been the mainstream practice for transformers in the base layer and apply the residual coding
learning-based image compression. Although the above to achieve compression across a range of bit rates with one
methods show promising compression performance com- single model. In [11], only eight feature maps are used for the
pared with conventional image codecs, there are two main
compact representation which limits the learning capability
drawbacks in real applications.
of the base layer. In our framework, a hyperprior network [12]
Firstly, a separate model needs to be trained for each bit rate
is adopted to estimate the distribution parameters of the quan-
which increases the coding complexity. To this end, variable
tized feature representation so that the channel dimension
bitrate image compression models [9]–[11] are developed to
of the representation can be set larger in the base layer.
cover various bit rates with one training model. In particular,
Experimental results show 0.75dB improvement in PSNR at
in [11], a layered coding scheme is developed, where the
base layer feature map is obtained by a deep learning (DL) 0.15bpp than [11]. When using the residual coding strategy,
our framework keeps good performance in PSNR and is
The associate editor coordinating the review of this manuscript and comparable to BPG420, whereas the performance in [11] is
approving it for publication was Yizhang Jiang . lower than BPG420.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 50323
B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

Secondly, although the masked convolutional context The rest of the paper is organized as follows. In Section II,
model in [6]–[8] enables to achieve better compres- we discuss some related work on learned image compression
sion performance compared with the scale-only hyperprior and transformers in vision applications. Then we introduce
model [12], it brings in extra computational overhead because our framework and explain the building blocks. Experimental
of the sequential decoding process. Besides, the context infor- results on the Kodak dataset and discussions are presented in
mation is constrained into 5 × 5 windows. In this paper, Section IV. Section V concludes the paper.
we leverage transformers to capture the long-range depen-
dency and use the masked multi-head attention module in
II. RELATED WORKS
transformers in training to guarantee the causal relationship,
A. LEARNED IMAGE COMPRESSION
which is called TransContext model. Equipped with local
Many learned image compression models are proposed with
transformers, we divide the latent representation into seg-
the prevalence of DL techniques applied in various research
ments. In each segment we conduct the transformer. Each
fields. Some researches study learning-based image com-
segment can thus be processed parallelly to reduce the total
pression in specific scenarios. In [17], a discrete wavelet
decoding time.
transform based DL model is proposed for internet of under-
In our framework, only local transformer blocks (Trans-
water things. [18] presents a compression model using the
Blocks) are adopted since the computational complexity of
convolutional neural network for remote sensing images.
attention mechanism in transformers is a quadratic function
In this paper, we focus on deep image compression for nat-
to the sequence length. Similar to [13], we divide the input
ural RGB images. In [12], a hyperprior network is proposed
image into image patches. By this way the spatial size of
to learn the scale parameters of the Gaussian scale mixture
the input can be reduced and we can feed the patch features
model for the entropy model. The hyper latent is transmitted
to transformer blocks in the encoder. At the output of the
as side information to help decode the main latent. However,
decoder, an image patch is reconstructed based on the infor-
the estimation is not image-dependent and spatially adaptive
mation from the local patch. All the patches are merged into a
after trained. In [6]–[8], the main latent representation is mod-
complete image. This could produce the blocking artifacts at
eled by Gaussian distribution with parameters learned from
the patch boundaries [14]. To alleviate these artifacts, we use
the context and prior information. The context model allows
image patches with a small portion of overlaps as the input
to combine the information from the neighboring decoded
to the network and average the values of positions where two
symbols and thus giving a more accurate prediction. It has
outputs are predicted. We find this strategy can improve the
been the classic learned image compression method due to its
compression performance to some extent. To further remove
superior PSNR and MS-SSIM performance compared with
the artifacts, a post-processing network [15] is applied.
previous works.
When the network is designed with fully convolution
In [8], non-local blocks are embedded in the encoder and
layers, the size of input image can be arbitrary. Previous
decoder networks to learn the long-range dependency. How-
works use the entire image as input to the compression net-
ever, the self-attention mechanism results in non-negligible
work. Patch-based learned image compression has not been
computational cost which imposes restrictions to the com-
explored. Although patch-based image compression methods
pression framework design. Based on this joint architecture,
may result in blocking artifacts as in JPEG [14], it has its
recently a new framework that combines the octave convo-
advantages. In [16], the inpainting techniques are embed-
lutions is applied in [19] and achieves higher results than
ded into the patch-based image compression framework to
VVC (4:2:0) and other DL-based image compression models.
improve the compression performance. In our work, image
Later works extend the single Gaussian probability model
patches are used as input to cope with the demanding compu-
in [6] to Gaussian mixture model (GMM) [20], [21] and show
tation resource for a high-resolution input to transformers.
better compression efficiency. In [21], a joint optimization
Our contributions include: 1) We build an effective
of the image compression and quality enhancement model
patch-based learned image compression network with vision
is applied. The loss at the output of compression network
transformers in the base layer based on [11] for the
acts as an intermediate supervision and the output of the
variable-rate deep image compression. To alleviate the com-
quality enhancement model is the final reconstruction. The
pression artifacts resulted from patch reconstructions, we par-
post-processing technique is commonly employed in previ-
tition the image patches with overlaps and utilize an existing
ous codec JPEG [22] to remove the compression artifacts.
deblocking network to further remove the blocking artifacts.
2) Different kinds of transformer blocks are applied to meet
the various requirements in the subnetworks. 3) We propose 1) VARIABLE-RATE LEARNED IMAGE COMPRESSION
a transformer-based context model to facilitate the Gaussian In [23], [24], variable-rate image compression models are
parameter predictions based on the previously decoded proposed based on convolutional and deconvolutional LSTM
symbols. It is performed on segments of the quantized recurrent networks. In [25], four code layers including a base
latent representation and thus can reduce the total decod- layer and three enhancement layers are adopted to construct
ing time compared with the masked convolution context the scalable image compression framework. A decorrelation
model in [6]. unit is utilized to obliterate redundancy between the base

50324 VOLUME 10, 2022

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 1. Our proposed deep image compression model in the base layer. The input image x is divided into patches with a size of
3 × n × n. Patches are flattened to vectors to formalize a 3D tensor as input to the main encoder. The main decoder outputs a 3D
tensor and the vector at each spatial position is reshaped to a 3 × n × n patch. The patches are merged to the reconstructed image.

layer and the current enhancement layer. During inference, problems [20], [30]–[32]. Transformers [33] with multi-head
the output of each layer corresponds to the reconstruction at attentions have become predominant DL models for natu-
certain bitrate. These methods rely on the layered architec- ral language processing (NLP). Due to the ability to learn
tures to adjust for the variable bitrate and are not flexible to long-range interactions on sequential data, recently trans-
obtain a specific rate target. In [9], a multi-scale decompo- formers are migrated to many computer vision tasks such
sition transform is learned and a rate allocation algorithm is as image classification [13], object detection [34], seg-
used to determine the optimal scale of each image block based mentation [35] as well as the low-level computer vision
on content complexity given a target rate. In [10], the authors task [36]. However, purely using transformers instead of
apply bit-plane decomposition before the transform and intro- convolution layers requires to pre-train on a very large-scale
duce a bidirectional network to disentangle the information of dataset and it consumes a vast of time to train [13] to get
different bit-planes. However, the performance of [9] and [10] comparable or even better performance than convolutional
still has a large gap from the state-of-the-art. In [26], a set networks. Other works integrate convolution layers and trans-
of scaling factors is embedded on the quantized feature map formers to improve results based on similar computation
from a high bit-rate pre-trained model to fine-tune for the low complexity [34], [35], [37].
bit-rate one while keeping main parameters fixed. For low In [38], transformers are applied on the convolutional
bit rates far from the bit rate of the pre-trained model, the feature maps and followed by convolutional decoder to syn-
performance is not satisfactory. In [27] a conditional autoen- thesize high-resolution scene images. It leverages the autore-
coder is proposed with a coarse rate control by the Lagrange gressive structure of transformers to predict current index
multiplier and a fine-tuning parameter by the quantization based on previous indices. This property also suits for the
bin size. The fine-tuning process is conducted on intervals context model in entropy coding module in image compres-
between individually trained models. Therefore, to get the sion. In [38], the standard transformer layers are applied.
compression results for a wide range of bitrates, it is still In our work, different transformer modules are developed.
required to train discrete multiple models. In addition, we apply transformers in local windows and
In [11], [28], [29], a hybrid architecture that combines a symbols in each window can be decoded parallelly, which
learning based model and conventional codec is proposed. compensates the expensive time cost from the context model
The BPG-based residual coding is applied as the enhance- in [6]–[8].
ment layer to obtain compression results for the subsequent
bit rates. However, only eight feature maps are used for
III. OUR APPROACH
the compact representation in [11] which limits the learning
We propose an effective learned image framework by incor-
capability of the base layer. Based on this, we build a more
porating transformers in the base layer and apply the residual
effective model for the base layer.
coding [11] to achieve compression across a range of bit
rates. No previous work on vision transformers is proposed
B. TRANSFORMERS IN VISIONS for variable-rate image compression models. Our patch-based
Attention mechanisms are widely applied in DL mod- framework along with the post-processing step performs bet-
els for speech processing and computer vision ter in the base layer than other baselines with residual coding

VOLUME 10, 2022 50325

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 2. The encoding and decoding process of the overall framework.

‘‘Base layer’’ is the proposed deep image compression model in Fig. 1.
‘‘· · · ’’ represents the decoder part in the base layer.

scheme [11], [28]. The encoding and decoding process of the

overall framework is given in Fig. 2. Next we will elaborate on
the autoencoder image compression model in the base layer,
deblocking network and residual coding in the enhancement FIGURE 3. The original transformer block in [33].
layer separately.

A. AUTOENCODER NETWORK concatenated with the output from the hyper-decoder E 00 to

The architecture of our proposed deep image compression predict the parameters µ and σ 2 for ŷ. We use the arithmetic
model in the base layer is given in Fig. 1. During training, encoding AE and arithmetic decoding AD to encode and
the input image is randomly cropped with the resolution of decode the latent ŷ and hyper-latent ẑ with predicted Gaussian
256×256 . Given a 2D image x ∈ RH ×W , the sequence length distribution.
is H × W , where H and W are the height and width of the The loss function of the compression model is:
image. As the computational complexity of transformers is a Lcomp = R + λD
quadratic function of the sequence length, it is infeasible to = (Ex∼px [−log2 pŷ|ẑ (ŷ|ẑ)]
apply transformers on the entire image directly. We partition
the input image x into patches with a size of n×n. Each patch + Ex∼px [−log2 pẑ (ẑ)]) + λD (1)
can be flattened to a vector with the length of 3 n2 . We have where the first two items are the bitrate loss for the latent ŷ and
H W
n × n patches. Then each vector is projected to d dimension hyper-latent ẑ, and the last item D is the distortion function
where d is the channel size through the autoencoder network. between the original image x and reconstructed image x̃. λ is
At this point, we obtain a tensor with X ∈ Rh×w×d , where the tradeoff between the distortion and bitrate. The distor-
h = Hn and w = Wn . We reshape the tensor as X ∈ Rhw×d as tion D can be the mean square error (MSE) loss optimized
input of the main encoder network. At the output of the main for peak signal-to-noise ratio (PSNR) or multi-scale struc-
decoder, the vector at each spatial position is first mapped to tural similarity index measure (MS-SSIM) loss optimized for
3 n2 dimension from d and then reshaped back to a 3 × n × n MS-SSIM [41].
patch. All the patches are merged to a complete image. The MSE loss is given below.
The main encoder and decoder consist of Generalized Divi- 1 XX
sive Normalization (GDN) [39] layers, residual blocks (Res- DMSE = kx − x̃k2 (2)
N
Blocks) [40] and transformer blocks (TransBlocks). GDN
layers are suited for Gaussianizing data from natural images. where N is the number of elements. The PSNR is calculated
255
ResBlocks are added to extract local information to compen- by 20log10 √ . The final compression loss optimized
DMSE
sate the transformer blocks that focus more on long-range with MSE in our experiment is L = R+0.003×2552 ×DMSE
dependency. We will introduce the TransBlocks in detail in for the base layer.
Sec. III-A1 below. The PSNR metric is commonly used as the quality assess-
In Fig.1, we denote the output of the main encoder as y and ment for image reconstruction. However, it does not aim for
it is followed by a quantizer Q to obtain the quantized latent ŷ. perceived quality. MS-SSIM is a complementary metric to
Note that ŷ has the same spatial size as X with h × w since evaluate the structural similarity between two images [41].
no downsampling is needed for the main encoder network. The MSS-SSIM is calculated as
Similar to [12], the latent ŷ is modeled with the Gaussian M
distribution and a hyperprior network is applied to predict
MS-SSIM(x, x̃) = blM (x, x̃)cαM bcj (x, x̃)cβj bsj (x, x̃)cγj
Y
the Gaussian parameters µ and σ 2 . The hyper-encoder and
j=1
decoder contain three types of TransBlocks. The output of
(3)
the hyper-encoder is denoted as z and ẑ after quantization.
The context model is called TransContext which will be M is the number of scales. lM (x, x̃), cj (x, x̃) and sj (x, x̃)
detailed in Sec. III-A2. The output of context model E 0 is then are luminance, contrast and structure comparison measures

50326 VOLUME 10, 2022

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 5. Masked attention module in TransContext Model.

FIGURE 4. DownTransBlock and UpTransBlock illustration.

respectively [41]. αM , βj and γj are the relative importance and UpTransBlock as depicted in Fig. 4 to meet the various
of the terms. The final compression loss optimized with requirements in the architecture.
MS-SSIM in our experiment is L = R + 8 × (1 − DMS-SSIM ) The regular TransBlock has the input and output with the
for the base layer. same spatial size similar to [13]. The DownTransBlock is
modified to get the output size reduced by a factor of 2 as
1) TransBlocks shown at first row in Fig. 4. The input tensor is divided into
We explore to use transformer blocks to extract the 4×4 blocks. Then we flatten each block to a vector and use a
long-range information in the learned image compression convolution layer with 1×1 kernel size to reduce the channel
network. The original transformer block in [33] is given in size to the same as the input tensor. Then the output tensor
Fig. 3. One transformer block contains a multi-head attention is followed with the regular TransBlock. The UpTransBlock
network and a point-wise feed-forward network. is the inverse operation of DownTransBlock as shown at the
We denote the number of heads as m. The input tensor X second row in Fig. 4. The DownTransBlock is applied in the
is divided into m heads with di = md dimension for each head hyper-encoder network to obtain the compressed hyper-latent
(i = 1, 2, · · · m). For a tensor Xi ∈ Rhw×di , the multi-head ẑ and the UpTransBlock is used to transform back in order to
attention process can be represented by a set of equations predict the Gaussian parameters for ŷ.
below. hw is the sequence length and di is the vector dimen- All the TransBlocks are conducted in local windows. Based
sion in ith head. When X is reshaped to a sequence with length on the spatial size of the tensors, we use 8 × 8 window
of hw, the position information is lost. A positional encoding size for the main encoder and decoder network. Each Trans-
module is added to provide spatial information at the input. Block contains N = 4 layers of MHSA and FFN. For the
hyper-encoder and decoder network, 4 × 4 window size is
Qi = Xi WQTi , Ki = Xi WKTi , Vi = Xi WVTi applied. Each TransBlock contains N = 2 layers of MHSA
Qi K T and FFN.
Zi = Softmax( √ i )Vi
dk
2) TransContext MODEL
O(Q, K , V ) = Concat(Z1 , Z2 , · · · , Zm )WOT (4)
In [6], the context model is a simple masked convolution
where WQi ,√WKi , WVi and WO are weights for the linear layer with 5 × 5 kernels. A symbol is decoded based on
layers and dk is a scaling factor. In the second equation, previous decoded symbols above and to the left of the current
Softmax is the softmax operation to get the attention scores. symbol in the window. However, the context information
In the third equation, the weighted vectors from each head are is constrained to local windows. We propose to apply a
concatenated as the final output. The attention here is referred transformer-based context model which is called TransCon-
as multi-head self-attention (MHSA) mechanism as the three text to allow more context to be used for prediction.
items Q, K and V are obtained from the same input X . During training, we use masked multi-head attention mod-
The output of MHSA is then fed into the feed-forward ules [33] in the TransContext model to allow the network
network (FFN) as to back-propagate for gradient calculation. Fig. 5 gives an
illustration of the masked attention module for a tensor with
f (O(Q, K , V )) = ReLU (O(Q, K , V )W1T +b1 )W2T +b2 (5)
the input size of 2 × 2 × d. The mask shown in the figure has
where W1 , W2 are the weights and b1 , b2 are the bias of linear 0s at and below the diagonal direction. The values above the
layers. RELU is a ReLU activation layer. diagonal direction are set to negative infinite.
Differing from [33] for machine translation tasks, the posi- Given an input from the quantized feature representation ŷ,
tional encoding module is based on 2D fixed sine function we first flatten the tensor and pad it with a vector with all 0s
for images. The periodic property of the sine function allows at the beginning. For the current symbol (upper right value of
to extend for longer sequence length. In addition, in order the input), the output of the corresponding position (second
to get a compact feature representation for an image and vector) only depends on the first vector and padded 0s. In the
reconstruct it after the decoder, the transformer blocks need softmax operation, the product of q and k is added with the
to be scalable for spatial size which is not required in [33] mask so that the values corresponding to 0s in the mask will
for language modeling. We propose the DownTransBlock not change and the values above the diagonal direction will

VOLUME 10, 2022 50327

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

be negative infinite. Note that the softmax of negative infinite

is 0. By this way, each symbol only uses the information
of previous decoded symbols during test. The output is then
sent to the FFN. The last linear layer outputs a tensor with
channel size of 2d and it is then combined with output from
the hyper-decoder network to predict for the µ and σ 2 of
Gaussian distribution.
FIGURE 6. Examples of blocking artifacts for patch-based reconstruction:
In implementation, given a feature representation ŷ ∈ (a) non-overlap, (b) overlap, (c) overlap+deblock. (please zoom in).
Rh×w×d , we divide ŷ by 2 in each spatial direction to obtain
segments with h2 × w2 × d size. In each segment, we apply
the TransContext model for inference in parallel. The C. RESIDUAL ENCODING FOR VARIABLE RATE
TransContext model also contains N = 4 layers of MHSA Current learned image compression networks achieve the
and FFN. state-of-the-art compression performance but they need to
train a separate model for each bit rate. In variable-rate image
compression, a single model is trained to get results for a
B. DEBLOCKING NETWORK range of bitrate. The fine-tuning trick may reduce the total
As in Sec. III-A, given an image x, we use image patches training time but can only be applied by a trained model from
as the input in order to leverage the transformer blocks in a high bit rate to a close low bit rate. For low bit rates far from
the autoencoder network. During reconstruction, each vector the bit rate of the pre-trained model, the performance drops
is reshaped to form an image patch. The patches from all dramatically [26].
positions are merged to a complete image x̃. Experiments Similar to [11], we use the BPG444 codec 1 to encode and
show that the restored image contains some blocking artifacts decode the residual between the reconstructed image x̃d from
at the patch borders. This is because each vector only uses the deblocking network in Sec. III-B and the original image
the local information in the last linear layer and the edge x as an enhancement layer as shown in Fig. 2. The bit rate of
values for the adjacent patches cannot keep consistency in BPG codec is controlled by a quality parameter q. The total bit
the prediction. We show two examples in Fig. 6. In (a), the rate for our framework is the addition of the bitrate R from the
reconstruction is based on non-overlapped image patches, base layer in Eq. 1 and the bitrate Rbpg from this enhancement
whereas in (b) the image patches are overlapped by two layer controlled by q.
pixels and the overlapping areas are averaged by the neigh-
boring patches. We find that (b) actually has less artifacts IV. EXPERIMENTS
than (a). A. DATASET AND TRAINING DETAILS
Although it can reduce some artifacts by using the method 1) DATASET
in Fig. 6 (b), it is insufficient for image compression where Since for the learned image compression model, the input
blocking noise can result in PNSR or MS-SSIM degradation. and the ground truth image are the same, no extra labels
Motivated by [42] that a network is developed to post-process are needed for the training. In fact, prior work conduct
the compression artifacts in JPEG [14] for a better compres- experiments on different training dataset. We use a subset
sion performance, we apply the model in [15] to enhance the of 40k images from the COCO-2014 set [43] as the training
image reconstruction quality. The decompressed image x̃ is set and compare the results on the popular Kodak PhotoCD
fed into the deblocking network as shown in Fig. 2 to obtain dataset 2 and Berkeley Segmentation Dataset (BSD) 100 test
the deblocking image x̃d . Different from previous work [42] dataset [44].
and [31] where only the MSE loss is used during training in
accordance with the JPEG optimization metric, we train the 2) TRAINING SETTING
deblocking network with MSE or MS-SSIM loss between the
We randomly crop each image by 256 × 256 during training.
deblocking image x̃d and the original image x depending on
The learning rate is set to 0.00003 for the image compression
the optimization method of the image compression network.
network. We find that a higher learning rate makes it hard for
An example result after applying the deblocking network is
the training to converge. The training lasts 300 epochs and we
shown in Fig. 6 (c).
reduce the learning rate by 0.1 after 180 epochs. We set the
The deblocking process does not increase the bitrate,
batch size as 20. The learning rate for the deblocking network
as when we complete the training process, the reconstructed
is set to 0.0001. The training lasts 80 epochs and we reduce
image from the image compression network can be improved
the learning rate by 0.5 after 40 and 60 epochs. The batch
by using one feed-forward step from the deblocking network.
size is set to 8. We experiment on the Pytorch framework [45]
It can also be trained jointly with the image compression
and use one TITAN X GPU for the training with the Adam
network end-to-end. However, it will increase the model
optimizer.
complexity which makes it hard to train the model on one
GPU card. Therefore, in our experiment, we train the two 1 http://bellard.org/bpg
networks separately. 2 http://r0k.us/graphics/kodak/

50328 VOLUME 10, 2022

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 7. (a) PSNR/bpp and (b) MS-SSIM/bpp on the Kodak dataset. FIGURE 8. (a) PSNR/bpp and (b) MS-SSIM/bpp on the BSD100 dataset.

B. EXPERIMENTAL RESULTS outperforms Cai2018 at low bit rates. Our approach does
1) RESULTS ON KODAK DATASET not show advantages for MS-SSIM at high bit rates as the
Fig. 7 shows the comparison of our results with conven- residual coding with the classic codec BPG is not optimized
tional codecs (JPEG [14], JPEG2000 [22], BPG420 and for MS-SSIM. However, the residual coding strategy can
BPG444) and learned variable-rate image compression mod- provide an effective as well as simple way for variable-rate
els Cai2018 [9], Zhang2019 [10], Akbari2020 [11] and image compression.
Fu2021 [28] on the Kodak dataset in terms of PSNR and
MS-SSIM for per bit per pixel (bpp). Our approach can 2) RESULTS ON BSD100 DATASET
achieve comparable PSNR with BPG420. The first point in The methods Cai2018, Zhang2019 and Akbari2020 only
the R-D curve actually reflects the influence of the proposed show results on Kodak dataset. We also compare our results
compression model in the base layer in Fig. 1. Our result in with JPEG, JPEG2000, BPG420 and BPG444 and the
the base layer achieves 0.75dB higher than Akbari2020 and learned variable-rate image compression model Fu2021 on
1.7dB higher than Fu2021 at 0.15bpp in which BPG444 is the BSD100 dataset as given in Fig. 8. The overall trend
also applied for the residual coding. is consistent to that on Kodak dataset and it shows that the
For MS-SSIM in (b), we have better performance at the trained models can generalize well.
base layer (first point at 0.21bpp) than Akbari2020 and
Fu2021. Compared with BPG444, we get 0.021 higher at 3) ABLATION TEST: NON OVERLAP vs. OVERLAP
0.21 bpp. However, as the bitrate increases, our MS-SSIM We experiment on two different partition schemes which
result saturates to that from BPG444 which is similar to we call non-overlap and overlap on Kodak dataset as
Fu2021. The MS-SSIM of our method shows better perfor- shown in Fig. 9. For non-overlap, we set the patch size
mance than the traditional codecs and Zhang2019. It also with 16 × 16 . For overlap, the patch size is 18 with

VOLUME 10, 2022 50329

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

combined with the ResBlocks, our model extracts the local

and global information to optimize the compression loss.
The TransContext model allows to predict the Gaussian
parameters from previously decoded symbols which con-
tributes to a more accurate probability estimation for the
arithmetic coding. Tab. 1 shows that both the Trans-
Blocks and TransContext can help improve the compression
performance.

5) ABLATION TEST: DEBLOCKING NETWORK

The deblocking network is applied after we get the recon-
structed results from the mean decoder. We get about 0.2dB
improvement in PSNR and 0.001 in MS-SSIM for overlap
after the deblocking network. The PSNR curve is displayed
FIGURE 9. Ablation study of our framework. with the red curve in Fig. 9.

6) ABLATION TEST: FEATURE SIZE IN TRANSFORMERS

TABLE 1. Ablation test for TransBlocks and TransContext modules
optimized with MSE loss at 0.15bpp. In the above experiment, we set the channel dimension
d = 512. This can better maintain the information from a
patch. We experiment on a smaller channel dimension with
d = 256. It shows that d = 512 can achieve higher PSNR
and MS-SSIM at less bit rate as given in Tab. 2. Therefore,
we use 512 as the channel dimension for other experiments.
TABLE 2. Results on different channel size for d in transformers.
The PSNR with d = 512 after residual coding (green curve)
is steadily better than that with d = 256 (black curve) at
various bit rates as shown in Fig. 9.

C. TIME COMPLEXITY
We discuss the running time of our framework for inference
on a E5-2620 v4 CPU (2.10GHz) with 128GB RAM. The
a stride of 16. The overlapping areas are two pixels in most time consuming part of the model is the context model
each direction. With the same λ in Eq. 1, the calculated which needs to be decoded sequentially from previously
bit rate for non-overlap is less than overlap at the base decoded symbols. For main encoder and decoder networks,
layer. After using the BPG residual coding, overlap (green it takes about 0.05 s and 0.04 s for one forward step. For the
curve) show generally better PSNR performance than non- hyper-encoder and hyper-decoder networks, the running time
overlap (blue curve). For non-overlap, only the local infor- is 0.01 s and 0.01 s. The hyper-latent ẑ can be encoded and
mation is used to construct each patch in the last linear decoded in parallel. The decoding time for one position is
layer and the edge values for the neighboring patches around 0.04s. In [6] the latent ŷ can only be decoded one by
could have a large variance. For overlap, the inconsis- one and it takes h × w times of forward steps of the entropy
tency is averaged to reduce the blocking artifacts as shown model. In our framework, the transformer is applied on the
in Fig. 6 (b). local windows with size 2h × w2 and the forward steps is
reduced by 41 when decoding parallelly. The running time for
4) ABLATION TEST: TransBlocks AND TransContext one window is around 177s. Note that the arithmetic coding in
To prove the effectiveness of the TransBlocks and TransCon- our experiment is not optimized .3 Since different platforms
text, we experiment to delete them respectively and keep the may affect the time elapse, we also test the original context
remaining parts of the model same. Tab. 1 shows the results model with a masked convolution layer (5×5 kernels) on this
with MSE optimization at 0.15bpp without the deblock- device. The running time is about 240s, which takes longer
ing post-processing. The first row shows the result without than our scheme.
the TransContext model. The second row gives the result
without the TransBlocks in the main encoder and decoder. D. EXAMPLES
We show the result for the model with both modules in the In Fig. 10 and Fig. 11, we show reconstructed examples from
third row. different methods. In Fig. 10, the results in (e) and (f) show
Compared with convolutional layers where the respective more clear lines on the sail nevertheless blurry human faces.
field size is constrained by the kernel size, TransBlocks can
extract long-range dependency from the feature tensor. When 3 https://github.com/nayuki/Reference-arithmetic-coding

50330 VOLUME 10, 2022

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 10. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

VOLUME 10, 2022 50331

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

FIGURE 11. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

50332 VOLUME 10, 2022

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

The results in (b) and (c) contain more detailed features on [6] D. Minnen, J. Ballé, and G. D. Toderici, ‘‘Joint autoregressive and hier-
human faces. This is because the results in (e) and (f) are archical priors for learned image compression,’’ in Proc. Adv. Neural Inf.
Process. Syst., 2018, pp. 10771–10780.
obtained from the model in the base layer trained with [7] J. Lee, S. Cho, and S.-K. Beack, ‘‘Context-adaptive entropy model for
MS-SSIM loss. As the MS-SSIM loss focuses more on over- end-to-end optimized image compression,’’ in Proc. 7th Int. Conf. Learn.
all structures, the MS-SSIM in (e) and (f) are higher than Represent., May 2019, pp. 1–20.
[8] H. Liu, T. Chen, P. Guo, Q. Shen, X. Cao, Y. Wang, and Z. Ma, ‘‘Non-local
BPG444, whereas the PSNR in (e) and (f) are relatively attention optimized deep image compression,’’ 2019, arXiv:1904.09757.
low. [9] C. Cai, L. Chen, X. Zhang, and Z. Gao, ‘‘Efficient variable rate image com-
At higher bit rate in Fig. 11, the visual difference is not pression with multi-scale decomposition network,’’ IEEE Trans. Circuits
that significant. The MS-SSIM in (f) is slightly better than Syst. Video Technol., vol. 29, no. 12, pp. 3687–3700, Dec. 2018.
[10] Z. Zhang, Z. Chen, J. Lin, and W. Li, ‘‘Learned scalable image compression
BPG444 in (d) with less 0.04bpp. Note that in (e) and (f), with bidirectional context disentanglement network,’’ in Proc. IEEE Int.
due to the residual coding scheme based on BPG which is Conf. Multimedia Expo. (ICME), Jul. 2019, pp. 1438–1443.
optimized with MSE, the results in (e) and (f) also have high [11] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned variable-rate image
compression with residual divisive normalization,’’ in Proc. IEEE Int.
PSNR values. The wall texture on the left in (a) obtained with Conf. Multimedia Expo. (ICME), Jul. 2020, pp. 1–6.
BPG420 method is not well restored. The corner between [12] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, ‘‘Variational
the roof and wall contour on the left in (d) is blurry. In both image compression with a scale hyperprior,’’ 2018, arXiv:1802.01436.
figures, our results are improved when adding the deblocking [13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
modules compared with that from the model optimized with J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16x16 words: Trans-
the corresponding loss. formers for image recognition at scale,’’ 2020, arXiv:2010.11929.
[14] G. K. Wallace, ‘‘The JPEG still picture compression standard,’’ Commun.
ACM, vol. 34, no. 4, pp. 30–44, Apr. 1991.
V. CONCLUSION [15] N. Ahn, B. Kang, and K.-A. Sohn, ‘‘Fast, accurate, and lightweight super-
We propose to incorporate vision transformers into a resolution with cascading residual network,’’ in Proc. Eur. Conf. Comput.
variable-rate learned image compression framework. Differ- Vis. (ECCV), Sep. 2018, pp. 252–268.
[16] D. Liu, X. Sun, and F. Wu, ‘‘Inpainting with image patches for compres-
ent transformer blocks are applied to meet the various require- sion,’’ J. Vis. Commun. Image Represent., vol. 23, pp. 100–113, Jan. 2012.
ments in the subnetworks. Compared with other variable-rate [17] N. Krishnaraj, M. Elhoseny, M. Thenmozhi, M. M. Selim, and K. Shankar,
learned image compression networks, our framework can ‘‘Deep learning model for real-time image compression in Internet of
get higher PSNR across a range of bit rates and MS-SSIM Underwater Things (IoUT),’’ J. Real-Time Image Process., vol. 17,
pp. 2097–2111, May 2019.
performance at low bit rates. Ablation experiment shows the [18] B. Sujitha, V. S. Parvathy, E. L. Lydia, P. Rani, Z. Polkowski, and
effectiveness of the proposed TransBlocks and TransContext K. Shankar, ‘‘Optimal deep learning based image compression technique
model. We also experiment on two different image patch for data transmission on industrial Internet of Things applications,’’ Trans.
Emerg. Telecommun. Technol., vol. 32, Apr. 2020, Art. no. e3976.
strategies and show that the overlap partition achieves bet- [19] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Generalized octave
ter compression performance than the non-overlap partition. convolutions for learned multi-frequency image compression,’’ 2020,
At last we discuss the time complexity of our model and it arXiv:2002.10032.
can reduce the inference time for the autoregressive context [20] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Learned image compression
with discretized Gaussian mixture likelihoods and attention modules,’’ in
model. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
When applying vision transformers, the sequence length pp. 7939–7948.
which is the number of image patches in our framework [21] J. Lee, S. Cho, and M. Kim, ‘‘An end-to-end joint learning scheme of image
compression and quality enhancement with improved entropy minimiza-
associates with the computation cost. More layers of the tion,’’ 2019, arXiv:1912.12817.
transformer block can be added and explored if the sequence [22] Information Technology JPEG 2000 Image Coding System: Core Coding
length can be further reduced. In the future work, we may System, International Organization for Standardization, Geneva, Switzer-
land, Dec. 2000.
mask out some of the patches and apply image inpainting
[23] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,
techniques to fill the masked patches at the decoder. S. Baluja, M. Covell, and R. Sukthankar, ‘‘Variable rate image compression
with recurrent neural networks,’’ 2015, arXiv:1511.06085.
[24] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,
REFERENCES S. J. Hwang, J. Shor, and G. Toderici, ‘‘Improved lossy image compression
[1] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘End-to-end optimized image with priming and spatially adaptive bit rates for recurrent networks,’’
compression,’’ 2016, arXiv:1611.01704. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, pp. 4385–4393.
L. Benini, and L. V. Gool, ‘‘Soft-to-hard vector quantization for end-to- [25] Z. Guo, Z. Zhang, and Z. Chen, ‘‘Deep scalable image compression via
end learning compressible representations,’’ in Proc. Adv. Neural Inf. hierarchical feature decorrelation,’’ in Proc. Picture Coding Symp. (PCS),
Process. Syst., 2017, pp. 1141–1151. Nov. 2019, pp. 1–5.
[3] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, ‘‘Learning convolutional [26] T. Chen and Z. Ma, ‘‘Variable bitrate image compression with quality
networks for content-weighted image compression,’’ in Proc. IEEE/CVF scaling factors,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3214–3223. (ICASSP), May 2020, pp. 2163–2167.
[4] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, [27] Y. Choi, M. El-Khamy, and J. Lee, ‘‘Variable rate deep image compression
‘‘Conditional probability models for deep image compression,’’ in with a conditional autoencoder,’’ in Proc. IEEE/CVF Int. Conf. Comput.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Vis. (ICCV), Oct. 2019, pp. 3146–3154.
pp. 4394–4402. [28] H. Fu, F. Liang, B. Lei, Q. Zhang, J. Liang, C. Tu, and G. Zhang,
[5] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, ‘‘Energy compaction-based ‘‘An extended context-based entropy hybrid modeling for image com-
image compression using convolutional AutoEncoder,’’ IEEE Trans. Mul- pression,’’ Signal Process., Image Commun., vol. 95, Jul. 2021,
timedia, vol. 22, no. 4, pp. 860–873, Apr. 2020. Art. no. 116244.

VOLUME 10, 2022 50333

B. Li et al.: Variable-Rate Deep Image Compression With Vision Transformers

[29] M. Akbari, J. Liang, J. Han, and C. Tu, ‘‘Learned multi-resolution variable- BINGLIN LI received the B.Eng. degree in com-
rate image compression with octave-based residual blocks,’’ IEEE Trans. munication engineering from Wuhan University,
Multimedia, vol. 23, pp. 3013–3021, 2021. Wuhan, China, in 2014, and the M.Sc. degree
[30] A. Tursunov, Mustaqeem, J. Y. Choeh, and S. Kwon, ‘‘Age and gender in engineering from the University of Manitoba,
recognition using a convolutional neural network with a specially designed Winnipeg, Canada, in 2016. She is currently pursu-
multi-attention module through speech spectrograms,’’ Sensors, vol. 21, ing the Ph.D. degree with the Multimedia Labora-
no. 17, p. 5892, Sep. 2021. tory, School of Engineering Science, Simon Fraser
[31] B. Li, J. Liang, and Y. Wang, ‘‘Compression artifact removal with stacked
University, Burnaby, Canada. Since 2018, she has
multi-context channel-wise attention network,’’ in Proc. IEEE Int. Conf.
been a Research Assistant with the Multimedia
Image Process. (ICIP), Sep. 2019, pp. 3601–3605.
[32] K. Muhammad, A. Ullah, A. S. Imran, M. Sajjad, M. S. Kiran, G. Sannino, Laboratory, School of Engineering Science, Simon
and V. H. C. de Albuquerque, ‘‘Human action recognition using attention Fraser University. Her research interests include deep-learning-based image
based LSTM network with dilated CNN features,’’ Future Gener. Comput. compression and computer vision.
Syst., vol. 125, pp. 820–830, Dec. 2021.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2017, pp. 5998–6008. JIE LIANG (Senior Member, IEEE) received the
[34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and B.E. and M.E. degrees from Xi’an Jiaotong Uni-
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc. versity, China, in 1992 and 1995, respectively,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2020, pp. 213–229. the M.E. degree from the National University of
[35] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, ‘‘MaX-DeepLab: Singapore, in 1998, and the Ph.D. degree from
End-to-end panoptic segmentation with mask transformers,’’ in Proc.
Johns Hopkins University, USA, in 2003.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
From 2003 to 2004, he worked with the
pp. 5463–5474.
[36] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, Microsoft Digital Media Division, Video Codec
C. Xu, and W. Gao, ‘‘Pre-trained image processing transformer,’’ in Proc. Group. Since May 2004, he has been with the
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, School of Engineering Science, Simon Fraser Uni-
pp. 12299–12310. versity, Canada, where he is currently a Professor. His research interests
[37] H. Yan, Z. Li, W. Li, C. Wang, M. Wu, and C. Zhang, ‘‘ConTNet: include image and video processing, computer vision, and deep learning.
Why not use convolution and transformer at the same time?’’ 2021, Prof. Liang received the 2014 IEEE TCSVT Best Associate Editor Award,
arXiv:2104.13497. 2014 SFU Dean of the Graduate Studies Award for Excellence in Leader-
[38] P. Esser, R. Rombach, and B. Ommer, ‘‘Taming transformers for high- ship, and 2015 Canada NSERC Discovery Accelerator Supplements (DAS)
resolution image synthesis,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pat- Award. He served as an Associate Editor for several journals, including
tern Recognit. (CVPR), Jun. 2021, pp. 12873–12883. IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON CIRCUITS
[39] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘Density modeling of AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), and IEEE SIGNAL PROCESSING
images using a generalized normalization transformation,’’ 2015, LETTERS. He has also served on three IEEE Technical Committees.
arXiv:1511.06281.
[40] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 770–778.
[41] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘‘Multiscale structural simi- JINGNING HAN (Senior Member, IEEE)
larity for image quality assessment,’’ in Proc. 37th Asilomar Conf. Signals, received the B.S. degree in electrical engineering
Syst. Comput., vol. 2, Jul. 2003, pp. 1398–1402. from Tsinghua University, Beijing, China, in 2007,
[42] C. Dong, Y. Deng, C. C. Loy, and X. Tang, ‘‘Compression artifacts reduc- and the M.S. and Ph.D. degrees in electrical
tion by a deep convolutional network,’’ in Proc. IEEE Int. Conf. Comput. and computer engineering from the University of
Vis. (ICCV), Dec. 2015, pp. 576–584. California at Santa Barbara, Santa Barbara, CA,
[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, USA, in 2008 and 2012, respectively.
and C. L. Zitnick, ‘‘Microsoft coco: Common objects in context,’’ in Proc. He joined the WebM Codec Team, Google,
Eur. Conf. Comput. Vis., Cham, Switzerland: Springer, 2014, pp. 740–755. Mountain View, CA, USA, in 2012, where he is
[44] D. Martin, C. Fowlkes, D. Tal, and J. Malik, ‘‘A database of human the Main Architect of the VP9 and AV1 codecs,
segmented natural images and its application to evaluating segmentation and leads the Software Video Codec Team. He has published more than
algorithms and measuring ecological statistics,’’ in Proc. 8th IEEE Int.
60 research articles. He holds more than 50 U.S. patents in the field of video
Conf. Comput. Vis. (ICCV), vol. 2, Jun. 2001, pp. 416–423.
coding. His research interests include video coding and computer science
[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, and A. Desmaison, architecture.
‘‘Pytorch: An imperative style, high-performance deep learning library,’’ Dr. Han received the Dissertation Fellowship from the Department of Elec-
in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, H. Larochelle, trical and Engineering, University of California at Santa Barbara, in 2012.
A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, Eds. Red Hook, He was a recipient of the Best Student Paper Award at the IEEE International
NY, USA: Curran Associates, 2019, pp. 8024–8035. [Online]. Available: Conference on Multimedia and Expo, in 2012. He also received the IEEE
http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high- Signal Processing Society Best Young Author Paper Award, in 2015.
pe%rformance-deep-learning-library.pdf

50334 VOLUME 10, 2022

Khairul - Naim.bin - Ahmad 109213 PDF
100% (1)
Khairul - Naim.bin - Ahmad 109213 PDF
623 pages
Screenshot 2025-01-09 at 07.10.26
No ratings yet
Screenshot 2025-01-09 at 07.10.26
687 pages
GATE Electromagnetic Theory Book
No ratings yet
GATE Electromagnetic Theory Book
12 pages
AVL Trees
No ratings yet
AVL Trees
41 pages
Masters in Public Administration Course of Study Curriculum
100% (1)
Masters in Public Administration Course of Study Curriculum
15 pages
Deep-Learning Based Lossless Image Coding
No ratings yet
Deep-Learning Based Lossless Image Coding
14 pages
Online Quiz 1
100% (1)
Online Quiz 1
9 pages
ChemPhysChem - 2018 - Mayerhöfer - Beer S Law Why Absorbance Depends Almost Linearly On Concentration
No ratings yet
ChemPhysChem - 2018 - Mayerhöfer - Beer S Law Why Absorbance Depends Almost Linearly On Concentration
5 pages
Sources of Experimental Error
No ratings yet
Sources of Experimental Error
3 pages
Problem Solving Circular Motion Dynamics: Challenge Problems Problem 1: Double Star System
No ratings yet
Problem Solving Circular Motion Dynamics: Challenge Problems Problem 1: Double Star System
10 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
Transformer-Based Image Compression
No ratings yet
Transformer-Based Image Compression
10 pages
Sašo Živanović: Quantificational Aspects of LF
No ratings yet
Sašo Živanović: Quantificational Aspects of LF
285 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
3850 Maths Skills Qualification Handbook v2 1 PDF - Ashx
No ratings yet
3850 Maths Skills Qualification Handbook v2 1 PDF - Ashx
103 pages
Monorail
No ratings yet
Monorail
3 pages
Mahamaya Technical University,: Noida
No ratings yet
Mahamaya Technical University,: Noida
47 pages
MAT 1100 Inequalities - 2020
No ratings yet
MAT 1100 Inequalities - 2020
15 pages
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
No ratings yet
(Download) SSC - CGL Tier-II Exam Paper-I (Arithmetical Ability) Held On - 16-09-2012 - SSCPORTAL PDF
12 pages
IGCSE1 Test Geometry
No ratings yet
IGCSE1 Test Geometry
13 pages
Parametric Equations and Polar Coordinates: Dr. Lê Xuân Đ I
No ratings yet
Parametric Equations and Polar Coordinates: Dr. Lê Xuân Đ I
32 pages
Drawing VSWR Circle On Smith Chart
No ratings yet
Drawing VSWR Circle On Smith Chart
19 pages
M911 G11 - Transformation Geometry
No ratings yet
M911 G11 - Transformation Geometry
12 pages
Compression Survey Hal
No ratings yet
Compression Survey Hal
26 pages
Ar 8
No ratings yet
Ar 8
26 pages
2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al
No ratings yet
2024 - Image and Video Tokenization With Binary Spherical Quantization - Zhao Et Al
23 pages
Escaping The Big Data Paradigm With Compact Transformers
No ratings yet
Escaping The Big Data Paradigm With Compact Transformers
18 pages
Sensors: HEMIGEN: Human Embryo Image Generator Based On Generative Adversarial Networks
No ratings yet
Sensors: HEMIGEN: Human Embryo Image Generator Based On Generative Adversarial Networks
16 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
5 - Day 3 - Cot
No ratings yet
5 - Day 3 - Cot
11 pages
Research Notes
No ratings yet
Research Notes
9 pages
Introvae: Introspective Variational Autoencoders For Photographic Image Synthesis
No ratings yet
Introvae: Introspective Variational Autoencoders For Photographic Image Synthesis
20 pages
2024 - Denoising Autoregressive Representation Learning - Li Et Al
No ratings yet
2024 - Denoising Autoregressive Representation Learning - Li Et Al
22 pages
Improving Inference For Neural Image Compression
No ratings yet
Improving Inference For Neural Image Compression
17 pages
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
No ratings yet
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
11 pages
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
No ratings yet
Penggunaan Balanced Scorecard Dalam: Strategic Management Jamu Puspo
15 pages
Reward Management Practices and Its Impact On Employees Motivation An Evidence
No ratings yet
Reward Management Practices and Its Impact On Employees Motivation An Evidence
6 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Preprints202403 1272 v1
No ratings yet
Preprints202403 1272 v1
37 pages
Entropy 25 01469
No ratings yet
Entropy 25 01469
22 pages
Project Presentation
No ratings yet
Project Presentation
20 pages
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
No ratings yet
Hybrid Spatial-Temporal Entropy Modelling For Neural Video Compression
17 pages
Deep-Learning-Based Lossless Image Coding
No ratings yet
Deep-Learning-Based Lossless Image Coding
14 pages
Learning-Driven Lossy Image Compression A Comprehensive Survey
No ratings yet
Learning-Driven Lossy Image Compression A Comprehensive Survey
14 pages
Deep Learning Approaches To Predict Future Frames in Videos
No ratings yet
Deep Learning Approaches To Predict Future Frames in Videos
17 pages
Lossy Image Compression With Foundation Diffusion Models Paper
No ratings yet
Lossy Image Compression With Foundation Diffusion Models Paper
17 pages
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
No ratings yet
Generative - Adversarial - Networks - For - Extreme - Learned - Image - Compression
11 pages
Applied Sciences: An End-to-End Deep Learning Image Compression Framework Based On Semantic Analysis
No ratings yet
Applied Sciences: An End-to-End Deep Learning Image Compression Framework Based On Semantic Analysis
13 pages
Assignment - Lagrangian Based Problems 1
No ratings yet
Assignment - Lagrangian Based Problems 1
15 pages
Deep Lossy Plus Residual Coding For Lossless and Near-Lossless Image Compression
No ratings yet
Deep Lossy Plus Residual Coding For Lossless and Near-Lossless Image Compression
18 pages
Kinematics With Calculus 1
No ratings yet
Kinematics With Calculus 1
8 pages
Paper 1
No ratings yet
Paper 1
14 pages
EEEN 201 Lecture Notes-08
No ratings yet
EEEN 201 Lecture Notes-08
10 pages
Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
No ratings yet
Learning End-to-End Lossy Image Compression: A Benchmark: Yueyu Hu, Wenhan Yang, Zhan Ma, and Jiaying Liu
18 pages
Full Resolution Image Compression With Recurrent Neural Networks
No ratings yet
Full Resolution Image Compression With Recurrent Neural Networks
10 pages
Enhanced Standard Compatible Image Compression
No ratings yet
Enhanced Standard Compatible Image Compression
15 pages
Semantically-Guided Image Compression For Enhanced Perceptual Quality at Extremely Low Bitrates
No ratings yet
Semantically-Guided Image Compression For Enhanced Perceptual Quality at Extremely Low Bitrates
16 pages
A Universal Optimization Framework For Learning-Based Image Codec
No ratings yet
A Universal Optimization Framework For Learning-Based Image Codec
19 pages
Aman Arora Blog On Vision Transformer
No ratings yet
Aman Arora Blog On Vision Transformer
11 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Dual Autoencoder-Based Framework For Image Compression and Decompression
No ratings yet
Dual Autoencoder-Based Framework For Image Compression and Decompression
9 pages
On Combining Denoising With Learning-Based Image Decoding
No ratings yet
On Combining Denoising With Learning-Based Image Decoding
14 pages
CVPR 2021 Learning Continuous Image Representation With Local Implicit Image Function
No ratings yet
CVPR 2021 Learning Continuous Image Representation With Local Implicit Image Function
11 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
Scaling Laws in Patch If Ication
No ratings yet
Scaling Laws in Patch If Ication
13 pages
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
No ratings yet
Lu DVC An End-To-End Deep Video Compression Framework CVPR 2019 Paper
10 pages
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
No ratings yet
XQ-GAN: An Open-Source Image Tokenization Framework For Autoregressive Generation
12 pages
REF-19-Deep Networks For Image Super-Resolution With Sparse Prior
No ratings yet
REF-19-Deep Networks For Image Super-Resolution With Sparse Prior
10 pages
FLLIC: Functionally Lossless Image Compression: Xi Zhang and Xiaolin Wu
No ratings yet
FLLIC: Functionally Lossless Image Compression: Xi Zhang and Xiaolin Wu
10 pages
Toderici Full Resolution Image CVPR 2017 Paper
No ratings yet
Toderici Full Resolution Image CVPR 2017 Paper
9 pages
Id Preservation Loss
No ratings yet
Id Preservation Loss
10 pages
Full Resolution Image Compression With Recurrent Neural Networks
No ratings yet
Full Resolution Image Compression With Recurrent Neural Networks
9 pages
Transformer CNN Mixture Architecture
No ratings yet
Transformer CNN Mixture Architecture
10 pages
Choi Variable Rate Deep Image Compression With A Conditional Autoencoder ICCV 2019 Paper
No ratings yet
Choi Variable Rate Deep Image Compression With A Conditional Autoencoder ICCV 2019 Paper
9 pages
Paper 1
No ratings yet
Paper 1
16 pages
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
No ratings yet
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
4 pages
Background: Image Transformer
No ratings yet
Background: Image Transformer
6 pages
MMPBSA Python Manual
No ratings yet
MMPBSA Python Manual
17 pages
Learned Lossless Image Compression With Combined Channel-Conditioning Models and Autoregressive Modules
No ratings yet
Learned Lossless Image Compression With Combined Channel-Conditioning Models and Autoregressive Modules
8 pages
Dis10 Sol
No ratings yet
Dis10 Sol
11 pages
Jpeg I D L: Nspired EEP Earning
No ratings yet
Jpeg I D L: Nspired EEP Earning
24 pages
1 s2.0 S2667241321000148 Main
No ratings yet
1 s2.0 S2667241321000148 Main
14 pages
Deepcoder: A Deep Neural Network Based Video Compression
No ratings yet
Deepcoder: A Deep Neural Network Based Video Compression
4 pages
Comprehensive Complexity Assessment of Emerging Learned Image Compression On Cpu and Gpu
No ratings yet
Comprehensive Complexity Assessment of Emerging Learned Image Compression On Cpu and Gpu
5 pages
Galteri Deep Generative Adversarial ICCV 2017 Paper
No ratings yet
Galteri Deep Generative Adversarial ICCV 2017 Paper
10 pages
Vector Operations 1
No ratings yet
Vector Operations 1
4 pages
Sample - Solutions Manual Photonics 2nd Edition by Saleh
No ratings yet
Sample - Solutions Manual Photonics 2nd Edition by Saleh
1 page
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
No ratings yet
Notes Key Topic 1.3 Rates of Change Linear and Quadratic Functions Ap PC
2 pages
19-10-2024 SR - Super60 Nucleus&Sterling-bt Jee-Main Rptm-11&14 Final Key
No ratings yet
19-10-2024 SR - Super60 Nucleus&Sterling-bt Jee-Main Rptm-11&14 Final Key
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Variable-Rate Deep Image Compression With Vision Transformers

Uploaded by

Variable-Rate Deep Image Compression With Vision Transformers

Uploaded by

Received April 21, 2022, accepted May 2, 2022, date of publication May 9, 2022, date of current version May

Variable-Rate Deep Image Compression

INDEX TERMS Learned image compression, transformer, variable-rate.

50324 VOLUME 10, 2022

VOLUME 10, 2022 50325

FIGURE 2. The encoding and decoding process of the overall framework.

scheme [11], [28]. The encoding and decoding process of the

A. AUTOENCODER NETWORK concatenated with the output from the hyper-decoder E 00 to

50326 VOLUME 10, 2022

FIGURE 5. Masked attention module in TransContext Model.

VOLUME 10, 2022 50327

be negative infinite. Note that the softmax of negative infinite

50328 VOLUME 10, 2022

VOLUME 10, 2022 50329

combined with the ResBlocks, our model extracts the local

5) ABLATION TEST: DEBLOCKING NETWORK

6) ABLATION TEST: FEATURE SIZE IN TRANSFORMERS

50330 VOLUME 10, 2022

FIGURE 10. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

VOLUME 10, 2022 50331

FIGURE 11. Reconstructed example from different methods(bpp, PSNR, MS-SSIM).

50332 VOLUME 10, 2022

VOLUME 10, 2022 50333

50334 VOLUME 10, 2022

You might also like