0% found this document useful (0 votes)
43 views11 pages

Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper

Uploaded by

Janhavi Kongari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views11 pages

Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper

Uploaded by

Janhavi Kongari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Activating More Pixels in Image Super-Resolution Transformer

Xiangyu Chen1,2,3 Xintao Wang4 Jiantao Zhou1 Yu Qiao2,3 Chao Dong2,3†


1
State Key Laboratory of Internet of Things for Smart City, University of Macau
2
Shenzhen Key Lab of Computer Vision and Pattern Recognition,
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
3
Shanghai Artificial Intelligence Laboratory 4 ARC Lab, Tencent PCG
{chxy95, xintao.alpha}@gmail.com [email protected] {yu.qiao, chao.dong}@siat.ac.cn

Abstract

Transformer-based methods have shown impressive per-


formance in low-level vision tasks, such as image super-
resolution. However, we find that these networks can only
utilize a limited spatial range of input information through
attribution analysis. This implies that the potential of
Transformer is still not fully exploited in existing networks.
In order to activate more input pixels for better recon-
struction, we propose a novel Hybrid Attention Transformer
(HAT). It combines both channel attention and window- Figure 1. Performance comparison on PSNR(dB) of the proposed
based self-attention schemes, thus making use of their com- HAT with the state-of-the-art methods SwinIR [31] and EDT [27].
plementary advantages of being able to utilize global statis- HAT-L represents a larger variant of HAT. Our approach can sur-
tics and strong local fitting capability. Moreover, to better pass the state-of-the-art methods by 0.3dB∼1.2dB.
aggregate the cross-window information, we introduce an ing rapid progress on high-level vision tasks [14, 39, 54],
overlapping cross-attention module to enhance the interac- Transformer-based methods are also developed for low-
tion between neighboring window features. In the train- level vision tasks [6, 57, 65], as well as for SR [27, 31]. Es-
ing stage, we additionally adopt a same-task pre-training pecially, a newly designed network, SwinIR [31], obtains a
strategy to exploit the potential of the model for further im- breakthrough improvement in this task.
provement. Extensive experiments show the effectiveness of Despite the success, “why Transformer is better than
the proposed modules, and we further scale up the model to CNN” remains a mystery. An intuitive explanation is that
demonstrate that the performance of this task can be greatly this kind of network can benefit from the self-attention
improved. Our overall method significantly outperforms the mechanism and utilize long-range information. Thus, we
state-of-the-art methods by more than 1dB. employ the attribution analysis method LAM [15] to ex-
amine the involved range of utilized information for recon-
1. Introduction struction in SwinIR. Interestingly, we find that SwinIR does
NOT exploit more input pixels than CNN-based methods
Single image super-resolution (SR) is a classic prob-
(e.g., RCAN [68]) in super-resolution, as shown in Fig. 2.
lem in computer vision and image processing. It aims
Besides, although SwinIR obtains higher quantitative per-
to reconstruct a high-resolution image from a given low-
formance on average, it produces inferior results to RCAN
resolution input. Since deep learning has been success-
in some samples, due to the limited range of utilized infor-
fully applied to the SR task [10], numerous methods based
mation. These phenomena illustrate that Transformer has
on the convolutional neural network (CNN) have been pro-
a stronger ability to model local information, but the range
posed [8, 11, 12, 24, 29, 32, 68, 70] and almost dominate this
of its utilized information needs to be expanded. In addi-
field in the past few years. Recently, due to the success in
tion, we also find that blocking artifacts would appear in the
natural language processing, Transformer [53] has attracted
intermediate features of SwinIR, as depicted in Fig. 3. It
the attention of the computer vision community. After mak-
demonstrates that the shift window mechanism cannot per-
† Corresponding author. fectly realize cross-window information interaction.

22367
To address the above-mentioned limitations and further fresh the state-of-the-art of SR task, showing the powerful
develop the potential of Transformer for SR, we propose representation ability of Transformer.
a Hybrid Attention Transformer, namely HAT. Our HAT To better understand the working mechanisms of SR net-
combines channel attention and self-attention schemes, in works, several works are proposed to analyze and interpret
order to take advantage of the former’s capability in using the SR networks. LAM [15] adopts the integral gradient
global information and the powerful representative ability method to explore which input pixels contribute most to the
of the latter. Besides, we introduce an overlapping cross- final performance. DDR [37] reveals the deep semantic rep-
attention module to achieve more direct interaction of ad- resentations in SR networks based on deep feature dimen-
jacent window features. Benefiting from these designs, our sionality reduction and visualization. FAIG [62] aims to
model can activate more pixels for reconstruction and thus find discriminative filters for specific degradations in blind
obtains significant performance improvement. SR. RDSR [23] introduces channel saliency map to demon-
Since Transformers do not have an inductive bias like strate that Dropout can help prevent co-adapting for real-SR
CNNs, large-scale data pre-training is important to unlock networks. SRGA [38] aims to evaluate the generalization
the potential of such models. In this work, we provide an ability of SR methods. In this work, we exploit LAM [15]
effective same-task pre-training strategy. Different from to analyse and understand the behavior of SR networks.
IPT [6] using multiple restoration tasks for pre-training and
EDT [27] using multiple degradation levels for pre-training, 2.2. Vision Transformer
we directly perform pre-training using large-scale dataset Recently, Transformer [53] has attracted the attention of
on the same task. We believe that large-scale data is what computer vision community due to its success in the field of
really matters for pre-training, and experimental results also natural language processing. A series of Transformer-based
show the superiority of our strategy. Equipped with the methods [7,13,14,20,26,28,39,44,54,59,60,63] have been
above designs, HAT can surpass the state-of-the-art meth- developed for high-level vision tasks, including image clas-
ods by a huge margin (0.3dB∼1.2dB), as shown in Fig. 1. sification [14,28,39,46,52], object detection [5,7,36,39,50],
Contributions: 1) We design a novel Hybrid Attention segmentation [3, 18, 54, 58], etc. Although vision Trans-
Transformer (HAT) that combines self-attention, channel former has shown its superiority on modeling long-range
attention and a new overlapping cross-attention to activate dependency [14,45], there are still many works demonstrat-
more pixels for better reconstruction. 2) We propose an ef- ing that the convolution can help Transformer achieve better
fective same-task pre-training strategy to further exploit the visual representation [26, 59, 61, 63, 64]. Due to the impres-
potential of SR Transformer and show the importance of sive performance, Transformer has also been introduced for
large-scale data pre-training for the task. 3) Our method low-level vision tasks [4, 6, 27, 30, 31, 51, 57, 65]. Specifi-
achieves state-of-the-art performance. By further scaling cally, IPT [6] develops a ViT-style network and introduces
up HAT to build a big model, we greatly extend the perfor- multi-task pre-training for image processing. SwinIR [31]
mance upper bound of the SR task. proposes an image restoration Transformer based on [39].
VRT [30] introduces Transformer-based networks to video
2. Related Work restoration. EDT [27] adopts self-attention mechanism and
multi-related-task pre-training strategy to further refresh the
2.1. Deep Networks for Image SR state-of-the-art of SR. However, existing works still cannot
Since SRCNN [10] first introduces deep convolution fully exploit the potential of Transformer, while our method
neural networks (CNNs) to the image SR task and obtains can activate more input pixels for better reconstruction.
superior performance over conventional SR methods, nu-
merous deep networks [8, 11, 12, 21, 27, 31, 32, 42, 43, 47, 3. Methodology
68, 70] have been proposed for SR to further improve the
3.1. Motivation
reconstruction quality. For instance, many methods apply
more elaborate convolution module designs, such as resid- Swin Transformer [39] has already presented excellent
ual block [25, 32] and dense block [56, 70], to enhance the performance in image super-resolution [31]. Then we are
model representation ability. Several works explore more eager to know what makes it work better than CNN-based
different frameworks like recursive neural network [22, 48] methods. To reveal its working mechanisms, we resort to a
and graph neural network [72]. To improve perceptual qual- diagnostic tool – LAM [15], which is an attribution method
ity, [25, 55, 56, 67] introduce adversarial learning to gener- designed for SR. With LAM, we could tell which input pix-
ate more realistic results. By using attention mechanism, els contribute most to the selected region. As shown in
[8, 35, 42, 43, 68, 69] achieve further improvement in terms Fig. 2, the red marked points are informative pixels that
of reconstruction fidelity. Recently, a series of Transformer- contribute to the reconstruction. Intuitively, the more in-
based networks [6, 27, 31] are proposed and constantly re- formation is utilized, the better performance can be ob-

22368
3.2. Network Architecture
3.2.1 The Overall Structure
As shown in Fig. 4, the overall network consists of three
parts, including shallow feature extraction, deep feature ex-
traction and image reconstruction. The architecture design
is widely used in previous works [31, 68]. Specifically,
for a given low-resolution (LR) input ILR ∈ RH×W ×Cin ,
we first exploit one convolution layer to extract the shal-
Figure 2. LAM [15] results for different networks. The LAM low feature F0 ∈ RH×W ×C , where Cin and C denote the
attribution reflects the importance of each pixel in the input LR channel number of the input and the intermediate feature.
image when reconstructing the patch marked with a box. Diffusion Then, a series of residual hybrid attention groups (RHAG)
index (DI) [15] reflects the range of involved pixels. A higher DI and one 3 × 3 convolution layer HConv (·) are utilized to
represents a wider range of utilized pixels. The results indicate
perform the deep feature extraction. After that, we add a
that SwinIR utilizes less information compared to RCAN, while
global residual connection to fuse shallow features F0 and
HAT uses the most pixels for reconstruction.
deep features FD ∈ RH×W ×C , and then reconstruct the
high-resolution result via a reconstruction module. As de-
picted in Fig. 4, each RHAG contains several hybrid at-
tention blocks (HAB), an overlapping cross-attention block
(OCAB) and a 3 × 3 convolution layer with a residual con-
nection. For the reconstruction module, the pixel-shuffle
method [47] is adopted to up-sample the fused feature. We
simply use L1 loss to optimize the network parameters.

Figure 3. The blocking artifacts appear in the intermediate features 3.2.2 Hybrid Attention Block (HAB)
of SwinIR [31]. “Layer N ” represents the intermediate features
As shown in Fig. 2, more pixels are activated when chan-
after the Nth layer (i.e., RSTB in SwinIR and RHAG in HAT.)
nel attention is adopted, as global information is involved
tained. This is true for CNN-based methods, as comparing to calculate the channel attention weights. Besides, many
RCAN [68] and EDSR [32]. However, for the Transformer- works illustrate that convolution can help Transformer get
based method – SwinIR, its LAM does not show a larger better visual representation or achieve easier optimiza-
range than RCAN. This is in contradiction with our com- tion [26,59,61,63,71]. Therefore, we incorporate a channel
mon sense, but could also provide us with additional in- attention-based convolution block into the standard Trans-
sights. First, it implies that SwinIR has a much stronger former block to enhance the representation ability of the
mapping ability than CNN, and thus could use less infor- network. As demonstrated in Fig. 4, a channel atten-
mation to achieve better performance. Second, SwinIR may tion block (CAB) is inserted into the standard Swin Trans-
restore wrong textures due to the limited range of utilized former block after the first LayerNorm (LN) layer in parallel
pixels, and we think it can be further improved if it could ex- with the window-based multi-head self-attention (W-MSA)
ploit more input pixels. Therefore, we aim to design a net- module. Note that shifted window-based self-attention
work that can take advantage of similar self-attention while (SW-MSA) is adopted at intervals in consecutive HABs
activating more pixels for reconstruction. As depicted in similar to [31, 39]. To avoid the possible conflict of CAB
Fig. 2, our HAT can see pixels almost all over the image and MSA on optimization and visual representation, a small
and restore correct and clear textures. constant α is multiplied to the output of CAB. For a given
Besides, we can observe obvious blocking artifacts in the input feature X, the whole process of HAB is computed as
intermediate features of SwinIR, as shown in Fig. 3. These
artifacts are caused by the window partition mechanism, XN = LN(X),
which suggests that the shifted window mechanism is inef- XM = (S)W-MSA(XN ) + αCAB(XN ) + X, (1)
ficient to build the cross-window connection. Some works Y = MLP(LN(XM )) + XM ,
for high-level vision tasks [13, 20, 44, 60] also point out that
enhancing the connection among windows can improve the where XN and XM denote the intermediate features. Y
window-based self-attention methods. Thus, we strengthen represents the output of HAB. Especially, we treat each
cross-window information interactions when designing our pixel as a token for embedding (i.e., set patch size as 1 for
approach and the blocking artifacts in the intermediate fea- patch embedding following [31]). MLP denotes a multi-
tures obtained by HAT are significantly alleviated. layer perceptron. For calculation of the self-attention mod-

22369
Figure 4. The overall architecture of HAT and the structure of RHAG and HAB.

ule, given an input feature of size H × W × C, it is first


partitioned into HWM 2 local windows of size M × M , then
self-attention is calculated inside each window. For a local
2
window feature XW ∈ RM ×C , the query, key and value
matrices are computed by linear mappings as Q, K and V .
Then the window-based self-attention is formulated as

Attention(Q, K, V ) = SoftMax(QK T / d + B)V, (2)

where d represents the dimension of query/key. B denotes


the relative position encoding and is calculated as [53]. Figure 5. The overlapping window partition for OCA.
Note that we use a large window size to compute self- window self-attention. Our OCAB consists of an overlap-
attention, since we find it significantly enlarges the range ping cross-attention (OCA) layer and an MLP layer simi-
of used pixels, as depicted in Sec.4.2. Besides, to build lar to the standard Swin Transformer block [39]. But for
the connections between neighboring non-overlapping win- OCA, as depicted in Fig. 5, we use different window sizes
dows, we also utilize the shifted window partitioning ap- to partition the projected features. Specifically, for the
proach [39] and set the shift size to half of the window size. XQ , XK , XV ∈ RH×W ×C of the input feature X, XQ
A CAB consists of two standard convolution layers with is partitioned into HW
M 2 non-overlapping windows of size
a GELU activation [17] and a channel attention (CA) mod-
M × M , while XK , XV are unfolded to HW M 2 overlapping
ule, as shown in Fig. 4. Since the Transformer-based struc-
windows of size Mo × Mo . It is calculated as
ture often requires a large number of channels for token em-
bedding, directly using convolutions with constant width Mo = (1 + γ) × M, (3)
incurs a large computation cost. Thus, we compress the
channel numbers of the two convolution layers by a con- where γ is a constant to control the overlapping size. To
stant β. For an input feature with C channels, the channel better understand this operation, the standard window parti-
number of the output feature after the first convolution layer tion can be considered as a sliding partition with the kernel
is squeezed to C size and the stride both equal to the window size M . In con-
β , then the feature is expanded to C channels
through the second layer. Next, a standard CA module [68] trast, the overlapping window partition can be viewed as a
is exploited to adaptively rescale channel-wise features. sliding partition with the kernel size equal to Mo , while the
stride is equal to M . Zero-padding with size γM 2 is used to
ensure the size consistency of overlapping windows. The
3.2.3 Overlapping Cross-Attention Block (OCAB)
attention matrix is calculated as Equ. 2, and the relative
We introduce OCAB to directly establish cross-window position bias B ∈ RM ×Mo is also adopted. Unlike WSA
connections and enhance the representative ability for the whose query, key and value are calculated from the same

22370
window feature, OCA computes key/value from a larger Table 1. Quantitative comparison on PSNR(dB) of different win-
field where more useful information can be utilized for the dow sizes.
query. Note that although Multi-resolution Overlapped At- Size Set5 Set14 BSD100 Urban100 Manga109
tention (MOA) module in [44] performs similar overlapping (8,8) 32.88 29.09 27.92 27.45 32.03
window partition, our OCA is fundamentally different from (16,16) 32.97 29.12 27.95 27.81 32.15
MOA, since MOA calculates global attention using window
features as tokens while OCA computes cross-attention in-
side each window feature using pixel token.

3.3. The Same-task Pre-training


Pre-training is proven effective on many high-level vi-
sion tasks [1, 14, 16]. Recent works [6, 27] also demon-
strate that pre-training is beneficial to low-level vision tasks.
IPT [6] emphasizes the use of various low-level tasks, such
as denoising, deraining, super-resolution and etc., while
EDT [27] utilizes different degradation levels of a specific
task to do pre-training. These works focus on investigat- Figure 6. Qualitative comparison of different window sizes.
ing the effect of multi-task pre-training for a target task. In
is set to 144 and the depth-wise convolution is used in CAB.
contrast, we directly perform pre-training on a larger-scale
Five benchmark datasets including Set5 [2], Set14 [66],
dataset (i.e., ImageNet [9]) based on the same task, show-
BSD100 [40], Urban100 [19] and Manga109 [41] are used
ing that the effectiveness of pre-training depends more on
to evaluate the methods. For the quantitative metrics, PSNR
the scale and diversity of data. For example, when we want
and SSIM (calculated on the Y channel) are reported. More
to train a model for ×4 SR, we first train a ×4 SR model
training details can refer to the supp. file.
on ImageNet, then fine-tune it on the specific dataset, such
as DF2K. The proposed strategy, namely same-task pre- 4.2. Effects of different window sizes
training, is simpler while bringing more performance im-
provements. It is worth mentioning that sufficient training As discussed in Sec. 3.1, activating more input pixels for
iterations for pre-training and an appropriate small learn- SR tends to achieve better performance. Enlarging window
ing rate for fine-tuning are very important for the effective- size for the window-based self-attention is an intuitive way
ness of the pre-training strategy. We think that it is because to realize the goal. In [27], the authors investigate the ef-
Transformer requires more data and iterations to learn gen- fects of different window sizes. However, they conduct ex-
eral knowledge for the task, but needs a small learning rate periments based on the shifted cross local attention and only
for fine-tuning to avoid overfitting to the specific dataset. explore the window size up to 12×12. We further explore
how the window size of self-attention influences the repre-
4. Experiments sentation ability. To eliminate the influence of our newly-
introduced blocks, we conduct the following experiments
4.1. Experimental Setup directly on SwinIR. As shown in Tab. 1, the model with a
We use DF2K (DIV2K [33]+Flicker2K [49]) dataset as large window size of 16×16 obtains better performance, es-
the training dataset, since we find that using only DIV2K pecially on the Urban100. We also provide the qualitative
will lead to overfitting. When utilizing pre-training, we comparison in Fig. 6. For the red marked patch, the model
adopt ImageNet [9] following [6, 27]. For the structure of with window size of 16 utilizes much more input pixels than
HAT, we keep the depth and width the same as SwinIR. the model with window size of 8. The quantitative perfor-
Specifically, the RHAG number and HAB number are both mance of the reconstructed results also demonstrates the ef-
set to 6. The channel number is set to 180. The attention fectiveness of large window size. Based on this conclusion,
head number and window size are set to 6 and 16 for both we directly use window size 16 as our default setting.
(S)W-MSA and OCA. For the hyper-parameters of the pro-
4.3. Ablation Study
posed modules, we set the weighting factor in HAB (α), the
squeeze factor between two convolutions in CAB (β), and Effectiveness of OCAB and CAB. We conduct exper-
the overlapping ratio of OCA (γ) as 0.01, 3 and 0.5. For the iments to demonstrate the effectiveness of the proposed
large variant HAT-L, we directly double the depth of HAT CAB and OCAB. The quantitative performance reported
by increasing the RHAG number from 6 to 12. We also pro- on the Urban100 dataset for ×4 SR is shown in Tab. 2.
vide a small version HAT-S with fewer parameters and sim- Compared with the baseline results, both OCAB and CAB
ilar computation to SwinIR. In HAT-S, the channel number bring the performance gain of 0.1dB. Benefiting from the

22371
Table 2. Ablation study on the proposed OCAB and CAB. Table 3. Effects of the channel attention (CA) module in CAB.
Baseline Structure w/o CA w/ CA
OCAB X ✓ X ✓ PSNR / SSIM 27.92dB / 0.8362 27.97dB / 0.8367
CAB X X ✓ ✓
PSNR 27.81dB 27.91dB 27.91dB 27.97dB Table 4. Effects of the weighting factor α in CAB.
α 0 1 0.1 0.01
PSNR 27.81dB 27.86dB 27.90dB 27.97dB

Table 5. Ablation study on the overlapping ratio of OCAB.


γ 0 0.25 0.5 0.75
PSNR 27.85dB 27.81dB 27.91dB 27.86dB

trates that inappropriate overlapping size cannot benefit the


interaction of neighboring windows.

Figure 7. Ablation study on the proposed OCAB and CAB. 4.4. Comparison with State-of-the-Art Methods
two modules, the model obtains a further performance im- Quantitative results. Tab. 6 shows the quantitative com-
provement of 0.16dB. We also provide qualitative compar- parison of our approach and the state-of-the-art methods:
ison to further illustrate the influence of OCAB and CAB, EDSR [32], RCAN [68], SAN [8], IGNN [72], HAN [43],
as presented in Fig. 7. We can observe that the model with NLSN [42], RCAN-it [34], as well as approaches using
OCAB has a larger scope of the utilized pixels and gener- ImageNet pre-training, i.e., IPT [6] and EDT [27]. We
ate better-reconstructed results. When CAB is adopted, the can see that our method outperforms the other methods
used pixels even expand to almost the full image. Moreover, significantly on all benchmark datasets. Concretely, HAT
the result of our method with OCAB and CAB obtains the surpasses SwinIR by 0.48dB∼0.64dB on Urban100 and
highest DI [15], which means our method utilizes the most 0.34dB∼0.45dB on Manga109. When compared with the
input pixels. Although it obtains a little lower performance approaches using pre-training, HAT also has large perfor-
than w/OCAB, our method gets the highest SSIM and re- mance gains of more than 0.5dB against EDT on Urban100
constructs the clearest textures. for all three scales. Besides, HAT with pre-training outper-
Effects of different designs of CAB. We conduct experi- forms SwinIR by a huge margin of up to 1dB on Urban100
ments to explore the effects of different designs of CAB. for ×2 SR. Moreover, the large model HAT-L can even
First, we investigate the influence of channel attention. As bring further improvement and greatly expands the perfor-
shown in Tab. 3, the model using CA achieves a perfor- mance upper bound of this task. HAT-S with fewer param-
mance gain of 0.05dB compared to the model without CA. eters and similar computation can also significantly outper-
It demonstrates the effectiveness of the channel attention in forms the state-of-the-art method SwinIR. (Detailed com-
our network. We also conduct experiments to explore the putational complexity comparison can be found in the supp.
effects of the weighting factor α of CAB. As presented in file.) Note that the performance gaps are much larger on
the manuscript Sec. 3.2.2, α is used to control the weight of Urban100, as it contains more structured and self-repeated
CAB features for feature fusion. A larger α means a larger patterns that can provide more useful pixels for reconstruc-
weight of features extracted by CAB and α = 0 represents tion when the utilized range of information is enlarged. All
CAB is not used. As shown in Tab. 4, the model with α these results show the effectiveness of our method.
of 0.01 obtains the best performance. It indicates that CAB Visual comparison. We provide the visual comparison in
and self-attention may have potential issue in optimization, Fig. 8. For the images “img 002”, “img 011”, “img 030”,
while a small weighting factor for the CAB branch can sup- “img 044” and “img 073” in Urban100, HAT successfully
press this issue for the better combination. recovers the clear lattice content. In contrast, the other
Effects of the overlapping ratio. In OCAB, we set a con- approaches all suffer from severe blurry effects. We can
stant γ to control the overlapping size for the overlapping also observe similar behaviors on “PrayerHaNemurenai” in
cross-attention. To explore the effects of different overlap- Manga109. When recovering the characters, HAT obtains
ping ratios, we set a group of γ from 0 to 0.75 to examine significantly clearer textures than other methods. The visual
the performance change, as shown in Tab. 5. Note that results also demonstrate the superiority of our approach.
γ = 0 means a standard Transformer block. It can be found
4.5. Study on the pre-training strategy
that the model with γ = 0.5 performs best. In contrast,
when γ is set to 0.25 or 0.75, the model has no obvious In Tab. 6, we can see that HAT can benefit greatly from
performance gain or even has a performance drop. It illus- the pre-training strategy, by comparing the performance of

22372
Table 6. Quantitative comparison with state-of-the-art methods on benchmark datasets. The top three results are marked in red, blue and
green. “†” indicates that methods adopt pre-training strategy on ImageNet.
Training Set5 Set14 BSD100 Urban100 Manga109
Method Scale
Dataset PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
EDSR ×2 DIV2K 38.11 0.9602 33.92 0.9195 32.32 0.9013 32.93 0.9351 39.10 0.9773
RCAN ×2 DIV2K 38.27 0.9614 34.12 0.9216 32.41 0.9027 33.34 0.9384 39.44 0.9786
SAN ×2 DIV2K 38.31 0.9620 34.07 0.9213 32.42 0.9028 33.10 0.9370 39.32 0.9792
IGNN ×2 DIV2K 38.24 0.9613 34.07 0.9217 32.41 0.9025 33.23 0.9383 39.35 0.9786
HAN ×2 DIV2K 38.27 0.9614 34.16 0.9217 32.41 0.9027 33.35 0.9385 39.46 0.9785
NLSN ×2 DIV2K 38.34 0.9618 34.08 0.9231 32.43 0.9027 33.42 0.9394 39.59 0.9789
RCAN-it ×2 DF2K 38.37 0.9620 34.49 0.9250 32.48 0.9034 33.62 0.9410 39.88 0.9799
SwinIR ×2 DF2K 38.42 0.9623 34.46 0.9250 32.53 0.9041 33.81 0.9427 39.92 0.9797
EDT ×2 DF2K 38.45 0.9624 34.57 0.9258 32.52 0.9041 33.80 0.9425 39.93 0.9800
HAT-S (ours) ×2 DF2K 38.58 0.9628 34.70 0.9261 32.59 0.9050 34.31 0.9459 40.14 0.9805
HAT (ours) ×2 DF2K 38.63 0.9630 34.86 0.9274 32.62 0.9053 34.45 0.9466 40.26 0.9809
IPT† ×2 ImageNet 38.37 - 34.43 - 32.48 - 33.76 - - -
EDT† ×2 DF2K 38.63 0.9632 34.80 0.9273 32.62 0.9052 34.27 0.9456 40.37 0.9811
HAT† (ours) ×2 DF2K 38.73 0.9637 35.13 0.9282 32.69 0.9060 34.81 0.9489 40.71 0.9819
HAT-L† (ours) ×2 DF2K 38.91 0.9646 35.29 0.9293 32.74 0.9066 35.09 0.9505 41.01 0.9831
EDSR ×3 DIV2K 34.65 0.9280 30.52 0.8462 29.25 0.8093 28.80 0.8653 34.17 0.9476
RCAN ×3 DIV2K 34.74 0.9299 30.65 0.8482 29.32 0.8111 29.09 0.8702 34.44 0.9499
SAN ×3 DIV2K 34.75 0.9300 30.59 0.8476 29.33 0.8112 28.93 0.8671 34.30 0.9494
IGNN ×3 DIV2K 34.72 0.9298 30.66 0.8484 29.31 0.8105 29.03 0.8696 34.39 0.9496
HAN ×3 DIV2K 34.75 0.9299 30.67 0.8483 29.32 0.8110 29.10 0.8705 34.48 0.9500
NLSN ×3 DIV2K 34.85 0.9306 30.70 0.8485 29.34 0.8117 29.25 0.8726 34.57 0.9508
RCAN-it ×3 DF2K 34.86 0.9308 30.76 0.8505 29.39 0.8125 29.38 0.8755 34.92 0.9520
SwinIR ×3 DF2K 34.97 0.9318 30.93 0.8534 29.46 0.8145 29.75 0.8826 35.12 0.9537
EDT ×3 DF2K 34.97 0.9316 30.89 0.8527 29.44 0.8142 29.72 0.8814 35.13 0.9534
HAT-S (ours) ×3 DF2K 35.01 0.9325 31.05 0.8550 29.50 0.8158 30.15 0.8879 35.40 0.9547
HAT (ours) ×3 DF2K 35.07 0.9329 31.08 0.8555 29.54 0.8167 30.23 0.8896 35.53 0.9552
IPT† ×3 ImageNet 34.81 - 30.85 - 29.38 - 29.49 - - -
EDT† ×3 DF2K 35.13 0.9328 31.09 0.8553 29.53 0.8165 30.07 0.8863 35.47 0.9550
HAT† (ours) ×3 DF2K 35.16 0.9335 31.33 0.8576 29.59 0.8177 30.70 0.8949 35.84 0.9567
HAT-L† (ours) ×3 DF2K 35.28 0.9345 31.47 0.8584 29.63 0.8191 30.92 0.8981 36.02 0.9576
EDSR ×4 DIV2K 32.46 0.8968 28.80 0.7876 27.71 0.7420 26.64 0.8033 31.02 0.9148
RCAN ×4 DIV2K 32.63 0.9002 28.87 0.7889 27.77 0.7436 26.82 0.8087 31.22 0.9173
SAN ×4 DIV2K 32.64 0.9003 28.92 0.7888 27.78 0.7436 26.79 0.8068 31.18 0.9169
IGNN ×4 DIV2K 32.57 0.8998 28.85 0.7891 27.77 0.7434 26.84 0.8090 31.28 0.9182
HAN ×4 DIV2K 32.64 0.9002 28.90 0.7890 27.80 0.7442 26.85 0.8094 31.42 0.9177
NLSN ×4 DIV2K 32.59 0.9000 28.87 0.7891 27.78 0.7444 26.96 0.8109 31.27 0.9184
RRDB ×4 DF2K 32.73 0.9011 28.99 0.7917 27.85 0.7455 27.03 0.8153 31.66 0.9196
RCAN-it ×4 DF2K 32.69 0.9007 28.99 0.7922 27.87 0.7459 27.16 0.8168 31.78 0.9217
SwinIR ×4 DF2K 32.92 0.9044 29.09 0.7950 27.92 0.7489 27.45 0.8254 32.03 0.9260
EDT ×4 DF2K 32.82 0.9031 29.09 0.7939 27.91 0.7483 27.46 0.8246 32.05 0.9254
HAT-S (ours) ×4 DF2K 32.92 0.9047 29.15 0.7958 27.97 0.7505 27.87 0.8346 32.35 0.9283
HAT (ours) ×4 DF2K 33.04 0.9056 29.23 0.7973 28.00 0.7517 27.97 0.8368 32.48 0.9292
IPT† ×4 ImageNet 32.64 - 29.01 - 27.82 - 27.26 - - -
EDT† ×4 DF2K 33.06 0.9055 29.23 0.7971 27.99 0.7510 27.75 0.8317 32.39 0.9283
HAT† (ours) ×4 DF2K 33.18 0.9073 29.38 0.8001 28.05 0.7534 28.37 0.8447 32.87 0.9319
HAT-L† (ours) ×4 DF2K 33.30 0.9083 29.47 0.8015 28.09 0.7551 28.60 0.8498 33.09 0.9335

HAT and HAT† . To show the superiority of the proposed cess. From this perspective, multi-task pre-training proba-
same-task pre-training, we also apply the multi-related-task bly impairs the restoration performance of the network on
pre-training [27] to HAT for comparison using full Ima- a specific degradation, while the same-task pre-training can
geNet, under the same training settings as [27]. As depicted maximize the performance gain brought by large-scale data.
as Tab. 7, the same-task pre-training performs better, not To further investigate the influences of our pre-training
only in the pre-training stage but also in the fine-tuning pro- strategy for different networks, we apply our pre-training

22373
Figure 8. Visual comparison on ×4 SR. The patches for comparison are marked with red boxes in the original images. PSNR/SSIM is
calculated based on the patches to better reflect the performance difference.

to four networks: SRResNet (1.5M), RRDBNet (16.7M), Table 7. Quantitative results on PSNR(dB) of HAT using two
SwinIR (11.9M) and HAT (20.8M), as shown in Fig. 9. kinds of pre-training strategies on ×4 SR under the same train-
First, we can see that all four networks can benefit from pre- ing setting. The full ImageNet dataset is adopted to perform pre-
training and DF2K dataset is used for fine-tuning.
training, showing the effectiveness of the proposed same-
task pre-training strategy. Second, for the same type of net- Strategy Stage Set5 Set14 Urban100
work (i.e., CNN or Transformer), the larger the network ca- Multi-related-task pre-training 32.94 29.17 28.05
pacity, the more performance gain from pre-training. Third, pre-training fine-tuning 33.06 29.33 28.21
although with less parameters, SwinIR obtains greater per- Same-task pre-training 33.02 29.20 28.11
formance improvement from the pre-training compared to pre-training(ours) fine-tuning 33.07 29.34 28.28
RRDBNet. It suggests that Transformer needs more data to
exploit the potential of the model. Finally, HAT obtains the
largest gain from pre-training, indicating the necessity of
the pre-training strategy for such large models. Equipped
with big models and large-scale data, we show the perfor-
mance upper bound of this task is significantly extended.

5. Conclusion
In this paper, we propose a novel Hybrid Attention Figure 9. Quantitative comparison on PSNR(dB) of four different
Transformer, HAT, for single image super-resolution. Our networks without and with the same-task pre-training on ×4 SR.
model combines channel attention and self-attention to ac-
tivate more pixels for high-resolution reconstruction. Be- Acknowledgement. This work was supported in part by
sides, we propose an overlapping cross-attention module Macau Science and Technology Development Fund under
to enhance the interaction of cross-window information. SKLIOTSC-2021-2023, 0072/2020/AMJ, 0022/2022/A1;
Moreover, we introduce a same-task pre-training strategy to in part by Alibaba Innovative Research Program; in part by
further exploit the potential of HAT. Extensive experiments the National Natural Science Foundation of China under
show the effectiveness of the proposed modules and the pre- Grant (61971476, 62276251), the Joint Lab of CAS-HK;
training strategy. Our approach significantly outperforms and in part by the Youth Innovation Promotion Association
the state-of-the-art methods quantitatively and qualitatively. of Chinese Academy of Sciences (No. 2020356).

22374
References [14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
[1] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
of image transformers. arXiv preprint arXiv:2106.08254, vain Gelly, et al. An image is worth 16x16 words: Trans-
2021. 5 formers for image recognition at scale, 2020. 1, 2, 5
[2] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and [15] Jinjin Gu and Chao Dong. Interpreting super-resolution
Marie Line Alberi-Morel. Low-complexity single-image networks with local attribution maps. In Proceedings of
super-resolution based on nonnegative neighbor embedding. the IEEE/CVF Conference on Computer Vision and Pattern
2012. 5 Recognition, pages 9199–9208, 2021. 1, 2, 3, 6
[3] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi- [16] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Dollár, and Ross Girshick. Masked autoencoders are scalable
Unet-like pure transformer for medical image segmentation, vision learners. In Proceedings of the IEEE/CVF Conference
2021. 2 on Computer Vision and Pattern Recognition, pages 16000–
[4] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. 16009, 2022. 5
Video super-resolution transformer, 2021. 2 [17] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas units (gelus), 2016. 4
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [18] Gao Huang, Yulin Wang, Kangchen Lv, Haojun Jiang, Wen-
end object detection with transformers. In European confer- hui Huang, Pengfei Qi, and Shiji Song. Glance and focus
ence on computer vision, pages 213–229. Springer, 2020. 2 networks for dynamic visual recognition, 2022. 2
[6] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping [19] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single
Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and image super-resolution from transformed self-exemplars. In
Wen Gao. Pre-trained image processing transformer. In Pro- Proceedings of the IEEE conference on computer vision and
ceedings of the IEEE/CVF Conference on Computer Vision pattern recognition, pages 5197–5206, 2015. 5
and Pattern Recognition, pages 12299–12310, 2021. 1, 2, 5, [20] Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng,
6 Gang Yu, and Bin Fu. Shuffle transformer: Rethinking spa-
[7] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- tial shuffle for vision transformer, 2021. 2, 3
ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. [21] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
Twins: Revisiting the design of spatial attention in vision image super-resolution using very deep convolutional net-
transformers. Advances in Neural Information Processing works. In Proceedings of the IEEE conference on computer
Systems, 34, 2021. 2 vision and pattern recognition, pages 1646–1654, 2016. 2
[8] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and [22] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-
Lei Zhang. Second-order attention network for single im- recursive convolutional network for image super-resolution.
age super-resolution. In Proceedings of the IEEE/CVF con- In Proceedings of the IEEE conference on computer vision
ference on computer vision and pattern recognition, pages and pattern recognition, pages 1637–1645, 2016. 2
11065–11074, 2019. 1, 2, 6 [23] Xiangtao Kong, Xina Liu, Jinjin Gu, Yu Qiao, and Chao
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Dong. Reflash dropout in image super-resolution. In Pro-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image ceedings of the IEEE/CVF Conference on Computer Vision
database. In 2009 IEEE conference on computer vision and and Pattern Recognition, pages 6002–6012, 2022. 2
pattern recognition, pages 248–255, 2009. 5 [24] Xiangtao Kong, Hengyuan Zhao, Yu Qiao, and Chao Dong.
[10] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Classsr: A general framework to accelerate super-resolution
Tang. Learning a deep convolutional network for image networks by data characteristic. In Proceedings of the
super-resolution. In European conference on computer vi- IEEE/CVF Conference on Computer Vision and Pattern
sion, pages 184–199. Springer, 2014. 1, 2 Recognition (CVPR), pages 12016–12025, June 2021. 1
[11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou [25] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
Tang. Image super-resolution using deep convolutional net- Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
works. IEEE transactions on pattern analysis and machine Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
intelligence, 38(2):295–307, 2015. 1, 2 realistic single image super-resolution using a generative ad-
[12] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler- versarial network. In Proceedings of the IEEE conference on
ating the super-resolution convolutional neural network. In computer vision and pattern recognition, pages 4681–4690,
European conference on computer vision, pages 391–407. 2017. 2
Springer, 2016. 1, 2 [26] Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu
[13] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni-
Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining fying convolution and self-attention for visual recognition,
Guo. Cswin transformer: A general vision transformer 2022. 2, 3
backbone with cross-shaped windows. In Proceedings of [27] Wenbo Li, Xin Lu, Jiangbo Lu, Xiangyu Zhang, and Jiaya
the IEEE/CVF Conference on Computer Vision and Pattern Jia. On efficient transformer and image pre-training for low-
Recognition, pages 12124–12134, 2022. 2, 3 level vision, 2021. 1, 2, 5, 6, 7

22375
[28] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc [41] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto,
Van Gool. Localvit: Bringing locality to vision transformers, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa.
2021. 2 Sketch-based manga retrieval using manga109 dataset. Mul-
[29] Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Jinjin timedia Tools and Applications, 76(20):21811–21838, 2017.
Gu, Yu Qiao, and Chao Dong. Blueprint separable residual 5
network for efficient image super-resolution. In Proceedings [42] Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super-
of the IEEE/CVF Conference on Computer Vision and Pat- resolution with non-local sparse attention. In Proceedings of
tern Recognition (CVPR) Workshops, pages 833–843, June the IEEE/CVF Conference on Computer Vision and Pattern
2022. 1 Recognition, pages 3517–3526, 2021. 2, 6
[30] Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, [43] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping
Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and
Vrt: A video restoration transformer, 2022. 2 Haifeng Shen. Single image super-resolution via a holistic
[31] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc attention network. In European conference on computer vi-
Van Gool, and Radu Timofte. Swinir: Image restoration us- sion, pages 191–207. Springer, 2020. 2, 6
ing swin transformer. In Proceedings of the IEEE/CVF Inter- [44] Krushi Patel, Andres M Bur, Fengjun Li, and Guanghui
national Conference on Computer Vision, pages 1833–1844, Wang. Aggregating global features into local vision trans-
2021. 1, 2, 3 former, 2022. 2, 3, 5
[32] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and [45] Maithra Raghu, Thomas Unterthiner, Simon Kornblith,
Kyoung Mu Lee. Enhanced deep residual networks for single Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans-
image super-resolution. In Proceedings of the IEEE confer- formers see like convolutional neural networks? Advances
ence on computer vision and pattern recognition workshops, in Neural Information Processing Systems, 34, 2021. 2
pages 136–144, 2017. 1, 2, 3, 6 [46] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan
[33] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Bello, Anselm Levskaya, and Jon Shlens. Studying stand-
Kyoung Mu Lee. Enhanced deep residual networks for single alone self-attention in vision models. 2019. 2
image super-resolution. In Proceedings of the IEEE confer- [47] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
ence on computer vision and pattern recognition workshops, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
pages 136–144, 2017. 5 Wang. Real-time single image and video super-resolution
[34] Zudi Lin, Prateek Garg, Atmadeep Banerjee, Salma Abdel using an efficient sub-pixel convolutional neural network. In
Magid, Deqing Sun, Yulun Zhang, Luc Van Gool, Donglai Proceedings of the IEEE conference on computer vision and
Wei, and Hanspeter Pfister. Revisiting rcan: Improved train- pattern recognition, pages 1874–1883, 2016. 2, 3
ing for image super-resolution, 2022. 6 [48] Ying Tai, Jian Yang, and Xiaoming Liu. Image super-
[35] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and resolution via deep recursive residual network. In Proceed-
Thomas S Huang. Non-local recurrent network for image ings of the IEEE conference on computer vision and pattern
restoration. Advances in neural information processing sys- recognition, pages 3147–3155, 2017. 2
tems, 31, 2018. 2 [49] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
[36] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single
Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning image super-resolution: Methods and results. In Proceed-
for generic object detection: A survey. International journal ings of the IEEE conference on computer vision and pattern
of computer vision, 128(2):261–318, 2020. 2 recognition workshops, pages 114–125, 2017. 5
[37] Yihao Liu, Anran Liu, Jinjin Gu, Zhipeng Zhang, Wenhao [50] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Wu, Yu Qiao, and Chao Dong. Discovering” semantics” in Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
super-resolution networks, 2021. 2 data-efficient image transformers & distillation through at-
[38] Yihao Liu, Hengyuan Zhao, Jinjin Gu, Yu Qiao, and tention. In International Conference on Machine Learning,
Chao Dong. Evaluating the generalization ability of super- pages 10347–10357. PMLR, 2021. 2
resolution networks. arXiv preprint arXiv:2205.07019, [51] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang,
2022. 2 Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxim:
[39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Multi-axis mlp for image processing. CVPR, 2022. 2
Zhang, Stephen Lin, and Baining Guo. Swin transformer: [52] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
Hierarchical vision transformer using shifted windows. In Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling
Proceedings of the IEEE/CVF International Conference on local self-attention for parameter efficient visual backbones.
Computer Vision, pages 10012–10022, 2021. 1, 2, 3, 4 In Proceedings of the IEEE/CVF Conference on Computer
[40] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Vision and Pattern Recognition, pages 12894–12904, 2021.
Malik. A database of human segmented natural images 2
and its application to evaluating segmentation algorithms and [53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
measuring ecological statistics. In Proceedings Eighth IEEE reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
International Conference on Computer Vision. ICCV 2001, Polosukhin. Attention is all you need. Advances in neural
volume 2, pages 416–423. IEEE, 2001. 5 information processing systems, 30, 2017. 1, 2, 4

22376
[54] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao restoration. In Proceedings of the IEEE/CVF Conference
Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. on Computer Vision and Pattern Recognition, pages 5728–
Pyramid vision transformer: A versatile backbone for dense 5739, 2022. 1, 2
prediction without convolutions. In Proceedings of the [66] Roman Zeyde, Michael Elad, and Matan Protter. On sin-
IEEE/CVF International Conference on Computer Vision, gle image scale-up using sparse-representations. In Interna-
pages 568–578, 2021. 1, 2 tional conference on curves and surfaces, pages 711–730.
[55] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Springer, 2010. 5
Real-esrgan: Training real-world blind super-resolution with [67] Wenlong Zhang, Yihao Liu, Chao Dong, and Yu Qiao.
pure synthetic data. In Proceedings of the IEEE/CVF Inter- Ranksrgan: Generative adversarial networks with ranker for
national Conference on Computer Vision, pages 1905–1914, image super-resolution. In Proceedings of the IEEE/CVF
2021. 2 International Conference on Computer Vision, pages 3096–
[56] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, 3105, 2019. 2
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- [68] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
hanced super-resolution generative adversarial networks. In Zhong, and Yun Fu. Image super-resolution using very
Proceedings of the European conference on computer vision deep residual channel attention networks. In Proceedings of
(ECCV) workshops, pages 0–0, 2018. 2 the European conference on computer vision (ECCV), pages
[57] Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang 286–301, 2018. 1, 2, 3, 4, 6
Zhou, Jianzhuang Liu, and Houqiang Li. Uformer: A general [69] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun
u-shaped transformer for image restoration. In Proceedings Fu. Residual non-local attention networks for image restora-
of the IEEE/CVF Conference on Computer Vision and Pat- tion, 2019. 2
tern Recognition, pages 17683–17693, 2022. 1, 2 [70] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
[58] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Yun Fu. Residual dense network for image super-resolution.
Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph In Proceedings of the IEEE conference on computer vision
Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transform- and pattern recognition, pages 2472–2481, 2018. 1, 2
ers: Token-based image representation and processing for [71] Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong
computer vision, 2020. 2 Luo, Wenjun Zeng, and Zheng-Jun Zha. A battle of network
[59] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, structures: An empirical study of cnn, transformer, and mlp,
Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing 2021. 3
convolutions to vision transformers. In Proceedings of the [72] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and
IEEE/CVF International Conference on Computer Vision, Chen Change Loy. Cross-scale internal graph neural network
pages 22–31, 2021. 2, 3 for image super-resolution. Advances in neural information
[60] Sitong Wu, Tianyi Wu, Haoru Tan, and Guodong Guo. Pale processing systems, 33:3499–3509, 2020. 2, 6
transformer: A general vision transformer backbone with
pale-shaped attention. In Proceedings of the AAAI Confer-
ence on Artificial Intelligence, volume 36, pages 2731–2739,
2022. 2, 3
[61] Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor
Darrell, and Ross Girshick. Early convolutions help trans-
formers see better. Advances in Neural Information Process-
ing Systems, 34, 2021. 2, 3
[62] Liangbin Xie, Xintao Wang, Chao Dong, Zhongang Qi, and
Ying Shan. Finding discriminative filters for specific degra-
dations in blind super-resolution. Advances in Neural Infor-
mation Processing Systems, 34, 2021. 2
[63] Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Feng-
wei Yu, and Wei Wu. Incorporating convolution designs into
visual transformers. In Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, pages 579–588,
2021. 2, 3
[64] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao
Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-
resolution vision transformer for dense predict. Advances
in Neural Information Processing Systems, 34:7281–7293,
2021. 2
[65] Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu-
nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang.
Restormer: Efficient transformer for high-resolution image

22377

You might also like