0% found this document useful (0 votes)
48 views27 pages

Efficient Deep Models For Real-Time 4K Image Super-Resolution.

Uploaded by

matin fazel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views27 pages

Efficient Deep Models For Real-Time 4K Image Super-Resolution.

Uploaded by

matin fazel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

This CVPR workshop paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Efficient Deep Models for Real-Time 4K Image Super-Resolution.


NTIRE 2023 Benchmark and Report

Marcos V. Conde† Eduard Zamfir† Radu Timofte† Daniel Motilla‡ Cen Liu
Zexin Zhang Yunbo Peng Yue Lin Jiaming Guo Xueyi Zou Yuyi Chen
Yi Liu Jia Hao Youliang Yan Yuanfan Zhang Gen Li Lei Sun
Lingshun Kong Haoran Bai Jinshan Pan Jiangxin Dong Jinhui Tang
Mustafa Ayazoglu Bahri Batuhan Bilecen Mingxi Li Yuhang Zhang Xianjun Fan
Yankai Sheng Long Sun Zibin Liu Weiran Gou Shaoqing Li Ziyao Yi
Yan Xiang Dehui Kong Ke Xu Ganzorig Gankhuyag Kihwan Yoon Jin Zhang
Gaocheng Yu Feng Zhang Hongbin Wang Zhou Zhou Jiahao Chao
Hongfan Gao Jiali Gong Zhengfeng Yang Zhenbing Zeng Chengpeng Chen
Zichao Guo Anjin Park Yuqing Liu Qi Jia Hongyuan Yu Xuanwu Yin
Kunlong Zuo Dongyang Zhang Ting Fu Zhengxue Cheng Shiai Zhu
Dajiang Zhou Hongyuan Yu Weichen Yu Lin Ge Jiahua Dong Yajun Zou
Zhuoyuan Wu Binnan Han Xiaolin Zhang Heng Zhang Xuanwu Yin Ben Shao
Shaolong Zheng Daheng Yin Baijun Chen Mengyang Liu Marian-Sergiu Nistor
Yi-Chung Chen Zhi-Kai Huang Yuan-Chun Chiang Wei-Ting Chen
Hao-Hsiang Yang Hua-En Chang I-Hsiang Chen Chia-Hsuan Hsieh Sy-Yen Kuo
Tu Vo Qingsen Yan Yun Zhu Jinqiu Su Yanning Zhang Cheng Zhang
Jiaying Luo Youngsun Cho Nakyung Lee Kunlong Zuo

720p 1080p 4K

Figure 1. NTIRE 2023 Real-Time 4K SR. We introduce a new benchmark and a diverse test set for 4K Super-Resolution.

Abstract ×3 factors) in real-time on commercial GPUs. For this, we


use a new test set containing diverse 4K images ranging
This paper introduces a novel benchmark for efficient up- from digital art to gaming and photography. We assessed
scaling as part of the NTIRE 2023 Real-Time Image Super- the methods devised for 4K SR by measuring their runtime,
Resolution (RTSR) Challenge, which aimed to upscale im- parameters, and FLOPs, while ensuring a minimum PSNR
ages from 720p and 1080p resolution to native 4K (×2 and fidelity over Bicubic interpolation. Out of the 170 partic-
† Organizers and corresponding authors. Computer Vision Lab, Uni- ipants, 25 teams contributed to this report, making it the
versity of Würzburg, Germany. ‡ Co-organizer. SIE FTG, USA. most comprehensive benchmark to date and showcasing the
{marcos.conde, radu.timofte}@uni-wuerzburg.de latest advancements in real-time SR.
NTIRE 2023 webpage: https://cvlai.net/ntire/2023/
Code: https://github.com/eduardzamfir/NTIRE23-RTSR

1495
1. Introduction while maintaining efficiency. The challenge seeks to iden-
tify innovative and advanced solutions for real-time super-
Single image super-resolution (SR) refers to the process resolution, benchmark their efficiency, and identify general
of generating a high-resolution (HR) image from a single trends for designing efficient SR networks.
degraded low-resolution (LR) image. This ill-posed prob-
lem was initially solved using interpolation methods [28,
2. NTIRE 2023 Real-Time Super-Resolution
77–79]. However, with the emergence of deep learning, SR
is now commonly approached through the use of deep neu- Challenge
ral networks [17,24,49,56,57,84,88,99]. Image SR assumes The aim of this challenge is to create real-time super-
that the LR image is obtained through two major degrada- resolution (SR) methods, with a specific focus on up-scaling
tion processes: blurring and down-sampling. This can be to 4K resolution. We believe that this area remains largely
expressed as: unexplored within the computer vision community. The
\mathbf {y} = (\mathbf {x} * \mathbf {k})\downarrow _s, (1) challenge has three main objectives: Firstly, to advance re-
search on real-time SR methods. Secondly, to introduce a
where ∗ represents the convolution operation between the
novel and competitive benchmark for 4K SR, utilizing var-
LR image and the blur kernel, and ↓s is the down-sampling
ious image types such as digital art and natural imagery.
operation with respective down-sampling factor ×s. Most
Thirdly, to facilitate interactions between academic and in-
SR methods are built around the Bicubic model [77,78] with
dustry participants and encourage potential collaborations.
various down-scaling factors (e.g. ×2, ×3, ×4, ×8).
The advancements in hardware technologies have led to 2.1. 4K SR Benchmark Dataset
the training of larger and deeper neural networks for image
super-resolution, resulting in significant performance im- The 4K RTSR benchmark provides a unique test set com-
provements. However, these breakthroughs often come at prising ultra-high resolution images from various sources,
the cost of introducing more complex approaches [3, 20, 56, setting it apart from traditional super-resolution bench-
84,99]. Since the seminal work by Shi et al. [70], the design marks. Specifically, the benchmark addresses the increas-
of efficient deep neural networks for single image super- ing demand for upscaling computer-generated content e.g.
resolution [40, 47, 72, 81, 101] has become pivotal. Vari- gaming and rendered content, in addition to photorealistic
ous workshops and challenges, such as [42, 53, 94], have imagery, thereby posing a different challenge for existing
emerged as popular forums for sharing ideas and advanc- SR approaches. The test set includes diverse content such
ing the state-of-the-art in efficient and real-time SR. Pub- as rendered gaming images, digital art, as well as high-
licly available large-scale datasets have been instrumental in resolution photorealistic images of animals, city scenes,
driving recent advances in image and video SR [1,32,52,66, and landscapes, totaling 110 test samples. We created this
76]. However, with the exception of DIV8K [32] and [95], benchmark with the intention of advancing the development
most existing datasets have images of limited resolution e.g. of SR methods, as well as replacing outdated test sets such
2K. In addition, the practical challenge of performing real- as Set5 [7], Set14 [93], and Urban100 [39].
time SR of images and videos to 4K resolution has received All the images in the benchmark testset are at least 4K
relatively little attention so far. resolution i.e. 3840 × 2160 (some are bigger, even 8K). The
As the amount of digital content continues to surge, there images were filtered manually to ensure there are not un-
is a mounting demand for effective SR techniques for ren- pleasant effects such as noise or strong defocus.
dered content [86, 90]. However, rendering presents unique The distribution of the 4K RTSR benchmark testset is:
challenges as it often exhibits significant aliasing, resulting 14 real-world captures using a 60MP DSLR camera, 21 ren-
in jagged lines and other sampling artifacts. Consequently, dered images using Unreal Engine [38], 75 diverse images
up-scaling rendered content requires a novel approach that e.g. animals, paintings, digital art, nature, buildings, etc.
involves both anti-aliasing and interpolation, which is dis-
2.2. Baseline Model
tinct from the well-established research on denoising and
deblurring in existing SR research [86]. Previous lightweight SR methods [51] such as
In conjunction with the 2023 New Trends in Image IMDN [40] or RFDN [60] are not fast enough for this task.
Restoration and Enhancement (NTIRE) workshop, we in- For this reason, we use RT4KSR [92] as the baseline model
troduce the real-time 4K super-resolution challenge. The for this challenge. The primary objective is to enhance
challenge entails super-resolving a LR image from either its efficiency in terms of runtime, parameter count and
720p or 1080p to 4K resolution using a network that reduces FLOPs. Drawing inspiration from the research presented in
one or several aspects, such as runtime, parameters, FLOPs, [42,53], the baseline design utilizes a shallow convolutional
and memory consumption. The goal is to at least outper- architecture to achieve rapid and precise reconstruction
form bicubic interpolation on a new and diverse benchmark, performance. The proposed baseline stacks five simple

1496
3 × 3 convolutions with a GeLU activation layer and adds test set. The corresponding degraded images are obtained
a global residual connection with LayerNorm [6] before through bicubic down-scaling to their respective resolutions
the standard depth2scale up-sampling operation. Besides, (1080p for X2 and 720p for X3 up-scaling). The average
the authors in [92] develop a sophisticated approach that runtime is determined by using mixed-precision and repeat-
improves model efficiency by downscaling feature maps. edly evaluating randomly initialized tensors of correspond-
To avoid losing important high frequency details that are ing sizes to overcome any bottlenecks that may arise due to
already scarce, the authors propose extracting HF details data loading. The FLOPs are evaluated on an input image
from the LR input prior to its downscaling. Additionally, of size 1920 × 1080 and 1280 × 720, respectively.
the authors provide a detailed roadmap of their method’s
development, resulting in a competitive shallow CNN \label {eq:score} S &= \frac {2^{2 \times (\text {PSNR}_{M}-\text {PSNR}_{B})}}{C \times T_{M}^{0.5}} (2)
design that can be scaled up and achieves performance
comparable to previous state-of-the-art efficient SR models.
Similar to [42], we determine the final score S of each
2.3. Tracks and Competition participant in the challenge by utilizing Eq. (2), in which
PSNRM and TM represent the PSNR result and runtime of
The objective of this challenge is to develop a high- the individual submission. Additionally, the scoring func-
performance SR technique that can upscale a broad range tion is designed to prioritize faster runtime over restoration
of images to 4K resolution in real-time, while ensuring a accuracy. However, in cases where two methods have simi-
PSNR above a traditional Bicubic interpolation. lar runtimes, the PSNR value will be the deciding factor.

Track 1: 1080p to 4K. The first challenge track addresses Related NTIRE 2023 Challenges. The NTIRE 2023
X2 up-scaling from 1080p to 4K resolution. Real-Time Image Super-Resolution (RTSR) Challenge is
part of the NTIRE 2023 Workshop series of challenges
Track 2: 720p to 4K. The second leg of this NTIRE chal- on: night photography rendering [71], HR depth from
lenge addresses X3 up-scaling from 720p to 4K resolution. images of specular and transparent surfaces [91], im-
age denoising [55], video colorization [44], shadow re-
moval [80], quality assessment of video enhancement [62],
Challenge Phases. Development and Validation Phase. stereo super-resolution [82], light field image super-
The participants were provided with access to a validation resolution [85], image super-resolution (×4) [100], 360°
set comprising of 100 images from the DIV2K validation omnidirectional image and video super-resolution [9], lens-
split, along with an additional collection of 50 images that to-lens bokeh effect transformation [18], real-time 4K
included a variety of content, from videogames to realistic super-resolution [19], HR nonhomogenous dehazing [4], ef-
high-resolution photography. The baseline model, scoring ficient super-resolution [54].
function, and evaluation scripts were made available to the
participants through GitHub (https://github.com/ 2.4. Architectures and Main Ideas
eduardzamfir/NTIRE23- RTSR). This allowed the
Here we summarize the core ideas behind the most com-
participants to benchmark the performance of their models
petitive solutions. Each proposed solution will be covered
on their systems. During the development phase, the objec-
in the following Sec. 3 and Tab. 2.
tive was aimed at up-scaling 2K imagery since DIV2K did
not include any 4K imagery. Testing Phase. During the final 1. Re-parameterization allows to train the network us-
test phase, the participating teams received a 4K benchmark ing complex blocks [22], while during inference the
comprising 110 diverse images. However, they did not have so-called RepBlocks can be reduced to a simple 3 × 3
access to the HR ground-truth. Once the participants gener- convolution.
ated their super-resolved results, they submitted their code,
factsheets and resulting images to the organizers via email. 2. Pixel shuffle and unshuffle (also known as depth-to-
The organizers then validated and executed the submitted space and space-to-depth respectively) [70] to effi-
code to obtain the final results, which were later conveyed ciently transform the features maps and perform both
to the participants upon completion of the challenge. spatial upsampling and downsampling.
3. Multi-stage Training. Since the neural networks are
Evaluation Protocol. The quantitative evaluation metrics extremely constrained and shallow, this technique al-
for this challenge comprise of testing PSNR, runtime, num- lows to maximize learning by alternating different
ber of parameters, number of FLOPs and maximum GPU learning rates and loss functions.
memory consumed during inference. The PSNR is calcu-
lated on 110 RGB images sourced from our 4K benchmark https://cvlai.net/ntire/2023/

1497
Table 1. Results of the NTIRE23 Real-Time SR challenge. The runtimes are computed using a Nvidia RTX3090 GPU. The teams are
ordered by their ranking according to their score. For better comparison we color-code the runtime using < 24 FPS , 30 > x > 24 FPS ,
60 > x > 30 FPS , 120 > x > 60 FPS and > 120 FPS , respectively.

PSNR (dB, ↑) SSIM (↑)


Team Score # Params (M) FLOPs (G) Runtime (ms, ↓)
RGB Y RGB
Track 1: Upscaling from 1080p to 4K resolution.
Bicubic - - - 33.92 36.66 0.8829 0.46
Noah TerminalVision 24.13 2.3523 9.062 35.02 37.74 0.8957 3.190
ALONG 23.81 0.0668 15.3281 34.63 37.38 0.8906 1.910
RTVSR 23.13 0.0266 13.7687 34.71 37.50 0.8910 2.240
Team OV 19.06 0.0042 8.734 34.62 37.45 0.8899 2.910
DFCDN Team 15.17 0.0064 6.0881 34.63 37.46 0.8916 4.670
DoYouChargeQQCoin 15.07 0.0008 1.6921 34.14 36.97 0.8855 2.380
NJUST-RTSR 14.96 0.0114 23.5893 34.74 37.64 0.8901 5.560
Multimedia 14.09 0.0100 20.4125 34.85 37.61 0.8926 7.300
PixelBE 13.12 0.0137 14.7226 34.70 37.52 0.8908 6.840
z6 12.87 0.0414 85.7309 35.02 37.76 0.8948 11.19
AGSR 12.77 0.0068 14.0673 34.31 37.00 0.8888 4.220
Antins cv 11.25 0.0111 22.9174 34.71 37.56 0.8921 9.470
ECNU SR 10.37 0.1623 83.2094 35.30 37.95 0.8971 25.23
R.I.P. ShopeeVideo 9.68 0.3987 272.7942 35.32 38.01 0.8971 29.73
dh isp 7.63 0.0113 23.4234 33.99 36.89 0.8809 7.600
P.AI.R 6.27 0.0212 38.486 34.65 37.47 0.8905 28.31
NTU BL6 6.07 0.2223 409.8416 35.26 38.04 0.8977 69.37
diSRupt 5.54 0.0500 207.0 34.07 36.86 0.8830 16.00
Touch Fish 5.03 0.0641 132.5777 34.28 37.14 0.8862 26.31
SEU CNII 4.84 0.0299 58.5454 34.24 37.10 0.8858 26.89
KCML2 3.99 0.0392 57.2567 34.24 37.09 0.8851 39.17
NPU SR 3.45 0.2001 0.165 (*) 34.49 37.42 0.8895 74.00
YNOT 2.25 0.4734 422.6991 34.03 36.99 0.8844 92.79
Our Baseline [92] 9.27 0.0445 171.99 34.22 37.01 0.8854 7.090
Track 2: Upscaling from 720p to 4K resolution.
Bicubic - - - 31.30 33.82 0.8245 0.46
Aselsan Research 31.26 0.0504 11.6343 32.06 34.56 0.8344 1.170
Team OV 29.63 0.0058 5.3748 32.17 34.72 0.8376 1.510
ALONG 28.57 0.2404 13.8019 32.18 34.66 0.8367 1.660
RTVSR 26.89 0.0532 12.2315 32.22 34.77 0.8372 1.960
Noah TerminalVision 26.68 17.797 16.1252 32.65 35.10 0.8455 3.640
NJUST-RTSR 23.51 0.0135 12.4748 32.25 34.90 0.8384 2.680
Antins cv 23.44 0.0127 11.6785 32.63 35.21 0.8457 4.600
DFCDN Team 22.64 0.0075 3.7011 32.07 34.63 0.8371 2.250
Multimedia 21.55 0.0125 11.4361 32.33 34.83 0.8398 3.560
z6 20.90 0.0457 41.9365 32.59 35.05 0.8446 5.470
R.I.P. ShopeeVideo 15.67 0.4073 129.2038 32.84 35.30 0.8469 13.79
ECNU SR 15.39 0.1662 37.8667 32.64 35.17 0.8458 10.75
Touch Fish 11.55 0.1465 134.7748 32.67 35.31 0.8468 19.86
P.AI.R 8.66 0.1280 104.362 32.55 35.04 0.8441 30.03
SEU CNII 6.68 0.0629 55.0807 31.85 34.52 0.8326 19.05
diSRupt 6.34 0.0649 120.0 31.64 34.25 0.8292 16.00
Our Baseline [92] 14.01 0.0575 219.77 31.74 34.37 0.8299 3.740

1498
3. Methods and Teams Repeat
Upscaling

Pixel Unshuffle Assembled Pixel Shuffle Pixel Shuffle


LR Conv-3 Conv-3 ⊕ SR
3.1. AsConvSR (↓2) Block (↑2) (↑2)

The winning team in Track 1, Noah TerminalVision, Control


Module
coefficients

Feature Assembled Assembled Assembled Feature


proposes a fast and lightweight super-resolution network Input Conv Conv Conv Output

(AsConvSR) with assembled convolutions [34]. The


key points and contributions of the proposed network (a) FLASRN Network architecture.
(see Fig. 2a) are as follows: (i) Pixel unshuffled [41] is used feature
coeff
to reduce the resolution of the image and increase the chan- Adaptive
* coeff1
* coeff11 * coeff12 * coeff13
AvgPool(1) * featurei
nel dimension. Such design can reduce the computational 3 k
Huawei Proprietary - Restricted Distribution basic
* coeff2
* coeff21 * coeff22 * coeff23
Conv-1

=
cost of the network while keep the information volume un- * coeff31 * coeff32 * coeff33
ReLU
K Conv-3 * coeff3

changed. (ii) They remove all residual connections and keep

=
=
a global skip connection, which repeats each pixel value 4× Conv-1
featureo

(or 9× for ×3 SR) [26]. (iii) The authors propose an as- coeff
Dynamic conv Assembled conv
sembled convolution structure Fig. 2b. Different from the (a) (b) (c)
dynamic convolution [14] which generates the whole con-
(b) a) Control module. b) Assembled convolution. c) Comparison between
volution kernel in a linear combination of the basis, assem- dynamic and Assembled convolution.
bled convolution generate the optimal kernel coefficient for
Figure 2. Team Noah TerminalVision solution.
each output channel, which is more flexible and outperform
4 Huawei Proprietary - Restricted Distribution
the dynamic convolution in this task.
Because different batches of data require different convo-
Network architecture. Given an input LR image, the res- lution kernels, the batch dimension of the feature map is
olution would be converted to channel dimension by pixel reshaped to the channel dimension and the group convolu-
unshuffle layer. By using a 3x3 convolution, the channel of tion is used to calculate the output feature maps. As shown
feature map would be converted to the target size (32 for in Fig. 2b, dynamic convolution generates the whole con-
×2, 64 for ×3) and then feed into the assembled block. The volution kernel (all channels) in a linear combination of the
assembled block contains a control module and three as- basis. Assembled convolution generate an optimal convolu-
sembled convolutions. As shown in the Fig. 2b, the control tion kernel coefficient for each channel, which is more flex-
module is mainly responsible for generating coefficients for ible and outperform the dynamic convolution in this task.
the assembled convolutions. Base on these coefficients, a Implementation Details. In the training phase, the training
3x3 convolution is generated to perform a classical convo- sets include DF2k [2, 75], DIV8K [33], GTAV‘ [68], and
lution on the feature maps. Therefore, the major computa- LIU4K-V2 [59]. The network is trained by minimizing the
tional cost of the assembled convolution is still the 3x3 con- charbonnier loss with Adam optimizer. The initial learning
volution itself, and runtime of a assembled convolution is rate is 5e-4 and halved at every 2e5 iteration. The total num-
only a little higher than the classical convolution. After the ber of training iteration is 3e6 on a Tesla V100 platform.
assembled block, a 3x3 convolution layer is used to convert
the channels size to 48 (108 for ×3 SR) so that the feature 3.2. Bicubic++
map can be restored to target resolution after the pixel shuf- The winning team in Track 2, Aselsan Research, pro-
fle layer. It should be noted that a low resolution images poses a lightweight, single image super-resolution method,
repeated in the channel dimension can also be restored in named Bicubic++ [8]. Unlike many others lightweight
to the high resolution with a pixel shuffle layer, we divide methods where the input image dimensions are fixed
the final pixel shuffle into two steps in order to import the throughout the network, Bicubic++ downscales the image
global skip connection to the network. first (by half with strided convolutions) to reduce the num-
ber of operations greatly on the following network convo-
Assembled block. As shown in the Fig. 2b, given the in- lutional layers to meet the real-time requirements. Finally
put features F ∈ RB,C,H,W , the control module converts they apply ×6 upscaling. The overall structure is given
the features F into coefficients coeff ∈ RB,Co ,E , B is the in Fig. 3.
batchsize, Co is the number of output channels, and E is the In addition, they follow a three stage training approach,
number of candidate convolution basis. Matrix multiplica- where they train a slightly larger model first, and per-
tion is performed on the coefficient coeff and all candi- form global structured convolutional layers and bias prun-
date convolution kernels kbasis ∈ RE,Ci ,ks,ks — where Ci ing without using heuristic metrics like weight norms on
is the number of input channels, and ks is the kernel size — the following two stages. This approach ultimately yields
to generate a final convolution kernel K ∈ RB,Co ,Ci ,ks,ks . a much faster, real-time model with none to marginal de-

1499
crease in the visual quality. They have not employed quanti- module is composed of a sequence of Re-Parameter blocks
zation or the reparametrization of the convolutional kernels. (RepBlock) that serves to extract and refine features in a
progressive manner. Following the new suggestions in low-
x0.5 x6 level vision task introduced by [53, 58], the Gaussian Er-

3x3

3x3
3x3 ror Linear Unit (GeLU) activation function is utilized in the

D2S
+ ch
DS

3 ch ch ch 108 3

s=1 s=1 s=1 ×2 model, while the Sigmoid Linear Unit (SiLU) activa-
p=1 p=1 p=1
tion function is used in the ×3 model, respectively. Finally,
Figure 3. Bicubic++ structure proposed by Aselsan Research. The the upsampling layer and a skip connection are utilized to
s and p denote stride and padding, respectively. In the final pro- increase the image resolution to the desired level. This is
posed model, ch is 32, all bias terms are removed, and a strided achieved by applying a convolutional layer, followed by a
convolution with s=2, p=1 for the downscaling (DS) layer is uti- pixel-shuffle layer.
lized. Red blocks after 3x3 convolutions are leaky ReLU activa- Besides re-parameterization [22], they also use Knowl-
tions. D2S denotes depth-to-space layer [70]. edge Distillation [36] in training. During the training stage,
teacher output images and ground-truth images are used
to guide the student network via teacher surpervision (TS)
Implementation Details. The models are trained in Py- and data surpervision (DS), respectively. They use HAT-L
Torch Lightning. The training is done with mixed precision model [13] as the teacher model, which is currently consid-
(FP16) by setting a precision flag in the Trainer, and Adam ered the SOTA model in the field of super-resolution.
optimizer with β1,2 parameters 0.99 and 0.999, respectively.
For the first two stages of the training, they start with Implementation Details. The method is implemented us-
the learning rate of 5e-4. For the last stage, they start with ing Pytorch 1.13. The loss function is L1 for reconstruction
1e-4. They utilize a decaying learning rate scheduler for and L2 is employed during the fine-tuning and knowledge
all stages, where after 500 epochs the learning rate decays distillation phases. For the X2 model, the channel employed
linarly until we reach to 1e-8. in the CNN model (student) is 32, and the number of Rep
For all three stages of the training, they train for Blocks is 3. Additionally, the scale of pixel unshuffle and
1000 epochs using batch size 8. Each epoch consumes pixel shuffle layers is 3. For the X3 model, the channel em-
800 randomly cropped and rotated patches of dimension ployed in the CNN model (student) is 64, and the number of
(108,108,3) -for LR- from Q=90 degraded DIV2K [1] Rep Blocks is 5. Additionally, the scale of pixel unshuffle
dataset. For the validation, they use 48 images with and pixel shuffle layers is 4.
same dimensions (680,452,3) -for LR- from Q=90 degraded
DIV2K validation dataset. 3.4. Team OV
3.3. RUNet Team OV presents a simple and efficient Convolutional
Neural Network architecture that incorporates 3×3 con-
Team ALONG proposes RUNet: Re-parameterization
volutions, GELU activation function, and depth-to-space
and Unshuffle Network for Real-time Super- Resolution.
operations. The network utilizes 12 (for ×2) and 16
The team mainly considers designing the network fol-
(for ×3) channels and produces the final image output
lowing two aspects: (i) Receptive field: the model’s ability
through the depth-to-space operation. These architectural
may be limited if its receptive field is too small. (ii) Com-
elements are depicted in Figure 5. The team also uses re-
putational efficiency: The relationship between runtime and
parameterization as shown in Fig. 5 (b).
computation is not necessarily positive. A higher level of
computational efficiency can result in a shorter runtime.
As shown in 4a, inspiring by [83], initially, they apply Implementation Details. The network was trained us-
the pixel-unshuffle technique, which serves as the inverse ing DF2K (DIV2K+Flickr2K) dataset [2, 75], divided into
process of pixelshuffle [70], to reduce the spatial dimen- three stages. Initially, low-resolution (LR) patches having
sions and amplify the channel dimensions of the data be- a dimension of 128×128 are randomly cropped from high-
fore feeding them into the main model architecture. Thus, resolution (HR) images with a mini-batch size of 64. L1
the majority of calculations are performed within a smaller and FFT losses are used as target loss functions. Following
resolution space, leading to a reduction in computational this, network parameters were optimized for 300K itera-
resource consumption and an effective enhancement of the tions employing the Adam algorithm, with a learning rate of
inference speed. Furthermore, this approach can improve 1 × 10−3 decreasing to 1 × 10−7 through the cosine sched-
the receptive field. Next, a convolutional layer followed by uler. In the second stage, the model obtained from the first
an activation function is applied. This process effectively stage was trained similarly for another 300K iterations. In
extracts low-level features from the input image. The body the final stage, the model was fine-tuned using L2 loss and

1500
(a) RUNet Architecture. (b) Reparameter Block.

NTIRE 2023 Real-Time Super-Resolution


Figure 4. Team ALONG. Overview of the proposed RUNet.

complexity of the model. For the ×2 and ×3 tracks, based


Rep_block

Rep_block

PixelShuffle
Conv3x3

Conv3x3

on time consuming considerations, the model body uses


GELU

GELU

three repconvs and four repconvs, respectively. The net-


work is illustrated in Fig. 6a.
Training
Implementation Details. Their training framework uses
Pytorch for training on the A100 GPU. DIV2K, Flicker2K,
DIV8K, GTAV datasets are used for training. The model
PixelShuffle
Conv3x3

Conv3x3

Conv3x3
Conv3x3
GELU

GELU

training can be divided into two stages. In the first stage,


the reconfigurable parameterized network structure shown
in Fig. 6b is used for training. It is trained for 150 epoch us-
NTIRE 2023 Real-Time Super-Resolution
Inference ing batchsize 32, the patch size is 256x256, and the learning
(a) The model architecture proposed by Team OV. rate is 2e-4. Adam optimizer is used. In the second stage,
they use L2 loss to finetune the model obtained in the previ-
ous stage. The batchsize is 16, the patch size is 256x256, he
conv-1x1 conv-1x1 conv-1x1 conv-1x1 conv-1x1 conv-1x1 conv-1x1 learning rate is 1e-5, and the training time is 50 epochs. Af-
conv-3x3
conv-3x3 Sobel-Dx Sobel-Dy Laplacian4 Laplacian8 Prewitt-Dx Prewitt-Dy
ter the training (and during inference) they re parameterize
the model into a network structure with conventional 3x3
convolution, as shown in Fig. 6b.

(b) The Rep block proposed by Team OV. 3.6. DFCDN


Figure 5. Team OV. Overview of the proposed solution. Team DFCDN proposes a novel network for efficient
image super-resolution with deep feature complement and
distillation network (DFCDN). They use online convolu-
FFT loss. Network parameters are optimized for 300k iter- tional re-parameterization to reduce the large extra training
ations through the Adam algorithm, with a learning rate of cost introduced by re-parameterization.
5e–4 reduced to 1e–7 using the cosine scheduler.
Network Architecture. The overall architecture of Team
3.5. Repnet DFCDN is shown in Fig. 7. The proposed network consists
only one deep feature complement and distillation block
The team RTVSR proposes Repnet for Real-Time (DFCDB). Inspired by [35, 67], the input feature map is
Super-Resolution. To reduce spatial dimension of the CNN, split equally along the channel dimension in the block.
they first use paired space2depth and depth2space for sin- Then several convolutional layers process one of the split
gle image super resolution. Furthermore, they also re- feature maps to generate complement features. The input
parameterize (conv3-bn-conv1) into a normal 3x3 convo- features and complementary features are concatenated to
lution during inference, effectively improving the perfor- avoid loss of input information and distilled by a conv-1
mance of the model without increasing the computational layer. Besides, the output feature map of DFCDB is further

1501
(a) The architecture of Repconv-based Plain Net for RTSR. (b) The architecture of reparameterized convolution module.

Figure 6. Team RTVSR. Overview of the proposed Repnet.

(a) DFCDN (b) DFCDB (c) RepConv

Figure 7. Team DFCDN: The overall architecture of the proposed DFCDN network.

enhanced by efficient spatial attention layer [63]. HR images from DIV2K. The mini-batch size is set to 64.
The L1 loss is minimized with Adam optimizer. The initial
Online Convolutional Re-parameterization Re- learning rate is set to 5e-4 with a cosine annealing schedule.
parameterization [96] has improved the performance of The total number of epochs is 1000. At the second stage, the
image restoration models without introducing any inference model is initialized with the pre-trained weights of Stage 1.
cost. However, the training cost is large because of compli- The HR patch size is set to 640. The model is trained with
cated training-time blocks. To reduce the extra training cost, the same settings as in the previous step. At the third stage,
they apply online convolutional re-parameterization [37] the model is initialized with the pre-trained weights of Stage
by converting the complex convblocks into one single 2. The MSE loss is used for fine-tuning with 640 × 640 HR
convolutional layer. The architecture of RepConv is shown patches and a learning rate of 1e-5 for 100 epochs.
in Fig. 7 (c). It can be converted to a 3 × 3 convolution
during training, which saves great training cost.
The training details of ×3 (Track 2) are as follows: At
the first stage, the model is initialized with the pre-trained
Implementation Details. The number of features is set weights of the model with scale 2. The HR patch size is
to 8 and the number of attention channels is set to 16. The set to 660. The model is trained with the same settings as
DIV2K [1] dataset is used for training and the inputs are X2. At the second stage, the model is initialized with the
in the range of 0-255. First, for training the ×2 (Track 1) pre-trained weights of Stage 1. The MSE loss is used for
models, the setup is as follows: The model is first trained fine-tuning with 660 × 660 HR patches and a learning rate
from scratch with 256×256 patches randomly cropped from of 1e-5 for 100 epochs.

1502
Conv 3×3

Conv 3×3
SubPixel
RepRB

RepRBRepRB
RepRBRepRB
RepRBRepRB
Conv 3×3

Conv 3×3
SubPixel
RepRB
Conv 1×1

Conv 3×3

Conv 1×1
Conv 3×3 Conv 3×3 (a) Training mode of the proposed network.
Conv 1×1

Conv 3×3

Conv 1×1
Conv 3×3 Conv 3×3

Training phase Inference phase

Training phase Inference phase

Figure 8. Team NJUST-RTSR: The overall architecture of the pro- (b) Inference mode of the proposed network.
posed network. (Bottom) Detail network of the proposed RepRB.
Figure 9. Team z6 proposed LRSRN network structure.

3.7. NJUST-RTSR
3.8. LRSRN
The team proposes a method that first transforms the in- Team z6 proposes a Lightweight Real-Time Image
put LR image into the feature space using a convolutional Super-Resolution Network (LRSRN) [30] that can deliver
layer, then performs feature extraction using four reparame- higher accuracy at a faster speed compared to previous real-
terizable residual blocks (RepRBs), and finally reconstructs time SR models for 4K images. They apply a reparame-
the final output by a sub-pixel [70] convolution. The pro- terized convolution (RepConv) for all convolution layers to
posed architecture is illustrated in Fig. 8. improve the image quality while maintaining the model size
To enhance the capability of the model, they use the re- and inference speed. The proposed network is an extended
parametrization technique [23]. Fig. 8 (Bottom) shows the version of [29] (previous work of the team), which was de-
detail description of the used RepRB module. It contains signed for Mobile devices. The proposed network is illus-
three branches in the training phase to learn features from trated in Fig. 9.
different receptive fields, while in the inference phase it can
be merged into a 3 × 3 convolution. Implementation Details. The team used Pytorch 1.13.
The models were trained in two steps: (i) First, models
were trained from scratch. The LR patches were cropped
Implementation Details. The team uses DIV2K [2] and from HR images with mini-batch size 8, and resolution 192
Flickr2K [75] as the training data. In order to accelerate x 192 (Track 1) and 128
the IO speed during training, they crop the 2K resolution
×
images to sub-images — the HR image is cropped into
640 × 640 and 960 × 960 sub-images for ×2 and ×3 SR, 128 (Track 2). The Adam optimizer was used with a 0.0005
respectively. learning rate, and cosine warm-up scheduler. The total
During the training, the data argumentation is performed number of epochs was set to 800. They use L1 loss. (ii)
on the input patches with random horizontal flips and rota- In the second step, the model was initialized from previ-
tions. The HR image patch size is initialized as 128 × 128 ous step. Fine-tuning with L2 loss improves the PSNR
and increases to 256 × 256, and batch size is set as 64. They value by 0.01 ∼ 0.02 dB. In this step, the initial learning
use the Adam [46] optimizer with the Cosine Annealing rate was set as 0.0001. The total epoch was set to 200.
scheme [64]. The initial learning rate to 1 × 10−3 and the In particular, the DIV2K [1] was used for scratch train-
minimum one to 1 × 10−6 . The number of total iterations ing. The combined dataset, which includes DIV2K train set
is set to 300k. They use a combination of mean absolute (800 images), Flickr2K (2650 images), GTA (train seq 00
error (MAE) loss and an FFT-based frequency loss function ∼ 19), LSDIR [52] (first 1000 images) used for the fine-
to constrain the model training, which is the same as [73]. tuning stage. The training data is preprocessed by cen-
All experiments are conducted with the PyTorch framework ter cropping it to a resolution of 2040 x 1080. To gener-
on an NVIDIA GeForce RTX 3090 GPU. ate low-resolution, they degrade the center cropped images

1503
with bicubic downsampling and JPEG compression. Dur- for fine-tuning with 2040 × 1080 HR patches and an initial
ing training, they used random cropping, rotations, and flips learning rate is 1 × 10−5 , the mini-batch size is set to 4.
augmentations.
3.10. ERLFN
3.9. SCSYENet Team Team Antins CV proposes a method built on
Team Multimedia proposes SCSYENet: A Compact Residual Local Feature Network (RLFN) [48]. Based on
Skip-Concatenated Simple Yet Effective Real-Time Image this network, we prune the architecture and introduce the
Super-Resolution based on element-wise multiplication fu- Enhanced Residual Block (ERB) RepBlock proposed by
sion operation and Re-parameter convolution. [51] the runner up solution, and we propose our Enhanced
They built an end-to-end RTSR network based on Residual Local Feature Network (ERLFN).
element-wise multiplication fusion operation and re-
parameter convolution, following previous work [5, 43, 97]. Network Structure. The RLFN proposed by [48] is an
SCSYENet, has only 10/12.5K parameters (in Track 1 (X2) efficient network for lightweight super resolution task.
and Track 2 repectively). The network consists of two While for this real-time super-resolution task, they further
asymmetrical branches with simple building blocks. To ef- prune the network for an ideal speed.
fectively connect the results by asymmetrical branches, a For Track 1 (upscaling from FHD 1080p to 4K) the net-
element-wise multiplication fusion operation is proposed. work requires heavy computation. To balance for speed,
The architecture of SCSYENet is illustrated in Fig. 10a. we cut the four RLFB blocks in RLFN to two blocks, and
shrink the feature channels to 12. The ESA blocks nested
in RLFB are removed to reduce computation cost and save
Network Structure Inspired by ECBSR [97], SCSYENet time. For Track 2, to upscale from HD 720p to 4K resolu-
employs the re-parameterization technique to boost the SR tion, we cut the four RLFB blocks in RLFN to two blocks,
performance while maintaining high efficiency. The model and shrink the feature channels to 27. The ESA blocks are
consists of six ECBs (see Fig. 10b), one PReLu, two fusion kept and channels are remained as 16.
blocks and one skip connection (concatenation of input im- The team also uses the ERB RepBlock in the Enhanced
age after preprocessing and intermediate feature map). The Residual Block (ERB) first proposed by [51] the runner up
number of channels in the network is set to 16. The pix- solution. They replace the 3 × 3 convolutions in RLFB with
elshuffle is used to produce the final image output. Typi- the ERB RepBlock. The network and ERB block are shown
cally, in the previous multi-branch networks, the fusion of in Fig. 11. For inference, the ERB RepBlock is reparame-
outputs by different branches could be done by concatena- terized to a 3×3 convolution. THe team does not experience
tion [5, 74] or element-wise addition followed by activation any performance drop after reparameterization.
function [21, 31]. In this study, in order to effectively im-
prove the representational power, a element-wise multipli-
Implementation Details. The ERLFN model is trained
cation fusion operation [43], as in Fig. 10a, is employed
for two stages both for Track 1 and Track 2. In the first
for the fusion of the results by two branches, where ⊗ is
stage, they train the model from scratch on DIV2K [1],
the element-wise multiplication, and ⊕ is the element-wise
cropped DIV8K, Flickr2K, OST, WED, first 2000 images of
addition. During inference, the ECB block can be reparam-
FFHQ, and first 1000 images of SCUT-CTW1500 datasets
eterized into one single 3×3 convolution.
— following [56]. The HR images are randomly cropped
to patches of size 256 × 256 for Track 1, and 192 × 192
Implementation Details. The team uses Pytorch 1.21.1, for Track 2. They use Adam optimizer with L1 loss for
and the training device is the A100 GPU. During train- this stage. We set the initial learning rate to 5e − 4, with a
ing, DIV2K [1] and Flickr2K [75] datasets are used for the mini batch size of 64, and train the model for 1000 epochs,
whole process. The team follows a 3-stage training: First, and decay the learning rate by 0.5 every 200 epochs. In the
the model is trained from scratch. HR patches of size 128 × second stage, the model is initialized with the pretrained
128 are randomly cropped from HR images, and the mini- weights from the first stage on the same training data as
batch size is set to 32. The SCSYENet model is trained by stage 1. Then the model is finetuned using a cosine learning
minimizing L1 loss function with Adam optimizer. The ini- rate schedule with an initial learning rate of 1e − 4 for 500
tial learning rate is set to 1 × 10−4 and decayed with cosine epochs, using L2 loss is applied.
annealing scheduler at every 200 epochs. The total number
3.11. PCRTSR
of epochs is 1000. Second, the model is initialized with the
pretrained weights, and trained with the same settings as in Team ECNUSR proposes PCRTSR: Partial convolution
the previous step. This process repeats once. Third, training based Network for Real-Time Super Resolution. The over-
settings are the same as Stage 1, except that L2 loss is used all architecture is shown in Fig. 12. The network first

1504
CONV1×1 CONV1×1 CONV3×3 CONV1×1
16 16 b 16 b CONV3×3 CONV3×3
ECB/PReLu ECB ECB CONV3×3 Sobel − Dx Sobel − Dy Laplacian
3×3 3×3 3×3
• + • + c
ECB QCU ECB QCU concat
3×3 3×3
16

ECB/Pixelshuffle +
3×3

16 16

(b) ECB: In the training stage, the Block employs multiple branches, which
(a) Detailed architecture of SCSYENet. can be merged into one normal convolution layer in the inference stage

Figure 10. Team Multimedia. Overview of the proposed SCSYENet.

Unshuffle

Shuffle Shuffle
PCBS

PCBS

PCBS PCBS
Conv

Conv Conv

Pixel Pixel
UnshufflePixel

SR
LR

PCBS

PCBS
Conv
Pixel

SR
LR
Figure 11. Team Antins CV proposed ERLFN network.

PCBS PCB

involves the pixel unshuffle for faster speed and a larger PConv
PCBS PCB
reception field. Then, the several stacked PCBS Block PCB
3*3

num_block
(Fig. 12 (a)) build up for feature extraction where each Conv
PConv1*1
PCBS block is composed of several PCB blocks (Fig. 12 PCB
3*3
PReLU

num_block
(b)) and a residual connection. Finally, the reconstruction PCB
Conv 1*1
Conv1*1
module consisting of a 3×3 vanilla convolution and a pixel PReLU
shuffle operation produces the SR image. PCB
Conv1*1

(a) PCBS block (b) PCB block


Network Structure. The team designs the models using
partial convolution for accelerating the running speed, they
do not use pruning or re-parameterization. Figure 12. Team ECNUSR architecture for Partial Convolution
PCB Block The high latency of most efficient networks based Network for Real-Time Super Resolution (PCRTSR). (a)
PCBS block and (b) PCB block.
is due to the frequent memory access of the operators, to ad-
dress this, a PCB block is proposed which consists of partial
convolution. Our partial Convolution applies filters on 1/4 β1 = 0.9, β2 = 0.999. The initial learning rate is set to
of the channels, resulting in lower FLOPs than the vanilla 5 ×10−4 and decreases by half at 8 ×106 and 1.4 ×107
convolution and higher FLOPS than the group convolution. iterations. L1 loss is used for training. The model is imple-
Each PCB block comprises a partial convolution followed mented by PyTorch 1.12 using one 2080Ti GPU.
by two pointwise convolution layers, with a PReLU activa-
tion layer after the middle layer. During feature extraction, 3.12. R2CNet
there are 3 PCBS blocks which consist of 2, 4 and 2 PCB Team R.I.P. ShopeeVideo proposes R2CNet using effi-
blocks respectively. The kernel size of the partial convolu- cient Bottle-in-Bottle blocks for RTSR. As shown in c13,
tion and vanilla convolution is 3 × 3. The architecture is they propose a hardware-efficient R2C block with well-
symmetrically designed and highly optimized, resulting in designed channel numbers. In the R2C block, they stack
lower inference latency. efficient 3×3 convolutions [22] inside with small channel
numbers, while keeping the channel numbers large outside
Implementation Details. The team first trained the to improve performance. In R2CNet, a novel downsample-
models on the DF2K (combined DIV2K and Flickr2K) upsample mechanism is also utilized to process images of
dataset [75], and then finetuned on the combined datasets large size (4K). Neither pruning nor re-parametrization is
consisting of DIV8K, FFHQ, LSDIR [52], and GTA V for not used in R2CNet.
the variety of the data. The patches are cropped with the
size 256 × 256 and augmented by random flipping and ro- Network Structure. The proposed R2C block is illus-
tation. The model is trained by Adam [46] optimizer with trated in Fig. 13 (a); an input 1\times 1 convolution reduces the

1505
Conv-1 Conv-3

𝑁× 𝑀×
R2C block
Conv-3
Conv-3
upsample

Add Conv-3

Add
Add Conv-3

Conv-1 Conv-3

L-ESA PixelShuffle

(a) R2C block (b) R2CNet

Figure 13. Team R.I.P. ShopeeVideo proposed R2C block and Figure 14. Team P.AI.R proposed FADN. Comparison of (a) resid-
R2CNet. (a) R2C block uses L-ESA:, an improved ESA [60], also ual feature distillation block and (b) no attention distillation block.
Batch Normalization (BN) is applied for each convolution layer to
accelerate convergence. (b) R2CNet: the macro structure is based
on RLFN [47] and M R2C blocks are used. images during training for R2CNet\times 3 and R2CNet\times 2 are
576 and 512, respectively. Before inference, the BN layers
in R2C blocks are fused into their corresponding convolu-
channel numbers, and the output one to increase. Thus, the tion layers for fast inference. The team uses DIV2K [2],
channel numbers inside the block is small, making it ef- Flickr2K, and half LSDIR [52] datasets for training.
ficient to stack efficient 3\times 3 convolutions inside [10, 22],
i.e., N basic blocks, and a skip-path 3\times 3 convolution. The 3.13. FADN
team also proposes L-ESA for efficient and effective spa- Team P.AI.R proposes FADN: Few Activation Distilla-
tial attention, in which they simply reset the kernel size and tion Networks for Real-time Super-resolution. The solu-
stride of the pooling layer in ESA [60] from 7 and 3 to 11 tion is mainly based on RFDN [60]. The architecture of the
and 7. Large kernel captures more spatial information and proposed method differs from the RFDN in two ways: 1)
large stride reduces computation and runtime [16]. With the simple gate (SG) introduced in NAFNet [11], which is
R2C block, we build our R2CNet following the macro struc- an element-wise product of feature maps divided into two
ture of RLFN [47], as shown in Fig. 13 (b). parts in the channel dimension, was used instead of ReLU
To process images of large size (4K) efficiently, they also in a shallow residual block (SWB). 2) Simplified channel at-
introduce a new downsample-upsample mechanism into the tention (SCA), also introduced in [11], was used instead of
R2CNet: simply set the stride of the first R2C block as 2 contrast-aware channel attention (CCA). The team adopted
for downsampling and utilize a pixel shuffle layer with fac- the SG and SCA to simplify the network, as the SG halves
tor 2 for upsampling. Specially, in both R2CNet\times 2 and the number of channels, and the SCA is the simplified ver-
R2CNet\times 3, we set N = 4, M = 2, the channel number of the sion of channel attention. In addition, layer normalization
main body as 64 and that inside R2C block is 32. was also adopted in the network to ensure a more stable
training process. The FADN (see Fig. 14) consists of four
Implementation Details. The team uses PyTorch for no-activation distillation blocks (NADB).
training and inference. They train the models for there
stages. Each stage has 100k iterations. The learning rate Technical details. The team train the models with ADAM
is set as 5e-4 for the first two stages with first 5k iterations optimizer by setting beta1=0.9, beta2=0.999, and eta=10−8 .
as warm-up, while 2e-4 for the last one without warm-up, The learning rate is initialized as 2e-4 and halved at every
and we uses cosine annealing. PSNR loss [12] is utilized. 100 epochs. The team used LSDIR [52] datasets to train
Adam is the optimizer and weight decay is not applied. The the models, and generated the training LR images by down-
global batch size is set to 96 on 3 GPUs. The sizes of HR sampling HR images with bicubic interpolation and JPEG

1506
channels channels channels channels channels
compression. The model is implemented using the PyTorch 3→16 16→16 16→16 16→16 16→12

framework with an RTX3090 GPU. The number of feature

RepConv 3x3
RepConv 3x3

PixelShuffle
Conv 3x3
Conv 3x3
channels was 16 for ×2 SR and 40 for ×3 SR. So then, the

PReLU
number of parameters is 0.0121 M and 0.1280 M respec- Input size:[1,3,h,w]

tively. Output size:[1,3,2h,2w]

(a) The overall architecture and the structure of the RepConv block.
3.14. Team PixelBE Training Inference

The team proposes: Two-Stage Super-resolution Algo-


rithm Based on Re-Parameterization. They use as ref- Conv 1x1
erence [98], a re-parameterizable building block, namely
Edge-oriented Convolution Block (ECB), for an efficient Conv 3x3
convolutional module design. This module uses multiple Conv 3x3

parallel convolution operators in the training phase to im-


prove the SR capability of the model, and fuses the paral-
lel operators into a convolution module in the testing phase Conv 1x1

to improve inference efficiency. Based on this ECB mod-


ule [98], they designed a two-stage SR algorithm as follows:
(i) First, they downsample by a factor of 2 using a convo-
lution with a stride of 2. Downsampling breaks down jpeg (b) The Repblock module.
compression and also improves network inference speed.
Figure 15. Team AGSR. Overview of the proposed OELSR.
(ii) Then stack two ECB modules and a ×2 upsampling
pixel shuffle module to return a three-channel image. (iii)
Finally, two ECB modules and a ×2 upsampling pixel shuf- improve the performance of the middle convolution. Pixel-
fle module are used to return a HR image. Shuffle operation is used at last to upscale the size of output
without introducing more calculation.
Implementation details. The team uses the LSDIR
Dataset [52] for training, and the training data is degraded Technical details. The team uses DIV2K [1] and
online (i.e. downsampling, JPEG compression). The input Flickr2K as training dataset. In each training batch, 64
image size is 128x128x3, the optimizer is Adam. The train- cropped LR RGB patches augmented by random flipping
ing is divided into two stages: First, the learning rate is 1e-3 and rotation are input to the network. The input data
and the jpeg loss and super-resolution loss are calculated at range of the network is 0-255. The model is trained us-
the same time. This stage is trained for 100k iterations. Sec- ing PyTorch, Adam [46] optimizer with β1 = 0.9 and β2 =
ond, only the super resolution loss (L1 ) is calculated, and 0.999, and they utilize Charbonnier loss (first stage) and L2
the learning rate is halved — this is for 150k iterations. loss (second stage) function separately since they employ a
multi-stage training approach.
3.15. OELSR
3.16. Team DoYouChargeQQCoin
Team AGSR proposes an optimized extreme lightweight
super-resolution network (OELSR). The Extreme Low- The team proposes a ultra fast network for image super-
Power Super Resolution Network in [87] is their baseline. resolution. The network is illustrated in Fig. 16, it consists
The network (see Fig. 15a) stacks multiple highly optimized on 2-layer CNN with a ReLU activation for image SR. This
convolution+activation layers to achieve a good trade-off represents the most compact and simple solutions in this
between the enhanced quality and model complexity. The challenge; it improves Bicubic upsampling by +0.2dB while
team uses the re-parameterizable blocks [25] and replace running at ≈ 2ms.
them to a single convolution to reduce the inference time. They implement the network with PyTorch. The opti-
Besides, they use a multi-stage training where in each stage, mizer is Adam with learning rate as 10e-4, which is halved
the weights from previous stages are utilized as warm-start for every 200 epochs. The training dataset is DIV2K, using
to improve the model performance progressively. random flips and rotations. The input of the network is in
Finally, the team obtains a simple yet effective network the range 0-255.
structure with single frame input (as shown in Fig. 15a)
3.17. Team Touch Fish
which only have 6 layers, of which only 5 have learnable
parameters, including 4 Conv layers and a PReLU activa- The team proposes a new attention mechanism. The ra-
tion layer. Besides, they use re-parameterizable blocks to tionale behind utilizing an attention map with a consider-

1507
Conv1x1

Softmax
PixelShuffle
Conv3x3

Conv3x3

Conv3x3

Conv3x3

Conv3x3

Conv3x3
LR
C
Concate

HR
Conv1x1
Figure 16. Team DoYouChargeQQCoin proposed network.

Softmax
PixelShuffle
Rep_Block

Rep_Block

Rep_Block

Rep_Block
LR

Conv3x3

Conv3x3
LR
C
Conv-3 Concate

HR
Conv-3
ReLU
AttLi
[H, W, C]

Conv3x3
Conv-3
AttLi
ReLU
[H, W, C]
AttLi Att
Conv-3

Conv1x1

ReLU
ReLU
Concat
[H, W, C] Long Range

Conv-1
Conv-1
Sigmoid
Att
Concat
Short Range
[H, W, C] Conv-3

Conv-3
Figure 18. Team DH ISP proposed solution.
PixelShuffle

(a)
HR

(b)
is trained for a total of 106 iterations, with the L1 loss, batch
sizes of 64, and Adam optimizer [45]. Subsequently, fine-
Figure 17. Team Touch Fish solution: (a) Attli block, pink denotes tuning is executed using the L1 and L2 loss functions, with
the generated attention map M. (b) Pipeline ×2 SR. an initial learning rate of 1 × 10−5 for 5 × 105 iterations,
and HR patch size of 512. The dataset utilized for training
comprises of DIV2K [1] and LSDIR [52].
able perception field is that it can be advantageous for the
preceding layers to concentrate their attention on regions of 3.18. Team DH ISP
interest. They generate an attention map M(i, j) as:
The team designed a simple lightweight network for im-
\mathcal {M}(i,j) = \phi (\textrm {Conv}_{1 \times 1}(F_l(i,j)))\textrm {,} (3) age super resolution. The model consists of two 3x3 con-
volution layers, one 1x1 convolution layer and four Re-
where ϕ(·) denotes the sigmoid function. Fl (i, j) and Parameterizable blocks (RepBlock), the final output is ob-
Ff (i, j) denote the value of the feature map in the position tained using the pixel shuffle. Re-parameterizable blocks
(i, j) from the latter layer and former layers, respectively. can learn features at different scales during the training
Then we use the generated attention map phase, then, in during inference, they can be converted into
J to reweight the
features in the former layers as M(i, j) Ff (i, j), where a 3x3 convolutions to accelerate the inference speed. The
network structure is shown in the Figure 18.
J
denotes the Hadamard product.
As depicted in Fig. 17 (b), an attention map is generated Two branches are used for feature extraction. (i) four
for each block, which is subsequently utilized to reweight re-parameterizable blocks and a 3x3 convolution, which is
the feature maps originating from distinct levels. used to extract the deep features of the image. (ii) a 1x1 con-
They also use re-parameterization (rep) [22] to enhance volution is used to extract the shallow features of the input
the efficiency of the inference phase. This technique has image. Finally, the features extracted from the two branches
been incorporated into each convolutional block depicted in are added together for fusion, and the upsampled features
Fig. 17. In contrast to prior techniques that employ stride are obtained through the pixelshuffle layer and the final out-
convolutions, pooling, and upsampling, the team merely put is obtained through the structure of self-attention.
uses the generated mask. This modification has resulted
in a significant acceleration of both inference and training Technical details. The training data set includes Flickr2K
times, as well as a reduction in the memory footprint. and DIV2K [1]. The training of the model is divided into
two stages: (i) the network is trained from scratch. The
Technical details. The number of channels is set to 24 input image size is 256 × 256, the batch size is 16, the loss
(x2) and 32 (x3). The learning rate is 5 × 10−4 and under- function is L1 , Adam optimizer with the initial learning rate
goes a halving process every 2×105 iterations. The network set to 0.001, the learning rate is halved every 200 epoch,

1508
and a total of 800 batches of training. (ii) on the basis of
conv conv conv
the training in the first stage, the L2 loss was used to con- RFDBS

tinue training for 200 epochs, with an initial learning rate of RFDB RFDB conv
RFDBS
0.0001, halved every 50 batches. Finally, the heavy param- RFDB RFDB conv
eter module in the network is re-parameterized by 3x3 con- RFDBS
RFDB RFDB
volution, and the trained model parameters are transformed
RFDB RFDB
to achieve faster inference.
concat concat
3.19. PRFDN conv conv

Team SEU CNII proposes PRFDN: High Parallelism


conv conv
Distillation Network For Image Super-Resolution. Pixel shuffle Pixel shuffle
The proposed Parallel RFDN (PRFDN) is based on the (a) RFDN [60] (b) Branching
pre-trained RFDN [60] as shown in Fig. 19a. The method
disentangles the sequentially computed trunks in RFDN conv conv

into branches (Fig. 19b) and performs re-parametrization


split split split
to make these branches inference in parallel on single de-
vices. After that, they further perform pruning on the model RFDBS RFDBS

(Fig. 19d) and fine-tune it to achieve higher performance. concat concat

RFDB RFDB

Network Structure. Branching. To accelerate the infer-


ence, authors first consider reducing the data dependency conv conv

in the model to achieve higher parallelism. Thus, the


method disentangles the sequentially computed trunks into conv conv
Pixel shuffle Pixel shuffle
branches. As shown in Fig. 19b, after the branching, the
major part of the model will consist of four independent (c) Re-parametrization (d) Pruning
branches that can calculate in parallel. To improve the per-
Figure 19. Team SEU CNII proposed PRFDN including: branch-
formance, authors also design small SR blocks (SRFDB)
ing, re-parameterization, and pruning.
based on [60], and add them before the input of each branch.
Re-parameterization. Without much data dependency,
branches in the model can be computed in parallel. As
shown in Fig. 19c, the major part of these four branches
(RFDBs and SRFDBs) have exactly the same structure but
different parameters, so we can merge and re-parameterized
the RFDBs and SRFDBs into a single branch.
Pruning. To further accelerate the inference, they apply
channel pruning on the re-parameterized model, as shown in
Fig. 19d, using Torch-Pruning [27], and fine-tune the model Figure 20. Team NTU-BL6F solution based on LFDN [47]. They
between each pruning step. adjust the channel number of RLFB and use mixed precision train-
ing to improve the model.
Technical details. The authors use Pytorch and Torch-
Pruning [27]. The models are trained using Adam [46] with
lize pre-trained weights by selecting the necessary channels
learning rate 1e-5 before re-parameterization, 1e-6 after re-
to match the compressed channel quantities. The authors
parameterization. The training datasets are LSDIR [52] and
also find that using channel quantities whose power is 2 can
DIV2K [1]. Since they only change the data flow, but not
result in faster processing speed compared to other channel
the structure of RFDB, the pre-trained RFDN parameters
quantities. The model is illustrated in Fig. 20.
can still be loaded into the major part of our branch model
(only except for those SRFDBs). To benefit from the pre-
training, they load the pre-trained RFDN parameters into Technical details. The team uses LFDN [47] model pre-
our branch model before training our branch model. trained on the DIV2K dataset [1]. The network input range
is from 0 to 255, and mixed precision was used for fine-
3.20. LFDN
tuning. The team uses the DIV2K [1], Flickr [76], OTS
Team NTU-BL6F adopts LFDN [47] model as the back- [50] and GTA [69] datasets to train the model. The authors
bone. The authors reduce the number of channels and uti- adopt L1 loss to optimize the network. The optimizer is

1509
Adam [46] with learning rate 5e-4. In the test phase, they \eta =0.1. The number of ECSB is set to 5 and the num-
feed the whole-size image to the model and the inference ber of channels inside ECSB to 32. The model is trained
speed is approximately 18ms per image. using a multi-stages training strategy with cyclic learning
rate scheduler, Adam optimizer [46] and batch size of 64 .
3.21. DRCNN The authors did not use any pruning or re-parameterization
Team diSRupt proposes Depthwise-Residual Convolu- technique, only using channel splitting and attention.
tional Neural Network (DRCNN).
DRCNN (see Fig. 21) extends the SCSRN architecture, 3.23. Team NPU SuperResolution
which was introduced in [43]. On top of the existing archi- The team proposes a model based on ECBSR [96] with
tecture, DRCNN performs nearest-neighbors upsampling to some improvements. The authors found that the edge op-
provide the SCSRN stage with an upsampled baseline im- erator can not make a relatively large contribution to the
age. In order to maintain efficiency through GPU paral- performance improvement of the whole model, so they pro-
lelism, a space-to-depth transformation is applied to the up- pose to replace the edge operator with wavelet transform.
scaled LR image, forcing the following convolutional layers The experiment proves that the wavelet transform has a cer-
to operate on feature maps having the same dimensions as tain effect on the improvement of the model.
the LR image. The same depthwise-upsampled LR image The authors also use ideas from MWCNN [61] and
is added to the feature map generated through the SCSRN, other models that use wavelet transform to achieve super-
forcing the network to learn the residual between the naive resolution. In their model, LL, HL, LH, and HH after
interpolation and the HR image, thus enhancing the conver- wavelet transformation will be concatenated in the channel
gence speed and the overall performance. dimension, which can ensure that messages will not be lost,
thereby further improving the performance of the model.
Implementation details. The authors use Tensorflow 2. They chose a very simple model with only one branch, so
The network was trained for 70 epochs on the entire Div2K that the speed of the model can be guaranteed. In a block,
training set [1], using the Adam [46] optimizer with a they remove the branches that do not significantly improve
3e-4 learning rate, a batch size of 16, a patch size of the model effect, and only keep the branch that contributes
128, classical augmentations, and optimizing for MSE. The the most. In addition, they also use re-parameterization as
model accepts RGB images of any resolution. No re- ECBSR [96], so that each block can be re-parameterized
parameterization, pruning or quantization was applied. into one or two 3x3 convolutions, so that during the infer-
ence process of the model, have faster speed.
3.22. ELIS
Team KCML2 proposes Enhanced Lightweight Image Technical details. The team uses Pytorch to implement
Super-resolution (ELIS), which is inspired by XLSR [5] the model. The optimizer used is Adam [46], the learn-
with the addition of the advanced attention mechanism. The ing rate is 5e-4, and the GPU is A100. The training dataset
main idea is to use channel splitting to separate the feature combines DIV2K [1], Flicker2k, manga [65], and some pic-
maps and process them in parallel with attention. Besides tures obtained on the internet – the authors find that the data
this, the authors use a multi-stage warm-start training strat- set can significantly improve the performance of the model.
egy. In each stage, the pre-trained weights from previous The obtained model is re-parameterized.
stages are utilized to improve the model performance. The
network is illustrated in Fig. 22. 3.24. Team YNOT
The authors add a spatial operation to the original block The team utilized an image processing method based
from XLSR [5] to enhance the performance as each pixel is on Fast Fourier Convolution (FFC) [15], which has differ-
considered differently at each pixel location. They design ent advantages from conventional convolution-based image
the ECSB block, which contains a channel splitting mecha- processing (i.e. it can utilize both global and local infor-
nism, convolution operation, and an enhanced spatial atten- mation), and Wavelet Analysis [89] image processing tech-
tion block (ESA) as shown on Fig. 22 (bottom). niques. By utilizing information at the frequency level, they
aimed for better performance while lightening the baseline
Implementation details. The authors use DIV2K and architecture of IMDN [40].
Flickr2K [1] for training set, and randomly crop the im- The authors found that FFC [15] can be used to replace
ages to the size of 512×512. All images are normalized to traditional CNNs, but it may not be suitable for real-time
range 0-1. During training, they randomly crop LR patches super-resolution. However, by utilizing the information
of size 256×256 and use horizontal flipping, vertical flip- available in the spectral domain (e.g. Fourier Transform,
ping along with random intensity scaling for augmentation. Wavelet Transform), they were able to lighten the architec-
As the loss function, we employ the Charbornier loss with ture of the IMDN [40] model used to satisfy some of the

1510
Figure 21. Team diSRupt proposed DRCNN.

Figure 23. Team YNOT proposed solution.

methods can recover details from the LR 1080p and 720p,


and produce high-quality 4K images.
Figure 22. Team KCML2 proposed enhanced lightweight image
super-resolution network. (Bottom) Architecture of ECSB with
ESA (Enhanced Spatial Attention) [51]. 5. Conclusion
This paper introduces a novel benchmark for efficient
computational and performance tradeoffs. upscaling as part of the NTIRE 2023 Real-Time Image
Technical details. The authors use Pytorch 1.7.1 to de- Super-Resolution (RTSR) Challenge, which aimed to up-
velop the models. The models are trained for 500 epochs scale images from 720p and 1080p resolution to native 4K
using L1 loss, Adam optimizer [46], a learning rate of 2e-4, (×2 and ×3 factors) in real-time on commercial GPUs.
and MultiStepLR with a gamma of 0.5. The team only uses For this, we use a new test set containing diverse 4K im-
DIV2K [1] for training the models. ages ranging from digital art to gaming and photography.
We assessed the methods devised for 4K SR by measur-
4. Qualitative Results Comparison ing their runtime, parameters, and FLOPs, while ensuring a
minimum PSNR fidelity over Bicubic interpolation. These
We provide qualitative comparisons in Fig. 24, Fig. 25 methods allow processing at 60 FPS and even beyond. Out
and Fig. 26 between the top-3 proposed methods. All high- of the 170 participants, 25 teams contributed to this report,
resolution images and the results from each top team, are making it the most comprehensive benchmark to date and
available in our project website and github. All the top showcasing the latest advancements in real-time SR.

1511
LR Input AsConvSR [34] Bicubic++ [8]

HR Ground-truth RUNet RT4KSR [92]

Figure 24. Qualitative results. Comparison of the best methods using the test sample 11. The image corresponds to a real capture using a
60MP camera. Complete HQ uncompressed results -for the top teams- can be consulted in our project website.

1512
LR Input AsConvSR [34] Bicubic++ [8]

HR Ground-truth RUNet RT4KSR [92]

Figure 25. Qualitative results. Comparison of the best methods using the test sample 59, a real world capture using a SONY ILCE-7M3.
Image credit: “Asakusa” by @mosdesign.

1513
LR Input AsConvSR [34] Bicubic++ [8]

HR Ground-truth RUNet RT4KSR [92]

Figure 26. Qualitative results. Comparison of the best methods using the test sample 114, rendered content using Unreal Engine [38].

1514
Table 2. We provide Additional Training Details to facilitate reproducibility of the solutions. The teams indicate the resolution of the
input RGB image during training, the training time in hours, and the GPU device.

Method Input Training Time (h) Attention Quantization # Params. (M) GPU
AsConvSR ×2 120 × 120 30 No No 2.3 V100
AsConvSR ×3 80 × 80 30 No No 17 V100
RUNet ×2 192 × 192 24 No No 0.0668 RTX3090
RUNet ×3 192 × 192 20 No No 0.24 RTX3090
Team OV 128 × 128 21 No No 0.005 RTX3090
Repnet ×2 256 × 256 8 No No 0.0266 A100
Repnet ×3 256 × 256 12 No No 0.0532 A100
Bicubic++ ×3 108 × 108 3 No No 0.0504 V100
DFCDN ×2 320 × 320 44 Yes No 0.0064 RTX3090
DFCDN ×3 220 × 220 44 Yes No 0.0075 RTX3090
NJUST-RTSR ×2 256 × 256 16 No No 0.014 RTX3090
LRSRN ×2 192 × 192 48 No No 0.0046 A6000
LRSRN ×3 128 × 128 16 No No 0.0046 A6000
SCSYENet ×2 512 × 512 27 No No 0.01 A100
SCSYENet ×3 540 × 540 18 No No 0.0125 A100
ERLFN ×2 256 × 256 71 ESA No 0.0111 V100x4
ERLFN ×3 192 × 192 47 ESA No 0.0666 V100x4
PCRTSR ×2 256 × 256 30 No No 0.162288 2080Ti
R2CNet ×2 512 × 512 180 L-ESA No 0.3987 V100
R2CNet ×3 576 × 576 180 L-ESA No 0.4073 V100
FADN ×2 256 × 256 130 Yes No 0.0212 RTX3090
PixelBE ×2 128 × 128 96 No No 0.137 V100
OELSR ×2 512 × 512 8 No No 0.0068 2080Ti
QQCoin ×2 256 × 256 48 No No 0.00082 RTX3090
Touch Fish ×2 256 × 256 60 Yes No 0.064 A100x8
Touch Fish ×3 256 × 256 60 Yes No 0.183 A100x8
dh ISP 256 × 256 5 Yes No 0.01 2080Ti
PRFDN ×2 678 × 1020 16 No No 0.0299 RTX3070
PRFDN ×3 512 × 680 16 No No 0.0629 RTX3070
NTU-BL6F (LFDN) ×2 256 × 256 12 Yes Yes 0.22 RTX3090
DRCNN ×2 128 × 128 5 No No 0.0499 NVIDIA T4
DRCNN ×3 128 × 128 3 No No 0.0649 NVIDIA T4
ELIS 256 × 256 10 ESA No 0.039 TITAN RTX
NPU-SR (ECBSR) ×2 1080 × 1920 10 No Yes 0.2 A100
YNOT ×2 256 × 256 4 Yes No 0.4648 A100

Acknowledgments This work was partly supported by 6. Appendix


the Humboldt Foundation and Sony Interactive Entertain-
ment. We thank the NTIRE 2023 sponsors: Sony Interac- 6.1. NTIRE 2023 Team
tive Entertainment, Meta Reality Labs, ModelScope, ETH
Zürich (Computer Vision Lab) and University of Würzburg Title: NTIRE 2023 Real-Time Super-Resolution Chal-
(Computer Vision Lab). lenge Organization
Members: Marcos V. Conde 1 , Eduard Zamfir 1 , Radu
Timofte 1 , Daniel Motilla 2
Affiliations: 1 Computer Vision Lab, CAIDAS, IFI, Uni-

1515
versity of Würzburg, Germany Members: Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui
2
Sony Interactive Entertainment, CA. Tang
Affiliations: Nanjing University of Science and Technol-
6.2. Noah TerminalVision ogy
Title: AsConvSR: Fast and Lightweight Super-Resolution
Network with Assembled Convolutions 6.10. Multimedia
Members: Jiaming Guo, Xueyi Zou, Yuyi Chen, Yi Liu, Title: SCSYENet: A Compact Skip-Concatenated Sim-
Jia Hao, Youliang Yan ple Yet Effective Real- Time Image Super-Resolution based
Affiliations: Huawei Technologies Co., Ltd. on element-wise multiplication fusion operation and Re-
parameter convolution
6.3. Aselsan Research
Members: Zibin Liu, Weiran Gou, Shaoqing Li, Ziyao Yi,
Title: Bicubic++: Slim, Slimmer, Slimmest - Designing Yan Xiang, Dehui Kong, Ke Xu
an Industry-Grade Super-Resolution Network Affiliations: Sanechips Co Ltd
Members: Mustafa Ayazoglu, Bahri Batuhan Bilecen
Affiliations: Aselsan Research, Türkiye. https:// 6.11. Antins CV
www.aselsan.com/tr Title: Enhanced Residual Local Feature Network
6.4. ALONG (ERLFN)
Members: Jin Zhang, Gaocheng Yu, Feng Zhang, Hong-
Title: RUNet: Re-parameterization and Unshuffle Net- bin Wang
work for Real-time Super-Resolution Affiliations: Ant Group
Members: Cen Liu, Zexin Zhang, Yunbo Peng, Yue Lin
Affiliations: NetEase Games AI Lab 6.12. ECNU SR
6.5. Team OV Title: Partial convolution based Network for Real-Time
Super Resolution (PCRTSR)
Title: An Efficient ConvNet for Real-time Image Super- Members: Zhou Zhou, Jiahao Chao, Hongfan Gao, Jiali
resolution Gong, Zhengfeng Yang, Zhenbing Zeng
Members: Lingshun Kong, Haoran Bai, Jinshan Pan, Affiliations: East China Normal University
Jiangxin Dong, Jinhui Tang
Affiliations: Nanjing University of Science and Technol- 6.13. R.I.P ShopeeVideo
ogy
Title: Efficient Bottle-in-Bottle Block for Real-Time
6.6. RTVSR Super-Resolution
Members: Chengpeng Chen, Zichao Guo
Title: Repnet for Real-Time Super-Resolution
Affiliations: Shopee https://shopee.com/
Members: Yuanfan Zhang, Gen Li, Lei Sun
Affiliations: Tencent 6.14. DoYouChargeQQCoin
6.7. DFCDN Team Title: Ultra fast network for image super-resolution.
Title: DFCDN: Deep Feature Complement and Distilla- Members: Yuqing Liu, Qi Jia, Hongyuan Yu, Xuanwu
tion Network Yin, Kunlong Zuo
Members: Mingxi Li, Yuhang Zhang, Xianjun Fan, Affiliations: Dalian University of Technology; Xiaomi
Yankai Sheng Inc.
Affiliations Attrsense 6.15. PixelBE
6.8. z6 Title: Two-Stage Super-resolution Algorithm Based on
Title: Lightweight Efficient Real-Time Image Super- Re-Parameterization
Resolution Network (LER- SRN) Members: Dongyang Zhang
Members: Ganzorig Gankhuyag, Kihwan Yoon Affiliations: Mango TV (MGTV)
Affiliations: Korea Electronics Technology Institute
(KETI)
6.16. AGSR
Title: Optimized Extreme Lightweight Super Resolution
6.9. NJUST-RTSR
Members: Ting Fu, Zhengxue Cheng, Shiai Zhu, Dajiang
Title: A Simple Residual ConvNet with Progressive Zhou
Learning for Real- Time Super-Resolution Affiliations: Ant Group antgroup.com

1516
6.17. dh isp 6.24. KCML2
Title: Lightweight network for image super-resolution. Title: Enhanced Lightweight Image Super-resolution
Members: Ben Shao, Shaolong Zheng (ELIS)
Affiliations: Zhejiang Dahua Technology Co., Ltd. Members: Tu Vo
Affiliations: KC Machine Learning Lab
6.18. Touch Fish
6.25. YNOT
Title: Attention Block for Real-time Super-Resolution
Members: Hongyuan Yu, Weichen Yu, Lin Ge, Jiahua Title: Super Resolution with Spectral Transform and
Dong, Yajun Zou, Zhuoyuan Wu, Binnan Han, Xiaolin Wavelet Transform
Zhang, Heng Zhang, Xuanwu Yin, Kunlong Zuo Members: Youngsun Cho, Nakyung Lee
Affiliations: Multimedia Department, Xiaomi Inc. Affiliations: CJ OliveNetworks AI Research

6.19. P.A.I.R References


Title: Few Activation Distillation Networks for Real-time [1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge
on single image super-resolution: Dataset and study. In The
Super-resolution
IEEE Conference on Computer Vision and Pattern Recog-
Members: Anjin Park
nition (CVPR) Workshops, July 2017. 2, 6, 8, 9, 10, 13, 14,
Affiliations: Korea Photonic Technology Institute 15, 16, 17
[2] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge
6.20. SEU CNII on single image super-resolution: Dataset and study. In
Title: PRFDN: High Parallelism Distillation Network For Proceedings of the IEEE conference on computer vision
and pattern recognition workshops, pages 126–135, 2017.
Image Super-resolution
5, 6, 9, 12
Members: Daheng Yin, Baijun Chen, Mengyang Liu
[3] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn.
Affiliations: School of Computer Science and Engineer-
Fast, accurate, and lightweight super-resolution with cas-
ing, Southeast University cading residual network. In European Conference on Com-
puter Vision, pages 252–268, 2018. 2
6.21. diSRupt [4] Codruta O Ancuti, Cosmin Ancuti, Florin-Alexandru
Title: Depthwise-Residual Convolutional Neural Net- Vasluianu, Radu Timofte, et al. NTIRE 2023 challenge
on nonhomogeneous dehazing. In Proceedings of the
work (DRCNN)
IEEE/CVF Conference on Computer Vision and Pattern
Members: Marian-Sergiu Nistor
Recognition Workshops, 2023. 3
Affiliations: University “Al. I. Cuza” Iasi
[5] Mustafa Ayazoglu. Extremely lightweight quantization ro-
bust real-time single-image super resolution for mobile de-
6.22. NTU-BL6 vices. 2021 IEEE/CVF Conference on Computer Vision and
Title: Finetuning and pruning for Real-Time Super- Pattern Recognition Workshops (CVPRW), pages 2472–
2479, 2021. 10, 16
Resolution
Members: Yi-Chung Chen3 , Zhi-Kai Huang2 , Yuan-Chun [6] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton.
Layer normalization. arXiv preprint arXiv:1607.06450,
Chiang2 , Wei-Ting Chen1 , Hao-Hsiang Yang2 , Hua-En
2016. 3
Chang2 , I-Hsiang Chen2 , Chia-Hsuan Hsieh4 , Sy-Yen Kuo2
[7] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and
Affiliations: 1 Graduate Institute of Electronics Engineer- Marie line Alberi Morel. Low-complexity single-image
ing, National Taiwan University, Taiwan super-resolution based on nonnegative neighbor embed-
2
Department of Electrical Engineering, National Taiwan ding. In British Machine Vision Conference, pages 135.1–
University, Taiwan 135.10, 2012. 2
3
Graduate Institute of Communication Engineering, Na- [8] Bahri Batuhan Bilecen and Mustafa Ayazoglu. Bicubic++:
tional Taiwan University, Taiwan Slim, slimmer, slimmest - designing an industry-grade
4
ServiceNow, USA super-resolution network. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
6.23. NPU Superresolution Workshops, 2023. 5, 18, 19, 20
[9] Mingdeng Cao, Chong Mou, Fanghua Yu, Xintao Wang,
Title: ECBSR, Yinqiang Zheng, Jian Zhang, Chao Dong, Ying Shan,
Members: Qingsen Yan, Yun Zhu, Jinqiu Su, Yanning Gen Li, Radu Timofte, et al. NTIRE 2023 challenge
Zhang, Cheng Zhang, Jiaying Luo on 360° omnidirectional image and video super-resolution:
Affiliations: Northwestern Polytechnical University Datasets, methods and results. In Proceedings of the

1517
IEEE/CVF Conference on Computer Vision and Pattern [22] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong
Recognition Workshops, 2023. 3 Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-
[10] Chengpeng Chen, Zichao Guo, Haien Zeng, Pengfei style convnets great again. In Proceedings of the IEEE/CVF
Xiong, and Jian Dong. Repghost: A hardware-efficient conference on computer vision and pattern recognition,
ghost module via re-parameterization. arXiv preprint pages 13733–13742, 2021. 3, 6, 11, 12, 14
arXiv:2211.06088, 2022. 12 [23] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong
[11] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-
Simple baselines for image restoration. arXiv preprint style convnets great again. In CVPR, 2021. 9
arXiv:2204.04676, 2022. 12 [24] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
[12] Liangyu Chen, Xin Lu, Jie Zhang, Xiaojie Chu, and Cheng- Tang. Learning a deep convolutional network for image
peng Chen. Hinet: Half instance normalization network for super-resolution. In David Fleet, Tomas Pajdla, Bernt
image restoration. In Proceedings of the IEEE/CVF Confer- Schiele, and Tinne Tuytelaars, editors, European Confer-
ence on Computer Vision and Pattern Recognition (CVPR) ence on Computer Vision, pages 184–199, Cham, 2014.
Workshops, pages 182–192, June 2021. 12 Springer International Publishing. 2
[25] Zongcai Du, Ding Liu, Jie Liu, Jie Tang, Gangshan Wu,
[13] Xiangyu Chen, Xintao Wang, Jiantao Zhou, and Chao
and Lean Fu. Fast and memory-efficient network towards
Dong. Activating more pixels in image super-resolution
efficient image super-resolution, 2022. 13
transformer. arXiv preprint arXiv:2205.04437, 2022. 6
[26] Zongcai Du, Jie Liu, Jie Tang, and Gangshan Wu. Anchor-
[14] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong
based plain net for mobile image super-resolution. In Pro-
Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution:
ceedings of the IEEE/CVF Conference on Computer Vision
Attention over convolution kernels. In Proceedings of
and Pattern Recognition, pages 2494–2502, 2021. 5
the IEEE/CVF conference on computer vision and pattern
[27] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi,
recognition, pages 11030–11039, 2020. 5
and Xinchao Wang. Depgraph: Towards any structural
[15] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- pruning. The Thirty-Fourth IEEE/CVF Conference on Com-
tion. Advances in Neural Information Processing Systems, puter Vision and Pattern Recognition, 2023. 15
33:4479–4488, 2020. 16
[28] William T Freeman, Thouis R Jones, and Egon C Pasztor.
[16] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Example-based super-resolution. IEEE Computer graphics
Lu. Improving image restoration by revisiting global in- and Applications, 22(2):56–65, 2002. 2
formation aggregation. In Computer Vision–ECCV 2022:
[29] Ganzorig Gankhuyag, Jingang Huh, Myeongkyun Kim,
17th European Conference, Tel Aviv, Israel, October 23–27,
Kihwan Yoon, HyeonCheol Moon, Seungho Lee, Jinwoo
2022, Proceedings, Part VII, pages 53–71. Springer, 2022.
Jeong, Sungjei Kim, and Yoonsik Choe. Skip-concatenated
12
image super-resolution network for mobile devices. IEEE
[17] Marcos V Conde, Ui-Jin Choi, Maxime Burchi, and Radu Access, 2022. 9
Timofte. Swin2SR: Swinv2 transformer for compressed [30] Ganzorig Gankhuyag, Kihwan Yoon, Jinman Park,
image super-resolution and restoration. In Proceedings Haeng Seon Son, and Kyoungwon Min. Lightweight real-
of the European Conference on Computer Vision (ECCV) time image super-resolution network for 4k images. In Pro-
Workshops, 2022. 2 ceedings of the IEEE/CVF Conference on Computer Vision
[18] Marcos V Conde, Manuel Kolmet, Tim Seizinger, and Pattern Recognition Workshops, 2023. 9
Thomas E. Bishop, Radu Timofte, et al. Lens-to-lens bokeh [31] Michaël Gharbi, Jiawen Chen, Jonathan T. Barron,
effect transformation. NTIRE 2023 challenge report. In Samuel W. Hasinoff, and Frédo Durand. Deep bilateral
Proceedings of the IEEE/CVF Conference on Computer Vi- learning for real-time image enhancement. ACM Transac-
sion and Pattern Recognition Workshops, 2023. 3 tions on Graphics (TOG), 36:1 – 12, 2017. 10
[19] Marcos V Conde, Eduard Zamfir, Radu Timofte, et al. Effi- [32] Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel
cient deep models for real-time 4k image super-resolution. Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse
NTIRE 2023 benchmark and report. In Proceedings of 8k resolution image dataset. In 2019 IEEE/CVF Interna-
the IEEE/CVF Conference on Computer Vision and Pattern tional Conference on Computer Vision Workshop (ICCVW),
Recognition Workshops, 2023. 3 pages 3512–3516, 2019. 2
[20] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and [33] Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel
Lei Zhang. Second-order attention network for single im- Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse
age super-resolution. In IEEE Conference on Computer Vi- 8k resolution image dataset. In 2019 IEEE/CVF Interna-
sion and Pattern Recognition, pages 11065–11074, 2019. tional Conference on Computer Vision Workshop (ICCVW),
2 pages 3512–3516. IEEE, 2019. 5
[21] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and J. Han. [34] Jiaming Guo, Xueyi Zou, Yuyi Chen, Yi Liu, Jia Hao,
Acnet: Strengthening the kernel skeletons for powerful cnn and Youliang Yan. Asconvsr: Fast and lightweight super-
via asymmetric convolution blocks. 2019 IEEE/CVF In- resolution network with assembled convolution. In Pro-
ternational Conference on Computer Vision (ICCV), pages ceedings of the IEEE/CVF Conference on Computer Vision
1911–1920, 2019. 10 and Pattern Recognition Workshops, 2023. 5, 18, 19, 20

1518
[35] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing [45] Diederik P Kingma and Jimmy Ba. Adam: A method for
Xu, and Chang Xu. Ghostnet: More features from cheap stochastic optimization. arXiv preprint arXiv:1412.6980,
operations. In 2020 IEEE/CVF Conference on Computer 2014. 14
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, [46] Diederik P. Kingma and Jimmy Ba. Adam: A method for
USA, June 13-19, 2020, pages 1577–1586. Computer Vi- stochastic optimization. In ICLR, 2015. 9, 11, 13, 15, 16,
sion Foundation / IEEE, 2020. 7 17
[36] Zibin He, Tao Dai, Jian Lu, Yong Jiang, and Shu-Tao Xia. [47] Fangyuan Kong, Mingxi Li, Songwei Liu, Ding Liu, Jing-
Fakd: Feature-affinity based knowledge distillation for ef- wen He, Yang Bai, Fangmin Chen, and Lean Fu. Residual
ficient image super-resolution. In 2020 IEEE International local feature network for efficient super-resolution. In Pro-
Conference on Image Processing (ICIP), pages 518–522. ceedings of the IEEE/CVF Conference on Computer Vision
IEEE, 2020. 6 and Pattern Recognition, pages 766–776, 2022. 2, 12, 15
[37] Mu Hu, Junyi Feng, Jiashen Hua, Baisheng Lai, Jianqiang [48] Fangyuan Kong, Mingxi Li, Songwei Liu, Ding Liu, Jing-
Huang, Xiaojin Gong, and Xian-Sheng Hua. Online convo- wen He, Yang Bai, Fangmin Chen, and Lean Fu. Resid-
lutional re-parameterization. CoRR, abs/2204.00826, 2022. ual local feature network for efficient super-resolution. In
8 IEEE/CVF Conference on Computer Vision and Pattern
[38] Yaoyu Hu, Wenshan Wang, Huai Yu, Weikun Zhen, and Recognition Workshops, CVPR Workshops 2022, New Or-
Sebastian Scherer. Orstereo: Occlusion-aware recurrent leans, LA, USA, June 19-20, 2022, pages 765–775. IEEE,
stereo matching for 4k-resolution images, 2021. 2, 20 2022. 10
[39] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. [49] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Single image super-resolution from transformed self- ballero, Andrew Cunningham, Alejandro Acosta, Andrew
exemplars. In IEEE Conference on Computer Vision and Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
Pattern Recognition, pages 5197–5206, 2015. 2 Photo-realistic single image super-resolution using a gener-
[40] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei ative adversarial network. In IEEE Conference on Com-
Wang. Lightweight image super-resolution with informa- puter Vision and Pattern Recognition, pages 4681–4690,
tion multi-distillation network. In ACM International Con- 2017. 2
ference on Multimedia, pages 2024–2032, 2019. 2, 16 [50] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan
[41] Andrey Ignatov, Radu Timofte, Maurizio Denna, and Ab- Feng, Wenjun Zeng, and Zhangyang Wang. Benchmark-
del Younes. Real-time quantized image super-resolution on ing single-image dehazing and beyond. IEEE Transactions
mobile npus, mobile ai 2021 challenge: Report. In Pro- on Image Processing, 28(1):492–505, 2019. 15
ceedings of the IEEE/CVF Conference on Computer Vision [51] Yawei Li, Kai Zhang, Luc Van Gool, Radu Timofte, et al.
and Pattern Recognition, pages 2525–2534, 2021. 5 NTIRE 2022 challenge on efficient super-resolution: Meth-
[42] Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel ods and results. In CVPR Workshops, 2022. 2, 10, 17
Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun [52] Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu,
Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman-
et al. Efficient and accurate quantized image super- dolx, Rakesh Ranjan, Radu Timofte, and Luc Van Gool.
resolution on mobile npus, mobile ai & aim 2022 challenge: LSDIR: a large scale dataset for image restoration. In Pro-
report. In Computer Vision–ECCV 2022 Workshops: Tel ceedings of the IEEE/CVF Conference on Computer Vision
Aviv, Israel, October 23–27, 2022, Proceedings, Part III, and Pattern Recognition Workshops, 2023. 2, 9, 11, 12, 13,
pages 92–129. Springer, 2023. 2, 3 14, 15
[43] Andrey Ignatov, Radu Timofte, Shuai Liu, Chaoyu Feng, [53] Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool,
Furui Bai, Xiaotao Wang, Lei Lei, Ziyao Yi, Yan Xiang, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du,
Zibin Liu, Shaoqing Li, Keming Shi, Dehui Kong, Ke Xu, Ding Liu, Chenhui Zhou, et al. Ntire 2022 challenge on
Minsu Kwon, Yaqi Wu, Jiesi Zheng, Zhihao Fan, Xun Wu, efficient super-resolution: Methods and results. In Proceed-
Feng Zhang, Albert No, Minhyeok Cho, Zewen Chen, Xi- ings of the IEEE/CVF Conference on Computer Vision and
aze Zhang, Ran Li, Juan Wang, Zhiming Wang, Marcos V. Pattern Recognition, pages 1062–1102, 2022. 2, 6
Conde, Ui-Jin Choi, Georgy Perevozchikov, Egor Ershov, [54] Yawei Li, Yulun Zhang, Luc Van Gool, Radu Timofte, et al.
Zheng Hui, Mengchuan Dong, Xin Lou, Wei Zhou, Cong NTIRE 2023 challenge on efficient super-resolution: Meth-
Pang, Haina Qin, and Mingxuan Cai. Learned Smart- ods and results. In Proceedings of the IEEE/CVF Confer-
phone ISP on Mobile GPUs with Deep Learning, Mobile ence on Computer Vision and Pattern Recognition Work-
AI & AIM 2022 Challenge: Report. arXiv e-prints, page shops, 2023. 3
arXiv:2211.03885, Nov. 2022. 10, 16 [55] Yawei Li, Yulun Zhang, Luc Van Gool, Radu Timofte,
[44] Xiaoyang Kang, Xianhui Lin, Kai Zhang, Zheng Hui, et al. NTIRE 2023 challenge on image denoising: Methods
Wangmeng Xiang, Jun-Yan He, Xiaoming Li, Peiran Ren, and results. In Proceedings of the IEEE/CVF Conference
Xuansong Xie, Radu Timofte, et al. NTIRE 2023 video on Computer Vision and Pattern Recognition Workshops,
colorization challenge. In Proceedings of the IEEE/CVF 2023. 3
Conference on Computer Vision and Pattern Recognition [56] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
Workshops, 2023. 3 Van Gool, and Radu Timofte. Swinir: Image restoration

1519
using swin transformer. In Proceedings of the IEEE/CVF ropean Conference, Amsterdam, The Netherlands, Octo-
International Conference on Computer Vision, pages 1833– ber 11-14, 2016, Proceedings, Part II 14, pages 102–118.
1844, 2021. 2, 10 Springer, 2016. 5
[57] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and [69] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and
Kyoung Mu Lee. Enhanced deep residual networks for sin- Vladlen Koltun. Playing for data: Ground truth from com-
gle image super-resolution. In IEEE Conference on Com- puter games. In Bastian Leibe, Jiri Matas, Nicu Sebe, and
puter Vision and Pattern Recognition Workshops, pages Max Welling, editors, European Conference on Computer
136–144, 2017. 2 Vision (ECCV), volume 9906 of LNCS, pages 102–118.
[58] Zudi Lin, Prateek Garg, Atmadeep Banerjee, Salma Ab- Springer International Publishing, 2016. 15
del Magid, Deqing Sun, Yulun Zhang, Luc Van Gool, [70] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
Donglai Wei, and Hanspeter Pfister. Revisiting rcan: Im- Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
proved training for image super-resolution. arXiv preprint Wang. Real-time single image and video super-resolution
arXiv:2201.11279, 2022. 6 using an efficient sub-pixel convolutional neural network.
[59] J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y. Dai. In Proceedings of the IEEE conference on computer vision
A comprehensive benchmark for single image compression and pattern recognition, pages 1874–1883, 2016. 2, 3, 6, 9
artifacts reduction. In arXiv, 2019. 5 [71] Alina Shutova, Egor Ershov, Georgy Perevozchikov, Ivan A
[60] Jie Liu, Jie Tang, and Gangshan Wu. Residual feature dis- Ermakov, Nikola Banic, Radu Timofte, Richard Collins,
tillation network for lightweight image super-resolution. In Maria Efimova, Arseniy Terekhin, et al. NTIRE 2023 chal-
Computer Vision–ECCV 2020 Workshops: Glasgow, UK, lenge on night photography rendering. In Proceedings of
August 23–28, 2020, Proceedings, Part III 16, pages 41– the IEEE/CVF Conference on Computer Vision and Pattern
55. Springer, 2020. 2, 12, 15 Recognition Workshops, 2023. 3
[61] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and [72] Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu,
Wangmeng Zuo. Multi-level wavelet-cnn for image restora- and Yunhe Wang. Efficient residual dense block search for
tion. In Proceedings of the IEEE conference on computer image super-resolution. In Proceedings of the AAAI Con-
vision and pattern recognition workshops, pages 773–782, ference on Artificial Intelligence, volume 34, pages 12007–
2018. 16 12014, 2020. 2
[62] Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, [73] Long Sun, Jinshan Pan, and Jinhui Tang. ShuffleMixer: An
Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, efficient convnet for image super-resolution. In NeurIPS,
Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, 2022. 9
et al. NTIRE 2023 quality assessment of video enhance- [74] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
ment challenge. In Proceedings of the IEEE/CVF Confer- Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Van-
ence on Computer Vision and Pattern Recognition Work- houcke, and Andrew Rabinovich. Going deeper with con-
shops, 2023. 3 volutions. 2015 IEEE Conference on Computer Vision and
[63] Zhuoqun Liu, Meiguang Jin, Ying Chen, Huaida Liu, Pattern Recognition (CVPR), pages 1–9, 2014. 10
Canqian Yang, and Hongkai Xiong. Mfdnet: Towards [75] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
real-time image denoising on mobile devices. CoRR, Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single
abs/2211.04687, 2022. 8 image super-resolution: Methods and results. In Proceed-
[64] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient ings of the IEEE conference on computer vision and pattern
descent with warm restarts. In ICLR, 2017. 9 recognition workshops, pages 114–125, 2017. 5, 6, 9, 10,
[65] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fuji- 11
moto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu [76] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
Aizawa. Sketch-based manga retrieval using manga109 Hsuan Yang, Lei Zhang, Bee Lim, et al. Ntire 2017 chal-
dataset. Multimedia Tools and Applications, 76(20):21811– lenge on single image super-resolution: Methods and re-
21838, 2017. 16 sults. In The IEEE Conference on Computer Vision and
[66] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Pattern Recognition (CVPR) Workshops, July 2017. 2, 15
Moon, Sanghyun Son, Heewon Lee, Radu Timofte, and Ky- [77] Radu Timofte, Vincent De Smet, and Luc Van Gool. An-
oung Mu Lee. Ntire 2019 challenge on video restoration chored neighborhood regression for fast example-based
and enhancement: Methods and results. The IEEE Confer- super-resolution. In IEEE Conference on International
ence on Computer Vision and Pattern Recognition (CVPR) Conference on Computer Vision, pages 1920–1927, 2013.
Workshops, 2019. 2 2
[67] Ying Nie, Kai Han, Zhenhua Liu, An Xiao, Yiping Deng, [78] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+:
Chunjing Xu, and Yunhe Wang. Ghostsr: Learning Adjusted anchored neighborhood regression for fast super-
ghost features for efficient image super-resolution. CoRR, resolution. In ACCV, 2014. 2
abs/2101.08525, 2021. 7 [79] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven
[68] Stephan R Richter, Vibhav Vineet, Stefan Roth, and ways to improve example-based single image super resolu-
Vladlen Koltun. Playing for data: Ground truth from com- tion. In Proceedings of the IEEE conference on computer
puter games. In Computer Vision–ECCV 2016: 14th Eu- vision and pattern recognition, pages 1865–1873, 2016. 2

1520
[80] Florin-Alexandru Vasluianu, Tim Seizinger, Radu Timofte, [92] Eduard Zamfir, Marcos V Conde, and Radu Timofte. To-
et al. NTIRE 2023 image shadow removal challenge report. wards real-time 4k image super-resolution. In Proceedings
In Proceedings of the IEEE/CVF Conference on Computer of the IEEE/CVF Conference on Computer Vision and Pat-
Vision and Pattern Recognition Workshops, 2023. 3 tern Recognition, 2023. 2, 3, 4, 18, 19, 20
[81] Longguang Wang, Xiaoyu Dong, Yingqian Wang, Xinyi [93] Roman Zeyde, Michael Elad, and Matan Protter. On single
Ying, Zaiping Lin, Wei An, and Yulan Guo. Exploring image scale-up using sparse-representations. In Interna-
sparsity in image super-resolution for efficient inference. tional Conference on Curves and Surfaces, pages 711–730,
In Proceedings of the IEEE/CVF conference on computer 2010. 2
vision and pattern recognition, pages 4917–4926, 2021. 2 [94] Kai Zhang, Martin Danelljan, Yawei Li, Radu Timofte, Jie
[82] Longguang Wang, Yulan Guo, Yingqian Wang, Juncheng Liu, Jie Tang, Gangshan Wu, Yu Zhu, Xiangyu He, Wenjie
Li, Shuhang Gu, Radu Timofte, et al. NTIRE 2023 chal- Xu, et al. Aim 2020 challenge on efficient super-resolution:
lenge on stereo image super-resolution: Methods and re- Methods and results. In Computer Vision–ECCV 2020
sults. In Proceedings of the IEEE/CVF Conference on Com- Workshops: Glasgow, UK, August 23–28, 2020, Proceed-
puter Vision and Pattern Recognition Workshops, 2023. 3 ings, Part III 16, pages 5–40, 2020. 2
[83] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. [95] Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren,
Real-esrgan: Training real-world blind super-resolution Björn Stenger, Wei Liu, Hongdong Li, and Ming-Hsuan
with pure synthetic data. In Proceedings of the IEEE/CVF Yang. Benchmarking ultra-high-definition image super-
International Conference on Computer Vision, pages 1905– resolution. In Proceedings of the IEEE/CVF international
1914, 2021. 6 conference on computer vision, pages 14769–14778, 2021.
[84] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, 2
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: [96] Xindong Zhang, Hui Zeng, and Lei Zhang. Edge-oriented
Enhanced super-resolution generative adversarial networks. convolution block for real-time super resolution on mobile
In European Conference on Computer Vision Workshops, devices. In Heng Tao Shen, Yueting Zhuang, John R. Smith,
pages 701–710, 2018. 2 Yang Yang, Pablo César, Florian Metze, and Balakrishnan
[85] Yingqian Wang, Longguang Wang, Zhengyu Liang, Jun- Prabhakaran, editors, MM ’21: ACM Multimedia Confer-
gang Yang, Radu Timofte, Yulan Guo, et al. NTIRE 2023 ence, Virtual Event, China, October 20 - 24, 2021, pages
challenge on light field image super-resolution: Dataset, 4034–4043. ACM, 2021. 8, 16
methods and results. In Proceedings of the IEEE/CVF Con- [97] Xindong Zhang, Huiyu Zeng, and Lei Zhang. Edge-
ference on Computer Vision and Pattern Recognition Work- oriented convolution block for real-time super resolution on
shops, 2023. 3 mobile devices. Proceedings of the 29th ACM International
[86] Lei Xiao, Salah Nouri, Matt Chapman, Alexander Fix, Conference on Multimedia, 2021. 10
Douglas Lanman, and Anton Kaplanyan. Neural supersam- [98] Xindong Zhang, Hui Zeng, and Lei Zhang. Edge-oriented
pling for real-time rendering. ACM Transactions on Graph- convolution block for real-time super resolution on mobile
ics (TOG), 39(4):142–1, 2020. 2 devices. In Proceedings of the 29th ACM International
[87] Tianyu Xu, Zhuang Jia, Yijian Zhang, Long Bao, and Heng Conference on Multimedia, pages 4034–4043, 2021. 13
Sun. Elsr: Extreme low-power super resolution network for [99] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
mobile devices, 2022. 13 Zhong, and Yun Fu. Image super-resolution using very deep
[88] Ren Yang, Radu Timofte, Xin Li, Qi Zhang, Lin Zhang, residual channel attention networks. In European Confer-
Fanglong Liu, Dongliang He, Fu Li, He Zheng, Weihang ence on Computer Vision, pages 286–301, 2018. 2
Yuan, et al. Aim 2022 challenge on super-resolution of
[100] Yulun Zhang, Kai Zhang, Zheng Chen, Yawei Li, Radu
compressed image and video: Dataset, methods and results.
Timofte, et al. NTIRE 2023 challenge on image super-
In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Is-
resolution (x4): Methods and results. In Proceedings of
rael, October 23–27, 2022, Proceedings, Part III, pages
the IEEE/CVF Conference on Computer Vision and Pattern
174–202. Springer, 2023. 2
Recognition Workshops, 2023. 3
[89] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu
[101] Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, and
Kang, and Jung-Woo Ha. Photorealistic style transfer via
Chao Dong. Efficient image super-resolution using pixel at-
wavelet transforms. In Proceedings of the IEEE/CVF In-
tention. In Computer Vision–ECCV 2020 Workshops: Glas-
ternational Conference on Computer Vision, pages 9036–
gow, UK, August 23–28, 2020, Proceedings, Part III 16,
9045, 2019. 16
pages 56–72. Springer, 2020. 2
[90] Hongliang Yuan, Boyu Zhang, Mingyan Zhu, Ligang Liu,
and Jue Wang. High-quality supersampling via mask-
reinforced deep learning for real-time rendering. arXiv
preprint arXiv:2301.01036, 2023. 2
[91] Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano,
Radu Timofte, et al. NTIRE 2023 challenge on hr depth
from images of specular and transparent surfaces. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops, 2023. 3

1521

You might also like