Efficient Deep Models For Real-Time 4K Image Super-Resolution.
Efficient Deep Models For Real-Time 4K Image Super-Resolution.
Marcos V. Conde† Eduard Zamfir† Radu Timofte† Daniel Motilla‡ Cen Liu
Zexin Zhang Yunbo Peng Yue Lin Jiaming Guo Xueyi Zou Yuyi Chen
Yi Liu Jia Hao Youliang Yan Yuanfan Zhang Gen Li Lei Sun
Lingshun Kong Haoran Bai Jinshan Pan Jiangxin Dong Jinhui Tang
Mustafa Ayazoglu Bahri Batuhan Bilecen Mingxi Li Yuhang Zhang Xianjun Fan
Yankai Sheng Long Sun Zibin Liu Weiran Gou Shaoqing Li Ziyao Yi
Yan Xiang Dehui Kong Ke Xu Ganzorig Gankhuyag Kihwan Yoon Jin Zhang
Gaocheng Yu Feng Zhang Hongbin Wang Zhou Zhou Jiahao Chao
Hongfan Gao Jiali Gong Zhengfeng Yang Zhenbing Zeng Chengpeng Chen
Zichao Guo Anjin Park Yuqing Liu Qi Jia Hongyuan Yu Xuanwu Yin
Kunlong Zuo Dongyang Zhang Ting Fu Zhengxue Cheng Shiai Zhu
Dajiang Zhou Hongyuan Yu Weichen Yu Lin Ge Jiahua Dong Yajun Zou
Zhuoyuan Wu Binnan Han Xiaolin Zhang Heng Zhang Xuanwu Yin Ben Shao
Shaolong Zheng Daheng Yin Baijun Chen Mengyang Liu Marian-Sergiu Nistor
Yi-Chung Chen Zhi-Kai Huang Yuan-Chun Chiang Wei-Ting Chen
Hao-Hsiang Yang Hua-En Chang I-Hsiang Chen Chia-Hsuan Hsieh Sy-Yen Kuo
Tu Vo Qingsen Yan Yun Zhu Jinqiu Su Yanning Zhang Cheng Zhang
Jiaying Luo Youngsun Cho Nakyung Lee Kunlong Zuo
720p 1080p 4K
Figure 1. NTIRE 2023 Real-Time 4K SR. We introduce a new benchmark and a diverse test set for 4K Super-Resolution.
1495
1. Introduction while maintaining efficiency. The challenge seeks to iden-
tify innovative and advanced solutions for real-time super-
Single image super-resolution (SR) refers to the process resolution, benchmark their efficiency, and identify general
of generating a high-resolution (HR) image from a single trends for designing efficient SR networks.
degraded low-resolution (LR) image. This ill-posed prob-
lem was initially solved using interpolation methods [28,
2. NTIRE 2023 Real-Time Super-Resolution
77–79]. However, with the emergence of deep learning, SR
is now commonly approached through the use of deep neu- Challenge
ral networks [17,24,49,56,57,84,88,99]. Image SR assumes The aim of this challenge is to create real-time super-
that the LR image is obtained through two major degrada- resolution (SR) methods, with a specific focus on up-scaling
tion processes: blurring and down-sampling. This can be to 4K resolution. We believe that this area remains largely
expressed as: unexplored within the computer vision community. The
\mathbf {y} = (\mathbf {x} * \mathbf {k})\downarrow _s, (1) challenge has three main objectives: Firstly, to advance re-
search on real-time SR methods. Secondly, to introduce a
where ∗ represents the convolution operation between the
novel and competitive benchmark for 4K SR, utilizing var-
LR image and the blur kernel, and ↓s is the down-sampling
ious image types such as digital art and natural imagery.
operation with respective down-sampling factor ×s. Most
Thirdly, to facilitate interactions between academic and in-
SR methods are built around the Bicubic model [77,78] with
dustry participants and encourage potential collaborations.
various down-scaling factors (e.g. ×2, ×3, ×4, ×8).
The advancements in hardware technologies have led to 2.1. 4K SR Benchmark Dataset
the training of larger and deeper neural networks for image
super-resolution, resulting in significant performance im- The 4K RTSR benchmark provides a unique test set com-
provements. However, these breakthroughs often come at prising ultra-high resolution images from various sources,
the cost of introducing more complex approaches [3, 20, 56, setting it apart from traditional super-resolution bench-
84,99]. Since the seminal work by Shi et al. [70], the design marks. Specifically, the benchmark addresses the increas-
of efficient deep neural networks for single image super- ing demand for upscaling computer-generated content e.g.
resolution [40, 47, 72, 81, 101] has become pivotal. Vari- gaming and rendered content, in addition to photorealistic
ous workshops and challenges, such as [42, 53, 94], have imagery, thereby posing a different challenge for existing
emerged as popular forums for sharing ideas and advanc- SR approaches. The test set includes diverse content such
ing the state-of-the-art in efficient and real-time SR. Pub- as rendered gaming images, digital art, as well as high-
licly available large-scale datasets have been instrumental in resolution photorealistic images of animals, city scenes,
driving recent advances in image and video SR [1,32,52,66, and landscapes, totaling 110 test samples. We created this
76]. However, with the exception of DIV8K [32] and [95], benchmark with the intention of advancing the development
most existing datasets have images of limited resolution e.g. of SR methods, as well as replacing outdated test sets such
2K. In addition, the practical challenge of performing real- as Set5 [7], Set14 [93], and Urban100 [39].
time SR of images and videos to 4K resolution has received All the images in the benchmark testset are at least 4K
relatively little attention so far. resolution i.e. 3840 × 2160 (some are bigger, even 8K). The
As the amount of digital content continues to surge, there images were filtered manually to ensure there are not un-
is a mounting demand for effective SR techniques for ren- pleasant effects such as noise or strong defocus.
dered content [86, 90]. However, rendering presents unique The distribution of the 4K RTSR benchmark testset is:
challenges as it often exhibits significant aliasing, resulting 14 real-world captures using a 60MP DSLR camera, 21 ren-
in jagged lines and other sampling artifacts. Consequently, dered images using Unreal Engine [38], 75 diverse images
up-scaling rendered content requires a novel approach that e.g. animals, paintings, digital art, nature, buildings, etc.
involves both anti-aliasing and interpolation, which is dis-
2.2. Baseline Model
tinct from the well-established research on denoising and
deblurring in existing SR research [86]. Previous lightweight SR methods [51] such as
In conjunction with the 2023 New Trends in Image IMDN [40] or RFDN [60] are not fast enough for this task.
Restoration and Enhancement (NTIRE) workshop, we in- For this reason, we use RT4KSR [92] as the baseline model
troduce the real-time 4K super-resolution challenge. The for this challenge. The primary objective is to enhance
challenge entails super-resolving a LR image from either its efficiency in terms of runtime, parameter count and
720p or 1080p to 4K resolution using a network that reduces FLOPs. Drawing inspiration from the research presented in
one or several aspects, such as runtime, parameters, FLOPs, [42,53], the baseline design utilizes a shallow convolutional
and memory consumption. The goal is to at least outper- architecture to achieve rapid and precise reconstruction
form bicubic interpolation on a new and diverse benchmark, performance. The proposed baseline stacks five simple
1496
3 × 3 convolutions with a GeLU activation layer and adds test set. The corresponding degraded images are obtained
a global residual connection with LayerNorm [6] before through bicubic down-scaling to their respective resolutions
the standard depth2scale up-sampling operation. Besides, (1080p for X2 and 720p for X3 up-scaling). The average
the authors in [92] develop a sophisticated approach that runtime is determined by using mixed-precision and repeat-
improves model efficiency by downscaling feature maps. edly evaluating randomly initialized tensors of correspond-
To avoid losing important high frequency details that are ing sizes to overcome any bottlenecks that may arise due to
already scarce, the authors propose extracting HF details data loading. The FLOPs are evaluated on an input image
from the LR input prior to its downscaling. Additionally, of size 1920 × 1080 and 1280 × 720, respectively.
the authors provide a detailed roadmap of their method’s
development, resulting in a competitive shallow CNN \label {eq:score} S &= \frac {2^{2 \times (\text {PSNR}_{M}-\text {PSNR}_{B})}}{C \times T_{M}^{0.5}} (2)
design that can be scaled up and achieves performance
comparable to previous state-of-the-art efficient SR models.
Similar to [42], we determine the final score S of each
2.3. Tracks and Competition participant in the challenge by utilizing Eq. (2), in which
PSNRM and TM represent the PSNR result and runtime of
The objective of this challenge is to develop a high- the individual submission. Additionally, the scoring func-
performance SR technique that can upscale a broad range tion is designed to prioritize faster runtime over restoration
of images to 4K resolution in real-time, while ensuring a accuracy. However, in cases where two methods have simi-
PSNR above a traditional Bicubic interpolation. lar runtimes, the PSNR value will be the deciding factor.
Track 1: 1080p to 4K. The first challenge track addresses Related NTIRE 2023 Challenges. The NTIRE 2023
X2 up-scaling from 1080p to 4K resolution. Real-Time Image Super-Resolution (RTSR) Challenge is
part of the NTIRE 2023 Workshop series of challenges
Track 2: 720p to 4K. The second leg of this NTIRE chal- on: night photography rendering [71], HR depth from
lenge addresses X3 up-scaling from 720p to 4K resolution. images of specular and transparent surfaces [91], im-
age denoising [55], video colorization [44], shadow re-
moval [80], quality assessment of video enhancement [62],
Challenge Phases. Development and Validation Phase. stereo super-resolution [82], light field image super-
The participants were provided with access to a validation resolution [85], image super-resolution (×4) [100], 360°
set comprising of 100 images from the DIV2K validation omnidirectional image and video super-resolution [9], lens-
split, along with an additional collection of 50 images that to-lens bokeh effect transformation [18], real-time 4K
included a variety of content, from videogames to realistic super-resolution [19], HR nonhomogenous dehazing [4], ef-
high-resolution photography. The baseline model, scoring ficient super-resolution [54].
function, and evaluation scripts were made available to the
participants through GitHub (https://github.com/ 2.4. Architectures and Main Ideas
eduardzamfir/NTIRE23- RTSR). This allowed the
Here we summarize the core ideas behind the most com-
participants to benchmark the performance of their models
petitive solutions. Each proposed solution will be covered
on their systems. During the development phase, the objec-
in the following Sec. 3 and Tab. 2.
tive was aimed at up-scaling 2K imagery since DIV2K did
not include any 4K imagery. Testing Phase. During the final 1. Re-parameterization allows to train the network us-
test phase, the participating teams received a 4K benchmark ing complex blocks [22], while during inference the
comprising 110 diverse images. However, they did not have so-called RepBlocks can be reduced to a simple 3 × 3
access to the HR ground-truth. Once the participants gener- convolution.
ated their super-resolved results, they submitted their code,
factsheets and resulting images to the organizers via email. 2. Pixel shuffle and unshuffle (also known as depth-to-
The organizers then validated and executed the submitted space and space-to-depth respectively) [70] to effi-
code to obtain the final results, which were later conveyed ciently transform the features maps and perform both
to the participants upon completion of the challenge. spatial upsampling and downsampling.
3. Multi-stage Training. Since the neural networks are
Evaluation Protocol. The quantitative evaluation metrics extremely constrained and shallow, this technique al-
for this challenge comprise of testing PSNR, runtime, num- lows to maximize learning by alternating different
ber of parameters, number of FLOPs and maximum GPU learning rates and loss functions.
memory consumed during inference. The PSNR is calcu-
lated on 110 RGB images sourced from our 4K benchmark https://cvlai.net/ntire/2023/
1497
Table 1. Results of the NTIRE23 Real-Time SR challenge. The runtimes are computed using a Nvidia RTX3090 GPU. The teams are
ordered by their ranking according to their score. For better comparison we color-code the runtime using < 24 FPS , 30 > x > 24 FPS ,
60 > x > 30 FPS , 120 > x > 60 FPS and > 120 FPS , respectively.
1498
3. Methods and Teams Repeat
Upscaling
=
cost of the network while keep the information volume un- * coeff31 * coeff32 * coeff33
ReLU
K Conv-3 * coeff3
=
=
a global skip connection, which repeats each pixel value 4× Conv-1
featureo
(or 9× for ×3 SR) [26]. (iii) The authors propose an as- coeff
Dynamic conv Assembled conv
sembled convolution structure Fig. 2b. Different from the (a) (b) (c)
dynamic convolution [14] which generates the whole con-
(b) a) Control module. b) Assembled convolution. c) Comparison between
volution kernel in a linear combination of the basis, assem- dynamic and Assembled convolution.
bled convolution generate the optimal kernel coefficient for
Figure 2. Team Noah TerminalVision solution.
each output channel, which is more flexible and outperform
4 Huawei Proprietary - Restricted Distribution
the dynamic convolution in this task.
Because different batches of data require different convo-
Network architecture. Given an input LR image, the res- lution kernels, the batch dimension of the feature map is
olution would be converted to channel dimension by pixel reshaped to the channel dimension and the group convolu-
unshuffle layer. By using a 3x3 convolution, the channel of tion is used to calculate the output feature maps. As shown
feature map would be converted to the target size (32 for in Fig. 2b, dynamic convolution generates the whole con-
×2, 64 for ×3) and then feed into the assembled block. The volution kernel (all channels) in a linear combination of the
assembled block contains a control module and three as- basis. Assembled convolution generate an optimal convolu-
sembled convolutions. As shown in the Fig. 2b, the control tion kernel coefficient for each channel, which is more flex-
module is mainly responsible for generating coefficients for ible and outperform the dynamic convolution in this task.
the assembled convolutions. Base on these coefficients, a Implementation Details. In the training phase, the training
3x3 convolution is generated to perform a classical convo- sets include DF2k [2, 75], DIV8K [33], GTAV‘ [68], and
lution on the feature maps. Therefore, the major computa- LIU4K-V2 [59]. The network is trained by minimizing the
tional cost of the assembled convolution is still the 3x3 con- charbonnier loss with Adam optimizer. The initial learning
volution itself, and runtime of a assembled convolution is rate is 5e-4 and halved at every 2e5 iteration. The total num-
only a little higher than the classical convolution. After the ber of training iteration is 3e6 on a Tesla V100 platform.
assembled block, a 3x3 convolution layer is used to convert
the channels size to 48 (108 for ×3 SR) so that the feature 3.2. Bicubic++
map can be restored to target resolution after the pixel shuf- The winning team in Track 2, Aselsan Research, pro-
fle layer. It should be noted that a low resolution images poses a lightweight, single image super-resolution method,
repeated in the channel dimension can also be restored in named Bicubic++ [8]. Unlike many others lightweight
to the high resolution with a pixel shuffle layer, we divide methods where the input image dimensions are fixed
the final pixel shuffle into two steps in order to import the throughout the network, Bicubic++ downscales the image
global skip connection to the network. first (by half with strided convolutions) to reduce the num-
ber of operations greatly on the following network convo-
Assembled block. As shown in the Fig. 2b, given the in- lutional layers to meet the real-time requirements. Finally
put features F ∈ RB,C,H,W , the control module converts they apply ×6 upscaling. The overall structure is given
the features F into coefficients coeff ∈ RB,Co ,E , B is the in Fig. 3.
batchsize, Co is the number of output channels, and E is the In addition, they follow a three stage training approach,
number of candidate convolution basis. Matrix multiplica- where they train a slightly larger model first, and per-
tion is performed on the coefficient coeff and all candi- form global structured convolutional layers and bias prun-
date convolution kernels kbasis ∈ RE,Ci ,ks,ks — where Ci ing without using heuristic metrics like weight norms on
is the number of input channels, and ks is the kernel size — the following two stages. This approach ultimately yields
to generate a final convolution kernel K ∈ RB,Co ,Ci ,ks,ks . a much faster, real-time model with none to marginal de-
1499
crease in the visual quality. They have not employed quanti- module is composed of a sequence of Re-Parameter blocks
zation or the reparametrization of the convolutional kernels. (RepBlock) that serves to extract and refine features in a
progressive manner. Following the new suggestions in low-
x0.5 x6 level vision task introduced by [53, 58], the Gaussian Er-
3x3
3x3
3x3 ror Linear Unit (GeLU) activation function is utilized in the
D2S
+ ch
DS
3 ch ch ch 108 3
s=1 s=1 s=1 ×2 model, while the Sigmoid Linear Unit (SiLU) activa-
p=1 p=1 p=1
tion function is used in the ×3 model, respectively. Finally,
Figure 3. Bicubic++ structure proposed by Aselsan Research. The the upsampling layer and a skip connection are utilized to
s and p denote stride and padding, respectively. In the final pro- increase the image resolution to the desired level. This is
posed model, ch is 32, all bias terms are removed, and a strided achieved by applying a convolutional layer, followed by a
convolution with s=2, p=1 for the downscaling (DS) layer is uti- pixel-shuffle layer.
lized. Red blocks after 3x3 convolutions are leaky ReLU activa- Besides re-parameterization [22], they also use Knowl-
tions. D2S denotes depth-to-space layer [70]. edge Distillation [36] in training. During the training stage,
teacher output images and ground-truth images are used
to guide the student network via teacher surpervision (TS)
Implementation Details. The models are trained in Py- and data surpervision (DS), respectively. They use HAT-L
Torch Lightning. The training is done with mixed precision model [13] as the teacher model, which is currently consid-
(FP16) by setting a precision flag in the Trainer, and Adam ered the SOTA model in the field of super-resolution.
optimizer with β1,2 parameters 0.99 and 0.999, respectively.
For the first two stages of the training, they start with Implementation Details. The method is implemented us-
the learning rate of 5e-4. For the last stage, they start with ing Pytorch 1.13. The loss function is L1 for reconstruction
1e-4. They utilize a decaying learning rate scheduler for and L2 is employed during the fine-tuning and knowledge
all stages, where after 500 epochs the learning rate decays distillation phases. For the X2 model, the channel employed
linarly until we reach to 1e-8. in the CNN model (student) is 32, and the number of Rep
For all three stages of the training, they train for Blocks is 3. Additionally, the scale of pixel unshuffle and
1000 epochs using batch size 8. Each epoch consumes pixel shuffle layers is 3. For the X3 model, the channel em-
800 randomly cropped and rotated patches of dimension ployed in the CNN model (student) is 64, and the number of
(108,108,3) -for LR- from Q=90 degraded DIV2K [1] Rep Blocks is 5. Additionally, the scale of pixel unshuffle
dataset. For the validation, they use 48 images with and pixel shuffle layers is 4.
same dimensions (680,452,3) -for LR- from Q=90 degraded
DIV2K validation dataset. 3.4. Team OV
3.3. RUNet Team OV presents a simple and efficient Convolutional
Neural Network architecture that incorporates 3×3 con-
Team ALONG proposes RUNet: Re-parameterization
volutions, GELU activation function, and depth-to-space
and Unshuffle Network for Real-time Super- Resolution.
operations. The network utilizes 12 (for ×2) and 16
The team mainly considers designing the network fol-
(for ×3) channels and produces the final image output
lowing two aspects: (i) Receptive field: the model’s ability
through the depth-to-space operation. These architectural
may be limited if its receptive field is too small. (ii) Com-
elements are depicted in Figure 5. The team also uses re-
putational efficiency: The relationship between runtime and
parameterization as shown in Fig. 5 (b).
computation is not necessarily positive. A higher level of
computational efficiency can result in a shorter runtime.
As shown in 4a, inspiring by [83], initially, they apply Implementation Details. The network was trained us-
the pixel-unshuffle technique, which serves as the inverse ing DF2K (DIV2K+Flickr2K) dataset [2, 75], divided into
process of pixelshuffle [70], to reduce the spatial dimen- three stages. Initially, low-resolution (LR) patches having
sions and amplify the channel dimensions of the data be- a dimension of 128×128 are randomly cropped from high-
fore feeding them into the main model architecture. Thus, resolution (HR) images with a mini-batch size of 64. L1
the majority of calculations are performed within a smaller and FFT losses are used as target loss functions. Following
resolution space, leading to a reduction in computational this, network parameters were optimized for 300K itera-
resource consumption and an effective enhancement of the tions employing the Adam algorithm, with a learning rate of
inference speed. Furthermore, this approach can improve 1 × 10−3 decreasing to 1 × 10−7 through the cosine sched-
the receptive field. Next, a convolutional layer followed by uler. In the second stage, the model obtained from the first
an activation function is applied. This process effectively stage was trained similarly for another 300K iterations. In
extracts low-level features from the input image. The body the final stage, the model was fine-tuned using L2 loss and
1500
(a) RUNet Architecture. (b) Reparameter Block.
Rep_block
PixelShuffle
Conv3x3
Conv3x3
GELU
Conv3x3
Conv3x3
Conv3x3
GELU
GELU
1501
(a) The architecture of Repconv-based Plain Net for RTSR. (b) The architecture of reparameterized convolution module.
Figure 7. Team DFCDN: The overall architecture of the proposed DFCDN network.
enhanced by efficient spatial attention layer [63]. HR images from DIV2K. The mini-batch size is set to 64.
The L1 loss is minimized with Adam optimizer. The initial
Online Convolutional Re-parameterization Re- learning rate is set to 5e-4 with a cosine annealing schedule.
parameterization [96] has improved the performance of The total number of epochs is 1000. At the second stage, the
image restoration models without introducing any inference model is initialized with the pre-trained weights of Stage 1.
cost. However, the training cost is large because of compli- The HR patch size is set to 640. The model is trained with
cated training-time blocks. To reduce the extra training cost, the same settings as in the previous step. At the third stage,
they apply online convolutional re-parameterization [37] the model is initialized with the pre-trained weights of Stage
by converting the complex convblocks into one single 2. The MSE loss is used for fine-tuning with 640 × 640 HR
convolutional layer. The architecture of RepConv is shown patches and a learning rate of 1e-5 for 100 epochs.
in Fig. 7 (c). It can be converted to a 3 × 3 convolution
during training, which saves great training cost.
The training details of ×3 (Track 2) are as follows: At
the first stage, the model is initialized with the pre-trained
Implementation Details. The number of features is set weights of the model with scale 2. The HR patch size is
to 8 and the number of attention channels is set to 16. The set to 660. The model is trained with the same settings as
DIV2K [1] dataset is used for training and the inputs are X2. At the second stage, the model is initialized with the
in the range of 0-255. First, for training the ×2 (Track 1) pre-trained weights of Stage 1. The MSE loss is used for
models, the setup is as follows: The model is first trained fine-tuning with 660 × 660 HR patches and a learning rate
from scratch with 256×256 patches randomly cropped from of 1e-5 for 100 epochs.
1502
Conv 3×3
Conv 3×3
SubPixel
RepRB
RepRBRepRB
RepRBRepRB
RepRBRepRB
Conv 3×3
Conv 3×3
SubPixel
RepRB
Conv 1×1
Conv 3×3
Conv 1×1
Conv 3×3 Conv 3×3 (a) Training mode of the proposed network.
Conv 1×1
Conv 3×3
Conv 1×1
Conv 3×3 Conv 3×3
Figure 8. Team NJUST-RTSR: The overall architecture of the pro- (b) Inference mode of the proposed network.
posed network. (Bottom) Detail network of the proposed RepRB.
Figure 9. Team z6 proposed LRSRN network structure.
3.7. NJUST-RTSR
3.8. LRSRN
The team proposes a method that first transforms the in- Team z6 proposes a Lightweight Real-Time Image
put LR image into the feature space using a convolutional Super-Resolution Network (LRSRN) [30] that can deliver
layer, then performs feature extraction using four reparame- higher accuracy at a faster speed compared to previous real-
terizable residual blocks (RepRBs), and finally reconstructs time SR models for 4K images. They apply a reparame-
the final output by a sub-pixel [70] convolution. The pro- terized convolution (RepConv) for all convolution layers to
posed architecture is illustrated in Fig. 8. improve the image quality while maintaining the model size
To enhance the capability of the model, they use the re- and inference speed. The proposed network is an extended
parametrization technique [23]. Fig. 8 (Bottom) shows the version of [29] (previous work of the team), which was de-
detail description of the used RepRB module. It contains signed for Mobile devices. The proposed network is illus-
three branches in the training phase to learn features from trated in Fig. 9.
different receptive fields, while in the inference phase it can
be merged into a 3 × 3 convolution. Implementation Details. The team used Pytorch 1.13.
The models were trained in two steps: (i) First, models
were trained from scratch. The LR patches were cropped
Implementation Details. The team uses DIV2K [2] and from HR images with mini-batch size 8, and resolution 192
Flickr2K [75] as the training data. In order to accelerate x 192 (Track 1) and 128
the IO speed during training, they crop the 2K resolution
×
images to sub-images — the HR image is cropped into
640 × 640 and 960 × 960 sub-images for ×2 and ×3 SR, 128 (Track 2). The Adam optimizer was used with a 0.0005
respectively. learning rate, and cosine warm-up scheduler. The total
During the training, the data argumentation is performed number of epochs was set to 800. They use L1 loss. (ii)
on the input patches with random horizontal flips and rota- In the second step, the model was initialized from previ-
tions. The HR image patch size is initialized as 128 × 128 ous step. Fine-tuning with L2 loss improves the PSNR
and increases to 256 × 256, and batch size is set as 64. They value by 0.01 ∼ 0.02 dB. In this step, the initial learning
use the Adam [46] optimizer with the Cosine Annealing rate was set as 0.0001. The total epoch was set to 200.
scheme [64]. The initial learning rate to 1 × 10−3 and the In particular, the DIV2K [1] was used for scratch train-
minimum one to 1 × 10−6 . The number of total iterations ing. The combined dataset, which includes DIV2K train set
is set to 300k. They use a combination of mean absolute (800 images), Flickr2K (2650 images), GTA (train seq 00
error (MAE) loss and an FFT-based frequency loss function ∼ 19), LSDIR [52] (first 1000 images) used for the fine-
to constrain the model training, which is the same as [73]. tuning stage. The training data is preprocessed by cen-
All experiments are conducted with the PyTorch framework ter cropping it to a resolution of 2040 x 1080. To gener-
on an NVIDIA GeForce RTX 3090 GPU. ate low-resolution, they degrade the center cropped images
1503
with bicubic downsampling and JPEG compression. Dur- for fine-tuning with 2040 × 1080 HR patches and an initial
ing training, they used random cropping, rotations, and flips learning rate is 1 × 10−5 , the mini-batch size is set to 4.
augmentations.
3.10. ERLFN
3.9. SCSYENet Team Team Antins CV proposes a method built on
Team Multimedia proposes SCSYENet: A Compact Residual Local Feature Network (RLFN) [48]. Based on
Skip-Concatenated Simple Yet Effective Real-Time Image this network, we prune the architecture and introduce the
Super-Resolution based on element-wise multiplication fu- Enhanced Residual Block (ERB) RepBlock proposed by
sion operation and Re-parameter convolution. [51] the runner up solution, and we propose our Enhanced
They built an end-to-end RTSR network based on Residual Local Feature Network (ERLFN).
element-wise multiplication fusion operation and re-
parameter convolution, following previous work [5, 43, 97]. Network Structure. The RLFN proposed by [48] is an
SCSYENet, has only 10/12.5K parameters (in Track 1 (X2) efficient network for lightweight super resolution task.
and Track 2 repectively). The network consists of two While for this real-time super-resolution task, they further
asymmetrical branches with simple building blocks. To ef- prune the network for an ideal speed.
fectively connect the results by asymmetrical branches, a For Track 1 (upscaling from FHD 1080p to 4K) the net-
element-wise multiplication fusion operation is proposed. work requires heavy computation. To balance for speed,
The architecture of SCSYENet is illustrated in Fig. 10a. we cut the four RLFB blocks in RLFN to two blocks, and
shrink the feature channels to 12. The ESA blocks nested
in RLFB are removed to reduce computation cost and save
Network Structure Inspired by ECBSR [97], SCSYENet time. For Track 2, to upscale from HD 720p to 4K resolu-
employs the re-parameterization technique to boost the SR tion, we cut the four RLFB blocks in RLFN to two blocks,
performance while maintaining high efficiency. The model and shrink the feature channels to 27. The ESA blocks are
consists of six ECBs (see Fig. 10b), one PReLu, two fusion kept and channels are remained as 16.
blocks and one skip connection (concatenation of input im- The team also uses the ERB RepBlock in the Enhanced
age after preprocessing and intermediate feature map). The Residual Block (ERB) first proposed by [51] the runner up
number of channels in the network is set to 16. The pix- solution. They replace the 3 × 3 convolutions in RLFB with
elshuffle is used to produce the final image output. Typi- the ERB RepBlock. The network and ERB block are shown
cally, in the previous multi-branch networks, the fusion of in Fig. 11. For inference, the ERB RepBlock is reparame-
outputs by different branches could be done by concatena- terized to a 3×3 convolution. THe team does not experience
tion [5, 74] or element-wise addition followed by activation any performance drop after reparameterization.
function [21, 31]. In this study, in order to effectively im-
prove the representational power, a element-wise multipli-
Implementation Details. The ERLFN model is trained
cation fusion operation [43], as in Fig. 10a, is employed
for two stages both for Track 1 and Track 2. In the first
for the fusion of the results by two branches, where ⊗ is
stage, they train the model from scratch on DIV2K [1],
the element-wise multiplication, and ⊕ is the element-wise
cropped DIV8K, Flickr2K, OST, WED, first 2000 images of
addition. During inference, the ECB block can be reparam-
FFHQ, and first 1000 images of SCUT-CTW1500 datasets
eterized into one single 3×3 convolution.
— following [56]. The HR images are randomly cropped
to patches of size 256 × 256 for Track 1, and 192 × 192
Implementation Details. The team uses Pytorch 1.21.1, for Track 2. They use Adam optimizer with L1 loss for
and the training device is the A100 GPU. During train- this stage. We set the initial learning rate to 5e − 4, with a
ing, DIV2K [1] and Flickr2K [75] datasets are used for the mini batch size of 64, and train the model for 1000 epochs,
whole process. The team follows a 3-stage training: First, and decay the learning rate by 0.5 every 200 epochs. In the
the model is trained from scratch. HR patches of size 128 × second stage, the model is initialized with the pretrained
128 are randomly cropped from HR images, and the mini- weights from the first stage on the same training data as
batch size is set to 32. The SCSYENet model is trained by stage 1. Then the model is finetuned using a cosine learning
minimizing L1 loss function with Adam optimizer. The ini- rate schedule with an initial learning rate of 1e − 4 for 500
tial learning rate is set to 1 × 10−4 and decayed with cosine epochs, using L2 loss is applied.
annealing scheduler at every 200 epochs. The total number
3.11. PCRTSR
of epochs is 1000. Second, the model is initialized with the
pretrained weights, and trained with the same settings as in Team ECNUSR proposes PCRTSR: Partial convolution
the previous step. This process repeats once. Third, training based Network for Real-Time Super Resolution. The over-
settings are the same as Stage 1, except that L2 loss is used all architecture is shown in Fig. 12. The network first
1504
CONV1×1 CONV1×1 CONV3×3 CONV1×1
16 16 b 16 b CONV3×3 CONV3×3
ECB/PReLu ECB ECB CONV3×3 Sobel − Dx Sobel − Dy Laplacian
3×3 3×3 3×3
• + • + c
ECB QCU ECB QCU concat
3×3 3×3
16
ECB/Pixelshuffle +
3×3
16 16
(b) ECB: In the training stage, the Block employs multiple branches, which
(a) Detailed architecture of SCSYENet. can be merged into one normal convolution layer in the inference stage
Unshuffle
Shuffle Shuffle
PCBS
PCBS
PCBS PCBS
Conv
Conv Conv
Pixel Pixel
UnshufflePixel
SR
LR
PCBS
PCBS
Conv
Pixel
SR
LR
Figure 11. Team Antins CV proposed ERLFN network.
PCBS PCB
involves the pixel unshuffle for faster speed and a larger PConv
PCBS PCB
reception field. Then, the several stacked PCBS Block PCB
3*3
num_block
(Fig. 12 (a)) build up for feature extraction where each Conv
PConv1*1
PCBS block is composed of several PCB blocks (Fig. 12 PCB
3*3
PReLU
num_block
(b)) and a residual connection. Finally, the reconstruction PCB
Conv 1*1
Conv1*1
module consisting of a 3×3 vanilla convolution and a pixel PReLU
shuffle operation produces the SR image. PCB
Conv1*1
1505
Conv-1 Conv-3
𝑁× 𝑀×
R2C block
Conv-3
Conv-3
upsample
Add Conv-3
Add
Add Conv-3
Conv-1 Conv-3
L-ESA PixelShuffle
Figure 13. Team R.I.P. ShopeeVideo proposed R2C block and Figure 14. Team P.AI.R proposed FADN. Comparison of (a) resid-
R2CNet. (a) R2C block uses L-ESA:, an improved ESA [60], also ual feature distillation block and (b) no attention distillation block.
Batch Normalization (BN) is applied for each convolution layer to
accelerate convergence. (b) R2CNet: the macro structure is based
on RLFN [47] and M R2C blocks are used. images during training for R2CNet\times 3 and R2CNet\times 2 are
576 and 512, respectively. Before inference, the BN layers
in R2C blocks are fused into their corresponding convolu-
channel numbers, and the output one to increase. Thus, the tion layers for fast inference. The team uses DIV2K [2],
channel numbers inside the block is small, making it ef- Flickr2K, and half LSDIR [52] datasets for training.
ficient to stack efficient 3\times 3 convolutions inside [10, 22],
i.e., N basic blocks, and a skip-path 3\times 3 convolution. The 3.13. FADN
team also proposes L-ESA for efficient and effective spa- Team P.AI.R proposes FADN: Few Activation Distilla-
tial attention, in which they simply reset the kernel size and tion Networks for Real-time Super-resolution. The solu-
stride of the pooling layer in ESA [60] from 7 and 3 to 11 tion is mainly based on RFDN [60]. The architecture of the
and 7. Large kernel captures more spatial information and proposed method differs from the RFDN in two ways: 1)
large stride reduces computation and runtime [16]. With the simple gate (SG) introduced in NAFNet [11], which is
R2C block, we build our R2CNet following the macro struc- an element-wise product of feature maps divided into two
ture of RLFN [47], as shown in Fig. 13 (b). parts in the channel dimension, was used instead of ReLU
To process images of large size (4K) efficiently, they also in a shallow residual block (SWB). 2) Simplified channel at-
introduce a new downsample-upsample mechanism into the tention (SCA), also introduced in [11], was used instead of
R2CNet: simply set the stride of the first R2C block as 2 contrast-aware channel attention (CCA). The team adopted
for downsampling and utilize a pixel shuffle layer with fac- the SG and SCA to simplify the network, as the SG halves
tor 2 for upsampling. Specially, in both R2CNet\times 2 and the number of channels, and the SCA is the simplified ver-
R2CNet\times 3, we set N = 4, M = 2, the channel number of the sion of channel attention. In addition, layer normalization
main body as 64 and that inside R2C block is 32. was also adopted in the network to ensure a more stable
training process. The FADN (see Fig. 14) consists of four
Implementation Details. The team uses PyTorch for no-activation distillation blocks (NADB).
training and inference. They train the models for there
stages. Each stage has 100k iterations. The learning rate Technical details. The team train the models with ADAM
is set as 5e-4 for the first two stages with first 5k iterations optimizer by setting beta1=0.9, beta2=0.999, and eta=10−8 .
as warm-up, while 2e-4 for the last one without warm-up, The learning rate is initialized as 2e-4 and halved at every
and we uses cosine annealing. PSNR loss [12] is utilized. 100 epochs. The team used LSDIR [52] datasets to train
Adam is the optimizer and weight decay is not applied. The the models, and generated the training LR images by down-
global batch size is set to 96 on 3 GPUs. The sizes of HR sampling HR images with bicubic interpolation and JPEG
1506
channels channels channels channels channels
compression. The model is implemented using the PyTorch 3→16 16→16 16→16 16→16 16→12
RepConv 3x3
RepConv 3x3
PixelShuffle
Conv 3x3
Conv 3x3
channels was 16 for ×2 SR and 40 for ×3 SR. So then, the
PReLU
number of parameters is 0.0121 M and 0.1280 M respec- Input size:[1,3,h,w]
(a) The overall architecture and the structure of the RepConv block.
3.14. Team PixelBE Training Inference
1507
Conv1x1
Softmax
PixelShuffle
Conv3x3
Conv3x3
Conv3x3
Conv3x3
Conv3x3
Conv3x3
LR
C
Concate
HR
Conv1x1
Figure 16. Team DoYouChargeQQCoin proposed network.
Softmax
PixelShuffle
Rep_Block
Rep_Block
Rep_Block
Rep_Block
LR
Conv3x3
Conv3x3
LR
C
Conv-3 Concate
HR
Conv-3
ReLU
AttLi
[H, W, C]
Conv3x3
Conv-3
AttLi
ReLU
[H, W, C]
AttLi Att
Conv-3
Conv1x1
ReLU
ReLU
Concat
[H, W, C] Long Range
Conv-1
Conv-1
Sigmoid
Att
Concat
Short Range
[H, W, C] Conv-3
Conv-3
Figure 18. Team DH ISP proposed solution.
PixelShuffle
(a)
HR
(b)
is trained for a total of 106 iterations, with the L1 loss, batch
sizes of 64, and Adam optimizer [45]. Subsequently, fine-
Figure 17. Team Touch Fish solution: (a) Attli block, pink denotes tuning is executed using the L1 and L2 loss functions, with
the generated attention map M. (b) Pipeline ×2 SR. an initial learning rate of 1 × 10−5 for 5 × 105 iterations,
and HR patch size of 512. The dataset utilized for training
comprises of DIV2K [1] and LSDIR [52].
able perception field is that it can be advantageous for the
preceding layers to concentrate their attention on regions of 3.18. Team DH ISP
interest. They generate an attention map M(i, j) as:
The team designed a simple lightweight network for im-
\mathcal {M}(i,j) = \phi (\textrm {Conv}_{1 \times 1}(F_l(i,j)))\textrm {,} (3) age super resolution. The model consists of two 3x3 con-
volution layers, one 1x1 convolution layer and four Re-
where ϕ(·) denotes the sigmoid function. Fl (i, j) and Parameterizable blocks (RepBlock), the final output is ob-
Ff (i, j) denote the value of the feature map in the position tained using the pixel shuffle. Re-parameterizable blocks
(i, j) from the latter layer and former layers, respectively. can learn features at different scales during the training
Then we use the generated attention map phase, then, in during inference, they can be converted into
J to reweight the
features in the former layers as M(i, j) Ff (i, j), where a 3x3 convolutions to accelerate the inference speed. The
network structure is shown in the Figure 18.
J
denotes the Hadamard product.
As depicted in Fig. 17 (b), an attention map is generated Two branches are used for feature extraction. (i) four
for each block, which is subsequently utilized to reweight re-parameterizable blocks and a 3x3 convolution, which is
the feature maps originating from distinct levels. used to extract the deep features of the image. (ii) a 1x1 con-
They also use re-parameterization (rep) [22] to enhance volution is used to extract the shallow features of the input
the efficiency of the inference phase. This technique has image. Finally, the features extracted from the two branches
been incorporated into each convolutional block depicted in are added together for fusion, and the upsampled features
Fig. 17. In contrast to prior techniques that employ stride are obtained through the pixelshuffle layer and the final out-
convolutions, pooling, and upsampling, the team merely put is obtained through the structure of self-attention.
uses the generated mask. This modification has resulted
in a significant acceleration of both inference and training Technical details. The training data set includes Flickr2K
times, as well as a reduction in the memory footprint. and DIV2K [1]. The training of the model is divided into
two stages: (i) the network is trained from scratch. The
Technical details. The number of channels is set to 24 input image size is 256 × 256, the batch size is 16, the loss
(x2) and 32 (x3). The learning rate is 5 × 10−4 and under- function is L1 , Adam optimizer with the initial learning rate
goes a halving process every 2×105 iterations. The network set to 0.001, the learning rate is halved every 200 epoch,
1508
and a total of 800 batches of training. (ii) on the basis of
conv conv conv
the training in the first stage, the L2 loss was used to con- RFDBS
tinue training for 200 epochs, with an initial learning rate of RFDB RFDB conv
RFDBS
0.0001, halved every 50 batches. Finally, the heavy param- RFDB RFDB conv
eter module in the network is re-parameterized by 3x3 con- RFDBS
RFDB RFDB
volution, and the trained model parameters are transformed
RFDB RFDB
to achieve faster inference.
concat concat
3.19. PRFDN conv conv
RFDB RFDB
1509
Adam [46] with learning rate 5e-4. In the test phase, they \eta =0.1. The number of ECSB is set to 5 and the num-
feed the whole-size image to the model and the inference ber of channels inside ECSB to 32. The model is trained
speed is approximately 18ms per image. using a multi-stages training strategy with cyclic learning
rate scheduler, Adam optimizer [46] and batch size of 64 .
3.21. DRCNN The authors did not use any pruning or re-parameterization
Team diSRupt proposes Depthwise-Residual Convolu- technique, only using channel splitting and attention.
tional Neural Network (DRCNN).
DRCNN (see Fig. 21) extends the SCSRN architecture, 3.23. Team NPU SuperResolution
which was introduced in [43]. On top of the existing archi- The team proposes a model based on ECBSR [96] with
tecture, DRCNN performs nearest-neighbors upsampling to some improvements. The authors found that the edge op-
provide the SCSRN stage with an upsampled baseline im- erator can not make a relatively large contribution to the
age. In order to maintain efficiency through GPU paral- performance improvement of the whole model, so they pro-
lelism, a space-to-depth transformation is applied to the up- pose to replace the edge operator with wavelet transform.
scaled LR image, forcing the following convolutional layers The experiment proves that the wavelet transform has a cer-
to operate on feature maps having the same dimensions as tain effect on the improvement of the model.
the LR image. The same depthwise-upsampled LR image The authors also use ideas from MWCNN [61] and
is added to the feature map generated through the SCSRN, other models that use wavelet transform to achieve super-
forcing the network to learn the residual between the naive resolution. In their model, LL, HL, LH, and HH after
interpolation and the HR image, thus enhancing the conver- wavelet transformation will be concatenated in the channel
gence speed and the overall performance. dimension, which can ensure that messages will not be lost,
thereby further improving the performance of the model.
Implementation details. The authors use Tensorflow 2. They chose a very simple model with only one branch, so
The network was trained for 70 epochs on the entire Div2K that the speed of the model can be guaranteed. In a block,
training set [1], using the Adam [46] optimizer with a they remove the branches that do not significantly improve
3e-4 learning rate, a batch size of 16, a patch size of the model effect, and only keep the branch that contributes
128, classical augmentations, and optimizing for MSE. The the most. In addition, they also use re-parameterization as
model accepts RGB images of any resolution. No re- ECBSR [96], so that each block can be re-parameterized
parameterization, pruning or quantization was applied. into one or two 3x3 convolutions, so that during the infer-
ence process of the model, have faster speed.
3.22. ELIS
Team KCML2 proposes Enhanced Lightweight Image Technical details. The team uses Pytorch to implement
Super-resolution (ELIS), which is inspired by XLSR [5] the model. The optimizer used is Adam [46], the learn-
with the addition of the advanced attention mechanism. The ing rate is 5e-4, and the GPU is A100. The training dataset
main idea is to use channel splitting to separate the feature combines DIV2K [1], Flicker2k, manga [65], and some pic-
maps and process them in parallel with attention. Besides tures obtained on the internet – the authors find that the data
this, the authors use a multi-stage warm-start training strat- set can significantly improve the performance of the model.
egy. In each stage, the pre-trained weights from previous The obtained model is re-parameterized.
stages are utilized to improve the model performance. The
network is illustrated in Fig. 22. 3.24. Team YNOT
The authors add a spatial operation to the original block The team utilized an image processing method based
from XLSR [5] to enhance the performance as each pixel is on Fast Fourier Convolution (FFC) [15], which has differ-
considered differently at each pixel location. They design ent advantages from conventional convolution-based image
the ECSB block, which contains a channel splitting mecha- processing (i.e. it can utilize both global and local infor-
nism, convolution operation, and an enhanced spatial atten- mation), and Wavelet Analysis [89] image processing tech-
tion block (ESA) as shown on Fig. 22 (bottom). niques. By utilizing information at the frequency level, they
aimed for better performance while lightening the baseline
Implementation details. The authors use DIV2K and architecture of IMDN [40].
Flickr2K [1] for training set, and randomly crop the im- The authors found that FFC [15] can be used to replace
ages to the size of 512×512. All images are normalized to traditional CNNs, but it may not be suitable for real-time
range 0-1. During training, they randomly crop LR patches super-resolution. However, by utilizing the information
of size 256×256 and use horizontal flipping, vertical flip- available in the spectral domain (e.g. Fourier Transform,
ping along with random intensity scaling for augmentation. Wavelet Transform), they were able to lighten the architec-
As the loss function, we employ the Charbornier loss with ture of the IMDN [40] model used to satisfy some of the
1510
Figure 21. Team diSRupt proposed DRCNN.
1511
LR Input AsConvSR [34] Bicubic++ [8]
Figure 24. Qualitative results. Comparison of the best methods using the test sample 11. The image corresponds to a real capture using a
60MP camera. Complete HQ uncompressed results -for the top teams- can be consulted in our project website.
1512
LR Input AsConvSR [34] Bicubic++ [8]
Figure 25. Qualitative results. Comparison of the best methods using the test sample 59, a real world capture using a SONY ILCE-7M3.
Image credit: “Asakusa” by @mosdesign.
1513
LR Input AsConvSR [34] Bicubic++ [8]
Figure 26. Qualitative results. Comparison of the best methods using the test sample 114, rendered content using Unreal Engine [38].
1514
Table 2. We provide Additional Training Details to facilitate reproducibility of the solutions. The teams indicate the resolution of the
input RGB image during training, the training time in hours, and the GPU device.
Method Input Training Time (h) Attention Quantization # Params. (M) GPU
AsConvSR ×2 120 × 120 30 No No 2.3 V100
AsConvSR ×3 80 × 80 30 No No 17 V100
RUNet ×2 192 × 192 24 No No 0.0668 RTX3090
RUNet ×3 192 × 192 20 No No 0.24 RTX3090
Team OV 128 × 128 21 No No 0.005 RTX3090
Repnet ×2 256 × 256 8 No No 0.0266 A100
Repnet ×3 256 × 256 12 No No 0.0532 A100
Bicubic++ ×3 108 × 108 3 No No 0.0504 V100
DFCDN ×2 320 × 320 44 Yes No 0.0064 RTX3090
DFCDN ×3 220 × 220 44 Yes No 0.0075 RTX3090
NJUST-RTSR ×2 256 × 256 16 No No 0.014 RTX3090
LRSRN ×2 192 × 192 48 No No 0.0046 A6000
LRSRN ×3 128 × 128 16 No No 0.0046 A6000
SCSYENet ×2 512 × 512 27 No No 0.01 A100
SCSYENet ×3 540 × 540 18 No No 0.0125 A100
ERLFN ×2 256 × 256 71 ESA No 0.0111 V100x4
ERLFN ×3 192 × 192 47 ESA No 0.0666 V100x4
PCRTSR ×2 256 × 256 30 No No 0.162288 2080Ti
R2CNet ×2 512 × 512 180 L-ESA No 0.3987 V100
R2CNet ×3 576 × 576 180 L-ESA No 0.4073 V100
FADN ×2 256 × 256 130 Yes No 0.0212 RTX3090
PixelBE ×2 128 × 128 96 No No 0.137 V100
OELSR ×2 512 × 512 8 No No 0.0068 2080Ti
QQCoin ×2 256 × 256 48 No No 0.00082 RTX3090
Touch Fish ×2 256 × 256 60 Yes No 0.064 A100x8
Touch Fish ×3 256 × 256 60 Yes No 0.183 A100x8
dh ISP 256 × 256 5 Yes No 0.01 2080Ti
PRFDN ×2 678 × 1020 16 No No 0.0299 RTX3070
PRFDN ×3 512 × 680 16 No No 0.0629 RTX3070
NTU-BL6F (LFDN) ×2 256 × 256 12 Yes Yes 0.22 RTX3090
DRCNN ×2 128 × 128 5 No No 0.0499 NVIDIA T4
DRCNN ×3 128 × 128 3 No No 0.0649 NVIDIA T4
ELIS 256 × 256 10 ESA No 0.039 TITAN RTX
NPU-SR (ECBSR) ×2 1080 × 1920 10 No Yes 0.2 A100
YNOT ×2 256 × 256 4 Yes No 0.4648 A100
1515
versity of Würzburg, Germany Members: Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui
2
Sony Interactive Entertainment, CA. Tang
Affiliations: Nanjing University of Science and Technol-
6.2. Noah TerminalVision ogy
Title: AsConvSR: Fast and Lightweight Super-Resolution
Network with Assembled Convolutions 6.10. Multimedia
Members: Jiaming Guo, Xueyi Zou, Yuyi Chen, Yi Liu, Title: SCSYENet: A Compact Skip-Concatenated Sim-
Jia Hao, Youliang Yan ple Yet Effective Real- Time Image Super-Resolution based
Affiliations: Huawei Technologies Co., Ltd. on element-wise multiplication fusion operation and Re-
parameter convolution
6.3. Aselsan Research
Members: Zibin Liu, Weiran Gou, Shaoqing Li, Ziyao Yi,
Title: Bicubic++: Slim, Slimmer, Slimmest - Designing Yan Xiang, Dehui Kong, Ke Xu
an Industry-Grade Super-Resolution Network Affiliations: Sanechips Co Ltd
Members: Mustafa Ayazoglu, Bahri Batuhan Bilecen
Affiliations: Aselsan Research, Türkiye. https:// 6.11. Antins CV
www.aselsan.com/tr Title: Enhanced Residual Local Feature Network
6.4. ALONG (ERLFN)
Members: Jin Zhang, Gaocheng Yu, Feng Zhang, Hong-
Title: RUNet: Re-parameterization and Unshuffle Net- bin Wang
work for Real-time Super-Resolution Affiliations: Ant Group
Members: Cen Liu, Zexin Zhang, Yunbo Peng, Yue Lin
Affiliations: NetEase Games AI Lab 6.12. ECNU SR
6.5. Team OV Title: Partial convolution based Network for Real-Time
Super Resolution (PCRTSR)
Title: An Efficient ConvNet for Real-time Image Super- Members: Zhou Zhou, Jiahao Chao, Hongfan Gao, Jiali
resolution Gong, Zhengfeng Yang, Zhenbing Zeng
Members: Lingshun Kong, Haoran Bai, Jinshan Pan, Affiliations: East China Normal University
Jiangxin Dong, Jinhui Tang
Affiliations: Nanjing University of Science and Technol- 6.13. R.I.P ShopeeVideo
ogy
Title: Efficient Bottle-in-Bottle Block for Real-Time
6.6. RTVSR Super-Resolution
Members: Chengpeng Chen, Zichao Guo
Title: Repnet for Real-Time Super-Resolution
Affiliations: Shopee https://shopee.com/
Members: Yuanfan Zhang, Gen Li, Lei Sun
Affiliations: Tencent 6.14. DoYouChargeQQCoin
6.7. DFCDN Team Title: Ultra fast network for image super-resolution.
Title: DFCDN: Deep Feature Complement and Distilla- Members: Yuqing Liu, Qi Jia, Hongyuan Yu, Xuanwu
tion Network Yin, Kunlong Zuo
Members: Mingxi Li, Yuhang Zhang, Xianjun Fan, Affiliations: Dalian University of Technology; Xiaomi
Yankai Sheng Inc.
Affiliations Attrsense 6.15. PixelBE
6.8. z6 Title: Two-Stage Super-resolution Algorithm Based on
Title: Lightweight Efficient Real-Time Image Super- Re-Parameterization
Resolution Network (LER- SRN) Members: Dongyang Zhang
Members: Ganzorig Gankhuyag, Kihwan Yoon Affiliations: Mango TV (MGTV)
Affiliations: Korea Electronics Technology Institute
(KETI)
6.16. AGSR
Title: Optimized Extreme Lightweight Super Resolution
6.9. NJUST-RTSR
Members: Ting Fu, Zhengxue Cheng, Shiai Zhu, Dajiang
Title: A Simple Residual ConvNet with Progressive Zhou
Learning for Real- Time Super-Resolution Affiliations: Ant Group antgroup.com
1516
6.17. dh isp 6.24. KCML2
Title: Lightweight network for image super-resolution. Title: Enhanced Lightweight Image Super-resolution
Members: Ben Shao, Shaolong Zheng (ELIS)
Affiliations: Zhejiang Dahua Technology Co., Ltd. Members: Tu Vo
Affiliations: KC Machine Learning Lab
6.18. Touch Fish
6.25. YNOT
Title: Attention Block for Real-time Super-Resolution
Members: Hongyuan Yu, Weichen Yu, Lin Ge, Jiahua Title: Super Resolution with Spectral Transform and
Dong, Yajun Zou, Zhuoyuan Wu, Binnan Han, Xiaolin Wavelet Transform
Zhang, Heng Zhang, Xuanwu Yin, Kunlong Zuo Members: Youngsun Cho, Nakyung Lee
Affiliations: Multimedia Department, Xiaomi Inc. Affiliations: CJ OliveNetworks AI Research
1517
IEEE/CVF Conference on Computer Vision and Pattern [22] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong
Recognition Workshops, 2023. 3 Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-
[10] Chengpeng Chen, Zichao Guo, Haien Zeng, Pengfei style convnets great again. In Proceedings of the IEEE/CVF
Xiong, and Jian Dong. Repghost: A hardware-efficient conference on computer vision and pattern recognition,
ghost module via re-parameterization. arXiv preprint pages 13733–13742, 2021. 3, 6, 11, 12, 14
arXiv:2211.06088, 2022. 12 [23] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong
[11] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-
Simple baselines for image restoration. arXiv preprint style convnets great again. In CVPR, 2021. 9
arXiv:2204.04676, 2022. 12 [24] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
[12] Liangyu Chen, Xin Lu, Jie Zhang, Xiaojie Chu, and Cheng- Tang. Learning a deep convolutional network for image
peng Chen. Hinet: Half instance normalization network for super-resolution. In David Fleet, Tomas Pajdla, Bernt
image restoration. In Proceedings of the IEEE/CVF Confer- Schiele, and Tinne Tuytelaars, editors, European Confer-
ence on Computer Vision and Pattern Recognition (CVPR) ence on Computer Vision, pages 184–199, Cham, 2014.
Workshops, pages 182–192, June 2021. 12 Springer International Publishing. 2
[25] Zongcai Du, Ding Liu, Jie Liu, Jie Tang, Gangshan Wu,
[13] Xiangyu Chen, Xintao Wang, Jiantao Zhou, and Chao
and Lean Fu. Fast and memory-efficient network towards
Dong. Activating more pixels in image super-resolution
efficient image super-resolution, 2022. 13
transformer. arXiv preprint arXiv:2205.04437, 2022. 6
[26] Zongcai Du, Jie Liu, Jie Tang, and Gangshan Wu. Anchor-
[14] Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong
based plain net for mobile image super-resolution. In Pro-
Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution:
ceedings of the IEEE/CVF Conference on Computer Vision
Attention over convolution kernels. In Proceedings of
and Pattern Recognition, pages 2494–2502, 2021. 5
the IEEE/CVF conference on computer vision and pattern
[27] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi,
recognition, pages 11030–11039, 2020. 5
and Xinchao Wang. Depgraph: Towards any structural
[15] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- pruning. The Thirty-Fourth IEEE/CVF Conference on Com-
tion. Advances in Neural Information Processing Systems, puter Vision and Pattern Recognition, 2023. 15
33:4479–4488, 2020. 16
[28] William T Freeman, Thouis R Jones, and Egon C Pasztor.
[16] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Example-based super-resolution. IEEE Computer graphics
Lu. Improving image restoration by revisiting global in- and Applications, 22(2):56–65, 2002. 2
formation aggregation. In Computer Vision–ECCV 2022:
[29] Ganzorig Gankhuyag, Jingang Huh, Myeongkyun Kim,
17th European Conference, Tel Aviv, Israel, October 23–27,
Kihwan Yoon, HyeonCheol Moon, Seungho Lee, Jinwoo
2022, Proceedings, Part VII, pages 53–71. Springer, 2022.
Jeong, Sungjei Kim, and Yoonsik Choe. Skip-concatenated
12
image super-resolution network for mobile devices. IEEE
[17] Marcos V Conde, Ui-Jin Choi, Maxime Burchi, and Radu Access, 2022. 9
Timofte. Swin2SR: Swinv2 transformer for compressed [30] Ganzorig Gankhuyag, Kihwan Yoon, Jinman Park,
image super-resolution and restoration. In Proceedings Haeng Seon Son, and Kyoungwon Min. Lightweight real-
of the European Conference on Computer Vision (ECCV) time image super-resolution network for 4k images. In Pro-
Workshops, 2022. 2 ceedings of the IEEE/CVF Conference on Computer Vision
[18] Marcos V Conde, Manuel Kolmet, Tim Seizinger, and Pattern Recognition Workshops, 2023. 9
Thomas E. Bishop, Radu Timofte, et al. Lens-to-lens bokeh [31] Michaël Gharbi, Jiawen Chen, Jonathan T. Barron,
effect transformation. NTIRE 2023 challenge report. In Samuel W. Hasinoff, and Frédo Durand. Deep bilateral
Proceedings of the IEEE/CVF Conference on Computer Vi- learning for real-time image enhancement. ACM Transac-
sion and Pattern Recognition Workshops, 2023. 3 tions on Graphics (TOG), 36:1 – 12, 2017. 10
[19] Marcos V Conde, Eduard Zamfir, Radu Timofte, et al. Effi- [32] Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel
cient deep models for real-time 4k image super-resolution. Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse
NTIRE 2023 benchmark and report. In Proceedings of 8k resolution image dataset. In 2019 IEEE/CVF Interna-
the IEEE/CVF Conference on Computer Vision and Pattern tional Conference on Computer Vision Workshop (ICCVW),
Recognition Workshops, 2023. 3 pages 3512–3516, 2019. 2
[20] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and [33] Shuhang Gu, Andreas Lugmayr, Martin Danelljan, Manuel
Lei Zhang. Second-order attention network for single im- Fritsche, Julien Lamour, and Radu Timofte. Div8k: Diverse
age super-resolution. In IEEE Conference on Computer Vi- 8k resolution image dataset. In 2019 IEEE/CVF Interna-
sion and Pattern Recognition, pages 11065–11074, 2019. tional Conference on Computer Vision Workshop (ICCVW),
2 pages 3512–3516. IEEE, 2019. 5
[21] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and J. Han. [34] Jiaming Guo, Xueyi Zou, Yuyi Chen, Yi Liu, Jia Hao,
Acnet: Strengthening the kernel skeletons for powerful cnn and Youliang Yan. Asconvsr: Fast and lightweight super-
via asymmetric convolution blocks. 2019 IEEE/CVF In- resolution network with assembled convolution. In Pro-
ternational Conference on Computer Vision (ICCV), pages ceedings of the IEEE/CVF Conference on Computer Vision
1911–1920, 2019. 10 and Pattern Recognition Workshops, 2023. 5, 18, 19, 20
1518
[35] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing [45] Diederik P Kingma and Jimmy Ba. Adam: A method for
Xu, and Chang Xu. Ghostnet: More features from cheap stochastic optimization. arXiv preprint arXiv:1412.6980,
operations. In 2020 IEEE/CVF Conference on Computer 2014. 14
Vision and Pattern Recognition, CVPR 2020, Seattle, WA, [46] Diederik P. Kingma and Jimmy Ba. Adam: A method for
USA, June 13-19, 2020, pages 1577–1586. Computer Vi- stochastic optimization. In ICLR, 2015. 9, 11, 13, 15, 16,
sion Foundation / IEEE, 2020. 7 17
[36] Zibin He, Tao Dai, Jian Lu, Yong Jiang, and Shu-Tao Xia. [47] Fangyuan Kong, Mingxi Li, Songwei Liu, Ding Liu, Jing-
Fakd: Feature-affinity based knowledge distillation for ef- wen He, Yang Bai, Fangmin Chen, and Lean Fu. Residual
ficient image super-resolution. In 2020 IEEE International local feature network for efficient super-resolution. In Pro-
Conference on Image Processing (ICIP), pages 518–522. ceedings of the IEEE/CVF Conference on Computer Vision
IEEE, 2020. 6 and Pattern Recognition, pages 766–776, 2022. 2, 12, 15
[37] Mu Hu, Junyi Feng, Jiashen Hua, Baisheng Lai, Jianqiang [48] Fangyuan Kong, Mingxi Li, Songwei Liu, Ding Liu, Jing-
Huang, Xiaojin Gong, and Xian-Sheng Hua. Online convo- wen He, Yang Bai, Fangmin Chen, and Lean Fu. Resid-
lutional re-parameterization. CoRR, abs/2204.00826, 2022. ual local feature network for efficient super-resolution. In
8 IEEE/CVF Conference on Computer Vision and Pattern
[38] Yaoyu Hu, Wenshan Wang, Huai Yu, Weikun Zhen, and Recognition Workshops, CVPR Workshops 2022, New Or-
Sebastian Scherer. Orstereo: Occlusion-aware recurrent leans, LA, USA, June 19-20, 2022, pages 765–775. IEEE,
stereo matching for 4k-resolution images, 2021. 2, 20 2022. 10
[39] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. [49] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
Single image super-resolution from transformed self- ballero, Andrew Cunningham, Alejandro Acosta, Andrew
exemplars. In IEEE Conference on Computer Vision and Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
Pattern Recognition, pages 5197–5206, 2015. 2 Photo-realistic single image super-resolution using a gener-
[40] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei ative adversarial network. In IEEE Conference on Com-
Wang. Lightweight image super-resolution with informa- puter Vision and Pattern Recognition, pages 4681–4690,
tion multi-distillation network. In ACM International Con- 2017. 2
ference on Multimedia, pages 2024–2032, 2019. 2, 16 [50] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan
[41] Andrey Ignatov, Radu Timofte, Maurizio Denna, and Ab- Feng, Wenjun Zeng, and Zhangyang Wang. Benchmark-
del Younes. Real-time quantized image super-resolution on ing single-image dehazing and beyond. IEEE Transactions
mobile npus, mobile ai 2021 challenge: Report. In Pro- on Image Processing, 28(1):492–505, 2019. 15
ceedings of the IEEE/CVF Conference on Computer Vision [51] Yawei Li, Kai Zhang, Luc Van Gool, Radu Timofte, et al.
and Pattern Recognition, pages 2525–2534, 2021. 5 NTIRE 2022 challenge on efficient super-resolution: Meth-
[42] Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel ods and results. In CVPR Workshops, 2022. 2, 10, 17
Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun [52] Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu,
Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman-
et al. Efficient and accurate quantized image super- dolx, Rakesh Ranjan, Radu Timofte, and Luc Van Gool.
resolution on mobile npus, mobile ai & aim 2022 challenge: LSDIR: a large scale dataset for image restoration. In Pro-
report. In Computer Vision–ECCV 2022 Workshops: Tel ceedings of the IEEE/CVF Conference on Computer Vision
Aviv, Israel, October 23–27, 2022, Proceedings, Part III, and Pattern Recognition Workshops, 2023. 2, 9, 11, 12, 13,
pages 92–129. Springer, 2023. 2, 3 14, 15
[43] Andrey Ignatov, Radu Timofte, Shuai Liu, Chaoyu Feng, [53] Yawei Li, Kai Zhang, Radu Timofte, Luc Van Gool,
Furui Bai, Xiaotao Wang, Lei Lei, Ziyao Yi, Yan Xiang, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du,
Zibin Liu, Shaoqing Li, Keming Shi, Dehui Kong, Ke Xu, Ding Liu, Chenhui Zhou, et al. Ntire 2022 challenge on
Minsu Kwon, Yaqi Wu, Jiesi Zheng, Zhihao Fan, Xun Wu, efficient super-resolution: Methods and results. In Proceed-
Feng Zhang, Albert No, Minhyeok Cho, Zewen Chen, Xi- ings of the IEEE/CVF Conference on Computer Vision and
aze Zhang, Ran Li, Juan Wang, Zhiming Wang, Marcos V. Pattern Recognition, pages 1062–1102, 2022. 2, 6
Conde, Ui-Jin Choi, Georgy Perevozchikov, Egor Ershov, [54] Yawei Li, Yulun Zhang, Luc Van Gool, Radu Timofte, et al.
Zheng Hui, Mengchuan Dong, Xin Lou, Wei Zhou, Cong NTIRE 2023 challenge on efficient super-resolution: Meth-
Pang, Haina Qin, and Mingxuan Cai. Learned Smart- ods and results. In Proceedings of the IEEE/CVF Confer-
phone ISP on Mobile GPUs with Deep Learning, Mobile ence on Computer Vision and Pattern Recognition Work-
AI & AIM 2022 Challenge: Report. arXiv e-prints, page shops, 2023. 3
arXiv:2211.03885, Nov. 2022. 10, 16 [55] Yawei Li, Yulun Zhang, Luc Van Gool, Radu Timofte,
[44] Xiaoyang Kang, Xianhui Lin, Kai Zhang, Zheng Hui, et al. NTIRE 2023 challenge on image denoising: Methods
Wangmeng Xiang, Jun-Yan He, Xiaoming Li, Peiran Ren, and results. In Proceedings of the IEEE/CVF Conference
Xuansong Xie, Radu Timofte, et al. NTIRE 2023 video on Computer Vision and Pattern Recognition Workshops,
colorization challenge. In Proceedings of the IEEE/CVF 2023. 3
Conference on Computer Vision and Pattern Recognition [56] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
Workshops, 2023. 3 Van Gool, and Radu Timofte. Swinir: Image restoration
1519
using swin transformer. In Proceedings of the IEEE/CVF ropean Conference, Amsterdam, The Netherlands, Octo-
International Conference on Computer Vision, pages 1833– ber 11-14, 2016, Proceedings, Part II 14, pages 102–118.
1844, 2021. 2, 10 Springer, 2016. 5
[57] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and [69] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and
Kyoung Mu Lee. Enhanced deep residual networks for sin- Vladlen Koltun. Playing for data: Ground truth from com-
gle image super-resolution. In IEEE Conference on Com- puter games. In Bastian Leibe, Jiri Matas, Nicu Sebe, and
puter Vision and Pattern Recognition Workshops, pages Max Welling, editors, European Conference on Computer
136–144, 2017. 2 Vision (ECCV), volume 9906 of LNCS, pages 102–118.
[58] Zudi Lin, Prateek Garg, Atmadeep Banerjee, Salma Ab- Springer International Publishing, 2016. 15
del Magid, Deqing Sun, Yulun Zhang, Luc Van Gool, [70] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
Donglai Wei, and Hanspeter Pfister. Revisiting rcan: Im- Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
proved training for image super-resolution. arXiv preprint Wang. Real-time single image and video super-resolution
arXiv:2201.11279, 2022. 6 using an efficient sub-pixel convolutional neural network.
[59] J. Liu, D. Liu, W. Yang, S. Xia, X. Zhang, and Y. Dai. In Proceedings of the IEEE conference on computer vision
A comprehensive benchmark for single image compression and pattern recognition, pages 1874–1883, 2016. 2, 3, 6, 9
artifacts reduction. In arXiv, 2019. 5 [71] Alina Shutova, Egor Ershov, Georgy Perevozchikov, Ivan A
[60] Jie Liu, Jie Tang, and Gangshan Wu. Residual feature dis- Ermakov, Nikola Banic, Radu Timofte, Richard Collins,
tillation network for lightweight image super-resolution. In Maria Efimova, Arseniy Terekhin, et al. NTIRE 2023 chal-
Computer Vision–ECCV 2020 Workshops: Glasgow, UK, lenge on night photography rendering. In Proceedings of
August 23–28, 2020, Proceedings, Part III 16, pages 41– the IEEE/CVF Conference on Computer Vision and Pattern
55. Springer, 2020. 2, 12, 15 Recognition Workshops, 2023. 3
[61] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and [72] Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu,
Wangmeng Zuo. Multi-level wavelet-cnn for image restora- and Yunhe Wang. Efficient residual dense block search for
tion. In Proceedings of the IEEE conference on computer image super-resolution. In Proceedings of the AAAI Con-
vision and pattern recognition workshops, pages 773–782, ference on Artificial Intelligence, volume 34, pages 12007–
2018. 16 12014, 2020. 2
[62] Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, [73] Long Sun, Jinshan Pan, and Jinhui Tang. ShuffleMixer: An
Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, efficient convnet for image super-resolution. In NeurIPS,
Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, 2022. 9
et al. NTIRE 2023 quality assessment of video enhance- [74] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
ment challenge. In Proceedings of the IEEE/CVF Confer- Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Van-
ence on Computer Vision and Pattern Recognition Work- houcke, and Andrew Rabinovich. Going deeper with con-
shops, 2023. 3 volutions. 2015 IEEE Conference on Computer Vision and
[63] Zhuoqun Liu, Meiguang Jin, Ying Chen, Huaida Liu, Pattern Recognition (CVPR), pages 1–9, 2014. 10
Canqian Yang, and Hongkai Xiong. Mfdnet: Towards [75] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
real-time image denoising on mobile devices. CoRR, Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single
abs/2211.04687, 2022. 8 image super-resolution: Methods and results. In Proceed-
[64] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient ings of the IEEE conference on computer vision and pattern
descent with warm restarts. In ICLR, 2017. 9 recognition workshops, pages 114–125, 2017. 5, 6, 9, 10,
[65] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fuji- 11
moto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu [76] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-
Aizawa. Sketch-based manga retrieval using manga109 Hsuan Yang, Lei Zhang, Bee Lim, et al. Ntire 2017 chal-
dataset. Multimedia Tools and Applications, 76(20):21811– lenge on single image super-resolution: Methods and re-
21838, 2017. 16 sults. In The IEEE Conference on Computer Vision and
[66] Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Pattern Recognition (CVPR) Workshops, July 2017. 2, 15
Moon, Sanghyun Son, Heewon Lee, Radu Timofte, and Ky- [77] Radu Timofte, Vincent De Smet, and Luc Van Gool. An-
oung Mu Lee. Ntire 2019 challenge on video restoration chored neighborhood regression for fast example-based
and enhancement: Methods and results. The IEEE Confer- super-resolution. In IEEE Conference on International
ence on Computer Vision and Pattern Recognition (CVPR) Conference on Computer Vision, pages 1920–1927, 2013.
Workshops, 2019. 2 2
[67] Ying Nie, Kai Han, Zhenhua Liu, An Xiao, Yiping Deng, [78] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+:
Chunjing Xu, and Yunhe Wang. Ghostsr: Learning Adjusted anchored neighborhood regression for fast super-
ghost features for efficient image super-resolution. CoRR, resolution. In ACCV, 2014. 2
abs/2101.08525, 2021. 7 [79] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven
[68] Stephan R Richter, Vibhav Vineet, Stefan Roth, and ways to improve example-based single image super resolu-
Vladlen Koltun. Playing for data: Ground truth from com- tion. In Proceedings of the IEEE conference on computer
puter games. In Computer Vision–ECCV 2016: 14th Eu- vision and pattern recognition, pages 1865–1873, 2016. 2
1520
[80] Florin-Alexandru Vasluianu, Tim Seizinger, Radu Timofte, [92] Eduard Zamfir, Marcos V Conde, and Radu Timofte. To-
et al. NTIRE 2023 image shadow removal challenge report. wards real-time 4k image super-resolution. In Proceedings
In Proceedings of the IEEE/CVF Conference on Computer of the IEEE/CVF Conference on Computer Vision and Pat-
Vision and Pattern Recognition Workshops, 2023. 3 tern Recognition, 2023. 2, 3, 4, 18, 19, 20
[81] Longguang Wang, Xiaoyu Dong, Yingqian Wang, Xinyi [93] Roman Zeyde, Michael Elad, and Matan Protter. On single
Ying, Zaiping Lin, Wei An, and Yulan Guo. Exploring image scale-up using sparse-representations. In Interna-
sparsity in image super-resolution for efficient inference. tional Conference on Curves and Surfaces, pages 711–730,
In Proceedings of the IEEE/CVF conference on computer 2010. 2
vision and pattern recognition, pages 4917–4926, 2021. 2 [94] Kai Zhang, Martin Danelljan, Yawei Li, Radu Timofte, Jie
[82] Longguang Wang, Yulan Guo, Yingqian Wang, Juncheng Liu, Jie Tang, Gangshan Wu, Yu Zhu, Xiangyu He, Wenjie
Li, Shuhang Gu, Radu Timofte, et al. NTIRE 2023 chal- Xu, et al. Aim 2020 challenge on efficient super-resolution:
lenge on stereo image super-resolution: Methods and re- Methods and results. In Computer Vision–ECCV 2020
sults. In Proceedings of the IEEE/CVF Conference on Com- Workshops: Glasgow, UK, August 23–28, 2020, Proceed-
puter Vision and Pattern Recognition Workshops, 2023. 3 ings, Part III 16, pages 5–40, 2020. 2
[83] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. [95] Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren,
Real-esrgan: Training real-world blind super-resolution Björn Stenger, Wei Liu, Hongdong Li, and Ming-Hsuan
with pure synthetic data. In Proceedings of the IEEE/CVF Yang. Benchmarking ultra-high-definition image super-
International Conference on Computer Vision, pages 1905– resolution. In Proceedings of the IEEE/CVF international
1914, 2021. 6 conference on computer vision, pages 14769–14778, 2021.
[84] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, 2
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: [96] Xindong Zhang, Hui Zeng, and Lei Zhang. Edge-oriented
Enhanced super-resolution generative adversarial networks. convolution block for real-time super resolution on mobile
In European Conference on Computer Vision Workshops, devices. In Heng Tao Shen, Yueting Zhuang, John R. Smith,
pages 701–710, 2018. 2 Yang Yang, Pablo César, Florian Metze, and Balakrishnan
[85] Yingqian Wang, Longguang Wang, Zhengyu Liang, Jun- Prabhakaran, editors, MM ’21: ACM Multimedia Confer-
gang Yang, Radu Timofte, Yulan Guo, et al. NTIRE 2023 ence, Virtual Event, China, October 20 - 24, 2021, pages
challenge on light field image super-resolution: Dataset, 4034–4043. ACM, 2021. 8, 16
methods and results. In Proceedings of the IEEE/CVF Con- [97] Xindong Zhang, Huiyu Zeng, and Lei Zhang. Edge-
ference on Computer Vision and Pattern Recognition Work- oriented convolution block for real-time super resolution on
shops, 2023. 3 mobile devices. Proceedings of the 29th ACM International
[86] Lei Xiao, Salah Nouri, Matt Chapman, Alexander Fix, Conference on Multimedia, 2021. 10
Douglas Lanman, and Anton Kaplanyan. Neural supersam- [98] Xindong Zhang, Hui Zeng, and Lei Zhang. Edge-oriented
pling for real-time rendering. ACM Transactions on Graph- convolution block for real-time super resolution on mobile
ics (TOG), 39(4):142–1, 2020. 2 devices. In Proceedings of the 29th ACM International
[87] Tianyu Xu, Zhuang Jia, Yijian Zhang, Long Bao, and Heng Conference on Multimedia, pages 4034–4043, 2021. 13
Sun. Elsr: Extreme low-power super resolution network for [99] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
mobile devices, 2022. 13 Zhong, and Yun Fu. Image super-resolution using very deep
[88] Ren Yang, Radu Timofte, Xin Li, Qi Zhang, Lin Zhang, residual channel attention networks. In European Confer-
Fanglong Liu, Dongliang He, Fu Li, He Zheng, Weihang ence on Computer Vision, pages 286–301, 2018. 2
Yuan, et al. Aim 2022 challenge on super-resolution of
[100] Yulun Zhang, Kai Zhang, Zheng Chen, Yawei Li, Radu
compressed image and video: Dataset, methods and results.
Timofte, et al. NTIRE 2023 challenge on image super-
In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Is-
resolution (x4): Methods and results. In Proceedings of
rael, October 23–27, 2022, Proceedings, Part III, pages
the IEEE/CVF Conference on Computer Vision and Pattern
174–202. Springer, 2023. 2
Recognition Workshops, 2023. 3
[89] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu
[101] Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, and
Kang, and Jung-Woo Ha. Photorealistic style transfer via
Chao Dong. Efficient image super-resolution using pixel at-
wavelet transforms. In Proceedings of the IEEE/CVF In-
tention. In Computer Vision–ECCV 2020 Workshops: Glas-
ternational Conference on Computer Vision, pages 9036–
gow, UK, August 23–28, 2020, Proceedings, Part III 16,
9045, 2019. 16
pages 56–72. Springer, 2020. 2
[90] Hongliang Yuan, Boyu Zhang, Mingyan Zhu, Ligang Liu,
and Jue Wang. High-quality supersampling via mask-
reinforced deep learning for real-time rendering. arXiv
preprint arXiv:2301.01036, 2023. 2
[91] Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano,
Radu Timofte, et al. NTIRE 2023 challenge on hr depth
from images of specular and transparent surfaces. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops, 2023. 3
1521