0% found this document useful (0 votes)
222 views28 pages

NTIRE 2025 Challenge On Image Denoising Methods and Results

The Tenth NTIRE 2025 Image Denoising Challenge focused on developing advanced network architectures for high-quality image denoising from noisy inputs, specifically targeting additive white Gaussian noise at a level of 50. A total of 290 participants registered, with 20 teams submitting valid results, showcasing a variety of innovative methodologies and achieving significant improvements over previous benchmarks. The challenge aimed to foster collaboration and establish performance benchmarks in the field of image denoising, with the top team achieving a PSNR of 31.20.

Uploaded by

jtu9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views28 pages

NTIRE 2025 Challenge On Image Denoising Methods and Results

The Tenth NTIRE 2025 Image Denoising Challenge focused on developing advanced network architectures for high-quality image denoising from noisy inputs, specifically targeting additive white Gaussian noise at a level of 50. A total of 290 participants registered, with 20 teams submitting valid results, showcasing a variety of innovative methodologies and achieving significant improvements over previous benchmarks. The challenge aimed to foster collaboration and establish performance benchmarks in the field of image denoising, with the top team achieving a PSNR of 31.20.

Uploaded by

jtu9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

The Tenth NTIRE 2025 Image Denoising Challenge Report

Lei Sun∗ Hang Guo∗ Bin Ren∗ Luc Van Gool∗ Radu Timofte∗ Yawei Li∗
Xiangyu Kong Hyunhee Park Xiaoxuan Yu Suejin Han Hakjae Jeon Jia Li
Hyung-Ju Chun Donghun Ryou Inju Ha Bohyung Han Jingyu Ma
Zhijuan Huang Huiyuan Fu Hongyuan Yu Boqi Zhang Jiawei Shi Heng Zhang
Huadong Ma Deepak Kumar Tyagi Aman Kukretti Gajender Sharma
Sriharsha Koundinya Asim Manna Jun Cheng Shan Tan Jun Liu Jiangwei Hao
Jianping Luo Jie Lu Satya Narayan Tazi Arnim Gautam Aditi Pawar
Aishwarya Joshi Akshay Dudhane Praful Hambadre Sachin Chaudhary
Santosh Kumar Vipparthi Subrahmanyam Murala Jiachen Tu Nikhil Akalwadi
Vijayalaxmi Ashok Aralikatti Dheeraj Damodar Hegde G Gyaneshwar Rao Jatin Kalal
Chaitra Desai Ramesh Ashok Tabib Uma Mudenagudi Zhenyuan Lin Yubo Dong
Weikun Li Anqi Li Ang Gao Weijun Yuan Zhan Li Ruting Deng
Yihang Chen Yifan Deng Zhanglu Chen Boyang Yao Shuling Zheng
Feng Zhang Zhiheng Fu Anas M. Ali Bilel Benjdira Wadii Boulila JanSeny
Pei Zhou Jianhua Hu K. L. Eddie Law Jaeho Lee M. J. Aashik Rasool
Abdur Rehman SMA Sharif Seongwan Kim Alexandru Brateanu Raul Balmez
Ciprian Orhei Cosmin Ancuti Zeyu Xiao Zhuoyuan Li Ziqi Wang Yanyan Wei
Fei Wang Kun Li Shengeng Tang Yunkai Zhang Weirun Zhou Haoxuan Lu

Abstract 1. Introduction
Image denoising is a fundamental problem in low-level vi-
This paper presents an overview of the NTIRE 2025 Im- sion, where the objective is to reconstruct a noise-free im-
age Denoising Challenge (σ = 50), highlighting the pro- age from its degraded counterpart. During image acquisi-
posed methodologies and corresponding results. The pri- tion and processing, various types of noise can be intro-
mary objective is to develop a network architecture capable duced, such as Gaussian noise, Poisson noise, and compres-
of achieving high-quality denoising performance, quantita- sion artifacts from formats like JPEG. The presence of these
tively evaluated using PSNR, without constraints on compu- noise sources makes denoising a particularly challenging
tational complexity or model size. The task assumes inde- task. Given the importance of image denoising in applica-
pendent additive white Gaussian noise (AWGN) with a fixed tions such as computational photography, medical imaging,
noise level of 50. A total of 290 participants registered for and remote sensing, continuous research efforts are neces-
the challenge, with 20 teams successfully submitting valid sary to develop more efficient and generalizable denoising
results, providing insights into the current state-of-the-art solutions [20, 61].
in image denoising.
To further advance research in this area, this challenge
aims to promote the development of denoising methods. A
widely used benchmark for fair performance evaluation is
the additive white Gaussian noise (AWGN) model, which
∗ L. Sun ([email protected], INSAIT, Sofia University “St. Kliment
serves as the standard setting in this competition.
Ohridski”), H. Guo, B. Ren ([email protected], University of Pisa & Uni-
versity of Trento, Italy), L. Van Gool, R. Timofte, and Y. Li were the chal- As part of the New Trends in Image Restoration and En-
lenge organizers, while the other authors participated in the challenge. hancement (NTIRE) 2025 workshop, we organized the Im-
Appendix A contains the authors’ teams and affiliations.
NTIRE 2025 webpage: https://cvlai.net/ntire/2025/.
age Denoising Challenge. The objective is to restore clean
Code: https://github.com/AHupuJR/NTIRE2025_Dn50_ images from inputs corrupted by AWGN with a noise level
challenge. of σ = 50. This competition seeks to foster innovative
solutions, establish performance benchmarks, and explore set and an additional 100 images from the LSDIR test set.
emerging trends in the design of image denoising networks, To ensure a fair assessment, the ground-truth noise-free im-
we hope the methods in this challenge will shed light on ages for the test phase remained hidden from participants
image denoising. throughout the challenge.
This challenge is one of the NTIRE 2025 Work-
shop associated challenges on: ambient lighting normaliza- 2.2. Tracks and Competition
tion [54], reflection removal in the wild [57], shadow re-
moval [53], event-based image deblurring [48], image de- The goal is to develop a network architecture that can gener-
noising [49], XGC quality assessment [37], UGC video en- ate high-quality denoising results, with performance evalu-
hancement [45], night photography rendering [18], image ated based on the peak signal-to-noise ratio (PSNR) metric.
super-resolution (x4) [12], real-world face restoration [13],
efficient super-resolution [44], HR depth estimation [58], Challenge phases (1) Development and validation phase:
efficient burst HDR and restoration [27], cross-domain few- Participants were provided with 800 clean training images
shot object detection [19], short-form UGC video quality and 100 clean/noisy image pairs from the DIV2K dataset,
assessment and enhancement [29, 30], text to image gen- along with an additional 84,991 clean images from the LS-
eration model quality assessment [22], day and night rain- DIR dataset. During the training process, noisy images
drop removal for dual-focused images [28], video quality were generated by adding Gaussian noise with a noise level
assessment for video conferencing [23], low light image of σ = 50. Participants had the opportunity to upload their
enhancement [38], light field super-resolution [56], restore denoising results to the CodaLab evaluation server, where
any image model (RAIM) in the wild [34], raw restoration the PSNR of the denoised images was computed, offering
and super-resolution [16] and raw reconstruction from RGB immediate feedback on their model’s performance. (2) Test-
on smartphones [17]. ing phase: In the final test phase, participants were given
access to 100 noisy test images from the DIV2K dataset
2. NTIRE 2025 Image Denoising Challenge and 100 noisy test images from the LSDIR dataset, while
the corresponding clean ground-truth images remained con-
The objectives of this challenge are threefold: (1) to stim- cealed. Participants were required to submit their denoised
ulate advancements in image denoising research, (2) to en- images to the CodaLab evaluation server and send their
able a fair and comprehensive comparison of different de- code and factsheet to the organizers. The organizers then
noising techniques, and (3) to create a collaborative envi- verified the submitted code and ran it to compute the final
ronment where academic and industry professionals can ex- results, which were shared with participants at the conclu-
change ideas and explore potential partnerships. sion of the challenge.
In the following sections, we provide a detailed overview
of the challenge, including its dataset, evaluation criteria,
Evaluation protocol The primary objective of this chal-
challenge results, and the methodologies employed by par-
lenge is to promote the development of accurate image de-
ticipating teams. By establishing a standardized bench-
noising networks. Hence, PSNR and SSIM metrics are used
mark, this challenge aims to push the boundaries of current
for quantitative evaluation, based on the 200 test images.
denoising approaches and foster innovation in the field.
A code example for calculating these metrics can be found
2.1. Dataset at https://github.com/AHupuJR/NTIRE2025_
Dn50_challenge. Additionally, the code for the sub-
The widely used DIV2K [2] dataset and LSDIR [31] dataset mitted solutions, along with the pre-trained weights, is also
are utilized for the challenge. provided in this repository. Note that computational com-
DIV2K dataset comprises 1,000 diverse RGB images at plexity and model size are not factored into the final ranking
2K resolution, partitioned into 800 images for training, 100 of the participants.
images for validation, and 100 images for testing.
LSDIR dataset consists of 86,991 high-resolution, high-
quality images, with 84,991 images allocated for training, 3. Challenge Results
1,000 images for validation, and 1,000 images for testing.
Participants were provided with training images from Table 1 presents the final rankings and results of the par-
both the DIV2K and LSDIR datasets. During the valida- ticipating teams. Detailed descriptions of each team’s im-
tion phase, the 100 images from the DIV2K validation set plementation are provided in Sec.4, while team member in-
were made accessible to them. In the test phase, evalua- formation can be found in Appendix A. SRC-B secured first
tion was conducted using 100 images from the DIV2K test place in terms of PSNR, achieving a 1.25 dB advantage over
the second-best entry. SNUCV and BuptMM ranked second
https://www.cvlai.net/ntire/2025/ and third, respectively.
Team Rank PSNR (primary) SSIM the dataset for training instead of training on the whole
SRC-B 1 31.20 0.8884 DIV2K and LSDIR dataset.
SNUCV 2 29.95 0.8676 3. The devil is in the details. Wavelet Transform loss [25]
BuptMM 3 29.89 0.8664 is utilized by the winning team, which is proven to help
HMiDenoise 4 29.84 0.8653 the model escape from local optima. Tricks such as a
Pixel Purifiers 5 29.83 0.8652 progressive learning strategy also work well. A higher
Alwaysu 6 29.80 0.8642 percentage of overlapping of the patches during infer-
Tcler Denoising 7 29.78 0.8632 ence also leads to higher PSNR. Ensemble techniques
cipher vision 8 29.64 0.8601 effectively improve the performance.
Sky-D 9 29.61 0.8602 4. New Mamba-based Design. SNUCV, the second-
KLETech-CEVI 10 29.60 0.8602 ranking team, leveraged the MambaIRv2 framework to
xd denoise 11 29.58 0.8597 design a hybrid architecture, combining the efficient se-
JNU620 12 29.55 0.8590 quence modeling capabilities from Mamba with image
PSU team 12 29.55 0.8598 restoration objectives.
Aurora 14 29.51 0.8605 5. Self-ensemble or model ensembling is adopted to im-
mpu ai 15 29.30 0.8499 prove the performance by some of the teams.
OptDenoiser 16 28.95 0.8422
AKDT 17 28.83 0.8374 3.3. Fairness
X-L 18 26.85 0.7836 To uphold the fairness of the image denoising challenge,
Whitehairbin 19 26.83 0.8010 several rules were established, primarily regarding the
mygo 20 24.92 0.6972 datasets used for training. First, participants were allowed
to use additional external datasets, such as Flickr2K, for
Table 1. Results of NTIRE 2025 Image Denoising Challenge.
training. However, training on the DIV2K validation set, in-
PSNR and SSIM scores are measured on the 200 test images from
DIV2K test set and LSDIR test set. Team rankings are based pri-
cluding either high-resolution (HR) or low-resolution (LR)
marily on PSNR. images, was strictly prohibited, as this set was designated
for evaluating the generalization ability of the models. Sim-
ilarly, training with the LR images from the DIV2K test
3.1. Participants set was not permitted. Lastly, employing advanced data
augmentation techniques during training was considered ac-
This year, the challenge attracted 290 registered partici- ceptable and within the scope of fair competition.
pants, with 20 teams successfully submitting valid results.
Compared to the previous challenge [32], the SRC-B team’s 4. Challenge Methods and Teams
solution outperformed the top-ranked method from 2023 by
1.24 dB. Notably, the results achieved by the top six teams 4.1. Samsung MX (Mobile eXperience) Business
this year surpassed those of their counterparts from the pre- & Samsung R&D Institute China - Beijing
vious edition, establishing a new benchmark for image de- (SRC-B)
noising. 4.1.1. Model Framework
3.2. Main Ideas and Architectures The proposed solution is shown in figure 1. In recent years,
the Transformer structure has shown excellent performance
During the challenge, participants implemented a range of in image denoising tasks due to its advantages in capturing
novel techniques to enhance image denoising performance. global context.
Below, we highlight some of the fundamental strategies However, it is found that pure Transformer architec-
adopted by the leading teams. tures are relatively weak in recovering local features and
1. Hybrid architecture performs well. All the mod- details. On the other hand, CNN-based methods excel in
els from the top-3 teams adopted a hybrid architec- detail recovery but struggle to effectively capture global
ture that combines transformer-based and convolutional- context information. Therefore, they designed a network
based network. Both Global features from the trans- that combines the strengths of the transformer network
former and local features from the convolutional net- Restormer [59] and the convolutional network NAFnet [10].
work are useful for image denoising. SNUCV further They first extract global features using the Transformer net-
adopted the model ensemble to push the limit. work and then enhance detail information using the convo-
2. Data is important. This year’s winning team, SRC-B lutional network. The denoising network’s structure follows
adopted a data selection process to mitigate the influence Restormer, while the detail enhancement network draws in-
of data imbalance, and also select high-quality images in spiration from NAFNet. Finally, they dynamically fuse the
conducted a semantic selection based on Clip [43] fea-
tures to ensure that the dataset reflects diverse and repre-
sentative content across various scene categories.
Training. The model training consists of three stages. In
the first stage, they pre-train the entire network using a cus-
tom dataset of 2 million images, with an initial learning rate
of 1e−4 and a training time of approximately 360 hours. In
the second stage, they fine-tune the detail enhancement net-
work module using the DIV2K and LSDIR datasets, with an
initial learning rate of 1e−5 and a training duration of about
240 hours, which enhanced the model’s ability to restore
details. In the third stage, they select 1,000 images from
the custom dataset, 1,000 images from the LSDIR data, and
800 images from DIV2K as the training set. With an initial
learning rate of 1e−6 , they fine-tuned the entire network for
approximately 120 hours.
The model is trained by alternately iterating L1 loss,
L2 loss, and Stationary Wavelet Transform(SWT) loss[25].
They found that adding SWT loss during training helps the
model escape from local optima. They also perform pro-
gressive learning where the network is trained on different
image patch sizes gradually enlarged from 256 to 448 and
768. As the patch size increases, the performance can grad-
ually improve. The model was trained on an A100 80G
GPU.

4.2. SNUCV
Method. As shown in Figure 2, the network architecture
they utilized consists of MambaIRv2 [21], Xformer [60],
Figure 1. Framework of the hybrid network proposed by Team
SRC-B. and Restormer [59]. These networks were first trained on
Gaussian noise with a standard deviation of 50. Subse-
quently, the outputs of these networks are concatenated with
two features from transformer network and convolutional the noisy image, which is then used as input to the ensem-
network through a set of learnable parameters to balance ble model. In addition to the output, the features from the
denoising and detail preservation like in , thereby improv- deepest layers of these networks are also concatenated and
ing the overall performance of image denoising. integrated into the deepest layer features of the ensemble
network. This approach ensures that the feature informa-
4.1.2. Dataset and Training Strategy tion from the previous networks is preserved and effectively
Dataset. Three datasets are used in total: the DIV2K transferred to the ensemble network without loss. The en-
dataset, the LSDIR dataset, and a self-collected custom semble model is designed based on Xformer, accepting an
dataset consisting of 2 million images. The specific ways in input with 12 channels. Its deepest layer is structured to in-
which they utilized these training sets across different train- corporate the concatenated features of the previous models.
ing phases will be detailed in the training details section. These concatenated features are then processed through a
In the final fine-tuning phase, they construct a high quality 1×1 convolution to reduce the channel dimension back to
dataset consist of 1000 images from LSDIR, 1000 images that of the original network, thus alleviating the computa-
from the custom dataset and all 800 images from DIV2K. tional burden. Additionally, while Xformer and Restormer
The data selection process including: reduce the feature size in their deep layer, MambaIRv2 re-
• Image resolution: Keep only images with a resolution tains its original feature size without reduction. To align the
greater than 900x900. sizes for concatenation, the features of MambaIRv2 were
• Image quality: Keep only images that rank in the top downscaled by a factor of 8 before being concatenated.
30% for all three metrics: Laplacian Var, BRISQUE, and Training details. They first train the denoising net-
NIQE. works, and then we incorporate the frozen denoising net-
• Semantic selection: To achieve semantic balance, they works to train the ensemble model. Both the denoising
Figure 2. The overview of the deep ensemble pipeline proposed by Team SNUCV.

models and the ensemble model were trained exclusively stages. In the first stage, they use DIV2K [2] and LS-
using the DIV2K [2] and LSDIR [31] datasets. Training DIR [31] training sets to train Restormer [59] and HAT [11]
was performed using the AdamW [39] optimizer with hy- respectively, and then enhance the ability of Restormer [59]
perparameters \beta _1 = 0.9 and \beta _2 = 0.999, and a learn- through TLC [36] technology during its reasoning stage. In
ing rate of 3 \times 10^{-4} . All models were trained for a to- the second stage, they first use the Canny operator to per-
tal of 300,000 iterations. For denoising models, Restormer form edge detection on the images processed by the two
and Xformer were trained using a progressive training strat- models. They take an OR operation on the two edge images,
egy to enhance robustness and efficiency. Patch sizes were and then XOR the result with the edge of HAT to obtain the
progressively increased as [128, 160, 192, 256, 320, 384], edge difference between the two images. For this part of the
with corresponding batch sizes of [8, 5, 4, 2, 1, 1]. In con- edge difference, they use the result obtained by HAT [11] as
trast, MambaIRv2 was trained with a more constrained the standard for preservation. Finally, they take the average
setup due to GPU memory limitations, utilizing patch sizes of the other pixels of HAT [11] and Restormer [59] to obtain
of [128, 160] and batch sizes of [2, 1]. The ensemble the final result.
model was trained with a progressive patch size schedule They used the DIV2K [2] and LSDIR [31] datasets to
of [160, 192, 256, 320, 384, 448] and corresponding batch train both the Restormer [59] and HAT [11] simultane-
sizes of [8, 5, 4, 2, 1, 1]. The denoising models were trained ously. They employed a progressive training strategy for
using L1 loss, while the ensemble model was trained using the Restormer [59]with a total of 292000 iterations, where
a combination of L1 loss, MSE loss, and high frequency the image block size increased from 128 to 384 with a step
loss. size of 64. They also used progressive training strategy for
Inference details. During the final inference stage to the HAT [11], where the image block size increased from
derive test results, they utilized a self-ensemble technique. 64 to 224. They did not use any other datasets besides the
Furthermore, inference was conducted using a patch-based datasets mentioned above during the process. During the
sliding-window approach. Patch sizes were set at [256, 384, training phase, they spent one day separately training the
512], with corresponding overlap values of [48, 64, 96]. Reformer [59] and HAT [11], they trained two models us-
The resulting outputs were subsequently averaged to opti- ing 8 NVIDIA H100 GPUs. They conducted the inference
mize performance. This self-ensemble approach, while sig- process on the H20 test set, with a memory usage of 15G.
nificantly increasing computational cost, substantially en- The average inference time for a single image from the 200
hances performance. test sets was 4.4 seconds, while the average time for mor-
phological post-processing was within 1 second.
4.3. BuptMM
Description. In recent years, the Transformer architecture
4.4. HMiDenoise
has been widely used in image denoising tasks. In order to The network is inspired by the HAT [11] model architec-
further explore the superiority of the two representative net- ture, and the architecture is optimized for the task specifi-
works, Restormer [59] and HAT [11], they propose a dual cally. The optimized denoising network structure(D-HAT)
network & post-processing denoising model that combines is shown in Fig 4.
the advantages of the former’s global attention mechanism The dataset utilized for training comprises DIV2K and
and the latter’s channel attention mechanism. LSDIR. To accelerate training and achieve good perfor-
As shown in Fig. 3, our network is divided into two mance, they initially train on a small scale (64x64) with
Figure 3. The model architecture of DDU proposed by Team BuptMM.

volutions to encode information from spatially neighboring


pixel positions, useful for learning local image structure for
effective restoration.
Training Techniques. They have conducted extensive
Figure 4. Model architecture of DB-HAT proposed by Team HMi- experiments to evaluate the effectiveness of our approach
Denoise. (as shown in Fig. 5(b)). The network is trained using the
DIV2K and LSDIR datasets only with L1 loss function. To
enhance generalization and mitigate overfitting, they apply
batch size 16, then on a medium scale (128x128) with batch randomized data augmentation during training, including
size 1, and finally optimize on a larger scale (224x224) with horizontal flipping, vertical flipping, and rotations of 90◦ ,
batch size 1. As the patch size increases, the performance 180◦ , and 270◦ . A fixed patch size of 256\times 256 is main-
can gradually improve. The learning rate is initialized at 4 tained for both training and inference to preserve global
× 10−4 and decays according to the cosine annealing strat- context. For optimization, they used the AdamW optimizer
egy during the training. The network undergoes training in conjunction with the CosineAnnealingRestartCyclicLR
for a total of 2×105 iterations, with the L2 loss function scheduler, with an initial learning rate 1 × 10−4 . Train-
being minimized through the utilization of the Adam op- ing is done using 8 NVIDIA Tesla V100 GPUs. Addition-
timizer. Subsequently, fine-tuning is executed using the L2 ally, they leveraged Hard Dataset Mining for model fine-
loss and SSIM loss functions, with an initial learning rate of tuning, specifically targeting training patches where the loss
5 × 10−5 for 2×105 iterations. They repeated the aforemen- exceeded a predefined threshold. This technique, discussed
tioned fine-tune settings two times after loading the trained in detail in the following section, further enhanced the per-
weights. All experiments are conducted with the PyTorch formance of our baseline model.
2.0 framework on 8 H100 GPUs.
Hard Dataset Mining. To further enhance PSNR, they
4.5. Pixel Purifiers employed a hard dataset mining technique inspired by [3]
Architecture. Restormer architecture [59], as shown in for fine-tuning. Specifically, training patches with loss
Fig. 5(a), is an efficient transformer and it uses the multi- value exceeding a predefined threshold is selected for trans-
Dconv head transposed attention block (MDTA) for chan- fer learning on our base trained model. To preserve the
nel attention and the gated Dconv feedforward network model’s generalization while refining its performance on
(GDFN) for the feedforward network. MDTA block applies challenging samples, they applied a learning rate that was
self-attention across channels rather than the spatial dimen- 100 times smaller than the initial training rate.
sion to compute cross-covariance across channels to gen- DIV2K and LSDIR Datasets Ratio. As the model is to
erate an attention map encoding the global context implic- be trained and tested on two datasets (DIV2K and LSDIR),
itly. Additionally, depth-wise convolutions are used to em- they first analysed their characteristics. DIV2K is relatively
phasize on the local context before computing feature co- small and generalised with 800 training images while LS-
variance to produce the global attention map. GDFN block DIR is significantly large dataset with 84k+ training im-
introduces a novel gating mechanism and depth-wise con- ages, primarily consisting of high texture images. Consid-
Figure 5. Block Diagram for Image Denoising using Restormer architecture along with Hard data mining and Ensemble Techniques (Team
Pixel Purifiers).

ering the dataset charatectertics and our dataset ratio experi- They choose the Restormer [59] model trained to remove
ments, they found that DIV2K to LSDIR ratio of 12:88 dur- the same i.i.d. Gaussian noise (σ = 50) without intensity
ing training helps to improve overall PSNR and generalise clipping as our baseline. As this pre-trained Restormer did
the model better for both validation and test datasets. not clip noisy images’ intensities into the normal range, i.e.,
Overlapping Percentage During Inference. Using a [0, 255], it performs poorly in clipped noisy images, result-
small overlap of 5% during inference with a patch size of ing in low PSNR/SSIM (27.47/0.79 on DIV2K validation)
256\times 256 (same as the training patch size to preserve global and clear artifacts. After embedding learnable bias param-
context) resulted in improved inference speed. However, eters into this freezing Restormer (except LayerNorm mod-
despite applying boundary pixel averaging, minor stitching ules) and fine-tuning the model, satisfactory denoising re-
artifacts is observed, leading to a decline in PSNR perfor- sults can be obtained, and the resultant PSNR increases by
mance. To mitigate these artifacts, they increased the over- over 3dB (evaluated on DIV2K validation set). They found
lap to 20% with original 256\times 256 patch size, which re- that various pre-trained Gaussian denoisers from [59], in-
sulted in PSNR improvement. cluding three noise-specific models and one noise-blind
Ensemble Technique at Inference. Ensemble tech- model, resulted in similar denoising performance on clipped
niques played a crucial role by effectively boosting perfor- noisy images after Bias-Tuning.
mance. They used the Self Ensemble Strategy, specifically
During the inference, they further enhance the denoiser
test-time augmentation ensemble [35] where multiple flips
via self-ensemble [35] and patch stitching. When deal-
and rotations of images were used before model inference.
ing with high-resolution (HR) noisy images, they process
The model outputs are averaged to generate the final output
them via overlapping patches with the same patch size as
image.
the training phase. They stitch these overlapping denoised
4.6. Alwaysu patches via linear blending, as introduced in image stitching
[7].
Method: Our objective is to achieve efficient Gaussian
denoising based on pre-trained denoisers. Our core idea, Training details: They fine-tune this bias-version
termed Bias-Tuning, initially proposed in transfer learning Restormer using the PSNR loss function and AdamW op-
[8], is freezing pre-trained denoisers and only fine-tuning timizer combined with batch size 2, patch size 256 × 256,
existing or newly added bias parameters during adaptation, learning rate 3e−4 (cosine annealed to 1e−6 ), 200k iter-
thus maintaining the knowledge of pre-trained models and ations and geometric augmentation. The training dataset
reducing tuning cost. consists of 800 images from DIV2K training set and 1,000
images from LSDIR training set. They also note that the
pre-trained Restormer utilized a combined set of 800 im-
ages from DIV2K, 2,650 images of Flickr2K, 400 BSD500
images and 4,744 images from WED.
Inference details: The patch size and overlapping size dur-
ing patch stitching are 256 × 256 and 16, respectively.
Complexity: Total number of parameters: 26.25M; To-
tal number of learnable bias parameters: 0.014M; FLOPs:
140.99G (evaluated on image with shape 256 × 256 × 3).

4.7. Tcler Denosing


Building upon the work of Potlapalli et al. [42], they
propose a novel transformer-based architecture for im-
age restoration, termed PromptIR-Dn50. This architecture Figure 6. Proposed Pureformer encoder-decoder architecture for
adopts a U-shaped encoder-decoder network structure, in- image denoising proposed by Team cipher vision. The input
corporating progressive downsampling and upsampling op- noisy image is processed through a multi-level encoder, a fea-
erations. Specifically tailored for denoising tasks under ad- ture enhancer block, and a multi-level decoder. Each encoder
ditive white Gaussian noise (AWGN) with a noise level and decoder level employs xN transformer blocks [62], consisting
of sigma=50, PromptIR-Dn50 leverages the strengths of of Multi-Dconv Head Transposed Attention (MDTA) and Gated-
the PromptGenBlock with targeted modifications. In this Dconv Feed-Forward Network (GDFN) blocks. The feature en-
framework, the PromptGenBlock is adapted by explicitly hancer block, placed in the latent space, expands the receptive field
using a spatial filter bank. The multi-scale features are then con-
incorporating sigma=50 as an input parameter, ensuring the
catenated and refined through xN transformer blocks to enhance
model is optimized for the specific noise level and achieves feature correlation and merge multi-scale information effectively.
superior performance in denoising tasks.
Inspired by the advancements in MambaIRv2 [21], they
further introduce a specialized variant, MambaIRv2-Dn50, nealingRestartCyclicLR schedule to adjust the learning rate
designed for image restoration tasks. This architecture dynamically during training. The model is trained using
also adopts a U-shaped encoder-decoder structure but in- a combination of Charbonnier loss and Gradient-weighted
tegrates two key innovations: the Attentive State-space L1 loss, which effectively balances pixel-wise accuracy and
Equation (ASE) and Semantic Guided Neighboring (SGN) edge preservation. The weights for those two losses are 0.8
modules. These components address the causal scan- and 0.2, respectively. They use the DIV2K [2] and LSDIR
ning limitations inherent in traditional Mamba frameworks [31] datasets exclusively during the training phase, where
while maintaining linear computational complexity. Unlike horizontally and vertically flipping, rotation, USM sharpen
prior approaches that rely on multi-directional scanning, [55] are used to augment the input images of our model.
MambaIRv2-Dn50 achieves non-causal global perception During the testing phase, the input size is fixed at
through single-sequence processing, making it highly effi- 112×112, and self-ensemble techniques [50] are applied to
cient and well-suited for vision tasks. further enhance the model’s performance. This approach
To further enhance the performance of image restora- ensures robust denoising results and improved generaliza-
tion, they propose a fusion strategy that combines the tion to unseen data.
strengths of PromptIR-Dn50 and MambaIRv2-Dn50. By In summary, MambaIRv2-Dn50 introduces a tailored
integrating the outputs of these two architectures, the state-space model-based architecture for denoising tasks,
fused model leverages the noise-specific optimization of leveraging progressive learning, advanced loss functions,
PromptIR-Dn50 and the global perception capabilities of and self-ensemble techniques to achieve state-of-the-art
MambaIRv2-Dn50. This hybrid approach ensures robust performance on sigma=50 AWGN noise.
and high-quality restoration results, effectively addressing
the challenges posed by sigma=50 AWGN noise. 4.8. cipher vision
The architecture follows a progressive training strategy As shown in Figure 6, they employ a Transformer-
as in Restormer [59], where input resolutions gradually in- based encoder-decoder architecture featuring a four-level
crease from 64×64 to 112×112. This progressive learning encoder-decoder structure designed to restore images de-
scheme enhances feature adaptation across scales without graded by Gaussian noise (σ = 50). This architecture is
compromising training stability. optimized to capture both local and global features, signifi-
For optimization, they employ the Adam optimizer with cantly enhancing the quality of input images. The hierarchi-
an initial learning rate of 1e-4, combined with a CosineAn- cal structure of the model includes four levels, containing
[4, 6, 6, 8] Transformer blocks respectively. Each Trans- learning directly from noisy observations without requiring
former block includes Multi-Dconv Head Transposed At- clean labels.
tention (MDTA) followed by a Gated-Dconv feed-forward
network (GDFN), enabling the model to capture long-range Forward Corruption Process Following [51], we define
feature dependencies effectively. Additionally, skip connec- a forward corruption process that systematically adds addi-
tions are utilized to link the encoder and decoder, preserving tional Gaussian noise to Xtdata :
spatial details and ensuring efficient feature reuse through-
out the network. The feature enhancer block in the latent
space processes latent features through the filter bank, and \begin {aligned} \mathbf {X}_t &= \mathbf {X}_{t_{\text {data}}} + \sqrt {\sigma _t^2 - \sigma _{t_{\text {data}}}^2} \, \mathbf {Z}, \\ \mathbf {Z} &\sim \mathcal {N}(\mathbf {0}, \mathbf {I}_d), \quad t > t_{\text {data}}, \end {aligned} \label {eq:forward_process}
(2)
extracted multi-scale features are concatenated and passed
through the transformer blocks as shown in Figure 6.
Training Details Our training strategy uses the datasets where σt is a monotonically increasing noise schedule
DIV2K (1000) and LSDIR (86,991). They leverage small function for t ∈ (tdata , T ], with T being the maximum noise
patch-based training and data augmentation techniques to level.
optimize the Pureformer. The training process uses the
AdamW optimizer (β1 = 0.9, β2 = 0.999) with a learn- Generalized Denoising Score Matching Loss The
ing schedule that includes a linear warmup for 15 epochs GDSM loss function [51] is formulated as:
followed by cosine annealing. The batch size is set to 4,
consisting of 4 × 3 × 128 × 128 patches, and training is con- \begin {aligned} J(\theta ) = \mathbb {E}_{\mathbf {X}_{t_{\text {data}}}, t, \mathbf {X}_t} \Big [ \big \| \gamma (t, \sigma _{t_{\text {target}}}) \, \mathbf {h}_\theta (\mathbf {X}_t, t) \\ + \delta (t, \sigma _{t_{\text {target}}}) \, \mathbf {X}_t - \mathbf {X}_{t_{\text {data}}} \big \|^2 \Big ], \end {aligned} \label {eq:general_loss}
ducted on 2×A100 GPUs. Data augmentation techniques (3)
such as random cropping, flips, 90° rotations, and mixup
are applied. They use L1 Loss to optimize the parameters.
Testing Strategy For inference, they use the datasets where t is sampled uniformly from (tdata , T ] and the co-
DIV2K (100) and LSDIR (100). Testing is performed us- efficients are defined by:
ing 512 × 512 patches. To enhance robustness, they employ
self-ensemble testing with rotational transformations. The \begin {aligned} \gamma (t, \sigma _{t_{\text {target}}}) &:= \frac {\sigma _t^2 - \sigma _{t_{\text {data}}}^2}{\sigma _t^2 - \sigma _{t_{\text {target}}}^2} \\ \delta (t, \sigma _{t_{\text {target}}}) &:= \frac {\sigma _{t_{\text {data}}}^2 - \sigma _{t_{\text {target}}}^2}{\sigma _t^2 - \sigma _{t_{\text {target}}}^2}. \end {aligned}
input image is rotated by 0°, 90°, 180°, and 270°, processed
through the trained model, and rotated back to its original (4)
orientation. The final prediction is obtained by averaging
the outputs of all four rotations.
The parameter σttarget controls the target noise level, with
4.9. A Two-Stage Denoising Framework with Gen- σttarget = 0 representing maximum denoising (complete
eralized Denoising Score Matching Pretrain- noise removal).
ing and Supervised Fine-tuning (Sky-D)
Problem Formulation In natural image denoising, we Reparameterization for Improved Training Stability
aim to recover a clean image X0 ∈ Rd from its noisy obser- To enhance training stability and improve convergence, we
vation Xtdata ∈ Rd . The noisy observation can be modeled employ the reparameterization strategy proposed in [51].
as: Let τ ∈ (0, T ′ ] be a new variable defined by:
\mathbf {X}_{t_{\text {data}}} = \mathbf {X}_0 + \sigma _{t_{\text {data}}} \mathbf {N}, \label {eq:noisy_model} (1)
\begin {aligned} \sigma _\tau ^2 &= \sigma _t^2 - \sigma _{t_{\text {data}}}^2, \\ T' &= \sqrt {\sigma _T^2 - \sigma _{t_{\text {data}}}^2}. \end {aligned}
where σtdata > 0 denotes the noise standard deviation at level (5)
tdata , and N ∼ N (0, Id ) represents the noise component.
Our approach consists of two stages: (1) self-supervised
The original t can be recovered via:
pretraining using Generalized Denoising Score Matching
(GDSM) and (2) supervised fine-tuning. This two-stage ap-
proach enables us to leverage both noisy data and clean la- t = \sigma _t^{-1}\left ( \sqrt { \sigma _\tau ^2 + \sigma _{t_{\text {data}}}^2 } \right ). (6)
bels effectively.
Under this reparameterization, the loss function be-
4.9.1. Self-Supervised Pretraining with Generalized De-
comes:
noising Score Matching
For the pretraining stage, we adopt the Generalized Denois- \begin {aligned} J'(\theta ) = \mathbb {E}_{\mathbf {X}_{t_{\text {data}}}, \tau , \mathbf {X}_t} \Big [ \big \| \gamma '(\tau , \sigma _{t_{\text {target}}}) \, \mathbf {h}_\theta (\mathbf {X}_t, t) \\ + \delta '(\tau , \sigma _{t_{\text {target}}}) \, \mathbf {X}_t - \mathbf {X}_{t_{\text {data}}} \big \|^2 \Big ], \end {aligned} \label {eq:reparam_loss}
ing Score Matching (GDSM) framework introduced in Cor- (7)
ruption2Self (C2S) [51]. This approach enables effective
where the coefficients are: Algorithm 1: Two-Stage Training Procedure for
GDSM Pretraining and Supervised Fine-tuning
\begin {aligned} \gamma '(\tau , \sigma _{t_{\text {target}}}) &= \frac {\sigma _{\tau }^2}{\sigma _{\tau }^2 + \sigma _{t_{\text {data}}}^2 - \sigma _{t_{\text {target}}}^2}, \\ \delta '(\tau , \sigma _{t_{\text {target}}}) &= \frac {\sigma _{t_{\text {data}}}^2 - \sigma _{t_{\text {target}}}^2}{\sigma _{\tau }^2 + \sigma _{t_{\text {data}}}^2 - \sigma _{t_{\text {target}}}^2}. \end {aligned} Require: Training data from DIV2K and LSDIR, max noise
level T , learning rates α1 , α2
(8)
Ensure: Trained denoising model hθ
1: // Phase 1: Self-supervised Pretraining with GDSM
2: Initialize network parameters θ randomly
3: repeat
This reparameterization ensures uniform sampling over 4: Sample minibatch {Xitdata }m i=1 from DIV2K and LSDIR
τ and consistent coverage of the noise level range during training sets
training, leading to smoother and faster convergence. 5: Sample noise level τ ∼ U (0, T ′ ]

4.9.2. Supervised Fine-tuning


6: Sample Gaussian noiseq Z ∼ N (0,Id )
7: Compute t = σt−1 στ2 + σt2data
After pretraining with GDSM, we propose to fine-tune the 8: Generate corrupted samples: Xt = Xtdata + στ Z
model with a supervised approach. Unlike traditional meth- 9: Compute coefficients γ ′ (τ, σttarget ) and δ ′ (τ, σttarget )
ods that train from scratch using clean labels, our approach 10: Compute GDSM loss J ′ (θ) according to Eq. (7)
leverages the knowledge gained during pretraining to en- 11: Update parameters: θ ← θ − α1 ∇θ J ′ (θ)
hance performance. 12: until convergence or maximum iterations reached
13: // Phase 2: Supervised Fine-tuning
14: Initialize network parameters θ with pretrained weights from
Supervised Fine-tuning Loss Given paired training data Phase 1
{(Xitdata , Yi )}N i
i=1 , where Xtdata is the noisy observation and 15: repeat
i
Y is the corresponding clean target, we formulate the su- 16: Sample paired minibatch {(Xitdata , Yi )}m i=1 from DIV2K
pervised fine-tuning loss as: and LSDIR training sets
17: Compute supervised
Pm loss: i
1 i 2
Lsup (θ) = m i=1 ∥hθ (Xtdata , tdata ) − Y ∥
18: Update parameters: θ ← θ − α2 ∇θ Lsup (θ) {α2 < α1 for
\mathcal {L}_{\text {sup}}(\theta ) = \frac {1}{N} \sum _{i=1}^N \left \| \mathbf {h}_\theta (\mathbf {X}_{t_{\text {data}}}^i, t_{\text {data}}) - \mathbf {Y}^i \right \|^2. \label {eq:supervised_loss} (9) stable fine-tuning}
19: until convergence or maximum iterations reached
20: return Trained model hθ
This formulation directly optimizes the network to map
noisy observations to clean targets. By initializing θ with
the pretrained weights from the GDSM stage, we enable 4.9.4. Training Procedure
more effective and stable fine-tuning. As outlined in Algorithm 1, our approach combines self-
4.9.3. Time-Conditioned Diffusion Model Architecture supervised pretraining with supervised fine-tuning to lever-
age the strengths of both paradigms. The GDSM pretrain-
Our approach employs the same time-conditioned diffu- ing phase enables the model to learn robust representations
sion model architecture used in [51], which is based on across diverse noise levels without clean labels, establish-
the U-Net architecture enhanced with time conditioning and ing a strong initialization for subsequent supervised learn-
the Noise Variance Conditioned Multi-Head Self-Attention ing. This knowledge transfer accelerates convergence dur-
(NVC-MSA) module. The model’s denoising function hθ : ing fine-tuning and enhances generalization to noise distri-
Rd × R → Rd maps a noisy input Xt and noise level t to butions not explicitly covered in the supervised data. The
an estimate of the clean image X0 . time-conditioned architecture further facilitates this adapt-
The time conditioning is implemented through an em- ability by dynamically adjusting denoising behavior based
bedding layer that transforms the noise level t into a high- on input noise characteristics. To our knowledge, this repre-
dimensional feature vector, which is then integrated into the sents the first application of GDSM as a pretraining strategy
convolutional layers via adaptive instance normalization. for natural image denoising, offering a principled approach
This enables the model to dynamically adjust its denoising to combining self-supervised and supervised learning ob-
behavior based on the noise level of the input. jectives for this task.
The NVC-MSA module extends standard self-attention
by conditioning the attention mechanism on the noise vari- 4.9.5. Implementation Details
ance, allowing the model to adapt its attention patterns We implement our two-stage training procedure with a pro-
based on the noise characteristics of the input. This adap- gressive learning strategy similar to that proposed in [59],
tation enhances the model’s ability to denoise effectively gradually increasing image patch sizes to capture multi-
across different noise levels and patterns. scale features while maintaining computational efficiency.
Table 2. Progressive Training Schedule
\hat {\mathbf {X}} = \mathbf {h}_{\theta ^*}(\mathbf {X}_{t_{\text {data}}}, t_{\text {data}}), (10)
Stage Patch Size Batch Learning Rate
However, to maximize denoising performance for high-
1 2562 48 1 × 10−3
resolution images without requiring additional model train-
2 3842 24 3 × 10−4
3 5122 12 1 × 10−4
ing, we incorporate two advanced techniques: geometric
4 Mixed* 4 5 × 10−5 self-ensemble and adaptive patch-based processing.
*Randomly selected from {5122 , 7682 , 8962 } per batch
Geometric Self-Ensemble Following [35], we imple-
ment geometric self-ensemble to enhance denoising qual-
As detailed in Algorithm 1, each stage consists of both self- ity by leveraging the model’s equivariance properties. This
supervised pretraining and supervised fine-tuning phases. technique applies a set of geometric transformations (ro-
For the GDSM pretraining, we set the maximum corrup- tations and flips) to the input image, processes each
tion level T = 10, which provides sufficient noise cover- transformed version independently, and then averages the
age while maintaining training stability. To determine the aligned outputs. The approach can be concisely formulated
data noise level tdata , we incorporate standard noise estima- as:
tion techniques from the skimage package [52]. While
we could explicitly set tdata to correspond to specific noise
levels (e.g., 50/255), we found that automated estimation
\hat {\mathbf {X}}_{\text {GSE}} = \frac {1}{K}\sum _{i=1}^{K} T_i^{-1}\left (\mathbf {h}_{\theta ^*}\left (T_i(\mathbf {X}_{t_{\text {data}}}), t_{\text {data}}\right )\right ), (11)
suffices for good performance. In future work, more tai-
lored approaches for specific noise level denoising could be
implemented. where {Ti }Ki=1 represents a set of K = 8 geometric
For optimization, we employ the AdamW optimizer with transformations (identity, horizontal flip, vertical flip, 90°,
gradient clipping to stabilize training, coupled with a cosine 180°, and 270° rotations, plus combinations), and Ti−1
annealing learning rate scheduler. Our progressive training denotes the corresponding inverse transformation. This
schedule (see Table 2) gradually increases patch sizes while approach effectively provides model ensembling benefits
adjusting batch sizes and learning rates accordingly. We without requiring multiple models or additional training.
initialize each stage with weights from the previous stage,
setting a maximum of 20 epochs per stage with early stop-
ping based on validation performance. Due to computa- Adaptive Patch-Based Processing To handle high-
tional time constraints, we note that the network training resolution images efficiently, we implement an adaptive
for the final stage of progressive learning had not yet fully patch-based processing scheme that dynamically selects ap-
converged when reporting our results. propriate patch sizes based on input dimensions. Algo-
This progressive approach allows the model to initially rithm 2 details our complete inference procedure.
learn basic denoising patterns on smaller patches where Our adaptive patch-based approach dynamically selects
more diverse samples can be processed in each batch, then from three patch sizes (896 × 896, 768 × 768, or 512 × 512)
gradually adapt to larger contextual information in later based on input image dimensions. For each geometric trans-
stages. We train our models using the DIV2K [2] and LS- formation, the algorithm determines whether patch-based
DIR [31] training datasets, while validation is performed processing is necessary. If so, it divides the image into
on their respective validation sets, which remain completely overlapping patches with 50% stride, processes each patch
separate from training. independently, and reconstructs the full image by averaging
Throughout the entire training process, we maintain the overlapping regions. This strategy effectively handles high-
same time-conditioned model architecture, leveraging its resolution images while maintaining computational effi-
ability to handle varying noise levels both during self- ciency.
supervised pretraining and supervised fine-tuning. The self- 4.10. KLETech-CEVI
supervised pretraining with GDSM establishes robust ini-
tialization across diverse noise conditions, while the super- Method: The proposed HNNformer method is based on
vised fine-tuning further refines the model’s performance on the HNN framework [24], which includes three main mod-
specific noise distributions of interest. ules: the hierarchical spatio-contextual (HSC) feature en-
coder, Global-Local Spatio-Contextual (GLSC) block, and
4.9.6. Inference Process hierarchical spatio-contextual (HSC) decoder, as shown in
During standard inference, given a noisy observation Xtdata , Figure 7. Typically, image denoising networks employ fea-
we obtain the denoised output directly from our trained ture scaling for varying the sizes of the receptive fields. The
model: varying receptive fields facilitate learning of local-to-global
Algorithm 2: Adaptive Geometric Self-Ensemble They decode the deep features obtained at various scales
Inference with the proposed hierarchical decoder, given by:
Require: Noisy image Xtdata , model hθ∗
Ensure: Denoised image X̂ d_{si} = MD_{si}(D_{si}) (14)
1: T ← {Identity, HFlip, VFlip, Rot90, ...} ▷ 8 transforms
2: H, W ← dimensions of X t data
where Dsi is the deep feature at the ith scale, dsi is the
decoded feature at the ith scale, and M Dsi represents the
(
estimate noise(Xtdata ) if auto mode
3: tdata ←
predefined level otherwise hierarchical decoder. The decoded features and upscaled

896 if min(H, W ) ≥ 896
 features at each scale are passed to the reconstruction layers
4: patch size ← 768 if min(H, W ) ≥ 768 Mr to obtain the denoised image ŷ. The upscaled features


512 if min(H, W ) ≥ 512 from each scale are stacked and represented as:
5: stride ← patch size/2 ▷ 50% overlap
6: outputs ← ∅
7: for all T ∈ T do
P = d_{s1} + d_{s2} + d_{s3} (15)
8: XT ← T (Xtdata ) where ds1 , ds2 , and ds3 are decoded features at three dis-
9: HT , WT ← dimensions of XT
10: if max(HT , WT ) > patch size then tinct scales, and P represents the final set of features passed
11: outputT , count ← zeros(HT , WT ) to the Channel Attention Block (CAB) to obtain the de-
12: Pad XT to dimensions divisible by stride noised image ŷ.
13: for (i, j) in overlapping patch grid do
14: patch ← XT [i : i + patch size, j : j + patch size]
15: result ← hθ∗ (patch, tdata ) \hat {y} = M_r (P) (16)
16: Accumulate result and increment count at positions (i, j)
17: end for where ŷ is the denoised image obtained from reconstruc-
18: denoisedT ← outputT /count tion layers Mr . They optimize the learning of HNNFormer
19: else with the proposed LHN N f ormer , given as:
20: denoisedT ← hθ∗ (XT , tdata )
21: end if
22: outputs ← outputs ∪ {T −1 (denoisedT )}
23: end for L_{HNNformer} = (\alpha \cdot L_1) + (\beta \cdot L_{VGG}) + (\gamma \cdot L_{MSSSIM})
24: return X̂ ← |T1 | out∈outputs out (17)
P

where α, β, and γ are the weights. They experimentally


set the weights to α = 0.5, β = 0.7, and γ = 0.5. LHN N
is a weighted combination of three distinct losses: L1 loss
variances in the features. With this motivation, they learn
to minimize error at the pixel level, perceptual loss to effi-
contextual information from multi-scale features while pre-
ciently restore contextual information between the ground-
serving high-resolution spatial details. They achieve this via
truth image and the output denoised image, and multiscale
a hierarchical style encoder-decoder network with residual
structural dissimilarity loss to restore structural details. The
blocks as the backbone for learning. Given an input noisy
aim here is to minimize the weighted combinational loss
image x, the proposed multi-scale hierarchical encoder ex-
LHN N given as:
tracts shallow features in three distinct scales and is given
as:r

F_{si} = ME_s(x) (12) L(\theta ) = \frac {1}{N} \sum _{i=1}^{N} \| HNNFormer(x_i) - y_i \| L_{HNN} (18)

where Fsi are the shallow features extracted at the ith


scale from the sampled space of input noisy image x and where θ denotes the learnable parameters of the proposed
M Es represents the hierarchical encoder at scale s. framework, N is the total number of training pairs, x and
Inspired by [60], they propose Global-Local Spatio- y are the input noisy and output denoised images, respec-
Contextual (GLSC) Block, that uses Spatial Attention tively, and HN N F ormer(·) is the proposed framework for
Blocks (SAB) to learn spatial features at each scale. They image denoising.
also employ a Channel Attention Block (CAB) to fuse the 4.11. xd denoise
multi-level features. The learned deep features are repre-
sented as: Implementation details. As shown in Figure 8, They use
D_{si} = GLSC_{si}(F_{si}) (13) SCUNet[62] as their baseline model. They employed the
PyTorch deep learning framework and conducted experi-
where Dsi is the deep feature at the ith scale, Fsi are the ments on an Ubuntu 20.04 system. The hardware and soft-
spatial features extracted at the ith scale, and GLSCsi rep- ware setup is as follows: CPU: Intel Xeon Gold 6226R,
resents Spatial Attention Blocks (SAB) at respective scales. GPU: Four graphics cards of NVIDIA GeForce RTX 4090,
Figure 7. Overview of the HNNFormer proposed by Team KLETech-CEVI: Hierarchical Noise-Deinterlace Transformer for Image De-
noising (HNNFormer). The encoder extracts features in three distinct scales, with information passed across hierarchies (green dashed
box). Fine-grained global-local spatial and contextual information is learnt through the attention blocks at GLSC (orange dashed box). At
the decoder, information exchange occurs in reverse hierarchies (blue dashed box).

Figure 8. The SCUNet model architecture proposed by Team xd denoise.

Python version: 3.8.0, PyTorch version: 2.0.0, CUDA ver- tation(TTA) into their method during testing, including hor-
sion: 11.7. They only use high-definition images from the izontal flip, vertical flip, and 90-degree rotation. They uti-
DIV2K and LSDIR datasets for training and validation. The lized an ensemble technique by chaining three basic U-Net
training set consists of 85791 images (84991 + 800), and networks and SCUNet, and according to the weights of 0.6
the validation set consists of 350 images (250 + 100). They and 0.4, output the results of concatenating the SCUNet
used the Adam optimizer with 100 training epochs, a batch model with three UNet models to achieve better perfor-
size of 32, and a crop size of 256 × 256. The initial learning mance.
rate was set to 1e−4 , with β1 = 0.9 , β2 = 0.999, and no
weight decay applied. At epoch 90, the learning rate was 4.12. JNU620
reduced to 1e−5 . No data augmentation was applied during
training or validation.The model is trained with MSE loss. Description. Recently, some research in low-level vision
has shown that ensemble learning can significantly improve
Testing description They integrate Test-Time Augmen- model performance. Thus, instead of designing a new archi-
tecture, they leverage existing NAFNet [10] and RCAN [63] integrate a PatchGAN discriminator with adversarial train-
as basic networks to design a pipeline for image denoising ing.
(NRDenoising) based on the idea of ensemble learning, as Training details. The model is trained from scratch us-
shown in Fig 9. They find the results are better improved by ing the DIV2K dataset, without relying on any pre-trained
employing both self-ensemble and model ensemble strate- weights. They jointly optimize all modules using a com-
gies. posite loss function that includes diffusion loss, Sinkhorn-
based optimal transport loss, multi-scale SSIM and L1
NAFNet α1 losses, and an adversarial loss. The training spans 300
Self-ensemble epochs with a batch size of 8, totaling 35,500 iterations per
epoch. The method emphasizes both fidelity and perceptual
+ quality, achieving strong results in PSNR and LPIPS.

RCAN 4.14. Aurora


α2
Self-ensemble
They will introduce their algorithm from four aspects:
model architecture, data processing methods, training
Figure 9. The pipeline of the NRDenoising proposed by Team
pipeline, and testing pipeline.
JNU620.
Given the excellent performance of generative adversar-
ial networks (GANs) in image generation tasks, and con-
Implementation details. For the training of sidering that image denoising can also be regarded as a
NAFNet [10], they utilize the provided DIV2K [2] type of generative task, they utilize a generative adversar-
dataset. The model is trained with MSE loss. They utilize ial network for the denoising task. Specifically, they adopt
the AdamW optimizer (β1 = 0.9, β2 = 0.9) for 400K NAFNet [10] as the generator and have made a series of pa-
iterations on an NVIDIA Tesla V100 GPU. The initial rameter adjustments. In particular, they increased both the
learning rate is set to 1 × 10−3 and gradually reduces to number of channels and the number of modules. Due to the
1 × 10−7 with the cosine annealing. The training batch is superior performance of the SiLU activation function across
set to 4 and the patch size is 384×384. Random horizontal various tasks, they replaced the original activation function
flipping and rotation are adopted for data augmentation. with SiLU. For the discriminator, they employ a VGG11 ar-
For the training of RCAN [63], the provided DIV2K [2] chitecture without batch normalization (BN) layers, where
dataset is also employed. The MSE loss is utilized with an the ReLU activation function is replaced with LeakyReLU.
initial learning rate of 1×10−4 . The Adam optimizer (β1 = In the training stage, they exclusively use the DIV2K and
0.9, β2 = 0.99) is used for 100K iterations. The batch LSDIR datasets [31]. Instead of employing overly complex
size is 3, and the patch size is 200×200. Data augmentation data augmentation algorithms, they applied simple flipping
includes the horizontal flip and the 90-degree rotation. and rotation techniques for data augmentation. Finally, a
During inference, they apply a self-ensemble strategy patch is cropped from the high-resolution (HR) image, nor-
for NAFNet [10] and selectively adopt the TLC [15] malized, and then fed into the network.
method based on the size of input images; For RCAN [63], During training, they progressively trained the model us-
they utilize a self-ensemble strategy. Finally, the model- ing resolutions of [128, 192, 256]. The model was jointly
ensemble strategy is employed to combine the outputs of optimized using L1, L2, and Sobel loss functions. The opti-
NAFNet [10] and RCAN [63]. mizer and learning rate scheduler used during training were
AdamW and CosineAnnealingLR, respectively.
4.13. PSU team In the inference phase, they employed a self-ensemble
General method description. They propose OptiMalDiff, strategy and selectively adopted the TLC [14] method to
a high-fidelity image enhancement framework that refor- further enhance performance.
mulates image denoising as an optimal transport problem. 4.15. mpu ai
The core idea is to model the transition from noisy to clean
image distributions via a Schrödinger Bridge-based diffu- 4.15.1. Method
sion process. The architecture (shown in Fig. 10) consists Existing deep learning-based image restoration methods
of three main components: (1) a hierarchical Swin Trans- exhibit inadequate generalization capabilities when faced
former backbone that extracts both local and global features with a variety of noise types and intensities, thereby sig-
efficiently, (2) a Schrödinger Bridge Diffusion Module that nificantly impeding their broad application in real-world
learns forward and reverse stochastic mappings, and (3) a scenarios. To tackle this challenge, this paper proposes
Multi-Scale Refinement Network (MRefNet) designed to a novel prompt-based learning approach, namely Blind
progressively refine image details. To enhance realism, they Image Restoration Using Dual-Channel Transformers and
OptiMalDiff: Hybrid Image Restoration with Optimal Transport and Schrödinger Bridge
Original x as condition

Input I. Parallel Feature Extraction Pathways II. Multi-Scale Refinement Network (MRefNet)

Swin
Structure path Transformer Structure Structure features Output
Encoder Projection ŷ
Patch Embedding Initial MRefNet Multi-Scale
High-Level Features
+ Attention Restore Refinement
Texture
Feature Fusion
Degraded Image x path Enhanced Integration
ℝ³ˣᴴˣᵂ
Conditional Texture Coarse-Scale Path
Diffusion Projection y_diff Down/Up-sampling

UNet-based Architecture Fine Detail Recovery Texture Details


Texture features
y_refined
Fine-Scale Path
y_restored
Distribution alignment Full Resolution Enhanced Details
Perceptually Enhanced
Schrödinger Bridge / Optimal Transport + ℝ³ˣᴴˣᵂ
SamplesLoss (Sinkhorn Divergence) Fused features → y_restored
Optimal Path Between Distributions
Generated Image

Real Image
III. Adversarial Training Framework

PatchGAN Discriminator Combined Loss Optimization


Conv + LeakyReLU + InstanceNorm
Diffusion Loss MSSSIM+L1 Loss
real/fake LSGAN Loss
y (Ground Truth) Optimal Transport Loss Adversarial Loss

Parallel Feature Optimal Transport Multi-Scale Adversarial


Extraction Alignment Legend Refinement Enhancement

Structure Path (Swin Transformer) Optimal Transport/Schrödinger Bridge Adversarial Framework Structure flow Refinement flow
Texture flow Adversarial flow
Texture Path (Diffusion Model) Multi-Scale Refinement (MRefNet)

- Frozen weights | - Trainable weights | - Bidirectional flow | - Processing | - Refinement | - Discriminator

Figure 10. Overview of the OptiMalDiff architecture proposed by PSU team, combining Schrödinger Bridge diffusion, transformer-based
feature extraction, and adversarial refinement.

Multi-Scale Attention Prompt Learning (CTMP), as de- In pursuit of enhanced performance, they have refined the
picted in Figure 11. The CTMP model features a U- Transformer module by devising a novel architecture that
shaped architecture grounded in the Transformer frame- integrates Channel Attention with the self-attention mecha-
work, constructed from the enhanced Channel Attention nism, thereby combining the strengths of both Transformer
Transformer Block (CATB). During the image restoration and Channel Attention. Specifically, the Transformer fo-
process, CTMP adopts a blind image restoration strategy to cuses on extracting high-frequency information to capture
address diverse noise types and intensities. It integrates an the fine details and textures of images, while Channel At-
Efficient Multi-Scale Attention Prompt Module (EMAPM) tention excels at capturing low-frequency information to
that is based on prompts. Within the EMAPM, an En- extract the overall structure and semantic information of
hanced Multi-scale Attention (EMA) module is specifically images. This integration further boosts the image denois-
designed. This module extracts global information across ing effect.As depicted in Figure 12, the improved Trans-
different directions and employs dynamic weight calcula- former architecture, named the Channel Attention Trans-
tions to adaptively modulate the importance of features at former Block (CATB), primarily consists of the follow-
various scales. The EMA module subsequently fuses the ing three modules: Multi-DConv Head Transposed Self-
enhanced multi-scale features with the input feature maps, Attention (MDTA), Channel Attention (CA), and Gated-
yielding a more enriched feature representation. This fusion Dconv Feed-Forward Network (GDFN).
mechanism empowers the model to more effectively capture The Multi-DConv Head Transposed Self-Attention
and leverage features at different scales, thereby markedly (MDTA) module enhances the self-attention mechanism’s
bolstering its capacity to restore image degradations and perception of local image features by incorporating multi-
showcasing superior generalization capabilities. scale depthwise convolution operations, effectively captur-
ing detailed image information. The Channel Attention
4.15.2. Transformer Block Incorporating Channel At- (CA) module, dedicated to information processing along
tention and Residual Connections the channel dimension, computes the importance weights
The Transformer Block serves as the cornerstone of their of each channel to perform weighted fusion of channel fea-
entire model, harnessing the Transformer architecture to ex- tures, thereby strengthening the model’s perception of the
tract image features through the self-attention mechanism. overall image structure. The Gated-Dconv Feed-Forward
Figure 11. The CTMP architecture proposed by Team mpu ai

Figure 12. The Channel Attention Transformer Block (CATB), proposed by Team mpu ai

Network (GDFN) module combines the gating mechanism networks (CNNs) and Transformer architectures primarily
with depthwise convolution operations, aiming to further focus on feature extraction in the spatial domain, while pay-
optimize the nonlinear transformation of features. By in- ing less attention to the weighting of features in the chan-
troducing the gating mechanism, the model can adaptively nel dimension. To address this limitation, they introduce a
adjust the transmission and updating of features based on Channel Attention module in the Transformer Block, cre-
the dynamic characteristics of the input features, thereby ating a Transformer Block that incorporates Channel At-
enhancing the flexibility and adaptability of feature rep- tention and Residual Connections. This module weights
resentation. Through the synergistic action of these three the channel dimension through global average pooling and
modules, the improved Transformer architecture can more fully connected layers, enhancing important channel fea-
effectively handle both high-frequency and low-frequency tures while suppressing less important ones. This weighting
information in images, thereby significantly enhancing the mechanism enables the model to focus more effectively on
performance of image denoising and restoration. key information, thereby improving the quality of restored
In image restoration tasks, feature extraction and repre- images. Additionally, the introduction of residual connec-
sentation are crucial steps. Traditional convolutional neural tions further enhances the model’s robustness and perfor-
mance. Residual connections ensure that the information The Efficient Multi-Scale Attention Prompt Module
of the input features is fully retained after processing by (EMAPM) is designed to enhance the model’s ability to
the Channel Attention module by adding the input features capture multi-scale features in image restoration tasks. By
directly to the output features. This design not only aids generating adaptive prompts that focus on different scales
gradient propagation but also retains the original informa- and characteristics of the input image, EMAPM allows the
tion of the input features when the weighting effect of the model to better handle various types of image degradations.
Channel Attention module is suboptimal, further boosting The core components and operations of EMAPM are de-
the model’s robustness. scribed as follows:
The proposed model incorporates several key enhance- Module Configuration: To configure the EMAPM, sev-
ments to improve image restoration quality. Firstly, the eral key parameters are defined:
Channel Attention Module leverages global average pool- • Prompt Dimension (dp ): This determines the dimension
ing and fully connected layers to selectively enhance impor- of each prompt vector, which represents the feature space
tant channel features while suppressing less relevant ones. for each prompt.
This mechanism enables the model to focus more effec- • Prompt Length (Lp ): This specifies the number of
tively on critical information, thereby improving the quality prompt vectors, which controls the diversity of prompts
of the restored image. Secondly, residual connections are generated.
employed to ensure that the original input features are fully • Prompt Size (Sp ): This sets the spatial size of each
retained and added directly to the output features after pro- prompt vector, which affects the resolution of the
cessing by the Channel Attention Module. This not only prompts.
aids gradient propagation but also preserves the original • Linear Dimension (dl ): This is the dimension of the in-
information when the weighting effect is suboptimal, thus put to the linear layer, which processes the embedding of
boosting the model’s robustness. Lastly, the LeakyReLU the input feature map.
activation function is utilized in the Feed-Forward Network • Factor (f ): This defines the number of groups in the
to introduce non-linearity while avoiding the ”dying neu- EMA module, which influences the grouping mechanism
rons” issue associated with ReLU, further enhancing the in the attention process.
model’s expressive power. Together, these improvements Mathematical Formulation: Given an input feature
contribute to a more effective and robust image restoration map x \in \mathbb {R}^{B \times C \times H \times W} , where B is the batch size, C is the
model. number of channels, and H \times W is the spatial dimension,
4.15.3. Efficient Multi-Scale Attention Prompt Module the operations within EMAPM are defined as follows:
1. Compute Embedding: The embedding of the input
Addressing multi-scale image degradations is a crucial chal- feature map is computed by averaging the spatial dimen-
lenge in image restoration tasks. Traditional feature extrac- sions.
tion methods typically capture features at a single scale, ne-
glecting the fusion and interaction of features across mul-
\text {emb} = \frac {1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} x_{:, :, i, j} \in \mathbb {R}^{B \times C} (19)
tiple scales. To overcome this limitation, they propose
a prompt-based blind image restoration approach, incor-
porating an Efficient Multi-Scale Attention Prompt Mod- 2. Linear Layer and Softmax: The embedding is
ule (EMAPM). As be shown in Figure 13, the core of passed through a linear layer followed by a softmax func-
the EMAPM is the Enhanced Multi-scale Attention (EMA) tion to generate prompt weights.
module, which extracts global information in different \text {prompt\_weights} = \text {softmax}(\text {linear\_layer}(\text {emb})) \in \mathbb {R}^{B \times L_p}
directions and combines dynamic weight calculations to (20)
adaptively adjust the significance of features at various 3. Generate Prompt: The prompts are generated by
scales, thereby generating a richer feature representation. weighting the prompt parameters with the prompt weights
This design not only enhances the model’s adaptability to and then summing them up. The prompts are then interpo-
multi-scale image degradations but also strengthens the ex- lated to match the spatial dimensions of the input feature
pressiveness of features, significantly improving the quality map.
of image restoration. The introduction of the EMA module
represents a significant innovation in their image restora-
tion approach. Experimental results validate the effective- \text {prompt} = \sum _{k=1}^{L_p} \text {prompt\_weights}_{:, k} \cdot \text {prompt\_param}_{k} \in \mathbb {R}^{B \times d_p \times S_p \times S_p}

ness of the EMA module, demonstrating its ability to sub-


(21)
stantially boost model performance across multiple image
restoration tasks. This innovation not only enhances the
model’s restoration capabilities but also offers new research \text {prompt} = \text {F.interpolate}(\text {prompt}, (H, W), \text {mode}=\text {"bilinear"})
directions for image restoration tasks. (22)
Figure 13. Efficient Multi-Scale Attention Prompt Module (EMAPM), proposed by Team mpu ai.

4. Enhance Prompt using EMA: The prompts are en- batch size of 2, leveraging the computational power of a
hanced using the Enhanced Multi-scale Attention (EMA) Tesla T4 GPU. The network was optimized through L1 loss,
module, which refines the prompts by incorporating multi- using the Adam optimizer (β1 = 0.9, β2 = 0.999) with a
scale attention. learning rate of 2 × 10−4 . To further enhance the model’s
generalization ability, they used 128×128 cropped blocks
\text {enhanced\_prompt} = \text {EMA}(\text {prompt}) \in \mathbb {R}^{B \times d_p \times H \times W} as input during training and augmented the training data by
(23) applying random horizontal and vertical flips to the input
5. Conv3x3: Finally, the enhanced prompts are pro- images.
cessed through a 3x3 convolutional layer to further refine The proposed model in this paper exhibits the following
the feature representation. characteristics in terms of overall complexity: It consists
of approximately 35.92 million parameters and has a com-
\text {enhanced\_prompt} = \text {conv3x3}(\text {enhanced\_prompt}) \in \mathbb {R}^{B \times d_p \times H \times W}
(24) putational cost of 158.41 billion floating-point operations
(FLOPs). The number of activations is around 1{,}863.85
4.15.4. Experiments million, with 304 Conv2d layers. During GPU training, the
In this section, they conducted a series of extensive exper- maximum memory consumption is 441.57 MB, and the av-
iments to comprehensively demonstrate the superior per- erage runtime for validation is 25{,}287.67 seconds.
formance of the proposed CTMP model across multiple
datasets and benchmarks. The experiments covered a va- 4.15.5. Dataset
riety of tasks, including denoising and deblocking of com- To comprehensively evaluate the performance of the CTMP
pressed images, and were compared with previous state- algorithm in image restoration tasks, they conducted ex-
of-the-art methods. Additionally, they reported the results periments in two critical areas: image denoising and de-
of ablation studies, which strongly validated the effective- blocking of compressed images. For training, they selected
ness of the Channel Attention Transformer Block (CATB) the high-quality DIV2K dataset, which comprises 800 high-
and the Enhanced Multi-scale Attention Prompt Module resolution clean images with rich textures and details, pro-
(EMAPM) within the CTMP architecture. viding ample training samples to enable the model to per-
The CTMP framework is end-to-end trainable without form well under various degradation conditions [2]. Addi-
the need for pretraining any individual components. Its ar- tionally, they used 100 clean/noisy image pairs as the vali-
chitecture consists of a 4-level encoder-decoder, with each dation set to monitor the model’s performance during train-
level equipped with a different number of Transformer mod- ing and adjust the hyperparameters.
ules, specifically [4, 6, 6, 8] from level 1 to level 4. They During the testing phase, they chose several widely used
placed a Prompt module between every two consecutive de- datasets, including Kodak, LIVE1, and BSDS100, to com-
coder levels, resulting in a total of 3 Prompt modules across prehensively assess the algorithm’s performance. The Ko-
the entire PromptIR network, with a total of 5 Prompt com- dak dataset consists of 24 high-quality images with diverse
ponents. During training, the model was trained with a scenes and textures, commonly used to evaluate the visual
effects of image restoration algorithms [1]. The LIVE1 Input Image
H*W*3

dataset contains a variety of image types and is widely used Denoised Output
H*W*3
Input Image
for image quality assessment tasks, effectively testing the H*W*3

algorithm’s performance under different degradation con-


ditions [47]. The BSDS100 dataset includes 100 images
with rich textures and edge information, providing a com-
prehensive evaluation of the algorithm’s performance in im-
age restoration tasks [41]. Restoration Refinement
By testing on these representative datasets, they were
able to comprehensively evaluate the CTMP algorithm’s Intermediate
Output
performance across different degradation types and image H*W*3
MHC

conditions, ensuring its effectiveness and reliability in prac-


tical applications. Figure 14. Overview of the two-stage OptDenoiser framework for
image denoising.
4.16. OptDenoiser
Method They introduce a two-stage transformer-based net- 4.16.2. Technical details
work that effectively maps low-resolution noisy images The proposed solution is implemented with the PyTorch
to their high-resolution counterparts, as depicted in Fig. framework. The networks were optimized using the Adam
14. The proposed framework comprises two independent optimizer, where the hyperparameters were tuned as β1 =
encoder-decoder blocks (EDBs) and Multi-Head correla- 0.9, β2 = 0.99, and the learning rate was set to 1 × 10−4 .
tion blocks to generate visually coherent images [46]. To They trained their model using randomly cropped image
enhance reconstruction efficiency, they integrate illumina- patches with a constant batch size of 4, which takes approx-
tion mapping [46] guided by Retinex theory [26]. Addi- imately 72 hours to complete. All experiments were con-
tionally, they conduct a theory, an in-depth evaluation of ducted on a machine equipped with an NVIDIA RTX 3090
the effectiveness of illumination mapping in general im- GPU.
age reconstruction tasks, including image denoising. There-
fore, their framework integrates the Retinexformer [9] net- 4.17. AKDT
work as the first stage. In the context of image denois- Method. The team utilizes their existing network Adap-
ing, Retinexformer surpasses conventional denoisers such tive Kernel Dilation Transformer [5] (AKDT), published
as UFormer, Restormer, and DnCNN. However, like other at VISAPP 2025, with code published at https : / /
denoising methods, Retinexformer encounters challenges, github.com/albrateanu/AKDT. Figure 15 presents
including jagged edges, blurred outputs, and difficulties in the architecture of AKDT. It proposes a novel convolutional
capturing and representing complex structures in noisy in- structure with learnable dilation rates: the Learnable Dila-
puts. To address these obstacles, they incorporate the MHC, tion Rate (LDR) Block, used to formulate the Noise Es-
followed by an additional EDB in their framework. This de- timator (NE) Module, which is leveraged within the self-
sign effectively exploits feature correlations from interme- attention and feed-forward mechanisms.
diate outputs, enabling more accurate reconstruction with LDR. The Learnable Dilation Rate module lies at the foun-
improved structural fidelity and texture preservation. Fur- dation of AKDT and helps the model effectively pick op-
thermore, they integrate a perceptual loss function with timal dilation rates for convolutional kernels. Given an in-
luminance-chrominance guidance [46] to mitigate color in- put feature map \textbf {F}_\text {in} \in \mathbb {R}^{H \times W \times C} , it is formulated as the
consistencies, ensuring visually coherent and perceptually weighted concatenaton of N dilated convolutions:
refined reconstructions.
\label {formula:akdt_ldr} \textbf {F}_\text {LDR}\!=\!\textit {conv}1\!\!\times \!\!1\left ( \textbf {concat} _{i=1}^{N} \alpha _i \times \textit {conv}3\!\!\times \!\!3_i(\textbf {F}_\text {in})\right ) (25)
4.16.1. Global Method Description
where concat represents the channel-wise concatenation
Training Procedure: During the training phase, input im- operation. The specific dilation rates picked for LDR are a
ages were randomly cropped into 512 × 512 patches and hyperparameter that is carefully chosen to balance between
subsequently downscaled to 128 × 128 to enhance the performance and computational efficiency.
model’s ability to capture spatial features effectively. A NE. The Noise Estimator integrates both global and local
fixed learning rate of 0.0001 was maintained throughout the context understanding through its unique structure. This
training process. The model was trained exclusively on the module consists of two distinct parallel components: the
LSDIR and DIV2K datasets, without the inclusion of any Global and Local LDR modules with selected dilation rates
additional training, validation, or testing data. for capturing global and local structure. It is defined as:
Noisy Input Denoised Image
LDRLocal
𝐻×W×𝐶
Fout

𝐻×W×𝐶

𝐻 × W × 2𝐶
𝐻×W×𝐶 𝐻×W×𝐶
Fin Fout
𝐻×W×3 𝐻×W×𝐶
𝐻×W×3
Gating Mechanism

𝐻×W×𝐶
𝐻×W×𝐶

LDRGlobal Noise Estimator Module


𝐻×W×𝐶
𝐻 𝑊
× × 4𝐶
2 2
LDRLocal

𝐻 𝑊 𝐻𝑊 × 𝐶
× × 2𝐶
V

𝐻×W×𝐶
𝐻 𝑊 2 2
× × 2𝐶
2 2

𝐻 × W × 3𝐶
𝐻×W×𝐶 𝐶 × 𝐻𝑊 𝐻×W×𝐶
Skip Connections 𝐻 𝑊 Fin K Fout
× × 4𝐶
2 2 Softmax

Q 𝐶×𝐶
𝐻𝑊 × 𝐶
𝐻 𝑊
𝐻 𝑊 × × 4𝐶
× × 4𝐶 4 4
Fin
4 4
𝐻×W×𝐶
LDRGlobal Noise Estimator Module
𝐻 𝑊
× × 8𝐶
4 4

𝐻 𝑊 𝐻 𝑊
× × 8𝐶 × × 4𝐶
8 8 4 4

Figure 15. Overall framework of AKDT - Adaptive Kernel Dilation Transformer.

here ϕ denotes the GELU activation function, ⊙ represents


\label {formula:akd_ne} \textbf {NE}= \varrho \left ( \textbf {LDR}_\text {Global}, \textbf {LDR}_\text {Local}\right ) (26) element-wise multiplication, and W1 , W2 are the learnable
parameters of the parallel paths.
where ϱ is the Noise Estimation Fusion operation that
Implementation. AKDT is implemented by PyTorch.
merges global and local noiseless feature context.
They only use the DIV2K dataset for training. The model is
NG-MSA. To ensure efficiency in their Noise-guided
trained using the Adam Optimizer for 150k iterations, with
Multi-headed Self-Attention, they utilize the Transposed
an initial learning rate set at 2e − 4 which gradually de-
Multi-headed Self-Attention mechanism [59] as baseline.
creases through a Cosine Annealing scheme. Each iteration
They then integrate their proposed NE module for the
consists of a batch of 4 600 × 600 randomly-cropped im-
Q,K,V extraction phase, to ensure self-attended feature
age patches that undergo data augmentation (random flip-
maps are produced utilizing noiseless context. Therefore,
ping/rotation). To optimize their network, they utilize a hy-
given the input feature map Fin ∈ RH×W ×C , they can de-
brid loss function capable to capture pixel-level, multi-scale
fine this process as:
and perceptual differences [6] [4]. Testing is performed via
standard inference, without additional enhancement tech-
\label {formula:akdt_qkv} \{\textbf {Q,K,V}\}= \textbf {NE}(\textbf {F}_\text {in}), \quad \textbf {Q,K,V} \in \mathbb {R}^{HW \times C} (27) niques.

Then, Q,K are used to compute the self-attention map by


4.18. X-L
matrix multiplication and Softmax activation, which is then General method description. To ensure performance
applied to V to obtain the final self-attended feature map. while reducing computational overhead, they adopted the
NG-FFN. The Noise-guided Feed-forward Network also following strategy: leveraging two leading approaches,
utilizes the NE module for noise-free feature extraction Xformer [60] and SwinIR [33], the pipeline is shown in
context. It consists of a series of convolutional layers with Fig. 16. They directly utilized their pre-trained models to
a gating mechanism used to selectively apply non-linear perform self-ensemble, generating two output results. Then,
activations. The noise-free features, obtained from pro- they conducted model ensemble on these two outputs, inte-
jecting the input through their NE will be referred to as grating the results between models to obtain the final recon-
FNE ∈ RH×W ×C . Consequently, the feed-forward process struction result.
can be described as: Training details. They do not require additional train-
ing; instead, they directly leverage existing methods and
\label {formula:NG-FFN} \textbf {F}_\text {NG-FFN} = \phi \left (W_1 \textbf {F}_\text {NE}\right ) \odot W_2 \textbf {F}_\text {NE} + \textbf {F}_\text {NE}, (28) their pre-trained models for inference. This approach not
Self-Ensemble
Out𝒑𝒖𝒕
decoder structure with skip connections to preserve spatial
𝑰𝒏𝒑𝒖𝒕
Xformer details while extracting multi-scale features.
Model
The denoising process follows a stochastic differential
Ensemble equation (SDE) approach, where Gaussian noise N (0, σt2 I)
SwinIR is added to the clean image x0 during the forward diffu-
sion process, and the network is trained to predict the noise
Self-Ensemble
residual sθ (xt , t). This predicted noise is used to compute
Figure 16. Overview of the MixEnsemble pipeline proposed by the score function, which guides the reverse diffusion pro-
Team X-L. cess, progressively removing noise through an iterative up-
date step:

only saves significant computational resources and time but x_{t-1} = x_t - 0.5 \cdot \sigma _t^2 \cdot \text {score}(x_t, t) \cdot dt.
also fully utilizes the excellent models and valuable exper-
tise available in the field. By directly employing these pre- To improve sampling efficiency, they integrate an ODE-
trained models, they can quickly generate high-quality pre- based sampling strategy, which allows for faster denois-
dictions while avoiding the high costs and complexity asso- ing while maintaining high restoration quality. Addition-
ciated with training models from scratch. ally, they employ a cosine noise schedule, which ensures
a smooth noise transition across time steps and improves
4.19. Whitehairbin training stability. The network is optimized using a cus-
tom loss function that minimizes the deviation between the
4.19.1. Introduce predicted noise and the true noise, ensuring precise score
Their method is based on the Refusion[40] model proposed estimation.
in previous work, and they trained it on the dataset provided Training is conducted with the Lion optimizer, incorpo-
by this competition to validate its effectiveness. The Refu- rating a learning rate scheduler for improved convergence.
sion model itself is a denoising method based on the diffu- To enhance computational efficiency, they apply mixed pre-
sion model framework. Its core idea is to guide the reverse cision training, reduce time steps T , and utilize lightweight
diffusion process by learning the noise gradient (score func- backbone networks, striking a balance between high-quality
tion) at different time steps t. Within the Refusion frame- denoising and efficient execution.
work, they can still flexibly choose NAFNet or UNet as Training description They trained their diffusion-based
the neural network backbone architecture to adapt to differ- denoising model on a mixed dataset composed of DIV2K
ent computational resources and performance requirements. and LSDIR, which contained high-resolution images with
NAFNet is known for its efficiency, while UNet excels in diverse textures and content. The dataset was augmented
preserving details. The denoising process follows a stochas- with random cropping, horizontal flipping, and other data
tic differential equation (SDE) approach, which calculates augmentation techniques to improve model generalization.
the score function by predicting the noise residual and iter- The backbone network was selected from either
atively removes noise. Through training and validation on NAFNet, with the feature channel width set to 64. They ex-
the competition dataset, their method ultimately achieved a perimented with different channel sizes and determined that
test performance of PSNR 27.07 and SSIM 0.79. 64 channels provided a good balance between performance
and computational efficiency.
4.19.2. Method details They employed theLion optimizer with β1 = 0.95 and
General method description Their proposed denoising β2 = 0.98 to ensure faster convergence and better stability
method is based on a diffusion model framework, where the during training. The learning rate was initialized at 2 ×
network is designed to estimate the noise gradient (score 10−4 and was reduced by half after every 200k iterations
function) at different time steps t to guide the reverse dif- using a CosineAnnealingLR scheduler to achieve smoother
fusion process. The core architecture consists of a neural convergence.
backbone, which can be either NAFNet, selected based on The loss function was a Matching Loss designed to min-
a trade-off between computational efficiency and denoising imize the distance between the predicted and true noise
quality. residuals. This function integrated L1 and L2 components,
NAFNet features a lightweight structure optimized for weighted dynamically based on the noise variance at differ-
high-speed image restoration, incorporating a self-gated ac- ent time steps to stabilize the training across different diffu-
tivation mechanism (SimpleGate), simplified channel atten- sion levels.
tion (SCA), and depth-wise convolutions, making it highly They applied mixed precision training with automatic
efficient. UNet, on the other hand, is a widely adopted gradient scaling to accelerate training while reducing mem-
architecture for image denoising, leveraging an encoder- ory usage. The model was trained for a total of 800k itera-
Figure 17. Diffusion model for image denoising from Team Whitehairbin.

tions, and each batch contained 16 cropped patches of size formation and semantic features of the image. The decoder
128 × 128. Training was conducted using a single NVIDIA performs upsampling, restoring the feature maps to the orig-
RTX 4090 GPU, and the entire process took approximately inal image size and progressively recovering the detailed in-
36 hours to complete. formation of the image. This architecture enables U-Net to
To ensure robust noise modeling, a cosine noise sched- achieve rich global semantic information while accurately
ule was adopted, which progressively adjusted the noise restoring image details when processing high-definition im-
level throughout the training process, allowing the model to ages, thereby realizing high-precision segmentation.
better capture high-frequency details during the denoising The U-Net architecture is characterized by its symmet-
phase. ric encoder-decoder structure with skip connections. In
Testing description During the training phase, they val- the encoder (or contracting path), the network progres-
idated the model using the official validation dataset pro- sively downsamples the input image through multiple con-
vided by the NTIRE 2025 competition. The validation set volutional layers interspersed with max-pooling operations.
included images with Gaussian noise of varying intensities, This process allows the model to extract hierarchical fea-
and the model was assessed based on both PSNR and SSIM tures at various scales, capturing both the global context and
metrics. semantic information of the image.
Upon completing 800k iterations, the model achieved a In the decoder (or expansive path), the network employs
peak PSNR of 26.83 dB and an SSIM of 0.79 on the val- transposed convolutions (or upsampling layers) to gradually
idation dataset, indicating effective noise suppression and upscale the feature maps back to the original image resolu-
structure preservation. tion. During this process, the decoder receives additional
After training was completed, the model was rigorously information from the encoder via skip connections, which
tested using the official test set to verify its effectiveness concatenate corresponding feature maps from the encoder
in real-world scenarios. They conducted multiple test runs to those in the decoder. This mechanism helps in refining
with different noise levels to ensure model robustness across the output by incorporating fine-grained details and spatial
various conditions. The test results confirmed that the information, which are crucial for accurate image restora-
model performed consistently well in Gaussian noise re- tion or segmentation.
moval, maintaining high PSNR and SSIM values across di- This design ensures that U-Net can effectively handle
verse image types. high-resolution images by leveraging both the broad con-
To further evaluate the performance, they applied both textual understanding gained from the encoder and the de-
SDE-based and ODE-based sampling methods during infer- tailed spatial information preserved through the skip con-
ence. ODE sampling provided a faster and more determin- nections. Consequently, this dual capability of capturing
istic denoising process, while SDE sampling yielded more global semantics and local details makes U-Net particularly
diverse results. The final submitted model leveraged ODE powerful for tasks that require precise image segmentation.
sampling to achieve a balance between quality and infer- The uniqueness of U-Net lies in its skip connections. These
ence speed. skip connections directly transfer feature maps of the same
scale from the encoder to the corresponding layers in the
4.20. mygo decoder. This mechanism allows the decoder to utilize low-
U-Net adopts a typical encoder-decoder structure. The en- level feature information extracted by the encoder, aiding in
coder is responsible for downsampling the input image, ex- the better recovery of image details. When processing high-
tracting features at different scales to capture the global in- definition images, these low-level features contain abundant
edge, texture, and other detail information, which is crucial
for accurate image segmentation.
Compared to Fully Convolutional Networks (FCNs), U-
Net stands out because of its use of skip connections. FCN
is also a commonly used model for image segmentation,
but lacks the skip connections found in U-Net, resulting in
poorer performance in recovering detailed image informa-
tion. When processing high-definition images, FCNs can
produce blurry segmentation results with unclear edges. In
contrast, U-Net can better preserve the details of the image
through its skip connections, thereby improving the accu-
racy of segmentation.
Our model resizes all images to 512*512 for training,
which facilitates the rapid extraction of image features and
effectively reduces the usage of video memory (VRAM).
Next, they feed the images into the network model and
compute the loss of the output images. In particular, their
loss function incorporates both MSE (mean squared error)
and SSIM (structured similarity index measure), allowing
the model to focus on pixel-level accuracy during training
while also emphasizing the structural features of the images.
This dual approach improves the overall performance of the
model. They use the Adam optimizer for training, which
dynamically adjusts the learning rate during the training
process based on the first and second moments of the gra-
dients. This allows it to automatically select the appropri-
ate step sizes for each parameter, leading to more efficient
convergence compared to fixed learning rate methods. Ad-
ditionally, Adam helps reduce the overall memory footprint
by maintaining only a few extra parameters per weight, con-
tributing to its efficiency in practical applications. In par-
ticular, they employ an early stopping mechanism to avoid
redundant computations.
It is worth mentioning that they have implemented an
early stopping mechanism. This approach helps prevent
overfitting by halting the training process when the per-
formance on a validation set stops improving, thus avoid-
ing unnecessary computations and saving computational re-
sources. Early stopping monitors a chosen metric (such as
validation loss) and stops training when no improvement is
observed over a predefined number of epochs, effectively
reducing the risk of overfitting and ensuring efficient use of Figure 18. Unet model architecture from Team mygo.
computational resources.

A. Teams and affiliations


Acknowledgments
NTIRE 2025 team
This work was partially supported by the Humboldt Foun-
dation, the Ministry of Education and Science of Bul- Title: NTIRE 2025 Image Denoising Challenge
garia (support for INSAIT, part of the Bulgarian National Members:
Roadmap for Research Infrastructure). We thank the Lei Sun1 ([email protected]),
NTIRE 2025 sponsors: ByteDance, Meituan, Kuaishou, Hang Guo2 ([email protected]),
and University of Wurzburg (Computer Vision Lab). Bin Ren1,3,4 ([email protected]),
Luc Van Gool1 ([email protected]), 1
Xiaomi Inc.
Radu Timofte5 ([email protected]) 2
Beijing University of Posts and Telecommunications
Yawei Li6 ([email protected]),
Affiliations:
1
INSAIT,Sofia University,”St.Kliment Ohridski”, Bulgaria Pixel Purifiers
2
Tsinghua University, China Title: Denoiser using Restormer and Hard Dataset Mining
3
University of Pisa, Italy Members:
4
University of Trento, Italy Deepak Kumar Tyagi1 ([email protected]),
5
University of Würzburg, Germany Aman Kukretti1 , Gajender Sharma1 , Sriharsha
6
ETH Zürich, Switzerland Koundinya , Asim Manna1
1

Affiliations:
1
Samsung R&D Institute India - Bangalore (SRI-B)
Samsung MX (Mobile eXperience) Business &
Samsung R&D Institute China - Beijing (SRC-B)
Title: Dynamic detail-enhanced image denoising frame- Alwaysu
work Title: Bias-Tuning Enables Efficient Image Denoising
Members: Members:
Xiangyu Kong1 ([email protected]), Hyunhee Jun Cheng 1 ([email protected]), Shan Tan 1
Park2 , Xiaoxuan Yu1 , Suejin Han2 , Hakjae Jeon2 , Jia Li1 , Affiliations:
Hyung-Ju Chun2 1
Huazhong University of Science and Technology
Affiliations:
1
Samsung R&D Institute China - Beijing (SRC-B) Tcler Denosing
2
Department of Camera Innovation Group, Samsung Title: Tcler Denoising
Electronics Members:
Jun Liu1,2 ([email protected]), Jiangwei Hao1,2 , Jianping
Luo1,2 , Jie Lu1,2
SNUCV
Affiliations:
Title: Deep ensemble for Image denoising 1
TCL Corporate Research
Members: 2
TCL Science Park International E City - West Zone,
Donghun Ryou1 ([email protected]), Inju Ha1 , Bohyung Building D4
Han1
Affiliations:
1
Seoul National University cipher vision
Title: Pureformer: Transformer-Based Image Denoising
Members:
BuptMM
Satya Narayan Tazi1 ([email protected]), Arnim
Title: DDU——Image Denoising Unit using transformer Gautam1 , Aditi Pawar1 , Aishwarya Joshi2 , Akshay
and morphology method Dudhane3 , Praful Hambadre4 , Sachin Chaudhary5 , Santosh
Members: Kumar Vipparthi5 , Subrahmanyam Murala6 ,
Jingyu Ma1 ([email protected]), Zhijuan Huang2 , Affiliations:
Huiyuan Fu1 , Hongyuan Yu2 , Boqi Zhang1 , Jiawei Shi1 , 1
Government Engineering College Ajmer
Heng Zhang2 , Huadong Ma1 2
Mohamed bin Zayed University of Artificial Intelligence
Affiliations: gence, Abu Dhabi
1 3
Beijing University of Posts and Telecommunications University of Petroleum and Energy Studies, Dehradun
2 4
Xiaomi Inc., China Indian Institute of Technology, Mandi
5
Indian Institute of Technology, Ropar
6
Trinity College Dublin, Ireland
HMiDenoise
Title: Hybrid Denosing Method Based on HAT
Members:
Sky-D
Zhijuan Huang1 (huang [email protected]), Jingyu Ma2 , Title: A Two-Stage Denoising Framework with General-
Hongyuan Yu1 , Heng Zhang1 , Huiyuan Fu2 , Huadong Ma2 ized Denoising Score Matching Pretraining and Supervised
Affiliations: Fine-tuning
Members: Wadii Boulila1
Jiachen Tu1 ([email protected])
Affiliations: Affiliations:
1 1
University of Illinois Urbana-Champaign Robotics and Internet-of-Things Laboratory, Prince
Sultan University, Riyadh, Saudi Arabia

KLETech-CEVI
Aurora
Title: HNNFormer: Hierarchical Noise-Deinterlace
Title: GAN + NAFNet: A Powerful Combination for
Transformer for Image Denoising
High-Quality Image Denoising
Members:
Members:
Nikhil Akalwadi1,3 ([email protected]), Vi-
JanSeny ([email protected]), Pei Zhou
jayalaxmi Ashok Aralikatti1,3 , Dheeraj Damodar Hegde2,3 ,
G Gyaneshwar Rao2,3 , Jatin Kalal2,3 , Chaitra Desai1,3 ,
Ramesh Ashok Tabib2,3 , Uma Mudenagudi2,3 mpu ai
Affiliations: Title: Enhanced Blind Image Restoration with Channel
1
School of Computer Science and Engineering, KLE Attention Transformers and Multi-Scale Attention Prompt
Technological University Learning
2
School of Electronics and Communication Engineering, Members:
KLE Technological University Jianhua Hu1 ([email protected]), K. L. Eddie Law1
3
Center of Excellence in Visual Intelligence (CEVI), KLE Affiliations:
Technological University 1
Macao Polytechnic University

xd denoise OptDenoiser
Title: SCUNet for image denoising Title: Towards two-stage OptDenoiser framework for
Members: image denoising.
Zhenyuan Lin1 ([email protected]), Yubo Members:
Dong1 , Weikun Li2 , Anqi Li1 , Ang Gao1 Jaeho Lee 1 ([email protected] ), M.J. Aashik Rasool1 ,
Affiliations: Abdur Rehman1 , SMA Sharif1 , Seongwan Kim1
1 Affiliations:
Xidian University
2 1
Guilin University Of Electronic Technology Opt-AI Inc, Marcus Building, Magok, Seoul, South Korea

JNU620 AKDT
Title: High-resolution Image Denoising via Adaptive
Title: Image Denoising using NAFNet and RCAN
Kernel Dilation Transformer
Members:
Members:
Weijun Yuan1 ([email protected]), Zhan Li1 ,
Alexandru Brateanu1 ([email protected]
Ruting Deng1 , Yihang Chen1 , Yifan Deng1 , Zhanglu
chester.ac.uk ), Raul Balmez1 , Ciprian Orhei2 , Cosmin
Chen1 , Boyang Yao1 , Shuling Zheng2 , Feng Zhang1 ,
Ancuti2
Zhiheng Fu1
Affiliations:
Affiliations: 1
1 University of Manchester - Manchester, United Kingdom
Jinan University 2
2 Polytechnica University Timisoara - Timisoara, Romania
Guangdong University of Foreign Studies

X-L
PSU team
Title: MixEnsemble
Title: OptimalDiff: High-Fidelity Image Enhancement Members:
Using Schrödinger Bridge Diffusion and Multi-Scale Zeyu Xiao1 ([email protected]), Zhuoyuan Li2
Adversarial Refinement Affiliations:
1
National University of Singapore
2
Members: University of Science and Technology of China
Anas M. Ali1 ([email protected]), Bilel Benjdira1 ,
Whitehairbin ceedings of the IEEE/CVF international conference on com-
puter vision, pages 12504–12513, 2023. 19
Title: Diffusion-based Denoising Model
[10] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun.
Simple baselines for image restoration. In European confer-
Members: ence on computer vision, pages 17–33. Springer, 2022. 3,
Ziqi Wang1 ([email protected]), Yanyan Wei1 , Fei 14
Wang1 , Kun Li1 , Shengeng Tang1 , Yunkai Zhang1 [11] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao,
and Chao Dong. Activating more pixels in image super-
Affiliations: resolution transformer. In Proceedings of the IEEE/CVF con-
1
Hefei University of Technology, China ference on computer vision and pattern recognition, pages
22367–22377, 2023. 5
mygo [12] Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun,
Title: High-resolution Image Denoising via Unet neural Zongwei Wu, Radu Timofte, Yulun Zhang, et al. NTIRE
network 2025 challenge on image super-resolution (×4): Methods
Members: and results. In Proceedings of the IEEE/CVF Conference
Weirun Zhou1 ([email protected]), Haoxuan Lu2 on Computer Vision and Pattern Recognition (CVPR) Work-
shops, 2025. 2
[13] Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun,
Affiliations:
1 Zongwei Wu, Radu Timofte, Yulun Zhang, et al. NTIRE
Xidian University
2 2025 challenge on real-world face restoration: Methods and
China University of Mining and Technology results. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Work-
References shops, 2025. 2
[1] Kodak dataset. http://r0k.us/graphics/kodak/. [14] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Lu.
19 Revisiting global statistics aggregation for improving image
[2] Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- restoration. arXiv preprint arXiv:2112.04491, 2(4):5, 2021.
lenge on single image super-resolution: Dataset and study. 14
In Proceedings of the IEEE Conference on Computer Vision [15] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Lu.
and Pattern Recognition Workshops, pages 126–135, 2017. Improving image restoration by revisiting global information
2, 5, 8, 11, 14, 18 aggregation. In European Conference on Computer Vision,
[3] Yuval Becker, Raz Z Nossek, and Tomer Peleg. Make the pages 53–71. Springer, 2022. 14
most out of your net: Alternating between canonical and [16] Marcos Conde, Radu Timofte, et al. NTIRE 2025 challenge
hard datasets for improved image demosaicing. CoRR, 2023. on raw image restoration and super-resolution. In Proceed-
6 ings of the IEEE/CVF Conference on Computer Vision and
[4] Alexandru Brateanu and Raul Balmez. Kolmogorov-arnold Pattern Recognition (CVPR) Workshops, 2025. 2
networks in transformer attention for low-light image en- [17] Marcos Conde, Radu Timofte, et al. Raw image reconstruc-
hancement. In 2024 International Symposium on Electronics tion from RGB on smartphones. NTIRE 2025 challenge re-
and Telecommunications (ISETC), pages 1–4. IEEE, 2024. port. In Proceedings of the IEEE/CVF Conference on Com-
20 puter Vision and Pattern Recognition (CVPR) Workshops,
[5] Alexandru Brateanu, Raul Balmez, Adrian Avram, and 2025. 2
Ciprian Orhei. Akdt: Adaptive kernel dilation transformer [18] Egor Ershov, Sergey Korchagin, Alexei Khalin, Artyom Pan-
for effective image denoising. Proceedings Copyright, 418: shin, Arseniy Terekhin, Ekaterina Zaychenkova, Georgiy
425. 19 Lobarev, Vsevolod Plokhotnyuk, Denis Abramov, Elisey
[6] Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Cosmin Zhdanov, Sofia Dorogova, Yasin Mamedov, Nikola Banic,
Ancuti, and Codruta Ancuti. Enhancing low-light images Georgii Perevozchikov, Radu Timofte, et al. NTIRE 2025
with kolmogorov–arnold networks in transformer attention. challenge on night photography rendering. In Proceedings
Sensors, 25(2):327, 2025. 20 of the IEEE/CVF Conference on Computer Vision and Pat-
[7] Matthew Brown and David G Lowe. Automatic panoramic tern Recognition (CVPR) Workshops, 2025. 2
image stitching using invariant features. International jour- [19] Yuqian Fu, Xingyu Qiu, Bin Ren Yanwei Fu, Radu Timofte,
nal of computer vision, 74:59–73, 2007. 7 Nicu Sebe, Ming-Hsuan Yang, Luc Van Gool, et al. NTIRE
[8] Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: 2025 challenge on cross-domain few-shot object detection:
Reduce memory, not parameters for efficient on-device Methods and results. In Proceedings of the IEEE/CVF Con-
learning. Advances in Neural Information Processing Sys- ference on Computer Vision and Pattern Recognition (CVPR)
tems, 33:11285–11297, 2020. 7 Workshops, 2025. 2
[9] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Tim- [20] Shuhang Gu and Radu Timofte. A brief review of image
ofte, and Yulun Zhang. Retinexformer: One-stage retinex- denoising algorithms and beyond. Inpainting and Denoising
based transformer for low-light image enhancement. In Pro- Challenges, pages 1–21, 2019. 1
[21] Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, [31] Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu,
Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman-
state space restoration. arXiv preprint arXiv:2411.15269, dolx, et al. Lsdir: A large scale dataset for image restoration.
2024. 4, 8 In Proceedings of the IEEE/CVF Conference on Computer
[22] Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Vision and Pattern Recognition Workshops, 2023. 2, 5, 8,
Chunle Guo, Chongyi Li, Radu Timofte, et al. NTIRE 2025 11, 14
challenge on text to image generation model quality assess- [32] Yawei Li, Yulun Zhang, Radu Timofte, Luc Van Gool, Zhi-
ment. In Proceedings of the IEEE/CVF Conference on Com- jun Tu, Kunpeng Du, Hailing Wang, Hanting Chen, Wei Li,
puter Vision and Pattern Recognition (CVPR) Workshops, Xiaofei Wang, et al. Ntire 2023 challenge on image denois-
2025. 2 ing: Methods and results. In Proceedings of the IEEE/CVF
[23] Varun Jain, Zongwei Wu, Quan Zou, Louis Florentin, Hen- Conference on Computer Vision and Pattern Recognition,
rik Turbell, Sandeep Siddhartha, Radu Timofte, et al. NTIRE pages 1905–1921, 2023. 3
2025 challenge on video quality enhancement for video con- [33] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
ferencing: Datasets, methods and results. In Proceedings of Van Gool, and Radu Timofte. Swinir: Image restoration us-
the IEEE/CVF Conference on Computer Vision and Pattern ing swin transformer. In Proceedings of the IEEE/CVF inter-
Recognition (CVPR) Workshops, 2025. 2 national conference on computer vision, pages 1833–1844,
[24] Amogh Joshi, Nikhil Akalwadi, Chinmayee Mandi, Chaitra 2021. 20
Desai, Ramesh Ashok Tabib, Ujwala Patil, and Uma Mude- [34] Jie Liang, Radu Timofte, Qiaosi Yi, Zhengqiang Zhang,
nagudi. Hnn: Hierarchical noise-deinterlace net towards im- Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong
age denoising. In Proceedings of the IEEE/CVF Conference Zhang, Hui Zeng, Lei Zhang, et al. NTIRE 2025 the 2nd
on Computer Vision and Pattern Recognition, pages 3007– restore any image model (RAIM) in the wild challenge. In
3016, 2024. 11 Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR) Workshops, 2025. 2
[25] Cansu Korkmaz and A Murat Tekalp. Training transformer
[35] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
models by wavelet losses improves quantitative and visual
Kyoung Mu Lee. Enhanced deep residual networks for single
performance in single image super-resolution. In Proceed-
image super-resolution. In Proceedings of the IEEE confer-
ings of the IEEE/CVF Conference on Computer Vision and
ence on computer vision and pattern recognition workshops,
Pattern Recognition, pages 6661–6670, 2024. 3, 4
pages 136–144, 2017. 7, 11
[26] Edwin H Land and John J McCann. Lightness and retinex
[36] Jingbo Lin, Zhilu Zhang, Yuxiang Wei, Dongwei Ren, Dong-
theory. Journal of the Optical society of America, 61(1):1–
sheng Jiang, Qi Tian, and Wangmeng Zuo. Improving image
11, 1971. 19
restoration through removing degradations in textual repre-
[27] Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, sentations. In Proceedings of the IEEE/CVF Conference
Youngjo Kim, Hyungju Chun, Xin Jin, Chongyi Li, Chun-Le on Computer Vision and Pattern Recognition, pages 2866–
Guo, Radu Timofte, et al. NTIRE 2025 challenge on efficient 2878, 2024. 5
burst hdr and restoration: Datasets, methods, and results. In [37] Xiaohong Liu, Xiongkuo Min, Qiang Hu, Xiaoyun Zhang,
Proceedings of the IEEE/CVF Conference on Computer Vi- Jie Guo, et al. NTIRE 2025 XGC quality assessment chal-
sion and Pattern Recognition (CVPR) Workshops, 2025. 2 lenge: Methods and results. In Proceedings of the IEEE/CVF
[28] Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Conference on Computer Vision and Pattern Recognition
Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby (CVPR) Workshops, 2025. 2
Tan, Radu Timofte, et al. NTIRE 2025 challenge on day and [38] Xiaoning Liu, Zongwei Wu, Florin-Alexandru Vasluianu,
night raindrop removal for dual-focused images: Methods Hailong Yan, Bin Ren, Yulun Zhang, Shuhang Gu, Le
and results. In Proceedings of the IEEE/CVF Conference Zhang, Ce Zhu, Radu Timofte, et al. NTIRE 2025 challenge
on Computer Vision and Pattern Recognition (CVPR) Work- on low light image enhancement: Methods and results. In
shops, 2025. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
[29] Xin Li, Xijun Wang, Bingchen Li, Kun Yuan, Yizhen Shao, sion and Pattern Recognition (CVPR) Workshops, 2025. 2
Suhang Yao, Ming Sun, Chao Zhou, Radu Timofte, and [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Zhibo Chen. NTIRE 2025 challenge on short-form ugc video regularization. arXiv preprint arXiv:1711.05101, 2017. 5
quality assessment and enhancement: Kwaisr dataset and [40] Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund,
study. In Proceedings of the IEEE/CVF Conference on Com- and Thomas B Schön. Refusion: Enabling large-size realis-
puter Vision and Pattern Recognition (CVPR) Workshops, tic image restoration with latent-space diffusion models. In
2025. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
[30] Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen sion and Pattern Recognition Workshops, pages 1680–1691,
Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang 2023. 21
Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, [41] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database
et al. NTIRE 2025 challenge on short-form ugc video quality of human segmented natural images and its application to
assessment and enhancement: Methods and results. In Pro- evaluating segmentation algorithms and measuring ecologi-
ceedings of the IEEE/CVF Conference on Computer Vision cal statistics. In IEEE International Conference on Computer
and Pattern Recognition (CVPR) Workshops, 2025. 2 Vision (ICCV), pages 416–423, 2001. 19
[42] Vaishnav Potlapalli, Syed Waqas Zamir, Salman H Khan, the IEEE/CVF Conference on Computer Vision and Pattern
and Fahad Shahbaz Khan. Promptir: Prompting for all-in- Recognition (CVPR) Workshops, 2025. 2
one image restoration. Advances in Neural Information Pro- [55] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.
cessing Systems, 36:71275–71293, 2023. 8 Real-esrgan: Training real-world blind super-resolution with
[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya pure synthetic data. In Proceedings of the IEEE/CVF inter-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, national conference on computer vision, pages 1905–1914,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 2021. 8
transferable visual models from natural language supervi- [56] Yingqian Wang, Zhengyu Liang, Fengyuan Zhang, Lvli
sion. In International conference on machine learning, pages Tian, Longguang Wang, Juncheng Li, Jungang Yang, Radu
8748–8763. PmLR, 2021. 4 Timofte, Yulan Guo, et al. NTIRE 2025 challenge on light
[44] Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timo- field image super-resolution: Methods and results. In Pro-
fte, Yawei Li, et al. The tenth NTIRE 2025 efficient super- ceedings of the IEEE/CVF Conference on Computer Vision
resolution challenge report. In Proceedings of the IEEE/CVF and Pattern Recognition (CVPR) Workshops, 2025. 2
Conference on Computer Vision and Pattern Recognition [57] Kangning Yang, Jie Cai, Ling Ouyang, Florin-Alexandru
(CVPR) Workshops, 2025. 2 Vasluianu, Radu Timofte, Jiaming Ding, Huiming Sun, Lan
[45] Nickolay Safonov, Alexey Bryntsev, Andrey Moskalenko, Fu, Jinlong Li, Chiu Man Ho, Zibo Meng, et al. NTIRE
Dmitry Kulikov, Dmitriy Vatolin, Radu Timofte, et al. 2025 challenge on single image reflection removal in the
NTIRE 2025 challenge on UGC video enhancement: Meth- wild: Datasets, methods and results. In Proceedings of
ods and results. In Proceedings of the IEEE/CVF Conference the IEEE/CVF Conference on Computer Vision and Pattern
on Computer Vision and Pattern Recognition (CVPR) Work- Recognition (CVPR) Workshops, 2025. 2
shops, 2025. 2 [58] Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu
[46] SMA Sharif, Abdur Rehman, Zain Ul Abidin, Rizwan Ali Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Ste-
Naqvi, Fayaz Ali Dharejo, and Radu Timofte. Illuminating fano Mattoccia, et al. NTIRE 2025 challenge on hr depth
darkness: Enhancing real-world low-light scenes with smart- from images of specular and transparent surfaces. In Pro-
phone images. arXiv preprint arXiv:2503.06898, 2025. 19 ceedings of the IEEE/CVF Conference on Computer Vision
[47] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. Live image qual- and Pattern Recognition (CVPR) Workshops, 2025. 2
ity assessment database release 2. http://live.ece. [59] Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu-
utexas.edu/research/quality/, 2006. 19 nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang.
[48] Lei Sun, Andrea Alfarano, Peiqi Duan, Shaolin Su, Kaiwei Restormer: Efficient transformer for high-resolution image
Wang, Boxin Shi, Radu Timofte, Danda Pani Paudel, Luc restoration. In Proceedings of the IEEE/CVF conference on
Van Gool, et al. NTIRE 2025 challenge on event-based computer vision and pattern recognition, pages 5728–5739,
image deblurring: Methods and results. In Proceedings of 2022. 3, 4, 5, 6, 7, 8, 10, 20
the IEEE/CVF Conference on Computer Vision and Pattern [60] Jiale Zhang, Yulun Zhang, Jinjin Gu, Jiahua Dong,
Recognition (CVPR) Workshops, 2025. 2 Linghe Kong, and Xiaokang Yang. Xformer: Hybrid x-
[49] Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timo- shaped transformer for image denoising. arXiv preprint
fte, Yawei Li, et al. The tenth ntire 2025 image denoising arXiv:2303.06440, 2023. 4, 12, 20
challenge report. In Proceedings of the IEEE/CVF Confer- [61] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and
ence on Computer Vision and Pattern Recognition (CVPR) Lei Zhang. Beyond a gaussian denoiser: Residual learning of
Workshops, 2025. 2 deep cnn for image denoising. IEEE transactions on image
[50] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven processing, 26(7):3142–3155, 2017. 1
ways to improve example-based single image super resolu-
[62] Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yu-
tion. In Proceedings of the IEEE conference on computer
lun Zhang, Hao Tang, Deng-Ping Fan, Radu Timofte, and
vision and pattern recognition, pages 1865–1873, 2016. 8
Luc Van Gool. Practical blind image denoising via swin-
[51] Jiachen Tu, Yaokun Shi, and Fan Lam. Score-based self-
conv-unet and data synthesis. Machine Intelligence Re-
supervised MRI denoising. In The Thirteenth International
search, 20(6):822–836, 2023. 8, 12
Conference on Learning Representations, 2025. 9, 10
[63] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
[52] Stefan Van der Walt, Johannes L Schönberger, Juan Nunez-
Zhong, and Yun Fu. Image super-resolution using very
Iglesias, François Boulogne, Joshua D Warner, Neil Yager,
deep residual channel attention networks. In Proceedings of
Emmanuelle Gouillart, and Tony Yu. scikit-image: image
the European conference on computer vision (ECCV), pages
processing in python. PeerJ, 2:e453, 2014. 11
286–301, 2018. 14
[53] Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou,
Cailian Chen, Zongwei Wu, Radu Timofte, et al. NTIRE
2025 image shadow removal challenge report. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, 2025. 2
[54] Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou,
Zongwei Wu, Radu Timofte, et al. NTIRE 2025 ambi-
ent lighting normalization challenge. In Proceedings of

You might also like