NTIRE 2025 Challenge On Image Denoising Methods and Results
NTIRE 2025 Challenge On Image Denoising Methods and Results
Lei Sun∗ Hang Guo∗ Bin Ren∗ Luc Van Gool∗ Radu Timofte∗ Yawei Li∗
Xiangyu Kong Hyunhee Park Xiaoxuan Yu Suejin Han Hakjae Jeon Jia Li
Hyung-Ju Chun Donghun Ryou Inju Ha Bohyung Han Jingyu Ma
Zhijuan Huang Huiyuan Fu Hongyuan Yu Boqi Zhang Jiawei Shi Heng Zhang
Huadong Ma Deepak Kumar Tyagi Aman Kukretti Gajender Sharma
Sriharsha Koundinya Asim Manna Jun Cheng Shan Tan Jun Liu Jiangwei Hao
Jianping Luo Jie Lu Satya Narayan Tazi Arnim Gautam Aditi Pawar
Aishwarya Joshi Akshay Dudhane Praful Hambadre Sachin Chaudhary
Santosh Kumar Vipparthi Subrahmanyam Murala Jiachen Tu Nikhil Akalwadi
Vijayalaxmi Ashok Aralikatti Dheeraj Damodar Hegde G Gyaneshwar Rao Jatin Kalal
Chaitra Desai Ramesh Ashok Tabib Uma Mudenagudi Zhenyuan Lin Yubo Dong
Weikun Li Anqi Li Ang Gao Weijun Yuan Zhan Li Ruting Deng
Yihang Chen Yifan Deng Zhanglu Chen Boyang Yao Shuling Zheng
Feng Zhang Zhiheng Fu Anas M. Ali Bilel Benjdira Wadii Boulila JanSeny
Pei Zhou Jianhua Hu K. L. Eddie Law Jaeho Lee M. J. Aashik Rasool
Abdur Rehman SMA Sharif Seongwan Kim Alexandru Brateanu Raul Balmez
Ciprian Orhei Cosmin Ancuti Zeyu Xiao Zhuoyuan Li Ziqi Wang Yanyan Wei
Fei Wang Kun Li Shengeng Tang Yunkai Zhang Weirun Zhou Haoxuan Lu
Abstract 1. Introduction
Image denoising is a fundamental problem in low-level vi-
This paper presents an overview of the NTIRE 2025 Im- sion, where the objective is to reconstruct a noise-free im-
age Denoising Challenge (σ = 50), highlighting the pro- age from its degraded counterpart. During image acquisi-
posed methodologies and corresponding results. The pri- tion and processing, various types of noise can be intro-
mary objective is to develop a network architecture capable duced, such as Gaussian noise, Poisson noise, and compres-
of achieving high-quality denoising performance, quantita- sion artifacts from formats like JPEG. The presence of these
tively evaluated using PSNR, without constraints on compu- noise sources makes denoising a particularly challenging
tational complexity or model size. The task assumes inde- task. Given the importance of image denoising in applica-
pendent additive white Gaussian noise (AWGN) with a fixed tions such as computational photography, medical imaging,
noise level of 50. A total of 290 participants registered for and remote sensing, continuous research efforts are neces-
the challenge, with 20 teams successfully submitting valid sary to develop more efficient and generalizable denoising
results, providing insights into the current state-of-the-art solutions [20, 61].
in image denoising.
To further advance research in this area, this challenge
aims to promote the development of denoising methods. A
widely used benchmark for fair performance evaluation is
the additive white Gaussian noise (AWGN) model, which
∗ L. Sun ([email protected], INSAIT, Sofia University “St. Kliment
serves as the standard setting in this competition.
Ohridski”), H. Guo, B. Ren ([email protected], University of Pisa & Uni-
versity of Trento, Italy), L. Van Gool, R. Timofte, and Y. Li were the chal- As part of the New Trends in Image Restoration and En-
lenge organizers, while the other authors participated in the challenge. hancement (NTIRE) 2025 workshop, we organized the Im-
Appendix A contains the authors’ teams and affiliations.
NTIRE 2025 webpage: https://cvlai.net/ntire/2025/.
age Denoising Challenge. The objective is to restore clean
Code: https://github.com/AHupuJR/NTIRE2025_Dn50_ images from inputs corrupted by AWGN with a noise level
challenge. of σ = 50. This competition seeks to foster innovative
solutions, establish performance benchmarks, and explore set and an additional 100 images from the LSDIR test set.
emerging trends in the design of image denoising networks, To ensure a fair assessment, the ground-truth noise-free im-
we hope the methods in this challenge will shed light on ages for the test phase remained hidden from participants
image denoising. throughout the challenge.
This challenge is one of the NTIRE 2025 Work-
shop associated challenges on: ambient lighting normaliza- 2.2. Tracks and Competition
tion [54], reflection removal in the wild [57], shadow re-
moval [53], event-based image deblurring [48], image de- The goal is to develop a network architecture that can gener-
noising [49], XGC quality assessment [37], UGC video en- ate high-quality denoising results, with performance evalu-
hancement [45], night photography rendering [18], image ated based on the peak signal-to-noise ratio (PSNR) metric.
super-resolution (x4) [12], real-world face restoration [13],
efficient super-resolution [44], HR depth estimation [58], Challenge phases (1) Development and validation phase:
efficient burst HDR and restoration [27], cross-domain few- Participants were provided with 800 clean training images
shot object detection [19], short-form UGC video quality and 100 clean/noisy image pairs from the DIV2K dataset,
assessment and enhancement [29, 30], text to image gen- along with an additional 84,991 clean images from the LS-
eration model quality assessment [22], day and night rain- DIR dataset. During the training process, noisy images
drop removal for dual-focused images [28], video quality were generated by adding Gaussian noise with a noise level
assessment for video conferencing [23], low light image of σ = 50. Participants had the opportunity to upload their
enhancement [38], light field super-resolution [56], restore denoising results to the CodaLab evaluation server, where
any image model (RAIM) in the wild [34], raw restoration the PSNR of the denoised images was computed, offering
and super-resolution [16] and raw reconstruction from RGB immediate feedback on their model’s performance. (2) Test-
on smartphones [17]. ing phase: In the final test phase, participants were given
access to 100 noisy test images from the DIV2K dataset
2. NTIRE 2025 Image Denoising Challenge and 100 noisy test images from the LSDIR dataset, while
the corresponding clean ground-truth images remained con-
The objectives of this challenge are threefold: (1) to stim- cealed. Participants were required to submit their denoised
ulate advancements in image denoising research, (2) to en- images to the CodaLab evaluation server and send their
able a fair and comprehensive comparison of different de- code and factsheet to the organizers. The organizers then
noising techniques, and (3) to create a collaborative envi- verified the submitted code and ran it to compute the final
ronment where academic and industry professionals can ex- results, which were shared with participants at the conclu-
change ideas and explore potential partnerships. sion of the challenge.
In the following sections, we provide a detailed overview
of the challenge, including its dataset, evaluation criteria,
Evaluation protocol The primary objective of this chal-
challenge results, and the methodologies employed by par-
lenge is to promote the development of accurate image de-
ticipating teams. By establishing a standardized bench-
noising networks. Hence, PSNR and SSIM metrics are used
mark, this challenge aims to push the boundaries of current
for quantitative evaluation, based on the 200 test images.
denoising approaches and foster innovation in the field.
A code example for calculating these metrics can be found
2.1. Dataset at https://github.com/AHupuJR/NTIRE2025_
Dn50_challenge. Additionally, the code for the sub-
The widely used DIV2K [2] dataset and LSDIR [31] dataset mitted solutions, along with the pre-trained weights, is also
are utilized for the challenge. provided in this repository. Note that computational com-
DIV2K dataset comprises 1,000 diverse RGB images at plexity and model size are not factored into the final ranking
2K resolution, partitioned into 800 images for training, 100 of the participants.
images for validation, and 100 images for testing.
LSDIR dataset consists of 86,991 high-resolution, high-
quality images, with 84,991 images allocated for training, 3. Challenge Results
1,000 images for validation, and 1,000 images for testing.
Participants were provided with training images from Table 1 presents the final rankings and results of the par-
both the DIV2K and LSDIR datasets. During the valida- ticipating teams. Detailed descriptions of each team’s im-
tion phase, the 100 images from the DIV2K validation set plementation are provided in Sec.4, while team member in-
were made accessible to them. In the test phase, evalua- formation can be found in Appendix A. SRC-B secured first
tion was conducted using 100 images from the DIV2K test place in terms of PSNR, achieving a 1.25 dB advantage over
the second-best entry. SNUCV and BuptMM ranked second
https://www.cvlai.net/ntire/2025/ and third, respectively.
Team Rank PSNR (primary) SSIM the dataset for training instead of training on the whole
SRC-B 1 31.20 0.8884 DIV2K and LSDIR dataset.
SNUCV 2 29.95 0.8676 3. The devil is in the details. Wavelet Transform loss [25]
BuptMM 3 29.89 0.8664 is utilized by the winning team, which is proven to help
HMiDenoise 4 29.84 0.8653 the model escape from local optima. Tricks such as a
Pixel Purifiers 5 29.83 0.8652 progressive learning strategy also work well. A higher
Alwaysu 6 29.80 0.8642 percentage of overlapping of the patches during infer-
Tcler Denoising 7 29.78 0.8632 ence also leads to higher PSNR. Ensemble techniques
cipher vision 8 29.64 0.8601 effectively improve the performance.
Sky-D 9 29.61 0.8602 4. New Mamba-based Design. SNUCV, the second-
KLETech-CEVI 10 29.60 0.8602 ranking team, leveraged the MambaIRv2 framework to
xd denoise 11 29.58 0.8597 design a hybrid architecture, combining the efficient se-
JNU620 12 29.55 0.8590 quence modeling capabilities from Mamba with image
PSU team 12 29.55 0.8598 restoration objectives.
Aurora 14 29.51 0.8605 5. Self-ensemble or model ensembling is adopted to im-
mpu ai 15 29.30 0.8499 prove the performance by some of the teams.
OptDenoiser 16 28.95 0.8422
AKDT 17 28.83 0.8374 3.3. Fairness
X-L 18 26.85 0.7836 To uphold the fairness of the image denoising challenge,
Whitehairbin 19 26.83 0.8010 several rules were established, primarily regarding the
mygo 20 24.92 0.6972 datasets used for training. First, participants were allowed
to use additional external datasets, such as Flickr2K, for
Table 1. Results of NTIRE 2025 Image Denoising Challenge.
training. However, training on the DIV2K validation set, in-
PSNR and SSIM scores are measured on the 200 test images from
DIV2K test set and LSDIR test set. Team rankings are based pri-
cluding either high-resolution (HR) or low-resolution (LR)
marily on PSNR. images, was strictly prohibited, as this set was designated
for evaluating the generalization ability of the models. Sim-
ilarly, training with the LR images from the DIV2K test
3.1. Participants set was not permitted. Lastly, employing advanced data
augmentation techniques during training was considered ac-
This year, the challenge attracted 290 registered partici- ceptable and within the scope of fair competition.
pants, with 20 teams successfully submitting valid results.
Compared to the previous challenge [32], the SRC-B team’s 4. Challenge Methods and Teams
solution outperformed the top-ranked method from 2023 by
1.24 dB. Notably, the results achieved by the top six teams 4.1. Samsung MX (Mobile eXperience) Business
this year surpassed those of their counterparts from the pre- & Samsung R&D Institute China - Beijing
vious edition, establishing a new benchmark for image de- (SRC-B)
noising. 4.1.1. Model Framework
3.2. Main Ideas and Architectures The proposed solution is shown in figure 1. In recent years,
the Transformer structure has shown excellent performance
During the challenge, participants implemented a range of in image denoising tasks due to its advantages in capturing
novel techniques to enhance image denoising performance. global context.
Below, we highlight some of the fundamental strategies However, it is found that pure Transformer architec-
adopted by the leading teams. tures are relatively weak in recovering local features and
1. Hybrid architecture performs well. All the mod- details. On the other hand, CNN-based methods excel in
els from the top-3 teams adopted a hybrid architec- detail recovery but struggle to effectively capture global
ture that combines transformer-based and convolutional- context information. Therefore, they designed a network
based network. Both Global features from the trans- that combines the strengths of the transformer network
former and local features from the convolutional net- Restormer [59] and the convolutional network NAFnet [10].
work are useful for image denoising. SNUCV further They first extract global features using the Transformer net-
adopted the model ensemble to push the limit. work and then enhance detail information using the convo-
2. Data is important. This year’s winning team, SRC-B lutional network. The denoising network’s structure follows
adopted a data selection process to mitigate the influence Restormer, while the detail enhancement network draws in-
of data imbalance, and also select high-quality images in spiration from NAFNet. Finally, they dynamically fuse the
conducted a semantic selection based on Clip [43] fea-
tures to ensure that the dataset reflects diverse and repre-
sentative content across various scene categories.
Training. The model training consists of three stages. In
the first stage, they pre-train the entire network using a cus-
tom dataset of 2 million images, with an initial learning rate
of 1e−4 and a training time of approximately 360 hours. In
the second stage, they fine-tune the detail enhancement net-
work module using the DIV2K and LSDIR datasets, with an
initial learning rate of 1e−5 and a training duration of about
240 hours, which enhanced the model’s ability to restore
details. In the third stage, they select 1,000 images from
the custom dataset, 1,000 images from the LSDIR data, and
800 images from DIV2K as the training set. With an initial
learning rate of 1e−6 , they fine-tuned the entire network for
approximately 120 hours.
The model is trained by alternately iterating L1 loss,
L2 loss, and Stationary Wavelet Transform(SWT) loss[25].
They found that adding SWT loss during training helps the
model escape from local optima. They also perform pro-
gressive learning where the network is trained on different
image patch sizes gradually enlarged from 256 to 448 and
768. As the patch size increases, the performance can grad-
ually improve. The model was trained on an A100 80G
GPU.
4.2. SNUCV
Method. As shown in Figure 2, the network architecture
they utilized consists of MambaIRv2 [21], Xformer [60],
Figure 1. Framework of the hybrid network proposed by Team
SRC-B. and Restormer [59]. These networks were first trained on
Gaussian noise with a standard deviation of 50. Subse-
quently, the outputs of these networks are concatenated with
two features from transformer network and convolutional the noisy image, which is then used as input to the ensem-
network through a set of learnable parameters to balance ble model. In addition to the output, the features from the
denoising and detail preservation like in , thereby improv- deepest layers of these networks are also concatenated and
ing the overall performance of image denoising. integrated into the deepest layer features of the ensemble
network. This approach ensures that the feature informa-
4.1.2. Dataset and Training Strategy tion from the previous networks is preserved and effectively
Dataset. Three datasets are used in total: the DIV2K transferred to the ensemble network without loss. The en-
dataset, the LSDIR dataset, and a self-collected custom semble model is designed based on Xformer, accepting an
dataset consisting of 2 million images. The specific ways in input with 12 channels. Its deepest layer is structured to in-
which they utilized these training sets across different train- corporate the concatenated features of the previous models.
ing phases will be detailed in the training details section. These concatenated features are then processed through a
In the final fine-tuning phase, they construct a high quality 1×1 convolution to reduce the channel dimension back to
dataset consist of 1000 images from LSDIR, 1000 images that of the original network, thus alleviating the computa-
from the custom dataset and all 800 images from DIV2K. tional burden. Additionally, while Xformer and Restormer
The data selection process including: reduce the feature size in their deep layer, MambaIRv2 re-
• Image resolution: Keep only images with a resolution tains its original feature size without reduction. To align the
greater than 900x900. sizes for concatenation, the features of MambaIRv2 were
• Image quality: Keep only images that rank in the top downscaled by a factor of 8 before being concatenated.
30% for all three metrics: Laplacian Var, BRISQUE, and Training details. They first train the denoising net-
NIQE. works, and then we incorporate the frozen denoising net-
• Semantic selection: To achieve semantic balance, they works to train the ensemble model. Both the denoising
Figure 2. The overview of the deep ensemble pipeline proposed by Team SNUCV.
models and the ensemble model were trained exclusively stages. In the first stage, they use DIV2K [2] and LS-
using the DIV2K [2] and LSDIR [31] datasets. Training DIR [31] training sets to train Restormer [59] and HAT [11]
was performed using the AdamW [39] optimizer with hy- respectively, and then enhance the ability of Restormer [59]
perparameters \beta _1 = 0.9 and \beta _2 = 0.999, and a learn- through TLC [36] technology during its reasoning stage. In
ing rate of 3 \times 10^{-4} . All models were trained for a to- the second stage, they first use the Canny operator to per-
tal of 300,000 iterations. For denoising models, Restormer form edge detection on the images processed by the two
and Xformer were trained using a progressive training strat- models. They take an OR operation on the two edge images,
egy to enhance robustness and efficiency. Patch sizes were and then XOR the result with the edge of HAT to obtain the
progressively increased as [128, 160, 192, 256, 320, 384], edge difference between the two images. For this part of the
with corresponding batch sizes of [8, 5, 4, 2, 1, 1]. In con- edge difference, they use the result obtained by HAT [11] as
trast, MambaIRv2 was trained with a more constrained the standard for preservation. Finally, they take the average
setup due to GPU memory limitations, utilizing patch sizes of the other pixels of HAT [11] and Restormer [59] to obtain
of [128, 160] and batch sizes of [2, 1]. The ensemble the final result.
model was trained with a progressive patch size schedule They used the DIV2K [2] and LSDIR [31] datasets to
of [160, 192, 256, 320, 384, 448] and corresponding batch train both the Restormer [59] and HAT [11] simultane-
sizes of [8, 5, 4, 2, 1, 1]. The denoising models were trained ously. They employed a progressive training strategy for
using L1 loss, while the ensemble model was trained using the Restormer [59]with a total of 292000 iterations, where
a combination of L1 loss, MSE loss, and high frequency the image block size increased from 128 to 384 with a step
loss. size of 64. They also used progressive training strategy for
Inference details. During the final inference stage to the HAT [11], where the image block size increased from
derive test results, they utilized a self-ensemble technique. 64 to 224. They did not use any other datasets besides the
Furthermore, inference was conducted using a patch-based datasets mentioned above during the process. During the
sliding-window approach. Patch sizes were set at [256, 384, training phase, they spent one day separately training the
512], with corresponding overlap values of [48, 64, 96]. Reformer [59] and HAT [11], they trained two models us-
The resulting outputs were subsequently averaged to opti- ing 8 NVIDIA H100 GPUs. They conducted the inference
mize performance. This self-ensemble approach, while sig- process on the H20 test set, with a memory usage of 15G.
nificantly increasing computational cost, substantially en- The average inference time for a single image from the 200
hances performance. test sets was 4.4 seconds, while the average time for mor-
phological post-processing was within 1 second.
4.3. BuptMM
Description. In recent years, the Transformer architecture
4.4. HMiDenoise
has been widely used in image denoising tasks. In order to The network is inspired by the HAT [11] model architec-
further explore the superiority of the two representative net- ture, and the architecture is optimized for the task specifi-
works, Restormer [59] and HAT [11], they propose a dual cally. The optimized denoising network structure(D-HAT)
network & post-processing denoising model that combines is shown in Fig 4.
the advantages of the former’s global attention mechanism The dataset utilized for training comprises DIV2K and
and the latter’s channel attention mechanism. LSDIR. To accelerate training and achieve good perfor-
As shown in Fig. 3, our network is divided into two mance, they initially train on a small scale (64x64) with
Figure 3. The model architecture of DDU proposed by Team BuptMM.
ering the dataset charatectertics and our dataset ratio experi- They choose the Restormer [59] model trained to remove
ments, they found that DIV2K to LSDIR ratio of 12:88 dur- the same i.i.d. Gaussian noise (σ = 50) without intensity
ing training helps to improve overall PSNR and generalise clipping as our baseline. As this pre-trained Restormer did
the model better for both validation and test datasets. not clip noisy images’ intensities into the normal range, i.e.,
Overlapping Percentage During Inference. Using a [0, 255], it performs poorly in clipped noisy images, result-
small overlap of 5% during inference with a patch size of ing in low PSNR/SSIM (27.47/0.79 on DIV2K validation)
256\times 256 (same as the training patch size to preserve global and clear artifacts. After embedding learnable bias param-
context) resulted in improved inference speed. However, eters into this freezing Restormer (except LayerNorm mod-
despite applying boundary pixel averaging, minor stitching ules) and fine-tuning the model, satisfactory denoising re-
artifacts is observed, leading to a decline in PSNR perfor- sults can be obtained, and the resultant PSNR increases by
mance. To mitigate these artifacts, they increased the over- over 3dB (evaluated on DIV2K validation set). They found
lap to 20% with original 256\times 256 patch size, which re- that various pre-trained Gaussian denoisers from [59], in-
sulted in PSNR improvement. cluding three noise-specific models and one noise-blind
Ensemble Technique at Inference. Ensemble tech- model, resulted in similar denoising performance on clipped
niques played a crucial role by effectively boosting perfor- noisy images after Bias-Tuning.
mance. They used the Self Ensemble Strategy, specifically
During the inference, they further enhance the denoiser
test-time augmentation ensemble [35] where multiple flips
via self-ensemble [35] and patch stitching. When deal-
and rotations of images were used before model inference.
ing with high-resolution (HR) noisy images, they process
The model outputs are averaged to generate the final output
them via overlapping patches with the same patch size as
image.
the training phase. They stitch these overlapping denoised
4.6. Alwaysu patches via linear blending, as introduced in image stitching
[7].
Method: Our objective is to achieve efficient Gaussian
denoising based on pre-trained denoisers. Our core idea, Training details: They fine-tune this bias-version
termed Bias-Tuning, initially proposed in transfer learning Restormer using the PSNR loss function and AdamW op-
[8], is freezing pre-trained denoisers and only fine-tuning timizer combined with batch size 2, patch size 256 × 256,
existing or newly added bias parameters during adaptation, learning rate 3e−4 (cosine annealed to 1e−6 ), 200k iter-
thus maintaining the knowledge of pre-trained models and ations and geometric augmentation. The training dataset
reducing tuning cost. consists of 800 images from DIV2K training set and 1,000
images from LSDIR training set. They also note that the
pre-trained Restormer utilized a combined set of 800 im-
ages from DIV2K, 2,650 images of Flickr2K, 400 BSD500
images and 4,744 images from WED.
Inference details: The patch size and overlapping size dur-
ing patch stitching are 256 × 256 and 16, respectively.
Complexity: Total number of parameters: 26.25M; To-
tal number of learnable bias parameters: 0.014M; FLOPs:
140.99G (evaluated on image with shape 256 × 256 × 3).
F_{si} = ME_s(x) (12) L(\theta ) = \frac {1}{N} \sum _{i=1}^{N} \| HNNFormer(x_i) - y_i \| L_{HNN} (18)
Python version: 3.8.0, PyTorch version: 2.0.0, CUDA ver- tation(TTA) into their method during testing, including hor-
sion: 11.7. They only use high-definition images from the izontal flip, vertical flip, and 90-degree rotation. They uti-
DIV2K and LSDIR datasets for training and validation. The lized an ensemble technique by chaining three basic U-Net
training set consists of 85791 images (84991 + 800), and networks and SCUNet, and according to the weights of 0.6
the validation set consists of 350 images (250 + 100). They and 0.4, output the results of concatenating the SCUNet
used the Adam optimizer with 100 training epochs, a batch model with three UNet models to achieve better perfor-
size of 32, and a crop size of 256 × 256. The initial learning mance.
rate was set to 1e−4 , with β1 = 0.9 , β2 = 0.999, and no
weight decay applied. At epoch 90, the learning rate was 4.12. JNU620
reduced to 1e−5 . No data augmentation was applied during
training or validation.The model is trained with MSE loss. Description. Recently, some research in low-level vision
has shown that ensemble learning can significantly improve
Testing description They integrate Test-Time Augmen- model performance. Thus, instead of designing a new archi-
tecture, they leverage existing NAFNet [10] and RCAN [63] integrate a PatchGAN discriminator with adversarial train-
as basic networks to design a pipeline for image denoising ing.
(NRDenoising) based on the idea of ensemble learning, as Training details. The model is trained from scratch us-
shown in Fig 9. They find the results are better improved by ing the DIV2K dataset, without relying on any pre-trained
employing both self-ensemble and model ensemble strate- weights. They jointly optimize all modules using a com-
gies. posite loss function that includes diffusion loss, Sinkhorn-
based optimal transport loss, multi-scale SSIM and L1
NAFNet α1 losses, and an adversarial loss. The training spans 300
Self-ensemble epochs with a batch size of 8, totaling 35,500 iterations per
epoch. The method emphasizes both fidelity and perceptual
+ quality, achieving strong results in PSNR and LPIPS.
Input I. Parallel Feature Extraction Pathways II. Multi-Scale Refinement Network (MRefNet)
Swin
Structure path Transformer Structure Structure features Output
Encoder Projection ŷ
Patch Embedding Initial MRefNet Multi-Scale
High-Level Features
+ Attention Restore Refinement
Texture
Feature Fusion
Degraded Image x path Enhanced Integration
ℝ³ˣᴴˣᵂ
Conditional Texture Coarse-Scale Path
Diffusion Projection y_diff Down/Up-sampling
Real Image
III. Adversarial Training Framework
Structure Path (Swin Transformer) Optimal Transport/Schrödinger Bridge Adversarial Framework Structure flow Refinement flow
Texture flow Adversarial flow
Texture Path (Diffusion Model) Multi-Scale Refinement (MRefNet)
Figure 10. Overview of the OptiMalDiff architecture proposed by PSU team, combining Schrödinger Bridge diffusion, transformer-based
feature extraction, and adversarial refinement.
Multi-Scale Attention Prompt Learning (CTMP), as de- In pursuit of enhanced performance, they have refined the
picted in Figure 11. The CTMP model features a U- Transformer module by devising a novel architecture that
shaped architecture grounded in the Transformer frame- integrates Channel Attention with the self-attention mecha-
work, constructed from the enhanced Channel Attention nism, thereby combining the strengths of both Transformer
Transformer Block (CATB). During the image restoration and Channel Attention. Specifically, the Transformer fo-
process, CTMP adopts a blind image restoration strategy to cuses on extracting high-frequency information to capture
address diverse noise types and intensities. It integrates an the fine details and textures of images, while Channel At-
Efficient Multi-Scale Attention Prompt Module (EMAPM) tention excels at capturing low-frequency information to
that is based on prompts. Within the EMAPM, an En- extract the overall structure and semantic information of
hanced Multi-scale Attention (EMA) module is specifically images. This integration further boosts the image denois-
designed. This module extracts global information across ing effect.As depicted in Figure 12, the improved Trans-
different directions and employs dynamic weight calcula- former architecture, named the Channel Attention Trans-
tions to adaptively modulate the importance of features at former Block (CATB), primarily consists of the follow-
various scales. The EMA module subsequently fuses the ing three modules: Multi-DConv Head Transposed Self-
enhanced multi-scale features with the input feature maps, Attention (MDTA), Channel Attention (CA), and Gated-
yielding a more enriched feature representation. This fusion Dconv Feed-Forward Network (GDFN).
mechanism empowers the model to more effectively capture The Multi-DConv Head Transposed Self-Attention
and leverage features at different scales, thereby markedly (MDTA) module enhances the self-attention mechanism’s
bolstering its capacity to restore image degradations and perception of local image features by incorporating multi-
showcasing superior generalization capabilities. scale depthwise convolution operations, effectively captur-
ing detailed image information. The Channel Attention
4.15.2. Transformer Block Incorporating Channel At- (CA) module, dedicated to information processing along
tention and Residual Connections the channel dimension, computes the importance weights
The Transformer Block serves as the cornerstone of their of each channel to perform weighted fusion of channel fea-
entire model, harnessing the Transformer architecture to ex- tures, thereby strengthening the model’s perception of the
tract image features through the self-attention mechanism. overall image structure. The Gated-Dconv Feed-Forward
Figure 11. The CTMP architecture proposed by Team mpu ai
Figure 12. The Channel Attention Transformer Block (CATB), proposed by Team mpu ai
Network (GDFN) module combines the gating mechanism networks (CNNs) and Transformer architectures primarily
with depthwise convolution operations, aiming to further focus on feature extraction in the spatial domain, while pay-
optimize the nonlinear transformation of features. By in- ing less attention to the weighting of features in the chan-
troducing the gating mechanism, the model can adaptively nel dimension. To address this limitation, they introduce a
adjust the transmission and updating of features based on Channel Attention module in the Transformer Block, cre-
the dynamic characteristics of the input features, thereby ating a Transformer Block that incorporates Channel At-
enhancing the flexibility and adaptability of feature rep- tention and Residual Connections. This module weights
resentation. Through the synergistic action of these three the channel dimension through global average pooling and
modules, the improved Transformer architecture can more fully connected layers, enhancing important channel fea-
effectively handle both high-frequency and low-frequency tures while suppressing less important ones. This weighting
information in images, thereby significantly enhancing the mechanism enables the model to focus more effectively on
performance of image denoising and restoration. key information, thereby improving the quality of restored
In image restoration tasks, feature extraction and repre- images. Additionally, the introduction of residual connec-
sentation are crucial steps. Traditional convolutional neural tions further enhances the model’s robustness and perfor-
mance. Residual connections ensure that the information The Efficient Multi-Scale Attention Prompt Module
of the input features is fully retained after processing by (EMAPM) is designed to enhance the model’s ability to
the Channel Attention module by adding the input features capture multi-scale features in image restoration tasks. By
directly to the output features. This design not only aids generating adaptive prompts that focus on different scales
gradient propagation but also retains the original informa- and characteristics of the input image, EMAPM allows the
tion of the input features when the weighting effect of the model to better handle various types of image degradations.
Channel Attention module is suboptimal, further boosting The core components and operations of EMAPM are de-
the model’s robustness. scribed as follows:
The proposed model incorporates several key enhance- Module Configuration: To configure the EMAPM, sev-
ments to improve image restoration quality. Firstly, the eral key parameters are defined:
Channel Attention Module leverages global average pool- • Prompt Dimension (dp ): This determines the dimension
ing and fully connected layers to selectively enhance impor- of each prompt vector, which represents the feature space
tant channel features while suppressing less relevant ones. for each prompt.
This mechanism enables the model to focus more effec- • Prompt Length (Lp ): This specifies the number of
tively on critical information, thereby improving the quality prompt vectors, which controls the diversity of prompts
of the restored image. Secondly, residual connections are generated.
employed to ensure that the original input features are fully • Prompt Size (Sp ): This sets the spatial size of each
retained and added directly to the output features after pro- prompt vector, which affects the resolution of the
cessing by the Channel Attention Module. This not only prompts.
aids gradient propagation but also preserves the original • Linear Dimension (dl ): This is the dimension of the in-
information when the weighting effect is suboptimal, thus put to the linear layer, which processes the embedding of
boosting the model’s robustness. Lastly, the LeakyReLU the input feature map.
activation function is utilized in the Feed-Forward Network • Factor (f ): This defines the number of groups in the
to introduce non-linearity while avoiding the ”dying neu- EMA module, which influences the grouping mechanism
rons” issue associated with ReLU, further enhancing the in the attention process.
model’s expressive power. Together, these improvements Mathematical Formulation: Given an input feature
contribute to a more effective and robust image restoration map x \in \mathbb {R}^{B \times C \times H \times W} , where B is the batch size, C is the
model. number of channels, and H \times W is the spatial dimension,
4.15.3. Efficient Multi-Scale Attention Prompt Module the operations within EMAPM are defined as follows:
1. Compute Embedding: The embedding of the input
Addressing multi-scale image degradations is a crucial chal- feature map is computed by averaging the spatial dimen-
lenge in image restoration tasks. Traditional feature extrac- sions.
tion methods typically capture features at a single scale, ne-
glecting the fusion and interaction of features across mul-
\text {emb} = \frac {1}{H \times W} \sum _{i=1}^{H} \sum _{j=1}^{W} x_{:, :, i, j} \in \mathbb {R}^{B \times C} (19)
tiple scales. To overcome this limitation, they propose
a prompt-based blind image restoration approach, incor-
porating an Efficient Multi-Scale Attention Prompt Mod- 2. Linear Layer and Softmax: The embedding is
ule (EMAPM). As be shown in Figure 13, the core of passed through a linear layer followed by a softmax func-
the EMAPM is the Enhanced Multi-scale Attention (EMA) tion to generate prompt weights.
module, which extracts global information in different \text {prompt\_weights} = \text {softmax}(\text {linear\_layer}(\text {emb})) \in \mathbb {R}^{B \times L_p}
directions and combines dynamic weight calculations to (20)
adaptively adjust the significance of features at various 3. Generate Prompt: The prompts are generated by
scales, thereby generating a richer feature representation. weighting the prompt parameters with the prompt weights
This design not only enhances the model’s adaptability to and then summing them up. The prompts are then interpo-
multi-scale image degradations but also strengthens the ex- lated to match the spatial dimensions of the input feature
pressiveness of features, significantly improving the quality map.
of image restoration. The introduction of the EMA module
represents a significant innovation in their image restora-
tion approach. Experimental results validate the effective- \text {prompt} = \sum _{k=1}^{L_p} \text {prompt\_weights}_{:, k} \cdot \text {prompt\_param}_{k} \in \mathbb {R}^{B \times d_p \times S_p \times S_p}
4. Enhance Prompt using EMA: The prompts are en- batch size of 2, leveraging the computational power of a
hanced using the Enhanced Multi-scale Attention (EMA) Tesla T4 GPU. The network was optimized through L1 loss,
module, which refines the prompts by incorporating multi- using the Adam optimizer (β1 = 0.9, β2 = 0.999) with a
scale attention. learning rate of 2 × 10−4 . To further enhance the model’s
generalization ability, they used 128×128 cropped blocks
\text {enhanced\_prompt} = \text {EMA}(\text {prompt}) \in \mathbb {R}^{B \times d_p \times H \times W} as input during training and augmented the training data by
(23) applying random horizontal and vertical flips to the input
5. Conv3x3: Finally, the enhanced prompts are pro- images.
cessed through a 3x3 convolutional layer to further refine The proposed model in this paper exhibits the following
the feature representation. characteristics in terms of overall complexity: It consists
of approximately 35.92 million parameters and has a com-
\text {enhanced\_prompt} = \text {conv3x3}(\text {enhanced\_prompt}) \in \mathbb {R}^{B \times d_p \times H \times W}
(24) putational cost of 158.41 billion floating-point operations
(FLOPs). The number of activations is around 1{,}863.85
4.15.4. Experiments million, with 304 Conv2d layers. During GPU training, the
In this section, they conducted a series of extensive exper- maximum memory consumption is 441.57 MB, and the av-
iments to comprehensively demonstrate the superior per- erage runtime for validation is 25{,}287.67 seconds.
formance of the proposed CTMP model across multiple
datasets and benchmarks. The experiments covered a va- 4.15.5. Dataset
riety of tasks, including denoising and deblocking of com- To comprehensively evaluate the performance of the CTMP
pressed images, and were compared with previous state- algorithm in image restoration tasks, they conducted ex-
of-the-art methods. Additionally, they reported the results periments in two critical areas: image denoising and de-
of ablation studies, which strongly validated the effective- blocking of compressed images. For training, they selected
ness of the Channel Attention Transformer Block (CATB) the high-quality DIV2K dataset, which comprises 800 high-
and the Enhanced Multi-scale Attention Prompt Module resolution clean images with rich textures and details, pro-
(EMAPM) within the CTMP architecture. viding ample training samples to enable the model to per-
The CTMP framework is end-to-end trainable without form well under various degradation conditions [2]. Addi-
the need for pretraining any individual components. Its ar- tionally, they used 100 clean/noisy image pairs as the vali-
chitecture consists of a 4-level encoder-decoder, with each dation set to monitor the model’s performance during train-
level equipped with a different number of Transformer mod- ing and adjust the hyperparameters.
ules, specifically [4, 6, 6, 8] from level 1 to level 4. They During the testing phase, they chose several widely used
placed a Prompt module between every two consecutive de- datasets, including Kodak, LIVE1, and BSDS100, to com-
coder levels, resulting in a total of 3 Prompt modules across prehensively assess the algorithm’s performance. The Ko-
the entire PromptIR network, with a total of 5 Prompt com- dak dataset consists of 24 high-quality images with diverse
ponents. During training, the model was trained with a scenes and textures, commonly used to evaluate the visual
effects of image restoration algorithms [1]. The LIVE1 Input Image
H*W*3
dataset contains a variety of image types and is widely used Denoised Output
H*W*3
Input Image
for image quality assessment tasks, effectively testing the H*W*3
𝐻×W×𝐶
𝐻 × W × 2𝐶
𝐻×W×𝐶 𝐻×W×𝐶
Fin Fout
𝐻×W×3 𝐻×W×𝐶
𝐻×W×3
Gating Mechanism
𝐻×W×𝐶
𝐻×W×𝐶
𝐻 𝑊 𝐻𝑊 × 𝐶
× × 2𝐶
V
𝐻×W×𝐶
𝐻 𝑊 2 2
× × 2𝐶
2 2
𝐻 × W × 3𝐶
𝐻×W×𝐶 𝐶 × 𝐻𝑊 𝐻×W×𝐶
Skip Connections 𝐻 𝑊 Fin K Fout
× × 4𝐶
2 2 Softmax
Q 𝐶×𝐶
𝐻𝑊 × 𝐶
𝐻 𝑊
𝐻 𝑊 × × 4𝐶
× × 4𝐶 4 4
Fin
4 4
𝐻×W×𝐶
LDRGlobal Noise Estimator Module
𝐻 𝑊
× × 8𝐶
4 4
𝐻 𝑊 𝐻 𝑊
× × 8𝐶 × × 4𝐶
8 8 4 4
only saves significant computational resources and time but x_{t-1} = x_t - 0.5 \cdot \sigma _t^2 \cdot \text {score}(x_t, t) \cdot dt.
also fully utilizes the excellent models and valuable exper-
tise available in the field. By directly employing these pre- To improve sampling efficiency, they integrate an ODE-
trained models, they can quickly generate high-quality pre- based sampling strategy, which allows for faster denois-
dictions while avoiding the high costs and complexity asso- ing while maintaining high restoration quality. Addition-
ciated with training models from scratch. ally, they employ a cosine noise schedule, which ensures
a smooth noise transition across time steps and improves
4.19. Whitehairbin training stability. The network is optimized using a cus-
tom loss function that minimizes the deviation between the
4.19.1. Introduce predicted noise and the true noise, ensuring precise score
Their method is based on the Refusion[40] model proposed estimation.
in previous work, and they trained it on the dataset provided Training is conducted with the Lion optimizer, incorpo-
by this competition to validate its effectiveness. The Refu- rating a learning rate scheduler for improved convergence.
sion model itself is a denoising method based on the diffu- To enhance computational efficiency, they apply mixed pre-
sion model framework. Its core idea is to guide the reverse cision training, reduce time steps T , and utilize lightweight
diffusion process by learning the noise gradient (score func- backbone networks, striking a balance between high-quality
tion) at different time steps t. Within the Refusion frame- denoising and efficient execution.
work, they can still flexibly choose NAFNet or UNet as Training description They trained their diffusion-based
the neural network backbone architecture to adapt to differ- denoising model on a mixed dataset composed of DIV2K
ent computational resources and performance requirements. and LSDIR, which contained high-resolution images with
NAFNet is known for its efficiency, while UNet excels in diverse textures and content. The dataset was augmented
preserving details. The denoising process follows a stochas- with random cropping, horizontal flipping, and other data
tic differential equation (SDE) approach, which calculates augmentation techniques to improve model generalization.
the score function by predicting the noise residual and iter- The backbone network was selected from either
atively removes noise. Through training and validation on NAFNet, with the feature channel width set to 64. They ex-
the competition dataset, their method ultimately achieved a perimented with different channel sizes and determined that
test performance of PSNR 27.07 and SSIM 0.79. 64 channels provided a good balance between performance
and computational efficiency.
4.19.2. Method details They employed theLion optimizer with β1 = 0.95 and
General method description Their proposed denoising β2 = 0.98 to ensure faster convergence and better stability
method is based on a diffusion model framework, where the during training. The learning rate was initialized at 2 ×
network is designed to estimate the noise gradient (score 10−4 and was reduced by half after every 200k iterations
function) at different time steps t to guide the reverse dif- using a CosineAnnealingLR scheduler to achieve smoother
fusion process. The core architecture consists of a neural convergence.
backbone, which can be either NAFNet, selected based on The loss function was a Matching Loss designed to min-
a trade-off between computational efficiency and denoising imize the distance between the predicted and true noise
quality. residuals. This function integrated L1 and L2 components,
NAFNet features a lightweight structure optimized for weighted dynamically based on the noise variance at differ-
high-speed image restoration, incorporating a self-gated ac- ent time steps to stabilize the training across different diffu-
tivation mechanism (SimpleGate), simplified channel atten- sion levels.
tion (SCA), and depth-wise convolutions, making it highly They applied mixed precision training with automatic
efficient. UNet, on the other hand, is a widely adopted gradient scaling to accelerate training while reducing mem-
architecture for image denoising, leveraging an encoder- ory usage. The model was trained for a total of 800k itera-
Figure 17. Diffusion model for image denoising from Team Whitehairbin.
tions, and each batch contained 16 cropped patches of size formation and semantic features of the image. The decoder
128 × 128. Training was conducted using a single NVIDIA performs upsampling, restoring the feature maps to the orig-
RTX 4090 GPU, and the entire process took approximately inal image size and progressively recovering the detailed in-
36 hours to complete. formation of the image. This architecture enables U-Net to
To ensure robust noise modeling, a cosine noise sched- achieve rich global semantic information while accurately
ule was adopted, which progressively adjusted the noise restoring image details when processing high-definition im-
level throughout the training process, allowing the model to ages, thereby realizing high-precision segmentation.
better capture high-frequency details during the denoising The U-Net architecture is characterized by its symmet-
phase. ric encoder-decoder structure with skip connections. In
Testing description During the training phase, they val- the encoder (or contracting path), the network progres-
idated the model using the official validation dataset pro- sively downsamples the input image through multiple con-
vided by the NTIRE 2025 competition. The validation set volutional layers interspersed with max-pooling operations.
included images with Gaussian noise of varying intensities, This process allows the model to extract hierarchical fea-
and the model was assessed based on both PSNR and SSIM tures at various scales, capturing both the global context and
metrics. semantic information of the image.
Upon completing 800k iterations, the model achieved a In the decoder (or expansive path), the network employs
peak PSNR of 26.83 dB and an SSIM of 0.79 on the val- transposed convolutions (or upsampling layers) to gradually
idation dataset, indicating effective noise suppression and upscale the feature maps back to the original image resolu-
structure preservation. tion. During this process, the decoder receives additional
After training was completed, the model was rigorously information from the encoder via skip connections, which
tested using the official test set to verify its effectiveness concatenate corresponding feature maps from the encoder
in real-world scenarios. They conducted multiple test runs to those in the decoder. This mechanism helps in refining
with different noise levels to ensure model robustness across the output by incorporating fine-grained details and spatial
various conditions. The test results confirmed that the information, which are crucial for accurate image restora-
model performed consistently well in Gaussian noise re- tion or segmentation.
moval, maintaining high PSNR and SSIM values across di- This design ensures that U-Net can effectively handle
verse image types. high-resolution images by leveraging both the broad con-
To further evaluate the performance, they applied both textual understanding gained from the encoder and the de-
SDE-based and ODE-based sampling methods during infer- tailed spatial information preserved through the skip con-
ence. ODE sampling provided a faster and more determin- nections. Consequently, this dual capability of capturing
istic denoising process, while SDE sampling yielded more global semantics and local details makes U-Net particularly
diverse results. The final submitted model leveraged ODE powerful for tasks that require precise image segmentation.
sampling to achieve a balance between quality and infer- The uniqueness of U-Net lies in its skip connections. These
ence speed. skip connections directly transfer feature maps of the same
scale from the encoder to the corresponding layers in the
4.20. mygo decoder. This mechanism allows the decoder to utilize low-
U-Net adopts a typical encoder-decoder structure. The en- level feature information extracted by the encoder, aiding in
coder is responsible for downsampling the input image, ex- the better recovery of image details. When processing high-
tracting features at different scales to capture the global in- definition images, these low-level features contain abundant
edge, texture, and other detail information, which is crucial
for accurate image segmentation.
Compared to Fully Convolutional Networks (FCNs), U-
Net stands out because of its use of skip connections. FCN
is also a commonly used model for image segmentation,
but lacks the skip connections found in U-Net, resulting in
poorer performance in recovering detailed image informa-
tion. When processing high-definition images, FCNs can
produce blurry segmentation results with unclear edges. In
contrast, U-Net can better preserve the details of the image
through its skip connections, thereby improving the accu-
racy of segmentation.
Our model resizes all images to 512*512 for training,
which facilitates the rapid extraction of image features and
effectively reduces the usage of video memory (VRAM).
Next, they feed the images into the network model and
compute the loss of the output images. In particular, their
loss function incorporates both MSE (mean squared error)
and SSIM (structured similarity index measure), allowing
the model to focus on pixel-level accuracy during training
while also emphasizing the structural features of the images.
This dual approach improves the overall performance of the
model. They use the Adam optimizer for training, which
dynamically adjusts the learning rate during the training
process based on the first and second moments of the gra-
dients. This allows it to automatically select the appropri-
ate step sizes for each parameter, leading to more efficient
convergence compared to fixed learning rate methods. Ad-
ditionally, Adam helps reduce the overall memory footprint
by maintaining only a few extra parameters per weight, con-
tributing to its efficiency in practical applications. In par-
ticular, they employ an early stopping mechanism to avoid
redundant computations.
It is worth mentioning that they have implemented an
early stopping mechanism. This approach helps prevent
overfitting by halting the training process when the per-
formance on a validation set stops improving, thus avoid-
ing unnecessary computations and saving computational re-
sources. Early stopping monitors a chosen metric (such as
validation loss) and stops training when no improvement is
observed over a predefined number of epochs, effectively
reducing the risk of overfitting and ensuring efficient use of Figure 18. Unet model architecture from Team mygo.
computational resources.
Affiliations:
1
Samsung R&D Institute India - Bangalore (SRI-B)
Samsung MX (Mobile eXperience) Business &
Samsung R&D Institute China - Beijing (SRC-B)
Title: Dynamic detail-enhanced image denoising frame- Alwaysu
work Title: Bias-Tuning Enables Efficient Image Denoising
Members: Members:
Xiangyu Kong1 ([email protected]), Hyunhee Jun Cheng 1 ([email protected]), Shan Tan 1
Park2 , Xiaoxuan Yu1 , Suejin Han2 , Hakjae Jeon2 , Jia Li1 , Affiliations:
Hyung-Ju Chun2 1
Huazhong University of Science and Technology
Affiliations:
1
Samsung R&D Institute China - Beijing (SRC-B) Tcler Denosing
2
Department of Camera Innovation Group, Samsung Title: Tcler Denoising
Electronics Members:
Jun Liu1,2 ([email protected]), Jiangwei Hao1,2 , Jianping
Luo1,2 , Jie Lu1,2
SNUCV
Affiliations:
Title: Deep ensemble for Image denoising 1
TCL Corporate Research
Members: 2
TCL Science Park International E City - West Zone,
Donghun Ryou1 ([email protected]), Inju Ha1 , Bohyung Building D4
Han1
Affiliations:
1
Seoul National University cipher vision
Title: Pureformer: Transformer-Based Image Denoising
Members:
BuptMM
Satya Narayan Tazi1 ([email protected]), Arnim
Title: DDU——Image Denoising Unit using transformer Gautam1 , Aditi Pawar1 , Aishwarya Joshi2 , Akshay
and morphology method Dudhane3 , Praful Hambadre4 , Sachin Chaudhary5 , Santosh
Members: Kumar Vipparthi5 , Subrahmanyam Murala6 ,
Jingyu Ma1 ([email protected]), Zhijuan Huang2 , Affiliations:
Huiyuan Fu1 , Hongyuan Yu2 , Boqi Zhang1 , Jiawei Shi1 , 1
Government Engineering College Ajmer
Heng Zhang2 , Huadong Ma1 2
Mohamed bin Zayed University of Artificial Intelligence
Affiliations: gence, Abu Dhabi
1 3
Beijing University of Posts and Telecommunications University of Petroleum and Energy Studies, Dehradun
2 4
Xiaomi Inc., China Indian Institute of Technology, Mandi
5
Indian Institute of Technology, Ropar
6
Trinity College Dublin, Ireland
HMiDenoise
Title: Hybrid Denosing Method Based on HAT
Members:
Sky-D
Zhijuan Huang1 (huang [email protected]), Jingyu Ma2 , Title: A Two-Stage Denoising Framework with General-
Hongyuan Yu1 , Heng Zhang1 , Huiyuan Fu2 , Huadong Ma2 ized Denoising Score Matching Pretraining and Supervised
Affiliations: Fine-tuning
Members: Wadii Boulila1
Jiachen Tu1 ([email protected])
Affiliations: Affiliations:
1 1
University of Illinois Urbana-Champaign Robotics and Internet-of-Things Laboratory, Prince
Sultan University, Riyadh, Saudi Arabia
KLETech-CEVI
Aurora
Title: HNNFormer: Hierarchical Noise-Deinterlace
Title: GAN + NAFNet: A Powerful Combination for
Transformer for Image Denoising
High-Quality Image Denoising
Members:
Members:
Nikhil Akalwadi1,3 ([email protected]), Vi-
JanSeny ([email protected]), Pei Zhou
jayalaxmi Ashok Aralikatti1,3 , Dheeraj Damodar Hegde2,3 ,
G Gyaneshwar Rao2,3 , Jatin Kalal2,3 , Chaitra Desai1,3 ,
Ramesh Ashok Tabib2,3 , Uma Mudenagudi2,3 mpu ai
Affiliations: Title: Enhanced Blind Image Restoration with Channel
1
School of Computer Science and Engineering, KLE Attention Transformers and Multi-Scale Attention Prompt
Technological University Learning
2
School of Electronics and Communication Engineering, Members:
KLE Technological University Jianhua Hu1 ([email protected]), K. L. Eddie Law1
3
Center of Excellence in Visual Intelligence (CEVI), KLE Affiliations:
Technological University 1
Macao Polytechnic University
xd denoise OptDenoiser
Title: SCUNet for image denoising Title: Towards two-stage OptDenoiser framework for
Members: image denoising.
Zhenyuan Lin1 ([email protected]), Yubo Members:
Dong1 , Weikun Li2 , Anqi Li1 , Ang Gao1 Jaeho Lee 1 ([email protected] ), M.J. Aashik Rasool1 ,
Affiliations: Abdur Rehman1 , SMA Sharif1 , Seongwan Kim1
1 Affiliations:
Xidian University
2 1
Guilin University Of Electronic Technology Opt-AI Inc, Marcus Building, Magok, Seoul, South Korea
JNU620 AKDT
Title: High-resolution Image Denoising via Adaptive
Title: Image Denoising using NAFNet and RCAN
Kernel Dilation Transformer
Members:
Members:
Weijun Yuan1 ([email protected]), Zhan Li1 ,
Alexandru Brateanu1 ([email protected]
Ruting Deng1 , Yihang Chen1 , Yifan Deng1 , Zhanglu
chester.ac.uk ), Raul Balmez1 , Ciprian Orhei2 , Cosmin
Chen1 , Boyang Yao1 , Shuling Zheng2 , Feng Zhang1 ,
Ancuti2
Zhiheng Fu1
Affiliations:
Affiliations: 1
1 University of Manchester - Manchester, United Kingdom
Jinan University 2
2 Polytechnica University Timisoara - Timisoara, Romania
Guangdong University of Foreign Studies
X-L
PSU team
Title: MixEnsemble
Title: OptimalDiff: High-Fidelity Image Enhancement Members:
Using Schrödinger Bridge Diffusion and Multi-Scale Zeyu Xiao1 ([email protected]), Zhuoyuan Li2
Adversarial Refinement Affiliations:
1
National University of Singapore
2
Members: University of Science and Technology of China
Anas M. Ali1 ([email protected]), Bilel Benjdira1 ,
Whitehairbin ceedings of the IEEE/CVF international conference on com-
puter vision, pages 12504–12513, 2023. 19
Title: Diffusion-based Denoising Model
[10] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun.
Simple baselines for image restoration. In European confer-
Members: ence on computer vision, pages 17–33. Springer, 2022. 3,
Ziqi Wang1 ([email protected]), Yanyan Wei1 , Fei 14
Wang1 , Kun Li1 , Shengeng Tang1 , Yunkai Zhang1 [11] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao,
and Chao Dong. Activating more pixels in image super-
Affiliations: resolution transformer. In Proceedings of the IEEE/CVF con-
1
Hefei University of Technology, China ference on computer vision and pattern recognition, pages
22367–22377, 2023. 5
mygo [12] Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun,
Title: High-resolution Image Denoising via Unet neural Zongwei Wu, Radu Timofte, Yulun Zhang, et al. NTIRE
network 2025 challenge on image super-resolution (×4): Methods
Members: and results. In Proceedings of the IEEE/CVF Conference
Weirun Zhou1 ([email protected]), Haoxuan Lu2 on Computer Vision and Pattern Recognition (CVPR) Work-
shops, 2025. 2
[13] Zheng Chen, Jingkai Wang, Kai Liu, Jue Gong, Lei Sun,
Affiliations:
1 Zongwei Wu, Radu Timofte, Yulun Zhang, et al. NTIRE
Xidian University
2 2025 challenge on real-world face restoration: Methods and
China University of Mining and Technology results. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) Work-
References shops, 2025. 2
[1] Kodak dataset. http://r0k.us/graphics/kodak/. [14] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Lu.
19 Revisiting global statistics aggregation for improving image
[2] Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- restoration. arXiv preprint arXiv:2112.04491, 2(4):5, 2021.
lenge on single image super-resolution: Dataset and study. 14
In Proceedings of the IEEE Conference on Computer Vision [15] Xiaojie Chu, Liangyu Chen, Chengpeng Chen, and Xin Lu.
and Pattern Recognition Workshops, pages 126–135, 2017. Improving image restoration by revisiting global information
2, 5, 8, 11, 14, 18 aggregation. In European Conference on Computer Vision,
[3] Yuval Becker, Raz Z Nossek, and Tomer Peleg. Make the pages 53–71. Springer, 2022. 14
most out of your net: Alternating between canonical and [16] Marcos Conde, Radu Timofte, et al. NTIRE 2025 challenge
hard datasets for improved image demosaicing. CoRR, 2023. on raw image restoration and super-resolution. In Proceed-
6 ings of the IEEE/CVF Conference on Computer Vision and
[4] Alexandru Brateanu and Raul Balmez. Kolmogorov-arnold Pattern Recognition (CVPR) Workshops, 2025. 2
networks in transformer attention for low-light image en- [17] Marcos Conde, Radu Timofte, et al. Raw image reconstruc-
hancement. In 2024 International Symposium on Electronics tion from RGB on smartphones. NTIRE 2025 challenge re-
and Telecommunications (ISETC), pages 1–4. IEEE, 2024. port. In Proceedings of the IEEE/CVF Conference on Com-
20 puter Vision and Pattern Recognition (CVPR) Workshops,
[5] Alexandru Brateanu, Raul Balmez, Adrian Avram, and 2025. 2
Ciprian Orhei. Akdt: Adaptive kernel dilation transformer [18] Egor Ershov, Sergey Korchagin, Alexei Khalin, Artyom Pan-
for effective image denoising. Proceedings Copyright, 418: shin, Arseniy Terekhin, Ekaterina Zaychenkova, Georgiy
425. 19 Lobarev, Vsevolod Plokhotnyuk, Denis Abramov, Elisey
[6] Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Cosmin Zhdanov, Sofia Dorogova, Yasin Mamedov, Nikola Banic,
Ancuti, and Codruta Ancuti. Enhancing low-light images Georgii Perevozchikov, Radu Timofte, et al. NTIRE 2025
with kolmogorov–arnold networks in transformer attention. challenge on night photography rendering. In Proceedings
Sensors, 25(2):327, 2025. 20 of the IEEE/CVF Conference on Computer Vision and Pat-
[7] Matthew Brown and David G Lowe. Automatic panoramic tern Recognition (CVPR) Workshops, 2025. 2
image stitching using invariant features. International jour- [19] Yuqian Fu, Xingyu Qiu, Bin Ren Yanwei Fu, Radu Timofte,
nal of computer vision, 74:59–73, 2007. 7 Nicu Sebe, Ming-Hsuan Yang, Luc Van Gool, et al. NTIRE
[8] Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: 2025 challenge on cross-domain few-shot object detection:
Reduce memory, not parameters for efficient on-device Methods and results. In Proceedings of the IEEE/CVF Con-
learning. Advances in Neural Information Processing Sys- ference on Computer Vision and Pattern Recognition (CVPR)
tems, 33:11285–11297, 2020. 7 Workshops, 2025. 2
[9] Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Tim- [20] Shuhang Gu and Radu Timofte. A brief review of image
ofte, and Yulun Zhang. Retinexformer: One-stage retinex- denoising algorithms and beyond. Inpainting and Denoising
based transformer for low-light image enhancement. In Pro- Challenges, pages 1–21, 2019. 1
[21] Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, [31] Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu,
Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Deman-
state space restoration. arXiv preprint arXiv:2411.15269, dolx, et al. Lsdir: A large scale dataset for image restoration.
2024. 4, 8 In Proceedings of the IEEE/CVF Conference on Computer
[22] Shuhao Han, Haotian Fan, Fangyuan Kong, Wenjie Liao, Vision and Pattern Recognition Workshops, 2023. 2, 5, 8,
Chunle Guo, Chongyi Li, Radu Timofte, et al. NTIRE 2025 11, 14
challenge on text to image generation model quality assess- [32] Yawei Li, Yulun Zhang, Radu Timofte, Luc Van Gool, Zhi-
ment. In Proceedings of the IEEE/CVF Conference on Com- jun Tu, Kunpeng Du, Hailing Wang, Hanting Chen, Wei Li,
puter Vision and Pattern Recognition (CVPR) Workshops, Xiaofei Wang, et al. Ntire 2023 challenge on image denois-
2025. 2 ing: Methods and results. In Proceedings of the IEEE/CVF
[23] Varun Jain, Zongwei Wu, Quan Zou, Louis Florentin, Hen- Conference on Computer Vision and Pattern Recognition,
rik Turbell, Sandeep Siddhartha, Radu Timofte, et al. NTIRE pages 1905–1921, 2023. 3
2025 challenge on video quality enhancement for video con- [33] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
ferencing: Datasets, methods and results. In Proceedings of Van Gool, and Radu Timofte. Swinir: Image restoration us-
the IEEE/CVF Conference on Computer Vision and Pattern ing swin transformer. In Proceedings of the IEEE/CVF inter-
Recognition (CVPR) Workshops, 2025. 2 national conference on computer vision, pages 1833–1844,
[24] Amogh Joshi, Nikhil Akalwadi, Chinmayee Mandi, Chaitra 2021. 20
Desai, Ramesh Ashok Tabib, Ujwala Patil, and Uma Mude- [34] Jie Liang, Radu Timofte, Qiaosi Yi, Zhengqiang Zhang,
nagudi. Hnn: Hierarchical noise-deinterlace net towards im- Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong
age denoising. In Proceedings of the IEEE/CVF Conference Zhang, Hui Zeng, Lei Zhang, et al. NTIRE 2025 the 2nd
on Computer Vision and Pattern Recognition, pages 3007– restore any image model (RAIM) in the wild challenge. In
3016, 2024. 11 Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR) Workshops, 2025. 2
[25] Cansu Korkmaz and A Murat Tekalp. Training transformer
[35] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
models by wavelet losses improves quantitative and visual
Kyoung Mu Lee. Enhanced deep residual networks for single
performance in single image super-resolution. In Proceed-
image super-resolution. In Proceedings of the IEEE confer-
ings of the IEEE/CVF Conference on Computer Vision and
ence on computer vision and pattern recognition workshops,
Pattern Recognition, pages 6661–6670, 2024. 3, 4
pages 136–144, 2017. 7, 11
[26] Edwin H Land and John J McCann. Lightness and retinex
[36] Jingbo Lin, Zhilu Zhang, Yuxiang Wei, Dongwei Ren, Dong-
theory. Journal of the Optical society of America, 61(1):1–
sheng Jiang, Qi Tian, and Wangmeng Zuo. Improving image
11, 1971. 19
restoration through removing degradations in textual repre-
[27] Sangmin Lee, Eunpil Park, Angel Canelo, Hyunhee Park, sentations. In Proceedings of the IEEE/CVF Conference
Youngjo Kim, Hyungju Chun, Xin Jin, Chongyi Li, Chun-Le on Computer Vision and Pattern Recognition, pages 2866–
Guo, Radu Timofte, et al. NTIRE 2025 challenge on efficient 2878, 2024. 5
burst hdr and restoration: Datasets, methods, and results. In [37] Xiaohong Liu, Xiongkuo Min, Qiang Hu, Xiaoyun Zhang,
Proceedings of the IEEE/CVF Conference on Computer Vi- Jie Guo, et al. NTIRE 2025 XGC quality assessment chal-
sion and Pattern Recognition (CVPR) Workshops, 2025. 2 lenge: Methods and results. In Proceedings of the IEEE/CVF
[28] Xin Li, Yeying Jin, Xin Jin, Zongwei Wu, Bingchen Li, Yufei Conference on Computer Vision and Pattern Recognition
Wang, Wenhan Yang, Yu Li, Zhibo Chen, Bihan Wen, Robby (CVPR) Workshops, 2025. 2
Tan, Radu Timofte, et al. NTIRE 2025 challenge on day and [38] Xiaoning Liu, Zongwei Wu, Florin-Alexandru Vasluianu,
night raindrop removal for dual-focused images: Methods Hailong Yan, Bin Ren, Yulun Zhang, Shuhang Gu, Le
and results. In Proceedings of the IEEE/CVF Conference Zhang, Ce Zhu, Radu Timofte, et al. NTIRE 2025 challenge
on Computer Vision and Pattern Recognition (CVPR) Work- on low light image enhancement: Methods and results. In
shops, 2025. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
[29] Xin Li, Xijun Wang, Bingchen Li, Kun Yuan, Yizhen Shao, sion and Pattern Recognition (CVPR) Workshops, 2025. 2
Suhang Yao, Ming Sun, Chao Zhou, Radu Timofte, and [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Zhibo Chen. NTIRE 2025 challenge on short-form ugc video regularization. arXiv preprint arXiv:1711.05101, 2017. 5
quality assessment and enhancement: Kwaisr dataset and [40] Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sjölund,
study. In Proceedings of the IEEE/CVF Conference on Com- and Thomas B Schön. Refusion: Enabling large-size realis-
puter Vision and Pattern Recognition (CVPR) Workshops, tic image restoration with latent-space diffusion models. In
2025. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
[30] Xin Li, Kun Yuan, Bingchen Li, Fengbin Guan, Yizhen sion and Pattern Recognition Workshops, pages 1680–1691,
Shao, Zihao Yu, Xijun Wang, Yiting Lu, Wei Luo, Suhang 2023. 21
Yao, Ming Sun, Chao Zhou, Zhibo Chen, Radu Timofte, [41] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database
et al. NTIRE 2025 challenge on short-form ugc video quality of human segmented natural images and its application to
assessment and enhancement: Methods and results. In Pro- evaluating segmentation algorithms and measuring ecologi-
ceedings of the IEEE/CVF Conference on Computer Vision cal statistics. In IEEE International Conference on Computer
and Pattern Recognition (CVPR) Workshops, 2025. 2 Vision (ICCV), pages 416–423, 2001. 19
[42] Vaishnav Potlapalli, Syed Waqas Zamir, Salman H Khan, the IEEE/CVF Conference on Computer Vision and Pattern
and Fahad Shahbaz Khan. Promptir: Prompting for all-in- Recognition (CVPR) Workshops, 2025. 2
one image restoration. Advances in Neural Information Pro- [55] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.
cessing Systems, 36:71275–71293, 2023. 8 Real-esrgan: Training real-world blind super-resolution with
[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya pure synthetic data. In Proceedings of the IEEE/CVF inter-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, national conference on computer vision, pages 1905–1914,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning 2021. 8
transferable visual models from natural language supervi- [56] Yingqian Wang, Zhengyu Liang, Fengyuan Zhang, Lvli
sion. In International conference on machine learning, pages Tian, Longguang Wang, Juncheng Li, Jungang Yang, Radu
8748–8763. PmLR, 2021. 4 Timofte, Yulan Guo, et al. NTIRE 2025 challenge on light
[44] Bin Ren, Hang Guo, Lei Sun, Zongwei Wu, Radu Timo- field image super-resolution: Methods and results. In Pro-
fte, Yawei Li, et al. The tenth NTIRE 2025 efficient super- ceedings of the IEEE/CVF Conference on Computer Vision
resolution challenge report. In Proceedings of the IEEE/CVF and Pattern Recognition (CVPR) Workshops, 2025. 2
Conference on Computer Vision and Pattern Recognition [57] Kangning Yang, Jie Cai, Ling Ouyang, Florin-Alexandru
(CVPR) Workshops, 2025. 2 Vasluianu, Radu Timofte, Jiaming Ding, Huiming Sun, Lan
[45] Nickolay Safonov, Alexey Bryntsev, Andrey Moskalenko, Fu, Jinlong Li, Chiu Man Ho, Zibo Meng, et al. NTIRE
Dmitry Kulikov, Dmitriy Vatolin, Radu Timofte, et al. 2025 challenge on single image reflection removal in the
NTIRE 2025 challenge on UGC video enhancement: Meth- wild: Datasets, methods and results. In Proceedings of
ods and results. In Proceedings of the IEEE/CVF Conference the IEEE/CVF Conference on Computer Vision and Pattern
on Computer Vision and Pattern Recognition (CVPR) Work- Recognition (CVPR) Workshops, 2025. 2
shops, 2025. 2 [58] Pierluigi Zama Ramirez, Fabio Tosi, Luigi Di Stefano, Radu
[46] SMA Sharif, Abdur Rehman, Zain Ul Abidin, Rizwan Ali Timofte, Alex Costanzino, Matteo Poggi, Samuele Salti, Ste-
Naqvi, Fayaz Ali Dharejo, and Radu Timofte. Illuminating fano Mattoccia, et al. NTIRE 2025 challenge on hr depth
darkness: Enhancing real-world low-light scenes with smart- from images of specular and transparent surfaces. In Pro-
phone images. arXiv preprint arXiv:2503.06898, 2025. 19 ceedings of the IEEE/CVF Conference on Computer Vision
[47] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. Live image qual- and Pattern Recognition (CVPR) Workshops, 2025. 2
ity assessment database release 2. http://live.ece. [59] Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu-
utexas.edu/research/quality/, 2006. 19 nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang.
[48] Lei Sun, Andrea Alfarano, Peiqi Duan, Shaolin Su, Kaiwei Restormer: Efficient transformer for high-resolution image
Wang, Boxin Shi, Radu Timofte, Danda Pani Paudel, Luc restoration. In Proceedings of the IEEE/CVF conference on
Van Gool, et al. NTIRE 2025 challenge on event-based computer vision and pattern recognition, pages 5728–5739,
image deblurring: Methods and results. In Proceedings of 2022. 3, 4, 5, 6, 7, 8, 10, 20
the IEEE/CVF Conference on Computer Vision and Pattern [60] Jiale Zhang, Yulun Zhang, Jinjin Gu, Jiahua Dong,
Recognition (CVPR) Workshops, 2025. 2 Linghe Kong, and Xiaokang Yang. Xformer: Hybrid x-
[49] Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timo- shaped transformer for image denoising. arXiv preprint
fte, Yawei Li, et al. The tenth ntire 2025 image denoising arXiv:2303.06440, 2023. 4, 12, 20
challenge report. In Proceedings of the IEEE/CVF Confer- [61] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and
ence on Computer Vision and Pattern Recognition (CVPR) Lei Zhang. Beyond a gaussian denoiser: Residual learning of
Workshops, 2025. 2 deep cnn for image denoising. IEEE transactions on image
[50] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven processing, 26(7):3142–3155, 2017. 1
ways to improve example-based single image super resolu-
[62] Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yu-
tion. In Proceedings of the IEEE conference on computer
lun Zhang, Hao Tang, Deng-Ping Fan, Radu Timofte, and
vision and pattern recognition, pages 1865–1873, 2016. 8
Luc Van Gool. Practical blind image denoising via swin-
[51] Jiachen Tu, Yaokun Shi, and Fan Lam. Score-based self-
conv-unet and data synthesis. Machine Intelligence Re-
supervised MRI denoising. In The Thirteenth International
search, 20(6):822–836, 2023. 8, 12
Conference on Learning Representations, 2025. 9, 10
[63] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
[52] Stefan Van der Walt, Johannes L Schönberger, Juan Nunez-
Zhong, and Yun Fu. Image super-resolution using very
Iglesias, François Boulogne, Joshua D Warner, Neil Yager,
deep residual channel attention networks. In Proceedings of
Emmanuelle Gouillart, and Tony Yu. scikit-image: image
the European conference on computer vision (ECCV), pages
processing in python. PeerJ, 2:e453, 2014. 11
286–301, 2018. 14
[53] Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou,
Cailian Chen, Zongwei Wu, Radu Timofte, et al. NTIRE
2025 image shadow removal challenge report. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, 2025. 2
[54] Florin-Alexandru Vasluianu, Tim Seizinger, Zhuyun Zhou,
Zongwei Wu, Radu Timofte, et al. NTIRE 2025 ambi-
ent lighting normalization challenge. In Proceedings of