Modumer Modulating Transformer For Image Restoration
Modumer Modulating Transformer For Image Restoration
Abstract—Image restoration aims to recover clean images from networks (CNNs) have demonstrated significant success in
degraded counterparts. While Transformer-based approaches addressing this ill-posed problem by learning direct mappings
have achieved significant advancements in this field, they are lim- between degraded inputs and their corresponding restored out-
ited by high complexity and their inability to capture omni-range
dependencies, hindering their overall performance. In this work, puts [8], [9], [10]. However, the shortcomings of convolutional
we develop Modumer for effective and efficient image restoration operators are obvious. Due to poor receptive field scaling
by revisiting the Transformer block and modulation design, which [11], [12], CNNs cannot capture long-scale dependencies for
processes input through a convolutional block and projection powerful image representations.
layers and fuses features via elementwise multiplication. Specif- Recently, Transformers have significantly advanced the
ically, within each unit of Modumer, we integrate the cascaded
modulation design with the downsampled Transformer block to state-of-the-art performance of low-level tasks [13], [14], [15],
build the attention layers, enabling omni-kernel modulation and [16]. Despite having the great power to capture content-aware
mapping inputs into high-dimensional feature spaces. Moreover, global perceptive fields, the self-attention (SA) layer features
we introduce a bioinspired parameter-sharing mechanism to quadratic complexity to the input, limiting their applications
attention layers, which not only enhances efficiency but also in real-world scenarios. Many attempts have been made to
improves performance. In addition, a dual-domain feed-forward
network (DFFN) strengthens the representational power of the enhance the efficiency of this expensive mechanism. SwinIR
model. Extensive experimental evaluations demonstrate that the [17], Uformer [18], and Stripformer [19] reduce the complex-
proposed Modumer achieves state-of-the-art performance across ity of Transformer models by confining the SA operation to
ten datasets in five single-degradation image restoration tasks, a fixed spatial range. Restormer [14] tactfully switches the
including image motion deblurring, deraining, dehazing, desnow- operation dimension from the spatial domain to channels.
ing, and low-light enhancement. Moreover, the model exhibits
strong generalization capabilities in all-in-one image restoration Afterward, a few works explore adopting both channel SA
tasks. Additionally, it demonstrates competitive performance in and spatial SA in cascading or parallel manners to improve
composite-degradation image restoration. representational ability [12], [20]. Nonetheless, these methods
Index Terms—All-in-one image restoration, composite- impede the inherent potential of SA, originally proposed for
degradation image restoration, dual-domain learning, image superior global feature modeling, leading to a deterioration
restoration, modulation design, parameter sharing, transformer. in restoration performance. Moreover, they mostly operate
at a single scale, limiting their ability to capture multiscale
receptive fields within a single computational unit.
I. I NTRODUCTION
Most recently, the modulation mechanism [21], as illustrated
© 2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Comparison of Transformer block, modulation design, and our block. ⊗ and are matrix and elementwise multiplication, respectively. Compared to
Transformer and modulation blocks, our design performs attention calculation in downsampled spaces and employs cascaded modulation operation to pursue
omni-kernel feature refinement and high-dimensional representation learning. As such, the model achieves a better tradeoff between complexity and accuracy.
(a) Transformer block. (b) Modulation design. (c) Our block.
Fig. 2. Computation comparisons between the proposed model and state-of-the-art algorithms on AGAN-Data [27], HIDE [28], CSD [29], and Haze4k [30]
for deraining, motion deblurring, desnowing, and dehazing, respectively.
features into higher-dimensional feature spaces. Additionally, single-degradation image restoration tasks with lower com-
our CTX is content-aware, which is beneficial for dealing with plexity and fewer parameters (see Fig. 2).
spatially varying degradations. Moreover, we explore a bioin- The main contributions of this study are listed as follows.
spired parameter-sharing mechanism that shares parameters 1) We introduce an attention block that consecutively mod-
across different attention layers, improving both efficiency and ulates SA outputs derived from downsampled features,
performance. enabling efficient omni-kernel modulation and enhanc-
Additionally, to reduce discrepancies between spectra ing high-dimensional representational capacity.
of clean/degraded image pairs, we present a dual-domain 2) We develop a DFFN that achieves spatial-spatial and
feed-forward network (DFFN) to improve dual-domain rep- spectral-spatial interactions.
resentation learning. Specifically, DFFN first utilizes GEGLU 3) We deploy channelwise Transformer blocks at the first
[25] to achieve spatial-domain signal interactions. Subse- scale while using spatialwise blocks at deeper scales
quently, the resulting features pass through the fast Fourier with lower-resolution features, resulting in our effective
transform (FFT) to obtain the spectra, which are then mod- and efficient image restoration network, dubbed Mod-
ulated by the learnable parameters and transformed back to umer.
the spatial domain through the inverse IFFT [26]. Next, the 4) Extensive experimental results demonstrate that
results interact with spatial features under the guidance of Modumer achieves state-of-the-art performance
attention weights. By doing these, our DFFN achieves intra- on single-degradation, all-in-one, and composite-
and interdomain interactions, improving the representational degradation image restoration tasks.
ability.
The unit of our U-shaped Modumer is built upon the
II. R ELATED W ORKS
above modulation-based SA block and DFFN. Unlike other
Transformer-based restoration algorithms that utilize a uni- A. Image Restoration
form block throughout the model, we adopt a channelwise Image restoration aims to reconstruct a sharp image from a
modulation-based SA block at the initial scale to enable more degraded observation [11], [31], [32], [33], [34], [35], [36].
efficient global feature modeling. For lower-resolution features Recently, deep learning methods have remarkably boosted
at deeper scales, we apply spatialwise blocks, effectively the performance of various image restoration tasks by learn-
capturing spatial representations. Based on these designs, ing generalizable features from collected large-scale data.
Modumer achieves state-of-the-art performance on several These methods can be roughly divided into CNN-based and
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 3. Network architecture of our U-shaped Modumer. We employ CMB with shared parameters at the first scale while using SMB at deeper scales which
involve lower-resolution features. This can strike a better balance between the complexity and the representational ability. The DFFN enhances dual-domain
frequency learning via spatial-spatial and spatial-spectral interactions.
Transformer-based categories. CNN-based methods leverage accuracy. Furthermore, mesoscale and local information is
attention mechanisms to attend to informative information leveraged to modulate SA outputs through cascaded modu-
in different dimensions [8], [37], e.g., spatial and channel. lation, enabling omni-kernel refinement and mapping inputs
Furthermore, these methods incorporate advanced techniques into higher-dimensional spaces.
to expand receptive fields and capture multiscale features [38],
[39], [40], [41], [42], [43], [44], [45], including encoder- III. M ETHODOLOGY
decoder architectures, atrous convolution, and multistage
learning strategies. Subsequently, Transformer methods scale In this section, we first introduce the overall architecture
the receptive field to global features via the SA layer [46]. of Modumer. Subsequently, the proposed components are
To enhance its efficiency on low-level vision tasks, a few delineated individually, including two kinds of attention layers
algorithms confine the SA region to fixed windows or strips (CMB and SMB), the parameter-sharing mechanism, and the
[18], [19], which impedes the inherent potential of SA. More- DFFN.
over, they cannot model multiscale features within a single
unit, limiting their capability for removing degradations of A. Overall Pipeline
different sizes. In this article, we apply SA to downsampled Modumer follows the encoder-decoder design (see Fig. 3).
embedding spaces to capture global dependencies and use the We employ a channelwise modulation block (CMB) at the
cascaded modulation operation to complement the missing first scale, as channelwise SA effectively captures long-range
local information. features in an implicit manner. Meanwhile, a spatialwise
modulation block (SMB) is utilized at the two lower-resolution
B. Modulation Design scales to enhance spatial feature representation. As such,
the model strikes a better balance between complexity and
The modulation mechanism [21], [23] considers context
representational capacity.
modeling using a large-kernel convolutional unit and mod-
More specifically, given an image, we use a 3 × 3 convolu-
ulates the projected inputs using elementwise multiplication,
tion to extract the embedding features of size RC×H×W , where
which has exhibited cutting-edge performance in high-level
C denotes the channel count while H × W defines the spatial
vision tasks. FocalNet [24] utilizes a stack of depthwise con-
index. Subsequently, the features are fed into the three-scale
volutional layers to implement hierarchical contextualization
encoder subnetwork to produce the in-depth features. Each
and uses gated aggregation to selectively gather contexts.
scale contains several Transformer blocks, whose calculation
Afterward, EfficientMod [21] adopts a simpler method for
process is formulated as follows:
context modeling using a series of linear projections and
depthwise convolution. MambaOut [47] and Conv2former X0k = CMB/SMB(Xk−1 ) + Xk−1 (1)
[22] use 7 × 7 depthwise convolutions to extract contex- Xk = DFFN(X0k ) + X0k (2)
tual features. Recently, StarNet [48] reveals that the strong
representational capacity of elementwise multiplication arises where Xk−1 and Xk are the output of the last and current
from its implicit mapping to high-dimensional spaces. How- Transformer blocks, respectively. In the encoder stage, the
ever, the receptive fields of the CTX in these methods are resolution of the features is gradually downsampled using
limited. In contrast, our approach incorporates long-range bilinear interpolation while the channel capability is doubled
contextual signals by applying SA to downsampled embedding using a 3 × 3 convolution. Next, the in-depth features pass
spaces, effectively balancing computational complexity and through the symmetric decoder network to generate the clean
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 4. Architectures of channel and spatial modulation blocks (CMB|SMB). (a) CMB. (b) SMB.
features. In this process, the resolution of features is progres- 2) Modulation Design: DCSA encodes downsampled
sively restored to the original size using bilinear interpolation global information while disregarding fine-grained local details
and 3 × 3 convolution. Meanwhile, the skip connection is during the downsampling process. To complement local infor-
adopted to combine the encoder and decoder features via con- mation, we first filter the initially generated V tensor using a
catenation. The features produced by the three-level decoder 3 × 3 depthwise convolution, which is expressed as follows:
are then processed by a refinement stage consisting of r
Transformer blocks. Finally, a 3 × 3 convolution is applied X̂ M3×3 = Sigmoid(Dw3×3 (V)) V (5)
to generate the residual image, which is added to the original
where Dw3×3 is a depthwise convolution of kernel size 3 × 3.
input image to obtain the final model output. Next, we present
Next, we modulate the output of DCSA with the locally
the internal components of the Transformer block.
filtered result via elementwise multiplication. This approach
enables the model to capture both downsampled global and
B. Channelwise Modulation Block
fine-grained local dependencies while mapping inputs into
The architectural details of CMB are illustrated in high-dimensional spaces, thereby enhancing its representa-
Fig. 4(a). CMB contains a downsampled channelwise SA layer tional capacity. To simplify the analyses, we assume a scenario
for global information modeling, along with two depthwise with a single-pixel input x ∈ Rd×1 and a single-element output
convolutional branches that modulate the SA output. These x̂ ∈ R1×1 , where d is the channel count. We define w1 ,
branches enhance local and mesoscale receptive fields while w2 ∈ R1×d as convolution parameters. The modulation process
mapping features into higher-dimensional spaces. The calcu- [48], which involves a single convolution in each branch, can
lation process of CMB can be formally expressed as follows: be expressed as follows:
X̂CMB = W2 X̂ M7×7 W1 (X̂ M3×3 DCSA(XCMB ))
(3) d
! 0 d 1
X X j
where X̂CMB and XCMB denote the output and input of CMB, w>1 x w>2 x = wi1 xi @ w2 x j A (6)
respectively. DCSA is a downsampled channelwise SA layer. i=1 j=1
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
D ETAILS OF O UR M ODEL V ERSIONS . T HE N UMBER OF H EADS AT
THE T HREE S CALES I S S ET TO [1, 2, 4]. FLOP S A RE M EASURED
ON 3×256 × 256 PATCHES
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
DATASET S UMMARY FOR F IVE S INGLE -D EGRADATION I MAGE R ESTORATION TASKS
Fig. 7. Deblurred results on GoPro [42]. Compared to other algorithms, the proposed method restores more details and clearer structures from the input.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
Fig. 10. Visualization of the training process on the Snow100K [9] dataset.
I MAGE M OTION D EBLURRING R ESULTS . O UR M ODEL I S T RAINED O NLY
ON THE G O P RO [42] DATASET AND D IRECTLY A PPLIED TO THE G O P RO
[42] AND HIDE [28] DATASETS TABLE VIII
I MAGE D ESNOWING C OMPARISONS ON T HREE W IDELY U SED DATASETS
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX TABLE XI
N UMERICAL C OMPARISONS ON THE LOL-V2-S YN DATASET [53] A BLATION S TUDIES OF THE D EPLOYMENT S TRATEGY
FOR L OW-L IGHT I MAGE E NHANCEMENT FOR D IFFERENT K INDS OF ATTENTION
TABLE X
A BLATION S TUDIES FOR E ACH C OMPONENT. FLOP S AND M EMORY
F OOTPRINT A RE M EASURED ON A 3×256 × 256 PATCH S IZE U SING
TABLE XII
ptflops AND torch.cuda. max memory allocated(), R ESPECTIVELY
M ORE A BLATION S TUDIES FOR DFFN
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 13. Object detection results on the degraded and our restored images. The examples are obtained from the CSD [29], GoPro [42], and LOL-V2-syn [53]
datasets for desnowing, deblurring, and low-light enhancement, respectively.
TABLE XIV
DATASETS U SED FOR THE A LL - IN -O NE S ETTING . M OTION D EBLURRING AND L OW-L IGHT E NHANCEMENT A RE O NLY
U SED FOR THE F IVE -TASK S ETTING
TABLE XV
Q UANTITATIVE C OMPARISONS ON T HREE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 15. Visual comparisons on the Rain100 [85] dataset under the all-in-one setting. The image produced by our model is closer to the reference image.
TABLE XVI
N UMERICAL C OMPARISONS ON F IVE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING : D EHAZING (SOTS [89]), D ERAINING
(R AIN 100L [85]), D ENOISING (BSD68 [91]), D EBLURRING (G O P RO [42]), AND L OW-L IGHT I MAGE E NHANCEMENT (LOL-V1 [90])
TABLE XVII
Q UANTITATIVE E VALUATION U NDER THE A LL - IN -O NE S ETTING W ITH L EARNING -BASED M ETRICS
D. Discussion
1) Evaluation Datasets: In this study, we conduct exper-
Fig. 16. Results of applying models trained in the three-task setting to images
from the UAVDT [103] dataset. iments transitioning from single-degradation to composite-
degradation scenarios for image restoration, aligning with
the evolving trends in the field. However, given the diverse
C. Composite-Degradation Image Restoration real-world conditions encountered by users, it is impractical
to encompass all possible scenarios within a single study.
We conduct experiments on the CDD-11 [99] dataset Consequently, several valuable datasets remain available for
for composite-degradation image restoration. This dataset further exploration and evaluation.
comprises a total of 11 degradation categories, created by For instance, MC-Blur [104] includes four types of
combining four types: low light, haze, rain, and snow. The blur—uniform blur, motion blur caused by averaging
training configuration follows that of single-degradation image continuous frames, heavy defocus blur, and real-world
restoration. Table XVIII shows that our model achieves the blur—providing a comprehensive benchmark for multi-
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 17. Visual comparisons on the CDD-11 [99] dataset under the haze + snow scenario.
TABLE XVIII
Q UANTITATIVE C OMPARISONS ON THE CDD-11 [99] FOR C OMPOSITE -D EGRADATION I MAGE R ESTORATION . T HE S CORES A RE R EPORTED IN THE F ORM
OF PSNR ( D B, ↑) AND SSIM (↑). O UR M ODEL O UTPERFORMS THE P REVIOUS L EADING A LGORITHM [99] W HILE U SING F EWER PARAMETERS
V. C ONCLUSION
cause image deblurring frameworks [105]. Similarly, the
SnowKITTI2012 and SnowCityScapes datasets introduced in This study presents an effective and efficient Transformer
[106] feature three levels of snow degradation in street envi- model for image restoration, termed Modumer. The model
ronments, facilitating the development of algorithms aimed at incorporates different downsampled SA layers with cascaded
enhancing autonomous driving in adverse weather conditions. modulation designs, which can model omni-receptive field fea-
Additionally, JRSRD [107] synthesizes rain streaks and rain- tures, keep a better balance between complexity and accuracy,
drops simultaneously, offering a more realistic representation and map features into high-dimensional spaces. Moreover,
of rainy conditions compared to datasets that include only a we investigate a bioinspired parameter-sharing mechanism in
single type of degradation. attention layers, improving efficiency and performance. In
Furthermore, our model can be extended to stereo image addition, we introduce a DFFN to facilitate intra- and inter-
restoration tasks, such as evaluating its performance on domain interactions. Comprehensive experiments across three
deraining using the Stereo RainKITTI2012 dataset [108]. categories of image restoration tasks validate the effectiveness
These additional datasets present promising avenues for future of our model.
research, enabling a more comprehensive assessment of image
restoration techniques across varied conditions. R EFERENCES
2) Network Architecture: We employ a unified network [1] Y. Quan, P. Lin, Y. Xu, Y. Nan, and H. Ji, “Nonblind image deblurring
architecture across various image restoration tasks, including via deep learning in complex field,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 33, no. 10, pp. 5387–5400, Oct. 2021.
single-degradation, all-in-one, and composite-degradation sce- [2] Y. Cui, Y. Tao, W. Ren, and A. Knoll, “Dual-domain attention for
narios. In all-in-one settings, a common approach involves image deblurring,” in Proc. AAAI Conf. Artif. Intell., vol. 37, Jun. 2023,
first extracting task-aware information, which then guides the pp. 479–487.
[3] Y. Zheng, X. Yu, M. Liu, and S. Zhang, “Single-image deraining
restoration process [93], [94]. Despite not incorporating such via recurrent residual multiscale networks,” IEEE Trans. Neural Netw.
explicit task-specific components, our model achieves state-of- Learn. Syst., vol. 33, no. 3, pp. 1310–1323, Mar. 2022.
the-art performance on two all-in-one tasks. This success can [4] Y. Cui and A. Knoll, “Exploring the potential of channel interactions
for image restoration,” Knowl.-Based Syst., vol. 282, Dec. 2023, Art.
be attributed to two key factors: 1) the modulation operation, no. 111156.
which functions as a gated mechanism, dynamically attending [5] Y. Zhou, Z. Chen, P. Li, H. Song, C. L. P. Chen, and B. Sheng, “FSAD-
to informative signals for different tasks; and 2) the synergy Net: Feedback spatial attention dehazing network,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 34, no. 10, pp. 7719–7733, Oct. 2023.
between the modulation operation and dual-domain interac- [6] K. Jiang et al., “Multi-scale hybrid fusion network for single image
tions in DFFN, which enhances the model’s representational deraining,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 7,
capacity without significantly increasing complexity. A similar pp. 3594–3608, Jul. 2023.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[7] Y. Cui, Y. Tao, L. Jing, and A. Knoll, “Strip attention for image [32] L. Ruan, B. Chen, J. Li, and M. Lam, “Learning to deblur using light
restoration,” in Proc. Int. Joint Conf. Artif. Intell., Aug. 2023, field generated and real defocus images,” in Proc. IEEE/CVF Conf.
pp. 645–653. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16283–16292.
[8] Q. Xu, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion [33] M. Liu, Y. Cui, W. Ren, J. Zhou, and A. C. Knoll, “LIEDNet: A
attention network for single image dehazing,” in Proc. AAAI Conf. lightweight network for low-light enhancement and deblurring,” IEEE
Artif. Intell., vol. 34, Apr. 2020, pp. 11908–11915. Trans. Circuits Syst. Video Technol., early access, Feb. 13, 2025, doi:
[9] Y.-F. Liu, D.-W. Jaw, S.-C. Huang, and J.-N. Hwang, “DesnowNet: 10.1109/TCSVT.2025.3541429.
Context-aware deep network for snow removal,” IEEE Trans. Image [34] X. Su et al., “Prior-guided hierarchical harmonization network for
Process., vol. 27, no. 6, pp. 3064–3073, Jun. 2018. efficient image dehazing,” 2025, arXiv:2503.01136.
[10] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel modulation for universal [35] Y. Cui and A. Knoll, “Enhancing local–global representation learning
image restoration,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, for image restoration,” IEEE Trans. Ind. Informat., vol. 20, no. 4,
no. 12, pp. 12496–12509, Dec. 2024. pp. 6522–6530, Apr. 2024.
[11] S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking [36] Y. Cui, Q. Wang, C. Li, W. Ren, and A. Knoll, “EENet: An effective
coarse-to-fine approach in single image deblurring,” in Proc. IEEE/CVF and efficient network for single image dehazing,” Pattern Recognit.,
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4621–4630. vol. 158, Feb. 2025, Art. no. 111074.
[12] X. Chen et al., “A comparative study of image restoration networks for [37] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Focal network for image
general backbone network design,” in Proc. Eur. Conf. Comput. Vis., restoration,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct.
Oct. 2024, pp. 74–91. 2023, pp. 12955–12965.
[13] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single [38] H. Son, J. Lee, S. Cho, and S. Lee, “Single image defocus deblurring
image dehazing,” IEEE Trans. Image Process., vol. 32, pp. 1927–1941, using kernel-sharing parallel atrous convolutions,” in Proc. IEEE/CVF
2023. Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2622–2630.
[14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and [39] Y. Cui, M. Liu, W. Ren, and A. Knoll, “Hybrid frequency modulation
M. Yang, “Restormer: Efficient transformer for high-resolution image network for image restoration,” in Proc. 33rd Int. Joint Conf. Artif.
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Intell., Aug. 2024, pp. 722–730.
(CVPR), Jun. 2022, pp. 5718–5729. [40] K.-H. Liu, C.-H. Yeh, J.-W. Chung, and C.-Y. Chang, “A motion deblur
[15] Y. Cui and A. Knoll, “PSNet: Towards efficient image restoration with method based on multi-scale high frequency residual image learning,”
self-attention,” IEEE Robot. Autom. Lett., vol. 8, no. 9, pp. 5735–5742, IEEE Access, vol. 8, pp. 66025–66036, 2020.
Sep. 2023. [41] Y. Cui, J. Zhu, and A. Knoll, “Enhancing perception for autonomous
[16] J.-G. Wang, Y. Cui, Y. Li, W. Ren, and X. Cao, “Omnidirectional image vehicles: A multi-scale feature modulation network for image
super-resolution via bi-projection fusion,” in Proc. AAAI Conf. Artif. restoration,” IEEE Trans. Intell. Transp. Syst., vol. 26, no. 4,
Intell., vol. 38, Mar. 2024, pp. 5454–5462. pp. 4621–4632, Apr. 2025.
[17] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timo- [42] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convo-
fte, “SwinIR: Image restoration using Swin transformer,” in Proc. lutional neural network for dynamic scene deblurring,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 1833–1844. pp. 257–265.
[18] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: [43] Q. Wang, Y. Cui, Y. Li, Y. Ruan, B. Zhu, and W. Ren,
A general U-shaped transformer for image restoration,” in Proc. “RFFNet: Towards robust and flexible fusion for low-light image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, denoising,” in Proc. 32nd ACM Int. Conf. Multimedia, Oct. 2024,
pp. 17662–17672. pp. 836–845.
[44] K. Jiang et al., “Multi-scale progressive fusion network for single image
[19] F.-J. Tsai, Y.-T. Peng, Y. Lin, C.-C. Tsai, and C. Lin, “Stripformer: Strip
deraining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
transformer for fast image deblurring,” in Proc. Eur. Conf. Comput.
(CVPR), Jun. 2020, pp. 8343–8352.
Vis., Jan. 2022, pp. 146–162.
[45] Y. Cui, W. Ren, S. Yang, X. Cao, and A. Knoll, “IRNeXt: Rethinking
[20] J. Zhang, Y. Zhang, J. Gu, J. Dong, L. Kong, and X. Yang, “Xformer:
convolutional network design for image restoration,” in Proc. Int. Conf.
Hybrid X-shaped transformer for image denoising,” in Proc. 12th Int.
Mach. Learn., Jul. 2023, pp. 6545–6564.
Conf. Learn. Represent., Jan. 2023.
[46] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehaz-
[21] X. Ma et al., “Efficient modulation for vision networks,” in Proc. 12th ing transformer with transmission-aware 3D position embedding,” in
Int. Conf. Learn. Represent., Mar. 2024. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
[22] Q. Hou, C.-Z. Lu, M.-M. Cheng, and J. Feng, “Conv2Former: A simple 2022, pp. 5802–5810.
transformer-style ConvNet for visual recognition,” IEEE Trans. Pattern [47] W. Yu and X. Wang, “MambaOut: Do we really need mamba for
Anal. Mach. Intell., vol. 46, no. 12, pp. 8274–8283, Dec. 2024. vision?,” 2024, arXiv:2405.07992.
[23] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual [48] X. Ma, X. Dai, Y. Bai, Y. Wang, and Y. Fu, “Rewrite the stars,” in
attention network,” Comput. Vis. Media, vol. 9, no. 4, pp. 733–752, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
Dec. 2023. 2024, pp. 5694–5703.
[24] J. Yang, C. Li, and J. Gao, “Focal modulation networks,” in Proc. Adv. [49] M. P. Witter, T. P. Doan, B. Jacobsen, E. S. Nilssen, and S. Ohara,
Neural Inf. Process. Syst., Jan. 2022, pp. 4203–4217. “Architecture of the entorhinal cortex a review of entorhinal anatomy
[25] N. Shazeer, “GLU variants improve transformer,” 2020, in rodents with some comparative notes,” Frontiers Syst. Neurosci.,
arXiv:2002.05202. vol. 11, p. 46, Jun. 2017.
[26] L. Kong, J. Dong, J. Ge, M. Li, and J. Pan, “Efficient frequency [50] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. H. Lau,
domain-based transformers for high-quality image deblurring,” in Proc. “Spatial attentive single-image deraining with a high quality real rain
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, dataset,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 5886–5895. (CVPR), Jun. 2019, pp. 12262–12271.
[27] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative [51] W. Yan, R. T. Tan, and D. Dai, “Nighttime defogging using high-low
adversarial network for raindrop removal from a single image,” in frequency decomposition and grayscale-color networks,” in Proc. Eur.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Conf. Comput. Vis., Jan. 2020, pp. 473–488.
pp. 2482–2491. [52] W. Chen, H. Fang, J. Ding, C.-C. Tsai, and S. Kuo, “JSTASR: Joint
[28] Z. Shen et al., “Human-aware motion deblurring,” in Proc. IEEE/CVF size and transparency-aware snow removal algorithm based on modified
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5571–5580. partial convolution and veiling effect removal,” in Proc. Eur. Conf.
[29] W.-T. Chen et al., “ALL snow removed: Single image desnowing Comput. Vis., Jan. 2020, pp. 754–770.
algorithm using hierarchical dual-tree complex wavelet representation [53] W. Yang, W. Wang, H. Huang, S. Wang, and J. Liu, “Sparse gra-
and contradict channel loss,” in Proc. IEEE/CVF Int. Conf. Comput. dient regularized deep retinex network for robust low-light image
Vis. (ICCV), Oct. 2021, pp. 4176–4185. enhancement,” IEEE Trans. Image Process., vol. 30, pp. 2072–2086,
[30] Y. Liu et al., “From synthetic to real: Image dehazing collaborating 2021.
with unlabeled real data,” in Proc. 29th ACM Int. Conf. Multimedia, [54] X. Liu, M. Suganuma, Z. Sun, and T. Okatani, “Dual residual networks
Oct. 2021, pp. 50–58. leveraging the potential of paired operations for image restoration,” in
[31] Y. Cui and A. Knoll, “Dual-domain strip attention for image Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
restoration,” Neural Netw., vol. 171, pp. 429–439, Mar. 2024. 2019, pp. 7000–7009.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[55] R. Quan, X. Yu, Y. Liang, and Y. Yang, “Removing raindrops and [79] J. M. Jose Valanarasu, R. Yasarla, and V. M. Patel, “TransWeather:
rain streaks in one go,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Transformer-based restoration of images degraded by adverse weather
Recognit. (CVPR), Jun. 2021, pp. 9143–9152. conditions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[56] Y. Quan, S. Deng, Y. Chen, and H. Ji, “Deep learning for seeing through (CVPR), Jun. 2022, pp. 2343–2353.
window with raindrops,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [80] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel network for image
(ICCV), Oct. 2019, pp. 2463–2471. restoration,” in Proc. AAAI Conf. Artif. Intell., vol. 38, Mar. 2024,
[57] J. Xiao, X. Fu, A. Liu, F. Wu, and Z.-J. Zha, “Image de-raining pp. 1426–1434.
transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, [81] S. W. Zamir et al., “Learning enriched features for fast image restora-
pp. 12978–12995, Nov. 2023. tion and enhancement,” IEEE Trans. Pattern Anal. Mach. Intell.,
[58] Z. Tu et al., “MAXIM: Multi-axis MLP for image processing,” in vol. 45, no. 2, pp. 1934–1948, Feb. 2023.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. [82] X. Xu, R. Wang, C.-W. Fu, and J. Jia, “SNR-aware low-light image
2022, pp. 5759–5770. enhancement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[59] T. Ye et al., “Adverse weather removal with codebook priors,” nit. (CVPR), Jun. 2022, pp. 17693–17703.
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, [83] Y. Cai, H. Bian, J. Lin, H. Wang, R. Timofte, and Y. Zhang,
pp. 12619–12630. “Retinexformer: One-stage retinex-based transformer for low-light
[60] S. Zhou, J. Pan, J. Shi, D. Chen, L. Qu, and J. Yang, “Seeing the image enhancement,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
unseen: A frequency prompt guided transformer for image restoration,” (ICCV), Oct. 2023, pp. 12504–12513.
in Proc. Eur. Conf. Comput. Vis., Oct. 2024, pp. 246–264. [84] J. Weng, Z. Yan, Y. Tai, J. Qian, J. Yang, and J. Li, “MambaLLIE:
[61] S. Zhou, D. Chen, J. Pan, J. Shi, and J. Yang, “Adapt or perish: Implicit retinex-aware low light enhancement with global-then-local
Adaptive sparse transformer with attentive feature refinement for image state space,” in Proc. 38th Annu. Conf. Neural Inf. Process. Syst.,
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Sep. 2024, pp. 27440–27462.
(CVPR), Jun. 2024, pp. 2952–2963. [85] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint
[62] S. W. Zamir et al., “Multi-stage progressive image restoration,” in Proc. rain detection and removal from a single image,” in Proc. IEEE Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1685–1694.
pp. 14816–14826. [86] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, “YOLOv7: Trainable
[63] Y. Guo, X. Xiao, Y. Chang, S. Deng, and L. Yan, “From sky to the bag-of-freebies sets new state-of-the-art for real-time object detectors,”
ground: A large-scale benchmark and simple baseline towards real rain in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
removal,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 7464–7475.
2023, pp. 12063–12073. [87] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection
[64] X. Chen, H. Li, M. Li, and J. Pan, “Learning a sparse transformer and hierarchical image segmentation,” IEEE Trans. Pattern Anal.
network for effective image deraining,” in Proc. IEEE/CVF Conf. Mach. Intell., vol. 33, no. 5, pp. 898–916, May 2011.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 5896–5905. [88] K. Ma et al., “Waterloo exploration database: New challenges for image
[65] H. Zhang, Y. Dai, H. Li, and P. Koniusz, “Deep stacked hierarchical quality assessment models,” IEEE Trans. Image Process., vol. 26,
multi-patch network for image deblurring,” in Proc. IEEE/CVF Conf. no. 2, pp. 1004–1016, Feb. 2017.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5971–5979.
[89] B. Li et al., “Benchmarking single-image dehazing and beyond,” IEEE
[66] K. Zhang et al., “Deblurring by realistic blurring,” in Proc. Trans. Image Process., vol. 28, no. 1, pp. 492–505, Jan. 2019.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[90] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition
pp. 2734–2743.
for low-light enhancement,” 2018, arXiv:1808.04560.
[67] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image
[91] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
restoration,” in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 17–33.
segmented natural images and its application to evaluating segmenta-
[68] Y. Li et al., “Efficient and explicit modelling of image hierarchies
tion algorithms and measuring ecological statistics,” in Proc. 8th IEEE
for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Int. Conf. Comput. Vis. (ICCV), vol. 2, Oct. 2001, pp. 416–423.
Recognit. (CVPR), Jun. 2023, pp. 18278–18289.
[92] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “A general
[69] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Image restoration via frequency
decoupled learning framework for parameterized image operators,”
selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 2,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 33–47, Jan.
pp. 1093–1108, Feb. 2024.
2021.
[70] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Revitalizing convolutional
network for image restoration,” IEEE Trans. Pattern Anal. Mach. [93] B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng, “All-in-one image
Intell., vol. 46, no. 12, pp. 9423–9438, Dec. 2024. restoration for unknown corruption,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17431–17441.
[71] X. Gao et al., “Efficient multi-scale network with learnable dis-
crete wavelet transform for blind motion deblurring,” in Proc. [94] V. Potlapalli, S. W. Zamir, S. H. Khan, and F. S. Khan, “PromptIR:
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, Prompting for all-in-one image restoration,” in Proc. Adv. Neural Inf.
pp. 2733–2742. Process. Syst., Sep. 2023, pp. 71275–71293.
[72] C. X. Liu et al., “Motion-adaptive separable collaborative filters for [95] G. Wu, J. Jiang, K. Jiang, and X. Liu, “Harmony in diversity: Improving
blind motion deblurring,” in Proc. IEEE/CVF Conf. Comput. Vis. all-in-one image restoration via multi-task collaboration,” in Proc. 32nd
Pattern Recognit. (CVPR), Jun. 2024, pp. 25595–25605. ACM Int. Conf. Multimedia, Oct. 2024, pp. 6015–6023.
[73] X. Mao, J. Wang, X. Xie, Q. Li, and Y. Wang, “LoFormer: Local [96] M. V. Conde, G. Geigle, and R. Timofte, “InstructIR: High-quality
frequency transformer for image deblurring,” in Proc. 32nd ACM Int. image restoration following human instructions,” in Proc. Eur. Conf.
Conf. Multimedia, Oct. 2024, pp. 10382–10391. Comput. Vis., Oct. 2024, pp. 1–21.
[74] X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: [97] J. Zhang et al., “Ingredient-oriented multi-degradation learning for
Attention-based multi-scale network for image dehazing,” in image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, Recognit. (CVPR), Jun. 2023, pp. 5825–5835.
pp. 7313–7322. [98] H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, and S. T. Xia, “MambaIR:
[75] H. Dong et al., “Multi-scale boosted dehazing network with dense fea- A simple baseline for image restoration with state-space model,” in
ture fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Proc. Eur. Conf. Comput. Vis., 2024, pp. 222–241.
(CVPR), Jun. 2020, pp. 2154–2164. [99] Y. Guo, Y. Gao, Y. Lu, H. Zhu, R. Liu, and S. He, “OneRestore: A
[76] Y. Tian et al., “Perceiving and modeling density for image dehazing,” universal restoration framework for composite degradation,” in Proc.
in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 130–145. Eur. Conf. Comput. Vis., Jul. 2024, pp. 255–272.
[77] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-dehaze: Enhanced [100] Y. Zhu et al., “Learning weather-general and weather-specific features
CycleGAN for single image dehazing,” in Proc. IEEE/CVF Conf. for image restoration under multiple adverse weather conditions,” in
Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
pp. 938–9388. 2023, pp. 21747–21758.
[78] Y. Jin, B. Lin, W. Yan, Y. Yuan, W. Ye, and R. T. Tan, “Enhancing [101] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
visibility in nighttime haze images using guided APSF and gradient unreasonable effectiveness of deep features as a perceptual metric,”
adaptive convolution,” in Proc. 31st ACM Int. Conf. Multimedia, Oct. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
2023, pp. 2446–2457. pp. 586–595.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[102] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality Mingyu Liu (Graduate Student Member, IEEE)
assessment: Unifying structure and texture similarity,” IEEE Trans. received the dual master’s degree in electrical and
Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2567–2581, May 2020. computer engineering from the Department of Elec-
[103] D. Du et al., “The unmanned aerial vehicle benchmark: Object detec- tronics and Communication Engineering, Technical
tion and tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, University of Munich (TUM), Munich, Germany,
pp. 370–386. and Tongji University, Shanghai, China, in 2022.
[104] K. Zhang et al., “MC-blur: A comprehensive benchmark for image He is currently pursuing the Ph.D. degree with the
deblurring,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, Chair of Robotics, Artificial Intelligence and Real-
pp. 3755–3767, May 2024. time Systems, TUM.
[105] K. Zhang et al., “Deep image deblurring: A survey,” Int. J. Comput. His research interests include computer vision
Vis., vol. 130, no. 9, pp. 2103–2130, Sep. 2022. in autonomous driving, deep learning, and artificial
[106] K. Zhang, R. Li, Y. Yu, W. Luo, and C. Li, “Deep dense multi-scale intelligence.
network for snow removal using semantic and depth priors,” IEEE
Trans. Image Process., vol. 30, pp. 7419–7431, 2021.
[107] K. Zhang, D. Li, W. Luo, and W. Ren, “Dual attention-in-attention
model for joint rain streak and raindrop removal,” IEEE Trans. Image
Process., vol. 30, pp. 7608–7619, 2021.
[108] K. Zhang et al., “Beyond monocular deraining: Stereo image deraining Wenqi Ren (Member, IEEE) received the Ph.D.
via semantic understanding,” in Proc. 16th Eur. Conf. Comp. Vis. degree from Tianjin University, Tianjin, China, in
(ECCV), Aug. 2020, pp. 71–89. 2017.
From 2015 to 2016, he was supported by China
Scholarship Council and worked with Prof. Ming-
Husan Yang as a Joint-Training Ph.D. Student with
the Electrical Engineering and Computer Science
Department, University of California at Merced,
Merced, CA, USA. He is currently a Professor with
the School of Cyber Science and Technology, Shen-
zhen Campus, Sun Yat-sen University, Shenzhen,
China. His research interests include image processing and related high-level
vision problems.
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.