0% found this document useful (0 votes)
8 views15 pages

Modumer Modulating Transformer For Image Restoration

The article introduces Modumer, a novel Transformer-based model designed for efficient image restoration, addressing limitations of existing methods by integrating a modulation mechanism and dual-domain feed-forward network. Modumer achieves state-of-the-art performance across various image restoration tasks, including single and composite degradations, while maintaining lower complexity and fewer parameters. The model employs a unique architecture that balances global feature modeling and local information capture, enhancing its representational capacity and efficiency.

Uploaded by

lewisliang321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views15 pages

Modumer Modulating Transformer For Image Restoration

The article introduces Modumer, a novel Transformer-based model designed for efficient image restoration, addressing limitations of existing methods by integrating a modulation mechanism and dual-domain feed-forward network. Modumer achieves state-of-the-art performance across various image restoration tasks, including single and composite degradations, while maintaining lower complexity and fewer parameters. The model employs a unique architecture that balances global feature modeling and local information capture, enhancing its representational capacity and efficiency.

Uploaded by

lewisliang321
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Modumer: Modulating Transformer for


Image Restoration
Yuning Cui , Graduate Student Member, IEEE, Mingyu Liu , Graduate Student Member, IEEE,
Wenqi Ren , Member, IEEE, and Alois Knoll , Fellow, IEEE

Abstract—Image restoration aims to recover clean images from networks (CNNs) have demonstrated significant success in
degraded counterparts. While Transformer-based approaches addressing this ill-posed problem by learning direct mappings
have achieved significant advancements in this field, they are lim- between degraded inputs and their corresponding restored out-
ited by high complexity and their inability to capture omni-range
dependencies, hindering their overall performance. In this work, puts [8], [9], [10]. However, the shortcomings of convolutional
we develop Modumer for effective and efficient image restoration operators are obvious. Due to poor receptive field scaling
by revisiting the Transformer block and modulation design, which [11], [12], CNNs cannot capture long-scale dependencies for
processes input through a convolutional block and projection powerful image representations.
layers and fuses features via elementwise multiplication. Specif- Recently, Transformers have significantly advanced the
ically, within each unit of Modumer, we integrate the cascaded
modulation design with the downsampled Transformer block to state-of-the-art performance of low-level tasks [13], [14], [15],
build the attention layers, enabling omni-kernel modulation and [16]. Despite having the great power to capture content-aware
mapping inputs into high-dimensional feature spaces. Moreover, global perceptive fields, the self-attention (SA) layer features
we introduce a bioinspired parameter-sharing mechanism to quadratic complexity to the input, limiting their applications
attention layers, which not only enhances efficiency but also in real-world scenarios. Many attempts have been made to
improves performance. In addition, a dual-domain feed-forward
network (DFFN) strengthens the representational power of the enhance the efficiency of this expensive mechanism. SwinIR
model. Extensive experimental evaluations demonstrate that the [17], Uformer [18], and Stripformer [19] reduce the complex-
proposed Modumer achieves state-of-the-art performance across ity of Transformer models by confining the SA operation to
ten datasets in five single-degradation image restoration tasks, a fixed spatial range. Restormer [14] tactfully switches the
including image motion deblurring, deraining, dehazing, desnow- operation dimension from the spatial domain to channels.
ing, and low-light enhancement. Moreover, the model exhibits
strong generalization capabilities in all-in-one image restoration Afterward, a few works explore adopting both channel SA
tasks. Additionally, it demonstrates competitive performance in and spatial SA in cascading or parallel manners to improve
composite-degradation image restoration. representational ability [12], [20]. Nonetheless, these methods
Index Terms—All-in-one image restoration, composite- impede the inherent potential of SA, originally proposed for
degradation image restoration, dual-domain learning, image superior global feature modeling, leading to a deterioration
restoration, modulation design, parameter sharing, transformer. in restoration performance. Moreover, they mostly operate
at a single scale, limiting their ability to capture multiscale
receptive fields within a single computational unit.
I. I NTRODUCTION
Most recently, the modulation mechanism [21], as illustrated

A S A longstanding task, image restoration aims to recover


a high-quality image from its degraded counterpart [1],
[2], [3], [4], [5], [6], [7]. In recent years, convolutional neural
in Fig. 1(b), considering context modeling using a large-kernel
convolutional block and modulating the projected input via
elementwise multiplication, has become popular in high-level
vision tasks [22], [23], [24]. These approaches are computa-
Received 27 November 2024; revised 19 March 2025; accepted 13 April tionally efficient and implement-friendly, showing competitive
2025. This work was supported in part by the National Natural Sci-
ence Foundation of China under Grant 62322216, Grant 62172409, and performance on par with Transformer counterparts. Inspired by
Grant 62311530686; in part by Shenzhen Science and Technology Program this modulation technique, we acquire the approximate omni-
under Grant RCYX20221008092849068, Grant JCYJ20220530145209022, kernel feature modeling ability by integrating the Transformer
and Grant KQTD20221101093559018; in part by the Project “VIDETEC-
2” under Grant 19F2232E; and in part by the Federal Ministry for Digital layer [Fig. 1(a)] and modulation design [Fig. 1(b)] within a
and Transport of Germany (BMDV). (Corresponding author: Wenqi Ren.) block. As illustrated in Fig. 1(c), the context branch (CTX) is
Yuning Cui is with the School of Cyber Science and Technology, Shenzhen
Campus of Sun Yat-sen University, Shenzhen 518107, China, and also with
implemented through a Transformer block at a downsampled
the School of Computation, Information and Technology, Technical University scale, which retains the ability of SA to model global features
of Munich, 85748 Munich, Germany (e-mail: [email protected]). while striking a trade-off between complexity and accuracy.
Mingyu Liu and Alois Knoll are with the School of Computation, Infor- The local and mesoscale receptive fields are complemented
mation and Technology, Technical University of Munich, 85748 Munich,
Germany (e-mail: [email protected]; [email protected]). by modulating the result of SA in series using depthwise con-
Wenqi Ren is with the School of Cyber Science and Technology, Shen- volutions of different kernel sizes. Compared to the canonical
zhen Campus of Sun Yat-sen University, Shenzhen 518107, China (e-mail:
[email protected]).
modulation design, our block provides real context modeling
Digital Object Identifier 10.1109/TNNLS.2025.3561924 and performs cascaded modulation processes, mapping input

© 2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Comparison of Transformer block, modulation design, and our block. ⊗ and are matrix and elementwise multiplication, respectively. Compared to
Transformer and modulation blocks, our design performs attention calculation in downsampled spaces and employs cascaded modulation operation to pursue
omni-kernel feature refinement and high-dimensional representation learning. As such, the model achieves a better tradeoff between complexity and accuracy.
(a) Transformer block. (b) Modulation design. (c) Our block.

Fig. 2. Computation comparisons between the proposed model and state-of-the-art algorithms on AGAN-Data [27], HIDE [28], CSD [29], and Haze4k [30]
for deraining, motion deblurring, desnowing, and dehazing, respectively.

features into higher-dimensional feature spaces. Additionally, single-degradation image restoration tasks with lower com-
our CTX is content-aware, which is beneficial for dealing with plexity and fewer parameters (see Fig. 2).
spatially varying degradations. Moreover, we explore a bioin- The main contributions of this study are listed as follows.
spired parameter-sharing mechanism that shares parameters 1) We introduce an attention block that consecutively mod-
across different attention layers, improving both efficiency and ulates SA outputs derived from downsampled features,
performance. enabling efficient omni-kernel modulation and enhanc-
Additionally, to reduce discrepancies between spectra ing high-dimensional representational capacity.
of clean/degraded image pairs, we present a dual-domain 2) We develop a DFFN that achieves spatial-spatial and
feed-forward network (DFFN) to improve dual-domain rep- spectral-spatial interactions.
resentation learning. Specifically, DFFN first utilizes GEGLU 3) We deploy channelwise Transformer blocks at the first
[25] to achieve spatial-domain signal interactions. Subse- scale while using spatialwise blocks at deeper scales
quently, the resulting features pass through the fast Fourier with lower-resolution features, resulting in our effective
transform (FFT) to obtain the spectra, which are then mod- and efficient image restoration network, dubbed Mod-
ulated by the learnable parameters and transformed back to umer.
the spatial domain through the inverse IFFT [26]. Next, the 4) Extensive experimental results demonstrate that
results interact with spatial features under the guidance of Modumer achieves state-of-the-art performance
attention weights. By doing these, our DFFN achieves intra- on single-degradation, all-in-one, and composite-
and interdomain interactions, improving the representational degradation image restoration tasks.
ability.
The unit of our U-shaped Modumer is built upon the
II. R ELATED W ORKS
above modulation-based SA block and DFFN. Unlike other
Transformer-based restoration algorithms that utilize a uni- A. Image Restoration
form block throughout the model, we adopt a channelwise Image restoration aims to reconstruct a sharp image from a
modulation-based SA block at the initial scale to enable more degraded observation [11], [31], [32], [33], [34], [35], [36].
efficient global feature modeling. For lower-resolution features Recently, deep learning methods have remarkably boosted
at deeper scales, we apply spatialwise blocks, effectively the performance of various image restoration tasks by learn-
capturing spatial representations. Based on these designs, ing generalizable features from collected large-scale data.
Modumer achieves state-of-the-art performance on several These methods can be roughly divided into CNN-based and

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 3

Fig. 3. Network architecture of our U-shaped Modumer. We employ CMB with shared parameters at the first scale while using SMB at deeper scales which
involve lower-resolution features. This can strike a better balance between the complexity and the representational ability. The DFFN enhances dual-domain
frequency learning via spatial-spatial and spatial-spectral interactions.

Transformer-based categories. CNN-based methods leverage accuracy. Furthermore, mesoscale and local information is
attention mechanisms to attend to informative information leveraged to modulate SA outputs through cascaded modu-
in different dimensions [8], [37], e.g., spatial and channel. lation, enabling omni-kernel refinement and mapping inputs
Furthermore, these methods incorporate advanced techniques into higher-dimensional spaces.
to expand receptive fields and capture multiscale features [38],
[39], [40], [41], [42], [43], [44], [45], including encoder- III. M ETHODOLOGY
decoder architectures, atrous convolution, and multistage
learning strategies. Subsequently, Transformer methods scale In this section, we first introduce the overall architecture
the receptive field to global features via the SA layer [46]. of Modumer. Subsequently, the proposed components are
To enhance its efficiency on low-level vision tasks, a few delineated individually, including two kinds of attention layers
algorithms confine the SA region to fixed windows or strips (CMB and SMB), the parameter-sharing mechanism, and the
[18], [19], which impedes the inherent potential of SA. More- DFFN.
over, they cannot model multiscale features within a single
unit, limiting their capability for removing degradations of A. Overall Pipeline
different sizes. In this article, we apply SA to downsampled Modumer follows the encoder-decoder design (see Fig. 3).
embedding spaces to capture global dependencies and use the We employ a channelwise modulation block (CMB) at the
cascaded modulation operation to complement the missing first scale, as channelwise SA effectively captures long-range
local information. features in an implicit manner. Meanwhile, a spatialwise
modulation block (SMB) is utilized at the two lower-resolution
B. Modulation Design scales to enhance spatial feature representation. As such,
the model strikes a better balance between complexity and
The modulation mechanism [21], [23] considers context
representational capacity.
modeling using a large-kernel convolutional unit and mod-
More specifically, given an image, we use a 3 × 3 convolu-
ulates the projected inputs using elementwise multiplication,
tion to extract the embedding features of size RC×H×W , where
which has exhibited cutting-edge performance in high-level
C denotes the channel count while H × W defines the spatial
vision tasks. FocalNet [24] utilizes a stack of depthwise con-
index. Subsequently, the features are fed into the three-scale
volutional layers to implement hierarchical contextualization
encoder subnetwork to produce the in-depth features. Each
and uses gated aggregation to selectively gather contexts.
scale contains several Transformer blocks, whose calculation
Afterward, EfficientMod [21] adopts a simpler method for
process is formulated as follows:
context modeling using a series of linear projections and
depthwise convolution. MambaOut [47] and Conv2former X0k = CMB/SMB(Xk−1 ) + Xk−1 (1)
[22] use 7 × 7 depthwise convolutions to extract contex- Xk = DFFN(X0k ) + X0k (2)
tual features. Recently, StarNet [48] reveals that the strong
representational capacity of elementwise multiplication arises where Xk−1 and Xk are the output of the last and current
from its implicit mapping to high-dimensional spaces. How- Transformer blocks, respectively. In the encoder stage, the
ever, the receptive fields of the CTX in these methods are resolution of the features is gradually downsampled using
limited. In contrast, our approach incorporates long-range bilinear interpolation while the channel capability is doubled
contextual signals by applying SA to downsampled embedding using a 3 × 3 convolution. Next, the in-depth features pass
spaces, effectively balancing computational complexity and through the symmetric decoder network to generate the clean

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. Architectures of channel and spatial modulation blocks (CMB|SMB). (a) CMB. (b) SMB.

features. In this process, the resolution of features is progres- 2) Modulation Design: DCSA encodes downsampled
sively restored to the original size using bilinear interpolation global information while disregarding fine-grained local details
and 3 × 3 convolution. Meanwhile, the skip connection is during the downsampling process. To complement local infor-
adopted to combine the encoder and decoder features via con- mation, we first filter the initially generated V tensor using a
catenation. The features produced by the three-level decoder 3 × 3 depthwise convolution, which is expressed as follows:
are then processed by a refinement stage consisting of r
Transformer blocks. Finally, a 3 × 3 convolution is applied X̂ M3×3 = Sigmoid(Dw3×3 (V)) V (5)
to generate the residual image, which is added to the original
where Dw3×3 is a depthwise convolution of kernel size 3 × 3.
input image to obtain the final model output. Next, we present
Next, we modulate the output of DCSA with the locally
the internal components of the Transformer block.
filtered result via elementwise multiplication. This approach
enables the model to capture both downsampled global and
B. Channelwise Modulation Block
fine-grained local dependencies while mapping inputs into
The architectural details of CMB are illustrated in high-dimensional spaces, thereby enhancing its representa-
Fig. 4(a). CMB contains a downsampled channelwise SA layer tional capacity. To simplify the analyses, we assume a scenario
for global information modeling, along with two depthwise with a single-pixel input x ∈ Rd×1 and a single-element output
convolutional branches that modulate the SA output. These x̂ ∈ R1×1 , where d is the channel count. We define w1 ,
branches enhance local and mesoscale receptive fields while w2 ∈ R1×d as convolution parameters. The modulation process
mapping features into higher-dimensional spaces. The calcu- [48], which involves a single convolution in each branch, can
lation process of CMB can be formally expressed as follows: be expressed as follows:
X̂CMB = W2 X̂ M7×7 W1 (X̂ M3×3 DCSA(XCMB ))

(3) d
! 0 d 1
X X j
where X̂CMB and XCMB denote the output and input of CMB, w>1 x w>2 x = wi1 xi @ w2 x j A (6)
respectively. DCSA is a downsampled channelwise SA layer. i=1 j=1

X̂ Mn×n is the modulation branch with the kernel size of n × n, d X


X d

encoding local information. W1 and W2 are 1 × 1 convolutions = wi1 w2j xi x j (7)


for refinement. i=1 j=1
1) DCSA: Compared to the normal channel SA, our version α1,1 x1 x1 + · · · + α2,3 x2 x3 + · · · + αd,d xd xd
computes attention maps in a downsampled space, resulting =„ ƒ‚ … (8)
d(d + 1)/2
in high efficiency. We assume that the number of heads is (
1 and consider DCSA to be a single-head fashion. Given wi1 w2j , i= j
the normalized input XN ∈ RC×H×W , DCSA first utilizes the αi, j = (9)
wi1 w2j + w1j wi2 i, j
projection layers to produce query, key, and value tensors by
Q = WQ XN , K = WK XN , and V = WV XN , where W(·) denotes where i, j index the channel. It can be observed that each term
parameters of 1 × 1 pointwise convolution. Then, the obtained in (9) exhibits a nonlinear relationship with x and represents
Q, K, and PV tensors are reshaped into the size of C × N, an individual dimension, suggesting that this scenario achieves
N × C, and C × N, respectively, where N = H × W. The query a representation in an implicit feature space of dimensionality
and key tensors are further normalized and downsampled to d(d + 1)/2. Note that besides convolutions, the branches in
prepare for cross-covariance attention [14]. The transposed our modulation design experience complicated SA, further
attention map is calculated by Q and K, with the size of RC×C . improving the representational capability. Additionally, we
The output of DCSA is obtained by apply a 7 × 7 kernel branch to further modulate the preceding
outcome and supply mesoscale receptive fields.
X̂DCSA = Softmax(QK/τ)V (4)
3) Parameter Sharing: We employ a parameter-sharing
where τ is a learnable temperature parameter and X̂DCSA ∈ mechanism inspired by the relationship between the hippocam-
RC×N is reshaped to the original input feature size of RH×W×C pus and cortex in the brain [49]. Specifically, as illustrated
for further modulation operation. in Fig. 5, although different layers of the cortex, such as

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 5

TABLE I
D ETAILS OF O UR M ODEL V ERSIONS . T HE N UMBER OF H EADS AT
THE T HREE S CALES I S S ET TO [1, 2, 4]. FLOP S A RE M EASURED
ON 3×256 × 256 PATCHES

Fig. 5. Motivation of the parameter-sharing mechanism.


Furthermore, DFFN facilitates spatial-spectral interactions
by integrating the Fourier-domain refined output with spatial
layers II and III, perform distinct tasks, they exchange infor- features, guided by learnable attention weights. The calcula-
mation with the shared memory in the hippocampal field CA1. tion process is formulated as follows:
Accordingly, we conceptualize the attention layer as analogous X̂DFFN = αXSpectral + (1 − α)X̂S−S (13)
to the hippocampus, while the feed-forward layer represents   
the cortex, forming the foundation of our parameter-sharing XSpectral = P −1 F −1 W F(P(X̂S−S )) (14)
mechanism illustrated in the left part of Fig. 3. Interestingly,
this design not only saves parameters but also improves where F and F −1 denote the FFT and the inverse transform,
performance. respectively. P and P −1 are windows partition operation and
the inverse transformation, respectively. W is the learnable
parameter to filter the frequency signals [26]. α is the learnable
C. Spatialwise Modulation Block
parameter to control information aggregation.
Fig. 4(b) presents SMB, which has three branches: a
downsampled spatialwise attention unit (DSSA), and two
IV. E XPERIMENTS
modulation operators. The output of SMB is obtained by
We evaluate the performance of our proposed Modumer
X̂SMB = W4 X̂ M7×7 W3 (X̂ M3×3 DSSA(XSMB ))

(10) on three kinds of tasks: single-degradation, all-in-one, and
where XSMB is the input of SMB. composite-degradation image restoration. We train separate
1) DSSA: DSSA is used at low-resolution scales to model model instances for different single-degradation tasks, while
spatial global features. Similarly, we also assume the number a unified model is trained on a mixed dataset encompassing
of heads is 1 to transfer DSSA to single-head mode. Given multiple tasks for the all-in-one version. For composite-
any input X ∈ RH×W×C , it is first processed by the layer degradation restoration, the training dataset consists of images
normalization to yield XN . Then, the query (Q), key (K), and affected by multiple degradation types simultaneously. Based
value (V) tensors are produced by Q = W Q XN , K = W K XN ↓, on the complexity of different datasets, we deploy two ver-
and V = W V XN ↓, where K and V are generated from the sions of our model—Modumer-S (small) and Modumer-B
downsampled input (XN ↓) for high efficiency. After reshaping (base)—to ensure a better trade-off between efficiency and
Q, K, and V to new tensors of size N × C, C × N 0 , N 0 × C, accuracy. More details can be found in Table I.
respectively, where N = H × W and N 0 = H/8 × W/8, the
calculation process of DSSA is formulated as follows: A. Single-Degradation Image Restoration
 
QK 1) Implementation Details: We evaluate our model on five
X̂DSSA = Softmax √ V. (11)
C representative tasks with ten benchmark datasets (see Table II).
We adopt the dual-domain loss functions [11], [26], [37]
2) Modulation Design: Similar to CMB, we utilize a
to train the network for 300 000 iterations with the Adam
cascaded modulation design with kernel sizes of 3 × 3 and
optimizer. The deblurring task needs another 300 000 iterations
7 × 7 to complement local and mesoscale information. As
following [26]. The initial learning is set to 1e−3 , which is
such, the model is equipped with an approximate omni-kernel
gradually reduced to 1e−7 with the cosine annealing strategy.
modulation ability, i.e., local-mesoscale–global.
The patch size is set to 128 × 128, and the batch size is 32.
We adopt the same data augmentation strategy as [14]. The
D. Dual-Domain Feed-Forward Network window size in DFFN and the downsampling ratio in SA are
DFFN facilitates the spatial-spatial and spatial-spectral set to 8. In tables, the best results are highlighted.
interactions for high-fidelity reconstruction. Fig. 3 illustrates 2) Results:
the architecture. To be specific, given input features X ∈ a) Image Deraining: The numerical results on the rain-
RH×W×C , DFFN first applies layer normalization, followed by drop dataset AGAN-Data [27] are presented in Table III.
GEGLU [25], as formulated below Our method significantly outperforms the recent Transformer-
based AST-B [61] and FPro [60] by 0.73 and 1.09 dB,
X̂S−S = W7 GELU Dw13 W5 (XN ) Dw23 W6 (XN )
 
(12)
respectively, while consuming lower complexity, as illustrated
where W5 , W6 , and W7 denote 1 × 1 convolutions. Dw13 and in Fig. 2(a). Fig. 6 shows that our method is more effective in
Dw23 are 3 × 3 depthwise convolutions. XN is the normalized raindrop removal than competitors. Moreover, the comparison
input and X̂S−S is the spatial-spatial interaction output. results on the rain streak dataset SPAD [50] are reported in

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE II
DATASET S UMMARY FOR F IVE S INGLE -D EGRADATION I MAGE R ESTORATION TASKS

Fig. 6. Visual comparisons on the raindrop AGAN-Data [27] dataset.

Fig. 7. Deblurred results on GoPro [42]. Compared to other algorithms, the proposed method restores more details and clearer structures from the input.

Fig. 8. Image dehazing comparisons on the Haze4k [30] dataset.

TABLE III b) Image Motion Deblurring: We conduct experiments


Q UANTITATIVE R ESULTS ON AGAN-DATA [27] FOR R AINDROP R EMOVAL for motion deblurring on the GoPro [42] dataset and compare
our results with state-of-the-art works in Table V. Our method
significantly surpasses the recent frequency-based Transformer
model [73] by 0.18 dB PSNR while using 65% fewer
parameters. Compared to the recent convolutional network
ConvIR-L [70], our method achieves a notable gain of 0.99
dB PSNR. The visual results in Fig. 7 show that our model
recovers more structural details from the hard example. We
further apply our model pretrained on GoPro to the HIDE [28]
dataset. The results presented in Table V show that our method
obtains the best result in PSNR with a prominent gain of
0.15 dB over the second-best LoFormer-L [73], demonstrating
Table IV. As seen, our method achieves the best performance the better generalization ability of our model.
in terms of PSNR, outperforming the previous state-of-the-art c) Image Dehazing: We perform dehazing experiments
algorithm [61] by 0.06 dB PSNR. on the Haze4k [30] dataset. The numerical results are

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 7

Fig. 9. Image desnowing comparisons on the CSD [29] dataset.

TABLE IV TABLE VII


Q UANTITATIVE R ESULTS ON SPAD [50] FOR R AIN S TREAK R EMOVAL Q UANTITATIVE R ESULTS ON GTA5 [51] FOR N IGHT H AZE R EMOVAL

TABLE V
Fig. 10. Visualization of the training process on the Snow100K [9] dataset.
I MAGE M OTION D EBLURRING R ESULTS . O UR M ODEL I S T RAINED O NLY
ON THE G O P RO [42] DATASET AND D IRECTLY A PPLIED TO THE G O P RO
[42] AND HIDE [28] DATASETS TABLE VIII
I MAGE D ESNOWING C OMPARISONS ON T HREE W IDELY U SED DATASETS

based method FSNet [69], our method has a more obvious


advantage with much lower complexity. Fig. 8 shows that
our model can better deal with haze degradations than other
TABLE VI
algorithms. Additionally, we provide comparison results on
I MAGE D EHAZING C OMPARISONS ON THE H AZE 4 K [30] DATASET
a nighttime dehazing dataset GTA5 [51] in Table VII. Our
Modumer-S is still superior to the strong competitors.
d) Image Desnowing: Furthermore, we verify the effec-
tiveness of our model in snow removal using three datasets:
CSD [29], SRRS [52], and Snow100K [9]. The quantitative
results are presented in Table VIII. Our method achieves
39.17 dB PSNR on the CSD dataset, 0.74 dB higher than
the second-best algorithm [70]. The superiority of our model
is also evident on the other two datasets, further demonstrating
its effectiveness in snow removal. Fig. 9 shows that our
model yields a more favorable image by removing more snow
degradations. Visualizations of the training process on the
presented in Table VI. Our model attains a significant Snow100K [9] dataset are illustrated in Fig. 10.
performance gain of 0.54 dB PSNR over the recent algorithm e) Low-Light Image Enhancement: The numerical results
[70] with lower FLOPs [see Fig. 2(d)]. Compared to the CNN- on LOL-V2-Syn [53] for low-light image enhancement are

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 11. Visual results on the LOL-V2-syn [53] dataset.

TABLE IX TABLE XI
N UMERICAL C OMPARISONS ON THE LOL-V2-S YN DATASET [53] A BLATION S TUDIES OF THE D EPLOYMENT S TRATEGY
FOR L OW-L IGHT I MAGE E NHANCEMENT FOR D IFFERENT K INDS OF ATTENTION

TABLE X
A BLATION S TUDIES FOR E ACH C OMPONENT. FLOP S AND M EMORY
F OOTPRINT A RE M EASURED ON A 3×256 × 256 PATCH S IZE U SING
TABLE XII
ptflops AND torch.cuda. max memory allocated(), R ESPECTIVELY
M ORE A BLATION S TUDIES FOR DFFN

features. Table XI shows that our strategy achieves the best


presented in Table IX. Our method significantly outper- performance. Moreover, we experiment by using only regular
forms the Transformer-based algorithm Retinexformer [83] channel attention [14] in all scales, achieving a 0.32 dB lower
by 0.43 dB PSNR. Notably, our model also surpasses the performance than our full model. These results validate the
recent Mamba-based MambaLLIE [84]. The visual results are efficacy of our design.
illustrated in Fig. 11. Our model recovers more edges from c) DFFN: We conduct more ablation studies for DFFN
the input image. These results highlight the strong potential by removing or substituting certain operators. Table XII
of our method for low-light image enhancement. shows that removing the spatial branch in interdomain fusion
3) Ablation Studies: We perform ablation studies by train- achieves 31.76 dB PSNR, suggesting the significance of dual-
ing our small model for 70 000 iterations on GoPro [42]. domain feature fusion. Removing the attention weights leads
a) Effects of Individual Components: Table X shows to 31.73 dB PSNR, which is even lower than the result of using
the results of individually removing the proposed component a single branch, demonstrating the importance of coordinating
from the complete model. The removal of our modulation the fusion process.
branch results in a performance degradation compared to d) Modulation Design: In this part, we perform ablation
the full model. Our parameter-sharing mechanism achieves studies for the modulation design. We use the plain depthwise
a 0.04 dB PSNR performance improvement while requiring convolutions with the same kernel size to supplant the filter
fewer parameters. Employing only the spatial-spatial inter- operation, achieving 31.69 dB PSNR, which is 0.13 dB lower
actions, i.e., GEGLU, in the feed-forward network achieves than our design.
31.69 dB PSNR, which is 0.13 dB lower than our full model. e) Parameter-Sharing Mechanism: In our model, we
Additionally, the FLOPs and memory footprint comparisons share the parameters across CMB. We carry out experiments to
indicate that our designs introduce minimal computational apply the parameter-sharing strategy to deeper scales, achiev-
complexity and memory overhead. These results demonstrate ing lower performance than our design (see Table XIII). We
the effectiveness of our proposed modules and mechanism. also attempt to further share the parameters among DFFN in
b) Deployment Strategy for Attention: We deploy CMBs the first scale, obtaining only 30.53 dB PSNR. Therefore, we
at the first scale while using spatialwise blocks at other scales, only apply the mechanism in CMB for better performance.
as spatialwise SA is more computationally expensive than its f) Ablations for Downsampling Operation: In CMB, we
channelwise counterpart when modeling large-scale features. apply downsampling after the convolutions to fully capture
In our case, the first scale includes the highest resolution spatial connectivity, as the channelwise SA layer is unable to

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 9

the effectiveness of our choice. This version achieves a


PSNR of 31.67 dB, which is 0.15 dB lower than our
design while introducing more parameters (0.02 M) and
higher complexity (2.06 GFLOPs).
4) Applications: To assess the effectiveness of our approach
in enhancing the performance of high-level vision tasks,
particularly object detection, we integrate YOLOv7 [86] into
our evaluation. Specifically, we apply the detector to both the
degraded observations and the restored images generated by
our method across three image restoration tasks. As depicted
in Fig. 13, our results facilitate improved object detection by
enabling the model to identify a greater number of objects,
enhance detection confidence, and rectify misclassifications.
For instance, in the snowy scene, our method successfully
Fig. 12. Visualization comparison of intermediate features obtained from
models with and without downsampling.
corrects the erroneous identification of a mouse instance.
5) Limitation: Although our model yields superior results
on the GoPro [42] dataset for image deblurring, it struggles to
TABLE XIII
produce completely clean outputs for fast-moving subjects, as
A BLTION S TUDIES FOR THE PARAMETER -S HARING M ECHANISM .
S CALE 0,1,2 I NDICATES T HAT PARAMETERS A RE S HARED exemplified by the foreground person in Fig. 14. A potential
W ITHIN E ACH S CALE ACROSS ALL S CALES solution to mitigate this issue is the incorporation of optical
flow to better handle large movements.

B. All-in-One Image Restoration


1) Implementation Details: Many all-in-one methods are
model true spatial pixel interactions. We evaluate the impact of evaluated under two settings: the three-task setting, as used by
performing downsampling before convolutions, which reduces [94] and [93], and the five-task setting, as employed by [97].
the computational cost by 1.89 GFLOPs but results in a Recent studies have conducted experiments in both settings,
0.11 dB decrease in PSNR. Consequently, we opt to place like [96]. We follow this trend to ensure a comprehensive and
downsampling after convolutions in CMB. complete evaluation of our model, Modumer-B. The dataset
We further conduct an experiment using the model without summary is presented in Table XIV. The model is trained on
downsampling. Interestingly, this variant achieves a 0.2 dB 32 samples of size 128 × 128 per iteration using the Adam
lower PSNR than our design (31.62 versus 31.82), indicat- optimizer with a learning rate of 2e−4 . Training is conducted
ing that the synergy between our two modulation branches for 150 epochs with the L1 loss function.
and downsampling-based global modeling is more effective 2) Results: For the three-task setting, the model is trained
than the direct global operation. Fig. 12 presents a visu- on a mixed dataset obtained from denoising, dehazing, and
alization comparison of intermediate features obtained from deraining. Table XV shows that our model achieves an average
models with and without downsampling. Compared to full-size score of 32.77 dB PSNR, 0.34 dB higher than the recent
attention, our model exhibits an enhanced ability to capture InstructIR [96]. Moreover, our method achieves the best
long-range features. Notably, it produces sharper results for performance across most metrics. Notably, for the deraining
the girl, which represents the most blurred region in the task, our model outperforms InstructIR by 0.8 dB. Fig. 15
first image. Furthermore, our model effectively preserves local demonstrates that our model is more effective in removing
details, such as the wires in the top-right region of the second rain streaks, resulting in a noticeably cleaner image.
example. Moreover, we report experimental results under the five-
g) Frequency Processing or Convolutions?: In DFFN, task all-in-one setting. The quantitative results are presented in
we employ Fourier processing to facilitate interdomain inter- Table XVI. As observed, our method achieves a PSNR score
actions. Compared to convolutions, our operation offers two of 30.19 when averaged across all tasks, which is 0.64 and
key advantages. 1.22 dB higher than those of InstructIR [96] and MambaIR
1) According to the convolution theorem, a pixel in the [98], respectively. In particular, for the dehazing problem,
Fourier spectrum incorporates global information from our model significantly outperforms the second-best algorithm
the spatial domain. Consequently, Fourier processing [96] by 3.19 dB PSNR. Despite not incorporating a complex
can effectively model long-range dependencies while dynamic mechanism for identifying degradation types, our
reducing parameter usage and improving efficiency. method consistently delivers promising results across various
2) Since there are frequency discrepancies between all-in-one tasks, thanks to its robust representational capability.
degraded and clean image pairs, certain degradation pat- 3) Evaluation on Deep Learning-Based Metrics: In addi-
terns can be easily removed in the Fourier domain. We tion to distortion-based metrics, e.g., PSNR and SSIM, we
conduct an experiment by replacing the Fourier process- additionally compare our model with recent competing algo-
ing with a 7 × 7 depthwise convolution to demonstrate rithms under the three-task setting using the learning-based

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 13. Object detection results on the degraded and our restored images. The examples are obtained from the CSD [29], GoPro [42], and LOL-V2-syn [53]
datasets for desnowing, deblurring, and low-light enhancement, respectively.

TABLE XIV
DATASETS U SED FOR THE A LL - IN -O NE S ETTING . M OTION D EBLURRING AND L OW-L IGHT E NHANCEMENT A RE O NLY
U SED FOR THE F IVE -TASK S ETTING

TABLE XV
Q UANTITATIVE C OMPARISONS ON T HREE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING

across all datasets. This demonstrates the effectiveness of our


approach in preserving perceptual quality while maintaining
structural and textural consistency.
4) Generalization: To evaluate the generalization capability
of our method, we directly apply the model trained in the
Fig. 14. Our model struggles to produce sharp results for fast-moving subjects. three-task setting to the UAVDT [103] dataset. Fig. 16 illus-
trates that our model is robust in out-of-distribution scenarios,
metrics, including LPIPS [101] and DISTS [102]. Table XVII removing more hazy degradations from challenging inputs
shows that our model outperforms two recent competitors than competing all-in-one algorithms.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 11

Fig. 15. Visual comparisons on the Rain100 [85] dataset under the all-in-one setting. The image produced by our model is closer to the reference image.

TABLE XVI
N UMERICAL C OMPARISONS ON F IVE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING : D EHAZING (SOTS [89]), D ERAINING
(R AIN 100L [85]), D ENOISING (BSD68 [91]), D EBLURRING (G O P RO [42]), AND L OW-L IGHT I MAGE E NHANCEMENT (LOL-V1 [90])

TABLE XVII
Q UANTITATIVE E VALUATION U NDER THE A LL - IN -O NE S ETTING W ITH L EARNING -BASED M ETRICS

best results in most degradation categories while requiring


fewer parameters. Specifically, compared to the previous
state-of-the-art algorithm [99], our model yields a significant
performance gain of 1.08 dB in PSNR when averaged across
all 11 degradation categories. Visual comparisons in Fig. 17
demonstrate that our model is more effective in removing
mixed degradations, whereas the results generated by com-
peting methods still contain snow and haze artifacts. Fig. 18
illustrates the training process on this dataset.

D. Discussion
1) Evaluation Datasets: In this study, we conduct exper-
Fig. 16. Results of applying models trained in the three-task setting to images
from the UAVDT [103] dataset. iments transitioning from single-degradation to composite-
degradation scenarios for image restoration, aligning with
the evolving trends in the field. However, given the diverse
C. Composite-Degradation Image Restoration real-world conditions encountered by users, it is impractical
to encompass all possible scenarios within a single study.
We conduct experiments on the CDD-11 [99] dataset Consequently, several valuable datasets remain available for
for composite-degradation image restoration. This dataset further exploration and evaluation.
comprises a total of 11 degradation categories, created by For instance, MC-Blur [104] includes four types of
combining four types: low light, haze, rain, and snow. The blur—uniform blur, motion blur caused by averaging
training configuration follows that of single-degradation image continuous frames, heavy defocus blur, and real-world
restoration. Table XVIII shows that our model achieves the blur—providing a comprehensive benchmark for multi-

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 17. Visual comparisons on the CDD-11 [99] dataset under the haze + snow scenario.

TABLE XVIII
Q UANTITATIVE C OMPARISONS ON THE CDD-11 [99] FOR C OMPOSITE -D EGRADATION I MAGE R ESTORATION . T HE S CORES A RE R EPORTED IN THE F ORM
OF PSNR ( D B, ↑) AND SSIM (↑). O UR M ODEL O UTPERFORMS THE P REVIOUS L EADING A LGORITHM [99] W HILE U SING F EWER PARAMETERS

pattern is observed in the single-degradation setting, where our


model consistently demonstrates strong performance across
multiple tasks. While employing a unified architecture for dif-
ferent image restoration scenarios offers practical advantages,
integrating task-specific or spatially adaptive strategies remains
Fig. 18. Visualization of the training process on the CDD-11 [99] dataset. a promising direction for further performance enhancement.

V. C ONCLUSION
cause image deblurring frameworks [105]. Similarly, the
SnowKITTI2012 and SnowCityScapes datasets introduced in This study presents an effective and efficient Transformer
[106] feature three levels of snow degradation in street envi- model for image restoration, termed Modumer. The model
ronments, facilitating the development of algorithms aimed at incorporates different downsampled SA layers with cascaded
enhancing autonomous driving in adverse weather conditions. modulation designs, which can model omni-receptive field fea-
Additionally, JRSRD [107] synthesizes rain streaks and rain- tures, keep a better balance between complexity and accuracy,
drops simultaneously, offering a more realistic representation and map features into high-dimensional spaces. Moreover,
of rainy conditions compared to datasets that include only a we investigate a bioinspired parameter-sharing mechanism in
single type of degradation. attention layers, improving efficiency and performance. In
Furthermore, our model can be extended to stereo image addition, we introduce a DFFN to facilitate intra- and inter-
restoration tasks, such as evaluating its performance on domain interactions. Comprehensive experiments across three
deraining using the Stereo RainKITTI2012 dataset [108]. categories of image restoration tasks validate the effectiveness
These additional datasets present promising avenues for future of our model.
research, enabling a more comprehensive assessment of image
restoration techniques across varied conditions. R EFERENCES
2) Network Architecture: We employ a unified network [1] Y. Quan, P. Lin, Y. Xu, Y. Nan, and H. Ji, “Nonblind image deblurring
architecture across various image restoration tasks, including via deep learning in complex field,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 33, no. 10, pp. 5387–5400, Oct. 2021.
single-degradation, all-in-one, and composite-degradation sce- [2] Y. Cui, Y. Tao, W. Ren, and A. Knoll, “Dual-domain attention for
narios. In all-in-one settings, a common approach involves image deblurring,” in Proc. AAAI Conf. Artif. Intell., vol. 37, Jun. 2023,
first extracting task-aware information, which then guides the pp. 479–487.
[3] Y. Zheng, X. Yu, M. Liu, and S. Zhang, “Single-image deraining
restoration process [93], [94]. Despite not incorporating such via recurrent residual multiscale networks,” IEEE Trans. Neural Netw.
explicit task-specific components, our model achieves state-of- Learn. Syst., vol. 33, no. 3, pp. 1310–1323, Mar. 2022.
the-art performance on two all-in-one tasks. This success can [4] Y. Cui and A. Knoll, “Exploring the potential of channel interactions
for image restoration,” Knowl.-Based Syst., vol. 282, Dec. 2023, Art.
be attributed to two key factors: 1) the modulation operation, no. 111156.
which functions as a gated mechanism, dynamically attending [5] Y. Zhou, Z. Chen, P. Li, H. Song, C. L. P. Chen, and B. Sheng, “FSAD-
to informative signals for different tasks; and 2) the synergy Net: Feedback spatial attention dehazing network,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 34, no. 10, pp. 7719–7733, Oct. 2023.
between the modulation operation and dual-domain interac- [6] K. Jiang et al., “Multi-scale hybrid fusion network for single image
tions in DFFN, which enhances the model’s representational deraining,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 7,
capacity without significantly increasing complexity. A similar pp. 3594–3608, Jul. 2023.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 13

[7] Y. Cui, Y. Tao, L. Jing, and A. Knoll, “Strip attention for image [32] L. Ruan, B. Chen, J. Li, and M. Lam, “Learning to deblur using light
restoration,” in Proc. Int. Joint Conf. Artif. Intell., Aug. 2023, field generated and real defocus images,” in Proc. IEEE/CVF Conf.
pp. 645–653. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16283–16292.
[8] Q. Xu, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion [33] M. Liu, Y. Cui, W. Ren, J. Zhou, and A. C. Knoll, “LIEDNet: A
attention network for single image dehazing,” in Proc. AAAI Conf. lightweight network for low-light enhancement and deblurring,” IEEE
Artif. Intell., vol. 34, Apr. 2020, pp. 11908–11915. Trans. Circuits Syst. Video Technol., early access, Feb. 13, 2025, doi:
[9] Y.-F. Liu, D.-W. Jaw, S.-C. Huang, and J.-N. Hwang, “DesnowNet: 10.1109/TCSVT.2025.3541429.
Context-aware deep network for snow removal,” IEEE Trans. Image [34] X. Su et al., “Prior-guided hierarchical harmonization network for
Process., vol. 27, no. 6, pp. 3064–3073, Jun. 2018. efficient image dehazing,” 2025, arXiv:2503.01136.
[10] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel modulation for universal [35] Y. Cui and A. Knoll, “Enhancing local–global representation learning
image restoration,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, for image restoration,” IEEE Trans. Ind. Informat., vol. 20, no. 4,
no. 12, pp. 12496–12509, Dec. 2024. pp. 6522–6530, Apr. 2024.
[11] S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking [36] Y. Cui, Q. Wang, C. Li, W. Ren, and A. Knoll, “EENet: An effective
coarse-to-fine approach in single image deblurring,” in Proc. IEEE/CVF and efficient network for single image dehazing,” Pattern Recognit.,
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4621–4630. vol. 158, Feb. 2025, Art. no. 111074.
[12] X. Chen et al., “A comparative study of image restoration networks for [37] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Focal network for image
general backbone network design,” in Proc. Eur. Conf. Comput. Vis., restoration,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct.
Oct. 2024, pp. 74–91. 2023, pp. 12955–12965.
[13] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single [38] H. Son, J. Lee, S. Cho, and S. Lee, “Single image defocus deblurring
image dehazing,” IEEE Trans. Image Process., vol. 32, pp. 1927–1941, using kernel-sharing parallel atrous convolutions,” in Proc. IEEE/CVF
2023. Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2622–2630.
[14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and [39] Y. Cui, M. Liu, W. Ren, and A. Knoll, “Hybrid frequency modulation
M. Yang, “Restormer: Efficient transformer for high-resolution image network for image restoration,” in Proc. 33rd Int. Joint Conf. Artif.
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Intell., Aug. 2024, pp. 722–730.
(CVPR), Jun. 2022, pp. 5718–5729. [40] K.-H. Liu, C.-H. Yeh, J.-W. Chung, and C.-Y. Chang, “A motion deblur
[15] Y. Cui and A. Knoll, “PSNet: Towards efficient image restoration with method based on multi-scale high frequency residual image learning,”
self-attention,” IEEE Robot. Autom. Lett., vol. 8, no. 9, pp. 5735–5742, IEEE Access, vol. 8, pp. 66025–66036, 2020.
Sep. 2023. [41] Y. Cui, J. Zhu, and A. Knoll, “Enhancing perception for autonomous
[16] J.-G. Wang, Y. Cui, Y. Li, W. Ren, and X. Cao, “Omnidirectional image vehicles: A multi-scale feature modulation network for image
super-resolution via bi-projection fusion,” in Proc. AAAI Conf. Artif. restoration,” IEEE Trans. Intell. Transp. Syst., vol. 26, no. 4,
Intell., vol. 38, Mar. 2024, pp. 5454–5462. pp. 4621–4632, Apr. 2025.
[17] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timo- [42] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convo-
fte, “SwinIR: Image restoration using Swin transformer,” in Proc. lutional neural network for dynamic scene deblurring,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 1833–1844. pp. 257–265.
[18] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: [43] Q. Wang, Y. Cui, Y. Li, Y. Ruan, B. Zhu, and W. Ren,
A general U-shaped transformer for image restoration,” in Proc. “RFFNet: Towards robust and flexible fusion for low-light image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, denoising,” in Proc. 32nd ACM Int. Conf. Multimedia, Oct. 2024,
pp. 17662–17672. pp. 836–845.
[44] K. Jiang et al., “Multi-scale progressive fusion network for single image
[19] F.-J. Tsai, Y.-T. Peng, Y. Lin, C.-C. Tsai, and C. Lin, “Stripformer: Strip
deraining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
transformer for fast image deblurring,” in Proc. Eur. Conf. Comput.
(CVPR), Jun. 2020, pp. 8343–8352.
Vis., Jan. 2022, pp. 146–162.
[45] Y. Cui, W. Ren, S. Yang, X. Cao, and A. Knoll, “IRNeXt: Rethinking
[20] J. Zhang, Y. Zhang, J. Gu, J. Dong, L. Kong, and X. Yang, “Xformer:
convolutional network design for image restoration,” in Proc. Int. Conf.
Hybrid X-shaped transformer for image denoising,” in Proc. 12th Int.
Mach. Learn., Jul. 2023, pp. 6545–6564.
Conf. Learn. Represent., Jan. 2023.
[46] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehaz-
[21] X. Ma et al., “Efficient modulation for vision networks,” in Proc. 12th ing transformer with transmission-aware 3D position embedding,” in
Int. Conf. Learn. Represent., Mar. 2024. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
[22] Q. Hou, C.-Z. Lu, M.-M. Cheng, and J. Feng, “Conv2Former: A simple 2022, pp. 5802–5810.
transformer-style ConvNet for visual recognition,” IEEE Trans. Pattern [47] W. Yu and X. Wang, “MambaOut: Do we really need mamba for
Anal. Mach. Intell., vol. 46, no. 12, pp. 8274–8283, Dec. 2024. vision?,” 2024, arXiv:2405.07992.
[23] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual [48] X. Ma, X. Dai, Y. Bai, Y. Wang, and Y. Fu, “Rewrite the stars,” in
attention network,” Comput. Vis. Media, vol. 9, no. 4, pp. 733–752, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
Dec. 2023. 2024, pp. 5694–5703.
[24] J. Yang, C. Li, and J. Gao, “Focal modulation networks,” in Proc. Adv. [49] M. P. Witter, T. P. Doan, B. Jacobsen, E. S. Nilssen, and S. Ohara,
Neural Inf. Process. Syst., Jan. 2022, pp. 4203–4217. “Architecture of the entorhinal cortex a review of entorhinal anatomy
[25] N. Shazeer, “GLU variants improve transformer,” 2020, in rodents with some comparative notes,” Frontiers Syst. Neurosci.,
arXiv:2002.05202. vol. 11, p. 46, Jun. 2017.
[26] L. Kong, J. Dong, J. Ge, M. Li, and J. Pan, “Efficient frequency [50] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. H. Lau,
domain-based transformers for high-quality image deblurring,” in Proc. “Spatial attentive single-image deraining with a high quality real rain
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, dataset,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 5886–5895. (CVPR), Jun. 2019, pp. 12262–12271.
[27] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative [51] W. Yan, R. T. Tan, and D. Dai, “Nighttime defogging using high-low
adversarial network for raindrop removal from a single image,” in frequency decomposition and grayscale-color networks,” in Proc. Eur.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Conf. Comput. Vis., Jan. 2020, pp. 473–488.
pp. 2482–2491. [52] W. Chen, H. Fang, J. Ding, C.-C. Tsai, and S. Kuo, “JSTASR: Joint
[28] Z. Shen et al., “Human-aware motion deblurring,” in Proc. IEEE/CVF size and transparency-aware snow removal algorithm based on modified
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5571–5580. partial convolution and veiling effect removal,” in Proc. Eur. Conf.
[29] W.-T. Chen et al., “ALL snow removed: Single image desnowing Comput. Vis., Jan. 2020, pp. 754–770.
algorithm using hierarchical dual-tree complex wavelet representation [53] W. Yang, W. Wang, H. Huang, S. Wang, and J. Liu, “Sparse gra-
and contradict channel loss,” in Proc. IEEE/CVF Int. Conf. Comput. dient regularized deep retinex network for robust low-light image
Vis. (ICCV), Oct. 2021, pp. 4176–4185. enhancement,” IEEE Trans. Image Process., vol. 30, pp. 2072–2086,
[30] Y. Liu et al., “From synthetic to real: Image dehazing collaborating 2021.
with unlabeled real data,” in Proc. 29th ACM Int. Conf. Multimedia, [54] X. Liu, M. Suganuma, Z. Sun, and T. Okatani, “Dual residual networks
Oct. 2021, pp. 50–58. leveraging the potential of paired operations for image restoration,” in
[31] Y. Cui and A. Knoll, “Dual-domain strip attention for image Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
restoration,” Neural Netw., vol. 171, pp. 429–439, Mar. 2024. 2019, pp. 7000–7009.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[55] R. Quan, X. Yu, Y. Liang, and Y. Yang, “Removing raindrops and [79] J. M. Jose Valanarasu, R. Yasarla, and V. M. Patel, “TransWeather:
rain streaks in one go,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Transformer-based restoration of images degraded by adverse weather
Recognit. (CVPR), Jun. 2021, pp. 9143–9152. conditions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[56] Y. Quan, S. Deng, Y. Chen, and H. Ji, “Deep learning for seeing through (CVPR), Jun. 2022, pp. 2343–2353.
window with raindrops,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [80] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel network for image
(ICCV), Oct. 2019, pp. 2463–2471. restoration,” in Proc. AAAI Conf. Artif. Intell., vol. 38, Mar. 2024,
[57] J. Xiao, X. Fu, A. Liu, F. Wu, and Z.-J. Zha, “Image de-raining pp. 1426–1434.
transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, [81] S. W. Zamir et al., “Learning enriched features for fast image restora-
pp. 12978–12995, Nov. 2023. tion and enhancement,” IEEE Trans. Pattern Anal. Mach. Intell.,
[58] Z. Tu et al., “MAXIM: Multi-axis MLP for image processing,” in vol. 45, no. 2, pp. 1934–1948, Feb. 2023.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. [82] X. Xu, R. Wang, C.-W. Fu, and J. Jia, “SNR-aware low-light image
2022, pp. 5759–5770. enhancement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[59] T. Ye et al., “Adverse weather removal with codebook priors,” nit. (CVPR), Jun. 2022, pp. 17693–17703.
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, [83] Y. Cai, H. Bian, J. Lin, H. Wang, R. Timofte, and Y. Zhang,
pp. 12619–12630. “Retinexformer: One-stage retinex-based transformer for low-light
[60] S. Zhou, J. Pan, J. Shi, D. Chen, L. Qu, and J. Yang, “Seeing the image enhancement,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
unseen: A frequency prompt guided transformer for image restoration,” (ICCV), Oct. 2023, pp. 12504–12513.
in Proc. Eur. Conf. Comput. Vis., Oct. 2024, pp. 246–264. [84] J. Weng, Z. Yan, Y. Tai, J. Qian, J. Yang, and J. Li, “MambaLLIE:
[61] S. Zhou, D. Chen, J. Pan, J. Shi, and J. Yang, “Adapt or perish: Implicit retinex-aware low light enhancement with global-then-local
Adaptive sparse transformer with attentive feature refinement for image state space,” in Proc. 38th Annu. Conf. Neural Inf. Process. Syst.,
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Sep. 2024, pp. 27440–27462.
(CVPR), Jun. 2024, pp. 2952–2963. [85] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint
[62] S. W. Zamir et al., “Multi-stage progressive image restoration,” in Proc. rain detection and removal from a single image,” in Proc. IEEE Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1685–1694.
pp. 14816–14826. [86] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, “YOLOv7: Trainable
[63] Y. Guo, X. Xiao, Y. Chang, S. Deng, and L. Yan, “From sky to the bag-of-freebies sets new state-of-the-art for real-time object detectors,”
ground: A large-scale benchmark and simple baseline towards real rain in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
removal,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 7464–7475.
2023, pp. 12063–12073. [87] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection
[64] X. Chen, H. Li, M. Li, and J. Pan, “Learning a sparse transformer and hierarchical image segmentation,” IEEE Trans. Pattern Anal.
network for effective image deraining,” in Proc. IEEE/CVF Conf. Mach. Intell., vol. 33, no. 5, pp. 898–916, May 2011.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 5896–5905. [88] K. Ma et al., “Waterloo exploration database: New challenges for image
[65] H. Zhang, Y. Dai, H. Li, and P. Koniusz, “Deep stacked hierarchical quality assessment models,” IEEE Trans. Image Process., vol. 26,
multi-patch network for image deblurring,” in Proc. IEEE/CVF Conf. no. 2, pp. 1004–1016, Feb. 2017.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5971–5979.
[89] B. Li et al., “Benchmarking single-image dehazing and beyond,” IEEE
[66] K. Zhang et al., “Deblurring by realistic blurring,” in Proc. Trans. Image Process., vol. 28, no. 1, pp. 492–505, Jan. 2019.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[90] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition
pp. 2734–2743.
for low-light enhancement,” 2018, arXiv:1808.04560.
[67] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image
[91] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
restoration,” in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 17–33.
segmented natural images and its application to evaluating segmenta-
[68] Y. Li et al., “Efficient and explicit modelling of image hierarchies
tion algorithms and measuring ecological statistics,” in Proc. 8th IEEE
for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Int. Conf. Comput. Vis. (ICCV), vol. 2, Oct. 2001, pp. 416–423.
Recognit. (CVPR), Jun. 2023, pp. 18278–18289.
[92] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “A general
[69] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Image restoration via frequency
decoupled learning framework for parameterized image operators,”
selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 2,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 33–47, Jan.
pp. 1093–1108, Feb. 2024.
2021.
[70] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Revitalizing convolutional
network for image restoration,” IEEE Trans. Pattern Anal. Mach. [93] B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng, “All-in-one image
Intell., vol. 46, no. 12, pp. 9423–9438, Dec. 2024. restoration for unknown corruption,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17431–17441.
[71] X. Gao et al., “Efficient multi-scale network with learnable dis-
crete wavelet transform for blind motion deblurring,” in Proc. [94] V. Potlapalli, S. W. Zamir, S. H. Khan, and F. S. Khan, “PromptIR:
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, Prompting for all-in-one image restoration,” in Proc. Adv. Neural Inf.
pp. 2733–2742. Process. Syst., Sep. 2023, pp. 71275–71293.
[72] C. X. Liu et al., “Motion-adaptive separable collaborative filters for [95] G. Wu, J. Jiang, K. Jiang, and X. Liu, “Harmony in diversity: Improving
blind motion deblurring,” in Proc. IEEE/CVF Conf. Comput. Vis. all-in-one image restoration via multi-task collaboration,” in Proc. 32nd
Pattern Recognit. (CVPR), Jun. 2024, pp. 25595–25605. ACM Int. Conf. Multimedia, Oct. 2024, pp. 6015–6023.
[73] X. Mao, J. Wang, X. Xie, Q. Li, and Y. Wang, “LoFormer: Local [96] M. V. Conde, G. Geigle, and R. Timofte, “InstructIR: High-quality
frequency transformer for image deblurring,” in Proc. 32nd ACM Int. image restoration following human instructions,” in Proc. Eur. Conf.
Conf. Multimedia, Oct. 2024, pp. 10382–10391. Comput. Vis., Oct. 2024, pp. 1–21.
[74] X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: [97] J. Zhang et al., “Ingredient-oriented multi-degradation learning for
Attention-based multi-scale network for image dehazing,” in image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, Recognit. (CVPR), Jun. 2023, pp. 5825–5835.
pp. 7313–7322. [98] H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, and S. T. Xia, “MambaIR:
[75] H. Dong et al., “Multi-scale boosted dehazing network with dense fea- A simple baseline for image restoration with state-space model,” in
ture fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Proc. Eur. Conf. Comput. Vis., 2024, pp. 222–241.
(CVPR), Jun. 2020, pp. 2154–2164. [99] Y. Guo, Y. Gao, Y. Lu, H. Zhu, R. Liu, and S. He, “OneRestore: A
[76] Y. Tian et al., “Perceiving and modeling density for image dehazing,” universal restoration framework for composite degradation,” in Proc.
in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 130–145. Eur. Conf. Comput. Vis., Jul. 2024, pp. 255–272.
[77] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-dehaze: Enhanced [100] Y. Zhu et al., “Learning weather-general and weather-specific features
CycleGAN for single image dehazing,” in Proc. IEEE/CVF Conf. for image restoration under multiple adverse weather conditions,” in
Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
pp. 938–9388. 2023, pp. 21747–21758.
[78] Y. Jin, B. Lin, W. Yan, Y. Yuan, W. Ye, and R. T. Tan, “Enhancing [101] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
visibility in nighttime haze images using guided APSF and gradient unreasonable effectiveness of deep features as a perceptual metric,”
adaptive convolution,” in Proc. 31st ACM Int. Conf. Multimedia, Oct. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
2023, pp. 2446–2457. pp. 586–595.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 15

[102] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality Mingyu Liu (Graduate Student Member, IEEE)
assessment: Unifying structure and texture similarity,” IEEE Trans. received the dual master’s degree in electrical and
Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2567–2581, May 2020. computer engineering from the Department of Elec-
[103] D. Du et al., “The unmanned aerial vehicle benchmark: Object detec- tronics and Communication Engineering, Technical
tion and tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, University of Munich (TUM), Munich, Germany,
pp. 370–386. and Tongji University, Shanghai, China, in 2022.
[104] K. Zhang et al., “MC-blur: A comprehensive benchmark for image He is currently pursuing the Ph.D. degree with the
deblurring,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, Chair of Robotics, Artificial Intelligence and Real-
pp. 3755–3767, May 2024. time Systems, TUM.
[105] K. Zhang et al., “Deep image deblurring: A survey,” Int. J. Comput. His research interests include computer vision
Vis., vol. 130, no. 9, pp. 2103–2130, Sep. 2022. in autonomous driving, deep learning, and artificial
[106] K. Zhang, R. Li, Y. Yu, W. Luo, and C. Li, “Deep dense multi-scale intelligence.
network for snow removal using semantic and depth priors,” IEEE
Trans. Image Process., vol. 30, pp. 7419–7431, 2021.
[107] K. Zhang, D. Li, W. Luo, and W. Ren, “Dual attention-in-attention
model for joint rain streak and raindrop removal,” IEEE Trans. Image
Process., vol. 30, pp. 7608–7619, 2021.
[108] K. Zhang et al., “Beyond monocular deraining: Stereo image deraining Wenqi Ren (Member, IEEE) received the Ph.D.
via semantic understanding,” in Proc. 16th Eur. Conf. Comp. Vis. degree from Tianjin University, Tianjin, China, in
(ECCV), Aug. 2020, pp. 71–89. 2017.
From 2015 to 2016, he was supported by China
Scholarship Council and worked with Prof. Ming-
Husan Yang as a Joint-Training Ph.D. Student with
the Electrical Engineering and Computer Science
Department, University of California at Merced,
Merced, CA, USA. He is currently a Professor with
the School of Cyber Science and Technology, Shen-
zhen Campus, Sun Yat-sen University, Shenzhen,
China. His research interests include image processing and related high-level
vision problems.

Alois Knoll (Fellow, IEEE) received the Diploma


(M.Sc.) degree in electrical/communications engi-
Yuning Cui (Graduate Student Member, IEEE) neering from the University of Stuttgart, Stuttgart,
received the B.Eng. degree from Central South Germany, in 1985, and the Ph.D. degree (summa
University, Changsha, China, in 2016, and the cum laude) in computer science from the Tech-
M.Eng. degree from the National University of nical University of Berlin, Berlin, Germany, in
Defense Technology, Changsha, in 2018. He is cur- 1988.
rently pursuing the Ph.D. degree with the Chair Since 2001, he has been a Professor at the
of Robotics, Artificial Intelligence and Real-time Department of Informatics, Technical University of
Systems, School of Computation, Information and Munich (TUM), Munich, Germany. His research
Technology, Technical University of Munich (TUM), interests include cognitive, medical, and sensor-
Munich, Germany. based robotics, multiagent systems, and model-driven development of
His research interest lies in image restoration. embedded systems.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.

You might also like