0% found this document useful (0 votes)

8 views15 pages

Modumer Modulating Transformer For Image Restoration

The article introduces Modumer, a novel Transformer-based model designed for efficient image restoration, addressing limitations of existing methods by integrating a modulation mechanism and dual-domain feed-forward network. Modumer achieves state-of-the-art performance across various image restoration tasks, including single and composite degradations, while maintaining lower complexity and fewer parameters. The model employs a unique architecture that balances global feature modeling and local information capture, enhancing its representational capacity and efficiency.

Uploaded by

lewisliang321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views15 pages

Modumer Modulating Transformer For Image Restoration

Uploaded by

lewisliang321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Modumer: Modulating Transformer for

Image Restoration
Yuning Cui , Graduate Student Member, IEEE, Mingyu Liu , Graduate Student Member, IEEE,
Wenqi Ren , Member, IEEE, and Alois Knoll , Fellow, IEEE

Abstract—Image restoration aims to recover clean images from networks (CNNs) have demonstrated significant success in
degraded counterparts. While Transformer-based approaches addressing this ill-posed problem by learning direct mappings
have achieved significant advancements in this field, they are lim- between degraded inputs and their corresponding restored out-
ited by high complexity and their inability to capture omni-range
dependencies, hindering their overall performance. In this work, puts [8], [9], [10]. However, the shortcomings of convolutional
we develop Modumer for effective and efficient image restoration operators are obvious. Due to poor receptive field scaling
by revisiting the Transformer block and modulation design, which [11], [12], CNNs cannot capture long-scale dependencies for
processes input through a convolutional block and projection powerful image representations.
layers and fuses features via elementwise multiplication. Specif- Recently, Transformers have significantly advanced the
ically, within each unit of Modumer, we integrate the cascaded
modulation design with the downsampled Transformer block to state-of-the-art performance of low-level tasks [13], [14], [15],
build the attention layers, enabling omni-kernel modulation and [16]. Despite having the great power to capture content-aware
mapping inputs into high-dimensional feature spaces. Moreover, global perceptive fields, the self-attention (SA) layer features
we introduce a bioinspired parameter-sharing mechanism to quadratic complexity to the input, limiting their applications
attention layers, which not only enhances efficiency but also in real-world scenarios. Many attempts have been made to
improves performance. In addition, a dual-domain feed-forward
network (DFFN) strengthens the representational power of the enhance the efficiency of this expensive mechanism. SwinIR
model. Extensive experimental evaluations demonstrate that the [17], Uformer [18], and Stripformer [19] reduce the complex-
proposed Modumer achieves state-of-the-art performance across ity of Transformer models by confining the SA operation to
ten datasets in five single-degradation image restoration tasks, a fixed spatial range. Restormer [14] tactfully switches the
including image motion deblurring, deraining, dehazing, desnow- operation dimension from the spatial domain to channels.
ing, and low-light enhancement. Moreover, the model exhibits
strong generalization capabilities in all-in-one image restoration Afterward, a few works explore adopting both channel SA
tasks. Additionally, it demonstrates competitive performance in and spatial SA in cascading or parallel manners to improve
composite-degradation image restoration. representational ability [12], [20]. Nonetheless, these methods
Index Terms—All-in-one image restoration, composite- impede the inherent potential of SA, originally proposed for
degradation image restoration, dual-domain learning, image superior global feature modeling, leading to a deterioration
restoration, modulation design, parameter sharing, transformer. in restoration performance. Moreover, they mostly operate
at a single scale, limiting their ability to capture multiscale
receptive fields within a single computational unit.
I. I NTRODUCTION
Most recently, the modulation mechanism [21], as illustrated

A S A longstanding task, image restoration aims to recover

a high-quality image from its degraded counterpart [1],
[2], [3], [4], [5], [6], [7]. In recent years, convolutional neural
in Fig. 1(b), considering context modeling using a large-kernel
convolutional block and modulating the projected input via
elementwise multiplication, has become popular in high-level
vision tasks [22], [23], [24]. These approaches are computa-
Received 27 November 2024; revised 19 March 2025; accepted 13 April tionally efficient and implement-friendly, showing competitive
2025. This work was supported in part by the National Natural Sci-
ence Foundation of China under Grant 62322216, Grant 62172409, and performance on par with Transformer counterparts. Inspired by
Grant 62311530686; in part by Shenzhen Science and Technology Program this modulation technique, we acquire the approximate omni-
under Grant RCYX20221008092849068, Grant JCYJ20220530145209022, kernel feature modeling ability by integrating the Transformer
and Grant KQTD20221101093559018; in part by the Project “VIDETEC-
2” under Grant 19F2232E; and in part by the Federal Ministry for Digital layer [Fig. 1(a)] and modulation design [Fig. 1(b)] within a
and Transport of Germany (BMDV). (Corresponding author: Wenqi Ren.) block. As illustrated in Fig. 1(c), the context branch (CTX) is
Yuning Cui is with the School of Cyber Science and Technology, Shenzhen
Campus of Sun Yat-sen University, Shenzhen 518107, China, and also with
implemented through a Transformer block at a downsampled
the School of Computation, Information and Technology, Technical University scale, which retains the ability of SA to model global features
of Munich, 85748 Munich, Germany (e-mail: [email protected]). while striking a trade-off between complexity and accuracy.
Mingyu Liu and Alois Knoll are with the School of Computation, Infor- The local and mesoscale receptive fields are complemented
mation and Technology, Technical University of Munich, 85748 Munich,
Germany (e-mail: [email protected]; [email protected]). by modulating the result of SA in series using depthwise con-
Wenqi Ren is with the School of Cyber Science and Technology, Shen- volutions of different kernel sizes. Compared to the canonical
zhen Campus of Sun Yat-sen University, Shenzhen 518107, China (e-mail:
[email protected]).
modulation design, our block provides real context modeling
Digital Object Identifier 10.1109/TNNLS.2025.3561924 and performs cascaded modulation processes, mapping input

© 2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. Comparison of Transformer block, modulation design, and our block. ⊗ and are matrix and elementwise multiplication, respectively. Compared to
Transformer and modulation blocks, our design performs attention calculation in downsampled spaces and employs cascaded modulation operation to pursue
omni-kernel feature refinement and high-dimensional representation learning. As such, the model achieves a better tradeoff between complexity and accuracy.
(a) Transformer block. (b) Modulation design. (c) Our block.

Fig. 2. Computation comparisons between the proposed model and state-of-the-art algorithms on AGAN-Data [27], HIDE [28], CSD [29], and Haze4k [30]
for deraining, motion deblurring, desnowing, and dehazing, respectively.

features into higher-dimensional feature spaces. Additionally, single-degradation image restoration tasks with lower com-
our CTX is content-aware, which is beneficial for dealing with plexity and fewer parameters (see Fig. 2).
spatially varying degradations. Moreover, we explore a bioin- The main contributions of this study are listed as follows.
spired parameter-sharing mechanism that shares parameters 1) We introduce an attention block that consecutively mod-
across different attention layers, improving both efficiency and ulates SA outputs derived from downsampled features,
performance. enabling efficient omni-kernel modulation and enhanc-
Additionally, to reduce discrepancies between spectra ing high-dimensional representational capacity.
of clean/degraded image pairs, we present a dual-domain 2) We develop a DFFN that achieves spatial-spatial and
feed-forward network (DFFN) to improve dual-domain rep- spectral-spatial interactions.
resentation learning. Specifically, DFFN first utilizes GEGLU 3) We deploy channelwise Transformer blocks at the first
[25] to achieve spatial-domain signal interactions. Subse- scale while using spatialwise blocks at deeper scales
quently, the resulting features pass through the fast Fourier with lower-resolution features, resulting in our effective
transform (FFT) to obtain the spectra, which are then mod- and efficient image restoration network, dubbed Mod-
ulated by the learnable parameters and transformed back to umer.
the spatial domain through the inverse IFFT [26]. Next, the 4) Extensive experimental results demonstrate that
results interact with spatial features under the guidance of Modumer achieves state-of-the-art performance
attention weights. By doing these, our DFFN achieves intra- on single-degradation, all-in-one, and composite-
and interdomain interactions, improving the representational degradation image restoration tasks.
ability.
The unit of our U-shaped Modumer is built upon the
II. R ELATED W ORKS
above modulation-based SA block and DFFN. Unlike other
Transformer-based restoration algorithms that utilize a uni- A. Image Restoration
form block throughout the model, we adopt a channelwise Image restoration aims to reconstruct a sharp image from a
modulation-based SA block at the initial scale to enable more degraded observation [11], [31], [32], [33], [34], [35], [36].
efficient global feature modeling. For lower-resolution features Recently, deep learning methods have remarkably boosted
at deeper scales, we apply spatialwise blocks, effectively the performance of various image restoration tasks by learn-
capturing spatial representations. Based on these designs, ing generalizable features from collected large-scale data.
Modumer achieves state-of-the-art performance on several These methods can be roughly divided into CNN-based and

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 3

Fig. 3. Network architecture of our U-shaped Modumer. We employ CMB with shared parameters at the first scale while using SMB at deeper scales which
involve lower-resolution features. This can strike a better balance between the complexity and the representational ability. The DFFN enhances dual-domain
frequency learning via spatial-spatial and spatial-spectral interactions.

Transformer-based categories. CNN-based methods leverage accuracy. Furthermore, mesoscale and local information is
attention mechanisms to attend to informative information leveraged to modulate SA outputs through cascaded modu-
in different dimensions [8], [37], e.g., spatial and channel. lation, enabling omni-kernel refinement and mapping inputs
Furthermore, these methods incorporate advanced techniques into higher-dimensional spaces.
to expand receptive fields and capture multiscale features [38],
[39], [40], [41], [42], [43], [44], [45], including encoder- III. M ETHODOLOGY
decoder architectures, atrous convolution, and multistage
learning strategies. Subsequently, Transformer methods scale In this section, we first introduce the overall architecture
the receptive field to global features via the SA layer [46]. of Modumer. Subsequently, the proposed components are
To enhance its efficiency on low-level vision tasks, a few delineated individually, including two kinds of attention layers
algorithms confine the SA region to fixed windows or strips (CMB and SMB), the parameter-sharing mechanism, and the
[18], [19], which impedes the inherent potential of SA. More- DFFN.
over, they cannot model multiscale features within a single
unit, limiting their capability for removing degradations of A. Overall Pipeline
different sizes. In this article, we apply SA to downsampled Modumer follows the encoder-decoder design (see Fig. 3).
embedding spaces to capture global dependencies and use the We employ a channelwise modulation block (CMB) at the
cascaded modulation operation to complement the missing first scale, as channelwise SA effectively captures long-range
local information. features in an implicit manner. Meanwhile, a spatialwise
modulation block (SMB) is utilized at the two lower-resolution
B. Modulation Design scales to enhance spatial feature representation. As such,
the model strikes a better balance between complexity and
The modulation mechanism [21], [23] considers context
representational capacity.
modeling using a large-kernel convolutional unit and mod-
More specifically, given an image, we use a 3 × 3 convolu-
ulates the projected inputs using elementwise multiplication,
tion to extract the embedding features of size RC×H×W , where
which has exhibited cutting-edge performance in high-level
C denotes the channel count while H × W defines the spatial
vision tasks. FocalNet [24] utilizes a stack of depthwise con-
index. Subsequently, the features are fed into the three-scale
volutional layers to implement hierarchical contextualization
encoder subnetwork to produce the in-depth features. Each
and uses gated aggregation to selectively gather contexts.
scale contains several Transformer blocks, whose calculation
Afterward, EfficientMod [21] adopts a simpler method for
process is formulated as follows:
context modeling using a series of linear projections and
depthwise convolution. MambaOut [47] and Conv2former X0k = CMB/SMB(Xk−1 ) + Xk−1 (1)
[22] use 7 × 7 depthwise convolutions to extract contex- Xk = DFFN(X0k ) + X0k (2)
tual features. Recently, StarNet [48] reveals that the strong
representational capacity of elementwise multiplication arises where Xk−1 and Xk are the output of the last and current
from its implicit mapping to high-dimensional spaces. How- Transformer blocks, respectively. In the encoder stage, the
ever, the receptive fields of the CTX in these methods are resolution of the features is gradually downsampled using
limited. In contrast, our approach incorporates long-range bilinear interpolation while the channel capability is doubled
contextual signals by applying SA to downsampled embedding using a 3 × 3 convolution. Next, the in-depth features pass
spaces, effectively balancing computational complexity and through the symmetric decoder network to generate the clean

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. Architectures of channel and spatial modulation blocks (CMB|SMB). (a) CMB. (b) SMB.

features. In this process, the resolution of features is progres- 2) Modulation Design: DCSA encodes downsampled
sively restored to the original size using bilinear interpolation global information while disregarding fine-grained local details
and 3 × 3 convolution. Meanwhile, the skip connection is during the downsampling process. To complement local infor-
adopted to combine the encoder and decoder features via con- mation, we first filter the initially generated V tensor using a
catenation. The features produced by the three-level decoder 3 × 3 depthwise convolution, which is expressed as follows:
are then processed by a refinement stage consisting of r
Transformer blocks. Finally, a 3 × 3 convolution is applied X̂ M3×3 = Sigmoid(Dw3×3 (V)) V (5)
to generate the residual image, which is added to the original
where Dw3×3 is a depthwise convolution of kernel size 3 × 3.
input image to obtain the final model output. Next, we present
Next, we modulate the output of DCSA with the locally
the internal components of the Transformer block.
filtered result via elementwise multiplication. This approach
enables the model to capture both downsampled global and
B. Channelwise Modulation Block
fine-grained local dependencies while mapping inputs into
The architectural details of CMB are illustrated in high-dimensional spaces, thereby enhancing its representa-
Fig. 4(a). CMB contains a downsampled channelwise SA layer tional capacity. To simplify the analyses, we assume a scenario
for global information modeling, along with two depthwise with a single-pixel input x ∈ Rd×1 and a single-element output
convolutional branches that modulate the SA output. These x̂ ∈ R1×1 , where d is the channel count. We define w1 ,
branches enhance local and mesoscale receptive fields while w2 ∈ R1×d as convolution parameters. The modulation process
mapping features into higher-dimensional spaces. The calcu- [48], which involves a single convolution in each branch, can
lation process of CMB can be formally expressed as follows: be expressed as follows:
X̂CMB = W2 X̂ M7×7 W1 (X̂ M3×3 DCSA(XCMB ))

(3) d
! 0 d 1
X X j
where X̂CMB and XCMB denote the output and input of CMB, w>1 x w>2 x = wi1 xi @ w2 x j A (6)
respectively. DCSA is a downsampled channelwise SA layer. i=1 j=1

X̂ Mn×n is the modulation branch with the kernel size of n × n, d X

X d

encoding local information. W1 and W2 are 1 × 1 convolutions = wi1 w2j xi x j (7)

for refinement. i=1 j=1
1) DCSA: Compared to the normal channel SA, our version α1,1 x1 x1 + · · · + α2,3 x2 x3 + · · · + αd,d xd xd
computes attention maps in a downsampled space, resulting =„ ƒ‚ … (8)
d(d + 1)/2
in high efficiency. We assume that the number of heads is (
1 and consider DCSA to be a single-head fashion. Given wi1 w2j , i= j
the normalized input XN ∈ RC×H×W , DCSA first utilizes the αi, j = (9)
wi1 w2j + w1j wi2 i, j
projection layers to produce query, key, and value tensors by
Q = WQ XN , K = WK XN , and V = WV XN , where W(·) denotes where i, j index the channel. It can be observed that each term
parameters of 1 × 1 pointwise convolution. Then, the obtained in (9) exhibits a nonlinear relationship with x and represents
Q, K, and PV tensors are reshaped into the size of C × N, an individual dimension, suggesting that this scenario achieves
N × C, and C × N, respectively, where N = H × W. The query a representation in an implicit feature space of dimensionality
and key tensors are further normalized and downsampled to d(d + 1)/2. Note that besides convolutions, the branches in
prepare for cross-covariance attention [14]. The transposed our modulation design experience complicated SA, further
attention map is calculated by Q and K, with the size of RC×C . improving the representational capability. Additionally, we
The output of DCSA is obtained by apply a 7 × 7 kernel branch to further modulate the preceding
outcome and supply mesoscale receptive fields.
X̂DCSA = Softmax(QK/τ)V (4)
3) Parameter Sharing: We employ a parameter-sharing
where τ is a learnable temperature parameter and X̂DCSA ∈ mechanism inspired by the relationship between the hippocam-
RC×N is reshaped to the original input feature size of RH×W×C pus and cortex in the brain [49]. Specifically, as illustrated
for further modulation operation. in Fig. 5, although different layers of the cortex, such as

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 5

TABLE I
D ETAILS OF O UR M ODEL V ERSIONS . T HE N UMBER OF H EADS AT
THE T HREE S CALES I S S ET TO [1, 2, 4]. FLOP S A RE M EASURED
ON 3×256 × 256 PATCHES

Fig. 5. Motivation of the parameter-sharing mechanism.

Furthermore, DFFN facilitates spatial-spectral interactions
by integrating the Fourier-domain refined output with spatial
layers II and III, perform distinct tasks, they exchange infor- features, guided by learnable attention weights. The calcula-
mation with the shared memory in the hippocampal field CA1. tion process is formulated as follows:
Accordingly, we conceptualize the attention layer as analogous X̂DFFN = αXSpectral + (1 − α)X̂S−S (13)
to the hippocampus, while the feed-forward layer represents
the cortex, forming the foundation of our parameter-sharing XSpectral = P −1 F −1 W F(P(X̂S−S )) (14)
mechanism illustrated in the left part of Fig. 3. Interestingly,
this design not only saves parameters but also improves where F and F −1 denote the FFT and the inverse transform,
performance. respectively. P and P −1 are windows partition operation and
the inverse transformation, respectively. W is the learnable
parameter to filter the frequency signals [26]. α is the learnable
C. Spatialwise Modulation Block
parameter to control information aggregation.
Fig. 4(b) presents SMB, which has three branches: a
downsampled spatialwise attention unit (DSSA), and two
IV. E XPERIMENTS
modulation operators. The output of SMB is obtained by
We evaluate the performance of our proposed Modumer
X̂SMB = W4 X̂ M7×7 W3 (X̂ M3×3 DSSA(XSMB ))

(10) on three kinds of tasks: single-degradation, all-in-one, and
where XSMB is the input of SMB. composite-degradation image restoration. We train separate
1) DSSA: DSSA is used at low-resolution scales to model model instances for different single-degradation tasks, while
spatial global features. Similarly, we also assume the number a unified model is trained on a mixed dataset encompassing
of heads is 1 to transfer DSSA to single-head mode. Given multiple tasks for the all-in-one version. For composite-
any input X ∈ RH×W×C , it is first processed by the layer degradation restoration, the training dataset consists of images
normalization to yield XN . Then, the query (Q), key (K), and affected by multiple degradation types simultaneously. Based
value (V) tensors are produced by Q = W Q XN , K = W K XN ↓, on the complexity of different datasets, we deploy two ver-
and V = W V XN ↓, where K and V are generated from the sions of our model—Modumer-S (small) and Modumer-B
downsampled input (XN ↓) for high efficiency. After reshaping (base)—to ensure a better trade-off between efficiency and
Q, K, and V to new tensors of size N × C, C × N 0 , N 0 × C, accuracy. More details can be found in Table I.
respectively, where N = H × W and N 0 = H/8 × W/8, the
calculation process of DSSA is formulated as follows: A. Single-Degradation Image Restoration

QK 1) Implementation Details: We evaluate our model on five
X̂DSSA = Softmax √ V. (11)
C representative tasks with ten benchmark datasets (see Table II).
We adopt the dual-domain loss functions [11], [26], [37]
2) Modulation Design: Similar to CMB, we utilize a
to train the network for 300 000 iterations with the Adam
cascaded modulation design with kernel sizes of 3 × 3 and
optimizer. The deblurring task needs another 300 000 iterations
7 × 7 to complement local and mesoscale information. As
following [26]. The initial learning is set to 1e−3 , which is
such, the model is equipped with an approximate omni-kernel
gradually reduced to 1e−7 with the cosine annealing strategy.
modulation ability, i.e., local-mesoscale–global.
The patch size is set to 128 × 128, and the batch size is 32.
We adopt the same data augmentation strategy as [14]. The
D. Dual-Domain Feed-Forward Network window size in DFFN and the downsampling ratio in SA are
DFFN facilitates the spatial-spatial and spatial-spectral set to 8. In tables, the best results are highlighted.
interactions for high-fidelity reconstruction. Fig. 3 illustrates 2) Results:
the architecture. To be specific, given input features X ∈ a) Image Deraining: The numerical results on the rain-
RH×W×C , DFFN first applies layer normalization, followed by drop dataset AGAN-Data [27] are presented in Table III.
GEGLU [25], as formulated below Our method significantly outperforms the recent Transformer-
based AST-B [61] and FPro [60] by 0.73 and 1.09 dB,
X̂S−S = W7 GELU Dw13 W5 (XN ) Dw23 W6 (XN )

(12)
respectively, while consuming lower complexity, as illustrated
where W5 , W6 , and W7 denote 1 × 1 convolutions. Dw13 and in Fig. 2(a). Fig. 6 shows that our method is more effective in
Dw23 are 3 × 3 depthwise convolutions. XN is the normalized raindrop removal than competitors. Moreover, the comparison
input and X̂S−S is the spatial-spatial interaction output. results on the rain streak dataset SPAD [50] are reported in

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE II
DATASET S UMMARY FOR F IVE S INGLE -D EGRADATION I MAGE R ESTORATION TASKS

Fig. 6. Visual comparisons on the raindrop AGAN-Data [27] dataset.

Fig. 7. Deblurred results on GoPro [42]. Compared to other algorithms, the proposed method restores more details and clearer structures from the input.

Fig. 8. Image dehazing comparisons on the Haze4k [30] dataset.

TABLE III b) Image Motion Deblurring: We conduct experiments

Q UANTITATIVE R ESULTS ON AGAN-DATA [27] FOR R AINDROP R EMOVAL for motion deblurring on the GoPro [42] dataset and compare
our results with state-of-the-art works in Table V. Our method
significantly surpasses the recent frequency-based Transformer
model [73] by 0.18 dB PSNR while using 65% fewer
parameters. Compared to the recent convolutional network
ConvIR-L [70], our method achieves a notable gain of 0.99
dB PSNR. The visual results in Fig. 7 show that our model
recovers more structural details from the hard example. We
further apply our model pretrained on GoPro to the HIDE [28]
dataset. The results presented in Table V show that our method
obtains the best result in PSNR with a prominent gain of
0.15 dB over the second-best LoFormer-L [73], demonstrating
Table IV. As seen, our method achieves the best performance the better generalization ability of our model.
in terms of PSNR, outperforming the previous state-of-the-art c) Image Dehazing: We perform dehazing experiments
algorithm [61] by 0.06 dB PSNR. on the Haze4k [30] dataset. The numerical results are

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 7

Fig. 9. Image desnowing comparisons on the CSD [29] dataset.

TABLE IV TABLE VII

Q UANTITATIVE R ESULTS ON SPAD [50] FOR R AIN S TREAK R EMOVAL Q UANTITATIVE R ESULTS ON GTA5 [51] FOR N IGHT H AZE R EMOVAL

TABLE V
Fig. 10. Visualization of the training process on the Snow100K [9] dataset.
I MAGE M OTION D EBLURRING R ESULTS . O UR M ODEL I S T RAINED O NLY
ON THE G O P RO [42] DATASET AND D IRECTLY A PPLIED TO THE G O P RO
[42] AND HIDE [28] DATASETS TABLE VIII
I MAGE D ESNOWING C OMPARISONS ON T HREE W IDELY U SED DATASETS

based method FSNet [69], our method has a more obvious

advantage with much lower complexity. Fig. 8 shows that
our model can better deal with haze degradations than other
TABLE VI
algorithms. Additionally, we provide comparison results on
I MAGE D EHAZING C OMPARISONS ON THE H AZE 4 K [30] DATASET
a nighttime dehazing dataset GTA5 [51] in Table VII. Our
Modumer-S is still superior to the strong competitors.
d) Image Desnowing: Furthermore, we verify the effec-
tiveness of our model in snow removal using three datasets:
CSD [29], SRRS [52], and Snow100K [9]. The quantitative
results are presented in Table VIII. Our method achieves
39.17 dB PSNR on the CSD dataset, 0.74 dB higher than
the second-best algorithm [70]. The superiority of our model
is also evident on the other two datasets, further demonstrating
its effectiveness in snow removal. Fig. 9 shows that our
model yields a more favorable image by removing more snow
degradations. Visualizations of the training process on the
presented in Table VI. Our model attains a significant Snow100K [9] dataset are illustrated in Fig. 10.
performance gain of 0.54 dB PSNR over the recent algorithm e) Low-Light Image Enhancement: The numerical results
[70] with lower FLOPs [see Fig. 2(d)]. Compared to the CNN- on LOL-V2-Syn [53] for low-light image enhancement are

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 11. Visual results on the LOL-V2-syn [53] dataset.

TABLE IX TABLE XI
N UMERICAL C OMPARISONS ON THE LOL-V2-S YN DATASET [53] A BLATION S TUDIES OF THE D EPLOYMENT S TRATEGY
FOR L OW-L IGHT I MAGE E NHANCEMENT FOR D IFFERENT K INDS OF ATTENTION

TABLE X
A BLATION S TUDIES FOR E ACH C OMPONENT. FLOP S AND M EMORY
F OOTPRINT A RE M EASURED ON A 3×256 × 256 PATCH S IZE U SING
TABLE XII
ptflops AND torch.cuda. max memory allocated(), R ESPECTIVELY
M ORE A BLATION S TUDIES FOR DFFN

features. Table XI shows that our strategy achieves the best

presented in Table IX. Our method significantly outper- performance. Moreover, we experiment by using only regular
forms the Transformer-based algorithm Retinexformer [83] channel attention [14] in all scales, achieving a 0.32 dB lower
by 0.43 dB PSNR. Notably, our model also surpasses the performance than our full model. These results validate the
recent Mamba-based MambaLLIE [84]. The visual results are efficacy of our design.
illustrated in Fig. 11. Our model recovers more edges from c) DFFN: We conduct more ablation studies for DFFN
the input image. These results highlight the strong potential by removing or substituting certain operators. Table XII
of our method for low-light image enhancement. shows that removing the spatial branch in interdomain fusion
3) Ablation Studies: We perform ablation studies by train- achieves 31.76 dB PSNR, suggesting the significance of dual-
ing our small model for 70 000 iterations on GoPro [42]. domain feature fusion. Removing the attention weights leads
a) Effects of Individual Components: Table X shows to 31.73 dB PSNR, which is even lower than the result of using
the results of individually removing the proposed component a single branch, demonstrating the importance of coordinating
from the complete model. The removal of our modulation the fusion process.
branch results in a performance degradation compared to d) Modulation Design: In this part, we perform ablation
the full model. Our parameter-sharing mechanism achieves studies for the modulation design. We use the plain depthwise
a 0.04 dB PSNR performance improvement while requiring convolutions with the same kernel size to supplant the filter
fewer parameters. Employing only the spatial-spatial inter- operation, achieving 31.69 dB PSNR, which is 0.13 dB lower
actions, i.e., GEGLU, in the feed-forward network achieves than our design.
31.69 dB PSNR, which is 0.13 dB lower than our full model. e) Parameter-Sharing Mechanism: In our model, we
Additionally, the FLOPs and memory footprint comparisons share the parameters across CMB. We carry out experiments to
indicate that our designs introduce minimal computational apply the parameter-sharing strategy to deeper scales, achiev-
complexity and memory overhead. These results demonstrate ing lower performance than our design (see Table XIII). We
the effectiveness of our proposed modules and mechanism. also attempt to further share the parameters among DFFN in
b) Deployment Strategy for Attention: We deploy CMBs the first scale, obtaining only 30.53 dB PSNR. Therefore, we
at the first scale while using spatialwise blocks at other scales, only apply the mechanism in CMB for better performance.
as spatialwise SA is more computationally expensive than its f) Ablations for Downsampling Operation: In CMB, we
channelwise counterpart when modeling large-scale features. apply downsampling after the convolutions to fully capture
In our case, the first scale includes the highest resolution spatial connectivity, as the channelwise SA layer is unable to

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 9

the effectiveness of our choice. This version achieves a

PSNR of 31.67 dB, which is 0.15 dB lower than our
design while introducing more parameters (0.02 M) and
higher complexity (2.06 GFLOPs).
4) Applications: To assess the effectiveness of our approach
in enhancing the performance of high-level vision tasks,
particularly object detection, we integrate YOLOv7 [86] into
our evaluation. Specifically, we apply the detector to both the
degraded observations and the restored images generated by
our method across three image restoration tasks. As depicted
in Fig. 13, our results facilitate improved object detection by
enabling the model to identify a greater number of objects,
enhance detection confidence, and rectify misclassifications.
For instance, in the snowy scene, our method successfully
Fig. 12. Visualization comparison of intermediate features obtained from
models with and without downsampling.
corrects the erroneous identification of a mouse instance.
5) Limitation: Although our model yields superior results
on the GoPro [42] dataset for image deblurring, it struggles to
TABLE XIII
produce completely clean outputs for fast-moving subjects, as
A BLTION S TUDIES FOR THE PARAMETER -S HARING M ECHANISM .
S CALE 0,1,2 I NDICATES T HAT PARAMETERS A RE S HARED exemplified by the foreground person in Fig. 14. A potential
W ITHIN E ACH S CALE ACROSS ALL S CALES solution to mitigate this issue is the incorporation of optical
flow to better handle large movements.

B. All-in-One Image Restoration

1) Implementation Details: Many all-in-one methods are
model true spatial pixel interactions. We evaluate the impact of evaluated under two settings: the three-task setting, as used by
performing downsampling before convolutions, which reduces [94] and [93], and the five-task setting, as employed by [97].
the computational cost by 1.89 GFLOPs but results in a Recent studies have conducted experiments in both settings,
0.11 dB decrease in PSNR. Consequently, we opt to place like [96]. We follow this trend to ensure a comprehensive and
downsampling after convolutions in CMB. complete evaluation of our model, Modumer-B. The dataset
We further conduct an experiment using the model without summary is presented in Table XIV. The model is trained on
downsampling. Interestingly, this variant achieves a 0.2 dB 32 samples of size 128 × 128 per iteration using the Adam
lower PSNR than our design (31.62 versus 31.82), indicat- optimizer with a learning rate of 2e−4 . Training is conducted
ing that the synergy between our two modulation branches for 150 epochs with the L1 loss function.
and downsampling-based global modeling is more effective 2) Results: For the three-task setting, the model is trained
than the direct global operation. Fig. 12 presents a visu- on a mixed dataset obtained from denoising, dehazing, and
alization comparison of intermediate features obtained from deraining. Table XV shows that our model achieves an average
models with and without downsampling. Compared to full-size score of 32.77 dB PSNR, 0.34 dB higher than the recent
attention, our model exhibits an enhanced ability to capture InstructIR [96]. Moreover, our method achieves the best
long-range features. Notably, it produces sharper results for performance across most metrics. Notably, for the deraining
the girl, which represents the most blurred region in the task, our model outperforms InstructIR by 0.8 dB. Fig. 15
first image. Furthermore, our model effectively preserves local demonstrates that our model is more effective in removing
details, such as the wires in the top-right region of the second rain streaks, resulting in a noticeably cleaner image.
example. Moreover, we report experimental results under the five-
g) Frequency Processing or Convolutions?: In DFFN, task all-in-one setting. The quantitative results are presented in
we employ Fourier processing to facilitate interdomain inter- Table XVI. As observed, our method achieves a PSNR score
actions. Compared to convolutions, our operation offers two of 30.19 when averaged across all tasks, which is 0.64 and
key advantages. 1.22 dB higher than those of InstructIR [96] and MambaIR
1) According to the convolution theorem, a pixel in the [98], respectively. In particular, for the dehazing problem,
Fourier spectrum incorporates global information from our model significantly outperforms the second-best algorithm
the spatial domain. Consequently, Fourier processing [96] by 3.19 dB PSNR. Despite not incorporating a complex
can effectively model long-range dependencies while dynamic mechanism for identifying degradation types, our
reducing parameter usage and improving efficiency. method consistently delivers promising results across various
2) Since there are frequency discrepancies between all-in-one tasks, thanks to its robust representational capability.
degraded and clean image pairs, certain degradation pat- 3) Evaluation on Deep Learning-Based Metrics: In addi-
terns can be easily removed in the Fourier domain. We tion to distortion-based metrics, e.g., PSNR and SSIM, we
conduct an experiment by replacing the Fourier process- additionally compare our model with recent competing algo-
ing with a 7 × 7 depthwise convolution to demonstrate rithms under the three-task setting using the learning-based

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 13. Object detection results on the degraded and our restored images. The examples are obtained from the CSD [29], GoPro [42], and LOL-V2-syn [53]
datasets for desnowing, deblurring, and low-light enhancement, respectively.

TABLE XIV
DATASETS U SED FOR THE A LL - IN -O NE S ETTING . M OTION D EBLURRING AND L OW-L IGHT E NHANCEMENT A RE O NLY
U SED FOR THE F IVE -TASK S ETTING

TABLE XV
Q UANTITATIVE C OMPARISONS ON T HREE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING

across all datasets. This demonstrates the effectiveness of our

approach in preserving perceptual quality while maintaining
structural and textural consistency.
4) Generalization: To evaluate the generalization capability
of our method, we directly apply the model trained in the
Fig. 14. Our model struggles to produce sharp results for fast-moving subjects. three-task setting to the UAVDT [103] dataset. Fig. 16 illus-
trates that our model is robust in out-of-distribution scenarios,
metrics, including LPIPS [101] and DISTS [102]. Table XVII removing more hazy degradations from challenging inputs
shows that our model outperforms two recent competitors than competing all-in-one algorithms.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 11

Fig. 15. Visual comparisons on the Rain100 [85] dataset under the all-in-one setting. The image produced by our model is closer to the reference image.

TABLE XVI
N UMERICAL C OMPARISONS ON F IVE I MAGE R ESTORATION TASKS U NDER THE A LL - IN -O NE S ETTING : D EHAZING (SOTS [89]), D ERAINING
(R AIN 100L [85]), D ENOISING (BSD68 [91]), D EBLURRING (G O P RO [42]), AND L OW-L IGHT I MAGE E NHANCEMENT (LOL-V1 [90])

TABLE XVII
Q UANTITATIVE E VALUATION U NDER THE A LL - IN -O NE S ETTING W ITH L EARNING -BASED M ETRICS

best results in most degradation categories while requiring

fewer parameters. Specifically, compared to the previous
state-of-the-art algorithm [99], our model yields a significant
performance gain of 1.08 dB in PSNR when averaged across
all 11 degradation categories. Visual comparisons in Fig. 17
demonstrate that our model is more effective in removing
mixed degradations, whereas the results generated by com-
peting methods still contain snow and haze artifacts. Fig. 18
illustrates the training process on this dataset.

D. Discussion
1) Evaluation Datasets: In this study, we conduct exper-
Fig. 16. Results of applying models trained in the three-task setting to images
from the UAVDT [103] dataset. iments transitioning from single-degradation to composite-
degradation scenarios for image restoration, aligning with
the evolving trends in the field. However, given the diverse
C. Composite-Degradation Image Restoration real-world conditions encountered by users, it is impractical
to encompass all possible scenarios within a single study.
We conduct experiments on the CDD-11 [99] dataset Consequently, several valuable datasets remain available for
for composite-degradation image restoration. This dataset further exploration and evaluation.
comprises a total of 11 degradation categories, created by For instance, MC-Blur [104] includes four types of
combining four types: low light, haze, rain, and snow. The blur—uniform blur, motion blur caused by averaging
training configuration follows that of single-degradation image continuous frames, heavy defocus blur, and real-world
restoration. Table XVIII shows that our model achieves the blur—providing a comprehensive benchmark for multi-

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 17. Visual comparisons on the CDD-11 [99] dataset under the haze + snow scenario.

TABLE XVIII
Q UANTITATIVE C OMPARISONS ON THE CDD-11 [99] FOR C OMPOSITE -D EGRADATION I MAGE R ESTORATION . T HE S CORES A RE R EPORTED IN THE F ORM
OF PSNR ( D B, ↑) AND SSIM (↑). O UR M ODEL O UTPERFORMS THE P REVIOUS L EADING A LGORITHM [99] W HILE U SING F EWER PARAMETERS

pattern is observed in the single-degradation setting, where our

model consistently demonstrates strong performance across
multiple tasks. While employing a unified architecture for dif-
ferent image restoration scenarios offers practical advantages,
integrating task-specific or spatially adaptive strategies remains
Fig. 18. Visualization of the training process on the CDD-11 [99] dataset. a promising direction for further performance enhancement.

V. C ONCLUSION
cause image deblurring frameworks [105]. Similarly, the
SnowKITTI2012 and SnowCityScapes datasets introduced in This study presents an effective and efficient Transformer
[106] feature three levels of snow degradation in street envi- model for image restoration, termed Modumer. The model
ronments, facilitating the development of algorithms aimed at incorporates different downsampled SA layers with cascaded
enhancing autonomous driving in adverse weather conditions. modulation designs, which can model omni-receptive field fea-
Additionally, JRSRD [107] synthesizes rain streaks and rain- tures, keep a better balance between complexity and accuracy,
drops simultaneously, offering a more realistic representation and map features into high-dimensional spaces. Moreover,
of rainy conditions compared to datasets that include only a we investigate a bioinspired parameter-sharing mechanism in
single type of degradation. attention layers, improving efficiency and performance. In
Furthermore, our model can be extended to stereo image addition, we introduce a DFFN to facilitate intra- and inter-
restoration tasks, such as evaluating its performance on domain interactions. Comprehensive experiments across three
deraining using the Stereo RainKITTI2012 dataset [108]. categories of image restoration tasks validate the effectiveness
These additional datasets present promising avenues for future of our model.
research, enabling a more comprehensive assessment of image
restoration techniques across varied conditions. R EFERENCES
2) Network Architecture: We employ a unified network [1] Y. Quan, P. Lin, Y. Xu, Y. Nan, and H. Ji, “Nonblind image deblurring
architecture across various image restoration tasks, including via deep learning in complex field,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 33, no. 10, pp. 5387–5400, Oct. 2021.
single-degradation, all-in-one, and composite-degradation sce- [2] Y. Cui, Y. Tao, W. Ren, and A. Knoll, “Dual-domain attention for
narios. In all-in-one settings, a common approach involves image deblurring,” in Proc. AAAI Conf. Artif. Intell., vol. 37, Jun. 2023,
first extracting task-aware information, which then guides the pp. 479–487.
[3] Y. Zheng, X. Yu, M. Liu, and S. Zhang, “Single-image deraining
restoration process [93], [94]. Despite not incorporating such via recurrent residual multiscale networks,” IEEE Trans. Neural Netw.
explicit task-specific components, our model achieves state-of- Learn. Syst., vol. 33, no. 3, pp. 1310–1323, Mar. 2022.
the-art performance on two all-in-one tasks. This success can [4] Y. Cui and A. Knoll, “Exploring the potential of channel interactions
for image restoration,” Knowl.-Based Syst., vol. 282, Dec. 2023, Art.
be attributed to two key factors: 1) the modulation operation, no. 111156.
which functions as a gated mechanism, dynamically attending [5] Y. Zhou, Z. Chen, P. Li, H. Song, C. L. P. Chen, and B. Sheng, “FSAD-
to informative signals for different tasks; and 2) the synergy Net: Feedback spatial attention dehazing network,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 34, no. 10, pp. 7719–7733, Oct. 2023.
between the modulation operation and dual-domain interac- [6] K. Jiang et al., “Multi-scale hybrid fusion network for single image
tions in DFFN, which enhances the model’s representational deraining,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 7,
capacity without significantly increasing complexity. A similar pp. 3594–3608, Jul. 2023.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 13

[7] Y. Cui, Y. Tao, L. Jing, and A. Knoll, “Strip attention for image [32] L. Ruan, B. Chen, J. Li, and M. Lam, “Learning to deblur using light
restoration,” in Proc. Int. Joint Conf. Artif. Intell., Aug. 2023, field generated and real defocus images,” in Proc. IEEE/CVF Conf.
pp. 645–653. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16283–16292.
[8] Q. Xu, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion [33] M. Liu, Y. Cui, W. Ren, J. Zhou, and A. C. Knoll, “LIEDNet: A
attention network for single image dehazing,” in Proc. AAAI Conf. lightweight network for low-light enhancement and deblurring,” IEEE
Artif. Intell., vol. 34, Apr. 2020, pp. 11908–11915. Trans. Circuits Syst. Video Technol., early access, Feb. 13, 2025, doi:
[9] Y.-F. Liu, D.-W. Jaw, S.-C. Huang, and J.-N. Hwang, “DesnowNet: 10.1109/TCSVT.2025.3541429.
Context-aware deep network for snow removal,” IEEE Trans. Image [34] X. Su et al., “Prior-guided hierarchical harmonization network for
Process., vol. 27, no. 6, pp. 3064–3073, Jun. 2018. efficient image dehazing,” 2025, arXiv:2503.01136.
[10] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel modulation for universal [35] Y. Cui and A. Knoll, “Enhancing local–global representation learning
image restoration,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, for image restoration,” IEEE Trans. Ind. Informat., vol. 20, no. 4,
no. 12, pp. 12496–12509, Dec. 2024. pp. 6522–6530, Apr. 2024.
[11] S.-J. Cho, S.-W. Ji, J.-P. Hong, S.-W. Jung, and S.-J. Ko, “Rethinking [36] Y. Cui, Q. Wang, C. Li, W. Ren, and A. Knoll, “EENet: An effective
coarse-to-fine approach in single image deblurring,” in Proc. IEEE/CVF and efficient network for single image dehazing,” Pattern Recognit.,
Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 4621–4630. vol. 158, Feb. 2025, Art. no. 111074.
[12] X. Chen et al., “A comparative study of image restoration networks for [37] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Focal network for image
general backbone network design,” in Proc. Eur. Conf. Comput. Vis., restoration,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct.
Oct. 2024, pp. 74–91. 2023, pp. 12955–12965.
[13] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single [38] H. Son, J. Lee, S. Cho, and S. Lee, “Single image defocus deblurring
image dehazing,” IEEE Trans. Image Process., vol. 32, pp. 1927–1941, using kernel-sharing parallel atrous convolutions,” in Proc. IEEE/CVF
2023. Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 2622–2630.
[14] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and [39] Y. Cui, M. Liu, W. Ren, and A. Knoll, “Hybrid frequency modulation
M. Yang, “Restormer: Efficient transformer for high-resolution image network for image restoration,” in Proc. 33rd Int. Joint Conf. Artif.
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Intell., Aug. 2024, pp. 722–730.
(CVPR), Jun. 2022, pp. 5718–5729. [40] K.-H. Liu, C.-H. Yeh, J.-W. Chung, and C.-Y. Chang, “A motion deblur
[15] Y. Cui and A. Knoll, “PSNet: Towards efficient image restoration with method based on multi-scale high frequency residual image learning,”
self-attention,” IEEE Robot. Autom. Lett., vol. 8, no. 9, pp. 5735–5742, IEEE Access, vol. 8, pp. 66025–66036, 2020.
Sep. 2023. [41] Y. Cui, J. Zhu, and A. Knoll, “Enhancing perception for autonomous
[16] J.-G. Wang, Y. Cui, Y. Li, W. Ren, and X. Cao, “Omnidirectional image vehicles: A multi-scale feature modulation network for image
super-resolution via bi-projection fusion,” in Proc. AAAI Conf. Artif. restoration,” IEEE Trans. Intell. Transp. Syst., vol. 26, no. 4,
Intell., vol. 38, Mar. 2024, pp. 5454–5462. pp. 4621–4632, Apr. 2025.
[17] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timo- [42] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convo-
fte, “SwinIR: Image restoration using Swin transformer,” in Proc. lutional neural network for dynamic scene deblurring,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 1833–1844. pp. 257–265.
[18] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: [43] Q. Wang, Y. Cui, Y. Li, Y. Ruan, B. Zhu, and W. Ren,
A general U-shaped transformer for image restoration,” in Proc. “RFFNet: Towards robust and flexible fusion for low-light image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, denoising,” in Proc. 32nd ACM Int. Conf. Multimedia, Oct. 2024,
pp. 17662–17672. pp. 836–845.
[44] K. Jiang et al., “Multi-scale progressive fusion network for single image
[19] F.-J. Tsai, Y.-T. Peng, Y. Lin, C.-C. Tsai, and C. Lin, “Stripformer: Strip
deraining,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
transformer for fast image deblurring,” in Proc. Eur. Conf. Comput.
(CVPR), Jun. 2020, pp. 8343–8352.
Vis., Jan. 2022, pp. 146–162.
[45] Y. Cui, W. Ren, S. Yang, X. Cao, and A. Knoll, “IRNeXt: Rethinking
[20] J. Zhang, Y. Zhang, J. Gu, J. Dong, L. Kong, and X. Yang, “Xformer:
convolutional network design for image restoration,” in Proc. Int. Conf.
Hybrid X-shaped transformer for image denoising,” in Proc. 12th Int.
Mach. Learn., Jul. 2023, pp. 6545–6564.
Conf. Learn. Represent., Jan. 2023.
[46] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehaz-
[21] X. Ma et al., “Efficient modulation for vision networks,” in Proc. 12th ing transformer with transmission-aware 3D position embedding,” in
Int. Conf. Learn. Represent., Mar. 2024. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
[22] Q. Hou, C.-Z. Lu, M.-M. Cheng, and J. Feng, “Conv2Former: A simple 2022, pp. 5802–5810.
transformer-style ConvNet for visual recognition,” IEEE Trans. Pattern [47] W. Yu and X. Wang, “MambaOut: Do we really need mamba for
Anal. Mach. Intell., vol. 46, no. 12, pp. 8274–8283, Dec. 2024. vision?,” 2024, arXiv:2405.07992.
[23] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual [48] X. Ma, X. Dai, Y. Bai, Y. Wang, and Y. Fu, “Rewrite the stars,” in
attention network,” Comput. Vis. Media, vol. 9, no. 4, pp. 733–752, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
Dec. 2023. 2024, pp. 5694–5703.
[24] J. Yang, C. Li, and J. Gao, “Focal modulation networks,” in Proc. Adv. [49] M. P. Witter, T. P. Doan, B. Jacobsen, E. S. Nilssen, and S. Ohara,
Neural Inf. Process. Syst., Jan. 2022, pp. 4203–4217. “Architecture of the entorhinal cortex a review of entorhinal anatomy
[25] N. Shazeer, “GLU variants improve transformer,” 2020, in rodents with some comparative notes,” Frontiers Syst. Neurosci.,
arXiv:2002.05202. vol. 11, p. 46, Jun. 2017.
[26] L. Kong, J. Dong, J. Ge, M. Li, and J. Pan, “Efficient frequency [50] T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. H. Lau,
domain-based transformers for high-quality image deblurring,” in Proc. “Spatial attentive single-image deraining with a high quality real rain
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, dataset,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
pp. 5886–5895. (CVPR), Jun. 2019, pp. 12262–12271.
[27] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative [51] W. Yan, R. T. Tan, and D. Dai, “Nighttime defogging using high-low
adversarial network for raindrop removal from a single image,” in frequency decomposition and grayscale-color networks,” in Proc. Eur.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Conf. Comput. Vis., Jan. 2020, pp. 473–488.
pp. 2482–2491. [52] W. Chen, H. Fang, J. Ding, C.-C. Tsai, and S. Kuo, “JSTASR: Joint
[28] Z. Shen et al., “Human-aware motion deblurring,” in Proc. IEEE/CVF size and transparency-aware snow removal algorithm based on modified
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 5571–5580. partial convolution and veiling effect removal,” in Proc. Eur. Conf.
[29] W.-T. Chen et al., “ALL snow removed: Single image desnowing Comput. Vis., Jan. 2020, pp. 754–770.
algorithm using hierarchical dual-tree complex wavelet representation [53] W. Yang, W. Wang, H. Huang, S. Wang, and J. Liu, “Sparse gra-
and contradict channel loss,” in Proc. IEEE/CVF Int. Conf. Comput. dient regularized deep retinex network for robust low-light image
Vis. (ICCV), Oct. 2021, pp. 4176–4185. enhancement,” IEEE Trans. Image Process., vol. 30, pp. 2072–2086,
[30] Y. Liu et al., “From synthetic to real: Image dehazing collaborating 2021.
with unlabeled real data,” in Proc. 29th ACM Int. Conf. Multimedia, [54] X. Liu, M. Suganuma, Z. Sun, and T. Okatani, “Dual residual networks
Oct. 2021, pp. 50–58. leveraging the potential of paired operations for image restoration,” in
[31] Y. Cui and A. Knoll, “Dual-domain strip attention for image Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
restoration,” Neural Netw., vol. 171, pp. 429–439, Mar. 2024. 2019, pp. 7000–7009.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[55] R. Quan, X. Yu, Y. Liang, and Y. Yang, “Removing raindrops and [79] J. M. Jose Valanarasu, R. Yasarla, and V. M. Patel, “TransWeather:
rain streaks in one go,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Transformer-based restoration of images degraded by adverse weather
Recognit. (CVPR), Jun. 2021, pp. 9143–9152. conditions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[56] Y. Quan, S. Deng, Y. Chen, and H. Ji, “Deep learning for seeing through (CVPR), Jun. 2022, pp. 2343–2353.
window with raindrops,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. [80] Y. Cui, W. Ren, and A. Knoll, “Omni-kernel network for image
(ICCV), Oct. 2019, pp. 2463–2471. restoration,” in Proc. AAAI Conf. Artif. Intell., vol. 38, Mar. 2024,
[57] J. Xiao, X. Fu, A. Liu, F. Wu, and Z.-J. Zha, “Image de-raining pp. 1426–1434.
transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, [81] S. W. Zamir et al., “Learning enriched features for fast image restora-
pp. 12978–12995, Nov. 2023. tion and enhancement,” IEEE Trans. Pattern Anal. Mach. Intell.,
[58] Z. Tu et al., “MAXIM: Multi-axis MLP for image processing,” in vol. 45, no. 2, pp. 1934–1948, Feb. 2023.
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. [82] X. Xu, R. Wang, C.-W. Fu, and J. Jia, “SNR-aware low-light image
2022, pp. 5759–5770. enhancement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[59] T. Ye et al., “Adverse weather removal with codebook priors,” nit. (CVPR), Jun. 2022, pp. 17693–17703.
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, [83] Y. Cai, H. Bian, J. Lin, H. Wang, R. Timofte, and Y. Zhang,
pp. 12619–12630. “Retinexformer: One-stage retinex-based transformer for low-light
[60] S. Zhou, J. Pan, J. Shi, D. Chen, L. Qu, and J. Yang, “Seeing the image enhancement,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
unseen: A frequency prompt guided transformer for image restoration,” (ICCV), Oct. 2023, pp. 12504–12513.
in Proc. Eur. Conf. Comput. Vis., Oct. 2024, pp. 246–264. [84] J. Weng, Z. Yan, Y. Tai, J. Qian, J. Yang, and J. Li, “MambaLLIE:
[61] S. Zhou, D. Chen, J. Pan, J. Shi, and J. Yang, “Adapt or perish: Implicit retinex-aware low light enhancement with global-then-local
Adaptive sparse transformer with attentive feature refinement for image state space,” in Proc. 38th Annu. Conf. Neural Inf. Process. Syst.,
restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Sep. 2024, pp. 27440–27462.
(CVPR), Jun. 2024, pp. 2952–2963. [85] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint
[62] S. W. Zamir et al., “Multi-stage progressive image restoration,” in Proc. rain detection and removal from a single image,” in Proc. IEEE Conf.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1685–1694.
pp. 14816–14826. [86] C.-Y. Wang, A. Bochkovskiy, and H.-Y.-M. Liao, “YOLOv7: Trainable
[63] Y. Guo, X. Xiao, Y. Chang, S. Deng, and L. Yan, “From sky to the bag-of-freebies sets new state-of-the-art for real-time object detectors,”
ground: A large-scale benchmark and simple baseline towards real rain in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
removal,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 7464–7475.
2023, pp. 12063–12073. [87] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection
[64] X. Chen, H. Li, M. Li, and J. Pan, “Learning a sparse transformer and hierarchical image segmentation,” IEEE Trans. Pattern Anal.
network for effective image deraining,” in Proc. IEEE/CVF Conf. Mach. Intell., vol. 33, no. 5, pp. 898–916, May 2011.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 5896–5905. [88] K. Ma et al., “Waterloo exploration database: New challenges for image
[65] H. Zhang, Y. Dai, H. Li, and P. Koniusz, “Deep stacked hierarchical quality assessment models,” IEEE Trans. Image Process., vol. 26,
multi-patch network for image deblurring,” in Proc. IEEE/CVF Conf. no. 2, pp. 1004–1016, Feb. 2017.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 5971–5979.
[89] B. Li et al., “Benchmarking single-image dehazing and beyond,” IEEE
[66] K. Zhang et al., “Deblurring by realistic blurring,” in Proc. Trans. Image Process., vol. 28, no. 1, pp. 492–505, Jan. 2019.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[90] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition
pp. 2734–2743.
for low-light enhancement,” 2018, arXiv:1808.04560.
[67] L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image
[91] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
restoration,” in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 17–33.
segmented natural images and its application to evaluating segmenta-
[68] Y. Li et al., “Efficient and explicit modelling of image hierarchies
tion algorithms and measuring ecological statistics,” in Proc. 8th IEEE
for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Int. Conf. Comput. Vis. (ICCV), vol. 2, Oct. 2001, pp. 416–423.
Recognit. (CVPR), Jun. 2023, pp. 18278–18289.
[92] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “A general
[69] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Image restoration via frequency
decoupled learning framework for parameterized image operators,”
selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 2,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 33–47, Jan.
pp. 1093–1108, Feb. 2024.
2021.
[70] Y. Cui, W. Ren, X. Cao, and A. Knoll, “Revitalizing convolutional
network for image restoration,” IEEE Trans. Pattern Anal. Mach. [93] B. Li, X. Liu, P. Hu, Z. Wu, J. Lv, and X. Peng, “All-in-one image
Intell., vol. 46, no. 12, pp. 9423–9438, Dec. 2024. restoration for unknown corruption,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17431–17441.
[71] X. Gao et al., “Efficient multi-scale network with learnable dis-
crete wavelet transform for blind motion deblurring,” in Proc. [94] V. Potlapalli, S. W. Zamir, S. H. Khan, and F. S. Khan, “PromptIR:
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2024, Prompting for all-in-one image restoration,” in Proc. Adv. Neural Inf.
pp. 2733–2742. Process. Syst., Sep. 2023, pp. 71275–71293.
[72] C. X. Liu et al., “Motion-adaptive separable collaborative filters for [95] G. Wu, J. Jiang, K. Jiang, and X. Liu, “Harmony in diversity: Improving
blind motion deblurring,” in Proc. IEEE/CVF Conf. Comput. Vis. all-in-one image restoration via multi-task collaboration,” in Proc. 32nd
Pattern Recognit. (CVPR), Jun. 2024, pp. 25595–25605. ACM Int. Conf. Multimedia, Oct. 2024, pp. 6015–6023.
[73] X. Mao, J. Wang, X. Xie, Q. Li, and Y. Wang, “LoFormer: Local [96] M. V. Conde, G. Geigle, and R. Timofte, “InstructIR: High-quality
frequency transformer for image deblurring,” in Proc. 32nd ACM Int. image restoration following human instructions,” in Proc. Eur. Conf.
Conf. Multimedia, Oct. 2024, pp. 10382–10391. Comput. Vis., Oct. 2024, pp. 1–21.
[74] X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: [97] J. Zhang et al., “Ingredient-oriented multi-degradation learning for
Attention-based multi-scale network for image dehazing,” in image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, Recognit. (CVPR), Jun. 2023, pp. 5825–5835.
pp. 7313–7322. [98] H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, and S. T. Xia, “MambaIR:
[75] H. Dong et al., “Multi-scale boosted dehazing network with dense fea- A simple baseline for image restoration with state-space model,” in
ture fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Proc. Eur. Conf. Comput. Vis., 2024, pp. 222–241.
(CVPR), Jun. 2020, pp. 2154–2164. [99] Y. Guo, Y. Gao, Y. Lu, H. Zhu, R. Liu, and S. He, “OneRestore: A
[76] Y. Tian et al., “Perceiving and modeling density for image dehazing,” universal restoration framework for composite degradation,” in Proc.
in Proc. Eur. Conf. Comput. Vis., Jan. 2022, pp. 130–145. Eur. Conf. Comput. Vis., Jul. 2024, pp. 255–272.
[77] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-dehaze: Enhanced [100] Y. Zhu et al., “Learning weather-general and weather-specific features
CycleGAN for single image dehazing,” in Proc. IEEE/CVF Conf. for image restoration under multiple adverse weather conditions,” in
Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2018, Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun.
pp. 938–9388. 2023, pp. 21747–21758.
[78] Y. Jin, B. Lin, W. Yan, Y. Yuan, W. Ye, and R. T. Tan, “Enhancing [101] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The
visibility in nighttime haze images using guided APSF and gradient unreasonable effectiveness of deep features as a perceptual metric,”
adaptive convolution,” in Proc. 31st ACM Int. Conf. Multimedia, Oct. in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
2023, pp. 2446–2457. pp. 586–595.

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 15

[102] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality Mingyu Liu (Graduate Student Member, IEEE)
assessment: Unifying structure and texture similarity,” IEEE Trans. received the dual master’s degree in electrical and
Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2567–2581, May 2020. computer engineering from the Department of Elec-
[103] D. Du et al., “The unmanned aerial vehicle benchmark: Object detec- tronics and Communication Engineering, Technical
tion and tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, University of Munich (TUM), Munich, Germany,
pp. 370–386. and Tongji University, Shanghai, China, in 2022.
[104] K. Zhang et al., “MC-blur: A comprehensive benchmark for image He is currently pursuing the Ph.D. degree with the
deblurring,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 5, Chair of Robotics, Artificial Intelligence and Real-
pp. 3755–3767, May 2024. time Systems, TUM.
[105] K. Zhang et al., “Deep image deblurring: A survey,” Int. J. Comput. His research interests include computer vision
Vis., vol. 130, no. 9, pp. 2103–2130, Sep. 2022. in autonomous driving, deep learning, and artificial
[106] K. Zhang, R. Li, Y. Yu, W. Luo, and C. Li, “Deep dense multi-scale intelligence.
network for snow removal using semantic and depth priors,” IEEE
Trans. Image Process., vol. 30, pp. 7419–7431, 2021.
[107] K. Zhang, D. Li, W. Luo, and W. Ren, “Dual attention-in-attention
model for joint rain streak and raindrop removal,” IEEE Trans. Image
Process., vol. 30, pp. 7608–7619, 2021.
[108] K. Zhang et al., “Beyond monocular deraining: Stereo image deraining Wenqi Ren (Member, IEEE) received the Ph.D.
via semantic understanding,” in Proc. 16th Eur. Conf. Comp. Vis. degree from Tianjin University, Tianjin, China, in
(ECCV), Aug. 2020, pp. 71–89. 2017.
From 2015 to 2016, he was supported by China
Scholarship Council and worked with Prof. Ming-
Husan Yang as a Joint-Training Ph.D. Student with
the Electrical Engineering and Computer Science
Department, University of California at Merced,
Merced, CA, USA. He is currently a Professor with
the School of Cyber Science and Technology, Shen-
zhen Campus, Sun Yat-sen University, Shenzhen,
China. His research interests include image processing and related high-level
vision problems.

Alois Knoll (Fellow, IEEE) received the Diploma

(M.Sc.) degree in electrical/communications engi-
Yuning Cui (Graduate Student Member, IEEE) neering from the University of Stuttgart, Stuttgart,
received the B.Eng. degree from Central South Germany, in 1985, and the Ph.D. degree (summa
University, Changsha, China, in 2016, and the cum laude) in computer science from the Tech-
M.Eng. degree from the National University of nical University of Berlin, Berlin, Germany, in
Defense Technology, Changsha, in 2018. He is cur- 1988.
rently pursuing the Ph.D. degree with the Chair Since 2001, he has been a Professor at the
of Robotics, Artificial Intelligence and Real-time Department of Informatics, Technical University of
Systems, School of Computation, Information and Munich (TUM), Munich, Germany. His research
Technology, Technical University of Munich (TUM), interests include cognitive, medical, and sensor-
Munich, Germany. based robotics, multiagent systems, and model-driven development of
His research interest lies in image restoration. embedded systems.

Authorized licensed use limited to: East China Univ of Science and Tech. Downloaded on May 14,2025 at 02:13:17 UTC from IEEE Xplore. Restrictions apply.

Fina HSINC 50
100% (1)
Fina HSINC 50
56 pages
DO 27 S 2019 PDF
No ratings yet
DO 27 S 2019 PDF
258 pages
Uformer A General U-Shaped Transformer For Image Restoration
No ratings yet
Uformer A General U-Shaped Transformer For Image Restoration
17 pages
Doctoral Dissertation For Junyong Lee, PH.D., POSTECH
No ratings yet
Doctoral Dissertation For Junyong Lee, PH.D., POSTECH
141 pages
Guidelines For Summer Training Report - UU
No ratings yet
Guidelines For Summer Training Report - UU
5 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
Final Project
No ratings yet
Final Project
50 pages
Learning Enriched Features For Real Image Restoration and Enhancement
No ratings yet
Learning Enriched Features For Real Image Restoration and Enhancement
20 pages
Pyramid Attention Network For Image Restoration
No ratings yet
Pyramid Attention Network For Image Restoration
19 pages
Reliability Management (Jakarta-Des22)
100% (3)
Reliability Management (Jakarta-Des22)
87 pages
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
No ratings yet
MADNet A Fast and Lightweight Network For Single-Image Super Resolution
11 pages
Zamir 2022 Mirnetv2
No ratings yet
Zamir 2022 Mirnetv2
15 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
LowLight Image Enhancement by Combining Transformer and Convolutional Neural NetworkMathematics
No ratings yet
LowLight Image Enhancement by Combining Transformer and Convolutional Neural NetworkMathematics
14 pages
Blind Image Super-Resolution in
No ratings yet
Blind Image Super-Resolution in
17 pages
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
No ratings yet
A Survey of The Vision Transformers and Its CNN-Transformer Based Variants - Khan Et Al
82 pages
CascadedGaze: Efficiency in Global Context Extraction For Image Restoration
No ratings yet
CascadedGaze: Efficiency in Global Context Extraction For Image Restoration
16 pages
Swinir: Image Restoration Using Swin Transformer
No ratings yet
Swinir: Image Restoration Using Swin Transformer
12 pages
Wang Uformer A General U-Shaped Transformer For Image Restoration CVPR 2022 Paper
No ratings yet
Wang Uformer A General U-Shaped Transformer For Image Restoration CVPR 2022 Paper
11 pages
Department of Information Engineering and Computer Science University of Trento
No ratings yet
Department of Information Engineering and Computer Science University of Trento
98 pages
Final Review Presentation
No ratings yet
Final Review Presentation
26 pages
Deep Convolutional Neural Network For Inverse Problems in Imaging
No ratings yet
Deep Convolutional Neural Network For Inverse Problems in Imaging
20 pages
Ouyang Image Restoration Refinement With Uformer GAN CVPRW 2024 Paper-2
No ratings yet
Ouyang Image Restoration Refinement With Uformer GAN CVPRW 2024 Paper-2
10 pages
Image Retrieval - Transformer
No ratings yet
Image Retrieval - Transformer
10 pages
Kim2019 Article LatentTransformationsNeuralNet
No ratings yet
Kim2019 Article LatentTransformationsNeuralNet
15 pages
Image Restoration Via Frequency Selection
No ratings yet
Image Restoration Via Frequency Selection
16 pages
Image Restoration
No ratings yet
Image Restoration
9 pages
(NIPS23) Scattering Transformation For ViT
No ratings yet
(NIPS23) Scattering Transformation For ViT
21 pages
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
No ratings yet
Going Deeper With Image Transformers: Hugo Touvron Matthieu Cord Alexandre Sablayrolles Gabriel Synnaeve Herv e J Egou
11 pages
Naveen Project Review 2 4th Sem
No ratings yet
Naveen Project Review 2 4th Sem
22 pages
Liang SwinIR Image Restoration Using Swin Transformer ICCVW 2021 Paper
No ratings yet
Liang SwinIR Image Restoration Using Swin Transformer ICCVW 2021 Paper
12 pages
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
No ratings yet
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
15 pages
Sensors 23 07260 v2
No ratings yet
Sensors 23 07260 v2
17 pages
2103 - ICML - Perceiver General Perception With Iterative Attention
No ratings yet
2103 - ICML - Perceiver General Perception With Iterative Attention
16 pages
Deep Network Interpolation For Continuous Imagery Effect Transition
No ratings yet
Deep Network Interpolation For Continuous Imagery Effect Transition
17 pages
Zamir Multi-Stage Progressive Image Restoration CVPR 2021 Paper
No ratings yet
Zamir Multi-Stage Progressive Image Restoration CVPR 2021 Paper
11 pages
Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
No ratings yet
Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
149 pages
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
No ratings yet
Chen Activating More Pixels in Image Super-Resolution Transformer CVPR 2023 Paper
11 pages
See More Details: Efficient Image Super-Resolution by Experts Mining
No ratings yet
See More Details: Efficient Image Super-Resolution by Experts Mining
16 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
No ratings yet
Production - Derieux - Cedric - Advances in Automatic Image Restoration and Upscaling
4 pages
REF-21-MemNet - A Persistent Memory Network For Image Restoration
No ratings yet
REF-21-MemNet - A Persistent Memory Network For Image Restoration
9 pages
Real Image Restoration Using VAEs
No ratings yet
Real Image Restoration Using VAEs
12 pages
SISR Transformer
No ratings yet
SISR Transformer
10 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
A Triple Deep Image Prior Model For Image Denoising Based On Mixed Priors and Noise Learning
No ratings yet
A Triple Deep Image Prior Model For Image Denoising Based On Mixed Priors and Noise Learning
19 pages
2024CVPR - Multimodal Prompt Perceiver Empower Adaptiveness, Generalizability and Fidelity For All-In-One Image Restoration
No ratings yet
2024CVPR - Multimodal Prompt Perceiver Empower Adaptiveness, Generalizability and Fidelity For All-In-One Image Restoration
13 pages
Xia DiffIR Efficient Diffusion Model For Image Restoration ICCV 2023 Paper
No ratings yet
Xia DiffIR Efficient Diffusion Model For Image Restoration ICCV 2023 Paper
11 pages
Information 15 00617 v2
No ratings yet
Information 15 00617 v2
14 pages
Fully-Connected Transformer For Multi-Source Image Fusion
No ratings yet
Fully-Connected Transformer For Multi-Source Image Fusion
18 pages
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
No ratings yet
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
11 pages
Module V-Deep Learning
No ratings yet
Module V-Deep Learning
19 pages
NEUCOM SI Editorial SR
No ratings yet
NEUCOM SI Editorial SR
5 pages
Wavemix SR
No ratings yet
Wavemix SR
9 pages
Consultant Empanelment Form
No ratings yet
Consultant Empanelment Form
24 pages
Embedded System Design (ECE 1021) : Introduction To Embedded Systems
No ratings yet
Embedded System Design (ECE 1021) : Introduction To Embedded Systems
18 pages
Harsha Thesis
No ratings yet
Harsha Thesis
62 pages
Singh 2020
No ratings yet
Singh 2020
5 pages
Contextual Transformer
No ratings yet
Contextual Transformer
10 pages
85.1 (IJCV 2025) SRConvNet A Transformer-Style ConvNet For Lightweight Image Super-Resolution
No ratings yet
85.1 (IJCV 2025) SRConvNet A Transformer-Style ConvNet For Lightweight Image Super-Resolution
17 pages
A Survey of Deep Learning Approaches To Image Restoration
No ratings yet
A Survey of Deep Learning Approaches To Image Restoration
20 pages
Transformers in Image Super-Resolution - A Brief Review: Since The Advent of Deep Learning A Decade Ago
No ratings yet
Transformers in Image Super-Resolution - A Brief Review: Since The Advent of Deep Learning A Decade Ago
8 pages
CAAI Trans On Intel Tech - 2022 - Zhang - A Robust Deformed Convolutional Neural Network CNN For Image Denoising
No ratings yet
CAAI Trans On Intel Tech - 2022 - Zhang - A Robust Deformed Convolutional Neural Network CNN For Image Denoising
12 pages
Stepper Motors Catalog
100% (1)
Stepper Motors Catalog
35 pages
Activating More Pixels in Image Super-Resolution Transformer
No ratings yet
Activating More Pixels in Image Super-Resolution Transformer
13 pages
Information and Communication Technology: Class XI-XII
No ratings yet
Information and Communication Technology: Class XI-XII
222 pages
DRCT
No ratings yet
DRCT
10 pages
PSNet Towards Efficient Image Restoration With Self-Attention
No ratings yet
PSNet Towards Efficient Image Restoration With Self-Attention
8 pages
Cot - DLP - Epp 6 by Teacher Cherrie Ann A. Dela Cruz
No ratings yet
Cot - DLP - Epp 6 by Teacher Cherrie Ann A. Dela Cruz
4 pages
MySQL 8 Installation Guide
No ratings yet
MySQL 8 Installation Guide
11 pages
MoA AoA Amended PDF
No ratings yet
MoA AoA Amended PDF
185 pages
Unit Ii: Sensor Networks - Introduction & Architectures
No ratings yet
Unit Ii: Sensor Networks - Introduction & Architectures
8 pages
Swordfish
No ratings yet
Swordfish
86 pages
Demystifying Noise Spectre Example
No ratings yet
Demystifying Noise Spectre Example
20 pages
Chapter 8 Revision
No ratings yet
Chapter 8 Revision
15 pages
DataBinding in The OpenEdge GUI For PDF
No ratings yet
DataBinding in The OpenEdge GUI For PDF
37 pages
Chameye C720 User - Manual
No ratings yet
Chameye C720 User - Manual
31 pages
Construction of Transmission Line Catenary From Survey Data
No ratings yet
Construction of Transmission Line Catenary From Survey Data
7 pages
Math102-Problem Set 3.1-Sem2-Sy2018-2019
No ratings yet
Math102-Problem Set 3.1-Sem2-Sy2018-2019
5 pages
Worksheet 5 Memorandum Patterns Grade 10 Maths
No ratings yet
Worksheet 5 Memorandum Patterns Grade 10 Maths
3 pages
Ebookmetafile 4417
No ratings yet
Ebookmetafile 4417
38 pages
Reducing 3D Seismic Turnaround: Seismics
No ratings yet
Reducing 3D Seismic Turnaround: Seismics
15 pages
ERP and Virtualization
No ratings yet
ERP and Virtualization
11 pages
Naukri NarahariJayavardhan (6y 0m)
No ratings yet
Naukri NarahariJayavardhan (6y 0m)
2 pages
GD - Gregory Adekoya - Submitted
No ratings yet
GD - Gregory Adekoya - Submitted
1 page
Dingo Coin White Paper
No ratings yet
Dingo Coin White Paper
4 pages
8 Puzzle
No ratings yet
8 Puzzle
17 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Product Development Life Cycle
No ratings yet
Product Development Life Cycle
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Modumer Modulating Transformer For Image Restoration

Uploaded by

Modumer Modulating Transformer For Image Restoration

Uploaded by

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Modumer: Modulating Transformer for

A S A longstanding task, image restoration aims to recover

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 3

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

X̂ Mn×n is the modulation branch with the kernel size of n × n, d X

encoding local information. W1 and W2 are 1 × 1 convolutions = wi1 w2j xi x j (7)

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 5

Fig. 5. Motivation of the parameter-sharing mechanism.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 6. Visual comparisons on the raindrop AGAN-Data [27] dataset.

Fig. 8. Image dehazing comparisons on the Haze4k [30] dataset.

TABLE III b) Image Motion Deblurring: We conduct experiments

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 7

Fig. 9. Image desnowing comparisons on the CSD [29] dataset.

TABLE IV TABLE VII

based method FSNet [69], our method has a more obvious

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 11. Visual results on the LOL-V2-syn [53] dataset.

features. Table XI shows that our strategy achieves the best

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 9

the effectiveness of our choice. This version achieves a

B. All-in-One Image Restoration

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

across all datasets. This demonstrates the effectiveness of our

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 11

best results in most degradation categories while requiring

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

pattern is observed in the single-degradation setting, where our

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 13

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

CUI et al.: MODUMER: MODULATING TRANSFORMER FOR IMAGE RESTORATION 15

Alois Knoll (Fellow, IEEE) received the Diploma

You might also like