0% found this document useful (0 votes)
24 views13 pages

Image Restoration 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Image Restoration 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Towards Efficient and Scale-Robust

Ultra-High-Definition Image Demoiréing


Xin Yu1 , Peng Dai1 , Wenbo Li2 , Lan Ma3 ,

Jiajun Shen3 , Jia Li4 , and Xiaojuan Qi1

1 The University of Hong Kong

2 The Chinese University of Hong Kong

3 TCL AI Lab

4 Sun Yat-sen University

Abstract. With the rapid development of mobile devices, modern widely used mobile
phones typically allow users to capture 4K resolution (i.e., ultra-high-definition) images.
However, for image demoiréing, a challenging task in low-level vision, existing works
are generally carried out on low-resolution or synthetic images. Hence, the
effectiveness of these methods on 4K resolution images is still unknown. In this paper,
we explore moiré pattern removal for ultra-high-definition images. To this end, we
propose the first ultra-high-definition demoiréing dataset (UHDM), which contains 5,000
real-world 4K resolution image pairs, and conduct a benchmark study on current state-
of-the-art methods. Further, we present an efficient baseline model ESDNet for tackling
4K moiré images, wherein we build a semantic-aligned scale-aware module to address
the scale variation of moir´e patterns. Extensive experiments manifest the effectiveness
of our approach, which outperforms state-of-the-art methods by a large margin while
being much more lightweight. Code and dataset are available at https://xinyu-
andy.github.io/uhdm-page.
Keywords: Image demoiréing, Image restoration, Ultra-high-definition
1 Introduction
When photographing the contents displayed on the digital screen, an inevitable
frequency aliasing between the camera’s color filter array (CFA) and the screen’s LCD
subpixel widely exists. The captured images are thus mixed with colorful stripes, named
moiré patterns, which severely degrade the perceptual quality of images. Currently,
efficiently removing moiré patterns from a single moiré image is still challenging and
receives growing attention from the research community.
Recently, several image demoiréing methods [13,47,12,29,22,8,20,40] have been
proposed, yielding a plethora of dedicated designs such as moiré pattern classification
[12], frequency domain modeling [22,47], and multi-stage framework [13]. Apart from
FHDe2Net [13] which is specially designed for high-definition images, most of the
research efforts have been devoted to studying low-resolution images [29] or synthetic
images [40]. However, the fast development of mobile devices enables modern mobile
phones to capture ultra-high-definition images, so it is more practical to conduct
research on 4K image demoiréing for real applications. Unfortunately, the highest
resolution in current public demoiréing datasets (see Table 1) is 1080p [13] (1920 ×
1080). Whether methods investigated on such datasets can be trivially transferred into
the 4K scenario is still unknown due to the data distribution change and dramatically
increased computational cost.
Under this circumstance, we explore the more practical yet more challenging
demoiréing scenario, i.e., ultra-high-definition image demoiréing. To evaluate the
demoiréing methods in this scenario, we build the first large-scale real-world ultra-high-
definition demoiréing dataset (UHDM), which consists of 4, 500 training image pairs and
500 testing image pairs with diverse scenes (see Fig. 1).
Benchmark study and limitation analysis: Based upon our dataset, we conduct a
benchmark study on state-of-the-art methods [13,47,12,29,22,8]. Our empirical study
reveals that most methods [29,8,47] struggle to remove moiré patterns with a much
wider range of scales in 4K images while simultaneously tolerating the growing
demands for computational cost (see Fig. 3) or fine image detail [13] (see Fig. 2). We
attribute their deficiencies to the lack of an effective multi-scale feature extraction
strategy. Concretely, existing methods attempting to address the scale challenge can be
coarsely categorized into two lines of research. One line of research develops multi-
stage models, such as FHDe2Net [13], to process large moiré patterns at a low-
resolution stage and then refines the textures at a high-resolution stage, which however
incurs huge computational cost
when applied to 4K images (see Fig. 3: FHDe2Net). Another line of research utilizes
features from different depths of a network to build multi-scale representations, in which
the most representative work [47] achieves a better trade-off between accuracy and
efficiency (see Fig. 3: MBCNN), yet still cannot be generally scale-robust (see Fig. 2
and Fig. 5). We note that the extracted multi-scale features are from different semantic
levels which may result in misaligned features when fused together, potentially limiting
its capabilities. Detailed study and analysis are unfolded in Section 3.2.
To this end, inspired by HRNet [33], we propose a plug-and-play semanticaligned scale-
aware module (SAM) to boost the network’s capability in handling moiré patterns with
diverse scales without incurring too much computational cost, serving as a supplement
to existing methods. Specifically, SAM incorporates a pyramid context extraction module
to effectively and efficiently extract multi-scale features aligned at the same semantic
level. Further, a cross-scale dynamic fusion module is developed to selectively fuse
multi-scale features where the fusion weights are learned and dynamically adapted to
individual images.
Equipped with SAM, we develop an efficient and scale-robust network for 4K image
demoiréing, named ESDNet. ESDNet adopts a simple encoder - decoder network with
skip-connections as its backbone and stacks SAM at different semantic levels to boost
the model’s capability in addressing scale variations of 4K moir´e images. ESDNet is
easy to implement while achieving state-of-the-art performance (see Fig. 5 and Table 2)
on the challenging ultra-high-definition image demoir´eing dataset and three other public
demoir´eing datasets [13,40,29].
In particular, ESDNet exceeds multi-stage high-resolution method FHDe2Net, 1.8dB in
terms of PSNR while being 300× faster (5.620s vs 0.017s) in the UHDM dataset. Our
major contributions are summarized as follows:
– We are the first to explore the ultra-high-definition image demoiréing problem, which is
more practical yet more challenging. To this end, we build a large-scale real-world 4K
resolution demoiréing dataset UHDM.
– We conduct a benchmark study for the existing state-of-the-art methods on this
dataset, summarizing several challenges and analyses. Motivated by these analyses,
we propose an efficient baseline model ESDNet for ultra-high-definition image
demoiréing.
– Our ESDNet achieves state-of-the-art results on the UHDM dataset and three other
public demoir´eing datasets, in terms of quantitative evaluation and qualitative
comparisons. Moreover, ESDNet is lightweight and can process standard 4K (3840 ×
2160) resolution images at 60 fps.
2 Related Work
Image demoir´eing: To remove moiré patterns caused by the frequency aliasing, Liu et
al. [20] propose a synthetic dataset by simulating the camera imaging process and
develop a GAN-based [10] framework. Further, a large-scale synthetic dataset [40] is
proposed and promotes many follow-up works [47,8,40].
However, it is difficult for models trained on synthetic data to handle real-world
scenarios due to the sim-to-real gap. For real-world image demoiréing, Sun et al. [29]
propose the first real-world moir´e image dataset (i.e., TIP2018) and develop a multi-
scale network (DMCNN). To distinguish different types of moiré patterns, He et al. [12]
manually annotate moiré images with category labels to train a moir´e pattern
classification model. Frequency domain methods [22,47] are also studied for moiré
removal. To deal with high-resolution images, He et al. [13] construct a high-definition
dataset FHDMi and develop the multi-stage framework FHDe2Net. Although significant
progress has been achieved, the above methods either cannot achieve satisfactory
results [47,12,29,8] or suffer from heavy computational cost [47,13,12,8]. More
importantly, the highest resolution of existing image demoiréing datasets is FHDMi [13]
with 1080p resolution, which is not suitable for practical use considering the ultra-high-
definition (4K) images captured by current mobile cameras. We focus on developing a
lightweight model that can process ultra-high-definition images.
Image restoration: To this point, plenty of learning-based image restoration models
have been proposed. For instance, residual learning [14] and dense connection [15] are
widely used to develop very deep neural networks for different low-level vision tasks
[43,1,19,17,46]. In order to capture multi-scale information, encoder-decoder [25]
structures or hierarchical architectures are frequently exploited in image restoration
tasks [42,41,9]. Inspired by iterative solvers, some methods utilize recurrent structures
[9,31] to gradually recover images while reducing the number of parameters. To
preserve structural and semantic information, many works [36,21,28,37,30,34] adopt the
perceptual loss [16] or generative loss [10,11,2] to guide the training procedure. In our
work, we also take advantage of the well-designed dense blocks for efficient feature
reuse and the perceptual loss for semantically guided optimization.
Multi-scale network: The multi-scale network has been widely adopted in various tasks
[33,4,48,38,6] due to its ability to leverage features with different receptive fields. U-Net
[25], as one representative multi-scale network, extracts multi-scale information using
an encoder-decoder structure, and enhances features in decoder with skip-connections.
To preserve the high-resolution representation, the full resolution residual network [24]
extends the U-Net by introducing an extra stream containing information of the full
resolution, and similar operations can be found in the HRNet [33]. Considering that the
extracted multi-scale features have different semantic meanings, the question of how to
fuse features with different meanings is also important and has been widely studied in
many works [3,5,7]. In this work, we design a semantic-aligned scale-aware module to
handle moir´e patterns with diverse scales without incurring too great a computational
cost, which renders our method highly practical for 4K images.
3 UHDM Dataset
We study ultra-high-definition image demoiréing, which has more practical applications.
For the training of 4K demoiréing models and the evaluation of existing methods, we
collect a large-scale ultra-high-definition demoiréing dataset (UHDM). Dataset collection
and benchmark study are elaborated upon below.
3.1 Data Collection and Selection
To obtain the real-world 4K image pairs, we first collect high-quality images with
resolutions ranging from 4K to 8K from the Internet. We note that Internet resources
lack document scenes, which also constitute a vital application scenario (e.g., slides,
papers), so we manually generate high-quality text images and make sure they maintain
3000 dpi (Dots Per Inch). Finally, the collected moiré free images cover a wide range of
scenes (see Fig. 1), such as landscapes, sports, video clips, and documents. Given
these high-quality images, we generate diverse real world moiré patterns elaborated
upon below.
First, to produce realistic moiré images and ease the difficulties of calibrations, we shoot
the clean pictures displayed on the screen with a camera phone fixed on a DJI OM 5
smartphone gimbal, which allows us to conveniently and flexibly adjust the camera view
through its control button, as shown in Fig. 1.
Second, we note that the characteristics of moiré patterns highly are highly dependent
upon the geometric relationship between the screen and the camera (see supplement
for more details). Therefore, during the capturing process, we continuously adjust the
viewpoint every ten shots to produce diverse moiré patterns.
Third, we adopt multiple < mobile phone,screen > (i.e., three mobile phones and three
digital screens, see supplement for more details) combinations to cover various device
pairs, since they will also have an impact on the styles of moiré patterns. Finally, to
obtain aligned pairs, we utilize RANSAC algorithm [32] to estimate the homography
matrix between the original high-quality image and the captured moir´e screen image.
Since it is difficult to ensure accurate pixel-wise calibration due to the camera’s internal
nonlinear distortions and perturbations of moiré artifacts, manual selection is performed
to rule out severely misaligned image pairs, thereby ensuring quality. Our dataset
contains 5, 000 image pairs in total. We randomly split them into 4, 500 for training and
500 for validation. As we collect moiré images using various mobile phones, the
resolution can either be 4032 × 3024 or 4624 × 3472. Comparisons with other existing
datasets are shown in Table 1, and the characteristics of our dataset are summarized as
below.
– Ultra-high resolution UHDM is the first 4K resolution demoiréing dataset, consisting
of 5,000 image pairs in total.
– Diverse image scenes The dataset includes diverse scenes, such as landscapes,
sports, video clips, and documents.
– Real-world capture settings The moiré images are generated following practical
routines, with different device combinations and viewpoints to produce diverse moiré
patterns.
Table 1: Comparisons of different demoiréing datasets; our dataset is the first ultra-high-
definition dataset (“London’s Buildings” is not available currently)

Dataset Avg.Resolution Size Diversity Real-world


TIP18 [29] 256x256 135 000 No text scenes 
LCD Moiré [40 1024x1024 10 200 Only text 
scenes
FHDMi [13] 1920x1080 12 000 Diverse scenes 
London’s 2100x1700 460 Only urban 
Building [22] scenes
UHDM 4238x3248 5000 Diverse scenes 

Input moiré image Zoom in region DMCNN Ours


Ground-truth
Fig. 2: Limitations of current methods: they are often unable to remove the moiré pattern
with a wider scale range or lose high-frequency details
3.2 Benchmark Study on 4K Demoiréing
As the image resolution is increased to the 4K resolution, the scale of moiré patterns
has a very wide range, from very large moiré patterns to small ones (see Fig. 1). This
poses a major challenge to demoiréing methods as they are required to be scale-robust.
Furthermore, increased image resolution also leads to dramatically increased
computational cost and high requirements of detail restoration/preservation. Here, we
carry out a benchmark study on the existing state-of-the-art methods [47,29,12,13,22,8]
on our 4K demoiréing dataset to evaluate their effectiveness. Main results are
summarized in Fig. 2 and Fig. 3: existing methods are mostly not capable of achieving a
good balance of accuracy and computational efficiency. More detailed results are shown
in Section 5.
Analysis and discussions: Although existing methods also attempt to address the
scale challenge by developing multi-scale strategies, they still have several deficiencies
regarding computational efficiency and restoration quality when applied to 4K high-
resolution images (see Fig. 2). One line of methods, such as DMCNN [29] and MDDM
[8], fuses multi-scale features harvested from multi-resolution inputs only at the output
stage, which potentially prohibits the intermediate features from interacting with and
refining each other, leading to sub-optimal results, i.e., significantly sacrificing accuracy
on 4K image demoiréing despite being lightweight (see Fig. 3 and Fig. 2). Another line
of methods, such as MBCNN [47], exploits multi-scale features at different network
depths following a U-Net-like architecture. Compared with other existing methods,
although it achieves the best trade-off between accuracy and efficiency, it still suffers
from moiré patterns with a wide-scale range (the second row of Fig. 2 and Fig. 5). One
possible issue is that the combined multi-scale features come from different semantic
levels [33], prohibiting a specific feature level to harvest multi-resolution representations
[33], which could also be an important cue for image demoiréing. On the other hand,
FHDe2Net [13] designs a coarse-to-fine two-stage model to simultaneously address the
scale and detail challenge. It suffers, however, from heavy computational cost when
applied to 4K images (see Fig. 3) yet is still not sufficient to remove moiré patterns (see
Fig. 5) or recover fine image detail (see Fig. 2 and Fig. 5)
4 Proposed Method
Motivated by observations in Section 3.2, we introduce a baseline approach to advance
4K resolution image demoiréing, aimed towards a more scale-robust and efficient
model. In the following, we first present an overview of our pipeline and then elaborate
on our core semantic-aligned scale-aware module (SAM).
4.1 Pipeline
The overall architecture is shown in Fig. 4, where a pre-processing head is utilized to
enlarge the receptive field, followed by an encoder-decoder architecture for image
demoiréing. The pre-processing head adopts pixel shuffle [26] to downsample the
image by two times and a 5 × 5 convolution layer to further extract low-level features.
Then, the extracted low-level features are fed into an encoder-decoder backbone
architecture that consists of three downsampling and upsampling levels. Note that the
encoder and decoder are connected via skip-connections to allow features containing
high-resolution information to facilitate the restoration of corresponding moiré-free
images. At each decoder level, the network would produce intermediate results through
a convolution layer and a pixelshuffle upsampling operation (see the upper part of Fig.
4), which are also
supervised by the ground-truth, serving the purpose of deep supervision to facilitate
training. Specifically, each encoder or decoder level (see Fig. 4) contains a dilated
residual dense block [46,15,14,39] for refining the input features (as detailed below) and
a proposed semantic-aligned multi-scale module (SAM) for extracting and dynamically
fusing multi-scale features at the same semantic level (as elaborated in Section 4.2).
Dilated residual dense block: For each level i ∈ {1, 2, 3, 4, 5, 6} (i.e., three encoder
levels and three decoder levels), the input feature Fi first goes through a convolutional
block, i.e., dilated residual dense block, for refining input features. It incorporates the
residual dense block (RDB) [46,15,14] and dilated convolution layers [39] to process the
input features and output refined ones. Specifically, given an input feature Fi 0 to the i-th
level encoder or decoder, the cascaded local features from each layer inside the block
can be formulated as Eq. (1):

where [Fi0 , Fi1 , ..., Fil−1 ] denotes the concatenation of all intermediate features inside
the block before layer l, and C l is the operator to process the concatenated features,
consisting of a 3×3 Conv with dilated rate d l and a rectified linear unit (ReLU). After
that, we apply a 1 × 1 convolution to keep the output channel number the same as that
of Fi0 . Finally, we exploit the residual connection to produce the refined feature
representation Fir , formulated as Eq.(2):

The refined feature representation Fi r is then fed to our proposed SAM for
semantic-aligned multi-scale feature extraction.
Fig. 4: The pipeline of our ESDNet and the proposed semanticaligned scale-aware
module (SAM)
4.2 Semantic-Aligned Scale-Aware Module

Given the input feature , the SAM is intended to extract multi-scale features within
the same semantic level i and allow them to interact and be dynamically fused,
significantly improving the model’s ability to handle moiré patterns with a wide range of
scales. As demonstrated in Table 3, SAM enables us to develop a lightweight network
while still being more effective in comparison with existing methods. In the following, we
detail the design of SAM which encompasses two major modules: pyramid feature
extraction and cross-scale dynamic fusion.

Pyramid context extraction: Given an input feature map (we


simplify by in the following discussion), we first produce pyramid input features

, and through
bilinear interpolation, then feed them into a corresponding convolutional branch
with five convolution layers to yield pyramid outputs Y0, Y1, Y2 (see the lower
part of Fig. 4):
where we build E0, E1, and E2 via the dilated dense block, followed by a 1 × 1
convolution layer. In addition, the up-sampling operations will be performed in E1, E2 to
align the size of three outputs, i.e., Yi ∈ R H×W×C ,(i = 0, 1, 2). Note that, as the
internal architectures of E0, E1, and E2 are identical, their corresponding learnable
parameters can be shared to lower the cost of parameter number. In fact, as proven in
Section 5, the improvement primarily comes from the pyramid architecture instead of
additional parameters.
Cross-scale dynamic fusion: Given the pyramid features Y0, Y1, Y2, the cross
scale dynamic fusion module fuses them together to produce fused multi-scale features
for the next level to process. The insight for this module is that scale of moir´e patterns
vary from image to image and thus the importance of different scale features would also
vary across images. Therefore, we develop the following cross-scale dynamic fusion
module to make the fusion process dynamically adjusted and adapted to each image.
Specifically, we learn dynamic weights to fuse Y1, Y2, Y3.
Given Yi ∈ R H×W×C (i = 0, 1, 2), we first apply global average pooling in the
spatial dimension of each feature map to obtain the 1D global feature vi ∈ R C
for each scale i following Eq. (4)
Then, we concatenate them along the channel dimension and learn the dynamic
weights through an MLP module as:
where “MLP” consists of three fully connected layers and outputs w0, w1, w2 ∈
R C to fuse Y1, Y2, Y3 dynamically. Finally, with fusion weights, we channel-wisely
fuse the pyramid features with the input-adaptive weights, and then add the
input feature F r to get the final output of SAM:
where ⊙ denotes the channel-wise multiplication, and the output F out will go
through the next level (i → i + 1) for further feature extraction and image reconstruction.
Comparisons and analysis: Existing methods [47,22] utilize features from different
depths to obtain multi-scale representations. However, features at different depths have
different levels of semantic information. Thus, they are incapable of representing multi-
scale information at the same semantic level, which might provide important cues for
boosting the model’s multi-scale modeling capabilities, as indicated in [33]. We offer
SAM as a supplement to existing methods as Y0, Y1, Y2 include semantic-aligned
information with different local receptive fields. The dynamic fusion methods further
make the module adaptive to different images and boost its abilities. This strategy can
also be treated as an implicit classifier compared with the explicit one in MopNet [12],
which is more efficient and avoids the ambiguous hand-craft attribute definition. We
include more detailed analysis in our supplementary file.
4.3 Loss Function
To boost optimization, we adopt the deep supervision strategy, which has been proven
useful in [47]. As shown in Fig. 4, in each decoder level, the network will produce
hierarchical predictions I ˆ 1, I ˆ 2, I ˆ 3, which are also supervised by ground-truth
images. We note that moir´e patterns disrupt image structures since they generate new
strip-shaped structures. Therefore, we adopt the perceptual loss [16] for feature-based
supervision. At each level, we build our loss function
Table 2: Quantitative comparisons between our model and state-of-the-art methods on
four datasets. (↑) denotes the larger the better, and (↓) denotes the smaller the better.
Red: best and Blue: second-best by combining the pixel-wise L1 loss and the feature-
based perceptual loss Lp. Hence, the final loss function is formulated as:
For the perceptual loss, we extract features from conv3 3 (after ReLU) using a pre-
trained VGG16 [27] network and compute the L1 distance in the feature space; we
simply set λ = 1 during training. We find that this perceptual loss is effective in removing
moir´e patterns.
5 Experiments
Datasets and metrics: We conduct experiments on the proposed UHDM dataset and
three other public datasets: FHDMi [13], TIP2018 [29] and LCD Moir´e [40]. In our
UHDM dataset, we keep the original two resolutions (see Section 3) and models are
trained with cropped patches. During the evaluation phase, we do center crop from the
original images to obtain test pairs with a resolution of 3840 × 2160 (standard 4K size).
We adopt the widely used PSNR, SSIM [35] and LPIPS [44] metrics for quantitative
evaluation. It has been proven that LPIPS is more consistent with human perception
and suitable for measuring demoir´eing quality [13]. Note that existing methods only
report PSNR and SSIM on the TIP2018 and LCDMoir´e, so we follow this setup for
comparisons.
Implementation details: We implement our algorithm using PyTorch on an NVIDIA
RTX 3090 GPU card. During training, we randomly crop a 768 × 768 patch from the
ultra-high-definition images, and set the batch size to 2. The model is trained for 150
epochs and optimized by Adam [18] with β1 = 0.9 and β2 = 0.999. The learning rate is
initially set to 0.0002 and scheduled by cyclic cosine annealing [23]. Details for
implementations on other benchmarks are unfolded in the supplementary file. We also
train other methods on our dataset faithfully and sufficiently and unfold details in the
supplementary file.
Fig. 5: Qualitative comparisons with state-of-the-art methods on the UHDM dataset.
Please zoom in for a better view. More results are given in the supplementary file
5.1 Comparisons with State-of-the-Art Methods
We provide two versions of our model: ESDNet and ESDNet-L. ESDNet is the default
lightweight model and ESDNet-L is a larger model, stacking one more SAM in each
network level.
Quantitative comparison: Table 2 shows quantitative performance of existing
approaches. The proposed method achieves state-of-the-art results on all four
datasets. Specifically, both of our two models outperform other methods by a large
margin in the ultra-high-definition UHDM dataset and high-definition FHDMi dataset,
demonstrating the effectiveness of our method in high-resolution scenarios. It is
worthwhile to note that our ESDNet, though possessing far fewer parameters, already
shows competitive performance.
Qualitative comparison: We present visual comparisons between our algorithm and
existing methods in Fig. 5. Apparently, our method obtains more perceptually
satisfactory results. In comparison, MDDM [8], DMCNN [29] and WDNet [22] often fail to
remove moir´e patterns, while MBCNN [47] and MopNet [12] cannot handle large-scale
patterns well. Though performing better than other methods (except for ours), HDe2Net
[13] usually suffers from severe loss of details. All these facts manifest the superiority of
our method.
Table 3: Ablation study of the proposed SAM. “A” represents the baseline model. “A+”
denotes a stronger baseline which is of similar model capacity compared to our full
model “E”. “B” adds the pyramid context extraction with shared weights across all
branches to “A” while “D” adopts adaptive weights. “C” and “E” add the cross-scale
dynamic fusion based on “B” and “D”, respectively
Computational cost: As shown in Fig. 3, our method strikes a sweet point of balancing
the parameter number, computation cost (MACs), and demoir´eing performance. Also,
we test the inference speed of our method on an NVIDIA RTX 3090 GPU. Surprisingly,
our ESDNet only needs 17ms (i.e., 60fps) to proccess a standard 4K resolution image,
almost 300× faster than FHDe2Net. The competitive performance and low
computational cost render our method highly practical in the 4K scenario.
5.2 Ablation Study
In this section, we tease apart which components of our network contribute most to the
final performance on the UHDM dataset. As shown in Table 3, we start from the
baseline model (model “A”), which ablates the pyramid context extraction and the cross-
scale dynamic fusion strategies. To make a fair comparison, we further build a stronger
baseline model (model “A+”) that is comparable to our full model (model “E”) in terms of
the model capacity.
Pyramid context extraction: We construct two variants (model “B” and model “D”) for
exploring the effectiveness of this design. Compared with the baseline (model “A”), we
observe that the proposed pyramid context extraction can significantly boost the model
performance. To validate whether the improvement comes from more parameters in the
additional two sub-branches, we exploit a weight-sharing strategy across all branches
(model “B”). The observations in
Table 3 demonstrate that the performance gain mainly stems from the pyramid design
rather than the increase of parameters. Further, as shown in Fig. 6, we
Table 4: Ablation study of the loss function. The left and the right of “/” denote results
trained by the pixel-wise L1 loss and trained by our loss, respectively
find our pyramid design can successfully remove the moir´e patterns that are not well
addressed in the baseline model.
Cross-scale dynamic fusion: To verify the importance of the proposed dynamic fusion
scheme, we increasingly add this design to model “B” and model “D”, resulting in model
“C” and model “E”. We observe consistent improvements for both models, especially on
PSNR. Also, Fig. 6 shows that the artifacts retained in model “D” are totally removed in
the result of model “E”, achieving a more harmonious color style.
Loss function: Through our experiments, we find the perceptual loss plays an essential
role in image demoir´eing. As shown in Table 4, when replacing our loss function with a
single L1 loss, we notice obvious performance drops in our method, especially on
LPIPS. Also, we make further exploration by applying our loss function to other state-of-
the-art methods [29,8]. The significant improvements on LPIPS illustrate the importance
of the loss design in yielding a higher perceptual quality of recovered images. We
suggest our loss is more robust to address the large-scale moir´e patterns and the
misaligned issue in the real-world datasets [13,29]. More discussions are included in the
supplementary file.
6 Conclusion
In this paper, to explore the more practical yet challenging 4K image demoir´eing
scenario, we propose the first real-world ultra-high-definition demoir´eing dataset
(UHDM). Based upon this dataset, we conduct a benchmark study and limitation
analysis of current methods, which motivates us to build a lightweight semantic-aligned
scale-aware module (SAM) to strengthen the model’s multi-scale ability without
incurring much computational cost. By leveraging SAM in different depths of a simple
encoder-decoder backbone network, we develop ESDNet to handle 4K high-resolution
image demoir´eing effectively. Our method is computationally efficient and easy to
implement, achieving state-of-the-art results on four benchmark demoir´eing datasets
(including our UHDM). We hope our investigation could inspire future research in this
more practical setting.
Acknowledgements. This work is partially supported by HKU-TCL Joint Research
Center for Artificial Intelligence, Hong Kong Research Grant Council - Early Career
Scheme (Grant No. 27209621), National Key R&D Program of China
(No.2021YFA1001300), and Guangdong-Hong Kong-Macau Applied Math Center grant
2020B1515310011.

You might also like