NERNet Noise Estimation and Removal Network For Image Denoising

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

J. Vis. Commun. Image R.

71 (2020) 102851

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.


journal homepage: www.elsevier.com/locate/jvci

NERNet: Noise estimation and removal network for image denoising


Bingyang Guo a,b, Kechen Song a,b,⇑, Hongwen Dong a,b, Yunhui Yan a,b,⇑, Zhibiao Tu c, Liu Zhu c
a
School of Mechanical Engineering & Automation, Northeastern University, Shenyang, Liaoning, China
b
Energy Saving Metallurgical Equipment and Intelligent Detection Engineering Technology Research Center of Liaoning Province, Shenyang, Liaoning, China
c
Taizhou University, Taizhou, Zhejiang, China

a r t i c l e i n f o a b s t r a c t

Article history: While some denoising methods based on deep learning achieve superior results on synthetic noise, they
Received 20 January 2020 are far from dealing with photographs corrupted by realistic noise. Denoising on real-world noisy images
Revised 8 May 2020 faces more significant challenges due to the source of it is more complicated than synthetic noise. To
Accepted 27 June 2020
address this issue, we propose a novel network including noise estimation module and removal module
Available online 20 July 2020
(NERNet). The noise estimation module automatically estimates the noise level map corresponding to the
information extracted by symmetric dilated block and pyramid feature fusion block. The removal module
Keywords:
focuses on removing the noise from the noisy input with the help of the estimated noise level map.
Image denoising
Convolutional neural networks
Dilation selective block with attention mechanism in the removal module adaptively not only fuses fea-
Attention mechanism tures from convolution layers with different dilation rates, but also aggregates the global and local infor-
Dilated convolution mation, which is benefit to preserving more details and textures. Experiments on two datasets of
Dilation rate selecting synthetic noise and three datasets of realistic noise show that NERNet achieves competitive results in
comparison with other state-of-the-art methods.
Ó 2020 Elsevier Inc. All rights reserved.

1. Introduction denoising performance due to its extraordinary capability of fea-


ture learning and ingenious design of network architecture. How-
Image denoising, which aims to improve the quality of images ever, whether traditional methods or deep learning methods tend
and solve a series of problems caused by noise corruption, refers to be over-fitted to synthetic noise. With the increasing demand
to the process of reducing noise in digital images. As an essential for image quality, it is critical to improving the robustness towards
problem in computer vision, image denoising effectively improve realistic noise. The source of realistic noise is more complicated in
the accuracy of subsequent tasks such as classification, segmenta- real camera system, and the quality of output image is affected by
tion, and target detection. some external and internal conditions, such as illumination, CCD/
Over the past few decades, various traditional methods have CMOS sensors, and camera shaking. [17] All these make realistic
been proposed, including filter methods [1–3], sparse coding meth- noise much more different from synthetic noise, as show in in
ods [4,5], effective prior methods [6], total variational methods Fig. 1. Unfortunately, current algorithms are far from dealing with
[7,8], and wavelet methods [9,10], and significant performance realistic noise, it remains a challenging issue in practical applica-
have been achieved for the removal of synthetic noise (e.g., addi- tion environment.
tive white Gaussian noise (AWGN), sault noise, pepper noise). In In this paper, we design a novel network including noise esti-
order to acquire high quality images, most of the traditional meth- mation module and removal module (NERNet). As noted in FFDNet
ods usually contain process of solving a complex optimization [21], taking the noise level map as input is benefit to remove noise
problem that causes conflicts between good performance and run- and preserve detail. However, FFDNet achieves the best result
ning time. With the rapid development of deep learning, some when the input noise level map and ground truth noise level are
methods [11–16] based on deep learning effectively improve the matched. When the input noise level map is lower or higher than
the ground truth one, the result of FFDNet is unsatisfactory. To
address this issue, CBDNet [25] design an asymmetric loss func-
⇑ Corresponding authors at: School of Mechanical Engineering & Automation, tion, which can impose more penalty to under-estimation error
Northeastern University, Shenyang, Liaoning, China. to estimate more accurate noise level map. Different from CBDNet,
E-mail addresses: [email protected] (K. Song), [email protected] which uses complex loss function to fix wrong predictions, that our
(Y. Yan).

https://doi.org/10.1016/j.jvcir.2020.102851
1047-3203/Ó 2020 Elsevier Inc. All rights reserved.
2 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

Fig. 1. Comparison of different noises. Realistic noise is far different from other synthetic noise.

network design an exquisite network architecture to get more 2. Related works


information of the noise and predict more precise noise level
map. Firstly, we design a symmetric dilated convolution block 2.1. Traditional denoising methods
and a pyramid feature fusion block in the noise estimation module.
Contrast to the plain structure used in CBDNet, our architecture As a classical problem in computer vision, image denoising has
constitutes a receptive field pyramid that makes our network has been studied for a long time. Initially, a lot of spatial domain filters,
the natural attributes of learning multiple scale information. Sec- including Mean filtering [18], Wiener filtering [19], and Median fil-
ondly, based on the idea of kernel selecting module in [31], we tering [20], have been applied to image denoising. Nevertheless,
design a novel dilation selective block to make full use of detail they failed to preserve image texture and lost sharp edges. Then,
information under different dilation rates. What’s more, the dila- some variational denoising methods [1,3,8] based on maximum a
tion selective block also contains an attention block that pay more posterior (MAP) and non-local means achieved good results. Sub-
attention to local and global information of the feature. More sequently, researchers proposed some methods, i.e., BM3D [2],
details will be given in later section. focusing on the transform domain rather than the spatial domain.
In short, the main contributions of our work are generalized in The above traditional methods have two disadvantages: (i) it takes
four folds: long processing time even dealing with a single image, (ii) it can’t
handle spatial variant noise and realistic noise.
(1) NERNet is presented by considering both synthetic noise and
realistic noise, greatly achieve the end-to-end denoising
task.
2.2. Denoising methods based on deep learning
(2) Benefit from the dilated convolution block and the pyramid
feature fusion block, our network has the natural attributes
Due to the powerful feature learning ability and novel network
of learning multiple scale information of noise and effec-
architecture, some methods based on deep learning significantly
tively estimates the accurate noise level map.
improve the performance of image denoising. Zhang et al. [11] pro-
(3) An implementation of dilation selective block with attention
pose a plain convolutional network named DnCNN, proving that
mechanism is elaborated such that multiple branches with
residual learning and batch normalization are benefit to improving
different dilation rates can be fused by linear combination
the performance and reducing training time. Zhang et al. [12]
to enhance the ability of removal module.
design a similar network called IRCNN, which also confirms the
(4) Experiments on two datasets of synthetic noise and three
conclusion in DnCNN. Meanwhile, they find using the dilated filter
datasets of realistic noise prove that our NERNet outper-
could enlarge the receptive field of network. To improve the ability
forms many state-of-the-art methods.
to tolerate errors in noise estimation, Shi et al. [13] design HRLNet
B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851 3

by adding fusion sub-network that concatenates features of differ- where Mr ðÞ denotes the removal module, ‘‘Concat” operation con-
ent levels in the backbone. catenates the noisy input x and the estimated noise level map
What is well known to us all is that deeper network has better r^ ðyÞ, W r is the network parameters of the removal module.
learning capabilities. The above methods have simple architecture In the noise estimation module, the mean absolute error (MAE)
but still achieve better performance than traditional methods. To between the real noise level map rðyÞ and the estimated noise
effectively stimulate potential abilities of deep networks, Mao level map r ^ ðyÞ as loss function:
et al. [14] design a very deep network with encoder-decoder archi-
1X N
tecture (RED), which has a strong learning capacity. Thanks to the Losse ¼ jjr
^ ðyÞ  rðyÞjj1 ð3Þ
deep network structure, RED can use a single model to handle dif- N i¼1
ferent levels of noise. However, it is challenging to recover edges
It should be noted that the ground truth rðyÞ used to deal with
and textures from noisy images. Some details in noisy images are
synthetic noise and realistic noise is different. For synthetic noise,
inevitably lost during the forward propagation when the number
the ground truth rðyÞ is the noise with specific standard deviation.
of convolution layers increases. To address it, Tai et al. [15] pro-
As for realistic noise, the ground truth rðyÞ is the difference x  y
posed a network named MemNet, consisting of a recursive unit
acquired by noisy image x minus clean image y.
and a gate unit, to adaptively learn the weight of different features.
What’s more, in the removal module, the mean absolute error
This novel architecture modeled after human memory can take
(MAE) between the clean image y and the estimated clean image
advantage of multi-scale information and address the problem of
^ as loss function:
y
long-term memory in DnCNN, IRCNN, RED. Inspired by MemNet,
Jia et al. [16] develop FOCNet, including multi-scale memory sys- 1X N

tem that fuses features of previous layers and different scales, by Lossr ¼ ^  yjj1
jjy ð4Þ
N i¼1
solving a fractional-order optimal (FOC) control problem.
Regardless of shallow network or deep network based on deep
learning, they both focus on AWGN without spatial variance. Zhang 3.2. Noise estimation module
et al. [21] propose FFDNet that performs better on spatial variant
noise by transforming the noise into the noise level map and taking In the task of predicting the noise level map, sufficient contex-
it as input. This conversion introduces more information about tual information is beneficial to acquiring accurate noise and infer-
noise, which is beneficial to estimating noise. However, FFDNet is ring high-frequency components. To make full use of features from
not efficient when the noise standard-deviation is unknown, and different layers and scales, we design symmetric dilated convolu-
it increases the difficulty to train the model. tion block and pyramid feature fusion block. In the remainder of
the subsection, we introduce the design idea and function of each
2.3. Denoising of real images block in detail.

Due to the source of realistic noise is far from AWGN, denoising 3.2.1. Symmetric dilated convolution block
of real images is more difficult and significant than synthetic noise. Currently, the effective method to expand the receptive field is
Some PCA based methods [22,23] provide the motivation to deal dilated convolution [26] that can cover more spatial information of
with this issue, they build two stages including noise estimation feature map without increasing extra parameters and computation
and denoising. Inspired by these, Foi et al. [24] proposed local esti- pressure. However, it will cause the Gridding phenomenon when
mation of multiple expectation/standard-deviation pairs, and glo- we use a stack of convolution layers with the same dilation rate.
bal parametric model realize denoising task. Guo et al. [25] Inspired by multiple dilated convolutional blocks [27], we find
designed CBDNet also containing two sub-networks, i.e., noise esti- the combination of different dilation rates will improve the recep-
mation sub-network, and blind denoising sub-network. What’s tive field and avoid the Gridding phenomenon.
more, a novel loss function helps avoiding inaccurate noise level To extract more discriminative features from input noisy
predicted by noise estimation sub-network. images, we proposed a symmetric dilated convolutional block
including five DCRs as shown in Fig. 2. Furthermore, in each convo-
3. Proposed method lution layer of DCR, the kernel size is 3  3, dilation rate is set to
1,2,3,2,1 and the channel is set to 64. We can calculate the recep-
This section presents our NERNet consisting of the noise estima- tive filed of each layer is 3,10,21,28,31 according to the formula:
tion module and the removal module. To begin with, we introduce
X
k
the general network architecture of NERNet. Furthermore, we lk ¼ lk1 þ ½ðf k  1Þ  si  ð5Þ
described each module in detail. i¼1

3.1. Network architecture where lk , f k denote the receptive field and the kernel size of layer k,
si represents the stride of pooling layer i.
As illustrated in Fig. 2, the proposed NERNet includes the noise What’s more, we concatenate feature maps extracted from each
estimation module and the removal module. First, the noise esti- convolution layer of the symmetric dilated convolutional block:
mation module produces the estimated noise level map r ^ ðyÞ from F 0 ¼ Concatðf 1 ; :::; f 5 Þ ð6Þ
a noisy observation x, which can be formulated as:
where the ‘‘Concat” operation autonomously learns the weight
r^ ðyÞ ¼ Me ðx; W e Þ ð1Þ value of each feature, f i is the feature map extracted from each layer
where M e ðÞ denotes the noise estimation module, W e is the net- and F 0 is the feature sent to next block.
work parameters of the noise estimation module.
Then, the removal module takes both x and r
^ ðyÞ as input to pre- 3.2.2. Pyramid feature fusion block
dict the clean image y^: As one of the classic network architectures, pyramid pooling
module [28] has been widely applied to image segmentation and
^ ¼ Mr ðConcatðx; r
y ^ ðyÞÞ; W r Þ ð2Þ target detection, but haven’t been used for image denoising.
4 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

Noise Esmaon Module


Removal Module
Pyramid Feature Fusion Block

Symmetric Dilated Convoluon Block


1x1
Conv 1x1

UpSampling
Dilaon
DCR-2
DCR-1

DCR-3

DCR-2

DCR-1
2x2
Conv 1x1 Selecve
Block
UpSampling
4x4
Conv 1x1

Skip Connecon Element Wise-add Convoluon with dilaon

DCR-N
factor N Dense convoluonal
+ block
NxN ReLU
Average Pooling Concatenate

Fig. 2. The architecture of NERNet. Given a noisy image, the noise estimation module produces the feature of noise level, and the removal module predict the final clean
image.

The foremost advantage of the pyramid pooling module is that structure in DenseNet. As the line is shown in Fig. 3, we concate-
spatial information of different sizes can be fully extracted, which nate the feature obtained from the former layer with the features
can improve the robustness to different object sizes. On the con- acquired from the latter layers, it can be expressed as:
trary, image details will be lost due to multiple pooling operations.
To address this issue and take advantage of the pyramid pooling xi ¼ wc  Concatðx0 ; :::; xiþ1 Þ þ bc ; i 2 f1; 2; 3; 4g ð9Þ
module, we design the pyramid feature fusion block. Our module
consists of three branches with different pooling kernels size, where wc is a filter of size of 1  1 and bc is the biases, the ‘‘Concat”
and each kernel size is set to 1  1, 2  2, 4  4. Then, the smaller operation concatenates the features extracted from different convo-
feature obtained after pooling will be concatenated with the upper lution layer of Dense Block.
layer feature, and this operation makes full use of feature informa- Furthermore, each convolution layer has 3  3 kernel size and
tion from multiple sizes and reduces the loss of details as much as padding 1, ensuring that the size of the feature map is consistent
possible. Last but not least, we estimate the feature of noise level during the forward propagation.
by a 1  1 convolution layer.
The feature of different branches and final output feature of the
noise level could be expressed as the follows: 3.3.2. Dilation selective block
As described in Section 3.1, dilated convolution is good for
Ai ¼ Av g i ðF 0 Þ ð7Þ enlarging the receptive field. To select features produced by the

F 1 ¼ wc  ConcatðAi ; Aiþ1 Þ þ bc ; i 2 f1; 2g ð8Þ

where ‘‘Av g i ” means the average pooling layer with kernel size i, Ai
means the feature after the ‘‘Av g i ” operation, the ‘‘Concat” operation
concatenates the features extracted from different branches, wc is a
filter of size of 1  1 and bc is the biases, F 1 is the feature of noise
level.

3.3. Removal module

We acquire the feature of noise level after the noisy input has
been passed through the noise estimation module. The next step
is to build a removal module to remove the noise from the noisy
input. We choose U-Net [29] as the basic framework. What’s more,
we modify U-Net by using dense block and dilation selective block.
In remainder of this subsection, we introduce the structure of each
block in detail.

3.3.1. Dense block


Huang et al. [30] proposed DenseNet for image classification
and achieve great success. The main design idea of DenseNet is
to build a dense connection between adjacent layers. Inspired by
DenseNet, we design a dense block instead of plain convolution
layers in the U-Net.
The dense block in our model only consists of three convolution Fig. 3. The architecture of dense block. The line with different color denotes
layers without any pooling layer, which is different from the forward propagation of features from different layers.
B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851 5

Fig. 4. The structure of our dilation selective block. The ‘‘FC’’’ means the fully connected layer.

convolution layer with different dilation rates, inspired by SKNet In addition to focusing on the local attention, we also extract
[31], we propose the dilation selective block. global information F global by using a global average pooling layer
As shown in Fig. 4, the input of our dilation selective block is the (GAP):
feature F 2 outputted by two dense blocks and two average pooling
F global ¼ GAPðF 3 Þ ð12Þ
layers. It is divided into three parts by three parallel convolution
layers with different dilation rates 1,3 and 5, respectively, to get We combine the local information F local and the global informa-
the middle features F I2 , F II2 and F III
2 . We add both parts via
tion F global to acquire the attention weight F fusion . Then F fusion is
element-wise to get feature F 3 : shrunk and expanded by passing through two fully connected lay-
ers, and later operated by Softmax to get three weight outputs k0 , l0
F 3 ¼ F I2 þ F II2 þ F III
2 ð10Þ and m0 . Specifically, the Softmax operation is applied on the
Then we put F 3 into an attention mechanism that including a channel-wise digits:
local attention block and a global attention block, as shown in 0
ekc
Fig. 5. kc ¼ 0 ; k ¼ k; l; m ð13Þ
e þ elc þ emc
k0c 0

Inspired by gram matrix [32], we proposed the local attention


block that pays more attention to measuring the relationship where k, l and m denote the soft attention vector for F I2 ,F II2 and F III
2 , kc
between the pixels by multiplying the elements of F 3 with F T3 , is the c-th element of k, likewise lc and mc .
which can be expressed as the follow: We get final feature maps F 4 through integrating each feature F i2
with their weight output kc , lc and mc :
F local ¼ F 3  F T3 ð11Þ
F 4c ¼ kc  F I2 þ lc  F II2 þ mc  F III2 ; kc þ lc þ mc ¼ 1 ð14Þ

Fig. 5. The operation of attention mechanism in dilation selective block. The ‘‘GAP’’’ means the global average pooling operation.
6 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

Table 1
Average PSNR achieved by various algorithms on two synthetic noise datasets.

Dataset Noise level BM3D [2] IRCNN [11] DNCNN [12] FFDNET [21] HRLNet [13] FOCNet [16] FFDNet* NERNet* NERNet
BSD68 15 33.52 33.86 33.89 33.87 33.83 33.07 33.85 34.08 34.03
25 30.69 31.16 31.23 31.21 31.14 30.73 31.20 31.39 31.37
50 27.37 27.86 27.92 27.96 27.86 27.43 27.96 28.22 28.12
Set14 15 32.37 32.77 32.86 32.77 32.96 32.98 32.74 33.34 33.32
25 29.97 30.38 30.43 30.44 30.51 30.53 30.41 30.59 30.57
50 26.72 27.14 27.18 27.32 27.34 27.33 27.31 27.37 27.36

The significance of bold font means the best result of each contrast experiment or ablation experiments.

F 4 ¼ fF 41 ; :::; F 4c g ð15Þ 4. Experiment


Finally, we put F 4 into a 3  3 convolution layer to acquire the
^: We evaluate the performance of our model on both synthetic
predicted image y
noise and real noise denoising in this section. All experiments are
^ ¼ wc  F 4 þ bc
y ð16Þ evaluated on RGB space. First of all, we introduce the training
and testing datasets in different tasks. Next, we give the details
of network training. Then, we show the qualitative comparisons
where wc is a filter of size of 1  1 and bc is the biases.

Fig. 6. Comparison of our model with other methods on an image from Set14. The region of blue box shows more detailed part of the result. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)
B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851 7

Fig. 7. Comparison of our model with other methods on an image from BSD68. The region of orange box shows more detailed part of the result. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

with other state-of-the-art methods in synthetic noise and realistic 4.2. Parameter settings for network training
noise denoising. Finally, we illustrate the function of each module
by ablation studies. We adopt the ADAM algorithm with the exponential decay rate
equaling 0.9 to train our model. The method described in [42] is
4.1. Datasets selected for weight initialization. The batch size is set to 8, and
the epoch is set to 80. For symmetric dilated convolutional module
It is well known to us all that the quality of datasets is vital for and spatial feature pyramid module, we set the initial learning rate
the performance of the model in image denoising. We select a lot of of feature extraction module as 103 . The initial learning rate to
datasets of source image according to different tasks. inference module is set to 5  103 . Both learning rates of two
For denoising towards synthetic noise, we collected 1600 modules multiply 0.1 every 20 epochs.
images from Waterloo [33], 1600 images from MIT-Adobe Five K We implement our model with the PyTorch package. All the
[34], 500 images from ImageNet [35] as training data. We use experiments are carried out in Ubuntu 16.04 + python 3.6 environ-
two datasets, namely BSD68 [36] and Set14 [37], to test the ability ments running on a PC with Intel I7-9700 K CPU, Nvidia GeForce
to remove synthetic noise. The BSD68 dataset consists of 68 color RTX 2080Ti GPU and Galaxy GAMER 32 GB RAM. It takes about
images from the BSD300 dataset [36]. The Set14 dataset including 36 h to train our model on GPU with accelerating by CUDA
14 images, which is widely used for evaluating the ability of 8.0.61 + cuDNN-v6.1.
denoising. What’s more, both images in the phase of training and
testing should not resize or crop.
For denoising towards realistic noise, we selected Renoir [38] as 4.3. Comparison with state-of-the-art methods
training data. Three datasets of real-world noisy images, i.e., NC12
[39], Nam [40], and SIDD [41], are adopted for testing. The NC12 4.3.1. Experiment on synthetic noisy images
dataset includes 12 realistic noise images without ground-truth. In this subsection, we evaluate our model on the synthetic noisy
The Nam dataset consists of 11 static scenes, and images in the images corrupted by spatially invariant AWGN. Unlike other meth-
same scenes mostly contain similar objects and textures. For each ods, we train our model with fixed level AWGN (non-blind) and
scene, 500 temporal images were captured to compute ground- random level AWGN (blind), rather than only using fixed level
truth image and noise image. The SIDD dataset contains 320 image AWGN. Then we evaluate thee denoising ability towards different
pairs for training, 1280 images for validation, and 1280 images for level AWGN.
testing. We crop images to 350  350 in the phase of training and As the PSNR on BSD68 and Set14 in comparison with other
testing. methods shown on Table 1, our model has achieved very compet-
8 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

Fig. 8. Comparison of our model with other methods on an image from NC12. The region of green box shows more detailed part of the result. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)

itive results for all noise level. Our network exceeds BM3D by ule into FFDNet, calling it FFDNet*. We also find that the result is
about 0.7 dB for a wide range of noise level on BSD68 and Set14 very close to FFDNet, as shown on 1. It proves that our noise esti-
that proves the powerful performance of deep learning. Further- mation module could predict the accurate noise level map.
more, our model produces better images as shown in Figs. 6 and 7.
Contrast to FFDNet, the advantages of NERNet can be summa- 4.4. Experiment on realistic noisy images
rized to two folds: (i) benefit from the noise estimation module,
our method needn’t to know the noise level, which can be called Due to realistic image noise coming from multiple sources, the
non-blind denoising; (2) the structure of the removal module is denoising of the realistic noisy image is difficult. In this subsection,
more complex than FFDNet and has better performance on denois- we further verify the robustness of our model towards to realistic
ing. To further prove our method is superior than FFDNet, we con- noise which is more valuable in practical applications. We select
duct relevant experiments. First of all, we remove the noise NC12, Nam and SIDD as our testing datasets that have been widely
estimation module and reserve the removal module, calling it NER- used by other denoising methods.
Net*. Then, we take the specific noise level used in FFDNet into NC12: We only report the denoising results through the
NERNet*. As shown on Table 1, we find that NERNet* has better denoised image due to the ground-truth images of NC12 are
performance than FFDNet and NERNet*, which demonstrates the unavailable. We compared different methods including traditional
architecture of the removal module is better than FFDNet. What’s method (BM3D) and deep learning methods (DnCNN, FFDNet and
more, contrast to NERNet, it confirms that the more accurate noise CBDNet). The images produced by different methods as shown in
level map outputs the better results. Last but not least, we take the Fig. 8. The result of BM3D is still fuzzy. DnCNN trained on synthetic
estimated noise level map outputted by the noise estimation mod- noise can’t effectively remove the noise, and several light spots
B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851 9

Fig. 9. Comparison of our model with other methods on an image from SIDD. The result predicted by our model shows the best visual effects.

Table 2
The PSNR/SSIM achieved by various methods on the SIDD dataset.

Method Blind/Non-blind Denoising on PSNR SSIM


BM3D [2] Non-blind RGB 25.65 0.685
WNNM [4] Non-blind RGB 25.78 0.809
DnCNN [12] Non-blind RGB 23.66 0.583
FFDNet [21] Non-blind RGB 29.20 0.735
CBDNet [25] Blind RGB 33.28 0.868
NERNet Blind RGB 37.97 0.942

The significance of bold font means the best result of each contrast experiment or ablation experiments.

Table 3
Average PSNR removal by different methods on the Nam dataset.

Method BM3D [2] DnCNN [12] FFDNet [21] CBDNet [25] NERNet
PSNR 37.30 35.55 38.7 39.01 40.10

The significance of bold font means the best result of each contrast experiment or ablation experiments.

appear on some patches in the image. The denoising image pro- tle regions of image to absolutely reveal the ability of different
duced by FFDNet is dark than the original noise image, which methods, as shown in Fig. 9. One can see that NERNet generates
makes visual effect worse. CBDNet performs not well in some more accurate details and textures than other models.
details of the image. Compared with the above methods, our Nam: In CBDNet, they random select 25 patches of cropped
method achieves the better visual effects and preserves more infor- images for testing rather than using both images. This testing
mation on whole image. method can’t completely reflect denoising ability. To address this
SIDD: We compare the PSNR and SSIM of different methods on issue and more accurately reflects the results, we crop images into
the SIDD validation datasets as shown in Table 2. It is obvious that 1/16 of the original size and use all of them for evaluation. The
our model achieves the best result. Moreover, we select three sub- average PSNR and SSIM scores are presented in Table 3, and we
10 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

4.5. Ablation studies

We conduct four ablation experiments to prove the ability of


each module by training four different network architectures on
our training dataset, including ‘‘PC (Plain Convolution
Block) + PFF (Pyramid Feature Fusion Block) + DUDS (Dense U-
Net with Dilation Selective Block)”, SDC (Symmetric Dilated Convo-
lution Block) + PFF + DUDS”, ‘‘SDC + DUDS” and ‘‘SDC + PFF + DU
(Dense U-Net)”. Our ablation experiments are evaluated on
BSD68 and SIDD datasets.
SDC vs. PC. To prove the effectiveness of the receptive filed
improvement, we use five plain convolution layers with dilation
rate 1 (PC) instead of Symmetric Dilated Convolutional Block
(SDC). The difference of the receptive field between two structures,
as is shown in Table 4, SDC has three times the receptive field than
PC, which means SDC can use more contextual information to pre-
dict images. What’s more, the comparison between the experiment
3 and 4 indicates that SDC exceeds PC about 0.56 dB on BSD68 and
0.1 dB on SIDD, as shown in Table 5.
With PFF vs. without PFF. We remove the PFF from our network
without any module instead. Otherwise, we still reserve SDC and
DUDS. As shown in Table 5, experiments 2 and 4 show that the
model with PFF achieves about 0.47 dB gain over the model with-
out PFF whether on BSD68 or SIDD. We can conclude that PFF
effectively fuses the feature information at different scales.
DUDS vs. DU. In our design perspective, dilation selective mod-
ule can adaptively fuse the features extracted from different con-
volution with different dilation rates, so we consider that DUDS
is more benefit to acquiring information due to its ability for fea-
ture fusion. We replace DUDS with DU to demonstrates this view.
As shown in Table 5, the results between experiments 1 and 4 con-
clusively prove the powerful ability of dilation selective module.
Furthermore, we plot the training curves on synthetic noise and
realistic noise, as shown in Fig. 11, to observe the effect of each
module. Through the comparison of different training curves, we
have the following observations:

Fig. 10. Comparison of our model with other methods on an image from Nam. The (i) The network containing both three modules achieves the
regions of different color boxes show more detailed parts of the results. best result than other architectures, which proves the effect
of our designed modules.
(ii) It is clear that the dilation selective module improves the
PSNR higher than other modules.
Table 4
The receptive fields of plain convolution block (PC) and symmetric dilated convolu-
tion block (SDC). 4.6. Comparison on testing time
Layer 1 2 3 4 5
In addition to PSNR or SSIM quality evaluation, another impor-
PC 3 5 7 9 11
SDC 3 10 21 28 31
tant aspect is testing time. We select 50 images from BSD68 with
noise level 15 for the speed evaluation among different ablation
studies. We compare different model on CPU and GPU to evaluate
their speed. Table 6 shows the testing times of DnCNN, FFDNet,
FOCNet, CBDNet and NERNet for images of 256  256, 512  512
can conclude that our method performs better than other methods and 1024  1024. We can conclude that our model consumes less
under different evaluation indicators. The visual denoising images time. What’s more, we also compare the testing time of different
produced by our model have more accurate details, as shown in structure in NERNet, as shown in Table 7. As modules increase,
Fig. 10. What’s more, our model is a blind denoising method and testing time doesn’t increase too much. It is clear that the time cost
more flexible than other non-blind methods. of each module is similar.

Table 5
The evaluation results of ablation study on BSD68 and SIDD datasets.

Serial number SDC PFF DUDS BSD68 SIDD


p p
1 33.28 38.15
p p
2 33.47 38.42
p p
3 33.51 38.49
p p p
4 34.03 38.81

The significance of bold font means the best result of each contrast experiment or ablation experiments.
B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851 11

(a) Training curves on BSD68 (b) Training curves on SIDD


Fig. 11. Training curves of different ablation studies on synthetic noise dataset and realistic noise dataset. Compared with other model, our model, including the symmetric
dilated convolution block, pyramid feature fusion block, and the dilation selective block, achieves the best result.

Table 6
The testing time (in seconds) of different image sizes.

Size Device DnCNN [25] FFDNet [21] HRLNet [13] FOCNet [16] CBDNet [31] NERNet
256  256 CPU 2.44 0.62 0.55 2.32 1.24 0.58
GPU 0.011 0.008 0.007 0.009 0.008 0.007
512  512 CPU 9.85. 2.51 2.17 9.11 3.56. 2.08
GPU 0.040 0.017 0.018 0.039 0.021 0.15
1024  1024 CPU 38.11 10.17 9.21 35.11. 11.59. 9.19
GPU 0.057 0.057 0.058 0.059 0.059 0.056

The significance of bold font means the best result of each contrast experiment or ablation experiments.

Table 7 Acknowledgement
The testing time (in seconds) of different structures among ablation studies.

SDC PFF DUDS CPU GPU This work is supported by the National Natural Science Founda-
p p tion of China (51805078), the National Key Research and Develop-
0.58 0.007
p p ment Program of China (2017YFB0304200), the Fundamental
0.60 0.008
p p
p p p
0.61 0.008 Research Funds for the Central Universities (N2003021), Taizhou
0.58 0.007 Science and Technology Planning Project (1902GY09).

5. Conclusion References

[1] B.K. Shreyamsha Kumar, Image denoising based on non-local means filter and
In this paper, we proposed a novel denoising method based on its method noise thresholding, SIViP 7 (6) (2013) 1211–1227.
deep learning, namely NERNet, which performs better on both syn- [2] W.J. Li et al., Fast combination filtering based on weighted fusion, J. Vis.
thetic noise and realistic noise than other state-of-the-art methods. Commun. Image Represent. 62 (2019) 226–233.
[3] J. Liu et al., Image denoising searching similar blocks along edge directions,
The proposed network includes two sequential stage. The first Signal Processing-Image Commun. 57 (2017) 33–45.
stage estimated the noise level map of from the noisy input. Benefit [4] S.H. Gu et al., Weighted Nuclear Norm Minimization with Application to Image
from the dilated convolution block and the pyramid feature fusion Denoising, Ieee Conference on Computer Vision and Pattern Recognition (Cvpr)
2014 (2014) 2862–2869.
block, the noise estimation module has the natural attributes of [5] Y.X. Liu et al., Patch based image denoising using the finite ridgelet transform
learning multiple scale information of noise and effectively esti- for less artifacts, J. Vis. Commun. Image Represent. 25 (2014) 1006–1017.
mates the accurate noise level map. At the second stage, with the [6] Y. Niu et al., Model-based adaptive resolution upconversion of degraded
images, J. Vis. Commun. Image Represent. 23 (2012) 1144–1157.
help of the noise level map estimated by the noise estimation mod- [7] W.H. Li et al., Total variation blind deconvolution employing split Bregman
ule, the removal module combines multiple branches with differ- iteration, J. Vis. Commun. Image Represent. 23 (2012) 409–417.
ent dilation rate to enhance the performance of denoising. [8] Z.X. Cui et al., A nonconvex nonsmooth regularization method with structure
tensor total variation, J. Vis. Commun. Image Represent. 43 (2017) 30–40.
Extensive experiments on two synthetic noise datasets and three [9] D. Cho, T.D. Bui, Multivariate statistical modeling for image denoising using
realistic noise datasets demonstrate the performance of our model. wavelet transforms, Signal Processing-Image Commun. 20 (1) (2005) 77–89.
[10] K.Q. Huang et al., Color image denoising with wavelet thresholding based on
human visual system model, Signal Processing-Image Commun. 20 (2) (2005)
Declaration of Competing Interest 115–127.
[11] K. Zhang et al., Beyond a Gaussian denoiser: residual learning of deep CNN for
The authors declare that they have no known competing finan- image denoising, IEEE Trans. Image Process. 26 (7) (2017) 3142–3155.
[12] K. Zhang et al., Learning deep CNN denoiser prior for image restoration, Ieee
cial interests or personal relationships that could have appeared Conference on Computer Vision and Pattern Recognition (Cvpr) 2017 (2017)
to influence the work reported in this paper. 2808–2817.
12 B. Guo et al. / J. Vis. Commun. Image R. 71 (2020) 102851

[13] W.Z. Shi et al., Hierarchical residual learning for image denoising, Signal [29] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for
Processing-Image Commun. 76 (2019) 243–251. biomedical image segmentation, Medical Image Computing and Computer-
[14] X.J. Mao, C.H. Shen, Y.B. Yang, Image restoration using very deep convolutional Assisted Intervention, Pt Iii 9351 (2015) 234–241.
encoder-decoder networks with symmetric skip connections. Advances in [30] G. Huang et al., Densely connected convolutional networks, Ieee Conference on
Neural Information Processing Systems 29 (Nips 2016), 2016. 29. Computer Vision and Pattern Recognition (Cvpr) 2017 (2017) 2261–2269.
[15] Y. Tai et al., MemNet: a persistent memory network for image restoration, Ieee [31] Xiang Li, et al., Selective Kernel Networks, 2019 Ieee Conference on Computer
International Conference on Computer Vision (Iccv) 2017 (2017) 4549–4557. Vision and Pattern Recognition (Cvpr), 2019, pp. 510–519.
[16] X. Jia et al., FOCNet: A fractional optimal control network for image denoising, [32] P. Drineas, M.W. Mahoney, On the Nystrom method for approximating a gram
Ieee Conference on Computer Vision and Pattern Recognition (Cvpr) 2019 matrix for improved kernel-based learning, J. Mach. Learn. Res. 6 (2005) 2153–
(2019) 6054–6063. 2175.
[17] L. Fan, F. Zhang, H. Fan, et al., Brief review of image denoising techniques, [33] K.D. Ma et al., Waterloo exploration database: new challenges for image
Visual Computing for Industry, Biomedicine, and Art 2 (1) (2019) 7. quality assessment models, IEEE Trans. Image Process. 26 (2) (2017) 1004–
[18] A. Ben Hamza, H. Krim, Image denoising: A nonlinear robust statistical 1016.
approach, IEEE Trans. Signal Process. 49 (12) (2001) 3045–3054. [34] V. Bychkovsky et al., Learning Photographic Global Tonal Adjustment with a
[19] J. Benesty, J.D. Chen, Y.T. Huang, Study of the widely linear wiener filter for Database of Input/Output Image Pairs, Ieee Conference on Computer Vision
noise reduction, in: 2010 Ieee International Conference on Acoustics, Speech, and Pattern Recognition (Cvpr) 2011 (2011) 97–104.
and Signal Processing, 2010, pp. 205–208. [35] J. Deng et al., ImageNet: A large-scale hierarchical image database, in: 2009
[20] R.K. Yang et al., Optimal weighted median filtering under structural Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 2009, pp.
constraints, IEEE Trans. Signal Process. 43 (3) (1995) 591–604. 248–255.
[21] K. Zhang, W.M. Zuo, L. Zhang, FFDNet: toward a fast and flexible solution for [36] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-
CNN-based image denoising, IEEE Trans. Image Process. 27 (9) (2018) 4608– representations, in: International conference on curves and surfaces, 2010,
4622. pp. 711–730.
[22] X.H. Liu, M. Tanaka, M. Okutomi, Single-image noise level estimation for blind [37] S. Roth, M.J. Black, Fields of Experts, Int. J. Comput. Vision 82 (2) (2009) 205–
denoising, IEEE Trans. Image Process. 22 (12) (2013) 5226–5237. 229.
[23] S. Pyatykh, J. Hesser, L. Zheng, Image noise level estimation by principal [38] J. Anaya, A. Barbu, RENOIR - A dataset for real low-light image noise reduction,
component analysis, IEEE Trans. Image Process. 22 (2) (2013) 687–699. J. Vis. Commun. Image Represent. 51 (2018) 144–154.
[24] A. Foi et al., Practical Poissonian-Gaussian noise modeling and fitting for [39] M. Lebrun, M. Colom, J.M. Morel, The noise clinic: a universal blind denoising
single-image raw-data, IEEE Trans. Image Process. 17 (10) (2008) 1737–1754. algorithm, in: 2014 Ieee International Conference on Image Processing (Icip),
[25] Guo, Shi, et al., Toward convolutional blind denoising of real photographs, 2014, pp. 2674–2678.
2019 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), [40] S. Nam et al., A holistic approach to cross-channel image noise modeling and
2019, pp. 1712–1722. its application to image denoising, in: 2016 Ieee Conference on Computer
[26] Yu, Fisher, and Vladlen Koltun., Multi-scale context aggregation by dilated Vision and Pattern Recognition (Cvpr), 2016, pp. 1683–1691.
convolutions. 2015, arXiv preprint arXiv:1511.07122. [41] A. Abdelhamed, S. Lin, M.S. Brown, A high-quality denoising dataset for
[27] Yunchao Wei, et al., Revisiting dilated convolution: A simple approach for smartphone cameras, in: 2018 Ieee Conference on Computer Vision and
weakly-and semi-supervised semantic segmentation. 2018 Ieee Conference on Pattern Recognition (Cvpr), 2018, pp. 1692–1700.
Computer Vision and Pattern Recognition (Cvpr), 2018, pp. 7268–7277. [42] K.M. He et al., Delving deep into rectifiers: surpassing human-level
[28] He, K.M., et al., Spatial Pyramid Pooling in Deep Convolutional Networks for performance on imagenet classification, in: 2015 Ieee International
Visual Recognition. Computer Vision - Eccv 2014, Pt Iii, 2014. 8691: p. 346-361. Conference on Computer Vision (Iccv), 2015, pp. 1026–1034.

You might also like