0% found this document useful (0 votes)
221 views12 pages

Noise2Noise: Learning Image Restoration Without Clean Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views12 pages

Noise2Noise: Learning Image Restoration Without Clean Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Noise2Noise: Learning Image Restoration without Clean Data

Jaakko Lehtinen1 2 Jacob Munkberg1 Jon Hasselgren1 Samuli Laine1 Tero Karras1 Miika Aittala3 Timo Aila1

Abstract renderings of a synthetic scene, etc. Significant advances


have been reported in several applications, including Gaus-
We apply basic statistical reasoning to signal re-
sian denoising, de-JPEG, text removal (Mao et al., 2016),
construction by machine learning – learning to
arXiv:1803.04189v3 [cs.CV] 29 Oct 2018

super-resolution (Ledig et al., 2017), colorization (Zhang


map corrupted observations to clean signals – with
et al., 2016), and image inpainting (Iizuka et al., 2017). Yet,
a simple and powerful conclusion: it is possi-
obtaining clean training targets is often difficult or tedious:
ble to learn to restore images by only looking at
a noise-free photograph requires a long exposure; full MRI
corrupted examples, at performance at and some-
sampling precludes dynamic subjects; etc.
times exceeding training using clean data, without
explicit image priors or likelihood models of the In this work, we observe that we can often learn to turn
corruption. In practice, we show that a single bad images into good images by only looking at bad images,
model learns photographic noise removal, denois- and do this just as well – sometimes even better – as if we
ing synthetic Monte Carlo images, and reconstruc- were using clean examples. Further, we require neither an
tion of undersampled MRI scans – all corrupted explicit statistical likelihood model of the corruption nor
by different processes – based on noisy data only. an image prior, and instead learn these indirectly from the
training data. (Indeed, in one of our examples, synthetic
Monte Carlo renderings, the non-stationary noise cannot
1. Introduction be characterized analytically.) In addition to denoising, our
observation is directly applicable to inverse problems such
Signal reconstruction from corrupted or incomplete mea- as MRI reconstruction from undersampled data. While our
surements is an important subfield of statistical data analysis. conclusion is almost trivial from a statistical perspective, it
Recent advances in deep neural networks have sparked sig- significantly eases practical learned signal reconstruction by
nificant interest in avoiding the traditional, explicit a priori lifting requirements on availability of training data.
statistical modeling of signal corruptions, and instead learn-
ing to map corrupted observations to the unobserved clean The reference TensorFlow implementation for Noise2Noise
versions. This happens by training a regression model, e.g., training is available on GitHub.1
a convolutional neural network (CNN), with a large number
of pairs (x̂i , yi ) of corrupted inputs x̂i and clean targets yi 2. Theoretical Background
and minimizing the empirical risk
X Assume that we have a set of unreliable measurements
argmin L (fθ (x̂i ), yi ) , (1) (y1 , y2 , ...) of the room temperature. A common strategy
θ i for estimating the true unknown temperature is to find a
number z that has the smallest average deviation from the
where fθ is a parametric family of mappings (e.g., CNNs),
measurements according to some loss function L:
under the loss function L. We use the notation x̂ to un-
derline the fact that the corrupted input x̂ ∼ p(x̂|yi ) is a argmin Ey {L(z, y)}. (2)
random variable distributed according to the clean target. z
Training data may include, for example, pairs of short and
For the L2 loss L(z, y) = (z − y)2 , this minimum is found
long exposure photographs of the same scene, incomplete
at the arithmetic mean of the observations:
and complete k-space samplings of magnetic resonance
images, fast-but-noisy and slow-but-converged ray-traced z = Ey {y}. (3)
1 2 3
NVIDIA Aalto University MIT CSAIL. Correspondence to:
Jaakko Lehtinen <[email protected]>. The L1 loss, the sum of absolute deviations L(z, y) = |z −
y|, in turn, has its optimum at the median of the observations.
Proceedings of the 35 th International Conference on Machine The general class of deviation-minimizing estimators are
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
1
by the author(s). https://github.com/NVlabs/noise2noise
Noise2Noise: Learning Image Restoration without Clean Data

known as M-estimators (Huber, 1964). From a statistical principle, corrupt the training targets of a neural network
viewpoint, summary estimation using these common loss with zero-mean noise without changing what the network
functions can be seen as ML estimation by interpreting the learns. Combining this with the corrupted inputs from Equa-
loss function as the negative log likelihood. tion 1, we are left with the empirical risk minimization task
Training neural network regressors is a generalization of X
argmin L (fθ (x̂i ), ŷi ) , (6)
this point estimation procedure. Observe the form of the θ i
typical training task for a set of input-target pairs (xi , yi ),
where the network function fθ (x) is parameterized by θ: where both the inputs and the targets are now drawn from
a corrupted distribution (not necessarily the same), condi-
argmin E(x,y) {L(fθ (x), y)}. (4) tioned on the underlying, unobserved clean target yi such
θ that E{ŷi |x̂i } = yi . Given infinite data, the solution is
Indeed, if we remove the dependency on input data, and the same as that of (1). For finite data, the variance is the
use a trivial fθ that merely outputs a learned scalar, the task average variance of the corruptions in the targets, divided
reduces to (2). Conversely, the full training task decomposes by the number of training samples (see appendix). Inter-
to the same minimization problem at every training sample; estingly, none of the above relies on a likelihood model of
simple manipulations show that (4) is equivalent to the corruption, nor a density model (prior) for the under-
lying clean image manifold. That is, we do not need an
argmin Ex {Ey|x {L(fθ (x), y)}}. (5) explicit p(noisy|clean) or p(clean), as long as we have data
θ
distributed according to them.
The network can, in theory, minimize this loss by solving the In many image restoration tasks, the expectation of the cor-
point estimation problem separately for each input sample. rupted input data is the clean target that we seek to restore.
Hence, the properties of the underlying loss are inherited by Low-light photography is an example: a long, noise-free ex-
neural network training. posure is the average of short, independent, noisy exposures.
The usual process of training regressors by Equation 1 over With this in mind, the above suggests the ability to learn to
a finite number of input-target pairs (xi , yi ) hides a subtle remove photon noise given only pairs of noisy images, with
point: instead of the 1:1 mapping between inputs and tar- no need for potentially expensive or difficult long exposures.
gets (falsely) implied by that process, in reality the mapping Similar observations can be made about other loss functions.
is multiple-valued. For example, in a superresolution task For instance, the L1 loss recovers the median of the targets,
(Ledig et al., 2017) over all natural images, a low-resolution meaning that neural networks can be trained to repair im-
image x can be explained by many different high-resolution ages with significant (up top 50%) outlier content, again
images y, as knowledge about the exact positions and ori- only requiring access to pairs of such corrupted images.
entations of the edges and texture is lost in decimation. In In the next sections, we present a wide variety of examples
other words, p(y|x) is the highly complex distribution of demonstrating that these theoretical capabilities are also
natural images consistent with the low-resolution x. Train- efficiently realizable in practice.
ing a neural network regressor using training pairs of low-
and high-resolution images using the L2 loss, the network
learns to output the average of all plausible explanations 3. Practical Experiments
(e.g., edges shifted by different amounts), which results in We now experimentally study the practical properties of
spatial blurriness for the network’s predictions. A signif- noisy-target training. We start with simple noise distribu-
icant amount of work has been done to combat this well tions (Gaussian, Poisson, Bernoulli) in Sections 3.1 and 3.2,
known tendency, for example by using learned discriminator and continue to the much harder, analytically intractable
functions as losses (Ledig et al., 2017; Isola et al., 2017). Monte Carlo image synthesis noise (Section 3.3). In Sec-
Our observation is that for certain problems this tendency tion 3.4, we show that image reconstruction from sub-
has an unexpected benefit. A trivial, and, at first sight, use- Nyquist spectral samplings in magnetic resonance imaging
less, property of L2 minimization is that on expectation, the (MRI) can be learned from corrupted observations only.
estimate remains unchanged if we replace the targets with
random numbers whose expectations match the targets. This 3.1. Additive Gaussian Noise
is easy to see: Equation (3) holds, no matter what particu-
We will first study the effect of corrupted targets using
lar distribution the ys are drawn from. Consequently, the
synthetic additive Gaussian noise. As the noise has zero
optimal network parameters θ of Equation (5) also remain
mean, we use the L2 loss for training to recover the mean.
unchanged, if input-conditioned target distributions p(y|x)
are replaced with arbitrary distributions that have the same Our baseline is a recent state-of-the-art method ”RED30”
conditional expected values. This implies that we can, in (Mao et al., 2016), a 30-layer hierarchical residual net-
Noise2Noise: Learning Image Restoration without Clean Data

33 32.5
32.5 32.5
32.5
32 32
32.5
31.5
31.5 31.5
31.5
32 31 31

31.5 30.5
30.5 30.5
30.5
30 30
31
29.5
29.5 29.5
29.5
30.5 29 29
0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 350 400 450 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

clean targets noisy targets 2 pix 5 pix 10 pix 20 pix 40 pix Case 1 (trad.) Case 2 Case 3 (N2N)

(a) White Gaussian, σ = 25 (b) Brown Gaussian, σ = 25 (c) Capture budget study (see text)

Figure 1. Denoising performance (dB in KODAK dataset) as a function of training epoch for additive Gaussian noise. (a) For i.i.d. (white)
Gaussian noise, clean and noisy targets lead to very similar convergence speed and eventual quality. (b) For brown Gaussian noise, we
observe that increased inter-pixel noise correlation (wider spatial blur; one graph per bandwidth) slows convergence down, but eventual
performance remains close. (c) Effect of different allocations of a fixed capture budget to noisy vs. clean examples (see text).

For all further tests, we switch from RED30 to a shallower


Table 1. PSNR results from three test datasets KODAK, BSD300,
U-Net (Ronneberger et al., 2015) that is roughly 10× faster
and S ET 14 for Gaussian, Poisson, and Bernoulli noise. The com-
parison methods are BM3D, Inverse Anscombe transform (ANSC), to train and gives similar results (−0.2 dB in Gaussian noise).
and deep image prior (DIP). The architecture and training parameters are described in
the appendix.
Gaussian (σ=25) Poisson (λ=30) Bernoulli (p=0.5)
clean noisy BM3D clean noisy ANSC clean noisy DIP Convergence speed Clearly, every training example asks
Kodak 32.50 32.48 31.82 31.52 31.50 29.15 33.01 33.17 30.78
BSD300 31.07 31.06 30.34 30.18 30.16 27.56 31.04 31.16 28.97
for the impossible: there is no way the network could suc-
Set14 31.31 31.28 30.50 30.07 30.06 28.36 31.51 31.72 30.67 ceed in transforming one instance of the noise to another.
Average 31.63 31.61 30.89 30.59 30.57 28.36 31.85 32.02 30.14 Consequently, the training loss does actually not decrease
during training, and the loss gradients continue to be quite
large. Why do the larger, noisier gradients not affect con-
vergence speed? While the activation gradients are indeed
work with 128 feature maps, which has been demonstrated noisy, the weight gradients are in fact relatively clean be-
to be very effective in a wide range of image restoration cause Gaussian noise is independent and identically dis-
tasks, including Gaussian noise. We train the network us- tributed (i.i.d.) in all pixels, and the weight gradients get
ing 256×256-pixel crops drawn from the 50k images in averaged over 216 pixels in our fully convolutional network.
the I MAGE N ET validation set. We furthermore random-
ize the noise standard deviation σ ∈ [0, 50] separately for Figure 1b makes the situation harder by introducing inter-
each training example, i.e., the network has to estimate the pixel correlation to the noise. This brown additive noise
magnitude of noise while removing it (“blind” denoising). is obtained by blurring white Gaussian noise by a spatial
Gaussian filter of different bandwidths and scaling to retain
We use three well-known datasets: BSD300 (Martin et al., σ = 25. An example is shown in Figure 1b. As the correla-
2001), S ET 14 (Zeyde et al., 2010), and KODAK2 . As sum- tion increases, the effective averaging of weight gradients
marized in Table 1, the behavior is qualitatively similar decreases, and the weight updates become noisier. This
in all three sets, and thus we discuss the averages. When makes the convergence slower, but even with extreme blur,
trained using the standard way with clean targets (Equa- the eventual quality is similar (within 0.1 dB).
tion 1), RED30 achieves 31.63 ± 0.02 dB with σ = 25. The
confidence interval was computed by sampling five random Finite data and capture budget The previous studies re-
initializations. The widely used benchmark denoiser BM3D lied on the availability of infinitely many noisy examples
(Dabov et al., 2007) gives ∼0.7 dB worse results. When we produced by adding synthetic noise to clean images. We
modify the training to use noisy targets (Equation 6) instead, now study corrupted vs. clean training data in the realis-
the denoising performance remains equally good. Further- tic scenario of finite data and a fixed capture budget. Our
more, the training converges just as quickly, as shown in experiment setup is as follows. Let one ImageNet image
Figure 1a. This leads us to conclude that clean targets are with white additive Gaussian noise at σ = 25 correspond to
unnecessary in this application. This perhaps surprising one “capture unit” (CU). Suppose that 19 CUs are enough
observation holds also with different networks and network for a clean capture, so that one noisy realization plus the
capacities. Figure 2a shows an example result. clean version (the average of 19 noisy realizations) con-
sumes 20 CU. Let us fix a total capture budget of, say, 2000
2
http://r0k.us/graphics/kodak/ CUs. This budget can be allocated between clean latents
Noise2Noise: Learning Image Restoration without Clean Data

(N ) and noise realizations per clean latent (M ) such that (a) Gaussian (σ = 25)
N ∗ M = 2000. In the traditional scenario, we have only
100 training pairs (N = 100, M = 20): a single noisy
realization and the corresponding clean image (= average

BM3D
of 19 noisy images; Figure 1c, Case 1). We first observe
that using the same captured data as 100 ∗ 20 ∗ 19 = 38000
training pairs with corrupted targets — i.e., for each latent,
forming all the 19 ∗ 20 possible noisy/clean pairs — yields
notably better results (several .1s of dB) than the traditional,
fixed noisy+clean pairs, even if we still only have N = 100
latents (Figure 1c, Case 2). Second, we observe that setting (b) Poisson (λ = 30)
N = 1000 and M = 2, i.e., increasing the number of clean

A NSCOMBE
latents but only obtaining two noisy realizations of each
(resulting in 2000 training pairs) yields even better results
(again, by several .1s of dB, Figure 1c, Case 3).
We conclude that for additive Gaussian noise, corrupted (c) Bernoulli (p = 0.5)
targets offer benefits — not just the same performance but
better — over clean targets on two levels: both 1) seeing

D EEP I MAGE P RIOR


more realizations of the corruption for the same latent clean
image, and 2) seeing more latent clean images, even if just
two corrupted realizations of each, are beneficial.

3.2. Other Synthetic Noises


We will now experiment with other types of synthetic noise.
The training setup is the same as described above. Ground truth Input Our Comparison
Poisson noise is the dominant source of noise in pho-
Figure 2. Example results for Gaussian, Poisson, and Bernoulli
tographs. While zero-mean, it is harder to remove because it
noise. Our result was computed by using noisy targets — the
is signal-dependent. We use the L2 loss, and vary the noise corresponding result with clean targets is omitted because it is
magnitude λ ∈ [0, 50] during training. Training with clean virtually identical in all three cases, as discussed in the text. A
targets results in 30.59 ± 0.02 dB, while noisy targets give different comparison method is used for each noise type.
an equally good 30.57 ± 0.02 dB, again at similar conver-
gence speed. A comparison method (Mäkitalo & Foi, 2011)
that first transforms the input Poisson noise into Gaussian The probability of corrupted pixels is denoted with p; in our
(Anscombe transform), then denoises by BM3D, and finally training we vary p ∈ [0.0, 0.95] and during testing p = 0.5.
inverts the transform, yields 2 dB less. Training with clean targets gives an average of 31.85 ±
Other effects, e.g., dark current and quantization, are domi- 0.03 dB, noisy targets (separate m for input and target) give
nated by Poisson noise, can be made zero-mean (Hasinoff a slightly higher 32.02 ± 0.03 dB, possibly because noisy
et al., 2016), and hence pose no problems for training with targets effectively implement a form of dropout (Srivastava
noisy targets. We conclude that noise-free training data is et al., 2014) at the network output. DIP was almost 2 dB
unnecessary in this application. That said, saturation (gamut worse – DIP is not a learning-based solution, and as such
clipping) renders the expectation incorrect due to removing very different from our approach, but it shares the property
part of the distribution. As saturation is unwanted for other that neither clean examples nor an explicit model of the
reasons too, this is not a significant limitation. corruption is needed. We used the “Image reconstruction”
setup as described in the DIP supplemental material.3
Multiplicative Bernoulli noise (aka binomial noise) con-
structs a random mask m that is 1 for valid pixels and 0 for Text removal Figure 3 demonstrates blind text removal.
zeroed/missing pixels. To avoid backpropagating gradients The corruption consists of a large, varying number of ran-
from missing pixels, we exclude them from the loss: dom strings in random places, also on top of each other, and
X furthermore so that the font size and color are randomized
argmin (m (fθ (x̂i ) − ŷi ))2 , (7) as well. The font and string orientation remain fixed.
θ i
The network is trained using independently corrupted input
as described by Ulyanov et al. (2017) in the context of their
3
deep image prior (DIP). https://dmitryulyanov.github.io/deep image prior
Noise2Noise: Learning Image Restoration without Clean Data

p ≈ 0.04 p ≈ 0.42

Example training pairs Input (p ≈ 0.25) L2 L1 Clean targets Ground truth


17.12 dB 26.89 dB 35.75 dB 35.82 dB PSNR

Figure 3. Removing random text overlays corresponds to seeking the median pixel color, accomplished using the L1 loss. The mean (L2
loss) is not the correct answer: note shift towards mean text color. Only corrupted images shown during training.

p = 0.22 p = 0.81

Example training pairs Input (p = 0.70) L2 / L1 L0 Clean targets Ground truth


8.89 dB 13.02 dB / 16.36 dB 28.43 dB 28.86 dB PSNR

Figure 4. For random impulse noise, the approx. mode-seeking L0 loss performs better than the mean (L2 ) or median (L1 ) seeking losses.
PSNR delta from clean targets

0 Random-valued impulse noise replaces some pixels with


noise and retains the colors of others. Instead of the standard
-5
salt and pepper noise (randomly replacing pixels with black
L0 L1
-10 or white), we study a harder distribution where each pixel
10% 20% 30% 40% 50% 60% 70% 80% 90% is replaced with a random color drawn from the uniform
distribution [0, 1]3 with probability p and retains its color
Figure 5. PSNR of noisy-target training relative to clean targets with probability 1 − p. The pixels’ color distributions are a
with a varying percentage of target pixels corrupted by RGB im- Dirac at the original color plus a uniform distribution, with
pulse noise. In this test a separate network was trained for each cor-
relative weights given by the replacement probability p. In
ruption level, and the graph was averaged over the KODAK dataset.
this case, neither the mean nor the median yield the correct
result; the desired output is the mode of the distribution
(the Dirac spike). The distribution remains unimodal. For
approximate mode seeking, we use an annealed version
and target pairs. The probability of corrupted pixels p is
of the “L0 loss” function defined as (|fθ (x̂) − ŷ| + )γ ,
approximately [0, 0.5] during training, and p ≈ 0.25 during
where  = 10−8 , where γ is annealed linearly from 2 to 0
testing. In this test the mean (L2 loss) is not the correct
during training. This annealing did not cause any numerical
answer because the overlaid text has colors unrelated to the
issues in our tests. The relationship of the L0 loss and mode
actual image, and the resulting image would incorrectly tend
seeking is analyzed in the appendix.
towards a linear combination of the right answer and the
average text color (medium gray). However, with any rea- We again train the network using noisy inputs and noisy
sonable amount of overlaid text, a pixel retains the original targets, where the probability of corrupted pixels is random-
color more often than not, and therefore the median is the ized separately for each pair from [0, 0.95]. Figure 4 shows
correct statistic. Hence, we use L1 = |fθ (x̂) − ŷ| as the loss the inference results when 70% input pixels are randomized.
function. Figure 3 shows an example result.
Noise2Noise: Learning Image Restoration without Clean Data

Training with L2 loss biases the results heavily towards gray, dominated by the long-tail effects (outliers) in the targets,
because the result tends towards a linear combination the and training does not converge. On the other hand, if the
correct answer and and mean of the uniform random corrup- denoiser were to output tonemapped values T (v), the non-
tion. As predicted by theory, the L1 loss gives good results linearity of T would make the expected value of noisy target
as long as fewer than 50% of the pixels are randomized, images E{T (v)} different from the clean training target
but beyond that threshold it quickly starts to bias dark and T (E{v}), leading to incorrect predictions.
bright areas towards gray (Figure 5). L0 , on the other hand,
A metric often used for measuring the quality of HDR im-
shows little bias even with extreme corruptions (e.g. 90%
ages is the relative MSE (Rousselle et al., 2011), where
pixels), because of all the possible pixel values, the correct
the squared difference is divided by the square of approx-
answer (e.g. 10%) is still the most common.
imate luminance of the pixel, i.e., (fθ (x̂) − ŷ)2 /(ŷ + )2 .
However, this metric suffers from the same nonlinearity
3.3. Monte Carlo Rendering problem as comparing of tonemapped outputs. Therefore,
Physically accurate renderings of virtual environments are we propose to use the network output, which tends to-
most often generated through a process known as Monte wards the correct value in the limit, in the denominator:
Carlo path tracing. This amounts to drawing random se- LHDR = (fθ (x̂) − ŷ)2 /(fθ (x̂) + 0.01)2 . It can be shown
quences of scattering events (“light paths”) in the scene that that LHDR converges to the correct expected value as long
connect light sources and virtual sensors, and integrating as we consider the gradient of the denominator to be zero.
the radiance carried by them over all possible paths (Veach Finally, we have observed that it is beneficial to tone map
& Guibas, 1995). The Monte Carlo integrator is constructed the input image T (x̂) instead of using HDR inputs. The
such that the intensity of each pixel is the expectation of network continues to output non-tonemapped (linear-scale)
the random path sampling process, i.e., the sampling noise luminance values, retaining the correctness of the expected
is zero-mean. However, despite decades of research into value. Figure 6 evaluates the different loss functions.
importance sampling techniques, little else can be said about
the distribution. It varies from pixel to pixel, heavily de- Denoising Monte Carlo rendered images We trained a
pends on the scene configuration and rendering parameters, denoiser for Monte Carlo path traced images rendered using
and can be arbitrarily multimodal. Some lighting effects, 64 samples per pixel (spp). Our training set consisted of
such as focused caustics, also result in extremely long-tailed 860 architectural images, and the validation was done using
distributions with rare, bright outliers. 34 images from a different set of scenes. Three versions of
the training images were rendered: two with 64 spp using
All of these effects make the removal of Monte Carlo noise different random seeds (noisy input, noisy target), and one
much more difficult than removing, e.g., Gaussian noise. with 131k spp (clean target). The validation images were
On the other hand, the problem is somewhat alleviated by rendered in both 64 spp (input) and 131k spp (reference)
the possibility of generating auxiliary information that has versions. All images were 960×540 pixels in size, and as
been empirically found to correlate with the clean result mentioned earlier, we also saved the albedo and normal
during data generation. In our experiments, the denoiser buffers for all of the input images. Even with such a small
input consists of not only the per-pixel luminance values, dataset, rendering the 131k spp clean images was a stren-
but also the average albedo (i.e., texture color) and normal uous effort — for example, Figure 7d took 40 minutes to
vector of the surfaces visible at each pixel. render on a high-end graphics server with 8 × NVIDIA
High dynamic range (HDR) Even with adequate sam- Tesla P100 GPUs and a 40-core Intel Xeon CPU.
pling, the floating-point pixel luminances may differ from The average PSNR of the 64 spp validation inputs with re-
each other by several orders of magnitude. In order to con- spect to the corresponding reference images was 22.31 dB
struct an image suitable for the generally 8-bit display de- (see Figure 7a for an example). The network trained for
vices, this high dynamic range needs to be compressed to a 2000 epochs using clean target images reached an average
fixed range using a tone mapping operator (Cerdá-Company PSNR of 31.83 dB on the validation set, whereas the simi-
et al., 2016). We use a variant of Reinhard’s global op- larly trained network using noisy target images gave 0.5 dB
erator (Reinhard et al., 2002): T (v) = (v/(1 + v))1/2.2 , less. Examples are shown in Figure 7b,c – the training took
where v is a scalar luminance value, possibly pre-scaled 12 hours with a single NVIDIA Tesla P100 GPU.
with an image-wide exposure constant. This operator maps
any v ≥ 0 into range 0 ≤ T (v) < 1. At 4000 epochs, the noisy targets matched 31.83 dB, i.e.,
noisy targets took approximately twice as long to converge.
The combination of virtually unbounded range of lumi- However, the gap between the two methods had not nar-
nances and the nonlinearity of operator T poses a problem. rowed appreciably, leading us to believe that some quality
If we attempt to train a denoiser that outputs luminance difference will remain even in the limit. This is not sur-
values v, a standard MSE loss L2 = (fθ (x̂) − ŷ)2 will be
Noise2Noise: Learning Image Restoration without Clean Data

Input, 8 spp L2 with x̂, ŷ L2 with T (x̂), ŷ L2 with T (x̂), T (ŷ) LHDR with x̂, ŷ LHDR with T (x̂), ŷ Reference, 32k spp
11.32 dB 25.46 dB 25.39 dB 15.50 dB 29.05 dB 30.09 dB PSNR

Figure 6. Comparison of various loss functions for training a Monte Carlo denoiser with noisy target images rendered at 8 samples per
pixel (spp). In this high-dynamic range setting, our custom relative loss LHDR is clearly superior to L2 . Applying a non-linear tone map to
the inputs is beneficial, while applying it to the target images skews the distribution of noise and leads to wrong, visibly too dark results.

(a) Input (64 spp), 23.93 dB (b) Noisy targets, 32.42 dB (c) Clean targets, 32.95 dB (d) Reference (131k spp)

Figure 7. Denoising a Monte Carlo rendered image. (a) Image rendered with 64 samples per pixel. (b) Denoised 64 spp input, trained
using 64 spp targets. (c) Same as previous, but trained on clean targets. (d) Reference image rendered with 131 072 samples per pixel.
PSNR values refer to the images shown here, see text for averages over the entire validation set.
PSNR
40
movie shot (Chaitanya et al., 2017). In this context, it can
30
even be desirable to train on-the-fly while walking through
20 the scene. In order to maintain interactive frame rates, we
10 can afford only few samples per pixel, and thus both input
0 and target images will be inherently noisy.
0 100 200 300 400 500 600 700 800 900 1000

Noisy targets Clean targets Input Figure 8 shows the convergence plots for an experiment
where we trained a denoiser from scratch for the duration
Figure 8. Online training PSNR during a 1000-frame flythrough of 1000 frames in a scene flythrough. On an NVIDIA Titan
of the scene in Figure 6. Noisy target images are almost as good V GPU, path tracing a single 512×512 pixel image with
for learning as clean targets, but are over 2000× faster to render 8 spp took 190 ms, and we rendered two images to act
(190 milliseconds vs 7 minutes per frame in this scene). Both as input and target. A single network training iteration
denoisers offer a substantial improvement over the noisy input. with a random 256×256 pixel crop took 11.25 ms and we
performed eight of them per frame. Finally, we denoised
both rendered images, each taking 15 ms, and averaged
prising, since the training dataset contained only a limited the result to produce the final image shown to the user.
number of training pairs (and thus noise realizations) due Rendering, training and inference took 500 ms/frame.
to the cost of generating the clean target images, and we
wanted to test both methods using matching data. That Figure 8 shows that training with clean targets does not
said, given that noisy targets are 2000 times faster to pro- perform appreciably better than noisy targets. As rendering
duce, one could trivially produce a larger quantity of them a single clean image takes approx. 7 minutes in this scene
and still realize vast gains. The finite capture budget study (resp. 190 ms for a noisy target), the quality/time tradeoff
(Section 3.1) supports this hypothesis. clearly favors noisy targets.

Online training Since it can be tedious to collect a suf- 3.4. Magnetic Resonance Imaging (MRI)
ficiently large corpus of Monte Carlo images for training
a generally applicable denoiser, a possibility is to train a Magnetic Resonance Imaging (MRI) produces volumetric
model specific to a single 3D scene, e.g., a game level or a images of biological tissues essentially by sampling the
Noise2Noise: Learning Image Restoration without Clean Data

Fourier transform (the “k-space”) of the signal. Modern

Image
MRI techniques have long relied on compressed sensing
(CS) to cheat the Nyquist-Shannon limit: they undersample
k-space, and perform non-linear reconstruction that removes
aliasing by exploiting the sparsity of the image in a suitable
transform domain (Lustig et al., 2008).

Spectrum
We observe that if we turn the k-space sampling into a ran-
dom process with a known probability density p(k) over the
frequencies k, our main idea applies. In particular, we model
the k-space sampling operation as a Bernoulli process where (a) Input (b) Noisy trg. (c) Clean trg. (d) Reference
each individual frequency has a probability p(k) = e−λ|k| 18.93 dB 29.77 dB 29.81 dB
of being selected for acquisition.4 The frequencies that are
retained are weighted by the inverse of the selection proba- Figure 9. MRI reconstruction example. (a) Input image with only
bility, and non-chosen frequencies are set to zero. Clearly, 10% of spectrum samples retained and scaled by 1/p. (b) Recon-
the expectation of this “Russian roulette” process is the struction by a network trained with noisy target images similar
correct spectrum. The parameter λ controls the overall frac- to the input image. (c) Same as previous, but training done with
tion of k-space retained; in the following experiments, we clean target images similar to the reference image. (d) Original,
uncorrupted image. PSNR values refer to the images shown here,
choose it so that 10% of the samples are retained relative to a
see text for averages over the entire validation set.
full Nyquist-Shannon sampling. The undersampled spectra
are transformed to the primal image domain by the standard
inverse Fourier transform. An example of an undersam- with noisy targets reached an average PSNR of 31.74 dB
pled input/target picture, the corresponding fully sampled on the validation data, and the network trained with clean
reference, and their spectra, are shown in Figure 9(a, d). targets reached 31.77 dB. Here the training with clean tar-
Now we simply set up a regression problem of the form (6) gets is similar to prior art (Wang et al., 2016; Lee et al.,
and train a convolutional neural network using pairs of two 2017). Training took 13 hours on an NVIDIA Tesla P100
independent undersampled images x̂ and ŷ of the same vol- GPU. Figure 9(b, c) shows an example of reconstruction re-
ume. As the spectra of the input and target are correct on ex- sults between convolutional networks trained with noisy and
pectation, and the Fourier transform is linear, we use the L2 clean targets, respectively. In terms of PSNR, our results
loss. Additionally, we improve the result slightly by enforc- quite closely match those reported in recent work.
ing the exact preservation of frequencies that are present in
the input image x̂ by Fourier transforming the result fθ (x̂), 4. Discussion
replacing the frequencies with those from the input, and
transforming back to the primal domain before computing We have shown that simple statistical arguments lead to new
the loss: the final loss reads (F −1 (Rx̂ (F(fθ (x̂)))) − ŷ)2 , capabilities in learned signal recovery using deep neural
where R denotes the replacement of non-zero frequencies networks; it is possible to recover signals under complex
from the input. This process is trained end-to-end. corruptions without observing clean signals, without an
explicit statistical characterization of the noise or other cor-
We perform experiments on 2D slices extracted from the ruption, at performance levels equal or close to using clean
IXI brain scan MRI dataset.5 To simulate spectral sampling, target data. That clean data is not necessary for denoising
we draw random samples from the FFT of the (already re- is not a new observation: indeed, consider, for instance, the
constructed) images in the dataset. Hence, in deviation from classic BM3D algorithm (Dabov et al., 2007) that draws
actual MRI samples, our data is real-valued and has the on self-similar patches within a single noisy image. We
periodicity of the discrete FFT built-in. The training set show that the previously-demonstrated high restoration per-
contained 5000 images in 256×256 resolution from 50 sub- formance of deep neural networks can likewise be achieved
jects, and for validation we chose 1000 random images from entirely without clean data, all based on the same general-
10 different subjects. The baseline PSNR of the sparsely- purpose deep convolutional model. This points the way to
sampled input images was 20.03 dB when reconstructed significant benefits in many applications by removing the
directly using IFFT. The network trained for 300 epochs need for potentially strenuous collection of clean data.
4
Our simplified example deviates from practical MRI in the AmbientGAN (Ashish Bora, 2018) trains generative adver-
sense that we do not sample the spectra along 1D trajectories.
However, we believe that designing pulse sequences that lead to
sarial networks (Goodfellow et al., 2014) using corrupted
similar pseudo-random sampling characteristics is straightforward. observations. In contrast to our approach, AmbientGAN
5
http://brain-development.org/ixi-dataset → T1 images. needs an explicit forward model of the corruption. We find
combining ideas along both paths intriguing.
Noise2Noise: Learning Image Restoration without Clean Data

Acknowledgments Kingma, Diederik P. and Ba, Jimmy. Adam: A method for


stochastic optimization. In ICLR, 2015.
Bill Dally, David Luebke, Aaron Lefohn for discussions and
supporting the research; NVIDIA Research staff for sugges- Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero,
tions and discussion; Runa Lober and Gunter Sprenger for Jose, Aitken, Andrew P., Tejani, Alykhan, Totz, Johannes,
synthetic off-line training data; Jacopo Pantaleoni for the in- Wang, Zehan, and Shi, Wenzhe. Photo-realistic single
teractive renderer used in on-line training; Samuli Vuorinen image super-resolution using a generative adversarial net-
for initial photography test data; Koos Zevenhoven for dis- work. In Proc. CVPR, pp. 105–114, 2017.
cussions on MRI; Peyman Milanfar for helpful comments.
Lee, D., Yoo, J., and Ye, J. C. Deep residual learning for
compressed sensing MRI. In Proc. IEEE 14th Interna-
References tional Symposium on Biomedical Imaging (ISBI 2017),
Ashish Bora, Eric Price, Alexandros G. Dimakis. Ambi- pp. 15–18, 2017.
entGAN: Generative models from lossy measurements. Lustig, Michael, Donoho, David L., Santos, Juan M., and
ICLR, 2018. Pauly, John M. Compressed sensing MRI. In IEEE Signal
Cerdá-Company, Xim, Párraga, C. Alejandro, and Otazu, Processing Magazine, volume 25, pp. 72–82, 2008.
Xavier. Which tone-mapping operator is the best? Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew. Recti-
A comparative study of perceptual quality. CoRR, fier nonlinearities improve neural network acoustic mod-
abs/1601.04450, 2016. els. In Proc. International Conference on Machine Learn-
Chaitanya, Chakravarty R. Alla, Kaplanyan, Anton S., ing (ICML), volume 30, 2013.
Schied, Christoph, Salvi, Marco, Lefohn, Aaron,
Mao, Xiao-Jiao, Shen, Chunhua, and Yang, Yu-Bin. Im-
Nowrouzezahrai, Derek, and Aila, Timo. Interactive
age restoration using convolutional auto-encoders with
reconstruction of Monte Carlo image sequences using a
symmetric skip connections. In Proc. NIPS, 2016.
recurrent denoising autoencoder. ACM Trans. Graph., 36
(4):98:1–98:12, 2017. Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database
of human segmented natural images and its application
Dabov, K., Foi, A., Katkovnik, V., and Egiazarian, K. Image
to evaluating segmentation algorithms and measuring
denoising by sparse 3-D transform-domain collaborative
ecological statistics. In Proc. ICCV, volume 2, pp. 416–
filtering. IEEE Trans. Image Process., 16(8):2080–2095,
423, 2001.
2007.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Mäkitalo, Markku and Foi, Alessandro. Optimal inversion
Bing, Warde-Farley, David, Ozair, Sherjil, Courville, of the Anscombe transformation in low-count Poisson
Aaron, and Bengio, Yoshua. Generative Adversarial Net- image denoising. IEEE Trans. Image Process., 20(1):
works. In NIPS, 2014. 99–109, 2011.

Hasinoff, Sam, Sharlet, Dillon, Geiss, Ryan, Adams, An- Reinhard, Erik, Stark, Michael, Shirley, Peter, and Ferwerda,
drew, Barron, Jonathan T., Kainz, Florian, Chen, Jiawen, James. Photographic tone reproduction for digital images.
and Levoy, Marc. Burst photography for high dynamic ACM Trans. Graph., 21(3):267–276, 2002.
range and low-light imaging on mobile cameras. ACM Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas.
Trans. Graph., 35(6):192:1–192:12, 2016. U-net: Convolutional networks for biomedical image
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, segmentation. MICCAI, 9351:234–241, 2015.
Jian. Delving deep into rectifiers: Surpassing human- Rousselle, Fabrice, Knaus, Claude, and Zwicker, Matthias.
level performance on imagenet classification. CoRR, Adaptive sampling and reconstruction using greedy error
abs/1502.01852, 2015. minimization. ACM Trans. Graph., 30(6):159:1–159:12,
Huber, Peter J. Robust estimation of a location parameter. 2011.
Ann. Math. Statist., 35(1):73–101, 1964.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Iizuka, Satoshi, Simo-Serra, Edgar, and Ishikawa, Hiroshi. Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
Globally and locally consistent image completion. ACM A simple way to prevent neural networks from overfitting.
Trans. Graph., 36(4):107:1–107:14, 2017. Journal of Machine Learning Research, 15:1929–1958,
2014.
Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros,
Alexei A. Image-to-image translation with conditional Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor S.
adversarial networks. In Proc. CVPR 2017, 2017. Deep image prior. CoRR, abs/1711.10925, 2017.
Noise2Noise: Learning Image Restoration without Clean Data

Veach, Eric and Guibas, Leonidas J. Optimally combining


sampling techniques for Monte Carlo rendering. In Proc.
ACM SIGGRAPH 95, pp. 419–428, 1995.
Wang, S., Su, Z., Ying, L., Peng, X., Zhu, S., Liang, F.,
Feng, D., and Liang, D. Accelerating magnetic resonance
imaging via deep learning. In Proc. IEEE 13th Inter-
national Symposium on Biomedical Imaging (ISBI), pp.
514–517, 2016.
Zeyde, R., Elad, M., and Protter, M. On single image scale-
up using sparse-representations. In Proc. Curves and
Surfaces: 7th International Conference, pp. 711–730,
2010.

Zhang, Richard, Isola, Phillip, and Efros, Alexei A. Colorful


image colorization. In Proc. ECCV, pp. 649–666, 2016.
Noise2Noise: Learning Image Restoration without Clean Data

A. Appendix
NAME Nout F UNCTION
A.1. Network architecture INPUT n
Table 2 shows the structure of the U-network (Ronneberger ENC CONV 0 48 Convolution 3 × 3
et al., 2015) used in all of our tests, with the exception ENC CONV 1 48 Convolution 3 × 3
of the first test in Section 3.1 that used the “RED30” net- POOL 1 48 Maxpool 2 × 2
work (Mao et al., 2016). For all basic noise and text removal ENC CONV 2 48 Convolution 3 × 3
experiments with RGB images, the number of input and POOL 2 48 Maxpool 2 × 2
output channels were n = m = 3. For Monte Carlo de- ENC CONV 3 48 Convolution 3 × 3
noising we had n = 9, m = 3, i.e., input contained RGB POOL 3 48 Maxpool 2 × 2
pixel color, RGB albedo, and a 3D normal vector per pixel. ENC CONV 4 48 Convolution 3 × 3
The MRI reconstruction was done with monochrome im- POOL 4 48 Maxpool 2 × 2
ages (n = m = 1). Input images were represented in range ENC CONV 5 48 Convolution 3 × 3
[−0.5, 0.5]. POOL 5 48 Maxpool 2 × 2
ENC CONV 6 48 Convolution 3 × 3
A.2. Training parameters UPSAMPLE 5 48 Upsample 2 × 2
CONCAT 5 96 Concatenate output of POOL 4
The network weights were initialized following He et DEC CONV 5 A 96 Convolution 3 × 3
al. (2015). No batch normalization, dropout or other reg- DEC CONV 5 B 96 Convolution 3 × 3
ularization techniques were used. Training was done us- UPSAMPLE 4 96 Upsample 2 × 2
ing ADAM (Kingma & Ba, 2015) with parameter values CONCAT 4 144 Concatenate output of POOL 3
β1 = 0.9, β2 = 0.99,  = 10−8 . DEC CONV 4 A 96 Convolution 3 × 3
Learning rate was kept at a constant value during training DEC CONV 4 B 96 Convolution 3 × 3
except for a brief rampdown period at where it was smoothly UPSAMPLE 3 96 Upsample 2 × 2
brought to zero. Learning rate of 0.001 was used for all CONCAT 3 144 Concatenate output of POOL 2
experiments except Monte Carlo denoising, where 0.0003 DEC CONV 3 A 96 Convolution 3 × 3
was found to provide better stability. Minibatch size of 4 DEC CONV 3 B 96 Convolution 3 × 3
was used in all experiments. UPSAMPLE 2 96 Upsample 2 × 2
CONCAT 2 144 Concatenate output of POOL 1
DEC CONV 2 A 96 Convolution 3 × 3
A.3. Finite corrupted data in L2 minimization
DEC CONV 2 B 96 Convolution 3 × 3
Let us compute the expected error in L2 norm minimization UPSAMPLE 1 96 Upsample 2 × 2
task when corrupted targets {ŷi }Ni=1 are used in place of CONCAT 1 96+n Concatenate INPUT
the clean targets {yi }N
i=1 , with N a finite number. Let yi DEC CONV 1 A 64 Convolution 3 × 3
be arbitrary random variables, such that E{ŷi } = yi . As DEC CONV 1 B 32 Convolution 3 × 3
usual, the point of least deviation is found at the respec- DEV CONV 1 C m Convolution 3 × 3, linear act.
tive mean. The expected squared difference between these
means across realizations of the noise is then: Table 2. Network architecture used in our experiments. Nout de-
notes the number of output feature maps for each layer. Number
" #2 of network input channels n and output channels m depend on
1 X 1 X the experiment. All convolutions use padding mode “same”, and
Eŷ yi − ŷi
N i N i except for the last layer are followed by leaky ReLU activation
" " # # function (Maas et al., 2013) with α = 0.1. Other layers have linear
1 X
2
X X X
2 activation. Upsampling is nearest-neighbor.
= 2 Eŷ ( yi ) − 2Eŷ ( yi )( ŷi ) + Eŷ ( ŷi )
N i i i i
1 X
= 2 Var( ŷi ) mutually uncorrelated, the last row simplifies to
N i
  " #
1 1 X
1  1 XX Var(yi ) (9)
= Cov(ŷi , ŷj ) N N i
N N i j
(8) In either case, the variance of the estimate is the average
P P (co)variance of the corruptions, divided by the number of
In the intermediate steps, we have used Eŷ ( i ŷi ) = i yi samples N . Therefore, the error approaches zero as the
and basic properties of (co)variance. If the corruptions are number of samples grows. The estimate is unbiased in the
Noise2Noise: Learning Image Restoration without Clean Data

sense that it is correct on expectation, even with a finite


amount of data.
The above derivation assumes scalar target variables. When
ŷi are images, N is to be taken as the total number of scalars
in the images, i.e., #images × #pixels/image × #color chan-
nels.

A.4. Mode seeking and the “L0 ” norm


Interestingly, while the “L0 norm” could intuitively be ex-
pected to converge to an exact mode, i.e. a local maximum
of the probability density function of the data, theoretical
analysis reveals that it recovers a slightly different point.
While an actual mode is a zero-crossing of the derivative of
the PDF, the L0 norm minimization recovers a zero-crossing
of its Hilbert transform instead. We have verified this behav-
ior in a variety of numerical experiments, and, in practice,
we find that the estimate is typically close to the true mode.
This can be explained by the fact that the Hilbert transform
approximates differentiation (with a sign flip): the latter is
a multiplication by iω in the Fourier domain, whereas the
Hilbert transform is a multiplication by −i sgn(ω).
For a continuous data density q(x), the norm minimization
task for Lp amounts to finding a point x∗ that has a min-
imal expected p-norm distance (suitably normalized, and
omitting the pth root) from points y ∼ q(y):
1
x∗ = argmin Ey∼q { |x − y|p }
x p
Z (10)
1
= argmin |x − y|p q(y) dy
x p
Following the typical procedure, the minimizer is found at a
root of the derivative of the expression under argmin:
Z
∂ 1
0= |x − y|p q(y) dy
∂x p
Z (11)
p−1
= sgn(x − y)|x − y| q(y) dy

This equality holds also when we take limp→0 . The usual


results for L2 and L1 norms can readily be derived from
this form. For the L0 case, we take p = 0 and obtain
Z
0 = sgn(x − y)|x − y|−1 q(y) dy
Z (12)
1
= q(y) dy.
x−y
The right hand side is the formula for the Hilbert transform
of q(x), up to a constant multiplier.

You might also like