Noise2Noise: Learning Image Restoration Without Clean Data
Noise2Noise: Learning Image Restoration Without Clean Data
Jaakko Lehtinen1 2 Jacob Munkberg1 Jon Hasselgren1 Samuli Laine1 Tero Karras1 Miika Aittala3 Timo Aila1
known as M-estimators (Huber, 1964). From a statistical principle, corrupt the training targets of a neural network
viewpoint, summary estimation using these common loss with zero-mean noise without changing what the network
functions can be seen as ML estimation by interpreting the learns. Combining this with the corrupted inputs from Equa-
loss function as the negative log likelihood. tion 1, we are left with the empirical risk minimization task
Training neural network regressors is a generalization of X
argmin L (fθ (x̂i ), ŷi ) , (6)
this point estimation procedure. Observe the form of the θ i
typical training task for a set of input-target pairs (xi , yi ),
where the network function fθ (x) is parameterized by θ: where both the inputs and the targets are now drawn from
a corrupted distribution (not necessarily the same), condi-
argmin E(x,y) {L(fθ (x), y)}. (4) tioned on the underlying, unobserved clean target yi such
θ that E{ŷi |x̂i } = yi . Given infinite data, the solution is
Indeed, if we remove the dependency on input data, and the same as that of (1). For finite data, the variance is the
use a trivial fθ that merely outputs a learned scalar, the task average variance of the corruptions in the targets, divided
reduces to (2). Conversely, the full training task decomposes by the number of training samples (see appendix). Inter-
to the same minimization problem at every training sample; estingly, none of the above relies on a likelihood model of
simple manipulations show that (4) is equivalent to the corruption, nor a density model (prior) for the under-
lying clean image manifold. That is, we do not need an
argmin Ex {Ey|x {L(fθ (x), y)}}. (5) explicit p(noisy|clean) or p(clean), as long as we have data
θ
distributed according to them.
The network can, in theory, minimize this loss by solving the In many image restoration tasks, the expectation of the cor-
point estimation problem separately for each input sample. rupted input data is the clean target that we seek to restore.
Hence, the properties of the underlying loss are inherited by Low-light photography is an example: a long, noise-free ex-
neural network training. posure is the average of short, independent, noisy exposures.
The usual process of training regressors by Equation 1 over With this in mind, the above suggests the ability to learn to
a finite number of input-target pairs (xi , yi ) hides a subtle remove photon noise given only pairs of noisy images, with
point: instead of the 1:1 mapping between inputs and tar- no need for potentially expensive or difficult long exposures.
gets (falsely) implied by that process, in reality the mapping Similar observations can be made about other loss functions.
is multiple-valued. For example, in a superresolution task For instance, the L1 loss recovers the median of the targets,
(Ledig et al., 2017) over all natural images, a low-resolution meaning that neural networks can be trained to repair im-
image x can be explained by many different high-resolution ages with significant (up top 50%) outlier content, again
images y, as knowledge about the exact positions and ori- only requiring access to pairs of such corrupted images.
entations of the edges and texture is lost in decimation. In In the next sections, we present a wide variety of examples
other words, p(y|x) is the highly complex distribution of demonstrating that these theoretical capabilities are also
natural images consistent with the low-resolution x. Train- efficiently realizable in practice.
ing a neural network regressor using training pairs of low-
and high-resolution images using the L2 loss, the network
learns to output the average of all plausible explanations 3. Practical Experiments
(e.g., edges shifted by different amounts), which results in We now experimentally study the practical properties of
spatial blurriness for the network’s predictions. A signif- noisy-target training. We start with simple noise distribu-
icant amount of work has been done to combat this well tions (Gaussian, Poisson, Bernoulli) in Sections 3.1 and 3.2,
known tendency, for example by using learned discriminator and continue to the much harder, analytically intractable
functions as losses (Ledig et al., 2017; Isola et al., 2017). Monte Carlo image synthesis noise (Section 3.3). In Sec-
Our observation is that for certain problems this tendency tion 3.4, we show that image reconstruction from sub-
has an unexpected benefit. A trivial, and, at first sight, use- Nyquist spectral samplings in magnetic resonance imaging
less, property of L2 minimization is that on expectation, the (MRI) can be learned from corrupted observations only.
estimate remains unchanged if we replace the targets with
random numbers whose expectations match the targets. This 3.1. Additive Gaussian Noise
is easy to see: Equation (3) holds, no matter what particu-
We will first study the effect of corrupted targets using
lar distribution the ys are drawn from. Consequently, the
synthetic additive Gaussian noise. As the noise has zero
optimal network parameters θ of Equation (5) also remain
mean, we use the L2 loss for training to recover the mean.
unchanged, if input-conditioned target distributions p(y|x)
are replaced with arbitrary distributions that have the same Our baseline is a recent state-of-the-art method ”RED30”
conditional expected values. This implies that we can, in (Mao et al., 2016), a 30-layer hierarchical residual net-
Noise2Noise: Learning Image Restoration without Clean Data
33 32.5
32.5 32.5
32.5
32 32
32.5
31.5
31.5 31.5
31.5
32 31 31
31.5 30.5
30.5 30.5
30.5
30 30
31
29.5
29.5 29.5
29.5
30.5 29 29
0 20 40 60 80 100 120 140 0 50 100 150 200 250 300 350 400 450 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
clean targets noisy targets 2 pix 5 pix 10 pix 20 pix 40 pix Case 1 (trad.) Case 2 Case 3 (N2N)
(a) White Gaussian, σ = 25 (b) Brown Gaussian, σ = 25 (c) Capture budget study (see text)
Figure 1. Denoising performance (dB in KODAK dataset) as a function of training epoch for additive Gaussian noise. (a) For i.i.d. (white)
Gaussian noise, clean and noisy targets lead to very similar convergence speed and eventual quality. (b) For brown Gaussian noise, we
observe that increased inter-pixel noise correlation (wider spatial blur; one graph per bandwidth) slows convergence down, but eventual
performance remains close. (c) Effect of different allocations of a fixed capture budget to noisy vs. clean examples (see text).
(N ) and noise realizations per clean latent (M ) such that (a) Gaussian (σ = 25)
N ∗ M = 2000. In the traditional scenario, we have only
100 training pairs (N = 100, M = 20): a single noisy
realization and the corresponding clean image (= average
BM3D
of 19 noisy images; Figure 1c, Case 1). We first observe
that using the same captured data as 100 ∗ 20 ∗ 19 = 38000
training pairs with corrupted targets — i.e., for each latent,
forming all the 19 ∗ 20 possible noisy/clean pairs — yields
notably better results (several .1s of dB) than the traditional,
fixed noisy+clean pairs, even if we still only have N = 100
latents (Figure 1c, Case 2). Second, we observe that setting (b) Poisson (λ = 30)
N = 1000 and M = 2, i.e., increasing the number of clean
A NSCOMBE
latents but only obtaining two noisy realizations of each
(resulting in 2000 training pairs) yields even better results
(again, by several .1s of dB, Figure 1c, Case 3).
We conclude that for additive Gaussian noise, corrupted (c) Bernoulli (p = 0.5)
targets offer benefits — not just the same performance but
better — over clean targets on two levels: both 1) seeing
p ≈ 0.04 p ≈ 0.42
Figure 3. Removing random text overlays corresponds to seeking the median pixel color, accomplished using the L1 loss. The mean (L2
loss) is not the correct answer: note shift towards mean text color. Only corrupted images shown during training.
p = 0.22 p = 0.81
Figure 4. For random impulse noise, the approx. mode-seeking L0 loss performs better than the mean (L2 ) or median (L1 ) seeking losses.
PSNR delta from clean targets
Training with L2 loss biases the results heavily towards gray, dominated by the long-tail effects (outliers) in the targets,
because the result tends towards a linear combination the and training does not converge. On the other hand, if the
correct answer and and mean of the uniform random corrup- denoiser were to output tonemapped values T (v), the non-
tion. As predicted by theory, the L1 loss gives good results linearity of T would make the expected value of noisy target
as long as fewer than 50% of the pixels are randomized, images E{T (v)} different from the clean training target
but beyond that threshold it quickly starts to bias dark and T (E{v}), leading to incorrect predictions.
bright areas towards gray (Figure 5). L0 , on the other hand,
A metric often used for measuring the quality of HDR im-
shows little bias even with extreme corruptions (e.g. 90%
ages is the relative MSE (Rousselle et al., 2011), where
pixels), because of all the possible pixel values, the correct
the squared difference is divided by the square of approx-
answer (e.g. 10%) is still the most common.
imate luminance of the pixel, i.e., (fθ (x̂) − ŷ)2 /(ŷ + )2 .
However, this metric suffers from the same nonlinearity
3.3. Monte Carlo Rendering problem as comparing of tonemapped outputs. Therefore,
Physically accurate renderings of virtual environments are we propose to use the network output, which tends to-
most often generated through a process known as Monte wards the correct value in the limit, in the denominator:
Carlo path tracing. This amounts to drawing random se- LHDR = (fθ (x̂) − ŷ)2 /(fθ (x̂) + 0.01)2 . It can be shown
quences of scattering events (“light paths”) in the scene that that LHDR converges to the correct expected value as long
connect light sources and virtual sensors, and integrating as we consider the gradient of the denominator to be zero.
the radiance carried by them over all possible paths (Veach Finally, we have observed that it is beneficial to tone map
& Guibas, 1995). The Monte Carlo integrator is constructed the input image T (x̂) instead of using HDR inputs. The
such that the intensity of each pixel is the expectation of network continues to output non-tonemapped (linear-scale)
the random path sampling process, i.e., the sampling noise luminance values, retaining the correctness of the expected
is zero-mean. However, despite decades of research into value. Figure 6 evaluates the different loss functions.
importance sampling techniques, little else can be said about
the distribution. It varies from pixel to pixel, heavily de- Denoising Monte Carlo rendered images We trained a
pends on the scene configuration and rendering parameters, denoiser for Monte Carlo path traced images rendered using
and can be arbitrarily multimodal. Some lighting effects, 64 samples per pixel (spp). Our training set consisted of
such as focused caustics, also result in extremely long-tailed 860 architectural images, and the validation was done using
distributions with rare, bright outliers. 34 images from a different set of scenes. Three versions of
the training images were rendered: two with 64 spp using
All of these effects make the removal of Monte Carlo noise different random seeds (noisy input, noisy target), and one
much more difficult than removing, e.g., Gaussian noise. with 131k spp (clean target). The validation images were
On the other hand, the problem is somewhat alleviated by rendered in both 64 spp (input) and 131k spp (reference)
the possibility of generating auxiliary information that has versions. All images were 960×540 pixels in size, and as
been empirically found to correlate with the clean result mentioned earlier, we also saved the albedo and normal
during data generation. In our experiments, the denoiser buffers for all of the input images. Even with such a small
input consists of not only the per-pixel luminance values, dataset, rendering the 131k spp clean images was a stren-
but also the average albedo (i.e., texture color) and normal uous effort — for example, Figure 7d took 40 minutes to
vector of the surfaces visible at each pixel. render on a high-end graphics server with 8 × NVIDIA
High dynamic range (HDR) Even with adequate sam- Tesla P100 GPUs and a 40-core Intel Xeon CPU.
pling, the floating-point pixel luminances may differ from The average PSNR of the 64 spp validation inputs with re-
each other by several orders of magnitude. In order to con- spect to the corresponding reference images was 22.31 dB
struct an image suitable for the generally 8-bit display de- (see Figure 7a for an example). The network trained for
vices, this high dynamic range needs to be compressed to a 2000 epochs using clean target images reached an average
fixed range using a tone mapping operator (Cerdá-Company PSNR of 31.83 dB on the validation set, whereas the simi-
et al., 2016). We use a variant of Reinhard’s global op- larly trained network using noisy target images gave 0.5 dB
erator (Reinhard et al., 2002): T (v) = (v/(1 + v))1/2.2 , less. Examples are shown in Figure 7b,c – the training took
where v is a scalar luminance value, possibly pre-scaled 12 hours with a single NVIDIA Tesla P100 GPU.
with an image-wide exposure constant. This operator maps
any v ≥ 0 into range 0 ≤ T (v) < 1. At 4000 epochs, the noisy targets matched 31.83 dB, i.e.,
noisy targets took approximately twice as long to converge.
The combination of virtually unbounded range of lumi- However, the gap between the two methods had not nar-
nances and the nonlinearity of operator T poses a problem. rowed appreciably, leading us to believe that some quality
If we attempt to train a denoiser that outputs luminance difference will remain even in the limit. This is not sur-
values v, a standard MSE loss L2 = (fθ (x̂) − ŷ)2 will be
Noise2Noise: Learning Image Restoration without Clean Data
Input, 8 spp L2 with x̂, ŷ L2 with T (x̂), ŷ L2 with T (x̂), T (ŷ) LHDR with x̂, ŷ LHDR with T (x̂), ŷ Reference, 32k spp
11.32 dB 25.46 dB 25.39 dB 15.50 dB 29.05 dB 30.09 dB PSNR
Figure 6. Comparison of various loss functions for training a Monte Carlo denoiser with noisy target images rendered at 8 samples per
pixel (spp). In this high-dynamic range setting, our custom relative loss LHDR is clearly superior to L2 . Applying a non-linear tone map to
the inputs is beneficial, while applying it to the target images skews the distribution of noise and leads to wrong, visibly too dark results.
(a) Input (64 spp), 23.93 dB (b) Noisy targets, 32.42 dB (c) Clean targets, 32.95 dB (d) Reference (131k spp)
Figure 7. Denoising a Monte Carlo rendered image. (a) Image rendered with 64 samples per pixel. (b) Denoised 64 spp input, trained
using 64 spp targets. (c) Same as previous, but trained on clean targets. (d) Reference image rendered with 131 072 samples per pixel.
PSNR values refer to the images shown here, see text for averages over the entire validation set.
PSNR
40
movie shot (Chaitanya et al., 2017). In this context, it can
30
even be desirable to train on-the-fly while walking through
20 the scene. In order to maintain interactive frame rates, we
10 can afford only few samples per pixel, and thus both input
0 and target images will be inherently noisy.
0 100 200 300 400 500 600 700 800 900 1000
Noisy targets Clean targets Input Figure 8 shows the convergence plots for an experiment
where we trained a denoiser from scratch for the duration
Figure 8. Online training PSNR during a 1000-frame flythrough of 1000 frames in a scene flythrough. On an NVIDIA Titan
of the scene in Figure 6. Noisy target images are almost as good V GPU, path tracing a single 512×512 pixel image with
for learning as clean targets, but are over 2000× faster to render 8 spp took 190 ms, and we rendered two images to act
(190 milliseconds vs 7 minutes per frame in this scene). Both as input and target. A single network training iteration
denoisers offer a substantial improvement over the noisy input. with a random 256×256 pixel crop took 11.25 ms and we
performed eight of them per frame. Finally, we denoised
both rendered images, each taking 15 ms, and averaged
prising, since the training dataset contained only a limited the result to produce the final image shown to the user.
number of training pairs (and thus noise realizations) due Rendering, training and inference took 500 ms/frame.
to the cost of generating the clean target images, and we
wanted to test both methods using matching data. That Figure 8 shows that training with clean targets does not
said, given that noisy targets are 2000 times faster to pro- perform appreciably better than noisy targets. As rendering
duce, one could trivially produce a larger quantity of them a single clean image takes approx. 7 minutes in this scene
and still realize vast gains. The finite capture budget study (resp. 190 ms for a noisy target), the quality/time tradeoff
(Section 3.1) supports this hypothesis. clearly favors noisy targets.
Online training Since it can be tedious to collect a suf- 3.4. Magnetic Resonance Imaging (MRI)
ficiently large corpus of Monte Carlo images for training
a generally applicable denoiser, a possibility is to train a Magnetic Resonance Imaging (MRI) produces volumetric
model specific to a single 3D scene, e.g., a game level or a images of biological tissues essentially by sampling the
Noise2Noise: Learning Image Restoration without Clean Data
Image
MRI techniques have long relied on compressed sensing
(CS) to cheat the Nyquist-Shannon limit: they undersample
k-space, and perform non-linear reconstruction that removes
aliasing by exploiting the sparsity of the image in a suitable
transform domain (Lustig et al., 2008).
Spectrum
We observe that if we turn the k-space sampling into a ran-
dom process with a known probability density p(k) over the
frequencies k, our main idea applies. In particular, we model
the k-space sampling operation as a Bernoulli process where (a) Input (b) Noisy trg. (c) Clean trg. (d) Reference
each individual frequency has a probability p(k) = e−λ|k| 18.93 dB 29.77 dB 29.81 dB
of being selected for acquisition.4 The frequencies that are
retained are weighted by the inverse of the selection proba- Figure 9. MRI reconstruction example. (a) Input image with only
bility, and non-chosen frequencies are set to zero. Clearly, 10% of spectrum samples retained and scaled by 1/p. (b) Recon-
the expectation of this “Russian roulette” process is the struction by a network trained with noisy target images similar
correct spectrum. The parameter λ controls the overall frac- to the input image. (c) Same as previous, but training done with
tion of k-space retained; in the following experiments, we clean target images similar to the reference image. (d) Original,
uncorrupted image. PSNR values refer to the images shown here,
choose it so that 10% of the samples are retained relative to a
see text for averages over the entire validation set.
full Nyquist-Shannon sampling. The undersampled spectra
are transformed to the primal image domain by the standard
inverse Fourier transform. An example of an undersam- with noisy targets reached an average PSNR of 31.74 dB
pled input/target picture, the corresponding fully sampled on the validation data, and the network trained with clean
reference, and their spectra, are shown in Figure 9(a, d). targets reached 31.77 dB. Here the training with clean tar-
Now we simply set up a regression problem of the form (6) gets is similar to prior art (Wang et al., 2016; Lee et al.,
and train a convolutional neural network using pairs of two 2017). Training took 13 hours on an NVIDIA Tesla P100
independent undersampled images x̂ and ŷ of the same vol- GPU. Figure 9(b, c) shows an example of reconstruction re-
ume. As the spectra of the input and target are correct on ex- sults between convolutional networks trained with noisy and
pectation, and the Fourier transform is linear, we use the L2 clean targets, respectively. In terms of PSNR, our results
loss. Additionally, we improve the result slightly by enforc- quite closely match those reported in recent work.
ing the exact preservation of frequencies that are present in
the input image x̂ by Fourier transforming the result fθ (x̂), 4. Discussion
replacing the frequencies with those from the input, and
transforming back to the primal domain before computing We have shown that simple statistical arguments lead to new
the loss: the final loss reads (F −1 (Rx̂ (F(fθ (x̂)))) − ŷ)2 , capabilities in learned signal recovery using deep neural
where R denotes the replacement of non-zero frequencies networks; it is possible to recover signals under complex
from the input. This process is trained end-to-end. corruptions without observing clean signals, without an
explicit statistical characterization of the noise or other cor-
We perform experiments on 2D slices extracted from the ruption, at performance levels equal or close to using clean
IXI brain scan MRI dataset.5 To simulate spectral sampling, target data. That clean data is not necessary for denoising
we draw random samples from the FFT of the (already re- is not a new observation: indeed, consider, for instance, the
constructed) images in the dataset. Hence, in deviation from classic BM3D algorithm (Dabov et al., 2007) that draws
actual MRI samples, our data is real-valued and has the on self-similar patches within a single noisy image. We
periodicity of the discrete FFT built-in. The training set show that the previously-demonstrated high restoration per-
contained 5000 images in 256×256 resolution from 50 sub- formance of deep neural networks can likewise be achieved
jects, and for validation we chose 1000 random images from entirely without clean data, all based on the same general-
10 different subjects. The baseline PSNR of the sparsely- purpose deep convolutional model. This points the way to
sampled input images was 20.03 dB when reconstructed significant benefits in many applications by removing the
directly using IFFT. The network trained for 300 epochs need for potentially strenuous collection of clean data.
4
Our simplified example deviates from practical MRI in the AmbientGAN (Ashish Bora, 2018) trains generative adver-
sense that we do not sample the spectra along 1D trajectories.
However, we believe that designing pulse sequences that lead to
sarial networks (Goodfellow et al., 2014) using corrupted
similar pseudo-random sampling characteristics is straightforward. observations. In contrast to our approach, AmbientGAN
5
http://brain-development.org/ixi-dataset → T1 images. needs an explicit forward model of the corruption. We find
combining ideas along both paths intriguing.
Noise2Noise: Learning Image Restoration without Clean Data
Hasinoff, Sam, Sharlet, Dillon, Geiss, Ryan, Adams, An- Reinhard, Erik, Stark, Michael, Shirley, Peter, and Ferwerda,
drew, Barron, Jonathan T., Kainz, Florian, Chen, Jiawen, James. Photographic tone reproduction for digital images.
and Levoy, Marc. Burst photography for high dynamic ACM Trans. Graph., 21(3):267–276, 2002.
range and low-light imaging on mobile cameras. ACM Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas.
Trans. Graph., 35(6):192:1–192:12, 2016. U-net: Convolutional networks for biomedical image
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, segmentation. MICCAI, 9351:234–241, 2015.
Jian. Delving deep into rectifiers: Surpassing human- Rousselle, Fabrice, Knaus, Claude, and Zwicker, Matthias.
level performance on imagenet classification. CoRR, Adaptive sampling and reconstruction using greedy error
abs/1502.01852, 2015. minimization. ACM Trans. Graph., 30(6):159:1–159:12,
Huber, Peter J. Robust estimation of a location parameter. 2011.
Ann. Math. Statist., 35(1):73–101, 1964.
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,
Iizuka, Satoshi, Simo-Serra, Edgar, and Ishikawa, Hiroshi. Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:
Globally and locally consistent image completion. ACM A simple way to prevent neural networks from overfitting.
Trans. Graph., 36(4):107:1–107:14, 2017. Journal of Machine Learning Research, 15:1929–1958,
2014.
Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros,
Alexei A. Image-to-image translation with conditional Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor S.
adversarial networks. In Proc. CVPR 2017, 2017. Deep image prior. CoRR, abs/1711.10925, 2017.
Noise2Noise: Learning Image Restoration without Clean Data
A. Appendix
NAME Nout F UNCTION
A.1. Network architecture INPUT n
Table 2 shows the structure of the U-network (Ronneberger ENC CONV 0 48 Convolution 3 × 3
et al., 2015) used in all of our tests, with the exception ENC CONV 1 48 Convolution 3 × 3
of the first test in Section 3.1 that used the “RED30” net- POOL 1 48 Maxpool 2 × 2
work (Mao et al., 2016). For all basic noise and text removal ENC CONV 2 48 Convolution 3 × 3
experiments with RGB images, the number of input and POOL 2 48 Maxpool 2 × 2
output channels were n = m = 3. For Monte Carlo de- ENC CONV 3 48 Convolution 3 × 3
noising we had n = 9, m = 3, i.e., input contained RGB POOL 3 48 Maxpool 2 × 2
pixel color, RGB albedo, and a 3D normal vector per pixel. ENC CONV 4 48 Convolution 3 × 3
The MRI reconstruction was done with monochrome im- POOL 4 48 Maxpool 2 × 2
ages (n = m = 1). Input images were represented in range ENC CONV 5 48 Convolution 3 × 3
[−0.5, 0.5]. POOL 5 48 Maxpool 2 × 2
ENC CONV 6 48 Convolution 3 × 3
A.2. Training parameters UPSAMPLE 5 48 Upsample 2 × 2
CONCAT 5 96 Concatenate output of POOL 4
The network weights were initialized following He et DEC CONV 5 A 96 Convolution 3 × 3
al. (2015). No batch normalization, dropout or other reg- DEC CONV 5 B 96 Convolution 3 × 3
ularization techniques were used. Training was done us- UPSAMPLE 4 96 Upsample 2 × 2
ing ADAM (Kingma & Ba, 2015) with parameter values CONCAT 4 144 Concatenate output of POOL 3
β1 = 0.9, β2 = 0.99, = 10−8 . DEC CONV 4 A 96 Convolution 3 × 3
Learning rate was kept at a constant value during training DEC CONV 4 B 96 Convolution 3 × 3
except for a brief rampdown period at where it was smoothly UPSAMPLE 3 96 Upsample 2 × 2
brought to zero. Learning rate of 0.001 was used for all CONCAT 3 144 Concatenate output of POOL 2
experiments except Monte Carlo denoising, where 0.0003 DEC CONV 3 A 96 Convolution 3 × 3
was found to provide better stability. Minibatch size of 4 DEC CONV 3 B 96 Convolution 3 × 3
was used in all experiments. UPSAMPLE 2 96 Upsample 2 × 2
CONCAT 2 144 Concatenate output of POOL 1
DEC CONV 2 A 96 Convolution 3 × 3
A.3. Finite corrupted data in L2 minimization
DEC CONV 2 B 96 Convolution 3 × 3
Let us compute the expected error in L2 norm minimization UPSAMPLE 1 96 Upsample 2 × 2
task when corrupted targets {ŷi }Ni=1 are used in place of CONCAT 1 96+n Concatenate INPUT
the clean targets {yi }N
i=1 , with N a finite number. Let yi DEC CONV 1 A 64 Convolution 3 × 3
be arbitrary random variables, such that E{ŷi } = yi . As DEC CONV 1 B 32 Convolution 3 × 3
usual, the point of least deviation is found at the respec- DEV CONV 1 C m Convolution 3 × 3, linear act.
tive mean. The expected squared difference between these
means across realizations of the noise is then: Table 2. Network architecture used in our experiments. Nout de-
notes the number of output feature maps for each layer. Number
" #2 of network input channels n and output channels m depend on
1 X 1 X the experiment. All convolutions use padding mode “same”, and
Eŷ yi − ŷi
N i N i except for the last layer are followed by leaky ReLU activation
" " # # function (Maas et al., 2013) with α = 0.1. Other layers have linear
1 X
2
X X X
2 activation. Upsampling is nearest-neighbor.
= 2 Eŷ ( yi ) − 2Eŷ ( yi )( ŷi ) + Eŷ ( ŷi )
N i i i i
1 X
= 2 Var( ŷi ) mutually uncorrelated, the last row simplifies to
N i
" #
1 1 X
1 1 XX Var(yi ) (9)
= Cov(ŷi , ŷj ) N N i
N N i j
(8) In either case, the variance of the estimate is the average
P P (co)variance of the corruptions, divided by the number of
In the intermediate steps, we have used Eŷ ( i ŷi ) = i yi samples N . Therefore, the error approaches zero as the
and basic properties of (co)variance. If the corruptions are number of samples grows. The estimate is unbiased in the
Noise2Noise: Learning Image Restoration without Clean Data