0% found this document useful (0 votes)

33 views14 pages

A Connection Between Score Matching and Denoising Autoencoders

This document discusses the connection between score matching and denoising autoencoders. It shows that training a denoising autoencoder is equivalent to matching the score of a specific energy-based model to that of a nonparametric density estimator of the data, known as score matching. This provides insights such as defining a probabilistic model for denoising autoencoders and justifying their use of tied weights between the encoder and decoder.

Uploaded by

alekthiery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views14 pages

A Connection Between Score Matching and Denoising Autoencoders

Uploaded by

alekthiery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

NOTE Communicated by Aapo Hyvärinen

A Connection Between Score Matching and

Denoising Autoencoders

Pascal Vincent
[email protected]
Département d’Informatique, Université de Montréal,
Montréal (QC) H3C 3J7, Canada

Denoising autoencoders have been previously shown to be competitive

alternatives to restricted Boltzmann machines for unsupervised pretrain-
ing of each layer of a deep architecture. We show that a simple denois-
ing autoencoder training criterion is equivalent to matching the score
(with respect to the data) of a specific energy-based model to that of a
nonparametric Parzen density estimator of the data. This yields several
useful insights. It defines a proper probabilistic model for the denoising
autoencoder technique, which makes it in principle possible to sam-
ple from them or rank examples by their energy. It suggests a different
way to apply score matching that is related to learning to denoise and
does not require computing second derivatives. It justifies the use of tied
weights between the encoder and decoder and suggests ways to extend
the success of denoising autoencoders to a larger family of energy-based
models.

1 Introduction

This note uncovers an unsuspected link between the score matching tech-
nique (Hyvärinen, 2005, 2008) for learning the parameters of unnormalized
density models over continuous-valued data, and the training of denois-
ing autoencoders (Vincent, Larochelle, Bengio, & Manzagol, 2008; Vincent,
Larochelle, Lajoie, Bengio, & Manzagol, 2010).
Score matching (SM) is an alternative to the maximum likelihood princi-
ple suitable for unnormalized probability density models whose partition
function is intractable. Its relationship to maximum likelihood has been
investigated by Lyu (2010), who formally relates the Fisher divergence
that yields score matching and the Kullback-Leibler divergence that yields
maximum likelihood. Interestingly, his formal analysis indicates that SM
searches for parameters that are more robust to small-noise perturbations
of the training data. Score matching has also been recast as a special case
under the more general frameworks of generalized score matching (Lyu,

Neural Computation 23, 1661–1674 (2011)

C 2011 Massachusetts Institute of Technology
1662 P. Vincent

2010; Marlin, Swersky, Chen, & de Freitas, 2010) and minimum probability
flow (Sohl-Dickstein et al., 2009), allowing generalizations of SM to discrete
distributions (Hyvärinen, 2007b; Lyu, 2010; Sohl-Dickstein et al., 2009). The
minimum probability flow paradigm is particularly interesting as it unifies
several recent alternative parameter estimation methods, for both continu-
ous and discrete data, under a single unified view.1 Recently, Kingma and
LeCun (2010) investigated a regularized form of SM that adds a specific reg-
ularization term to the original SM objective. Its relationship to this work is
discussed in detail in section 5.
Denoising autoencoders (DAE) were proposed by Vincent et al. (2008)
as a simple and competitive alternative to the contrastive-divergence-
trained restricted Boltzmann Machines (RBM) used by Hinton, Osin-
dero, and Teh (2006) for pretraining deep networks (Erhan et al., 2010;
Vincent et al., 2010). Previous studies have already pointed out connec-
tions between SM and contrastive divergence (Hyvärinen, 2007a; Sohl-
Dickstein et al., 2009), have connected SM to optimal denoising for
gaussian noise with infinitesimal variance (Hyvärinen, 2008), and have
shown that training gaussian binary RBM with SM is equivalent to
training a regular (nondenoising) autoencoder with an additional reg-
ularization term (Swersky, 2010). This note, however, is the first to
recast the training of a DAE as a form of regularized SM. This con-
nection yields insights relevant to both research directions and suggests
a novel parameter estimation technique that has its roots in both DAE
and SM.
We begin with a brief presentation of the DAE architecture for
continuous-valued inputs in section 2 and the SM technique in section 3.
This allows us to introduce our formalism and precise terminology. In sec-
tion 4, we connect the denoising autoencoder objective to SM. We conclude
by a discussion on how our findings advance our understanding of both
approaches.

1.1 Notation. We are interested in techniques that learn the parameters

θ of a model by minimizing some objective function J (θ ). For uniformity of
notation, all distributions will be represented by their probability density
functions (pdf) on Rd . The pdf for discrete distributions will be expressed
with Dirac-deltas δ.

1 Specifically SM (Hyvärinen, 2005), minimum velocity learning (Movellan, 2008), and

certain forms of contrastive divergence (Hinton, 2002; Welling & Hinton, 2002) are all
recast as minimizing the Kullback-Leibler divergence between the data distribution and
the distribution obtained after running, for infinitesimal time, a dynamic that would
transform it into the model distribution (Sohl-Dickstein et al., 2009).
A Connection Between Score Matching and Denoising Autoencoders 1663

q (x) Unknown true pdf. x ∈ Rd

Dn = {x(1) , . . . , x(n) } Training set: i.i.d. sample from q
n
q 0 (x) = n1 i=1 δ(x − x(i) ) Empirical pdf associated with Dn
q σ (x̃|x) = (2π )1d/2 σ d e − 2σ 2 x̃−x
1 2
Smoothing kernel or noise model:
isotropic gaussian of variance σ 2
q σ (x̃, x) = q σ (x̃|x)q 0 (x) Joint pdf

q σ (x̃) = n1 nt=1 q σ (x̃|x(t) ) Parzen density estimate based on Dn
obtainable by marginalizing
q σ (x̃, x)
p(x; θ ) Density model with parameters θ
J1 J2 Means J 1 (θ ) and J 2 (θ ) are equivalent
optimization objectives2
Eq (x) [g(x)] = x q (x)g(x)dx Expectation with respect to
distribution q
u, v =
√ i ui vi Dot product between two vectors
u = u, u Euclidean norm of vector u
softplus(x) = log(1 + e x ) Will be applied elementwise to
vectors
sigmoid(x) = 1
1+e −x
= softplus (x) Will be applied elementwise to
vectors
I Identity matrix
WT Transpose of matrix W
Wi Vector for ith row of W
W·, j Vector for jth column of W

2 Denoising Autoencoders

Denoising autoencoders (DAEs) are a simple modification of classical au-

toencoder neural networks that are trained not to reconstruct their input
but rather to denoise an artificially corrupted version of their input (Vin-
cent et al., 2008, 2010). Whereas an overcomplete regular autoencoder can
easily learn a useless identity mapping, a DAE must extract more useful
features in order to solve the much harder denoising problem. DAEs have
proven to be an empirically successful alternative to restricted Boltzmann
machines (RBM) for pretraining deep networks (Vincent et al., 2008, 2010;
Erhan et al., 2010). Denoising autoencoders have also been used in different
contexts in the earlier works of LeCun (1987); Gallinari, LeCun, Thiria, and
Fogelman-Soulie (1987); and Seung (1998).

2 Equivalence will be asserted when J = α J + β with α > 0, β ∈ R. Indeed, a

2 1
gradient-based optimization algorithm, when starting from some initial θ value, should
land in the exact same minimum whether optimizing J 1 or J 2 (this may, however, require
learning rate adjustment to compensate for scaling factor α).
1664 P. Vincent

In this study, we consider the denoising version of a simple classical

autoencoder that uses a single sigmoidal hidden layer. Since data points
originate from a continuous real valued distribution, it is natural to use a
linear decoder with a squared reconstruction loss.3 We will be using tied
weights whereby encoder and decoder share the same linear transformation
parameters. The considered corruption is additive isotropic gaussian noise.
A detailed description of the architecture follows:
r A training input x ∈ Dn is first corrupted by additive gaussian noise
of covariance σ 2 I, yielding corrupted input x̃ = x + , ∼ N (0, σ 2 I).
This corresponds to conditional density q σ (x̃|x) = (2π )1d/2 σ d e − 2σ 2 x̃−x .
1 2

r The corrupted version x̃ is encoded into a hidden representation

h ∈ Rdh through an affine mapping followed by a nonlinearity:
h = encode(x̃) = sigmoid(Wx̃ + b), where x̃ ∈ Rd , h ∈ (0, 1)dh , W is a
dh × d matrix and b ∈ Rdh .
r The hidden representation h is decoded into reconstruction xr through
affine mapping: xr = decode(h) = WT h + c, where c ∈ Rd .
r The parameters θ = {W, b, c} are optimized so that the expected
squared reconstruction error xr − x2 is minimized, that is, the ob-
jective function being minimized by such a DAE is
J DAEσ (θ ) = Eqσ (x̃,x) [decode(encode(x̃)) − x2 ]
= Eqσ (x̃,x) [WT sigmoid(Wx̃ + b) + c − x2 ]. (2.1)

3 Score Matching

3.1 Explicit Score Matching. Score matching was introduced by

Hyvärinen (2005) as a technique to learn the parameters θ of probabil-
ity density models p(x; θ ) with intractable partition function Z(θ ), where p
can be written as

1
p(x; θ ) = exp(−E(x; θ )).
Z(θ )

E is called the energy function. Following Hyvärinen (2005), we will

call score the gradient of the log density with respect to the data vector:
∂ log p(x;θ )
ψ(x; θ ) = ∂x
. Beware that this use differs slightly from traditional
statistics terminology where score usually refers to the derivative of the log
likelihood with respect to parameters, whereas here we are talking about a
score with respect to the data. The core principle of SM (Hyvärinen, 2005) is
∂ log p(x;θ )
to learn θ so that ψ(x; θ ) = ∂x
best matches the corresponding score

3 As opposed to a linear+sigmoid decoder with a Bernoulli cross-entropy loss, which

would be the preferred choice for binary input.

A Connection Between Score Matching and Denoising Autoencoders 1665

∂ log q (x)
of the true distribution: ∂ x . The corresponding objective function to be
minimized is the expected squared error between these two vectors:
2
1
∂ log q (x)
.
J E SMq (θ ) = Eq (x) ψ(x; θ ) −
2 ∂x

We refer to this formulation as explicit score matching (ESM).

Note that the score ψ(x; θ ) does not depend on troublesome Z(θ ). But
∂ log q (x)
since q is unknown, we do not have explicit regression targets ∂x
.
Hyvärinen (2005) mentions in passing that nonparametric methods might
be used to estimate those, and we shall later pay closer attention to this
possibility.

3.2 Implicit Score Matching. Hyvärinen (2005) instead proceeds by

proving the following remarkable property:
2
1ψ(x; θ ) − ∂ log q (x)

Eq (x)
2 ∂x

J E SMq (θ )

1 d
∂ψi (x; θ )
= Eq (x) ψ(x; θ )2 + + C1 , (3.1)
2 ∂xi
i=1

J I SMq (θ )

∂ log p(x;θ )
where ψi (x; θ ) = ψ(x; θ )i = ∂xi
, and C1 is a constant that does not
depend on θ . This yields an implicit SM objective J I SMq that no longer
requires having an explicit score target for q but is nevertheless equivalent
to J E SMq . Hyvärinen (2005) formally shows that provided q (x) and ψ(x; θ )
satisfy some weak regularity conditions,4 we have
J E SMq J I SMq . (3.2)

3.3 Finite Sample Version of Implicit Score Matching. Since we only

have samples Dn from q , Hyvärinen proposes to optimize the finite sample
version of J I SMq which, following our notation, we shall write as J I SMq0 :

1 2 d
∂ψ (x; θ )
J I SMq0 (θ ) = Eq0 (x) ψ(x; θ ) + i
2 ∂xi
i=1

1 1 2
n d
∂ψi (x(t) ; θ )
= ψ(x ; θ ) +
(t)
. (3.3)
n 2 ∂xi
t=1 i=1

∂ log q (x) 2
4 Namely, q (x) and ψ(x; θ ) are differentiable, Eq (x) [ ∂x ] is finite, and for any θ,
Eq (x) [ψ(x; θ )2 ] is finite and limx→∞ q (x)ψ(x; θ ) = 0.
1666 P. Vincent

J I SMq0 is asymptotically equivalent to J I SMq when n → ∞ and hence

asymptotically equivalent to objective J E SMq . This can be summarized as

J E SMq J I SMq lim J I SMq0 . (3.4)

n→∞

What happens in the transition from J I SMq to finite-sample version J I SMq0

is, however, not entirely clear. Concerns regarding the stability of the re-
sulting criterion were raised by Kingma and LeCun (2010), who instead
propose optimizing a regularized version of J I SMq0 ,

d 2
1 ∂ψi (x(t) ; θ )
n
J I SMreg (θ ) = J I SMq0 (θ ) + λ , (3.5)
n ∂xi
t=1 i=1

where the strength of the additional regularization term is controlled by

hyperparameter λ. The relationship between this criterion and the criteria
we propose below are further discussed in section 5.

4 Linking Score Matching to the Denoising

Autoencoder Objective

4.1 Matching the Score of a Nonparametric Estimator. The possibil-

ity of matching the score ψ(x; θ ) with an explicit target score for q ob-
tained through nonparametric estimation was mentioned but not pursued
in Hyvärinen (2005). We now examine this possibility more closely. Explic-
itly matching ψ(x; θ ) with the score of Parzen windows density estimator
q σ (x̃) yields the following objective:
2
1
∂ log q σ (x̃)
.
J E SMqσ (θ ) = Eqσ (x̃) ψ(x̃; θ ) − (4.1)
2 ∂ x̃

For σ > 0, q σ is differentiable, decreases to 0 at infinity, and

∂ log q σ (x̃) 2
Eq σ (x̃) [
∂ x̃
] is finite. All regularity conditions are satisfied, so the
same equivalence with ISM as in equation 3.2 holds:

J E SMqσ J I SMqσ . (4.2)

Note that this equivalence breaks in the limit σ → 0 because q σ no longer

satisfies these regularity conditions and J E SMq σ can no longer be computed
(whereas J I SMqσ remains well behaved).

4.2 Denoising Score Matching. Let us now consider a slightly different

objective, which is inspired by both the SM principle and the DAE approach
of using pairs of clean and corrupted examples (x, x̃). For joint density
A Connection Between Score Matching and Denoising Autoencoders 1667

q σ (x̃, x) = q σ (x̃|x)q 0 (x), we define the following denoising score matching

(DSM) objective:
2
1ψ(x̃; θ ) − ∂ log q σ (x̃|x)
.
J DSMqσ (θ ) = Eqσ (x,x̃) (4.3)
2 ∂ x̃

The underlying intuition is that following the gradient ψ of the log

density at some corrupted point, x̃ should ideally move us toward the clean
sample x. Note that with the considered gaussian kernel, we have

∂ log q σ (x̃|x) 1
= 2 (x − x̃). (4.4)
∂ x̃ σ

Direction σ12 (x − x̃) clearly corresponds to moving from x̃ back toward clean
sample x, and we want ψ to match that as best it can.
This objective, inspired by denoising autoencoders, is equivalent to ex-
plicit SM. Formally,

J E SMqσ J DSMqσ . (4.5)

The proof is in the appendix and does not depend on the particular form
of q σ (x̃|x) as long as log q σ (x̃|x) is differentiable with respect to x̃.

4.3 An Energy Function That Yields the Denoising Autoencoder

Objective. Let us now choose for model p the form

1
p(x; θ ) = exp(−E(x; θ )),
Z(θ )
h
c, x − 12 x2 + dj=1 softplus W j , x + b j
E(x; W, b, c) = − . (4.6)

σ2
θ

We then have

∂ log p(x; θ )
ψi (x; θ ) =
∂xi
∂E
=−
∂xi
⎛ ⎞
1 ⎝
dh
∂ Wj, x + bj
= 2 ci − xi +
softplus W j , x + b j ⎠
σ ∂xi
j=1
1668 P. Vincent

⎛ ⎞
1 ⎝
dh

= 2 ci − xi + sigmoid W j , x + b j W ji ⎠
σ
j=1

1
= ci − xi + W·i , sigmoid(Wx + b) ,
σ2

which we can write as the single equation

1 T
ψ(x; θ ) = W sigmoid(Wx + b) + c − x . (4.7)
σ 2

Substituting equations 4.4 and 4.7 in the expression for J DSMqσ ,

equation 4.3, we get, for this choice of Parzen kernel and density model,

2
1 ∂ log q σ (x̃|x)

J DSMqσ (θ ) = Eqσ (x,x̃) ψ(x̃; θ ) −
2 ∂ x̃

1 1
= Eqσ (x,x̃) (WT sigmoid(Wx̃ + b) + c − x̃)
2σ2
2
1
− 2 (x − x̃)
σ
1 2
= 4
Eq σ (x,x̃) WT sigmoid(Wx̃ + b) + c − x
2σ
1
= 4 J DAEσ (θ ).
2σ

We have thus shown that

J DSMqσ J DAEσ . (4.8)

5 Discussion

Putting together equations 4.2, 4.5, and 4.8, we can write, for σ > 0,

J I SMqσ J E SMqσ J DSMqσ J DAEσ . (5.1)

In summary, training the denoising autoencoder defined in section 2 is

equivalent to performing SM (explicit or implicit) with the energy function
of equation 4.6 on Parzen density estimate q σ . Such training would typically
use stochastic gradient descent, whereby samples from q σ are obtained by
A Connection Between Score Matching and Denoising Autoencoders 1669

corrupting samples from Dn . And it may be carried out with any of these
four optimization objective formulations.5
We introduced the kernel-smoothed empirical distribution q σ to show
a connection between SM and a simple DAE. Interestingly, the regular-
ized SM criterion J I SMreg (see equation 3.5) that Kingma and LeCun (2010)
recently introduced with the very different motivation of curing possible
instabilities, was derived by approximating what amounts to J I SMqσ . 6 From
this perspective, our four q σ -based criteria in equation 5.1, including the
DAE, may be seen as alternative approximation-free forms of regularized
score matching. A key difference is that, as is done with DAE training, we
would optimize stochastic versions of these approximation-free regularized
criteria by corrupting training examples (i.e., sampling form q σ ), whereas
Kingma and LeCun (2010) optimize an approximation of J I SMqσ , centered on
the training samples only (i.e., sampling from q 0 ). Also, whereas J I SMreg , like
the other ISM criteria, requires computing second derivatives, the stochas-
tic version of our novel J DSMqσ criterion does not, and thus appears much
simpler to implement.
Note that the energy function in equation 4.6 is particular in that its
scaling, which we may call its temperature, is chosen to match the corrupt-
ing noise level σ 2 . This is required only to establish the last equivalence
with the specific DAE we considered. But regarding the generic objectives
J I SMqσ J E SMqσ J DSMqσ , their σ may in principle be chosen regardless
of the form or temperature of whatever energy function is to be learned.
Interestingly, the energy function in equation 4.6, which we designed to
yield the equivalence with our denoising autoencoder objective, happens
to be very similar to the free energy of a restricted Boltzmann machine with
binary hidden units and gaussian visible units (Welling, Rosen-Zvi, & Hin-
ton, 2005; Bengio, Lamblin, Popovici, & Larochelle, 2007; Swersky, 2010).
The major difference is that this latter free energy does not have a global
temperature scaling of the whole expression.7 We designed equation 4.6 to
exactly yield the denoising version of the classic autoencoder described in
section 2. But with tied weights, it may be preferable to allow an extra pos-
itive scaling parameter α for the reconstruction, so that there at least exists
an equivalent reparameterization of the model for scaled input data.8 This

5 Note, however, that while these q -based objectives are formally equivalent, their
σ
stochastic gradient descent optimization, based on sampling a limited number of cor-
rupted examples, is likely to behave differently for each objective.
6 A first-order Taylor expansion and a diagonal Hessian approximation are used.
7 Specifically, in the free energy of a gaussian-binary RBM, the softplus terms are not

divided by σ 2 or scaled in any way.

8 If, for example, one multiplies the input values by 100, one can obtain the same

hidden representation as before by dividing W by 100. But because of the tied weights,
this means that the reconstruction would also be divided by 100 (i.e., there is no equiv-
alent reparameterization), unless it can be compensated by an additional scaling of the
reconstruction by a parameter α.
1670 P. Vincent

is easily obtained in the energy function by multiplying the sum of softplus

terms in equation 4.6 by α. We may even allow an arbitrary rescaling fac-
tor α j for each hidden-layer dimension independently by multiplying each
softplus term by its own rescaling parameter α j , which yields the following
more flexible energy function:

E(x; W, b, c, α, σm )

θ
⎛ ⎞
1 ⎝ 1
dh

= − 2 c, x − x2 + αj softplus W j , x + b j ⎠ .
σm 2
j=1

Here we have also included, as model parameter, a σm (where m stands for

model) distinct from the noise σ of the training objective.9
Our q σ -based objectives J I SMqσ , J E SMqσ , or J DSMqσ can be used as alter-
natives to the finite sample objective J I SMq0 (see equation 3.3) advocated in
Hyvärinen (2005) for learning unnormalized densities. Note that J I SMq0 is a
special case of J I SMqσ obtained in the limit of σ → 0. Also, since Kingma and
LeCun (2010) showed that it may be preferable to use a regularized criterion
(that they derived from smoothed empirical distribution q σ ), it is likely that
our q σ -based criteria may, for σ > 0, yield better generalization performance
than the J I SMq0 advocated in Hyvärinen (2005).10 It seems that σ could al-
low one to choose an optimal bias-variance trade-off for the finite-sample
estimation of the true SM gradient of interest ∇θ J E SMq = ∇θ J I SMq . While
∇θ J I SMq0 is an unbiased estimator of it, ∇θ J I SMqσ = ∇θ J E SMqσ = ∇θ J DSMqσ
will generally be biased when σ > 0 but are also likely to have a lower
variance.
Among the three equivalent SM objectives based on q σ , objective J DSMqσ
appears particularly interesting as a novel alternative formulation. It was
motivated by both the SM and the DAE principles. From DAE, it borrows the
idea of learning to denoise artificially corrupted samples, and from SM, it
borrows the idea of learning a score function derived from an unnormalized
density. J DSMqσ may prove simpler and more efficient in practice than the
mathematically equivalent J I SMqσ because it does not require computing
second derivatives.
Our result is also a significant advance for DAEs. First, we have defined
a proper energy function for the considered DAE through equation 4.6.

9 We would, however, have to set σ = σ to recover a recognizable denoising autoen-

m
coder objective.
10 It is also noteworthy that the experimental results of Vincent et al. (2008, 2010)

on DAE showed that the best models, judged by their ability to extract useful features,
were obtained for nonnegligible values of the noise parameters. Moreover, this way of
controlling the model’s capacity worked much better than either reducing the hidden
layer size or than traditional weight decay.
A Connection Between Score Matching and Denoising Autoencoders 1671

This will enable many previously impossible or ill-defined operations on a

trained DAE, for example, deciding which is the more likely among several
inputs, or sampling from a trained DAE using hybrid Monte Carlo (Duane,
Kennedy, Pendleton, & Roweth, 1987). Second, whereas using the same
weight matrix (“tied weights”) for the encoder and decoder is justified for
RBMs, the encoder-decoder framework does not constrain that choice. Pre-
vious work on DAEs (Vincent et al., 2008, 2010; Erhan et al., 2010) explored
both options, often finding tied weights to yield better empirical results.
Within the SM framework presented here, using tied weights between en-
coder and decoder now has a proper justification, since it follows naturally
from differentiating the energy. Third, this framework opens the door to
new variants that would naturally fall out from other choices of the energy
function.

Appendix: Proof That J E SMqσ J DSMqσ Equation 4.5

The explicit score matching criterion using the Parzen density estimator is
defined in equation 4.1 as
2
1
∂ log q σ (x̃)
,
J E SMqσ (θ ) = Eqσ (x̃) ψ(x̃; θ ) −
2 ∂ x̃

which we can develop as

1
ψ(x̃; θ )2 − S(θ ) + C2 ,
J E SMqσ (θ ) = Eqσ (x̃) (A.1)
2

1 ∂ log qσ (x̃) 2
where C2 = Eqσ (x̃) is a constant that does not depend on θ ,
2 ∂ x̃
and

∂ log q σ (x̃)
S(θ ) = Eqσ (x̃) ψ(x̃; θ ),
∂ x̃

∂ log q σ (x̃)
= q σ (x̃) ψ(x̃; θ ), d x̃
x̃ ∂ x̃

∂
q σ (x̃)
= q σ (x̃) ψ(x̃; θ ), ∂ x̃ d x̃
x̃ q σ (x̃)

∂
= ψ(x̃; θ ), q σ (x̃) d x̃
x̃ ∂ x̃

∂
= ψ(x̃; θ ), q 0 (x)q σ (x̃|x)dx d x̃
x̃ ∂ x̃ x
1672 P. Vincent

Substituting this expression for S(θ ) in equation A.1 yields

1
ψ(x̃; θ )2
J E SMqσ (θ ) = Eqσ (x̃)
2

∂ log q σ (x̃|x)
−Eqσ (x,x̃) ψ(x̃; θ ), + C2 . (A.2)
∂ x̃

We also have defined in equation 4.3,

2
1
∂ log q σ (x̃|x)
,
J DSMqσ (θ ) = Eqσ (x,x̃) ψ(x̃; θ ) −
2 ∂ x̃

which we can develop as

1
2

J DSMqσ (θ ) = Eqσ (x̃) ψ(x̃; θ )
2

∂ log q σ (x̃|x)
−Eqσ (x,x̃) ψ(x̃; θ ), + C3 , (A.3)
∂ x̃
∂ log qσ (x̃|x) 2
where C3 = Eqσ (x,x̃) 12 ∂ x̃
is a constant that does not depend
on θ .
Looking at equations A.2 and A.3, we see that J E SMqσ (θ ) = J DSMqσ (θ ) +
C2 − C3 . We have thus shown that the two optimization objectives are
equivalent.

Acknowledgments

I thank Yoshua Bengio, Olivier Delalleau, and the other members of the Lisa
Lab who provided timely feedback, as well as two anonymous referees
A Connection Between Score Matching and Denoising Autoencoders 1673

whose thoughtful comments and suggestions helped improve this note.

This research was supported by NSERC, MITACS, and FQRNT.

References

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise train-
ing of deep networks. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.), Advances in
neural information processing systems, 19 (pp. 153–160). Cambridge, MA: MIT Press.
Duane, S., Kennedy, A., Pendleton, B., & Roweth, D. (1987). Hybrid Monte Carlo.
Phys. Lett. B, 195, 216–222.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., & Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? Journal of Machine
Learning Research, 11, 625–660.
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires
associatives distribués. In Proceedings of COGNITIVA 87. Paris: Cesta-Afcet.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive
divergence. Neural Computation, 14, 1771–1800.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18, 1527–1554.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6, 695–709.
Hyvärinen, A. (2007a). Connections between score matching, contrastive diver-
gence, and pseudolikelihood for continuous-valued variables. IEEE Transactions
on Neural Networks, 18, 1529–1531.
Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics
and Data Analysis, 51, 2499–2512.
Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation,
20(12), 3087–3110.
Kingma, D., & LeCun, Y. (2010). Regularized estimation of image statistics by score
matching. In J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta
(Eds.), Advances in neural information processing systems, 23 (pp. 1126–1134).
Cambridge, MA: MIT Press.
LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Unpublished doctoral
dissertation, Université de Paris VI.
Lyu, S. (2010). Interpretation and generalization of score matching. In Proceedings of
the 25th Conference in Uncertainty in Artificial Intelligence (UAI’09). Corvallis, OR:
AUAI Press.
Marlin, B., Swersky, K., Chen, B., & de Freitas, N. (2010). Inductive principles
for restricted Boltzmann machine learning. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10) (Vol. 9,
pp. 509–516). N.p.: JMLR.
Movellan, J. R. (2008). Contrastive divergence in gaussian diffusions. Neural
Computation, 20(9), 2238–2252.
Seung, S. H. (1998). Learning continuous attractors in recurrent networks. In
M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing
systems, 10 (pp. 654–660). Cambridge, MA: MIT Press.
1674 P. Vincent

Sohl-Dickstein, J., Battaglino, P., & DeWeese, M. R. (2009). Minimum probability flow
learning (Tech. Rep.). arXiv:0906.4779.
Swersky, K. (2010). Inductive principles for learning restricted Boltzmann machines.
Unpublished master’s thesis, University of British Columbia.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In W. W. Cohen,
A. McCallum, & S. T. Roweis (Eds.), Proceedings of the 25th Annual International
Conference on Machine Learning (pp. 1096–1103). New York: ACM.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with
a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
Welling, M., & Hinton, G. E. (2002). A new learning algorithm for mean field
Boltzmann machines. In ICANN ’02: Proceedings of the International Conference on
Artificial Neural Networks (pp. 351–357). Berlin: Springer-Verlag.
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums
with an application to information retrieval. In L. Saul, Y. Weiss, & L. Bottou
(Eds.), Advances in neural information processing systems, 17 (pp. 1481–1488).
Cambridge, MA: MIT Press.

Received September 30, 2010; accepted December 15, 2010.

Krajewski Om9 PPT 13
100% (2)
Krajewski Om9 PPT 13
83 pages
CORE TOOLS-MSA 4th Ed
No ratings yet
CORE TOOLS-MSA 4th Ed
94 pages
Econometrics Test Bank
No ratings yet
Econometrics Test Bank
134 pages
BEC503 Digital Communication
No ratings yet
BEC503 Digital Communication
1 page
07 - Revised - Business Statistics 2022-24 - 10-10-2022
No ratings yet
07 - Revised - Business Statistics 2022-24 - 10-10-2022
5 pages
Factor Analysis: Madan Lal
No ratings yet
Factor Analysis: Madan Lal
44 pages
Intro To Factor Analysis
No ratings yet
Intro To Factor Analysis
52 pages
Main Assignment
No ratings yet
Main Assignment
5 pages
Marketing Research Project On Starbucks
No ratings yet
Marketing Research Project On Starbucks
49 pages
Unit 6
No ratings yet
Unit 6
8 pages
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
No ratings yet
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
22 pages
Lda-The Gritty Details
100% (1)
Lda-The Gritty Details
12 pages
Unit 5: Test of Significance/Hypothesis Testing (Topics 20, 22, 23)
No ratings yet
Unit 5: Test of Significance/Hypothesis Testing (Topics 20, 22, 23)
24 pages
For Statsprob 1st PPT in 2nd Sem
No ratings yet
For Statsprob 1st PPT in 2nd Sem
7 pages
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
100% (5)
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
56 pages
Advanced Econometrics (I) Chapter 9 - Hypothesis Testing Fall 2012
No ratings yet
Advanced Econometrics (I) Chapter 9 - Hypothesis Testing Fall 2012
33 pages
Lecture - 3 - INDU 6221
No ratings yet
Lecture - 3 - INDU 6221
38 pages
Group 05
No ratings yet
Group 05
16 pages
Sampling Notes - Part-02
No ratings yet
Sampling Notes - Part-02
8 pages
Understanding T Test 0 1
No ratings yet
Understanding T Test 0 1
5 pages
General (UCP-AFMT1003-A-FOMS, F21) - Microsoft Teams
No ratings yet
General (UCP-AFMT1003-A-FOMS, F21) - Microsoft Teams
8 pages
S1 Formula Sheet
No ratings yet
S1 Formula Sheet
4 pages
QRM Pma Q1
No ratings yet
QRM Pma Q1
8 pages
第一次電腦分組作業
No ratings yet
第一次電腦分組作業
12 pages
SP 17 Ak
No ratings yet
SP 17 Ak
7 pages
12.0 PP 92 107 Local Asymptotic Normality
No ratings yet
12.0 PP 92 107 Local Asymptotic Normality
16 pages
Practical 4
No ratings yet
Practical 4
3 pages
Hasts211 W8y23
No ratings yet
Hasts211 W8y23
2 pages
FMGNT 312 FINAL EXAM Portfolio Analysis
No ratings yet
FMGNT 312 FINAL EXAM Portfolio Analysis
1 page
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

A Connection Between Score Matching and Denoising Autoencoders

Uploaded by

A Connection Between Score Matching and Denoising Autoencoders

Uploaded by

NOTE Communicated by Aapo Hyvärinen

A Connection Between Score Matching and

Denoising autoencoders have been previously shown to be competitive

Neural Computation 23, 1661–1674 (2011)

1.1 Notation. We are interested in techniques that learn the parameters

1 Specifically SM (Hyvärinen, 2005), minimum velocity learning (Movellan, 2008), and

q (x) Unknown true pdf. x ∈ Rd

Denoising autoencoders (DAEs) are a simple modification of classical au-

2 Equivalence will be asserted when J = α J + β with α > 0, β ∈ R. Indeed, a

In this study, we consider the denoising version of a simple classical

r The corrupted version x̃ is encoded into a hidden representation

3.1 Explicit Score Matching. Score matching was introduced by

E is called the energy function. Following Hyvärinen (2005), we will

3 As opposed to a linear+sigmoid decoder with a Bernoulli cross-entropy loss, which

would be the preferred choice for binary input.

We refer to this formulation as explicit score matching (ESM).

3.2 Implicit Score Matching. Hyvärinen (2005) instead proceeds by

3.3 Finite Sample Version of Implicit Score Matching. Since we only

J I SMq0 is asymptotically equivalent to J I SMq when n → ∞ and hence

J E SMq  J I SMq  lim J I SMq0 . (3.4)

What happens in the transition from J I SMq to finite-sample version J I SMq0

where the strength of the additional regularization term is controlled by

4 Linking Score Matching to the Denoising

4.1 Matching the Score of a Nonparametric Estimator. The possibil-

For σ > 0, q σ is differentiable, decreases to 0 at infinity, and

J E SMqσ  J I SMqσ . (4.2)

Note that this equivalence breaks in the limit σ → 0 because q σ no longer

4.2 Denoising Score Matching. Let us now consider a slightly different

q σ (x̃, x) = q σ (x̃|x)q 0 (x), we define the following denoising score matching

The underlying intuition is that following the gradient ψ of the log

J E SMqσ  J DSMqσ . (4.5)

4.3 An Energy Function That Yields the Denoising Autoencoder

which we can write as the single equation

Substituting equations 4.4 and 4.7 in the expression for J DSMqσ ,

We have thus shown that

J DSMqσ  J DAEσ . (4.8)

J I SMqσ  J E SMqσ  J DSMqσ  J DAEσ . (5.1)

In summary, training the denoising autoencoder defined in section 2 is

divided by σ 2 or scaled in any way.

is easily obtained in the energy function by multiplying the sum of softplus

Here we have also included, as model parameter, a σm (where m stands for

9 We would, however, have to set σ = σ to recover a recognizable denoising autoen-

This will enable many previously impossible or ill-defined operations on a

Appendix: Proof That J E SMqσ J DSMqσ Equation 4.5

which we can develop as

Substituting this expression for S(θ ) in equation A.1 yields

We also have defined in equation 4.3,

which we can develop as

whose thoughtful comments and suggestions helped improve this note.

Received September 30, 2010; accepted December 15, 2010.

You might also like

J E SMq J I SMq lim J I SMq0 . (3.4)

J E SMqσ J I SMqσ . (4.2)

J E SMqσ J DSMqσ . (4.5)

J DSMqσ J DAEσ . (4.8)

J I SMqσ J E SMqσ J DSMqσ J DAEσ . (5.1)