0% found this document useful (0 votes)

45 views13 pages

A Connection Between Score Matching

This document discusses the connection between score matching and denoising autoencoders. It shows that the training criterion for a simple denoising autoencoder is equivalent to matching the score of a specific energy-based model to that of a non-parametric density estimator of the data, known as score matching. This defines a probabilistic model for denoising autoencoders and justifies techniques like tied weights between the encoder and decoder. It also suggests extending score matching approaches related to learning to denoise data corrupted with noise.

Uploaded by

Pixel Dinosaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views13 pages

A Connection Between Score Matching

Uploaded by

Pixel Dinosaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1

A Connection Between Score Matching

and Denoising Autoencoders
Pascal Vincent
[email protected]
Dept. IRO, Université de Montréal,
CP 6128, Succ. Centre-Ville, Montréal (QC) H3C 3J7, Canada.
Technical Report 1358
Département d’Informatique et de Recherche Opérationnelle
December 2010
THIS IS A PREPRINT VERSION OF A NOTE THAT HAS BEEN
ACCEPTED FOR PUBLICATION IN NEURAL COMPUTATION.

Keywords: autoencoder, energy based models, score matching, denoising, density

estimation.

Abstract

Denoising autoencoders have been previously shown to be competitive alternatives

to Restricted Boltzmann Machines for unsupervised pre-training of each layer of a deep
architecture. We show that a simple denoising autoencoder training criterion is equiv-
alent to matching the score (with respect to the data) of a specific energy based model
to that of a non-parametric Parzen density estimator of the data. This yields several
useful insights. It defines a proper probabilistic model for the denoising autoencoder
technique which makes it in principle possible to sample from them or to rank examples
by their energy. It suggests a different way to apply score matching that is related to
learning to denoise and does not require computing second derivatives. It justifies the
use of tied weights between the encoder and decoder, and suggests ways to extend the
success of denoising autoencoders to a larger family of energy-based models.

1 Introduction
This note uncovers an unsuspected link between the score matching technique (Hyväri-
nen, 2005; Hyvärinen, 2008) for learning the parameters of unnormalized density mod-
els over continuous-valued data, and the training of denoising autoencoders (Vincent
et al., 2008, 2010).
Score matching (SM) is an alternative to the maximum likelihood principle suitable
for unnormalized probability density models whose partition function is intractable. Its
relationship to maximum likelihood has been investigated by Lyu (2009) who formally
relates the Fisher divergence that yields score matching and the Kullback-Leibler diver-
gence that yields maximum likelihood. Interestingly, his formal analysis indicates that
score matching searches for parameters that are more robust to small-noise perturba-
tions of the training data (Lyu, 2009). Score matching has also been recast as a special
case under the more general frameworks of generalized score matching (Lyu, 2009;
Marlin et al., 2009) and minimum probability flow (Sohl-Dickstein et al., 2009), allow-
ing generalizations of score matching to discrete distributions (Hyvärinen, 2007b; Lyu,
2009; Sohl-Dickstein et al., 2009). The minimum probability flow paradigm is partic-
ularly interesting as it unifies several recent alternative parameter-estimation methods,
both for continuous and discrete data, under a single unified view1 . Recently, Kingma
and LeCun (2010) investigated a regularized form of score matching which adds a spe-
cific regularization term to the original score matching objective. Its relationship to the
present work will be discussed in detail in Section 5.
Denoising Autoencoders (DAE) were proposed by Vincent et al. (2008) as a simple
and competitive alternative to the Contrastive-Divergence-trained Restricted Boltzmann
Machines (RBM) used by Hinton et al. (2006) for pretraining deep networks (Erhan
et al., 2010; Vincent et al., 2010). Previous studies have already pointed out connec-
tions between SM and Contrastive Divergence (Hyvärinen, 2007a; Sohl-Dickstein et al.,
2009), have connected SM to optimal denoising for Gaussian noise with infinitesimal
variance (Hyvärinen, 2008) and have shown that training Gaussian Binary RBM with
SM is equivalent to training a regular (non-denoising) autoencoder with an additional
regularization term (Swersky, 2010). The present study however is the first to recast
the training of a DAE as a form of regularized score matching. This connection yields
insights relevant to both research directions and suggests a novel parameter estimation
technique that has its roots in both DAE and SM.
We begin by a very brief presentation of the DAE architecture for continuous-valued
inputs in Section 2 and of the SM technique in Section 3. This allows us to introduce
our formalism and precise terminology. Then, in section 4, we connect the denoising
autoencoder objective to score matching. We conclude by a discussion on how our
findings advance our understanding of both approaches.

Notation
We are interested in techniques that learn the parameters θ of a model by minimiz-
ing some objective function J(θ). For uniformity of notation, all distributions will be
represented by their probability density functions (pdf) on Rd . The pdf for discrete
distributions will be expressed with Dirac-deltas δ.

q(x) Unknown true pdf. x ∈ Rd .

Dn = {x , . . . , x(n) } Training set: i.i.d. sample from q.
(1)

1
Specifically score matching (Hyvärinen, 2005), minimum velocity learning (Movellan, 2008), and
certain forms of contrastive divergence (Hinton, 2002; Welling and Hinton, 2002) are all recast as mini-
mizing the Kullback-Leibler divergence between the data distribution and the distribution obtained after
running, for infinitesimal time, a dynamic that would transform it into the model distribution (Sohl-
Dickstein et al., 2009).

2
q0 (x) = n1 ni=1 δ(kx − x(i) k) Empirical pdf associated with Dn .
P
1 2
qσ (x̃|x) = (2π)d/21
σd
e− 2σ2 kx̃−xk Smoothing kernel or noise model: isotropic
Gaussian of variance σ 2 .
qσ (x̃, x) P= qσ (x̃|x)q0 (x) Joint pdf.
qσ (x̃) = n nt=1 qσ (x̃|x(t) ) Parzen density estimate based on Dn
1

obtainable by marginalizing qσ (x̃, x).

p(x; θ) Density model with parameters θ.
J1 ^ J2 Means J1 (θ) and J2 (θ) are equivalent
´ optimization objectives 2 .
Eq(x) [g(x)] = x q(x)g(x)dxP Expectation with respect to distribution q.
hu, vi = p i ui vi Dot product between two vectors.
kuk = hu, ui Euclidean norm of vector u.
softplus(x) = log(1 + ex ) Will be applied elementwise to vectors.
sigmoid(x) = 1+e1−x = softplus0 (x) Will be applied elementwise to vectors.
I Identity matrix.
T
W Transpose of matrix W.
Wi Vector for ith row of W.
W·,j Vector for j th column of W.

2 Denoising Autoencoders
Denoising Autoencoders (DAEs) are a simple modification of classical autoencoder
neural networks that are trained, not to reconstruct their input, but rather to denoise an
artificially corrupted version of their input (Vincent et al., 2008, 2010). Whereas an
over-complete regular autoencoder can easily learn a useless identity mapping, a DAE
must extract more useful features in order to solve the much harder denoising problem.
DAEs have proven to be an empirically successful alternative to Restricted Boltzmann
Machines (RBM) for pre-training deep networks (Vincent et al., 2008, 2010; Erhan
et al., 2010). Denoising autoencoders have also been used in different contexts in the
earlier works of LeCun (1987); Gallinari et al. (1987); Seung (1998).
In this study, we will consider the denoising version of a simple classical autoen-
coder that uses a single sigmoidal hidden layer. Since data points originate from a con-
tinuous real valued distribution it is natural to use a linear decoder with a squared recon-
struction loss3 . We will be using tied weights whereby encoder and decoder share the
same linear transformation parameters. The considered corruption is additive isotropic
Gaussian noise. A detailed description of the architecture follows:
• A training input x ∈ Dn is first corrupted by additive Gaussian noise of covari-
ance σ 2 I yielding corrupted input x̃ = x + , ∼ N (0, σ 2 I). This corresponds to
2
Equivalence will be asserted when J2 = αJ1 + β with α > 0, β ∈ R. Indeed a gradient-based
optimization algorithm, when starting from some initial θ value, should land in the exact same minimum
whether optimizing J1 or J2 (this may however require learning rate adjustment to compensate for scaling
factor α).
3
as opposed to a linear+sigmoid decoder with a Bernoulli cross-entropy loss, which would be the
preferred choice for binary input.

3
1 2
conditional density qσ (x̃|x) = 1
(2π)d/2 σ d
e− 2σ2 kx̃−xk .

• The corrupted version x̃ is encoded into a hidden representation h ∈ Rdh through

an affine mapping followed by a nonlinearity:
h = encode(x̃) = sigmoid(Wx̃+b), where x̃ ∈ Rd , h ∈ (0, 1)dh , W is a dh ×d
matrix and b ∈ Rdh .

• The hidden representation h is decoded into reconstruction xr through affine

mapping: xr = decode(h) = WT h + c, where c ∈ Rd .

• The parameters θ = {W, b, c} are optimized so that the expected squared recon-
struction error kxr − xk2 is minimized, i.e. the objective function being mini-
mized by such a denoising autoencoder (DAE) is:

JDAEσ (θ) = Eqσ (x̃,x) kdecode(encode(x̃)) − xk2

h 2 i
T
= Eqσ (x̃,x) W sigmoid(Wx̃ + b)) + c − x .
(1)

3 Score Matching
3.1 Explicit Score Matching (ESM)
Score Matching was introduced by Hyvärinen (2005) as a technique to learn the param-
eters θ of probability density models p(x; θ) with intractable partition function Z(θ),
where p can be written as
1
p(x; θ) = exp(−E(x; θ)).
Z(θ)

E is called the energy function. Following Hyvärinen (2005), we will call score the
gradient of the log density with respect to the data vector: ψ(x; θ) = ∂ log∂x p(x;θ)
. Beware
that this usage differs slightly from traditional statistics terminology where score usu-
ally refers to the derivative of the log likelihood with respect to parameters, whereas
here we are talking about a score with respect to the data. The core principle of score
matching (Hyvärinen, 2005) is to learn θ so that ψ(x; θ) = ∂ log∂x p(x;θ)
best matches the
∂ log q(x)
corresponding score of the true distribution, i.e. ∂x
. The corresponding objective
function to be minimized is the expected squared error between these two vectors, i.e.
" 2 #
1 ψ(x; θ) − ∂ log q(x)
JESM q (θ) = Eq(x) .
2 ∂x

We refer to this formulation as explicit score matching (ESM).

Note that the score ψ(x; θ) does not depend on troublesome Z(θ). But since q is un-
known, we do not have explicit regression targets ∂ log∂xq(x) . Hyvärinen (2005) mentions
in passing that non-parametric methods might be used to estimate those, and we shall
later pay closer attention to this possibility.

4
3.2 Implicit Score Matching (ISM)
Hyvärinen (2005) instead proceeds by proving the following remarkable property:
" 2 # " d
#
1
ψ(x; θ) − ∂ log q(x) 1 2
X ∂ψ i (x; θ)
Eq(x) = Eq(x) kψ(x; θ)k + +C1
2 ∂x 2 i=1
∂xi
| {z } | {z }
JESM q (θ) JISM q (θ)
(2)
∂ log p(x;θ)
where ψi (x; θ) = ψ(x; θ)i = ∂xi
and C1 is a constant that does not depend
,
on θ. This yields an implicit score matching objective JISM q that no longer requires
having an explicit score target for q but is nevertheless equivalent to JESM q . Hyvärinen
(2005) formally shows that, provided q(x) and ψ(x; θ) satisfy some weak regularity
conditions4 , we have

JESM q ^ JISM q (3)

3.3 Finite Sample Version of Implicit Score Matching

Since we only have samples Dn from q, Hyvärinen proposes to optimize the finite
sample version of JISM q which, following our notation, we shall write JISM q0 :
" d
#
1 X ∂ψ i (x; θ)
JISM q0 (θ) = Eq0 (x) kψ(x; θ)k2 +
2 i=1
∂x i
n d
!
(t)
1X 1 ψ(x(t) ; θ) 2 +
X ∂ψ i (x ; θ)
= . (4)
n t=1 2 i=1
∂x i

JISM q0 is asymptotically equivalent to JISM q when n → ∞ and hence asymptoti-

cally equivalent to objective JESM q . This can be summarized as

JESM q ^ JISM q ^ lim JISM q0 . (5)

n→∞

What happens in the transition from JISM q to finite sample version JISM q0 is how-
ever not entirely clear. Concerns regarding the stability of the resulting criterion were
raised by Kingma and LeCun (2010), who propose instead to optimize a regularized
version of JISM q0 :
d 2
X ∂ψi (x(t) ; θ)
JISM reg (θ) = JISM q0 (θ) + λ , (6)
i=1
∂xi

where the strength of the additional regularization term is controlled by hyperparameter

λ. The relationship between this criterion and the criteria we propose below will be
further discussed in Section 5.
2
4
Namely: q(x) and ψ(x; θ) are differentiable, Eq(x) ∂ log∂xq(x) is finite, and for any θ,

h i
2
Eq(x) kψ(x; θ)k is finite and limkxk→∞ q(x)ψ(x; θ) = 0.

5
4 Linking Score Matching to the Denoising Autoencoder
Objective
4.1 Matching the Score of a Non-Parametric Estimator
As previously stated, the possibility of matching the score ψ(x; θ) with an explicit tar-
get score for q obtained through non-parametric estimation was mentioned but not pur-
sued in Hyvärinen (2005). We now examine this possibility more closely. Explicitly
matching ψ(x; θ) with the score of Parzen windows density estimator qσ (x̃) yields the
following objective:
" 2 #
1
ψ(x̃; θ) − ∂ log q σ (x̃)
JESM qσ (θ) = Eqσ (x̃) . (7)
2 ∂ x̃

∂ log qσ (x̃) 2
For σ > 0, qσ is differentiable, decreases to 0 at infinity, and Eqσ (x̃) ∂ x̃
is finite. All regularity conditions are satisfied, so the same equivalence with ISM as in
Eq. 3 holds, i.e.
JESM qσ ^ JISM qσ . (8)
Note that this equivalence however breaks in the limit σ → 0, because qσ no longer
satisfies these regularity conditions, and JESM qσ can no longer be computed (whereas
JISM qσ remains well-behaved).

4.2 Denoising Score Matching (DSM)

Let us now consider a slightly different objective, that is inspired by both the Score
Matching principle and by the Denoising Autoencoder approach of using pairs of clean
and corrupted examples (x, x̃). For joint density qσ (x̃, x) = qσ (x̃|x)q0 (x), we define
the following denoising score matching (DSM) objective:
" 2 #
1 ψ(x̃; θ) − ∂ log q σ (x̃|x)
JDSM qσ (θ) = Eqσ (x,x̃) . (9)
2 ∂ x̃

The underlying intuition is that following the gradient ψ of the log density at some
corrupted point x̃ should ideally move us towards the clean sample x. Note that with
the considered Gaussian kernel we have
∂ log qσ (x̃|x) 1
= 2 (x − x̃). (10)
∂ x̃ σ
Direction σ12 (x − x̃) clearly corresponds to moving from x̃ back towards clean sample
x, and we want ψ to match that as best it can.
Now this alternate objective, inspired by denoising autoencoders, is equivalent to
explicit score matching. Formally,

JESM qσ ^ JDSM qσ (11)

The proof is in the appendix, and does not depend on the particular form of qσ (x̃|x)
or q(x) as long as log qσ (x̃|x) is differentiable with respect to x̃.

6
4.3 An Energy Function that Yields the Denoising Autoencoder Ob-
jective
Let us now choose for model p the form
1
p(x; θ) = exp(−E(x; θ))
Z(θ)
hc, xi − 12 kxk2 + dj=1
Ph
softplus (hWj , xi + bj )
E(x; W, b, c) = − . (12)
| {z } σ2
θ

We then have
∂ log p(x; θ)
ψi (x; θ) =
∂xi
∂E
= −
∂xi
dh
!
1 X ∂ (hWj , xi + bj )
= ci − xi + softplus0 (hWj , xi + bj )
σ2 j=1
∂xi
dh
!
1 X
= ci − xi + sigmoid (hWj , xi + bj ) Wji
σ2 j=1
1
= (ci − xi + hW·i , sigmoid(Wx + b)i) ,
σ2
which we can write as the single equation
1 T

ψ(x; θ) = W sigmoid(Wx + b) + c − x . (13)
σ2
Substituting Eq. 10 and Eq. 13 in the expression for JDSM qσ (Eq. 9) we get, for this
choice of Parzen kernel and density model,

" 2 #
1 ψ(x̃; θ) − ∂ log q σ (x̃|x)
JDSM qσ (θ) = Eqσ (x,x̃)
2 ∂ x̃
" 2 #
1 1 1
WT sigmoid(Wx̃ + b) + c − x̃ − 2 (x − x̃)

= Eqσ (x,x̃)
2
2 σ σ
1 h
T
2 i
= E q (x,x̃)
W sigmoid(Wx̃ + b) + c − x
2σ 4 σ
1
= JDAEσ (θ).
2σ 4
We have thus shown that
JDSM qσ ^ JDAEσ (14)

7
5 Discussion
Putting together equations 8, 11 and 14, we can write, for σ > 0,

JISM qσ ^ JESM qσ ^ JDSM qσ ^ JDAEσ . (15)

In summary, training the denoising autoencoder defined in section 2 is equiva-

lent to performing score matching (explicit or implicit) with the energy function of
Eq. 12 on Parzen density estimate qσ . Such a training would typically use stochastic
gradient descent, whereby samples from qσ are obtained by corrupting samples from
Dn . And it may be carried out with any of these four optimization objective formula-
tions5 .
We introduced the kernel-smoothed empirical distribution qσ to show a connection
between score matching and a simple denoising autoencoder. Interestingly, the reg-
ularized score matching criterion JISM reg (Eq. 6) that Kingma and LeCun (2010) re-
cently introduced with the very different motivation of curing possible instabilities, was
derived by approximating6 what amounts to JISM qσ . From this perspective our four
qσ -based criteria in Eq. 15, including the denoising autoencoder, may be seen as al-
ternative approximation-free forms of regularized score matching. A key difference is
that, as is done with denoising autoencoder training, we would optimize stochastic ver-
sions of these approximation-free regularized criteria by corrupting training examples
(i.e. sampling form qσ ), whereas Kingma and LeCun (2010) optimize an approximation
of JISM qσ , centered on the training samples only (i.e. sampling from q0 ). Also, whereas
JISM reg , like the other ISM criteria, requires computing second derivatives, the stochas-
tic version of our novel JDSM qσ criterion does not, and thus appears much simpler to
implement.
Note that the energy function in Eq. 12 is particular in that its scaling, which we
may call its temperature, is chosen to match the corrupting noise level σ 2 . This is
required only to establish the last equivalence with the specific DAE we considered.
But regarding the generic objectives JISM qσ ^ JESM qσ ^ JDSM qσ , their σ may in
principle be chosen irrespective of the form or temperature of whatever energy function
is to be learnt. Interestingly, the energy function in Eq. 12, that we designed to yield
the equivalence with our denoising autoencoder objective, happens to be very similar
to the free energy of a Restricted Boltzmann Machine with binary hidden units and
Gaussian visible units (Welling et al., 2005; Bengio et al., 2007; Swersky, 2010). The
major difference is that this latter free energy does not have a global temperature scaling
of the whole expression7 . We designed Eq. 12 to exactly yield the denoising version
of the classic autoencoder described in Section 2. But with tied weights, it may be
preferable to allow for an extra positive scaling parameter α for the reconstruction, so
that there at least exists an equivalent reparametrization of the model for scaled input
5
Note however that while these qσ -based objectives are formally equivalent, their stochastic gradient
descent optimization, based on sampling a limited number of corrupted examples, is likely to behave
differently for each objective.
6
A first order Taylor expansion and a diagonal Hessian approximation are used.
7
Specifically, in the free energy of a Gaussian-binary RBM, the softplus terms are not divided by σ 2
nor scaled in any way.

8
data8 . This is easily obtained in the energy function by multiplying the sum of softplus
terms in Eq. 12 by α. We may even allow an arbitrary rescaling factor αj for each
hidden layer dimension independently by multiplying each softplus term by its own
rescaling parameter αj , which yields the following more flexible energy function:
dh
!
1 1 X
E(x; W, b, c, α, σm ) = − 2 hc, xi − kxk2 + αj softplus (hWj , xi + bj ) .
| {z } σm 2 j=1
θ

Here we have also included, as model parameter, a σm (where m stands for model)
distinct from the noise σ of the training objective9 .
Our qσ -based objectives JISM qσ , JESM qσ , or JDSM qσ can be used as alternatives to
the finite sample objective JISM q0 (Eq. 4) advocated in Hyvärinen (2005) for learning
unnormalized densities. Note that JISM q0 is a special case of JISM qσ obtained in the
limit of σ → 0. Also, since Kingma and LeCun (2010) showed that it may be preferable
to use a regularized criterion (that they derived form smoothed empirical distribution
qσ ), it is likely that our qσ -based criteria may, for σ > 0, yield better generalization
performance than the JISM q0 advocated in Hyvärinen (2005)10 . It seems that σ could
allow one to choose an optimal bias-variance tradeoff for the finite-sample estimation
of the true score matching gradient of interest ∇θ JESM q = ∇θ JISM q . While ∇θ JISM q0
is an unbiased estimator of it, ∇θ JISM qσ = ∇θ JESM qσ = ∇θ JDSM qσ will generally be
biased when σ > 0 but are also likely to have a lower variance.
Among the three equivalent score matching objectives based on qσ , objective JDSM qσ
appears particularly interesting as a novel alternative formulation. It was motivated by
both the SM principle and the DAE principle. From DAE it borrows the idea of learning
to denoise artificially corrupted samples, and from SM it borrows the idea of learning
a score function derived from an unnormalized density. JDSM qσ may prove simpler and
more efficient in practice than the mathematically equivalent JISM qσ , as it does not
require computing second derivatives.
Our result is also a significant advance for DAEs. First, we have defined a proper
energy function for the considered DAE through Eq. 12. This will enable many pre-
viously impossible or ill-defined operations on a trained DAE, for example deciding
which is the more likely among several inputs, or sampling from a trained DAE us-
ing Hybrid Monte-Carlo (Duane et al., 1987). Second, whereas using the same weight
matrix (“tied weights”) for the encoder and decoder is justified for RBMs, the encoder-
decoder framework does not constrain that choice. Previous work on DAEs (Vincent
et al., 2008; Erhan et al., 2010; Vincent et al., 2010) explored both options, often find-
ing tied weights to yield better empirical results. Within the SM framework presented
here, using tied weights between encoder and decoder now has a proper justification,
8
If for example one multiplies the input values by 100, one can obtain the same hidden representation
as before by dividing W by 100. But because of the tied weights this means that the reconstruction would
also be divided by 100 (i.e. there is no equivalent reparametrization), unless it can be compensated by an
additional scaling of the reconstruction by a parameter α.
9
We would however have to set σm = σ to recover a recognizable denoising autoencoder objective.
10
It is also noteworthy that the experimental results of Vincent et al. (2008, 2010) on DAE showed that
the best models, judged by their ability to extract useful features, were obtained for non negligible values
of the noise parameters. Moreover this way of controlling the model’s capacity worked much better than
either reducing the hidden layer size or than traditional weight decay.

9
since it follows naturally from differentiating the energy. Third, this framework opens
the door to new variants that would naturally fall out from other choices of the energy
function.

Acknowledgements
We would like to thank Yoshua Bengio, Olivier Delalleau and the other members of
the Lisa Lab who provided timely feedback, as well as two anonymous referees whose
thoughtful comments and suggestions helped improve the manuscript. This research
was supported by NSERC, MITACS and FQRNT.

References
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise
training of deep networks. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Ad-
vances in Neural Information Processing Systems 19 (NIPS’06), pages 153–160. MIT
Press.

Duane, S., Kennedy, A., Pendleton, B., and Roweth, D. (1987). Hybrid Monte Carlo.
Phys. Lett. B, 195, 216–222.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S.
(2010). Why does unsupervised pre-training help deep learning? Journal of Machine
Learning Research, 11, 625–660.

Gallinari, P., LeCun, Y., Thiria, S., and Fogelman-Soulie, F. (1987). Memoires associa-
tives distribuees. In Proceedings of COGNITIVA 87, Paris, La Villette.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive diver-

gence. Neural Computation, 14, 1771–1800.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18, 1527–1554.

Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score

matching. Journal of Machine Learning Research, 6, 695–709.

Hyvärinen, A. (2007a). Connections between score matching, contrastive divergence,

and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural
Networks, 18, 1529–1531.

Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics

and Data Analysis, 51, 2499–2512.

Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation,

20(12), 3087–3110.

10
Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score
matching. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23 (NIPS’10), pages
1126–1134. MIT Press.

LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université

de Paris VI.

Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of

the 25th Conference in Uncertainty in Artificial Intelligence (UAI’09).

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2009). Inductive principles for
restricted boltzmann machine learning. In Proceedings of The Thirteenth Interna-
tional Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9,
pages 509–516.

Movellan, J. R. (2008). Contrastive divergence in gaussian diffusions. Neural Compu-

tation, 20(9), 2238–2252.

Seung, S. H. (1998). Learning continuous attractors in recurrent networks. In M. Jordan,

M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems
10 (NIPS’97), pages 654–660. MIT Press.

Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. R. (2009). Minimum probability

flow learning. Technical report, arXiv:0906.4779.

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.

Master’s thesis, University of British Columbia.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In W. W. Cohen, A. McCal-
lum, and S. T. Roweis, editors, ICML’08, pages 1096–1103. ACM.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked
denoising autoencoders: learning useful representations in a deep network with a
local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.

Welling, M. and Hinton, G. E. (2002). A new learning algorithm for mean field Boltz-
mann machines. In ICANN ’02: Proceedings of the International Conference on
Artificial Neural Networks, pages 351–357, London, UK. Springer-Verlag.

Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums
with an application to information retrieval. In L. Saul, Y. Weiss, and L. Bottou,
editors, Advances in Neural Information Processing Systems 17 (NIPS’04), pages
1481–1488, Cambridge, MA. MIT Press.

11
Appendix
Proof that JESM qσ ^ JDSM qσ (11)
The explicit score matching criterion using the Parzen density estimator is defined in
Eq. 7 as
" 2 #
1 ψ(x̃; θ) − ∂ log q σ (x̃)
JESM qσ (θ) = Eqσ (x̃)
2 ∂ x̃

which we can develop as

1 2
JESM qσ (θ) = Eqσ (x̃) kψ(x̃; θ)k − S(θ) + C2 (16)
2
2
1 ∂ log qσ (x̃)
where C2 = Eqσ (x̃) 2 ∂ x̃ is a constant that does not depend on θ, and

∂ log qσ (x̃)
S(θ) = Eqσ (x̃) ψ(x̃; θ),
∂ x̃
ˆ
∂ log qσ (x̃)
= qσ (x̃) ψ(x̃; θ), dx̃
x̃ ∂ x̃
ˆ *
∂
+
q σ (x̃)
= qσ (x̃) ψ(x̃; θ), ∂ x̃ dx̃
x̃ qσ (x̃)
ˆ
∂
= ψ(x̃; θ), qσ (x̃) dx̃
∂ x̃
ˆ
x̃
ˆ
∂
= ψ(x̃; θ), q0 (x)qσ (x̃|x)dx dx̃
∂ x̃ x
ˆ
x̃
ˆ
∂qσ (x̃|x)
= ψ(x̃; θ), q0 (x) dx dx̃
∂ x̃
ˆ
x̃
ˆx

∂ log qσ (x̃|x)
= ψ(x̃; θ), q0 (x)qσ (x̃|x) dx dx̃
∂ x̃
ˆ ˆ
x̃ x

∂ log qσ (x̃|x)
= q0 (x)qσ (x̃|x) ψ(x̃; θ), dxdx̃
∂ x̃
ˆ ˆ
x̃ x

∂ log qσ (x̃|x)
= qσ (x̃, x) ψ(x̃; θ), dxdx̃
x̃ x ∂ x̃

∂ log qσ (x̃|x)
= Eqσ (x̃,x) ψ(x̃; θ), .
∂ x̃
Substituting this expression for S(θ) in Eq. 16 yields

1 2
JESM qσ (θ) = Eqσ (x̃) kψ(x̃; θ)k
2

∂ log qσ (x̃|x)
−Eqσ (x,x̃) ψ(x̃; θ), + C2 . (17)
∂ x̃

12
We also have defined in Eq. 9,
" 2 #
1
ψ(x̃; θ) − ∂ log q σ (x̃|x)
JDSM qσ (θ) = Eqσ (x,x̃) ,
2 ∂ x̃

which we can develop as

1 2
JDSM qσ (θ) = Eqσ (x̃) kψ(x̃; θ)k
2

∂ log qσ (x̃|x)
−Eqσ (x,x̃) ψ(x̃; θ), + C3 (18)
∂ x̃
2
∂ log q (x̃|x)
where C3 = Eqσ (x,x̃) 21 σ
is a constant that does not depend on θ.

∂ x̃

Looking at equations 17 and 18 we see that JESM qσ (θ) = JDSM qσ (θ) + C2 − C3 .

We have thus shown that the two optimization objectives are equivalent.

2011 Vincent
No ratings yet
2011 Vincent
14 pages
Marginalized Denoising Auto-Encoders For Nonlinear Representations
No ratings yet
Marginalized Denoising Auto-Encoders For Nonlinear Representations
9 pages
DL Unit 4
No ratings yet
DL Unit 4
21 pages
Noise Learning Based Denoising Autoencoder
No ratings yet
Noise Learning Based Denoising Autoencoder
5 pages
Unit VI
No ratings yet
Unit VI
46 pages
Alain 14 A
No ratings yet
Alain 14 A
31 pages
Deconstructing Denoising Diffusion Models For Self-Supervised Learning
No ratings yet
Deconstructing Denoising Diffusion Models For Self-Supervised Learning
10 pages
Auto Encoder
No ratings yet
Auto Encoder
73 pages
Denoising Autoencoders
No ratings yet
Denoising Autoencoders
13 pages
Denoising Autoencoders
No ratings yet
Denoising Autoencoders
2 pages
Unit 5e - Autoencoders
No ratings yet
Unit 5e - Autoencoders
32 pages
Analyzing Noise in Autoencoders and Deep Networks
No ratings yet
Analyzing Noise in Autoencoders and Deep Networks
10 pages
Lecture09 Ae
No ratings yet
Lecture09 Ae
13 pages
Denoising Adversarial Autoencoders
No ratings yet
Denoising Adversarial Autoencoders
17 pages
AAI Module 3
No ratings yet
AAI Module 3
11 pages
DL Unit - 4
No ratings yet
DL Unit - 4
26 pages
用于语义分割的解码器去噪预训练
No ratings yet
用于语义分割的解码器去噪预训练
17 pages
DLA Unit 5
No ratings yet
DLA Unit 5
18 pages
Deep Learning For Image Denoising
No ratings yet
Deep Learning For Image Denoising
10 pages
Denoising Autoencoder Explanation
No ratings yet
Denoising Autoencoder Explanation
4 pages
Module 4
No ratings yet
Module 4
10 pages
Autoencoders
No ratings yet
Autoencoders
35 pages
Unit-V DL
No ratings yet
Unit-V DL
31 pages
DeepLearning Unit IV Notes
No ratings yet
DeepLearning Unit IV Notes
58 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Denoising Autoencoders tr1316
No ratings yet
Denoising Autoencoders tr1316
16 pages
Pseudo Label Final
No ratings yet
Pseudo Label Final
7 pages
1 Autoencoders
No ratings yet
1 Autoencoders
22 pages
Ad3501-Dl-Unit 5 Notes
No ratings yet
Ad3501-Dl-Unit 5 Notes
16 pages
DL M3 Tech
No ratings yet
DL M3 Tech
15 pages
From Denoising Diffusions To Denoising Markov Models
No ratings yet
From Denoising Diffusions To Denoising Markov Models
55 pages
Zhang Learning Deep CNN CVPR 2017 Paper
No ratings yet
Zhang Learning Deep CNN CVPR 2017 Paper
10 pages
Advance Deep Learning - BIT L3
No ratings yet
Advance Deep Learning - BIT L3
45 pages
Intro
No ratings yet
Intro
12 pages
ch14 Autoencoder
No ratings yet
ch14 Autoencoder
42 pages
Dl-Unit 3
No ratings yet
Dl-Unit 3
14 pages
Week 6
No ratings yet
Week 6
4 pages
Lecture 14
No ratings yet
Lecture 14
23 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
DL Class5
No ratings yet
DL Class5
23 pages
Lecture 13
No ratings yet
Lecture 13
43 pages
Gen AI Unit 2
100% (1)
Gen AI Unit 2
65 pages
D5 PPT
No ratings yet
D5 PPT
79 pages
Convolutional Autoencoder For Image Denoising
No ratings yet
Convolutional Autoencoder For Image Denoising
11 pages
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
No ratings yet
DAC: Deep Autoencoder-Based Clustering, A General Deep Learning Framework of Representation Learning
12 pages
Fast Inference in Sparse Coding Algorithms With Applications To Object Recognition
No ratings yet
Fast Inference in Sparse Coding Algorithms With Applications To Object Recognition
9 pages
Unit5 Autoencoders
No ratings yet
Unit5 Autoencoders
45 pages
Medical Image Denoising Using Convolutional Denoising Autoencoders
No ratings yet
Medical Image Denoising Using Convolutional Denoising Autoencoders
6 pages
Unit 3
No ratings yet
Unit 3
23 pages
Lecture 12
No ratings yet
Lecture 12
35 pages
ACV - Notes - Final
No ratings yet
ACV - Notes - Final
7 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
No ratings yet
Eneralization in Diffusion Models Arises From Geometry Adaptive Harmonic Representations
25 pages
Gondara 2016
No ratings yet
Gondara 2016
6 pages
Auto Encoders
No ratings yet
Auto Encoders
4 pages
Icip2007 Final Submitted PDF
No ratings yet
Icip2007 Final Submitted PDF
4 pages
Unit 5
No ratings yet
Unit 5
36 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
DL Lecture8 Autoencoder
No ratings yet
DL Lecture8 Autoencoder
28 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Duality
100% (1)
Duality
26 pages
ACT Customization Guide For DesignXplorer
No ratings yet
ACT Customization Guide For DesignXplorer
28 pages
1 BR LLamasoft Supply Chain Guru US
No ratings yet
1 BR LLamasoft Supply Chain Guru US
6 pages
OR HW F02 Soln 1to3
No ratings yet
OR HW F02 Soln 1to3
49 pages
Railway Route Optimization System
No ratings yet
Railway Route Optimization System
3 pages
Circle Packing Survey
No ratings yet
Circle Packing Survey
31 pages
QT Chap5 Optimisation
No ratings yet
QT Chap5 Optimisation
21 pages
Linear Programming Part 1
No ratings yet
Linear Programming Part 1
20 pages
(2010) Automating The Conceptual Design Process
No ratings yet
(2010) Automating The Conceptual Design Process
14 pages
(Ebook) Model-Based Predictive Control: A Practical Approach (Control Series) by J.A. Rossiter ISBN 9780849312915, 0849312914 Download
No ratings yet
(Ebook) Model-Based Predictive Control: A Practical Approach (Control Series) by J.A. Rossiter ISBN 9780849312915, 0849312914 Download
55 pages
Amin Bin Shahrom - Industrial Training Report
No ratings yet
Amin Bin Shahrom - Industrial Training Report
34 pages
Drill Hole Spacing Analysis For Classification and Cost Optimisation
No ratings yet
Drill Hole Spacing Analysis For Classification and Cost Optimisation
13 pages
SGPEHandbook1314 17sept
No ratings yet
SGPEHandbook1314 17sept
59 pages
Transportation Problem Chapter 6
No ratings yet
Transportation Problem Chapter 6
5 pages
Optimal Power Flow Analysis For 23MW Microgrid Using ETAP
No ratings yet
Optimal Power Flow Analysis For 23MW Microgrid Using ETAP
6 pages
Lecture Notes
No ratings yet
Lecture Notes
54 pages
Credit Card Fraud Detection Using A Combined Approach of Genetic Algorithm and Random Forest
No ratings yet
Credit Card Fraud Detection Using A Combined Approach of Genetic Algorithm and Random Forest
4 pages
Simplex Method-2
No ratings yet
Simplex Method-2
9 pages
Shortest Path Algorithms
100% (1)
Shortest Path Algorithms
25 pages
Optimal Design of Hybrid Renewable Energy System For Electrification of A Remote Village in Egypt
No ratings yet
Optimal Design of Hybrid Renewable Energy System For Electrification of A Remote Village in Egypt
6 pages
Yu Shi, Rolf D. Reitz: Article Info
No ratings yet
Yu Shi, Rolf D. Reitz: Article Info
15 pages
Fertilizacion Nitrogenada en Maiz Basada en Materia Organica (Sanchez Roldan)
No ratings yet
Fertilizacion Nitrogenada en Maiz Basada en Materia Organica (Sanchez Roldan)
33 pages
Bia 2
No ratings yet
Bia 2
4 pages
Interactive 3D CAD Generation of Tensegrity Structures by Charalambides, Liapi
No ratings yet
Interactive 3D CAD Generation of Tensegrity Structures by Charalambides, Liapi
6 pages
660006b1e6a76 - 1711277745 - Fpso Offshore WTG Bess Case Study
No ratings yet
660006b1e6a76 - 1711277745 - Fpso Offshore WTG Bess Case Study
13 pages
Dynamic Surrogate Optimization of Vertically Stacked Nanosheet FET Based On Gaussian Process Regression
No ratings yet
Dynamic Surrogate Optimization of Vertically Stacked Nanosheet FET Based On Gaussian Process Regression
9 pages
Basketball Hoop Topology Optimization - Midas NFX
No ratings yet
Basketball Hoop Topology Optimization - Midas NFX
9 pages
Heat Exchangers: Types, Design, and Applications: Types, Design, and Applications, Nova Science Publishers, Incorporated, 2010. Proquest Ebook
100% (1)
Heat Exchangers: Types, Design, and Applications: Types, Design, and Applications, Nova Science Publishers, Incorporated, 2010. Proquest Ebook
245 pages
Tracing Complexity Theory
No ratings yet
Tracing Complexity Theory
25 pages
SCM Chapter 3 Supply Chain
No ratings yet
SCM Chapter 3 Supply Chain
166 pages

A Connection Between Score Matching

Uploaded by

A Connection Between Score Matching

Uploaded by

1

A Connection Between Score Matching

Keywords: autoencoder, energy based models, score matching, denoising, density

Denoising autoencoders have been previously shown to be competitive alternatives

q(x) Unknown true pdf. x ∈ Rd .

obtainable by marginalizing qσ (x̃, x).

• The corrupted version x̃ is encoded into a hidden representation h ∈ Rdh through

• The hidden representation h is decoded into reconstruction xr through affine

JDAEσ (θ) = Eqσ (x̃,x) kdecode(encode(x̃)) − xk2

We refer to this formulation as explicit score matching (ESM).

JESM q ^ JISM q (3)

3.3 Finite Sample Version of Implicit Score Matching

JISM q0 is asymptotically equivalent to JISM q when n → ∞ and hence asymptoti-

JESM q ^ JISM q ^ lim JISM q0 . (5)

where the strength of the additional regularization term is controlled by hyperparameter

4.2 Denoising Score Matching (DSM)

JESM qσ ^ JDSM qσ (11)

JISM qσ ^ JESM qσ ^ JDSM qσ ^ JDAEσ . (15)

In summary, training the denoising autoencoder defined in section 2 is equiva-

Hinton, G. E. (2002). Training products of experts by minimizing contrastive diver-

Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score

Hyvärinen, A. (2007a). Connections between score matching, contrastive divergence,

Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics

Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation,

LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université

Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of

Movellan, J. R. (2008). Contrastive divergence in gaussian diffusions. Neural Compu-

Seung, S. H. (1998). Learning continuous attractors in recurrent networks. In M. Jordan,

Sohl-Dickstein, J., Battaglino, P., and DeWeese, M. R. (2009). Minimum probability

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.

which we can develop as

which we can develop as

Looking at equations 17 and 18 we see that JESM qσ (θ) = JDSM qσ (θ) + C2 − C3 .

You might also like