A Connection Between Score Matching and Denoising Autoencoders
A Connection Between Score Matching and Denoising Autoencoders
Pascal Vincent
[email protected]
Département d’Informatique, Université de Montréal,
Montréal (QC) H3C 3J7, Canada
1 Introduction
This note uncovers an unsuspected link between the score matching tech-
nique (Hyvärinen, 2005, 2008) for learning the parameters of unnormalized
density models over continuous-valued data, and the training of denois-
ing autoencoders (Vincent, Larochelle, Bengio, & Manzagol, 2008; Vincent,
Larochelle, Lajoie, Bengio, & Manzagol, 2010).
Score matching (SM) is an alternative to the maximum likelihood princi-
ple suitable for unnormalized probability density models whose partition
function is intractable. Its relationship to maximum likelihood has been
investigated by Lyu (2010), who formally relates the Fisher divergence
that yields score matching and the Kullback-Leibler divergence that yields
maximum likelihood. Interestingly, his formal analysis indicates that SM
searches for parameters that are more robust to small-noise perturbations
of the training data. Score matching has also been recast as a special case
under the more general frameworks of generalized score matching (Lyu,
2010; Marlin, Swersky, Chen, & de Freitas, 2010) and minimum probability
flow (Sohl-Dickstein et al., 2009), allowing generalizations of SM to discrete
distributions (Hyvärinen, 2007b; Lyu, 2010; Sohl-Dickstein et al., 2009). The
minimum probability flow paradigm is particularly interesting as it unifies
several recent alternative parameter estimation methods, for both continu-
ous and discrete data, under a single unified view.1 Recently, Kingma and
LeCun (2010) investigated a regularized form of SM that adds a specific reg-
ularization term to the original SM objective. Its relationship to this work is
discussed in detail in section 5.
Denoising autoencoders (DAE) were proposed by Vincent et al. (2008)
as a simple and competitive alternative to the contrastive-divergence-
trained restricted Boltzmann Machines (RBM) used by Hinton, Osin-
dero, and Teh (2006) for pretraining deep networks (Erhan et al., 2010;
Vincent et al., 2010). Previous studies have already pointed out connec-
tions between SM and contrastive divergence (Hyvärinen, 2007a; Sohl-
Dickstein et al., 2009), have connected SM to optimal denoising for
gaussian noise with infinitesimal variance (Hyvärinen, 2008), and have
shown that training gaussian binary RBM with SM is equivalent to
training a regular (nondenoising) autoencoder with an additional reg-
ularization term (Swersky, 2010). This note, however, is the first to
recast the training of a DAE as a form of regularized SM. This con-
nection yields insights relevant to both research directions and suggests
a novel parameter estimation technique that has its roots in both DAE
and SM.
We begin with a brief presentation of the DAE architecture for
continuous-valued inputs in section 2 and the SM technique in section 3.
This allows us to introduce our formalism and precise terminology. In sec-
tion 4, we connect the denoising autoencoder objective to SM. We conclude
by a discussion on how our findings advance our understanding of both
approaches.
certain forms of contrastive divergence (Hinton, 2002; Welling & Hinton, 2002) are all
recast as minimizing the Kullback-Leibler divergence between the data distribution and
the distribution obtained after running, for infinitesimal time, a dynamic that would
transform it into the model distribution (Sohl-Dickstein et al., 2009).
A Connection Between Score Matching and Denoising Autoencoders 1663
2 Denoising Autoencoders
3 Score Matching
1
p(x; θ ) = exp(−E(x; θ )).
Z(θ )
∂ log q (x)
of the true distribution: ∂ x . The corresponding objective function to be
minimized is the expected squared error between these two vectors:
2
1
∂ log q (x)
.
J E SMq (θ ) = Eq (x) ψ(x; θ ) −
2 ∂x
J E SMq (θ )
1 d
∂ψi (x; θ )
= Eq (x) ψ(x; θ )2 + + C1 , (3.1)
2 ∂xi
i=1
J I SMq (θ )
∂ log p(x;θ )
where ψi (x; θ ) = ψ(x; θ )i = ∂xi
, and C1 is a constant that does not
depend on θ . This yields an implicit SM objective J I SMq that no longer
requires having an explicit score target for q but is nevertheless equivalent
to J E SMq . Hyvärinen (2005) formally shows that provided q (x) and ψ(x; θ )
satisfy some weak regularity conditions,4 we have
J E SMq J I SMq . (3.2)
∂ log q (x) 2
4 Namely, q (x) and ψ(x; θ ) are differentiable, Eq (x) [ ∂x ] is finite, and for any θ,
Eq (x) [ψ(x; θ )2 ] is finite and limx→∞ q (x)ψ(x; θ ) = 0.
1666 P. Vincent
d 2
1 ∂ψi (x(t) ; θ )
n
J I SMreg (θ ) = J I SMq0 (θ ) + λ , (3.5)
n ∂xi
t=1 i=1
∂ log q σ (x̃|x) 1
= 2 (x − x̃). (4.4)
∂ x̃ σ
Direction σ12 (x − x̃) clearly corresponds to moving from x̃ back toward clean
sample x, and we want ψ to match that as best it can.
This objective, inspired by denoising autoencoders, is equivalent to ex-
plicit SM. Formally,
The proof is in the appendix and does not depend on the particular form
of q σ (x̃|x) as long as log q σ (x̃|x) is differentiable with respect to x̃.
1
p(x; θ ) = exp(−E(x; θ )),
Z(θ )
h
c, x − 12 x2 + dj=1 softplus W j , x + b j
E(x; W, b, c) = − . (4.6)
σ2
θ
We then have
∂ log p(x; θ )
ψi (x; θ ) =
∂xi
∂E
=−
∂xi
⎛ ⎞
1 ⎝
dh
∂ Wj, x + bj
= 2 ci − xi +
softplus W j , x + b j ⎠
σ ∂xi
j=1
1668 P. Vincent
⎛ ⎞
1 ⎝
dh
= 2 ci − xi + sigmoid W j , x + b j W ji ⎠
σ
j=1
1
= ci − xi + W·i , sigmoid(Wx + b) ,
σ2
1 T
ψ(x; θ ) = W sigmoid(Wx + b) + c − x . (4.7)
σ 2
2
1 ∂ log q σ (x̃|x)
J DSMqσ (θ ) = Eqσ (x,x̃) ψ(x̃; θ ) −
2 ∂ x̃
1 1
= Eqσ (x,x̃) (WT sigmoid(Wx̃ + b) + c − x̃)
2σ2
2
1
− 2 (x − x̃)
σ
1 2
= 4
Eq σ (x,x̃) WT sigmoid(Wx̃ + b) + c − x
2σ
1
= 4 J DAEσ (θ ).
2σ
5 Discussion
Putting together equations 4.2, 4.5, and 4.8, we can write, for σ > 0,
corrupting samples from Dn . And it may be carried out with any of these
four optimization objective formulations.5
We introduced the kernel-smoothed empirical distribution q σ to show
a connection between SM and a simple DAE. Interestingly, the regular-
ized SM criterion J I SMreg (see equation 3.5) that Kingma and LeCun (2010)
recently introduced with the very different motivation of curing possible
instabilities, was derived by approximating what amounts to J I SMqσ . 6 From
this perspective, our four q σ -based criteria in equation 5.1, including the
DAE, may be seen as alternative approximation-free forms of regularized
score matching. A key difference is that, as is done with DAE training, we
would optimize stochastic versions of these approximation-free regularized
criteria by corrupting training examples (i.e., sampling form q σ ), whereas
Kingma and LeCun (2010) optimize an approximation of J I SMqσ , centered on
the training samples only (i.e., sampling from q 0 ). Also, whereas J I SMreg , like
the other ISM criteria, requires computing second derivatives, the stochas-
tic version of our novel J DSMqσ criterion does not, and thus appears much
simpler to implement.
Note that the energy function in equation 4.6 is particular in that its
scaling, which we may call its temperature, is chosen to match the corrupt-
ing noise level σ 2 . This is required only to establish the last equivalence
with the specific DAE we considered. But regarding the generic objectives
J I SMqσ J E SMqσ J DSMqσ , their σ may in principle be chosen regardless
of the form or temperature of whatever energy function is to be learned.
Interestingly, the energy function in equation 4.6, which we designed to
yield the equivalence with our denoising autoencoder objective, happens
to be very similar to the free energy of a restricted Boltzmann machine with
binary hidden units and gaussian visible units (Welling, Rosen-Zvi, & Hin-
ton, 2005; Bengio, Lamblin, Popovici, & Larochelle, 2007; Swersky, 2010).
The major difference is that this latter free energy does not have a global
temperature scaling of the whole expression.7 We designed equation 4.6 to
exactly yield the denoising version of the classic autoencoder described in
section 2. But with tied weights, it may be preferable to allow an extra pos-
itive scaling parameter α for the reconstruction, so that there at least exists
an equivalent reparameterization of the model for scaled input data.8 This
5 Note, however, that while these q -based objectives are formally equivalent, their
σ
stochastic gradient descent optimization, based on sampling a limited number of cor-
rupted examples, is likely to behave differently for each objective.
6 A first-order Taylor expansion and a diagonal Hessian approximation are used.
7 Specifically, in the free energy of a gaussian-binary RBM, the softplus terms are not
hidden representation as before by dividing W by 100. But because of the tied weights,
this means that the reconstruction would also be divided by 100 (i.e., there is no equiv-
alent reparameterization), unless it can be compensated by an additional scaling of the
reconstruction by a parameter α.
1670 P. Vincent
E(x; W, b, c, α, σm )
θ
⎛ ⎞
1 ⎝ 1
dh
= − 2 c, x − x2 + αj softplus W j , x + b j ⎠ .
σm 2
j=1
on DAE showed that the best models, judged by their ability to extract useful features,
were obtained for nonnegligible values of the noise parameters. Moreover, this way of
controlling the model’s capacity worked much better than either reducing the hidden
layer size or than traditional weight decay.
A Connection Between Score Matching and Denoising Autoencoders 1671
The explicit score matching criterion using the Parzen density estimator is
defined in equation 4.1 as
2
1
∂ log q σ (x̃)
,
J E SMqσ (θ ) = Eqσ (x̃) ψ(x̃; θ ) −
2 ∂ x̃
1 ∂ log qσ (x̃) 2
where C2 = Eqσ (x̃) is a constant that does not depend on θ ,
2 ∂ x̃
and
∂ log q σ (x̃)
S(θ ) = Eqσ (x̃) ψ(x̃; θ ),
∂ x̃
∂ log q σ (x̃)
= q σ (x̃) ψ(x̃; θ ), d x̃
x̃ ∂ x̃
∂
q σ (x̃)
= q σ (x̃) ψ(x̃; θ ), ∂ x̃ d x̃
x̃ q σ (x̃)
∂
= ψ(x̃; θ ), q σ (x̃) d x̃
x̃ ∂ x̃
∂
= ψ(x̃; θ ), q 0 (x)q σ (x̃|x)dx d x̃
x̃ ∂ x̃ x
1672 P. Vincent
∂q σ (x̃|x)
= ψ(x̃; θ ), q 0 (x) dx d x̃
x̃ x ∂ x̃
∂ log q σ (x̃|x)
= ψ(x̃; θ ), q 0 (x)q σ (x̃|x) dx d x̃
x̃ x ∂ x̃
∂ log q σ (x̃|x)
= q 0 (x)q σ (x̃|x) ψ(x̃; θ ), dxd x̃
x̃ x ∂ x̃
∂ log q σ (x̃|x)
= q σ (x̃, x) ψ(x̃; θ ), dxd x̃
x̃ x ∂ x̃
∂ log q σ (x̃|x)
= Eqσ (x̃,x) ψ(x̃; θ ), .
∂ x̃
Acknowledgments
I thank Yoshua Bengio, Olivier Delalleau, and the other members of the Lisa
Lab who provided timely feedback, as well as two anonymous referees
A Connection Between Score Matching and Denoising Autoencoders 1673
References
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise train-
ing of deep networks. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.), Advances in
neural information processing systems, 19 (pp. 153–160). Cambridge, MA: MIT Press.
Duane, S., Kennedy, A., Pendleton, B., & Roweth, D. (1987). Hybrid Monte Carlo.
Phys. Lett. B, 195, 216–222.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., & Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? Journal of Machine
Learning Research, 11, 625–660.
Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires
associatives distribués. In Proceedings of COGNITIVA 87. Paris: Cesta-Afcet.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive
divergence. Neural Computation, 14, 1771–1800.
Hinton, G. E., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep
belief nets. Neural Computation, 18, 1527–1554.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6, 695–709.
Hyvärinen, A. (2007a). Connections between score matching, contrastive diver-
gence, and pseudolikelihood for continuous-valued variables. IEEE Transactions
on Neural Networks, 18, 1529–1531.
Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics
and Data Analysis, 51, 2499–2512.
Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation,
20(12), 3087–3110.
Kingma, D., & LeCun, Y. (2010). Regularized estimation of image statistics by score
matching. In J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta
(Eds.), Advances in neural information processing systems, 23 (pp. 1126–1134).
Cambridge, MA: MIT Press.
LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Unpublished doctoral
dissertation, Université de Paris VI.
Lyu, S. (2010). Interpretation and generalization of score matching. In Proceedings of
the 25th Conference in Uncertainty in Artificial Intelligence (UAI’09). Corvallis, OR:
AUAI Press.
Marlin, B., Swersky, K., Chen, B., & de Freitas, N. (2010). Inductive principles
for restricted Boltzmann machine learning. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10) (Vol. 9,
pp. 509–516). N.p.: JMLR.
Movellan, J. R. (2008). Contrastive divergence in gaussian diffusions. Neural
Computation, 20(9), 2238–2252.
Seung, S. H. (1998). Learning continuous attractors in recurrent networks. In
M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing
systems, 10 (pp. 654–660). Cambridge, MA: MIT Press.
1674 P. Vincent
Sohl-Dickstein, J., Battaglino, P., & DeWeese, M. R. (2009). Minimum probability flow
learning (Tech. Rep.). arXiv:0906.4779.
Swersky, K. (2010). Inductive principles for learning restricted Boltzmann machines.
Unpublished master’s thesis, University of British Columbia.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In W. W. Cohen,
A. McCallum, & S. T. Roweis (Eds.), Proceedings of the 25th Annual International
Conference on Machine Learning (pp. 1096–1103). New York: ACM.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with
a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
Welling, M., & Hinton, G. E. (2002). A new learning algorithm for mean field
Boltzmann machines. In ICANN ’02: Proceedings of the International Conference on
Artificial Neural Networks (pp. 351–357). Berlin: Springer-Verlag.
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2005). Exponential family harmoniums
with an application to information retrieval. In L. Saul, Y. Weiss, & L. Bottou
(Eds.), Advances in neural information processing systems, 17 (pp. 1481–1488).
Cambridge, MA: MIT Press.