Variational Bayes
Variational Bayes
1. Introduction
A Gibbs posterior, also known as a PAC-Bayesian or pseudo-posterior, is a probability
distribution for random estimators of the form:
exp[−λrn (θ)]
ρ̂λ (dθ) = R π(dθ).
exp[−λrn ]dπ
More precise definitions will follow, but for now, θ may be interpreted as a parameter
(in a finite or infinite-dimensional space), rn (θ) as an empirical measure of risk (e.g.
prediction error), and π(dθ) a prior distribution.
We will follow in this paper the PAC (Probably Approximatively Correct)-Bayesian
approach, which originates from machine learning [Shawe-Taylor and Williamson, 1997,
McAllester, 1998, Catoni, 2004]; see Catoni [2007] for an exhaustive study, and Jiang and
Tanner [2008], Yang [2004], Zhang [2006], Dalalyan and Tsybakov [2008] for related per-
spectives (such as the aggregation of estimators in the last 3 papers). There, ρ̂λ appears
1
as the probability distribution that minimises the upper bound of an oracle inequality
on the risk of random estimators. The PAC-Bayesian approach offers sharp theoretical
guarantees on the properties of such estimators, without assuming a particular model
for the data generating process.
The Gibbs posterior has also appeared in other places, and under different motiva-
tions: in Econometrics, as a way to avoid direct maximisation in moment estimation
[Chernozhukov and Hong, 2003]; and in Bayesian decision theory, as as way to define
a Bayesian posterior distribution when no likelihood has been specified [Bissiri et al.,
2013]. Another well-known connection, although less directly useful (for Statistics), is
with thermodynamics, where rn is interpreted as an energy function, and λ as the inverse
of a temperature.
Whatever the perspective, estimators derived from Gibbs posteriors usually show ex-
cellent performance in diverse tasks, such as classification, regression, ranking, and so
on, yet their actual implementation is still far from routine. The usual recommendation
[Dalalyan and Tsybakov, 2012, Alquier and Biau, 2013, Guedj and Alquier, 2013] is to
sample from a Gibbs posterior using MCMC [Markov chain Monte Carlo, see e.g. Green
et al., 2015]; but constructing an efficient MCMC sampler is often difficult, and even
efficient implementations are often too slow for practical uses when the dataset is very
large.
In this paper, we consider instead VB (Variational Bayes) approximations, which have
been initially developed to provide fast approximations of ‘true’ posterior distributions
(i.e. Bayesian posterior distributions for a given model); see Jordan et al. [1999], MacKay
[2002] and Chap. 10 in Bishop [2006].
Our main results are as follows: when PAC-Bayes bounds are available - mainly, when
a strong concentration inequality holds - replacing the Gibbs posterior by a variational
approximation does not affect the rate of convergence to the best possible prediction,
on the condition that the Küllback-Leibler divergence between the posterior and the
approximation is itself controlled in an appropriate way.
We also provide empirical bounds, which may be computed from the data so as to
ascertain the actual performance of estimators obtained by variational approximation.
All the results gives strong incentives, we believe, to recommend Variational Bayes as
the default approach to approximate Gibbs posteriors.
The rest of the paper is organized as follows. In Section 2 we introduce the notations
and assumptions. In Section 3 we introduce variational approximations and the corre-
sponding algorithms. The main results are provided in general form in Section 4: in
Subsection 4.1, we give results under the assumption that a Hoeffding type inequality
holds (slow rates) and in Subsection 4.2, we give results under the assumption that a
Bernstein type inequality holds (fast rates). Note that for the sake of shortness, we will
refer to these settings as “Hoeffding assumption” and “Bernstein assumption” even if
this terminology is non standard. We then apply these results in various settings: clas-
sification (Section 5), convex classification (Section 6), ranking (Section 7), and matrix
completion (Section 8). In each case, we show how to specialise the general results of
Section 4 to the considered application, so as to obtain the properties of the VB approx-
2
imation, and we also discuss its numerical implementation. All the proofs are collected
in the Appendix.
2. PAC-Bayesian framework
We observe a sample (X1 , Y1 ), . . . , (Xn , Yn ), taking values in X × Y, where the pairs
(Xi , Yi ) have the same distribution P . We will assume explicitly that the (Xi , Yi )’s are
independent in several of our specialised results, but we do not make this assumption
at this stage, as some of our general results, and more generally the PAC-Bayesian
theory, may be extended to dependent observations; see e.g. Alquier and Li [2012]. The
label set Y is always a subset of R. A set of predictors is chosen by the statistician:
{fθ : X → R, θ ∈ Θ}. For example, in linear regression, we may have: fθ (x) = hθ, xi, the
inner product of X = Rd , while in classification, one may have fθ (x) = Ihθ,xi>0 ∈ {0, 1}.
We assume we have at our disposal a risk function R(θ); typically R(θ) is a measure
of the prevision error. We set R = R(θ), where θ ∈ arg minΘ R; i.e. fθ is an optimal
predictor. We also assume that the risk function R(θ) has an empirical counterpart
rn (θ), and set rn = rn (θ). Often, R and rn are based on a loss function ` : R2 → R; i.e.
1 Pn
R(θ) = E[`(Y, fθ (X))] and rn (θ) = n i=1 `(Yi , fθ (Xi )). (In this paper, the symbol E
will always denote the expectation with respect to the (unknown) law P of the (Xi , Yi )’s.)
There are situations however (e.g. ranking), where R and rn have a different form.
We define a prior probability measure π(·) on the set Θ (equipped with the standard
σ-algebra for the considered context), and we let M1+ (Θ) denote the set of all probability
measures on Θ.
exp[−λrn (θ)]
ρ̂λ (dθ) = R π(dθ).
exp[−λrn ]dπ
The pseudo-posterior ρ̂λ (also known as the Gibbs posterior, Catoni [2004, 2007], or
the exponentially weighted aggregate, Dalalyan and Tsybakov [2008]) plays a central
role in the PAC-Bayesian approach. It is obtained as the distribution that minimises
the upper bound of a certain oracle inequality applied to random estimators. Practical
estimators (predictors) may be derived from the pseudo-posterior, by e.g. taking the
expectation, or sampling from it. Of course, when exp[−λrn (θ)] may be interpreted as
the likelihood of a certain model, ρ̂λ becomes a Bayesian posterior distribution, but we
will not restrict our attention to this particular case.
The following ‘theoretical’ counterpart of ρ̂λ will prove useful to state results.
exp[−λR(θ)]
πλ (dθ) = R π(dθ).
exp[−λR]dπ
3
We will derive PAC-Bayesian bounds on predictions obtained by variational approx-
imations of ρ̂λ under two types of assumptions: a Hoeffding-type assumption, from
which we may deduce slow rates of convergence (Subsection 4.1), and a Bernstein-type
assumption, from which we may obtain fast rates of convergence (Subsection 4.2).
Definition 2.3 We say that a Hoeffding assumption is satisfied for prior π when there
is a function f and an interval I ⊂ R∗+ such that, for any λ ∈ I, for any θ ∈ Θ,
π (E exp {λ[R(θ) − rn (θ)]})
≤ exp [f (λ, n)] . (1)
π (E exp {λ[rn (θ) − R(θ)]})
Definition 2.4 We say that a Bernstein assumption is satisfied for prior π when there
is a function g and an interval I ⊂ R∗+ such that, for any λ ∈ I, for any θ ∈ Θ,
π E exp λ[R(θ) − R] − λ[rn (θ) − rn ]
≤ π exp g(λ, n)[R(θ) − R] . (2)
π E exp λ[rn (θ) − rn ] − λ[R(θ) − R]
that allow to use the more general form of the margin assumption of Mammen and Tsy-
bakov [1999], Tsybakov [2004]. PAC-Bayes bounds in this context are provided by Catoni
[2007]. However, the techniques involved would require many pages to be described so we
decided to focus on the cases κ = 0 and κ = 1 to keep the exposition simple.
4
3. Numerical approximations of the pseudo-posterior
3.1. Monte Carlo
As already explained in the introduction, the usual approach to approximate ρ̂λ is
MCMC (Markov chain Monte Carlo) sampling. Ridgway et al. [2014] proposed tem-
pering SMC (Sequential Monte Carlo, e.g. Del Moral et al. [2006]) as an alternative
to MCMC to sample from Gibbs posteriors: one samples sequentially from ρ̂λt , with
0 = λ0 < · · · < λT = λ where λ is the desired temperature. One advantage of this
approach is that it makes it possible to contemplate different values of λ, and choose
one by e.g. cross-validation. Another advantage is that such an algorithm requires little
tuning; see Appendix B for more details on the implementation of tempering SMC. We
will use tempering SMC as our gold standard in our numerical studies.
SMC and related Monte Carlo algorithms tend to be too slow for practical use in
situations where the sample size is large, the dimension of Θ is large, or fθ is expen-
sive to compute. This motivates the use of fast, deterministic approximations, such as
Variational Bayes, which we describe in the next section.
where K(ρ, ρ̂λ ) denotes the KL (Küllback-Leibler) divergence of ρ̂λ relative to ρ: K(m, µ) =
log[ dm
R
dµ ]dm if m µ (i.e. µ dominates m), K(m, µ) = +∞ otherwise.
The difficulty is to find a family F (a) which is large enough, so that ρ̃λ may be close
to ρ̂λ , and (b) such that computing ρ̃λ is feasible. We now review two types of families
popular in the VB literature.
• Mean field VB: for a certain decomposition Θ = Θ1 × . . . × Θd , F is the set of
product probability measures
d
( )
Y
F MF = ρ ∈ M1+ (Θ) : ρ(dθ) = ρi (dθi ), ∀i ∈ {1, . . . , d}, ρi ∈ M1+ (Θi ) . (3)
i=1
Q
The infimum of the KL divergence K(ρ, ρ̂λ ), relative to ρ = i ρi satisfies the
following fixed point condition [Parisi, 1988, Bishop, 2006, Chap. 10]:
Z Y
∀j ∈ {1, · · · , d} ρj (dθj ) ∝ exp {−λrn (θ) + log π(θ)} ρi (dθi ) π(dθj ).
i6=j
(4)
5
This leads to a natural algorithm were we update successively every ρj until sta-
bilization.
• Parametric family:
F P = ρ ∈ M1+ (Θ) : ρ(dθ) = f (θ; m)dθ, m ∈ M ;
Since the left hand side does not depend on ρ, one sees that ρ̃λ , which minimises K(ρ, ρ̂λ )
over F, is also the minimiser of:
Z
1
ρ̃λ = arg min rn (θ)ρ(dθ) + K(ρ, π)
ρ∈F λ
This equation will appear frequently in the sequel in the form of an empirical upper bound.
4. General results
This section gives our general results, under either a Hoeffding Assumption (Definition
2.3) or a Bernstein Assumption (Definition 2.4), on risks bounds for the variational
approximation, and how it relates to risks bounds for Gibbs posteriors. These results
will be specialised to several learning problems in the following sections.
6
This result is a simple variant of a result in Catoni [2007] but for the sake of com-
pleteness, its proof is given in Appendix A. It gives us an upper bound on the risk
of both the pseudo-posterior (take ρ = ρ̂λ ) and its variational approximation (take
ρ = ρ̃λ ). These bounds may be be computed from the data, and therefore provide a sim-
ple way to evaluate the performance of the corresponding procedure, in the spirit of the
first PAC-Bayesian inequalities [Shawe-Taylor and Williamson, 1997, McAllester, 1998,
1999]. However, this bound do not provide the rate of convergence of these estimators.
For this reason, we also provide oracle-type inequalities.
Theorem 4.2 Assume that the Hoeffding assumption is satisfied (Definition 2.3). For
any ε > 0, with probability at least 1 − ε we have simultaneously
(Z )
f (λ, n) + K(ρ, π) + log 2ε
Z
1
Rdρ̂λ ≤ Bλ (M+ (Θ)) := inf Rdρ + 2
ρ∈M1+ (Θ) λ
and (Z )
2
f (λ, n) + K(ρ, π) + log
Z
ε
Rdρ̃λ ≤ Bλ (F) := inf Rdρ + 2 .
ρ∈F λ
Moreover,
2
Bλ (F) = Bλ (M1+ (Θ)) + inf K(ρ, π λ )
λ ρ∈F 2
able to obtain explicit expressions for the right-hand side of these inequalities in various
models, and thus to obtain rates of convergence. This will be done in the remaining
sections. This leads to the second interest of this result: if there is a λ = λ(n) that leads
to Bλ (M1+ (Θ)) ≤ R + sn with sn → 0 for the pseudo-posterior ρ̂λ , then we only have
to prove that there is a ρ ∈ F such that K(ρ, πλ )/λ ≤ csn for some constant c > 0 to
ensure that the VB approximation ρ̃λ also reaches the rate sn .
We will see in the following sections several examples where the approximation does
not deteriorate the rate of convergence. But first let us show the equivalent oracle
inequality under the Bernstein assumption.
7
4.2. Bounds under the Bernstein assumption
In this context the empirical bound on the risk would depend on the minimal achievable
risk r̄n , and cannot be computed explicitly. We give the oracle inequality for both the
Gibbs posterior and its VB approximation in the following theorem.
Theorem 4.3 Assume that the Bernstein assumption is satisfied (Definition 2.4). As-
sume that λ > 0 satisfies λ − g(λ, n) > 0. Then for any ε > 0, with probability at least
1 − ε we have simultaneously:
Z
Rdρ̂λ − R ≤ B λ M1+ (Θ) ,
Z
Rdρ̃λ − R ≤ B λ (F),
In addition,
2
B λ (F) = B λ M1+ (Θ) +
inf K ρ, π λ+g(λ,n) .
λ − g(λ, n) ρ∈F 2
The main difference with Theorem 4.2 is that the function R(·) is replaced by R(·)−R.
This is well known way to obtain better rates of convergence.
5. Application to classification
5.1. Preliminaries
In all this section, we assume that Y = {0, 1} and we consider linear classification: Θ =
d 1 Pn
X = R , fθ (x) = 1hθ,xi≥0 . We put rn (θ) = n i=1 1{fθ (Xi )6=Yi } , R(θ) = P(Y 6= fθ (X))
and assume that the [(Xi , Yi )]ni=1 are i.i.d. In this setting, it is well-known that the
Hoeffding assumption always holds. We state as a reminder the following lemma.
8
Lemma 5.2 Assume that Mammen and Tsybakov’s margin assumption is satisfied: i.e.
there is a constant C such that
Cλ2
Then Bernstein assumption (2) is satisfied with g(λ, n) = 2n−λ .
Remark 5.1 We refer the reader to Tsybakov [2004] for a proof that
P(0 < | θ, X |≤ t) ≤ C 0 t
for some constant C 0 > 0 implies the margin assumption. In words, when X is not
likely to be in the region θ, X ' 0, where points are hard to classify, then the problem
becomes easier and the classification rate can be improved.
where Φm,σ2 is Gaussian distribution Nd (m, σ 2 Id ), Φm,σ2 is Nd (m, diag(σ 2 )), and Φm,Σ
is Nd (m, Σ). Obviously, F1 ⊂ F2 ⊂ F3 ⊂ M1+ (Θ), and
Note that, for the sake of simplicity, we will use the following classical notations in the
rest of the paper: ϕ(·) is the density of N (0, 1) w.r.t. the Lebesgue measure, and Φ(·)
the corresponding c.d.f. The rest of Section 5 is organized as follows. In Subsection 5.3,
we calculate explicitly Bλ (F2 ) and Bλ (F1 ). Thanks to (6) this also gives an upper bound
on Bλ (F3 ) and proves the validity of the three types of Gaussian approximations. Then,
we give details on algorithms to compute the variational approximation based on F2 and
F3 , and provide a numerical illustration on real data.
9
Corollary 5.3 For any ε > 0, with probability at least 1 − ε we have, for any m ∈ Rd ,
σ 2 ∈ (R+ )d ,
Pd h 1 σi2
2 i 2
ϑ
+ kmk − d2 + log 1ε
Z Z
λ i=1 2 log σi2
+ ϑ2 ϑ2
RdΦm,σ2 ≤ rn dΦm,σ2 + + .
2n λ
We now want to apply Theorem 4.2 in this context. In order to do so, we introduce
an additional assumption.
Definition 5.1 We say that Assumption A1 is satisfied when there is a constant c > 0
such that, for any (θ, θ0 ) ∈ Θ2 with kθk= kθ0 k= 1, P(hX, θi hX, θ0 i < 0) ≤ ckθ − θ0 k.
Note that this is not a stringent assumption. For example, it is satisfied as soon as
X/kXk has a bounded density on the unit sphere.
2 log 2ε
R r r
Rdρ̂ λ d 2
c d
R ≤R+ log 4ne + √ + + √ .
Rdρ̃λ n n 4n3 nd
√
See the appendix for a proof. Note also that the values λ = nd and ϑ = √1d allow to
derive this almost optimal rate of convergence, but are not necessarily the best choices
in practice.
Remark 5.2 Note that Assumption A1 is not necessary to obtain oracle inequalities on
the risk integrated under ρ̂λ . We refer the reader to Chapter 1 in Catoni [2007] for such
assumption-free bounds. However, it is clear that without this assumption the shape of ρ̂λ
and ρ̃λ might be very different. Thus, it seems reasonable to require that A1 is satisfied
for the approximation of ρ̂λ by ρ̃λ to make sense.
We finally provide an application of Theorem 4.3. Under the additional constraint
that the margin assumption is satisfied, we obtain a better rate.
Corollary 5.5 Assume that the VB approximation is done on either F1 , F2 or F3 . Un-
der Assumption A1 (Definition 5.1 page 10), and under Mammen and Tsybakov margin
2n
assumption, with λ = C+2 and ϑ > 0, for any ε > 0, with probability at least 1 − ε,
√
d log nϑ
R
Rdρ̂λ (C + 2)(C + 1) 2dϑ 2 d 2 2 d2c(2C + 1)
R ≤ R̄+ + 2 + − + log + .
Rdρ̃λ 2 n n ϑ ϑn n ε n
The prior variance optimizing the bound is ϑ = d/(d + 2 + 2d/n), this choice or any
constant instead will lead to a rate in d log(n)/n. Note that the rate d/n is minimax-
optimal in this context. This is, for example, a consequence of more general results
in Lecué [2007] under a general form of the the margin assumption. See the Appendix
for a proof.
10
5.4. Implementation and numerical results
For family F2 (mean field), the variational lower bound (5) equals
n d
!
mT m 1 X σk2
λX Xi m 2
Lλ,ϑ (m, σ) = − Φ −Yi p − + log σ k − ,
n X diag(σ 2 )X t 2ϑ 2 ϑ
i=1 i i k=1
Both functions are non-convex, but the multimodality of the latter may be more
severe due to the larger dimension of F3 . To address this issue, we recommend to use
the reparametrisation of Opper and Archambeau [2009], which makes the dimension
of the latter optimisation problem O(n); see Khan [2014] for a related approach. In
both cases, we found that deterministic annealing to be a good approach to optimise
such non-convex functions. We refer to Appendix B for more details on deterministic
annealing and on our particular implementation.
We now compare the numerical performance of the mean field and full covariance VB
approximations to the Gibbs posterior (as approximated by SMC, see Section 3.1) for the
classification of standard datasets; see Table 1. We also include results for a kernel SVM
(support vector machine); this comparison is not entirely fair, since SVM is a non-linear
classifier, while all the other classifiers are linear. Still, except for the Glass dataset, the
full covariance VB approximation performs as well or better than both SMC and SVM
(while being much faster to compute, especially compared to SMC).
Dataset Covariates Mean Field (F2 ) Full cov. (F3 ) SMC SVM
11
Interestingly, VB outperforms SMC in certain cases. This might be due to the fact
that a VB approximation tends to be more concentrated around the mode than the
Gibbs posterior it approximates. Mean field VB does not perform so well on certain
datasets (e.g. Indian). This may due either to the approximation family being too
small, or to the corresponding optmisation problem to be strongly multi-modal.
We will write RH for the theoretical counterpart and R̄H for its minimum in θ. We
keep the superscript H in order to allow comparison with the risk R under the 0 − 1 loss.
We assume in this section that the Xi are uniformly bounded by a constant, |Xi |< cx .
Note that we do not require an assumption of the form (A1) to obtain the results of this
section, as we rely directly on the Lipschitz continuity of the hinge risk.
Lemma 6.1 q Under a independent Gaussian prior π such that each component is N (0, ϑ2 ),
and for λ < 2c nϑ 2 and with bounded design |Xij |< cx , Hoeffding assumption (1) is sat-
2 λ2 c2
isfied with f (λ, n) = λ2 /(4n) − 12 log 1 − ϑ 4n x
.
The main impact of such a bound is that the prior variance cannot be taken too big
relative to λ.
12
The oracle inequality in the above corollary enjoys the same rate of convergence as
the equivalent result in the preceding section. In the following we link the two results.
Remark 6.1 As stated in the beginning of the section we can use the estimator specified
under the hinge loss to bound the excess risk of the 0-1 loss. We write R? and RH? the
respective risk for their corresponding Bayes classifiers. From Zhang [2004] (section 3.3)
we have the following inequality, linking the excess risk under the hinge loss and the 0−1
loss,
R(θ) − R? ≤ RH (θ) − RH?
for every θ ∈ Rp . By integrating with respect to ρ̃H (the VB approximation on any
F1 , F2 , F3 of the Gibbs posterior for the hinge risk) and making use of Corollary 6.2 we
have with high probability,
r !
H ? H H? d n
ρ̃ (R(θ)) − R ≤ infp R (θ) − R + O log .
θ∈R n d
writting Γi := Yi Xi .
Hence the lower bound to be maximized is given by
( n n )
λ X 1 − Γi m X 1 − Γi m
L(m, σ) = − (1 − Γi m) Φ + σkΓi kϕ
n σkΓi k2 σkΓi k2
i=1 i=1
kmk22 d
2 ϑ
− + log σ − 2 .
2ϑ 2 σ
It is easy to see that the function is convex in (m, σ), first note that the map
x x x
Ψ: 7→ xΦ + yϕ ,
y y y
m x
is convex and note that we can write Ξi =Ψ A + b hence by com-
σ y
position of convex function with linear mappings we have the result. Similar reasoning
13
could be held for the case F2 and F3 , where in later the parametrization should be done
in C such that Σ = CC t . The bound is however not universally Lipschitz in σ, this
impacts the optimization algorithms.
n o
On the class of function F0 = Φm, 1 , m ∈ Rd , for which our Oracle inequalities still
n
hold we could get faster numerical algorithms. The objective function has Lipschitz
L
continuous derivatives and we would get a rate of (1+k) 2.
Other convex loss could be considered which could lead to convex optimization prob-
lems. For instance one could consider the exponential loss.
where L is the Lipschitz coefficient on a ball of radius M of the objective function max-
imized in VB.
From Theorem 6.3 we can compute the number of iterations to get a given level of
error at a given probability.
We find that on average the misclassification error (Table 2) is lower than for the 0-1
loss where we have no guaranties that the maximum is attained.
14
7. Application to ranking
7.1. Preliminaries
In this section we take Y = {0, 1} and consider again linear classifiers: Θ = X = Rd ,
fθ (x) = 1hθ,xi≥0 . We consider however a different criterion: in ranking, not only we want
to classify well an object x, but we want to make sure that given two different objects,
the one that is more likely to correspond to a label 1 will be assigned a larger score
through the function fθ . A usual way to measure this is to introduce the risk function
The variant of the margin assumption adapted to ranking was established by Robbiano
[2013] and Ridgway et al. [2014].
E[(1[fθ (X1 )−fθ (X2 )][Y1 −Y2 ]<0 − 1[fθ (X1 )−fθ (X2 )][Y1 −Y2 ]<0 )2 ] ≤ C[R(θ) − R].
Cλ2
Then Bernstein assumption (2) is satisfied with g(λ, n) = n−1−4λ .
Corollary 7.3 For any ε > 0, with probability at least 1 − ε we have, for any m ∈ Rd ,
σ 2 ∈ (R+ )d ,
Pd h 1 σi2
2 i 2
ϑ
+ kmk − d2 + log 1ε
Z Z
λ j=1 2 log σi2 + ϑ 2 ϑ2
RdΦm,σ2 ≤ rn dΦm,σ2 + + .
n−1 λ
15
In order to derive a theoretical bound, we introduce the following variant of Assump-
tion A1.
Definition 7.1 We say that Assumption A2 is satisfied when there is a constant c > 0
such that, for any (θ, θ0 ) ∈ Θ2 with kθk= kθ0 k= 1, P(hX1 − X2 , θi hX1 − X2 , θ0 i < 0) ≤
ckθ − θ0 k.
Assumption A2 is satisfied as soon as (X1 − X2 )/kX1 − X2 k has a bounded density on
the unit sphere.
q
Corollary 7.4 Use either F1 , F2 or F3 . Take λ = d(n−1)2 and ϑ = 1. Under (A2),
for any ε > 0, with probability at least 1 − ε,
√ √
2 2 log 2e
R r
R Rdρ̂λ 2d 1 c 2 ε
≤R+ 1 + log (2d(n − 1)) + √ + p .
Rdρ̃λ n−1 2 n−1 (n − 1)d
d log nϑ
R
Rdρ̂λ (C + 5)(C + 1) 2dϑ 2 d 2 2
R ≤ R̄+ + + − + log
Rdρ̃λ 2 n−1 n(n − 1) ϑ ϑn − 1 n − 1 ε
√
d4c(C + 1)
+ .
n
The prior variance optimizing the bound is ϑ = d/(d + 2 + 2d/n). The proof is similar
to the ones of Corollaries 5.4, 5.5 and 7.4.
As in the case of classification, ranking under an AUC loss can be done by replacing
the indicator function by the corresponding upper bound given by an hinge loss. In this
case we can derive similar results as for the convexified classification in particular we
can get a convex minimization problem and obtain result without requiring assumption
(A2).
16
We propose to use a stochastic gradient descent in the spirit of Hoffman et al. [2013].
The model we consider is not in an exponential family, meaning we cannot use the trick
developed by these authors. We propose instead to use a standard descent.
The idea is to replace the gradient by a unbiased version based on a batch of size B
as described in Algorithm P 4 in the Appendix.
P Robbins and Monro [1951] show that for
a step-size (λt )t such that t λ2t < ∞ and t λt = ∞ the algorithm converges to a local
optimum.
In our case we propose to sample pairs of data with replacement and use the unbiased
version of the derivative of the risk component. We use a simple gradient descent with-
out any curvature information. One could also use recent research on stochastic quasi
Newton-Raphson [Byrd et al., 2014].
For illustration, we consider a small dataset (Pima), and a larger one (Adult). The
latter is already quite challenging with n+ n− = 193, 829, 520 pairs to compare. In both
cases with different size of batches convergence is obtained with a few iterations only
and leads to acceptable bounds.
In Figure 1 we show the empirical bound on the AUC risk as a function of the iteration
of the algorithm, for several batch sizes. The bound is taken for 95% probability, the
batch sizes are taken to be B = 1, 10, 20, 50 for the Pima dataset, and 50 for the Adult
dataset. The figure shows an additional feature of VB approximation in the context of
Gibbs posterior: namely the possibility of computing the empirical upper bound given
by Corollary 7.3. That is we can check the quality of the bound at each iteration of the
algorithm, or for different values of the hyperparameters.
17
3
3
Emprical Bound 95%
1 1
0
0 25 50 75 100 0 100 200 300
Iterations Iterations
Figure 1: Error bound at each iteration, stochastic descent, Pima and Adult
datasets.
Stochastic VB with fixed temperature λ = 100 for Pima and λ = 1000 for adult. The left panel shows
several curves that correspond to different batch sizes; these curves are hard to distinguish. The right panel
is for a batch size of 50. The adult dataset has n = 32556 observation and n+ n− = 193829520 possible
pairs. The convergence is obtained in order of seconds. The bounds are the empirical bounds obtained in
Corollary 7.3 for a probability of 95%.
18
We propose the following approximation:
m1
Y m2
Y
F = ρ(d(U, V )) = ui (dUi,· ) vj (dVj,· ) .
i=1 j=1
Theorem 8.1 Assume that M = U V T with |Ui,k |, |Vj,k |≤ C. Assume that rank(M ) = r
so that we can assume that U·,r+1 = · · · = U·,K = V·,r+1 = · · · = V·,K = 0 (note that
the prior π does not depend on the knowledge of r though). Choose the prior distri-
bution on the hyper-parameters γj as inverse gamma Inv−Γ(a, b) with b ≤ 1/[2β(m1 ∨
m2 ) log(2K(m1 ∨ m2 ))]. Then there is a constant C(a, C) such that, for any β > 0,
1
inf K(ρ, πβ ) ≤ C(a, C) r(m1 + m2 ) log [βb(m1 + m2 )K] + .
ρ∈F β
and note that in this context it is know that the minimax rate is at least r(m1 + m2 )/n
[Koltchinskii et al., 2011].
8.1. Algorithm
As already mentioned, the approximation family is not parametric in this case, but rather
of type mean field. The corresponding VB algorithm amounts to iterating equation (4),
which takes the following form in this particular case:
K
( )
λX T 2
X 1 2
uj (dUj,. ) ∝ exp − EV,U−j (YXi − (U V )Xi ) − Eγj Ujk
n 2γk
i k=1
K
( )
λX T 2
X 1 2
vj (dVj,. ) ∝ exp − EV−j ,U (YXi − (U V )Xi ) − Eγj V
n 2γk jk
i k=1
1 X X 1 β
2
p(γk ) ∝ exp − EU Ukj + EV Vik2 + (α + 1) log −
2γk γk γk
j i
where the expectations are taken with respect to the thus defined variational approxi-
mations. One recognises Gaussian distributions for the first two, and an inverse Gamma
distribution for the third. We refer to Lim and Teh [2007] for more details on this
algorithm and for a numerical illustration.
19
9. Discussion
We showed in several important scenarios that approximating a Gibbs posterior through
VB (Variational Bayes) techniques does not deteriorate the rate of convergence of the
corresponding procedure. We also described practical algorithms for fast computation of
these VB approximations, and provided empirical bounds that may be computed from
the data to evaluate the performance of the so-obtained VB-approximated procedure.
We believe these results provide a strong incentive to recommend VB as the default
approach to approximate Gibbs posteriors, in lieu of Monte Carlo methods.
We hope to extend our results to other applications beyond those discussed in this
paper, such as regression. One technical difficulty with regression is that the risk function
is not bounded, which makes our approach a bit less direct to apply. In many papers
on PAC-Bayesian bounds for regression, the noise can be unbounded (usually, it is
assumed to be sub-exponential), but one assumes that the predictors are bounded, see
e.g. Alquier and Biau [2013]. However, using the robust loss function of Audibert and
Catoni, it is possible to relax this assumption [Audibert and Catoni, 2011, Catoni, 2012].
This requires a more technical analysis, which we leave for further work.
References
P. Alquier. Bayesian methods for low-rank matrix estimation: short survey and theoret-
ical study. In S. Jain, R. Munos, F. Stephan, and T. Zeugmann, editors, Algorithmic
Learning Theory. Springer - Lecture Notes in Artificial Intelligence, 2014.
P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning Re-
search, 14(1):243–280, 2013.
J.-Y. Audibert and O. Catoni. Robust linear least squares regression. Ann. Statist.,
39(5):2766–2794, 10 2011. doi: 10.1214/11-AOS918. URL http://dx.doi.org/10.
1214/11-AOS918.
J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup and Workshop
07, 2007.
C. M. Bishop. Pattern Recognition and Machine Learning, chapter 10. Springer, 2006.
P. Bissiri, C. Holmes, and S. Walker. A general framework for updating belief distribu-
tions. arXiv preprint arXiv:1306.6430, 2013.
20
R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-Newton method
for large-scale optimization. arXiv preprint arXiv:1401.7020, 2014.
E. J. Candès and T. Tao. The power of convex relaxation: near-optimal matrix com-
pletion. IEEE Trans. Inform. Theory, 56(5):2053–2080, 2010. ISSN 0018-9448. doi:
10.1109/TIT.2010.2044061. URL http://dx.doi.org/10.1109/TIT.2010.2044061.
O. Catoni. Statistical learning theory and stochastic optimization, volume 1851 of Lecture
Notes in Mathematics. Springer-Verlag, Berlin, 2004. Lecture notes from the 31st
Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
O. Catoni. Challenging the empirical mean and empirical variance: A deviation study.
Ann. Inst. H. Poincaré Probab. Statist., 48(4):1148–1185, 11 2012. doi: 10.1214/
11-AIHP454. URL http://dx.doi.org/10.1214/11-AIHP454.
P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. J. R. Statist.
Soc. B, 68(3):411–436, 2006. ISSN 1467-9868.
21
W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensional
classification and data mining. The Annals of Statistics, 36(5):2207–2231, 2008.
T. T. Mai and P. Alquier. A Bayesian approach for matrix completion: optimal rate
under general sampling distribution. Electronic Journal of Statistics, 9:823–841, 2015.
22
G. Parisi. Statistical field theory. Addison-Wesley, New-York, 1988.
T. Suzuki. Convergence rate of Bayesian tensor estimator: Optimal rate without re-
stricted strong convexity. arXiv preprint arXiv:1408.3092 (accepted by ICML2015),
2014.
O. Wintenberger. Deviation inequalities for sums of weakly dependent time series. Elec-
tronic Communications in Probability, 15:489–503, 2010.
A. Yuille. Belief propagation, mean-field and the Bethe approximation. Technical report,
Dept. Statistics UCLA, 2010.
T. Zhang. Information theoretical upper and lower bounds for statistical estimation.
IEEE Transaction on Information Theory, 52:1307–1321, 2006.
23
A. Proofs
A.1. Preliminary remarks
R
We start by a general remark. Let h be a function Θ → R+ with exp[−h(θ)]π(dθ) < ∞.
Let us put
exp[−h(θ)]
π[h](dθ) = R π(dθ).
exp[−h(θ0 )]π(dθ0 )
R
Direct calculation yields, for any ρ π with hdρ < ∞,
Z Z
K(ρ, π[h]) = λ hdρ + K(ρ, π) + log exp(−h)dπ.
We will use these inequalities many times in the followings. The most frequent applica-
tion will be with h(θ) = λrn (θ) (in this case π[λrn ] = ρ̂λ ) or h(θ) = ±λ[rn (θ) − R(θ)],
the first case leads to
Z Z
K(ρ, ρ̂λ ) = λ rn dρ + K(ρ, π) + log exp(−λrn )dπ, (8)
Z
ρ̂λ = arg min λ rn dρ + K(ρ, π) , (9)
ρ∈M1+ (Θ)
Z Z
− log exp(−λrn )dπ = min λ rn dρ + K(ρ, π) . (10)
ρ∈M1+ (Θ)
We will use (8), (9) and (10) several times in this appendix.
then apply the preliminary remark with h(θ) = λ[rn (θ) − R(θ)]:
( Z )
E exp sup λ[R(θ) − rn (θ)]ρ(dθ) − K(ρ, π) − f (λ, n) ≤ 1.
ρ∈M1+ (Θ)
24
Multiply both sides by ε and use E[exp(U )] ≥ P(U > 0) for any U to obtain:
" Z #
P sup λ[R(θ) − rn (θ)]ρ(dθ) − K(ρ, π) − f (λ, n) + log(ε) > 0 ≤ ε.
ρ∈M1+ (Θ)
Now, we work with ρ̃λ = arg minρ∈F K(ρ, ρ̂λ ). Plugging (8) into (11) we get, for any ρ,
Z Z
2
λ Rdρ ≤ f (λ, n) + K(ρ, ρ̂λ ) − log exp(−λrn )dπ + log .
ε
By definition of ρ̃λ , we have:
Z Z
2
λ Rdρ̃λ ≤ inf f (λ, n) + K(ρ, ρ̂λ ) − log exp(−λrn )dπ + log
ρ∈F ε
25
This proves the second inequality of the theorem. In order to prove the claim
2
Bλ (F) = Bλ (M1+ (Θ)) + inf K(ρ, π λ ),
λ ρ∈F 2
note that
(Z )
2f (λ, n) 2K(ρ, π) 2 log 2ε
Bλ (F) = inf Rdρ + + +
ρ∈F λ λ λ
( )
2f (λ, n) 2K(ρ, π λ2 ) 2 log 2ε
Z
2 λ
= inf − log exp − R dπ + + +
ρ∈F λ 2 λ λ λ
2f (λ, n) 2 log 2ε
Z
2 λ 2
= − log exp − R dπ + + + inf K(ρ, π λ )
λ 2 λ λ λ ρ∈F 2
2
= Bλ (M1+ (Θ)) + inf K(ρ, π λ ).
λ ρ∈F 2
We combine (13) and (14) by a union bound argument, and we consider the complemen-
tary event: with probability at least 1 − ε, simultaneously for all ρ ∈ M1+ (Θ),
Z Z
2
[λ − g(λ, n)] Rdρ − R ≤ λ rn dρ − rn + K(ρ, π) + log , (15)
ε
26
Z Z
2
λ rn dρ − rn ≤ [λ + g(λ, n)] Rdρ − R + K(ρ, π) + log . (16)
ε
We now derive consequences of these two inequalities (in other words, we focus on the
event where these two inequalities are satisfied). Using (9) in (15) yields
Z Z
2
[λ − g(λ, n)] Rdρ̂λ − R ≤ inf λ rn dρ − rn + K(ρ, π) + log .
ρ∈M1+ (Θ) ε
27
Proof of Lemma 5.2. Apply Theorem 2.10 in Boucheron et al. [2013], and plug the
margin assumption.
Proof of Corollary 5.4. We remind that thanks to (6) it is enough to prove the claim for
F1 . We apply Theorem 4.2 to get:
(Z )
λ K(Φm,σ2 , π) + log 2ε
Bλ (F1 ) = inf RdΦm,σ2 + + 2
(m,σ 2 ) n λ
h 2 i 2
Z λ d 1
2 log ϑ
σ2
+ σ2
ϑ2
+ kmk
ϑ2
− d2 + log 2ε
= inf RdΦm,σ2 + + 2 .
(m,σ 2 ) n λ
Note that the minimizer of R, θ, is not unique (because fθ (x) does not depend on kθk)
and we can chose it in such a way that kθk= 1. Then
h i h i
R(θ) − R = E 1hθ,XiY <0 − 1hθ,X iY <0 ≤ E 1hθ,Xihθ,X i<0
θ
= P hθ, Xi θ, X < 0 ≤ c − θ ≤ 2ckθ − θk.
kθk
So:
Z
Bλ (F1 ) ≤ R + inf 2c kθ − θkΦm,σ2 (dθ)
(m,σ 2 )
h 2 i
1 ϑ σ2 kmk2 d 2
λ d 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+ +2 .
n λ
1 √1
We put σ = 2λ and substitute d
for ϑ to get
√ 2 d2
λ c d + d log(4 λd ) + 2λ 2
2 + d + 2 log ε
B(F1 ) ≤ R + + .
n λ
√
Substitute nd for λ to get the desired result.
Proof of Corollary 5.5. We apply Theorem 4.3:
Z
(R − R)dρ̃λ
Z
λ + g(λ, n) 1 2
≤ inf (R − R̄)dΦm,σ2 + 2K(Φm,σ2 , π) + 2 log
m,σ 2 λ − g(λ, n) λ − g(λ, n)
28
2n
where λ < C+1 . Computations similar to those in the the proof of Corollary 5.4 lead to
Z ( Z
λ + g(λ, n)
Rdρ̃λ ≤ R + inf 2c kθ − θkΦm,σ2 (dθ)
m,σ 2 λ − g(λ, n)
h i
Pd 1 ϑ2 σ2 kmk2 d 2
)
j=1 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+2 .
λ − g(λ, n)
2n
taking m = θ̄ and λ = C+2 , we get the result.
where the last inequality stems from the fact that (a + b)2 ≤ 2 a2 + b2 and the fact
that we have supposed the Xi to be bounded. We can take the expectation of this term
with respect to the Xi ’s and with respect to our Gaussian prior.
2
λ
exp 4n
Z 2 2
λ cx 1
H H 2 2
π E exp λ(R − rn ) ≤ d√ exp kθk − 2 kθk dθ
(2π) 2 ϑ2 4n 2ϑ
2
λ
exp 4n Z
1 1
λ2 c2x
2
≤ d√ exp − − kθk dθ
(2π) 2 ϑ2 2 ϑ2 4n
2 2
The integral is a properly defined Gaussian integral under the hypothesis that ϑ12 − λ4ncx >
q
0 hence λ < c2x nϑ 2 . The integral is proportional to a Gaussian and we can directly
write: 2
λ
exp 4n
H H
π E exp λ(R − rn ) ≤q
2 λ2 c2
1 − ϑ 4n x
29
Proof of Corollary 6.2. We apply Theorem 4.2 to get:
(Z )
ϑ2 λ2 c2x K(Φm,σ2 , π) + log 2ε
H λ 1
Bλ (F1 ) = inf R dΦm,σ2 + − log 1 − +2
(m,σ 2 ) 2n λ 4n λ
ϑλ2 c2x
Z
H λ 1
= inf R dΦm,σ2 + − log 1 −
(m,σ 2 ) 2n λ 4n
h i
Pd 1 ϑ2 σ2 kmk2 d
2
j=1 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+2 .
λ
√1
pn
We specify σ 2 = dn
and λ = cx ϑ2
such that we get:
r √ r
√ 2d 2 2
ϑ2 ϑ2 2 + − d + 2 log
d 1 cx ϑ ϑ2 ε
B(F1 ) ≤ RH +cx + √ −cx log 1 − +d √ log ϑ2 nd +cx ϑ nϑ √ .
n 2cx n n 4 n n
To get the correct rate we take the prior variance to be ϑ2 = d1 by replacing in the above
equation we get the desired result.
Proof of Theorem 6.3. From Nesterov [2004] (th. 3.2.2) we have the following bound on
the objective function minimized by VB, (the objective is not uniformlly Lipschitz)
k H 1 k H 1 LM
ρ (rn ) + K(ρ , π) − inf ρ(rn ) + K(ρ, π) ≤ √ . (17)
λ ρ∈F1 λ 1+k
30
Using equation (11) a second time we get with probability 1 − ε
Z
LM 2 2 2 2
RH dρk ≤ √ + f (n, λ) + ρ(RH ) + K(ρ, π) + log
1+k λ λ λ ε
Because this is true for any ρ ∈ F1 in 1 − ε we can write the bound for the smallest
measure in F1 .
Z
LM 2 2 2 2
RH dρk ≤ √ + f (n, λ) + inf ρ(RH ) + K(ρ, π) + log
1+k λ ρ∈F 1 λ λ ε
By taking the Gaussian measure with variance n1 and mean θ in the infemum and taking
√
λ = c1x nd and ϑ = d1 , we can use the results of Corrolary 6.2 to get the result.
so that
1 X
θ
Un := qi,j = rn (θ) − R(Θ).
n(n − 1)
i,j
2
bnc
1 X 1 X θ
Un = qπ(i),π(i+b n
c)
n! π b n2 c 2
i=1
where the sum is taken over all the permutations π of {1, . . . , n}. Jensen’s inequality
leads to
bn
2
c
1 X 1 X
θ
E exp[λUn ] = E exp λ qπ(i),π(i+b n
c)
n! π b n2 c 2
i=1
bn
2
c
1 X λ X
θ
≤ E exp n qπ(i),π(i+b n .
c)
n! π b2c 2
i=1
We now use, for each of the terms in the sum we use the same argument as in the proof
of Lemma 5.1 to get
2 2
1 X λ λ
E exp[λUn ] ≤ exp n ≤ exp
n! π 2b 2 c n−1
31
(in the last step, we used b n2 c ≥ (n − 1)/2). We proceed in the same way to upper bound
E exp[−λUn ].
Proof of Lemma 7.2. As already done above, we use Bernstein inequality and Hoeffding
decomposition. Fix θ. We define this time
θ
qi,j = 1{hθ, Xi − Xj i (Yi − Yj ) < 0} − 1{[σ(Xi ) − σ(Xj )](Yi − Yj ) < 0} − R(θ) + R
so that
1 X
θ
Un := rn (θ) − rn − R(θ) + R = qi,j .
n(n − 1)
i6=j
Then,
2
bnc
1 X 1 X θ
Un = qπ(i),π(i+b n .
c)
n! π b n2 c 2
i=1
Jensen’s inequality:
bn
2
c
1 X 1 X
θ
E exp[λUn ] = E exp λ qπ(i),π(i+b n
c)
n! π b n2 c 2
i=1
bn
2
c
1 X λ X
θ
≤ E exp n qπ(i),π(i+b n .
c)
n! π b2c 2
i=1
Then, for each of the terms in the sum, use Bernstein’s inequality:
bn 2 λ2
c θ
2 E((q n ) ) n
λ X
θ π(1),π(1+b 2 c) b c
E exp n qπ(i),π(i+b n ≤ exp
c) 2 .
b2c 2
2 1 − 2 λn
i=1 b2c
We use again b n2 c ≥ (n−1)/2. Then, as the pairs (Xi , Yi ) are iid, we have E((qπ(1),π(1+b
θ 2
n ) ) =
c)
2
θ )2 ) and then E((q θ )2 ) ≤ C[R(θ) − R] thanks to the margin assumption. So
E((q1,2 1,2
bn
c λ2
λ X2
C[R(θ) − R]
E exp n θ
qπ(i),π(i+b n ≤ exp
c) n−1 .
b2c 2
1− 4λ
i=1 n−1
32
A.7. Proofs of Section 8
Proof. First, note that, for any ρ,
Z Z
K(ρ, πβ ) = β (R − R)dρ + K(ρ, π) + log exp −β(R − R) dπ
Z
≤ β (R − R)dρ + K(ρ, π).
Now, we define a subset of F that will be used for the calculation of the bound. We
define for δ > 0 the probability distribution ρU,V,δ (dθ) as π conditioned to θ = µν T with
µ is uniform on {∀(i, `), |µi,` − Ui,` |≤ δ} and ν is uniform on {∀(j, `), |νi,` − Vj,` |≤ δ}.
Note that
Z Z
(R − R)dρM,N,δ = E((θX − MX )2 )ρU,V,δ (dθ)
Z
≤ 3E(((U V T )X − MX )2 )ρU,V,δ (d(µ, ν))
Z
+ 3 E(((U ν T )X − (U V T )X )2 )ρU,V,δ (d(µ, ν))
Z
+ 3 E(((µν T )X − (U ν T )X )2 )ρU,V,δ (d(µ, ν)).
≤ Kr(C + δ)2 δ 2 .
So: Z
(R − R)dρM,N,δ ≤ 2Krδ 2 (C + δ 2 ).
Now, let us consider the term K(ρU,V,δ , π). An explicit calculation is possible but tedious.
Instead, we might just introduce the set Gδ = {θ = µν T , kµ − U kF ≤ δ, kν − V kF ≤ δ}
33
and note that K(ρU,V,δ , π) ≤ log π(G1 δ ) . An upper bound for Gδ is calculated page 317-320
in Alquier [2014] and the result is given by (10) in this reference:
δ2 δ2
as soon as the restriction b ≤ 2m1 K log(2m1 K) , 2m2 K log(2m2 K) is satisfied. So we obtain:
Noteqthat kU k2F ≤ C 2 rm1 , kV k2F ≤ C 2 rm2 and K ≤ m1 + m2 so it is clear that the choice
δ = β1 and b ≤ 2β(m1 ∨m2 ) log(2K(m
1
1 ∨m2 ))
leads to the existence of a constant C(a, C)
such that
1
K(ρU,V,δ , πβ ) ≤ C(a, C) r(m1 + m2 ) log [βb(m1 + m2 )K] + .
β
B. Implementation details
B.1. Sequential Monte Carlo
Tempering SMC approximates iteratively a sequence of distribution ρλt , with
1
ρλt (dθ) = exp (−λt rn (θ)) π(dθ),
Zt
and temperature ladder λ0 = 0 < . . . < λT = λ. The pseudo code below is given for an
adaptive sequence of temperatures.
34
Algorithm 1 Tempering SMC
Input N (number of particles), τ ∈ (0, 1) (ESS threshold), κ > 0 (random walk tuning
parameter)
{ N i 2
P
i=1 wt (θt−1 )}
PN = τ N, wt (θ) = exp[−(λt − λt−1 )rn (θ)] (18)
i 2
i=1 {wt (θt−1 )) }
n PN o
1 i ) , and
using bisection search. If λt ≥ λT , set ZT = Zt−1 × N w (θ
i=1 t t−1
stop.
b. Resample: P for i = 1 to N , draw Ait in 1, . . . , N so that P(Ait = j) =
j
wt (θt−1 )/ N k
k=1 wt (θt−1 ); see Algorithm 2 in the appendix.
Ai
c. Sample θti ∼ Mt (θt−1t
, dθ) for i = 1 to N where Mt is a MCMC kernel that
leaves invariant πt ; see comments below.
n P o
d. Set Zt = Zt−1 × N1 N i=1 w t (θ i ) .
t−1
The algorithm outputs a weighted sample (wTi , θTi ) approximately distributed as target
posterior, and an unbiased estimator of the normalizing constant ZλT .
Step b. of algorithm B.1 depends of a resampling algorithm. We choose to use
Systematic resampling, described in Algorithm 2.
35
Algorithm 2 Systematic resampling
c. Set s ← U , m ← 1.
d. For n = 1 : N
While C m < s do m ← m + 1.
An ← m, and s ← s + 1.
End For
For the MCMC step, we used a Gaussian random-walk Metropolis kernel, with a
covariance matrix for the random step that is proportional to the empirical covariance
matrix of the current set of simulations.
36
Algorithm 3 Deterministic annealing
Loop t=1,. . . ,T
a. mλt , Σλt = Minimize Lλt (m, Σ) using some local optimization routine with ini-
tial points mλt−1 , Σλt−1
b. Break if the empirical bound increases.
End Loop
400 1.25
γ = 500 ● ●
● ● ●
● ● ●
●
● ● ● ● ●
● ●
● ● ● ● ●
●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ● ●
1.00 ● ●
● ● ●
●
●
●
●
● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
300 ●
γ = 375
● ● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
●
●
● ● ● ●
● ● ●
● ●
0.75 ● ● ●
●
● ● ● ● ●
● ●
●
●
95% bound
γ = 250
value
200
0.50
●
●
●
● γ = 125
100
0.25
●
●
●
●
γ=0
0 ●
●
0.00
Figure 2: Deterministic annealing on a Pima Indians with one covariate and full
model resp.
The right panel gives the empirical bound obtained for the DA method (in red) and the dot are direct global
optimization based on L-BFGS algorithms from starting values drawn from the prior. Each optimization
problem is repeated 20 times.
We find that using a deterministic annealing algorithm with a limited amount of steps
helps in finding a high enough optimum. On the left panel of Figure 2, we can see the one
dimensional case where the initial problem γ = 0 corresponds to a convex minimization
problem and where the increasing temperature gradually complexifies the optimization
problem. Figure 2 shows that the solution given by DA is in average lower than randomly
37
initialized optimization.
ˆ B f , η ∈ (0, 1) and c
Input B a batch size, an unbiased estimator of the gradient ∇
While ¬converged
ˆ B f (xt )
a. xt+1 = xt − λt ∇
1
b. Update λt+1 = (t+c)η
End Loop
38