0% found this document useful (0 votes)
32 views38 pages

Variational Bayes

Uploaded by

z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views38 pages

Variational Bayes

Uploaded by

z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

On the properties of variational

approximations of Gibbs posteriors


Pierre Alquier (CREST-ENSAE),
James Ridgway (CREST-ENSAE and Université Paris Dauphine)
and Nicolas Chopin (CREST-ENSAE and HEC Paris)
arXiv:1506.04091v2 [stat.ML] 15 Jun 2015

The PAC-Bayesian approach is a powerful set of techniques to derive non-


asymptotic risk bounds for random estimators. The corresponding optimal
distribution of estimators, usually called the Gibbs posterior, is unfortunately
intractable. One may sample from it using Markov chain Monte Carlo, but
this is often too slow for big datasets. We consider instead variational approx-
imations of the Gibbs posterior, which are fast to compute. We undertake
a general study of the properties of such approximations. Our main finding
is that such a variational approximation has often the same rate of conver-
gence as the original PAC-Bayesian procedure it approximates. We specialise
our results to several learning tasks (classification, ranking, matrix comple-
tion), discuss how to implement a variational approximation in each case,
and illustrate the good properties of said approximation on real datasets.

1. Introduction
A Gibbs posterior, also known as a PAC-Bayesian or pseudo-posterior, is a probability
distribution for random estimators of the form:
exp[−λrn (θ)]
ρ̂λ (dθ) = R π(dθ).
exp[−λrn ]dπ

More precise definitions will follow, but for now, θ may be interpreted as a parameter
(in a finite or infinite-dimensional space), rn (θ) as an empirical measure of risk (e.g.
prediction error), and π(dθ) a prior distribution.
We will follow in this paper the PAC (Probably Approximatively Correct)-Bayesian
approach, which originates from machine learning [Shawe-Taylor and Williamson, 1997,
McAllester, 1998, Catoni, 2004]; see Catoni [2007] for an exhaustive study, and Jiang and
Tanner [2008], Yang [2004], Zhang [2006], Dalalyan and Tsybakov [2008] for related per-
spectives (such as the aggregation of estimators in the last 3 papers). There, ρ̂λ appears

1
as the probability distribution that minimises the upper bound of an oracle inequality
on the risk of random estimators. The PAC-Bayesian approach offers sharp theoretical
guarantees on the properties of such estimators, without assuming a particular model
for the data generating process.
The Gibbs posterior has also appeared in other places, and under different motiva-
tions: in Econometrics, as a way to avoid direct maximisation in moment estimation
[Chernozhukov and Hong, 2003]; and in Bayesian decision theory, as as way to define
a Bayesian posterior distribution when no likelihood has been specified [Bissiri et al.,
2013]. Another well-known connection, although less directly useful (for Statistics), is
with thermodynamics, where rn is interpreted as an energy function, and λ as the inverse
of a temperature.
Whatever the perspective, estimators derived from Gibbs posteriors usually show ex-
cellent performance in diverse tasks, such as classification, regression, ranking, and so
on, yet their actual implementation is still far from routine. The usual recommendation
[Dalalyan and Tsybakov, 2012, Alquier and Biau, 2013, Guedj and Alquier, 2013] is to
sample from a Gibbs posterior using MCMC [Markov chain Monte Carlo, see e.g. Green
et al., 2015]; but constructing an efficient MCMC sampler is often difficult, and even
efficient implementations are often too slow for practical uses when the dataset is very
large.
In this paper, we consider instead VB (Variational Bayes) approximations, which have
been initially developed to provide fast approximations of ‘true’ posterior distributions
(i.e. Bayesian posterior distributions for a given model); see Jordan et al. [1999], MacKay
[2002] and Chap. 10 in Bishop [2006].
Our main results are as follows: when PAC-Bayes bounds are available - mainly, when
a strong concentration inequality holds - replacing the Gibbs posterior by a variational
approximation does not affect the rate of convergence to the best possible prediction,
on the condition that the Küllback-Leibler divergence between the posterior and the
approximation is itself controlled in an appropriate way.
We also provide empirical bounds, which may be computed from the data so as to
ascertain the actual performance of estimators obtained by variational approximation.
All the results gives strong incentives, we believe, to recommend Variational Bayes as
the default approach to approximate Gibbs posteriors.
The rest of the paper is organized as follows. In Section 2 we introduce the notations
and assumptions. In Section 3 we introduce variational approximations and the corre-
sponding algorithms. The main results are provided in general form in Section 4: in
Subsection 4.1, we give results under the assumption that a Hoeffding type inequality
holds (slow rates) and in Subsection 4.2, we give results under the assumption that a
Bernstein type inequality holds (fast rates). Note that for the sake of shortness, we will
refer to these settings as “Hoeffding assumption” and “Bernstein assumption” even if
this terminology is non standard. We then apply these results in various settings: clas-
sification (Section 5), convex classification (Section 6), ranking (Section 7), and matrix
completion (Section 8). In each case, we show how to specialise the general results of
Section 4 to the considered application, so as to obtain the properties of the VB approx-

2
imation, and we also discuss its numerical implementation. All the proofs are collected
in the Appendix.

2. PAC-Bayesian framework
We observe a sample (X1 , Y1 ), . . . , (Xn , Yn ), taking values in X × Y, where the pairs
(Xi , Yi ) have the same distribution P . We will assume explicitly that the (Xi , Yi )’s are
independent in several of our specialised results, but we do not make this assumption
at this stage, as some of our general results, and more generally the PAC-Bayesian
theory, may be extended to dependent observations; see e.g. Alquier and Li [2012]. The
label set Y is always a subset of R. A set of predictors is chosen by the statistician:
{fθ : X → R, θ ∈ Θ}. For example, in linear regression, we may have: fθ (x) = hθ, xi, the
inner product of X = Rd , while in classification, one may have fθ (x) = Ihθ,xi>0 ∈ {0, 1}.
We assume we have at our disposal a risk function R(θ); typically R(θ) is a measure
of the prevision error. We set R = R(θ), where θ ∈ arg minΘ R; i.e. fθ is an optimal
predictor. We also assume that the risk function R(θ) has an empirical counterpart
rn (θ), and set rn = rn (θ). Often, R and rn are based on a loss function ` : R2 → R; i.e.
1 Pn
R(θ) = E[`(Y, fθ (X))] and rn (θ) = n i=1 `(Yi , fθ (Xi )). (In this paper, the symbol E
will always denote the expectation with respect to the (unknown) law P of the (Xi , Yi )’s.)
There are situations however (e.g. ranking), where R and rn have a different form.
We define a prior probability measure π(·) on the set Θ (equipped with the standard
σ-algebra for the considered context), and we let M1+ (Θ) denote the set of all probability
measures on Θ.

Definition 2.1 We define, for any λ > 0, the pseudo-posterior ρ̂λ by

exp[−λrn (θ)]
ρ̂λ (dθ) = R π(dθ).
exp[−λrn ]dπ

The pseudo-posterior ρ̂λ (also known as the Gibbs posterior, Catoni [2004, 2007], or
the exponentially weighted aggregate, Dalalyan and Tsybakov [2008]) plays a central
role in the PAC-Bayesian approach. It is obtained as the distribution that minimises
the upper bound of a certain oracle inequality applied to random estimators. Practical
estimators (predictors) may be derived from the pseudo-posterior, by e.g. taking the
expectation, or sampling from it. Of course, when exp[−λrn (θ)] may be interpreted as
the likelihood of a certain model, ρ̂λ becomes a Bayesian posterior distribution, but we
will not restrict our attention to this particular case.
The following ‘theoretical’ counterpart of ρ̂λ will prove useful to state results.

Definition 2.2 We define, for any λ > 0, πλ as

exp[−λR(θ)]
πλ (dθ) = R π(dθ).
exp[−λR]dπ

3
We will derive PAC-Bayesian bounds on predictions obtained by variational approx-
imations of ρ̂λ under two types of assumptions: a Hoeffding-type assumption, from
which we may deduce slow rates of convergence (Subsection 4.1), and a Bernstein-type
assumption, from which we may obtain fast rates of convergence (Subsection 4.2).

Definition 2.3 We say that a Hoeffding assumption is satisfied for prior π when there
is a function f and an interval I ⊂ R∗+ such that, for any λ ∈ I, for any θ ∈ Θ,

π (E exp {λ[R(θ) − rn (θ)]})
≤ exp [f (λ, n)] . (1)
π (E exp {λ[rn (θ) − R(θ)]})

Inequality (1) can be interpreted as an integrated version (with respect to π) of Ho-


effding’s inequality, for which f (λ, n)  λ2 /n. In many cases the loss will be bounded
uniformly over θ; then Hoeffding’s inequality will directly imply (1). The expectation
with respect to π in (1) allows us to treat some cases where the loss is not upper bounded
by specifying a prior with sufficiently light tails.

Definition 2.4 We say that a Bernstein assumption is satisfied for prior π when there
is a function g and an interval I ⊂ R∗+ such that, for any λ ∈ I, for any θ ∈ Θ,
  
π E exp λ[R(θ) − R] − λ[rn (θ) − rn ]   
≤ π exp g(λ, n)[R(θ) − R] . (2)
π E exp λ[rn (θ) − rn ] − λ[R(θ) − R]

This assumption is satisfied for example by sums of i.i.d. sub-exponential random


variables, see Subsection 2.4 p. 27 in Boucheron et al. [2013], when a margin assumption
on the function R(·) is satisfied [Tsybakov, 2004]. This is discussed in Section 4.2. Again,
extensions beyond the i.i.d. case are possible, see e.g. Wintenberger [2010] for a survey
and new results. In all these examples, the important feature of the function g that we
will use to derive rates of convergence is the fact that there is a constant c > 0 such that
when λ = cn, g(λ, n) = g(cn, n)  n.
As mentioned previously, we will often consider rn (θ) = n1 ni=1 `(Yi , fθ (Xi )), how-
P
ever, the previous assumptions can also be satisfied when rn (θ) is a U-statistic, using
Hoeffding’s decomposition of U-statistics combined with the corresponding inequality
for sums of independent variables [Hoeffding, 1948]. This idea comes from Clémençon
et al. [2008] and we will use it in our ranking application.

Remark 2.1 We could consider more generally inequalities of the form


  
π E exp λ[R(θ) − R] − λ[rn (θ) − rn ] 
≤ π exp g(λ, n)[R(θ) − R]κ
 
π E exp λ[rn (θ) − rn ] − λ[R(θ) − R]

that allow to use the more general form of the margin assumption of Mammen and Tsy-
bakov [1999], Tsybakov [2004]. PAC-Bayes bounds in this context are provided by Catoni
[2007]. However, the techniques involved would require many pages to be described so we
decided to focus on the cases κ = 0 and κ = 1 to keep the exposition simple.

4
3. Numerical approximations of the pseudo-posterior
3.1. Monte Carlo
As already explained in the introduction, the usual approach to approximate ρ̂λ is
MCMC (Markov chain Monte Carlo) sampling. Ridgway et al. [2014] proposed tem-
pering SMC (Sequential Monte Carlo, e.g. Del Moral et al. [2006]) as an alternative
to MCMC to sample from Gibbs posteriors: one samples sequentially from ρ̂λt , with
0 = λ0 < · · · < λT = λ where λ is the desired temperature. One advantage of this
approach is that it makes it possible to contemplate different values of λ, and choose
one by e.g. cross-validation. Another advantage is that such an algorithm requires little
tuning; see Appendix B for more details on the implementation of tempering SMC. We
will use tempering SMC as our gold standard in our numerical studies.
SMC and related Monte Carlo algorithms tend to be too slow for practical use in
situations where the sample size is large, the dimension of Θ is large, or fθ is expen-
sive to compute. This motivates the use of fast, deterministic approximations, such as
Variational Bayes, which we describe in the next section.

3.2. Variational Bayes


Various versions of VB (Variational Bayes) have appeared in the literature, but the main
idea is as follows. We define a family F ⊂ M1+ (Θ) of probability distributions that are
considered as tractable. Then, we define the VB-approximation of ρ̂λ : ρ̃λ .

Definition 3.1 Let


ρ̃λ = arg min K(ρ, ρ̂λ ),
ρ∈F

where K(ρ, ρ̂λ ) denotes the KL (Küllback-Leibler) divergence of ρ̂λ relative to ρ: K(m, µ) =
log[ dm
R
dµ ]dm if m  µ (i.e. µ dominates m), K(m, µ) = +∞ otherwise.

The difficulty is to find a family F (a) which is large enough, so that ρ̃λ may be close
to ρ̂λ , and (b) such that computing ρ̃λ is feasible. We now review two types of families
popular in the VB literature.
• Mean field VB: for a certain decomposition Θ = Θ1 × . . . × Θd , F is the set of
product probability measures
d
( )
Y
F MF = ρ ∈ M1+ (Θ) : ρ(dθ) = ρi (dθi ), ∀i ∈ {1, . . . , d}, ρi ∈ M1+ (Θi ) . (3)
i=1
Q
The infimum of the KL divergence K(ρ, ρ̂λ ), relative to ρ = i ρi satisfies the
following fixed point condition [Parisi, 1988, Bishop, 2006, Chap. 10]:
 
Z Y
∀j ∈ {1, · · · , d} ρj (dθj ) ∝ exp  {−λrn (θ) + log π(θ)} ρi (dθi ) π(dθj ).
i6=j
(4)

5
This leads to a natural algorithm were we update successively every ρj until sta-
bilization.
• Parametric family:
F P = ρ ∈ M1+ (Θ) : ρ(dθ) = f (θ; m)dθ, m ∈ M ;


and M is finite-dimensional; say F P is the family of Gaussian distributions (of di-


mension d). In this case, several methods may be used to compute the infimum. As
above, one may used fixed-point iteration, providedRan equation similar to (4) is
available. Alternatively, one may directly maximize log[exp[−λrn (θ)] dπdρ (θ)]ρ(dθ)
with respect to paramater m, using numerical optimization routines. This ap-
proach was used for instance in Hoffman et al. [2013] with combination of some
stochastic gradient descent to perform inference on a latent Dirichlet allocation
model. See also e.g. Khan [2014], Khan et al. [2013] for efficient algorithms for
Gaussian variational approximation.
In what follows (Subsections 4.1 and 4.2) we provide tight bounds for the prevision
risk of ρ̃λ . This leads to the identification of a condition on F such that the risk of ρ̃λ is
not worse than the risk of ρ̂λ . We will make this condition explicit in various examples,
using either mean field VB or parametric approximations.
Remark 3.1 An useful identity, obtained by direct calculations, is: for any ρ  π,
Z Z
log exp [−λrn (θ)] π(dθ) = −λ rn (θ)ρ(dθ) − K(ρ, π) + K(ρ, ρ̂λ ). (5)

Since the left hand side does not depend on ρ, one sees that ρ̃λ , which minimises K(ρ, ρ̂λ )
over F, is also the minimiser of:
Z 
1
ρ̃λ = arg min rn (θ)ρ(dθ) + K(ρ, π)
ρ∈F λ
This equation will appear frequently in the sequel in the form of an empirical upper bound.

4. General results
This section gives our general results, under either a Hoeffding Assumption (Definition
2.3) or a Bernstein Assumption (Definition 2.4), on risks bounds for the variational
approximation, and how it relates to risks bounds for Gibbs posteriors. These results
will be specialised to several learning problems in the following sections.

4.1. Bounds under the Hoeffding assumption


4.1.1. Empirical bounds
Theorem 4.1 Under the Hoeffding assumption (Definition 2.3), for any ε > 0, with
probability at least 1 − ε we have simultaneously for any ρ ∈ M1+ (Θ),
f (λ, n) + K(ρ, π) + log 1ε
Z Z 
Rdρ ≤ rn dρ + .
λ

6
This result is a simple variant of a result in Catoni [2007] but for the sake of com-
pleteness, its proof is given in Appendix A. It gives us an upper bound on the risk
of both the pseudo-posterior (take ρ = ρ̂λ ) and its variational approximation (take
ρ = ρ̃λ ). These bounds may be be computed from the data, and therefore provide a sim-
ple way to evaluate the performance of the corresponding procedure, in the spirit of the
first PAC-Bayesian inequalities [Shawe-Taylor and Williamson, 1997, McAllester, 1998,
1999]. However, this bound do not provide the rate of convergence of these estimators.
For this reason, we also provide oracle-type inequalities.

4.1.2. Oracle-type inequalities


R
Another way to use PAC-Bayesian bounds is to compare Rdρ̂λ to the best possible
risk, thus linking this approach to oracle inequalities. This is the point of view developed
in Catoni [2004, 2007], Dalalyan and Tsybakov [2008].

Theorem 4.2 Assume that the Hoeffding assumption is satisfied (Definition 2.3). For
any ε > 0, with probability at least 1 − ε we have simultaneously
(Z )
f (λ, n) + K(ρ, π) + log 2ε
Z
1
Rdρ̂λ ≤ Bλ (M+ (Θ)) := inf Rdρ + 2
ρ∈M1+ (Θ) λ

and (Z )
2
f (λ, n) + K(ρ, π) + log
Z
ε
Rdρ̃λ ≤ Bλ (F) := inf Rdρ + 2 .
ρ∈F λ
Moreover,
2
Bλ (F) = Bλ (M1+ (Θ)) + inf K(ρ, π λ )
λ ρ∈F 2

where we remind that πλ is defined in Definition 2.2.


R
In this way, we are able to compare Rdρ̂λ to the best possible aggregation procedure
in M1+ (Θ) and Rdρ̃λ to the best aggregation procedure in F. More importantly, we are
R

able to obtain explicit expressions for the right-hand side of these inequalities in various
models, and thus to obtain rates of convergence. This will be done in the remaining
sections. This leads to the second interest of this result: if there is a λ = λ(n) that leads
to Bλ (M1+ (Θ)) ≤ R + sn with sn → 0 for the pseudo-posterior ρ̂λ , then we only have
to prove that there is a ρ ∈ F such that K(ρ, πλ )/λ ≤ csn for some constant c > 0 to
ensure that the VB approximation ρ̃λ also reaches the rate sn .
We will see in the following sections several examples where the approximation does
not deteriorate the rate of convergence. But first let us show the equivalent oracle
inequality under the Bernstein assumption.

7
4.2. Bounds under the Bernstein assumption
In this context the empirical bound on the risk would depend on the minimal achievable
risk r̄n , and cannot be computed explicitly. We give the oracle inequality for both the
Gibbs posterior and its VB approximation in the following theorem.

Theorem 4.3 Assume that the Bernstein assumption is satisfied (Definition 2.4). As-
sume that λ > 0 satisfies λ − g(λ, n) > 0. Then for any ε > 0, with probability at least
1 − ε we have simultaneously:
Z
Rdρ̂λ − R ≤ B λ M1+ (Θ) ,


Z
Rdρ̃λ − R ≤ B λ (F),

where, for either A = M1+ (Θ) or A = F,


( Z  )
1 2
B λ (A) = inf [λ + g(λ, n)] (R − R)dρ + 2K(ρ, π) + 2 log .
λ − g(λ, n) ρ∈A ε

In addition,
2  
B λ (F) = B λ M1+ (Θ) +

inf K ρ, π λ+g(λ,n) .
λ − g(λ, n) ρ∈F 2

The main difference with Theorem 4.2 is that the function R(·) is replaced by R(·)−R.
This is well known way to obtain better rates of convergence.

5. Application to classification
5.1. Preliminaries
In all this section, we assume that Y = {0, 1} and we consider linear classification: Θ =
d 1 Pn
X = R , fθ (x) = 1hθ,xi≥0 . We put rn (θ) = n i=1 1{fθ (Xi )6=Yi } , R(θ) = P(Y 6= fθ (X))
and assume that the [(Xi , Yi )]ni=1 are i.i.d. In this setting, it is well-known that the
Hoeffding assumption always holds. We state as a reminder the following lemma.

Lemma 5.1 Hoeffding assumption (1) is satisfied with f (λ, n) = λ2 /(2n).

The proof is given in Appendix A for the sake of completeness.


It is also possible to prove that Bernstein assumption (2) holds in the case where the
so-called margin assumption of Mammen and Tsybakov is satisfied. This condition we
use was introduced by Tsybakov [2004] in a classification setting, based on a related
definition in Mammen and Tsybakov [1999].

8
Lemma 5.2 Assume that Mammen and Tsybakov’s margin assumption is satisfied: i.e.
there is a constant C such that

E[(1fθ (X)6=Y − 1fθ (X)6=Y )2 ] ≤ C[R(θ) − R].

Cλ2
Then Bernstein assumption (2) is satisfied with g(λ, n) = 2n−λ .

Remark 5.1 We refer the reader to Tsybakov [2004] for a proof that

P(0 < | θ, X |≤ t) ≤ C 0 t

for some constant C 0 > 0 implies the margin assumption. In words, when X is not
likely to be in the region θ, X ' 0, where points are hard to classify, then the problem
becomes easier and the classification rate can be improved.

We propose in this context a Gaussian prior: π = Nd (0, ϑ2 Id ), and we consider a VB


approach based on Gaussian families. The corresponding optimization problem is not
convex, but remains feasible as we explain below.

5.2. Three sets of Variational Gaussian approximations


Consider the three following Gaussian families
n o
d 2 ∗
F1 = Φm,σ2 , m ∈ R , σ ∈ R+ ,
n o
F2 = Φm,σ2 , m ∈ Rd , σ 2 ∈ (R∗+ )2 (mean field approximation),
n o
F3 = Φm,Σ , m ∈ Rd , Σ ∈ S d+ (full covariance approximation),

where Φm,σ2 is Gaussian distribution Nd (m, σ 2 Id ), Φm,σ2 is Nd (m, diag(σ 2 )), and Φm,Σ
is Nd (m, Σ). Obviously, F1 ⊂ F2 ⊂ F3 ⊂ M1+ (Θ), and

Bλ (M1+ (Θ)) ≤ Bλ (F3 ) ≤ Bλ (F2 ) ≤ Bλ (F1 ). (6)

Note that, for the sake of simplicity, we will use the following classical notations in the
rest of the paper: ϕ(·) is the density of N (0, 1) w.r.t. the Lebesgue measure, and Φ(·)
the corresponding c.d.f. The rest of Section 5 is organized as follows. In Subsection 5.3,
we calculate explicitly Bλ (F2 ) and Bλ (F1 ). Thanks to (6) this also gives an upper bound
on Bλ (F3 ) and proves the validity of the three types of Gaussian approximations. Then,
we give details on algorithms to compute the variational approximation based on F2 and
F3 , and provide a numerical illustration on real data.

5.3. Theoretical analysis


We start with the empirical bound for F2 (and F1 as a consequence), which is a direct
corollary of Theorem 4.1.

9
Corollary 5.3 For any ε > 0, with probability at least 1 − ε we have, for any m ∈ Rd ,
σ 2 ∈ (R+ )d ,
Pd h 1 σi2
 2 i 2
ϑ
+ kmk − d2 + log 1ε

Z Z
λ i=1 2 log σi2
+ ϑ2 ϑ2
RdΦm,σ2 ≤ rn dΦm,σ2 + + .
2n λ
We now want to apply Theorem 4.2 in this context. In order to do so, we introduce
an additional assumption.
Definition 5.1 We say that Assumption A1 is satisfied when there is a constant c > 0
such that, for any (θ, θ0 ) ∈ Θ2 with kθk= kθ0 k= 1, P(hX, θi hX, θ0 i < 0) ≤ ckθ − θ0 k.
Note that this is not a stringent assumption. For example, it is satisfied as soon as
X/kXk has a bounded density on the unit sphere.

Corollary√5.4 Assume that the VB approximation is done on either F1 , F2 or F3 .


Take λ = nd and ϑ = √1d . Under Assumption A1, for any ε > 0, with probability at
least 1 − ε we have simultaneously

2 log 2ε
R  r r 
Rdρ̂ λ d 2
 c d
R ≤R+ log 4ne + √ + + √ .
Rdρ̃λ n n 4n3 nd

See the appendix for a proof. Note also that the values λ = nd and ϑ = √1d allow to
derive this almost optimal rate of convergence, but are not necessarily the best choices
in practice.
Remark 5.2 Note that Assumption A1 is not necessary to obtain oracle inequalities on
the risk integrated under ρ̂λ . We refer the reader to Chapter 1 in Catoni [2007] for such
assumption-free bounds. However, it is clear that without this assumption the shape of ρ̂λ
and ρ̃λ might be very different. Thus, it seems reasonable to require that A1 is satisfied
for the approximation of ρ̂λ by ρ̃λ to make sense.
We finally provide an application of Theorem 4.3. Under the additional constraint
that the margin assumption is satisfied, we obtain a better rate.
Corollary 5.5 Assume that the VB approximation is done on either F1 , F2 or F3 . Un-
der Assumption A1 (Definition 5.1 page 10), and under Mammen and Tsybakov margin
2n
assumption, with λ = C+2 and ϑ > 0, for any ε > 0, with probability at least 1 − ε,
 √
d log nϑ
R  
Rdρ̂λ (C + 2)(C + 1) 2dϑ 2 d 2 2 d2c(2C + 1)
R ≤ R̄+ + 2 + − + log + .
Rdρ̃λ 2 n n ϑ ϑn n ε n
The prior variance optimizing the bound is ϑ = d/(d + 2 + 2d/n), this choice or any
constant instead will lead to a rate in d log(n)/n. Note that the rate d/n is minimax-
optimal in this context. This is, for example, a consequence of more general results
in Lecué [2007] under a general form of the the margin assumption. See the Appendix
for a proof.

10
5.4. Implementation and numerical results
For family F2 (mean field), the variational lower bound (5) equals
n d 
!
mT m 1 X σk2

λX Xi m 2
Lλ,ϑ (m, σ) = − Φ −Yi p − + log σ k − ,
n X diag(σ 2 )X t 2ϑ 2 ϑ
i=1 i i k=1

while for family F3 (full covariance), it equals


n
!
mT m 1
 
λX Xi m 1
Lλ,ϑ (m, Σ) = − Φ −Yi p − + log|Σ|− trΣ .
n
i=1
Xi ΣXit 2ϑ 2 ϑ

Both functions are non-convex, but the multimodality of the latter may be more
severe due to the larger dimension of F3 . To address this issue, we recommend to use
the reparametrisation of Opper and Archambeau [2009], which makes the dimension
of the latter optimisation problem O(n); see Khan [2014] for a related approach. In
both cases, we found that deterministic annealing to be a good approach to optimise
such non-convex functions. We refer to Appendix B for more details on deterministic
annealing and on our particular implementation.
We now compare the numerical performance of the mean field and full covariance VB
approximations to the Gibbs posterior (as approximated by SMC, see Section 3.1) for the
classification of standard datasets; see Table 1. We also include results for a kernel SVM
(support vector machine); this comparison is not entirely fair, since SVM is a non-linear
classifier, while all the other classifiers are linear. Still, except for the Glass dataset, the
full covariance VB approximation performs as well or better than both SMC and SVM
(while being much faster to compute, especially compared to SMC).

Dataset Covariates Mean Field (F2 ) Full cov. (F3 ) SMC SVM

Pima 7 31.0 21.3 22.3 30.4


Credit 60 32.0 33.6 32.0 32.0
DNA 180 23.6 23.6 23.6 20.4
SPECTF 22 08.0 06.9 08.5 10.1
Glass 10 34.6 19.6 23.3 4.7
Indian 11 48.0 25.5 26.2 26.8
Breast 10 35.1 1.1 1.1 1.7

Table 1: Comparison of misclassification rates (%).


Misclassification rates for different datasets and for the proposed approximations of the
Gibbs posterior. The last column is the missclassification rate given by a kernel-SVM with
radial kernel. The hyper-parameters are chosen by cross-validation.

11
Interestingly, VB outperforms SMC in certain cases. This might be due to the fact
that a VB approximation tends to be more concentrated around the mode than the
Gibbs posterior it approximates. Mean field VB does not perform so well on certain
datasets (e.g. Indian). This may due either to the approximation family being too
small, or to the corresponding optmisation problem to be strongly multi-modal.

6. Application to classification under convexified loss


Compared to the previous section, the advantage of convex classification is that the
corresponding variational approximation will amount to minimising a convex function.
This means that (a) the minimisation problem will be easier to deal with; and (b) we
will be able to compute a bound for the integrated risk after a given number of steps of
the minimisation procedure.
The setting is the same as in the previous section, except that for convenience we now
take Y = {−1, 1}, and the risk is based on the hinge loss,
n
1X
rnH (θ) = max(0, 1 − Yi < θ, Xi >).
n
i=1

We will write RH for the theoretical counterpart and R̄H for its minimum in θ. We
keep the superscript H in order to allow comparison with the risk R under the 0 − 1 loss.
We assume in this section that the Xi are uniformly bounded by a constant, |Xi |< cx .
Note that we do not require an assumption of the form (A1) to obtain the results of this
section, as we rely directly on the Lipschitz continuity of the hinge risk.

6.1. Theoretical Results


Contrarily to the previous section, the risk is not bounded in θ, and we must specify a
prior distribution for the Hoeffding assumption to hold.

Lemma 6.1 q Under a independent Gaussian prior π such that each component is N (0, ϑ2 ),
and for λ < 2c nϑ 2 and with bounded design |Xij |< cx , Hoeffding assumption (1) is sat-
 2 λ2 c2

isfied with f (λ, n) = λ2 /(4n) − 12 log 1 − ϑ 4n x
.

The main impact of such a bound is that the prior variance cannot be taken too big
relative to λ.

p nAssume that1 the VB approximation is done on either F1 , F2 or F3 .


Corollary 6.2
1
Take λ = cx ϑ2 and ϑ = √d . For any ε > 0, with probability at least 1 − ε we have
simultaneously
R H  r  2 
R dρ̂λ H cx d n d 1 cx + 1 2
R H ≤R + log + 2cx + √ + 2cx log
R dρ̃λ 2 n d n nd 2cx 

12
The oracle inequality in the above corollary enjoys the same rate of convergence as
the equivalent result in the preceding section. In the following we link the two results.

Remark 6.1 As stated in the beginning of the section we can use the estimator specified
under the hinge loss to bound the excess risk of the 0-1 loss. We write R? and RH? the
respective risk for their corresponding Bayes classifiers. From Zhang [2004] (section 3.3)
we have the following inequality, linking the excess risk under the hinge loss and the 0−1
loss,
R(θ) − R? ≤ RH (θ) − RH?
for every θ ∈ Rp . By integrating with respect to ρ̃H (the VB approximation on any
F1 , F2 , F3 of the Gibbs posterior for the hinge risk) and making use of Corollary 6.2 we
have with high probability,
r !
H ? H H? d n
ρ̃ (R(θ)) − R ≤ infp R (θ) − R + O log .
θ∈R n d

6.2. Numerical application


We have motivated the introduction of the hinge loss as a convex upper bound. In the
sequel we show that the resulting VB approximation also leads to a convex optimization
problem. This has the advantage of opening a range of possible optimization algorithms
[Nesterov, 2004]. In addition we are able to bound the error of the approximated measure
after a fixed number of iterations (see Theorem 6.3).
Under the model F1 each individual risk is given by:
     
1 − Γi m 1 − Γi m m
ρm,σ (ri (θ)) = (1 − Γi m) Φ + σkΓi kϕ := Ξi ,
σkΓi k2 σkΓi k2 σ

writting Γi := Yi Xi .
Hence the lower bound to be maximized is given by
( n   n  )
λ X 1 − Γi m X 1 − Γi m
L(m, σ) = − (1 − Γi m) Φ + σkΓi kϕ
n σkΓi k2 σkΓi k2
i=1 i=1
kmk22 d
 
2 ϑ
− + log σ − 2 .
2ϑ 2 σ

It is easy to see that the function is convex in (m, σ), first note that the map
     
x x x
Ψ: 7→ xΦ + yϕ ,
y y y

    
m x
is convex and note that we can write Ξi =Ψ A + b hence by com-
σ y
position of convex function with linear mappings we have the result. Similar reasoning

13
could be held for the case F2 and F3 , where in later the parametrization should be done
in C such that Σ = CC t . The bound is however not universally Lipschitz in σ, this
impacts the optimization algorithms.
n o
On the class of function F0 = Φm, 1 , m ∈ Rd , for which our Oracle inequalities still
n
hold we could get faster numerical algorithms. The objective function has Lipschitz
L
continuous derivatives and we would get a rate of (1+k) 2.

Other convex loss could be considered which could lead to convex optimization prob-
lems. For instance one could consider the exponential loss.

Dataset Covariates Hinge loss SMC

Pima 7 21.8 22.3


Credit 60 27.2 32.0
DNA 180 4.2 23.6
SPECTF 22 19.2 08.5
Glass 10 26.12 23.3
Indian 11 26.2 25.5
Breast 10 0.5 1.1

Table 2: Comparison of misclassification rates (%).


Misclassification rates for different datasets and for the proposed approximations of the Gibbs
posterior. The hyperparameters are chosen by cross-validation. This is to be compared to
Table 1.

Theorem 6.3 Assume that the VB approximation is done on F1 , F2 or F3 . Denote by


ρ̃k (dθ) the VB approximated measure
√ after the kth iteration of an optimal convex solver
using the hinge loss. Take λ = nd and ϑ = √1d then under the hypothesis of Corollary
6.2 with probability 1 − 
Z r  2 
H H LM cx d n d 1 cx + 1 2
R dρ̃k ≤ R + √ ++ log + 2cx + √ + 2cx log
1+k 2 n d n nd 2cx 

where L is the Lipschitz coefficient on a ball of radius M of the objective function max-
imized in VB.

From Theorem 6.3 we can compute the number of iterations to get a given level of
error at a given probability.
We find that on average the misclassification error (Table 2) is lower than for the 0-1
loss where we have no guaranties that the maximum is attained.

14
7. Application to ranking
7.1. Preliminaries
In this section we take Y = {0, 1} and consider again linear classifiers: Θ = X = Rd ,
fθ (x) = 1hθ,xi≥0 . We consider however a different criterion: in ranking, not only we want
to classify well an object x, but we want to make sure that given two different objects,
the one that is more likely to correspond to a label 1 will be assigned a larger score
through the function fθ . A usual way to measure this is to introduce the risk function

R(θ) = P[(Y1 − Y2 )(fθ (X1 ) − fθ (X2 )) < 0]

and the empirical risk


1 X
rn (θ) = 1{(Yi −Yj )(fθ (Xi )−fθ (Xj ))<0} .
n(n − 1)
1≤i6=j≤n

Then, again, we recall classical results.


λ2
Lemma 7.1 The Hoeffding-type assumption is satisfied with f (λ, n) = n−1 .

The variant of the margin assumption adapted to ranking was established by Robbiano
[2013] and Ridgway et al. [2014].

Lemma 7.2 Assume the following margin assumption:

E[(1[fθ (X1 )−fθ (X2 )][Y1 −Y2 ]<0 − 1[fθ (X1 )−fθ (X2 )][Y1 −Y2 ]<0 )2 ] ≤ C[R(θ) − R].

Cλ2
Then Bernstein assumption (2) is satisfied with g(λ, n) = n−1−4λ .

We still consider a Gaussian prior


d
Y
π(dθ) = ϕ(θi ; 0, ϑ2 )dθi
i=1

and the approximation families will be the same as in Section 5: F1 = {Φm,σ2 , m ∈


Rd , σ 2 ∈ R∗+ }, F2 = {Φm,σ2 , m ∈ Rd , σ 2 ∈ (R∗+ )2 } and F3 = {Φm,Σ , m ∈ Rd , Σ ∈ S d+ }.

7.2. Theoretical study


Here again, we start with the empirical bound.

Corollary 7.3 For any ε > 0, with probability at least 1 − ε we have, for any m ∈ Rd ,
σ 2 ∈ (R+ )d ,
Pd h 1 σi2
 2 i 2
ϑ
+ kmk − d2 + log 1ε

Z Z
λ j=1 2 log σi2 + ϑ 2 ϑ2
RdΦm,σ2 ≤ rn dΦm,σ2 + + .
n−1 λ

15
In order to derive a theoretical bound, we introduce the following variant of Assump-
tion A1.
Definition 7.1 We say that Assumption A2 is satisfied when there is a constant c > 0
such that, for any (θ, θ0 ) ∈ Θ2 with kθk= kθ0 k= 1, P(hX1 − X2 , θi hX1 − X2 , θ0 i < 0) ≤
ckθ − θ0 k.
Assumption A2 is satisfied as soon as (X1 − X2 )/kX1 − X2 k has a bounded density on
the unit sphere.
q
Corollary 7.4 Use either F1 , F2 or F3 . Take λ = d(n−1)2 and ϑ = 1. Under (A2),
for any ε > 0, with probability at least 1 − ε,
√ √
2 2 log 2e
R  r   
R Rdρ̂λ 2d 1 c 2 ε
≤R+ 1 + log (2d(n − 1)) + √ + p .
Rdρ̃λ n−1 2 n−1 (n − 1)d

Finally, under an additional margin assumption, we have:


Corollary 7.5 Under Assumption A2 and the margin assumption of Lemma (7.2), for
n−1
λ = C+5 and ϑ > 0, for any ε > 0, with probability at least 1 − ε,

d log nϑ
R   
Rdρ̂λ (C + 5)(C + 1) 2dϑ 2 d 2 2
R ≤ R̄+ + + − + log
Rdρ̃λ 2 n−1 n(n − 1) ϑ ϑn − 1 n − 1 ε

d4c(C + 1)
+ .
n
The prior variance optimizing the bound is ϑ = d/(d + 2 + 2d/n). The proof is similar
to the ones of Corollaries 5.4, 5.5 and 7.4.
As in the case of classification, ranking under an AUC loss can be done by replacing
the indicator function by the corresponding upper bound given by an hinge loss. In this
case we can derive similar results as for the convexified classification in particular we
can get a convex minimization problem and obtain result without requiring assumption
(A2).

7.3. Algorithms and numerical results


As an illustration we focus here on family F2 (mean field). In this case the VB objective
to maximize is given by:
 
d 
kmk22 1 X σk2

2 λ X Γij m 2
L(m, σ ) = − Φ − qP
  − + log σk − ,
n+ n− d
(γ k )2 σ 2 2ϑ 2 ϑ
i:yi =1,j:yj =0 k=1 ij k k=1
(7)
k ) are the elements of Γ.
where Γij = Xi − Xj , and where (γij k
This function is expensive to compute, as it involves n+ n− terms, the computation of
which is O(p).

16
We propose to use a stochastic gradient descent in the spirit of Hoffman et al. [2013].
The model we consider is not in an exponential family, meaning we cannot use the trick
developed by these authors. We propose instead to use a standard descent.
The idea is to replace the gradient by a unbiased version based on a batch of size B
as described in Algorithm P 4 in the Appendix.
P Robbins and Monro [1951] show that for
a step-size (λt )t such that t λ2t < ∞ and t λt = ∞ the algorithm converges to a local
optimum.
In our case we propose to sample pairs of data with replacement and use the unbiased
version of the derivative of the risk component. We use a simple gradient descent with-
out any curvature information. One could also use recent research on stochastic quasi
Newton-Raphson [Byrd et al., 2014].
For illustration, we consider a small dataset (Pima), and a larger one (Adult). The
latter is already quite challenging with n+ n− = 193, 829, 520 pairs to compare. In both
cases with different size of batches convergence is obtained with a few iterations only
and leads to acceptable bounds.
In Figure 1 we show the empirical bound on the AUC risk as a function of the iteration
of the algorithm, for several batch sizes. The bound is taken for 95% probability, the
batch sizes are taken to be B = 1, 10, 20, 50 for the Pima dataset, and 50 for the Adult
dataset. The figure shows an additional feature of VB approximation in the context of
Gibbs posterior: namely the possibility of computing the empirical upper bound given
by Corollary 7.3. That is we can check the quality of the bound at each iteration of the
algorithm, or for different values of the hyperparameters.

8. Application to matrix completion


The matrix completion problem has received increasing attention recently, partly due to
spectacular theoretical results [Candès and Tao, 2010], and to challenging applications
like the Netflix challenge [Bennett and Lanning, 2007]. In the perspective of this paper,
the specific interest of this application is twofold. First, this is a case where the family of
approximations is not parametric, but rather of the form (3), i.e. the family of products
of independent components. Then, there is no known theoretical result for the Gibbs
estimator in the considered model, yet we can still directly bound the loss induced by
the variational approximation.
We observe i.i.d. pairs ((Xi , Yi ))ni=1 where Xi ∈ {1, . . . , m1 } × {1, . . . , m2 }, and we
assume that there is a m1 × m2 -matrix M such that Yi = MXi + εi and the εi are
centred. Assuming that Xi is uniform on {1, . . . , m1 } × {1, . . . , m2 }, that fθ (Xi ) = θXi ,
and taking the quadratic risk, R(θ) = E (Yi − θXi )2 , we have that
1
R(θ) − R = kθ − M k2F
m1 m2
where k·kF stands for the Frobenius norm.
A common way to parametrise the problem is
Θ = {θ = U V T , U ∈ Rm1 ×K , V ∈ Rm2 ×K }

17
3
3
Emprical Bound 95%

Emprical Bound 95%


2
2

1 1

0
0 25 50 75 100 0 100 200 300
Iterations Iterations

(a) Pima (b) adult

Figure 1: Error bound at each iteration, stochastic descent, Pima and Adult
datasets.
Stochastic VB with fixed temperature λ = 100 for Pima and λ = 1000 for adult. The left panel shows
several curves that correspond to different batch sizes; these curves are hard to distinguish. The right panel
is for a batch size of 50. The adult dataset has n = 32556 observation and n+ n− = 193829520 possible
pairs. The convergence is obtained in order of seconds. The bounds are the empirical bounds obtained in
Corollary 7.3 for a probability of 95%.

where K is large; e.g. K = min(m1 , m2 ). Following Salakhutdinov and Mnih [2008], we


define the following prior distribution: U·,j ∼ N (0, γj I), V·,j ∼ N (0, γj I) where the γj ’s
are i.i.d. from an inverse gamma distribution, γj ∼ IΓ(a, b).
Note that VB algorithms were used in this context by Lim and Teh [2007] (with a
slightly simpler prior however: the γj ’s are fixed rather than random). Since then, this
prior and variants were used in several papers [e.g. Lawrence and Urtasun, 2009, Zhou
et al., 2010]. Until now, no theoretical results were proved up to our knowledge. Two
papers prove minimax-optimal rates for slightly modified estimators (by truncation), for
which efficient algorithms are unknown [Mai and Alquier, 2015, Suzuki, 2014]. However,
using Theorems 4.2 and 4.3 we are able to prove the following: if there is a PAC-Bayesian
bound leading to a rate for ρ̂λ in this context, then the same rate holds for ρ̃λ . In other
words: if someone proves the conjecture that the Gibbs estimator is minimax-optimal
(up to log terms) in this context, then the VB approximation will enjoy automatically
the same property.

18
We propose the following approximation:
 
 m1
Y m2
Y 
F = ρ(d(U, V )) = ui (dUi,· ) vj (dVj,· ) .
 
i=1 j=1

Theorem 8.1 Assume that M = U V T with |Ui,k |, |Vj,k |≤ C. Assume that rank(M ) = r
so that we can assume that U·,r+1 = · · · = U·,K = V·,r+1 = · · · = V·,K = 0 (note that
the prior π does not depend on the knowledge of r though). Choose the prior distri-
bution on the hyper-parameters γj as inverse gamma Inv−Γ(a, b) with b ≤ 1/[2β(m1 ∨
m2 ) log(2K(m1 ∨ m2 ))]. Then there is a constant C(a, C) such that, for any β > 0,
 
1
inf K(ρ, πβ ) ≤ C(a, C) r(m1 + m2 ) log [βb(m1 + m2 )K] + .
ρ∈F β

See the Appendix for a proof.


For instance, in Theorem 4.3, in classification and ranking we had λ, λ − g(λ, n) and
λ + g(λ, n) of order O(n). In this case we would have:
 
2   C(a, C)r(m1 + m2 ) log [nb(m1 + m2 )K]
inf K ρ, π λ+g(λ,n) = O ,
λ − g(λ, n) ρ∈F 2 n

and note that in this context it is know that the minimax rate is at least r(m1 + m2 )/n
[Koltchinskii et al., 2011].

8.1. Algorithm
As already mentioned, the approximation family is not parametric in this case, but rather
of type mean field. The corresponding VB algorithm amounts to iterating equation (4),
which takes the following form in this particular case:
K
(   )
λX  T 2
 X 1 2
uj (dUj,. ) ∝ exp − EV,U−j (YXi − (U V )Xi ) − Eγj Ujk
n 2γk
i k=1
K
(   )
λX  T 2
 X 1 2
vj (dVj,. ) ∝ exp − EV−j ,U (YXi − (U V )Xi ) − Eγj V
n 2γk jk
i k=1
   
 1 X X 1 β
2
p(γk ) ∝ exp −  EU Ukj + EV Vik2  + (α + 1) log −
 2γk γk γk 
j i

where the expectations are taken with respect to the thus defined variational approxi-
mations. One recognises Gaussian distributions for the first two, and an inverse Gamma
distribution for the third. We refer to Lim and Teh [2007] for more details on this
algorithm and for a numerical illustration.

19
9. Discussion
We showed in several important scenarios that approximating a Gibbs posterior through
VB (Variational Bayes) techniques does not deteriorate the rate of convergence of the
corresponding procedure. We also described practical algorithms for fast computation of
these VB approximations, and provided empirical bounds that may be computed from
the data to evaluate the performance of the so-obtained VB-approximated procedure.
We believe these results provide a strong incentive to recommend VB as the default
approach to approximate Gibbs posteriors, in lieu of Monte Carlo methods.
We hope to extend our results to other applications beyond those discussed in this
paper, such as regression. One technical difficulty with regression is that the risk function
is not bounded, which makes our approach a bit less direct to apply. In many papers
on PAC-Bayesian bounds for regression, the noise can be unbounded (usually, it is
assumed to be sub-exponential), but one assumes that the predictors are bounded, see
e.g. Alquier and Biau [2013]. However, using the robust loss function of Audibert and
Catoni, it is possible to relax this assumption [Audibert and Catoni, 2011, Catoni, 2012].
This requires a more technical analysis, which we leave for further work.

References
P. Alquier. Bayesian methods for low-rank matrix estimation: short survey and theoret-
ical study. In S. Jain, R. Munos, F. Stephan, and T. Zeugmann, editors, Algorithmic
Learning Theory. Springer - Lecture Notes in Artificial Intelligence, 2014.

P. Alquier and G. Biau. Sparse single-index model. Journal of Machine Learning Re-
search, 14(1):243–280, 2013.

P. Alquier and X. Li. Prediction of quantiles by statistical learning and application to


GDP forecasting. In J.-G. Ganascia, P. Lenca, and J.-M. Petit, editors, Discovery
Science. Springer - Lecture Notes in Artificial Intelligence, 2012.

J.-Y. Audibert and O. Catoni. Robust linear least squares regression. Ann. Statist.,
39(5):2766–2794, 10 2011. doi: 10.1214/11-AOS918. URL http://dx.doi.org/10.
1214/11-AOS918.

J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cup and Workshop
07, 2007.

C. M. Bishop. Pattern Recognition and Machine Learning, chapter 10. Springer, 2006.

P. Bissiri, C. Holmes, and S. Walker. A general framework for updating belief distribu-
tions. arXiv preprint arXiv:1306.6430, 2013.

S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford University


Press, 2013.

20
R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-Newton method
for large-scale optimization. arXiv preprint arXiv:1401.7020, 2014.

E. J. Candès and T. Tao. The power of convex relaxation: near-optimal matrix com-
pletion. IEEE Trans. Inform. Theory, 56(5):2053–2080, 2010. ISSN 0018-9448. doi:
10.1109/TIT.2010.2044061. URL http://dx.doi.org/10.1109/TIT.2010.2044061.

O. Catoni. Statistical learning theory and stochastic optimization, volume 1851 of Lecture
Notes in Mathematics. Springer-Verlag, Berlin, 2004. Lecture notes from the 31st
Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.

O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical


learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56.
Institute of Mathematical Statistics, Beachwood, OH, 2007.

O. Catoni. Challenging the empirical mean and empirical variance: A deviation study.
Ann. Inst. H. Poincaré Probab. Statist., 48(4):1148–1185, 11 2012. doi: 10.1214/
11-AIHP454. URL http://dx.doi.org/10.1214/11-AIHP454.

V. Chernozhukov and H. Hong. An MCMC approach to classical estimation. Journal of


Econometrics, 115(2):293–346, 2003.

S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-


statistics. Ann. Stat., 36(2):844–874, 2008.

A. S. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp PAC-


Bayesian bounds and sparsity. Machine Learning, 72:39–61, 2008.

A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and


Langevin Monte-Carlo. Journal of Computer and System Science, 78(5):1423–1443,
2012.

P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. J. R. Statist.
Soc. B, 68(3):411–436, 2006. ISSN 1467-9868.

P. J. Green, K. Latuszynski, M. Pereyra, and C. P. Robert. Bayesian computation:


a perspective on the current state, and sampling backwards and forwards. Preprint
arXiv:1502.01148, 2015.

B. Guedj and P. Alquier. PAC-Bayesian estimation and prevision in sparse additive


models. Electronic Journal of Statistics, 7:264–291, 2013.

W. Hoeffding. Probability inequalities for sums of random variables. Annals of Mathe-


matical Statistics, 10:293–325, 1948.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference.


The Journal of Machine Learning Research, 14(1):1303–1347, 2013.

21
W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensional
classification and data mining. The Annals of Statistics, 36(5):2207–2231, 2008.

M. I. Jordan, Z. Ghahrapani, T. S. Jaakkola, and L. K. Saul. An introduction to


variational methods for graphical models. Machine Learning, (37):183–233, 1999.

M. E. Khan. Decoupled variational Gaussian inference. In Advances in Neural Informa-


tion Processing Systems, pages 1547–1555, 2014.

M. E. Khan, A. Aravkin, M. Friedlander, and M. Seeger. Fast dual variational inference


for non-conjugate latent gaussian models. In Proceedings of The 30th International
Conference on Machine Learning, pages 951–959, 2013.

V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal


rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329,
2011.

N. D. Lawrence and R. Urtasun. Non-linear matrix factorization with Gaussian pro-


cesses. In Proceedings of the 26th Annual International Conference on Machine Learn-
ing, pages 601–608. ACM, 2009.

G. Lecué. Méthodes d’aggrégation: optimalité et vitesses rapides. Ph.D. thesis, Univer-


sité Paris 6, 2007.

Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction.


Proceedings of KDD Cup and Workshop, 7:15–21, 2007.

D. J. C. MacKay. Information theory, inference and learning algorithms. Cambridge


University Press, 2002.

T. T. Mai and P. Alquier. A Bayesian approach for matrix completion: optimal rate
under general sampling distribution. Electronic Journal of Statistics, 9:823–841, 2015.

E. Mammen and A. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,


27(6):1808–1829, 1999.

D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of of the Twelth An-


nual Conference On Computational Learning Theory, Santa Cruz, California (Elec-
tronic), pages 164–170. ACM, New-York, 1999.

D.A McAllester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual


conference on Computational learning theory, pages 230–234. ACM, New York, 1998.

Y. Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science


& Business Media, 2004.

M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural


computation, 21(3):786–792, 2009.

22
G. Parisi. Statistical field theory. Addison-Wesley, New-York, 1988.

J. Ridgway, P. Alquier, N. Chopin, and F. Liang. PAC-Bayesian AUC classification


and scoring. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.
Weinberger, editors, Advances in Neural Information Processing Systems 27, pages
658–666. Curran Associates Inc., 2014.

S. Robbiano. Upper bounds and aggregation in bipartite ranking. Electronic Journal of


Statistics, 7:1249–1271, 2013.

H. Robbins and S. Monro. A stochastic approximation method. The annals of mathe-


matical statistics, pages 400–407, 1951.

R. Salakhutdinov and A. Mnih. Bayesian probabilistic matrix factorization using Markov


chain Monte Carlo. In Proceedings of the 25th international conference on Machine
learning, pages 880–887. ACM, 2008.

J. Shawe-Taylor and R.C. Williamson. A PAC analysis of a Bayesian estimator. In


Proceedings of the tenth annual conference on Computational learning theory, pages
2–9. ACM, 1997.

T. Suzuki. Convergence rate of Bayesian tensor estimator: Optimal rate without re-
stricted strong convexity. arXiv preprint arXiv:1408.3092 (accepted by ICML2015),
2014.

A. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of


Statistics, 32(1):135–166, 2004.

O. Wintenberger. Deviation inequalities for sums of weakly dependent time series. Elec-
tronic Communications in Probability, 15:489–503, 2010.

Y. Yang. Aggregating regression procedures to improve performance. Bernoulli, 10:


25–47, 2004.

A. Yuille. Belief propagation, mean-field and the Bethe approximation. Technical report,
Dept. Statistics UCLA, 2010.

T. Zhang. Statistical behavior and consistency of classification methods based on convex


risk minimization. Annals of Statistics, pages 56–85, 2004.

T. Zhang. Information theoretical upper and lower bounds for statistical estimation.
IEEE Transaction on Information Theory, 52:1307–1321, 2006.

M. Zhou, C. Wang, M. Chen, J. Paisley, D. Dunson, and L. Carin. Nonparametric


bayesian matrix completion. Proc. IEEE SAM, 2010.

23
A. Proofs
A.1. Preliminary remarks
R
We start by a general remark. Let h be a function Θ → R+ with exp[−h(θ)]π(dθ) < ∞.
Let us put
exp[−h(θ)]
π[h](dθ) = R π(dθ).
exp[−h(θ0 )]π(dθ0 )
R
Direct calculation yields, for any ρ  π with hdρ < ∞,
Z Z
K(ρ, π[h]) = λ hdρ + K(ρ, π) + log exp(−h)dπ.

Two well known consequences are


Z 
π[h] = arg min hdρ + K(ρ, π) ,
ρ∈M1+ (Θ)
Z Z 
− log exp(−h)dπ = min hdρ + K(ρ, π) .
ρ∈M1+ (Θ)

We will use these inequalities many times in the followings. The most frequent applica-
tion will be with h(θ) = λrn (θ) (in this case π[λrn ] = ρ̂λ ) or h(θ) = ±λ[rn (θ) − R(θ)],
the first case leads to
Z Z
K(ρ, ρ̂λ ) = λ rn dρ + K(ρ, π) + log exp(−λrn )dπ, (8)
 Z 
ρ̂λ = arg min λ rn dρ + K(ρ, π) , (9)
ρ∈M1+ (Θ)
Z  Z 
− log exp(−λrn )dπ = min λ rn dρ + K(ρ, π) . (10)
ρ∈M1+ (Θ)

We will use (8), (9) and (10) several times in this appendix.

A.2. Proof of the theorems in Subsection 4.1


Proof of Theorem 4.1. This proof follows the standard PAC-Bayesian approach (see Catoni
[2007]). Apply Fubini’s theorem to the first inequality of (1):
Z
E exp {λ[R(θ) − rn (θ)] − f (λ, n)} π(dθ) ≤ 1

then apply the preliminary remark with h(θ) = λ[rn (θ) − R(θ)]:
( Z )
E exp sup λ[R(θ) − rn (θ)]ρ(dθ) − K(ρ, π) − f (λ, n) ≤ 1.
ρ∈M1+ (Θ)

24
Multiply both sides by ε and use E[exp(U )] ≥ P(U > 0) for any U to obtain:
" Z #
P sup λ[R(θ) − rn (θ)]ρ(dθ) − K(ρ, π) − f (λ, n) + log(ε) > 0 ≤ ε.
ρ∈M1+ (Θ)

Then consider the complementary event:


 Z Z  
1
P ∀ρ ∈ M1+ (Θ), λ Rdρ ≤ λ rn dρ + f (λ, n) + K(ρ, π) + log ≥ 1 − ε.
ε

Proof of Theorem 4.2. Using the same calculations as above, we have, with probability
at least 1 − ε, simultaneously for all ρ ∈ M1+ (Θ),
Z Z  
2
λ Rdρ ≤ λ rn dρ + f (λ, n) + K(ρ, π) + log (11)
ε
Z Z  
2
λ rn dρ ≤ λ Rdρ + f (λ, n) + K(ρ, π) + log . (12)
ε

We use (11) with ρ = ρ̂λ and (9) to get


Z  Z  
2
λ Rdρ̂λ ≤ inf λ rn dρ + f (λ, n) + K(ρ, π) + log
1
ρ∈M+ (Θ) ε

and plugging (12) into the right-hand side, we obtain


Z  Z  
2
λ Rdρ̂λ ≤ inf λ Rdρ + 2f (λ, n) + 2K(ρ, π) + 2 log .
1
ρ∈M+ (Θ) ε

Now, we work with ρ̃λ = arg minρ∈F K(ρ, ρ̂λ ). Plugging (8) into (11) we get, for any ρ,
Z Z  
2
λ Rdρ ≤ f (λ, n) + K(ρ, ρ̂λ ) − log exp(−λrn )dπ + log .
ε
By definition of ρ̃λ , we have:
Z  Z  
2
λ Rdρ̃λ ≤ inf f (λ, n) + K(ρ, ρ̂λ ) − log exp(−λrn )dπ + log
ρ∈F ε

and, using (8) again, we obtain:


Z  Z  
2
λ Rdρ̃λ ≤ inf λ rn dρ + f (λ, n) + K(ρ, π) + log .
ρ∈F ε

We plug (12) into the right-hand side to obtain:


Z  Z  
2
λ Rdρ̃λ ≤ inf λ Rdρ + 2f (λ, n) + 2K(ρ, π) + 2 log .
ρ∈F ε

25
This proves the second inequality of the theorem. In order to prove the claim
2
Bλ (F) = Bλ (M1+ (Θ)) + inf K(ρ, π λ ),
λ ρ∈F 2

note that
(Z )
2f (λ, n) 2K(ρ, π) 2 log 2ε
Bλ (F) = inf Rdρ + + +
ρ∈F λ λ λ
( )
2f (λ, n) 2K(ρ, π λ2 ) 2 log 2ε
Z  
2 λ
= inf − log exp − R dπ + + +
ρ∈F λ 2 λ λ λ
2f (λ, n) 2 log 2ε
Z   
2 λ 2
= − log exp − R dπ + + + inf K(ρ, π λ )
λ 2 λ λ λ ρ∈F 2

2
= Bλ (M1+ (Θ)) + inf K(ρ, π λ ).
λ ρ∈F 2

This ends the proof. 

A.3. Proof of Theorem 4.3 (Subsection 4.2)


Proof of Theorem 4.3. As in the proof of Theorem 4.1, we apply Fubini, then (10) to
the first inequality of (2) to obtain
 Z 
 
E exp sup λ[R(θ) − R] − λ[rn (θ) − rn ] − g(λ, n)[R(θ) − R] ρ(dθ) − K(ρ, π) ≤ 1
ρ

and we multiply both sides by ε/2 to get


( " Z  Z   #)
2 ε
P sup [λ−g(λ, n)] Rdρ − R ≥ λ rn dρ − rn +K(ρ, π)+log ≤ . (13)
ρ ε 2

We now consider the second inequality in (2):



E exp λ[rn (θ) − rn ] − λ[R(θ) − R] − g(λ, n)[R(θ) − R] ≤ 1.

The same derivation leads to


( " Z  Z   #)
2 ε
P sup [λ−g(λ, n)] rn dρ − rn ≥ λ Rdρ − R +K(ρ, π)+log ≤ . (14)
ρ ε 2

We combine (13) and (14) by a union bound argument, and we consider the complemen-
tary event: with probability at least 1 − ε, simultaneously for all ρ ∈ M1+ (Θ),
Z  Z   
2
[λ − g(λ, n)] Rdρ − R ≤ λ rn dρ − rn + K(ρ, π) + log , (15)
ε

26
Z  Z   
2
λ rn dρ − rn ≤ [λ + g(λ, n)] Rdρ − R + K(ρ, π) + log . (16)
ε
We now derive consequences of these two inequalities (in other words, we focus on the
event where these two inequalities are satisfied). Using (9) in (15) yields
Z   Z   
2
[λ − g(λ, n)] Rdρ̂λ − R ≤ inf λ rn dρ − rn + K(ρ, π) + log .
ρ∈M1+ (Θ) ε

We plug (16) into the right-hand side to obtain:


Z 
[λ − g(λ, n)] Rdρ̂λ − R
( Z   )
2
≤ inf [λ + g(λ, n)] Rdρ − R + 2K(ρ, π) + 2 log .
ρ∈M1+ (Θ) ε

Now, we work with ρ̃λ . Plugging (8) into (13) we get


Z  Z  
2
[λ − g(λ, n)] Rdρ − R ≤ K(ρ, ρ̂λ ) − log exp[−λ(rn − rn )]dπ + log .
ε

By definition of ρ̃λ , we have:


Z 
[λ − g(λ, n)] Rdρ̃λ − R
 Z  
2
≤ inf K(ρ, ρ̂λ ) − log exp[−λ(rn − rn )]dπ + log .
ρ∈F ε

Then, apply (8) again to get:


Z   Z  
2
[λ − g(λ, n)] Rdρ̃λ − R ≤ inf λ (rn − rn )dρ + K(ρ, π) + log .
ρ∈F ε

Plug (16) into the right-hand side to get


Z 
[λ − g(λ, n)] Rdρ̃λ − R
 Z  
2
≤ inf [λ + g(λ, n)] (R − R)dρ + 2K(ρ, π) + 2 log .
ρ∈F ε

A.4. Proofs of Section 5


Proof of Lemma 5.1. Combine Theorem 2.1 p. 25 and Lemma 2.2 p. 27 in Boucheron
et al. [2013]. 

27
Proof of Lemma 5.2. Apply Theorem 2.10 in Boucheron et al. [2013], and plug the
margin assumption. 
Proof of Corollary 5.4. We remind that thanks to (6) it is enough to prove the claim for
F1 . We apply Theorem 4.2 to get:
(Z )
λ K(Φm,σ2 , π) + log 2ε
Bλ (F1 ) = inf RdΦm,σ2 + + 2
(m,σ 2 ) n λ
 h  2 i 2 
Z λ d 1
2 log ϑ
σ2
+ σ2
ϑ2
+ kmk
ϑ2
− d2 + log 2ε 
= inf RdΦm,σ2 + + 2 .
(m,σ 2 )  n λ 

Note that the minimizer of R, θ, is not unique (because fθ (x) does not depend on kθk)
and we can chose it in such a way that kθk= 1. Then
h i h i
R(θ) − R = E 1hθ,XiY <0 − 1hθ,X iY <0 ≤ E 1hθ,Xihθ,X i<0
 θ
= P hθ, Xi θ, X < 0 ≤ c − θ ≤ 2ckθ − θk.
kθk

So:
 Z
Bλ (F1 ) ≤ R + inf 2c kθ − θkΦm,σ2 (dθ)
(m,σ 2 )
h  2 i
1 ϑ σ2 kmk2 d 2

λ d 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+ +2 .
n λ

We now restrict the infimum to distributions ν such that m = θ:


  2 2 
ϑ
 √ λ d log σ2
+ 2dσ
ϑ2
+ ϑ22 − d + 2 log 2
ε

B(F1 ) ≤ R + inf 2c dσ + + .
σ2  n λ 

1 √1
We put σ = 2λ and substitute d
for ϑ to get
√ 2 d2
λ c d + d log(4 λd ) + 2λ 2

2 + d + 2 log ε
B(F1 ) ≤ R + + .
n λ

Substitute nd for λ to get the desired result. 
Proof of Corollary 5.5. We apply Theorem 4.3:
Z
(R − R)dρ̃λ
 Z  
λ + g(λ, n) 1 2
≤ inf (R − R̄)dΦm,σ2 + 2K(Φm,σ2 , π) + 2 log
m,σ 2 λ − g(λ, n) λ − g(λ, n) 

28
2n
where λ < C+1 . Computations similar to those in the the proof of Corollary 5.4 lead to

Z ( Z
λ + g(λ, n)
Rdρ̃λ ≤ R + inf 2c kθ − θkΦm,σ2 (dθ)
m,σ 2 λ − g(λ, n)
h   i
Pd 1 ϑ2 σ2 kmk2 d 2
)
j=1 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+2 .
λ − g(λ, n)

2n
taking m = θ̄ and λ = C+2 , we get the result. 

A.5. Proofs of Section 6


Proof of Lemma 6.1. For fixed θ we can upper bound the individual risk such that:

0 ≤ max(0, 1− < θ, Xi > Yi ) ≤ 1 + |< θ, Xi > |

such that we can apply Hoeffding’s inequality conditionally on Xi and fixed θ.


We get,
n
( )
λ2 X
E exp λ(RH − rnH ) |X1 , · · · , Xn ≤ exp (1 + |< θ, Xi > |)2
  
8n2
i=1
 2
λ2 c2x

λ
≤ exp + kθk2
4n 4n

where the last inequality stems from the fact that (a + b)2 ≤ 2 a2 + b2 and the fact


that we have supposed the Xi to be bounded. We can take the expectation of this term
with respect to the Xi ’s and with respect to our Gaussian prior.
 2
λ
exp 4n
Z  2 2
λ cx 1

H H 2 2
  
π E exp λ(R − rn ) ≤ d√ exp kθk − 2 kθk dθ
(2π) 2 ϑ2 4n 2ϑ
 2
λ
exp 4n Z 
1 1

λ2 c2x
 
2
≤ d√ exp − − kθk dθ
(2π) 2 ϑ2 2 ϑ2 4n
2 2
The integral is a properly defined Gaussian integral under the hypothesis that ϑ12 − λ4ncx >
q
0 hence λ < c2x nϑ 2 . The integral is proportional to a Gaussian and we can directly
write:  2
λ
exp 4n
H H
  
π E exp λ(R − rn ) ≤q
2 λ2 c2
1 − ϑ 4n x

writing everything in the exponential gives the desired result. 

29
Proof of Corollary 6.2. We apply Theorem 4.2 to get:
(Z )
ϑ2 λ2 c2x K(Φm,σ2 , π) + log 2ε
 
H λ 1
Bλ (F1 ) = inf R dΦm,σ2 + − log 1 − +2
(m,σ 2 ) 2n λ 4n λ

ϑλ2 c2x
Z  
H λ 1
= inf R dΦm,σ2 + − log 1 −
(m,σ 2 )  2n λ 4n
h   i
Pd 1 ϑ2 σ2 kmk2 d

2 
j=1 2 log σ2
+ ϑ2
+ ϑ2
− 2 + log ε
+2 .
λ 

We use the fact that the hinge loss is √


Lipschitz and that the (Xi ) are uniformly bounded
kXk∞ < cx . We get RH (θ) ≤ R̄H +cx dkθ− θ̄k and restrict the infemum to distributions
ν such that m = θ:
   
 2 2 2
 d log ϑ2 + 2dσ2 + 2 − d + 2 log 2
H
 λ 1 ϑ λ cx σ2 ϑ2 ϑ2 ε

B(F1 ) ≤ R +inf cx dσ 2 + − log 1 − + .
σ2  2n λ 4n λ 

√1
pn
We specify σ 2 = dn
and λ = cx ϑ2
such that we get:
r √ r
 √  2d 2 2

ϑ2 ϑ2 2 + − d + 2 log
 
d 1 cx ϑ ϑ2 ε
B(F1 ) ≤ RH +cx + √ −cx log 1 − +d √ log ϑ2 nd +cx ϑ nϑ √ .
n 2cx n n 4 n n

To get the correct rate we take the prior variance to be ϑ2 = d1 by replacing in the above
equation we get the desired result.

Proof of Theorem 6.3. From Nesterov [2004] (th. 3.2.2) we have the following bound on
the objective function minimized by VB, (the objective is not uniformlly Lipschitz)
 
k H 1 k H 1 LM
ρ (rn ) + K(ρ , π) − inf ρ(rn ) + K(ρ, π) ≤ √ . (17)
λ ρ∈F1 λ 1+k

We have from equation (11) specified for measures ρk probability 1 − ε,


Z Z  
H k H k k 1
λ rn dρ ≤ λ R dρ + f (λ, n) + K(ρ , π) + log
ε
Combining the two equations yields,
Z  
H k LM 1 H 1 1 1
R dρ ≤ √ + f (n, λ) + inf ρ(rn ) + K(ρ, π) + log
1+k λ ρ∈F1 λ λ ε
We can therefore write for any ρ ∈ F1 ,
Z
LM 1 1 1 1
RH dρk ≤ √ + f (n, λ) + ρ(rnH ) + K(ρ, π) + log
1+k λ λ λ ε

30
Using equation (11) a second time we get with probability 1 − ε
Z
LM 2 2 2 2
RH dρk ≤ √ + f (n, λ) + ρ(RH ) + K(ρ, π) + log
1+k λ λ λ ε
Because this is true for any ρ ∈ F1 in 1 − ε we can write the bound for the smallest
measure in F1 .
Z  
LM 2 2 2 2
RH dρk ≤ √ + f (n, λ) + inf ρ(RH ) + K(ρ, π) + log
1+k λ ρ∈F 1 λ λ ε

By taking the Gaussian measure with variance n1 and mean θ in the infemum and taking

λ = c1x nd and ϑ = d1 , we can use the results of Corrolary 6.2 to get the result.

A.6. Proofs of Section 7


Proof of Lemma 7.1. The idea of the proof is to use Hoeffding’s decomposition of U-
statistics combined with Hoeffding’s inequality for iid random variables. This was done
in ranking by Clémençon et al. [2008], and later in Robbiano [2013], Ridgway et al. [2014]
for ranking via aggregation and Bayesian statistics. The proof is as follows: we define
θ
qi,j = 1(Yi −Yj )(fθ (Xi )−fθ (Xj ))<0 − R(θ)

so that
1 X
θ
Un := qi,j = rn (θ) − R(Θ).
n(n − 1)
i,j

From Hoeffding [1948] we have

2
bnc
1 X 1 X θ
Un = qπ(i),π(i+b n
c)
n! π b n2 c 2
i=1

where the sum is taken over all the permutations π of {1, . . . , n}. Jensen’s inequality
leads to
bn
 
2
c
1 X 1 X
θ
E exp[λUn ] = E exp λ qπ(i),π(i+b n 
c)
n! π b n2 c 2
i=1
bn
 
2
c
1 X λ X
θ
≤ E exp  n qπ(i),π(i+b n .
c)
n! π b2c 2
i=1

We now use, for each of the terms in the sum we use the same argument as in the proof
of Lemma 5.1 to get
 2   2 
1 X λ λ
E exp[λUn ] ≤ exp n ≤ exp
n! π 2b 2 c n−1

31
(in the last step, we used b n2 c ≥ (n − 1)/2). We proceed in the same way to upper bound
E exp[−λUn ]. 
Proof of Lemma 7.2. As already done above, we use Bernstein inequality and Hoeffding
decomposition. Fix θ. We define this time
θ
qi,j = 1{hθ, Xi − Xj i (Yi − Yj ) < 0} − 1{[σ(Xi ) − σ(Xj )](Yi − Yj ) < 0} − R(θ) + R

so that
1 X
θ
Un := rn (θ) − rn − R(θ) + R = qi,j .
n(n − 1)
i6=j

Then,
2
bnc
1 X 1 X θ
Un = qπ(i),π(i+b n .
c)
n! π b n2 c 2
i=1

Jensen’s inequality:
bn
 
2
c
1 X 1 X
θ
E exp[λUn ] = E exp λ qπ(i),π(i+b n 
c)
n! π b n2 c 2
i=1
bn
 
2
c
1 X λ X
θ
≤ E exp  n qπ(i),π(i+b n .
c)
n! π b2c 2
i=1

Then, for each of the terms in the sum, use Bernstein’s inequality:
bn 2 λ2
   
c θ
2 E((q n ) ) n
λ X
θ π(1),π(1+b 2 c) b c
E exp  n qπ(i),π(i+b n  ≤ exp 
c)   2 .
b2c 2
2 1 − 2 λn
i=1 b2c

We use again b n2 c ≥ (n−1)/2. Then, as the pairs (Xi , Yi ) are iid, we have E((qπ(1),π(1+b
θ 2
n ) ) =
c)
2
θ )2 ) and then E((q θ )2 ) ≤ C[R(θ) − R] thanks to the margin assumption. So
E((q1,2 1,2

bn
   
c λ2
λ X2
C[R(θ) − R]
E exp  n θ
qπ(i),π(i+b n  ≤ exp 
c)  n−1  .
b2c 2
1− 4λ
i=1 n−1

This ends the proof of the proposition. 


Proof of Corollary 7.4. The calculations are similar to the ones in the proof of Corol-
lary 5.4 so we don’t give the details. Note that when we reach

c d + d log(2λ) + 2 log 2e

2λ ε
Bλ (F1 ) ≤ R + + ,
n−1 λ
q
an approximate minimization with respect to λ leads to the choice λ = d(n−1)2 . 

32
A.7. Proofs of Section 8
Proof. First, note that, for any ρ,
Z Z
 
K(ρ, πβ ) = β (R − R)dρ + K(ρ, π) + log exp −β(R − R) dπ
Z
≤ β (R − R)dρ + K(ρ, π).

Now, we define a subset of F that will be used for the calculation of the bound. We
define for δ > 0 the probability distribution ρU,V,δ (dθ) as π conditioned to θ = µν T with
µ is uniform on {∀(i, `), |µi,` − Ui,` |≤ δ} and ν is uniform on {∀(j, `), |νi,` − Vj,` |≤ δ}.
Note that
Z Z
(R − R)dρM,N,δ = E((θX − MX )2 )ρU,V,δ (dθ)
Z
≤ 3E(((U V T )X − MX )2 )ρU,V,δ (d(µ, ν))
Z
+ 3 E(((U ν T )X − (U V T )X )2 )ρU,V,δ (d(µ, ν))
Z
+ 3 E(((µν T )X − (U ν T )X )2 )ρU,V,δ (d(µ, ν)).

By definition, the first term is = 0. Moreover:


Z
E(((U ν T )X − (U V T )X )2 )ρU,V,δ (d(µ, ν))
Z " #2
1 X X
= Ui,k (νj,k − Vj,k ) ρU,V,δ (d(µ, ν))
m1 m2
i,j k
Z " #" #
1 X X
2
X
≤ Ui,k (νj,k − Vj,k )2 ρU,V,δ (d(µ, ν))
m1 m2
i,j k k
2 2
≤ KrC δ .

In the same way,


Z Z
E(((µν )X − (U ν )X ) )ρU,V,δ (d(µ, ν)) ≤ kµ − U k2F kνk2F ρU,V,δ (d(µ, ν))
T T 2

≤ Kr(C + δ)2 δ 2 .

So: Z
(R − R)dρM,N,δ ≤ 2Krδ 2 (C + δ 2 ).

Now, let us consider the term K(ρU,V,δ , π). An explicit calculation is possible but tedious.
Instead, we might just introduce the set Gδ = {θ = µν T , kµ − U kF ≤ δ, kν − V kF ≤ δ}

33
and note that K(ρU,V,δ , π) ≤ log π(G1 δ ) . An upper bound for Gδ is calculated page 317-320
in Alquier [2014] and the result is given by (10) in this reference:

K(ρU,V,δ , π) ≤ 4δ 2 + 2kU k2F +2kN k2F +2 log(2)


r !
Γ(a)3a+1 exp(2)
 
1 3π(m1 ∨ m2 )K
+ (m1 + m2 )r log + 2K log
δ 4 ba+1 2a

δ2 δ2
as soon as the restriction b ≤ 2m1 K log(2m1 K) , 2m2 K log(2m2 K) is satisfied. So we obtain:

K(ρU,V,δ , πβ ) ≤ β2Krδ 2 (C + δ 2 ) + 4δ 2 + 2kU k2F +2kN k2F +2 log(2)


r !
Γ(a)3a+1 exp(2)
 
1 3π(m1 ∨ m2 )K
+ (m1 + m2 )r log + 2K log .
δ 4 ba+1 2a

Noteqthat kU k2F ≤ C 2 rm1 , kV k2F ≤ C 2 rm2 and K ≤ m1 + m2 so it is clear that the choice
δ = β1 and b ≤ 2β(m1 ∨m2 ) log(2K(m
1
1 ∨m2 ))
leads to the existence of a constant C(a, C)
such that
 
1
K(ρU,V,δ , πβ ) ≤ C(a, C) r(m1 + m2 ) log [βb(m1 + m2 )K] + .
β

B. Implementation details
B.1. Sequential Monte Carlo
Tempering SMC approximates iteratively a sequence of distribution ρλt , with
1
ρλt (dθ) = exp (−λt rn (θ)) π(dθ),
Zt
and temperature ladder λ0 = 0 < . . . < λT = λ. The pseudo code below is given for an
adaptive sequence of temperatures.

34
Algorithm 1 Tempering SMC

Input N (number of particles), τ ∈ (0, 1) (ESS threshold), κ > 0 (random walk tuning
parameter)

Init. Sample θ0i ∼ πξ (θ) for i = 1 to N , set t ← 1, λ0 = 0, Z0 = 1.

Loop a. Solve in λt the equation

{ N i 2
P
i=1 wt (θt−1 )}
PN = τ N, wt (θ) = exp[−(λt − λt−1 )rn (θ)] (18)
i 2
i=1 {wt (θt−1 )) }
n PN o
1 i ) , and
using bisection search. If λt ≥ λT , set ZT = Zt−1 × N w (θ
i=1 t t−1
stop.
b. Resample: P for i = 1 to N , draw Ait in 1, . . . , N so that P(Ait = j) =
j
wt (θt−1 )/ N k
k=1 wt (θt−1 ); see Algorithm 2 in the appendix.
Ai
c. Sample θti ∼ Mt (θt−1t
, dθ) for i = 1 to N where Mt is a MCMC kernel that
leaves invariant πt ; see comments below.
n P o
d. Set Zt = Zt−1 × N1 N i=1 w t (θ i ) .
t−1

The algorithm outputs a weighted sample (wTi , θTi ) approximately distributed as target
posterior, and an unbiased estimator of the normalizing constant ZλT .
Step b. of algorithm B.1 depends of a resampling algorithm. We choose to use
Systematic resampling, described in Algorithm 2.

35
Algorithm 2 Systematic resampling

Input: Normalised weights Wtj := wt (θt−1


j PN i
)/ i=1 wt (θt−1 ).

Output: indices Ai ∈ {1, . . . , N }, for i = 1, . . . , N .

a. Sample U ∼ U([0, 1]).


Pn
b. Compute cumulative weights as C n = m=1 N W
m.

c. Set s ← U , m ← 1.

d. For n = 1 : N

While C m < s do m ← m + 1.

An ← m, and s ← s + 1.

End For

For the MCMC step, we used a Gaussian random-walk Metropolis kernel, with a
covariance matrix for the random step that is proportional to the empirical covariance
matrix of the current set of simulations.

B.2. Optimizing the bound


A natural idea to find a global optimum of the objective is to try to solve a sequence of
local optimization problems with increasing temperatures. For γ = 0 the problem can be
solved exactly (as a KL divergence between two Gaussians). Then, for two consecutive
temperatures, the corresponding solutions should be close enough.
This idea has been coined under several names. It has a long history in variational
algorithm under the name deterministic annealing, Yuille [2010] uses it on mean field on
Gibbs distribution for Markov random fields. In addition the intermediate results can
be of interest in our case for selecting the temperature. One can compute the bound at
almost no additional cost as a function of the current risk. In turns this can be used to
monitor the bound.

36
Algorithm 3 Deterministic annealing

Input (λt )t∈[0,T ] a sequence of temperature

Init. Set m = 0 and Σ = ϑId , the values minimizing KL-divergence for λ = 0

Loop t=1,. . . ,T
a. mλt , Σλt = Minimize Lλt (m, Σ) using some local optimization routine with ini-
tial points mλt−1 , Σλt−1
b. Break if the empirical bound increases.

End Loop

400 1.25
γ = 500 ● ●
● ● ●
● ● ●

● ● ● ● ●
● ●
● ● ● ● ●

● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●
● ● ●
● ● ● ●
1.00 ● ●
● ● ●




● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
300 ●
γ = 375
● ● ● ● ● ● ●

● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●


● ● ● ●
● ● ●
● ●
0.75 ● ● ●

● ● ● ● ●
● ●


95% bound

γ = 250
value

200

0.50



● γ = 125
100

0.25


γ=0
0 ●

0.00

−10 −5 0 5 0 100 200 300 400 500


λ

(a) A one dimensional problem (b) Empirical bound

Figure 2: Deterministic annealing on a Pima Indians with one covariate and full
model resp.
The right panel gives the empirical bound obtained for the DA method (in red) and the dot are direct global
optimization based on L-BFGS algorithms from starting values drawn from the prior. Each optimization
problem is repeated 20 times.

We find that using a deterministic annealing algorithm with a limited amount of steps
helps in finding a high enough optimum. On the left panel of Figure 2, we can see the one
dimensional case where the initial problem γ = 0 corresponds to a convex minimization
problem and where the increasing temperature gradually complexifies the optimization
problem. Figure 2 shows that the solution given by DA is in average lower than randomly

37
initialized optimization.

C. Stochastic gradient descent


The stochastic gradient descent algorithm used in Section ?? is described as Algorithm
4.

Algorithm 4 Stochastic Gradient Descent

ˆ B f , η ∈ (0, 1) and c
Input B a batch size, an unbiased estimator of the gradient ∇

While ¬converged
ˆ B f (xt )
a. xt+1 = xt − λt ∇
1
b. Update λt+1 = (t+c)η

End Loop

In all our experiment we take c = 1 and η = 0.9.

38

You might also like