Thesis 2018 Revisiting Generalizaiton For Deep Learning - Pac-Bayes and Generative Models
Thesis 2018 Revisiting Generalizaiton For Deep Learning - Pac-Bayes and Generative Models
Department of Engineering
University of Cambridge
I hereby declare that except where specific reference is made to the work of others, the
contents of this dissertation are original and have not been submitted in whole or in part
for consideration for any other degree or qualification in this, or any other university. This
dissertation is my own work and contains nothing which is the outcome of work done in
collaboration with others, except as specified in the text and Acknowledgements. This
dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes,
tables and equations and has fewer than 150 figures.
My first exposure to research was due to Wei Ji Ma, who accepted me for a summer
internship in Computational Neuroscience despite the fact that I had only just completed
my undergraduate degree in Mathematics and had no prior research experience. Working
with Weiji changed my perspective on research as a career choice. I am grateful to Weiji for
encouraging me to apply to PhD programs.
I would like to thank my supervisor Zoubin Ghahramani for supporting my decision
to transition into machine learning research. Zoubin quickly directed me to interesting
projects that resulted in publications, motivating me to continue my research career. I was
very fortunate to have an advisor who was an expert in so many different areas of machine
learning.
Much of the research reported in this thesis was carried out or initiated while I was a
visiting student in the “Foundations of Machine Learning” program at the Simons Institute
for the Theory of Computing at UC Berkeley. I would like to explicitly thank Peter Bartlett,
Shai Ben-David, Dylan Foster, Matus Telgarsky, and Ruth Urner for helpful discussions. I
would also like to thank Bharath Sriperumbudur for technical discussions regarding the work
described in Chapter 5.
I am very grateful to my parents who were willing to help at any time and in any way they
could, who visited me for months at a time, travelled with me to conferences, and helped
care for my children while I was working unreasonable hours.
I am most thankful for my husband’s never-ending support.
Last but not least, I would like to thank the hyper competitive deep learning research
community for pushing me to exceed my rest-to-work limits while raising two small children.
Under different, more relaxed, conditions, I might have gotten much less work done.
Abstract
List of figures xv
Notation 1
Introduction 3
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Conclusion 119
References 121
List of figures
4.1 SGDL and Entropy-SGLD results for various levels of privacy on the CONV
network on two-class MNIST. . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Differentially private Entropy-SGLD results for a fully connected network. 93
4.3 Performance of SGLD when tuned to be differentially private . . . . . . . . 93
4.4 Entropy-SGLD performance on MNIST . . . . . . . . . . . . . . . . . . . 94
4.5 Entropy-SGLD performance on CIFAR10 . . . . . . . . . . . . . . . . . . 95
4.6 Entropy-SGLD CIFAR10 experiments with a fixed level of privacy . . . . . 96
Neural networks have enjoyed several waves of popularity. One of the defining properties
of the most recent resurgence—the “deep learning” era—is the use of large data sets and
much larger networks. Neural network approaches now dominate in fields such as vision,
natural language processing, and many others. Despite this success, the generalization
properties of deep learning algorithms are yet to be fully understood: there is, as yet, no
complete theory explaining why common algorithms work in practice. Instead, guidelines
for choosing and tuning common learning algorithm are based on empirical experience. Such
practice is problematic. In some applications, standard algorithms like stochastic gradient
descent (SGD) reliably return solutions with low test error. In other applications, these same
algorithms rapidly overfit.
Our goal is to improve our understanding of generalization of neural networks in the deep
learning regime, empowering us to:
Any attempt to explain generalization must grapple with the fact that the hypothesis class
induced by standard neural network models is huge, as measured by standard complexity
measures, such as VC dimension and Rademacher complexity. One of the defining properties
of deep learning is that models are chosen to have many more parameters than available
training data. In practice, the number of available training instances is too small in comparison
to the size of the network to yield generalization guarantees without taking into consideration
the learning algorithm or depending on the complexity of the learned hypothesis.
This fact has long been appreciated and Bartlett (1998) is essentially credited with solving
this problem in 1998. In his seminal work, Bartlett introduced fat shattering risk bounds
where the fat shattering dimension of the network is controlled by the norms of the weights.
One of the key contributions of this thesis is revisiting this and later developments and
questioning whether they solve the problem at hand of understanding generalization in deep
learning. One might hope that the generalization properties of SGD could be explained by
4 Introduction
showing that SGD performs implicit regularization of the weight norms. However, Bartlett’s
learning bounds are numerically vacuous, i.e., greater than the upper bound on the loss, when
applied to modern networks learned by SGD. Logically, in order to explain generalization,
we need nonvacuous bounds.
Nonetheless, many authors are actively working on understanding implicit regularization
and several empirical studies suggest that the implicit regularization idea may be promising
(Neyshabur, Tomioka, and Srebro, 2014; Neyshabur, 2017; Zhang et al., 2017; Shwartz-Ziv
and Tishby, 2017; Bartlett, Foster, and Telgarsky, 2017a; Gunasekar et al., 2017). For
example, Zhang et al. (2017) demonstrate that, without regularization, SGD can achieve
zero training error on standard benchmark datasets, like MNIST and CIFAR. At the same
time, SGD obtains weights with very small generalization error with the original labels. As a
simplified model, Zhang et al. study SGD in a linear model, and show that SGD obtains the
minimum norm solution, and thus performs implicit regularization. They suggest a similar
phenomenon may occur when using SGD to training neural networks. Indeed, earlier work
by Neyshabur, Tomioka, and Srebro (2014) observes similar phenomena and argues for the
same point: implicit regularization underlies the ability of SGD to generalize, even under
massive overparametrization. Subsequent work by Neyshabur, Tomioka, and Srebro (2015)
introduces “path norms” as a better measure of the complexity of ReLU networks. The
authors bound the Rademacher complexity of classes of neural networks with bounded path
norm. Despite progress, we show that these new bounds probably will not lead to nonvacuous
generalization bounds.
The generalization question may also be addressed by studying properties of learning
algorithms, such as algorithmic stability. However, existing stability guarantees typically
diminish rapidly with training time. The experiments by Neyshabur, Tomioka, and Srebro
(2014) demonstrate that the classification error, as estimated on the test set, does not get
worse if we run SGD to convergence. On the other hand, Zhang et al. shows that SGD rapidly
overfits when handed randomized labels. This example highlights the impediment of using
label-independent stability analyses to explaining generalization, as noted by Zhang et al.
(2017).
In summary, the theoretical relevance of many learning bounds has been assessed by
looking whether the bound captures some implicit regularization, structural properties of
the solution, and/or properties of the data distribution. Progress was made by eliminating or
improving the dependencies on the number of layers, depth, and parameters in the neural
network. Empirical relevance was commonly illustrated via plots that show the generalization
bounds tracking the estimated generalization error.
5
In our work, we take one step further by seeking to obtain generalization bounds that
are numerically close to the estimate of generalization on held-out data. Our approach
combines Probably Approximately Correct (PAC) Bayes (McAllester, 1999a) framework
with nonconvex optimization and other computational techniques. This direction turns out
to be fruitful for several reasons. For instance, it yields insight into what empirical risk
surface structure found by SGD leads to good generalization and it also allows us to interpret
other existing optimizers as risk bound minimizers, analyze and control their generalization
guarantees in theory and in practice.
• Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani (2015). “Train-
ing Generative Neural Networks via Maximum Mean Discrepancy Optimization”.
Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence.
UAI’15. Amsterdam, Netherlands: AUAI Press, pp. 258–267 (Chapter 5).
The core ideas were developed during discussions with my collaborators Zoubin Ghahra-
mani and Daniel M. Roy, and, thus, these contributions are shared among the authors of
each of the papers listed above. I carried out all the empirical and theoretical work, with one
exception: the results in Section 3.4.1 were derived jointly in collaboration with Daniel M.
Roy.
6 Introduction
falls within the (empirically nonvacuous) risk bounds computed under the assumption that
SGLD reaches stationarity. In particular, Entropy-SGLD can be configured to yield relatively
tight generalization bounds and still fit real labels, although these same settings do not obtain
state-of-the-art performance.
In Chapter 5, we consider training a deep neural network to generate samples from an
unknown distribution given i.i.d. data. We frame learning as an optimization minimizing a
two-sample test statistic—informally speaking, a good generator network produces samples
that cause a two-sample test to fail to reject the null hypothesis. As our two-sample test
statistic, we use an unbiased estimate of the maximum mean discrepancy, which is the
centerpiece of the nonparametric kernel two-sample test proposed by Gretton et al. (Gretton
et al., 2012). We compare to the generative adversarial nets framework introduced by
Goodfellow et al. (Goodfellow et al., 2014), in which learning is a two-player game between
a generator network and an adversarial discriminator network, both trained to outwit the other.
From this perspective, the MMD statistic plays the role of the discriminator. In addition to
empirical comparisons, we prove bounds on the generalization error incurred by optimizing
the empirical MMD.
Chapter 1
In this chapter we introduce basic concepts from statistical learning theory. Informally,
learning uses data to improve performance on one or more tasks. Statistical learning theory
provides a formal framework for defining learning problems, measuring their complexity,
and anticipating the performance of learning algorithms.
In this section, our focus will be on the batch supervised learning setting, and, in
particular, classification. In supervised learning, we want to learn a mapping from a space
Rk of inputs to a space K of labels. The goal is to find the best predictor, or hypothesis, h,
in some hypothesis class H ⊆ Rk → K, on the basis of example input–output pairs. In the
batch setting, these examples are presented all at once to the learner and are assumed to be
independent and identically distributed (i.i.d.).
In the batch setting, the i.i.d. assumption allows us to link (past) empirical performance
to (future/expected) performance. Empirical performance is thus a key measure of the quality
of a proposed hypothesis. However, empirical performance can be misleading, depending
on the quantity of data and the learning algorithm. Generalization bounds relate empirical
performance to performance on unseen data, and can give us confidence using a learning
algorithm. These bounds may depend on a number of factors, including:
3. the properties of the chosen predictor and/or prior knowledge expressed by the learner
(Section 1.2);
4. the properties of the algorithm used by the learner (Sections 1.3 and 1.4), such as its
stability, the property that the distribution of the algorithm’s output does not change
when small modifications are made to the training data.
10 Statistical Learning Theory
Bounds that depend only on (1) exist, but they capture only worst-case scenarios. Taking
the data distribution into considerations often yields tighter bounds. Bounds combining all
factors listed above are specific to the particular problem being studied and yield the tightest
known bounds.
where the second equality follows from Fubini’s theorem and the fact that ℓ is bounded
below.
Let Sm = (z1 , . . . , zm ) denote a training set of size m. We will often drop the subscript
m and simply write S, unless the training set size is not obvious. Let D̂ := m1 ∑m i=1 δzi be
the empirical distribution. Given a weight distribution Q, such as that chosen by a learning
algorithm on the basis of data S, its empirical risk,
1 m
R̂S (Q) := RD̂ (Q) = E (ℓ(w, zi )),
∑ w∼Q (1.3)
m i=1
will be studied as a stand-in for its risk, which we cannot compute. While R̂S (Q) is easily seen
to be an unbiased estimate of RD (Q) when Q is independent of S, our goal is to characterize
the (one-sided) generalization error RD (Q) − R̂S (Q) when Q is random and dependent on S.
Note, that throughout this thesis, we will often use a term generalization bounds, by
which we refer to a bound on the difference between the risk and empirical risk.
1.1 Supervised learning 11
In Chapter 2, for K = {±1}, we also make use of the logistic loss ℓ̆ : R p × (X, {±1}) → R+
1
ℓ̆(w, (x, y)) = log 1 + exp(−h(w, x)y) ,
log(2)
which serves as a convex surrogate (i.e., upper bound) to the 0–1 loss.
We also consider parametric families of probability-density-valued classifiers h : R p ×
X → [0, 1]K . For every input x ∈ X, the output h(w, x) determines a probability distribution
on K. In this setting, ℓ(w, (x, y)) = g(h(w, x), y) for some g : [0, 1]K × K → R. The standard
loss is then the cross entropy, given by g((p1 , . . . , pK ), y) = − log py . (Under cross entropy
loss, the empirical risk is, up to a multiplicative constant, a negative log likelihood.) In the
special case of binary classification, the output can be represented simply by an element
of [0, 1], i.e., the probability the label is one. The binary cross entropy, ℓBCE , is given by
g(p, y) = −y log(p) − (1 − y) log(1 − p). Note that cross entropy loss is merely bounded
below. We consider bounded modifications in Section 4.7.2.
In the context of supervised learning, we will sometimes refer to elements of R p and
M1 (R p ) as classifiers and randomized classifiers, respectively. Likewise, under 0–1 loss, we
will sometimes refer to (empirical) risk as (empirical) error.
In summary, we define the following notions of risk that we use throughout:
1 m
• R̂S (w) = ∑ ℓ(w, zi) empirical risk of (hypothesis indexed by parameters) w on
m i=1
training data S (under 0–1 loss, training error);
1 m
• R̆S (w) = ∑ ℓ̆(w, zi) empirical risk under the surrogate loss (or surrogate error) of w
m i=1
on training data S. We use this empirical risk for training purposes in Chapter 2 when
we need our objective to be differentiable;
12 Statistical Learning Theory
• RD (w) = E m [R̂w (S)] risk (error) under (the data distribution) D for w (in Chapter 2,
S∼D
we often drop the subscript D and just write R(w));
• R̂S (Q) = E [R̂S (w))] empirical risk (error) of (randomized classifier) Q on training
w∼Q
data S;
The properties of ERM are well studied. A large hypothesis class H may contain a
predictor with minimal empirical risk but large true risk, leading to a high generalization
error, i.e., overfitting. Choosing a more restrictive hypothesis class may prevent overfitting in
exchange for bias. Sharp results exist relating the possibility of obtaining uniform bounds on
the generalization error to properties of the hypothesis class H and the 0–1 loss function ℓ.
Definition 1.1.1 (Agnostic PAC Learnable). Let an algorithm map a training sample to a
hypothesis, i.e., A : Z m → H . A hypothesis class H is called (agnostic) PAC learnable
if there exists an algorithm A and a function mH (ε, δ ) : (0, 1)2 → N, such that for every
(ε, δ ) ∈ (0, 1)2 and a data distribution D, for all m > mH (ε, δ ), with probability at least
1 − δ over the training sample S,
We call mH (ε, δ ) a sample complexity for H , i.e., it is the number of examples needed to
guarantee PAC learnability of H .
In other words, if a PAC learning algorithm is given enough data, for any data distribution
it is guaranteed with high probability to return a predictor that is nearly as good as the best
predictor in the hypothesis class. It is also known that a H is PAC learnable if and only if it
is learnable by ERM.
1.1 Supervised learning 13
Uniform convergence
There are many ways to understand PAC learnability, one of the key ways is through uniform
convergence. This property guarantees that our generalization error converges to zero with
the sample size, and quantifies the rate of that convergence.
Definition 1.1.2 (Uniform Convergence). A hypothesis class H has the uniform convergence
(UC) property if there exists a function mUC 2
H (ε, δ ) : (0, 1) → N, such that for every (ε, δ ) ∈
(0, 1)2 and data distribution D, for all m ≥ mUC
H (ε, δ ), with probability at least 1 − δ over
the training sample S,
VC dimension
One measure of complexity of the hypothesis class is expressed in terms of the number of
ways it can label the dataset.
Thus if H has VC dimension d, we know that there is a hypothesis w in the class that
achieves zero empirical risk for every labelling of some dataset of size d. A key result in
statistical learning is that H has uniform convergence property if and only if it has a finite
VC dimension.
The VC dimension can be used to bound the sample complexity, and thus the general-
ization error. However, VC bounds on the generalization error are loose in practice as they
account for worst case data distributions. These bounds are valid for any data distribution
D and any ERM learning algorithm, and thus it needs to give a bound that holds for an
adversarially crafted hypothesis for the data distribution.
One can bound the VC dimension of neural network hypothesis space. The first bounds
in the literature are reported in (Goldberg and Jerrum, 1995; Koiran and Sontag, 1996) and
depend on the number of parameters in the network. Bartlett, Maiorov, and Meir (1999)
improved these bounds by introducing the dependence on the number of layers. These bounds
are of order O(LW log(W ), where W is the number of parameters in the neural networks and
L is the number of layers. In Chapter 2 we evaluate one of such bounds and show that for
large neural networks used in practice, we get a very large upper bound on the VC dimension
(more than 107 ). The large VC dimension combined with a relatively small number of
training examples in the dataset results in generalization bounds that are orders of magnitude
too big to make a nonvacuous guarantee on the risk.
Rademacher Complexity
The VC bounds discussed above ignore the distribution over the data. Thus one can possibly
get tighter bounds by incorporating the data distribution. In this section, we describe
the application of Rademacher complexity to statistical learning theory (Koltchinskii and
Panchenko, 2002; Bartlett and Mendelson, 2003).
Definition 1.1.5 (Rademacher Complexity). Fix a sample size m ∈ N. Let σ = {σi }i∈N be a
sequence of Rademacher random variables, i.e., random variables taking values in {−1, 1}
with equal probability. Fix a class of measurable functions F ⊆ Z → R. For a sample S ∈ Z m ,
the empirical Rademacher complexity of the class F is
1 m
R̂m (S, F ) = Eσ [ sup ∑ σi fi(zi)]. (1.8)
f ∈F m i=1
Usually, one studies the Rademacher complexity of the loss class, and thus F refers to
the loss function composed with H , i.e., all f ∈ F are such that f (x, y) = ℓ(w, (x, y)) for
some w ∈ H . In this case, the Rademacher complexity can be used to obtain the mean of
worst case generalization error:
Theorem 1.1.6.
h i
E sup RD (w) − R̂S (w) ≤ 2Rm ({ℓ(w, ·) : w ∈ R p }) (1.10)
S∼D w∈R p
This bound can be used to prove a bound on the expected generalization error and
expected oracle risk (the gap between risks of an ERM classifier and the best classifier in the
hypothesis class) of any ERM mechanism.
One of the key application of Rademacher complexity in statistical learning theory is
to margin bounds for classifiers that output a real-valued prediction. The 0-1 loss of these
predictors is obtained by thresholding the output. To obtain a margin bound, one studies the
ramp loss, a Lipschitz upper bound to the 0–1 loss. One can show that the Rademacher of
the loss class is no larger than the product of the Lipschitz constant and the Rademacher
complexity of the real-valued hypothesis class. In this case, the H has a high Rademacher
complexity if it is rich enough to contain predictors that can explain a random labelling of
the data.
In Chapter 2 we present and evaluate a generalization error bound for neural network
classifiers. Also, in Chapter 5 (Section 5.4), we introduce a neural network generative model
and use the Rademacher complexity to obtain some generalization guarantees.
where α ∈ R+ and the term C(w) may depend on the hypothesis w. In ERM, the term C(w)
is a constant, i.e., it does not depend on the particular hypothesis w. In classical SRM, the
16 Statistical Learning Theory
term C(w) is not constant, and so Eq. (1.11) encodes a trade-off between the empirical risk
and some notion of complexity for predictors.
S
Consider a hypothesis class H = Hn where, for each n ∈ N, the subclass Hn has
n∈N
the uniform convergence property with sample complexity mUC
Hn (ε, δ ). Define the function
εn : M × N × (0, 1) → (0, 1) as
Define a weight function w : N → [0, 1] over each hypothesis class {Hn }n∈N in H , such that
∑n∈N w(n) ≤ 1. The following generalization guarantee holds:
Note that, the weight function, once normalized, has the same form as a prior distribution
in Bayesian analysis. However, the guarantee holds for any weight function, provided it is
fixed before seeing the data. The theorem makes it clear that we are incentivized to assign
large weight to small subclasses Hn that we believe will fit the data well, and so there is an
indirect link with the Bayesian prior.
The risk bound, Eq. (1.13), suggests a natural learning algorithm for any hypothesis space
that can be written as a countable union of PAC learnable classes. The following paradigm is
called Structural Risk Minimization (SRM):
S
1. Write H = Hn for some countable collection of PAC learnable subclasses;
n∈N
3. Given data S and a confidence parameter δ ∈ (0, 1), choose w ∈ H that minimizes
the right hand side of Eq. (1.13).
Note, that while each hypothesis Hn in SRM is uniformly learnable, H is only nonuni-
formly learnable.
The SRM paradigm drives the learning towards the simplest hypothesis that achieves
a relatively low empirical risk. This closely connects to the Minimum Description Length
(MDL) and PAC-Bayes frameworks discussed in the following sections.
1.2 PAC-Bayes
PAC-Bayes theory was first developed by McAllester in 1999 with the goal to provide PAC
learning guarantees for Bayesian algorithms. The first PAC analysis of Bayesian algorithms
is due to Shawe-Taylor and Williamson. The developed theory reported in (Shawe-Taylor
and Williamson, 1997) apply under more restrictive set of assumptions on the parameter
space and prior measures, as compared to McAllester’s work.
PAC-Bayes theorems give generalization and oracle bounds that can be used to build
and analyze learning algorithms that are similar to SRM and MDL. In all three settings, the
learner:
• fixes a prior/weight function over the hypothesis class, prior to seeing the data;
S
• as in MDL, the hypothesis class H = Hh is viewed as a (possibly uncountable)
h∈H
union of singleton classes Hh = {w};
• the search over classifiers, h, is replaced by one over randomized classifiers, Q, i.e.,
probability measures over H
MDL is obtained as the special case where H is countable and the search is restricted to
degenerate probability measures (i.e., ones assigning probability one to a single hypothesis).
Note that P and Q are the same type of structure.
kl(q||p) := KL(B(q)||B(p))
q 1−q
= q log + (1 − q) log ,
p 1− p
1.2.2 Bounds
We now present a PAC-Bayes theorem, first established by McAllester (1999a). We focus
on the setting of bounding the generalization error of a (randomized) classifier on a finite
discrete set of labels K. The following variation is due to Langford and Seeger (2001) for
0–1 loss (see also (Langford, 2002) and (Catoni, 2007).)
Theorem 1.2.1 (PAC-Bayes (McAllester, 1999a; Langford and Seeger, 2001)). Under 0–1
loss, for every δ > 0, m ∈ N, distribution D on Rk × K, and distribution P on R p ,
KL(Q||P) + log 2m
δ
P m (∀Q) kl(R̂S (Q)||RD (Q)) ≤ ≥ 1−δ. (1.16)
S∼D m−1
Theorem 1.2.2 (Linear PAC-Bayes Bound (McAllester, 2013; Catoni, 2007)). Fix λ > 1/2
and assume the loss takes values in an interval of length Lmax . For every δ > 0, m ∈ N,
distribution D on Rk × K, and distribution P on R p ,
1 λ Lmax 1
P (∀Q) RD (Q) ≤ 1
R̂S (Q) + (KL(Q||P) + log ) ≥ 1 − δ . (1.17)
S∼D m 1 − 2λ m δ
Fix a prior distribution P. It is well known that the optimal data-dependent randomized
classifier Q that minimizes the PAC-Bayes risk bound is a Gibbs posterior. The result appears
in McAllester (2003, Thm. 2) for a PAC-Bayes bound that is valid for any measurable
loss functions, but the hypothesis class is restricted to be finite. A more general result for
uncountable H was developed by Catoni (2007, Lem. 1.1.3) in the case of bounded risk.
Thus for a given P and a bounded loss function, the optimal distribution Q minimizing
the linear PAC-Bayes bound satisfies
For some τ, λ in the linear PAC-Bayes bound stated in Theorem 1.2.2 can be expressed in
terms of the inverse temperature parameter τ.
Local Priors
It is natural to consider a fixed data-dependent posterior Q = Q(S) and ask what prior
optimizes a PAC-Bayes bound. Catoni (2007) studied this question and showed that the
optimal prior (in expectation) is
P = E [Q(S)]. (1.19)
S∼D
This prior is not available in practice because we do not know D. Instead, one can study
some non-optimal but data-distribution-dependent prior distributions.
In the case when Q(w) is a Gibbs classifier with density proportional to
Lever, Laviolette, and Shawe-Taylor (2013) were able to bound the KL divergence with a
distribution-dependent prior whose prior density is
While this prior choice is not optimal, it is an interesting case to study due to optimality
of Gibbs distributions discussed above. The functions FQ and FP may be different and
act as regularizers. They may depend on the parameters of the hypothesis or perform a
data-dependent regularization. In (Lever, Laviolette, and Shawe-Taylor, 2013), the authors
give localized PAC-Bayes bounds for these particular choices of P and Q.
The use of an informed prior gives a much tighter PAC-Bayes bound. A generic PAC-
Bayes bound depends on KL(Q||P). This term may be very large when P is chosen poorly
and is not tailored for the specific task at hand. In the analysis presented by Lever, Laviolette,
and Shawe-Taylor (2013), they eliminate the KL term by replacing it with an upper bound
when P and Q are chosen as described above. In Chapter 3 we study local prior bounds and
develop a new data-dependent prior bound that depends on the properties of the algorithm
used to choose P.
For a use of local priors, see, e.g., Parrado-Hernández et al. (2012). In this work, the
authors provide tighter PAC-Bayes generalization bounds for SVM classifiers. They explore
several strategies to achieve this. One idea is to use part of the dataset to learn a better prior
for a PAC-Bayes bound, which the authors call a prior PAC Bayes bound. Another idea
is to choose a Gaussian prior with the mean depending on the data distribution. Since the
data distribution is not available, the authors instead use an empirical data distribution and
then upper bound the difference between the expected parameter value under the empirical
distribution and the true data distribution. This approach yields a bound named an expectation
prior PAC-Bayes bound (Parrado-Hernández et al., 2012, Thm. 9). Their experiments
demonstrate that the use of informative priors for SVM classifiers results in tighter PAC-
Bayes bounds.
This is equivalent to Eq. (1.18) with a rescaled loss function, i.e., the density of a Gibbs
posterior with τ = m. In a special case, where the loss function is the negative log likelihood
and τ = m, the term −τ R̂S (w) is the expected log likelihood under Q. This demonstrates
the connection among the PAC-Bayes optimal posteriors, general Bayesian inference, and
classical Bayesian inference. However, note that the latter works under an assumption that the
likelihood contains the true data generating distribution. In contrast, the former frameworks
are valid under model misspecification. Most PAC-Bayes generalization bounds require a
bounded loss function. Germain et al. (2016) study the connection between PAC-Bayes
and Bayesian inference and extend PAC-Bayes generalization bounds for regression tasks
and real-valued unbounded loss functions. Similar connections have been made by Zhang
(2006a), Zhang (2006b), Jiang and Tanner (2008), Grünwald (2012), and Grünwald and
Mehta (2016).
The log normalizing constant of a Gibbs posterior is
Z
log exp(−τ R̂S (w′ ))P(dw′ ). (1.23)
adapted for online learning algorithms. Bounds on estimation error are also often referred to as excess risk or
oracle bounds.
1.2 PAC-Bayes 23
Bayesian inference employs Bayes rule to express a posterior distribution over the hypothesis
class. In most but the simplest scenarios, this distribution is computationally intractable.
One can target this problem by building approximate samplers for the posterior distribution
(MCMC methods). An alternative approach is provided by variational methods.
In variational Bayesian inference, one replaces the exact inference problem, with an
approximate one. The approximate inference problem can then be angled as an optimization
problem.
The density of the Bayesian posterior measure QBayes on the hypothesis space is given by
p(w, S)
p(w|S) = R . (1.24)
p(w, S) dw
The integral in the denominator is the marginal density of the observations S, also called the
evidence. The evidence is intractable for many models of interest. In variational inference,
the goal is to find a distribution close to the Bayesian posterior QBayes . We can formalize
this as finding a distribution QVI (with density q(w)) within a tractable family of probability
measures Q satisfying
Computing this KL term is intractable. However, the optimization problem can be seen to be
equivalent to
The objective in Eq. (1.28) is called the evidence lower bound objective (ELBO). In modern
stochastic variational inference, one then proceeds by using stochastic gradient method to
optimize the ELBO objective and find QVI . For a review on variational inference methods,
see, e.g., Blei, Kucukelbir, and McAuliffe (2017).
The first term in the ELBO objective is the expected log likelihood, which is unknown.
During optimization, it is usually replaced with its Monte Carlo estimate. Thus in variational
24 Statistical Learning Theory
for some prior P. The reader may recognize that the ELBO objective is of the same form as
Eq. (1.11), and, in particular, a linear PAC-Bayes bound on the risk (Theorem 1.2.2), with
the loss chosen to be log loss. The connection between minimizing PAC-Bayes bounds under
log loss and maximizing log marginal densities is the subject of recent work by Germain
et al. (2016).
Kingma, Salimans, and Welling (2015) use variational inference to improve the dropout
technique in neural network training. The original Gaussian dropout (Srivastava et al., 2014)
can be interpreted as sampling stochastic weights. The sampling distribution is isotropic
gaussian, centered at the current weight values, with variance equal to the dropout rate (and
thus the variance is equal for all weights in the network). Kingma, Salimans, and Welling
(2015) treat the dropout rate as a variational parameter which is learned via optimization.
In summary, the variational dropout can be interpreted as optimizing the parameters of Q
by minimizing a PAC-Bayes risk bound with a fixed prior. The posterior distribution on the
parameters is a stochastic classifier Q, which is an isotropic Gaussian with variances tuned to
the individual weights. This closely resembles our work presented on PAC-Bayes risk bound
optimization and described in Chapter 2.
Note, that uniform stability depends on the dataset size m and requires the algorithm to be
deterministic. One can easily extend this to randomized algorithms and obtain generalization
in expectation. See (Hardt, Recht, and Singer, 2015) for more details.
1.3 Algorithms and Stability 25
Theorem 1.3.2 (McDiarmid’s Inequality (Mendelson, 2003)). Let s1 , .., sm be random vari-
ables, and S = (s1 , .., sm ). Assume vectors S and Si differ at only one coordinate i. Let a
function F map S to R. Then if there exists (c1 , ..., cm ), such that F satisfies
Now consider F(S) = R(A (S)) − R̂S (A (S)). Then it is straightforward to show that
E[F(S)] is bounded by ε. Assume that the loss function ℓ is bounded and takes values in an
S
interval of length Lmax . Then we can easily demonstrate that |F(S) − F(S′ )| is bounded in
terms of ε and Lmax . It is a straightforward exercise to apply McDiarmid’s inequality to F to
get a bound on |F(S) − E[F(S)]|, and a bound on the risk follows.
S
Theorem 1.3.3 (Uniform stability implies generalization; Bousquet and Elisseeff 2002). Let
an algorithm A be ε−uniformly stable. Assume the loss function takes values in an interval
of length Lmax . Then with probability at least 1 − δ over the training sample S,
log δ1
R(A (S)) ≤ R̂S (A (S)) + 2ε + (2mε + Lmax ) . (1.32)
2m
There exists a number of other notions of stability, that either imply generalization with
high probability, or in expectation. For a comparison, see, e.g. Bousquet and Elisseeff (2002)
and Bassily et al. (2016).
More concretely, given an initial hypothesis w0 ∈ R p , SGD repeatedly performs the updates
1 k
wn+1 = wn − ηn ∑ ∇wn ℓ(wn, z ji ) (1.33)
k i=1
where, on each round n = 1, 2, . . . , some number k < m indices j1 , . . . , jk are chosen uniformly
at random and without replacement from [m] := {1, 2, . . . , m}.
Theorem 1.3.4. (Hardt, Recht, and Singer, 2015) Let the loss function l ∈ [0, 1] be L-Lipschitz
and β -smooth. Assume we run SGD for N steps, with step sizes ηn < c/n. Then SGD is
ε-uniformly stable with
1 + 1/β c 1 βc
ε≤ (2cL2 ) β c+1 T β c+1 . (1.34)
m−1
While this is an interesting result regarding the stability of the algorithm, it only gives a
generalization bound in expectation, rather than a high probability bound. Furthermore, as
discussed above, uniform stability is a very strong requirement and the notion is independent
of the data distribution, and thus the risk bounds obtained using uniform stability are very
loose for "nice" datasets. It would allow us to make few SGD steps before the bound on the
risk exceeds 0.5 and becomes trivial for binary classification tasks.
Here we formally define some of the differential privacy related terms used in the main
text. (See (Dwork, 2006; Dwork and Roth, 2014b) for more details.)
Let U,U1 ,U2 , . . . be independent uniform (0, 1) random variables, independent also
of any random variables introduced by P and E, and let π : N × [0, 1] → [0, 1] satisfy
d
(π(1,U), . . . , π(k,U)) = (U1 , . . . ,Uk ) for all k ∈ N. Write πk for π(k, ·).
Differential privacy can also be viewed as a very strong notion of algorithmic stability. In
particular, every differentially private algorithm is uniformly stable. This sufficiency result
is well known and follows immediately from the definition of differential privacy. In some
work, differential provacy is also referred to as Max-KL Stability to highlight the connection
to other stability definitions (Bassily et al., 2016).
We introduce several additional generalization bounds in Chapters 3 and 4, where we use
differential privacy to derive a data dependent PAC-Bayes bound and to analyze an existing
learning algorithm in terms of this new bound.
Chapter 2
2.1 Introduction
By optimizing a PAC-Bayes bound, we show that it is possible to compute nonvacuous
numerical bounds on the generalization error of deep stochastic neural networks with millions
of parameters, despite the training data sets being one or more orders of magnitude smaller
than the number of parameters. To our knowledge, these are the first explicit and nonvacuous
numerical bounds computed for trained neural networks in the modern deep learning regime
where the number of network parameters eclipses the number of training examples.
The bounds we compute are data dependent, incorporating millions of components
optimized numerically to identify a large region in weight space with low average empirical
error around the solution obtained by stochastic gradient descent (SGD). The data dependence
is essential: indeed, the VC dimension of neural networks is typically bounded below by the
number of parameters, and so one needs as many training data as parameters before (uniform)
PAC bounds are nonvacuous, i.e., before the generalization error falls below 1. To put this in
concrete terms, on MNIST, having even 72 hidden units in a fully connected first layer yields
vacuous PAC bounds.
Evidently, we are operating far from the worst case: observed generalization cannot
be explained in terms of the regularizing effect of the size of the neural network alone.
This is an old observation, and one that attracted considerable theoretical attention two
decades ago: Bartlett (Bartlett, 1997; Bartlett, 1998) showed that, in large (sigmoidal) neural
networks, when the learned weights are small in magnitude, the fat-shattering dimension
is more important than the VC dimension for characterizing generalization. In particular,
30 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
Bartlett established classification error bounds in terms of the empirical margin and the fat-
shattering dimension, and then gave fat-shattering bounds for neural networks in terms of the
magnitudes of the weights and the depth of the network alone. Improved norm-based bounds
were obtained using Rademacher and Gaussian complexity by Bartlett and Mendelson (2002)
and Koltchinskii and Panchenko (2002).
These norm-based bounds are the foundation of our current understanding of neural
network generalization. It is widely accepted that these bounds explain observed general-
ization, at least “qualitatively” and/or when the weights are explicitly regularized. Indeed,
recent work by Neyshabur, Tomioka, and Srebro (2014) puts forth the idea that SGD per-
forms implicit norm-based regularization. Somewhat surprisingly, when we investigated
state-of-the-art Rademacher bounds for ReLU networks, the bounds were vacuous when
applied to solutions obtained by SGD on real networks/datasets. We discuss the details of
this analysis in Section 2.9. While most researchers assume these bounds are explanatory
even if they are numerically loose, we argue that the bounds, being numerically vacuous, do
not establish generalization on their own. It is worth highlighting that nonvacuous bounds
may exist under hypotheses that currently only yield vacuous bounds. This is an important
avenue to investigate.
2. still able to achieve ≈ 0 training error even after the labels are randomized, and does
so with only a small factor of additional computational time.
Taken together, these two observations demonstrate that these networks have a tremendous
capacity to overfit and yet SGD does not abuse this capacity as it optimizes the surrogate
loss, despite the lack of explicit regularization.
It is a major open problem to explain this phenomenon. A natural approach would be
to show that, under realistic hypotheses, SGD performs implicit regularization or tends to
find solutions that possess some particular structural property that we already know to be
connected to generalization. However, in order to complete the logical connection, we need
an associated error bound to be nonvacuous in the regime of model size / data size where we
hope to explain the phenomenon.
2.1 Introduction 31
This work establishes a potential candidate, building off ideas by Langford (2002) and
Langford and Caruana (2002a): On a binary class variant of MNIST, we find that SGD
solutions are nearby to relatively large regions in weight space with low average empirical
error. We find this structure by optimizing a PAC-Bayes bound, starting at the SGD solution,
obtaining a nonvacuous generalization bound for a stochastic neural network. Across a
variety of network architectures, our PAC-Bayes bounds on the test error are in the range
16–22%. These are far from nonvacuous but loose: Chernoff bounds on the test error based
on held-out data are consistently around 3%. Despite the gap, theoreticians aware of the
numerical performance of generalization bounds will likely be surprised that it is possible at
all to obtain nonvacuous numerical bounds for models with such large capacity trained on so
few training examples. While we cannot entirely explain the magnitude of generalization,
we can demonstrate nontrivial generalization.
Our approach was inspired by a line of work in physics by Baldassi, Ingrosso, Lucibello,
Saglietti, and Zecchina (Baldassi et al., 2015) and the same authors with Borgs and Chayes
(Baldassi et al., 2016). Based on theoretical results for discrete optimization linking compu-
tational efficiency to the existence of nonisolated solutions, the authors propose a number
of new algorithms for learning discrete neural networks by explicitly driving a local search
towards nonisolated solutions. On the basis of Bayesian ideas, they posit that these solutions
have good generalization properties. In a recent work with Chaudhari, Choromanska, Soatto,
and LeCun (Chaudhari et al., 2017), they introduce local-entropy loss and EntropySGD,
extending these algorithmic ideas to modern deep learning architectures with continuous
parametrizations, and obtaining impressive empirical results.
In the continuous setting, nonisolated solutions correspond to “flat minima”. The exis-
tence and regularizing effects of flat minima in the empirical error surface was recognized
early on by researchers, going back at work by Hinton and Camp (1993) and Hochreiter and
Schmidhuber (1997). Hochreiter and Schmidhuber discuss sharp versus flat minima using
the language of minimum description length (MDL; (Rissanen, 1983; Grünwald, 2007)).
In short, describing weights in sharp minima requires high precision in order to not incur
nontrivial excess error, whereas flat minimum can be described with lower precision. A
similar coding argument appears in (Hinton and Camp, 1993).
Hochreiter and Schmidhuber propose an algorithm to find flat minima by minimizing
the training error while maximizing the log volume of a connected region of the parameter
space that yields similar classifiers with similarly good training error. There are very close
connections—at both the level of analysis and algorithms—with the work of Chaudhari et al.
(2017) and close connections with the approach we take to compute nonvacuous error bounds
by exploiting the local error surface. (We discuss more related work in Section 2.10.)
32 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
2.1.2 Approach
Our working hypothesis is that SGD finds good solutions only if they are surrounded by a
relatively large volume of solutions that are nearly as good. This hypothesis suggests that
PAC-Bayes bounds may be fruitful: if SGD finds a solution contained in a large volume of
equally good solutions, then the expected error rate of a classifier drawn at random from
this volume should match that of the SGD solution. The PAC-Bayes theorem (McAllester,
1999a) bounds the expected error rate of a classifier chosen from a distribution Q in terms
of the Kullback–Liebler divergence from some a priori fixed distribution P, and so if the
volume of equally good solutions is large, and not too far from the mass of P, we will obtain
a nonvacuous bound.
Our approach will be to use optimization to find a broad distribution Q over neural
network parameters that minimizes the PAC-Bayes bound, in effect mapping out the volume
of equally good solutions surrounding the SGD solution. This idea is actually a modern
take on an old idea by Langford and Caruana (Langford and Caruana, 2002a), who apply
PAC-Bayes bounds to small two-layer stochastic neural networks (with only 2 hidden units)
that were trained on (relatively large, in comparison) data sets of several hundred labeled
examples.
The basic idea can be traced back even further to work by Hinton and Camp (Hinton
and Camp, 1993), who propose an algorithm for controlling overfitting in neural networks
2.1 Introduction 33
via the minimum description length principle. In particular, they minimize the sum of the
empirical squared error and the KL divergence between a prior and posterior distribution on
the weights. Their algorithm is applied to networks with 100’s of inputs and 4 hidden units,
trained on several hundred labeled examples. Hinton and Camp do not compute numerical
generalization bounds to verify that MDL principles alone suffice to explain the observed
generalization.
Our algorithm more directly extends the work by Langford and Caruana, who propose to
construct a distribution Q over neural networks by performing a sensitivity analysis on each
parameter after training, searching for the largest deviation that does not increase the training
error by more than, e.g., 1%. For Q, Langford and Caruana choose a multivariate normal
distribution over the network parameters, centered at the parameters of the trained neural
network. The covariance matrix is diagonal, with the variance of each parameter chosen to be
the estimated sensitivity, scaled by a global constant. (The global scale is chosen so that the
training error of Q is within, e.g., 1% of that of the original trained network.) Their prior P is
also a multivariate normal, but with zero mean and covariance given by some scalar multiple
of the identity matrix. By employing a union bound, they allow themselves to choose the
scalar multiple in a data-dependent fashion to optimize the PAC-Bayes bound.
The algorithm sketched by Langford and Caruana does not scale to modern neural
networks for several reasons, but one dominates: in massively overparametrized networks,
individual parameters often have negligible effect on the training classification error, and
so it is not possible to estimate the relative sensitivity of large populations of neurons by
studying the sensitivity of neurons in isolation.
Instead, we use stochastic gradient descent to directly optimize the PAC-Bayes bound on
the error rate of a stochastic neural network. At each step, we update the network weights
and their variances by taking a step along an unbiased estimate of the gradient of (an upper
bound on) the PAC-Bayes bound. In effect, the objective function is the sum of i) the
empirical surrogate loss averaged over a random perturbation of the SGD solution, and ii) a
generalization error bound that acts like a regularizer.
Having demonstrated that this simple approach can construct a witness to generalization,
it is worthwhile asking whether these ideas can be extended to the setting of local-entropic
loss (Chaudhari et al., 2017). If we view the distribution that defines the local-entropic loss as
defining a stochastic neural network, can we use PAC-Bayes bounds to establish nonvacuous
bounds on its generalization error?
34 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
2.2 Bounds
We will employ three probabilistic bounds to control generalization error: the union bound, a
sample convergence bound derived from the Chernoff bound, and the PAC-Bayes bound due
to McAllester (1999a). We state the union bound for completeness.
S
Theorem 2.2.1 (union). Let E1 , E2 , . . . be events. Then P( n En ) ≤ ∑n P(En ).
Recall that B(p) denotes the Bernoulli distribution on {0, 1} with mean p ∈ [0, 1]. The
following bound is derived from the KL formulation of the Chernoff bound:
Theorem 2.2.2 (sample convergence (Langford and Caruana, 2002a)). For every p, δ ∈ (0, 1)
log δ2
and n ∈ N, with probability at least 1 − δ over x ∼ B(p)n , kl(n−1 ∑ni=1 xi ||p) ≤ n .
Theorem 2.2.3 (PAC-Bayes (McAllester, 1999a; Langford and Seeger, 2001)). For every
δ > 0, m ∈ N, distribution D on Rk × {−1, 1}, and distribution P on R p , with probability at
least 1 − δ over Sm ∼ D m , for all distributions Q on R p ,
KL(Q||P) + log 2m
δ
kl(R̂Sm (Q)||R(Q)) ≤ .
m−1
kl(q||p∗ ) ≤ c
We are not aware of a simple formula for KL−1 (q|c), although numerical approximations
are readily obtained via Newton’s method (Section 2.4). For the purpose of gradient-based
optimization, we can use the well-known inequality, 2(q − p)2 ≤ kl(q||p), to obtain a simple
1 In
this section we use a bound that is slightly tighter from the ones stated in Section 1.2.2. The latter has a
log 2m
δ term rather than log mδ as in Theorem 2.2.3. However, note, that this term is very small when divided by
m − 1 (≈ 1e − 4), and the effect on the bounds calculated is negligible.
2.3 PAC-Bayes bound optimization 35
upper bound
p
KL−1 (q|c) ≤ q + c/2. (2.2)
This bound is quantitatively loose near q ≈ 0, because then KL−1 (q|c) ≈ c for c ≪ 1, versus
√ p
the upper bound of Θ( c). On the other hand, when c is large enough that q + 2c > 1, the
derivative of KL−1 (q|c) is zero, whereas the upper bound provides a useful derivative.
In all but the simplest scenarios, making predictions according to the optimal Q is intractable.
However, we can attempt to approximate it.
We cannot compute this bound exactly because computing R̂Sm (Q) is intractable. However,
we can obtain unbiased estimates and apply the sample convergence bound (Theorem 2.2.2).
In particular, given n i.i.d. samples w1 , . . . , wn from Q, we produce the Monte Carlo approxi-
mation Q̂n = ∑ni=1 δwi , for which R̂Sm (Q̂n ) is exactly computable, and obtain the bound
′
R̂Sm (Q) ≤ R̂n,δ
Sm (Q)
:= KL−1 (R̂Sm (Q̂n )|n−1 log 2/δ ′ ),
′
R(Q) ≤ KL−1 (R̂n,δ
Sm (Q)|BRE (w, s, λ ; δ )), (2.6)
hq,c (c)
pn+1 = N(pn ; q, c) where N(p; q, c) = p − .
h′q,c (p)
2. If b̃ ≥ 1, then return 1.
Our reported results in Table 2.1 use five steps of Newton’s method.
Lemma 2.5.1. Let S be a finite set of network symmetries, let P be an absolutely continuous
distribution such that P = Pσ for all σ ∈ S, and define QS as above for some arbitrary
absolutely continuous distribution Q on Rd with finite differential entropy. Then KL(QS ||P) =
KL(Q||P) − KL(Q||QS ) ≤ KL(Q||P).
The above lemma can be generalized to distributions over (potentially infinite) sets of
network symmetries.
It follows from this lemma that one can do no worse by accounting for symmetries
using mixtures, provided that one is comparing to a distribution P that is invariant to those
symmetries. In light of the PAC-Bayes theorem, this means that a generalization bound
based upon a KL divergence that does not account for symmetries can likely be improved.
However, for a finite set S of symmetries, it is easy to show that the improvement is bounded
by log |S|, which suggests that, in order to obtain appreciable improvements in a numerical
bound, one would need to account for an exponential number of symmetries. Unfortunately,
exploiting this many symmetries seems intractable. It is hard to obtain useful lower bounds
to KL(Q||QS ), while upper bounds from Jensen’s inequality lead to negative (hence vacuous)
lower bounds on KL(QS ||P).
In this work, we therefore take a different approach to dealing with symmetries. Neural
networks are randomly initialized in order to break symmetries. Combined with the idea that
the learned parameters will reflect these broken symmetries, we choose our prior P to be
located at the random initialization, rather than at zero.
40 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
2.6 Experiments
Starting from neural networks whose weights have been trained by SGD (with momentum)
to achieve near-perfect accuracy on a binary class variant of MNIST, we then optimize a
PAC-Bayes bound on the error rate of stochastic neural network whose weights are ran-
dom perturbations of the weights learned by SGD. We consider several different network
architectures, varying both the depth and the width of the network.
Table 2.1 Results for experiments on binary class variant of MNIST. SGD is either
trained on (T) true labels or (R) random labels. The network architecture is expressed as
N L , indicating L hidden layers with N nodes each. Errors are classification error. The
reported VC dimension is the best known upper bound (in millions) for ReLU networks.
The SNN error rates are tight upper bounds (see text for details). The PAC-Bayes bounds
upper bound the test error with probability 0.965.
2.6.1 Dataset
We use the MNIST handwritten digits data set (LeCun, Cortes, and Burges, 2010) as provided
in Tensorflow (Abadi et al., 2015), where the dataset is split into the training set (55000
images) and test set (10000 images). (We do not use the validation set.) Each MNIST image
is black and white and 28-pixels square, resulting in a network input dimension of k = 784.
MNIST is usually treated as a multiclass classification problem. In order to use standard PAC-
Bayes bounds, we produce a binary classification problem by mapping numbers {0, . . . , 4}
to label 1 and {5, . . . , 9} to label −1. In some experiments, we train on random labels, i.e.,
binary labels drawn independently and uniformly at random.
2.6 Experiments 41
2.7 Results
See Table 2.1.[Training results and bounds for MNIST trained networks] All SGD trained
networks achieve perfect or near-perfect accuracy on the training data. On true labels, the
SNN mean training error increases slightly as the weight distribution broadens to minimize
the KL divergence. The SGD solution is close to mean of the SNN as measured with respect
to the SNN covariance. (See Section 2.8 for a discussion.) For the random-label experiment,
the SNN mean training error rises above 10%. Ideally, it might have risen to nearly 50%,
while driving down the KL term to near zero, finding the optimal Q equal to P.
The empirical test error of the SGD classifiers does not change much across the different
architectures, despite the potential for overfitting. This phenomenon is well known, though
still remarkable. For the random-label experiment, the empirical test classification error of
0.508 represents lack of generalization, as expected. The same two patterns hold for the SNN
test error too, with slightly higher error rates.
Remarkably, the PAC-Bayes bounds do not grow much despite the networks becoming
several times larger, and all true label experiments have classification error bounded by
0.23. (This observation is consistent with (Neyshabur, Tomioka, and Srebro, 2014).) Since
larger networks possess many more symmetries, the true PAC-Bayes bounds for our learned
stochastic neural network classifier might be substantially smaller. (See Section 2.5 for
a discussion.) While these bounds are several times larger than the test error estimated
2.8 Comparing weights before and after PAC-Bayes optimization 43
tight as those that we computed. Do the weights change much during optimization of the
bound? How would we measure this change?
To answer these questions, we calculated the p-value of the SGD solution under the
distribution of the stochastic neural network.
Let QSNN denote the distribution obtained by optimizing the PAC-Bayes bound, write
wSNN and ΣSNN for its mean and covariance, and let ∥w∥ΣSNN = wT Σ−1
SNN w denote the induced
norm. Using 10000 samples, we estimated
P ∥w − wSNN ∥ΣSNN < ∥wSGD − wSNN ∥ΣSNN .
w∼QSNN
The estimate was 0 for all true label experiments, i.e., wSGD is less extreme of a perturbation
of wSNN than a typical perturbation. For the random-label experiments, wSNN and wSGD
differ significantly, which is consistent with the bound being optimized in the face of random
labels.
Theorem 2.9.1. For every L > 0, with probability at least 1 − δ over the choice of Sm ∼ D m ,
for all h ∈ F ,
s
log( δ2 )
RD (h) ≤ R̂S (h, L) + 2LRm (F ) + , (2.7)
2m
where
1 m
R̂S (h, L) = ∑ max(min(1 − Lyih(xi), 1), 0).
m i=1
In order to compute these bounds, we must compute (bounds on) the Rademacher
complexity of appropriate function classes. To that end, we will use results by Neyshabur,
2.9 Evaluating Rademacher error bounds 45
Tomioka, and Srebro (2015) for ReLU networks (i.e., multilayer perceptrons with ReLU
activations).
(k)
Let w be the weights of a ReLU network and let wi, j denote the weight associated with
the edge from neuron i in layer k − 1 to neuron j in layer k. Neyshabur, Tomioka, and Srebro
(2015) define the ℓ1 path norm
(2) (1)
φ1 (w) = ∑ |w j,1 | ∑ |wi, j | , (2.8)
j i
stated here in the special case of a 2-layer network with 1 output neuron. For any number of
layers, the path norm can be computed easily in a forward pass, requiring only a matrix–vector
product at each layer.
Neyshabur, Tomioka, and Srebro also provide the follow Rademacher bound in terms of
the path norm:
Theorem 2.9.2 ((Neyshabur, Tomioka, and Srebro, 2015, Cor. 7)). Given m datapoints
x1 , . . . , xm ∈ RD , the Rademacher complexity of the class of depth-d ReLU networks, whose
ℓ1 path norms are bounded by φ , is no greater than
r
d log(2D)
2 φ max ∥xi ∥∞ . (2.9)
m i
(k)
Let w j for the jth column of w( j) , i.e., the vector of weights for edges from layer k − 1
to neuron j in layer k. The ℓ1 path norm is closely related to the norm
d
(k)
γ1,∞ (w) = ∏ max w j 1
.
i=1 j
If the upper bound φ appearing in the bound of Theorem 2.9.2 is instead taken to be a
bound on γ1,∞ (w), then one essentially obtains the Gaussian complexity bounds for neural
networks established by Bartlett and Mendelson (2002) and Koltchinskii and Panchenko
(2002). However, their bounds apply only to networks with bounded activation functions,
ruling out ReLU networks.
Regardless, the path-norm bound is tighter for ReLU networks. In order to establish
the connection, let W (w) denote the set of all weights w′ obtained from redistributing the
weights w across layers, i.e., by multiplying the weights w(k−1) in a layer by a constant c > 0
and multiplying the weights in the subsequent layer w(k) by c−1 . Note that the function
computed by a ReLU network is invariant to this transformation. This is the key insight of
Neyshabur, Tomioka, and Srebro. Obviously, φ1 (w) = φ1 (w′ ) for all w′ ∈ W (w). Neyshabur,
46 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
Tomioka, and Srebro show that φ1 (w) = infw′ ∈W (w) γ1,∞ (w′ ), and so the path norm better
captures the complexity of a ReLU network.
In our experiments, we will compute the bound obtained by combining Theorems 2.9.1
and 2.9.2.
Note that the constant L in Theorem 2.9.1 must be chosen independently of the data Sm .
As in the original result (Koltchinskii and Panchenko, 2002, Thm. 2), one can use a union
bound to allow oneself to choose L based on the data in order to minimize the bound. Even
though the effect of this change is usually (relatively) small, its magnitude depends on the
particular weight function employed in the union bound. Instead, we will apply the bound
with an optimized L, yielding an optimistic bound (formally, a lower bound on any upper
bound obtained from a union bound). We optimize L over a grid of values, and handle the
vacuous edge cases analytically. Nevertheless, we will see even the resulting (optimistic)
bound is vacuous.
2.9.2 Results
When the network is trained by optimizing the logistic cost function without regularization,
the error bound becomes vacuous within a fraction of a single epoch. This occurs before the
training error dips appreciable below chance. The bound’s behavior is due to the path norm
diverging. While the level sets R̂S (h, ·)−1 of the empirical margin distribution are growing,
they are not growing fast enough to counteract the growth of the path norm. (See the left
column of Fig. 2.1.)
2.10 Related work 47
When the network is trained with explicit path-norm regularization, we obtain vacuous
error bounds, unless we apply excessive amounts of regularization. We report results when
the regularization parameter is 0.01 and 0.05. Both settings are clearly too large, as evidenced
by the training error converging to ~20% and ~30%, respectively. A cursory study of overall
ℓ1 and ℓ2 regularization produced qualitatively similar results. Further study is necessary.
The objective optimized by Hinton and Camp is of the same essential form as this one, except
for the choice of squared error and different prior and posterior distributions. We explored
using Eq. (2.10) as our objective with a surrogate loss, but it did not produce different results.
In the introduction we discuss the close connection of our work to several recent papers
(Baldassi et al., 2015; Baldassi et al., 2016; Chaudhari et al., 2017) that study “flat” or
nonisolated minima on the account of their generalization and/or algorithmic properties.
Based on theoretical results for k-SAT that efficient algorithms find nonisolated solutions,
Baldassi et al. (2016) model efficient neural network learning algorithms as minimizers of a
replicated version of the empirical loss surface, which emphasizes nonisolated minima and
deemphasizes isolated minima. They then propose several algorithms for learning discrete
neural networks using these ideas.
In follow-up work with Chaudhari, Choromanska, Soatto, and LeCun (Chaudhari et al.,
2017), they translate these ideas into the setting of continuously parametrized neural networks.
They introduce an algorithm, called Entropy-SGD, which seeks out large regions of dense
local minima: it maximizes the depth and flatness of the energy landscape. Their objective
integrates both the energy of nearby parameters and the weighted distance to the parameters.
In particular, rather than directly minimizing an error surface w 7→ R̂Sm (w), they propose the
48 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
where γ > 0 is a parameter and C(γ) a constant. In comparison, our algorithm can be
interpreted as an optimization of the form
where L serves as a regularizer that accounts for the generalization error by, roughly speaking,
trying to expand the axis-aligned ellipsoid {x ∈ Rd : (w − x)T diag(s)−1 (w − x) = 1} and
draw it closer to some point w0 near the origin. Comparing Eqs. (2.11) and (2.12) highlights
similarities and differences. The local-entropic loss is sensitive to the volume of the regions
containing good solutions. While the first term in our objective function looks similar, it does
not, on its own, account for the volume of regions. This role is played by the second term,
which prefers large regions (but also ones near the initialization w0 ). In our formulation, the
first term is the empirical error of a stochastic neural network, which is precisely the term
whose generalization error we are trying to bound. Entropy-SGD was not designed for the
purpose of finding good stochastic neural networks, although it seems possible that having
small local-entropic loss would lead to generalization for neural networks whose parameters
are drawn from the local Gibbs distribution. Another difference is that, in our formulation,
the diagonal covariance of the multivariate normal perturbation is learned adaptively, and
driven by the goal of minimizing error. The shape of the normal perturbation is not learned,
although the region whose volume is being measured is determined by the error surface, and
it seems likely that this volume will be larger than that spanned by a multivariate Gaussian
chosen to lie entirely in a region with good loss.
Chaudhari et al. (2017) give an informal characterization of the generalization proper-
ties of local-entropic loss in Bayesian terms by comparing the marginal likelihood of two
Bayesian priors centered at a solution with small and large local-entropic loss. Informally,
a Bayesian prior centered on an isolated solution will lead to small marginal likelihood in
contrast to one centered in a wide valley. They give a formal result relying on the uniform
stability of SGD (Hardt, Recht, and Singer, 2015) to show under some strong (and admittedly
unrealistic) conditions that Entropy-SGD generalizes better than SGD. The key property is
that the local-entropic loss surface is smoother than the original error surface.
2A detailed analysis of local entropy and Entropy-SGD algorithm is done in Chapter 4
2.11 Discussion 49
Other authors have found evidence of the importance of “flat” minima: Recent work by
Keskar, Mudigere, Nocedal, Smelyanskiy, and Tang (Keskar et al., 2017) finds that large-
batch methods tend to converge to sharp / isolated minima and have worse generalization
performance compared to mini-batch algorithms, which tend to converge to flat minima and
have good generalization performance. The bulk of their paper is devoted to the problem of
restoring good generalization behavior to batch algorithms.
Finally, our algorithm also bears resemblance to graduated optimization, an approach to-
ward non-convex optimization attributed to Blake and Zisserman (1987) whereby a sequence
of increasingly fine-grained versions of an optimization problem are solved in succession.
(See (Hazan, Levy, and Shalev-Shwartz, 2016) and references therein.) In this context,
Eq. (2.11) is the result of a local smoothing operation acting on the objective function
w 7→ ℓ̆(w, SM ). In graduate optimization, the effect of the local smoothing operation would
be decreased over time, eventually disappearing. In our formulation, the act of balancing the
empirical loss and generalization error serve to drive the evolution of the local smoothing
in an adaptive fashion. Moreover, in the limit, the local smoothing does not vanish in our
algorithm, as the volume spanned by the perturbations relates to the generalization error. Our
results suggest that SGD solutions live inside relatively large volumes, and so perhaps SGD
can be understood in terms of graduated optimization.
2.11 Discussion
We obtain nonvacuous generalization bounds for deep neural networks with millions of
parameters trained on 55000 MNIST examples. These bounds are obtained by optimizing
an objective derived from the PAC-Bayes bound, starting from the solution produced by
SGD. Despite the weights changing, the SGD solution remains well within the 1% ellipsoidal
quantile, i.e., the volume spanned by the stochastic neural network contains the original SGD
solution. (When labels are randomized, however, optimizing the PAC-Bayes bound causes
the solution to shift considerably.)
Our experiments look only at fully connected feed forward networks trained on a binary
class variant of MNIST. It would be interesting to see if the results extend to multiclass
classification, to other data sets, and to other types of architectures, especially convolutional
ones.
Our PAC-Bayes bound can be tightened in several ways. Highly dependent weights
constrain the size of the axis-aligned ellipsoid representing the stochastic neural network.
We can potentially recognize small populations of highly dependent weights, and optimize
their covariance parameters, rather than enforcing independence in the posterior.
50 Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks
One might also consider replacing the multivariate normal posterior with a distribution
that is more tuned to the loss surface. One promising avenue is to follow the lines of
Chaudhari et al. (2017) and consider (local) Gibbs distributions. If the solutions obtained by
minimizing the local-entropic loss are flatter than those obtained by SGD, than we may be
able to demonstrate quantitatively tighter bounds.
Finally, there is the hard work of understanding the generalization properties of SGD.
In light of our work, it may be useful to start by asking whether SGD finds solutions in
flat minima. Such solutions could then be lifted to stochastic neural networks with good
generalization properties. Going from stochastic networks back to deterministic ones may
require additional structure.
2.11 Discussion 51
Train error
0-1 Error
100 100
100
10 1 10 1
10 1
10 2 10 2
10 2
10 3
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
15.0 15.0 15.0
3.1 Introduction
There has been a resurgence of interest in PAC-Bayes bounds, especially towards explain-
ing generalization in large-scale neural networks trained by stochastic gradient descent
(Neyshabur et al., 2017b; Neyshabur et al., 2017a). See also (Bégin et al., 2016; Germain
et al., 2016; Thiemann et al., 2017; Bartlett, Foster, and Telgarsky, 2017b; Raginsky, Rakhlin,
and Telgarsky, 2017; Grünwald and Mehta, 2017; Smith and Le, 2018).
PAC-Bayes bounds control the generalization error of Gibbs classifiers (aka PAC-Bayes
“posteriors”) in terms of the Kullback–Leibler (KL) divergence to a fixed probability measure
(aka PAC-Bayes “prior”) on the space of classifiers. PAC-Bayes bounds depend on a tradeoff
between the empirical risk of the posterior Q and a penalty m1 KL(Q||P), where P is the prior,
fixed independently of the sample S ∈ Z m from some space Z of labelled examples. The KL
penalty is typically the largest contribution to the bound and so finding the tightest possible
bound generally depends on minimizing the KL term.
The KL penalty vanishes for Q = P, but typically P, viewed as a randomized (Gibbs)
classifier, has poor performance since it has been chosen independently of the data. On the
other hand, since P is chosen independently of the data, posteriors Q tuned to the data to
achieve minimal empirical risk often bear little resemblance to the data-independent prior P,
causing KL(Q||P) to be large. As a result, PAC-Bayes bounds can be loose or even vacuous.
The problem of excessive KL penalties is not inherent to the PAC-Bayes framework.
Indeed, the PAC-Bayes theorem permits one to choose the prior P based on the distribution D
of the data. However, since D is considered unknown, and our only insight as to D is through
the sample S, this flexibility would seem to be useless, as P must be chosen independently of
S in existing bounds. Nevertheless, it is possible to make progress in this direction, and it is
likely the best way towards tighter bounds and deeper understanding.
54 Data-dependent PAC-Bayes bounds
Theorem 3.1.1 (Lever, Laviolette, and Shawe-Taylor 2013). Fix τ > 0. For S ∈ Z m , let
Q(S) = Pexp(−τ R̂S ) be a Gibbs posterior with respect to some base measure P on R p , where
the empirical risk R̂S is bounded in [0, 1]. For every δ > 0, m ∈ N, distribution D on Z,
r √ √
1 2 2 m τ2 2 m
P kl(R̂S (Q(S))||RD (Q(S))) ≤ τ ln + + ln ≥ 1 − δ . (3.1)
S∼D m m m δ 2m δ
The dependence on the data distribution is captured through τ, which is ideally chosen as
small as possible, subject to Q(S) yielding small empirical risk. (One can use a union bound
to tune τ based on S.) The fact that the KL bound does not depend on D, other than through
τ, implies that the bound must be loose for all τ such that there exists a distribution D that
causes Q to overfit with high probability on size m datasets S ∼ D m . In other words, for fixed
τ, the bound is no longer distribution dependent. This would not be important if not for the
following empirical finding: weights sampled according to high values of τ do not overfit on
real data, but they do on data whose labels have been randomized. Thus these bounds are
vacuous in practice when the generalization error is, in fact, small. Evidently, the KL bound
gives up too much.
Our work launches a different attack on the problem of using distribution-dependent
priors. Loosely speaking, if a prior is chosen on the basis of the data, but in a way that is
very stable to perturbations of the data set, then the resulting data-dependent prior should
reflect the underlying data distribution, rather than the data, resulting in a bound that should
still hold, perhaps with smaller probability.
We formalize this intuition using differential privacy (Dwork, 2006; Dwork et al., 2015b).
We show that an ε-differentially private prior mean yields a valid, though necessarily looser,
3.2 Other Related Work 55
Kifer, Smith, and Thakurta (2012, Thm. 1) also establish a “limit” theorem for differential
privacy, showing that the almost sure convergence of mechanisms of the same privacy level
admits a private mechanism of the same privacy level. Our result can be viewed as a
significant weakening of the hypothesis to require only that the weak limit be private: no
element on the sequence need be private.
The bounds we establish hold for bounded loss functions and i.i.d. data. Under additional
assumptions, one can obtain PAC-Bayes generalization and excess risk bounds for unbounded
loss with heavy tails (Catoni, 2007; Germain et al., 2016; Grünwald and Mehta, 2016; Alquier
and Guedj, 2018). Alquier and Guedj (2018) also consider non-i.i.d. training data. Our
approach to differentially private data-dependent priors can be readily extended to these
settings.
Definition 3.2.1 (Dwork et al. 2015a, §3). Let β ≥ 0, let X and Y be random variables in
arbitrary measurable spaces, and let X ′ be independent of Y and equal in distribution to X.
β
The β -approximate max-information between X and Y , denoted I∞ (X;Y ), is the least value k
such that, for all product-measurable events E,
T , the β -approximate max-information of A , denoted I∞ (A , m), is the least value k such that,
β
A is defined similarly.1
In Section 3.3.1, we consider the case where the dataset S and a data-dependent prior
P(S) have small approximate max-information. The above definition tells us that we can
almost treat the data-dependent prior as if it was chosen independently from S. The following
is the key result connecting pure differential privacy and max-information:
Theorem 3.3.1 (PAC-Bayes; Maurer 2004, Thm. 5). Under bounded loss ℓ ∈ [0, 1], for every
δ > 0, m ∈ N, distribution D on Z, and distribution P on R p ,
√
KL(Q||P) + ln 2 δ m
P (∀Q) kl(R̂S (Q)||RD (Q)) ≤ ≥ 1−δ. (3.3)
S∼D m m
One can use Pinsker’s inequality to obtain a bound on the generalization error |R̂S (Q) −
RD (Q)|, however this significantly loosens the bound, especially when R̂S (Q) is close to
zero. We refer to the quantity kl(R̂S (Q)||RD (Q)) to the KL-generalization error. From a
bound on this quantity, we can bound the risk as follows: given empirical risk q and a bound
on the KL-generalization error c, the risk is bounded by the largest value p ∈ [0, 1] such
kl(q||p) ≤ c. See Chapter 2 for a discussion of this computation. When the empirical risk is
near zero, the KL-generalization error is essentially the generalization error. As empirical
risk increases, the bound loosens and the square root of the KL-generalization error bounds
the generalization error.
See Section 3.3.1 for a proof of a more general statement of the theorem and further
discussion. The main innovation here is recognizing the potential to choose data-dependent
priors using private mechanisms. The hard work is done by Theorem 3.2.2: obtaining
differentially private versions of other PAC-Bayes bounds is straightforward.
When one is choosing the privacy parameter, ε, there is a balance between minimizing
the direct contributions of ε to the bound (forcing ε smaller) and minimizing the indirect
contribution of ε through the KL term for posteriors Q that have low empirical risk (forcing
ε larger). The optimal value for ε is often much less than one, which can be challenging to
obtain. We discuss strategies for achieving the required privacy in later sections.
We prove a slightly more general result the the one stated in Theorem 3.3.2.
Theorem 3.3.3. Fix a bounded loss ℓ ∈ [0, 1]. Let m ∈ N, let P : Z m ⇝ M1 (R p ) be an
ε-differentially private data-dependent prior, let D ∈ M1 (Z), and let S ∼ D m . Then, for all
δ ∈ (0, 1) and β ∈ (0, δ ), with probability at least 1 − δ ,
√ r
KL(Q||P(S)) + ln δ2 −βm ln(2/β )
∀Q ∈ M1 (R p ), kl(R̂S (Q)||RD (Q)) ≤ + ε 2 /2 + ε .
m 2m
(3.5)
Proof. For every distribution P on R p , let
n √
m −1 2 m o
R(P) = S ∈ Z : (∃Q) kl(R̂S (Q)||RD (Q)) ≥ m KL(Q||P) + ln ′ . (3.6)
δ
It follows from Theorem 3.3.1 that PS∼D m (S ∈ R(P)) ≤ δ ′ . Let β > 0. Then, by the definition
of approximate max-information, we have
β
PS∼D m (S ∈ R(P(S))) ≤ eI∞ (P;m) P (S ∈ R(P(S′ ))) + β (3.7)
(S,S′ )∼D 2m
β
≤ eI∞ (P;m) δ ′ + β := δ . (3.8)
β
We have δ ′ = e−I∞ (P;m) (δ − β ). Therefore, with probability no more than δ over S ∼ D m ,
√
KL(Q||P(S)) + ln δ2 −βm + I∞ (P; m)
β
∃Q ∈ M1 (R ), kl(R̂S (Q)||RD (Q)) ≥
p
. (3.9)
m
β
The result follows from replacing the approximate max-information I∞ (P; m) with the bound
provided by Theorem 3.2.2.
3.4 Weak approximations to ε-differentially private priors 59
The theorem leaves open the choice of β < δ . For any fixed values for ε, m, and δ , it is
easy to optimize β to obtain the tighest possible bound. In practice, however, the optimal
bound is almost indistinguishable from that obtained by taking β = δ /2. For the remainder
of the chapter, we take this value for β , in which case, the r.h.s. of Eq. (3.4) is
√ r
KL(Q||P(S)) + ln 4 δ m ln(4/δ )
+ ε 2 /2 + ε . (3.10)
m 2m
Note that the bound holds for all posteriors Q. In general the bounds are interesting only
when Q is data dependent, otherwise one can obtain tighter bounds via concentration of
measure results for empirical means of bounded i.i.d. random variables.
When one is choosing the privacy parameter, ε, there is a balance between minimizing
the direct contributions of ε to the bound (forcing ε smaller) and minimizing the indirect
contribution of ε through the KL term for posteriors Q that have low empirical risk (forcing
ε larger). One approach is to compute the value of ε that achieves a certain bound on the
excess generalization error. In particular, choosing ε 2 /2 = α contributes an additional gap
of α to the KL-generalization error. Choosing α is complicated by the fact that there is a
non-linear relationship between the generalization error and the KL-generalization error,
depending on the empirical risk. A better approach is often to attempt to balance the direct
contribution with the indirect one. Regardless, the optimal value for ε is much less than one,
which can be challenging to obtain. We discuss strategies for achieving the required privacy
in later sections.
R̆S , subject to privacy constraints. A natural way to do this is via an exponential mechanism.
We pause to introduce some notation for Gibbs distributions: For a measure P on R p and
R
measurable function g : R p → R, let P[g] denote the expectation g(h)P(dh) and, provided
P[g] < ∞, let Pg denote the probability measure on R p , absolutely continuous with respect
dP
to P, with Radon–Nikodym derivative dPg (h) = g(h) P[g] . A distribution of the form Pexp(−τg) is
generally referred to as a Gibbs distribution with energy function g and inverse temperature τ.
In the special case where P is a probability measure, we call Pexp(−τ R̆S ) a “Gibbs posterior”.
Lemma 3.4.1 (McSherry and Talwar 2007, Thm. 6). Let q : Z m × R p → R be measurable, let
P be a σ -finite measure on R p , let β > 0, and assume P[exp(−β q(S, ·))] < ∞ for all S ∈ Z m .
Let ∆q := supS,S′ supw∈R p |q(S, w) − q(S′ , w)|, where the first supremum ranges over pairs
S, S′ ∈ Z m that disagree on no more than one coordinate. Let A : Z m ⇝ R p , on input S ∈ Z m ,
output a sample from the Gibbs distribution Pexp(−β q(S,·)) . Then A is 2β ∆q-differentially
private.
Corollary 3.4.2. Let τ > 0 and let R̆S denote the surrogate risk, taking values in an interval
of length ∆. One sample from the Gibbs posterior Pexp(−τ R̆S ) is 2τ∆
m -differentially private.
The proof is straightforward, highlights the role of Q in judging the difference between
P′ and P, and leads immediately to the following corollary (see Section 3.5).
Corollary 3.4.7 (Gaussian means close to private means). Let m ∈ N, let D ∈ M1 (Z), let
S ∼ D m , let A : Z m ⇝ R p be ε-differentially private, and let w(S) denote a data-dependent
mean vector such that, for some w∗ (S) satisfying P[w∗ (S)|S] = P[A (S)|S], we have
with probability at least 1 − δ ′ . Let σmin be the minimum eigenvalue of Σ. Then, with proba-
bility at least 1 − δ − δ ′ , Eq. (3.4) holds with KL(Q||P(S)) replaced by KL(Q||N(w(S))) +
1
p S
2 C/σ min + C/σmin Ev∼Q ∥v−w(S)∥Σ−1 . In particular, for a Gibbs posterior Q = Pexp(−τ R̆S ) ,
√ p
we have Ev∼Q ∥v − w(S)∥Σ−1 ≤ 2τ∆ + 2/π.
Corollary 3.4.8 (SGLD PAC-Bayes Bound). Consider SGLD sampling the Gibbs poste-
rior Pexp(−τ R̆S ) with Gaussian P and smooth surrogate risk R̆S taking values in a length-∆
interval. Assume that for every C > C′ > 0, there is a choice of step size η and num-
ber of SGLD iterations n, such that the n-th iterate w(S) ∈ R p produced by SGLD sat-
isfies W2 P[w(S)|S], P[A (S)|S] ≤ C′ , where A (S) is a 2τ∆ m -differentially private vector
distributed according to the Gibbs posterior given S. By Markov’s inequality and the def-
inition of W2 , there exists w∗ (S) as in Corollary 3.4.7 such that, with probability 1 − δ ′ ,
∥w(S) − w∗ (S)∥22 ≤ C/δ ′ .
The dependence on δ ′ appears to be poor. However, Markov chain algorithms are often
geometrically ergodic, in which case C decays exponentially fast in the number of Markov
chain steps, allowing one to spend computation to control the 1/δ ′ term.
h dQ i
KL(Q||P) − KL(Q||P′ ) = Q ln − KL(Q||P′ ) (3.12)
dP
h dQ dP′ i
= Q ln ′ + ln − KL(Q||P′ ) (3.13)
dP dP
h dP′ i
= KL(Q||P′ ) + Q ln − KL(Q||P′ ) (3.14)
dP
h dP′ i
= Q ln . (3.15)
dP
Proof of Lemma 3.4.4. Let P∗ (S) satisfy the conditions in the statement of the theorem.
Then P∗ (S) is ε-differentially private. By Theorem 3.3.2, the bound in Eq. (3.4) holds with
3.5 Proofs for Section 3.4.1 63
probability at least 1 − δ for the data-dependent prior P∗ (S) and all posteriors Q. By hypoth-
esis, with probability 1 − δ − δ ′ , PS ≪ P∗ (S), and so, by Lemma 3.4.3, KL(Q||P∗ (S)) =
S
KL(Q||PS ) + Q[ln dPdP∗ (S) ].
Proof of Lemma 3.4.5. Expanding the log ratio of Gaussian densities and then applying
Cauchy–Schwarz, we obtain
dN(w′ ) 1
ln (v) = ∥w − v∥2Σ−1 − ∥w′ − v∥2Σ−1 (3.16)
dN(w) 2
1
= ⟨w′ − w, v⟩Σ−1 + ∥w∥2Σ−1 − ∥w′ ∥2Σ−1 (3.17)
2
1 ′ 1
= ⟨w − w, 2v⟩Σ−1 − ⟨w′ − w, w + w′ ⟩Σ−1 (3.18)
2 2
1 ′
= ⟨w − w, 2v − w − 2w′ + w′ ⟩Σ−1 (3.19)
2
1 ′
= ⟨w − w, 2(v − w′ ) + w′ − w⟩Σ−1 (3.20)
2
1
≤ ∥w′ − w∥2Σ−1 + ∥w′ − w∥Σ−1 ∥v − w′ ∥Σ−1 . (3.21)
2
The result follows by taking the expectation with respect to v ∼ Q.
h ∥h∥L∞ (P)
Proof of Lemma 3.4.6. Let g = dQ e
dP := P[eh ] . Then ∥g∥L1 (P) = 1 and ∥g∥L∞ (P) ≤ e .
Let f (v) = ∥v − w∥Σ−1 . Then E ∥v − w∥Σ−1 = ∥ f ∥L1 (Q) = ∥ f g∥L1 (P) . Finally, let χ be
v∼Q
the indicator function for the ellipsoid {v : ∥v − w∥Σ−1 ≤ R}, and let χ̄ = 1 − χ. Then
∥ f χ∥L∞ (P) ≤ R and
where
q the inequalities follow from
q two applications of Hölder’s inequality. Choosing R =
p
2∥h∥L∞ (P) gives ∥ f ∥L1 (Q) ≤ 2∥h∥L∞ (P) + 2/π.
Proof of Corollary 3.4.7. Let PS = N(w(S)) and P∗ (S) = N(w∗ (S)). By the closure of ε-
differential privacy under composition, P∗ (S) is ε-differentially private and is absolutely con-
tinuous with respect to N(w) for all w, and so satisfies the conditions of Lemma 3.4.4. In par-
ticular, with probability 1 − δ , Eq. (3.4) holds with KL(Q||P∗ (S)) replaced by KL(Q||PS ) +
S
Q[ln dPdP∗ (S) ].
64 Data-dependent PAC-Bayes bounds
By Lemma 3.4.6 E ∥v − w(S)∥Σ−1 is bounded for Gibbs measures based on a surrogate risk
v∼Q
√ p
taking values in a length-∆ interval by 2τ∆ + 2/π.
Violated bounds would be an obvious sign of trouble. We expect the bound on the
classification error not to go below the true error as estimated on the heldout test set (with
high probability). We perform an experiment on a MNIST (and CIFAR10, with the same
conclusion so we have not included it) using true and random labels and find that no bounds
are violated. The results suggest that it may be useful to empirically study bounds for Gibbs
classifiers using SGLD.
Our main focus is a sythentic experiment comparing the bounds of Lever, Laviolette, and
Shawe-Taylor (2013) to our new bounds based on privacy. The main finding here is that, as
expected, the bounds by Lever, Laviolette, and Shawe-Taylor must explode when the Gibbs
classifier begins to overfit random labels, whereas our bounds, on true labels, continue to
track the training error and bound the test error.
3.6.1 Setup
Our focus is on classification by neural networks into K classes. Thus Z = X × [K].
We use neural networks that output probability vectors over these K classes. Given
weights w ∈ R p and input x ∈ X, the probability vector output by the network is p(w, x) ∈
[0, 1]K . Networks are trained by minimizing cross entropy loss: ℓ(w, (x, y)) = g(p(w, x), y),
where g((p1 , . . . , pK ), y) = − ln py . Note that cross entropy loss is merely bounded below.
We report results in terms of {0, 1}-valued classification error: ℓ(w, (x, y)) = 0 if and only if
p(w, x) takes its maximum value only at coordinate y,
We sometimes refer to elements of R p and M1 (R p ) as classifiers and randomized
classifiers, respectively, and refer to the (empirical) 0–1 risk as the (empirical) error. We train
two different architectures using SGLD on MNIST and a synthetic dataset, SYNTH. The
experimental setup is explained in Section 3.6.1.
One-stage training procedure We run SGLD for T training epochs with a fixed value of
the parameter τ. We observe that convergence appears to occur within 10 epochs, but use a
much larger number of training epochs to potentially expose nonconvergence behavior. The
value of the inverse temperature τ is fixed during the whole training procedure.
Two-stage training procedure In order to evaluate our private PAC-Bayes bound (Theo-
rem 3.3.2), we perform a two-stage training procedure:
• Stage One. We run SGLD for T1 epochs with inverse temperature τ1 , minimizing the
standard cross entropy objective. Let w0 denote the neural network weights after stage
one.
66 Data-dependent PAC-Bayes bounds
• Transition. We restart the learning rate schedule and continue SGLD for T1 epochs
with linearly annealing the temperature between τ1 and τ2 , i.e., inverse temperature
τt = ((t −T1 )τ2 +(2T1 −t)∗τ1 )/T1 , where t is the current epoch number. The objective
at w is the cross entropy loss for w plus a weight-decay term 2γ ∥w − w0 ∥22 .
• Stage Two. We continue SGLD for T2 − T1 epochs with inverse temperature τ2 . The
objective is the same as in the transition stage.
During the first stage, the k-step transitions of SGLD converge weakly towards a Gibbs
distribution with a uniform base measure, producing a random vector w0 ∈ R p . The private
data-dependent prior Pw0 is the Gaussian distribution centred at w0 with diagonal covariance
1
γ Ip .
During the second stage, SGLD converges to the Gibbs posterior with a Gaussian base
measure Pw0 , i.e., Qτ2 = Pexp(−τ2 R̂S ) .
Bound calculation Our experiments evaluate different values of the inverse temperature τ.
We evaluate Lever bounds for the randomized classifier Qτ obtained by the one-stage
training procedure, with T = 1000. We do so on both the MNIST and SYNTH datasets.
We also evaluate our private PAC-Bayes bound (Theorem 3.3.2) for the randomized
classifier Qτ2 and the private data-dependent prior Pw0 , where the privacy parameter depends
on τ1 . The bound depends on the value of the KL(Qτ1 ||Pw0 ). The challenges of estimating this
term are described in Section 3.6.1. We only evaluate the differentially private PAC-Bayes
bounds on the small neural network and SYNTH dataset,
The parameter settings for SYNTH experiments are: T 1 = 100, T 2 = 1000, γ = 2; for
MNIST: T 1 = 500, T 2 = 1000, γ = 5. When evaluating Lever bounds with a one-stage
learning procedure for either datasets, T = 1000.
Bounded loss While it is typical to train neural networks by minimizing cross entropy, this
loss is unbounded and our theory is developed only for bounded loss. We therefore work
with a bounded version of cross-entropy loss, which we obtain by preventing the network
from producing extreme probabilities near zero and one. We describe our modification of the
cross entropy in Section 3.6.1.
Datasets We use two datasets. The first is MNIST, which consists of handwritten digit
images with labels in {0, ..., 9}. The dataset contains 50,000 training images and 10,000
validation images.
3.6 Empirical studies 67
Architectures We use SGLD without any standard modifications (such as momentum and
batch norm) to ensure that the stationary distribution is that of SGLD. For MNIST, we use a
fully connected neural network architecture. The network has 3 layers and 600 units in each
hidden layer. The input is a 784 dimensional vector and the output layer has 10 units. For the
SYNTH dataset, we use a fully connected neural network with 1 hidden layer consisting of
100 units. The input layer has 4 units, and the output layer is a single unit.
Learning rate At epoch t, the learning rate is at = a0 ∗ t −b , where a0 is the initial learning
rate and b is the decay rate. We set b = 0.5 and use a0 = 10−5 for MNIST experiments and
a0 = 10−3 for SYNTH experiments.
Minibatches An epoch refers to the full pass through the data in mini batches of size 128
for MNIST data, and 10 for SYNTH data.
In order to achieve differential privacy, we work with a bounded version of the cross entropy
loss. The problem is associated with extreme probabilities near zero and one. Our solution is
to remap the probabilities p 7→ ψ(p), where
is an affine transformation that maps [0, 1] to [e−Lmax , 1 − e−Lmax ], removing extreme prob-
ability values. Cross entropy loss is then replaced by g((p1 , . . . , pK ), y) = − ln ψ(py ). As
a result, cross entropy loss is contained in the interval [0, Lmax ]. We take Lmax = 4 in our
experiments.
68 Data-dependent PAC-Bayes bounds
For a given PAC-Bayes prior P and dataset S, it is natural to ask which posterior Q = Q(S)
minimizes the PAC-Bayes bounds. In general, some Gibbs posterior (with respect to P) is
the minimizer. We now introduce the Gibbs posterior and discuss how we can compute the
term KL(Q||P) in the case of Gibbs posteriors.
For a σ -finite measure P over R p and function g : R p → R, let P[g] denote the expectation
R
g(h)P(dh) and, provided P[g] < ∞, let Pg denote the probability measure on R p , absolutely
dP
continuous with respect to P, with Radon–Nikodym derivative dPg (h) = g(h) P[g] . A distribution
of the form Pexp(−τg) is generally referred to as a Gibbs distribution. A Gibbs posterior is a
probability measure of the form Pexp(−τ R̂S ) for some constant τ > 0.
The challenge of evaluating PAC-Bayes bounds for Gibbs posteriors is computing the KL
term. We now describe a classical estimate and show that it is going to be an upper bound with
high probability. Fix a prior P and τ ≥ 0, let Qτ = Pexp(−τ R̂S ) , and let Zτ = P[exp(−τ R̂S )].
Then
hdQτ i
KL(Qτ ||P) = Qτ ln (3.27)
dP
h exp(−τ R̂ ) i
S
= Qτ ln (3.28)
Zτ
= −τQτ [R̂S ] − ln Zτ . (3.29)
(The quantity within the expectation on the r.h.s. thus defines an unbiased estimator of Qτ .)
In the ideal case, the samples are independent, and then the variance decays at an n−1 rate.
In practice, it is often difficult to even sample from Qτ for high values of τ. Indeed, using
this approach, we would generally overestimate the risk, which means that we do not obtain
an upper bound on the KL term. So instead, we approximate −τQτ [R̂S ] ≈ 0. Despite this,
we obtain nonvacuous bounds.
3.6 Empirical studies 69
The second term is challenging to estimate accurately, even assuming that P and Qτ can
be efficiently simulated. One tack is to consider i.i.d. samples V1 , . . . ,Vn ∼ P, and note that
h1 n i
− ln Zτ = − ln P[exp(−τ R̂S )] = − ln E ∑ exp(−τ R̂S (Vi )) (3.31)
n i=1
h 1 n i
≤ E − ln ∑ exp(−τ R̂S (Vi )) , (3.32)
n i=1
where the inequality follows from an application of Jensen’s inequality. The quantity within
the expectation on the r.h.s. thus forms an upper bound, and indeed, it is possible to show
that it does not fall below the l.h.s. by ε with probability exponentially small in ε. Thus we
have a high-probability (near) upper bound on the term in the KL. One might be inclined to
compute a normalized importance sampler, but since Q cannot be effectively sampled, one
does not obtain an upper bound with high probability.
The term ln Zτ is a generalized log marginal likelihood, which, in our experiments, we
approximate by sampling from a Gaussian distribution P. Numerical integration techniques
rapidly diminish in accuracy with increasing dimensionality of the parameter space.
Note, that due to the convexity of the exponential, samples Wi ∼ P, for which R̂S (Wi ) is
close to zero, will dominate Zτ . Due to high dimensionality of the neural network parameter
space, with high probability a random sample Wi from P will not be in the minima of the
empirical loss surface and therefore R̂S (Wi ) will be high. As a results, in our experiments we
obtain a very loose upper bound on the KL.
3.6.2 Results
Results are presented in Figs. 3.1 and 3.2. We never observe a violation of the PAC-Bayes
bounds for Gibbs distributions. This suggests that our assumption that SGLD has nearly
converged is accurate enough or the bounds are sufficiently loose that any effect from
nonconvergence was masked.
Our MNIST experiments highlight that the Lever bounds must also upper bound the risk
for every possible data distribution, including the random label distribution. In the random
label experiment (right plot in Fig. 3.1), we observe that when τ gets close to the number
of training samples, the generalization error starts increasing steeply. This phase transition
is captured by the Lever bound. In the true label experiment (right plot), the generalization
error does not rise with τ. Indeed, it continues to decrease, and so the Lever bound quickly
becomes vacuous as we increase τ. The Lever bound cannot capture this behavior because it
must capture also the behavior under random labels.
70 Data-dependent PAC-Bayes bounds
On the SYNTH dataset, we see the same phase transition under random labels and so
Lever bounds remain vacuous after this point. In contrast, we see that our private PAC-Bayes
bounds can track the error beyond the phase transition that occurs under random labels. (See
Fig. 3.2.) At high values of τ, our KL upper bound becomes very loose.
Private versus Lever PAC-Bayes bound While Lever PAC-Bayes bound fails to explain
generalization for high τ values, our private PAC-Bayes bound may remain nonvacuous.
This is due to the fact that it retains the KL term, which is sensitive to the data distribution
via Q, and thus it can be much lower than the upper bound on the KL in Lever, Laviolette,
and Shawe-Taylor (2013) for datasets with small true Bayes risk. Two stage optimization,
inspired by the DP PAC-Bayes bound, allows us to obtain more accurate classifiers by setting
a higher inverse temperature parameter at the second stage, τ2 .
We do not plot DP PAC-Bayes bound for MNIST experiments due to the computational
challenges approximating the KL term for a high-dimensional parameter space, as discussed
in Section 3.6.1. We evaluate our private PAC-Bayes bound on MNIST dataset only for a
combination of τ1 = 103 and τ2 ∈ [3 ∗ 103 , 3 ∗ 104 , 105 , 3 ∗ 105 ]. The values are chosen such
that τ1 gives a small penalty for using the data to learn the prior and τ2 is chosen such that
at τ = τ2 Lever’s bound returns a vacuous bound (as seen in Fig. 3.1). We use 105 number
of samples from the DP Gaussian prior learnt in stage one to approximate log Z term that
appears in the KL, as defined in Eq. (3.31).
The results are presented in the table below. While DP PAC-Bayes bound is very loose, it
is still smaller than Lever’s bound for high values of inverse temperature.
Note, that for smaller values τ2 , we can use Lever’s upper bound on the KL term instead
of performing a Monte Carlo approximation. Since τ1 is small and adds only a small penalty
(∼ 1%), the DP PAC-Bayes bound is equal to Lever’s bound plus a differential privacy
penalty (∼ 1%).
1.0 1.0
0.8 0.8
0.2 0.2
0.0 0.0
102 103 104 105 106 107 108 102 103 104 105 106 107 108
Fig. 3.1 Results for a fully connected neural network trained on MNIST dataset with SGLD
and a fixed value of τ. We vary τ on the x-axis. The y-axis shows the average 0–1 loss. We
plot the estimated generalization gap, which is the difference between the training and test
errors. The left plot shows the results for the true label dataset. We observe, that the training
error converges to zero as τ increases. Further, while the generalization error increases for
intermediate values of τ (104 to 106 ), it starts dropping again as τ increases. We see that
the Lever bound fails to capture this behaviour due to the monotonic increase with τ. The
right hand side plot shows the results for a classifier trained on random labelling of MNIST
images. The true error is around 0.9. For small values of τ (under 103 ) the network fails
to learn and the training error stays at around 0.9. When τ exceeds the number of training
points, the network starts to overfit heavily. The sharp increase of the generalization gap is
predicted by the Lever bound.
3.7 Discussion
In this chapter we presented a valid PAC-Bayes bound that allows one to learn a prior from
the training data. Our proposed bound works by requiring the prior to be differentially private.
A prior learned with a differentially private algorithm can approximate a data-distribution-
dependent prior. While the data dependency of the prior adds an additional penalty term,
it may also allow us to get tighter generalization bounds by decreasing the KL term in the
PAC-Bayes bound.
Achieving pure differential privacy is difficult. For example, sampling from an exponen-
tial mechanism is intractable. To achieve bounds for practical algorithms, we also derive a
bound that is valid for algorithms that are weak approximations of exponential mechanism.
This result is of practical importance and allows us to use of SGLD, a popular variant of
SGD with stronger privacy guarantees.
In our empirical evaluation, we use SGLD to learn a PAC-Bayes prior. We empirically
evaluate a two-stage learning procedure, which consists of initially running SGLD at high
72 Data-dependent PAC-Bayes bounds
1.0 1.0
0.8 0.8
0.2 0.2
0.0 0.0
100 101 102 100 101 102
1.0
Gibbs Empirical Risk
Gibbs Risk
Generalization Error
DP PAC Bayes on risk
0.8 DP PAC Bayes on Generalization Error
0.6
0.4
0.2
0.0
100 101 102
Fig. 3.2 Results for a small fully connected neural network trained on a synthetically generated
dataset SYNTH, consisting of 50 training examples. The x-axis shows the τ value, and the
y-axis the average 0–1 loss. To generate the top plots, we train the network with a one-stage
SGLD. The top-left plot corresponds to the true label dataset, and the top-right to the random
label dataset. Similarly as in MNIST experiments, we do not witness any violation of the
Lever bounds. Once again, we notice that Lever bound gets very loose for larger values of τ
in the true label case. The bottom plot demonstrates the results for the two-stage SGLD. In
this case the x-axis plots the τ value used in the second-stage optimization. The first stage
used τ1 = 1. The network is trained on true labels. We see that the differentially private
PAC-Bayes bound yields a much tighter estimate of the generalization gap for larger values
of τ than the Lever bound (top left). When τ becomes very large relative to the amount of
training data, it becomes more difficult to sample from the Gibbs posterior. This results in a
looser upper bound on the KL divergence between the prior and posterior.
temperature to learn the prior and then running SGLD at low temperature to obtain a
posterior. Our results demonstrate the advantage of such two-stage training over learning at
high temperature only, and provide evidence that our bounds can sometimes achieve much
stronger guarantees than standard PAC-Bayes bounds, as well as those of Lever et al.
Chapter 4
4.1 Introduction
Optimization is central to much of machine learning, but generalization is the ultimate goal.
Despite this, the generalization properties of many optimization-based learning algorithms
are poorly understood. The standard example is stochastic gradient descent (SGD), one of the
workhorses of deep learning, which has good generalization performance in many settings,
even under overparametrization (Neyshabur, Tomioka, and Srebro, 2014), but rapidly overfits
in others (Zhang et al., 2017). Can we develop high performance learning algorithms with
provably strong generalization guarantees? Or is there a limit?
In this chapter, we study an optimization algorithm called Entropy-SGD (Chaudhari
et al., 2017), which was designed to outperform SGD in terms of generalization error when
optimizing an empirical risk. Entropy-SGD minimizes an objective f : R p → R indirectly by
performing (approximate) stochastic gradient ascent on the so-called local entropy
prior. This connection between local entropy and PAC-Bayes follows from a result due to
Catoni (2007, Lem. 1.1.3) in the case of bounded risk. (See Theorem 4.4.1.) In the special
case where f is the empirical cross entropy, the local entropy is literally a Bayesian log
marginal density. The connection between minimizing PAC-Bayes bounds under log loss
and maximizing log marginal densities is the subject of recent work by Germain et al. (2016).
Similar connections have been made by Zhang (2006a), Zhang (2006b), Grünwald (2012),
and Grünwald and Mehta (2016).
Despite the connection to PAC-Bayes, as well as theoretical results by Chaudhari et al.
suggesting that Entropy-SGD may be more stable than SGD, we demonstrate that Entropy-
SGD (and its corresponding Gibbs posterior) can rapidly overfit, just like SGD. We identify
two changes, motivated by theoretical analysis, that suffice to control generalization error,
and thus prevent overfitting.
The first change relates to the stability of optimizing the prior mean. The PAC-Bayes
theorem requires that the prior be independent of the data, and so by optimizing the prior
mean, Entropy-SGD invalidates the bound. Indeed, the bound does not hold empirically.
While a PAC-Bayes prior may not be chosen based on the data, it can depend on the data
distribution. This suggests that if the prior depends only weakly on the data, it may be
possible to derive a valid bound.
Indeed, in Chapter 3 we formalized this idea using differential privacy (Dwork, 2006;
Dwork et al., 2015b) under the assumption of bounded risk. Using existing results connecting
statistical validity and differential privacy (Dwork et al., 2015b, Thm. 11), we show that an
ε-differentially private prior yields a valid, though looser, PAC-Bayes bound.
Achieving strong differential privacy can be computationally intractable. Motivated by
this obstruction, in Chapter 3 we relax the privacy requirement in the case of Gaussian
PAC-Bayes priors parameterized by their mean vector. We show that one need only sample a
sequence that converges in distribution to the output of a differentially private mechanism.
This allows one to use stochastic gradient Langevin dynamics (SGLD; Welling and Teh,
2011), which is known to converge weakly to its target distribution, under regularity condi-
tions. We will refer to the Entropy-SGD algorithm as Entropy-SGLDwhen the SGD step on
local entropy is replaced by SGLD.
The one hurdle to using data-dependent priors learned by SGLD is that we cannot easily
measure how close we are to converging. Rather than abandoning this approach, we take
two steps: First, we run SGLD far beyond the point where it appears to have converged
visually. Second, we assume convergence, but then view/interpret the bounds as being
optimistic. In effect, these two steps allow us to see the potential and limitations of using
private data-dependent priors to study Entropy-SGLD.
4.2 Related work 75
We find that the resulting PAC-Bayes bounds are remarkably tight but still conservative
in our experiments. On MNIST, when the privacy of Entropy-SGLD is tuned to contribute no
more than 2ε 2 × 100 ≈ 0.2% to the generalization error, the test error of the learned network
is 3–8%, which is approximately 5–10 times higher than the state of the art, which for MNIST
is between 0.2-1%, although the community has almost certainly overfit to MNIST.
The second change pertains to the stability of the stochastic gradient estimate made on
each iteration of Entropy-SGD. This estimate is made using SGLD. (Hence Entropy-SGD
is SGLD within SGD.) Chaudhari et al. make a subtle but critical modification to the noise
term in the SGLD update: the noise is divided by a factor that ranges from 103 to 104 . (This
factor was ostensibly tuned to produce good empirical results.) Our analysis shows that, as a
result of this modification, the Lipschitz constant of the objective function is approximately
106 –108 times larger, and the conclusion that the Entropy-SGD objective is smoother than
the original risk surface no longer stands. This change to the noise also negatively impacts
the differential privacy of the prior mean. Working backwards from the desire to obtain tight
√
generalization bounds, we are led to divide the SGLD noise by a factor of only 4 m, where m
√
is the number of data points. (For MNIST, 4 m ≈ 16.) The resulting bounds are nonvacuous
and tighter than those we obtained in Chapter 2, although it must be emphasized that the
bounds are optimistic because we assume SGLD has converged. The extent to which it has
not converged would inflate the bound.
We begin by introducing sufficient background so that we can make a formal connec-
tion between local entropy and PAC-Bayes bounds. We discuss additional related work
in Section 4.2. We then introduce several existing learning bounds that use differential
privacy, including the PAC-Bayes bounds outlined above that use data-dependent priors. In
Section 4.6, we present experiments on MNIST and CIFAR10, which provide evidence for
our theoretical analysis. We close with a short discussion.
(Achille and Soatto (2017) arrive at a similar objective from the information bottleneck
perspective.) In contrast, our work shows that Entropy-SGD implicitly uses the optimal
(Gibbs) posterior, and then optimizes the resulting PAC-Bayes bound with respect to the
prior.
Our work also relates to renewed interest in nonvacuous generalization bounds (Langford,
2002; Langford and Caruana, 2002b), i.e., bounds on the numerical difference between the
unknown classification error and the training error that are (much) tighter than the tautological
upper bound of one. In Chapter 2, nonvacuous generalization bounds are demonstrated for
MNIST. (Their algorithm can be viewed as variational dropout (Kingma, Salimans, and
Welling, 2015) or information dropout (Achille and Soatto, 2018), with a proper data-
independent prior but without local reparametrization.) Their work builds on core insights
by Langford and Caruana (2002b), who computed nonvacuous bounds for neural networks
five orders of magnitude smaller. Our work shows that Entropy-SGLD yields generalization
bounds, though the value of these bounds depends on the degree to which SGLD is allowed
to converge. Under the optimistic assumption that convergence has been achieved, we see
that the resulting bounds are tighter than those computed by Chapter 2.
Our analysis of Entropy-SGLD exploits results in differential privacy (Dwork, 2008b)
and its connection to generalization (Dwork et al., 2015b; Dwork et al., 2015a; Bassily
et al., 2016; Oneto, Ridella, and Anguita, 2017). Entropy-SGLD is related to differentially
private empirical risk minimization, which is well studied, both in the abstract (Chaudhuri,
Monteleoni, and Sarwate, 2011; Kifer, Smith, and Thakurta, 2012; Bassily, Smith, and
Thakurta, 2014) and in the particular setting of private training via SGD (Bassily, Smith,
and Thakurta, 2014; Abadi et al., 2016). Given the connection between Gibbs posteriors
and Bayesian posteriors, Entropy-SGLD also relates to the differential privacy of Bayesian
and approximate sampling algorithms (Mir, 2013; Bassily, Smith, and Thakurta, 2014;
Dimitrakakis et al., 2014; Wang, Fienberg, and Smola, 2015; Minami et al., 2016).
Finally, the local entropy should not be confused with the smoothed risk surface obtained
by convolution with a Gaussian kernel: in that case, every point on this surface represents
the risk of a randomized classifier, obtained by perturbing the parameters according to a
Gaussian distribution. (This type of smoothing relates to the approach of in Chapter 2.) The
local entropy also relates to a Gaussian perturbation, but the perturbation is either accepted
or rejected based upon its relative performance (as measured by the exponentiated loss)
compared with typical perturbations. Thus the local entropy perturbation concentrates on
regions of weight space with low empirical risk, provided they have sufficient probability
mass under the distribution of the random perturbation.
4.3 Preliminaries: Entropy-SGD 77
where Z
γ
Fγ,τ (w; S) = log exp(−τ R̂S (w′ ) − τ ∥w′ − w∥22 ) dw′ . (4.2)
Rp | {z 2 }
gw,S ′
γ,τ (w )
The objective Fγ,τ (·; S) is known as the local entropy, and can be viewed as the log parti-
tion function of the unnormalized probability density function gw,Sγ,τ . (We will denote the
w,S
corresponding distribution by Gγ,τ .) Assuming that one can exchange differentiation and
integration, it is straightforward to verify that
and then the local entropy Fγ,τ (·; S) is differentiable, even if the empirical risk R̂S is not.
Indeed, Chaudhari et al. show that the local entropy and its derivative are Lipschitz. Chaudhari
et al. argue informally that maximizing the local entropy leads to “flat minima” in the
empirical risk surface, which several authors (Hinton and Camp, 1993; Hochreiter and
Schmidhuber, 1997; Baldassi et al., 2015; Baldassi et al., 2016) have argued is tied to
good generalization performance (though none of these papers gives generalization bounds,
vacuous or otherwise). Chaudhari et al. propose a Monte Carlo estimate of the gradient,
∇w Fγ,τ (w; S) ≈ τγ(w − µL ), with µ1 = w1 and µ j+1 = αw′j + (1 − α)µ j , where w′1 , w′2 , . . .
are (approximately) i.i.d. samples from Gw,S γ,τ and α ∈ (0, 1) defines a weighted average.
w,S
Obtaining samples from Gγ,τ may be difficult when the dimensionality of the weight vector
78 Entropy-SGD optimizes the prior of a PAC-Bayes bound
is large. Chaudhari et al. use Stochastic Gradient Langevin Dynamics (SGLD; Welling and
Teh, 2011), which simulates a Markov chain whose long-run distribution converges to Gw,S γ,τ
1
and requires that the empirical risk be differentiable. The final output of Entropy-SGD is
the deterministic predictor corresponding to the final weights w∗ achieved by several epochs
of optimization.
Algorithm 2 gives a complete description of the stochastic gradient step performed by
Entropy-SGD. If we rescale the learning rate, η ′ ← 21 η ′ τ, lines 6 and 7 are equivalent to
6: dw′ ← − K1 ∑K ′ ′
i=1 ∇w′ ℓ(w , z ji ) − γ(w − w)
p p
7: w′ ← w′ + ηi′ dw′ + ηi′ 2/τ N(0, I p )
p
Notice that the noise term is multiplied by a factor of 2/τ. A multiplicative factor ε—
p
called the “thermal noise”, but playing exactly the same role as 2/τ here—appears in
the original description of the Entropy-SGD algorithm given by Chaudhari et al. However,
1 Chaudhari et al. take L = 20 steps of SLGD, using a constant step size η ′j = 0.2 and weighting α = 0.75.
4.4 Maximizing local entropy minimizes a PAC-Bayes bound 79
ε does not appear in the definition of local entropy used in their stability analysis. Our
derivations highlights that scaling the noise term in SGLD update has a profound effect:
the thermal noise exponentiates the density that defines the local entropy. The smoothness
analysis of Entropy-SGD does not take into consideration the role of ε, which is critical
because Chaudhari et al. take ε to be as small as 10−3 and 10−4 . Indeed, the conclusion
that the local entropy surface is smoother no longer holds. We will see that τ controls the
differential privacy and thus the generalization error of Entropy-SGD.
Theorem 4.4.1 (Maximizing local entropy optimizes a PAC-Bayes bound’s prior). Assume
the loss takes values in an interval of length Lmax , let τ = λ Lmmax for some λ > 1/2, Then the
set of weight w maximizing the local entropy Fγ,τ (w; S) equals the set of weights w minimizing
the right hand side of Theorem 1.2.2 for Q = Gw,Sγ,τ = Pexp(−τ R̂S ) and P a multivariate normal
distribution with mean w and covariance matrix (τγ)−1 I p .
Proof of Theorem 4.4.1. Let m, δ , D, and P be as in Theorem 1.2.2 and let S ∼ D m . The
linear PAC-Bayes bound (Theorem 1.2.2) ensures that for any fixed λ > 1/2 and bounded
loss function, with probability at least 1 − δ over the choice of S, the bound
1 m RD (Q) m R̂S (Q)
1− ≤ + KL(Q||P) + g(δ ).
2λ λ Lmax λ Lmax
holds for all Q ∈ M1 (R p ). Minimizing the upper bound on RD (Q) is equivalent to the
problem
infQ Q[r] + KL(Q||P) (4.4)
with r(h) = λ Lmmax R̂S (h). By (Catoni, 2007, Lem. 1.1.3), for all Q ∈ M1 (R p ) with KL(Q||P) <
∞,
− log P[exp(−r)] = Q[r] + KL(Q||P) − KL(Q||Pexp(−r) ). (4.5)
80 Entropy-SGD optimizes the prior of a PAC-Bayes bound
By the nonnegativity of the Kullback–Liebler divergence, the infimum is achieved when the
KL term is zero, i.e., when Q = Pexp(−r) . Then
1 m
1− RD (Pexp(−r) ) ≤ − log P[exp(−r)] + g(δ ).
2λ λ Lmax
Finally, it is plain to see that Fγ,τ (w; S) = C + log P[exp(−r)] when C = 12 p log(2π(τγ)−1 )
is a constant, τ = λ Lmmax , and P = N (w, (τγ)−1 I p ) is a multivariate normal with mean w and
covariance matrix (τγ)−1 I.
The theorem requires the loss function to be bounded, because the PAC-Bayes bound
we have used applies only to bounded loss functions. Germain et al. (2016) described
PAC-Bayes generalization bounds for unbounded loss functions, though it requires that one
make additional assumptions about the distribution of the empirical risk, which we would
prefer not to make. (See Grünwald and Mehta (2016) for related work on excess risk bounds
and further references).
Theorem 4.5.1. Let m ∈ N, let A : Z m ⇝ R p be ε-differentially private, and let δ > 0. Then
1
|RD (A (S)) − R̂S (A (S))| < ε̄ + m− 2 with probability at least 1 − δ over S ∼ D m , where
4.5 Data-dependent PAC-Bayes priors 81
q √ 1
ε̄ = max{ε, 1 3
m log δ }. The same holds for the upper bound (6R̂S (A (S)))(ε̄ + m− 2 ) +
6(ε̄ 2 + m−1 ).
Theorem 4.5.2. Let γ, τ > 0, and assume the range of the loss is contained in an interval
of length Lmax . One sample from the local entropy distribution Pexp(β Fγ,τ (·;S)) is 2β Lmmax τ -
differentially private.
Proof of Theorem 4.5.2. The result follows immediately from Lemma 3.4.1 and the follow-
ing lemma.
Lemma 4.5.3. Let Fγ,τ (w; S) be defined as Eq. (4.2), assume the range of the loss is
contained in an interval of length Lmax , and define q(S, w) = −Fγ,τ (w; S). Then ∆q :=
supS,S′ supw∈R p |q(S, h) − q(S′ , h)| ≤ Lmax
m .
τ
82 Entropy-SGD optimizes the prior of a PAC-Bayes bound
Proof. The proof essentially mirrors that of (McSherry and Talwar, 2007, Thm. 6).
where at each round ĝ(w) is an estimate of the gradient ∇w Fγ,τ (w; S). (Recall the identity
Eq. (4.3).) As in Entropy-SGD, we construct biased gradient estimates via an inner loop of
SGLD. (We ignore this source of error.) In summary, the only change to Entropy-SGD is
the addition of noise in the outer loop. We call the resulting algorithm Entropy-SGLD. (See
Algorithm 2.)
As we run SGLD longer, we obtain a tighter bound that holds with at least probability
approaching 1 − δ . However, in practice we may not know the rate at which this convergence
occurs. In our experiments, we use very long runs to approximate near-convergence and
then only interpret the bounds as being optimistic. We return to these issues in Sections 4.6
and 4.10.
For the local entropy distribution, the degree ε of privacy is determined by the product of the
p
τ and β parameters of the local entropy distribution. (Thermal noise is 2/τ.) In turn, ε
increases the generalization bound. For a fixed β , theory predicts that τ affects the degree of
overfitting. We see this empirically. No bound we compute is violated more frequently then
it is expected to be. The PAC-Bayes bound for SGLD is expanded by an amount ε ′ that goes
to zero as SGLD converges. We assume SGLD has converged and so the bounds we plot are
optimistic. We discuss this point below in light of our empirical results, and then return to
this point in the discussion.
The weights learned by SGD, SGLD, and Entropy-SGD are treated differently from those
learned by Entropy-SGLD. In the former case, the weights parametrize a neural network as
usual, and the training and test error are computed using these weights. In the latter case, the
weights are taken to be the mean of a multivariate normal prior, and we evaluate the training
and test error of the associated Gibbs posterior (i.e., a randomized classifier). We also report
the performance of the (deterministic) network parametrized by these weights (the “mean”
classifier) in order to give a coarse statistic summarizing the local empirical risk surface.
Following Zhang et al. (2017), we study these algorithms on MNIST with the original
(“true”) labels, as well as on random labels. Parameter τ that performs very well in one
setting often does not perform well in the other. Random labels mimic data where the Bayes
error rate is high, and where overfitting can have severe consequences.
4.6.1 Details
We use a two-class variant of MNIST (LeCun, Cortes, and Burges, 2010).2 Some experiments
involve random labels, i.e., labels drawn independently and uniformly at random at the start of
training. We study three network architectures, abbreviated FC600, FC1200, and CONV. Both
FC600 and FC1200 are 3-layer fully connected networks, with 600 and 1200 units per hidden
layer, respectively. CONV is a convolutional architecture. All three network architectures are
taken from the MNIST experiments by Chaudhari et al. (2017), but adapted to our two-class
version of MNIST.3 Let S and Stst denote the training and test sets, respectively. For all
learning algorithms we track
(i) RS0–1(w) and RS0–1tst (w), i.e., the training/test error for w.
We also track
2 The MNIST handwritten digits dataset (LeCun, Cortes, and Burges, 2010) consists of 60000 training
set images and 10000 test set images, labeled 0–9. We transformed MNIST to a two-class (i.e., binary)
classification task by mapping digits 0–4 to label 1 and 5–9 to label −1.
3 We adapt the code provided by Chaudhari et al., with some modifications to the training procedure and
RD0–1(Gw,S
γ,τ ) = E (RD0–1(w′ )),
w′ ∼Gw,S
γ,τ
and so we may interpret the bounds in terms of the performance of a randomized classifier or
the mean performance of a randomly chosen classifier.
4.6.2 Results
Key results for the convolutional architecture (CONV) appear in Fig. 4.1. Results for FC600
and FC1200 appear in Fig. 4.2 of Section 4.7. (Training the CONV network produces
the lowest training/test errors and tightest generalization bounds. Results and bounds for
FC600 are nearly identical to those for FC1200, despite FC1200 having three times as many
parameters.)
The top row of Fig. 4.1 presents the performance of SGLD for various levels of thermal
p
noise 2/τ under both true and random labels. (Assuming SGLD is close to weak conver-
gence, we may also use SGLD to directly perform a private optimization of the empirical risk
surface. The level of thermal noise determines the differential privacy of SGLD’s stationary
distribution and so we expect to see a tradeoff between empirical risk and generalization
error. Note that, algorithmically, SGD is SGLD with zero thermal noise.) SGD achieves
4.7 Two-class MNIST experiments 85
the smallest training and test error on true labels, but overfits the worst on random labels.
In comparison, SGLD’s generalization performance improves with higher thermal noise,
while its risk performance worsens. At 0.05 thermal noise, SGLD achieves reasonable but
relatively large risk but almost zero generalization error on both true and random labels.
Other thermal noise settings have either much worse risk or generalization performance.
The middle row of Fig. 4.1 presents the performance of Entropy-SGD for various levels
p
of thermal noise 2/τ under both true and random labels. As with SGD, Entropy-SGD’s
generalization performance improves with higher thermal noise, while its risk performance
worsens. At the same levels of thermal noise, Entropy-SGD outperforms the risk and
generalization error of SGD. At 0.01 thermal noise, Entropy-SGD achieves good risk and low
generalization error on both true and random labels. However, the test-set performance of
Entropy-SGD at 0.01 thermal noise is still worse than that of SGD. Whether this difference
is due to SGD overfitting to the MNIST test set is unclear and deserves further study.
√
The bottom row of Fig. 4.1 presents the performance of Entropy-SGLD with τ = m on
true and random labels. (This corresponds to approximately 0.09 thermal noise.) On true
labels, both the mean and Gibbs classifier learned by Entropy-SGLD have approximately 2%
test error and essentially zero generalization error, which is less than predicted by the bounds
evaluated. The differentially private PAC-Bayes risk bounds are roughly 3%. As expected
by the theory, Entropy-SGLD, properly tuned, does not overfit on random labels, even after
thousands of epochs.
We find that the PAC-Bayes bounds are generally tighter than the H- and C-bounds. All
bounds are nonvacuous, though still loose. The error bounds reported here are tighter than
those reported in Chapter 2. However, the bounds are optimistic because they do not include
the additional term which measure how far SGLD is from its weak limit. Despite the bounds
being optimistically tight, we see almost no violations in the data. (Many violations would
undermine our assumption.) While we observe tighter generalization bounds than previously
reported, and better test error, we are still far from the performance of SGD. The optimistic
picture we get from the bounds suggests we need to develop new approaches. Weaker notions
of stability/privacy may be necessary to achieve further improvement in generalization error
and test error.
CONV is a convolutional neural network, whose architecture is the same as that used by
Chaudhari et al. (2017) for multiclass MNIST classification, except modified to produce a
single probability output for our two-class variant of MNIST. In particular, CONV has two
convolutional layers, a fully connected ReLU layer, and a sigmoidal output layer, yielding
126, 711 parameters in total.
FC600 and FC1200 are fully connected 3-layer neural networks, with 600 and 1200
hidden units, respectively, yielding 834, 601 and 2, 385, 185 parameters in total, respectively.
We use ReLU activations for all but the last layer, which was sigmoidal to produce an output
in [0, 1].
In their MNIST experiments, Chaudhari et al. (2017) use dropout and batch normalization.
We do not use dropout. The bounds we achieve with and without batch norm are very similar.
Without batch norm, however, it is necessary to tune the learning rates. Understanding the
combination of SGLD and batch norm and the limiting invariant distribution, if any, is an
important open problem.
Ordinarily, an epoch implies one pass through the entire data set. For SGD, each stochastic
gradient step processes a minibatch of size K = 128. Therefore, an epoch is m/K = 468
steps of SGD. An epoch for Entropy-SGD and Entropy-SGLD is defined as follows: each
iteration of the inner SGLD loop processes a minibatch of size K = 128, and the inner loop
runs for L = 20 steps. Therefore, an epoch is m/(LK) steps of the outer loop. In concrete
terms, there are 20 steps of SGD per every one step of Entropy-SG(L)D. Concretely, the
x-axis of our plots measure epochs divided by L. This choice, used also by Chaudhari et al.
(2017), ensures that the wall-clock time of Entropy-SG(L)D and SGD align.
The step sizes for SGLD must be square summable but not summable. The step sizes for the
η′
outer SGLD loop are of the form ηt = ηt −0.6 , with η = γτ , where η ′ = 0.006 and is called
the base learning rate. The step sizes for the inner SGLD loop are of the form ηt = ηt −1 ,
with η = τ2 .
The estimate produced by the inner SGLD loop is computed using a weighted average
(line 8) with α = 0.75. We use SGLD again when computing the PAC-Bayes generalization
bound (Section 4.7.3). In this case, SGLD is used to sample from the local Gibbs distribution
when estimating the Gibbs risk and the KL term. We run SGLD for 1000 epochs to obtain
4.7 Two-class MNIST experiments 87
our estimate. Again, we use weighted averages, but with α = 0.005, in order to average over
a larger number of samples and better control the variance.
max{ln δ3 , mε 2 }
2 (4.6)
m
In order to bound the risk using the differentially private PAC-Bayes bound, we must compute
the largest value p such that kl(q||p) ≤ c. There does not appear to be a simple formula for
this value. In practice, however, the value can be efficiently numerically approximated using,
e.g., Newton’s method. See Section 2.4.
In Chapter 3 we make use of this to propose the two following Monte Carlo estimates:
′
1 k
E [−ℓ(w)] ≈ − ′ ∑ ℓ(w′ ) (4.8)
w∼Pexp(−ℓ) k i=1
88 Entropy-SGD optimizes the prior of a PAC-Bayes bound
where w′1 , . . . , w′k′ are taken from a Markov chain targeting Pexp(−ℓ) , such as SGLD run for
k′ ≫ 1 steps (which is how we computed our bounds), and
Z
log P[exp(−ℓ)] = log exp{−ℓ(w)} P(dw) (4.9)
1 k
⪆ log ∑ exp{−ℓ(wi)}.
k i=1
(4.10)
where h1 , . . . , hk are i.i.d. P (which is a multivariate Gaussian in this case). In the latter case,
due to the concavity of log, the estimate is a lower bound with high probability, yielding a
high probability upper bound on the KL term.
4.8.1 Objective
The neural network produces a probability vector (p1 , . . . , pK ) via a soft-max operation.
Ordinarily, we then apply the cross entropy loss − log py . When training privately, we use a
bounded variant of the cross entropy loss, − log ψ(py ), where ψ is defined as in Eq. (3.26).
converges very quickly, in most cases within the first 5 calls of the outerloop Entropy-SGLD
step. In order to estimate the Gibbs randomized classifier error and KL divergence and
evaluate the PAC-Bayes bound, we run SGLD for an extra 50 epochs at the end of training.
yields the smallest bound on the risk and also the best performing classifiers, as judged by
the risk evaluated on the test error. A value of τ < 0.01 corresponds to large prior variance
and excessive smoothing, which results in Entropy-SGLD finding a poor classifier.
On the right hand side plot, we reduce β to 0.004 to be able to increase τ and maintain
the same level of privacy. This corresponds to higher SGLD noise on the outerloop step, and
smaller noise on the inner SGLD step. Remember, that the prior variance is (γτ)−1 . Since τ
is now a lot higher, γ < 0.001 results in less smoothing than in the β = 1 case and we see
that Entropy-SGLD is now able to find a relatively good classifier while preserving the same
level of privacy. In addition, we see further improvement in the bound on the risk and the test
error for a larger range of γ values (γ ∈ [0.0003, 20]).
4.10 Discussion
Our work reveals that Entropy-SGD can be understood as optimizing a PAC-Bayes gener-
alization bound in terms of the bound’s prior. Because the prior must be independent of
the data, the bound is invalid, and, indeed, we observe overfitting in our experiments with
p
Entropy-SGD when the thermal noise 2/τ is set to 0.0001 as suggested by Chaudhari et al.
for MNIST.
PAC-Bayes priors can, however, depend on the data distribution. This flexibility seems
wasted, since the data sample is typically viewed as one’s only view onto the data distribution.
However, using results combining differential privacy and PAC-Bayes bounds, we arrive
at an algorithm, Entropy-SGLD, that minimizes its own PAC-Bayes bound (though for a
surrogate risk). Entropy-SGLD performs an approximately private computation on the data,
extracting information about the underlying distribution, without undermining the statistical
validity of its PAC-Bayes bound. The cost of using the data is a looser bound, but the gains in
choosing a better prior make up for the loss. (The gains come from the KL term being much
smaller on the account of the prior being better matched to the data-dependent posterior.)
Our bounds based on Theorem 3.3.2 are optimistic because we do not include the ε ′
term, assuming that SGLD has essentially converged. We do not find overt evidence that
our approximation is grossly violated, which would be the case if we saw the test error
repeatedly falling outside our confidence intervals. We believe that it is useful to view the
bounds we obtain for Entropy-SGLD as being optimistic and representing the bounds we
might be able to achieve rigorously should there be a major advance in private optimization.
(No analysis of the privacy of SGLD takes advantage of the fact that it mixes weakly, in part
because it’s difficult to characterize how much it has converged in any real-world setting
after a finite number of steps.) On the account of using private data-dependent priors (and
4.10 Discussion 91
making optimistic assumptions), the bounds we observe for Entropy-SGLD are significantly
tighter than those reported by Chapter 2. However, despite our bounds potentially being
optimistic, the test set error we are able to achieve is still 5–10 times worse than that of SGD.
Differential privacy may be too conservative for our purposes, leading us to underfit. We are
able to achieve good generalization on both true and random labels under 0.01 thermal noise,
despite this value of noise being too large for tight bounds. Identifying the appropriate notion
of privacy/stability to combine with PAC-Bayes bounds is an important problem.
Despite Entropy-SGLD having much stronger generalization guarantees, Entropy-SGLD
learns much more slowly than Entropy-SGD, the test error of Entropy-SGLD is far from state
of the art, and the PAC-Bayes bounds, while much tighter than existing bounds, are still quite
loose. It seems possible that we may be facing a fundamental tradeoff between the speed
of learning, the excess risk, and the ability to produce a certificate of one’s generalization
error via a rigorous bound. Characterizing the relationship between these quantities is an
important open problem.
92 Entropy-SGD optimizes the prior of a PAC-Bayes bound
True Labels
Train (Gibbs)
8 SGD 8 8 H-bound
C-bound
6 6 6
4 4 4
2 2 2
0 0 0
1 2 3 4 5 6 2 4 6 8 10 0 50 100 150 200 250 300 350
60 60 60
50 50 50
Random Labels
0–1 error × 100
40 40 40
30 30 30
Test (mean)
20 20 20 Train (mean)
PAC bound (Gibbs)
Test (Gibbs)
10 10 10 Train (Gibbs)
H-bound
C-bound
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 100 200 300 400 500 600 700 800
Fig. 4.1 Results on the CONV network on two-class MNIST. (left column) Training error
(under
p 0–1 loss) for SGLD on the empirical risk −τ R̂S under a variety of thermal noise
2/τ settings. SGD corresponds to zero thermal noise. (top-left) The large markers on
the right indicate test error. The gap is an estimate of the generalization error. On true
labels, SGLD finds classifiers with relatively small generalization error. At low thermal noise
settings, SGLD (and its zero limit, SGD), achieve small empirical risk. As we increase the
thermal noise, the empirical 0–1 error increases, but the generalization error decreases. At
0.1 thermal noise, risk is close to 50%. (bottom-left) On random labels, SGLD has high
generalization error for thermal noise values 0.01 and below. (True error is 50%). (top-
middle) On true labels, Entropy-SGD, like SGD and SGLD, has small generalization error.
For the same settings of thermal noise, empirical risk is lower. (bottom-middle) On random
labels, Entropy-SGD overfits for thermal noise values 0.005 and below. Thermal noise 0.01
produces good performance on both true and random labels. (right column) Entropy-SGLD
is configured to approximately
√ sample from an ε-differentially private mechanism with
ε ≈ 0.0327 by setting τ = m, where m is the number of training samples. (top-right) On
true labels, the generalization error for networks learned by Entropy-SGLD is close to zero.
Generalization bounds are relatively tight. (bottom-right) On random label, Entropy-SGLD
does not overfit. See Fig. 4.3 for SGLD bounds at same privacy setting.
4.10 Discussion 93
FC600 FC1200
12 12
Test (mean) Test (mean)
Train (mean) Train (mean)
10 PAC bound (Gibbs) 10 PAC bound (Gibbs)
Test (Gibbs) Test (Gibbs)
Train (Gibbs) Train (Gibbs)
0–1 error × 100
8 H-bound 8 H-bound
True Labels
C-bound C-bound
6 6
4 4
2 2
0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Epochs ÷ L Epochs ÷ L
Fig. 4.2 Fully connected networks trained on binarized MNIST with a differentially private
Entropy-SGLD algorithm. (left) Entropy-SGLD applied to FC600 network trained on true
labels. (right) Entropy-SGLD applied to FC1200 network trained on true labels. Both
training error and generalization error are similar for both network architectures. The true
generalization gap is close to zero, since the test and train error overlaps. All the computed
bounds on the test error are loose but nonvacuous.
60
14
50
12
0–1 error × 100
10 40
8
30 SGLD
6
20
4
Test Test
Train 10 Train
2 H-bound H-bound
C-bound C-bound
0 0
0 50 100 150 200 250 300 350 0 1000 2000 3000 4000
Epochs Epochs
10
94
8
92
6 Test (local)
Train (local) 90
4 PAC bound (Gibbs)
Test (Gibbs) 88
Train (Gibbs)
2 H-bound (Gibbs)
C-bound (Gibbs) 86
0
0 100 200 300 400 0 500 1000 1500 2000 2500 3000 3500
100
14 Test (local)
Train (local)
PAC bound (Gibbs) 98
12 Test (Gibbs)
Train (Gibbs) 96
0–1 error × 100
10 H-bound (Gibbs)
C-bound (Gibbs) 94
8
92
6
90
4
88
2
86
0
0 100 200 300 400 0 500 1000 1500 2000 2500 3000 3500
Epochs ÷ L Epochs ÷ L
Fig. 4.4 (“Local” here refers to the mean classifier.) Entropy-SGLD results on MNIST.
(top-left) FC1024 network trained on true labels. The train and test error suggest that the
generalization gap is close to zero, while all three bounds exceed the test error by slightly
more than 3%. (bottom-left) CONV network trained on true labels. Both the train and the
test errors are lower than those achieved by the FC1024 network. We still do not observe
overfitting. The C-bound and PAC-Bayes bounds exceed the test error by ≈ 3%. (top-right)
FC1024 network trained on random labels. After approximately 1000 epochs, we notice
overfitting by ≈ 2%. Running Entropy-SGLD further does not cause an additional overfitting.
Theory suggests that our choice of τ prevents overfitting via differential privacy. (bottom-
right) CONV network trained on random labels. We observe almost no overfitting (less than
1%). Both training and test error coincide and remain close to the guessing rate (90%).
4.10 Discussion 95
80 80
0–1 error × 100
60 Train 60 Train
γ = 0.03
Test Test
Generalization Generalization
C-gen bound C-gen bound
40 DP-PAC-Bayes gen bound 40 DP-PAC-Bayes gen bound
20 20
0 2 0 2
10 103 104 105 106 107 108 10 103 104 105 106 107 108
80 80
0–1 error × 100
60 Train 60 Train
Test Test
γ =3
Generalization Generalization
C-gen bound C-gen bound
40 DP-PAC-Bayes gen bound 40 DP-PAC-Bayes gen bound
20 20
0 2 0 2
10 103 104 105 106 107 108 10 103 104 105 106 107 108
τ τ
Fig. 4.5 Results for Entropy-SGLD trained on CIFAR10 data, with β = 1. (top-left) For this
configuration of parameters, Entropy-SGLD finds good classifiers (with test error lower than
0.3) only at the values of τ for which both generalization bounds are vacuous. Note that, as τ
increases, the test error keeps dropping, which cannot be captured by the risk bounds using
differential privacy. However, this is only observed for the true label dataset, where the true
Bayes error is small. (top-right) The generalization gap increases with τ as suggested by the
risk bounds, and takes a maximum value for τ > 106 . (bottom-left) The pattern is similar
to the γ = 0.03 case (top left plot). In contrast, Entropy-SGLD finds better classifiers (with
test error lower than 0.3) for smaller values of τ. The PAC-Bayes and C-bounds are very
close to each other. (bottom-right) As in the γ = 0.03 case (top right plot), we see maximal
overfitting for large values of τ. However, Entropy-SGLD starts overfitting at a much lower
τ value (τ > 2 ∗ 102 ) compared to the smaller γ case (τ > 5 ∗ 104 ). The PAC-Bayes bound
approaches C-bound but no generalization bounds are violated.
96 Entropy-SGD optimizes the prior of a PAC-Bayes bound
True Labels
60 60
40 40
20 20
0 0
10 2 10 1 100 101 10 1 100 101
γ γ
Fig. 4.6 Entropy-SGLD experiments on CIFAR10 data with a fixed level of privacy β τ =
2000. The generalization bounds we evaluate do not depend on γ, therefore the C-gen
bound takes a constant value. The DP-PAC-Bayes generalization bound corresponds to
DP-PAC-Bayes bound on risk minus the empirical error (Train), and is tighter than the C-gen
bound for all values of β , τ, γ that we tested. (left) β is set to 1 during optimization. A
small value of γ = 0.003 corresponds to high prior variance and thus substantial smoothing
of the optimization surface. In this case, Entropy-SGLD fails to find a good classifier. We
hypothesize that this happens due to over-smoothing and information loss relating to the
location of the actual empirical risk minima. (right) β is reduced to 0.004, which allows
us to increase τ and maintain the same level of privacy. One can notice that we are able to
achieve lower empirical risk (train error is ≈ 0.06 for γ = 3), risk (≈ 0.19 on the test set),
and thus a lower PAC-Bayes bound on the risk (≈ 0.32) compared to the best result for β = 1
setup. The largest value of γ tested is 50. In this case, the prior variance (τγ)−1 is very small,
and the step size is also very small (initial step size is set to base learning rate times the prior
variance). This may explain why Entropy-SGLD finds slightly worse classifiers compared to
smaller values of γ.
Chapter 5
In this chapter, we consider the problem of learning generative models from i.i.d. data with
unknown distribution P. We formulate the learning problem as one of finding a function G,
called the generator, such that, given an input Z drawn from some fixed noise distribution
N , the distribution of the output G(Z) is close to the data’s distribution P. Note that, given
G and N , we can easily generate new samples despite not having an explicit representation
for the underlying density.
We are particularly interested in the case where the generator is a deep neural network
whose parameters we must learn. Rather than being used to classify or predict, these networks
transport input randomness to output randomness, thus inducing a distribution. The first direct
instantiation of this idea is due to MacKay (1994), although MacKay draws connections
even further back to the work of Saund (1989) and others on autoencoders, suggesting that
generators can be understood as decoders. MacKay’s proposal, called density networks, uses
multi-layer perceptrons (MLP) as generators and learns the parameters by approximating
Bayesian inference.
Since MacKay’s proposal, there has been a great deal of progress on learning generative
models, especially over high-dimensional spaces like images. Some of the most successful
approaches have been based on restricted Boltzmann machines (Salakhutdinov and Hinton,
2009) and deep Boltzmann networks (Hinton and Salakhutdinov, 2006). A recent example is
the Neural Autoregressive Density Estimator due to Uria, Murray, and Larochelle (2013).
An indepth survey, however, is beyond the scope of this chaptera.
98 Training GANs with an MMD discriminator
The work in this chapter builds on a proposal due to Goodfellow et al. (2014). Their
generative adversarial nets framework takes an indirect approach to learning deep generative
neural networks: a discriminator network is trained to recognize the difference between train-
ing data and generated samples, while the generator is trained to confuse the discriminator.
The resulting two-player game is cast as a minimax optimization of a differentiable objective
and solved greedily by iteratively performing gradient descent steps to improve the generator
and then the discriminator.
Given the greedy nature of the algorithm, Goodfellow et al. (2014) give a careful pre-
scription for balancing the training of the generator and the discriminator. In particular,
two gradient steps on the discriminator’s parameters are taken for every iteration of the
generator’s parameters. It is not clear at this point how sensitive this balance is as the data set
and network vary. In this chapter, we describe an approximation to adversarial learning that
replaces the adversary with a closed-form nonparametric two-sample test statistic based on
the Maximum Mean Discrepancy (MMD), which we adopted from the kernel two sample test
(Gretton et al., 2012). We call our proposal MMD nets.1 We give bounds on the estimation
error incurred by optimizing an empirical estimator rather than the true population MMD
and give some illustrations on synthetic and real data.
also propose to use MMD as a training objective for generative neural networks. We leave a comparison to
future work.
squared MMD Before
Before Training
Training After Training
After Training
3 10
10 1 1
wm,1 ym,1 true data
input came from 2.5
G✓ P data not generator 8 generated data 0.8 0.8
true data
2 3 2 3
⇢
1.5
4 . 5 4 . 5
mu
pdf
pdf
pdf
pdf
4 0.4 0.4
wm,1 ym,1 1
kernel two-sample
G✓ + 0.5 2 0.2 0.2
2 3 2 3
test statistic
wm,d ym,d
0 0
6 .. 7 6 .. 7
1 2 3 4 0 0
−2
−2 00 2 2 4 4 −2 −2 0 02 42 4
4 . 5 4 . 5
sigma
Initialized Iter = 10 Iter = 50 Iter = 80 Trained. Iter = 400 Convergence of Training Error
0.8
0.6
0.4
MMD2
0.2
Training −0.2
Generated 0 100 200 300 400
Iteration
Fig. 5.1 (top left) Comparison of adversarial nets and MMD nets. (top right) Here we present a simple one-dimensional illustration
of optimizing a generator via MMD. Both the training data and noise data are Gaussian distributed and we consider the class of
generators given by G(µ,σ ) (w) = µ + σ w. The plot on the left shows the isocontours of the MMD-based cost function and the path
taken by gradient descent. On right, we show the distribution of the generator before and after a number of training iterations, as
compared with the data generating distribution. Here we did not resample the generated points and so we do not expect to be able
to drive the MMD to zero and match the distribution exactly. (bottom) The same procedure is repeated here for a two-dimensional
dataset. On the left, we see the gradual alignment of the Gaussian-distributed input data to the Gaussian-distributed output data as the
parameters of the generator Gθ are optimized. The learning curve on the right shows the decrease in MMD obtained via gradient
descent.
100 Training GANs with an MMD discriminator
where
V (Gθ , Dφ ) = E log Dφ (X) + log(1 − Dφ (Gθ (W ))) (5.4)
respect to Gθ , they take two gradient steps with respect to Dφ to bring Dφ closer to the
desired optimum (Goodfellow, pers. comm.).
It is unclear how sensitive this balance is. Regardless, while adversarial networks deliver
impressive sampling performance, the optimization takes approximately 7.5 hours to train on
the MNIST dataset running on a GeForce GTX TITAN GPU from nVidia with 6GB RAM.
Can we potentially speed up the process with a more tractable choice of adversary?
Our proposal is to replace the adversary with the kernel two-sample test introduced by
Gretton et al. (2012). In particular, we replace the family of discriminators with a family H
of test functions X → R, closed under negation, and use the maximum mean discrepancy
between P and Gθ (N ) over H , given by
The functions induced by a kernel k are those functions in the closure of the span of the set
{k(·, x) : x ∈ Ω}, which is necessarily an RKHS. Note, that for every positive definite kernel
there is a unique RKHS H such that every function in H satisfies Eq. (5.6).
Assume that X is a nonempty compact metric space and F a class of functions f : X → R.
Let p and q be Borel probability measures on X, and let X and Y be random variables with
distribution p and q, respectively. The maximum mean discrepancy (MMD) between p and q
is
MMD2 (F , p, q) = ∥µ p − µq ∥2H
E[ f (X)] = ⟨ f , µ p ⟩H . (5.7)
The properties of MMD(H , ·, ·) depend on the underlying RKHS H . For our purposes, it
suffices to say that if we take X to be RD and consider the RKHS H induced by Gaussian
or Laplace kernels, then MMD is a metric, and so the minimum of our learning objective is
achieved uniquely by P, as desired. (For more details, see Sriperumbudur et al. (2008).)
In practice, we often do not have access to p or q. Instead, we are given independent
i.i.d. data X, X ′ , X1 , . . . , XN and Y,Y ′ ,Y1 , . . . ,YM fom p and q, respectively, and would like to
estimate the MMD. Gretton et al. (2012) showed that
1
MMD2u [H , X,Y ] = k(xn , xn′ )
N(N − 1) n̸∑
=n′
1
+ k(ym , ym′ )
M(M − 1) m̸∑
=m′
(5.8)
2 M N
− ∑ ∑ k(xn, ym).
MN m=1 n=1
1
C(Yθ , X) = k(ym , ym′ )
M(M − 1) m̸∑
=m′
2 M N
− ∑ ∑ k(ym, xn).
MN m=1 n=1
Note that C(Yθ , X) is composed of only those parts of the unbiased estimator (Eq. (5.8)) that
depend on θ .
In practice, the minimization is solved by gradient descent, possibly on subsets of the
data. More carefully, the chain rule gives us
where
1
Cn (Yθ , Xn ) = k(ym , ym′ )
M(M − 1) m̸∑
=m′
2 M
− ∑ k(ym, xn).
M m=1
Each derivative ∂Cn∂(Yyθm,Xn ) is easily computed for standard kernels like the RBF kernel. Our
gradient ∇C(Yθ , Xn ) depends on the partial derivatives of the generator with respect to its
parameters, which we can compute using back propagation.
104 Training GANs with an MMD discriminator
To that end, for a measured space X , write L∞ (X ) for the space of essentially bounded
functions on X and write B(L∞ (X )) for the unit ball under the sup norm, i.e.,
The bounds we obtain will depend on a notion of complexity captured by the fat-shattering
dimension:
Gk+
X
= {g = k(x, Gθ (·)) : x ∈ X, θ ∈ Θ}
of functions from W to R that are compositions of some generator and the kernel with some
fixed input, and the (sub)class
Theorem 5.3.2 (estimation error). Assume the kernel is bounded by one and that there
exists γ1 , γ2 > 1 and p1 , p2 ∈ N such that, for all ε > 0, it holds that fatε (Gk+ ) ≤ γ1 ε −p1 and
X ) ≤ γ ε −p2 . Then with probability at least 1 − δ ,
fatε (Gk+ 2
with
r
− 12 2
ε = r(p1 , γ1 , M) + r(p2 , γ2 , M − 1) + 12M log ,
δ
The proof appears in Section 5.4. We can obtain simpler, but slightly more restrictive,
hypotheses if we bound the fat-shattering dimension of the class of generators {Gθ : θ ∈ Θ}
alone: Take the observation space X to be a bounded subset of a finite-dimensional Euclidean
space and the kernel to be Lipschitz continuous and translation invariant. For the RBF kernel,
the Lipschitz constant is proportional to the inverse of the length-scale: the resulting bound
loosens as the length scale shrinks.
106 Training GANs with an MMD discriminator
5.4 Proofs
We begin with some preliminaries and known results.
Theorem 5.4.2 ((Mendelson, 2003, Thm. 2.35)). Let F ⊂ B(L∞ (X )). Assume there exists
γ > 1, such that for all ε > 0, fatε (F ) ≤ γε −p for some p ∈ N. Then there exists constants
C p depending on p only, such that RN (F ) ≤ C p Ψ(p, N, γ) where
if 0 < p < 2
1
1
3
Ψ(p, N, γ) = γ 2 log N 2 if p = 2
N 12 − 1p if p > 2.
where
ε 2M
δε = 2 exp − .
16K 2
Theorem 5.4.4 (estimation error for finite parameter set). Let pθ be the distribution of
Gθ (W ), with θ taking values in some finite set Θ = {θ1 , ..., θT }, T < ∞. Then, with probabil-
ity at least 1 − (T + 1)δε , where δε is defined as in Theorem 5.4.3, we have
Because θ̂ depends on the training data X and generator data Y , we use a uniform bound that
holds over all θ . Specifically,
P |E (θ̂ ) − T (θ̂ )| > ε ≤ P sup |E (θ ) − T (θ )| > ε (5.18)
θ
T
≤ ∑P |E (θ̂ ) − T (θ̂ )| > ε ≤ T δε .
t=1
2ε ≥ |E (θ̂ ) − T (θ̂ )| + |E (θ ∗ ) − T (θ ∗ )|
(5.19)
≥ |E (θ ∗ ) − E (θ̂ ) + T (θ̂ ) − T (θ ∗ )|.
2ε ≥ T (θ̂ ) − T (θ ∗ )
= MMD2 (H , pdata , pθ ∗ ) − MMD2 (H , pdata , pθ̂ )
Corollary 5.4.5. With the definitions from Theorem 5.4.4, with probability at least 1 − δ ,
where
r
1
εδ = 8K log [2(T + 1)δ ]. (5.21)
M
In order to prove the general result, we begin with some technical lemmas. The develop-
ment here owes much to Gretton et al. (2012).
108 Training GANs with an MMD discriminator
1
sup E f (Y,Y ′ ) − f (yn , yn′ ) . (5.23)
f ∈F N(N − 1) n̸∑
=n′
C
E (ρ(Y1 , . . . ,YN )) ≤ √ Ψ(γ, N − 1, p). (5.24)
N −1
Proof. Let us introduce {ζn }N n=1 , where ζn and Yn′ have the same distribution and are
′
independent for all n, n ∈ {1, . . . , N}. Then the following is true:
1
E( f (Y,Y ′ )) = E f (ζ , ) (5.25)
N(N − 1) n,n′∑
n ζ n′
: n̸=n′
Using Jensen’s inequality and the independence of Y,Y ′ and Yn ,Yn′ , we have
E (ρ(Y1 , . . . ,YN ))
= E sup E( f (Y,Y ′ ))
f ∈F
1
− f (Ym ,Ym′ )
N(N − 1) n̸∑
=n′
1
≤E sup ∑ f (ζn , ζn′ )
f ∈F N(N − 1) n̸=n′
(5.26)
1
− f (Yn ,Yn′ ) .
N(N − 1) n̸∑
=n′
5.4 Proofs 109
Introducing conditional expectations allows us to rewrite the equation with the sum over n
outside the expectations. I.e., Eq. (5.26) equals
1 1
(Yn ,ζn )
sup , ,Y ,Y )
N∑ ∑ n n n n
EE Φ(ζ ζ ′ ′
n f ∈F N − 1 n̸=n′
1 N−1
=EE(Y,ζ ) sup ∑ n σ Φ(ζ , ζn ,Y,Yn ) , (5.27)
f ∈F N − 1 n=1
where Φ(x, x′ , y, y′ ) = f (x, x′ ) − f (y, y′ ). The second equality follows by symmetry of random
variables {ζn }N−1 N−1
n=1 . Note that we also added Rademacher random variables {σn }n=1 before
each term in the sum since ( f (ζn , ζn′ )− f (Yn ,Yn′ )) has the same distribution as −( f (ζn , ζn′ )−
f (Yn ,Yn′ )) for all n, n′ and therefore the σ ’s do not affect the expectation of the sum.
Note that ζm and Ym are identically distributed. Thus the triangle inequality implies that
Eq. (5.27) is less than or equal to
!!
N−1
2
E E(Y ) sup ∑ σn f (Y,Yn) (5.28)
N −1 f ∈F n=1
2
≤√ RN−1 (F+ ), (5.29)
N −1
where RN−1 (F+ ) is the Rademacher’s complexity of F+ . Then by Theorem 5.4.2, we have
C
E (ρ(Y1 , . . . ,YN )) ≤ √ Ψ(γ, N − 1, p).
N −1
ρ(x1 , . . . , xN , y1 , . . . , yM ) = (5.30)
1
sup E( f (X,Y ) − ∑ f (xn, ym) . (5.31)
f ∈F NM n,m
C
E (ρ(X1 , . . . , XN ,Y1 , . . . ,YM )) ≤ √ Ψ(γ, M, p). (5.32)
M
110 Training GANs with an MMD discriminator
Proof of Theorem 5.3.2. The proof follows the same steps as the proof of Theorem 5.4.4
apart from a stronger uniform bound stated in Eq. (5.18). I.e., we need to show:
P sup |E (θ ) − T (θ )| ≥ ε ≤ δ .
θ ∈Θ
sup |E (θ ) − T (θ )|
θ ∈Θ
= sup E(k(X, X ′ ))
θ ∈Θ
1
− k(Xn , Xn′ )
N(N − 1) n∑
′ ̸=n
+ E(k(Gθ (W ), Gθ (W ′ )))
1 (5.33)
− k(Gθ (Wm ), Gθ (Wm′ ))
M(M − 1) m̸∑
=m′
− 2E(k(X, Gθ (W )))
2
+ ∑ k(Xn, Gθ (Wm)) .
MN m,n
For all n ∈ {1, . . . , N}, k(Xn , Xn′ ) does not depend on θ and therefore the first two terms of
the equation above can be taken out of the supremum. Also, note that since |k(·, ·)| ≤ K, we
have
2K
ζ (x1 , . . . , xn , . . . , xN ) − ζ (x1 , . . . , xn′ , . . . , xN ) ≤ , (5.34)
N
where
1
ζ (x1 , . . . , xN ) = k(xn , xn′ ), (5.35)
N(N − 1) n,n′∑ ′
: n ̸=n
5.4 Proofs 111
Therefore Eq. (5.33) is bounded by the sum of the bound on Eq. (5.36) and the following:
Thus the next step is to find the bound for the supremum above.
Define
and
f (W M ) + 2h(X N ,W M ).
112 Training GANs with an MMD discriminator
We will first find the upper bound on f (W M ), i.e., for every ε > 0, we will show that there
exists δ f , such that
P ( f (W M ) > ε) ≤ δ f (5.44)
since the kernel is bounded by K, and therefore k(Gθ (Wm ), Gθ (Wm′ )) is bounded by K for all
m. The conditions of Theorem 5.4.1 are satisfied and thus we can use McDiarmid’s Inequality
on f :
2
ε M
P ( f (W M ) − E( f (W M )) ≥ ε) ≤ exp − 2 .
2K
Define
To show Eq. (5.44), we need to bound the expectation of f . We can apply Lemma 5.4.6 on
the function classes Gk and Gk+ . The resulting bound is
Cf
E( f (W M )) ≤ ε p1 = √ Ψ(γ1 , M − 1, p1 ), (5.45)
M−1
where p1 and γ1 are parameters associated with fat shattering dimension of Gk+ as stated in
the assumptions of the theorem, and C f is a constant depending on p1 .
Now we can write down the bound on f :
2
ε M
P ( f (W M ) ≥ ε p1 + ε) ≤ exp − 2 = δ f . (5.46)
2K
and
P (h(X N ,W M ) − E(h(X N ,W M ) ≥ ε)
(5.47)
ε 2 NM
≤ exp − 2 .
2K N + M
We can bound expectation of h(X N ,W M ) using Lemma 5.4.7 applied on GkX and Gk+
X , where
Then
Ch
E(h(X N ,W M )) ≤ ε p2 = √ Ψ(γ2 , M, p2 ). (5.48)
M
P (h(X N ,W M ) ≥ ε p2 + ε)
ε 2 NM
≤ exp − 2 = δh .
2K N + M
Summing up the bounds from Eq. (5.46) and Eq. (5.47), it follows that
P f (W M ) + 2h(X N ,W M ) ≥ ε p1 + 2ε p2 + 3ε
≤ max(δ f , δh ) = δh .
Using the bound in Eq. (5.36), we have obtain the uniform bound we were looking for:
P sup |E (θ ) − T (θ )| > ε p1 + 2ε p2 + 4ε ≤ δh ,
θ ∈Θ
To finish, we proceed as in the proof of Theorem 5.4.4. We can rearrange some of the
terms to get a different form of Eq. (5.17):
2
∗ ∗ ε M
P [|E (θ ) − T (θ )| > 2ε] ≤ 2 exp − = 2δh . (5.49)
4
All of the above implies that for any ε > 0, there exists δ , such that
P MMD2 (H , pdata , pθ̂ ) − MMD2 (H , pdata , pθ ∗ ) ≥ ε ≤ δ , (5.50)
where
r
12 2
ε = ε p1 + 2ε p2 + √ log .
M δ
We close by noting that the approximation error is zero in the nonparametric limit.
Theorem 5.4.8 (Gretton et al. (2012)). Let F be the unit ball in a universal RKHS H ,
defined on the compact metric space X, with associated continuous kernel k(·, ·). Then
MMD[H , p, q] = 0 if and only if p = q.
Corollary 5.4.9 (approximation error). Assume pdata is in the family {pθ } and that H is an
RKHS induced by a characteristic kernel. Then
Proof. By Theorem 5.4.8, it follows that MMD2 (H , ·, ·) is a metric. The result is then
immediate.
Fig. 5.2 (top-left) MNIST digits from the training set. (top-right) Newly generated digits
produced after 1,000,000 iterations (approximately 5 hours). Despite the remaining artifacts,
the resulting kernel-density estimate of the test data was state of the art at the time when
MMD nets were introduced. (top-center) Newly generated digits after 300 further iterations
optimizing the associated empirical MMD. (bottom-left) MMD learning curves for first
2000 iterations. (bottom-right) MMD learning curves from 2000 to 500,000 iterations. Note
the difference in y-axis scale. No appreciable change is seen in later iterations.
generated points. It is clear that the digits produced have many artifacts not appearing in
the MNIST data set. Indeed, the mean log density of held-out test data was estimated to be
only 113 ± 2, as compared with the reported 225 ± 2 achieved by adversarial nets. On the
other hand, most of the gain is achieved by MMD nets in the first 100-200k iterations, and so
perhaps MMD nets could be used to initialize a network further optimized by other means.
Fig. 5.3 (left) TFD. (right) Faces generated by network trained for 500,000 iterations.
(center) After an additional 500 iterations.
resulting network are plotted in Fig. 5.3. Again, the samples produced by MMD nets are
clearly distinguishable from the training samples and this is reflected in a much lower mean
log density than adversarial nets.
5.6 Discussion
MMD offers a closed-form surrogate for the discriminator in the adversarial nets framework.
After using Bayesian optimization for the parameters, we found that the network produced
samples that were visually similar, but far from indistinguishable from those used to train
the network. On one hand, adversarial nets handedly outperformed MMD nets in terms of
mean log density. On the other, MMD nets achieve most of their gain quickly and so it
seems promising to combine MMD nets with another technique, perhaps using MMD nets to
initialize a more costly procedure.
Li et al. (2017) propose to learn the kernel using adversarial learning techniques, and
call the resulting learning scheme MMD GANs. The authors perform a number of empirical
studies in order to compare the performance of MMD GANs to MMD nets and other standard
GAN approaches. Their findings indicate, that MMD GANs achieve state of the art results
on standard benchmark datasets.
Another recent paper by Bińkowski et al. (2018) contains a theoretical explanation of
how GAN training can lead to biased gradient estimates for the generator network parameters
in multiple discriminator learning settings. The bias arises due to the discriminator being
optimized based on samples. This is true for MMD GANs as well as Wasserstein GANs
(Arjovsky, Chintala, and Bottou, 2017). The authors prove that in the case of a fixed
discriminator, the gradients on the generator parameters are unbiased.
Arjovsky, Chintala, and Bottou (2017) also describe a method for choosing the kernel
for the MMD critic and suggest how one can adapt training strategies from the Wasserstein
GANs. The paper contains empirical studies demonstrating that MMD GANs can achieve
comparable performance as Wasserstein GANs with a much smaller critic network, making
training more efficient and practical. In addition, the authors also propose a new metric for
evaluating GAN convergence.
Conclusion
In this thesis we studied generalization bounds for existing learning algorithms using PAC-
Bayes theory and nonconvex optimization. We also propose modifications to several existing
learning algorithms in order to control overfitting. In addition, we design a new learning
algorithm for unsupervised learning and analyze its generalization properties.
We produced the first nonvacuous train-set bound on the risk for a modern neural network
via direct PAC-Bayes risk bound optimization. One drawback of our approach is that it is
computationally intensive and is difficult to apply to much larger networks. Despite our
bounds being the best known, the bounds we obtain are still much higher than error estimates
on heldout data. In order to explain generalization, we need tighter bounds. Our current
method relies on an isotropic Gaussian PAC-Bayesian posterior. As a result, the posterior
does not capture the dependencies between neural network weights. Replacing the isotropic
Gaussian posterior with a non-isotropic Gaussian posterior distribution on the parameters
may produce a smaller KL divergence and tighter bound. However, networks used in practice
have millions of parameters and so we cannot hope to optimize a full covariance matrix
scalably. Variational inference methods that use factored representations could potentially be
used to overcome this problem.
We introduced a PAC-Bayes bound that allows one to use a data-dependent but differen-
tially private prior. Differential privacy is a very strong notion, and so we show that weak
convergence towards a private prior suffices for generalization. It would be interesting to
explore the necessity of this condition in order to obtain PAC-Bayes generalization bounds
with data-dependent priors.
This leads to a related problem. PAC-Bayes generalization bounds allow us to bound
the performance of a stochastic classifier. Therefore, we may understand the expected
generalization properties of stochastic learning algorithms by studying their distributions.
While experimentally evaluating the differentially private PAC-Bayes bound, we centered
our attention on SGLD. Under certain assumptions, SGLD is known to converge weakly to a
Gibbs distribution, which minimize PAC-Bayes bounds. However, recent works suggests that
SGD may be usefully thought of as a stochastic learning algorithm. This is yet to be fully
120 Conclusion
understood. Thus a fundamental problem is: can we precisely formalize the distributions of
SGD and SGLD? This knowledge may lead to improved understanding of the generalization
properties of these learning algorithms. It may also lead to the construction of better
approximate samplers to Gibbs distributions with trackable generalization error.
Poor prior choices lead to loose PAC-Bayes bounds. Local prior bounds have a big
potential, since they eliminate the need to choose a prior distribution. We empirically
evaluated local prior bounds using SGLD. We experimentally highlight that current local
prior bounds need to account for the worst-case data distribution and thus fail to capture the
dynamics of the generalization error for easy data as the temperature parameter in a Gibbs
distribution increases beyond the point where the Gibbs distribution overfits on random labels.
Further research is needed in order to obtain useful PAC-Bayes bounds based on local priors.
One approach is to seek tighter data-dependent upper bounds on the KL divergence between
the posterior and the local prior. Such an estimate should be sensitive to the data distribution.
For the purposes of understanding generalization, we could potentially make use of held-out
data.
References
V Vapnik and A Chervonenkis (1974). “About structural risk minimization principle”. Au-
tomation Remote Control 8, p. 9.
Jorma Rissanen (June 1983). “A Universal Prior for Integers and Estimation by Minimum
Description Length”. Ann. Statist. 11.2, pp. 416–431. DOI: 10.1214/aos/1176346150.
Andrew Blake and Andrew Zisserman (1987). Visual Reconstruction. Cambridge, MA, USA:
MIT Press.
Eric Saund (1989). “Dimensionality-Reduction Using Connectionist Networks.” IEEE Trans.
Pattern Anal. Mach. Intell. 11.3, pp. 304–314.
Geoffrey E. Hinton and Drew van Camp (1993). “Keeping the Neural Networks Simple by
Minimizing the Description Length of the Weights”. Proceedings of the Sixth Annual
Conference on Computational Learning Theory. COLT ’93. Santa Cruz, California, USA:
ACM, pp. 5–13. DOI: 10.1145/168304.168306.
David J.C. MacKay (1994). “Bayesian Neural Networks and Density Networks”. Nuclear
Instruments and Methods in Physics Research, A, pp. 73–80.
Paul W Goldberg and Mark R Jerrum (1995). “Bounding the Vapnik-Chervonenkis dimension
of concept classes parameterized by real numbers”. Machine Learning 18.2-3, pp. 131–
148.
Pascal Koiran and Eduardo D Sontag (1996). “Neural networks with quadratic VC dimen-
sion”. Advances in neural information processing systems, pp. 197–203.
Peter L Bartlett (1997). “For valid generalization the size of the weights is more important
than the size of the network”. Advances in Neural Information Processing Systems,
pp. 134–140.
Sepp Hochreiter and Jürgen Schmidhuber (Jan. 1997). “Flat Minima”. Neural Comput. 9.1,
pp. 1–42. DOI: 10.1162/neco.1997.9.1.1.
John Shawe-Taylor and Robert C Williamson (1997). “A PAC analysis of a Bayesian esti-
mator”. Proceedings of the tenth annual conference on Computational learning theory.
ACM, pp. 2–9.
Peter L Bartlett (1998). “The sample complexity of pattern classification with neural networks:
the size of the weights is more important than the size of the network”. IEEE Transactions
on Information Theory 44.2, pp. 525–536.
Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner (1998). “Gradient-based
learning applied to document recognition”. Proceedings of the IEEE, pp. 2278–2324.
122 References
Carl Edward Rasmussen and Christopher K. I. Williams (2005). Gaussian Processes for
Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
Cynthia Dwork (2006). “Differential Privacy”. Automata, Languages and Programming:
33rd International Colloquium, ICALP 2006, Venice, Italy, July 10–14, 2006, Proceedings,
Part II. Ed. by Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener.
Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 1–12. DOI: 10.1007/11787006_1.
G E Hinton and R R Salakhutdinov (July 2006). “Reducing the dimensionality of data with
neural networks”. Science 313.5786, pp. 504–507.
Olav Kallenberg (2006). Foundations of modern probability. Springer Science & Business
Media.
Tong Zhang (2006a). “From ε-entropy to KL-entropy: Analysis of minimum information
complexity density estimation”. The Annals of Statistics 34.5, pp. 2180–2210. DOI:
10.1214/009053606000000704.
— (2006b). “Information-theoretic upper and lower bounds for statistical estimation”. IEEE
Transactions on Information Theory 52.4, pp. 1307–1321.
Olivier Catoni (2007). “PAC-Bayesian supervised classification: the thermodynamics of
statistical learning”. arXiv preprint arXiv:0712.0248.
Peter D Grünwald (2007). The minimum description length principle. MIT press.
Frank McSherry and Kunal Talwar (2007). “Mechanism Design via Differential Privacy”.
Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science.
FOCS ’07. Washington, DC, USA: IEEE Computer Society, pp. 94–103. DOI: 10.1109/
FOCS.2007.41.
Cynthia Dwork (2008a). “Differential privacy: A survey of results”. International Conference
on Theory and Applications of Models of Computation. Springer, pp. 1–19.
— (2008b). “Differential privacy: A survey of results”. International Conference on Theory
and Applications of Models of Computation. Springer, pp. 1–19.
Wenxin Jiang and Martin A Tanner (2008). “Gibbs posterior for variable selection in high-
dimensional classification and data mining”. The Annals of Statistics, pp. 2207–2231.
Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert Lanckriet, and Bernhard
Schölkopf (2008). “Injective Hilbert Space Embeddings of Probability Measures”. Conf.
Comp. Learn. Theory, (COLT).
Alex Krizhevsky (2009). Learning Multiple Layers of Features from Tiny Images. https:
//www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Ruslan Salakhutdinov and Geoffrey E. Hinton (2009). “Deep Boltzmann Machines”. Journal
of Machine Learning Research - Proceedings Track 5, pp. 448–455.
Yann LeCun, Corinna Cortes, and Christopher J. C. Burges (2010). MNIST handwritten digit
database. http://yann.lecun.com/exdb/mnist/.
J. M. Susskind, A. K. Anderson, and G. E. Hinton (2010). The Toronto face database. Tech.
rep.
124 References
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro (2014). In Search of the Real
Inductive Bias: On the Role of Implicit Regularization in Deep Learning. Workshop track
poster at ICLR 2015. arXiv: 1412.6614v4 [cs.LG].
Shai Shalev-Shwartz and Shai Ben-David (2014). Understanding machine learning: From
theory to algorithms. Cambridge university press.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov (2014). “Dropout: a simple way to prevent neural networks from overfitting.”
Journal of machine learning research 15.1, pp. 1929–1958.
Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefow-
icz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry
Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya
Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda
Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous
Systems. Software available from tensorflow.org.
Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina
(2015). “Subdominant Dense Clusters Allow for Simple Learning and High Computa-
tional Performance in Neural Networks with Discrete Synapses”. Phys. Rev. Lett. 115
(12), p. 128101. DOI: 10.1103/PhysRevLett.115.128101.
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron
Roth (2015a). “Generalization in adaptive data analysis and holdout reuse”. Advances in
Neural Information Processing Systems, pp. 2350–2358.
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron
Leon Roth (2015b). “Preserving statistical validity in adaptive data analysis”. Proceedings
of the Forty-Seventh Annual ACM on Symposium on Theory of Computing. ACM, pp. 117–
126.
Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani (2015). “Training Gen-
erative Neural Networks via Maximum Mean Discrepancy Optimization”. Proceedings of
the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15. Amsterdam,
Netherlands: AUAI Press, pp. 258–267.
Moritz Hardt, Benjamin Recht, and Yoram Singer (2015). “Train faster, generalize better:
Stability of stochastic gradient descent”. CoRR abs/1509.01240.
Diederik P Kingma, Tim Salimans, and Max Welling (2015). “Variational Dropout and the
Local Reparameterization Trick”. Advances in Neural Information Processing Systems
28. Ed. by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett. Curran
Associates, Inc., pp. 2575–2583.
Yujia Li, Kevin Swersky, and Richard Zemel (2015). “Generative Moment Matching Net-
works”. http://arxiv.org/abs/1502.02761v1.
126 References
Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro (2015). “Path-SGD: Path-
Normalized Optimization in Deep Neural Networks”. Advances in Neural Information
Processing Systems. arXiv: 1506.02617v1 [cs.LG].
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro (2015). “Norm-based capacity
control in neural networks”. Proceedings of the Eighth Annual Conference on Learning
Theory. COLT 2016, pp. 1376–1401. arXiv: 1503.00036v2 [cs.LG].
Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola (2015). “Privacy for Free:
Posterior Sampling and Stochastic Gradient Monte Carlo”. Proceedings of the 32nd
International Conference on International Conference on Machine Learning - Volume 37.
ICML’15. Lille, France: JMLR.org, pp. 2493–2502.
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal
Talwar, and Li Zhang (2016). “Deep Learning with Differential Privacy”. Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security. CCS
’16. Vienna, Austria: ACM, pp. 308–318. DOI: 10.1145/2976749.2978318.
Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello,
Luca Saglietti, and Riccardo Zecchina (2016). “Unreasonable effectiveness of learning
neural networks: From accessible states and robust ensembles to basic algorithmic
schemes”. Proceedings of the National Academy of Sciences 113.48, E7655–E7662. DOI:
10.1073/pnas.1608103113. eprint: http://www.pnas.org/content/113/48/E7655.full.pdf.
Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan
Ullman (2016). “Algorithmic stability for adaptive data analysis”. Proceedings of the
48th Annual ACM SIGACT Symposium on Theory of Computing. ACM, pp. 1046–1059.
Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy (2016). “PAC-
Bayesian bounds based on the Rényi divergence”. Artificial Intelligence and Statistics,
pp. 435–444.
Pier Giovanni Bissiri, CC Holmes, and Stephen G Walker (2016). “A general framework
for updating belief distributions”. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 78.5, pp. 1103–1130.
Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien (2016). “PAC-
Bayesian Theory Meets Bayesian Inference”. Advances in Neural Information Processing
Systems, pp. 1884–1892.
Peter D Grünwald and Nishant A Mehta (2016). “Fast Rates for General Unbounded Loss
Functions: from ERM to Generalized Bayes”. arXiv: 1605.00252.
Peter D. Grünwald and Nishant A. Mehta (2016). “Fast Rates with Unbounded Losses”.
CoRR abs/1605.00252. arXiv: 1605.00252.
Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz (2016). “On Graduated Opti-
mization for Stochastic Non-Convex Problems”. Proceedings of the 33rd International
Conference on Machine Learning (ICML), pp. 1833–1841.
Kentaro Minami, Hitomi Arai, Issei Sato, and Hiroshi Nakagawa (2016). “Differential Privacy
without Sensitivity”. Advances in Neural Information Processing Systems 29. Ed. by
References 127
Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (2017).
“MMD GAN: Towards deeper understanding of moment matching network”. Advances in
Neural Information Processing Systems, pp. 2200–2210.
Behnam Neyshabur (2017). “Implicit regularization in deep learning”. arXiv preprint arXiv:1709.01953.
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro (2017a). A
PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.
arXiv: 1707.09564.
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro (2017b).
“Exploring generalization in deep learning”. Advances in Neural Information Processing
Systems, pp. 5949–5958.
Luca Oneto, Sandro Ridella, and Davide Anguita (2017). “Differential privacy and general-
ization: Sharper bounds with applications”. Pattern Recognition Letters 89, pp. 31–38.
DOI : 10.1016/j.patrec.2017.02.006.
Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky (2017). “Non-convex learning via
Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis”. Proc. Conference
on Learning Theory (COLT). arXiv: 1702.03849.
Ravid Shwartz-Ziv and Naftali Tishby (2017). “Opening the black box of deep neural
networks via information”. arXiv preprint arXiv:1703.00810.
Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin (2017). “A
Strongly Quasiconvex PAC-Bayesian Bound”. International Conference on Algorithmic
Learning Theory, pp. 466–492.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals (2017). “Un-
derstanding deep learning requires rethinking generalization”. International Conference
on Representation Learning (ICLR). arXiv: 1611.03530v2 [cs.LG].
Alessandro Achille and Stefano Soatto (2018). “Information dropout: Learning optimal
representations through noisy computation”. IEEE Transactions on Pattern Analysis and
Machine Intelligence. First appeared as https://arxiv.org/abs/1611.01353.
Pierre Alquier and Benjamin Guedj (May 2018). “Simpler PAC-Bayesian Bounds for Hostile
Data”. Mach. Learn. 107.5, pp. 887–902. DOI: 10.1007/s10994-017-5690-0.
Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton (2018). “De-
mystifying MMD GANs”. arXiv preprint arXiv:1801.01401.
Gintare Karolina Dziugaite and Daniel M. Roy (2018a). “Data-dependent PAC-Bayes priors
via differential privacy”. Advances in Neural Information Processing Systems (NIPS).
Vol. 29. Cambridge, MA: MIT Press. arXiv: 1802.09583.
— (2018b). “Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization prop-
erties of Entropy-SGD and data-dependent priors”. Proceedings of the 35th International
Conference on Machine Learning (ICML). arXiv: 1712.09376.
Samuel L Smith and Quoc V Le (2018). “A Bayesian perspective on generalization and
stochastic gradient descent”. Proc. of ICLR.