Lecture Notes For Machine Learning Theory
Lecture Notes For Machine Learning Theory
Instructor: Tengyu Ma
https://t.me/machine_learning_and_DL
Contents
2 Asymptotic Analysis 9
2.1 Asymptotics of empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Key ideas of proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Main proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Well-specified case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Limitations of asymptotic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Concentration Inequalities 15
3.1 The big-O notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Examples of sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Concentrations of functions of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5.1 Bounds for Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
i
4.6.3 Lipschitz composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 VC dimension and its limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8 Nonconvex Optimization 79
8.1 Optimization landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Efficient convergence to (approximate) local minima . . . . . . . . . . . . . . . . . . . . . . . 80
8.2.1 Strict-saddle condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 All local minima are global minima: two examples . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3.1 Principal components analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3.2 Matrix Completion [Ge et al., 2016] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3.3 Other problems where all local minima are global minima . . . . . . . . . . . . . . . . 87
8.4 The Neural Tangent Kernel (NTK) Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.4.1 Two examples of the NTK regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.4.2 Optimizing L̂(gθ ) vs. L̂(fθ ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.4.3 Limitations of NTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
ii
9.5.1 Warmup: SGD on the one dimensional quadratic function . . . . . . . . . . . . . . . . 121
9.5.2 SGD on multi-dimensional quadratic functions . . . . . . . . . . . . . . . . . . . . . . 122
9.5.3 SGD on non-quadratic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.5.4 SGD with label noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
iii
Acknowledgments
This monograph is a collection of scribe notes for the course CS229M/STATS214 at Stanford University.
The materials in Chapter 1–5 are mostly based on Percy Liang’s lecture notes [Liang, 2016], and Chapter 11
is largely based on Haipeng Luo’s lectures [Luo, 2017]. Kenneth Tay contributed significantly to the revision
of these notes as a teaching assistant for the course. The original contributor to the scribe notes are Stanford
students including but not limited to Anusri Pampari, Gabriel Poesia, Alexander Ke, Trenton Chang, Brad
Ross, Robbie Jones, Yizhou Qian, Will Song, Daniel Do, Spencer M. Richards, Thomas Lew, David Lin,
Jinhui Wang, Rafael Rafailov, Aidan Perreault, Kevin Han, Han Wu, Andrew Wang, Rohan Taori, Jonathan
Lee, Rohith Kuditipudi, Kefan Dong, Roshni Sahoo, Sarah Wu, Tianyu Du, Xin Lu, Soham Sinha, Kevin
Guo, Jeff Z. HaoChen, Carrie Wu, Kaidi Cao, and Ruocheng Wang. The notes will be updated every year
with new materials. The reference list is far from complete.
5
Chapter 1
In this chapter, we will set up the standard theoretical formulation of supervised learning and introduce the
empirical risk minimization (ERM) paradigm. The setup will apply to almost the entire monograph and
the ERM paradigm will be the main focus of Chapter 2, 3, and 4.
Note that L is nonnegative because ` is nonnegative. Typically, the loss function is designed so that the
best possible loss is zero when ŷ matches y exactly. Therefore, the goal is to find h such that L(h) is as close
to zero as possible.
Examples: regression and classification problems. Here are two standard types of supervised learning
problems based on the properties of the output space:
• In the problem of regression, predictions are real numbers (Y = R). We would like predictions to be
as close as possible to the real labels. A classical loss function that captures this is the squared error,
`(ŷ, y) = (ŷ − y)2 .
6
• In the problem of classification, predictions are in a discrete set of k unordered classes Y = [k] =
{1, · · · , k}. One possible classification loss is the 0 − 1 loss: `(ŷ, y) = 1(ŷ 6= y), i.e. 0 if the prediction
is equal to the true label, and 1 otherwise.
Hypothesis class. So far, we said we would like to find any function that minimizes population risk.
However, in practice, we do not have a way of optimizing over arbitrary functions. Instead, we work within a
more constrained set of functions H, which we call the hypothesis family (or hypothesis class). Each element
of H is a function h : X → Y. Usually, we choose a set H that we know how to optimize over (e.g. linear
functions, or neural networks).
Given one particular function h ∈ H, we define the excess risk of h with respect to H as the difference
between the population risk of h and the best possible population risk inside H:
∆
E(h) = L(h) − inf L(g).
g∈H
Generally we need more assumptions about a specific problem and hypothesis class to bound absolute
population risk, hence we focus on bounding the excess risk.
Usually, the family we choose to work with can be parameterized by a vector of parameters θ ∈ Θ. In
that case, we can refer to an element of H by hθ , making that explicit. An example of such a parametrization
of the hypothesis class is H = {h : hθ (x) = θ> x, θ ∈ Rd }.
Since we are assuming that our training examples are drawn from the same distribution as the whole
population, we know that empirical risk and population risk are equal in expectation (over the randomness
of the training dataset):
n
1X
E L(h
b θ) = E `(hθ (x(i) ), y (i) ) (1.4)
iid
(x(i) ,y (i) ) ∼ P
iid
(x(i) ,y (i) ) ∼ P n i=1
n
1X
= E `(hθ (x(i) ), y (i) ) (1.5)
n i=1 (x(i) ,y(i) )iid
∼P
1
= ·n· E `(hθ (x(i) ), y (i) ) (1.6)
n iid
(x(i) ,y (i) ) ∼ P
= L(hθ ). (1.7)
7
This is one reason why it makes sense to use empirical risk: it is an unbiased estimator of the population
risk.
The key question that we seek to answer in the first part of this course is: what guarantees do we have
on the excess risk for the parameters learned by ERM? The hope with ERM is that minimizing the
training error will lead to small testing error. One way to make this rigorous is by showing that the ERM
minimizer’s excess risk is bounded.
8
Chapter 2
Asymptotic Analysis
In this chapter, we use an asymptotic approach (i.e. assuming number of training samples n → ∞) to achieve
a bound on the ERM. We then instantiate these results to the case where the loss function is the maximum
likelihood and discuss the limitations of asymptotics. (In future chapters we will assume finite n and provide
a non-asymptotic analysis.)
Here c is a problem dependent constant that does not depend on n, and o(1/n) hides all dependencies except
n. The equation above shows that as we have more training data (i.e. as n increases) the excess risk of ERM
decreases at the rate of n1 .
Let {(x(1) , y (1) ), · · · , (x(n) , y (n) )} be the training data and let H = {hθ : θ ∈ Rp } be the parameterized
family of hypothesis functions. Let the ERM minimizer be θ̂ as defined in Equation (1.3). Let θ∗ be the
minimizer of the population risk L, i.e. θ∗ = argminθ L(θ). The theorem below quantifies the excess risk
L(θ̂) − L(θ∗ ):
p
Theorem 2.1 (Informally stated). Suppose that (a) θ̂ → θ∗ as n → ∞ (i.e. consistency of θ̂), (b) ∇2 L(θ∗ )
is full rank, and (c) other appropriate regularity conditions hold.1 Then,
√ √
1. n(θ̂ − θ∗ ) = OP (1), i.e. for every > 0, there is an M such that supn P(k n(θ̂ − θ∗ )k2 > M ) < .
√
(This means that the sequence { n(θ̂ − θ∗ )} is “bounded in probability”.)
√ d
2. n(θ̂ − θ∗ ) → N 0, (∇2 L(θ∗ ))−1 Cov(∇`((x, y), θ∗ ))(∇2 L(θ∗ ))−1 .
h i
5. limn→∞ E n(L(θ̂) − L(θ∗ )) = 12 tr ∇2 L(θ∗ )−1 Cov(∇`((x, y), θ∗ ) .
p d
1 X → X implies that for all > 0, P (kX − Xk > ) → 0, while X → X implies that P(X ≤ t) → P(X ≤ t) at all points
n n n n
t for which P(X ≤ t) is continuous. These two notions of convergence are known as convergence in probability and convergence
in distribution, respectively. These concepts are not essential to this course, but additional information can be found by reading
the Wikipedia article on convergence of random variables.
9
Remark: In the theorem above, Parts 1 and 3 only show the rate or order of convergence, while Parts
2 and 4 define the limiting distribution for the random variables. √
Theorem 2.1 is a powerful conclusion because once we know that n(θ̂ −θ∗ ) is (asymptotically) Gaussian,
we can easily work out the distribution of the excess risk. If we believe in our assumptions and n is large
enough such that we can assume n → ∞, this allows us to analytically determine quantities of interest in
almost any scenario (for example, if our test distribution changes). The key takeaway is that our parameter
√
error θ̂ − θ∗ decreases in order 1/ n and the excess risk decreases in order 1/n. While we will not discuss
the regularity assumptions in Theorem 2.1 in great detail, we note that the assumption that L is twice
differentiable is crucial.
2. If Z ∼ N (0, Σ−1 ) and Z ∈ Rp , then Z > ΣZ ∼ χ2 (p), where ∼ χ2 (p) is the chi-squared distribution with
p degrees of freedom.
0 = ∇L( b ∗ ) + ∇2 L(θ
b θ̂) = ∇L(θ b ∗ )(θ̂ − θ∗ ) + O(kθ̂ − θ∗ k22 ). (2.2)
Rearranging, we have
θ̂ − θ∗ = −(∇2 L(θ
b ∗ ))−1 ∇L(θ
b ∗ ) + O(||θ̂ − θ∗ ||2 ).
2 (2.3)
√
Multiplying by n on both sides,
√ √ √
n(θ̂ − θ∗ ) = −(∇2 L(θ
b ∗ ))−1 n(∇L(θ
b ∗ )) + O( n||θ̂ − θ∗ ||22 ) (2.4)
√
b ∗ ))−1 n(∇L(θ
≈ −(∇2 L(θ b ∗ )). (2.5)
10
Applying the Central Limit Theorem (Theorem 2.2) using Xi = ∇`((x(i) , y (i) ), θ∗ ) and X b ∗ ), and
b = ∇L(θ
∗ ∗
b )] = ∇L(θ ), we have
noticing that E[∇L(θ
√ d
b ∗ ) − ∇L(θ∗ )) → N (0, Cov(∇`((x, y), θ∗ ))).
n(∇L(θ (2.6)
√ d
b ∗ )) →
Note that ∇L(θ∗ ) = 0 because θ∗ is the minimizer of L, so n(∇L(θ N (0, Cov(∇`((x, y), θ∗ ))). By
p
b ∗ ) → ∇2 L(θ∗ ). Applying these results to (2.5) (together with an application
the law of large numbers, ∇2 L(θ
of Slutsky’s theorem),
√ d
n(θ̂ − θ∗ ) → ∇2 L(θ∗ )−1 N (0, Cov(∇`((x, y), θ∗ ))) (2.7)
d 2 ∗ −1 ∗ 2 ∗ −1
= N 0, ∇ L(θ ) Cov(∇`((x, y), θ ))∇ L(θ ) , (2.8)
where the second step is due to Lemma 2.3. This proves Part 2 of Theorem 2.1.
d
Part 1 follows directly from Part 2 by the following fact: If Xn → P for some probability distribution P ,
then Xn = OP (1).
We now turn to proving Parts 3 and 4. Using a Taylor expansion of L with respect to θ at θ∗ , we find
1
L(θ̂) = L(θ∗ ) + h∇L(θ∗ ), θ̂ − θ∗ i + hθ̂ − θ∗ , ∇2 L(θ∗ )(θ̂ − θ∗ )i + o(kθ̂ − θ∗ k22 ). (2.9)
2
Since θ∗ is the minimizer of the population risk L, we know that ∇L(θ∗ ) = 0 and the linear term is equal to
0. Rearranging and multiplying by n, we can write
n
n(L(θ̂) − L(θ∗ )) = hθ̂ − θ∗ , ∇2 L(θ∗ )(θ̂ − θ∗ )i + o(kθ̂ − θ∗ k22 ) (2.10)
2
1 √ √
≈ h n(θ̂ − θ∗ ), ∇2 L(θ∗ ) n(θ̂ − θ∗ )i (2.11)
2
1 √ 2
= ∇2 L(θ∗ )1/2 n(θ̂ − θ∗ ) , (2.12)
2 2
where the last equality follows from the fact that for any vector v and positive semi-definite
√ matrix A
of appropriate dimensions, the inner product hv, Avi = v > Av = kA1/2 vk22 . Let S = ∇2 L(θ∗ )1/2 n(θ̂ − θ∗ ),
√
i.e. the random vector inside the norm. By Part 2, we know the asymptotic distribution of n(θ̂ − θ∗ ) is
d
Gaussian. Thus as n → ∞, n(L(θ̂) − L(θ∗ )) → 21 kSk22 where
This proves Part 4, and Part 3 follows directly from the definition of the OP notation.
Finally, for Part 5, using the fact that the trace operator is invariant under cyclic permutations, the fact
that E[S] = 0, and some regularity conditions,
h i 1 1
lim E n(L(θ̂) − L(θ∗ )) = E kSk22 = E tr(S > S)
(2.15)
n→∞ 2 2
1 1
= E tr(SS ) = tr E[SS > ]
>
(2.16)
2 2
1
= tr (Cov(S)) (2.17)
2
1
= tr ∇2 L(θ∗ )−1 Cov(∇`((x, y), θ∗ )) .
(2.18)
2
11
2.1.3 Well-specified case
Theorem 2.1 is powerful because it is general, avoiding any assumptions of a probabilistic model of our data.
However in many applications, we assume a model of our data and we define the log-likelihood with respect
to this model. Formally, suppose that we have a family of probability distributions Pθ , parameterized by
θ ∈ Θ, such that Pθ∗ is the true data-generating distribution. This is known as the well-specified case. To
make the results of Theorem 2.1 more applicable, we derive analogous results for this well-specified case in
Theorem 2.4.
Theorem 2.4. In addition to the assumptions of Theorem 2.1, suppose there exists a parametric model
P (y | x; θ), θ ∈ Θ, such that {y (i) | x(i) }ni=1 ∼ P (y (i) | x(i) ; θ∗ ) for some θ∗ ∈ Θ. Assume that we performing
maximum likelihood estimation (MLE), i.e. our loss function is the negative log-likelihood `((x(i) , y (i) ), θ) =
− log P (y (i) | x(i) ; θ). As before, let θ̂ and θ∗ denote the minimizers of empirical risk and population risk,
respectively. Then
θ ∗ = θ∗ , (2.19)
√ d
n(θ̂ − θ∗ ) → N (0, ∇2 L(θ∗ )−1 ). (2.22)
Remark 1: You may also have seen (2.22) in the following form: under the maximum likelihood
estimation (MLE) paradigm, the MLE is asymptotically efficient as it achieves the Cramer-Rao lower bound.
That is, the parameter error of the MLE estimate converges in distribution to N (0, I(θ)−1 ), where I(θ) is
the Fisher information matrix (in this case, equivalent to the risk Hessian ∇2 L(θ∗ )) [Rice, 2006].
Remark 2: (2.21) is also known as Bartlett’s identity [Liang, 2016].
Although the proofs were not presented in live lecture, we include them here.
Proof. From the definition of the population loss,
h i
L(θ) = E `((x(i) , y (i) ), θ) (2.23)
= E [− log P (y | x; θ)] (2.24)
= E [− log P (y | x; θ) + log P (y | x; θ∗ )] + E [− log P (y | x; θ∗ )] (2.25)
P (y | x; θ∗ )
= E log + E [− log P (y | x; θ∗ )] . (2.26)
P (y | x; θ)
Notice that the second term is a constant which we will express as H(y | x; θ∗ ). We expand the first term
using the tower rule (or law of total expectation):
P (y | x; θ∗ )
L(θ) = E E log x + H(y | x; θ∗ ). (2.27)
P (y | x; θ)
The term in the expectation is just the KL divergence between the two probabilities, so
since KL divergence is always non-negative. Since θ∗ makes the KL divergence term 0, it minimizes L(θ)
and so θ∗ ∈ argminθ L(θ). However, the minimizer of L(θ) is unique because of consistency, so we must have
argminθ L(θ) = θ∗ which proves (2.19).
12
For (2.20), recall ∇L(θ∗ ) = 0, so we have
h i h i
0 = ∇L(θ∗ ) = ∇ E `((x(i) , y (i) ), θ∗ ) = E ∇`((x(i) , y (i) ), θ∗ ) , (2.30)
where we can switch the gradient and expectation under some regularity conditions.
To prove (2.21), we first expand the RHS using the definition of covariance and express the marginal
distributions as integrals:
∇P (y | x; θ∗ )∇P (y | x; θ∗ )>
Z Z
= P (x) dy dx. (2.33)
P (y | x; θ∗ )
Now we expand the LHS using the definition of the population loss and differentiate repeatedly:
so we find
∇P (y | x; θ∗ )∇P (y | x; θ∗ )>
Z Z
∗
2
∇ L(θ ) = P (x) dy dx = Cov (∇`((x, y), θ∗ )) . (2.37)
P (y | x; θ∗ )
Finally, (2.22) follows directly from Part 2 of Theorem 2.1 and (2.21).
d
Using similar logic to our proof of Part 4 and 5 of Theorem 2.1, we can see that n(L(θ̂) − L(θ∗ )) → 12 kSk22
where S ∼ N (0, I). Since a chi-squared distribution with p degrees of freedom is defined as a sum of the
squares of p independent standard normals, it quickly follows that 2n(L(θ̂) − L(θ∗ )) ∼ χ2 (p), where θ ∈ Rp
and n → ∞. We can thus characterize the excess risk in this case using the properties of a chi-squared
distribution:
h i p
lim E L(θ̂) − L(θ∗ ) = . (2.38)
n→∞ 2n
(Here o(·) treats the parameter p as a constant as n goes to infinity.) We have no idea how large n needs to
be for asymptotic bounds to be “reasonable.” Compare two possible versions of (2.39):
p 1 p p100
+ vs. + 2 . (2.40)
2n n2 2n n
13
Asymptotic analysis treats both of these bounds as the same, hiding the polynomial dependence on p in
the second bound. Clearly, the second bound is significantly more data-intensive than the first: we would
100
need n > p50 for pn2 to be less than one. Since p represents the dimensionality of the data, this may be an
unreasonable assumption.
This is where non-asymptotic analysis can be helpful. Whereas asymptotic analysis uses large-sample
theorems such as the central limit theorem and the law of large numbers to provide convergence guarantees,
non-asymptotic analysis relies on concentration inequalities to develop alternative techniques for reasoning
about the performance of learning algorithms.
14
Chapter 3
Concentration Inequalities
In this chapter, we take a little diversion and develop the notion of concentration inequalities. Assume that
we have independent random variables X1 , . . . , Xn . We will develop tools to show results that formalize the
intuition for these statements:
1. X1 + . . . + Xn concentrates around E[X1 + . . . + Xn ].
2. More generally, f (X1 , . . . , Xn ) concentrates around E[f (X1 , . . . , Xn )].
These inequalities will be used in subsequent chapters to bound several key quantities of interest.
As it turns out, the material from this chapter constitutes arguably the important mathematical tools
in the entire course. No matter what area of machine learning one wants to study, if it involves sample
complexity, some kind of concentration result will typically be required. Hence, concentration inequalities
are some of the most important tools in modern statistical learning theory.
15
Theorem 3.1 (Chebyshev’s inequality). Let Z be a random variable with finite expectation and variance.
Then
Var(Z)
Pr[|Z − E[Z]| ≥ t] ≤ , ∀t > 0. (3.1)
t2
Intuitively, this means that as we approach the tails of the distribution √ of Z, the density decreases at a
rate of at least 1/t2 . Moreover, for any δ ∈ (0, 1], by plugging in t = sd(Z)/ δ to (3.1) we see that
sd(Z)
Pr |Z − E[Z]| ≤ √ ≥ 1 − δ. (3.2)
δ
Unfortunately, it turns out that Chebyshev’s inequality is a rather weak concentration inequality. To
illustrate this, assume Z ∼ N (0, 1). We can show (using the Gaussian tail bound derived in Problem 3(c)
in Homework 0) that
h p i
Pr |Z − E[Z]| ≤ sd(Z) 2 log(2/δ) ≥ 1 − δ. (3.3)
for any δ ∈ (0, 1]. In other words, the density at the tails of the normal distribution is decreasing at an
exponential rate, while Chebyshev’s inequality only gives a quadratic rate. The discrepancy between (3.2)
and (3.3) is made more apparent when we consider inverse-polynomial δ = n1c for some parameter n and
degree c (we will see concrete instances of this setup in future chapters). Then the tail bound for the normal
distribution (3.3) implies that
p p
|Z − E[Z]| ≤ sd(Z) · log O (nc ) = sd(Z) · O log n w.p. 1 − δ, (3.4)
Chebyshev’s inequality is actually optimal without further assumptions, in the sense that there exist
distributions with finite variance for which the bound is tight. However, in many cases, we will be able to
improve the 1/t2 rate of tail decay in Chebyshev’s inequality to an e−t rate. In the next two sections, we
will demonstrate how to construct tail bounds with exponential decay rates.
−2n2 ε2
Pr |X̄ − µ| ≤ ε ≥ 1 − 2 exp Pn 2
. (3.6)
i=1 (bi − ai )
Pn
Note that the demoninator within the exponential term, i=1 (bi − ai )2 , can be thought of as an upper
bound or proxy for the variance Var(Xi ). In fact, under the independence assumption, we can show
n n
1 X 1 X
(bi − ai )2 .
Var X̄ = 2 Var(Xi ) ≤ 2 (3.7)
n i=1 n i=1
16
Pn √ √
Let σ 2 = n12 i=1 (bi − ai )2 . If we take ε = O(σ log n) = σ c log n √
so that ε is bounded above by some
large (i.e., c ≥ 10) multiple of the standard deviation of the Xi ’s times log n, we can substitute this value
of ε into (3.6) to reach the following conclusion:
−2ε2
Pr |X̄ − µ| ≤ ε ≥ 1 − 2 exp (3.8)
σ2
= 1 − 2 exp(−2c log n) (3.9)
−2c
= 1 − 2n (3.10)
We can see that as n grows, the right-most term tends to zero such that Pr[|X̄ − µ| ≤ ε] very quickly
approaches 1. Intuitively, this result tells us that, with high probability, the sample
√ mean X̄ will not be
“much farther” from the population mean µ by more than some sublogarithmic ( c log n) factor of the
standard deviation.1 Thus, we can restate the above claim we reached as follows:
√
Remark 3.3. For sufficiently large n, |X̄ − µ| ≤ O(σ log n) with high probability.
in addition, we have ai = −O(1) and bi = O(1), then σ 2 = O n1 , and |X̄ − µ| ≤
Remark
q 3.4. If,
log n e √1 .2
O n = O n
Remark 3.4 provides a compact form of the Hoeffding bound that we can use when the Xi are bounded
almost surely.
So far, we have only shown how to construct exponential tail bounds for bounded random variables.
Since requiring boundedness in [0, 1] (or [a, b] more generally) is limiting, it is worth asking what types
of distributions permit such an exponential tail bound. The following section will explore such a class of
random variables: sub-Gaussian random variables.
A bound on the moment generating function then is a bound on infinitely many moments of X, i.e. a
requirement that the moments of X are all finite and grow slowly enough to allow the power series to converge.
Though a proof of this result is beyond the scope of this monograph, Proposition 2.5.2 in [Vershynin, 2018]
1/p √
shows that (3.11) is equivalent to E [|X|p ] . p for all p ≥ 1.
1 This is with the caveat, of course, that σ is not exactly the standard deviation but a loose upper bound on standard
deviation.
2O
e is analogous to Big-O notation, but O e hides logarithmic factors. That is; if f (n) = O(log n), then f (n) = O(1).
e
17
Although (3.11) is not a particularly intuitive definition, it turns out to imply exactly the type of exponential
tail bound we want:
Theorem 3.7 (Tail bound for sub-Gaussian random variables). If a random variable X with finite mean µ
is σ-sub-Gaussian, then
t2
Pr[|X − µ| ≥ t] ≤ 2 exp − 2 , ∀t ∈ R. (3.13)
2σ
Because the bound (3.17) holds for any choice of λ > 0 and exp(·) is monotonically increasing, we can
optimize the bound (3.17) by finding λ which minimizes the exponent −λt + σ 2 λ2 /2. Differentiating and
setting the derivative equal to zero, we find that the optimal choice is λ = t/σ 2 , yielding the one-sided tail
bound
t2
Pr[X − µ ≥ t] ≤ exp − 2 . (3.18)
2σ
Going through the same line of reasoning but for −X and −t, we can also show that for any t > 0,
t2
Pr[X − µ ≤ −t] ≤ exp − 2 . (3.19)
2σ
t2
Pr[|X − µ| ≥ t] = Pr[X − µ ≥ t] + Pr[X − µ ≤ −t] ≤ 2 exp − 2 . (3.20)
2σ
Remark 3.8 (Tail bound implies sub-Gaussianity). In addition to being a necessary condition for sub-
Gaussianity (Theorem 3.7), the tail bound (3.13) for sub-Gaussian random variables is also a sufficient
condition up to a constant factor. In particular, if a random variable X with finite mean µ satisfies (3.13)
for some σ > 0, then X is O(σ)-sub-Gaussian. Unfortunately, the proof of this reverse direction is somewhat
more involved, so we refer the interested reader to Theorem 2.6 and its proof in Section 2.4 of [Wainwright,
2019] and Proposition 2.5.2 in [Vershynin, 2018] for details. While the tail bound is the property we ulti-
mately care about most when studying sub-Gaussian random variables, the definition in (3.11) is a more
technically convenient characterization, as we will see in the proof of Theorem 3.10.
Remark 3.9. Note that in light of Remark 3.6, the tail bound (3.3) requires all central moments of X to exist
and not grow too quickly. In contrast, Chebyshev’s inequality (and more generally any polynomial variant
of Markov’s inequality Pr[|X − µ| ≥ t] = Pr[|X − µ|k ≥ tk ] ≤ t−k E[|X − µ|k ]) only requires that the second
central moment E[(X −µ)2 ] (more generally, the kth central moment E[|X −µ|k ]) is finite to yield a tail bound.
If infinite moments exist, however, it turns out that inf k∈N t−k E[|X − µ|k ] ≤ inf λ>0 exp(−λt) E[exp(λ(X −
λ))], i.e. the optimal polynomial tail bound is tighter than the optimal exponential tail bound (see Exercise
2.3 in [Wainwright, 2019]). As we will see shortly though, using exponential functions of random variables
allows us to prove results about sums of random variables more conveniently. This “tensorization” property
is why most researchers use exponential tail bounds in practice.
18
Having defined and derived exponential tail bounds for sub-Gaussian random variables, we can now
accomplish the first of the goals we set out at the beginning of the chapter: show Pnthat under certain
conditions, namely
Pnindependence and sub-Gaussianity of X1 , . . . , X n , the sum Z = i=1 Xi concentrates
around E[Z] = E[ i=1 Xi ].
Theorem 3.10 (Sum of sub-Gaussian random variables is sub-Gaussian). If PnX1 , . . . , Xn are independent
2 2
sub-Gaussian random
Pn variables with variance proxies σ 1 , . . . , σ n , then Z = i=1 Xi is sub-Gaussian with
variance proxy i=1 σi2 . As a consequence, we have the tail bound
t2
Pr[|Z − E[Z]| ≥ t] ≤ 2 exp − Pn , (3.21)
2 i=1 σi2
for all t ∈ R.
Proof. Using the independence of X1 , . . . , Xn , we have that for any λ ∈ R:
" n #
Y
E [exp {λ(Z − E[Z])}] = E exp {λ(Xi − E[Xi ])} (3.22)
i=1
n
Y
= E [exp {λ(Xi − E[Xi ])}] (3.23)
i=1
n
λ2 σi2
Y
≤ exp (3.24)
i=1
2
2 Pn 2
λ i=1 σi
= exp , (3.25)
2
Pn
so Z is sub-Gaussian with variance proxy i=1 σi2 . The tail bound then follows immediately from (3.13).
The proof above demonstrates the value of the moment generating functions of sub-Gaussian random
variables: they factorize conveniently when dealing with sums of independent random variables.
which is exactly the moment generating function bound (3.11) required for 1-sub-Gaussianity.
19
Example 3.12 (Random variables with bounded distance to mean). Suppose a random variable X satisfies
|X − E[X]| ≤ M almost surely for some constant M . Then X is O(M )-sub-Gaussian.
We now provide an even more general class of sub-Gaussian random variables that subsume the random
variables in Example 3.12:
Example 3.13 (Bounded random variables). If X is a random variable such that a ≤ X ≤ b almost surely
for some constants a, b ∈ R, then
2
λ (b − a)2
h i
λ(X−E[X])
E e ≤ exp ,
8
i.e., X is sub-Gaussian with variance proxy (b − a)2 /4. (We will prove this in Question 2(a) of Homework
1.) Note that combining the (b − a)/2-sub-Gaussianity of i.i.d. bounded random variables X1 , . . . , Xn and
Theorem 3.10 yields a proof of Hoeffding’s inequality.
Example 3.14 (Gaussian random variables). If X is Gaussian with variance σ 2 , then X satisfies (3.11)
with equality. In this special case, the variance and the variance proxy are the same.
(Intuitively, (3.31) states that f is not overly sensitive to arbitrary changes in a single coordinate.) Then,
for any independent random variables X1 , . . . , Xn ,
2t2
Pr [f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] ≥ t] ≤ exp − Pn 2 . (3.32)
i=1 ci
pP
n 2
Moreover, f (X1 , . . . , Xn ) is O i=1 ci -sub-Gaussian.
Remark 3.16. Note that McDiarmid’s inequality is a generalization of Hoeffding’s inequality with ai ≤ xi ≤ bi
and
n
X
f (x1 , . . . , xn ) = xi . (3.33)
i=1
Proof. The idea of this proof is to take the quantity f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] and break it into
manageable components by conditioning on portions of the sample. To this end, we begin by defining:
Z0 = E [f (X1 , . . . , Xn )] constant
Z1 = E [f (X1 , . . . , Xn )|X1 ] a function of X1
···
Zi = E [f (X1 , . . . , Xn )|X1 , . . . , Xi ] a function of X1 , . . . , Xi
···
Zn = f (X1 , . . . , Xn )
20
Using the law of total expectation, we show also that the expectation of Zi equals Z0 for all i.
E[Zi ] = E [E [f (X1 , . . . , Xn )|X1 , . . . , Xi ]]
= E[f (X1 , . . . , Xn )]
= Z0
The fact that E[Di ] = 0, where Di = Zi − Zi−1 , is an immediate corollary of this result. Next, we observe
that we can rewrite the quantity of interest, Zn − Z0 , as a telescoping sum in the increments Zi − Zi−1 :
Zn − Z0 = (Zn − Zn−1 ) + (Zn−1 − Zn−2 ) + · · · + (Z1 − Z0 )
Xn
= Di
i=1
Next, we show that conditional on X1 , . . . , Xi−1 , Di is a bounded random variable. First, observe that:
Ai = inf E [f (X1 , . . . , Xn )|X1 , . . . , Xi−1 , Xi = x] − E [f (X1 , . . . , Xn )|X1 , . . . , Xi−1 ]
x
Bi = sup E [f (X1 , . . . , Xn )|X1 , . . . , Xi=1 , Xi = x] − E [f (X1 , . . . , Xn )|X1 , . . . , Xi−1 ]
x
It is clear from their definition that Ai ≤ Di ≤ Bi . Furthermore, by independence of the Xi ’s, we have that:
Z
Bi − Ai ≤ sup sup (f (x1 , . . . , xi−1 , x, xi+1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0 , xi+1 , . . . , xn )) dP (xi+1 , . . . , xn )
x1:i−1 x,x0
≤ ci
Using this bound,
pPthe properties
of conditional expectation, and Example 3.13, we can now prove that that
n 2 -sub-Gaussian.
Zn − Z0 is O c
i=1 i
h i h Pn i
E eλ(Zn −Z0 ) = E eλ i=1 (Zi −Zi−1 )
P
λ(Zn −Zn−1 ) λ n−1 (Zi −Zi−1 )
=E E e X1 , . . . , Xn−1 e i=1
2 2
h Pn−1 i
≤ eλ cn /8 E eλ i=1 (Zi −Zi−1 )
···
2
( n 2
P
≤ eλ i=1 ci )/8
These two quantities are functions of x ∈ Rn , and hence can be interpreted as describing the sensitivity
of f at a particular point. (Contrast this with the bounded difference condition (3.31), which bounds the
sensitivity of f universally over all points.) For convenience, define
n
X n
X
d+ = |Di+ f |2 = sup [|Di+ f (x1 , . . . , xn )]2 (3.36)
x1 ,...,xn
i=1 ∞ i=1
Xn Xn
d− = |Di− f |2 = sup [Di− f (x1 , . . . , xn )]2 . (3.37)
x1 ,...,xn
i=1 ∞ i=1
21
Theorem 3.17 (Bounded difference inequality, Theorem 3.18 in [van Handel, 2016]). Let f : Rn → R, and
let X1 , . . . , Xn be independent random variables. Then, for all t ≥ 0,
t2
Pr[f (X1 , . . . , Xn ) ≥ E[f (X1 , . . . , Xn )] + t] ≤ exp − − (3.38)
4d
t2
Pr[f (X1 , . . . , Xn ) ≤ E[f (X1 , . . . , Xn )] − t] ≤ exp − + . (3.39)
4d
Before introducing the next theorem, we recall that a function f : Rn → R is L-Lipschitz with respect
to the `2 -norm if there exists a non-negative constant L ∈ R such that for all x, y ∈ Rn ,
t2
Pr[|f (X) − E[f (X)]| ≥ t] ≤ 2 exp − 2 . (3.42)
2L
22
Chapter 4
In Chapter 2, we pointed out some limitations of asymptotic analysis. In this chapter, we will turn our
focus to non-asymptotic analysis, where we provide convergence guarantees without having the number of
observations n go off to infinity. A key tool for proving such guarantees is uniform convergence, where we
have bounds of the following form:
Pr sup |L̂(h) − L(h)| ≤ ≥ 1 − δ. (4.1)
h∈H
In other words, the probability that the difference between our empirical loss and population loss is larger
than is at most δ. We give motivation for uniform convergence and show how it can give us non-asymptotic
guarantees on excess risk.
In other words, uniform convergence tells us that for any choice of θ, our empirical risk is always close to
our population risk with high probability. Let’s look at a motivating example for why this type of bound is
useful.
23
We know that L̂(θ̂) − L̂(θ∗ ) ≤ 0 since θ̂ is a minimizer of L̂. This allows us to write
This result tells us that if supθ∈Θ |L(θ) − L̂(θ)| is small (say, less than ε/2), then excess risk L(θ̂) − L(θ∗ ) is
less than ε. But this is exactly in the form of the bound in (4.2). Hence, if we can show that a parameter
family exhibits uniform convergence, we can get a bound on excess risk as well.
For future reference, Equation (4.6) can be strengthened straightforwardly into the following with slightly
more careful treatment of the signs of each term:
L(θ̂) − L(θ∗ ) ≤ |L̂(θ∗ ) − L(θ∗ )| + L(θ̂) − L̂(θ̂) ≤ |L̂(θ∗ ) − L(θ∗ )| + sup L(θ) − L̂(θ) (4.7)
θ∈Θ
This will make some of our future derivations technically slightly more convenient, but the nuanced difference
between Equations (4.6) and (4.7) does not change the fundamental idea and the discussions in this chapter.
Let us try to apply our knowledge of concentration
inequalities
to this problem. Earlier we assumed that
`((x, y); θ) is bounded, so we can bound 3 by O e √1 via Hoeffding’s inequality (Remark 3.4). However,
n
we cannot apply the same concentration inequality to 1 : since θ̂ is data-dependent by definition, the i.i.d.
assumption no longer holds. (To see this, note that θ̂ depends on the training dataset {(x(i) , y (i) )}, so the
b θ̂), `((x(i) , y (i) ); θ̂), all depend on the training dataset too.) This is concerning: it is certainly
terms in L(
possible that L(θ̂) − L̂(θ̂) is large. You’ve probably encountered this yourself when a model exhibits low
training loss, but high validation/testing loss.
• If we know all possible values of θ in advance, we can use the above bound to create a more general
bound over all values of θ.
In particular, we can use the union-bound inequality to create the general bound described in the second
bullet point, using the bound in the first bullet point:
h i X h i
Pr ∀θ ∈ Θ, |L̂(θ) − L(θ)| ≥ ε0 ≤ Pr |L̂(θ) − L(θ)| ≥ ε0 . (4.8)
θ∈Θ
We can then use Hoeffding’s inequality to deal with the summands as θ there is no longer data-dependent.
We will talk more later about proving statements of this form.
24
L L
θ θ
(a) (b)
Figure 4.1: These curves demonstrate how we apply uniform convergence to bound the population risk. The
blue curves are the unobserved population risk we aim to bound. The green curves denote the empirical risk
we observe. Though this curve is often depicted as the fluctuating curve used in Figure 4.1a, it is more often
a smooth curve whose shape mimics that of the population risk (Figure 4.1b). Uniform convergence allows
us to construct additive error bounds for the excess risk, which are depicted using the red, dashed lines.
25
2n2 2
Pr |L̂(h) − L(h)| ≥ ≤ 2 exp − Pn 2
(4.11)
i=1 (bi − ai )
2n2 2
= 2 exp − (4.12)
n
= 2 exp(−2n2 ), (4.13)
since we can set ai = 0, bi = 1. The bound above holds for a single fixed h. To prove a similar inequality
that holds for all h ∈ H, we apply the union bound with Eh = {|L̂(h) − L(h)| ≥ }:
X
Pr ∃h s.t. |L̂(h) − L(h)| ≥ ≤ Pr |L̂(h) − L(h)| ≥ (4.14)
h∈H
X
≤ 2 exp(−2n2 ) (4.15)
h∈H
The bound here depends on each h. In contrast, the uniform convergence bound we obtain from (4.17) is
uniform over all h ∈ H:
ln |H|
w.h.p., ∀h ∈ H, |L̂(h) − L(h)| ≤ Õ √ , (4.22)
n
1
if we omit the ln (1/δ) factor (we can do this since ln (1/δ) is small in general and we take δ = poly(n) ).
Hence, the extra ln |H| term that depends on the size of our finite hypothesis family H can be viewed as a
trade-off in order to make the bound uniform.
Remark 4.2. There is no standard definition for the term with high probability (w.h.p). For this class, the
term is equivalent to the condition that the probability is higher than 1 − n−c for some constant c.
26
4.2.2 Comparing Theorem 4.1 with asymptotic bounds
We can also compare the bound in Theorem 4.1 with our original asymptotic bound, namely,
c
L(ĥ) − L(h∗ ) ≤ + o n−1 .
(4.23)
n
The o(n−1 ) term can vary significantly depending on the problem. For instance, both n−2 and p100 n−2
are o(n−1 ) but the second one converges much more slowly. With the new bound, there are no longer any
constants hidden in an o(n−1 ) term (in fact that term is no longer there). However, we now have a slower
convergence rate of O(n−1/2 ).
Remark 4.3. O(n−1/2 ) convergence is sometimes known as the slow rate while O(n−1 ) convergence is known
as the fast rate. We were only able to get the slow rate from uniform convergence: we needed asymptotics
to get the fast rate. (It is possible to get the fast rate from uniform convergence under certain conditions,
e.g. when the population risk on the true h∗ is very low.)
(We note that in some definitions it is possible for points in C to lie outside of S; we do not worry about this
technicality in this class.) The following lemma tells us that our parameter space S = {θ ∈ Rp : kθk2 ≤ B}
has an -cover with not too many elements:
Lemma 4.5 (-cover of `2 ball). Let B, > 0, and let S = {x ∈ Rp : kxk2 ≤ B}. Then there exists an
√ p
3B p
-cover of S with respect to the `2 -norm with at most max , 1 elements.
27
√
Proof. Note that if > B p, then S is trivially contained in the ball centered at the origin with radius
√
and the -cover has size 1. Assume ≤ B p. Set
√
B p
C = x ∈ S : xi = ki √ , ki ∈ Z, |ki | ≤ , (4.27)
p
i.e. C is the set of grid points in Rp of width √ that are contained in S. See Figure 4.2 for an illustration.
p
Figure 4.2: The -cover (shown in red) of S that we construct in the proof of Lemma 4.5. For x ∈ S, we
choose the grid point x0 such that kx − x0 k2 ≤ .
We claim that C is an -cover of S with respect to the `2 -norm: ∀x ∈ S, there exists a grid point x0 ∈ C
such that |xi − x0i | ≤ √p for each i. Therefore,
v s
u p
0
uX
0 2
kx − x k2 = t |xi − xi |2 ≤ p · = .
i=1
p
√
B p
We now bound the size of C. Since each ki in the definition of C has at most 2 + 1 choices, we have
√ p √ p
2B p 3B p
|C| ≤ +1 ≤ . (4.28)
Remark 4.6.
p We can actually prove a stronger version of Lemma 4.5: there exists an -cover of S with at
most 3B elements. We will be using this version of the lemma in the proof below. (We will leave the
proof of this stronger version as a homework exercise.)
28
Assume that our infinite hypothesis class H can be parameterized by H = {hθ : θ ∈ R, kθk2 ≤ B}. We
have the following uniform convergence theorem for our infinite hypothesis class H:
Theorem 4.8. Suppose `((x, y), θ) ∈ [0, 1], and `((x, y), θ) is κ-Lipschitz in θ with respect to the `2 -norm
for all (x, y). Then, with probability at least 1 − O(exp(−Ω(p))), we have
r !
p max(ln (κBn), 1)
∀θ, |L̂(θ) − L(θ)| ≤ O . (4.29)
n
Proof of Theorem 4.8. Fix parameters δ, > 0 (we will specify their values later). Let C be the -cover
of our parameter space S owith respect to the `2 -norm constructed in Lemma 4.5. Define event E =
n
∀θ ∈ C, |L̂(θ) − L(θ)| ≤ δ . By Theorem 4.1, we have Pr(E) ≥ 1 − 2|C| exp(−2nδ 2 ).
Now for any θ ∈ S, we can pick some θ0 ∈ C such that kθ − θ0 k2 ≤ . Since L and L b are κ-Lipschitz
functions (this follows from the Lipschitzness of `), we have
|L(θ) − L(θ0 )| ≤ κkθ − θ0 k2 ≤ κ, and (4.30)
|L̂(θ) − L̂(θ0 )| ≤ κkθ − θ0 k2 ≤ κ. (4.31)
Therefore, conditional on E, we have
|L̂(θ) − L(θ)| ≤ |L̂(θ) − L̂(θ0 )| + |L̂(θ0 ) − L(θ0 )| + |L(θ0 ) − L(θ)| ≤ 2κ + δ. (4.32)
It remains to choose suitable parameters δ and to get the desired bound in Theorem 4.8 while making
the failure probability small. First, set = δ/(2κ) so that conditional on E,
|L̂(θ) − L(θ)| ≤ 2δ. (4.33)
To choose the correct δ, we must reason about the probability of E under different choices of the param-
eter. The event E happens with probability 1 − 2|C| exp(−2nδ 2 ) = 1 − 2 exp(ln |C| − 2nδ 2 ). From Remark
4.6, we know that ln |C| ≤ p ln (3B/(δ/2)). If we ignore the log term and assume ln |c| ≤ p, then this would
give us the high probability bound we want:
2|C| exp(−2nδ 2 ) = 2 exp(ln |C| − 2nδ 2 ) ≤ 2 exp(p − 2p) = 2 exp(−p). (4.34)
pp
(At the same time, we see from (4.33) that this choice of δ gives |L̂(θ) − L(θ)| ≤ 2 n , which is roughly the
bound we want.)
Since we cannot actually drop theq log term in the inequality ln |C| ≤ p ln (3B/(δ/2)), we need to make δ
a little bit bigger. So, if we set δ = c0 p max(1,ln
n
(κBn))
with c0 = 36, then by Remark 4.6,
6Bκ
ln |C| − 2nδ 2 ≤ p ln − 2nδ 2 (4.35)
δ
√ !
6Bκ n c0 p
≤ p ln p − 2n ln(κBn) (dfn of δ) (4.36)
c0 p max(1, ln (κBn)) n
√
Bκ n
≤ p ln √ − 72p ln(κBn) (max(1, ln (κBn)) ≥ 1, c0 = 36) (4.37)
p
p
≤ p ln(Bκn) − 72p ln(Bκn) ( n/p ≤ n) (4.38)
≤ −p, (4.39)
since ln(Bκn) ≥ 1 for large enough n. Therefore, with probability greater than 1 − 2|C| exp(−2nδ 2 ) =
1 − 2 exp(ln |C| − 2nδ 2 ) ≥ 1 − O(e−p ), we have
r
p
|L̂(θ) − L(θ)| ≤ 2δ = O max(1, ln(κBn)) . (4.40)
n
29
q
ln |C|
Remark 4.9. We bounded the generalization error |L̂(θ) − L(θ)| by δ + 2κ ≤ n + 2κ. The term
2κ represents the error from our brute-force discretization. It is not a problem
q because we can always
ln |C|
choose small enough without worrying about the growth of the first term n .
This in turn is because
q
ln |C| ≈ p ln −1 , which is very insensitive to , even if we let = poly(n)
1
. We also observe that both ln n|C|
pp
and n are bounds that depend on the “size” of our hypothesis class, in terms of either its total size or
dimensionality. This possibly explains why one may need more training samples when the hypothesis class
is larger.
Both of these bounds have a √1n -dependency on n, which is known as the “slow rate”. The terms in the
numerator (log |H| and p resp.) can be thought of as complexity measures of H.
The bound (4.42) is not precise enough: it depends solely on p and is not always optimal. For example,
this would be a poor bound if the hypothesis class H has very high dimension but small norm. One specific
example is for the following two hypothesis classes:
This complexity measure will depend on the distribution P over X × Y (the input and output spaces),
and hence takes into account how easy it is to learn P . If P is easy to learn, then this complexity measure
will be small even if the hypothesis space is big.
One of the practical implications of having such a complexity measure is that we can restrict the hypoth-
esis space by regularizing the complexity measure (assuming it is something we can evaluate and train with).
If we successfully find a low complexity model, then this generalization bound guarantees that we have not
overfit.
30
4.4.2 Definitions
In uniform convergence, we sought a high probability bound for suph∈H (L(h)− L̂(h)). Here we have a weaker
goal: we try to obtain an upper bound for its expectation instead, i.e.
E sup (L(h) − L̂(h)) ≤ upper bound. (4.44)
h∈H
The expectation is over the randomness in the training data {(x(i) , y (i) )}ni=1 .1
To do so, we first define Rademacher complexity.
Definition 4.10 (Rademacher complexity). Let F be a family of functions mapping Z 7→ R, and let P be
a distribution over Z. The (average) Rademacher complexity of F is defined as
" " n
##
1X
Rn (F ) , E E sup σi f (zi ) , (4.45)
z1 ,...,zn ∼ P σ1 ,...,σn ∼ {±1} f ∈F n i=1
iid iid
where σ1 , . . . , σn are independent Rademacher random variables, i.e. each taking on the value of 1 or −1
with probability 1/2.
Remark 4.11. For applications to empirical risk minimization, we will take Z = X × Y. However, Definition
4.10 holds for abstract input spaces Z as well.
Remark 4.12. Note that Rn (F) is also dependent on the measure P of the space, so technically it should be
Rn,P (F), but for brevity, we refer to it as Rn (F).
An interpretation is that Rn (F) is the maximal possible correlation between outputs of some f ∈ F
(on points f (z1 ), . . . , f (zn )) and random Rademacher variables (σ1 , . . . , σn ). Essentially, functions with more
random sign outputs will better match random patterns of Rademacher variables and have higher complexity
(greater ability to mimic or express randomness).
The following theorem is the main theorem involving Rademacher complexity:
Theorem 4.13.
" " n
##
1X
E sup f (zi ) − E [f (z)] ≤ 2Rn (F). (4.46)
iid
z1 ,...,zn ∼ P f ∈F n i=1 z∼P
Pn
Remark 4.14. We can think of n1 i=1 f (zi ) as an empirical average and Ez∼P [f (z)] as a population average.
Why is Theorem 4.13 useful to us? We can set F to be the family of loss functions, i.e.
F = {z = (x, y) ∈ Z 7→ `((x, y), h) ∈ R : h ∈ H} . (4.47)
This is the family of losses induced by the hypothesis functions in H. We also define the function class −F
d
as {−f : f ∈ F}. It should be obvious from this definition that Rn (F) = Rn (−F) since σi = −σi for all i.
Then, letting zi = (x(i) , y (i) ),
" " n
##
1X (i) (i)
E sup L(h) − L̂(h) = E sup L(h) − `((x , y ), h) (4.48)
h∈H {(x(i) ,y (i) )} h∈H n i=1
" n
!#
1X
= E sup E[f (z)] − f (zi ) (4.49)
{zi } f ∈F n i=1
" n
!#
1X
= E sup f (zi ) − E[f (z)] (4.50)
{zi } f ∈−F n i=1
≤ 2Rn (−F) = 2Rn (F) (4.51)
1 Though we might like to pull the sup outside of the E operator, and bound the expectation of the excess risk (a far simpler
h i
quantity to deal with!), in general, the sup and E operators do not commute. In particular, E suph∈H (L(h) − L̂(h)) ≥
h i
suph∈H E L(h) − L̂(h) .
31
where the last step follows by Theorem 4.13.
Thus, 2Rn (F) is an upper bound for the generalization error. In this context, Rn (F) can be interpreted
as how well the loss sequence `((x(1) , y (1) ), h), . . . `((x(n) , y (n) ), h) correlates with σ1 , . . . , σn .
Example 4.15. Consider the binary classification setting where y ∈ {±1}. Let `0−1 denote the zero-one
loss function. Note that
1 − yh(x)
`0−1 ((x, y), h) = 1{h(x) 6= y} = . (4.52)
2
Hence,
" n
#
1X (i) (i)
Rn (F) = E sup `0−1 ((x , y ), h)σi (by definition) (4.53)
{(x(i) ,y (i) )},σi h∈H n
i=1
" n #
1 X −h(x(i) )y (i) + 1
= E sup σi (by (4.52)) (4.54)
{(x(i) ,y (i) )},σi h∈H n 2
i=1
" n n
#
1 1X 1X (i) (i)
= E σi + sup −h(x )y σi (sup only over H) (4.55)
2 {(x(i) ,y(i) )},σi n i=1 h∈H n i=1
" n
#
1 1X (i) (i)
= E sup −h(x )y σi (E[σi ] = 0) (4.56)
2 {(x(i) ,y(i) )},σi h∈H n i=1
" n
#
1 1X (i) d
= E sup h(x )σi (−yi σi = σi ) (4.57)
2 {(x(i) ,y(i) )},σi h∈H n i=1
1
= Rn (H). (by definition) (4.58)
2
In this setting, Rn (F) and Rn (H) are the same (except for the factor of 2). Rn (H) has a slightly more
intuitive interpretation: it represents how well h ∈ H can fit random patterns.
Warning! Rn (F) is not always the same as Rn (H) in other problems.
Remark 4.16. Rademacher complexity
is invariant
to translation. This property manifests in the previous
−h(x(i) )y (i) +1
example when the +1 in the 2 term essentially vanishes in the computation.
Let us now prove Theorem 4.13.
Proof of Theorem 4.13. We use a technique called symmetrization, which is a very important technique in
iid
probability theory. We first fix z1 , . . . , zn and draw z10 , . . . zn0 ∼ P . Then we can rewrite the term in the
expectation on the LHS of (4.46):
n
! n
" n #!
1X 1X 1X 0
sup f (zi ) − E[f ] = sup f (zi ) − 0 E 0 f (zi ) (4.59)
f ∈F n i=1 f ∈F n i=1 z1 ,...,zn n
i=1
" n n
#!
1X 1X 0
= sup 0 E 0 f (zi ) − f (zi ) (4.60)
f ∈F z1 ,...,zn n i=1 n i=1
" n n
!#
1X 1X 0
≤ 0 E 0 sup f (zi ) − f (zi ) . (4.61)
z1 ,...,zn f ∈F n i=1 n i=1
32
Now, if we take the expectation over z1 , . . . , zn for both sides of (4.61),
" n
!# " " n
!##
1X 1X 0
E sup f (zi ) − E[f ] ≤ E E0 sup (f (zi ) − f (zi )) (4.63)
z1 ,...,zn f ∈F n i=1 zi zi f ∈F n i=1
" " n
!##
1X 0
= E 0 E sup σi (f (zi ) − f (zi )) (4.64)
zi ,zi σi f ∈F n i=1
" n
! n
!#
1X 1X 0
≤ E0 sup σi f (zi ) + sup −σi f (zi ) (4.65)
zi ,zi ,σi f ∈F n i=1 f ∈F n i=1
= 2Rn (F), (4.66)
d
where (4.64) is because σi (f (zi ) − f (zi0 )) = f (zi ) − f (zi0 ) since f (zi ) − f (zi0 ) has a symmetric distribution.
d
The last equality holds since −σi = σi and zi , zi0 are drawn iid from the same distribution.
Here is an intuitive understanding of what Theorem 4.13 achieves. Consider the quantities on the LHS
and RHS of (4.46):
n
! n
!
1X 1X
sup f (zi ) − E[f (z)] vs. sup σi f (zi ) .
f ∈F n i=1 f ∈F n i=1
First, we removed E[f (z)], which is hard to control quantitatively since it is deterministic. Second, we
added more randomness in the form of Rademacher variables. This will allow us to shift our focus from
the randomness in the zi ’s to the randomness in the σi ’s. In the future, our bounds on the Rademacher
complexity will typically only depend on the randomness from the σi ’s.
33
4.5 Empirical Rademacher complexity
Pn
In the previous section, we bounded the expectation of supf ∈F n1 i=1 f (zi ) − Ez∼P [f (z)] . This expecta-
tion is taken over the training examples z1 , . . . , zn . In many instances we only have one training set, and do
not have access to many training sets. Thus, the bound on the expectation does not give a guarantee for the
one training set that we have. In this section, we seek to bound the quantity itself with high probability.
Definition 4.17 (Empirical Rademacher complexity). Given a dataset S = {z1 , . . . , zn }, the empirical
Rademacher complexity is defined as
" n
#
∆ 1X
RS (F) = E sup σi f (zi ) . (4.73)
σ1 ,...,σn f ∈F n
i=1
34
Step 2: We apply Theorem 4.13 to get
Using a similar argument to that of Step 1, we show that g̃ satisfies the bounded difference condition:
as required.
Setting F to be a family of loss functions bounded by [0, 1] in Theorem 4.18 gives the following corollary:
Corollary 4.19. Let F be a family of loss functions F = {(x, y) 7→ `((x, y), h) : h ∈ H} with `((x, y), h) ∈
[0, 1] for all `, (x, y) and h. Then, with probability 1 − δ, the generalization gap is
r
log(2/δ)
L̂(h) − L(h) ≤ 2RS (F) + 3 for all h ∈ H. (4.92)
2n
Remark 4.20. If we want to bound the generalization
q gap by the average Rademacher complexity instead,
we can replace the RHS of (4.92) with 2Rn (F) + log(2/δ)
2n .
q
log(2/δ)
Interpretation of Corollary 4.19. It is typically the case that O n RS (F) and
q
log(2/δ)
O n Rn (F). This is the case because RS (F) and Rn (F) often take the form √cn where c
35
is a big constant
q depending on the complexity of F, whereas q
we only have a logarithmic term in the nu-
log(2/δ) log(2/δ)
merator of O n . As a result, we can view the 3 n term in the RHS of Corollary 4.19 as
e √1 term is necessary even for the concentration
negligible. Another way of seeing this is noting that a O n
bound of a single function h ∈ H. Previously, we bounded L(h) − L̂(h)
qusing a union bound over h ∈ H,
e √1 . As a result, the O log(2/δ)
which necessarily needs to be larger than O n n term is not significant.
1
Pn
where (4.96) follows because Eσ1 ,...,σn n i=1 σi c0 = 0, since the σi ’s are Rademacher random variables.
we only care about the output of function f ∈ F, and not the function itself (i.e. it is sufficient for
our purposes to know f (z1 ), . . . , f (zn ), but not know f ). In other words, we can characterize f ∈ F by
f (z1 ), . . . , f (zn ). In the sequel, we will take advantage of this simplification from the (potentially large)
space of all functions F to the output space,
n > o
Q , f (z1 ), . . . , f (zn ) : f ∈ F ⊆ Rn , (4.98)
36
which may be drastically smaller than F. Correspondingly, the empirical Rademacher complexity can be
rewritten as a maximization over the output space Q instead of the function space F:
1
RS (F) = E sup hσ, vi . (4.99)
σ v∈Q n
In other words, the complexity of F can be also interpreted as how much the vectors in Q can be correlated
with a random vector σ. See Figure 4.3 for an illustration of this idea. One can also view Eσ supv∈Q n1 hσ, vi
as a complexity measure for the set Q. If we replace σ by a Gaussian vector with
spherical covariance, then
the corresponding quantity (without the n1 scaling), Eg∼N (0,I) supv∈Q hg, vi , is often referred to as the
Gaussian complexity of the set Q. (It turns out that Gaussian complexity and Rademacher complexity are
closely related.)
Another corollary of this is that the empirical Rademacher complexity only depends on the functionality
of F but not Pon the exact parameterization of F.
P For example, suppose we have two parameterizations
F = f (x) = θi xi | θ ∈ Rd and F 0 = f (x) = θi3 · wi xi | θ ∈ Rd , w ∈ Rd . Since QF and QF 0 are the
same, we see that RS (F) = RS (F 0 ) since our earlier expression for RS (F) only depends on F through QF .
Figure 4.3: We can view empirical Rademacher complexity as the expectation of the maximum inner product
between σ and v ∈ Q.
Rademacher complexity of finite hypothesis classes. In practice, we cannot directly evaluate the
Rademacher complexity, so we instead bound its value using quantities that are computable. Given finite
|Q|, we often rely on the following bound, which is also known as Massart’s finite lemma:
Proposition 4.6.1. Let F be a collection of functions mapping Z 7→ R and let Q be defined as in (4.98).
Assume that √1n kvk2 ≤ M < ∞ for all v ∈ Q. Then,
r
2M 2 log |Q|
RS (F) ≤ (4.100)
n
We prove a (slightly) simplified version of this result in Problem 3(c) of Homework 2, so we omit the
proof of Massart’s lemma here. Using Massart’s lemma, we can also bound the Rademacher complexity in
terms of F. Restating the assumption accordingly,
37
q P
1 n
Corollary 4.21. Let F be a collection of functions mapping Z 7→ R. If n i=1 f (zi )2 ≤ M for all f ∈ F,
then
r
2M 2 log |F|
RS (F) ≤ . (4.101)
n
Note that Corollary 4.21 yields a looser bound than Massart’s lemma since |Q| ≤ |F |.
In practice, we rarely apply Massart’s lemma directly since |Q| is typically infinite. In the sequel, we
discuss alternative approaches to bounding the Rademacher complexity that are appropriate for this setting.
Bounding Rademacher complexity using -covers. When |Q| is infinite, we can apply the same
discretization trick that we used to prove the generalization bound for an infinite-hypothesis space. This
time, instead of trying to cover the parameter space, we will cover the output space. To this end, we first
recall a few definitions concerning -covers.
Definition 4.22. C is an -cover of Q with respect to metric ρ if for all v 0 ∈ Q, there exists v ∈ C such that
ρ(v, v 0 ) ≤ .
Definition 4.23. The covering number is defined as the minimum size of an -cover, or explicitly:
4
N (, Q, ρ) = (min size of -cover of Q w.r.t. metric ρ). (4.102)
Figure 4.4: We can visualize the -cover C by depicting a set of -balls that cover the output space Q. The
yellow circles denote the -neighborhoods of the covering points ui ∈ C.
2 P denotes the empirical distribution, i.e. the uniform distribution over the observations z , . . . , z . More generally the
n 1 n
Lp (Q) metric is defined by EQ [(f (z) − f 0 (z))p ]1/p .
38
Recall we have established the following correspondences between the set of functions F and the output
space Q:
f (z1 )
f ∈ F ⇐⇒ ... ∈ Q (4.104)
f (zn )
We can write a trivial correspondence between both the output and function class points of view as
follows:
1
N (, F, L2 (Pn )) = N , Q, √ || · ||2 (4.105)
n
The results below will be stated in the function-space notation, but in the proofs we will shift to the Q-
formulation for the sake of clarity. In general, we prefer to reason about covering numbers on Q as it is more
natural to analyze vector spaces compared to function spaces.
Equipped with the definition of minimal -covers, we can prove the following Rademacher complexity
bound:
Theorem 4.25. Let F be a family of functions Z 7→ [−1, 1]. Then
r !
2 log N (, F, L2 (Pn ))
RS (F) ≤ inf + . (4.106)
>0 n
The term can be thought of as the discretization error, while the second term is the Rademacher
complexity of the finite -cover. The precise form of this complexity bound follows from Proposition 4.6.1.
Proof. Fix any > 0. Let C be the minimal -cover of Q with respect to the metric ρ(v, v 0 ) = √1n kv − v 0 k2 .
Note that |C| = N (, Q, √1n k · k2 ) = N (, F, L2 (Pn )).
We aim to bound RS (F) = Eσ [supv∈Q n1 hv, σi] by approximating v with v 0 ∈ C. In particular, for every
point v ∈ Q, choose v 0 ∈ C such that ρ(v, v 0 ) ≤ and z is small (specifically, √1n kzk2 ≤ ). This gives
1 1 1
hv, σi = hv 0 , σi + hv − v 0 , σi (4.107)
n n n
1 1
≤ hv 0 , σi + kzk2 kσk2 (z = v − v 0 , Cauchy-Schwarz)
∆
(4.108)
n n
1 √ √
≤ hv 0 , σi + . (since kzk2 ≤ n and kσk2 ≤ n) (4.109)
n
Taking the expectation of the supremum on both sides of this inequality gives
1
RS (F) = E sup hv, σi (4.110)
σ v∈Q n
1 0
≤ E sup hv , σi + (4.111)
σ v 0 ∈C n
1 0
= + E sup hv , σi (4.112)
σ v 0 ∈C n
r
2 log |C|
≤+ (Proposition 4.6.1) (4.113)
n
r
2 log N (, Q, ρ)
=+ (4.114)
n
r
2 log N (, F, L2 (Pn ))
=+ (Remark 4.24) (4.115)
n
39
Since the argument above holds for any > 0, we can take the infimum over all to arrive at Equation
(4.106).
This bound follows by observing that E[sup(A + B)] ≤ E[sup A] + E[sup B] since the sup on the RHS is taken
separately over both terms. The difficult term to tightly bound is the last one, n1 hz, σi. In the previous
derivation, we naively upper bounded hz, σi using Cauchy-Schwarz,
1 kzk2 · kσk2
hz, σi ≤ , (4.119)
n n
but this bound is only tight if there exists z ∈ Bv0 that is perfectly correlated with σ. We claim that such
perfect correlation is unlikely. Recall that the output space is defined by possible outputs of f ∈ F given n
inputs. Unless our function class is extremely expressive, the set of radius around v 0 contained in Q will
only be a small subset of the -ball centered at v 0 ; thus, supz n1 hz, σi kzk2n·kσk2 .
To precisely set up our approach, we observe that E[supz∈Bv0 n1 hz, σi] is itself a Rademacher complexity:
h i
RS (Bv0 ∩Q). To more tightly bound E supz∈Bv0 n1 hz, σi , we then repeat the -covering argument again with
a smaller choice of . Intuitively, this procedure amounts to decomposing hz, σi from (4.117) into another
pair of terms corresponding to the new -cover and the discretization error. “Chaining” then repeats this
decomposition countably many times. This procedure is illustrated visually by Figure 4.5, and we formalize
this argument in the sequel.
Proof. Let 0 = supf ∈F maxi |f (zi )|, so that for all v ∈ Q,
v
u n r
u1 X 1
0 ≥ t 2
f (zi ) = kvk22 . (4.120)
n i=1 n
40
(a) (b)
(c)
Figure 4.5: We depict how the chaining procedure approximates v using a sequence of progressively finer
discretizations. Figure 4.5a illustrates how we first approximate v using the nearest covering point u1 , while
Figures 4.5b and 4.5c describe how we refine this approximation using two finer covers, whose nearest points
are denoted by u2 and u3 , respectively.
Define j = 2−j 0 and let Cj be an j -cover of Q. Then, C0 is the coarsest cover of Q, and as j increases,
we obtain progressively more fine-grained covers Cj . We can intuitively think of these covers as nested, but
this is not necessary for the proof to hold. We next use this sequence of covers to define a telescoping series
that equals v; the terms in this series can then be analyzed using the tools that we have developed in the
prequel.
For v ∈ Q, let ui denote the nearest neighbor of v in Ci . Note that by definition ρ(u, vj ) ≤ j . Taking
u0 = 0, it follows from our definition of Ci that as j → ∞, j → 0 and uj → v. Leveraging these observations,
we can express v using the following series:
41
Substituting (4.123) in the Rademacher complexity we aim to bound, we obtain
∞
" #
1 1X
E sup hv, σi = E sup hui − ui−1 , σi (4.124)
v∈Q n v∈Q n i=1
"∞ #
X 1
≤E sup hui − ui−1 , σi (4.125)
i=1 ui ∈Ci ,ui−1 ∈Ci−1
n
∞
" #
X 1
= E sup hui − ui−1 , σi . (4.126)
i=1 ui ∈Ci ,ui−1 ∈Ci−1 n
is a Rademacher complexity defined over the finite space Ci ×Ci−1 , so we can use Proposition 4.6.1 (Massart’s
lemma) to obtain a tractable upper bound. To do so, we must first compute an upper bound on √1n kui −
ui−1 k2 :
1 1
√ kui − ui−1 k2 = √ k(ui − v) − (ui−1 − v)k2 (4.128)
n n
1
≤ √ (kui − vk2 − kui−1 − vk2 ) (4.129)
n
≤ i + i−1 (4.130)
∆
= 3i (i−1 = 2i ) (4.131)
Now we apply Proposition 4.6.1 with M = 3i and |Q| = |Ci × Ci−1 | ≤ |Ci | · |Ci−1 |.
" # r
1 2(3i )2 log(|Ci | · |Ci−1 |)
E sup hui − ui−1 , σi ≤ (4.132)
ui ∈Ci ,ui−1 ∈Ci−1 n n
3i p
=√ 2(log |Ci | + log |Ci−1 |) (4.133)
n
6i p
≤√ log |Ci | (|Ci | ≥ |Ci−1 |) (4.134)
n
Applying (4.134) to each term in (4.126) and substituting the covering number N (i , F, L2 (Pn )) for |Ci |,
we obtain the following upper bound on the Rademacher complexity:
X ∞
1 6 p
E sup hv, σi ≤ √ i log N (i , F, L2 (Pn )). (4.135)
v∈Q n i=1
n
p
Finally, we must relate (4.135) to the target upper bound of 12 √1n log N (, F, L2 (Pn ))d. Examining
R
Figure 4.6, we can make two crucial observations. First, for sufficiently large , log N (, F, L2 (Pn )) = 0 since
one point is sufficient to construct a cover. Second, we observe that
p Z i p
(i − i+1 ) log |Ci | ≤ log N (, F, L2 (Pn ))d (4.136)
i+1
since the LHS of (4.136) is the area of the dotted rectangle illustrated in Figure 4.6 while the RHS is the
area under the curve for that interval. Formally, this result is equivalent to observing that the right Riemann
sum underestimates the integral for monotone decreasing functions f .
42
Figure 4.6: We observe that log N (, F, L2 (Pn )) is monotone decreasing in . The area of the dotted rectangle
formed by the vertical lines at i+1 and i equals (up to a constant factor) the i−th term of the infinite sum
derived in our proof of Dudley’s theorem (4.135). The figure shows that the area of this rectangle is no larger
than the integral of log N (, F, L2 (Pn )) over this same interval.
i
Recognizing that i − i+1 = 2, we note that the LHS of (4.136) is equal (up to a constant factor) to the
i-th term of (4.135). Thus,
∞ ∞
6 p 12 X
√ i log N (i , F, L2 (Pn )) = √
X p
(i − i+1 ) log N (i , F, L2 (Pn )) (4.137)
i=1
n n i=1
Z i p
12
≤√ log N (i , F, L2 (Pn ))d (4.138)
n i+1
Z 0 p
12
=√ log N (, F, L2 (Pn ))d. (4.139)
n 0
To complete the proof, observe that log N (, F, L2 (Pn )) = 0 for all > 0 . This allows us to extend the
upper limit of the integral given by (4.139) to ∞ and yields the desired result:
Z ∞p
1 12
E sup hv, σi ≤ √ log N (, F, L2 (Pn ))d. (4.140)
v∈Q n n 0
Remark 4.27. If F consists of functions bounded in [−1, 1], then we have that for all > 1, N (, F, L2 (Pn )) =
1. To see this, choose {f ≡ 0}, which is a complete cover for > 1. Hence, the limits of integration in
(4.116) can be truncated to [0, 1]:
Z 1r
log N (, F, L2 (Pn ))
RS (F) ≤ 12 d, (4.141)
0 n
43
4.6.2 Translating Covering Number Bounds to Rademacher Complexity
Of course, the bound in (4.116) is only useful if the integral on the RHS is finite. Here are some setups where
this is the case (we continue to assume that the functions in F are bounded in [−1, 1]):
then we have log N (, F, L2 (Pn )) ≈ R log(1/). We can plug this into the RHS of (4.116) to get
Z 1r Z 1r r
log N (, F, L2 (Pn )) R log(1/) R
d = d ≈ . (4.143)
0 n 0 n n
Z 1r Z 1r
log N (, F, L2 (Pn )) R
d ≈ log a d (4.145)
0 n 0 n
r Z 1r
R 1
= log a d (4.146)
n 0
r !
R
=Oe . (4.147)
n
2
R
3. If the covering number has the form N (, F, L2 (Pn )) ≈ aR/ , then log N (, F, L2 (Pn )) ≈ 2 log a. In
this case we have:
r r
1 1
log N (, F, L2 (Pn ))
Z Z
R 1
d ≈ log a d = ∞, (4.148)
0 n n
| 0 {z }
=∞
i.e. the bound in (4.116) is vacuous. This is because of the behavior of 7→ 1/2 near 0: the function
goes to infinity too quickly for us to upper bound its integral. Fortunately, there is an “improved”
version of Dudley’s theorem that is applicable here:
Theorem 4.28 (Localized Dudley’s Theorem). If F is a function class from Z 7→ R, then for any
fixed cutoff α ≥ 0 we have the bound
Z ∞r
log N (, F, L2 (Pn ))
RS (F) ≤ 4α + 12 d. (4.149)
α n
The proof of this theorem is similar to the proof of the original Dudley’s theorem, except that the
iterative covering procedure is stopped at the threshold = α at the cost of the extra 4α term above.
44
Theorem 4.28 allows us to avoid the problematic region around = 0 in the integral in (4.116). If we
let α = 1/poly(n), where poly(n) denotes some polynomial function of n, the bound in (4.149) becomes
√
R log a 1 1
Z
1
RS (F) ≤ + √ d (4.150)
poly(n) n α
√
1 R log a
= + √ log(1/α) (4.151)
poly(n) n
r !
R
=O e . (4.152)
n
1 − sgn(h(x))y
`0-1 ((x, y), h) = . (4.157)
2
The first approach is to reason directly about the Rademacher complexity of `0-1 loss, i.e. considering the
family of functions F = {z = (x, y) 7→ `0-1 ((x, y), h) : h ∈ H}. Define Q to be the set of all possible outputs
45
on our dataset: Q = sgn h x(1) , . . . , sgn h x(n) | h ∈ H . Then, using our earlier remark about
viewing the empirical Rademacher complexity as an inner product between v ∈ Q and σ, we have
" n
#
1 X 1 − sgn(h(x(i) ))yi
RS (F) = E sup σi (4.158)
σ1 ,...,σn f ∈F n 2
i=1
" n
#
1 X sgn(h(x(i) ))
= E sup σi (4.159)
σ1 ,...,σn f ∈F n 2
i=1
1 1
= E sup hσ, vi . (4.160)
2 σ1 ,...,σn v∈Q n
Notice that the supremum is now over Q instead of F. If n is sufficiently large, then it is typically the
case that |Q| > |F|. To see why this is the case, note that each function f corresponds to a single element
in Q. However, as n increases, |Q| increases as well. For any particular v ∈ Q, notice that hv, σi is a sum of
bounded random variables, so we can use Hoeffding’s inequality to obtain
1
Pr hσ, vi ≥ t ≤ exp(−nt2 /2). (4.161)
n
This upper bound only depends on the number of parameters in our model, and does not take into the
account the scale and norm of the parameters. Additionally, this doesn’t work with kernel methods since the
explicit parameterization is possibly infinite-dimensional, and therefore this upper bound becomes useless.
These limitations motivate the use of margin theory, which does take into account the norm of parameters
and provides a theoretical basis for regularization techniques such as L1 and L2 regularization.
46
Chapter 5
In this chapter, we will instantiate Rademacher complexity for two important hypothesis classes: linear
models and two-layer neural networks. In the process, we will develop margin theory and use it to bound
the generalization gap for binary classifiers.
Definition 5.2 (Minimum margin). Given a dataset D = ((x(1) , y (1) ), . . . , (x(n) , y (n) )), the minimum margin
over the dataset is defined as γmin , mini∈{1,...,|D|} y (i) hθ (x(i) ).
Our final bound will have the form (generalization gap) ≤ f (margin, parameter norm). This is very
generic since there are many different bounds we could derive based on what margin we use. For this current
setting we are using γmin , which is the minimum margin, but in other settings could use γaverage , which is
the average margin of each point in the dataset.
47
Figure 5.1: The red and black lines are two decision boundaries. The X’s are positive examples and the
O’s are negative examples. The black line has a larger margin than the red line, and is intuitively a better
classifier.
We will begin by introducing the idea of a surrogate loss, a loss function which approximates zero-one
loss but takes the scale of the margin into account. The margin loss (also known as ramp loss) is defined as
0,
t≥γ
`γ (t) = 1, t≤0 (5.1)
1 − t/γ, 0 ≤ t ≤ γ
It is plotted in Figure 5.2. For convenience, define `γ ((x, y), h) , `γ (yh(x)). We can view `γ as a
continuous version of `0-1 that is more sensitive to the scale of the margin on [0, γ]. Notice that `0-1 is always
less than or equal to the `γ when γ ≥ 0, i.e.
`0-1 ((x, y), h) = 1[yh(x) < 0] ≤ `γ (yh(x)) = `γ ((x, y), h) (5.2)
holds for all (x, y) ∼ P . Taking the expectation over (x, y) on both sides of this inequality, we see that
L(h) = E [`0-1 ((x, y), h)] ≤ E [`γ ((x, y), h)] . (5.3)
(x,y)∼P (x,y)∼P
48
Therefore, the population loss is bounded by the expectation of the margin loss, and so it is sufficient to
bound the expectation of the margin loss in order to bound the population loss.
Define the population and empirical versions of the margin loss:
n h
X i
Lγ (h) = E [`γ ((x, y), h)] , L̂γ (h) = `γ ((x(i) , y (i) ), h) . (5.4)
(x,y)∼P
i=1
1
RS (F) ≤ RS (H0 ) (5.7)
γ
" n
#
1 1X
= E sup σi y (i) h(x(i) ) (5.8)
γ σ1 ,...,σn h∈H n
i=1
" n
#
1 1X (i)
= E sup σi h(x ) (5.9)
γ σ1 ,...,σn h∈H n i=1
1
= RS (H). (5.10)
γ
Putting this all together, we have shown that for γ = γmin ,
r !
RS (H) log(2/δ)
L0-1 (h) ≤ Lγ (h) ≤ 0 + O +O e (5.11)
γ 2n
r !
RS (H) log(2/δ)
=O +O
e . (5.12)
mini y (i) h(x(i) ) 2n
In other words, for training data of the form S = {(x(i) , y (i) )}ni=1 ⊂ Rd × {−1, 1}, a hypothesis class H
and 0-1 loss, we can derive a bound of the form
2RS (H)
generalization loss ≤ + low-order term, (5.13)
γmin
where γmin is the minimum margin achievable on S over those hypotheses in H that separate the data, and
RS (H) is the empirical Rademacher complexity of H. Such bounds state that simpler models will generalize
better beyond the training data, particularly for data that is strongly separable.
49
Remark 5.4. Note there is a subtlety here. If we think of the dataset as random, it follows that γmin is a
random variable. Consequently, the γ we choose to define the hypothesis class is random, which is not a
valid choice when thinking about Rademacher complexity! Technically we cannot apply Talagrand’s lemma
with a random κ (which we took to be 1/γ). Also, when we use concentration inequalities, we implicitly
assume that the `γ ((x(i) , y (i) ), h) are independent of each other. That is not the case if γ is dependent on
the data.
We sketch out how one might address this issue below. The main idea is to do another union bound
over γ. Choose a family Γ = 2k : k ∈ [−B, B] for some B. Then, for every fixed γ ∈ Γ, with probability
greater than 1 − δ,
s
1
b γ (h) + O RS (H) + O log δ
L0-1 (h) ≤ L e . (5.14)
γ n
Taking a union bound over all γ ∈ Γ, it further holds that for all γ ∈ (0, B),
s r !
1
R S (H) log δ log B
L0-1 (h) ≤ L
b γ (h) + O +Oe +O e . (5.15)
γ n n
Last, choose the largest γ ∈ Γ such that γ ≤ γmin . Then, for this value of γ, our desired
bound directly
follows from the bound in (5.15). Namely, we have that L b γ (h) = 0 and O RS (H) = O RS (H) . The
γ γmin
q
log B
additional term, O
e
n , is the price exacted by the uniform convergence argument required to correct
the heuristic bound given in (5.13).
and
BC
Rn (H) ≤ √ . (5.17)
n
Generally speaking, there are two methods with which we can bound the Rademacher complexity of a
model. The first method, which we used in Chapter 4, consists of discretizing the space of possible outputs
from our hypothesis class, then using a union bound or covering number argument to bound the Rademacher
complexity of the model. While this method is powerful and generally applicable, it yields bounds that
depend on the logarithm of the cardinality of this discretized output space, which in turn depends on the
number of data points n. In the proof below, we will instead use a more elegant, albeit limited technique
which does not rely on discretization of the output space.
50
Proof. We start with the proof of (5.16). By definition,
" n
#
1X D (i)
E
RS (H) = E sup σi w, x (5.18)
σ kwk ≤B n
2 i=1
" * n
+#
1 X
= E sup w, σi x(i) (5.19)
n σ kwk2 ≤B i=1
" n #
B X
(i)
= E σi x (supkwk2 ≤B hw, vi = B kvk2 ) (5.20)
n σ i=1
2
v
u
n 2
Bu X
u
≤ tE σi x(i) (Jensen’s ineq. for α 7→ α2 ) (5.21)
n σ i=1 2
v
u
n
* n
+
Bu
u X
2 X
= tE σi2 x(i) + σi x(i) , σj x(j) (5.22)
n σ i=1 2
j6=i
v
u n
BuX 2
= t x(i) 2
. (σi indep. and E[σi ] = 0) (5.23)
n i=1
This completes the proof of (5.16) for the empirical Rademacher complexity. The bound on the average
Rademacher complexity in (5.17) follows from taking the expectation of both sides to get
v v
u n u n h
B u X
(i) 2 B uX i BC
(i) 2 ≤ √ ,
Rn (H) = E [RS (H)] = E t x 2
≤ t E x 2
(5.24)
n i=1
n i=1 n
We observe that both the empirical and average Rademacher complexities scale with the upper `2 -norm
bound kwk2 ≤ B on the parameters w, which motivates regularizing the model. However, smaller weights
in the model may reduce the margin γmin , which in turn hurts generalization according to (5.13).
Remark 5.6. Note that if we scale the data by some multiplicative factor, the bound on empirical Rademacher
complexity RS (H) will scale accordingly. However, at the same time, we expect the margin to scale by the
same multiplicative factor, so the bound on the generalization gap in (5.13) does not change. This lines up
with our intuition that the bound should not depend on the scaling of the data.
51
√
Lemma 5.8 (Massart’s lemma). Suppose Q ⊂ Rn is finite and contained in the `2 -norm ball of radius M n
for some constant M > 0, i.e., √
Q ⊂ {v ∈ Rn | kvk2 ≤ M n}. (5.26)
Then, for Rademacher variables σ = (σ1 , σ2 , . . . , σn ) ∈ Rn ,
r
1 2 log |Q|
E sup hσ, vi ≤ M . (5.27)
σ v∈Q n n
As a corollary, if F is a set of real-valued functions satisfying
n
1X
sup f (z (i) )2 ≤ M 2 , (5.28)
f ∈F n i=1
where the last equality is because supkwk1 ≤B hw, vi = B kvk∞ , i.e., the `∞ -norm is the dual of the `1 -
norm, which is a consequence of Hölder’s inequality. However, the `∞ -norm is difficult to simplify further.
Instead, we use the fact that supkwk1 ≤1 hw, vi for any v ∈ Rd is always attained at one of the vertices
Sd
W = i=1 {−ei , ei }, where ei ∈ Rd is the i-th coordinate unit vector. Defining the restricted hypothesis
class H̄ = {x 7→ hw, xi | w ∈ W} ⊂ H, this yields
" * n
+#
1 X
(i)
RS (H) = E sup w, σi x (5.33)
n σ kwk1 ≤B i=1
" * n
+#
B X
(i)
= E max w, σi x (5.34)
n σ w∈W i=1
= BRS (H̄). (5.35)
In particular, the model class H̄ is bounded and finite with cardinality |H̄| = 2d. This suggests using
Massart’s lemma to complete the proof. To do so, we need to confirm that H̄ is bounded with respect to the
`2 -metric. Indeed, since the inner product of x(i) with a coordinate vector ej just selects the j-th coordinate
of x(i) , for any w ∈ W we have
n E2 n 2 n
1 XD 1 X (i) 1X 2
w, x(i) ≤ x ≤ C = C 2, (5.36)
n i=1 n i=1 ∞ n i=1
52
where the last inequality uses the assumption kxi k∞ ≤ C. So H̄ is bounded in the `2 -metric and finite, thus
by Massart’s Lemma we have
r r
2 log |H̄| 2 log(2d)
RS (H) = BRS (H̄) ≤ BC = BC , (5.37)
n n
which completes the proof.
53
5.3 Two-layer neural networks
We now compute a bound for the Rademacher complexity of two-layer neural networks. Throughout this
section, we use the following notation:
• θ = (w, U ) are the parameters of the model with w ∈ Rm and U ∈ Rm×d , where m denotes the number
of hidden units. We use ui ∈ Rd to denote the i-th row of U (written as a column vector).
• φ(z) = max(z, 0) is the ReLU activation function applied element-wise.
This bound is not ideal as it depends on the number of neurons m. Empirically, it has been found that
the generalization error does not increase monotonically with m. As more neurons are added to the model,
thereby giving it more expressive power, studies have shown that generalization is improved [Belkin et al.,
2019]. This contradicts the bound above, which states that more neurons leads to worse generalization. We
also note that the theorem can be generalized straightforwardly to the setting where the w and U are jointly
the sense that we set√H = {fθ | kwk2 · (maxi kui k2 ) ≤ B} and obtain the generalization bound
constrained in p
Rn (H) ≤ 2BC m n . However, the m dependency still exists under this formulation of H. Nevertheless, we
now derive this bound.
54
Proof. By definition,
" n
#
1X D E
RS (H) = E sup σi w, φ(U x(i) ) (5.43)
σ θ n i=1
" * n
+#
1 X
(i)
= E sup sup w, σi φ(U x ) (5.44)
n σ U :kuj k2 ≤Bu kwk2 ≤Bw i=1
" n
#
Bw X
= E sup σi φ(U x(i) ) (supkwk2 ≤B hw, vi = B kvk2 ) (5.45)
n σ U :kuj k2 ≤Bu i=1
2
√ " n
#
Bw m X
(i)
√
≤ E sup σi φ(U x ) (kvk2 ≤ m kvk∞ ) (5.46)
n σ U :ku k ≤B
j 2 u i=1 ∞
√ " n
#
Bw m X
> (i)
= E sup max σi φ(uj x ) (5.47)
n σ U :ku k ≤B 1≤j≤m
j 2 u i=1
√ " n
#
Bw m X
= E sup σi φ(u> x(i) ) (5.48)
n σ kuk ≤B
2 u i=1
√ " n
#
2Bw m X
> (i)
≤ E sup σi φ(u x ) (by Lemma 5.12) (5.49)
n σ kuk ≤B
2 u i=1
√ " n
#
2Bw m X
≤ E sup σi u> x(i) , (5.50)
n σ kuk ≤B
2 u i=1
where the last inequality follows by applying the contraction lemma (Talagrand’s lemma) and observing that
the ReLU function is 1-Lipschitz. (Observe that the expectation in (5.49) is the Rademacher complexity for
{x 7→ φ(u> x) | kuk2 ≤ Bu }: this is the family that we are applying the contraction lemma to.)
We now observe that the expectation in (5.50) is the Rademacher complexity of the family of linear
models {x 7→ hu, xi | kuk2 ≤ Bu }. Thus, applying Theorem 5.7 yields
√
v
u n
2Bw m u X 2
RS (H) ≤ Bu t x(i) 2 . (5.51)
n i=1
Taking the expectation of both sides and using similar steps to those in the proof of Theorem 5.7 gives
us
This upper bound is undesirable since it grows with the number of neurons m, contradicting empirical
observations of the generalization error decreasing with m.
55
5.3.1 Refined bounds
Next, we look at a finer bound that results from defining a new complexity measure. A recurring theme in
subsequent proofs will be the functional invariance of two-layer neural networks under a class of rescaling
transformations. The key ingredient will be the positive homogeneity of the ReLU function, i.e.
This implies that for any λi > 0 (i = 1, . . . , m), the transformation θ = {(wi , ui )}1≤i≤m 7→ θ0 =
{(λi wi , ui /λi )}1≤i≤m has no net effect on the neural network’s functionality (i.e. fθ = fθ0 ) since
> !
> (i)
ui (i)
wi · φ ui x = (λi wi ) · φ x . (5.57)
λi
In light of this, we devise a new complexity measure C(θ) that is also invariant under such transformations
and use it to prove a better bound for the Rademacher complexity. This positive homogeneity property is
absent in the complexity measure used in the hypothesis class (5.41) of Theorem 5.9.
Pm
Theorem 5.10. Let C(θ) = j=1 |wj | kuj k2 , and for some constant B > 0 consider the hypothesis class
If x(i) 2
≤ C for all i ∈ {1, . . . , n}, then
2BC
RS (H) ≤ √ . (5.59)
n
Remark 5.11. Compared to Theorem 5.9, this bound does not explicitly depend on the number of neurons m.
Thus, it is possible to use more neurons and still maintain a tight bound if the value of the new complexity
measure C(θ) is reasonable. In contrast, the bound of Theorem 5.9 explicitly grows with the total number
of neurons. In fact, Theorem 5.10 is strictly stronger than Theorem 5.9 as elaborated below. Note that
X X 1/2 X 1/2
|wj |kuj k2 ≤ |wj |2 kuj k22 (by Cauchy-Schwarz inequality)
√
≤ kwk2 · m · max kuj k2 (5.60)
j
√
Therefore, if we consider H1 = {fθ | |wj |kuj k2 ≤ B 0 } and H2 = {fθ | kwk2 · m · maxj kuj k√ 0
P
2 ≤ B },
1 2 0
then either Theorem 5.10 on H or Theorem 5.9 on H gives the same generalization bound O(B / n), but
H1 ⊃ H2 .
Moreover, Theorem 5.10 is stronger as we have more neurons—this is because the hypothesis class H as
defined in (5.58) is bigger as m increases. Because of this, it’s possible to obtain a generalization guarantee
that decreases as m increases, as shown in Section 5.4.2.
Proof of Theorem 5.10. Due to the positive homogeneity of the ReLU function φ, it will be useful to define
∆
the `2 -normalized weight vector ūj = uj /kuj k2 so that φ u> j x = ku k
j 2 ·φ(ū>
j x). The empirical Rademacher
complexity satisfies
" n
#
1 X
(i)
RS (H) = E sup σi fθ x (5.61)
n σ θ i=1
n m
1 X X
= E sup σi wj φ uTj x(i) (by dfn of fθ ) (5.62)
n σ θ i=1 j=1
n m
1 X X
= E sup σi wj kuj k2 φ ūTj x(i) (by positive homogeneity of φ) (5.63)
n σ θ i=1 j=1
56
" n #
m
1 X X
= E sup wj kuj k2 σi φ ūTj x(i) (5.64)
n σ θ j=1 i=1
m n
1 X X X X
≤ E sup |wj |kuj k2 max σi φ ūTk x(i) because αj βj ≤ |αj | max |βk |
n σ θ j=1 k∈[n]
i=1 j j
k
(5.65)
" n
#
B X
≤ E sup max σi φ ūTk x(i) (because C(θ) ≤ B) (5.66)
n σ θ=(w,U ) k∈[n] i=1
" n
#
B X
= E sup σi φ ūT x(i) (5.67)
n σ ū:kūk2 =1 i=1
" n
#
B X
T (i)
≤ E sup σi φ ū x (5.68)
n σ ū:kūk2 ≤1 i=1
" n
#
2B X
≤ E sup σi φ ūT x(i) (see Lemma 5.12) (5.69)
n σ ū:kūk2 ≤1 i=1
0
= 2BRS (H ), (5.70)
RS (H00 ) where H00 = x 7→ ū> x : ū ∈ Rd , kūk2 ≤ 1 is a linear hypothesis space. Using RS (H00 ) ≤ √Cn by
We complete the proof by deriving the Lemma 5.12 used in the second-to-last inequality. Notably, the
lemma’s assumption holds in the current context, since
n
X
suphσ, fθ (x)i = sup σi φ ū> x(i) ≥ 0. (5.71)
θ ū:kūk2 ≤1 i=1
Proof. Letting φ be the ReLU function, the lemma’s assumption implies that supθ φ (hσ, fθ (x)i) =
supθ hσ, fθ (x)i for any σ ∈ {±1}n . Observing that |z| = φ(z) + φ(−z),
d
Taking the expectation over σ (and noting that σ = −σ), we get the desired conclusion.
57
5.4.1 Connection to `2 regularization
Recall that margin theory yields
r !
2RS (H) log (2/δ)
for all θ, L0-1 (θ) ≤ +O
e , (5.76)
γmin n
RS (H)
with probability at least 1 − δ. Thus, Theorem 5.10 motivates us to minimize γmin by regularizing C(θ).
Concretely, this can be formulated as the optimization problem
m
X
minimize C(θ) = |wj | · kuj k2 (I)
j=1
or equivalently,
At first glance, the above seems orthogonal to techniques used in practice. However, it turns out that
the optimal neural network from (I) is functionally equivalent to that of the new problem:
m m
1X 1X
minimize C`2 (θ) = |wj |2 + kuj k22 (I*)
2 j=1 2 j=1
subject to γmin (θ) ≥ 1.
This is a simple consequence of the positive homogeneity of φ. For any scaling factor λ = (λ1 , . . . , λm ) ∈ Rm+,
∆
the rescaled neural network θλ = {(λi wi , ui /λi )} has the same functionality as the original neural network
θ = {wi , ui } (i.e. it achieves the same γmin ). Thus,
m m
1 X 1 X
min C`2 (θ) = min min λ2 |wj |2 + λ−2 kuj k22 (5.77)
θ θ λ 2 j=1 j 2 j=1 j
m
X
= min |wj | · kuj k2 (5.78)
θ
j=1
58
of (II) enables us to isolate all dependencies of m in
γmin. Letting θbm denote the minimizer of program (II)
∗
with width m and defining optimal value γm = γmin θbm , we can rewrite the margin bound (5.76) as
4C 1
L(θbm ) ≤ √ · ∗ + (lower-order terms), (5.80)
n γm
∗
where all dependencies on m are now contained in γm . Hence, to show that this bound does not worsen as
∗
m grows, we just have to show that γm is non-decreasing in m. This is intuitively the case since a neural
network of width m + 1 contains one of width m under the same complexity constraints. The following
theorem formalizes this hunch:
∗
Theorem 5.13. Let γm be the minimum margin obtained by solving (II) with a two-layer neural network
∗ ∗
of width m. Then γm ≤ γm+j for all positive integers j.
Proof. Suppose θ = {(wi , ui )}1≤i≤m is a two-layer neural network of width m satisfying C(θ) ≤ 1. Then we
may construct a neural network θe = {(w̃i , ũi )}1≤i≤m+1 of width m + 1 by simply taking
(
(wi , ui ) i ≤ m,
(wei , u
ei ) = (5.81)
(0, 0) otherwise.
θe is functionally equivalent to θ and C(θ)e = C(θ) ≤ 1. This means maximizing γmin over {C(θ) e :
θ of width m + 1} should give no lower of a value than the maximum of γmin over {C(θ) : θ of width m}.
e
∗
is well-defined. The next interesting fact is that in this m → ∞ limit, γ∞ of the two-layer neural network is
equivalent to the minimum margin of an `1 -SVM. As a brief digression, we recap the formulation of `p -SVMs
and discuss the importance of `1 -SVMs in particular.
Since a collection of data points with binary class labels may not be a priori separable, a kernel model
first transforms an input x to ϕ(x) where ϕ : Rd → G is known as the feature map. The model then seeks a
separating hyperplane in this new (extremely high-dimensional) feature space G, parameterized by a vector
µ pointing from the origin to the hyperplane. The prediction of the model on an input x is then a decision
score that quantifies ϕ(x)’s displacement with respect to the hyperplane:
∆
gµ,ϕ (x) = hµ, ϕ(x)i . (5.83)
Motivated by margin theory, it is desirable to seek the maximum-margin hyperplane under a constraint on
µ to guarantee the generalizability of the model. In particular, a kernel model with an `p -constraint seeks
to solve the following program:
subject to kµkp ≤ 1.
Observe that both the prediction and optimization of the feature model only rely on inner products in G.
The ingenuity of the SVM is to choose maps ϕ such that K(x, x0 ) = hϕ(x), ϕ(x0 )i can be directly computed
in terms of x and x0 in the original space Rd , thereby circumventing the need to perform expensive inner
products in the large space G. Remarkably, this “kernel trick” enables us to even operate in an implicit,
infinite-dimensional G.
59
The case of p = 1 is particularly useful in practice as `1 -regularization generally produces sparse feature
weights (the constrained parameter space is a polyhedron and the optimum tends to lie at one of its vertices).
Hence, `1 -regularization is an important feature selection method when one expects only a few dimensions
of G to be significant. Unfortunately, the `1 -SVM is not kernelizable due to the kernel trick relying on
`2 -geometry, and is hence infeasible to implement. However, our next theorem shows that a two-layer neural
network can approximate a particular `1 -SVM in the m → ∞ limit (and in fact, for finite m). For the sake
of simplicity, we sacrifice rigor in defining the space G and convey the main ideas.
Theorem 5.14. Define the feature map φrelu : Rd → G such that x is mapped to φ(u> x) for all vectors u
on the d − 1-dimensional sphere S d−1 . Informally,
..
∆
.>
φ(u x)
φrelu (x) =
..
. u∈S d−1
is an “infinite-dimensional vector” that contains an entry φ(u> x) for each vector u ∈ S d−1 , and we let
φrelu (x)[u] denote the “u”-th entry of this vector. Noting that G is the space of functions which can be
indexed by u ∈ S d−1 , the inner product structure on G is defined by hf, gi = S d−1 f [u]g[u]du.
R
Note that {(wei , ui )} has the same functionality (and also the same complexity measure C(θ), margin, etc.)
as that of {(wi , ui )}, but now ui has unit `2 -norm (i.e. ūi ∈ S d−1 ). Thus, φ(u>
i x) can be treated as a
feature in G and we can construct an equivalent `1 -SVM (denoted by µ) such that w ei is the coefficient of µ
associated with that feature. Since w ei must only be “turned on” at ui , we have
X
µ[u] = w̃i δ(u − ui ), (5.87)
i∈S d−1
where δ(u) is the Dirac-delta function. Given this µ, we can check that the SVM’s prediction is
Z
gµ,φrelu (x) = µ[u]φrelu (x)[u]du (5.88)
d−1
ZS X
w̃i δ(u − ui )φ u> x du
= (5.89)
S d−1
i∈S d−1
X
w̃i φ u>
= i x , (5.90)
i∈S d−1
which is identical to the output f{(wei ,ui )} (x) of the neural network. Furthermore,
∞
X ∞
X
kµk1 = |w
ei | = |wi | · kui k2 ≤ 1, (5.91)
i=1 i=1
60
where the last equality holds because {(wei , ui )} satisfies the constraints of (II). This shows that our con-
∗
structed µ satisfies the `1 -SVM constraint. Thus, γ∞ ≤ γ`∗1 since the functional behavior of the optimal
neural network is contained in the search range of the SVM.
Remark 5.15. How well does a finite-dimensional neural network approximate the infinite-dimensional `1
network? Proposition B.11 of [Wei et al., 2020] shows that you only need n + 1 neurons. Another way to
say this is that {γm } stabilizes once m = n + 1:
The main idea of the proof is that if we have a neural net θ with (n + 2) neurons, then we can always pick
a simplification θ0 with (n + 1) neurons such that θ, θ0 agree on all n datapoints.
∗
As an aside, this result also resolves the issue in our partial proof. A priori, γ∞ may not necessarily be
attained by a set of weights {(w ei , ui )}, but the above shows that it is achievable with just n + 1 non-zero
indices.
B2C 2
log N (, F2 , ρ) ≤ log2 (2 min(n, d) + 1) (5.94)
2
Remark 5.17. Applying (4.152) to the covering number bound derived above with R = B 2 C 2 , we conclude
that the Rademacher complexity of this class of linear models satisfies
RS (Fq ) ≤ Oe BC
√ . (5.95)
n
We also prove this result without relying on Dudley’s theorem in Theorem 5.5.
Next, we consider multivariate linear functions as
Pthey are building blocks for multi-layer neural networks.
n
Let M = (M1 , · · · , Mn ) ∈ Rm×n and kM k2,1 = i=1 kMi k2 . Then, kM > k2,1 denotes the sum of the `2
norms of the rows of M .
q P
n
Theorem 5.18. Let F = {x → W x : W ∈ Rm×d , ||W > ||2,1 ≤ B} and let C = n1 i=1 ||x(i) ||22 . Then,
c2 B 2
log N (, F, L2 (Pn )) ≤ ln(2dm). (5.96)
2
61
Remark 5.19. In some sense, Theorem 5.18 arises from treating each dimension of the multivariate problem
independently.
> We
can view the linear layer as applying m different linear functions. Explicitly, if W =
w1 w1> x
.. .
. and W x = .. , then as we expect, kW > k2,1 = kwi k2 .
P
> >
wm wm x
Using this notation, we establish an upper bound on the Rademacher complexity of a dense neural network.
Theorem 5.20 ([Bartlett et al., 2017]). Suppose that ∀i, kx(i) k2 ≤ c and let
Then,
r
! r
!3/2
2/3
c Y X bi
RS (F) ≤ √ · κi · 2/3
. (5.99)
n i=1 i=1 κi
| {z } | {z }
(I) (II)
We use kW kop to denote the operator norm (or spectral norm) of W , and recall that kWi> k2,1 denotes
the sum of the `2 norms of the rows of Wi . Examining (5.99), we see that (II) is relatively small as it is a
sum of matrix norms, and so the bound is dominated by (I), which is a product of matrix norms.
Remark 5.21. We note that f (x) = W x is Lipschitz with a Lipschitz constant of kW kop . This is because
.
Remark 5.22. As a corollary of the above theorem, we also get a bound on the generalization error for the
margin loss of the following form:
2/3 3/2
! !
r r
1 1 Y X kWi> k2,1
generalization error ≤ Õ ·√ · kWi kop · 2/3
, (5.102)
γmin n i=1 i=1 kWi kop
• At a high level, we want to show that the covering number N (, F, ρ) for a dense neural network is ≤ R2 .
Proving this would enable us to apply Theorem 4.28 to get a bound on the Rademacher Complexity.
• To bound the covering number for a dense neural network, we use -covers to cover each layer of fθ
separately, and then combine them to prove that there exists an -cover of the original function fθ .
62
• To combine the -covers of each layer, we use the Lipschitzness of each layer.
• We control and approximate the error propagation that is introduced through discretizing each layer
using i -coverings in order to get a reasonable final .
As a prelude to the proof of Theorem 5.20, let us abstractify each layer of F as Fi where Fi corresponds
to matrix multiplication by Wi composed with a nonlinear activation function σ. We then denote F as the
composition of each of these (single layer) function spaces as follows:
Let us also assume, for simplicity, that fi (0) = 0 and kx(j) k2 ≤ c for all j = 1, . . . , n. Then, by applying the
definition of Lipschitz continuity, we obtain that:
Proof. Let 1 , . . . , r be the radius for each layer. Let C1 be an 1 -cover of F1 . Then, for all f10 ∈ C1 , we
define C2,f10 as an 2 -covering of the set
Taking a union of this covering over all f10 ∈ C1 clearly yields an -covering for F2 ◦ F2 . In paricular, if
[
C2 = C2,f10 , (5.109)
f10 ∈C1
then C2 is an -cover of F2 ◦ F1 with = 1 · κ2 + 2 . We depict this covering procedure in Figure 5.3, and
we prove this claim rigorously in the sequel.
1 If c2
i−1
Fi defines a collection of linear models, then log N (i , Fi , L2 (Pn )) ≤ 2
.
i
63
Figure 5.3: We visualize the covering strategy adopted in the proof of Lemma 5.23. The two grey sets depict
the output spaces of the first and second layers, namely, Q1 and Q2 , respectively. The blue dots in Q1 are
the outputs of three functions in the 1 -cover C1 , while the blue subsets of Q2 depict F2 ◦ f10 and F2 ◦ f100 .
The red circles show how we construct a covering, C2 , of Q2 . In particular, the two collections of red circles
depict the C2,f10 and C2,f100 covers. Taking the union of such covers over all functions in C1 yields C2 .
Next, we bound the sizes of these covers. Directly applying the assumption given by (5.106), we conclude
that
Next, we show that for the above construction, the radius of the cover for F is
X r Yr
= i κj . (5.116)
i=1 j=i+1
64
For any choice of fr ◦ · · · ◦ f1 ∈ Fr ◦ Fr−1 ◦ · · · ◦ F1 , then, by definition of C1 , there exists f10 ∈ C1 such that
We can leverage these two facts and the triangle inequality to now prove that f20 ◦ f10 is close to f2 ◦ f1 .
Namely,
ρ (f20 ◦ f10 , f2 ◦ f1 ) ≤ ρ (f20 ◦ f10 , f2 ◦ f10 ) + ρ (f2 ◦ f10 , f2 ◦ f1 ) (triangle ineq.) (5.119)
≤ 2 + ρ (f2 ◦ f10 , f2 ◦ f1 ) (def. of C2,f10 ) (5.120)
≤ 2 + κ2 ρ (f10 , f1 ) (5.104) (5.121)
≤ 2 + κ2 1 (def. of C1 ) (5.122)
Inducting to prove this argument for arbitrary k, we similarly apply the definition of Ck,fk−1
0 ,...,f10 to conclude
0 0 0
that there exists fk ◦ fk−1 ◦ · · · ◦ f1 ∈ Ck such that
ρ(fk0 ◦ fk−1
0
◦ · · · f10 , fk ◦ fk−1
0
◦ · · · f10 ) ≤ k (5.123)
Then, expanding using the triangle inequality and peeling off terms by applying the definition of our i -
coverings, we again show that
ρ fk0 ◦ fk−1
0
◦ · · · ◦ f10 , fk ◦ · · · ◦ f1 ≤ ρ fk0 ◦ fk−1
0
◦ · · · ◦ f10 , fk ◦ fk−1
0
◦ · · · ◦ f10
(5.124)
0 0 0 0 0
+ ρ fk ◦ fk−1 ◦ fk−2 ◦ · · · ◦ f1 , fk ◦ fk−1 ◦ fk−2 ◦ · · · ◦ f1
+ · · · + ρ (fk ◦ fk−1 ◦ · · · ◦ f2 ◦ f10 , fk ◦ fk−1 ◦ · · · ◦ f1 )
≤ ρ fk0 ◦ fk−1
0
◦ · · · ◦ f10 , fk ◦ fk−1
0
◦ · · · ◦ f10
(5.125)
0
+ κk · ρ(fk−1 ◦ · · · ◦ f10 , fk−1 ◦ fk−2
0
◦ · · · ◦ f10 ) (5.126)
Y k
+ ··· + κj ρ(f10 , f1 )
j=2
k
X k
Y
≤ i κj . (5.127)
i=1 j=i+1
Note that the first inequality follows by the triangle inequality, the second by the κi -Lipschitz continuity of
fi , and the third by applying the definition of each of our i -covers.
Proof of Theorem 5.20. We now apply Lemma 5.23 to dense neural networks. Dense neural networks consist
of a composition of layers, where each layer is a linear model composed with a 1-Lipschitz activation. Using
Theorem 5.18 along with the property that 1-Lipschitz functions will only contribute a factor of at most 1
(Lemma 4.29), the covering number of each layer can be bounded by:
2 2
c b
g (i , ci−1 ) = Õ i−12 i , (5.128)
i
where c2i−1 is the norm of the inputs, b2i is kWi> k2,1 , and 2i is the radius. From Lemma 5.23, we know that
r
!
X c2i−1 b2i
log N (, F, ρ) = Õ (5.129)
i=1
2i
65
for
r
X r
Y
= i κj (5.130)
i=1 j=i+1
r
X α2 i R
≥ , (5.134)
i=1
2i 2
Pr Qr 32 23
where R = i=1 ci−1 bi j=i+1 κj . We note that equality holds when
! 13
c2 b2
i = Qri−1 i · 2 (5.135)
j=i+1 κj
2
Pr bi3 Qr
i=1 2 i=1 κi
3
κi3
| {z }
0
Using this choice of i and letting 0 equal the second factor in (5.135) for notational convenience, we know
that the log covering number is (up to a constant factor):
r r 2
X c2i−1 b2i X c2i−1 b2i (κi+1 · · · κr ) 3
= (5.136)
i=1
2i i=1
4
ci−1
3
4
bi3 (0 )2
r
X 1 2
= (ci−1 bi κi+1 · · · κr ) 30 )2
(5.137)
i=1
(
2 P 2
2
r 23 Yr r bi 32 Qr
2
X bi 2 c 3 (
i=1 κi ) κ
i=1 i
3
= c3 κi3 2
(5.138)
i=1
κi i=1
r 23 Yr
!3
2
X bi 2 1
= c 3 κi3
(5.139)
i=1
κi i=1
2
r r 2 !3
2
Y
2
X bi 3 1
=c κi . (5.140)
i=1 i=1
κi 2
66
Since this log covering number is of the form R/2 , we can apply (4.152) and conclude that
r
R
RS (F) . (5.141)
n
Last, plugging in
r r 2 !3
2
Y X bi 3
R=c κ2i (5.142)
i=1 i=1
κi
67
Chapter 6
Data-dependent Generalization
Bounds for Deep Nets
In Theorem 5.20, we proved the following bound on the Rademacher complexity of deep neural networks:
r
Y
RS (F) ≤ kWi kop · poly(kW1 k, . . . , kWr k). (6.1)
i=1
This bound, however, suffers from multiple deficiencies. In particular, it grows exponentially in the depth,
r, of the network and kWi kop measures the worst-case Lipschitz-ness of the network layers over the input
space.
In this section, we obtain a tighter generalization bound that depends upon the realized Lipschitz-ness
of the model on the training data. To further motivate this approach, we also note that stochastic gradient
descent, i.e. the typical optimization method typically used to fit deep neural networks, prefers models that
are more Lipschitz (see Chapter (TBD) for further discussion) . This preference must be realized by the
model on empirical data, however, as no learning algorithm has access to the model’s properties over the
entire data space.
Ultimately, we aim to prove a tighter bound on the population loss that grows polynomially in the
Lipschitz-ness of f on the empirical data. Namely, given that f is parameterized by some θ, we hope to
derive a bound on the population loss at θ that is a polynomial function of the Lipschitz-ness of f on
x(1) , . . . , x(n) as well as the norm of θ.
Uniform convergence with a data-dependent hypothesis class. So far in this course, given some
complexity measure we denote as comp(·), our uniform convergence results always appear in one of the two
following forms (which are essentially equivalent). Namely, with high probability,
comp(F)
∀f ∈ F, L(f ) ≤ √ (I) (6.2)
n
comp(f )
∀f, L(f ) ≤ √ (II) (6.3)
n
√
Remark 6.1. Most of the results we have obtained so far are of type I, e.g. with comp(F)/ n = Rn (F).
We obtain results of type II by considering a restricted set of functions FC = {f : comp(f ) ≤ C}. We then
apply a type I bound to FC and take a union bound over all C. Therefore, these two type of bounds are
essentially equivalent (up to a small additive factor difference due to the additional union bound over the
choices of C.)
Note, however, that neither of these approaches produce bounds that depend upon the data. By contrast,
in the sequel, we will derive a new data-dependent generalization bound. These bounds state that with high
68
probability over the choice of the empirical data and, for all functions f ∈ F,
L(f ) ≤ comp f, {(x(i) , y (i) )}ni=1 (6.4)
Even though the complexity measure depends on the training data, and is thus a random variable by itself,
it can be used as a regularizer which can be added to the original training loss.
Remark 6.2. Although there is no universal consensus on the type of generalization bound we should derive,
we can argue that there is no way to leverage more information in a generalization bound beyond the empirical
data. For example, one might try to use the input distribution P to define the complexity measure, but if
we allowed ourselves access to P , we could just define comp(f, P ) = EP [f (X)]. In some sense, defining a
generalization bound using the true distribution amounts to cheating, and the dependence on the empirical
data seems to be proper because the bound can still be used as a regularizer.
In this new paradigm, we can no longer take the previous approach of obtaining type I bounds and then
derive a type II bound via a reduction. To see why, suppose that we have the hypothesis class
If our complexity measure depends on the empirical data, then so does our hypothesis class FC , which
makes FC itself a random variable. However, our theorems regarding Rademacher complexity require that
the hypothesis class be fixed before we ever see the empirical data.
We may hope to get around this by changing the way we think about uniform convergence. Consider the
simplified case where our new complexity measure is separable, i.e.
n
X
comp(f, {(x(i) , y (i) )}ni=1 ) = h(f, x(i) ), (6.6)
i=1
Suppose we have a region of low complexity in our existing loss function as depicted in Figure 6.1. Because
this region is random, so we cannot selectively apply uniform convergence. However, we can use our new
surrogate loss function `˜ in that region. By modifying the loss function in this way, we can still fix the
hypothesis class ahead of time, allowing us to apply existing tools to `(f˜ ). The surrogate loss was used
in [Wei and Ma, 2019a] to obtain a data-dependent generalization bound, though there are possibly various
other ways to define surrogate losses and apply existing uniform convergence guarantees. In the sequel, we
introduce a particular surrogate “margin” that allows us to cleanly apply our previous results to a (implicitly)
data-dependent hypothesis class [Wei and Ma, 2019a].
69
loss
train
test
θ
low-complexity params
Figure 6.1: These curves depict a “low-complexity” region in parameter space. The blue curve is the
unobserved test loss we aim to bound, while the green curve denotes the empirical training loss we observe.
Observe that in the region of θ that we identify as being “low-complexity,” the gap between the train and
test losses is smaller than in the high-complexity regions.
To simplify the exposition of the machinery below, we also introduce the ∞-covering number N∞ (, F) as
the minimum cover size with respect to the metric ρ defined as the infinity-norm distance on an input domain
X:
ρ(f, f ) , sup |f (x) − f 0 (x)| , kf − f 0 k∞ .1 (6.9)
x∈X
Remark 6.3. Notice that N∞ (, F) ≥ N (, F, L2 (Pn )). This is because the ρ = L∞ (X ) is a more demanding
measure of error: f and f 0 must be close on every input, not just the empirical data. That is,
v
u n
u1 X
t (f (xi ) − f 0 (xi ))2 ≤ sup |f (x) − f 0 (x)|. (6.10)
n i=1 x∈X
Lemma 6.4. Suppose gf is a generalized margin. Let G = {gf : f ∈ F }. Suppose that for some R,
2
log N∞ (, G) ≤ b R2 c for all > 0.2 Then, with high probability over the randomness in the training data,
for every f in F that correctly predicts all the training examples,
L01 ≤ Oe √1 · R
+ e √1 .
O (6.11)
n mini∈[n] gf (x(i) , y (i) ) n
Proof. The high-level idea of our proof is to replace F with G before repeating the standard margin theory
argument from Section 5.1.2.
Let `γ be the ramp loss given in (5.1), which is 1 for negative values, 0 for values greater than γ,
and a linearPinterpolation between 1 and 0 for values between 0 and γ. We define the surrogate loss as
n
L̂γ (θ) = n1 i=1 `γ (gfθ (x(i) , y (i) )), and the surrogate population loss as Lγ (θ) = E[`γ (gfθ (x, y))]. Applying
Corollary 4.19, where we used the Rademacher complexity to control the generalization error, we conclude
that
e √1 .
Lγ (θ) − L̂γ (θ) ≤ RS (`γ ◦ G) + O (6.12)
n
1 If f maps X to multi-dimensional outputs, we will define ρ(f, f ) , sup 0 0
x∈X kf (x) − f (x)k , kf − f k∞ where the norm in
kf (x) − f 0 (x)k is a norm in the output space of f (which will be the Euclidean norm in this rest of this section).
2 Recall that this is the worst dependency on that we can tolerate when converting covering number bounds to Rademacher
complexity.
70
Next we observe that
Then, using our results relating the log of the covering number toa bound
on the Rademacher complexity (re-
R
call (4.152) and Theorem 4.28), we conclude that RS (`γ ◦G) ≤ O γ n . Take γ = γmin = mini gγ (x(i) , y (i) ).3
e √
Using Corollary 4.19, we conclude that L̂γmin (θ) ≤ 0 + O e √ R + e √1 , as desired.
O
n·γmin n
We can then ask: what was the smallest perturbation that changed our decision? That is, let
v
u r
∆
uX
mf (x, y) = min t ||δi ||22 s.t. f (x, δ)y ≤ 0, (6.20)
δ
i=1
71
Theorem 6.5. With high probability, for all f with training error 0,
Pr
1 i=1 kWi k1,1 r
L01 (f ) ≤ O √ ·
e +O √
e , (6.21)
n mini∈[n] mf (x(i) , y (i) ) n
where N∞ (i , Fi ) is defined with respect to the input domain X = {x : kxk2 ≤ 1}.
That is, we only have to find the covering number for each layer, and then we have the covering number
for the (all-layer margin of the) composed function class. Notice that we bounded the covering number of
m ◦ F in the above lemma, not F.
Then, the desired result follows directly from the preceding decomposition lemma.
j 2k
c
Corollary 6.7. Assume that log N∞ (i , Fi ) ≤ 2i for every Fi , i.e. the function class corresponding to
i
ci
the i-th layer of f in Theorem 6.5. Then, by taking i = · √P 2
, we have that
i ci
2
P
i ci
log N∞ (, m ◦ F) ≤ . (6.23)
2
This result gives the complexity of the composed model in terms of the complexity of the layers, with
2
e β2i (where βi is a bound on
each ci given by kWi k1,1 . For linear models, we can show N∞ (i , Fi ) ≤ O
kWi k1,1 ), and this implies Theorem 6.54 Finally, we are only left with the proof of Lemma 6.6.
Proof of Lemma 6.6. Now we will prove a limited form of the decomposition lemma for affine models: Fi =
{z 7→ σ(Wi z) : kWi k1,1 ≤ βi }. There are two crucial steps to this problem. First, we will prove that mf (x, y)
is 1-Lipschitz in f . That is, for all F = Fr ◦ Fr−1 ◦ · · · ◦ F1 and F 0 = Fr0 ◦ Fr−1
0
◦ · · · ◦ F10 ,
v
u r
uX
|mf (x, y) − mf 0 (x, y)| ≤ t max kfi (x) − fi0 (x)k22 . (6.24)
kxk2 ≤1
i=1
Notice that now we are working with a clean sum of differences, with no multipliers!
Second, we construct a cover: Let U1 , . . . , Ur be 1 , . . . , r -covers of F1 , . . . , Fr , respectively, such that
|Ui | = N∞ (i , Fi ). By definition, for all fi in Fi , there exists a ui ∈ Ui such that maxkxk≤1 kfi (x)−ui (x)k2 ≤
i . Take U = Ur ◦ Ur−1 ◦ · · · ◦ U1 = {ur ◦ ur−1 ◦ · · · ◦ u1 } as the cover for m ◦ F. Suppose we were given
4 Technically, we also need to union bound over the choices of βi , which can also be achieved following Remark 5.4.
72
f = fr ◦ · · · ◦ f1 ∈ F. Let ur , . . . , u1 be the nearest neighbors of fr , . . . , f1 . Then
v
u r
uX
|mf (x, y) − mu (x, y)| ≤ t max ||fi (x) − ui (x)||22 (6.25)
||x||≤1
i=1
v
u r
uX
≤t 2 i (by construction). (6.26)
i=1
Having established the validity of our cover, we now return to our claim of 1-Lipschitz-ness stated in
(6.24). By symmetry, it is sufficient to prove an upper bound for mf 0 (x, y) − mf (x, y).
Let δ1∗ , . . . , δr∗ be the optimal choices of δ in defining mf (x, y). Our goal is to turn these into a feasi-
ble solution q of mf 0 (x, y), which we denote by δ̂1 , . . . , δ̂r . If this solution is feasible, we obtain the bound
kδ̂i k22 .
P
mf 0 (x, y) ≤
Intuitively, we want to define a perturbation for f 0 that does the same thing as δ1∗ , . . . , δr∗ for f . In plain
English, (f 0 , δ̂1 , . . . , δ̂r ) should do the same thing as (f1 , δ1∗ , . . . , δr∗ ). Recall that f has parameters W1 , . . . , Wr
and f 0 has parameters W10 , . . . , Wr0 . Then, under the optimal perturbation,
h1 = W1 x + δ1∗ kxk2 (6.27)
h2 = σ(W2 h1 ) + δ2∗ kh1 k2 (6.28)
..
.
hr = σ(Wr hr−1 ) + δr∗ khr−1 k2 (6.29)
We want to imitate this by perturbing f 0 in some way. In particular, let
h1 = W10 x + δ1∗ kxk2 + (W1 − W10 )x, (6.30)
| {z }
∆
= δ̂1 kxk2
∆ (W1 −W10 )x
where the last term serves to compensate for the difference between W1 and W10 . Thus, δ̂1 = δ1∗ + kxk2 .
We repeat this argument for every layer. Using the second layer as an example,
h2 = σ(W20 h1 ) + δ2∗ kh1 k + σ(W2 h1 ) − σ(W20 h1 ) . (6.31)
| {z }
∆
= δ̂2 khk2
σ(W2 h1 )−σ(W20 h1 )
So, δ̂2 = δ2∗ + kh1 k2 . In general,
73
Note that in (6.35), constraining kxk2 ≤ 1 is equivalent to dividing by the `2 -norm of x.
Remark 6.8. We can compare the above with Theorem 5.20 proven in [Bartlett et al., 2017].
Ignoring minor details (e.g. dependency on r), we suppose that y = 1. Then, if f (x) > 0 and f (x + δ) ≤ 0,
it must be the case that kδk2 . Qr |f (x)|
kWi kop . This further implies that
i=1
mf (x, y) 1
& Qr . (6.38)
yf (x) i=1 kW i kop
1 1
Rearranging, we conclude that we have obtained a tighter bound since the inverse margin mf (x,y) . yf (x) ·
Qr
i=1 kWi kop .
Remark 6.9. Later, we will show that SGD prefers Lipschitz solutions and Lipschitzness on data points.
Implicitly, SGD seems to be maximizing the all-layer margin. Since the algorithm is (in a sense) minimizing
Lipschitzness on a data point, this likely accounts for the empirically observed gap between the two bounds.
Remark 6.10. The approach we have described here is also similar to other methods in the deep learning
literature. Other authors have introduced a method known as SAM (a form of sharpness-aware regulariza-
tion); this method applies a perturbation to the parameter θ itself rather than on the intermediate hidden
∂`
parameters hi . However, these two methods are related! If we consider the (single-example) loss ∂W i
, it
∂` >
equals ∂hi+1 · hi . Note that the norm of the term on the left is bounded by the product of the norms of the
two terms of the right; this observation relates the model’s Lipschitzness with respect to the parameters to
its Lipschitzness with respect to the hidden layer outputs.
Remark 6.11. Finally, we can prove a more general version of this result in which we do not need to study
the minimum margin of the entire dataset, and insteadqconsider the average margin. Using this approach,
Pn
we can show that the test error is bounded above by n1 n1 i=1 mf (x(i)1 ,y(i) )2 times the sum of complexities
of each layer, plus a low-order term.
qP qP qP
5 Minkowski’s inequality, which states that kai + bi k22 ≤ kai k22 + kbi k22 . In this setting, this inequality can
also be proved using Cauchy-Schwarz.
74
Chapter 7
We now turn to a high-level overview of deep learning theory. To begin, we outline a framework for classical
machine learning theory, then discuss how the situation is different from deep learning theory.
1. Approximation theory attempts to answer whether there is any choice of parameters θ that achieves
low population error. In other words, is the choice of hypothesis class good enough to approximate
the ground truth function? Using notation from earlier in this course, the goal is to upper bound
L(θ∗ ) = minθ∈Θ L(θ).
2. Statistical generalization focuses on bounding the excess risk L(θ̂)−L(θ∗ ). In Chapter 4 we obtained
the following bound:
Any global minimizer θ̂λ of L̂reg has small excess risk L(θ̂λ ) − L(θ∗ ) . (7.2)
3. Optimization considers how to obtain the minimizer θ̂ or θ̂λ computationally. This usually involves
convex optimization: if L̂ or L̂reg is convex, then we have a polynomial-time algorithm to find the
global minimum.
1 In earlier chapters, we defined the complexity of a hypothesis class, not of a specific parameter value. To reconcile these
two approaches, think of R as a measure of complexity (such as a norm) that we can then use to define a hypothesis class Θ,
i.e. Θ = {θ0 : R(θ0 ) ≤ R(θ)}.
75
While there are many tradeoffs to consider between these three components (for example, we may be
able to find a loss function for which optimization is easy, but generalization becomes worse), they are
conceptually independent, and it is typically possible to study each area individually, then combine all three
to get a result.
Figure 7.1: The black and red lines denote the training and test error, respectively, of a three layer neural
network fit to and evaluated on MNIST [Neyshabur et al., 2015]. While classical generalization theory
predicts that beyond some threshold, the test error will increase with complexity (shown by the purple line),
the true test error continues to decline with overparameterization. Though not depicted here, Neyshabur et
al. observe similar test set error curves for a neural network fit to CIFAR-10.
Let us consider the difference in each of the three components described for classical machine learning
theory.
1. Approximation theory: Large neural net models are considered to be very expressive. That is, both
the population loss L(θ) and the finite sample loss L̂(θ) can be made small. In fact, neural networks are
universal approximators; see for example [Hornik, 1991]. This can be a somewhat misleading statement
as the definition of universal approximator allows for the size of the network to be impracticably large,
but morally it seems to hold true in practice anyway.
This expressivity is possible because neural networks are usually highly over-parametrized : they have
many more parameters than samples. It is possible to prove that in this regime, the network can
“memorize” the entire dataset and achieve approximately zero training error [Arpit et al., 2017].
2. Statistical generalization: Relatively weak regularization is used in practice. In many cases only
weak `2 regularization is used, i.e.
b reg (θ) = L̂(θ) + λkθk2 .
L (7.3)
2
The first interesting fact is that this regularized loss does not have a unique (approximate) global
minimizer. This is due to overparametrization: there are so many degrees of freedom that there are
many approximate global minimizers with approximately the same `2 norm.
76
(a) (b)
Figure 7.2: We use dotted and solid lines to depict training and test error, respectively. Figure 7.2a demon-
strates how global minimizers for the training loss can have differing performance on test data. In Figure 7.2b,
blue and red colors differentiate between the model fit with a decaying learning rate and a small constant
learning rate. Though both neural networks shown in this plot achieve 0 training error, the global minimizer
obtained by a more sophisticated learning rate schedule appears to generalize better to unseen data.
However, it turns out that these global minimizers are not equally good: many models which achieve
zero training error may have very bad test error (Figure 7.2a). Take, for example, using stochastic
gradient descent (SGD) to learn a model to classify the dataset CIFAR-10. In Figure 7.2b, we show
two instantiations of this: one starting with a large learning rate and slowly decreasing it, and one
with a small learning rate throughout. Even though both instantiations result in approximately zero
training error, the former leads to much better test performance.
Therefore, the job of optimizers in deep learning is not just to find an arbitrary global minimum: we
need to find the right global minimum. This contrasts sharply with (7.2) from the classical setting,
where achieving a global minimum leads to good guarantees on generalization error. This means that
(7.2) is simply not powerful enough to deal with deep learning, because it cannot distinguish between
global minima with good test error and bad test error.
3. Optimization: The discussion above means that optimization plays a significant role in generalization
for deep learning. Different training algorithms/optimizers have different “implicit biases” or “implicit
regularization effect”, causing them to converge to different global minimizers. Understanding the
implicit regularization effect of optimizers is thus a central goal of deep learning theory. The lack
of understanding implicit regularization hinders the development of fast optimizers—it is impossible
to design a good optimization algorithm without also considering its impact on generalization. In
fact, many algorithms for non-convex optimization have been proposed that work well for minimizing
training loss, but because their implicit bias is different, they lead to worse test performance and are
therefore not too useful.
Often these implicit biases or implicit regularization effect can be characterized in the form of showing
the optimizers prefer θ̂ of certain low complexity among all the global minimizers. The deep learning
analog of (7.2) often consists of two statements: (a) the optimizer implicitly prefers low complexity so-
lution according to complexity measure R(·) by converging to a global minimizer θ̂ with low complexity
R(θ̂), and (b) low complexity solutions generalize. This means that we end up doing more work on
the optimization front—the optimizer needs to ensure both a small training loss and a low complexity
solution. On the other hand, proving generalization bounds (statement (b)) works similarly to the
classical setting once we understand how our optimizer finds a low-complexity solution.
We summarize some of the results that we will present in the future chapters.
77
1. Optimization. First, we will prove that under certain data distribution assumption, optimizers such as
stochastic gradient decent can converge to an approximate global minimum, even though the objective
function is non-convex. Results of this form can be shown on matrix factorization problems and
linearized neural networks, even without over-parameterization, but so far are limited to these simple
models. Second, we will discuss a recent approach, called neural tangent kernels (NTK), which proves
that for almost any neural networks, with overparameterization, gradient descent can converge to a
global minimum, under specific hyperparameter settings (e.g, specific learning rate and initialization).
However, it turns out that these specific hyperparaemeter settings does not provide sufficient implicit
regularization effect for the learned models to generalize. (In other words, the optimizer only returns
a global minimizer, but not a global minimizer that generalizes well.)
2. Implicit regularization effect. This involves showing that the solution θ̂ obtained by a particular
optimizer has low complexity R(θ̂) ≤ C according to some complexity measure R(·) (which depends
on the choice of optimizers). It’s believed and empirically observed that any changes or tricks in the
optimizers (e.g., learning rate schedule, batch size, initialization, batchnorm) could introduce additional
implicit regularization effects. We will only demonstrate these on some special cases of models (e.g.
logistic regression, matrix factorization) and optimizers (e.g. gradient descent, label noise in SGD,
dropout, learning rate). Recently, there are also more general results with label noise SGD [Blanc
et al., 2019, Damian et al., 2021].
3. Generalization bounds. This part involves showing that for all θ such that R(θ) ≤ C with L̂(θ) ≈ 0,
we have L(θ) is small. That is, we show that low-complexity solutions to the empirical risk problem
generalize well. We will be working with more fine-grained complexity measures (e.g., those complexity
measures that are similar to the complexity measure in part 2 above that are preferred by the optimizer).
Here, many tools we developed in classical machine learning can still apply.
78
Chapter 8
Nonconvex Optimization
In the previous chapter, we outlined conceptual topics in deep learning theory and how the situation was
different from classical machine learning theory. In particular, we described approximation theory, statistical
generalization and optimization. In this chapter, we will focus on optimization theory in deep learning. We
will introduce some basics about optimization (Section 8.2), discuss how we can make the notion “all local
minima are global minima” rigorous, and walk through two examples where this is the case (Section 8.3).
Finally, we introduce the neural tangent kernel approach which allows us to characterize of the loss of general
neural networks near a specific initialization (or under specific parameterization).
Figure 8.1: Classification of different functions for optimization. The functions we optimize in deep learning
seem to fall mostly within the middle cloud.
Before diving into details, we first highlight some observations that will be important to keep in mind
when discussing optimization in deep learning. Suppose g(θ) is the loss function. Recall that the gradient
descent (GD) algorithm would do the following:
∆
1. θ0 = initialization
79
Figure 8.2: Illustration of how gradient descent does not always find the global minimum. In the picture,
gradient descent initialized at the blue point only makes it to the local minimum at the red point: it does
not find the global minimum at the black point.
80
It is generally not easy to verify if a point is a local minimum. In fact, we have the following theorem
regarding the computational tractability:
Theorem 8.3. It is NP-hard to check whether a point is a local minimum or not [Murty and Kabadi,
1987]. In addition, Hillar and Lim [Hillar and Lim, 2013] show that a degree four polynomial is NP-hard to
optimize.
Definition 8.4 (Strict-saddle condition [Lee et al., 2016]). For positive α, β, γ, we say that f : Rd 7→ R is
(α, β, γ)-strict-saddle if every x ∈ Rd satisfies one of the following:
1. k∇f (x)k2 ≥ α.
2. λmin (∇2 f (x)) ≤ −β.
Theorem 8.5 (Informally stated). If f is (α, β, γ)-strict-saddle for some positive α, β, γ, then many op-
timizers (e.g. gradient descent, stochastic gradient descent,
cubic regularization)
can converge to a local
minimum with -error in Euclidean distance in time poly d, α1 , β1 , γ1 , 1 .
Therefore, if all local minima are global minima and the function satisfies the strict-saddle condition,
then optimizers can converge to a global minimum with -error in polynomial time. (See Figure 8.3 for an
example of a function whose local minima are all global minima.) The next theorem expresses this concretely
by being explicit about the strict-saddle condition:
Theorem 8.6. Suppose f is a function that satisfies the following condition: ∃ 0 , τ0 , c > 0 such that if
x ∈ Rd satisfies k∇f (x)k2 ≤ < 0 and ∇2 f (x) −τ0 I, then x is c -close to a global minimum of f .
Then many optimizers
can converge to a global minimum of f up to δ-error in Euclidean distance in time
poly 1δ , τ10 , d .
Theorem 8.7. All local minima of g are global minima (even though g is non-convex).
81
Figure 8.3: A two-dimensional function with the property that all local minima are global minima. It also
satisfies the strict-saddle condition because all the saddle points have a strictly negative curvature in some
direction.
Figure 8.4: Objective function for principal components analysis (PCA) when d = 1.
Remark 8.8. For d = 1, g(x) = 12 (m − x2 )2 for some constant m. Figure 8.4 below shows such an example.
We can see that all local minima are indeed global minima.
Proof. Step 1: Show that all stationary points must be eigenvectors. From HW0, we know that ∇g(x) =
−(M − xx> )x, hence
∇g(x) = 0 =⇒ M x = kxk22 · x, (8.2)
which implies that x is an eigenvector of M with eigenvalue kxk22 . From the Eckart–Young–Mirsky theorem
we know the global minimum (i.e. the best rank-1 approximation) is the eigenvector with the largest
eigenvalue.
82
Step 2: Show that all local minima must be eigenvectors of the largest eigenvalue. We use the second
order condition for this. For x to be a local minimum we need ∇2 g(x) 0, which means for any v ∈ Rd ,
To compute hv, ∇2 g(x)vi, we use the following trick: expand g(x + v) into g(x) + linear term in v +
quadratic term in v, then the quadratic term will be 21 hv, ∇2 g(x)vi (see HW0 Problem 2d for an exam-
ple). Using this trick, we get
1
g(x + v) = kM − (x + v)(x + v)> k2F (8.4)
2
1 1
= kM − xx> k2F − hM − xx> , xv > + vx> i + hxv > + vx> , xv > + vx> i
2 2
− hM − xx> , vv > i + higher order terms in v. (8.5)
Hence, we have
1 1
hv, ∇2 g(x)vi = hxv > + vx> , xv > + vx> i − hM − xx> , vv > i (8.6)
2 2
= hx, vi2 + kxk22 kvk22 − v M v + hx, vi2 (8.7)
2 >
= 2hx, vi + kxk22 kvk22 − v M v. (8.8)
Picking v = v1 , the unit eigenvector with the largest eigenvalue (denoted λ1 ), for x to be a local minimum
it must satisfy
hv1 , ∇2 g(x)v1 i = 2hx, v1 i2 − v1> M v1 + kxk22 ≥ 0. (8.9)
Note that by (8.2), all our candidates for local minima are eigenvectors of M so naturally we have two
cases:
• Case 1: x has eigenvalue λ1 . Then x is the global minimum (by the Eckart–Young–Mirsky theorem).
• Case 2: x has eigenvalue λ < λ1 . Then we know x and v1 are orthogonal (eigenvectors with different
eigenvalues are always orthogonal), hence
83
Remark 8.11. We need d parameters to describe a rank-1 matrix M and the number of observations is
roughly pd2 . Thus, for identifiability we need to work in the regime where pd2 > d, i.e. p d1 .
We define our non-convex loss functions to be
1 X
min f (x) , (Mij − xi xj )2 (8.11)
x∈Rd 2
(i,j)∈Ω
1
= kPΩ (M − xx> )k2F . (8.12)
2
To really solve our problem we need some regularity condition on the ground truth vector z (recall
M = zz > ). Incoherence is one such condition:
Definition 8.12 (Incoherence). Without loss of generality, assume the ground truth vector z ∈ Rd satisfies
kzk2 = 1. z satisfies the incoherence condition if kzk∞ ≤ √µd , where µ is considered to be a constant or log
in dimension d.
Remark 8.13. A nice counterexample to think about why such condition is necessary is when z = e1 and
M = e1 e>
1 . All entries of M are 0 except for a 1 in the top-left corner. There is no way to recover M without
observing the top-left corner.
The goal is to prove that local minima of this objective function are close to a global minimum:
poly(µ, log d)
Theorem 8.14. Assume p = for some sufficient small constant and assume z is incoherent.
d2 √
Then with high probability, all local minima of f are O( )-close to +z or −z (the global minima of f ).
Before presenting the proof, we make some observations that will guide the proof strategy.
Remark 8.15. f (x) can be viewed as a sampled version of the PCA loss function g(x) = 21 kM − xx> k2F =
1 2
P
2 (i,j)∈[d]×[d] (Mij − xi xj ) , in which we only observe a subset of the matrix entries. Thus, we would like
to claim that f (x) ≈ g(x). However, matching the values of f and g is not sufficient to prove the theorem:
even a small margin of error between f and g could lead to creation of many spurious local minima (see
Figure 8.5 for an illustration). In order to ensure that the local minima of f look like the local minima of g,
we will need further conditions like ∇f (x) ≈ ∇g(x) and ∇2 f (x) ≈ ∇2 g(x).
Remark 8.16. Key idea: concentration for scalars is easy. We can approximate a sum of scalars via a sample:
X X
Tij ≈ p Tij , (8.13)
(i,j)∈Ω (i,j)∈[d]×[d]
with high probability. This suggests the strategy of casting the estimation of our desired quantities in the
form of estimating a scalar sum via a sample. In particular, we note that for any matrices A and B,
X
hA, PΩ (B)i = Aij Bij ≈ phA, Bi. (8.15)
(i,j)∈Ω
To make use of this observation to understand the quantities of interest (∇f (x) and ∇2 f (x)), we compute
the bilinear and quadratic forms for ∇f (x) and ∇2 f (x) respectively:
hv, ∇f (x)i = hv, PΩ (M − xx> )xi = hvx> , PΩ (M − xx> )i, (8.16)
where we have used the fact that hA, BCi = hAC > , Bi. Also note that vx> is a rank-1 matrix and M − xx>
is a rank-2 matrix.
hv, ∇2 f (x)vi = kPΩ (vx> + xv > )k2F − 2hPΩ (M − xx> ), vv > i (8.17)
> > > > > >
= hPΩ (vx + xv ), vx + xv i − 2hPΩ (M − xx ), vv i, (8.18)
84
Figure 8.5: Even if f (x) and g(x) are no more than apart at any given x, without any additional knowledge,
the local minima of f may possibly look dramatically different from the local minima of g. However, the
proofs in this section show that the landscape of f (the matrix completion objective) and g (the PCA
objective) are have similar properties by proving more advanced concentration inequalities.
where we have used the fact that kPΩ (A)k2F = hPΩ (A), PΩ (A)i = hP (Ω(A), Ai.
The key lemma that applies the scalar concentration to these matrix quantities is as follows:
poly(µ, log d)
Lemma 8.17. Let > 0, p = . Given that A = uu> , B = vv > for some u, v satisfying
√ d2 √
kuk2 ≤ 1, kvk2 ≤ 1, kuk∞ ≤ µ/ d, kvk∞ ≤ µ/ d, we have |hPΩ (A), Bi/p − hA, Bi| ≤ w.h.p.
If we can show that g has no bad local minima via a proof that only uses g via terms of the form hv, ∇g(x)i
and hv, ∇2 g(x)vi, then by Lemma 8.17 this proof will automatically generalize to f by concentration.
Next, we prove some facts about g and show the analogous proofs for f that we will use in the proof of
Theorem 8.14.
Lemma 8.18 (Connecting inner product and norm for g). If x satisfies ∇g(x) = 0, then hx, zi2 = kxk42 .
Proof.
√
Lemma 8.19 (Connecting inner product and norm for f ). Suppose kxk∞ ≤ 2µ/ d. If x satisfies ∇f (x) = 0,
then hx, zi2 ≥ kxk42 − with high probability.
Proof.
85
Lemma 8.20 (Bound norm for g). If ∇2 g(x) 0, then kxk22 ≥ 1/3.
Proof.
√
Lemma 8.21 (Bound norm for f ). Suppose kxk∞ ≤ µ/ d. If ∇2 f (x) 0, then kxk22 ≥ 1/3 − /3 with high
probability.
Proof.
Lemma 8.22 (g has no bad local minimum). All local minima of g are global minima.
Proof.
Since |hx, zi| ≥ 1/3 6= 0 (by Lemma 8.20), we must have kxk22 = 1. But then Lemma 8.18 implies hx, zi2 =
kxk42 = 1, so x = ±z by Cauchy-Schwarz.
We now prove Theorem 8.14, restated for convenience:
poly(µ, log d)
Theorem 8.23 (f has no bad local minimum). Assume p = . Then with high probability, all
√ d2
local minima of f are O( )-close to +z or −z.
Proof. Observe that kx − zk22 = kxk22 + kzk22 − 2hx, zi ≤ kxk22 + 1 − 2hx, zi. Our goal is to show that
this quantity is small with high probability, hence we need to bound kxk22 and hx, zi w.h.p. Note that the
following bounds in this proof are understood to hold w.h.p.
Let x be such that ∇f (x) = 0. For ≤ 1/16,
86
With this, we can get a bound on kxk22 :
∇f (x) = 0 =⇒ hx, ∇f (x)i = 0 (8.41)
=⇒ |hz, ∇g(x)i| ≤ (by Lemma 8.17) (8.42)
=⇒ |hx, zi| · |1 − kxk22 | ≤ (by dfn of g) (8.43)
=⇒ |1 − kxk22 | ≤ 32 = O() (by (8.40)) (8.44)
=⇒ kxk22 = 1 ± O(). (8.45)
Next, we bound hx, zi:
hx, zi2 ≥ kxk42 − (by Lemma 8.19) (8.46)
2
≥ (1 − O()) − (by (8.45)) (8.47)
= 1 − O(). (8.48)
Finally, we put these quantities together to bound kx − zk22 . We have two cases:
Case 1: hx, zi ≥ 1 − O(). Then
kx − zk22 = kxk22 + kzk22 − 2hx, zi (8.49)
≤ kxk22+ 1 − 2hx, zi (8.50)
≤ 1 + O() + 1 − 2(1 − O()) (8.51)
≤ O(). (8.52)
√
Hence we conclude x is O( )-close to z. √
Case 2: hx, zi ≤ −(1 − O()). Then by an analogous argument, x is O( )-close to −z.
We have shown above that matrix completion of a rank-1 matrix has no spurious local minima. This
proof strategy can be extended to handle higher-rank matrices and noisy matrices [Ge et al., 2016]. The
proof also demonstrates a generally useful proof strategy: often, reducing a hard problem to an easy problem
results in solutions that do not give much insight into the original problem, because the proof techniques do
not generalize. It can often be fruitful to seek a proof in the simplified problem that makes use of a restricted
set of tools that could generalize to the harder problem. Here we limited ourselves to only using hv, ∇g(x)i
and hv, ∇2 g(x)vi in the easy case; these quantities could then be easily converted to analogous quantities in
f via the concentration lemma (Lemma 8.17).
8.3.3 Other problems where all local minima are global minima
We have now demonstrated that two classes of machine learning problems, rank-1 PCA and rank-1 matrix
completion, have no spurious local minima and are thus amenable to being solvable by gradient descent
methods. We now outline some major classes of problems for which it is known that there are no spurious
local minima.
• Principal component analysis (covered in previous lecture).
• Matrix completion (and other matrix factorization problems). On a related note, it has also been
shown that linearized neural networks of the form y = W1 W2 x, where W1 and W2 are optimized
separately, have no spurious local minima [Baldi and Hornik, 1989]. It should be noted that linearized
neural networks are not very useful in practice since the advantage of optimizing W1 and W2 separately
versus optimizing a single W = W1 W2 is not clear.
• Tensor decomposition. The problem is as follows:
d X
X d X
d X
d
maximize Tijkl xi xj xk xl such that kxk2 = 1. (8.53)
i=1 j=1 k=1 l=1
87
Additionally, constraints are imposed on the tensor T to make the problem tractable. For example,
one condition is that T must be a low-rank tensor with orthonormal components [Ge et al., 2015].
Figure 8.6: Training loss around an initialized θ0 . The dotted lines indicate B(θ0 ), a region where the loss
is convex, and where a global minimum exists.
fθ (x) = fθ0 (x) + h∇θ fθ0 (x), θ − θ0 i + higher order terms (8.54)
| {z }
∆
=gθ (x)
Note that gθ (x) is an affine function in θ, as fθ0 (x) is a constant for fixed x, θ0 . Similarly, defining
∆θ = θ − θ0 , we can say that gθ (x) is linear in ∆θ. For convenience, we will sometimes choose θ0 such that
fθ0 (x) = 0 for all x. It is easy to see why such an initialization exists. Consider splitting a two-layer neural
network fθ (x) Pwith width 2m into P two halves, each with m neurons; the outputs of these two networks are
m m
then given by i=1 ai σ(wi> x) and i=1 −ai σ(wi> x), respectively. Here, wi can be randomly chosen so long
as Wi is the same in both halves, and ai can be randomly chosen as long as the other half is initialized with
−ai . Summing these two networks together yields fθ0 (x) ≡ 0 for all x.
When fθ0 (x) ≡ 0, we have that
we observe that ∆θ depends upon the parameter we evaluate the network at, while ∇θ fθ0 (x) can be thought
of as a feature map since it is a fixed function of x (given the architecture and θ0 ) that does not depend on
θ whatsoever. We thus let φ(x) , ∇θ fθ0 (x), which motivates the following definition:
88
Definition 8.24 (Neural Tangent Kernel). For simplicity, we assume fθ0 (x) = 0 so that y = y 0 . The neural
tangent kernel K is given by
Here, the feature ∇θ fθ0 (x) is precisely the gradient of the neural network. This is where the “tangent”
in Neural Tangent Kernel comes from.
Instead of fθ (x), suppose we use the approximation gθ (x), which we recall is linear in θ. The kernel
method gives a linear model on top of features. When θ ≈ θ0 , given a convex loss function `, we have
Convexity of the RHS follows from the fact that a convex function, `, composed with a linear function, gθ ,
is still convex.
A natural question to ask is: how valid is this approximation? We devote the rest of this chapter to
answering this question. First, we define the empirical loss:
n
1X
` fθ x(i) , y (i)
L̂(fθ ) = (8.59)
n i=1
n
1X
` gθ x(i) , y (i) .
L̂(gθ ) = (8.60)
n i=1
The key idea is that the Taylor approximation works for certain cases. We defer a more complete enumeration
of these cases to a later section of this monograph. Here we outline the high-level approach we take to validate
and use this Taylor expansion. Namely, we will show that there exists a neighborhood around θ0 called B(θ0 ),
such that we have the following:
1. Accurate approximation: fθ (x) ≈ gθ (x), and L̂(fθ ) ≈ L̂(gθ ) for all θ ∈ B(θ0 ).
2. It suffices to optimize in B(θ0 ): There exists an approximate global minimum θ̂ ∈ B(θ0 ), so L̂(gθ̂ ) ≈ 0.
This is the lowest possible loss (because the loss is nonnegative), which implies we are close to the
global minimum. Because of 1, this implies that L̂(fθ̂ ) ≈ 0 as well. See Figure 8.7 for an illustration.
3. Optimizing L̂(fθ ) is similar to optimizing L̂(gθ ) and does not leave B(θ0 ), i.e. everything is confined
to this region. Intuitively, this last point to some extent is “implied” by (1) and (2), but this claim
still requires a formal proof.
Note (1), (2), and (3) can all be true in various settings. In particular, to attain all three, we will require:
(a) Overparametrization and/or a particular scaling of the initialized θ0 .
(b) Small (or even zero) stochasticity, so θ never leaves B(θ0 ). This condition is guaranteed by a small
learning rate or full-batch gradient descent.
Despite the limitations of the requirements of (a) and (b), the existence of such a region is still surprising.
Given the loss landscape which could potentially be highly non-convex, it is striking to find a neighborhood
where the loss function is convex (e.g. quadratic) with a global minimum. This suggests there is some
flexibility in the loss landscape.
To begin our formal discussion, we start by providing tools for proving (1) and (2). Let
89
Figure 8.7: Here, L̂(gθ ) and L̂(fθ ) are both plotted. At θ̂, we have reached the approximate global minimum
where L̂(gθ̂ ) ≈ 0, in turn implying also that L̂(fθ̂ ) ≈ 0.
and
>
φ(1)
. n×p
.. ∈ R
Φ= (8.62)
>
φ(n)
where p is the number of parameters. Taking the quadratic loss, we have
n > 2
1 X (i) 1
L̂(gθ ) = y − φ x(i) ∆θ = k~y − Φ · ∆θk22 (8.63)
n i=1 n
>
where ~y = y (1) , · · · , y (n) ∈ Rn . Note that this looks a lot like linear regression, where Φ and ∆θ are
(i)
the analogues
√ of the design matrix and parameter, respectively. We further assume that y = O(1) and
kyk2 = O( n). Now, we can prove a lemma that addresses the second of the three conditions we described
above, i.e. that it is sufficient to optimize in some small ball around θ0 .
Lemma 8.25 (for (2)). Suppose we are in the setting where p ≥ n, rank(Φ) = n, and σmin (Φ) = σ > 0.
Then, letting ∆θ̂ denote the minimum norm solution, i.e. the nearest global minimum, of ~y = Φ∆θ, we have
√
k∆θ̂k2 ≤ O( n/σ) (8.64)
Remark 8.26. The meaning of the bound on ∆θ̂ becomes clear if we consider the ball given by
√
Bθ0 = {θ = θ0 + ∆θ : k∆θk2 ≤ O( n/σ)}. (8.65)
In particular, notice that Bθ0 contains a global minimum, so this lemma characterizes how large the ball
must be to contain a global minimum.
Remark 8.27. We also note that the condition rank(Φ) = n and σ > 0 can be thought of as a “finite-sample
expressivity” condition, saying that the features Φ are expressive enough so that there exists a linear model
on top of these features that perfectly fit the data. The condition rank(Φ) = n requires p ≥ n—so we need
some amount of over-parameterization to apply these analysis.
Proof. Letting Φ+ denote the Moore-Penrose pseudoinverse of Φ, note that ∆θ̂ = Φ+ y, and kΦ+ kop =
1 1
σmin (Φ) = σ . A simple argument shows
90
Next, we prove a lemma that addresses the first of the three steps we described above.
Lemma 8.28 (for (1)). Suppose ∇θ fθ (x) is β-Lipschitz in θ, i.e. for every x, and θ, θ0 , we have
k∇θ fθ (x) − ∇θ fθ0 (x)k2 ≤ β · kθ − θ0 k2 . (8.68)
Then,
|fθ (x) − gθ (x)| ≤ O βk∆θk22 .
(8.69)
If we further restrict our choice of θ using Bθ0 as defined in Remark 8.26, we obtain that
βn
|fθ (x) − gθ (x)| ≤ O , ∀θ ∈ Bθ0 . (8.70)
σ2
Proof. The proof comes from the following fact: if h(θ) is such that ∇h(θ) is β-Lipschitz (which if differen-
tiable is equivalent to k∇2 h(θ)kop ≤ β), then
As shown above, the proof is as simple as plugging in fθ (x) = h(θ) and gθ (x) = h(θ0 ) + ∇h(θ0 ), ∆θ .
Remark 8.29. The lemma above bounds the approximation error. Intuitively, as you move farther away from
θ0 , the Taylor approximation gets worse; the approximation error is bounded above by a second order ∆θ
term.
Remark 8.30. Note that if fθ involves a relu function, then ∇fθ is not continuous everywhere. This requires
a technical fix outside the scope of our discussion.2
Furthermore, let β̄ be the Lipschitz parameter of ∇θ f¯θ (x) in θ. A simple chain-rule gradient argu-
ment shows that scaling f¯θ by α also scales σ and β accordingly, i.e. σ = ασ̄, and β = αβ̄. Some
straightforward algebra yields
β β̄ 1
= 2 · →0 as α → ∞. (8.73)
σ2 σ̄ α
Once α becomes big enough, then by Lemma 8.28, the approximation |fθ (x) − gθ (x)| ≤ O βn/σ 2
becomes very good.
2 A relu function is continuous almost everywhere, so we can make some minor fixes and still use some modified notion of
91
Remark 8.31. A priori, such a phenomenon may appear to be too good to be true. To understand
it better, we first note that this re-parameterizaton does not change the scale of the loss, but rather
change the shape of the loss function. Intuitively, as α becomes larger, the function fθ becomes sharper
and more non-smooth (leading to higher approximation error). However, on the other hand, we note
0
that we only need to travel a little√ bit away from θ to find a global minimum given that there is a
global minimum within radius O( n/σ). It turns out that the radius needed shrinks faster than the
smoothness grows.
To visualize this effect, we can consider the following example with only 1 data point with 1-dimensional
input (x, y) = (1, 1) and the quadratic model f¯θ (x) = x(θ + βθ2 ) = θ + βθ2 . Using the squared loss,
we have
b f¯θ ) = (1 − (θ + βθ2 ))2
L( (8.74)
Let θ0 = 0. Taylor expanding at θ0 gives the linear approximation ḡθ (x) = θx = θ, and the resulting
loss function that is quadratic
b θ ) = (1 − θ)2
L(ḡ (8.75)
and
2
L(αḡθ ) = (1 − αθ) (8.77)
b
0
Note that the minimizer of L(αḡ θ ) is 1/α, which is closer to θ as α → ∞. We zoom into the region
b
¯
[0, 1/α] and find out the difference between αfθ and αḡθ is αβθ2 ≤ β/α, which is much smaller than
the value of αḡθ ≈ O(1).
We visualize the these functions in Figure 8.8. We observe that L(αḡ
b θ ) becomes a better approximation
¯ b f¯θ ) globally.)
of L(αfθ ) in the region [0, 1/α] as α → ∞ (though L(αḡθ ) is a worse approximation of L(α
b b
2. Overparametrization (with specific initialization). Early papers on the NTK take this approach
(e.g., [Li and Liang, 2018, Du and Hu, 2019]). Consider a two-layer network with m neurons.
m
1 X
ŷ = √ ai σ(wi> x) (8.78)
m i=1
√
The scaling 1/ m is to ensure that a random initialization with constant scale will have output on the
right order, as we see momentarily. We make the following assumptions regarding the network and its
inputs.
>
w1
..
W = . ∈ Rm×d (8.79)
>
wm
σ is 1-Lipschitz and twice-differentiable (8.80)
ai ∼ {±1} (not optimized) (8.81)
wi0 ∼ N (0, Id ) (8.82)
kxk2 = Θ(1) (8.83)
dm
θ = vec(W ) ∈ R (vectorized W ) (8.84)
92
Figure 8.8: The approximation L(αḡ
b b ¯
θ ) becomes a better approximation of L(αfθ ) in the region [0, 1/α] as
α → ∞ (though L(αḡ
b b ¯
θ ) is a worse approximation of L(αfθ ) globally).
We will assume m → ∞ polynomially in n and d. In particular, for fixed n, d, we have m = poly(n, d).
√ >
Why do we use the 1/ m scaling? Note that σ wi0 x ≈ 1 because kxk2 = Θ(1) and wi0 is drawn
0>
Pm
from a spherical Gaussian. Thus, as some ai are positive and others are negative, i=1 ai σ wi x =
√
Θ ( m), and finally fθ0 (x) = Θ(1).
Now we analyze σ and β. We let
q
σ = σmin (Φ) = σmin (ΦΦ> ) (8.85)
where
D E
ΦΦ> = ∇θ fθ0 x(i) , ∇θ fθ0 x(j)
ij
(8.86)
93
Note that the scale of k∇θ fθ0 (x)k2 does not depend on m, so the inner product in (8.86) also does not
depend on m either. As above, we can show
m
1 X
h∇θ fθ0 (x), ∇θ fθ0 (x0 )i = hx, x0 i σ 0 (w> x)σ 0 (w> x0 ) (8.92)
m i=1
0 >
σ (w x)σ 0 (w> x0 ) hx, x0 i
→ E (8.93)
w∼N (0,Id )
This is precisely the NTK with m = ∞. Though we omit the proof of this claim, it can be shown that
K ∞ is full rank. Then, let
σmin , σmin (K ∞ ) > 0. (8.95)
We can show that
1
σ = σmin ΦΦ> > σmin (8.96)
2
Intuitively, ΦΦ> → K ∞ , so the spectrum of the matrix should also converge. Thus, in some sense, we
have shown that σ is constant in the limit.
Now what about β? If we can show β → 0 as m → ∞, we are done. We begin by analyzing this key
expression:
m
1 0 > 0 0>
∇θ fθ (x) − ∇θ fθ0 (x) = √ σ wi x − σ wi x · x (8.97)
m i=1
assumption that σ 0 is O(1)-Lipschitz. The third line uses Cauchy-Schwarz and the fact that kxk22 ≈ 1.
Taking the square root, we have that
1
k∇θ fθ (x) − ∇θ fθ0 (x)k2 . √ kθ − θ0 k2 (8.102)
m
√
Thus, the Lipschitz parameter is β = O(1/ m). Thus, our key quantity β/σ 2 goes to 0 as m grows.
Namely,
β 1 1
≈ √ · 2 →0 as m → ∞. (8.103)
σ2 m σmin
Recall here that σmin does not depend on m. Concretely, this result tells us that our function becomes
more smooth (the gradient has a smaller Lipschitz constant) as we add more neurons.
94
8.4.2 Optimizing L̂(gθ ) vs. L̂(fθ )
We now discuss how to establish the last of the three conditions under which we claimed a Taylor approxi-
mation is reasonable. We need to show that optimizing L̂(fθ ) is similar to optimizing L̂(gθ ). To do so, we
require two steps:
(A) Analyze optimization of L̂(gθ ).
(B) Analyze optimization of L̂(fθ ) by re-using or modifying the proofs in (A).
There are two approaches in the literature for (A), which implies that there exist two approaches for (B) as
well.
(i) We leverage the strong convexity of L̂(gθ ), and then show an exponential convergence rate.3
(ii) Instead of strong convexity, we rely on the smoothness of fθ (i.e. bounded second derivative).
We will only discuss the first of these two methods in the sequel.
Remark 8.32. In both either approach (i) or (ii), we will implicitly or explicitly use the following simple fact.
Suppose at any θt , we take the Taylor expansion of fθ at θt :
This is really saying that fθ and gθt agree up to first-order at θt . This implies that L(fθ ) and L(gθt ) also agree
to first-order at θt . This also means that T steps of gradient descent on L̂(fθ ) is the same as performing
online gradient descent4 on a sequence of changing objectives L(gθ0 ), . . . , L(gθT ), and this online learning
perspective is useful in the approach (ii).
We will now show that under the strong convexity regime, optimizing a neural network fθ is equivalent
to optimizing a linear model gθ . We will also observe that this regime is not particularly practically relevant,
but this analysis is nevertheless of interest to us for two reasons. First, the approach used in the subsequent
exposition is of technical interest and second, it remains quite interesting that optimizing fθ and optimization
gθ yields the same results under any regime.
Optimizing gθ
We relate the optimization of gθ to performing linear regression. Recall that we can think of ∇fθ0 (x) as
a feature map. Then, the problem of choosing ∆θ to get gθ (x) to be close to ~y is a linear regression. In
particular, we use gradient descent to minimize
where
∇fθ0 (x(1) )> y (1)
.. n×p
~y = ... ∈ Rn
Φ= ∈R . (8.108)
.
(n) > (n)
∇fθ0 (x ) y
3 Recall that a differentiable function f is µ-strongly convex if
µ
f (y) ≥ f (x) + ∇f (x)> (y − x) + ky − xk22 (8.104)
2
for some µ > 0 and all x, y.
4 Online gradient descent is the algorithm that takes one gradient descent step upon receiving a new objective function. See
95
For learning rate η, the gradient descent update rule is
This analysis considers changes in the output space. Define ŷ t = Φ∆θt . Then, we’re interested in changes
in
From this decomposition, we see that the residuals, ŷ t − ~y , are monotonically shrinking since ηΦΦ> , i.e. the
term we are subtracting from I in (8.115), is positive semidefinite. Next, we quantify how quickly we are
shrinking the residuals. Define
1
Then, we claim that when η ≤ τ2 ,
Why? Let the eigenvalues of ΦΦ> be (in descending order) τ12 , . . . , τn2 . By definition, τ12 = τ 2 and τn2 = σ 2 .
Now, given the singular value decomposition, Φ = U ΣV > , we obtain the eigendecomposition:
(8.121) is the eigendecomposition of I − ηΦΦ> , so I − ηΦΦ> has eigenvalues 1 − ητ12 , . . . , 1 − ητn2 . Note
that assuming η ≤ τ12 ensures that all eigenvalues of I − ηΦΦ> are non-negative. Thus,
= 1 − ητn2 (8.123)
2
= 1 − ησ , (8.124)
96
Optimizing fθ
We now transition to an analysis of the optimization of fθ . Our key result is Theorem 8.33. If we compare
it against what we have in (8.128), we see the claimed similarity between fθ and gθ in error decay under
optimization.
β c0
Theorem 8.33. There exists a constant c0 ∈ (0, 1) such that for σ2 ≤ n and sufficiently small η (which
could depend on β, σ, or p), L̂ (fθT ) ≤ after T = O log 1/
ησ 2 steps.
Proof. (This is actually a proof sketch that elides a few technical details for the sake of a simpler exposition.)
Our approach is to follow the preceding analysis of gθ , making changes where necessary.
Let
∇fθt (x(1) )>
Φt = .. n×p
∈R . (8.129)
.
∇fθt (x(n) )>
To obtain our gradient descent update rule, we find, using the chain rule,
n
X
∇L̂ (fθt ) = fθt x(i) − y (i) ∇fθt x(i) (8.130)
i=1
Xn
= ŷ (i),t − y (i) ∇fθt x(i) (8.131)
i=1
t >
ŷ t − ~y .
= (Φ ) (8.132)
where we have let bt = (Φt )> (ŷ t − ~y ). Following our treatment of gθ , we want to express ŷ t+1 as a function
of ŷ t . The challenge now is that f is nonlinear. To deal with this, we Taylor expand fθ at θt :
D E
fθt+1 (x(i) ) = fθt (x(i) ) + ∇fθt (x(i) ), θt+1 − θt + high order terms (8.136)
D E
= fθt (x(i) ) + ∇fθt (x(i) ), −ηbt + O kθt+1 − θt k22 .
(8.137)
Since O kθt+1 − θt k22 is O η 2 , we can ignore this term as η → 0. Vectorizing (8.137) without
O kθt+1 − θt k22 ,
97
>
Comparing (8.141) with (8.115), we see one difference: in (8.141), our convergence depends on ηΦt (Φt ) ,
which is a matrix that changes as we iterate, whereas in (8.115), convergence is controlled by a matrix that
is fixed as we iterate.
>
To understand the convergence implications of (8.141), we examine the eigenvalues of I − ηΦt (Φt ) . For
now, suppose √
kθt − θ0 k2 ≤ σ/(4 nβ) (8.142)
σ
at time t. This implies that kΦt − ΦkF ≤ 4 by the Lipschitzness of ∇fθ (x) in θ. Then, we claim that
Next, we note that the second term of (8.145) is lower bounded by σ by simplifying and applying the
definition of σ given in (8.117). Combining this observation with (8.149), we conclude that (8.143) must
hold.
Applying this lower bound on the eigenvalues of Φt , we can use the same argument we used to establish
(8.118) to conclude that
>
kI − ηΦt Φt kop ≤ 1 − 3ησ/4, (8.150)
and
t+1
kŷ t+1 − ~y k2 ≤ (1 − 3ησ/4) kŷ 0 − ~y k2 . (8.151)
log 1/
So, as desired, we see exponential decay in the error at each iteration and after T = O nσ 2 iterations,
L̂(fθT ) ≤ . (8.152)
t 0
√ complete our proof, observe that this argument is predicated upon the assumption that kθ − θ k2 ≤
To
σ/(4 nβ). This assumption is reasonable, however, given what we have already proven. Recall that in
Lemma 8.25, we proved that
√
k∆θ̂k2 = kθ̂ − θ0 k2 . n/σ. (8.153)
√ √
Thus, when β/σ 2 → 0, eventually, n/σ σ/(4 nβ). To extend t
√ this to0 kθ̂ − θ k2 for arbitrary t, we
heuristically argue that since
√ the empirical minimizer is within σ/(4 nβ) of θ , we would not expect to have
traveled more than σ/(4 nβ) from θ0 at any iteration.
More formally, we claim that for all t ∈ N,
√
kŷ t − ~y k2 ≤ O( n). (8.154)
98
We proceed via induction. For t = 0, because each element of ŷ is of order 1, we know that:
1
√ kŷ 0 − ~y k2 ≤ O(1). (8.155)
n
Now, suppose that (8.154) holds for some t. Then, because the errors are monotonically decreasing, (cf.
(8.141) and (8.150)),
1 1
√ kŷ t+1 − ~y k2 ≤ √ kŷ t − ~y k2 ≤ O(1). (8.156)
n n
1
√ kΦθt − ŷ t k2 ≤ O(1) (8.157)
n
Using this result and (8.154), we can show that √1 kΦ(θ t − θ̂)k2 is O(1).
n
1 1
√ kΦ(θt − θ̂)k2 = √ kΦθt − ~y k2 (~y = Φθ̂) (8.158)
n n
1
= √ kΦθt − ŷ t + ŷ t − ~y k2 (8.159)
n
1 1
≤ √ kΦθt − ŷ t k2 + √ kŷ t − ~y k2 (triangle ineq.) (8.160)
n n
≤ O(1). (8.161)
Then, leveraging the definition of σ given in (8.117) and rearranging, we obtain (nearly) the desired result:
1 √
kθt − θ̂k2 ≤ kΦ(θt − θ̂)k2 ≤ O( n/σ). (8.162)
σ
Recall that in Lemma 8.25, we proved that
√
kθ̂ − θ0 k2 ≤ O( n/σ). (8.163)
99
• The analysis above was for gradient descent, while stochastic gradient descent is used in practice,
introducing noise in the procedure. This means that NTK with stochastic gradient descent requires
a small learning rate to stay in the initialization neighborhood. Deviating from the requirements can
lead to leaving the initialization neighborhood.
One possible explanation for the gap between theory and practice is because NTK effectively requires a
fixed kernel, so there is no incentive to select the right features. Furthermore, the minimum `2 -norm solution
is typically dense. This is similar to the difference between sparse and dense combinations of features
observed in the `1 -SVM/two-layer network versus the standard kernel method SVM (or `2 -SVM) analyzed
previously.
To make these ideas more concrete, consider the following example [Wei et al., 2020].
Example 8.34. Let x ∈ Rd and y ∈ {−1, 1}. Assume that each component of x satisfies xi ∈ {−1, 1}.
Define the output y = x1 x2 , that is, y is only a function of the first two components of x.
This output function can be described exactly by a neural network consisting of a sparse combination of
the features (4 neurons to be exact):
1
ŷ = [φrelu (x1 + x2 ) + φrelu (−x1 − x2 ) − φrelu (x1 − x2 ) − φrelu (x2 − x1 )] (8.166)
2
1
= (|x1 + x2 | − |x1 − x2 |) (8.167)
2
= x1 x2 . (8.168)
(8.167) follows from the fact that φrelu (t) + φrelu (−t) = |t| for all t, while (8.168) follows from evaluating
the 4 possible values of (x1 , x2 ). Thus, we can solve this problem exactly with a very sparse combination of
features.
However, if we were to use the NTK approach (kernel method), the network’s output will always involve
σ(w> x) where w is random so it includes all components of x (i.e. a dense combination of features), and
cannot isolate just the relevant features x1 and x2 . This is illustrated in the following informal theorem:
2
Theorem 8.35. The kernel method with NTK Pm requires n = Ω(d ) samples to learn Example 8.34 well. In
contrast, the neural network regularized by j=1 |uj |kwj k2 only requires n = O(d) samples.
100
Chapter 9
Implicit/Algorithmic Regularization
Effect
One of the miracles of modern deep learning is the phenomenon of algorithmic regularization (also known
as implicit regularization or implicit bias): although the loss landscape may contain infinitely many global
minimizers, many of which do not generalize well, in practice our optimizer (e.g. SGD) tends to recover
solutions with good generalization properties.
The focus of this chapter will be to illustrate algorithmic regularization in simple settings. In particular,
we first show that gradient descent (with the right initialization) identifies the minimum norm interpolating
solution in overparametrized linear regression. Next, we show that for a certain non-convex reparametrization
of the linear regression task where the data is generated from a sparse ground-truth model, gradient descent
(again, suitably initialized) approximately recovers a sparse solution with good generalization. Finally,
we discuss algorithmic regularization in the classification setting, and how stochasticity can contribute to
algorithmic regularization.
Lemma 9.1. Let X + denote the pseudoinverse1 of X. Then β is a global minimizer if and only if β =
X + ~y + ζ for some ζ such that ζ ⊥ x1 , ..., xn .
Proof. For any β ∈ Rd , we can decompose it as β = X + + ζ for some ζ ∈ Rd . Since
1 Since X is full rank, XX > is invertible and so we have X + = X > (XX > )−1 . Note that XX + X = X.
101
Figure 9.1: Visualization of proof intuition for Theorem 9.3. The solution β ? is the projection of the origin
onto the subspace of global minima.
∆
From Lemma 9.1, we can derive an explicit formula for the minimum norm interpolant β ? =
argminβ:L(β)=0
b kβk2 .
Corollary 9.2. β ? = X + ~y .
Proof. Take any β such that L(β)
b = 0, and write β = X + ~y + ζ. Then from the definition of X + and the
fact that Xζ = 0 (see the proof of Lemma 9.1), we have
Now, suppose we learn β using gradient descent with initialization β 0 , where at iteration t we set β t =
t−1 b t−1 ) for some learning rate η. Since L(β)
β − η∇L(β b is convex, we know from standard results in convex
optimization that gradient descent will converge to a global minimizer for a suitably chosen learning rate
η (in particular, taking η to be sufficiently small). Assuming β 0 = 0, we will in fact recover the minimum
norm interpolating solution.
Theorem 9.3. Suppose gradient descent on L(β)
b with initialization β 0 = 0 converges to a solution β̂ such
?
that L(
b β̂) = 0. Then β̂ = β .
The main idea of the proof is that the iterates of gradient descent always lie in the span of the x(i) ’s (see
Figure 9.1 for an illustration).
Proof. We first show via induction that β t ∈ span x(1) , . . . , x(n) for all t. For the induction base case, note
that β 0 = 0 ∈ span x(1) , . . . , x(n) . Now suppose β t−1 ∈ span x(1) , . . . , x(n) . Recall that β t = β t−1 −
b t−1 ). Since left-multiplying any vector by X > amounts to taking a linear combination of the rows of
η∇L(β
b t−1 ) = ηX > (Xβ t−1 − ~y ) ∈ span x(1) , . . . , x(n) , and so β t = β t−1 − η∇L(β
X, it follows that η∇L(β b t−1 ) ∈
(1)
span x , . . . , x(n) . This proves the induction step.
Next, we show that β̂ ∈ span x(1) , . . . , x(n) and L( b β̂) = 0 implies β̂ = β ? . By definition, β̂ ∈
span x , . . . , x(n) implies β̂ = X > v for some v ∈ Rn . Since L( b β̂) = 0, we have 0 = X β̂ − ~y = XX > v − ~y .
(1)
> −1 > > > −1
This implies v = (XX ) y, and so β̂ = X v = X (XX ) ~y = X + ~y = β̂ ? .
102
9.2 Implicit regularization of small initialization in nonlinear mod-
els
We give another example of implicit regularization effect of small initialization in a non-convex version of
the overparametrized linear regression task considered in the previous section. The results in this subsection
are largely simplifications of the paper Li et al. [2017] which studies over-parameterized compressed sensing
and two-layer neural nets with quadratic activation.
iid
We assume x(1) , ..., x(n) ∼ N (0, Id×d ) and y (i) = fβ ? (x(i) ), where the ground truth vector β ? is r-sparse
(i.e. kβ ? k0 = r). For simplicity, we assume βi? = 1{i ∈ S} for some S ⊂ [d] such that |S| = r. We again
analyze the overparametrized setting, where this time n d but also n ≥ Ω(r e 2 ).
Our goal is to find a weight vector that minimizes our empirical loss function
n 2
∆ 1 X (i)
L(β)
b = y − fβ (x(i) ) , (9.7)
4n i=1
∆
where fβ (x) = hβ β, xi. The operation denotes the Hadamard product: for u, v ∈ Rd , u v ∈ Rd is
∆
defined by (u v)i = ui vi for i = 1, . . . , d.
with respect to θ, where θ = β β. However, it turns out that we can equivalently learn a sparse solution
by running gradient descent from a suitable initialization on the original unregularized loss.
To be specific, let β 0 = α1 ∈ Rd be the initialization where α is a small positive number. The update
rule of gradient descent algorithm is given by β t+1 = β t − η∇L(β
b t ). The next theorem shows that when
2 ?
n = Ω(r
e ), gradient descent on L(β)
b converges to β .
Theorem 9.4. Let c be a sufficiently large universal constant. Suppose n ≥ cr2 log2 (d) and α ≤ 1/dc , then
log(d/α) 1
when . T . √ , we have
η η dα
2
√
β> β> − β? β? 2 ≤ O α d . (9.9)
103
Remark 9.6. Theorem 9.4 is a simplified version of Theorem 1.1 in [Li et al., 2018].
Remark 9.7. L(β)
b has many global minima. To see this, observe that the number of parameters is d and
the number of constraints to fit all the examples is O(n) because there are only n examples. Recall that for
overparameterized model we have d n; consequently, there exists many global minima of L(β).
b
Remark 9.8. β ? is the min-norm solution in this case. That is,
Informally, this is because we can view β β as a vector θ ∈ Rd , which leads to kβk22 = kθk1 . Then in the θ
space (and with a little abuse of notation), the optimization problem (9.10) becomes
9.2.2 Ground work for proof and the restricted isometry property
In this section we prepare the ground work for the proof of Theorem 9.4.
We start by showing several basic properties about L(β).
b Note that for any fixed vector v ∈ Rd and
d
x ∈ R , when x is drawn from N (0, I), we have
It follows that
1
(y − hβ β, xi2
L(β) = E (9.13)
4 x∼N (0,I)
1 ?
β ? − β β, xi2
= E hβ (by definition of y) (9.14)
4 x∼N (0,I)
1 ? 2
= kβ β ? − β βk2 . (by (9.12)) (9.15)
4
Note that (9.15) is the metric that we use to characterize how close β is to the ground-truch parameter β ?
(see (9.9)).
In the following lemma we show that L(β)
b ≈ L(β) by uniform convergence. Generally speaking, uniform
convergence of the loss function for all β requires n ≥ Ω(d) samples, so in our setting (where n d)
L(β)
b ≈ L(β) does not always hold. However, since we assume β ? is sparse, the analysis only requires
uniform convergence for sparse vectors.
e 2 ). With high probability over the randomness in x(1) , · · · , x(n) , ∀v such
Lemma 9.10. Assume n ≥ Ω(r
that kvk0 ≤ r we have
n
1X
(1 − δ)kvk22 ≤ hv, x(i) i2 ≤ (1 + δ)kvk22 . (9.16)
n i=1
Lemma
9.10 is a special case of Lemma 2.2 in [Li et al., 2018] so the proof is omitted here. We say the
set x(1) , · · · , x(n) (or X = [x(1) , · · · , x(n) ]) satisfies (r, δ)-RIP condition (restricted isometric property) if
(9.16) holds.
104
By algebraic manipulation, (9.16) is equivalent to
n
!
1 X (i) (i) >
(1 − δ)kvk22 ≤ v > x (x ) v ≤ (1 + δ)kvk22 . (9.17)
n i=1
Pn
In other
Pn words, from the point of view of a sparse vector v we have i=1 x(i) (x(i) )> P≈ I. (Note however
n
that i=1 x(i) (x(i) )> is not close to Id×d in other notions of closeness.
Pn For example, (i) (i) >
i=1 x (x ) is not
(i) (i) >
close to Id×d in spectral norm. Another way to see this is that i=1 x (x ) is a d × d matrix but only
has rank n d.)
As a result, with the RIP condition we have L(β)
b ≈ L(β) if β is sparse. With more tools we can also get
∇L(β)
b ≈ ∇L(β). Let us define the set Sr = {β : kβk0 ≤ O(r)}, the set where we have uniform convergence
b and ∇L.
of L b and ∇L
b Informally, as long as we are in the set Sr , L b have similar behavior to their population
counterparts. (Note, on the other hand, that there exists a dense β 6∈ Sr such that L(β)
b = 0 but L(β) 0.)
The RIP condition also gives us the following lemma which will be needed for the proof of Theorem 9.4.
Lemma 9.11. Suppose x(1) , x(2) , . . . x(n) satisfy the (r, δ)-RIP condition. Then, ∀v, w such that kvk0 ≤ r
and kwk0 ≤ r, we have that
n n
!
1 X (i) (i) T 1 X (i) (i) >
hx , vihx , wi − hv, wi = v x (x ) w − hv, wi (9.18)
n i=1 n i=1
≤ 4δ kvk2 · kwk2 . (9.19)
Combining the two steps we can show that gradient descent on L(β)
b does not leave Sr and converges to
β?.
As a warm up, we prove the following theorem for gradient descent on L(β).
? log(1/(α))
Theorem 9.13. For sufficiently small η, gradient descent on L(β) converges to β in Θ
η
iteration with -error in `2 -distance.
Proof. Since
∇L(β) = (β β − β? β?) β, (9.22)
the gradient descent step is
105
Figure 9.2: Visualization of proof intuition for Theorem 9.4.
Recall that β ? = 1{i ∈ S} and β 0 = α1, and the update rule above decouples across the coordinates of
β . Thus, we only need to show that |βi? − β t | ≤ for the number of iterations stated in the Theorem.
t
106
For a coordinate i ∈
/ S, the gradient descent update for this problem becomes
βit+1 = β t − η(β t β t − β ? β ? ) β t i
(9.33)
= βit − η(βit · βit ) · βit (since βi? = 0 ∀i ∈
/ S) (9.34)
= βit − η(βit )3 . (9.35)
Since our initialization β 0 was small, the update to these coordinates will be even smaller because (βit )3
is small. We can prove the desired claim using strong induction. Suppose βis ≤ 2α for all s ≤ t and i ∈ / S,
and that t + 1 ≤ 1/(10ηα2 ). Then, for all s ≤ t,
With strong induction, we can repeatedly apply this gradient update starting from t = 0 to obtain
Remark 9.15. Note that Theorem 9.14 is a slightly weaker version of Theorem 9.4 for r = 1, since the bound
on the RHS depends on the number of examples and not the initialization α. In Theorem 9.4, we could take
α as small as we like to drive the bound to zero; we cannot do this for Theorem 9.14.
We proceed to prove Theorem 9.14 with the follow steps:
1. Computing the gradient update ∇L(β),
b
107
4. Putting it all together.
Computing the gradient update ∇L(β)
b
WLOG, assume that β = e1 . We can decompose the gradient descent iterate β t as
?
β t = rt · e1 + ζt , (9.45)
where ζt ⊥ e1 . The idea is to prove convergence to β ? by showing that (i) rt → 1 as t → ∞, and (ii)
kζt k∞ ≤ O(α) for t ≤ Oe 1/η). In other words, the signal rt converges quickly to 1 while the noise ζt remains
small for some number of initial iterations. One may be concerned that it is possible for the noise to amplify
after many iterations, but we will not have to worry about this scenario if we can guarantee that β t converges
to β ? first.
b t ) as follows. Since y (i) = hβ ? β ? , x(i) i and β t = rt e1 +ζt = rt β ? +ζt ,
We can compute the gradient of L(β
n
b t) = 1X t
∇L(β (hβ β t , x(i) i − y (i) )x(i) βt (9.46)
n i=1
n
1X t
= (hβ βt − β? β ? , x(i) i)x(i) βt (9.47)
n i=1
n
1X 2 ?
= hr β β ? + ζt ζt − β ? β ? , x(i) ix(i) βt (9.48)
n i=1 t
n
1 XD 2 E
rt − 1 β ? β ? + ζt ζt , x(i) x(i) β t .
= (9.49)
n i=1
| {z }
mt
n
1 XD 2 E
rt − 1 β ? β ? + ζt ζt , x(i) x(i)
mt = (9.50)
n i=1
n
! n
!
1 X (i) (i) > 1 X (i) (i) >
rt2 − 1 · (β ? β ? ) +
= x x x (x ) (ζt ζt ) (9.51)
n i=1 n i=1
= X rt2 − 1 · β ? β ? + X ζt ζt .
(9.52)
| {z } | {z }
part of ut vt
∆ ∆
Now, define ut = (rt2 − 1)(β ? β ? ) − X(rt2 − 1)(β∗ β∗ ) and vt = X βt βt . Then we can rewrite the
gradient as
b t ) = mt
∇L(β β t = [(rt2 − 1)β ? β ? − ut + vt ] βt . (9.53)
t
Our goal is to show that both ut and vt are small, so that ∇L(β b ) is close to its population version
∇L(β t ). Observe that X appears in both ut and vt . This matrix is challenging to deal with mathematically
because it does not have full rank (because n < d). Instead, we rely on the RIP condition to reason about
the behavior of X: the idea is that X behaves like the identity for sparse vector multiplication. Applying
Corollary 9.12, we can bound kut k∞ as
(In the second inequality, we assume that |rt | < 1. We can do this because rt starts out at α which is
small; if rt ≥ 1, then we are already in the regime where gradient descent has converged.) We can bound
108
kvt k∞ in a similar manner: since Corollary 9.12 implies kvt − ζt ζt k∞ ≤ 4δ kζt ζt k2 ,
Note that the size of vt depends on the size of the noise ζt . Thus, by bounding ζt (e.g. with a small
initialization), we can ensure that vt is also small. (Ensuring bounds on ut is more difficult because it
depends only on δ.) In the next two subsections, we analyze the growth of ζt and rt .
Dynamics analysis of ζt
First, we analyze the dynamics of the noise ζt , which we want to ensure does not grow too fast.
Lemma 9.16. For all t ≤ 1/(cηδ) with sufficiently large constant c, we have
2 1
kζt k∞ ≤ 2α, kζt k2 ≤ , and kζt+1 k∞ ≤ 1 + O(ηδ) kζt k∞ . (9.58)
2
Note that this result is weaker than what we were able to show for the population gradient (exponential
growth with a small fixed rate), but we will ultimately show that the growth of the signal will be even faster.
Proof. Recall that the empirical gradient (9.53) is ∇L̂(β) = (rt2 − 1)β ? β ? − ut + vt β t . Hence, the
t
gradient update to β is
β t+1 = β t − η rt2 − 1 β ? β ? − ut + vt βt
(9.59)
= β t − η rt2 − 1 β ? β ? β t −η (−ut + vt ) t
β . (9.60)
| {z }
GD update for population loss
Recall that ζt+1 is simply β t+1 except for the first coordinate (where it has a zero instead of rt+1 ), i.e.
ζt+1 is the projection of β t+1 onto the subspace orthogonal to e1 . Hence,
ζt+1 = I − e1 e>
t+1
1 β (9.61)
> >
t t
= I − e1 e1 · β − η I − e1 e1 (vt − ut ) β (by (9.60), second term = 0) (9.62)
T T
t
= ζt − η I − e1 e1 (vt − ut ) I − e1 e1 β (by distribution law for ) (9.63)
T
= ζt − η I − e1 e1 (vt − ut ) ζt . (9.64)
| {z }
ρt
If we define ρt such that ζt+1 = ζt − ηρt ζt , then the growth of ζt is dictated by the size of ρt . We can
bound this as
kζt+1 k∞ ≤ (1 + η kρt k∞ ) kζt k∞ . (9.65)
Now, we will prove the lemma by using strong induction on t. Suppose that the first two pieces of (9.58)
hold for all iterations up to t. We can show that
where the last step holds because we can take α to be arbitrarily small (e.g. α ≤ poly(1/n) ≤ O(δ)).
Plugging this into (9.65), we have
kζt+1 k∞ ≤ (1 + 8ηδ) kζt k∞ = 1 + O(ηδ) kζt k∞ , (9.70)
109
which proves the third piece of the lemma. Using this piece, we can show that
t+1 1/(cηδ)
kζt+1 k∞ ≤ (1 + 8ηδ) kζ0 k∞ ≤ 1 + 8ηδ · α ≤ 2α (9.71)
for a sufficiently large constant c, which proves the second piece. Finally, we show that
2 t+1 2
√ 1
kζt+1 k2 ≤ 1 + 8ηδ kζ0 k2 ≤ 1 + 8ηδ)1/(cηδ) · α d ≤ , (9.72)
2
1
if α ≤ nO(1)
, which proves the first piece.
Dynamics analysis of rt
Next, we analyze the dynamics of the signal rt , which we want to show converges to 1.
Lemma 9.17. For all t ≤ 1/(cηδ) with sufficiently large constant c, we have that
rt+1 = 1 + η 1 − rt2 rt + O ηδ rt .
Note that the first term on the RHS is rt+1 during gradient descent on the population loss, and the
second term captures the error.
Proof. Recall that the gradient descent update from the empirical gradient (9.53) is
β t+1 = β t − η rt2 − 1 β ? β ? − ut + vt
βt . (9.73)
We have that
Putting it all together Finally, we return to the proof of Theorem 9.14. By Lemma 9.17, we know that
as long as rt ≤ 1/2 it will grow exponentially fast, since
η
rt+1 ≥ 1 + η 1 − rt2 − O(ηδ) · rt ≥ 1 +
· rt . (9.81)
2
This implies that at some t0 = O log(1/α)
η , we’ll observe rt0 > 1/2 for the first time. Consider what happens
after this point.
110
• When 1/2 < rt ≤ 1, we have that
implying that rt remains very close to 1 after the same order of iterations.
This completes the proof of Theorem 9.14, bounding the number of iterations needed for gradient descent
on the empirical loss to converge to β ∗ .
If we scale the time by η so that each update by gradient descent corresponds to a time step of size η (rather
than size 1), the update becomes
Taking η → 0 yields a differential equation, which can be thought of as a continuous process rather than
discrete updates:
111
This can also be written as:
∂w(t)
ẇ(t) = −∇L(w(t) with ẇ(t) = (9.90)
∂t
This allows us to ignore the η 2 term (alternatively the (dt2 ) term), which will simplify some of the technical
details that follow.
where w+ , w− ∈ Rd . Let w denote the concatenation of the two parameter vectors, i.e. = (w+ , w− ). In
(9.7), we defined fβ (x) = (β β)> x; this model can only represent positive linear combinations of x. By
contrast, fw (x) can represent any linear model. Moreover, if we choose our initialization for w such that
w+ (0) = w− (0), we obtain fw(0) (x) ≡ 0 for all x. Similar to our analysis of the NTK, this initialization will
simplify the subsequent derivations.
Next, we define the following loss function,
n 2
1 X (i)
L(w)
b = y − fw (x(i) ) , (9.92)
2 i=1
where ~1 denotes the all-ones vector. The analysis technique still applies to any general initializations as
long as all the dimension are initialized to be non-zero, but the the initialization scale is the most important
factor, and therefore we chose this simplification for the ease of exposition.
Note that every w = (w+ , w− ) corresponds to a de facto linear function of x. We denote the resulting
linear model as θw :
θw = w + 2 − w − 2 . (9.94)
>
Note that θw x = fw (x).
Let w(∞) denote the limit of the gradient flow, i.e.
Then, the converged linear model in the θ space is defined by θα (∞) = θw(∞) —we are interested in under-
standing its properties. For simplicity, we will omit the ∞ index and refer to this quantity as θα . We assume
throughout that the limit exists and all corresponding regularity conditions are met.
Let
(1)> (1)
x y
.. n×d ..
X= . ∈R and ~y = . . (9.96)
>
x(n) y (n)
In the sequel, we formally state our result relating the complexity of the solution discovered by gradient flow
to the size of the initialization.
112
Theorem 9.18 (Theorem 1 in Woodworth et al. [2020]). For any 0 < α < ∞, assume that gradient flow
with initialization w+ (0) = w− (0) = α · ~1 converges to a solution that fits the data exactly: Xθα = ~y .3 Then,
the solution satisfies the following notion of minimum complexity:
where
n
X θi
Qα (θ) = α2 · q 2
(9.99)
i=1
α
and
p z
q(z) = 2 − 4 + z 2 + z · arcsinh (9.100)
2
In words, Theorem 9.18 claims that θα is the minimum complexity solution for the complexity measure
Qα .
Remark 9.19. In particular, when α → ∞ we have that
and so
1 2
Qα (θ) kθk2 . (9.102)
α2
This means that if α → ∞ than the complexity measure Qα is the `2 -norm, ||θ||2 . If α → 0, then the
complexity measure becomes
θi |θi | 1
q 2 log (by Taylor expansion) (9.103)
α2 α α2
and so,
kθk1 1
Qα (θ) log (9.104)
α2 α2
To summarize, for α → ∞, the constrained minimization problem we solve in (9.98) yields the minimum
`2 -norm solution of θ (i.e. the `4 -norm for w). When α → 0, solving (9.98) yields the minimum `1 -norm θ
(which is the `2 -norm for w). For 0 < α < ∞, we obtain some interpolation of `1 and `2 regularization of
the optimum.
Remark 9.20. Note that when α → 0, the intuition is similar to what we had observed in previous analyses;
in particular, the solution is the global minimum closest to the initialization. Note however, that when
α 6= 0, the solution discovered by gradient descent will not exactly correspond to the solution closest to the
initialization.
Remark 9.21. When α → ∞, we claim that the model optimization is in the neural tangent kernel (NTK)
regime. Recall that we had two parameters, (σ, β), that determined if we could treat the optimization
problem as a kernel regression. Further recall that σ denotes the minimum singular value of Φ and β is the
Lipschitzness of the gradient. Let us now compute σ and β for large α initializations of our model.
3 This assumption can likely be proved to be true and thus not required. Here we still include the condition because the
113
For w− (0) = w+ (0) = α~1,
w+ (0) · x x
∇fw(0) (x) = 2 = 2α (9.105)
−w− (0) x −x
by the chain rule. It is clear then that both σ and β linearly depend on α. This implies that
β
→0 as α → ∞ (9.106)
σ2
since the denominator is O(α2 ), while the numerator is O(α). In particular, the features used in this kernel
method are:
x
φ(x) = ∇fw(0) (x) = 2α (9.107)
−x
The neural tangent kernel perspective then gives an alternative proof of this complexity minimization result
for α → ∞. In the NTK regime, the solution (to our convex problem) is always the minimum `2 -norm
X
solution for the feature matrix, which in this case equals .
−X
Note that practice tends not to follow the assumptions made here. Often, people either do not use large
initializations or do not use infinitesimally small step sizes. But this is a good thing because we do not want
to be in the NTK regime; being in the NTK regime implies that we are doing no different or better than
just using a kernel method.
We can now prove Theorem 9.18, which is similar to the overparametrized linear regression proof of
Theorem 9.3.
This proof follows in two steps:
1. We find an invariance maintained by the optimizer. In the overparametrized linear regression proof
of Theorem 9.3, we required θ ∈ span{x(i) }. For this proof, we will use a slightly more complicated
invariance.
2. We characterize the solution using this invariance. The invariance, which depends on α, will tell us
which zero error solution the optimization converges to.
Note also that all of these conditions only depend upon the empirically observed samples. The invariance
and minimum is not defined with respect to any population quantities.
Proof. Let
n×2d w+ (t)
∈ R2d .
X̃ = X −X ∈ R and w(t) = (9.108)
w− (t)
Then, the model output on n data points can be described in matrix notation as follows:
fw(t) (x(1) )
2
w+ (t) ..
X̃w(t) 2 = X −X n
= ∈R . (9.109)
w− (t) 2 .
fw(t) (x(n) )
114
the gradient of w(t) can be computed as
where r(t) = X̃w(t) 2 − ~y denotes the residual vector. We see that the X̃ > r(t) term in (9.113) is reminiscent
of linear regression for which it would correspond to the gradient, although the w(t) reminds us that this
problem is indeed quadratic.
We cannot directly solve this differential equation, but we claim that
!
Z >
w(t) = w(0) exp −2X̃ > r(s)ds (exp is applied entry-wise) (9.114)
0
which is not quite a closed form solution of equation 9.113 since r(s) is still a function of w(t). To understand
how we obtained this “solution,” we consider a more abstract setting. Suppose that
We can then “solve” this differential equation as follows. Rearranging, we observe that
u̇(t)
= v(t) (9.116)
u(t)
d log u(t)
= v(t) (chain rule) (9.117)
dt
Z t
log u(t) − log u(0) = v(s)ds (integration) (9.118)
0
Z t
u(t)
= exp v(s)ds (9.119)
u(0) 0
115
Letting t → ∞, we have that
Z ∞
2 >
θα = 2α sinh −4X r(s)ds . (9.125)
0
Lastly, we also know
Xθα = ~y (9.126)
since this is the assumption by the theorem (which should can be proven because the optimization should
converge to a zero-error solution). We next show that (9.125) and (9.126) are also sufficient conditions
for a solution to the constrained optimization problem given by (9.98). In particular, (9.125) and (9.126)
correspond to the Karush-Kuhn-Tucker (or KKT) conditions of (9.98).
A KKT condition is an optimality condition for constrained optimization problems. While these condi-
tions can have a variety of formulations and typically one can invoke some off-the-shelf theorems to use them,
we can motivate the conditions we encountered by considering the following general optimization program:
argmin Q(θ) (9.127)
s.t. Xθ = ~y . (9.128)
We say that θ satisfies the (first order) KKT conditions if
∇Q(θ) = X > ν for some ν ∈ Rn (9.129)
Xθ = ~y (9.130)
More intuitively, we know that optimality implies that there are no first order local improvements that satisfy
the constraint (up to first order). Then, consider a perturbation ∆θ. In order to satisfy the constraint, we
must enforce the following:
∆θ ⊥ row-span{X} so X∆θ = 0 (9.131)
So, if we look at θ + ∆θ satisfying the constraint, we can use a Taylor expansion to show that
Q(θ + ∆θ) = Q(θ) + h∆θ, ∇Q(θ)i ≤ Q(θ) (9.132)
because if h∆θ, ∇Q(θ)i is positive it violates the optimality assumption. In fact, it is very easy to make the
sign flip for h∆θ, ∇Q(θ)i because you can flip ∆θ to be the opposite direction. This means that
∀ ∆θ ⊥ row-span{X}, h∆θ, ∇Q(θ)i = 0 (9.133)
because if it is negative, you can equivalently flip it to be positive which violates optimality. This means
that Q(θ) ⊆ row-span{X}, or Q(θ) = X > ν for some ν.
Returning to our problem, the KKT condition gives
∇Q(θ) = X > ν (9.134)
and the invariance gives us
Z ∞
2 >
θα = 2α sinh −4X r(s)ds (9.135)
0
= 2α2 sinh −4X > v 0
(9.136)
0
R∞
where we let v = 0
r(s)ds for simplicity. Taking the gradient of Q gives
1
∇Qα (θ) = arcsinh θ (9.137)
2α2
Plugging in θα , we get
1
∇Q(θα ) = arcsinh θα = −4X > v 0 (9.138)
2α2
Thus, θα satisfies both KKT conditions. Even further, since our optimization problem (9.98) is convex (we
do not formally argue this), we conclude that θα is a global minimum.
116
9.4 Implicit regularization towards max-margin solutions in clas-
sification
We now switch our focus to classification problems. We consider linear models (though these results also
apply to nonlinear models with a weaker version of the conclusion). We assume that our data is separable and
will prove that gradient descent converges to the max-margin solution. This result holds for any initialization
and does not require any additional regularization; we only require the use of gradient descent and the
standard logistic loss function. The results in this subsection are originally given by Soudry et al. [2018],
and our exposition heavily depends on those in [Ji and Telgarsky, 2018, Telgarsky, 2021].
Assume we have data {(x(i) , y (i) )}ni=1 , where x(i) ∈ Rd and y (i) ∈ {±1}. We consider the linear model
Pn
hw (x) = w> x and the cross entropy loss function L(w) (i) (i)
b = i=1 ` y , hw x , where `(t) = log(1 +
exp(−t)) is the logistic loss.
As we have separable data, there can be multiple global minima, as you can trivially take an infinite
number of separators. More formally, there are an infinite number of unit vectors w̄ such that w̄> x(i) y (i) > 0
for all i as one can perturb any strict separator while still maintaining a separation of classes. Then, we can
scale the separator to make the loss arbitrarily small—we have that L(α b w̄) → 0 as α → ∞. Thus, informally,
for any unit vector w̄ that separate the data, ∞ · w̄ is a global minimum.
We would like to understand which global minimum gradient descent converges to. We will now show
that it finds the max-margin solution. Before we can do so, we recall/introduce the following definitions.
Definition 9.22 (Margin). Let {(x(i) , y (i) )}ni=1 be given data. Assuming {(x(i) , y (i) )}ni=1 is linearly separa-
ble, the margin is defined as
Definition 9.23 (Normalized Margin). Let {(x(i) , y (i) )}ni=1 be given data. Assuming {(x(i) , y (i) )}ni=1 is
linearly separable, the normalized margin is defined as
Definition 9.24 (Max-Margin Solution). Using the normalized margin γ defined in Definition 9.23, we
define a max-margin solution as
γ(w(t)) → γ̄ as t → ∞ (9.142)
117
3. Using observation 1, we see that kw(t)k2 → ∞ because if kw(t)k2 were instead bounded, then the
loss L(w(t))
b will be bounded below by a constant that is strictly greater than zero, contradicting
observation 1. Formally, if kw(t)k2 ≤ B, then
4. Suppose we have w such that kwk2 = q is very big. Then, using observation 2, we see that
n
X
L(w)
b = `(y (i) w> x(i) ) (9.145)
i=1
Xn
≈ exp −y (i) w> x(i) (9.146)
i=1
n
X
log L(w)
b ≈ log exp −y (i) w> x(i) (9.147)
i=1
Xn
= log exp −qy (i) w̄> x(i) (9.148)
i=1
w
where w̄ = kwk 2
and the last step holds because the log of a sum of exponentials (log-sum-exp) is a
smooth approximation to the maximum function. To motivate this claim, observe that:
n
X
log exp(aui ) ≥ q max ui (9.150)
i
i=1
Xn
log exp(aui ) ≤ log n exp(q max ui ) (9.151)
i
i=1
= log n + q max ui (9.152)
i
≈ q max ui + o(q) as q → ∞ (9.153)
i∈[n]
The above observations heuristically demonstrate that minimizing the logistic loss with gradient descent
is equivalent (in the limit) to maximizing the margin. Below, we prove Theorem 9.25 rigorously for the
exponential loss function `(t) = exp(−t), which is nearly the same as the logistic loss.
118
Proof of Theorem 9.25. We begin by defining the smooth margin as
∆ − log L̂(w)
γ̃(w) = (9.156)
kwk2
Pn (i) > (i)
− log i=1 exp(−y w x )
= . (9.157)
kwk2
Note that γ̃(w) approximates γ(w) by the log-sum-exp approximation. To make this precise, recall that
γ(w) ≥ γ̃(w) because y (i) w> x(i) ≥ γ(w)kwk2 for all i.
Then, since γ(w) ≤ γ̄ by definition, it suffices to show that
While the derivation above tells us how the numerator of (9.156) is changing, we have yet to relate this to
the denominator, i.e. the norm of w. Recall that w∗ is the direction of the max-margin solution. Then, we
have
= γ̄ · L̂(w(t)). (9.170)
This shows that ẇ(t) is correlated to w∗ , and that this correlation depends on γ̄ and the loss. In addition,
ẇ(t) is not too small compared to the loss.
119
Next, we substitute (9.165) into the second term of the right-hand-side of (9.163):
Z T Z T
kẇ(t)k22
dt ≥ γ̄ · kẇ(t)k2 dt (9.171)
0 L̂(w(t)) 0
Z T
≥ γ̄ · ẇ(t)dt (9.172)
0
2
= γ̄kw(T )k2 . (9.173)
log L̂(w(T ))
lim − ≥ γ̄. (9.176)
T →∞ kw(T )k2
Recognizing the LHS as the definition of the smooth margin, i.e. (9.156), we conclude that
120
Given the score of the lectures, we will only be able to discuss some of these results informally and
heuristically. For example, we refer to the paper Damian et al. [2021] for the a concrete, formal version
result for the third bullet.
For the remainder of this section, let g(x) denote the general loss function. Then, the formulation of
SGD is: for t in [0, T ],
where η > 0 is the learning rate, ξt denotes the SGD noise, and E[ξt ] = 0. Note that in the most general
case, ξt can depend on xt .
(1 − η)xt is called the contraction because η > 0, which means that this term will shrink after each iteration.
The random noise term ησξt will accumulate over time, and the scale of ησξt remains unchanged. When
xt is large, the contraction term will dominate. When xt is small, the noise term will dominate. Without
the noise term, as xt continues its contraction, we approach the global minimum x = 0. However, with the
presence of the noise σξt , xt will not stay at 0, but instead bounce around it.
To characterize this intuition more precisely, we have
The first term in (9.189) becomes negligible when ηt 1 (since (1 − η)t ≈ e−ηt ). The second term in
(9.189) is the accumulation of noise, which is the sum Ptof Gaussians. Leveraging the properties of Gaussian
distributions, we know that its variance equals η 2 σ 2 k=0 (1 − η)2k .
P∞ η2 σ2
From the analysis above, we know that as t → ∞, Var(xt ) ≈ η 2 σ 2 k=0 (1 − η)2k = 2η−η 2
2 = Θ(ησ ).
2
Therefore, as t → ∞, xt ∼ N (0, Θ(ησ )).
Interpretation. In the one dimensional case, the noise only makes it harder to converge to the global
minimum. Classical convex optimization tells us: (1) noisy GD leads to a less accurate solution and (2)
noisy GD is faster than GD. What we do in practice is achieve a balance between (1) and (2). This does not
lead to implicit regularization since E[xt ] → 0 as t → ∞. However, this case is important for further analysis
because (9.183) corresponds to the Ornstein–Uhlenbeck (OU) process which we use more extensively in the
multi-dimensional cases.
121
9.5.2 SGD on multi-dimensional quadratic functions
Consider a PSD matrix A ∈ Rd×d . In this section, g(x) = 12 x> Ax. Suppose ξt ∼ N (0, Σ). For ease of
presentation, assume that A and Σ are simultaneously diagonizable (they have the same set of eigenvectors).
We use K to denote the span of the eigenvectors of A/Σ. Then, consider the following SGD iterate:
σ2
Interpretation. Intuitively, dii here is the iterate fluctuation in the direction of the i-th eigenvector. This
results tell us that the fluctuation of the iterates depends on the shape of Σ and A. If Σ is not full rank,
√
the fluctuations will be limited to the subspace K. Also note that E[kxt k2 ] = Θ( η). This reflects the
noise accumulation since the scale of noise in each step is Θ(η). However, we still do not have any implicit
regularization effect. This is because the Hessian of the quadratic objective is unchanged. When we have
the change in Hessian, SGD noise will exert an implicit bias on the iterate. See Figure 9.3 for an example.
Figure 9.3: The effect of SGD noise with the change in Hessian when x = 0. Consider the objective F (x) = x2
1 2
when x ≤ 0 and F (x) = 10 x when x > 0. Suppose we initialize SGD at x = 0 and run 1024 steps of SGD
with step size 0.01. We plot the probability density of the iterate after various steps of SGD. Note that the
density function and the mean gradually move to the left.
In the sequel, we separately analyze the second order and third order effects of SGD on a general non-
quadratic function. The second order effect exactly corresponds to this section’s analysis when A equals the
Hessian of the general non-quadratic function.
5 For random variable ξ ∈ Rd , E[(W ξ)(W ξ)> ] = W E[ξξ > ]W >
122
9.5.3 SGD on non-quadratic functions
In this section, we analyze SGD on non-quadratic functions based on [Damian et al., 2021]. Due to the
complexity of the analysis, we provide heuristic derivations to convey the main insights.
Without loss of generality, suppose a global minimum of g(x) is x = 0. Therefore, ∇x g(0) = 0 and
∇2x g(0) is PSD. We also assume the iterates xt are close to 0, so we can Taylor expand around 0.
Let H = ∇2x g(0) and T = ∇3x g(0). Since T is a tensor, we first clarify our notation. First, for T ∈ Rd×d×d ,
x, y ∈ Rd , T [x, y] ∈ Rd , and
∆
X
T [x, y]i = Tijk xj yk . (9.198)
j,k∈[d]
Now returning to (9.197), after dropping the higher order terms, we obtain the following third-order
Taylor expansion:
If we don’t consider the third order term ηT [xt , xt ], the update reduces to the one we studied in the
√
previous subsection. Next, recall that kxt k2 ≈ η. Therefore, ηT [xt , xt ] ≈ η 2 . This quantity is dominated
by both ηξt and ηHxt ≈ η 1.5 .
So, when H is positive definite, the third order term can be negligible. However, in overparametrized
models, H is typically low-dimensional. For instance, if the NTK matrix is full rank, then the manifold
of interpolators has dimension d − n. Then, in the direction orthogonal to the span of H, the contraction
term disappears. Letting ΠA denote projections onto the subspace A, we see that ηHΠK ⊥ (xt ) = 0 and
T [xt , xt ] ≈ η 2 will dominate the update in that direction.
Consider the case in which both H and Σ are not full rank. When the loss is quadratic as in the previous
section, we know that the iterate xt bounces in the subspace K and remains stable in the subspace K ⊥ .
What happens when the loss is not quadratic, i.e. T [xt , xt ] affects the gradient update?
To answer this question, we decompose the effect of the update in (9.201) between the two subspaces of
interest, K and K ⊥ . First, observe that (I − ηH)xt − ηξt is working in K, and −ηT [xt , xt ] is only working
in K ⊥ because in K the effect of ηT [xt , xt ] is dominated by (I − ηH)xt − ηξt . In previous section, we
already well-characterized the effect of optimization without a third order effect. To refine our analysis of
the gradient update, we define an iterate ut+1 = (I − ηH)yt − ηξt in which we do not have the third order
effect.6 Then, to analyze what the implicit regularization effect is, we study rt = xt − ut .
Note that we only have the contraction and the bias terms for the rt iterate. The stochasticity term ηξt is
canceled out.
6 Note that ξt is the same for each ut and xt .
123
In the subspace K = span(H), the effect of ηT [xt , xt ] is again dominated by (I − ηH)xt − ηξt , so no
meaningful regularization occurs. But letting ΠA denote the projection onto the subspace A, we have that
in K ⊥ ,
Namely, the effect of T [uk , uk ] is slowly accumulating in K ⊥ . In Figure 9.4, an illustration of this phenomenon
is provided.
Note that the OU process is a Markov chain and a Gaussian process. Here we assume that H is constructed
Pt that ut converges to its stationary distribution. Suppose the Markov chain ut mixes as t → ∞. Then,
such
k=0 ΠK ⊥ T [uk , uk ] ≈ t E[T [u∞ , u∞ ]]. By equation (9.198) and equation (9.199),
X
E[T [u, u]]i = E[ Tijk ui uj ] (9.204)
j,k
X
= Tijk E[uu> ]jk (9.205)
j,k
Figure 9.4: The effect of SGD noise on non-quadratic functions. K is the span of the noise covariance Σ.
In the quadratic case, the iterates will fluctuate in K, but remains unchanged in K ⊥ . When the function is
non-quadratic, the third order effect slowly accumulates in K ⊥ , resulting in implicit regularization.
Interpretation. Intuitively, the direction of the implicit regularization is T (S) = ∇x h∇2x g(0), Si . In
other words, the implicit bias −T (S) is trying to make h∇2x g(0), Si small. [Damian et al., 2021] further
prove that SGD with label noise on loss L̂(θ) converges to a stationary point of the regularized loss L̂(θ) +
λtr(∇2θ L̂(θ)). In the next subsection, we will heuristically explain why this regularization term is useful.
124
9.5.4 SGD with label noise
We previously claimed that SGD with label noise minimizes the regularized loss
But why is tr(∇2θ L̂(θ)) a useful regularizer? This question has been the subject of recent study in the
implicit regularization literature. [Wei and Ma, 2019b] show that the complexity of neural networks can be
controlled by its Lipschitzness. Indeed, we will see that tr(∇2 L̂(θ)) is intimately related to the Lipschitzness
of the networks. [Foret et al., 2020] also discover empirically that regularizing the sharpness of the local
curvature leads to better generalization performance on a wide range of tasks. In the sequel, we will unpack
∆
some of these arguments to justify regularizing by R(θ) = tr ∇2 L̂(θ) .
We first consider the case of one data point, i.e. L̂(θ) = `(fθ (x), y). For notational simplicity, let
∆
f = fθ (x) denote the model output, p be the number of parameters, and `(f, y) be the loss function. Then,
2 ∂` ∂f
∇ L̂(θ) = ∇θ · (9.208)
∂f ∂θ
∂`
= ∇θ · ∇θ fθ (x) (9.209)
∂f
∂2` ∂` 2
= · ∇θ fθ (x)∇θ fθ (x)> + ∇ fθ (x) . (9.210)
∂f 2 ∂f | θ {z }
∈Rp×p
Suppose the loss function is `(f, y) = 21 (f − y)2 . Then, observing that ` is simply a quadratic function of f ,
we have
Note that the first term of (9.211) is positive semi-definite (PSD), while the second term is not necessarily
PSD. In general, (9.211) is referred to as the Gauss-Newton decomposition. Note also that for convex losses
`,
∂2`
≥ 0, (9.212)
∂f 2
which further implies that
∂2`
∇fθ (x)∇fθ (x)> < 0. (9.213)
∂f 2
Empirically, we observe that the second term (f − y)∇2 fθ (x) is generally smaller. This is especially
evident when θ is at a global minimum for which `(fθ , y) = 0. In this case, (f − y)∇2 fθ (x) = 0 because
fθ (x) = y. These two observations suggest that we can ignore the second term. In that case,
∂2`
tr ∇2 L̂(θ) ≈ · tr ∇f (x)∇f (x)>
2
(9.214)
∂f
∂2`
= · k∇fθ (x)k22 (9.215)
∂f 2
Thus, minimizing tr ∇2 L̂(θ) is approximately equivalent to minimizing the Lipschitzness of the model
output with respect to θ, which is approximately equivalent to minimizing the Lipschitzness of the model
output with respect to hidden variables.
125
For example, let θ = (w1 , . . . , wr ), then we have
∂f ∂f
= · h> , (9.216)
∂wi ∂h0i+1 i
where h0i+1 = wi hi , and hi denotes the hidden variables of the i-th layer and h0i+1 is the pre-activation of
the (i + 1)-th layer. Then,
∂f ∂f
= · khi k2 . (9.217)
∂wi F ∂hi+1 2
This validates our claim that minimizing the Lipschitzness of the model output with respect to the parameters
is (approximately) equivalent to minimizing the Lipschitzness of the model output with respect to the hidden
variables. We have previously connected the latter concept to generalization of deep neural networks. See
Section 6.1 for a discussion of the all-layer margin, a measure of Lipschitzness of the model with respect to
hidden layer variables that can be directly used to bound generalization error of a deep net.
126
Chapter 10
We venture into unsupervised learning by first studying classical (and analytically tractable) approaches
to unsupervised learning. Classical unsupervised learning usually consists of specifying a latent variable
model and fitting using the expectation-maximization (EM) algorithm. However, so far we do not have a
comprehensive theoretical analysis for the convergence of EM algorithms because fundamentally analyzing
EM algorithms involves understanding non-convex optimization. Most analysis of EM only applies to special
cases (e.g., see Xu et al. [2016], Daskalakis et al. [2016]) and it is not clear whether any of the results can
be extended to more realistic, complex setups, without a fundamentally new technique for understanding
nonconvex optimization. Instead, we will analyze a family of algorithms which are broadly referred to as
spectral methods or tensor methods, which are a particular application of the method of moments [Pearson,
1894] with the algorithmic technique of tensor decomposition [Anandkumar et al., 2015]. While the spectral
method appears to be not as empirically sample-efficient as EM, it has provable guarantees and arguably is
more reliable than EM given the provable guarantees.
After discussing the basics of classical unsupervised learning, we will move on to modern applications
of deep learning. In particular, we’ll focus on theoretical interpretations of contrastive learning, which is a
class of successful self-supervised learning algorithms in computer vision.
127
Here i is called the latent variable since we only observe x. Here we assume the covariances of the Gaussians
to be identity, but they can also be parameters that are to be learned.
There are many other latent variables that could be defined via a similar generative process, such as
Hidden Markov Models, Independent Component Analysis, which we will discuss later.
M2 = E[xx> ]
∆
(10.9)
1 1
= E[xx> |i = 1] + E[xx> |i = 2] (10.10)
2 2
To compute these expectations, consider an arbitrary Z ∼ N (µ, I). Then,
E[ZZ > ] = E[Z] E[Z]> + Cov(Z) (10.11)
>
= µµ + I (10.12)
Recognizing that this second moment calculation is the same for both Gaussians in our mixture, we obtain:
1 1
M2 = (µµ> + I) + (µµ> + I) (10.13)
2 2
= µµ> + I (10.14)
Since the second moment provides information about µ, we can complete the two tasks required for the
moment method using the second moment.
If we had access to infinite data, then we can compute the exact second moment M2 = µµ> +I. Then, we
can recover µ by evaluating the top eigenvector and eigenvalue of M2 .1 The top eigenvector and eigenvalue
∆ µ
of M2 is µ̄ = kµk 2
and kµk22 + 1, respectively.
In practice, however, we do not have infinite data. In that case, we need to estimate the second moment
by an empirical average.
n
c2 = 1 >
X
M x(i) x(i) (10.15)
n i=1
1 This approach is known as the spectral method.
128
We can then recover µ by evaluating the top egivenvector and eigenvalue of M c2 . However, we need this
algorithm to be robust to errors, i.e., similar estimates, M2 , of the second moment should yield similar
c
estimates of µ. Fortunately, most algorithms we might use for obtaining the top eigenvector and eigenvalue
are robust, so we can limit our attention to the infinite data case. Having outlined the moment method
approach to the mixture of two Gaussians problem, we study a generalization of this problem in the sequel.
M1 = E[x] (10.17)
k
X 1
= E[x|i] (10.18)
i=1
k
k
1X
= µi , (10.19)
k i=1
M2 = E[xx> ] (10.20)
k
X 1
= E[xx> |i] (10.21)
i=1
k
k
X 1
= (µi µ>
i + I) (10.22)
i=1
k
k
1X
= µi µ>
i + I. (10.23)
k i=1
129
denote the matrix we aim to recover. Then, consider some rotation matrix R ∈ Rk×k . We consider U versus
U R:
k
1X 1
µi µ>
i = UU> (10.25)
k i=1 k
1
= (U R)(U R)> (RR> = I) (10.26)
k
This result proves that the second moment is invariant to rotations. To prove a similar claim for the first
moment, we also constrain our choice of R such that
R · ~1 = ~1 (10.27)
Then,
k
1X 1
µi = U · ~1 (10.28)
k i=1 k
1
= U R · ~1 (10.29)
k
Therefore, the first and second moments of U and U R are indistinguishable, and we must consider the third
moment in order to identify U .
Each term in the sum now corresponds to the third moment for some multivariate Gaussian. Fortunately,
Lemma 10.1 suggests a formula for estimating its value.
Lemma 10.1. Suppose z ∈ N (v, I). Then,
d
X d
X d
X
E[z ⊗ z ⊗ z] = v ⊗ v ⊗ v + E[z] ⊗ el ⊗ el + el ⊗ E[z] ⊗ el + el ⊗ el ⊗ E[z] (10.31)
l=1 l=1 l=1
130
To explicitly compute the last four terms in (10.34), we note that:
(
0 if i 6= k
E[ξi ξk ] = 2
(10.35)
E[ξi ] = 1 if i = k
d
X d
X d
X
E[z ⊗ z ⊗ z] = v ⊗ v ⊗ v + v ⊗ el ⊗ el + el ⊗ v ⊗ el + el ⊗ el ⊗ v. (10.38)
l=1 l=1 l=1
We can now apply Lemma 10.1 to compute the third moment of the mixture of k Gaussians. In particular,
k
1X
E[x ⊗ x ⊗ x] = E[x ⊗ x ⊗ x | i] (10.39)
k i=1
k d d d
!
1X X X X
= µi ⊗ µi ⊗ µi + µi ⊗ el ⊗ el + el ⊗ µi ⊗ el + el ⊗ el ⊗ µi (10.40)
k i=1
l=1 l=1 l=1
k d k
! d k
!
1X X 1 X X 1 X
= µi ⊗ µi ⊗ µi + µi ⊗ el ⊗ el + el ⊗ µi ⊗ el
k i=1 k i=1 k i=1
l=1 l=1
d k
!
X 1 X
+ el ⊗ el ⊗ µi (10.41)
k i=1
l=1
k d d d
1X X X X
= µi ⊗ µi ⊗ µi + E[x] ⊗ el ⊗ el + el ⊗ E[x] ⊗ el + el ⊗ el ⊗ E[x] (10.42)
k i=1
l=1 l=1 l=1
(10.43)
For notational convenience, let
a⊗3 = a ⊗ a ⊗ a.
∆
(10.44)
Pk
⊗3
So far, we have shown how to compute 1
i=1 µi
k from E[x⊗3 ] and E[x]. In the sequel, we will formalize the
k
remaining problem, recovering {µi }ki=1 from k1 i=1 µ⊗3
P
i , as the tensor decomposition problem, and discuss
efficient algorithms for it.
Tensor decomposition Recovering the Gaussian means, {µi }ki=1 , from the third mixture moment,
1
Pk ⊗3
k i=1 µi , is a special case of the general tensor decomposition problem. That problem is set up as
Pk
follows: Assume that a1 , · · · ak ∈ Rd are unknown. Then, given i=1 a⊗3
i , our goal is to reconstruct the ai
vectors.
Before we present some standard results on tensor decomposition, we first describe some basic facts about
tensors. Much like matrices, tensors have an associated rank. For example, a ⊗ b ⊗ c is a rank-1 tensor. In
general, the rank of a tensor T is the minimum k such that T can be decomposed as
k
X
T = ai ⊗ bi ⊗ ci . (10.45)
i=1
131
for some {ai }ki=1 , {bi }ki=1 , {ci }ki=1 . Another difference between tensors and matrices is that the former objects
do not have the typical rotational invariance. In particular, consider applying a right rotation R ∈ Rk×k to
the matrix
and get
e = AR = ã1
A ··· ãk ∈ Rd×k (10.47)
Then,
k
X k
X
ai a> > >
i = AA = (AR) · (AR) = ãi ã>
i (10.48)
i=1 i=1
Then,
k
X k
X
a⊗3
i = ã⊗3
i (10.50)
i=1 i=1
The lack of rotation invariance in the sense above and the existence of permutation invariance make tensor
decomposition computationally challenging as well as powerful. Pk
We now summarize some standard results regarding tensor decomposition for T = i=1 a⊗3 i . The results
Pk
for decomposing the asymmetric version T = i=1 ai ⊗ bi ⊗ are largely analogous. We will not prove these
results here.
0. In the most general case, recovering the ai ’s from T is computationally hard. Any procedure will either
fail to find a unique ai or it fails to find ai efficiently.
1. In the orthogonal case, i.e. a1 , . . . , ak are orthogonal vectors, each ai is a global maximizer of
X
max T (x, x, x) = Tijk xi xj xk (10.51)
kxk2 =1
i,j,k
Results 1 and 2 above both involve the so-called “under-complete” case (k ≤ d), e.g., when the number of
Gaussians in the mixture is smaller than the dimension of the data. Next, we describe certain overcomplete
cases for which efficient tensor decomposition is possible.
132
3. Suppose a⊗2 ⊗2 2
1 , . . . , ak are independent for k ≤ d . Then, applying Result 2, we can recover ai from
Pk ⊗2 ⊗3 Pk ⊗6 6
i=1 (ai ) = i=1 (ai ) ∈ Rd .
4. Excluding an algebraic
Pk set of measure 0,2 we can use the FOOBI algorithm to recover ai from the
fourth-order tensor i=1 a⊗4
i when k ≤ d . A robust version of the FOOBI algorithm is described in
Ma et al. [2016].
5. Assume ai are randomly generated unit vectors. Then, for the third order tensor, k can be large as
d1.5 [Ma et al., 2016, Schramm and Steurer, 2017].
In summary, the moment method is a recipe in which we first compute high order moments (i.e. tensors),
assume that these estimates are noiseless, and decompose these tensors to recover the latent variables.
Though we do not discuss these results here, there is an extensive literature analyzing the robustness of
the moment method to error in the moment estimates. Last, we note that even though we only explicitly
analyze the mixture of Gaussians model here, latent variable models amenable to analysis by the moment
method include ICA, Hidden Markov Models, topic models, etc.
1 7
3
6
2
4 5
S1 S2
Figure 10.1: A demonstration of graph partitioning. Sets S1 and S2 form a good partition of the graph since
there’s only one edge between them.
133
n
where S and S̄ partition [n]. Assume |S| = 2. We then assume the following generative model for G. If
i, j ∈ S or i, j ∈ S̄, then
(
1 w.p. p
Gij = . (10.54)
0 w.p. 1 − p
for p > q (i.e., more likely to be connected if from the same group). For instance, S and S̄ could mean two
companies, and i ∈ [n] is a user of a social network. Two users i, j are more likely to know each other if they
are in the same company.
𝑆̅
2
Figure 10.2: A graph generated by the stochastic block model with p = 3 and q = 15 .
Our goal is then to recover S and S̄ from G; the primary tool we use towards this goal is the eigende-
composition of G.
In some trivial cases, it is not necessary to eigendecompose G to recover the two hidden communities.
Suppose, for instance, that p = 0.5 and q = 0. Then, the graph represented by G will contain two connected
components that correspond to S and S̄.
As a warm-up to motivate our approach, we eigendecompose Ḡ = E[G]. Observe that
(
p if i, j from the same community
Ḡij = . (10.56)
q o.w.
134
Lemma 10.2. Let Ḡ = E[G] for the stochastic block model. Then, letting ui (A) denote the i-th eigenvector
of the matrix A,
u1 (Ḡ) = ~1 (10.58)
>
u2 (Ḡ) = [1, . . . , 1, −1, . . . , −1] (10.59)
| {z } | {z }
|S| |S̄|
for r = p − q. To precisely define G0 , we note that G0 is block diagonal with two blocks of size |S| and |S̄|,
respectively. Then,
Ḡ = ~1~1> q + G0 . (10.63)
Thus,
r···r 1
.. ..
.
r···r
0 .
· = r · n · u.
0
1
G ·u= (10.64)
−1
r···r 2
.. .
0 . .
.
r···r −1
135
More generally, when we have more than two clusters in the graph, G0 is block diagonal with more than
two blocks. In this setting, the eigenvectors will still align with the blocks. We illustrate this below for a
generic block diagonal matrix. Let
1···1
..
.
1···1
0 0
1 · · · 1
.
A=
0 .. 0
(10.67)
1···1
1···1
..
0 0 .
1···1
Then, the three eigenvectors of A are
1 0 0
.. .. ..
. . .
1 0 0
0 1 0
.. .. ..
. , . , . (10.68)
0 1 0
0 0 1
. . .
.. .. ..
0 0 1
Furthermore, the rows of the matrix formed by the three eigenvectors given by (10.68) clearly give the
cluster IDs of the vertices in G. Note also that permutations of A will result in equivalent permutations in
the coordinates of each of the three eigenvectors.
Next, we relate this observation to the result in Lemma 10.2. While there are no negative values in the
eigenvectors given in (10.68), we observe that any linear combination of eigenvectors is also an eigenvector,
so recovering a solution that look more like (10.59) is straightforward. Indeed, taking linear combinations of
the eigenvectors defined above shows that there is an alternative eigenbasis that includes the all-ones vector,
~1. How for this choice of A, the all-ones vector is not the unique top eigenvector. For that to be the case,
we require background noise in Ḡ.
In reality, we only observe G. In the sequel, we will show that in terms of the spectrum, G ≈ E[G].
Formally, we will leverage earlier concentration results to prove that kG − E[G]kop is small. Concretely, then,
G = (G − E[G]) + E[G] (10.69)
p + q ~~ > p − q
= (G − E[G]) + · 11 + · uu> (10.70)
2 2
Rearranging, we obtain that:
p + q ~~ > p−q
G− · 11 = (G − E[G]) + · uu> (10.71)
2 2
p+q
We then hope that G − E[G] is a small perturbation, so that the top eigenvector of G − 2 · ~1~1> is close to
u. Namely, it suffices to show that
p−q
kG − E[G]kop · uu> . (10.72)
2 op
136
Lemma 10.4. With high probability,
p
kG − E[G]kop ≤ O( n log n) . (10.73)
Note that this concentration inequality is different from the ones we have seen in the course so far because
both G and E[G] are matrices, not scalars. Our goal will be to turn the quantity on the LHS into something
that we are familiar with.
Proof. The key idea is that we can use uniform convergence, after noting that
Now, the quantity inside the max is the difference between the sum of independent random variables and
their expectation, which we are familiar with. We can use brute force discretization to deal with the max.
First, note that for a fixed v with kvk2 = 1, we can use Hoeffding’s inequality to find that
X X 2
Pr vi vj Gij − E vi vj Gij ≥ ≤ exp(− ) . (10.77)
2
i,j∈[n] i,j∈[n]
√
Then, we choose = O( n log n), take a discretization of the unit ball with granularity 1/nO(1) (which
yields a cover of cardinality exp(n log n)), and take a union bound over this discretized set to achieve the
desired result.
√
Remark 10.5. Comparing this bound to p−q2 ·n, we can deduce that if p−q
log n
√
n
, then we can recover the
vector u approximately. Via a post-processing step that we do not discuss here, u can actually be recovered
exactly.
∆ |E(S, S̄)|
φ(S) = (10.78)
Vol(S)
where E(S, S̄) is the total number of edges between S and S̄, and Vol(S) is the total number of edges
connecting to S. To be precise, let A be the adjacency matrix of G,
X
E(S, S̄) = Aij (10.79)
i∈S,j∈S̄
X
Vol(S) = Aij . (10.80)
i∈S,j∈[n]
137
the conductance of the other part could be larger. Therefore, typically people only consider the conductance
of a smaller part of the partition.
Next, we define φ(G) to be the sparsest cut of G:
One may wonder why we need to normalize by Vol(S) in the definition of conductance. The reason is that
E(S, S̄) itself is typically minimized when S is small. Thus, without this normalization, the sparsest cut
would not be very meaningful. For instance, suppose the graph G contains N nodes and can be divided
into two halves each containing N/2 nodes, and every node is connected to all the other nodes in the
same half, but is connected to only 2 nodes in the other half (as shown in Figure 10.3). Then, we can
consider two subsets S1 and S2 , where S1 contains half the nodes, and S2 contains two nodes in the same
half. It’s easy to see that E(S1 , S̄1 ) = N2 · 2 > E(S2 , S̄2 ) = N2 . However, the conductance of S1 is
N N
E(S1 ,S̄1 ) 2 ·2 4 E(S2 ,S̄2 ) 1
φ(S1 ) = Vol(S1 ) = N
·( N N ≈ N, whereas the conductance of S2 is φ(S2 ) = Vol(S2 ) = 2
N +2 ≈ 2.
2 2 −1)+ 2 ·2
Thus, when n is large, S1 is a sparser cut than S2 under φ(·).
𝑆"
𝑆!
Our goal is to find an approximate sparsest cut Ŝ such that φ(Ŝ) ≈ φ(G).2 Our main technique is
eigendecomposition or spectral clustering, though in the literature much more advanced and better algorithms
have been proposed, e.g., the famous ARV algorithm [Arora et al., 2009]. Let di = Vol({i}) be the degree of
node i, and let D = diag({di }) be the diagonal matrix that contains the degrees di as entries. Furthermore,
let
1 1
Ā = D− 2 AD 2 (10.82)
be the normalized adjacency matrix. This is equivalent to normalizing each element Aij of the adjacency
A
matrix by √ 1 (i.e., Āij = √ ij ). In most cases, it suffices to starting with considering G as a regular
di dj di dj
graph (whose degrees are all the same), because the proof for regular graph can oftentimes extend to general
graph easily. Assuming G is a κ-regular graph, i.e. di = κ; then, this normalization simply scales A by κ1 .
Let L = I − Ā be the Laplacian matrix. Note that any eigenvector of L is also an eigenvector of Ā.
Suppose L has eigenvalues λ1 ≤ . . . ≤ λn with corresponding eigenvectors u1 . . . un , then Ā has eigenvalues
1 − λ1 ≥ . . . ≥ 1 − λn with the same eigenvectors.
The following famous Cheeger’s inequality relates the eigendecompositions to the graph partitioning.
Theorem 10.6 (Cheeger’s inequality). The second eigenvalue of L, namely λ2 , is related to the sparsest cut
φ(G) as follows:
λ2 p
≤ φ(G) ≤ 2λ2 . (10.83)
2
√ p
Moreover, we can find Ŝ such that φ(Ŝ) ≤ 2λ2 ≤ 2 φ(G) efficiently by rounding the second eigenvector.
Suppose u2 = [β1 · · · βn ]> ∈ Rn is the second eigenvector of L. Then wep can choose a threshold τ = βi and
consider Ŝi = {j ∈ [n] : βj ≤ τ }. At least one such Ŝi satisfies φ(Ŝi ) ≤ 2 φ(G).
2 Finding the exact sparsest cut is a NP-hard problem.
138
Note that this can be viewed as a general but weaker version of the theorem that we proved for stochastic
block model. There is no randomized assumption; the conclusion is weaker than those for SBM; also the
rounding algorithm to recover the cluster is also more complicated—one has to try multiple thresholding
instead of using threshold 1/2.
We will refer the readers to Chung [2007] for the proof of the theorem. Here below we give a few basic
lemmas that help build up intuitions on why eigendecompositions relate to graph decomposition.
First, one might wonder why this algorithm uses the second eigenvector of Ā, but not the first eigenvector.
As we have seen in the SBM case, the first eigenvector captures the background in some sense. Here for
general graph, we see the same phenomenon. The top eigenvector is generally not that interesting as it only
captures the “background density” of the graph. For instance, when A is κ-regular, ~1 is the top eigenvector
of A and is thus also the top eigenvector of Ā = κ1 · A. More generally, we have the following lemma:
√ √
Lemma 10.7. The top eigenvector of Ā (respectively, the smallest eigenvector of L) is u1 = [ d1 · · · dn ]> .
Proof.
X √
(Ā · u1 )i = Āij dj (10.84)
j
X Aij √
= √ p dj (10.85)
j
di dj
1 X
=√ Aij (10.86)
di j
di √
= √ = di . (10.87)
di
To understand why the eigenvectors of the Laplacian are related to the sparsest cut, we examine the
quadratic form the Laplacian. In particular, we have the following lemma:
Lemma 10.8. Let v ∈ RN be a vector, L is the graph Laplacian. Then,
!2
1 X v v
>
v Lv = √ i − pj . (10.88)
2 di dj
(i,j)∈E
Proof.
139
If G is κ-regular, then this becomes v > Lv = 1
− vj )2 . Furthermore, suppose v ∈ {0, 1} is a
P
2κ (i,j)∈E (vi
binary vector with support S = supp(v). Then,
1 X 1
(vi − vj )2 = E(S, S̄) (10.94)
2κ κ
(i,j)∈E
1
= E(supp(v), supp(v)) . (10.95)
κ
If | supp(v)| ≤ n/2, implying Vol(S) ≤ Vol(V )/2, then
1
v > Lv κ E(S, S̄)
= = φ(S) . (10.96)
kvk22 1
κ Vol(S)
>
The term vkvkLv
2 is also called the Rayleigh quotient. This result nicely connects the abstract linear algebraic
2
approach to the sparsest cut approach. Note that we only achieve an approximate sparsest cut because when
we compute eigenvectors, we minimize the Rayleigh quotient without any constraints on v. By contrast, the
sparsest cut minimizes the Rayleigh quotient subject to the constraint that v ∈ {0, 1}n . Proving Cheeger
inequality essentially involves controlling the difference caused by real v vs binary v. We omit the proof of
Cheeger’s inequality because it’s beyond the scope of this notes.
represents a distance between two data points. Then, we compute the first k eigenvectors of the Laplacian
L and arrange them into the columns of a matrix:
| |
u1 · · · uk ∈ Rn×k . (10.98)
| |
The i-th row of this matrix (which we denote by vi ) is then a k-dimensional embedding of the i-th example.
Note that this is usually a much lower-dimensional representation than the raw data. Finally, we can use
k-means to cluster the embeddings {v1 , . . . , vn }.
In high dimensions, the issue with Ng et al. [2001]’s approach is that the training data points are all far
away from each other. The Euclidean distance between points becomes meaningless, and so our definition
of G does not make sense.
How do we solve this issue? HaoChen et al. [2021] propose a solution. They consider an infinite weighted
graph G(V, w), where w are the edge weights, and V = X ⊆ Rn is the set of all possible data points. We
define w(x, x0 ) to be large only if x and x0 are very close in `2 distance. Now, the graph is more meaningful,
because only data points that are small perturbations of each other have high connection weights. However,
we do not have explicit access to this graph, and its eigenvectors are infinite-dimensional.
Now, suppose we have some eigenvector u = (ux )x∈X . Rather than explicitly represent ux , we represent
ux by fθ (x) where fθ is some parameterized model. Now, the challenge is to find θ such that [fθ (x)]x∈X is the
second smallest eigenvector of Laplacian of G. It turns out solving this problem gives a form of contrastive
learning, which we will discuss in Section 10.3.2.
140
10.3 Self-supervised Learning
10.3.1 Pretraining / self-supervised learning / foundation model basics
Self-supervised learning is widely used for pretraining modern models. These large pretrained models are also
called foundation models [Bommasani et al., 2021]. One simplified setup / workflow contains the following
two stages:
Pretraining on unlabeled, massive data. Let {x(1) , · · · , x(n) } be a dataset where x(i) ∈ Rd is sampled
from some pretraining data distribution x(i) ∼ Ppre . The goal is to learn a pretrained model fθ : Rd → Rk ,
where k is the dimension of features / representations / embeddings, and θ is the model parameter. This
model can be learned by minimizing certain pretrained loss function: L̂pre (θ) = L̂pre (x(1) , · · · , x(n) ; θ), which
Pn
sometimes is of the form L̂pre (θ) = n1 i=1 `pre (x(i) ; θ). We use θ̂ = argminθ L̂pre (θ) to denote the parameter
learned during pretraining.
Adaptation. During adaptation, we have access to a set of labeled downstream task examples
(1) (1) (ntask ) (ntask )
{(xtask , ytask ), · · · , (xtask , ytask )}, where usually ntask n. One popular adapataion method is lin-
ear probe, which uses fθ̂ (x) as features / representations / embeddings, and train a linear classifier
on downstream tasks. Concretely, the prediction on data x is w> fθ̂ (x), where w is the linear head
1
Pntask (i) > (i)
learned from minw L̂task (w) = ntask i=1 `task (ytask , w fθ̂ (xtask )). Another popular adaptation method
is finetuning, which also updates the parameter θ. Concretely, one initializes from θ = θ̂, and solve
1
Pntask (i) > (i)
minθ,w L̂task (w, θ) = ntask i=1 `task (ytask , w fθ (xtask )).
Why does pretraining on unlabeled data with an unsupervised (self-supervised) loss help a wide range
of downstream prediction tasks? There are many open questions to be answered in this field. For instance,
we may ask: (1) how pretraining helps label efficiency of downstream tasks, (2) why pretraining can give
“universal” representations, and (3) why does pretraining provide robustness to distribution shift.
augmentation
Figure 10.4: A demonstration of contrastive learning. Representations of augmentations of the same image
are pulled close, whereas augmentations of random images are pushed far away.
Given a natural image x̄ ∈ X̄, one can generate augmentations by random cropping, flipping, adding
Gaussian noise or color transformation. We use x ∼ A(·|x̄) to denote the conditional distribution of aug-
mentations given the natural image. For simplicity, we consider the case where Gaussian blurring is the
141
augmentation, we have
A
Notice that − log A+B is decreasing in A but increasing in B, the loss above encourages fθ (x(i) )> fθ (x(i)+ )
(i) >
to be large, while fθ (x ) fθ (x(j)+ ) to be small.
In the rest of this section, we consider a variant of contrastive loss, proposed in [HaoChen et al., 2021]:
2
L(θ) = −2 +
E fθ (x)> fθ (x+ ) + 0
E fθ (x)> fθ (x0 ) . (10.101)
(x,x )∼positive (x,x )∼random
This contrastive loss follows the same design principle as other contrastive losses in the literature: suppose
all representations have the same norm, then the first term encourages the representations of a positive pair
to be closer while the second term encourages the representations of a random pair to be orthogonal to each
other (hence far away). [HaoChen et al., 2021] show that the loss function, though inspired by theoretical
derivations, still perform competitively, nearly matching the SOTA methods.
We can also define the empirical loss on a set of tuples (x(i) , x+(i) , x0(i) ) sampled i.i.d. as follows: x̄ ∼
P̄X̄ , x(i) ∼ A(·|x̄(i) ), x+(i) ∼ A(·|x̄(i) ), x̄0 ∼ P̄X̄ , x0(i) ∼ A(·|x̄0(i) ). The empirical loss is defined as
Xn 2
L̂(θ) = −2fθ (x(i) )> fθ (x+(i) ) + fθ (x(i) )> fθ (x0(i) ) . (10.102)
i=1
We learn the linear head ŵ = argminw L̂task (w, θ̂), and the evaluate its performance on downstream
population data:
1 2
Ltask (ŵ, θ̂) = E ytask − ŵ> fθ (xtask ) . (10.104)
2
Analysis pipeline. We give a summary of our analysis pipeline below. The key takeaway is that we only
have to focus on the population distribution case (step 3).
0. Assume expressivity, i.e., assuming ∃θ∗ such that L(θ∗ ) is sufficiently small (the details will be quan-
tified later).
142
1. For large enough n (e.g., n > Comp(F)/2 where F = {fθ } is the function class, Comp(·) is some
measure of complexity, is the target error), show that L̂(θ) = L(θ) ± .
2. Let θ̂ be the parameter learned on empirical data. Since L̂(θ̂) = minθ L̂(θ) ≤ L̂(θ∗ ) ≤ L(θ∗ ) + , we
have
L̂(θ̂) ≤ ⇒ L(θ̂) ≤ 2 (10.105)
3. Key step: (infinite data case) We will prove a theorem (Theorem 10.12 below as a simplified version,
or Theorem 3.8 of HaoChen et al. [2021]) that shows if L(θ̂) ≤ 2, then there exists w such that
Ltask (θ, w) ≤ δ, where δ is a function of and data distribution P̄ .
4. When we have enough downstream data ntask ≥ poly(k, 10 ), for any θ, with high probability we have
(via uniform convergence) that for any w, L̂task (w, θ) ≈ Ltask (w, θ) ± 0 .
5. Using the results in step 3 and step 4, we have L̂task (ŵ, θ̂) = minw L̂task (w, θ̂) ≤ minw Ltask (w, θ̂) + 0 ≤
δ + 0 . Thus, the final evaluation loss on the downstream task is Ltask (ŵ, θ̂) ≤ L̂task (ŵ, θ̂) + 0 ≤ δ + 20 .
Key step: the case with population pretraining and downstream data. We will now dive into the
analysis of step 3, as all the other steps are from standard concentration inequalities. Recall that
2
L(θ) = −2 E fθ (x)> fθ (x+ ) + E 0 fθ (x)> fθ (x0 ) . (10.106)
(x,x+ ) (x,x )
As expected, the analysis requires structural assumptions on the data. In particular, we will use the
graph-theoretic language to describe the assumptions on population data. Let X be the set of all augmented
data, P be the distribution of augmented data x ∼ A(·|x̄) where x̄ ∼ P̄X̄ . Let p(x, x+ ) be the probability
density of positive pair (x, x+ ). We define a graph G(V, w) where vertex set V = X and edge weights
w(x, z) = p(x, z) for any (x, z) ∈ X × X. In general, this graph may be infinitely large. To simplify math
and avoid integrals, we assume |X| = N where N is the number of all possible augmented images (which
can be infinite or exponential in dimension).
The degree of node x is p(x) = z∈X p(x, z). Let A ∈ RN ×N be the adjacency matrix of this graph
P
defined as Ax,z = p(x, z), and let Ā ber the normalized adjacency matrix such that Āx,z = √p(x,z) .
p(x)p(z)
The following lemma shows that contrastive learning is closely related to the eigendecomposition of Ā.
2
Lemma 10.9. Let L(f ) = −2 E(x,x+ ) f (x)> f (x+ ) + E(x,x0 ) f (x)> f (x0 ) . Suppose X = {x1 , · · · , xN }, let
matrix 1
p(x1 ) 2 f (x1 )>
F = ..
. (10.107)
.
1
p(xN ) 2 f (xN )>
Then,
L(f ) = kĀ − F F > k2F + const. (10.108)
Hence, minimizing L(f ) w.r.t the variable f is equivalent to eigendecomposition of Ā.
Proof. Directly expanding the Frobenius norm kĀ − F F > k2F as a sum over entries, we have
!2
> 2
X p(x, z) >
p p
kĀ − F F kF = p p − f (x) f (z) p(x) p(z) (10.109)
x,z∈X
p(x) p(z)
X X 2
= const − 2 p(x, z)f (x)> f (z) + p(x)p(z) f (x)> f (z) (10.110)
x,z∈X x,z∈X
2
= const − 2 E f (x)> f (x+ ) + E f (x)> f (x0 ) , (10.111)
(x,x+ )∼positive (x,x0 )∼random
143
where the last equation uses the fact that p(x, z) and p(x)p(z) are the probability densities of (x, z) being a
positive pair and a random pair, respectively.
1
Standard matrix decomposition results tell us that the minimizer of kĀ−F F > k2F satisfies F = U ·diag(γi2 ),
where γi ’s are the eigenvalues of Ā and U ∈ RN ×k contains the top k eigenvectors of Ā as its columns.
Suppose we use v1 , · · · , vN to represent the rows of U , i.e.,
>
v1
..
U = . . (10.112)
>
vN
1 1
Then we know f (xj ) = p(xj )− 2 · diag(γi2 ) · vj is the minimizer of the contrastive loss.
k
One interesting thing is hthat f (xi ) has the same
1
i separability as vi . This is because1 for any vector b ∈ R ,
we have 1 b vi > 0 = 1 b> diag(γi 2 )f (x) > 0 , suggesting linear head diag(γi 2 )b applied on feature
> − −
f (xi ) would achieve the same classification accuracy as v applied on vi . Thus, it suffices to analyze vi ’s
downstream accuracy under linear head.
Since vi is exactly the feature used by the classic spectral clustering algorithm, we may ask when spectral
clustering produces good features. As discussed in Section 10.2, spectral clustering is good at graph partition-
ing in stochastic block models. In this section, we aim to find more general settings where
P spectral clustering
produces good features. For simplicity, let’s consider a regular graph where w(x) = x0 ∈V w(x, x0 ) = κ.3
The following lemma shows that suppose the graph roughly contains two clusters, then the spectral
clustering features can be used to accurately predict which cluster a node belongs to.
N
Lemma 10.10. Suppose P the graph G can be partitioned into 2 clusters S1 , S2 with size |S1 | = |S2 | = 2 ,
such that E(S1 , S2 ) = x∈S1 ,z∈S2 w(x, z) ≤ ακN . Furthermore, suppose G cannot be partitioned well into
3 clusters in the sense that for all partition T1 , T2 , T3 , we have max{φ(T1 ), φ(T2 ), φ(T3 )} ≥ ρ. (Figure 10.5
gives a demonstration of these assumptions.) Then, let g = 1(S1 ) ∈ RN (i.e., gi = 1 if i ∈ S1 ), and k ≥ 6,
𝐸 𝑇! , 𝑇"
≥𝜌 ≤𝛼
𝑉𝑜𝑙(𝑇! )
𝑇#
𝑇!
𝑇"
Figure 10.5: A demonstration of the assumptions in Lemma 10.10. The left half and right half of the graph
can be chosen as S1 and S2 , since there’s at most α proportion of edges between them. Sets T1 , T2 , T3 form
a 3-way partition where φ(T1 ) ≥ ρ.
144
The above lemma essentially says that hvx , bi ≈ gx for all data x ∈ X, where vx is the x-th row of U .
Before proving the above lemma, we first introduce the following higher-order Cheeger inequality, which
shows that when the graph cannot be partitioned well into 3 clusters, the 6-th smalled eigenvalue of the
Laplacian cannot be too small.
Lemma 10.11 (Proposition 1.2 in [Louis and Makarychev, 2014]). Let G = (V, w) be a weight graph.
Suppose the graph cannot be partitioned into 3 sets S1 , S2 , S3 such that max{φ(S1 ), φ(S2 ), φ(S3 )} ≤ ρ. Then,
we have
λ 6 & ρ2 .
We can combine Lemma 10.9 and Lemma 10.10 to prove the following theorem, which shows that when
the graph roughly contains 2 clusters, the feature learned from contrastive learning can be used to predict
the cluster membership accurately.
2
Theorem 10.12. Let L(f ) = −2 E(x,x+ ) f (x)> f (x+ ) + E(x,x0 ) f (x)> f (x0 ) , and f ∗ : X → Rk is a
minimizer of L(f ) for k ≥ 6. Suppose P the graph G can be partitioned into 2 clusters S1 , S2 with size
|S1 | = |S2 | = N2 , such that E(S1 , S2 ) = x∈S1 ,z∈S2 w(x, z) ≤ ακN . Furthermore, suppose G cannot be parti-
tioned well into 3 clusters in the sense that for all partition T1 , T2 , T3 , we have max{φ(T1 ), φ(T2 ), φ(T3 )} ≥ ρ.
Let y(xi ) = 1(xi ∈ S1 ) (i.e., y(xi ) = 1 if xi ∈ S1 , otherwise y(xi ) = 0). Then, there exists linear classifier
b ∈ Rk such that
1 X 2 α
f (xi )> b − y(xi ) . 2 . (10.118)
N ρ
i∈[N ]
Proof. Let U ∈ RN ×k contains the top k eigenvectors of Ā as its columns. By Lemma 10.10, we know there
exists some b̂ ∈ Rk such that kU b̂ − gk22 . Nρ2α , where g ∈ RN such that gi = y(xi ). Let v1 , · · · , vN be the
1 1 1 1
rows of U . According to Lemma 10.9, we know that f (xi ) = p(xi )− 2 · diag(γj2 ) · vi = κ− 2 · diag(γj2 ) · vi ,
1 1 1 1
where γj is the j-th largest eigenvalue of Ā, and diag(γj2 ) is a diagonal matrix containing γ12 , γ22 , · · · , γk2 as
145
√ −1
its entries. Thus, if we let b = κ · diag(γj 2 ) · b̂, we would have
X X Nα
(f (xi )> b − y(xi ))2 = (vi> b̂ − gi )2 = kU b̂ − gk22 . . (10.119)
ρ2
i∈[N ] i∈[N ]
146
Chapter 11
Online learning
In this chapter, we switch gears and talk about online learning and online convex optimization. The main
idea driving online learning is that we move away from the assumption that the training and test data are
both drawn i.i.d from some fixed distribution. In the online setting, training data and test data come to
the user in an interwoven manner, and data can be generated adversarially. We will describe how online
learning can be reduced to online convex optimization, some important algorithms, as well as applications
of these algorithms to some illustrative examples.
147
The game takes place over T rounds or time steps. At each step t = 1, . . . , T , the learner receives an
input xt ∈ X from the environment and makes a prediction ŷ ∈ Y in response. The learner then receives
the label yt from the environment and suffers some loss. This procedure is outlined in Algorithm 1 and is
illustrated in Figure 11.1.
Later, we will see that the manner in which nature generates (xt , yt ) leads to different types of online
learning. In the most adversarial setting of online learning, it is possible that the “true label” yt is not
generated at the same time as xt . The environment could generate the label yt depending on the prediction
ŷt made by the learner. We can also see that Algorithm 1 is a very general framework as there are very few
constraints on how xt and yt are generated.
However, as the environment can be adversarial, the task itself might be inherently hard and even the
best possible learner fails to achieve a small loss. Hence, instead instead of using the cumulative loss for
a learner by itself, we compare its performance against a suitable baseline, the “best model in hindsight”.
Assume that our learner comes from a set of hypotheses H. Let us choose the hypothesis h ∈ H that
minimizes the cumulative loss, i.e.
T
X
h? = argmin `(yt , h(xt )). (11.2)
h∈H t=1
148
Note here that in minimizing the cumulative loss, the learner gets to see all the data points (xt , yt ) at
once. The cumulative loss of h? is the best we can ever hope to do, and so it would be better to compare
the cumulative loss of the learner against it. (This approach is analogous to “excess risk”, which tells how
far the current model is away from the best we could hope for.) This measurement is denoted as regret, and
is formally defined as:
" T # " T
#
∆
X X
Regret = `(yt , ŷt ) − min `(yt , h(xt )) (11.3)
h∈H
t=1 t=1
| {z }
best loss in hindsight
Using this definition, if the best model in hindsight performs well, then the learner has more responsibility
to learn to predict well in order to match up the performance of the baseline.
so in this example, one has regret ≤ log |H|, which is a non-trivial bound when H is finite.
As one can see in the example, the realizable case usually indicates that the problem is not too far
out of reach. Indeed, for finite hypothesis classes and linear models, the realizable case is considered to be
straightforward to solve. This is perhaps why most of the past literature has focused on non-realizable cases.
However, the realizable case is still an interesting problem and perhaps a very good starting point when the
model class is beyond linear models and when the loss function is no longer convex, because the xt ’s are not
i.i.d. as they are in classical supervised learning. Hence, standard statistical learning theory does not apply,
and there is still research to be done here.
In the rest of the chapter, we will only focus on the convex loss case, where we reduce online learning to
online convex optimization.
149
11.2 Online (convex) optimization (OCO)
Online convex optimization (OCO) is a particularly useful tool to get results for online learning. Many online
learning problems (and many other types of problems!) can be reduced to OCO problems, which allow them
to be solved and analyzed algorithmically. Algorithm 2 describes the OCO problem, which is more general
than the online learning problem. (Note: Online optimization (OO) refers to Algorithm 2 except that the
ft ’s need not be convex. However, due to the difficulty in non-convex function optimization, most research
has focused on OCO.)
Essentially the learner is trying to minimize the function ft at each step. As with online learning, one
evaluates the performance of learner in online optimization setting using the regret:
T
X T
X
Regret = ft (wt ) − min ft (w) . (11.5)
w∈Ω
t=1 t=1
| {z }
best action in hindsight
At some level, OCO seems like an impossible task, since we are trying to minimize a function ft that we
only get to see after we have made our prediction! This is certainly the case for t = 1. However, as time
goes on, we see more and more functions and, if future functions are somewhat related to past functions,
we have more information to make better predictions. (And if the future functions are completely unrelated
or contradictory to past functions, then the best action in hindsight would also be bad and therefore our
algorithm does not have to do much.)
• Stochastic setting: f1 , ..., fT are i.i.d samples from some distribution P . This corresponds to (xt , yt )
being i.i.d. in online learning. Under this setting, the environment is not adversarial.
• Oblivious setting: f1 , ..., fT are chosen arbitrarily but before the game starts. This corresponds to
(xt , yt being chosen before the game starts. In this setting, the environment can be adversarial but
cannot be adaptive. The environment can choose these functions based on the learner’s algorithm, but
not the actual action if the learner’s algorithm contains randomness. (This is the setting that we focus
on in this course.)
• Non-oblivious/adaptive setting: For all t, ft can depend on the learner’s actions w1 , ...wt . Under this
setting, the environment can be adversarial and adaptive. This is the most challenging setting because
the environment is powerful enough to know not only the strategy of the learner, but also the exact
choice the learner finally made. (Note however that If the learner is deterministic, the environment
does not have more power here than in the oblivious setting. The oblivious adversary can simulate the
game before the game starts, and chose the most adversarial input accordingly.)
150
11.3 Reducing online learning to online optimization
There is a natural way to reduce the online learning problem to online optimization, with respect to a specific
type of model hw parametrized by w ∈ Ω. Recall that in online learning problem, the learner predicts yt
upon receiving xt . If the learner possesses oracle to solve online optimization problem, the learner can consult
the oracle to obtain wt , the parameter of the model as in online optimization problem, and then predict
ŷt = hwt (xt ).
In the next two subsections, we give two examples of how an online learning problem can be reduced to
an OCO problem.
151
binary, i.e. ŷ ∈ {0, 1}, and suppose the loss function is 0-1 loss. (The problem can easily be generalized to
more general predictions and losses.) The problem is outlined in Algorithm 5.
We want to design a method that chooses it for each step (line 3 in Algorithm 5) to minimize the regret:
T T
X
∆
X
Regret = E `t (it ) − min `t (i) , (11.6)
t=1 i∈[N ]
t=1
| {z }
the best expert in hindsight
where the expected value is over it , thus covering the case where the it ’s could be random.
To make the expert problem amenable to reduction to OCO, we introduce idea of a continuous action
space. Instead of choosing
it from Ω = [N ], the learner chooses a distribution pt from the N -dimensional
simplex ∆(N ) = p ∈ RN : kpk1 = 1, p ≥ 0 . The learner then samples it ∼ pt . With this formulation,
instead of selecting particular expert it to follow, the learner adjusts the belief pt , and samples from the
distribution to choose which expert to follow. Algorithm 6 outlines this procedure. Note that the loss is the
expected loss Ei∼pt [`t (i)] instead of the sampled `t (it ).
With the continuous action space, it is easy to reduce the expert problem to an OCO: see Algorithm 7.
(The problem is convex since the loss function is convex and the parameter space ∆(N ) is convex.)
152
In this setting, one can rewrite the regret as:
T
X T
X
Regret = hpt , `t i − min `t (i) (11.7)
i∈[N ]
t=1 t=1
T
X T
X
= hpt , `t i − min hp, `t i (11.8)
p∈∆(N )
t=1 t=1
T
X T
X
= ft (pt ) − min ft (p). (11.9)
p∈∆(N )
t=1 t=1
Theorem 11.3 (Relationship between excess risk and regret). Assume `((x, y), w) is convex. Suppose we
run an online learning algorithm on the dataset {(xi , yi )}Ti=1 and obtain a sequence of models w1 , . . . , wT ,
PT
and regret RT . Let w = T1 i=1 wi , then the excess risk of w can be bounded above:
? RT 1
L(w) − L(w ) ≤ +O √
e , (11.11)
T T
where w? = argminw∈Ω L(w).
Here are some intuitive interpretations of the theorem:
• If RT = O(T ), then we have some non-trivial result. Otherwise, the bound in (11.11) is increasing T
and does not provide any useful information.
√
• If the batch problem
√ has a 1/ T generalization bound, then the best you can hope for in online
learning is RT = O( T ).
• If the batch problem has a 1/T generalization bound, you can hope for O(1) regret (or O(1)e regret in
some cases).
√ √
• We often have O( T ) excess risk supervised learning problems; hence it is reasonable to expect O( T )
regret in online learning problems.
153
11.5 Follow-the-Leader (FTL) algorithm
In this section, we analyze an algorithm called “Follow-the-Leader” (FTL) for OCO, which is intuitive but
fails to perform well in many cases.
The FTL algorithm behaves as its name suggests: it always selects the action wt such that it minimizes
the historical loss the learner has seen so far, i.e.
t−1
X
wt = argmin fi (w). (11.12)
w∈Ω i=1
We now demonstrate how the FTL algorithm can fail for the expert problem. In the expert problem,
ft (p) = hp, `t i, so
t−1
X
pt = argmin fi (p) (11.13)
p∈∆(N ) i=1
t−1
X
= argmin h`i , pi (11.14)
p∈∆(N ) i=1
* t−1 +
X
= argmin `i , p . (11.15)
p∈∆(N ) i=1
The minimizer p ∈ ∆(N ) is a point-mass probability, with the point mass at the smallest coordinate of
Pt−1
i=1 `i . This gives regret
t−1
X t−1
X
Regret = `i (it ), where it = argmin `i (j). (11.16)
i=1 j∈[N ] i=1
Now, consider the following example: suppose we have only two experts. Suppose expert 1 makes perfect
predictions on even days while expert 2 makes perfect predictions on odd days. Assume also that the FTL
algorithm chooses expert 1 to break ties (this is not an important point but makes the exposition simpler.)
In this setting, the FTL algorithm always selects the wrong expert to follow. A few rounds of simulation of
this example is shown in Table 11.1.
The best expert in hindsight has a loss of T /2 (choosing either expert all the time incurs this loss, and
so the regret of the FTL algorithm is T − T /2 = T /2 = Θ(T ). The main reason for FTL’s failure is that is
a deterministic algorithm driven by an extreme update, with no consideration on potential domain shift (it
always selects the best expert based on the past with no consideration of the potential next ft ). Knowing
its deterministic strategy, the environment can easily play in an adversarial manner. To perform better in a
problem like this, we need some randomness to hedge risk.
154
defined for the FTL algorithm. Note that this is an “illegal” choice for the action because wt+1 depends on
ft : in online convex optimization, the action at time t is required to be chosen before seeing the function ft .
Nevertheless, we can still gain some useful insights by analyzing this procedure. In particular, the following
lemma shows that the BTL strategy is worth emulating because it achieves very good regret.
Lemma 11.4. The BTL strategy has non-positive regret. That, is, if wt is defined as in the FTL algorithm,
then
T
X T
X
BTL regret = ft (wt+1 ) − min ft (w) ≤ 0, (11.17)
w∈Ω
t=1 t=1
A useful consequence of this lemma is a regret bound for the FTL strategy.
Lemma 11.5. (FTL regret bound) Again, let wt be as in the FTL algorithm. The FTL strategy has the
regret guarantee
T
X T
X T
X
FTL regret = ft (wt ) − min ft (w) ≤ [ft (wt ) − ft (wt+1 )]. (11.22)
w∈Ω
t=1 t=1 t=1
Proof.
T
X T
X
FTL regret = ft (wt ) − min ft (w) (11.23)
w∈Ω
t=1 t=1
T
X T
X T
X
= ft (wt+1 ) − min ft (w) + [ft (wt ) − ft (wt+1 )] (11.24)
w∈Ω
t=1 t=1 t=1
T
X
≤0+ [ft (wt ) − ft (wt+1 )], (11.25)
t=1
Lemma 11.5 tells us that if terms ft (wt ) − ft (wt+1 ) are small (e.g. wt does not change much from round
to round), then the FTL strategy can have small regret. It suggests that the player should adopt a stable
policy, i.e. one where the terms ft (wt ) − ft (wt+1 ) are small.√It turns out that following this intuition will
lead to a strategy that improves the regret all the way to O( T ) in certain cases.
155
11.7 Follow-the-regularized-leader (FTRL) strategy
Now, we discuss a OCO strategy aims to improve the stability of FTL by controlling the differences ft (wt ) −
ft (wt+1 ). To describe the method, we will first need a preliminary definition.
Definition 11.6. We say that a differentiable function φ : Ω 7→ R is α-strongly-convex with respect to the
norm || · || on Ω if we have
α
φ(x) ≥ φ(y) + h∇f (y), x − yi + kx − yk2 (11.26)
2
for any x, y ∈ Ω.
Remark 11.7. If φ is convex, then we know that f (x) has a linear lower bound φ(y) + h∇f (y), x − yi. Being
α-strong-convex means that f (x) has a quadratic lower bound, the RHS of (11.26). This quadratic lower
bound is very useful in proving theorems in optimization.
Remark 11.8. If ∇2 f (y) αI for all y, then f is α-strongly-convex. This follows directly from writing the
second-order Taylor expansion of f around y.
Given a 1-strongly-convex function φ(·), which we call a regularizer, we can implement the “Follow the
Regularized Leader” (FTRL) strategy. At time t, this strategy chooses the action
" t−1 #
X 1
wt = argmin fi (w) + φ(w) , (11.27)
w∈Ω i=1
η
1
0 ≤ f (w) − f (w0 ) ≤ k∇f (w)k2∗ , (11.28)
α
where k · k∗ is the dual norm of k · k.
156
Adding these two inequalities gives
Since this lower bound is clearly positive, this shows 0 ≤ f (w) − f (w0 ).
Next, we prove the upper bound on f (w) − f (w0 ). Rearranging the inequality (11.32), we obtain
r
1
kw − w0 k ≤ [f (w) − f (w0 )]. (11.33)
α
Since f is convex, we have f (w0 ) ≥ f (w) + h∇f (w), w0 − wi. Rearranging this gives
157
i.e. FTRL is just FTL with an additional “round” of play at time zero. Thus, by Lemma 11.5 with time
starting from t = 0, we have
T
X T
X T
X
ft (wt ) − argmin ft (w) ≤ [ft (wt ) − ft (wt+1 )]. (11.38)
t=0 w∈Ω t=0 t=0
Pt−1
For any t ≥ 1, applying Lemma 11.9 with F (w) = i=0 fi (w) (which is 1/η-strongly-convex) and f (w) =
ft (w) gives the bound ft (wt ) − ft (wt+1 ) ≤ η||∇ft (wt )||2∗ . Plugging this into the preceding display gives the
upper bound:
T
X T
X T
X
ft (wt ) − argmin ft (w) ≤ f0 (w0 ) − f0 (w1 ) + η k∇ft (wt )k2∗ . (11.39)
t=0 w∈Ω t=0 t=1
Next, we need to relate the LHS of the above display (which starts at time t = 0) to the actual regret of
PT
FTRL (which starts at time t = 1). To do this, define w∗ = argminw∈Ω t=1 ft (w). Then,
T
X T
X T
X T
X
ft (wt ) − argmin ft (w) ≥ ft (wt ) − ft (w∗ ) (11.40)
t=0 w∈Ω t=0 t=0 t=0
T T
!
X X
= f0 (w0 ) − f0 (w∗ ) + ft (wt ) − argmin ft (w) . (11.41)
t=1 w∈Ω t=1
| {z }
Regret of FTRL
158
p
Therefore, by choosing η = 1/(8T ) and applying the FTRL regret theorem (Theorem 11.11), we can obtain
the regret guarantee
T T
X X √
(yt − wt> xt )2 − min (yt − w> xt )2 ≤ 4 T . (11.48)
||w||2 ≤1
t=1 t=1
where p ∈ ∆(N ) is in the set of distributions over [N ]. We first introduce the following nice property of this
regularizer:
Lemma 11.13. φ(p) defined above is 1-strongly convex with respective to the `1 norm k · k1 .
Proof. By definition of strong convexity, we need to show that for all p, q ∈ ∆(N ),
1
φ(p) − φ(q) − h∇φ(q), p − qi ≥ kp − qk21 . (11.55)
2
From direct computation, we know the gradient of φ(q) is
1 + log q(1)
∇φ(q) = ··· . (11.56)
1 + log q(N )
159
Plugging this into the LHS of (11.55), we get
Hence, φ is a satisfies the condition on the regularizer for our FTRL regret guarantee. To obtain the
regret bound (11.36), we also need to bound D = sup φ(p) − inf φ(p) and G = sup k∇ft (w)k∞ (since k · k∞
is the dual norm of k · k1 ). Since negative entropy is always non-positive and (positive) entropy is always
bounded above by log N , we bound D with
D = sup φ(p) − inf φ(p) ≤ − inf φ(p) = − inf(−H(p)) = sup(H(p)) ≤ log N, (11.62)
where Eν [f ] = Ex∼ν [f (x)] = hv, f i and Eµ ef = Ex∼µ ef (x) . Moreover, the optimal solution is attained
at
Intuitively, Lemma 11.14 says that taking the supremum over distributions µ of a linear function plus
the KL divergence as the regularizer will give us the same distribution as exponentiating f .
If we take µ to be the uniform distribution on [N ] and replace f with −f in Lemma 11.14, we get the
following corollary:
160
Corollary 11.15. Let ν be a probability distribution. Then, Eν [f ] − H(ν) is minimized at ν(x) ∝ e−f (x) .
Proof. When µ is uniform distribution, we have
X ν(x)
KL(ν||µ) = ν(x) log (11.67)
x
µ(x)
X 1
= log N − ν(x) log (11.68)
x
ν(x)
= log N − H(ν). (11.69)
So supν (Eν [−f ] − KL(ν||µ)) = − inf ν (Eν [f ] − H(ν) + log N ). This means that the value of ν that
attains the infimum of Eν [f ] − H(ν) is the same ν attaining the supremum of Eν [−f ] − KL(ν||µ), which by
Lemma 11.14 is proportional to e−f (x) .
We now apply the Gibbs variational principle to the expert problem. Notice that our FTRL update for
the expert problem at time t can be written as
* t−1 + * t−1 +
X 1 X
argmin li , pt − H(pt ) = argmin η li , pt − H(pt ), (11.70)
pt ∈∆(N ) i=1 η pt ∈∆(N ) i=1
Pt−1
where li is the vector of expert losses at time i. Letting
f = η i=1 li , we know from Corollary 11.15 that
Pt−1
the minimizer is attained at pt ∝ exp −η i=1 li , or equivalently,
exp(−ηLt (j))
pt (j) = PN , (11.71)
k=1 exp(−ηLt (k))
Pt−1
where Lt = i=1 li is the cumulative loss vector. Basically, solving the expert problem is to look a the
historical loss of each expert and take softmax to find the probability distribution of how much to trust each
expert.
This algorithm is also called the “Multiplicative Weights Update”, which has been studied before online
learning framework became popular [Arora et al., 2005, Freund and Schapire, 1997, Littlestone and War-
muth, 1994]. One way of doing multiplicative weights update is the following: Let p̃t be the unnormalized
distribution that we keep track of. At each time step t, for each expert j, we look at lt−1 (j). if lt−1 (j) = 1,
i.e. the expert made a mistake at the previous time step, we update p̃t (j) = p̃t−1 (j) · exp(−η); otherwise we
make no change. We then get a distribution by normalizing p̃t :
p̃t
pt = . (11.72)
kp̃t k1
161
3. The learner construct a linear function gt (w) = h∇ft (wt ), wi, which is the local linear approximation
of f at w. (Technically the local linear approximation of f and w is h∇ft (wt ), w − wt i, but we drop
the wt shift for convenience.)
4. The learner feeds gt (·) to algorithm A as the loss function.
We have the following informal claim1 :
Proposition 11.8.1 (Informal). If a deterministic algorithm A has regret no more than γ(T ) for linear
cases for some function γ(·), then A0 stated above has regret no more than γ(T ) for convex functions.
Proof. For all w ∈ Ω, the regret guarantee on A tells us that
T
X T
X
gt (wt ) − gt (w) ≤ γ(T ). (11.73)
t=1 t=1
where ΠΩ (·) is the projection operator onto the set Ω.The last equality is because for any vector a, we have
1 1
argminha, wi + kwk22 = argmin kw + ηak22 − ηkak22 (11.80)
w∈Ω η w∈Ω 2η
162
Intuitively, we can think of this algorithm as gradient descent with “lazy” projection:
Similarly, we can define gradient descent with “eager” projection (which can get similar regret bounds):
163
Bibliography
Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models
through tensor methods. In Proceedings of the Conference on Learning Theory (COLT), Paris, France,
2015.
Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming
using the multiplicative weights update method. In 46th Annual IEEE Symposium on Foundations of
Computer Science (FOCS’05), pages 339–348. IEEE, 2005.
Sanjeev Arora, Satish Rao, and Umesh Vazirani. Expander flows, geometric embeddings and graph parti-
tioning. Journal of the ACM (JACM), 56(2):1–37, 2009.
Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kan-
wal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer
look at memorization in deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the
34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Re-
search, pages 233–242. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/arpit17a.
html.
Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples
without local minima. Neural networks, 2(1):53–58, January 1989. ISSN 0893-6080. doi: 10.1016/
0893-6080(89)90014-2. URL http://dx.doi.org/10.1016/0893-6080(89)90014-2.
Peter Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural
networks. NeurIPS, 2017.
M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical
bias-variance trade-off. Proceedings of the National Academy of Sciences (PNAS), 116(32):15849–15854,
2019. doi: 10.1073/pnas.1903070116.
Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of representations for
domain adaptation. Advances in neural information processing systems, 19:137, 2007.
Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks
driven by an ornstein-uhlenbeck like process. arXiv preprint arXiv:1904.09080, 2019.
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S.
Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dal-
las Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora
Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin
Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby
Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong,
Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karam-
cheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Kohd, Mark Krass, Ranjay Krishna,
164
Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Lev-
ent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani,
Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen
Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadim-
itriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich,
Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Sh-
iori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W.
Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael
Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui
Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models.
arXiv preprint arXiv:2108.07258, 2021.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive
learning of visual representations. In International Conference on Machine Learning (ICML), pages 1597–
1607, 2020.
Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv
preprint arXiv:1812.07956, 8, 2018.
Fan Chung. Four proofs for the cheeger inequality and graph partition algorithms. In Proceedings of ICCM,
volume 2, page 378. Citeseer, 2007.
Alex Damian, Tengyu Ma, and Jason Lee. Label noise sgd provably prefers flat global minimizers, 2021.
Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suffice for mixtures
of two gaussians. arXiv preprint arXiv:1609.00368, 2016.
Simon Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. In Inter-
national Conference on Machine Learning, pages 1655–1664. PMLR, 2019.
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for
efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application
to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points — online stochastic gradient
for tensor decomposition. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The
28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages
797–842, Paris, France, 03–06 Jul 2015. PMLR.
Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In D. Lee,
M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/
2016/file/7fb8ceb3bd59c7956b1df66729296a4c-Paper.pdf.
Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep
learning with spectral contrastive loss, 2021.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised
visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 9729–9738, June 2020.
Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM, 60(6),
2013.
165
Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4:251–257,
1991.
Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint
arXiv:1803.07300, 2018.
Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to
minimizers. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference
on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246–1257, Columbia
University, New York, New York, USA, 23–26 Jun 2016. PMLR.
Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent
on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018. URL
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16067.
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix
sensing and neural networks with quadratic activations. arXiv preprint arXiv:1712.09203, pages 2–47,
2017.
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix
sensing and neural networks with quadratic activations. In Conference On Learning Theory, pages 2–47.
PMLR, 2018.
Percy Liang. Cs229t/stat231: Statistical learning theory (winter 2016), April 2016.
Shiyu Liang, Ruoyu Sun, Jason D Lee, and R Srikant. Adding one neuron can eliminate all bad local minima.
Neural Information Processing Systems (NIPS), 2018.
Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation,
108(2):212–261, 1994.
Pengda Liu and Garrett Thomas. Cs229t/stat231: Statistical learning theory (fall 2018), October 2018.
Anand Louis and Konstantin Makarychev. Approximation algorithm for sparsest k-partitioning. In Proceed-
ings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1244–1255. SIAM,
2014.
Haipeng Luo. Introduction to online learning, 2017. URL https://haipeng-luo.net/courses/CSCI699/.
Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares.
IEEE Symposium on Foundations of Computer Science (FOCS), 2016. URL http://arxiv.org/abs/
1610.01980.
Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses, 2017.
Katta G. Murty and Santosh N. Kabadi. Some np-complete problems in quadratic and nonlinear program-
ming. Mathematical Programming, 39:117–129, 1987.
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks.
In Conference on Learning Theory, pages 1376–1401. PMLR, 2015.
Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances
in neural information processing systems, 14:849–856, 2001.
Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the
Royal Society of London. A, 185:71–110, 1894.
John A. Rice. Mathematical Statistics and Data Analysis. Belmont, CA: Duxbury Press., third edition,
2006.
166
Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary
learning. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning
Theory, volume 65 of Proceedings of Machine Learning Research, pages 1760–1793. PMLR, 07–10 Jul
2017. URL https://proceedings.mlr.press/v65/schramm17a.html.
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern
analysis and machine intelligence, 22(8):888–905, 2000.
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias
of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
Matus Telgarsky. Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/, 2021. Version:
2021-10-27 v0.0-e7150f2d (alpha).
Ramon van Handel. Probability in high dimension: Apc 550 lecture notes, December 2016.
Roman Vershynin. High-dimensional probability: An introduction with applications in data science, vol-
ume 47. Cambridge university press, 2018.
167