0% found this document useful (0 votes)
17 views50 pages

RADL TQKhoat

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

RADL TQKhoat

hust

Uploaded by

Phuc Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Why do deep neural networks

perform really well?


A guarantee from normalization methods

Khoat Than
Hanoi University of Science and Technology

RADL, VIASM, 6/2023


Contents 2

¡ Recent breakthroughs
¡ The open theoretical challenge
¡ Basic concepts in learning theory
¡ Some theories for deep neural networks
¡ Theoretical benefits of normalization methods
Some successes: AlphaGo (2016) 3

¡ AlphaGo of Google DeepMind:


the world champion at Go (cờ vây),
3/2016
¡ Go is a 2500-year-old game
Wired
¡ Go is one of the most complex games

¡ AlphaGo learns from 30 millions human


moves and plays itself to find new
moves
Mean accuracy Interval (low, hi
Control (deliberately bad model) 86% 83%–90%
GPT-3 Small 76% 72%–80%

Some successes: GPT-3 (2020) GPT-3 Medium


GPT-3 Large
GPT-3 XL
61%
68%
62%
58%–65%
64%–72%
59%–65%
4
GPT-3 2.7B 62% 58%–65%
¡ Language generation (writing ability?) GPT-3 6.7B
GPT-3 13B
60%
Con người
56%–63%
55% không52%–58%
GPT-3 175B 52%biệt bài 49%–54%
thể phân
¨ A huge model was trained from a huge data set
Table 7.3: Human accuracy inviết 500 whether
identifying từ là do short (⇠200 wor
generated. We find that human accuracy (measured by the ratio of corre
¨ This model, as universal knowledge, can be used forassignments)
problems ranges with few
from 86%
máy
on the
hay người
data
control model to 52% on GPT-3
viếtand shows the results of
mean accuracy between five different models,
difference in mean accuracy between each model and the control mod
Small model with increased output randomness).

95% Confidence t compared


Mean accuracy Interval (low, hi) control (p-v
GPT-3 for
Control 88% 84%–91% -
contexts with GPT-3 175B 52% 48%–57% 12.7 (3.2e-
few data
Table 7.4: People’s ability to identify whether ⇠ 500 word articles are m
by the ratio of correct assignments to non-neutral assignments) was 88
52% on GPT-3 175B. This table shows the results of a two-sample T-Te
accuracy between GPT-3 175B and the control model (an uncondition
increased output randomness).

methodology above, we ran two experiments, each on around 80 US-ba


human abilities to detect the articles generated by GPT-3 and a control
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared
Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language We found that mean human accuracy at detecting the intentionally bad lo
models are few-shot learners." NeurIPS (2020). Best Paper Award model was ⇠ 88%, while mean human accuracy at detecting the longe
5
Some successes: AlphaFold 2 (2021)
¡ Accurate prediction of Protein folding

This computational
work represents a
stunning advance
on the protein-
folding problem,
a 50-year-old
grand challenge in
biology.

– Venki Ramakrishnan,
Nobel Laureate

Jumper, John, et al. "Highly accurate protein structure


prediction with AlphaFold." Nature 596.7873 (2021).
Théâtre D’opéra Spatial 6

An AI-
Generated
Picture Won
an
Art Prize

@Jason Allen
+ Midjourney https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html
Some successes: Text-to-image (2022) 7

¡ Draw pictures by short descriptions

A bowl
of
soup

DALL-E 2

Midjourney

Imagen
Some successes: ChatGPT (2022) 8

¡ Human-level Chatting, Writing, QA,…

- Forbes, 2/2023
Open question 9

¡ Why can deep neural networks perform well?


¡ Many breakthroughs in recognition, games, image
synthesis, language generation, Protein folding
prediction, …
Theoretical study 10

¡ Approximation (power of an architecture)


¡ Pros: any continuous function can be approximated well by a deep neural network
(NNs)
¡ Cons: Unclear how to find a specific NN, based on a given training set

¡ Optimization (learning process)


¡ Overparameterized NNs can have zero training error, but do not overfit
¡ SGD can find global solutions to the training problems
¡ Cons: good optimization does not imply good generalization ability

¡ Generalization (ability of trained NNs to perform on unseen data)


¡ Existing standard theories cannot be used, due to vacuousness
¡ Some theories work well for only NNs with one-hidden layer
11

Learning theory
Basic concepts
The learning problem 4 1. INTRODUCTION
12

Figure 1.2 Plot of a training data set of N =

¡ There is an unknown (measurable) function


10 points, shown as blue circles,
each comprising an observation
1
of the input variable x along with
the corresponding target variable t

𝑦 ∗: 𝒳→𝒴 t. The green curve shows the


function sin(2πx) used to gener-
ate the data. Our goal is to pre- 0
dict the value of t for some new

¡ It maps each input 𝒙 ∈ 𝒳 to a label (output) 𝑦 ∈ 𝒴 value of x, without knowledge of


the green curve.
−1

¡ Spaces: input space 𝒳, output space 𝒴


0 x 1
7.1. Maximum Margin Classifiers 331

¡ We can collect a dataset D = {(x1, y1), (x2, y2),detailed


…,treatment
(xMlies, beyond
yM)} the scope of this book.
Figure 7.2 Example of synthetic data from
Although
two classeseachin of
twothese tasks needs its own tools and techniques, many of the
dimensions
¡ 𝑦! = 𝑦 ∗(𝒙! ) for any 𝑖 ∈ {1, … , 𝑀} showing
key ideas
goals y(x)
that contours
underpinofthem constant
obtained from a support
of this
are common to all such problems. One of the main
chapter is to introduce, in a relatively informal way, several of the most
vector machine having a Gaus-
¡ Sometimes labels cannot be collected important
the book
of these
sian kernel
are the
concepts
function.
we decision
Also and
shall seeboundary,
shown
these same
to illustrate them using simple examples. Later in
the ideas re-emerge in the context of more sophisti-
cated margin
modelsboundaries,
that are applicable to real-world pattern recognition applications. This
and the sup-
port
chapter vectors.
also provides a self-contained introduction to three important tools that will
¡ We need to learn 𝑦 ∗ from D be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Although these might sound like daunting topics, they are in fact
straightforward, and a clear understanding of them is essential if machine learning
techniques are to be used to best effect in practical applications.

1.1. Example: Polynomial Curve Fitting


¡ In practice, we often find a function h to approximate 𝑦 ∗
form (6.23). Although the data set is not linearly separable in the two-dimensional
data space
We begin x, it is linearly
by introducing separable
a simple in the nonlinear
regression problem, feature
which space defined
we shall implicitly
use as a run-
by the nonlinear kernel function. Thus the training data points
ning example throughout this chapter to motivate a number of key concepts. are perfectly separated
Sup-
in the
pose we original
observe data space. input variable x and we wish to use this observation to
a real-valued
predict theThis example
value also provides
of a real-valued a geometrical
target variable t. insight
For theinto the origin
present
[Figure by
of sparsity
purposes,
C.
in
it is in-
Bishop]
the SVM. The maximum margin hyperplane is defined by the location of
structive to consider an artificial example using synthetically generated data because the support
Basic concepts 13

¡ Loss/cost function: 𝑓: 𝒴×𝒴 → ℝ


1 the cost/loss of prediction 𝑦1 about 𝑦
¡ 𝑓(𝑦, 𝑦):
¡ 0-1 loss: 𝑓(𝑦, 𝑦)
1 = 𝟏#$#%
¡ Square loss: 𝑓(𝑦, 𝑦)
1 = 𝑦 − 𝑦1 &

𝑦∗
¡ Empirical loss: the loss of a function h on the training set D
1 %
𝐹 𝑫, ℎ = 0 𝑓(𝑦" , ℎ 𝒙" )
𝑀 "#$

¡ Expected loss (risk): the loss of a function h over the



whole space
𝐹 𝑃, ℎ = 𝔼(𝒙,))~, [𝑓(𝑦, ℎ 𝒙 )]
¡ P is the distribution where each (𝒙, 𝑦) is sampled
Learning goal 14

¡ Function space (hypothesis space, model space):


a set of functions ℋ, where a learner will select a good function ℎ ∈ ℋ
¡ Depends on input features: ℎ: 𝒳 → 𝒴
¡ Represents prior knowledge about a task

¡ Learner: a learning algorithm that can select one ℎ ∈ ℋ, based on a training set D
¡ Learning goal: find one ℎ ∈ ℋ with small expected loss
¡ It should generalize well on future data h
¡ Small training loss/error is not enough

¡ Ultimately, we want to find the best one in ℋ


ℎ∗ = arg min 𝐹(𝑃, ℎ)
"∈ℋ

¡ Learning ≠ Fitting
¡ Fitting focuses on minimizing the training loss 𝐹 𝑫, ℎ x
Errors of a trained model 15

¡ After training, the learning algorithm will return ℎ- ∈ ℋ


¡ How well does it work with future data? ⇒ Generalization! 𝑦∗
¡ Maybe: ℎ- ≠ ℎ-∗ and ℎ- ≠ ℎ∗ ℎ∗
ℎ% ℎ%∗
¡ ℎ'∗ = arg min 𝐹(𝑫, ℎ) is the minimizer of the empirical loss

(∈ℋ
¡ ℎ∗ = arg min 𝐹(𝑃, ℎ) is the best member of family ℋ
(∈ℋ

¡ Note:
𝐹 𝑃, ℎ- − 𝐹 𝑃, 𝑦 ∗ = 𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗ + 𝐹 𝑃, ℎ∗ − 𝐹 𝑃, 𝑦 ∗
¡ Estimation error: 𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗
¡ How good is the training algorithm?

¡ Approximation error: 𝐹 𝑃, ℎ∗ − 𝐹 𝑃, 𝑦 ∗
¡ Capacity (representational power) of family ℋ Bousquet et al. Introduction to statistical learning theory.
In Machine Learning, LNAI, volume 3176. Springer, 2004.
Error decomposition 16

¡ Estimation error
|𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗ | ≤ |𝐹 𝑫, ℎ- − 𝐹 𝑫, ℎ-∗ | + 2 sup |𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ |
.∈ℋ
¡ It can be decomposed into two types of error

¡ Optimization error: 𝐹 𝑫, ℎ- − 𝐹 𝑫, ℎ-∗ 𝑦∗


¡ How close to optimality is ℎ' ?
ℎ∗
ℎ% ℎ%∗
¡ Generalization error: 𝐹 𝑃, ℎ- − 𝐹 𝑫, ℎ-
¡ How far is the training loss from expected loss? ℋ

¡ In summary:

𝐸𝑟𝑟𝑜𝑟 ℎ- ≔ Optimization error +Generalization error +Approximation error


Errors by different factors 17

¡ Function space ℋ
¡ A bigger space (ℋ′), the (probably) smaller approximation error
¡ More complex members, the (probably) smaller approximation error
è larger capacity 𝑦∗
¡ An effective space (ℋ+ ) is enough è not too big/complex
ℋ′
ℋ1
¡ Training algorithm 𝒜
¡ A better 𝒜 implies smaller estimation error of the trained model ℋ
¡ A bad 𝒜 can provide small optimization error,
but large generalization error è overfitting
¡ A good 𝒜 can localize an effective subset ℋ ∗ ⊂ ℋ

¡ Data
¡ Complexity of the data space data
manifolds
¡ Representativeness of the training samples, …
A unified view 18

Approximation
error

Error

Optimization Generalization
error error

Data space Learning algorithm Function space


𝓧 (×𝓨) 𝓐 𝓗
Error bounds 19

¡ Study upper (and lower) bounds for the errors


Approximation
error

¡ Approximation error:
Error

Optimization Generalization

|𝐹 𝑃, 𝑦 ∗ −𝐹 𝑃, ℎ∗ | ≤ 𝜖7
error error

¡ Capacity of family ℋ Data space


! (×%)
Learning algorithm
'
Function space
(

¡ Optimization error:
𝑦∗
|𝐹 𝑫, ℎ-∗ − 𝐹 𝑫, ℎ- | ≤ 𝜖-
¡ Depending on the number of training iterations (epochs) ℎ∗
ℎ% ℎ%∗
¡ Capacity of learning algorithm 𝒜

Bounds on Generalization Error 20

𝐹 𝑃, ℎ- − 𝐹 𝑫, ℎ- ≤ 𝜖8
Approximation
error

¡ Generalizability of a learned function ℎ' Error

Optimization Generalization
error error

¡ Uniform bounds:
sup |𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ | ≤ 𝜖8 Data space
! (×%)
Learning algorithm
'
Function space
(

.∈ℋ
¡ Generalizability of the worst member
𝑦∗
¡ May not be a good way to explain a learned function ℎ'

¡ PAC-Bayes bounds: ℎ∗
ℎ% ℎ%∗
𝔼.∈ℋ 𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝜖8

¡ Study the error on average over ℋ
¡ May not explain a learned function ℎ'

Bousquet et al. Introduction to statistical learning theory. In Machine Learning, LNAI, volume 3176. Springer, 2004.
Nagarajan & Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems. 2019.
21

Theoretical results for


deep neural networks
A short summary
Neural network 22

¡ Artificial neural networks (ANN):


¡ Biologically inspired by human brain
¡ A rich family to represent complex functions

¡ An ANN:
¡ Consists of many neurons, organized in a layer-wise manner
¡ Each neuron computes a simple function
¡ A neuron can have few connections to other neurons

¡ Each configuration about #neurons, #layers,


#connections, … è an architecture
Mathematical description 23

ℎ 𝒙, 𝑾 = 𝑔9 𝑾9 ℎ9:$ , where ℎ" = 𝑔" 𝑾" ℎ":$ , ℎ; = 𝒙


¡ An NN with K layers
¡ 𝑾! is the weight matrix at layer i
¡ ℎ! is the output of layer i
𝑦∗
¡ 𝑔! is the activation function at layer i
ℎ∗
¡ A NN maps an input 𝒙 to an output y = ℎ 𝒙, 𝑾 ℎ% ℎ%∗
¡ Training: often find weights W, by minimizing a loss 𝐹 𝑫, ℎ ℋ

𝐸𝑟𝑟𝑜𝑟 ℎ- ≔ Optimization error +Generalization error +Approximation error


Approximation error: classical 24

𝑦∗
𝑦∗ − ℎ ≤ 𝜖7 ℋ′
¡ Increase capacity è approximate better
¡ Larger family ℋ′

¡ More complex NNs è stronger representational power
¡ E.g., wider or deeper NNs

¡ Any binary function can be learnt (approximately well) by a feedforward


network using one hidden layer, when the width goes to infinity
¡ Any bounded continuous function can be learnt (approximately) by a
feedforward network using one hidden layer [Cybenko, 1989; Hornik, 1991]

Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257.
Approximation error: modern 25

¡ Any continuous function can be approximated arbitrarily well by


Convolutional neural network, when the depth is large [Zhou, 2020]
¡ Any Lebesgue-integrable function can be approximated arbitrarily well by
a ResNet with one neuron per hidden layer [Lin & Jegelka, 2018]
¡ Deep NNs avoid the curse of dimensionality when approximating Lipschitz
functions [Poggio et al. 2017]
¡ Shallow NNs cannot

Universal tools
Lin, H., & Jegelka, S. (2018). ResNet with one-neuron hidden layers is a universal approximator. NeurIPS.
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-
networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing.
Zhou, D. X. (2020). Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis.
Approximation: existence ↛ method 26

Unclear
how to find such DNNs,
based on a training set
Optimization error 27

¡ Training is often by minimizing a loss 𝐹 𝑫, ℎ


𝐹 𝑫, ℎ' − 𝐹 𝑫, ℎ'∗
𝑦∗
¡ The training loss is highly non-convex
¡ Theory: ℎ∗
ℎ% ℎ%∗
¡ Exponentially large number of iterations may be needed

¡ Intractable in the worst case [Nesterov, 2018]

¡ Practice:
¡ Often have zero training error è global solution ℎ'∗ ?
¡ Easily perfectly fit random labelling of data [Zhang et al. 2021]
(training seems to be easy!)

¡ Contradiction? What’s missing?


Nesterov, Y. (2018). Lectures on convex optimization. Springer.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM.
Optimization: theoretically easy 28

¡ Gradient descent (GD) achieves zero training loss in polynomial time for a
deep over-parameterized ResNet [Du et al. 2019]
¡ Over-parameterization: #parameters ≫ training size

¡ GD can find a global optimum when the width of the last hidden layer of
an MLP exceeds the number of training samples [Nguyen, 2021]
¡ Stochastic gradient descent (SGD) can find global minima on the training
objective of DNNs in polynomial time [Allen-Zhu et al. 2019]
¡ Architecture: MLP, CNN, ResNet

Du, S., Lee, J., Li, H., Wang, L., & Zhai, X. (2019). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
Nguyen, Q. (2021). On the proof of global convergence of gradient descent for deep relu networks with linear widths. In International Conference on Machine Learning.
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
Optimization: reminder 29

However
global optimality
of the training problem
does not imply
good predictive ability
Bias-Variance tradeoff: classical view 30

7.3 The Bias–Varian


¡ The more complex the model is, the more data points it can capture, and
the lower the bias can be
¡ However, higher complexity will make the model "move" more to capture the data
2. points,
Overview and hence
of Supervised its variance will be larger.
Learning
k−NN − Regression Linea

0.4

0.4
High Bias Low Bias
Low Variance High Variance
Prediction Error

0.3

0.3
Expected
prediction
error

0.2

0.2
Test Sample
Bias

0.1

0.1
Variance

0.0

0.0
Training Sample
50 40 30 20 10 0 5

Low High Number of Neighbors k

Model Complexity
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.
Bias-Variance: modern behavior 31

¡ Modern phenomenon: ¡ Classical view:


Very rich models such as DNNs are trained to more complex model
exactly fit the data, but often obtain high accuracy ¡ Lower bias, higher variance
on test data [Belkin et al., 2019]
¡ 𝐵𝑖𝑎𝑠 ≅ 0 38 2. Overview of Supervised Learning

¡ GPT-4, ResNets, StyleGAN, DALLE-3, … High Bias Low Bias


Low Variance High Variance

Prediction Error
B

Test Sample
Risk (Error)

Training Sample

Low High

Model complexity Model Complexity

test risk (solid line). (A) The classical U-shaped risk curve arising from theBelkin,
bias–variance FIGURE 2.11. Test and training error as a function of model complexit
M., Hsu,trade-off. (B) &
D., Ma, S., The
Mandal, S. (2019). Reconciling modern machine-learning practice and the
e U-shaped risk curve (i.e., the “classical” regime) together with the observed behavior
classical from using
bias–variance high- Proceedings of the National Academy of Sciences, 116(32), 15849-15854.
trade-off.
polating regime), separated by the interpolation threshold. The predictors to the right of the interpolation
Generalization ability: long-standing open 32

¡ Main goal: small expected loss 𝐹 𝑃, ℎ- 𝐸𝑟𝑟𝑜𝑟 ℎ! ≔


¡ Practice: training loss 𝐹 𝑫, ℎ' ≅ 0 for overparameterized NNs Approximation error
+Optimization error
¡ Why can a trained DNN generalize well? +Generalization error
(Generalization: ability to well perform on unseen data)

¡ We want to assure, for 𝛿 > 0,


A long-
Pr 𝐹 𝑃, ℎ- − 𝐹 𝑫, ℎ- ≤𝜖 ≥ 1−𝛿
standing
¡ Generalization gap should be small with a high probability
over the random choice of D
challenge
¡ How fast does 𝐹 𝑫, ℎ' converge to 𝐹 𝑃, ℎ' ?
in DL theory
(as the training size increases)

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning. MIT press.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM.
Generalization: VC dimension 33

¡ Vapnik–Chervonenkis (VC) dimension:


¡ Measure of the capacity (complexity, expressive power, richness) of a set of functions
¡ The cardinality of the largest set of points that the algorithm can shatter
¡ A higher VC dim è richer model family ℋ

¡ Example: in 𝑛-dimensional space Bartlett, P. L., Harvey, N., Liaw, C., &
Mehrabian, A. (2019). Nearly-tight VC-
¡ Linear models: 𝑉𝐶 ℋ = 𝑛 + 1 dimension and pseudodimension bounds
for piecewise linear neural networks. The
¡ ReLU networks with 𝑊 weights: 𝑉𝐶 ℋ = Ω(𝑊 log 𝑊) Journal of Machine Learning Research.

¡ Classical bound: for any 𝛿 > 0, with probability at least 1 − 𝛿


2 2𝑒. 𝑚 1 2
𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝑉𝐶 ℋ log + log
𝑚 𝑉𝐶 ℋ 𝑚 𝛿

¡ Vacuous/meaningless for modern DNNs, due to 𝑊 ≫ 𝑚 (training size)


Generalization: Weight norm 34
Stronger Generalization Bounds for Deep Nets via a Compression Approach

0.095
¡ DNN: ℎ 𝒙,init𝑾 = 𝑔9 𝑾9 ℎ9:$
random random init
trained trained 0.09
¡ Bartlett: #params is not important

VC-dim
0.085
¡ Size of weights may be more important
0.08
¡ Neyshabur et al.; Golowich et al.: 0.075
120
𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝑂( 𝑾$ A⋯ 𝑾9 A )/ 𝑚
0.0 0.1 0.2 0.3 0.2 0.4 0.6
a) layer cushion µi b) minimal inter-layer cushion µi!
¡ Bartlett et al.: Uninformative
Figure 3. Left) Comparing neural net genera
random init random init forRight)
modern
𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝑂( 𝑾$ B⋯ 𝑾 9 B )/ 𝑚
Appendix D.3 for details. Comparing
trained trained DNNs
cal generalization error during training. Our
be within the same range as the generalizatio
Arora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In ICML.
Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the
network. IEEE Transactions on Information Theory.
and log h which make the comparison t
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Neural Information Processing Systems.
Golowich, N., Rakhlin, A., & Shamir, O. (2020). Size-independent sample complexity of neural networks. Information and Inference: A Journal of the IMA.
bit unfair, but the comparison to previo
Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2018). A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks. In ICLR.
Generalization: PAC-Bayes 35

¡ Consider 𝔼.~C 𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ¡ The “distance” between


¡ Generalization error on average over ℋ
posterior 𝜌 and prior 𝜇:
¡ Plays important role
¡ 𝜌 is the posterior distribution of h
¡ Depends on the bias of a
¡ McAllester: with probability at least 1 − 𝛿 learning algorithm

𝐾𝐿(𝜌| 𝜇 + log(𝑚/𝛿) ¡ Unclear how fast can 𝜌


𝔼.~C 𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ approach 𝜇?
2𝑚 − 1
¡ 𝜇 is the prior distribution of h
¡ Do not directly consider
the complexity of family
¡ KL is the Kullback-Leibler divergence ℋ

Meaningful bounds appeared


McAllester, D. A. (2003). PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 5-21.
Generalization: non-vacuousTable
bounds 36 comp
2: Our PAC-Bayesian subspace compression bounds
bounds. All results are with 95% confidence, i.e. = .05. The s
SOTA numbers that we computed using [59], which we run on the

Dataset Data-independent priors


¡ We can optimize the PAC-Bayes bound Err. Bound (%) SOTA (%) Er
¡ Find the posterior 𝜌∗ that minimizes 𝐾𝐿(𝜌| 𝜇 MNIST 11.6 21.7 [59]
+ SVHN Transfer 9.0 16.1†
FashionMNIST 32.8 46.5†
¡ Dziugaite & Roy: non-vacuous bounds + CIFAR-10 Transfer 28.2 30.1†
¡ MLP with 3 layers, SGD algorithm, MNIST dataset CIFAR-10 58.2 89.9†
+ ImageNet Transfer 35.1 54.2†
CIFAR-100 94.6 100†
¡ Zhou et al.: compressibility è Stochastic
small KL + ImageNet Transfer 81.3 98.1†
ImageNet 93.5 96.5 [73]
¡ Use SOTA compression alg. to findDNNs
nonvacuous
bound for ImageNet, LeNet-5, MobileNet
Biggs & Guedj:
¡ Lotfi et al.: ¡ Non-vacuous bounds for a
(special)
the remaining part deterministic
of the process: networks
the adaptation of the prior P (h |
¡ Propose compression alg. to find nonvacuous the data Db . The empirical risk is computed over Db only. Intuitive
MNIST
to construct ¡a much andprior
Fashion-MNIST
bounds for LeNet-5, ResNet-18, MobileViT tighter over the possible neural network
datasets
lar to transfer learning, we use the prior PDa (✓) = 2 K(✓|✓Da ) /Z
Dziugaite, G., & Roy, D. (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) ✓= ✓DaNetworks
Neural andMany
+ P w, with is theParameters
✓Da More solution than
found by training
Training the mod
Data. In UAI.
Zhou, W., Veitch, V., Austern, M., Adams, R., & Orbanz, P. (2019). Non-vacuous Generalization Bounds on the data
at the Da rather
ImageNet Scale:than initializingCompression
a PAC-Bayesian randomly.Approach.
With these data-d
In ICLR.
best bounds in Table 2.
Lotfi, S., Finzi, M., Kapoor, S., Potapczynski, A., Goldblum, M., & Wilson, A. G. (2022). PAC-bayes compression bounds so tight that they can explain generalization. In NeurIPS.
Biggs, F., & Guedj, B. (2022). Non-vacuous generalisation bounds for shallow neural networks. In ICML.
Generalization: long-standing open 37

¡ Some other approaches:


Approximation
error

¡ Neural tangent kernel, Mean field


Error
¡ Algorithmic robustness, algorithmic stability, …
Current meaningful bounds Optimization
error
Generalization
error

however are mostly for Data space Learning algorithm Function space

stochastic or shallow NNs


! (×%) ' (

Unclear about Unclear about


Big pretrained models, Why many tricks in DL
Deep NNs in practice improve performance
38

Normalization
methods
The exponential benefits
Batch Normalization 39

ℎ 𝒙, 𝑾 = 𝑔9 𝑾9 ℎ9:$ , where ℎ" = 𝑔" 𝑾" ℎ":$ , ℎ; = 𝒙


¡ Large variance of the input at a layer
¡ Training is often unstable, gradient explosion may appear

¡ Batch normalization: [Ioffe & Szegedy, 2015]


Batch Nor
ℎ" = 𝑔" 𝑩𝑵(𝑾" ℎ":$)
0.8
¡ Each input signal will have mean 0 and variance 1

¡ DNN + BN:
0.7

¡ Training is often easier and faster 0.6


¡ Become a standard Inception

¡ Have much better generalization @@


BN−Baseline
0.5 BN−x5
BN−x30
BN−x5−Sigmoid
Ioffe & Szegedy (2015). Batch normalization: Accelerating deep Steps to match Inception
network training by reducing internal covariate shift. In ICML. 0.4
5M 10M 15M 20M 25M 30M
Many normalizers 40

¡ Batch normalization
¡ Layer normalization [Ba et al., 2016]
¡ Instance normalization [Ulyanov et al., 2016]
¡ Group normalization [Wu & He, 2020]
¡ Spectral normalization [Miyato et al., 2018]
¡ Weight normalization [Salimans & Kingma et al., 2016]
¡…
Ioffe & Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Ba, Kiros, and Hinton. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018) Spectral Normalization for Generative Adversarial Networks. In ICLR.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NeurIPS.
Wu & He. (2020). Group normalization. International Journal of Computer Vision.
Broadly speaking, BatchNorm Ioffe and Szegedythat
is a mechanism [10]aims
describe ICS a
to stabilize
in the
batch) of inputs to a given network network
layer duringchanges
training.due to is
This anachiev
upda
constant shift of the underlying41 trainin
Normalizers: many why’s with additional layers that set the first two moments
the trainingThen,
activation to be zero and one respectively. process.
(mean and varian
the batch normalized
and shifted based on trainable parameters to preserve model expre
applied before the non-linearity of the previous layer.
ℎ 𝒙, 𝑾 = 𝑔9 𝑾9 ℎ9:$ , where ℎ" = 𝑔" 𝑾" ℎ":$ , ℎ; = 𝒙
One of the key motivations for the development of BatchNorm was the
¡ Batch normalization: ℎ" = 𝑔shift
covariate " 𝑩𝑵(𝑾(ICS)." ℎThis
":$ )reduction has been widely viewed as the
Ioffe and Szegedy [10] describe ICS as the phenomenon wherein the d
¡ Why can they fasten training? in the network changes due to an update of parameters of the previous
¡ BN can reduce the Lipschitz constant
constantshift of the
of the lossunderlying training problem and is thus believed
à flatten the loss [Santurkar etthe
al. training
2018; Lyuprocess.
et al. 2022]
¡ Unclear about other normalizers

¡ Does they control the capacity of a neural net? Figure 1: Comparison of (a) training
standard VGG network trained on CIF
¡ Yes, for a single-layer perceptron [Luo et al. 2018] There is a consistent gain in training s
¡ Why can they improve generalization? Unclear gap between the performance of the Ba
in the evolution of layer input distribu
activations of a given layer and visuali
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?. In NeurIPS.
Luo, P., Wang, X., Shao, W., & Peng, Z. (2019). Towards Understanding Regularization in Batch Normalization. In ICLR.
Lyu, Li, & Arora. (2022). Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. In NeurIPS.
Normalization 42

Despite being key parts of


modern Deep Learning,
theoretical understanding
for normalization methods
remains missing
Normalization: our work 43

¡ BN can reduce the Lipschitz constant of a DNN at an exponential rate


¡ Capacity control, regularization role
¡ Many other normalizers have this property

¡ A normalized DNN can have exponentially smaller generalization error


than its unnormalized version in the worst case
¡ A normalized DNN may require exponentially less training data

¡ A normalizer can make the training loss to be exponentially flatter


¡ Iteration complexity is exponentially reduced
(The required number of iterations for any gradient-based learning algorithm)
¡ Training can converge exponentially faster
101 Lipschitz constants are presented.
9 3 Lipschitz continuity of normalizers
0
Normalization: Lipschitz continuity
3.1 Batch normalization
102
We first analyze three popular normalization methods: BN, LN, and GN. Some estimates of their
44

1 Lipschitz constants are presented. 103 BN [18] normalizes individual inputs at a layer according t
104 different inputs will be normalized independently.
¡ Lipschitz constant
2 3.1 Batch normalization 105 Let {x(1) , ..., x(m) } be the samples of a signal/variable x wi
¡ tells how fast a function h(x) can change at 106 an input
mean x variance x2 . BN will normalize each sample x
µx and
3 BN [18] normalizes individual inputs at a layer according to their distributions. It means that two
4 ¡ represents
different the
inputs will be complexity
normalized of h(x) è capacity of h
independently. BN(x(i) , , ✏) = p (
2 +✏
x
p
Let {x , ..., x } be the samples of a signal/variable
(1) (m)
(") x with
107 population (either true
Note that @BN(x, or estimated)2 + ✏, since µ and 2
BN normalizes 2 each input 𝑥 as
5 , ✏)/@x = /
6
¡
mean µx and variance x . BN will normalize each sample x108 as:
(i)
x x x
signal x. This Jacobian is large only when the variance of x
109
(i) of kBNkLip depends on the nature of the distribution of signa
BN(x(i) , , ✏) = p (x
110
µxlayer.
same ) (1)
The following lemma provides an estimate.
2 +✏
x
p 2111 Lemma 1 Given ✏ > 0, , let x = (x1 , ..., xn ) be an input a
& . 𝛾 isµtrainable,
7 with
Note¡that mean, ✏)/@x
@BN(x, 𝜇, and = variance
/ x + ✏,𝜎,
2 since 𝜖 is a smoothing
x and x represent the populationconstant
distribution of
8
112 x, where each input
signal x. This Jacobian is large only when the variance of x is small. It suggests that thex k with population
⇣ magnitude variance 2
k is norma

p p
9 ¡ Lipschitz
of kBNk constant
Lip depends w.r.t.ofinput
on the nature x: 𝐵𝑁 of-!.signal
the distribution =113𝜸x⁄konly,
𝝈 / k,butwhere
not on/other
= signals
1/ in
2
the✏, ..., n / n2 + ✏ .
1 +
0 same layer. The following lemma provides an estimate.
¡ Other normalizers: 114
Lemma 1 Given ✏ > 0, , let x = (x1 , ..., xn ) be an input 115
This observation, whose proof appears in Appendix A.1, sug
and BN(x, , ✏) be the normalization of An increase in varian
1 all the variances of the inputs are large.
2 x, where each input xk ⇣with population variance ! k2 is normalized as (1) given kinput
⌘ 116 a higher-varying . Then
cankBNk
be penalized
Lip = stronger. This fact pr
¡ Layer norm: p 𝐿𝑁 -!. ≤ 1 p − " 𝜸⁄𝝈 1 (for a layer with n neurons)
k / k, where / = 1 / 117 of BN.
n+✏ .
2 2
3 1 + ✏, ..., n /
¡ Group norm: 𝐺𝑁 -!. ≤ 1 − "!# 𝜸⁄𝝈 (for a group of 𝑛/ neurons)
118 3.2 Layer normalization
4 This observation, whose proof appears in Appendix A.1, suggests that kBNk will be small when
Exponential benefits: Capacity control
Figure 1: Comparison of ResNet-18 and its unnormalized version (w/o BN). Variance ( 2
45
) of input
distributions before each BN layer is calculated per-dimension over mini-batches. Leftmost subfigure
reports the variance at each layer after training, while the two next subfigures show the dynamics
of variance along the training process. Rightmost subfigure reports the top@5 accuracy. CIFAR10
dataset is used in this experiment. We observe that DNNs without normalization often have high
input variances at deep layers. Meanwhile, DNNs with BN often have small variances.

QK
160 Lemma 4 For any h defined in Definition 1, we have khkLip  k=1 sk . Its normalized version
QK
161 satisfies khno kLip  k=1 sk kNOk kLip .

162 Consider the case of BN. [18]9observed that the distribution


9
of input at a layer may vary significantly
Forhence
163 ¡ and BN: cause
𝒉G- some ∏J#$ 𝜸for
H"I ≤difficulties J ⁄training.
𝝈J ∏J#$ The𝑠high
J varying inputs imply a high variance of
164 the input distribution, i.e., a large x in (1). BN was proposed to reduce the issue of high variances,
Reminder:
165 ¡ possibly helping“Large variance
training more of following
easily. The the input”results provide a novel perspective, which comes
166 from combining
¡ Often appears Lemmas 4 and 1.
in practice
167
¡ Corollary
ResNet-18:1 (DNN+BN)
∏) Consider any hno *+
&'( 𝜸& ⁄𝝈& ≈ 1.35×10
in ≈
Definition
1.35×2 1 with operator NOk (·) ⌘ BN(·,
*,-./ k , ✏)
168 and k k k  vk at any layer k  K. Denote k⇤ be the minimal one among the input variances
2
QK Exponential
QK reduction! p
169 at layer k. We have khno kLip  ( k=1 sk ) k=1 vk ( k⇤ + ✏)
2 0.5
. Denoting = mink 2
k⇤ + ✏,
213 diameter at most B, and a loss function f : H ⇥ X ! R which is bounded by a constant C.2
214 Given a distribution P defined on X , the quality of a function h 2 H is measured by its expected
Exponential benefits: Generalization
215
216
loss F (P, h) = Ex⇠P [f (h, x)]. Since P is unknown, we need to rely on a training P set D =
{x1 , ..., xm } ✓ X of size m and often work with the empirical loss F (D, h) = m x2D f (h, x).
1
46

217 The generalization error of an h is often measured by |F (P, h) F (D, h)|.


SN
WeLet
¡ 218 study bridge
(X ) := i=1 X between Lipschitz
i be a partition of X intoproperty & generalization
N disjoint nonempty subsets. Denoteability
i as the
219 diameter of Xi , Li as the local Lipschitz constant of f on Xi , and mi = |D \ Xi | as the number of
¡ Γ denote a partition of input space 𝒳P into
N small parts: 𝒳! with diameter 𝜆!
220 samples falling into Xi , meaning that m = j=1 mj . Denote TD = {i 2 [N ] : D \ Xi 6= ;}. We
is the
¡ 𝐿! have
221 thelocal Lipschitz
following constant
connection whose of lossappears
proof f on 𝒳in! Appendix C.
222 Theorem 1 Consider a function h defined over X , and D consisting of m i.i.d. samples from
223 distribution P . Assume that the loss f (h,q
x) is Lipschitz continuous on every Xi , i 2 TD . For any
p
224 > 0, denoting g( , D, ) = C( 2 + 1) |TD | log(2N/m
)
+ 2C|TD | log(2N/ )
m , we have the following
225 with probability at least 1 :
X mi
|F (P, h) F (D, h)|  i Li + g( , D, ) (6)
m
i2TD

Generalization
¡ 226 error:
This bound depends only on the specific sample D and function h (but not the whole family H),
227 and hence is both hypothesis-specific and data-dependent. Note that we only need the assumption
¡ Depends heavily on the local behaviors around training samples
228 of Lipschitzness on some small areas around the individual samples of D. Therefore, this bound
¡ A can
229 DNN work
witheven with non-Lipschitz
smaller functions.may
Lipschitz constant As ageneralize
result, it can help us to
better (𝐿analyze a large class of
! ≤ 𝑓 -!. 𝒉0' |𝒳! -!. )
230 models. The uncertainty part g( , D, ) does not depend
P mon H and |TD | is often small (compared
¡ Explain
231 with Nwhy many adversarial
) for practical training
probelms [20]. methods
The first term i are reasonable
m i Li depends on only function h, and
i
Exponential benefits: Generalization 47

𝐹 𝑃, 𝒉 − 𝐹 𝑫, 𝒉 ≤ 𝜔 +𝑂 𝑚:;.L
9
𝐹 𝑃, 𝒉G- − 𝐹 𝑫, 𝒉G- ≤ 𝜔 n 𝜸J ⁄𝝈J +𝑂 𝑚:;.L (For BN)
J#$

¡ Ex.: ∏4
123 𝜸1 ⁄𝝈1 ≈ 1.35×2
5&6.8

¡ A normalized DNN can have exponentially smaller generalization error


than its unnormalized version in the worst case
¡ A normalized DNN may require exponentially less training data

¡ “Deeper” can save more samples


¡ Same property appears for many normalizers
Exponential benefits: Optimization 48

¡ Training loss 𝐹 𝑫, 𝒉 can be flatten exponentially by a normalizer


¡ Consider DNN 𝒉 𝒙, 𝑾3 , … , 𝑾4 , a layer 𝒉! = 𝑔! 𝑾! 𝒉!53 , and a loss function 𝑓(𝒉, 𝒙)

𝜕𝑓 𝜕𝑓 𝜕𝒉
= × ×𝒉!53 𝑤ℎ𝑒𝑟𝑒 𝒚! = 𝑾! 𝒉!53
𝜕𝑾! 𝜕𝒉 𝜕𝒚!
¡ The gradient w.r.t weight 𝑾! depends on 9𝒉s9𝒚$ (Lipschitz constant of h w.r.t layer i)
¡ 𝒉0' -!. can be exponentially smaller than 𝒉 -!.
è 𝐹 𝑫, 𝒉0' can be exponentially flatter than 𝐹 𝑫, 𝒉

¡ Consequences on training:
¡ Iteration complexity (lower bound on #iterations)
¡ Convergence rate (upper bound on #iterations)
Exponential benefits: Convergence rate 49

¡ Iteration complexity: any gradient-based methods require at least


¡ Ω( 𝐹 -!. /𝛼) iterations to find an 𝛼-stationary point of nonconvex function 𝐹 [Zhang et al.]

¡ Convergence rate: after T iterations, we can find an approximate solution


% & /𝑇
¡ with error 𝑂 𝐹 -!. , for nonconvex function 𝐹
Batch Nor
[Davis et al.]
0.8
¡ with error 𝑂 𝐹 -!. / 𝑇 , for convex function 𝐹

¡ 𝐹 012 can be made exponentially smaller 0.7

by a normalizer
¡ Iteration complexity is exponentially reduced 0.6

¡ Training can converge exponentially faster Inception


BN−Baseline
0.5 BN−x5
BN−x30
Zhang, J., Lin, H., Jegelka, S., Sra, S., & Jadbabaie, A. (2020). Complexity of finding stationary BN−x5−Sigmoid
points of nonconvex nonsmooth functions. In ICML. Steps to match Inception
Davis, D., Drusvyatskiy, D., Lee, Y. T., Padmanabhan, S., & Ye, G. (2022). A gradient sampling 0.4
method with complexity guarantees for lipschitz functions in high and low dimensions. NeurIPS. 5M 10M 15M 20M 25M 30M
Take-home messages 50

¡ Deep neural networks are universal approximators


¡ Theoretically clear about:
¡ Approximation ability
¡ Optimization (learning process)

¡ Normalization methods have exponential benefits


¡ Long-standing open challenge about Generalization ability

You might also like