RADL TQKhoat
RADL TQKhoat
Khoat Than
Hanoi University of Science and Technology
¡ Recent breakthroughs
¡ The open theoretical challenge
¡ Basic concepts in learning theory
¡ Some theories for deep neural networks
¡ Theoretical benefits of normalization methods
Some successes: AlphaGo (2016) 3
This computational
work represents a
stunning advance
on the protein-
folding problem,
a 50-year-old
grand challenge in
biology.
– Venki Ramakrishnan,
Nobel Laureate
An AI-
Generated
Picture Won
an
Art Prize
@Jason Allen
+ Midjourney https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html
Some successes: Text-to-image (2022) 7
A bowl
of
soup
DALL-E 2
Midjourney
Imagen
Some successes: ChatGPT (2022) 8
- Forbes, 2/2023
Open question 9
Learning theory
Basic concepts
The learning problem 4 1. INTRODUCTION
12
𝑦∗
¡ Empirical loss: the loss of a function h on the training set D
1 %
𝐹 𝑫, ℎ = 0 𝑓(𝑦" , ℎ 𝒙" )
𝑀 "#$
¡ Learner: a learning algorithm that can select one ℎ ∈ ℋ, based on a training set D
¡ Learning goal: find one ℎ ∈ ℋ with small expected loss
¡ It should generalize well on future data h
¡ Small training loss/error is not enough
¡ Learning ≠ Fitting
¡ Fitting focuses on minimizing the training loss 𝐹 𝑫, ℎ x
Errors of a trained model 15
¡ Note:
𝐹 𝑃, ℎ- − 𝐹 𝑃, 𝑦 ∗ = 𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗ + 𝐹 𝑃, ℎ∗ − 𝐹 𝑃, 𝑦 ∗
¡ Estimation error: 𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗
¡ How good is the training algorithm?
¡ Approximation error: 𝐹 𝑃, ℎ∗ − 𝐹 𝑃, 𝑦 ∗
¡ Capacity (representational power) of family ℋ Bousquet et al. Introduction to statistical learning theory.
In Machine Learning, LNAI, volume 3176. Springer, 2004.
Error decomposition 16
¡ Estimation error
|𝐹 𝑃, ℎ- − 𝐹 𝑃, ℎ∗ | ≤ |𝐹 𝑫, ℎ- − 𝐹 𝑫, ℎ-∗ | + 2 sup |𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ |
.∈ℋ
¡ It can be decomposed into two types of error
¡ In summary:
¡ Function space ℋ
¡ A bigger space (ℋ′), the (probably) smaller approximation error
¡ More complex members, the (probably) smaller approximation error
è larger capacity 𝑦∗
¡ An effective space (ℋ+ ) is enough è not too big/complex
ℋ′
ℋ1
¡ Training algorithm 𝒜
¡ A better 𝒜 implies smaller estimation error of the trained model ℋ
¡ A bad 𝒜 can provide small optimization error,
but large generalization error è overfitting
¡ A good 𝒜 can localize an effective subset ℋ ∗ ⊂ ℋ
¡ Data
¡ Complexity of the data space data
manifolds
¡ Representativeness of the training samples, …
A unified view 18
Approximation
error
Error
Optimization Generalization
error error
¡ Approximation error:
Error
Optimization Generalization
|𝐹 𝑃, 𝑦 ∗ −𝐹 𝑃, ℎ∗ | ≤ 𝜖7
error error
¡ Optimization error:
𝑦∗
|𝐹 𝑫, ℎ-∗ − 𝐹 𝑫, ℎ- | ≤ 𝜖-
¡ Depending on the number of training iterations (epochs) ℎ∗
ℎ% ℎ%∗
¡ Capacity of learning algorithm 𝒜
ℋ
Bounds on Generalization Error 20
𝐹 𝑃, ℎ- − 𝐹 𝑫, ℎ- ≤ 𝜖8
Approximation
error
Optimization Generalization
error error
¡ Uniform bounds:
sup |𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ | ≤ 𝜖8 Data space
! (×%)
Learning algorithm
'
Function space
(
.∈ℋ
¡ Generalizability of the worst member
𝑦∗
¡ May not be a good way to explain a learned function ℎ'
¡ PAC-Bayes bounds: ℎ∗
ℎ% ℎ%∗
𝔼.∈ℋ 𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝜖8
ℋ
¡ Study the error on average over ℋ
¡ May not explain a learned function ℎ'
Bousquet et al. Introduction to statistical learning theory. In Machine Learning, LNAI, volume 3176. Springer, 2004.
Nagarajan & Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems. 2019.
21
¡ An ANN:
¡ Consists of many neurons, organized in a layer-wise manner
¡ Each neuron computes a simple function
¡ A neuron can have few connections to other neurons
𝑦∗
𝑦∗ − ℎ ≤ 𝜖7 ℋ′
¡ Increase capacity è approximate better
¡ Larger family ℋ′
ℋ
¡ More complex NNs è stronger representational power
¡ E.g., wider or deeper NNs
Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251-257.
Approximation error: modern 25
Universal tools
Lin, H., & Jegelka, S. (2018). ResNet with one-neuron hidden layers is a universal approximator. NeurIPS.
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., & Liao, Q. (2017). Why and when can deep-but not shallow-
networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing.
Zhou, D. X. (2020). Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis.
Approximation: existence ↛ method 26
Unclear
how to find such DNNs,
based on a training set
Optimization error 27
¡ Practice:
¡ Often have zero training error è global solution ℎ'∗ ?
¡ Easily perfectly fit random labelling of data [Zhang et al. 2021]
(training seems to be easy!)
¡ Gradient descent (GD) achieves zero training loss in polynomial time for a
deep over-parameterized ResNet [Du et al. 2019]
¡ Over-parameterization: #parameters ≫ training size
¡ GD can find a global optimum when the width of the last hidden layer of
an MLP exceeds the number of training samples [Nguyen, 2021]
¡ Stochastic gradient descent (SGD) can find global minima on the training
objective of DNNs in polynomial time [Allen-Zhu et al. 2019]
¡ Architecture: MLP, CNN, ResNet
Du, S., Lee, J., Li, H., Wang, L., & Zhai, X. (2019). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
Nguyen, Q. (2021). On the proof of global convergence of gradient descent for deep relu networks with linear widths. In International Conference on Machine Learning.
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
Optimization: reminder 29
However
global optimality
of the training problem
does not imply
good predictive ability
Bias-Variance tradeoff: classical view 30
0.4
0.4
High Bias Low Bias
Low Variance High Variance
Prediction Error
0.3
0.3
Expected
prediction
error
0.2
0.2
Test Sample
Bias
0.1
0.1
Variance
0.0
0.0
Training Sample
50 40 30 20 10 0 5
Model Complexity
Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning. Springer, 2009.
Bias-Variance: modern behavior 31
Prediction Error
B
Test Sample
Risk (Error)
Training Sample
Low High
test risk (solid line). (A) The classical U-shaped risk curve arising from theBelkin,
bias–variance FIGURE 2.11. Test and training error as a function of model complexit
M., Hsu,trade-off. (B) &
D., Ma, S., The
Mandal, S. (2019). Reconciling modern machine-learning practice and the
e U-shaped risk curve (i.e., the “classical” regime) together with the observed behavior
classical from using
bias–variance high- Proceedings of the National Academy of Sciences, 116(32), 15849-15854.
trade-off.
polating regime), separated by the interpolation threshold. The predictors to the right of the interpolation
Generalization ability: long-standing open 32
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning. MIT press.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM.
Generalization: VC dimension 33
¡ Example: in 𝑛-dimensional space Bartlett, P. L., Harvey, N., Liaw, C., &
Mehrabian, A. (2019). Nearly-tight VC-
¡ Linear models: 𝑉𝐶 ℋ = 𝑛 + 1 dimension and pseudodimension bounds
for piecewise linear neural networks. The
¡ ReLU networks with 𝑊 weights: 𝑉𝐶 ℋ = Ω(𝑊 log 𝑊) Journal of Machine Learning Research.
0.095
¡ DNN: ℎ 𝒙,init𝑾 = 𝑔9 𝑾9 ℎ9:$
random random init
trained trained 0.09
¡ Bartlett: #params is not important
VC-dim
0.085
¡ Size of weights may be more important
0.08
¡ Neyshabur et al.; Golowich et al.: 0.075
120
𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝑂( 𝑾$ A⋯ 𝑾9 A )/ 𝑚
0.0 0.1 0.2 0.3 0.2 0.4 0.6
a) layer cushion µi b) minimal inter-layer cushion µi!
¡ Bartlett et al.: Uninformative
Figure 3. Left) Comparing neural net genera
random init random init forRight)
modern
𝐹 𝑃, ℎ − 𝐹 𝑫, ℎ ≤ 𝑂( 𝑾$ B⋯ 𝑾 9 B )/ 𝑚
Appendix D.3 for details. Comparing
trained trained DNNs
cal generalization error during training. Our
be within the same range as the generalizatio
Arora, S., Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In ICML.
Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the
network. IEEE Transactions on Information Theory.
and log h which make the comparison t
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Neural Information Processing Systems.
Golowich, N., Rakhlin, A., & Shamir, O. (2020). Size-independent sample complexity of neural networks. Information and Inference: A Journal of the IMA.
bit unfair, but the comparison to previo
Neyshabur, B., Bhojanapalli, S., & Srebro, N. (2018). A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks. In ICLR.
Generalization: PAC-Bayes 35
however are mostly for Data space Learning algorithm Function space
Normalization
methods
The exponential benefits
Batch Normalization 39
¡ DNN + BN:
0.7
¡ Batch normalization
¡ Layer normalization [Ba et al., 2016]
¡ Instance normalization [Ulyanov et al., 2016]
¡ Group normalization [Wu & He, 2020]
¡ Spectral normalization [Miyato et al., 2018]
¡ Weight normalization [Salimans & Kingma et al., 2016]
¡…
Ioffe & Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Ba, Kiros, and Hinton. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018) Spectral Normalization for Generative Adversarial Networks. In ICLR.
Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.
Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NeurIPS.
Wu & He. (2020). Group normalization. International Journal of Computer Vision.
Broadly speaking, BatchNorm Ioffe and Szegedythat
is a mechanism [10]aims
describe ICS a
to stabilize
in the
batch) of inputs to a given network network
layer duringchanges
training.due to is
This anachiev
upda
constant shift of the underlying41 trainin
Normalizers: many why’s with additional layers that set the first two moments
the trainingThen,
activation to be zero and one respectively. process.
(mean and varian
the batch normalized
and shifted based on trainable parameters to preserve model expre
applied before the non-linearity of the previous layer.
ℎ 𝒙, 𝑾 = 𝑔9 𝑾9 ℎ9:$ , where ℎ" = 𝑔" 𝑾" ℎ":$ , ℎ; = 𝒙
One of the key motivations for the development of BatchNorm was the
¡ Batch normalization: ℎ" = 𝑔shift
covariate " 𝑩𝑵(𝑾(ICS)." ℎThis
":$ )reduction has been widely viewed as the
Ioffe and Szegedy [10] describe ICS as the phenomenon wherein the d
¡ Why can they fasten training? in the network changes due to an update of parameters of the previous
¡ BN can reduce the Lipschitz constant
constantshift of the
of the lossunderlying training problem and is thus believed
à flatten the loss [Santurkar etthe
al. training
2018; Lyuprocess.
et al. 2022]
¡ Unclear about other normalizers
¡ Does they control the capacity of a neural net? Figure 1: Comparison of (a) training
standard VGG network trained on CIF
¡ Yes, for a single-layer perceptron [Luo et al. 2018] There is a consistent gain in training s
¡ Why can they improve generalization? Unclear gap between the performance of the Ba
in the evolution of layer input distribu
activations of a given layer and visuali
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization?. In NeurIPS.
Luo, P., Wang, X., Shao, W., & Peng, Z. (2019). Towards Understanding Regularization in Batch Normalization. In ICLR.
Lyu, Li, & Arora. (2022). Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. In NeurIPS.
Normalization 42
1 Lipschitz constants are presented. 103 BN [18] normalizes individual inputs at a layer according t
104 different inputs will be normalized independently.
¡ Lipschitz constant
2 3.1 Batch normalization 105 Let {x(1) , ..., x(m) } be the samples of a signal/variable x wi
¡ tells how fast a function h(x) can change at 106 an input
mean x variance x2 . BN will normalize each sample x
µx and
3 BN [18] normalizes individual inputs at a layer according to their distributions. It means that two
4 ¡ represents
different the
inputs will be complexity
normalized of h(x) è capacity of h
independently. BN(x(i) , , ✏) = p (
2 +✏
x
p
Let {x , ..., x } be the samples of a signal/variable
(1) (m)
(") x with
107 population (either true
Note that @BN(x, or estimated)2 + ✏, since µ and 2
BN normalizes 2 each input 𝑥 as
5 , ✏)/@x = /
6
¡
mean µx and variance x . BN will normalize each sample x108 as:
(i)
x x x
signal x. This Jacobian is large only when the variance of x
109
(i) of kBNkLip depends on the nature of the distribution of signa
BN(x(i) , , ✏) = p (x
110
µxlayer.
same ) (1)
The following lemma provides an estimate.
2 +✏
x
p 2111 Lemma 1 Given ✏ > 0, , let x = (x1 , ..., xn ) be an input a
& . 𝛾 isµtrainable,
7 with
Note¡that mean, ✏)/@x
@BN(x, 𝜇, and = variance
/ x + ✏,𝜎,
2 since 𝜖 is a smoothing
x and x represent the populationconstant
distribution of
8
112 x, where each input
signal x. This Jacobian is large only when the variance of x is small. It suggests that thex k with population
⇣ magnitude variance 2
k is norma
⌘
p p
9 ¡ Lipschitz
of kBNk constant
Lip depends w.r.t.ofinput
on the nature x: 𝐵𝑁 of-!.signal
the distribution =113𝜸x⁄konly,
𝝈 / k,butwhere
not on/other
= signals
1/ in
2
the✏, ..., n / n2 + ✏ .
1 +
0 same layer. The following lemma provides an estimate.
¡ Other normalizers: 114
Lemma 1 Given ✏ > 0, , let x = (x1 , ..., xn ) be an input 115
This observation, whose proof appears in Appendix A.1, sug
and BN(x, , ✏) be the normalization of An increase in varian
1 all the variances of the inputs are large.
2 x, where each input xk ⇣with population variance ! k2 is normalized as (1) given kinput
⌘ 116 a higher-varying . Then
cankBNk
be penalized
Lip = stronger. This fact pr
¡ Layer norm: p 𝐿𝑁 -!. ≤ 1 p − " 𝜸⁄𝝈 1 (for a layer with n neurons)
k / k, where / = 1 / 117 of BN.
n+✏ .
2 2
3 1 + ✏, ..., n /
¡ Group norm: 𝐺𝑁 -!. ≤ 1 − "!# 𝜸⁄𝝈 (for a group of 𝑛/ neurons)
118 3.2 Layer normalization
4 This observation, whose proof appears in Appendix A.1, suggests that kBNk will be small when
Exponential benefits: Capacity control
Figure 1: Comparison of ResNet-18 and its unnormalized version (w/o BN). Variance ( 2
45
) of input
distributions before each BN layer is calculated per-dimension over mini-batches. Leftmost subfigure
reports the variance at each layer after training, while the two next subfigures show the dynamics
of variance along the training process. Rightmost subfigure reports the top@5 accuracy. CIFAR10
dataset is used in this experiment. We observe that DNNs without normalization often have high
input variances at deep layers. Meanwhile, DNNs with BN often have small variances.
QK
160 Lemma 4 For any h defined in Definition 1, we have khkLip k=1 sk . Its normalized version
QK
161 satisfies khno kLip k=1 sk kNOk kLip .
Generalization
¡ 226 error:
This bound depends only on the specific sample D and function h (but not the whole family H),
227 and hence is both hypothesis-specific and data-dependent. Note that we only need the assumption
¡ Depends heavily on the local behaviors around training samples
228 of Lipschitzness on some small areas around the individual samples of D. Therefore, this bound
¡ A can
229 DNN work
witheven with non-Lipschitz
smaller functions.may
Lipschitz constant As ageneralize
result, it can help us to
better (𝐿analyze a large class of
! ≤ 𝑓 -!. 𝒉0' |𝒳! -!. )
230 models. The uncertainty part g( , D, ) does not depend
P mon H and |TD | is often small (compared
¡ Explain
231 with Nwhy many adversarial
) for practical training
probelms [20]. methods
The first term i are reasonable
m i Li depends on only function h, and
i
Exponential benefits: Generalization 47
𝐹 𝑃, 𝒉 − 𝐹 𝑫, 𝒉 ≤ 𝜔 +𝑂 𝑚:;.L
9
𝐹 𝑃, 𝒉G- − 𝐹 𝑫, 𝒉G- ≤ 𝜔 n 𝜸J ⁄𝝈J +𝑂 𝑚:;.L (For BN)
J#$
¡ Ex.: ∏4
123 𝜸1 ⁄𝝈1 ≈ 1.35×2
5&6.8
𝜕𝑓 𝜕𝑓 𝜕𝒉
= × ×𝒉!53 𝑤ℎ𝑒𝑟𝑒 𝒚! = 𝑾! 𝒉!53
𝜕𝑾! 𝜕𝒉 𝜕𝒚!
¡ The gradient w.r.t weight 𝑾! depends on 9𝒉s9𝒚$ (Lipschitz constant of h w.r.t layer i)
¡ 𝒉0' -!. can be exponentially smaller than 𝒉 -!.
è 𝐹 𝑫, 𝒉0' can be exponentially flatter than 𝐹 𝑫, 𝒉
¡ Consequences on training:
¡ Iteration complexity (lower bound on #iterations)
¡ Convergence rate (upper bound on #iterations)
Exponential benefits: Convergence rate 49
by a normalizer
¡ Iteration complexity is exponentially reduced 0.6