Wilson2020 Part2
Wilson2020 Part2
Process Perspective
https://cims.nyu.edu/~andrewgw
Courant Institute of Mathematical Sciences
Center for Data Science
New York University
1 / 47
Last Time... Machine Learning for Econometrics
(The Start of My Journey...)
2 / 47
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics
3 / 47
Heteroscedasticity revisited...
Choice 1
Choice 2
4 / 47
Some conclusions...
I Flexibility isn’t the whole story, inductive biases are at least as important.
I Degenerate model specification can be helpful, rather than something to
necessarily avoid.
I Asymptotic results often mean very little. Rates of convergence, or even
intuitions about non-asymptotic behaviour, are more meaningful.
I Infinite models (models with unbounded capacity) are almost always desirable,
but the details matter.
I Releasing good code is crucial.
I Try to keep the approach as simple as possible.
I Empirical results often provide the most effective argument.
5 / 47
Model Selection
700
500
400
300
200
100
1949 1951 1953 1955 1957 1959 1961
Year
6 / 47
A Function-Space View
f (x) = w0 + w1 x , (1)
w0 , w1 ∼ N (0, 1) . (2)
25
20
15
10
Output, f(x)
−5
−10
−15
−20
−25
−10 −8 −6 −4 −2 0 2 4 6 8 10
Input, x
7 / 47
Model Construction and Generalization
p(D|M)
Well-Specified Model
Calibrated Inductive Biases
Example: CNN
Simple Model
Poor Inductive Biases
Example: Linear Function
Complex Model
Poor Inductive Biases
Example: MLP
8 / 47
How do we learn?
I The ability for a system to learn is determined by its support (which solutions
are a priori possible) and inductive biases (which solutions are a priori likely).
I We should not conflate flexibility and complexity.
I An influx of new massive datasets provide great opportunities to automatically
learn rich statistical structure, leading to new scientific discoveries.
9 / 47
What is Bayesian learning?
10 / 47
Why Bayesian Deep Learning?
11 / 47
Mode Connectivity
12 / 47
Mode Connectivity
13 / 47
Mode Connectivity
14 / 47
Mode Connectivity
15 / 47
Mode Connectivity
16 / 47
Better Marginalization
Z
p(y|x∗ , D) = p(y|x∗ , w)p(w|D)dw . (4)
[1] Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift.
Ovadia et. al, 2019
[2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson and Izmailov, 2020
18 / 47
Double Descent
Reconciling modern machine learning practice and the bias-variance trade-off. Belkin et. al, 2018
19 / 47
Double Descent
20 / 47
Bayesian Model Averaging Alleviates Double Descent
40
35
30
10 20 30 40 50
ResNet-18 Width
Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020
21 / 47
Neural Network Priors
[1] Deep Image Prior. Ulyanov, D., Vedaldi, A., Lempitsky, V. CVPR 2018.
[2] Understanding Deep Learning Requires Rethinking Generalzation. Zhang et. al, ICLR 2016.
[3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.
22 / 47
Tempered Posteriors
23 / 47
Cold Posteriors
Wenzel et. al (2020) highlight the result that for p(w) = N(0, I) cold posteriors with
T < 1 often provide improved performance.
How good is the Bayes posterior in deep neural networks really? Wenzel et. al, ICML 2020.
24 / 47
Prior Misspecification?
They suggest the result is due to prior misspecification, showing that sample
functions p(f (x)) seem to assign one label to most classes on CIFAR-10.
25 / 47
Changing the prior variance scale α
Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.
26 / 47
The effect of data on the posterior
1.0 1.0
Class Probability
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
1.0 1.0
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
√
(a) Prior (α = 10) (b) 10 datapoints
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
27 / 47
Neural Networks from a Gaussian Process Perspective
28 / 47
Prior Class Correlations
1.00 1.0
7 0.97 0.97 0.97 0.97 0.98 7 0.81 0.82 0.85 0.84 0.88
0.9
MNIST Class
0.98
MNIST Class
4 0.97 0.97 0.97 0.98 0.97 4 0.81 0.79 0.83 0.89 0.84
0.96 0.8
2 0.97 0.97 0.98 0.97 0.97 2 0.83 0.82 0.89 0.83 0.85
0.94 0.7
1 0.96 0.99 0.97 0.97 0.97 1 0.75 0.90 0.82 0.79 0.82
0.92 0.6
0 0.98 0.96 0.97 0.97 0.97 0 0.89 0.75 0.83 0.81 0.81
0.90 0.5
0 1 2 4 7 0 1 2 4 7
MNIST Class MNIST Class
1.0
7 0.76 0.79 0.80 0.81 0.85
0.9
MNIST Class
NLL
0.7
1 0.71 0.89 0.80 0.78 0.79
0.6 103
0 0.85 0.71 0.77 0.76 0.76 5 · 102
0.5
0 1 2 4 7
10−2 10−1 100 101
MNIST Class Prior std α
(g) α = 1 (h)
Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.
29 / 47
Thoughts on Tempering (Part 1)
30 / 47
Thoughts on Tempering (Part 2)
I While certainly the prior p(f (x)) is misspecified, the result of assigning one
class to most data is a soft prior bias, which (1) doesn’t hurt the predictive
distribution, (2) is easily corrected by appropriately setting the prior parameter
variance α2 , and (3) is quickly modulated by data.
I More important is the induced covariance function (kernel) over images, which
appears reasonable. The deep image prior and random network feature results
also suggest this prior is largely reasonable.
I In addition to not tuning α, the result in Wenzel et. al (2020) could have been
exacerbated due to lack of multimodal marginalization.
I There are cases when T < 1 will help given a finite number of samples, even if
the untempered model is correctly specified. Imagine estimating the mean √
of
N (0, I) from samples where d 1. The samples will have norm close to d.
31 / 47
Rethinking Generalization
[1] Understanding Deep Learning Requires Rethinking Generalzation. Zhang et. al, ICLR 2016.
[2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.
32 / 47
Model Construction
p(D|M)
Well-Specified Model
Calibrated Inductive Biases
Example: CNN
Simple Model
Poor Inductive Biases
Example: Linear Function
Complex Model
Poor Inductive Biases
Example: MLP
33 / 47
Function Space Priors
34 / 47
PAC-Bayes
PAC-Bayes provides explicit generalization error bounds for stochastic networks
with posterior Q, prior P, training points n, probability 1 − δ based on
s
KL(Q||P) + log( δn )
. (6)
2(n − 1)
35 / 47
Rethinking Parameter Counting: Effective Dimension
3.5 94
Test Loss
3.0 Train Loss 92
2.5 Neff(Hessian) 90
Neff(Hessian)
2.0 88
Loss
1.5 86
1.0 84
0.5 82
0.0 80
10 20 30 40 50 60 70
Width
4 Effective Dimensionality 100 4 Test Loss 2.2 4 Train Loss
3.5
95 2.0
3 3 3 3.0
90 2.5
1.8
Depth
85 2.0
2 2 2
80 1.6 1.5
1 75 1 1.4 1 1.0
70 0.5
0 12 0 12 1.2 0 12
16 20 24 28 32 36 16 20 24 28 32 36 16 20 24 28 32 36
Width Width Width
X λi
Neff (H) =
i
λi + α
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited.
W. Maddox, G. Benton, A.G. Wilson, 2020.
36 / 47
Properties in Degenerate Directions
37 / 47
Gaussian Processes and Neural Networks
Introduction to Gaussian processes. MacKay, D. J. In Bishop, C. M. (ed.), Neural Networks and Machine
Learning, Chapter 11, pp. 133-165. Springer-Verlag, 1998.
38 / 47
Deep Kernel Learning Review
Deep kernel learning combines the inductive biases of deep learning architectures
with the non-parametric flexibility of Gaussian processes.
W (1)
W (2) h1 (θ)
W (L)
(1)
Input layer h1
Output layer
...
...
(2)
h1
x1 (L)
h1
y1
...
...
...
...
...
... yP
xD (L)
hC
...
hB
(2)
...
(1)
hA
h∞ (θ)
Hidden layers
∞ layer
Deep Kernel Learning. Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P. AISTATS, 2016
39 / 47
Face Orientation Extraction
Training data
Test data
Figure: Top: Randomly sampled examples of the training and test data. Bottom: The
two dimensional outputs of the convolutional network on a set of test cases. Each
point is shown using a line segment that has the same orientation as the input face.
40 / 47
Learning Flexible Non-Euclidean Similarity Metrics
0.2 300
2
100
100 100
0.1 200
200 200
200
0 100
Figure: Left: The induced covariance matrix using DKL-SM (spectral mixture)
kernel on a set of test cases, where the test samples are ordered according to the
orientations of the input faces. Middle: The respective covariance matrix using
DKL-RBF kernel. Right: The respective covariance matrix using regular RBF
kernel. The models are trained with n = 12, 000.
41 / 47
Kernels from Infinite Bayesian Neural Networks
I The neural network kernel (Neal, 1996) is famous for triggering research on
Gaussian processes in the machine learning community.
Consider a neural network with one hidden layer:
J
X
f (x) = b + vi h(x; ui ) . (7)
i=1
I b is a bias, vi are the hidden to output weights, h is any bounded hidden unit
transfer function, ui are the input to hidden weights, and J is the number of
hidden units. Let b and vi be independent with zero mean and variances σb2 and
σv2 /J, respectively, and let the ui have independent identical distributions.
Collecting all free parameters into the weight vector w,
Ew [f (x)] = 0 , (8)
J
1 X
cov[f (x), f (x0 )] = Ew [f (x)f (x0 )] = σb2 + σv2 Eu [hi (x; ui )hi (x0 ; ui )] , (9)
J i=1
J
X
f (x) = b + vi h(x; ui ) . (11)
i=1
Rz 2
e−t dt
PP
I Let h(x; u) = erf(u0 + uj xj ), where erf(z) = √2
j=1 π 0
I Choose u ∼ N (0, Σ)
Then we obtain
2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ), (12)
π (1 + 2x̃T Σx̃)(1 + 2x̃0T Σx̃0 )
43 / 47
Neural Network Kernel
2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ) (13)
π (1 + 2x̃ Σx̃)(1 + 2x̃0T Σx̃0 )
T
Set Σ = diag(σ0 , σ). Draws from a GP with a neural network kernel with varying σ:
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
44 / 47
Neural Network Kernel
2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ) (14)
π (1 + 2x̃T Σx̃)(1 + 2x̃0T Σx̃0 )
Set Σ = diag(σ0 , σ). Draws from a GP with a neural network kernel with varying σ:
Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006
45 / 47
NN → GP Limits and Neural Tangent Kernels
I Several recent works [e.g., 2-9] have extended Radford Neal’s limits to
multilayer nets and other architectures.
I Closely related work also derives neural tangent kernels from infinite neural
network limits, with promising results.
I Note that most kernels from infinite neural network limits have a fixed
structure. On the other hand, standard neural networks essentially learn a
similarity metric (kernel) for the data. Learning a kernel amounts to
representation learning. Bridging this gap is interesting future work.
46 / 47
What’s next?
47 / 47