0% found this document useful (0 votes)

24 views47 pages

Wilson2020 Part2

This document provides a summary of Bayesian neural networks from a Gaussian process perspective by Andrew Gordon Wilson. It discusses Wilson's early work with Gaussian copula processes for modeling heteroscedasticity. It then covers model selection and the importance of inductive biases. The document discusses Bayesian deep learning and using the Bayesian model average to represent model uncertainty. It describes challenges in approximating the posterior distribution for neural networks and techniques like MultiSWAG to better approximate the Bayesian model average.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views47 pages

Wilson2020 Part2

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Bayesian Neural Networks from a Gaussian

Process Perspective

Andrew Gordon Wilson

https://cims.nyu.edu/~andrewgw
Courant Institute of Mathematical Sciences
Center for Data Science
New York University

Gaussian Process Summer School

September 16, 2020

1 / 47
Last Time... Machine Learning for Econometrics
(The Start of My Journey...)

Autoregressive Conditional Heteroscedasticity (ARCH)

2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t − 1)2 )

2 / 47
Autoregressive Conditional Heteroscedasticity (ARCH)
2003 Nobel Prize in Economics

y(t) = N (y(t); 0, a0 + a1 y(t − 1)2 )

Gaussian Copula Process Volatility (GCPV)

(My First PhD Project)

y(x) = N (y(x); 0, f (x)2 )

f (x) ∼ GP(m(x), k(x, x0 ))

I Can approximate a much greater range of variance functions

I Operates on continuous inputs x
I Can effortlessly handle missing data
I Can effortlessly accommodate multivariate inputs x (covariates other than time)
I Observation: performance extremely sensitive to even small changes in
kernel hyperparameters

3 / 47
Heteroscedasticity revisited...

Which of these models do you prefer, and why?

Choice 1

y(x)|f (x), g(x) ∼ N (y(x); f (x), g(x)2 )

f (x) ∼ GP, g(x) ∼ GP

Choice 2

y(x)|f (x), g(x) ∼ N (y(x); f (x)g(x), g(x)2 )

f (x) ∼ GP, g(x) ∼ GP

4 / 47
Some conclusions...

I Flexibility isn’t the whole story, inductive biases are at least as important.
I Degenerate model specification can be helpful, rather than something to
necessarily avoid.
I Asymptotic results often mean very little. Rates of convergence, or even
intuitions about non-asymptotic behaviour, are more meaningful.
I Infinite models (models with unbounded capacity) are almost always desirable,
but the details matter.
I Releasing good code is crucial.
I Try to keep the approach as simple as possible.
I Empirical results often provide the most effective argument.

5 / 47
Model Selection

700

Airline Passengers (Thousands)

600

500

400

300

200

100
1949 1951 1953 1955 1957 1959 1961
Year

Which model should we choose?

4
3
X 10
X
(1): f1 (x) = w0 + w1 x (2): f2 (x) = wj x j (3): f3 (x) = wj xj
j=0 j=0

6 / 47
A Function-Space View

Consider the simple linear model,

f (x) = w0 + w1 x , (1)
w0 , w1 ∼ N (0, 1) . (2)

10
Output, f(x)

−5

−10

−15

−20

−25
−10 −8 −6 −4 −2 0 2 4 6 8 10
Input, x

7 / 47
Model Construction and Generalization

p(D|M)

Well-Specified Model
Calibrated Inductive Biases
Example: CNN

Simple Model
Poor Inductive Biases
Example: Linear Function
Complex Model
Poor Inductive Biases
Example: MLP

Corrupted CIFAR-10 MNIST Dataset

CIFAR-10
Structured Image Datasets

8 / 47
How do we learn?

I The ability for a system to learn is determined by its support (which solutions
are a priori possible) and inductive biases (which solutions are a priori likely).
I We should not conflate flexibility and complexity.
I An influx of new massive datasets provide great opportunities to automatically
learn rich statistical structure, leading to new scientific discoveries.

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

Wilson and Izmailov, 2020
arXiv 2002.08791

9 / 47
What is Bayesian learning?

I The key distinguishing property of a Bayesian approach is marginalization

instead of optimization.
I Rather than use a single setting of parameters w, use all settings weighted by
their posterior probabilities in a Bayesian model average.

10 / 47
Why Bayesian Deep Learning?

Recall the Bayesian model average (BMA):

Z
p(y|x∗ , D) = p(y|x∗ , w)p(w|D)dw . (3)

I Think of each setting of w as a different model. Eq. (3) is a Bayesian model

average over models weighted by their posterior probabilities.
I Represents epistemic uncertainty over which f (x, w) fits the data.
I Can view classical training as using an approximate posterior
q(w|y, X) = δ(w = wMAP ).
I The posterior p(w|D) (or loss L = − log p(w|D)) for neural networks is
extraordinarily complex, containing many complementary solutions, which is
why BMA is especially significant in deep learning.
I Understanding the structure of neural network loss landscapes is crucial for
better estimating the BMA.

11 / 47
Mode Connectivity

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.

T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson. NeurIPS 2018.
Loss landscape figures in collaboration with Javier Ideami (losslandscape.com).

12 / 47
Mode Connectivity

13 / 47
Mode Connectivity

14 / 47
Mode Connectivity

15 / 47
Mode Connectivity

16 / 47
Better Marginalization

Z
p(y|x∗ , D) = p(y|x∗ , w)p(w|D)dw . (4)

I MultiSWAG forms a Gaussian mixture posterior from multiple independent

SWAG solutions.
I Like deep ensembles, MultiSWAG incorporates multiple basins of attraction in
the model average, but it additionally marginalizes within basins of attraction
for a better approximation to the BMA.
17 / 47
Better Marginalization: MultiSWAG

[1] Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift.
Ovadia et. al, 2019
[2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson and Izmailov, 2020

18 / 47
Double Descent

Belkin et. al (2018)

Reconciling modern machine learning practice and the bias-variance trade-off. Belkin et. al, 2018

19 / 47
Double Descent

Should a Bayesian model experience double descent?

20 / 47
Bayesian Model Averaging Alleviates Double Descent

CIFAR-100, 20% Label Corruption

50
SGD
45 SWAG
Test Error (%) Multi-SWAG

10 20 30 40 50
ResNet-18 Width

Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020

21 / 47
Neural Network Priors

A parameter prior p(w) = N (0, α2 ) with a neural network architecture f (x, w)

induces a structured distribution over functions p(f (x)).

Deep Image Prior

I Randomly initialized CNNs without training provide excellent performance for
image denoising, super-resolution, and inpainting: a sample function from
p(f (x)) captures low-level image statistics, before any training.
Random Network Features
I Pre-processing CIFAR-10 with a randomly initialized untrained CNN
dramatically improves the test performance of a Gaussian kernel on pixels from
54% accuracy to 71%, with an additional 2% from `2 regularization.

[1] Deep Image Prior. Ulyanov, D., Vedaldi, A., Lempitsky, V. CVPR 2018.
[2] Understanding Deep Learning Requires Rethinking Generalzation. Zhang et. al, ICLR 2016.
[3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.

22 / 47
Tempered Posteriors

In Bayesian deep learning it is typical to consider the tempered posterior:

1
pT (w|D) = p(D|w)1/T p(w), (5)
Z(T)

where T is a temperature parameter, and Z(T) is the normalizing constant

corresponding to temperature T. The temperature parameter controls how the prior
and likelihood interact in the posterior:
I T < 1 corresponds to cold posteriors, where the posterior distribution is more
concentrated around solutions with high likelihood.
I T = 1 corresponds to the standard Bayesian posterior distribution.
I T > 1 corresponds to warm posteriors, where the prior effect is stronger and
the posterior collapse is slower.

E.g.: The safe Bayesian. Grunwald, P. COLT 2012.

23 / 47
Cold Posteriors

Wenzel et. al (2020) highlight the result that for p(w) = N(0, I) cold posteriors with
T < 1 often provide improved performance.

How good is the Bayes posterior in deep neural networks really? Wenzel et. al, ICML 2020.

24 / 47
Prior Misspecification?

They suggest the result is due to prior misspecification, showing that sample
functions p(f (x)) seem to assign one label to most classes on CIFAR-10.

25 / 47
Changing the prior variance scale α

Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.

26 / 47
The effect of data on the posterior
1.0 1.0

Class Probability
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

1.0 1.0

Class Probability 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

√
(a) Prior (α = 10) (b) 10 datapoints

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(c) 100 datapoints (d) 1000 datapoints

27 / 47
Neural Networks from a Gaussian Process Perspective

From a Gaussian process perspective, what

properties of the prior over functions induced by
a Bayesian neural network might you check to see
if it seems reasonable?

28 / 47
Prior Class Correlations
1.00 1.0

7 0.97 0.97 0.97 0.97 0.98 7 0.81 0.82 0.85 0.84 0.88
0.9

MNIST Class
0.98

MNIST Class
4 0.97 0.97 0.97 0.98 0.97 4 0.81 0.79 0.83 0.89 0.84
0.96 0.8

2 0.97 0.97 0.98 0.97 0.97 2 0.83 0.82 0.89 0.83 0.85
0.94 0.7
1 0.96 0.99 0.97 0.97 0.97 1 0.75 0.90 0.82 0.79 0.82
0.92 0.6
0 0.98 0.96 0.97 0.97 0.97 0 0.89 0.75 0.83 0.81 0.81
0.90 0.5

0 1 2 4 7 0 1 2 4 7
MNIST Class MNIST Class

(e) α = 0.02 (f) α = 0.1

1.0
7 0.76 0.79 0.80 0.81 0.85
0.9
MNIST Class

4 0.76 0.78 0.80 0.85 0.81

0.8 104
2 0.77 0.80 0.84 0.80 0.80 5 · 103

NLL
0.7
1 0.71 0.89 0.80 0.78 0.79
0.6 103
0 0.85 0.71 0.77 0.76 0.76 5 · 102
0.5

0 1 2 4 7
10−2 10−1 100 101
MNIST Class Prior std α

(g) α = 1 (h)

Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.
29 / 47
Thoughts on Tempering (Part 1)

I It would be surprising if T = 1 was the best setting of this hyperparameter.

I Our models are certainly misspecified, and we should acknowledge that
misspecification in our estimation procedure by learning T. Learning T is not
too different from learning other properties of the likelihood, such as noise.
I A tempered posterior is a more honest reflection of our prior beliefs than the
untempered posterior. Bayesian inference is about honestly reflecting our
beliefs in the modelling process.

30 / 47
Thoughts on Tempering (Part 2)

I While certainly the prior p(f (x)) is misspecified, the result of assigning one
class to most data is a soft prior bias, which (1) doesn’t hurt the predictive
distribution, (2) is easily corrected by appropriately setting the prior parameter
variance α2 , and (3) is quickly modulated by data.
I More important is the induced covariance function (kernel) over images, which
appears reasonable. The deep image prior and random network feature results
also suggest this prior is largely reasonable.
I In addition to not tuning α, the result in Wenzel et. al (2020) could have been
exacerbated due to lack of multimodal marginalization.
I There are cases when T < 1 will help given a finite number of samples, even if
the untempered model is correctly specified. Imagine estimating the mean √
of
N (0, I) from samples where d 1. The samples will have norm close to d.

31 / 47
Rethinking Generalization

[1] Understanding Deep Learning Requires Rethinking Generalzation. Zhang et. al, ICLR 2016.
[2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020.

32 / 47
Model Construction

p(D|M)

Well-Specified Model
Calibrated Inductive Biases
Example: CNN

Simple Model
Poor Inductive Biases
Example: Linear Function
Complex Model
Poor Inductive Biases
Example: MLP

Corrupted CIFAR-10 MNIST Dataset

CIFAR-10
Structured Image Datasets

33 / 47
Function Space Priors

We should embrace the function space perspective in constructing priors.

I However, if we contrive priors over parameters p(w) to induce distributions
over functions p(f ) that resemble familiar models such as Gaussian processes
with RBF kernels, we could be throwing the baby out with the bathwater.
I Indeed, neural networks are useful as their own model class precisely because
they have different inductive biases from other models.
I We should try to gain insights by thinking in function space, but note that
architecture design itself is thinking in function space: properties such as
equivariance to translations in convolutional architectures imbue the associated
distribution over functions with these properties.

34 / 47
PAC-Bayes
PAC-Bayes provides explicit generalization error bounds for stochastic networks
with posterior Q, prior P, training points n, probability 1 − δ based on
s
KL(Q||P) + log( δn )
. (6)
2(n − 1)

I Non-vacuous bounds derived from exploiting flatness in Q (e.g., at least 80%

generalization accuracy predicted on binary MNIST).
I Very promising framework but tends not to be prescriptive about model
construction, or informative for understanding why a model generalizes.
I Bounds are improved by compact P and a low dimensional parameter space.
We suggest a P with large support and many parameters.
I Generalization significantly improved by multimodal Q, but not PAC-Bayes
generalization bounds.
Fantastic generalization measures and where to find them. Jiang et. al, 2019.
A primer on PAC-Bayesian learning. Guedj, 2019.
Computing nonvacuous generalization bounds for deep (stochastic) neural networks. Dziugaite & Roy, 2017.
A PAC-Bayesian approach to spectrally-normalized bounds for neural networks. Neyshabur et. al, 2017.

35 / 47
Rethinking Parameter Counting: Effective Dimension
3.5 94
Test Loss
3.0 Train Loss 92
2.5 Neff(Hessian) 90

Neff(Hessian)
2.0 88
Loss
1.5 86
1.0 84
0.5 82
0.0 80
10 20 30 40 50 60 70
Width
4 Effective Dimensionality 100 4 Test Loss 2.2 4 Train Loss
3.5
95 2.0
3 3 3 3.0
90 2.5
1.8
Depth

85 2.0
2 2 2
80 1.6 1.5
1 75 1 1.4 1 1.0
70 0.5
0 12 0 12 1.2 0 12
16 20 24 28 32 36 16 20 24 28 32 36 16 20 24 28 32 36
Width Width Width

X λi
Neff (H) =
i
λi + α
Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited.
W. Maddox, G. Benton, A.G. Wilson, 2020.
36 / 47
Properties in Degenerate Directions

Decision boundaries do not change in directions of little posterior contraction,

suggesting a mechanism for subspace inference!

37 / 47
Gaussian Processes and Neural Networks

“How can Gaussian processes

possibly replace neural networks?
Have we thrown the baby out with
the bathwater?” (MacKay, 1998)

Introduction to Gaussian processes. MacKay, D. J. In Bishop, C. M. (ed.), Neural Networks and Machine
Learning, Chapter 11, pp. 133-165. Springer-Verlag, 1998.

38 / 47
Deep Kernel Learning Review

Deep kernel learning combines the inductive biases of deep learning architectures
with the non-parametric flexibility of Gaussian processes.

W (1)
W (2) h1 (θ)
W (L)
(1)
Input layer h1
Output layer
...

...
(2)
h1
x1 (L)
h1
y1

...
...
...

...

...
... yP
xD (L)
hC

...
hB
(2)
...
(1)
hA
h∞ (θ)
Hidden layers
∞ layer

Base kernel hyperparameters θ and deep network hyperparameters w are

jointly trained through the marginal likelihood objective.

Deep Kernel Learning. Wilson, A.G., Hu, Z., Salakhutdinov, R., Xing, E.P. AISTATS, 2016

39 / 47
Face Orientation Extraction

-43.10 36.15 17.35 -3.49 -19.81 Label

Training data

Test data

Figure: Top: Randomly sampled examples of the training and test data. Bottom: The
two dimensional outputs of the convolutional network on a set of test cases. Each
point is shown using a line segment that has the same orientation as the input face.

40 / 47
Learning Flexible Non-Euclidean Similarity Metrics

0.2 300

2
100
100 100

0.1 200

200 200
200

0 100

300 300 300

400 −0.1 400 0 400 0

100 200 300 400 100 200 300 400 100 200 300 400

Figure: Left: The induced covariance matrix using DKL-SM (spectral mixture)
kernel on a set of test cases, where the test samples are ordered according to the
orientations of the input faces. Middle: The respective covariance matrix using
DKL-RBF kernel. Right: The respective covariance matrix using regular RBF
kernel. The models are trained with n = 12, 000.

41 / 47
Kernels from Infinite Bayesian Neural Networks
I The neural network kernel (Neal, 1996) is famous for triggering research on
Gaussian processes in the machine learning community.
Consider a neural network with one hidden layer:
J
X
f (x) = b + vi h(x; ui ) . (7)
i=1

I b is a bias, vi are the hidden to output weights, h is any bounded hidden unit
transfer function, ui are the input to hidden weights, and J is the number of
hidden units. Let b and vi be independent with zero mean and variances σb2 and
σv2 /J, respectively, and let the ui have independent identical distributions.
Collecting all free parameters into the weight vector w,
Ew [f (x)] = 0 , (8)
J
1 X
cov[f (x), f (x0 )] = Ew [f (x)f (x0 )] = σb2 + σv2 Eu [hi (x; ui )hi (x0 ; ui )] , (9)
J i=1

= σb2 + σv2 Eu [h(x; u)h(x0 ; u)] . (10)

We can show any collection of values f (x1 ), . . . , f (xN ) must have a joint Gaussian
distribution using the central limit theorem.
Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.
42 / 47
Neural Network Kernel

J
X
f (x) = b + vi h(x; ui ) . (11)
i=1

Rz 2
e−t dt
PP
I Let h(x; u) = erf(u0 + uj xj ), where erf(z) = √2
j=1 π 0
I Choose u ∼ N (0, Σ)
Then we obtain
2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ), (12)
π (1 + 2x̃T Σx̃)(1 + 2x̃0T Σx̃0 )

where x ∈ RP and x̃ = (1, xT )T .

43 / 47
Neural Network Kernel

2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ) (13)
π (1 + 2x̃ Σx̃)(1 + 2x̃0T Σx̃0 )
T

Set Σ = diag(σ0 , σ). Draws from a GP with a neural network kernel with varying σ:

Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006

44 / 47
Neural Network Kernel

2 2x̃T Σx̃0
kNN (x, x0 ) = sin( p ) (14)
π (1 + 2x̃T Σx̃)(1 + 2x̃0T Σx̃0 )

Set Σ = diag(σ0 , σ). Draws from a GP with a neural network kernel with varying σ:

Question: Is a GP with this kernel doing representation learning?

Gaussian processes for Machine Learning. Rasmussen, C.E. and Williams, C.K.I. MIT Press, 2006

45 / 47
NN → GP Limits and Neural Tangent Kernels

I Several recent works [e.g., 2-9] have extended Radford Neal’s limits to
multilayer nets and other architectures.
I Closely related work also derives neural tangent kernels from infinite neural
network limits, with promising results.
I Note that most kernels from infinite neural network limits have a fixed
structure. On the other hand, standard neural networks essentially learn a
similarity metric (kernel) for the data. Learning a kernel amounts to
representation learning. Bridging this gap is interesting future work.

[1] Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.

[2] Deep Convolutional Networks as Shallow Gaussian Processes. Garriga-Alonso et. al, NeurIPS 2018.
[3] Gaussian Process Behaviour in Wide Deep Neural Networks. Matthews et. al, ICLR 2018.
[4] Deep neural networks as Gaussian processes. Lee et. al, ICLR 2018.
[5] Bayesian Deep CNNs with Many Channels are Gaussian Processes. Novak et. al, ICLR 2019.
[6] Scaling limits of wide neural networks with weight sharing. Yang, G. arXiv 2019.
[7] Neural tangent kernel: convergence and generalization in neural networks. Jacot et. al, NeurIPS 2018.
[8] On exact computation with an infinitely wide neural net. Arora et. al, NeurIPS 2019.
[9] Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks. Arora et. al, arXiv 2019.

46 / 47
What’s next?

I A broader view of deep learning, where we look at deep hierarchical

representations, often quite distinct from neural networks.
I Much more Bayesian non-parametric function-space representation learning!
I Challenges will include non-stationarity, high dimensional inputs, scalable
high-fidelity approximate inference, and accommodating for misspecification in
Bayesian inference procedures.
I Using what we’ve learned about Gaussian processes as a tool to understand the
principles of model construction and a wide variety of model classes.

47 / 47

Bayesoptbook A4
No ratings yet
Bayesoptbook A4
374 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Hands-On Bayesian Neural Network
No ratings yet
Hands-On Bayesian Neural Network
28 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
CS6659 AI UNIT 3 Notes
50% (4)
CS6659 AI UNIT 3 Notes
30 pages
Advanced Neural Network: Multiple Choice Questions and Answers
No ratings yet
Advanced Neural Network: Multiple Choice Questions and Answers
39 pages
Gonzalez 2021
No ratings yet
Gonzalez 2021
67 pages
BNN Tutorial CILVR
No ratings yet
BNN Tutorial CILVR
83 pages
Sparse Bayesian Learning - Analysis and Applications
No ratings yet
Sparse Bayesian Learning - Analysis and Applications
57 pages
4-2 Generalizing Bayesian Optimization With Likelihood-Free Inference and Decision-Theoretic Entropies
No ratings yet
4-2 Generalizing Bayesian Optimization With Likelihood-Free Inference and Decision-Theoretic Entropies
45 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
B N N P R: Ayesian Eural Etwork Riors Evisited
No ratings yet
B N N P R: Ayesian Eural Etwork Riors Evisited
35 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
PAC Bayesian Learning Overview
No ratings yet
PAC Bayesian Learning Overview
66 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
No ratings yet
FunctionSpace Regularization in Neural NetworksA Probabilistic Perspective
16 pages
T C D B I: Ransformers AN O Ayesian Nference
No ratings yet
T C D B I: Ransformers AN O Ayesian Nference
23 pages
Hands-On Bayesian Neural NetworksA Tutorial For Deep Learning Users
No ratings yet
Hands-On Bayesian Neural NetworksA Tutorial For Deep Learning Users
20 pages
Bayes Optimization For Machine Learning
No ratings yet
Bayes Optimization For Machine Learning
29 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Transformers Can Do Bayesian Inference
No ratings yet
Transformers Can Do Bayesian Inference
23 pages
Priors in Bayesian Learning
No ratings yet
Priors in Bayesian Learning
26 pages
Slide 1
No ratings yet
Slide 1
37 pages
Bayesdll: Bayesian Deep Learning Library: T.Hospedales@Ed - Ac.Uk
No ratings yet
Bayesdll: Bayesian Deep Learning Library: T.Hospedales@Ed - Ac.Uk
13 pages
Hands-On Bayesian Neural Networks
No ratings yet
Hands-On Bayesian Neural Networks
24 pages
A Simple Baseline For Bayesian Uncertainty in Deep Learning
No ratings yet
A Simple Baseline For Bayesian Uncertainty in Deep Learning
25 pages
Bayesian Inference With Certifiable Adversarial Robustness: Matthew Wicker Luca Laurenti Andrea Patane
No ratings yet
Bayesian Inference With Certifiable Adversarial Robustness: Matthew Wicker Luca Laurenti Andrea Patane
18 pages
Lecture 06
No ratings yet
Lecture 06
22 pages
MOPED
No ratings yet
MOPED
8 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Hernandez Lobatoc15
No ratings yet
Hernandez Lobatoc15
9 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
DLT Unit-1
No ratings yet
DLT Unit-1
28 pages
Rethinking Function Space Vari
No ratings yet
Rethinking Function Space Vari
14 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
ML 3
No ratings yet
ML 3
66 pages
AI 2 Marks PDF
No ratings yet
AI 2 Marks PDF
14 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Bayesian Neural Networks Essential
No ratings yet
Bayesian Neural Networks Essential
17 pages
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
No ratings yet
Instructor Solution Manual To Neural Networks and Deep Learning A Textbook Solutions 3319944622 9783319944623 - Compress
40 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
No ratings yet
Mastering Probabilistic Graphical Models Using Python - Sample Chapter
36 pages
Ryan Adams 140814 Bayesopt Ncap
No ratings yet
Ryan Adams 140814 Bayesopt Ncap
84 pages
Module - 04 Machine Learning (BCS602) Search Creators
No ratings yet
Module - 04 Machine Learning (BCS602) Search Creators
21 pages
Probabilistic Machine Learning: Advantages of Using Probabilistic Models
No ratings yet
Probabilistic Machine Learning: Advantages of Using Probabilistic Models
3 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Question of The Day: N N N N
No ratings yet
Question of The Day: N N N N
8 pages
Ec3561 Vlsi Laboratory L T P C
No ratings yet
Ec3561 Vlsi Laboratory L T P C
6 pages
Constantinou 2018 ML PDF
No ratings yet
Constantinou 2018 ML PDF
27 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Part 1
No ratings yet
Part 1
200 pages
Operational Risk Modeling in Insurance and Banking
No ratings yet
Operational Risk Modeling in Insurance and Banking
13 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Enabling Artificial Intelligence and Cyber Security in Smart Manufacturing
No ratings yet
Enabling Artificial Intelligence and Cyber Security in Smart Manufacturing
15 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
CH 7 - Uncertain Knowledge & Reasoning2
No ratings yet
CH 7 - Uncertain Knowledge & Reasoning2
89 pages
Ranjan 2019
No ratings yet
Ranjan 2019
11 pages
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
No ratings yet
Reasoning With Uncertainty - Probabilistic Reasoning: Version 2 CSE IIT, Kharagpur
10 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lec30 GibbsSampling
No ratings yet
Lec30 GibbsSampling
55 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Institute of Engineering and Technology, Lucknow: Special Lab (Kcs-751)
No ratings yet
Institute of Engineering and Technology, Lucknow: Special Lab (Kcs-751)
37 pages
Seminar em
No ratings yet
Seminar em
51 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
AI Unit 5 Notes
No ratings yet
AI Unit 5 Notes
35 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
B.Tech. Robotic & AI 6th - Sem
No ratings yet
B.Tech. Robotic & AI 6th - Sem
17 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Bayesian Belief Network
No ratings yet
Bayesian Belief Network
41 pages
X22-Artificial Intelligence Enabled Energy-Efficient Heating, Ventilation and Air
No ratings yet
X22-Artificial Intelligence Enabled Energy-Efficient Heating, Ventilation and Air
27 pages
Decision Theory Models For Applications in Artificial Intelligence Concepts and Solutions 1st Edition L Enrique Sucar Download
No ratings yet
Decision Theory Models For Applications in Artificial Intelligence Concepts and Solutions 1st Edition L Enrique Sucar Download
82 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
ML Lab Manual (IT-804)
No ratings yet
ML Lab Manual (IT-804)
49 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Expert Systems (Unit 4)
No ratings yet
Expert Systems (Unit 4)
29 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Honors10 1PP
No ratings yet
Honors10 1PP
16 pages
Multivariate Regression Modelling
No ratings yet
Multivariate Regression Modelling
10 pages
Is ZC444 Ec-2r Second Sem 2021-2022 - Ans
No ratings yet
Is ZC444 Ec-2r Second Sem 2021-2022 - Ans
4 pages
Ipc2022-87872 Structured, Systematic Threat Based Approach To Evaluate and Improve
No ratings yet
Ipc2022-87872 Structured, Systematic Threat Based Approach To Evaluate and Improve
8 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Chapter 3 Fundamentals of Bayesian Inference - Bayesian Hierarchical Models in Ecology
No ratings yet
Chapter 3 Fundamentals of Bayesian Inference - Bayesian Hierarchical Models in Ecology
14 pages
Medical AI
No ratings yet
Medical AI
10 pages
CS3491 - Aiml - Qbank
No ratings yet
CS3491 - Aiml - Qbank
9 pages
An Introduction To Bayesian Methods For Analyzing Chemistry Data Part 2 PDF
No ratings yet
An Introduction To Bayesian Methods For Analyzing Chemistry Data Part 2 PDF
10 pages
Uncertainty in AI
No ratings yet
Uncertainty in AI
3 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

Wilson2020 Part2

Uploaded by

Wilson2020 Part2

Uploaded by

Bayesian Neural Networks from a Gaussian

Andrew Gordon Wilson

Gaussian Process Summer School

Autoregressive Conditional Heteroscedasticity (ARCH)

y(t) = N (y(t); 0, a0 + a1 y(t − 1)2 )

y(t) = N (y(t); 0, a0 + a1 y(t − 1)2 )

Gaussian Copula Process Volatility (GCPV)

y(x) = N (y(x); 0, f (x)2 )

I Can approximate a much greater range of variance functions

Which of these models do you prefer, and why?

y(x)|f (x), g(x) ∼ N (y(x); f (x), g(x)2 )

y(x)|f (x), g(x) ∼ N (y(x); f (x)g(x), g(x)2 )

Airline Passengers (Thousands)

Which model should we choose?

Consider the simple linear model,

Corrupted CIFAR-10 MNIST Dataset

Bayesian Deep Learning and a Probabilistic Perspective of Generalization

I The key distinguishing property of a Bayesian approach is marginalization

Recall the Bayesian model average (BMA):

I Think of each setting of w as a different model. Eq. (3) is a Bayesian model

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs.

I MultiSWAG forms a Gaussian mixture posterior from multiple independent

Belkin et. al (2018)

Should a Bayesian model experience double descent?

CIFAR-100, 20% Label Corruption

A parameter prior p(w) = N (0, α2 ) with a neural network architecture f (x, w)

Deep Image Prior

In Bayesian deep learning it is typical to consider the tempered posterior:

where T is a temperature parameter, and Z(T) is the normalizing constant

E.g.: The safe Bayesian. Grunwald, P. COLT 2012.

Class Probability 0.8 0.8

(c) 100 datapoints (d) 1000 datapoints

From a Gaussian process perspective, what

(e) α = 0.02 (f) α = 0.1

4 0.76 0.78 0.80 0.85 0.81

I It would be surprising if T = 1 was the best setting of this hyperparameter.

Corrupted CIFAR-10 MNIST Dataset

We should embrace the function space perspective in constructing priors.

I Non-vacuous bounds derived from exploiting flatness in Q (e.g., at least 80%

Decision boundaries do not change in directions of little posterior contraction,

“How can Gaussian processes

Base kernel hyperparameters θ and deep network hyperparameters w are

-43.10 36.15 17.35 -3.49 -19.81 Label

300 300 300

400 −0.1 400 0 400 0

= σb2 + σv2 Eu [h(x; u)h(x0 ; u)] . (10)

where x ∈ RP and x̃ = (1, xT )T .

Question: Is a GP with this kernel doing representation learning?

[1] Bayesian Learning for Neural Networks. Neal, R. Springer, 1996.

I A broader view of deep learning, where we look at deep hierarchical

You might also like