0% found this document useful (0 votes)
11 views46 pages

Lecture 12 Bayesian Neural Network

Uploaded by

VI XY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views46 pages

Lecture 12 Bayesian Neural Network

Uploaded by

VI XY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

CEG5304 / EE6934 (PT2)

Lecture 12:

A Simple Introduction to:


Bayesian Neural Network &
Variational AutoEncoder (VAE)
© Copyright National University of Singapore. All Rights Reserved.
© Copyright National University of Singapore. All Rights Reserved.
Admin Issues for Week 12

Final Guest Lecture will be on Week 13 (16 April)


If have not attend last week’s briefing:
Make sure to watch the recording and download Examplify!

© Copyright National University of Singapore. All Rights Reserved.


The Prior:
Bayesian Inference &
Bayesian Neural Network

✓ Bayesian Inference
✓ Source of Model Uncertainty and Ensembling
✓ Bayesian Neural Netwosk

© Copyright National University of Singapore. All Rights Reserved.


Quick Recap: Pillars of Deep Learning
The two pillars of Deep Learning: The Model and the Data

© Copyright National University of Singapore. All Rights Reserved.


What We Have? What We Left Behind?
Good performance with ever increasing complexity of models:
Expressive, modular structure, efficient training.
Performances that are trained and evaluated on (a) certain dataset(s).
What about the generalization on other datasets?
Black-box model, judge the confidence over the model’s decision?

One (long forgotten) principle: the Occam’s Razor:


The problem-solving principle that recommends searching for
explanations constructed with the smallest possible set of elements. It
is also known as the principle of parsimony or the law of parsimony.
© Copyright National University of Singapore. All Rights Reserved.
Bayesian Inference
What we actually obtain from deep learning models?
Probabilities: probabilities of classes / probabilities of params.
Bayesian inference:
Leverage the Bayes' theorem to update the probability for a hypothesis
(a model, could be a statistical model) as more evidence or information
(data) becomes available.
𝑝 𝒟 𝜃 𝑝(𝜃) 𝑝 𝒟 𝒘 𝑝(𝒘)
𝑝 𝜃𝒟 = 𝑜𝑟 𝑝 𝒘 𝒟 =
𝑝(𝒟) 𝑝(𝒟)
𝑝(𝒘): prior of parameter 𝒘 before seeing the data; 𝑝 𝒟 𝒘 : likelihood,
probability of data 𝒟 given 𝒘; 𝑝 𝒘 𝒟 : posterior probability of 𝒘 w.r.t. 𝒟.

© Copyright National University of Singapore. All Rights Reserved.


Bayesian Inference and Uncertainty
Given a data input of 𝑢 what we want is the prediction of input
𝑢 as 𝑦, computed from the posterior predictive distribution:
Posterior predictive distribution: distribution of possible unobserved
values conditional on the observed values

𝑝 𝑦 𝑢, 𝒟 = න𝑝 𝒘 𝒟 𝑝 𝑦 𝑢, 𝒘 𝑑𝒘

Why do we need to compute in this manner?


Understand the “uncertanity” of results;
Probability – Uncertainty;
What types of uncertainty we may encounter?
© Copyright National University of Singapore. All Rights Reserved.
Uncertainty and Bayesian Inference

Aleatoric uncertainty: inherent uncertainty in the environment’s dynamics;


Output distribution for a classifier or a model (from the SoftMax).
Epistemic uncertainly: uncertainty about the model parameters:
Not considered with normal neural networks;
Different networks from different training processes.
© Copyright National University of Singapore. All Rights Reserved.
Examples of Uncertainty in Classification

© Copyright National University of Singapore. All Rights Reserved.


Quantifying Uncertainty with Bayesian Inference
A brief example of how Bayesian inference quantifies model
uncertainty: Bayesian linear regression.
Considers various plausible explanation for data generation;
Prediction obtained using all possible regression weighted by their
posterior probability 𝑝 𝒘 𝒟 .

© Copyright National University of Singapore. All Rights Reserved.


Example: Bayesian Linear Regression

Prior distribution: 𝒘~𝒩 𝟎, 𝑺 ;


Likelihood: 𝑝 𝑦 𝑢, 𝒘 ~𝒩(𝒘𝑇 𝜓 𝑢 , 𝜎 2 )
Assumed fixed or known 𝑺 and 𝜎 2 is a big assumption, can be estimated!
Posterior probability 𝑝 𝒘 𝒟 .
© Copyright National University of Singapore. All Rights Reserved.
Example: Bayesian Linear Regression
The connection between
prior, likelihood, posterior:

Line 1: prior distribution;


Lines 2 – 4: Model dist. is
adjusted with to observed
data

© Copyright National University of Singapore. All Rights Reserved.


Another Example of Bayesian Regression
One other example to
estimate model uncertainty:
Assume the model is a
radial basis function, can
be formulated:
2
𝑥 − 𝜇𝑗
𝑓𝑗 𝑥 = exp(− 2
)
2𝜎
The posterior probability
𝑝 𝒘 𝒟 obtained by
sampling data points:

© Copyright National University of Singapore. All Rights Reserved.


Visualizing Confidence Interval vs. Data Observation

Visualize confidence
intervals based on the
posterior predictive
mean and variance at
each point.

© Copyright National University of Singapore. All Rights Reserved.


Data Uncertainty vs. Model Uncertainty

© Copyright National University of Singapore. All Rights Reserved.


Bayesian Neural Network?
Can we combine the advantages of neural networks with
Bayesian models?

Bayesian Neural Network:


Place a prior on the weights of the network, e.g. 𝑝 𝒘 = 𝒩 𝒘; 𝟎, 𝜂𝑰 .
Define an observation model, e.g. 𝑝 𝑦 𝑢, 𝒘 = 𝒩(𝑦; 𝑓𝒘 𝑢 , 𝜎 2 ).
Apply the Bayes Rule:
𝑁
𝑝 𝒘 𝒟 ∝ 𝑝(𝒘) ෑ 𝑝(𝑦𝑖 |𝑢𝑖 , 𝒘)
𝑖=1

© Copyright National University of Singapore. All Rights Reserved.


Bayesian Model: Samples from the Prior
Understand a Bayesian model by looking at prior samples of
the functions (i.e., samples of 𝑝(𝑤)).
Visualize: prior samples for a BNN with 1 hidden layer and 10k units:

Under certain assumptions, an infinitely wide BNN approx. a


Gaussian process, so does for deep BNNs.

© Copyright National University of Singapore. All Rights Reserved.


BNN vs NN (in General with DNN)

Regularization arises
naturally!
(Think of dropout)
Parameters represented by single, Parameters are represented by
fixed values after training. distributions.
Conventional approaches to train Introduce a prior distribution on the
NNs can be interpreted as approx. weights 𝑝 𝒘 , obtain the posterior
to the Bayesian method 𝑝 𝒘 𝒟 through learning.
© Copyright National University of Singapore. All Rights Reserved.
BNN: Maximizing the Posterior
Compute the posterior with Bayes’ Rule:
𝑝 𝒟𝒘 𝑝 𝒘 𝑝 𝒟𝒘 𝑝 𝒘
𝑝 𝒘𝒟 = =
𝑝 𝒟 ∫ 𝑝 𝒟 𝒘 𝑝 𝒘 𝑑𝒘

What is the problem of this computation?


Problem: the Denominator cannot be computed directly
Impossible to compute all possible combinations!
Two ways to compute the approximate of the posterior:
Markov Chain Monte Carlo (MCMC);
Variational Inference (Variational Bayes);

© Copyright National University of Singapore. All Rights Reserved.


Posterior Inference with Markov Chain Monte Carlo
To explore how to compute the posterior 𝑝 𝒘 𝒟 , we first look
into how we use this posterior distribution.
What we want is the prediction.
Sample set of values 𝑤1 , … , 𝑤4 from the posterior distribution 𝑝(𝑤|𝒟)
and average the predictive distributions:
𝐾
1
𝑝 𝑦 𝑢, 𝒟 ≈ ෍ 𝑝(𝑦|𝑢, 𝒘𝑘 )
𝐾
𝑘=1
How to sample from the posterior?
One possible method: Markov Chain Monte Carlo (MCMC).
Sampling a chain of samples that converges to 𝑝(𝑤|𝒟).
Can be very expensive with large datasets.
© Copyright National University of Singapore. All Rights Reserved.
Posterior Inference: Variational Bayes
Less accurate, but more scalable approach.
Idea: approximate complex posterior
distribution with simpler analytical variational
approximation 𝑞 with parameters 𝜽:
Assume Gaussian posterior with diagonal
covariance
𝑞 𝒘; 𝜽 = 𝒩 𝒘; 𝝁, 𝜮 = ς𝐷𝑗 𝒩(𝑤𝑗 ; 𝜇𝑗 , 𝜎𝑗 )
Each weight of the network has its own mean
and variance.

© Copyright National University of Singapore. All Rights Reserved.


Posterior Inference: Variational Bayes
Estimate the best 𝜽 that gives the best approx. of 𝑝(𝒘|𝒟)?
Kullback-Leibler divergence (KL divergence, measuring the
distance between two distribution)
𝑞 𝒘;𝜽 𝑞 𝒘;𝜽
𝐷𝐾𝐿 𝑞 𝒘; 𝜽 ∥ 𝑝 𝒘 𝒟 = ∫ 𝑞 𝒘; 𝜽 log 𝑑𝒘 = 𝔼𝑞 𝒘;𝜽 log
𝑝 𝒘𝒟 𝑝 𝒘𝒟

𝐷𝐾𝐿 𝑞 𝒘; 𝜽 ∥ 𝑝 𝒘 𝒟 = 𝐷𝐾𝐿 𝑞 𝒘; 𝜽 ∥ 𝑝(𝒘) − 𝔼𝑞 𝒘;𝜽 log 𝑝(𝒟|𝒘) +


log 𝑝(𝒟)
Variational Free Energy: ℱ 𝒟, 𝜽 ≡ 𝐷𝐾𝐿 𝑞 𝒘; 𝜽 ∥ 𝑝(𝒘) − 𝔼𝑞 𝒘;𝜽 log 𝑝(𝒟|𝒘)
Minimize ℱ w.r.t. 𝜃 that will minimize the KL-divergence.

© Copyright National University of Singapore. All Rights Reserved.


Uses of BNNs (Some)
Bayesian optimization: Snoek et al., 2015. Scalable Bayesian
optimization using deep neural networks.
Curriculum learning: Graves et al., 2017. Automated curriculum
learning for neural networks
Intrinsic motivation in reinforcement learning: Houthooft et al.,
2016. Variational information maximizing exploration
Network compression: Louizos et al., 2017. Bayesian
compression for deep learning

© Copyright National University of Singapore. All Rights Reserved.


An Example of BNN:
Variational AutoEncoders (VAEs):
Generative Models and Representation Learning

✓ Generative Modeling and Representation Learning


✓ Variational AutoEncoders (VAEs)

© Copyright National University of Singapore. All Rights Reserved.


Learning Generative Models via Density Models
Objective of generative modeling:
A generator produces synthetic data that looks like the real data:
synthetic data has high probability under a density model fit to real data.
Produce synthetic data that is identically distributed as the training data,
i.e. 𝑥~𝑝
ො 𝑑𝑎𝑡𝑎 , 𝑝𝑑𝑎𝑡𝑎 is the true process that produced the training data.
Using density model 𝑝𝜃 to get close to 𝑝𝑑𝑎𝑡𝑎 .

© Copyright National University of Singapore. All Rights Reserved.


Autoregressive Models: Image Generation
Synthesize an image pixel-by-pixel
Predict each next pixel from the partial image already completed – from
the context of the partial image.

© Copyright National University of Singapore. All Rights Reserved.


The Gaussian Diffusion Model
If the forward and reverse process are modelled with Gaussian:
Gaussian diffusion model (as described in previous slides)
Forward: 𝑥𝑡 = (1 − 𝛽𝑡 )𝑥𝑡−1 + 𝛽𝑡 𝜖𝑡 , 𝜖𝑡 ~𝒩(0, 𝐈).
𝑞 𝑥𝑡 𝑥𝑡−1 = 𝒩 1 − 𝛽𝑡 𝑥𝑡−1 , 𝛽𝑡 𝐈
Backward: 𝜇 = 𝑓𝜃 𝑥𝑡 , 𝑡 , 𝑥𝑡−1 ~𝒩(𝜇, 𝜎 2 )
𝑝𝜃 𝑥𝑡−1 𝑥𝑡 = 𝒩 𝑓𝜃 𝑥𝑡 , 𝑡 , 𝜎 2

© Copyright National University of Singapore. All Rights Reserved.


The Generative Adversarial Network (GAN)
The generator 𝐺: 𝒵 → 𝒳 and the discriminator 𝐷: 𝒳 → Δ1 :
𝐺 tries to synthesize fake images that fool 𝐷;
𝐷 tries to identify the synthetic images (differentiate btw real and fake).
A minmax game (adversarial game) between 𝐺 and 𝐷.

© Copyright National University of Singapore. All Rights Reserved.


Generative Modelling v.s. Representation Learning
Representation learning: mapping
data to abstract representation (for
data analysis)

Generative modelling: mapping


abstract representations to data (for
data synthesis)

© Copyright National University of Singapore. All Rights Reserved.


How does Generative Modelling Effect Distribution?
We can view generative models as distribution transformers
(not the Transformers):
Transforming from a latent space to the data distribution space.

© Copyright National University of Singapore. All Rights Reserved.


Properties of Generative Modelling
How to generate from 𝑧?
We assume we know the distribution of 𝑝𝑧 !
The generated data is measured with the log likelihood function:
𝑁
𝐿 𝑥 𝑖=1 , 𝜃 = σ𝑁
𝑖
𝑖=1 log 𝑝𝜃 (𝑥 𝑖 ).
To compute
One way to express this function is as the marginal likelihood of 𝑥,
marginalizing over all unobserved latent variables 𝑧:
𝑝𝜃 𝑥 = ∫𝑧 𝑝𝜃 𝑥 𝑧 𝑝𝑧 𝑧 𝑑𝑧 – reduce to learning the conditional
distribution 𝑝𝜃 (𝑋|𝑧) given we know the distribution 𝑝𝑧 (𝑧).

© Copyright National University of Singapore. All Rights Reserved.


Generative Modelling v.s. Representation Learning

Compare / connect
generative modelling and
representation learning from
the distribution perspectives:
Define the “latent space” as a
“representation space”.
𝑥~𝑝𝑑𝑎𝑡𝑎 , 𝑧 = 𝑓(𝑥);
𝑧~𝑝𝑧 , 𝑥 = 𝑔(𝑧);
Observe: what structure can
you associate with?

© Copyright National University of Singapore. All Rights Reserved.


Generative Modelling with Autoencoder (AE)
Autoencoder: models that learn an embedding that can be
decoded to reconstruct the input data.
Connection: decoder of an autoencoder similar to a generator.

© Copyright National University of Singapore. All Rights Reserved.


Adapting Autoencoder (AE) for Generative Modelling
How do we obtain this latent 𝑧 variable?
Goal of generative modeling: make up random images from scratch.
Require a distribution from which to sample different 𝑧, i.e., require 𝑝𝑧 .
Can we perform with Autoencoder directly?
One possibility, after training just sample a random z from a unit
Gaussian, and feed it through the decoder?
Problem: sample might be different from what the decoder was trained
on, and might not map to a natural looking image.
In short: The latent space of an autoencoder can be just as complex as
the data space.

© Copyright National University of Singapore. All Rights Reserved.


Adapting Autoencoder (AE) for Generative Modelling
Turn an AE into a generative model: Model distribution
The variant needs to be sampled from
which maximizes data likelihood under
a formal probabilistic model.
Variational AutoEncoders (VAEs):
modelling the data with a density model
𝑝𝜃 𝑥 : mixture of Gaussians.
𝑝𝜃 𝑥 = σ𝑘𝑖=1 𝑤𝑖 𝒩(𝑥; 𝜇𝑖 , 𝜎𝑖2 ); the
parameters: means and variances.
Special of VAE: leverage on inifinite
mixture of Gaussians: 𝑘 → ∞. Target Distribution

© Copyright National University of Singapore. All Rights Reserved.


Adapting Autoencoder (AE) for Generative Modelling
The formula as shown, how to parameterize an infinite mixture?
“Reparameterization trick”;
Make mean and variance be functions of underlying continuous variable (latent var.).
Observe the expression! It is exactly the same form as the marginal likelihood func.!
Assume the latent variable follow Gaussian distribution: 𝑝𝑧 (𝑧) = 𝒩 (𝑧; 0, 𝐈).

𝑝𝜃 (𝑥|𝑧)

© Copyright National University of Singapore. All Rights Reserved.


Adapting Autoencoder (AE) for Generative Modelling

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs
The optimum parameter is obtained with:
𝑖 𝑁

𝜃 = arg max 𝐿 𝑥 𝑖=1 , 𝜃 = arg max σ𝑁
𝑖=1 log 𝑝𝜃 (𝑥 𝑖
)=
𝜃 𝜃
𝜇 𝜎2
arg max σ𝑁
𝑖=1 log ∫𝑧 𝒩
𝑖
𝑥 ; 𝑔𝜃 𝑧 , 𝑔𝜃 𝑧 𝒩 𝑧; 0, 𝐈 𝑑𝑧.
𝜃
Difficult to obtain directly, apply certain mathematical tricks!
Trick 1: Approximate the objective via Sampling. (An Monte Carlo
estimation of the integral)

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs
Trick 2: Efficient Approximation via Importance Sampling:
Problem with Trick 1: higher dimension 𝑧, more samples needed. But
when approx. the likelihood intensity for 𝑝𝜃 (𝑥), only sample 𝑧 (2) works.
Importance sampling: only sample 𝑧 values that place high probability
on 𝑥, where values of 𝑝𝜃 (𝑥|𝑧) is large.
Define 𝑞𝑧 whose optimal is 𝑞 ∗ = 𝑝𝜃 (𝑍|𝑥): a prediction of which 𝑧 values
most likely to have generated the observed 𝑥.

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs
Now we know we should be sampling from 𝑝𝜃 (𝑍|𝑥), how?
VAEs leverage on variational inference (Trick 3): predict the optimal
sampling distribution for each given datapoint.
𝜇 𝜎2
Approx. 𝑝𝜃 (𝑍|𝑥) with 𝑞𝜓 which is a Gaussian function 𝒩(𝑓𝜓 𝑥 , 𝑓𝜓 𝑥 ),
where 𝑓𝜓 maps from data 𝑥 to the latent var. 𝑧, a probabilistic encoder.

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs
To best approx. 𝑝𝜃 (𝑍|𝑥) with 𝑞𝜓 , the optimal 𝑞𝜓 is obtained by min. the
KL-divergence between 𝑞𝜓 and 𝑝𝜃 (𝑍|𝑥). The objective function 𝐽𝑞 is:

The optimal parameter 𝜓 for 𝑞𝜓 is obtained as: (log 𝑝𝜃 (𝑥) is constant)

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs

© Copyright National University of Singapore. All Rights Reserved.


Learning the Model Parameters of VAEs
Question: how is learning 𝑞 related to learning 𝑝?

𝑝 and 𝑞 now share the same exact objective, learning 𝑞 is learning 𝑝!


𝜃 ∗ , 𝜓 ∗ = arg min 𝐽(𝜃, 𝜓)
𝜃,𝜓
© Copyright National University of Singapore. All Rights Reserved.
Learning the Model Parameters of VAEs

© Copyright National University of Singapore. All Rights Reserved.


Training a VAE: Step-by-Step
(𝑖) 𝑁
Sample one or more x~ x 𝑖=1 ;
Encode the data with a forward pass through 𝑓𝜓 ;
For each datapoint, create one or more noisy latent codes
using the distribution parameterized by the encoder;
Decode the data by passing the noisy latent codes through 𝑔𝜃 ;
Compute the losses and backprop to update 𝜃 and 𝜓.

© Copyright National University of Singapore. All Rights Reserved.

You might also like