Lecture 1
Lecture 1
At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.
1
Course Structure
2
Course Assessment
3
Bayesian Machine Learning—Course Outline
Lectures
• Machine Learning Paradigms (1 hour)
• Bayesian Modeling (2 hours)
• Foundations of Bayesian Inference (1 hour)
• Advanced Inference Methods (1 hour)
• Variational Auto-Encoders (1 hour)—key lecture for
assessments!
I will upload notes after each lecture. These will not perfectly
overlap with the lectures/slides so you will need to separately
digest each
4
What is Machine Learning?
5
Motivation: Why Should we Take a Bayesian Approach?
8
Learning From Data
8
Learning from Data
9
Supervised Learning
10
Classification
Supervised Learning—Classification
Classification CatCat
Classification CatDog
Flying
CatSpaghetti
Monster
12
Supervised Learning
Supervised Learning
Input Features Outputs
}
}
Datapoint
Training Data
x1 x2 x3 … xM y
Index
1 0.24 0.12 -0.34 … 0.98 3
2 0.56 1.22 0.20 … 1.03 2
3 -3.20 -0.01 0.21 … 0.93 1
… … … … … … …
N 2.24 1.76 -0.47 … 1.16 2
14
Unsupervised Learning—Clustering
Classification
Cat
15
Two major unsolved problems in the field of machine learning are (1) data-efficiency: the ability to
Unsupervised
learn from fewLearning—Deep Generative
datapoints, like humans; and (2) generalization: Models
robustness to changes of the task or
its context. AI systems, for example, often do not work at all when given inputs that are different
⇤
Equal contribution.
Figure 1: Synthetic celebrities sampled from our model; see Section 3 for architecture and method,
and Section 5 for more results.
These are not real faces: they are samples from a learned model!
1
D P Kingma and P Dhariwal. “Glow: Generative flow with invertible 1x1 convolutions”. In: NeurIPS. 2018. 16
Discriminative vs
Generative Machine
Learning
16
Discriminative vs Generative Machine Learning
17
Image credit: Jason Martuscello, medium.com
Discriminative Machine Learning
Cons
• Can be difficult to impart prior information
• Typically lack interpretability
• Do not usually provide natural uncertainty estimates
19
Generative Machine Learning
20
Generative Machine Learning
21
Generative Machine Learning
Cons
• Can be difficult to construct—typically require problem
specific expertise
• Can impart unwanted assumptions—often less effective for
huge datasets
• Tackling an inherently more difficult problem than straight
prediction
22
The Bayesian Paradigm
22
Bayesian Probability is All About Belief
Frequentist Probability
The frequentist interpretation of probability is that it is the average
proportion of the time an event will occur if a trial is repeated
infinitely many times.
Bayesian Probability
The Bayesian interpretation of probability is that it is the
subjective belief that an event will occur in the presence of
incomplete information
23
Bayesianism vs Frequentism
https://xkcd.com/1132/ 24
Bayesianism vs Frequentism
https://xkcd.com/1132/
24
Bayesianism vs Frequentism
https://xkcd.com/1132/
24
Bayesianism vs Frequentism
Warning
Bayesiansism has its shortfalls too—see the course notes
24
The Basic Laws of Probability
25
Bayes’ Rule
p(A|B)p(B)
p(B|A) =
p(A)
26
Using Bayes’ Rule
We have just had a result back from the Doctor for a cancer
screen and it comes back positive. How worried should we be given
the test isn’t perfect?
29
Example: Positive Cancer Test (2)
Before these results came in, the chance of us having this type of
cancer was quite low: 1/1000. Let’s say θ represents us having
cancer so our prior is p(θ) = 1/1000.
For people who do have cancer, the test is 99.9% accurate.
Denoting the event of the test returning positive as D = 1, we
thus have p(D = 1|θ = 1) = 999/1000.
For people who do not have cancer, the test is 99% accurate. We
thus have p(D = 1|θ = 0) = 1/100.
Our prospects might seem quite grim at this point given how
accurate the test is.
30
Example: Positive Cancer Test (3)
31
Alternative Viewpoint
32
2 3 4
How
BridgingMight x
the Gap Between =Bayesian
we2 Write
the ≠1a System x3 =
to Break
Ideal and Common Practice 6
Captchas? x4
Tom Rainforth
⁄” a6 = “⁄” a7 = “⁄” a8
i6 = 4 i7 = 5 i8
8 x6 = 53 x7 = 17 x8
⁄” Noise: Noise: No
displacement stroke ell
field
33
i2 = 1
Bridging the Gap Between the Bayesian Ideal and Common Practice
i3 = 1
x2 = ≠1 x3 = 6
TomSimulating
Rainforth Captchas is Much Easier
a5 = “⁄” a6 = “⁄
i5 = 3 i6 = 4
” gxs2rRj
=
a65 =
x 18
“⁄” = “⁄
x76 = 53
Generation
a
i6 = 4 i7 = 5
Inference
8 x6 = 53 x7 = 17
a9 = “⁄” Noise:
i9 = 7 displace
” 9 = 9
Noise:
x field
Noise:
displacement stroke
[Le, Baydin, and Wood. Inference Compilation and Universal 34
3
Bridging the Gap Between the Bayesian Ideal and Common Practice
Tom Rainforth
The Bayesian Pipeline
} p(✓|D) / p(D|✓)p(✓)
Posterior
35
Breaking Captchas with Bayesian Models
https://youtu.be/ZTKx4TaqNrQ?t=9
2
TA Le, A G Baydin, and F Wood. “Inference Compilation and Universal Probabilistic Programming”. In:
AISTATS. 2017.
36
Making Predictions
37
Making Predictions (2)
Points of Note
• We usually assume that p(D∗ |θ, D) = p(D∗ |θ), i.e. data is
conditionally independent given θ
• p(D∗ |θ) is equivalent to the likelihood model of the new data:
in almost all cases we just use the likelihood from the original
model
• Calculating the posterior predictive can be computationally
challenging: sometimes we resort to approximations,
e.g. taking a point estimate for θ (see Lecture 4)
• There are lots of things we might use the posterior for other
than just calculating the posterior predictive, e.g. making
decisions (see course notes) and calculating expectations
38
Recap
39
Further Reading
• Look at the course notes! For this lecture there are discussion
of Bayesian vs frequentist approaches, and a worked example
of Bayesian modeling for a biased coin.
• Chapter 1 of K P Murphy. Machine learning: a probabilistic
perspective. 2012. https://www.cs.ubc.ca/~murphyk/MLbook/pml-intro-22may12.pdf.
• L Breiman. “Statistical modeling: The two cultures”. In:
Statistical science (2001)
• Chapter 1 of C Robert. The Bayesian choice: from
decision-theoretic foundations to computational
implementation. 2007. https://www.researchgate.net/publication/41222434_The_
Bayesian_Choice_From_Decision_Theoretic_Foundations_to_Computational_Implementation.