0% found this document useful (0 votes)
59 views

Introduce To Probabilistic Machine Learning

This document is an introduction to probabilistic machine learning written for computer scientists without a strong background in probability and statistics. It covers fundamental probabilistic concepts like distributions, moments, and the softmax transformation. It then discusses several important probabilistic machine learning models and techniques including maximum likelihood, Bayesian inference, Gaussian processes, Monte Carlo methods, ranking models, and expectation-maximization. The goal is to provide computer scientists with the probabilistic foundations needed to understand and apply probabilistic machine learning algorithms.

Uploaded by

Pin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Introduce To Probabilistic Machine Learning

This document is an introduction to probabilistic machine learning written for computer scientists without a strong background in probability and statistics. It covers fundamental probabilistic concepts like distributions, moments, and the softmax transformation. It then discusses several important probabilistic machine learning models and techniques including maximum likelihood, Bayesian inference, Gaussian processes, Monte Carlo methods, ranking models, and expectation-maximization. The goal is to provide computer scientists with the probabilistic foundations needed to understand and apply probabilistic machine learning algorithms.

Uploaded by

Pin Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to

Probabilistic Machine Learning


a handbook for innumerate computer scientists

Jacob Moss
University of Cambridge

Last updated: December 17, 2020

Citation: Jacob Moss. (2020). Introduction to Probabilistic Machine Learning.


Contents
Preface 3

1 Introduction 4
1.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Discrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Mapping Between Random Variables . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Softmax Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Directed Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.2 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Keep in mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Maximum Likelihood 10
2.1 Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Regression with Additive Independent Gaussian Noise . . . . . . . . . . . . . . . . 11

3 Bayesian Inference 12
3.1 Maximum a posteriori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Conjugate Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Marginal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Gaussian Processes 15
4.1 Non-parametric Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Hyperparameters & Covariance Functions . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Positive Definiteness of Inner Products in Hilbert Spaces . . . . . . . . . . . 17
4.3.3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.4 Squared Exponential (RBF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.5 Periodic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.6 Reproducing Kernel for Vector-Valued Functions . . . . . . . . . . . . . . . 19
4.4 Coregionalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Relation with Linear in the Parameters Model . . . . . . . . . . . . . . . . . . . . . 20
4.6 Eigenvectors and Relation to PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.7 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 Gaussian Process Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.9 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Gaussian Process Classification 23

6 Monte Carlo 24
6.1 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Markov Chains & Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Example: Step Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

7 Ranking 27
7.1 Towards Probabilistic Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Gibbs Sampling in TrueSkill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Message Passing on Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1
8 Expectation-Maximisation 30
8.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.2 Mathematical Machinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2.1 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2.2 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.2.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.3 Expectation Maximisation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8.4 Example: K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5 Example: Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.5.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

9 Variational Inference 36
9.1 Amortizing Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 General Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2.1 Mean Field Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.3.1 Inducing Point Approximation for Gaussian Processes . . . . . . . . . . . . 37
9.3.2 Deep Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

10 Stochastic Calculus 42
10.1 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.2 Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.3 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.3.1 Brownian Motion as a GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
10.4 Stochastic Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.4.1 Itô vs. Stratonovich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

11 Advanced Generative Models 45


11.1 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.1.1 Reparameterisation Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
11.2 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
11.3 Normalising Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A Derivations 49
A.1 Double Integral of a Mean Function . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.2 Expansion of Strange Gaussian Identity . . . . . . . . . . . . . . . . . . . . . . . . 49
A.3 Derivation of Biased Variance Estimator . . . . . . . . . . . . . . . . . . . . . . . . 50
A.4 Derivations of Means and Variances . . . . . . . . . . . . . . . . . . . . . . . . . . 50

B Kullback Leibler divergence 51

C Backup 51
Todo List:
1. Beta distribution derivation, arising from Bayesian inference;
2. Monte Carlo
3. Things they don’t tell us...

2
Preface
The purpose of this booklet is to give the foundations and intuitions for
probablistic machine learning. The targeted audience are Computer Sci-
entists who might have missed out on some critical components in their
mathematical education. When I started my Master’s in machine learning
in 2019, I found I had considerable gaps which made reading the literature
both intimidating and inefficient. This is why I set out to create this book-
let. I am now a PhD student at the University of Cambridge–a Bayesian
stronghold–which has only solidified my passion for the subject, and I can
only hope this helps others find their way into this field.
Probabilistic machine learning is a fascinating subject, and also incredibly
useful in practice. The field is growing rapidly, so I will regularly update
this document with new material, clarifications, and corrections. Please
do contact me at [email protected] if you spot any mistakes or have any
requests.
Disclaimer: Some of the structure is inspired by courses including Carl
Rasmussen’s Probabilistic ML (4f13) course at Cambridge and Dariush
Hosseini’s data mining course at UCL.

3
1 Introduction
In machine learning, we typically try to fit a model to a dataset. This model may be parame-
tererised by θ. In a Bayesian model, we assign some prior distribution over parameters. We also
have a likelihood: the probability of the data given a particular parameter setting. For example,
the likelihood of a coin toss is p(x = 1|θ) = θ where θ = 0.5 for a even-weighted coin. This is
an example of a discrete distribution. In a Bayesian setting, however, this θ is not fixed–it is a
distribution. In the next few sections, it will become clear how a Bayesian mindset enables us to
apply our prior knowledge that θ is around 0.5!

Colour Coding

In order to make equations easier to read, we use the following colour coding throughout:
• prior distribution
• posterior distribution
• marginal distribution
• variational distribution
n.b. the terms above will be defined at their first point of use

The key thing to understand in Bayesian statistics is the posterior update; using the prior and
likelihood, the posterior is updated according to the data. This is achieved using Bayes’ theorem:
the root of all Bayesian statistics,

pD (x|θ)pΘ (θ)
pΘ (θ|x) = ,
pD (x)

where pΘ (θ|x) is the posterior, pD (x|θ) is the likelihood, pΘ (θ) is the prior, and pD (x) is the
evidence or the marginal likelihood. The marginal likelihood is very important, see Section 3.4.
Note that some literature may use lik(θ|x), which is the likelihood function: it is neither a condi-
tional probability nor a probability density. Likelihood is the probability of observing the dataset
given the parameters.

Types of Probability

• Discrete variables have probability mass assigned to each event in the countable
sample space. The Poisson distribution is an example.
• Continuous variables have probability density assigned to an uncountable set.
The Gaussian distribution is an example.
• Marginal: Marginalisation is simply summing out/integrating over the probability
of a R.V as follows:
X Z
p(x) = p(x, y) for discrete distributions, p(x, y) dy for continuous distributions
y

• Joint: pX,Y (x, y) = p(X = x and Y = y)


• Conditional: p(y|x)

Another useful result is the law of total probability:


Z
p(x) = p(x|y)p(y) dy

4
1.1 Moments
Moments summarise key features of distributions, for example the mean, variance, kurtosis, and
skewness. We will discuss a few of these below, in addition to deriving some useful identities.

Moment REquation Name


0 pX (x)Rdx = 1
1 E[X] = xp R X (x) dx Mean
2 Var[X] = (x − µ)2 pX (x) dx Variance

Table 1: Moments

Expectation is a linear operator, so for linear f , E[f (X)] = f (E[X]). Furthermore, for linear
operations the variance exhibits the following results:

V ar(X) = E((X − E(X))2 ) = E X 2 + E(X)2 − 2XE(X)




= E(X 2 ) − E(X)2
2
=⇒ V ar(aX + b) = E (aX + b)2 − E (aX + b))


= E a2 X 2 + b2 + 2abX − (aE(X) + b)2




= a2 E(X 2 ) + 2abE(X) − a2 E(X)2 − 2abE(X)


= a2 (E(X 2 ) − E(X)2 )
= a2 V ar(X)

PN
Note that the sample mean µ̂ = N1 i=1 Xi is an estimator of the mean, not a moment! The
same goes for other estimators such as sample variance. These estimators are, in fact, random
variables. For example:
" N
# N
1 X 1 X
E [µ̂] = E Xi = E(Xi ) = E(Xi )
N i=1 N i=1

Therefore, as the sample size increases, the error drops proportionally to the square root.
Let us now derive1 the same for the variance:
  2 
N N
1 X
Xi − 1
X
E s2 = E 
 
Xj  

N i=1 N j=1

N −1
= Var[Xi ]
N
Because of the multiplying factor, this is called a biased estimator. We can make it an unbiased
estimator by multiplying by N 1−1 . How bad are these estimators? Well, we can interpret the
standard deviation of the estimator as the error in the estimator; e.g., for the mean:

" N
# N
1 X 1 X 1 1
Var[µ̂] = Var 2
Xi = Var[Xi ] = σX =⇒ σµ̂ = √ σX
N i=1 N i=1 N N
1 The full derivation of the estimator for variance is found in Appendix A.3.

5
1.2 Distributions
Probability distributions are functions that outputs probabilities for given events/data. Distribu-
tions are nice to work, but note that you may encounter distributions of an unknown analytical
form. For example, products of known distributions are not always tractable. Some well-known
distributions are discussed below.

1.2.1 Continuous
Gaussian Distribution
Also called the Normal distribution. Very special distribution, used for measuring magnitudes, for
example heights or temperatures. The Central Limit Theorem states that the sum of independent
random variables (distributed however) tend to a Gaussian.
A joint Gaussian distribution is the same as a multivariate Gaussian.

1.2.2 Discrete
Counts
Suppose we have an experiment where we toss a coin n times, observing k heads. A suitable
distribution would be the binomial. What is π, the probability of getting head? If we pursue a
frequentist approach of maximum likelihood:

p(k|π, n) ∝ π k (1 − π)n−k (binomial pdf)

arg max p(k|π, n) = arg max log p(k|π, n) = arg max k log π + (n − k) log(1 − π)
π π π

∂ log p(k|π, n) k n−k k


= − = 0 =⇒ π =
∂π π 1−π n

If we approach instead in a Bayesian setting, we apply a prior distribution on our probability π...
a distribution over probabilities?

Γ(α + β) α−1
Beta(π|α, β) = π (1 − π)β−1
Γ(α)Γ(β)
α and β are shape parameters and relate to the pseudo-count +1. The Beta distribution is defined
over [0, 1] and is actually a special case of the Dirichlet distribution, which will be discussed in a
later section. The Beta is a conjugate prior to the Binomial distribution (the posterior is Beta,
likelihood is Binomial).
Continuing with the Bayesian approach; let our data consist of a single coin toss: D = {k = 1}
with n = 1. Our likelihood:
p(D|π) = π
Our posterior:

p(π|α, β)p(D|π)
p(π|D) = ∝ Beta(π|α, β)π
p(D)
∝ π α (1 − π)β−1 ∝ Beta(α + 1, β)

Word Counts For example a word frequency bar chart. Zipf’s law states that the frequency of
any word is inversely proportional to its index in the ordered frequency table. Just counting words,
not taking into account order, is very simple and not very useful. Words like ‘the’ are the most
frequent but do not give much information about meaning.
Multinomial Distribution

6
Dirichlet Distribution The Dirichlet distribution is, as mentioned, a generalisation of the Beta
distribution, defined on the m − 1 dimensional simplex. A simplex is a higher-dimensional triangle,
defined by m − 1 points in an m-dimensional space. The Dirichlet is the conjugate prior for the
multinomial.
• The binomial to the Bernoulli is the multinomial to the categorical. (multiple trials)
• The Dirichlet to the multinomial is the Beta to the binomial.

Pm m
Γ( i=1 αi ) Y αi −1
Dir(π|α1 , ..., αm ) = Qm π
i=1 Γ(αi ) i=1

The αi s are the shape parameters, and E(πj ) = Pαj is the mean for the j-th element.
i αi

Figure 1: Symmetric Dirichlet plot with various settings of α

1.2.3 Mapping Between Random Variables


We will discuss a method called Inverse Transform Sampling here, which is used to sample from
other distributions using the c.d.f. We discuss more complicated sampling methods in Section
6. This is crucial for some advanced models such as the Variational Autoencoder in Section
11.1. Suppose we have a continuous r.v. U with p.d.f. pU (u) and c.d.f. FU (u), and we seek a
function f : U → X where X is our desired distribution. For maintaining the order, f must be a
monotonically increasing function.
The conservation of probability, pU (u)du = pX (x)dx, means:

pU (f −1 (x))du = pX (x)dx

du  df −1 (x)
=⇒ pX (x) = pU (f −1 (x)) = pU f −1 (x)
dx du
The same applies to the c.d.f.:
Z x Z x  df −1 (t)
FX (x) = pX (t)dt = pU f −1 (t) dt
−∞ −∞ du
=⇒ FX (x) = FU (f −1 (x))

Let U be a uniform distribution s.t. FU (u) = u.


−1
=⇒ FX (x) = FU (f −1 (x)) = f −1 (x) = u =⇒ x = FX (u)

Therefore, if we have the inverse c.d.f. then we can sample from the distribution. The same applies
to discrete distributions. This is the foundation for Normalising Flows [Rezende and Mohamed,
2015], a generative model that we will discuss in Section 11.3

7
1.3 Softmax Transformation
The softmax function is very useful in machine learning, it takes a length K vector and outputs a
normalised probability distribution (adds to 1) consisting of K probabilities. It is used in multi-
class classification, and can be used in multi-variate optimisation. For example, optimise

f (x1 , x2 , x3 ) = 0.2x1 + 0.3x2 + 0.5x3

where x1 , x2 , x3 ∈ [0, 1] and x1 + x2 + x3 = 1


We can set x3 = 1 − x1 − x2 . The softmax is normalised, which will satisfy the constraint. Also
due to this constraint, we have only two free variables.

e ξ1 e ξ2 1
x1 = , x1 = , x3 = ξ1
e ξ1 ξ
+e +1 2 ξ
e +e +1
1 ξ 2 e + eξ 2 + 1

todo finish

1.4 Types of Models


The Generative model estimates the joint distribution, and from that computes the posterior
pY (y|x) to make predictions. Often this is done by inferring the likelihood pX (x|y) and prior pY (y).
Learning the posterior directly is termed Discriminative classification.

1.5 Graphical Models


Graphical models are sometimes treated as a separate topic, but I prefer to view them as a tool for
visualising and constructing probabilistic models. A graphical model is essentially a dependency
graph, where there is an arrow from node A to node B, where A, B are r.v.s, if B is conditioned
on A.

1.5.1 Directed Graphical Models


They are directed acyclic graphs, where for each node we assign a random variable Xi and a
probability density function fi (xi , xπi ), where πi is the set of parents of Xi , which are those on
which Xi conditions. The graphical model is possible using the chain rule of probability:

pX1 ,...,Xn (x1 , ..., xn ) = pX1 (x1 )pX2 |X1 (x2 |x1 ) · · · pX1 ,...,Xn−1 (xn |x1 , ..., xn−1 ) (1)

Equation 1 creates a fully-connected DAG, which can represent any probability distribution. The
joint probability distribution for N variables where each variable can take on a value in X, requires
a table of size |X|N . When the conditional distributions involved do not depend on all conditioning
variables some edges can be removed. Sparse graphs can lead to more efficient inference.

(b) XA ⊥
⊥ XB
(a) pX,Y,Z = pZ|Y pY |X pX

Figure 2: Directed and undirected graphical models

8
1.5.2 Markov Random Fields
Markov Random Fields are undirected graphical models which satisfy the Markov property:
the future and past are conditionally independent given the present, i.e., X ⊥
⊥ Y |Z (independent)
iff all paths from a node in X to a node in Y pass through a node in Z. In (b), we see that nodes
in A are indep. of nodes in B.

1.6 Keep in mind

When marginalising out a variable, the limits are often not explicit.
 
K Z
y +α −1
Y
Li (Θ) = − log  pijij ij dpij 
j=1

 
K Z 1
y +αij −1
Y
Li (Θ) = − log  pijij dpij 
j=1 0

9
2 Maximum Likelihood
Before we discuss Bayesian models of inference, we will discuss an incredibly important (and simple)
concept: maximum likelihood estimation (MLE). Most neural networks carry out MLE: they
minimise some cost function which is often the negative log likelihood–in other words, maximising
the log likelihood! MLE is exactly this: maximise the likelihood. Equivalently, often we maximise
the log-likelihood, since the logarithm is an increasing function and it often makes the analytical
solution easier to derive.
θ∗ = max lik(θ|x) = max pX (x|θ)
θ θ

where pX is either the probability mass for discrete variables or the density for continuous variables.
For example, with a binomial likelihood with n trials:
 
n x
θ∗ = max log θ (1 − θ)n−x
θ x
= max[x log θ + (n − x) log(1 − θ)]
θ | {z }
f
∂f x∗ n−x
= − =0
∂θ θ 1 − θ∗
x = θ∗ (n − x + x)
x
θ∗ =
n

Therefore θ∗ = nx is our MLE, i.e. the probability of success is the number of successful trials over
the total number of trials–very intuitive. We will look at this example again through a Bayesian
lens in Section 3.1.

2.1 Least Squares Regression


Another example of MLE is least squares regression2 : a
method of fitting a line through a set of points. To start
with, how can we fit a line through the points on the
right? Perhaps a polynomial function would be good?
The question then is: what order should the polynomial
be? Suppose we take an M -order polynomial, fw (x). We
denote {φj (x)}Mj=0 as the basis functions. Such a model
is linear in parameters (w) but not linear in variables.

y∗ = fw (x) = w0 + w1 x + w2 x2 + ... + wM xM
M
X
= wj φj (x) where φj (x) = xj
j=0

y∗ = Xw

where X is the design matrix [φ0 (x), φ1 (x), ...]T and w is the row vector of weights.
We then attempt to minimise the error, which is the discrepancy between the actual points y and
the model estimates y∗.
2 In fact, least squares regression can be derived from several directions: projection matrices and maximum

likelihood are two such starting points.

10
e = y − y∗
E(w) = ||e||2 = (y − y∗)| (y − y∗)
= (y − Xw)| (y − Xw) = y | y − w| X | y + (Xw)| (Xw)
∂E(w)
= 2X | Xw − 2y | X = 0
∂(w)
=⇒ X | Xw = y | X
w = (X | X)−1 X | y which is the normal equation

2.2 Regression with Additive Independent Gaussian Noise


An interesting result is that the MLE and OLS regression as shown above yields the same result
if we model the function as y (i) = fw (x(i) ) + ε(i) where ε(i) ∼ N(0, σ 2 ), σ 2 is the noise variance.
Thus ε ∼ N(0, σ 2 I), which is called an isotropic multivariate Gaussian. Isotropic because there is
only variance along the diagonal, therefore it will be a hypersphere.
Since we are assuming noise is independent, and there are N terms (len(ε) = N ), the following
result follows simply from the Gaussian PDF:
N  N  | 
Y 1 ε ε
p(ε) = N(0, σ 2 I) = p(ε(n) ) = √ exp − 2
n=1
σ 2π 2σ

Since y = y∗ + ε, we can work out the probability of the data y given the model estimate y∗,
by constructing the same normal distribution with the mean as the estimate and variance as the
variance of the noise! In other words; centered around the point, with the spread at that point.

ε| ε = ||ε||2 = ||y − y∗||2 = E(w) from the sum of squared errors from before
N
||y − y∗||2
  
1
p(y|y∗) = N(y*, σ I) =2
√ exp −
σ 2π 2σ 2

Since y∗ = Xw, for a given X, p(y|y∗) = p(y|w). Ahah! The likelihood function of w has fallen
out. Thus we attempt to maximise the likelihood L(w) ∝ p(y|w):

||y − y∗||2
 
w∗ = arg max L(w) = arg max exp − = arg min E(w)
w w 2σ 2 w

Above is possible since exp is an increasing function. Note how this is the same result at with the
least squares method. Overfitting is still an issue, with MLE, more complex models overfit the
data. Next we bring Bayes into it.

11
3 Bayesian Inference
3.1 Maximum a posteriori
The MAP estimate is the point estimate of a parameter θ, which is essentially the MLE but taking
into account the prior. Suppose we have observations x from the distribution p(x|θ). Start with
the MLE. We therefore need the likelihood: lik(θ|x) = p(x|θ)
We can calculate the posterior using Bayes’ theorem. We have a prior belief in the form of a
distribution over θ, pΘ (θ)
p(x|θ)pΘ (θ)
p(θ|x) =
p(x)
We now maximise the posterior, noting that the marginal likelihood does not affect the maximi-
sation of θ and is always positive so we can ignore it:

p(x|θ)pΘ (θ)
θ∗ = arg max p(θ|x) = arg max = arg max p(x|θ)pΘ (θ)
θ θ p(x) θ

3.2 Conjugate Priors


If the prior and posterior distribution are in the same family then they are said to be conjugate
and the prior is the conjugate prior to the likelihood. We will continue our example from Section
2, where we have a binomial likelihood: p(x|θ) = nx θx (1 − θ)n−x . Since we are doing Bayesian
inference, our parameters must be given prior distributions. In this case, we need a prior for θ: the
probability of success. We therefore need a distribution of probabilities. The beta distribution
is one such distribution, which also happens to be the conjugate prior to the binomial and Bernoulli
likelihoods. It is defined as:

1
p(θ) = θα−1 (1 − θ)β−1
B(α, β)

where α and β can be interpreted as one plus the pseudo-counts. For example, if you have seen 5
successes and 4 failures, set α = 5, β = 4. B(α, β) is defined as:

Γ(α + β)
B(α, β) =
Γ(α)Γ(β)

where Γ(x) = (x + 1)! Now we will show that the beta distribution is the conjugate prior. Starting
with Bayes’ rule:

n x 1
 n−x α−1
p(x|θ)pΘ (θ) x θ (1 − θ) B(α,β) θ (1 − θ)β−1
p(θ|x) = = (2)
p(x) p(x)
n −1 x+α−1
 n−x+β−1
B(α, β) θ (1 − θ)
= x (3)
p(x)
n! Γ(α)Γ(β) x+α−1
x!(n−x)! Γ(α+β) θ (1 − θ)n−x+β−1
= (4)
p(x)
G
z }| {
Γ(n + 1)Γ(α)Γ(β)
θx+α−1 (1 − θ)n−x+β−1
Γ(x + 1)Γ(n − x + 1)Γ(α + β)
= (5)
p(x)

12
We pause here and calculate the marginal:

p(x) = p(x|θ)pΘ (θ) (6)


Z
= G · θx+α−1 (1 − θ)n−x+β−1 dθ (7)
| {z }
unnormalised Beta(α+x,n−x+β)
Z
1
= G · B(x + α, n − x + β) θx+α−1 (1 − θ)n−x+β−1 dθ (8)
B(x + α, n − x + β)
| {z }
normalises to 1

where we multiplied and divided by the B(·) in order to normalise the integrand in Eq. 8. Plugging
this marginal into the intermediate result Eq. 5:

G · θx+α−1 (1 − θ)n−x+β−1
p(θ|x) =
G · B(x + α, n − x + β)
= Beta(x + α, n − x + β)

Since our posterior is also beta, it is conjugate with the beta prior. We can finish this example by
calculating the MAP and comparing it to the MLE! We maximise as follows:

θ∗ = arg max log p(θ|x) = arg max log Beta(x + α, n − x + β)


θ θ
θx+α−1 (1 − θ)n−x+β−1
= arg max log
θ B(x + α, n − x + β)
= arg max(x + α − 1) log θ + (n − x + β − 1) log(1 − θ)
θ
∂ x+α−1 n−x+β−1
= −
∂θ θ 1−θ
∗ x+α−1
θMAP =
n+α+β−2

Compare this to our MLE: θMLE = nx . They are equivalent if α = β = 1, and otherwise they reflect
whatever prior information we give with our setting of α and β. Very satisfying!

3.3 Gaussian Likelihood


Take a model M, representing a choice of model structure and parameter values. Let the structure
be y = fw (x) + ε, the probability of the data is conditional: p(y|x). The Gaussian likelihood
is:
N
(y (i) − fw (x(i) )2
Y  
p(y|x, w, M) ∝ exp −
i=1
2σ 2

Fit the model by optimising for the weights (MLE):

w∗ = arg max p(y|x, w, M)


w

After maximisation, make predictions from p(y|x, w∗, M). Notice how it now uses the fitted
weights.
With a certain likelihood distribution, the conjugate prior is the distribution such that the posterior
will be tractable (there is a closed form). Here we will have a Gaussian likelihood p(y|x, w, M) =
N(Xw, σ 2 I) and Gaussian prior p(w|M) = N(0, σw 2
).

13
Let’s now apply Bayes’ rule as before to include the prior. We first calculate the posterior:

p(y|x, w, M)p(w|M)
p(w|x, y, M) = = N(µ, Σ)
p(y|x, M)
   
1 2 1 | −1
∝ exp − 2 |Y − Xw| exp − w Σw w
2σ 2
 
1 | −2 | −2 | −1
= exp − ((Xw) Xwσ − 2y (Xw)σ + w Σw w)
2
 
1 | | −2 −1 −2 | |

= exp − w (X Xσ + Σw )w − 2σ w X y (9)
2

Notice equation (1), remember completing the square with a scalar equation e.g. x2 + xy + y 2 .
With matrices, the result is similar:

x| Ax + x| b + c = (x − µ)| A(x − µ) + k

where µ = − 21 A−1 b, k = c − 14 b| A−1 b. Apply this result to the exponent as follows:


 
1 | −1
= exp − (w − µ) Σ (w − µ)
2
where Σ = (σ −2 X| X + Σ−1
w )
−1
,
1 −1
µ = − × −2σ −2 Σ−1 X| y = σ −2 ΣX| y
2
R R
Reminder of the marginal: p(x) = p(x, y)dy = p(x|y)p(y)dy we need to marginalise out the
parameters in order to make predictions. The following is the predictive distribution:
Z Z
p(y|x, x, y, M) = p(y, w|x, x, y, M) dw = p(y|x, w, x, y, M)p(w|x, y, M) dw

3.4 Marginal Likelihood


We marginalise out w from the marginal likelihood p(y|x, M). If there is confusion regarding
w not being conditioned on, I believe the intuition is that the distribution of w is in some way
parameterised by x. This is an application of the law of total probability.
Z
p(y|x, M) = p(w|x, M)p(y|x, w, M) dw

One then optimises the marginal likelihood to tune the hyperparameters.


To demonstrate how the marginal likelihood assists in model selection, we apply Bayes’ rule again:

p(M|y, x)p(y|x)
p(y|x, M) =
p(M)

p(y|x, M)p(M)
p(M|y, x) = ∝ p(y|x, M)p(M)
p(y|x)
Since the probability of a model given the data is proportional to the marginal likelihood, this
provides some hand-wavey intuition as to why the marginal likelihood gives some kind of score to
the model.

14
4 Gaussian Processes
4.1 Non-parametric Modelling
Previously the models discussed have been parametric. These parameters are marginalised over to
yield a predictive distribution.
Z
p(y |x , x, y) = p(y ∗ |x∗ , w, x, y)p(w|x, y) dw
∗ ∗

Observe the last term, the p(w|x, y) is a bottleneck for the model; there are a fixed number of
parameters for the model. In fact, the distribution over the parameters implies a distribution over
functions. A non-parametric model would work directly with such a distribution. Moreover, the
predictive distribution is usually an intractable integral. Which distribution is easily integrable?

Gaussians
Suppose we have the joint probability:
   
µ1 Σ Σ12
p(x1 , x2 ) = N , 11
µ2 Σ21 Σ22

• What is the marginal distribution, p(x1 )?


We integrate out x2 :
Z
p(x1 ) = p(x1 , x2 )dx2 = N(µ1 , Σ11 )

• What is the conditional distribution, p(x1 |x2 )?


This has the solution p(x1 |x2 ) = N(µc , Σc ) where:

µc = µ1 + Σ12 Σ−1
22 (x2 − µ2 ))
Σc = Σ11 − Σ12 Σ−1
22 Σ21

Recall a model using basis functions: fw (x) =


PM
m=0 wm φm (x). A prior over the weights p(w) induces
a prior distribution over functions, p(f ).
How do we make predictions from such a distribution
over functions? Given a parametric family of functions,
f (x|f ) and a prior over f , Bayesian modelling helps us
predict given the data, p(f |y, x):
p(y|x, f )p(f ) Figure 3: Polynomials are not a great
p(f |y, x) = . prior over functions: after 1 they expe-
p(y|x)
rience rapid growth.
4.2 Definition
Definition: a Gaussian process is a collection of random variables, every finite subset
of which are jointly Gaussian.
where m(x) is the mean function and k(x, x0 ) is the covariance function. A GP is a Gaussian
distribution with an infinitely long mean vector and infinite covariance matrix. We can’t really
reason with an infinitely long mean vector and covariance matrix, so we restrict ourselves to a
finite subset, and rely on the marginalisation property to marginalise out the infinite. So now
we’ve learnt how to draw random functions, but this is not useful, we want to somehow model
the data.
f ∼ GP(m, k)

15
Since we are in a non-parametric model, the parameters
are the function itself. Let’s go through the same Bayesian
inference procedure as in Section 3.3 but plug in the func-
tion r.v. instead of the parameter r.v. We first find an
expression for the likelihood:

p(y|x, f ) ∼ N(f , σ 2 I)
where f is the vector of function applied to the input data.
We can do this instead of a Gaussian distribution over
y|x since we only know what the function is doing at the
observations.
We have a Gaussian process prior for the functions, as we
saw before: p(f ) ∼ GP(m = 0, k)
The posterior is calculated by multiplying the likelihood
and the prior. Product of two Gaussians is a Gaussian. Figure 4: The arrows are eigenvectors
This leads to an infinite Gaussian, which is a Gaussian of Σ scaled by the sqroot of their eigen-
process. values.
p(f|x, y) ∼ GP(mpost , kpost )

And the following predictive distribution:


Z Z
p(y∗ |x∗ , x, y) = p(y∗ , f |x, y, x∗ ) df = p(y∗ |x∗ , f , x, y)p(f|x, y) df

What does the marginal likelihood look like? The evidence is just the probability of observations
where the function has been marginalised out. This yields a closed form solution:

1 1 n
log p(y|x) = − y | [K + σn2 I]−1 y − log[K + σn2 I] − log(2π)
2 2 2
where the data fit measures how well the model fits the training data, and the complexity penalty
penalises how big the model class that we’re using. The nice thing is that there are no hyperpa-
rameters here like with regularisers; Occam’s razor is automatic.

4.3 Hyperparameters & Covariance Functions


Kernel methods are useful since they can express an infinite amount of features in closed form using
the dot product. This section will outline how we can obtain such a kernel. First, what is positive
definiteness?

A symmetric matrix A ∈ Rn×n is positive definite if x| Ax > 0 ∀x 6= 0.


Let’s build an intuition of what this means.
  
 1 3 x
= x2 + 6xy + 8y 2

f (x, y) = x y
3 8 y

What are the roots of the equation f (x, y) = 0? To calculate them we


can set x = α and y = 1.3 Now our equation is x2 + 6x + 8 with roots
x = −2, x = −4. Plug in any value for x in between our two roots:

f (−3, 1) = 9 − 18 + 8 = −1 Figure 5: x2 + 6x + 8

shows our matrix is not positive definite.


3 We have essentially scaled y by α: let x = β and y = γ. Now set x = β/γ = α and y = 1

16
Let us now apply this in the case of a covariance matrix. Suppose we have two variables jointly
Gaussian:   2 
σ α
(x, y) ∼ N µ ν , Σ =
 
α ρ2
We will now see that for the covariance matrix to have any meaning, it must be positive semi-
definite. Suppose Z = aX + bY
What is the distribution of Z? Gaussian. The mean is trivial: E(aX + bY ) = aE(X) + bE(Y ) =
aµ + bν. The variance is also easy:

Var(aX + bY ) = Var(aX) + Var(bY ) + 2Cov(aX, bY ) = a2 σ 2 + b2 ρ2 + 2E((aX − aµ)(bY − bν))


= a2 σ 2 + b2 ρ2 + 2 (abE(XY ) − abE(X)E(Y )) = a2 σ 2 + b2 ρ2 + 2abCov(X, Y )
= a2 σ 2 + b2 ρ2 + 2abα

Common sense tells us that Var(Z) > 0 (variances can’t be negative). Thus the only covariance
matrices that are permissable are those that are positive semi-definite.
 σ2 α a
    

Z ∼ N aµ + bν, a b
α ρ2 b
∼ N (aµ + bν, w| Σw) where w| Σw ≥ 0

Notice the similarly to the result Var(aX) = a2 Var(X)

4.3.1 Hilbert Spaces


Let H be a vector space over R. The inner product h·, ·i : H × H → R iff (i) hαf1 + βf2 , gi =
αhg, f1 i + βhg, f2 i; (ii) hf, gi = hg, f i; (iii) hf, f i ≥ 0 and is 0 only if f = 0.
A Hilbert space is a space where the inner product is defined in addition to another technical
condition relating to Cauchy sequences, which is not necessary to go into here. A kernel is defined
as follows:
The function k : X × X → R is a kernel if there exists a Hilbert space and a map φ : X → H s.t.
∀x, x0 ∈ X, k(x, x0 ) = hφ(x), φ(x0 )i

4.3.2 Positive Definiteness of Inner Products in Hilbert Spaces


Let H be any Hilbert space, X a non-empty set and φ a feature mapping. This implies that
k(x, y) = hφ(x), φ(y)i is a positive semidefinite function.

Proof. The norm is written as p


||f || = hf, f i
XX X
ai aj k(xi , xj ) = hai φ(xi ), aj φ(xj )i
i j j
X X
=h ai φ(xi ), aj φ(xj )i
i j
2
X
= aφ (xi ) ≥ 0


i

The reverse direction also holds. A positive semidefinite function is guaranteed to be the inner
product in a Hilbert space. Thus positive semidefiniteness is a way of proving a function is a
kernel.

17
4.3.3 Reproducing Kernel Hilbert Space
We now have kernels on feature spaces. We want to define what our functions on X look like. The
space of these functions is known as a reproducing kernel Hilbert space.
 
  x1
x 1
Suppose our feature map φ : R2 → R3 is, for example, φ =  x2  with a dot-product
x2
x1 x2
kernel k(x, y) = x| y. This feature space is denoted by H
 
a
Let f (x) = ax1 + bx2 + cx1 x2 or f (·) =  b 
c
We can also express f as f (x) = f (·)| φ(x) = hf (·), φ(x)i. φ(x) is a function mapping R2 → R3
and defines the parameters of a function mapping R2 → R. To illustrate this further, take
 
y1
k(·, y) =  y2  = φ(y)
y1 y2

For every y, there is a vector k(·, y) s.t. hk(·, y), φ(x)i = ax1 + bx2 + cx1 x2 where a = y1 , b =
y2 , c = y1 y2
This is equivalent to
hk(·, x), φ(y)i = uy1 + vy2 + wy1 y2 = k(x,y)
So we can write φ(x) = k(·, x) and φ(y) = k(·, y) without ambiguity.
This shows that:
• every feature mapping is in the feature space: ∀x ∈ X, k(·, x) ∈ H;
• ∀x ∈ X, ∀f ∈ H, hf, k(·, x)i = f (x)
This last property is the reproducing property. It yields another appealing property: the norm
in an RKHS is a natural measure of how complex a function is.

4.3.4 Squared Exponential (RBF)


(x − x0 )2
 
k(x, x0 ) = v 2 exp − + σ 2 δxx0
2l2
Example: l is length scale for a squared exponential covariance function. It’s roughly the length
between inputs before the covariance is lower. You can have anisotropic RBF kernels where
there are multiple length scales, one for each direction in the input space. That way you can
accommodate input features that are on different scales.

4.3.5 Periodic
2 sin2 (π(x − x0 ))
k(x, x0 ) = exp(− )
l2
The x’s are first mapped to u = [sin(x), cos(x)]| and then distances are measured in the u-space.
If the length scale is larger than the period, then the covariance is high, whereas in the inverse
case, there can be a lot of action within the period.

18
4.3.6 Reproducing Kernel for Vector-Valued Functions
Our reproducing kernel is now defined as a symmetric function k : X × X → RD×D s.t. k(x, x0 )
yields a positive semidefinite matrix.
Let H be the vector-valued RKHS over functions f : X → RD . This means that ∀x ∈ X, ∀f ∈
H, ∀c ∈ RD , f (x0 ) = k(x, x0 )c belongs to H
and the reproducing property is written now as: hf, k(·, x)i = f (x)| c

4.4 Coregionalization
Coregionalization originated in geostatistics literature, where it is known as cokriging. In order
for covariance functions to be valid kernels, as seen before, they must be positive semidefinite.
Suppose we have a multi-output problem with D outputs. In the linear model of coregionalization,
these outputs are a written as a linear combination of Q independent latent functions which have
zero mean and a covariance function.
For each d ∈ {1, ..., D}, the output is determined by the function fd (x) with p-dimensional input
vector x.
Q
X
fd (x) = ad,q uq (x)
q=1
(
kq (x, x0 ) q = q 0
where uq (x) are the latent functions with covariance Cov(uq (x), u0q (x0 )) =,
0 otherwise
due to independence. Some of these latent functions can share the same covariance kernel and can
be grouped:

Rq
Q X
X
fd (x) = aid,q uiq (x)
q=1 i=1
0
where uiq (x)
have covariance Cov(uiq (x), uiq0 (x0 ))
= kq (x, x0 ) for i = i0 , q = q 0 . There are now Q
groups of functions, within each one sharing a covariance function.
We can now write the cross-covariance between functions as:
Q X Rq Rq
Q X
X X 0 0
Cov(fd (x), fd0 (x0 )) = (K(x, x0 ))d,d0 = aid,q aid0 ,q0 · Cov(uiq (x), uiq0 (x0 ))
q=1 q 0 =1 i=1 i0 =1
Rq
Q X
X
= aid,q aid0 ,q0 · kq (x, x0 ) by independence
q=1 i=1
Q
X
= bqd,d0 kq (x, x0 )
q=1

PRq i i
where bqd,d0 = i=1 ad,q ad0 ,q 0 which forms a D × D matrix Bq called the coregionalisation
matrix. The rank of Bq , the number of linearly independent row or column vectors, is intuitively
determined by Rq .
PQ
Writing our kernel K(x, x0 ) = q=1 Bq kq (x, x0 ), one can intuit that this is a sum of the products
of two kernels (called separable kernels), one that models the output dependencies (Bq ) and one
that models the input dependencies (kq ).

19
4.5 Relation with Linear in the Parameters Model
Consider f (x) = ax + b where a ∼ N(0, α), b ∼ N(0, β). We can work out the mean function (see
A.1 for derivation):
Z Z Z Z
µ(x) = E(f (x)] = (ax + b)p(a)p(b) da db = axp(a) da + bp(b) db = 0

And now the covariance function:


Z Z
0 0
k(x, x ) = E[(f (x) − 0)(f (x ) − 0)] = (ax + b)(ax0 + b)p(a)p(b) da db
Z Z
= (a2 xx0 + b2 + ab(x + x0 ))p(a)p(b) da db
Z Z Z Z Z Z
0 2 0
= p(b)xx a p(a) + (x + x) ap(a)bp(b) + p(a) b2 p(b)
b a a b a b
= αxx0 + β

So we have, in a very overly complicated way, constructed a linear model. We can now take this
finite linear model and go to a Gaussian process (infinite):

M
X
f (x) = wm φm (x) p(w) = N(0, A)
m=1

The joint distribution of any vector f = [f (x1 ), ..., f (xN )] is a multivariate Gaussian and therefore
a Gaussian process. The mean function is 0. The covariance:

k(xi , xj ) = Covw (f (xi ), f (xj )) = E(f (xi )f (xj )) − E(f (xi ))E(f (xj )) = E(f (xi )f (xj ))
Z Z X M X M
!
= ... wk wl φk (xi )φl (xj ) p(w) dw = ...
k=1 l=1
M X
X M Z Z
= φk (xi )φl (xj ) wk wl p(wk , wl ) dwk dwl
k=1 l=1

This shows that any linear in the parameters model with Gaussian prior over weights, is also
a Gaussian process. Mercer’s theorem states that every GP also corresponds to a linear in the
parameters model but not necessarily a finite one.
We will now show a very cool result that a squared exponential covariance function corresponds
to a linear in parameters model with infinitely many Gaussian bumps.
Consider the following Gaussian bump basis function:
N/2  2 !
1 X n
f (x) = lim γn exp − x − √ where γn ∼ N(0, 1)
N →∞ N N
−N/2

We use the sum (limited to infinity) to place bumps everywhere along


x.
PN/2 R∞
But limN →∞ N1 −N/2 = −∞ , so
Z ∞
γ(u) exp(−(x − u)2 ) du
−∞
Figure 6: Example
20 Gaussian bump
µ(x) = E(f (x))
Z ∞ Z ∞
= exp(−(x − u)2 ) γ(u)p(γ(u)) dγ(u) du = 0
−∞ −∞

4.6 Eigenvectors and Relation to PCA


Let’s visualise what a 2 variable joint Gaussian distribution (2-D multivariate Gaussian) would
look like.

     
0.2 0.4 0.2 0.2
Figure 7: (x, y) ∼ N 0, Σ = Figure 8: (x, y) ∼ N 0, Σ =
0.4 0.2 0.2 0.2

4.7 Cholesky Decomposition


Due to the marginalisation/consistency property (or even the definition of a GP), the marginal
(and the conditional) distributions are also Gaussian.
R
Recall p(f ) = p(f , y) dy, but now suppose y is infinitely long.
   
a A B
p(x, y) = N , =⇒ p(x) = N(a, A)
b B| C

Cholesky factorisation decomposes a positive-definite matrix A = R| R where R is upper trian-


gular. Essentially the “square root”.
We can sample now from a D-dimensional joint Gaussian with mean vector m and covariance
matrix K.

z = randn(D, 1)
y = chol(K)| z + m

E ((y − m)(y − m)| ) = E(R| zz| R) = R| E(zz| )R = R| IR = K

4.8 Gaussian Process Example


y =f +
y are our training outputs.
where the likelihood for our data y is p(y|f ) = N(f , σ 2 I)
Any set of function variables f has p(f |X) = N(0, K)
The marginal likelihood is p(y) = p(y|f )p(f )df = N(0, K + σ 2 I)
R

For prediction, consider joint training and test marginal likelihood:

21
  
Kxx Kxt
p(y, y∗ ) = N 0,
Ktx Ktt
Conditioning on training outputs:
p(y∗ |y) = N(µ, Σ)
where
µ = Ktx [Kxx + σ 2 I]−1 y
−1
Σ = Ktt − Ktx Kxx Kxt

4.9 Scalability
GPs scale poorly due to the O(n3 ) matrix inversion. In Section 8.1, we will discuss methods of
scaling GPs using inducing point approximations.

22
5 Gaussian Process Classification
Now, the likelihood is categorical and we have a new likelihood:

p(y|x) = σ(f (x))y (1 − σ(f (x))1−y )

where σ is the sigmoid function.


The integral is no longer tractable. finish

23
6 Monte Carlo
How can we integrate an intractable function?
We want to find approximate expectations of a function
φ(x) w.r.t. probability p(x).
Z
Ep(x) [φ(x)] = φ̄ = φ(x)p(x) dx

We could lay out a grid and compute


Z T
X
φ(x)p(x) dx ≈ φ(x(τ ) )p(x(τ ) )∆x
τ =1

. However, this requires too many points if the dimensionality increases (suppose we have a grid
of 10 points along each dimension).

6.1 Monte Carlo


If we choose the points from the distribution p(x) then
T
1 X
Ep(x) [φ(x)] ' φ(x(τ ) ) where x(τ ) ∼ p(x)
T τ =1

Furthermore, due to the central limit theorem, the sum of the independent samples yields an
unbiased estimate: V[φ̂] = V[φ]
T so this variance is independent of the dimensionality of x.

This leads to the question of how we sample from p(x)? What if it is intractable?

6.2 Markov Chains & Gibbs Sampling


What if we are trying to find the posterior p(θ|x) = p(x|θ)p(θ)
p(x) . We don’t need the denominator
to get the shape of the posterior - we just need the relative value of the posterior at one point
versus all others. This involves (for continuous variables) an infinite number of calculations. This
is where dependent sampling comes in, and in particular Markov chains. A Markov chain is a
sequence defined by a transition function q(x0 |x). Gibbs sampling states that if you generate a
new x but keep all other components the same, by sampling it from this distribution:
x0i ∼ p(xi |x1 , ..., xi−1 , xi+1 , ..., xD )

If you iterate this over all the indices, and repeat this process many times then the sample will
have the correct distribution. Of course this conditional distributions must be known.
https://www.youtube.com/watch?v=ER3DDBFzH2g

6.3 Example: Step Model


Suppose
xi≤θ ∼ Poisson(λ), xi>θ ∼ Poisson(µ)
Such that
e−λ λx
p(xi ) =
x!
e−λ λx
= exp(x log λ − λ − log(x!)) ⇐= log = −λ + x log λ − log(x!)
x!

24
with priors:

λ ∼ Gamma(a, b), µ ∼ Gamma(a, b), θ ∼ U(1851, 1961)

1 a a−1 −bλ
p(λ) = b λ e
Γ(a)
 
1 a a−1 −bλ
But since log Γ(a) b λ e = − log Γ(a) + a log b + (a − 1) log λ − bλ:

p(λ) = a log b − log Γ(a) + (a − 1) log λ − bλ


p(µ) = a log b − log Γ(a) + (a − 1) log µ − bµ

We wish to find the posterior, so we can use Bayes’ theorem:


p(θ, λ, µ|x) ∝ p(x|θ, λ, µ)p(θ, λ, µ)
= p(xi≤θ |θ, λ) · p(xi>θ |θ, µ) · p(θ, λ, µ)
Y Y
= exp(xi log λ − λ − log(xi !)) exp(xi log µ − µ − log(xi !)) · p(θ, λ, µ)
i≤θ i>θ
X X
log p(θ, λ, µ|x) ∝ (xi log λ − λ − log(xi !)) + (xi log µ − µ − log(xi !)) + log p(θ, λ, µ)
i≤θ i>θ

To find the conditional posteriors for the parameters, we ignore terms which do not include that
parameter.
X
log p(λ|x, θ, µ) = (xi log λ − λ − log(xi !)) + a log b − log Γ(a) + (a − 1) log λ − bλ
i≤θ
X
∝ (xi log λ − λ) + (a − 1) log λ − bλ
i≤θ
X X
∝ (a − 1 + (xi )) log λ − λ − bλ
i≤θ i≤θ
X
∝ (a − 1 + (xi )) log λ − (θ + b)λ
i≤θ
X
∝ log Gamma(a + (xi ), θ + b)
i≤θ

Similarly;

X
log p(µ|x, λ, θ) ∝ (a − 1 + (xi )) log µ − (N − θ + b)µ
i>θ
X
∝ log Gamma(a + (xi ), N − θ + b)
i≤θ

It is worth noting at this point that the Gamma and Poisson are conjugates.
X X
log p(θ|x, λ, µ) ∝ (xi log λ − λ − log(xi !)) + (xi log µ − µ − log(xi !))
i≤θ i>θ
X X
∝ (xi log λ) − θλ + (xi log µ) − (N − θ)µ
i≤θ i>θ

25
This is not a standard distribution, but simple enough to sample. Sample it from i = 0 : N and
use it to construct a multinomial distribution finish

26
7 Ranking
7.1 Towards Probabilistic Ranking
Suppose we have player 1 and player 2, with skills w1 , w2 respectively. The skill difference is
therefore s = w1 − w2 . The performance may not be perfectly consistent though, so we can add
noise:
t = s + n where n ∼ N(0, 1)
(
+1 player 1 wins
The game outcome is given as y = sign(t) =
−1 player 2 wins

p(t|w1 , w2 ) = N(w1 − w2 , 1)
p(y = 1|w1 , w2 ) = p(t > 0|w1 , w2 ) = Φ(w1 − w2 ) where Φ is the cum. dist.
We can now construct the likelihood:

p(y|w1 , w2 ) = Φ(y(w1 − w2 ))

We can also write the likelihood as this kind of chain:


ZZ
p(y|w1 , w2 ) = p(y|t)p(t|s)p(s|w1 , w2 ) dt ds

Recall Bayes’ rule:


p(y|w)p(w)
p(w|y) =
p(y)
p(y|w1 , w2 )p(w1 )p(w2 )
p(w1 , w2 |y) = our data is y
p(y)
p(y|w1 , w2 )p(w1 )p(w2 )
= RR
p(w1 )p(w2 )p(y|w1 , w2 ) dw1 dw2
Φ(y(w1 − w2 ))N(w1 |µ1 , σ12 )N(w2 |µ2 , σ22 )
= RR
Φ(y(w1 − w2 ))N(w1 |µ1 , σ12 )N(w2 |µ2 , σ22 ) dw1 dw2

where we set the prior to be p(wi ) = N(wi |µi , σ 2 )


Everytime one has a game and an outcome, then skills are correlated, since one player will win, so
their skill will be higher. This posterior does not have a closed form, since it is not Gaussian.
The prior over y, the normalising constant, the model evidence,
the marginal likelihood, does have a closed form:
!
y(µ1 − µ2 )
p(y) = Φ p
1 + σ12 + σ22

Consider if we are very uncertain about the skills, then the ar-
gument to the cumulative distribution gets closer to zero, so the
probability of the outcome gets closer to 50%.

7.2 Gibbs Sampling in TrueSkill


Suppose for games g = 1, ...,(
G we have a the variable Ig and Jg for the id of the first and second
+1 Ig wins,
player. The outcome is y = .
−1 otherwise

27
We first initialise w from a prior p(w). We then need the performance differences:

p(tg |wIg , yg ) ∝ δ(yg − sign(tg ))N(wIg − wJg , 1)

The conditional distribution of the performance differences as above is a univariate truncated


Gaussian, sampled either using rejection sampling or ”inverse transformation” method.
Jointly sample the skills:

G
Y
p(w|t, y) = p(w|t) ∝ p(w) p(tg |wIg , wJg )
g=1

But t = s + N(0, 1) =⇒ p(tg |wIg , wJg ) ∝ N(w; wIg − wJg , 1)


Let µg = wIg − wJg and tg = µ1 − µ2 . Refer to appendix for an expansion which doesn’t show
much.
For Gibbs sampling, we then iterate back to calculating the performance differences.
When one conditions on a random variable, the value of the random variable is fixed, which
simplifies things.
The product of two Gaussians yields an unnormalised Gaussian:

N(µa , Σa )N(µb , Σb ) = zc N(µc , Σc )


−1 −1
Σ−1 −1
c = Σa + Σb µc = Σc (Σ−1
a µa + Σb µb )

Suppose p(w) ∼ N(µ0 , Σ0 ). Using the above results of multiplying Gaussians:


G
X
=⇒ Σ−1 = Σ−1
0 + Σ−1
g
g=1

µ=

1
p(tg |wIg , wJg ) ∝ exp(− (wIg − wJg − tg )2 )
2   
1  1 −1 wIg − µ1
∝ N(− wIg − µ1 wJg − µ2 , 1)
2 −1 1 wJg − µ2

An alternative to Gibbs sampling for TrueSkill is message passing on graphs.

7.3 Message Passing on Factor Graphs


Factor graphs are a type of probabilistic graphical model.
We can lay out the probabilities in a factor
graph. Suppose:

p(v, w, x, y, z) = f1 (v, w)f2 (w, x)f3 (x, y)f4 (x, z)

We can now ask questions like what are the


marginal distributions, conditional distribu-
tions, p(w)? XXXX
p(w) = f1 (v, w)f2 (w, x)f3 (x, y)f4 (x, z)
v x y z

28
Computing this is K 4 sums (where K is the number of values the variables can take) and K possible
values of w therefore O(K 5 ). We can break up the factor graph above into two subgraphs split at
w.

X XXX
p(w) = f1 (v, w) f2 (w, x)f3 (x, y)f4 (x, z) and while we’re here...
v x y z
X X X X
= f1 (v, w) f2 (w, x) f3 (x, y) f4 (x, z)
v x y z

We call these components messages


X
mf1 →w (w) = f1 (v, w)
v
XXX
mf2 →w (w) = f2 (w, x)f3 (x, y)f4 (x, z
x y z
X XX
= f2 (w, x) f3 (x, y)f4 (x, z)
x y z
X
= f2 (w, x)mx→f2 (x)
x
X
p(w) = mf1 →w (w) · mf3 →x (x) · mf4 →x (x)
x

So nodes take incoming messages and passes them on.


In summary message passing involves three update equations:
• Marginals are the product of all incoming messages from neighbour factors
Y
p(t) = mf →t (t)
f ∈Ft

• Messages from factors sum out all variables except the receiving one
• Messages from variables are the product of all incoming messages except the message from
the receiving factor
p(t)
mt→f (t) =
mf →t (t)

The benefits of this are partial and localised computations.

29
8 Expectation-Maximisation
8.1 Gaussian Mixture Models
In a GMM, the parameters are θ = {µj , σj2 , πj }j=1...k . We have one latent variable for each
datapoint zi , an assignment to a class. GMMs can represent any distribution. We wish to find:
N
Y N
Y
arg max p(X|θ) = p(xi |θ) = π1 N(xi |µ1 , Σ1 ) + ...
θ i i

PK
subject to c πc = 1 and Σi  0. This psd constraint is quite tough. So we introduce the EM
algorithm for efficiently training these models.
We represent the GMM as a latent variable problem, where we introduce a latent variable z for
each datapoint such that p(z = c|θ) = πc . We assign a Gaussian prior on data. We now marginalise
away the latent variable:

p(x|z = c, θ) = N(x|µc , Σc )
K
X
p(x|θ) = p(x|z = c, θ)p(z = c|θ)
c=1

which gives us the same likelihood result as without the latent variable, z, which we refer to as the
source.

z x
N

Figure 9: Graphical model for the GMM.

The idea of the EM algorithm is that we keep iterating between keeping the parameters and the
sources fixed.
Keeping the sources fixed, we can calculate the parameters:

p(x|z = 1, θ) = N(x|µ1 , σ12 )

where p(zi = j|θ) = πj P


. We will derive µ1 , σ1 for “soft” assignments in the next sections, which
p(z =1|x ,θ)x
would result in µ1 = Pi p(zi i =1|xi i ,θ) i . For now, suppose we choose a hard assignment: µ1 =
i
P P 2
xi i∈I1 (xi −µ1 )
i∈I1
|I1 | , σ12 = |I1 | ,.
Keeping the parameters fixed, we can work out the sources. Suppose our parmeters are p(x|z =
1, θ) = N(−2, 1), and we want p(z = 1|x, θ),

p(z = 1|θ)p(x|z = 1, θ)
p(z = 1|x, θ) =
p(x|θ)

The normalising constant can be determined explicitly: it is marginalising over just K terms.

30
8.2 Mathematical Machinery
8.2.1 Jensen’s Inequality
A function f is concave is such that all function values between any two points are greater or equal
to the value of the line at that point which joints the two points:
∀a, b, α : f (αa + (1 − α)b) ≥ αf (a) + (1 − α)f (b),
where α ∈ [0, 1] represents the distance along the line between a and b. This generalises to multiple
αs: 
f Ep(z) z ≥ Ep(z) f (z) (Jensen’s inequality)

8.2.2 Kullback-Leibler Divergence


Z
p(x)
KL(p||q) = p(x) log dx
q(x)
This is an asymmetric and non-negative metric. Proof:
   
p q
−KL(p||q) = Ep − log = Ep log
q p
    Z
p p p(x)
Ep log ≤ log E = log p(x) dx = 0 using Jensen’s inequality
q q q(x)

8.2.3 Mutual Information


MI(X, Y ) = KL(pX,Y , px py )
more detail
needed
8.3 Expectation Maximisation Algorithm
This algorithm is for probabilistic models with latent variables, observed variables, and parameters.
We will discuss the algorithm in terms of the GMM. First, we can write out the likelihood from
the graphical model in Figure 11, by marginalising the joint:
K
p(xi |zi = c, θ)p(zi = c|θ) X
p(xi |θ) = = p(xi |zi = c, θ)p(zi = c|θ)
p(zi = c|xi , θ) c=1

and now we wish to compute the maximum marginal likelihood:


N
X K
X
log p(xi |zi = c, θ)p(zi = c|θ)
i=1 c=1

we could maximise with a stochastic optimizer, but it’s not very efficient. Instead, let’s find a
lower bound by applying Jensen’s inequality:

N K N X K
X X q(zi = c) X p(xi , zi = c|θ)
log p(xi , zi = c|θ) ≥ q(zi = c) log
i=1 c=1
q(zi = c) i=1 c=1
q(zi = c)

This q is known as the variational distribution. In order to illustrate what maximising this
lower bound does, we start with Bayes’ rule:
p(x|z, θ)p(z|θ) p(x|z, θ)p(z|θ) q(z)
p(x|θ) = =
p(z|x, θ) q(z) p(z|x, θ)
p(x|z, θ)p(z|θ) q(z)
log p(x|θ) = log + log
q(z) p(z|x, θ)

31
Finally we average both sides w.r.t. q(z):
Z Z
p(x|z, θ)p(z|θ) q(z)
log p(x|θ) = q(z) log dz + q(z) log dz
q(z) p(z|x, θ)
| {z } | {z }
=L(x|θ), lower-bound functional KL-divergence

The lower-bound functional is a lower bound since the second term is always non-negative.
We now have the log-marginal-likelihood as a sum of two terms. But how do we select q? We do
this using the EM algorithm. The EM algorithm is as follows:
• Initialise randomly θt=0 , then for t = 1...T :
• E-step: for fixed θt−1 maximise the lower-bound functional, L(x|θ), wrt q(z)
Since the marginal likelihood term log p(x|θ) does not depend on q(z), maximising the lower-
bound consists of minimising the KL-divergence, by setting q t (t) = p(z|x, θt−1 )
• M-step: for fixed q t (z), maximise the lower-bound functional, L(x|θ), wrt θ

Z Z Z
p(x|z, θ)p(z|θ)
arg max q(z) log dz = arg max q(z) log p(x|z, θ)p(z|θ) dz− q(z) log q(z) dz
θ q(z) θ

The second term is the entropy of q(z) but is not dependent on θ so throw it away.
Z
θ = arg max q t (z) log p(x|z, θ)p(z|θ) dz = arg max Eq [log p(x, z|θ)]
t
θ θ

For the GMM algorithm, where the likelihood is Gaussian and the prior is πc , we can work
out the analytical expression of the M-step:
Z
θt = arg max q t (z) log p(x|z, θ)p(z|θ) dz = arg max Eq [log p(x, z|θ)]
θ θ
N X
K
(xi − µc )2
   
X 1
= q t (z = c) log exp − πc
i=1 c=1
Z 2σc2
N X K
(xi − µc )2
 
X
t πc
= q (z = c) log −
i=1 c=1
Z 2σc2

Using school-level maths, in order to maximise we take the derivative w.r.t. each µc , for
example:
N
(xi − µ1 )2
 
∂ X
= q t (z = 1) 0 − =0
∂µ1 i=1
2σ12
N
X
= q t (z = 1) (xi − µ1 )
i=1
PN t
i=1 q (z = 1)xi
=⇒ µ1 = PN t
i=1 q (z = 1)

When can the EM algorithm be used?

The EM algorithm works when the posterior distribution is known and tractable.

32
8.4 Example: K-Means
K-means is a clustering algorithm: an unsupervised method of grouping together datapoints into
groups (clusters). Datapoints are assigned to their closest cluster centroid by minimising squared
Euclidean distance. It is important to note that this is not a probabilistic method, so one could
argue it does not belong here. Nevertheless, we shall discover K-means by analysing its close
1
similarity with GMM. We first fix the σc = 1 and the assignment weights πc to be uniform ( M
where M is the number of Gaussians). This means that the new likelihood is:
1
exp −0.5||x − µc ||2 .

p(x|z = c, θ) =
Z

We modify the E-step and M-step in the following ways:


• E-step: We also choose our variational distribution q(z = c) ∈ Q where Q is the set of delta
functions. As we saw, the E-step consists only of minimising the KL-divergence between the
variational distributoin and the true posterior. Thus the optimal q will be the delta function:
(
1 if z = c
q(z) =
0 otherwise

corresponding to the marginal likelihood maximisation:


1
ci = arg max p(z = c|x, θ) = arg max p(x|z = c, θ)p(z = c|θ)
c c Z
1
= arg max exp −0.5||x − µc ||2 πc

c Z
= arg min ||x − µc ||2
c

This result is easy to calculating by taking the derivative.


• M-step:
Recall that for fixed q k , we maximise the lower bound w.r.t. θ. This is essentially the average
of all datapoints assigned to a cluster.
PN t
i=1 q (zi =1)xi
µk1 = P N t
i=1 q (zi =1)
P
i:ci =c xi
=⇒ µk+1
c = |{i:ci =c}|

8.5 Example: Topic Modelling


This section covers three models for modelling text documents, with increasing complexity. Start-
ing with simple bag-of-words models, moving to the EM algorithm and finally Latent Dirichlet
Allocation.
The general layout of the problem is as follows.
• There are D documents,
• with a vocab size of M words,
• Nd is the word count in doc d,
• wnd is the n-th word in document d, parameterised by β, the probabilities of each word. Our
data is given a Categorical likelihood: wnd |β ∼ Cat(wnd |β)

33
The simplest model is shown on the right. We will find an MLE solution,
β̂, which we constrain to ensure the β’s normalises to 1.
Nd
D Y
β̂ = arg max
Y
p(w|β) s.t.
X
βm = 1 wnd β
β
d=1 n n = 1...Nd
M
Y M
X
cm d = 1...D
= arg max log βm = arg max cm log βm
β m=1 β m=1
PD PNd
where cm is total count of word m: cm = d=1 n I(wnd = m). We use the Lagrange multiplier
method to carry out this constrained optimisation problem. Setting derivatives to zero:

M M
!
X X
L= cm log βm + λ 1 − βm
m=1 m=1
∂L cm βm
= − λ = 0 =⇒ cm =
∂βm βm λ
M M
∂L X X
=1− βm = 0 =⇒ βm = 1
∂λ m=1 m=1

We can combine these two equations to yield our MLE:


M M
X cm X
= 1 =⇒ λ = cm
m=1
λ m=1
cm
=⇒ β̂m =
N

where N is total number of words (including repetitions) in the corpus.


This model is intuitive but limited since all doc-
uments are modelled by the same global word
frequency distribution, but we want a model
which accommodates different topics of docu-
ments. θ zd wnd βk
In this new model (shown on the right), we n = 1...Nd k = 1...K
add more latent variables: zd ∈ {1, ..., K} as- d = 1...D
signs document d to one of K topics; and
θk = p(zd = k) is the probability of document d Figure 10: z ∼ Cat(θ), w|z ∼ Cat(βzd )
being assigned to k. They form the parameters
of a categorical distribution, θ.
First, we write down the likelihood (try reading it off the graphical model!):
D
Y
log p(w|θ, β) = log p(wd |θ, β)
d=1
D X
Y K
= log p(wd = 1, zd = k|θ, β) (marginalising the zk )
d k=1
D X
Y K
= log p(zd = k|θ)p(wd |zd = k, β)
d=1 k=1
D
X K
X Nd
Y
= log p(zd = k|θ) p(wnd |zd = k, β)
d=1 k=1 n=1

34
What does this look like? GMM! So we can use the EM algorithm! Recall that
Z Z
p(x|z, θ)p(z|θ) q(z)
log p(x|θ) = q(z) log dz + q(z) log dz
q(z) p(z|x, θ)

Our EM algorithm is then as follows:


• E-step: for fixed θ t−1 maximise the lower-bound wrt q(z). This consists of minimising the
KL-divergence, by setting q t (t) to the posterior:

Nd
Y
q t (zd = k) = p(zd = k|θ t−1 ) p(wnd |zd = k, β)
n=1
= θk Mult(c1d , ..., cM d |βk , Nd )
:= rkd

• M-step: for fixed q t (z), maximise the lower-bound wrt θ.

8.5.1 Latent Dirichlet Allocation


A symmetric Dirichlet simply has all parameters identical (∀i : αi = α).

35
9 Variational Inference
Recall our GMM model:

z x
N

Figure 11: Graphical model for the GMM.

We seek the posterior distribution of the latent variable given the observations: p(z|x)

This is often intractable, why???


VI approximates the intractable distribution with a simpler one. The parameters of which are called
variational parameters and are optimised. One such objective function is the ELBO which we
came across in Section 8. Supposing we use a Gaussian posterior approximation, we would optimise
the mean and deviation parameters for each observation. Variational parameters, therefore,
scale with the dataset size.

9.1 Amortizing Variational Inference


Instead of optimising the free variational parameters directly, we can create a parameterised func-
tion that maps from observation space to variational parameter space. We end up with a constant
number of variational parameters. The downside is that this results in less expressivity than free
optimisation.

9.2 General Form


To give a more general comment on variational inference, we take:
Z Z
q(z)
log p(x) = log p(x|z)p(z)dz = log p(x, z) dz
q(z)
 
p(x, z)
= log E
q(z)
  Z
p(x, z) p(x|z)p(z)
≥ E log = q(z) log dz
q(z) q(z)

which can be further decomposed into the lower bound and the KL-divergence:

Z Z Z
p(x|z)p(z) q(z)
q(z) log dz = q(z) log p(x|z)dz − q(z) log
q(z) p(z)
= F (q) − KL(q||p)

9.2.1 Mean Field Approximation


This is a method of finding this KL term. We select a family of distributions Q = {q|q(z) =
Qd
i=1 qi (zi )}

36
d d
Z Y Qd d
Z Y d d
Z Y
i=1 qi
Y Y
min KL( qi ||p) = qi log dz = qj log qi dz − qi log p dz
qk
i=1 i=1
p j=1 i=1 i=1
d Z Y
X d d
Z Y
= qj log qi − qi log p dz
i=1 j=1 i=1
d
Z Y d Z Y
X d d
Z Y
= qj log qk dz + qj log qj − qi log p dz
j=1 i6=k j=1 i=1

d
Z Y
R Qd R
but j=1 qj log qk dz = qk log qk qj dz6=k dzk and since we are optimising w.r.t qk :
j6=k
| {z }
1×...×1=1

d
Y Z d
Z Y
=⇒ min KL( qi ||p) = qk log qk dzk − qi log p dz
qk
i=1 i=1
following a similar procedure for the second term
= t.b.c

9.3 Examples
9.3.1 Inducing Point Approximation for Gaussian Processes
We will now go through the inducing point approximation which is commonly used with multi-
output GPs. With D outputs, let yd be the dth Gaussian process, such that yd = fd +  and
 ∼ N(0, β 2 ). The inducing points are ud . First, we apply the general form of VI from Section 9.2:
Z Z Z
q(X)
log p(Y ) = log p(Y |X)p(X)dX ≥ q(X) log p(Y |X)dX − q(X) log dX (10)
p(X)
= F (q) − KL(q||p) (11)

We use the variational distribution q(fd , ud ) = p(fd |ud )φ(ud ). The marginal likelihood is now:
ZZ
log p(yd |X) = log p(yd |fd )p(fd |ud )p(ud ) dfd dud

p(yd |fd )p(fd |ud )p(ud )p(fd |ud )φ(ud )


ZZ
= log dfd dud
p(fd |ud )φ(ud )

Now apply Jensen’s inequality where E is over variational distribution

p(yd |fd )p(fd |ud )p(ud )


ZZ
log p(yd |X) ≥ p(fd |ud )φ(ud ) log dfd dud (12)
p(fd |ud )φ(ud )
 
Z Z
p(ud )
= φ(ud )  p(fd |ud ) log p(yd |fd ) + p(fd |ud ) log dfd  dud (13)
 
| {z } φ(ud )
integrates to 1
Z Z
p(ud )
= φ(ud ) Ep(fd |ud ) [log p(yd |fd )] dud + φ(ud ) log dud (14)
| {z } φ(ud )
≤log Ep(fd |ud ) [p(y|fd )]=log p(y|ud ) | {z }
KL(φ||p)

= L1 (15)

37
−1
• Note that p(fd |ud ) = N(fd |αd , KN N − KN M KM M KM N ) and p(yd |fd ) = N(0, σ ).
2

• Note that the brown expectation is a lower bound on the conditional p(y|ud ). With a
Gaussian likelihood, p(y|ud ) is tractable, but in O(N 3 ) compared with O(M 3 ) for the lower
bound.
• L1 is the lower bound from Titsias and Lawrence [2010], Hensman et al. [2013].
We take the trace of the log Gaussian since the trace of a scalar is the scalar.
 
N 1
Ep(fd |ud ) [log p(yd |fd )] = Ep(fd |ud ) − log(2πσ 2 ) − 2 (yd| yd − 2fd| y + fd| fd )
2 2σ
 
N 1
=− log(2πσ 2 ) − 2 tr yd| yd − 2Ep(fd |ud ) [fd| y] + Ep(fd |ud ) [fd| fd ]
 
2 2σ | {z }
Cov[f ]+αα|
N 1
= − log(2πσ 2 ) − 2 tr yd| yd − 2α|d y + αd α|d +KN N − KN M KM
−1

M KM N
| 2 2σ {z }
log N(yd |αd ,σ 2 )
 
1 −1
= log N(yd |αd , σ 2 ) − tr KN N − KN M KM KM N  (16)
 
2σ 2 | {zM }
Qnn

−1
where αd = KN M KM M ud . We now plug Eq. 16 into Eq. 14, giving us Eq. 13 from Titsias and
Lawrence [2010]:

N(yd |αd , σ 2 )p(ud )


Z
1
L1 = φ(ud ) log dud − 2 tr (KN N − Qnn ) (17)
φ(ud ) 2σ

How do we acquire the optimal variational distribution?


1. We could take the (functional) derivative of L1 w.r.t. φ(ud ), set it to zero, and plug that
back into L1 .

N(yd |αd , σ 2 )p(ud )


Z 
dL1 d 1
0= = φ(ud ) log dud − 2 tr (KN N − Qnn )
dφ dφ φ(ud ) 2σ
(18)
Z
d A
= φ log dud (19)
dφ φ
Z Z
d A d
= φ log dud = [φ log A − φ log φ] dud (20)
dφ φ dφ
Z   Z
1 A
= log A − log φ + φ dud = log − 1 dud (21)
φ φ
Z
A
= log − log e dud (22)
φ
Z Z
A
log dud = log e dud (23)
φ
A
= e =⇒ φ ∝ A = N(yd |αd , σ 2 )p(ud ) (24)
φ

38
2. Another way is to reverse the Jensen’s inequality, which we shall see now.
N(yd |αd , σ 2 )p(ud )
Z
1
L1 = φ(ud ) log dud − 2 tr (KN N − Qnn ) (25)
φ(ud ) 2σ
| {z }
N(yd |αd , σ 2 )p(ud )
Z
≤ log φ(ud ) dud by reverse Jensen’s inequality
φ(ud )
Z
1
=⇒ L1 ≤ log N(yd |αd , σ 2 )p(ud ) dud − 2 tr (KN N − Qnn ) (26)

Z
1
= log N(yd |αd , σ 2 )N(u|0, KM M ) dud − 2 tr (KN N − Qnn ) (27)

Z
−1 −1 1
= log N(yd |αd , σ 2 ) N(KN M KM M u|0, KN M KM M KM N ) dud − 2
tr (KN N − Qnn )
| {z } 2σ
scaled Gaussian: Cov(BX)=BCov(X)B >
(28)
Z
1
= log N(yd |0, QN N + σ 2 I) dud − 2 tr (KN N − QN N ) (29)

1
= log N(y|0, QN N + σ 2 I) − 2 tr (KN N − Qnn ) (30)

= L2 (31)
where Eq. 29 comes from: N(a|µ1 , Σ1 )N(µ1 |µ2 , Σ2 ) = N(a|µ2 , Σ1 + Σ2 ). This gives us L2
from Hensman et al. [2013].
Continuing with Hensman et al. [2013], we wish to derive a new lower bound which which we
can apply stochastic variational inference (SVI). SVI scales to large datasets. Marginalising ud re- hensman
introduces dependencies on the observations, so Hensman et al. [2013] writes an explicit variational 2012 nips
distribution q(ud ). This enables the derivation of natural gradients and therefore SVI. paper: col-
lapsed stuff
SVI requires a set of global variables. For our purposes, ud will be the global variables and a new
lower bound is defined: natural gra-
dient intro

log p(y|X) ≥ Eq(ud ) [L1 + log p(ud ) − log q(ud )] = L3 define global
variables
9.3.2 Deep Gaussian Processes
Imagine each layer is a multi-output GP, for example with Q inputs and D output units, f : RQ →
RD . Each layer, therefore, would add many model parameters. Moreover, how do we pick the size
of layers and number of layers? Deep GPs as defined in Damianou and Lawrence [2013], therefore,
marginalise out the entire latent space. We start with a two-layer model:

fX fY
Z X Y
N

First, the marginal likelihood:


ZZ
log p(Y) = log p(Y|Xp(X|Z)p(Z)dXdY

where Z ∼ N(0, I)

p(Y, FY , FX , X, Z)
Z
log p(Y) ≥ φ log dXdY (32)
φ

39
where we can factorise p(Y, FY , FX , XZ) = p(Y|FY )p(FY |X)p(X|FX )p(FX |Z)p(Z). This is in-
tractable as X and Z appear in the integral non-linearly. Again, we use inducing points, yielding:

p(Y, FY , FX , X, Z, UY , UX , X̃, Ỹ) = p(Y|FY )p(FY |UY , X)p(UY )p(X|FX )p(FX |UX , Z)p(UX )p(Z)

Note that, for brevity, p(UY ) = p(UY |X̃) and similarly for p(UX ). We define φ as:

φ(FY , UY , FX , UX ) = p(FY |UY , X)p(FX |UX , Z)q(UY )q(UX )q(X)q(Z)

We can now plug this into the unruly integral and separate terms (Eq. 13 in Damianou and
Lawrence [2013]):

p(Y|FY )( (Y( (Y( , X)p(UY )p(X|FX )( (X( (X(, Z)p(UX )p(Z)


(
( (
|U |U
Z
p(F p(F
log p(Y) ≥ φ log d{X, Z, FX , FY , UX , UY }
(Y( (Y( (X( (X , Z)q(UY )q(UX )q(X)q(Z)
(
( (
p(F
( |U , X)(p(F |U (
(33)
p(Y|FY )p(UY )p(X|FX )p(UX )p(Z)
Z
= φ log d··· (34)
q(UY )q(UX )q(X)q(Z)
= gY + rX + Hq(X) − KL(q(Z)||p(Z)) (35)

It’s quite a large integral, but it is trivial algebraic rearrangements to derive gY , rX , Hq(X) , which
I show below without derivation:
p(UY )
 
gY = Ep(FY |UY ,X)q(UY )q(X) log p(Y|FY ) + log (36)
q(UY )
p(UX )
 
rX = Ep(FX |UX ,Z)q(UX )q(X)q(Z) log p(X|FX ) + log (37)
q(UX )
Z
Hq(X) = q(X) log q(X) dX (38)

We can conclude:
• q(UY ), q(UX ) are free-form variational distributions
• q(X), q(Z) are Gaussians factorised along the dimensions
• Both gY , rX are known Gaussian densities, so are tractable.
• Notice that gY is equivalent to L1 from Eq. 11. continue

9.4 Implementation Details


Now that we know some lower bounds, it is crucial that we understand how we might implement
this. Suppose we have the lower bound:

Eq(X) Eq(F |m,S,X,Z) [log p(Y |F )] − KL [q(X)||p(X)] − KL [q(U )||p(U )]


 
| {z } | {z } | {z }
3 2 1

1: KL between the inducing points First, let’s calculate KL [q(U )||p(U )]. We need two
variational parameters here: q µ , the mean of the inducing points; and q σ , the square-root of the
inducing points covariance. The KL-divergence between two multivariate Gaussians is:
 
1 |Σ2 |
log − d + tr{Σ−1
2 Σ 1 } + (µ2 − µ1 )> −1
Σ 2 (µ2 − µ1 )
2 |Σ1 |

40
where d is the number of dimensions plus the number of inducing points, Σ1 = q 2σ , Σ2 = KM M ,
µ1 = q µ , and µ2 = 0

Theorem 1. The determinant of an upper triangular matrix is the product of the diagonal.

Theorem 2. The square-root of the determinant is equal to the determinant of the square-
root. More generally, |An | = |A|n

Using these results, we can find the first term.


Y X
− log |q 2σ | = − log |q σ |2 = − log diag(q σ )2 = − log diag(q σ )2
log |KM M | = log |LL> | where L = chol(KM M )
Y
= log 2|L| = log 2 diag(L)

Combining the two gives the first term. Now, for the trace term. To calculate tr(X) where X =
−1 1
2 −1
q σ . Therefore, since L is triangular, tr(X) = triangular solve(L, q σ )2
P
KM M q σ , notice that X = L
2

−1 −1
Finally, we need the q >
µ KM M q µ term. KM M q µ can be calculated using cholesky solve(L, q µ ),
and therefore the whole thing is q µ cholesky solve(L, q µ )
2: KL between the latents
Now, we wish to find KL [q(X)||p(X)].
3: Marginal likelihood
Finally, we can calculate the third term using Monte Carlo sampling of q(X). First, we set
N
Y
q(X) = N(µn , Sn )
n=1

so we have {(µn , Sn )}N Q


n variational parameters where µ ∈ R .
−1 −1
p(fd |ud ) = N(fd |αd , KN N − KN M KM M KM N ) where αd = KN M KM M ud .

Algorithm 1: Marginal likelihood.


Input : Current i
Output: Next i
# First, we generate our latent samples:
z ∼ N(0, I) ∈ RNS ,N,Q
1
z = µ + zS 2 (reparameterisation trick)
# Second, we sample from variational posterior:
temp = L−1 KM N = triangular solve(L, KM N )
α = L−1> L−1 = triangular solve(temp, )

41
10 Stochastic Calculus
When reading probabilistic machine learning literature, especially when applied to physical or
dynamical systems, some foundational understanding of stochastic calculus is extremely useful. I
cover only the main topics around stochastic processes and stochastic differential equations, so
note that this only scratches the surface.

10.1 Probability Generating Function


We shall see that the probability generating function (pgf) is a useful
tool when manipulating probabilities. It is defined as:
  X x
GX (s) = E sX = s pX (x)
x≥0

where s is a complex variable in the unit circle (see Fig. 12).


Figure 12: Unit complex
Note that the normalisation of the pmf px is GX (1) =
P
x≥0 pX (x). circle.
Therefore,, (
1 if pX (∞) = 0
GX (1) =
1 − pX (∞) otherwise
where pX (∞) is the probability that X takes a value outside of [0, ∞). For example, this could be
the probability that the temperature will go below zero.
We can differentiate this function n times as follows:
dn GX (s) X
n
= x · (x − 1) · ... · (x − n + 1) · sx−n pX (x)
ds
x≥n
| {z }
summand is zero for 0≤x≤n−1

X x!
= · sx−n pX (x)
(x − n)!
x≥n

Why is this useful? Notice that the first derivative at s → 1− is the expectation:
dGX (s) X X
lim = x · sx−1 pX (x) = xpX (x)
s→1− ds
x≥1 x≥n

Similarly, we can work out the variance:

2
Var[X] = E X 2 − E [X]
 

= G00X (1) − G0X (1) + (G0X (1))2

When s → 0, the derivative of the pgf gives the probabilities:


X x! (n + 1)! 2
lim · sx−1 pX (x) = n! · pX (x) + (n + 1)! · spX (x) + · s pX (x) + ...
s→0 (x − n)! 2!
x≥n

= n! · pX (x)

Let us derive the pgf for the Binomial distribution:


X n
G(s) = sx px q n−x = (ps + q)n by the Binomial Theorem
x
x≥0

42
10.2 Stochastic Process
A stochastic process is an indexed collection of random variables, where the index set may be
discrete or continuous. Some examples of stochastic processes include Brownian Motion (Section
10.3), Gaussian Processes (Section 4), and Markov chains (Section 6). Markov chains satisfy the
Markov property, which states that the current state is sufficient for determining the future
state. More formally:

p(Xt ≤ x|Xt−1 , Xt−2 , ..., X0 ) = p(Xt ≤ x|Xt−1 )

10.3 Brownian Motion


Brownian motion, Bt , is a stochastic process, in particular the continuous random walk that
particles exhibit, for example in a gas. It satisfies the following properties:
• independent increments
• ∀s < t : Bt − Bs ∼ N(0, t − s)
• paths are continuous
• B0 = 0
There are some interesting properties we can show. First, it is a discrete random walk taken to
an infinite limit. Let Xn be zero-mean i.i.d. random
Pn variables with variance 1. The random walk
starts at S0 = 0 and proceeds with Sn = S0 + i=1 Xi ∀n ≥ 1. Suppose a continuous-time
process:
S[nt]
Btn = √ where [nt] integer part of nt
n

The Central Limit Theorem states that Z = √n −µ



= N(0, 1) as n → ∞.
σ·n

S[nt] nt S[nt] √ √
=⇒ Btn = √ √ = √ t = t · N(0, 1) as n → ∞
nt n nt

Therefore, Bnt converges to a scaled normal r.v., the variance of which will be:
√
Var tN(0, 1) = t =⇒ Btn ∼ N(0, t)


Furthermore,
[nt]
S[nt] − S[ns] 1 d S[nt]−[ns] √
X
Btn − Bsn = √ =√ Xi = p t − s ∼ N(0, t − s)
n n n(t − s)
i=[ns]+1

10.3.1 Brownian Motion as a GP


Brownian motion can be interpreted as a zero-mean GP with covariance

Cov(Bs , Bt ) = Cov(Bs , Bs + Bt = Bs ) = Cov(Bs , Bs ) + Cov(Bs , Bt − Bs )


| {z } | {z }
Var(Bs ) 0 (indep. increments)

=s

if s < t. Thus, in general, the kernel is κ(t, s) = min(s, t).

43
10.4 Stochastic Differential Equations
Modelling diffusive processes can be done using stochastic differential equations (SDEs). The
typically look like this:
dXt = µ(Xt ) dt + σ(Xt ) dBt
| {z } | {z }
drift diffusion

corresponding to (for t > s):


Z t Z t
Xt = Xs + µ(Xu ) du + σ(Xu ) dBu
s s

Some things to note:


• If there is no diffusion term then this reduces to an ODE.
• σ(Xt ) dBt ∼ N(0, σ 2 (Xt ) dt)

10.4.1 Itô vs. Stratonovich


The Riemann-Stieltjes Integral is the following:

Z t n
X
f (u) dg(s) = lim f ( τi )(g(ti+1 ) − g(ti ))
s n→∞ |{z}
i=1 ∈[t ,t
i i+1 ]

We wish to use this to solve the diffusion term. There are two ways of picking τi :
• Itô: τi = ti
ti +ti+1
• Stratonovich: τi = 2

44
11 Advanced Generative Models
11.1 Variational Autoencoders
The autoencoder is a method of compressing data through a bottleneck such that it can be recon-
structed with minimal error. A simple example architecture is shown in Figure 13. The z vector in
the bottleneck layer is referred to as the latent state. The variational autoencoder (VAE) [Kingma
and Welling, 2013] is an extension which combines the autoencoder and latent variable approach,
resulting in a continuous latent state that has meaningful samples. This means that the latent
space can be explored by decoding these samples. You might have guessed from the “variational”
in the name that a lower bound is imminent. We will derive it now.

Input Bottleneck Ouput


layer layer layer
x1 y1
z1
x2 y2

x3 .. y3

.. .
zm ..
xn
. . yn

Figure 13: General autoencoder architecture.

The fundamental model assumption is that our dataset X = {x(i) }N i=1 are samples of some ran-
dom variable x generated by a continuous random variable z with prior p(z). In our case, the
decoder represents the conditional p(x|z). Moreover, the aforementioned pdfs are differentiable
w.r.t. parameters and latent variables. Our encoder is also our variational posterior: q(z|x).

When are VAEs used?


• Probabilistically, the VAE is used for cases where the true posterior distribution is
intractable.
• Practically, the VAE is used in an unsupervised learning setting where we wish to
generate new data that mimics the same generative process as the real data.

Our log marginal likelihood is:


Z Z (i)
(i) (i) p(x|z)p(z) q(z|x )
log p(x ) = q(z|x ) log (i)
dz + q(z|x) log (i)
dz
q(z|x ) p(z|x )
h i
(i) (i)
= L(φ; x(i) ) + KL q(z|x )||p(z|x )
h i
(i)
≥ L(φ; x(i) ) = Eq(z|x(i) ) log p(x|z)p(z) − log q(z|x )
h i h i
(i)
=⇒ L(φ; x(i) ) = Eq(z|x(i) ) log p(x(i) |z) − KL q(z|x )||p(z)

Note how we have removed dependency on the true posterior by using this lower bound. Note also
that our variational posterior is conditioned on the data. One way of optimising such a bound is

45
using Monte Carlo gradient sampling, where we seek the gradient w.r.t. the parameters:
T
1X (t)
∇φ Eqφ (z|x(i) ) [f (z)] ≈ ∇ (i,t) (i) f (z )
T t=1 qφ (z |x )

(i)
where z(t) ∼ q φ (z|x ) and f (z) = log p(x(i) |z). See Section 6 for intuition of the above.
However, this method “exhibits high variance” [Kingma and Welling, 2013]. In order to find
another method of inferring a posterior, we must introduce the reparameterisation trick.

11.1.1 Reparameterisation Trick


Let z ∼ q φ (z|x) (our posterior). We represent z as a non-random variable z = g(, x) where
 ∼ p() and g is a vector-valued function with parameters φ. An expectation w.r.t. q φ (z|x) can
be reparameterised as an expectation that is differentiable w.r.t. φ
As we saw in Section 1.2.3, under the conservation of probability, we have:

q φ (z|x) dz = p() d

Substituting both this equation and our function g into the Monte Carlo sampler yields:
Z
∇φ Eqφ (z|x(i) ) [f (z)] = ∇φ f (z)q φ (z|x) dz
Z
= ∇φ f (g(, z))p() d

= ∇φ Ep() [f (g(, z))]


h i
= ∇φ Ep() log p(x(i) |z)
h i
= Ep() ∇φ log p(x(i) |z)

where z = g(, z).

Key Concept

Note that in the last step the gradient can pass through the expectation since it is unrelated
to the differential, p(). This enables us to propagate a gradient through the sampling
procedure!

In order to define g for the VAE, we should specify the variational posterior, which we pick to be
a multivariate Gaussian:
(i)
q φ (z|x ) = N(z|µ(i) , σ 2(i) I)
where µ(i) and σ (i) are output from our encoder neural network. A valid g, therefore, could be:

z(i,t) ∼ q φ (z(i,t) |x(i) ) = g(z, ) = µ(i) + σ (i) 

Our final loss function, therefore, is:


h i 1X T
(i)
L(φ; x(i) ) ≈ −KL q(z|x )||p(z) + log p(x(i) |z(i,t) )
T t=1

where z(i,t) = µ(i) + σ (i)  and  ∼ N and  ∼ N(0, I).

11.2 Generative Adversarial Networks write

46
11.3 Normalising Flows
Recall that for two distributions X and Z, and function f : X → Z,

dx · pX (x) = dz · pZ (z)
 
df (x)
=⇒ pX (x) = det · pZ (f (x))
dx

where the determinant is to generalise the result to multidimensional distributions. Taking the
logarithm:
 
df
log pX (x) = log det + log pZ (f (x))
dx finish

47
References
Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and
Statistics, pages 207–215, 2013.
James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. arXiv preprint
arXiv:1309.6835, 2013.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
preprint arXiv:1505.05770, 2015.
Michalis Titsias and Neil D Lawrence. Bayesian gaussian process latent variable model. In Proceed-
ings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages
844–851, 2010.

48
A Derivations
A.1 Double Integral of a Mean Function
Z Z Z Z
(ax + b)p(a)p(b) = axp(a)p(b) + bp(a)p(b)
b a b a
Z Z Z Z
= axp(a)p(b) + bp(a)p(b)
Zb a Z Zb a Z Z
= axp(a) p(b) + p(a) bp(b) but p(a) = 1
Za b
Z a b a

= axp(a) + bp(b)
a b

A.2 Expansion of Strange Gaussian Identity

1
p(tg |wIg , wJg ) ∝ exp(− (wIg − wJg − tg )2 )
2   
1  1 −1 wIg − µ1
∝ N(− wIg − µ1 wJg − µ2 , 1)
2 −1 1 wJg − µ2
  
1  wIg − µ1
∝ N − (wIg − µ1 ) − (wJg − µ2 ) −(wIg − µ1 ) + wJg − µ2
2 wJg − µ2
  
1 wIg − µ1
∝ N − wIg − wJg − µ1 + µ2 −wIg + wJg + µ1 − µ2
 
2 wJg − µ2
  
1 wIg − µ1
∝ N − wIg − wJg − tg −wIg + wJg + tg
 
2 wJg − µ2
 
1
∝ N − (wIg − wJg − tg )(wIg − µ1 ) + (−wIg + wJg + tg )(wJg − µ2 )
2
 
1
∝ N − (wI2g − wJg wIg − tg wIg − µ1 (wIg − wJg − tg ) − wIg wJg + wJ2g + tg wJg + µ2 (wIg − wJg − tg ))
2
 
1
∝ N − (wI2g − 2wJg wIg − tg wIg + (µ2 − µ1 )(wIg − wJg − tg ) + wJ2g + tg wJg )
2
 
1
∝ N − (wI2g − 2wJg wIg − tg wIg + tg (wJg − wIg + tg ) + wJ2g + tg wJg )
2
 
1
∝ N − (wI2g + wJ2g + t2g − 2wJg wIg − 2tg wIg + 2tg wJg )
2

49
A.3 Derivation of Biased Variance Estimator
  2   2 
N N N N
 2 1 Xi − 1
X X 1 X 1 X
E s =E Xj   = E  Xi − Xj  

N i=1 N j=1 N i=1 N j=1
| {z }
2
Xi2 − N
PN Pn
j=1 Xi Xj + N 2 ( j=1 Xj )
2 1

    2 
N N n
1 X
 E Xi2 − 2 E 
  X 1 X
= Xi Xj  + 2 E  Xj  

N i=1 N j=1
N j=1
| {z }
PN
Xj2 +2
PN PN
j=1 j=1 k=j+1 Xj Xk
    
N N N N N
1 X   2 2 X 1 X X X
E [Xi Xj ] +E Xi2  + 2  E Xj2 + 2
   
= E Xi − E [Xj Xk ]

N i=1 N N

j6=i
| {z } j=1 j=1 k=j+1
E[Xi ]E[Xj ]
N   
1 X 2 1 2 2 2 2
E Xi2 − (N − 1)E [Xi ] + 2 N (N − 1)E [Xi ]
 
= 1− +
N i=1
N N N N
N  
1 X N − 1   2 2
= E Xi − E [Xi ]
N i=1
N
N  
1 X N −1 N −1
= Var[Xi ] = Var[Xi ]
N i=1
N N

A.4 Derivations of Means and Variances


 
We go through derivations for some common distributions. Recall Var[X] = E (X − µ)2 =
 2 2
E X − E [X]

50
Distribution Derivation P
X ∼ Bernoulli(p) E [X] = P xpX (x) = 1p + 0(1 − p) = p
Var[X] = x2 pX (x) − p2
= p −p2 = p(1 −  p)
X ∼ Binomial(N, p) n
Note k k−1 = n k−1 n−1
Pn
E [X] = k=0 k nk pk q n−k
(
Pn n−1
 k−1 ( ((((
(n−1)−(k−1)
= n k=0 ( k−1
( (pp( ( q
= np (((
P 2 n k n−k 2
Var[X] = k k p q − (np)
n−1 k−1 n−k

− (np)2
P
= np (1 + k − 1) k−1 p q
P n−1 k−1
+ np (k − 1) n−1
(  k−1 n−k
((p((q n−k − (np)2
( P
= np((k−1 k−1 p q
P n−2 k−2 (
n−k 2
= np + np(p(n − 1))((k−2 − (np)
(
((p ( ( q
2
= np(np
P∞ − p + 1) − (np)
P∞ d x = np(1 − p)
X ∼ Geometric(p) E [X] = x=1 xq x−1 p = p x=1 dq q
pX (x) = (1 − p)x−1 p d
= p dq (1 − q) −1
(power series)
p 1
= (1−q) 2 = p
1
P∞
Note 1−q = x=1 q x
 00 P

1
= x=1 q −1 x2 q x−1 − q −1 xq x−1
P P
1−q
 00  0
1 1
P 2 x−1
x q = q 1−q + 1−q
P∞
Var[X] = p x=1 x2 q x−1 − p12
= 2(1−p)+p−1
p2 = 1−p
p2

B Kullback Leibler divergence


R q(x)
To minimise the KL divergence q(x) log p(x) dx, we add a Lagrange multiplier to normalise the
q(z) to 1. We will work with KL(q(x)||p(x))
Z Z 
δ q(x) q(x)
q(x) log dx + λ(1 − q(x) dx) = log +1−λ
δq(x) p(x) p(x)

q(x)
Since q(x) log p(x) = q(x) log q(x) − q(x) log p(x)
δ q(x)
=⇒ δq(x) = q(x)/q(x) + log q(x) − log p(x) = 1 + log p(x)

q(x) = exp(λ − 1)p(x) and we set λ = 1 for normalisation, so q(x) = p(x) at the minimum.

C Backup
We can now work out the bound from Hensman et al. [2013]. Let L1 = Ep(fd |ud ) [log p(yd |fd )]:

N  
Y (i) (i) 1
exp(L1 ) = N(yd |αd , σ 2 ) exp k̃ii
i=1
2σ 2

where k̃ii is the ith element of tr (KN N − Qnn ). Note that this assumes that p(yd |fd ) factorises
over the data. Next, by plugging in the exponentiated L1 into the likelihood, we get:
Z
log p(yd |X) ≥ log exp(L1 )p(ud ) dud = L2

51
This is from Titsias, it maby not be necessary. Plugging into the lower bound from Eq. 11:

N(yd |αd , σ 2 )p(ud )


Z Z 
1
Fd (q) ≥ q(X) φ(ud )[log dud − 2 tr (KN N − Qnn ) dX
φ(ud ) 2σ
Z  
p(ud ) 1
= φ(ud ) hlog N(yd |αd , σ 2 )iq(X) + log dud − 2 htr (KN N − Qnn )iq(X)
φ(ud ) 2σ
| {z }
now reverse Jensen’s inequality: E(·)→log E exp(·)
Z
 p(ud ) 1
= log φ(ud ) exp hlog N(yd |αd , σ 2 )iq(X) dud − 2 htr (KN N − Qnn )iq(X)
φ(ud ) 2σ
Z
1
= log exp hlog N(yd |αd , σ 2 )iq(X) p(ud ) dud − 2 htr (KN N − Qnn )iq(X)


52

You might also like