0% found this document useful (0 votes)
42 views

Introduction To Probabilistic Learning

This document discusses different types of machine learning: 1. Unsupervised learning involves observing input data alone to discover underlying patterns and structure in the data. 2. Supervised learning involves observing input-output pairs to learn relationships and make predictions on new data. Main examples are classification and regression. 3. Reinforcement learning involves learning from rewards or payoffs to find policies that maximize long-term payoffs from actions. The document advocates for a probabilistic approach to modeling data generation processes and using probabilities to represent beliefs and make inferences. Basic probability rules like non-negativity and normalization are also covered.

Uploaded by

gagamamapapa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Introduction To Probabilistic Learning

This document discusses different types of machine learning: 1. Unsupervised learning involves observing input data alone to discover underlying patterns and structure in the data. 2. Supervised learning involves observing input-output pairs to learn relationships and make predictions on new data. Main examples are classification and regression. 3. Reinforcement learning involves learning from rewards or payoffs to find policies that maximize long-term payoffs from actions. The document advocates for a probabilistic approach to modeling data generation processes and using probabilities to represent beliefs and make inferences. Basic probability rules like non-negativity and normalization are also covered.

Uploaded by

gagamamapapa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

What do we mean by learning?

Probabilistic & Unsupervised Learning

Introduction and Foundations

Maneesh Sahani
[email protected]

Gatsby Computational Neuroscience Unit, and


MSc ML/CSML, Dept Computer Science
Jan Steen
University College London
Not just remembering:
Term 1, Autumn 2018 I Systematising (noisy) observations: discovering structure.
I Predicting new outcomes: generalising.
I Choosing actions wisely.

Three learning problems Unsupervised Learning

I Systematising (noisy) observations: discovering structure.


I Unsupervised learning. Observe (sensory) input alone:
Find underlying structure:
x1 , x2 , x3 , x4 , . . .
I separate generating processes (clusters)
Describe pattern of data [p(x )], identify and extract underlying structural variables [xi → yi ].
I reduced dimensionality representations
I Predicting new outcomes: generalising.
I good explanations (causes) of the data a
filters
I Supervised learning. Observe input/output pairs (“teaching”): I modelling the data density W image patch, I
image
ensemble
(x1 , y1 ), (x2 , y2 ), (x3 , y3 ), (x4 , y4 ), . . .
Predict the correct y ∗ for new test input x ∗ .
Φ
basis functions
I Choosing actions wisely. Uses of Unsupervised Learning: a causes

I Reinforcement learning. Rewards or payoffs (and possibly also inputs) depend on actions: I structure discovery, science
x1 : a1 → r1 , x2 : a2 → r2 , x3 : a3 → r3 . . . I data compression
Find a policy for action choice that maximises payoff. I outlier detection
I input to supervised/reinforcement algorithms (causes may be more simply related to
outputs or rewards)
I a theory of biological learning and perception
Supervised learning A probabilistic approach
Data are generated by random and/or unknown processes.
Two main examples:
Our approach to learning starts with a probabilistic model of data production:
Classification: Regression:
50

x 40
P (data|parameters) P (x |θ) or P (y |x , θ)
x x o 30

x x x o 20 This is the generative model or likelihood.


x

y
x o 10

o o
x oo
0

x x
o −10
I The probabilistic model can be used to
−20
−2 0 2 4
x
6 8 10 12 I make inferences about missing inputs
Discrete (class label) outputs. Continuous-values outputs.
I generate predictions/fantasies/imagery
I make predictions or decisions which minimise expected loss

I communicate the data in an efficient way


But also: ranks, relationships, trees etc.
I Probabilistic modelling is often equivalent to other views of learning:
I information theoretic: finding compact representations of the data
Variants may relate to unsupervised learning:
I physical analogies: minimising (free) energy of a corresponding statistical
I semi-supervised learning (most x unlabelled; assumes structure of {x } and relationship mechanical system
x → y are linked). I structural risk: compensate for overconfidence in powerful models
I multitask (transfer) learning (predict different y in different contexts; assumes links
between structure of relationships). The calculus of probabilities naturally handles randomness. It is also the right way to reason
about unknown values.

Representing beliefs Basic rules of probability

Let b(x ) represent our strength of belief in (plausibility of) proposition x:


I Probabilities are non-negative P (x ) ≥ 0 ∀x.
0 ≤ b(x ) ≤ 1
b (x ) = 0
P
x is definitely not true I Probabilities normalise: x ∈X P (x ) = 1 for distributions if x is a discrete variable and
b (x ) = 1 x is definitely true
R +∞
−∞
p(x )dx = 1 for probability densities over continuous variables
b(x |y ) strength of belief that x is true given that we know y is true
I The joint probability of x and y is: P (x , y ).

Cox Axioms (Desiderata): I The marginal probability of x is: P (x ) =


P
P (x , y ), assuming y is discrete.
y
I Let b(x ) be real. As b(x ) increases, b(¬x ) decreases, and so the function mapping
b(x ) ↔ b(¬x ) is monotonically decreasing and self-inverse. I The conditional probability of x given y is: P (x |y ) = P (x , y )/P (y )
I b(x ∧ y ) depends only on b(y ) and b(x |y ). I Bayes Rule:
I Consistency
I If a conclusion can be reasoned in more than one way, then every way should lead to the P (x |y )P (y )
same answer. P ( x , y ) = P ( x ) P ( y |x ) = P ( y ) P ( x |y ) ⇒ P ( y |x ) =
P (x )
I Beliefs always take into account all relevant evidence.
I Equivalent states of knowledge are represented by equivalent plausibility assignments.

Consequence: Belief functions (e.g. b(x ), b(x |y ), b(x , y )) must be isomorphic to


Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should
probabilities, satisfying all the usual laws, including Bayes rule. (See Jaynes, Probability be obvious from context.
Theory: The Logic of Science)
The Dutch book theorem Bayesian learning
Apply the basic rules of probability to learning from data.
Assume you are willing to accept bets with odds proportional to the strength of your beliefs.
That is, b(x ) = 0.9 implies that you will accept a bet: I Problem specification:
 Data: D = {x1 , . . . , xn } Models: M1 , M2 , etc. Parameters: θi (per model)
x is true win ≥ $1 Prior probability of models: P (Mi ).
x at 1 : 9 ⇒
x is false lose $9 Prior probabilities of model parameters: P (θi |Mi )
Model of data given parameters (likelihood model): P (x |θi , Mi )
Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there
exists a set of simultaneous bets (called a “Dutch Book”) which you are willing to accept, and I Data probability (likelihood)
for which you are guaranteed to lose money, no matter what the outcome. n
E.g. suppose A ∩ B = ∅, then
Y
P (D|θi , Mi ) = P (xj |θi , Mi ) ≡ L(θi )
    j =1
 b(A) = 0 .3   ¬A at 3 : 7 
b (B ) = 0 .2 ⇒ accept the bets ¬B at 2 : 8 (provided the data are independently and identically distributed (iid).
b (A ∪ B ) = 0 .6 A∪B at 4 : 6
   
I Parameter learning (posterior):

But then: P (D|θi , Mi )P (θi |Mi )


Z
P (θi |D, Mi ) = ; P (D|Mi ) = d θi P (D|θi , Mi )P (θi |Mi )
¬A ∩ B ⇒ win + 3 − 8 + 4 = −1 P (D|Mi )
A ∩ ¬B ⇒ win − 7 + 2 + 4 = −1 P (D|Mi ) is called the marginal likelihood or evidence for Mi . It is proportional to the
¬A ∩ ¬B ⇒ win + 3 + 2 − 6 = −1 posterior probability model Mi being the one that generated the data.
I Model selection:
The only way to guard against Dutch Books is to ensure that your beliefs are coherent: i.e. P (D|Mi )P (Mi )
satisfy the rules of probability. P (Mi |D) =
P (D)

Bayesian learning: A coin toss example Bayesian learning: The coin toss (cont)

Coin toss: One parameter q — the probability of obtaining heads


Now we observe a toss. Two possible outcomes:
So our space of models is the set of distributions over q ∈ [0, 1].
Learner A believes model MA : all values of q are equally plausible; p(H|q ) = q p(T|q ) = 1 − q
Learner B believes model MB : more plausible that the coin is “fair” (q ≈ 0.5) than “biased”.
2 2.5
Suppose our single coin toss comes out heads
1.8

1.6 2

1.4 The probability of the observed data (likelihood) is:


1.2 1.5

p(H|q ) = q
P(q)

P(q)

0.8 1

0.6

0.4 0.5

0.2

0 0
Using Bayes Rule, we multiply the prior, p(q ) by the likelihood and renormalise to get the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q
posterior probability:
A: α1 = α2 = 1.0 B: α1 = α2 = 4.0
p(q )p(H|q )
Both prior beliefs can be described by the Beta distribution: p(q |H) = ∝ q Beta(q |α1 , α2 )
p (H )
q (α1 −1) (1 − q )(α2 −1)
p(q |α1 , α2 ) = = Beta(q |α1 , α2 ) ∝ qq
(α1 −1)
(1 − q )(α2 −1) = Beta(q |α1 + 1, α2 )
B (α1 , α2 )
Bayesian learning: The coin toss (cont) Bayesian learning: The coin toss (cont)
What about multiple tosses? Suppose we observe D = { H H T H T T }:
3 3
A B p({ H H T H T T }|q ) = qq (1 − q )q (1 − q )(1 − q ) = q (1 − q )
2 2.5

1.8

1.6 2
This is still straightforward:
1.4

1.2 1.5
p(q )p(D|q )
Prior p(q |D) = ∝ q 3 (1 − q )3 Beta(q |α1 , α2 )
P(q)

P(q)
1

0.8 1 p(D)
0.6

0.4 0.5

0.2 ∝ Beta(q |α1 + 3, α2 + 3)


0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Beta(q |1, 1) Beta(q |4, 4)


2.5 3

2.5 2.5

2.5
2
2 2

2
1.5 1.5
1.5

Posterior
P(q)

P(q)

P(q)

P(q)
1.5
1 1

1
0.5 0.5

0.5
0 0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Beta(q |2, 1) Beta(q |5, 4) 0 0


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q

Conjugate priors Conjugate priors

Updating the prior to form the posterior was particularly easy in these examples. This is The posterior given an exponential family likelihood and conjugate prior is:
because we used a conjugate prior for an exponential family likelihood. P  ν+n
h
T
 P i
P (θ|{xi }) = F τ + T(xi ), ν + n g (θ) exp φ(θ) τ+ T(xi )
Exponential family distributions take the form: i i

φ(θ)T T(x ) Here,


P (x |θ) = g (θ)f (x )e
φ(θ) is the vector of natural parameters
with g (θ) the normalising constant. Given n iid observations, P
  i T(xi ) is the vector of sufficient statistics
T
P
Y n φ(θ) i
T(xi ) Y τ are pseudo-observations which define the prior
P ({xi }|θ) = P (xi |θ) = g (θ) e f (xi )
i i ν is the scale of the prior (need not be an integer)
Thus, if the prior takes the conjugate form As new data come in, each one increments the sufficient statistics vector and the scale to
define the posterior.
ν φ(θ)T τ
P (θ) = F (τ , ν)g (θ) e
The prior appears to be based on “pseudo-observations”, but:
with F (τ , ν) the normaliser, then the posterior is
1. This is different to applying Bayes’ rule. No prior! Sometimes we can take a uniform
 
T
P
ν+n φ(θ) τ+ i
T(xi )
prior (say on [0, 1] for q), but for unbounded θ , there may be no equivalent.
P (θ|{xi }) ∝ P ({xi }|θ)P (θ) ∝ g (θ) e
P  2. A valid conjugate prior might have non-integral ν or impossible τ , with no likelihood
with the normaliser given by F τ + T(xi ), ν + n .
i equivalent.
Conjugacy in the coin flip Bayesian coins – comparing models
Distributions are not always written in their natural exponential form.
The Bernoulli distribution (a single coin flip) with parameter q and observation x ∈ {0, 1}, can We have seen how to update posteriors within each model. To study the choice of model,
be written: consider two more extreme models: “fair” and “bent”. A priori, we may think that “fair” is more
x (1−x )
P ( x |q ) = q ( 1 − q ) probable, eg:
= ex log q+(1−x ) log(1−q) p(fair) = 0.8, p(bent) = 0.2
= elog(1−q)+x log(q/(1−q))
For the bent coin, we assume all parameter values are equally likely, whilst the fair coin has a
= (1 − q )elog(q/(1−q))x fixed probability:
So the natural parameter is the log odds log(q /(1 − q )), and the sufficient stats (for multiple
tosses) is the number of heads.
1 1
The conjugate prior is

p(q|bent)

p(q|fair)
ν log(q /(1−q ))τ
P (q ) = F (τ, ν) (1 − q ) e
= F (τ, ν) (1 − q )ν eτ log q−τ log(1−q) 0.5 0.5
ν−τ τ
= F (τ, ν) (1 − q ) q
which has the form of the Beta distribution ⇒ F (τ, ν) = 1/B (τ + 1, ν − τ + 1). 0 0
In general, then, the posterior will be P (q |{xi }) = Beta(α1 , α2 ), with 0 0.5 1 0 0.5 1
P  P  parameter, q parameter, q
α1 = 1 + τ + i xi α2 = 1 + (ν + n) − τ + i xi
We make 10 tosses, and get: D = (T H T H T T T T T T).
P
If we observe a head, we add 1 to the sufficient statistic xi , and also 1 to the count n. This
P
increments α1 . If we observe a tail we add 1 to n, but not to xi , incrementing α2 .

Bayesian coins – comparing models Learning parameters


The Bayesian probabilistic prescription tells us how to reason about models and their
Which model should we prefer a posteriori (i.e. after seeing the data)? parameters. But it is often impractical for realistic models (outside the exponential family).
I Point estimates of parameters or other predictions
The evidence for the fair model is:
I Compute posterior and find single parameter that minimises expected loss.
10
≈ 0.001
D E
P (D|fair) = (1/2) θBP = argmin L(θ̂, θ)
P (θ|D)
θ̂
and for the bent model is:
Z Z I hθiP (θ|D) minimises squared loss.

P (D|bent) = dq P (D|q , bent)p(q |bent) =


2 8
dq q (1 − q ) = B(3, 9) ≈ 0.002 I Maximum a Posteriori (MAP) estimate: Assume a prior over the model parameters P (θ),
and compute parameters that are most probable under the posterior:

Thus, the posterior for the models, by Bayes rule: θMAP = argmax P (θ|D) = argmax P (θ)P (D|θ) .
I Equivalent to minimising the 0/1 loss.

P (fair|D) ∝ 0.0008, P (bent|D) ∝ 0.0004,


I Maximum Likelihood (ML) Learning: No prior over the parameters. Compute parameter
value that maximises the likelihood function alone:
ie, a two-thirds probability that the coin is fair. θML = argmax P (D|θ) .
I Parameterisation-independent.
How do we make predictions? Could choose the fair model (model selection). I Approximations may allow us to recover samples from posterior, or to find a distribution
Or could weight the predictions from each model by their probability (model averaging). which is close in some sense.
Probability of H at next toss is: I Choosing between these and other alternatives may be a matter of definition, of goals
2 1 1 3 5 (loss function), or of practicality.
P (H|D) = P (H|D, fair)P (fair|D) + P (H|D, bent)P (bent|D) = × + × = . I For the next few weeks we will look at ML and MAP learning in more complex models.
3 2 3 12 12
We will then return to the fully Bayesian formulation for the few intersting cases where it
is tractable. Approximations will be addressed in the second half of the course.
Modelling associations between variables ML Learning for a Gaussian
N
Y
1
Data set D = {x1 , . . . , xN }, likelihood: p(D|µ, Σ) = p(xn |µ, Σ)
n =1

Goal: find µ and Σ that maximise likelihood ⇔ maximise log likelihood:


I Data set D = {x1 , . . . , xN }
I with each data point a vector of D features: N N
i2

0
x

Y Y X
xi = [xi1 . . . xiD ] L` = p(xn |µ, Σ) = log p(xn |µ, Σ) = log p(xn |µ, Σ)
n =1 n =1 n
I Assume data are i.i.d. (independent and
N 1X
identically distributed). =− log |2πΣ| − (xn − µ)T Σ−1 (xn − µ)
2 2
n

−1
−1 0 1
xi1
Note: equivalently, minimise −`, which is quadratic in µ
A simple forms of unsupervised (structure) learning: model the mean of the data and the
Procedure: take derivatives and set to zero:
correlations between the D features in the data.
∂` 1 X
=0 ⇒ µ̂ = xn (sample mean)
We can use a multivariate Gaussian model: ∂µ N
n
 
− 12 1 T −1 ∂` 1 X
p(x|µ, Σ) = N (µ, Σ) = |2πΣ| exp − (x − µ) Σ (x − µ) =0 ⇒ Σ̂ = (xn − µ̂)(xn − µ̂)T (sample covariance)
2 ∂Σ N
n

Refresher – matrix derivatives of scalar forms Gaussian Derivatives


We will use the following facts:
h i
T T T T
x Ay = y A x = Tr x Ay (scalars equal their own transpose and trace) " #
∂(−`) ∂ N 1X T −1
h i = log |2πΣ| + (xn − µ) Σ (xn − µ)
Tr [A] = Tr AT Tr [ABC ] = Tr [CAB ] = Tr [BCA] ∂µ ∂µ 2 2
n

1X ∂
h i
∂ h
T
i ∂ X T ∂ XX T ∂ X = (xn − µ)T Σ−1 (xn − µ)
Tr A B = [A B ]nn = Anm Bmn = Amn Bmn = Bij 2
n
∂µ
∂ Aij ∂ Aij n ∂ Aij n m ∂ Aij mn
1X ∂
h i
T −1 T −1 T −1
∂ h T
i = xn Σ xn + µ Σ µ − 2µ Σ xn
⇒ Tr A B =B 2
n
∂µ
∂A
1 X ∂ h T −1 i ∂ h T −1 i
∂ h T i ∂ h T
i = µ Σ µ −2 µ Σ xn
Tr A BAC = Tr F1 (A) BF2 (A)C with F1 and F2 both identity maps 2
n
∂µ ∂µ
∂A ∂A
∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ =h1 T 2Σ ∂ Fµ ∂ −1 xh T T T ∂ F2
h h h h X i −1 i
T 1 T 2 T 1 T 2 1 − 2Σ
= Tr F1 BF2 C + Tr F1 BF2 C = Tr F1 BF2 C + Tr CF1 BF2 = Tr 2
F1 BF2 C + Tr n F2 B F1 C
∂ F1 ∂A ∂ F2 ∂A ∂ F1 ∂A ∂ F2 ∂A ∂ F1 n ∂A ∂ F2 ∂A
= BF2 C + B T F1 C T = BAC + B T AC T −1 −1
X
= NΣ µ − Σ xn
n
∂ 1 ∂ 1 ∂ X 1
log |A| = |A| = (−1)i +k Aik |[A]ik | = (−1)i +j |[A]ij | 1 X
∂ Aij |A| ∂ Aij |A| ∂ Aij k |A| = 0 ⇒ µ̂ = xn
N
n
∂ −1 T
⇒ log |A| = (A )
∂A
Gaussian Derivatives Equivalences

" #
∂(−`) ∂ N 1X
= log |2πΣ| + (xn − µ)T Σ−1 (xn − µ)
∂Σ−1 ∂Σ−1 2 2

xi2
n 0
   
∂ N ∂ N −1
= log |2π I | − log |Σ |
∂Σ−1 2 ∂Σ−1 2
1X ∂
h i
+ −1
(xn − µ)T Σ−1 (xn − µ)
2 ∂Σ n
−1
N 1X −1 0 1

= − ΣT + (xn − µ)(xn − µ)T xi1


2 2
n
1 X modelling correlations
= 0 ⇒ Σ̂ = (xn − µ)(xn − µ)T m
N
n
maximising likelihood of a Gaussian model
m
minimising a squared error cost function
m
minimizing data coding cost in bits (assuming Gaussian distributed)

Multivariate Linear Regression Multivariate Linear Regression – ML estimate


The relationship between variables can also be modelled as a conditional distribution. ML estimates are obtained by maximising the (conditional) likelihood, as before:
X
1 `= log p(yi |xi , W, Σy )
i
N 1X
=− log |2πΣy | − (yi − Wxi )T Σ−1
y (yi − Wxi )
2 2
i
" #
I data D = {(x1 , y1 ) . . . , (xN , yN )} ∂(−`) ∂ N 1X
= log |2πΣy | + (yi − Wxi )T Σ− 1
( y − Wx )
i2

0 y i i
x

I each xi (yi ) is a vector of Dx (Dy ) features, ∂W ∂W 2 2


i

1X ∂
h i
I yi is conditionally independent of all else, given xi . (yi − Wxi )T Σ− 1
= y (yi − Wxi )
2
i
∂W
1 X ∂ h T −1 T T −1 T T −1
i
= yi Σy yi + xi W Σy Wxi − 2xi W Σy yi
−1
−1 0 1
2
i
∂W
x
i1 
1X ∂
h i ∂ h i
T −1 T T −1 T
A simple form of supervised (predictive) learning: model y as a linear function of x, with = Tr W Σy Wxi xi − 2 Tr W Σy yi xi
2
i
∂W ∂W
Gaussian noise.
1X
h i
−1 T −1 T
= 2Σy Wxi xi − 2Σy yi xi
2
  i
− 12 1 −1
− (y − Wx)T Σ−1 X
p(y|x, W, Σy ) = |2πΣy | exp y (y − Wx)
X T T
2 =0⇒W
b = yi xi xi xi
i i
Multivariate Linear Regression – Posterior MAP and ML for linear regression
Let yi be scalar (so that W is a row vector) and write w for the column vector of weights.
A conjugate prior for w is As the posterior is Gaussian, the MAP and posterior mean weights are the same:
−1  −1 P
xi xTi
P
P (w|A) = N 0, A
 X T −1 X
MAP yi xi 
w = A+ i
2
i
2
= Aσy2 + xi xi yi xi
σ y σ y
i i
| {z }
Σw
Then the log posterior on w is
Compare this to the (transposed) ML weight vector for scalar outputs:
log P (w|D, A, σy ) = log P (D|w, A, σy ) + log P (w|A, σy ) − log P (D|A, σy ) X −1 X
ML bT = T
1 1X w =W xi xi yi xi
= − wT Aw − (yi − wT xi )2 σy−2 + const i i
2 2
i
1 X T X
= − wT (A + σy−2 xi xi )w + w
T
(yi xi )σy−2 + const I The prior acts to “inflate” the apparent covariance of inputs.
2
| {z
i
}
i I As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0).
Σ−w
1 I If A = αI this is known as the ridge regression estimator.
1 X I The MAP/shrinkage/ridge weight estimate often has lower squared error (despite bias)
= − wT Σ−1 T −1
w w + w Σw Σw (yi xi )σy−2 + const
2 and makes more accurate predictions on test inputs than the ML estimate.
i
I An example of prior-based regularisation of estimates.
| {z }
µw
(yi xi )σy−2 , Σw
P 
= log N Σw i

Gaussians for Regression Three limitations of the multivariate Gaussian model

I Models the conditional P (y|x).


I If we also model P (x), then learning is indistinguishable from unsupervised. In particular I What about higher order statistical structure in the data?
if P (x) is Gaussian, and P (y|x) is linear-Gaussian, then x, y are jointly Gaussian.
⇒ nonlinear and hierarchical models
I Generalised Linear Models (GLMs) generalise to non-Gaussian, exponential-family
distributions and to non-linear link functions.

yi ∼ ExpFam(µi , φ) I What happens if there are outliers?


T
g (µi ) = w xi
⇒ other noise models
Posterior, or even ML, estimation is not possible in closed form ⇒ iterative methods
such as gradient ascent or iteratively re-weighted least squares (IRLS).
A warning to fMRIers: SPM uses GLM for “general” (not -ised) linear model; which is just
linear. I There are D (D + 1)/2 parameters in the multivariate Gaussian model.
I These models: Gaussians, Linear-Gaussian Regression and GLMs are important What if D is very large?
building blocks for the more sophisticated models we will develop later.
⇒ dimensionality reduction
I Gaussian models are also used for regression in Gaussian Process Models. We’ll see
these later too.
End Notes

I It is very important that you understand all the material in the following cribsheet:
http://www.gatsby.ucl.ac.uk/teaching/courses/ml1/cribsheet.pdf
I The following notes by (the late) Sam Roweis are quite useful:
I Matrix identities and matrix derivatives:
http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf
I Gaussian identities:
http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf

I Here is a useful statistics / pattern recognition glossary:


http://alumni.media.mit.edu/~tpminka/statlearn/glossary/

I Tom Minka’s in-depth notes on matrix algebra:


http://research.microsoft.com/en-us/um/people/minka/papers/matrix/

You might also like