Introduction To Probabilistic Learning
Introduction To Probabilistic Learning
Maneesh Sahani
[email protected]
I Reinforcement learning. Rewards or payoffs (and possibly also inputs) depend on actions: I structure discovery, science
x1 : a1 → r1 , x2 : a2 → r2 , x3 : a3 → r3 . . . I data compression
Find a policy for action choice that maximises payoff. I outlier detection
I input to supervised/reinforcement algorithms (causes may be more simply related to
outputs or rewards)
I a theory of biological learning and perception
Supervised learning A probabilistic approach
Data are generated by random and/or unknown processes.
Two main examples:
Our approach to learning starts with a probabilistic model of data production:
Classification: Regression:
50
x 40
P (data|parameters) P (x |θ) or P (y |x , θ)
x x o 30
y
x o 10
o o
x oo
0
x x
o −10
I The probabilistic model can be used to
−20
−2 0 2 4
x
6 8 10 12 I make inferences about missing inputs
Discrete (class label) outputs. Continuous-values outputs.
I generate predictions/fantasies/imagery
I make predictions or decisions which minimise expected loss
Bayesian learning: A coin toss example Bayesian learning: The coin toss (cont)
1.6 2
p(H|q ) = q
P(q)
P(q)
0.8 1
0.6
0.4 0.5
0.2
0 0
Using Bayes Rule, we multiply the prior, p(q ) by the likelihood and renormalise to get the
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q
posterior probability:
A: α1 = α2 = 1.0 B: α1 = α2 = 4.0
p(q )p(H|q )
Both prior beliefs can be described by the Beta distribution: p(q |H) = ∝ q Beta(q |α1 , α2 )
p (H )
q (α1 −1) (1 − q )(α2 −1)
p(q |α1 , α2 ) = = Beta(q |α1 , α2 ) ∝ qq
(α1 −1)
(1 − q )(α2 −1) = Beta(q |α1 + 1, α2 )
B (α1 , α2 )
Bayesian learning: The coin toss (cont) Bayesian learning: The coin toss (cont)
What about multiple tosses? Suppose we observe D = { H H T H T T }:
3 3
A B p({ H H T H T T }|q ) = qq (1 − q )q (1 − q )(1 − q ) = q (1 − q )
2 2.5
1.8
1.6 2
This is still straightforward:
1.4
1.2 1.5
p(q )p(D|q )
Prior p(q |D) = ∝ q 3 (1 − q )3 Beta(q |α1 , α2 )
P(q)
P(q)
1
0.8 1 p(D)
0.6
0.4 0.5
2.5 2.5
2.5
2
2 2
2
1.5 1.5
1.5
Posterior
P(q)
P(q)
P(q)
P(q)
1.5
1 1
1
0.5 0.5
0.5
0 0 0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
q q
Updating the prior to form the posterior was particularly easy in these examples. This is The posterior given an exponential family likelihood and conjugate prior is:
because we used a conjugate prior for an exponential family likelihood. P ν+n
h
T
P i
P (θ|{xi }) = F τ + T(xi ), ν + n g (θ) exp φ(θ) τ+ T(xi )
Exponential family distributions take the form: i i
p(q|bent)
p(q|fair)
ν log(q /(1−q ))τ
P (q ) = F (τ, ν) (1 − q ) e
= F (τ, ν) (1 − q )ν eτ log q−τ log(1−q) 0.5 0.5
ν−τ τ
= F (τ, ν) (1 − q ) q
which has the form of the Beta distribution ⇒ F (τ, ν) = 1/B (τ + 1, ν − τ + 1). 0 0
In general, then, the posterior will be P (q |{xi }) = Beta(α1 , α2 ), with 0 0.5 1 0 0.5 1
P P parameter, q parameter, q
α1 = 1 + τ + i xi α2 = 1 + (ν + n) − τ + i xi
We make 10 tosses, and get: D = (T H T H T T T T T T).
P
If we observe a head, we add 1 to the sufficient statistic xi , and also 1 to the count n. This
P
increments α1 . If we observe a tail we add 1 to n, but not to xi , incrementing α2 .
Thus, the posterior for the models, by Bayes rule: θMAP = argmax P (θ|D) = argmax P (θ)P (D|θ) .
I Equivalent to minimising the 0/1 loss.
0
x
Y Y X
xi = [xi1 . . . xiD ] L` = p(xn |µ, Σ) = log p(xn |µ, Σ) = log p(xn |µ, Σ)
n =1 n =1 n
I Assume data are i.i.d. (independent and
N 1X
identically distributed). =− log |2πΣ| − (xn − µ)T Σ−1 (xn − µ)
2 2
n
−1
−1 0 1
xi1
Note: equivalently, minimise −`, which is quadratic in µ
A simple forms of unsupervised (structure) learning: model the mean of the data and the
Procedure: take derivatives and set to zero:
correlations between the D features in the data.
∂` 1 X
=0 ⇒ µ̂ = xn (sample mean)
We can use a multivariate Gaussian model: ∂µ N
n
− 12 1 T −1 ∂` 1 X
p(x|µ, Σ) = N (µ, Σ) = |2πΣ| exp − (x − µ) Σ (x − µ) =0 ⇒ Σ̂ = (xn − µ̂)(xn − µ̂)T (sample covariance)
2 ∂Σ N
n
1X ∂
h i
∂ h
T
i ∂ X T ∂ XX T ∂ X = (xn − µ)T Σ−1 (xn − µ)
Tr A B = [A B ]nn = Anm Bmn = Amn Bmn = Bij 2
n
∂µ
∂ Aij ∂ Aij n ∂ Aij n m ∂ Aij mn
1X ∂
h i
T −1 T −1 T −1
∂ h T
i = xn Σ xn + µ Σ µ − 2µ Σ xn
⇒ Tr A B =B 2
n
∂µ
∂A
1 X ∂ h T −1 i ∂ h T −1 i
∂ h T i ∂ h T
i = µ Σ µ −2 µ Σ xn
Tr A BAC = Tr F1 (A) BF2 (A)C with F1 and F2 both identity maps 2
n
∂µ ∂µ
∂A ∂A
∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ i ∂F ∂ =h1 T 2Σ ∂ Fµ ∂ −1 xh T T T ∂ F2
h h h h X i −1 i
T 1 T 2 T 1 T 2 1 − 2Σ
= Tr F1 BF2 C + Tr F1 BF2 C = Tr F1 BF2 C + Tr CF1 BF2 = Tr 2
F1 BF2 C + Tr n F2 B F1 C
∂ F1 ∂A ∂ F2 ∂A ∂ F1 ∂A ∂ F2 ∂A ∂ F1 n ∂A ∂ F2 ∂A
= BF2 C + B T F1 C T = BAC + B T AC T −1 −1
X
= NΣ µ − Σ xn
n
∂ 1 ∂ 1 ∂ X 1
log |A| = |A| = (−1)i +k Aik |[A]ik | = (−1)i +j |[A]ij | 1 X
∂ Aij |A| ∂ Aij |A| ∂ Aij k |A| = 0 ⇒ µ̂ = xn
N
n
∂ −1 T
⇒ log |A| = (A )
∂A
Gaussian Derivatives Equivalences
" #
∂(−`) ∂ N 1X
= log |2πΣ| + (xn − µ)T Σ−1 (xn − µ)
∂Σ−1 ∂Σ−1 2 2
xi2
n 0
∂ N ∂ N −1
= log |2π I | − log |Σ |
∂Σ−1 2 ∂Σ−1 2
1X ∂
h i
+ −1
(xn − µ)T Σ−1 (xn − µ)
2 ∂Σ n
−1
N 1X −1 0 1
0 y i i
x
1X ∂
h i
I yi is conditionally independent of all else, given xi . (yi − Wxi )T Σ− 1
= y (yi − Wxi )
2
i
∂W
1 X ∂ h T −1 T T −1 T T −1
i
= yi Σy yi + xi W Σy Wxi − 2xi W Σy yi
−1
−1 0 1
2
i
∂W
x
i1
1X ∂
h i ∂ h i
T −1 T T −1 T
A simple form of supervised (predictive) learning: model y as a linear function of x, with = Tr W Σy Wxi xi − 2 Tr W Σy yi xi
2
i
∂W ∂W
Gaussian noise.
1X
h i
−1 T −1 T
= 2Σy Wxi xi − 2Σy yi xi
2
i
− 12 1 −1
− (y − Wx)T Σ−1 X
p(y|x, W, Σy ) = |2πΣy | exp y (y − Wx)
X T T
2 =0⇒W
b = yi xi xi xi
i i
Multivariate Linear Regression – Posterior MAP and ML for linear regression
Let yi be scalar (so that W is a row vector) and write w for the column vector of weights.
A conjugate prior for w is As the posterior is Gaussian, the MAP and posterior mean weights are the same:
−1 −1 P
xi xTi
P
P (w|A) = N 0, A
X T −1 X
MAP yi xi
w = A+ i
2
i
2
= Aσy2 + xi xi yi xi
σ y σ y
i i
| {z }
Σw
Then the log posterior on w is
Compare this to the (transposed) ML weight vector for scalar outputs:
log P (w|D, A, σy ) = log P (D|w, A, σy ) + log P (w|A, σy ) − log P (D|A, σy ) X −1 X
ML bT = T
1 1X w =W xi xi yi xi
= − wT Aw − (yi − wT xi )2 σy−2 + const i i
2 2
i
1 X T X
= − wT (A + σy−2 xi xi )w + w
T
(yi xi )σy−2 + const I The prior acts to “inflate” the apparent covariance of inputs.
2
| {z
i
}
i I As A is positive (semi)definite, shrinks the weights towards the prior mean (here 0).
Σ−w
1 I If A = αI this is known as the ridge regression estimator.
1 X I The MAP/shrinkage/ridge weight estimate often has lower squared error (despite bias)
= − wT Σ−1 T −1
w w + w Σw Σw (yi xi )σy−2 + const
2 and makes more accurate predictions on test inputs than the ML estimate.
i
I An example of prior-based regularisation of estimates.
| {z }
µw
(yi xi )σy−2 , Σw
P
= log N Σw i
I It is very important that you understand all the material in the following cribsheet:
http://www.gatsby.ucl.ac.uk/teaching/courses/ml1/cribsheet.pdf
I The following notes by (the late) Sam Roweis are quite useful:
I Matrix identities and matrix derivatives:
http://www.cs.nyu.edu/~roweis/notes/matrixid.pdf
I Gaussian identities:
http://www.cs.nyu.edu/~roweis/notes/gaussid.pdf