ML Map and Bayseian
ML Map and Bayseian
and Bayesian
CE-717: Machine Learning
Sharif University of Technology
Fall 2016
Soleymani
Outline
Introduction
Maximum-Likelihood (ML) estimation
Maximum A Posteriori (MAP) estimation
Bayesian inference
2
Relation of learning & statistics
Target model in the learning problems can be considered
as a statistical model
3
Density estimation
Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.
4
Parametric density estimation
Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.
5
Parametric density estimation
Goal: estimate parameters of a distribution from a dataset
𝒟 = {𝒙 1 , . . . , 𝒙(𝑁) }
𝒟 contains 𝑁 independent, identically distributed (i.i.d.) training
samples.
6
Example
𝑃 𝑥 𝜇 = 𝑁(𝑥|𝜇, 1)
7
Example
8
Maximum Likelihood Estimation (MLE)
Maximum-likelihood estimation (MLE) is a method of estimating the
parameters of a statistical model given data.
𝑝 𝒟𝜽 = 𝑝(𝒙(𝑖) |𝜽)
𝑖=1
10
Maximum Likelihood Estimation (MLE)
11
Maximum Likelihood Estimation (MLE)
12
Maximum Likelihood Estimation (MLE)
𝑁 𝑁
ℒ 𝜽 = ln 𝑝 𝒟 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽
𝑖=1 𝑖=1
𝑁
Thus, we solve 𝛻𝜽 ℒ 𝜽 = 𝟎
to find global optimum
13
MLE
Bernoulli
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)
𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥
𝑁 𝑁
𝑖 1−𝑥 𝑖
𝑝 𝒟𝜃 = 𝑝(𝑥 𝑖 |𝜃) = 𝜃𝑥 1−𝜃
𝑖=1 𝑖=1
𝑁 𝑁
ln 𝑝 𝒟 𝜃 = ln 𝑝(𝑥 𝑖 |𝜃) = {𝑥 𝑖 ln 𝜃 + (1 − 𝑥 𝑖 ) ln 1 − 𝜃 }
𝑖=1 𝑖=1
𝑁 (𝑖)
𝜕 ln 𝑝 𝒟 𝜃 𝑖=1 𝑥 𝑚
= 0 ⇒ 𝜃𝑀𝐿 = =
𝜕𝜃 𝑁 𝑁
MLE
Bernoulli: example
3
Example: 𝒟 = {1,1,1}, 𝜃𝑀𝐿 = = 1
3
Prediction: all future tosses will land heads up
Overfitting to 𝒟
15
MLE: Multinomial distribution
Multinomial distribution (on variable with 𝐾 state):
Parameter space: 𝐾
𝜽 = 𝜃1 , … , 𝜃𝐾 𝑥
𝑃 𝒙𝜽 = 𝜃𝑘 𝑘
𝜃𝑖 ∈ 0,1 𝑘=1
𝐾
𝜃𝑘 = 1 𝑃 𝑥𝑘 = 1 = 𝜃𝑘
𝑘=1
𝜃2
𝒙 = 𝑥1 , … , 𝑥𝐾
𝑥𝑘 ∈ {0,1}
𝐾
𝑥𝑘 = 1
𝑘=1 𝜃1
𝜃3
16
MLE: Multinomial distribution
𝒟 = 𝒙(1) , 𝒙(2) , … , 𝒙(𝑁)
𝑁 𝑁
𝐾 𝑥𝑘
(𝑖) 𝐾 𝑁
𝑥
(𝑖)
𝑃 𝒟𝜽 = 𝑃(𝒙 𝑖 |𝜽) = 𝜃𝑘 = 𝜃𝑘 𝑖=1 𝑘
𝑘=1 𝑘=1 𝑁
𝑖=1 𝑖=1 (𝑖)
𝑁𝑘 = 𝑥𝑘
𝑖=1
𝐾
𝐾
𝑁𝑘 = 𝑁
ℒ 𝜽, 𝜆 = ln 𝑝 𝒟 𝜽 + 𝜆(1 − 𝜃𝑘 ) 𝑘=1
𝑘=1
𝑁 (𝑖)
𝑖=1 𝑥𝑘 𝑁𝑘
𝜃𝑘 = =
𝑁 𝑁
17
MLE
Gaussian: unknown 𝜇
𝑖
1 𝑖 2
ln 𝑝(𝑥 |𝜇) = − ln 2𝜋𝜎 − 2 𝑥 −𝜇
2𝜎
𝑁
𝜕ℒ 𝜇 𝜕
=0⇒ ln 𝑝 𝑥 (𝑖) 𝜇 =0
𝜕𝜇 𝜕𝜇
𝑖=1
𝑁 𝑁
1 𝑖
1 𝑖
⇒ 2
𝑥 − 𝜇 = 0 ⇒ 𝜇𝑀𝐿 = 𝑥
𝜎 𝑁
𝑖=1 𝑖=1
18
MLE
Gaussian: unknown 𝜇 and 𝜎
𝜽 = 𝜇, 𝜎
𝛻𝜽 ℒ 𝜽 = 𝟎
𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖
⇒ 𝜇𝑀𝐿 = 𝑥
𝜕𝜇 𝑁
𝑖=1
𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖 2
⇒ 𝜎𝑀𝐿 = 𝑥 − 𝜇𝑀𝐿
𝜕𝜎 𝑁
𝑖=1
19
Maximum A Posteriori (MAP) estimation
MAP estimation
𝜽𝑀𝐴𝑃 = argmax 𝑝 𝜽 𝒟
𝜽
20
MAP estimation
Gaussian: unknown 𝜇
𝑝(𝑥|𝜇)~𝑁(𝜇, 𝜎 2 ) 𝜇 is the only unknown parameter
𝑝(𝜇|𝜇0 )~𝑁(𝜇0 , 𝜎02 ) 𝜇0 and 𝜎0 are known
𝑁
𝑑 𝑖
ln 𝑝(𝜇) 𝑝 𝑥 𝜇 =0
𝑑𝜇
𝑖=1
𝑁
1 𝑖
1
⇒ 2
𝑥 − 𝜇 − 2 𝜇 − 𝜇0 = 0
𝜎 𝜎0
𝑖=1
𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
⇒ 𝜇𝑀𝐴𝑃 = 𝜎
𝜎02
1+ 2𝑁
𝜎
𝑁
𝜎02 𝑖=1 𝑥
𝑖
≫ 1 or 𝑁 → ∞ ⇒ 𝜇𝑀𝐴𝑃 = 𝜇𝑀𝐿 =
𝜎2 𝑁
21
Maximum A Posteriori (MAP) estimation
Given a set of observations 𝒟 and a prior distribution
𝑝(𝜽) on parameters, the parameter vector that
maximizes 𝑝 𝒟 𝜽 𝑝(𝜽) is found.
𝑝 𝒟𝜃 𝑝 𝒟𝜃
𝜎2 𝑁𝜎02
𝜇𝑁 = 𝜇 + 𝜇
𝑁𝜎02 + 𝜎 2 0 𝑁𝜎02 + 𝜎 2 𝑀𝐿
22
MAP estimation
Gaussian: unknown 𝜇 (known 𝜎)
𝑝 𝜇 𝒟 ∝ 𝑝 𝜇 𝑝(𝒟|𝜇)
𝑝 𝜇 𝒟 = 𝑁 𝜇 𝜇𝑁 , 𝜎𝑁
𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
𝜇𝑁 = 𝜎
𝜎02
1+ 2𝑁
𝑝(𝜇) 𝜎
1 1 𝑁
= +
𝜎𝑁2 𝜎02 𝜎 2
[Bishop]
More samples ⟹ sharper 𝑝(𝜇|𝒟)
Higher confidence in estimation
23
Conjugate Priors
We consider a form of prior distribution that has a simple
interpretation as well as some useful analytical properties
24
Prior for Bernoulli Likelihood
𝛼1
𝐸𝜃 =
𝛼0 + 𝛼1
Beta distribution over 𝜃 ∈ [0,1]: 𝜃=
𝛼1 − 1
𝛼0 − 1 + 𝛼1 − 1
Beta 𝜃 𝛼1 , 𝛼0 ∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
most probable 𝜃
Γ(𝛼0 + 𝛼1 ) 𝛼 −1 𝛼0 −1
Beta 𝜃 𝛼1 , 𝛼0 = 𝜃 1 1−𝜃
Γ(𝛼0 )Γ(𝛼1 )
25
Beta distribution
26
Benoulli likelihood: posterior
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)
𝑝 𝜃 𝒟 ∝ 𝑝 𝒟 𝜃 𝑝(𝜃)
𝑁
𝑥𝑖 1−𝑥 𝑖
= 𝜃 1−𝜃 Beta 𝜃 𝛼1 , 𝛼0
𝑖=1
∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
∝ 𝜃 𝑚+𝛼1 −1 1 − 𝜃 𝑁−𝑚+𝛼0 −1
𝑁
𝑚= 𝑥 (𝑖)
⇒ 𝑝 𝜃 𝒟 ∝ 𝐵𝑒𝑡𝑎 𝜃 𝛼1′ , 𝛼0′ 𝑖=1
𝛼1′ = 𝛼1 + 𝑚
𝛼0′ = 𝛼0 + 𝑁 − 𝑚
27
Example
𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥
Prior
Beta: 𝛼0 = 𝛼1 = 2 Bernoulli
𝑝 𝑥=1𝜃
𝜃
𝜃
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) :
Posterior
𝑚 heads (1), 𝑁 − 𝑚 tails (0)
Beta:𝛼1′ = 5, 𝛼0′ = 2
𝛼0 = 𝛼1 = 2
𝒟 = 1,1,1 ⇒ 𝑁 = 3, 𝑚 = 3
𝛼1′ − 1 4
𝜃𝑀𝐴𝑃 = argmax 𝑃 𝜃 𝒟 = ′ =
𝜃 𝛼1 − 1 + 𝛼0′ − 1 5
28 𝜃
Toss example
MAP estimation can avoid overfitting
𝒟 = {1,1,1}, 𝜃𝑀𝐿 = 1
𝜃𝑀𝐴𝑃 = 0.8 (with prior 𝑝 𝜃 = Beta 𝜃 2,2 )
29
Bayesian inference
Parameters 𝜽 as random variables with a priori distribution
Bayesian estimation utilizes the available prior information about the
unknown parameter
As opposed to ML and MAP estimation, it does not seek a specific point
estimate of the unknown parameter vector 𝜽
30
Bayesian estimation: predictive distribution
𝑖 𝑁
Given a set of samples 𝒟 = 𝒙 𝑖=1 , a prior distribution on
the parameters 𝑃(𝜽), and the form of the distribution 𝑃 𝒙 𝜽
31
Benoulli likelihood: prediction
Training samples: 𝒟 = 𝑥 (1) , … , 𝑥 (𝑁)
𝑃 𝜃 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 , 𝛼0
∝ 𝜃 𝛼1 −1 1 − 𝜃 𝛼0 −1
𝑃 𝜃|𝒟 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 + 𝑚, 𝛼0 + 𝑁 − 𝑚
∝ 𝜃 𝛼1 +𝑚−1 1 − 𝜃 𝛼0 + 𝑁−𝑚 −1
= 𝐸𝑃 𝜃|𝒟 𝑃(𝑥|𝜃)
𝛼1 + 𝑚
⇒ 𝑃 𝑥 = 1|𝒟 = 𝐸𝑃 𝜃|𝒟 𝜃 =
𝛼0 + 𝛼1 + 𝑁
32
ML, MAP, and Bayesian Estimation
If 𝑝 𝜽|𝒟 has a sharp peak at 𝜽 = 𝜽 (i.e., 𝑝 𝜽|𝒟 ≈
𝛿(𝜽, 𝜽)), then 𝑝 𝒙|𝒟 ≈ 𝑝 𝒙|𝜽
In this case, the Bayesian estimation will be approximately equal
to the MAP estimation.
If 𝑝 𝒟|𝜽 is concentrated around a sharp peak and 𝑝(𝜽)
is broad enough around this peak, the ML, MAP, and
Bayesian estimations yield approximately the same result.
33
Summary
ML and MAP result in a single (point) estimate of the unknown
parameters vector.
More simple and interpretable than Bayesian estimation
35