0% found this document useful (0 votes)
14 views35 pages

ML Map and Bayseian

The document discusses Maximum-Likelihood (ML) and Maximum A Posteriori (MAP) estimation methods within the context of Bayesian inference for machine learning. It outlines the relationship between learning and statistics, density estimation approaches, and provides examples of ML and MAP estimation for different distributions. The document emphasizes the importance of parameter estimation in statistical models and the role of prior distributions in MAP estimation.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

ML Map and Bayseian

The document discusses Maximum-Likelihood (ML) and Maximum A Posteriori (MAP) estimation methods within the context of Bayesian inference for machine learning. It outlines the relationship between learning and statistics, density estimation approaches, and provides examples of ML and MAP estimation for different distributions. The document emphasizes the importance of parameter estimation in statistical models and the role of prior distributions in MAP estimation.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

ML, MAP Estimation

and Bayesian
CE-717: Machine Learning
Sharif University of Technology
Fall 2016

Soleymani
Outline
 Introduction
 Maximum-Likelihood (ML) estimation
 Maximum A Posteriori (MAP) estimation
 Bayesian inference

2
Relation of learning & statistics
 Target model in the learning problems can be considered
as a statistical model

 For a fixed set of data and underlying target (statistical


model), the estimation methods try to estimate the target
from the available data

3
Density estimation
 Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.

 Main approaches of density estimation:


 Parametric: assuming a parameterized model for density
function
 A number of parameters are optimized by fitting the model to the data set

 Nonparametric (Instance-based): No specific parametric model


is assumed
 The form of the density function is determined entirely by the data

4
Parametric density estimation
 Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.

 Assume that 𝑝(𝒙) in terms of a specific functional form


which has a number of adjustable parameters.

 Methods for parameter estimation


 Maximum likelihood estimation
 Maximum A Posteriori (MAP) estimation

5
Parametric density estimation
 Goal: estimate parameters of a distribution from a dataset
𝒟 = {𝒙 1 , . . . , 𝒙(𝑁) }
 𝒟 contains 𝑁 independent, identically distributed (i.i.d.) training
samples.

 We need to determine 𝜽 given {𝒙 1 , … , 𝒙(𝑁) }


 How to represent 𝜽?
 𝜽∗ or 𝑝(𝜽)?

6
Example

𝑃 𝑥 𝜇 = 𝑁(𝑥|𝜇, 1)

7
Example

8
Maximum Likelihood Estimation (MLE)
 Maximum-likelihood estimation (MLE) is a method of estimating the
parameters of a statistical model given data.

 Likelihood is the conditional probability of observations 𝒟 =


𝒙(1) , 𝒙(2) , … , 𝒙(𝑁) given the value of parameters 𝜽
 Assuming i.i.d. observations:
𝑁

𝑝 𝒟𝜽 = 𝑝(𝒙(𝑖) |𝜽)
𝑖=1

likelihood of 𝜽 w.r.t. the samples

 Maximum Likelihood estimation


𝜽𝑀𝐿 = argmax 𝑝 𝒟 𝜽
𝜽
9
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

10
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

11
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

12
Maximum Likelihood Estimation (MLE)
𝑁 𝑁

ℒ 𝜽 = ln 𝑝 𝒟 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽
𝑖=1 𝑖=1
𝑁

𝜽𝑀𝐿 = argmax ℒ(𝜽) = argmax ln 𝑝 𝒙(𝑖) 𝜽


𝜽 𝜽
𝑖=1

 Thus, we solve 𝛻𝜽 ℒ 𝜽 = 𝟎
 to find global optimum

13
MLE
Bernoulli
 Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)

𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

𝑁 𝑁
𝑖 1−𝑥 𝑖
𝑝 𝒟𝜃 = 𝑝(𝑥 𝑖 |𝜃) = 𝜃𝑥 1−𝜃
𝑖=1 𝑖=1
𝑁 𝑁

ln 𝑝 𝒟 𝜃 = ln 𝑝(𝑥 𝑖 |𝜃) = {𝑥 𝑖 ln 𝜃 + (1 − 𝑥 𝑖 ) ln 1 − 𝜃 }
𝑖=1 𝑖=1

𝑁 (𝑖)
𝜕 ln 𝑝 𝒟 𝜃 𝑖=1 𝑥 𝑚
= 0 ⇒ 𝜃𝑀𝐿 = =
𝜕𝜃 𝑁 𝑁
MLE
Bernoulli: example
3
 Example: 𝒟 = {1,1,1}, 𝜃𝑀𝐿 = = 1
3
 Prediction: all future tosses will land heads up

 Overfitting to 𝒟

15
MLE: Multinomial distribution
 Multinomial distribution (on variable with 𝐾 state):

Parameter space: 𝐾
𝜽 = 𝜃1 , … , 𝜃𝐾 𝑥
𝑃 𝒙𝜽 = 𝜃𝑘 𝑘
𝜃𝑖 ∈ 0,1 𝑘=1
𝐾

𝜃𝑘 = 1 𝑃 𝑥𝑘 = 1 = 𝜃𝑘
𝑘=1
𝜃2
𝒙 = 𝑥1 , … , 𝑥𝐾
𝑥𝑘 ∈ {0,1}
𝐾

𝑥𝑘 = 1
𝑘=1 𝜃1

𝜃3
16
MLE: Multinomial distribution
𝒟 = 𝒙(1) , 𝒙(2) , … , 𝒙(𝑁)

𝑁 𝑁
𝐾 𝑥𝑘
(𝑖) 𝐾 𝑁
𝑥
(𝑖)
𝑃 𝒟𝜽 = 𝑃(𝒙 𝑖 |𝜽) = 𝜃𝑘 = 𝜃𝑘 𝑖=1 𝑘

𝑘=1 𝑘=1 𝑁
𝑖=1 𝑖=1 (𝑖)
𝑁𝑘 = 𝑥𝑘
𝑖=1
𝐾
𝐾
𝑁𝑘 = 𝑁
ℒ 𝜽, 𝜆 = ln 𝑝 𝒟 𝜽 + 𝜆(1 − 𝜃𝑘 ) 𝑘=1

𝑘=1

𝑁 (𝑖)
𝑖=1 𝑥𝑘 𝑁𝑘
𝜃𝑘 = =
𝑁 𝑁
17
MLE
Gaussian: unknown 𝜇
𝑖
1 𝑖 2
ln 𝑝(𝑥 |𝜇) = − ln 2𝜋𝜎 − 2 𝑥 −𝜇
2𝜎

𝑁
𝜕ℒ 𝜇 𝜕
=0⇒ ln 𝑝 𝑥 (𝑖) 𝜇 =0
𝜕𝜇 𝜕𝜇
𝑖=1
𝑁 𝑁
1 𝑖
1 𝑖
⇒ 2
𝑥 − 𝜇 = 0 ⇒ 𝜇𝑀𝐿 = 𝑥
𝜎 𝑁
𝑖=1 𝑖=1

MLE corresponds to many well-known estimation methods.

18
MLE
Gaussian: unknown 𝜇 and 𝜎

𝜽 = 𝜇, 𝜎

𝛻𝜽 ℒ 𝜽 = 𝟎

𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖
⇒ 𝜇𝑀𝐿 = 𝑥
𝜕𝜇 𝑁
𝑖=1

𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖 2
⇒ 𝜎𝑀𝐿 = 𝑥 − 𝜇𝑀𝐿
𝜕𝜎 𝑁
𝑖=1

19
Maximum A Posteriori (MAP) estimation
 MAP estimation
𝜽𝑀𝐴𝑃 = argmax 𝑝 𝜽 𝒟
𝜽

 Since 𝑝 𝜽|𝒟 ∝ 𝑝 𝒟|𝜽 𝑝(𝜽)

𝜽𝑀𝐴𝑃 = argmax 𝑝 𝒟 𝜽 𝑝(𝜽)


𝜽

 Example of prior distribution:


𝑝 𝜃 = 𝒩(𝜃0 , 𝜎 2 )

20
MAP estimation
Gaussian: unknown 𝜇
𝑝(𝑥|𝜇)~𝑁(𝜇, 𝜎 2 ) 𝜇 is the only unknown parameter
𝑝(𝜇|𝜇0 )~𝑁(𝜇0 , 𝜎02 ) 𝜇0 and 𝜎0 are known

𝑁
𝑑 𝑖
ln 𝑝(𝜇) 𝑝 𝑥 𝜇 =0
𝑑𝜇
𝑖=1
𝑁
1 𝑖
1
⇒ 2
𝑥 − 𝜇 − 2 𝜇 − 𝜇0 = 0
𝜎 𝜎0
𝑖=1
𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
⇒ 𝜇𝑀𝐴𝑃 = 𝜎
𝜎02
1+ 2𝑁
𝜎
𝑁
𝜎02 𝑖=1 𝑥
𝑖
≫ 1 or 𝑁 → ∞ ⇒ 𝜇𝑀𝐴𝑃 = 𝜇𝑀𝐿 =
𝜎2 𝑁
21
Maximum A Posteriori (MAP) estimation
 Given a set of observations 𝒟 and a prior distribution
𝑝(𝜽) on parameters, the parameter vector that
maximizes 𝑝 𝒟 𝜽 𝑝(𝜽) is found.
𝑝 𝒟𝜃 𝑝 𝒟𝜃

𝜃𝑀𝐴𝑃 ≅ 𝜃𝑀𝐿 𝜃𝑀𝐴𝑃 > 𝜃𝑀𝐿

𝜎2 𝑁𝜎02
𝜇𝑁 = 𝜇 + 𝜇
𝑁𝜎02 + 𝜎 2 0 𝑁𝜎02 + 𝜎 2 𝑀𝐿
22
MAP estimation
Gaussian: unknown 𝜇 (known 𝜎)

𝑝 𝜇 𝒟 ∝ 𝑝 𝜇 𝑝(𝒟|𝜇)

𝑝 𝜇 𝒟 = 𝑁 𝜇 𝜇𝑁 , 𝜎𝑁

𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
𝜇𝑁 = 𝜎
𝜎02
1+ 2𝑁
𝑝(𝜇) 𝜎
1 1 𝑁
= +
𝜎𝑁2 𝜎02 𝜎 2

[Bishop]
More samples ⟹ sharper 𝑝(𝜇|𝒟)
Higher confidence in estimation

23
Conjugate Priors
 We consider a form of prior distribution that has a simple
interpretation as well as some useful analytical properties

 Choosing a prior such that the posterior distribution that


is proportional to 𝑝(𝒟|𝜽)𝑝(𝜽) will have the same
functional form as the prior.

∀𝜶, 𝒟 ∃𝜶′ 𝑃(𝜽|𝜶′ ) ∝ 𝑃 𝒟 𝜽 𝑃(𝜽|𝜶)

Having the same functional form

24
Prior for Bernoulli Likelihood
𝛼1
𝐸𝜃 =
𝛼0 + 𝛼1
 Beta distribution over 𝜃 ∈ [0,1]: 𝜃=
𝛼1 − 1
𝛼0 − 1 + 𝛼1 − 1
Beta 𝜃 𝛼1 , 𝛼0 ∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
most probable 𝜃
Γ(𝛼0 + 𝛼1 ) 𝛼 −1 𝛼0 −1
Beta 𝜃 𝛼1 , 𝛼0 = 𝜃 1 1−𝜃
Γ(𝛼0 )Γ(𝛼1 )

 Beta distribution is the conjugate prior of Bernoulli:


𝑃 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

25
Beta distribution

26
Benoulli likelihood: posterior
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)

𝑝 𝜃 𝒟 ∝ 𝑝 𝒟 𝜃 𝑝(𝜃)
𝑁
𝑥𝑖 1−𝑥 𝑖
= 𝜃 1−𝜃 Beta 𝜃 𝛼1 , 𝛼0
𝑖=1
∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
∝ 𝜃 𝑚+𝛼1 −1 1 − 𝜃 𝑁−𝑚+𝛼0 −1
𝑁

𝑚= 𝑥 (𝑖)
⇒ 𝑝 𝜃 𝒟 ∝ 𝐵𝑒𝑡𝑎 𝜃 𝛼1′ , 𝛼0′ 𝑖=1

𝛼1′ = 𝛼1 + 𝑚
𝛼0′ = 𝛼0 + 𝑁 − 𝑚
27
Example
𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

Prior
Beta: 𝛼0 = 𝛼1 = 2 Bernoulli
𝑝 𝑥=1𝜃

𝜃
𝜃
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) :
Posterior
𝑚 heads (1), 𝑁 − 𝑚 tails (0)
Beta:𝛼1′ = 5, 𝛼0′ = 2
𝛼0 = 𝛼1 = 2

𝒟 = 1,1,1 ⇒ 𝑁 = 3, 𝑚 = 3

𝛼1′ − 1 4
𝜃𝑀𝐴𝑃 = argmax 𝑃 𝜃 𝒟 = ′ =
𝜃 𝛼1 − 1 + 𝛼0′ − 1 5
28 𝜃
Toss example
 MAP estimation can avoid overfitting
 𝒟 = {1,1,1}, 𝜃𝑀𝐿 = 1
 𝜃𝑀𝐴𝑃 = 0.8 (with prior 𝑝 𝜃 = Beta 𝜃 2,2 )

29
Bayesian inference
 Parameters 𝜽 as random variables with a priori distribution
 Bayesian estimation utilizes the available prior information about the
unknown parameter
 As opposed to ML and MAP estimation, it does not seek a specific point
estimate of the unknown parameter vector 𝜽

 The observed samples 𝒟 convert the prior densities 𝑝 𝜽 into


a posterior density 𝑝 𝜽|𝒟
 Keep track of beliefs about 𝜽’s values and uses these beliefs for reaching
conclusions
 In the Bayesian approach, we first specify 𝑝 𝜽|𝒟 and then we compute
the predictive distribution 𝑝(𝒙|𝒟)

30
Bayesian estimation: predictive distribution
𝑖 𝑁
 Given a set of samples 𝒟 = 𝒙 𝑖=1 , a prior distribution on
the parameters 𝑃(𝜽), and the form of the distribution 𝑃 𝒙 𝜽

 We find 𝑃 𝜽|𝒟 and then use it to specify 𝑃 𝒙 = 𝑃(𝒙|𝒟) as


an estimate of 𝑃(𝒙):

𝑃 𝒙𝒟 = 𝑃 𝒙, 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝒟, 𝜽 𝑃 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝜽 𝑃 𝜽|𝒟 𝑑𝜽


Predictive distribution
If we know the value of the parameters 𝜽,
we know exactly the distribution of 𝒙
 Analytical solutions exist for very special forms of the involved
functions

31
Benoulli likelihood: prediction
 Training samples: 𝒟 = 𝑥 (1) , … , 𝑥 (𝑁)
𝑃 𝜃 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 , 𝛼0
∝ 𝜃 𝛼1 −1 1 − 𝜃 𝛼0 −1

𝑃 𝜃|𝒟 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 + 𝑚, 𝛼0 + 𝑁 − 𝑚
∝ 𝜃 𝛼1 +𝑚−1 1 − 𝜃 𝛼0 + 𝑁−𝑚 −1

𝑃 𝑥|𝒟 = 𝑃 𝑥|𝜃 𝑃 𝜃|𝒟 𝑑𝜃

= 𝐸𝑃 𝜃|𝒟 𝑃(𝑥|𝜃)
𝛼1 + 𝑚
⇒ 𝑃 𝑥 = 1|𝒟 = 𝐸𝑃 𝜃|𝒟 𝜃 =
𝛼0 + 𝛼1 + 𝑁
32
ML, MAP, and Bayesian Estimation
 If 𝑝 𝜽|𝒟 has a sharp peak at 𝜽 = 𝜽 (i.e., 𝑝 𝜽|𝒟 ≈
𝛿(𝜽, 𝜽)), then 𝑝 𝒙|𝒟 ≈ 𝑝 𝒙|𝜽
 In this case, the Bayesian estimation will be approximately equal
to the MAP estimation.
 If 𝑝 𝒟|𝜽 is concentrated around a sharp peak and 𝑝(𝜽)
is broad enough around this peak, the ML, MAP, and
Bayesian estimations yield approximately the same result.

 All three methods asymptotically (𝑁 → ∞) results in the


same estimate

33
Summary
 ML and MAP result in a single (point) estimate of the unknown
parameters vector.
 More simple and interpretable than Bayesian estimation

 Bayesian approach finds a predictive distribution using all the


available information:
 expected to give better results
 needs higher computational complexity

 Bayesian methods have gained a lot of popularity over the


recent decade due to the advances in computer technology.

 All three methods asymptotically (𝑁 → ∞) results in the same


estimate.
34
Resource

 C. Bishop, “Pattern Recognition and Machine Learning”,


Chapter 2.

35

You might also like