0% found this document useful (0 votes)

14 views35 pages

ML Map and Bayseian

The document discusses Maximum-Likelihood (ML) and Maximum A Posteriori (MAP) estimation methods within the context of Bayesian inference for machine learning. It outlines the relationship between learning and statistics, density estimation approaches, and provides examples of ML and MAP estimation for different distributions. The document emphasizes the importance of parameter estimation in statistical models and the role of prior distributions in MAP estimation.

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views35 pages

ML Map and Bayseian

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

ML, MAP Estimation

and Bayesian
CE-717: Machine Learning
Sharif University of Technology
Fall 2016

Soleymani
Outline
 Introduction
 Maximum-Likelihood (ML) estimation
 Maximum A Posteriori (MAP) estimation
 Bayesian inference

2
Relation of learning & statistics
 Target model in the learning problems can be considered
as a statistical model

 For a fixed set of data and underlying target (statistical

model), the estimation methods try to estimate the target
from the available data

3
Density estimation
 Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.

 Main approaches of density estimation:

 Parametric: assuming a parameterized model for density
function
 A number of parameters are optimized by ﬁtting the model to the data set

 Nonparametric (Instance-based): No speciﬁc parametric model

is assumed
 The form of the density function is determined entirely by the data

4
Parametric density estimation
 Estimating the probability density function 𝑝(𝒙), given a
𝑖 𝑁
set of data points 𝒙 𝑖=1
drawn from it.

 Assume that 𝑝(𝒙) in terms of a speciﬁc functional form

which has a number of adjustable parameters.

 Methods for parameter estimation

 Maximum likelihood estimation
 Maximum A Posteriori (MAP) estimation

5
Parametric density estimation
 Goal: estimate parameters of a distribution from a dataset
𝒟 = {𝒙 1 , . . . , 𝒙(𝑁) }
 𝒟 contains 𝑁 independent, identically distributed (i.i.d.) training
samples.

 We need to determine 𝜽 given {𝒙 1 , … , 𝒙(𝑁) }

 How to represent 𝜽?
 𝜽∗ or 𝑝(𝜽)?

6
Example

𝑃 𝑥 𝜇 = 𝑁(𝑥|𝜇, 1)

7
Example

8
Maximum Likelihood Estimation (MLE)
 Maximum-likelihood estimation (MLE) is a method of estimating the
parameters of a statistical model given data.

 Likelihood is the conditional probability of observations 𝒟 =

𝒙(1) , 𝒙(2) , … , 𝒙(𝑁) given the value of parameters 𝜽
 Assuming i.i.d. observations:
𝑁

𝑝 𝒟𝜽 = 𝑝(𝒙(𝑖) |𝜽)
𝑖=1

likelihood of 𝜽 w.r.t. the samples

 Maximum Likelihood estimation

𝜽𝑀𝐿 = argmax 𝑝 𝒟 𝜽
𝜽
9
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

10
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

11
Maximum Likelihood Estimation (MLE)

𝜃 best agrees with the observed samples

12
Maximum Likelihood Estimation (MLE)
𝑁 𝑁

ℒ 𝜽 = ln 𝑝 𝒟 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽 = ln 𝑝 𝒙(𝑖) 𝜽
𝑖=1 𝑖=1
𝑁

𝜽𝑀𝐿 = argmax ℒ(𝜽) = argmax ln 𝑝 𝒙(𝑖) 𝜽

𝜽 𝜽
𝑖=1

 Thus, we solve 𝛻𝜽 ℒ 𝜽 = 𝟎
 to find global optimum

13
MLE
Bernoulli
 Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)

𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

𝑁 𝑁
𝑖 1−𝑥 𝑖
𝑝 𝒟𝜃 = 𝑝(𝑥 𝑖 |𝜃) = 𝜃𝑥 1−𝜃
𝑖=1 𝑖=1
𝑁 𝑁

ln 𝑝 𝒟 𝜃 = ln 𝑝(𝑥 𝑖 |𝜃) = {𝑥 𝑖 ln 𝜃 + (1 − 𝑥 𝑖 ) ln 1 − 𝜃 }
𝑖=1 𝑖=1

𝑁 (𝑖)
𝜕 ln 𝑝 𝒟 𝜃 𝑖=1 𝑥 𝑚
= 0 ⇒ 𝜃𝑀𝐿 = =
𝜕𝜃 𝑁 𝑁
MLE
Bernoulli: example
3
 Example: 𝒟 = {1,1,1}, 𝜃𝑀𝐿 = = 1
3
 Prediction: all future tosses will land heads up

 Overfitting to 𝒟

15
MLE: Multinomial distribution
 Multinomial distribution (on variable with 𝐾 state):

Parameter space: 𝐾
𝜽 = 𝜃1 , … , 𝜃𝐾 𝑥
𝑃 𝒙𝜽 = 𝜃𝑘 𝑘
𝜃𝑖 ∈ 0,1 𝑘=1
𝐾

𝜃𝑘 = 1 𝑃 𝑥𝑘 = 1 = 𝜃𝑘
𝑘=1
𝜃2
𝒙 = 𝑥1 , … , 𝑥𝐾
𝑥𝑘 ∈ {0,1}
𝐾

𝑥𝑘 = 1
𝑘=1 𝜃1

𝜃3
16
MLE: Multinomial distribution
𝒟 = 𝒙(1) , 𝒙(2) , … , 𝒙(𝑁)

𝑁 𝑁
𝐾 𝑥𝑘
(𝑖) 𝐾 𝑁
𝑥
(𝑖)
𝑃 𝒟𝜽 = 𝑃(𝒙 𝑖 |𝜽) = 𝜃𝑘 = 𝜃𝑘 𝑖=1 𝑘

𝑘=1 𝑘=1 𝑁
𝑖=1 𝑖=1 (𝑖)
𝑁𝑘 = 𝑥𝑘
𝑖=1
𝐾
𝐾
𝑁𝑘 = 𝑁
ℒ 𝜽, 𝜆 = ln 𝑝 𝒟 𝜽 + 𝜆(1 − 𝜃𝑘 ) 𝑘=1

𝑘=1

𝑁 (𝑖)
𝑖=1 𝑥𝑘 𝑁𝑘
𝜃𝑘 = =
𝑁 𝑁
17
MLE
Gaussian: unknown 𝜇
𝑖
1 𝑖 2
ln 𝑝(𝑥 |𝜇) = − ln 2𝜋𝜎 − 2 𝑥 −𝜇
2𝜎

𝑁
𝜕ℒ 𝜇 𝜕
=0⇒ ln 𝑝 𝑥 (𝑖) 𝜇 =0
𝜕𝜇 𝜕𝜇
𝑖=1
𝑁 𝑁
1 𝑖
1 𝑖
⇒ 2
𝑥 − 𝜇 = 0 ⇒ 𝜇𝑀𝐿 = 𝑥
𝜎 𝑁
𝑖=1 𝑖=1

MLE corresponds to many well-known estimation methods.

18
MLE
Gaussian: unknown 𝜇 and 𝜎

𝜽 = 𝜇, 𝜎

𝛻𝜽 ℒ 𝜽 = 𝟎

𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖
⇒ 𝜇𝑀𝐿 = 𝑥
𝜕𝜇 𝑁
𝑖=1

𝑁
𝜕ℒ 𝜇, 𝜎 1 𝑖 2
⇒ 𝜎𝑀𝐿 = 𝑥 − 𝜇𝑀𝐿
𝜕𝜎 𝑁
𝑖=1

19
Maximum A Posteriori (MAP) estimation
 MAP estimation
𝜽𝑀𝐴𝑃 = argmax 𝑝 𝜽 𝒟
𝜽

 Since 𝑝 𝜽|𝒟 ∝ 𝑝 𝒟|𝜽 𝑝(𝜽)

𝜽𝑀𝐴𝑃 = argmax 𝑝 𝒟 𝜽 𝑝(𝜽)

𝜽

 Example of prior distribution:

𝑝 𝜃 = 𝒩(𝜃0 , 𝜎 2 )

20
MAP estimation
Gaussian: unknown 𝜇
𝑝(𝑥|𝜇)~𝑁(𝜇, 𝜎 2 ) 𝜇 is the only unknown parameter
𝑝(𝜇|𝜇0 )~𝑁(𝜇0 , 𝜎02 ) 𝜇0 and 𝜎0 are known

𝑁
𝑑 𝑖
ln 𝑝(𝜇) 𝑝 𝑥 𝜇 =0
𝑑𝜇
𝑖=1
𝑁
1 𝑖
1
⇒ 2
𝑥 − 𝜇 − 2 𝜇 − 𝜇0 = 0
𝜎 𝜎0
𝑖=1
𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
⇒ 𝜇𝑀𝐴𝑃 = 𝜎
𝜎02
1+ 2𝑁
𝜎
𝑁
𝜎02 𝑖=1 𝑥
𝑖
≫ 1 or 𝑁 → ∞ ⇒ 𝜇𝑀𝐴𝑃 = 𝜇𝑀𝐿 =
𝜎2 𝑁
21
Maximum A Posteriori (MAP) estimation
 Given a set of observations 𝒟 and a prior distribution
𝑝(𝜽) on parameters, the parameter vector that
maximizes 𝑝 𝒟 𝜽 𝑝(𝜽) is found.
𝑝 𝒟𝜃 𝑝 𝒟𝜃

𝜃𝑀𝐴𝑃 ≅ 𝜃𝑀𝐿 𝜃𝑀𝐴𝑃 > 𝜃𝑀𝐿

𝜎2 𝑁𝜎02
𝜇𝑁 = 𝜇 + 𝜇
𝑁𝜎02 + 𝜎 2 0 𝑁𝜎02 + 𝜎 2 𝑀𝐿
22
MAP estimation
Gaussian: unknown 𝜇 (known 𝜎)

𝑝 𝜇 𝒟 ∝ 𝑝 𝜇 𝑝(𝒟|𝜇)

𝑝 𝜇 𝒟 = 𝑁 𝜇 𝜇𝑁 , 𝜎𝑁

𝜎02 𝑁 𝑖
𝜇0 + 2 𝑖=1 𝑥
𝜇𝑁 = 𝜎
𝜎02
1+ 2𝑁
𝑝(𝜇) 𝜎
1 1 𝑁
= +
𝜎𝑁2 𝜎02 𝜎 2

[Bishop]
More samples ⟹ sharper 𝑝(𝜇|𝒟)
Higher confidence in estimation

23
Conjugate Priors
 We consider a form of prior distribution that has a simple
interpretation as well as some useful analytical properties

 Choosing a prior such that the posterior distribution that

is proportional to 𝑝(𝒟|𝜽)𝑝(𝜽) will have the same
functional form as the prior.

∀𝜶, 𝒟 ∃𝜶′ 𝑃(𝜽|𝜶′ ) ∝ 𝑃 𝒟 𝜽 𝑃(𝜽|𝜶)

Having the same functional form

24
Prior for Bernoulli Likelihood
𝛼1
𝐸𝜃 =
𝛼0 + 𝛼1
 Beta distribution over 𝜃 ∈ [0,1]: 𝜃=
𝛼1 − 1
𝛼0 − 1 + 𝛼1 − 1
Beta 𝜃 𝛼1 , 𝛼0 ∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
most probable 𝜃
Γ(𝛼0 + 𝛼1 ) 𝛼 −1 𝛼0 −1
Beta 𝜃 𝛼1 , 𝛼0 = 𝜃 1 1−𝜃
Γ(𝛼0 )Γ(𝛼1 )

 Beta distribution is the conjugate prior of Bernoulli:

𝑃 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

25
Beta distribution

26
Benoulli likelihood: posterior
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) , 𝑚 heads (1), 𝑁 − 𝑚 tails (0)

𝑝 𝜃 𝒟 ∝ 𝑝 𝒟 𝜃 𝑝(𝜃)
𝑁
𝑥𝑖 1−𝑥 𝑖
= 𝜃 1−𝜃 Beta 𝜃 𝛼1 , 𝛼0
𝑖=1
∝ 𝜃 𝛼1−1 1 − 𝜃 𝛼0 −1
∝ 𝜃 𝑚+𝛼1 −1 1 − 𝜃 𝑁−𝑚+𝛼0 −1
𝑁

𝑚= 𝑥 (𝑖)
⇒ 𝑝 𝜃 𝒟 ∝ 𝐵𝑒𝑡𝑎 𝜃 𝛼1′ , 𝛼0′ 𝑖=1

𝛼1′ = 𝛼1 + 𝑚
𝛼0′ = 𝛼0 + 𝑁 − 𝑚
27
Example
𝑝 𝑥 𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

Prior
Beta: 𝛼0 = 𝛼1 = 2 Bernoulli
𝑝 𝑥=1𝜃

𝜃
𝜃
Given: 𝒟 = 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑁) :
Posterior
𝑚 heads (1), 𝑁 − 𝑚 tails (0)
Beta:𝛼1′ = 5, 𝛼0′ = 2
𝛼0 = 𝛼1 = 2

𝒟 = 1,1,1 ⇒ 𝑁 = 3, 𝑚 = 3

𝛼1′ − 1 4
𝜃𝑀𝐴𝑃 = argmax 𝑃 𝜃 𝒟 = ′ =
𝜃 𝛼1 − 1 + 𝛼0′ − 1 5
28 𝜃
Toss example
 MAP estimation can avoid overfitting
 𝒟 = {1,1,1}, 𝜃𝑀𝐿 = 1
 𝜃𝑀𝐴𝑃 = 0.8 (with prior 𝑝 𝜃 = Beta 𝜃 2,2 )

29
Bayesian inference
 Parameters 𝜽 as random variables with a priori distribution
 Bayesian estimation utilizes the available prior information about the
unknown parameter
 As opposed to ML and MAP estimation, it does not seek a specific point
estimate of the unknown parameter vector 𝜽

 The observed samples 𝒟 convert the prior densities 𝑝 𝜽 into

a posterior density 𝑝 𝜽|𝒟
 Keep track of beliefs about 𝜽’s values and uses these beliefs for reaching
conclusions
 In the Bayesian approach, we first specify 𝑝 𝜽|𝒟 and then we compute
the predictive distribution 𝑝(𝒙|𝒟)

30
Bayesian estimation: predictive distribution
𝑖 𝑁
 Given a set of samples 𝒟 = 𝒙 𝑖=1 , a prior distribution on
the parameters 𝑃(𝜽), and the form of the distribution 𝑃 𝒙 𝜽

 We find 𝑃 𝜽|𝒟 and then use it to specify 𝑃 𝒙 = 𝑃(𝒙|𝒟) as

an estimate of 𝑃(𝒙):

𝑃 𝒙𝒟 = 𝑃 𝒙, 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝒟, 𝜽 𝑃 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝜽 𝑃 𝜽|𝒟 𝑑𝜽

Predictive distribution
If we know the value of the parameters 𝜽,
we know exactly the distribution of 𝒙
 Analytical solutions exist for very special forms of the involved
functions

31
Benoulli likelihood: prediction
 Training samples: 𝒟 = 𝑥 (1) , … , 𝑥 (𝑁)
𝑃 𝜃 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 , 𝛼0
∝ 𝜃 𝛼1 −1 1 − 𝜃 𝛼0 −1

𝑃 𝜃|𝒟 = 𝐵𝑒𝑡𝑎 𝜃 𝛼1 + 𝑚, 𝛼0 + 𝑁 − 𝑚
∝ 𝜃 𝛼1 +𝑚−1 1 − 𝜃 𝛼0 + 𝑁−𝑚 −1

𝑃 𝑥|𝒟 = 𝑃 𝑥|𝜃 𝑃 𝜃|𝒟 𝑑𝜃

= 𝐸𝑃 𝜃|𝒟 𝑃(𝑥|𝜃)
𝛼1 + 𝑚
⇒ 𝑃 𝑥 = 1|𝒟 = 𝐸𝑃 𝜃|𝒟 𝜃 =
𝛼0 + 𝛼1 + 𝑁
32
ML, MAP, and Bayesian Estimation
 If 𝑝 𝜽|𝒟 has a sharp peak at 𝜽 = 𝜽 (i.e., 𝑝 𝜽|𝒟 ≈
𝛿(𝜽, 𝜽)), then 𝑝 𝒙|𝒟 ≈ 𝑝 𝒙|𝜽
 In this case, the Bayesian estimation will be approximately equal
to the MAP estimation.
 If 𝑝 𝒟|𝜽 is concentrated around a sharp peak and 𝑝(𝜽)
is broad enough around this peak, the ML, MAP, and
Bayesian estimations yield approximately the same result.

 All three methods asymptotically (𝑁 → ∞) results in the

same estimate

33
Summary
 ML and MAP result in a single (point) estimate of the unknown
parameters vector.
 More simple and interpretable than Bayesian estimation

 Bayesian approach finds a predictive distribution using all the

available information:
 expected to give better results
 needs higher computational complexity

 Bayesian methods have gained a lot of popularity over the

recent decade due to the advances in computer technology.

 All three methods asymptotically (𝑁 → ∞) results in the same

estimate.
34
Resource

 C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 2.

11 Walras and General Equilibrium Theory
100% (1)
11 Walras and General Equilibrium Theory
17 pages
Blocking, - Confounding, and - Fractional Factorial Designs
No ratings yet
Blocking, - Confounding, and - Fractional Factorial Designs
87 pages
Materi GMM Panel Data
No ratings yet
Materi GMM Panel Data
11 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
3 TP Sums
No ratings yet
3 TP Sums
21 pages
Strobe
No ratings yet
Strobe
48 pages
Bayes Gauss
100% (1)
Bayes Gauss
29 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Game Theory:: 1. Two-Person Game: 2. N-Person Game: 3. Zero-Sum Game
No ratings yet
Game Theory:: 1. Two-Person Game: 2. N-Person Game: 3. Zero-Sum Game
4 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Portage Math 110 Statistics
No ratings yet
Portage Math 110 Statistics
2 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Fix Data
No ratings yet
Fix Data
148 pages
Toc Eco601
No ratings yet
Toc Eco601
1 page
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Game Theory
No ratings yet
Game Theory
28 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
CS775 Lec 2
No ratings yet
CS775 Lec 2
66 pages
MAP&MLE
No ratings yet
MAP&MLE
44 pages
Chapter 04
No ratings yet
Chapter 04
29 pages
Dynamic Games and Modeling
No ratings yet
Dynamic Games and Modeling
33 pages
2 Mle
No ratings yet
2 Mle
28 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Mlelectures PDF
No ratings yet
Mlelectures PDF
24 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
CH07 Linear Regression
No ratings yet
CH07 Linear Regression
39 pages
06 Estimation
No ratings yet
06 Estimation
31 pages
Bayesian Inference
No ratings yet
Bayesian Inference
18 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
No ratings yet
SP2009F - Lecture03 - Maximum Likelihood Estimation (Parametric Methods)
23 pages
Dsge Var
No ratings yet
Dsge Var
59 pages
Lecture 3
No ratings yet
Lecture 3
35 pages
Sta255 Week 11-2 Pre
No ratings yet
Sta255 Week 11-2 Pre
21 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Solution 9 8
No ratings yet
Solution 9 8
17 pages
PRCI Slides 1
No ratings yet
PRCI Slides 1
86 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
PSLecture18 2022
No ratings yet
PSLecture18 2022
100 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
BSC121-Lec.8 (The Binomial Distribution)
No ratings yet
BSC121-Lec.8 (The Binomial Distribution)
15 pages
Objective Assignment 6: (Https://swayam - Gov.in)
No ratings yet
Objective Assignment 6: (Https://swayam - Gov.in)
5 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
Chapter 6 - Summarizing
No ratings yet
Chapter 6 - Summarizing
37 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
20 pages
Games Howell Test Pos Hoc
No ratings yet
Games Howell Test Pos Hoc
29 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Excel
No ratings yet
Excel
12 pages
Questions For Unit 4
No ratings yet
Questions For Unit 4
6 pages
Notes4 BayesianLearning
No ratings yet
Notes4 BayesianLearning
8 pages
2 Probability
No ratings yet
2 Probability
30 pages
L08 Map
No ratings yet
L08 Map
8 pages
Mock Test
No ratings yet
Mock Test
6 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
Regression Probabilistic Perspective
No ratings yet
Regression Probabilistic Perspective
20 pages
Anova Table
No ratings yet
Anova Table
9 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Variational Problems in Machine Learning and Their Solution With Finite Elements
No ratings yet
Variational Problems in Machine Learning and Their Solution With Finite Elements
11 pages
(MLE) - MLE-vs-Bayes
No ratings yet
(MLE) - MLE-vs-Bayes
11 pages
Lec 38
No ratings yet
Lec 38
8 pages
Assignment Session 4 - Vu Thi Thu
No ratings yet
Assignment Session 4 - Vu Thi Thu
14 pages
Libros
No ratings yet
Libros
8 pages
Bayesian Estimation
No ratings yet
Bayesian Estimation
13 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
Tarea 4.1
No ratings yet
Tarea 4.1
5 pages
Multiple Linear Regression Analysis With Stepwise Method
No ratings yet
Multiple Linear Regression Analysis With Stepwise Method
5 pages
Game Theory: Rock Paper Scissors Rock 0 0 - 1 1 1 - 1 Paper 1 - 1 0 0 - 1 1 Scissors - 1 1 1 - 1 0 0
No ratings yet
Game Theory: Rock Paper Scissors Rock 0 0 - 1 1 1 - 1 Paper 1 - 1 0 0 - 1 1 Scissors - 1 1 1 - 1 0 0
6 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
ANNParameter Estimation-II, III
No ratings yet
ANNParameter Estimation-II, III
2 pages
Project ABC
No ratings yet
Project ABC
2 pages
EC 264: Decisions & Games: Problem 3: Correlated Equilibria vs. Nash Equilibria
No ratings yet
EC 264: Decisions & Games: Problem 3: Correlated Equilibria vs. Nash Equilibria
1 page
Statistical Service Form Undergrad
No ratings yet
Statistical Service Form Undergrad
1 page
Impulse Balance Theory and its Extension by an Additional Criterion
From Everand
Impulse Balance Theory and its Extension by an Additional Criterion
Reinhard Selten
1/5 (1)
Calculus by Muhammad Umer
From Everand
Calculus by Muhammad Umer
Muhammad Umer
No ratings yet
Calculus Volume1
From Everand
Calculus Volume1
Ming Yao Tsai
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

ML Map and Bayseian

Uploaded by

ML Map and Bayseian

Uploaded by

ML, MAP Estimation

 For a fixed set of data and underlying target (statistical

 Main approaches of density estimation:

 Nonparametric (Instance-based): No speciﬁc parametric model

 Assume that 𝑝(𝒙) in terms of a speciﬁc functional form

 Methods for parameter estimation

 We need to determine 𝜽 given {𝒙 1 , … , 𝒙(𝑁) }

 Likelihood is the conditional probability of observations 𝒟 =

likelihood of 𝜽 w.r.t. the samples

 Maximum Likelihood estimation

𝜃 best agrees with the observed samples

𝜃 best agrees with the observed samples

𝜃 best agrees with the observed samples

𝜽𝑀𝐿 = argmax ℒ(𝜽) = argmax ln 𝑝 𝒙(𝑖) 𝜽

MLE corresponds to many well-known estimation methods.

 Since 𝑝 𝜽|𝒟 ∝ 𝑝 𝒟|𝜽 𝑝(𝜽)

𝜽𝑀𝐴𝑃 = argmax 𝑝 𝒟 𝜽 𝑝(𝜽)

 Example of prior distribution:

𝜃𝑀𝐴𝑃 ≅ 𝜃𝑀𝐿 𝜃𝑀𝐴𝑃 > 𝜃𝑀𝐿

 Choosing a prior such that the posterior distribution that

∀𝜶, 𝒟 ∃𝜶′ 𝑃(𝜽|𝜶′ ) ∝ 𝑃 𝒟 𝜽 𝑃(𝜽|𝜶)

Having the same functional form

 Beta distribution is the conjugate prior of Bernoulli:

 The observed samples 𝒟 convert the prior densities 𝑝 𝜽 into

 We find 𝑃 𝜽|𝒟 and then use it to specify 𝑃 𝒙 = 𝑃(𝒙|𝒟) as

𝑃 𝒙𝒟 = 𝑃 𝒙, 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝒟, 𝜽 𝑃 𝜽|𝒟 𝑑𝜽 = 𝑃 𝒙 𝜽 𝑃 𝜽|𝒟 𝑑𝜽

𝑃 𝑥|𝒟 = 𝑃 𝑥|𝜃 𝑃 𝜃|𝒟 𝑑𝜃

 All three methods asymptotically (𝑁 → ∞) results in the

 Bayesian approach finds a predictive distribution using all the

 Bayesian methods have gained a lot of popularity over the

 All three methods asymptotically (𝑁 → ∞) results in the same

 C. Bishop, “Pattern Recognition and Machine Learning”,

You might also like