0% found this document useful (0 votes)
12 views6 pages

Wk04 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Wk04 Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Week-4: Estimation and EM Algorithm Overview

Sherry Thomas
21f3001449

Contents
Introduction to Estimation in Machine Learning 1

Maximum Likelihood Estimation 2


Fisher’s Principle of Maximum Likelihood . . . . . . . . . . . . . . . . 2
Likelihood Estimation for Bernoulli Distributions . . . . . . . . . . . . 2
Likelihood Estimation for Gaussian Distributions . . . . . . . . . . . . 2

Bayesian Estimation 3
Bayesian Estimation for a Bernoulli Distribution . . . . . . . . . . . . 3

Gaussian Mixture Models 4

Likelihood of GMM’s 4
Convexity and Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . 4

Estimating the Parameters 5

EM Algorithm 6

Acknowledgments 6
Abstract
The week introduces estimators, and delves deeper into topics like
Maximum Likelihood Estimator and Bayesian Estimator. Later, it goes
into Gaussian Mixture Models and its implementation.

Introduction to Estimation in Machine Learning


Estimation in machine learning involves inferring unknown parameters or pre-
dicting outcomes from observed data. Estimators, often algorithms or models,
are used for these tasks and to characterize the data’s underlying distribution.
Let {x1 , x2 , … , x𝑛 } represent a dataset, where each data point x𝑖 is in the
𝑑-dimensional binary space {0, 1}𝑑 . It is assumed that the data points are
independent and identically distributed (i.i.d).
Independence is denoted as 𝑃 (x𝑖 |x𝑗 ) = 𝑃 (x𝑖 ). Identically distributed means
𝑃 (x𝑖 ) = 𝑃 (x𝑗 ) = 𝑝.

1
Maximum Likelihood Estimation
Fisher’s Principle of Maximum Likelihood
Fisher’s principle of maximum likelihood is a statistical method used to estimate
parameters of a statistical model by selecting values that maximize the likelihood
function. This function quantifies how well the model fits the observed data.

Likelihood Estimation for Bernoulli Distributions


Applying the likelihood function on the aforementioned dataset, we obtain:

ℒ(𝑝; {x1 , x2 , … , x𝑛 }) = 𝑃 (x1 , x2 , … , x𝑛 ; 𝑝)


= 𝑝(x1 ; 𝑝)𝑝(x2 ; 𝑝) … 𝑝(x𝑛 ; 𝑝)
𝑛
= ∏ 𝑝x𝑖 (1 − 𝑝)1−x𝑖
𝑖=1

𝑛
∴ log(ℒ(𝑝; {x1 , x2 , … , x𝑛 })) = arg max log (∏ 𝑝x𝑖 (1 − 𝑝)1−x𝑖 )
𝑝 𝑖=1

Differentiating wrt 𝑝, we get


1 𝑛
∴𝑝ML
̂ = ∑x
𝑛 𝑖=1 𝑖

Likelihood Estimation for Gaussian Distributions


Let {x1 , x2 , … , x𝑛 } be a dataset where x𝑖 ∼ 𝒩(𝜇, 𝜎2 ). We assume that the data
points are independent and identically distributed.

ℒ(𝜇, 𝜎2 ; {x1 , x2 , … , x𝑛 }) = 𝑓x1 ,x2 ,…,x𝑛 (x1 , x2 , … , x𝑛 ; 𝜇, 𝜎2 )


𝑛
= ∏ 𝑓x𝑖 (x𝑖 ; 𝜇, 𝜎2 )
𝑖=1
𝑛
1 −(x𝑖 −𝜇)2
= ∏ [√ 𝑒 2𝜎2 ]
𝑖=1 2𝜋𝜎
𝑛
1 (x − 𝜇)2
∴ log(ℒ(𝑝; {x1 , x2 , … , x𝑛 })) = ∑ [log ( √ )− 𝑖 2 ]
𝑖=1 2𝜋𝜎 2𝜎

By differentiating with respect to 𝜇 and 𝜎, we get

1 𝑛
𝜇̂ ML = ∑x
𝑛 𝑖=1 𝑖
1 𝑛
𝜎2̂ ML = ∑(x − 𝜇)𝑇 (x𝑖 − 𝜇)
𝑛 𝑖=1 𝑖

2
Bayesian Estimation
Bayesian estimation is a statistical method that updates parameter estimates by
incorporating prior knowledge or beliefs along with observed data to calculate
the posterior probability distribution of the parameters.
Let {x1 , x2 , … , x𝑛 } be a dataset where x𝑖 follows a distribution with parameters
𝜃. We assume that the data points are independent and identically distributed,
and we also consider 𝜃 as a random variable with its own probability distribution.
Our objective is to update the parameters using the available data.
i.e.
𝑃 (𝜃) ⇒ 𝑃 (𝜃|{x1 , x2 , … , x𝑛 })
where, employing Bayes’ Law, we find

𝑃 ({x1 , x2 , … , x𝑛 }|𝜃)
𝑃 (𝜃|{x1 , x2 , … , x𝑛 }) = ( ) ∗ 𝑃 (𝜃)
𝑃 ({x1 , x2 , … , x𝑛 })

Bayesian Estimation for a Bernoulli Distribution


Let {𝑥1 , 𝑥2 , … , 𝑥𝑛 } be a dataset where 𝑥𝑖 ∈ {0, 1} with parameter 𝜃. What
distribution can be suitable for 𝑃 (𝜃)?
A commonly used distribution for priors is the Beta Distribution.

𝑝𝛼−1 (1 − 𝑝)𝛽−1
𝑓(𝑝; 𝛼, 𝛽) = ∀𝑝 ∈ [0, 1]
𝑧
where 𝑧 is a normalizing factor

Hence, utilizing the Beta Distribution as the Prior, we obtain,

𝑃 (𝜃|{𝑥1 , 𝑥2 , … , 𝑥𝑛 }) ∝ 𝑃 (𝜃|{𝑥1 , 𝑥2 , … , 𝑥𝑛 }) ∗ 𝑃 (𝜃)


𝑛
𝑓𝜃|{𝑥1 ,𝑥2 ,…,𝑥𝑛 } (𝑝) ∝ [∏ 𝑝𝑥𝑖 (1 − 𝑝)1−𝑥𝑖 ] ∗ [𝑝𝛼−1 (1 − 𝑝)𝛽−1 ]
𝑖=1
𝑛 𝑛
∑𝑖=1 𝑥𝑖 +𝛼−1
𝑓𝜃|{𝑥1 ,𝑥2 ,…,𝑥𝑛 } (𝑝) ∝ 𝑝 (1 − 𝑝)∑𝑖=1 (1−𝑥𝑖 )+𝛽−1

i.e. we obtain,
{𝑥1 ,𝑥2 ,…,𝑥𝑛 }
BETA PRIOR (𝛼, 𝛽) −−−−−−−−→ BETA POSTERIOR (𝛼 + 𝑛ℎ , 𝛽 + 𝑛𝑡 )
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖

𝛼 + 𝑛ℎ
∴𝑝ML
̂ = 𝔼[Posterior] = 𝔼[Beta(𝛼 + 𝑛ℎ , 𝛽 + 𝑛𝑡 )] =
𝛼 + 𝑛ℎ + 𝛽 + 𝑛 𝑡

3
Gaussian Mixture Models
Gaussian Mixture Models are a type of probabilistic model used to represent
complex data distributions by combining multiple Gaussian distributions.
The procedure is as follows:
• Step 1: Generate a mixture component among {1, 2, … , 𝐾} where 𝑧𝑖 ∈
{1, 2, … , 𝐾}. We obtain,
𝐾
𝑃 (𝑧𝑖 = 𝑘) = 𝜋𝑘 [∑ 𝜋𝑖 = 1 0 ≤ 𝜋𝑖 ≤ 1 ∀𝑖]
𝑖=1

• Step 2: Generate x𝑖 ∼ 𝒩(𝜇𝑧 , 𝜎2𝑧𝑖 )


𝑖

𝐾
Hence, there are 3𝐾 parameters. However, since ∑ 𝜋𝑖 = 1, the number of
𝑖=1
parameters to be estimated becomes 3𝐾 − 1 for a GMM with 𝐾 components.

Likelihood of GMM’s

𝜇 1 , 𝜇 2 , … , 𝜇𝐾 𝑛 𝜇 1 , 𝜇 2 , … , 𝜇𝐾
ℒ⎛
⎜ 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ; x1 , x2 , … , x𝑛 ⎞ ⎜x𝑖 ; 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ⎞
⎟ = ∏ 𝑓mix ⎛ ⎟
⎝ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 ⎠ 𝑖=1 ⎝ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 ⎠
𝑛 𝐾
= ∏ [∑ 𝜋𝑘 ∗ 𝑓mix (x𝑖 ; 𝜇𝑘 , 𝜎𝑘 )]
𝑖=1 𝑘=1
𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2
∴ log ℒ(𝜃) = ∑ log [∑ 𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ]
𝑖=1 𝑘=1 2𝜋𝜎𝑘

To solve the above equation, we need to understand convexity.

Convexity and Jensen’s Inequality


Convexity is a property of a function or set that implies a unique line segment
can be drawn between any two points within the function or set. For a concave
function, this property can be expressed as,
𝐾 𝐾
𝑓 (∑ 𝜆𝑘 𝑎𝑘 ) ≥ ∑ 𝜆𝑘 𝑓(𝑎𝑘 )
𝑘=1 𝑘=1

where
𝐾
∑ 𝜆𝑘 = 1
𝑘=1

𝑎𝑘 are points of the function


This is also known as Jensen’s Inequality.

4
Estimating the Parameters
Since log is a concave function, we can approximate the likelihood function for
GMM’s as follows,
𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2
log ℒ(𝜃) = ∑ log [∑ 𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ]
𝑖=1 𝑘=1 2𝜋𝜎𝑘

By introducing parameters {𝜆𝑖1 , 𝜆𝑖2 , … , 𝜆𝑖𝑘 } for data point x𝑖 such that
𝐾
∀𝑖, 𝑘 ∑ 𝜆𝑖𝑘 = 1; 0 ≤ 𝜆𝑖𝑘 ≤ 1, we obtain:
𝑘=1

𝑛 𝐾 −(x𝑖 −𝜇𝑘 )2
1 2𝜎2
log ℒ(𝜃) = ∑ log [∑ 𝜆𝑖𝑘 (𝜋𝑘 ∗ √ 𝑒 𝑘 )]
𝑖=1 𝑘=1 𝜆𝑖𝑘 2𝜋𝜎𝑘

Using Jensen’s Inequality, we get:

log ℒ(𝜃) ≥ modified_logℒ(𝜃)

𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2𝜎2
∴modified_logℒ(𝜃) = ∑ ∑ 𝜆𝑖𝑘 log (𝜋𝑘 ∗ √ 𝑒 𝑘 ) (1)
𝑖=1 𝑖=𝑘 𝜆𝑖𝑘 2𝜋𝜎𝑘

Note that the modified-log likelihood function gives a lower bound for the true
log likelihood function at 𝜃. Finally, to get the parameters, we do the following:
• To get 𝜃: Fix 𝜆 and maximize over 𝜃.
𝑛 𝐾 −(x𝑖 −𝜇𝑘 ) 2
1 2
max ∑ ∑ 𝜆𝑖𝑘 log (𝜋𝑘 ∗ 𝑖 √ 𝑒 2𝜎𝑘 )
𝜃
𝑖=1 𝑖=𝑘 𝜆𝑘 2𝜋𝜎𝑘

Differentiate w.r.t. 𝜇, 𝜎2 , and 𝜋 to get the following

𝑛
∑ 𝜆𝑖𝑘 x𝑖
MML 𝑖=1
𝜇̂ 𝑘 = 𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝑛
MML 2
∑ 𝜆𝑖𝑘 (x𝑖 − 𝜇̂ 𝑘 )
MML
2 𝑖=1
𝜎̂ 𝑘 = 𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝜋𝑘̂MML =
𝑛

5
• To get 𝜆: Fix 𝜃 and maximize over 𝜆. For any 𝑖:
𝐾 −(x𝑖 −𝜇𝑘 ) 2 𝐾
1 2
max ∑ [𝜆𝑖𝑘 log (𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ) − 𝜆𝑖𝑘 log(𝜆𝑖𝑘 )] 𝑠.𝑡. ∑ 𝜆𝑖𝑘 = 1; 0 ≤ 𝜆𝑖𝑘 ≤ 1
𝜆𝑖1 ,𝜆𝑖2 ,…,𝜆𝑖𝑘 𝑘=1 2𝜋𝜎𝑘 𝑘=1

Solving the above constrained optimization problem analytically, we get:


−(x𝑖 −𝜇𝑘 )2
1 2𝜎2
( √2𝜋𝜎 𝑒 𝑘 ) ∗ 𝜋𝑘
𝑘

𝜆̂ 𝑖𝑘
MML
= 𝐾 −(x𝑖 −𝜇𝑘 ) 2
1 2
∑ (√ 𝑒 2𝜎𝑘 ) ∗ 𝜋𝑘
𝑘=1 2𝜋𝜎𝑘

EM Algorithm
The EM (Expectation-Maximization) algorithm is a popular method for esti-
mating the parameters of statistical models with incomplete data by iteratively
alternating between expectation and maximization steps until convergence to a
stable solution.
The algorithm is as follows:
⎧ 𝜇 , 𝜇 , … , 𝜇𝐾 ⎫
0 { 1 2 }
• Initialize 𝜃 = ⎨ 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ⎬ using Lloyd’s algorithm.
{
⎩ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 }⎭
𝑡+1 𝑡
• Until convergence (||𝜃 − 𝜃 || ≤ 𝜖 where 𝜖 is the tolerance parameter) do
the following:
𝑡
𝜆𝑡+1 = arg max modified_log(𝜃 , 𝜆) → Expectation Step
𝜆
𝑡+1
𝜃 = arg max modified_log(𝜃, 𝜆𝑡+1 ) → Maximization Step
𝜃

EM algorithm produces soft clustering. For hard clustering using EM, a further
step is involved:
• For a point x𝑖 , assign it to a cluster using the following equation:

𝑧𝑖 = arg max𝜆𝑖𝑘
𝑘

Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.
His expertise and educational materials have greatly contributed to the devel-
opment of this document.

You might also like