Wk04 Machine Learning
Wk04 Machine Learning
Sherry Thomas
21f3001449
Contents
Introduction to Estimation in Machine Learning 1
Bayesian Estimation 3
Bayesian Estimation for a Bernoulli Distribution . . . . . . . . . . . . 3
Likelihood of GMM’s 4
Convexity and Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . 4
EM Algorithm 6
Acknowledgments 6
Abstract
The week introduces estimators, and delves deeper into topics like
Maximum Likelihood Estimator and Bayesian Estimator. Later, it goes
into Gaussian Mixture Models and its implementation.
1
Maximum Likelihood Estimation
Fisher’s Principle of Maximum Likelihood
Fisher’s principle of maximum likelihood is a statistical method used to estimate
parameters of a statistical model by selecting values that maximize the likelihood
function. This function quantifies how well the model fits the observed data.
𝑛
∴ log(ℒ(𝑝; {x1 , x2 , … , x𝑛 })) = arg max log (∏ 𝑝x𝑖 (1 − 𝑝)1−x𝑖 )
𝑝 𝑖=1
1 𝑛
𝜇̂ ML = ∑x
𝑛 𝑖=1 𝑖
1 𝑛
𝜎2̂ ML = ∑(x − 𝜇)𝑇 (x𝑖 − 𝜇)
𝑛 𝑖=1 𝑖
2
Bayesian Estimation
Bayesian estimation is a statistical method that updates parameter estimates by
incorporating prior knowledge or beliefs along with observed data to calculate
the posterior probability distribution of the parameters.
Let {x1 , x2 , … , x𝑛 } be a dataset where x𝑖 follows a distribution with parameters
𝜃. We assume that the data points are independent and identically distributed,
and we also consider 𝜃 as a random variable with its own probability distribution.
Our objective is to update the parameters using the available data.
i.e.
𝑃 (𝜃) ⇒ 𝑃 (𝜃|{x1 , x2 , … , x𝑛 })
where, employing Bayes’ Law, we find
𝑃 ({x1 , x2 , … , x𝑛 }|𝜃)
𝑃 (𝜃|{x1 , x2 , … , x𝑛 }) = ( ) ∗ 𝑃 (𝜃)
𝑃 ({x1 , x2 , … , x𝑛 })
𝑝𝛼−1 (1 − 𝑝)𝛽−1
𝑓(𝑝; 𝛼, 𝛽) = ∀𝑝 ∈ [0, 1]
𝑧
where 𝑧 is a normalizing factor
i.e. we obtain,
{𝑥1 ,𝑥2 ,…,𝑥𝑛 }
BETA PRIOR (𝛼, 𝛽) −−−−−−−−→ BETA POSTERIOR (𝛼 + 𝑛ℎ , 𝛽 + 𝑛𝑡 )
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖
𝛼 + 𝑛ℎ
∴𝑝ML
̂ = 𝔼[Posterior] = 𝔼[Beta(𝛼 + 𝑛ℎ , 𝛽 + 𝑛𝑡 )] =
𝛼 + 𝑛ℎ + 𝛽 + 𝑛 𝑡
3
Gaussian Mixture Models
Gaussian Mixture Models are a type of probabilistic model used to represent
complex data distributions by combining multiple Gaussian distributions.
The procedure is as follows:
• Step 1: Generate a mixture component among {1, 2, … , 𝐾} where 𝑧𝑖 ∈
{1, 2, … , 𝐾}. We obtain,
𝐾
𝑃 (𝑧𝑖 = 𝑘) = 𝜋𝑘 [∑ 𝜋𝑖 = 1 0 ≤ 𝜋𝑖 ≤ 1 ∀𝑖]
𝑖=1
𝐾
Hence, there are 3𝐾 parameters. However, since ∑ 𝜋𝑖 = 1, the number of
𝑖=1
parameters to be estimated becomes 3𝐾 − 1 for a GMM with 𝐾 components.
Likelihood of GMM’s
𝜇 1 , 𝜇 2 , … , 𝜇𝐾 𝑛 𝜇 1 , 𝜇 2 , … , 𝜇𝐾
ℒ⎛
⎜ 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ; x1 , x2 , … , x𝑛 ⎞ ⎜x𝑖 ; 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ⎞
⎟ = ∏ 𝑓mix ⎛ ⎟
⎝ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 ⎠ 𝑖=1 ⎝ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 ⎠
𝑛 𝐾
= ∏ [∑ 𝜋𝑘 ∗ 𝑓mix (x𝑖 ; 𝜇𝑘 , 𝜎𝑘 )]
𝑖=1 𝑘=1
𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2
∴ log ℒ(𝜃) = ∑ log [∑ 𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ]
𝑖=1 𝑘=1 2𝜋𝜎𝑘
where
𝐾
∑ 𝜆𝑘 = 1
𝑘=1
4
Estimating the Parameters
Since log is a concave function, we can approximate the likelihood function for
GMM’s as follows,
𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2
log ℒ(𝜃) = ∑ log [∑ 𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ]
𝑖=1 𝑘=1 2𝜋𝜎𝑘
By introducing parameters {𝜆𝑖1 , 𝜆𝑖2 , … , 𝜆𝑖𝑘 } for data point x𝑖 such that
𝐾
∀𝑖, 𝑘 ∑ 𝜆𝑖𝑘 = 1; 0 ≤ 𝜆𝑖𝑘 ≤ 1, we obtain:
𝑘=1
𝑛 𝐾 −(x𝑖 −𝜇𝑘 )2
1 2𝜎2
log ℒ(𝜃) = ∑ log [∑ 𝜆𝑖𝑘 (𝜋𝑘 ∗ √ 𝑒 𝑘 )]
𝑖=1 𝑘=1 𝜆𝑖𝑘 2𝜋𝜎𝑘
𝑛 𝐾 2
−(x𝑖 −𝜇𝑘 )
1 2𝜎2
∴modified_logℒ(𝜃) = ∑ ∑ 𝜆𝑖𝑘 log (𝜋𝑘 ∗ √ 𝑒 𝑘 ) (1)
𝑖=1 𝑖=𝑘 𝜆𝑖𝑘 2𝜋𝜎𝑘
Note that the modified-log likelihood function gives a lower bound for the true
log likelihood function at 𝜃. Finally, to get the parameters, we do the following:
• To get 𝜃: Fix 𝜆 and maximize over 𝜃.
𝑛 𝐾 −(x𝑖 −𝜇𝑘 ) 2
1 2
max ∑ ∑ 𝜆𝑖𝑘 log (𝜋𝑘 ∗ 𝑖 √ 𝑒 2𝜎𝑘 )
𝜃
𝑖=1 𝑖=𝑘 𝜆𝑘 2𝜋𝜎𝑘
𝑛
∑ 𝜆𝑖𝑘 x𝑖
MML 𝑖=1
𝜇̂ 𝑘 = 𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝑛
MML 2
∑ 𝜆𝑖𝑘 (x𝑖 − 𝜇̂ 𝑘 )
MML
2 𝑖=1
𝜎̂ 𝑘 = 𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝑛
∑ 𝜆𝑖𝑘
𝑖=1
𝜋𝑘̂MML =
𝑛
5
• To get 𝜆: Fix 𝜃 and maximize over 𝜆. For any 𝑖:
𝐾 −(x𝑖 −𝜇𝑘 ) 2 𝐾
1 2
max ∑ [𝜆𝑖𝑘 log (𝜋𝑘 ∗ √ 𝑒 2𝜎𝑘 ) − 𝜆𝑖𝑘 log(𝜆𝑖𝑘 )] 𝑠.𝑡. ∑ 𝜆𝑖𝑘 = 1; 0 ≤ 𝜆𝑖𝑘 ≤ 1
𝜆𝑖1 ,𝜆𝑖2 ,…,𝜆𝑖𝑘 𝑘=1 2𝜋𝜎𝑘 𝑘=1
𝜆̂ 𝑖𝑘
MML
= 𝐾 −(x𝑖 −𝜇𝑘 ) 2
1 2
∑ (√ 𝑒 2𝜎𝑘 ) ∗ 𝜋𝑘
𝑘=1 2𝜋𝜎𝑘
EM Algorithm
The EM (Expectation-Maximization) algorithm is a popular method for esti-
mating the parameters of statistical models with incomplete data by iteratively
alternating between expectation and maximization steps until convergence to a
stable solution.
The algorithm is as follows:
⎧ 𝜇 , 𝜇 , … , 𝜇𝐾 ⎫
0 { 1 2 }
• Initialize 𝜃 = ⎨ 𝜎21 , 𝜎22 , … , 𝜎2𝐾 ⎬ using Lloyd’s algorithm.
{
⎩ 𝜋1 , 𝜋 2 , … , 𝜋 𝐾 }⎭
𝑡+1 𝑡
• Until convergence (||𝜃 − 𝜃 || ≤ 𝜖 where 𝜖 is the tolerance parameter) do
the following:
𝑡
𝜆𝑡+1 = arg max modified_log(𝜃 , 𝜆) → Expectation Step
𝜆
𝑡+1
𝜃 = arg max modified_log(𝜃, 𝜆𝑡+1 ) → Maximization Step
𝜃
EM algorithm produces soft clustering. For hard clustering using EM, a further
step is involved:
• For a point x𝑖 , assign it to a cluster using the following equation:
𝑧𝑖 = arg max𝜆𝑖𝑘
𝑘
Acknowledgments
Professor Arun Rajkumar: The content, including the concepts and nota-
tions presented in this document, has been sourced from his slides and lectures.
His expertise and educational materials have greatly contributed to the devel-
opment of this document.