M3 DensityEstimation v1
M3 DensityEstimation v1
Density Estimation
Manikandan Narayanan
Week 6 (Sep 4-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR], [CC], [BR] in the bottom right
of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.
• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.1 Parametric methods
• M3.2 Nonparametric methods (only brief mention)
Outline for Module M3 (detailed)
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.0.0 What it means to “learn” from data?
• M3.0.1 Intuitive warmup to ML (Estimation)
• M3.1 Parametric methods
(aka parameter learning of probabilistic models)
• M3.1.1 Maximum Likelihood Estimation (MLE)
(for continuous/discrete densities, (& mixture densities (later)))
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods (only brief mention)
• M3.2.0 General idea
• M3.2.1 K-nearest neighbors
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.0.0 What it means to “learn” from data?
• M3.0.1 Intuitive warmup to ML (Estimation)
• M3.1 Parametric methods
• M3.2 Nonparametric methods
Introduction to Density estimation
• So far: Bayesian decision theory (incl. Bayes classifiers)
• Two steps in a generative or discriminative model setting: Inference vs. Decision steps
• But how to do inference, i.e., how to “learn” a Bayes classifier from data???
• estimate the joint (class prior and class conditional) p(x,t) or posterior density p(t|x).
• So density estimation needed in both generative/discriminative model settings.
[CMB]
Inference & Decision (steps in detail for a generative model):
--------------------------------------------------------------------------------------------------------------
Inference (density est.):
--------------------------------------------------------------------------------------------------------------
Decision:
[HR]
Introduction to Density estimation
• So far: Bayesian decision theory (incl. Bayes classifiers)
• Two steps in a generative or discriminative model setting: Inference vs. Decision steps
• But how to do inference, i.e., how to “learn” a Bayes classifier from data???
• estimate the joint (class prior and class conditional) p(x,t) or posterior density p(t|x).
• So density estimation needed in both generative/discriminative model settings.
[CMB]
Density Estimation: Problem Statement & Notations
• Problem: “Learn a model from data” == “Estimate a density/distribution 𝔻𝔻
from independent observations (i.e., iid samples drawn from 𝔻𝔻)”
[DJM,CMB]
Approaches to Density estimation
• Parametric approach:
• some functional form of probability distribution 𝐷𝐷 assumed for the data points
• family of models parameterized by 𝜃𝜃 i.e., 𝑝𝑝 𝑥𝑥 𝜃𝜃 or 𝑓𝑓(𝑥𝑥|𝜃𝜃), with each family
member specified by a particular value of the parameter vector 𝜃𝜃.
• Distribution could be simple (e.g., unimodal density) or complex (e.g., multi-modal
density, incl. mixture density for mixture models)
• Nonparametric approach:
• distribution not assumed to be of a functional form specified by a few parameters;
instead form of distribution typically depends on the size of the dataset.
• Still have some “parameters” but they control model complexity (more so than
specifying the exact functional form of the distribution)
[CMB]
Warmup: Intuitive depiction of density
estimation example
Warmup: Parametric approach on a toy
dataset
[DJM]
Recap: The (1D) Gaussian/Normal Distribution
[CMB]
Warmup: How to fit a 1D Gaussian to this
data? – Intuition
[DJM]
Warmup: How to fit a 1D Gaussian to this data?
– “Visual” MLE
[DJM]
Warmup: How to fit a 1D Gaussian to this data?
(contd.)
[DJM]
Warmup: MLE for 1D Gaussian (the need for
“continuous optimization”)
[DJM]
MLE for one 1D Gaussian (closed-form
solution)
• Log likelihood:
• MLE estimates:
[DJM]
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Background
• M3.1 Parametric methods
• M3.1.1 Maximum Likelihood Estimation (MLE)
(for continuous/discrete densities; mixture densities (later))
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods
ML approach
• Dataset D or 𝐷𝐷𝑁𝑁 = {𝑥𝑥1 , … , 𝑥𝑥𝑁𝑁 } (iid samples from 𝑝𝑝(𝑥𝑥|𝜃𝜃); 𝑝𝑝 denotes pmf or pdf)
• Likelihood (function of parameters, given the data, is used as the score function):
ℒ 𝜃𝜃; 𝐷𝐷𝑁𝑁 = 𝑃𝑃({𝑥𝑥1 , … , 𝑥𝑥𝑁𝑁 } | 𝜃𝜃) = � 𝑝𝑝 𝑥𝑥𝑛𝑛 𝜃𝜃)
𝑛𝑛=1,…,𝑁𝑁
• ML Estimate
(opt. problem, solved
analytically or numerically):
[PSS]
Examples we will see:
1) Gaussian (uni- and multi-variate)
2) Bernoulli
3) Categorical/Multinoulli
Example 1 (Continuous density): MLE for 1D
Gaussian (toy dataset)
• What is the likelihood?
[DJM]
MLE for 1D Gaussian (general N datapoints)
Rough space for illustrations
MLE for one 1D Gaussian
• Log likelihood:
• MLE estimates:
[DJM]
Bias of the estimator: 𝔼𝔼𝐷𝐷𝑁𝑁={𝑥𝑥1,…,𝑥𝑥𝑁𝑁}|𝜃𝜃 [𝜃𝜃�
𝑁𝑁 ] − 𝜃𝜃
From uni- to multi-variate Gaussian
[CMB]
Maximum Likelihood for the Gaussian (1)
• Given i.i.d. data , the log likelihood function is given
by
• Sufficient statistics
[CMB]
Maximum Likelihood for the Gaussian (2)
• Set the gradient of the log likelihood function to zero,
• Similarly
[CMB]
Maximum Likelihood for the Gaussian (3)
Under the true distribution, bias for 2nd est.
Hence define
[CMB]
Derivation of MLE of Multi-variate Gaussian
• Facts on gradients (wrt vector or matrix of parameters):
𝜕𝜕 𝑇𝑇
• 𝑥𝑥 𝐴𝐴 𝑥𝑥 = 𝐴𝐴𝑇𝑇 𝑥𝑥 + 𝐴𝐴𝐴𝐴 (or 2𝐴𝐴𝐴𝐴 if 𝐴𝐴 is symmetric)
𝜕𝜕𝜕𝜕
𝜕𝜕 𝑇𝑇
• 𝑥𝑥 𝐴𝐴 𝑥𝑥 = 𝑥𝑥𝑥𝑥 𝑇𝑇 (outer-product)
𝜕𝜕𝜕𝜕
𝜕𝜕
• log |𝐴𝐴| = 𝐴𝐴−𝑇𝑇
𝜕𝜕𝜕𝜕
• Bernoulli Distribution
[CMB]
(Parametric) Density Estimation / Parameter
Estimation / Parameter learning
• ML for Bernoulli
• Given:
•
[CMB]
Example 3: From Bernoulli to Multinoulli
Categorical (Multinoulli) Variables
1-of-K coding scheme:
[CMB]
ML Parameter estimation
• Given:
[CMB]
Aside in Appendix
• Laplace’s sunrise problem: What is the probability that the sun will rise tomorrow? [https://en.wikipedia.org/wiki/Sunrise_problem]
• Prior information
• MLE cannot use additional information we may have about the parameter!
• Richer (compound or hierarchical) distbns. to fit the data, and robustness to outliers
• Treating parameters as r.v.s with their own distributions can offer a “natural” plug-and-play hierarchical modelling framework
to construct complex distbns. (marginal distbns. with heavy tails or overdispersion, etc.) that fit the data better.
[CMB]
Bayesian approach
• Bayesian approach in theory: View parameter 𝜃𝜃 as a r.v. and not as a fixed constant as in MLE
• Why Bayesian? ML (frequentist/Fisherian) approach gives useful/consistent estimators for many distbns., but
fails for small sample sizes and doesn’t permit incorporation of additional info. about the parameter!
• Information about the r.v. before seeing the data is encoded as a prior distribution 𝑃𝑃 𝜃𝜃
• Use Bayes rule to get posterior that captures your degree of belief/uncertainty about 𝜃𝜃 after seeing the data:
𝑃𝑃 𝜃𝜃 𝐷𝐷𝑁𝑁 ) ∝ 𝑃𝑃 𝜃𝜃 𝑃𝑃 𝐷𝐷𝑁𝑁 𝜃𝜃)
posterior ∝ prior x likelihood
[PSS,CMB]
Bayesian approach
• Bayesian approach in theory: View parameter 𝜃𝜃 as a r.v. and not as a fixed constant as in MLE
• Why Bayesian? ML (frequentist/Fisherian) approach gives useful/consistent estimators for many distbns., but
fails for small sample sizes and doesn’t permit incorporation of additional info. about the parameter!
• Information about the r.v. before seeing the data is encoded as a prior distribution 𝑃𝑃 𝜃𝜃
• Use Bayes rule to get posterior that captures your degree of belief/uncertainty about 𝜃𝜃 after seeing the data:
𝑃𝑃 𝜃𝜃 𝐷𝐷𝑁𝑁 ) ∝ 𝑃𝑃 𝜃𝜃 𝑃𝑃 𝐷𝐷𝑁𝑁 𝜃𝜃)
posterior ∝ prior x likelihood
[PSS,CMB]
Three examples again:
Bayesian inference for:
Example 2: Bernoulli
Example 3: Categorial/Multinoulli
Example 1: Gaussian (mostly 1D, multi-variate in Appendix)
Example 2: Bayesian inference for Bernoulli
What is a good prior?
What is a good prior? Beta Distribution
• Distribution over
[CMB]
Bayesian Bernoulli
[CMB]
Bayesian inference in action: Beta-Bernoulli
(Prior ∙ Likelihood = Posterior)
[CMB]
Pseudocounts, and updating these counts
with new data - example
[CMB]
Properties of the Posterior
As the size of the data set, N, increase
[CMB]
Point estimate vs. using the full posterior:
Prediction under the (full) posterior
What is the probability that the next coin toss will land
heads up?
[CMB]
Point estimate vs. using the full posterior:
Prediction under the (full) posterior
What is the probability that the next coin toss will land
heads up?
[CMB]
Example 3: Bayesian inference for
Categorical/Multinoulli
Dirichlet Distribution for the Prior
[CMB]
Bayesian Categorical
[CMB]
Bayesian Categorical
[CMB]
Example 1: Bayesian inference for 1D
Gaussian?
Bayesian Inference for the Gaussian (1)
• Assume 𝜎𝜎 2 is known. Given i.i.d. data
, the likelihood function for 𝜇𝜇 is given by
• Note:
Bayesian Inference for the Gaussian (4)
• Example: for N = 0, 1, 2 and 10.
Bayesian Inference for the Gaussian (5)
• Sequential Estimation
• The posterior obtained after observing N-1 data points becomes the
prior when we observe the N th data point.
Bayesian Inference for the Gaussian (6)
• Now assume 𝜇𝜇 is known. The likelihood function for 𝜆𝜆 =1/ 𝜎𝜎 2 is given
by