0% found this document useful (0 votes)
42 views65 pages

M3 DensityEstimation v1

This document provides an outline and overview of density estimation techniques, beginning with an introduction. It discusses parametric methods like maximum likelihood estimation (MLE) for continuous and discrete densities. MLE finds the parameters that maximize the likelihood of the data. Examples covered include MLE for 1D Gaussian distributions and Bernoulli distributions. Nonparametric methods are also briefly mentioned. The document cites several sources for content used.

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views65 pages

M3 DensityEstimation v1

This document provides an outline and overview of density estimation techniques, beginning with an introduction. It discusses parametric methods like maximum likelihood estimation (MLE) for continuous and discrete densities. MLE finds the parameters that maximize the likelihood of the data. Examples covered include MLE for 1D Gaussian distributions and Bernoulli distributions. Nonparametric methods are also briefly mentioned. The document cites several sources for content used.

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

M3.

Density Estimation
Manikandan Narayanan
Week 6 (Sep 4-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR], [CC], [BR] in the bottom right
of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.

• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.1 Parametric methods
• M3.2 Nonparametric methods (only brief mention)
Outline for Module M3 (detailed)
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.0.0 What it means to “learn” from data?
• M3.0.1 Intuitive warmup to ML (Estimation)
• M3.1 Parametric methods
(aka parameter learning of probabilistic models)
• M3.1.1 Maximum Likelihood Estimation (MLE)
(for continuous/discrete densities, (& mixture densities (later)))
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods (only brief mention)
• M3.2.0 General idea
• M3.2.1 K-nearest neighbors
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Warmup
• M3.0.0 What it means to “learn” from data?
• M3.0.1 Intuitive warmup to ML (Estimation)
• M3.1 Parametric methods
• M3.2 Nonparametric methods
Introduction to Density estimation
• So far: Bayesian decision theory (incl. Bayes classifiers)
• Two steps in a generative or discriminative model setting: Inference vs. Decision steps
• But how to do inference, i.e., how to “learn” a Bayes classifier from data???
• estimate the joint (class prior and class conditional) p(x,t) or posterior density p(t|x).
• So density estimation needed in both generative/discriminative model settings.

• Density estimation (informally aka learning the data distbn.):


• Addresses a fundamental question of what it means to learn from data.
• be it supervised (p(x,t) or p(t|x)) or unsupervised (p(x)) learning!
• Relies heavily on assumptions made in model selection step – otherwise, an
ill-posed problem!!

[CMB]
Inference & Decision (steps in detail for a generative model):

Setup (training data):

--------------------------------------------------------------------------------------------------------------
Inference (density est.):

--------------------------------------------------------------------------------------------------------------
Decision:
[HR]
Introduction to Density estimation
• So far: Bayesian decision theory (incl. Bayes classifiers)
• Two steps in a generative or discriminative model setting: Inference vs. Decision steps
• But how to do inference, i.e., how to “learn” a Bayes classifier from data???
• estimate the joint (class prior and class conditional) p(x,t) or posterior density p(t|x).
• So density estimation needed in both generative/discriminative model settings.

• Density estimation (informally aka learning the data distbn.):


• Addresses a fundamental question of what it means to learn from data.
• be it supervised (p(x,t) or p(t|x)) or unsupervised (p(x)) learning!
• Relies heavily on assumptions made in model selection step – otherwise, an
ill-posed problem!!

[CMB]
Density Estimation: Problem Statement & Notations
• Problem: “Learn a model from data” == “Estimate a density/distribution 𝔻𝔻
from independent observations (i.e., iid samples drawn from 𝔻𝔻)”

• Input: 𝑁𝑁 data points x1 , … , x𝑁𝑁 𝑇𝑇 assumed to be iid samples from an


unknown probability distribution 𝔻𝔻
• x𝑛𝑛 ∼𝑖𝑖𝑖𝑖𝑖𝑖 𝔻𝔻 for all 𝑛𝑛 = 1, … , 𝑁𝑁.
• x𝑛𝑛 ∈ ℝ𝑑𝑑

• Output: Probability density/distribution 𝔻𝔻 that “best fits” the data


• Univariate distbn. if d=1, and Multivariate/Joint distbn. if multiple (d>1) r.v.s are to be
modelled (e.g.., fish length, width and color).
• Family/Form of distributions fixed at “model selection” step to get a well-posed
problem.
Density estimation (intuitively in pictures)

[DJM,CMB]
Approaches to Density estimation
• Parametric approach:
• some functional form of probability distribution 𝐷𝐷 assumed for the data points
• family of models parameterized by 𝜃𝜃 i.e., 𝑝𝑝 𝑥𝑥 𝜃𝜃 or 𝑓𝑓(𝑥𝑥|𝜃𝜃), with each family
member specified by a particular value of the parameter vector 𝜃𝜃.
• Distribution could be simple (e.g., unimodal density) or complex (e.g., multi-modal
density, incl. mixture density for mixture models)

• Nonparametric approach:
• distribution not assumed to be of a functional form specified by a few parameters;
instead form of distribution typically depends on the size of the dataset.
• Still have some “parameters” but they control model complexity (more so than
specifying the exact functional form of the distribution)

[CMB]
Warmup: Intuitive depiction of density
estimation example
Warmup: Parametric approach on a toy
dataset

[DJM]
Recap: The (1D) Gaussian/Normal Distribution

[CMB]
Warmup: How to fit a 1D Gaussian to this
data? – Intuition

[DJM]
Warmup: How to fit a 1D Gaussian to this data?
– “Visual” MLE

[DJM]
Warmup: How to fit a 1D Gaussian to this data?
(contd.)

[DJM]
Warmup: MLE for 1D Gaussian (the need for
“continuous optimization”)

[DJM]
MLE for one 1D Gaussian (closed-form
solution)
• Log likelihood:

• MLE estimates:

[DJM]
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Background
• M3.1 Parametric methods
• M3.1.1 Maximum Likelihood Estimation (MLE)
(for continuous/discrete densities; mixture densities (later))
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods
ML approach
• Dataset D or 𝐷𝐷𝑁𝑁 = {𝑥𝑥1 , … , 𝑥𝑥𝑁𝑁 } (iid samples from 𝑝𝑝(𝑥𝑥|𝜃𝜃); 𝑝𝑝 denotes pmf or pdf)
• Likelihood (function of parameters, given the data, is used as the score function):
ℒ 𝜃𝜃; 𝐷𝐷𝑁𝑁 = 𝑃𝑃({𝑥𝑥1 , … , 𝑥𝑥𝑁𝑁 } | 𝜃𝜃) = � 𝑝𝑝 𝑥𝑥𝑛𝑛 𝜃𝜃)
𝑛𝑛=1,…,𝑁𝑁
• ML Estimate
(opt. problem, solved
analytically or numerically):

• Has desirable properties, mainly consistency (for “most” densities).


MLE converges in probab.
to the true parameter(s):

[PSS]
Examples we will see:
1) Gaussian (uni- and multi-variate)
2) Bernoulli
3) Categorical/Multinoulli
Example 1 (Continuous density): MLE for 1D
Gaussian (toy dataset)
• What is the likelihood?

[DJM]
MLE for 1D Gaussian (general N datapoints)
Rough space for illustrations
MLE for one 1D Gaussian
• Log likelihood:

• MLE estimates:

[DJM]
Bias of the estimator: 𝔼𝔼𝐷𝐷𝑁𝑁={𝑥𝑥1,…,𝑥𝑥𝑁𝑁}|𝜃𝜃 [𝜃𝜃�
𝑁𝑁 ] − 𝜃𝜃
From uni- to multi-variate Gaussian

[CMB]
Maximum Likelihood for the Gaussian (1)
• Given i.i.d. data , the log likelihood function is given
by

• Sufficient statistics

[CMB]
Maximum Likelihood for the Gaussian (2)
• Set the gradient of the log likelihood function to zero,

• and solve to obtain

• Similarly

[CMB]
Maximum Likelihood for the Gaussian (3)
Under the true distribution, bias for 2nd est.

Hence define

[CMB]
Derivation of MLE of Multi-variate Gaussian
• Facts on gradients (wrt vector or matrix of parameters):
𝜕𝜕 𝑇𝑇
• 𝑥𝑥 𝐴𝐴 𝑥𝑥 = 𝐴𝐴𝑇𝑇 𝑥𝑥 + 𝐴𝐴𝐴𝐴 (or 2𝐴𝐴𝐴𝐴 if 𝐴𝐴 is symmetric)
𝜕𝜕𝜕𝜕
𝜕𝜕 𝑇𝑇
• 𝑥𝑥 𝐴𝐴 𝑥𝑥 = 𝑥𝑥𝑥𝑥 𝑇𝑇 (outer-product)
𝜕𝜕𝜕𝜕
𝜕𝜕
• log |𝐴𝐴| = 𝐴𝐴−𝑇𝑇
𝜕𝜕𝜕𝜕

• Gradient of 𝐿𝐿𝐿𝐿 𝜇𝜇, Λ : = 𝐿𝐿𝐿𝐿 𝜇𝜇, Σ −1

[From Secn. 13.5 of The Multivariate Gaussian from https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf]


Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Background
• M3.1 Parametric methods
• M3.1.1 Maximum Likelihood Estimation (MLE)
(for continuous/discrete densities, mixture densities (later))
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods
Example 2: Bernoulli/Binary RVs
• Coin flipping: heads=1, tails=0

• Bernoulli Distribution

[CMB]
(Parametric) Density Estimation / Parameter
Estimation / Parameter learning
• ML for Bernoulli
• Given:

[CMB]
Example 3: From Bernoulli to Multinoulli
Categorical (Multinoulli) Variables
1-of-K coding scheme:

[CMB]
ML Parameter estimation
• Given:

• Ensure , use a Lagrange multiplier.

[CMB]
Aside in Appendix

• An Aside: Relation between Bernoulli and Binomial distribution


• An Aside: Relation between Categorical and Multinomial Distribution
Outline for Module M3
• M3. Density Estimation
• M3.0 Introduction/Background
• M3.1 Parametric methods
• M3.1.1 Maximum Likelihood Estimation (MLE)
• M3.1.2 Bayesian Inference(/estimation)
• M3.2 Nonparametric methods
Motivation: Why go from MLE to Bayesian
inference?
• Small sample sizes - overfitting to training data 𝒟𝒟

• ⇒ Prediction: all future tosses will land heads up

• Laplace’s sunrise problem: What is the probability that the sun will rise tomorrow? [https://en.wikipedia.org/wiki/Sunrise_problem]

• Prior information
• MLE cannot use additional information we may have about the parameter!

• Richer (compound or hierarchical) distbns. to fit the data, and robustness to outliers
• Treating parameters as r.v.s with their own distributions can offer a “natural” plug-and-play hierarchical modelling framework
to construct complex distbns. (marginal distbns. with heavy tails or overdispersion, etc.) that fit the data better.

[CMB]
Bayesian approach
• Bayesian approach in theory: View parameter 𝜃𝜃 as a r.v. and not as a fixed constant as in MLE
• Why Bayesian? ML (frequentist/Fisherian) approach gives useful/consistent estimators for many distbns., but
fails for small sample sizes and doesn’t permit incorporation of additional info. about the parameter!

• Information about the r.v. before seeing the data is encoded as a prior distribution 𝑃𝑃 𝜃𝜃

• Use Bayes rule to get posterior that captures your degree of belief/uncertainty about 𝜃𝜃 after seeing the data:
𝑃𝑃 𝜃𝜃 𝐷𝐷𝑁𝑁 ) ∝ 𝑃𝑃 𝜃𝜃 𝑃𝑃 𝐷𝐷𝑁𝑁 𝜃𝜃)
posterior ∝ prior x likelihood

• Bayesian approach in practice:


• Conjugate priors make calcn./interpretn. easy by ensuring posterior & prior follow same distbn.
• But may not be applicable always (use approximate inference such as MCMC/Gibbs sampling for more complex priors)
• What about that pesky hyperparameter (i.e., pseudocounts for beta distbn.)?
• Full (posterior) distribution vs. a point estimate?
• Posterior mode (MAP) or Posterior mean – a practical resort
• an ideal Bayesian can integrate over uncertainty around the parameter - posterior predictive distbn.

[PSS,CMB]
Bayesian approach
• Bayesian approach in theory: View parameter 𝜃𝜃 as a r.v. and not as a fixed constant as in MLE
• Why Bayesian? ML (frequentist/Fisherian) approach gives useful/consistent estimators for many distbns., but
fails for small sample sizes and doesn’t permit incorporation of additional info. about the parameter!

• Information about the r.v. before seeing the data is encoded as a prior distribution 𝑃𝑃 𝜃𝜃

• Use Bayes rule to get posterior that captures your degree of belief/uncertainty about 𝜃𝜃 after seeing the data:
𝑃𝑃 𝜃𝜃 𝐷𝐷𝑁𝑁 ) ∝ 𝑃𝑃 𝜃𝜃 𝑃𝑃 𝐷𝐷𝑁𝑁 𝜃𝜃)
posterior ∝ prior x likelihood

• Bayesian approach in practice:


• Conjugate priors make calcn./interpretn. easy by ensuring posterior & prior follow same distbn.
• But may not be applicable always (use approximate inference such as MCMC/Gibbs sampling for more complex priors)
• What about that pesky hyperparameter (i.e., pseudocounts for beta distbn.)?
• Full (posterior) distribution vs. a point estimate?
• Posterior mode (MAP) or Posterior mean – a practical resort
• an ideal Bayesian can integrate over uncertainty around the parameter - posterior predictive distbn.

[PSS,CMB]
Three examples again:
Bayesian inference for:
Example 2: Bernoulli
Example 3: Categorial/Multinoulli
Example 1: Gaussian (mostly 1D, multi-variate in Appendix)
Example 2: Bayesian inference for Bernoulli
What is a good prior?
What is a good prior? Beta Distribution
• Distribution over

(converges for z > 0) [CMB]


Beta Distribution

[CMB]
Bayesian Bernoulli

Beta distribution is the conjugate prior for the parameter


of the Bernoulli distribution (or Bernoulli likelihood fn.).

[CMB]
Bayesian inference in action: Beta-Bernoulli
(Prior ∙ Likelihood = Posterior)

[CMB]
Pseudocounts, and updating these counts
with new data - example

[CMB]
Properties of the Posterior
As the size of the data set, N, increase

[CMB]
Point estimate vs. using the full posterior:
Prediction under the (full) posterior
What is the probability that the next coin toss will land
heads up?

[CMB]
Point estimate vs. using the full posterior:
Prediction under the (full) posterior
What is the probability that the next coin toss will land
heads up?

[CMB]
Example 3: Bayesian inference for
Categorical/Multinoulli
Dirichlet Distribution for the Prior

Conjugate prior for the


categorical likelihood fn. i.e.,
Dirichlet distbn. is conjugate prior
for the parameters of the
Categorical distbn.

[CMB]
Bayesian Categorical

[CMB]
Bayesian Categorical

[CMB]
Example 1: Bayesian inference for 1D
Gaussian?
Bayesian Inference for the Gaussian (1)
• Assume 𝜎𝜎 2 is known. Given i.i.d. data
, the likelihood function for 𝜇𝜇 is given by

• This has a Gaussian shape as a function of 𝜇𝜇 (but it is not a


distribution over 𝜇𝜇).
Bayesian Inference for the Gaussian (2)
• Combined with a Gaussian prior over 𝜇𝜇,

• this gives the posterior

• Completing the square over 𝜇𝜇, we see that


Bayesian Inference for the Gaussian (3)
• … where

• Note:
Bayesian Inference for the Gaussian (4)
• Example: for N = 0, 1, 2 and 10.
Bayesian Inference for the Gaussian (5)
• Sequential Estimation

• The posterior obtained after observing N-1 data points becomes the
prior when we observe the N th data point.
Bayesian Inference for the Gaussian (6)
• Now assume 𝜇𝜇 is known. The likelihood function for 𝜆𝜆 =1/ 𝜎𝜎 2 is given
by

• This has a Gamma shape as a function of 𝜆𝜆.

• (cf. Appendix for more on Bayesian inference of Gaussian)

You might also like