L6: Parameter estimation
• Introduction
• Parameter estimation
• Maximum likelihood
• Bayesian estimation
• Numerical examples
1
• In previous lectures we showed how to build classifiers when the
underlying densities are known
– Bayesian Decision Theory introduced the general formulation
– Quadratic classifiers covered the special case of unimodal Gaussian data
• In most situations, however, the true distributions are unknown
and must be estimated from data
– Two approaches are commonplace
• Parameter Estimation (this lecture)
• Non-parametric Density Estimation (the next two lectures)
• Parameter estimation
– Assume a particular form for the density (e.g. Gaussian), so only the
parameters (e.g., mean and variance) need to be estimated
• Maximum Likelihood
• Bayesian Estimation
• Non-parametric density estimation
– Assume NO knowledge about the density
• Kernel Density Estimation
• Nearest Neighbor Rule
2
ML vs. Bayesian parameter estimation
• Maximum Likelihood
– The parameters are assumed to be FIXED but unknown
– The ML solution seeks the solution that “best” explains the dataset X
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑋|𝜃
• Bayesian estimation
– Parameters are assumed to be random variables with some (assumed)
known a priori distribution
– Bayesian methods seeks to estimate the posterior density 𝑝(𝜃|𝑋)
– The final density 𝑝(𝑥|𝑋) is obtained by integrating out the parameters
𝑝 𝑥|𝑋 = ∫ 𝑝 𝑥 𝜃 𝑝 𝜃|𝑋 𝑑𝜃
• Maximum Likelihood Bayesian
p X | θ p θ | X
p θ | X
p θ
θ̂
3
Maximum Likelihood
• Problem definition
– Assume we seek to estimate a density 𝑝(𝑥) that is known to depends
on a number of parameters 𝜃 = 𝜃1 , 𝜃2 , … 𝜃𝑀 𝑇
• For a Gaussian pdf, 𝜃1 = 𝜇, 𝜃2 = 𝜎 and 𝑝(𝑥) = 𝑁(𝜇, 𝜎)
• To make the dependence explicit, we write 𝑝(𝑥|𝜃)
– Assume we have dataset 𝑋 = {𝑥 (1 , 𝑥 (2 , … 𝑥 (𝑁 } drawn independently
from the distribution 𝑝(𝑥|𝜃) (an i.i.d. set)
• Then we can write
𝑁
𝑝 𝑋|𝜃 = Π𝑘=1 𝑝 𝑥 (𝑘 |𝜃
• The ML estimate of 𝜃 is the value that maximizes the likelihood 𝑝 𝑋 𝜃
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑋|𝜃
• This corresponds to the intuitive idea of choosing the value of 𝜃 that is
most likely to give rise to the data
4
• For convenience, we will work with the log likelihood
– Because the log is a monotonic function, then:
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑝 𝑋|𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 log 𝑝 𝑋|𝜃
p(X|)
log p(X|)
Taking logs
θ̂ θ̂
– Hence, the ML estimate of 𝜃 can be written as:
𝑁 𝑁
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 log Π𝑘=1 𝑝 𝑥 (𝑘 |𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 Σ𝑘=1 log 𝑝 𝑥 (𝑘 |𝜃
• This simplifies the problem, since now we have to maximize a sum of
terms rather than a long product of terms
• An added advantage of taking logs will become very clear when the
distribution is Gaussian
5
Example: Gaussian case, 𝝁 unknown
• Problem statement
– Assume a dataset 𝑋 = 𝑥 (1 , 𝑥 (2 , … 𝑥 (𝑁 and a density of the form
𝑝 𝑥 = 𝑁 𝜇, 𝜎 where 𝜎 is known
– What is the ML estimate of the mean?
𝑁
𝜃 = 𝜇 ⇒ 𝜃 = arg 𝑚𝑎𝑥Σ𝑘=1 𝑙𝑜𝑔𝑝 𝑥 (𝑘 |𝜃 =
𝑁 1 1 2
= arg 𝑚𝑎𝑥Σ𝑘=1 𝑙𝑜𝑔 exp − 2 𝑥 (𝑘 − 𝜇 =
2𝜋𝜎 2𝜎
𝑁 1 1 2
= arg 𝑚𝑎𝑥Σ𝑘=1 𝑙𝑜𝑔 − 2 𝑥 (𝑘 − 𝜇
2𝜋𝜎 2𝜎
– The maxima of a function are defined by the zeros of its derivative
𝜕Σ𝑁 (𝑘
𝑘=1 𝑙𝑜𝑔𝑝 𝑥 |𝜃 𝜕 𝑁
= Σ𝑘=1 𝑙𝑜𝑔𝑝 ⋅ = 0 ⇒
𝜕𝜃 𝜕𝜃
1 𝑁 (𝑘
𝜇= Σ𝑘=1 𝑥
𝑁
– So the ML estimate of the mean is the average value of the training
data, a very intuitive result!
6
Example: Gaussian case, both and unknown
• A more general case when neither 𝝁 nor 𝝈 is known
– Fortunately, the problem can be solved in the same fashion
– The derivative becomes a gradient since we have two variables
𝜕 𝑁 1 (𝑘
Σ𝑘=1 𝑙𝑜𝑔𝑝 𝑥 (𝑘 |𝜃 𝑥 − 𝜃1
𝜃1 = 𝜇 𝜕𝜃1 𝜃2
𝑁
𝜃= ⇒ 𝛻𝜃 = = Σ𝑘=1 2 =0
𝜃2 = 𝜎 2 𝜕 𝑁 1 𝑥 (𝑘 − 𝜃1
Σ𝑘=1 𝑙𝑜𝑔𝑝 𝑥 (𝑘 |𝜃 − +
𝜕𝜃2 2𝜃2 2𝜃22
– Solving for 𝜃1 and 𝜃2 yields
1 𝑁 (𝑘 1 𝑁 2
𝜃1 = Σ𝑘=1 𝑥 ; 𝜃2 = Σ𝑘=1 𝑥 (𝑘 − 𝜃1
𝑁 𝑁
• Therefore, the ML of the variance is the sample variance of the dataset,
again a very pleasing result
– Similarly, it can be shown that the ML estimates for the multivariate
Gaussian are the sample mean vector and sample covariance matrix
1 𝑁 (𝑘 1 𝑁 𝑇
𝜇 = Σ𝑘=1 𝑥 ; Σ = Σ𝑘=1 𝑥 (𝑘 − 𝜇 𝑥 (𝑘 − 𝜇
𝑁 𝑁
7
Bias and variance
• How good are these estimates?
– Two measures of “goodness” are used for statistical estimates
– BIAS: how close is the estimate to the true value?
– VARIANCE: how much does it change for different datasets?
BIAS
TRUE
VARIANCE
– The bias-variance tradeoff
• In most cases, you can only decrease one of them at the expense of the
other
LOW BIAS HIGH BIAS
HIGH VARIANCE LOW VARIANCE
TRUE
8
• What is the bias of the ML estimate of the mean?
1 𝑁 (𝑘 1 𝑁
𝐸 𝜇 =𝐸 Σ𝑘=1 𝑥 = Σ𝑘=1 𝐸 𝑥 (𝑘 = 𝜇
𝑁 𝑁
– Therefore the mean is an unbiased estimate
• What is the bias of the ML estimate of the variance?
2
1 𝑁 (𝑘 2 𝑁−1 2
𝐸 𝜎 =𝐸 Σ𝑘=1 𝑥 − 𝜇 = 𝜎 ≠ 𝜎2
𝑁 𝑁
– Thus, the ML estimate of variance is BIASED
• This is because the ML estimate of variance uses 𝜇 instead of 𝜇
– How “bad” is this bias?
• For 𝑁 → ∞ the bias becomes zero asymptotically
• The bias is only noticeable when we have very few samples, in which case
we should not be doing statistics in the first place!
– Notice that MATLAB uses an unbiased estimate of the covariance
1 𝑁 (𝑘 (𝑘 𝑇
Σ𝑈𝑁𝐵𝐼𝐴𝑆 = Σ 𝑥 −𝜇 𝑥 −𝜇
𝑁 − 1 𝑘=1
9
Bayesian estimation
• In the Bayesian approach, our uncertainty about the
parameters is represented by a pdf
– Before we observe the data, the parameters are described by a prior
density 𝑝(𝜃) which is typically very broad to reflect the fact that we
know little about its true value
– Once we obtain data, we make use of Bayes theorem to find the
posterior 𝑝(𝜃|𝑋)
• Ideally we want the data to sharpen the posterior 𝑝(𝜃|𝑋), that is, reduce
our uncertainty about the parameters
p θ | X
p θ | X
p θ
– Remember, though, that our goal is to estimate 𝑝(𝑥) or, more exactly,
𝑝(𝑥|𝑋), the density given the evidence provided by the dataset X
10
• Let us derive the expression of a Bayesian estimate
– From the definition of conditional probability
𝑝 𝑥, 𝜃|𝑋 = 𝑝 𝑥|𝜃, 𝑋 𝑝 𝜃|𝑋
– 𝑃(𝑥|𝜃, 𝑋) is independent of X since knowledge of 𝜃 completely
specifies the (parametric) density. Therefore
𝑝 𝑥, 𝜃|𝑋 = 𝑝 𝑥|𝜃 𝑝 𝜃|𝑋
– and, using the theorem of total probability we can integrate 𝜃 out:
𝑝 𝑥|𝑋 = ∫ 𝑝 𝑥|𝜃 𝑝 𝜃|𝑋 𝑑𝜃
• The only unknown in this expression is 𝑝(𝜃|𝑋); using Bayes rule
𝑝 𝑋|𝜃 𝑝 𝜃 𝑝 𝑋|𝜃 𝑝 𝜃
𝑝 𝜃|𝑋 = =
𝑝 𝑋 ∫ 𝑝 𝑋|𝜃 𝑝 𝜃 𝑑𝜃
• Where 𝑝(𝑋|𝜃) can be computed using the i.i.d. assumption
𝑁
𝑝 𝑋|𝜃 = 𝑝 𝑥 (𝑘 |𝜃
𝑘=1
• NOTE: The last three expressions suggest a procedure to estimate 𝑝(𝑥|𝑋).
This is not to say that integration of these expressions is easy!
11
• Example
– Assume a univariate density where our random variable 𝑥 is generated
from a normal distribution with known standard deviation
– Our goal is to find the mean 𝜇 of the distribution given some i.i.d. data
points 𝑋 = 𝑥 (1 , 𝑥 (2 , … 𝑥 (𝑁
– To capture our knowledge about 𝜃 = 𝜇, we assume that it also follows
a normal density with mean 𝜇0 and standard deviation 𝜎0
1
1 − 2 𝜃−𝜇0 2
𝑝0 𝜃 = 𝑒 2𝜎0
2𝜋𝜎0
– We use Bayes rule to develop an expression for the posterior 𝑝 𝜃 𝑋
𝑝 𝑋|𝜃 𝑝 𝜃 𝑝0 𝜃 𝑁
𝑝 𝜃|𝑋 = = Π𝑘=1 𝑝 𝑥 (𝑘 |𝜃 =
𝑝 𝑋 𝑝 𝑋
1 1 (𝑘 −𝜃 2
1 − 2 𝜃−𝜇0 2 1 1 − 𝑥
e 2𝜎0 ∏𝑁 e 2𝜎2
2𝜋𝜎0 𝑝 𝑋 𝑘=1 2𝜋𝜎
[Bishop, 1995]
12
– To understand how Bayesian estimation changes the posterior as more
data becomes available, we will find the maximum of 𝑝(𝜃|𝑋)
– The partial derivative with respect to 𝜃 = 𝜇 is
𝜕 𝜕 1 𝑁 1 2
log 𝑝 𝜃|𝑋 = 0 ⇒ − 2 𝜇 − 𝜇0 − Σ𝑘=1 2 𝑥 (𝑘 − 𝜇
2 =0
𝜕𝜃 𝜕𝜇 2𝜎0 2𝜎
– which, after some algebraic manipulation, becomes
𝜎2 𝑁𝜎02 1 𝑁 (𝑘
𝜇𝑁 = 2 𝜇0 + 2 Σ 𝑥
𝜎 + 𝑁𝜎02 𝜎 + 𝑁𝜎02 𝑁 𝑘=1
𝑃𝑅𝐼𝑂𝑅 𝑀𝐿
• Therefore, as N increases, the estimate of the mean 𝜇𝑁 moves from the
initial prior 𝜇0 to the ML solution
– Similarly, the standard deviation 𝜎𝑁 can be found to be
1 𝑁 1
2 = +
𝜎𝑁 𝜎2 𝜎02
[Bishop, 1995]
13
Example
• Assume that the true mean of the distribution 𝑝(𝑥) is
𝜇 = 0.8 with standard deviation 𝜎 = 0.3
• In reality we would not know the true mean; we are just “playing God”
– We generate a number of examples from this distribution
– To capture our lack of knowledge about the mean, we assume a
normal prior 𝑝0 (𝜃0 ), with 𝜇0 = 0.0 and 𝜎0 = 0.3
– The figure below shows the posterior 𝑝(𝜇|𝑋)
• As 𝑁 increases, the estimate 𝜇𝑁 approaches its true value (𝜇 = 0.8) and
the spread 𝜎𝑁 (or uncertainty in the estimate) decreases
50
P ( |X )
N=10
40
30
N=5
20
N=0
10 N=1
0
0 0 .2 0 .4 0 .6 0 .8
14
ML vs. Bayesian estimation
• What is the relationship between these two estimates?
– By definition, 𝑝(𝑋|𝜃) peaks at the ML estimate
– If this peak is relatively sharp and the prior is broad, then the integral
below will be dominated by the region around the ML estimate
𝑝 𝑥|𝑋 = ∫ 𝑝 𝑥|𝜃 𝑝 𝜃|𝑋 𝑑𝜃 ≅ 𝑝 𝑥|𝜃 ∫ 𝑝 𝜃|𝑋 𝑑𝜃 = 𝑝 𝑥|𝜃
=1
• Therefore, the Bayesian estimate will approximate the ML solution
– As we have seen in the previous example, when the number of
available data increases, the posterior 𝑝(𝜃|𝑋) tends to sharpen
• Thus, the Bayesian estimate of 𝑝(𝑥) will approach the ML solution as
𝑁→∞
• In practice, only when we have a limited number of observations will the
two approaches yield different results
15