0% found this document useful (0 votes)
10 views

Lecture 5

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 5

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

Recap

• Bayes classifier is optimal for minimizing risk. Risk


minimization is a very good objective.

PR NPTEL course – p.1/127


Recap

• Bayes classifier is optimal for minimizing risk. Risk


minimization is a very good objective.
• Given class conditional densities we can derive the
Bayes classifier for any loss function.

PR NPTEL course – p.2/127


Recap

• Bayes classifier is optimal for minimizing risk. Risk


minimization is a very good objective.
• Given class conditional densities we can derive the
Bayes classifier for any loss function.
• There are other ways (other than loss function) to
trade different errors. For example, NP classifier.

PR NPTEL course – p.3/127


Recap

• Bayes classifier is optimal for minimizing risk. Risk


minimization is a very good objective.
• Given class conditional densities we can derive the
Bayes classifier for any loss function.
• There are other ways (other than loss function) to
trade different errors. For example, NP classifier.
• ROC curve also allows for such trade-off

PR NPTEL course – p.4/127


Receiver Operating Characteristic (ROC)

• Consider a one dimensional feature space, 2-class


problem with a classifier, h(X) = 0 if X < τ .
• Consider equal priors, Gaussian class conditional
densities with equal variance, 0-1 loss. Now let us
write the probability of error as a function of τ .

PR NPTEL course – p.5/127


Receiver Operating Characteristic (ROC)

Z τ Z ∞
P [error] = 0.5 f1 (X) dX + 0.5 f0 (X) dX
−∞ τ
τ − µ1 τ − µ0
µ ¶ µ ¶
= 0.5Φ + 0.5(1 − Φ )
σ σ
• As we vary τ we trade one kind of error with another.
In Bayes classifier, the loss function determines the
‘exchange rate’.

PR NPTEL course – p.6/127


ROC curve

• The receiver operating characteristic (ROC) curve is


one way to conveniently visualize and exploit this
trade off.
• For a two class classifier there are four possible
outcomes of a classifcation decison – two are correct
decisions and two are errors.
• Let ei denote probability of wrongly assigning class i,
i = 0, 1.

PR NPTEL course – p.7/127


ROC curve

Then we have
e0 = P [X ≤τ |X ∈ c-1] (a miss)
e1 = P [X >τ |X ∈ c-0] (false alarm)
1 − e0 = P [X >τ |X ∈ c-1] (correct detection)
1 − e1 = P [X ≤τ |X ∈ c-0] (correct rejection)

PR NPTEL course – p.8/127


ROC curve

Then we have
e0 = P [X ≤τ |X ∈ c-1] (a miss)
e1 = P [X >τ |X ∈ c-0] (false alarm)
1 − e0 = P [X >τ |X ∈ c-1] (correct detection)
1 − e1 = P [X ≤τ |X ∈ c-0] (correct rejection)
• For fixed class conditional densities, if we vary τ the
point (e1 , 1 − e0 ) moves on a smooth curve in ℜ2 .
• This is traditionally called the ROC curve. (Choice of
coordinates is arbitrary)

PR NPTEL course – p.9/127


• For any fixed τ we can estimate e0 and e1 from
training data.

PR NPTEL course – p.10/127


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.

PR NPTEL course – p.11/127


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.
• This can be done for any threshold based classifier
irrespective of class conditional densities.

PR NPTEL course – p.12/127


• For any fixed τ we can estimate e0 and e1 from
training data.
• Hence, varying τ we can find ROC and decide which
may be the best operating point.
• This can be done for any threshold based classifier
irrespective of class conditional densities.
• When the class conditional densities are Gaussian
with equal variance, we use this procedure to
estimate Bayes error also.

PR NPTEL course – p.13/127


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ

PR NPTEL course – p.14/127


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ
|µ1 −µ0 |
• Then, |a − b| = σ
= d, the discriminability.

PR NPTEL course – p.15/127


• From our earlier error integral we get
τ − µ0
= Φ−1 (1 − e1 ) = a, say
σ
τ − µ1
= Φ−1 (1 − (1 − e0 )) = b, say
σ
|µ1 −µ0 |
• Then, |a − b| = σ
= d, the discriminability.
• Knowing e1 , (1 − e0 ), we can get d and hence the
Bayes error. For our given τ we can also get the
actuall error probability. We can tweak τ to match the
Bayes error.
PR NPTEL course – p.16/127
• We can in general use the ROC curve in
multidimensional cases also. Consider, for example,
h(X) = sgn(Wt X + w0 ).
We can use ROC to fix w0 after learning W.

PR NPTEL course – p.17/127


Implementing Bayes Classifier

• We need class conditional densities and prior


probabilities.

PR NPTEL course – p.18/127


Implementing Bayes Classifier

• We need class conditional densities and prior


probabilities.
• Prior probabilities can be estimated as fraction of
examples from each class.

PR NPTEL course – p.19/127


Implementing Bayes Classifier

• We need class conditional densities and prior


probabilities.
• Prior probabilities can be estimated as fraction of
examples from each class.
• Since examples are iid and the class labels of
examples are known, we have some iid samples from
each class conditional distribution.

PR NPTEL course – p.20/127


Implementing Bayes Classifier

• We need class conditional densities and prior


probabilities.
• Prior probabilities can be estimated as fraction of
examples from each class.
• Since examples are iid and the class labels of
examples are known, we have some iid samples from
each class conditional distribution.
• The problem: Given {x1 , x2 , · · · , xn } drawn iid
according to some distribution, estimate the
probability distribution / density.

PR NPTEL course – p.21/127


Estimating densities

• Two main approaches: Parametric and


non-parametric.

PR NPTEL course – p.22/127


Estimating densities

• Two main approaches: Parametric and


non-parametric.
• Parametric: We assume we have iid realizations of a
random variable X whose distribution is known except
for values of a parameter vector. We estimate the
parameters of the density using the samples available.

PR NPTEL course – p.23/127


Estimating densities

• Two main approaches: Parametric and


non-parametric.
• Parametric: We assume we have iid realizations of a
random variable X whose distribution is known except
for values of a parameter vector. We estimate the
parameters of the density using the samples available.
• In non-parametric approach we do not assume form
of density. It is often modelled as a convex
combination of some densities using the samples.

PR NPTEL course – p.24/127


Estimating parameters of a density

• Denote the density by f (x | θ) where θ is a parameter


vector.

PR NPTEL course – p.25/127


Estimating parameters of a density

• Denote the density by f (x | θ) where θ is a parameter


vector.
• For example, let θ = (θ1 , θ2 ) and
2
(x − θ1 )
µ ¶
1
f (x | θ) = √ exp −
2π θ2 2θ2

PR NPTEL course – p.26/127


Estimating parameters of a density

• Denote the density by f (x | θ) where θ is a parameter


vector.
• For example, let θ = (θ1 , θ2 ) and
2
(x − θ1 )
µ ¶
1
f (x | θ) = √ exp −
2π θ2 2θ2
f (x|θ) is normal with mean and variance constituting
the parameter vector.

PR NPTEL course – p.27/127


Estimating parameters of a density

• Denote the density by f (x | θ) where θ is a parameter


vector.
• For example, let θ = (θ1 , θ2 ) and
2
(x − θ1 )
µ ¶
1
f (x | θ) = √ exp −
2π θ2 2θ2
f (x|θ) is normal with mean and variance constituting
the parameter vector.
• Now estimation of density is same as estimation of a
parameter vector.

PR NPTEL course – p.28/127


Notation

• Let X denote a random variable with density f (x | θ).


(Use same notation even when X is a random vector)

PR NPTEL course – p.29/127


Notation

• Let X denote a random variable with density f (x | θ).


(Use same notation even when X is a random vector)
• A (iid) sample of size n consists of n iid realizations of
X.

PR NPTEL course – p.30/127


Notation

• Let X denote a random variable with density f (x | θ).


(Use same notation even when X is a random vector)
• A (iid) sample of size n consists of n iid realizations of
X.
• x = (x1 , · · · , xn )T – the sample or the data.
We sometimes use D to denote the data.

PR NPTEL course – p.31/127


Notation

• Let X denote a random variable with density f (x | θ).


(Use same notation even when X is a random vector)
• A (iid) sample of size n consists of n iid realizations of
X.
• x = (x1 , · · · , xn )T – the sample or the data.
We sometimes use D to denote the data.
• It can be thought of as a realization of (X1 , · · · , Xn )T
where Xi are iid with density f (x | θ).

PR NPTEL course – p.32/127


• A statistic is a function of data, e.g., g(x1 , · · · , xn ).

PR NPTEL course – p.33/127


• A statistic is a function of data, e.g., g(x1 , · · · , xn ).
• An estimator is such a statistic. θ̂(x1 , · · · , xn ).

PR NPTEL course – p.34/127


• A statistic is a function of data, e.g., g(x1 , · · · , xn ).
• An estimator is such a statistic. θ̂(x1 , · · · , xn ).
• When we need to remember the sample size, we
write θ̂n

PR NPTEL course – p.35/127


• A statistic is a function of data, e.g., g(x1 , · · · , xn ).
• An estimator is such a statistic. θ̂(x1 , · · · , xn ).
• When we need to remember the sample size, we
write θ̂n
• For example,
n
1 X
θ̂n = xi
n i=1

the well-known sample mean.

PR NPTEL course – p.36/127


• There can be different estimators that are intuitively
reasonable.

PR NPTEL course – p.37/127


• There can be different estimators that are intuitively
reasonable.
• Let X be Poisson with parameter λ. Then sample
mean as well as sample variance seem to be
reasonable estimators for λ.

PR NPTEL course – p.38/127


• There can be different estimators that are intuitively
reasonable.
• Let X be Poisson with parameter λ. Then sample
mean as well as sample variance seem to be
reasonable estimators for λ.
• Let X be normal with mean µ and variance unity.
Both sample mean and sample median seem good
choices.

PR NPTEL course – p.39/127


• There can be different estimators that are intuitively
reasonable.
• Let X be Poisson with parameter λ. Then sample
mean as well as sample variance seem to be
reasonable estimators for λ.
• Let X be normal with mean µ and variance unity.
Both sample mean and sample median seem good
choices.
• How does one choose estimators

PR NPTEL course – p.40/127


• We need ‘good’ estimators.

PR NPTEL course – p.41/127


• We need ‘good’ estimators.
• We need some criteria for ‘goodness’. Also, methods
to obtain such estimators.

PR NPTEL course – p.42/127


• We need ‘good’ estimators.
• We need some criteria for ‘goodness’. Also, methods
to obtain such estimators.
• In this course, we will consider two methods:
Maximum likelihood and Bayesian estimators.

PR NPTEL course – p.43/127


• We need ‘good’ estimators.
• We need some criteria for ‘goodness’. Also, methods
to obtain such estimators.
• In this course, we will consider two methods:
Maximum likelihood and Bayesian estimators.
• To begin with, a simple introduction to some general
issues in estimation.

PR NPTEL course – p.44/127


• An estimator, θ̂ of a parameter (vector) θ is said to be
unbiased if E[θ̂] = θ .

PR NPTEL course – p.45/127


• An estimator, θ̂ of a parameter (vector) θ is said to be
unbiased if E[θ̂] = θ .

• The θ̂ is a function of data. Hence the expectation is


with respect to the joint density of (X1 , · · · Xn ), the iid
random variables.

PR NPTEL course – p.46/127


• An estimator, θ̂ of a parameter (vector) θ is said to be
unbiased if E[θ̂] = θ .

• The θ̂ is a function of data. Hence the expectation is


with respect to the joint density of (X1 , · · · Xn ), the iid
random variables.
• Since Xi ∼ f (x | θ), the expectation above needs
value of θ. So, we write

Eθ [θ̂] = θ

PR NPTEL course – p.47/127


• An unbiased estimator, θ̂ satisfies

Eθ [θ̂] = θ
• θ̂ is an unbiased estimator, if for every density in the
class of densities we are interested in (i.e., every
value of the parameter in the parameter space),
expected value of the estimator is the true parameter
value.

PR NPTEL course – p.48/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi

PR NPTEL course – p.49/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi
• Then E[θ̂n ] = θ for all n because EXi = θ.

PR NPTEL course – p.50/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi
• Then E[θ̂n ] = θ for all n because EXi = θ.
• Sample mean is an unbiased estimator of actual
mean.

PR NPTEL course – p.51/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi
• Then E[θ̂n ] = θ for all n because EXi = θ.
• Sample mean is an unbiased estimator of actual
mean.
• Let θ̂′ (x1 , · · · , xn ) = 0.5(x1 + x2 ).
• This is also an unbiased estimator.

PR NPTEL course – p.52/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi
• Then E[θ̂n ] = θ for all n because EXi = θ.
• Sample mean is an unbiased estimator of actual
mean.
• Let θ̂′ (x1 , · · · , xn ) = 0.5(x1 + x2 ).
• This is also an unbiased estimator.
• So is θ̂′′ = x1 .

PR NPTEL course – p.53/127


• Let f (x | θ) be normal with mean θ and variance unity.
P
Let θ̂n = (1/n) i xi
• Then E[θ̂n ] = θ for all n because EXi = θ.
• Sample mean is an unbiased estimator of actual
mean.
• Let θ̂′ (x1 , · · · , xn ) = 0.5(x1 + x2 ).
• This is also an unbiased estimator.
• So is θ̂′′ = x1 .
• Unbiasedness alone is not enough
PR NPTEL course – p.54/127
• One possibility: We can say θ̂ is better than θ̂′ if, ∀θ,

Pθ [−a ≤ (θ̂−θ) ≤ b] ≥ Pθ [−a ≤ (θ̂′ −θ) ≤ b] ∀a, b > 0


(for any fixed sample size)

PR NPTEL course – p.55/127


• One possibility: We can say θ̂ is better than θ̂′ if ∀θ,

Pθ [−a ≤ (θ̂−θ) ≤ b] ≥ Pθ [−a ≤ (θ̂′ −θ) ≤ b] ∀a, b > 0


(for any fixed sample size)
• Difficult to get such estimators.

PR NPTEL course – p.56/127


• A weaker method is: θ̂ is better than θ̂′ if

Eθ [(θ̂ − θ)2 ] ≤ Eθ [(θ̂′ − θ)2 ] ∀θ

PR NPTEL course – p.57/127


• A weaker method is: θ̂ is better than θ̂′ if

Eθ [(θ̂ − θ)2 ] ≤ Eθ [(θ̂′ − θ)2 ] ∀θ


• The mean square error of an estimator is defined by

MSEθ (θ̂) = Eθ [(θ̂ − θ)2 ]

PR NPTEL course – p.58/127


• Lemma:
MSEθ (θ̂) = Vθ (θ̂) + [Bθ (θ̂)]2
where Vθ (θ̂) is the variance given by

Vθ (θ̂) = Eθ [(θ̂ − Eθ [θ̂])2 ]

and Bθ (θ̂) is the bias given by

Bθ (θ̂) = Eθ [θ̂] − θ

PR NPTEL course – p.59/127


• Lemma:
MSEθ (θ̂) = Vθ (θ̂) + [Bθ (θ̂)]2
where Vθ (θ̂) is the variance given by

Vθ (θ̂) = Eθ [(θ̂ − Eθ [θ̂])2 ]

and Bθ (θ̂) is the bias given by

Bθ (θ̂) = Eθ [θ̂] − θ
• For unbiased estimators the variance is the mean
square error (because bias is zero).
PR NPTEL course – p.60/127
• Proof:

MSE(θ̂) = E[(θ̂ − θ)2 ]

PR NPTEL course – p.61/127


• Proof:

MSE(θ̂) = E[(θ̂ − θ)2 ]


= E[{(θ̂ − E[θ̂]) + (E[θ̂] − θ)}2 ]

PR NPTEL course – p.62/127


• Proof:

MSE(θ̂) = E[(θ̂ − θ)2 ]


= E[{(θ̂ − E[θ̂]) + (E[θ̂] − θ)}2 ]
= E[(θ̂ − E[θ̂])2 ] + (E[θ̂] − θ)2 +
h i
2E (θ̂ − E[θ̂])(E[θ̂] − θ)

PR NPTEL course – p.63/127


• Proof:

MSE(θ̂) = E[(θ̂ − θ)2 ]


= E[{(θ̂ − E[θ̂]) + (E[θ̂] − θ)}2 ]
= E[(θ̂ − E[θ̂])2 ] + (E[θ̂] − θ)2 +
h i
2E (θ̂ − E[θ̂])(E[θ̂] − θ)
= V (θ̂) + [B(θ̂)]2 + 2(E[θ̂] − θ)E[(θ̂ − E[θ̂])

PR NPTEL course – p.64/127


• Proof:

MSE(θ̂) = E[(θ̂ − θ)2 ]


= E[{(θ̂ − E[θ̂]) + (E[θ̂] − θ)}2 ]
= E[(θ̂ − E[θ̂])2 ] + (E[θ̂] − θ)2 +
h i
2E (θ̂ − E[θ̂])(E[θ̂] − θ)
= V (θ̂) + [B(θ̂)]2 + 2(E[θ̂] − θ)E[(θ̂ − E[θ̂])
= V (θ̂) + [B(θ̂)]2

PR NPTEL course – p.65/127


• For unbiased estimators, low variance implies low
MSE.

PR NPTEL course – p.66/127


• For unbiased estimators, low variance implies low
MSE.
• Earlier example: When θ̂ is the sample mean,

σ2
Vθ (θ̂n ) =
n

PR NPTEL course – p.67/127


• For unbiased estimators, low variance implies low
MSE.
• Earlier example: When θ̂ is the sample mean,

σ2
Vθ (θ̂n ) =
n
For θ̂n′ = 0.5(x1 + x2 ),
2
σ
Vθ (θ̂n′ ) =
2

PR NPTEL course – p.68/127


• For unbiased estimators, low variance implies low
MSE.
• Earlier example: When θ̂ is the sample mean,

σ2
Vθ (θ̂n ) =
n
For θ̂n′ = 0.5(x1 + x2 ),
2
σ
Vθ (θ̂n′ ) =
2
• Hence θ̂ is better than θ̂′ PR NPTEL course – p.69/127
• So, unbiased estimators with low mean square error
are good.

PR NPTEL course – p.70/127


• So, unbiased estimators with low mean square error
are good.
• For a given family of density functions, θ̂ is said to be
uniformly minimum variance unbiased estimator (UMVUE) if
1. θ̂ is unbiased, and

PR NPTEL course – p.71/127


• So, unbiased estimators with low mean square error
are good.
• For a given family of density functions, θ̂ is said to be
uniformly minimum variance unbiased estimator (UMVUE) if
1. θ̂ is unbiased, and
2. MSEθ (θ̂n ) ≤ MSEθ (θ̂n′ ) ∀n, θ ,
and forall θ̂′ that are unbiased estimators for θ.

PR NPTEL course – p.72/127


• So, unbiased estimators with low mean square error
are good.
• For a given family of density functions, θ̂ is said to be
uniformly minimum variance unbiased estimator (UMVUE) if
1. θ̂ is unbiased, and
2. MSEθ (θ̂n ) ≤ MSEθ (θ̂n′ ) ∀n, θ ,
and forall θ̂′ that are unbiased estimators for θ.
• If we can get an UMVUE, then it is the ’best’ estimator.
• In many cases, it is difficult to get UMVUE.

PR NPTEL course – p.73/127


• So far, we are looking at figures of merit of estimators
at (all) fixed sample sizes.

PR NPTEL course – p.74/127


• So far, we are looking at figures of merit of estimators
at (all) fixed sample sizes.
• We can also think of asymptotic properties.

PR NPTEL course – p.75/127


• So far, we are looking at figures of merit of estimators
at (all) fixed sample sizes.
• We can also think of asymptotic properties.
• An estimator θ̂ is said to be consistent for θ if
P
θ̂n → θ ∀θ

PR NPTEL course – p.76/127


• So far, we are looking at figures of merit of estimators
at (all) fixed sample sizes.
• We can also think of asymptotic properties.
• An estimator θ̂ is said to be consistent for θ if
P
θ̂n → θ ∀θ
• For example, the sample mean is a consistent
estimator of population mean (expectation of the
random variable)
(Law of large numbers)

PR NPTEL course – p.77/127


• A consistent estimator need not be unbiased.

PR NPTEL course – p.78/127


• A consistent estimator need not be unbiased.
• Let θ be the mean and let
n
1 X
θ̂n = xi
n+1 i=1

PR NPTEL course – p.79/127


• A consistent estimator need not be unbiased.
• Let θ be the mean and let
n
1 X
θ̂n = xi
n+1 i=1

• This is not an unbiased estimator.

PR NPTEL course – p.80/127


• A consistent estimator need not be unbiased.
• Let θ be the mean and let
n
1 X
θ̂n = xi
n+1 i=1

• This is not an unbiased estimator.


• But we have the following

PR NPTEL course – p.81/127


Ã !2 
n
2 1 X 1
E[(θ̂n − θ) ] = E  (xi − θ) − θ 
n+1 i=1
n+1

PR NPTEL course – p.82/127


Ã !2 
n
2 1 X 1
E[(θ̂n − θ) ] = E  (xi − θ) − θ 
n+1 i=1
n+1
1 2 1 2
= nσ + θ −
(n + 1)2 (n + 1)2

2θ X
E[ (xi − θ)]
(n + 1) 2

PR NPTEL course – p.83/127


Ã !2 
n
2 1 X 1
E[(θ̂n − θ) ] = E  (xi − θ) − θ 
n+1 i=1
n+1
1 2 1 2
= nσ + θ −
(n + 1)2 (n + 1) 2

2θ X
E[ (xi − θ)]
(n + 1) 2

n 2 1 2
= σ + θ
(n + 1)2 (n + 1) 2

PR NPTEL course – p.84/127


• Thus, E[(θ̂n − θ)2 ] → 0 as n → ∞.
• Hence, θ̂ is consistent (though it is biased).

PR NPTEL course – p.85/127


• Maximum Likelihood (ML) estimation is a general
procedure for obtaining consistent estimators.
• It is a parametric method.
• We estimate parameters of a density based on iid
samples.
• For most densities, ML estimates are consistent.

PR NPTEL course – p.86/127


Maximum likelihood estimation

• Let x = {x1 , x2 , · · · , xn } be the samples.


• Likelihood function is defined by
n
Y
L(x, θ) = f (xj |θ)
j=1

PR NPTEL course – p.87/127


Maximum likelihood estimation

• Let x = {x1 , x2 , · · · , xn } be the samples.


• Likelihood function is defined by
n
Y
L(x, θ) = f (xj |θ)
j=1

• If samples are from a discrete random variable, f is


taken to be the mass function. If samples are from a
continuous random variable, then f is the density
function.

PR NPTEL course – p.88/127


Maximum likelihood estimation

• We essentially look at the likelihood function as a


function of θ with the xj being known values (as given
by data).

PR NPTEL course – p.89/127


Maximum likelihood estimation

• We essentially look at the likelihood function as a


function of θ with the xj being known values (as given
by data).
• To emphasize this we write it as L(θ, x) or L(θ | x) or
L(θ | D).
Recall that we denote the data samples by D also.

PR NPTEL course – p.90/127


Maximum likelihood estimation contd..

• The maximum likelihood (ML) estimate of θ is the


value that (globally) maximizes the likelihood function.

PR NPTEL course – p.91/127


Maximum likelihood estimation contd..

• The maximum likelihood (ML) estimate of θ is the


value that (globally) maximizes the likelihood function.
• θ∗ is the MLE for θ if
L(θ∗ | x) ≥ L(θ | x) ∀θ

PR NPTEL course – p.92/127


Maximum likelihood estimation contd..

• The maximum likelihood (ML) estimate of θ is the


value that (globally) maximizes the likelihood function.
• θ∗ is the MLE for θ if
L(θ∗ | x) ≥ L(θ | x) ∀θ
• Finding MLE is an optimization problem.

PR NPTEL course – p.93/127


• For convenience in optimization we often take the log
likelihood given by
n
X
l(θ | x) = log L(θ | x) = log f (xj |θ)
j=1

PR NPTEL course – p.94/127


• For convenience in optimization we often take the log
likelihood given by
n
X
l(θ | x) = log L(θ | x) = log f (xj |θ)
j=1

• Now the ML estimate would be maximizer of the log


likelihood.

PR NPTEL course – p.95/127


• For convenience in optimization we often take the log
likelihood given by
n
X
l(θ | x) = log L(θ | x) = log f (xj |θ)
j=1

• Now the ML estimate would be maximizer of the log


likelihood.
• For many densities we can analytically solve for the
maximizer.

PR NPTEL course – p.96/127


• For convenience in optimization we often take the log
likelihood given by
n
X
l(θ | x) = log L(θ | x) = log f (xj |θ)
j=1

• Now the ML estimate would be maximizer of the log


likelihood.
• For many densities we can analytically solve for the
maximizer.
• In general we can use numerical optimization
techniques.
PR NPTEL course – p.97/127
Example

• Consider one dimensional case.


Let f (x|θ) ∼ N (µ, σ 2 ) with θ1 = µ and θ2 = σ .

PR NPTEL course – p.98/127


Example

• Consider one dimensional case.


Let f (x|θ) ∼ N (µ, σ 2 ) with θ1 = µ and θ2 = σ .
2
(x − θ1 )
µ ¶
1
f (x|θ) = √ exp − 2
θ2 2π 2θ 2

PR NPTEL course – p.99/127


Example

• Consider one dimensional case.


Let f (x|θ) ∼ N (µ, σ 2 ) with θ1 = µ and θ2 = σ .
2
(x − θ1 )
µ ¶
1
f (x|θ) = √ exp − 2
θ2 2π 2θ 2

• Now the likelihood is given by


n 2
(xj − θ1 )
µ ¶
Y 1
L(θ | x) = √ exp − 2
j=1
θ2 2π 2θ 2

PR NPTEL course – p.100/127


• Hence log likelihood would be
n · 2
(xj − θ1 )
X ¸
l(θ | x) = − log(θ2 ) − 0.5 log(2π) − 2
j=1
2θ 2

PR NPTEL course – p.101/127


Example

• Hence log likelihood would be


n · 2
(xj − θ1 )
X ¸
l(θ | x) = − log(θ2 ) − 0.5 log(2π) − 2
j=1
2θ 2
n
X (xj − θ1 )2
= −n log(θ2 ) − 0.5n log(2π) −
j=1
2θ22

PR NPTEL course – p.102/127


Example

• Hence log likelihood would be


n · 2
(xj − θ1 )
X ¸
l(θ | x) = − log(θ2 ) − 0.5 log(2π) − 2
j=1
2θ 2
n
X (xj − θ1 )2
= −n log(θ2 ) − 0.5n log(2π) −
j=1
2θ22

• To maximize log likelihood we equate the partial


derivatives to zero.

PR NPTEL course – p.103/127


• This gives
n
∂l X
= (xj − θ1 ) = 0
∂θ1 j=1
n
∂l n 1 X
= − + 3 (xj − θ1 )2 = 0
∂θ2 θ2 θ2 j=1

PR NPTEL course – p.104/127


• Solving these, we get
n
1 X
θ̂1 = xj
n j=1
n
1 X
θ̂2 = (xj − θ̂1 )2
n j=1

PR NPTEL course – p.105/127


• Solving these, we get
n
1 X
θ̂1 = xj
n j=1
n
1 X
θ̂2 = (xj − θ̂1 )2
n j=1

• These are the ML estimates of mean and variance of


a normal density

PR NPTEL course – p.106/127


• Solving these, we get
n
1 X
θ̂1 = xj
n j=1
n
1 X
θ̂2 = (xj − θ̂1 )2
n j=1

• These are the ML estimates of mean and variance of


a normal density
• ML estimate of variance is not unbiased.
PR NPTEL course – p.107/127
Example: discrete case

• Let X have Bernoulli distribution. That is X takes


values 0 and 1 with probability (1 − p) and p
respectively.

PR NPTEL course – p.108/127


Example: discrete case

• Let X have Bernoulli distribution. That is X takes


values 0 and 1 with probability (1 − p) and p
respectively.
• Then, f (x|p) = px (1 − p)1−x , x ∈ {0, 1}

PR NPTEL course – p.109/127


Example: discrete case

• Let X have Bernoulli distribution. That is X takes


values 0 and 1 with probability (1 − p) and p
respectively.
• Then, f (x|p) = px (1 − p)1−x , x ∈ {0, 1}
• The mass function has only one parameter, namely, p.

PR NPTEL course – p.110/127


Example: discrete case

• Let X have Bernoulli distribution. That is X takes


values 0 and 1 with probability (1 − p) and p
respectively.
• Then, f (x|p) = px (1 − p)1−x , x ∈ {0, 1}
• The mass function has only one parameter, namely, p.
• Note that we must have 0 ≤ p ≤ 1.

PR NPTEL course – p.111/127


• The likelihood function is
n
Y
L(p | x) = pxj (1 − p)1−xj = pnx̄ (1 − p)n−nx̄
j=1

1
Pn
where x̄ = n j=1 xj is the sample mean.

PR NPTEL course – p.112/127


• The likelihood function is
n
Y
L(p | x) = pxj (1 − p)1−xj = pnx̄ (1 − p)n−nx̄
j=1

1
Pn
where x̄ = n j=1 xj is the sample mean.
• The loglikelihood is given by

l(p | x) = nx̄ log p + n(1 − x̄) log(1 − p)

PR NPTEL course – p.113/127


• Differentiating the log likelihood with respect to p and
equating to zero we get

nx̄ n(1 − x̄)


=
p 1−p

PR NPTEL course – p.114/127


• Differentiating the log likelihhod with respect to p and
equating to zero we get

nx̄ n(1 − x̄)


=
p 1−p
which implies
n
1 X
p = x̄ = xj
n j=1

PR NPTEL course – p.115/127


• Differentiating the log likelihood with respect to p and
equating to zero we get

nx̄ n(1 − x̄)


=
p 1−p
which implies
n
1 X
p = x̄ = xj
n j=1

• This is the ML estimate of the parameter p of a


Bernoulli random variable.
• Sample mean is the ML estimator. PR NPTEL course – p.116/127
To Summarize

• To implement Bayes classifier, we need to estimate


densities.

PR NPTEL course – p.117/127


To Summarize

• To implement Bayes classifier, we need to estimate


densities.
• Parametric methods assume that form of density is
known.

PR NPTEL course – p.118/127


To Summarize

• To implement Bayes classifier, we need to estimate


densities.
• Parametric methods assume that form of density is
known.
• Estimate (for a parameter) is a function of (iid) data

PR NPTEL course – p.119/127


To Summarize

• To implement Bayes classifier, we need to estimate


densities.
• Parametric methods assume that form of density is
known.
• Estimate (for a parameter) is a function of (iid) data
• An estimate is unbiased if its expectation is the true
value.

PR NPTEL course – p.120/127


To Summarize

• To implement Bayes classifier, we need to estimate


densities.
• Parametric methods assume that form of density is
known.
• Estimate (for a parameter) is a function of (iid) data
• An estimate is unbiased if its expectation is the true
value.
• The MSE of an unbiased estimator is its variance.

PR NPTEL course – p.121/127


To Summarize

• To implement Bayes classifier, we need to estimate


densities.
• Parametric methods assume that form of density is
known.
• Estimate (for a parameter) is a function of (iid) data
• An estimate is unbiased if its expectation is the true
value.
• The MSE of an unbiased estimator is its variance.
• UMVUE is a good estimate to have

PR NPTEL course – p.122/127


• Consistent estimators converge to the true value in
probability as sample size goes to infinity

PR NPTEL course – p.123/127


• Consistent estimators converge to the true value in
probability as sample size goes to infinity
• Maximum likelihood estimation is a general procedure
that can find consistent estimators.

PR NPTEL course – p.124/127


• Consistent estimators converge to the true value in
probability as sample size goes to infinity
• Maximum likelihood estimation is a general procedure
that can find consistent estimators.
• MLE is the maximizer of the likelihood function.

PR NPTEL course – p.125/127


• Consistent estimators converge to the true value in
probability as sample size goes to infinity
• Maximum likelihood estimation is a general procedure
that can find consistent estimators.
• MLE is the maximizer of the likelihood function.
• Often, one maximizes loglikelihood

PR NPTEL course – p.126/127


• Consistent estimators converge to the true value in
probability as sample size goes to infinity
• Maximum likelihood estimation is a general procedure
that can find consistent estimators.
• MLE is the maximizer of the likelihood function.
• Often, one maximizes loglikelihood
• For many standard densities we can obtain MLE
analytically.

PR NPTEL course – p.127/127

You might also like