0% found this document useful (0 votes)
5 views

Lecture 6

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 6

Uploaded by

Sourabh Dandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Recap

• To implement Bayes Classifier we need class


conditional densities.

PR NPTEL course – p.1/123


Recap

• To implement Bayes Classifier we need class


conditional densities.
• Two main approaches to estimating densities –
Parametric and non-parametric

PR NPTEL course – p.2/123


Recap

• To implement Bayes Classifier we need class


conditional densities.
• Two main approaches to estimating densities –
Parametric and non-parametric
• In the parametric method we assume that the form of
the density is known and estimate the parameters.

PR NPTEL course – p.3/123


Recap

• To implement Bayes Classifier we need class


conditional densities.
• Two main approaches to estimating densities –
Parametric and non-parametric
• In the parametric method we assume that the form of
the density is known and estimate the parameters.
• Maximum likelihood method is a general procedure
for obtaining consistent estimators for parameters.

PR NPTEL course – p.4/123


Recap

• Maximum Likelihood (ML) estimate is the maximizer


of the likelihood (or log likelihhod) function.

PR NPTEL course – p.5/123


Recap

• Maximum Likelihood (ML) estimate is the maximizer


of the likelihood (or log likelihhod) function.
• For most standard density models, one can
analytically derive ML estimates.

PR NPTEL course – p.6/123


Recap

• Maximum Likelihood (ML) estimate is the maximizer


of the likelihood (or log likelihhod) function.
• For most standard density models, one can
analytically derive ML estimates.
• We have seen some examples of obtaining ML
estimates.

PR NPTEL course – p.7/123


Recap

• Maximum Likelihood (ML) estimate is the maximizer


of the likelihood (or log likelihhod) function.
• For most standard density models, one can
analytically derive ML estimates.
• We have seen some examples of obtaining ML
estimates.
• We now see more examples of ML estimates.

PR NPTEL course – p.8/123


Example

• Suppose the assumed density for x is exponential

PR NPTEL course – p.9/123


Example

• Suppose the assumed density for x is exponential

f (x | λ) = λ exp(−λx), x ≥ 0

PR NPTEL course – p.10/123


Example

• Suppose the assumed density for x is exponential

f (x | λ) = λ exp(−λx), x ≥ 0
• Given iid data, D = {x1 , · · · , xn }, we need to
estimate λ.

PR NPTEL course – p.11/123


Example

• Suppose the assumed density for x is exponential

f (x | λ) = λ exp(−λx), x ≥ 0
• Given iid data, D = {x1 , · · · , xn }, we need to
estimate λ.
• The likelihood function is
n
Y
L(λ | D) = λ exp(−λxi )
i=1

PR NPTEL course – p.12/123


Example

• The log likelihood function is


n
X
l(λ | D) = (ln(λ) − λxi )
i=1

PR NPTEL course – p.13/123


Example

• The log likelihood function is


n
X
l(λ | D) = (ln(λ) − λxi )
i=1

• Differentiating w.r.t. λ and equating to zero, we get


n
n X
− xi = 0
λ i=1

PR NPTEL course – p.14/123


• This gives us the final ML estimate as
n
λ̂ = Pn
i=1 xi

PR NPTEL course – p.15/123


• This gives us the final ML estimate as
n
λ̂ = Pn
i=1 xi
• The final estimate is intuitively clear.
(Note that Ex = λ1 ).

PR NPTEL course – p.16/123


Another Example

• Consider the multidimensional Gaussian density


 
1 1
f (x | θ) = p exp − (x − µ)T Σ−1 (x − µ)
(2π)d |Σ| 2

where x ∈ ℜd and θ = (µ, Σ) are the parameters.

PR NPTEL course – p.17/123


Another Example

• Consider the multidimensional Gaussian density


 
1 1
f (x | θ) = p exp − (x − µ)T Σ−1 (x − µ)
(2π)d |Σ| 2

where x ∈ ℜd and θ = (µ, Σ) are the parameters.


• For a random vector x having the above joint density,
µ ∈ ℜd is the mean vector (i.e., Ex = µ) and the
d × d matrix Σ is the covariance matrix defined by
Σ = E(x − µ)(x − µ)T

PR NPTEL course – p.18/123


• To find the ML estimate for the parameters, we have
to maximise the log likelihood.

PR NPTEL course – p.19/123


• To find the ML estimate for the parameters, we have
to maximise the log likelihood.
• Recall that the log likelihood function is defined by
n
X
l(θ | D) = ln(f (xi | θ))
i=1

where D = {x1 , · · · , xn } constitutes the iid data from


which we are estimating the parameters of the density.

PR NPTEL course – p.20/123


The log likelihood function is given by
n  
X 1 1
l(θ|D) = − ln((2π) |Σ|) − (xi − µ)T Σ−1 (xi − µ)
d

i=1
2 2

where θ = (µ, Σ) constitute the parameters to be esti-


mated.

PR NPTEL course – p.21/123


The log likelihood function is given by
n  
X 1 1
l(θ|D) = − ln((2π) |Σ|) − (xi − µ)T Σ−1 (xi − µ)
d

i=1
2 2

where θ = (µ, Σ) constitute the parameters to be


estimated.
• To find the ML estimates, we have to equate the
partial derivatives of l (with respect to the parameters)
to zero and solve.

PR NPTEL course – p.22/123


∂l
• Now, ∂µ
= 0 gives us
n
X
Σ−1 (xi − µ) = 0
i=1

PR NPTEL course – p.23/123


∂l
• Now, ∂µ
= 0 gives us
n
X
Σ−1 (xi − µ) = 0
i=1

which gives us the ML estimate for µ as


n
1 X
µ̂ = xi
n i=1

PR NPTEL course – p.24/123


∂l
• Now, ∂µ
= 0 gives us
n
X
Σ−1 (xi − µ) = 0
i=1

which gives us the ML estimate for µ as


n
1 X
µ̂ = xi
n i=1

Thus, even in the multidimensional case, the ML


estimate for mean is the sample mean.
PR NPTEL course – p.25/123
• Finding the partial derivative with respect to Σ is
algebraically involved.

PR NPTEL course – p.26/123


• Finding the partial derivative with respect to Σ is
algebraically involved.
• However, one can show that the ML estimate for Σ is
n
1 X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n i=1

PR NPTEL course – p.27/123


• Finding the partial derivative with respect to Σ is
algebraically involved.
• However, one can show that the ML estimate for Σ is
n
1 X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n i=1

• Again, the final ML estimate is intuitively obvious.

PR NPTEL course – p.28/123


• Finding the partial derivative with respect to Σ is
algebraically involved.
• However, one can show that the ML estimate for Σ is
n
1 X
Σ̂ = (xi − µ̂)(xi − µ̂)T
n i=1

• Again, the final ML estimate is intuitively obvious.


(Recall that Σ = E(x − µ)(x − µ)T ).

PR NPTEL course – p.29/123


One more example

• Suppose we have a discrete random variable, say, z ,


that takes values a1 , · · · , aM with probabilities
p1 , · · · , pM .

PR NPTEL course – p.30/123


One more example

• Suppose we have a discrete random variable, say, z ,


that takes values a1 , · · · , aM with probabilities
p1 , · · · , pM .
• Given data in the form of iid realizations of this
random variable, we want to estimate the parameters
pi .

PR NPTEL course – p.31/123


One more example

• Suppose we have a discrete random variable, say, z ,


that takes values a1 , · · · , aM with probabilities
p1 , · · · , pM .
• Given data in the form of iid realizations of this
random variable, we want to estimate the parameters
pi .
• Note that the parameters satisfy: pi ≥ 0 and
P
i pi = 1.

PR NPTEL course – p.32/123


• For our estimation, we represent the discrete random
variable, z by an M -dimensional vector random
variable x = [x1 , · · · , xM ]T .

PR NPTEL course – p.33/123


• For our estimation, we represent the discrete random
variable, z by an M -dimensional vector random
variable x = [x1 , · · · , xM ]T .
• The idea is that if z takes value ai then we will
represent it by x whose ith component is one and all
others are zero.

PR NPTEL course – p.34/123


• For our estimation, we represent the discrete random
variable, z by an M -dimensional vector random
variable x = [x1 , · · · , xM ]T .
• The idea is that if z takes value ai then we will
represent it by x whose ith component is one and all
others are zero.
• So, the random vector x actually takes only M
possible values, namely,
[1, 0, · · · , 0]T , [0, 1, 0, · · · , 0]T etc.

PR NPTEL course – p.35/123


• For our estimation, we represent the discrete random
variable, z by an M -dimensional vector random
variable x = [x1 , · · · , xM ]T .
• The idea is that if z takes value ai then we will
represent it by x whose ith component is one and all
others are zero.
• So, the random vector x actually takes only M
possible values, namely,
[1, 0, · · · , 0]T , [0, 1, 0, · · · , 0]T etc.
• This is sometimes called ‘1 of M’ representation for a
discrete random variable taking M values.
PR NPTEL course – p.36/123
• · , xM ]T satisfies:
Thus, x = [x1 , · ·P
xi ∈ {0, 1} and i xi = 1.

PR NPTEL course – p.37/123


• · , xM ]T satisfies:
Thus, x = [x1 , · ·P
xi ∈ {0, 1} and i xi = 1.
• Also now we have pi = Prob[xi = 1].

PR NPTEL course – p.38/123


• · , xM ]T satisfies:
Thus, x = [x1 , · ·P
xi ∈ {0, 1} and i xi = 1.
• Also now we have pi = Prob[xi = 1].
• Now the mass function for x can be written as
M
xi
Y
f (x | p) = pi ,
i=1

M T i i
1
P
x = [x , · · · , x ] , x ∈ {0, 1}, i x =1

PR NPTEL course – p.39/123


• · , xM ]T satisfies:
Thus, x = [x1 , · ·P
xi ∈ {0, 1} and i xi = 1.
• Also now we have pi = Prob[xi = 1].
• Now the mass function for x can be written as
M
xi
Y
f (x | p) = pi ,
i=1

M T i i
1
P
x = [x , · · · , x ] , x ∈ {0, 1}, i x =1
• Here, p = (p1 , · · · , pM )T is the parameter vector.
PR NPTEL course – p.40/123
• Now the problem of estimating the parameters, pi ,
becomes the following.

PR NPTEL course – p.41/123


• Now the problem of estimating the parameters, pi ,
becomes the following.
• We are given iid data

D = {x1 , · · · , xn }
j
where xi = [x1i , ···, xM
i ]T
with xi ∈ {0, 1} and
xji = 1, ∀i.
P
j

PR NPTEL course – p.42/123


• Now the problem of estimating the parameters, pi ,
becomes the following.
• We are given iid data

D = {x1 , · · · , xn }
j
where xi = [x1i , ···, xM
i ]T
with xi ∈ {0, 1} and
xji = 1, ∀i.
P
j
• We know the probability mass function of x and we
need to derive ML estimates for parameters pi .

PR NPTEL course – p.43/123


• The log likelihood function is given by

PR NPTEL course – p.44/123


• The log likelihood function is given by
n
X
l(p | D) = ln(f (xi | p)
i=1

PR NPTEL course – p.45/123


• The log likelihood function is given by
n
X
l(p | D) = ln(f (xi | p)
i=1
n M
!
xji
X Y
= ln pj
i=1 j=1

PR NPTEL course – p.46/123


• The log likelihood function is given by
n
X
l(p | D) = ln(f (xi | p)
i=1
n M
!
xji
X Y
= ln pj
i=1 j=1
n
X M
X j
= xi ln(pj )
i=1 j=1

PR NPTEL course – p.47/123


• We now want to find values for pi , i = 1, · · · , M , to
maximize l(p | D).

PR NPTEL course – p.48/123


• We now want to find values for pi , i = 1, · · · , M , to
maximize l(p | D).
• But this is not an unconstrained maximization.

PR NPTEL course – p.49/123


• We now want to find values for pi , i = 1, · · · , M , to
maximize l(p | D).
• But this is not an unconstrained maximization.
• We need toP maximize l over only those pi that satisfy
pi ≥ 0 and i pi = 1.

PR NPTEL course – p.50/123


• We now want to find values for pi , i = 1, · · · , M , to
maximize l(p | D).
• But this is not an unconstrained maximization.
• We need toP maximize l over only those pi that satisfy
pi ≥ 0 and i pi = 1.
• Hence ML estimation of the parameters here
becomes a constrained optimization problem as
follows.

PR NPTEL course – p.51/123


The constrained optimization problem is
n
X M
X
max l(p | D) = xji ln(pj )
pi
i=1 j=1
M
X
subject to pi = 1
i=1

PR NPTEL course – p.52/123


The constrained optimization problem is
n
X M
X
max l(p | D) = xji ln(pj )
pi
i=1 j=1
M
X
subject to pi = 1
i=1

• We can solve this by the method of lagrange


multipliers. (We have not explicitly included the
non-negativity constraint).

PR NPTEL course – p.53/123


• The lagrangian for this problem is given by
n M M
!
X X X
xsi ln(ps ) + λ 1 − ps
i=1 s=1 s=1

where λ is the Lagrange multiplier.

PR NPTEL course – p.54/123


• The lagrangian for this problem is given by
n M M
!
X X X
xsi ln(ps ) + λ 1 − ps
i=1 s=1 s=1

where λ is the Lagrange multiplier.


• Now, we calculate the partial derivatives of the
Lagrangian and equate them to zero to get the
maximum.

PR NPTEL course – p.55/123


• This gives us
n
X xj i
− λ = 0, j = 1, · · · , M
i=1
pj

PR NPTEL course – p.56/123


• This gives us
n
X xj i
− λ = 0, j = 1, · · · , M
i=1
pj

Solving this, we get


n
1 X j
pj = xi , j = 1, · · · , M
λ i=1

PR NPTEL course – p.57/123


P
• Now using the constraint, j pj = 1, we get value of
λ as

PR NPTEL course – p.58/123


P
• Now using the constraint, j pj = 1, we get value of
λ as
M X
X n
j
λ = xi
j=1 i=1

PR NPTEL course – p.59/123


P
• Now using the constraint, j pj = 1, we get value of
λ as
M X
X n
j
λ = xi
j=1 i=1
n X
X M

= xji
i=1 j=1

PR NPTEL course – p.60/123


P
• Now using the constraint, j pj = 1, we get value of
λ as
M X
X n
j
λ = xi
j=1 i=1
n X
X M

= xji
i=1 j=1
= n

PR NPTEL course – p.61/123


P
• Now using the constraint, j pj = 1, we get value of
λ as
M X
X n
j
λ = xi
j=1 i=1
n X
X M

= xji
i=1 j=1
= n
P j
where last step follows because j xi = 1, ∀i.
PR NPTEL course – p.62/123
• Thus, we get the final ML estimate for pj as
n
1 X j
p̂j = xi
n i=1

PR NPTEL course – p.63/123


• Thus, we get the final ML estimate for pj as
n
1 X j
p̂j = xi
n i=1

• The final ML estimate for pj is the fraction of times the


j th value occurs – intuitively clear.

PR NPTEL course – p.64/123


• The distribution (or probability mass function) of any
discrete random variable taking finitely many values,
is specified by some M parameters like the pi .

PR NPTEL course – p.65/123


• The distribution (or probability mass function) of any
discrete random variable taking finitely many values,
is specified by some M parameters like the pi .
• Hence, what we presented is a general procedure
using which we can estimate the distribution of any
discrete random variable.

PR NPTEL course – p.66/123


• The distribution (or probability mass function) of any
discrete random variable taking finitely many values,
is specified by some M parameters like the pi .
• Hence, what we presented is a general procedure
using which we can estimate the distribution of any
discrete random variable.
• Also, note that for discrete random variables, there is
really no distinction between parametric and
non-parametric ways of estimating the distribution.

PR NPTEL course – p.67/123


• Features that take only finitely many values are
important in some pattern classification problems.

PR NPTEL course – p.68/123


• Features that take only finitely many values are
important in some pattern classification problems.
• For example, search and ranking, document
classification, spam filtering etc.

PR NPTEL course – p.69/123


• Features that take only finitely many values are
important in some pattern classification problems.
• For example, search and ranking, document
classification, spam filtering etc.
• For example, for document classification, we can use
‘word count’ as the feature vector.

PR NPTEL course – p.70/123


• Features that take only finitely many values are
important in some pattern classification problems.
• For example, search and ranking, document
classification, spam filtering etc.
• For example, for document classification, we can use
‘word count’ as the feature vector.
• Often called, ‘bag of words’ representation.

PR NPTEL course – p.71/123


• In such cases, each feature is a discrete random
variable.

PR NPTEL course – p.72/123


• In such cases, each feature is a discrete random
variable.
• We can estimate (marginal) distribution of feature
using our procedure.

PR NPTEL course – p.73/123


• In such cases, each feature is a discrete random
variable.
• We can estimate (marginal) distribution of feature
using our procedure.
• To implement Bayes classifier we need joint
distribution of the feature vector.

PR NPTEL course – p.74/123


• In such cases, each feature is a discrete random
variable.
• We can estimate (marginal) distribution of feature
using our procedure.
• To implement Bayes classifier we need joint
distribution of the feature vector.
• We can, e.g., assume features are independent.

PR NPTEL course – p.75/123


• In such cases, each feature is a discrete random
variable.
• We can estimate (marginal) distribution of feature
using our procedure.
• To implement Bayes classifier we need joint
distribution of the feature vector.
• We can, e.g., assume features are independent.
• Then, joint mass function is product of marginals.
• Often called, ‘naive Bayes’ classifier

PR NPTEL course – p.76/123


ML Estimation

• ML estimates of parameters (of a density) are


obtained as maximizers of the (log) likelihood function.

PR NPTEL course – p.77/123


ML Estimation

• ML estimates of parameters (of a density) are


obtained as maximizers of the (log) likelihood function.
• We have seen many examples of how we can
analytically derive ML estimates.

PR NPTEL course – p.78/123


ML Estimation

• ML estimates of parameters (of a density) are


obtained as maximizers of the (log) likelihood function.
• We have seen many examples of how we can
analytically derive ML estimates.
• ML estimates are easy to obtain for most standrad
densities and it is a very useful method of estimation.

PR NPTEL course – p.79/123


• ML method of estimation has some drawbacks.

PR NPTEL course – p.80/123


• ML method of estimation has some drawbacks.
• ML estimates are consistent. Hence, given large
number of samples we would get good estimates.

PR NPTEL course – p.81/123


• ML method of estimation has some drawbacks.
• ML estimates are consistent. Hence, given large
number of samples we would get good estimates.
• However, when sample size is small, ML estimates
may be quite bad.

PR NPTEL course – p.82/123


• ML method of estimation has some drawbacks.
• ML estimates are consistent. Hence, given large
number of samples we would get good estimates.
• However, when sample size is small, ML estimates
may be quite bad.
• Also, the method does not allow one to incorporate
any additional knowledge one may have about the
values of unknown parameters.

PR NPTEL course – p.83/123


• ML method of estimation has some drawbacks.
• ML estimates are consistent. Hence, given large
number of samples we would get good estimates.
• However, when sample size is small, ML estimates
may be quite bad.
• Also, the method does not allow one to incorporate
any additional knowledge one may have about the
values of unknown parameters.
• The final estimated value of the parameter is
determined by data alone.

PR NPTEL course – p.84/123


Bayesian Estimation

• Bayesian estimation is the second parametric method


of estimation that we consider in this course.

PR NPTEL course – p.85/123


Bayesian Estimation

• Bayesian estimation is the second parametric method


of estimation that we consider in this course.
• In ML estimation the parameters are taken to be
constants that are unknown.

PR NPTEL course – p.86/123


Bayesian Estimation

• Bayesian estimation is the second parametric method


of estimation that we consider in this course.
• In ML estimation the parameters are taken to be
constants that are unknown.
• In Bayesian estimation we think of the parameter itself
as a random variable.

PR NPTEL course – p.87/123


Bayesian Estimation

• We capture our lack of knowledge about the value of


a parameter through a probability density over the
parameter space.

PR NPTEL course – p.88/123


Bayesian Estimation

• We capture our lack of knowledge about the value of


a parameter through a probability density over the
parameter space.
• We call this the prior density of the parameter.

PR NPTEL course – p.89/123


Bayesian Estimation

• We capture our lack of knowledge about the value of


a parameter through a probability density over the
parameter space.
• We call this the prior density of the parameter.
• Any information we may have about the value of
parameter can be incorporated into this.

PR NPTEL course – p.90/123


Bayesian Estimation

• We capture our lack of knowledge about the value of


a parameter through a probability density over the
parameter space.
• We call this the prior density of the parameter.
• Any information we may have about the value of
parameter can be incorporated into this.
• We then view the role of data as transforming our
prior density into a posterior density for the parameter.
(We will see the details of this shortly).

PR NPTEL course – p.91/123


Bayesian Approach

• We can think of the prior density of the parameter as


capturing our subjective beliefs about the parameter
value.

PR NPTEL course – p.92/123


Bayesian Approach

• We can think of the prior density of the parameter as


capturing our subjective beliefs about the parameter
value.
• Thus, our final inference about the parameter value is
not completely governed by data alone; other
knowledge we have also plays a role.

PR NPTEL course – p.93/123


Bayesian Approach

• Thus, our final inference about the parameter value is


not completely governed by data alone; other
knowledge we have also plays a role.
• Though we consider it only for parameter estimation
of density functions, the Bayesian approach is to be
viewed as a generic approach for probabilistic
modelling and inference.

PR NPTEL course – p.94/123


Bayesian Approach

• Thus, our final inference about the parameter value is


not completely governed by data alone; other
knowledge we have also plays a role.
• Though we consider it only for parameter estimation
of density functions, the Bayesian approach is to be
viewed as a generic approach for probabilistic
modelling and inference.
• The Bayesian approach is characterized by thinking of
probabilities as also capturing subjective beliefs.

PR NPTEL course – p.95/123


Bayesian Parameter Estimation

• As earlier, let θ be the parameter and let D be the


data

PR NPTEL course – p.96/123


Bayesian Parameter Estimation

• As earlier, let θ be the parameter and let D be the


data
• Recall that
D = {x1 , · · · , xn }
is the set of iid data and each xi has density f (xi | θ)
(which is the assumed model).

PR NPTEL course – p.97/123


Bayesian Parameter Estimation

• As earlier, let θ be the parameter and let D be the


data
• Recall that
D = {x1 , · · · , xn }
is the set of iid data and each xi has density f (xi | θ)
(which is the assumed model).
• Let f (θ) be the prior density of the parameter and let
f (θ | D) be the posterior density.

PR NPTEL course – p.98/123


• Now, using Bayes theorem we get

f (D | θ)f (θ)
f (θ | D) = R
f (D | θ)f (θ) dθ
Q
where f (D | θ) = i f (xi | θ) is the data likelihood
that we considered earlier.

PR NPTEL course – p.99/123


• Now, using Bayes theorem we get

f (D | θ)f (θ)
f (θ | D) = R
f (D | θ)f (θ) dθ
Q
where f (D | θ) = i f (xi | θ) is the data likelihood
that we considered earlier.
• In the above expression for f (θ | D), the denominator
is not a function of θ . It is a normalizing constant and
when we do not need its details, we will denote it by
Z.

PR NPTEL course – p.100/123


• Essentially, the posterior density is taken as the final
Bayesian estimate.

PR NPTEL course – p.101/123


• Essentially, the posterior density is taken as the final
Bayesian estimate.
• An important question: how does one represent the
posterior (and the prior) density?

PR NPTEL course – p.102/123


• Essentially, the posterior density is taken as the final
Bayesian estimate.
• An important question: how does one represent the
posterior (and the prior) density?
• It would be nice if these densities can be represented
in some parametric form.

PR NPTEL course – p.103/123


• Essentially, the posterior density is taken as the final
Bayesian estimate.
• An important question: how does one represent the
posterior (and the prior) density?
• It would be nice if these densities can be represented
in some parametric form.
• For that, we would like the prior and posterior
densities to have the same general parametric form.

PR NPTEL course – p.104/123


• A form for the prior density, that results in the same
form of density for the posterior is called conjugate
prior.

PR NPTEL course – p.105/123


• A form for the prior density, that results in the same
form of density for the posterior is called conjugate
prior.
• Posterior density depends on product of prior and
data likelihood.

PR NPTEL course – p.106/123


• A form for the prior density, that results in the same
form of density for the posterior is called conjugate
prior.
• Posterior density depends on product of prior and
data likelihood.
• The form of data likelihood depends on the form
assumed for f (x | θ).

PR NPTEL course – p.107/123


• A form for the prior density, that results in the same
form of density for the posterior is called conjugate
prior.
• Posterior density depends on product of prior and
data likelihood.
• The form of data likelihood depends on the form
assumed for f (x | θ).
• Hence the conjugate prior is determined by the the
form of f (x | θ) (and hence that of data likelihood).

PR NPTEL course – p.108/123


• When we use a conjugate prior, both prior and
posterior belong to the same family of densities.

PR NPTEL course – p.109/123


• When we use a conjugate prior, both prior and
posterior belong to the same family of densities.
• Hence calculating posterior is essentially updating
parameters of the density.

PR NPTEL course – p.110/123


• When we use a conjugate prior, both prior and
posterior belong to the same family of densities.
• Hence calculating posterior is essentially updating
parameters of the density.
• We shall see many examples where this would be
more clear.

PR NPTEL course – p.111/123


• How do we use the final posterior density for
implementing the classifier?

PR NPTEL course – p.112/123


• How do we use the final posterior density for
implementing the classifier?
• There are many possibilities for this.

PR NPTEL course – p.113/123


• How do we use the final posterior density for
implementing the classifier?
• There are many possibilities for this.
• We finally need the class conditional densities for
implementing the Bayes classifier.

PR NPTEL course – p.114/123


• How do we use the final posterior density for
implementing the classifier?
• There are many possibilities for this.
• We finally need the class conditional densities for
implementing the Bayes classifier.
• So, one method is: can we find density of x based on
the data (so that the density is not dependent on any
unknown parameter).

PR NPTEL course – p.115/123


• Having obtained f (θ | D), we have

PR NPTEL course – p.116/123


• Having obtained f (θ | D), we have
Z
f (x | D) = f (x, θ | D) dθ
Z
= f (x | θ)f (θ | D) dθ

PR NPTEL course – p.117/123


• Having obtained f (θ | D), we have
Z
f (x | D) = f (x, θ | D) dθ
Z
= f (x | θ)f (θ | D) dθ

• Depending on the form of posterior, we may be able


to get a closed form expression for the density as
needed.

PR NPTEL course – p.118/123


• Another possibility is to use some specific value of θ
based on the posterior density.

PR NPTEL course – p.119/123


• Another possibility is to use some specific value of θ
based on the posterior density.
• We can take mode of the posterior density as the
parameter value.

PR NPTEL course – p.120/123


• Another possibility is to use some specific value of θ
based on the posterior density.
• We can take mode of the posterior density as the
parameter value.
• Called MAP estimate. (Maximum Aposteriori
Probability)

PR NPTEL course – p.121/123


• Another possibility is to use some specific value of θ
based on the posterior density.
• We can take mode of the posterior density as the
parameter value.
• Called MAP estimate. (Maximum Aposteriori
Probability)
• Or, we can take the mean of the posterior density as
the parameter value.

PR NPTEL course – p.122/123


• Another possibility is to use some specific value of θ
based on the posterior density.
• We can take mode of the posterior density as the
parameter value.
• Called MAP estimate. (Maximum Aposteriori
Probability)
• Or, we can take the mean of the posterior density as
the parameter value.
• Both these are also often used.

PR NPTEL course – p.123/123

You might also like