0% found this document useful (0 votes)
30 views

Dr. Arslan Shaukat

The document discusses Bayesian estimation and classification. It explains that in Bayesian estimation, parameters are treated as random variables rather than fixed values. The key steps are: (1) computing the posterior probability distribution P(θ|D) using Bayes' rule, and (2) deriving the class-conditional density P(x|D) by integrating over the posterior. This allows incorporating prior knowledge and quantifying uncertainty about parameter values. The document provides detailed examples for univariate and multivariate Gaussian cases.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Dr. Arslan Shaukat

The document discusses Bayesian estimation and classification. It explains that in Bayesian estimation, parameters are treated as random variables rather than fixed values. The key steps are: (1) computing the posterior probability distribution P(θ|D) using Bayes' rule, and (2) deriving the class-conditional density P(x|D) by integrating over the posterior. This allows incorporating prior knowledge and quantifying uncertainty about parameter values. The document provides detailed examples for univariate and multivariate Gaussian cases.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

Dr.

Arslan Shaukat
Bayesian Estimation
In MLE  was supposed to have a fixed value
In BE  is a random variable
Training data allows us to convert a distribution on this
variable into a posterior probability density
The computation of posterior probabilities P(i|x) lies at
the heart of Bayesian classification
P ( x | i ).P (i )
P (i | x)  c
 P( x |  j ).P( j )
j 1

Given the training sample set D, Bayes formula can be


written P ( x | i , D).P(i | D)
P(i | x, D)  c

 P( x |  , D).P(
j 1
j j | D)
The training samples D can be used to determine the
class-conditional densities and prior probabilities
Assume that the true values of the a priori
probabilities are known or obtainable from a trivial
calculation; thus we substitute P(ωi) = P(ωi|D)
We can separate the training samples by class into c
subsets D1, ...,Dc, with the samples in Di belonging to ωi
The previous expression can be written as:
P ( x | i , Di ).P(i )
P (i | x, D)  c

 P( x |  , D ).P( )
j 1
j j j

Like MLE, each class is treated independently, so we


can dispense with needless class distinctions and
simplify our notation for P(x|ω,D) to P(x,D)
P(x) is unknown but has known parametric form by
saying that the function p(x|θ) is completely known.
Any information we might have about θ prior to
observing the samples is assumed to be contained
in a known prior density p(θ).
Observation of the samples converts this to a
posterior density p(θ|D)
Goal is to compute p(x|D)
Do this by integrating the joint density p(x, θ|D)
over θ. That is,

p( x | D)   p( x, | D)d   p( x |  , D) p( | D)d   p( x |  ) p( | D)d


In general, if we are less certain about the exact value of θ,
this equation directs us to average p(x|θ) over the possible
values of θ.
Thus, when the unknown densities have a known parametric
form, the samples exert their influence on p(x|D) through the
posterior density p(θ|D).
The basic problem is: “Compute the posterior density P( |
D)” then “Derive P(x | D)”
Using Bayes formula, we have:

P (D |  ).P( )
P ( | D)  ,
 P(D |  ).P( )d
And independence assumption leads to the value of P(D|θ) as

k n
P (D |  )   P ( xk |  )
k 1
5
Bayesian Parameter Estimation:
General Theory
P(x |D) computation can be applied to any situation
in which the unknown density can be parametrized:
the basic assumptions are:
The form of P(x |) is assumed known, but the value of
 is not known exactly
Our knowledge about  is assumed to be contained in a
known prior density P()
The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn drawn independently
according to the unknown probability density P(x)
p ( x | D )   p ( x |  ) p ( | D )d
P (D |  ).P( ) k n
P( | D)  , P (D |  )   P ( xk |  )
 P(D |  ).P( )d k 1 5
Bayesian Parameter Estimation:
Gaussian Case
Goal: Estimate  using the a-posteriori density P(|D)
The univariate case: P(|D)
 is the only unknown parameter
P(x |  ) ~ N(  ,  2 )
P(  ) ~ N(  0 ,  02 )
We assume that whatever prior knowledge we might have
about μ can be expressed by a known prior density p(μ)
0 and  0 are known
Roughly speaking, μ 0 represents our best a priori guess
for μ, and σ20 measures our uncertainty about this guess
4
By Bayes formula
P(D |  ).P(  )
P(  | D)  (1)
 P(D |  ).P( )d
k n
   P( xk |  ).P(  )
k 1

We assume that
P(x k |  ) ~ N(  ,  2 )
P(  ) ~ N(  0 ,  02 )
We have

4
If we write P (  | D) ~ N (  n ,  n2 )
μ n and σ2n can be found by equating coefficients in
the previous Eq. with corresponding coefficients in
the Gaussian form:
1  1   n 
 
2

P(  | D)  exp    
2  n  2   n  

 n 02  2
 n    ˆ 
2  n
. 0
 n0 0    n 0  
2 2 2

 0
2 2
and  n2 
n 02   2
μ n represents our best guess for μ after observing n
samples, and σ2n measures our uncertainty about this
guess.
Since σ2n decreases monotonically with n —
approaching σ2/n as n approaches infinity — each
additional observation decreases our uncertainty about
the true value of μ.
As n increases, p(μ|D) becomes more and more sharply
peaked
This behavior is commonly known as Bayesian learning
4
The Univariate Case P(x |D)
 P( | D) computed
 P(x | D) remains to be computed

P( x | D)   P( x |  ).P(  | D)d is Gaussian

where

4
It provides:
P( x | D) ~ N (  n ,  2   n2 )

(Desired class-conditional density P(x | Dj, j))


Therefore: P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P ( j | x, D)  Max P ( x |  j , D j ).P ( j )
j j

Multivariate Case
Assume: p( x| µ)~N(µ, ∑)
p(µ) ~N(µ0, ∑0)

We get :P(µ|D) ~ N (µn, ∑n)


Recursive Bayes Learning

Using Bayes formula

and

An incremental or on-line learning method, where learning goes


on as the data is collected
Difference between the two
methods
Computational complexity:
Maximum likelihood is simpler
Our confidence in the prior information
Maximum likelihood must be of the assumed parametric
form, not so for the Bayesian solution.
Bayesian methods use more of the information than
maximum likelihood thus it gives better results.
Bayesian methods exploit asymmetric information contained
in θ distribution while maximum likelihood does not.
Classification Error
To apply these results to multiple classes, separate the
training samples to c subsets D1, . . . ,Dc, with the
samples in Di belonging to class wi, and then estimate
each density p(x|wi,Di) separately
Different sources of error
Bayes error: due to overlapping class-conditional densities
(related to the features used)
Model error: due to incorrect model
Estimation error: due to estimation of parameters from a
finite sample (can be reduced by increasing the amount of
training data)
Conclusion
Maximum likelihood approach estimates a point in θ
space, the Bayesian approach estimates a distribution.
Bayesian method has strong theoretical and
methodological arguments supporting it, though in
practice maximum likelihood is simpler.
When used for classifiers, they mostly give same
result.

You might also like