0% found this document useful (0 votes)
10 views66 pages

CS775 Lec 2

Uploaded by

esasc.swayam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views66 pages

CS775 Lec 2

Uploaded by

esasc.swayam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

Chapter 3:

Maximum-Likelihood &
Bayesian Parameter
Estimation
Introduction
Maximum-Likelihood Estimation
– Example of a Specific Case
– The Gaussian Case: unknown  and 
– Bias
 Introduction
– Data availability in a Bayesian framework
 We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample


 No problem with prior estimation
 Samples are often too small for class-conditional
estimation (large dimension of feature space!)

1
– A priori information about the problem

– Normality of P(x | i)

P(x | i) ~ N( i, i)

 Characterized by 2 parameters

– Estimation techniques

 Maximum-Likelihood (ML) and the Bayesian


estimations
 Results are nearly identical, but the approaches
are different (they are the same on the limit of
infinite number of samples)

1
 Parameters in ML estimation are fixed but
unknown!

 Best parameters are obtained by maximizing the


probability of obtaining the samples observed

 Bayesian methods view the parameters as


random variables having some known distribution

 In either approach, we use P(i | x)


for our classification rule!
 We can use estimation for other problems too!

1
 Maximum-Likelihood Estimation

 Has good convergence properties as the sample


size increases
 Simpler than any other alternative techniques

– General principle

 Assume we have c classes and


P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

1 2 11 22 m n
  ( j ,  j )  ( j ,  j ,...,  j ,  j , cov(x j , x j )...)
2
 Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category

 Suppose that D contains n samples, x1, x2,…, xn


k n
P(D | )   P(x k | )  F()
k 1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

̂
 ML estimate of  is, by definition the value that
maximizes P(D | )
“It is the value of  that best agrees with the actually
2
2
 Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator
t
    
   , ,..., 
  1  2  p 

– We define l() as the log-likelihood function


l() = ln P(D | )

– New problem statement:


determine  that maximizes the log-likelihood

ˆ  arg max l()


2
Set of necessary conditions for an optimum is:

k n
( l     ln P(x k | ))
k 1

l = 0

2
 Example of a specific case: unknown 

– P(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal
population)

1
 d 1
 t
1
ln P(x k |  )   ln (2)   (x k   )  (x k   )
2 2
1
and   ln P(x k |  )   (x k   )

 =  therefore:
• The ML estimate for  must satisfy:

k n
1
  (x k  ˆ )  0
k 1
2
• Multiplying by  and rearranging, we
obtain:
1 k n
ˆ   x k
n k 1

Just the arithmetic average of the samples


of the training samples! (Exhale now! )

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
2
 ML Estimation:
– Gaussian Case: unknown  and 
 = (1, 2) = (, 2)
1 1
l  ln P( x k | )   ln 2  2  (x k  1 )2
2 2 2
  
 (ln P( x k | )) 
 1
 l   0
  
 (ln P( x k | )) 
  2 
 1
  (x k  1 )  0
 2
 2
 1  (x k  1 )  0

 2  2 2  2
2

2
Summation:
k n 1

 k 1 ˆ ( x k   1 )  0 (1)
 2
 k n ˆ 2
  1   (x k  1 )  0
k  n
(2)
 k 1 ˆ 2 k 1 ˆ 22
Combining (1) and (2), one obtains:

k n
2
k n x  (x k  )
k 1
  k
; 2 
k 1 n n

2
 Bias

– ML estimate for 2 is biased


1 2 n1 2
E  ( x i  x )   .   2
n  n

– An elementary unbiased estimator for  is:


1 k n t
C  (x k   )(x k  ˆ )
  n - 1 k1       
Sample covariance matrix

2
 ML Problem Statement

– Let D = {x1, x2, …, xn}

P(x1,…, xn | ) = 1,nP(xk | ); |D| = n

̂
Our goal is to determine (value of  that
makes this sample the most representative!)

2
 = (1, 2, …, c)

Problem: find ̂ such that:

Max P(D | )  MaxP(x 1,..., x n | )



n
 Max  P(x k | )
k 1

2
Chapter 3:
Maximum-Likelihood and
Bayesian Parameter Estimation
 Bayesian Estimation (BE)
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General
Estimation
 Problems of Dimensionality
 Computational Complexity
 Component Analysis and Discriminants
 Hidden Markov Models
 Bayesian Estimation (Bayesian learning
to pattern classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written
P(x | i , D ).P(i | D )
P(i | x, D )  c
 P(x |  j , D ).P( j | D )
j 1
3
 To demonstrate the preceding equation, use:

P(x, D | i )  P(x | D i ).P(D | i )


P(x | D )   P(x,  j | D )
j

P(i )  P(i | D ) (Training sample provides this! )


Thus :
P(x | i , D i ).P(i )
P(i | x, D )  c
 P(x |  j , D ).P( j )
j 1

3
 Bayesian Parameter Estimation:
Gaussian Case

Goal: Estimate  using the a-posteriori density


P( | D)
The univariate case: P( | D)
 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)

4
P(D |  ).P( )
P( | D )  (1)
 P(D |  ).P( )d
k n
   P(x k |  ).P( )
k 1
– Reproducing density
P( | D ) ~ N( n ,  n2 ) (2)
Identifying (1) and (2) yields:

 n 02  2
n   2 2 
ˆ n  2 .0
 n 0    n 0   2

 0
2 2
and  n 
2

n 02   2
4
 n 02  2
n   2 2 
ˆ n  2 .0
 n 0    n 0   2

 0
2 2
and  n 
2

n 02   2

n is a linear combination of n and 0


When n   n  n (like in the ML case!)
4
– The univariate case P(x | D)
 P( | D) computed
 P(x | D) remains to be computed!

P(x | D )   P(x |  ).P( | D )d is Gaussian

It provides:
P(x | D ) ~ N( n ,  2   n2 )

(Desired class-conditional density P(x | Dj, j))


Therefore: P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:
  
Max P( j | x , D  Max P( x |  j , D j ).P( j )
j j

4
 Bayesian Parameter Estimation: General
Theory

– P(x | D) computation can be applied to any


situation in which the unknown density can
be parametrized: the basic assumptions are:

 The form of P(x | ) is assumed known, but the


value of  is not known exactly
 Our knowledge about  is assumed to be
contained in a known prior density P()
 The rest of our knowledge  is contained in a set
D of n random variables x1, x2, …, xn that follows
P(x)
5
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”
Using Bayes formula, we have:
P(D | ).P()
P( | D )  ,
 P(D | ).P()d
And by independence assumption:
k n
P(D | )   P(x k | )
k 1

5
More cases: Binary Variables
 Coin flipping: heads=1, tails=0

 Bernoulli Distribution
Binary variables (2)
 N coin flips:

 Binomial Distribution
Binomial distribution
Parameter Estimation
ML for Bernoulli
Given:
Parameter Estimation (2)
 Example:
 Prediction: all future tosses will land heads up

 Overfitting to D
Beta Distribution
 Distribution over .
Bayesian Bernoulli

The Beta distribution provides the conjugate prior for the


Bernoulli distribution.
Beta Distribution
Prior ∙ Likelihood = Posterior
Properties of the Posterior
As the size of the data set, N , increase
Prediction under the Posterior
What is the probability that the next coin toss will
land heads up?
Multinomial Variables
1  of  K coding scheme
x  [0, 0,1, 0, 0, 0]T
ML Parameter estimation
 Given:

 Ensure , use a Lagrange multiplier, ¸.


The Multinomial Distribution
The Dirichlet Distribution

Conjugate prior for the


multinomial distribution.
Bayesian Multinomial (1)
Bayesian Multinomial (2)
 Problems of Dimensionality
– Problems involving 50 or 100 features
(binary valued)
 Classification accuracy depends upon the
dimensionality and the amount of training data
 Case of two classes multivariate normal with the
same covariance
u2
1  2
P(error )  e du
2 r / 2
where : r 2  ( 1   2 ) t  1 ( 1   2 )

lim P(error )  0
r 

7
 If features are independent then:

  diag( 12 ,  22 ,...,  2d )
2
i  d    i2 
2 i1
r    
i 1 i 

 Most useful features are the ones for which the difference between
the means is large relative to the standard deviation

 It has frequently been observed in practice that, beyond a certain


point, the inclusion of additional features leads to worse rather
than better performance: we have the wrong model !

7
7

7
 Computational Complexity

– Our design methodology is affected by the


computational difficulty

 “big oh” notation


f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 )   2 ; f ( x )  c 0 h(x )

(An upper bound on f(x) grows no worse than h(x)


for sufficiently large x!)

f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
7
– “big oh” is not unique!
f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

 “big theta” notation


f(x) = (h(x))
If:
( x 0 , c 1 , c 2 )   3 ; x  x 0
0  c 1g( x )  f ( x )  c 2 g( x )

f(x) = (x2) but f(x)  (x3)

7
– Complexity of the ML Estimation

 Gaussian priors in d dimensions classifier with n


training samples for each of c classes

 For each category, we have to compute the


discriminant function O(1)
O( d.n)
2
O(n.d )   
1  d 1 ˆ
t 1
g(x )   (x  ˆ )  (x  ˆ )  ln 2  ln   ln P()
2 2 2   
 O(n)
O( d2 .n)

Total = O(d2..n)
Total for c classes = O(cd2.n)  O(d2.n)

 Cost increase when d and n are large!


7
 Component Analysis and Discriminants

– Combine features in order to reduce the dimension of


the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower dimensional
space
– Two classical approaches for finding “optimal” linear
transformation

 PCA (Principal Component Analysis) “Projection that


best represents the data in a least- square sense”
 MDA (Multiple Discriminant Analysis) “Projection
that best separates the data in a least-squares
sense”

8
Expectation Maximization
 Learning parameters governing a distribution
from training points with missing values
 A generalization of MLE Missing
features
 Consider
D  [x1 ,..., x n ] xi  [xig , xib ]
Improved
estimate
Q(θ; θ )  EDb [ln p( Dg , Db : θ) | [ Dg : θ ]
i i

Marginalized
w/respect to
Best current
estimate so estimate
far
EM Algorithm

begin initialize θ0 ,  , i  0
do i  i  1
E step: compute Q(θ; θi )
M step: θi 1  arg max Q(θ; θi )
until Q(θi 1 ; θi )  Q(θi ; θi 1 )  

return θ  θ i 1

end
EM example: Mixtures of Gaussians
K
f ( x)    k N ( x | k ,  k )
k 1
Pick estimates of
θ  { 1 ,...,  K , 1 ,... K ,  1 ,... K }
 k N ( x | k ,  k ) k ,  k ,  k (θ0 )
P(k | x)  E-step
f ( x)
Plug them here
1 n 
 k   P(k | x(i ))
  Use the result to solve these
n i 1 
n  Use the new estimate to go
  1
k
 
n k i 1
P (k | x(i )) x(i )  M-step back to the E-step

1 n  …

k 
 
n k i 1

P (k | x(i ))( x(i )  k ) 
2


Gaussian
Mixture
Example:
Start
After first
iteration
After 2nd
iteration
After 3rd
iteration
After 4th
iteration
After 5th
iteration
After 6th
iteration
After 20th
iteration
 Hidden Markov Models:
– Markov Chains

– Goal: make a sequence of decisions

 Processes that unfold in time, states at time t are


influenced by a state at time t-1

 Applications: speech recognition, gesture recognition,


parts of speech tagging and DNA sequencing,

 Any temporal process without memory


T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}

 The system can revisit a state at different steps and


not every state need to be visited
10
– First-order Markov models

 Our productions of any sequence is described by


the transition probabilities

P(j(t + 1) | i (t)) = aij

10
10
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”


Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/
to /n/ and /n/ to a silent state

10
HMMs

You might also like