0% found this document useful (0 votes)

10 views66 pages

CS775 Lec 2

Uploaded by

esasc.swayam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views66 pages

CS775 Lec 2

Uploaded by

esasc.swayam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 66

Chapter 3:

Maximum-Likelihood &
Bayesian Parameter
Estimation
Introduction
Maximum-Likelihood Estimation
– Example of a Specific Case
– The Gaussian Case: unknown  and 
– Bias
 Introduction
– Data availability in a Bayesian framework
 We could design an optimal classifier if we knew:
– P(i) (priors)
– P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

– Design a classifier from a training sample

 No problem with prior estimation
 Samples are often too small for class-conditional
estimation (large dimension of feature space!)

1
– A priori information about the problem

– Normality of P(x | i)

P(x | i) ~ N( i, i)

 Characterized by 2 parameters

– Estimation techniques

 Maximum-Likelihood (ML) and the Bayesian

estimations
 Results are nearly identical, but the approaches
are different (they are the same on the limit of
infinite number of samples)

1
 Parameters in ML estimation are fixed but
unknown!

 Best parameters are obtained by maximizing the

probability of obtaining the samples observed

 Bayesian methods view the parameters as

random variables having some known distribution

 In either approach, we use P(i | x)

for our classification rule!
 We can use estimation for other problems too!

1
 Maximum-Likelihood Estimation

 Has good convergence properties as the sample

size increases
 Simpler than any other alternative techniques

– General principle

 Assume we have c classes and

P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

1 2 11 22 m n
  ( j ,  j )  ( j ,  j ,...,  j ,  j , cov(x j , x j )...)
2
 Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category

 Suppose that D contains n samples, x1, x2,…, xn

k n
P(D | )   P(x k | )  F()
k 1
P(D | ) is called the likelihood of  w.r.t. the set of samples)

̂
 ML estimate of  is, by definition the value that
maximizes P(D | )
“It is the value of  that best agrees with the actually
2
2
 Optimal estimation
– Let  = (1, 2, …, p)t and let  be the gradient operator
t
    
   , ,..., 
  1  2  p 

– We define l() as the log-likelihood function

l() = ln P(D | )

– New problem statement:

determine  that maximizes the log-likelihood

ˆ  arg max l()



2
Set of necessary conditions for an optimum is:

k n
( l     ln P(x k | ))
k 1

l = 0

2
 Example of a specific case: unknown 

– P(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal
population)

1
 d 1
 t
1
ln P(x k |  )   ln (2)   (x k   )  (x k   )
2 2
1
and   ln P(x k |  )   (x k   )

 =  therefore:
• The ML estimate for  must satisfy:

k n
1
  (x k  ˆ )  0
k 1
2
• Multiplying by  and rearranging, we
obtain:
1 k n
ˆ   x k
n k 1

Just the arithmetic average of the samples

of the training samples! (Exhale now! )

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!
2
 ML Estimation:
– Gaussian Case: unknown  and 
 = (1, 2) = (, 2)
1 1
l  ln P( x k | )   ln 2  2  (x k  1 )2
2 2 2
  
 (ln P( x k | )) 
 1
 l   0
  
 (ln P( x k | )) 
  2 
 1
  (x k  1 )  0
 2
 2
 1  (x k  1 )  0

 2  2 2  2
2

2
Summation:
k n 1

 k 1 ˆ ( x k   1 )  0 (1)
 2
 k n ˆ 2
  1   (x k  1 )  0
k  n
(2)
 k 1 ˆ 2 k 1 ˆ 22
Combining (1) and (2), one obtains:

k n
2
k n x  (x k  )
k 1
  k
; 2 
k 1 n n

2
 Bias

– ML estimate for 2 is biased

1 2 n1 2
E  ( x i  x )   .   2
n  n

– An elementary unbiased estimator for  is:

1 k n t
C  (x k   )(x k  ˆ )
  n - 1 k1       
Sample covariance matrix

2
 ML Problem Statement

– Let D = {x1, x2, …, xn}

P(x1,…, xn | ) = 1,nP(xk | ); |D| = n

̂
Our goal is to determine (value of  that
makes this sample the most representative!)

2
 = (1, 2, …, c)

Problem: find ̂ such that:

Max P(D | )  MaxP(x 1,..., x n | )


n
 Max  P(x k | )
k 1

2
Chapter 3:
Maximum-Likelihood and
Bayesian Parameter Estimation
 Bayesian Estimation (BE)
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General
Estimation
 Problems of Dimensionality
 Computational Complexity
 Component Analysis and Discriminants
 Hidden Markov Models
 Bayesian Estimation (Bayesian learning
to pattern classification problems)
– In MLE  was supposed fix
– In BE  is a random variable
– The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
– Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written
P(x | i , D ).P(i | D )
P(i | x, D )  c
 P(x |  j , D ).P( j | D )
j 1
3
 To demonstrate the preceding equation, use:

P(x, D | i )  P(x | D i ).P(D | i )

P(x | D )   P(x,  j | D )
j

P(i )  P(i | D ) (Training sample provides this! )

Thus :
P(x | i , D i ).P(i )
P(i | x, D )  c
 P(x |  j , D ).P( j )
j 1

3
 Bayesian Parameter Estimation:
Gaussian Case

Goal: Estimate  using the a-posteriori density

P( | D)
The univariate case: P( | D)
 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)

4
P(D |  ).P( )
P( | D )  (1)
 P(D |  ).P( )d
k n
   P(x k |  ).P( )
k 1
– Reproducing density
P( | D ) ~ N( n ,  n2 ) (2)
Identifying (1) and (2) yields:

 n 02  2
n   2 2 
ˆ n  2 .0
 n 0    n 0   2

 0
2 2
and  n 
2

n 02   2
4
 n 02  2
n   2 2 
ˆ n  2 .0
 n 0    n 0   2

 0
2 2
and  n 
2

n 02   2

n is a linear combination of n and 0

When n   n  n (like in the ML case!)
4
– The univariate case P(x | D)
 P( | D) computed
 P(x | D) remains to be computed!

P(x | D )   P(x |  ).P( | D )d is Gaussian

It provides:
P(x | D ) ~ N( n ,  2   n2 )

(Desired class-conditional density P(x | Dj, j))

Therefore: P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:
  
Max P( j | x , D  Max P( x |  j , D j ).P( j )
j j

4
 Bayesian Parameter Estimation: General
Theory

– P(x | D) computation can be applied to any

situation in which the unknown density can
be parametrized: the basic assumptions are:

 The form of P(x | ) is assumed known, but the

value of  is not known exactly
 Our knowledge about  is assumed to be
contained in a known prior density P()
 The rest of our knowledge  is contained in a set
D of n random variables x1, x2, …, xn that follows
P(x)
5
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”
Using Bayes formula, we have:
P(D | ).P()
P( | D )  ,
 P(D | ).P()d
And by independence assumption:
k n
P(D | )   P(x k | )
k 1

5
More cases: Binary Variables
 Coin flipping: heads=1, tails=0

 Bernoulli Distribution
Binary variables (2)
 N coin flips:

 Binomial Distribution
Binomial distribution
Parameter Estimation
ML for Bernoulli
Given:
Parameter Estimation (2)
 Example:
 Prediction: all future tosses will land heads up

 Overfitting to D
Beta Distribution
 Distribution over .
Bayesian Bernoulli

The Beta distribution provides the conjugate prior for the

Bernoulli distribution.
Beta Distribution
Prior ∙ Likelihood = Posterior
Properties of the Posterior
As the size of the data set, N , increase
Prediction under the Posterior
What is the probability that the next coin toss will
land heads up?
Multinomial Variables
1  of  K coding scheme
x  [0, 0,1, 0, 0, 0]T
ML Parameter estimation
 Given:

 Ensure , use a Lagrange multiplier, ¸.

The Multinomial Distribution
The Dirichlet Distribution

Conjugate prior for the

multinomial distribution.
Bayesian Multinomial (1)
Bayesian Multinomial (2)
 Problems of Dimensionality
– Problems involving 50 or 100 features
(binary valued)
 Classification accuracy depends upon the
dimensionality and the amount of training data
 Case of two classes multivariate normal with the
same covariance
u2
1  2
P(error )  e du
2 r / 2
where : r 2  ( 1   2 ) t  1 ( 1   2 )

lim P(error )  0
r 

7
 If features are independent then:

  diag( 12 ,  22 ,...,  2d )
2
i  d    i2 
2 i1
r    
i 1 i 

 Most useful features are the ones for which the difference between
the means is large relative to the standard deviation

 It has frequently been observed in practice that, beyond a certain

point, the inclusion of additional features leads to worse rather
than better performance: we have the wrong model !

7
7

7
 Computational Complexity

– Our design methodology is affected by the

computational difficulty

 “big oh” notation

f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 )   2 ; f ( x )  c 0 h(x )

(An upper bound on f(x) grows no worse than h(x)

for sufficiently large x!)

f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
7
– “big oh” is not unique!
f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

 “big theta” notation

f(x) = (h(x))
If:
( x 0 , c 1 , c 2 )   3 ; x  x 0
0  c 1g( x )  f ( x )  c 2 g( x )

f(x) = (x2) but f(x)  (x3)

7
– Complexity of the ML Estimation

 Gaussian priors in d dimensions classifier with n

training samples for each of c classes

Total = O(d2..n)
Total for c classes = O(cd2.n)  O(d2.n)

 Cost increase when d and n are large!

7
 Component Analysis and Discriminants

– Combine features in order to reduce the dimension of

the feature space
– Linear combinations are simple to compute and
tractable
– Project high dimensional data onto a lower dimensional
space
– Two classical approaches for finding “optimal” linear
transformation

 PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
 MDA (Multiple Discriminant Analysis) “Projection
that best separates the data in a least-squares
sense”

8
Expectation Maximization
 Learning parameters governing a distribution
from training points with missing values
 A generalization of MLE Missing
features
 Consider
D  [x1 ,..., x n ] xi  [xig , xib ]
Improved
estimate
Q(θ; θ )  EDb [ln p( Dg , Db : θ) | [ Dg : θ ]
i i

Marginalized
w/respect to
Best current
estimate so estimate
far
EM Algorithm

begin initialize θ0 ,  , i  0
do i  i  1
E step: compute Q(θ; θi )
M step: θi 1  arg max Q(θ; θi )
until Q(θi 1 ; θi )  Q(θi ; θi 1 )  

return θ  θ i 1

end
EM example: Mixtures of Gaussians
K
f ( x)    k N ( x | k ,  k )
k 1
Pick estimates of
θ  { 1 ,...,  K , 1 ,... K ,  1 ,... K }
 k N ( x | k ,  k ) k ,  k ,  k (θ0 )
P(k | x)  E-step
f ( x)
Plug them here
1 n 
 k   P(k | x(i ))
  Use the result to solve these
n i 1 
n  Use the new estimate to go
  1
k
 
n k i 1
P (k | x(i )) x(i )  M-step back to the E-step

1 n  …

k 
 
n k i 1

P (k | x(i ))( x(i )  k ) 
2


Gaussian
Mixture
Example:
Start
After first
iteration
After 2nd
iteration
After 3rd
iteration
After 4th
iteration
After 5th
iteration
After 6th
iteration
After 20th
iteration
 Hidden Markov Models:
– Markov Chains

– Goal: make a sequence of decisions

 Processes that unfold in time, states at time t are

influenced by a state at time t-1

 Applications: speech recognition, gesture recognition,

parts of speech tagging and DNA sequencing,

 Any temporal process without memory

T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}

 The system can revisit a state at different steps and

not every state need to be visited
10
– First-order Markov models

 Our productions of any sequence is described by

the transition probabilities

P(j(t + 1) | i (t)) = aij

10
10
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 . P((1) = i)

Example: speech recognition

“production of spoken words”

Production of the word: “pattern” represented by
phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/, /er/
to /n/ and /n/ to a silent state

10
HMMs

Lecture 4
No ratings yet
Lecture 4
51 pages
4.ML Estimation
No ratings yet
4.ML Estimation
19 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
18 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
I2ml3e Chap4
No ratings yet
I2ml3e Chap4
28 pages
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
No ratings yet
Maximum-Likelihood & Bayesian Parameter Estimation: Srihari: CSE 555
9 pages
7.estimation Clustering
No ratings yet
7.estimation Clustering
56 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Max Likelihood
No ratings yet
Max Likelihood
4 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
PDF Estimation 23mar23
No ratings yet
PDF Estimation 23mar23
45 pages
4.4 Parametric and Non-Parametric Estimator
No ratings yet
4.4 Parametric and Non-Parametric Estimator
47 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
PDF Estimation Corr
No ratings yet
PDF Estimation Corr
43 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Inf 2
No ratings yet
Inf 2
37 pages
M3 DensityEstimation v1
No ratings yet
M3 DensityEstimation v1
65 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Learning With Maximum Likelihood: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
50 pages
Maximum Likelihood Learning of Gaussians For Data Mining
No ratings yet
Maximum Likelihood Learning of Gaussians For Data Mining
25 pages
Modelling The Data
No ratings yet
Modelling The Data
13 pages
Chapter 4 ML Parametric Classification
No ratings yet
Chapter 4 ML Parametric Classification
42 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
Chapter 7: Parameter Estimation: ST2334 Probability and Statistics (Academic Year 2014/15, Semester 1)
No ratings yet
Chapter 7: Parameter Estimation: ST2334 Probability and Statistics (Academic Year 2014/15, Semester 1)
45 pages
Lec11 Introduction2BayesianStatistics
No ratings yet
Lec11 Introduction2BayesianStatistics
48 pages
Estimation 4
No ratings yet
Estimation 4
16 pages
2 Mle
No ratings yet
2 Mle
28 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
CS3491-AI ML-Chapter 4
No ratings yet
CS3491-AI ML-Chapter 4
26 pages
11 Mle
No ratings yet
11 Mle
26 pages
Estimation Theory
100% (1)
Estimation Theory
8 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Module 4
No ratings yet
Module 4
3 pages
ML Notes
No ratings yet
ML Notes
4 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Applied Time-Series Analysis: Arun K. Tangirala
No ratings yet
Applied Time-Series Analysis: Arun K. Tangirala
50 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
ML Map and Bayseian
No ratings yet
ML Map and Bayseian
35 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Statistical Methods: Multivariate Analysis
No ratings yet
Statistical Methods: Multivariate Analysis
1 page
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
46 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
02 Unit 2 - 2
No ratings yet
02 Unit 2 - 2
51 pages
01 Unit 1 - 1
No ratings yet
01 Unit 1 - 1
72 pages
Unit I - Introduction
No ratings yet
Unit I - Introduction
31 pages
02 Unit 1 - 2
No ratings yet
02 Unit 1 - 2
37 pages
Unit 4 Browser1
No ratings yet
Unit 4 Browser1
13 pages
Unit 3 - Chisquare
No ratings yet
Unit 3 - Chisquare
65 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
14 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
14 pages
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
No ratings yet
Introduction To Econometrics, 5 Edition: Chapter 5: Dummy Variables
32 pages
2024 MI2020E Probability-and-Statistics CTTT Final 2
No ratings yet
2024 MI2020E Probability-and-Statistics CTTT Final 2
7 pages
Calonico Cattaneo Farrell Titiunik 2017 Stata RD
No ratings yet
Calonico Cattaneo Farrell Titiunik 2017 Stata RD
33 pages
Data Mining - Density Based Clustering
No ratings yet
Data Mining - Density Based Clustering
8 pages
Lecture Notes - 3
No ratings yet
Lecture Notes - 3
17 pages
Ujian Statistik Untuk Bisnis Desi Rusfiani
No ratings yet
Ujian Statistik Untuk Bisnis Desi Rusfiani
7 pages
Determining The Sample Size (Continuous Data)
No ratings yet
Determining The Sample Size (Continuous Data)
2 pages
Assignment 2 Sol
No ratings yet
Assignment 2 Sol
11 pages
Planar/Bilinear Least Squares Regression
No ratings yet
Planar/Bilinear Least Squares Regression
2 pages
Chi Square
No ratings yet
Chi Square
13 pages
1966 Benston
No ratings yet
1966 Benston
17 pages
Ramin Shamshiri ABE6981 HW - 03
No ratings yet
Ramin Shamshiri ABE6981 HW - 03
13 pages
Criterion Regression 1 PDC A + (B × MP$) Regression 2 PDC A + (B × # of Pos) Regression 3 PDC A + (B × # of SS)
No ratings yet
Criterion Regression 1 PDC A + (B × MP$) Regression 2 PDC A + (B × # of Pos) Regression 3 PDC A + (B × # of SS)
3 pages
EViews 6 Users Guide II
No ratings yet
EViews 6 Users Guide II
688 pages
Statistical Approach For Selection of Regression Model During Validation of Bioanalytical Method
No ratings yet
Statistical Approach For Selection of Regression Model During Validation of Bioanalytical Method
7 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
6 pages
7.6 10ex
No ratings yet
7.6 10ex
1 page
7.confidence Intervals
No ratings yet
7.confidence Intervals
13 pages
Logistic Regression in SPSS
No ratings yet
Logistic Regression in SPSS
15 pages
MAESP 202 Detection & Estimation Theory
No ratings yet
MAESP 202 Detection & Estimation Theory
4 pages
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
No ratings yet
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
27 pages
T Test
No ratings yet
T Test
3 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
325unit 1 Simple Regression Analysis
No ratings yet
325unit 1 Simple Regression Analysis
10 pages
Parameters Trial 1 Trial 2 Trial 3
No ratings yet
Parameters Trial 1 Trial 2 Trial 3
1 page
Logit and Probit Models
No ratings yet
Logit and Probit Models
44 pages
Econometrics - Lecture Notes
No ratings yet
Econometrics - Lecture Notes
42 pages
Chapter 2
No ratings yet
Chapter 2
62 pages
LoD Estimation Methods FINAL 29Jul2024A
No ratings yet
LoD Estimation Methods FINAL 29Jul2024A
13 pages
3 - Chapter Three-Unequal Probability Sampling
No ratings yet
3 - Chapter Three-Unequal Probability Sampling
32 pages

CS775 Lec 2

Uploaded by

CS775 Lec 2

Uploaded by

Chapter 3:

– Design a classifier from a training sample

– Normality of P(x | i)

P(x | i) ~ N( i, i)

 Maximum-Likelihood (ML) and the Bayesian

 Best parameters are obtained by maximizing the

 Bayesian methods view the parameters as

 In either approach, we use P(i | x)

 Has good convergence properties as the sample

 Assume we have c classes and

 Suppose that D contains n samples, x1, x2,…, xn

– We define l() as the log-likelihood function

– New problem statement:

ˆ  arg max l()

Just the arithmetic average of the samples

– ML estimate for 2 is biased

– An elementary unbiased estimator for  is:

– Let D = {x1, x2, …, xn}

P(x1,…, xn | ) = 1,nP(xk | ); |D| = n

Problem: find ̂ such that:

Max P(D | )  MaxP(x 1,..., x n | )

P(x, D | i )  P(x | D i ).P(D | i )

P(i )  P(i | D ) (Training sample provides this! )

Goal: Estimate  using the a-posteriori density

n is a linear combination of n and 0

P(x | D )   P(x |  ).P( | D )d is Gaussian

(Desired class-conditional density P(x | Dj, j))

– P(x | D) computation can be applied to any

 The form of P(x | ) is assumed known, but the

The Beta distribution provides the conjugate prior for the

 Ensure , use a Lagrange multiplier, ¸.

Conjugate prior for the

 It has frequently been observed in practice that, beyond a certain

– Our design methodology is affected by the

 “big oh” notation

(An upper bound on f(x) grows no worse than h(x)

 “big theta” notation

f(x) = (x2) but f(x)  (x3)

 Gaussian priors in d dimensions classifier with n

 For each category, we have to compute the

 Cost increase when d and n are large!

– Combine features in order to reduce the dimension of

 PCA (Principal Component Analysis) “Projection that

– Goal: make a sequence of decisions

 Processes that unfold in time, states at time t are

 Applications: speech recognition, gesture recognition,

 Any temporal process without memory

 The system can revisit a state at different steps and

 Our productions of any sequence is described by

P(j(t + 1) | i (t)) = aij

Example: speech recognition

“production of spoken words”

You might also like