0% found this document useful (0 votes)
33 views42 pages

Chapter 4 ML Parametric Classification

Uploaded by

Tuğba Can
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views42 pages

Chapter 4 ML Parametric Classification

Uploaded by

Tuğba Can
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

BLG 527E Machine Learning

FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun

Parametric Methods
Parametric Estimation
• X = { xt }t where xt ~ p (x)
• Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient statistics,
using X
e.g., N ( μ, σ2) where q = { μ, σ2}

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Maximum Likelihood Estimation
• Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)

• Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

• Maximum likelihood estimator (MLE)


θ* = argmaxθ L(θ|X)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Examples: Bernoulli/Multinomial
• Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N

• Multinomial: K>2 states, xi in {0,1}


P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Examples: Bernoulli (Derivation)
• Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 – po ) (1 – x)
t t
L (po|X) = log ∏t pox (1 – po ) (1 – x )

dL( p0 | X ) N t d N
d
 x log( p0 )   (1  x )
t
log(1  p0 )
dp0 t 1 dp0 t 1 dp0
N N
1 1

p0
x 
t 1
t
 (1  x t )
t 1 1  p0
0

5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bernoulli (Derivation)
N N N
(1  p0 ) x t  p0  1  p0  x t 0
t 1 t 1 t 1
N N
1
 x  p0 N 0  p0 
t
 x t

t 1 N t 1

MLE: po = ∑t xt / N
Gaussian (Normal) Distribution
• p(x) = N ( μ, σ2)
p x  
1  x   2 
1  x   2 
px  
exp - 
exp 
2
2  2 

2   2  2

μ σ

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7
Gaussian (Normal) Distribution
• Given that X = { xt }t with xt ~ N ( μ, σ2)

𝐿 ¿
 x t

MLE for μ and σ2: m t

N
 x  m
t 2

s2  t

N
Bias and Variance
Let X be a sample from a population specified up to a
parameter q

To evaluate the quality of this estimator we can measure


how much it is different from q
That is (d (X)-q )2

But since it is random variable (it depends on the


sample) we need to average over all possible X and
consider meas square error of the estimator

Remember the properties of expectation

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – q


Variance: E [(d–E [d])2]

Mean square error:


q

r (d,q) = E [(d–q)2]=E [(d–E[d]+E[d]-θ)2]

= (E[d]-θ)2+E[(d–E[d])2+2 (d–E[d])(E[d]-θ)]

Remember the properties of expectation

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
= E[(E[d]-θ)2]+E [(d–E [d])2] +2 E[(d–E[d])(E[d]-θ)]

= E[(E[d]-θ)2]+E [(d–E [d])2] +2 (E[d]–E[d])(E[d]-θ)

= (E [d] – θ)2 +E [(d–E [d])2]

= (E [d] – θ)2 + E [(d–E [d])2]

= Bias2 + Variance
Bayes’ Estimator
• Sometimes before looking at a sample we may have some prior
information on the possible value range that a parameter , θ, may
take.
• This information is quite useful especially when the sample is small.
• Treat θ as a random variable with prior p (θ)

• Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)


• Density at x: p(x|X) = ∫ p(x|θ, X) p(θ|X) dθ= ∫ p(x|θ) p(θ|X) dθ

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Bayes’ Estimator
• Evaluating the p(x|X) integrals may be quite difficult except in cases
where the posterior has a nice form
• Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
• Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
• Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
• If we have no prior reason to favor some values of θ then the prior
density is flat and the posterior will have the same form as the
likelihood p(X|θ)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) where are μ, σ2, σo2
known
• θML = m

• p(θ|X) ∝ p(X| θ)p(θ)

• Take the derivative with respect to θ

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
1  x   2 
P( X |  ) exp   
2  0  2 0 
2

1     2 
P( )  exp   
2   2 
2

Likelihood :
1     2  N 1 
 xt    
2

l ( ) exp    exp  
2   2  2
 t 1 2  0  2 2
0 
Loglikelihood :

 1        N   1   x t  
2
   
2

L( ) log       log   


 2    2 2
 t 1   2  0   2 2
0  
 1       
2 N   1   xt      
2

L( ) log        log   



 2    2 2
 t 1   2  0   2 2
0  
L( )

 

    N xt    0
  2
t 1 02

N
xt N
     N N x t N
  

t 1  0
2
 2
t 1  0  2
 2 
 0 t 1 N

t 1  0
2
 2 2 
 
N  N 
m 2  2  2
0 2
 0 

N /2
1/  2
E  | X   0
m 
N /  0  1/ 
2 2
N /  0 1/ 
2 2
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
• θML = m
• θMAP = θBayes’ =

N / 2
1/  2
E  | X   m 0

N /  0  1/ 
2 2
N /  0 1/ 
2 2

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
Bayesian Learning for Coin Model
• Bayesian Learning procedure:
• Given data x1, x2,…, xN write down expression for likelihood p(X| θ)

Specify a prior P (θ )

Compute posterior p(θ |X) = p(X| θ)p(θ)/ p(X)

p(θ |X) ∝ p(X| θ)p(θ)

• This example is obtained from Nando Freitas lecture notes


Bayesian Learning for Coin Model
• For coin model likelihood of data (i.i.d. in our case)

P(x1, x2,…, xN| θ ) = ∏t θ xt (1 – θ ) (1 – xt) = θ m (1 – θ ) (N – m)

Where xt ϵ {0,1} and m is the number of 1’s

• Specify a prior on θ. For this we need to introduce Beta distribution


Beta Distribution
(   )   1  ,  are hyperparameters
p ( )   (1   )   1
( )(  )

( z ) e x x z  1dx
0

p( )d 1
(   )   1
( )( )
1
 (1   ) d 1

( )(  ) 
 (1   ) d  (   )
1 1
E[ ] 
 
• The figure is obtained from wikipedia
Bayesian Learning for Coin Model
Compute Posterior: p(θ |X) ∝ p(X| θ)p(θ)

P(x1, x2,…, xN| θ ) = ∏t θ xt (1 – θ ) (1 – xt) = θ m (1 – θ ) (N – m)

(   )   1 1 1 1


p ( )   (1   )   (1   )   1
( )(  ) const
1 1 ( ' ' )  ' 1
p ( | X )   (1   )   1 m (1   ) N  m  p ( | X )   (1   )  ' 1
const ( ' )(  ' )
 ' m   ,  ' N  m  
Conjugate Prior
• Conjugate priors: A likelihood-prior pair is said to be conjugate if they
result in a posterior which is of the same form as the prior.
• This enables us to compute the posterior density analytically without
having to worry about computing the denominator in Bayes' rule, the
marginal likelihood.

Prior Likelihood
Gaussian Gaussian
Beta Binomial
Dirichlet Multinomial
Gamma Gaussian
Example
• Suppose that we observe X= {1,1,1,1,1,1} where each xt comes
from Bernoulli distribution θML = 1
• We can compute posterior and use its mean as the estimate

1 1 ( ' ' )  ' 1 '


p ( | X )   (1   )   1 m (1   ) N  m  p ( | X )   (1   )  ' 1 E[ ] 
const ( ' )(  ' )  ' '
 ' m   ,  ' N  m  

• Using Beta(2,2) prior 8


B 
10
Parametric Classification
gi x  px |C i P C i 
or
gi x  log px |C i  log P C i 

px |C i  
1 
exp  
x  i  
2

2 
2  i  2  i 
1
gi x   log 2  log  i 
x   i 2
 log P C i 
2
2 2 i

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
• Given the sample X {x t ,r t }tN1
t
1 if x Ci
x t
ri  t
0 if x C j , j i
• ML estimates are

• Discriminant becomes

1
gi x   log 2  log si 
x  mi 
 log ˆC i 
P
2

2 2 si 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25
Equal variances

Single boundary at
halfway between means

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26
Variances are different

Two boundaries

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Probabilistic Interpretation of Linear Regression

r  f x  
estimator : gx | 
 ~ N 0, 2 
pr | x  ~ N gx | , 2 
N
L  |X  log  px t , r t 
t 1
N N
log  pr t | x t  log  px t 
t 1 t 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Regression: From LogL to Error

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Linear Regression
g x t |w1 , w 0 w1 x t  w 0 Take derivative of E

 r
t
t
Nw 0  w1 x t

t
…wrto w0

r x t t
w 0  x  w1  xt

t 2 …wrto w1
t t t

 N t   w0 
x t
   r t

A  w  y  t 
  t x    w1   t x 
t 2    t t
x t
r
t

w A  1y
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
Polynomial Regression
g x |w k , , w 2 , w1 , w 0 w k x
t
    w x   w x
t k
2
t 2
1
t
 w0

1 x 1

x 1 2
 x  

 1 k
r 1

 2
D 
1 x 2
x 2 2
 x  
2 k
r r 
   
   N
 1 x N
x N 2
 x  
N 2
 r 

w D D DT r T 1

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31
Other Error Measures
N 2
1
• Square Error: 
E  |X    r t  g x t | 
2 t 1

N 2

 r t
 g x t |  
E  |X   t 1 N 2
• Relative Square Error:
 r
t 1
t
r 
• Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
• ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
Bias and Variance

   
E r  gx 2 | x E r  E r | x 2 | x  E r | x   gx 2
noise squared error

 
E X E r | x   g x  | x  E r | x   E X g x   E X g x   E X g x 
2 2
 2

bias variance

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33
Estimating Bias and Variance

• M samples Xi={xti , rti}, i=1,...,M


are used to fit gi (x), i =1,...,M and t=1,…,N

 g x  f x 
1
Bias g  
2 t t 2

N t

  g x  g x 
1
Variance g   t t 2
i
NM t i

1
g x   g i x 
 
M i

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34
Bias/Variance Dilemma
• Example: gi(x)=2 has no variance and high bias

gi(x)= ∑t rti/N has lower bias with variance

• As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
• Bias/Variance dilemma: (Geman et al., 1992)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35
f
f
bias
gi g

variance

Bias Variance Image is obtained from: https://www.researchgate.net/figure/Visualizing-bias-and-variance-tradeoff-using-a-bulls-eye-diagram_fig3_318432363 36


Polynomial Regression

Best fit “min error”

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37
Best fit, “elbow”

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38
Regression example
Coefficients increase in magnitude as order
increases:

1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675, -0.0002
4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

Idea: Penalize large coefficients

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39
Regularization
• New Cost Function
2

E w | X     y t  g x t | w    i wi2
N
1
2 t 1
• Ridge RegressionR( w)w 2  w2
i i
• LASSO: R( w) w 1 wi
i

)+
 i wi2 wT w
• = -)+λ

• -) +λ = 0  w +λ 

• +λ)

• =
• Image is obtained from Nando Freitas’ lecture notes

You might also like