Chapter 4 ML Parametric Classification
Chapter 4 ML Parametric Classification
FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun
Parametric Methods
Parametric Estimation
• X = { xt }t where xt ~ p (x)
• Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient statistics,
using X
e.g., N ( μ, σ2) where q = { μ, σ2}
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Maximum Likelihood Estimation
• Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
• Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Examples: Bernoulli (Derivation)
• Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
t t
L (po|X) = log ∏t pox (1 – po ) (1 – x )
dL( p0 | X ) N t d N
d
x log( p0 ) (1 x )
t
log(1 p0 )
dp0 t 1 dp0 t 1 dp0
N N
1 1
p0
x
t 1
t
(1 x t )
t 1 1 p0
0
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bernoulli (Derivation)
N N N
(1 p0 ) x t p0 1 p0 x t 0
t 1 t 1 t 1
N N
1
x p0 N 0 p0
t
x t
t 1 N t 1
MLE: po = ∑t xt / N
Gaussian (Normal) Distribution
• p(x) = N ( μ, σ2)
p x
1 x 2
1 x 2
px
exp -
exp
2
2 2
2 2 2
μ σ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7
Gaussian (Normal) Distribution
• Given that X = { xt }t with xt ~ N ( μ, σ2)
𝐿 ¿
x t
N
x m
t 2
s2 t
N
Bias and Variance
Let X be a sample from a population specified up to a
parameter q
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi
= (E[d]-θ)2+E[(d–E[d])2+2 (d–E[d])(E[d]-θ)]
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
= E[(E[d]-θ)2]+E [(d–E [d])2] +2 E[(d–E[d])(E[d]-θ)]
= Bias2 + Variance
Bayes’ Estimator
• Sometimes before looking at a sample we may have some prior
information on the possible value range that a parameter , θ, may
take.
• This information is quite useful especially when the sample is small.
• Treat θ as a random variable with prior p (θ)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Bayes’ Estimator
• Evaluating the p(x|X) integrals may be quite difficult except in cases
where the posterior has a nice form
• Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
• Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
• Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
• If we have no prior reason to favor some values of θ then the prior
density is flat and the posterior will have the same form as the
likelihood p(X|θ)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) where are μ, σ2, σo2
known
• θML = m
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
1 x 2
P( X | ) exp
2 0 2 0
2
1 2
P( ) exp
2 2
2
Likelihood :
1 2 N 1
xt
2
l ( ) exp exp
2 2 2
t 1 2 0 2 2
0
Loglikelihood :
1 N 1 x t
2
2
N
xt N
N N x t N
t 1 0
2
2
t 1 0 2
2
0 t 1 N
t 1 0
2
2 2
N N
m 2 2 2
0 2
0
N /2
1/ 2
E | X 0
m
N / 0 1/
2 2
N / 0 1/
2 2
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
• θML = m
• θMAP = θBayes’ =
N / 2
1/ 2
E | X m 0
N / 0 1/
2 2
N / 0 1/
2 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
Bayesian Learning for Coin Model
• Bayesian Learning procedure:
• Given data x1, x2,…, xN write down expression for likelihood p(X| θ)
Specify a prior P (θ )
p( )d 1
( ) 1
( )( )
1
(1 ) d 1
( )( )
(1 ) d ( )
1 1
E[ ]
• The figure is obtained from wikipedia
Bayesian Learning for Coin Model
Compute Posterior: p(θ |X) ∝ p(X| θ)p(θ)
Prior Likelihood
Gaussian Gaussian
Beta Binomial
Dirichlet Multinomial
Gamma Gaussian
Example
• Suppose that we observe X= {1,1,1,1,1,1} where each xt comes
from Bernoulli distribution θML = 1
• We can compute posterior and use its mean as the estimate
px |C i
1
exp
x i
2
2
2 i 2 i
1
gi x log 2 log i
x i 2
log P C i
2
2 2 i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
• Given the sample X {x t ,r t }tN1
t
1 if x Ci
x t
ri t
0 if x C j , j i
• ML estimates are
• Discriminant becomes
1
gi x log 2 log si
x mi
log ˆC i
P
2
2 2 si 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25
Equal variances
Single boundary at
halfway between means
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26
Variances are different
Two boundaries
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Probabilistic Interpretation of Linear Regression
r f x
estimator : gx |
~ N 0, 2
pr | x ~ N gx | , 2
N
L |X log px t , r t
t 1
N N
log pr t | x t log px t
t 1 t 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Regression: From LogL to Error
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Linear Regression
g x t |w1 , w 0 w1 x t w 0 Take derivative of E
r
t
t
Nw 0 w1 x t
t
…wrto w0
r x t t
w 0 x w1 xt
t 2 …wrto w1
t t t
N t w0
x t
r t
A w y t
t x w1 t x
t 2 t t
x t
r
t
w A 1y
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
Polynomial Regression
g x |w k , , w 2 , w1 , w 0 w k x
t
w x w x
t k
2
t 2
1
t
w0
1 x 1
x 1 2
x
1 k
r 1
2
D
1 x 2
x 2 2
x
2 k
r r
N
1 x N
x N 2
x
N 2
r
w D D DT r T 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31
Other Error Measures
N 2
1
• Square Error:
E |X r t g x t |
2 t 1
N 2
r t
g x t |
E |X t 1 N 2
• Relative Square Error:
r
t 1
t
r
• Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
• ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
Bias and Variance
E r gx 2 | x E r E r | x 2 | x E r | x gx 2
noise squared error
E X E r | x g x | x E r | x E X g x E X g x E X g x
2 2
2
bias variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33
Estimating Bias and Variance
g x f x
1
Bias g
2 t t 2
N t
g x g x
1
Variance g t t 2
i
NM t i
1
g x g i x
M i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34
Bias/Variance Dilemma
• Example: gi(x)=2 has no variance and high bias
• As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
• Bias/Variance dilemma: (Geman et al., 1992)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35
f
f
bias
gi g
variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37
Best fit, “elbow”
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38
Regression example
Coefficients increase in magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675, -0.0002
4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39
Regularization
• New Cost Function
2
E w | X y t g x t | w i wi2
N
1
2 t 1
• Ridge RegressionR( w)w 2 w2
i i
• LASSO: R( w) w 1 wi
i
)+
i wi2 wT w
• = -)+λ
• -) +λ = 0 w +λ
• +λ)
• =
• Image is obtained from Nando Freitas’ lecture notes