I2ml3e Chap10
I2ml3e Chap10
INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 10:
LINEAR DISCRIMINATION
Likelihood- vs. Discriminant-based
3
Classification
Likelihood-based: Assume a model for p(x|Ci), use
Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
Discriminant-based: Assume a model for gi(x|Φi);
no density estimation
Estimating the boundaries is enough; no need to
accurately estimate the densities inside the
boundaries
Linear Discriminant
4
Linear discriminant:
d
gi x|w i ,wi 0 wTi x wi 0 wij x j wi 0
j 1
Advantages:
Simple: O(d) space/computation
Quadratic discriminant:
gi x| Wi , w i ,wi 0 xT Wi x wTi x wi 0
gx g1 x g2 x
w1T x w10 w T2 x w 20
w1 w 2 T x w10 w 20
w T x w0
C1 if gx 0
choose
C 2 otherwise
Geometry
7
Multiple Classes
8
Choos e C i i f
K
gi x maxg j x
j 1
Classes are
linearly separable
Pairwise Separation
0 if x C i
gij x 0 if x C j
don't care otherwise
choos e C i if
j i , gij x 0
9
From Discriminants to Posteriors
10
When p (x | Ci ) ~ N ( μi , ∑)
gi x| w i ,wi 0 w Ti x wi 0
1 T 1
w i μ i wi 0 μ i μ i logP C i
1
2
y P C1 | x and P C 2 | x 1 y
y 0.5
chooseC1 if y /1 y 1 and C 2 otherwise
log y /1 y 0
P C1 | x P C | x
logitP C1 | x log log 1
1 P C1 | x P C 2 | x
px | C1 P C1
log log
px | C 2 P C 2
2 d / 2 1/ 2 exp 1/ 2x μ1 T 1 x μ1 P C1
log log
2 d / 2 1/ 2 exp 1/ 2x μ 2 T 1 x μ 2 P C 2
w T x w0
1
where w 1 μ1 μ 2 w 0 μ1 μ 2 T 1 μ1 μ 2
2
The inverse of logit
P C1 | x
log w T x w0
1 P C1 | x
P C1 | x sigmoidw x w0
T 1
1 exp w T x w 0
11
Sigmoid (Logistic) Function
12
Gradient E E
T
E
w E , ,...,
1w w 2 w d
Gradient-descent:
Starts from random w and updates w iteratively in the
negative direction of gradient
Gradient-Descent
14
E
wi , i
wi
wi wi wi
E (wt)
E (wt+1)
wt wt+1
η
Logistic Discrimination
15
P C 2
0
1
y PˆC1 | x
1 exp w T x w0
Training: Two Classes
16
X xt , r t t r t | xt ~ Bernoulliy t
1
y P C1 | x
1 exp w T x w0
t r t 1 r
l w ,w0 | X y 1 y
t t
E logl
E w ,w0 | X r t log y t 1 r t log 1 y t
t
Training: Gradient-Descent
17
E w , w0 | X r t log y t 1 r t log 1 y t
t
dy
If y sigmoida y 1 y
da
E rt 1 rt t
w j t
t
y
1 y t
x t
w j
j
t y 1 y
r t y t x tj , j 1,..., d
t
E
w0 r t y t
w0 t
18
100 1000
10
19
K>2 Classes
20
y PˆC i | x
exp w Ti x w i 0 , i 1,..., K softmax
expw x w
K T
j 1 j j0
l w i ,w i 0 | X y
i
t
i
rit
t i
Quadratic:
px|C i
log xT Wi x wTi x wi 0
px|C K
y sigmoidw x w0
t T t 1
1 exp w T xt w 0
l w ,w0 | X
1
exp
r y
t t 2
t 2 2 2
E w ,w0 | X
1
r t
y
t 2
2 t
w r t y t y t 1 y t xt
t
Learning to Rank
25