0% found this document useful (0 votes)
11 views27 pages

I2ml3e Chap10

The document discusses linear discrimination methods for classification. It describes how linear discriminants model the decision boundary as a linear combination of the input features. This allows for simple and interpretable models. It also covers extensions like quadratic discriminants, kernel methods, and multi-class classification using the softmax function. Gradient descent is used to optimize the model parameters to minimize the cross-entropy loss.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

I2ml3e Chap10

The document discusses linear discrimination methods for classification. It describes how linear discriminants model the decision boundary as a linear combination of the input features. This allows for simple and interpretable models. It also covers extensions like quadratic discriminants, kernel methods, and multi-class classification using the softmax function. Gradient descent is used to optimize the model parameters to minimize the cross-entropy loss.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Lecture Slides for

INTRODUCTION
TO
MACHINE
LEARNING
3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014

[email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 10:

LINEAR DISCRIMINATION
Likelihood- vs. Discriminant-based
3
Classification
 Likelihood-based: Assume a model for p(x|Ci), use
Bayes’ rule to calculate P(Ci|x)
gi(x) = log P(Ci|x)
 Discriminant-based: Assume a model for gi(x|Φi);
no density estimation
 Estimating the boundaries is enough; no need to
accurately estimate the densities inside the
boundaries
Linear Discriminant
4

 Linear discriminant:
d
gi x|w i ,wi 0   wTi x  wi 0  wij x j  wi 0
j 1

 Advantages:
 Simple: O(d) space/computation

 Knowledge extraction: Weighted sum of attributes;


positive/negative weights, magnitudes (credit scoring)
 Optimal when p(x|Ci) are Gaussian with shared cov matrix;
useful when classes are (almost) linearly separable
Generalized Linear Model
5

 Quadratic discriminant:
gi x| Wi , w i ,wi 0   xT Wi x  wTi x  wi 0

 Higher-order (product) terms:


z1  x1 , z2  x2 , z3  x12 , z4  x22 , z5  x1x2

Map from x to z using nonlinear basis functions and use a linear


discriminant in z-space
k
gi x    wij j x 
j 1
Two Classes
6

gx   g1 x   g2 x 
 w1T x  w10   w T2 x  w 20 
 w1  w 2 T x  w10  w 20 
 w T x  w0

C1 if gx   0
choose 
C 2 otherwise
Geometry
7
Multiple Classes
8

gi x|w i ,wi 0   wTi x  wi 0

Choos e C i i f
K
gi x   maxg j x 
j 1

Classes are
linearly separable
Pairwise Separation

gij x|w ij ,wij0   wTij x  wij0

 0 if x  C i

gij x    0 if x  C j
don't care otherwise

choos e C i if
j  i , gij x   0

9
From Discriminants to Posteriors
10

When p (x | Ci ) ~ N ( μi , ∑)
gi x| w i ,wi 0   w Ti x  wi 0
1 T 1
w i   μ i wi 0   μ i  μ i  logP C i 
1

2
y  P C1 | x  and P C 2 | x   1  y
 y  0.5

chooseC1 if  y /1  y   1 and C 2 otherwise
log y /1  y   0

P C1 | x  P C | x 
logitP C1 | x   log  log 1
1  P C1 | x  P C 2 | x 
px | C1  P C1 
 log  log
px | C 2  P C 2 
2 d / 2  1/ 2 exp  1/ 2x  μ1 T  1 x  μ1  P C1 
 log  log
2 d / 2  1/ 2 exp  1/ 2x  μ 2 T  1 x  μ 2  P C 2 
 w T x  w0
1
where w   1 μ1  μ 2  w 0   μ1  μ 2 T  1 μ1  μ 2 
2
The inverse of logit
P C1 | x 
log  w T x  w0
1  P C1 | x 

P C1 | x   sigmoidw x  w0  
T 1

1  exp  w T x  w 0  
11
Sigmoid (Logistic) Function
12

Calculate gx   wT x  w0 and chooseC1 if gx   0, or


Calculate y  sigmoidwT x  w0  and chooseC1 if y  0.5
Gradient-Descent
13

 E(w|X) is error with parameters w on sample X


w*=arg minw E(w | X)

 Gradient  E E
T
E 
w E   , ,..., 

 1w w 2 w d 

 Gradient-descent:
Starts from random w and updates w iteratively in the
negative direction of gradient
Gradient-Descent
14
E
wi   , i
wi
wi  wi  wi

E (wt)

E (wt+1)

wt wt+1
η
Logistic Discrimination
15

Two classes: Assume log likelihood ratio is linear


px | C1 
log  w T x  w0o
px | C 2 
P C1 | x  px | C1  P C 
logitP C1 | x   log  log  log 1
1  P C1 | x  px | C 2  P C 2 
 w T x  w0
P C1 
where w0  w  log
o

P C 2 
0

1
y  PˆC1 | x  
 
1  exp  w T x  w0 
Training: Two Classes
16

X  xt , r t t r t | xt ~ Bernoulliy t 
1
y  P C1 | x  

1  exp  w T x  w0  
t r  t 1 r 
l w ,w0 | X    y  1  y 
t t

E  logl
E w ,w0 | X   r t log y t  1  r t log 1  y t 
t
Training: Gradient-Descent
17

E w , w0 | X   r t log y t  1  r t log 1  y t 
t

dy
If y  sigmoida   y 1  y 
da
E  rt 1 rt  t
w j       t  
t 
y 
1  y t
x t

w j 
j
t  y 1 y 
   r t  y t x tj , j  1,..., d
t

E
w0      r t  y t 
w0 t
18
100 1000

10

19
K>2 Classes
20

X  xt , r t t r t | xt ~ Mul tK 1, y t 


px |C i 
l og  w Ti x  w io0
px |C K 

y  PˆC i | x  

exp w Ti x  w i 0  , i  1,..., K softmax
 expw x  w 
K T
j 1 j j0

 
l w i ,w i 0 | X    y 
i
t
i
rit

t i

E w i ,w i 0 i| X   ritl og y it


t

w j    rjt  y tj xt w j 0    rjt  y tj 


t t
21
Example
22
Generalizing the Linear Model
23

 Quadratic:
px|C i 
log  xT Wi x  wTi x  wi 0
px|C K 

 Sum of basis functions:


px|C i 
log  w Ti x   wi 0
px|C K 
where φ(x) are basis functions. Examples:
 Hidden units in neural networks (Chapters 11 and 12)

 Kernels in SVM (Chapter 13)


Discrimination by Regression
24

 Classes are NOT mutually exclusive and exhaustive


r t  y t   where  ~ N 0, 2 

y  sigmoidw x  w0  
t T t 1

1  exp  w T xt  w 0  
l w ,w0 | X   
1 
exp 
r y 
t t 2


t 2  2 2


E w ,w0 | X  
1
 r t
 y 
t 2

2 t
w    r t  y t y t 1  y t xt
t
Learning to Rank
25

 Ranking: A different problem than classification or


regression
 Let us say xu and xv are two instances, e.g., two
movies
We prefer u to v implies that g(xu)>g(xv)
where g(x) is a score function, here linear:
g(x)=wTx
 Find a direction w such that we get the desired
ranks when instances are projected along w
Ranking Error
26

 We prefer u to v implies that g(xu)>g(xv), so


error is g(xv)-g(xu), if g(xu)<g(xv)
27

You might also like