0% found this document useful (0 votes)
46 views14 pages

Linear Classifiers PPT 1

Uploaded by

Atul Shende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views14 pages

Linear Classifiers PPT 1

Uploaded by

Atul Shende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Today

Linear Discriminant Functions


Introduction
2 classes
CS434a/541a: Pattern Recognition
Multiple classes
Prof. Olga Veksler
Optimization with gradient descent
Perceptron Criterion Function
Lecture 9 Batch perceptron rule
Single sample perceptron rule
Announcements Linear discriminant functions on Road Map
Final project proposal due Nov. 1 No probability distribution (no shape or a lot is
1-2 paragraph description parameters are known) known
Late Penalty: is 1 point off for each day late Labeled data salmon bass salmon salmon
Assignment 3 due November 10 The shape of discriminant functions is
Data for final project due Nov. 15 known
Must be ported in Matlab, send me .mat file with data ba linear

lightness
and a short description file of what the data is ss discriminant
function
Late penalty is 1 point off for each day late s a l
Final project progress report m
on
Meet with me the week of November 22-26
5 points of if I will see you that have done NOTHNG yet
length
Assignment 4 due December 1 Need to estimate parameters of the
little is
Final project due December 8 discriminant function (parameters of the known
line in case of linear discriminant)
Linear Discriminant Functions: Basic Idea LDF: Introduction
bass
salmon

Discriminant functions can be more general than


lightness

lightness

ba on
sa

ss
lm

linear
For now, we will study linear discriminant functions
Simple model (should try simpler models first)
length length
Analytically tractable
bad boundary good boundary
Linear Discriminant functions are optimal for
Have samples from 2 classes x1, x2 ,…, xn Gaussian distributions with equal covariance
Assume 2 classes can be separated by a linear May not be optimal for other data distributions, but
boundary l(θ) with some unknown parameters θ they are very simple to use
Fit the “best” boundary to data by optimizing over
parameters θ Knowledge of class densities is not required when
What is best? using linear discriminant functions
Minimize classification error on training data? we can say that this is a non-parametric approach
Does not guarantee small testing error
Parametric Methods vs. Discriminant Functions LDF: 2 Classes
Assume the shape of density Assume discriminant
for classes is known p1(x|θ1), functions are or known shape A discriminant function is linear if it can be written as
p2(x|θ2),… l(θ1), l(θ2), with parameters g(x) = wtx + w0
θ1, θ2,…
Estimate θ1, θ2,… from data w is called the weight vector and w0 called bias or threshold
Estimate θ1, θ2,… from data
Use a Bayesian classifier to Use discriminant functions for
find decision regions classification
x(2) ℜ1 g(x ) > 0 x ∈class 1
c3 g(x ) < 0 x ∈ class 2
c2 c2 g(x ) = 0 either class
c3
g(x) > 0
c1 c1 ℜ2
In theory, Bayesian classifier minimizes the risk
In practice, do not have confidence in assumed model shapes
In practice, do not really need the actual density functions in the end
x(1)
Estimating accurate density functions is much harder than
estimating accurate discriminant functions g(x) < 0
Some argue that estimating densities should be skipped decision boundary g(x) = 0
Why solve a harder problem than needed ?
LDF: 2 Classes LDF: 2 Classes
Decision boundary g(x) = wtx + w0=0 is a hyperplane
set of vectors x which for some scalars α0,…, αd
satisfy α0 +α1x(1)+…+ αdx(d) = 0
A hyperplane is
a point in 1D
a line in 2D
a plane in 3D
LDF: 2 Classes LDF: Many Classes
g(x) = wtx + w0 Suppose we have m classes
w determines orientation of the decision hyperplane Define m linear discriminant functions
w0 determines location of the decision surface gi ( x ) = w it x + w i 0 i = 1,..., m
x(2)
|| x Given x, assign class ci if
/ ||w
x) gi ( x ) ≥ g j ( x ) ∀j ≠ i
g(
g(x) > 0
w Such classifier is called a linear machine
||
w
/|| A linear machine divides the feature space into c
w0
x(1)
decision regions, with gi(x) being the largest
discriminant if x is in the region Ri
g(x) < 0 g(x) = 0
LDF: Many Classes LDF: Many Classes
Decision regions for a linear machine are convex
y , z ∈ Ri α y + (1 − α )z ∈ Ri y
z
Ri
∀j ≠ i g i (y ) ≥ g j (y ) and g i (z ) ≥ g j (z ) ⇔
⇔ ∀j ≠ i g i (α y + (1 − α )z ) ≥ g j (α y + (1 − α )z )
In particular, decision regions must be spatially
contiguous
Ri Ri Ri
Rj is a valid Rj is not a valid
decision region decision region
LDF: Many Classes LDF: Many Classes
For a two contiguous regions Ri and Rj; the Thus applicability of linear machine to mostly limited
boundary that separates them is a portion of to unimodal conditional densities p(x|θ)
hyperplane Hij defined by: even though we did not assume any parametric models
g (x ) = g (x ) ⇔ w x + w
i j
t
i i0 =w x +w
t
j j0
Example:
( ) (
⇔ wi − w j t x + wi0 − w j0 = 0 )
Thus wi – wj is normal to Hij
And distance from x to Hij is given by
gi ( x ) − g j ( x )
d (x ,H ) =
ij
wi − w j need non-contiguous decision regions
thus linear machine will fail
LDF: Augmented feature vector LDF: Training Error
Linear discriminant function: g ( x ) = w x + w 0
t
For the rest of the lecture, assume we have 2 classes
Can rewrite it: g ( x ) = [w 0 w t ] x1 = a t y = g (y ) Samples y1,…, yn some in class 1, some in class 2
Use these samples to determine weights a in the
new weight new feature
vector a vector y discriminant function g ( y ) = a t y
y is called the augmented feature vector What should be our criterion for determining a?
Added a dummy dimension to get a completely For now, suppose we want to minimize the training error
(that is the number of misclassifed samples y1,…, yn )
equivalent new homogeneous problem
old problem new problem g(y i ) > 0 y i classified c1
Recall that
g( x ) = w t x + w 0 g(y ) = at y g(y i ) < 0 y i classified c2
1
x1 x1 g ( y i ) > 0 ∀y i ∈ c1
Thus training error is 0 if g ( y i ) < 0 ∀y i ∈ c 2
xd xd
LDF: Augmented feature vector LDF: Problem “Normalization”
Feature augmenting is done for simpler notation a t y i > 0 ∀y i ∈ c1
Thus training error is 0 if
From now on we always assume that we have a t y i < 0 ∀y i ∈ c 2
augmented feature vectors Equivalently, training error is 0 if
Given samples x1,…, xn convert them to
augmented samples y1,…, yn by adding y i = x1 at y i > 0 ∀y i ∈ c1
a new dimension of value 1
i
a t (− y i ) > 0 ∀y i ∈ c 2
This suggest problem “normalization”:
y (2 ) ℜ2 ℜ1 1. Replace all examples from class c2 by their negative
g(y) < 0 g g(y) > 0
( y) yi → −y i ∀y i ∈ c 2
/ ||a 2. Seek weight vector a s.t.
a | |
y at y i > 0 ∀y i
If such a exists, it is called a separating or solution vector
g(y) = 0
y (1) Original samples x1,…, xn can indeed be separated by a
line then
LDF: Problem “Normalization” LDF: Solution Region
before normalization after “normalization”
Solution region for a: set of all possible solutions
y (2 ) y (2 ) defined in terms of normal a to the separating hyperplane
y (2 )
y (1) y (1)
y (1)
a
region
Seek a hyperplane that Seek hyperplane that solution
separates patterns from puts normalized
different categories patterns on the same
(positive) side
LDF: Solution Region Optimization
Find weight vector a s.t. for all samples y1,…, yn Need to minimize a function of many variables
d
J ( x ) = J ( x 1,..., x d )
at y i = ak y i( k ) > 0
k =0 We know how to minimize J(x)
y (2 ) Take partial derivatives and set them to zero
∂ gradient
J (x )
∂x 1
= ∇J ( x ) = 0

y (1) J (x )
∂x d
a
However solving analytically is not always easy
best a
a Would you like to solve this system of nonlinear equations?
sin(x12 + x 23 ) + e x 4 = 0
2
In general, there are many such solutions a
( )
cos (x12 + x 23 ) + log x 53
x 42
=0
Sometimes it is not even possible to write down an analytical
expression for the derivative, we will see an example later today
Optimization: Gradient Descent Optimization: Gradient Descent
Gradient ∇J ( x ) points in direction of steepest increase of Gradient descent is guaranteed to find only a local
J(x), and − ∇J ( x ) in direction of steepest decrease minimum
one dimension two dimensions J(x)
− dJ (a )
J(x) dx − ∇J (a )
x
x(1) x(2) x(3) x(k) global minimum
a
a x Nevertheless gradient descent is very popular
− dJ (a ) − dJ (a ) because it is simple and applicable to any function
dx dx
a a
Optimization: Gradient Descent Optimization: Gradient Descent
J(x) − ∇ J (x ( 1 ) ) Main issue: how to set parameter η (learning rate )
− ∇ J (x ( 2 ) ) If η is too small, need too many iterations
J(x)
s(1) ∇ J (x ( k ) ) = 0 x
s (2)
x(1) x(2) x(3) x(k) x
Gradient Descent for minimizing any function J(x) J(x)
set k = 1 and x(1) to some initial guess for the weight vector If η is too large may
(k )
while η ∇J x (
(k )
>ε ) overshoot the minimum
choose learning rate η(k) and possibly never find it
(if we keep overshooting) x
x(k+1)= x(k) – η (k) ∇J ( x ) (update rule)
k=k+1 x(1) x(2)
Today LDF
Augmented and “normalized” samples y1,…, yn
Continue Linear Discriminant Functions Seek weight vector a s.t. a t y i > 0 ∀y i
Perceptron Criterion Function y (2 )
y (2 )
Batch perceptron rule
Single sample perceptron rule
y (1) y (1)
before normalization after “normalization”
If such a exists, it is called a separating or solution vector
original samples x1,…, xn can indeed be separated by a
line then
LDF: Augmented feature vector Optimization: Gradient Descent
J(x) − ∇ J (x ( 1 ) )
Linear discriminant function:
lightness

ba on
sa
( )

ss
lm
g( x ) = w t x + w 0 − ∇ J x (2 )
need to estimate parameters w
and w0 from data s(1) ∇ J (x ( k ) ) = 0 x
length s (2)
x(1) x(2) x(3) x(k)
Augment samples x to get equivalent homogeneous s (k +1 ) = x (k +1 ) − x (k ) = η (k ) (− ∇J (x (k ) ))
problem in terms of samples y:
Gradient Descent for minimizing any function J(x)
[
g( x ) = w 0 w t ] 1
x = a y = g (y )
t
set k = 1 and x(1) to some initial guess for the weight vector
“normalize” by replacing all examples from class c2
(k )
while η ∇J x (
(k )
>ε )
choose learning rate η(k)
by their negative
yi → −y i ∀y i ∈ c 2 x(k+1)= x(k) – η (k) ∇J ( x ) (update rule)
k=k+1
LDF: Criterion Function LDF: Perceptron Batch Rule
Find weight vector a s.t. for all samples
d
y1,…, yn J p (a ) = (− a y )
t
y ∈YM
at y = a y (k )
>0
i
k =0
k i
Gradient of Jp(a) is ∇J p (a ) = (− y )
Need criterion function J(a) which is minimized when y ∈YM
a is a solution vector YM are samples misclassified by a(k)
It is not possible to solve ∇J p (a ) = 0 analytically
Let YM be the set of examples misclassified by a because of YM
YM (a ) = { sample y i s .t . a t y i < 0 } Update rule for gradient descent: x(k+1)= x(k)–η (k) ∇J ( x )
First natural choice: number of misclassified examples
Thus gradient decent batch update rule for Jp(a) is:
J (a ) = YM (a )
J(a) a (k +1) = a (k ) + η (k ) y
y ∈YM
piecewise constant, gradient
descent is useless It is called batch rule because it is based on all
misclassified examples
a
LDF: Perceptron Criterion Function LDF: Perceptron Single Sample Rule
Better choice: Perceptron criterion function Thus gradient decent single sample rule for Jp(a) is:
J (a ) =
p (− a y ) t
a (k +1) = a (k ) + η (k ) y M
y ∈YM
y note that yM is one sample misclassified by a(k)

ay
t
If y is misclassified, a t y ≤ 0 must have a consistent way of visiting samples

/ ||
Thus J p (a ) ≥ 0

a ||
Geometric Interpretation:

a
Jp(a) is ||a|| times sum of yM misclassified by a(k)
distances of misclassified (a ( ) ) y
k t
≤0 yM

a(k
M
examples to decision boundary

+1)
yM is on the wrong side of

a
decision hyperplane ηyM

(k
Jp(a) is piecewise linear J(a)

)
adding ηyM to a moves new
and thus suitable for decision hyperplane in the right
gradient descent direction with respect to yM
a
LDF: Perceptron Single Sample Rule LDF Example: Augment feature vector
a (k +1) = a (k ) + η (k ) y M
features grade
name extra good tall? sleeps in chews
attendance? class? gum?
Jane 1 yes (1) yes (1) no (-1) no (-1) A
a(k

a(

yM yM Steve 1 yes (1) yes (1) yes (1) yes (1) F


k +1
+1)

Mary 1 no (-1) no (-1) no (-1) yes (1) F


)
a

a
(k

(k
)

Peter 1 yes (1) no (-1) no (-1) yes (1) A


yk
yk convert samples x1,…, xn to augmented samples
η is too large, previously η is too small, yM is still y1,…, yn by adding a new dimension of value 1
correctly classified sample misclassified
yk is now misclassified
LDF: Perceptron Example LDF: Perform “Normalization”
features grade features grade
name good tall? sleeps in chews
name extra good tall? sleeps in chews
attendance? class? gum?
attendance? class? gum?
Jane yes (1) yes (1) no (-1) no (-1) A
Jane 1 yes (1) yes (1) no (-1) no (-1) A
Steve yes (1) yes (1) yes (1) yes (1) F Steve -1 yes (-1) yes (-1) yes (-1) yes (-1) F
Mary no (-1) no (-1) no (-1) yes (1) F
Mary -1 no (1) no (1) no (1) yes (-1) F
Peter yes (1) no (-1) no (-1) yes (1) A Peter 1 yes (1) no (-1) no (-1) yes (1) A
Replace all examples from class c2 by their negative
class 1: students who get grade A yi → −y i ∀y i ∈ c 2
class 2: students who get grade F Seek weight vector a s.t. at y i > 0 ∀y i
LDF: Use Single Sample Rule LDF: Gradient decent Example
features grade a (2 ) = [− 0.75 − 0.75 − 0.75 − 0.75 − 0.75 ]
name extra good tall? sleeps in chews
attendance? class? gum? name misclassified?
aty
Jane 1 yes (1) yes (1) no (-1) no (-1) A
Mary -0.75*(-1)-0.75*1 -0.75 *1 -0.75 *1 -0.75*(-1) <0 yes
Steve -1 yes (-1) yes (-1) yes (-1) yes (-1) F
Mary -1 no (1) no (1) no (1) yes (-1) F
Peter 1 yes (1) no (-1) no (-1) yes (1) A new weights
4 a (3 ) = a (2 ) + y M = [− 0.75 − 0.75 − 0.75 − 0.75 − 0.75 ] +
Sample is misclassified if at y i = ak y i( k ) < 0
k =0 + [− 1 1 1 1 − 1] =
gradient descent single sample rule: a (k +1) = a (k ) + η (k ) y = [− 1.75 0.25 0.25 0.25 − 1.75 ]
y ∈YM
Set fixed learning rate to η(k)= 1: a (k +1) = a (k ) + y M
LDF: Gradient decent Example LDF: Gradient decent Example
set equal initial weights a(1)=[0.25, 0.25, 0.25, 0.25] a (3 ) = [− 1.75 0.25 0.25 0.25 − 1.75 ]
visit all samples sequentially, modifying the weights
for after finding a misclassified example name aty misclassified?
Peter -1.75 *1 +0.25* 1+0.25* (-1) +0.25 *(-1)-1.75*1 <0 yes
name aty misclassified?
Jane 0.25*1+0.25*1+0.25*1+0.25*(-1)+0.25*(-1) >0 no
new weights
Steve 0.25*(-1)+0.25*(-1)+0.25*(-1)+0.25*(-1)+0.25*(-1)<0 yes
a ( 4 ) = a (3 ) + y M = [− 1.75 0.25 0.25 0.25 − 1.75 ] +
new weights + [1 1 −1 −1 1] =
a (2 ) = a (1) + y M = [0.25 0.25 0.25 0.25 0.25] +
= [− 0.75 1.25 − 0.75 − 0.75 − 0.75 ]
+ [− 1 − 1 − 1 − 1 − 1] =
= [− 0.75 − 0.75 − 0.75 − 0.75 − 0.75 ]
LDF: Gradient decent Example LDF: Nonseparable Example
a ( 4 ) = [− 0.75 1.25 − 0.75 − 0.75 − 0.75 ] Suppose we have 2 features
and samples are:
name aty misclassified?
Class 1: [2,1], [4,3], [3,5]
Jane -0.75 *1 +1.25*1 -0.75*1 -0.75 *(-1) -0.75 *(-1)+0 no Class 2: [1,3] and [5,6]
Steve -0.75*(-1)+1.25*(-1) -0.75*(-1) -0.75*(-1)-0.75*(-1)>0 no These samples are not
Mary -0.75 *(-1)+1.25*1-0.75*1 -0.75 *1 –0.75*(-1) >0 no separable by a line
Peter -0.75 *1+ 1.25*1-0.75* (-1)-0.75* (-1) -0.75 *1 >0 no Still would like to get approximate separation by a
line, good choice is shown in green
Thus the discriminant function is some samples may be “noisy”, and it’s ok if they are on
g (y ) = −0.75 * y (0 ) + 1.25 * y (1) − 0.75 * y (2 ) − 0.75 * y (3 ) − 0.75 * y ( 4 ) the wrong side of the line
Get y1, y2 , y3 , y4 by adding extra feature and
Converting back to the original features x: “normalizing” 1 1 1 −1
−1
g ( x ) = 1.25 * x (1) − 0.75 * x (2 ) − 0.75 * x (3 ) − 0.75 * x ( 4 ) − 0.75 y = 2
1 y = 4
2
y = 3
3 y = −1 4
y = −5
5
1 3 5 −3 −6
LDF: Gradient decent Example LDF: Nonseparable Example
Converting back to the original features x: Let’s apply Perceptron single
1.25 * x (1) − 0.75 * x (2 ) − 0.75 * x (3 ) − 0.75 * x ( 4 ) > 0.75 grade A
sample algorithm
1.25 * x (1) − 0.75 * x (2 ) − 0.75 * x (3 ) − 0.75 * x ( 4 ) < 0.75 grade F initial equal weights a (1 ) = [1 1 1]
this is line x(1)+x(2)+1=0 (1
)
good tall sleeps in class chews gum a
attendance
fixed learning rate η = 1
a (k +1) = a (k ) + y M
This is just one possible solution vector
1 1 1 −1 −1
y = 2 y = 4 y = 3 y4 = −1 y = −5
If we started with weights a(1)=[0,0.5, 0.5, 0, 0], 1
1
2
3
3
5 −3
5
−6
solution would be [-1,1.5, -0.5, -1, -1]
1.5 * x (1 ) − 0.5 * x (2 ) − x (3 ) − x ( 4 ) > 1 grade A yt1a(1) = [1 1 1]*[1 2 1]t > 0
1.5 * x (1 ) − 0.5 * x (2 ) − x (3 ) − x ( 4 ) < 1 grade F
In this solution, being tall is the least important feature
yt2a(1) = [1 1 1]*[1 4 3]t > 0
yt3a(1) = [1 1 1]*[1 3 5]t > 0
LDF: Nonseparable Example LDF: Nonseparable Example
a (1 ) = [1 1 1] a (k +1) = a (k ) + y M a ( 4 ) = [0 1 − 4 ] a (k +1) = a (k ) + y M
−1 −1 −1 −1 )
1 1 1 1 1 1 (3
y1 = 2 y2 = 4 y3 = 3 y4 = −1 y5 = − 5 y1 = 2 y2 = 4 y3 = 3 y4 = −1 y5 = − 5 a
1 3 5 −3 −6 1 3 5 −3 −6
)
(1
a a(4)
yt4a(1)=[1 1 1]*[-1 -1 -3]t = -5< 0 a(2)
yt2 a(3)=[1 4 3]*[1 2 -1]t =6 > 0
a (2 ) = a (1 ) + y M = [1 1 1] + [− 1 − 1 − 3 ] = [0 0 − 2 ] yt3 a(3)=[1 3 5]*[1 2 -1]t > 0
yt5 a(2)=[0 0 -2]*[-1 -5 -6]t = 12 > 0 yt4 a(3)=[-1 -1 -3]*[1 2 -1]t = 0
yt1 a(2)=[0 0 -2]*[1 2 1]t < 0 a ( 4 ) = a (3 ) + y M = [1 2 − 1] + [− 1 − 1 − 3 ] = [0 1 − 4 ]
a (3 ) = a (2 ) + y M = [0 0 − 2 ] + [1 2 1] = [1 2 − 1]
LDF: Nonseparable Example LDF: Nonseparable Example
a (3 ) = [1 2 − 1] a (k +1) = a (k ) + y M we can continue this forever
)
1 1 1 −1 −1 (3 there is no solution vector a satisfying for all i
y1 = 2 y2 = 4 y3 = 3 y4 = −1 y5 = − 5 a 5
1 3 5 −3 −6
at y =
i ak y i( k ) > 0
k =0
need to stop but at a good point:
a(2)
yt2 a(3)=[1 4 3]*[1 2 -1]t =6 > 0
yt3 a(3)=[1 3 5]*[1 2 -1]t > 0 solutions at iterations
yt4 a(3)=[-1 -1 -3]*[1 2 -1]t = 0 900 through 915.
a ( 4 ) = a (3 ) + y M = [1 2 − 1] + [− 1 − 1 − 3 ] = [0 1 − 4 ] Some are good
some are not.
How do we stop at a
good solution?
LDF: Convergence of Perceptron rules
If classes are linearly separable, and use fixed
learning rate, that is for some constant c, η(k) =c
both single sample and batch perceptron rules converge to
a correct solution (could be any a in the solution space)
If classes are not linearly separable:
algorithm does not stop, it keeps looking for solution which
does not exist
by choosing appropriate learning rate, can always ensure
( )
convergence: η k → 0 as k → ∞
(k ) η (1)
for example inverse linear learning rate: η =
k
for inverse linear learning rate convergence in the linearly
separable case can also be proven
no guarantee that we stopped at a good point, but there are
good reasons to choose inverse linear learning rate
LDF: Perceptron Rule and Gradient decent
Linearly separable data
perceptron rule with gradient decent works well
Linearly non-separable data
need to stop perceptron rule algorithm at a good point, this
maybe tricky
Batch Rule Single Sample Rule
Smoother gradient easier to analyze
because all samples are
used Concentrates more than
necessary on any isolated
“noisy” training examples

You might also like