Linear Classifiers PPT 1
Linear Classifiers PPT 1
lightness
and a short description file of what the data is ss discriminant
function
Late penalty is 1 point off for each day late s a l
Final project progress report m
on
Meet with me the week of November 22-26
5 points of if I will see you that have done NOTHNG yet
length
Assignment 4 due December 1 Need to estimate parameters of the
little is
Final project due December 8 discriminant function (parameters of the known
line in case of linear discriminant)
Linear Discriminant Functions: Basic Idea LDF: Introduction
bass
salmon
lightness
ba on
sa
ss
lm
linear
For now, we will study linear discriminant functions
Simple model (should try simpler models first)
length length
Analytically tractable
bad boundary good boundary
Linear Discriminant functions are optimal for
Have samples from 2 classes x1, x2 ,…, xn Gaussian distributions with equal covariance
Assume 2 classes can be separated by a linear May not be optimal for other data distributions, but
boundary l(θ) with some unknown parameters θ they are very simple to use
Fit the “best” boundary to data by optimizing over
parameters θ Knowledge of class densities is not required when
What is best? using linear discriminant functions
Minimize classification error on training data? we can say that this is a non-parametric approach
Does not guarantee small testing error
Parametric Methods vs. Discriminant Functions LDF: 2 Classes
Assume the shape of density Assume discriminant
for classes is known p1(x|θ1), functions are or known shape A discriminant function is linear if it can be written as
p2(x|θ2),… l(θ1), l(θ2), with parameters g(x) = wtx + w0
θ1, θ2,…
Estimate θ1, θ2,… from data w is called the weight vector and w0 called bias or threshold
Estimate θ1, θ2,… from data
Use a Bayesian classifier to Use discriminant functions for
find decision regions classification
x(2) ℜ1 g(x ) > 0 x ∈class 1
c3 g(x ) < 0 x ∈ class 2
c2 c2 g(x ) = 0 either class
c3
g(x) > 0
c1 c1 ℜ2
In theory, Bayesian classifier minimizes the risk
In practice, do not have confidence in assumed model shapes
In practice, do not really need the actual density functions in the end
x(1)
Estimating accurate density functions is much harder than
estimating accurate discriminant functions g(x) < 0
Some argue that estimating densities should be skipped decision boundary g(x) = 0
Why solve a harder problem than needed ?
LDF: 2 Classes LDF: 2 Classes
Decision boundary g(x) = wtx + w0=0 is a hyperplane
set of vectors x which for some scalars α0,…, αd
satisfy α0 +α1x(1)+…+ αdx(d) = 0
A hyperplane is
a point in 1D
a line in 2D
a plane in 3D
LDF: 2 Classes LDF: Many Classes
g(x) = wtx + w0 Suppose we have m classes
w determines orientation of the decision hyperplane Define m linear discriminant functions
w0 determines location of the decision surface gi ( x ) = w it x + w i 0 i = 1,..., m
x(2)
|| x Given x, assign class ci if
/ ||w
x) gi ( x ) ≥ g j ( x ) ∀j ≠ i
g(
g(x) > 0
w Such classifier is called a linear machine
||
w
/|| A linear machine divides the feature space into c
w0
x(1)
decision regions, with gi(x) being the largest
discriminant if x is in the region Ri
g(x) < 0 g(x) = 0
LDF: Many Classes LDF: Many Classes
Decision regions for a linear machine are convex
y , z ∈ Ri α y + (1 − α )z ∈ Ri y
z
Ri
∀j ≠ i g i (y ) ≥ g j (y ) and g i (z ) ≥ g j (z ) ⇔
⇔ ∀j ≠ i g i (α y + (1 − α )z ) ≥ g j (α y + (1 − α )z )
In particular, decision regions must be spatially
contiguous
Ri Ri Ri
Rj is a valid Rj is not a valid
decision region decision region
LDF: Many Classes LDF: Many Classes
For a two contiguous regions Ri and Rj; the Thus applicability of linear machine to mostly limited
boundary that separates them is a portion of to unimodal conditional densities p(x|θ)
hyperplane Hij defined by: even though we did not assume any parametric models
g (x ) = g (x ) ⇔ w x + w
i j
t
i i0 =w x +w
t
j j0
Example:
( ) (
⇔ wi − w j t x + wi0 − w j0 = 0 )
Thus wi – wj is normal to Hij
And distance from x to Hij is given by
gi ( x ) − g j ( x )
d (x ,H ) =
ij
wi − w j need non-contiguous decision regions
thus linear machine will fail
LDF: Augmented feature vector LDF: Training Error
Linear discriminant function: g ( x ) = w x + w 0
t
For the rest of the lecture, assume we have 2 classes
Can rewrite it: g ( x ) = [w 0 w t ] x1 = a t y = g (y ) Samples y1,…, yn some in class 1, some in class 2
Use these samples to determine weights a in the
new weight new feature
vector a vector y discriminant function g ( y ) = a t y
y is called the augmented feature vector What should be our criterion for determining a?
Added a dummy dimension to get a completely For now, suppose we want to minimize the training error
(that is the number of misclassifed samples y1,…, yn )
equivalent new homogeneous problem
old problem new problem g(y i ) > 0 y i classified c1
Recall that
g( x ) = w t x + w 0 g(y ) = at y g(y i ) < 0 y i classified c2
1
x1 x1 g ( y i ) > 0 ∀y i ∈ c1
Thus training error is 0 if g ( y i ) < 0 ∀y i ∈ c 2
xd xd
LDF: Augmented feature vector LDF: Problem “Normalization”
Feature augmenting is done for simpler notation a t y i > 0 ∀y i ∈ c1
Thus training error is 0 if
From now on we always assume that we have a t y i < 0 ∀y i ∈ c 2
augmented feature vectors Equivalently, training error is 0 if
Given samples x1,…, xn convert them to
augmented samples y1,…, yn by adding y i = x1 at y i > 0 ∀y i ∈ c1
a new dimension of value 1
i
a t (− y i ) > 0 ∀y i ∈ c 2
This suggest problem “normalization”:
y (2 ) ℜ2 ℜ1 1. Replace all examples from class c2 by their negative
g(y) < 0 g g(y) > 0
( y) yi → −y i ∀y i ∈ c 2
/ ||a 2. Seek weight vector a s.t.
a | |
y at y i > 0 ∀y i
If such a exists, it is called a separating or solution vector
g(y) = 0
y (1) Original samples x1,…, xn can indeed be separated by a
line then
LDF: Problem “Normalization” LDF: Solution Region
before normalization after “normalization”
Solution region for a: set of all possible solutions
y (2 ) y (2 ) defined in terms of normal a to the separating hyperplane
y (2 )
y (1) y (1)
y (1)
a
region
Seek a hyperplane that Seek hyperplane that solution
separates patterns from puts normalized
different categories patterns on the same
(positive) side
LDF: Solution Region Optimization
Find weight vector a s.t. for all samples y1,…, yn Need to minimize a function of many variables
d
J ( x ) = J ( x 1,..., x d )
at y i = ak y i( k ) > 0
k =0 We know how to minimize J(x)
y (2 ) Take partial derivatives and set them to zero
∂ gradient
J (x )
∂x 1
= ∇J ( x ) = 0
∂
y (1) J (x )
∂x d
a
However solving analytically is not always easy
best a
a Would you like to solve this system of nonlinear equations?
sin(x12 + x 23 ) + e x 4 = 0
2
In general, there are many such solutions a
( )
cos (x12 + x 23 ) + log x 53
x 42
=0
Sometimes it is not even possible to write down an analytical
expression for the derivative, we will see an example later today
Optimization: Gradient Descent Optimization: Gradient Descent
Gradient ∇J ( x ) points in direction of steepest increase of Gradient descent is guaranteed to find only a local
J(x), and − ∇J ( x ) in direction of steepest decrease minimum
one dimension two dimensions J(x)
− dJ (a )
J(x) dx − ∇J (a )
x
x(1) x(2) x(3) x(k) global minimum
a
a x Nevertheless gradient descent is very popular
− dJ (a ) − dJ (a ) because it is simple and applicable to any function
dx dx
a a
Optimization: Gradient Descent Optimization: Gradient Descent
J(x) − ∇ J (x ( 1 ) ) Main issue: how to set parameter η (learning rate )
− ∇ J (x ( 2 ) ) If η is too small, need too many iterations
J(x)
s(1) ∇ J (x ( k ) ) = 0 x
s (2)
x(1) x(2) x(3) x(k) x
Gradient Descent for minimizing any function J(x) J(x)
set k = 1 and x(1) to some initial guess for the weight vector If η is too large may
(k )
while η ∇J x (
(k )
>ε ) overshoot the minimum
choose learning rate η(k) and possibly never find it
(if we keep overshooting) x
x(k+1)= x(k) – η (k) ∇J ( x ) (update rule)
k=k+1 x(1) x(2)
Today LDF
Augmented and “normalized” samples y1,…, yn
Continue Linear Discriminant Functions Seek weight vector a s.t. a t y i > 0 ∀y i
Perceptron Criterion Function y (2 )
y (2 )
Batch perceptron rule
Single sample perceptron rule
y (1) y (1)
before normalization after “normalization”
If such a exists, it is called a separating or solution vector
original samples x1,…, xn can indeed be separated by a
line then
LDF: Augmented feature vector Optimization: Gradient Descent
J(x) − ∇ J (x ( 1 ) )
Linear discriminant function:
lightness
ba on
sa
( )
ss
lm
g( x ) = w t x + w 0 − ∇ J x (2 )
need to estimate parameters w
and w0 from data s(1) ∇ J (x ( k ) ) = 0 x
length s (2)
x(1) x(2) x(3) x(k)
Augment samples x to get equivalent homogeneous s (k +1 ) = x (k +1 ) − x (k ) = η (k ) (− ∇J (x (k ) ))
problem in terms of samples y:
Gradient Descent for minimizing any function J(x)
[
g( x ) = w 0 w t ] 1
x = a y = g (y )
t
set k = 1 and x(1) to some initial guess for the weight vector
“normalize” by replacing all examples from class c2
(k )
while η ∇J x (
(k )
>ε )
choose learning rate η(k)
by their negative
yi → −y i ∀y i ∈ c 2 x(k+1)= x(k) – η (k) ∇J ( x ) (update rule)
k=k+1
LDF: Criterion Function LDF: Perceptron Batch Rule
Find weight vector a s.t. for all samples
d
y1,…, yn J p (a ) = (− a y )
t
y ∈YM
at y = a y (k )
>0
i
k =0
k i
Gradient of Jp(a) is ∇J p (a ) = (− y )
Need criterion function J(a) which is minimized when y ∈YM
a is a solution vector YM are samples misclassified by a(k)
It is not possible to solve ∇J p (a ) = 0 analytically
Let YM be the set of examples misclassified by a because of YM
YM (a ) = { sample y i s .t . a t y i < 0 } Update rule for gradient descent: x(k+1)= x(k)–η (k) ∇J ( x )
First natural choice: number of misclassified examples
Thus gradient decent batch update rule for Jp(a) is:
J (a ) = YM (a )
J(a) a (k +1) = a (k ) + η (k ) y
y ∈YM
piecewise constant, gradient
descent is useless It is called batch rule because it is based on all
misclassified examples
a
LDF: Perceptron Criterion Function LDF: Perceptron Single Sample Rule
Better choice: Perceptron criterion function Thus gradient decent single sample rule for Jp(a) is:
J (a ) =
p (− a y ) t
a (k +1) = a (k ) + η (k ) y M
y ∈YM
y note that yM is one sample misclassified by a(k)
ay
t
If y is misclassified, a t y ≤ 0 must have a consistent way of visiting samples
/ ||
Thus J p (a ) ≥ 0
a ||
Geometric Interpretation:
a
Jp(a) is ||a|| times sum of yM misclassified by a(k)
distances of misclassified (a ( ) ) y
k t
≤0 yM
a(k
M
examples to decision boundary
+1)
yM is on the wrong side of
a
decision hyperplane ηyM
(k
Jp(a) is piecewise linear J(a)
)
adding ηyM to a moves new
and thus suitable for decision hyperplane in the right
gradient descent direction with respect to yM
a
LDF: Perceptron Single Sample Rule LDF Example: Augment feature vector
a (k +1) = a (k ) + η (k ) y M
features grade
name extra good tall? sleeps in chews
attendance? class? gum?
Jane 1 yes (1) yes (1) no (-1) no (-1) A
a(k
a(
a
(k
(k
)