0% found this document useful (0 votes)
45 views5 pages

3 Linear

This document discusses linear classifiers for classification problems. It begins by noting that probabilistic information may be unavailable for Bayesian classification, so linear classifiers provide an alternative. Linear classifiers use a decision hyperplane to classify examples based on which side of the hyperplane they fall on. The perceptron algorithm and Winnow algorithm are then introduced as methods for finding the optimal hyperplane by updating the weight vector over multiple iterations. The perceptron algorithm works by updating the weight vector to move it towards the correct prediction for misclassified examples. Winnow uses an exponentiated gradient approach to update weights based on a cost function measuring distance between weights over iterations.

Uploaded by

Rachna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views5 pages

3 Linear

This document discusses linear classifiers for classification problems. It begins by noting that probabilistic information may be unavailable for Bayesian classification, so linear classifiers provide an alternative. Linear classifiers use a decision hyperplane to classify examples based on which side of the hyperplane they fall on. The perceptron algorithm and Winnow algorithm are then introduced as methods for finding the optimal hyperplane by updating the weight vector over multiple iterations. The perceptron algorithm works by updating the weight vector to move it towards the correct prediction for misclassified examples. Winnow uses an exponentiated gradient approach to update weights based on a cost function measuring distance between weights over iterations.

Uploaded by

Rachna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Introduction

• Sometimes probabilistic information unavailable


or mathematically intractable

• Many alternatives to Bayesian classification,


CSCE 970 Lecture 3: but optimality guarantee may be compromised!
Linear Classifiers
• Linear classifiers use a decision hyperplane to
perform classification

Stephen D. Scott • Simple and efficient to train and use

• Optimality requires linear separability of classes

= Class A
January 21, 2003
= Class B

= unclassified
= decision line

1 2

The Perceptron Algorithm


Linear Discriminant Functions
• Assume linear separability, i.e. ∃ w ∗ s.t.

• Let w = [w1, . . . , w`]T be a weight vector and


w∗T · x > 0 ∀ x ∈ ω1
w∗T · x ≤ 0 ∀ x ∈ ω2
w0 (a.k.a. θ) be a threshold
(w0∗ is included in w∗)

• Decision surface is a hyperplane: • So ∃ deterministic function classifying vectors


(contrary to Ch. 2 assumptions)
w T · x + w0 = 0
1
(ω1 )
w0
• E.g. predict ω2 if `i=1 wixi > w0, otherwise
P
x1 w1 y(t)=1 if sum > 0
predict ω1 Σi wi xi
xl wl y(t)=0 otherwise
• Focus of this lecture: How to find wi’s (ω2 )
May also use +1 and -1
– Perceptron algorithm
• Given actual label y(t) for trial t, update weights:
– Winnow w(t + 1) = w(t) + ρ(y(t) − ŷ(t))x(t)

– Least squares methods (if classes not lin- · ρ > 0 is learning rate
early separable)
· (y(t) − ŷ(t)) moves weights toward correct
prediction for x

3 4
The Perceptron Algorithm
Intuition
The Perceptron Algorithm
Example
• Compromise between correctiveness and
x2 y(t) = 0 conservativeness
w0 = 0
y(t) = 1
our new dec. line – Correctiveness: Tendency to improve on x(t)
our dec. line
if prediction error made
x(t)
– Conservativeness: Tendency to keep
w(t + 1) close to w(t)

w* w(t+1)
opt. dec. line • Use cost function that measures both:

x1 conserv. corrective
z }| { z }| {
w(t) U (w) = kw(t + 1) − w(t)k2 2
2 +η (y(t) − w(t + 1) · x(t))

(ω1 ) = (wi(t + 1) − wi(t))2 +
i=1
(ω2 )
 2

η y(t) − wi(t + 1) xi(t)
i=1

5 6

The Perceptron Algorithm


Intuition
(cont’d)

• Take gradient w.r.t. w(t + 1) and set to 0: The Perceptron Algorithm


Miscellany
0 =2 (wi(t + 1) − wi(t)) −
 

2η y(t) − wi(t + 1) xi(t) xi(t)
i=1
• If classes linearly separable, then by cycling
through vectors,
guaranteed to converge in finite number of steps
• Approximate with

0 =2 (wi(t + 1) − wi(t)) −



• For real-valued output, can replace threshold
2η y(t) − wi(t) xi(t) xi(t), function on sum with
i=1
– Identity function: f (x) = x
which yields
wi(t + 1) = wi(t) + 1
– Sigmoid function: e.g. f (x) = 1+exp(−ax)
 

η y(t) − wi(t)xi (t) xi(t) – Hyperbolic tangent: e.g. f (x) = c tanh(ax)
i=1

• Applying threshold to summation yields


wi(t + 1) = wi(t) + η (y(t) − ŷ(t)) xi(t)

7 8
Winnow/Exponentiated Gradient
Intuition
Winnow/Exponentiated Gradient
• Measure distance in cost function with
1
(ω1 ) unnormalized relative entropy:
x1 w1 w0
y(t)=1 if sum > 0
conserv.
Σi wi xi z }| !{
X̀ w (t + 1)
xl wl y(t)=0 otherwise U (w) = wi(t) − wi(t + 1) + wi(t + 1) ln i
(ω2 ) i=1 wi(t)
May also use +1 and -1 corrective
z }| {
+ η (y − w(t + 1) · x(t))2
• Same as Perceptron, but update weights:
• Take gradient w.r.t. w(t + 1) and set to 0:
wi(t + 1) = wi(t) exp (−2η(ŷ(t) − y(t)) xi(t))
 
w (t + 1) X̀
0 = ln i − 2η y(t) − wi(t + 1) xi(t) xi(t)
• If y(t), ŷ(t) ∈ {0, 1} ∀t, then set η = (ln α)/2 wi(t) i=1
(α > 1) and get Winnow:
 • Approximate with
xi (t)
wi(t)/α if ŷ(t) = 1, y(t) = 0

  
wi(t + 1) = wi(t)αxi (t) w (t + 1) X̀

if ŷ(t) = 0, y(t) = 1 0 = ln i − 2η y(t) − wi(t)xi (t) xi(t),


wi(t) if ŷ(t) = y(t) wi(t) i=1

which yields
wi(t + 1) = wi(t) exp (−2η (ŷ(t) − y(t)) xi(t))

9 10

Winnow/Exponentiated Gradient Winnow/Exponentiated Gradient


Negative Weights Miscellany

• Winnow and EG update wts by multiplying by • Winnow and EG are muliplicative weight update
a pos const: impossible to change sign schemes versus additive weight update schemes,
e.g. Perceptron
– Weight vectors restricted to one quadrant
• Winnow and EG work well when most attributes
(features) are irrelevant, i.e. optimal weight
• Solution: Maintain wt vectors w +(t) and w−(t) vector w∗ is sparse (many 0 entries)
 
– Predict ŷ(t) = w+(t) − w−(t) · x(t) • E.g. xi ∈ {0, 1}, x’s are labelled by a monotone
k-disjunction over ` attributes, k  `
– Update:
– Remaining ` − k are irrelevant
ri+ (t) = exp (−2η (ŷ(t) − y(t)) xi(t) U )
– E.g. x5 ∨ x9 ∨ x12, ` = 150, k = 3
ri−(t) = 1/ri+(t)
– For disjunctions, number of on-line
wi+ (t) ri+(t) prediction mistakes is O(k log `) for Winnow
wi+ (t + 1) = U · P 
+ + − −

` and worst-case Ω(k`) for Perceptron
j=1 wi (t) ri (t) + wi (t) ri (t)

U and denominator normalize wts for proof of error – So in worst case, need exponentially fewer
bound updates for training in Winnow than Per-
ceptron
Kivinen & Warmuth, “Additive Versus Exponen-
tiated Gradient Updates for Linear Prediction.” • Other bounds exist for real-valued inputs and
Information and Computation, 132(1):1–64, Jan.
outputs
1997. [see web page]
11 12
Non-Linearly Separable Classes Non-Linearly Separable Classes
Winnow’s Agnostic Results

• What if no hyperplane completely separates


the classes? • Winnow’s total number of prediction mistakes
• Add extra inputs that are nonlinear combina- loss (in on-line setting) provably not much worse
tions of original inputs (Section 4.14) than best linear classifier

= Class A optimal decision line


– E.g. attribs. x1 and x2, so try
h iT
x = x1, x2, x1x2, x21, x22, x21x2, x1x22, x31, x32 = Class B

– Perhaps classes linearly separable in new fea-


ture space

– Useful, especially with Winnow/EG loga-


rithmic bounds

– Kernel functions/SVMs

• Pocket algorithm (p. 63) guarantees conver- • Loss bound related to performance of best
gence to a best hyperplane classifier and total distance under k · k1 that
feature vectors must be moved to make best
• Winnow’s & EG’s agnostic results classifier perfect [Littlestone, COLT ’91]
• Least squares methods (Sec. 3.4)

• Networks of classifiers (Ch. 4) • Similar bounds for EG [Kivinen & Warmuth]

13 14

Non-Linearly Separable Classes


Multiclass learning
Least Squares Methods
Kessler’s Construction
• Recall from Slide 7:
 

wi(t + 1) = wi(t) + η y(t) − wi(t)xi (t) xi(t) ω1’s line ω3’s line
i=1
  [2,2]
= wi(t) + η y(t) − w(t)T · x(t) xi(t)

• If we don’t threshold dot product during train-


ing and allow η to vary each trial (i.e. substi- ω2’s line
tute ηt), get∗ Eq. 3.38, p. 69:
  = Class ω1
w(t + 1) = w(t) + ηt x(t) y(t) − w(t)T · x(t)
= Class ω2
• This is Least Mean Squares (LMS) Algorithm = Class ω3

• If e.g. ηt = 1/t, then



lim P w(t) = w∗ = 1, • For∗ x = [2, 2, 1]T of class ω1, want
t→∞
where
  2 
w∗ = argmin E y − wT · x `+1 `+1 `+1 `+1
w∈<` X X X X
w1i xi > w2ixi AND w1ixi > w3i xi
is vector minimizing mean square error (MSE) i=1 i=1 i=1 i=1

∗ Note
that here w(t) is weight before trial t. In book it is
∗ The extra 1 is added so threshold can be placed in w.
weight after trial t.
15 16
Multiclass learning Multiclass learning
Kessler’s Construction (cont’d) Error-Correcting Output Codes (ECOC)

• So map x to • Since Win. & Percep. learn binary functions,


orig. neg pad learn individual bits of binary encoding of classes
z }| { z }| { z }| {
x1 = [2, 2, 1, −2, −2, −1, 0, 0, 0]T
x2 = [2, 2, 1, 0, 0, 0, −2, −2, −1]T • E.g. M = 4, so use two linear classifiers:
(all labels = +1) and let
w1 w2 w3 Class Binary Encoding
w = [w11, w12, w10, w21, w22, w20, w31, w32, w30]T
z }| {z }| {z }| { Classifier 1 Classifier 2
ω1 0 0
ω2 0 1
• Now if w∗T · x1 > 0 and w∗T · x2 > 0, then ω3 1 0
ω4 1 1
`+1
X `+1
X `+1
X `+1
X
∗ ∗ ∗ ∗
w1i xi > w2i xi AND w1i xi > w3i xi and train simultaneously
i=1 i=1 i=1 i=1

• Problem: Sensitive to individual classifier er-


• In general, map (` + 1) × 1 feature vector x to rors, so use a set of encodings per class to
x1, . . . xM −1, each of size (` + 1)M × 1 improve robustness

• x ∈ ωi ⇒ x in ith block and −x in jth block,


(rest are 0s). Repeat for all j 6= i • Similar to principle of error-correcting output
codes used in communication networks
• Now train to find weights for new vector space [Dietterich & Bakiri, 1995]
via perceptron, Winnow, etc. • General-purpose, independent of learner

17 18

Topic summary due in 1 week!

19

You might also like