0% found this document useful (0 votes)
54 views27 pages

Perceptron Bound Proof

This document summarizes a lecture on the perceptron algorithm for machine learning. It provides reminders about homework due dates, then discusses the perceptron learning algorithm through examples and extensions. It analyzes the perceptron by defining concepts like geometric margin and linear separability. Finally, it states the perceptron mistake bound, guaranteeing that the number of mistakes is proportional to the radius over margin, assuming data is linearly separable with margin.

Uploaded by

Ephrem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views27 pages

Perceptron Bound Proof

This document summarizes a lecture on the perceptron algorithm for machine learning. It provides reminders about homework due dates, then discusses the perceptron learning algorithm through examples and extensions. It analyzes the perceptron by defining concepts like geometric margin and linear separability. Finally, it states the perceptron mistake bound, guaranteeing that the number of mistakes is proportional to the radius over margin, assuming data is linearly separable with margin.

Uploaded by

Ephrem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

10-607 Computational Foundations for Machine Learning

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Perceptron Mistake
Bound
Matt Gormley
Lecture 4
Oct. 31, 2018

1
Reminders
• Homework A:
– Out: Tue, Oct. 29
– Due: Wed, Nov. 7 at 11:59pm

2
Q&A
3
THE PERCEPTRON ALGORITHM

4
Perceptron Algorithm: Example
Example: −1,2 − X
1,0 + a
-
1,1 + X
−1,0 − a
+
−1, −2 − X - +
1, −1 + a +
-
Perceptron Algorithm: (without the bias term) 𝑤& = (0,0)
§ Set t=1, start with all-zeroes weight vector 𝑤& .
𝑤+ = 𝑤& − −1,2 = (1, −2)
§ Given example 𝑥, predict positive iff 𝑤1 ⋅ 𝑥 ≥ 0.
§ On a mistake, update as follows: 𝑤, = 𝑤+ + 1,1 = (2, −1)
• Mistake on positive, update 𝑤15& ← 𝑤1 + 𝑥 𝑤. = 𝑤, − −1, −2 = (3,1)
• Mistake on negative, update 𝑤15& ← 𝑤1 − 𝑥

Slide adapted from Nina Balcan


Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
bias b and the weights w
H = {x : w x = b}
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to
x and increasing
dimensionality by one!
w

Half-spaces:
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determined by hyperplane.


T 1, if a 0
ŷ = h (x) = sign( x) sign(a) =
1, otherwise

Learning: Iterative procedure:


• initialize parameters to vector of all zeroes
• while not converged
• receive next example (x(i), y(i))
• predict y’ = h(x(i))
• if positive mistake: add x(i) to parameters
• if negative mistake: subtract x(i) from parameters 7
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determinedImplementation


by hyperplane. Trick: same
T 1, if a 0
ŷ = h (x) = sign( x) behavior as=our
sign(a) “add on
1, otherwise
positive mistake and
subtract on negative
Learning: mistake” version, because
y(i) takes care of the sign

8
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Algorithm 1 Perceptron Learning Algorithm (Batch)


1: procedure P (D = {(t(1) , y (1) ), . . . , (t(N ) , y (N ) )})
2: 0 Initialize parameters
3: while not converged do
4: for i {1, 2, . . . , N } do For each example
5: ŷ sign( T t(i) ) Predict
6: if ŷ = y (i) then If mistake
7: + y (i) t(i) Update parameters
8: return
9
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Discussion:
The Batch Perceptron Algorithm can be derived in two ways.
1. By extending the online Perceptron algorithm to the batch
setting (as mentioned above)
2. By applying Stochastic Gradient Descent (SGD) to minimize a
so-called Hinge Loss on a linear separator

10
Extensions of Perceptron
• Voted Perceptron
– generalizes better than (standard) perceptron
– memory intensive (keeps around every weight vector seen during
training, so each one can vote)
• Averaged Perceptron
– empirically similar performance to voted perceptron
– can be implemented in a memory efficient way
(running averages are efficient)
• Kernel Perceptron
– Choose a kernel K(x’, x)
– Apply the kernel trick to Perceptron
– Resulting algorithm is still very simple
• Structured Perceptron
– Basic idea can also be applied when y ranges over an exponentially
large set
– Mistake bound does not depend on the size of that set

11
ANALYSIS OF PERCEPTRON

12
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)

Margin of positive example 𝑥&

𝑥&
w

Margin of negative example 𝑥+

𝑥+

Slide from Nina Balcan


Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
Definition: The margin 𝛾9 of a set of examples 𝑆 wrt a linear
separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.

+ +
+
w
𝛾9 +
- 𝛾9 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
Definition: The margin 𝛾9 of a set of examples 𝑆 wrt a linear
separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾9
over all linear separators 𝑤.

𝛾 + w
- 𝛾 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Linear Separability
Def: For a binary classification problem, a set of examples 𝑆
is linearly separable if there exists a linear decision boundary
that can separate the points

Case 1: Case 2: Case 3: Case 4:

+ - + + -
- + + + + + - +

16
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
g +
- g +
+
- +
- - R
-
- - -
Slide adapted from Nina Balcan
- 17
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
Def: We say that the (batch) perceptron algorithm has
g +mistakes on the training data
converged if it stops making
- g +
(perfectly classifies the training data).+
- +
Main Takeaway: For - linearly
- separable
R data, if the
perceptron algorithm cycles- repeatedly through the data,
- # of- steps.
it will converge in a finite
-
Slide adapted from Nina Balcan
- 18
Analysis: Perceptron
Perceptron Mistake Bound
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
Suppose:
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+

g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
19
Figure from Nina Balcan
Analysis: Perceptron
Proof of Perceptron Mistake Bound:

We will show that there exist constants A and B s.t.


(k+1)
Ak || || B k

(k+1)
Ak || || B k
(k+1)
Ak
(k+1) || || B k
Ak || || B k

(k+1)
Ak || || B k
20
Analysis: Perceptron
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 . +
+
+
Suppose:
+
1. Finite size inputs: ||x(i) || R - g
g
+
+
2. Linearly separable data: s.t. || || = 1 and - +

y (i) ( · t(i) ) , i -
-
-
R

Then: The number of mistakes made by the Perceptron -


-
-
-
algorithm on this dataset is

k (R/ )2

Algorithm 1 Perceptron Learning Algorithm (Online)


1: procedure P (D = {(t(1) , y (1) ), (t(2) , y (2) ), . . .})
2: 0, k = 1 Initialize parameters
3: for i {1, 2, . . .} do For each example
4: if y (i) ( (k) · t(i) ) 0 then If mistake
5:
(k+1) (k)
+ y (i) t(i) Update parameters
6: k k+1
21
7: return
Analysis: Perceptron
Chalkboard:
– Proof of Perceptron Mistake Bound

22
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 1: for some A, Ak || (k+1) || B k
(k+1)
· =( (k)
+ y (i) t(i) )
by Perceptron algorithm update
= (k)
· + y (i) ( · t(i) )
(k)
· +
by assumption
(k+1)
· k
by induction on k since (1)
=0
|| (k+1) || k
since ||r|| ||m|| r · m and || || = 1

Cauchy-Schwartz inequality
23
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
AkB, || (k+1) || B k
Part 2: for some
|| (k+1) 2
|| = || (k)
+ y (i) t(i) ||2
by Perceptron algorithm update
= || (k) 2
|| + (y (i) )2 ||t(i) ||2 + 2y (i) ( (k)
· t(i) )
(k) 2
|| || + (y (i) )2 ||t(i) ||2
since kth mistake y (i) ( (k)
· t(i) ) 0
(k) 2
= || || + R2
since (y (i) )2 ||t(i) ||2 = ||t(i) ||2 = R2 by assumption and (y (i) )2 = 1
(k+1) 2
|| || kR2
by induction on k since ( (1) 2
) =0
(k+1)
|| || kR

24
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 3: Combining the bounds finishes the proof.
(k+1)
k || || kR
2
k (R/ )

The total number of mistakes


must be less than this

25
Combining, gives

k R ≥ ∥vk+1 ∥ ≥ vk+1 · u ≥ kγ
Analysis:2
Perceptron
which implies k ≤ (R/γ ) proving the theorem. ✷

What if the data is not linearly separable?


3.2. Analysis for the inseparable case

1.If thePerceptron will separable


data are not linearly not converge in this 1case
then the Theorem cannot (itbecan’t!)
used directly. However,
2.we now give a generalized
However, Freundversion of the theorem
& Schapire (1999) which allows
show for some
that mistakes in the
by projecting the
training set. As(hypothetically)
points far as we know, this theorem is new, although
into a higher the proof space,
dimensional techniqueweis very
can
similar to that ofa Klasner
achieve similarand Simonon
bound (1995,
theTheorem
number 2.2).
ofSee also the recent
mistakes madework on of
Shawe-Taylor
one pass andthrough
Cristianini (1998) who used this
the sequence oftechnique
examples to derive generalization error
bounds for any large margin classifier.

Theorem 2. Let ⟨(x1 , y1 ), . . . , (xm , ym )⟩ be a sequence of labeled examples with ∥xi ∥ ≤ R.


Let u be any vector with ∥u∥ = 1 and let γ > 0. Define the deviation of each example as

di = max{0, γ − yi (u · xi )},
!"
m 2
and define D = i=1 di . Then the number of mistakes of the online perceptron algorithm
on this sequence is bounded by
# $2
R+D
.
γ
26
Proof: The case D = 0 follows from Theorem 1, so we can assume that D > 0.
Summary: Perceptron
• Perceptron is a linear classifier
• Simple learning algorithm: when a mistake is
made, add / subtract the features
• Perceptron will converge if the data are linearly
separable, it will not converge if the data are
linearly inseparable
• For linearly separable and inseparable data, we
can bound the number of mistakes (geometric
argument)
• Extensions support nonlinear separators and
structured prediction
27

You might also like