Perceptron Bound Proof
Perceptron Bound Proof
Perceptron Mistake
Bound
Matt Gormley
Lecture 4
Oct. 31, 2018
1
Reminders
• Homework A:
– Out: Tue, Oct. 29
– Due: Wed, Nov. 7 at 11:59pm
2
Q&A
3
THE PERCEPTRON ALGORITHM
4
Perceptron Algorithm: Example
Example: −1,2 − X
1,0 + a
-
1,1 + X
−1,0 − a
+
−1, −2 − X - +
1, −1 + a +
-
Perceptron Algorithm: (without the bias term) 𝑤& = (0,0)
§ Set t=1, start with all-zeroes weight vector 𝑤& .
𝑤+ = 𝑤& − −1,2 = (1, −2)
§ Given example 𝑥, predict positive iff 𝑤1 ⋅ 𝑥 ≥ 0.
§ On a mistake, update as follows: 𝑤, = 𝑤+ + 1,1 = (2, −1)
• Mistake on positive, update 𝑤15& ← 𝑤1 + 𝑥 𝑤. = 𝑤, − −1, −2 = (3,1)
• Mistake on negative, update 𝑤15& ← 𝑤1 − 𝑥
Half-spaces:
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
8
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.
Discussion:
The Batch Perceptron Algorithm can be derived in two ways.
1. By extending the online Perceptron algorithm to the batch
setting (as mentioned above)
2. By applying Stochastic Gradient Descent (SGD) to minimize a
so-called Hinge Loss on a linear separator
10
Extensions of Perceptron
• Voted Perceptron
– generalizes better than (standard) perceptron
– memory intensive (keeps around every weight vector seen during
training, so each one can vote)
• Averaged Perceptron
– empirically similar performance to voted perceptron
– can be implemented in a memory efficient way
(running averages are efficient)
• Kernel Perceptron
– Choose a kernel K(x’, x)
– Apply the kernel trick to Perceptron
– Resulting algorithm is still very simple
• Structured Perceptron
– Basic idea can also be applied when y ranges over an exponentially
large set
– Mistake bound does not depend on the size of that set
11
ANALYSIS OF PERCEPTRON
12
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
𝑥&
w
𝑥+
+ +
+
w
𝛾9 +
- 𝛾9 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
Definition: The margin 𝛾9 of a set of examples 𝑆 wrt a linear
separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾9
over all linear separators 𝑤.
𝛾 + w
- 𝛾 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Linear Separability
Def: For a binary classification problem, a set of examples 𝑆
is linearly separable if there exists a linear decision boundary
that can separate the points
+ - + + -
- + + + + + - +
16
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
+ +
+
g +
- g +
+
- +
- - R
-
- - -
Slide adapted from Nina Balcan
- 17
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)
+ +
+
Def: We say that the (batch) perceptron algorithm has
g +mistakes on the training data
converged if it stops making
- g +
(perfectly classifies the training data).+
- +
Main Takeaway: For - linearly
- separable
R data, if the
perceptron algorithm cycles- repeatedly through the data,
- # of- steps.
it will converge in a finite
-
Slide adapted from Nina Balcan
- 18
Analysis: Perceptron
Perceptron Mistake Bound
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
Suppose:
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+
g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
19
Figure from Nina Balcan
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
(k+1)
Ak || || B k
(k+1)
Ak
(k+1) || || B k
Ak || || B k
(k+1)
Ak || || B k
20
Analysis: Perceptron
Theorem 0.1 (Block (1962), Novikoff (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 . +
+
+
Suppose:
+
1. Finite size inputs: ||x(i) || R - g
g
+
+
2. Linearly separable data: s.t. || || = 1 and - +
y (i) ( · t(i) ) , i -
-
-
R
k (R/ )2
22
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 1: for some A, Ak || (k+1) || B k
(k+1)
· =( (k)
+ y (i) t(i) )
by Perceptron algorithm update
= (k)
· + y (i) ( · t(i) )
(k)
· +
by assumption
(k+1)
· k
by induction on k since (1)
=0
|| (k+1) || k
since ||r|| ||m|| r · m and || || = 1
Cauchy-Schwartz inequality
23
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
AkB, || (k+1) || B k
Part 2: for some
|| (k+1) 2
|| = || (k)
+ y (i) t(i) ||2
by Perceptron algorithm update
= || (k) 2
|| + (y (i) )2 ||t(i) ||2 + 2y (i) ( (k)
· t(i) )
(k) 2
|| || + (y (i) )2 ||t(i) ||2
since kth mistake y (i) ( (k)
· t(i) ) 0
(k) 2
= || || + R2
since (y (i) )2 ||t(i) ||2 = ||t(i) ||2 = R2 by assumption and (y (i) )2 = 1
(k+1) 2
|| || kR2
by induction on k since ( (1) 2
) =0
(k+1)
|| || kR
24
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 3: Combining the bounds finishes the proof.
(k+1)
k || || kR
2
k (R/ )
25
Combining, gives
√
k R ≥ ∥vk+1 ∥ ≥ vk+1 · u ≥ kγ
Analysis:2
Perceptron
which implies k ≤ (R/γ ) proving the theorem. ✷
di = max{0, γ − yi (u · xi )},
!"
m 2
and define D = i=1 di . Then the number of mistakes of the online perceptron algorithm
on this sequence is bounded by
# $2
R+D
.
γ
26
Proof: The case D = 0 follows from Theorem 1, so we can assume that D > 0.
Summary: Perceptron
• Perceptron is a linear classifier
• Simple learning algorithm: when a mistake is
made, add / subtract the features
• Perceptron will converge if the data are linearly
separable, it will not converge if the data are
linearly inseparable
• For linearly separable and inseparable data, we
can bound the number of mistakes (geometric
argument)
• Extensions support nonlinear separators and
structured prediction
27