0% found this document useful (0 votes)

54 views27 pages

Perceptron Bound Proof

This document summarizes a lecture on the perceptron algorithm for machine learning. It provides reminders about homework due dates, then discusses the perceptron learning algorithm through examples and extensions. It analyzes the perceptron by defining concepts like geometric margin and linear separability. Finally, it states the perceptron mistake bound, guaranteeing that the number of mistakes is proportional to the radius over margin, assuming data is linearly separable with margin.

Uploaded by

Ephrem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views27 pages

Perceptron Bound Proof

Uploaded by

Ephrem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

10-607 Computational Foundations for Machine Learning

Machine Learning Department

School of Computer Science
Carnegie Mellon University

Perceptron Mistake
Bound
Matt Gormley
Lecture 4
Oct. 31, 2018

1
Reminders
• Homework A:
– Out: Tue, Oct. 29
– Due: Wed, Nov. 7 at 11:59pm

2
Q&A
3
THE PERCEPTRON ALGORITHM

4
Perceptron Algorithm: Example
Example: −1,2 − X
1,0 + a
-
1,1 + X
−1,0 − a
+
−1, −2 − X - +
1, −1 + a +
-
Perceptron Algorithm: (without the bias term) 𝑤& = (0,0)
§ Set t=1, start with all-zeroes weight vector 𝑤& .
𝑤+ = 𝑤& − −1,2 = (1, −2)
§ Given example 𝑥, predict positive iff 𝑤1 ⋅ 𝑥 ≥ 0.
§ On a mistake, update as follows: 𝑤, = 𝑤+ + 1,1 = (2, −1)
• Mistake on positive, update 𝑤15& ← 𝑤1 + 𝑥 𝑤. = 𝑤, − −1, −2 = (3,1)
• Mistake on negative, update 𝑤15& ← 𝑤1 − 𝑥

Slide adapted from Nina Balcan

Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
bias b and the weights w
H = {x : w x = b}
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to
x and increasing
dimensionality by one!
w

Half-spaces:
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determined by hyperplane.

T 1, if a 0
ŷ = h (x) = sign( x) sign(a) =
1, otherwise

Learning: Iterative procedure:

• initialize parameters to vector of all zeroes
• while not converged
• receive next example (x(i), y(i))
• predict y’ = h(x(i))
• if positive mistake: add x(i) to parameters
• if negative mistake: subtract x(i) from parameters 7
(Online) Perceptron Algorithm
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Prediction: Output determinedImplementation

by hyperplane. Trick: same
T 1, if a 0
ŷ = h (x) = sign( x) behavior as=our
sign(a) “add on
1, otherwise
positive mistake and
subtract on negative
Learning: mistake” version, because
y(i) takes care of the sign

8
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Algorithm 1 Perceptron Learning Algorithm (Batch)

1: procedure P (D = {(t(1) , y (1) ), . . . , (t(N ) , y (N ) )})
2: 0 Initialize parameters
3: while not converged do
4: for i {1, 2, . . . , N } do For each example
5: ŷ sign( T t(i) ) Predict
6: if ŷ = y (i) then If mistake
7: + y (i) t(i) Update parameters
8: return
9
(Batch) Perceptron Algorithm
Learning for Perceptron also works if we have a fixed training
dataset, D. We call this the “batch” setting in contrast to the “online”
setting that we’ve discussed so far.

Discussion:
The Batch Perceptron Algorithm can be derived in two ways.
1. By extending the online Perceptron algorithm to the batch
setting (as mentioned above)
2. By applying Stochastic Gradient Descent (SGD) to minimize a
so-called Hinge Loss on a linear separator

10
Extensions of Perceptron
• Voted Perceptron
– generalizes better than (standard) perceptron
– memory intensive (keeps around every weight vector seen during
training, so each one can vote)
• Averaged Perceptron
– empirically similar performance to voted perceptron
– can be implemented in a memory efficient way
(running averages are efficient)
• Kernel Perceptron
– Choose a kernel K(x’, x)
– Apply the kernel trick to Perceptron
– Resulting algorithm is still very simple
• Structured Perceptron
– Basic idea can also be applied when y ranges over an exponentially
large set
– Mistake bound does not depend on the size of that set

11
ANALYSIS OF PERCEPTRON

12
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)

Margin of positive example 𝑥&

𝑥&
w

Margin of negative example 𝑥+

𝑥+

Slide from Nina Balcan

Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
Definition: The margin 𝛾9 of a set of examples 𝑆 wrt a linear
separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.

+ +
+
w
𝛾9 +
- 𝛾9 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0 (or the negative if on wrong side)
Definition: The margin 𝛾9 of a set of examples 𝑆 wrt a linear
separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾9
over all linear separators 𝑤.

𝛾 + w
- 𝛾 +
+
- - +
- - +
-
-
Slide from Nina Balcan - -
Linear Separability
Def: For a binary classification problem, a set of examples 𝑆
is linearly separable if there exists a linear decision boundary
that can separate the points

Case 1: Case 2: Case 3: Case 4:

+ - + + -
- + + + + + - +

16
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
g +
- g +
+
- +
- - R
-
- - -
Slide adapted from Nina Balcan
- 17
Analysis: Perceptron
Perceptron Mistake Bound
Guarantee: If data has margin and all points inside a ball of
radius R, then Perceptron makes (R/ )2 mistakes.
(Normalized margin: multiplying all points by 100, or dividing all points by 100,
doesn’t change the number of mistakes; algo is invariant to scaling.)

+ +
+
Def: We say that the (batch) perceptron algorithm has
g +mistakes on the training data
converged if it stops making
- g +
(perfectly classifies the training data).+
- +
Main Takeaway: For - linearly
- separable
R data, if the
perceptron algorithm cycles- repeatedly through the data,
- # of- steps.
it will converge in a finite
-
Slide adapted from Nina Balcan
- 18
Analysis: Perceptron
Perceptron Mistake Bound
Theorem 0.1 (Block (1962), Novikoﬀ (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 .
Suppose:
1. Finite size inputs: ||x(i) || R
2. Linearly separable data: s.t. || || = 1 and
y (i) ( · t(i) ) , i
Then: The number of mistakes made by the Perceptron
algorithm on this dataset is +
+
+

g +
2 +
k (R/ ) - g
+
- +
- -
- R
- -
- -
19
Figure from Nina Balcan
Analysis: Perceptron
Proof of Perceptron Mistake Bound:

We will show that there exist constants A and B s.t.

(k+1)
Ak || || B k

(k+1)
Ak || || B k
(k+1)
Ak
(k+1) || || B k
Ak || || B k

(k+1)
Ak || || B k
20
Analysis: Perceptron
Theorem 0.1 (Block (1962), Novikoﬀ (1962)).
Given dataset: D = {(t(i) , y (i) )}N
i=1 . +
+
+
Suppose:
+
1. Finite size inputs: ||x(i) || R - g
g
+
+
2. Linearly separable data: s.t. || || = 1 and - +

y (i) ( · t(i) ) , i -
-
-
R

Then: The number of mistakes made by the Perceptron -

-
-
-
algorithm on this dataset is

k (R/ )2

Algorithm 1 Perceptron Learning Algorithm (Online)

1: procedure P (D = {(t(1) , y (1) ), (t(2) , y (2) ), . . .})
2: 0, k = 1 Initialize parameters
3: for i {1, 2, . . .} do For each example
4: if y (i) ( (k) · t(i) ) 0 then If mistake
5:
(k+1) (k)
+ y (i) t(i) Update parameters
6: k k+1
21
7: return
Analysis: Perceptron
Chalkboard:
– Proof of Perceptron Mistake Bound

22
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 1: for some A, Ak || (k+1) || B k
(k+1)
· =( (k)
+ y (i) t(i) )
by Perceptron algorithm update
= (k)
· + y (i) ( · t(i) )
(k)
· +
by assumption
(k+1)
· k
by induction on k since (1)
=0
|| (k+1) || k
since ||r|| ||m|| r · m and || || = 1

Cauchy-Schwartz inequality
23
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
AkB, || (k+1) || B k
Part 2: for some
|| (k+1) 2
|| = || (k)
+ y (i) t(i) ||2
by Perceptron algorithm update
= || (k) 2
|| + (y (i) )2 ||t(i) ||2 + 2y (i) ( (k)
· t(i) )
(k) 2
|| || + (y (i) )2 ||t(i) ||2
since kth mistake y (i) ( (k)
· t(i) ) 0
(k) 2
= || || + R2
since (y (i) )2 ||t(i) ||2 = ||t(i) ||2 = R2 by assumption and (y (i) )2 = 1
(k+1) 2
|| || kR2
by induction on k since ( (1) 2
) =0
(k+1)
|| || kR

24
Analysis: Perceptron
Proof of Perceptron Mistake Bound:
Part 3: Combining the bounds finishes the proof.
(k+1)
k || || kR
2
k (R/ )

The total number of mistakes

must be less than this

25
Combining, gives
√
k R ≥ ∥vk+1 ∥ ≥ vk+1 · u ≥ kγ
Analysis:2
Perceptron
which implies k ≤ (R/γ ) proving the theorem. ✷

What if the data is not linearly separable?

3.2. Analysis for the inseparable case

1.If thePerceptron will separable

data are not linearly not converge in this 1case
then the Theorem cannot (itbecan’t!)
used directly. However,
2.we now give a generalized
However, Freundversion of the theorem
& Schapire (1999) which allows
show for some
that mistakes in the
by projecting the
training set. As(hypothetically)
points far as we know, this theorem is new, although
into a higher the proof space,
dimensional techniqueweis very
can
similar to that ofa Klasner
achieve similarand Simonon
bound (1995,
theTheorem
number 2.2).
ofSee also the recent
mistakes madework on of
Shawe-Taylor
one pass andthrough
Cristianini (1998) who used this
the sequence oftechnique
examples to derive generalization error
bounds for any large margin classifier.

Theorem 2. Let ⟨(x1 , y1 ), . . . , (xm , ym )⟩ be a sequence of labeled examples with ∥xi ∥ ≤ R.

Let u be any vector with ∥u∥ = 1 and let γ > 0. Define the deviation of each example as

di = max{0, γ − yi (u · xi )},
!"
m 2
and define D = i=1 di . Then the number of mistakes of the online perceptron algorithm
on this sequence is bounded by
# $2
R+D
.
γ
26
Proof: The case D = 0 follows from Theorem 1, so we can assume that D > 0.
Summary: Perceptron
• Perceptron is a linear classifier
• Simple learning algorithm: when a mistake is
made, add / subtract the features
• Perceptron will converge if the data are linearly
separable, it will not converge if the data are
linearly inseparable
• For linearly separable and inseparable data, we
can bound the number of mistakes (geometric
argument)
• Extensions support nonlinear separators and
structured prediction
27

Perceptron Algorithm
No ratings yet
Perceptron Algorithm
10 pages
The Percept Ronal Go
No ratings yet
The Percept Ronal Go
72 pages
NN 2
No ratings yet
NN 2
42 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
PNAL4 SingleLayerNets
No ratings yet
PNAL4 SingleLayerNets
42 pages
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
No ratings yet
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
68 pages
01 Halfspaces Perceptron
No ratings yet
01 Halfspaces Perceptron
56 pages
Linear Classifier-Perceptron
No ratings yet
Linear Classifier-Perceptron
42 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
Neural N Problems - SLP
No ratings yet
Neural N Problems - SLP
123 pages
ML - Lec 6 - Linear Classifiers
No ratings yet
ML - Lec 6 - Linear Classifiers
55 pages
MAT6007 Session5 Perceptron Algorithm
No ratings yet
MAT6007 Session5 Perceptron Algorithm
19 pages
Clase3 Redunidireccional
No ratings yet
Clase3 Redunidireccional
74 pages
20.NeuralNets Short
No ratings yet
20.NeuralNets Short
60 pages
Lecture 4
No ratings yet
Lecture 4
65 pages
Kernel Perceptron
No ratings yet
Kernel Perceptron
28 pages
Preceptron
No ratings yet
Preceptron
17 pages
Perceptron
No ratings yet
Perceptron
26 pages
1 Algorithm: For I 1 To N Ify
No ratings yet
1 Algorithm: For I 1 To N Ify
6 pages
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture2 Compressed
No ratings yet
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture2 Compressed
21 pages
Percept Rons
No ratings yet
Percept Rons
68 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
BT Neural
No ratings yet
BT Neural
9 pages
Perceptron Mistake Bound
No ratings yet
Perceptron Mistake Bound
10 pages
Lecture 3 - The Perceptron
No ratings yet
Lecture 3 - The Perceptron
4 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Perceptron: Tirtharaj Dash
No ratings yet
Perceptron: Tirtharaj Dash
22 pages
Perceptron Lecture 3
No ratings yet
Perceptron Lecture 3
25 pages
Perceptron - Algorithm
No ratings yet
Perceptron - Algorithm
9 pages
Perceptrons Algorithm PDF
No ratings yet
Perceptrons Algorithm PDF
68 pages
NN Theory
No ratings yet
NN Theory
138 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
DL Assignment Solutions
No ratings yet
DL Assignment Solutions
64 pages
Deep Learning Practical Assignment #1:: Instructions
No ratings yet
Deep Learning Practical Assignment #1:: Instructions
5 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
hw1 Sols PDF
No ratings yet
hw1 Sols PDF
5 pages
L17 Perceptron
No ratings yet
L17 Perceptron
21 pages
ANN (Perceptron) 02
No ratings yet
ANN (Perceptron) 02
14 pages
Lecture 3 - Rosenblatt - S Perceptron-Ch2
No ratings yet
Lecture 3 - Rosenblatt - S Perceptron-Ch2
20 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
Lecture 8 - Intro To Neural Networks
No ratings yet
Lecture 8 - Intro To Neural Networks
61 pages
Perceptron
No ratings yet
Perceptron
6 pages
Perceptron
No ratings yet
Perceptron
23 pages
NN-Ch2 New V1
No ratings yet
NN-Ch2 New V1
99 pages
Perceptron PDF
No ratings yet
Perceptron PDF
37 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Pr5 PerceptronWriteUp
No ratings yet
Pr5 PerceptronWriteUp
6 pages
NN 1
No ratings yet
NN 1
6 pages
Module 4 Lab 1
No ratings yet
Module 4 Lab 1
5 pages
Week 1
No ratings yet
Week 1
5 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
Percept Ron
No ratings yet
Percept Ron
2 pages
Perceptron Linear Classifiers
No ratings yet
Perceptron Linear Classifiers
42 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
2007 02 01b Janecek Perceptron
No ratings yet
2007 02 01b Janecek Perceptron
37 pages
Linear Separability
No ratings yet
Linear Separability
4 pages
Perceptron
No ratings yet
Perceptron
3 pages
01 Introduction
No ratings yet
01 Introduction
24 pages
01 ML
No ratings yet
01 ML
43 pages
Comp Arch and Org - Lec 1
No ratings yet
Comp Arch and Org - Lec 1
23 pages
Chapter 2 - Spatial Descriptions and Transformations
No ratings yet
Chapter 2 - Spatial Descriptions and Transformations
30 pages
Introduction To Robotics
No ratings yet
Introduction To Robotics
29 pages
COMPARCH Outline
No ratings yet
COMPARCH Outline
2 pages
02 DP
No ratings yet
02 DP
31 pages
Lecture 1
No ratings yet
Lecture 1
11 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Perceptron Bound Proof

Uploaded by

Perceptron Bound Proof

Uploaded by

10-607 Computational Foundations for Machine Learning

Machine Learning Department

Slide adapted from Nina Balcan

Prediction: Output determined by hyperplane.

Learning: Iterative procedure:

Prediction: Output determinedImplementation

Algorithm 1 Perceptron Learning Algorithm (Batch)

Margin of positive example 𝑥&

Margin of negative example 𝑥+

Slide from Nina Balcan

Case 1: Case 2: Case 3: Case 4:

We will show that there exist constants A and B s.t.

Then: The number of mistakes made by the Perceptron -

Algorithm 1 Perceptron Learning Algorithm (Online)

The total number of mistakes

What if the data is not linearly separable?

1.If thePerceptron will separable

Theorem 2. Let ⟨(x1 , y1 ), . . . , (xm , ym )⟩ be a sequence of labeled examples with ∥xi ∥ ≤ R.

You might also like