0% found this document useful (0 votes)

40 views5 pages

Perceptron Notes

This document summarizes the perceptron algorithm and maximum margin classifiers. It describes how the perceptron algorithm transforms the problem of finding a separating hyperplane in input space into finding an optimal point in weight space. It discusses using gradient descent and stochastic gradient descent to optimize the risk function and find the separating hyperplane. The perceptron convergence theorem guarantees the algorithm will converge for linearly separable data.

Uploaded by

cobol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views5 pages

Perceptron Notes

Uploaded by

cobol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Perceptron Learning; Maximum Margin Classifiers 13

3 Perceptron Learning; Maximum Margin Classifiers

Perceptron Algorithm (cont’d)

Recall:
– linear decision fn f (x) = w · x (for simplicity, no ↵)
– decision boundary {x : f (x) = 0} (a hyperplane through the origin)
– sample points X1 , X2 , . . . , Xn 2 Rd ; classifications y1 , . . . , yn = ±1
– goal: find weights w such that yi Xi · w 0
P
– goal, rewritten: find w that minimizes R(w) = i2V yi Xi · w [risk function]
where V is the set of indices i for which yi Xi · w < 0.
[Our original problem was to find a separating hyperplane in one space, which I’ll call x-space. But we’ve
transformed this into a problem of finding an optimal point in a di↵erent space, which I’ll call w-space. It’s
important to understand transformations like this, where a geometric structure in one space becomes a point
in another space.]
Objects in x-space transform to objects in w-space:
x-space w-space
hyperplane: {z : w · z = 0} point: w
point: x hyperplane: {z : x · z = 0}
Point x lies on hyperplane {z : w · z = 0} , w · x = 0 , point w lies on hyperplane {z : x · z = 0} in w-space.
[So a hyperplane transforms to its normal vector. And a sample point transforms to the hyperplane whose
normal vector is the sample point.]
[In this case, the transformations happen to be symmetric: a hyperplane in x-space transforms to a point in
w-space the same way that a hyperplane in w-space transforms to a point in x-space. That won’t always be
true for the weight spaces we use this semester.]
If we want to enforce inequality x · w 0, that means
– in x-space, x should be on the same side of {z : z · w = 0} as w
– in w-space, w ” ” ” ” ” ” ” {z : x · z = 0} as x

x-space w-space
X [Draw this by hand. xwspace.pdf ]
X [Observe that the x-space sample
points are the normal vectors for the
w-space lines. We can choose w to be
C
anywhere in the shaded region.]
w

[For a sample point x in class C, w and x must be on the same side of the hyperplane that x transforms into.
For a point x not in class C (marked by an X), w and x must be on opposite sides of the hyperplane that x
transforms into. These rules determine the shaded region above, in which w must lie.]
[Again, what have we accomplished? We have switched from the problem of finding a hyperplane in x-space
to the problem of finding a point in w-space. That’s a much better fit to how we think about optimization
algorithms.]
14 Jonathan Richard Shewchuk

[Let’s take a look at the risk function these three sample points create.]

-2

-4

-4 -2 0 2 4

riskplot.pdf, riskiso.pdf [Plot & isocontours of risk R(w). Note how R’s creases match the
dual chart above.]
[In this plot, we can choose w to be any point in the bottom pizza slice; all those points minimize R.]
[We have an optimization problem; we need an optimization algorithm to solve it.]
An optimization algorithm: gradient descent on R.
Given a starting point w, find gradient of R with respect to w; this is the direction of steepest ascent.
Take a step in the opposite direction. Recall [from your vector calculus class]
2 @R 3 2 3
666 @w1 777 666 z1 777
666 @R 777 666 777
66 777 6 z2 777
rR(w) = 66666 r(z · w) = 66666
@w2
.. 777 and .. 777 = z
666 . 777 666 . 777
64 775 4 5
@R
@wd
zd

X X
rR(w) = r yi Xi · w = yi Xi
i2V i2V

At any point w, we walk downhill in direction of steepest descent, rR(w).

w arbitrary nonzero starting point (good choice is any yi Xi )

while R(w) > 0
V set of indices i for which yi Xi · w < 0
P
w w + ✏ i2V yi Xi
return w

✏ > 0 is the step size aka learning rate, chosen empirically. [Best choice depends on input problem!]
[Show plot of R again. Draw the typical steps of gradient descent.]
Problem: Slow! Each step takes O(nd) time. [Can we improve this?]
Perceptron Learning; Maximum Margin Classifiers 15

Optimization algorithm 2: stochastic gradient descent

Idea: each step, pick one misclassified Xi ;

do gradient descent on loss fn L(Xi · w, yi ).
Called the perceptron algorithm. Each step takes O(d) time.
[Not counting the time to search for a misclassified Xi .]

while some yi Xi · w < 0

w w + ✏ yi Xi
return w

[Stochastic gradient descent is quite popular and we’ll see it several times more this semester, especially
for neural nets. However, stochastic gradient descent does not work for every problem that gradient descent
works for. The perceptron risk function happens to have special properties that guarantee that stochastic
gradient descent will always succeed.]
What if separating hyperplane doesn’t pass through origin?
Add a fictitious dimension. Decision fn is
2 3
666 x1 777
666 7
f (x) = w · x + ↵ = [w1 w2 ↵] · 66 x2 7777
4 5
1

Now we have sample points in Rd+1 , all lying on hyperplane xd+1 = 1.

Run perceptron algorithm in (d + 1)-dimensional space. [We are simulating a general hyperplane in
d dimensions by using a hyperplane through the origin in d + 1 dimensions.]
[The perceptron algorithm was invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory.
It was originally designed not to be a program, but to be implemented in hardware for image recognition on
a 20 ⇥ 20 pixel image. Rosenblatt built a Mark I Perceptron Machine that ran the algorithm, complete with
electric motors to do weight updates.]

Mark I perceptron.jpg (from Wikipedia, “Perceptron”) [The Mark I Perceptron Machine.

This is what it took to process a 20 ⇥ 20 image in 1957.]
16 Jonathan Richard Shewchuk

[Then he held a press conference where he predicted that perceptrons would be “the embryo of an electronic
computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of
its existence.” We’re still waiting on that.]
[One interesting aspect of the perceptron algorithm is that it’s an “online algorithm,” which means that if
new data points come in while the algorithm is already running, you can just throw them into the mix and
keep looping.]
Perceptron Convergence Theorem: If data is linearly separable, perceptron algorithm will find a linear
classifier that classifies all data correctly in at most O(R2 / 2 ) iterations, where R = max |Xi | is “radius of
data” and is the “maximum margin.”
[I’ll define “maximum margin” shortly.]
[We’re not going to prove this, because perceptrons are obsolete.]
[Although the step size/learning rate doesn’t appear in that big-O expression, it does have an e↵ect on the
running time, but the e↵ect is hard to characterize. The algorithm gets slower if ✏ is too small because it has
to take lots of steps to get down the hill. But it also gets slower if ✏ is too big for a di↵erent reason: it jumps
right over the region with zero risk and oscillates back and forth for a long time.]
[Although stochastic gradient descent is faster for this problem than gradient descent, the perceptron algo-
rithm is still slow. There’s no reliable way to choose a good step size ✏. Fortunately, optimization algorithms
have improved a lot since 1957. You can get rid of the step size by using any decent modern “line search” al-
gorithm. Better yet, you can find a better decision boundary much more quickly by quadratic programming,
which is what we’ll talk about next.]

MAXIMUM MARGIN CLASSIFIERS

The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
What if we make the margin as wide as possible?

C C

X C C

X C
X C
X
X
X w·x+↵=1
w·x+↵= 1 w·x+↵=0 [Draw this by hand. maxmargin.pdf ]
We enforce the constraints

yi (w · Xi + ↵) 1 for i 2 [1, n]

[Notice that the right-hand side is a 1, rather than a 0 as it was for the perceptron algorithm. It’s not obvious,
but this a better way to formulate the problem, partly because it makes it impossible for the weight vector w
to get set to zero.]
Perceptron Learning; Maximum Margin Classifiers 17

Recall: if |w| = 1, signed distance from hyperplane to Xi is w · Xi + ↵.

w ↵
Otherwise, it’s |w| · Xi + |w| . [We’ve normalized the expression to get a unit weight vector.]
1 1
Hence the margin is mini |w| |w · Xi + ↵| |w| . [We get the inequality by substituting the constraints.]
2
There is a slab of width |w| containing no sample points [with the hyperplane running along its middle].
To maximize the margin, minimize |w|. Optimization problem:
Find w and ↵ that minimize |w|2
subject to yi (Xi · w + ↵) 1 for all i 2 [1, n]
Called a quadratic program in d + 1 dimensions and n constraints.
It has one unique solution! [If the points are linearly separable; otherwise, it has no solution.]
[A reason we use |w|2 as an objective function, instead of |w|, is that the length function |w| is not smooth at
zero, whereas |w|2 is smooth everywhere. This makes optimization easier.]
The solution gives us a maximum margin classifier, aka a hard margin support vector machine (SVM).
[Technically, this isn’t really a support vector machine yet; it doesn’t fully deserve that name until we add
features and kernels, which we’ll do in later lectures.]
[Let’s see what these constraints look like in weight space.]

alpha
1.0

0.5

w2
-1.0 -0.8 -0.6 -0.4 -0.2

-0.5

-1.0

weight3d.pdf, weightcross.pdf [This is an example of what the linear constraints look like
in the 3D weight space (w1 , w2 , ↵) for an SVM with three training points. The SVM is
looking for the point nearest the origin that lies above the blue plane (representing an in-
class training point) but below the red and pink planes (representing out-of-class training
points). In this example, that optimal point lies where the three planes intersect. At right
we see a 2D cross-section w1 = 1/17 of the 3D space, because the optimal solution lies in
this cross-section. The constraints say that the solution must lie in the leftmost pizza slice,
while being as close to the origin as possible, so the optimal solution is where the three
lines meet.]

HW5 Solutions Autotag
No ratings yet
HW5 Solutions Autotag
18 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
Perceptron
No ratings yet
Perceptron
23 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
NN Theory
No ratings yet
NN Theory
138 pages
Linear Separability
No ratings yet
Linear Separability
4 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Lec 3
No ratings yet
Lec 3
131 pages
Lecture 3 - The Perceptron
No ratings yet
Lecture 3 - The Perceptron
4 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Iv. Single Layer Structures: 4.1. Perceptrons
No ratings yet
Iv. Single Layer Structures: 4.1. Perceptrons
26 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
PR-January20-10 Online Trial
No ratings yet
PR-January20-10 Online Trial
42 pages
3 Linear
No ratings yet
3 Linear
5 pages
Support Vector Machines: 1 What's SVM
No ratings yet
Support Vector Machines: 1 What's SVM
25 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
SVM and Kernel
No ratings yet
SVM and Kernel
57 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
lec22-ML III
No ratings yet
lec22-ML III
51 pages
Support Vector Machines: CS229 Lecture Notes
100% (2)
Support Vector Machines: CS229 Lecture Notes
25 pages
Support Vector Machines
No ratings yet
Support Vector Machines
25 pages
Week 1
No ratings yet
Week 1
5 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
NN 03
No ratings yet
NN 03
27 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Chapter 8
No ratings yet
Chapter 8
103 pages
cs229 SVM Notes
No ratings yet
cs229 SVM Notes
20 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Support Vector Machines: CS229 Lecture Notes
No ratings yet
Support Vector Machines: CS229 Lecture Notes
25 pages
Perceptron
No ratings yet
Perceptron
26 pages
NN-Ch2 New V1
No ratings yet
NN-Ch2 New V1
99 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
ML - Lec 6 - Linear Classifiers
No ratings yet
ML - Lec 6 - Linear Classifiers
55 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
No ratings yet
Lecture 13 - Perceptrons: Machine Learning March 16, 2010
49 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Differential Equations (Calculus) Mathematics E-Book For Public Exams
From Everand
Differential Equations (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Numerical Differentiation
No ratings yet
Numerical Differentiation
2 pages
Algorithm and Flowchart Notes
No ratings yet
Algorithm and Flowchart Notes
5 pages
Ece Filter Design
No ratings yet
Ece Filter Design
117 pages
Mb0048operaton Research
No ratings yet
Mb0048operaton Research
3 pages
Advance Math 1st Prelim
No ratings yet
Advance Math 1st Prelim
2 pages
MME Assignment 2
No ratings yet
MME Assignment 2
2 pages
Q2 LAS#1 Polynomial Function
No ratings yet
Q2 LAS#1 Polynomial Function
2 pages
DTG Workshop 2 Besvarelse
No ratings yet
DTG Workshop 2 Besvarelse
6 pages
Introduction To Artificial Life (Alife) : Computational Modeling Lab
No ratings yet
Introduction To Artificial Life (Alife) : Computational Modeling Lab
8 pages
Linear Equations With Brackets LESSON
No ratings yet
Linear Equations With Brackets LESSON
2 pages
Scientific Computation (COMS 3210) Bigass Study Guide: Spring 2012
No ratings yet
Scientific Computation (COMS 3210) Bigass Study Guide: Spring 2012
30 pages
10 R CNN
No ratings yet
10 R CNN
28 pages
Class Lecture - 03
No ratings yet
Class Lecture - 03
15 pages
Numerical Analysis Slide
No ratings yet
Numerical Analysis Slide
28 pages
M3 Compre
No ratings yet
M3 Compre
3 pages
Safi ML Lab6
No ratings yet
Safi ML Lab6
10 pages
Approximation Algorithms II Max 3 SAT
No ratings yet
Approximation Algorithms II Max 3 SAT
5 pages
AD-3501-Deep Learning - COURSE PLAN - Unit - Wise
No ratings yet
AD-3501-Deep Learning - COURSE PLAN - Unit - Wise
5 pages
Newton's Divided Difference Formula:: Lecture - 8
No ratings yet
Newton's Divided Difference Formula:: Lecture - 8
3 pages
FB
No ratings yet
FB
3 pages
NLP MultiVAr Constrained
No ratings yet
NLP MultiVAr Constrained
63 pages
05 MAS Distributed Constraint Optimization
No ratings yet
05 MAS Distributed Constraint Optimization
124 pages
Comp Mult Linear
No ratings yet
Comp Mult Linear
48 pages
Result SVM Diabetes
No ratings yet
Result SVM Diabetes
16 pages
Optimization 101 General
No ratings yet
Optimization 101 General
17 pages
Object Detection Using Transformers: H.O.D DR.D.Haritha
No ratings yet
Object Detection Using Transformers: H.O.D DR.D.Haritha
24 pages
Hwsoln 08
No ratings yet
Hwsoln 08
3 pages
Solving Cubic Equation and Factorize Cubic Polynomial
No ratings yet
Solving Cubic Equation and Factorize Cubic Polynomial
3 pages
2021 09 24 - Solvers
No ratings yet
2021 09 24 - Solvers
10 pages

Perceptron Notes

Uploaded by

Perceptron Notes

Uploaded by

Perceptron Learning; Maximum Margin Classifiers 13

3 Perceptron Learning; Maximum Margin Classifiers

Perceptron Algorithm (cont’d)

At any point w, we walk downhill in direction of steepest descent, rR(w).

w arbitrary nonzero starting point (good choice is any yi Xi )

Optimization algorithm 2: stochastic gradient descent

Idea: each step, pick one misclassified Xi ;

while some yi Xi · w < 0

Now we have sample points in Rd+1 , all lying on hyperplane xd+1 = 1.

Mark I perceptron.jpg (from Wikipedia, “Perceptron”) [The Mark I Perceptron Machine.

MAXIMUM MARGIN CLASSIFIERS

Recall: if |w| = 1, signed distance from hyperplane to Xi is w · Xi + ↵.

You might also like