Perceptron Notes
Perceptron Notes
Recall:
– linear decision fn f (x) = w · x (for simplicity, no ↵)
– decision boundary {x : f (x) = 0} (a hyperplane through the origin)
– sample points X1 , X2 , . . . , Xn 2 Rd ; classifications y1 , . . . , yn = ±1
– goal: find weights w such that yi Xi · w 0
P
– goal, rewritten: find w that minimizes R(w) = i2V yi Xi · w [risk function]
where V is the set of indices i for which yi Xi · w < 0.
[Our original problem was to find a separating hyperplane in one space, which I’ll call x-space. But we’ve
transformed this into a problem of finding an optimal point in a di↵erent space, which I’ll call w-space. It’s
important to understand transformations like this, where a geometric structure in one space becomes a point
in another space.]
Objects in x-space transform to objects in w-space:
x-space w-space
hyperplane: {z : w · z = 0} point: w
point: x hyperplane: {z : x · z = 0}
Point x lies on hyperplane {z : w · z = 0} , w · x = 0 , point w lies on hyperplane {z : x · z = 0} in w-space.
[So a hyperplane transforms to its normal vector. And a sample point transforms to the hyperplane whose
normal vector is the sample point.]
[In this case, the transformations happen to be symmetric: a hyperplane in x-space transforms to a point in
w-space the same way that a hyperplane in w-space transforms to a point in x-space. That won’t always be
true for the weight spaces we use this semester.]
If we want to enforce inequality x · w 0, that means
– in x-space, x should be on the same side of {z : z · w = 0} as w
– in w-space, w ” ” ” ” ” ” ” {z : x · z = 0} as x
x-space w-space
X [Draw this by hand. xwspace.pdf ]
X [Observe that the x-space sample
points are the normal vectors for the
w-space lines. We can choose w to be
C
anywhere in the shaded region.]
w
[For a sample point x in class C, w and x must be on the same side of the hyperplane that x transforms into.
For a point x not in class C (marked by an X), w and x must be on opposite sides of the hyperplane that x
transforms into. These rules determine the shaded region above, in which w must lie.]
[Again, what have we accomplished? We have switched from the problem of finding a hyperplane in x-space
to the problem of finding a point in w-space. That’s a much better fit to how we think about optimization
algorithms.]
14 Jonathan Richard Shewchuk
[Let’s take a look at the risk function these three sample points create.]
-2
-4
-4 -2 0 2 4
riskplot.pdf, riskiso.pdf [Plot & isocontours of risk R(w). Note how R’s creases match the
dual chart above.]
[In this plot, we can choose w to be any point in the bottom pizza slice; all those points minimize R.]
[We have an optimization problem; we need an optimization algorithm to solve it.]
An optimization algorithm: gradient descent on R.
Given a starting point w, find gradient of R with respect to w; this is the direction of steepest ascent.
Take a step in the opposite direction. Recall [from your vector calculus class]
2 @R 3 2 3
666 @w1 777 666 z1 777
666 @R 777 666 777
66 777 6 z2 777
rR(w) = 66666 r(z · w) = 66666
@w2
.. 777 and .. 777 = z
666 . 777 666 . 777
64 775 4 5
@R
@wd
zd
X X
rR(w) = r yi Xi · w = yi Xi
i2V i2V
✏ > 0 is the step size aka learning rate, chosen empirically. [Best choice depends on input problem!]
[Show plot of R again. Draw the typical steps of gradient descent.]
Problem: Slow! Each step takes O(nd) time. [Can we improve this?]
Perceptron Learning; Maximum Margin Classifiers 15
[Stochastic gradient descent is quite popular and we’ll see it several times more this semester, especially
for neural nets. However, stochastic gradient descent does not work for every problem that gradient descent
works for. The perceptron risk function happens to have special properties that guarantee that stochastic
gradient descent will always succeed.]
What if separating hyperplane doesn’t pass through origin?
Add a fictitious dimension. Decision fn is
2 3
666 x1 777
666 7
f (x) = w · x + ↵ = [w1 w2 ↵] · 66 x2 7777
4 5
1
[Then he held a press conference where he predicted that perceptrons would be “the embryo of an electronic
computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of
its existence.” We’re still waiting on that.]
[One interesting aspect of the perceptron algorithm is that it’s an “online algorithm,” which means that if
new data points come in while the algorithm is already running, you can just throw them into the mix and
keep looping.]
Perceptron Convergence Theorem: If data is linearly separable, perceptron algorithm will find a linear
classifier that classifies all data correctly in at most O(R2 / 2 ) iterations, where R = max |Xi | is “radius of
data” and is the “maximum margin.”
[I’ll define “maximum margin” shortly.]
[We’re not going to prove this, because perceptrons are obsolete.]
[Although the step size/learning rate doesn’t appear in that big-O expression, it does have an e↵ect on the
running time, but the e↵ect is hard to characterize. The algorithm gets slower if ✏ is too small because it has
to take lots of steps to get down the hill. But it also gets slower if ✏ is too big for a di↵erent reason: it jumps
right over the region with zero risk and oscillates back and forth for a long time.]
[Although stochastic gradient descent is faster for this problem than gradient descent, the perceptron algo-
rithm is still slow. There’s no reliable way to choose a good step size ✏. Fortunately, optimization algorithms
have improved a lot since 1957. You can get rid of the step size by using any decent modern “line search” al-
gorithm. Better yet, you can find a better decision boundary much more quickly by quadratic programming,
which is what we’ll talk about next.]
The margin of a linear classifier is the distance from the decision boundary to the nearest sample point.
What if we make the margin as wide as possible?
C C
X C C
X C
X C
X
X
X w·x+↵=1
w·x+↵= 1 w·x+↵=0 [Draw this by hand. maxmargin.pdf ]
We enforce the constraints
yi (w · Xi + ↵) 1 for i 2 [1, n]
[Notice that the right-hand side is a 1, rather than a 0 as it was for the perceptron algorithm. It’s not obvious,
but this a better way to formulate the problem, partly because it makes it impossible for the weight vector w
to get set to zero.]
Perceptron Learning; Maximum Margin Classifiers 17
alpha
1.0
0.5
w2
-1.0 -0.8 -0.6 -0.4 -0.2
-0.5
-1.0
weight3d.pdf, weightcross.pdf [This is an example of what the linear constraints look like
in the 3D weight space (w1 , w2 , ↵) for an SVM with three training points. The SVM is
looking for the point nearest the origin that lies above the blue plane (representing an in-
class training point) but below the red and pink planes (representing out-of-class training
points). In this example, that optimal point lies where the three planes intersect. At right
we see a 2D cross-section w1 = 1/17 of the 3D space, because the optimal solution lies in
this cross-section. The constraints say that the solution must lie in the leftmost pizza slice,
while being as close to the origin as possible, so the optimal solution is where the three
lines meet.]