0% found this document useful (0 votes)
61 views2 pages

Percept Ron

The document summarizes the perceptron, a simple binary classifier that learns a linear decision boundary through gradient descent. Specifically: - The perceptron classifies inputs x as +1 or -1 based on whether θTx is greater than or less than 0. - The standard empirical risk function counts misclassifications, but is not differentiable. - A modified risk function is introduced that is differentiable and allows gradient descent learning of θ. - The update rule moves θ in the direction that would correctly classify misclassified examples. - A bias term can be added by appending a constant 1 to each input vector.

Uploaded by

Vishesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views2 pages

Percept Ron

The document summarizes the perceptron, a simple binary classifier that learns a linear decision boundary through gradient descent. Specifically: - The perceptron classifies inputs x as +1 or -1 based on whether θTx is greater than or less than 0. - The standard empirical risk function counts misclassifications, but is not differentiable. - A modified risk function is introduced that is differentiable and allows gradient descent learning of θ. - The update rule moves θ in the direction that would correctly classify misclassified examples. - A bias term can be added by appending a constant 1 to each input vector.

Uploaded by

Vishesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

The Perceptron

The perceptron implements a binary classifier f : RD 7 {+1, 1} with a linear


decision surface through the origin:
f (x) = step( >x).
where

(
1
step(z) =
1

(1)

if z 0
otherwise.

Using the zero-one loss


L(y, f (x)) =

0
1

if y = f (x)
otherwise,

the empirical risk of the perceptron on training data S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xN , yN )}
is just the number of misclassified examples:
X
1.
Remp () =
i(1,2,...,N ) : yi 6=step

Txi

The problem with this is that Remp () is not differentiable in , so we cannot


do gradient descent to learn .
To circumvent this, we use the modified empirical loss
X

Remp () =
yi Txi .
(2)
i(1,2,...,N ) : yi 6=step

Txi

This just says that correctly classified examples


dont
incur any loss at all, while

incorrectly classified examples contribute Txi , which is some sort of measure


of confidence in the (incorrect) labeling. 1
We can now use gradient descent to learn . Starting from an arbitrary (0) ,
we update our parameter vector according to
(t+1) = (t) R| (t) ,
where , called the learning rate, is a parameter of our choosing. The gradient
of (2) is again a sum over the misclassified examples:
X
yi xi .
Remp () =
T
i(1,2,...,N ) : yi 6=step xi
1 A slightly more principled way to look at this is to derive this modified risk from the hinge

loss L(y, Tx) = max 0, y Tx .

If we let M S be the set of training examples misclassified by (t) , the update


rule can be written very simply as
X
(t+1) = (t) +
yi xi .
(xi ,yi )M

One issue that remains is how to implement a bias term generalizing to linear
classifiers that do not necessarily cross the origin:
f (x) = step(0 + >x).

(3)

The simplest solution to this is to append a constant (0th) element 1 to each


input vector and incorporate 0 in . This reduces (3) to the original (1) except
that the dimensionality of all the vectors has increased by one.

On-line perceptron (not examinable)


What we described above is the batch perceptron. The perceptron has a more
prominent role in the world of online learning [1]. In online learning there is no
distinction between the training set and testing set. The input is a continuous
stream of examples, and the algorithm has to make a prediction immediately
after xi arrives. Before the next example arrives, the true label yi is presented,
and the algorithm can update its internal parameters to reflect what it has
learnt from its success or failiure in predicting yi .
The online perceptron is about as simple as a learning algorithm gets:
w=0
for i=1 to m
predict y=step(w*x_i)
if (y=-1 and y_i=1) w=w+x_i
if (y=1 and y_i=-1) w=w-x_i
end
(note that w and x_i are vectors and * is the dot product). Remarkably, it is
still a powerful learning algorithm. It is possible to prove that, provided the
data lies withing a ball of radius R centered on the origin and is separable with
margin (i.e. there exists a separating hyperplane with normal vector w such
that | w xi | / k w k for all examples), the online perceptron will make no
more than dM/ 2 e errors, regardless of the number of examples.

References
[1] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386408,
1958.

You might also like