0% found this document useful (0 votes)

54 views19 pages

Large Margin Classification Using The Perceptron Algorithm: Machine Learning, 37 (3) :277-296, 1999

This document summarizes a new machine learning algorithm called the voted-perceptron algorithm. It combines Rosenblatt's perceptron algorithm with Helmbold and Warmuth's leave-one-out method. Like support vector machines, it takes advantage of data that is linearly separable with large margins. However, it is simpler to implement and more efficient computationally. The authors analyze the algorithm's generalization error and show it performs comparably to support vector machines on classifying handwritten digits, while being faster.

Uploaded by

Saiful Nur Budiman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views19 pages

Large Margin Classification Using The Perceptron Algorithm: Machine Learning, 37 (3) :277-296, 1999

Uploaded by

Saiful Nur Budiman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Machine Learning, 37(3):277-296, 1999.

Large Margin Classification

Using the Perceptron Algorithm

[email protected]
AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A205, Florham Park, NJ 07932-0971

[email protected]
AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A279, Florham Park, NJ 07932-0971

Abstract. We introduce and analyze a new algorithm for linear classification which combines Rosenblatts
perceptron algorithm with Helmbold and Warmuths leave-one-out method. Like Vapniks maximal-margin classifier, our algorithm takes advantage of data that are linearly separable with large margins. Compared to Vapniks
algorithm, however, ours is much simpler to implement, and much more efficient in terms of computation time.
We also show that our algorithm can be efficiently used in very high dimensional spaces using kernel functions.
We performed some experiments using our algorithm, and some variants of it, for classifying images of handwritten digits. The performance of our algorithm is close to, but not as good as, the performance of maximal-margin
classifiers on the same problem, while saving significantly on computation time and programming effort.

Introduction

One of the most influential developments in the theory of machine learning in the last few
years is Vapniks work on support vector machines (SVM) (Vapnik, 1982). Vapniks analysis suggests the following simple method for learning complex binary classifiers. First,

use some fixed mapping to map the instances into some very high dimensional space
in which the two classes are linearly separable. Then use quadratic programming to find
the vector that classifies all the data correctly and maximizes the margin, i.e., the minimal
distance between the separating hyperplane and the instances.
There are two main contributions of his work. The first is a proof of a new bound on the
difference between the training error and the test error of a linear classifier that maximizes
the margin. The significance of this bound is that it depends only on the size of the margin
(or the number of support vectors) and not on the dimension. It is superior to the bounds
that can be given for arbitrary consistent linear classifiers.
The second contribution is a method for computing the maximal-margin classifier efficiently for some specific high dimensional mappings. This method is based on the idea of
kernel functions, which are described in detail in Section 4.
The main part of algorithms for finding the maximal-margin classifier is a computation
of a solution for a large quadratic program. The constraints in the program correspond to
the training examples so their number can be very large. Much of the recent practical work
on support vector machines is centered on finding efficient ways of solving these quadratic
programming problems.
In this paper, we introduce a new and simpler algorithm for linear classification which
takes advantage of data that are linearly separable with large margins. We named the
new algorithm the voted-perceptron algorithm. The algorithm is based on the well known

perceptron algorithm of Rosenblatt (1958, 1962) and a transformation of online learning algorithms to batch learning algorithms developed by Helmbold and Warmuth (1995).
Moreover, following the work of Aizerman, Braverman and Rozonoer (1964), we show
that kernel functions can be used with our algorithm so that we can run our algorithm efficiently in very high dimensional spaces. Our algorithm and its analysis involve little more
than combining these three known methods. On the other hand, the resulting algorithm is
very simple and easy to implement, and the theoretical bounds on the expected generalization error of the new algorithm are almost identical to the bounds for SVMs given by
Vapnik and Chervonenkis (1974) in the linearly separable case.
We repeated some of the experiments performed by Cortes and Vapnik (1995) on the
use of SVM on the problem of classifying handwritten digits. We tested both the votedperceptron algorithm and a variant based on averaging rather than voting. These experiments indicate that the use of kernel functions with the perceptron algorithm yields a
dramatic improvement in performance, both in test accuracy and in computation time.
In addition, we found that, when training time is limited, the voted-perceptron algorithm
performs better than the traditional way of using the perceptron algorithm (although all
methods converge eventually to roughly the same level of performance).
Recently, Friess, Cristianini and Campbell (1998) have experimented with a different
online learning algorithm called the adatron. This algorithm was suggested by Anlauf and
Biehl (1989) as a method for calculating the largest margin classifier (also called the maximally stable perceptron). They proved that their algorithm converges asymptotically to
the correct solution.
Our paper is organized as follows. In Section 2, we describe the voted perceptron algorithm. In Section 3, we derive upper bounds on the expected generalization error for
both the linearly separable and inseparable cases. In Section 4, we review the method of
kernels and describe how it is used in our algorithm. In Section 5, we summarize the results of our experiments on the handwritten digit recognition problem. We conclude with
Section 6 in which we summarize our observations on the relations between the theory and
the experiments and suggest some new open problems.
2.

The Algorithm

We assume that all instances are points x . We use ! x ! to denote the Euclidean length
of x. For most of the paper, we assume that labels " are in #$&%')(*%+ .
The basis of our study is the classical perceptron algorithm invented by Rosenblatt (1958,
1962). This is a very simple algorithm most naturally studied in the online learning model.
The online perceptron algorithm starts with an initial zero prediction vector ,.-0/ . It
1 - sign 3,54 x6 . If this prediction differs from
predicts the label of a new instance x to be "2
the label " , it updates the prediction vector to ,-7,8(9" x. If the prediction is correct then
, is not changed. The process then repeats with the next example.
The most common way the perceptron algorithm is used for learning from a batch of
training examples is to run the algorithm repeatedly through the training set until it finds
a prediction vector which is correct on all of the training set. This prediction rule is then
used for predicting the labels on the test set.

Block (1962), Novikoff (1962) and Minsky and Papert (1969) have shown that if the
data are linearly separable, then the perceptron algorithm will make a finite number of
mistakes, and therefore, if repeatedly cycled through the training set, will converge to a
vector which correctly classifies all of the examples. Moreover, the number of mistakes is
upper bounded by a function of the gap between the positive and negative examples, a fact
that will be central to our analysis.
In this paper, we propose to use a more sophisticated method of applying the online
perceptron algorithm to batch learning, namely, a variation of the leave-one-out method of
Helmbold and Warmuth (1995). In the voted-perceptron algorithm, we store more information during training and then use this elaborate information to generate better predictions
on the test data. The algorithm is detailed in Figure 1. The information we maintain during
training is the list of all prediction vectors that were generated after each and every mistake. For each such vector, we count the number of iterations it survives until the next
mistake is made; we refer to this count as the weight of the prediction vector. To calculate a prediction we compute the binary prediction of each one of the prediction vectors
and combine all these predictions by a weighted majority vote. The weights used are the
survival times described above. This makes intuitive sense as good prediction vectors
tend to survive for a long time and thus have larger weight in the majority vote.
3.

Analysis

In this section, we give an analysis of the voted-perceptron algorithm for the case - % in
which the algorithm runs exactly once through the training data. We also quote a theorem
of Vapnik and Chervonenkis (1974) for the linearly separable case. This theorem bounds
the generalization error of the consistent perceptron found after the perceptron algorithm is
run to convergence. Interestingly, for the linearly separable case, the theorems yield very
similar bounds.
As we shall see in the experiments, the algorithm actually continues to improve performance after - % . We have no theoretical explanation for this improvement.
If the data are linearly separable, then the perceptron algorithm will eventually converge
on some consistent hypothesis (i.e., a prediction vector that is correct on all of the training
examples). As this prediction vector makes no further mistakes, it will eventually dominate the weighted vote in the voted-perceptron algorithm. Thus, for linearly separable
data, when , the voted-perceptron algorithm converges to the regular use of the
perceptron algorithm, which is to predict using the final prediction vector.
As we have recently learned, the performance of the final prediction vector has been
analyzed by Vapnik and Chervonenkis (1974). We discuss their bound at the end of this
section.
We now give our analysis for the case - % . The analysis is in two parts and mostly
combines known material. First, we review the classical analysis of the online perceptron algorithm in the linearly separable case, as well as an extension to the inseparable
case. Second, we review an analysis of the leave-one-out conversion of an online learning
algorithm to a batch learning algorithm.

Training
Input:

a labeled training set 3 x ')" 6 ' )' 3 x2')"6

number of epochs
a list of weighted perceptrons 3, ' 6 ')'3 ,
'
6

Output:

Initialize: - , ,

Repeat times:
For -

%')'

- .

Compute prediction:
1" - sign 3,
&4 x
If 1" - " then
-
( % .
else ,
-7,
( " x ;

- % ;
- ( % .

Prediction
Given:
the list of weighted perceptrons:
an unlabeled instance: x
compute a predicted label 1" as follows:

sign 3

, 4 x !6

&1" -

3 , ' 6 ')' 3,

'
6

6
sign 3 "

Figure 1. The voted-perceptron algorithm.

3.1.

The online perceptron algorithm in the separable case

Our analysis is based on the following well known result first proved by Block (1962) and
Novikoff (1962). The significance of this result is that the number of mistakes does not
depend on the dimension of the instances. This gives reason to believe that the perceptron
algorithm might perform well in high dimensional spaces.
T HEOREM 1 (B LOCK , N OVIKOFF ) Let 3 x ')" 6 ')'3 x ' " 6 be a sequence of labeled

examples with ! x !$#&% . Suppose that there exists a vector ' such that ! ' ! - % and
")3'54 x 6)(+* for all examples in the sequence. Then the number of mistakes made by the
online perceptron algorithm on this sequence is at most 3,%.-* 6/ .
Proof: Although the proof is well known, we repeat it for completeness.
Let ,
denote the prediction vector used prior to the th mistake. Thus, ,
the th mistake occurs on 3 x ' " 6 then " 3,
4 x 60# and ,
1 -7,
( " x .

We have

42'5-7,

!4 '8(

" 33 '54 x )
6 ( ,

( *

4 ' 4

and, if

8
Therefore, ,
1
Similarly,

42'

( * .

! ,
1 ! / -.! ,
! / ( " 3 ,
&4 x 6 ( ! x

/ #

! ,
! / (4%

Therefore, ! ,
! / # %./ .

Combining, gives

,
!

which implies #
3.2.

(
- *

,
4!'
(

proving the theorem.

Analysis for the inseparable case

If the data are not linearly separable then Theorem 1 cannot be used directly. However,
we now give a generalized version of the theorem which allows for some mistakes in the
training set. As far as we know, this theorem is new, although the proof technique is
very similar to that of Klasner and Simon (1995, Theorem 2.2). See also the recent work
of Shawe-Taylor and Cristianini (1998) who used this technique to derive generalization
error bounds for any large margin classifier.
T HEOREM 2 Let )3 x ' " 6 ')'3 x ')" 6 be a sequence of labeled examples with ! x ! #

% . Let ' be any vector with ! ' ! -.% and let * . Define the deviation of each example
as

-
#' *2$9" 33'54 x 6)+'
/ . Then the number of mistakes of the online perceptron algo
and define -

rithm on this sequence is bounded by
% ( /
*

.
Proof: The case - follows from Theorem 1, so we can assume that

The proof is based on a reduction of the inseparable case to a separable case in a higher
dimensional space. As we will see, the reduction does not change the algorithm.

We extend the instance space to
by adding new dimensions, one for each

example. Let x .
denote the extension of the instance x . We set the first
coordinates of x equal to x . We set the 3 (+ 6 th coordinate to where is a positive
real constant whose value will be specified later. The rest of the coordinates of x are set
to zero.

Next we extend the comparison vector ' 7 to '
. We use the constant
, which we calculate shortly, to ensure that the length of ' is one. We set the first
coordinates of ' equal to ' - . We set the 3 ( 6 th coordinate to 3 " 6 -3
6 . It is easy
to check that the appropriate normalization is %( / - / .

4 x 6 :

'4 x
")3' 4 x 6 - "
( "

" 3'4 x 6
(
" 3'4 x 6 * $ " 33'54 x 6
(
(

- *
Thus the extended prediction vector ' achieves a margin of * - %( / -
Consider the value of " 33 '

/ on the extended examples.

In order to apply Theorem 1, we need a bound on the length of the instances. As % (
! x ! for all , and the only additional non-zero coordinate has value , we get that ! x / #
%./(
/ . Using these values in Theorem 1 we get that the number of mistakes of the online
perceptron algorithm if run in the extended space is at most

3,%./( / 6 3 %(
* /

- % minimizes the bound and yields the bound given in the statement of
Setting
the theorem.
To finish the proof we show that the predictions of the perceptron algorithm in the extended space are equal to the predictions of the perceptron in the original space. We use
, to denote the prediction vector used for predicting the instance x in the original space
and , to denote the prediction vector used for predicting the corresponding instance x in
the extended space. The claim follows by induction over %.# # of the following three
claims:

1. The first

coordinates of , are equal to those of , .

3 4
( 6 th coordinate of ,
sign 3, 4 x 6- sign 3, 4 x 6 .

2. The
3.

3.3.

is equal to zero.

Converting online to batch

We now have an algorithm that will make few mistakes when presented with the examples
one by one. However, the setup we are interested in here is the batch setup in which
we are given a training set, according to which we generate a hypothesis, which is then
tested on a seperate test set. If the data are linearly separable then the perceptron algorithm
eventually converges and we can use this final prediction rule as our hypothesis. However,
the data might not be separable or we might not want to wait till convergence is achieved.

8
In this case, we have to decide on the best prediction rule given the sequence of different
classifiers that the online algorithm genarates. One solution to this problem is to use the
prediction rule that has survived for the longest time before it was changed. A prediction
rule that has survived for a long time is likely to be better than one that has only survived
for a few iterations. This method was suggested by Gallant (1986) who called it the pocket
method. Littlestone (1989), suggested a two-phase method in which the performance of
all of the rules is tested on a seperate test set and the rule with the least error is then used.
Here we use a different method for converting the online perceptron algorithm into a batch
learning algorithm; the method combines all of the rules generated by the online algorithm
after it was run for just a single time through the training data.
We now describe Helmbold and Warmuths (1995) very simple leave-one-out method
of converting an online learning algorithm into a batch learning algorithm. Our votedperceptron algorithm is a simple application of this general method. We start with the
randomized version. Given a training set 3 x ' " 6 ')' 3 x ')" 6 and an unlabeled in
stance x, we do the following. We select a number in #1' ' + uniformly at random.
We then take the first examples in the training sequence and append the unlabeled instance to the end of this subsequence. We run the online algorithm on this sequence of
length ( % , and use the prediction of the online algorithm on the last unlabeled instance.
In the deterministic leave-one-out conversion, we modify the randomized leave-one-out
conversion to make it deterministic in the obvious way by choosing the most likely prediction. That is, we compute the prediction that would result for all possible choices of in
#1' '+ , and we take majority vote of these predictions. It is straightforward to show
that taking a majority vote runs the risk of doubling the probability of mistake while it has
the potential of significantly decreasing it. In this work we decided to focus primarily on
deterministic voting rather than randomization.
The following theorem follows directly from Helmbold and Warmuth (1995). (See also
Kivinen and Warmuth (1997) and Cesa-Bianchi et al. (1997).)
T HEOREM 3 Assume all examples 3 x ')"6 are generated i.i.d. Let be the expected number of mistakes that the online algorithm makes on a randomly generated sequence of
( % examples. Then given random training examples, the expected probability that
the randomized leave-one-out conversion of makes a mistake on a randomly generated
test instance is at most -3, ( %6 . For the deterministic leave-one-out conversion, this
expected probability is at most -3,.(7%6 .

3.4.

Putting it all together

It can be verified that the deterministic leave-one-out conversion of the online perceptron
algorithm is exactly equivalent to the voted-perceptron algorithm of Figure 1 with - % .
Thus, combining Theorems 2 and 3, we have:
C OROLLARY 1 Assume all examples are generated i.i.d.
at random.
3 x ' " 6 ')'3 x ')"6 be a sequence of training examples and let 3 x ')"

, let
test example. Let % ! x . For ! ' -.% and *

Let
be a

3
#1'**$ " 33 '54 x 6 + 6 /

Then the probability (over the choice of all ( % examples) that the voted-perceptron
algorithm with - % does not predict " on test instance x is at most

/

% (

( %

*

(where the expectation is also over the choice of all

( %

examples).

In fact, the same proof yields a slightly stronger statement which depends only on examples on which mistakes occur. Formally, this can be stated as follows:
C OROLLARY 2 Assume all examples are generated i.i.d. at random. Suppose that we run
the online perceptron algorithm once on the sequence )3 x ')" 6 ' ' 3 x ')" 6 , and

that mistakes occur on examples with indices ')'
. Redefine %

! x ! ,
and redefine

#' *2$ "33'4 x 6)+! /

Now suppose that we run the voted-perceptron algorithm on training examples

3 x ' " 6 ')'3 x ')" 6 for a single epoch. Then the probability (over the choice of all

( % examples) that the voted-perceptron algorithm does not predict " on test instance

x is at most

$#
( % #"

% ( /
%

*

( %

(where the expectation is also over the choice of all

( %

examples).

A rather similar theorem was proved by Vapnik and Chervonenkis (1974, Theorem 6.1)
for training the perceptron algorithm to convergence and predicting with the final perceptron vector.
T HEOREM 4 (VAPNIK AND C HERVONENKIS ) Assume all examples are generated i.i.d.
at random. Suppose that we run the online perceptron algorithm on the sequence
3 x ' " 6 ')'3 x ')" 6 repeatedly until convergence, and that mistakes occur on

a total of examples with indices ')'
. Let %7

! x ! , and let
*8" 3'4 x 6!

Assume *
with probability one.
Now suppose that we run the perceptron algorithm to convergence on training examples
3 x ' " 6 ')'3 x ')"6 . Then the probability (over the choice of all (% examples) that

the final perceptron does not predict " on test instance x is at most

8
%

(where the expectation is also over the choice of all

( %

examples).

For the separable case (in which

can be set to zero), Corollary 2 is almost identical
to Theorem 4. One difference is that in Corolary 2, we lose a factor of 2. This is because we
use the deterministic algorithm, rather than the randomized one. The other, more important
difference is that , the number of mistakes that the perceptron makes, will almost certainly
be larger when the perceptron is run to convergence than when it is run just for a single
epoch. This gives us some indication that running the voted-perceptron algorithm with
- % might be better than running it to convergence; however, our experiments do not
support this prediction.
Vapnik (to appear) also gives a very similar bound for the expected error of supportvector machines. There are two differences between the bounds. First, the set of vectors on
which the perceptron makes a mistake is replaced by the set of essential support vectors.
Second, the radius % is the maximal distance of any support vector from some optimally
chosen vector, rather than from the origin. (The support vectors are the training examples
which fall closest to the decision boundary.)
4.

Kernel-based Classification

We have seen that the voted-perceptron algorithm has guaranteed performance bounds
when the data are (almost) linearly separable. However, linear separability is a rather
strict condition. One way to make the method more powerful is by adding dimensions or
features to the input space. These new coordinates are nonlinear functions of the original
coordinates. Usually if we add enough coordinates we can make the data linearly separable.
If the separation is sufficiently good (in the senses of Theorems 1 and 2) then the expected
generalization error will be small (provided we do not increase the complexity of instances
too much by moving to the higher dimensional space).
However, from a computational point of view, computing the values of the additional
coordinates can become prohibitively hard. This problem can sometimes be solved by the
elegant method of kernel functions. The use of kernel functions for classification problems
was proposed by suggested Aizerman, Braverman and Rozonoer (1964) who specifically
described a method for combining kernel functions with the perceptron algorithm. Continuing their work, Boser, Guyon and Vapnik (1992) suggested using kernel functions with
SVMs.
Kernel functions are functions of two variables 93 x ' y6 which can be represented as an

inner product 3 x6 4 3 y 6 for some function and some

. In other
words, we can calculate 93 x ' y6 by mapping x and y to vectors 3 x 6 and 3 y 6 and then
taking their inner product.
For instance, an important kernel function that we use in this paper is the polynomial
expansion

93 x ' y6 - 3 % (

x 4 y 6

(1)

There exist general conditions for checking if a function is a kernel function. In this par
ticular case, however, it is straightforward to construct witnessing that is a kernel
function. For instance, for - and , we can choose

3 x 6 - 3 % ' / ' // ' / ' ' / '
' / ' '
/ 62

In general, for , we can
define 3 x 6 to have one coordinate 3 x 6 for each monomial

3 x 6 of degree at most over the variables ' ' , and where is an appropriately

chosen constant.
Aizerman, Braverman and Rozonoer observed that the perceptron algorithm can be formulated in such a way that all computations involving instances are in fact in terms of inner
products x 4 y between pairs of instances. Thus, if we want to map each instance x to a vec
tor 3 x 6 in a high dimensional space, we only need to be able to compute inner products

3 x 64 3 y 6 , which is exactly what is computed by a kernel function. Conceptually, then,
with the kernel method, we can work with vectors in a very high dimensional space and
the algorithms performance only depends on linear separability in this expanded space.
Computationally, however, we only need to modify the algorithm by replacing each inner
product computation x 4 y with a kernel function computation 93 x ' y6 . Similar observations
were made by Boser, Guyon and Vapnik for Vapniks SVM algorithm.
In this paper, we observe that all the computations in the voted-perceptron learning algorithm involving instances can also be written in terms of inner products, which means
that we can apply the kernel method to the voted-perceptron algorithm as well. Referring
to Figure 1, we see that both training and prediction involve inner products between instances x and prediction vectors ,
. In order to perform this operation efficiently, we store
each prediction vector ,
in an implicit form, as the sum of instances that were added or
subtracted in order to create it. That is, each ,
can be written and stored as a sum

,
&-

" x

for appropriate indices . We can thus calculate the inner product with x as

4x-

" 3 x 4 x26

To use a kernel function , we would merely replace each x 4 x by 3 x ' x 6 .

Computing the prediction of the final vector ,
on a test instance x requires kernel
calculations where is the number of mistakes made by the algorithm during training.
Naively, the prediction of the voted-perceptron would seem to require 3,/ 6 kernel calculations since we need to compute , 4 x for each # , and since itself involves a sum of
$*% instances. However, taking advantage of the recurrence , 4 x -7, 4 x (&" 3 x 4 x6 , it

is clear that we can compute the prediction of the voted-perceptron also using only kernel
calculations.
Thus, calculating the prediction of the voted-perceptron when using kernels is only
marginally more expensive than calculating the prediction of the final prediction vector,
assuming that both methods are trained for the same number of epochs.

8
d=1

d=2

d=3

10
random (unnorm)
last (unnorm)
avg (unnorm)
vote

random (unnorm)
last (unnorm)
avg (unnorm)
vote

Test Erorr

15
6

0
0.1

Epoch

d=4

d=6
10

random (unnorm)
last (unnorm)
avg (unnorm)
vote

8
6

0
0.1

Epoch

random (unnorm)
last (unnorm)
avg (unnorm)
vote

0
0.1

10
Epoch

10
random (unnorm)
last (unnorm)
avg (unnorm)
vote

d=5

Test Erorr

0
0.1

10
Epoch

0
0.1

Epoch

10
Epoch

Figure 2. Learning curves for algorithms tested on NIST data.

Experiments

In our experiments, we followed closely the experimental setup used by Cortes and Vapnik (1995) in their experiments on the NIST OCR database. / We chose to use this setup
because the dataset is widely available and because LeCun et al. (1995) have published a
detailed comparison of the performance of some of the best digit classification systems in
this setup.
Examples in this NIST database consist of labeled digital images of individual handwritten digits. Each instance is a
matrix in which each entry is an 8-bit representation
of a grey value, and labels are from the set #' ' + . The dataset consists of 60,000
training examples and 10,000 test examples. We treat each image as a vector in
, and,
like Cortes and Vapnik, we use the polynomial kernels of Eq. (1) to expand this vector into
very high dimensions.
To handle multiclass data, we essentially reduced to 10 binary problems. That is, we
trained the voted-perceptron algorithm once for each of the 10 classes. When training on
class , we replaced each labeled example 3 x )' " 6 (where " #1' ' + ) by the binarylabeled example 3 x ')(*%6 if " and by 3 x)' $ % 6 if "1 - . Let

3 , ' 6 ' ' 3,

'
6

be the sequence of weighted prediction vectors which result from training on class .
To make predictions on a new instance x, we tried four different methods. In each
method, we first compute a score for each #')' + and then predict with the
label receiving the highest score:

"&
1 -

Table 1. Results of experiments on NIST 10-class OCR data with

. The rows marked SupVec
and Mistake give average number of support vectors and average number of mistakes. All other rows
give test error rate in percent for the various methods.

Vote
Avg.
Last
Rand.
SupVec
Mistake

(unnorm)
(norm)
(unnorm)
(norm)
(unnorm)
(norm)

0.1

10.7
10.9
10.9
16.0
15.4
22.0
21.5
2,489
3,342

8.5
8.7
8.5
14.7
14.1
15.7
15.2
19,795
25,461

8.3
8.5
8.3
13.6
13.1
14.7
14.2
24,263
48,431

8.2
8.4
8.2
13.9
13.5
14.3
13.8
26,704
70,915

8.2
8.3
8.2
13.7
13.2
14.1
13.6
28,322
93,090

8.1
8.3
8.1
13.5
13.0
13.8
13.2
32,994
223,657

6.0
6.0
6.2
8.6
8.4
13.4
13.2
1,639
2,150

2.8
2.8
3.0
4.0
3.9
5.9
5.9
8,190
10,201

2.4
2.4
2.5
3.4
3.3
4.7
4.7
9,888
15,290

2.2
2.2
2.3
3.0
3.0
4.1
4.1
10,818
19,093

2.1
2.1
2.2
2.7
2.7
3.8
3.8
11,424
22,100

1.8
1.9
1.9
2.3
2.3
2.9
2.9
12,963
32,451

1.8
1.8
1.8
2.0
1.9
2.4
2.3
13,861
41,614

5.4
5.3
5.5
6.9
6.8
11.6
11.5
1,460
1,937

2.3
2.3
2.5
3.1
3.1
4.9
4.8
6,774
8,475

1.9
1.9
2.0
2.5
2.5
3.7
3.7
8,073
11,739

1.8
1.8
1.8
2.2
2.2
3.2
3.2
8,715
13,757

1.7
1.7
1.8
2.0
2.0
2.9
2.9
9,102
15,129

1.6
1.6
1.6
1.7
1.7
2.2
2.2
9,883
18,422

1.6
1.5
1.5
1.6
1.6
1.8
1.8
10,094
19,473

The first method is to compute each score using the respective final prediction vector:

- ,
4 x

This method is denoted last (unnormalized) in the results. A variant of this method is to
compute scores after first normalizing the final prediction vectors:

,
4 x
- ! ,
!

This method is denoted last (normalized) in the results. Note that normalizing vectors
has no effect for binary problems, but can plausibly be important in the multiclass case.
The next method (denoted vote) uses the analog of the deterministic leave-one-out
conversion. Here we set

Table 2. Results of experiments on NIST 10-class OCR data with

. The rows marked
SupVec and Mistake give average number of support vectors and average number of mistakes. All
other rows give test error rate in percent for the various methods.

Vote
Avg.
Last
Rand.
SupVec
Mistake

(unnorm)
(norm)
(unnorm)
(norm)
(unnorm)
(norm)

sign 3

0.1

5.4
5.3
5.5
6.5
6.5
11.5
11.3
1,406
1,882

2.2
2.2
2.3
2.8
2.8
4.6
4.5
6,338
7,977

1.8
1.8
1.9
2.3
2.3
3.5
3.4
7,453
10,543

1.7
1.7
1.7
2.0
2.0
3.1
3.0
7,944
11,933

1.6
1.7
1.6
1.9
1.9
2.7
2.7
8,214
12,780

1.6
1.6
1.6
1.6
1.6
2.1
2.1
8,673
14,375

1.6
1.6
1.6
1.6
1.6
1.8
1.8
8,717
14,538

5.7
5.7
5.7
6.6
6.3
11.9
11.5
1,439
1,953

2.2
2.3
2.3
3.0
2.9
4.7
4.5
6,327
8,044

1.9
1.9
1.9
2.2
2.1
3.5
3.4
7,367
10,379

1.8
1.8
1.8
1.9
1.9
3.0
2.9
7,788
11,563

1.8
1.7
1.7
1.9
1.9
2.7
2.6
7,990
12,215

1.7
1.7
1.7
1.8
1.7
2.1
2.0
8,295
13,234

1.7
1.7
1.6
1.7
1.7
1.9
1.8
8,313
13,289

6.0
6.2
6.0
7.3
6.9
12.8
12.1
1,488
2,034

2.5
2.5
2.5
3.2
3.0
5.0
4.8
6,521
8,351

2.1
2.1
2.1
2.4
2.3
3.8
3.6
7,572
10,764

2.0
2.0
2.0
2.2
2.1
3.3
3.2
7,947
11,892

1.9
1.9
1.9
2.0
2.0
3.0
2.8
8,117
12,472

1.9
1.9
1.8
1.9
1.9
2.3
2.2
8,284
13,108

1.9
1.9
1.8
1.9
1.9
2.0
2.0
8,285
13,118

, 4 x26

The third method (denoted average (unnormalized)) uses an average of the predictions
of the prediction vectors

3 , 4 x62

As in the last method, we also tried a variant (denoted average (normalized)) using
normalized prediction vectors:

, 4 x

! , !

Table 3. Results of experiments on individual classes using polynomial kernels with

. The rows marked SupVec
and Mistake give average number of support vectors and average number of mistakes. All other rows give test error rate in
percent for the various methods.
label

Vote
Avg.

(unnorm)
(norm)

Last
Rand.
SupVec
Mistake

Vote
Avg.

(unnorm)
(norm)

Last
Rand.
SupVec
Mistake
Vote
Avg.

(unnorm)
(norm)

Last
Rand.
SupVec
Mistake
Vote
Avg.

(unnorm)
(norm)

Last
Rand.
SupVec
Mistake
Cortes & Vapnik
SupVec

0.7
0.7
0.7
1.0
2.1
133
133

0.5
0.5
0.5
0.7
1.3
89
89

1.3
1.3
1.3
1.7
3.0
180
180

1.5
1.5
1.5
2.1
3.7
228
228

1.4
1.3
1.4
1.5
3.0
179
179

1.4
1.3
1.4
2.8
3.2
202
202

0.9
0.9
0.9
1.2
2.2
136
136

1.3
1.3
1.3
1.8
2.7
160
160

1.8
1.8
1.8
2.4
4.7
285
285

2.1
2.0
2.1
2.7
4.5
290
290

0.3
0.3
0.3
0.5
0.8
506
506

0.3
0.2
0.2
0.5
0.6
407
407

0.6
0.6
0.6
1.0
1.4
782
782

0.5
0.5
0.6
1.1
1.5
996
996

0.5
0.5
0.5
0.7
1.2
734
734

0.5
0.5
0.5
0.8
1.3
849
849

0.5
0.4
0.4
0.5
0.9
541
541

0.6
0.6
0.6
1.0
1.2
738
738

0.7
0.7
0.8
1.2
1.9
1,183
1,183

0.9
0.9
1.0
1.3
2.1
1,240
1,240

0.2
0.2
0.2
0.2
0.3
736
837

0.2
0.2
0.2
0.2
0.3
636
824

0.4
0.4
0.4
0.4
0.5
1,164
1,339

0.4
0.4
0.4
0.4
0.6
1,504
1,796

0.4
0.4
0.4
0.4
0.5
1,075
1,218

0.4
0.4
0.4
0.4
0.6
1,271
1,487

0.3
0.3
0.3
0.4
0.5
817
951

0.5
0.5
0.5
0.5
0.6
1,103
1,323

0.6
0.6
0.6
0.6
0.8
1,833
2,278

0.7
0.7
0.7
0.7
0.9
1,899
2,323

0.2
0.2
0.2
0.2
0.2
740
844

0.2
0.2
0.2
0.2
0.3
643
843

0.4
0.4
0.4
0.4
0.5
1,168
1,345

0.4
0.4
0.4
0.4
0.5
1,512
1,811

0.4
0.4
0.4
0.4
0.4
1,078
1,222

0.4
0.4
0.4
0.4
0.5
1,277
1,497

0.4
0.3
0.3
0.4
0.4
823
960

0.5
0.5
0.5
0.5
0.5
1,103
1,323

0.6
0.6
0.6
0.6
0.6
1,856
2,326

0.7
0.6
0.6
0.7
0.7
1,920
2,367

0.2
1,379

0.1
989

0.4
1,958

0.4
1,900

0.4
1,224

0.5
2,024

0.3
1,527

0.4
2,064

0.5
2,332

0.6
2,765

The final method (denoted random (unnormalized)), is a possible analog of the randomized leave-one-out method in which we predict using the prediction vectors that exist
at a randomly chosen time slice. That is, let be the number of rounds executed (i.e., the
number of examples processed by the inner loop of the algorithm) so that

for all . To classify x, we choose a time slice

then set

- , 4 x

9#1')' +

uniformly at random. We

Table 4. Results of experiments on NIST data when distinguishing 9 from all other digits. The
rows marked SupVec and Mistake give average number of support vectors and average number
of mistakes. All other rows give test error rate in percent for the various methods.

Vote
Avg.
Last
Rand.
SupVec
Mistake

(unnorm)
(norm)

0.1

4.5
4.5
4.6
7.9
8.3
513
513

3.9
3.9
3.9
6.4
6.7
4,085
4,085

3.8
3.8
3.9
5.7
6.5
5,240
7,880

3.8
3.8
3.8
6.3
6.3
5,888
11,630

3.8
3.8
3.8
5.8
6.2
6,337
15,342

3.7
3.7
3.8
5.9
6.2
7,661
37,408

2.4
2.4
2.5
4.1
5.5
337
337

1.2
1.2
1.3
1.8
2.8
1,668
1,668

1.0
1.0
1.1
1.6
2.2
2,105
2,541

0.9
1.0
1.0
1.6
1.9
2,358
3,209

0.9
0.9
1.0
1.3
1.8
2,527
3,744

0.8
0.9
0.9
1.1
1.4
2,983
5,694

0.8
0.8
0.8
1.0
1.1
3,290
7,715

2.2
2.1
2.2
2.9
4.9
302
302

1.0
0.9
1.0
1.3
2.2
1,352
1,352

0.8
0.8
0.8
1.0
1.7
1,666
1,867

0.8
0.8
0.8
1.0
1.5
1,842
2,202

0.7
0.7
0.8
0.8
1.4
1,952
2,448

0.7
0.7
0.7
0.7
1.0
2,192
3,056

0.7
0.6
0.6
0.7
0.8
2,283
3,318

2.1
2.0
2.1
2.7
4.5
290
290

0.9
0.9
1.0
1.3
2.1
1,240
1,240

0.8
0.8
0.8
1.0
1.6
1,528
1,648

0.7
0.7
0.8
0.8
1.4
1,669
1,882

0.7
0.7
0.7
0.8
1.2
1,746
2,020

0.7
0.7
0.7
0.7
0.9
1,899
2,323

0.7
0.6
0.6
0.7
0.7
1,920
2,367

2.2
2.2
2.2
2.7
4.6
294
294

0.9
0.9
1.0
1.3
2.0
1,229
1,229

0.8
0.8
0.8
1.0
1.5
1,502
1,598

0.7
0.7
0.8
0.9
1.3
1,628
1,798

0.7
0.7
0.7
0.8
1.2
1,693
1,908

0.7
0.7
0.7
0.7
0.9
1,817
2,132

0.7
0.7
0.7
0.7
0.8
1,827
2,150

2.3
2.3
2.3
2.7
4.7
302
302

0.9
0.9
1.0
1.3
2.1
1,263
1,263

0.8
0.8
0.8
1.0
1.6
1,537
1,625

0.8
0.8
0.8
0.9
1.3
1,655
1,810

0.8
0.8
0.8
0.8
1.2
1,715
1,916

0.8
0.7
0.7
0.8
0.9
1,774
2,035

0.7
0.7
0.7
0.7
0.8
1,776
2,039

where is the index of the final vector which existed at time for label . Formally,
the largest number in #1' ' + satisfying

The analogous normalized method (Random (normalized)) uses

, 4 x
- ! , !

Our analysis is applicable only for the cases of voted or randomly chosen predictions and
where - % . However, in the experiments, we ran the algorithm with up to . When
using polynomial kernels of degree 5 or more, the data becomes linearly separable. Thus,
after several iterations, the perceptron algorithm converges to a consistent prediction vector
and makes no more mistakes. After this happens, the final perceptron gains more and more
weight in both vote and average. This tends to have the effect of causing all of the
variants to converge eventually to the same solution. By reaching this limit we compare
the voted-perceptron algorithm to the standard way in which the perceptron algorithm is
used, which is to find a consistent prediction rule.
We performed experiments with polynomial kernels for dimensions - % (which corresponds to no expansion) up to . We preprocessed the data on each experiment by
randomly permuting the training sequence. Each experiment was repeated 10 times, each
time with a different random permutation of the training examples. For - % , we were
only able to run the experiment for ten epochs for reasons which are described below.
Figure 2 shows plots of the test error as a function of the number of epochs for four of
the prediction methods vote and the unnormalized versions of last, average and
random (we omitted the normalized versions for the sake of readability). Test errors are
averaged over the multiple runs of the algorithm, and are plotted one point for every tenth
of an epoch.
Some of the results are also summarized numerically in Tables 1 and 2 which show
(average) test error for several values of for the seven different methods in the rows
marked Vote, Avg. (unnorm), etc. The rows marked SupVec show the number of
support vectors, that is, the total number of instances that actually are used in computing
scores as above. In other words, this is the size of the union of all instances on which a
mistake occured during training. The rows marked Mistake show the total number of
mistakes made during training for the 10 different labels. In every case, we have averaged
over the multiple runs of the algorithm.
The column corresponding to - !% is helpful for getting an idea of how the algorithms
perform on smaller datasets since in this case, each algorithm has only used a tenth of the
available data (about 6000 training examples).
Ironically, the algorithm runs slowest with small values of . For larger values of , we
move to a much higher dimensional space in which the data becomes linearly separable.
For small values of especially for - % the data are not linearly separable which
means that the perceptron algorithm tends to make many mistakes which slows down the
algorithm significantly. This is why, for - % , we could not even complete a run out to 30

epochs but had to stop at - % (after about six days of computation). In comparison, for
- , we can run 30 epochs in about 25 hours, and for - or , a complete run takes
about 8 hours. (All running times are on a single SGI MIPS R10000 processor running at
194 MHZ.)
The most significant improvement in performance is clearly between - % and .
The migration to a higher dimensional space makes a tremendous difference compared to
running the algorithm in the given space. The improvements for
are not nearly as
dramatic.
Our results indicate that voting and averaging perform better than using the last vector.
This is especially true prior to convergence of the perceptron updates. For - % , the data
are highly inseparable, so in this case the improvement persists for as long as we were able
to run the algorithm. For higher dimensions (
% ), the data becomes more separable and
the perceptron update rule converges (or almost converges), in which case the performance
of all the prediction methods is very similar. Still, even in this case, there is an advantage
to using voting or averaging for a relatively small number of epochs.
There does not seem to be any significant difference between voting and averaging in
terms of performance. However, using random vectors performs the worst in all cases.
This stands in contrast to our analysis, which applies only to random vectors and gives an
upper bound on the error of average vectors which is twice the error of the randomized
vectors. A more refined analysis of the effect of averaging is required to better explain the
observed behavior.
Using normalized vectors seems to sometimes help a bit for the last method, but can
help or hurt performance slightly for the average method; in any case, the differences in
performance between using normalized and unnormalized vectors are always minor.
LeCun et al. (1995) give a detailed comparison of algorithms on this dataset. The best of
the algorithms that they tested is (a rather old version of) boosting on top of the neural net
LeNet 4 which achieves an error rate of 0.7%. A version of the optimal margin classifier
algorithm (Cortes & Vapnik, 1995), using the same kernel function, performs significantly
better than ours, achieving a test error rate of 1.1% for .
Table 3 shows how the variants of the perceptron algorithm perform on the ten binary
problems corresponding to the 10 class labels. For this table, we fix , and we also
compare performance to that reported by Cortes and Vapnik (1995) for SVMs. Table 4
gives more details of how the perceptron methods perform on the single binary problem
of distinguishing 9 from all other images. Note that these binary problems come closest
to the theory discussed earlier in the paper. It is interesting that the perceptron algorithm
generally ends up using fewer support vectors than with the SVM algorithm.

Conclusions and Summary

The most significant result of our experiments is that running the perceptron algorithm in
a higher dimensional space using kernel functions produces very significant improvements
in performance, yielding accuracy levels that are comparable, though still inferior, to those
obtainable with support-vector machines. On the other hand, our algorithm is much faster
and easier to implement than the latter method. In addition, the theoretical analysis of the
expected error of the perceptron algorithm yields very similar bounds to those of support-

vector machines. It is an open problem to develop a better theoretical understanding of the
empirical superiority of support-vector machines.
We also find it significant that voting and averaging work better than just using the final
hypothesis. This indicates that the theoretical analysis, which suggests using voting, is
capturing some of the truth. On the other hand, we do not have a theoretical explanation
for the improvement in performance following the first epoch.
Acknowledgments
We thank Vladimir Vapnik for some helpful discussions and for pointing us to Theorem 4.
Notes
1. Storing all of these vectors might seem an excessive waste of memory. However, as we shall see, when
perceptrons are used together with kernels, the excess in memory and computition is really quite minimal.
2. National Institute for Standards and Technology, Special Database 3. See
http://www.research.att.com/ yann/ocr/ for information on obtaining this dataset and for a
list of relevant publications.

References
Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations
of the potential function method in pattern recognition learning. Automation and
Remote Control, 25, 821837.
Anlauf, J. K., & Biehl, M. (1989). The adatron: an adaptive perceptron algorithm. Europhysics Letters, 10(7), 687692.
Block, H. D. (1962). The perceptron: A model for brain functioning. Reviews of Modern
Physics, 34, 123135. Reprinted in Neurocomputing by Anderson and Rosenfeld.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational
Learning Theory, pp. 144152.
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth,
M. K. (1997). How to use expert advice. Journal of the Association for Computing
Machinery, 44(3), 427485.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3),
273297.
Friess, T., Cristianini, N., & Campbell, C. (1998). The kernel-adatron: A fast and simple
learning procedure for support vector machines. In Machine Learning: Proceedings
of the Fifteenth International Conference.
Gallant, S. I. (1986). Optimal linear discriminants. In Eighth International Conference on
Pattern Recognition, pp. 849852. IEEE.

8
Helmbold, D. P., & Warmuth, M. K. (1995). On weak learning. Journal of Computer and
System Sciences, 50, 551573.
Kivinen, J., & Warmuth, M. K. (1997). Additive versus exponentiated gradient updates for
linear prediction. Information and Computation, 132(1), 164.
Klasner, N., & Simon, H. U. (1995). From noise-free to noise-tolerant and from on-line to
batch learning. In Proceedings of the Eighth Annual Conference on Computational
Learning Theory, pp. 250264.
LeCun, Y., Jackel, L. D., Bottou, L., Brunot, A., Cortes, C., Denker, J. S., Drucker, H.,
Guyon, I., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1995). Comparison
of learning algorithms for handwritten digit recognition. In International Conference
on Artificial Neural Networks, pp. 5360.
Littlestone, N. (1989). From on-line to batch learning. In Proceedings of the Second
Annual Workshop on Computational Learning Theory, pp. 269284.
Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. The MIT Press.
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. In Proceedings of the
Symposium on the Mathematical Theory of Automata, Vol. XII, pp. 615622.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65, 386407. (Reprinted in Neurocomputing (MIT Press, 1988).).
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.
Shawe-Taylor, J., & Cristianini, N. (1998). Robust bounds on generalization from the
margin distribution. Tech. rep. NC2-TR-1998-029, NeuroCOLT2.
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. SpringerVerlag.
Vapnik, V. N., & Chervonenkis, A. Y. (1974). Theory of pattern recognition. Nauka,
Moscow. (In Russian).
Vapnik, V. N. (1998 (to appear)). Statistical Learning Theory. Wiley.

6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Midterm Review Spring18 Sols
No ratings yet
Midterm Review Spring18 Sols
22 pages
Deloitte
No ratings yet
Deloitte
40 pages
Support Vector Network
No ratings yet
Support Vector Network
25 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Machine Learning
No ratings yet
Machine Learning
46 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Support Vector Networks
No ratings yet
Support Vector Networks
25 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Perceptron
No ratings yet
Perceptron
23 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
Search BoW Results
No ratings yet
Search BoW Results
35 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
SVM Class
No ratings yet
SVM Class
33 pages
SVM and Kernel
No ratings yet
SVM and Kernel
57 pages
Online Learning With Kernels: Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Member, IEEE
No ratings yet
Online Learning With Kernels: Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Member, IEEE
12 pages
An Adventure of Epic Porpoises
No ratings yet
An Adventure of Epic Porpoises
174 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
No ratings yet
(Bernhard Schölkopf, Alexander J. Smola) Learning With Kernels PDF
645 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Learning Kernel Classifiers. Theory and Algorithms
100% (2)
Learning Kernel Classifiers. Theory and Algorithms
371 pages
PR-January20-10 Online Trial
No ratings yet
PR-January20-10 Online Trial
42 pages
Online Passive-Aggressive Algorithms
No ratings yet
Online Passive-Aggressive Algorithms
35 pages
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
4 pages
Notes Chapter Feature Representation
No ratings yet
Notes Chapter Feature Representation
6 pages
3 Linear
No ratings yet
3 Linear
5 pages
EE05425Notes 14
No ratings yet
EE05425Notes 14
3 pages
Predicting Structured Data
No ratings yet
Predicting Structured Data
361 pages
Fast Kernel Classifiers
No ratings yet
Fast Kernel Classifiers
41 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
Supervised Learning - Support Vector Machines and Feature Reduction
No ratings yet
Supervised Learning - Support Vector Machines and Feature Reduction
11 pages
Main
No ratings yet
Main
12 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Ralf Herbrich Learning Kernel Classifiers. Theory and Algorithms 2001ã. 382ñ. ISBN ISBN10 026208306X PDF
No ratings yet
Ralf Herbrich Learning Kernel Classifiers. Theory and Algorithms 2001ã. 382ñ. ISBN ISBN10 026208306X PDF
382 pages
hw2 4
No ratings yet
hw2 4
3 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Discriminant Functions
No ratings yet
Discriminant Functions
33 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
No ratings yet
Statistical Pattern Recognition Toolbox For Matlab: User's Guide
99 pages
HW 3
No ratings yet
HW 3
3 pages
Support Vector Machines For Histogram-Based Image Classification
No ratings yet
Support Vector Machines For Histogram-Based Image Classification
10 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
ML Unit-4
No ratings yet
ML Unit-4
28 pages
Ijatcse 43922020
No ratings yet
Ijatcse 43922020
6 pages
Interpolation and Extrapolation Optimal Designs 2: Finite Dimensional General Models
From Everand
Interpolation and Extrapolation Optimal Designs 2: Finite Dimensional General Models
Giorgio Celant
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
TCS Previous Year Interview Questions by SREEJITA DHAR
No ratings yet
TCS Previous Year Interview Questions by SREEJITA DHAR
12 pages
Lumpaz, Jan Paolo M. 3 Year Bsba - Marketing Management
No ratings yet
Lumpaz, Jan Paolo M. 3 Year Bsba - Marketing Management
8 pages
Road Damage Detection and Classification
No ratings yet
Road Damage Detection and Classification
11 pages
DL1 - Neural Network
No ratings yet
DL1 - Neural Network
14 pages
T Extacy
No ratings yet
T Extacy
184 pages
Unit 3
No ratings yet
Unit 3
18 pages
ML Project Proposal Abstract Presentation
No ratings yet
ML Project Proposal Abstract Presentation
9 pages
CCW332 - DM Lab-1
No ratings yet
CCW332 - DM Lab-1
31 pages
Digital Skills Sbi
No ratings yet
Digital Skills Sbi
34 pages
Ey The Digital Bank Tech Innovations Driving Change at Us Banks PDF
100% (1)
Ey The Digital Bank Tech Innovations Driving Change at Us Banks PDF
36 pages
Demand Forecasting
No ratings yet
Demand Forecasting
4 pages
Paper 4
No ratings yet
Paper 4
12 pages
Bilingual OCR Report
No ratings yet
Bilingual OCR Report
10 pages
Clustering. Computational Journalism Week 2
No ratings yet
Clustering. Computational Journalism Week 2
41 pages
Placement Brochure - STATISTICS Kanpur
No ratings yet
Placement Brochure - STATISTICS Kanpur
12 pages
20 Questions To Test Your Skills On CNN Convolutional Neural Networks
No ratings yet
20 Questions To Test Your Skills On CNN Convolutional Neural Networks
11 pages
Dillam Thesis Submitted
No ratings yet
Dillam Thesis Submitted
219 pages
Gan Framework
No ratings yet
Gan Framework
57 pages
Precise Detection of Diabetic Retinopathy Using Adaptive Remora Optimization Algorithm With Deep Adversarial Approach
No ratings yet
Precise Detection of Diabetic Retinopathy Using Adaptive Remora Optimization Algorithm With Deep Adversarial Approach
31 pages
Microsoft Azure Ai Fundamentals Certification Companion Guide To Prepare For The Ai900 Exam 1st Edition Krunal S Trivedi Download
No ratings yet
Microsoft Azure Ai Fundamentals Certification Companion Guide To Prepare For The Ai900 Exam 1st Edition Krunal S Trivedi Download
82 pages
Education: Phone: +91-7667558315 Address: Kolkata, West Bengal
No ratings yet
Education: Phone: +91-7667558315 Address: Kolkata, West Bengal
1 page
Abhinav - AIML - Resume - YPX1K2BXZW (1) (1) 1
No ratings yet
Abhinav - AIML - Resume - YPX1K2BXZW (1) (1) 1
1 page
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
No ratings yet
3.4. A Comprehensive Guide To Convolutional Neural Networks - The ELI5 Way - by Sumit Saha - Towards Data Science
17 pages
Unit 5 - AI - Part1
No ratings yet
Unit 5 - AI - Part1
7 pages
CCCCC
No ratings yet
CCCCC
4 pages
9 - IAI5101 Unsupervised Learning - 20-40
No ratings yet
9 - IAI5101 Unsupervised Learning - 20-40
21 pages
Notification
No ratings yet
Notification
7 pages
Data Mining Lab
No ratings yet
Data Mining Lab
9 pages
Unit III
No ratings yet
Unit III
19 pages

Large Margin Classification Using The Perceptron Algorithm: Machine Learning, 37 (3) :277-296, 1999

Uploaded by

Large Margin Classification Using The Perceptron Algorithm: Machine Learning, 37 (3) :277-296, 1999

Uploaded by

Machine Learning, 37(3):277-296, 1999.

Large Margin Classification

               

a labeled training set 3 x ')" 6 ' )' 3 x2')"6

Initialize:  - , ,

3 ,  '   6 ')' 3,

Figure 1. The voted-perceptron algorithm.

The online perceptron algorithm in the separable case

               

proving the theorem.

Analysis for the inseparable case

/ on the extended examples.

coordinates of ,  are equal to those of ,  .

Converting online to batch

               

Putting it all together

(where the expectation is also over the choice of all 

 # ' *2$ "33'4 x 6)+! /

Now suppose that we run the voted-perceptron algorithm on training examples

(where the expectation is also over the choice of all 

               

(where the expectation is also over the choice of all 

For the separable case (in which

To use a kernel function  , we would merely replace each x  4 x by  3 x ' x 6 .

               

Figure 2. Learning curves for algorithms tested on NIST data.

3 ,  '   6 ' ' 3,

Table 1. Results of experiments on NIST 10-class OCR data with

               

Table 2. Results of experiments on NIST 10-class OCR data with

Table 3. Results of experiments on individual classes using polynomial kernels with

for all . To classify x, we choose a time slice

               

The analogous normalized method (Random (normalized)) uses

               

Conclusions and Summary

               

You might also like

a labeled training set 3 x ')" 6 ' )' 3 x2')"6

Initialize: - , ,

3 , ' 6 ')' 3,

coordinates of , are equal to those of , .

(where the expectation is also over the choice of all

#' *2$ "33'4 x 6)+! /

(where the expectation is also over the choice of all

(where the expectation is also over the choice of all

To use a kernel function , we would merely replace each x 4 x by 3 x ' x 6 .

3 , ' 6 ' ' 3,