0% found this document useful (0 votes)

30 views7 pages

Lecturenotes Perceptron

The document discusses the perceptron algorithm for binary classification. It introduces linear separability and describes how the perceptron learns a weight vector to classify data points. An example of learning the AND function is provided. The convergence of the perceptron training algorithm is proved assuming the data is linearly separable.

Uploaded by

Kundan Mahaseth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views7 pages

Lecturenotes Perceptron

Uploaded by

Kundan Mahaseth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CS 6140: Machine Learning Spring 2015

College of Computer and Information Science

Northeastern University
Lecture 5 March, 16
Instructor: Bilal Ahmed Scribe: Bilal Ahmed & Virgil Pavlu

The Perceptron Algorithm1

1 Introduction
Consider the problem of binary classification where we have N data points as shown in Figure-1a.
We can observe that the decision boundary between the two classes (blue and red points) is a
straight line. Datasets that can be separated by a straight line are known as linearly separable
datasets. Generalizing the example from two to m dimensions, a linearly separable dataset is one
for which the decision boundary between the two classes is a linear function of the features x.
Figure-1b shows an example where the two classes are not linearly separable.

(a) (b)

Figure 1: Linear (a) and non-linear (b) separability for binary classification in two-dimensions. The
blue points represent the positive class and the red points belong to the negative class.

More formally, for a data set having N instances (xi , yi ), where xi ∈ Rm , yi ∈ {−1, 1}, a weight
vector w ∈ Rm and a threshold θ, if the following conditions are satisfied:
(
wT xi > θ ; if yi = 1
wT xi < θ ; if yi = −1

then the dataset is linearly separable. Here w defines the normal vector for the hyperplane that
separates the two classes in the m-dimensional feature space. The learning task in such a setting
would be to learn the m + 1 parameters corresponding to the weight vector w and the threshold
1
These lecture notes are intended for in-class use only.

1
θ. We can absorb the threshold into the weight vector by observing that the above conditions can
be written more compactly as wT xi − θ > 0, so that we can define an augmented weight vector
w̃ = [w0 w1 . . . wm ]T (note that w0 = θ) and also augment our feature set with a constant feature
that is always equal to one so that every instance is x̃ = [x0 x1 . . . xm ]T , where x0 = 1. The above
conditions can now be stated as:
(
w̃T x̃i > 0 ; if yi = 1
(1)
w̃T x̃i < 0 ; if yi = −1

It is more convenient to express the output of the perceptron in term of the sgn(.) function, so that
li = sgn(w̃T x̃i ), where li is the label predicted by the perceptron for the ith instance and the sgn(.)
function is defined as: (
1 ;x > 0
sgn(x) =
−1 ; otherwise
In the rest of the discussion we will assume that we are working with augmented features and to
keep the notation as clear as possible we are going to revert back to w and x.

2 The Perceptron
A perceptron taken as input a set of m real-valued features and calculates their linear combination.
If the linear combination is above a pre-set threshold it outputs a 1 otherwise it outputs a −1 as
per Equation-1 which is also called the perceptron classification rule. We can use the perceptron
training algorithm to learn the decision boundary for linearly separable datasets. Algorithm-1
shows the perceptron training algorithm.

2.1 Example: Learning the boolean AND function for two variables
Consider the task of learning the AND function for two boolean variables x1 and x2 . We can easily
generate the data as there are only four possible instances, as shown in Table-1. These instances
along with their labels are plotted in Figure-2a. In order to apply the perceptron algorithm we will
map an output of 0 to −1 for the AND function.
Below we show the updates to the weight vector as it makes it first pass through the training
data processing one instance at a time. In neural network nomenclature, a complete pass of through
the training data is known as an epoch. During an epoch the updates made to the weight vector are

Data: Training Data:(xi , yi ); ∀i ∈ {1, 2, . . . , N }, Learning Rate: η

Result: Separating Hyper-plane coefficients : w∗
Initialize w ← 0;
repeat
get example (xi , yi );
ŷi ← wT xi ;
if ŷi yi ≤ 0 then
w ← w + ηyi xi
until convergence;
Algorithm 1: The perceptron training algorithm.

2
Table 1: Data for learning the boolean AND function.

i x0 x1 x2 AND(x1 , x2 )
1. 1 0 0 0
2. 1 0 1 0
3. 1 1 0 0
4. 1 1 1 1

dependent on the order in which the instances are presented to the algorithm. The first epoch of
the perceptron training algorithm on the training data in Table-1 (in the same order and a learning
rate η of 1) is:

1. ŷ1 = wT x1 = 0; therefore, w ← w + (1)(−1)x1 = [−1 0 0]T

2. ŷ2 = wT x2 = −1; therefore, no update to w

3. ŷ3 = wT x3 = −1; therefore, no update to w

4. ŷ4 = wT x4 = −1; therefore, w ← w + (1)(1)x4 = [0 1 1]T

The decision boundary corresponding to the weight vector at the end of the first epoch is shown
in Figure-2b. Figure-2(a-d) shows the decision boundary at the end of different epochs in the
perceptron training algorithm and the final decision boundary after convergence. The final weight
vector in this case is found to be [−4 3 2]T , note that the first weight corresponds to the constant
feature and hence defines the threshold θ.

2.2 Hypothesis space for the perceptron

The hypothesis space for the perceptron is Rm , and any w ∈ Rm constitutes a valid hypothesis.
The goal of the learning algorithm is to find the weight-vector w∗ ∈ Rm that separates the two
classes. For linearly separable datasets there would be a continuum of weight vectors that fulfill
this criterion, the perceptron algorithm is guaranteed to find converge to a separating hyperplane
as we show next. The particular weight vector that the perceptron training algorithm converges to
is dependent on the learning rate η and the initial weight values.

2.3 Convergence of the perceptron training algorithm

Consider that we have training data as required by the perceptron algorithm. We are working with
datasets that are linearly separable and hence we will assume ∃ wopt ∈ Rm that perfectly separates
the data and we will further assume that ∃ γ ∈ R such that for any instance in the training set we
have:
T
yi wopt xi ≥ γ ; ∀i ∈ {1, 2, . . . , N } (2)
Since, we have finite data we can assume without loss of generality that:

kxi k ≤ R ; ∀i ∈ {1, 2, . . . , N }

3
AND(x1,x2) AND(x1,x2)
2 2

1.5 1.5

1 1
x2

x2
0.5 0.5

0 0

−0.5 −0.5

−1 −1
−0.5 0 0.5 1 1.5 2 −0.5 0 0.5 1 1.5 2
x1 x1

(a) (b)
AND(x1,x2) AND(x1,x2)
2 2

1.5 1.5

1 1
x2

x2
0.5 0.5

0 0

−0.5 −0.5

−1 −1
−0.5 0 0.5 1 1.5 2 −0.5 0 0.5 1 1.5 2
x1 x1

Figure 2: Learning the boolean AND function for two variables using the perceptron. (a) shows
the initial decision boundary (b) the blue line shows the updated decision boundary after the first
pass through the data (first epoch) (c) the green line is the decision boundary after the second
epoch and (d) show the the final decision boundary (black line).

The quantity on the left in Equation-2 is proportional to the distance of an instance from the
T x ≥
separating hyperplane. For all instances belonging to the positive class i.e., yi = 1 we have wopt i
T
γ and wopt xi ≤ −γ for all instances with yi = −1. This means that the optimal hyperplane wopt
separates the two classes with at least a distance proportional to γ.
Let wk be the weight vector after the perceptron makes the k th mistake on an instance (xj , yj ),
so that
wk = wk−1 + yj xj
to show that the perceptron training algorithm will learn the decision boundary within a finite
number of steps we will bound the number of mistakes the perceptron makes on the training
dataset.

4
Consider the inner product between wopt and wk .

wkT wopt = (wk−1 + yj xj )T wopt

T
= wk−1 wopt + yj xTj wopt
T T
≥ wk−1 wopt + γ (∵ yj wopt xj ≥ γ)

since we started with w0 = 0, by induction we have that,

wkT wopt ≥ kγ (3)

The Cauchy-Shwarz inequality states that:

(aT b)2 ≤ kak2 kbk2 ; ∀a, b ∈ Rm (4)

Next we will look at the norm of wk :

kwk k2 = (wk−1 + yj xj )T (wk−1 + yj xj )

T x + kx k2
= kwk−1 k2 + 2yj wk−1 j j
T
≤ kwk−1 k2 + R2 (∵ yj wk−1 xj ≤ 0)

since we started with w0 = 0, by induction we have that,

kwk k2 ≤ kR2 (5)

Combining Equations 3-5, we get:

k 2 γ 2 ≤ (wkT wopt )2 ≤ kwk k2 kwopt k2 ≤ kR2 kwopt k2

from the above we can see that:

R 2
k≤ kwopt k2 (6)
γ
This shows that based on our initial assumptions, the perceptron training algorithm will stop after
making a finite number of mistakes (Equation-6) irrespective of the sequence in which the samples
are presented.

2.4 Is there a loss function here?

The loss function for the perceptron is given as:

Lp (x, y, w) = max(0, −ywT x) (7)

which is zero when the instance is classified correctly, and is proportional to the signed distance
of the instance from the hyperplane when it is incorrectly classified. Note, that the instance is
misclassified and is therefore on the wrong side of the hyperplane. Figure-3 shows a plot of the
perceptron loss function. Perceptron loss is a special case of the Hinge loss, which we will encounter
when discussing support vector machines. Also note that the perceptron loss involves the γ from
Equation-2, and optimizes the negative γ of the misclassified instances.

5
0.8

0.7

0.6

0.5

Perceptron Loss
0.4

0.3

0.2

0.1

0
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
ywTx

Figure 3: The perceptron loss function.

3 Inseparable Data
What happens when the data is not linearly separable? Based on our previous discussion the
training algorithm will not halt. We can impose a limit on the number of iterations to make sure
that the algorithm stops, but in this case we have no guarantees about the final weight vector. A
possible avenue is to change the loss function so that instead of looking for a perfect classification
we find the “best fit” to the decision boundary. In essence we are looking for a weight vector that
does not perfectly separate the data but defines a decision boundary that has the minimum number
of misclassifications on the training data.
In order to learn such a decision boundary using the perceptron, we need to first change the loss
function, and in this case we can use the squared loss between the true labels and the predictions
of the perceptron. We will be working with an unthresholded perceptron i.e., we will compute the
output of the perceptron as:
o(x) = wT x
as opposed to sgn(wT x). This is known as a linear unit. The squared loss function over the training
data is given as:
N
1X
E(w) = (yi − oi )2 (8)
2
i=1

We can optimize this loss function by using gradient descent. The gradient of E(.) can be calculated
as:
XN
∇w E = (yi − oi )(−xi )
i=1
and the gradient descent update to w is then simply,

w ← w − η∇E

The gradient descent algorithm is listed in Algorithm-2. As we are calculating the gradient (∇E)

6
Data: Training Data:(xi , yi ); ∀i ∈ {1, 2, . . . , N }, Learning Rate: η
Result: Optimal Hyper-plane coefficients based on squared loss: w∗
Initialize w ← random weights ;
repeat
calculate ∇E ;
w ← w − η∇E
until convergence;
Algorithm 2: Gradient descent for training a linear unit. Note, that the update rule for the
weight vector involves the calculation of the loss over the entire training set as compared to the
perceptron training algorithm where we update the weight vector using one instance at a time.

based on the entire training dataset, the resulting gradient descent algorithm is also called the
batch gradient descent algorithm. We can modify the batch update to work with single examples
in which case the gradient is approximated as:

∇w E(xi ) = (yi − oi )(−xi )

This is also known as stochastic gradient descent where we update the parameters based on a single
example. The resulting training algorithm for a linear unit is shown in Algorithm-3.

Data: Training Data:(xi , yi ); ∀i ∈ {1, 2, . . . , N }, Learning Rate: η

Result: Optimal Hyper-plane coefficients based on squared loss: w∗
Initialize w ← random weights ;
repeat
get example (xi , yi );
ôi ← wT xi ;
w ← w + η(yi − oi )xi
until convergence;
Algorithm 3: Stochastic gradient descent algorithm for training a linear unit.

Stochasitc gradient descent works by processing a single example each time as opposed to the batch
gradient descent that needs to process the entire dataset to carry out a single update. This makes
the stochastic gradient descent algorithm faster than the conventional batch gradient descent.

4 References:
1. Andrew Ng’s notes on perceptron. (cs229.stanford.edu/notes/cs229-notes6.pdf)

2. Roni Khardon’s notes. (www.cs.tufts.edu/ roni/Teaching/CLT/LN/lecture14.pdf)

3. Machine Learning, Tom Mitchell.

6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
NN-Ch2 New V1
No ratings yet
NN-Ch2 New V1
99 pages
Perceptron Linear Classifiers
No ratings yet
Perceptron Linear Classifiers
42 pages
20.NeuralNets Short
No ratings yet
20.NeuralNets Short
60 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
No ratings yet
Artificial Neural Networks Unit 3: Single-Layer Perceptrons
11 pages
Single Layer Feedforward Networks
No ratings yet
Single Layer Feedforward Networks
21 pages
Preceptron
No ratings yet
Preceptron
17 pages
Perceptron: Tirtharaj Dash
No ratings yet
Perceptron: Tirtharaj Dash
22 pages
Perceptron
No ratings yet
Perceptron
6 pages
Perceptron Lecture 3
No ratings yet
Perceptron Lecture 3
25 pages
Perceptron PDF
0% (1)
Perceptron PDF
8 pages
Lecture 3- Rosenblatt_s Perceptron-Ch2
No ratings yet
Lecture 3- Rosenblatt_s Perceptron-Ch2
20 pages
NN Theory
No ratings yet
NN Theory
138 pages
Module 4 Lab 1
No ratings yet
Module 4 Lab 1
5 pages
ML Lecture#3
No ratings yet
ML Lecture#3
37 pages
NN 03
No ratings yet
NN 03
27 pages
Week 1 (1)
No ratings yet
Week 1 (1)
5 pages
ML_Lec 6- Linear Classifiers
No ratings yet
ML_Lec 6- Linear Classifiers
55 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
PNAL4 SingleLayerNets
No ratings yet
PNAL4 SingleLayerNets
42 pages
NN Unit 2
No ratings yet
NN Unit 2
20 pages
ANN - Perceptron - Adaline
No ratings yet
ANN - Perceptron - Adaline
15 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
NN Part1
No ratings yet
NN Part1
43 pages
Slide 2
No ratings yet
Slide 2
35 pages
nn1
No ratings yet
nn1
6 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
Perceptron PDF
No ratings yet
Perceptron PDF
37 pages
2007 02 01b Janecek Perceptron
No ratings yet
2007 02 01b Janecek Perceptron
37 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Linear Classifier-Perceptron
No ratings yet
Linear Classifier-Perceptron
42 pages
Iv. Single Layer Structures: 4.1. Perceptrons
No ratings yet
Iv. Single Layer Structures: 4.1. Perceptrons
26 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Perceptron
No ratings yet
Perceptron
26 pages
NNLS1 2019 HW1 Solutions
No ratings yet
NNLS1 2019 HW1 Solutions
5 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
07. Linear Regression
No ratings yet
07. Linear Regression
37 pages
3 Perceptron: Nnets - L. 3 February 10, 2002
No ratings yet
3 Perceptron: Nnets - L. 3 February 10, 2002
31 pages
SML_Lecture5
No ratings yet
SML_Lecture5
45 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Perceptron_Algorithm
No ratings yet
Perceptron_Algorithm
10 pages
ANN-unit 4 PDF
No ratings yet
ANN-unit 4 PDF
23 pages
MAT6007 Session5 Perceptron Algorithm
No ratings yet
MAT6007 Session5 Perceptron Algorithm
19 pages
Perceptron Network
No ratings yet
Perceptron Network
26 pages
lecture 4
No ratings yet
lecture 4
65 pages
NN 2
No ratings yet
NN 2
42 pages
Neural N Problems - SLP
No ratings yet
Neural N Problems - SLP
123 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
MODULE 4
No ratings yet
MODULE 4
55 pages
PLA Explanation
No ratings yet
PLA Explanation
19 pages
ANN notes 1-5
No ratings yet
ANN notes 1-5
217 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Baliga Technical Data
No ratings yet
Baliga Technical Data
10 pages
India - IT Services, April 2024
No ratings yet
India - IT Services, April 2024
51 pages
RBC and Artificial Intelligence
No ratings yet
RBC and Artificial Intelligence
9 pages
Sven Gustavsson's Bibliography
No ratings yet
Sven Gustavsson's Bibliography
16 pages
12 01 0192 PDF
No ratings yet
12 01 0192 PDF
9 pages
26 Site Installation Plan
No ratings yet
26 Site Installation Plan
20 pages
Release Notes For AirN@vMaintenance A380-SA-LR
No ratings yet
Release Notes For AirN@vMaintenance A380-SA-LR
13 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
4 pages
HW1 Ans
No ratings yet
HW1 Ans
4 pages
Workshop Manual: Screed
No ratings yet
Workshop Manual: Screed
49 pages
Delta Modulation For Voice Transmission: Application Note January 1997 AN607.1
No ratings yet
Delta Modulation For Voice Transmission: Application Note January 1997 AN607.1
6 pages
LT6_Assisting and Participating in School Programs and Activities
No ratings yet
LT6_Assisting and Participating in School Programs and Activities
5 pages
Homoeopathic Medical Publishers: Book Order Form
No ratings yet
Homoeopathic Medical Publishers: Book Order Form
2 pages
Language: "It Is System of Conventional Signals Used For Communication by A Whole Community"
No ratings yet
Language: "It Is System of Conventional Signals Used For Communication by A Whole Community"
18 pages
Chapter 3 - Exception Handling
No ratings yet
Chapter 3 - Exception Handling
13 pages
Swarm Intelligence
No ratings yet
Swarm Intelligence
49 pages
Trainer Guide: Automotive Service Technician
No ratings yet
Trainer Guide: Automotive Service Technician
30 pages
LAT TOEFL WRITTEN EXPRESSION
No ratings yet
LAT TOEFL WRITTEN EXPRESSION
2 pages
S7 Communication With Put/Get: S7-1500 Cpus and S7-1200 Cpus
No ratings yet
S7 Communication With Put/Get: S7-1500 Cpus and S7-1200 Cpus
24 pages
3fa5 1991
No ratings yet
3fa5 1991
46 pages
Morita Philosophy PDF
No ratings yet
Morita Philosophy PDF
16 pages
Geog 2025 Gr 11 June Exam Marking Guidelines
No ratings yet
Geog 2025 Gr 11 June Exam Marking Guidelines
8 pages
Peran Ngo Dalam Mendukung Sdgs Pendidikan Berkualitas (Studi Kasus: Project Child Indonesia Di Yogyakarta (2018-2022)
No ratings yet
Peran Ngo Dalam Mendukung Sdgs Pendidikan Berkualitas (Studi Kasus: Project Child Indonesia Di Yogyakarta (2018-2022)
16 pages
703 ArticleText 9351 1 10 20190327 - 2
No ratings yet
703 ArticleText 9351 1 10 20190327 - 2
15 pages
Manual Servico
No ratings yet
Manual Servico
461 pages
Digital Speedometer Using FPGA
No ratings yet
Digital Speedometer Using FPGA
4 pages
Review On Effects and Causes of Insulation Paper Moisture On Transformer
No ratings yet
Review On Effects and Causes of Insulation Paper Moisture On Transformer
4 pages
Routing Algorithms For DHTS: Some Open Questions
No ratings yet
Routing Algorithms For DHTS: Some Open Questions
5 pages
Pergi Devils PDF
No ratings yet
Pergi Devils PDF
1 page
Phase 3 Subject Programme English G5-9 Version 6 100914 - CLEAN
No ratings yet
Phase 3 Subject Programme English G5-9 Version 6 100914 - CLEAN
36 pages

Lecturenotes Perceptron

Uploaded by

Lecturenotes Perceptron

Uploaded by

CS 6140: Machine Learning Spring 2015

College of Computer and Information Science

The Perceptron Algorithm1

Data: Training Data:(xi , yi ); ∀i ∈ {1, 2, . . . , N }, Learning Rate: η

1. ŷ1 = wT x1 = 0; therefore, w ← w + (1)(−1)x1 = [−1 0 0]T

2. ŷ2 = wT x2 = −1; therefore, no update to w

3. ŷ3 = wT x3 = −1; therefore, no update to w

4. ŷ4 = wT x4 = −1; therefore, w ← w + (1)(1)x4 = [0 1 1]T

2.2 Hypothesis space for the perceptron

2.3 Convergence of the perceptron training algorithm

wkT wopt = (wk−1 + yj xj )T wopt

since we started with w0 = 0, by induction we have that,

wkT wopt ≥ kγ (3)

The Cauchy-Shwarz inequality states that:

(aT b)2 ≤ kak2 kbk2 ; ∀a, b ∈ Rm (4)

Next we will look at the norm of wk :

kwk k2 = (wk−1 + yj xj )T (wk−1 + yj xj )

since we started with w0 = 0, by induction we have that,

kwk k2 ≤ kR2 (5)

Combining Equations 3-5, we get:

k 2 γ 2 ≤ (wkT wopt )2 ≤ kwk k2 kwopt k2 ≤ kR2 kwopt k2

from the above we can see that:

2.4 Is there a loss function here?

Lp (x, y, w) = max(0, −ywT x) (7)

Figure 3: The perceptron loss function.

∇w E(xi ) = (yi − oi )(−xi )

Data: Training Data:(xi , yi ); ∀i ∈ {1, 2, . . . , N }, Learning Rate: η

2. Roni Khardon’s notes. (www.cs.tufts.edu/ roni/Teaching/CLT/LN/lecture14.pdf)

3. Machine Learning, Tom Mitchell.

You might also like