0% found this document useful (0 votes)

87 views69 pages

Lecture 2

DL introduction

Uploaded by

Prasanna Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views69 pages

Lecture 2

DL introduction

Uploaded by

Prasanna Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

CS7015 (Deep Learning) : Lecture 2

McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron

Learning Algorithm and Convergence, Multilayer Perceptrons (MLPs),
Representation Power of MLPs

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.1: Biological Neurons

2/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
We will first see what a biological
neuron looks like ...

3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons
synapse: point of connection to
other neurons
soma: processes the information
axon: transmits the output of this
neuron

Biological Neurons∗

∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg

4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works
Our sense organs interact with the
outside world
They relay information to the neur-
ons
The neurons (may) get activated and
produces a response (laughter in this
case)

5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
These neurons may also fire (again, in red)
and the process continues eventually resulting
in a response (laughter in this case)
An average human brain has around 1011 (100
billion) neurons!
6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus

A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)
We observe that the layers V 1, V 2 to
AIT form a hierarchy (from identify-
ing simple visual forms to high level
objects)

8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Sample illustration of hierarchical
processing∗
∗
Idea borrowed from Hugo Larochelle’s lecture slides
9/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Disclaimer
I understand very little about how the brain works!
What you saw so far is an overly simplified explanation of how the brain works!
But this explanation suffices for the purpose of this course!

10/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.2: McCulloch Pitts Neuron

11/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else

x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...

13/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

θ 3 1

x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

1 0 0

x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Can any boolean function be represented using a McCulloch Pitts unit ?
Before answering this question let us first see the geometric interpretation of a
MP unit ...

15/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line
Let us convince ourselves about this with a
(0, 1) (1, 1) few more examples (if it is not already clear
from the math)

x1 + x2 = θ = 1

(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}

2 0

x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)

x1 + x2 = θ = 0
x1 + x2 = θ = 2

(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)

(0, 1, 1) (1, 1, 1)x1 + x2 + x3 = θ = 1

(0, 0, 0) (1, 0, 0) x1

(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such
that all inputs which produce a 1 lie on one side of the line (plane) and all
inputs which produce a 0 lie on other side of the line (plane)

19/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.3: Perceptron

20/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?

21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here

22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
Rewriting the above,
x0 = 1 x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
n = 0 if wi ∗ x i − θ < 0
X
= 0 if wi ∗ x i < 0 i=1
i=0
where, x0 = 1 and w0 = −θ
23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now try to answer the following questions:
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?

24/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross
the threshold θ by assigning a high weight to isDirect-
x1 = isActorDamon orNolan
x2 = isGenreT hriller
w0 is called the bias as it represents the prior (preju-
x3 = isDirectorN olan dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
25/69
ector [θ = 0]CS7015 (Deep Learning) : Lecture 2
Mitesh M. Khapra
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?

26/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ 0 All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < 0
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ 0 puts can be real valued
i=0 We will first revisit some boolean functions
n
X and then see the perceptron learning al-
= 0 if wi ∗ x i < 0
gorithm (for learning weights)
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also (Try it!)
28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.4: Errors and Error Surfaces

29/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 3 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0
Let us plot the error surface corresponding to
different values of w0 , w1 , w2
30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
We are interested in finding an al-
gorithm which finds the values of
w1 , w2 which minimize this error

31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.5: Perceptron Learning Algorithm

32/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see a more principled approach for learning these weights and
threshold but before that let us answer this question...
Apart from implementing boolean functions (which does not look very interest-
ing) what can a perceptron be used for ?
Our interest lies in the use of perceptron as a binary classifier. Let us see what
this means...

33/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1) In other words, we want the perceptron to find
... ... the equation of this separating plane (or find
the values of w0 , w1 , w2 , .., wm )
xn = criticsRating(scaled to 0 to 1)
34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm Why would this work ?
P ← inputs with label 1; To understand why this works
N ← inputs with label 0; we will have to get into a bit of
Initialize w randomly; Linear Algebra and a bit of geo-
while !convergence do metry...
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly

35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
Since the vector w is perpendicular to
= 0 if wT x < 0
every point on the line it is actually
perpendicular to the line itself

36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm

37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end ∝ cosα + xT x
//the algorithm converges when all the cos(αnew ) > cosα
inputs are classified correctly Thus αnew will be less than α and
wT x
cosα = this is exactly what we want
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end ∝ cosα − xT x
//the algorithm converges when all the cos(αnew ) < cosα
inputs are classified correctly Thus αnew will be greater than α
wT x
cosα = and this is exactly what we want
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see this algorithm in action for a toy dataset

40/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly oppsite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3 Randomly pick a point (say, p2 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check41/69
Mitesh M. Khaprathe CS7015
angle(Deep
visually)
Learning) : Lecture 2
Module 2.6: Proof of Convergence

42/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Now that we have some faith and intuition about why the algorithm works, we
will see a more formal proof of convergence ...

43/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .

Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.

Proof: On the next slide

44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Further we will normalize all Pick random p ∈ P 0 ;
p
p ← ||p|| (so now,||p|| = 1) ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
Let w∗ be the normalized //the algorithm converges when all the inputs are
solution vector (we know one classified correctly
exists as the data is linearly //notice that we do not need the other if condition
because by construction we want all points in P 0 to
separable)
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi ≤ 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi ≤ 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ
So by time-step t, a quantity ≥ w∗ · w0 + (k)δ (By induction)
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 = ||wt+1 ||2
= (wt + pi ) · (wt + pi )
= ||wt ||2 + 2wt · pi + ||pi ||2 )
≤ ||wt ||2 + ||pi ||2 (∵ wt · pi ≤ 0)
2
≤ ||wt || + 1 (∵ ||pi ||2 = 1)
≤ (||wt−1 ||2 + 1) + 1
≤ ||wt−1 ||2 + 2
≤ ||w0 ||2 + (k) (By same observation that we made about δ)
47/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

w∗ · wt+1
cosβ = (by definition)
||wt+1 ||
N umerator ≥ w∗ · w0 + kδ (proved by induction)
Denominator2 ≤ ||w0 ||2 + k (By same observation that we made about δ)
w∗ · w0 + kδ
cosβ ≥ p
||w0 ||2 + k
√
cosβ thus grows proportional to k
As k (number of corrections) increases cosβ can become arbitrarily large
But since cosβ ≤ 1, k must be bounded by a maximum number
Thus, there can only be a finite number of corrections (k) to w and the algorithm
will converge!

48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold

Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..

49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.7: Linearly Separable Boolean Functions

50/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?

51/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0

w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0 And indeed you can see that
w0 + w1 · 1 + w2 · 1 < 0 =⇒ w1 + w2 < −w0 it is impossible to draw a
line which separates the red
The fourth condition contradicts conditions 2 points from the blue points
and 3
Hence we cannot have a solution to this set of
inequalities
52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
o ceptrons can indeed deal with such data
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Before seeing how a network of perceptrons can deal with linearly inseparable
data, we will discuss boolean functions in some more detail ...

54/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable
(I encourage you to figure out the exact answer :-) )

55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.8: Representation Power of a Network of
Perceptrons

56/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see how to implement any boolean function using a network of
perceptrons ...

57/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
y We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w = -1 output of this network
blue edge indicates w = +1

58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w = -1 w1 , w2 , w3 , w4 are called layer 2 weights
blue edge indicates w = +1

59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the first perceptron fires for {-1,-1}
blue edge indicates w = +1 the second perceptron fires for {-1,1}
the third perceptron fires for {1,-1}
the fourth perceptron fires for {1,1} 60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4

This results in the following four conditions

bias =-2 to implement XOR: w1 < w0 , w2 ≥ w0 , w3 ≥
w0 , w4 < w0
x1 x2
Unlike before, there are no contradictions now
red edge indicates w = -1 and the system of inequalities can be satisfied
blue edge indicates w = +1
Essentially each wi is now responsible for one
of the 4 possible inputs and can be adjusted
to get the desired output for that input 61/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
It should be clear that the same network
can be used to represent the remaining 15
y boolean functions also
Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
w1 w2 w3 w4 setting w1 , w2 , w3 , w4
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Try it!

bias =-2

x1 x2
red edge indicates w = -1
blue edge indicates w = +1

62/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have more than 3 inputs ?

63/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again each of the 8 perceptorns will fire only for one of the 8 inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs
and can be adjusted to produce the desired output for that input
y

w1 w2 w3 w4 w5 w6 w7 w8

bias =-3

x1 x2 x3
64/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have n inputs ?

65/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

example, we already saw how to represent AND function with just 1 perceptron

Catch: As n increases the number of perceptrons in the hidden layers obviously

increases exponentially

66/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again, why do we care about boolean functions ?
How does this help us with our original problem: which was to predict whether
we like a movie or not? Let us see!

67/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
We are given this data about our past movie
w1 w2 w3 w4 w5 w6 w7 w8
experience
bias =-3 For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
x1 x2 x3
value of y (like/dislike)
  pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
 x22 . . . x2n y2 = 1 The data may or may not be linearly separable
..  .. .. .. .. .. 
.  .
 . . . .  The proof that we just saw tells us that it is
n1 xk1
 xk2 . . . xkn yi = 0 
 possible to have a network of perceptrons and
n2  xj1
 xj2 . . . xjn yj = 0 learn the weights in this network such that for
.. .. .. .. .. .. any given pi or nj the output of the network
. . . . . .
will be the same as yi or yj (i.e., we can sep-
arate the positive and the negative points)
68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function

69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2

Current Practice of Clinical Electroencephalography_ Fifth Edition
No ratings yet
Current Practice of Clinical Electroencephalography_ Fifth Edition
1,247 pages
Unit I deeplearning
No ratings yet
Unit I deeplearning
13 pages
Zurada - Introduction To Artificial Neural Systems (WPC, 1992)
93% (14)
Zurada - Introduction To Artificial Neural Systems (WPC, 1992)
764 pages
Anatomy of Nose and Tongue
No ratings yet
Anatomy of Nose and Tongue
40 pages
KNN (K Nearest Neighbor)
No ratings yet
KNN (K Nearest Neighbor)
21 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Williams Textbook of Endocrinology. 13th Edition.
97% (30)
Williams Textbook of Endocrinology. 13th Edition.
23 pages
CEG5301: Machine Learning With Applications: Part I: Fundamentals of Neural Networks
No ratings yet
CEG5301: Machine Learning With Applications: Part I: Fundamentals of Neural Networks
57 pages
Assignment On RNN
No ratings yet
Assignment On RNN
1 page
Initial Serum Level of TNF As An Outcome Predictor in Pediatric Patient With Sepsis
No ratings yet
Initial Serum Level of TNF As An Outcome Predictor in Pediatric Patient With Sepsis
6 pages
DW Slides
No ratings yet
DW Slides
246 pages
Prebiotic Efficiency of Custard Apple Seeds
No ratings yet
Prebiotic Efficiency of Custard Apple Seeds
7 pages
R20 - M.Tech - DWDM Syllabus
No ratings yet
R20 - M.Tech - DWDM Syllabus
2 pages
Instructor's Solution Manual For Neural Networks
No ratings yet
Instructor's Solution Manual For Neural Networks
40 pages
First Term Exam, Group B - Answer Sheet
No ratings yet
First Term Exam, Group B - Answer Sheet
3 pages
Cardiac Study Guide
No ratings yet
Cardiac Study Guide
3 pages
TITLE Human Reaction Time To Stimulus Under The Influence of
No ratings yet
TITLE Human Reaction Time To Stimulus Under The Influence of
8 pages
42682-Anatmy of The Venous System - Pedro Vilela
No ratings yet
42682-Anatmy of The Venous System - Pedro Vilela
33 pages
Homeostasis
No ratings yet
Homeostasis
13 pages
Lecture DW 021
No ratings yet
Lecture DW 021
195 pages
Unit-7_ANN
No ratings yet
Unit-7_ANN
211 pages
Vitamin D and Cardiovascular Disease: Controversy Unresolved
No ratings yet
Vitamin D and Cardiovascular Disease: Controversy Unresolved
12 pages
cv_2025_Spring_14
No ratings yet
cv_2025_Spring_14
33 pages
Somaclonal Variation: Presented By-Alex Mathew M.Sc. Biotechnology 2 Semester
100% (2)
Somaclonal Variation: Presented By-Alex Mathew M.Sc. Biotechnology 2 Semester
14 pages
DL_Unit_I_&_Unit_II
No ratings yet
DL_Unit_I_&_Unit_II
156 pages
CH 2
0% (1)
CH 2
16 pages
Deep Leaning
No ratings yet
Deep Leaning
117 pages
Epigenetic Mechanisms Shape the Biological Response to Trauma and Risk of ptsd
No ratings yet
Epigenetic Mechanisms Shape the Biological Response to Trauma and Risk of ptsd
10 pages
L2
No ratings yet
L2
68 pages
Properties of Skeletal Muscle
No ratings yet
Properties of Skeletal Muscle
26 pages
The Animal Kingdom: Living On Planet Earth
No ratings yet
The Animal Kingdom: Living On Planet Earth
19 pages
Bioelectromagnetic and Subtle Energy Med
100% (2)
Bioelectromagnetic and Subtle Energy Med
15 pages
Anatomy of The Trachea and Bronchial Tree: Learning Points
No ratings yet
Anatomy of The Trachea and Bronchial Tree: Learning Points
2 pages
MP Neuron
No ratings yet
MP Neuron
35 pages
NITIN SIR NOTES
No ratings yet
NITIN SIR NOTES
66 pages
Robotics Final Exam Spring 2021
No ratings yet
Robotics Final Exam Spring 2021
2 pages
Fuzzy Logic and Neural Networks - 4 - Solution
100% (1)
Fuzzy Logic and Neural Networks - 4 - Solution
13 pages
R20 M.Tech DS
No ratings yet
R20 M.Tech DS
64 pages
Transport in Plants I PUC - Biology Chapter - 11 One Mark
No ratings yet
Transport in Plants I PUC - Biology Chapter - 11 One Mark
7 pages
Eva-Detko 4-Week Program
100% (3)
Eva-Detko 4-Week Program
25 pages
Unit 1 Deep Learning
No ratings yet
Unit 1 Deep Learning
20 pages
Oneiromancy Lesson 5 - Lucid Dreaming
No ratings yet
Oneiromancy Lesson 5 - Lucid Dreaming
4 pages
Phylum Chordata
100% (1)
Phylum Chordata
25 pages
18 - Anesthesia
No ratings yet
18 - Anesthesia
7 pages
Lec 11
No ratings yet
Lec 11
11 pages
EEE421 LAB 1 Practical
No ratings yet
EEE421 LAB 1 Practical
4 pages
1. Assignment-Fuzzy inference problem (2)
No ratings yet
1. Assignment-Fuzzy inference problem (2)
6 pages
Introduction DL
No ratings yet
Introduction DL
53 pages
Body Coordination FORM 4 SCIENCE
No ratings yet
Body Coordination FORM 4 SCIENCE
15 pages
Ann Chapter 2
No ratings yet
Ann Chapter 2
240 pages
MTech Shortlist CAIR
100% (1)
MTech Shortlist CAIR
3 pages
Hypertyroidism
No ratings yet
Hypertyroidism
46 pages
Science: Third Quarter
100% (1)
Science: Third Quarter
10 pages
Health For All 3 Types of Exercises: Benefits of Aerobic Exercise. Regular Aerobic Exercise Provides The
No ratings yet
Health For All 3 Types of Exercises: Benefits of Aerobic Exercise. Regular Aerobic Exercise Provides The
5 pages
Cvs Assesment 1 Origkey
No ratings yet
Cvs Assesment 1 Origkey
5 pages
Module NCM112 FluidsElectrolytes 2
No ratings yet
Module NCM112 FluidsElectrolytes 2
58 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Computation of The DFT of Real Sequences
No ratings yet
Computation of The DFT of Real Sequences
45 pages
Final R20 M.Tech AI - ML Syllabus
No ratings yet
Final R20 M.Tech AI - ML Syllabus
50 pages
Nptel: Recurrent Network of A Single Layer
No ratings yet
Nptel: Recurrent Network of A Single Layer
35 pages
1to7.2.2020 Discussion QUESTIONS PDF
No ratings yet
1to7.2.2020 Discussion QUESTIONS PDF
63 pages
Guide Convolutional Neural Network CNN
100% (1)
Guide Convolutional Neural Network CNN
25 pages
Lecture Notes SC
No ratings yet
Lecture Notes SC
21 pages
B.Tech. - R23 - Academic Regulations
No ratings yet
B.Tech. - R23 - Academic Regulations
18 pages
Streptokinase: Priya P. 2015-09-017 B.Sc.-M.Sc. (Integrated) Biotechnology
No ratings yet
Streptokinase: Priya P. 2015-09-017 B.Sc.-M.Sc. (Integrated) Biotechnology
11 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
Perceptron and Backpropagation
No ratings yet
Perceptron and Backpropagation
17 pages
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
No ratings yet
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
4 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
PPT_Btech CSE
No ratings yet
PPT_Btech CSE
17 pages
SNU 2017 Exams
No ratings yet
SNU 2017 Exams
52 pages
ENPM667: Control of Robotic Systems Final Project: University of Maryland, College Park
100% (1)
ENPM667: Control of Robotic Systems Final Project: University of Maryland, College Park
18 pages
000003
No ratings yet
000003
444 pages
Reconfigurable Hardware Design Approach For Economic Neural Network
No ratings yet
Reconfigurable Hardware Design Approach For Economic Neural Network
5 pages
Lecture 26-30 Unit 2
No ratings yet
Lecture 26-30 Unit 2
20 pages
Unit 5
No ratings yet
Unit 5
61 pages
Lecture Notes 5
No ratings yet
Lecture Notes 5
3 pages
Sample For Solution Manual A First Course in Fuzzy and Neural Control by Nguyen & Prasad
No ratings yet
Sample For Solution Manual A First Course in Fuzzy and Neural Control by Nguyen & Prasad
7 pages
Cs3027 Deep Learning Syllabus
No ratings yet
Cs3027 Deep Learning Syllabus
2 pages
B.Tech. - R23 - Proposed Course Strcuture
No ratings yet
B.Tech. - R23 - Proposed Course Strcuture
5 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
Train A Simple NN - Jupyter Notebook
No ratings yet
Train A Simple NN - Jupyter Notebook
4 pages
Colah Github Io Posts 2015 08 Understanding LSTMs
No ratings yet
Colah Github Io Posts 2015 08 Understanding LSTMs
16 pages
ANN Lab Manual-2
No ratings yet
ANN Lab Manual-2
35 pages
Solution 4 Ann Weka 2012
No ratings yet
Solution 4 Ann Weka 2012
8 pages
Backpropagation Examples PDF
No ratings yet
Backpropagation Examples PDF
9 pages
DFT Domain Image
No ratings yet
DFT Domain Image
65 pages
Turtlesim Project
No ratings yet
Turtlesim Project
4 pages
Eed3001 Lab2 Single Phase Transformer Loading EM3000
No ratings yet
Eed3001 Lab2 Single Phase Transformer Loading EM3000
5 pages
Unit 4 (Velocity and Static Force Analysis)
No ratings yet
Unit 4 (Velocity and Static Force Analysis)
42 pages
Bug Algorithms
No ratings yet
Bug Algorithms
15 pages
Module 4 PDF
No ratings yet
Module 4 PDF
33 pages
Robotics: Dynamic Model of Manipulators
No ratings yet
Robotics: Dynamic Model of Manipulators
20 pages
Sri Venkateswara College of Engineering Course Delivery Plan - Theory Page 1 of 6
No ratings yet
Sri Venkateswara College of Engineering Course Delivery Plan - Theory Page 1 of 6
6 pages
ANN Matlab
No ratings yet
ANN Matlab
13 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages

Lecture 2

Uploaded by

Lecture 2

Uploaded by

CS7015 (Deep Learning) : Lecture 2

McCulloch Pitts Neuron, Thresholding Logic, Perceptrons, Perceptron

Department of Computer Science and Engineering

y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}

(0, 1, 1) (1, 1, 1)x1 + x2 + x3 = θ = 1

Proof: On the next slide

So far we have, wT · pi ≤ 0 (and hence we made the correction)

So far we have, wT · pi ≤ 0 (and hence we made the correction)

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16

This results in the following four conditions

Proof (informal:) We just saw how to construct such a network

Note: A network of 2n + 1 perceptrons is not necessary but sufficient. For

Catch: As n increases the number of perceptrons in the hidden layers obviously

You might also like