Lecture 2
Lecture 2
Mitesh M. Khapra
1/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.1: Biological Neurons
2/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y The most fundamental unit of a deep
neural network is called an artificial
neuron
σ Why is it called a neuron ? Where
does the inspiration come from ?
The inspiration comes from biology
w1 w2 w3
(more specifically, from the brain)
x1 x2 x3 biological neurons = neural cells =
neural processing units
Artificial Neuron
We will first see what a biological
neuron looks like ...
3/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
dendrite: receives signals from other
neurons
synapse: point of connection to
other neurons
soma: processes the information
axon: transmits the output of this
neuron
Biological Neurons∗
∗
Image adapted from
https://cdn.vectorstock.com/i/composite/12,25/neuron-cell-vector-81225.jpg
4/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us see a very cartoonish illustra-
tion of how a neuron works
Our sense organs interact with the
outside world
They relay information to the neur-
ons
The neurons (may) get activated and
produces a response (laughter in this
case)
5/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Of course, in reality, it is not just a single
neuron which does all this
There is a massively parallel interconnected
network of neurons
The sense organs relay information to the low-
est layer of neurons
Some of these neurons may fire (in red) in re-
sponse to this information and in turn relay
information to other neurons they are connec-
ted to
These neurons may also fire (again, in red)
and the process continues eventually resulting
in a response (laughter in this case)
An average human brain has around 1011 (100
billion) neurons!
6/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
This massively parallel network also ensures
that there is division of work
Each neuron may perform a certain role or
respond to a certain stimulus
A simplified illustration
7/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The neurons in the brain are arranged
in a hierarchy
We illustrate this with the help of
visual cortex (part of the brain) which
deals with processing visual informa-
tion
Starting from the retina, the informa-
tion is relayed to several layers (follow
the arrows)
We observe that the layers V 1, V 2 to
AIT form a hierarchy (from identify-
ing simple visual forms to high level
objects)
8/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Sample illustration of hierarchical
processing∗
∗
Idea borrowed from Hugo Larochelle’s lecture slides
9/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Disclaimer
I understand very little about how the brain works!
What you saw so far is an overly simplified explanation of how the brain works!
But this explanation suffices for the purpose of this course!
10/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.2: McCulloch Pitts Neuron
11/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} McCulloch (neuroscientist) and Pitts (logi-
cian) proposed a highly simplified computa-
tional model of the neuron (1943)
f g aggregates the inputs and the function f
takes a decision based on this aggregation
g The inputs can be excitatory or inhibitory
y = 0 if any xi is inhibitory, else
x1 x2 .. .. xn ∈ {0, 1} n
X
g(x1 , x2 , ..., xn ) = g(x) = xi
i=1
y = f (g(x)) = 1 if g(x) ≥ θ
= 0 if g(x) < θ
θ is called the thresholding parameter
This is called Thresholding Logic
12/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us implement some boolean functions using this McCulloch Pitts (MP) neuron
...
13/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1} y ∈ {0, 1}
θ 3 1
x1 x2 x3 x1 x2 x3 x1 x2 x3
A McCulloch Pitts unit AND function OR function
1 0 0
x1 x2 x1 x2 x1
x1 AND !x2 ∗ NOR function NOT function
∗
circle at the end indicates inhibitory input: if any inhibitory input is 1 the output will be 0 14/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Can any boolean function be represented using a McCulloch Pitts unit ?
Before answering this question let us first see the geometric interpretation of a
MP unit ...
15/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1}
A single MP neuron splits the input points (4
points for 2 binary inputs) into two halves
Points lying on or above the line ni=1 xi −θ =
P
1
0 and points lying below this line
x1 x2 In other words, all inputs which Pproduce an
output 0 will be on one side ( ni=1 xi < θ)
OR function
P2 of the line and all inputs which produce an
x1 + x2 = i=1 xi ≥ 1 Pn
output 1 will lie on the other side ( i=1 xi ≥
x2 θ) of this line
Let us convince ourselves about this with a
(0, 1) (1, 1) few more examples (if it is not already clear
from the math)
x1 + x2 = θ = 1
(0, 0) (1, 0) x1
16/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} y ∈ {0, 1}
2 0
x1 x2 x1 x2
AND function
P2 Tautology (always ON)
x1 + x2 = i=1 xi ≥ 2
x2
x2
(0, 1) (1, 1)
(0, 1) (1, 1)
x1 + x2 = θ = 0
x1 + x2 = θ = 2
(0, 0) (1, 0) x1
(0, 0) (1, 0) x1
17/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y ∈ {0, 1} What if we have more than 2 inputs?
Well, instead of a line we will have a
plane
1 OR
For the OR function, we want a plane
x1 x2 x3 such that the point (0,0,0) lies on one
x2 side and the remaining 7 points lie on
the other side of the plane
(0, 1, 0) (1, 1, 0)
(0, 0, 0) (1, 0, 0) x1
(0, 0, 1) (1, 0, 1)
x3 18/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
A single McCulloch Pitts Neuron can be used to represent boolean functions
which are linearly separable
Linear separability (for boolean functions) : There exists a line (plane) such
that all inputs which produce a 1 lie on one side of the line (plane) and all
inputs which produce a 0 lie on other side of the line (plane)
19/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.3: Perceptron
20/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story ahead ...
What about non-boolean (say, real) inputs ?
Do we always need to hand code the threshold ?
Are all inputs equal ? What if we want to assign more weight (importance) to
some inputs ?
What about functions which are not linearly separable ?
21/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Frank Rosenblatt, an American psychologist,
proposed the classical perceptron model
(1958)
A more general computational model than
McCulloch–Pitts neurons
Main differences: Introduction of numer-
w1 w2 .. .. wn
ical weights for inputs and a mechanism for
x1 x2 .. .. xn learning these weights
Inputs are no longer limited to boolean values
Refined and carefully analyzed by Minsky and
Papert (1969) - their model is referred to as
the perceptron model here
22/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
n
X
y = 1 if wi ∗ x i ≥ θ
i=1
Xn
= 0 if wi ∗ x i < θ
i=1
w0 = −θ w1 w2 .. .. wn
Rewriting the above,
x0 = 1 x1 x2 .. .. xn
n
A more accepted convention, X
n y = 1 if wi ∗ x i − θ ≥ 0
X
y = 1 if wi ∗ x i ≥ 0 i=1
n
i=0 X
n = 0 if wi ∗ x i − θ < 0
X
= 0 if wi ∗ x i < 0 i=1
i=0
where, x0 = 1 and w0 = −θ
23/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now try to answer the following questions:
Why are we trying to implement boolean functions?
Why do we need weights ?
Why is w0 = −θ called the bias ?
24/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Consider the task of predicting whether we would like
a movie or not
Suppose, we base our decision on 3 inputs (binary, for
simplicity)
Based on our past viewing experience (data), we may
give a high weight to isDirectorNolan as compared to
w0 = −θ w1 w2 w3 the other inputs
x0 = 1 x1 x2 x3 Specifically, even if the actor is not Matt Damon and
the genre is not thriller we would still want to cross
the threshold θ by assigning a high weight to isDirect-
x1 = isActorDamon orNolan
x2 = isGenreT hriller
w0 is called the bias as it represents the prior (preju-
x3 = isDirectorN olan dice)
A movie buff may have a very low threshold and may
watch any movie irrespective of the genre, actor, dir-
25/69
ector [θ = 0]CS7015 (Deep Learning) : Lecture 2
Mitesh M. Khapra
What kind of functions can be implemented using the perceptron? Any difference
from McCulloch Pitts neurons?
26/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
McCulloch Pitts Neuron
From the equations it should be clear that
(assuming no inhibitory inputs)
even a perceptron separates the input space
n
X into two halves
y = 1 if xi ≥ 0 All inputs which produce a 1 lie on one side
i=0
n
and all inputs which produce a 0 lie on the
other side
X
= 0 if xi < 0
i=0 In other words, a single perceptron can only
be used to implement linearly separable func-
tions
Perceptron Then what is the difference? The weights (in-
n
X cluding threshold) can be learned and the in-
y = 1 if wi ∗ x i ≥ 0 puts can be real valued
i=0 We will first revisit some boolean functions
n
X and then see the perceptron learning al-
= 0 if wi ∗ x i < 0
gorithm (for learning weights)
i=0
27/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 OR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
P2
1 0 1 w0 + i=1 wi xi ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
P2
1 1 1 w0 + i=1 wi xi ≥0
−1 + 1.1x1 + 1.1x2 = 0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0
w0 + w1 · 1 + w2 · 1 ≥ 0 =⇒ w1 + w2 ≥ −w0
Note that we can come up
One possible solution to this set of inequalities with a similar set of inequal-
is w0 = −1, w1 = 1.1, , w2 = 1.1 (and various ities and find the value of θ
other solutions are possible) for a McCulloch Pitts neuron
also (Try it!)
28/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.4: Errors and Error Surfaces
29/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let us fix the threshold (−w0 = 1) and try
different values of w1 , w2 −1 + (0.45)x1 + (0.45)x2 = 0
Say, w1 = −1, w2 = −1 x2
What is wrong with this line? We make an −1 + 1.1x1 + 1.1x2 = 0
error on 1 out of the 4 inputs (0, 1) (1, 1)
Lets try some more values of w1 , w2 and note
how many errors we make
w1 w2 errors
-1 -1 3 x1
(0, 0) (1, 0)
1.5 0 1
0.45 0.45 3 −1 + (1.5)x1 + (0)x2 = 0
We are interested in those values of w0 , w1 , w2
which result in 0 error −1 + (−1)x1 + (−1)x2 = 0
Let us plot the error surface corresponding to
different values of w0 , w1 , w2
30/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For ease of analysis, we will keep w0
fixed (-1) and plot the error for dif-
ferent values of w1 , w2
For a given w0 , w1 , w2 we will com-
pute −w0 +w1 ∗x1 +w2 ∗x2 for all com-
binations of (x1 , x2 ) and note down
how many errors we make
For the OR function, an error occurs
if (x1 , x2 ) = (0, 0) but −w0 +w1 ∗x1 +
w2 ∗ x2 ≥ 0 or if (x1 , x2 ) 6= (0, 0) but
−w0 + w1 ∗ x1 + w2 ∗ x2 < 0
We are interested in finding an al-
gorithm which finds the values of
w1 , w2 which minimize this error
31/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.5: Perceptron Learning Algorithm
32/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see a more principled approach for learning these weights and
threshold but before that let us answer this question...
Apart from implementing boolean functions (which does not look very interest-
ing) what can a perceptron be used for ?
Our interest lies in the use of perceptron as a binary classifier. Let us see what
this means...
33/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y Let us reconsider our problem of deciding
whether to watch a movie or not
Suppose we are given a list of m movies and
a label (class) associated with each movie in-
dicating whether the user liked this movie or
not : binary decision
w0 = −θ w1 w2 .. .. wn Further, suppose we represent each movie
x0 = 1 x1 x2 .. .. xn with n features (some boolean, some real val-
ued)
x1 = isActorDamon
We will assume that the data is linearly sep-
x2 = isGenreT hriller arable and we want a perceptron to learn how
x3 = isDirectorN olan to make this decision
x4 = imdbRating(scaled to 0 to 1) In other words, we want the perceptron to find
... ... the equation of this separating plane (or find
the values of w0 , w1 , w2 , .., wm )
xn = criticsRating(scaled to 0 to 1)
34/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm Why would this work ?
P ← inputs with label 1; To understand why this works
N ← inputs with label 0; we will have to get into a bit of
Initialize w randomly; Linear Algebra and a bit of geo-
while !convergence do metry...
Pick random x ∈ P P ∪N ;
n
if x ∈ P and i=0 wi ∗ xi < 0 then
w =w+x ;
end Pn
if x ∈ N and i=0 wi ∗ xi ≥ 0 then
w =w−x ;
end
end
//the algorithm converges when all the
inputs are classified correctly
35/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider two vectors w and x We are interested in finding the line
wT x = 0 which divides the input
w = [w0 , w1 , w2 , ..., wn ] space into two halves
x = [1, x1 , x2 , ..., xn ] Every point (x) on this line satisfies
X n the equation wT x = 0
w · x = wT x = wi ∗ x i
What can you tell about the angle (α)
i=0
between w and any point (x) which
We can thus rewrite the perceptron lies on this line ?
rule as wT x
The angle is 90◦ (∵ cosα = ||w||||x|| =
0)
y = 1 if wT x ≥ 0
Since the vector w is perpendicular to
= 0 if wT x < 0
every point on the line it is actually
perpendicular to the line itself
36/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Consider some points (vectors) which lie in x2
the positive half space of this line (i.e., wT x ≥
0) p2 w
What will be the angle between any such vec- wT x =0
tor and w ? Obviously, less than 90◦ p1
p3
What about points (vectors) which lie in the
negative half space of this line (i.e., wT x < 0) n1
What will be the angle between any such vec- x1
tor and w ? Obviously, greater than 90◦
Of course, this also follows from the formula
wT x
(cosα = ||w||||x|| )
Keeping this picture in mind let us revisit the n2 n3
algorithm
37/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ P if w.x < 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is
Initialize w randomly; greater than 90◦ (but we want α
while !convergence do to be less than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w + x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w + x)T x
w =w−x ;
end ∝ w T x + xT x
end ∝ cosα + xT x
//the algorithm converges when all the cos(αnew ) > cosα
inputs are classified correctly Thus αnew will be less than α and
wT x
cosα = this is exactly what we want
||w||||x||
38/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Algorithm: Perceptron Learning Algorithm For x ∈ N if w.x ≥ 0 then it
P ← inputs with label 1; means that the angle (α) between
N ← inputs with label 0; this x and the current w is less
Initialize w randomly; than 90◦ (but we want α to be
while !convergence do greater than 90◦ )
Pick random x ∈ P ∪ N ; What happens to the new angle
if x ∈ P and w.x < 0 then (αnew ) when wnew = w − x
w =w+x ;
end cos(αnew ) ∝ wnew T x
if x ∈ N and w.x ≥ 0 then
∝ (w − x)T x
w =w−x ;
end ∝ w T x − xT x
end ∝ cosα − xT x
//the algorithm converges when all the cos(αnew ) < cosα
inputs are classified correctly Thus αnew will be greater than α
wT x
cosα = and this is exactly what we want
||w||||x||
39/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see this algorithm in action for a toy dataset
40/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x2 We initialized w to a random value
We observe that currently, w · x < 0 (∵ angle
p2 > 90◦ ) for all the positive points and w ·x ≥ 0
(∵ angle < 90◦ ) for all the negative points
p1 (the situation is exactly oppsite of what we
p3
actually want it to be)
n1 We now run the algorithm by randomly going
x1 over the points
Randomly pick a point (say, p1 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
n2 n3 Randomly pick a point (say, p2 ), apply correc-
tion w = w + x ∵ w · x < 0 (you can check
the angle visually)
Randomly pick a point (say, n1 ), apply correc-
tion w = w − x ∵ w · x ≥ 0 (you can check41/69
Mitesh M. Khaprathe CS7015
angle(Deep
visually)
Learning) : Lecture 2
Module 2.6: Proof of Convergence
42/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Now that we have some faith and intuition about why the algorithm works, we
will see a more formal proof of convergence ...
43/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Definition: Two sets P and N of points in an n-dimensional space are called
absolutely linearly separable if n + 1 real P numbers w0 , w1 , ..., wn exist such that
P satisfies ni=1 wi ∗ xi > w0 and every point
every point (x1 , x2 , ..., xn ) ∈P
(x1 , x2 , ..., xn ) ∈ N satisfies ni=1 wi ∗ xi < w0 .
Proposition: If the sets P and N are finite and linearly separable, the perceptron
learning algorithm updates the weight vector wt a finite number of times. In other
words: if the vectors in P and N are tested cyclically one after the other, a weight
vector wt is found after a finite number of steps t which can separate the two sets.
44/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Setup:
If x ∈ N then -x ∈ P (∵ Algorithm: Perceptron Learning Algorithm
wT x < 0 =⇒ wT (−x) ≥ 0) P ← inputs with label 1;
N ← inputs with label 0;
We can thus consider a single
N − contains negations of all points in N;
set P 0 = P ∪ N − and for P 0 ← P ∪ N −;
every element p ∈ P 0 ensure Initialize w randomly;
that wT p ≥ 0 while !convergence do
Further we will normalize all Pick random p ∈ P 0 ;
p
p ← ||p|| (so now,||p|| = 1) ;
the p’s so that ||p|| = 1 (no-
tice that this does not affect if w.p < 0 then
p w =w+p ;
the solution ∵ if wT ||p|| ≥
end
T
0 then w p ≥ 0) end
Let w∗ be the normalized //the algorithm converges when all the inputs are
solution vector (we know one classified correctly
exists as the data is linearly //notice that we do not need the other if condition
because by construction we want all points in P 0 to
separable)
lie in the positive half space w.p ≥ 0
45/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Observations: Proof:
w∗ is some optimal solution Now suppose at time step t we inspected the
which exists but we don’t know point pi and found that wT · pi ≤ 0
what it is We make a correction wt+1 = wt + pi
We do not make a correction Let β be the angle between w∗ and wt+1
at every time-step w∗ · wt+1
We make a correction only if cosβ =
||wt+1 ||
wT · pi ≤ 0 at that time step N umerator = w∗ · wt+1 = w∗ · (wt + pi )
So at time-step t we would = w∗ · wt + w∗ · pi
have made only k (≤ t) cor-
rections ≥ w ∗ · wt + δ (δ = min{w∗ · pi |∀i}
Every time we make a correc- ≥ w∗ · (wt−1 + pj ) + δ
tion a quantity δ gets added to ≥ w∗ · wt−1 + w∗ · pj + δ
the numerator ≥ w∗ · wt−1 + 2δ
So by time-step t, a quantity ≥ w∗ · w0 + (k)δ (By induction)
kδ gets added to the numer-
46/69
ator Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Proof (continued:)
48/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Coming back to our questions ...
What about non-boolean (say, real) inputs? Real valued inputs are allowed in
perceptron
Do we always need to hand code the threshold? No, we can learn the threshold
Are all inputs equal? What if we want to assign more weight (importance) to
some inputs? A perceptron allows weights to be assigned to inputs
What about functions which are not linearly separable ? Not possible with a
single perceptron but we will see how to handle this ..
49/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.7: Linearly Separable Boolean Functions
50/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
So what do we do about functions which are not linearly separable ?
Let us see one such simple boolean function first ?
51/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
x1 x2 XOR x2
w0 + 2i=1 wi xi
P
0 0 0 <0
w0 + 2i=1 wi xi
P
1 0 1 ≥0 (0, 1) (1, 1)
w0 + 2i=1 wi xi
P
0 1 1 ≥0
w0 + 2i=1 wi xi
P
1 1 0 <0
w0 + w1 · 0 + w2 · 0 < 0 =⇒ w0 < 0
(0, 0) (1, 0) x1
w0 + w1 · 0 + w2 · 1 ≥ 0 =⇒ w2 ≥ −w0
w0 + w1 · 1 + w2 · 0 ≥ 0 =⇒ w1 ≥ −w0 And indeed you can see that
w0 + w1 · 1 + w2 · 1 < 0 =⇒ w1 + w2 < −w0 it is impossible to draw a
line which separates the red
The fourth condition contradicts conditions 2 points from the blue points
and 3
Hence we cannot have a solution to this set of
inequalities
52/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Most real world data is not linearly separable
and will always contain some outliers
In fact, sometimes there may not be any out-
liers but still the data may not be linearly sep-
arable
We need computational units (models) which
can deal with such data
While a single perceptron cannot deal with
such data, we will show that a network of per-
o ceptrons can indeed deal with such data
oo o o oo
oo + + + + oo
o + oo
o +
o + + o
o + + o
o + + o
o + + oo
o +
oo + + + oo
oo o o oo
o 53/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Before seeing how a network of perceptrons can deal with linearly inseparable
data, we will discuss boolean functions in some more detail ...
54/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
How many boolean functions can you design from 2 inputs ?
Let us begin with some easy ones which you already know ..
Of these, how many are linearly separable ? (turns out all except XOR and
!XOR - feel free to verify)
n
In general, how many boolean functions can you have for n inputs ? 22
n
How many of these 22 functions are not linearly separable ? For the time being,
it suffices to know that at least some of these may not be linearly inseparable
(I encourage you to figure out the exact answer :-) )
55/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Module 2.8: Representation Power of a Network of
Perceptrons
56/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We will now see how to implement any boolean function using a network of
perceptrons ...
57/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
For this discussion, we will assume True
= +1 and False = -1
y We consider 2 inputs and 4 perceptrons
Each input is connected to all the 4 per-
ceptrons with specific weights
w1 w2 w3 w4 The bias (w0 ) of each perceptron is -2
(i.e., each perceptron will fire only if the
weighted sum of its input is ≥ 2)
Each of these perceptrons is connected to
bias =-2 an output perceptron by weights (which
need to be learned)
x1 x2
The output of this perceptron (y) is the
red edge indicates w = -1 output of this network
blue edge indicates w = +1
58/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Terminology:
This network contains 3 layers
y The layer containing the inputs (x1 , x2 ) is
called the input layer
The middle layer containing the 4 per-
ceptrons is called the hidden layer
w1 w2 w3 w4
h1 h2 h3 h4 The final layer containing one output
neuron is called the output layer
The outputs of the 4 perceptrons in the
bias =-2 hidden layer are denoted by h1 , h2 , h3 , h4
The red and blue edges are called layer 1
x1 x2 weights
red edge indicates w = -1 w1 , w2 , w3 , w4 are called layer 2 weights
blue edge indicates w = +1
59/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
We claim that this network can be used to
implement any boolean function (linearly
y separable or not) !
In other words, we can find w1 , w2 , w3 , w4
such that the truth table of any boolean
function can be represented by this net-
w1 w2 w3 w4 work
h1 h2 h3 h4
-1,-1 -1,1 1,-1 1,1
Astonishing claim! Well, not really, if you
understand what is going on
bias =-2 Each perceptron in the middle layer fires
only for a specific input (and no two per-
x1 x2 ceptrons fire for the same input)
red edge indicates w = -1 the first perceptron fires for {-1,-1}
blue edge indicates w = +1 the second perceptron fires for {-1,1}
the third perceptron fires for {1,-1}
the fourth perceptron fires for {1,1} 60/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Let w0 be the biasPoutput of the neuron
(i.e., it will fire if 4i=1 wi hi ≥ w0 )
y
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
w1 w2 w3 w4 0 1 1 0 1 0 0 w2
h1 h2 h3 h4 1 0 1 0 0 1 0 w3
-1,-1 -1,1 1,-1 1,1 1 1 0 0 0 0 1 w4
bias =-2
x1 x2
red edge indicates w = -1
blue edge indicates w = +1
62/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have more than 3 inputs ?
63/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again each of the 8 perceptorns will fire only for one of the 8 inputs
Each of the 8 weights in the second layer is responsible for one of the 8 inputs
and can be adjusted to produce the desired output for that input
y
w1 w2 w3 w4 w5 w6 w7 w8
bias =-3
x1 x2 x3
64/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
What if we have n inputs ?
65/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Theorem
Any boolean function of n inputs can be represented exactly by a network of
perceptrons containing 1 hidden layer with 2n perceptrons and one output layer
containing 1 perceptron
66/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
Again, why do we care about boolean functions ?
How does this help us with our original problem: which was to predict whether
we like a movie or not? Let us see!
67/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
y
We are given this data about our past movie
w1 w2 w3 w4 w5 w6 w7 w8
experience
bias =-3 For each movie, we are given the values of the
various factors (x1 , x2 , . . . , xn ) that we base
our decision on and we are also also given the
x1 x2 x3
value of y (like/dislike)
pi ’s are the points for which the output was 1
p1 x11 x12 . . . x1n y1 = 1 and ni ’s are the points for which it was 0
p2 x21
x22 . . . x2n y2 = 1 The data may or may not be linearly separable
.. .. .. .. .. ..
. .
. . . . The proof that we just saw tells us that it is
n1 xk1
xk2 . . . xkn yi = 0
possible to have a network of perceptrons and
n2 xj1
xj2 . . . xjn yj = 0 learn the weights in this network such that for
.. .. .. .. .. .. any given pi or nj the output of the network
. . . . . .
will be the same as yi or yj (i.e., we can sep-
arate the positive and the negative points)
68/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2
The story so far ...
Networks of the form that we just saw (containing, an input, output and one
or more hidden layers) are called Multilayer Perceptrons (MLP, in short)
More appropriate terminology would be“Multilayered Network of Perceptrons”
but MLP is the more commonly used name
The theorem that we just saw gives us the representation power of a MLP with
a single hidden layer
Specifically, it tells us that a MLP with a single hidden layer can represent any
boolean function
69/69
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 2