0% found this document useful (0 votes)
152 views127 pages

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

The document provides an overview of machine learning topics including feed forward neural networks, backpropagation algorithm, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). It discusses the history and advantages of deep learning models over shallow models, and how techniques like backpropagation and GPUs enabled effective training of deep models. It also provides details on perceptrons and multilayer feed forward neural networks, including how hidden layers allow these networks to learn complex patterns from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views127 pages

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

The document provides an overview of machine learning topics including feed forward neural networks, backpropagation algorithm, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). It discusses the history and advantages of deep learning models over shallow models, and how techniques like backpropagation and GPUs enabled effective training of deep models. It also provides details on perceptrons and multilayer feed forward neural networks, including how hidden layers allow these networks to learn complex patterns from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

Machine Learning by ambedkar@IISc

I Feed Forward Neural Networks

I Backpropagation Algorithm

I CNNs and RNNs


Agenda

Introduction

Perceptron (Recall)

Feed Forward Neural Networks

Backpropagation Algorithm: 1

Autoencoders

Convolutional Neural Networks

Recurrent Neural Networks

2
Introduction
A Snap Shot of Deep Learning

Features

I Go beyond the curve fitting...

I Amazing results with raw data...

I Pay and get the tagged data...

3
A Snap Shot of Deep Learning (cont...)

Popular Models

I Feed Forward Neural Networks


I Convolutional Neural Networks
I Recurrent Neural Networks and Long Short Term Memory
Networks
I Restricted Boltzmann Machines and Deep Boltzmann
Machines
I Autoencoders
I Generative Adversarial Networks,
I Variational Autoencoders
I a lot more....
4
A Snap Shot of Deep Learning (cont...)

Tools

I PyTorch

I Theano (outdated)

I Caffe

I TensorFlow

5
A Snap Shot of Deep Learning (cont...)

Consequences

I Brought “AI” back into computer science ***forey***

I If it works, we accept....we do not mind waiting for “why”

6
Shallow Vs. Deep

Until recently most machine learning and signal processing


techniques had exploited “Shallow-Structured Architectures”

I Shallow: typically contain at most one or two layers of


nonlinear function transformations
I Gaussian mixture models
I Conditional random fields
I Linear or nonlinear dynamical systems
I Maximum entropy models
I Support vector machines
I Logistic regression
I Kernel regression
I Multilayer perceptron with single hidden layers.
I Deep: More nonlinear “hidden” layers

7
Shallow Vs. Deep (Cont...)

I Shallow architectures have been effective in solving many


simple or well constrained problems

I Shallow architectures have limited modeling and


“representation power” can cause difficulties when dealing
with more complicated real world applications involving
natural signals:

I human speech,

I natural language,

I natural images and scenes

8
Shallow Vs. Deep (Cont...)

I These shallow architectures work well given very good


“hand crafted features” (may require signal processing
techniques)

I Advantage is that training is easy and may end up mostly


with a “convex optimization problem”.

9
Human Perception and Evidence for layered hierarchical
systems

Human information processing mechanism (eg. vision and


audio) suggest the need of deep architectures for extracting
complex structures and building internal representations from
rich sensory inputs.

I Human speech production and perception systems are


equipped with “layered hierarchical structures” in
transforming the information from the waveform level to
the linguistic level.

I Human visual systems on the perception side is hierarchical


but also “generative.”

10
Can we also emulate the same?

I The concept of deep learning is obtained from neural


networks

I Feedforward neural networks or MLPs with many hidden


layers, which are often referred to as deep neural networks
(DNNs)

I Back propagation is popularized in 1980 and has been well


known algorithm for learning parameters.

11
Can we also emulate the same? (Cont...)

I Unfortunately BP alone did not work because nonconvex


nature of resulting optimization problems
I The bigger problem: Vanishing gradient problem
I This has steered away most of the ML researchers from
natural networks to shallow models that have convex loss
functions like
I support vector machines (SVM),
I conditional random fields (CRF) and
I maximum entropy models (MaxEnt)

for which global optimum can be efficiently obtained at the


cost of reduced modeling power.

12
What has changed now?

The optimization difficulties associated with the deep models


was empirically alleviated when a “reasonably efficient”
unsupervised learning algorithms were introduced by Hinton
(2006)

13
What has changed now? (Cont...)

I DBN is composed of a stack of restricted Boltzmann


machines (RBM)

I A greedy, layer by layer algorithm optimizes DBM weights


at the at the time complexity linear to the size and depth
of the networks.

I DBMs can be used to initialize the training of Deep Neural


Networks (DNN)
I Advantages of DBMs:
I Supply of good initialization for DNNs
I Learning algorithm makes effective use of unlabeled data

14
What has changed now? (Cont...)

Most importantly GPUs and Tools like Torch and


TensorFlow

15
Perceptron: History

History:

I McCulloch and Pitts (1943) introduced idea of neural


networks as computing machines

I Hebb (1949) postulated the first rule for self organizing


maps

I Rosenblatt (1958) invented perceptron that algorithmically


described neural networks

16
Perceptron: History

17
Deep Learning: Advantages

I Nonlinearity

I Input-Output mapping

I Adaptivity

I Fault Tolerance

I VLSI implementability

18
Where do we start?

I Perceptron (Recall)

I Feed forward deep networks and Back propagation


algorithm

I later CNNs, LSTMs (if time permits)

19
Perceptron (Recall)
Hyperplanes

I Seperates a d-dimensional space into two half


spaces(positive and negative)
I Equation of the hyperplane is

w| x = 0

I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
20
Hyperplane based classification

I Classification rule

y = sign(w| x + b)

w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1

21
Hyperplane based classification

22
The Perceptron Algorithm (Rosenblatt, 1958)

I Aim is to learn a linear hyperplane to separate two classes.

I Mistake drives online learning algorithm

I Guaranteed to find a separating hyperplane if data is


linearly separable.

I If data is not linearly separable

I Make linearly separable using kernel methods.

I (Or) Use multilayer perceptron.

23
Perceptron Algorithm

I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

I Initialize wold = [0, ..., 0], bold = 0

I Repeat until convergence.

I For a random (xn , yn ) ∈ D

I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]

I wnew = wold + yn xn

I bnew = bold + yn

24
Perceptron Convergence Theorem (Block and Novikoff )

"Roughly" : If the data is linearly separable perceptron


algorithm converges.

25
Feed Forward Neural Networks
Some Basic Features of Multilayer Perceptrons (or Feed-
forward Deep Neural Networks

I Network will have hidden layers.


I Since perceptron works only for linearly separable data,
each neuron has a non-linear activation function:
Activation function is differentiable.
I Network exhibit a high degree of connectivity.

26
Why hidden Layers?

27
Why hidden Layers? (Cont...)

I Hidden layers can automatically learn features from data

I The bottom-most hidden layer captures very low level


features (e.g., edges). Subsequent hidden layers learn
progressively more high-level features (e.g., parts of
objects) that are composed of previous layer’s features

28
Two important steps in training neural network

1 Forward step:
I Input is fed to the first layer.
I Input signal is propagated through the network layer by
layer.
I Synaptic weights of the network are fixed i.e. no learning
happens in this step.
I Error is calculated at the output layer by comparing the
observed output with "desired output" (Ground truth)
2 Backward step:
I The observed error at the output layer is propagated
"backwards", layer by layer. (How?)
I Error is propagated "backwards", layer by layer.
I In this step, successive adjustments are made to the
synaptic weights.

29
Propagation of information in neural network

Two kinds of signals:

1 Function signal (leads to observed error)


2 Error signal (leads to updation of weights or parameters)

Propagation of information in neural network

30
Computation of signals

1 Computation of function signal (in the forward step)

2 Computation of gradient
I gradients of the "error surface" w.r.t weights (we will see
this later how)
31
Error

I D = {(x(n), z(n))}N
n=1 be a training sample where x(n) is
an input and z(n) is the desired output.

I x(n) ∈ RD , we write x(n) = (x1 (n), ..., xD (n).


I z(n) ∈ RM , we write z(n) = (z1 (n), ..., zM (n)).

I Suppose the output of the network is


y(n) = (y1 (n), ..., yM (n) when x(n) is the input.

I Error at the j th output neuron is

ej (n) = |zj (n) − yj (n)|, j = 1, 2, ..., M

32
Error (contd. . . )

I The total error per sample is


M
1X
E(n) = (zj (n) − yj (n))2
2
j=1

I Average error for the training data or empirical risk


N N M
1 X 1 XX
E= E(n) = (zj (n) − yj (n))2
N 2N
n=1 n=1 j=1

33
Backpropagation Algorithm: 1
The Backpropagation Algorithm

34
The Backpropagation Algorithm

I x1 (n), . . . , xi (n), . . . , xm (n): Function signals that are


produced by the previous layer which is an input to the j th
neuron.
Pm
I vj (n) = i=0 wji (n)xi (n)

I vj (n) is the induced local field.


I m is the size of the input (i.e. in the previous layer there
are m neurons)

I yj (n) = ϕ(vj (n))

I Function signal appearing at the output of neuron j.

35
The Backpropagation Algorithm (cont...)

I BPA applies a correction ∆wji (n) to the synaptic weight


∂E(n)
proportional to , i = 1, 2, . . . , m.
∂wji (n)

I Note that we are trying to update j th neuron out of all M


neurons.

I For nth data point, the error is


M M
1X 2 1X 2
E(n) = (zj (n) − yj (n)) = e
2 j=1 2 j=1 j

36
The Backpropagation Algorithm (cont...)

We compute the derivative of


M
1X
E(n) = (zj (n) − yj (n))2
2
j=1

w.r.t wji (x) (apply chain rule)

37
The Backpropagation Algorithm (cont...)

The derivative

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)


=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)

Since

I E is a function of ej

I ej is a function of yj (yj is the output)

I yj is a function of vj (vj is the local field)

I vj is a function of wji

38
The Backpropagation Algorithm (cont...)

I The derivative is

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)


=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)

1
PM 2 ∂E(n)
I E(n) = 2 j=1 ej (n) =⇒ ∂ej (n) = ej (n)
∂ej (n)
I ej (n) = zj (n) − yj (n) =⇒ ∂yj (n) = −1
∂y (n)
I yj (n) = ϕ(vj (n)) =⇒ ∂vjj (n) = ϕ0j (vj (n))
Pm ∂vj (n)
I vj (n) = i=0 wji (n)xi (n) =⇒ ∂w ji
= xi (n)

I =⇒
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)

39
The Backpropagation Algorithm (contd. . . )

Update rule for jth output neuron

I We have
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)

I Hence, the update rule is

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) + ηej (n)ϕ0j (vj (n))xi (n)

40
The Backpropagation Algorithm (contd. . . )

Local Gradient

I Define local gradient δj (n) for j th neuron as

∂E(n)
δj (n) = −
∂vj (n)
∂E(n) ∂ej (n) ∂yj (n)
=− = ej (n)ϕ0j (vj (n))
∂ej (n) ∂yj (n) ∂vj (n)

I wji (n + 1) = wji (n) + η δj (n) xi (n)


| {z }
Local Gradient

41
BPA: Case 1: Neuron j is an output node

I Output neuron has an “easy” access to the error

ej (n) = dj (n) − yj (n)


M
1X 2
E(n) = ej (n)
2
j=1

42
BPA: Case 1: Neuron j is an output node (Cont. . . )

I Update rule

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient

= wji (n) − ηδj (n)xi (n)

43
BPA: Case 1: Neuron j is an output node (Cont. . . )

I Update rule

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient

= wji (n) − ηδj (n)xi (n)

44
BPA: Case 2: Neuron j is a hidden node

I Unlike in the case of output neuron, hidden neuron does


not have a direct access to the "error".

I TRICK

I Error signal for a hidden neuron will be determined


recursively.

I They expect the next hidden neuron (or output neuron),


they are connected to, to share "some" error.

I Error propagates by working backwards.

45
BPA: Case 2: Neuron j is a hidden node (contd. . . )

Strategy
I First compute the local gradient δj (n) for j th hidden
neuron (How we will see...)
I Then use the update that is similar to output neuron

∆w = Learning rate × Local gradient × Input


= ηδj (n)xi (n) 46
BPA: Case 2: Neuron j is a hidden node (contd. . . )

Local field of j th hidden neuron:

∂E(n) ∂E(n) ∂yj (n)


δj (n) = − =−
∂vj (n) ∂yj (n) ∂vj (n)
∂E(n)
=− ϕ0j (vj (n)) ∵ yj (n) = ϕj (vj (n))
∂yj (n)

Note: If this had been the output neuron, we would have had

∂E(n) ∂E(n) ∂ej (n)


= = −ej (n)
∂yj (n) ∂ej (n) ∂yj (n)

Since j is hidden, it does not have access to the error.

47
BPA: Case 2: Neuron j is a hidden node (contd. . . )

I We are trying to compute local gradient


∂E(n) ∂E(n) 0
δj (n) = − =− ϕ (vj (n))
∂vj (n) ∂yj (n) j
∂E(n)
I Let us compute ∂yj (n)
X
I We have E(n) = 1
2 e2k (n)
k∈C
| {z }
Summation over all the output neurons
I Then
∂E(n) X ∂ek (n)
= ek (n)
∂yj (n) ∂yj (n)
k∈C
X ∂ek (n) ∂vk (n)
= ek (n)
∂vk (n) ∂yj (n)
k∈C

48
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) P ∂ek (n) ∂vk (n)


We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)

I We have

ek (n) = zk (n) − yk (n)


= zk (n) − ϕk (vk (n))

∂ek (n)
I
∂vk (n)= −ϕ0 (vk (n))
We have vk (n) = m
P
l=1 wkl (n)yl (n)
I

Note that j ∈ {1, 2, ..., m} and j th neuron output along


with the other neurons in that layer are fed to the k th
output neuron.
=⇒ ∂v k (n)
∂yj (n) = wkj (n)

49
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) P ∂ek (n) ∂vk (n)


We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)

∂E(n) X
=− ek (n)ϕk (vk (n))wkj (n)
∂yj (n)
k∈C
X
=− δk (n)wkj (n)
k∈C

where δk (n) = ek (n)ϕk (vk (n)) is the local gradient of the k th


neuron.

50
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) X
We have =− δk (n)wkj (n)
∂yj (n)
k
∂yj (n)
and = ϕ0j (vj (n))
∂vj (n)
X
Hence, δj (n) = ϕ0j (vj (n)) δk (n)wkj (n)
k

51
BPA: Case 2: Neuron j is a hidden node (contd. . . )

I Now we have local gradient at the j th hidden node


i.e. δj (n) = ϕ0j (vj (n)) k δk (n)wkj (n)
P

where δk (n) = ek (n)ϕk (vk (n)) is the local field at k th


output layer.

I Hence,

wji (n + 1) = wji (n) + ηδj (n)xi (n)


" #
X
= wji (n) + η ϕ0j (vj (n)) δk (n)wkj (n) xi (n)
k∈C

52
BPA: Update Rule Summary

Case 1: j th neuron us a output neuron

∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)


| {z }
Local gradient at the j th neuron

Case 2: j th neuron is a hidden neuron


!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
Local gradient at the j th neuron

53
BPA: Update Rule Summary

Case 1: j th neuron us a output neuron

∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)


| {z }
Local gradient at the j th neuron

Case 2: j th neuron is a hidden neuron (see Bishop Section 5.3))


!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
Local gradient at the j th neuron

54
Online Vs Batch Learning

Batch Learning

I Each adjustment to the weights is performed after the


presentation of all the N examples in the training samples
are presented.

I That is, cost function is average error or empirical risk.


N N M
1 X 1 XX 2
E= E(n) = ej (n)
N 2N
n=1 n=1 j=1
N M
1 XX
= (zj (n) − yj (n))2
2N
n=1 j=1

55
Online Vs Batch Learning (Cont...)

Batch Learning

I This constitutes one epoch of training.

I In each epoch of training, samples are randomly shuffled.

I The learning curve in this case is E vs epoch number.

I Advantage: It can be easily parallelized.

I Disadvantage: Memory requirements are very high.

56
Online Vs Batch Learning

Online Learning
I Each adjustment to the weights is performed example by
example in the training data.
I The cost function is error obtained in each sample.

M M
1X 2 1X
E= ej (n) = (zj (n) − yj (n))2
2 2
j=1 j=1
I The learning curve in this case is E(n) vs epoch.
I Learning curve is significantly different from that of batch
learning.
I Online learning take advantage of redundant data (multiple
copies of data).
I Online learning is simple to implement.
57
Activation Function

Activation function needs to be differentiable

1 Logistic function:
1
ϕ0j (vj (n)) = a>0
1 + exp(−avj (n))

where vj is induced local field and a is a parameter.

2 Hyperbolic Tangent Function:

ϕ0j (vj (n)) = a tanh(bvj (n))

where a and b are positive constants.

58
Activation Functions

Heaviside step function:

ϕ(x) = 0 if x<0
ϕ(x) = 1 if x>0

This is useful
in the case of perceptron which works
only when the data is linearly separable.

59
Activation Functions

Heaviside step function (contd.)


The reasons why we cannot use Heaviside step function in
feedforward neural networks:

I We train neural network using backpropagation algorithm


which requires differential activation function. For
Heaviside step function, it is not differentiable at x=0 and
it has 0 derivation everywhere else.
=⇒ The gradient descent will not be able to make
progress in updating weights.
I We want our neural network weights to be modified
continuously so that predictions can be as close as real
values. Having a function that can only generate either 0 or
1 will not help to achieve this objective.
60
Activation Functions (contd...)

Sigmoid Function

I Sigmoid function is also known as the logistic function.


I It non-linearly squashes a number to a value between 0 and
1.
1
I Sigmoid(x) = 1+e−z
I Activations are bounded between 0 and 1.

61
Activation Functions (contd. . . )

Sigmoid Function (contd...)


Disadvantages:

1 When the input is too small (towards -∞) the gradient is


zero.

I Hence while executing the backpropagation algorithm


weights will not get updated i.e. there is no learning.
I Vanishing gradient problem.

2 Though computing activation functions are less


computationally expensive than matrix multiplication or
convolutions, still computing exponential is expensive.

62
Activation Functions (contd. . . )

Tanh Function (or Hyperbolic tangent)


I It is similar to sigmoid function but squashes the values,
non-linearly between -1 and 1.
ez − e−z
tanh(z) = +1
ez + e−z

63
I Shares some disadvantages of sigmoid.
Activation Functions (contd. . . )

Rectified Linear Unit (ReLU)

I Given an input if it is negative or zero, it outputs zero.


Otherwise it outputs the number same as input.

ReLU(x) = max(0, x)

64
Activation Functions (contd. . . )

Rectified Linear Unit (ReLU)

I Is it non-linear? Yes.

I A linear function should satisfy the property that

f (x + y) = f (x) + f (y)
But ϕ(−1) + ϕ(+1) 6= ϕ(0)

I But it is piece-wise linear.

65
Activation Functions (contd. . . )

ReLU (contd...)
I It is a non-bounded function.
I Change from sigmoid to ReLU as an activation function in
hidden layer is possible - hidden neurons need not have
bounded values.
I The issue with sigmoid function is that it has very small
values (near zero) everywhere except near 0.

66
Activation Functions (contd. . . )

ReLU (contd...)

I At the j th neuron which is a hidden layer


!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n))
k∈C
| {z }
| {z } Derivation of
Local gradient from above sigmoid function

We multiply the gradient from the layer above with partial


derivative of the sigmoid function.
I Since in the case of sigmoid, it has very small values
everywhere except for when the input has values close to 0.
=⇒ Lower layers will likely have smaller gradients in
terms of magnitude compared to higher layers.

67
Activation Functions (contd. . . )

ReLU (contd...)
I The reason is in the case of sigmoid ϕ0 (.) is always less than
1 with most values being 0.
=⇒ This imbalance in gradient magnitude makes it
difficult to change the parameters of the neural networks
with stochastic descent.

68
Activation Functions (contd. . . )

ReLU (contd...)
I This problem can be adjusted by the use of rectified linear
activation function because dervative of ReLU can have
many non-zero values.
=⇒ Which in turn means that magnitude of the gradient
is more balanced throughout the network.
I Dying ReLU: A neuron in the network is permanently dead
due to inability to fire in forward pass.
=⇒ When activation is zero in the forward pass all the
weights will get zero gradient.
=⇒ In backpropagation, the weights of neurons never get
updated.
I Using ReLU in RNNs can blow up the computations to
infinity as activations are not bounded. 69
Rate of Learning

"Backpropagation algorithm provides an ’approximation’ to the


trajectory in weight space computed by the method of steepest
gradient."
Case 1:- Smaller the learning rate ⇒ smaller the change to the
synoptic weights in the network

Smoother the learning will be
Case 2:- Larger learning rate ⇒ The network may become
unstable or oscillatory

70
Rate of Learning (contd. . . )

∆wji (n) = α∆wji (n − 1) + ηδj (n)yi (n)

where α is the momentum constant


α = 0 gives us original delta rule

71
Stopping Criteria

"In general backpropagation algorithm cannot be shown to


converge"

Hence this is not a well defined criteria for stopping its
operation.

72
Stopping Criteria (contd. . . )

Some good Criteria:

∂E
I Euclidean norm of ∂w reaches a sufficiently small gradient
threshold.
∵ The necessary condition for w∗ to be global maximum or
∂E
local minimum is ∂w =0
w∗

I Absolute rate of change is the average squared error per


epoch is sufficiently small.

I Stop when "generalizing performance" is adequate or it


apparent that generalization performance is peaked.

73
Summary of Backpropagation Learning Algorithm

wji (n + 1) = wji (n) + ηδj (n)xi (n)

I η : learning rate
I δj (n) : local gradient
I xi (n) : Input to the j th neuron

Local gradient:

δj (n) = ej (n)ϕ0j (vj (n)) if j is a output neuron


hX i
= δk (n) wkj (n) ϕ0j (vj (n)) if j is a hidden neuron
| {z }
k local gradient
of kth neuron
at the output

74
XOR Problem

I Rosenblatt’s single layer perceptron has no hidden layer


hence it cannot classify input pattern that are NOT
linearly separable.
Consider XOR Problem:

0⊕0=0
1⊕0=0
0⊕1=1
1⊕0=1
n o n o
I (0,0),(1,1) and (0,1),(1,0) are not linearly separable.
Hence single layer neural network cannot solve this
problem.
75
XOR Problem (contd. . . )

I We use a hidden layer

I Consider the following neural network

76
XOR Problem (contd. . . )

I The function of output neuron : construct a linear


combination of decision boundaries formed by the two
hidden neurons.
For various inputs :

I For input (1,1): v1 = (1)(+1) + (1)(+1) + (+1)(-1.5)


= 1 + 1 - 1.5 = 0.5 ⇒ ϕ(v1 ) =1
v2 = (1)(+1) + 1(+1) + (+1)(-0.5)
= 1 + 1 - 0.5 = 1.5 ⇒ ϕ(v2 )=1
v3 = 1(-2)+ 1(+1) +(+1)(-0.5)
= -2 + 1 - 0.5 = -1.5 ⇒ ϕ(v3 ) = 0

77
XOR Problem (contd. . . )

For various inputs :

I For input (0,0): v1 = (+1)(-1.5) = -1.5 ⇒ ϕ(v1 ) = 0


v2 = (+1)(-0.5) = -0.5 ⇒ ϕ(v2 ) = 0
v3 = (+1)(-0.5) = -0.5 ⇒ ϕ(v3 ) = 0

I For input (1,0): v1 = (1)(+1) + (+1)(-0.5)


= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0
v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
v3 =1(+1)+(+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v3 ) =1

78
XOR Problem (contd. . . )

For various inputs :

I For input (0,1): v1 = (1)(+1) + (+1)(-1.5)


= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0
v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
v3 =1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v3 ) =1

79
XOR Problem (contd. . . )

I Decision Boundary of neuron 1

I Decision Boundary of neuron 2

I Decision Boundary of neuron 3

80
Universal Approximation Theorem

Let ϕ(.) be a non-constant, bounded and monotonic-increasing


function. Let Im0 denotes the m0 -dimensional unit hypercube
[0, 1]m0 . Let the space of continuous functions on Im0 is denoted
by C(Im0 ). Given any function f ∈ C(Im0 ) and  > 0, there
exists an integer m1 and sets of real numbers αi , bi and wi ,
where i = 1, 2, ...m and j = 1, 2...m0 .
Such that
m1
X m0
X
F (x1 , ...xm0 ) = αi ϕ( wij xj + bj )
i=1 j=1

and F arbitrarily approximates f(.). That is

|F (x1 , ..xm0 ) − f (x1 , ...xm0 )| < 0 ∀x1 , ...xm0 ∈ Im0

81
Autoencoders
Introduction

I The origin of deep learning (post neural networks) since


early 2000 was the use of Deep Belief Nentworks to
“pretrain” deep networks.
I This approach is based on the observation that random
initialization is not a good idea, and that pretraining each
layer with an unsupervised learning algorithm can allow for
better initial weights.
I Examples such unsupervised algorithms are
I Deep Belief Networks based on Restricted Boltzmann
Machines
I Deep autoencoders

82
Compression

I Aim is to transmit this data: that is we have to send both


the first and second dimension
I If we observe carefully, value at the second dimension is
just twice the first dimension
I Hence we can just transmit first dimension (can be thought
of as encoding of the data) and compute the value of the
second dimension (can be thought of as decoding the data) 83
Compression (cont...)

The process...

I Encoding: Map the data xn by means of some method to


compressed data zn

I Transmit

I Decoding: Map from compressed data zn to x̃n

84
Autoencoder

A linear encoding and decoding


I Encoding: zn = W1 xn + b1
I Decoding: x̃n = W2 zn + b2
Objective function:

N
X
J(W1 , b1 , W2 , b2 ) = (x̃n − xn )2
i=1

85
Autoencoder

I If the data lie on a nonlinear surface, we use nonlinear


activation functions.
I If the data is highly nonlinear, one could add more hidden
layers to the networks to have a deep encoder.
I Note that this is an unsupervised learning.

86
Convolutional Neural Networks
Convolutional Neural Network(Introduction)

I Convolutional Neural Network (CNN) came into limelight


in 2012
I Alex Krizhevsky used CNN to win 2012 Imagenet
competition.
I The classification error has been improved from 20% to
15%.
I Paper: Krizhevsky, Sutskever and Hinton : Imagenet
classification with Deep Convolutioinal Neural Network,
NIPS 2012.

I CNN were first proposed in the paper by Lecun, Bottou,


Bengio, Haffner : Gradient based Learning Applied to
Document Recognition, 1998 (Proceedings of IEEE )

87
Biological Connection

I Experiment by Hubel and Wiesel (1962)

I Some individual neuronal cells in the brain fire only in the


presence of edges of certain Orientation.

I For example, Some neurons fired when exposed to vertical


and some fired when exposed to horizontal edges.

I Hubel and Wiesel found that all these neurons were


organized in a column architecture and that together they
were able to produce visual perception.

88
CNN

I CNN is a fixed feed forward Neural Network with special


structure.
I Sparse "Local" connectivity between layer except the last
output layer ⇒ Reduces the number of parameters.

Local connectivity of CNN

I Shared weights (like a global filter) Helps to capture the


89
local properties of the signal (useful for the images)
CNN

90
CNN

I Convolution: Extract "Local" properties of the signal,


using "filters" that have to be learned.

I Pooling: Down Samples the output to reduce the size of


representation.

I Nonlinearity: Non-linearity is used after the convolution


layer.

91
Convolution

I This operation extracts local spacial properties of input

I The operation is defined as


hkij = f ((W k X)ij + bk )
where W k is a filter, is the convolution operation and f
is a nonlinear function.
I Second filter W k , k = 1, 2, 3, ... are applied which need to be
learned. Size of filter also need to be specified.
92
Convolution Layer

I A small portion of the image that we look at the image


from a "lens".
I Suppose size of this lens is 5 × 5 × 3.

5 X
X 5
r∗ = aij bij
i=1 j=1 93
Convolution Layer (cont....)

I Example:

Input Filter Output


32 × 32 × 3 5×5×3 28 × 28
32 × 32 × 32 5 × 5 × 3 (2 nos.) 28 × 28 × 2

I Each filter can be thought of as a "Feature identifier".

I Intuition: In the input image, if there is a shape that


generally resembles the curve that the particular filter is
representing thus all the multiplications summed together
will result in a large Value.

94
Convolution Layer (cont....)

Stride: Stride is size of the shift of the filter across the image
(preciously we kept stride as 1).
Ex.

3 × 3 convolution with a stride of 1

95
Convolution Layer (contd. . . )

Stride (contd...) Example :

Convolution with stride of 2

I
Size of input − Size of filter
Size of output = +1
Stride 96
Convolution with stride

Filer is moved along the image and at each position the dot product is
computed

Image taken from Poczos’s notes

97
Convolution Layer (contd. . . )

Padding: If we want output to be same size as input then we


pad the output with zeros.

Padding of two to the output

I To enforce size of input and output to be same we need the


padding size to be
size of filter − 1
size of padding =
2

98
Convolution Layer (contd. . . )

In general,
size of input − size of filter + 2 ∗ size of padding
size of output = +1
size of stride

99
Rectified Linear Unit OR ReLU

I A recent advance (not very recent) : Use ReLU,


y(z) = max(0, z) as the activation function instead of
traditional sigmoid function.

Activation functions

I ReLU improves performance of many networks.

100
Pooling or Down Sampling Layer

Divide layer into partitions and get a max or average of each


partition

Maxpooling

I Max Pooling
I Average Pooling
I L2 norm Pooling 101
Pooling or Down Sampling Layer (contd. . . )

I Advantages

I Reduces the dimension of representation

I Controls overfitting

Lookout : If you have 99% to 100% accuracy on training set and


only 40% to 50% of test accuracy it is a cause of concern.

102
Dropout Layer

I This layer drops random set of activation in that layer by


setting them to zero.

I Helps as a Regularizer.

103
Architecture of LeNet-5 (LeCun et al, 1986)

Architecture of LeNet-5

I This is one of the first convolutional neural network


I This was designed to classify images of handwritten digits.
I Here the activation function used it tanh, but now the usual
choice is ReLU.
104
Some Popular CNNs

Alex Net (2012)

I Trained on 15 million images.


I Achieved test error 15.4% (The next lost was only 26%).
I 5 convolution layer, max-pooling later, dropout layer and 3
fully connected layer.
I Used ReLU for activation.
I Used data augmentation techniques that consists of image
translation, horizontal reflection and patch extraction.
I Implemented dropout layer in order to control overfitting to
the training data.

105
Some popular CNNs (contd. . . )

Alex Net (2012) (contd...)

I Trained the model using batch stochastic gradient descent


with specific values of momentum and decay.
I Trained on two GTx580 GPU’s for five to six days.

ZF Net (2013) Zieler and Fergus

I Error of 11.2%.
I More of a fine tuning of Alex Net.
I Provided visualizations which provided better intuitions.
I ZF trained using 1.3 million images.

106
Some popular CNNs (contd. . . )

I Used 7x7 filters instead of 11x11 filters (as in Alex Net)


also with decreased stride value.

I Smaller filters in convolution layer help retain a lot of


original pixel information in input image.

I ReLU for activation, cross entropy loss, training using


batch stochastic descent.

I Trained on a GTx580 GPU for 12 days

I Deconvolutional network helps to visualize the feature


maps.

107
Some popular CNN’s (contd. . . )

VGG Net (2014) Simonyan and Zisserman

I Error 7.3%.
I 19 layers of convolutional layer, 3x3 filters, padding of 2,
max pooling with stride of 2.
I Trained for two to three weeks.

Google Net (2015)

I Error 6.7%
I 22 layer CNN.
I User Inception modules.

108
Some Popular CNN’s (contd. . . )

Microsoft ResNet (2015)

I Error 3.6%, (Human accuracy is 5 − 10%).

I Residual blocks.

I 152 layers.

I Trained on 8 GPU machines for 2 to 3 weeks.

109
Recurrent Neural Networks
What we have been doing so far?

Feed Forward Neural Networks


I Consists of input,hidden and output layers
I Given sequential data, FF networks does not take
sequential structure in the data
I Given a sequence of observations, x1 , . . . , xT , then
corresponding hidden units are h1 , . . . , hT are assumed
independent of each other (i.i.d data?)

110
What we have been doing so far? (cont. . . )

I How can we use feed forward neural networks for sequential


data like text, audio, video?

I Can we modify feed forward neural networks such a way


that, it “remembers” the previous example?

I The answer is Recurrent Neural Networks or RNNs

111
RNN (Introduction)

I Since we have sequential data,hidden state at each step


depends on the hidden states of the previous

I Hence, ht = ϕ(W xt + U ht−1 ) where U acts as a transition


matrix and ϕ is a nonlinear activation function
I ht acts as a memory
I RNNs can be considered as multiple copies of the same
network, each passing a message to a successor.
112
RNN Application

I RNN have many applications in modeling the sequential


data
I Input,Output or both can be sequences (possibly of
different lengths)
I Different inputs and different outputs need not be of the
same length
I Regardless of the length of the input, RNN will learn fixed
sized embedding for the input
113
RNN Training

I Trained using Backpropagation Through Time (forward


propagate from step 1 to end, and then backward
propagate from end to step 1)
I Think of the time-dimension as another hidden layer and
then it is just like standard backpropagation for
feedforward neural nets

114
Vanishing Gradient Problem

I Learnability of hidden states and outputs become weaker as


we move away from them along the sequence ⇒ Weak
Memory
I New inputs "overwrite" the activations of the previous
hidden states
I Repeated multiplications can cause the gradients to vanish
or explode (with ReLU)
115
RNNs are really useful for sequential data?

I The whole idea of feedback loop is to able to connect


previous information to the present task.

I For example, previous video frames may inform the


understanding of the present frame.

I How much past information RNNs can remember so that


they can be used for the present task?

116
RNNs are really useful for sequential data? (cont. . . )

I Consider a language model trying to predict the next word


based on the previous words.

I If the model is trying to predict the word sky in the setence


the clouds are in the sky, that model do not require
very old context.

I If we consider I grew up in France...I speak fluent


French. There is a huge gap between the relevant
information.

I If this gap is too much RNNs will not be able to connect


the information.

117
RNNs are really useful for sequential data? (cont. . . )

Short Range Dependencies

Long Range Dependencies

The solution is Long Short Term Memory Netowrks (Hochreiter


& Schmidhuber, 1997)
118
Capturing the Long range Dependencies

I Augment hidden states with gates


I The gates involves some parameters which needs to be
learned
I These gates will help the model to remember and target the
information selectively
I The hidden states has three types of gates
I Input(bottom), Forget(left) and Output(top)
I Open ’o’, close ’-’
119
Some images and material on CNNs and RNNs is taken from Piyush Rai’s Lecture Notes.
Homework:

Go through Colah’s blog on LSTM networks.


http://colah.github.io/posts/2015-08-Understanding-LSTMs/

120

You might also like