0% found this document useful (0 votes)

152 views127 pages

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

The document provides an overview of machine learning topics including feed forward neural networks, backpropagation algorithm, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). It discusses the history and advantages of deep learning models over shallow models, and how techniques like backpropagation and GPUs enabled effective training of deep models. It also provides details on perceptrons and multilayer feed forward neural networks, including how hidden layers allow these networks to learn complex patterns from data.

Uploaded by

Boul chandra Garai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

152 views127 pages

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

Uploaded by

Boul chandra Garai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 127

Machine Learning by ambedkar@IISc

I Feed Forward Neural Networks

I Backpropagation Algorithm

I CNNs and RNNs

Agenda

Introduction

Perceptron (Recall)

Feed Forward Neural Networks

Backpropagation Algorithm: 1

Autoencoders

Convolutional Neural Networks

Recurrent Neural Networks

2
Introduction
A Snap Shot of Deep Learning

Features

I Go beyond the curve fitting...

I Amazing results with raw data...

I Pay and get the tagged data...

3
A Snap Shot of Deep Learning (cont...)

Popular Models

I Feed Forward Neural Networks

I Convolutional Neural Networks
I Recurrent Neural Networks and Long Short Term Memory
Networks
I Restricted Boltzmann Machines and Deep Boltzmann
Machines
I Autoencoders
I Generative Adversarial Networks,
I Variational Autoencoders
I a lot more....
4
A Snap Shot of Deep Learning (cont...)

Tools

I PyTorch

I Theano (outdated)

I Caffe

I TensorFlow

5
A Snap Shot of Deep Learning (cont...)

Consequences

I Brought “AI” back into computer science forey

I If it works, we accept....we do not mind waiting for “why”

6
Shallow Vs. Deep

Until recently most machine learning and signal processing

techniques had exploited “Shallow-Structured Architectures”

I Shallow: typically contain at most one or two layers of

nonlinear function transformations
I Gaussian mixture models
I Conditional random fields
I Linear or nonlinear dynamical systems
I Maximum entropy models
I Support vector machines
I Logistic regression
I Kernel regression
I Multilayer perceptron with single hidden layers.
I Deep: More nonlinear “hidden” layers

7
Shallow Vs. Deep (Cont...)

I Shallow architectures have been effective in solving many

simple or well constrained problems

I Shallow architectures have limited modeling and

“representation power” can cause difficulties when dealing
with more complicated real world applications involving
natural signals:

I human speech,

I natural language,

I natural images and scenes

8
Shallow Vs. Deep (Cont...)

I These shallow architectures work well given very good

“hand crafted features” (may require signal processing
techniques)

I Advantage is that training is easy and may end up mostly

with a “convex optimization problem”.

9
Human Perception and Evidence for layered hierarchical
systems

Human information processing mechanism (eg. vision and

audio) suggest the need of deep architectures for extracting
complex structures and building internal representations from
rich sensory inputs.

I Human speech production and perception systems are

equipped with “layered hierarchical structures” in
transforming the information from the waveform level to
the linguistic level.

I Human visual systems on the perception side is hierarchical

but also “generative.”

10
Can we also emulate the same?

I The concept of deep learning is obtained from neural

networks

I Feedforward neural networks or MLPs with many hidden

layers, which are often referred to as deep neural networks
(DNNs)

I Back propagation is popularized in 1980 and has been well

known algorithm for learning parameters.

11
Can we also emulate the same? (Cont...)

I Unfortunately BP alone did not work because nonconvex

nature of resulting optimization problems
I The bigger problem: Vanishing gradient problem
I This has steered away most of the ML researchers from
natural networks to shallow models that have convex loss
functions like
I support vector machines (SVM),
I conditional random fields (CRF) and
I maximum entropy models (MaxEnt)

for which global optimum can be efficiently obtained at the

cost of reduced modeling power.

12
What has changed now?

The optimization difficulties associated with the deep models

was empirically alleviated when a “reasonably efficient”
unsupervised learning algorithms were introduced by Hinton
(2006)

13
What has changed now? (Cont...)

I DBN is composed of a stack of restricted Boltzmann

machines (RBM)

I A greedy, layer by layer algorithm optimizes DBM weights

at the at the time complexity linear to the size and depth
of the networks.

I DBMs can be used to initialize the training of Deep Neural

Networks (DNN)
I Advantages of DBMs:
I Supply of good initialization for DNNs
I Learning algorithm makes effective use of unlabeled data

14
What has changed now? (Cont...)

Most importantly GPUs and Tools like Torch and

TensorFlow

15
Perceptron: History

History:

I McCulloch and Pitts (1943) introduced idea of neural

networks as computing machines

I Hebb (1949) postulated the first rule for self organizing

maps

I Rosenblatt (1958) invented perceptron that algorithmically

described neural networks

16
Perceptron: History

17
Deep Learning: Advantages

I Nonlinearity

I Input-Output mapping

I Adaptivity

I Fault Tolerance

I VLSI implementability

18
Where do we start?

I Perceptron (Recall)

I Feed forward deep networks and Back propagation

algorithm

I later CNNs, LSTMs (if time permits)

19
Perceptron (Recall)
Hyperplanes

I Seperates a d-dimensional space into two half

spaces(positive and negative)
I Equation of the hyperplane is

w| x = 0

I By adding bias b ∈ R
w| x + b = 0 b > 0 moving the
hyperplane parallely along w
b<0 opposite direction
20
Hyperplane based classification

I Classification rule

y = sign(w| x + b)

w| x + b > 0 =⇒ y = +1
w| x + b < 0 =⇒ y = −1

21
Hyperplane based classification

22
The Perceptron Algorithm (Rosenblatt, 1958)

I Aim is to learn a linear hyperplane to separate two classes.

I Mistake drives online learning algorithm

I Guaranteed to find a separating hyperplane if data is

linearly separable.

I If data is not linearly separable

I Make linearly separable using kernel methods.

I (Or) Use multilayer perceptron.

23
Perceptron Algorithm

I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

I Initialize wold = [0, ..., 0], bold = 0

I Repeat until convergence.

I For a random (xn , yn ) ∈ D

I If yn (w| xn + b) ≤ 0
[Or sign(w| x + b) 6= yn i.e mistake mode]

I wnew = wold + yn xn

I bnew = bold + yn

24
Perceptron Convergence Theorem (Block and Novikoff )

"Roughly" : If the data is linearly separable perceptron

algorithm converges.

25
Feed Forward Neural Networks
Some Basic Features of Multilayer Perceptrons (or Feed-
forward Deep Neural Networks

I Network will have hidden layers.

I Since perceptron works only for linearly separable data,
each neuron has a non-linear activation function:
Activation function is differentiable.
I Network exhibit a high degree of connectivity.

26
Why hidden Layers?

27
Why hidden Layers? (Cont...)

I Hidden layers can automatically learn features from data

I The bottom-most hidden layer captures very low level

features (e.g., edges). Subsequent hidden layers learn
progressively more high-level features (e.g., parts of
objects) that are composed of previous layer’s features

28
Two important steps in training neural network

1 Forward step:
I Input is fed to the first layer.
I Input signal is propagated through the network layer by
layer.
I Synaptic weights of the network are fixed i.e. no learning
happens in this step.
I Error is calculated at the output layer by comparing the
observed output with "desired output" (Ground truth)
2 Backward step:
I The observed error at the output layer is propagated
"backwards", layer by layer. (How?)
I Error is propagated "backwards", layer by layer.
I In this step, successive adjustments are made to the
synaptic weights.

29
Propagation of information in neural network

Two kinds of signals:

1 Function signal (leads to observed error)

2 Error signal (leads to updation of weights or parameters)

Propagation of information in neural network

30
Computation of signals

1 Computation of function signal (in the forward step)

2 Computation of gradient
I gradients of the "error surface" w.r.t weights (we will see
this later how)
31
Error

I D = {(x(n), z(n))}N
n=1 be a training sample where x(n) is
an input and z(n) is the desired output.

I x(n) ∈ RD , we write x(n) = (x1 (n), ..., xD (n).

I z(n) ∈ RM , we write z(n) = (z1 (n), ..., zM (n)).

I Suppose the output of the network is

y(n) = (y1 (n), ..., yM (n) when x(n) is the input.

I Error at the j th output neuron is

ej (n) = |zj (n) − yj (n)|, j = 1, 2, ..., M

32
Error (contd. . . )

I The total error per sample is

M
1X
E(n) = (zj (n) − yj (n))2
2
j=1

I Average error for the training data or empirical risk

N N M
1 X 1 XX
E= E(n) = (zj (n) − yj (n))2
N 2N
n=1 n=1 j=1

33
Backpropagation Algorithm: 1
The Backpropagation Algorithm

34
The Backpropagation Algorithm

I x1 (n), . . . , xi (n), . . . , xm (n): Function signals that are

produced by the previous layer which is an input to the j th
neuron.
Pm
I vj (n) = i=0 wji (n)xi (n)

I vj (n) is the induced local field.

I m is the size of the input (i.e. in the previous layer there
are m neurons)

I yj (n) = ϕ(vj (n))

I Function signal appearing at the output of neuron j.

35
The Backpropagation Algorithm (cont...)

I BPA applies a correction ∆wji (n) to the synaptic weight

∂E(n)
proportional to , i = 1, 2, . . . , m.
∂wji (n)

I Note that we are trying to update j th neuron out of all M

neurons.

I For nth data point, the error is

M M
1X 2 1X 2
E(n) = (zj (n) − yj (n)) = e
2 j=1 2 j=1 j

36
The Backpropagation Algorithm (cont...)

We compute the derivative of

M
1X
E(n) = (zj (n) − yj (n))2
2
j=1

w.r.t wji (x) (apply chain rule)

37
The Backpropagation Algorithm (cont...)

The derivative

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)

Since

I E is a function of ej

I ej is a function of yj (yj is the output)

I yj is a function of vj (vj is the local field)

I vj is a function of wji

38
The Backpropagation Algorithm (cont...)

I The derivative is

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

=
∂wji (n) ∂ej (n) ∂yj (n) ∂vj (n) ∂wji (n)

1
PM 2 ∂E(n)
I E(n) = 2 j=1 ej (n) =⇒ ∂ej (n) = ej (n)
∂ej (n)
I ej (n) = zj (n) − yj (n) =⇒ ∂yj (n) = −1
∂y (n)
I yj (n) = ϕ(vj (n)) =⇒ ∂vjj (n) = ϕ0j (vj (n))
Pm ∂vj (n)
I vj (n) = i=0 wji (n)xi (n) =⇒ ∂w ji
= xi (n)

I =⇒
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)

39
The Backpropagation Algorithm (contd. . . )

Update rule for jth output neuron

I We have
∂E(n)
= −ej (n)ϕ0j (vj (n))xi (n)
∂wji (n)

I Hence, the update rule is

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) + ηej (n)ϕ0j (vj (n))xi (n)

40
The Backpropagation Algorithm (contd. . . )

Local Gradient

I Define local gradient δj (n) for j th neuron as

∂E(n)
δj (n) = −
∂vj (n)
∂E(n) ∂ej (n) ∂yj (n)
=− = ej (n)ϕ0j (vj (n))
∂ej (n) ∂yj (n) ∂vj (n)

I wji (n + 1) = wji (n) + η δj (n) xi (n)

| {z }
Local Gradient

41
BPA: Case 1: Neuron j is an output node

I Output neuron has an “easy” access to the error

ej (n) = dj (n) − yj (n)

M
1X 2
E(n) = ej (n)
2
j=1

42
BPA: Case 1: Neuron j is an output node (Cont. . . )

I Update rule

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient

= wji (n) − ηδj (n)xi (n)

43
BPA: Case 1: Neuron j is an output node (Cont. . . )

I Update rule

∂E(n)
wji (n + 1) = wji (n) − η
∂wji (n)
= wji (n) − η ej (n)ϕ0 (vj (n)) xi (n)
| {z }
Local Gradient

= wji (n) − ηδj (n)xi (n)

44
BPA: Case 2: Neuron j is a hidden node

I Unlike in the case of output neuron, hidden neuron does

not have a direct access to the "error".

I TRICK

I Error signal for a hidden neuron will be determined

recursively.

I They expect the next hidden neuron (or output neuron),

they are connected to, to share "some" error.

I Error propagates by working backwards.

45
BPA: Case 2: Neuron j is a hidden node (contd. . . )

Strategy
I First compute the local gradient δj (n) for j th hidden
neuron (How we will see...)
I Then use the update that is similar to output neuron

∆w = Learning rate × Local gradient × Input

= ηδj (n)xi (n) 46
BPA: Case 2: Neuron j is a hidden node (contd. . . )

Local field of j th hidden neuron:

∂E(n) ∂E(n) ∂yj (n)

δj (n) = − =−
∂vj (n) ∂yj (n) ∂vj (n)
∂E(n)
=− ϕ0j (vj (n)) ∵ yj (n) = ϕj (vj (n))
∂yj (n)

Note: If this had been the output neuron, we would have had

∂E(n) ∂E(n) ∂ej (n)

= = −ej (n)
∂yj (n) ∂ej (n) ∂yj (n)

Since j is hidden, it does not have access to the error.

47
BPA: Case 2: Neuron j is a hidden node (contd. . . )

I We are trying to compute local gradient

∂E(n) ∂E(n) 0
δj (n) = − =− ϕ (vj (n))
∂vj (n) ∂yj (n) j
∂E(n)
I Let us compute ∂yj (n)
X
I We have E(n) = 1
2 e2k (n)
k∈C
| {z }
Summation over all the output neurons
I Then
∂E(n) X ∂ek (n)
= ek (n)
∂yj (n) ∂yj (n)
k∈C
X ∂ek (n) ∂vk (n)
= ek (n)
∂vk (n) ∂yj (n)
k∈C

48
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) P ∂ek (n) ∂vk (n)

We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)

I We have

ek (n) = zk (n) − yk (n)

= zk (n) − ϕk (vk (n))

∂ek (n)
I
∂vk (n)= −ϕ0 (vk (n))
We have vk (n) = m
P
l=1 wkl (n)yl (n)
I

Note that j ∈ {1, 2, ..., m} and j th neuron output along

with the other neurons in that layer are fed to the k th
output neuron.
=⇒ ∂v k (n)
∂yj (n) = wkj (n)

49
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) P ∂ek (n) ∂vk (n)

We are computing ∂yj (n) = k∈C ek (n) ∂vk (n) ∂yj (n)

∂E(n) X
=− ek (n)ϕk (vk (n))wkj (n)
∂yj (n)
k∈C
X
=− δk (n)wkj (n)
k∈C

where δk (n) = ek (n)ϕk (vk (n)) is the local gradient of the k th

neuron.

50
BPA: Case 2: Neuron j is a hidden node (contd. . . )

∂E(n) X
We have =− δk (n)wkj (n)
∂yj (n)
k
∂yj (n)
and = ϕ0j (vj (n))
∂vj (n)
X
Hence, δj (n) = ϕ0j (vj (n)) δk (n)wkj (n)
k

51
BPA: Case 2: Neuron j is a hidden node (contd. . . )

I Now we have local gradient at the j th hidden node

i.e. δj (n) = ϕ0j (vj (n)) k δk (n)wkj (n)
P

where δk (n) = ek (n)ϕk (vk (n)) is the local field at k th

output layer.

I Hence,

wji (n + 1) = wji (n) + ηδj (n)xi (n)

" #
X
= wji (n) + η ϕ0j (vj (n)) δk (n)wkj (n) xi (n)
k∈C

52
BPA: Update Rule Summary

Case 1: j th neuron us a output neuron

∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)

| {z }
Local gradient at the j th neuron

Case 2: j th neuron is a hidden neuron

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
Local gradient at the j th neuron

53
BPA: Update Rule Summary

Case 1: j th neuron us a output neuron

∆wji (n) = η ej (n)ϕ0j (vj (n)) xi (n)

| {z }
Local gradient at the j th neuron

Case 2: j th neuron is a hidden neuron (see Bishop Section 5.3))

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n)) xi (n)
k∈C
| {z }
Local gradient at the j th neuron

54
Online Vs Batch Learning

Batch Learning

I Each adjustment to the weights is performed after the

presentation of all the N examples in the training samples
are presented.

I That is, cost function is average error or empirical risk.

N N M
1 X 1 XX 2
E= E(n) = ej (n)
N 2N
n=1 n=1 j=1
N M
1 XX
= (zj (n) − yj (n))2
2N
n=1 j=1

55
Online Vs Batch Learning (Cont...)

Batch Learning

I This constitutes one epoch of training.

I In each epoch of training, samples are randomly shuffled.

I The learning curve in this case is E vs epoch number.

I Advantage: It can be easily parallelized.

I Disadvantage: Memory requirements are very high.

56
Online Vs Batch Learning

Online Learning
I Each adjustment to the weights is performed example by
example in the training data.
I The cost function is error obtained in each sample.

M M
1X 2 1X
E= ej (n) = (zj (n) − yj (n))2
2 2
j=1 j=1
I The learning curve in this case is E(n) vs epoch.
I Learning curve is significantly different from that of batch
learning.
I Online learning take advantage of redundant data (multiple
copies of data).
I Online learning is simple to implement.
57
Activation Function

Activation function needs to be differentiable

1 Logistic function:
1
ϕ0j (vj (n)) = a>0
1 + exp(−avj (n))

where vj is induced local field and a is a parameter.

2 Hyperbolic Tangent Function:

ϕ0j (vj (n)) = a tanh(bvj (n))

where a and b are positive constants.

58
Activation Functions

Heaviside step function:

ϕ(x) = 0 if x<0
ϕ(x) = 1 if x>0

This is useful
in the case of perceptron which works
only when the data is linearly separable.

59
Activation Functions

Heaviside step function (contd.)

The reasons why we cannot use Heaviside step function in
feedforward neural networks:

I We train neural network using backpropagation algorithm

which requires differential activation function. For
Heaviside step function, it is not differentiable at x=0 and
it has 0 derivation everywhere else.
=⇒ The gradient descent will not be able to make
progress in updating weights.
I We want our neural network weights to be modified
continuously so that predictions can be as close as real
values. Having a function that can only generate either 0 or
1 will not help to achieve this objective.
60
Activation Functions (contd...)

Sigmoid Function

I Sigmoid function is also known as the logistic function.

I It non-linearly squashes a number to a value between 0 and
1.
1
I Sigmoid(x) = 1+e−z
I Activations are bounded between 0 and 1.

61
Activation Functions (contd. . . )

Sigmoid Function (contd...)

Disadvantages:

1 When the input is too small (towards -∞) the gradient is

zero.

I Hence while executing the backpropagation algorithm

weights will not get updated i.e. there is no learning.
I Vanishing gradient problem.

2 Though computing activation functions are less

computationally expensive than matrix multiplication or
convolutions, still computing exponential is expensive.

62
Activation Functions (contd. . . )

Tanh Function (or Hyperbolic tangent)

I It is similar to sigmoid function but squashes the values,
non-linearly between -1 and 1.
ez − e−z
tanh(z) = +1
ez + e−z

63
I Shares some disadvantages of sigmoid.
Activation Functions (contd. . . )

Rectified Linear Unit (ReLU)

I Given an input if it is negative or zero, it outputs zero.

Otherwise it outputs the number same as input.

ReLU(x) = max(0, x)

64
Activation Functions (contd. . . )

Rectified Linear Unit (ReLU)

I Is it non-linear? Yes.

I A linear function should satisfy the property that

f (x + y) = f (x) + f (y)
But ϕ(−1) + ϕ(+1) 6= ϕ(0)

I But it is piece-wise linear.

65
Activation Functions (contd. . . )

ReLU (contd...)
I It is a non-bounded function.
I Change from sigmoid to ReLU as an activation function in
hidden layer is possible - hidden neurons need not have
bounded values.
I The issue with sigmoid function is that it has very small
values (near zero) everywhere except near 0.

66
Activation Functions (contd. . . )

ReLU (contd...)

I At the j th neuron which is a hidden layer

!
X
∆wji (n) = η δk (n)wkj (n) ϕ0j (vj (n))
k∈C
| {z }
| {z } Derivation of
Local gradient from above sigmoid function

We multiply the gradient from the layer above with partial

derivative of the sigmoid function.
I Since in the case of sigmoid, it has very small values
everywhere except for when the input has values close to 0.
=⇒ Lower layers will likely have smaller gradients in
terms of magnitude compared to higher layers.

67
Activation Functions (contd. . . )

ReLU (contd...)
I The reason is in the case of sigmoid ϕ0 (.) is always less than
1 with most values being 0.
=⇒ This imbalance in gradient magnitude makes it
difficult to change the parameters of the neural networks
with stochastic descent.

68
Activation Functions (contd. . . )

ReLU (contd...)
I This problem can be adjusted by the use of rectified linear
activation function because dervative of ReLU can have
many non-zero values.
=⇒ Which in turn means that magnitude of the gradient
is more balanced throughout the network.
I Dying ReLU: A neuron in the network is permanently dead
due to inability to fire in forward pass.
=⇒ When activation is zero in the forward pass all the
weights will get zero gradient.
=⇒ In backpropagation, the weights of neurons never get
updated.
I Using ReLU in RNNs can blow up the computations to
infinity as activations are not bounded. 69
Rate of Learning

"Backpropagation algorithm provides an ’approximation’ to the

trajectory in weight space computed by the method of steepest
gradient."
Case 1:- Smaller the learning rate ⇒ smaller the change to the
synoptic weights in the network
⇓
Smoother the learning will be
Case 2:- Larger learning rate ⇒ The network may become
unstable or oscillatory

70
Rate of Learning (contd. . . )

∆wji (n) = α∆wji (n − 1) + ηδj (n)yi (n)

where α is the momentum constant

α = 0 gives us original delta rule

71
Stopping Criteria

"In general backpropagation algorithm cannot be shown to

converge"
⇓
Hence this is not a well defined criteria for stopping its
operation.

72
Stopping Criteria (contd. . . )

Some good Criteria:

∂E
I Euclidean norm of ∂w reaches a sufficiently small gradient
threshold.
∵ The necessary condition for w∗ to be global maximum or
∂E
local minimum is ∂w =0
w∗

I Absolute rate of change is the average squared error per

epoch is sufficiently small.

I Stop when "generalizing performance" is adequate or it

apparent that generalization performance is peaked.

73
Summary of Backpropagation Learning Algorithm

wji (n + 1) = wji (n) + ηδj (n)xi (n)

I η : learning rate
I δj (n) : local gradient
I xi (n) : Input to the j th neuron

Local gradient:

δj (n) = ej (n)ϕ0j (vj (n)) if j is a output neuron

hX i
= δk (n) wkj (n) ϕ0j (vj (n)) if j is a hidden neuron
| {z }
k local gradient
of kth neuron
at the output

74
XOR Problem

I Rosenblatt’s single layer perceptron has no hidden layer

hence it cannot classify input pattern that are NOT
linearly separable.
Consider XOR Problem:

0⊕0=0
1⊕0=0
0⊕1=1
1⊕0=1
n o n o
I (0,0),(1,1) and (0,1),(1,0) are not linearly separable.
Hence single layer neural network cannot solve this
problem.
75
XOR Problem (contd. . . )

I We use a hidden layer

I Consider the following neural network

76
XOR Problem (contd. . . )

I The function of output neuron : construct a linear

combination of decision boundaries formed by the two
hidden neurons.
For various inputs :

I For input (1,1): v1 = (1)(+1) + (1)(+1) + (+1)(-1.5)

= 1 + 1 - 1.5 = 0.5 ⇒ ϕ(v1 ) =1
v2 = (1)(+1) + 1(+1) + (+1)(-0.5)
= 1 + 1 - 0.5 = 1.5 ⇒ ϕ(v2 )=1
v3 = 1(-2)+ 1(+1) +(+1)(-0.5)
= -2 + 1 - 0.5 = -1.5 ⇒ ϕ(v3 ) = 0

77
XOR Problem (contd. . . )

For various inputs :

I For input (0,0): v1 = (+1)(-1.5) = -1.5 ⇒ ϕ(v1 ) = 0

v2 = (+1)(-0.5) = -0.5 ⇒ ϕ(v2 ) = 0
v3 = (+1)(-0.5) = -0.5 ⇒ ϕ(v3 ) = 0

I For input (1,0): v1 = (1)(+1) + (+1)(-0.5)

= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0
v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
v3 =1(+1)+(+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v3 ) =1

78
XOR Problem (contd. . . )

For various inputs :

I For input (0,1): v1 = (1)(+1) + (+1)(-1.5)

= 1 - 1.5 = -0.5 ⇒ ϕ(v1 ) =0
v2 = 1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v2 ) = 1
v3 =1(+1) + (+1)(-0.5)
= 1 - 0.5 = 0.5 ⇒ ϕ(v3 ) =1

79
XOR Problem (contd. . . )

I Decision Boundary of neuron 1

I Decision Boundary of neuron 2

I Decision Boundary of neuron 3

80
Universal Approximation Theorem

Let ϕ(.) be a non-constant, bounded and monotonic-increasing

function. Let Im0 denotes the m0 -dimensional unit hypercube
[0, 1]m0 . Let the space of continuous functions on Im0 is denoted
by C(Im0 ). Given any function f ∈ C(Im0 ) and > 0, there
exists an integer m1 and sets of real numbers αi , bi and wi ,
where i = 1, 2, ...m and j = 1, 2...m0 .
Such that
m1
X m0
X
F (x1 , ...xm0 ) = αi ϕ( wij xj + bj )
i=1 j=1

and F arbitrarily approximates f(.). That is

|F (x1 , ..xm0 ) − f (x1 , ...xm0 )| < 0 ∀x1 , ...xm0 ∈ Im0

81
Autoencoders
Introduction

I The origin of deep learning (post neural networks) since

early 2000 was the use of Deep Belief Nentworks to
“pretrain” deep networks.
I This approach is based on the observation that random
initialization is not a good idea, and that pretraining each
layer with an unsupervised learning algorithm can allow for
better initial weights.
I Examples such unsupervised algorithms are
I Deep Belief Networks based on Restricted Boltzmann
Machines
I Deep autoencoders

82
Compression

I Aim is to transmit this data: that is we have to send both

the first and second dimension
I If we observe carefully, value at the second dimension is
just twice the first dimension
I Hence we can just transmit first dimension (can be thought
of as encoding of the data) and compute the value of the
second dimension (can be thought of as decoding the data) 83
Compression (cont...)

The process...

I Encoding: Map the data xn by means of some method to

compressed data zn

I Transmit

I Decoding: Map from compressed data zn to x̃n

84
Autoencoder

A linear encoding and decoding

I Encoding: zn = W1 xn + b1
I Decoding: x̃n = W2 zn + b2
Objective function:

N
X
J(W1 , b1 , W2 , b2 ) = (x̃n − xn )2
i=1

85
Autoencoder

I If the data lie on a nonlinear surface, we use nonlinear

activation functions.
I If the data is highly nonlinear, one could add more hidden
layers to the networks to have a deep encoder.
I Note that this is an unsupervised learning.

86
Convolutional Neural Networks
Convolutional Neural Network(Introduction)

I Convolutional Neural Network (CNN) came into limelight

in 2012
I Alex Krizhevsky used CNN to win 2012 Imagenet
competition.
I The classification error has been improved from 20% to
15%.
I Paper: Krizhevsky, Sutskever and Hinton : Imagenet
classification with Deep Convolutioinal Neural Network,
NIPS 2012.

I CNN were first proposed in the paper by Lecun, Bottou,

Bengio, Haffner : Gradient based Learning Applied to
Document Recognition, 1998 (Proceedings of IEEE )

87
Biological Connection

I Experiment by Hubel and Wiesel (1962)

I Some individual neuronal cells in the brain fire only in the

presence of edges of certain Orientation.

I For example, Some neurons fired when exposed to vertical

and some fired when exposed to horizontal edges.

I Hubel and Wiesel found that all these neurons were

organized in a column architecture and that together they
were able to produce visual perception.

88
CNN

I CNN is a fixed feed forward Neural Network with special

structure.
I Sparse "Local" connectivity between layer except the last
output layer ⇒ Reduces the number of parameters.

Local connectivity of CNN

I Shared weights (like a global filter) Helps to capture the

89
local properties of the signal (useful for the images)
CNN

90
CNN

I Convolution: Extract "Local" properties of the signal,

using "filters" that have to be learned.

I Pooling: Down Samples the output to reduce the size of

representation.

I Nonlinearity: Non-linearity is used after the convolution

layer.

91
Convolution

I This operation extracts local spacial properties of input

I The operation is defined as

hkij = f ((W k X)ij + bk )
where W k is a filter, is the convolution operation and f
is a nonlinear function.
I Second filter W k , k = 1, 2, 3, ... are applied which need to be
learned. Size of filter also need to be specified.
92
Convolution Layer

I A small portion of the image that we look at the image

from a "lens".
I Suppose size of this lens is 5 × 5 × 3.

5 X
X 5
r∗ = aij bij
i=1 j=1 93
Convolution Layer (cont....)

I Example:

Input Filter Output

32 × 32 × 3 5×5×3 28 × 28
32 × 32 × 32 5 × 5 × 3 (2 nos.) 28 × 28 × 2

I Each filter can be thought of as a "Feature identifier".

I Intuition: In the input image, if there is a shape that

generally resembles the curve that the particular filter is
representing thus all the multiplications summed together
will result in a large Value.

94
Convolution Layer (cont....)

Stride: Stride is size of the shift of the filter across the image
(preciously we kept stride as 1).
Ex.

3 × 3 convolution with a stride of 1

95
Convolution Layer (contd. . . )

Stride (contd...) Example :

Convolution with stride of 2

I
Size of input − Size of filter
Size of output = +1
Stride 96
Convolution with stride

Filer is moved along the image and at each position the dot product is
computed

Image taken from Poczos’s notes

97
Convolution Layer (contd. . . )

Padding: If we want output to be same size as input then we

pad the output with zeros.

Padding of two to the output

I To enforce size of input and output to be same we need the

padding size to be
size of filter − 1
size of padding =
2

98
Convolution Layer (contd. . . )

In general,
size of input − size of filter + 2 ∗ size of padding
size of output = +1
size of stride

99
Rectified Linear Unit OR ReLU

I A recent advance (not very recent) : Use ReLU,

y(z) = max(0, z) as the activation function instead of
traditional sigmoid function.

Activation functions

I ReLU improves performance of many networks.

100
Pooling or Down Sampling Layer

Divide layer into partitions and get a max or average of each

partition

Maxpooling

I Max Pooling
I Average Pooling
I L2 norm Pooling 101
Pooling or Down Sampling Layer (contd. . . )

I Advantages

I Reduces the dimension of representation

I Controls overfitting

Lookout : If you have 99% to 100% accuracy on training set and

only 40% to 50% of test accuracy it is a cause of concern.

102
Dropout Layer

I This layer drops random set of activation in that layer by

setting them to zero.

I Helps as a Regularizer.

103
Architecture of LeNet-5 (LeCun et al, 1986)

Architecture of LeNet-5

I This is one of the first convolutional neural network

I This was designed to classify images of handwritten digits.
I Here the activation function used it tanh, but now the usual
choice is ReLU.
104
Some Popular CNNs

Alex Net (2012)

I Trained on 15 million images.

I Achieved test error 15.4% (The next lost was only 26%).
I 5 convolution layer, max-pooling later, dropout layer and 3
fully connected layer.
I Used ReLU for activation.
I Used data augmentation techniques that consists of image
translation, horizontal reflection and patch extraction.
I Implemented dropout layer in order to control overfitting to
the training data.

105
Some popular CNNs (contd. . . )

Alex Net (2012) (contd...)

I Trained the model using batch stochastic gradient descent

with specific values of momentum and decay.
I Trained on two GTx580 GPU’s for five to six days.

ZF Net (2013) Zieler and Fergus

I Error of 11.2%.
I More of a fine tuning of Alex Net.
I Provided visualizations which provided better intuitions.
I ZF trained using 1.3 million images.

106
Some popular CNNs (contd. . . )

I Used 7x7 filters instead of 11x11 filters (as in Alex Net)

also with decreased stride value.

I Smaller filters in convolution layer help retain a lot of

original pixel information in input image.

I ReLU for activation, cross entropy loss, training using

batch stochastic descent.

I Trained on a GTx580 GPU for 12 days

I Deconvolutional network helps to visualize the feature

maps.

107
Some popular CNN’s (contd. . . )

VGG Net (2014) Simonyan and Zisserman

I Error 7.3%.
I 19 layers of convolutional layer, 3x3 filters, padding of 2,
max pooling with stride of 2.
I Trained for two to three weeks.

Google Net (2015)

I Error 6.7%
I 22 layer CNN.
I User Inception modules.

108
Some Popular CNN’s (contd. . . )

Microsoft ResNet (2015)

I Error 3.6%, (Human accuracy is 5 − 10%).

I Residual blocks.

I 152 layers.

I Trained on 8 GPU machines for 2 to 3 weeks.

109
Recurrent Neural Networks
What we have been doing so far?

Feed Forward Neural Networks

I Consists of input,hidden and output layers
I Given sequential data, FF networks does not take
sequential structure in the data
I Given a sequence of observations, x1 , . . . , xT , then
corresponding hidden units are h1 , . . . , hT are assumed
independent of each other (i.i.d data?)

110
What we have been doing so far? (cont. . . )

I How can we use feed forward neural networks for sequential

data like text, audio, video?

I Can we modify feed forward neural networks such a way

that, it “remembers” the previous example?

I The answer is Recurrent Neural Networks or RNNs

111
RNN (Introduction)

I Since we have sequential data,hidden state at each step

depends on the hidden states of the previous

I Hence, ht = ϕ(W xt + U ht−1 ) where U acts as a transition

matrix and ϕ is a nonlinear activation function
I ht acts as a memory
I RNNs can be considered as multiple copies of the same
network, each passing a message to a successor.
112
RNN Application

I RNN have many applications in modeling the sequential

data
I Input,Output or both can be sequences (possibly of
different lengths)
I Different inputs and different outputs need not be of the
same length
I Regardless of the length of the input, RNN will learn fixed
sized embedding for the input
113
RNN Training

I Trained using Backpropagation Through Time (forward

propagate from step 1 to end, and then backward
propagate from end to step 1)
I Think of the time-dimension as another hidden layer and
then it is just like standard backpropagation for
feedforward neural nets

114
Vanishing Gradient Problem

I Learnability of hidden states and outputs become weaker as

we move away from them along the sequence ⇒ Weak
Memory
I New inputs "overwrite" the activations of the previous
hidden states
I Repeated multiplications can cause the gradients to vanish
or explode (with ReLU)
115
RNNs are really useful for sequential data?

I The whole idea of feedback loop is to able to connect

previous information to the present task.

I For example, previous video frames may inform the

understanding of the present frame.

I How much past information RNNs can remember so that

they can be used for the present task?

116
RNNs are really useful for sequential data? (cont. . . )

I Consider a language model trying to predict the next word

based on the previous words.

I If the model is trying to predict the word sky in the setence

the clouds are in the sky, that model do not require
very old context.

I If we consider I grew up in France...I speak fluent

French. There is a huge gap between the relevant
information.

I If this gap is too much RNNs will not be able to connect

the information.

117
RNNs are really useful for sequential data? (cont. . . )

Short Range Dependencies

Long Range Dependencies

The solution is Long Short Term Memory Netowrks (Hochreiter

& Schmidhuber, 1997)
118
Capturing the Long range Dependencies

I Augment hidden states with gates

I The gates involves some parameters which needs to be
learned
I These gates will help the model to remember and target the
information selectively
I The hidden states has three types of gates
I Input(bottom), Forget(left) and Output(top)
I Open ’o’, close ’-’
119
Some images and material on CNNs and RNNs is taken from Piyush Rai’s Lecture Notes.
Homework:

Go through Colah’s blog on LSTM networks.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

120

Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
51 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
Module I
No ratings yet
Module I
109 pages
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
No ratings yet
Neural Metwork: Institut Teknologi Sepuluh Nopember (ITS) Surabaya - Indonesia
43 pages
Mod 2.1,2.2
No ratings yet
Mod 2.1,2.2
24 pages
Unit 2
No ratings yet
Unit 2
19 pages
Aidl Unit III
No ratings yet
Aidl Unit III
79 pages
Chapter 1 - Introduction To Deep Learning 2023
No ratings yet
Chapter 1 - Introduction To Deep Learning 2023
50 pages
Chapter 4 Neural Network
No ratings yet
Chapter 4 Neural Network
46 pages
Deep Learning U1
No ratings yet
Deep Learning U1
5 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Unit 1 Fundamentals of Deep Learning
No ratings yet
Unit 1 Fundamentals of Deep Learning
20 pages
2025 Lecture07 P2 MLP
No ratings yet
2025 Lecture07 P2 MLP
56 pages
Unit 4 Neural Networks
No ratings yet
Unit 4 Neural Networks
76 pages
UNIT II DL
No ratings yet
UNIT II DL
17 pages
L6 Neural Network
No ratings yet
L6 Neural Network
57 pages
Deep Learning - Intro, Methods & Applications
100% (1)
Deep Learning - Intro, Methods & Applications
37 pages
Unit 1
No ratings yet
Unit 1
16 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Unit V
No ratings yet
Unit V
49 pages
Unit III
No ratings yet
Unit III
29 pages
Unit 1
No ratings yet
Unit 1
72 pages
ML-Lec10-Artificial Neural Networks
No ratings yet
ML-Lec10-Artificial Neural Networks
76 pages
3ML.05.NeuralNetworks DeepLearning
No ratings yet
3ML.05.NeuralNetworks DeepLearning
67 pages
Unit 1
No ratings yet
Unit 1
29 pages
Wk. 12. Artificial Neural Networks (12!05!2021)
No ratings yet
Wk. 12. Artificial Neural Networks (12!05!2021)
48 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
Neural Network Representation
No ratings yet
Neural Network Representation
5 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Basics
No ratings yet
Basics
48 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
Understanding and Coding Neural Networks From Scratch in Python and R
No ratings yet
Understanding and Coding Neural Networks From Scratch in Python and R
12 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Session XX - Neural Network
No ratings yet
Session XX - Neural Network
43 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
3rd Unit ML
No ratings yet
3rd Unit ML
7 pages
4 Neural Network
No ratings yet
4 Neural Network
74 pages
Unit 5 ML
No ratings yet
Unit 5 ML
37 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Unit 1
No ratings yet
Unit 1
19 pages
Deep Learning - Unit 1 Notes
No ratings yet
Deep Learning - Unit 1 Notes
27 pages
Unit 2
No ratings yet
Unit 2
20 pages
Neural Networks
No ratings yet
Neural Networks
40 pages
Neural Networks: Directed by
No ratings yet
Neural Networks: Directed by
53 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Neural Network Oxygen
No ratings yet
Neural Network Oxygen
25 pages
Eng PPT Tech
No ratings yet
Eng PPT Tech
18 pages
Artificial Intelligence: Outline
No ratings yet
Artificial Intelligence: Outline
35 pages
Shortnotedeeplearning
No ratings yet
Shortnotedeeplearning
11 pages
Unit 1
No ratings yet
Unit 1
20 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
15 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Ca 3 DL
No ratings yet
Ca 3 DL
6 pages
Unit 5
No ratings yet
Unit 5
61 pages
RNN Neural Network
No ratings yet
RNN Neural Network
23 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Frd Ut699 Space Wire Interface
No ratings yet
Frd Ut699 Space Wire Interface
7 pages
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
No ratings yet
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
81 pages
BERT-LSTM Model For Sarcasm Detection in Code-Mixed Social Media Post
No ratings yet
BERT-LSTM Model For Sarcasm Detection in Code-Mixed Social Media Post
20 pages
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
32 pages
Introduction To Artificial Neural Networks: Andrew L. Nelson
No ratings yet
Introduction To Artificial Neural Networks: Andrew L. Nelson
29 pages
Neural Network Architectures
No ratings yet
Neural Network Architectures
32 pages
7 CNN
No ratings yet
7 CNN
66 pages
Nvidia Fundamentals of Deep Learning PPT 4
No ratings yet
Nvidia Fundamentals of Deep Learning PPT 4
19 pages
Unit 3 Introduction To Deep Learning Part 1
No ratings yet
Unit 3 Introduction To Deep Learning Part 1
7 pages
Chap1-2 Markov Chain
No ratings yet
Chap1-2 Markov Chain
82 pages
Final
No ratings yet
Final
145 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Soc DL Manual
No ratings yet
Soc DL Manual
50 pages
Is RISC V Ready For Space A Security Perspective
No ratings yet
Is RISC V Ready For Space A Security Perspective
6 pages
Machine Learning
No ratings yet
Machine Learning
64 pages
Diannao Asplos2014
No ratings yet
Diannao Asplos2014
15 pages
No Cs
No ratings yet
No Cs
3 pages
Slides m440 Recurrent Neural Networks
No ratings yet
Slides m440 Recurrent Neural Networks
32 pages
(2020129) On Layer Normalization in The Transformer Architecture
No ratings yet
(2020129) On Layer Normalization in The Transformer Architecture
17 pages
Linear Programming: - Socrates
No ratings yet
Linear Programming: - Socrates
21 pages
Future Trends in Computer Architecture
No ratings yet
Future Trends in Computer Architecture
4 pages
Pavani Ishta Ecity Bill
No ratings yet
Pavani Ishta Ecity Bill
1 page
DL - Unit IV
No ratings yet
DL - Unit IV
36 pages
20 Induction
No ratings yet
20 Induction
25 pages
E0294 Scribe Lecture 9
No ratings yet
E0294 Scribe Lecture 9
24 pages
Computer Vision Exam
No ratings yet
Computer Vision Exam
7 pages
HighPerformanceSpaceflightComputing HPSC
No ratings yet
HighPerformanceSpaceflightComputing HPSC
19 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
13 pages
Sub 287
No ratings yet
Sub 287
16 pages
Ds Genai Partb
No ratings yet
Ds Genai Partb
4 pages
All Are Worth Words: A Vit Backbone For Diffusion Models: Long Skip Connection
No ratings yet
All Are Worth Words: A Vit Backbone For Diffusion Models: Long Skip Connection
21 pages
MP Set 092 17
No ratings yet
MP Set 092 17
14 pages
Sequence Learning
No ratings yet
Sequence Learning
22 pages
05.03 OBDP2021 Steenari
No ratings yet
05.03 OBDP2021 Steenari
9 pages
Hardware Is The New Software
No ratings yet
Hardware Is The New Software
8 pages
1866 - Year - B.E. Computer Technology Sem-VII Subject - CT7052 - CT705 - Elective-II - Neural Network & Fuzzy Logic
No ratings yet
1866 - Year - B.E. Computer Technology Sem-VII Subject - CT7052 - CT705 - Elective-II - Neural Network & Fuzzy Logic
4 pages
A European Roadmap To Leverage RISC-V in Space Applications
No ratings yet
A European Roadmap To Leverage RISC-V in Space Applications
7 pages
Neuro-Fuzzy - Wikipedia
No ratings yet
Neuro-Fuzzy - Wikipedia
3 pages
2 Deep Learning in Image Classification A Survey Report
No ratings yet
2 Deep Learning in Image Classification A Survey Report
4 pages
Introduction
No ratings yet
Introduction
10 pages
Image Captioning
No ratings yet
Image Captioning
16 pages
Echo State Network
No ratings yet
Echo State Network
3 pages
PLNB1BRE 202402010147235 Customer
No ratings yet
PLNB1BRE 202402010147235 Customer
6 pages
ANN Deep Learning Course Structure AUG-DEC2023
No ratings yet
ANN Deep Learning Course Structure AUG-DEC2023
1 page
Skyline Daa 1
No ratings yet
Skyline Daa 1
8 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Maxima
No ratings yet
Maxima
3 pages
List of Projects
No ratings yet
List of Projects
1 page
Convolutional Neural Network CNN For Image Detection and Recognition
No ratings yet
Convolutional Neural Network CNN For Image Detection and Recognition
5 pages
HW 4
No ratings yet
HW 4
1 page
Asg 0
No ratings yet
Asg 0
1 page
33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
No ratings yet
33-Bidirectional Encoder Representations From Transformers (BERT) - 30!09!2024
4 pages
Adjacency Matrix To List
No ratings yet
Adjacency Matrix To List
2 pages
Homework 3 2005
No ratings yet
Homework 3 2005
2 pages
IEEE Xplore Reference Download 2024.6.18.20.21.16
No ratings yet
IEEE Xplore Reference Download 2024.6.18.20.21.16
2 pages
Homework 9 2005
No ratings yet
Homework 9 2005
1 page
A Gentle Introduction To Neural Networks AI
No ratings yet
A Gentle Introduction To Neural Networks AI
1 page
Keras Tutorial Cheatsheet
No ratings yet
Keras Tutorial Cheatsheet
1 page
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Convolutional Neural Networks in Python: Beginner's Guide to Convolutional Neural Networks in Python
From Everand
Convolutional Neural Networks in Python: Beginner's Guide to Convolutional Neural Networks in Python
Frank Millstein
No ratings yet

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

Uploaded by

Machine Learning: Feed Forward Neural Networks Backpropagation Algorithm Cnns and Rnns

Uploaded by

Machine Learning by ambedkar@IISc

I Feed Forward Neural Networks

I CNNs and RNNs

Feed Forward Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

I Go beyond the curve fitting...

I Amazing results with raw data...

I Pay and get the tagged data...

I Feed Forward Neural Networks

I Brought “AI” back into computer science ***forey***

I If it works, we accept....we do not mind waiting for “why”

Until recently most machine learning and signal processing

I Shallow: typically contain at most one or two layers of

I Shallow architectures have been effective in solving many

I Shallow architectures have limited modeling and

I natural images and scenes

I These shallow architectures work well given very good

I Advantage is that training is easy and may end up mostly

Human information processing mechanism (eg. vision and

I Human speech production and perception systems are

I Human visual systems on the perception side is hierarchical

I The concept of deep learning is obtained from neural

I Feedforward neural networks or MLPs with many hidden

I Back propagation is popularized in 1980 and has been well

I Unfortunately BP alone did not work because nonconvex

for which global optimum can be efficiently obtained at the

The optimization difficulties associated with the deep models

I DBN is composed of a stack of restricted Boltzmann

I A greedy, layer by layer algorithm optimizes DBM weights

I DBMs can be used to initialize the training of Deep Neural

Most importantly GPUs and Tools like Torch and

I McCulloch and Pitts (1943) introduced idea of neural

I Hebb (1949) postulated the first rule for self organizing

I Rosenblatt (1958) invented perceptron that algorithmically

I Feed forward deep networks and Back propagation

I later CNNs, LSTMs (if time permits)

I Seperates a d-dimensional space into two half

I Aim is to learn a linear hyperplane to separate two classes.

I Mistake drives online learning algorithm

I Guaranteed to find a separating hyperplane if data is

I If data is not linearly separable

I Make linearly separable using kernel methods.

I (Or) Use multilayer perceptron.

I Given training data D = {(x1 , y1 ), ..., (xn , yn )}

I Initialize wold = [0, ..., 0], bold = 0

I Repeat until convergence.

I For a random (xn , yn ) ∈ D

"Roughly" : If the data is linearly separable perceptron

I Network will have hidden layers.

I Hidden layers can automatically learn features from data

I The bottom-most hidden layer captures very low level

Two kinds of signals:

1 Function signal (leads to observed error)

Propagation of information in neural network

1 Computation of function signal (in the forward step)

I x(n) ∈ RD , we write x(n) = (x1 (n), ..., xD (n).

I Suppose the output of the network is

I Error at the j th output neuron is

ej (n) = |zj (n) − yj (n)|, j = 1, 2, ..., M

I The total error per sample is

I Average error for the training data or empirical risk

I x1 (n), . . . , xi (n), . . . , xm (n): Function signals that are

I vj (n) is the induced local field.

I yj (n) = ϕ(vj (n))

I Function signal appearing at the output of neuron j.

I BPA applies a correction ∆wji (n) to the synaptic weight

I Note that we are trying to update j th neuron out of all M

I For nth data point, the error is

We compute the derivative of

w.r.t wji (x) (apply chain rule)

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

I ej is a function of yj (yj is the output)

I yj is a function of vj (vj is the local field)

∂E(n) ∂E(n) ∂ej (n) ∂yj (n) ∂vj (n)

Update rule for jth output neuron

I Brought “AI” back into computer science forey