0% found this document useful (0 votes)
8 views

Deep Learning - Part-1

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Deep Learning - Part-1

Uploaded by

abebaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

Machine Learning

ABDELA AHMED, PhD


© University of Gondar, 2022
[email protected]
Contents…

3. Deep Learning Algorithms


 ANN
 CNN
 DBM
 DBN
 Autoencoder
 LSTM

2
Outline

ANN

Overview Neural
Biological Gradient
of Network Training NN
Neurons Descent
DL Layer

3
What is deep learning?
Overview of DL
 Deep Learning is a growing trend in general data analysis and has been
termed one of the 10 breakthrough technologies

 Deep learning has had a long and rich history, but has gone by many names
reflecting different philosophical viewpoints, and has waxed and waned in
popularity.

 Broadly speaking, there have been three waves of development of deep


learning:
i. Deep learning known as cybernetics in the 1940s–1960s.
 development of theories of biological learning and
implementations of the first models such as the perceptron
allowing the training of a single neuron.
ii. Deep learning known as connectionism in the 1980s–1990s
 The central idea in connectionism is that a large number of simple
computational units can achieve intelligent behavior when
networked together
iii. The current resurgence under the name deep learning beginning in
2006
5
Overview of DL

 Deep learning has become more useful as the amount of available


training data has increased

 Deep learning models have grown in size over time as computer


infrastructure(both hardware and software) for deep learning has
improved.

 Deep learning has solved increasingly complicated applications


with increasing accuracy over time

6
Overview of DL

What exactly is Deep Learning?

 Deep Learning is an neural network with several layers of nodes


between input and output

 Deep-learning methods are representation-learning methods with


multiple levels of representation, obtained by composing simple
but non-linear modules that each transform the representation at
one level (starting with the raw input) into a representation at a
higher, slightly more abstract level.

 Representation learning is a set of


methods that allows a machine to be fed
with raw data and to automatically
discover the representations needed for
detection or classification. 7
Overview of DL

 Let’s be inspired by nature but not too much !!


 For airplanes, we developed aerodynamics and compressible fluid
dynamics.
 We figured that feathers and wing flapping weren't crucial
 What is the equivalent of aerodynamics for understanding
intelligence?
 Computational models of biological learning, i.e. models of how
learning happens or could happen in the brain
 Artificial Neural networks (ANNs) are algorithms that try to
mimic the information fusion of the brain.
 Most importantly, it is now the basis for most of the DL
algorithms
 As a result, one of the names that deep learning has gone by is
artificial neural networks (ANNs)

8
Neurons in the Brain
 The human brain is composed by 10
billions neuron which are
interconnected each others.
 Neurons communicate by sending
electrical impulses one to another
 Neurons receive inputs from other
neurons, carries out some computations
and sends its out put to other neurons
by electrical impulses

 A biological neuron consists of three


main components :
 Dendrites: input signals channel where
the strength of connections to nucleus
are affected by weights.
 Cell Body: where computation of input
signals and weights generate output
signals which will be delivered to
another neurons
 Axon: transmit output signals to
another neurons that are connected9 to it.
Biological vs Artificial Neurons

 The ANN uses a very simplified mathematical model what a biological neuron does
 ANNs are comprised of several interconnected computational units (neurons) arranged
in layers.
 A basic operating unit in a neural network is a neuron-like node it takes input from
other nodes and sends output to others.
 a computational unit that takes as input x1, x2, x3 .. xn and outputs y where f is the
activation function (E.g. binary threshold, Sigmoid, Softmax, ReLU, and others).
 Each connection link is associated with a weight that determines the strength of
the interconnection

10
Model of an artificial Neuron: Perceptron
𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3 + 𝑏

 The neuron receives the weighted sum as input and calculates the output
as a function of input
 Example- compute the output computed by the following perceptron,
which uses a sigmoid activation function and a bias value of 0.5

0.9 x = 0.9×2 + 0.2×3 + 0.3×-1 = 2.6


2
z= 2.1 + 0.5 = 2.6
3 f(x + b)
0.2 1
y = 𝝈 𝒛 = 1+𝑒 −𝟐.𝟔 = 0.93
-1
11
0.3
Logistic Regression vs Perceptron

 What is the hypothesis function of linear regression?


 𝒉𝜽 = 𝜽𝑻 𝒙, where 𝜽 = [𝜽𝟎 , 𝜽𝟏 , …, 𝜽𝒎 ],
 𝒙 = [𝒙𝟎 , 𝒙𝟏 , … , 𝒙𝒎 ], and 𝒙𝟎 = 1

To account for the intercept term


𝒙𝟎 = 𝟏 𝜽𝟎 𝜽𝟎
𝜽
𝒙𝟏 𝜽𝟏 𝒉𝜽 = 𝜽 𝑻 𝒙 = 𝟏 𝒙 𝟎 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑
𝜽𝟐
𝜽𝟐 𝜽𝟑
𝒙𝟐 𝜽𝑻 𝒙 𝒉𝜽
𝜽𝟑 = 𝜽𝟎 𝒙𝟎 + 𝜽𝟏 𝒙 𝟏 + 𝜽𝟐 𝒙𝟐 + 𝜽𝟑 𝒙𝟑
𝒙𝟑
= 𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + 𝜽𝟑 𝒙𝟑

12
Logistic Regression vs Perceptron

 What is the hypothesis function of logistic regression?


 𝒉𝜽 = 𝝈(𝜽𝑻 𝒙), where 𝜽 = [𝜽𝟎 , 𝜽𝟏 , …, 𝜽𝒎 ], 𝒙 = [𝒙𝟎 , 𝒙𝟏 , … , 𝒙𝒎 ],
1
𝒙𝟎 = 1, and 𝝈 𝒛 = −𝒛 1+𝑒

𝜽𝟎
𝜽
𝒉𝜽 = 𝒂 = 𝝈 𝒛 = 𝝈 𝜽𝑻 𝒙 = 𝝈( 𝟏 𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝟑 )
𝒙𝟎 = 𝟏 𝜽𝟐
𝜽𝟎 𝜽𝟑
To account for the intercept term
𝒙𝟏
𝜽𝟏
= 𝝈(𝜽𝟎 𝒙𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + 𝜽𝟑 𝒙𝟑 )
𝜽𝟐
𝒙𝟐 𝒛 = 𝜽𝑻 𝒙 𝒂 = 𝝈(𝒛) 𝒉𝜽
𝜽𝟑 = 𝝈(𝜽𝟎 + 𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + 𝜽𝟑 𝒙𝟑 )

𝒙𝟑 1
=
1 + 𝑒 −(𝜽𝟎 +𝜽𝟏 𝒙𝟏 + 𝜽𝟐 𝒙𝟐 + 𝜽𝟑 𝒙𝟑 )

13
Logistic Regression vs Perceptron

 Technically, logistic regression is a neural


network with only 1 neuron

𝒙𝟎 𝒙𝟎
𝜽𝟎 𝜽𝟎
𝒙𝟏 𝒙𝟏
𝜽𝟏 𝜽𝟏

𝜽𝟐 𝜽𝟐
𝒙𝟐 𝒛 = 𝜽𝑻 𝒙 𝒂 = 𝝈(𝒛) 𝒉𝜽 𝒙𝟐 𝒛 𝒂 𝒉𝜽
𝜽𝟑 𝜽𝟑

𝒙𝟑
Nueron 𝒙𝟑
Nueron

14
Logistic Regression vs Perceptron

 Technically, logistic regression is a neural


network with only 1 neuron

𝒙𝟎 𝟏 Denoted as bias
𝜽𝟎 𝒃
𝒙𝟏 𝒙𝟏
𝜽𝟏 𝒘𝟏

𝜽𝟐 𝒘𝟐 ෝ
𝒚
𝒙𝟐 𝒛 = 𝜽𝑻 𝒙 𝒂 = 𝝈(𝒛) 𝒉𝜽 𝒙𝟐 𝒛 𝒂
𝜽𝟑 𝒘𝟑

𝒙𝟑
Nueron 𝒙𝟑
Nueron
Using the notations in the neural network
literature, where 𝜽 = 𝒘 = 𝒘𝟏 , 𝒘𝟐 , 𝒘𝟑
ෝ,
(𝒘𝟎 is not part of this vector here), 𝒉𝜽 = 𝒚
and 𝜽𝟎 = 𝒘𝟎 = 𝒃

15
Logistic Regression vs Perceptron

 Technically, logistic regression is a neural


network with only 1 neuron

𝟏 𝒘𝟏
𝒃 𝑻
ෝ = 𝒂 = 𝝈 𝒛 = 𝝈 𝒘 𝒙 + 𝒃 = 𝝈( 𝒘𝟐 𝒙𝟏 𝒙𝟐 𝒙𝟑 + 𝒃)
𝒛 = 𝒘𝑻 𝒙 + 𝒃 𝒚
𝒙𝟏 𝒘𝟑
𝒘𝟏

𝒘𝟐 ෝ
𝒚 = 𝝈(𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒃)
𝒙𝟐 𝒛 𝒂
𝒘𝟑
1
=
𝒙𝟑 1 + 𝑒 −(𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 +𝒃)
𝒂 = 𝝈(𝒛)

16
Neural Network Architectures

 A neural network is highly structured and comes in layers where the


first layer being the input layer, and the last layer being the output
layer and all layers in between are referred to as hidden layers.
 The input layer accepts the inputs and forwards them for further
processing through out the network.
 The input units are equivalent to the feature vector considered
for the classification task.
 The output layer produces the prediction of the input instance.
17
Neural Network Architectures
 Left: A 2-layer Neural Network
(one hidden layer of 7 neurons (or
units) and one output layer with 4
neurons), and five inputs.

 Right: A 5-layer neural network


with five inputs, four hidden
layers of 7 neurons each and one
output layer.

 Why not just use a single neuron? Why do wee need a larger
network?
 A single neuron (like logistic regression) only permits a linear
decision boundary
 Most real world problems are considerably more complicated

18
Topologies of an ANN

feedforward
completely
(directed, a-cyclic)
connected
recurrent
(feedback connections)
 Feedforward versus recurrent networks
 Feedforward: No loops, input  hidden layers  output
 Recurrent: Use feedback (positive or negative), A network with feedback,
where some of its inputs are connected to some of its outputs (discrete
time).
 For regular neural networks, the most common layer type is the fully-connected
layer in which neurons between two adjacent layers are fully pairwise
connected, but neurons within a single layer share no connections
 The above feed forward neural network is an example of Neural Network
topologies that use a stack of fully-connected layers
Multi Layer Perceptron (MLP)

 An artificial neural network structure where the flow of


information processing is in only one direction is called Feed
Forward Neural Network.
 One of the most popular FFNN model is the multi-layer
perceptron (MLP).
 The MLP architecture applied for various problems, including
disease diagnosis, function approximation, pattern
classification, fault identification and in manufacturing
20
processes
MLP- 11 neurons, 3 layers: Notations
 We can construct a neural network with as many layers, and
neurons in any layer, as needed
 Notations
 x = input, b = bias term
 w = weights
 z = net input
 f = activation function
 a = output to next layer

Layer 1 Layer 2
Layer 3

Input Layer 21
MLP- 11 neurons, 3 layers : Notations
 We can construct a neural network with as many layers, and
neurons in any layer, as needed
 Notations
 x = input, b = bias term
 w = weights
 z = net input-
 f = activation function
 a = output to next layer

22
MLP- 4 neurons, 2 layers : Notations
 We can construct a neural network with as many layers, and
neurons in any layer, as needed
 Notations
 x = input, b = bias term
 w = weights
 z = net input- sum of weighted inputs
 f = activation function
 a = output to next layer

23
MLP- 4 neurons, 2 layers : Notations
 We can construct a neural network with as many layers, and
neurons in any layer, as needed
 Notations
 x = input, b = bias term
 w = weights
 z = net input- sum of weighted inputs
 f = activation function
 a = activation - output to next layer

24
MLP Matrix representations - Example
 We can construct a neural network with as many layers, and
neurons in any layer, as needed

Input Layer
Input Layer
(or Layer 0) Layer 1

25
MLP Matrix representations - Example

x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2

b4 b5 b6
 Bias added to Hidden
Neurons
-0.4 0.2 0.1
MLP Matrix representations - Example

Net Input and Output Calculation

Unitj Net Input Zj Output Oj

6
MLP Matrix representations - Example

Net Input and Output Calculation

Unitj Net Input Zj Output Oj

4 0.2 + 0 - 0.5 -0.4 = -0.7

6
MLP Matrix representations - Example

Net Input and Output Calculation

Unitj Net Input Zj Output Oj

4 0.2 + 0 - 0.5 -0.4 = -0.7 1


Oj  = 0.332
1  e0.7
5

6
MLP Matrix representations - Example

Net Input and Output Calculation

Unitj Net Input Zj Output Oj

4 0.2 + 0 + 0.5 -0.4 = -0.7 1


Oj  = 0.332
1  e0.7
5 -0.3 + 0 + 0.2 + 0.2 =0.1
1
Oj   0.1
= 0.525
1 e
6
MLP Matrix representations - Example

Net Input and Output Calculation

Unitj Net Input Zj Output Oj

4 0.2 + 0 + 0.5 -0.4 = -0.7 1


Oj  = 0.332
1  e0.7
5 -0.3 + 0 + 0.2 + 0.2 =0.1
1
Oj   0.1
= 0.525
1 e
6 (-0.3)0.332-
1
(0.2)(0.525)+0.1= -0.105 Oj 
1  e0.105 = 0.475
MLP : Matrix representation
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
To indicate that this activation, a, is in layer 1
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
𝒙𝟐
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

32
MLP Matrix representations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
𝒙𝟐
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

33
MLP Matrix reprsentations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
𝒙𝟐
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

34
MLP Matrix reprsentations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
𝒙𝟐
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

35
MLP Matrix representations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
𝒙𝟐
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

36
MLP Matrix reprsentations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐
[𝟐]
𝒙𝟐 𝒛𝟏 𝒂[𝟐]
𝟏
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
𝒙𝟑 Layer 2

[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0)
Layer 1

37
MLP Matrix representations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎 By convention, this
[𝟏] neural network is
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐 said to have 2 layers
[𝟐]
𝒙𝟐 𝒛𝟏 𝒂[𝟐]
𝟏

𝒚 (and not 3) since the
[𝟏]
𝒛𝟑 𝒂[𝟏] input layer is
𝒙𝟑 𝟑 Layer 2 typically not
Output layer
counted!
[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0) Hidden layer with 4 neurons
Layer 1
ANN Layer
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏] Also, the more layers
𝒙𝟏 𝒛𝟐 𝒂[𝟏]
𝟐 we add, the deeper
[𝟐]
𝒙𝟐 𝒛𝟏 𝒂[𝟐] ෝ
𝒚 the neural network
𝟏
[𝟏]
𝒛𝟑 𝒂[𝟏] becomes, giving rise
𝒙𝟑 𝟑 Layer 2 Output layer to the concept of
deep learning!
[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0) Hidden layer with 4 neurons
Layer 1

39
MLP Matrix representations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏
𝒙𝟎
[𝟏]
𝒛𝟐 𝒂[𝟏] Interestingly,
𝒙𝟏 𝟐
[𝟐]
neural networks
𝒙𝟐 𝒛𝟏 𝒂[𝟐]
𝟏

𝒚
learn their own
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑 features!
𝒙𝟑 Layer 2 Output layer
[𝟏]
Input Layer 𝒛𝟒 𝒂[𝟏]
𝟒
(or Layer 0) Hidden layer with 4 neurons
Layer 1

40
MLP Matrix representations
 We can construct a network of neurons (i.e., a
neural network) with as many layers, and neurons
in any layer, as needed

[𝟏]
𝒛𝟏 𝒂[𝟏]
𝟏 This looks like
logistic regression,
[𝟏] but with features
𝒛𝟐 𝒂[𝟏]
𝟐 that were learnt
[𝟐]
𝒛𝟏 𝒂[𝟐]
𝟏

𝒚 [𝟏] [𝟏] [𝟏]
(i.e., 𝒂𝟏 , 𝒂𝟐 , 𝒂𝟑 ,
[𝟏]
𝒛𝟑 𝒂[𝟏]
𝟑
[𝟏]
𝒂𝟒 ) and NOT
Layer 2 Output layer engineered by us
[𝟏] (i.e., 𝒙𝟏 , 𝒙𝟐 , and 𝒙𝟑 )
𝒛𝟒 𝒂[𝟏]
𝟒
Hidden layer with 4 neurons
Layer 1

41
Outline

ANN

Overview Neural
Biological Training
of Network Backprop.
Neurons ANN
DL Layer

42
ANN
 A mathematical model composed of
a large number of simple, highly
interconnected processing elements.
 The building blocks of ANN are the
neurons

 Neurons consists of:


1. A set of links, describing the neuron
inputs, with weights W1, W2, …, Wm

2. An adder function (linear combiner)


for computing the weighted sum of
the inputs (real numbers):

𝒛 = 𝒘𝑻 𝒙 + 𝒃
3 Activation function : for limiting the
amplitude of the neuron output.

𝒂 = 𝒇(𝒛)
43
Commonly Used Activation Functions
 Every activation function (or non-linearity) takes a single number and
performs a certain fixed mathematical operation on it. There are several
activation functions you may encounter in practice:
1. Sigmoid
 The sigmoid non-linearity has the mathematical form σ(z)=1/(1+e−z)
and is shown in the image below on the left
 Sigmoid non-linearity squashes real numbers to range between [0,1]

2. Hyperbolic Tangent Function (tanh)


 The tanh non-linearity is shown on the image above on the right.
 It squashes a real-valued number to the range [-1, 1].
sinh(𝑧) 𝑒 2𝑥 −1
 tanh(z)= = 44
cosh(𝑧 𝑒 2𝑥 +1
Commonly Used Activation Functions
3. Rectified Linear Unit (ReLU)
 The ReLU has become very popular in the last few years.
 It computes the linear weighted sum of the inputs, and the output
is a non-linear function of the total input computed using the max
operation,
 f(z)=max(0,z). =>

0, 𝑧 < 0
 𝑓 𝑧 =ቊ
𝑧, 𝑧 ≥ 0

4. Leaky ReLU
 Acts like ReLU, but allows negative outcomes

∝ 𝑧, 𝑧 < 0
 f(z) = 𝑓 𝑥 = ቊ
𝑧, 𝑥 ≥ 0

45
Perceptron Limitations
 A single neuron (like logistic regression) only permits a linear
decision boundary.
Linearly Separable Problem Linearly Inseparable Problems

 For a linearly not-separable problem:


 Would it help if we use more layers of neurons?
 What could be the learning rule for each neuron?
 Solution: More than one layer of perceptron and the backpropagation
learning algorithm can learn any Boolean function.
 The capacity of the network increases with more hidden units and more
hidden layers 46
Multilayer Perceptron
 In Multilayer perceptron, there may be one or more hidden layer(s) which
are called hidden since they are not observed from the outside.

 Each layer may have different number of nodes and different activation
functions:
 Commonly, same activation function within one layer.
 Typically,
 ReLu/tanh activation function is used in the hidden units, and
 Sigmoid/Softmax or linear activation functions are used in the
output units depending on the problem 47
Sizing Neural Networks…
 The two metrics that people commonly use to measure the size
of neural networks are the number of neurons, or more
commonly the number of parameters. Working with the two
example networks in the above picture:

 The above ANN has


 4 + 2 = 6 neurons (not counting the inputs),
 [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total
of 26 learnable parameters. 48
Sizing Neural Networks…

Similarly, the above network has


 4 + 4 + 1 = 9 neurons,
 [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and
 4 + 4 + 1 = 9 biases,
 for a total of 41 learnable parameters.
 To give you some context, modern Convolutional Networks
contain on orders of 100 million parameters and are usually
made up of approximately 10-20 layers (hence deep learning).
49
Training ANN
 How have we trained before (Logistic regression)?
1. Specify how to compute the output given the input x and
parameters w and b (define the hypothesis function (model))
1
𝒉𝜽 𝒙 = 𝒈(𝜽𝑻 𝒙) = 𝑇
1+𝑒 𝑥
−𝜽

2. Specify loss and cost

L(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


1 𝑚
𝐽(Θ) = 𝑐𝑜𝑠𝑡 ℎΘ 𝑥 = σ𝑖=1[Cost(𝒉𝜽 𝒙𝒊 , yi)]
𝑚
3. Train on data to minimize 𝑱(𝜣) using gradient decent
 Start off with some guesses for 𝜃0 , … , 𝜃𝑚
 Repeat until convergence
{
𝜕𝑱 𝞱
𝜃𝑗 = 𝜃𝑗 − α
𝜕 𝜃𝑗
} 50
Training ANN: Institution
 The process of learning the parameters (weights and biases) so
as to optimize its “performance” or minimize the cost function
 The process of training ANN
 Put in training inputs, get the output (Make Prediction)
 Compare the output to correct answers, and calculate the
loss function J, which measures our error
 Adjust the weight accordingly and repeat the process

A dataset
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
51
etc …
Training ANN: Institution
Training the neural network
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …

52
Training ANN: Institution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 Step-1 : Initialise with random weights
etc …

53
Training ANN: Institution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1 Step-2 : Make prediction
4.1 0.1 0.2 0
etc …
1.4

2.7

1.9

54
Training ANN: Institution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1 Step-2 : Make prediction
4.1 0.1 0.2 0
etc …
1.4

2.7 0.8

1.9

55
Training ANN: Intitution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
Step-3: Compare prediction Vs. target,
4.1 0.1 0.2 0
for the 1st training dataset
etc …

1.4

2.7 0.8
0
1.9 loss = 2.32

L1(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


56
Training ANN: Intitution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 Compare prediction Vs. target
etc … for the 2nd training example

6.4

2.8 0.9
1
1.7 loss = 0.152

L2(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


57
Training ANN: Intitution
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 Compare prediction Vs. target
etc … for the 3nd training example

6.4

2.8 0.5
0
1.7 loss = 1

L3(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


58
Training ANN: Intitution…
Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0 Compare prediction Vs. target
etc … for the 4th training example

6.4

2.8 0.5
0
1.7 loss = 1

L4(𝒉𝜽 𝒙 , y) = −𝑦 log 𝒉𝜽 𝒙 − (1 − 𝑦) log 𝟏 − 𝒉𝜽 𝒙


𝐽(Θ) = 𝑐𝑜𝑠𝑡 ℎΘ 𝑥 =
1 𝑚 1
σ𝑖=1 Cost(𝒉𝜽 𝒙𝒊 , yi) = ∗ (2.3 + 0.15 + 1 +
𝑚 4 59
Training ANN: Intitution..

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1 Step-4: Adjust weights and biases
4.1 0.1 0.2 0 based on the error (Backprop)
etc …
1.4

2.7 0.8
0
1.9 cost= 1.12

Repeat this thousands, maybe millions of times – each time


taking a random training instance, and making slight
weight adjustments
Algorithms for weight adjustment are designed to make
60
changes that will reduce the error
Training ANNs (Implementations)

 We need to first perform a forward pass


(forward propagation )
 calculate outputs given input pattern x.

 Then, we update weights with a backward


pass (Backward propagation)
 update weights by calculating delta

61
Forward propagation (aka “Inference”)
 Make prediction about that data and Calculate the cost function,
where the nodes of the output layer are probabilities that the
sample is of a certain class.

Step 1 (Making prediction): Specify how to compute the output


given input x and parameters w and b (define the model).
 Think of ANN as a function F: X --> Y involving many weights Wk
 Calculate the different values at each layer to ultimately get the
predicted output values
62
Forward propagation (aka “Inference”)
 Make prediction about that data and Calculate the cost function,
where the nodes of the output layer are probabilities that the
sample is of a certain class.

Step 2: Specify the loss and cost function


1 𝑚
𝐽(Θ) = σ [L(𝒇𝒘,𝒃 𝒙𝒊 , yi)]
𝑚 𝑖=1

 Squared loss (regression)


 Cross-entropy loss (classification)
63
Forward propagation (aka “Inference”)
 Make prediction about that data and Calculate the cost function,
where the nodes of the output layer are probabilities that the
sample is of a certain class.

Step 2: Specify the loss and cost function


1 𝑚
𝐽(Θ) = σ [L(𝒇𝒘,𝒃 𝒙𝒊 , yi)]
𝑚 𝑖=1
 Squared loss (regression)
 Cross-entropy loss (classification)

L(F 𝒙 , y) = −𝑦 log 𝑭 𝒙 − (1 − 𝑦) log 𝟏 − 𝑭 𝒙 64


Backward propagation (aka “Backprop.”)

 For each training instance the backpropagation algorithm first makes a


prediction (forward pass), measures the error, then goes through each
layer in reverse to measure the error contribution from each connection
(reverse pass), and finally slightly tweaks the connection weights to
reduce the error (Gradient Descent step).
 The backpropagation algorithm is used to compute the partial derivative
using calculus to update each of the weights in the right direction
 Compute error derivatives in each hidden layer from error derivatives in
layer above 65
Backward propagation (aka “Backprop.”)

 Step 3: Minimize the cost function by adjusting the parameters


 𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − lr ∗ derivative
 𝑏𝑛𝑒𝑤 = 𝑏𝑜𝑙𝑑 − lr ∗ derivative

 Start with initial guess w0


and update the guess in
each stage moving along
the search direction: 66
Backprop: Example

67
Backprop : Example

68
Backprop : Example

69
Backprop : Example

70
Convolutional Neural Network
ConvNet
CNN

71
Outline

CNN

Building State-of-the-
Motivation Convolution Transfer
Blocks art
Settings Learning
architectures

 In 1995, Yann LeCun and Yoshua Bengio introduced the concept of


convolutional neural networks (CNNs or ConvNets).
 CNNs are a special kind of multi-layer neural networks, designed to recognize
visual patterns directly from pixel images with minimal preprocessing
 ConvNets have shown excellent performance in many computer vision and
machine learning problems.
72
Motivation – Image Data
 Demand Prediction ANN- an ANN that try to look at a t-shirt product and
try to predict whether the product will be a top seller or not.
 You have collected different T-shirts as well as which one is a top seller

73
Motivation – Image Data
 Recognizing Images Using ANN- an ANN that try to look at image
of a person and try to predict the identity of that person (face
recognition)
 You have collected different persons images as well as which one is
Yohannes

1000 pixels

1000
pixels

74
Motivation – Image Data

1,000,000 values (features)


75
Motivation – Image Data

 A single fully connected layer would require:


1,000,000 * 1,000,000 = 1,000,000,000,000 weights
 For color images, it typically contains
(1,000 *1000 *3)2 = 9,000,000,000,000 weights
 Fully connected image networks would require a vast number of
parameters.
 Variance would be too high---very high likelihood of overfitting
76
Motivation – Image Data
 Important structures in image data
 Topology of pixels (Spatial locality)---
 pixels have a natural topology that have a special component that
are meaningful.
 The set of pixels we will have to take into consideration to find a
cat will be near one another in the image
 However, the structure of the ANN treats all inputs interchangeably
- No relationship between the individual inputs
- Just an ordered set of variables
- We want to incorporate domain knowledge into the architecture
of a neural network
For example, we won’t have to consider some combination of pixels in the four
corners of the image, in order to see if they encode catness.

77
Motivation – Image Data
 Important structures in image data
 Topology of pixels (Spatial locality)---
 Translation invariance
 The pattern of pixels that characterizes a cat is the same no matter
where in the image the cat occurs

For example : Cats don’t look different if they’re


on the left or the right side of the image.

78
Motivation – Image Data
 Important structures in image data
 Topology of pixels (Spatial locality)---
 Translation invariance
 Scale invariance
 Issues of lighting and contrast
 Knowledge of human visual system
 Features need to be “built up”.
 Edges  Shapes  relation between shapes

79
Motivation – Image Data
 The motivation behind the CNN is that different layers can learn
certain intermediate features
 Features need to be “built up”.
 Edges  Shapes  relation between shapes
 Identifying Textures
 CAT = [Two eyes in certain relation to one another] + [car fur texture]
=> Eyes = dark circle (pupil) inside another circle
=> Circle = particular combination of edge detectors
=> Fur = edges in certain pattern
 Addressing invariant problem
 Save computation time
 Significantly diminish the amount of training data

80
Outline

CNN

State-of-the-
Building Convolution Transfer
Motivation art
Blocks Settings Learning
architectures

81
Typical CNN Architecture
 Typical CNN architectures look like

 [(CONV+ReLU)*N + POOL?]*M + (FC+ReLU)*K + SOFTMAX


where N is usually up to ~5, M is large, 0 <= K <= 2.
 However, recent advances such as ResNet/GoogLeNet challenge this
paradigm
 Layers used to build ConvNets:
 a stacked sequence of layers. 3 main types
 Convolutional Layer, Pooling Layer, and Fully-Connected Layer
82
Typical CNN Architecture

CONV: Convolutional kernel layer


RELU: Activation function
POOL: Dimension reduction layer
FC: Fully connection layer

 For the above architecture, what is the value of N, M, and K


 N = 2 [(CONV+ReLU)*N + POOL?]*M
 M = 3 + (FC+ReLU)*K + SOFTMAX
 K= ?
Convolutional Layer
 The most essential component of any CNN architecture
 Performs feature extraction using a stack of convolution operation
and activation function
 Convolve the filter with the image.
 “slide over the image spatially,
computing dot products b/n the filter
values and the image values at each step,
and aggregating the outputs to produce
a new image.
 This process of applying the filter to the
image to create a new image is called
“convolution.”

 Kernels = Filters = Feature


detectors = receptive field
 Feature maps
 Padding
 Stride = Step Size
Convolutional Layer : Kernel
 Kernels ( Filters) : a grid of weights overlaid on an image, centered
on one pixel, and each weight multiplied with pixel underneath it

 Used for traditional image


processing techniques :
 Blur, Sharpen, Edges, etc.
Convolutional Layer:
convolution operation Example

 Kernel Size : 3 x 3 example


 Image Size : 5 x 5
 Calculate the output

 el
Convolutional Layer:
convolution operation Example

 Kernel Size : 3 x 3 example


 Image Size : 5 x 5
 Calculate the output

 el
Convolutional Layer:
convolution operation Example

 Kernel Size : 3 x 3 example


 Image Size : 5 x 5
 Calculate the output

51 60 20
 el
Convolutional Layer:
convolution operation Example

 Kernel Size : 3 x 3 example


 Image Size : 5 x 5
 Calculate the output

51 60 20
 el 31
Convolutional Layer:
convolution operation Example

 Kernel Size : 3 x 3 example


 Image Size : 5 x 5
 Calculate the output

51 60 20
 el 31 . .
. . -2
Convolutional Layer:
Stride

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Convolutional Layer:
Stride

1 -1 -1
Filter 1
-1 1 -1
stride=1
-1 -1 1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolutional Layer:
Stride
1 -1 -1
-1 1 -1 Filter 1
If stride=2 -1 -1 1
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolutional Layer:
Stride
1 -1 -1
Filter 1
-1 1 -1
stride=1 -1 -1 1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Convolutional Layer:
Stride
-1 1 -1
Filter 2
-1 1 -1
stride=1 -1 1 -1

1 0 0 0 0 1 Repeat this for each filter


0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map
0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter

Determine the
7 output image
size ?

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Determine the
output image
7 size ?

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

=> 3x3 output!


7

 Stride
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
 The step size as the kernel moves across the image
 When the stride is greater than 1, it scales down the
output dimension
Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Determine the
7 output image
size ?

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Stride

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

 Using
Lecture 7 - kernels
27 Jandirectly there
2016Fei-Fei will
Li & be an
Andrej edge &
Karpathy effect
Justin Johnson
 pixels near the edge will not be used as center pixels,
since there are no enough surrounding pixels
Convolutional Layer:
Stride
N
Can you find the formula for
obtaining output image size,
F for the given input image size
(NxN) and kernel size (FxF)
and stride s?
F N
Output size:
(N - F) / s + 1

e.g. N = 7, F = 3:
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Convolutional Layer:
Padding

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Padding :
 Adding extra pixels around the frame, so pixels from the original image
become center pixels as the kernel moves across the image
 In practice: Common to zero pad the border
Convolutional Layer:
Padding

3x3 output!

input 5x5 (recall (without


3x3 filter, applied with stride 1 padding):)
what is the output? (N - F) / stride + 1

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer:
Padding

(recall (without
padding):)
((N - F) / stride) +

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


input 5x5 5x5 output!
3x3 filter, applied with stride 1
what is the output?
Convolutional Layer:
Padding
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
 In general,𝑊𝑖𝑡ℎ 𝑃𝑎𝑑𝑑𝑖𝑛𝑔 𝑝 𝑎𝑛𝑑 𝑆𝑡𝑟𝑖𝑑𝑒 s,
Output Image Size is :

𝑁 −𝐹+2𝑝
+1
𝑠

 In general, it is common to see CONV layers with stride 1, filters


of size FxF, and zero-padding with (F-1)/2. (will preserve size
spatially)
 e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Convolutional Layer:
convolution for 3D images

 In images we have multiple numbers associated with each pixel location.


 These numbers are referred to as channels. Example RGB image : 3 channel
 The number of channels is referred to as the depth.
 The kernel will have a depth of the same size as the number of input channels
 Example : a 3 x 3 kernel on an RGB image, there will be 3x3x3= 27 weights
 27 multiplications added together to get one centered pixel

 1 number:
 the result of taking a
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
dot product between
the filter and a small
5x5x3 chunk of the
image
Convolutional Layer:
convolution for 3D images

1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1111-1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolutional Layer:
convolution for 3D images

Examples :Input volume: 32x32x3, 6 5x5 filters with stride 2, pad 2


What will be the Output volume size:
(32-5+2*2)/1+1 = 32 spatially, so 32x32x6

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Convolutional Layer: Summary

 We can think of kernels as local feature detectors

 Primary idea behind convolutional neural network:


 Let the neural network learn which kernels are most useful
 Use the same set of kernels across the entire image (translation invariance)
Lecture
 Reduce7 -the number
27 Jan 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson
of parameters
Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson
Common settings:
K = (powers of 2, e.g. 32, 64, 128, 512)
- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Typical CNN Architecture
 Typical CNN architectures look like

 [(CONV+ReLU)*N + POOL?]*M + (FC+ReLU)*K + SOFTMAX


where N is usually up to ~5, M is large, 0 <= K <= 2.
117
Typical CNN Architecture

CONV: Convolutional kernel layer


RELU: Activation function
POOL: Dimension reduction layer
FC: Fully connection layer
cat dog ……
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Pooling Layer
 Reduce the image size by mapping a patch of a pixel to a single value
 Shrinks the dimensions of the image
 Doesn’t have parameters, though there are different type of pooling
operation
 Max-pool: for each distinct patch, represent it by the maximum
 Average-pool: for each distinct patch, represent it by the average

 Example 2x2 maxpool and


avgpool, reducing the image
size from 4x4 to 2x2

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Pooling Layer

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Pooling Layer

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters


Typical CNN Architecture
 Typical CNN architectures look like

 [(CONV+ReLU)*N + POOL?]*M + (FC+ReLU)*K + SOFTMAX


where N is usually up to ~5, M is large, 0 <= K <= 2.
124
The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattene A new image


d
3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)

input

Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
-1 -1 1 25 3x3
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)

28 x 28 pixels 1: black/white, 3: RGB Convolution

3 -1 3 Max Pooling

-3 1
Outline

CNN

State-of-the-
Building Convolution Transfer
Motivation art
Blocks Settings Learning
architectures

128
LeNet-5
[LeCun et
al., 1998]

Conv filters were 5x5, applied at stride 1


Subsampling (Pooling) layers were 2x2 applied at stride 2
Lecture 7- 27 Jan is
i.e. architecture 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson
[CONV-POOL-CONV-POOL-CONV-FC]
Tested on MNIST
AlexNet

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


AlexNet

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


AlexNet

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


AlexNet

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


AlexNet

Input: 227x227x3 images After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


AlexNet

Input: 227x227x3 images After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


VGGNet

Only 3x3 CONV stride 1, pad 1


and 2x2 MAX POOL stride 2

best model

11.2% top 5 error in ILSVRC 2013

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


INPUT: [224x224x3] memory: 224*224*3=150K params: 0

CONV3-64: [224x224x64] memory: 224*224*64=3.2Mparams: (3*3*3)*64 = 1,728


CONV3-64: [224x224x64] memory: 224*224*64=3.2Mparams: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800Kparams: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6Mparams: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6Mparams: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400Kparams: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

TOTAL
forward!memory: 24M * 4 bytes ~= 93MB / image (only
~*2 for bwd)
TOTAL params: 138M parameters
GoogLeNet
[Szegedy et al., 2014]

Inception module
ILSVRC 2014 winner (6.7% top 5 error)

Fun features:
Only 5 million params! (Removes FC layers
completely)
Lecture 7 - Compared
27 Jan 2016Fei-Fei Li & Andrej to AlexNet:
Karpathy & Justin Johnson
- 12X less params
- 2x more compute
- 6.67% (vs. 15.4%)
ResNet
224x2
[He et al., 2015]
24x3 spatial dimension
only 56x56!

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


ResNet [He et al., 2015]

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


ResNet [He et al., 2015]

- Batch Normalization after every CONV layer


- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


ResNet [He et al., 2015]

(this trick is also used in GoogLeNet)

Lecture 7 - 27 Jan 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson


Thank You

145

You might also like