0% found this document useful (0 votes)
25 views84 pages

NN Lec - 04 - 05

The document discusses neural networks and multilayer perceptrons. It provides information on activation functions, including binary step functions, linear activation functions, and common non-linear activation functions like sigmoid and tanh. Examples and illustrations are given for different types of neural network units and architectures.

Uploaded by

Zeyad Gomaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views84 pages

NN Lec - 04 - 05

The document discusses neural networks and multilayer perceptrons. It provides information on activation functions, including binary step functions, linear activation functions, and common non-linear activation functions like sigmoid and tanh. Examples and illustrations are given for different types of neural network units and architectures.

Uploaded by

Zeyad Gomaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Neural Networks

Lectures (4-5)

Dr. Mona Nagy ElBedwehy


Quiz (1)
The XOR function can be represented as

𝒙𝟏 XOR 𝒙𝟐 ⟺ 𝒙𝟏 𝑶𝑹 𝒙𝟐 𝑨𝑵𝑫 𝑵𝑶𝑻 𝒙𝟏 𝑨𝑵𝑫 𝒙𝟐

Construct a MADALINE to implement this formulation of XOR,


and compare it with the previous result.

Dr. Mona Nagy ElBedwehy


ElBedwehy 2
Quiz (2)
Find the weights required to perform the following classifications:
Vectors 𝟏, 𝟏, 𝟏, 𝟏 and −𝟏, 𝟏, −𝟏, −𝟏 are members of the class (1);
and vectors 𝟏, 𝟏, 𝟏, −𝟏 and 𝟏, −𝟏, −𝟏, 𝟏 are not members of the
class (-1). Use a learning rate of 0.5 and starting weights of 0. Using
each of the training vectors as input, test the response of the net.

Dr. Mona Nagy ElBedwehy


ElBedwehy 3
Introduction
➢ The Single layer perceptron is one of the oldest and first
introduced Neural Networks.
➢ It was proposed by Frank Rosenblatt in 1958.
➢ The Perceptron is also known as an artificial neural network.
➢ It is mainly used to compute the logical gate like AND, OR, and
NOR which has binary input and binary output.
➢ The main functionality of the perceptron is
▪ Takes input from the input layer.
▪ Weight them up and sum it up.
▪ Pass the sum to the activation function to produce output.
Dr. Mona Nagy ElBedwehy
ElBedwehy 4
Introduction

Dr. Mona Nagy ElBedwehy


ElBedwehy 5
Introduction

Dr. Mona Nagy ElBedwehy


ElBedwehy 6
Introduction
➢ As the history of AI begins, the development of AI, which
seemed to be successful, faces a problem and faces a recession.
➢ On the one hand, it is said to be an XOR problem of Perceptron.
➢ Looking at the graph of the right-most XOR gate, there was a
problem in which the results of + and - could not be divided
into one straight line (linear).
➢ While if you look at the OR gate and the AND gate can
distinguish between + and - with a single line, but the XOR gate
is impossible.
➢ The XOR results can be divided into two straight lines, as in the
following graph.
Dr. Mona Nagy ElBedwehy
ElBedwehy 7
Introduction

Dr. Mona Nagy ElBedwehy


ElBedwehy 8
Introduction
➢ Linear separability problem can be overcome by adding more
layers and choosing the values of weights and threshold in such
a way that the decision boundary gets converge into a close
region.
➢ Professor Marvin Minsky said, "We can solve it using multi-
layer.“

Dr. Mona Nagy ElBedwehy


ElBedwehy 9
Introduction

Dr. Mona Nagy ElBedwehy


ElBedwehy 10
Multilayer Perceptron
➢ A multilayer perceptron (MLP) is a perceptron that teams up
with additional perceptrons, stacked in several layers, to solve
complex problems.
➢ Each perceptron in the first layer on the left (the input layer),
sends outputs to all the perceptrons in the second layer (the
hidden layer), and all perceptrons in the second layer send
outputs to the final layer on the right (the output layer).
➢ A three-layer MLP is called a Non-Deep or Shallow Neural
Network.
➢ An MLP with four or more layers is called a Deep Neural
Network (DNN).
Dr. Mona Nagy ElBedwehy
ElBedwehy 11
Multilayer Perceptron

Dr. Mona Nagy ElBedwehy


ElBedwehy 12
Multilayer Perceptron
➢ One difference between an MLP and a neural network is that in
the classic perceptron, the decision function is a step function
and the output is binary.
➢ In neural networks that evolved from MLPs, other activation
functions can be used which result in outputs of real values,
usually between 0 and 1 or between -1 and 1.
➢ This allows for probability based predictions or classification of
items into multiple labels.
➢ A multilayer perceptron is a special case of a feedforward neural
network where every layer is a fully connected layer.

Dr. Mona Nagy ElBedwehy


ElBedwehy 13
Activation Functions
➢ An Activation function helps the NN to use important
information while suppressing irrelevant data points.
➢ An Activation function is used as a decision making body at
the output of a neuron, for example it decides whether a
neuron should be activated or not.
➢ The role of the activation function is to derive output from a
set of input values fed to a node (or a layer).
➢ The neuron learns Linear or Non-linear decision boundaries
based on the activation function.

Dr. Mona Nagy ElBedwehy 14


Activation Functions
➢ It has a normalizing effect on the neuron output which prevents
the output of neurons after several layers to become very large,
due to the cascading effect.
➢ The purpose of an activation function is to add non-linearity to
the neural network.

Dr. Mona Nagy ElBedwehy 15


Activation Functions
➢ There are 3 types of neural networks activation functions.
1. Binary Step Function.
2. Linear Activation Function.
3. Non-Linear Activation Functions.

Dr. Mona ElBedwehy 16


Binary Step Function
➢ The Binary step function depends on a threshold value that
decides whether a neuron should be activated or not.
➢ The input fed to the activation function is compared to a
certain threshold;
If the input is greater than it, then the neuron is activated,
else it is deactivated, meaning that its output is not passed on
to the next hidden layer.

Dr. Mona Nagy ElBedwehy 17


Binary Step Function
Here are some of the limitations of binary step function:

👉It cannot provide multi-value


outputs, i.e. it cannot be used
for multi-class classification
problems.

👉The gradient of the binary


step function is zero, which
causes a hindrance in the
backpropagation process.

Dr. Mona Nagy ElBedwehy 18


Linear Activation Function
➢ The linear activation function, also known as "no activation,"
or "identity function" (multiplied × 𝟏. 𝟎), is where the activation
is proportional to the input.

➢ The function doesn't do anything to the weighted sum of the


input, it simply spits out the value it was given.

➢ Mathematically it can be represented as:


𝒇(𝒙) = 𝒙
➢ In one sense, a linear function is better than a Binary step
function because it allows multiple outputs, not just yes and no.

Dr. Mona Nagy ElBedwehy 19


Linear Activation Function
➢ However, a linear activation function has two major problems :
1. It’s not possible to use backpropagation as the derivative of
the function is a constant and has no relation to the input 𝒙.
2. All layers of neural network will collapse into one if a linear
activation function is used.
No matter the number of layers in the neural network, the
last layer will still be a linear function of the first layer.

So, essentially, a linear activation function turns the NN into


just one layer.

Dr. Mona Nagy ElBedwehy 20


Linear Activation Function

Dr. Mona Nagy ElBedwehy 21


Non-Linear Activation Functions
Non-linear activation functions solve the following limitations of
linear activation functions:
1. They allow backpropagation because now the derivative
function would be related to the input, and it’s possible to
go back and understand which weights in the input
neurons can provide a better prediction.
2. They allow the stacking of multiple layers of neurons as
the output would now be a non-linear combination of input
passed through multiple layers. Any output can be
represented as a functional computation in a neural
network.
Dr. Mona Nagy ElBedwehy 22
Sigmoid / Logistic Activation Function
There are 10 non-Linear Neural Networks Activation Functions
1. Sigmoid / Logistic Activation Function
➢ This function takes any real value as input and outputs
values in the range of 0 to 1.

➢ The larger the input (more positive), the closer the output
value will be to 1.0, whereas the smaller the input (more
negative), the closer the output will be to 0.0.

➢ Mathematically it can be represented as:

Dr. Mona Nagy ElBedwehy 23


Sigmoid / Logistic Activation Function
Here’s why sigmoid/logistic activation function is one of the most
widely used functions:
➢ It is commonly used for models where we have to predict the
probability as an output. Since probability of anything exists
only between the range of 0 and 1, sigmoid is the right choice
because of its range.
➢ The function is differentiable and provides a smooth gradient,
i.e., preventing jumps in output values.

Dr. Mona Nagy ElBedwehy 24


Sigmoid / Logistic Activation Function

Dr. Mona ElBedwehy 25


Sigmoid / Logistic Activation Function
The limitations of sigmoid function are discussed below:
➢ The derivative is 𝒇′ 𝒙 = 𝐬𝐢𝐠𝐦𝐨𝐢𝐝 𝒙 ∗ 𝟏 − 𝐬𝐢𝐠𝐦𝐨𝐢𝐝 𝒙 .
As we can see from the Figure, the gradient values are only
significant for range -3 to 3, and the graph gets much flatter in
other regions.
➢ It implies that for values greater than 3 or less than -3, the
function will have very small gradients. As the gradient value
approaches zero, the network ceases to learn and suffers from
the Vanishing gradient problem.

Dr. Mona Nagy ElBedwehy 26


Tanh Function (Hyperbolic Tangent)
2. Tanh Function (Hyperbolic Tangent)
➢ The Tanh function is very similar to the sigmoid/logistic
activation function, and even has the same S-shape with the
difference in output range of -1 to 1.

➢ In Tanh, the larger the input (more positive), the closer the
output value will be to 1.0, whereas the smaller the input
(more negative), the closer the output will be to -1.0.
➢ Mathematically it can be represented as:

Dr. Mona Nagy ElBedwehy 27


Tanh Function (Hyperbolic Tangent)
Have a look at the gradient of the tanh activation function to
understand its limitations:
➢ As you can see— it also faces the problem of vanishing
gradients similar to the sigmoid activation function. Plus, the
gradient of the tanh function is much steeper as compared to
the sigmoid function.
➢ 💡 Note: Although both sigmoid and tanh face vanishing
gradient issue, tanh is zero centered, and the gradients are not
restricted to move in a certain direction. Therefore, in practice,
tanh nonlinearity is always preferred to sigmoid nonlinearity.

Dr. Mona Nagy ElBedwehy 28


Tanh Function (Hyperbolic Tangent)

Dr. Mona Nagy ElBedwehy 29


ReLU (Rectified Linear Unit) Function
➢ In the context of artificial neural networks, the rectifier or ReLU
(rectified linear unit) activation function is an activation
function defined as:
𝒙 𝐢𝐟 𝒙 > 𝟎
𝒇 𝒙 =ቐ = 𝐦𝐚𝐱 (𝟎, 𝒙)
𝟎 𝐢𝐟 𝒙 ≤ 𝟎

where 𝒙 is the input to a neuron.


➢ This is also known as a ramp function.

➢ This activation function was introduced by Kunihiko Fukushima


in 1969 in the context of visual feature extraction in hierarchical
neural networks.
Dr. Mona Nagy ElBedwehy 30
ReLU (Rectified Linear Unit) Function

Dr. Mona Nagy ElBedwehy 31


ReLU (Rectified Linear Unit) Function
Advantages
▪ Computationally efficient—allows the network to converge
very quickly.
▪ Non-linear—although it looks like a linear function.
Disadvantages
▪ When inputs approach to zero, or are negative, the gradient
of the function becomes zero, the network cannot perform
backpropagation and cannot learn.

Dr. Mona Nagy ElBedwehy 32


Leaky ReLU Function
➢ Leaky ReLUs allow a small, positive gradient when the unit is
not active, helping to mitigate the vanishing gradient problem.
𝒙 𝐢𝐟 𝒙 > 𝟎
𝒇 𝒙 =ቐ
𝟎. 𝟎𝟏𝒙 𝐢𝐟 𝒙 ≤ 𝟎

Advantages

▪ Prevents dying ReLU problem—this variation of ReLU has a


small positive slope in the negative area, so it does enable
backpropagation, even for negative input values.

Dr. Mona Nagy ElBedwehy 33


Leaky ReLU Function
Disadvantages
▪ The results not consistent — leaky ReLU does not provide
consistent predictions for negative input values.

Dr. Mona Nagy ElBedwehy 34


Softmax Function
➢ The softmax function, also known as normalized exponential
function, converts a vector of 𝑲 real numbers into a probability
distribution of 𝑲 possible outcomes.
➢ The softmax function is a generalization of the logistic function
to multiple dimensions, and used in multinomial logistic
regression.
➢ The softmax function is often used as the last activation function
of a neural network to normalize the output of a network to a
probability distribution over predicted output classes.

Dr. Mona Nagy ElBedwehy


ElBedwehy 35
Softmax Function
➢ For a vector 𝒛 of 𝑲 real numbers, the standard softmax
function 𝝈: ℝ𝑲 → (𝟎, 𝟏)𝑲 , where 𝑲 ≥ 𝟏 , is defined by the
formula:
𝒆𝒛 𝒋
𝝈 𝒛 = 𝑲 𝒛𝒋
σ𝒋=𝟏 𝒆
for 𝒊 = 𝟏, … , 𝑲.
Advantages
▪ Able to handle multiple classes only one class in other
activation functions—normalizes the outputs for each class
between 0 and 1, and divides by their sum, giving the
probability of the input value being in a specific class.

Dr. Mona Nagy ElBedwehy


ElBedwehy 36
Softmax Function
Advantages
▪ Useful for output neurons—typically Softmax is used only for
the output layer, for neural networks that need to classify
inputs into multiple categories.

Dr. Mona Nagy ElBedwehy


ElBedwehy 37
Multilayer Perceptron Network

Dr. Mona Nagy ElBedwehy


ElBedwehy 38
Multilayer Perceptron Network
➢ Let’s write out the MLP computations mathematically.
Conceptually, there’s nothing new here; we just have to pick a
notation to refer to various parts of the network. As with the
linear case, we’ll refer to the activations of the input units as
𝒙𝒋 and the activation of the output unit as 𝒚.
(ℓ)
➢ The units in the ℓ𝐭𝐡 hidden layer will be denoted 𝒉𝒊 .
➢ The network is the fully connected, so each unit receives
connections from all the units in the previous layer.
➢ This means each unit has its own bias, and there’s a weight for
every pair of units in two consecutive layers.

Dr. Mona Nagy ElBedwehy


ElBedwehy 39
Multilayer Perceptron Network
Therefore, the network’s computations can be written out as:
(𝟏) (𝟏) (𝟏)
𝒉𝒊 = 𝝓(𝟏) ෍ 𝒘𝒊𝒋 𝒙𝒋 + 𝒃𝒊
𝒋

(𝟐) (𝟐) (𝟏) (𝟐)


𝒉𝒊 = 𝝓(𝟐) ෍ 𝒘𝒊𝒋 𝒉𝒊 + 𝒃𝒊
𝒋


(𝒏) (𝒏) (𝒏−𝟏) (𝒏)
𝒉𝒊 = 𝝓(𝒏) ෍ 𝒘𝒊𝒋 𝒉𝒊 + 𝒃𝒊
𝒋

(𝒏+𝟏) (𝒏) (𝒏+𝟏)


𝒚𝒊 = 𝝓(𝒏+𝟏) ෍ 𝒘𝒊𝒋 𝒉𝒊 + 𝒃𝒊
𝒋

Dr. Mona Nagy ElBedwehy


ElBedwehy 40
Multilayer Perceptron Network
Note that we distinguish 𝝓(𝟏) , 𝝓(𝟐), …, and 𝝓(𝒏) because different
layers may have different activation functions.

Dr. Mona Nagy ElBedwehy


ElBedwehy 41
Multilayer Perceptron Example
Example (1) Implement XOR gate using a perceptron.

Dr. Mona Nagy ElBedwehy 42


Multilayer Perceptron Example

Dr. Mona Nagy ElBedwehy 43


Multilayer Perceptron Example

Dr. Mona Nagy ElBedwehy 44


Multilayer Perceptron Example

Dr. Mona Nagy ElBedwehy 45


Multilayer Perceptron Example
Using Binary Step Activation Function

𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟎 − 𝟏𝟎 = 𝝈 −𝟏𝟎 = 𝟎

(0,0) 𝝈 −𝟐𝟎 × 𝟎 − 𝟐𝟎 × 𝟎 + 𝟑𝟎 = 𝝈 𝟑𝟎 = 𝟏

𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 −𝟏𝟎 = 𝟎

𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟏 − 𝟏𝟎 = 𝝈 𝟏𝟎 = 𝟏

(0,1) 𝝈 −𝟐𝟎 × 𝟎 − 𝟐𝟎 × 𝟏 + 𝟑𝟎 = 𝝈 𝟏𝟎 = 𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 𝟏𝟎 = 𝟏

Dr. Mona Nagy ElBedwehy 46


Multilayer Perceptron Example

𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟎 − 𝟏𝟎 = 𝝈 𝟏𝟎 = 𝟏

(1,0) 𝝈 −𝟐𝟎 × 𝟏 − 𝟐𝟎 × 𝟎 + 𝟑𝟎 = 𝝈 𝟏𝟎 = 𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 𝟏𝟎 = 𝟏

𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟏𝟎 = 𝝈 𝟑𝟎 = 𝟏

(1,1) 𝝈 −𝟐𝟎 × 𝟏 − 𝟐𝟎 × 𝟏 + 𝟑𝟎 = 𝝈 −𝟏𝟎 = 𝟎


𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟎 − 𝟑𝟎 = 𝝈 −𝟏𝟎 = 𝟎

Dr. Mona Nagy ElBedwehy 47


Multilayer Perceptron Example

Dr. Mona Nagy ElBedwehy 48


Multilayer Perceptron Example
Using Sigmoid Activation Function

(0,0)

𝟏
𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟎 − 𝟏𝟎 = 𝝈 −𝟏𝟎 = = 𝟎. 𝟎𝟎𝟎𝟎𝟎𝟒𝟓 ≈ 𝟎
𝟏 + 𝒆−(−𝟏𝟎)
𝟏
𝝈 −𝟐𝟎 × 𝟎 − 𝟐𝟎 × 𝟎 + 𝟑𝟎 = 𝝈 𝟑𝟎 = ≈𝟏
𝟏 + 𝒆−(𝟑𝟎)
𝟏
𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 −𝟏𝟎 = = 𝟎. 𝟎𝟎𝟎𝟎𝟎𝟒𝟓 ≈ 𝟎
𝟏 + 𝒆−(−𝟏𝟎)

Dr. Mona Nagy ElBedwehy 49


Multilayer Perceptron Example

(0,1)

𝟏
𝝈 𝟐𝟎 × 𝟎 + 𝟐𝟎 × 𝟏 − 𝟏𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)
𝟏
𝝈 −𝟐𝟎 × 𝟎 − 𝟐𝟎 × 𝟏 + 𝟑𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)
𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)

Dr. Mona Nagy ElBedwehy 50


Multilayer Perceptron Example

(1,0)

𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟎 − 𝟏𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)
𝟏
𝝈 −𝟐𝟎 × 𝟏 − 𝟐𝟎 × 𝟎 + 𝟑𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)
𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟑𝟎 = 𝝈 𝟏𝟎 = = 𝟎. 𝟗𝟗𝟗𝟗𝟓𝟓 ≈ 𝟏
𝟏 + 𝒆−(𝟏𝟎)

Dr. Mona Nagy ElBedwehy 51


Multilayer Perceptron Example

(1,1)

𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟏 − 𝟏𝟎 = 𝝈 𝟑𝟎 = =𝟏
𝟏 + 𝒆−(𝟑𝟎)
𝟏
𝝈 −𝟐𝟎 × 𝟏 − 𝟐𝟎 × 𝟏 + 𝟑𝟎 = 𝝈 −𝟏𝟎 = = 𝟎. 𝟎𝟎𝟎𝟎𝟒𝟓 ≈ 𝟎
𝟏 + 𝒆−(−𝟏𝟎)
𝟏
𝝈 𝟐𝟎 × 𝟏 + 𝟐𝟎 × 𝟎 − 𝟑𝟎 = 𝝈 −𝟏𝟎 = = 𝟎. 𝟎𝟎𝟎𝟎𝟒𝟓 ≈ 𝟎
𝟏 + 𝒆−(−𝟏𝟎)

Dr. Mona Nagy ElBedwehy 52


Multilayer Perceptron Example
Example (2)

𝒘𝟏 = 𝟎. 𝟒 𝒘𝟓 = 𝟎. 𝟔
𝟎. 𝟒 𝒙𝟏 𝒉𝟏𝟏 𝒉𝟐𝟏

𝒚𝟏

𝒚𝟐
𝟎. 𝟒 𝒙𝟐 𝒉𝟏𝟐 𝒉𝟐𝟐
𝒘𝟒 = 𝟎. 𝟒 𝒘𝟖 = 𝟎. 𝟔
RELU TANH Softmax

Dr. Mona Nagy ElBedwehy 53


Multilayer Perceptron Example
𝒘𝟏 = 𝟎. 𝟒 𝒘𝟓 = 𝟎. 𝟔
𝟎. 𝟒 𝒙𝟏 𝒉𝟏𝟏 𝒉𝟐𝟏

𝒚𝟏

𝒚𝟐
𝟎. 𝟒 𝒙𝟐 𝒉𝟏𝟐 𝒉𝟐𝟐
𝒘𝟒 = 𝟎. 𝟒 𝒘𝟖 = 𝟎. 𝟔
RELU TANH Softmax

𝒉𝟏𝟏 = 𝝈 𝒘𝟏 × 𝒙𝟏 + 𝒘𝟑 × 𝒙𝟐 = 𝝈 𝟎. 𝟒 × 𝟎. 𝟒 + 𝟎. 𝟔 × 𝟎. 𝟒
= 𝝈 𝟎. 𝟒 = 𝐦𝐚𝐱 𝟎, 𝟎. 𝟒 = 𝟎. 𝟒
𝒉𝟏𝟐 = 𝝈 𝒘𝟐 × 𝒙𝟏 + 𝒘𝟒 × 𝒙𝟐 = 𝝈 𝟎. 𝟔 × 𝟎. 𝟒 + 𝟎. 𝟒 × 𝟎. 𝟒
= 𝝈 𝟎. 𝟒 = 𝐦𝐚𝐱 𝟎, 𝟎. 𝟒 = 𝟎. 𝟒

Dr. Mona Nagy ElBedwehy 54


Multilayer Perceptron Example
𝒘𝟏 = 𝟎. 𝟒 𝒘𝟓 = 𝟎. 𝟔
𝟎. 𝟒 𝒙𝟏 𝒉𝟏𝟏 𝒉𝟐𝟏

𝒚𝟏

𝒚𝟐
𝟎. 𝟒 𝒙𝟐 𝒉𝟏𝟐 𝒉𝟐𝟐
𝒘𝟒 = 𝟎. 𝟒 𝒘𝟖 = 𝟎. 𝟔
RELU TANH Softmax

𝒉𝟐𝟏 = 𝝈 𝒘𝟓 × 𝒉𝟏𝟏 + 𝒘𝟕 × 𝒉𝟏𝟐 = 𝝈 𝟎. 𝟔 × 𝟎. 𝟒 + 𝟎. 𝟕 × 𝟎. 𝟒


= 𝝈 𝟎. 𝟓𝟐 = 𝐭𝐚𝐧𝐡 𝟎. 𝟓𝟐 = 𝟎. 𝟒𝟖
𝒉𝟐𝟐 = 𝝈 𝒘𝟔 × 𝒉𝟏𝟏 + 𝒘𝟖 × 𝒉𝟏𝟐 = 𝝈 𝟎. 𝟕 × 𝟎. 𝟒 + 𝟎. 𝟔 × 𝟎. 𝟒
= 𝝈 𝟎. 𝟓𝟐 = 𝐭𝐚𝐧𝐡 𝟎. 𝟓𝟐 = 𝟎. 𝟒𝟖

Dr. Mona Nagy ElBedwehy 55


Multilayer Perceptron Example
𝒘𝟏 = 𝟎. 𝟒 𝒘𝟓 = 𝟎. 𝟔
𝟎. 𝟒 𝒙𝟏 𝒉𝟏𝟏 𝒉𝟐𝟏

𝒚𝟏

𝒚𝟐
𝟎. 𝟒 𝒙𝟐 𝒉𝟏𝟐 𝒉𝟐𝟐
𝒘𝟒 = 𝟎. 𝟒 𝒘𝟖 = 𝟎. 𝟔
RELU TANH Softmax

𝒚𝟏 = 𝝈 𝒘𝟗 × 𝒉𝟐𝟏 + 𝒘𝟏𝟎 × 𝒉𝟐𝟐 = 𝝈 𝟎. 𝟖 × 𝟎. 𝟒𝟖 + 𝟎. 𝟕 × 𝟎. 𝟒𝟖


= 𝝈 𝟎. 𝟕𝟐 = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝟎. 𝟕𝟐
𝒚𝟐 = 𝝈 𝒘𝟏𝟏 × 𝒉𝟐𝟏 + 𝒘𝟏𝟐 × 𝒉𝟐𝟐 = 𝝈 𝟎. 𝟒 × 𝟎. 𝟒𝟖 + 𝟎. 𝟖 × 𝟎. 𝟒𝟖
= 𝝈 𝟎. 𝟓𝟕𝟔 = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝟎. 𝟓𝟕𝟔

Dr. Mona Nagy ElBedwehy 56


Multilayer Perceptron Example

𝒆𝟎.𝟕𝟐
𝒚𝟏 = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝟎. 𝟕𝟐 = = 𝟎. 𝟓𝟒
𝒆𝟎.𝟕𝟐 + 𝒆𝟎.𝟓𝟕𝟔

𝒆𝟎.𝟓𝟕𝟔
𝒚𝟐 = 𝐒𝐨𝐟𝐭𝐦𝐚𝐱 𝟎. 𝟓𝟕𝟔 = 𝟎.𝟕𝟐 = 𝟎. 𝟒𝟔
𝒆 + 𝒆𝟎.𝟓𝟕𝟔

Dr. Mona Nagy ElBedwehy 57


Multilayer Perceptron Practical (MNIST )
➢ The MNIST database (Modified National Institute of Standards
and Technology database) is a large database of handwritten
digits that is commonly used for training various image
processing systems.
➢ The MNIST database is also widely used for training and testing
in the field of machine learning.
➢ It was created by re-mixing the samples from NIST's original
datasets. The creators felt that since NIST's training dataset
was taken from American Census Bureau employees, while the
testing dataset was taken from American high school students, it
was not well-suited for machine learning experiments.

Dr. Mona Nagy ElBedwehy 58


Multilayer Perceptron Practical (MNIST )
➢ So, Half of the training set and half of the test set were taken
from NIST's training dataset, while the other half of the training
set and the other half of the test set were taken from NIST's
testing dataset.
➢ The MNIST database contains 70,000 handwritten digit images
(28x28 pixels), with 7,000 examples per digit (60,000 training
images and 10,000 testing images).
➢ We will train a feedforward neural network to achieve over 90%
accuracy on the MNIST dataset using Keras and TensorFlow in
python.

Dr. Mona Nagy ElBedwehy 59


Multilayer Perceptron Practical (MNIST )
Furthermore, the black and white images from NIST were
normalized to fit into a 𝟐𝟖 × 𝟐𝟖 pixel bounding box and anti-
aliased, which introduced grayscale levels.

Dr. Mona Nagy ElBedwehy 60


Step 1: Import the Required Python
Packages

Dr. Mona Nagy ElBedwehy 61


Step 1: Import the Required Python
Packages
➢ TensorFlow is a free and open-source software library for
machine learning and artificial intelligence. It can be used across
a range of tasks but has a particular focus on training and
inference of deep neural networks.

It was developed by the Google Brain team for Google's internal


use in research and production.
➢ The LabelBinarizer will be used to one-hot encode our integer
labels as vector labels. One-hot encoding transforms categorical
labels from a single integer to a vector. Many machine learning
algorithms benefit from this type of label representation.

Dr. Mona Nagy ElBedwehy 62


Step 1: Import the Required Python
Packages
➢ Each data point in the MNIST dataset has an integer label in the
range [0, 9], one for each of the possible ten digits in the MNIST
dataset.
➢ A label with a value of 0 indicates that the corresponding image
contains a zero digit.
➢ Similarly, label with a value of 8 indicates to the corresponding
image contains the number eight.

➢ However, we first need to transform these integer labels into


vector labels, where the index in the vector for label is set to 1
and 0 otherwise (this process is called one-hot encoding).
Dr. Mona Nagy ElBedwehy 63
Step 1: Import the Required Python
Packages
➢ For example, consider the label 3 and we wish to binarize/one-
hot encode it — the label 3 now becomes:
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
➢ The index for the digit three is set to one and all other entries in
the vector are set to zero.
➢ The one-hot encoding representations for each digit, 0−9, in the
listing below:

Dr. Mona Nagy ElBedwehy 64


Step 1: Import the Required Python
Packages
0: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 6: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]

1: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0] 7: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
2: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] 8: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
3: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 9: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
4: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
5: [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

Dr. Mona Nagy ElBedwehy 65


Step 1: Import the Required Python
Packages
➢ Lines 5-8 import the necessary packages to create a simple
feedforward neural network with Keras.
➢ The Sequential indicates that our network will be feedforward
and layers will be added to the class sequentially, one on top of
the other.
➢ The Flatten flattens the input provided without affecting the
batch size. For example, If inputs are shaped (batch_size,)
without a feature axis, then flattening adds an extra channel
dimension and output shape is (batch_size, 1).

Dr. Mona Nagy ElBedwehy 66


Step 1: Import the Required Python
Packages
➢ The Dense class is the implementation of our fully connected
layers and are the hidden layers and the output layer .
➢ The classification_report function will give a nicely formatted
report displaying the total accuracy of the model, along with a
breakdown on the classification accuracy for each digit.

Dr. Mona Nagy ElBedwehy 67


Step 2: Load the MNIST Dataset

Line 2 loads the MNIST dataset from disk. If you have never run
this function before, then the MNIST dataset will be downloaded
and stored locally to your machine. Once the dataset has been
downloaded, it is cached to your machine and will not have to be
downloaded again.

Dr. Mona Nagy ElBedwehy 68


Step 3: Cast and Normalize the Data

We perform data normalization


by scaling the pixel intensities
to the range [0, 1].

Dr. Mona Nagy ElBedwehy 69


Step 4: One-hot Encoding Representation

Dr. Mona Nagy ElBedwehy 70


Step 5: Define the Network Architecture

Dr. Mona Nagy ElBedwehy 71


Step 5: Define the Network Architecture

Dr. Mona Nagy ElBedwehy 72


Step 5: Define the Network Architecture
➢ Each image in the MNIST dataset is represented as 𝟐𝟖 × 𝟐𝟖 × 𝟏
pixel image.
➢ In order to train our neural network on the image data we first
need to flatten the 2D images into a flat list of 𝟐𝟖 × 𝟐𝟖
= 𝟕𝟖𝟒 values .
➢ The network is a feedforward architecture, instantiated by the
Sequential class— this architecture implies that the layers will
be stacked on top of each other with the output of the previous
layer feeding into the next.
➢ The input_shape is set to 784, the dimensionality of each
MNIST data points.
Dr. Mona Nagy ElBedwehy 73
Step 5: Define the Network Architecture
➢ Then learn 256 weights in this layer and apply the sigmoid
activation function.
➢ The next layer learns 128 weights and apply sigmoid activation
function.

➢ Finally, apply another fully connected layer, this time only


learning 10 weights, corresponding to the ten (0-9) output
classes.
➢ Instead of a sigmoid activation, we use a softmax activation to
obtain normalized class probabilities for each prediction.

Dr. Mona Nagy ElBedwehy 74


Step 6: Compile and Train the Network
Compile the model by specifying the optimizer,
loss function, and evaluation metric

Dr. Mona Nagy ElBedwehy 75


Step 6: Train the Network

Dr. Mona Nagy ElBedwehy 76


Step 6: Compile and Train the Network
➢ Epochs tell us the number of times the model will be trained in
forwarding and backward passes.
➢ Batch Size represents the number of samples, If it’s unspecified,
batch_size will default to 32.

➢ Validation Split is a float value between 0 and 1. The model


will set apart this fraction of the training data to evaluate the
loss and any model metrics at the end of each epoch. (The
model will not be trained on this data)

Dr. Mona Nagy ElBedwehy 77


Step 7: Test the Network
➢ Once the network has finished training, we want to evaluate it on
the testing data to obtain our final classifications:

Return the class label probabilities for


every data point in testX.

Dr. Mona Nagy ElBedwehy 78


Step 7: Test the Network

Dr. Mona Nagy ElBedwehy 79


Step 7: Test the Network
➢ Thus, if you were to inspect the predictions, then the array
would have the shape (𝑿, 𝟏𝟎) as there are 17,500 total data
points in the testing set and ten possible class labels (the digits
0-9).

➢ Each entry in a given row is, therefore, a probability.


➢ To determine the class with the largest probability, we can
simply call . 𝐚𝐫𝐠𝐦𝐚𝐱(𝐚𝐱𝐢𝐬 = 𝟏), which will give us the index of
the class with the largest probability, and the final output
classification.

Dr. Mona Nagy ElBedwehy 80


Step 8: Plot the Training Loss, Accuracy,
Validation Loss, and Accuracy

Dr. Mona Nagy ElBedwehy 81


Step 8: Plot the Training Loss, Accuracy,
Validation Loss, and Accuracy

Dr. Mona Nagy ElBedwehy 82


Step 8: Plot the Training Loss, Accuracy,
Validation Loss, and Accuracy

Dr. Mona Nagy ElBedwehy 83


Step 9: Create the Confusion Matrix

Dr. Mona Nagy ElBedwehy 84

You might also like