0% found this document useful (0 votes)
51 views61 pages

Ann MPDM

1. The perceptron was an early artificial neural network proposed by Rosenblatt in 1958 that could learn to classify data. It used a linear model and step activation function to make predictions. 2. The perceptron learning algorithm adjusts the weights of the network based on errors to minimize incorrect predictions on the training data. Weights are updated incrementally for each data point using a learning rate. 3. This allows the perceptron to learn linear discriminant functions to separate classes, provided the data is linearly separable. It served as an important early model but was limited to linear problems.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views61 pages

Ann MPDM

1. The perceptron was an early artificial neural network proposed by Rosenblatt in 1958 that could learn to classify data. It used a linear model and step activation function to make predictions. 2. The perceptron learning algorithm adjusts the weights of the network based on errors to minimize incorrect predictions on the training data. Weights are updated incrementally for each data point using a learning rate. 3. This allows the perceptron to learn linear discriminant functions to separate classes, provided the data is linearly separable. It served as an important early model but was limited to linear problems.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Master in Information Management

202223

Artificial Neural Networks


Ricardo Santos
I have always been convinced that the only way to get artificial
intelligence to work is to do the computation in a way similar to the
human brain…

Geoffrey Hinton
The AI Buzz
All of the algorithms used in the previous applications…

1. Have the ability to learn automatically from data


2. Do non-linear interpolation
3. Are universal approximators: Are able to approximate different
functions, no matter how complex that function may be
4. Have found success in modelling different phenomena, even those
where there is little knowledge about the phenomena
5. Attempt to mimic how a brain captures and stores information
Inspired from Biology

How the brain looks like to the naked eye What some neuronal pathways look like
using diffusion sprectrum imaging (DSI)
Credit to: http://www.humanconnectomeproject.org/
The biological neuron

Dendrites: Receive information


Soma or Cell Body: Processes the
information

Axon: Carries information (electric pulses)


to the Axon terminal

Synapse: The spectacular junction


between the axon terminal and the
dendrites of other neurons
The artificial neuron

Inputs Dendrites Output Axon

Dendrites: Receive inputs (x1,..., xn) at a x1 w1 𝑤1 𝑥1


synapse
𝑓 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
Cell Body: Processes the information: 𝑖
𝑤2 𝑥2
1. Weighted sum of xi x2 w2 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓
𝑖
2. Activation function f
… …
Axon: Exports the output of the cell body to
Cell body
other neurons or the environment 𝑤𝑛 𝑥𝑛
xn wn
Neurons are the building blocks of even the most
complex ANN architectures
Inputs

????

A Very Complex Neural Network Architecture


by Andrej Kaparthy
source here

Outputs
A
G
E
N
D
A
1 An historical introduction

2 The Multi-Layer Perceptron


1

An historical introduction
Modelling the brain: the main inspiration for Deep Learning
1 An historic perspective

McCulloch & Pitts (1943): networks


of binary neurons can do logic

1950 1960

Frank Rosenblatt (1958):


The Perceptron
1 The Perceptron

Linear Discriminant for Binary


x1 w1 𝑤1 𝑥1
Classification:
𝑓 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 a. Obtains the equation of a line that
𝑖
𝑤2 𝑥2 discriminates between 2 linearly
x2 w2 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓
𝑖 separable classes

… … b. Makes decisions according to the step


(also known as Heaviside) function:
𝑤𝑛 𝑥𝑛
xn wn
0, ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 < 𝜃
𝑖
𝑓(𝑋) =
1, ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 ≥ 𝜃
𝑖
1 How the Perceptron Learns

Linear Discriminant for Binary


x1 w1 𝑤1 𝑥1
Classification:
𝑓 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 a. Obtains the equation of a line that
𝑖
𝑤2 𝑥2 discriminates between 2 linearly
x2 w2 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏 𝑓
𝑖 separable classes

… … b. Makes decisions according to the step


(also known as Heaviside) function
𝑤𝑛 𝑥𝑛
xn wn c. Adjusts weights after making an incorrect
prediction
d. Objective function focuses on error
minimization
1 How the Perceptron Learns

x1 x2 y
Training a perceptron (numerical example): 0 0 0
a) 4 instances 0 1 0

b) 2 independent variables (x1 and x2) and one dependent 1 0 0


1 1 1
variable (y)
The algorithm:
Initialize weights, threshold and learning rate
For each instance:
i. Obtain prediction – Forward Pass
ii. Assess error in prediction
iii. Adjust the weights of the perceptron – Backward Pass
1 How the Perceptron Learns

Initialize Weights (random): x1 x2 y


0 0 0
𝑤1 = 0.9 𝑤2 = 0.9
0 1 0
Initialize threshold and learning rate:
1 0 0
𝜃 = 0.5 𝛼 = 0.5
1 1 1
First instance:
i. Calculate output of linear section
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 0 = 0

ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise


𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise.


1 How the Perceptron Learns

Second instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 1 = 0.9
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0 − 1 = −1
𝑤1𝑛𝑒𝑤 = 𝑤1 + α × 𝜀 × 𝑥1 = 0.9 + 0.5 × −1 × 0 = 0.9
𝑤2𝑛𝑒𝑤 = 𝑤2 + α × 𝜀 × 𝑥2 = 0.9 + 0.5 × −1 × 1 = 0.4
1 How the Perceptron Learns

Third instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.4 × 0 = 0.9
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0 − 1 = −1
𝑤1𝑛𝑒𝑤 = 𝑤1 + α × 𝜀 × 𝑥1 = 0.9 + 0.5 × −1 × 1 = 0.4
𝑤2𝑛𝑒𝑤 = 𝑤2 + α × 𝜀 × 𝑥2 = 0.9 + 0.5 × −1 × 0 = 0.4
1 How the Perceptron Learns

Fourth instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 1 + 0.4 × 1 = 0.8
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2 What now?
1 How the Perceptron Learns

First instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 0 = 0
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Second instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 1 = 0.4
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Third instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 1 + 0.4 × 0 = 0.4
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2 Everything is classified correctly,
no more updates can be made
1 An historic perspective

McCulloch & Pitts (1943): networks Minsky and Papert (1969): The
of binary neurons can do logic limitations of the Perceptron

1950 1960 1970

Frank Rosenblatt (1958):


The Perceptron
1 How the Perceptron Learns

Problems with the XOR: x1 x2 y


0 0 0
a) 4 instances
0 1 1
b) 2 independent variables (x1 and x2) and one 1 0 1
dependent variable (y) 1 1 0

c) This problem requires a non-linear solution

Initialize Weights (random):


𝑤1 = 0.9 𝑤2 = 0.9
Initialize threshold and learning rate:
𝜃 = 0.5 𝛼 = 0.5
1 How the Perceptron Learns

First instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 0 = 0
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Second instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 1 = 0.9
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Third instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.9 × 0 = 0.9
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Fourth instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.9 × 1 = 1.8
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0 − 1 = −1
𝑤1𝑛𝑒𝑤 = 𝑤1 + α × 𝜀 × 𝑥1 = 0.9 + 0.5 × −1 × 1 = 0.4
𝑤2𝑛𝑒𝑤 = 𝑤2 + α × 𝜀 × 𝑥2 = 0.9 + 0.5 × −1 × 1 = 0.4
1 How the Perceptron Learns

First instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 0 = 0
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 0
𝑤1𝑛𝑒𝑤 = 𝑤1
𝑤2𝑛𝑒𝑤 = 𝑤2
1 How the Perceptron Learns

Second instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 1 = 0.4
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0

iii. If 𝑦ො = 𝑦, do nothing. Compute error and adjust weights otherwise


𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො = 1 − 0 = 1
𝑤1𝑛𝑒𝑤 = 𝑤1 + α × 𝜀 × 𝑥1 = 0.4 + 0.5 × 1 × 1 = 0.9
𝑤2𝑛𝑒𝑤 = 𝑤2 + α × 𝜀 × 𝑥2 = 0.4 + 0.5 × 1 × 1 = 0.9 Can you start to see the problem?
1 How the Perceptron Learns

Able to Solve the AND Problem Incapable of Solving the XOR Problem
1 An historic perspective

McCulloch & Pitts (1943): networks Minsky and Papert (1969): The
of binary neurons can do logic limitations of the Perceptron

1950 1960 1970 1980

Frank Rosenblatt (1958): Werbos (1974) & Rumelhart (1986): Backpropagation

The Perceptron A gradient approach for error propagation throughout


multiple layers of ANN
1 Backpropagation

Backpropagation allowed for:


Making explicit the need non-linear activation functions
1 Backpropagation

Backpropagation allowed for:


Making explicit the need non-linear activation functions
The entire network to not be squashed into a single linear transformation

Decision boundary of NN without non- Decision boundary of NN with non-linearities


linearities
1 Backpropagation

Backpropagation allowed for:


Making explicit the need non-linear activation functions
The entire network to not be squashed into a single linear transformation
It uses partial derivates and the chain-rule to update weights

Consider the following computation graph of the equations of a perceptron:


𝒙 – input vector
𝒘 – weight vector
𝒙
∙ 𝒘𝒙
+
𝒛
𝒇
%
𝒉
𝒃 – bias of linear
𝒛 = 𝑾𝒙 + 𝒃
𝒘 𝒃 𝒉 = 𝒇(𝒛)
Any reasonable human being…
1 Backpropagation

To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients
𝒙 – input vector
𝒘 – weight vector
𝒙
∙ 𝒘𝒙
+
𝒛
𝒇 𝒉
𝒃 – bias of linear
𝝏𝒉 𝝏𝒉 𝒛 = 𝒘𝒙 + 𝒃
𝒘 𝒃 𝝏𝒛 𝝏𝒉 𝒉 = 𝒇(𝒛)
𝝏𝒉 𝝏𝒉
𝝏𝒘 𝝏𝒃
1 Backpropagation

To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients:
i. Each of these operation nodes has a local gradient, consider the example where
we have 𝒛 = 𝒘𝒙

𝒙
Local gradient
𝒘
𝝏𝒛
𝝏𝒘 ∙ 𝒛

𝝏𝒉
𝝏𝒉 𝝏𝒛
𝝏𝒘
Downstream gradient Upstream gradient
1 Backpropagation

To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients:
i. Each of these operation nodes has a local gradient, consider the example where
we have 𝒛 = 𝒘𝒙
𝝏𝒉 𝝏𝒉 𝝏𝒛
= 𝒙 𝝏𝒛
𝝏𝒙 𝝏𝒛 𝝏𝒙
𝝏𝒙
𝒘
𝝏𝒛 ∙ 𝒛

𝝏𝒉
𝝏𝒘
𝝏𝒉 𝝏𝒉 𝝏𝒛 𝝏𝒛
=
𝝏𝒘 𝝏𝒛 𝝏𝒘
1 Chapter 1 – Main Takeaways

1. Perceptron was one of the precursors of Modern ANN:


i. Starts by computing the equation of a linear discriminant function in binary classification
ii. Checks whether new observation is above or below that equation
iii. Updates weights for every wrong prediction (without a gradient)
iv. Only works correctly on linearly separable problems

2. Backpropagation:
i. Introduces the ability to use gradient-based learning
ii. Different logic for weight updates
iii. Allows the use of non-linear activation function: allowing for non-linear relationships
1 Chapter 1 – Main Takeaways

Core components of modern neural networks (that also exist in larger networks):
i. The forward pass starts from the current weights and inputs to compute the results of the
operations
ii. The backward pass computes the gradients that led to the classification output using
backpropagation
iii. Backpropagation is the algorithm used to apply the chain rule along a computational
graph:
Downstream gradient = Upstream gradient x local gradient
2

The Multi-Layer Perceptron


A feedforward artificial neural network with multiple layers of perceptrons,
capable of modelling complex non-linear relationships between inputs and outputs.
2 Multi-Layer Perceptron
The outputs from the input layers
are sent towards the Hidden
Layers. The outputs of the last
hidden layer are sent to the
output layer
Each input neuron gets
one input (one feature)

Both input and output layer are set


by how the problem is formulated:
Input layer size: number of features
Output layer size: Depends on how
prediction is computed
2 Multi-Layer Perceptron

Input Layer:
i. Introduces inputs to the network
ii. No processing or activation function

Hidden Layers
i. Take, as input, the outputs of previous layers
and passes them along to the next layers
ii. Two hidden layers are considered enough to
handle most problems

Output Layer
i. Generates prediction using the outputs of the
hidden layers as its inputs
ii. Backpropation in MLP is done for all weights
from input layer to output layer
2 Multi-Layer Perceptron

ANNs as universal approximators:


i. Non-linear activation functions allow for the models to go beyond
the application of mere linear transformations
ii. Extra layers (facilitated by back-propagation) allow for the capacity
to model more complex phenomena
iii. It is the increased number of layers, combined with the non-
linearities that allow these models to approximate any complex
function
2 Training an MLP – Numeric Example

Consider the following situation:


i. We have a dataset with data for Cats and Dogs
ii. We have a 5-2-1 MLP architecture

Weight Softness Purrs/min Barks/min Tail Length Label


ID
(X1) (X2) (X3) (X4) (X5) (y)
1 0.2 0.8 0.3 0.0 0.7 1
2 0.3 0.7 0.2 0.0 0.4 1
3 0.6 0.1 0.0 0.4 0.7 0
4 0.7 0.2 0.0 0.5 0.8 0
5 0.2 0.8 0.3 0.0 0.6 1
2 Training an MLP – Numeric Example

Step 1 - Initialization:
i. Weights – Random initialization

Hidden Layer Weights


𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3

Output Layer Weights


𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.8 -0.2 -0.2
2 Training an MLP – Numeric Example

Step 1 - Initialization:
ii. Learning Rate and how it evolves across iterations
𝛼 = 0.5, 𝑑𝑒𝑐𝑎𝑦𝑖𝑛𝑔 0.05 𝑝𝑒𝑟 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 (𝑒𝑝𝑜𝑐ℎ)
iii. Setting the Activation Function

Output layer: Hidden layers:

Quick note: In sklearn, the activation function of the output layer is already set for you
2 Training an MLP – Numeric Example

ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1
Step 2 – Forward Pass:
𝑛

𝑧1∗ = ෍ 𝑤𝑖 𝑥𝑖 + 𝑤0 = 0.3 × 0.2 + −0.2 × 0.8 + 0.3 × 0.3 + −0.2 × 0 + 0.3 × 0.7 + 0.5 = 0.7
𝑖=1

1 1
𝑎11 = = = 0.668
1 + 𝑒 −𝑧1 1 + 𝑒 −0.7

1
𝑎21 = = 0.812
1 + 𝑒 −1.46
𝑧2∗ = 0.9 × 0.2 + 0.1 × 0.9 + 0.3 × 0.3 + 0.1 × 0 + 0.9 × 0.7 + 0.3 = 1.46
Hidden Layer Weights Output Layer Weights
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example

ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1
Step 2 – Forward Pass:

𝑎11 = 0.668
𝑛

𝑧3∗ = ෍ 𝑤𝑖 𝑥𝑖 + 𝑤0 = 0.8 × 0.668 + (−0.2) × 0.812 + (−0.2) = 0.172


𝑖=1

1
𝑎12 = = 0.543 ≠ 𝑦 Need to update weights
1 + 𝑒 −0.172

𝑎21 = 0.812
Hidden Layer Weights Output Layer Weights
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example

Step 3 – Backward Pass:


i. Computing error for each unit j:
i. At the output layer
𝐸𝑟𝑟𝑗 = 𝑦ො𝑗 (1 − 𝑦ො𝑗 )(𝑦𝑗 − 𝑦ො𝑗 )

ii. At the hidden layer

𝐸𝑟𝑟𝑗 = 𝑦ො𝑗 (1 − 𝑦ො𝑗 ) ෍ 𝐸𝑟𝑟𝑘 𝑤𝑗𝑘


𝑘

ii. Update weights:


∆𝑤𝑖𝑗 = 𝛼𝐸𝑟𝑟𝑗 𝑎𝑖
𝑤𝑖𝑗 = 𝑜𝑙𝑑_𝑤𝑖𝑗 + ∆𝑤𝑖𝑗
2 Training an MLP – Numeric Example

Step 3 – Backward Pass

Compute error at the output layer

𝐸𝑟𝑟𝑎12 = 𝑎12 1 − 𝑎12 𝑦𝑎12 − 𝑎12 = 0.543 × 1 − 0.543 × 1 − 0.543 = 0.113

Compute the updated weights for output layer


𝐵2 𝐵2
𝑤01 = 𝑤01 + 𝛼𝐸𝑟𝑟𝑎12 = −0.2 + 0.5 × 0.113 = −0.143
2 2
𝑤11 = 𝑤11 + (𝛼 × 𝐸𝑟𝑟𝑎12 × 𝑎11 ) = 0.8 + 0.5 × 0.113 × 0.668 = 0.838
2 2
𝑤21 = 𝑤21 + (𝛼 × 𝐸𝑟𝑟𝑎12 × 𝑎21 ) = −0.2 + 0.5 × 0.113 × 0.812 = −0.154
Hidden Layer Weights Output Layer Weights
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example

Step 3 – Backward Pass

Compute errors at the hidden layer


𝐸𝑟𝑟𝑎11 = 𝑎11 1 − 𝑎11 𝐸𝑟𝑟𝑎12 𝑤11
2
= 0.668 × 1 − 0.668 × 0.113 × 0.838 = 0.021

𝐸𝑟𝑟𝑎21 = 𝑎12 1 − 𝑎21 𝐸𝑟𝑟𝑎12 𝑤21


2
= 0.812 × 1 − 0.812 × 0.113 × −0.154 = 0.003

Update weights of hidden layer


𝐵1 𝐵1
𝑤01 = 𝑤01 + 𝛼𝐸𝑟𝑟𝑎11 = 0.5 + 0.5 × 0.021 =0.511

1 1
𝑤11 = 𝑤11 + (𝛼 × 𝐸𝑟𝑟𝑎11 × 𝑥1 ) = 0.3 + 0.5 × 0.021 × 0.2 = 0.302

Hidden Layer Weights Output Layer Weights


𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example

Step 3 – Backward Pass

After repeating the process for all other weights, we would have:
Output Layer Weights
𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.8 -0.2 -0.2
0.838 -0.154 -0.143

Hidden Layer Weights

𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏
𝟎𝟏 𝒘𝑩𝟏
𝟎𝟐
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3
0.302 0.900 -0.192 0.099 0.303 0.900 -0.200 0.100 0.307 0.899 0.511 0.299
2 Training an MLP – Numeric Example
2 Training an MLP – Numeric Example
2 Training an MLP – Numeric Example
2 Training an MLP with sklearn

from sklearn.neural_network import MLPClassifier

mlp_model = MLPClassifier()

mlp_model.fit(X_train, y_train)

y_pred = mlp_model.predict(X_test)
2 Chapter 2 – Main Takeaways

1. MLP are universal approximators for almost any function:


i. Non-linear activation functions allow it to model non-linear relationships
ii. Additional layers allow for increased complexity
iii. Backpropagation combined with the non-linearities allows for weights to be updated
updated without squashing everything into a single linear equation

2. Training an MLP:
i. Forward pass -> Pass the inputs at the input layer into outputs at the Output layer
ii. Compute error (loss) at the end
iii. Backpropagate the error along the network, updating weights as you go in order to
minimize the error
We’ve barely scratched the surface of ANNs

1. Loss function – Cross-Entropy Loss for Classification


2. Batch Training
3. Learning Rate and Optimization Algorithms: Stochastic Gradient Descent,
Adam & others
4. Activation Functions: ReLU, Sigmoid, Tanh
5. Other forms of ANN (out of scope): Convolutional Neural Networks (CNN),
Recurrent Neural Networks (RNN), General Adversarial Networks (GAN)
References
- https://github.com/Atcold/NYU-DLSP21
- https://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes03-neuralnets.pdf
- https://towardsdatascience.com/understanding-backpropagation-abcc509ca9d0
- http://cs231n.stanford.edu/schedule.html
- http://cs231n.stanford.edu/slides/2023/lecture_7.pdf
- https://cs230.stanford.edu/syllabus/
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press.
Thank You!

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal


Tel: +351 213 828 610 | Fax: +351 213 828 611

You might also like