Unit 1 Until MLP
Unit 1 Until MLP
Deep Learning
Unit 1
4
History of Perceptron
▪ First implemented by an American
psychologist Frank Rosenblatt in
1957 at Cornell Aeronautical
Laboratory
▪ Rosenblatt was heavily inspired by
the biological neuron and its ability to
learn
▪ Rosenblatt’s Perceptron:
X1
X2
… P Y
Xn
8
What is Artificial Neuron?
▪ An artificial neuron is a mathematical function based on the
model of biological neurons
▪ Each neuron takes inputs, weighs them separately, sums
them up and passes this sum through a nonlinear function to
produce output
9
Perceptron – A Detailed Look
W1, W2, &
W3 are
Weights
𝑂𝑢𝑡𝑝𝑢𝑡
X1 3
W1
𝑦ො = 𝑊𝑖 𝑋𝑖 + 𝑏𝑖𝑎𝑠
𝑖=1
W3
3
𝑊𝑖 𝑋𝑖
X3
𝑖=1
10
What if Perceptron’s output is Binary?
Different kinds of activation functions:
Hyperbolic Tangent: used to output a number from -1 to 1.
X1 W1 Logistic Function: used to output a number from 0 to 1.
Bias
W2
Neuron a() Y
X2
Activation
W3 Function
X3
An activation function is a function that converts the inputs given (the input, in
this case, would be the weighted sum and the bias) into a certain output based
on a set of rules. 11
11
Perceptron Numerical Example
12
Perceptron Numerical Example contd.
▪ The Perceptron Algorithm
▪ Set a threshold value (1.5)
▪ Multiply all inputs (1, 0, 1, 0, 1) with its weights (0.7, 0.6, 0.5,
0.3, 0.4)
▪ Sum all the results
▪ Activate the output
x1 * w1 = 1 * 0.7 = 0.7
True, if the
Set the x2 * w2 = 0 * 0.6 = 0 0.7 + 0 + 0.5
sum > 1.5.
threshold to x3 * w3 = 1 * 0.5 = 0.5 + 0 + 0.4 =
False,
1.5 x4 * w4 = 0 * 0.4 = 0 1.6
otherwise.
x5 * w5 = 1 * 0.3 = 0.4
13
Threshold as Bias
X0
W0
X0 , X1, X2, 0, 𝑖𝑓 𝑧 < 0
…, Xn are X1 𝜑 𝑧 =ቊ
1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
binary W1
inputs and
X0 is W2
Neuron 𝜑 Y
always set X2
to 1
… Wn 𝑛
𝑧 = 𝑊𝑖 𝑋𝑖
Xn 𝑖=0
14
Perceptron and Linear Binary Classification
▪ The idea behind the binary linear classifier can be described as
follows:
𝑓 𝑥, 𝑤, 𝑤0 = 𝜑 𝑥 ∙ 𝑤 + 𝑤0
▪ The 𝜑 function is used to distinguish x as either a positive (+1) or a
negative (-1) label
▪ There is the decision boundary to separate the data with different
labels, which occurs at
𝑥 ∙ 𝑤 + 𝑤0 = 0
▪ The decision boundary separates the hyperplane into two regions
15
Perceptron Decision Boundary
▪ In general, if we have n inputs, the decision boundary will be
a n-1 dimensional object called a hyperplane that separates
our n-dimensional feature space into 2 parts:
▪ one in which the points are classified as positive, and
▪ one in which the points are classified as negative
16
Perceptron Decision Boundary contd.
▪ If all the instances in the given data are linearly separable, there
exists a w and a w₀ such that y⁽ⁱ⁾ (w⋅ x⁽ⁱ⁾ + w₀) > 0 for every ith data
point, where y⁽ⁱ⁾ is the label
𝑥 ∙ 𝑤 + 𝑤0 =+1
𝑥 ∙ 𝑤 + 𝑤0 > 0
𝑥 ∙ 𝑤 + 𝑤0 = −1 𝑤
1
𝑤
𝑥 ∙ 𝑤 + 𝑤0 < 0
17
𝑥 ∙ 𝑤 + 𝑤0 = 0
Start
For each
Training
Sample
Update the
Is (error weights and bias
!= 0) ? based on the
error
18 Stop
Perceptron Learning Algorithm
Step 1: Initialize the weights and bias to random values.
Step 2: For each training example, compute the predicted output using the
current weights and bias.
Step 3:Update the weights and bias based on the error between the
predicted output and the true output. The update rule is given by:
▪ wi = wi + learning_rate * (y - ypred) * xi
19
Perceptron Learning Algorithm
• Initialize the weights 𝑤
to small random
numbers or zeros
• Initialize the bias 𝑏 to a • For each epoch (a complete pass through the
small random number or training dataset):
zero • For each training sample (𝑥,𝑦):
• Set the learning rate 𝜂 • Compute the weighted sum (net
input) 𝑧:
𝑧 =𝑤∙𝑥+𝑏
• Apply the activation function to get
the predicted output 𝑦ො
1, 𝑖𝑓 𝑧 ≥ 0
𝑦ො = ቊ
0, 𝑖𝑓 𝑧 < 0
• Update the weights and bias based
on the error 𝑒 = 𝑦 − 𝑦ො
𝑤 =𝑤+𝜂⋅𝑒⋅𝑥 • Repeat the training
𝑏 =𝑏+𝜂⋅𝑒 process until
convergence or for a
fixed number of epochs.
20
What is convergence?
Convergence is typically achieved when the
weights and bias do not change
significantly between epochs or when the
classification error becomes zero
21
The update rule
22
Perceptron Learning with Logic Gates
▪ AND Gate
23
Perceptron Learning with Logic Gates contd.
▪ OR Gate
24
Limitations of Perceptron
▪ Limited to linearly separable problems
▪ If the data is not linearly separable, the perceptron algorithm will not
converge and cannot learn a correct decision boundary
25
Limitations of Perceptron
▪ Binary classification only
▪ It cannot handle multiclass classification problems without
modifications
26
Limitations of Perceptron
▪ Sensitive to input scaling and bias
▫ If the input features are not scaled properly, or the bias term is not
set correctly, the algorithm may not converge or may converge to a
suboptimal solution
27
Limitations of Perceptron
▪ Can get stuck in local optima
▫ This means that it may not find the global optimal solution if the
initial weights are not set properly or if the optimization landscape is
complex
28
MLP
▪ After Rosenblatt perceptron was developed in the 1950s, there
was a lack of interest in neural networks
▪ In1986, Dr. Hinton and his colleagues developed the
backpropagation algorithm to train a multilayer neural networks
29
Multi-Layer Perceptron (MLP)
▪ A multilayer perceptron is a fully connected class of feedforward
artificial neural network
▪ Multilayer perceptrons are sometimes colloquially referred to as
"vanilla" neural networks, especially when they have a single hidden
layer
30
Common Uses of MLP
MLP
31
MLP contd.
▪ An MLP consists of at least three layers of nodes: an input layer, a
hidden layer, and an output layer
▪ Except for the input nodes, each node is a neuron that uses a
nonlinear activation function
▪ MLP utilizes a supervised learning technique called backpropagation
for training
▪ Its multiple layers and non-linear activation distinguish MLP from a
linear perceptron
▪ It can distinguish data that is not linearly separable
32
Typical MLP Network
33
Perceptron Vs. MLP
34
What is MLP?
▪ It is a neural network where the mapping between inputs and
output is non-linear
▪ A Multilayer Perceptron has input and output layers, and one or more hidden
layers with many neurons stacked together
▪ And while in the Perceptron the neuron must have an activation function that
imposes a threshold, like ReLU or sigmoid, neurons in a Multilayer Perceptron
can use any arbitrary activation function
35
MLP Topology
Hidden Layer
Input Layer Output Layer
A regression or binary
One neuron per They are called classification problem
input value or hidden layers may have a single
column in your because they are output neuron, whereas
dataset not directly exposed multiple neurons for
to the input multiclass classification
36
MLP as Feed Forward Algorithm
▪ The inputs are combined with the initial weights in a weighted
sum and subjected to the activation function
▪ Each linear combination is propagated to the next layer
▪ Each layer is feeding the next one with the result of their
computation
▪ This goes all the way through the hidden layers to the output layer
▪ If the algorithm
▫ Only computed the weighted sums in each neuron
▫ Propagated results to the output layer and stopped there
▫ It wouldn’t be able to learn the weights that minimize the cost
function.
37
Feed Forward Algorithm
1. Initialize the weights and biases for all layers in the
network
2. Assign the input features to the input layer of the
network
3. For each hidden layer and the output layer:
• Compute the weighted sum of inputs for each neuron in
the layer:
𝜃𝑗 = 𝑤𝑖𝑗 ∙ 𝑥𝑖 + 𝑏𝑗
𝑖
• Apply the activation function to the weighted sum to get the
output (activation) of each neuron:
𝑂𝑗 = 𝜑 𝜃𝑗
• Compute the final outputs of the network using the
activations of the last hidden layer and the output
layer's weights and biases.
38
What is Backpropagation?
▪ Backpropagation, short for “backward propagation of errors,” is a
fundamental concept in the training of artificial neural networks
▪ It’s a method used to calculate the gradient of the loss function with
respect to the weights of the network
39
Loss Function Vs. Cost Function
2
𝐿 = 𝑦ො𝑖 − 𝑦𝑖 1 2
𝐽(𝑊, 𝐵) = σ𝑖 𝑦ො𝑖 − 𝑦𝑖
2𝑛
40
MLP with Backpropagation
▪ All the weights in the network are updated through backpropagation,
not just those at the starting point or the first hidden layer from back
▪ The updates occur layer by layer, starting from the output layer and
moving towards the input layer
𝑂0 𝑂1 𝑂2 𝑂𝑝 𝑂𝐾−1 𝑂𝐾
𝐿0 𝐿1 𝐿2 ⋯ 𝐿𝑝 ⋯ 𝐿𝐾−1 𝐿𝐾
∇𝑊1 𝐸 ∇𝑊2 𝐸 ∇𝑊𝑝 𝐸 ∇𝑊𝐾−1 𝐸 ∇𝑊𝐾 𝐸
41
MLP with Backpropagation
42
MLP Learning Algorithm
1. Initialize the weights and biases randomly
2. For each epoch do:
a) For each input sample do:
Feed the input sample into the network and compute the output
Calculate the error between the output and the desired output
Backpropagate the error through the network, updating the
weights and biases using the gradient descent algorithm
a) Calculate the total error for the epoch
b) If the error is below a specified threshold, stop training and return the
weights and biases
3. Return the trained weights and biases
43
Start
44 Stop
Perceptron Learning with Linear Transformation
Loss/Cost/Error Function
𝑛
1
𝐸 = 𝑊 𝑇 ∙ 𝑋𝑖 − 𝑦𝑖 2
2
𝑖=1
𝑦ො𝑖 = (𝑊 𝑇 ∙ 𝑋) 𝑛
𝑋𝑖
1 2
= 𝑦ො𝑖 − 𝑦𝑖
2
𝑖=1
Gradient
𝑛
𝑊 ∇𝑤 𝐸 = 𝑦ො𝑖 − 𝑦𝑖 ∙ 𝑋𝑖
𝑖=1
Weight Update Rule
𝑛
𝑊 = 𝑊 − 𝜂 ∙ (𝑦ො𝑖 −𝑦𝑖 ) ∙ 𝑋𝑖
45 𝑖=1
Learning rate 𝜂
• The learning rate controls how much the weights of the neural network
are adjusted with respect to the loss gradient
• A smaller learning rate means smaller adjustments, leading to a more
gradual convergence
• Conversely, a larger learning rate results in larger adjustments, which can
speed up convergence but also risks overshooting the optimal solution
46
Perceptron Learning with Non-linear Transformation
𝑦ො𝑖 = 𝜎(𝑊 𝑇 ∙ 𝑋)
)
𝑋𝑖
𝑊𝑇 ∙ 𝑋
47
Perceptron Learning with Non-linear Transformation
Loss/Cost/Error Function
𝑛
1
𝐸 = 𝜎(𝑊 𝑇 ∙ 𝑋𝑖 ) − 𝑦𝑖 2
2
𝑖=1
𝑛
1
𝑦ො𝑖 = 𝜎(𝑊 𝑇 ∙ 𝑋) = 𝑦ො𝑖 − 𝑦𝑖 2
2
𝑋𝑖 𝑖=1
Gradient
𝑛
𝑜2 𝜃𝑗 = 𝑊𝑖𝑗 ∙ 𝑋𝑖
𝑋
𝑖=0
Loss/Cost/Error Function
𝑜3
𝑚
𝑊 1 2
𝐸 = 𝑜𝑗 − 𝑡𝑗
2
𝑗=1
49
Single Layer Neural Network with Non-linearity
Loss/Cost/Error Function
𝑊 𝜃 𝑜 𝐸 𝑚
1 1 2
𝐸 = 𝑜𝑗 − 𝑡𝑗
2
𝑗=1
𝑜1
Gradient
𝜕𝐸 𝜕𝐸 𝜕𝑜𝑗 𝜕𝜃𝑗
𝑜2 = ∙ ∙
𝑋 𝜕𝑊𝑖𝑗 𝜕𝑜𝑗 𝜕𝜃𝑗 𝜕𝑊𝑖𝑗
= 𝑜𝑗 − 𝑡𝑗 oj 1 − 𝑜𝑗 𝑋𝑖
𝑜3
𝑊𝑖𝑗 = 𝑊𝑖𝑗 − 𝜂 ∙ 𝑜𝑗 1 − 𝑜𝑗 𝑜𝑗 − 𝑡𝑗 ∙ 𝑋𝑖
50
Multiple Hidden Layer & Multiple Output
1 1 1 1
1
𝑜1𝐾
𝑜2𝐾
𝑋
𝑜3𝐾
0 1 2 𝑝 𝐾−1 𝐾
51
Error – Output Layer
1 Gradient
𝑖 Output 𝜕𝐸 𝜕𝐸 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾
𝑂1𝐾 = ∙ ∙
1 𝜕𝑊𝑖𝑗𝐾 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾 𝜕𝑊𝑖𝑗𝐾
𝑂𝑗𝐾 = 𝐾
1 + 𝑒 −𝜃𝑗 𝑀𝐾−1
𝑀𝐾−1
𝑗 𝐿𝑒𝑡 𝛿𝑗𝐾 = 𝑂𝑗𝐾 − 𝑡𝑗 𝑂jK 1 − 𝑂𝑗𝐾
𝜃𝑗𝐾 = 𝑊𝑖𝑗𝐾 ∙ 𝑂𝑖𝐾−1
𝑂3𝐾 𝑖=0 𝜕𝐸 𝑀𝐾−1 𝐾−1
𝐾 = 𝛿𝑗𝐾 σ𝑖=0 𝑂𝑖
𝜕𝑊𝑖𝑗
Loss/Cost/Error Function
𝑀𝐾 Weight Update Rule
1 2
𝐾−1 𝐾 𝐸 = 𝑂𝑗𝐾 − 𝑡𝑗 𝑀𝐾−1 𝐾−1
2 𝑊𝑖𝑗𝐾 = 𝑊𝑖𝑗𝐾 − 𝜂 ∙ 𝛿𝑗𝐾 σ𝑖=0 𝑂𝑖
𝑗=1
52
Error – Hidden Layer
1 1 Output
𝑝 𝑖 1
𝑂1𝐾 𝑂𝑖𝐾−1 = 𝐾−1
1 + 𝑒 −𝜃𝑖
𝑂1𝐾−1
Weighted Sum
𝑂2𝐾 𝑀𝐾−2
𝑂2𝐾−1 𝜃𝑖𝐾−1 = 𝑊𝑝𝑖
𝐾−1
∙ 𝑂𝑝𝐾−2
𝑗
𝑝=0
𝑂3𝐾
Loss/Cost/Error Function
𝑂3𝐾−1 𝑀𝐾
1 2
𝐸 = 𝑂𝑗𝐾 − 𝑡𝑗
2
𝑗=1
𝐾−2 𝐾−1 𝐾
53
Error – Hidden Layer
Gradient
𝜃𝑖𝐾−1 = 𝑊𝑝𝑖
𝐾−1
∙ 𝑂𝑝𝐾−2
𝜕𝐸 𝜕𝐸 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾
𝑝=0 = ∙ ∙
𝜕𝑂𝑖𝐾−1 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾 𝜕𝑂𝑖𝐾−1
Loss/Cost/Error Function 𝑀𝑘
𝑀𝐾 = 𝑂𝑗𝐾 − 𝑡𝑗 𝑂jK 1 − 𝑂𝑗𝐾 𝑊𝑖𝑗𝐾
1 2
𝐸= 𝑂𝑗𝐾 − 𝑡𝑗 𝑗=1
2
𝑗=1
𝐿𝑒𝑡 𝛿𝑗𝐾 = 𝑂𝑗𝐾 − 𝑡𝑗 𝑂jK 1 − 𝑂𝑗𝐾
54
MLP with Backpropagation
𝜕𝐸 𝜕𝐸 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾 𝜕𝑂𝑖𝐾−1 𝜕𝜃𝑖𝐾−1
𝐾−1 = ∙ ∙ ∙ ∙
𝜕𝑊𝑝𝑖 𝜕𝑂𝑗𝐾 𝜕𝜃𝑗𝐾 𝜕𝑂𝑖𝐾−1 𝜕𝜃𝑖𝐾−1 𝜕𝑊𝑝𝑖
𝐾−1
𝑀𝐾−2 𝑀𝐾
∇𝑊 = 𝜂 ∙ 𝛿𝑗 ∙ 𝑜𝑖
Update Rules for Back Propagation – Output layer
𝛿𝑗 = (1 − 𝑜𝑗 ) ∙ 𝑜𝑗 ∙ (𝑡𝑗 − 𝑜𝑗 )
𝛿𝑗 = (1 − 𝑜𝑗 ) ∙ 𝑜𝑗 ∙ 𝛿𝑘 𝑤𝑘𝑗
𝑘
56