Lec 1
Lec 1
❑ Introduction
❑ Learning paradigms
❑ History of artificial neural networks (ANN)
❑ Modelling of ANNs
❑ Multilayer perceptron (MLP)
❑ Gradient Descent and Backpropagation
❑ ANN types, design and issues
❑ Validation techniques for efficient learning
❑ Assignment(s)
❑ Conclusion
2
Introduction
❑ The ever-increasing popularity of artificial intelligence (AI) and machine learning (ML)
provides a groundbreaking impetus on many aspects of our life.
➢ Artificial Intelligence (AI) are those set of human-designed tools (programs) to do things that is typically
done by human
➢ Machine learning (ML) is an AI field where machine can
learn new things through experience without the
involvement of a human.
➢ Deep learning (DL) is a ML subset where machines adapt
and learn from vast amount of data
Artificial Intelligence
(AI)
Machine Learning
(ML)
Deep
https://pvvajradhar.medium.com/ai-applications-in-various-fields-748dde27516d
Learning
(DL)
3
Categories of Machine Learning
Learning Paradigms
❑ Each neuron is a cell that uses biochemical reactions to receive, process and
transmit information
❑ Neurons are connected together through synapses (~10K)
10
Modeling of a Biological Neuron
❑ A mathematical model of the neuron (called the perceptron) has been
introduced in an effort to mimic our understanding of the functioning of
the brain.
Changes its internal state
(activation) based on the
current input.
Receives
input from
Dendrites
Output
Processor
Inputs
𝒙𝟐
𝒘𝟐
Activation
Sum Unit
Output = 𝒚
Processor
Inputs
𝒙𝟐 𝒘𝟐
𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )
14
Artificial Neuron (cont’d)
𝒙𝟏
𝒘𝟏
𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )
3 3
𝒘𝒎 2 2
𝑚
1 1
𝑢𝑘 = 𝑤𝑖 𝑥𝑖
𝑖=1
𝒙𝒎 1 2 3 1 2 3
A bias value (b) is important to full control of the activation function (i.e., the output) for successful
learning. This is a sort of regularization 15
Artificial Neuron (cont’d)
𝒙𝟎 = 𝟏 𝒘𝒐 = 𝒃
𝒙𝟏 𝒘𝟏
𝒙𝟐 𝒘𝟐 𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )
3
𝒘𝒎 2
𝑚 𝑚
1
𝑢𝑘 = 𝑤𝑖 𝑥𝑖 𝑢𝑘 = 𝑏 + 𝑤𝑖 𝑥𝑖
𝑖=1
𝒙𝒎 𝑖=1 1 2 3
16
Artificial Neuron Network (ANN)
Basic Elements of any ANN:
𝒙𝟐 𝒘𝟐 𝑓 𝒚
𝑦 = 𝑓 𝑏 + 𝑤𝑖 𝑥𝑖 𝑦 = 𝑓 𝑏 + 𝐖𝐓 𝐗
𝑖=1
1 𝑖𝑓 𝑢𝑘 ≥ 0 1 0.5
𝑦𝑘 = ቊ 0 𝑦𝑘 =
−1 𝑖𝑓 𝑢𝑘 < 0 1 + 𝑒 −𝑢𝑘
-1 -4 4 8
19
Artificial Neuron: Transfer Function
Hyperbolic Tangent Sigmoid Leaky ReLU
3
1 2
𝑦𝑘 = max(𝜖𝑢𝑘 , 𝑢𝑘 )
𝑒 𝑢𝑘 − 𝑒 −𝑢𝑘 1
𝑦𝑘 = 𝑢 𝜖≪1 0
𝑒 𝑘 − 𝑒 −𝑢𝑘 -4 0
4
1 2 3
-1
20
Artificial Neural Network (ANN)
❑ An artificial neural network (ANN) is a massively parallel distributed processor made up
of simple processing units (neurons).
❑ ANN is capable of resolving paradigms that
linear computing cannot resolve.
❑ ANNs are adaptive systems, i.e.,
parameters can be changed through a
learning process (training) to suit the
underlying problem.
𝒖 = 𝒃 + σ𝟑𝒊=𝟏 𝒘𝒊 𝒙𝒊
𝒃
𝒙𝟏 𝒘𝟏
𝒚 = 𝒇 𝒖 𝒅(𝒏)
𝒙𝟐 𝒘𝟐
𝑓 −
+
error signal
𝒘𝒎
𝒆(𝒏)
𝒙𝒎
m=3 Update 𝑤𝑠 based on 𝑒(𝑛)
23
Learning Process (cont’d)
Criterion
➢ Certain number of iterations
➢ Error threshold
25
Learning Process: Cost Function
❑ Our objective is to reduce the difference between
the actual and target outputs (i.e., the error)
❑ The adjustment of a weight vector of 𝒏 input neuron connection is proportional to the product of the
error signal and the input value of the connection in question.
26
Learning Process: Epoch
The training cycle at which All the training samples have been used by the network is called the epoch
27
Learning Process: Example
Example
n x1 x2 x3 d Assume
• initial weights are 0.5, -0.3, 0.8,
1 1 1 0.5 0.7
• b=0;
2 -1 0.7 -0.5 0.2
• 𝜂 =0.1 and
3 0.3 0.3 -0.3 0.3
• linear activation function
28
Learning Process Example: Solution
Solution
𝒘𝟏
𝒃=0
𝒙𝟏
n x1 x2 x3 d
1 1 1 0.5 0.7 𝒚
2
3
-1
0.3
0.7
0.3
-0.5
-0.3
0.2
0.3
𝒙𝟐 𝒘𝟐
𝑓
𝒙𝟑 𝒘𝟑
29
ANN Examples
❑ One layer feedforward neural network called the 𝒘𝟏 𝒃
𝒙𝟏
perceptron
𝑦
❑ Can solve linear function, e.g., AND, OR, NOT 𝒙𝟐 𝒘𝟐
𝑓
x2
x1 x2 y
𝒙𝟑 𝒘𝟑 𝑛
0 0 0
1 0 0
(1,0) (1,1) 𝑦 = 𝑓 𝑏 + 𝑤𝑖 𝑥𝑖
0 1 0 𝑖=1
1 1 1 x1
(0,0) (1,0)
−𝟏. 𝟓
𝒃
𝒙𝟏 𝒘
𝟏𝟏 𝒚𝑦 = 𝑠𝑡𝑒𝑝 −1.5 + 1. 𝑥1 + 1. 𝑥2
𝒙𝟐 𝒘𝟏𝟐
𝑓
𝒃 𝒘𝟏 𝒘𝟐
ANN Examples (cont’d)
❑ One layer feedforward neural network called the 𝒘𝟏 𝒃
𝒙𝟏
perceptron
𝑦
❑ Can solve linear function, e.g., AND, OR, NOT 𝒙𝟐 𝒘𝟐
𝑓
x2
x1 x2 y
𝒙𝟑 𝒘𝟑 𝑛
0 0 0
1 0 1
(1,0) (1,1) 𝑦 = 𝑓 𝑏 + 𝑤𝑖 𝑥𝑖
0 1 1 𝑖=1
1 1 1 x1
(0,0) (1,0)
0. 𝟓
𝒃
𝒙𝟏 𝒘
𝟏𝟏 𝒚𝑦 = 𝑠𝑡𝑒𝑝 0.5 + 1. 𝑥1 + 1. 𝑥2
𝒙𝟐 𝒘𝟏𝟐
𝒃 𝒘𝟏 𝒘𝟐 31
ANN Examples (cont’d)
x2 x2
x1 x2 y
x1 x2 y
0 0 0
0 0 0
(1,0) (1,1) (1,0) (1,1)
1 0 1
1 0 0
0 1 1
0 1 0
1 1 1
1 1 1 x1 x1
(0,0) (1,0) (0,0) (1,0)
𝒃 𝒃
𝑦 = 𝑠𝑡𝑒𝑝 −1.5 + 1. 𝑥1 + 1. 𝑥2 𝑦 = 𝑠𝑡𝑒𝑝 0.5 + 1. 𝑥1 + 1. 𝑥2
❑ Solving linearly, means the decision boundary is linear (straight line in 2D and a plane
in 3D)
❑ The bias term (𝒃) alters the position, but not the orientation, of the decision boundary
❑ The weights (𝑤 1, 𝑤 2, ...𝑤 m) determine the gradient
32
ANN Examples: XOR function
x2
AND
x1 x2 y
0 0 0
(1,0) (1,1)
1 0 1 OR
0 1 1
1 1 0 x1
(0,0) (1,0)
y2
❑ Error backpropagation is used for
learning Output
Hidden Layers
Hidden
𝑒 𝑛 =𝑑 𝑛 −𝑦 𝑛 Layer
Layer 22
❑ Weight adjustments are applied so as Hidden
Hidden
to minimize 𝑒(𝑛) in a statistical Layer1 1
Layer
sense. Inputs 34
Gradient Descent
The delta rule is a gradient descent learning rule for updating the weights of an
artificial neuron inputs in a single-layer NN
𝑤𝑘𝑗 𝑛 + 1 = 𝑤𝑘𝑗 𝑛 + ∆𝑤𝑘𝑗 𝑛
The goal of gradient descent is to iteratively take steps towards lower regions (minima) of the loss function 35
Gradient Descent (cont’d)
For linear activation function, the weight adjustment for a neuron k is given by
∆𝑤𝑘𝑗 𝑛 = 𝜂 ∗ 𝑒𝑘 𝑛 ∗ 𝑥𝑗 𝑛 𝑗 = 1,2, … . . 𝑚
𝒘𝟏
𝑢 𝑛
𝒙𝟏
𝒙𝟐 𝒘𝟐 𝑓 𝒚
𝒙𝒎 𝒘𝒎 𝑢 𝑛 = 𝑏 + 𝑤𝑗 𝑥𝑗
𝑗=1 36
Gradient Descent (cont’d)
𝜕𝐸
∆𝑤𝑘𝑗 = −𝜂 ∗
𝜕𝑤𝑗
𝑢 𝑛
gradient
minimization
https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1
algorithm
❑ It is based on the gradient search technique to
minimize the cost function ≡ squared error
between the network output and the target
output
❑ It is recursive application of the chain rule to
compute the gradients
2
2
39
Backpropagation (cont’d)
❑ The weights of each output neuron can be determined directly using the delta
learning rule.
∆𝑤𝑘𝑖 = 𝜂 ∗ 𝑒 ∗ 𝑓′ ∙ ∗ 𝑧𝑖 𝛿𝑘 =e∗𝑓′ ∙
𝛽𝑘𝑗
𝛿𝑗
y1
local gradient or error signal 𝑥𝑗
y2
❑ If the neuron is a hidden node
𝐾
′
2
𝛿𝑗 =𝑓 (∙) ∗ 𝛿𝑘 ∗ 𝑤𝑘 K is the set of all nodes on a next layer connected to the current neuron
𝑘=1 2
[local gradient] x [upstream gradient]
Please see the following for all details about mathematical derivation:
https://www.jeremyjordan.me/neural-networks-training/
40
Backpropagation Example
❑ Assume one input layer, one hidden layer, and one output neuron
𝑥𝑗 ∶is the 𝒋𝒕𝒉 input 𝜷𝟏𝟏
𝑥1 𝑧1
𝑧𝑖 ∶is the output of the 𝒊𝒕𝒉 hidden neuron 𝒘𝟏𝟏
𝑦𝑘 ∶is the output of the 𝒌𝒕𝒉 output neuron 𝑥2 𝜷𝟏𝟐
𝑧2 𝑦𝑘
𝒘𝟏𝟐
𝛽𝑖𝑗 ∶is the weight from input node 𝑥𝑗 to hidden node 𝑧𝑖
𝑤𝑘𝑖 ∶is the weight from hidden node 𝒛𝒊 to output neuron 𝒚𝒌 𝑧𝑖
𝑓 ′ 𝑢𝑘 = 𝑦𝑘 1 − 𝑦𝑘
42
Types of Neural Networks
Feedforward neural network Recurrent neural network (RNN)
Signals to travel one way only Output from previous step is fed
(input to output) as input in the current step
Learning with a teacher Learning without a teacher
Supervised Learning Unsupervised Learning
Self-organizing maps (SOM) 43
ANN Design and Issues
❑ Number of neurons, and hidden layers
❑ Initial weights (small random values ∈[‐1,1])
❑ Choice of the transfer function
❑ Learning rate
❑ Weights adjusting
❑ Data representation, pre-processing, and splitting
44
Learning Rate
❑ The learning rate, 𝜂, is a configurable (hyper)parameter used in ANNs
training
❑ 𝜂 controls how quickly the model is adapted to the problem
❑ Practical value 0 < 𝜂 < 1.
45
Learning Rate (cont’d)
One technique that can help the network out of local minima is the use of a momentum term.
∆𝑤𝑘𝑗 𝑛 = 𝜂 ∗ 𝛿𝑘 𝑛 ∗ 𝑥𝑗 𝑛 + 𝛼∆𝑤𝑘𝑗 𝑛 − 1
47
Overfitting
x2 x2 x2
Can NOT be
generalized
x1 x1 x1
Solution
➢ Early stopping
➢ Regularization (Dropout)
➢ The gradients get smaller and smaller when backpropagating the error.
➢ After few layers of propagation, the gradient disappears (vanishes)
➢ The parameters in the deep layer will be almost static
❑ Solution
➢ Modify the activation function
➢ Use batch normalization (sort of regularization)
49
ANN Advantages and Disadvantages
❑ Advantages
➢ Very simple principles
➢ Highly parallel: information processing is much more like the brain than a serial
computer
➢ Adapt to unknown situations, can model complex functions
➢ Ease of use, learns by example, and very little user domain‐specific expertise needed.
❑ Disadvantages
➢ Very complex behaviors
➢ Not exact.
➢ Needs training.
50
ANN Terminology
❑ Neuron, unit (node)
❑ Weight and bias
❑ Transfer function (linear, sigmoid, ReLU, etc)
❑ Loss function (mean squared error, cross entropy, etc.)
❑ Learning rate, epoch, batch
❑ Backpropagation (error propagation)
❑ Optimization (gradient descent (GD), stochastic GD, Adam,….etc.)
❑ Overfitting
❑ Dropout, Batch normalization
Each ANN aspect is considered a standalone research venue 51
Validation Techniques
Data Splitting
Training/Testing Training/Validation/Testing
Total # of Samples Total # of Samples
60% (70%)
20% (15%)
70% to 75% 25% to 30%
20% (15%)52
Validation Techniques
Random Sample Selection
Total Number of Samples
Test Samples
Experiment #1
Experiment #2
Experiment #k
❑ Assignment 2: Modify the above-designed code to implement a multi-layer perceptron, MLP (an
ANN with one input layer, one hidden layer and one output layer) for the same data points above.
Assume sigmoid activation function and there is no bias for simplicity (b=0). Test your approach
using different iteration numbers and different number of nodes for the hidden layer (e.g., 4, 8, and
16).
56
Assignments (cont’d)
❑ Assignment 3: Use the Keras library (tensorflow.keras) to build different ANNs using different
numbers of hidden layers (shallow: 1 hidden, output layer, deeper: two hidden layers with 12 and 8
nodes respectively, and more deep: three hidden layers with 32, 16, 8 nodes respectively). Use the
provided diabetic data sets (here) to train and test your design. Use the ReLU activation for the hidden
layers and the sigmoid activation for the output neuron, loss='binary_crossentropy', optimizer='adam’,
metrics=['accuracy’], epochs = 150.
❑ Assignment 4: Redo assignment #3 using 80% of the data for training and 20% of the data for testing.
Also, plot the training accuracy and loss curves for your designed networks
57
Thank You
&
Questions