0% found this document useful (0 votes)
12 views57 pages

Lec 1

Uploaded by

Mado Saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views57 pages

Lec 1

Uploaded by

Mado Saeed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Outline

❑ Introduction
❑ Learning paradigms
❑ History of artificial neural networks (ANN)
❑ Modelling of ANNs
❑ Multilayer perceptron (MLP)
❑ Gradient Descent and Backpropagation
❑ ANN types, design and issues
❑ Validation techniques for efficient learning
❑ Assignment(s)
❑ Conclusion

2
Introduction
❑ The ever-increasing popularity of artificial intelligence (AI) and machine learning (ML)
provides a groundbreaking impetus on many aspects of our life.
➢ Artificial Intelligence (AI) are those set of human-designed tools (programs) to do things that is typically
done by human
➢ Machine learning (ML) is an AI field where machine can
learn new things through experience without the
involvement of a human.
➢ Deep learning (DL) is a ML subset where machines adapt
and learn from vast amount of data

Artificial Intelligence
(AI)

Machine Learning
(ML)

Deep
https://pvvajradhar.medium.com/ai-applications-in-various-fields-748dde27516d
Learning
(DL)
3
Categories of Machine Learning
Learning Paradigms

Supervised Reinforcement Unsupervised


Learning Learning Learning
➢ Learning with a teacher ➢ Interactive learning environment ➢ Learning without a teacher
➢ Data with known output (label) by trial and error using feedback ➢ No labels
is given from its own actions and ➢ Machine understand the data
➢ Classification and Regression experiences. ➢ Clustering

Support Vector Machine (SVM), Q-learning, Gaussian Mixtures,


K Nearest Neighbours (KNN), Markov Decision Process K-means, RNN,
Decision Trees, Random Forest Fuzzy c-means
Feedforward Artificial Neural
Network (ANN) 4
Supervised
Supervised Machine Learning
Machine Learning
Data

Image source: https://www.enjoyalgorithms.com/blog/classification-of-machine-learning-models 5


Supervised Machine Learning (cont’d)
Data

Training Testing (or validation)


Image source: https://www.enjoyalgorithms.com/blog/classification-of-machine-learning-models 6
Supervised Machine Learning (cont’d)
Learning Paradigms

Supervised Reinforcement Unsupervised


Learning Learning Learning
➢ Learning with a teacher ➢ Interactive learning environment ➢ Learning without a teacher
➢ Data with known output (label) by trial and error using feedback ➢ No labels
is given from its own actions and ➢ Machine understand the data
➢ Classification and Regression experiences. ➢ Clustering

Support Vector Machine (SVM), Q-learning, Gaussian Mixtures,


K Nearest Neighbours (KNN), Markov Decision Process K-means, RNN,
Decision Trees, Random Forest Fuzzy c-means
Feedforward Artificial Neural
Network (ANN) 7
History of Neural Networks (NN)
❑ 1940: McCulloch and Pitts: First mathematical model of a neuron (A verification model)
❑ 1957: Rosenblatt’s: The Perceptron model
❑ 1959: Widrow and Hoff developed MADALINE was
the first NN to be applied to a real-world problem
Progress on NN research halted until 1981
❑ 1982: Hopfield: Associative memory - Recurrent NN
(or the RNNs)
❑ 1986: Rumelhart: Backpropagation and the era of
Image source: https://developpaper.com/take-you-into-the-past-life-and-this-life-of-neural-
multilayer perceptron (MLP). network/

❑ 1990s: Rise of support vector machine (SVM)


❑ 1997: Schmidhuber & Hochreiter: An RNN, long short-term memory (LSTM) was proposed.
❑ 2006: Hinton et al.: NN returned to the public’s vision again though Deep belief nets (DBNs)
❑ 2016: Boom of NN (Deep convolutional neural networks (CNNs): AlexNet, GoogLeNet, VGG, ResNet, etc.
8
Human Brain and Biological Neurons
❑ Human brain contains billion of neurons (~10 billion)

❑ Each neuron is a cell that uses biochemical reactions to receive, process and
transmit information
❑ Neurons are connected together through synapses (~10K)

Image source: https://beautifulnow.is/discover/wellness/new-brain-flows-are-beautiful-now Image source: https://www.getbodysmart.com/nervous-system/neuron-synapse-structure


9
Human Brain and Biological Neurons (cont’d)
❑ A neuron accept (and combine) inputs through dendrites from other neurons

❑ If a given neuron combined input above a threshold, the neuron discharges a


spike (electrical pulse) that travels from the body, down the axon, to the next
neuron(s) Neuron A

❑ The strength of the signal that Cell body


Nucleus
Neuron B
reaches the next neuron dendrites

depends on factors such as the axon


amount of neurotransmitter
(synapses) available
Synapses or
neurotransmitters
https://natureofcode.com/book/chapter-10-neural-networks/

10
Modeling of a Biological Neuron
❑ A mathematical model of the neuron (called the perceptron) has been
introduced in an effort to mimic our understanding of the functioning of
the brain.
Changes its internal state
(activation) based on the
current input.
Receives
input from
Dendrites

Cell Axon To neighboring


many other Body Neuron(s)
neurons.

Sends one output signal to many


other neurons, possibly including
its input neurons
(recurrent network)
11
Artificial Neuron
❑ An artificial neuron is an imitation of a human neuron
➢ Dendrites: Input
➢ Cell body: Processor
➢ Synaptic: Link
➢ Axon: Output

Output
Processor
Inputs

Technically, artificial neurons


are referred to as units or nodes.
12
Artificial Neuron (cont’d)
Multiple inputs (𝑥) each of which has a different strength, i.e., a weight 𝒘
𝒙𝟏
Activation the combined input must be
𝒘𝟏
above certain threshold

𝒙𝟐
𝒘𝟐

Activation
Sum Unit
Output = 𝒚
Processor
Inputs

Technically, artificial The operations done by a Neuron are:


𝒘𝒎 1) Multiply inputs by the weights,
neurons are referred
𝒙𝒎 2) Add them up
to as units or nodes. 3) Check the sum against the activation and get y 13
Artificial Neuron (cont’d)
𝒙𝟏
𝒘𝟏

𝒙𝟐 𝒘𝟐

෍ 𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )

The output (𝒚) is a function of


𝒘𝒎 ➢ Input 𝒙𝒎
𝑚
➢ Weights (𝒘)
𝑢𝑘 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
𝒙𝒎

14
Artificial Neuron (cont’d)
𝒙𝟏
𝒘𝟏

𝒇(. ): How the combined 𝒙s and 𝒘s


𝒙𝟐 𝒘𝟐 are used to produce 𝒚?

෍ 𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )

3 3
𝒘𝒎 2 2
𝑚
1 1
𝑢𝑘 = ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
𝒙𝒎 1 2 3 1 2 3

A bias value (b) is important to full control of the activation function (i.e., the output) for successful
learning. This is a sort of regularization 15
Artificial Neuron (cont’d)
𝒙𝟎 = 𝟏 𝒘𝒐 = 𝒃

𝒙𝟏 𝒘𝟏

𝒙𝟐 𝒘𝟐 ෍ 𝒇(𝒖𝒌 ) 𝒚 = 𝒇(𝒖𝒌 )

3
𝒘𝒎 2
𝑚 𝑚
1
𝑢𝑘 = ෍ 𝑤𝑖 𝑥𝑖 𝑢𝑘 = 𝑏 + ෍ 𝑤𝑖 𝑥𝑖
𝑖=1
𝒙𝒎 𝑖=1 1 2 3

16
Artificial Neuron Network (ANN)
Basic Elements of any ANN:

Inputs Weights Bias


𝒃 Activation
𝒙𝟏 𝒘𝟏 Output
Function

𝒙𝟐 𝒘𝟐 ෍ 𝑓 𝒚

➢ A set of connecting links from inputs 𝑥𝑖 (synapses)


𝒙𝒎 𝒘𝒎 each of which is characterized by a weight 𝒘𝒊 .
➢ A summing unit (adder).
➢ An activation function (nonlinearity) 17
ANN (cont’d)
❑ If the sum exceeds a certain threshold, the ANN
(or the perceptron) fires an output value that is
transmitted to the next unit(s)
❑ ANN uses nonlinear transfer function

Why do we need nonlinearity?


x2

𝑦 = 𝑓 𝑏 + ෍ 𝑤𝑖 𝑥𝑖 𝑦 = 𝑓 𝑏 + 𝐖𝐓 𝐗
𝑖=1

y is linear and unbounded


x1
➢ NOT realistic
➢ Can NOT be generalized
➢ LESS power to solve complex nonlinear problems 18
ANN Transfer Functions
Linear 3 Saturating linear
2 2
1 𝑖𝑓 𝑢𝑘 > 1 1
1
𝑦𝑘 = ቐ𝑢𝑘 𝑖𝑓 0 ≤ 𝑢𝑘 ≤ 1
𝑦𝑘 = 𝑢𝑘 1 2 3
0 𝑖𝑓 𝑢𝑘 < 0 1 2

Hard Limit Symmetric Saturating linear2


1
1 𝑖𝑓 𝑢𝑘 ≥ 0 1 𝑖𝑓 𝑢𝑘 > 1 1
𝑦𝑘 = ቊ 𝑦𝑘 = ቐ 𝑢𝑘 𝑖𝑓 0 ≤ 𝑢𝑘 ≤ 1 -1
0 𝑖𝑓 𝑢𝑘 < 0 0 1 2
−1 𝑖𝑓 𝑢𝑘 < 0 1

Symmetric Hard Limit 1


Log Sigmoid 1

1 𝑖𝑓 𝑢𝑘 ≥ 0 1 0.5
𝑦𝑘 = ቊ 0 𝑦𝑘 =
−1 𝑖𝑓 𝑢𝑘 < 0 1 + 𝑒 −𝑢𝑘
-1 -4 4 8
19
Artificial Neuron: Transfer Function
Hyperbolic Tangent Sigmoid Leaky ReLU
3
1 2
𝑦𝑘 = max(𝜖𝑢𝑘 , 𝑢𝑘 )
𝑒 𝑢𝑘 − 𝑒 −𝑢𝑘 1
𝑦𝑘 = 𝑢 𝜖≪1 0
𝑒 𝑘 − 𝑒 −𝑢𝑘 -4 0
4
1 2 3

-1

Rectified Linear Unit (ReLU) Exponential Linear Unit (ELU)


3
3 2
2 𝑆𝑘 𝑖𝑓 𝑢𝑘 ≥ 0 1
1 𝑦𝑘 = ቊ
𝑦𝑘 = max(0, 𝑢𝑘 ) 𝛼(𝑒 𝑆𝑘 − 1) 𝑖𝑓 𝑢𝑘 < 0 0 1 2 3
0 1 2 3

20
Artificial Neural Network (ANN)
❑ An artificial neural network (ANN) is a massively parallel distributed processor made up
of simple processing units (neurons).
❑ ANN is capable of resolving paradigms that
linear computing cannot resolve.
❑ ANNs are adaptive systems, i.e.,
parameters can be changed through a
learning process (training) to suit the
underlying problem.

❑ ANNs can be used in a wide variety of classification tasks, e.g., character


recognition, speech recognition, fraud detection, medical diagnosis.
❑ “neural networks are the second-best way of doing just about anything” John
Denker (AT&T Bell laboratories)
21
Learning Process
❑ learning is the process by which the parameters of an ANN, i.e., 𝑤, are adapted through
a process of stimulation by the environment by which the network is embedded.

Learning ≡ 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 X(n)


Sample Features d
Sample number n x1 x2 x3 Output Target
➢ Selection of the network topology
1 10.33 56 0.56 0.8
➢ Adapt weights values. 2 8.97 48 0.61 0.1
➢ Learn by trial‐and‐error (experience!) 3 11.01 49 0.49 0.3
4 9.32 53 0.89 0.7
batch 5 10.51 50 0.71 0.4
❑ Every data sample for an ANN training consists 6 12.10 59 0.90 0.8

of a vector X(n) and the corresponding (desired


or target) output d 1996 7.99 61 0.59 0.9
1997 11.36 52 0.63 0.5
❑ A batch is a group of input samples with their desired 1998 12.09 48 0.78 0.2
1999 10.81 55 0.87 0.7
outputs 2000 13.00 53 0.91 0.6 22
Learning Process (cont’d)

𝒖 = 𝒃 + σ𝟑𝒊=𝟏 𝒘𝒊 𝒙𝒊

𝒃
𝒙𝟏 𝒘𝟏

𝒚 = 𝒇 𝒖 𝒅(𝒏)
𝒙𝟐 𝒘𝟐 ෍
𝑓 −

+
error signal
𝒘𝒎
𝒆(𝒏)
𝒙𝒎
m=3 Update 𝑤𝑠 based on 𝑒(𝑛)

Weight Adjustment= function (𝑒𝑟𝑟𝑜𝑟,input)

23
Learning Process (cont’d)

new input sample(s) → output → update weights

Weight Adjustment= function (𝑒𝑟𝑟𝑜𝑟,input)


General rule for neuron learning
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 + 𝜂 ∗ 𝑒 ∗ 𝑥 𝜂 is the learning constant or the learning rate
24
Learning Process: Summary
❑ Learning is a recursive operation through which network parameters
(weights) are updated in a way to reduce the difference (error) between
network output and the desired (target) output

Set initial values of the weights (e.g., randomly)


Do
Compute the output function of a given input (𝑋(𝑛))
Evaluate the output by comparing 𝑦(𝑛) with 𝑑(𝑛).
Adjust the weights.
Loop until a criterion is met.
end

Criterion
➢ Certain number of iterations
➢ Error threshold
25
Learning Process: Cost Function
❑ Our objective is to reduce the difference between
the actual and target outputs (i.e., the error)

❑ This can be achieved by minimizing a function of


the error (error energy)
➢ This called the cost function.
➢ Example is the mean squared error
1 2 1 2 𝑒 𝑛 = 𝑑 𝑛 −𝑦 𝑛
𝐸 𝑛 = 𝑒 𝑛 = 𝑑 𝑛 −𝑦 𝑛
2 2
❑ This learning is called error-correction learning or delta rule or Widrow-Hoff rule
∆𝑤𝑘𝑗 𝑛 = 𝜂. 𝑒𝑘 𝑛 . 𝑥𝑗 𝑛 𝑛 is the current sample
𝑘 index for the current neuron
𝑤𝑘𝑗 𝑛 + 1 = 𝑤𝑘𝑗 𝑛 + ∆𝑤𝑘𝑗 𝑛 𝑗: 1 → m

❑ The adjustment of a weight vector of 𝒏 input neuron connection is proportional to the product of the
error signal and the input value of the connection in question.
26
Learning Process: Epoch

The training cycle at which All the training samples have been used by the network is called the epoch
27
Learning Process: Example
Example
n x1 x2 x3 d Assume
• initial weights are 0.5, -0.3, 0.8,
1 1 1 0.5 0.7
• b=0;
2 -1 0.7 -0.5 0.2
• 𝜂 =0.1 and
3 0.3 0.3 -0.3 0.3
• linear activation function

28
Learning Process Example: Solution
Solution
𝒘𝟏
𝒃=0
𝒙𝟏
n x1 x2 x3 d
1 1 1 0.5 0.7 𝒚
2
3
-1
0.3
0.7
0.3
-0.5
-0.3
0.2
0.3
𝒙𝟐 𝒘𝟐 ෍
𝑓
𝒙𝟑 𝒘𝟑

29
ANN Examples
❑ One layer feedforward neural network called the 𝒘𝟏 𝒃
𝒙𝟏
perceptron
𝑦
❑ Can solve linear function, e.g., AND, OR, NOT 𝒙𝟐 𝒘𝟐 ෍
𝑓
x2

x1 x2 y
𝒙𝟑 𝒘𝟑 𝑛
0 0 0
1 0 0
(1,0) (1,1) 𝑦 = 𝑓 𝑏 + ෍ 𝑤𝑖 𝑥𝑖
0 1 0 𝑖=1
1 1 1 x1
(0,0) (1,0)

−𝟏. 𝟓
𝒃
𝒙𝟏 𝒘
𝟏𝟏 𝒚𝑦 = 𝑠𝑡𝑒𝑝 −1.5 + 1. 𝑥1 + 1. 𝑥2
𝒙𝟐 𝒘𝟏𝟐
෍ 𝑓
𝒃 𝒘𝟏 𝒘𝟐
ANN Examples (cont’d)
❑ One layer feedforward neural network called the 𝒘𝟏 𝒃
𝒙𝟏
perceptron
𝑦
❑ Can solve linear function, e.g., AND, OR, NOT 𝒙𝟐 𝒘𝟐 ෍
𝑓
x2

x1 x2 y
𝒙𝟑 𝒘𝟑 𝑛
0 0 0
1 0 1
(1,0) (1,1) 𝑦 = 𝑓 𝑏 + ෍ 𝑤𝑖 𝑥𝑖
0 1 1 𝑖=1
1 1 1 x1
(0,0) (1,0)

0. 𝟓
𝒃
𝒙𝟏 𝒘
𝟏𝟏 𝒚𝑦 = 𝑠𝑡𝑒𝑝 0.5 + 1. 𝑥1 + 1. 𝑥2

𝒙𝟐 𝒘𝟏𝟐

𝒃 𝒘𝟏 𝒘𝟐 31
ANN Examples (cont’d)
x2 x2
x1 x2 y
x1 x2 y
0 0 0
0 0 0
(1,0) (1,1) (1,0) (1,1)
1 0 1
1 0 0
0 1 1
0 1 0
1 1 1
1 1 1 x1 x1
(0,0) (1,0) (0,0) (1,0)
𝒃 𝒃
𝑦 = 𝑠𝑡𝑒𝑝 −1.5 + 1. 𝑥1 + 1. 𝑥2 𝑦 = 𝑠𝑡𝑒𝑝 0.5 + 1. 𝑥1 + 1. 𝑥2

❑ Solving linearly, means the decision boundary is linear (straight line in 2D and a plane
in 3D)
❑ The bias term (𝒃) alters the position, but not the orientation, of the decision boundary
❑ The weights (𝑤 1, 𝑤 2, ...𝑤 m) determine the gradient
32
ANN Examples: XOR function
x2
AND
x1 x2 y
0 0 0
(1,0) (1,1)
1 0 1 OR
0 1 1
1 1 0 x1
(0,0) (1,0)

❑ The XOR function is said to be not linearly separable


❑ If one neuron defines one line through input space, what do we need to have two lines?
❑ We need to have two neurons working in parallel (next to each other rather than in different
layers).
❑ We would need a multilayer neural network to model (or to separate the two classes
using) the XOR function.
33
Multilayer Perceptron (MLP)
❑ More layers between the input the
output layers
❑ Fully connected layers

❑ Multiple neurons at the output layers y1


𝒚𝒋 , 𝐣 ∈ 𝑪 C is set of all neurons at the output layer

y2
❑ Error backpropagation is used for
learning Output
Hidden Layers
Hidden
𝑒 𝑛 =𝑑 𝑛 −𝑦 𝑛 Layer
Layer 22
❑ Weight adjustments are applied so as Hidden
Hidden
to minimize 𝑒(𝑛) in a statistical Layer1 1
Layer
sense. Inputs 34
Gradient Descent
The delta rule is a gradient descent learning rule for updating the weights of an
artificial neuron inputs in a single-layer NN
𝑤𝑘𝑗 𝑛 + 1 = 𝑤𝑘𝑗 𝑛 + ∆𝑤𝑘𝑗 𝑛

Image Source: https://datascience-enthusiast.com/figures/cost.jpg https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1

The goal of gradient descent is to iteratively take steps towards lower regions (minima) of the loss function 35
Gradient Descent (cont’d)
For linear activation function, the weight adjustment for a neuron k is given by
∆𝑤𝑘𝑗 𝑛 = 𝜂 ∗ 𝑒𝑘 𝑛 ∗ 𝑥𝑗 𝑛 𝑗 = 1,2, … . . 𝑚

For any activation function 𝑓:


∆𝑤𝑘𝑗 𝑛 = 𝜂 ∗ 𝑒𝑘 𝑛 ∗ 𝑓′ 𝑢(𝑛) ∗ 𝑥𝑗 𝑛 ∗

𝒘𝟏
𝑢 𝑛
𝒙𝟏

𝒙𝟐 𝒘𝟐 ෍ 𝑓 𝒚

𝒙𝒎 𝒘𝒎 𝑢 𝑛 = 𝑏 + ෍ 𝑤𝑗 𝑥𝑗
𝑗=1 36
Gradient Descent (cont’d)
𝜕𝐸
∆𝑤𝑘𝑗 = −𝜂 ∗
𝜕𝑤𝑗
𝑢 𝑛
gradient
minimization

https://medium.com/@divakar_239/stochastic-vs-batch-gradient-descent-8820568eada1

By applying the chain rule 1 2 𝜕𝐸


𝐸 𝑛 = 𝑒 𝑛 =𝑒
𝜕𝐸 𝜕𝐸 𝜕𝑒 𝜕𝑦 𝜕𝑢 2 𝜕𝑒
=
𝜕𝑤𝑗 𝜕𝑒 𝜕𝑦 𝜕𝑢 𝜕𝑤𝑗 𝜕𝑒
𝑒 𝑛 =𝑑 𝑛 −𝑦 𝑛 = −1
𝜕𝑦
∆𝑤𝑘𝑗 = −𝜂 ∗ 𝑒 −1 𝑓′ 𝑢(𝑛) 𝑥𝑗 𝜕𝑦
𝑦 𝑛 = 𝑓 𝑢(𝑛) = 𝑓′ 𝑣(𝑛)
𝑚
𝜕𝑣
∆𝑤𝑘𝑗 = 𝜂 ∗ 𝑒 ∗ 𝑓′ 𝑢(𝑛) 𝑥𝑗 𝜕𝑢
𝑢 𝑛 = ෍ 𝑤𝑗 𝑥𝑗 = 𝑥𝑗
𝑗=1
𝜕𝑤𝑗 37
Backpropagation
❑ Backpropagation is supervised algorithm that is propagate activation from input
a generalization for the least mean square (LMS) to output ≡ compute yj

algorithm
❑ It is based on the gradient search technique to
minimize the cost function ≡ squared error
between the network output and the target
output
❑ It is recursive application of the chain rule to
compute the gradients

Please see the following for all details about mathematical


derivation: https://www.jeremyjordan.me/neural-networks- propagate error from output
training/ to hidden layers ≡ adjust all weights
38
Backpropagation (cont’d)
❑ The weights of each output neuron can be determined directly using the delta
learning rule.
∆𝑤𝑘𝑖 = 𝜂 ∗ 𝑒 ∗ 𝑓′ ∙ ∗ 𝑧𝑖 𝛿𝑘 =e∗𝑓′ ∙
𝑤𝑘𝑖 𝛿𝑘
y1
local gradient or error signal 𝑧𝑖
y2

2
2

39
Backpropagation (cont’d)
❑ The weights of each output neuron can be determined directly using the delta
learning rule.
∆𝑤𝑘𝑖 = 𝜂 ∗ 𝑒 ∗ 𝑓′ ∙ ∗ 𝑧𝑖 𝛿𝑘 =e∗𝑓′ ∙
𝛽𝑘𝑗
𝛿𝑗
y1
local gradient or error signal 𝑥𝑗
y2
❑ If the neuron is a hidden node
𝐾

2
𝛿𝑗 =𝑓 (∙) ∗ ෍ 𝛿𝑘 ∗ 𝑤𝑘 K is the set of all nodes on a next layer connected to the current neuron
𝑘=1 2
[local gradient] x [upstream gradient]
Please see the following for all details about mathematical derivation:
https://www.jeremyjordan.me/neural-networks-training/
40
Backpropagation Example
❑ Assume one input layer, one hidden layer, and one output neuron
𝑥𝑗 ∶is the 𝒋𝒕𝒉 input 𝜷𝟏𝟏
𝑥1 𝑧1
𝑧𝑖 ∶is the output of the 𝒊𝒕𝒉 hidden neuron 𝒘𝟏𝟏
𝑦𝑘 ∶is the output of the 𝒌𝒕𝒉 output neuron 𝑥2 𝜷𝟏𝟐
𝑧2 𝑦𝑘
𝒘𝟏𝟐
𝛽𝑖𝑗 ∶is the weight from input node 𝑥𝑗 to hidden node 𝑧𝑖
𝑤𝑘𝑖 ∶is the weight from hidden node 𝒛𝒊 to output neuron 𝒚𝒌 𝑧𝑖

❑ The weights of the output neuron can be adjusted using 𝑥𝑗 𝜷𝒊𝒋


Output
Hidden
the delta learning rule and the error signal: Inputs Neurons Neuron
𝐼

𝛿𝑦𝑘 =𝑒𝑘 ∗𝑓 ′ 𝑢𝑘 = 𝑑𝑘 − 𝑦𝑘 𝑓 ′ 𝑢𝑘 𝑢𝑘 = ෍ 𝑤𝑘𝑖 𝑧𝑖


𝑖=1

❑ Update the weights as follows:


𝑤𝑘𝑖 𝑛 + 1 = 𝑤𝑘𝑖 𝑛 + 𝜂 ∗ 𝛿𝑦𝑘 ∗ 𝑧𝑖 41
Backpropagation (cont’d)
❑ The weights of a 𝒊𝒕𝒉 hidden neuron can be adjusted using its error signal:
𝐾 𝐽
𝛿𝑧𝑖 =𝑓 ′ 𝑢𝑖 ∗ ෍ 𝛿𝑦𝑘 ∗ 𝑤𝑘𝑖 𝑢𝑖 = ෍ 𝛽𝑖𝑗 𝑥𝑗
𝑘=1 𝑗=1

❑ Using the error signals, the weights of the 𝒊𝒕𝒉 hidden


neuron can be updated
𝛽𝑖𝑗 𝑛 + 1 = 𝛽𝑖𝑗 𝑛 + 𝜂 ∗ 𝛿𝑧𝑖 ∗ 𝑥𝑗

❑ For a sigmoid activation with zero bias

𝑓 ′ 𝑢𝑘 = 𝑦𝑘 1 − 𝑦𝑘

42
Types of Neural Networks
Feedforward neural network Recurrent neural network (RNN)

Signals to travel one way only Output from previous step is fed
(input to output) as input in the current step
Learning with a teacher Learning without a teacher
Supervised Learning Unsupervised Learning
Self-organizing maps (SOM) 43
ANN Design and Issues
❑ Number of neurons, and hidden layers
❑ Initial weights (small random values ∈[‐1,1])
❑ Choice of the transfer function
❑ Learning rate
❑ Weights adjusting
❑ Data representation, pre-processing, and splitting

44
Learning Rate
❑ The learning rate, 𝜂, is a configurable (hyper)parameter used in ANNs
training
❑ 𝜂 controls how quickly the model is adapted to the problem
❑ Practical value 0 < 𝜂 < 1.

➢ Smaller 𝜂 → smaller changes to 𝑤 → more training epochs


• Can cause the local minima stuck.
➢ Larger 𝜂 → larger changes to 𝑤 → fewer training epochs.
• May results in divergence.

Graph Source: https://cs231n.github.io/neural-networks-3/

45
Learning Rate (cont’d)

Graph Source: https://towardsdatascience.com/the-learning-rate-finder-6618dfcb2025

Graph Source: https://srdas.github.io/DLBook/GradientDescentTechniques.html

One technique that can help the network out of local minima is the use of a momentum term.

∆𝑤𝑘𝑗 𝑛 = 𝜂 ∗ 𝛿𝑘 𝑛 ∗ 𝑥𝑗 𝑛 + 𝛼∆𝑤𝑘𝑗 𝑛 − 1

Weight increment from previous iteration


Momentum factor 46
Learning Rate (cont’d)

Graph Source: https://towardsai.net/p/machine-learning/analysis-of-learning-rate-in-gradient-descent-algorithm-using-python

47
Overfitting
x2 x2 x2

Can NOT be
generalized

x1 x1 x1

Underfitting Good Model Overfitting


Correctly classify test
patterns it has never seen
(learned) before when tested
in real-world problem

Solution
➢ Early stopping
➢ Regularization (Dropout)

Image Source: https://www.pinterest.com/pin/462604192955327068/ 48


Vanishing Gradient
❑ Deeper neural networks (i.e., with multiple hidden layers) are difficult to train
(difficulty increases geometrically).
𝐷

𝛿𝑗 =𝑓 ′ (∙) ∗ ෍ 𝛿𝑘 ∗ 𝑤𝑘 [local gradient] x [upstream gradient]


𝑘=1

➢ The gradients get smaller and smaller when backpropagating the error.
➢ After few layers of propagation, the gradient disappears (vanishes)
➢ The parameters in the deep layer will be almost static

❑ Solution
➢ Modify the activation function
➢ Use batch normalization (sort of regularization)
49
ANN Advantages and Disadvantages
❑ Advantages
➢ Very simple principles
➢ Highly parallel: information processing is much more like the brain than a serial
computer
➢ Adapt to unknown situations, can model complex functions
➢ Ease of use, learns by example, and very little user domain‐specific expertise needed.

❑ Disadvantages
➢ Very complex behaviors
➢ Not exact.
➢ Needs training.

50
ANN Terminology
❑ Neuron, unit (node)
❑ Weight and bias
❑ Transfer function (linear, sigmoid, ReLU, etc)
❑ Loss function (mean squared error, cross entropy, etc.)
❑ Learning rate, epoch, batch
❑ Backpropagation (error propagation)
❑ Optimization (gradient descent (GD), stochastic GD, Adam,….etc.)
❑ Overfitting
❑ Dropout, Batch normalization
Each ANN aspect is considered a standalone research venue 51
Validation Techniques
Data Splitting

Training/Testing Training/Validation/Testing
Total # of Samples Total # of Samples

Training Testing Training Validation Testing

60% (70%)
20% (15%)
70% to 75% 25% to 30%
20% (15%)52
Validation Techniques
Random Sample Selection
Total Number of Samples

Test Samples

Experiment #1

Experiment #2

Experiment #k

→k is the number of experiment


→ Ei is the average error for each experiment using only testing data 53
Validation Techniques
Cross Validation
Divide data into mutually exclusive and equal-sized subsets, folds, and this number is
called K

Total Number of Samples

Here K=4 Part #1 Part #2 Part #3 Part #4

→k is the number of folds


→ Ei is the average error for each fold
54
Validation Techniques
Testing Training Training Training

Training Testing Training Training

Training Training Testing Training

Training Training Training Testing


55
Assignments
❑ Assignment 1: Design your own simple ANN, (one perceptron with one input layer X1 X2 X2 d
and one output neuron). Use the data points listed in the adjacent Table as your 0 0 1 0
training data. Assume the activation function is sigmoid and assume there is no bias
0 1 1 1
for simplicity (b=0). Test your design using different iteration numbers.
1 0 1 1
1 1 1 0

❑ Assignment 2: Modify the above-designed code to implement a multi-layer perceptron, MLP (an
ANN with one input layer, one hidden layer and one output layer) for the same data points above.
Assume sigmoid activation function and there is no bias for simplicity (b=0). Test your approach
using different iteration numbers and different number of nodes for the hidden layer (e.g., 4, 8, and
16).

56
Assignments (cont’d)
❑ Assignment 3: Use the Keras library (tensorflow.keras) to build different ANNs using different
numbers of hidden layers (shallow: 1 hidden, output layer, deeper: two hidden layers with 12 and 8
nodes respectively, and more deep: three hidden layers with 32, 16, 8 nodes respectively). Use the
provided diabetic data sets (here) to train and test your design. Use the ReLU activation for the hidden
layers and the sigmoid activation for the output neuron, loss='binary_crossentropy', optimizer='adam’,
metrics=['accuracy’], epochs = 150.

❑ Assignment 4: Redo assignment #3 using 80% of the data for training and 20% of the data for testing.
Also, plot the training accuracy and loss curves for your designed networks

57
Thank You
&
Questions

You might also like