0% found this document useful (0 votes)

10 views78 pages

Pattern Classification 11. Backpropagation & Time-Series Forecasting

The backpropagation algorithm is used to train multi-layer neural networks. It calculates the gradient of the error function with respect to the weights in each layer using the chain rule. This gradient is then used to update the weights through gradient descent, moving the weights in the direction that minimizes error. The algorithm iterates between a forward pass to calculate outputs and a backward pass to calculate gradients and update weights, processing training examples until the network converges.

Uploaded by

Mostafa Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views78 pages

Pattern Classification 11. Backpropagation & Time-Series Forecasting

Uploaded by

Mostafa Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Pattern Classification

11. Backpropagation & Time-Series Forecasting

AbdElMoniem Bayoumi, PhD

Spring 2022
Recap: Multi-Layer Networks
(𝟎)
𝑼𝒊 = 𝒀𝒊
(𝟎) (𝟏) (𝟏) (𝟐) (𝟐) (𝑳−𝟏) (𝑳)
𝒘𝒊𝒋 𝑿𝒊 𝒀𝒊 (𝟏) 𝑿𝒊 𝒀𝒊 𝒘𝒊𝒋 𝑿𝒊 (𝑳)
𝒀𝒊
U1 𝒇
𝒘𝒊𝒋
𝒇 … 𝒇

𝒇 𝒇 … 𝒇
⁞
⁞ ⁞ ⁞
UN 𝒇 𝒇 … 𝒇

Output layer
Input Layer 1
(Layer L)
layer
(layer 0)

N(l) is the number of nodes is layer l

2
Recap: Gradient Descent
• It can be shown that the negative direction
of the gradient gives the steepest descent
𝐸 𝐸

𝜕𝐸 𝜕𝐸
= −𝑣𝑒 = +𝑣𝑒
𝜕𝑊 𝜕𝑊

𝑊∗ 𝑊 𝑊∗ 𝑊

• When we approach the min, the steps

become very small because close to the
𝜕𝐸
min we find ≈0
𝜕𝑊

3
Back propagation Algorithm
• It is an algorithm based on the steepest
descent concept

• 𝑌𝑖
(𝑙)
=𝑓
(𝑙)
𝑋𝑖 Output of layer

5
Back propagation Algorithm
Error, i.e., cost fn.
1 𝑀
• 𝐸= 𝑀
σ𝑚=1 𝐸𝑚
2 Loss (i.e., for regression)
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)

• 𝑌𝑖 𝐿 𝑚 ≡ 𝑖𝑡ℎ output of the NN for the training pattern 𝑚

• 𝑑𝑖 (𝑚) ≡ target o/p
• 𝑁 𝐿 ≡ no. of outputs (i.e., nodes of the output layer)

𝜕𝐸
• We need to compute the gradient, i.e., (𝑙)
𝜕𝑤𝑖𝑗

6
Chain Rule
• 𝑌 = 𝑓 𝑦1 , 𝑦2 , 𝑦3

• 𝑦1 = 𝑔1 (𝑍) , 𝑦2 = 𝑔2 (𝑍), 𝑦3 = 𝑔3 (𝑍)

𝑔1
𝑦1
𝑔2 𝑦2 𝑓
𝑍 𝑌
𝑦3
𝑔3

𝜕𝑌 𝜕𝑌 𝜕𝑦1 𝜕𝑌 𝜕𝑦2 𝜕𝑌 𝜕𝑦3

• 𝜕𝑍
=
𝜕𝑦1
∗
𝜕𝑍
+
𝜕𝑦2
∗
𝜕𝑍
+
𝜕𝑦3
∗
𝜕𝑍

7
Back propagation Algorithm
2
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)

• (𝐿)
𝑌𝑖 = 𝑓 𝑋𝑖
(𝐿)

• 𝑋𝑖
(𝐿) 𝑁(𝐿−1) (𝐿−1) (𝐿−1)
= σ𝑗=1 𝑤𝑖𝑗 𝑌𝑗

𝐿 𝐿
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑌𝐼 𝜕𝑋𝐼 Chain rule!
(𝐿−1)
= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑤𝐼𝐽 𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝜕𝐸𝑚 𝜕𝑌𝐼
𝐿 𝜕 σ𝑁
𝑗=1
𝐿−1
𝑤𝐼𝑗
𝐿−1
𝑌𝑗
𝐿−1

= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝐿
𝜕𝐸𝑚 𝜕𝑌𝐼 𝐿−1
= (𝐿)
∗ 𝐿
∗ 𝑌𝐽
𝜕𝑌𝐼 𝜕𝑋𝐼

• 𝐿
Note that the other 𝑋𝑖 ’s are not taken into account, because they
𝐿−1
do not depend on 𝑤𝐼𝐽 at all

8
Back propagation Algorithm
𝜕𝐸𝑚
• To get (𝑙) for any general layer 𝑙
𝜕𝑤𝑖𝑗

(𝑙) (𝑙) (𝑙+1) (𝑙+1) (𝑙+2)

𝑋𝐽 𝑌𝐽 𝑋𝐼 𝑌𝐼 𝑋1
𝒇 𝒇
(𝑙) (𝑙+2)
𝑤𝐼𝐽 𝑋2
𝐸𝑚

⁞
𝑙+1
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑋𝐼 𝜕𝐸𝑚 𝑙
(𝑙)
= (𝑙+1)
∗ 𝑙
= (𝑙+1)
∗ 𝑌𝐽
𝜕𝑤𝐼𝐽 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽 𝜕𝑋𝐼

𝑁(𝑙+2) 𝑙+2 𝑁(𝑙+2) 𝑙+2 (𝑙+1)

𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑋𝑖 𝜕𝐸𝑚 𝜕𝑋𝑖 𝜕𝑌𝐼
(𝑙+1)
= ෍ (𝑙+2)
∗ 𝑙+1
= ෍ 𝑙+2
∗ 𝑙+1
∗ (𝑙+1)
𝜕𝑋𝐼 𝑖=1 𝜕𝑋𝑖 𝜕𝑋𝐼 𝑖=1 𝜕𝑋𝑖 𝜕𝑌𝐼 𝜕𝑋𝐼
9
Back propagation Algorithm
1. Initialize all weights to small randomly chosen
values, e.g. [-1,1]
2. Let u(m) & d(m) be the training input/output
examples
3. For m=1 to M:
i. Present u(m) to the network and compute the
hidden layer outputs and final layer outputs
ii. Use these outputs in a backward scheme to
compute the partial derivatives of error fn. w.r.t.
to the weights of each layer
iii. Update weights: 𝑙
𝑤𝑖𝑗 𝑛𝑒𝑤 =
𝑙
𝑤𝑖𝑗 𝑜𝑙𝑑 − 𝛼
𝜕𝐸𝑚
𝑙
𝜕𝑤𝑖𝑗
4. Compute total error (stop in case of
convergence)
𝑤𝑖𝑗
j i

Note: 𝑙 refers to layer 𝑙 10

Disadvantages of Back propagation

• Can often be slow in reaching the min (i.e.,

sometimes tens of thousands of iterations)
– Especially close to min
– Too small 𝛼 → very small steps & slow to reach
min
– Too large 𝛼 → leads to oscillations & possibly
not converging at all
– Use variable 𝛼 (start large then decrease it)

11
Disadvantages of Back propagation

• Prone to get stuck in local minima

– This problem could be alleviated to some
extent by repeating the training many times,
each time from a different set of initial weights

12
Types of Weight Update
• Batch or epoch update, i.e, Gradient
Descent:
– Present full set of training examples (batch of
examples)
– Compute the error of each example
– Compute gradient of the batch (based on cost
function of the whole batch)
– Update weights based on this batch gradient
– Do another iteration … and so on
– Advantages:
• Optimization is more consistent
– Disadvantages:
• Slow (too long per iteration)

13
Types of Weight Update
• Sequential update, i.e., Stochastic Gradient Descent:
– Present a training
𝜕𝐸
pattern, then update the weights
(according to 𝑚
), then present the next one … and so on
𝜕𝑤

– After finishing all patterns, do another iteration starting from

m=1

– Advantages:
• Faster compared to gradient descent, i.e., full-batch
– Disadvantages:
• Hard to converge: “stochastic” since it depends on every
single example; however, in practice being close to
minimum is reasonably good
• Loss speedup from vectorization

– In practice for large datasets SGD is preferred to GD

14
Types of Weight Update
• Mini-Batch :
– Present subset of training examples (mini-
batch of examples)
– Compute the error of each example
– Compute gradient of the mini-batch (based on
cost function of this mini-batch)
– Update weights based on this mini-batch
gradient
– Move to another mini-batch & after finishing all
mini-batches do another iteration … and so on
– Advantages:
• Fast

15
Other Optimization Algorithms
• Gradient descent with momentum
– Smooth-out the steps of the gradient descent
using a moving average of the derivatives
– Get faster learning in the intended direction &
avoid oscillations

Faster learning
𝝏𝑬
𝑫𝑾 = 𝜷 𝑫𝑾 + (𝟏 − 𝜷) Slower learning
𝝏𝑾

𝑾 = 𝑾 − 𝜶 𝑫𝑾 𝐷𝑊 = 0 initially 16
Other Optimization Algorithms
• RMSProp
– Slow-down learning in unintended directions
– Avoid oscillations
Faster learning
𝑤𝑗
Slower learning

𝑤𝑖
𝝏𝑬
𝝏𝒘𝒊
𝒘𝒊 = 𝒘𝒊 − 𝜶
𝟐 𝑺𝒘𝒊 + 𝜺
𝝏𝑬
𝑺𝒘𝒊 = 𝜷 𝑺𝒘𝒊 + (𝟏 − 𝜷) 𝝏𝑬
𝝏𝒘𝒊 small 𝝏𝒘𝒋
𝟐
𝝏𝑬 𝝏𝑬 𝒘𝒋 = 𝒘𝒋 − 𝜶
𝝏𝑬 < 𝑺𝒘𝒋 + 𝜺
𝑺𝒘𝒋 = 𝜷 𝑺𝒘𝒋 + (𝟏 − 𝜷) 𝝏𝒘𝒊 𝝏𝒘𝒋
𝝏𝒘𝒋
large 17
Other Optimization Algorithms
• Adam (combines both RMSProp &
momentum)

𝟐
𝝏𝑬 𝝏𝑬
𝑫𝒘𝒊 = 𝜷𝟏 𝑫𝒘𝒊 + (𝟏 − 𝜷𝟏 ) 𝑺𝒘𝒊 = 𝜷𝟐 𝑺𝒘𝒊 + (𝟏 − 𝜷𝟐 )
𝝏𝒘𝒊 𝝏𝒘𝒊

𝑫𝒘𝒊
𝒘𝒊 = 𝒘 𝒊 − 𝜶
𝑺𝒘𝒊 + 𝜺

18
Regularization
• Used to prevent overfitting
– Intuition: set the weights of some hidden nodes to zero to simplify the
network, i.e., smaller network

𝟏 𝑴 𝝀 𝟐
• L2 regularization (aka weight decay): 𝑱= σ 𝑬 + 𝑾
𝑴 𝒎=𝟏 𝒎 𝟐𝑴 𝟐
2
– 𝑊 2
= σ𝑗 𝑤𝑗2 = 𝑊 𝑇 𝑊

𝟏 𝑴 𝝀
• L1 regularization: 𝑱= σ𝒎=𝟏 𝑬𝒎 + 𝑾 𝟏
𝑴 𝟐𝑴
– 𝑊 1
= σ𝑗 |𝑤𝑗 |

• 𝑊 is the weights vector, thresholds not necessary to be included

• L2 regularization is used more often

• 𝜆 is the regularization parameter (hyper-parameter to be tuned)

19
Dropout Regularization
• Used to prevent overfitting

• Intuitions:
– Eliminate some nodes to simplify the network
based on some probability, i.e., smaller network
– As if you train smaller networks on individual
training examples
– Cannot rely on any one feature, so spread weights

• For each layer set a dropout probability

– Each node within that layer may get eliminated
based on that probability

20
Guidelines for Training
• Learning rate 𝛼 :
– Too small. Convergence will be slow.
– Too large: we will oscillate around the
minimum.
– Some methods propose varying rate. i.e.,
learning rate decay.
– When learning does not go well, consider using
smaller learning rate.

21
Learning Rate Decay
• Gradient descent with small mini-batch size

decaying learning rate

Source: Andrew Ng

22
Input and Output Normalization
• Input and Output normalization
– Inputs have to be approximately in the range of
0 to 1 or -1 to 1

Range -1 to 1

Range -1000 to 1000

Range -2 to 2

– x = (u – umin)/(umax - umin)
– x = (u – Mean(u)) / st dev(u)

23
Train/Dev/Test Partition
• Best practice:
– Training: 60% ,
Validation (Dev): 20% & Test: 20%
– In case of big data, e.g., 10^6, then 98%, 1%
& 1%

• Test set should be used only once, at the

very end of the design

24
Machine Learning Recipe
• Train the network and evaluate first on the training data
– If bias is high, i.e., underfitting (performance is bad on the training set
itself), then:
• Bigger network (more hidden nodes or more hidden layers) →
works most of the time
• Train longer → works sometimes
– Check for bias again and keep changing until a good bias is reached

• Check for variance, i.e., performance on Dev set

– If variance is high, i.e., overfitting (performance is bad on the validation
set), then:
• More data (if possible)
• Regularization
– Check again for bias first, then after that check for variance and so on
until you reach a good bias & good variance

• Search for better NN architecture that better suits the problem

(sometimes may work)

25
Hyper-Parameters Tuning
• Learning rate 𝛼 1st in importance

• Momentum parameter 𝛽 ≈ 0.9

• Number of hidden nodes 2nd in importance

• Mini-batch size

• Num of layers 3rd in importance

• Learning-rate decay

• Adam parameters 𝛽1 ≈ 0.9, 𝛽2 ≈ 0.999, ∈≈ 10−8

Not likely to make change!

26
Tuning Process
• Try random values: don’t use a grid
– Better exploration of important parameters
– Consider the example on the board

• Coarse to fine scheme

– Focus more on good regions

• Use appropriate scale

– Do not sample uniformly
– Use logarithmic scale
– E.g., 𝛼 range is [0.0001,1] linear scale scaling will
give more weight to the values between 0.1 & 1,
however, logarithmic scale:
Sample 𝒓 ~ −𝟒, 𝟎
then 𝜶 = 𝟏𝟎𝒓
0.0001 = 0.001 = 0.01 = 0.1 = 1= 27
10−4 10−3 10−2 10−1 100
Tuning Process
• Use appropriate scale
– More example: let 𝛽 range is [0.9,0.999]
– Sample from 1 − 𝛽, i.e., [0.1,0.001], using log scale
Sample 𝒓 ~ −𝟑, −𝟏
then 𝟏 − 𝜷 = 𝟏𝟎𝒓
0.1 = 0.01 = 0.001 = 𝜷 = 𝟏 − 𝟏𝟎𝒓
10−1 10−2 10−3
– Sensitivityof 𝛽 approaching has huge impact on
the performance, i.e., momentum corresponds to
1
averaging over the last examples
1−𝛽
• 𝛽~[𝟎. 𝟗𝟎𝟎, 𝟎. 𝟗𝟎𝟎𝟓] → averaging over last 10
examples
• 𝛽~[𝟎. 𝟗𝟗𝟗, 𝟎. 𝟗𝟗𝟗𝟓] → 1000 to 2000 examples

28
K-Fold Cross Validation
• For parameter tuning over the training data
• Apply K-fold validation to the training set (usually
k= 5).

• Example K=4 𝑬𝑽𝑨𝑳 = 𝑬𝟏 + 𝑬𝟐 + 𝑬𝟑 + 𝑬𝟒

→ 𝑬𝟏

→ 𝑬𝟐

validate train → 𝑬𝟑

→ 𝑬𝟒

• Repeat for every parameter value, minimize 𝑬𝑽𝑨𝑳

29
K-Fold Cross Validation
• Better than convention train-validation-test
split
– Just split not training and test (no need for validation set)
– Not biased to the nature of splitting of the training and
validation

30
Multi-Class Classification
• E.g., an image is either of a cat, or dog, or
duck or otherwise

X1 … P(cat|𝑿)

P(dog|𝑿)
…
⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …

Output layer
31
Multi-Class Classification
• Use sigmoid activation function in the
output only in case of binary classification,
i.e., two classes

• For multi-class classification use soft-max

regression:
X1 … [𝑳]
𝒁𝟏 P(cat|𝑿)

P(dog|𝑿)
… [𝑳]
𝒁𝟐

⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …
32
Output layer
Multi-Class Classification
• For multi-class classification use soft-max regression:
Output layer
𝒚𝟏
X1 … [𝑳]
𝒁𝟏 P(cat|𝑿)
𝒚𝟐 P(dog|𝑿)
… [𝑳]
𝒁𝟐

⁞
⁞ ⁞ ⁞
P(other|𝑿)
XN …
• 𝑧𝑖[𝐿] output of node i at the output layer before applying
any activation function
[𝐿]
𝑧𝑖
• 𝑦𝑖 is the output after applying soft-max 𝒆
𝒚𝒊 = [𝐿]
𝑧𝑗
σ𝒋 𝒆 33
Convolutional Neural Networks (CNN)

• Mostly applied to imagery problems

• Layers extract features from input images,

e.g., edge detection

34
Convolutional Neural Networks (CNN)

35
Convolutional Neural Networks (CNN)

• Mostly applied to imagery problems

• Layers extract features from input images,

e.g., edge detection

– Convolution layer, i.e., filtering

– Pooling Layer, i.e., reduce input (avg or max)

– Fully Connected Layer, i.e., as in multi-layer

NN, at the final layers

36
Vertical Edge Detection

10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
* 1 0 -1 = 0 30 30 0
10 10 10 0 0 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
10 10 10 0 0 0

convolution

37
Convolutional Neural Networks (CNN)

=
*

=
Input channels *

Output channels

38
Max Pooling

10 10 10 0 0 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 10 0 0 0

What about average pooling?

39
Global Max Pooling
10 10 10 0 0 0
30 10 10 0 0 0
10 10 10 0 0 0
30 30 10 0 0 0
10 10 10 0 0 0 10
10 10 10 0 0 0 30
10 10 10 0 0 0
10 10 10 0 30 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0

What about global average pooling?

40
Convolutional Neural Networks (CNN)

• Learn filters’ parameters and weights of

fully connected layers

Source: Andrew Ng’s Lectures

41
Convolutional Neural Networks (CNN)

• Convolution leads to less memory footprint

due to:
– Parameter sharing (compared to fully
connected layers)
– Sparsity of connections (at each layer output
depends on limited number of inputs)

42
Recurrent Neural Networks (RNN)

• Sequence models, e.g., speech recognition,

sentiment classification, … etc.

• Inputs & outputs can have different lengths

within the same dataset

43
Recurrent Neural Networks (RNN)

𝑦 <1> 𝑦 <2> 𝑦 <3> 𝑦 <𝑇>

zeros
𝑎<0> 𝑎<1> 𝑎<2>
… 𝑎<𝑇−1>

𝑦 <𝑡>
𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <𝑇>

≡
y is output at one time step from a NN
a is an activation passed from one step to another also from a NN
𝑥 <𝑡> 44
Autoencoder Network

Source: Lilian Weng’s Github blog

45
Autoencoder Network

• Unsupervised network

• Gives embedding
– Better embeddings using supervised

46
Generative Adversarial Network (GAN)

• Create a generative model of artificial data

Source: Guy Ernest, AWS Amazon Blogs 47

Siamese Networks

Source: Guy Ernest, AWS Amazon Blogs

48
Siamese Networks

Source: Guy Ernest, AWS Amazon Blogs

49
Deep Learning
• Subset of machine learning

• Multi-layered neural networks

• Raw data, i.e., end-to-end solution

• Requires big data & high computational

power

50
Machine Learning vs Deep Learning

51
Machine Learning vs Deep Learning

Source: Hannes Schulz and Sven Behnke, 2012 52

Machine Learning vs Deep Learning

53
When to use Deep Learning?
• Big amount of data
expensive!
• Availability of high computational power
expensive!
• Lack of domain understanding

• Complex problems (vision, NLP, speech

recognition)

54
Scalability with Data Amount

55
Scalability with Data Amount

56
Potentials of AI

“If a typical person can do a mental task with

less than one second of thought, we can
probably automate it using AI either now or
in the near future.”
— Andrew Ng

Currently, there are some limitations!

57
Limitations of Deep Learning
• Lots of achievements in vision field

• Not a magic tool!

– Lack of adaptability and generality compared to

human-vision system

– Not able to build general-intelligent machine

58
Limitations of Deep Learning

Source: Gartner Hype Cycle for AI, 2019 59

Limitations of Deep Learning
• Why cannot fit all real-world scenarios?

Source: Google

Source: Boston Dynamics 60

Limitations of Deep Learning
• Large amount of labeled data

– Impressive achievements correspond to

supervised learning

– Expensive!

– Sometimes experts & special equipment are

needed

61
Limitations of Deep Learning
• Datasets may be biased
– Deep Networks become biased against rare
patterns

– Serious consequences in some real-world

applications (e.g., medical, automotive, … etc.)

– Researchers should consider synthetic

generation of data to mitigate the unbalanced
representation of data

62
Limitations of Deep Learning
• Datasets may be biased

– Classification may be sensitive to viewpoint

• if one of the viewpoints is under-represented

63
Limitations of Deep Learning
• Sensitive to standard adversarial attacks
- Datasets are finite and just represent a fraction
of all possible images

- Add extra training, i.e., “adversarial training”

64
Limitations of Deep Learning
• Over-sensitive to changes in context
– Limited number of contexts in dataset, i.e.,
monkey in jungle
– Combinatorial Explosion!

65
Limitations of Deep Learning
• Combinatorial Explosion
– Real world images are combinatorial large

– Application dependent (e.g., medical imaging is

an exception)

- Considering compositionality may be a potential

solution

- Testing is challenging (consider worst case

scenarios)

66
Limitations of Deep Learning
• Visual understanding is tricky
– Mirrors
– Sparse Information
– Physics
– Humor

• Unintended results from fitness functions

67
Machine Learning Model Selection
1. Categorize the problem:

– Input: supervised, unsupervised, … etc.

– Output: numerical → regression, class →

classification, set of input groups → clustering

68
Machine Learning Model Selection
2. Understand your data:
a) Analyze the data:
• Descriptive statistics
• Data visualization

b) Process the data:

• Pre-processing, cleansing, … etc.

c) Feature Engineering

69
Machine Learning Model Selection
3. Determine the possible algorithms:
– Based on categorization & data understanding

– May have a look at the literature

– Determine: desired accuracy, interpretability,

scalability, complexity, training & testing time,
runtime, … etc.

70
Machine Learning Model Selection
4. Implement Machine Learning Algorithms:
– Setup a pipeline
– Compare algorithms
– Select an evaluation criteria

5. Tune hyperparameters

71
Time Series Prediction
• Time series contains:
– Trend trend

– Seasonality

• De-seasonalization:
– Remove the seasonal periodicities

return back
TS deseasonalize predict seasonal
components

72
How to deseasonalize?
• Removing the seasonal periodicities

• Usually seasonal cycle length is 12 months

𝒙(𝒕)

𝒕
12 months

73
How to deseasonalize?
• Obtain average of TS values over this window
1
𝑎 𝑦𝑒𝑎𝑟 = ෍ 𝑥(𝑡)
12
𝑤𝑖𝑛𝑑𝑜𝑤

𝒙(𝒊)
• Normalization step: 𝒁 𝒊 =
𝒂(𝒚𝒆𝒂𝒓)

• Seasonal average ≡ avg of 𝒁 𝒊 ’s of the different years

for month i
σ𝒋 𝒁𝒋 (𝒊)
𝒖 𝒊 =
#𝒚𝒆𝒂𝒓𝒔
𝒙(𝒕)
𝑍1 (2) 𝑍2 (2) 𝑍3 (2)

9 2 8 𝒕
2 2
12 months 74
How to deseasonalize?
• Seasonal average ≡ avg of 𝒁 𝒊 ’s of the
different years for month 𝒊

#𝒚𝒆𝒂𝒓𝒔
σ𝒋=𝟏 𝒁𝒋 (𝒊)
𝒖 𝒊 =
#𝒚𝒆𝒂𝒓𝒔

𝒖(𝒊)

1 12
75
Deseasonalization Step
• Divide time series value by the
corresponding seasonal average

𝒙(𝒕)
𝒙𝒅𝒆𝒔𝒆𝒂𝒔𝒐𝒏𝒂𝒍 𝒕 =
𝒖(𝒎𝒐𝒏𝒕𝒉(𝒕))

• After that focus on predicting the trend

𝒙𝒅𝒆𝒔𝒆𝒂𝒔𝒐𝒏𝒂𝒍 (𝒕)

𝒕 76
Recover Seasonality
• After trend prediction, seasonality can be
recovered via multiplication by the
corresponding seasonal average

77
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Numerical Methods Notes - by Trockers
No ratings yet
Numerical Methods Notes - by Trockers
35 pages
DL Unit 3 Notes PPT
No ratings yet
DL Unit 3 Notes PPT
37 pages
Lecture-The Simplex Method PDF
100% (1)
Lecture-The Simplex Method PDF
27 pages
15 DSA PPT Sorting Techniques-I
No ratings yet
15 DSA PPT Sorting Techniques-I
23 pages
ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Training NNs
No ratings yet
Training NNs
34 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
Numerical Analysis: Lecture - 13
No ratings yet
Numerical Analysis: Lecture - 13
12 pages
Eleven 19204 PDF
No ratings yet
Eleven 19204 PDF
19 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
No ratings yet
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
39 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Schiffmann BP
No ratings yet
Schiffmann BP
36 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
CE6146 Lecture 3
No ratings yet
CE6146 Lecture 3
83 pages
Inference and Learning
No ratings yet
Inference and Learning
33 pages
Lec 8
No ratings yet
Lec 8
43 pages
Cours 5
No ratings yet
Cours 5
23 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
CI-6-8 Backpropagation (COMPLETE) Updated
No ratings yet
CI-6-8 Backpropagation (COMPLETE) Updated
76 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Ch2 ANN BB
No ratings yet
Ch2 ANN BB
16 pages
PNAL6 MLPTraining
No ratings yet
PNAL6 MLPTraining
40 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Unit 2 Soft
No ratings yet
Unit 2 Soft
14 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Lect 7
No ratings yet
Lect 7
43 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
38 Backpropagation
No ratings yet
38 Backpropagation
19 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Classification Advanced
No ratings yet
Classification Advanced
51 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Ann 2 A
No ratings yet
Ann 2 A
20 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lec 15 MLP Cont
No ratings yet
Lec 15 MLP Cont
34 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
Exp 3
No ratings yet
Exp 3
9 pages
Advantages Bpa
No ratings yet
Advantages Bpa
38 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
Chapter 5 Summary
No ratings yet
Chapter 5 Summary
5 pages
Unit 2
No ratings yet
Unit 2
19 pages
Fill in The Blanks
No ratings yet
Fill in The Blanks
2 pages
21 CA1 Mahak
No ratings yet
21 CA1 Mahak
10 pages
DAA IA1 Updated
No ratings yet
DAA IA1 Updated
14 pages
Data Structure File MPCT
No ratings yet
Data Structure File MPCT
37 pages
CS Lab 2
No ratings yet
CS Lab 2
7 pages
Linked List 2
No ratings yet
Linked List 2
6 pages
Dereje Teferi Dereje - Teferi@aau - Edu.et
No ratings yet
Dereje Teferi Dereje - Teferi@aau - Edu.et
36 pages
Biometric Voice Recognition
100% (1)
Biometric Voice Recognition
33 pages
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
9 pages
Quiz Notes For Aiml
No ratings yet
Quiz Notes For Aiml
14 pages
Artificial Intelligence Question Paper
No ratings yet
Artificial Intelligence Question Paper
3 pages
Advanced Seismic Data Analysis Presentation On: Deconvolution
No ratings yet
Advanced Seismic Data Analysis Presentation On: Deconvolution
13 pages
Hashing
No ratings yet
Hashing
14 pages
Unit II Classification
No ratings yet
Unit II Classification
31 pages
DM
No ratings yet
DM
7 pages
Reduction
No ratings yet
Reduction
91 pages
05 Density Estimation
No ratings yet
05 Density Estimation
29 pages
Computing GCD's The Euclidean Algorithm: A QB + R
No ratings yet
Computing GCD's The Euclidean Algorithm: A QB + R
4 pages
CMP3010L03 Pipelining
No ratings yet
CMP3010L03 Pipelining
42 pages
Digital Control System 1
No ratings yet
Digital Control System 1
1 page
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 08. Gaussian Mixture Model: Abdelmoniem Bayoumi, PHD
12 pages
02 Training Patterns
No ratings yet
02 Training Patterns
18 pages
MFA-106-Unit IV Predictive Modelling and Analysis-21may2024
No ratings yet
MFA-106-Unit IV Predictive Modelling and Analysis-21may2024
10 pages
2.4 Graphs Question Paper
No ratings yet
2.4 Graphs Question Paper
17 pages
CPU Scheduling - I: Roadmap
No ratings yet
CPU Scheduling - I: Roadmap
5 pages
Solution:: Chapter 8 - Division of Algebraic Expressions
No ratings yet
Solution:: Chapter 8 - Division of Algebraic Expressions
9 pages
Properties of Discrete Time Convolution: Stephen Kruzick
No ratings yet
Properties of Discrete Time Convolution: Stephen Kruzick
4 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
5 pages
Solution CRC
No ratings yet
Solution CRC
3 pages
(MIT 18.656) Lecture 1 Notes
No ratings yet
(MIT 18.656) Lecture 1 Notes
4 pages
REAL NUMBERS and Polynomial
No ratings yet
REAL NUMBERS and Polynomial
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Pattern Classification 11. Backpropagation & Time-Series Forecasting

Uploaded by

Pattern Classification 11. Backpropagation & Time-Series Forecasting

Uploaded by

Pattern Classification

11. Backpropagation & Time-Series Forecasting

AbdElMoniem Bayoumi, PhD

N(l) is the number of nodes is layer l

• When we approach the min, the steps

• Used to train a general multi-layer network

• 𝑌𝑖 𝐿 𝑚 ≡ 𝑖𝑡ℎ output of the NN for the training pattern 𝑚

• 𝑦1 = 𝑔1 (𝑍) , 𝑦2 = 𝑔2 (𝑍), 𝑦3 = 𝑔3 (𝑍)

𝜕𝑌 𝜕𝑌 𝜕𝑦1 𝜕𝑌 𝜕𝑦2 𝜕𝑌 𝜕𝑦3

(𝑙) (𝑙) (𝑙+1) (𝑙+1) (𝑙+2)

𝑁(𝑙+2) 𝑙+2 𝑁(𝑙+2) 𝑙+2 (𝑙+1)

Note: 𝑙 refers to layer 𝑙 10

• Can often be slow in reaching the min (i.e.,

• Prone to get stuck in local minima

– After finishing all patterns, do another iteration starting from

– In practice for large datasets SGD is preferred to GD

• 𝑊 is the weights vector, thresholds not necessary to be included

• L2 regularization is used more often

• 𝜆 is the regularization parameter (hyper-parameter to be tuned)

• For each layer set a dropout probability

decaying learning rate

Range -1000 to 1000

• Test set should be used only once, at the

• Check for variance, i.e., performance on Dev set

• Search for better NN architecture that better suits the problem

• Momentum parameter 𝛽 ≈ 0.9

• Num of layers 3rd in importance

• Adam parameters 𝛽1 ≈ 0.9, 𝛽2 ≈ 0.999, ∈≈ 10−8

• Coarse to fine scheme

• Use appropriate scale

• Example K=4 𝑬𝑽𝑨𝑳 = 𝑬𝟏 + 𝑬𝟐 + 𝑬𝟑 + 𝑬𝟒

• Repeat for every parameter value, minimize 𝑬𝑽𝑨𝑳

• For multi-class classification use soft-max

• Mostly applied to imagery problems

• Layers extract features from input images,

• Mostly applied to imagery problems

• Layers extract features from input images,

– Convolution layer, i.e., filtering

– Pooling Layer, i.e., reduce input (avg or max)

– Fully Connected Layer, i.e., as in multi-layer

What about average pooling?

What about global average pooling?

• Learn filters’ parameters and weights of

Source: Andrew Ng’s Lectures

• Convolution leads to less memory footprint

• Sequence models, e.g., speech recognition,

• Inputs & outputs can have different lengths

𝑦 <1> 𝑦 <2> 𝑦 <3> 𝑦 <𝑇>

Source: Lilian Weng’s Github blog

• Create a generative model of artificial data

Source: Guy Ernest, AWS Amazon Blogs 47

Source: Guy Ernest, AWS Amazon Blogs

Source: Guy Ernest, AWS Amazon Blogs

• Multi-layered neural networks

• Raw data, i.e., end-to-end solution

• Requires big data & high computational

Source: Hannes Schulz and Sven Behnke, 2012 52

• Complex problems (vision, NLP, speech

“If a typical person can do a mental task with

Currently, there are some limitations!

• Not a magic tool!

– Lack of adaptability and generality compared to

– Not able to build general-intelligent machine

Source: Gartner Hype Cycle for AI, 2019 59

Source: Boston Dynamics 60

– Impressive achievements correspond to

– Sometimes experts & special equipment are

– Serious consequences in some real-world

– Researchers should consider synthetic

– Classification may be sensitive to viewpoint

- Add extra training, i.e., “adversarial training”

– Application dependent (e.g., medical imaging is

- Considering compositionality may be a potential

- Testing is challenging (consider worst case

• Unintended results from fitness functions

– Input: supervised, unsupervised, … etc.