0% found this document useful (0 votes)
10 views78 pages

Pattern Classification 11. Backpropagation & Time-Series Forecasting

The backpropagation algorithm is used to train multi-layer neural networks. It calculates the gradient of the error function with respect to the weights in each layer using the chain rule. This gradient is then used to update the weights through gradient descent, moving the weights in the direction that minimizes error. The algorithm iterates between a forward pass to calculate outputs and a backward pass to calculate gradients and update weights, processing training examples until the network converges.

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views78 pages

Pattern Classification 11. Backpropagation & Time-Series Forecasting

The backpropagation algorithm is used to train multi-layer neural networks. It calculates the gradient of the error function with respect to the weights in each layer using the chain rule. This gradient is then used to update the weights through gradient descent, moving the weights in the direction that minimizes error. The algorithm iterates between a forward pass to calculate outputs and a backward pass to calculate gradients and update weights, processing training examples until the network converges.

Uploaded by

Mostafa Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Pattern Classification

11. Backpropagation & Time-Series Forecasting

AbdElMoniem Bayoumi, PhD

Spring 2022
Recap: Multi-Layer Networks
(𝟎)
𝑼𝒊 = 𝒀𝒊
(𝟎) (𝟏) (𝟏) (𝟐) (𝟐) (𝑳−𝟏) (𝑳)
𝒘𝒊𝒋 𝑿𝒊 𝒀𝒊 (𝟏) 𝑿𝒊 𝒀𝒊 𝒘𝒊𝒋 𝑿𝒊 (𝑳)
𝒀𝒊
U1 𝒇
𝒘𝒊𝒋
𝒇 … 𝒇

𝒇 𝒇 … 𝒇

⁞ ⁞ ⁞
UN 𝒇 𝒇 … 𝒇

Output layer
Input Layer 1
(Layer L)
layer
(layer 0)

N(l) is the number of nodes is layer l


2
Recap: Gradient Descent
• It can be shown that the negative direction
of the gradient gives the steepest descent
𝐸 𝐸

𝜕𝐸 𝜕𝐸
= −𝑣𝑒 = +𝑣𝑒
𝜕𝑊 𝜕𝑊

𝑊∗ 𝑊 𝑊∗ 𝑊

• When we approach the min, the steps


become very small because close to the
𝜕𝐸
min we find ≈0
𝜕𝑊

3
Back propagation Algorithm
• It is an algorithm based on the steepest
descent concept

• Used to train a general multi-layer network

4
Back propagation Algorithm
• (𝑙)
𝑋𝑖 = σ𝑁(𝑙−1)
𝑗=1
(𝑙−1) (𝑙−1)
𝑤𝑖𝑗 𝑌𝑗
Output of hidden node before
applying the activation function

• 𝑌𝑖
(𝑙)
=𝑓
(𝑙)
𝑋𝑖 Output of layer

5
Back propagation Algorithm
Error, i.e., cost fn.
1 𝑀
• 𝐸= 𝑀
σ𝑚=1 𝐸𝑚
2 Loss (i.e., for regression)
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)

• 𝑌𝑖 𝐿 𝑚 ≡ 𝑖𝑡ℎ output of the NN for the training pattern 𝑚


• 𝑑𝑖 (𝑚) ≡ target o/p
• 𝑁 𝐿 ≡ no. of outputs (i.e., nodes of the output layer)

𝜕𝐸
• We need to compute the gradient, i.e., (𝑙)
𝜕𝑤𝑖𝑗

6
Chain Rule
• 𝑌 = 𝑓 𝑦1 , 𝑦2 , 𝑦3

• 𝑦1 = 𝑔1 (𝑍) , 𝑦2 = 𝑔2 (𝑍), 𝑦3 = 𝑔3 (𝑍)

𝑔1
𝑦1
𝑔2 𝑦2 𝑓
𝑍 𝑌
𝑦3
𝑔3

𝜕𝑌 𝜕𝑌 𝜕𝑦1 𝜕𝑌 𝜕𝑦2 𝜕𝑌 𝜕𝑦3


• 𝜕𝑍
=
𝜕𝑦1

𝜕𝑍
+
𝜕𝑦2

𝜕𝑍
+
𝜕𝑦3

𝜕𝑍

7
Back propagation Algorithm
2
• 𝐸𝑚 = σ𝑁(𝐿)
𝑖=1 𝑌𝑖
𝐿
(𝑚) − 𝑑𝑖 (𝑚)

• (𝐿)
𝑌𝑖 = 𝑓 𝑋𝑖
(𝐿)

• 𝑋𝑖
(𝐿) 𝑁(𝐿−1) (𝐿−1) (𝐿−1)
= σ𝑗=1 𝑤𝑖𝑗 𝑌𝑗

𝐿 𝐿
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑌𝐼 𝜕𝑋𝐼 Chain rule!
(𝐿−1)
= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑤𝐼𝐽 𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝜕𝐸𝑚 𝜕𝑌𝐼
𝐿 𝜕 σ𝑁
𝑗=1
𝐿−1
𝑤𝐼𝑗
𝐿−1
𝑌𝑗
𝐿−1

= (𝐿)
∗ 𝐿
∗ 𝐿−1
𝜕𝑌𝐼 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽
𝐿
𝜕𝐸𝑚 𝜕𝑌𝐼 𝐿−1
= (𝐿)
∗ 𝐿
∗ 𝑌𝐽
𝜕𝑌𝐼 𝜕𝑋𝐼

• 𝐿
Note that the other 𝑋𝑖 ’s are not taken into account, because they
𝐿−1
do not depend on 𝑤𝐼𝐽 at all

8
Back propagation Algorithm
𝜕𝐸𝑚
• To get (𝑙) for any general layer 𝑙
𝜕𝑤𝑖𝑗

(𝑙) (𝑙) (𝑙+1) (𝑙+1) (𝑙+2)


𝑋𝐽 𝑌𝐽 𝑋𝐼 𝑌𝐼 𝑋1
𝒇 𝒇
(𝑙) (𝑙+2)
𝑤𝐼𝐽 𝑋2
𝐸𝑚


𝑙+1
𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑋𝐼 𝜕𝐸𝑚 𝑙
(𝑙)
= (𝑙+1)
∗ 𝑙
= (𝑙+1)
∗ 𝑌𝐽
𝜕𝑤𝐼𝐽 𝜕𝑋𝐼 𝜕𝑤𝐼𝐽 𝜕𝑋𝐼

𝑁(𝑙+2) 𝑙+2 𝑁(𝑙+2) 𝑙+2 (𝑙+1)


𝜕𝐸𝑚 𝜕𝐸𝑚 𝜕𝑋𝑖 𝜕𝐸𝑚 𝜕𝑋𝑖 𝜕𝑌𝐼
(𝑙+1)
= ෍ (𝑙+2)
∗ 𝑙+1
= ෍ 𝑙+2
∗ 𝑙+1
∗ (𝑙+1)
𝜕𝑋𝐼 𝑖=1 𝜕𝑋𝑖 𝜕𝑋𝐼 𝑖=1 𝜕𝑋𝑖 𝜕𝑌𝐼 𝜕𝑋𝐼
9
Back propagation Algorithm
1. Initialize all weights to small randomly chosen
values, e.g. [-1,1]
2. Let u(m) & d(m) be the training input/output
examples
3. For m=1 to M:
i. Present u(m) to the network and compute the
hidden layer outputs and final layer outputs
ii. Use these outputs in a backward scheme to
compute the partial derivatives of error fn. w.r.t.
to the weights of each layer
iii. Update weights: 𝑙
𝑤𝑖𝑗 𝑛𝑒𝑤 =
𝑙
𝑤𝑖𝑗 𝑜𝑙𝑑 − 𝛼
𝜕𝐸𝑚
𝑙
𝜕𝑤𝑖𝑗
4. Compute total error (stop in case of
convergence)
𝑤𝑖𝑗
j i

Note: 𝑙 refers to layer 𝑙 10


Disadvantages of Back propagation

• Can often be slow in reaching the min (i.e.,


sometimes tens of thousands of iterations)
– Especially close to min
– Too small 𝛼 → very small steps & slow to reach
min
– Too large 𝛼 → leads to oscillations & possibly
not converging at all
– Use variable 𝛼 (start large then decrease it)

11
Disadvantages of Back propagation

• Prone to get stuck in local minima


– This problem could be alleviated to some
extent by repeating the training many times,
each time from a different set of initial weights

12
Types of Weight Update
• Batch or epoch update, i.e, Gradient
Descent:
– Present full set of training examples (batch of
examples)
– Compute the error of each example
– Compute gradient of the batch (based on cost
function of the whole batch)
– Update weights based on this batch gradient
– Do another iteration … and so on
– Advantages:
• Optimization is more consistent
– Disadvantages:
• Slow (too long per iteration)

13
Types of Weight Update
• Sequential update, i.e., Stochastic Gradient Descent:
– Present a training
𝜕𝐸
pattern, then update the weights
(according to 𝑚
), then present the next one … and so on
𝜕𝑤

– After finishing all patterns, do another iteration starting from


m=1

– Advantages:
• Faster compared to gradient descent, i.e., full-batch
– Disadvantages:
• Hard to converge: “stochastic” since it depends on every
single example; however, in practice being close to
minimum is reasonably good
• Loss speedup from vectorization

– In practice for large datasets SGD is preferred to GD

14
Types of Weight Update
• Mini-Batch :
– Present subset of training examples (mini-
batch of examples)
– Compute the error of each example
– Compute gradient of the mini-batch (based on
cost function of this mini-batch)
– Update weights based on this mini-batch
gradient
– Move to another mini-batch & after finishing all
mini-batches do another iteration … and so on
– Advantages:
• Fast

15
Other Optimization Algorithms
• Gradient descent with momentum
– Smooth-out the steps of the gradient descent
using a moving average of the derivatives
– Get faster learning in the intended direction &
avoid oscillations

Faster learning
𝝏𝑬
𝑫𝑾 = 𝜷 𝑫𝑾 + (𝟏 − 𝜷) Slower learning
𝝏𝑾

𝑾 = 𝑾 − 𝜶 𝑫𝑾 𝐷𝑊 = 0 initially 16
Other Optimization Algorithms
• RMSProp
– Slow-down learning in unintended directions
– Avoid oscillations
Faster learning
𝑤𝑗
Slower learning

𝑤𝑖
𝝏𝑬
𝝏𝒘𝒊
𝒘𝒊 = 𝒘𝒊 − 𝜶
𝟐 𝑺𝒘𝒊 + 𝜺
𝝏𝑬
𝑺𝒘𝒊 = 𝜷 𝑺𝒘𝒊 + (𝟏 − 𝜷) 𝝏𝑬
𝝏𝒘𝒊 small 𝝏𝒘𝒋
𝟐
𝝏𝑬 𝝏𝑬 𝒘𝒋 = 𝒘𝒋 − 𝜶
𝝏𝑬 < 𝑺𝒘𝒋 + 𝜺
𝑺𝒘𝒋 = 𝜷 𝑺𝒘𝒋 + (𝟏 − 𝜷) 𝝏𝒘𝒊 𝝏𝒘𝒋
𝝏𝒘𝒋
large 17
Other Optimization Algorithms
• Adam (combines both RMSProp &
momentum)

𝟐
𝝏𝑬 𝝏𝑬
𝑫𝒘𝒊 = 𝜷𝟏 𝑫𝒘𝒊 + (𝟏 − 𝜷𝟏 ) 𝑺𝒘𝒊 = 𝜷𝟐 𝑺𝒘𝒊 + (𝟏 − 𝜷𝟐 )
𝝏𝒘𝒊 𝝏𝒘𝒊

𝑫𝒘𝒊
𝒘𝒊 = 𝒘 𝒊 − 𝜶
𝑺𝒘𝒊 + 𝜺

18
Regularization
• Used to prevent overfitting
– Intuition: set the weights of some hidden nodes to zero to simplify the
network, i.e., smaller network

𝟏 𝑴 𝝀 𝟐
• L2 regularization (aka weight decay): 𝑱= σ 𝑬 + 𝑾
𝑴 𝒎=𝟏 𝒎 𝟐𝑴 𝟐
2
– 𝑊 2
= σ𝑗 𝑤𝑗2 = 𝑊 𝑇 𝑊

𝟏 𝑴 𝝀
• L1 regularization: 𝑱= σ𝒎=𝟏 𝑬𝒎 + 𝑾 𝟏
𝑴 𝟐𝑴
– 𝑊 1
= σ𝑗 |𝑤𝑗 |

• 𝑊 is the weights vector, thresholds not necessary to be included

• L2 regularization is used more often

• 𝜆 is the regularization parameter (hyper-parameter to be tuned)

19
Dropout Regularization
• Used to prevent overfitting

• Intuitions:
– Eliminate some nodes to simplify the network
based on some probability, i.e., smaller network
– As if you train smaller networks on individual
training examples
– Cannot rely on any one feature, so spread weights

• For each layer set a dropout probability


– Each node within that layer may get eliminated
based on that probability

20
Guidelines for Training
• Learning rate 𝛼 :
– Too small. Convergence will be slow.
– Too large: we will oscillate around the
minimum.
– Some methods propose varying rate. i.e.,
learning rate decay.
– When learning does not go well, consider using
smaller learning rate.

21
Learning Rate Decay
• Gradient descent with small mini-batch size

decaying learning rate

Source: Andrew Ng

22
Input and Output Normalization
• Input and Output normalization
– Inputs have to be approximately in the range of
0 to 1 or -1 to 1

Range -1 to 1

Range -1000 to 1000

Range -2 to 2

– x = (u – umin)/(umax - umin)
– x = (u – Mean(u)) / st dev(u)

23
Train/Dev/Test Partition
• Best practice:
– Training: 60% ,
Validation (Dev): 20% & Test: 20%
– In case of big data, e.g., 10^6, then 98%, 1%
& 1%

• Test set should be used only once, at the


very end of the design

24
Machine Learning Recipe
• Train the network and evaluate first on the training data
– If bias is high, i.e., underfitting (performance is bad on the training set
itself), then:
• Bigger network (more hidden nodes or more hidden layers) →
works most of the time
• Train longer → works sometimes
– Check for bias again and keep changing until a good bias is reached

• Check for variance, i.e., performance on Dev set


– If variance is high, i.e., overfitting (performance is bad on the validation
set), then:
• More data (if possible)
• Regularization
– Check again for bias first, then after that check for variance and so on
until you reach a good bias & good variance

• Search for better NN architecture that better suits the problem


(sometimes may work)

25
Hyper-Parameters Tuning
• Learning rate 𝛼 1st in importance

• Momentum parameter 𝛽 ≈ 0.9


• Number of hidden nodes 2nd in importance

• Mini-batch size

• Num of layers 3rd in importance


• Learning-rate decay

• Adam parameters 𝛽1 ≈ 0.9, 𝛽2 ≈ 0.999, ∈≈ 10−8


Not likely to make change!

26
Tuning Process
• Try random values: don’t use a grid
– Better exploration of important parameters
– Consider the example on the board

• Coarse to fine scheme


– Focus more on good regions

• Use appropriate scale


– Do not sample uniformly
– Use logarithmic scale
– E.g., 𝛼 range is [0.0001,1] linear scale scaling will
give more weight to the values between 0.1 & 1,
however, logarithmic scale:
Sample 𝒓 ~ −𝟒, 𝟎
then 𝜶 = 𝟏𝟎𝒓
0.0001 = 0.001 = 0.01 = 0.1 = 1= 27
10−4 10−3 10−2 10−1 100
Tuning Process
• Use appropriate scale
– More example: let 𝛽 range is [0.9,0.999]
– Sample from 1 − 𝛽, i.e., [0.1,0.001], using log scale
Sample 𝒓 ~ −𝟑, −𝟏
then 𝟏 − 𝜷 = 𝟏𝟎𝒓
0.1 = 0.01 = 0.001 = 𝜷 = 𝟏 − 𝟏𝟎𝒓
10−1 10−2 10−3
– Sensitivityof 𝛽 approaching has huge impact on
the performance, i.e., momentum corresponds to
1
averaging over the last examples
1−𝛽
• 𝛽~[𝟎. 𝟗𝟎𝟎, 𝟎. 𝟗𝟎𝟎𝟓] → averaging over last 10
examples
• 𝛽~[𝟎. 𝟗𝟗𝟗, 𝟎. 𝟗𝟗𝟗𝟓] → 1000 to 2000 examples

28
K-Fold Cross Validation
• For parameter tuning over the training data
• Apply K-fold validation to the training set (usually
k= 5).

• Example K=4 𝑬𝑽𝑨𝑳 = 𝑬𝟏 + 𝑬𝟐 + 𝑬𝟑 + 𝑬𝟒

→ 𝑬𝟏

→ 𝑬𝟐

validate train → 𝑬𝟑

→ 𝑬𝟒

• Repeat for every parameter value, minimize 𝑬𝑽𝑨𝑳

29
K-Fold Cross Validation
• Better than convention train-validation-test
split
– Just split not training and test (no need for validation set)
– Not biased to the nature of splitting of the training and
validation

30
Multi-Class Classification
• E.g., an image is either of a cat, or dog, or
duck or otherwise

X1 … P(cat|𝑿)

P(dog|𝑿)


⁞ ⁞ ⁞
P(other|𝑿)
XN …

Output layer
31
Multi-Class Classification
• Use sigmoid activation function in the
output only in case of binary classification,
i.e., two classes

• For multi-class classification use soft-max


regression:
X1 … [𝑳]
𝒁𝟏 P(cat|𝑿)

P(dog|𝑿)
… [𝑳]
𝒁𝟐


⁞ ⁞ ⁞
P(other|𝑿)
XN …
32
Output layer
Multi-Class Classification
• For multi-class classification use soft-max regression:
Output layer
𝒚𝟏
X1 … [𝑳]
𝒁𝟏 P(cat|𝑿)
𝒚𝟐 P(dog|𝑿)
… [𝑳]
𝒁𝟐


⁞ ⁞ ⁞
P(other|𝑿)
XN …
• 𝑧𝑖[𝐿] output of node i at the output layer before applying
any activation function
[𝐿]
𝑧𝑖
• 𝑦𝑖 is the output after applying soft-max 𝒆
𝒚𝒊 = [𝐿]
𝑧𝑗
σ𝒋 𝒆 33
Convolutional Neural Networks (CNN)

• Mostly applied to imagery problems

• Layers extract features from input images,


e.g., edge detection

34
Convolutional Neural Networks (CNN)

35
Convolutional Neural Networks (CNN)

• Mostly applied to imagery problems

• Layers extract features from input images,


e.g., edge detection

– Convolution layer, i.e., filtering

– Pooling Layer, i.e., reduce input (avg or max)

– Fully Connected Layer, i.e., as in multi-layer


NN, at the final layers

36
Vertical Edge Detection

10 10 10 0 0 0
10 10 10 0 0 0 0 30 30 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
* 1 0 -1 = 0 30 30 0
10 10 10 0 0 0
1 0 -1
0 30 30 0
10 10 10 0 0 0
10 10 10 0 0 0

convolution

37
Convolutional Neural Networks (CNN)

=
*

=
Input channels *

Output channels

38
Max Pooling

10 10 10 0 0 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 0
10 10 10 0 0 0
10 10 10 0 0 0

What about average pooling?

39
Global Max Pooling
10 10 10 0 0 0
30 10 10 0 0 0
10 10 10 0 0 0
30 30 10 0 0 0
10 10 10 0 0 0 10
10 10 10 0 0 0 30
10 10 10 0 0 0
10 10 10 0 30 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0
10 10 10 0 0 0

What about global average pooling?

40
Convolutional Neural Networks (CNN)

• Learn filters’ parameters and weights of


fully connected layers

Source: Andrew Ng’s Lectures

41
Convolutional Neural Networks (CNN)

• Convolution leads to less memory footprint


due to:
– Parameter sharing (compared to fully
connected layers)
– Sparsity of connections (at each layer output
depends on limited number of inputs)

42
Recurrent Neural Networks (RNN)

• Sequence models, e.g., speech recognition,


sentiment classification, … etc.

• Inputs & outputs can have different lengths


within the same dataset

43
Recurrent Neural Networks (RNN)

𝑦 <1> 𝑦 <2> 𝑦 <3> 𝑦 <𝑇>

zeros
𝑎<0> 𝑎<1> 𝑎<2>
… 𝑎<𝑇−1>

𝑦 <𝑡>
𝑥 <1> 𝑥 <2> 𝑥 <3> 𝑥 <𝑇>


y is output at one time step from a NN
a is an activation passed from one step to another also from a NN
𝑥 <𝑡> 44
Autoencoder Network

Source: Lilian Weng’s Github blog

45
Autoencoder Network

• Unsupervised network

• Gives embedding
– Better embeddings using supervised

46
Generative Adversarial Network (GAN)

• Create a generative model of artificial data

Source: Guy Ernest, AWS Amazon Blogs 47


Siamese Networks

Source: Guy Ernest, AWS Amazon Blogs

48
Siamese Networks

Source: Guy Ernest, AWS Amazon Blogs

49
Deep Learning
• Subset of machine learning

• Multi-layered neural networks

• Raw data, i.e., end-to-end solution

• Requires big data & high computational


power

50
Machine Learning vs Deep Learning

51
Machine Learning vs Deep Learning

Source: Hannes Schulz and Sven Behnke, 2012 52


Machine Learning vs Deep Learning

53
When to use Deep Learning?
• Big amount of data
expensive!
• Availability of high computational power
expensive!
• Lack of domain understanding

• Complex problems (vision, NLP, speech


recognition)

54
Scalability with Data Amount

55
Scalability with Data Amount

56
Potentials of AI

“If a typical person can do a mental task with


less than one second of thought, we can
probably automate it using AI either now or
in the near future.”
— Andrew Ng

Currently, there are some limitations!

57
Limitations of Deep Learning
• Lots of achievements in vision field

• Not a magic tool!

– Lack of adaptability and generality compared to


human-vision system

– Not able to build general-intelligent machine

58
Limitations of Deep Learning

Source: Gartner Hype Cycle for AI, 2019 59


Limitations of Deep Learning
• Why cannot fit all real-world scenarios?

Source: Google

Source: Boston Dynamics 60


Limitations of Deep Learning
• Large amount of labeled data

– Impressive achievements correspond to


supervised learning

– Expensive!

– Sometimes experts & special equipment are


needed

61
Limitations of Deep Learning
• Datasets may be biased
– Deep Networks become biased against rare
patterns

– Serious consequences in some real-world


applications (e.g., medical, automotive, … etc.)

– Researchers should consider synthetic


generation of data to mitigate the unbalanced
representation of data

62
Limitations of Deep Learning
• Datasets may be biased

– Classification may be sensitive to viewpoint


• if one of the viewpoints is under-represented

63
Limitations of Deep Learning
• Sensitive to standard adversarial attacks
- Datasets are finite and just represent a fraction
of all possible images

- Add extra training, i.e., “adversarial training”

64
Limitations of Deep Learning
• Over-sensitive to changes in context
– Limited number of contexts in dataset, i.e.,
monkey in jungle
– Combinatorial Explosion!

65
Limitations of Deep Learning
• Combinatorial Explosion
– Real world images are combinatorial large

– Application dependent (e.g., medical imaging is


an exception)

- Considering compositionality may be a potential


solution

- Testing is challenging (consider worst case


scenarios)

66
Limitations of Deep Learning
• Visual understanding is tricky
– Mirrors
– Sparse Information
– Physics
– Humor

• Unintended results from fitness functions

67
Machine Learning Model Selection
1. Categorize the problem:

– Input: supervised, unsupervised, … etc.

– Output: numerical → regression, class →


classification, set of input groups → clustering

68
Machine Learning Model Selection
2. Understand your data:
a) Analyze the data:
• Descriptive statistics
• Data visualization

b) Process the data:


• Pre-processing, cleansing, … etc.

c) Feature Engineering

69
Machine Learning Model Selection
3. Determine the possible algorithms:
– Based on categorization & data understanding

– May have a look at the literature

– Determine: desired accuracy, interpretability,


scalability, complexity, training & testing time,
runtime, … etc.

70
Machine Learning Model Selection
4. Implement Machine Learning Algorithms:
– Setup a pipeline
– Compare algorithms
– Select an evaluation criteria

5. Tune hyperparameters

71
Time Series Prediction
• Time series contains:
– Trend trend

– Seasonality

• De-seasonalization:
– Remove the seasonal periodicities

return back
TS deseasonalize predict seasonal
components

72
How to deseasonalize?
• Removing the seasonal periodicities

• Usually seasonal cycle length is 12 months

𝒙(𝒕)

𝒕
12 months

73
How to deseasonalize?
• Obtain average of TS values over this window
1
𝑎 𝑦𝑒𝑎𝑟 = ෍ 𝑥(𝑡)
12
𝑤𝑖𝑛𝑑𝑜𝑤

𝒙(𝒊)
• Normalization step: 𝒁 𝒊 =
𝒂(𝒚𝒆𝒂𝒓)

• Seasonal average ≡ avg of 𝒁 𝒊 ’s of the different years


for month i
σ𝒋 𝒁𝒋 (𝒊)
𝒖 𝒊 =
#𝒚𝒆𝒂𝒓𝒔
𝒙(𝒕)
𝑍1 (2) 𝑍2 (2) 𝑍3 (2)

9 2 8 𝒕
2 2
12 months 74
How to deseasonalize?
• Seasonal average ≡ avg of 𝒁 𝒊 ’s of the
different years for month 𝒊

#𝒚𝒆𝒂𝒓𝒔
σ𝒋=𝟏 𝒁𝒋 (𝒊)
𝒖 𝒊 =
#𝒚𝒆𝒂𝒓𝒔

𝒖(𝒊)

1 12
75
Deseasonalization Step
• Divide time series value by the
corresponding seasonal average

𝒙(𝒕)
𝒙𝒅𝒆𝒔𝒆𝒂𝒔𝒐𝒏𝒂𝒍 𝒕 =
𝒖(𝒎𝒐𝒏𝒕𝒉(𝒕))

• After that focus on predicting the trend


𝒙𝒅𝒆𝒔𝒆𝒂𝒔𝒐𝒏𝒂𝒍 (𝒕)

𝒕 76
Recover Seasonality
• After trend prediction, seasonality can be
recovered via multiplication by the
corresponding seasonal average

77
Acknowledgment
• These slides have been created relying on
lecture notes of Prof. Dr. Amir Atiya

78

You might also like