Slides l4 Ts
Slides l4 Ts
Series Analysis
KNIME AG
1
Download your course
materials from the KNIME Hub
2
Exercises – Getting Started
KNIME AG
Since social and economic conditions are constantly changing over time, data
analysts must be able to assess and predict the effects of these changes, in
order to suggest the most appropriate actions to take
Energy &
Utilities
§ Energy load forecasting: better planning and trading strategies
A Time series is made up by dynamic data collected over time! Consider the
differences between:
2000
0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
405 At a glance
380
1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11 1 3 5 7 9 11
2012 2013 2014 2015 2016
At a glance
Very irregular
dynamic
Many sudden
changes
At a glance
At a glance
Milliseconds basis
data
Once someone said: «Forecasting is the art of saying what will happen in the
future and then explaining why it didn’t»
§ Frequently true... history is full of examples of «bad forecasts», just like IBM Chairman’s famous
quote in 1943: “there is a world market for maybe five computers in the future.”
The reality is that forecasting is a really tough task, and you can do really bad, just
like in this cartoon..
§ The interval between observations can be any time interval (seconds, minute, hours, days, weeks,
months, quarters, years, etc.) and we assume that these time periods are equally spaced
§ One of the most distinctive characteristics of a time series is the mutual dependence between the
observations, generally called SERIAL CORRELATION OR AUTOCORRELATION
1000 6 2 3 3 7 28 51
1001 5 22 16 24 23 5 5
1002 ... ... ... ... ... ... ...
Early morning Late afternoon
Cluster 11
Cluster 18
KPI
Morning Evening
Cluster 0
https://kni.me/w/9pHnxeJUp8aueCJT
Business days
9-5
Weekend
Winter
Summer
§ Encapsulates a functionality
as a KNIME workflow
§ E. g. execute Python script by a component
with a graphical UI
§ Function like regular nodes:
§ Start using by drag and drop from the EXAMPLES
Server/local directory
§ Configure in the component’s configuration dialog
§ Also available on the KNIME Hub
hub.knime.com
Drag&Drop
hub.knime.com
§ Instances of shared
components are linked to the
master and are therefore
write-protected
§ Editable after disconnecting
the link or by a double click
in the component editor
§ Extract granularities (year, month, hour, etc.) from a timestamp and aggregate
(sum, average, mode, etc.) data at the selected granularity
§ In today’s example we calculate the total energy consumption by hour, day, and
month
Output: Aggregated
time series
Input: Time
series to
aggregate
§ TREND § CYCLE
The general direction in which the series Long-term fluctuations that occur regularly in
is running during a long period the series A CYCLE is an oscillatory
A TREND exists when there is a long-term component (i.e. Upward or Downward
increase or decrease in the data. swings) which is repeated after a certain
It does not have to be necessarily linear number of years, so:
(could be exponential or others functional § May vary in length and usually lasts several
form). years (from 2 up to 20/30)
§ Difficult to detect, because it is often
confused with the trend component
According to the data granularity and to the type of seasonality you want to
model, it is important to consider the right seasonal frequency (i.e. how many
observations you have for every seasonal cycle)
§ No problem if your data points are years, quarters, months or weeks (in this case you will face only
annual seasonality), but if the frequency of observations is smaller than a week, things get more
complicated
§ For example, hourly data might have a daily seasonality (frequency=24), a weekly seasonality
(frequency=24×7=168) and an annual seasonality (frequency=24×365.25=8766)
Cycle type
Frequency
Hour Day Week Year
Annual 1
Data granularity
§ The first chart in time series analysis is the TIME PLOT à the observations
are plotted against the time of observation, normally with consecutive
observations joined by straight lines
Example of TS Plot of Australian monthly wine sales Example of TS Plot of Air Passengers (monthly) series
§ The TIME PLOT is very useful in cases where the series shows a very
constant/simple dynamic (strong trend and strong seasonality), but in other
cases could be difficult to draw clear conclusions
Consumer Cost Index Oxygen saturation
§ Produce the Seasonal plot of the Time series in order to analyze more in detail
the seasonal component (and possible changes in seasonality over time)
Create the conditional Box plot of the Time series in order to deeply understand
the distribution of data in the same period of each seasons and focusing on specific
aspects such as outliers, skewness, variability,…
Category
column
Time series
column
In time series analysis it’s important to analyze the correlation between the lagged
values of a time series (autocorrelation): the lag plot is a bivariate analysis,
consisting in a simple scatter plot of the values of the target variable in t vs. the
values of the same variable in t-k; focusing on the correlation with the first lag (t-1)
you can see from the plot below that there is a strong linear relation between the
values in t and the values in t-1
AirPassengers
AirPassengers (Lag 1)
In order to go deeper inside the autocorrelation structure of the time series, you
can create the Auto Correlation Function plot (ACF plot), also called correlogram:
in this chart you can read the linear correlation index between the values in t and
all the possible lags (t-1, t-2, …, t-k); the chart below shows all the correlations up
to lag number 48
Together with the ACF, sometimes it is useful to analyze also the Partial
Autocorrelation Function
The ACF plot shows the autocorrelations which measure the linear relationship
between 𝑦& and 𝑦&'( for different values of k but consider that:
§ if 𝑦! and 𝑦!"# are correlated, then 𝑦!"# and 𝑦!"$ must also be correlated
§ But then 𝑦! and 𝑦!"$ might be correlated, simply because they are both connected to 𝑦!"#
§ à The Partial Autocorrelation Function (PACF) consider the linear relationship between 𝑦! and
𝑦!"% after removing the effects of other time lags 1, 2, 3, … , 𝑘 − 1
From a numerical point of view, it’s important to produce statistics (total sample
and split by seasonal periods) of the time series, in order to have a more precise
idea of: number of valid data points vs. missing data, central tendency measures,
dispersions measures, percentiles, confidence intervals of the means, etc.
Time Series Month N obs Missing Mean Std. Dev Min Max 95% LCL 95% UCL
AirPassengers 1 12 0 241.8 101.0 112 417 177.6 305.9
AirPassengers 2 12 0 235.0 89.6 118 391 178.1 291.9
AirPassengers 3 12 0 270.2 100.6 132 419 206.3 334.1
AirPassengers 4 12 0 267.1 107.4 129 461 198.9 335.3
AirPassengers 5 12 0 271.8 114.7 121 472 198.9 344.7
AirPassengers 6 12 0 311.7 134.2 135 535 226.4 396.9
AirPassengers 7 12 0 351.3 156.8 148 622 251.7 451.0
AirPassengers 8 12 0 351.1 155.8 148 606 252.1 450.1
AirPassengers 9 12 0 302.4 124.0 136 508 223.7 381.2
AirPassengers 10 12 0 266.6 110.7 119 461 196.2 336.9
AirPassengers 11 12 0 232.8 95.2 104 390 172.4 293.3
AirPassengers 12 12 0 261.8 103.1 118 432 196.3 327.3
AirPassengers Total 144 0 280.3 120.0 104 622 260.5 300.1
§ KNIME v4.02
§ Python Environment
§ StatsModels
§ Keras=2.2.4 & TensorFlow=1.8.0 hp5=
§ KNIME Python Integration
§ KNIME Deep Learning Keras Integration
61
Agenda
1.
2.
3.
4. Descriptive Analytics: Non-stationarity, Seasonality, Trend
5.
6.
7.
8.
9.
10.
11.
A time series can be defined as “stationary” when its properties does not depend
on the time at which the series is observed, so that:
§ the values oscillate frequently around the mean, independently from time
§ the variance of the fluctuations remains constant across time
§ the autocorrelation structure is constant over time and no periodic fluctuations exist
So, a time series that shows trend or seasonality is not stationary
Stationary Time Series example Non-Stationary Time Series example 1 Non-Stationary Time Series example 2
Typical examples of non-stationary series are all series that exhibit a deterministic
trend (i.e. y) = α + β . t + ε) ) or the so-called “Random Walk”
A random walk model is very widely used for non-stationary data, particularly
financial and economic data. Random Walk Example
Besides looking at the time plot of the data, the ACF plot is also useful for
identifying non-stationary TS:
à for a stationary time series, the ACF will drop to zero (i.e. within confidence
bounds) relatively quickly, while the ACF of non-stationary data decreases slowly
Stationary Time Series example Non-Stationary Time Series example 1 (random walk!)
§ One way to make a time series stationary is to compute the differences between
consecutive observations à This is known as DIFFERENCING
§ Differencing can help stabilize the mean of a time series by removing changes in the level of a time
series, and so eliminating trend (and also seasonality, using a specific differencing order)
§ The Order of Integration for a Time Series, denoted I(d), reports the minimum number of differences
(d) required to obtain a stationary series (note: I(0) à it means the series is stationary!)
§ Transformations such as logarithms can help to stabilize the variance of a time series
Differenced
Time Series (first order)
𝑦! 𝑦!" = 𝑦! − 𝑦!#$
Non-Stationary Time Series example 1 (Random Differenced Time Series (first order)
Walk)
𝑇𝑆2! − 𝑇𝑆2!"#
No significative
autocorrelation
exists à
applying first
differences to
a random walk
generates a
white noise
Occasionally the differenced data will not appear stationary and it may be
necessary to difference the data a second time to obtain a stationary series
+
(𝑦&++ = 𝑦&+ − 𝑦&'* = 𝑦& − 𝑦&'* − 𝑦&'* − 𝑦&', )*
Serie originaria Serie differenziata
Original Time Series: non-stationary (mean and variance) First Order Differencing: non-stationary (mean and variance)
1 300 2 20
250 15
200 10
150 5
100 0
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106
50 -5
0 -10
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106
Trend esponenziale differenziato due volte Trend esponenziale , logaritmo, due differenze
Second Order Diff.: stationary in mean, but not in variance Double Differencing applied to Log(Series): stationary series
3 25 4 0,4
20
0,3
15
0,2
10
5 0,1
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 0
-5
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105
-10 -0,1
-15
-0,2
-20
-25 -0,3
Consider that:
§ Sometimes it’s needed to apply both “simple” first differencing and seasonal differencing
in order to obtain a stationary series
§ It makes no difference which is done first—the result will be the same
§ However, if the data have a strong seasonal pattern, it’s recommended that seasonal differencing be
done first because sometimes the resulting series will be stationary and there will be no need for a
further non-seasonal differencing
Consider the following example where a set of differencing has been applied to
“Monthly Australian overseas visitors” TS
Original Time Series (𝒚𝒕) Seasonal Differencing (𝒚𝒕 − 𝒚𝒕"𝟏𝟐)
1 2
Applying first differencing to seasonal Use log trasformation in order to stabilize the variance
differenced series ( 𝒍𝒐𝒈(𝒚𝒕) − 𝒍𝒐𝒈(𝒚𝒕"𝟏) − 𝒍𝒐𝒈(𝒚𝒕"𝟏𝟐) − 𝒍𝒐𝒈(𝒚𝒕"𝟏𝟑) )
( 𝒚𝒕 − 𝒚𝒕"𝟏𝟐 − 𝒚𝒕"𝟏 − 𝒚𝒕"𝟏𝟑 )
3 4
The series
now appears to
be stationary
Same example of the previous slide, but changing the differencing process order
à the final result is…
Original Time Series (𝒚𝒕) First Order Differencing (𝒚𝒕 − 𝒚𝒕"𝟏)
1 2
First Order Diff. after log transformation Applying seasonal differencing to first order
(𝒍𝒐𝒈(𝒚𝒕) − 𝒍𝒐𝒈(𝒚𝒕"𝟏)) differenced of log series
( 𝒍𝒐𝒈(𝒚𝒕) − 𝒍𝒐𝒈(𝒚𝒕"𝟏) − 𝒍𝒐𝒈(𝒚𝒕"𝟏𝟐) − 𝒍𝒐𝒈(𝒚𝒕"𝟏𝟑) )
3 4
The series
is now
stationary
Output: Partial
autocorrelation plot
Output: Seasonality
Input: Time series column
with seasonality
§ Extract trend, first and second seasonality, and residual from time series and
show the progress of time series in line plots and ACF plots
Input: Signal to
decompose
Output: Line
plots and ACF
plots at the
different stages
of decomposing
.
Mean signed difference 1 Only informative about the direction
+ 𝑓 𝑥, − 𝑦, of the error
𝑛
,-$
.
Mean absolute percentage error 1 |𝑓 𝑥, − 𝑦, | Requires non-zero target column
(MAPE) + values
𝑛 |𝑦, |
,-$
§ Assess the expected forecast accuracy of your model by comparing actual and
predicted time series
§ Training data vs. in-sample predictions
§ Test data vs. out-of-sample predictions
83
Agenda
1.
2.
3.
4.
5. Quantitative Forecasting: Classical techniques
6. ARIMA Models: ARIMA(p,d,q)
7.
8.
9.
10.
11.
§ Qualitative forecasting methods are adopted when historical data are not
available (e.g. estimate the revenues of a new company that clearly doesn’t
have any data available). They are highly subjective methods.
Our focus
The basis for quantitative analysis of time series is the assumption that there are
factors that influenced the dynamics of the series in the past and these factors
continue to bring similar effects in also in the future
The main tools used in the Classical Time Series Analysis are:
§ Classical Decomposition: considers the time series as the overlap of several
elementary components (i.e. trend, cycle, seasonality, error)
§ Exponential Smoothing: method based on the weighting of past observations,
taking into account the overlap of some key time series components (trend and
seasonality)
§ ARIMA (AutoRegressive Integrated Moving Average): class of statistical models
that aim to treat the correlation between values of the series at different points in
time using a regression-like approach and controlling for seasonality
§ where m is the seasonal period, and k is the integer part of (h−1)/m (i.e., the
number of complete years in the forecast period prior to time T+h).
§ For example, with hourly data, the forecast for all future 6pm values is equal to
the last observed 6pm value.
§ Best predictor for seasonal random walk data
Example 1:
can you draw
something useful
looking at this
series?
1. Introduction to ARIMA
2. ARIMA Models
3. ARIMA Model selection
4. ARIMAX
While exponential smoothing models are based on a description of level, trend and sea-
sonality in the data, ARIMA models aim to describe the autocorrelations in the data
Before starting with ARIMA models is useful to give a look to a preliminary concept:
what is a linear regression model?
§ ARIMA models are, in theory, the most general class of models for forecasting a time series which
can be “stationarized” by transformations such as differencing and lagging
§ The easiest way to think of ARIMA models is as fine-tuned versions of random-walk models: the fine-
tuning consists of adding lags of the differenced series and/or lags of the forecast errors to the
prediction equation, as needed to remove any remains of autocorrelation from the forecast errors
where:
§ p is the number of autoregressive terms
§ d is the number of non-seasonal differences
§ q is the number of lagged forecast errors in the equation
§ P is the number of seasonal autoregressive terms
§ D is the number of seasonal differences
§ Q is the number of seasonal lagged forecast errors in the equation
§ s is the seasonal period (cycle frequency using R terminology)
à In the next slides we will explain each single component of ARIMA models!
The term autoregression indicates that it is a regression of the variable against itself
§ An Autoregressive model of order 𝒑, denoted 𝐴𝑅(𝑝) model, can be written as
Rather than use past values of the forecast variable in a regression, a Moving
Average model uses past forecast errors in a regression-like model
The lagged values of 𝜀& are not actually observed, so it is not a standard regression
Moving average models should not be confused with moving average smoothing
(the process used in classical decomposition in order to obtain the trend
component)à A moving average model is used for forecasting future values while
moving average smoothing is used for estimating the trend-cycle of past values
𝑦& = 𝑐 +𝜙* 𝑦&'* +𝜙, 𝑦&', + ⋯ +𝜙7 𝑦&'7 + 𝜃* 𝜀&'* + 𝜃, 𝜀&', + ⋯ +𝜃8 𝜀&'8 + 𝜀&
Autoregressive component of order p Moving Average component of order q
ARMA(2,1) process example, equal to ARIMA(2,0,1) ARIMA(2,1,1) process example (f1=0.5, f2=0.4, 𝜃&=0.8 )
(f1=0.5, f2=0.4, 𝜃&=0.8 )
General rules for model indentification based on ACF and PACF plots:
The data may follow an 𝑨𝑹𝑰𝑴𝑨(𝒑, 𝒅, 𝟎) model if the ACF and PACF plots of the
differenced data show the following patterns:
§ the ACF is exponentially decaying or sinusoidal
§ there is a significant spike at lags p in PACF, but none beyond lag p
The data may follow an 𝑨𝑹𝑰𝑴𝑨(𝟎, 𝒅, 𝒒) model if the ACF and PACF plots of the
differenced data show the following patterns:
§ the PACF is exponentially decaying or sinusoidal
§ there is a significant spike at lags q in ACF, but none beyond lag q
à For a general 𝑨𝑹𝑰𝑴𝑨(𝒑, 𝒅, 𝒒) model (with both p and q > 1) both ACF and PACF plots show
exponential or sinusoidal decay and it’s more difficult to understand the structure of the model
𝑨𝑹 𝟐 : Φ1>0, Φ2>0
𝑨𝑹 𝟐 : Φ1<0, Φ2>0
𝑴𝑨 𝟏 : θ1>0
𝑴𝑨 𝟏 : θ1<0
A special case of ARIMA models allows you to generate forecasts that depend on
both the historical data of the target time series (𝑌) and on other exogenous
variables (𝑋5 )à we call them ARIMAX models
§ This is not possible with other classical time series analysis techniques (e.g. ETS), where the
prediction depends only on past observations of the series itself
§ The advantage of ARIMAX models, therefore consists in the possibility to include additional
explanatory variables in addition to the target dependent variable lags
where s = number of periods per season (i.e. the frequency of seasonal cycle)
We use uppercase notation for the seasonal parts of the model, and lowercase
notation for the non-seasonal parts of the model
The seasonal part of an AR or MA model will be seen in the seasonal lags of the
PACF and ACF
This technique finds the values of the parameters which maximize the probability of obtaining the
data that we have observed à For given values of (𝒑, 𝒅, 𝒒) (𝑷, 𝑫, 𝑸) (i.e. model order) the algorithm will
try to maximize the log likelihood when finding parameter estimates
Input: Time
series,
specified Output: ARIMA model
orders,
estimation
method
Output: Model
performance statistics
Output: Model
residuals
Input:
ARIMA Output:
Model Forecasted
values and their
standard errors
Predict
differenced
(linear) or original
(level) time series
if I > 0 Output: In-sample
predictions
Input: Time
series
Output: Model
performance statistics
and the best model
Output: Model
residuals
§ Inspect ACF plot, residuals plot, Ljung-Box test statistics, and normality
measures → are residuals stationary and normally distributed?
Input: ARIMA
residuals
121
Agenda
1.
2.
3.
4.
5.
6.
7. Machine Learning based Models
8. Hyperparameter Optimization
9. Quick Look at LSTM Networks
10. Example of Time Series Analysis on Spark
11.
§ 1 v 10 v 100 performance
§ Can we automate the selection?
§ The algorithms to train such networks are not new, but they have been enabled
by recent advances in hardware performance and parallel execution.
x2
x3
x4 y1
x5
y2
x6
x7
y3
x8
x9
x10
x11
§ The KNIME Deep Learning intergation that we will use is based on Keras
§ We need to install:
§ KNIME Deep Learning Keras Integration
§ Keras
§ Python
§ The Keras integration includes:
§ Activation Functions
§ Neural Layers (many!!!)
§ Learners / Executors
§ Layer Freezer
§ Network Reader/Writer
§ Network Converter to TensorFlow
k LSTM units
...
x(t)
ReLu
100 hours
§ Out-sample Dynamic testing
Past RMSE MAE MAPE R^2
§ The Recursive Loop Start and End nodes pass data back to the start of the loop
with every iteration.
§ This enables us to generate predictions based on predictions.
Spark
Spark Graph
MLlib Streami
SQL ng X
Apache Spark
https://www.knime.com/blog/time-series-analysis-a-simple-example-with-knime-and-spark
https://kni.me/w/b-rFpW9Oueg0GhuN https://kni.me/w/vEaDHqWycVG-42ti
§ In real-time data streaming, big volumes of data are received and processed
quickly as soon as they are available.
§ “Quickly” and “as soon as available” are two important factors, since they allow
a reaction to changing conditions in real time.
Receiver
REST Response
REST Request