Time Series Model
Time Series Model
Definition
A time series model is a set of data points ordered in time, where time is the independent
variable. These models are used to analyze and forecast the future. It can also be defined
as a collection of data points organized in time that is used to analyze and forecast future
values. Time series models are based on the idea that a prediction is a weighted sum of
past observations.
Enter time series. A time series is a series of data points ordered in time. In a time series,
time is often the independent variable, and the goal is usually to make a forecast for the
future.
However, there are other aspects that come into play when dealing with time series.
Is it stationary?
Is there a seasonality?
Is the target variable autocorrelated?
In this post, I’ll introduce different characteristics of time series and how we can model
them to obtain as accurate as possible forecasts.
Autocorrelation
Informally, autocorrelation is the similarity between observations as a function of the time
lag between them.
Above is an example of an autocorrelation plot. If you look closely, you’ll see that the first
value and the 24th value have a high autocorrelation. Similarly, the 12th and 36th
observations are highly correlated. This means that we will find a very similar value every
24th unit of time.
Notice how the plot looks like a sinusoidal function. This is a hint for seasonality, and you
can find its value by finding the period in the plot above, which would give 24 hours.
Seasonality
Seasonality refers to periodic fluctuations. For example, electricity consumption is high
during the day and low during night, or online sales increase during Christmas before
slowing down again.
Example of seasonality.
As you can see above, there is a daily seasonality. Every day, you see a peak towards the
evening, and the lowest points are the beginning and the end of each day.
Remember that seasonality can also be derived from an autocorrelation plot if it has a
sinusoidal shape. Simply look at the period, and it gives the length of the season.
Stationarity
Stationarity is an important characteristic of time series. A time series is said to be
stationary if its statistical properties don’t change over time. In other words, it has a
constant mean and variance, and its covariance is independent of time.
Looking at the same plot, we see that the process above is stationary. The mean and
variance don’t vary over time.
Often, stock prices are not a stationary process. We might see a growing trend, or its
volatility might increase over time (meaning that variance is changing).
Ideally, we’d want to have a stationary time series for modeling. Of course, not all of them
are stationary, but we can make different transformations to make them stationary.
Moving Average
The moving average model is probably the most naive approach to time series modeling.
This model simply states that the next observation is the mean of all past observations.
While simple, this model can be surprisingly effective, and it represents a good starting
point.
Otherwise, the moving average can be used to identify interesting trends in the data. We
can define a window to apply the moving average model to smooth the time series and
highlight different trends.
In the plot above, we applied the moving average model to a 24-hour window. The green
line smoothed the time series, and we can see that there are two peaks in a 24-hour
period.
Of course, the longer the window, the smoother the trend will be. Below is an example of a
moving average in a smaller window.
Exponential Smoothing
Exponential smoothing uses similar logic to moving average, but this time, a different
decreasing weight is assigned to each observation. In other words, less importance is
given to observations as we move further from the present.
Mathematically, exponential smoothing is expressed as:
Here, alpha is a smoothing factor that takes values between zero and one. It determines
how fast the weight decreases for previous observations.
From the plot above, the dark blue line represents the exponential smoothing of the time
series using a smoothing factor of 0.3, while the orange line uses a smoothing factor of
0.05.
As you can see, the smaller the smoothing factor, the smoother the time series will be. This
makes sense, because as the smoothing factor approaches zero, we approach the moving
average model.
Here, beta is the trend smoothing factor, and it takes values between zero and one.
Below, you can see how different values of alpha and beta affect the shape of the time
series.
Next, we’ll add the moving average model MA(q). This takes a parameter q which
represents the biggest lag after which other lags are not significant on the autocorrelation
plot.
Below, q would be four.
After that, we’ll add the order of integration I(d). The parameter d represents the number
of differences required to make the series stationary.
Finally, we’ll add the final component: seasonality S(P, D, Q, s), where s is simply the
season’s length. This component requires the parameters P and Q which are the same as p
and q, but for the seasonal component. Finally, D is the order of seasonal integration
representing the number of differences required to remove seasonality from the series.
Combining all, we get the SARIMA (p, d, q)(P, D, Q, s) model.
The main takeaway is this: Before modeling with SARIMA, we must apply transformations
to our time series to remove seasonality and any non-stationary behaviors.
That was a lot of theory to wrap our head around. Let’s explore some applications and
examples of time series models before learning how to apply the techniques discussed
above.
Determining Patterns
Businesses that rely on seasonal sales, monthly online traffic spikes and other repetitive
behaviour can establish expectations based on time series models, gauging their overall
health and performance.
Detecting Anomalies
Time series models also allow organizations to more easily spot data shifts that may
signal unusual behaviour or changes in the market.
Healthcare
Time series models can be used to monitor the spread of diseases by observing how many
people transmit a disease and how many people die after being infected.
Agriculture
Time series models take into account seasonal temperatures, the number of rainy days
each month and other variables over the course of years, allowing agricultural workers to
assess environmental conditions and ensure a successful harvest.
Finance
Financial analysts can leverage time series models to record sales numbers for each month
and predict potential stock market behaviour.
Cybersecurity
IT and cybersecurity teams can develop patterns in user behaviour with time series
models, allowing them to be aware of when behaviour doesn’t align with normal trends.
Retail
Retailers may apply time series models to study how other companies’ prices and the
number of customer purchases change over time, helping them optimize prices.
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
DATAPATH = 'data/stock_prices_sample.csv'
As you can see, we have a few entries concerning a different stock than the New Germany
Fund (GF). Also, we have an entry concerning intraday information, but we only want end
of day (EOD) information.
2. Clean the Data
data = data[data.TICKER != 'GEF']
data = data[data.TYPE != 'Intraday']
data.head()
First, we’ll remove unwanted entries.
Then, we’ll remove unwanted columns, as we solely want to focus on the stock’s closing
price.
If you preview the dataset, you should see:
plt.figure(figsize=(17, 8))
plt.plot(data.CLOSE)
plt.title('Closing price of New Germany Fund Inc (GF)')
plt.ylabel('Closing price ($)')
plt.xlabel('Trading day')
plt.grid(False)
plt.show()
We’ll plot the closing price over the entire time period of our data set.
You should get:
Closing price of the New Germany Fund (GF).
Clearly, this is not a stationary process, and it’s hard to tell if there is some kind of
seasonality.
4. Moving Average
Let’s use the moving average model to smooth our time series. For that, we’ll rely on a
helper function that will run the moving average model over a specified time window, and
it will plot the result smoothed curve:
def plot_moving_average(series, window, plot_intervals=False, scale=1.96):
rolling_mean = series.rolling(window=window).mean()
plt.figure(figsize=(17,8))
plt.title('Moving average\n window size = {}'.format(window))
plt.plot(rolling_mean, 'g', label='Rolling mean trend')
Trends are easier to spot now. Notice how the 30-day and 90-day trends show a downward
curve at the end. This might mean that the stock is likely to go down in the following days.
5. Exponential Smoothing
Now, let’s use exponential smoothing to see if it can pick up a better trend.
def exponential_smoothing(series, alpha):
plt.figure(figsize=(17, 8))
for alpha in alphas:
plt.plot(exponential_smoothing(series, alpha), label="Alpha {}".format(alpha))
plt.plot(series.values, "c", label = "Actual")
plt.legend(loc="best")
plt.axis('tight')
plt.title("Exponential Smoothing")
plt.grid(True);
As you can see, an alpha value of 0.05 smoothed the curve while picking up most of the
upward and downward trends.
Now, let’s use double exponential smoothing.
result = [series[0]]
for n in range(1, len(series)+1):
if n == 1:
level, trend = series[0], series[1] - series[0]
if n >= len(series): # forecasting
value = result[-1]
else:
value = series[n]
last_level, level = level, alpha * value + (1 - alpha) * (level + trend)
trend = beta * (level - last_level) + (1 - beta) * trend
result.append(level + trend)
return result
def plot_double_exponential_smoothing(series, alphas, betas):
plt.figure(figsize=(17, 8))
for alpha in alphas:
for beta in betas:
plt.plot(double_exponential_smoothing(series, alpha, beta), label="Alpha {}, beta
{}".format(alpha, beta))
plt.plot(series.values, label = "Actual")
plt.legend(loc="best")
plt.axis('tight')
plt.title("Double Exponential Smoothing")
plt.grid(True)
7. Modeling
As outlined previously, we must turn our series into a stationary process in order to model
it. Therefore, let’s apply the Dickey-Fuller test to see if it is a stationary process:
def tsplot(y, lags=None, figsize=(12, 7), syle='bmh'):
with plt.style.context(style='bmh'):
fig = plt.figure(figsize=figsize)
layout = (2,2)
ts_ax = plt.subplot2grid(layout, (0,0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1,0))
pacf_ax = plt.subplot2grid(layout, (1,1))
y.plot(ax=ts_ax)
p_value = sm.tsa.stattools.adfuller(y)[1]
ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
plt.tight_layout()
tsplot(data.CLOSE, lags=30)
tsplot(data_diff[1:], lags=30)
You should see:
By the Dickey-Fuller test, the time series is unsurprisingly non-stationary. Also, looking at
the autocorrelation plot, we see that it’s very high, and it seems that there’s no clear
seasonality.
To get rid of the high autocorrelation and make the process stationary, let’s take the first
difference (line 23 in the code block.) We simply subtract the time series from itself with a
lag of one day, and we get:
8. SARIMA
#Set initial values and some bounds
ps = range(0, 5)
d=1
qs = range(0, 5)
Ps = range(0, 5)
D=1
Qs = range(0, 5)
s=5
results = []
best_aic = float('inf')
aic = model.aic
result_table = pd.DataFrame(results)
result_table.columns = ['parameters', 'aic']
#Sort in ascending order, lower AIC is better
result_table = result_table.sort_values(by='aic',
ascending=True).reset_index(drop=True)
return result_table
result_table = optimize_SARIMA(parameters_list, d, D, s)
#Set parameters that give the lowest AIC (Akaike Information Criteria)
p, q, P, Q = result_table.parameters[0]
print(best_model.summary())
Now, for SARIMA, we first need to define a few parameters and a range of values for other
parameters to generate a list of all possible combinations of p, q, d, P, Q, D, s.
Now, in the code cell above, we have 625 different combinations. We will try each
combination and train SARIMA with each to find the best-performing model. This might
take a while depending on your computer’s processing power.
Once this is done, we’ll print out a summary of the best model, and you should see:
Comparison of predicted (orange line) and actual (blue line) closing prices
It seems that we are a bit off in our predictions. In fact, the predicted price is essentially
flat, meaning that our model is probably not performing well.
Again, this is not due to our procedure, but to the fact that predicting stock prices is
essentially impossible.
From the first project, we learned the entire procedure of making a time series stationary
before using SARIMA to model. It’s a long and tedious process with a lot of manual
tweaking.
Now, let’s introduce Facebook’s Prophet. It’s a forecasting tool available in
both Python and R. This tool allows both experts and non-experts to produce high-quality
forecasts with minimal effort.
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
DATAPATH = 'data/AirQualityUCI.csv'
As you can see, the data set contains information about the concentrations of different
gasses. They were recorded at every hour for each day.
If you explore the data set a bit more, you’ll notice that there are several instances of the
value -200. Of course, it does not make sense to have a negative concentration, so we will
need to clean the data before modeling.
Therefore, we need to clean the data.
# Aggregate data
daily_data = data.drop('Time', axis=1).groupby('Date').apply(positive_average)
3. Modeling
# Drop irrelevant columns
cols_to_drop = ['PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S4(NO2)',
'PT08.S5(O3)', 'T', 'RH', 'AH']
# Import Prophet
from fbprophet import Prophet
import logging
logging.getLogger().setLevel(logging.ERROR)
# Make predictions
future = m.make_future_dataframe(periods=prediction_size)
forecast = m.predict(future)
forecast.head()
# Plot forecast
m.plot(forecast)
df = df.copy()
predicted_part = df[-prediction_size:]
Here, yhat represents the prediction, while yhat_lower and yhat_upper represent the lower
and upper bound of the prediction respectively. Prophet allows you to easily plot the
forecast, and we get:
NOx concentration forecast
As you can see, Prophet simply used a straight downward line to predict the concentration
of NOx in the future.
Then, we can check if the time series has any interesting features, such as seasonality: