Multivariate Time Series Analysis With Python For Forecasting and Modeling
Multivariate Time Series Analysis With Python For Forecasting and Modeling
Time is the most critical factor in data science and machine learning, influencing
whether a business thrives or falters. That’s why we see sales in stores and e-
series of spending data over years to discern the optimal time to open the gates and
But how can you, as a data scientist, perform this analysis? Don’t worry, you don’t
need to build a time machine! Time Series modeling is a powerful technique that
the web deal with univariate time series. Unfortunately, real-world use cases don’t
work like that. There are multiple variables at play, and handling all of them at the
Learning Objectives
Understand what a multivariate time series is and how to deal with it.
Table of contents
Univariate Vs. Multivariate Time Series Forecasting
Python
This article assumes some familiarity with univariate time series, their properties,
and various techniques used for forecasting. Since this article will be focused on
multivariate time series, I would suggest you go through the following articles, which
into the details of a multivariate time series. Let’s look at them one by one to
A univariate time series, as the name suggests, is a series with a single time-
dependent variable.
For example, have a look at the sample dataset below, which consists of the
temperature values (each hour) for the past 2 years. Here, the temperature is the
past values and try to gauge and extract a pattern. We would notice that the
temperature is lower in the morning and at night while peaking in the afternoon.
Also, if you have data for the past few years, you would observe that it is colder
during the months of November to January while being comparatively hotter from
April to June.
Such observations will help us in predicting future values. Did you notice that we
used only one variable (the temperature of the past 2 years)? Therefore, this is
depends not only on its past values but also has some dependency on other
Consider the above example. Suppose our dataset includes perspiration percent,
dew point, wind speed, cloud cover percentage, etc., and the temperature value for
the past two years. In this case, multiple variables must be considered to predict
temperature optimally. A series like this would fall under the category of multivariate
In this section, I will introduce you to one of the most commonly used methods for
In a VAR algorithm, each variable is a linear function of the past values of itself and
the past values of all the other variables. To explain this in a better manner, I’m
We have two variables, y1, and y2. We need to forecast the value of these two
variables at a time ‘t’ from the given data for past n values. For simplicity, I have
Here,
These equations are similar to the equation of an AR process. Since the AR process
is used for univariate time series data, the future values are linear combinations of
y(t) = a + w*y(t-1) +e
In this case, we have only one variable – y, a constant term – a, an error term – e,
equation for VAR, we will use vectors. We can write equations (1) and (2) in the
following form:
The two variables are y1 and y2, followed by a constant, a coefficient metric, a lag
value, and an error metric. This is the vector equation for a VAR(1) process. For a
VAR(2) process, another vector term for time (t-2) will be added to the equation to
The above equation represents a VAR(p) process with variables y1, y2 …yk. The
multivariate time series, εt should be a continuous random vector that satisfies the
following conditions:
1. E(εt) = 0
2. E(εt1,εt2‘) = σ12
The expected value of εt and εt‘ is the standard deviation of the series.
simple univariate forecasting methods like AR. Since the aim is to predict the
temperature, we can simply remove the other variables (except temperature) and fit
Another simple idea is to forecast values for each series individually using the
techniques we already know. This would make the work extremely straightforward!
Then why should you learn another forecasting technique? Isn’t this topic
understand and use the relationship between several variables. This is useful
for describing the dynamic behavior of the data and also provides better forecasting
Granger’s causality test can be used to identify the relationship between variables
The test in mathematics yields a p-value for the variables. If the p-value exceeds
0.05, the null hypothesis must be accepted. Conversely, if the p-value is less than
We know from studying the univariate concept that a stationary time series will,
more often than not, give us a better set of predictions. If you are not familiar with
the concept of stationarity, please go through this article first: A Gentle Introduction
y(t) = c*y(t-1) + ε t
The series is said to be stationary if the value of |c| < 1. Now, recall the equation of
The coefficient of y(t) is called the lag polynomial. Let us represent this as Φ(L):
modulus. This might seem complicated, given the number of variables in the
derivation. This idea has been explained using a simple numerical example in the
Johansen’s test for checking the stationarity of any multivariate time series data. We
will see how to perform the test in the last section of this article.
Train-Validation Split
If you have worked with univariate time series data before, you’ll be aware of the
Creating a validation set for time series problems is tricky because we have to take
into account the time component. One cannot directly use the train_test_split or k-
fold validation since this will disrupt the pattern in the series. The validation set
Suppose we have to forecast the temperate diff, dew point, cloud percent, etc., for
the next two months using data from the last two years. One possible method is to
keep the data for the last two months aside and train the model for the remaining 22
months.
Once the model has been trained, we can use it to make predictions on the
validation set. Based on these predictions and the actual values, we can check how
well the model performed and the variables for which the model did not do so well.
And for making the final prediction, use the complete dataset (combine the training
Python Implementation
In this section, we will implement the Vector AR model on a toy dataset. I have used
the Air Quality dataset for this and you can download it from here.
Python Code
The data type of the Date_Time column is object, and we need to change it
to datetime. Also, for preparing the data, we need the index to have datetime.
The next step is to deal with the missing values. It is not always wise to use
df.dropna. Since the missing values in the data are replaced with a value of -200,
we will have to impute the missing value with a better number. Consider this – if the
present dew point value is missing, we can safely assume that it will be close to the
value of the previous hour. That makes sense, right? Here, I will impute -200 with
You can choose to substitute the value using the average of a few previous values
or the value at the same time on the previous day (you can share your idea(s) of
We can now go ahead and create the validation set to fit the model and test its
performance.
model = VAR(endog=train)
model_fit = model.fit()
Making it Presentable
The predictions are in the form of an array, where each list represents the
predictions of the row. We will transform this into a more presentable format.
#check rmse
for i in cols:
print('rmse value for', i, 'is : ', sqrt(mean_squared_error(pred[i], valid[i])))
Output:
After the testing on the validation set, let’s fit the model on the complete dataset.
We can test the performance of our model by using the following methods:
balancing the fit of the model to the data with the complexity of the model.
AIC provides a way to compare different models and choose the one that
2. Bayesian information criterion (BIC): This stats measure is used for model
criterion (AIC), BIC provides a trade-off between the goodness of fit and
Conclusion
Before I started this article, the idea of working with a multivariate time series
seemed daunting in its scope. It is a complex topic, so take your time to understand
the details. The best way to learn is to practice, so I hope the above Python
I encourage you to use this approach on a dataset of your choice. This will further
cement your understanding of this complex yet highly useful topic. If you have any
Key Takeaways
Multivariate time series analysis involves the analysis of data over time that
Vector Auto Regression (VAR) is a popular model for multivariate time series
VAR models can be used for forecasting and making predictions about the