Use of Linear Regression For Time Series Prediction
Use of Linear Regression For Time Series Prediction
1 Gerardo 6
1.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Transformed Starting Data . . . . . . . . . . . . . . . . . . . 7
1.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 8
1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Models Using Variable Engineering . . . . . . . . . . . . . . 8
1.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Femi 12
2.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Data Processing and Feature Engineering . . . . . . . . . . . 13
2.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 14
2.3 Models Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Log transformation . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Spoorthi 20
2
DSC423 OHLC Linear Regression Project
3.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Data Processing and Feature Engineering . . . . . . . . . . . 20
3.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 21
3.3 Models Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Akanksha 27
4.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Model Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Analysis of Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Mutliple Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Model Utility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Complete Second-Order Model with Interaction Terms . . . . . . . 34
4.11 Model Utility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.12 Final Model Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.12.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Cristian 38
6 Team Conclusion 38
3
DSC423 OHLC Linear Regression Project
0 Group Introduction
We will explore the efficacy of linear regression in stock data time series analysis.
Each author applied linear regression methods to answer specific questions.
0.1 Data
Our base data consisted of daily market values from January 2000 to February
2020 for SPY (SP 500 Index ETF), XLK(technology ETF), XLE(Energy ETF),
VIX(Volatility Index), and additional single name equities. Each market day
consisted of the following:
• Adj. Close: the adjusted price(for dividends, stock splits, etc) for each day i.
Each author then took the base data and engineered custom features to help in
answering their questions.
4
DSC423 OHLC Linear Regression Project
0.2 What is an ETF?
An ETF(exchange traded fund) is an investment fund that allows you to buy and
sell baskets of securites without having to buy the individual compenents. This
is a similiar structure to a mutual fund but, unlike a mutual fund, ETFs can be
bought in sold through out the day, much like a stock. These funds are built to
track specific indexes. In our use we are looking at the following indexes: S&P500,
Technology, Energy.
We build our first model using just the basic ”out of the box” variables.
In studying the results, we see two variables which have high t-values. Even with
the inclusion of these two variables the results are impressive! What kind of data
scientist would we be if our inquiry stopped here. We will each continue our journey
by exploring specific questions that require different transformation of our base
data.
5
DSC423 OHLC Linear Regression Project
0.4 Overview of Each Authors Questions
1 Gerardo
We test and study linear regression and its efficacy in predicting tomorrow’s closing
price for SPY. Our goal is creating an applicable/tradable model. After testing
various models, all with various input variables, we accept that while the model
may appear to have predictive powers, in its application these models fall short
of their goal. This is due to the non Gaussian distribution of the price changes
and residual errors of the model. This breaks one of the core assumptions in linear
regression, the need to have residual errors fall within a Gaussian curve(be normally
distributed).
We will walk you through the analysis that was conducted. The various attempts to
find efficient models. Explain and show the impact that non-normal distributions
of residual errors has on the model’s results when applied to equity time series.
6
DSC423 OHLC Linear Regression Project
1.2 Data
Our data has high multicollinearity. That is the nature of our dataset. In attempting
to create a tradable model we embrace this fact and do not remove variables just
for this transgression. We also attempt to create non-correlated variables through
variable engineering using some domain knowledge and bias about how variables
effect overall movement(i.e. Volume Line).
The starting data was taken and modified through the addition of interaction and
secondary terms. Here is our final data starting point:
7
DSC423 OHLC Linear Regression Project
1.2.2 Transformation Process
1.3 Models
The first model we created is one using all the variables we created. Here are the
results.
8
DSC423 OHLC Linear Regression Project
There was not a significant amount of improvement in the results. So we will now
create a model using the stepwise method. Here are the results of the stepwise
process:
9
DSC423 OHLC Linear Regression Project
Here are the results of the model that was built from the stepwise process.
As you can see, there are still no significant improvement. One would think that
with results such as the ones above, one would be able to generate a significant
level of profits from the market using them. Unfortunately, that is not the case.
10
DSC423 OHLC Linear Regression Project
This is all due to the distribution of the residual errors. We will now take a look at
the residual errors and their impact on the profitability.
We first examine the density plot of our residuals and plot a normal distribution
density curve over it. At first glance our model’s residual distribution appears to
be normal. However, when we take a close look at the tails, we will see that the
residuals do not have a normal distribution.
The blue line is the density curve for our residuals.
The black line is the density curve for a normal density curve.
When we take a close look at the tails of our curves, we notice that there are
occurrences in our model’s residuals where there should not be any. There are
residuals that are too many standard deviations away from our mean. Our model
has the following mean and standard deviation:
If we take a close look at the tails of the model, we will see that there are events
that are too many standard deviation away from the mean to qualify for a normal
11
DSC423 OHLC Linear Regression Project
distribution. In the following histogram, the blue bars are the occurrences of our
residuals and the gray area are the occurrences of a normally distributed model.
Keep in mind the following stats:
mean: 2.016793e-17
standard deviation: 2.46
Two σ ≈ ±4.92
1.5 Conclusion
We explored the efficacy of linear regression when applied to stock market time
series data. Unfortunately, while the models appear to have predictive power,
when thouroughly exploring the results we discover that the residuals are not
normally distributed. One of linear regression’s core assumptions is that the
residual distributions must be normally distributed. Thus, we have the only reason
needed to prevent us from applying linear regression to this type of time series.
2 Femi
This section uses a regression model to predict the next day’s opening price for
AAPL. According to the section 2.3.1 and 2.4, the base model seems to be perfect
12
DSC423 OHLC Linear Regression Project
but while studying the residuals the base model fails. The residual show an increase
in variance and while fixing the increase the variance with the inverse sine the
model becomes more unstable. This leads to investigating the log transformation
on the model. Since the data-set is price the logarithm of price helps reduce the
variation of the time series. The methodology behind this model includes; data
engineering,passing t-test and F-test,step wise regression,VIF, residual assumptions,
removal of outliers , and the Dublin Watson test all are use to deem the usefulness
of the model.
2.2 Data
Due to limited open-source data, the AAPL stock data was generated from ya-
hoo finance historical data from 2000 to 2020.The initial data consists of 5283
observations and 5 variables before processing and feature engineering.
The first new variable created is next day’s opening price (Nopen), Nopen is derived
by shifting the open price up (open = [23,10,3,4,5] Nopen= [10,3,4,5,Na]).Because
of the shift the last cell ”Na” can be omitted from the data set bring the total obser-
vation to 5282. The next step is to the log of all base variables (open,high,close,low
and adj low). Other variables includes the interactions, market range and change
on market volume.
13
DSC423 OHLC Linear Regression Project
log High, log Low, log Close = log High, Low, and Close values
log open, log N open = log yesterday’s open , next day open
The first model is built with all the independent variables. Despite using the log
transformation multicollinearity still exist. The model passes the F-test and the
adjusted R is 0.9999.
14
DSC423 OHLC Linear Regression Project
The model fails all the necessary residual assumption test and multicollinearity is
an issue so the step wise regression is applied. The backward stepwise regression
drops the O C, H L , H O , and log Adj Close. But the forward stepwise regression
drops 8 variables.
15
DSC423 OHLC Linear Regression Project
Since simplicity is key the model forward stepwise regression’s final model is chosen.
The result from is identical to the original model.
16
DSC423 OHLC Linear Regression Project
Despite having a good base model it failed all the residual assumption , hence the
log transformation is considered.In this section we see that the log transformation
affects the residual. The sum of the residual is 2.65808e-15, the residual is also
normally distributed, with a constant variance. The DWT is 1.96
17
DSC423 OHLC Linear Regression Project
18
DSC423 OHLC Linear Regression Project
2.5 Conclusion
The final model using log transformation to predict the next day opening price is:
log N open = log Close + log Low + Market Range + log High + log open + V H
+Vl
The residual of the log transformation model is constant and stable. Also we notice
that multicollinarity still exists in the dataset and linear regression is not ideal . A
time series analysis is suggested.
19
DSC423 OHLC Linear Regression Project
3 Spoorthi
In this project, we have used various linear regression models to test the predictabil-
ity of the next day’s close price for XLK. XLK is an EFT based around the SP
Technology sector. The prediction for the close price for the next day seems to
be correct based on the r2 but when we look at the residual error it is quite high
making it seem less likely that these models will work. For XLK specifically we
have generated multiple models to check which one improves the prediction.
I will show the model performace in terms of the accuracy as well as the residuals
that we see that tell us that the model might not be as accurate as we would like
for it to be.
3.2 Data
The dataset that we have started from was only the daily Open, Close, low, high,
Adjusted Close, Volume of XLK since 2000. The nature of this data containing
only 6 attributes prevents experimentation. One way around this was for us to do
some feature engineering.
Feature engineering involved adding the features listed below in the images. At-
tributes that were created were:
20
DSC423 OHLC Linear Regression Project
3.2.2 Transformation Process
Moving Average 2 = Moving Average of Adjusted Close price for last 2 days
Moving Average 3 = Moving Average of Adjusted Close price for last 3 days
Moving Average 4 = Moving Average of Adjusted Close price for last 4 days
Moving Average 50 = Moving Average of Adjusted Close price for last 50 days
Initially the idea was to find the next day’s percent change. This is basically a
measure of the volatility that is expected for the next day. I instead opted to
predict the adjusted close price for the next day. This was in part due to there
being no linear correlation with the percent change day over day. As such using a
linear model it is difficult to fit a model. In the images below you can see the lack
of linearity on the left side with the percent change over time.
21
DSC423 OHLC Linear Regression Project
The model that we initially built was using all the independent variables available.
This allows for using all the variables to potentially predict the outcome which is
what is tomorrow’s adjusted close price.
The full model is listed below in the image:
22
DSC423 OHLC Linear Regression Project
The model has a R-squared of .9993 which is significantly high but there are
attributes which have a high p-value. This p-value suggests that some of the
attributes aren’t as important or aren’t providing significant information for the
model.
We opted to do a step-wise regression. In this case a backward regression to remove
23
DSC423 OHLC Linear Regression Project
attributes which aren’t important. This will yield a model that is still good with
prediction but without the unnecessary attributes.
The image below is for the backward regression:
Based on the above results it is easy to interpret that the we can safely remove
DayChangeO.C, RangeH.L, AbsolutePercentChange, Moving Average 3
The 1st order model had ended up performing the least. As I have added some
interaction to the model the performance has gone up slightly. I have also performed
2nd order model but there was no significant improvement as such I am not including
the results for that.
The results for the Interaction Model is:
24
DSC423 OHLC Linear Regression Project
The R-squared has not improved but the the residual error has improved slightly.
I also included the residual plots. The QQ plot shows a decent number of values
fall on close to normal distribution line. With the middle values it predicts well
but with the highs and lows the prediction isn’t good, this you can see with the fat
tails at either end of the QQ plot.
The residuals for the 1st model with all the attributes is below:
25
DSC423 OHLC Linear Regression Project
The QQ plot shows a decent number of values that fall to the normal distribution
line. With the middle values it predicts well but as you can see the model with
both highs and lows the prediction isn’t as good. This you can see with the fat
tails at either end of the QQ Plot.
The residuals for the 2nd interaction model with all the attributes is below:
The QQ plot is also very similar to the 1st order model. The model improves
slightly but the same issue with the fat tails exists.
26
DSC423 OHLC Linear Regression Project
3.5 Conclusion
The model based on the moving averages seems to perform well. As this presents a
trend to how the market has been performing the past few days relative to today.
Interaction Model using a moving average that is 50 days in length with the day’s
percent change improves performance slightly. None of the models in the end have
a really low residual value to suggest the model will be really good at predicting
the price. This is apparent with the residual study that we have performed for
both the models.
4 Akanksha
I will be looking at stock data and applying regression modeling to derive insights,
relationships, and predictions between variables of the stock dataset. This is both
interesting and important, as the Stock market is the biggest money machine and
people lose/gain billions every day. This is not only an opportunity to test out
data science skills but also to analyze if we can actually predict and guide our
investment strategy better with historical data and machine learning.
Aim of the mode: Predictions of closing price depending on the type of
day (Holidays/Weekends) and volatility index.
We will run the model at 4 PM to predict the closing price for that day. Depend-
ing on this predicted price we will either -
Questions answered by prediction:
27
DSC423 OHLC Linear Regression Project
2. Can we use the Volatility index (VIX) to improve our previous model for
predicting the closing price of a stock?
4.2 Data
Stock market datasets are mostly focused on prices, volume, and trading behavior.
In our dataset, we would be using the open and close price of a/few stocks along
with their trading volume. This dataset is the basis of our experiment and has
been used by all the team members in one way or other in this project. Hence not
going into details here
However, in addition to the base dataset of High,Low,Close,Volume etc. I have
also added dummy variables to mark weekends and holidays derived from the date
variable.
Lastly, regardless of variables or Model1/2, our dependent variable would be the
stock’s closing price when markets closed.i.e at 5:00 PM.
Predicting closing price is the main aim here to guide the investor to either:
1. HOLD/BUY Shares - In case the predicted closing price is higher than the
current price. We will make a profit as prices will increase and hence, we can
sell at a higher price later.
OR
2. SELL Shares - In case the predicted closing price is lower than the current
price. We need to sell as the prices will go down and hence, we need to avoid
a loss.
28
DSC423 OHLC Linear Regression Project
4.4 Feature Selection
The above steps would be repeated once we are done with further feature engineer-
ing.
4.5.1 Model 1
1. BW - All Fridays (the day before the weekend). The two levels of BW are:
29
DSC423 OHLC Linear Regression Project
0 – Not Friday
1 – Friday
2. AW – After Weekend – Every Monday, the first day after the weekend. The
2 levels of AH are:
0 – Not Monday
1 – Monday
3. c. AH – After Holiday - First Market Open day after a holiday (market close).
The 2 levels of AH are:
0 – Regular Day
1 – After Holiday
4. Normal –Market days that did not fall under any of the three categories
above (Tuesday, Wed, Thurs) – This is our base level in the categorical
variable Type of Day.
4.6 Model 2
Volatility Index (VIX) - Checks how Volatile the market is. It is used to measure
the level of risk, fear, or stress in the market. The source of data for VIX is the daily
VIX data set at the same timeframe from- https://finance.yahoo.com/quote/%5EVIX/history/
4.7.1 Correlation
30
DSC423 OHLC Linear Regression Project
31
DSC423 OHLC Linear Regression Project
2. Another tool used to measure the severity of multicollinearity in regression
analysis is the Variance Inflation Factor (VIF). It is a statistical concept that
indicates the increase in the variance of a regression coefficient as a result of
collinearity. We can see the VIF scores below –
The Correlation between variables looks good and VIFs is less than 10, we
can say that there are no signs of multicollinearity.
We run the first order model with our dummy variables and see how the base first
order model performs. The multiple linear regression equation-The First-order
model relating y to each of the five independent variables is:
32
DSC423 OHLC Linear Regression Project
1. R square (R2) is 0.98. It means that our model/the predictors (Xi ) explain
98% of the variance of Y (Close).
Hypothesis:
Hα : At least one βi 6= 0(At least one model term is useful for predicting y
Assumptions:
33
DSC423 OHLC Linear Regression Project
3. Standard Errors assume that the covariance matrix of the errors is correctly
specified.
At a significance level of 0.05, the data provide sufficient evidence that that
the model is useful in explaining the closing price of Amazon stocks by using
the 5 independent variables. So, our linear regression Model, provides a
better fit than the model which does not contain independent variables AH,
BW, AW.
After feature selection and multiple iterations with the above variable, we get to
this final complete Second-order model. We have included VIX along with the
significant predictors from the last first order model. We keep AW regardless of its
p-value as its interaction term is significant. Multicollinearity does not exist and
VIF all variables is less than 10.
Equation:
y(Close) = β0 +β1 Open+β2 V olume+β3 VIX+β4 AW +β5 BW +β6 (Vol)(AW )+β7 VIX2
34
DSC423 OHLC Linear Regression Project
1. R square (R2) is 0.98 (same as the First Order Model). It means that our
model/the predictors (Xi ) explain 98% of the variance of Y (Close).
35
DSC423 OHLC Linear Regression Project
4.11 Model Utility Test
Hypotheses:
Hα : At least one βi 6= 0(At least one model term is useful for predicting y
36
DSC423 OHLC Linear Regression Project
4.12 Final Model Residuals
Residuals do not have a trend so we can say that the model is a good fit. Predictors
are significant as well.
4.12.1 Conclusion
1. R2 equals 0.98. It means that by taking into the first-order effect of the
predictors (Xi), the interaction between them (XiYi), and the second-order
effect (Xi2), we can explain only 98% of the variance of Y(Close Price). In
other words, 98% of the variation in predicting the Closing Price of Amazon’s
stock has been explained by the Complete Second-Order Model.
2. Model Utility– We rejected the Null Hypothesis above which means that the
model is useful in explaining the Closing Price by using the 7 independent
variables, and hence our Model is a Good Fit.
37
DSC423 OHLC Linear Regression Project
3. Residual Spread – No trends seen in the residual spread hence we can say
that the model is a fit. However, the residuals are not normally distributed.
5 Cristian
6 Team Conclusion
While the models have quality statistical results, they remain mostly unusable
due to the lack of normal distribution in the residuals. We saw in all our models
that the distribution of the residuals were fat tailed. Unfortunately, one of the
core assumptions required for linear regression is the normal distribution of the
residuals. For this reason, we have shown that linear regression is not applicable
for this purpose.
38