0% found this document useful (0 votes)
160 views38 pages

Use of Linear Regression For Time Series Prediction

The document discusses using linear regression to predict time series data like stock prices. It examines using linear regression on SPY ETF data from 2000-2020, including variables like open, high, low, close, volume, and adjusted close prices. Each author then transforms the data in different ways and builds linear regression models to answer specific questions about predicting future prices, the impact of log transformations, and whether models can be applied to specific sectors. The authors find that while the models may appear predictive, they ultimately do not meet the assumptions of linear regression for time series data due to non-normal distributions of price changes and residuals.

Uploaded by

Gerardo Sandoval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views38 pages

Use of Linear Regression For Time Series Prediction

The document discusses using linear regression to predict time series data like stock prices. It examines using linear regression on SPY ETF data from 2000-2020, including variables like open, high, low, close, volume, and adjusted close prices. Each author then transforms the data in different ways and builds linear regression models to answer specific questions about predicting future prices, the impact of log transformations, and whether models can be applied to specific sectors. The authors find that while the models may appear predictive, they ultimately do not meet the assumptions of linear regression for time series data due to non-normal distributions of price changes and residuals.

Uploaded by

Gerardo Sandoval
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

USE OF LINEAR REGRESSION FOR

TIME SERIES PREDICTION

March 15, 2021

Spoorthi Chamala, Akanksha Chauhan, Cristian M. Casado


Gerardo Sandoval, Oluwafemi Shobowale
DePaul University
Data Analysis & Regression
DSC423 OHLC Linear Regression Project
Contents
0 Group Introduction 4
0.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 What is an ETF? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.4 Overview of Each Authors Questions . . . . . . . . . . . . . . . . . 6

1 Gerardo 6
1.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Transformed Starting Data . . . . . . . . . . . . . . . . . . . 7
1.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 8
1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Models Using Variable Engineering . . . . . . . . . . . . . . 8
1.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Femi 12
2.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Data Processing and Feature Engineering . . . . . . . . . . . 13
2.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 14
2.3 Models Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Log transformation . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Spoorthi 20

2
DSC423 OHLC Linear Regression Project
3.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Data Processing and Feature Engineering . . . . . . . . . . . 20
3.2.2 Transformation Process . . . . . . . . . . . . . . . . . . . . . 21
3.3 Models Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Residuals Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Akanksha 27
4.1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Model Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 Model 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.6 Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7 Analysis of Model 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.7.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Mutliple Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Model Utility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.10 Complete Second-Order Model with Interaction Terms . . . . . . . 34
4.11 Model Utility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.12 Final Model Residuals . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.12.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Cristian 38

6 Team Conclusion 38

3
DSC423 OHLC Linear Regression Project
0 Group Introduction
We will explore the efficacy of linear regression in stock data time series analysis.
Each author applied linear regression methods to answer specific questions.

0.1 Data

Our base data consisted of daily market values from January 2000 to February
2020 for SPY (SP 500 Index ETF), XLK(technology ETF), XLE(Energy ETF),
VIX(Volatility Index), and additional single name equities. Each market day
consisted of the following:

• Open: first trade price for day i.

• High: the highest trade price for day i.

• Low: the lowest trade price for day i.

• Close: the closing trade price for day i.

• Volume: the total number of shares traded on day i.

• Adj. Close: the adjusted price(for dividends, stock splits, etc) for each day i.

Each author then took the base data and engineered custom features to help in
answering their questions.

4
DSC423 OHLC Linear Regression Project
0.2 What is an ETF?

An ETF(exchange traded fund) is an investment fund that allows you to buy and
sell baskets of securites without having to buy the individual compenents. This
is a similiar structure to a mutual fund but, unlike a mutual fund, ETFs can be
bought in sold through out the day, much like a stock. These funds are built to
track specific indexes. In our use we are looking at the following indexes: S&P500,
Technology, Energy.

0.3 Base Model

We build our first model using just the basic ”out of the box” variables.

In studying the results, we see two variables which have high t-values. Even with
the inclusion of these two variables the results are impressive! What kind of data
scientist would we be if our inquiry stopped here. We will each continue our journey
by exploring specific questions that require different transformation of our base
data.

5
DSC423 OHLC Linear Regression Project
0.4 Overview of Each Authors Questions

Gerardo: What is the efficacy and impact of embracing multicollinearity within


linear regression and its use in predicting the next day’s closing price.
Femi: The effect of log transformations on predicting the next day’s opening price.
Does the interaction between volume*high and volume*low affect the model ?
Spoorthi: Can linear application be applied to specific sectors?
Akanksha: Application within seasonality constraints and a look into the VIX’s
impact on predicting prices using linear regression.
Cristian: Study of fundamental data point and their impact on linear regression
models.

1 Gerardo

1.1 Executive Summary

We test and study linear regression and its efficacy in predicting tomorrow’s closing
price for SPY. Our goal is creating an applicable/tradable model. After testing
various models, all with various input variables, we accept that while the model
may appear to have predictive powers, in its application these models fall short
of their goal. This is due to the non Gaussian distribution of the price changes
and residual errors of the model. This breaks one of the core assumptions in linear
regression, the need to have residual errors fall within a Gaussian curve(be normally
distributed).

We will walk you through the analysis that was conducted. The various attempts to
find efficient models. Explain and show the impact that non-normal distributions
of residual errors has on the model’s results when applied to equity time series.

6
DSC423 OHLC Linear Regression Project
1.2 Data

Our data has high multicollinearity. That is the nature of our dataset. In attempting
to create a tradable model we embrace this fact and do not remove variables just
for this transgression. We also attempt to create non-correlated variables through
variable engineering using some domain knowledge and bias about how variables
effect overall movement(i.e. Volume Line).

1.2.1 Transformed Starting Data

The starting data was taken and modified through the addition of interaction and
secondary terms. Here is our final data starting point:

7
DSC423 OHLC Linear Regression Project
1.2.2 Transformation Process

Y High, Y Low, Y Close = Yesterdays High, Low, and Close values


V oli−1 − V olmin
Normalized Volume =
V olmax − V olmin
Percent Change = Closei−1 − Closei−2
Highi−1 − Lowi−1
Percent Range =
Lowi−1
Closei−1 − Openi−1
Intraday Percent Change =
Openi−1
 
Intraday Pct Change
Volume Direction = V oli−1
|Intraday Pct Change|
Xi
Volume Line = (Volume Direction)
k=1

1.3 Models

1.3.1 Models Using Variable Engineering

The first model we created is one using all the variables we created. Here are the
results.

8
DSC423 OHLC Linear Regression Project

There was not a significant amount of improvement in the results. So we will now
create a model using the stepwise method. Here are the results of the stepwise
process:

9
DSC423 OHLC Linear Regression Project

Here are the results of the model that was built from the stepwise process.

As you can see, there are still no significant improvement. One would think that
with results such as the ones above, one would be able to generate a significant
level of profits from the market using them. Unfortunately, that is not the case.

10
DSC423 OHLC Linear Regression Project
This is all due to the distribution of the residual errors. We will now take a look at
the residual errors and their impact on the profitability.

1.4 Residuals Study

We first examine the density plot of our residuals and plot a normal distribution
density curve over it. At first glance our model’s residual distribution appears to
be normal. However, when we take a close look at the tails, we will see that the
residuals do not have a normal distribution.
The blue line is the density curve for our residuals.
The black line is the density curve for a normal density curve.

When we take a close look at the tails of our curves, we notice that there are
occurrences in our model’s residuals where there should not be any. There are
residuals that are too many standard deviations away from our mean. Our model
has the following mean and standard deviation:
If we take a close look at the tails of the model, we will see that there are events
that are too many standard deviation away from the mean to qualify for a normal

11
DSC423 OHLC Linear Regression Project
distribution. In the following histogram, the blue bars are the occurrences of our
residuals and the gray area are the occurrences of a normally distributed model.
Keep in mind the following stats:
mean: 2.016793e-17
standard deviation: 2.46
Two σ ≈ ±4.92

1.5 Conclusion

We explored the efficacy of linear regression when applied to stock market time
series data. Unfortunately, while the models appear to have predictive power,
when thouroughly exploring the results we discover that the residuals are not
normally distributed. One of linear regression’s core assumptions is that the
residual distributions must be normally distributed. Thus, we have the only reason
needed to prevent us from applying linear regression to this type of time series.

2 Femi

2.1 Executive Summary

This section uses a regression model to predict the next day’s opening price for
AAPL. According to the section 2.3.1 and 2.4, the base model seems to be perfect

12
DSC423 OHLC Linear Regression Project
but while studying the residuals the base model fails. The residual show an increase
in variance and while fixing the increase the variance with the inverse sine the
model becomes more unstable. This leads to investigating the log transformation
on the model. Since the data-set is price the logarithm of price helps reduce the
variation of the time series. The methodology behind this model includes; data
engineering,passing t-test and F-test,step wise regression,VIF, residual assumptions,
removal of outliers , and the Dublin Watson test all are use to deem the usefulness
of the model.

2.2 Data

Due to limited open-source data, the AAPL stock data was generated from ya-
hoo finance historical data from 2000 to 2020.The initial data consists of 5283
observations and 5 variables before processing and feature engineering.

2.2.1 Data Processing and Feature Engineering

The first new variable created is next day’s opening price (Nopen), Nopen is derived
by shifting the open price up (open = [23,10,3,4,5] Nopen= [10,3,4,5,Na]).Because
of the shift the last cell ”Na” can be omitted from the data set bring the total obser-
vation to 5282. The next step is to the log of all base variables (open,high,close,low
and adj low). Other variables includes the interactions, market range and change
on market volume.

13
DSC423 OHLC Linear Regression Project

2.2.2 Transformation Process

log High, log Low, log Close = log High, Low, and Close values

log open, log N open = log yesterday’s open , next day open

log Adj Close = adj Close values

Market Range = High - Low

Day Change = Open - Close

2.3 Models Building

2.3.1 Log transformation

The first model is built with all the independent variables. Despite using the log
transformation multicollinearity still exist. The model passes the F-test and the
adjusted R is 0.9999.

14
DSC423 OHLC Linear Regression Project

The model fails all the necessary residual assumption test and multicollinearity is
an issue so the step wise regression is applied. The backward stepwise regression
drops the O C, H L , H O , and log Adj Close. But the forward stepwise regression
drops 8 variables.

15
DSC423 OHLC Linear Regression Project

Since simplicity is key the model forward stepwise regression’s final model is chosen.
The result from is identical to the original model.

16
DSC423 OHLC Linear Regression Project

2.4 Residuals Study

Despite having a good base model it failed all the residual assumption , hence the
log transformation is considered.In this section we see that the log transformation
affects the residual. The sum of the residual is 2.65808e-15, the residual is also
normally distributed, with a constant variance. The DWT is 1.96

17
DSC423 OHLC Linear Regression Project

Further investigation in the residual consist of removing the outliers.Eliminating


the outliers allow the residuals to be more spread and reduce the p-value of the
DWT. Continual removal of the outliers does not affect the residual of the model
nor the DWT value.

18
DSC423 OHLC Linear Regression Project

2.5 Conclusion

The final model using log transformation to predict the next day opening price is:

log N open = log Close + log Low + Market Range + log High + log open + V H
+Vl

The residual of the log transformation model is constant and stable. Also we notice
that multicollinarity still exists in the dataset and linear regression is not ideal . A
time series analysis is suggested.

19
DSC423 OHLC Linear Regression Project
3 Spoorthi

3.1 Executive Summary

In this project, we have used various linear regression models to test the predictabil-
ity of the next day’s close price for XLK. XLK is an EFT based around the SP
Technology sector. The prediction for the close price for the next day seems to
be correct based on the r2 but when we look at the residual error it is quite high
making it seem less likely that these models will work. For XLK specifically we
have generated multiple models to check which one improves the prediction.
I will show the model performace in terms of the accuracy as well as the residuals
that we see that tell us that the model might not be as accurate as we would like
for it to be.

3.2 Data

The dataset that we have started from was only the daily Open, Close, low, high,
Adjusted Close, Volume of XLK since 2000. The nature of this data containing
only 6 attributes prevents experimentation. One way around this was for us to do
some feature engineering.

3.2.1 Data Processing and Feature Engineering

Feature engineering involved adding the features listed below in the images. At-
tributes that were created were:

20
DSC423 OHLC Linear Regression Project
3.2.2 Transformation Process

Daily Change = Open - Close

Percent Change = Open - Close as Percent Change

Absolute Percent Change = Open - Close as Absolute Percent Change

Moving Average 2 = Moving Average of Adjusted Close price for last 2 days

Moving Average 3 = Moving Average of Adjusted Close price for last 3 days

Moving Average 4 = Moving Average of Adjusted Close price for last 4 days

Moving Average 50 = Moving Average of Adjusted Close price for last 50 days

Initially the idea was to find the next day’s percent change. This is basically a
measure of the volatility that is expected for the next day. I instead opted to
predict the adjusted close price for the next day. This was in part due to there
being no linear correlation with the percent change day over day. As such using a
linear model it is difficult to fit a model. In the images below you can see the lack
of linearity on the left side with the percent change over time.

21
DSC423 OHLC Linear Regression Project

3.3 Models Building

The model that we initially built was using all the independent variables available.
This allows for using all the variables to potentially predict the outcome which is
what is tomorrow’s adjusted close price.
The full model is listed below in the image:

22
DSC423 OHLC Linear Regression Project

The model has a R-squared of .9993 which is significantly high but there are
attributes which have a high p-value. This p-value suggests that some of the
attributes aren’t as important or aren’t providing significant information for the
model.
We opted to do a step-wise regression. In this case a backward regression to remove

23
DSC423 OHLC Linear Regression Project
attributes which aren’t important. This will yield a model that is still good with
prediction but without the unnecessary attributes.
The image below is for the backward regression:

Based on the above results it is easy to interpret that the we can safely remove
DayChangeO.C, RangeH.L, AbsolutePercentChange, Moving Average 3

Models Built Include: 1st Order Model


Interaction Model
2nd Order Model

The 1st order model had ended up performing the least. As I have added some
interaction to the model the performance has gone up slightly. I have also performed
2nd order model but there was no significant improvement as such I am not including
the results for that.
The results for the Interaction Model is:

24
DSC423 OHLC Linear Regression Project

The R-squared has not improved but the the residual error has improved slightly.

3.4 Residuals Study

I also included the residual plots. The QQ plot shows a decent number of values
fall on close to normal distribution line. With the middle values it predicts well
but with the highs and lows the prediction isn’t good, this you can see with the fat
tails at either end of the QQ plot.
The residuals for the 1st model with all the attributes is below:

25
DSC423 OHLC Linear Regression Project

The QQ plot shows a decent number of values that fall to the normal distribution
line. With the middle values it predicts well but as you can see the model with
both highs and lows the prediction isn’t as good. This you can see with the fat
tails at either end of the QQ Plot.
The residuals for the 2nd interaction model with all the attributes is below:

The QQ plot is also very similar to the 1st order model. The model improves
slightly but the same issue with the fat tails exists.

26
DSC423 OHLC Linear Regression Project
3.5 Conclusion

The model based on the moving averages seems to perform well. As this presents a
trend to how the market has been performing the past few days relative to today.
Interaction Model using a moving average that is 50 days in length with the day’s
percent change improves performance slightly. None of the models in the end have
a really low residual value to suggest the model will be really good at predicting
the price. This is apparent with the residual study that we have performed for
both the models.

4 Akanksha

4.1 Executive Summary

I will be looking at stock data and applying regression modeling to derive insights,
relationships, and predictions between variables of the stock dataset. This is both
interesting and important, as the Stock market is the biggest money machine and
people lose/gain billions every day. This is not only an opportunity to test out
data science skills but also to analyze if we can actually predict and guide our
investment strategy better with historical data and machine learning.
Aim of the mode: Predictions of closing price depending on the type of
day (Holidays/Weekends) and volatility index.
We will run the model at 4 PM to predict the closing price for that day. Depend-
ing on this predicted price we will either -
Questions answered by prediction:

1. Do we Hold/Sell/Buy more shares based on the predicted closing price before


and after WEEKENDS and HOLIDAYS?

27
DSC423 OHLC Linear Regression Project
2. Can we use the Volatility index (VIX) to improve our previous model for
predicting the closing price of a stock?

4.2 Data

Stock market datasets are mostly focused on prices, volume, and trading behavior.
In our dataset, we would be using the open and close price of a/few stocks along
with their trading volume. This dataset is the basis of our experiment and has
been used by all the team members in one way or other in this project. Hence not
going into details here
However, in addition to the base dataset of High,Low,Close,Volume etc. I have
also added dummy variables to mark weekends and holidays derived from the date
variable.
Lastly, regardless of variables or Model1/2, our dependent variable would be the
stock’s closing price when markets closed.i.e at 5:00 PM.

4.3 Model Outcome

Predicting closing price is the main aim here to guide the investor to either:

1. HOLD/BUY Shares - In case the predicted closing price is higher than the
current price. We will make a profit as prices will increase and hence, we can
sell at a higher price later.

OR

2. SELL Shares - In case the predicted closing price is lower than the current
price. We need to sell as the prices will go down and hence, we need to avoid
a loss.

28
DSC423 OHLC Linear Regression Project
4.4 Feature Selection

We need to ensure we only select relevant features and avoid overfitting/underfitting.


Hence we look to:

1. Removing Redundant Features - We need to select the most relevant


features before building a model. A good practice is to remove redundant
features from the data and we noticed that we could choose between Close
and Adj Close as they were very similar.

2. Measuring collinearity – We measure the extent of interdependence between


the Explanatory variables of the dataset by checking the Pearson Correlation
and/or VIF scores. We noticed that all the Explanatory independent variables
of the Stock dataset are highly correlated and has VIF exceeding 10 hence
we remove some variables. Multicollinearity will likely exist when using these
independent variables together so we need to address this before fitting the
model

The above steps would be repeated once we are done with further feature engineer-
ing.

4.5 Feature Engineering

4.5.1 Model 1

After Feature Selection we performed Feature Engineering and added Dummy


variables in the Model. Variable Day was created by marking days before and
after Weekends and Holidays, days on which no trading is done. Variable Day was
plotted as a function of ‘date’ and converted into the given categorical variables-

1. BW - All Fridays (the day before the weekend). The two levels of BW are:

29
DSC423 OHLC Linear Regression Project
0 – Not Friday

1 – Friday

2. AW – After Weekend – Every Monday, the first day after the weekend. The
2 levels of AH are:

0 – Not Monday

1 – Monday

3. c. AH – After Holiday - First Market Open day after a holiday (market close).
The 2 levels of AH are:

0 – Regular Day

1 – After Holiday

4. Normal –Market days that did not fall under any of the three categories
above (Tuesday, Wed, Thurs) – This is our base level in the categorical
variable Type of Day.

4.6 Model 2

Volatility Index (VIX) - Checks how Volatile the market is. It is used to measure
the level of risk, fear, or stress in the market. The source of data for VIX is the daily
VIX data set at the same timeframe from- https://finance.yahoo.com/quote/%5EVIX/history/

4.7 Analysis of Model 2

4.7.1 Correlation

1. To measure the extent of interdependence between variables we performed a


Pearson Correlation.

30
DSC423 OHLC Linear Regression Project

31
DSC423 OHLC Linear Regression Project
2. Another tool used to measure the severity of multicollinearity in regression
analysis is the Variance Inflation Factor (VIF). It is a statistical concept that
indicates the increase in the variance of a regression coefficient as a result of
collinearity. We can see the VIF scores below –

The Correlation between variables looks good and VIFs is less than 10, we
can say that there are no signs of multicollinearity.

4.8 Mutliple Regression Analysis

We run the first order model with our dummy variables and see how the base first
order model performs. The multiple linear regression equation-The First-order
model relating y to each of the five independent variables is:

y(Close) = β0 + β1 Open + β2 V olume + β3 AH + β4 AW + β5 BW

32
DSC423 OHLC Linear Regression Project

1. R square (R2) is 0.98. It means that our model/the predictors (Xi ) explain
98% of the variance of Y (Close).

2. R2 is between 0 and 1, and the closer it is to 1, the better the model.

3. We see that AH and AW have high p-value and appear to be insignificant to


the model.

4.9 Model Utility Test

Hypothesis:

H0 : β1 = β2 = ... = βk = 0(All Model terms are unimportant for predictin y)

Hα : At least one βi 6= 0(At least one model term is useful for predicting y

Assumptions:

1. Everything else /Other factors are constant.

2. The standard regression assumptions about the random error component.

33
DSC423 OHLC Linear Regression Project
3. Standard Errors assume that the covariance matrix of the errors is correctly
specified.

4. Testing at α = 0.05 for 95% significance.

P-Value = P (F > F0 ) = 0.00

Reject H0 since P-Value < α(0.00 < 0.05)

At a significance level of 0.05, the data provide sufficient evidence that that
the model is useful in explaining the closing price of Amazon stocks by using
the 5 independent variables. So, our linear regression Model, provides a
better fit than the model which does not contain independent variables AH,
BW, AW.

4.10 Complete Second-Order Model with Interaction Terms

After feature selection and multiple iterations with the above variable, we get to
this final complete Second-order model. We have included VIX along with the
significant predictors from the last first order model. We keep AW regardless of its
p-value as its interaction term is significant. Multicollinearity does not exist and
VIF all variables is less than 10.
Equation:

y(Close) = β0 +β1 Open+β2 V olume+β3 VIX+β4 AW +β5 BW +β6 (Vol)(AW )+β7 VIX2

34
DSC423 OHLC Linear Regression Project

1. R square (R2) is 0.98 (same as the First Order Model). It means that our
model/the predictors (Xi ) explain 98% of the variance of Y (Close).

2. R2 is between 0 and 1, and the closer it is to 1, the better the model.

35
DSC423 OHLC Linear Regression Project
4.11 Model Utility Test

Hypotheses:

H0 : β1 = β2 = ... = βk = 0(All Model terms are unimportant for predictin y)

Hα : At least one βi 6= 0(At least one model term is useful for predicting y

Testing at α = 0.05 for 95% significance.


P-Value = P (F > F0 ) = 0.00
Reject H0 since P-Value < α(0.00 < 0.05)
At a significance level of 0.05, the data provide sufficient evidence that the model is
useful in explaining the closing price of Amazon stocks by using these 7 independent
variables.

36
DSC423 OHLC Linear Regression Project
4.12 Final Model Residuals

Residuals do not have a trend so we can say that the model is a good fit. Predictors
are significant as well.

4.12.1 Conclusion

We recommend using the Holiday/Weekend Model in practice because of the


following reasons:

1. R2 equals 0.98. It means that by taking into the first-order effect of the
predictors (Xi), the interaction between them (XiYi), and the second-order
effect (Xi2), we can explain only 98% of the variance of Y(Close Price). In
other words, 98% of the variation in predicting the Closing Price of Amazon’s
stock has been explained by the Complete Second-Order Model.

2. Model Utility– We rejected the Null Hypothesis above which means that the
model is useful in explaining the Closing Price by using the 7 independent
variables, and hence our Model is a Good Fit.

37
DSC423 OHLC Linear Regression Project
3. Residual Spread – No trends seen in the residual spread hence we can say
that the model is a fit. However, the residuals are not normally distributed.

5 Cristian

6 Team Conclusion
While the models have quality statistical results, they remain mostly unusable
due to the lack of normal distribution in the residuals. We saw in all our models
that the distribution of the residuals were fat tailed. Unfortunately, one of the
core assumptions required for linear regression is the normal distribution of the
residuals. For this reason, we have shown that linear regression is not applicable
for this purpose.

38

You might also like