0% found this document useful (0 votes)
9 views

Final Project

Uploaded by

tommyhi1234567
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Final Project

Uploaded by

tommyhi1234567
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Final_Project

Yutong Wang, Victor Wei, Erica Hwang

2022-03-12

Intro

What is the stock market and why do we care?

According to Investopedia, “the stock market broadly refers to the collection of exchanges and other venues
where the buying, selling, and issuance of shares of publicly held companies take place. Such financial
activities are conducted through institutionalized formal exchanges (whether physical or electronic) or via
over-the-counter (OTC) marketplaces that operate under a defined set of regulations.”
Stock markets are important, especially in a free-market economy like that of the United States, because
they allow any investor to trade and exchange capital. Through the stock market, companies can issue and
sell their shares to raise capital so that they can carry out their business activities.
To be able to gauge how the market is doing, there exist many indices that aim to provide a measure in order
to track the movement of the market. One such index is the S&P 500 index (which stands for Standard &
Poor’s). The S&P 500 index is composed of a list of the 500 largest companies in the U.S. across 11 sectors
of businesses. The calculation of the index weighs each company based on its market capitalization (which
includes the company’s outstanding shares and current share price).
For the purposes of this project, we will focus on SPX and how various monetary policies such as the interest
rate, M2 real money supply, Federal Funds Effective Rate, unemployment rate, CPI, 10-Year Treasury
Constant Maturity Minus 2-Year Treasury Constant Maturity, and 30-Year Fixed Rate Mortgage Average
in the United States affect SPX’s price. These are the 7 variables that we will use to predict SPX price in
the future.

What models are we going to use?

In this project, we will use several different regression models to predict our SPX outcome. This includes
a linear regression model, a ridge regression model, a lasso regression model, a regression tree, the method,
and the boosting method to predict SPX using our 7 variables. Some of these models might not work as
well as desired due to the fact that we don’t have too many predictors and they can be correlated to some
degree, but we will use methods like step-wise selection and cross-validation in order to enhance the models’
performances. We will, then, compare the performances of our models and find the most accurate one to
serve as our final model.

Why might these models be useful?

These models might be useful because it can be applied in real life. Specifically, should it succeed in
predicting the SPX values somewhat accurately, even though it can be very difficult to achieve, we can use
it to potentially make money off of our prediction by longing or shorting the SPX.

1
Loading data and packages:

Before diving into our model, we think that it’s important to understand the basics of what each of our
predictors mean.

From Investopedia:

M2 Real Money Supply A measure of the money supply that includes cash, checking deposits, and
easily-convertible near money.

Federal Funds Effective Rate The interest rate banks charge each other for overnight loans to meet
their reserve requirements.

Unemployment rate The number of unemployed people as a percentage of the labor force.

CPI The average change in prices over time that consumers pay for a basket of goods and services.

10-Year Treasury Constant Maturity Minus 2-Year Treasury Constant Maturity Constant ma-
turity is the theoretical value of a U.S. Treasury that is based on recent values of auctioned U.S. Treasuries.
They are often used by lenders to determine mortgage rates.

30-Year Fixed Rate Mortgage Average Average of the 30-year fixed-rate mortgage, which is basically
a home loan that gives you 30 years to pay back the money you borrowed at an interest rate that won’t
change.

PCE Personal consumer expenditures, which measures the change in goods and services consumed by all
households, and nonprofit institutions serving households. # Set Up

Training/Testing split

#Generate x and y
x <- model.matrix(Price~.-Date,data1)
y <- data1$Price
set.seed(123)
#Split data
train <- sample(1:nrow(x),600)
#Training data
x.train <- x[train,]
y.train <- y[train]
#Testing data
x.test <- x[-train,]
y.test <- y[-train]

#Part 1, Linear Regression

2
Linear Regression on Training Data

Let’s first run a linear regression on our training data:

train_lm_fit <- lm(y.train ~ x.train, data = data1)


summary(train_lm_fit)

##
## Call:
## lm(formula = y.train ~ x.train, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -539.96 -135.31 -5.98 121.23 794.54
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.345e+02 8.459e+01 -5.136 3.81e-07 ***
## x.train(Intercept) NA NA NA NA
## x.trainUNRATE -7.556e+01 6.386e+00 -11.832 < 2e-16 ***
## x.trainM2REAL 5.453e-01 2.159e-02 25.250 < 2e-16 ***
## x.trainPCE 3.325e-02 9.904e-03 3.358 0.000836 ***
## x.trainCPI 5.189e+01 2.993e+01 1.734 0.083457 .
## x.trainFEDFUNDS -3.245e+01 1.204e+01 -2.696 0.007213 **
## x.trainMORTGAGE30US 2.661e+01 1.256e+01 2.118 0.034584 *
## x.trainT10Y2Y -1.460e+02 2.264e+01 -6.448 2.35e-10 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 211.1 on 592 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9516
## F-statistic: 1683 on 7 and 592 DF, p-value: < 2.2e-16

The Rˆ2 value is 0.9521, and the adjusted Rˆ2 is 0.9516, both of which indicates that the model is of good
fit, but the summary report also indicates that CPI and Mortgage30 are not very significant. Let’s follow
up by using some other tools to analyze our data.

Exploratory Data Analysis

We will be doing an exploratory data analysis on the training data.

Residuals vs Fitted Plot

Let’s run a residuals vs fitted plot on our training data to see if we have constant variance.

plot(fitted(train_lm_fit), residuals(train_lm_fit), xlab = "Fitted", ylab = "Residuals")


abline(0,0)

3
600
Residuals

200
0
−400

0 1000 2000 3000 4000

Fitted

Here, we see that the regression line seems to systematically over and under predict the data at different
points along the curve. The data points are not randomly nor evenly spread across the line, indicating that
we don’t have constant variance in terms of a linear regression. As a result, linear regression appears to not
be a great fit for our model.

QQPlot

Let’s see if our datasets come from populations with a common distribution by using the QQPlot:

qqnorm(residuals(train_lm_fit))
qqline(residuals(train_lm_fit))

4
Normal Q−Q Plot
600
Sample Quantiles

200
0
−400

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Looking at our qqplot, we see that the majority of points fall pretty close along the straight line. This is a
good sign, showing that our datasets do seem to come from populations with a common distribution. In
this case, we have a normal distribution, since we used the ““qqnorm” function.
However, the points at both ends of the data start wavering and moving away from the line indicating
normality. As a result, we can understand that our data is not precisely normal, but it is not too far off as
well.

Cook’s Distance

Cook’s Distance is used to find any influential outliers in a set of predictors variables. It suggests that any
data points that exceed the 4/n threshold (where n = # data points) have strong influence over the fitted
values and are considered to be outliers.
Lets see if we have any particularly influential outliers:

plot(cooks.distance(train_lm_fit), pch = 16, col = "blue")


n <- nrow(data1)
abline(h = 4/n, lty = 2, col = "steelblue") # add cutoff line at 4/n

5
0.08
cooks.distance(train_lm_fit)

0.06
0.04
0.02
0.00

0 100 200 300 400 500 600

Index

Using Cook’s Distance to analyze our data, we see that there are many points that cross this 4/n threshold.
This doesn’t necessarily mean that we should remove all these points. Instead, the presence of so many
highly influential points could indicate that a linear regression is not the best model for our data.

Training MSE

# Use the SLR model to predict the values of Price


y_hat_train <- predict(train_lm_fit, data = data1)

training_MSE <- mean((y.train - y_hat_train)ˆ2)


training_MSE

## [1] 43952.57

The training MSE is 43952.57, which is very large. Let’s see if the models we use later give us a better MSE.
But first, we should take a look at whether this can be fixed by simply performing a variable selection.
Now that we have analyzed our data, let’s perform forward selection to choose our best linear regression
model.

Forward Selection for Training Data

6
# find cross validation error for a given model
set.seed(123)

# Use 10-fold cross-validation to estimate the average prediction error AKA RMSE of each of the 7 possib
train_control <- trainControl(method = "cv", number = 10) # train the data
cv <- train(Price ~ ., data = data1, method = "leapForward", # Forward selection using "leapForward"
tuneGrid = data.frame(nvmax = 1:7), #nvmax stands for the number of variables in the model.
trControl = train_control
)

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 107


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 107


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 105


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 107


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 105


## linear dependencies found

## Reordering variables and trying again:

7
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106
## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106


## linear dependencies found

## Reordering variables and trying again:

## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 7


## linear dependencies found

cv$results

## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD


## 1 1 285.9710 0.9099184 224.0649 17.60662 0.012491709 14.33427
## 2 2 219.8882 0.9467647 169.4957 13.47600 0.006311315 12.63195
## 3 3 212.7484 0.9500908 163.7082 11.44754 0.005504784 11.45915
## 4 4 209.8068 0.9517557 160.7437 12.44169 0.004111558 11.42559
## 5 5 209.8021 0.9517540 160.7232 12.43622 0.004109650 11.40789
## 6 6 209.7887 0.9517674 160.7189 12.42159 0.004105848 11.39766
## 7 7 209.7809 0.9517695 160.7042 12.39249 0.004107294 11.39979

cv$bestTune

## nvmax
## 7 7

The lower the RMSE and MAE are, the better the model. From the results of our forward selection, it
seems that the 7-variable model is the best out of all 7 types of models. It has the lowest RMSE and MAE
along with the highest Rˆ2 value. However, these values for 4, 5, 6, and 7 predictor values are all very
close. Additionally, cv$bestTune is also telling use that the best model seems to be the one with 7 predictor
variables.
As a result, for us to produce the most accurate results with a linear regression on our data, using all 7
predictors will yield the best results.

Linear Regression on Testing Data

Now, let’s move on to performing linear regression on our testing data.

test_lm_fit <- lm(y.test ~ x.test, data = data1)


summary(test_lm_fit)

##
## Call:
## lm(formula = y.test ~ x.test, data = data1)
##
## Residuals:

8
## Min 1Q Median 3Q Max
## -662.28 -124.81 -16.42 109.78 743.97
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -495.22275 100.94729 -4.906 1.38e-06 ***
## x.test(Intercept) NA NA NA NA
## x.testUNRATE -89.75160 8.83747 -10.156 < 2e-16 ***
## x.testM2REAL 0.53211 0.02540 20.953 < 2e-16 ***
## x.testPCE 0.04212 0.01162 3.625 0.000329 ***
## x.testCPI 75.29831 31.55377 2.386 0.017501 *
## x.testFEDFUNDS -26.45505 13.76062 -1.923 0.055282 .
## x.testMORTGAGE30US 34.19934 13.42136 2.548 0.011220 *
## x.testT10Y2Y -98.70008 28.57440 -3.454 0.000614 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 204.4 on 383 degrees of freedom
## Multiple R-squared: 0.9527, Adjusted R-squared: 0.9518
## F-statistic: 1102 on 7 and 383 DF, p-value: < 2.2e-16

# Rˆ2 is 0.9527
plot(fitted(test_lm_fit), residuals(test_lm_fit), xlab = "Fitted", ylab = "Residuals")
abline(0,0)
600
200
Residuals

−200
−600

0 1000 2000 3000 4000

Fitted

Rˆ2 is 0.9527, but the residuals vs fitted graph shows that the regression line systematically over/under

9
predicts the data at different points along the curve. Linear regression appears to not be a good fit for our
model.

Testing MSE:

y_hat_test <- predict(test_lm_fit, data = data1)

testing_MSE <- mean((y.test - y_hat_test)ˆ2)


testing_MSE

## [1] 40920.33

The testing MSE is 40920.33, which is slightly smaller than training MSE, which is not normal. Also, it is
still huge, which implies an underfitted model. Let’s see if variable selection could improve it.

Forward Selection for Testing Data

To perform forward selection on our testing data, we are going to use the “regsubsets” function, which
performs best subset selection by identifying the best model when given a number of predictors based on
RSS (residual sum of squares). By setting “nvmax = 7”, we are saying that we want to test what the best
model is for 1 predictor, 2 predictors, etc., all the way up to 7 predictors.

model_options <- regsubsets(Price ~.-Date, data = data1[-train,], nvmax = 7,


method = "forward")
summary_best_subset <- summary(model_options)
summary_best_subset

## Subset selection object


## Call: regsubsets.formula(Price ~ . - Date, data = data1[-train, ],
## nvmax = 7, method = "forward")
## 7 Variables (and intercept)
## Forced in Forced out
## UNRATE FALSE FALSE
## M2REAL FALSE FALSE
## PCE FALSE FALSE
## CPI FALSE FALSE
## FEDFUNDS FALSE FALSE
## MORTGAGE30US FALSE FALSE
## T10Y2Y FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
## UNRATE M2REAL PCE CPI FEDFUNDS MORTGAGE30US T10Y2Y
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " "*"
## 4 ( 1 ) "*" "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" "*" "*" " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"

10
which.max(summary_best_subset$adjr2) # to find number of predictors that gives us the best model

## [1] 7

Performing forward selection on our test data, we also see that using all 7 predictors gives us the best model
under linear regression. This would make sense, as our Rˆ2 value for the linear regression on all 7 predictors
was already quite high.
In conclusion, from performing linear regression, we found that keeping all 7 of our predictor variables
will give us the best predictions of stock market price. However, this is only the best model under linear
regression, which as we saw from our data analysis, does not seem to be the model that best fits our data.
Now, lets move into another model to see if it better fits and predicts our data.

Part 2, Ridge and Lasso

Generate ridge model

#Generate grid
grid <- 10ˆseq(10, -2, length = 100)
#Ridge regression
ridge_mod <- glmnet(x.train, y.train, alpha = 0, lambda = grid)
ridge_mod

##
## Call: glmnet(x = x.train, y = y.train, alpha = 0, lambda = grid)
##
## Df %Dev Lambda
## 1 7 0.00 1.000e+10
## 2 7 0.00 7.565e+09
## 3 7 0.00 5.722e+09
## 4 7 0.00 4.329e+09
## 5 7 0.00 3.275e+09
## 6 7 0.00 2.477e+09
## 7 7 0.00 1.874e+09
## 8 7 0.00 1.417e+09
## 9 7 0.00 1.072e+09
## 10 7 0.00 8.111e+08
## 11 7 0.00 6.136e+08
## 12 7 0.00 4.642e+08
## 13 7 0.00 3.511e+08
## 14 7 0.00 2.656e+08
## 15 7 0.00 2.009e+08
## 16 7 0.00 1.520e+08
## 17 7 0.00 1.150e+08
## 18 7 0.01 8.697e+07
## 19 7 0.01 6.579e+07
## 20 7 0.01 4.977e+07
## 21 7 0.01 3.765e+07
## 22 7 0.02 2.848e+07
## 23 7 0.03 2.154e+07

11
## 24 7 0.03 1.630e+07
## 25 7 0.05 1.233e+07
## 26 7 0.06 9.326e+06
## 27 7 0.08 7.055e+06
## 28 7 0.10 5.337e+06
## 29 7 0.14 4.037e+06
## 30 7 0.18 3.054e+06
## 31 7 0.24 2.310e+06
## 32 7 0.32 1.748e+06
## 33 7 0.42 1.322e+06
## 34 7 0.56 1.000e+06
## 35 7 0.74 7.565e+05
## 36 7 0.97 5.722e+05
## 37 7 1.28 4.329e+05
## 38 7 1.69 3.275e+05
## 39 7 2.22 2.477e+05
## 40 7 2.91 1.874e+05
## 41 7 3.81 1.417e+05
## 42 7 4.98 1.072e+05
## 43 7 6.49 8.111e+04
## 44 7 8.42 6.136e+04
## 45 7 10.85 4.642e+04
## 46 7 13.88 3.511e+04
## 47 7 17.61 2.656e+04
## 48 7 22.08 2.009e+04
## 49 7 27.32 1.520e+04
## 50 7 33.28 1.150e+04
## 51 7 39.82 8.697e+03
## 52 7 46.74 6.579e+03
## 53 7 53.74 4.977e+03
## 54 7 60.53 3.765e+03
## 55 7 66.84 2.848e+03
## 56 7 72.46 2.154e+03
## 57 7 77.30 1.630e+03
## 58 7 81.33 1.233e+03
## 59 7 84.59 9.330e+02
## 60 7 87.18 7.060e+02
## 61 7 89.19 5.340e+02
## 62 7 90.73 4.040e+02
## 63 7 91.89 3.050e+02
## 64 7 92.76 2.310e+02
## 65 7 93.40 1.750e+02
## 66 7 93.89 1.320e+02
## 67 7 94.24 1.000e+02
## 68 7 94.51 7.600e+01
## 69 7 94.71 5.700e+01
## 70 7 94.85 4.300e+01
## 71 7 94.96 3.300e+01
## 72 7 95.04 2.500e+01
## 73 7 95.09 1.900e+01
## 74 7 95.13 1.400e+01
## 75 7 95.16 1.100e+01
## 76 7 95.18 8.000e+00
## 77 7 95.19 6.000e+00

12
## 78 7 95.20 5.000e+00
## 79 7 95.21 4.000e+00
## 80 7 95.21 3.000e+00
## 81 7 95.21 2.000e+00
## 82 7 95.21 2.000e+00
## 83 7 95.21 1.000e+00
## 84 7 95.21 1.000e+00
## 85 7 95.21 1.000e+00
## 86 7 95.21 0.000e+00
## 87 7 95.21 0.000e+00
## 88 7 95.21 0.000e+00
## 89 7 95.21 0.000e+00
## 90 7 95.21 0.000e+00
## 91 7 95.21 0.000e+00
## 92 7 95.21 0.000e+00
## 93 7 95.21 0.000e+00
## 94 7 95.21 0.000e+00
## 95 7 95.21 0.000e+00
## 96 7 95.21 0.000e+00
## 97 7 95.21 0.000e+00
## 98 7 95.21 0.000e+00
## 99 7 95.21 0.000e+00
## 100 7 95.21 0.000e+00

Perform cross-validation

#Cross-validation
cvmodr <- cv.glmnet(x.train,y.train,alpha=0,folds=10)
cvmodr

##
## Call: cv.glmnet(x = x.train, y = y.train, alpha = 0, folds = 10)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 91.36 100 53324 4236 7
## 1se 132.54 96 57504 4913 7

#Get the best lambda, according to CV


ridge_lambda <- cvmodr$lambda.min
ridge_lambda

## [1] 91.35554

#Predict based on the best lambda


predict(ridge_mod,type="coefficients",s=ridge_lambda)

## 9 x 1 sparse Matrix of class "dgCMatrix"


## s1

13
## (Intercept) -86.69913181
## (Intercept) .
## UNRATE -53.99007905
## M2REAL 0.37637001
## PCE 0.07151295
## CPI 53.78446599
## FEDFUNDS -11.01616897
## MORTGAGE30US -11.68599230
## T10Y2Y -138.77562341

The training MSE is:

set.seed(123)
ridge.predtr <- predict(ridge_mod,s=ridge_lambda,newx=x.train)
mean((ridge.predtr-y.train)ˆ2)

## [1] 51946.73

The testing MSE is:

set.seed(123)
ridge.predte <- predict(ridge_mod,s=ridge_lambda,newx=x.test)
mean((ridge.predte-y.test)ˆ2)

## [1] 49091.91

Both the training and testing MSE are extremely high, and the training MSE is higher than the testing
MSE, which is not normal in theory. Therefore, it can be concluded that the model does not fit the data
well.

Generate lasso model

set.seed(123)
lasso_mod <- glmnet(x.train,y.train,alpha=1,lambda=grid)
lasso_mod

##
## Call: glmnet(x = x.train, y = y.train, alpha = 1, lambda = grid)
##
## Df %Dev Lambda
## 1 0 0.00 1.000e+10
## 2 0 0.00 7.565e+09
## 3 0 0.00 5.722e+09
## 4 0 0.00 4.329e+09
## 5 0 0.00 3.275e+09
## 6 0 0.00 2.477e+09

14
## 7 0 0.00 1.874e+09
## 8 0 0.00 1.417e+09
## 9 0 0.00 1.072e+09
## 10 0 0.00 8.111e+08
## 11 0 0.00 6.136e+08
## 12 0 0.00 4.642e+08
## 13 0 0.00 3.511e+08
## 14 0 0.00 2.656e+08
## 15 0 0.00 2.009e+08
## 16 0 0.00 1.520e+08
## 17 0 0.00 1.150e+08
## 18 0 0.00 8.697e+07
## 19 0 0.00 6.579e+07
## 20 0 0.00 4.977e+07
## 21 0 0.00 3.765e+07
## 22 0 0.00 2.848e+07
## 23 0 0.00 2.154e+07
## 24 0 0.00 1.630e+07
## 25 0 0.00 1.233e+07
## 26 0 0.00 9.326e+06
## 27 0 0.00 7.055e+06
## 28 0 0.00 5.337e+06
## 29 0 0.00 4.037e+06
## 30 0 0.00 3.054e+06
## 31 0 0.00 2.310e+06
## 32 0 0.00 1.748e+06
## 33 0 0.00 1.322e+06
## 34 0 0.00 1.000e+06
## 35 0 0.00 7.565e+05
## 36 0 0.00 5.722e+05
## 37 0 0.00 4.329e+05
## 38 0 0.00 3.275e+05
## 39 0 0.00 2.477e+05
## 40 0 0.00 1.874e+05
## 41 0 0.00 1.417e+05
## 42 0 0.00 1.072e+05
## 43 0 0.00 8.111e+04
## 44 0 0.00 6.136e+04
## 45 0 0.00 4.642e+04
## 46 0 0.00 3.511e+04
## 47 0 0.00 2.656e+04
## 48 0 0.00 2.009e+04
## 49 0 0.00 1.520e+04
## 50 0 0.00 1.150e+04
## 51 0 0.00 8.697e+03
## 52 0 0.00 6.579e+03
## 53 0 0.00 4.977e+03
## 54 0 0.00 3.765e+03
## 55 0 0.00 2.848e+03
## 56 0 0.00 2.154e+03
## 57 0 0.00 1.630e+03
## 58 0 0.00 1.233e+03
## 59 0 0.00 9.330e+02
## 60 1 36.68 7.060e+02

15
## 61 1 59.85 5.340e+02
## 62 2 73.21 4.040e+02
## 63 2 81.00 3.050e+02
## 64 2 85.46 2.310e+02
## 65 3 88.75 1.750e+02
## 66 3 91.24 1.320e+02
## 67 3 92.67 1.000e+02
## 68 3 93.49 7.600e+01
## 69 4 94.20 5.700e+01
## 70 4 94.61 4.300e+01
## 71 4 94.84 3.300e+01
## 72 4 94.97 2.500e+01
## 73 4 95.05 1.900e+01
## 74 4 95.09 1.400e+01
## 75 4 95.11 1.100e+01
## 76 5 95.13 8.000e+00
## 77 5 95.14 6.000e+00
## 78 6 95.15 5.000e+00
## 79 6 95.16 4.000e+00
## 80 6 95.17 3.000e+00
## 81 6 95.17 2.000e+00
## 82 6 95.18 2.000e+00
## 83 7 95.19 1.000e+00
## 84 7 95.20 1.000e+00
## 85 7 95.20 1.000e+00
## 86 7 95.21 0.000e+00
## 87 7 95.21 0.000e+00
## 88 7 95.21 0.000e+00
## 89 7 95.21 0.000e+00
## 90 7 95.21 0.000e+00
## 91 7 95.21 0.000e+00
## 92 7 95.21 0.000e+00
## 93 7 95.21 0.000e+00
## 94 7 95.21 0.000e+00
## 95 7 95.21 0.000e+00
## 96 7 95.21 0.000e+00
## 97 7 95.21 0.000e+00
## 98 7 95.21 0.000e+00
## 99 7 95.21 0.000e+00
## 100 7 95.21 0.000e+00

Perform cross-validation

cvmodl <- cv.glmnet(x.train,y.train,alpha=1,folds=10)


cvmodl

##
## Call: cv.glmnet(x = x.train, y = y.train, alpha = 1, folds = 10)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero

16
## min 1.79 68 45462 3936 6
## 1se 35.20 36 48905 4708 4

lasso_lambda <- cvmodl$lambda.min


lasso_lambda

## [1] 1.793346

predict(lasso_mod,type="coefficients",s=lasso_lambda)

## 9 x 1 sparse Matrix of class "dgCMatrix"


## s1
## (Intercept) -373.28612008
## (Intercept) .
## UNRATE -73.94648883
## M2REAL 0.55086676
## PCE 0.02890384
## CPI 30.68898065
## FEDFUNDS -6.76655218
## MORTGAGE30US .
## T10Y2Y -110.51307255

Mortgage30 was ruled out

Training MSE is:

set.seed(123)
lasso.predtr <- predict(lasso_mod,s=lasso_lambda,newx=x.train)
mean((lasso.predtr-y.train)ˆ2)

## [1] 44329.75

Testing MSE is:

set.seed(123)
lasso.predte <- predict(lasso_mod,s=lasso_lambda,newx=x.test)
mean((lasso.predte-y.test)ˆ2)

## [1] 42094.5

After the lasso regression performed variable selection automatically, dropping Mortgage30 out of the re-
gression model, we get a significantly lower MSE for both the training set and the testing set. However, they
are both still really high and the training MSE is still higher than the testing MSE, implying that the model
still underfits. # Part 3
#Tree Regression

17
tree.data_no_date = tree(Price ~ . , data = data_no_date)

plot(tree.data_no_date)
text(tree.data_no_date, pretty = 0, col = "red", cex = 0.6)
title("decision tree on S&P 500 price", cex = 0.7)

decision tree on S&P 500 price


PCE < 11439.5
|

PCE < 5394.85 MORTGAGE30US < 3.22

M2REAL < 5551.8

4007.0
2043.0 2800.0

308.1 1203.0

Tree regression on the actual data.

# Set random seed for results being reproducible


set.seed(123)
# Sample 75% of observations as the training set
train_tree = sample(nrow(data_no_date), 0.75*nrow(data_no_date))
data_no_date.train = data_no_date[train_tree, ]
# The rest 25% as the test set
data_no_date.test = data_no_date[-train_tree,]
# Fit model on training set
tree.data_no_date_train = tree(Price ~ . , data = data_no_date.train)
# Plot the tree
draw.tree(tree.data_no_date_train, nodeinfo=TRUE, cex = 0.55)

18
PCE <> 11498.3
1135.8268237 ; 743 obs; 66.8%

PCE <> 5394.85 MORTGAGE30US <> 3.135


747.6662458 ; 594 obs; 17.7% 2683.2589262 ; 149 obs; 8.3%

1 2 3 M2REAL <> 5551.8


2416.21376 ; 125 obs; 2.7%
309.9627063 1203.4194158 4074.1191667
303 obs 291 obs 24 obs

4 5
2049.7126154 2813.2566667
65 obs 60 obs

Total deviance explained = 95.5 %

Tree regression on a 75/25 split training data.

Tree Result

tree.pred = predict(tree.data_no_date_train, data_no_date.test)

plot (tree.pred, data_no_date.test$Price,


xlab = "Tree Regression Predication",
ylab = "S&P 500 Price")
abline (0, 1)

19
4000
3000
S&P 500 Price

2000
1000
0

1000 2000 3000 4000

Tree Regression Predication

MSE_tree = mean((tree.pred - data_no_date.test$Price)ˆ2)


MSE_tree

## [1] 46910.95

sqrt(MSE_tree)

## [1] 216.5894

Tree Interpretation

Tree regression model leads to an MSE of 46910.95 and a RMSE of 216.5894. The MSE is extremely high
and the RMSE tells us that on average we are 216.5894 away from the best fitted line on the graph which
is extremely inaccurate consider that our SPY price is range from 0 to 4000.

Pruning Tree Regressuib

cv.data_no_date <- cv.tree(tree.data_no_date, K = 10)


plot (cv.data_no_date$size , cv.data_no_date$dev, type = "b", col = 'red',
xlab = "Best Tree Size",
ylab = "CV Error")

20
best.cv = min(cv.data_no_date$size[cv.data_no_date$dev == min(cv.data_no_date$dev)])
abline(v=best.cv, lty=2)
# Add lines to identify complexity parameter
min.error = which.min(cv.data_no_date$dev) # Get minimum error index
abline(h = cv.data_no_date$dev[min.error],lty = 2)
6e+08
CV Error

2e+08

1 2 3 4 5

Best Tree Size

Performed k-fold CV to determine the optimal level of tree complexity which is 5 in this case.

pt.data_no_date <- prune.tree(tree.data_no_date, best = best.cv)

plot(pt.data_no_date)
text(pt.data_no_date , pretty = 0, cex = 0.52, col = "blue")

21
PCE < 11439.5
|

PCE < 5394.85 MORTGAGE30US < 3.22

M2REAL < 5551.8

4007.0
2043.0 2800.0

308.1 1203.0

This is Prune tree regression on actual data with tree size = 5.

Pruning Tree Result

pred.pt.cv = predict(pt.data_no_date, data_no_date.test)

plot (pred.pt.cv, data_no_date.test$Price,


xlab = "Prune Tree Regression Predication",
ylab = "S&P 500 Price")
abline (0, 1)

22
4000
3000
S&P 500 Price

2000
1000
0

1000 2000 3000 4000

Prune Tree Regression Predication

MSE_prune = mean((pred.pt.cv - data_no_date.test$Price)ˆ2)


MSE_prune

## [1] 38330.25

sqrt(MSE_prune)

## [1] 195.7811

Prune Tree Interpretation

Prune Tree regression model leads to an MSE of 38330.25 and a RMSE of 195.7811. The MSE is extremely
high but is 22.38623% better than the tree regression and the RMSE tells use that on average we are 195.7811
away from the best fitted line on the graph which is extremely inaccurate consider that our SPY price is
range from 0 to 4000.

Bagging and random forest

# Create training (75%) and test (25%) sets for data.


# Use set.seed for reproducibility
set.seed(123)

23
ames_split <- sample(nrow(data_no_date), 0.75*nrow(data_no_date))
ames_train <- data_no_date[train_tree, ]
ames_test <- data_no_date[-train_tree,]

Create 75/25 split sets for data.

Tuning

hyper_grid <- expand.grid(


mtry = seq(0, 7, by = 1),
node_size = seq(1, 9, by = 1),
sampe_size = c(.75),
OOB_RMSE = 0
)

for(i in 1:nrow(hyper_grid)) {

# reproducibility
set.seed(123)
# train model
model <- ranger(
formula = Price ~ .,
data = ames_train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
sample.fraction = hyper_grid$sampe_size[i],
seed = 123
)

# add OOB error to grid


hyper_grid$OOB_RMSE[i] <- sqrt(model$prediction.error)
}

hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)

## mtry node_size sampe_size OOB_RMSE


## 1 3 1 0.75 41.48185
## 2 3 2 0.75 41.82652
## 3 4 1 0.75 41.92855
## 4 4 2 0.75 42.14828
## 5 3 3 0.75 42.15636
## 6 4 3 0.75 42.40021
## 7 5 1 0.75 42.55801
## 8 5 2 0.75 42.62072
## 9 4 4 0.75 42.70524
## 10 3 4 0.75 42.85740

Find the best model parameter with tree = 500 and sample size = 75% by doing iteration on mtry = seq(0,
7, by = 1), node_size = seq(1, 9, by = 1),

24
Best Bagging and randomForest

set.seed(123)
bag.data_no_date <- randomForest(
formula = Price ~ .,
data = data_no_date,
mtry = 3,
samplesize = 0.75,
nodesize = 1,
ntree = 500,
importance = TRUE
)

Fit model with our best parameters which is mtry = 3, samplesize = 0.75, nodesize = 1, ntree = 500.

varImpPlot(bag.data_no_date)

bag.data_no_

PCE P
M2REAL M
UNRATE M
T10Y2Y F
MORTGAGE30US U
FEDFUNDS T
CPI C

5 15 25
%IncMSE
varible importance and error comparsion on best model

# create training and validation data


set.seed(123)
valid_split <- initial_split(ames_train, .75)

25
# training data
ames_train_v2 <- analysis(valid_split)

# validation data
ames_valid <- assessment(valid_split)
x_test <- ames_valid[setdiff(names(ames_valid), "Price")]
y_test <- ames_valid$Price

rf_oob_comp <- randomForest(


formula = Price ~ .,
data = data_no_date,
mtry = 3,
samplesize = 0.75,
nodesize = 1,
ntree = 500,
importance = TRUE,
xtest = x_test,
ytest = y_test
)

# extract OOB & validation errors


oob <- sqrt(rf_oob_comp$mse)
validation <- sqrt(rf_oob_comp$test$mse)

# compare error rates


tibble::tibble(
`Out of Bag Error` = oob,
`Test error` = validation,
ntrees = 1:rf_oob_comp$ntree
) %>%
gather(Metric, RMSE, -ntrees) %>%
ggplot(aes(ntrees, RMSE, color = Metric)) +
geom_line() +
scale_y_continuous(labels = scales::dollar) +
xlab("Number of trees")

26
$70

$60

Metric
RMSE

$50 Out of Bag Error


Test error

$40

$30

0 100 200 300 400 500


Number of trees
The first graph shows that variable importance in the bagging.
%IncMSE is based upon the mean decrease of accuracy in predictions on the out of bag samples when a
given variable is permuted with PCE, M2REAL, and UNRATE as our top 3 influencer.
IncNodePurity is a measure of the total decrease in node impurity that results from splits over that variable,
averaged over all trees with PCE, M2REAL, and MORTGAGE30US as our top 3 influencer
The second graph is the comparison between OOB error vs Test error with trees ranging from 0 to 2000.

bag.pred = predict(bag.data_no_date, data_no_date.test)

plot (bag.pred , data_no_date.test$Price,


xlab = "Random Forest Predication",
ylab = "S&P 500 Price")
abline (0, 1)

27
4000
3000
S&P 500 Price

2000
1000
0

0 1000 2000 3000 4000

Random Forest Predication


Bagging Result
###### Bagging Interpretation
Gradient Bagging model leads a RMSE of 41.48185. The RMSE is low compare to all other models except
boosting and the RMSE tells use that on average we are 41.48185 away from the best fitted line on the graph
which is pretty decent. However, we would not use this model over boosting.
#Boosting

set.seed (123)

Tuning

hyper_grid <- expand.grid(


shrinkage = c(.01, .05, .1, .2),
interaction.depth = c(1, 2, 3, 4),
n.minobsinnode = c(1, 5, 10, 15),
bag.fraction = c(.65, .7, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)

# randomize data 75/25 split


random_index <- sample(nrow(data_no_date), 0.75*nrow(data_no_date))
random_ames_train <- data_no_date[random_index, ]

# grid search

28
for(i in 1:nrow(hyper_grid)) {

# reproducibility
set.seed(123)

# train model
gbm.tune <- gbm(
formula = Price ~ . ,
distribution = "gaussian",
data = random_ames_train,
n.trees = 2000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
# reproducibility
set.seed(123)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(gbm.tune$valid.error)
hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$valid.error))
}

hyper_grid %>%
dplyr::arrange(min_RMSE) %>%
head(10)

## shrinkage interaction.depth n.minobsinnode bag.fraction optimal_trees


## 1 0.05 4 10 0.65 1189
## 2 0.05 3 1 0.70 1389
## 3 0.05 3 5 0.65 1524
## 4 0.05 4 1 0.80 613
## 5 0.05 4 10 0.80 1150
## 6 0.05 3 10 0.80 1508
## 7 0.05 3 10 0.65 1932
## 8 0.05 4 5 0.80 589
## 9 0.05 4 15 0.70 1946
## 10 0.05 2 5 1.00 1884
## min_RMSE
## 1 35.86454
## 2 36.02802
## 3 36.03908
## 4 36.28827
## 5 36.36611
## 6 36.40322
## 7 36.44686
## 8 36.55897
## 9 36.71759
## 10 36.78113

Find the best model parameter with sample size = 75% by doing iteration on shrinkage = c(.01, .05, .1, .2),

29
interaction.depth = c(1, 2, 3, 4), n.minobsinnode = c(1, 5, 10, 15), bag.fraction = c(.65, .7, .8, 1), tree =
seq(0, 2000, by = 1)

Best boost

# for reproducibility
set.seed(123)

# train GBM model


gbm.fit.final <- gbm(
formula = Price ~ . ,
distribution = "gaussian",
data = data_no_date,
n.trees = 1189,
interaction.depth = 4,
shrinkage = 0.05 ,
n.minobsinnode = 10,
bag.fraction = 0.65,
train.fraction = 1,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)

Fit best Gradient Boosting Model with parameter n.trees = 1189, interaction.depth = 4, shrinkage = 0.05 ,
n.minobsinnode = 10, bag.fraction = 0.65, train.fraction = 1.

par(mar = c(5, 8, 1, 2))


summary(
gbm.fit.final,
cBars = 10,
method = relative.influence,
las = 2
)

30
PCE

M2REAL

MORTGAGE30US

UNRATE

T10Y2Y

FEDFUNDS

CPI 0

10

20

30

40
Relative influence
Variance Importance

## var rel.inf
## PCE PCE 48.52661523
## M2REAL M2REAL 48.36345187
## MORTGAGE30US MORTGAGE30US 1.20761149
## UNRATE UNRATE 1.11218723
## T10Y2Y T10Y2Y 0.54742215
## FEDFUNDS FEDFUNDS 0.15887628
## CPI CPI 0.08383576

This is the relative influence plot and the relative influence statistics. It shows us that PCE and M2REAL
is by far the most important variables which means that we will see S&P 500 price increases as PCE and
M2REAL increases.

boost.pred = predict(gbm.fit.final, data_no_date.test)

Boosting Result

## Using 1189 trees...

plot (boost.pred , data_no_date.test$Price,


xlab = "Gradient Boosting Predication",
ylab = "S&P 500 Price")
abline (0, 1)

31
4000
3000
S&P 500 Price

2000
1000
0

0 1000 2000 3000 4000

Gradient Boosting Predication

###### Boosting Interpretation


Gradient Boosting model leads to a RMSE of 35.86454. The RMSE is the lowest compare to all other models
and it tells us that on average we are 35.86454 away from the best fitted line on the graph which is pretty
decent.

Conclusion
In conclusion, the best model is the Gradient Boosting model that we generated before With a RMSE of
35.86454 and a MSE of 1286.265, it performs the best among all models. Even though this is still not very
accurate, we believe that for beginner level stock predicting it performs well enough. Thus, we would use
boosting as our model.

32

You might also like