Final Project
Final Project
2022-03-12
Intro
According to Investopedia, “the stock market broadly refers to the collection of exchanges and other venues
where the buying, selling, and issuance of shares of publicly held companies take place. Such financial
activities are conducted through institutionalized formal exchanges (whether physical or electronic) or via
over-the-counter (OTC) marketplaces that operate under a defined set of regulations.”
Stock markets are important, especially in a free-market economy like that of the United States, because
they allow any investor to trade and exchange capital. Through the stock market, companies can issue and
sell their shares to raise capital so that they can carry out their business activities.
To be able to gauge how the market is doing, there exist many indices that aim to provide a measure in order
to track the movement of the market. One such index is the S&P 500 index (which stands for Standard &
Poor’s). The S&P 500 index is composed of a list of the 500 largest companies in the U.S. across 11 sectors
of businesses. The calculation of the index weighs each company based on its market capitalization (which
includes the company’s outstanding shares and current share price).
For the purposes of this project, we will focus on SPX and how various monetary policies such as the interest
rate, M2 real money supply, Federal Funds Effective Rate, unemployment rate, CPI, 10-Year Treasury
Constant Maturity Minus 2-Year Treasury Constant Maturity, and 30-Year Fixed Rate Mortgage Average
in the United States affect SPX’s price. These are the 7 variables that we will use to predict SPX price in
the future.
In this project, we will use several different regression models to predict our SPX outcome. This includes
a linear regression model, a ridge regression model, a lasso regression model, a regression tree, the method,
and the boosting method to predict SPX using our 7 variables. Some of these models might not work as
well as desired due to the fact that we don’t have too many predictors and they can be correlated to some
degree, but we will use methods like step-wise selection and cross-validation in order to enhance the models’
performances. We will, then, compare the performances of our models and find the most accurate one to
serve as our final model.
These models might be useful because it can be applied in real life. Specifically, should it succeed in
predicting the SPX values somewhat accurately, even though it can be very difficult to achieve, we can use
it to potentially make money off of our prediction by longing or shorting the SPX.
1
Loading data and packages:
Before diving into our model, we think that it’s important to understand the basics of what each of our
predictors mean.
From Investopedia:
M2 Real Money Supply A measure of the money supply that includes cash, checking deposits, and
easily-convertible near money.
Federal Funds Effective Rate The interest rate banks charge each other for overnight loans to meet
their reserve requirements.
Unemployment rate The number of unemployed people as a percentage of the labor force.
CPI The average change in prices over time that consumers pay for a basket of goods and services.
10-Year Treasury Constant Maturity Minus 2-Year Treasury Constant Maturity Constant ma-
turity is the theoretical value of a U.S. Treasury that is based on recent values of auctioned U.S. Treasuries.
They are often used by lenders to determine mortgage rates.
30-Year Fixed Rate Mortgage Average Average of the 30-year fixed-rate mortgage, which is basically
a home loan that gives you 30 years to pay back the money you borrowed at an interest rate that won’t
change.
PCE Personal consumer expenditures, which measures the change in goods and services consumed by all
households, and nonprofit institutions serving households. # Set Up
Training/Testing split
#Generate x and y
x <- model.matrix(Price~.-Date,data1)
y <- data1$Price
set.seed(123)
#Split data
train <- sample(1:nrow(x),600)
#Training data
x.train <- x[train,]
y.train <- y[train]
#Testing data
x.test <- x[-train,]
y.test <- y[-train]
2
Linear Regression on Training Data
##
## Call:
## lm(formula = y.train ~ x.train, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -539.96 -135.31 -5.98 121.23 794.54
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.345e+02 8.459e+01 -5.136 3.81e-07 ***
## x.train(Intercept) NA NA NA NA
## x.trainUNRATE -7.556e+01 6.386e+00 -11.832 < 2e-16 ***
## x.trainM2REAL 5.453e-01 2.159e-02 25.250 < 2e-16 ***
## x.trainPCE 3.325e-02 9.904e-03 3.358 0.000836 ***
## x.trainCPI 5.189e+01 2.993e+01 1.734 0.083457 .
## x.trainFEDFUNDS -3.245e+01 1.204e+01 -2.696 0.007213 **
## x.trainMORTGAGE30US 2.661e+01 1.256e+01 2.118 0.034584 *
## x.trainT10Y2Y -1.460e+02 2.264e+01 -6.448 2.35e-10 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 211.1 on 592 degrees of freedom
## Multiple R-squared: 0.9521, Adjusted R-squared: 0.9516
## F-statistic: 1683 on 7 and 592 DF, p-value: < 2.2e-16
The Rˆ2 value is 0.9521, and the adjusted Rˆ2 is 0.9516, both of which indicates that the model is of good
fit, but the summary report also indicates that CPI and Mortgage30 are not very significant. Let’s follow
up by using some other tools to analyze our data.
Let’s run a residuals vs fitted plot on our training data to see if we have constant variance.
3
600
Residuals
200
0
−400
Fitted
Here, we see that the regression line seems to systematically over and under predict the data at different
points along the curve. The data points are not randomly nor evenly spread across the line, indicating that
we don’t have constant variance in terms of a linear regression. As a result, linear regression appears to not
be a great fit for our model.
QQPlot
Let’s see if our datasets come from populations with a common distribution by using the QQPlot:
qqnorm(residuals(train_lm_fit))
qqline(residuals(train_lm_fit))
4
Normal Q−Q Plot
600
Sample Quantiles
200
0
−400
−3 −2 −1 0 1 2 3
Theoretical Quantiles
Looking at our qqplot, we see that the majority of points fall pretty close along the straight line. This is a
good sign, showing that our datasets do seem to come from populations with a common distribution. In
this case, we have a normal distribution, since we used the ““qqnorm” function.
However, the points at both ends of the data start wavering and moving away from the line indicating
normality. As a result, we can understand that our data is not precisely normal, but it is not too far off as
well.
Cook’s Distance
Cook’s Distance is used to find any influential outliers in a set of predictors variables. It suggests that any
data points that exceed the 4/n threshold (where n = # data points) have strong influence over the fitted
values and are considered to be outliers.
Lets see if we have any particularly influential outliers:
5
0.08
cooks.distance(train_lm_fit)
0.06
0.04
0.02
0.00
Index
Using Cook’s Distance to analyze our data, we see that there are many points that cross this 4/n threshold.
This doesn’t necessarily mean that we should remove all these points. Instead, the presence of so many
highly influential points could indicate that a linear regression is not the best model for our data.
Training MSE
## [1] 43952.57
The training MSE is 43952.57, which is very large. Let’s see if the models we use later give us a better MSE.
But first, we should take a look at whether this can be fixed by simply performing a variable selection.
Now that we have analyzed our data, let’s perform forward selection to choose our best linear regression
model.
6
# find cross validation error for a given model
set.seed(123)
# Use 10-fold cross-validation to estimate the average prediction error AKA RMSE of each of the 7 possib
train_control <- trainControl(method = "cv", number = 10) # train the data
cv <- train(Price ~ ., data = data1, method = "leapForward", # Forward selection using "leapForward"
tuneGrid = data.frame(nvmax = 1:7), #nvmax stands for the number of variables in the model.
trControl = train_control
)
7
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 106
## linear dependencies found
cv$results
cv$bestTune
## nvmax
## 7 7
The lower the RMSE and MAE are, the better the model. From the results of our forward selection, it
seems that the 7-variable model is the best out of all 7 types of models. It has the lowest RMSE and MAE
along with the highest Rˆ2 value. However, these values for 4, 5, 6, and 7 predictor values are all very
close. Additionally, cv$bestTune is also telling use that the best model seems to be the one with 7 predictor
variables.
As a result, for us to produce the most accurate results with a linear regression on our data, using all 7
predictors will yield the best results.
##
## Call:
## lm(formula = y.test ~ x.test, data = data1)
##
## Residuals:
8
## Min 1Q Median 3Q Max
## -662.28 -124.81 -16.42 109.78 743.97
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -495.22275 100.94729 -4.906 1.38e-06 ***
## x.test(Intercept) NA NA NA NA
## x.testUNRATE -89.75160 8.83747 -10.156 < 2e-16 ***
## x.testM2REAL 0.53211 0.02540 20.953 < 2e-16 ***
## x.testPCE 0.04212 0.01162 3.625 0.000329 ***
## x.testCPI 75.29831 31.55377 2.386 0.017501 *
## x.testFEDFUNDS -26.45505 13.76062 -1.923 0.055282 .
## x.testMORTGAGE30US 34.19934 13.42136 2.548 0.011220 *
## x.testT10Y2Y -98.70008 28.57440 -3.454 0.000614 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 204.4 on 383 degrees of freedom
## Multiple R-squared: 0.9527, Adjusted R-squared: 0.9518
## F-statistic: 1102 on 7 and 383 DF, p-value: < 2.2e-16
# Rˆ2 is 0.9527
plot(fitted(test_lm_fit), residuals(test_lm_fit), xlab = "Fitted", ylab = "Residuals")
abline(0,0)
600
200
Residuals
−200
−600
Fitted
Rˆ2 is 0.9527, but the residuals vs fitted graph shows that the regression line systematically over/under
9
predicts the data at different points along the curve. Linear regression appears to not be a good fit for our
model.
Testing MSE:
## [1] 40920.33
The testing MSE is 40920.33, which is slightly smaller than training MSE, which is not normal. Also, it is
still huge, which implies an underfitted model. Let’s see if variable selection could improve it.
To perform forward selection on our testing data, we are going to use the “regsubsets” function, which
performs best subset selection by identifying the best model when given a number of predictors based on
RSS (residual sum of squares). By setting “nvmax = 7”, we are saying that we want to test what the best
model is for 1 predictor, 2 predictors, etc., all the way up to 7 predictors.
10
which.max(summary_best_subset$adjr2) # to find number of predictors that gives us the best model
## [1] 7
Performing forward selection on our test data, we also see that using all 7 predictors gives us the best model
under linear regression. This would make sense, as our Rˆ2 value for the linear regression on all 7 predictors
was already quite high.
In conclusion, from performing linear regression, we found that keeping all 7 of our predictor variables
will give us the best predictions of stock market price. However, this is only the best model under linear
regression, which as we saw from our data analysis, does not seem to be the model that best fits our data.
Now, lets move into another model to see if it better fits and predicts our data.
#Generate grid
grid <- 10ˆseq(10, -2, length = 100)
#Ridge regression
ridge_mod <- glmnet(x.train, y.train, alpha = 0, lambda = grid)
ridge_mod
##
## Call: glmnet(x = x.train, y = y.train, alpha = 0, lambda = grid)
##
## Df %Dev Lambda
## 1 7 0.00 1.000e+10
## 2 7 0.00 7.565e+09
## 3 7 0.00 5.722e+09
## 4 7 0.00 4.329e+09
## 5 7 0.00 3.275e+09
## 6 7 0.00 2.477e+09
## 7 7 0.00 1.874e+09
## 8 7 0.00 1.417e+09
## 9 7 0.00 1.072e+09
## 10 7 0.00 8.111e+08
## 11 7 0.00 6.136e+08
## 12 7 0.00 4.642e+08
## 13 7 0.00 3.511e+08
## 14 7 0.00 2.656e+08
## 15 7 0.00 2.009e+08
## 16 7 0.00 1.520e+08
## 17 7 0.00 1.150e+08
## 18 7 0.01 8.697e+07
## 19 7 0.01 6.579e+07
## 20 7 0.01 4.977e+07
## 21 7 0.01 3.765e+07
## 22 7 0.02 2.848e+07
## 23 7 0.03 2.154e+07
11
## 24 7 0.03 1.630e+07
## 25 7 0.05 1.233e+07
## 26 7 0.06 9.326e+06
## 27 7 0.08 7.055e+06
## 28 7 0.10 5.337e+06
## 29 7 0.14 4.037e+06
## 30 7 0.18 3.054e+06
## 31 7 0.24 2.310e+06
## 32 7 0.32 1.748e+06
## 33 7 0.42 1.322e+06
## 34 7 0.56 1.000e+06
## 35 7 0.74 7.565e+05
## 36 7 0.97 5.722e+05
## 37 7 1.28 4.329e+05
## 38 7 1.69 3.275e+05
## 39 7 2.22 2.477e+05
## 40 7 2.91 1.874e+05
## 41 7 3.81 1.417e+05
## 42 7 4.98 1.072e+05
## 43 7 6.49 8.111e+04
## 44 7 8.42 6.136e+04
## 45 7 10.85 4.642e+04
## 46 7 13.88 3.511e+04
## 47 7 17.61 2.656e+04
## 48 7 22.08 2.009e+04
## 49 7 27.32 1.520e+04
## 50 7 33.28 1.150e+04
## 51 7 39.82 8.697e+03
## 52 7 46.74 6.579e+03
## 53 7 53.74 4.977e+03
## 54 7 60.53 3.765e+03
## 55 7 66.84 2.848e+03
## 56 7 72.46 2.154e+03
## 57 7 77.30 1.630e+03
## 58 7 81.33 1.233e+03
## 59 7 84.59 9.330e+02
## 60 7 87.18 7.060e+02
## 61 7 89.19 5.340e+02
## 62 7 90.73 4.040e+02
## 63 7 91.89 3.050e+02
## 64 7 92.76 2.310e+02
## 65 7 93.40 1.750e+02
## 66 7 93.89 1.320e+02
## 67 7 94.24 1.000e+02
## 68 7 94.51 7.600e+01
## 69 7 94.71 5.700e+01
## 70 7 94.85 4.300e+01
## 71 7 94.96 3.300e+01
## 72 7 95.04 2.500e+01
## 73 7 95.09 1.900e+01
## 74 7 95.13 1.400e+01
## 75 7 95.16 1.100e+01
## 76 7 95.18 8.000e+00
## 77 7 95.19 6.000e+00
12
## 78 7 95.20 5.000e+00
## 79 7 95.21 4.000e+00
## 80 7 95.21 3.000e+00
## 81 7 95.21 2.000e+00
## 82 7 95.21 2.000e+00
## 83 7 95.21 1.000e+00
## 84 7 95.21 1.000e+00
## 85 7 95.21 1.000e+00
## 86 7 95.21 0.000e+00
## 87 7 95.21 0.000e+00
## 88 7 95.21 0.000e+00
## 89 7 95.21 0.000e+00
## 90 7 95.21 0.000e+00
## 91 7 95.21 0.000e+00
## 92 7 95.21 0.000e+00
## 93 7 95.21 0.000e+00
## 94 7 95.21 0.000e+00
## 95 7 95.21 0.000e+00
## 96 7 95.21 0.000e+00
## 97 7 95.21 0.000e+00
## 98 7 95.21 0.000e+00
## 99 7 95.21 0.000e+00
## 100 7 95.21 0.000e+00
Perform cross-validation
#Cross-validation
cvmodr <- cv.glmnet(x.train,y.train,alpha=0,folds=10)
cvmodr
##
## Call: cv.glmnet(x = x.train, y = y.train, alpha = 0, folds = 10)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 91.36 100 53324 4236 7
## 1se 132.54 96 57504 4913 7
## [1] 91.35554
13
## (Intercept) -86.69913181
## (Intercept) .
## UNRATE -53.99007905
## M2REAL 0.37637001
## PCE 0.07151295
## CPI 53.78446599
## FEDFUNDS -11.01616897
## MORTGAGE30US -11.68599230
## T10Y2Y -138.77562341
set.seed(123)
ridge.predtr <- predict(ridge_mod,s=ridge_lambda,newx=x.train)
mean((ridge.predtr-y.train)ˆ2)
## [1] 51946.73
set.seed(123)
ridge.predte <- predict(ridge_mod,s=ridge_lambda,newx=x.test)
mean((ridge.predte-y.test)ˆ2)
## [1] 49091.91
Both the training and testing MSE are extremely high, and the training MSE is higher than the testing
MSE, which is not normal in theory. Therefore, it can be concluded that the model does not fit the data
well.
set.seed(123)
lasso_mod <- glmnet(x.train,y.train,alpha=1,lambda=grid)
lasso_mod
##
## Call: glmnet(x = x.train, y = y.train, alpha = 1, lambda = grid)
##
## Df %Dev Lambda
## 1 0 0.00 1.000e+10
## 2 0 0.00 7.565e+09
## 3 0 0.00 5.722e+09
## 4 0 0.00 4.329e+09
## 5 0 0.00 3.275e+09
## 6 0 0.00 2.477e+09
14
## 7 0 0.00 1.874e+09
## 8 0 0.00 1.417e+09
## 9 0 0.00 1.072e+09
## 10 0 0.00 8.111e+08
## 11 0 0.00 6.136e+08
## 12 0 0.00 4.642e+08
## 13 0 0.00 3.511e+08
## 14 0 0.00 2.656e+08
## 15 0 0.00 2.009e+08
## 16 0 0.00 1.520e+08
## 17 0 0.00 1.150e+08
## 18 0 0.00 8.697e+07
## 19 0 0.00 6.579e+07
## 20 0 0.00 4.977e+07
## 21 0 0.00 3.765e+07
## 22 0 0.00 2.848e+07
## 23 0 0.00 2.154e+07
## 24 0 0.00 1.630e+07
## 25 0 0.00 1.233e+07
## 26 0 0.00 9.326e+06
## 27 0 0.00 7.055e+06
## 28 0 0.00 5.337e+06
## 29 0 0.00 4.037e+06
## 30 0 0.00 3.054e+06
## 31 0 0.00 2.310e+06
## 32 0 0.00 1.748e+06
## 33 0 0.00 1.322e+06
## 34 0 0.00 1.000e+06
## 35 0 0.00 7.565e+05
## 36 0 0.00 5.722e+05
## 37 0 0.00 4.329e+05
## 38 0 0.00 3.275e+05
## 39 0 0.00 2.477e+05
## 40 0 0.00 1.874e+05
## 41 0 0.00 1.417e+05
## 42 0 0.00 1.072e+05
## 43 0 0.00 8.111e+04
## 44 0 0.00 6.136e+04
## 45 0 0.00 4.642e+04
## 46 0 0.00 3.511e+04
## 47 0 0.00 2.656e+04
## 48 0 0.00 2.009e+04
## 49 0 0.00 1.520e+04
## 50 0 0.00 1.150e+04
## 51 0 0.00 8.697e+03
## 52 0 0.00 6.579e+03
## 53 0 0.00 4.977e+03
## 54 0 0.00 3.765e+03
## 55 0 0.00 2.848e+03
## 56 0 0.00 2.154e+03
## 57 0 0.00 1.630e+03
## 58 0 0.00 1.233e+03
## 59 0 0.00 9.330e+02
## 60 1 36.68 7.060e+02
15
## 61 1 59.85 5.340e+02
## 62 2 73.21 4.040e+02
## 63 2 81.00 3.050e+02
## 64 2 85.46 2.310e+02
## 65 3 88.75 1.750e+02
## 66 3 91.24 1.320e+02
## 67 3 92.67 1.000e+02
## 68 3 93.49 7.600e+01
## 69 4 94.20 5.700e+01
## 70 4 94.61 4.300e+01
## 71 4 94.84 3.300e+01
## 72 4 94.97 2.500e+01
## 73 4 95.05 1.900e+01
## 74 4 95.09 1.400e+01
## 75 4 95.11 1.100e+01
## 76 5 95.13 8.000e+00
## 77 5 95.14 6.000e+00
## 78 6 95.15 5.000e+00
## 79 6 95.16 4.000e+00
## 80 6 95.17 3.000e+00
## 81 6 95.17 2.000e+00
## 82 6 95.18 2.000e+00
## 83 7 95.19 1.000e+00
## 84 7 95.20 1.000e+00
## 85 7 95.20 1.000e+00
## 86 7 95.21 0.000e+00
## 87 7 95.21 0.000e+00
## 88 7 95.21 0.000e+00
## 89 7 95.21 0.000e+00
## 90 7 95.21 0.000e+00
## 91 7 95.21 0.000e+00
## 92 7 95.21 0.000e+00
## 93 7 95.21 0.000e+00
## 94 7 95.21 0.000e+00
## 95 7 95.21 0.000e+00
## 96 7 95.21 0.000e+00
## 97 7 95.21 0.000e+00
## 98 7 95.21 0.000e+00
## 99 7 95.21 0.000e+00
## 100 7 95.21 0.000e+00
Perform cross-validation
##
## Call: cv.glmnet(x = x.train, y = y.train, alpha = 1, folds = 10)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
16
## min 1.79 68 45462 3936 6
## 1se 35.20 36 48905 4708 4
## [1] 1.793346
predict(lasso_mod,type="coefficients",s=lasso_lambda)
set.seed(123)
lasso.predtr <- predict(lasso_mod,s=lasso_lambda,newx=x.train)
mean((lasso.predtr-y.train)ˆ2)
## [1] 44329.75
set.seed(123)
lasso.predte <- predict(lasso_mod,s=lasso_lambda,newx=x.test)
mean((lasso.predte-y.test)ˆ2)
## [1] 42094.5
After the lasso regression performed variable selection automatically, dropping Mortgage30 out of the re-
gression model, we get a significantly lower MSE for both the training set and the testing set. However, they
are both still really high and the training MSE is still higher than the testing MSE, implying that the model
still underfits. # Part 3
#Tree Regression
17
tree.data_no_date = tree(Price ~ . , data = data_no_date)
plot(tree.data_no_date)
text(tree.data_no_date, pretty = 0, col = "red", cex = 0.6)
title("decision tree on S&P 500 price", cex = 0.7)
4007.0
2043.0 2800.0
308.1 1203.0
18
PCE <> 11498.3
1135.8268237 ; 743 obs; 66.8%
4 5
2049.7126154 2813.2566667
65 obs 60 obs
Tree Result
19
4000
3000
S&P 500 Price
2000
1000
0
## [1] 46910.95
sqrt(MSE_tree)
## [1] 216.5894
Tree Interpretation
Tree regression model leads to an MSE of 46910.95 and a RMSE of 216.5894. The MSE is extremely high
and the RMSE tells us that on average we are 216.5894 away from the best fitted line on the graph which
is extremely inaccurate consider that our SPY price is range from 0 to 4000.
20
best.cv = min(cv.data_no_date$size[cv.data_no_date$dev == min(cv.data_no_date$dev)])
abline(v=best.cv, lty=2)
# Add lines to identify complexity parameter
min.error = which.min(cv.data_no_date$dev) # Get minimum error index
abline(h = cv.data_no_date$dev[min.error],lty = 2)
6e+08
CV Error
2e+08
1 2 3 4 5
Performed k-fold CV to determine the optimal level of tree complexity which is 5 in this case.
plot(pt.data_no_date)
text(pt.data_no_date , pretty = 0, cex = 0.52, col = "blue")
21
PCE < 11439.5
|
4007.0
2043.0 2800.0
308.1 1203.0
22
4000
3000
S&P 500 Price
2000
1000
0
## [1] 38330.25
sqrt(MSE_prune)
## [1] 195.7811
Prune Tree regression model leads to an MSE of 38330.25 and a RMSE of 195.7811. The MSE is extremely
high but is 22.38623% better than the tree regression and the RMSE tells use that on average we are 195.7811
away from the best fitted line on the graph which is extremely inaccurate consider that our SPY price is
range from 0 to 4000.
23
ames_split <- sample(nrow(data_no_date), 0.75*nrow(data_no_date))
ames_train <- data_no_date[train_tree, ]
ames_test <- data_no_date[-train_tree,]
Tuning
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(123)
# train model
model <- ranger(
formula = Price ~ .,
data = ames_train,
num.trees = 500,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
sample.fraction = hyper_grid$sampe_size[i],
seed = 123
)
hyper_grid %>%
dplyr::arrange(OOB_RMSE) %>%
head(10)
Find the best model parameter with tree = 500 and sample size = 75% by doing iteration on mtry = seq(0,
7, by = 1), node_size = seq(1, 9, by = 1),
24
Best Bagging and randomForest
set.seed(123)
bag.data_no_date <- randomForest(
formula = Price ~ .,
data = data_no_date,
mtry = 3,
samplesize = 0.75,
nodesize = 1,
ntree = 500,
importance = TRUE
)
Fit model with our best parameters which is mtry = 3, samplesize = 0.75, nodesize = 1, ntree = 500.
varImpPlot(bag.data_no_date)
bag.data_no_
PCE P
M2REAL M
UNRATE M
T10Y2Y F
MORTGAGE30US U
FEDFUNDS T
CPI C
5 15 25
%IncMSE
varible importance and error comparsion on best model
25
# training data
ames_train_v2 <- analysis(valid_split)
# validation data
ames_valid <- assessment(valid_split)
x_test <- ames_valid[setdiff(names(ames_valid), "Price")]
y_test <- ames_valid$Price
26
$70
$60
Metric
RMSE
$40
$30
27
4000
3000
S&P 500 Price
2000
1000
0
set.seed (123)
Tuning
# grid search
28
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(123)
# train model
gbm.tune <- gbm(
formula = Price ~ . ,
distribution = "gaussian",
data = random_ames_train,
n.trees = 2000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
# reproducibility
set.seed(123)
# add min training error and trees to grid
hyper_grid$optimal_trees[i] <- which.min(gbm.tune$valid.error)
hyper_grid$min_RMSE[i] <- sqrt(min(gbm.tune$valid.error))
}
hyper_grid %>%
dplyr::arrange(min_RMSE) %>%
head(10)
Find the best model parameter with sample size = 75% by doing iteration on shrinkage = c(.01, .05, .1, .2),
29
interaction.depth = c(1, 2, 3, 4), n.minobsinnode = c(1, 5, 10, 15), bag.fraction = c(.65, .7, .8, 1), tree =
seq(0, 2000, by = 1)
Best boost
# for reproducibility
set.seed(123)
Fit best Gradient Boosting Model with parameter n.trees = 1189, interaction.depth = 4, shrinkage = 0.05 ,
n.minobsinnode = 10, bag.fraction = 0.65, train.fraction = 1.
30
PCE
M2REAL
MORTGAGE30US
UNRATE
T10Y2Y
FEDFUNDS
CPI 0
10
20
30
40
Relative influence
Variance Importance
## var rel.inf
## PCE PCE 48.52661523
## M2REAL M2REAL 48.36345187
## MORTGAGE30US MORTGAGE30US 1.20761149
## UNRATE UNRATE 1.11218723
## T10Y2Y T10Y2Y 0.54742215
## FEDFUNDS FEDFUNDS 0.15887628
## CPI CPI 0.08383576
This is the relative influence plot and the relative influence statistics. It shows us that PCE and M2REAL
is by far the most important variables which means that we will see S&P 500 price increases as PCE and
M2REAL increases.
Boosting Result
31
4000
3000
S&P 500 Price
2000
1000
0
Conclusion
In conclusion, the best model is the Gradient Boosting model that we generated before With a RMSE of
35.86454 and a MSE of 1286.265, it performs the best among all models. Even though this is still not very
accurate, we believe that for beginner level stock predicting it performs well enough. Thus, we would use
boosting as our model.
32