Assignment 6 Tree Based Methods
Assignment 6 Tree Based Methods
Problem 1:
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a
binary response variable. This question will seek to predict Sales using regression trees and related
approaches, treating the response as a quantitative variable (that is, without the conversion).
(a) Split the data set into a training set and a test set.
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. Then compute the
test MSE.
ShelveLoc: Bad,Medium
|
The Sales are high when the Shelve location is medium and the price is low. The sales are low when
the shelve location is bad, price is high, age is high.
(c) Prune the tree obtained in (b). Use cross validation to determine the optimal level of tree
complexity. Plot the pruned tree and interpret the results. Compute the test MSE of the pruned tree.
Does pruning improve the test error?
The optimum level of tree complexity was found to be 8. It is observed that pruning does not
improve the test MSE in this case.
1100 1300 1500
cv.carseats$dev
5 10 15
cv.carseats$size
ShelveLoc: Bad,Medium
|
12.080 8.792
Age < 50.5 Age < 66.5
The sales are high when the Shelve location is medium, and specifically when the prices are low(ie
less than 113). The sales are low when the Shelve location is bad and the price, age are high.
(d) Use the bagging approach to analyze the data. What test MSE do you obtain? Determine which
variables are most important.
Test MSE = 2.604 it is much lesser than that of bagging. The important variables according to the
given data are Price and ShelveLoc
(e) Use random forests to analyse the data. What test MSE do you obtain? Determine which
variables are most important.
The Test MSE = 3.296. The important variables according to the given data are Price and
ShelveLoc
Problem 2:
In the lab, we applied random forests to the Boston data using mtry=6 and ntree=100.
(a) Consider a more comprehensive range of values for mtry: 1, 2,,13. Given each value of mtry,
find the test error resulting from random forests on the Boston data (using ntree=100). Create a plot
displaying the test error rate vs. the value of mtry. Comment on the results in the plot.
randomforest.error
20
16
12
2 4 6 8 10 12
Index
Conclusion from the plot: It reaches a minimum near the p/2 region ie 7. At ie around 3 the error
is not that low.
(b) Similarly, consider a range of values for ntree (between 5 to 200). Given each value of ntree, find
the test error resulting from random forests (using mtry=6). Create a plot displaying the test error vs.
the value of ntree. Comment on the results in the plot.
12 14 16 18
randomforest.error
Index
Conclusion from the plot: It reaches a minimum at around ntree=30-35 Thereafter the results are
stable and similar to the minimum.