HW 6
HW 6
(a) Split the data set into a training set and a test set. Fix the random seed
to the value 234, choose 30% (rounded down to the nearest integer) of the
data at random for testing, and use the rest for training. Define a new
response variable Accept/Apps. Plot this variable against every variable
in the dataset (make sure you use the appropriate type of plot for each
predictor). Comment on which variables appear to be most predictive.
(b) Fit a linear model using least squares on the training set, and report
the training and test error obtained, with Accept/Apps as the response
variable and all other variables as predictors.
(c) Perform forward and backward selection on the full model with the
threshold α = 0.05 to select a potentially smaller model. Report which
model each method chose, and the training and test errors for their chosen
models.
(d) Use AIC, BIC, and adjusted R2 to select a potentially smaller model
instead, from the set of all possible predictors used in 4b. Report which
model each method chose, and the training and test errors for their chosen
model(s).
(e) Use 5-fold cross-validation to estimate the test error from the training
data, for the candidate smaller model(s) you found so far, and for the full
model from 4b. Compare the training, CV, and test errors and comment
on the results.
(f) Fit a ridge regression model on the training set, with λ chosen by cross-
validation. Report the training and test errors.
(g) Fit a lasso model on the training set, with λ chosen by cross-validation.
Report which variables are included in the model, and the training and
test errors obtained.
(h) Fit a PCR model on the training set, with M chosen by cross-validation.
Report the test error obtained, along with the value of M selected by
cross-validation.
(i) Fit a PLS model on the training set, with M chosen by cross-validation.
Report the test error obtained, along with the value of M selected by
cross-validation.
(j) Comment on the results obtained. How accurately can we predict the
acceptance rate? How much difference is there among the test errors
resulting from different approaches? Which approach would you recom-
mend for this dataset and why?