HW5 Solution
HW5 Solution
1. This problem involves the OJ data set which is part of the ISLR package.
(a) Create a training set containing a random sample of 80% observations, and a
test set containing the remaining observations.
Answer: Using the following code:
> library(ISLR)
> set.seed(1234)
> train = sample(dim(OJ)[1], dim(OJ)[1]*0.8)
> OJ.train = OJ[train, ]
> OJ.test = OJ[-train, ]
(b) Fit a support vector classifier to the training data using cost=0.01, with Purchase
as the response and the other variables as predictors. Use the summary()
function to produce summary statistics, and describe the results obtained. What
are the training and test error rates?
Answer: Support vector classifier creates 465 support vectors out of 856 (80%
observations) training points. Out of these, 233 belong to level 𝙲𝙷 and remaining 232
belong to level 𝙼𝙼.
!
The train and test results are below.
!
We can thus compute the training error rate is 17.1% and test error rate is about
15%.
(c) Use the tune() function to select an optimal cost. Consider values in the range
0.01 to 10. Compute the training and test error rates using this new value for
cost.
Answer: We set a sequence of candidate cost as cost=10^seq(-2, 1, by = 0.25).
Tuning result shows that optimal cost is 0.316.
!
We run the support vector machine using the optimal cost.
!
The training error decreases to 16.7%, but test error slightly increases to 16.4% by
using best cost.
(d) Repeat parts (b) through (c) using a support vector machine with a radial kernel.
Use the default value for gamma.
Answer: The radial basis kernel with default gamma creates 400 support vectors, out
of which, 203 belong to level 𝙲𝙷 and remaining 197 belong to level 𝙼𝙼. The classifier
has a training error of 15.8% and a test error of 14.5%, which are both improved
compared with linear kernel.
!
We now use cross validation to find optimal gamma. The candidate gamma
sequence is set as 10^seq(-2, 1, by = 0.25). The optimal gamma is 1.78. Using this
optimal parameter, we obtain that tuning slightly decreases training error to 15.2%,
while slightly increases test error to 15.4% compared with the default gamma (1/
length(train)). But the performance is still better than linear kernel.
(e) Repeat parts (b) through (c) using a support vector machine with a polynomial
kernel. Set degree=2.
Answer: Summary shows that polynomial kernel produces 495 support vectors, out
of which, 252 belong to level 𝙲𝙷 and remaining 243 belong to level 𝙼𝙼. This kernel
produces a train error of 18.5% and a test error of 15.9% which are higher than the
errors produces by radial kernel and linear kernel.
Using the same cost candidate, we find the optimal cost for polynomial kernel is 10.
Tuning reduces the training error to 15.9% and test error to 15.4% which is similar as
radial kernel, but better than linear kernel.
(f) Overall, which approach seems to give the best results on this data?
Answer: Overall, radial basis kernel seems to be producing minimum
misclassification error on both train and test data.
2. We have seen that we can fit an SVM with a non-linear kernel in order to perform
classification using a non-linear decision boundary. We will now see that we can also
obtain a non-linear decision boundary by performing logistic regression using non-
linear transformations of the features.
(a) Generate a data set with n = 700 and p = 2, such that the observations belong to
two classes with a quadratic decision boundary between them. For instance, you
can do this as follows:
>x
! 1=runif (700) -0.5
>x
! 2=runif (700) -0.5
(b) Plot the observations, colored according to their class labels. Your plot should
display x
! 1 on the x-axis, and x
! 2 on the y-axis.
(c) Fit a logistic regression model to the data, using x! 1 and x! 2 as predictors. Apply
this model to the training data in order to obtain a predicted class label for each
training observation. Plot the observations, colored according to the predicted
class labels. The decision boundary should be linear.
Answer: The logistic regression output is below. It is shown that both variables are
insignificant for predicting y under significant level 0.05.
!
We plot the true labels and predicted labels below with a probability threshold of 0.5.
It is obvious that the boundary is linear, and the prediction is very bad.
(d) Now fit a logistic regression model to the data using non-linear functions of x! 1
and x! 2 as predictors (e.g. x! 12 , x! 1 ×!x2, log(!x2), and so forth). Apply this model to
the training data in order to obtain a predicted class label for each training
observation. Plot the observations, colored according to the predicted class
labels. The decision boundary should be obviously non-linear. If it is not, then
repeat (a)-(d) until you come up with an example in which the predicted class
labels are obviously non-linear.
Answer: We use poly(x1, 3), squares for x2, and product interaction terms to fit the
model. This non-linear decision boundary closely resembles the true decision
boundary.
(f) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each
training observation. Plot the observations, colored according to the predicted
class labels.
Answer: We try SVM using radial kernel, and gamma=1. As shown, the non-linear
decision boundary on predicted labels closely resembles the true decision boundary.
!
(g) Comment on your results.
Answer: This experiment enforces the idea that SVMs with non-linear kernel are
extremely powerful in finding non-linear boundary. Both, logistic regression with non-
interactions and SVMs with linear kernels fail to find the decision boundary. Adding
interaction terms to logistic regression seems to give them same power as radial-
basis kernels. However, there is some manual efforts and tuning involved in picking
right interaction terms. This effort can become prohibitive with large number of
features. Radial basis kernels, on the other hand, only require tuning of one
parameter - gamma - which can be easily done using cross-validation.