0% found this document useful (0 votes)
63 views10 pages

HW5 Solution

The document presents a homework assignment for a Data Science course focused on quantitative finance, involving the application of support vector machines (SVM) and logistic regression on a dataset. It details the process of creating training and test sets, fitting classifiers, tuning parameters, and comparing the performance of different kernels. The results indicate that the radial basis kernel SVM generally yields the best classification performance compared to linear and polynomial kernels.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views10 pages

HW5 Solution

The document presents a homework assignment for a Data Science course focused on quantitative finance, involving the application of support vector machines (SVM) and logistic regression on a dataset. It details the process of creating training and test sets, fitting classifiers, tuning parameters, and comparing the performance of different kernels. The results indicate that the radial basis kernel SVM generally yields the best classification performance compared to linear and polynomial kernels.

Uploaded by

Jake Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Homework 5 and Answers

1. This problem involves the OJ data set which is part of the ISLR package.
(a) Create a training set containing a random sample of 80% observations, and a
test set containing the remaining observations.
Answer: Using the following code:
> library(ISLR)
> set.seed(1234)
> train = sample(dim(OJ)[1], dim(OJ)[1]*0.8)
> OJ.train = OJ[train, ]
> OJ.test = OJ[-train, ]

(b) Fit a support vector classifier to the training data using cost=0.01, with Purchase
as the response and the other variables as predictors. Use the summary()
function to produce summary statistics, and describe the results obtained. What
are the training and test error rates?
Answer: Support vector classifier creates 465 support vectors out of 856 (80%
observations) training points. Out of these, 233 belong to level 𝙲𝙷 and remaining 232
belong to level 𝙼𝙼.

!
The train and test results are below.

DSA5205 A/P Chen Ying !1 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

!
We can thus compute the training error rate is 17.1% and test error rate is about
15%.

(c) Use the tune() function to select an optimal cost. Consider values in the range
0.01 to 10. Compute the training and test error rates using this new value for
cost.
Answer: We set a sequence of candidate cost as cost=10^seq(-2, 1, by = 0.25).
Tuning result shows that optimal cost is 0.316.

DSA5205 A/P Chen Ying !2 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

!
We run the support vector machine using the optimal cost.

!
The training error decreases to 16.7%, but test error slightly increases to 16.4% by
using best cost.

(d) Repeat parts (b) through (c) using a support vector machine with a radial kernel.
Use the default value for gamma.

DSA5205 A/P Chen Ying !3 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Answer: The radial basis kernel with default gamma creates 400 support vectors, out
of which, 203 belong to level 𝙲𝙷 and remaining 197 belong to level 𝙼𝙼. The classifier
has a training error of 15.8% and a test error of 14.5%, which are both improved
compared with linear kernel.

!
We now use cross validation to find optimal gamma. The candidate gamma
sequence is set as 10^seq(-2, 1, by = 0.25). The optimal gamma is 1.78. Using this
optimal parameter, we obtain that tuning slightly decreases training error to 15.2%,
while slightly increases test error to 15.4% compared with the default gamma (1/
length(train)). But the performance is still better than linear kernel.

DSA5205 A/P Chen Ying !4 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(e) Repeat parts (b) through (c) using a support vector machine with a polynomial
kernel. Set degree=2.
Answer: Summary shows that polynomial kernel produces 495 support vectors, out
of which, 252 belong to level 𝙲𝙷 and remaining 243 belong to level 𝙼𝙼. This kernel
produces a train error of 18.5% and a test error of 15.9% which are higher than the
errors produces by radial kernel and linear kernel.

DSA5205 A/P Chen Ying !5 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

Using the same cost candidate, we find the optimal cost for polynomial kernel is 10.
Tuning reduces the training error to 15.9% and test error to 15.4% which is similar as
radial kernel, but better than linear kernel.

(f) Overall, which approach seems to give the best results on this data?
Answer: Overall, radial basis kernel seems to be producing minimum
misclassification error on both train and test data.

2. We have seen that we can fit an SVM with a non-linear kernel in order to perform
classification using a non-linear decision boundary. We will now see that we can also
obtain a non-linear decision boundary by performing logistic regression using non-
linear transformations of the features.
(a) Generate a data set with n = 700 and p = 2, such that the observations belong to
two classes with a quadratic decision boundary between them. For instance, you
can do this as follows:

>x
! 1=runif (700) -0.5

>x
! 2=runif (700) -0.5

> y=1*(!x1^2-!x2^2 > 0)

Answer: Using the following code:


> set.seed(123)
> x1 = runif(700) - 0.5
> x2 = runif(700) - 0.5
> y = 1 * (x1^2 - x2^2 > 0)

(b) Plot the observations, colored according to their class labels. Your plot should
display x
! 1 on the x-axis, and x
! 2 on the y-axis.

Answer: The plot clearly shows non-linear decision boundary.

DSA5205 A/P Chen Ying !6 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(c) Fit a logistic regression model to the data, using x! 1 and x! 2 as predictors. Apply
this model to the training data in order to obtain a predicted class label for each
training observation. Plot the observations, colored according to the predicted
class labels. The decision boundary should be linear.
Answer: The logistic regression output is below. It is shown that both variables are
insignificant for predicting y under significant level 0.05.

!
We plot the true labels and predicted labels below with a probability threshold of 0.5.
It is obvious that the boundary is linear, and the prediction is very bad.

DSA5205 A/P Chen Ying !7 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(d) Now fit a logistic regression model to the data using non-linear functions of x! 1
and x! 2 as predictors (e.g. x! 12 , x! 1 ×!x2, log(!x2), and so forth). Apply this model to
the training data in order to obtain a predicted class label for each training
observation. Plot the observations, colored according to the predicted class
labels. The decision boundary should be obviously non-linear. If it is not, then
repeat (a)-(d) until you come up with an example in which the predicted class
labels are obviously non-linear.
Answer: We use poly(x1, 3), squares for x2, and product interaction terms to fit the
model. This non-linear decision boundary closely resembles the true decision
boundary.

(e) Fit a support vector classifier to the data with predictors x


! 1 and x
! 2. Obtain a class
prediction for each training observation. Plot the observations, colored according
to the predicted class labels.
Answer: Using a linear kernel in support vector classifier, even with low cost fails to
find non-linear decision boundary and classifies most points to a single class.

DSA5205 A/P Chen Ying !8 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

(f) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each
training observation. Plot the observations, colored according to the predicted
class labels.
Answer: We try SVM using radial kernel, and gamma=1. As shown, the non-linear
decision boundary on predicted labels closely resembles the true decision boundary.

!
(g) Comment on your results.

Answer: This experiment enforces the idea that SVMs with non-linear kernel are
extremely powerful in finding non-linear boundary. Both, logistic regression with non-
interactions and SVMs with linear kernels fail to find the decision boundary. Adding
interaction terms to logistic regression seems to give them same power as radial-
basis kernels. However, there is some manual efforts and tuning involved in picking
right interaction terms. This effort can become prohibitive with large number of
features. Radial basis kernels, on the other hand, only require tuning of one
parameter - gamma - which can be easily done using cross-validation.

DSA5205 A/P Chen Ying !9 | P a g e


DSA5205 Data Science in Quantitative Finance AY2021/22SEM1

DSA5205 A/P Chen Ying 10


! |Page

You might also like