Stats216 hw4 PDF
Stats216 hw4 PDF
Stats216 hw4 PDF
html
Problem 1
Recall the body dataset from problem 4 of Homework 3. In that problem we used PCR and PLSR to predict
someones weight. Here we will re-visit this objective, using bagging and random forests. Start by setting aside
200 observations from your dataset to act as a test set, using the remaining 307 as a training set. Ideally, you
would be able to use your code from Homework 3 to select the same test set as you did on that problem.
load("/Users/alexnutkiewicz/Downloads/body.rdata")
Using the ranger package in CRAN, use Bagging and Random Forests to predict the weights in the test set, so
that you have two sets of predictions. Then answer the following questions:
library(ranger)
predictData = data.frame(Weight = Y$Weight[-testing],OJtrain)
#randomForest requires a fewer number of trees to be created than bagging
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), importance = "imp
urity")
The MSR and % variance explained are based on out of bag estimates. Because mtry = __, this is the number of
variables randomly chosen at each split.
a. Produce a plot of test MSE (as in Figure 8.8 in the text) as a function of number of trees for Bagging and
Random Forests. You should produce one plot with two curves: one corresponding to Bagging and the
other to Random Forests.
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 1/23
3/17/2017 Homework__4.html
rfMSE = rep(0,300)
bagMSE = rep(0,300)
for(i in 1:300){
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), num.trees = i,
importance = "impurity")
rfPreds = predict(rf.Weight, data = OJtest)$predictions
bag.Weight = ranger(Weight ~., data = predictData, mtry = ncol(X), num.trees = i, impo
rtance = "impurity")
bagPreds = predict(bag.Weight, data = OJtest)$predictions
rf.MSE = mean(((rfPreds - Y$Weight[testing])^2))
rfMSE[i] = rf.MSE
bag.MSE = mean(((bagPreds - Y$Weight[testing])^2))
bagMSE[i] = bag.MSE
}
allData = data.frame(num = 1:300, rfMSE, bagMSE)
library(ggplot2)
library(reshape2)
#id.vars = variable you want to keep constant
iceCream = melt(allData, id.vars = "num")
ggplot(iceCream, aes(x = num, y = value, col = variable)) + geom_line() + labs(title =
"Test MSE of Random Forest and Bagging", x = "Number of Trees", y = "Test MSE")
b. Which variables does your random forest identify as most important? How do they compare with the most
important variables as identied by Bagging?
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 2/23
3/17/2017 Homework__4.html
set.seed(36)
rf.Weight= ranger(Weight ~., data = predictData, mtry = sqrt(ncol(X)), importance = "imp
urity")
ranger::importance(rf.Weight)
Based on the values above, we see that the most importance variables in Random Forest is Chest.Girth,
Forearm.Girth, and Waist.Girth. In Bagging, similarly, the most important variables are Forearm.Girth, Waist.Girth,
and Chest.Girth. So it seems like these methods would identify similar type of variables as being most important.
c. Compare the test error of your random forest (with 500 trees) against the test errors of the three methods
you evaluated in Homework 3. Does your random forest make better predictions than your predictions from
Homework 3?
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 3/23
3/17/2017 Homework__4.html
set.seed(36)
rf.500 = predict(rf.Weight, data = OJtest)$predictions
rf.MSE500 = mean(((rf.500 - Y$Weight[testing])^2))
rf.MSE500
## [1] 9.160329
The Test MSE values of our PCR, PLSR, lasso predictions were 8.562, 7.952, and 8.141, respectively. So,
compared with our Random Forest approach, it seems like it does worse, which makes sense because CART
approaches dont really do feature selection in their model creation.
d. The ranger() function uses 500 as the default number of trees. For this problem, is 500 enough trees? How
can you tell?
set.seed(36)
rf.2000 = ranger(Weight ~., data = predictData, num.trees = 2000, importance = "impurit
y")
preds2000 = predict(rf.2000, data = OJtest)$predictions
rf.MSE2000 = mean(((preds2000 - Y$Weight[testing])^2))
rf.MSE2000
## [1] 9.295284
After running the model at 2000 trees, we get a worse Test MSE, so clearly it doesnt matter at this point if we add
more trees to our model.
Problem 2
Here we explore the maximal margin classier on a toy data set.
a. We are given n = 7 observations in p = 2 dimensions. For each observation, there is an associated class
label. Sketch the observations.
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 4/23
3/17/2017 Homework__4.html
b. Sketch the optimal separating hyperplane, and provide the equation for this hyperplane (of the form 0 +
1X1 + 2X2 = 0).
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 5/23
3/17/2017 Homework__4.html
Based on the plot, we see an optimal separating hyperplane between (3.32,1.28) and (3.30,3.29). Using
estimation, we nd that the equation for this line is 1 + X - X2 = 0.
c. Describe the classication rule for the maximal margin classier. It should be something along the lines of
Classify to Red if 0 + 1X1 + 2X2 > 0, and classify to Green otherwise. Provide the values for 0, 1,
and 2.
d. On your sketch, indicate the margin for the maximal margin hyperplane. How wide is the margin?
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 6/23
3/17/2017 Homework__4.html
The
All of the points along the support vectors are support vectors for the maximal margin classier.
f. Argue that a slight movement of the seventh observation would not aect the maximal margin hyperplane.
By moving the 7th observation, we would not aect the maximal margin hyperplane it only based on a small set of
observations. Because this observation is far from that hyperplane and support vectors, there should be a
minimal impact.
g. Sketch a hyperplane that is not the optimal separating hyperplane, and provide the equation for this
hyperplane.
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 7/23
3/17/2017 Homework__4.html
This
h. Draw an additional observation on the plot so that the two classes are no longer separable by a
hyperplane.
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 8/23
3/17/2017 Homework__4.html
Oh
no! As we can see there is a new 8th point inltrating the classied red section.
Problem 3
This problem involves the OJ data set which is part of the ISLR package.
a. Create a training set containing a random sample of 535 observations, and a test set containing the
remaining observations.
library(ISLR)
summary(OJ)
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 9/23
3/17/2017 Homework__4.html
set.seed(36)
train = sample(1:nrow(OJ), 535)
OJ.train = OJ[train,]
OJ.test = OJ[-train,]
b. Fit a (linear) support vector classier to the training data using cost=0.05, with Purchase as the response
and the other variables as predictors. Use the summary() function to produce summary statistics about the
SVM, and describe the results obtained.
library(e1071)
set.seed(36)
OJ.svm = svm(Purchase~., data=OJ.train, kernel = "linear", cost = 0.05)
summary(OJ.svm)
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 10/23
3/17/2017 Homework__4.html
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "linear",
## cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
## gamma: 0.05555556
##
## Number of Support Vectors: 262
##
## ( 131 131 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
This summary shows us that the model selects 262 out of the 535 observations as support points to predict 2
classes, 131 in one class and 131 in the other.
## pred
## obs CH MM
## CH 267 46
## MM 46 176
1-sum(diag(train.table))/sum(train.table)
## [1] 0.1719626
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 11/23
3/17/2017 Homework__4.html
## pred
## obs CH MM
## CH 302 38
## MM 49 146
1-sum(diag(test.table))/sum(test.table)
## [1] 0.1626168
Based on the above classication results, were getting reasonablly close testing (80.9%) and training (83.7%)
classication rates.
d. Use the tune() function to select an optimal cost. Consider values in the range 0.01 to 10.
set.seed(36)
svmTune = tune(svm,Purchase~.,data=OJ.train,
ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)),kernel="linear")
summary(svmTune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.2
##
## - best performance: 0.1718728
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.1832285 0.04877999
## 2 0.02 0.1868623 0.04710860
## 3 0.05 0.1850454 0.04910731
## 4 0.10 0.1756813 0.04388344
## 5 0.20 0.1718728 0.04230377
## 6 0.50 0.1811670 0.03851654
## 7 1.00 0.1792802 0.04410072
## 8 2.00 0.1793152 0.03971763
## 9 5.00 0.1830538 0.04395067
## 10 10.00 0.1812020 0.04424007
plot(svmTune)
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 12/23
3/17/2017 Homework__4.html
By
tuning our SVM function, we found that many dierent values of cost result in similar amounts of error, roughly
around 17%. However, weve determined that 0.1 is our best parameter.
e. Compute the training and test error rates using this new value for cost.
## pred
## obs CH MM
## CH 263 50
## MM 39 183
1-sum(diag(newtrain.table))/sum(newtrain.table)
## [1] 0.1663551
#testing
newOJtestPreds = predict(newOJsvm, newdata = OJ.test)
newtest.table = table(obs = OJ.test$Purchase, pred = newOJtestPreds)
newtest.table
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 13/23
3/17/2017 Homework__4.html
## pred
## obs CH MM
## CH 298 42
## MM 44 151
1-sum(diag(newtest.table))/sum(newtest.table)
## [1] 0.1607477
By creating a new SVM model and re-running a testing and training prediction code, we nd slightly improved
classication rates (83.4% and 81.5%). However, as we found from our code in part d, the costs result in similar
amounts of error, which explains why our improvement isnt that signicant.
f. Repeat parts (b) through (e) using a support vector machine with a radial kernel. Use the default value for
gamma.
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "radial",
## cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.05
## gamma: 0.05555556
##
## Number of Support Vectors: 447
##
## ( 222 225 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 14/23
3/17/2017 Homework__4.html
## pred
## obs CH MM
## CH 286 27
## MM 88 134
1-sum(diag(radsvm.train.table))/sum(radsvm.train.table)
## [1] 0.2149533
## pred
## obs CH MM
## CH 316 24
## MM 95 100
1-sum(diag(radsvm.test.table))/sum(radsvm.test.table)
## [1] 0.2224299
set.seed(36)
radialSVM = tune(svm , Purchase~. , data=OJ.train ,
ranges=list(cost=c(.01,.02,.05,.1,.2,.5,1,2,5,10)), kernel="radial")
summary(radialSVM)
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 15/23
3/17/2017 Homework__4.html
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.1866177
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4149196 0.05553547
## 2 0.02 0.4149196 0.05553547
## 3 0.05 0.2356045 0.06045365
## 4 0.10 0.2018868 0.05730822
## 5 0.20 0.1980084 0.06225772
## 6 0.50 0.1885744 0.05898148
## 7 1.00 0.1866177 0.05977225
## 8 2.00 0.1978686 0.05464430
## 9 5.00 0.1959818 0.05531844
## 10 10.00 0.2034242 0.06308126
#training
radialOJtrainPreds = predict(radialOJsvm, newdata = OJ.train)
radtrain.table = table(obs = OJ.train$Purchase, pred = radialOJtrainPreds)
radtrain.table
## pred
## obs CH MM
## CH 275 38
## MM 49 173
1-sum(diag(radtrain.table))/sum(radtrain.table)
## [1] 0.1626168
#testing
radialOJtestPreds = predict(radialOJsvm, newdata = OJ.test)
radtest.table = table(obs = OJ.test$Purchase, pred = radialOJtestPreds)
radtest.table
## pred
## obs CH MM
## CH 312 28
## MM 62 133
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 16/23
3/17/2017 Homework__4.html
1-sum(diag(radtest.table))/sum(radtest.table)
## [1] 0.1682243
Now, using a radial kernel with a cost = 2, were getting better prediction rates (85.4% and 82.0%)!
g. Repeat parts (b) through (e) using a support vector machine with a polynomial kernel of degree 2.
##
## Call:
## svm(formula = Purchase ~ ., data = OJ.train, kernel = "polynomial",
## degree = 2, cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 0.05
## degree: 2
## gamma: 0.05555556
## coef.0: 0
##
## Number of Support Vectors: 427
##
## ( 212 215 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
## pred
## obs CH MM
## CH 307 6
## MM 172 50
1-sum(diag(polysvm.train.table))/sum(polysvm.train.table)
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 17/23
3/17/2017 Homework__4.html
## [1] 0.3327103
## pred
## obs CH MM
## CH 330 10
## MM 164 31
1-sum(diag(polysvm.test.table))/sum(polysvm.test.table)
## [1] 0.3252336
set.seed(36)
polySVM = tune(svm , Purchase~. , data=OJ.train , ranges=list(cost=c(.01,.02,.05,.1,.2,.
5,1,2,5,10)), kernel="polynomial", degree = 2)
summary(polySVM)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 5
##
## - best performance: 0.19413
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.4149196 0.05553547
## 2 0.02 0.3851153 0.05320332
## 3 0.05 0.3458071 0.05901895
## 4 0.10 0.3233054 0.06237070
## 5 0.20 0.2858840 0.06399353
## 6 0.50 0.2278127 0.06978194
## 7 1.00 0.2221523 0.07434275
## 8 2.00 0.2016073 0.07409238
## 9 5.00 0.1941300 0.06193817
## 10 10.00 0.2016073 0.06151487
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 18/23
3/17/2017 Homework__4.html
#training
polyOJtrainPreds = predict(polyOJsvm, newdata = OJ.train)
polytrain.table = table(obs = OJ.train$Purchase, pred = polyOJtrainPreds)
polytrain.table
## pred
## obs CH MM
## CH 286 27
## MM 55 167
1-sum(diag(polytrain.table))/sum(polytrain.table)
## [1] 0.153271
#testing
polyOJtestPreds = predict(polyOJsvm, newdata = OJ.test)
polytest.table = table(obs = OJ.test$Purchase, pred = polyOJtestPreds)
polytest.table
## pred
## obs CH MM
## CH 309 31
## MM 69 126
1-sum(diag(polytest.table))/sum(polytest.table)
## [1] 0.1869159
Looking at our results, we see that a polynomial kernel doesnt quite beat out the results from our radial kernel
from earlier.
h. Repeat parts (b) through (e) using a linear support vector machine, applied to an expanded feature set
consisting of linear and all possible quadratic terms for the predictors. How does this compare to the
polynomial kernel both conceptually and in terms of the results for this problem?
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 19/23
3/17/2017 Homework__4.html
set.seed(315)
#2 refers to changing columns
#quadratic = apply(OJ[,-1], 2, as.numeric)
#quadratic = quadratic^2
#quadOJ = cbind(quadratic,OJ)
#newOJ = as.numeric(OJ[,-1])
#quadratic = do.call(poly, c(lapply(2:18, function(x) as.numeric(OJ[,x])), degree=2, raw
=TRUE))
#quadOJ = cbind(quadratic, OJ$Purchase)
quadOJtrain = quadOJ[train,]
quadOJtest = quadOJ[-train,]
##
## Call:
## svm(formula = Purchase ~ polym(PriceCH, PriceMM, DiscCH, DiscMM,
## LoyalCH, SalePriceMM, SalePriceCH, PriceDiff, PctDiscMM,
## PctDiscCH, ListPriceDiff, degree = 2), data = quadOJtrain,
## kernel = "linear", cost = 0.05)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
## gamma: 0.01298701
##
## Number of Support Vectors: 256
##
## ( 127 129 )
##
##
## Number of Classes: 2
##
## Levels:
## CH MM
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 20/23
3/17/2017 Homework__4.html
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.2242837
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.2242837 0.07316077
## 2 0.02 0.2673305 0.06020821
## 3 0.05 0.3026555 0.04910140
## 4 0.10 0.3009783 0.05457131
## 5 0.20 0.3046820 0.04299572
## 6 0.50 0.3066038 0.05260060
## 7 1.00 0.3028302 0.04732529
## 8 2.00 0.2990566 0.05611746
## 9 5.00 0.3008386 0.05382667
## 10 10.00 0.2989518 0.04491017
## pred
## obs CH MM
## CH 278 35
## MM 62 160
1-sum(diag(hquadtrain.table))/sum(hquadtrain.table)
## [1] 0.1813084
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 21/23
3/17/2017 Homework__4.html
## pred
## obs CH MM
## CH 304 36
## MM 64 131
1-sum(diag(hquadtest.table))/sum(hquadtest.table)
## [1] 0.1869159
quadsvm = svm(Purchase~polym(PriceCH,PriceMM,DiscCH,DiscMM,LoyalCH,SalePriceMM,SalePrice
CH,PriceDiff,PctDiscMM,PctDiscCH,ListPriceDiff, degree=2), data=quadOJtrain, kernel = "l
inear", cost = 0.01)
#training
quadOJtrainPreds = predict(quadsvm, newdata = OJ.train)
quadtrain.table = table(obs = OJ.train$Purchase, pred = quadOJtrainPreds)
quadtrain.table
## pred
## obs CH MM
## CH 275 38
## MM 65 157
1-sum(diag(quadtrain.table))/sum(quadtrain.table)
## [1] 0.1925234
#testing
quadOJtestPreds = predict(quadsvm, newdata = OJ.test)
quadtest.table = table(obs = OJ.test$Purchase, pred = quadOJtestPreds)
quadtest.table
## pred
## obs CH MM
## CH 301 39
## MM 65 130
1-sum(diag(quadtest.table))/sum(quadtest.table)
## [1] 0.1943925
Compared to the polynomial kernel, we see worse classication rates. A polynomial kernel has a more exible
decision boundary than a linear one, which would generally be better when dealing with a higher dimensional
space.
i. Overall, which approach seems to give the best results on this data?
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 22/23
3/17/2017 Homework__4.html
Overall, it seems like the polynomial kernel provides the best test error results on our data.
Problem 4
Consider a dataset with n observations, xi Rp for i = 1,,n. In this problem we show that the K-means
algorithm is guaranteed to converge but not necessarily to the globally optimal solution.
a. At the beginning of each iteration of the K-means algorithm, we have K clusters C1, , CK Rp, and each
data point is assigned to the cluster with the nearest centroid (at this point, the centroids are not
necessarily equal to the mean of the data points assigned to the cluster). Show (according to problem
specications):
d. Give, as an example, a toy data set and a pair of initial centroids for which the 2-means algorithm does not
converge to the globally optimal min.
le://localhost/Users/alexnutkiewicz/Desktop/STATS%20216/Homework__4.html 23/23
Problem 4a
Problem 4b
Problem 4c
Problem 4d
Problem 4d