0% found this document useful (0 votes)

45 views20 pages

ISLR Solutions - Classification

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views20 pages

ISLR Solutions - Classification

Uploaded by

ycguan99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Ch.

4 Exercises: Classification

Conceptual
1.

𝑃 (𝑥)(1+𝛽0 +𝛽1 𝑥 ) = 𝑒𝛽0 +𝛽1 𝑥

𝑃 (𝑥)
1
= 𝑒𝛽0 +𝛽1 𝑥
1+𝑒𝛽0 +𝛽1 𝑥
𝑃 (𝑥)
𝑒𝛽0 +𝛽1 𝑥
= 𝑒𝛽0 +𝛽1 𝑥
1− 1+𝑒𝛽0 +𝛽1 𝑥
𝑃 (𝑥)
= 𝑒𝛽0 +𝛽1 𝑥
1 − 𝑃 (𝑥)

2.
The terms from (4.12) that do not vary with 𝑘:

√ 1 exp(−1/2𝜎 2 𝑥2 )
2𝜋𝜎
𝐶= 1
∑𝑙 𝜋𝑙 √2𝜋𝜎 exp(−1/2𝜎2 (𝑥 − 𝜇𝑙 )2 )
Replacing C in (4.12):

1
𝑃𝑘 (𝑥) = 𝜋𝑘 𝐶 exp( (2𝜇𝑘 𝑥 − 𝜇2𝑘 ))
2𝜎2

Taking logs of both sides:

1
log(𝑃𝑘 (𝑥)) = log(𝜋𝑘 ) + log(𝐶) + (2𝜇𝑘 𝑥 − 𝜇2𝑘 )
2𝜎2
Rearranging and disregarding 𝐶:

𝜇𝑘 𝜇2
𝛿𝑘 (𝑥) = 𝑥 2
− 𝑘2 + log(𝜋𝑘 )
𝜎 2𝜎
3.
Removing the assumption of shared variance terms across all K classes, the terms from (4.12) that do not
vary with 𝑘:

√1
′ 2𝜋
𝐶 = 1
∑𝑙 𝜋𝑙 √2𝜋𝜎 exp(−1/2𝜎𝑙2 (𝑥 − 𝜇𝑙 )2 )
𝑙

Replacing 𝐶 ′ in (4.12) and taking logs:

1
𝜋𝑘 1
𝑃𝑘 (𝑥) = 𝐶 ′ exp(− 2 (𝑥2 − 2𝜇𝑘 𝑥 + 𝜇2𝑘 ))
𝜎𝑘 2𝜎
1 𝜇 𝑥 𝜇2 𝜋
log(𝑃𝑘 (𝑥)) = − 2 𝑥2 + 𝑘2 − 𝑘2 + log( 𝑘 ) + log(𝐶 ′ )
2𝜎𝑘 𝜎𝑘 2𝜎𝑘 𝜎𝑘

As can be seen from the presence of 𝑥2 in the final term, the discriminant is not linear.
4.
(a)
In a uniform distribution, all intervals of the same length are equally probable. Assuming 𝑥 ∈ [0.05, 0.95],
then intervals: [𝑥 − 0.05, 𝑥 + 0.05], so 𝑙𝑒𝑛𝑔𝑡ℎ = 0.1. On average 10% of the observations would be available
to make a prediction for the test observation.
(b)
Assuming 𝑥 ∈ [0.05, 0.95], 𝑥1𝑙𝑒𝑛𝑔𝑡ℎ × 𝑥2𝑙𝑒𝑛𝑔𝑡ℎ = 0.01. Therefore, 1% of the available observations would be
used to make a prediction.
(c)
When p=100; 0.1𝑝 × 100 = 0.1100 × 100 of the observations are available.
(d)
As the number of predictors increase, the fraction of observations available to make a prediction is reduced
exponentially.
(e)
𝐼𝑓𝑝 = 1; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/1 = 0.1 𝐼𝑓𝑝 = 2; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/2 = 0.32 𝐼𝑓𝑝 = 100; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/100 =
0.977
As p increases the side length converges to 1, and this shows that the hypercube centered around the test
observation with 10% of the test observation needs to be nearly the same size as the hypercube with all the
observations. It also shows that observations are ‘further’ from a test observation as p increases; that is they
are concentrated near the boundary of the hypercube.
5.
(a)

• LDA better on the test set.

• QDA better on the training set (more flexibility to better fit the data), but worse on the test set due
to increased variance.

(b)

• QDA better on training and test sets.

(c)

• In general, QDA tends to be better than LDA when the sample size is large, and where there isn’t a
common covariance between the classes. As such I would expect QDA to provide a better fit and so
provide better predictions.

(d)

2
• False : LDA will likely provide a better fit for a linear decision boundary than QDA, and so provide a
better test error rate. QDA could provide an over-fitting model (due to higher flexibility) that performs
well on the training set but worse on the test set(due to higher variance).

6.
(a)

exp(𝛽0̂ + 𝛽1̂ 𝑋1 + 𝛽2̂ 𝑋2 )

𝑃 (𝑋) =
1 + exp(𝛽0̂ + 𝛽1̂ 𝑋1 + 𝛽2̂ 𝑋2 )
exp(−0.5)
𝑃 (𝑋) = = 0.38
1 + exp(−0.5)

(b)

𝑃 (𝑋)
log( ) = 𝛽0̂ + 𝛽1̂ 𝑋1 + 𝛽2̂ 𝑋2
1 − 𝑃 (𝑋)
0.5
log( ) = −6 + 0.05𝑋1 + 3.5
1 − 0.5
𝑋1 = 50ℎ𝑜𝑢𝑟𝑠.

𝜋𝑦𝑒𝑠 𝑓𝑦𝑒𝑠 (𝑥) 𝜋𝑦𝑒𝑠 exp(−1/2𝜎2 (𝑥 − 𝜇𝑦𝑒𝑠 )2 )

𝑃 (𝑌 = 𝑦𝑒𝑠|𝑋 = 4) = =
∑𝑙=1 𝜋𝑙 exp(−1/2𝜎2 (𝑥 − 𝜇𝑙 )2 )
𝐾 𝐾
∑𝑙=1 𝜋𝑙 𝑓𝑙 (𝑥)
0.8 × exp(−0.5)
𝑃 (𝑌 = 𝑦𝑒𝑠|𝑋 = 4) =
0.8 × exp(−0.5) + 0.2 × exp(−16/72)
𝑃 (𝑌 = 𝑦𝑒𝑠|𝑋 = 4) = 0.75

8.
The KNN with K=1 model would fit the training set exactly and so the training error would be zero. This
means the test error has to be 36% in order for the average of the errors to be 18%. As model selection is
based on performance on the test set, we will choose logistic regression to classify new observations.
9.
(a)

𝑃 (𝑋)
𝑂𝑑𝑑𝑠 =
1 − 𝑃 (𝑋)
0.37
𝑃 (𝑋) = = 0.27
1.37

• 27% of people with odds of 0.37 will default.

(b)

0.16
𝑂𝑑𝑑𝑠 = = 0.19
1 − 0.16

3
Applied
10.
(a)

library(ISLR)
summary(Weekly)

## Year Lag1 Lag2 Lag3

## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##

# Scatterplot matrix.
pairs(Weekly[,1:8])

4
−15 0 −15 0 −15 0 −15 0

Year

1990
Lag1
−15

Lag2

−15
Lag3
−15

Lag4

−15
Lag5
−15

6
Volume

0
Today
−15

1990 2005 −15 0 −15 0 0 4 8

# Correlation matrix.
cor(Weekly[,1:8])

## Year Lag1 Lag2 Lag3 Lag4

## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000

• As can be seen on the scatterplot and correlation matrices, there appears to be a positive correlation
between ‘Year’ and ‘Volume’ only. From the summary statistics, we can observe that the Lag variables
are very similar to each other and ‘Today’. There doesn’t appear to to be any patterns except for an
increase in volume from 1989 to 2001.

5
(b) (c)

logistic_fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)

summary(logistic_fit)

##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4

• Lag2 is statistically significant.

logistic_probs = predict(logistic_fit, type="response")

logistic_preds = rep("Down", 1089) # Vector of 1089 "Down" elements.
logistic_preds[logistic_probs>0.5] = "Up" # Change "Down" to up when probability > 0.5.

# Confusion matrix
attach(Weekly)
table(logistic_preds,Direction)

## Direction
## logistic_preds Down Up
## Down 54 48
## Up 430 557

• The fraction of days where the predictions are correct is 611/1089 = 56%. Therefore, the training error
rate is 48%. Of the 987 “Up” predictions the model makes, it is correct 557/987 = 56.4% of the time.
Given that there were 605/1089 = 55.6% “Up” days, the model’s accuracy when predicting “Up” is
only slightly better than random guessing.

6
(d)

# Training observations from 1990 to 2008.

train = (Year<2009)

# Test observations from 2009 to 2010.

Test = Weekly[!train ,]
Test_Direction= Direction[!train]

# Logistic regression on training set.

logistic_fit2 = glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train)

# Predictions on the test set.

logistic_probs2 = predict(logistic_fit2,Test, type="response")
logistic_preds2 = rep("Down", 104)
logistic_preds2[logistic_probs2>0.5] = "Up"

# Confusion matrix.
table(logistic_preds2,Test_Direction)

## Test_Direction
## logistic_preds2 Down Up
## Down 9 5
## Up 34 56

• The model makes correct predictions on 65/104= 62.5% of the days.

(e)

# Using LDA.
library(MASS)
lda_fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
#lda_fit

# Predictions on the test set.

lda_pred = predict(lda_fit,Test)
lda_class = lda_pred$class

# Confusion matrix.
table(lda_class,Test_Direction)

## Test_Direction
## lda_class Down Up
## Down 9 5
## Up 34 56

• The lda model makes correct predictions 65/104 = 62.5% of the days.

(f)

7
# Using QDA.
qda_fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda_pred = predict(qda_fit,Test)
qda_class = qda_pred$class
table(qda_class,Test_Direction)

## Test_Direction
## qda_class Down Up
## Down 0 0
## Up 43 61

• QDA model’s TPR=1 and precision(correct predictions)=0.58, which is no better than guessing each
day is “Up”.

(g)

# Using KNN
library(class)
set.seed(1)
train_X = Weekly[train,3]
test_X = Weekly[!train,3]
train_direction = Direction[train]

# Changing from vector to matrix by adding dimensions

dim(train_X) = c(985,1)
dim(test_X) = c(104,1)

# Predictions for K=1

knn_pred = knn(train_X, test_X, train_direction, k=1)
table(knn_pred, Test_Direction)

## Test_Direction
## knn_pred Down Up
## Down 21 30
## Up 22 31

• KNN with K=1 in correct in its predictions for 50% of the days.

(h)

• Logistic regression, LDA give the exact same confusion matrix. The TPR = 0.92, Precision = 0.62,
TNR = 0.21 and NPV(Negative Predictive Value) = 0.64.
• For KNN with K=1, the TPR = 0.51, Precision = 0.58, TNR = 0.48 and FPV = 0.41.
• The logistic and LDA models provide the best results, particularly for predicting “Up” days.

(i)

# Using KNN and K=3

knn_pred2 = knn(train_X, test_X, train_direction, k=3)
table(knn_pred2, Test_Direction)

8
## Test_Direction
## knn_pred2 Down Up
## Down 16 19
## Up 27 42

# Using KNN and K=9

knn_pred3 = knn(train_X, test_X, train_direction, k=6)
table(knn_pred3, Test_Direction)

## Test_Direction
## knn_pred3 Down Up
## Down 15 20
## Up 28 41

• Higher K values shown an improvement in the overall correct predictions (59/104) made by a KNN
model when using Lag2 as the only predictor.

# Using LDA with all Lag values

lda_fit2 = lda(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5, data=Weekly, subset=train)

# Predictions on the test set

lda_pred2 = predict(lda_fit2,Test)
lda_class2 = lda_pred2$class

# Confusion matrix
table(lda_class2,Test_Direction)

## Test_Direction
## lda_class2 Down Up
## Down 9 13
## Up 34 48

• No real improvement using LDA when using all Lag variables.

# Using logistic with lag2 and lag2^2

logistic_fit3 = glm(Direction ~ Lag2 + I(Lag2^2), data=Weekly, family=binomial, subset=train)

# Predictions on the test set.

logistic_probs3 = predict(logistic_fit3,Test, type="response")
logistic_preds3 = rep("Down", 104)
logistic_preds3[logistic_probs3>0.5] = "Up"

# Confusion matrix.
table(logistic_preds3,Test_Direction)

## Test_Direction
## logistic_preds3 Down Up
## Down 8 4
## Up 35 57

• Results are similar to Lag2 only.

9
# Using logistic with lag2 and lag1^2
logistic_fit4 = glm(Direction ~ Lag2 + I(Lag1^2), data=Weekly, family=binomial, subset=train)

# Predictions on the test set.

logistic_probs4 = predict(logistic_fit4,Test, type="response")
logistic_preds4 = rep("Down", 104)
logistic_preds4[logistic_probs4>0.5] = "Up"

# Confusion matrix.
table(logistic_preds4,Test_Direction)

## Test_Direction
## logistic_preds4 Down Up
## Down 8 2
## Up 35 59

• This model is correct on 64% of the days.

11. (a)

# Dataframe with "Auto" data and empty "mpg01" column

df = Auto
df$mpg01 = NA
median_mpg = median(df$mpg)

# Loop
for(i in 1:dim(df)[1]){
if (df$mpg[i] > median_mpg){
df$mpg01[i] = 1
}else{
df$mpg01[i] = 0
}
}

# function to move a column to end of dataframe.

movetolast = function(data, move) {
data[c(setdiff(names(data), move), move)]
}

(b)

df = movetolast(df, c("name"))
pairs(df[,1:9])

10
3 5 7 50 200 10 20 1.0 2.5

mpg

10
7

cylinders
3

displacement

100
horsepower
50

weight

1500
acceleration
10

70 82
year

origin
1.0

mpg01

0.0
10 30 100 400 1500 4500 70 76 82 0.0 0.6

cor(df[,1:9])

## mpg cylinders displacement horsepower weight

## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000

• There is a strong positive correlation between mpg and mpg01, and a strong negative correlation between
cylinders, displacement, weight, horsepower and mpg01.

11
• I will use these variables except mpg.mpg was used to separate observations into mpg01 values and so
using it can lead to perfectly separating test observations.

(c)

# Training and Test data

require(caTools)

## Loading required package: caTools

## Warning: package 'caTools' was built under R version 3.6.2

set.seed(123)
sample_data = sample.split(df$mpg, SplitRatio = 0.70)
train2 = subset(df, sample_data==TRUE)
test2 = subset(df, sample_data==FALSE)

(d)

# LDA model
lda_fit3 = lda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)

# Predictions and confusion matrix

lda_pred3 = predict(lda_fit3,test2)
predictions = lda_pred3$class
actual = test2$mpg01
table(predictions,actual)

## actual
## predictions 0 1
## 0 48 4
## 1 8 44

• The test error of this model is 11.5%.

(e)

# QDA model
qda_fit2 = qda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
qda_pred2 = predict(qda_fit2,test2)
predictions = qda_pred2$class
table(predictions,actual)

## actual
## predictions 0 1
## 0 50 4
## 1 6 44

• The QDA model has a test error of 9.6%.

(f)

12
# Logistic regression model
logistic_fit5 = glm(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2, family=binomial)

logistic_probs5 = predict(logistic_fit5,test2, type="response")

logistic_preds5 = rep(0, length(test2$mpg01))
logistic_preds5[logistic_probs5>0.5] = 1

table(logistic_preds5,actual)

## actual
## logistic_preds5 0 1
## 0 50 4
## 1 6 44

• The logistic model has a 9.6% test error rate.

(g)

# Train, Test and response matrices.

train2_matrix = data.matrix(train2[,c("cylinders","displacement","weight","horsepower")])
test2_matrix = data.matrix(test2[,c("cylinders","displacement","weight","horsepower")])
train2_y = data.matrix(train2$mpg01)
test2_y = data.matrix(test2$mpg01)

# K=1 and predictions

knn_pred4 = knn(train2_matrix, test2_matrix, train2_y, k=1)
table(knn_pred4, test2_y)

## test2_y
## knn_pred4 0 1
## 0 45 9
## 1 11 39

• KNN with K=1 has a test error of 20%.

# K=3 and predictions

knn_pred5 = knn(train2_matrix, test2_matrix, train2_y, k=3)
table(knn_pred5, test2_y)

## test2_y
## knn_pred5 0 1
## 0 45 5
## 1 11 43

• KNN with K=3 is has a test error of 15%.

# K=9 and predictions

knn_pred6 = knn(train2_matrix, test2_matrix, train2_y, k=10)
table(knn_pred6, test2_y)

13
## test2_y
## knn_pred6 0 1
## 0 45 4
## 1 11 44

• K=10 leads to a slight improvement in test error(14.4%), with diminishing returns as K gets even
higher.

12. (a) (b) (c)

Power2 = function(x,a){
print(x^a)
}
Power2(3,8)

## [1] 6561

Power2(10,3)

## [1] 1000

Power2(8,17)

## [1] 2.2518e+15

Power2(131,3)

## [1] 2248091

(d) (e)

Power3 = function(x,a){
result = x^a
return(result)
}

# Plot f(x) = x^2

x = 1:100
y = Power3(x,2)
plot(x,y,log="x", main="Plot of x against x^2")

14
Plot of x against x^2
10000
6000
y

2000
0

1 2 5 10 20 50 100

(f)

PlotPower = function(x,a){
x_values = x
y_values = x^a
plot(x_values, y_values)
}
PlotPower(1:10,3)

15
1000
800
600
y_values

400
200
0

2 4 6 8 10

x_values

13.

#library(ISLR)
#library(MASS)
#library(class)
boston_df = Boston

#Add 1 to column if CRIM > median and 0 otherwise

median_crim = median(Boston$crim)
boston_df$crim01 = with(ifelse(crim>median_crim, 1, 0), data=Boston)

#Correlation between crim01 and other variables.

cor(boston_df$crim01,boston_df)

## crim zn indus chas nox rm age

## [1,] 0.4093955 -0.436151 0.6032602 0.07009677 0.7232348 -0.1563718 0.6139399
## dis rad tax ptratio black lstat medv
## [1,] -0.6163416 0.6197862 0.6087413 0.2535684 -0.3512109 0.4532627 -0.2630167
## crim01
## [1,] 1

#Training and Test sets

require(caTools)
set.seed(123)
boston_sample = sample.split(boston_df$crim01, SplitRatio = 0.80)

16
boston_train = subset(boston_df, boston_sample==TRUE)
boston_test = subset(boston_df, boston_sample==FALSE)

# Logistic regression using all variables except CHAS

# and crim(using crim will lead to perfect separation).
boston_lr = glm(crim01 ~.-chas-crim , data=boston_train, family=binomial)
summary(boston_lr)

##
## Call:
## glm(formula = crim01 ~ . - chas - crim, family = binomial, data = boston_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8978 -0.1705 -0.0002 0.0025 3.5340
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.431670 7.164710 -5.504 3.72e-08 ***
## zn -0.102660 0.038683 -2.654 0.007958 **
## indus -0.069107 0.047124 -1.467 0.142511
## nox 48.381844 8.123415 5.956 2.59e-09 ***
## rm 0.084088 0.798160 0.105 0.916096
## age 0.023671 0.013566 1.745 0.081014 .
## dis 0.820828 0.255752 3.209 0.001330 **
## rad 0.631930 0.164647 3.838 0.000124 ***
## tax -0.005362 0.002838 -1.889 0.058826 .
## ptratio 0.334091 0.133126 2.510 0.012087 *
## black -0.007506 0.005230 -1.435 0.151230
## lstat 0.076788 0.051698 1.485 0.137462
## medv 0.149509 0.077409 1.931 0.053430 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.06 on 403 degrees of freedom
## Residual deviance: 169.97 on 391 degrees of freedom
## AIC: 195.97
##
## Number of Fisher Scoring iterations: 9

boston_probs = predict(boston_lr,boston_test, type="response")

boston_preds = rep(0, length(boston_test$crim01))
boston_preds[boston_probs>0.5] = 1
actual = boston_test$crim01

table(boston_preds, actual)

## actual
## boston_preds 0 1
## 0 46 5
## 1 5 46

17
• Test error rate of 9.8%. Same accuracy when predicting 0(crime below median) or 1(crime above
median).

# Logistic regression using zn, nox, dis, rad and ptratio.

# These variables were statistically significant in the previous model.
boston_lr2 = glm(crim01 ~ zn+nox+dis+rad+ptratio, data=boston_train, family=binomial)
boston_probs2 = predict(boston_lr2,boston_test, type="response")
boston_preds2 = rep(0, length(boston_test$crim01))
boston_preds2[boston_probs2>0.5] = 1
actual = boston_test$crim01

table(boston_preds2, actual)

## actual
## boston_preds2 0 1
## 0 45 9
## 1 6 42

• Test error rises to 14.7% when using this subset.

# LDA
boston_lda = lda(crim01 ~.-chas-crim , data=boston_train)
boston_preds2 = predict(boston_lda, boston_test)
table(boston_preds2$class, actual)

## actual
## 0 1
## 0 49 12
## 1 2 39

• Test error rate of 13.7%.

# QDA
boston_qda = qda(crim01 ~.-chas-crim , data=boston_train)
boston_preds3 = predict(boston_qda, boston_test)
table(boston_preds3$class, actual)

## actual
## 0 1
## 0 50 9
## 1 1 42

• Test error rate of 9.8%. More accurate when predicting 0.

#KNN
#Training and Test sets without crim and chas
boston_train2 = data.matrix(subset(boston_train,select=-c(crim,chas)))
boston_test2 = data.matrix(subset(boston_test,select=-c(crim,chas)))

train2_y = data.matrix(boston_train[,15])
test2_y = data.matrix(boston_test[,15])

18
# KNN-1 and predictions
boston_knn1 = knn(boston_train2, boston_test2, train2_y, k=1)
table(boston_knn1, test2_y)

## test2_y
## boston_knn1 0 1
## 0 47 1
## 1 4 50

• Test error rate of 4.9%.

# KNN-3 and predictions

boston_knn2 = knn(boston_train2, boston_test2, train2_y, k=3)
table(boston_knn2, test2_y)

## test2_y
## boston_knn2 0 1
## 0 45 1
## 1 6 50

• Higher test error rate of 6.9%.

# KNN-10 and predictions

boston_knn3 = knn(boston_train2, boston_test2, train2_y, k=10)
table(boston_knn3, test2_y)

## test2_y
## boston_knn3 0 1
## 0 43 5
## 1 8 46

• Much higher test error rate of 11.7%.

• Higher K values result in the test error rate increasing. KNN-1 gives the best performance, therefore
the Bayes decision boundary for the data set is likely non-linear.
• QDA and Logistic regression perform better than LDA but worse than KNN.

#KNN-1 using indus, nox, age, dis, rad, tax (strongly correlated variables with crim01)
boston_train3 = data.matrix(subset(boston_train,select=c(indus,nox,age,dis,rad,tax)))
boston_test3 = data.matrix(subset(boston_test,select=c(indus,nox,age,dis,rad,tax)))

boston_knn4 = knn(boston_train3, boston_test3, train2_y, k=1)

table(boston_knn4, test2_y)

## test2_y
## boston_knn4 0 1
## 0 42 1
## 1 9 50

• The test error is worse when using these variables.

19
#KNN-2 using nox and rad - most statistically significant in the first logistic model.
boston_train4 = data.matrix(subset(boston_train,select=c(nox,rad)))
boston_test4 = data.matrix(subset(boston_test,select=c(nox,rad)))

boston_knn5 = knn(boston_train4, boston_test4, train2_y, k=2)

table(boston_knn5, test2_y)

## test2_y
## boston_knn5 0 1
## 0 49 2
## 1 2 49

• Test error of 4%, which is the lowest among the tested models and subsets of variables. KNN with
values of K=1,2 or 3 give the best results.

31 Lecture Slides 29 and 30
No ratings yet
31 Lecture Slides 29 and 30
15 pages
Regression in R
No ratings yet
Regression in R
40 pages
Stochastic Gradient Descent 1
No ratings yet
Stochastic Gradient Descent 1
42 pages
Assignment Solution 2
No ratings yet
Assignment Solution 2
8 pages
Final Predictive Vaibhav 2020
No ratings yet
Final Predictive Vaibhav 2020
101 pages
620W24HW4 SLN
No ratings yet
620W24HW4 SLN
8 pages
CS4780 Homework 5 SP24-2
No ratings yet
CS4780 Homework 5 SP24-2
7 pages
Chapter 4 Exercise 10
No ratings yet
Chapter 4 Exercise 10
8 pages
Classification: K N X X X y I y
No ratings yet
Classification: K N X X X y I y
6 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Machine Learning-Lecture 2 (Student)
No ratings yet
Machine Learning-Lecture 2 (Student)
9 pages
STA3022 Test2 Solutions
No ratings yet
STA3022 Test2 Solutions
7 pages
Stat 5700 HW 2
No ratings yet
Stat 5700 HW 2
15 pages
STATS 330: Lecture 23: Multiple Logistic Regression
No ratings yet
STATS 330: Lecture 23: Multiple Logistic Regression
33 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
Problems Chap1
No ratings yet
Problems Chap1
20 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
STA3022Test2 2018
No ratings yet
STA3022Test2 2018
7 pages
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
No ratings yet
ECON835 Lecture Notes Part 2 Maximum Likelihood Through Panel Data (Fall 2014)
68 pages
STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1
No ratings yet
STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1
6 pages
Econometric Methods With Applications in Business and Economics ERRATA LIST JANUARY 2010 (Applies To Books Printed After Mid 2005)
No ratings yet
Econometric Methods With Applications in Business and Economics ERRATA LIST JANUARY 2010 (Applies To Books Printed After Mid 2005)
7 pages
Credit-Scoring-CASE
No ratings yet
Credit-Scoring-CASE
29 pages
Logistic Regression (With R) : 1 Theory
No ratings yet
Logistic Regression (With R) : 1 Theory
15 pages
Problem Set 2
No ratings yet
Problem Set 2
18 pages
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
100% (1)
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
32 pages
Statistical Methods II
No ratings yet
Statistical Methods II
284 pages
WILP ASM End-Sem (Makeup) Solutions
No ratings yet
WILP ASM End-Sem (Makeup) Solutions
7 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
2015 No Memo Test 3
No ratings yet
2015 No Memo Test 3
4 pages
Estimations
100% (1)
Estimations
183 pages
Econometría
No ratings yet
Econometría
43 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
COMPSCI5014 1 Machine Learning (M) 201904
No ratings yet
COMPSCI5014 1 Machine Learning (M) 201904
7 pages
Math Stats Lecture 2020F
No ratings yet
Math Stats Lecture 2020F
122 pages
AB1202 Statistics and Analysis: Time Series Predictive Models
No ratings yet
AB1202 Statistics and Analysis: Time Series Predictive Models
15 pages
Generalized Linear Models
100% (9)
Generalized Linear Models
243 pages
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
No ratings yet
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
54 pages
Collection of Formulae and Statistical Tables For The B2-Econometrics and B3-Time Series Analysis Courses and Exams
No ratings yet
Collection of Formulae and Statistical Tables For The B2-Econometrics and B3-Time Series Analysis Courses and Exams
21 pages
Scott and Watson CHPT 4 Solutions
No ratings yet
Scott and Watson CHPT 4 Solutions
4 pages
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
No ratings yet
Big Data Assignment #1: Submitted To/ Eng. Eman Hossam
16 pages
Exercise 1 Statistical Learning
No ratings yet
Exercise 1 Statistical Learning
11 pages
Paver Block Specification
No ratings yet
Paver Block Specification
8 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Exam With Solutions
No ratings yet
Exam With Solutions
7 pages
HW4 Solutions: Problem 6.2
No ratings yet
HW4 Solutions: Problem 6.2
8 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
Final Exam: Hoang Tran
No ratings yet
Final Exam: Hoang Tran
4 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Lecture10 Mid
No ratings yet
Lecture10 Mid
43 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
STAT613
No ratings yet
STAT613
295 pages
HMWK 4
No ratings yet
HMWK 4
5 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Appendix: Answers To Selected Exercises: /user
No ratings yet
Appendix: Answers To Selected Exercises: /user
8 pages
Specialized Crime Investigation: With Legal Medicine
100% (1)
Specialized Crime Investigation: With Legal Medicine
4 pages
WEG - Transformer
No ratings yet
WEG - Transformer
20 pages
Analog Electronic Circuits Lab Manual
No ratings yet
Analog Electronic Circuits Lab Manual
99 pages
Explain The Physical Layer of The I2C Protocol
100% (1)
Explain The Physical Layer of The I2C Protocol
7 pages
Compiler Design 1
100% (1)
Compiler Design 1
30 pages
System Partitioning
No ratings yet
System Partitioning
3 pages
MIL 11 - 12 Q3 0102 What Is Media and Information Literacy PS
No ratings yet
MIL 11 - 12 Q3 0102 What Is Media and Information Literacy PS
14 pages
RW A. Com: An Essay On Criticism
No ratings yet
RW A. Com: An Essay On Criticism
1 page
TF1600 Manual Rev0
No ratings yet
TF1600 Manual Rev0
18 pages
ONLINE PRACTICE 26.7.2021 - EC5-14 (Code: N.2)
No ratings yet
ONLINE PRACTICE 26.7.2021 - EC5-14 (Code: N.2)
13 pages
03 Takaful MAYBANK EZYPAY Application Form V1.0 2018
No ratings yet
03 Takaful MAYBANK EZYPAY Application Form V1.0 2018
2 pages
Icats Basic HEO (HE)
No ratings yet
Icats Basic HEO (HE)
102 pages
MATH 115: Lecture XIII Notes
No ratings yet
MATH 115: Lecture XIII Notes
3 pages
02-07-23 SR - Iit Star Co-Sc (Model-A) Jee Adv 2020 (P-I) Wat-45 Key&Sol
No ratings yet
02-07-23 SR - Iit Star Co-Sc (Model-A) Jee Adv 2020 (P-I) Wat-45 Key&Sol
14 pages
Pronoun-Antecedent Rules
No ratings yet
Pronoun-Antecedent Rules
22 pages
Lecture 1
No ratings yet
Lecture 1
20 pages
Physical Science - q4 - Slm13-Pages-Deleted
No ratings yet
Physical Science - q4 - Slm13-Pages-Deleted
5 pages
Udgam School For Children: Page 1 of 2 Class-VII / Subject - English / Worksheet
No ratings yet
Udgam School For Children: Page 1 of 2 Class-VII / Subject - English / Worksheet
2 pages
Physics Grade 9 Worksheet I Second Sem
No ratings yet
Physics Grade 9 Worksheet I Second Sem
11 pages
Fórmulas Basicas de Derivadas e Integrales
No ratings yet
Fórmulas Basicas de Derivadas e Integrales
1 page
DAILY LESSON LOG Organic Compounds
No ratings yet
DAILY LESSON LOG Organic Compounds
4 pages
Prof K V Subbaraju
No ratings yet
Prof K V Subbaraju
26 pages
General Knowledge For IAS in English
No ratings yet
General Knowledge For IAS in English
4 pages
Kihlstrom GeneralPsych Sum12 Syllabus
No ratings yet
Kihlstrom GeneralPsych Sum12 Syllabus
12 pages
Handout 3 Skills - Unit 2 - 4 Medio
No ratings yet
Handout 3 Skills - Unit 2 - 4 Medio
3 pages
Writing Letter of Apllication and Resume
No ratings yet
Writing Letter of Apllication and Resume
10 pages
S20G Low Headroom Hoist/geared Trolley Combination
No ratings yet
S20G Low Headroom Hoist/geared Trolley Combination
5 pages
Filtration PDF
No ratings yet
Filtration PDF
13 pages
Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

ISLR Solutions - Classification

Uploaded by

ISLR Solutions - Classification

Uploaded by

Ch.

𝑃 (𝑥)(1+𝛽0 +𝛽1 𝑥 ) = 𝑒𝛽0 +𝛽1 𝑥

Taking logs of both sides:

Replacing 𝐶 ′ in (4.12) and taking logs:

• LDA better on the test set.

• QDA better on training and test sets.

exp(𝛽0̂ + 𝛽1̂ 𝑋1 + 𝛽2̂ 𝑋2 )

𝜋𝑦𝑒𝑠 𝑓𝑦𝑒𝑠 (𝑥) 𝜋𝑦𝑒𝑠 exp(−1/2𝜎2 (𝑥 − 𝜇𝑦𝑒𝑠 )2 )

• 27% of people with odds of 0.37 will default.

## Year Lag1 Lag2 Lag3

1990 2005 −15 0 −15 0 0 4 8

## Year Lag1 Lag2 Lag3 Lag4

logistic_fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)

• Lag2 is statistically significant.

logistic_probs = predict(logistic_fit, type="response")

# Training observations from 1990 to 2008.

# Test observations from 2009 to 2010.

# Logistic regression on training set.

# Predictions on the test set.

• The model makes correct predictions on 65/104= 62.5% of the days.

# Predictions on the test set.

# Changing from vector to matrix by adding dimensions

# Predictions for K=1

# Using KNN and K=3

# Using KNN and K=9

# Using LDA with all Lag values

# Predictions on the test set

• No real improvement using LDA when using all Lag variables.

# Using logistic with lag2 and lag2^2

# Predictions on the test set.

• Results are similar to Lag2 only.

# Predictions on the test set.

• This model is correct on 64% of the days.

# Dataframe with "Auto" data and empty "mpg01" column

# function to move a column to end of dataframe.

## mpg cylinders displacement horsepower weight

# Training and Test data

## Loading required package: caTools

## Warning: package 'caTools' was built under R version 3.6.2

# Predictions and confusion matrix

• The test error of this model is 11.5%.

• The QDA model has a test error of 9.6%.

logistic_probs5 = predict(logistic_fit5,test2, type="response")

• The logistic model has a 9.6% test error rate.

# Train, Test and response matrices.

# K=1 and predictions

• KNN with K=1 has a test error of 20%.

# K=3 and predictions

• KNN with K=3 is has a test error of 15%.

# K=9 and predictions

12. (a) (b) (c)

# Plot f(x) = x^2

#Add 1 to column if CRIM > median and 0 otherwise

#Correlation between crim01 and other variables.

## crim zn indus chas nox rm age

#Training and Test sets

# Logistic regression using all variables except CHAS

boston_probs = predict(boston_lr,boston_test, type="response")

# Logistic regression using zn, nox, dis, rad and ptratio.

• Test error rises to 14.7% when using this subset.

• Test error rate of 13.7%.

• Test error rate of 9.8%. More accurate when predicting 0.

• Test error rate of 4.9%.

# KNN-3 and predictions

• Higher test error rate of 6.9%.

# KNN-10 and predictions

• Much higher test error rate of 11.7%.

boston_knn4 = knn(boston_train3, boston_test3, train2_y, k=1)

• The test error is worse when using these variables.

boston_knn5 = knn(boston_train4, boston_test4, train2_y, k=2)

You might also like