ISLR Solutions - Classification
ISLR Solutions - Classification
4 Exercises: Classification
Conceptual
1.
2.
The terms from (4.12) that do not vary with 𝑘:
√ 1 exp(−1/2𝜎 2 𝑥2 )
2𝜋𝜎
𝐶= 1
∑𝑙 𝜋𝑙 √2𝜋𝜎 exp(−1/2𝜎2 (𝑥 − 𝜇𝑙 )2 )
Replacing C in (4.12):
1
𝑃𝑘 (𝑥) = 𝜋𝑘 𝐶 exp( (2𝜇𝑘 𝑥 − 𝜇2𝑘 ))
2𝜎2
1
log(𝑃𝑘 (𝑥)) = log(𝜋𝑘 ) + log(𝐶) + (2𝜇𝑘 𝑥 − 𝜇2𝑘 )
2𝜎2
Rearranging and disregarding 𝐶:
𝜇𝑘 𝜇2
𝛿𝑘 (𝑥) = 𝑥 2
− 𝑘2 + log(𝜋𝑘 )
𝜎 2𝜎
3.
Removing the assumption of shared variance terms across all K classes, the terms from (4.12) that do not
vary with 𝑘:
√1
′ 2𝜋
𝐶 = 1
∑𝑙 𝜋𝑙 √2𝜋𝜎 exp(−1/2𝜎𝑙2 (𝑥 − 𝜇𝑙 )2 )
𝑙
1
𝜋𝑘 1
𝑃𝑘 (𝑥) = 𝐶 ′ exp(− 2 (𝑥2 − 2𝜇𝑘 𝑥 + 𝜇2𝑘 ))
𝜎𝑘 2𝜎
1 𝜇 𝑥 𝜇2 𝜋
log(𝑃𝑘 (𝑥)) = − 2 𝑥2 + 𝑘2 − 𝑘2 + log( 𝑘 ) + log(𝐶 ′ )
2𝜎𝑘 𝜎𝑘 2𝜎𝑘 𝜎𝑘
As can be seen from the presence of 𝑥2 in the final term, the discriminant is not linear.
4.
(a)
In a uniform distribution, all intervals of the same length are equally probable. Assuming 𝑥 ∈ [0.05, 0.95],
then intervals: [𝑥 − 0.05, 𝑥 + 0.05], so 𝑙𝑒𝑛𝑔𝑡ℎ = 0.1. On average 10% of the observations would be available
to make a prediction for the test observation.
(b)
Assuming 𝑥 ∈ [0.05, 0.95], 𝑥1𝑙𝑒𝑛𝑔𝑡ℎ × 𝑥2𝑙𝑒𝑛𝑔𝑡ℎ = 0.01. Therefore, 1% of the available observations would be
used to make a prediction.
(c)
When p=100; 0.1𝑝 × 100 = 0.1100 × 100 of the observations are available.
(d)
As the number of predictors increase, the fraction of observations available to make a prediction is reduced
exponentially.
(e)
𝐼𝑓𝑝 = 1; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/1 = 0.1 𝐼𝑓𝑝 = 2; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/2 = 0.32 𝐼𝑓𝑝 = 100; 𝑑(𝑙𝑒𝑛𝑔𝑡ℎ) = 0.11/100 =
0.977
As p increases the side length converges to 1, and this shows that the hypercube centered around the test
observation with 10% of the test observation needs to be nearly the same size as the hypercube with all the
observations. It also shows that observations are ‘further’ from a test observation as p increases; that is they
are concentrated near the boundary of the hypercube.
5.
(a)
(b)
(c)
• In general, QDA tends to be better than LDA when the sample size is large, and where there isn’t a
common covariance between the classes. As such I would expect QDA to provide a better fit and so
provide better predictions.
(d)
2
• False : LDA will likely provide a better fit for a linear decision boundary than QDA, and so provide a
better test error rate. QDA could provide an over-fitting model (due to higher flexibility) that performs
well on the training set but worse on the test set(due to higher variance).
6.
(a)
(b)
𝑃 (𝑋)
log( ) = 𝛽0̂ + 𝛽1̂ 𝑋1 + 𝛽2̂ 𝑋2
1 − 𝑃 (𝑋)
0.5
log( ) = −6 + 0.05𝑋1 + 3.5
1 − 0.5
𝑋1 = 50ℎ𝑜𝑢𝑟𝑠.
7.
8.
The KNN with K=1 model would fit the training set exactly and so the training error would be zero. This
means the test error has to be 36% in order for the average of the errors to be 18%. As model selection is
based on performance on the test set, we will choose logistic regression to classify new observations.
9.
(a)
𝑃 (𝑋)
𝑂𝑑𝑑𝑠 =
1 − 𝑃 (𝑋)
0.37
𝑃 (𝑋) = = 0.27
1.37
(b)
0.16
𝑂𝑑𝑑𝑠 = = 0.19
1 − 0.16
3
Applied
10.
(a)
library(ISLR)
summary(Weekly)
# Scatterplot matrix.
pairs(Weekly[,1:8])
4
−15 0 −15 0 −15 0 −15 0
Year
1990
Lag1
−15
Lag2
−15
Lag3
−15
Lag4
−15
Lag5
−15
6
Volume
0
Today
−15
# Correlation matrix.
cor(Weekly[,1:8])
• As can be seen on the scatterplot and correlation matrices, there appears to be a positive correlation
between ‘Year’ and ‘Volume’ only. From the summary statistics, we can observe that the Lag variables
are very similar to each other and ‘Today’. There doesn’t appear to to be any patterns except for an
increase in volume from 1989 to 2001.
5
(b) (c)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
# Confusion matrix
attach(Weekly)
table(logistic_preds,Direction)
## Direction
## logistic_preds Down Up
## Down 54 48
## Up 430 557
• The fraction of days where the predictions are correct is 611/1089 = 56%. Therefore, the training error
rate is 48%. Of the 987 “Up” predictions the model makes, it is correct 557/987 = 56.4% of the time.
Given that there were 605/1089 = 55.6% “Up” days, the model’s accuracy when predicting “Up” is
only slightly better than random guessing.
6
(d)
# Confusion matrix.
table(logistic_preds2,Test_Direction)
## Test_Direction
## logistic_preds2 Down Up
## Down 9 5
## Up 34 56
(e)
# Using LDA.
library(MASS)
lda_fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
#lda_fit
# Confusion matrix.
table(lda_class,Test_Direction)
## Test_Direction
## lda_class Down Up
## Down 9 5
## Up 34 56
• The lda model makes correct predictions 65/104 = 62.5% of the days.
(f)
7
# Using QDA.
qda_fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda_pred = predict(qda_fit,Test)
qda_class = qda_pred$class
table(qda_class,Test_Direction)
## Test_Direction
## qda_class Down Up
## Down 0 0
## Up 43 61
• QDA model’s TPR=1 and precision(correct predictions)=0.58, which is no better than guessing each
day is “Up”.
(g)
# Using KNN
library(class)
set.seed(1)
train_X = Weekly[train,3]
test_X = Weekly[!train,3]
train_direction = Direction[train]
## Test_Direction
## knn_pred Down Up
## Down 21 30
## Up 22 31
• KNN with K=1 in correct in its predictions for 50% of the days.
(h)
• Logistic regression, LDA give the exact same confusion matrix. The TPR = 0.92, Precision = 0.62,
TNR = 0.21 and NPV(Negative Predictive Value) = 0.64.
• For KNN with K=1, the TPR = 0.51, Precision = 0.58, TNR = 0.48 and FPV = 0.41.
• The logistic and LDA models provide the best results, particularly for predicting “Up” days.
(i)
8
## Test_Direction
## knn_pred2 Down Up
## Down 16 19
## Up 27 42
## Test_Direction
## knn_pred3 Down Up
## Down 15 20
## Up 28 41
• Higher K values shown an improvement in the overall correct predictions (59/104) made by a KNN
model when using Lag2 as the only predictor.
# Confusion matrix
table(lda_class2,Test_Direction)
## Test_Direction
## lda_class2 Down Up
## Down 9 13
## Up 34 48
# Confusion matrix.
table(logistic_preds3,Test_Direction)
## Test_Direction
## logistic_preds3 Down Up
## Down 8 4
## Up 35 57
9
# Using logistic with lag2 and lag1^2
logistic_fit4 = glm(Direction ~ Lag2 + I(Lag1^2), data=Weekly, family=binomial, subset=train)
# Confusion matrix.
table(logistic_preds4,Test_Direction)
## Test_Direction
## logistic_preds4 Down Up
## Down 8 2
## Up 35 59
11. (a)
# Loop
for(i in 1:dim(df)[1]){
if (df$mpg[i] > median_mpg){
df$mpg01[i] = 1
}else{
df$mpg01[i] = 0
}
}
(b)
df = movetolast(df, c("name"))
pairs(df[,1:9])
10
3 5 7 50 200 10 20 1.0 2.5
mpg
10
7
cylinders
3
displacement
100
horsepower
50
weight
1500
acceleration
10
70 82
year
origin
1.0
mpg01
0.0
10 30 100 400 1500 4500 70 76 82 0.0 0.6
cor(df[,1:9])
• There is a strong positive correlation between mpg and mpg01, and a strong negative correlation between
cylinders, displacement, weight, horsepower and mpg01.
11
• I will use these variables except mpg.mpg was used to separate observations into mpg01 values and so
using it can lead to perfectly separating test observations.
(c)
set.seed(123)
sample_data = sample.split(df$mpg, SplitRatio = 0.70)
train2 = subset(df, sample_data==TRUE)
test2 = subset(df, sample_data==FALSE)
(d)
# LDA model
lda_fit3 = lda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
## actual
## predictions 0 1
## 0 48 4
## 1 8 44
(e)
# QDA model
qda_fit2 = qda(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2)
qda_pred2 = predict(qda_fit2,test2)
predictions = qda_pred2$class
table(predictions,actual)
## actual
## predictions 0 1
## 0 50 4
## 1 6 44
(f)
12
# Logistic regression model
logistic_fit5 = glm(mpg01 ~ cylinders+displacement+horsepower+weight, data=train2, family=binomial)
table(logistic_preds5,actual)
## actual
## logistic_preds5 0 1
## 0 50 4
## 1 6 44
(g)
## test2_y
## knn_pred4 0 1
## 0 45 9
## 1 11 39
## test2_y
## knn_pred5 0 1
## 0 45 5
## 1 11 43
13
## test2_y
## knn_pred6 0 1
## 0 45 4
## 1 11 44
• K=10 leads to a slight improvement in test error(14.4%), with diminishing returns as K gets even
higher.
Power2 = function(x,a){
print(x^a)
}
Power2(3,8)
## [1] 6561
Power2(10,3)
## [1] 1000
Power2(8,17)
## [1] 2.2518e+15
Power2(131,3)
## [1] 2248091
(d) (e)
Power3 = function(x,a){
result = x^a
return(result)
}
14
Plot of x against x^2
10000
6000
y
2000
0
1 2 5 10 20 50 100
(f)
PlotPower = function(x,a){
x_values = x
y_values = x^a
plot(x_values, y_values)
}
PlotPower(1:10,3)
15
1000
800
600
y_values
400
200
0
2 4 6 8 10
x_values
13.
#library(ISLR)
#library(MASS)
#library(class)
boston_df = Boston
16
boston_train = subset(boston_df, boston_sample==TRUE)
boston_test = subset(boston_df, boston_sample==FALSE)
##
## Call:
## glm(formula = crim01 ~ . - chas - crim, family = binomial, data = boston_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8978 -0.1705 -0.0002 0.0025 3.5340
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.431670 7.164710 -5.504 3.72e-08 ***
## zn -0.102660 0.038683 -2.654 0.007958 **
## indus -0.069107 0.047124 -1.467 0.142511
## nox 48.381844 8.123415 5.956 2.59e-09 ***
## rm 0.084088 0.798160 0.105 0.916096
## age 0.023671 0.013566 1.745 0.081014 .
## dis 0.820828 0.255752 3.209 0.001330 **
## rad 0.631930 0.164647 3.838 0.000124 ***
## tax -0.005362 0.002838 -1.889 0.058826 .
## ptratio 0.334091 0.133126 2.510 0.012087 *
## black -0.007506 0.005230 -1.435 0.151230
## lstat 0.076788 0.051698 1.485 0.137462
## medv 0.149509 0.077409 1.931 0.053430 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 560.06 on 403 degrees of freedom
## Residual deviance: 169.97 on 391 degrees of freedom
## AIC: 195.97
##
## Number of Fisher Scoring iterations: 9
table(boston_preds, actual)
## actual
## boston_preds 0 1
## 0 46 5
## 1 5 46
17
• Test error rate of 9.8%. Same accuracy when predicting 0(crime below median) or 1(crime above
median).
table(boston_preds2, actual)
## actual
## boston_preds2 0 1
## 0 45 9
## 1 6 42
# LDA
boston_lda = lda(crim01 ~.-chas-crim , data=boston_train)
boston_preds2 = predict(boston_lda, boston_test)
table(boston_preds2$class, actual)
## actual
## 0 1
## 0 49 12
## 1 2 39
# QDA
boston_qda = qda(crim01 ~.-chas-crim , data=boston_train)
boston_preds3 = predict(boston_qda, boston_test)
table(boston_preds3$class, actual)
## actual
## 0 1
## 0 50 9
## 1 1 42
#KNN
#Training and Test sets without crim and chas
boston_train2 = data.matrix(subset(boston_train,select=-c(crim,chas)))
boston_test2 = data.matrix(subset(boston_test,select=-c(crim,chas)))
train2_y = data.matrix(boston_train[,15])
test2_y = data.matrix(boston_test[,15])
18
# KNN-1 and predictions
boston_knn1 = knn(boston_train2, boston_test2, train2_y, k=1)
table(boston_knn1, test2_y)
## test2_y
## boston_knn1 0 1
## 0 47 1
## 1 4 50
## test2_y
## boston_knn2 0 1
## 0 45 1
## 1 6 50
## test2_y
## boston_knn3 0 1
## 0 43 5
## 1 8 46
#KNN-1 using indus, nox, age, dis, rad, tax (strongly correlated variables with crim01)
boston_train3 = data.matrix(subset(boston_train,select=c(indus,nox,age,dis,rad,tax)))
boston_test3 = data.matrix(subset(boston_test,select=c(indus,nox,age,dis,rad,tax)))
## test2_y
## boston_knn4 0 1
## 0 42 1
## 1 9 50
19
#KNN-2 using nox and rad - most statistically significant in the first logistic model.
boston_train4 = data.matrix(subset(boston_train,select=c(nox,rad)))
boston_test4 = data.matrix(subset(boston_test,select=c(nox,rad)))
## test2_y
## boston_knn5 0 1
## 0 49 2
## 1 2 49
• Test error of 4%, which is the lowest among the tested models and subsets of variables. KNN with
values of K=1,2 or 3 give the best results.
20