Week 7 Notes
Week 7 Notes
## groups
## pred.cv versicolor virginica
## versicolor 47 1
## virginica 3 49
where z̄j is the sample mean of (zji )i for j = 1, 2, s2z is the pooled variance of zij ,
2 nj
1 XX
s2z = (zji − z̄j )2 ,
n1 + n2 − 2 j=1 i=1
Here we need to assume that Sp is SPD, otherwise the maximum of (6.4.1) would
be infinity. This forces us to have n1 + n2 ≥ p + 2. Thus if we have less data points
than the dimension of the data, LDA is not applicable. Also, even if there is enough
data, it is possible that Sp is not SPD: for instance in the ZIP code dataset, taking
only digits 0, 1, the sample size is larger than p + 2, but the pooled covariance is not
SPD:
155
if(!exists('zip_code'))
{
zip_code <- read.table("~/st323/data/zip.train")
digit <- zip_code[,1] ## the digit label
zip.mat <- as.matrix(zip_code[,-1])
}
X0 <- zip.mat[digit==0,]
X1 <- zip.mat[digit==1,]
Sp <- ((nrow(X0)-1)*cov(X0) + (nrow(X1)-1)*cov(X1))/
(nrow(X0) + nrow(X1) - 2)
imin <- Sp %>% diag %>% which.min ##
print(paste0(nrow(X0)+nrow(X1), ' data points in ', ncol(X0),
' dimensions'))
a , Sp−1 (x1 − x2 ),
156
Figure 50: Several 1D projections of part of the iris dataset (versicolor and
virginica species), and the corresponding species densities estimates. Notice that
some of the projections separate well the two species, whereas some do not separate
the species well.
157
and to allocate to group 2 otherwise.
Proof. We first prove the following result: for any symmetric positive definite (SPD)
p × p matrix S and non-zero u ∈ Rp , Then for any non-zero a ∈ Rp
(aT u)2
max = uT S −1 u,
a6=0 aT Sa
and the maximum is attained for a = S −1 u. Indeed,
2
(aT u)2 = (S 1/2 a)T (S −1/2 u) ≤ (aT Sa)(uT S −1 u),
by the Cauchy–Schwartz inequality, and setting a = S −1 u we get an equality.
Given this result, setting u = x̄1 − x̄2 , we see that the maximizer of (6.4.1) is
given by a = Sp−1 (x1 − x2 ) We now need to find the classification rule: we already
know that we classify a new observation x∗ to group 1 if
z̄12 − z̄22 < 2(z̄1 − z̄2 )aT x∗ .
Now notice that z̄1 − z̄2 = aT (x1 − x2 ) = (x1 − x2 )T Sp−1 (x1 − x2 ) > 0 since Sp is
SPD. Therefore we can simplify the classification rule to “classify to group 1 if
1
(z̄1 + z̄2 ) < aT x∗
2
and classify to group 2 otherwise.”
We remark that the rule is equivalent to the one given in Proposition 6.2.1,
replacing the means and covariance by their sample estimates and setting π1 = π2 ,
c2 = c1 . A nice consequence of this Proposition is that it helps justify the rule based
on the multivariate Normal with equal covariances as an optimal linear decision
rule. Intuitively, it helps us understand that all we’re really doing is transforming
the original x into a scalar value y, and then checking whether it’s closer to ȳ1 or
ȳ2 .
Example 6.4.2. Consider the Iris data and apply Fisher’s LDA to discriminate
between the species Versicolor and Virginica.
5.94 6.59 0.34 0.09 0.24 0.05
2.77 2.97 0.09 0.10 0.08 0.04
x1 =
4.26 ; x2 = 5.55 ; Sp = 0.24
0.08 0.26 0.06
1.33 2.03 0.05 0.04 0.06 0.06
The coefficients for the optimal linear combination are therefore
3.56
5.58
a = Sp−1 (x1 − x2 ) =
−6.97 .
−12.38
Figure 51 shows the projections of the observed data and the midpoint between
ȳ1 and ȳ2 , i.e. the mean of the projections from each group. The plot suggests that
the linear rule is able to discriminate the two groups pretty well. This is confirmed
when we apply cross-validation, which delivers the following confusion matrix.
158
Versicolor Virginica
Versicolor 48 1
Virginica 2 49
## groups
## versicolor virginica
## versicolor 48 1
## virginica 2 49
159
pred.cv <- character(nrow(x))
for (i in 1:nrow(x)) {
fit <- fisherlda(x[-i,], groups=groups[-i], xnew=x[i,])
pred.cv[i] <- fit$xnewpred
}
table(pred.cv,groups)
## groups
## pred.cv versicolor virginica
## 1 48 1
## 2 2 49
op <- par(mfrow=c(3,1))
layout(matrix(c(1,2,2), nrow=3))
pch <- ifelse(as.numeric(groups)==1,4,4)
col <- ifelse(as.numeric(groups)==1,'black','blue')
plot(lda$z,rep(1,nrow(x)),col=col,pch=pch,xlab="a'x",
ylab='',cex.lab=1.25,ylim=c(.9,2),yaxt='n')
abline(v=lda$midpoint,lty=2,lwd=2)
X = iris[iris$Species != 'setosa',]
species=factor(X$Species)
X = X[, c(2,4)]
X = (as.matrix(X))
X <- jitter(X, amount=.05) # we slightly jitter the observations to get a nicer pl
X = scale(X, scale=FALSE)
par(mai=rep(0.,4))
plot(X, cex=.6, asp=1, xlim=c(-5.5, 3), ylim=c(-4, 2), pch=20,
axes=FALSE, col=c(1,4)[species])
a <- lda$a[c(2,4)]
projection_plot(X, groups=species, a, ind=NULL, lwdproj=.05,
density_scaling=1, shift_len=5)
lines(c(-1,1)*a[2], -c(-1,1)*a[1], lty=2, lwd=2)
par(op)
160
Figure 51: Fisher’s Linear Discriminant Rule to classify versicolor vs. virginica.
Top subfigure: the vertical dashed line shows the midpoint between ȳ1 , ȳ2 . Bot-
tom subfigure: the corresponding projections of the original data (here we have
slightly cheated—to allow for a graphical representation—in that we have taken the
projection x 7→ aT x restricted to the variables 2 & 4.
161
simple but highly intuitive classification technique.
3. Classify as Y = 1 if
P̂(Y = 1 | x) c2
> ,
P̂(Y = 2 | x) c1
and as Y = 2 otherwise.
162
Figure 52: Classification regions for various values of K using K-nearest neighbour
classification.
163
for (i in 1:length(y)) ypred[i] <- knn(train=x[-i,], test=x[i,],
cl=y[-i], k=1)
table(y,ypred)
## ypred
## y 1 2
## versicolor 48 2
## virginica 4 46
op <- par(mfrow=c(1,2))
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(prcomp(iris[,1:4])$x[,1:2],col=col,pch=pch)
## ypred
## y 1 2
## versicolor 48 2
## virginica 2 48
par(op)
We now repeat the analysis with a larger K = 18, finding 95/100 correct classifi-
cations (cross-validated). The PC plots of the classifications are given in Figure 53.
We assess the performance of the algorithm for K ranging from 1 to 50. For
convenience, we define a function knn.cv that returns the cross-validated correct
classification rate. Figure 54 shows the results.
164
● ●
0.5
0.5
● ●
● ● ●
● ●
●
PC2
PC2
● ● ●● ● ●●
● ●● ● ● ● ●● ● ●
0.0
0.0
● ● ● ●
● ●● ●● ●● ● ● ● ●● ●● ●● ● ●
●
● ●● ●●● ●
● ●● ●●●
● ●
● ● ●
●
● ● ●
●
● ● ● ● ●
● ● ● ●
−0.5
−0.5
● ●
● ● ● ●
● ●
●● ●● ●
●● ●
● ●
● ●
−2 −1 0 1 2 −2 −1 0 1 2
PC1 PC1
Figure 53: Principal components for the Iris data showing KNN predictions for
K = 1 (left) and K = 18 (right).
}
return(cc)
}
165
Correct classification rate (CV)
0.96
● ●
● ● ● ●
0.94
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
0.92
● ● ● ● ● ● ● ● ● ● ●
● ● ●
0.90
● ●
0 10 20 30 40 50
K
Figure 54: Iris data. Correct classification rate assessed via cross-validation for KNN
and K = 1, . . . , 50.
f (x|Y = 1)π1
η(x) = P(Y = 1|X = x) = ,
f (x|Y = 1)π1 + f (x|Y = 2)π2
On the other hand, if x is such that η(x) ≤ 1/2 then we have that
Under mild regularity conditions, we can derive the asymptotic risk of the 1NN
classifier and show that is has a similar form to that of the Bayes classifier. This will
allow us to compare the performance of the optimal Bayes classifer, which depends
166
on knowing the distribution of (X, Y ), to the performance of the 1NN classifier,
which only requires data. Recall that when c1 = c2 the 1NN classification rule
predicts that the label of X is the same as the label of the closest data point to
it. This is a relatively crude data-driven classifier, but still has some good proper-
ties.
We now compare these two expected costs. Write C∗ for the expected cost of the
Bayes classifier, and write C for the asymptotic expected cost of the 1NN classifier.
We will use the fact that, for any a, b ∈ R, we have ab = min(a, b) × max(a, b). Now
C/2 = E[η(X){1 − η(X)}]
= E min{η(X), 1 − η(X)} max{η(X), 1 − η(X)}
= E min{η(X), 1 − η(X)} 1 − min{η(X), 1 − η(X)}
= E min{η(X), 1 − η(X)} − E min{η(X), 1 − η(X)}2
2
= C∗ − Var E min{η(X), 1 − η(X)} + E min{η(X), 1 − η(X)}
≤ C∗ − C∗2 .
We therefore have that C ≤ 2C∗ (1 − C∗ ). In particular, when we have a lot of data,
the expected cost of the 1NN classifier is no worse than twice the optimal expected
cost, whatever the distribution of the data is.
Advantages of KNN
• Only requires distances, making it a very general algorithm,
• Can capture non-linear and non-monotonic patterns,
• Can detect interactions (complex combinations of variables associated with
one of the classes).
167
Disadvantages of KNN
library(rpart)
progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec, method ='class')
y <- progstat
ypred <- predict(cfit)[,2]>0.5
op <- par(xpd=NA)
plot(cfit)
text(cfit)
par(op)
168
grade< 2.5
|
g2< 13.2
No
ploidy=ab g2>=17.91
g2>=11.84
Prog
g2< 11
No
age>=62.5
Prog
No Prog No Prog
169
Grades 3−4 Grades 3−4
75
75
● ●
● ● ●
● ●
70
70
● ● ● ● ●
● ● ●
● ● ● ●
●● ● ●● ●
65
65
● ● ● ● ●
●● ●
● ● ●
Age
Age
● ● ● ● ●
● ● ● ● ● ●
60
60
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
●
●
55
55
● ● ● ●
● ● ● ●
50
50
● ●
● ●
●
10 20 30 40 50 10 20 30 40 50
G2 G2
Figure 56: Stage C data for Grade 3-4 patients and progression/no progression
status. Left: true status; Right: CART predictions.
op <- par(mfrow=c(1,2))
sel <- (stagec$grade>=2.5)
col <- ifelse(y=='Prog','black','blue')
pch <- ifelse(y=='Prog',16,17)
plot(stagec$g2[sel],stagec$age[sel],xlab='G2',ylab='Age',cex.lab=1,
col=col[sel],pch=pch[sel],main='Grades 3-4',cex.main=1, cex=.5)
#
sel <- (stagec$grade>=2.5)
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(stagec$g2[sel],stagec$age[sel],xlab='G2',ylab='Age',cex.lab=1,
col=col[sel],pch=pch[sel],main='Grades 3-4',cex.main=1, cex=.5)
abline(v=c(13.2,17.91),lwd=1)
segments(x0=17.91,x1=100,y0=62.5,lwd=1)
#plot(stagec$g2[!sel],stagec$age[!sel],xlab='G2',ylab='Age',
#cex.lab=1.25,col=col[!sel],pch=pch[!sel],main='Grades 1-2',cex.main=1.5)
par(op)
The main choices that one should consider carefully when building a classification
tree is how to select variables and set thresholds at each step, and how to define a
stop criterion. There are many strategies for each of these choices, which are too
extensive to cover here.
170
Example 6.6.2. We applied CART to Anderson’s Iris data, obtaining the tree de-
picted in Figure 57. The tree simply predicts species versicolor when Petal.Width
< 1.75, and virginica otherwise.
Figure 58 shows the true and predicted species. In this example CART seems
to perform a bit worse than other classifiers. It is interesting to note that in these
data there seems to be a smooth transition between groups as we move through the
predictor space. Smooth transitions can be hard to capture with hard-thresholding
rules.
op <- par(xpd=NA)
plot(fit)
text(fit)
par(op)
## ypred
## y FALSE TRUE
## 0 49 1
## 1 5 45
op <- par(mfrow=c(1,2))
col <- ifelse(y==1,'black','blue')
pch <- ifelse(y==1,16,17)
plot(x[,c('Petal.Width','Petal.Length')],pch=pch,col=col, cex=.6)
#
#
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(x[,c('Petal.Width','Petal.Length')],pch=pch,col=col, cex=.6)
par(op)
171
xPetal.Width< 1.75
|
0 1
● ●
● ● ● ●
● ●
● ●
● ●
● ● ● ● ● ●
6
● ● ● ●
● ● ● ●
● ● ● ● ●
Petal.Length
Petal.Length
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
5
● ● ● ● ●
● ● ● ●
● ●
●
4
4
3
Petal.Width Petal.Width
Figure 58: Iris data. Left: true species; Right: CART predictions.
172
Advantages of CART
• Easy to interpret,
Limitations
173
1.0
1.0
beta0=0, beta1=1
beta0=1, beta1=1
beta0=0, beta1=2
0.8
0.8
0.6
P(Y=1 | X)
0.6
0.4
0.4
0.2
0.2
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
X X
Figure 59: P(Y = 1|X) as given by logistic regression. Left: ηi = β0 + β1 X1i ; Right:
2
ηi = β0 + β1 X1i + β2 X1i
174
as Y = 1 if
P̂(Y = 1 | x) c0 c0
> ⇐⇒ xT β̂ + β̂0 > log ,
P̂(Y = 0 | x) c1 c1
and as Y = 0 otherwise.
The rule in Algorithm 6.7.1 is very similar to the optimal rule for multivariate
Normal data with equal covariances (Proposition 6.2.1), and by extension to Fisher’s
LDA. The decision is based on a linear combination of the predictors (namely, xT β̂),
with an intercept term and a term incorporating the misclassification costs.
In spite of these similarities, the rules are not identical. First, the linear combi-
nation coefficients are now given by β̂, which is not a linear function of the training
data (as was the case for the multivariate Normal). Second, logistic regression allows
us to include quadratic or any other non-linear transformation of the predictor, and
we can even work with binary or categorical predictors (as we would do in linear
regression).
Example 6.7.1. We look again at the Iris data, where we start by fitting a logistic
regression model using function glm. This function also fits Normal linear regression
models and a wider class of models called Generalized Linear Models. We indicate
that we wish to fit a logistic regression model with the argument family.
data(iris)
iris <- iris[iris$Species!='setosa',]
x <- as.matrix(iris[,1:4])
y <- ifelse(iris$Species=='virginica',1,0)
##
## Call:
## glm(formula = y ~ x[, 1] + x[, 2] + x[, 3] + x[, 4],
## family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01105 -0.00541 -0.00001 0.00677 1.78065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.638 25.707 -1.659 0.0972 .
## x[, 1] -2.465 2.394 -1.030 0.3032
## x[, 2] -6.681 4.480 -1.491 0.1359
## x[, 3] 9.429 4.737 1.991 0.0465 *
## x[, 4] 18.286 9.743 1.877 0.0605 .
175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 11.899 on 95 degrees of freedom
## AIC: 21.899
##
## Number of Fisher Scoring iterations: 10
The glm output reveals that we may not need the four predictors (which makes
sense, as we know they are highly correlated with each other). We decide to drop
the predictor with largest p-value and re-fit the model.
##
## Call:
## glm(formula = y ~ x[, 2] + x[, 3] + x[, 4], family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.75795 -0.00412 0.00000 0.00290 1.92193
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.527 23.995 -2.106 0.0352 *
## x[, 2] -8.376 4.761 -1.759 0.0785 .
## x[, 3] 7.875 3.841 2.050 0.0403 *
## x[, 4] 21.430 10.707 2.001 0.0453 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 13.266 on 96 degrees of freedom
## AIC: 21.266
##
## Number of Fisher Scoring iterations: 10
We will keep the three remaining predictors, as all of them have coefficients
minimally different from 0 (although not all of them are strictly significant at the
0.05 level).
176
Based on this output, and assuming equal misclassification costs c2 = c1 , we
compute the probability of each class and assign each individual to the most likely
class.
b0 <- coef(glm2)[1]
b1 <- matrix(coef(glm2)[-1],ncol=1)
p <- 1/(1+exp(-b0 -x[,2:4] %*% b1))
ypred <- p>0.5
table(y,ypred)
## ypred
## y FALSE TRUE
## 0 48 2
## 1 1 49
op <- par(mfrow=c(1,2))
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(prcomp(x)$x[,1:2],pch=pch,col=col, cex=.7)
#
col <- ifelse(y==1,'black','blue')
pch <- ifelse(y==1,16,17)
plot(prcomp(x)$x[,1:2],pch=pch,col=col, cex=.7)
par(op)
The confusion matrix in the observed data is similar to those for the multivariate
Normal and Fisher’s LDA rules. As discussed before, it would be better to obtain
the confusion matrix with cross-validation, but we do not do that here. Figure 60
displays the first two principal components and the predicted classes.
• Can combine discrete and continuous predictors, and their non-linear trans-
formations,
177
● ●
● ●
● ●
● ●
● ●
● ●
0.5
0.5
● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ● ● ●
● ●
● ●● ● ●
● ● ●● ● ● ● ●● ●
PC2
PC2
● ●
● ● ●● ● ● ● ●● ●
● ●
0.0
0.0
● ● ● ● ● ●
● ●
● ● ●● ● ● ●● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
−0.5
−0.5
● ●
● ●
−2 −1 0 1 2 −2 −1 0 1 2
PC1 PC1
Figure 60: Principal component plot for Iris data (Virginica/Versicolor). Left: lo-
gistic regression prediction; Right: true species
178