0% found this document useful (0 votes)
25 views

Week 7 Notes

Uploaded by

Hind Ouazzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Week 7 Notes

Uploaded by

Hind Ouazzani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

pred.

cv[i] <- fit$xnewpred


}
table(pred.cv,groups)

## groups
## pred.cv versicolor virginica
## versicolor 47 1
## virginica 3 49

6.4 Linear Discriminant Analysis


Fisher proposed a classification rule called linear discriminant analysis (LDA) which
turns out to be equivalent to the Bayes classifier for multivariate Normal data with
common covariance matrix across groups and π1 = π2 , c1 = c2 . The interesting point
is that he arrived to this solution through a completely different argument, and in
fact never assumed to have Normal data. LDA arises as the optimal linear classifier
in terms of achieving separation between groups.
Let x1i ∈ Rp for i = 1, . . . , n1 be the observed predictor values in group 1 and
x2i ∈ Rp for i = 1, . . . , n2 be those in group 2. The goal in LDA is to find a
linear combination zji = aT xji ∈ R such that the ratio of variability between groups
relative to the total variability is maximized. That is, we want to find coefficients
a ∈ Rp to maximise
(z̄1 − z̄2 )2 (aT (x1 − x2 ))2
= , (6.4.1)
s2z aT Sp a

where z̄j is the sample mean of (zji )i for j = 1, 2, s2z is the pooled variance of zij ,
2 nj
1 XX
s2z = (zji − z̄j )2 ,
n1 + n2 − 2 j=1 i=1

and Sp the pooled covariance matrix of xji ,


2 nj
1 XX
Sp = (xji − x̄j )(xji − x̄j )T .
n1 + n2 − 2 j=1 i=1

Here we need to assume that Sp is SPD, otherwise the maximum of (6.4.1) would
be infinity. This forces us to have n1 + n2 ≥ p + 2. Thus if we have less data points
than the dimension of the data, LDA is not applicable. Also, even if there is enough
data, it is possible that Sp is not SPD: for instance in the ZIP code dataset, taking
only digits 0, 1, the sample size is larger than p + 2, but the pooled covariance is not
SPD:

155
if(!exists('zip_code'))
{
zip_code <- read.table("~/st323/data/zip.train")
digit <- zip_code[,1] ## the digit label
zip.mat <- as.matrix(zip_code[,-1])
}

X0 <- zip.mat[digit==0,]
X1 <- zip.mat[digit==1,]
Sp <- ((nrow(X0)-1)*cov(X0) + (nrow(X1)-1)*cov(X1))/
(nrow(X0) + nrow(X1) - 2)
imin <- Sp %>% diag %>% which.min ##
print(paste0(nrow(X0)+nrow(X1), ' data points in ', ncol(X0),
' dimensions'))

## [1] "2199 data points in 256 dimensions"

print(paste0("The minimal entry of the diagonal of the pooled


covariance is ", Sp[imin, imin])) ##

## [1] "The minimal entry of the diagonal of the pooled


covariance is 0"
Figure 50 illustrates the idea behind LDA: we are looking for projections of the
data that best separates the two groups.
As we shall see in Proposition 6.4.1 there is a simple expression for a. But let
us assume for a moment that a is known, then the rule is assign a new observation
x∗ to group 1 whenever its projection is closer to z̄1 than to z̄2 , that is
(aT x∗ − z̄1 )2 < (aT x∗ − z̄2 )2
⇔ z̄12 − 2z̄1 aT x∗ < z̄22 − 2z̄2 aT x∗
⇔ z̄12 − z̄22 < 2(z̄1 − z̄2 )aT x∗ .
We would like to simplify the last inequality by dividing by (z̄1 − z̄2 ), but we don’t
know its sign. . . . It turns out that this value is positive for a ∈ Rp that maximizes
(6.4.1). We have the following result.

Proposition 6.4.1 (Fisher’s Linear Discriminant Analysis).


Assume x1 6= x2 . The maximizer of (6.4.1) is

a , Sp−1 (x1 − x2 ),

where Sp is the pooled covariance matrix (assumed to be non-singular, i.e.


n1 + n2 − 2 ≥ p).
The rule is to allocate x∗ to group 1 if
1 1
aT x∗ > (z̄1 + z̄2 ) = aT (x1 + x2 )
2 2

156
Figure 50: Several 1D projections of part of the iris dataset (versicolor and
virginica species), and the corresponding species densities estimates. Notice that
some of the projections separate well the two species, whereas some do not separate
the species well.

157
and to allocate to group 2 otherwise.
Proof. We first prove the following result: for any symmetric positive definite (SPD)
p × p matrix S and non-zero u ∈ Rp , Then for any non-zero a ∈ Rp
(aT u)2
max = uT S −1 u,
a6=0 aT Sa
and the maximum is attained for a = S −1 u. Indeed,
2
(aT u)2 = (S 1/2 a)T (S −1/2 u) ≤ (aT Sa)(uT S −1 u),
by the Cauchy–Schwartz inequality, and setting a = S −1 u we get an equality.
Given this result, setting u = x̄1 − x̄2 , we see that the maximizer of (6.4.1) is
given by a = Sp−1 (x1 − x2 ) We now need to find the classification rule: we already
know that we classify a new observation x∗ to group 1 if
z̄12 − z̄22 < 2(z̄1 − z̄2 )aT x∗ .
Now notice that z̄1 − z̄2 = aT (x1 − x2 ) = (x1 − x2 )T Sp−1 (x1 − x2 ) > 0 since Sp is
SPD. Therefore we can simplify the classification rule to “classify to group 1 if
1
(z̄1 + z̄2 ) < aT x∗
2
and classify to group 2 otherwise.”

We remark that the rule is equivalent to the one given in Proposition 6.2.1,
replacing the means and covariance by their sample estimates and setting π1 = π2 ,
c2 = c1 . A nice consequence of this Proposition is that it helps justify the rule based
on the multivariate Normal with equal covariances as an optimal linear decision
rule. Intuitively, it helps us understand that all we’re really doing is transforming
the original x into a scalar value y, and then checking whether it’s closer to ȳ1 or
ȳ2 .
Example 6.4.2. Consider the Iris data and apply Fisher’s LDA to discriminate
between the species Versicolor and Virginica.
     
5.94 6.59 0.34 0.09 0.24 0.05
2.77 2.97 0.09 0.10 0.08 0.04
x1 = 
4.26 ; x2 = 5.55 ; Sp = 0.24
    
0.08 0.26 0.06
1.33 2.03 0.05 0.04 0.06 0.06
The coefficients for the optimal linear combination are therefore
 
3.56
 5.58 
a = Sp−1 (x1 − x2 ) = 
 −6.97  .

−12.38
Figure 51 shows the projections of the observed data and the midpoint between
ȳ1 and ȳ2 , i.e. the mean of the projections from each group. The plot suggests that
the linear rule is able to discriminate the two groups pretty well. This is confirmed
when we apply cross-validation, which delivers the following confusion matrix.

158
Versicolor Virginica
Versicolor 48 1
Virginica 2 49

##LDA for Iris data


data(iris)
x <- iris[iris$Species!='setosa',]
groups <- factor(x$Species)
x <- x[,1:4]

fisherlda <- function(x, groups, xnew) {


g1 <- unique(groups)[1]
g2 <- unique(groups)[2]
#
m1 <- colMeans(x[groups==g1,])
m2 <- colMeans(x[groups==g2,])
S1 <- cov(x[groups==g1,])
S2 <- cov(x[groups==g2,])
n1 <- sum(groups==g1)
n2 <- sum(groups==g2)
Sp <- ((n1-1)*S1 + (n2-1)*S2)/(n1+n2-2)
#
a <- solve(Sp) %*% matrix(m1-m2,ncol=1)
z <- as.matrix(x) %*% a
midpoint <- as.vector(.5*(m1 %*% a + m2 %*% a))
pred <- z > midpoint
pred <- ifelse(pred,as.character(g1),as.character(g2))
ans <- list(z=z,xpred=pred,midpoint=midpoint, a=a)
if (!missing(xnew)) {
xnewpred <- (as.matrix(xnew) %*% a) > midpoint
xnewpred <- ifelse(xnewpred,g1,g2)
ans$xnewpred <- xnewpred
} else {
ans$xnewpred <- NA
}
return(ans)
}

lda <- fisherlda(x,groups)


table(lda$xpred,groups)

## groups
## versicolor virginica
## versicolor 48 1
## virginica 2 49

159
pred.cv <- character(nrow(x))
for (i in 1:nrow(x)) {
fit <- fisherlda(x[-i,], groups=groups[-i], xnew=x[i,])
pred.cv[i] <- fit$xnewpred
}

table(pred.cv,groups)

## groups
## pred.cv versicolor virginica
## 1 48 1
## 2 2 49

op <- par(mfrow=c(3,1))
layout(matrix(c(1,2,2), nrow=3))
pch <- ifelse(as.numeric(groups)==1,4,4)
col <- ifelse(as.numeric(groups)==1,'black','blue')
plot(lda$z,rep(1,nrow(x)),col=col,pch=pch,xlab="a'x",
ylab='',cex.lab=1.25,ylim=c(.9,2),yaxt='n')
abline(v=lda$midpoint,lty=2,lwd=2)

X = iris[iris$Species != 'setosa',]
species=factor(X$Species)
X = X[, c(2,4)]
X = (as.matrix(X))
X <- jitter(X, amount=.05) # we slightly jitter the observations to get a nicer pl
X = scale(X, scale=FALSE)

par(mai=rep(0.,4))
plot(X, cex=.6, asp=1, xlim=c(-5.5, 3), ylim=c(-4, 2), pch=20,
axes=FALSE, col=c(1,4)[species])
a <- lda$a[c(2,4)]
projection_plot(X, groups=species, a, ind=NULL, lwdproj=.05,
density_scaling=1, shift_len=5)
lines(c(-1,1)*a[2], -c(-1,1)*a[1], lty=2, lwd=2)
par(op)

6.5 K-Nearest Neighbours Classification


Recall that Bayes’ classifier can be sometimes written in terms of posterior class
probabilities P(Y = y|X = x), see Remark 6.1.4. K-nearest neighbours (KNN) is a

160
Figure 51: Fisher’s Linear Discriminant Rule to classify versicolor vs. virginica.
Top subfigure: the vertical dashed line shows the midpoint between ȳ1 , ȳ2 . Bot-
tom subfigure: the corresponding projections of the original data (here we have
slightly cheated—to allow for a graphical representation—in that we have taken the
projection x 7→ aT x restricted to the variables 2 & 4.

161
simple but highly intuitive classification technique.

Algorithm 6.5.1 (K-nearest neighbours).


Let x be the predictor values for an individual we wish to classify, and
let c2 , c1 be misclassification costs as in Proposition 6.1.3.

1. Find the K individuals closest to x, where distance can be defined in


any convenient manner.

2. Estimate P(Y = y | x) by P̂(Y = y | x), the proportion of individuals


in class y amongst the K nearest neighbours of x.

3. Classify as Y = 1 if
P̂(Y = 1 | x) c2
> ,
P̂(Y = 2 | x) c1
and as Y = 2 otherwise.

When c2 = c1 , which is done frequently in practice, KNN simply follows a


majority voting rule. You walk outside of your house, ask your neighbours’ opinion,
and go with the majority.
The two key choices for KNN are how to measure distance and how to set K (i.e.
how many neighbours are needed). To set K, one can try different values, assess
their predictive power via cross-validation, and choose the best-performing K. As
a limitation, the approach uses a fixed K for all individuals, but the optimal K in
general can be different depending on x. In regions where all individuals belong to
one class a large K may be best, but in regions where individuals quickly transition
from class 1 to class 2 a smaller K may be preferable.
Figure 52 shows that classification regions for various values of K. Notice that
the classification region is more “rough” for smaller values of K, and becomes
“smoother” for larger values of K.
Example 6.5.1. We use KNN to classify the Iris data. The R function knn in
package class lets us specify a training set where to train the algorithm and an
independent test set. We use a leave-one-out cross-validation setting K = 1 neigh-
bours, which achieves 94/100 correct classifications.

#KNN for Iris data


library(class)
data(iris)
iris <- iris[iris$Species!='setosa',]
x <- as.matrix(iris[,1:4])
x <- scale(x)
names(x) <- names(iris)[1:4]
y <- as.character(iris$Species)

ypred <- rep(NA,length(y))

162
Figure 52: Classification regions for various values of K using K-nearest neighbour
classification.

163
for (i in 1:length(y)) ypred[i] <- knn(train=x[-i,], test=x[i,],
cl=y[-i], k=1)
table(y,ypred)

## ypred
## y 1 2
## versicolor 48 2
## virginica 4 46

op <- par(mfrow=c(1,2))
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(prcomp(iris[,1:4])$x[,1:2],col=col,pch=pch)

ypred <- rep(NA,length(y))


for (i in 1:length(y)) ypred[i] <- knn(train=x[-i,], test=x[i,],
cl=y[-i], k=18)
table(y,ypred)

## ypred
## y 1 2
## versicolor 48 2
## virginica 2 48

col <- ifelse(ypred==1,'black','blue')


pch <- ifelse(ypred==1,16,17)
plot(prcomp(iris[,1:4])$x[,1:2],col=col,pch=pch)

par(op)

We now repeat the analysis with a larger K = 18, finding 95/100 correct classifi-
cations (cross-validated). The PC plots of the classifications are given in Figure 53.
We assess the performance of the algorithm for K ranging from 1 to 50. For
convenience, we define a function knn.cv that returns the cross-validated correct
classification rate. Figure 54 shows the results.

knn.cv <- function(x,y,kmax=20) {


#KNN correct classification rate for several K (cross-valid)
cc <- rep(NA,kmax)
for (k in 1:kmax) {
ypred <- rep(NA,length(y))
for (i in 1:length(y)) ypred[i] <- knn(train=x[-i,], test=x[i,],
cl=y[-i], k=k)
tab <- table(y,ypred)
cc[k] <- sum(diag(tab))/length(y)

164
● ●
0.5

0.5
● ●
● ● ●
● ●

PC2

PC2
● ● ●● ● ●●
● ●● ● ● ● ●● ● ●
0.0

0.0
● ● ● ●
● ●● ●● ●● ● ● ● ●● ●● ●● ● ●

● ●● ●●● ●
● ●● ●●●
● ●
● ● ●

● ● ●

● ● ● ● ●
● ● ● ●
−0.5

−0.5
● ●
● ● ● ●
● ●
●● ●● ●
●● ●
● ●
● ●

−2 −1 0 1 2 −2 −1 0 1 2

PC1 PC1

Figure 53: Principal components for the Iris data showing KNN predictions for
K = 1 (left) and K = 18 (right).

}
return(cc)
}

set.seed(1) # for reproducibility


cc <- knn.cv(x,y,kmax=50)

plot(1:length(cc),cc,type='b', xlab='K', pch=20,


ylab='Correct classification rate (CV)',
cex.lab=1.25,cex.axis=1.25)

The correct classification rate stays quite stable for up to K = 35 neighbours,


but then starts to decrease. The optimal values are found at K = 16, 18.

6.5.1 Comparison of 1NN with the Bayes classifier


When studying the performance of classification rules, we often aim to compare
them to the Bayes classifier. Since the Bayes classifier has minimal error probability,
we use this risk as a benchmark to compare the performance of other, data-driven
classifiers against. First, let’s get an explicit expression for the risk of the Bayes
classifier when c1 = c2 .

165
Correct classification rate (CV)
0.96
● ●

● ● ● ●
0.94

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ●
0.92

● ● ● ● ● ● ● ● ● ● ●

● ● ●
0.90

● ●

0 10 20 30 40 50
K

Figure 54: Iris data. Correct classification rate assessed via cross-validation for KNN
and K = 1, . . . , 50.

Proposition 6.5.2. Suppose c1 = c2 , and let

f (x|Y = 1)π1
η(x) = P(Y = 1|X = x) = ,
f (x|Y = 1)π1 + f (x|Y = 2)π2

so that the Bayes classifier g∗ : Rd → {1, 2} is given by g∗ (x) = 1 if


η(x) > 1/2 and g∗ (x) = 2 if η(x) ≤ 1/2. Then

P(g∗ (X) 6= Y ) = E[min{η(X), 1 − η(X)}].

Proof. If x is such that η(x) > 1/2 then we have that

P(g∗ (X) 6= Y |X = x) = P(Y = 2|X = x) = 1 − η(x).

On the other hand, if x is such that η(x) ≤ 1/2 then we have that

P(g∗ (X) 6= Y |X = x) = P(Y = 1|X = x) = η(x).

So, in either case,

P(g∗ (X) 6= Y |X = x) = min{η(x), 1 − η(x)},

and the result follows by taking expectation over x.

Under mild regularity conditions, we can derive the asymptotic risk of the 1NN
classifier and show that is has a similar form to that of the Bayes classifier. This will
allow us to compare the performance of the optimal Bayes classifer, which depends

166
on knowing the distribution of (X, Y ), to the performance of the 1NN classifier,
which only requires data. Recall that when c1 = c2 the 1NN classification rule
predicts that the label of X is the same as the label of the closest data point to
it. This is a relatively crude data-driven classifier, but still has some good proper-
ties.

Proposition 6.5.3 (Non-examinable). Suppose c1 = c2 . Given data


(X1 , Y1 ), . . . , (Xn , Yn ), let gn : Rp → {1, 2} be the 1NN classification rule,
so that gn (x) = YNN , where YNN is the label of the nearest Xi to x. Then
we have
P(gn (X) 6= Y ) → E[2η(X){1 − η(X)}]
as n → ∞, under mild regularity conditions.

Proof Sketch, non-examinable. We have


 
P(gn (X) 6= Y ) = P(YNN 6= Y ) = E P(YNN 6= Y |X)
 
= E η(X)P(YNN = 2) + {1 − η(X)}P(YNN = 1)
 
= E η(X){1 − η(XNN )} + {1 − η(X)}η(XNN ) ,
p
where we write XNN for the nearest value of Xi to X. As n → ∞, we have XNN → X,
p
as data points become more tightly packed. So, we also have η(XNN ) → η(X) and
hence
P(gn (X) 6= Y ) → E[2η(X){1 − η(X)}].

We now compare these two expected costs. Write C∗ for the expected cost of the
Bayes classifier, and write C for the asymptotic expected cost of the 1NN classifier.
We will use the fact that, for any a, b ∈ R, we have ab = min(a, b) × max(a, b). Now
C/2 = E[η(X){1 − η(X)}]
 
= E min{η(X), 1 − η(X)} max{η(X), 1 − η(X)}
 
= E min{η(X), 1 − η(X)} 1 − min{η(X), 1 − η(X)}
= E min{η(X), 1 − η(X)} − E min{η(X), 1 − η(X)}2
   
 
  2
= C∗ − Var E min{η(X), 1 − η(X)} + E min{η(X), 1 − η(X)}

≤ C∗ − C∗2 .
We therefore have that C ≤ 2C∗ (1 − C∗ ). In particular, when we have a lot of data,
the expected cost of the 1NN classifier is no worse than twice the optimal expected
cost, whatever the distribution of the data is.

Advantages of KNN
• Only requires distances, making it a very general algorithm,
• Can capture non-linear and non-monotonic patterns,
• Can detect interactions (complex combinations of variables associated with
one of the classes).

167
Disadvantages of KNN

• Requires choosing a distance metric,

• Sub-optimal to detect monotonic patterns,

• Requires a good choice for K.

6.6 Classification and Regression Trees (CART)


Classification trees provide a very different approach from the methods we have
seen so far. Following ideas similar to divisive clustering (which we will see later),
initially all individuals are in a single group, which forms the root of the tree. Then
that group is split into two nodes, typically by setting a threshold on one of the
predictors. Both the predictor and the threshold are usually chosen so that they
separate individuals from both classes as much as possible. The obtained nodes are
split iteratively, each time using any adequate variable and threshold, until nodes
are pure in the sense that they mostly contain observations from a single class.
Example 6.6.1. Figure 55, shows a tree classifying prostate cancer patients into
progressed / not progressed using multiple predictors (tumor grade, % cells in G2
phase, ploidysm, age etc.). Patients with tumor grade < 2.5 are assigned to the ‘not
progressed’ group. For the remaining patients, the tree checks if the value of G2 is
below 13.2, and then keeps checking variables until a node is reached.
Figure 56 (left panel) shows the age and % of G2 cells for grade 3-4 patients, as
well as the progression / no progression status. The right panel shows the CART
predictions, where we added lines defining the classification regions. We see that
the space is partitioned in a discontinuous fashion. We note that patients with
G2 < 13.2 are also classified according to the variable ploidy, which explains the
presence of mixed predictions in that region.

#CART for Stage C data

library(rpart)
progstat <- factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit <- rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec, method ='class')
y <- progstat
ypred <- predict(cfit)[,2]>0.5

op <- par(xpd=NA)
plot(cfit)
text(cfit)
par(op)

168
grade< 2.5
|

g2< 13.2
No

ploidy=ab g2>=17.91

g2>=11.84
Prog
g2< 11
No
age>=62.5
Prog
No Prog No Prog

Figure 55: Classification tree for Stage C data

169
Grades 3−4 Grades 3−4
75

75
● ●
● ● ●
● ●
70

70
● ● ● ● ●
● ● ●

● ● ● ●
●● ● ●● ●
65

65
● ● ● ● ●
●● ●
● ● ●
Age

Age
● ● ● ● ●
● ● ● ● ● ●
60

60
● ●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●


55

55
● ● ● ●

● ● ● ●
50

50
● ●

● ●

10 20 30 40 50 10 20 30 40 50

G2 G2

Figure 56: Stage C data for Grade 3-4 patients and progression/no progression
status. Left: true status; Right: CART predictions.

op <- par(mfrow=c(1,2))
sel <- (stagec$grade>=2.5)
col <- ifelse(y=='Prog','black','blue')
pch <- ifelse(y=='Prog',16,17)
plot(stagec$g2[sel],stagec$age[sel],xlab='G2',ylab='Age',cex.lab=1,
col=col[sel],pch=pch[sel],main='Grades 3-4',cex.main=1, cex=.5)
#
sel <- (stagec$grade>=2.5)
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(stagec$g2[sel],stagec$age[sel],xlab='G2',ylab='Age',cex.lab=1,
col=col[sel],pch=pch[sel],main='Grades 3-4',cex.main=1, cex=.5)
abline(v=c(13.2,17.91),lwd=1)
segments(x0=17.91,x1=100,y0=62.5,lwd=1)
#plot(stagec$g2[!sel],stagec$age[!sel],xlab='G2',ylab='Age',
#cex.lab=1.25,col=col[!sel],pch=pch[!sel],main='Grades 1-2',cex.main=1.5)
par(op)

The main choices that one should consider carefully when building a classification
tree is how to select variables and set thresholds at each step, and how to define a
stop criterion. There are many strategies for each of these choices, which are too
extensive to cover here.

170
Example 6.6.2. We applied CART to Anderson’s Iris data, obtaining the tree de-
picted in Figure 57. The tree simply predicts species versicolor when Petal.Width
< 1.75, and virginica otherwise.
Figure 58 shows the true and predicted species. In this example CART seems
to perform a bit worse than other classifiers. It is interesting to note that in these
data there seems to be a smooth transition between groups as we move through the
predictor space. Smooth transitions can be hard to capture with hard-thresholding
rules.

#CART for Iris data


library(rpart)
data(iris)
iris <- iris[iris$Species!='setosa',]
x <- as.matrix(iris[,1:4])
names(x) <- names(iris)[1:4]
y <- factor(iris$Species, labels=0:1)

fit <- rpart(y ~ x,method='class')

op <- par(xpd=NA)
plot(fit)
text(fit)
par(op)

ypred <- predict(fit)[,2]>0.5


table(y,ypred)

## ypred
## y FALSE TRUE
## 0 49 1
## 1 5 45

op <- par(mfrow=c(1,2))
col <- ifelse(y==1,'black','blue')
pch <- ifelse(y==1,16,17)
plot(x[,c('Petal.Width','Petal.Length')],pch=pch,col=col, cex=.6)
#
#
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(x[,c('Petal.Width','Petal.Length')],pch=pch,col=col, cex=.6)
par(op)

171
xPetal.Width< 1.75
|

0 1

Figure 57: Classification tree for Iris data


7

● ●
● ● ● ●
● ●
● ●
● ●
● ● ● ● ● ●
6

● ● ● ●
● ● ● ●
● ● ● ● ●
Petal.Length

Petal.Length

● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
5

● ● ● ● ●
● ● ● ●
● ●


4

4
3

1.0 1.5 2.0 2.5 1.0 1.5 2.0 2.5

Petal.Width Petal.Width

Figure 58: Iris data. Left: true species; Right: CART predictions.

172
Advantages of CART

• Easy to interpret,

• Allows continuous and discrete predictors,

• Captures complex non-linear and non-monotonic patterns.

Limitations

• Requires many parameters, and can be sensitive to their choices,

• Sub-optimal to detect monotonic patterns (loss of information due to catego-


rizing),

• Hard to detect interactions (e.g. divisions based on combinations of variables).

6.7 Logistic Regression Classification


Logistic regression is a model-based approach to classification that extends the ideas
of linear regression. Let Xi ∈ Rp be the vector of explanatory variables (predictors)
for individual i = 1, . . . , n, and Yi ∈ {0, 1} be the binary variable indicating the
group of individual i (the Yi s are assumed to be conditionally independent given the
Xi s). Notice here that the groups are now labeled 0, 1 and not 1, 2, and the cost of
misclassifications are denoted by c0 , c1 > 0. This is to allow Yi |Xi to be modeled
as a Bernoulli random variable. Further, let pi , P(Yi = 1 | Xi ) be the probability
of class 1, which is allowed to depend on Xi . As in linear regression, the idea is to
express the effect of Xi on the conditional expected value of Yi , pi , E [Yi |Xi ] using
a linear combination of the elements in Xi . However, since pi ∈ (0, 1) we cannot
model pi as a linear function of Xi , as the latter can return values outside of (0, 1).
Instead, we define the log-odds
 
pi
ηi = log . (6.7.1)
1 − pi
The trick is that ηi can take any value within the real line, and hence it is
reasonable to model ηi as a linear function of Xi . That is,
p
X
η i = β0 + XTi β = β0 + Xij βj ,
j=1

where β ∈ Rp . This model is a particular instance of a generalized linear model


(GLM), with the logit link function (6.7.1).
Within this model the problem is reduced to estimating the parameters β0 and
β, which in turn produce an estimate for ηi and ultimately for pi . Solving for pi in
(6.7.1) gives
T
eβ0 +Xi β
pi = T . (6.7.2)
1 + eβ0 +Xi β

173
1.0

1.0
beta0=0, beta1=1
beta0=1, beta1=1
beta0=0, beta1=2
0.8

0.8
0.6

P(Y=1 | X)

0.6
0.4
0.4

0.2
0.2

beta0=0, beta1=1, beta2=1


beta0=1, beta1=1, beta2=1

0.0
0.0

beta0=0, beta1=2, beta2=1

−4 −2 0 2 4 −4 −2 0 2 4

X X

Figure 59: P(Y = 1|X) as given by logistic regression. Left: ηi = β0 + β1 X1i ; Right:
2
ηi = β0 + β1 X1i + β2 X1i

Figure 59 (left panel) shows pi as a function of a single predictor for various


values of β0 and β1 . We obtain a sigmoid curve that is monotonic. As with linear
regression, the model is quite flexible in that we can include non-linear terms. For
instance, we can define X2 = X12 and include it as a new predictor in the equation.
Introducing such a quadratic term with β2 = 1 in Figure 59 we obtain the curves
shown in the right panel. The relationship between pi and Xi is no longer monotonic.
The main task is therefore to estimate β0 and β given a set of observed data.
Fortunately, there are efficient algorithms to find maximum likelihood estimates,
and asymptotic results (as n → ∞) indicating that these estimates are efficient
from a statistical point of view. We shall not discuss the algorithm here, but we will
note that it simply iterates weighted least squares steps, and that it is implemented
in the R function glm.

Algorithm 6.7.1 (Classification with logistic regression).

1. Obtain estimates β̂0 ,β̂,

2. For an individual with observed predictors x compute


 T
−1
P̂(Y = 1 | x) = 1 + e−β̂0 −x β̂ ,

3. Let c2 , c1 be the misclassification costs in Proposition 6.1.3. Classify

174
as Y = 1 if
 
P̂(Y = 1 | x) c0 c0
> ⇐⇒ xT β̂ + β̂0 > log ,
P̂(Y = 0 | x) c1 c1

and as Y = 0 otherwise.

The rule in Algorithm 6.7.1 is very similar to the optimal rule for multivariate
Normal data with equal covariances (Proposition 6.2.1), and by extension to Fisher’s
LDA. The decision is based on a linear combination of the predictors (namely, xT β̂),
with an intercept term and a term incorporating the misclassification costs.
In spite of these similarities, the rules are not identical. First, the linear combi-
nation coefficients are now given by β̂, which is not a linear function of the training
data (as was the case for the multivariate Normal). Second, logistic regression allows
us to include quadratic or any other non-linear transformation of the predictor, and
we can even work with binary or categorical predictors (as we would do in linear
regression).
Example 6.7.1. We look again at the Iris data, where we start by fitting a logistic
regression model using function glm. This function also fits Normal linear regression
models and a wider class of models called Generalized Linear Models. We indicate
that we wish to fit a logistic regression model with the argument family.

data(iris)
iris <- iris[iris$Species!='setosa',]
x <- as.matrix(iris[,1:4])
y <- ifelse(iris$Species=='virginica',1,0)

glm1 <- glm(y ~ x[,1]+x[,2]+x[,3]+x[,4], family=binomial(link='logit'))


summary(glm1)

##
## Call:
## glm(formula = y ~ x[, 1] + x[, 2] + x[, 3] + x[, 4],
## family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.01105 -0.00541 -0.00001 0.00677 1.78065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.638 25.707 -1.659 0.0972 .
## x[, 1] -2.465 2.394 -1.030 0.3032
## x[, 2] -6.681 4.480 -1.491 0.1359
## x[, 3] 9.429 4.737 1.991 0.0465 *
## x[, 4] 18.286 9.743 1.877 0.0605 .

175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 11.899 on 95 degrees of freedom
## AIC: 21.899
##
## Number of Fisher Scoring iterations: 10

The glm output reveals that we may not need the four predictors (which makes
sense, as we know they are highly correlated with each other). We decide to drop
the predictor with largest p-value and re-fit the model.

glm2 <- glm(y ~ x[,2]+x[,3]+x[,4], family=binomial(link='logit'))


summary(glm2)

##
## Call:
## glm(formula = y ~ x[, 2] + x[, 3] + x[, 4], family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.75795 -0.00412 0.00000 0.00290 1.92193
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.527 23.995 -2.106 0.0352 *
## x[, 2] -8.376 4.761 -1.759 0.0785 .
## x[, 3] 7.875 3.841 2.050 0.0403 *
## x[, 4] 21.430 10.707 2.001 0.0453 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 13.266 on 96 degrees of freedom
## AIC: 21.266
##
## Number of Fisher Scoring iterations: 10

We will keep the three remaining predictors, as all of them have coefficients
minimally different from 0 (although not all of them are strictly significant at the
0.05 level).

176
Based on this output, and assuming equal misclassification costs c2 = c1 , we
compute the probability of each class and assign each individual to the most likely
class.

b0 <- coef(glm2)[1]
b1 <- matrix(coef(glm2)[-1],ncol=1)
p <- 1/(1+exp(-b0 -x[,2:4] %*% b1))
ypred <- p>0.5

table(y,ypred)

## ypred
## y FALSE TRUE
## 0 48 2
## 1 1 49

op <- par(mfrow=c(1,2))
col <- ifelse(ypred==1,'black','blue')
pch <- ifelse(ypred==1,16,17)
plot(prcomp(x)$x[,1:2],pch=pch,col=col, cex=.7)
#
col <- ifelse(y==1,'black','blue')
pch <- ifelse(y==1,16,17)
plot(prcomp(x)$x[,1:2],pch=pch,col=col, cex=.7)
par(op)

The confusion matrix in the observed data is similar to those for the multivariate
Normal and Fisher’s LDA rules. As discussed before, it would be better to obtain
the confusion matrix with cross-validation, but we do not do that here. Figure 60
displays the first two principal components and the predicted classes.

Advantages of logistic regression

• Can detect variables with no predictive power (βj = 0),

• Can combine discrete and continuous predictors, and their non-linear trans-
formations,

• No distributional assumptions on predictors X.

177
● ●
● ●
● ●
● ●
● ●
● ●
0.5

0.5

● ● ● ●
● ●
● ● ●● ● ● ● ●● ●
● ● ● ●
● ●
● ●● ● ●
● ● ●● ● ● ● ●● ●
PC2

PC2

● ●
● ● ●● ● ● ● ●● ●
● ●
0.0

0.0

● ● ● ● ● ●
● ●
● ● ●● ● ● ●● ●

● ● ● ●
● ●
● ● ● ● ●
● ● ● ●
−0.5

−0.5

● ●

● ●

−2 −1 0 1 2 −2 −1 0 1 2

PC1 PC1

Figure 60: Principal component plot for Iris data (Virginica/Versicolor). Left: lo-
gistic regression prediction; Right: true species

178

You might also like