Nonparametric Classification 10/36-702: 1 1 N N N I I
Nonparametric Classification 10/36-702: 1 1 N N N I I
10/36-702
1 Introduction
Let us recall a few definitions and facts. The classification risk, or error rate, of h is
and the empirical error rate or training error rate based on training data (X1 , Y1 ), . . . , (Xn , Yn )
is n
1X
Rn (h) =
b I(h(Xi ) 6= Yi ). (2)
n i=1
R(h) is minimized by the Bayes’ rule
p1 (x) (1−π)
(
1 1 if >
∗ 1 if m(x) > 2 p0 (x) π
h (x) = = (3)
0 otherwise 0 otherwise.
where m(x) = P(Y = 1 | X = x), pj (x) = p(x | Y = j) and π = P(Y = 1). The excess risk of
a classifier h is R(h) − R(h∗ ).
where mj (x) = P(Y = j|X = x), πj = P(Y = j) and pj (x) = p(x|Y = j).
2 Plugin Methods
> 12
1 if m(x)
h(x) = (4)
b b
0 otherwise.
1
For example, we could use the kernel regresson estimator
Pn ||x−Xi ||
i=1 Yi K h
mb h (x) = P .
n ||x−Xi ||
i=1 K h
Howeve, the bandwidth should be optimized for classification error as described in Section
8.
Theorem 1 Let b
h be the plug-in classifier based on m.
b Then,
Z sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x). (5)
An immediate consequence of this theorem is that any result about nonparametric re-
gression can be turned into a result about nonparametric classification. For example, if
− m(x)|2 dP (x) = OP (n−2β/(2β+d) ) then R(b
h) − R(h∗ ) = OP (n−β/(2β+d) ). How-
R
|m(x)
b
∗
qR (5) is an upper bound and it is possible that R(h) − R(h ) is strictly smaller than
ever, b
|m(x)
b − m(x)|2 dP (x).
h(x) = argmaxj m
b b j (x)
where m
b j (x) is an estimate of P(Y = j|X = x).
We can apply nonparametric density estimation to each class to get estimators pb0 and pb1 .
Then we define
1 if ppbb01 (x)
(x)
> (1−bπ)
(
π
h(x) = (6)
b b
0 otherwise
b = n−1 ni=1 Yi . Hence, any nonparametric density estimation method yields a
P
where π
nonparametric classifier.
A simplification occurs if we assume that the covariate has independent coordinates, con-
ditioned on the class variable Y . Thus, if Xi = (Xi1 , . . . , Xid )T has dimension
Qd d and if
we assume conditional independence, then the density factors as pj (x) = `=1 pj` (x` ). In
2
this case we can estimate the one-dimensional marginals pj` (x` ) separately and then define
pbj (x) = d`=1 pbj` (x` ). This has the advantage that we never have to do more than a one-
Q
dimensional density estimate. This approach is called naive Bayes. The resulting classifier
can sometimes be very accurate even if the independence assumption is false.
It is easy to extend density based methods for multiclass problems. If Y ∈ {1, . . . , k} then
we estimate the k densities pbj (x) = p(x|Y = j) and the classifier is
h(x) = argmaxj π
b bj pbj (x)
Pn
bj = n−1
where π i=1 I(Yi = j).
4 Nearest Neighbors
The k-nearest neighbor classifier can be recast as a plugin rule. Define the regression estsi-
mator Pn
i=1 Yi I(||Xi ≤ x|| ≤ dk (x))
m(x) = P n
i=1 I(||Xi ≤ x|| ≤ dk (x))
b
where dk (x) is the distance between x and its k th -nearest neighbor. Then b
h(x) = I(m(x)
b >
1/2).
It is interesting to consider the classification error when n is large. First suppose that k = 1
and consider a fixed x. Then b h(x) is 1 if the closest Xi has label Y = 1 and b h(x) is 0 if the
closest Xi has label Y = 0. When n is large, the closest Xi is approximately equal to x. So
the probability of an error is
Define
Ln = P(Y 6= b
h(X) | Dn )
where Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. Then we have that
3
The Bayes risk can be written as R∗ = E(A) where A = min{m(X), 1 − m(X)}. Note that
A ≤ 2m(X)(1 − m(X)). Also, by direct integration, E(A(1 − A)) ≤ E(A)E(1 − A). Hence,
we have the well-known result due to Cover and Hart (1967),
Thus, for any problem with small Bayes error, k = 1 nearest neighbors should have small
error.
where
k !
X k j k−j
R(k) =E m (X)(1 − m(X)) [m(X)I(j < k/2) + (1 − m(X))I(j > k/2)] .
j
j=0
Theorem 3 (Devroye and Györfi 1985) Suppose that the distribution of X has a den-
sity and that k → ∞ and k/n → 0. For every > 0 the following is true. For all large
n,
2 2
h) − R∗ > ) ≤ e−n /(72γd )
P(R(b
where b
hn is the k-nearest neighbor classifier estimated on a sample of size n, and where γd
depends on the dimension d of X.
4
Recently, Chaudhuri and Dasgupta (2014) have obtained some very general results about
k-nn classifiers. We state one of their key results here.
for some β ≥ 0 and some C > 0. Also, suppose that m satisfies the following smoothness
condition: for all x and r > 0
|m(B) − m(x)| ≤ LP (B o )α
Blood Pressure 1
0 1
1
Blood Pressure
110
1
0
50
Age
From (5), we conclude that R(b h) − R(h∗ ) = O(n−1/(d+2) ). However, this binwidth was based
on the bias-variance tradeoff of the regression problem. For classification, b should be chosen
as described in Section 8.
Like regression trees, classification trees are partition classifiers where the partition is built
recursively. For illustration, suppose there are two covariates, X1 = age and X2 = blood
pressure. Figure 1 shows a classification tree using these variables.
The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.
Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is
only a single covariate X. We choose a split point t that divides the real line into two sets
6
A1 = (−∞, t] and A2 = (t, ∞). Let rs (j) be the proportion of observations in As such that
Yi = j: Pn
I(Y = j, Xi ∈ As )
Pn i
rs (j) = i=1 (13)
i=1 I(Xi ∈ As )
for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be I(t) = 2s=1 γs where
P
1
X
γs = 1 − rs (j)2 . (14)
j=0
This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.
When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity. This process is continued until some stopping criterion is met. For example,
we might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0
or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition
element.
This procedure is easily generalized to the case where Y ∈ {1, . . . , K}. We define the impurity
by
Xk
γs = 1 − rs2 (j) (15)
j=1
where ri (j) is the proportion of observations in the partition element for which Y = j.
5 Minimax Results
where R(bh) = P(Y 6= bh(X)), Rn∗ is the Bayes error and the infimum is over all classifiers
constructed from the data (X1 , Y1 ), . . . , (Xn , Yn ). Recall that
sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|2 dP (x)
7
Class Rate Condition
E(α) n−α/(2α+d) α > 1/2
BV n−1/3
p
MI log n/n
−α/(2α+1)
L(α, q) n α > (1/q − 1/2)+
α
Bσ,q n−α/(2α+d) α/d > 1/q − 1/2
Neural nets see text
However, with smaller classes that invoke extra assumptions, such as the Tsybakov noise
condition, there can be a dramatic difference. Here, we summarize Yang’s results under
the richness assumption. This assumption is simply that if m is in the class, then a small
hypercube containing m is also in the class. Yang’s results are summarized in Table 1.
The classes in Table 1 are the following: E(α) is th Sobolev space of order α, BV is the class
of functions of bounded variation, MI is all monotone functions, L(α, q) are α-Lipschitz (in
α
q-norm), and Bσ,q are Besov spaces. For neural nets we have the bound, for every > 0,
1+(2/d) 1+(1/d)
4+(4/d) +
1 log n 4+(2/d)
≤ Rn (P) ≤
n n
It appears that, as d → ∞, we get the dimension independent rate (log n/n)1/4 . However,
this result requires some caution since the class of distributions implicitly gets smaller as d
increases.
8
We can do a nonparametric version by letting H be in a RKHS and taking the penalty to be
||H||2K . In terms of implementation, this means replacing every instance of an inner product
hXi , Xj i with K(Xi , Xj ).
7 Boosting
Boosting refers to a class of methods that build classifiers in a greedy, iterative way. The
original boosting algorithm is called AdaBoost and is due to Freund and Schapire (1996).
See Figure 3.
The algorithm seems mysterious and there is quite a bit of controversey about why (and
when) it works. Perhaps the most compelling explanation is due to Friedman, Hastie and
Tibshirani (2000) which is the explanation we will give. However, the reader is warned
that there is not consensus on the issue. Futher discussions can be found in Bühlmann and
Hothorn (2007), Zhang and Yu (2005) and Mease and Wyner (2008). The latter paper is
followed by a spirited discussion from several authors. Our view is that boosting combines
two distinct ideas: surrogate loss functions and greedy function approximation.
In this section, we assume that Yi ∈ {−1, +1}. Many classifiers then have the form
h(x) = sign(H(x))
for some function H(x). For example, a linear classifier corresponds to H(x) = β T x. The
risk can then be written as
R(h) = P(Y 6= h(X)) = P(Y H(X) < 0) = E(L(A))
where A = Y H(X) and L(a) = I(a < 0). As a function of a, the loss L(a) is discontinuous
which makes it difficult to work with. Friedman, Hastie and Tibshirani (2000) show that
−a −yH(x)
AdaBoost corresponds to using P a surrogate loss, namely, L(a) = e = e P. Consider
finding a classifier of the form m αm hm (x) by minimizing the exponential loss i e−Yi H(Xi ) .
If we do this iteratively, adding one function
P at a time, this leads precisely to AdaBoost.
Typically, the classifiers hj in the sum m αm hm (x) are taken to be very simple classifiers
such as small classification trees.
The argument in Friedman, Hastie and Tibshirani (2000) is as follows. Consider minimizing
the expected loss J(F ) = E(e−Y F (X) ). Suppose our current estimate is F and consider
updating to an improved estimate F (x) + cf (x). Expanding around f (x) = 0,
J(F + cf ) = E(e−Y (F (X)+cf (X)) ) ≈ E(e−Y F (X) (1 − cY f (X) + c2 Y 2 f 2 (X)/2))
= E(e−Y F (X) (1 − cY f (X) + c2 /2))
since Y 2 = f 2 (X) = 1. Now consider minimizing the latter expression a fixed X = x.
If we minimize over f (x) ∈ {−1, +1} we get f (x) = 1 if Ew (y|x) > 0 and f (x) = −1 if
9
1. Input: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi ∈ {−1, +1}.
3. Repeat for m = 1, . . . , M .
Pn
(a) Compute the weighted error (h) = i=1 wi I(Yi 6= h(Xi ) and find hm to
minimize (h).
(b) Let αm = (1/2) log((1 − )/).
(c) Update the weights:
wi e−αm Yi hm (Xi )
wi ←
Z
where Z is chosen so that the weights sum to 1.
Figure 3: AdaBoost
10
Ew (y|x) < 0 where Ew (y|x) = E(w(x, y)y|x)/E(w(x, y)|x) and w(x, y) = e−yF (x) . In other
words, the optimal f is simply the Bayes classifier with respect to the weights. This is exactly
the first step in AdaBoost. If we fix now fix f (x) and minimize over c we get
1 1−
c = log
2
where = Ew (I(Y 6= f (x))). Thus the updated F (x) is
Seen in this light, boosting really combines two ideas. The first is the use of surrogate loss
functions. The second is greedy function approximation.
All the nonparametric methods involve tuning parameters, for example, the number of neigh-
bors k in nearest neighbors. As with density estimation and regression, these parameters
can be chosen by a variety of cross-validation methods. Here we describe the data splitting
version of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly
split the data into two halves that we denote by
n o n o
∗ ∗ ∗ ∗
D = (X1 , Y1 ), . . . , (Xn , Yn ) , and E = (X1 , Y1 ), . . . , (Xn , Yn ) .
e e e e
Let b
h = argminh∈H R(h).
b
11
2
Proof. By Hoeffding’s inequality, P(|R(h)
b − R(h)| > ) ≤ 2e−2n , for each h ∈ H. By the
union bound,
2
P(max |R(h)
b − R(h)| > ) ≤ 2N e−2n = δ
h∈H
q
1
log 2N
where = 2n δ
. Hence, except on a set of probability at most δ,
h) ≤ R(
R(b bbh) + ≤ R(
bbh∗ ) + ≤ R(b
h∗ ) + 2.
p
Note that the difference between R(bh) and R(h∗ ) is O( log N/n) but in regression it was
O(log N/n) which is an interesting difference between the two settings. Under low noise
conditions, the error can be improved.
9 Example
The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 2. See Figures 4, 5, 6, 7.
For high dimensional problems we can use sparsity-based methods. The nonparametric
additive logistic model is
P
p
exp j=1 fj (Xj )
P(Y = 1 | X) ≡ p(X; f ) = P (17)
p
1 + exp j=1 f j (X j )
12
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20
Table 2: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2
0.8
0.05
0.4
0.1
0.0
0.6
0.00
0.2
0.0
0.4
−0.05
−0.1
−0.1
0.0
0.2
−0.10
−0.2
−0.2
0.0
−0.2
−0.15
−0.3
−0.2
−0.3
−0.20
−0.4
−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15
0.06
0.04
0.2
0.2
0.04
0.10
0.03
0.1
0.02
0.02
0.05
0.1
0.01
0.00
0.0
0.00
0.00
−0.02
0.0
−0.1
−0.05
−0.04
−0.02
−0.10
−0.1
−0.2
−0.06
13
0.29
0.28
0.27
Test Error
0.26
0.25
0 10 20 30 40 50
xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193
xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0
xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0
1 1
14
xtrain.V9 < | !0.189962
1 0
Figure 7: Classification tree. The size of the tree was chcosen by cross-validation.
P
where f (X) = j=1p fj (Xj ). To fit this model, the local scoring algorithm runs the backfit-
ting procedure within Newton’s method. One iteratively computes the transformed response
for the current estimate fb
Yi − p(Xi ; fb)
Zi = fb(Xi ) + (19)
p(Xi ; fb)(1 − p(Xi ; fb))
and weights w(Xi ) = p(Xi ; fb)(1 − p(Xi ; fb), and carries out a weighted backfitting of (Z, X)
with weights w. The weighted smooth is given by
Sj (wRj )
Pbj = . (20)
Sj w
where Sj is a linear smoothing matrix, such as a kernel smoother. This extends iteratively
reweighted least squares to the nonparametric setting.
A sparsity penality can be incorporated, just as for sparse additive models (SpAM) for
regression. The Lagrangian is given by
p q
!
X
L(f, λ) = E log 1 + ef (X) − Y f (X) + λ
E(fj2 (Xj )) − L (21)
j=1
15
nonlinear in f , and so we linearize the gradient of the log-likelihood around fb. This yields
the linearized condition E [w(X)(f (X) − Z) | Xj ] + λvj = 0. To see this, note that
0 = E p(X; f ) − Y + p(X; f )(1 − p(X; f ))(f (X) − f (X)) | Xj + λvj
b b b b (22)
= E [w(X)(f (X) − Z) | Xj ] + λvj (23)
When E(fj2 ) 6= 0, this implies the condition
E (w | Xj ) + q λ fj (Xj ) = E(wRj | Xj ). (24)
2
E(fj )
In the finite sample case, in terms of the smoothing matrix Sj , this becomes
Sj (wRj )
fj = .q . (25)
Sj w + λ E(fj2 )
If kSj (wRj )k < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be
solved explicitly, so one simply iterates until convergence:
Sj (wRj )
fj ← √ . (26)
Sj w + λ n /kfj k
When λ = 0, this yields the standard local scoring update (20).
Example 6 (SpAM for Spam) Here we consider an email spam classification problem,
using the logistic SpAM backfitting algorithm above. This dataset has been studied Hastie et
al (2001) using a set of 3,065 emails as a training set, and conducting hypothesis tests to
choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all
numeric. The attributes measure the percentage of specific words or characters in the email,
the average and maximum run lengths of upper case letters, and the total number of such
letters.
The results of a typical run of logistic SpAM are summarized in Figure 8, using plug-in
bandwidths. A held-out set is used to tune the regularization parameter λ.
Suppose we draw B bootstrap samples and each time we construct a classifier. This gives
classifiers h1 , . . . , hB . We now classify by combining them:
(
1 if B1 j hj (x) ≥ 21
P
h(x) =
0 otherwise.
16
λ(×10−3 ) Error # zeros selected variables
5.5 0.2009 55 { 8,54}
4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}
√
4.0 0.1083 ( ) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}
Figure 8: (Email spam) Classification accuracies and variable selection for logistic SpAM.
This is called bagging which stands for bootstrap aggregration. The basline classifiers are
usually trees.
A variation is to choose a random subset of the predictors to split on at each stage. The
resulting classifier is called a random forests. Random forests often perform very well. Their
theoretical performance is not well understood. Some good references are:
Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.
Lin and Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101, p 578.
17
Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.
Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.
Now we consider the multiclass version. Suppose we have the nonparametric K-class logistic
regression model
ef` (X)
pf (Y = ` | X) = PK ` = 1, . . . , K (27)
fm (X)
m=1 e
where each function has an additive form
f` (X) = f`1 (X1 ) + f`2 (X2 ) + · · · + f`p (Xp ). (28)
In Newton’s algorithm, we minimize the quadratic approximation to the log-likelihood
h i 1 h i
L(f ) ≈ L(fb) + E (Y − pb)T (f − fb) + E (f − fb)T H(fb)(f − fb) (29)
2
where pb(X) = (pfb(Y = 1 | X), . . . , pfb(Y = K | X)), and H(fb(X)) is the Hessian
The above calculation can be reexpressed as follows, which leads a multiclass backfitting
algorithm. The difference in log-likelihoods for functions {fb` } and {f` } is, to second order,
!2
K−1 K−1 K−1
X X Y` − p` (X) X
E p` (X) fb` (X) − pk (X)fbk (X) + − f` (X) + pk (X)fk (X)
`=0 k=0
p ` (X)
k=0
(35)
18
where p` (X) = P(Y = ` | X), and Y` = δ(Y, `) are indicator variables. Minimizing over {f` }
gives coupled equations for the functions f` ; they can’t be solved independently over `.
A practical approach is to use coordinate descent, computing the function f` holding the
other functions {fk }k6=` fixed, and iterating. Assuming that fk = fbk for k 6= `, this simplifies
to
" 2 X 2 #
Y ` − p ` p k − Y k
E p` (1 − p` )2 fb` + − f` + pk p2` fb` + − f` . (36)
p` (1 − p` ) k6=`
pk p`
After some algebra, this can be seen to be the same as the usual objective function in the
binary case, where we take fb0 = 1 and fb1 arbitrary.
Now assume f` (and fb` ) has an additive form: f` (X) = pj=1 f`j (Xj ). Some further calcula-
P
tion shows that minimizing over each f`j yields the following backfitting algorithm:
" ! #
X Y ` − p `
E p` (1 − p` ) fb` − f`k + | Xj
p ` (1 − p` )
k6=j
f`j (Xj ) ← . (37)
E [p` (1 − p` ) | Xj ]
where
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) + (39)
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)). (40)
This is the same as in binary logistic regression. We thus have the following algorithm:
For each ` = 0, 1, . . . , K − 1
A. Initialize f` = fb`
B. Iterate until convergence:
19
For each j = 1, 2, . . . , p
D. Set fb` ← f` .
Incrementally updating the normalizing constants (step C) is important so that the proba-
bilties p` (X) = ef` (X) /Z(X) can be efficiently computed, and we avoid an O(K 2 ) algorithm.
b
20