0% found this document useful (0 votes)
61 views20 pages

Nonparametric Classification 10/36-702: 1 1 N N N I I

This document provides an introduction to nonparametric classification methods. It discusses several approaches: 1) Plugin methods that estimate unknown quantities in the Bayes' rule (such as the regression function m(x)) and plug those estimates into the rule. The excess risk of these methods depends on the accuracy of the regression estimate. 2) Classifiers based on density estimation, where the densities p0(x) and p1(x) are estimated nonparametrically and plugged into the Bayes' rule. A naive Bayes approach assumes conditional independence of the covariates. 3) Nearest neighbor classifiers, which can be viewed as plugin rules using a nearest neighbor regression estimate. The error rate of 1-nearest neighbor is shown

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views20 pages

Nonparametric Classification 10/36-702: 1 1 N N N I I

This document provides an introduction to nonparametric classification methods. It discusses several approaches: 1) Plugin methods that estimate unknown quantities in the Bayes' rule (such as the regression function m(x)) and plug those estimates into the rule. The excess risk of these methods depends on the accuracy of the regression estimate. 2) Classifiers based on density estimation, where the densities p0(x) and p1(x) are estimated nonparametrically and plugged into the Bayes' rule. A naive Bayes approach assumes conditional independence of the covariates. 3) Nearest neighbor classifiers, which can be viewed as plugin rules using a nearest neighbor regression estimate. The error rate of 1-nearest neighbor is shown

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Nonparametric Classification

10/36-702

1 Introduction

Let h : X → {0, 1} to denote a classifier where X is the domain of X. In parametric classi-


fication we assumed that h took a very constrained form, typically linear. In nonparametric
classification we aim to relax this assumption.

Let us recall a few definitions and facts. The classification risk, or error rate, of h is

R(h) = P(Y 6= h(X)) (1)

and the empirical error rate or training error rate based on training data (X1 , Y1 ), . . . , (Xn , Yn )
is n
1X
Rn (h) =
b I(h(Xi ) 6= Yi ). (2)
n i=1
R(h) is minimized by the Bayes’ rule
p1 (x) (1−π)
(
1 1 if >

∗ 1 if m(x) > 2 p0 (x) π
h (x) = = (3)
0 otherwise 0 otherwise.

where m(x) = P(Y = 1 | X = x), pj (x) = p(x | Y = j) and π = P(Y = 1). The excess risk of
a classifier h is R(h) − R(h∗ ).

In the multiclass case, Y ∈ {1, . . . , k}, the Bayes’ rule is

h∗ (x) = argmax1≤j≤k πj pj (x) = argmax1≤j≤k mj (x)

where mj (x) = P(Y = j|X = x), πj = P(Y = j) and pj (x) = p(x|Y = j).

2 Plugin Methods

One approach to nonparametric classification is to estimate the unknown quantities in the


expression for the Bayes’ rule (3) and simply plug them in. For example, if m b is any
nonparametric regression estimator then we can use

> 12

1 if m(x)
h(x) = (4)
b b
0 otherwise.

1
For example, we could use the kernel regresson estimator
 
Pn ||x−Xi ||
i=1 Yi K h
mb h (x) = P   .
n ||x−Xi ||
i=1 K h

Howeve, the bandwidth should be optimized for classification error as described in Section
8.

We have the following theorem.

Theorem 1 Let b
h be the plug-in classifier based on m.
b Then,
Z sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x). (5)

An immediate consequence of this theorem is that any result about nonparametric re-
gression can be turned into a result about nonparametric classification. For example, if
− m(x)|2 dP (x) = OP (n−2β/(2β+d) ) then R(b
h) − R(h∗ ) = OP (n−β/(2β+d) ). How-
R
|m(x)
b

qR (5) is an upper bound and it is possible that R(h) − R(h ) is strictly smaller than
ever, b
|m(x)
b − m(x)|2 dP (x).

When Y ∈ {1, . . . , k} the plugin rule has the form

h(x) = argmaxj m
b b j (x)

where m
b j (x) is an estimate of P(Y = j|X = x).

3 Classifiers Based on Density Estimation

We can apply nonparametric density estimation to each class to get estimators pb0 and pb1 .
Then we define
1 if ppbb01 (x)
(x)
> (1−bπ)
(
π
h(x) = (6)
b b

0 otherwise
b = n−1 ni=1 Yi . Hence, any nonparametric density estimation method yields a
P
where π
nonparametric classifier.

A simplification occurs if we assume that the covariate has independent coordinates, con-
ditioned on the class variable Y . Thus, if Xi = (Xi1 , . . . , Xid )T has dimension
Qd d and if
we assume conditional independence, then the density factors as pj (x) = `=1 pj` (x` ). In

2
this case we can estimate the one-dimensional marginals pj` (x` ) separately and then define
pbj (x) = d`=1 pbj` (x` ). This has the advantage that we never have to do more than a one-
Q
dimensional density estimate. This approach is called naive Bayes. The resulting classifier
can sometimes be very accurate even if the independence assumption is false.

It is easy to extend density based methods for multiclass problems. If Y ∈ {1, . . . , k} then
we estimate the k densities pbj (x) = p(x|Y = j) and the classifier is

h(x) = argmaxj π
b bj pbj (x)
Pn
bj = n−1
where π i=1 I(Yi = j).

4 Nearest Neighbors

The k-nearest neighbor classifier is


Pn
wi (x)I(Yi = 1) > ni=1 wi (x)I(Yi = 0)
 P
1 i=1
h(x) = (7)
0 otherwise

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. “Nearest”


depends on how you define the distance. Often we use Euclidean distance kXi − Xj k. In
that case you should standardize the variables first.

The k-nearest neighbor classifier can be recast as a plugin rule. Define the regression estsi-
mator Pn
i=1 Yi I(||Xi ≤ x|| ≤ dk (x))
m(x) = P n
i=1 I(||Xi ≤ x|| ≤ dk (x))
b

where dk (x) is the distance between x and its k th -nearest neighbor. Then b
h(x) = I(m(x)
b >
1/2).

It is interesting to consider the classification error when n is large. First suppose that k = 1
and consider a fixed x. Then b h(x) is 1 if the closest Xi has label Y = 1 and b h(x) is 0 if the
closest Xi has label Y = 0. When n is large, the closest Xi is approximately equal to x. So
the probability of an error is

m(Xi )(1−m(x))+(1−m(Xi ))m(x) ≈ m(x)(1−m(x))+(1−m(x))m(x) = 2m(x)(1−m(x)).

Define
Ln = P(Y 6= b
h(X) | Dn )
where Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. Then we have that

lim E(Ln ) = E(2m(X)(1 − m(X))) ≡ R(1) . (8)


n→∞

3
The Bayes risk can be written as R∗ = E(A) where A = min{m(X), 1 − m(X)}. Note that
A ≤ 2m(X)(1 − m(X)). Also, by direct integration, E(A(1 − A)) ≤ E(A)E(1 − A). Hence,
we have the well-known result due to Cover and Hart (1967),

R∗ ≤ R(1) = E(A(1 − A)) ≤ 2E(A)E(1 − A) = 2R∗ (1 − R∗ ) ≤ 2R∗ .

Thus, for any problem with small Bayes error, k = 1 nearest neighbors should have small
error.

More generally, for any odd k,


lim E(Ln ) = R(k) (9)
n→∞

where
k   !
X k j k−j
R(k) =E m (X)(1 − m(X)) [m(X)I(j < k/2) + (1 − m(X))I(j > k/2)] .
j
j=0

Theorem 2 (Devroye et al 1996) For all odd k,


1
R∗ ≤ R(k) ≤ R∗ + √ . (10)
ke

Proof. We can rewrite R(k) as R(k) = E(a(m(X))) where


 
k
a(z) = min{z, 1 − z} + |2z − 1| P B >
2
and B ∼ Binomial(k, min{z, 1 − z}). The mean of a(z) is less than or equal to its maximum
and, by symmetry, we can take the maximum over 0 ≤ z ≤ 1/2. Hence, letting B ∼
Binomial(k, z), we have, by Hoeffding’s inequality,
 
k 2 2 1
R(k) − R∗ ≤ sup (1 − 2z)P B > ≤ sup (1 − 2z)e−2k(1/2−z) = sup ue−ku /2 = √ .
0≤z≤1/2 2 0≤z≤1/2 0≤u≤1 ke


If the distribution of X has a density function then we have the following.

Theorem 3 (Devroye and Györfi 1985) Suppose that the distribution of X has a den-
sity and that k → ∞ and k/n → 0. For every  > 0 the following is true. For all large
n,
2 2
h) − R∗ > ) ≤ e−n /(72γd )
P(R(b
where b
hn is the k-nearest neighbor classifier estimated on a sample of size n, and where γd
depends on the dimension d of X.

4
Recently, Chaudhuri and Dasgupta (2014) have obtained some very general results about
k-nn classifiers. We state one of their key results here.

Theorem 4 (Chaudhuri and Dasgupta 2014) Suppose that

P ({x : |m(x) − (1/2)| ≤ t}) ≤ Ctβ

for some β ≥ 0 and some C > 0. Also, suppose that m satisfies the following smoothness
condition: for all x and r > 0

|m(B) − m(x)| ≤ LP (B o )α

where B = {u : ||x−u|| ≤ r}, B 0 = {u : ||x−u|| < r} and m(B) = (P (B))−1


R
B
m(u)dP (x).
Fix any 0 < δ < 1. Let h∗ be the Bayes rule. With probability at least 1 − δ,
 αβ
 2α+1
log(1/δ)
h(X) ≤ h∗ (X)) ≤ δC
P (b .
n

If k  n 2α+1 then
α(β+1)
h) − R(h∗ )  n−
R(b 2α+1 .

4.1 Partitions and Trees

As with nonparametric regression, simple and interpretable classifiers can be derived by


partitioning the range of X. Let Πn = {A1 , . . . , AN } be a partition of X . Let Aj be the
P P
partition element that contains x. Then b h(x) = 1 if Xi ∈Aj Yi ≥ Xi ∈Aj (1 − Yi ) and
h(x) = 0 otherwise. This is nothing other than the plugin classifier based on the partition
b
regression estimator
XN
m(x)
b = Y j I(x ∈ Aj )
j=1
−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)

Recall from the results on regression that if


( )
m∈M= m : |m(x) − m(x)| ≤ L||x − z||, x, z, ∈ Rd (11)

and the binwidth b satsifies b  n−1/(d+2) then


c
b − m||2P ≤
E||m . (12)
n2/(d+2)
5
Age
< 50 ≥ 50

Blood Pressure 1

< 100 ≥ 100

0 1

Figure 1: A simple classification tree.

1
Blood Pressure

110
1
0

50
Age

Figure 2: Partition representation of classification tree.

From (5), we conclude that R(b h) − R(h∗ ) = O(n−1/(d+2) ). However, this binwidth was based
on the bias-variance tradeoff of the regression problem. For classification, b should be chosen
as described in Section 8.

Like regression trees, classification trees are partition classifiers where the partition is built
recursively. For illustration, suppose there are two covariates, X1 = age and X2 = blood
pressure. Figure 1 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is
only a single covariate X. We choose a split point t that divides the real line into two sets

6
A1 = (−∞, t] and A2 = (t, ∞). Let rs (j) be the proportion of observations in As such that
Yi = j: Pn
I(Y = j, Xi ∈ As )
Pn i
rs (j) = i=1 (13)
i=1 I(Xi ∈ As )

for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be I(t) = 2s=1 γs where
P

1
X
γs = 1 − rs (j)2 . (14)
j=0

This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.

When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity. This process is continued until some stopping criterion is met. For example,
we might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0
or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition
element.

This procedure is easily generalized to the case where Y ∈ {1, . . . , K}. We define the impurity
by
Xk
γs = 1 − rs2 (j) (15)
j=1

where ri (j) is the proportion of observations in the partition element for which Y = j.

5 Minimax Results

The minimax classification risk over a set of joint distributions P is


 
Rn (P) = inf sup R(b h) − Rn∗ (16)
h P ∈P
b

where R(bh) = P(Y 6= bh(X)), Rn∗ is the Bayes error and the infimum is over all classifiers
constructed from the data (X1 , Y1 ), . . . , (Xn , Yn ). Recall that
sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|2 dP (x)

7
Class Rate Condition
E(α) n−α/(2α+d) α > 1/2
BV n−1/3
p
MI log n/n
−α/(2α+1)
L(α, q) n α > (1/q − 1/2)+
α
Bσ,q n−α/(2α+d) α/d > 1/q − 1/2
Neural nets see text

Table 1: Minimax Rates of Convergence for Classification.


q
Thus Rn (P) ≤ 2 R en (P) where R en (P) is the minimax risk for estimating the regression
function m. Since this is just anqinequality, it leaves open the following question: can Rn (P)
be substantially smaller than 2 R en (P)? Yang (1999) proved that the answer is no, in cases
where P is substantially rich. Moreover, we can achieve minimax classification rates using
plugin regression methods.

However, with smaller classes that invoke extra assumptions, such as the Tsybakov noise
condition, there can be a dramatic difference. Here, we summarize Yang’s results under
the richness assumption. This assumption is simply that if m is in the class, then a small
hypercube containing m is also in the class. Yang’s results are summarized in Table 1.

The classes in Table 1 are the following: E(α) is th Sobolev space of order α, BV is the class
of functions of bounded variation, MI is all monotone functions, L(α, q) are α-Lipschitz (in
α
q-norm), and Bσ,q are Besov spaces. For neural nets we have the bound, for every  > 0,
1+(2/d)  1+(1/d)
  4+(4/d) + 
1 log n 4+(2/d)
≤ Rn (P) ≤
n n

It appears that, as d → ∞, we get the dimension independent rate (log n/n)1/4 . However,
this result requires some caution since the class of distributions implicitly gets smaller as d
increases.

6 Support Vector Machines

When we discussed linear classification, we defined SVM classifier b


h(x) = sign(H(x))
b where
T
H(x)
b = βb0 + βb x and βb minimizes
X
[1 − Yi H(Xi )]+ + λ||β||22 .
i

8
We can do a nonparametric version by letting H be in a RKHS and taking the penalty to be
||H||2K . In terms of implementation, this means replacing every instance of an inner product
hXi , Xj i with K(Xi , Xj ).

7 Boosting

Boosting refers to a class of methods that build classifiers in a greedy, iterative way. The
original boosting algorithm is called AdaBoost and is due to Freund and Schapire (1996).
See Figure 3.

The algorithm seems mysterious and there is quite a bit of controversey about why (and
when) it works. Perhaps the most compelling explanation is due to Friedman, Hastie and
Tibshirani (2000) which is the explanation we will give. However, the reader is warned
that there is not consensus on the issue. Futher discussions can be found in Bühlmann and
Hothorn (2007), Zhang and Yu (2005) and Mease and Wyner (2008). The latter paper is
followed by a spirited discussion from several authors. Our view is that boosting combines
two distinct ideas: surrogate loss functions and greedy function approximation.

In this section, we assume that Yi ∈ {−1, +1}. Many classifiers then have the form
h(x) = sign(H(x))
for some function H(x). For example, a linear classifier corresponds to H(x) = β T x. The
risk can then be written as
R(h) = P(Y 6= h(X)) = P(Y H(X) < 0) = E(L(A))
where A = Y H(X) and L(a) = I(a < 0). As a function of a, the loss L(a) is discontinuous
which makes it difficult to work with. Friedman, Hastie and Tibshirani (2000) show that
−a −yH(x)
AdaBoost corresponds to using P a surrogate loss, namely, L(a) = e = e P. Consider
finding a classifier of the form m αm hm (x) by minimizing the exponential loss i e−Yi H(Xi ) .
If we do this iteratively, adding one function
P at a time, this leads precisely to AdaBoost.
Typically, the classifiers hj in the sum m αm hm (x) are taken to be very simple classifiers
such as small classification trees.

The argument in Friedman, Hastie and Tibshirani (2000) is as follows. Consider minimizing
the expected loss J(F ) = E(e−Y F (X) ). Suppose our current estimate is F and consider
updating to an improved estimate F (x) + cf (x). Expanding around f (x) = 0,
J(F + cf ) = E(e−Y (F (X)+cf (X)) ) ≈ E(e−Y F (X) (1 − cY f (X) + c2 Y 2 f 2 (X)/2))
= E(e−Y F (X) (1 − cY f (X) + c2 /2))
since Y 2 = f 2 (X) = 1. Now consider minimizing the latter expression a fixed X = x.
If we minimize over f (x) ∈ {−1, +1} we get f (x) = 1 if Ew (y|x) > 0 and f (x) = −1 if

9
1. Input: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi ∈ {−1, +1}.

2. Set wi = 1/n for i = 1, . . . , n.

3. Repeat for m = 1, . . . , M .
Pn
(a) Compute the weighted error (h) = i=1 wi I(Yi 6= h(Xi ) and find hm to
minimize (h).
(b) Let αm = (1/2) log((1 − )/).
(c) Update the weights:
wi e−αm Yi hm (Xi )
wi ←
Z
where Z is chosen so that the weights sum to 1.

4. The final classifier is


M
!
X
h(x) = sign αm hm (x) .
m=1

Figure 3: AdaBoost

10
Ew (y|x) < 0 where Ew (y|x) = E(w(x, y)y|x)/E(w(x, y)|x) and w(x, y) = e−yF (x) . In other
words, the optimal f is simply the Bayes classifier with respect to the weights. This is exactly
the first step in AdaBoost. If we fix now fix f (x) and minimize over c we get
 
1 1−
c = log
2 
where  = Ew (I(Y 6= f (x))). Thus the updated F (x) is

F (x) ← F (x) + cf (x)

as in AdaBoost. When we update F this way, we change the weights to


   
−cf (x)y 1−
w(x, y) ← w(x, y)e = w(x, y) exp log I(Y 6= f (x))

which again is the same as AadBoost.

Seen in this light, boosting really combines two ideas. The first is the use of surrogate loss
functions. The second is greedy function approximation.

8 Choosing Tuning Parameters

All the nonparametric methods involve tuning parameters, for example, the number of neigh-
bors k in nearest neighbors. As with density estimation and regression, these parameters
can be chosen by a variety of cross-validation methods. Here we describe the data splitting
version of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly
split the data into two halves that we denote by
n o n o
∗ ∗ ∗ ∗
D = (X1 , Y1 ), . . . , (Xn , Yn ) , and E = (X1 , Y1 ), . . . , (Xn , Yn ) .
e e e e

Construct classifiers H = {h1 , . . . , hN } from D corresponding to different values of the tuning


parameter. Define the risk estimator
n
b j) = 1
X
R(h I(Yi∗ 6= hj (Xi∗ )).
n i=1

Let b
h = argminh∈H R(h).
b

Theorem 5 Let h∗ ∈ H minimize R(h) = P(Y 6= h(X)). Then


s  !
1 2N
P R(bh) > R(h∗ ) + 2 log ≤ δ.
2n δ

11
2
Proof. By Hoeffding’s inequality, P(|R(h)
b − R(h)| > ) ≤ 2e−2n , for each h ∈ H. By the
union bound,
2
P(max |R(h)
b − R(h)| > ) ≤ 2N e−2n = δ
h∈H
q
1
log 2N

where  = 2n δ
. Hence, except on a set of probability at most δ,

h) ≤ R(
R(b bbh) +  ≤ R(
bbh∗ ) +  ≤ R(b
h∗ ) + 2.

p
Note that the difference between R(bh) and R(h∗ ) is O( log N/n) but in regression it was
O(log N/n) which is an interesting difference between the two settings. Under low noise
conditions, the error can be improved.

A popular modification of data-splitting is K-fold cross-validation. The data are divided


into K blocks; typicaly K = 10. One block is held out as test data to estimate risk. The
process is then repeated K times, leaving out a different block each time, and the results are
averaged over the K repetitions.

9 Example

The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 2. See Figures 4, 5, 6, 7.

10 Sparse Nonparametric Logistic Regression

For high dimensional problems we can use sparsity-based methods. The nonparametric
additive logistic model is
P 
p
exp j=1 fj (Xj )
P(Y = 1 | X) ≡ p(X; f ) = P  (17)
p
1 + exp j=1 f j (X j )

where Y ∈ {0, 1}, and the population log-likelihood is


`(f ) = E [Y f (X) − log (1 + exp f (X))] (18)

12
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20

Table 2: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2

0.8

0.05
0.4
0.1
0.0

0.6

0.00
0.2
0.0

0.4

−0.05
−0.1

−0.1

0.0
0.2

−0.10
−0.2

−0.2

0.0

−0.2

−0.15
−0.3

−0.2
−0.3

−0.20
−0.4

−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15

0.06

0.04
0.2
0.2
0.04
0.10

0.03
0.1
0.02

0.02
0.05

0.1

0.01
0.00

0.0
0.00

0.00
−0.02

0.0

−0.1
−0.05

−0.04

−0.02
−0.10

−0.1

−0.2
−0.06

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 4: Estimated functions for additive model.

13
0.29
0.28
0.27
Test Error

0.26
0.25

0 10 20 30 40 50

Figure 5: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962


|

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193

xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0

xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513


xtrain.V3 < −0.96174
1 1 0 0 0

xtrain.V1 < −0.787108


1 0 1 1 0

1 1

Figure 6: Full tree.

14
xtrain.V9 < | !0.189962

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

1 0

xtrain.V10 < !0.369401


0
1 0

Figure 7: Classification tree. The size of the tree was chcosen by cross-validation.
P
where f (X) = j=1p fj (Xj ). To fit this model, the local scoring algorithm runs the backfit-
ting procedure within Newton’s method. One iteratively computes the transformed response
for the current estimate fb

Yi − p(Xi ; fb)
Zi = fb(Xi ) + (19)
p(Xi ; fb)(1 − p(Xi ; fb))

and weights w(Xi ) = p(Xi ; fb)(1 − p(Xi ; fb), and carries out a weighted backfitting of (Z, X)
with weights w. The weighted smooth is given by
Sj (wRj )
Pbj = . (20)
Sj w
where Sj is a linear smoothing matrix, such as a kernel smoother. This extends iteratively
reweighted least squares to the nonparametric setting.

A sparsity penality can be incorporated, just as for sparse additive models (SpAM) for
regression. The Lagrangian is given by
p q
!
X
L(f, λ) = E log 1 + ef (X) − Y f (X) + λ
  
E(fj2 (Xj )) − L (21)
j=1

and the stationary condition for component


q function fj is E (p − Y | Xj ) + λvj = 0 where vj
is an element of the subgradient ∂ E(fj2 ). As in the unregularized case, this condition is

15
nonlinear in f , and so we linearize the gradient of the log-likelihood around fb. This yields
the linearized condition E [w(X)(f (X) − Z) | Xj ] + λvj = 0. To see this, note that
 
0 = E p(X; f ) − Y + p(X; f )(1 − p(X; f ))(f (X) − f (X)) | Xj + λvj
b b b b (22)
= E [w(X)(f (X) − Z) | Xj ] + λvj (23)
When E(fj2 ) 6= 0, this implies the condition
 
E (w | Xj ) + q λ  fj (Xj ) = E(wRj | Xj ). (24)
2
E(fj )

In the finite sample case, in terms of the smoothing matrix Sj , this becomes
Sj (wRj )
fj = .q . (25)
Sj w + λ E(fj2 )

If kSj (wRj )k < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be
solved explicitly, so one simply iterates until convergence:
Sj (wRj )
fj ← √ . (26)
Sj w + λ n /kfj k
When λ = 0, this yields the standard local scoring update (20).

Example 6 (SpAM for Spam) Here we consider an email spam classification problem,
using the logistic SpAM backfitting algorithm above. This dataset has been studied Hastie et
al (2001) using a set of 3,065 emails as a training set, and conducting hypothesis tests to
choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all
numeric. The attributes measure the percentage of specific words or characters in the email,
the average and maximum run lengths of upper case letters, and the total number of such
letters.

The results of a typical run of logistic SpAM are summarized in Figure 8, using plug-in
bandwidths. A held-out set is used to tune the regularization parameter λ.

11 Bagging and Random Forests

Suppose we draw B bootstrap samples and each time we construct a classifier. This gives
classifiers h1 , . . . , hB . We now classify by combining them:
(
1 if B1 j hj (x) ≥ 21
P
h(x) =
0 otherwise.

16
λ(×10−3 ) Error # zeros selected variables
5.5 0.2009 55 { 8,54}

5.0 0.1725 51 { 8, 9, 27, 53, 54, 57}

4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}

4.0 0.1083 ( ) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}

3.5 0.1117 0 all


3.0 0.1174 0 all
2.5 0.1251 0 all
2.0 0.1259 0 all
0.20
empirical prediction error
0.18
0.16
0.14
0.12

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5


penalization parameter

Figure 8: (Email spam) Classification accuracies and variable selection for logistic SpAM.

This is called bagging which stands for bootstrap aggregration. The basline classifiers are
usually trees.

A variation is to choose a random subset of the predictors to split on at each stage. The
resulting classifier is called a random forests. Random forests often perform very well. Their
theoretical performance is not well understood. Some good references are:

Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Lin and Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101, p 578.

17
Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.

Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.

Appendix: Multiclass Sparse Logistic Regression

Now we consider the multiclass version. Suppose we have the nonparametric K-class logistic
regression model
ef` (X)
pf (Y = ` | X) = PK ` = 1, . . . , K (27)
fm (X)
m=1 e
where each function has an additive form
f` (X) = f`1 (X1 ) + f`2 (X2 ) + · · · + f`p (Xp ). (28)
In Newton’s algorithm, we minimize the quadratic approximation to the log-likelihood
h i 1 h i
L(f ) ≈ L(fb) + E (Y − pb)T (f − fb) + E (f − fb)T H(fb)(f − fb) (29)
2
where pb(X) = (pfb(Y = 1 | X), . . . , pfb(Y = K | X)), and H(fb(X)) is the Hessian

H(fb) = −diag(b p(X)T .


p(X)) + pb(X)b (30)
Maximizing the right hand size of (29) is equivalent to minimizing
    1
−E (Y − pb)T (f − fb) − E fbT Jf + E f T Jf

(31)
2
which is, in turn, equivalent to minimizing the surrogate loss function
1
E kZ − Af k22 .

Q(f, fb) ≡ = (32)
2
where J = −H(fb), A = J 1/2 , and Z is defined by

Z = J −1/2 (Y − pb) + J 1/2 fb (33)


= A−1 (Y − pb) + Afb. (34)

The above calculation can be reexpressed as follows, which leads a multiclass backfitting
algorithm. The difference in log-likelihoods for functions {fb` } and {f` } is, to second order,
 !2 
K−1 K−1 K−1
X X Y` − p` (X) X
E p` (X) fb` (X) − pk (X)fbk (X) + − f` (X) + pk (X)fk (X) 
`=0 k=0
p ` (X)
k=0
(35)

18
where p` (X) = P(Y = ` | X), and Y` = δ(Y, `) are indicator variables. Minimizing over {f` }
gives coupled equations for the functions f` ; they can’t be solved independently over `.

A practical approach is to use coordinate descent, computing the function f` holding the
other functions {fk }k6=` fixed, and iterating. Assuming that fk = fbk for k 6= `, this simplifies
to
"  2 X  2 #
Y ` − p ` p k − Y k
E p` (1 − p` )2 fb` + − f` + pk p2` fb` + − f` . (36)
p` (1 − p` ) k6=`
pk p`

After some algebra, this can be seen to be the same as the usual objective function in the
binary case, where we take fb0 = 1 and fb1 arbitrary.

Now assume f` (and fb` ) has an additive form: f` (X) = pj=1 f`j (Xj ). Some further calcula-
P
tion shows that minimizing over each f`j yields the following backfitting algorithm:
" ! #
X Y ` − p `
E p` (1 − p` ) fb` − f`k + | Xj
p ` (1 − p` )
k6=j
f`j (Xj ) ← . (37)
E [p` (1 − p` ) | Xj ]

We approximate the conditional expectations by smoothing, as usual:

Sj (xj )T (w` (X) R`j (X))


f`j (xj ) ← (38)
Sj (xj )T (w` (X))

where
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) + (39)
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)). (40)

This is the same as in binary logistic regression. We thus have the following algorithm:

Multiclass Logistic Backfitting

1. Initialize {fb` = 0}, and set Z(X) = K.

2. Iterate until convergence:

For each ` = 0, 1, . . . , K − 1
A. Initialize f` = fb`
B. Iterate until convergence:

19
For each j = 1, 2, . . . , p

Sj (xj )T (w` (X) R`j (X))


f`j (xj ) ← where
Sj (xj )T (w` (X))
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) +
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)).

C. Update Z(X) ← Z(X) − ef` (X) + ef` (X) .


b

D. Set fb` ← f` .

Incrementally updating the normalizing constants (step C) is important so that the proba-
bilties p` (X) = ef` (X) /Z(X) can be efficiently computed, and we avoid an O(K 2 ) algorithm.
b

This can be extended to include a sparsity constraint, as in the binary case.

20

You might also like