0% found this document useful (0 votes)

61 views20 pages

Nonparametric Classification 10/36-702: 1 1 N N N I I

This document provides an introduction to nonparametric classification methods. It discusses several approaches: 1) Plugin methods that estimate unknown quantities in the Bayes' rule (such as the regression function m(x)) and plug those estimates into the rule. The excess risk of these methods depends on the accuracy of the regression estimate. 2) Classifiers based on density estimation, where the densities p0(x) and p1(x) are estimated nonparametrically and plugged into the Bayes' rule. A naive Bayes approach assumes conditional independence of the covariates. 3) Nearest neighbor classifiers, which can be viewed as plugin rules using a nearest neighbor regression estimate. The error rate of 1-nearest neighbor is shown

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views20 pages

Nonparametric Classification 10/36-702: 1 1 N N N I I

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Nonparametric Classification

10/36-702

1 Introduction

Let h : X → {0, 1} to denote a classifier where X is the domain of X. In parametric classi-

fication we assumed that h took a very constrained form, typically linear. In nonparametric
classification we aim to relax this assumption.

Let us recall a few definitions and facts. The classification risk, or error rate, of h is

R(h) = P(Y 6= h(X)) (1)

and the empirical error rate or training error rate based on training data (X1 , Y1 ), . . . , (Xn , Yn )
is n
1X
Rn (h) =
b I(h(Xi ) 6= Yi ). (2)
n i=1
R(h) is minimized by the Bayes’ rule
p1 (x) (1−π)
(
1 1 if >

∗ 1 if m(x) > 2 p0 (x) π
h (x) = = (3)
0 otherwise 0 otherwise.

where m(x) = P(Y = 1 | X = x), pj (x) = p(x | Y = j) and π = P(Y = 1). The excess risk of
a classifier h is R(h) − R(h∗ ).

In the multiclass case, Y ∈ {1, . . . , k}, the Bayes’ rule is

h∗ (x) = argmax1≤j≤k πj pj (x) = argmax1≤j≤k mj (x)

where mj (x) = P(Y = j|X = x), πj = P(Y = j) and pj (x) = p(x|Y = j).

2 Plugin Methods

One approach to nonparametric classification is to estimate the unknown quantities in the

expression for the Bayes’ rule (3) and simply plug them in. For example, if m b is any
nonparametric regression estimator then we can use

> 12

1 if m(x)
h(x) = (4)
b b
0 otherwise.

1
For example, we could use the kernel regresson estimator

Pn ||x−Xi ||
i=1 Yi K h
mb h (x) = P .
n ||x−Xi ||
i=1 K h

Howeve, the bandwidth should be optimized for classification error as described in Section
8.

We have the following theorem.

Theorem 1 Let b
h be the plug-in classifier based on m.
b Then,
Z sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|dP (x) ≤ 2 |m(x)
b − m(x)|2 dP (x). (5)

An immediate consequence of this theorem is that any result about nonparametric re-
gression can be turned into a result about nonparametric classification. For example, if
− m(x)|2 dP (x) = OP (n−2β/(2β+d) ) then R(b
h) − R(h∗ ) = OP (n−β/(2β+d) ). How-
R
|m(x)
b
∗
qR (5) is an upper bound and it is possible that R(h) − R(h ) is strictly smaller than
ever, b
|m(x)
b − m(x)|2 dP (x).

When Y ∈ {1, . . . , k} the plugin rule has the form

h(x) = argmaxj m
b b j (x)

where m
b j (x) is an estimate of P(Y = j|X = x).

3 Classifiers Based on Density Estimation

We can apply nonparametric density estimation to each class to get estimators pb0 and pb1 .
Then we define
1 if ppbb01 (x)
(x)
> (1−bπ)
(
π
h(x) = (6)
b b

0 otherwise
b = n−1 ni=1 Yi . Hence, any nonparametric density estimation method yields a
P
where π
nonparametric classifier.

A simplification occurs if we assume that the covariate has independent coordinates, con-
ditioned on the class variable Y . Thus, if Xi = (Xi1 , . . . , Xid )T has dimension
Qd d and if
we assume conditional independence, then the density factors as pj (x) = `=1 pj` (x` ). In

2
this case we can estimate the one-dimensional marginals pj` (x` ) separately and then define
pbj (x) = d`=1 pbj` (x` ). This has the advantage that we never have to do more than a one-
Q
dimensional density estimate. This approach is called naive Bayes. The resulting classifier
can sometimes be very accurate even if the independence assumption is false.

It is easy to extend density based methods for multiclass problems. If Y ∈ {1, . . . , k} then
we estimate the k densities pbj (x) = p(x|Y = j) and the classifier is

h(x) = argmaxj π
b bj pbj (x)
Pn
bj = n−1
where π i=1 I(Yi = j).

4 Nearest Neighbors

The k-nearest neighbor classifier is

Pn
wi (x)I(Yi = 1) > ni=1 wi (x)I(Yi = 0)
P
1 i=1
h(x) = (7)
0 otherwise

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. “Nearest”

depends on how you define the distance. Often we use Euclidean distance kXi − Xj k. In
that case you should standardize the variables first.

The k-nearest neighbor classifier can be recast as a plugin rule. Define the regression estsi-
mator Pn
i=1 Yi I(||Xi ≤ x|| ≤ dk (x))
m(x) = P n
i=1 I(||Xi ≤ x|| ≤ dk (x))
b

where dk (x) is the distance between x and its k th -nearest neighbor. Then b
h(x) = I(m(x)
b >
1/2).

It is interesting to consider the classification error when n is large. First suppose that k = 1
and consider a fixed x. Then b h(x) is 1 if the closest Xi has label Y = 1 and b h(x) is 0 if the
closest Xi has label Y = 0. When n is large, the closest Xi is approximately equal to x. So
the probability of an error is

m(Xi )(1−m(x))+(1−m(Xi ))m(x) ≈ m(x)(1−m(x))+(1−m(x))m(x) = 2m(x)(1−m(x)).

Define
Ln = P(Y 6= b
h(X) | Dn )
where Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}. Then we have that

lim E(Ln ) = E(2m(X)(1 − m(X))) ≡ R(1) . (8)

n→∞

3
The Bayes risk can be written as R∗ = E(A) where A = min{m(X), 1 − m(X)}. Note that
A ≤ 2m(X)(1 − m(X)). Also, by direct integration, E(A(1 − A)) ≤ E(A)E(1 − A). Hence,
we have the well-known result due to Cover and Hart (1967),

R∗ ≤ R(1) = E(A(1 − A)) ≤ 2E(A)E(1 − A) = 2R∗ (1 − R∗ ) ≤ 2R∗ .

Thus, for any problem with small Bayes error, k = 1 nearest neighbors should have small
error.

More generally, for any odd k,

lim E(Ln ) = R(k) (9)
n→∞

where
k !
X k j k−j
R(k) =E m (X)(1 − m(X)) [m(X)I(j < k/2) + (1 − m(X))I(j > k/2)] .
j
j=0

Theorem 2 (Devroye et al 1996) For all odd k,

1
R∗ ≤ R(k) ≤ R∗ + √ . (10)
ke

Proof. We can rewrite R(k) as R(k) = E(a(m(X))) where

k
a(z) = min{z, 1 − z} + |2z − 1| P B >
2
and B ∼ Binomial(k, min{z, 1 − z}). The mean of a(z) is less than or equal to its maximum
and, by symmetry, we can take the maximum over 0 ≤ z ≤ 1/2. Hence, letting B ∼
Binomial(k, z), we have, by Hoeffding’s inequality,

k 2 2 1
R(k) − R∗ ≤ sup (1 − 2z)P B > ≤ sup (1 − 2z)e−2k(1/2−z) = sup ue−ku /2 = √ .
0≤z≤1/2 2 0≤z≤1/2 0≤u≤1 ke

If the distribution of X has a density function then we have the following.

Theorem 3 (Devroye and Györfi 1985) Suppose that the distribution of X has a den-
sity and that k → ∞ and k/n → 0. For every > 0 the following is true. For all large
n,
2 2
h) − R∗ > ) ≤ e−n /(72γd )
P(R(b
where b
hn is the k-nearest neighbor classifier estimated on a sample of size n, and where γd
depends on the dimension d of X.

4
Recently, Chaudhuri and Dasgupta (2014) have obtained some very general results about
k-nn classifiers. We state one of their key results here.

Theorem 4 (Chaudhuri and Dasgupta 2014) Suppose that

P ({x : |m(x) − (1/2)| ≤ t}) ≤ Ctβ

for some β ≥ 0 and some C > 0. Also, suppose that m satisfies the following smoothness
condition: for all x and r > 0

|m(B) − m(x)| ≤ LP (B o )α

where B = {u : ||x−u|| ≤ r}, B 0 = {u : ||x−u|| < r} and m(B) = (P (B))−1

R
B
m(u)dP (x).
Fix any 0 < δ < 1. Let h∗ be the Bayes rule. With probability at least 1 − δ,
αβ
2α+1
log(1/δ)
h(X) ≤ h∗ (X)) ≤ δC
P (b .
n
2α
If k n 2α+1 then
α(β+1)
h) − R(h∗ ) n−
R(b 2α+1 .

4.1 Partitions and Trees

As with nonparametric regression, simple and interpretable classifiers can be derived by

partitioning the range of X. Let Πn = {A1 , . . . , AN } be a partition of X . Let Aj be the
P P
partition element that contains x. Then b h(x) = 1 if Xi ∈Aj Yi ≥ Xi ∈Aj (1 − Yi ) and
h(x) = 0 otherwise. This is nothing other than the plugin classifier based on the partition
b
regression estimator
XN
m(x)
b = Y j I(x ∈ Aj )
j=1
−1
Pn
where Y j = nj i=1 Yi I(Xi ∈ Aj ) is the average of the Yi ’s in Aj and nj = #{Xi ∈ Aj }.
(We define Y j to be 0 if nj = 0.)

Recall from the results on regression that if

( )
m∈M= m : |m(x) − m(x)| ≤ L||x − z||, x, z, ∈ Rd (11)

and the binwidth b satsifies b n−1/(d+2) then

c
b − m||2P ≤
E||m . (12)
n2/(d+2)
5
Age
< 50 ≥ 50

Blood Pressure 1

< 100 ≥ 100

0 1

Figure 1: A simple classification tree.

1
Blood Pressure

110
1
0

50
Age

Figure 2: Partition representation of classification tree.

From (5), we conclude that R(b h) − R(h∗ ) = O(n−1/(d+2) ). However, this binwidth was based
on the bias-variance tradeoff of the regression problem. For classification, b should be chosen
as described in Section 8.

Like regression trees, classification trees are partition classifiers where the partition is built
recursively. For illustration, suppose there are two covariates, X1 = age and X2 = blood
pressure. Figure 1 shows a classification tree using these variables.

The tree is used in the following way. If a subject has Age ≥ 50 then we classify him as
Y = 1. If a subject has Age < 50 then we check his blood pressure. If systolic blood pressure
is < 100 then we classify him as Y = 1, otherwise we classify him as Y = 0. Figure 2 shows
the same classifier as a partition of the covariate space.

Here is how a tree is constructed. First, suppose that y ∈ Y = {0, 1} and that there is
only a single covariate X. We choose a split point t that divides the real line into two sets

6
A1 = (−∞, t] and A2 = (t, ∞). Let rs (j) be the proportion of observations in As such that
Yi = j: Pn
I(Y = j, Xi ∈ As )
Pn i
rs (j) = i=1 (13)
i=1 I(Xi ∈ As )

for s = 1, 2 and j = 0, 1. The impurity of the split t is defined to be I(t) = 2s=1 γs where
P

1
X
γs = 1 − rs (j)2 . (14)
j=0

This particular measure of impurity is known as the Gini index. If a partition element As
contains all 0’s or all 1’s, then γs = 0. Otherwise, γs > 0. We choose the split point t to
minimize the impurity. Other indices of impurity besides the Gini index can be used, such
as entropy. The reason for using impurity rather than classification error is because impurity
is a smooth function and hence is easy to minimize.

When there are several covariates, we choose whichever covariate and split that leads to the
lowest impurity. This process is continued until some stopping criterion is met. For example,
we might stop when every partition element has fewer than n0 data points, where n0 is some
fixed number. The bottom nodes of the tree are called the leaves. Each leaf is assigned a 0
or 1 depending on whether there are more data points with Y = 0 or Y = 1 in that partition
element.

This procedure is easily generalized to the case where Y ∈ {1, . . . , K}. We define the impurity
by
Xk
γs = 1 − rs2 (j) (15)
j=1

where ri (j) is the proportion of observations in the partition element for which Y = j.

5 Minimax Results

The minimax classification risk over a set of joint distributions P is

Rn (P) = inf sup R(b h) − Rn∗ (16)
h P ∈P
b

where R(bh) = P(Y 6= bh(X)), Rn∗ is the Bayes error and the infimum is over all classifiers
constructed from the data (X1 , Y1 ), . . . , (Xn , Yn ). Recall that
sZ
h) − R(h∗ ) ≤ 2
R(b |m(x)
b − m(x)|2 dP (x)

7
Class Rate Condition
E(α) n−α/(2α+d) α > 1/2
BV n−1/3
p
MI log n/n
−α/(2α+1)
L(α, q) n α > (1/q − 1/2)+
α
Bσ,q n−α/(2α+d) α/d > 1/q − 1/2
Neural nets see text

Table 1: Minimax Rates of Convergence for Classification.

q
Thus Rn (P) ≤ 2 R en (P) where R en (P) is the minimax risk for estimating the regression
function m. Since this is just anqinequality, it leaves open the following question: can Rn (P)
be substantially smaller than 2 R en (P)? Yang (1999) proved that the answer is no, in cases
where P is substantially rich. Moreover, we can achieve minimax classification rates using
plugin regression methods.

However, with smaller classes that invoke extra assumptions, such as the Tsybakov noise
condition, there can be a dramatic difference. Here, we summarize Yang’s results under
the richness assumption. This assumption is simply that if m is in the class, then a small
hypercube containing m is also in the class. Yang’s results are summarized in Table 1.

The classes in Table 1 are the following: E(α) is th Sobolev space of order α, BV is the class
of functions of bounded variation, MI is all monotone functions, L(α, q) are α-Lipschitz (in
α
q-norm), and Bσ,q are Besov spaces. For neural nets we have the bound, for every > 0,
1+(2/d) 1+(1/d)
4+(4/d) +
1 log n 4+(2/d)
≤ Rn (P) ≤
n n

It appears that, as d → ∞, we get the dimension independent rate (log n/n)1/4 . However,
this result requires some caution since the class of distributions implicitly gets smaller as d
increases.

6 Support Vector Machines

When we discussed linear classification, we defined SVM classifier b

h(x) = sign(H(x))
b where
T
H(x)
b = βb0 + βb x and βb minimizes
X
[1 − Yi H(Xi )]+ + λ||β||22 .
i

8
We can do a nonparametric version by letting H be in a RKHS and taking the penalty to be
||H||2K . In terms of implementation, this means replacing every instance of an inner product
hXi , Xj i with K(Xi , Xj ).

7 Boosting

Boosting refers to a class of methods that build classifiers in a greedy, iterative way. The
original boosting algorithm is called AdaBoost and is due to Freund and Schapire (1996).
See Figure 3.

The algorithm seems mysterious and there is quite a bit of controversey about why (and
when) it works. Perhaps the most compelling explanation is due to Friedman, Hastie and
Tibshirani (2000) which is the explanation we will give. However, the reader is warned
that there is not consensus on the issue. Futher discussions can be found in Bühlmann and
Hothorn (2007), Zhang and Yu (2005) and Mease and Wyner (2008). The latter paper is
followed by a spirited discussion from several authors. Our view is that boosting combines
two distinct ideas: surrogate loss functions and greedy function approximation.

In this section, we assume that Yi ∈ {−1, +1}. Many classifiers then have the form
h(x) = sign(H(x))
for some function H(x). For example, a linear classifier corresponds to H(x) = β T x. The
risk can then be written as
R(h) = P(Y 6= h(X)) = P(Y H(X) < 0) = E(L(A))
where A = Y H(X) and L(a) = I(a < 0). As a function of a, the loss L(a) is discontinuous
which makes it difficult to work with. Friedman, Hastie and Tibshirani (2000) show that
−a −yH(x)
AdaBoost corresponds to using P a surrogate loss, namely, L(a) = e = e P. Consider
finding a classifier of the form m αm hm (x) by minimizing the exponential loss i e−Yi H(Xi ) .
If we do this iteratively, adding one function
P at a time, this leads precisely to AdaBoost.
Typically, the classifiers hj in the sum m αm hm (x) are taken to be very simple classifiers
such as small classification trees.

The argument in Friedman, Hastie and Tibshirani (2000) is as follows. Consider minimizing
the expected loss J(F ) = E(e−Y F (X) ). Suppose our current estimate is F and consider
updating to an improved estimate F (x) + cf (x). Expanding around f (x) = 0,
J(F + cf ) = E(e−Y (F (X)+cf (X)) ) ≈ E(e−Y F (X) (1 − cY f (X) + c2 Y 2 f 2 (X)/2))
= E(e−Y F (X) (1 − cY f (X) + c2 /2))
since Y 2 = f 2 (X) = 1. Now consider minimizing the latter expression a fixed X = x.
If we minimize over f (x) ∈ {−1, +1} we get f (x) = 1 if Ew (y|x) > 0 and f (x) = −1 if

9
1. Input: (X1 , Y1 ), . . . , (Xn , Yn ) where Yi ∈ {−1, +1}.

2. Set wi = 1/n for i = 1, . . . , n.

3. Repeat for m = 1, . . . , M .
Pn
(a) Compute the weighted error (h) = i=1 wi I(Yi 6= h(Xi ) and find hm to
minimize (h).
(b) Let αm = (1/2) log((1 − )/).
(c) Update the weights:
wi e−αm Yi hm (Xi )
wi ←
Z
where Z is chosen so that the weights sum to 1.

4. The final classifier is

M
!
X
h(x) = sign αm hm (x) .
m=1

Figure 3: AdaBoost

10
Ew (y|x) < 0 where Ew (y|x) = E(w(x, y)y|x)/E(w(x, y)|x) and w(x, y) = e−yF (x) . In other
words, the optimal f is simply the Bayes classifier with respect to the weights. This is exactly
the first step in AdaBoost. If we fix now fix f (x) and minimize over c we get

1 1−
c = log
2
where = Ew (I(Y 6= f (x))). Thus the updated F (x) is

F (x) ← F (x) + cf (x)

as in AdaBoost. When we update F this way, we change the weights to

−cf (x)y 1−
w(x, y) ← w(x, y)e = w(x, y) exp log I(Y 6= f (x))

which again is the same as AadBoost.

Seen in this light, boosting really combines two ideas. The first is the use of surrogate loss
functions. The second is greedy function approximation.

8 Choosing Tuning Parameters

All the nonparametric methods involve tuning parameters, for example, the number of neigh-
bors k in nearest neighbors. As with density estimation and regression, these parameters
can be chosen by a variety of cross-validation methods. Here we describe the data splitting
version of cross-validation. Suppose the data are (X1 , Y1 ), . . . , (X2n , Y2n ). Now randomly
split the data into two halves that we denote by
n o n o
∗ ∗ ∗ ∗
D = (X1 , Y1 ), . . . , (Xn , Yn ) , and E = (X1 , Y1 ), . . . , (Xn , Yn ) .
e e e e

Construct classifiers H = {h1 , . . . , hN } from D corresponding to different values of the tuning

parameter. Define the risk estimator
n
b j) = 1
X
R(h I(Yi∗ 6= hj (Xi∗ )).
n i=1

Let b
h = argminh∈H R(h).
b

Theorem 5 Let h∗ ∈ H minimize R(h) = P(Y 6= h(X)). Then

s !
1 2N
P R(bh) > R(h∗ ) + 2 log ≤ δ.
2n δ

11
2
Proof. By Hoeffding’s inequality, P(|R(h)
b − R(h)| > ) ≤ 2e−2n , for each h ∈ H. By the
union bound,
2
P(max |R(h)
b − R(h)| > ) ≤ 2N e−2n = δ
h∈H
q
1
log 2N

where = 2n δ
. Hence, except on a set of probability at most δ,

h) ≤ R(
R(b bbh) + ≤ R(
bbh∗ ) + ≤ R(b
h∗ ) + 2.

p
Note that the difference between R(bh) and R(h∗ ) is O( log N/n) but in regression it was
O(log N/n) which is an interesting difference between the two settings. Under low noise
conditions, the error can be improved.

A popular modification of data-splitting is K-fold cross-validation. The data are divided

into K blocks; typicaly K = 10. One block is held out as test data to estimate risk. The
process is then repeated K times, leaving out a different block each time, and the results are
averaged over the K repetitions.

9 Example

The following data are from simulated images of gamma ray events for the Major Atmo-
spheric Gamma-ray Imaging Cherenkov Telescope (MAGIC) in the Canary Islands. The
data are from archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. The telescope
studies gamma ray bursts, active galactic nuclei and supernovae remnants. The goal is to
predict if an event is real or is background (hadronic shower). There are 11 predictors that
are numerical summaries of the images. We randomly selected 400 training points (200 pos-
itive and 200 negative) and 1000 test cases (500 positive and 500 negative). The results of
various methods are in Table 2. See Figures 4, 5, 6, 7.

10 Sparse Nonparametric Logistic Regression

For high dimensional problems we can use sparsity-based methods. The nonparametric
additive logistic model is
P
p
exp j=1 fj (Xj )
P(Y = 1 | X) ≡ p(X; f ) = P (17)
p
1 + exp j=1 f j (X j )

where Y ∈ {0, 1}, and the population log-likelihood is

`(f ) = E [Y f (X) − log (1 + exp f (X))] (18)

12
Method Test Error
Logistic regression 0.23
SVM (Gaussian Kernel) 0.20
Kernel Regression 0.24
Additive Model 0.20
Reduced Additive Model 0.20
11-NN 0.25
Trees 0.20

Table 2: Various methods on the MAGIC data. The reduced additive model is based on
using the three most significant variables from the additive model.
0.2

0.8

0.05
0.4
0.1
0.0

0.6

0.00
0.2
0.0

0.4

−0.05
−0.1

−0.1

0.0
0.2

−0.10
−0.2

−0.2

0.0

−0.2

−0.15
−0.3

−0.2
−0.3

−0.20
−0.4

−1 0 1 2 3 4 5 0 2 4 6 −1 0 1 2 3 4 −2 −1 0 1 2 −2 −1 0 1 2
0.15

0.06

0.04
0.2
0.2
0.04
0.10

0.03
0.1
0.02

0.02
0.05

0.1

0.01
0.00

0.0
0.00

0.00
−0.02

0.0

−0.1
−0.05

−0.04

−0.02
−0.10

−0.1

−0.2
−0.06

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 4: Estimated functions for additive model.

13
0.29
0.28
0.27
Test Error

0.26
0.25

0 10 20 30 40 50

Figure 5: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V4 < 0.411748 xtrain.V4 < −0.769513 xtrain.V10 < −0.369401 xtrain.V2 < 0.343193

xtrain.V8 < −0.912288 xtrain.V3 < −0.463854 xtrain.V3 < −1.14519 xtrain.V7 < 0.015902
0 0 0 0

xtrain.V6 < −0.274359 xtrain.V10 < 0.4797 xtrain.V2 < −0.607802 xtrain.V8 < −0.199142
0 0 1 0

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513

xtrain.V3 < −0.96174
1 1 0 0 0

xtrain.V1 < −0.787108

1 0 1 1 0

1 1

Figure 6: Full tree.

14
xtrain.V9 < | !0.189962

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

1 0

xtrain.V10 < !0.369401

0
1 0

Figure 7: Classification tree. The size of the tree was chcosen by cross-validation.
P
where f (X) = j=1p fj (Xj ). To fit this model, the local scoring algorithm runs the backfit-
ting procedure within Newton’s method. One iteratively computes the transformed response
for the current estimate fb

Yi − p(Xi ; fb)
Zi = fb(Xi ) + (19)
p(Xi ; fb)(1 − p(Xi ; fb))

and weights w(Xi ) = p(Xi ; fb)(1 − p(Xi ; fb), and carries out a weighted backfitting of (Z, X)
with weights w. The weighted smooth is given by
Sj (wRj )
Pbj = . (20)
Sj w
where Sj is a linear smoothing matrix, such as a kernel smoother. This extends iteratively
reweighted least squares to the nonparametric setting.

A sparsity penality can be incorporated, just as for sparse additive models (SpAM) for
regression. The Lagrangian is given by
p q
!
X
L(f, λ) = E log 1 + ef (X) − Y f (X) + λ

E(fj2 (Xj )) − L (21)
j=1

and the stationary condition for component

q function fj is E (p − Y | Xj ) + λvj = 0 where vj
is an element of the subgradient ∂ E(fj2 ). As in the unregularized case, this condition is

15
nonlinear in f , and so we linearize the gradient of the log-likelihood around fb. This yields
the linearized condition E [w(X)(f (X) − Z) | Xj ] + λvj = 0. To see this, note that

0 = E p(X; f ) − Y + p(X; f )(1 − p(X; f ))(f (X) − f (X)) | Xj + λvj
b b b b (22)
= E [w(X)(f (X) − Z) | Xj ] + λvj (23)
When E(fj2 ) 6= 0, this implies the condition
 
E (w | Xj ) + q λ  fj (Xj ) = E(wRj | Xj ). (24)
2
E(fj )

In the finite sample case, in terms of the smoothing matrix Sj , this becomes
Sj (wRj )
fj = .q . (25)
Sj w + λ E(fj2 )

If kSj (wRj )k < λ, then fj = 0. Otherwise, this implicit, nonlinear equation for fj cannot be
solved explicitly, so one simply iterates until convergence:
Sj (wRj )
fj ← √ . (26)
Sj w + λ n /kfj k
When λ = 0, this yields the standard local scoring update (20).

Example 6 (SpAM for Spam) Here we consider an email spam classification problem,
using the logistic SpAM backfitting algorithm above. This dataset has been studied Hastie et
al (2001) using a set of 3,065 emails as a training set, and conducting hypothesis tests to
choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all
numeric. The attributes measure the percentage of specific words or characters in the email,
the average and maximum run lengths of upper case letters, and the total number of such
letters.

The results of a typical run of logistic SpAM are summarized in Figure 8, using plug-in
bandwidths. A held-out set is used to tune the regularization parameter λ.

11 Bagging and Random Forests

Suppose we draw B bootstrap samples and each time we construct a classifier. This gives
classifiers h1 , . . . , hB . We now classify by combining them:
(
1 if B1 j hj (x) ≥ 21
P
h(x) =
0 otherwise.

16
λ(×10−3 ) Error # zeros selected variables
5.5 0.2009 55 { 8,54}

5.0 0.1725 51 { 8, 9, 27, 53, 54, 57}

4.5 0.1354 46 {7, 8, 9, 17, 18, 27, 53, 54, 57, 58}
√
4.0 0.1083 ( ) 20 {4, 6–10, 14–22, 26, 27, 38, 53–58}

3.5 0.1117 0 all

3.0 0.1174 0 all
2.5 0.1251 0 all
2.0 0.1259 0 all
0.20
empirical prediction error
0.18
0.16
0.14
0.12

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

penalization parameter

Figure 8: (Email spam) Classification accuracies and variable selection for logistic SpAM.

This is called bagging which stands for bootstrap aggregration. The basline classifiers are
usually trees.

A variation is to choose a random subset of the predictors to split on at each stage. The
resulting classifier is called a random forests. Random forests often perform very well. Their
theoretical performance is not well understood. Some good references are:

Biau, Devroye and Lugosi. (2008). Consistency of Random Forests and Other Average
Classifiers. JMLR.

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Lin and Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American
Statistical Association, 101, p 578.

17
Wager, S. (2014). Asymptotic Theory for Random Forests. arXiv:1405.0352.

Wager, S. (2015). Uniform convergence of random forests via adaptive concentration. arXiv:1503.06388.

Appendix: Multiclass Sparse Logistic Regression

Now we consider the multiclass version. Suppose we have the nonparametric K-class logistic
regression model
ef` (X)
pf (Y = ` | X) = PK ` = 1, . . . , K (27)
fm (X)
m=1 e
where each function has an additive form
f` (X) = f`1 (X1 ) + f`2 (X2 ) + · · · + f`p (Xp ). (28)
In Newton’s algorithm, we minimize the quadratic approximation to the log-likelihood
h i 1 h i
L(f ) ≈ L(fb) + E (Y − pb)T (f − fb) + E (f − fb)T H(fb)(f − fb) (29)
2
where pb(X) = (pfb(Y = 1 | X), . . . , pfb(Y = K | X)), and H(fb(X)) is the Hessian

H(fb) = −diag(b p(X)T .

p(X)) + pb(X)b (30)
Maximizing the right hand size of (29) is equivalent to minimizing
1
−E (Y − pb)T (f − fb) − E fbT Jf + E f T Jf

(31)
2
which is, in turn, equivalent to minimizing the surrogate loss function
1
E kZ − Af k22 .

Q(f, fb) ≡ = (32)
2
where J = −H(fb), A = J 1/2 , and Z is defined by

Z = J −1/2 (Y − pb) + J 1/2 fb (33)

= A−1 (Y − pb) + Afb. (34)

The above calculation can be reexpressed as follows, which leads a multiclass backfitting
algorithm. The difference in log-likelihoods for functions {fb` } and {f` } is, to second order,
 !2 
K−1 K−1 K−1
X X Y` − p` (X) X
E p` (X) fb` (X) − pk (X)fbk (X) + − f` (X) + pk (X)fk (X) 
`=0 k=0
p ` (X)
k=0
(35)

18
where p` (X) = P(Y = ` | X), and Y` = δ(Y, `) are indicator variables. Minimizing over {f` }
gives coupled equations for the functions f` ; they can’t be solved independently over `.

A practical approach is to use coordinate descent, computing the function f` holding the
other functions {fk }k6=` fixed, and iterating. Assuming that fk = fbk for k 6= `, this simplifies
to
" 2 X 2 #
Y ` − p ` p k − Y k
E p` (1 − p` )2 fb` + − f` + pk p2` fb` + − f` . (36)
p` (1 − p` ) k6=`
pk p`

After some algebra, this can be seen to be the same as the usual objective function in the
binary case, where we take fb0 = 1 and fb1 arbitrary.

Now assume f` (and fb` ) has an additive form: f` (X) = pj=1 f`j (Xj ). Some further calcula-
P
tion shows that minimizing over each f`j yields the following backfitting algorithm:
" ! #
X Y ` − p `
E p` (1 − p` ) fb` − f`k + | Xj
p ` (1 − p` )
k6=j
f`j (Xj ) ← . (37)
E [p` (1 − p` ) | Xj ]

We approximate the conditional expectations by smoothing, as usual:

Sj (xj )T (w` (X) R`j (X))

f`j (xj ) ← (38)
Sj (xj )T (w` (X))

where
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) + (39)
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)). (40)

This is the same as in binary logistic regression. We thus have the following algorithm:

Multiclass Logistic Backfitting

1. Initialize {fb` = 0}, and set Z(X) = K.

2. Iterate until convergence:

For each ` = 0, 1, . . . , K − 1
A. Initialize f` = fb`
B. Iterate until convergence:

19
For each j = 1, 2, . . . , p

Sj (xj )T (w` (X) R`j (X))

f`j (xj ) ← where
Sj (xj )T (w` (X))
X Y` − p` (X)
R`j (X) = fb` (X) − f`k (Xk ) +
k6=j
p` (X)(1 − p` (X))
w` (X) = p` (X)(1 − p` (X)).

C. Update Z(X) ← Z(X) − ef` (X) + ef` (X) .

D. Set fb` ← f` .

Incrementally updating the normalizing constants (step C) is important so that the proba-
bilties p` (X) = ef` (X) /Z(X) can be efficiently computed, and we avoid an O(K 2 ) algorithm.
b

This can be extended to include a sparsity constraint, as in the binary case.

The Four Flavors of INTJ In-Depth: Analytic Intuiting With Analytic Thinking
No ratings yet
The Four Flavors of INTJ In-Depth: Analytic Intuiting With Analytic Thinking
4 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
NeuralHack Stage 2 Python
100% (1)
NeuralHack Stage 2 Python
2 pages
6 437-Pset1
No ratings yet
6 437-Pset1
8 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
IIT Kanpur Machine Learning End Sem Paper
No ratings yet
IIT Kanpur Machine Learning End Sem Paper
10 pages
Chapter 2 (Part 2) : Bayesian Decision Theory (Sections 2.3-2.5)
No ratings yet
Chapter 2 (Part 2) : Bayesian Decision Theory (Sections 2.3-2.5)
39 pages
SIL Selection SIL Verification With ExSIlentia Syllabus
0% (1)
SIL Selection SIL Verification With ExSIlentia Syllabus
3 pages
Rectus Tema
No ratings yet
Rectus Tema
486 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Bayesian Decision Theory: Prof. Richard Zanibbi
No ratings yet
Bayesian Decision Theory: Prof. Richard Zanibbi
47 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
Lec 1
No ratings yet
Lec 1
42 pages
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
No ratings yet
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
16 pages
A2mot En5
100% (1)
A2mot En5
5 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
BUF16821 DC-DC Ic
100% (1)
BUF16821 DC-DC Ic
31 pages
Hw2 - Raymond Von Mizener - Chirag Mahapatra
No ratings yet
Hw2 - Raymond Von Mizener - Chirag Mahapatra
13 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
Man Cruise
No ratings yet
Man Cruise
73 pages
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
No ratings yet
CS771: Machine Learning: Tools, Techniques and Applications Mid-Semester Exam
7 pages
Csci567 Hw1 Spring 2016
No ratings yet
Csci567 Hw1 Spring 2016
9 pages
Explain The Physical Layer of The I2C Protocol
100% (1)
Explain The Physical Layer of The I2C Protocol
7 pages
Ps 3
No ratings yet
Ps 3
6 pages
Solutions To Selected Problems-Duda, Hart
67% (3)
Solutions To Selected Problems-Duda, Hart
12 pages
Pat 02 Sol
100% (1)
Pat 02 Sol
5 pages
04 Bayes Classification Rule
No ratings yet
04 Bayes Classification Rule
46 pages
Tutorial Problems Day 1
No ratings yet
Tutorial Problems Day 1
3 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Problem Sheet 1 Answers
No ratings yet
Problem Sheet 1 Answers
4 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
AX Series Hanyoung Brochure
No ratings yet
AX Series Hanyoung Brochure
6 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
Katalog Atk&toner
No ratings yet
Katalog Atk&toner
21 pages
Causal Inference: 1.1 Two Types of Causal Questions
No ratings yet
Causal Inference: 1.1 Two Types of Causal Questions
19 pages
Sparse Additive Models: University of California, Berkeley, USA
No ratings yet
Sparse Additive Models: University of California, Berkeley, USA
22 pages
Manifold Estimation, Hidden Structure and Dimension Reduction
No ratings yet
Manifold Estimation, Hidden Structure and Dimension Reduction
39 pages
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
No ratings yet
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
35 pages
Lecture 9: Predictive Inference
No ratings yet
Lecture 9: Predictive Inference
10 pages
Plucker and Callahan 2014
No ratings yet
Plucker and Callahan 2014
17 pages
Causal Inference: 1.1 Two Types of Causal Questions
No ratings yet
Causal Inference: 1.1 Two Types of Causal Questions
8 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Nonparametric Regression
No ratings yet
Nonparametric Regression
24 pages
Data Analysis Exam 1 36-401, Section B
No ratings yet
Data Analysis Exam 1 36-401, Section B
3 pages
Support Vector Machines
No ratings yet
Support Vector Machines
5 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Validation
No ratings yet
Validation
11 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
No ratings yet
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
3 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Lecture 8: Inference 36-401, Fall 2015, Section B
No ratings yet
Lecture 8: Inference 36-401, Fall 2015, Section B
16 pages
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
No ratings yet
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
15 pages
Features Features Features Features
No ratings yet
Features Features Features Features
8 pages
0625 - w15 - QP - 63with Ms PDF
No ratings yet
0625 - w15 - QP - 63with Ms PDF
9 pages
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
No ratings yet
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
12 pages
10/36-702 Statistical Machine Learning Homework #2 Solutions
No ratings yet
10/36-702 Statistical Machine Learning Homework #2 Solutions
11 pages
HSDL 3005 028
No ratings yet
HSDL 3005 028
28 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Handout 3 Skills - Unit 2 - 4 Medio
No ratings yet
Handout 3 Skills - Unit 2 - 4 Medio
3 pages
1 Review
No ratings yet
1 Review
7 pages
Differential Privacy: 1 N I 1 N N
No ratings yet
Differential Privacy: 1 N I 1 N N
7 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
General Knowledge For IAS in English
No ratings yet
General Knowledge For IAS in English
4 pages
Filtration PDF
No ratings yet
Filtration PDF
13 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
Work Measurement Techniques Methods Types
No ratings yet
Work Measurement Techniques Methods Types
5 pages
Homework 4 Due Friday April 19 3:00 PM Submit A PDF File On Canvas
No ratings yet
Homework 4 Due Friday April 19 3:00 PM Submit A PDF File On Canvas
2 pages
Shenzhen Denver 3000T User Manual
No ratings yet
Shenzhen Denver 3000T User Manual
358 pages
HW7
No ratings yet
HW7
1 page
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Filtermedia HSL HSL-C Uk
No ratings yet
Filtermedia HSL HSL-C Uk
2 pages
4 Political Frame Worksheet
No ratings yet
4 Political Frame Worksheet
3 pages
Kel 13 Jurnal Ips
No ratings yet
Kel 13 Jurnal Ips
10 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Densityestimation
No ratings yet
Densityestimation
28 pages
Current Affairs - Compendium - DMS - IIT - Delhi
No ratings yet
Current Affairs - Compendium - DMS - IIT - Delhi
28 pages
Lecture - 11 SD Final
100% (1)
Lecture - 11 SD Final
26 pages
Endsem ML Regular AK
No ratings yet
Endsem ML Regular AK
7 pages
Statistical Methods For ML
No ratings yet
Statistical Methods For ML
24 pages
ML 20230316 1
No ratings yet
ML 20230316 1
9 pages
CSE 569 Homework #1: Notes
No ratings yet
CSE 569 Homework #1: Notes
3 pages
1 s2.0 016786559400074D Main33
No ratings yet
1 s2.0 016786559400074D Main33
7 pages
Tesla Gateway
No ratings yet
Tesla Gateway
1 page
ML Cheatsheet 2024-2025
No ratings yet
ML Cheatsheet 2024-2025
2 pages
Robotic Gripper Using Four Bar Mechanism
No ratings yet
Robotic Gripper Using Four Bar Mechanism
54 pages
Multivariate Classification
No ratings yet
Multivariate Classification
7 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
No ratings yet
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
4 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
Deep Learning
No ratings yet
Deep Learning
3 pages
Quiz2 Mock Solutions
No ratings yet
Quiz2 Mock Solutions
19 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
HW 3
No ratings yet
HW 3
13 pages
AI-Powered Exam Assessment System For Handwritten Answer Sheets
No ratings yet
AI-Powered Exam Assessment System For Handwritten Answer Sheets
4 pages
Stability Analysis and Modelling of Unde
No ratings yet
Stability Analysis and Modelling of Unde
309 pages
Unit 5
No ratings yet
Unit 5
21 pages
ML FinalUpdated 1
No ratings yet
ML FinalUpdated 1
45 pages
Ece 34 - Microprocessor System Project
No ratings yet
Ece 34 - Microprocessor System Project
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Hyperbolic Functions (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet

Nonparametric Classification 10/36-702: 1 1 N N N I I

Uploaded by

Nonparametric Classification 10/36-702: 1 1 N N N I I

Uploaded by

Nonparametric Classification

Let h : X → {0, 1} to denote a classifier where X is the domain of X. In parametric classi-

R(h) = P(Y 6= h(X)) (1)

In the multiclass case, Y ∈ {1, . . . , k}, the Bayes’ rule is

h∗ (x) = argmax1≤j≤k πj pj (x) = argmax1≤j≤k mj (x)

One approach to nonparametric classification is to estimate the unknown quantities in the

We have the following theorem.

When Y ∈ {1, . . . , k} the plugin rule has the form

3 Classifiers Based on Density Estimation

The k-nearest neighbor classifier is

where wi (x) = 1 if Xi is one of the k nearest neighbors of x, wi (x) = 0, otherwise. “Nearest”

m(Xi )(1−m(x))+(1−m(Xi ))m(x) ≈ m(x)(1−m(x))+(1−m(x))m(x) = 2m(x)(1−m(x)).

lim E(Ln ) = E(2m(X)(1 − m(X))) ≡ R(1) . (8)

R∗ ≤ R(1) = E(A(1 − A)) ≤ 2E(A)E(1 − A) = 2R∗ (1 − R∗ ) ≤ 2R∗ .

More generally, for any odd k,

Theorem 2 (Devroye et al 1996) For all odd k,

Proof. We can rewrite R(k) as R(k) = E(a(m(X))) where

If the distribution of X has a density function then we have the following.

Theorem 4 (Chaudhuri and Dasgupta 2014) Suppose that

P ({x : |m(x) − (1/2)| ≤ t}) ≤ Ctβ

where B = {u : ||x−u|| ≤ r}, B 0 = {u : ||x−u|| < r} and m(B) = (P (B))−1

4.1 Partitions and Trees

As with nonparametric regression, simple and interpretable classifiers can be derived by

Recall from the results on regression that if

and the binwidth b satsifies b  n−1/(d+2) then

< 100 ≥ 100

Figure 1: A simple classification tree.

Figure 2: Partition representation of classification tree.

The minimax classification risk over a set of joint distributions P is

Table 1: Minimax Rates of Convergence for Classification.

6 Support Vector Machines

When we discussed linear classification, we defined SVM classifier b

2. Set wi = 1/n for i = 1, . . . , n.

4. The final classifier is

F (x) ← F (x) + cf (x)

as in AdaBoost. When we update F this way, we change the weights to

8 Choosing Tuning Parameters

Construct classifiers H = {h1 , . . . , hN } from D corresponding to different values of the tuning

Theorem 5 Let h∗ ∈ H minimize R(h) = P(Y 6= h(X)). Then

A popular modification of data-splitting is K-fold cross-validation. The data are divided

10 Sparse Nonparametric Logistic Regression

where Y ∈ {0, 1}, and the population log-likelihood is

−4 −2 0 2 −4 −2 0 2 −6 −2 0 2 4 6 −1.0 0.0 1.0 2.0 −2 −1 0 1 2

Figure 4: Estimated functions for additive model.

Figure 5: Test error versus k for nearest neighbor estimator.

xtrain.V9 < −0.189962

xtrain.V1 < 1.21831 xtrain.V1 < −0.536394

xtrain.V5 < 1.41292 xtrain.V4 < 1.95513

xtrain.V1 < −0.787108

Figure 6: Full tree.

xtrain.V1 < 1.21831 xtrain.V1 < !0.536394

xtrain.V10 < !0.369401

and the stationary condition for component

11 Bagging and Random Forests

5.0 0.1725 51 { 8, 9, 27, 53, 54, 57}

3.5 0.1117 0 all

2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

Biau, G. (2012). Analysis of a Random Forests Model. arXiv:1005.0208.

Appendix: Multiclass Sparse Logistic Regression

H(fb) = −diag(b p(X)T .

Z = J −1/2 (Y − pb) + J 1/2 fb (33)

We approximate the conditional expectations by smoothing, as usual:

Sj (xj )T (w` (X) R`j (X))

Multiclass Logistic Backfitting

1. Initialize {fb` = 0}, and set Z(X) = K.

2. Iterate until convergence:

Sj (xj )T (w` (X) R`j (X))

C. Update Z(X) ← Z(X) − ef` (X) + ef` (X) .

This can be extended to include a sparsity constraint, as in the binary case.

You might also like

and the binwidth b satsifies b n−1/(d+2) then