0% found this document useful (0 votes)
52 views8 pages

09 Decision Trees Nearest Neighbor

This document summarizes decision trees and nearest neighbor methods for machine learning. It discusses: 1) Decision trees, which partition the input space into regions and make predictions based on the region an example falls into. Regression trees assign a constant prediction to each region while classification trees assign a class. 2) Learning decision trees involves growing a tree using splits that minimize training error, then pruning it to avoid overfitting. Splits are chosen greedily based on error reduction. 3) Nearest neighbor methods store all training examples and make predictions based on the labels of the closest examples in the training set. They are simple, intuitive and enjoy good consistency properties.

Uploaded by

Yiwei Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views8 pages

09 Decision Trees Nearest Neighbor

This document summarizes decision trees and nearest neighbor methods for machine learning. It discusses: 1) Decision trees, which partition the input space into regions and make predictions based on the region an example falls into. Regression trees assign a constant prediction to each region while classification trees assign a class. 2) Learning decision trees involves growing a tree using splits that minimize training error, then pruning it to avoid overfitting. Splits are chosen greedily based on error reduction. 3) Nearest neighbor methods store all training examples and make predictions based on the labels of the closest examples in the training set. They are simple, intuitive and enjoy good consistency properties.

Uploaded by

Yiwei Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CIS 520: Machine Learning Spring 2021: Lecture 9

Decision Trees and Nearest Neighbor Methods

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Introduction

• Decision trees

• Nearest neighbor methods

1 Introduction

We have previously seen a number of algorithms for learning parametric classification and regression models
from data, where the form of model and corresponding number of parameters to be estimated from data
is fixed. In this lecture, we will see two classes of non-parametric methods: decision trees and nearest
neighbor methods.1 Decision trees enjoy the benefit of interpretability: the learned models are easy
for humans to interpret. Nearest neighbor methods are local, memory-based methods, which store all the
training examples in memory and make predictions on new test points based on a few ‘nearby’ points in the
training sample; they are both simple and intuitive, and enjoy good consistency properties.

2 Decision Trees

Decision tree models are used for both classification and regression problems. We will describe the models
mostly for settings where instances contain numerical features, but they can also be used in settings with
categorical features.
To illustrate the basic form of a decision tree model, consider a binary classification problem on an instance
space with 2 features, X = R2 . Then Figure 1(a) shows an example of a decision tree classification model.
Specifically, given a test instance x ∈ R2 , this model first tests whether x1 > 5. If so, it proceeds to test
whether x2 > 6; if this is true, then it classifies the instance as +1, else it conducts a further test on x1 in
order to make a classification. On the other hand, if x1 ≤ 5, then the model next tests whether x2 > 2;
if so, it classifies the instance as +1, else as −1. Figure 1(b) shows the decision boundary or partition of
the instance space X corresponding to this model. One can use similar models for multiclass classification;
in this case, each leaf node will be labeled with one of K classes, corresponding to the predicted class for
instances that belong to that leaf node.
1 Note that SVMs/logistic regression/least squares regression with RBF kernels, and neural networks wherein the number of

hidden units is allowed to grow with the number of training examples, also effectively yield non-parametric models.

1
2 Decision Trees and Nearest Neighbor Methods

Figure 1: (a) A decision tree for binary classification over a 2-dimensional instance space. (b) The partition
of the instance space induced by the decision tree in (a).

Similarly, for regression, each leaf node in the tree will be labeled with a real-valued number, corresponding
to the predicted value for instances belonging to that leaf node; in this case, the resulting regression function
is a piece-wise constant function (taking a constant value over the region of the instance space corresponding
to each leaf node in the tree model).
Decision trees are easy for humans to interpret, and have therefore been widely used in medical and other
domains where it is desirable not only to make good predictions, but also to understand how a model reaches
its predictions. Two natural questions that come up are the following:
1. How do we learn a good decision tree model from data?
2. How do we evaluate a learned decision tree model?
The second question is easy to answer. A decision tree model is just a specific type of classification or
regression model, and is evaluated similarly to other models: the ideal performance measure is the expected
loss (e.g. 0-1 or squared loss) on new examples from the underlying distribution generating the data; in
practice, one measures the average loss on new examples in a test set.
The first question, of learning a decision tree model from data, is what we will focus on below. We will start
with regression trees, and then discuss classification trees.

2.1 Regression Trees

Consider first a regression problem, with instance space X = Rd and label and prediction spaces Y = Yb = R.2
Given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), our goal is to learn a good regression tree model.
Let us introduce some notation. For any regression tree T , denote by L(T ) the set of leaf nodes of T . Each
leaf node l corresponds to a region in the instance space X , such that predictions for instances in that region
are made according to that leaf node; denote by Xl ⊆ X the region corresponding to l, and by cl ∈ R the
constant used to predict labels of instances in Xl .
We will abuse notation somewhat, and for any instance x, will denote by l(x) the leaf node whose region
contains x. Then the regression model defined by T is given by
fT (x) = cl(x) .
2 As noted above, decision trees can be used with categorical features too; we describe them for the case of numerical features

for simplicity, but it should be easy to see how they can be extended to the case of categorical features.
Decision Trees and Nearest Neighbor Methods 3

We would ideally like to find a regression tree with small training error, e.g. as measured by squared loss:
m
1 X 2
b sq
er S [fT ] = fT (xi ) − yi
m i=1
1 X X 2
= cl − yi .
m
l∈L(T ) i:xi ∈Xl

Finding cl given a fixed tree structure. If the structure of the regression tree T (i.e. the splits defining
the nodes and therefore the partition of X induced by the resulting leaf nodes) is fixed, then the above
objective is minimized by choosing cl for each leaf node l as
X
cl ∈ arg min (c − yi )2 .
c∈R
i:xi ∈Xl

Clearly, this minimum is achieved by choosing cl to be the average value of the labels of training instances
that fall in the region Xl corresponding to leaf node l:
1 X
cl = yi ,
ml
i:xi ∈Xl

where ml is the number of training instances in Xl :



ml = {i ∈ [m] : xi ∈ Xl } .

Finding a good tree structure. The main question, then, is how to choose a good tree structure.
Finding an exact optimal tree structure would entail a combinatorial search. Instead, one usually uses a
greedy algorithm to learn a good tree structure. Several variants of tree learning algorithms are used; most
follow roughly the following approach to grow a tree:

• Start with a single leaf node containing all instances


• Repeat until a suitable stopping criterion is reached:

– For each ‘eligible’ leaf node in the current tree:


∗ For each candidate split of the leaf node (each variable and threshold combination):
· Compute the training error of the tree obtained with this split
– Choose the split that gives lowest training error

A few comments are in order. First, for the stopping criterion, one could possibly stop when the drop in
training error falls below some pre-specified threshold, but sometimes, continuing after a poor split that
doesn’t decrease the training error much, further splits can decrease training error significantly. Therefore,
stopping criteria are often based on the number of training examples contained in each leaf node instead (e.g.
stop when the regions Xl corresponding to the leaf nodes l all contain at most some pre-specified number
of training examples). Second, what constitutes an ‘eligible’ leaf node for splitting varies according to the
specific algorithm, but some common criteria include number of training examples (e.g. a leaf node may be
eligible to split as long as the number of training examples it contains exceeds some pre-specified number),
and ‘uniformity’ of the labels of training examples in the leaf node (e.g. a leaf node may be eligible to split as
long as the variance among training labels in the leaf node exceeds some pre-specified value). Finally, when
considering candidate splits of a leaf node, for any given variable, it is sufficient to consider only a finite
number of thresholds that correspond to transition points between training examples in that leaf node.
4 Decision Trees and Nearest Neighbor Methods

Pruning to avoid overfitting. The greedy approach often leads to long trees with many nodes. Having
greedily built a regression tree T , it is common to then ‘prune’ back in order to avoid overfitting. The pruning
stage typically uses a ‘regularized’ training error to compare various candidate pruned versions, generally
regularized by the number of leaf nodes (here λ > 0 is a regularization parameter):
sq,(λ)
b sq [fT ] + λ L(T )

er
Sb [fT ] = er
S

Again, many variants of pruning are used; most follow roughly the following approach:

• Start with the regression tree T learned by the greedy algorithm


• Repeat until a suitable stopping criterion is reached:
– For each internal node in the current tree:
∗ Consider collapsing the subtree rooted at this node into a single leaf node; compute the
regularized training error of the tree obtained with this pruning
– Choose the pruned tree that gives lowest regularized training error

Again, a few comments are in order. First, the stopping criterion is generally based either on the improvement
in regularized training error (e.g. stop when pruning no longer reduces the regularized training error by a
sufficiently large amount), or on the number of nodes (e.g. stop when the number of leaf nodes become
smaller than some pre-specified value). Second, for the choice of the regularization parameter λ, one often
uses a (cross-)validation approach: i.e. hold out a validation set (or repeat on multiple cross-validation folds),
consider several values of λ, learn a regression tree (by first greedily building and then pruning) on just the
training portion for each value of λ, test the performance of the pruned tree for each λ on the held-out
portion, and keep the tree with best validation error.
Puting everything together. If pruning is used, then the greedy tree growing phase followed by the
pruning phase jointly constitute the learning algorithm, which at the end of the two phases produces a
regression tree from the given training data.

2.2 Classification Trees

Consider now a binary classification problem, with instance space X = Rd and label and prediction spaces
Y = Yb = {±1} (the ideas are easily extended to multiclass classification; we focus on the binary case for
simplicity). Given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), our goal is to learn a good classification
tree model.
The basic approach is similar to the regression case. In this case, each leaf node l in a classification tree T is
associated with a number ηbl ∈ R, which is used to estimate the probability of a positive label for instances x
falling in the corresponding region Xl ; if ηbl > 21 , then the predicted class for an instance x ∈ Xl is +1, else it
is −1. Thus the class probability estimation (CPE) model associated with a classification tree T is given by

ηbT (x) = ηbl(x) ;

the corresponding classifier is given by


1

hT (x) = sign ηbl(x) − 2 .

Given a fixed tree structure, it can be verified that the log loss (cross-entropy loss) of the associated CPE
model on the training sample is minimized by choosing ηbl for each leaf node l as the fraction of training
Decision Trees and Nearest Neighbor Methods 5

instances in Xl that have label +1:


1 X
ηbl = pl+1 := 1(yi = +1) .
ml
i:x∈Xl

Now, in order to find a good tree structure, one could in principle look for a tree that minimizes the 0-1
classification error on the training sample. However, the 0-1 error is not very sensitive to the ‘purity’ of leaf
nodes. For example, consider the following two splits of a leaf node containing 6 positive and 2 negative
training examples:

Both splits produce leaf nodes of the same sizes (containing the same numbers of examples) and with the
same overall number of classification errors, even though the first split includes a ‘pure’ leaf node in which
all examples have the same class label. The 0-1 error would not be able to distinguish between the two
splits, even though intuitively we may want to prefer the split with the pure node. To encourage selection
of splits producing more ‘pure’ leaf nodes, various measures of ‘impurity’ are often used instead of 0-1 error
when growing a classification tree. Two of the most widely used impurity measures are the following:

• Entropy: The entropy of (the set of examples contained in) a leaf node l is defined as

Hl = −pl+1 log2 pl+1 − (1 − pl+1 ) log2 (1 − pl+1 ) .

The entropy takes its smallest value (0) when pl+1 = 0 or pl+1 = 1 (pure node); it takes its largest value
(1) when pl+1 = 12 (maximally impure node). The quality of a full tree T (or rather, of the partition of
X induced by T ), measured in terms of entropy, is then defined as
X ml
HT = Hl .
m
l∈L(T )

Smaller values of entropy are preferred.


• Gini index: The Gini index of (the set of examples contained in) a leaf node l is defined as

Gl = pl+1 (1 − pl+1 ) .

Again, the Gini index takes its smallest value (0) when pl+1 = 0 or pl+1 = 1 (pure node); it takes its
largest value ( 41 ) when pl+1 = 21 (maximally impure node). The quality of a full tree T (or rather, of
the partition of X induced by T ), measured in terms of Gini index, is then defined as
X ml
GT = Gl .
m
l∈L(T )

Smaller values of Gini index are preferred.

When considering a split of a particular leaf node l into two leaf nodes l1 and l2 , to measure the reduction
in entropy induced by the resulting split, one can simply evaluate what is termed the information gain
(equivalent to entropy reduction), defined as
m ml 
l1
IG(l, l1 , l2 ) = Hl − Hl1 + 2 Hl2 .
ml ml
6 Decision Trees and Nearest Neighbor Methods

Thus, in the greedy tree growing phase, given a current tree, one evaluates the information gain associated
with each candidate split under consideration, and chooses the split yielding the largest information gain
(largest reduction in entropy).
One can similarly define the Gini reduction associated with a split; if using the Gini index criterion, one
then chooses a split with the largest Gini reduction.
When pruning, one usually uses simply the regularized 0-1 error.
Exercise. Calculate the information gain associated with each of the two splits in the example above
(where a leaf node containing 6 positive and 2 negative examples is being considered for splitting into two
leaf nodes). Which split would be preferred based on the entropy criterion? Repeat the same using the Gini
index criterion.
Exercise. Show that choosing a split with maximal information gain is equivalent to choosing a split that
yields minimal cross-entropy loss (on the training sample S).

3 Nearest Neighbor Methods

The idea behind nearest neighbor methods is conceptually very simple. Basically, one simply stores all the
given training examples in memory, and when asked to make a prediction on a new test point, searches the
training examples to find the ‘nearest’ training point and returns its label (or finds a few nearest training
points and averages their labels in some way). The notion of ‘nearest’ needs a distance measure; in Euclidean
space, this is most commonly taken to be the Euclidean distance.
More formally, suppose instances are feature vectors in X = Rd , and say we are given a training sample
S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m , where Y could be {±1} in the case of binary classification,
{1, . . . , K} in the case of multiclass classification, or R in the case of regression. We start by discussing the
case of using a single nearest neighbor for prediction, and then discuss the extension to using more neighbors.

3.1 1-Nearest neighbor (1-NN)

Given a new test point x, the 1-NN algorithm simply finds the nearest point i∗ (x) in the training sample,
and predicts using its label. Specifically, for classification (both binary and multiclass), the 1-NN classifier
is given by
hS (x) = yi∗ (x) ,

where i∗ (x) is the index of the nearest neighbor of x in S (breaking ties arbitrarily):

i∗ (x) ∈ arg min kxi − xk2 .


i∈[m]

For regression, the 1-NN regression model is given by

fS (x) = yi∗ (x) .

The 1-NN method leads to what is known as a Voronoi diagram, or a Voronoi tessellation of the instance
space X = Rd , where the space is divided up into polyhedral regions or ‘Voronoi cells’; each training point
xi is associated with one such Voronoi cell, such that for all points x in that cell, xi is the closest training
point in S and the predicted label is yi . Such Voronoi tessellations can be quite complex, and they become
increasingly complex as the number of training points m increases.
Decision Trees and Nearest Neighbor Methods 7

3.2 k-Nearest neighbor (k-NN)

In this case, given a new test point x, the k-NN algorithm finds the k nearest points in the training sample,
and predicts by averaging their labels. Specifically, for multiclass classification, the k-NN classifier estimates
the probability of each class label y as
1 X
ηby (x) = 1(yi = y) ,
k
i∈Nk (x)

where Nk (x) denotes the set of k nearest neighbors of x in S (breaking ties arbitrarily). Classifications are
then based on estimated class probabilities and the target loss function; for example, under 0-1 loss, one
simply predicts the class with highest estimated probability, which amounts to taking a majority vote of the
class labels of the k nearest neighbors:

hS (x) ∈ arg max ηby (x) .


y∈[K]

Specializing the above to binary classification is straightforward.


For regression under squared loss, one simply averages the labels of the k nearest neighbors in S:
1 X
fS (x) = yi .
k
i∈Nk (x)

This acts as an estimate of the conditional expectation E[Y |X = x].


Larger values of k lead to smoother (simpler) models/decision boundaries. Indeed, note that a 1-NN model
has zero training error (unless the same instance appears with two different labels in the training sample),
suggesting it is a complex model that can overfit; on the other hand, averaging over a larger number of
training examples yields a smoother model that may have non-zero training error. The number k of nearest
neighbors used therefore controls the model complexity; more specifically, the ratio m k acts as a model
complexity parameter (the higher the parameter value, the more complex the model).

3.3 Consistency Results for Nearest Neighbor Classification

For binary classification, there are two sets of classical results regarding the statistical convergence properties
of nearest neighbor classifiers. The first result (presented here in an abbreviated form) says that, for any
fixed k, for large enough sample size m, the 0-1 generalization error of the k-NN classifier, er0-1 D [hS ], when
averaged over all training samples of size m (drawn from Dm ), is at most twice the Bayes error:

Theorem 1 (Cover and Hart, 1967). Let X = Rd . Let D be any probability distribution on X × {±1}.
Let k be any fixed positive integer, and let hS denote the k-NN classifier resulting from a training sample
S. Then

lim ES∼Dm er0-1 ≤ 2 er0-1,∗ 1 − er0-1,∗


  
D [hS ] D D
m→∞

≤ 2 er0-1,∗
D .

 
The full result shows something stronger: it gives the precise limit limm→∞ ES∼Dm er0-1 D [hS ] for each fixed
k, showing that as k increases, the limit of the error becomes smaller. However, as long as one uses any
fixed value of k, this limit is never equal to the Bayes error er0-1,∗
D , and therefore k-NN for any fixed k is not
(universally) consistent.
8 Decision Trees and Nearest Neighbor Methods

On the other hand, the following result shows that, if one allows k to depend on the number of training
examples m, then by choosing k to be a slowly growing function of m, one can achieve (universal) consistency:
the generalization error of the resulting algorithm actually converges to the Bayes error (for all D):
Theorem 2 (Stone, 1977). Let X = Rd . Let D be any probability distribution on X × {±1}. Let km be
such that km →∞ and kmm →0 as m→∞. Let hS denote the km -NN classifier resulting from a training sample
S of size m. Then

lim ES∼Dm er0-1 = er0-1,∗


 
D [hS ] D .
m→∞

3.4 Practical Issues

Nearest neighbor methods typically don’t require much computation in the training phase, but they require
storing all the training examples in memory. For this reason, they are often referred to as memory-based
or instance-based methods.
The testing phase, however, is computationally expensive, since given a new test point, one needs to search
the training sample to find the k nearest neighbors. There has been much work on developing approximation
algorithms for nearest neighbor search, which may not return the exact nearest neighbors but return a set
of ‘approximate’ neighbors.
In general, nearest neighbor methods tend to suffer from the curse of dimensionality: as the dimensionality d
of the instance space increases, the number of training examples needed to construct reliable estimates of the
class probability function or the conditional expectation function via nearest neighbor averaging increases
exponentially with d.

You might also like