09 Decision Trees Nearest Neighbor
09 Decision Trees Nearest Neighbor
Outline
• Introduction
• Decision trees
1 Introduction
We have previously seen a number of algorithms for learning parametric classification and regression models
from data, where the form of model and corresponding number of parameters to be estimated from data
is fixed. In this lecture, we will see two classes of non-parametric methods: decision trees and nearest
neighbor methods.1 Decision trees enjoy the benefit of interpretability: the learned models are easy
for humans to interpret. Nearest neighbor methods are local, memory-based methods, which store all the
training examples in memory and make predictions on new test points based on a few ‘nearby’ points in the
training sample; they are both simple and intuitive, and enjoy good consistency properties.
2 Decision Trees
Decision tree models are used for both classification and regression problems. We will describe the models
mostly for settings where instances contain numerical features, but they can also be used in settings with
categorical features.
To illustrate the basic form of a decision tree model, consider a binary classification problem on an instance
space with 2 features, X = R2 . Then Figure 1(a) shows an example of a decision tree classification model.
Specifically, given a test instance x ∈ R2 , this model first tests whether x1 > 5. If so, it proceeds to test
whether x2 > 6; if this is true, then it classifies the instance as +1, else it conducts a further test on x1 in
order to make a classification. On the other hand, if x1 ≤ 5, then the model next tests whether x2 > 2;
if so, it classifies the instance as +1, else as −1. Figure 1(b) shows the decision boundary or partition of
the instance space X corresponding to this model. One can use similar models for multiclass classification;
in this case, each leaf node will be labeled with one of K classes, corresponding to the predicted class for
instances that belong to that leaf node.
1 Note that SVMs/logistic regression/least squares regression with RBF kernels, and neural networks wherein the number of
hidden units is allowed to grow with the number of training examples, also effectively yield non-parametric models.
1
2 Decision Trees and Nearest Neighbor Methods
Figure 1: (a) A decision tree for binary classification over a 2-dimensional instance space. (b) The partition
of the instance space induced by the decision tree in (a).
Similarly, for regression, each leaf node in the tree will be labeled with a real-valued number, corresponding
to the predicted value for instances belonging to that leaf node; in this case, the resulting regression function
is a piece-wise constant function (taking a constant value over the region of the instance space corresponding
to each leaf node in the tree model).
Decision trees are easy for humans to interpret, and have therefore been widely used in medical and other
domains where it is desirable not only to make good predictions, but also to understand how a model reaches
its predictions. Two natural questions that come up are the following:
1. How do we learn a good decision tree model from data?
2. How do we evaluate a learned decision tree model?
The second question is easy to answer. A decision tree model is just a specific type of classification or
regression model, and is evaluated similarly to other models: the ideal performance measure is the expected
loss (e.g. 0-1 or squared loss) on new examples from the underlying distribution generating the data; in
practice, one measures the average loss on new examples in a test set.
The first question, of learning a decision tree model from data, is what we will focus on below. We will start
with regression trees, and then discuss classification trees.
Consider first a regression problem, with instance space X = Rd and label and prediction spaces Y = Yb = R.2
Given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), our goal is to learn a good regression tree model.
Let us introduce some notation. For any regression tree T , denote by L(T ) the set of leaf nodes of T . Each
leaf node l corresponds to a region in the instance space X , such that predictions for instances in that region
are made according to that leaf node; denote by Xl ⊆ X the region corresponding to l, and by cl ∈ R the
constant used to predict labels of instances in Xl .
We will abuse notation somewhat, and for any instance x, will denote by l(x) the leaf node whose region
contains x. Then the regression model defined by T is given by
fT (x) = cl(x) .
2 As noted above, decision trees can be used with categorical features too; we describe them for the case of numerical features
for simplicity, but it should be easy to see how they can be extended to the case of categorical features.
Decision Trees and Nearest Neighbor Methods 3
We would ideally like to find a regression tree with small training error, e.g. as measured by squared loss:
m
1 X 2
b sq
er S [fT ] = fT (xi ) − yi
m i=1
1 X X 2
= cl − yi .
m
l∈L(T ) i:xi ∈Xl
Finding cl given a fixed tree structure. If the structure of the regression tree T (i.e. the splits defining
the nodes and therefore the partition of X induced by the resulting leaf nodes) is fixed, then the above
objective is minimized by choosing cl for each leaf node l as
X
cl ∈ arg min (c − yi )2 .
c∈R
i:xi ∈Xl
Clearly, this minimum is achieved by choosing cl to be the average value of the labels of training instances
that fall in the region Xl corresponding to leaf node l:
1 X
cl = yi ,
ml
i:xi ∈Xl
Finding a good tree structure. The main question, then, is how to choose a good tree structure.
Finding an exact optimal tree structure would entail a combinatorial search. Instead, one usually uses a
greedy algorithm to learn a good tree structure. Several variants of tree learning algorithms are used; most
follow roughly the following approach to grow a tree:
A few comments are in order. First, for the stopping criterion, one could possibly stop when the drop in
training error falls below some pre-specified threshold, but sometimes, continuing after a poor split that
doesn’t decrease the training error much, further splits can decrease training error significantly. Therefore,
stopping criteria are often based on the number of training examples contained in each leaf node instead (e.g.
stop when the regions Xl corresponding to the leaf nodes l all contain at most some pre-specified number
of training examples). Second, what constitutes an ‘eligible’ leaf node for splitting varies according to the
specific algorithm, but some common criteria include number of training examples (e.g. a leaf node may be
eligible to split as long as the number of training examples it contains exceeds some pre-specified number),
and ‘uniformity’ of the labels of training examples in the leaf node (e.g. a leaf node may be eligible to split as
long as the variance among training labels in the leaf node exceeds some pre-specified value). Finally, when
considering candidate splits of a leaf node, for any given variable, it is sufficient to consider only a finite
number of thresholds that correspond to transition points between training examples in that leaf node.
4 Decision Trees and Nearest Neighbor Methods
Pruning to avoid overfitting. The greedy approach often leads to long trees with many nodes. Having
greedily built a regression tree T , it is common to then ‘prune’ back in order to avoid overfitting. The pruning
stage typically uses a ‘regularized’ training error to compare various candidate pruned versions, generally
regularized by the number of leaf nodes (here λ > 0 is a regularization parameter):
sq,(λ)
b sq [fT ] + λL(T )
er
Sb [fT ] = er
S
Again, many variants of pruning are used; most follow roughly the following approach:
Again, a few comments are in order. First, the stopping criterion is generally based either on the improvement
in regularized training error (e.g. stop when pruning no longer reduces the regularized training error by a
sufficiently large amount), or on the number of nodes (e.g. stop when the number of leaf nodes become
smaller than some pre-specified value). Second, for the choice of the regularization parameter λ, one often
uses a (cross-)validation approach: i.e. hold out a validation set (or repeat on multiple cross-validation folds),
consider several values of λ, learn a regression tree (by first greedily building and then pruning) on just the
training portion for each value of λ, test the performance of the pruned tree for each λ on the held-out
portion, and keep the tree with best validation error.
Puting everything together. If pruning is used, then the greedy tree growing phase followed by the
pruning phase jointly constitute the learning algorithm, which at the end of the two phases produces a
regression tree from the given training data.
Consider now a binary classification problem, with instance space X = Rd and label and prediction spaces
Y = Yb = {±1} (the ideas are easily extended to multiclass classification; we focus on the binary case for
simplicity). Given a training sample S = ((x1 , y1 ), . . . , (xm , ym )), our goal is to learn a good classification
tree model.
The basic approach is similar to the regression case. In this case, each leaf node l in a classification tree T is
associated with a number ηbl ∈ R, which is used to estimate the probability of a positive label for instances x
falling in the corresponding region Xl ; if ηbl > 21 , then the predicted class for an instance x ∈ Xl is +1, else it
is −1. Thus the class probability estimation (CPE) model associated with a classification tree T is given by
Given a fixed tree structure, it can be verified that the log loss (cross-entropy loss) of the associated CPE
model on the training sample is minimized by choosing ηbl for each leaf node l as the fraction of training
Decision Trees and Nearest Neighbor Methods 5
Now, in order to find a good tree structure, one could in principle look for a tree that minimizes the 0-1
classification error on the training sample. However, the 0-1 error is not very sensitive to the ‘purity’ of leaf
nodes. For example, consider the following two splits of a leaf node containing 6 positive and 2 negative
training examples:
Both splits produce leaf nodes of the same sizes (containing the same numbers of examples) and with the
same overall number of classification errors, even though the first split includes a ‘pure’ leaf node in which
all examples have the same class label. The 0-1 error would not be able to distinguish between the two
splits, even though intuitively we may want to prefer the split with the pure node. To encourage selection
of splits producing more ‘pure’ leaf nodes, various measures of ‘impurity’ are often used instead of 0-1 error
when growing a classification tree. Two of the most widely used impurity measures are the following:
• Entropy: The entropy of (the set of examples contained in) a leaf node l is defined as
The entropy takes its smallest value (0) when pl+1 = 0 or pl+1 = 1 (pure node); it takes its largest value
(1) when pl+1 = 12 (maximally impure node). The quality of a full tree T (or rather, of the partition of
X induced by T ), measured in terms of entropy, is then defined as
X ml
HT = Hl .
m
l∈L(T )
Gl = pl+1 (1 − pl+1 ) .
Again, the Gini index takes its smallest value (0) when pl+1 = 0 or pl+1 = 1 (pure node); it takes its
largest value ( 41 ) when pl+1 = 21 (maximally impure node). The quality of a full tree T (or rather, of
the partition of X induced by T ), measured in terms of Gini index, is then defined as
X ml
GT = Gl .
m
l∈L(T )
When considering a split of a particular leaf node l into two leaf nodes l1 and l2 , to measure the reduction
in entropy induced by the resulting split, one can simply evaluate what is termed the information gain
(equivalent to entropy reduction), defined as
m ml
l1
IG(l, l1 , l2 ) = Hl − Hl1 + 2 Hl2 .
ml ml
6 Decision Trees and Nearest Neighbor Methods
Thus, in the greedy tree growing phase, given a current tree, one evaluates the information gain associated
with each candidate split under consideration, and chooses the split yielding the largest information gain
(largest reduction in entropy).
One can similarly define the Gini reduction associated with a split; if using the Gini index criterion, one
then chooses a split with the largest Gini reduction.
When pruning, one usually uses simply the regularized 0-1 error.
Exercise. Calculate the information gain associated with each of the two splits in the example above
(where a leaf node containing 6 positive and 2 negative examples is being considered for splitting into two
leaf nodes). Which split would be preferred based on the entropy criterion? Repeat the same using the Gini
index criterion.
Exercise. Show that choosing a split with maximal information gain is equivalent to choosing a split that
yields minimal cross-entropy loss (on the training sample S).
The idea behind nearest neighbor methods is conceptually very simple. Basically, one simply stores all the
given training examples in memory, and when asked to make a prediction on a new test point, searches the
training examples to find the ‘nearest’ training point and returns its label (or finds a few nearest training
points and averages their labels in some way). The notion of ‘nearest’ needs a distance measure; in Euclidean
space, this is most commonly taken to be the Euclidean distance.
More formally, suppose instances are feature vectors in X = Rd , and say we are given a training sample
S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m , where Y could be {±1} in the case of binary classification,
{1, . . . , K} in the case of multiclass classification, or R in the case of regression. We start by discussing the
case of using a single nearest neighbor for prediction, and then discuss the extension to using more neighbors.
Given a new test point x, the 1-NN algorithm simply finds the nearest point i∗ (x) in the training sample,
and predicts using its label. Specifically, for classification (both binary and multiclass), the 1-NN classifier
is given by
hS (x) = yi∗ (x) ,
where i∗ (x) is the index of the nearest neighbor of x in S (breaking ties arbitrarily):
The 1-NN method leads to what is known as a Voronoi diagram, or a Voronoi tessellation of the instance
space X = Rd , where the space is divided up into polyhedral regions or ‘Voronoi cells’; each training point
xi is associated with one such Voronoi cell, such that for all points x in that cell, xi is the closest training
point in S and the predicted label is yi . Such Voronoi tessellations can be quite complex, and they become
increasingly complex as the number of training points m increases.
Decision Trees and Nearest Neighbor Methods 7
In this case, given a new test point x, the k-NN algorithm finds the k nearest points in the training sample,
and predicts by averaging their labels. Specifically, for multiclass classification, the k-NN classifier estimates
the probability of each class label y as
1 X
ηby (x) = 1(yi = y) ,
k
i∈Nk (x)
where Nk (x) denotes the set of k nearest neighbors of x in S (breaking ties arbitrarily). Classifications are
then based on estimated class probabilities and the target loss function; for example, under 0-1 loss, one
simply predicts the class with highest estimated probability, which amounts to taking a majority vote of the
class labels of the k nearest neighbors:
For binary classification, there are two sets of classical results regarding the statistical convergence properties
of nearest neighbor classifiers. The first result (presented here in an abbreviated form) says that, for any
fixed k, for large enough sample size m, the 0-1 generalization error of the k-NN classifier, er0-1 D [hS ], when
averaged over all training samples of size m (drawn from Dm ), is at most twice the Bayes error:
Theorem 1 (Cover and Hart, 1967). Let X = Rd . Let D be any probability distribution on X × {±1}.
Let k be any fixed positive integer, and let hS denote the k-NN classifier resulting from a training sample
S. Then
≤ 2 er0-1,∗
D .
The full result shows something stronger: it gives the precise limit limm→∞ ES∼Dm er0-1 D [hS ] for each fixed
k, showing that as k increases, the limit of the error becomes smaller. However, as long as one uses any
fixed value of k, this limit is never equal to the Bayes error er0-1,∗
D , and therefore k-NN for any fixed k is not
(universally) consistent.
8 Decision Trees and Nearest Neighbor Methods
On the other hand, the following result shows that, if one allows k to depend on the number of training
examples m, then by choosing k to be a slowly growing function of m, one can achieve (universal) consistency:
the generalization error of the resulting algorithm actually converges to the Bayes error (for all D):
Theorem 2 (Stone, 1977). Let X = Rd . Let D be any probability distribution on X × {±1}. Let km be
such that km →∞ and kmm →0 as m→∞. Let hS denote the km -NN classifier resulting from a training sample
S of size m. Then
Nearest neighbor methods typically don’t require much computation in the training phase, but they require
storing all the training examples in memory. For this reason, they are often referred to as memory-based
or instance-based methods.
The testing phase, however, is computationally expensive, since given a new test point, one needs to search
the training sample to find the k nearest neighbors. There has been much work on developing approximation
algorithms for nearest neighbor search, which may not return the exact nearest neighbors but return a set
of ‘approximate’ neighbors.
In general, nearest neighbor methods tend to suffer from the curse of dimensionality: as the dimensionality d
of the instance space increases, the number of training examples needed to construct reliable estimates of the
class probability function or the conditional expectation function via nearest neighbor averaging increases
exponentially with d.