0% found this document useful (0 votes)
143 views

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

This document contains lecture notes on decision trees from a machine learning course. It introduces decision trees, including key terminology like root nodes, internal nodes, and leaf nodes. It also categorizes decision trees as a supervised, greedy, eager, batch, nonparametric, and deterministic machine learning algorithm. Additionally, it discusses the relationship between decision trees and rule-based learning, noting that each leaf node in a decision tree represents a rule and that decision trees can represent rule sets but not necessarily vice versa.

Uploaded by

Ratneswar Saikia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

This document contains lecture notes on decision trees from a machine learning course. It introduces decision trees, including key terminology like root nodes, internal nodes, and leaf nodes. It also categorizes decision trees as a supervised, greedy, eager, batch, nonparametric, and deterministic machine learning algorithm. Additionally, it discusses the relationship between decision trees and rule-based learning, noting that each leaf node in a decision tree represents a rule and that decision trees can represent rule sets but not necessarily vice versa.

Uploaded by

Ratneswar Saikia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

STAT 451: Machine Learning

Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/

Fall 2020

Contents
6 Decision Trees 1
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.1.2 Machine Learning Categories . . . . . . . . . . . . . . . . . . . . . . . 2
6.1.3 Relationship Between Decision Trees and Rule-based Learning . . . . 2
6.2 Divide-and-Conquer and Recursion . . . . . . . . . . . . . . . . . . . . . . . . 3
6.3 General Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.5 ID3, C4.5, CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.5.1 General Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.5.2 ID3 – Iterative Dichotomizer 3 . . . . . . . . . . . . . . . . . . . . . . 6
6.5.3 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.4 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.5 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.6 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.7 Information Theory and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.8 Why Growing Decision Trees via Entropy or Gini Impurity instead of Mis-
classification Error? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.9 Gini Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.10 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.10.1 Grain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.11 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.11.1 Pre-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.11.2 Post-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.12 Decision Tree for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.13.1 Pros and Cons of Decision Trees . . . . . . . . . . . . . . . . . . . . . 16

2
STAT 451: Machine Learning
Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/

Fall 2020

6 Decision Trees

6.1 Introduction

• Decision tree algorithms can be considered as iterative, top-down construction method


for the hypothesis (classifier). You can picture a decision tree as a hierarchy of deci-
sions, which are forking or dividing a dataset into subspaces.
• Decision trees can represent any Boolean (binary) function, and the hypothesis space
being searched is the entire space of Boolean functions1 ; however, we need to keep
in mind that a critical challenge in machine learning is whether an algorithm can
learn/find the “right” function or a good approximation within that subspace being
searched.
• Considering only binary (or Boolean) features, at each node, there are 2m potential
splits to be evaluated given that the dataset has m features.

• Decision tree algorithms search the hypothesis space greedily2 over all possible trees;
an exhaustive search is not feasible because of the exponential nature of the problem.
I.e., assume that we have m Boolean features, then there are 2m potential feature
combinations. Then, if we consider that we have a binary classification problem, there
m
are 22 possible ways to label the data. If each tree corresponds to a unique label
assignment function, you can easily see how it becomes intractable to do a brute force
search over all possible decision trees for a given dataset (especially, if we are working
with non-binary features and labels).

6.1.1 Terminology

• Root node: no incoming edge, zero, or more outgoing edges.

• Internal node: one incoming edge, two (or more) outgoing edges.
1 the size of the hypothesis space determined by the dataset
2 In greedy search, we make a series of locally optimal choices, which may not lead to the global optimum
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 2

• Leaf node: each leaf node is assigned a class label if nodes are pure; otherwise, the
class label is determined by majority vote.
• Parent and child nodes: If a node is split, we refer to that given node as the parent
node, and the resulting nodes are called child nodes.

Work to do? Internal


node
Yes No
Root
node Stay in Outlook?
Branch

Sunny Rainy
Over-
cast
Go to beach Go running Friends busy?
Yes No
Leaf
node Stay in Go to movies

Figure 1: Example of a non-binary decision tree with categorical features.

6.1.2 Machine Learning Categories

In the context of the different categories of machine learning algorithms that we defined at
the beginning of this course, we may categorize decision trees as follows:

• Supervised learning algorithm: classification and regression


• Optimization method: combinatorial -> greedy search
• Eager learning algorithm (remember, KNN was ”lazy”)
• Batch learning algorithm (it means that it cannot learn one example at a time)
• Nonparametric model (similar to KNN, there are no model parameters to fit, unlike
e.g., linear regression)
• Deterministic (vs. stochastic)

6.1.3 Relationship Between Decision Trees and Rule-based Learning

Intuitively, we can also think of decision tree as nested “if-else” rules. And a rule is simply
a conjunction of conditions. For example,

Rule 1 = (if x = 1) ∩ (if y = 2) ∩ ... (1)

Multiple rules can then be joined into a set of rules, which can be applied to predict the
target value of a training example or test instance. For example,

Class 1 = if (Rule 1=True) ∪ (Rule 2=True) ∪ ... (2)

Each leaf node in a decision tree represents such a set of rules as illustrated in the following
figure, which depicts the rule
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 3

(Work to do? = False) ∩ (Outlook? = Rainy) ∩ (Friends busy? = Yes) (3)

Work to do? Internal


node
Yes No
Root
node Stay in Outlook?
Branch

Sunny Rainy
Over-
cast
Go to beach Go running Friends busy?
Yes No
Leaf
node Stay in Go to movies

Figure 2: A rule for a given leaf node (circled): (Work to do? = False) ∩ (Outlook? = Rainy) ∩
(Friends busy? = Yes)

Considering the complete tree depicted in the previous figure, the decision rule for the class
label “Stay In” can then be written as the following rule set:

(Work to do? = False)∩(Outlook = Rainy?)∩(Friends busy? = Yes)∪(Work to do? = True)


(4)

• Rules can be constructed from decision trees easily: each leaf node is a rule.
• However, the other way around, it is not possible to always build a decision tree from
a set of rules, and in cases where it is obvious, it may not be immediately apparent
how (especially if rules were pruned, e.g., where would you place the root of the tree?).
• Evaluating a rule set is much more expensive than evaluating a tree, where we only
have to go to one single branch.
• Rulesets can have multiple answers if we are not careful.
• While rules are more expressive or flexible, they are more prone to overfitting, because
they have a larger hypothesis space than decision trees.

6.2 Divide-and-Conquer and Recursion

• In the context of decision trees and how to implement them efficiently, it is helpful to
visit the topic of divide-and-conquer algorithms and the concept of recursion.
• To understand the basic concept behind recursion, consider the simple algorithm (here:
written as a Python function) to compute the length of an array:

def array_len(x):
if x == []:
return 0
else:
return 1 + array_len(x[1:])
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 4

• Divide-and-conquer is a concept in computer science where we divide a problem into


subproblems of the same type.
• Usually, divide-and-conquer algorithms can be implemented using recursion; an ex-
ample is shown below, implementing a simple version of the quicksort algorithm (in
Python):

def quicksort(array):
if len(array) < 2:
return array
else:
pivot = array[0]
smaller, bigger = [], []
for ele in array[1:]:
if ele <= pivot:
smaller.append(ele)
else:
bigger.append(ele)
return quicksort(smaller) + [pivot] + quicksort(bigger)

Figure 3: Illustration of the different steps of quicksort when applied to a simple toy dataset.

6.3 General Decision Tree Algorithm

This section outlines a generic decision tree algorithm using the concept of recursion outlined
in the previous section, which is a basic foundation that is underlying most decision tree
algorithms described in the literature.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 5

The process of growing a decision tree can be expressed as a recursive algorithm as follows:

1) Pick a feature such that when parent node3 is split, it results in the largest information
gain4 .
2) Stop if child nodes are pure or no improvement in class purity can be made.
3) Go back to step 1 for each of the two child nodes.

Below is a more formal expression of the algorithm outlined above:


GenerateTree(D):

• if y = 1 ∀ hx, yi ∈ D or y = 0 ∀ hx, yi ∈ D :

– return Tree

• else:

– Pick best feature xj :


∗ D0 at Child0 : xj = 0 ∀ hx, yi ∈ D
∗ D1 at Child1 : xj = 1 ∀ hx, yi ∈ D
return Node(xj , GenerateTree(D0 ), GenerateTree(D1 ))

Now, while the algorithm above appears to be a viable approach for constructing a decision
tree, in practice, we may face several edge cases and issues that we need to think of when
implementing decision tree algorithms.
For instance, some of the design choices and considerations we have to make are listed below:

• How do we decide which feature to select for splitting a parent node into child nodes?
I.e., what is a criterion to measure the goodness of the split?
• Since a multi-category splitting can be expressed as a series of binary splits, which
approach is to be preferred?
• While splitting categorical features is intuitive, how can we deal with continuous in-
puts?
• When do we stop growing a tree (because complete separation can easily lead to
overfitting)?
• How do we make predictions if no attributes exist to perfectly separate non-pure nodes
further5 ?

6.4 Time Complexity

Measuring the time complexity of decision tree algorithms can be complicated, and the
approach is not very straight-forward.
However, we can make a few simplifying assumption to analyze the complexity of decision
trees. Under the assumption that a decision tree is a balanced binary decision tree, the
3 in first iteration, this is the root node
4 Stop if information gain is not positive.
5 Majority voting for classification trees and the sample mean for regression trees is typically a good

choice.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 6

final tree will have a depth of log2 n, where n is the number of examples in the training set.
Hence, it should be immediately obvious that the time complexity for the prediction step is
O(log n).6
Determining the runtime complexity of decision tree training is less straightforward and
varies wildly based on the algorithm choice and implementation. Assuming we have con-
tinuous features and perform binary splits, the runtime of the decision tree construction is
generally O(m · n2 log n). It can be shown that optimal binary split on continuous features
is on the boundary between adjacent examples7 with different class labels8 . This means
that sorting the values of continuous features helps with determining a decision threshold
efficiently. If we have n examples, the sorting has time complexity O(n log n). If we have to
compare sort m features, this becomes O(m · n log n).
To see why the time complexity of decision tree construction is typically quoted at O(m · n2 log n),
keep in mind that we earlier determined the depth of a decision tree at log2 n. It follows
that the number of terminal nodes (leaf nodes) is 2log2 n = n. The total number of nodes
in the tree is 2n − 1, and consequently, the number of “splitting” nodes in the tree is
2n − 1 − n = n − 1 (all nodes minus the terminal nodes).
Hence, if we are not efficient and re-sort the features prior to each split we have to perform
the O(m · n log n) sorting step up to n/2 times – once for each splitting node in the tree –
which results in a time complexity of O(m · n2 log n).
(Many implementations such as scikit-learn use efficient caching tricks to keep track of the
general order of indices at each node such that the features do not need to be re-sorted at
each node; hence, the time complexity of these implementations is just O(m · n log(n)).)

6.5 ID3, C4.5, CART

There exists a relatively large variety of decision tree algorithms. This section lists some of
the most influential/popular ones.

6.5.1 General Differences

Most decision tree algorithms differ in the following ways:

• Splitting criterion: information gain (Shannon Entropy, Gini impurity, misclassifica-


tion error), use of statistical tests, objective function, etc.
• Binary split vs. multi-way splits
• Discrete vs. continuous variables

• Pre- vs. post-pruning

6.5.2 ID3 – Iterative Dichotomizer 3

• Described in Quinlan, J. R. (1986). Induction of decision trees. Machine learning,


1 (1), 81-106.

• One of the earlier/earliest decision tree algorithms


6 Remember, log n
we write just log n because the basis of the log is just a scaling factor, log2 n = logb 2 .
b
7 Trainingexamples with similar feature values.
8 Usama Mohammad Fayyad. “On the induction of decision trees for multiple concept learning”. In:

(1992).
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 7

• Discrete features, cannot handle numeric features

• Multi-category splits
• No pruning, prone to overfitting
• Short and wide trees (compared to CART)
• Maximizes information gain/minimizes entropy

• Discrete features, binary and multi-category features

6.5.3 C4.5

• Described in Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan


Kauffmann, 38, 48.

• Continuous and discrete features (continuous feature splitting is very expensive because
must consider all possible ranges)
• Splitting criterion is computed via the gain ratio (explained later)
• Handles missing attributes (ignores them in information gain computation)

• Performs post-pruning (bottom-up pruning)

6.5.4 CART

• Described in Breiman, L. (1984). Classification and regression trees. Belmont, Calif:


Wadsworth International Group.
• Continuous and discrete features
• Strictly binary splits (resulting trees are taller compared to ID3 and C4.5)
• Binary splits can generate better trees than C4.5, but tend to be larger and harder to
interpret; i.e., for k attributes, we have 2k−1 − 1 ways to create a binary partitioning
• Variance reduction in regression trees
• Uses Gini impurity (or “twoing criteria”) in classification trees
• Performs cost-complexity pruning (more on that later)

6.5.5 Others

• CHAID (CHi-squared Automatic Interaction Detector); Kass, G. V. (1980). “An


exploratory technique for investigating large quantities of categorical data.” Applied
Statistics. 29 (2): 119–127.

• MARS (Multivariate adaptive regression splines); Friedman, J. H. (1991). “Multivari-


ate Adaptive Regression Splines.” The Annals of Statistics. 19: 1
• C5.0 (patented)
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 8

6.6 Information Gain

The standard criterion that is being used for splitting in decision trees is the so-called
information gain. In simple words, the better the split, the higher the information gain.

• Information gain relies on the concept of mutual information: The reduction of the
entropy of one variable by knowing the other. 9
• We want to maximize mutual information when defining splitting criteria.
• I.e., we define the criterion at a node such that it maximizes information gain

X |Dv |
GAIN (D, xj ) = H(D) − H(Dv ). (5)
|D|
v∈V alues(xj )

Where D is the training set at the parent node, and Dv is a dataset at a child node upon
splitting.

6.7 Information Theory and Entropy

This section briefly summarizes the concept of Entropy as it was coined by Claude Shannon
in the context of information theory.10

• In ID3, we use Shannon Entropy to measure improvement in a decision tree (instead of


misclassification error); i.e., we use it as a optimization metric (or impurity measure)
• This entropy measure was originally proposed in the context of encoding digital infor-
mation in the form of Bits (0s or 1s).
• Consider entropy as a measure of the amount of information of a discrete random
variable (two outcomes, Bernoulli distribution)
• Shannon information:

– Shannon defined information as the number of bits to encode a number p1 , where


1
p is the probability that an event is true (i.e., 1−p is the uncertainty)
– The number of bits for encoding p1 is log2 (1/p)

– Note: log2 (1/p) = log2 (1) − log2 (p) = − log2 (p)


– − log2 (p) → [∞, 0] ; i.e., if we are 100% certain about an event, we gain 0 infor-
mation
– E.g., assume 2 soccer teams, team 1 and team 2, both with a win probability 50%
∗ If the information “team 1 wins” is transmitted, we transmitted 1 bit: log2 (1/0.5) =
log2 (2) = − log2 (0.5) = 1

• Shannon entropy is then the “average information”


P
– Entropy: H(p) = i pi log2 (1/pi )
9 Inthis context, knowing class label by knowing feature value.
10 Inthis lecture, the use of the term Entropy refers to the Shannon Entropy is not to be confused with
other definitions of entropy, such as the one from thermodynamics; although, it’s conceptually related if we
think about order and disorder.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 9

P
– Entropy (alt. notation): H(p) = − i pi log2 (pi )
– E.g., assume soccer team 1 and team 2 have win probabilities 75% and 25%,
respectively, we get an average information content of 0.81 bits:

 
H(p) = − 0.75 × log2 (0.75) − 0.25 × log2 (0.25)
(6)
= (−0.75 × −0.41 − 0.25 × −2)
≈ 0.81

• That means, since we have some information about the problem if the distribution
is non-uniform, and we could technically encode this message more efficiently then
(consider in a real application, we cannot send fractional bits, but this is just a toy
example).
• The key idea behind Shannon entropy is that we can use entropy as a way to create
messages of different lengths to transmit different information contents and send in-
formation more efficiently (saving bits) depending on how confident we are, however,
this is beyond the scope of this course.

1.0

0.8

0.6
H(p)

0.4

0.2

0.0 0.2 0.4 0.6 0.8 1.0


p(y=0)

Figure 4: Entropy function for a binary classification problem.

• If we have i class labels (i.e., i different values for y), then the entropy can be as large
as log2 i.
– E.g., for 10 classes: 10 × (−(0.1 × log2 (0.1))) = log2 (10) = 3.32
• In the context of decision trees, think of entropy as the minimum number of bits that
are required to encode the classification of data points. For instance, if we have p = 0.5
(uniform class distribution in a binary classification problem) we need to send 1 bit
on average (most expensive) to classify a data point.

6.8 Why Growing Decision Trees via Entropy or Gini Impurity


instead of Misclassification Error?

Consider the more general formula for measuring information gain,


X |Dv |
GAIN (D, xj ) = I(D) − I(Dv ), (7)
|D|
v∈V alues(xj )
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 10

where I is a function that measures the impurity of a given node. If I defined as the Entropy
measure (H) we defined earlier, this equation is equal to the information gain equation used
in classification trees.
Instead of using Entropy as an impurity measure, the misclassification error ERR seems to
be another reasonable choice, where
n
1X
ERR(D) = L(ŷ [i] , y [i] ), (8)
n i=1

with the 0-1 Loss, (


0 if ŷ = y,
L(ŷ, y) = (9)
1 otherwise.
This, in case of the training set, is equal to

ERR(p) = 1 − max(p(i|xj )) (10)


i

for a given node if we use majority voting at this node.


Now, to see the difference between using the misclassification error as opposed to the entropy
measure for computing the information gain upon growing a decision tree, we will take a
look at an example. Consider the following binary decision tree for a binary classification
problem with binary features:

y=1 y=0
40 80

No x1 = 1 ? Yes

28 42 12 38

No Yes No Yes
x2 = 1 ? x3 = 1 ?

28 0 0 42 12 0 0 38

Figure 5: Binary decision tree example.

Note that this is a toy example, where we assume that there exist three features x1 , x2 , x3 ∈ {0, 1}
that result in a perfect class separation if split as illustrated in the preceding figure.
We split the tree using entropy as an information or impurity measure first, and the entropy
values of the child nodes of the root node are provided in the figure below.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 11

|Dxj =1 |
Entropy = 0.918 GAIN (D, xj ) = H(D) H(Dxj =1 )
|D|
|Dxj =0 |
40 80 |D|
H(Dxj =0 )
<latexit sha1_base64="2nDNhLx0zVVyYV7+Bdr7p8xHG2M=">AAACqHicfVFbS8MwGE3rbc5b1UdfgmPgREcrgr4M5gVUBJmwi7KOkmbpjEsvJKk4uv42/4Nv/hvTOXBzww8Ch3O+nHw5nxsxKqRpfmn6wuLS8kpuNb+2vrG5ZWzvNEUYc0waOGQhf3KRIIwGpCGpZOQp4gT5LiMtt3+V6a03wgUNg7ocRKTjo15APYqRVJRjfNxc3D0c2D6SLxix5Do9gu/OawlW4O0kW4LHENoeRzgZTtBO8u4kr2nFSofplDBM4bTBb2fJtvOZ2z925qzdXDMzLTlGwSybo4KzwBqDAhhXzTE+7W6IY58EEjMkRNsyI9lJEJcUM5Lm7ViQCOE+6pG2ggHyiegko6BTWFRMF3ohVyeQcMRO3kiQL8TAd1VnNq34q2XkPK0dS++8k9AgiiUJ8M9DXsygDGG2NdilnGDJBgogzKmaFeIXpPKTard5FYL198uzoHlStsyy9XhaqF6O48iBPbAPDoAFzkAV3IIaaACsFbV7ra419EO9prf0559WXRvf2QVTpbvfHYPR4Q==</latexit>

= 0.918 - 70/120 * 0.971 - 50/120 * 0.795

= 0.02

Entropy = 0.971 28 42 12 38 Entropy = 0.795

28 0 0 42 12 0 0 38

Entropy = 0.0 Entropy = 0.0 Entropy = 0.0 Entropy = 0.0

Figure 6: Binary decision tree example. Splitting via entropy measure.

We note that splitting the root node into the two child nodes results in entropy values 0.971
and 0.795, respectively. The information gain of this split is 0.02. In the toy example, the
splits that follow this first split will separate the classes correctly/perfectly, as indicated in
the figure.
Next, consider the same decision tree in a scenario where the misclassification error was used
as an impurity metric, as shown in the next figure.

|Dxj =1 |
Error = 40/120 GAIN (D, xj ) = ERR(D)
|D|
ERR(Dxj =1 )

|Dxj =0 |
40 80 <latexit sha1_base64="p+PtJ4mY9qI7fMGL10cmuxMHAeo=">AAACrnicfVFbS8MwGE3rfd6mPvoSHIJDHakI+jKYN9QXUXFTXEtJs3SLpheSVBxdf55/wDf/jekcODf1g8DhnC8nX87nxZxJhdCHYU5MTk3PzM4V5hcWl5aLK6sNGSWC0DqJeCQePCwpZyGtK6Y4fYgFxYHH6b33fJLr9y9USBaFd6obUyfA7ZD5jGClKbf4dn50ebVlB1h1CObpabYDX92nMqzCs9vbYb4MdyG0fYFJ2hui3fTVTZ+yqpX1sh9CL4OjFt+9Zdsu5H7/GKJxwz/sUFZ2iyVUQf2C48AagBIY1LVbfLdbEUkCGirCsZRNC8XKSbFQjHCaFexE0hiTZ9ymTQ1DHFDppP24M7ipmRb0I6FPqGCfHb6R4kDKbuDpznxaOarl5G9aM1H+oZOyME4UDcnXQ37CoYpgvjvYYoISxbsaYCKYnhWSDtYJKr3hgg7BGv3yOGjsVSxUsW72S7XjQRyzYB1sgC1ggQNQAxfgGtQBMbaNG+PRaJrIbJiO6X61msbgzhr4UWbnE2ie1AA=</latexit>
|D|
ERR(Dxj =0 )

= 40/120 - 70/120 * 28/70 - 50/120 * 12/50

=0

Error = 28/70 28 42 12 38 Error = 12/50

28 0 0 42 12 0 0 38

Error = 0.0 Error = 0.0 Error = 0.0 Error = 0.0

Figure 7: Binary decision tree example. Splitting via misclassification error.

As shown in the previous figure, the information gain upon splitting the root node using
the misclassification error as impurity metric is 0, which means that performing this split
actually did not result in an improvement compared to the original state (i.e., the root node).
Hence, according to the decision tree algorithm(s) we defined earlier, we would stop growing
the tree if we do not make further improvement(s) as measured via information gain.
To provide an intuitive explanation as to why this happens, consider the next figure, which
shows the entropy values of the root node plotted along with the two child nodes.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 12

Figure 8: Average versus weighted entropy of child nodes.

As it can be seen in the plot above, the weighted average entropy of the child nodes is
always smaller than the entropy of the parent node due to the concave shape of the entropy
function. This means that we can always find a splitting criterion to continue growing the
tree where the tree growth might get stuck because the error doesn’t improve.

6.9 Gini Impurity

• Gini impurity is a measure used in in CART as opposed to entropy:

X
p(c = i)2

Gini(t) = 1 − (11)
i

• In practice, whether we use entropy or Gini impurity does not really matter, because
both have the same concave/bell shape, which is the essential part about this splitting
criterion.
• Gini is computationally more efficient to compute than entropy (due to the lack of
the log), which could make code negligibly more efficient in terms of computational
performance.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 13

Entropy Entropy (scaled) Gini Impurity Misclassification Error

1.0

0.8
Impurity Index
0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p(i=1)

Figure 9: Comparison of different impurity measures.

6.10 Improvements

This section introduces several ideas to improve the decision tree algorithm.

6.10.1 Grain Ratio

The gain ration was introduced by Quinlan penalizes splitting categorical attributes with
many values (e.g., think about including a date column, or really bad: row ID) via the split
information:
Gain(D, xj )
GainRatio(D, xj ) = , (12)
SplitInf o(D, xj )
where SplitInfo measures the entropy of the attribute itself:
X |Dv | |Dv |
SplitInf o(D, xj ) = − log2 . (13)
v∈x
|D| |D|
j

6.11 Overfitting

If decision trees are not pruned, they have a high risk to overfit the training data to a high
degree.

Figure 10: Relationship between tree depth and overfitting (gap between training and test ac-
curacy). Note that the plot starts at 50% accuracy, because we assume that we are dealing with
binary classification problem.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 14

• Overfitting occurs if models pick up noise or errors in the training dataset; hence,
overfitting can be seen as a performance gap between training and test data.

Given a hypothesis space H, a hypothesis h ∈ His said to overfit the training


data if there exists some alternative hypothesis h0 ∈ H, such that h has smaller
error than h0 over the training examples, but h0 has a smaller error than h over
the entire distribution of instances. – Tom Mitchell

• Occam’s razor: Favor a simpler hypothesis, because a simpler hypothesis that fits
the data equally well (let’s say the same accuracy) is more likely or plausible than a
complex one.

• A general approach for minimizing overfitting in decision trees is decision tree pruning.
There are generally two approaches: post- and pre-pruning.

6.11.1 Pre-Pruning

• Set a depth cut-off (maximum tree depth) a priori.

• Cost-complexity pruning: I + α|N |, where I is an impurity measure, α is a tuning


parameter, and |N | is the total number of nodes.
• Stop growing if a split is not statistically significant (e.g., χ2 test).
• Set a minimum number of data points for each node.

6.11.2 Post-Pruning

• Grow full tree first, then remove nodes (e.g., done in C4.5).
• Reduced-error pruning, remove nodes via validation set evaluation (problematic for
limited data).

• Reduced-error pruning: Greedily remove nodes based on validation set performance;


generally improves performance but can be problematic for limited data set sizes.
• We can also convert trees to rules first and then prune the rules.

– There is one rule per leaf node.


– If rules are not sorted, rule sets are costly to evaluate but more expressive.
– In contrast to pruned rule sets, rules from decision trees are mutually exclusive.
– H ere we can prune rules independently of others (means we do not remove both
child nodes if we remove the root node).
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 15

Figure 11: Illustration of reduced-error pruning.

6.12 Decision Tree for Regression

Decision trees can also be used for regression analysis, which was introduced via CART – as
you remember from the previous sections, CART stands for Classification And Regression
Trees.
If we use decision trees for regression, we grow the tree (that is, deciding upon splitting
criteria at each node) through variance reduction at each node. Here, the variance refers to
the variance among the target variables at the parent node and its child nodes.
Earlier, in the context of classification, we defined information gain as follows:
X |Dv |
GAIN (D, xj ) = I(D) − I(Dv ), (14)
|D|
v∈V alues(xj )

where I was defined as either Entropy, Gini impurity, or the misclassification error. For
regression, we can simply use a metric for comparing continuous target variables to the
predictions using a metric such as the mean squared error at a given node t:
n  2
1 X [i] [i]
M SE = yt − h(x )t . (15)
nt
i=1,i∈Dt

Note that the predicted target value at a node t, h(x)t ,is computed as the sample mean of
the training subset at that node:
1 X [i]
h(x)t = y . (16)
nt
i∈Dt

The MSE at a given node is hence also often referred to as “within-node variance,” and the
splitting criterion is thus called “variance reduction.”
Note that decision trees suffer from the same problem as classification trees in that they are
not good at approximating diagonal hyperplanes.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 16

Figure 12: Classification tree approximating a diagonal decision boundary. Splits are always
perpendicular to the feature axes.

6.13 Summary

6.13.1 Pros and Cons of Decision Trees

Listed below are some of the pros and cons of using decision trees as a predictive model.

• (+) Easy to interpret and communicate


• (+) Independent of feature scaling

• (-) Easy to overfit


• (-) Elaborate pruning required
• (-) Expensive to just fit a “diagonal line”
• (-) Output range is bounded (dep. on training examples) in regression trees

In the next lecture, we will talk about several ensemble methods, some of which are tradi-
tionally focused on using decision trees, e.g., bagging and random forests, which help with
making decision tree models more robust against overfitting by creating an ensemble that
reduces the variance (here: in terms of the variance of the model with respect to the loss
function 11 ) compared to the individual trees.

11 More details on the bias-variance decomposition and trade-off will be provided in the model evaluation

lectures

You might also like