0% found this document useful (0 votes)

201 views18 pages

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

This document contains lecture notes on decision trees from a machine learning course. It introduces decision trees, including key terminology like root nodes, internal nodes, and leaf nodes. It also categorizes decision trees as a supervised, greedy, eager, batch, nonparametric, and deterministic machine learning algorithm. Additionally, it discusses the relationship between decision trees and rule-based learning, noting that each leaf node in a decision tree represents a rule and that decision trees can represent rule sets but not necessarily vice versa.

Uploaded by

Ratneswar Saikia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

201 views18 pages

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

Uploaded by

Ratneswar Saikia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

STAT 451: Machine Learning

Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/

Fall 2020

Contents
6 Decision Trees 1
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.1.2 Machine Learning Categories . . . . . . . . . . . . . . . . . . . . . . . 2
6.1.3 Relationship Between Decision Trees and Rule-based Learning . . . . 2
6.2 Divide-and-Conquer and Recursion . . . . . . . . . . . . . . . . . . . . . . . . 3
6.3 General Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.5 ID3, C4.5, CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.5.1 General Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.5.2 ID3 – Iterative Dichotomizer 3 . . . . . . . . . . . . . . . . . . . . . . 6
6.5.3 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.4 CART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.5.5 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.6 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.7 Information Theory and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.8 Why Growing Decision Trees via Entropy or Gini Impurity instead of Mis-
classification Error? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.9 Gini Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.10 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.10.1 Grain Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.11 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.11.1 Pre-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.11.2 Post-Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6.12 Decision Tree for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.13.1 Pros and Cons of Decision Trees . . . . . . . . . . . . . . . . . . . . . 16

2
STAT 451: Machine Learning
Lecture Notes

Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison

http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/

Fall 2020

6 Decision Trees

6.1 Introduction

• Decision tree algorithms can be considered as iterative, top-down construction method

for the hypothesis (classifier). You can picture a decision tree as a hierarchy of deci-
sions, which are forking or dividing a dataset into subspaces.
• Decision trees can represent any Boolean (binary) function, and the hypothesis space
being searched is the entire space of Boolean functions1 ; however, we need to keep
in mind that a critical challenge in machine learning is whether an algorithm can
learn/find the “right” function or a good approximation within that subspace being
searched.
• Considering only binary (or Boolean) features, at each node, there are 2m potential
splits to be evaluated given that the dataset has m features.

• Decision tree algorithms search the hypothesis space greedily2 over all possible trees;
an exhaustive search is not feasible because of the exponential nature of the problem.
I.e., assume that we have m Boolean features, then there are 2m potential feature
combinations. Then, if we consider that we have a binary classification problem, there
m
are 22 possible ways to label the data. If each tree corresponds to a unique label
assignment function, you can easily see how it becomes intractable to do a brute force
search over all possible decision trees for a given dataset (especially, if we are working
with non-binary features and labels).

6.1.1 Terminology

• Root node: no incoming edge, zero, or more outgoing edges.

• Internal node: one incoming edge, two (or more) outgoing edges.
1 the size of the hypothesis space determined by the dataset
2 In greedy search, we make a series of locally optimal choices, which may not lead to the global optimum
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 2

• Leaf node: each leaf node is assigned a class label if nodes are pure; otherwise, the
class label is determined by majority vote.
• Parent and child nodes: If a node is split, we refer to that given node as the parent
node, and the resulting nodes are called child nodes.

Work to do? Internal

node
Yes No
Root
node Stay in Outlook?
Branch

Sunny Rainy
Over-
cast
Go to beach Go running Friends busy?
Yes No
Leaf
node Stay in Go to movies

Figure 1: Example of a non-binary decision tree with categorical features.

6.1.2 Machine Learning Categories

In the context of the different categories of machine learning algorithms that we defined at
the beginning of this course, we may categorize decision trees as follows:

• Supervised learning algorithm: classification and regression

• Optimization method: combinatorial -> greedy search
• Eager learning algorithm (remember, KNN was ”lazy”)
• Batch learning algorithm (it means that it cannot learn one example at a time)
• Nonparametric model (similar to KNN, there are no model parameters to fit, unlike
e.g., linear regression)
• Deterministic (vs. stochastic)

6.1.3 Relationship Between Decision Trees and Rule-based Learning

Intuitively, we can also think of decision tree as nested “if-else” rules. And a rule is simply
a conjunction of conditions. For example,

Rule 1 = (if x = 1) ∩ (if y = 2) ∩ ... (1)

Multiple rules can then be joined into a set of rules, which can be applied to predict the
target value of a training example or test instance. For example,

Class 1 = if (Rule 1=True) ∪ (Rule 2=True) ∪ ... (2)

Each leaf node in a decision tree represents such a set of rules as illustrated in the following
figure, which depicts the rule
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 3

(Work to do? = False) ∩ (Outlook? = Rainy) ∩ (Friends busy? = Yes) (3)

Work to do? Internal

node
Yes No
Root
node Stay in Outlook?
Branch

Sunny Rainy
Over-
cast
Go to beach Go running Friends busy?
Yes No
Leaf
node Stay in Go to movies

Figure 2: A rule for a given leaf node (circled): (Work to do? = False) ∩ (Outlook? = Rainy) ∩
(Friends busy? = Yes)

Considering the complete tree depicted in the previous figure, the decision rule for the class
label “Stay In” can then be written as the following rule set:

(Work to do? = False)∩(Outlook = Rainy?)∩(Friends busy? = Yes)∪(Work to do? = True)

(4)

• Rules can be constructed from decision trees easily: each leaf node is a rule.
• However, the other way around, it is not possible to always build a decision tree from
a set of rules, and in cases where it is obvious, it may not be immediately apparent
how (especially if rules were pruned, e.g., where would you place the root of the tree?).
• Evaluating a rule set is much more expensive than evaluating a tree, where we only
have to go to one single branch.
• Rulesets can have multiple answers if we are not careful.
• While rules are more expressive or flexible, they are more prone to overfitting, because
they have a larger hypothesis space than decision trees.

6.2 Divide-and-Conquer and Recursion

• In the context of decision trees and how to implement them efficiently, it is helpful to
visit the topic of divide-and-conquer algorithms and the concept of recursion.
• To understand the basic concept behind recursion, consider the simple algorithm (here:
written as a Python function) to compute the length of an array:

def array_len(x):
if x == []:
return 0
else:
return 1 + array_len(x[1:])
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 4

• Divide-and-conquer is a concept in computer science where we divide a problem into

subproblems of the same type.
• Usually, divide-and-conquer algorithms can be implemented using recursion; an ex-
ample is shown below, implementing a simple version of the quicksort algorithm (in
Python):

def quicksort(array):
if len(array) < 2:
return array
else:
pivot = array[0]
smaller, bigger = [], []
for ele in array[1:]:
if ele <= pivot:
smaller.append(ele)
else:
bigger.append(ele)
return quicksort(smaller) + [pivot] + quicksort(bigger)

Figure 3: Illustration of the different steps of quicksort when applied to a simple toy dataset.

6.3 General Decision Tree Algorithm

This section outlines a generic decision tree algorithm using the concept of recursion outlined
in the previous section, which is a basic foundation that is underlying most decision tree
algorithms described in the literature.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 5

The process of growing a decision tree can be expressed as a recursive algorithm as follows:

1) Pick a feature such that when parent node3 is split, it results in the largest information
gain4 .
2) Stop if child nodes are pure or no improvement in class purity can be made.
3) Go back to step 1 for each of the two child nodes.

Below is a more formal expression of the algorithm outlined above:

GenerateTree(D):

• if y = 1 ∀ hx, yi ∈ D or y = 0 ∀ hx, yi ∈ D :

– return Tree

• else:

– Pick best feature xj :

∗ D0 at Child0 : xj = 0 ∀ hx, yi ∈ D
∗ D1 at Child1 : xj = 1 ∀ hx, yi ∈ D
return Node(xj , GenerateTree(D0 ), GenerateTree(D1 ))

Now, while the algorithm above appears to be a viable approach for constructing a decision
tree, in practice, we may face several edge cases and issues that we need to think of when
implementing decision tree algorithms.
For instance, some of the design choices and considerations we have to make are listed below:

• How do we decide which feature to select for splitting a parent node into child nodes?
I.e., what is a criterion to measure the goodness of the split?
• Since a multi-category splitting can be expressed as a series of binary splits, which
approach is to be preferred?
• While splitting categorical features is intuitive, how can we deal with continuous in-
puts?
• When do we stop growing a tree (because complete separation can easily lead to
overfitting)?
• How do we make predictions if no attributes exist to perfectly separate non-pure nodes
further5 ?

6.4 Time Complexity

Measuring the time complexity of decision tree algorithms can be complicated, and the
approach is not very straight-forward.
However, we can make a few simplifying assumption to analyze the complexity of decision
trees. Under the assumption that a decision tree is a balanced binary decision tree, the
3 in first iteration, this is the root node
4 Stop if information gain is not positive.
5 Majority voting for classification trees and the sample mean for regression trees is typically a good

choice.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 6

final tree will have a depth of log2 n, where n is the number of examples in the training set.
Hence, it should be immediately obvious that the time complexity for the prediction step is
O(log n).6
Determining the runtime complexity of decision tree training is less straightforward and
varies wildly based on the algorithm choice and implementation. Assuming we have con-
tinuous features and perform binary splits, the runtime of the decision tree construction is
generally O(m · n2 log n). It can be shown that optimal binary split on continuous features
is on the boundary between adjacent examples7 with different class labels8 . This means
that sorting the values of continuous features helps with determining a decision threshold
efficiently. If we have n examples, the sorting has time complexity O(n log n). If we have to
compare sort m features, this becomes O(m · n log n).
To see why the time complexity of decision tree construction is typically quoted at O(m · n2 log n),
keep in mind that we earlier determined the depth of a decision tree at log2 n. It follows
that the number of terminal nodes (leaf nodes) is 2log2 n = n. The total number of nodes
in the tree is 2n − 1, and consequently, the number of “splitting” nodes in the tree is
2n − 1 − n = n − 1 (all nodes minus the terminal nodes).
Hence, if we are not efficient and re-sort the features prior to each split we have to perform
the O(m · n log n) sorting step up to n/2 times – once for each splitting node in the tree –
which results in a time complexity of O(m · n2 log n).
(Many implementations such as scikit-learn use efficient caching tricks to keep track of the
general order of indices at each node such that the features do not need to be re-sorted at
each node; hence, the time complexity of these implementations is just O(m · n log(n)).)

6.5 ID3, C4.5, CART

There exists a relatively large variety of decision tree algorithms. This section lists some of
the most influential/popular ones.

6.5.1 General Differences

Most decision tree algorithms differ in the following ways:

• Splitting criterion: information gain (Shannon Entropy, Gini impurity, misclassifica-

tion error), use of statistical tests, objective function, etc.
• Binary split vs. multi-way splits
• Discrete vs. continuous variables

• Pre- vs. post-pruning

6.5.2 ID3 – Iterative Dichotomizer 3

• Described in Quinlan, J. R. (1986). Induction of decision trees. Machine learning,

1 (1), 81-106.

• One of the earlier/earliest decision tree algorithms

6 Remember, log n
we write just log n because the basis of the log is just a scaling factor, log2 n = logb 2 .
b
7 Trainingexamples with similar feature values.
8 Usama Mohammad Fayyad. “On the induction of decision trees for multiple concept learning”. In:

(1992).
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 7

• Discrete features, cannot handle numeric features

• Multi-category splits
• No pruning, prone to overfitting
• Short and wide trees (compared to CART)
• Maximizes information gain/minimizes entropy

• Discrete features, binary and multi-category features

6.5.3 C4.5

• Described in Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan

Kauffmann, 38, 48.

• Continuous and discrete features (continuous feature splitting is very expensive because
must consider all possible ranges)
• Splitting criterion is computed via the gain ratio (explained later)
• Handles missing attributes (ignores them in information gain computation)

• Performs post-pruning (bottom-up pruning)

6.5.4 CART

• Described in Breiman, L. (1984). Classification and regression trees. Belmont, Calif:

Wadsworth International Group.
• Continuous and discrete features
• Strictly binary splits (resulting trees are taller compared to ID3 and C4.5)
• Binary splits can generate better trees than C4.5, but tend to be larger and harder to
interpret; i.e., for k attributes, we have 2k−1 − 1 ways to create a binary partitioning
• Variance reduction in regression trees
• Uses Gini impurity (or “twoing criteria”) in classification trees
• Performs cost-complexity pruning (more on that later)

6.5.5 Others

• CHAID (CHi-squared Automatic Interaction Detector); Kass, G. V. (1980). “An

exploratory technique for investigating large quantities of categorical data.” Applied
Statistics. 29 (2): 119–127.

• MARS (Multivariate adaptive regression splines); Friedman, J. H. (1991). “Multivari-

ate Adaptive Regression Splines.” The Annals of Statistics. 19: 1
• C5.0 (patented)
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 8

6.6 Information Gain

The standard criterion that is being used for splitting in decision trees is the so-called
information gain. In simple words, the better the split, the higher the information gain.

• Information gain relies on the concept of mutual information: The reduction of the
entropy of one variable by knowing the other. 9
• We want to maximize mutual information when defining splitting criteria.
• I.e., we define the criterion at a node such that it maximizes information gain

X |Dv |
GAIN (D, xj ) = H(D) − H(Dv ). (5)
|D|
v∈V alues(xj )

Where D is the training set at the parent node, and Dv is a dataset at a child node upon
splitting.

6.7 Information Theory and Entropy

This section briefly summarizes the concept of Entropy as it was coined by Claude Shannon
in the context of information theory.10

• In ID3, we use Shannon Entropy to measure improvement in a decision tree (instead of

misclassification error); i.e., we use it as a optimization metric (or impurity measure)
• This entropy measure was originally proposed in the context of encoding digital infor-
mation in the form of Bits (0s or 1s).
• Consider entropy as a measure of the amount of information of a discrete random
variable (two outcomes, Bernoulli distribution)
• Shannon information:

– Shannon defined information as the number of bits to encode a number p1 , where

1
p is the probability that an event is true (i.e., 1−p is the uncertainty)
– The number of bits for encoding p1 is log2 (1/p)

– Note: log2 (1/p) = log2 (1) − log2 (p) = − log2 (p)

– − log2 (p) → [∞, 0] ; i.e., if we are 100% certain about an event, we gain 0 infor-
mation
– E.g., assume 2 soccer teams, team 1 and team 2, both with a win probability 50%
∗ If the information “team 1 wins” is transmitted, we transmitted 1 bit: log2 (1/0.5) =
log2 (2) = − log2 (0.5) = 1

• Shannon entropy is then the “average information”

P
– Entropy: H(p) = i pi log2 (1/pi )
9 Inthis context, knowing class label by knowing feature value.
10 Inthis lecture, the use of the term Entropy refers to the Shannon Entropy is not to be confused with
other definitions of entropy, such as the one from thermodynamics; although, it’s conceptually related if we
think about order and disorder.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 9

P
– Entropy (alt. notation): H(p) = − i pi log2 (pi )
– E.g., assume soccer team 1 and team 2 have win probabilities 75% and 25%,
respectively, we get an average information content of 0.81 bits:

H(p) = − 0.75 × log2 (0.75) − 0.25 × log2 (0.25)
(6)
= (−0.75 × −0.41 − 0.25 × −2)
≈ 0.81

• That means, since we have some information about the problem if the distribution
is non-uniform, and we could technically encode this message more efficiently then
(consider in a real application, we cannot send fractional bits, but this is just a toy
example).
• The key idea behind Shannon entropy is that we can use entropy as a way to create
messages of different lengths to transmit different information contents and send in-
formation more efficiently (saving bits) depending on how confident we are, however,
this is beyond the scope of this course.

1.0

0.8

0.6
H(p)

0.4

0.2

0.0 0.2 0.4 0.6 0.8 1.0

p(y=0)

Figure 4: Entropy function for a binary classification problem.

• If we have i class labels (i.e., i different values for y), then the entropy can be as large
as log2 i.
– E.g., for 10 classes: 10 × (−(0.1 × log2 (0.1))) = log2 (10) = 3.32
• In the context of decision trees, think of entropy as the minimum number of bits that
are required to encode the classification of data points. For instance, if we have p = 0.5
(uniform class distribution in a binary classification problem) we need to send 1 bit
on average (most expensive) to classify a data point.

6.8 Why Growing Decision Trees via Entropy or Gini Impurity

instead of Misclassification Error?

Consider the more general formula for measuring information gain,

X |Dv |
GAIN (D, xj ) = I(D) − I(Dv ), (7)
|D|
v∈V alues(xj )
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 10

where I is a function that measures the impurity of a given node. If I defined as the Entropy
measure (H) we defined earlier, this equation is equal to the information gain equation used
in classification trees.
Instead of using Entropy as an impurity measure, the misclassification error ERR seems to
be another reasonable choice, where
n
1X
ERR(D) = L(ŷ [i] , y [i] ), (8)
n i=1

with the 0-1 Loss, (

0 if ŷ = y,
L(ŷ, y) = (9)
1 otherwise.
This, in case of the training set, is equal to

ERR(p) = 1 − max(p(i|xj )) (10)

for a given node if we use majority voting at this node.

Now, to see the difference between using the misclassification error as opposed to the entropy
measure for computing the information gain upon growing a decision tree, we will take a
look at an example. Consider the following binary decision tree for a binary classification
problem with binary features:

y=1 y=0
40 80

No x1 = 1 ? Yes

28 42 12 38

No Yes No Yes
x2 = 1 ? x3 = 1 ?

28 0 0 42 12 0 0 38

Figure 5: Binary decision tree example.

Note that this is a toy example, where we assume that there exist three features x1 , x2 , x3 ∈ {0, 1}
that result in a perfect class separation if split as illustrated in the preceding figure.
We split the tree using entropy as an information or impurity measure first, and the entropy
values of the child nodes of the root node are provided in the figure below.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 11

|Dxj =1 |
Entropy = 0.918 GAIN (D, xj ) = H(D) H(Dxj =1 )
|D|
|Dxj =0 |
40 80 |D|
H(Dxj =0 )
<latexit sha1_base64="2nDNhLx0zVVyYV7+Bdr7p8xHG2M=">AAACqHicfVFbS8MwGE3rbc5b1UdfgmPgREcrgr4M5gVUBJmwi7KOkmbpjEsvJKk4uv42/4Nv/hvTOXBzww8Ch3O+nHw5nxsxKqRpfmn6wuLS8kpuNb+2vrG5ZWzvNEUYc0waOGQhf3KRIIwGpCGpZOQp4gT5LiMtt3+V6a03wgUNg7ocRKTjo15APYqRVJRjfNxc3D0c2D6SLxix5Do9gu/OawlW4O0kW4LHENoeRzgZTtBO8u4kr2nFSofplDBM4bTBb2fJtvOZ2z925qzdXDMzLTlGwSybo4KzwBqDAhhXzTE+7W6IY58EEjMkRNsyI9lJEJcUM5Lm7ViQCOE+6pG2ggHyiegko6BTWFRMF3ohVyeQcMRO3kiQL8TAd1VnNq34q2XkPK0dS++8k9AgiiUJ8M9DXsygDGG2NdilnGDJBgogzKmaFeIXpPKTard5FYL198uzoHlStsyy9XhaqF6O48iBPbAPDoAFzkAV3IIaaACsFbV7ra419EO9prf0559WXRvf2QVTpbvfHYPR4Q==</latexit>

= 0.918 - 70/120 * 0.971 - 50/120 * 0.795

= 0.02

Entropy = 0.971 28 42 12 38 Entropy = 0.795

28 0 0 42 12 0 0 38

Entropy = 0.0 Entropy = 0.0 Entropy = 0.0 Entropy = 0.0

Figure 6: Binary decision tree example. Splitting via entropy measure.

We note that splitting the root node into the two child nodes results in entropy values 0.971
and 0.795, respectively. The information gain of this split is 0.02. In the toy example, the
splits that follow this first split will separate the classes correctly/perfectly, as indicated in
the figure.
Next, consider the same decision tree in a scenario where the misclassification error was used
as an impurity metric, as shown in the next figure.

|Dxj =1 |
Error = 40/120 GAIN (D, xj ) = ERR(D)
|D|
ERR(Dxj =1 )

|Dxj =0 |
40 80 <latexit sha1_base64="p+PtJ4mY9qI7fMGL10cmuxMHAeo=">AAACrnicfVFbS8MwGE3rfd6mPvoSHIJDHakI+jKYN9QXUXFTXEtJs3SLpheSVBxdf55/wDf/jekcODf1g8DhnC8nX87nxZxJhdCHYU5MTk3PzM4V5hcWl5aLK6sNGSWC0DqJeCQePCwpZyGtK6Y4fYgFxYHH6b33fJLr9y9USBaFd6obUyfA7ZD5jGClKbf4dn50ebVlB1h1CObpabYDX92nMqzCs9vbYb4MdyG0fYFJ2hui3fTVTZ+yqpX1sh9CL4OjFt+9Zdsu5H7/GKJxwz/sUFZ2iyVUQf2C48AagBIY1LVbfLdbEUkCGirCsZRNC8XKSbFQjHCaFexE0hiTZ9ymTQ1DHFDppP24M7ipmRb0I6FPqGCfHb6R4kDKbuDpznxaOarl5G9aM1H+oZOyME4UDcnXQ37CoYpgvjvYYoISxbsaYCKYnhWSDtYJKr3hgg7BGv3yOGjsVSxUsW72S7XjQRyzYB1sgC1ggQNQAxfgGtQBMbaNG+PRaJrIbJiO6X61msbgzhr4UWbnE2ie1AA=</latexit>
|D|
ERR(Dxj =0 )

= 40/120 - 70/120 * 28/70 - 50/120 * 12/50

Error = 28/70 28 42 12 38 Error = 12/50

28 0 0 42 12 0 0 38

Error = 0.0 Error = 0.0 Error = 0.0 Error = 0.0

Figure 7: Binary decision tree example. Splitting via misclassification error.

As shown in the previous figure, the information gain upon splitting the root node using
the misclassification error as impurity metric is 0, which means that performing this split
actually did not result in an improvement compared to the original state (i.e., the root node).
Hence, according to the decision tree algorithm(s) we defined earlier, we would stop growing
the tree if we do not make further improvement(s) as measured via information gain.
To provide an intuitive explanation as to why this happens, consider the next figure, which
shows the entropy values of the root node plotted along with the two child nodes.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 12

Figure 8: Average versus weighted entropy of child nodes.

As it can be seen in the plot above, the weighted average entropy of the child nodes is
always smaller than the entropy of the parent node due to the concave shape of the entropy
function. This means that we can always find a splitting criterion to continue growing the
tree where the tree growth might get stuck because the error doesn’t improve.

6.9 Gini Impurity

• Gini impurity is a measure used in in CART as opposed to entropy:

X
p(c = i)2

Gini(t) = 1 − (11)
i

• In practice, whether we use entropy or Gini impurity does not really matter, because
both have the same concave/bell shape, which is the essential part about this splitting
criterion.
• Gini is computationally more efficient to compute than entropy (due to the lack of
the log), which could make code negligibly more efficient in terms of computational
performance.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 13

Entropy Entropy (scaled) Gini Impurity Misclassification Error

1.0

0.8
Impurity Index
0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
p(i=1)

Figure 9: Comparison of different impurity measures.

6.10 Improvements

This section introduces several ideas to improve the decision tree algorithm.

6.10.1 Grain Ratio

The gain ration was introduced by Quinlan penalizes splitting categorical attributes with
many values (e.g., think about including a date column, or really bad: row ID) via the split
information:
Gain(D, xj )
GainRatio(D, xj ) = , (12)
SplitInf o(D, xj )
where SplitInfo measures the entropy of the attribute itself:
X |Dv | |Dv |
SplitInf o(D, xj ) = − log2 . (13)
v∈x
|D| |D|
j

6.11 Overfitting

If decision trees are not pruned, they have a high risk to overfit the training data to a high
degree.

Figure 10: Relationship between tree depth and overfitting (gap between training and test ac-
curacy). Note that the plot starts at 50% accuracy, because we assume that we are dealing with
binary classification problem.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 14

• Overfitting occurs if models pick up noise or errors in the training dataset; hence,
overfitting can be seen as a performance gap between training and test data.

Given a hypothesis space H, a hypothesis h ∈ His said to overfit the training

data if there exists some alternative hypothesis h0 ∈ H, such that h has smaller
error than h0 over the training examples, but h0 has a smaller error than h over
the entire distribution of instances. – Tom Mitchell

• Occam’s razor: Favor a simpler hypothesis, because a simpler hypothesis that fits
the data equally well (let’s say the same accuracy) is more likely or plausible than a
complex one.

• A general approach for minimizing overfitting in decision trees is decision tree pruning.
There are generally two approaches: post- and pre-pruning.

6.11.1 Pre-Pruning

• Set a depth cut-off (maximum tree depth) a priori.

• Cost-complexity pruning: I + α|N |, where I is an impurity measure, α is a tuning

parameter, and |N | is the total number of nodes.
• Stop growing if a split is not statistically significant (e.g., χ2 test).
• Set a minimum number of data points for each node.

6.11.2 Post-Pruning

• Grow full tree first, then remove nodes (e.g., done in C4.5).
• Reduced-error pruning, remove nodes via validation set evaluation (problematic for
limited data).

• Reduced-error pruning: Greedily remove nodes based on validation set performance;

generally improves performance but can be problematic for limited data set sizes.
• We can also convert trees to rules first and then prune the rules.

– There is one rule per leaf node.

– If rules are not sorted, rule sets are costly to evaluate but more expressive.
– In contrast to pruned rule sets, rules from decision trees are mutually exclusive.
– H ere we can prune rules independently of others (means we do not remove both
child nodes if we remove the root node).
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 15

Figure 11: Illustration of reduced-error pruning.

6.12 Decision Tree for Regression

Decision trees can also be used for regression analysis, which was introduced via CART – as
you remember from the previous sections, CART stands for Classification And Regression
Trees.
If we use decision trees for regression, we grow the tree (that is, deciding upon splitting
criteria at each node) through variance reduction at each node. Here, the variance refers to
the variance among the target variables at the parent node and its child nodes.
Earlier, in the context of classification, we defined information gain as follows:
X |Dv |
GAIN (D, xj ) = I(D) − I(Dv ), (14)
|D|
v∈V alues(xj )

where I was defined as either Entropy, Gini impurity, or the misclassification error. For
regression, we can simply use a metric for comparing continuous target variables to the
predictions using a metric such as the mean squared error at a given node t:
n 2
1 X [i] [i]
M SE = yt − h(x )t . (15)
nt
i=1,i∈Dt

Note that the predicted target value at a node t, h(x)t ,is computed as the sample mean of
the training subset at that node:
1 X [i]
h(x)t = y . (16)
nt
i∈Dt

The MSE at a given node is hence also often referred to as “within-node variance,” and the
splitting criterion is thus called “variance reduction.”
Note that decision trees suffer from the same problem as classification trees in that they are
not good at approximating diagonal hyperplanes.
Sebastian Raschka STAT451 FS20. L06: Decision Trees Page 16

Figure 12: Classification tree approximating a diagonal decision boundary. Splits are always
perpendicular to the feature axes.

6.13 Summary

6.13.1 Pros and Cons of Decision Trees

Listed below are some of the pros and cons of using decision trees as a predictive model.

• (+) Easy to interpret and communicate

• (+) Independent of feature scaling

• (-) Easy to overfit

• (-) Elaborate pruning required
• (-) Expensive to just fit a “diagonal line”
• (-) Output range is bounded (dep. on training examples) in regression trees

In the next lecture, we will talk about several ensemble methods, some of which are tradi-
tionally focused on using decision trees, e.g., bagging and random forests, which help with
making decision tree models more robust against overfitting by creating an ensemble that
reduces the variance (here: in terms of the variance of the model with respect to the loss
function 11 ) compared to the individual trees.

11 More details on the bias-variance decomposition and trade-off will be provided in the model evaluation

lectures

PHYSICS INVESTIGATORY PROJECT Step Down
67% (3)
PHYSICS INVESTIGATORY PROJECT Step Down
22 pages
Microsoft Premium PL-900 by VCEplus 157q
100% (1)
Microsoft Premium PL-900 by VCEplus 157q
123 pages
Unit-4 (1) .Docx ML
No ratings yet
Unit-4 (1) .Docx ML
42 pages
SCARA Robot Lab Manual
No ratings yet
SCARA Robot Lab Manual
19 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
Raw Material List
No ratings yet
Raw Material List
2 pages
ECON 262-Mathematical Applications in Economics-Kiran Arooj
0% (1)
ECON 262-Mathematical Applications in Economics-Kiran Arooj
4 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
WK 07
No ratings yet
WK 07
8 pages
L22 DecisionTrees
No ratings yet
L22 DecisionTrees
14 pages
Machine Learning: B.E, M.Tech, PH.D
No ratings yet
Machine Learning: B.E, M.Tech, PH.D
23 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
2020 Electrical Engineering Paper-1 (PCC-EE-301) : Circuit Theory Total Marks - 70 Duration:3 Hrs
No ratings yet
2020 Electrical Engineering Paper-1 (PCC-EE-301) : Circuit Theory Total Marks - 70 Duration:3 Hrs
5 pages
"Dorothy Meets The Scarecrow" : English
No ratings yet
"Dorothy Meets The Scarecrow" : English
25 pages
Non Convex Optimization PDF
No ratings yet
Non Convex Optimization PDF
204 pages
3 Dtrees-Lect6
No ratings yet
3 Dtrees-Lect6
63 pages
Vector-Logic Computing For Faults-As-Address Deductive Simulation
No ratings yet
Vector-Logic Computing For Faults-As-Address Deductive Simulation
15 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
Unit 3,4,5 ML (CS - AI)
No ratings yet
Unit 3,4,5 ML (CS - AI)
37 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
22 pages
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
No ratings yet
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
18 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
Lec.7.intro.D.S. Fall 2023
No ratings yet
Lec.7.intro.D.S. Fall 2023
26 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
Experiment No-2
No ratings yet
Experiment No-2
4 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Entropy and Information Gain For Decision Tree Algorithm
No ratings yet
Entropy and Information Gain For Decision Tree Algorithm
12 pages
Best Cydia Sources 2024 Top Repos With iOS Jailbreak Tweaks
No ratings yet
Best Cydia Sources 2024 Top Repos With iOS Jailbreak Tweaks
1 page
Aiml M4 C1
No ratings yet
Aiml M4 C1
101 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
Chapter 2 Types of Machine Learning and Their Learning Strategies
No ratings yet
Chapter 2 Types of Machine Learning and Their Learning Strategies
45 pages
Sap Abap Guide
No ratings yet
Sap Abap Guide
30 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
Ch02 DecisionTree
No ratings yet
Ch02 DecisionTree
41 pages
ML Unit 2 Final - III Yr
No ratings yet
ML Unit 2 Final - III Yr
72 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Machine - Learning - Lecture - 08 - Decision Tree Learning
No ratings yet
Machine - Learning - Lecture - 08 - Decision Tree Learning
67 pages
DataMining-Handouts1 5
No ratings yet
DataMining-Handouts1 5
8 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
63BSA
No ratings yet
63BSA
3 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
ML - Module-3-Chapter-6 RNSIT
No ratings yet
ML - Module-3-Chapter-6 RNSIT
10 pages
ML Unit3
No ratings yet
ML Unit3
24 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
A Brief History of FORTRAN:Fortran
No ratings yet
A Brief History of FORTRAN:Fortran
3 pages
Unit6 - 2 Classification-Decision-Trees
No ratings yet
Unit6 - 2 Classification-Decision-Trees
36 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
1756-RM006 - Logix5000 Controllers Process and Drives Instructions Reference Manual - InGLES
No ratings yet
1756-RM006 - Logix5000 Controllers Process and Drives Instructions Reference Manual - InGLES
567 pages
HTML Quiz Qa
No ratings yet
HTML Quiz Qa
17 pages
Unit1 and Unit2
No ratings yet
Unit1 and Unit2
85 pages
Guardianeye Monitoring System: Instruction Manual
No ratings yet
Guardianeye Monitoring System: Instruction Manual
37 pages
Ensemble Learning: Martin Sewell
No ratings yet
Ensemble Learning: Martin Sewell
16 pages
Aod Assignment-2 (201 Batch) PDF
No ratings yet
Aod Assignment-2 (201 Batch) PDF
24 pages
Process List
No ratings yet
Process List
14 pages
Mba 4 TH Sem Only
No ratings yet
Mba 4 TH Sem Only
29 pages
Zamfira Ioana Ruxandra - Raport
No ratings yet
Zamfira Ioana Ruxandra - Raport
10 pages
The Government Contracts Reference Book
0% (1)
The Government Contracts Reference Book
3 pages
5 Best Voicemail Greeting Examples For 2022 Tip
No ratings yet
5 Best Voicemail Greeting Examples For 2022 Tip
1 page
E Governance Final Documentation E Governance - Docxmanisha
No ratings yet
E Governance Final Documentation E Governance - Docxmanisha
23 pages
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
No ratings yet
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
7 pages
GST-I-9105R Rayo Linela
No ratings yet
GST-I-9105R Rayo Linela
1 page
AI - Lab - Manual - Day2
No ratings yet
AI - Lab - Manual - Day2
25 pages
Pisa Week 2
No ratings yet
Pisa Week 2
7 pages
Ibm Devops and Software Engineering: Sahish Pandav
No ratings yet
Ibm Devops and Software Engineering: Sahish Pandav
1 page
Food Safety Handbook: A Practical Guide for Building a Robust Food Safety Management System
From Everand
Food Safety Handbook: A Practical Guide for Building a Robust Food Safety Management System
International Finance Corporation
No ratings yet
Beckett's Industrial Chocolate Manufacture and Use
From Everand
Beckett's Industrial Chocolate Manufacture and Use
Steve T. Beckett
No ratings yet
Sustainability in Craft Brewing
From Everand
Sustainability in Craft Brewing
Katharina Landerer
3/5 (1)
Handbook of Food Safety Engineering
From Everand
Handbook of Food Safety Engineering
Da-Wen Sun
No ratings yet
Biochemistry Applied to the Brewing Processes - Mashing, Boiling, Cooling
From Everand
Biochemistry Applied to the Brewing Processes - Mashing, Boiling, Cooling
R. H. Hopkins
No ratings yet
The Brewing of Beer: Mashing and Sparging
From Everand
The Brewing of Beer: Mashing and Sparging
Edward Ralph Moritz
No ratings yet
A Glossary of Beer Brewing Terminology
From Everand
A Glossary of Beer Brewing Terminology
Edward H. Vogel
No ratings yet
Questions on the Brewhouse in the Beer Brewing Process
From Everand
Questions on the Brewhouse in the Beer Brewing Process
Edward H. Vogel
No ratings yet

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

Uploaded by

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison

Uploaded by

STAT 451: Machine Learning

• Decision tree algorithms can be considered as iterative, top-down construction method

• Root node: no incoming edge, zero, or more outgoing edges.

Work to do? Internal

Figure 1: Example of a non-binary decision tree with categorical features.

6.1.2 Machine Learning Categories

• Supervised learning algorithm: classification and regression

6.1.3 Relationship Between Decision Trees and Rule-based Learning

Rule 1 = (if x = 1) ∩ (if y = 2) ∩ ... (1)

Class 1 = if (Rule 1=True) ∪ (Rule 2=True) ∪ ... (2)

(Work to do? = False) ∩ (Outlook? = Rainy) ∩ (Friends busy? = Yes) (3)

Work to do? Internal

(Work to do? = False)∩(Outlook = Rainy?)∩(Friends busy? = Yes)∪(Work to do? = True)

6.2 Divide-and-Conquer and Recursion

• Divide-and-conquer is a concept in computer science where we divide a problem into

6.3 General Decision Tree Algorithm

Below is a more formal expression of the algorithm outlined above:

– Pick best feature xj :

6.4 Time Complexity

6.5 ID3, C4.5, CART

6.5.1 General Differences

Most decision tree algorithms differ in the following ways:

• Splitting criterion: information gain (Shannon Entropy, Gini impurity, misclassifica-

• Pre- vs. post-pruning

6.5.2 ID3 – Iterative Dichotomizer 3

• Described in Quinlan, J. R. (1986). Induction of decision trees. Machine learning,

• One of the earlier/earliest decision tree algorithms

• Discrete features, cannot handle numeric features

• Discrete features, binary and multi-category features

• Described in Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan

• Performs post-pruning (bottom-up pruning)

• Described in Breiman, L. (1984). Classification and regression trees. Belmont, Calif:

• CHAID (CHi-squared Automatic Interaction Detector); Kass, G. V. (1980). “An

• MARS (Multivariate adaptive regression splines); Friedman, J. H. (1991). “Multivari-

6.6 Information Gain

6.7 Information Theory and Entropy

• In ID3, we use Shannon Entropy to measure improvement in a decision tree (instead of

– Shannon defined information as the number of bits to encode a number p1 , where

– Note: log2 (1/p) = log2 (1) − log2 (p) = − log2 (p)

• Shannon entropy is then the “average information”

0.0 0.2 0.4 0.6 0.8 1.0

Figure 4: Entropy function for a binary classification problem.

6.8 Why Growing Decision Trees via Entropy or Gini Impurity

Consider the more general formula for measuring information gain,

with the 0-1 Loss, (

ERR(p) = 1 − max(p(i|xj )) (10)

for a given node if we use majority voting at this node.

Figure 5: Binary decision tree example.

= 0.918 - 70/120 * 0.971 - 50/120 * 0.795

Entropy = 0.971 28 42 12 38 Entropy = 0.795

Entropy = 0.0 Entropy = 0.0 Entropy = 0.0 Entropy = 0.0

Figure 6: Binary decision tree example. Splitting via entropy measure.

= 40/120 - 70/120 * 28/70 - 50/120 * 12/50

Error = 28/70 28 42 12 38 Error = 12/50

Error = 0.0 Error = 0.0 Error = 0.0 Error = 0.0

Figure 7: Binary decision tree example. Splitting via misclassification error.

Figure 8: Average versus weighted entropy of child nodes.

6.9 Gini Impurity

• Gini impurity is a measure used in in CART as opposed to entropy:

Entropy Entropy (scaled) Gini Impurity Misclassification Error

Figure 9: Comparison of different impurity measures.

6.10.1 Grain Ratio

Given a hypothesis space H, a hypothesis h ∈ His said to overfit the training

• Set a depth cut-off (maximum tree depth) a priori.

• Cost-complexity pruning: I + α|N |, where I is an impurity measure, α is a tuning

• Reduced-error pruning: Greedily remove nodes based on validation set performance;

– There is one rule per leaf node.

Figure 11: Illustration of reduced-error pruning.

6.12 Decision Tree for Regression

6.13.1 Pros and Cons of Decision Trees

• (+) Easy to interpret and communicate

• (-) Easy to overfit

You might also like