3 - Decision Trees
3 - Decision Trees
UNIT 3 U3/0/2024
Machine Learning
COS4852
Year module
School of Computing
CONTENTS
This document contains the material for UNIT 3 for COS4852 for 2024.
university
Define tomorrow. of south africa
A decision tree is a tree-like visual representation that work similarly to a flow-chart to make or
support decisions. Each node in the decision tree is an attribute that splits the data set into subsets
that correspond to specific values of that attribute. Each node then becomes a single decision point,
where a specific value of the attribute leads to sub-trees, until all the attributes are assigned to a
node, and final decision values are reached.
OUTLOOK
no yes no yes
Figure 1 shows an example of a small decision tree that could be used to determine whether to play
sport based on the values of three weather variables: Outlook, Humidity and Wind.
1 OUTCOMES
In this Unit you will learn more about the theoretical basis of decision trees, and understand how to
apply one of the algorithms used to construct a decision tree from a dataset. You will learn how to
describe and solve a learning problem using decision tree learning.
You will:
1. Understand the relationship between Boolean function, binary decision trees and decision
lists.
2
COS4852/U3
2 INTRODUCTION
In this Unit you will investigate the theory of decision trees and learn how to describe and solve a
learning problem using decision tree learning, using the ID3 algorithm.
There are many algorithms to construct decision trees. The most famous of these is Ross Quinlan’s
ID3 algorithm that are used to construct a decision tree on a set of discrete and integer data values.
There are variants of ID3 that can operate on continuous-valued datasets, as well as variants that
use a statistical approach. There are also more complex algorithms that construct a collection of
trees, called a forest-of-trees, which, although more complex, give more options for making accurate
decision based on complex data.
3 PREPARATION
Chapter 6 in Nilsson’s book works through the ID3 algorithm for decision tree construction, using a
slightly different notation from what we will be using.
3
3.2 Textbooks
Sections 18.1 to 18.4 in Russell and Norvig’s 3rd edition is also a good source in decision lists.
The original 1986 article by Ross Quinlan describes one of the most successful algorithms to create
decision trees.
This IBM article gives a detailed discussion on what a decision tree is and does, and how to do the
basic ID3 calculations.
The Wikipedia page on ID3 gives a good overview of the ID3 algorithm.
4 DISCUSSION
Boolean function:
f1 (A, B) = ¬A ∧ B
Start by choosing A as the root node. This gives us the binary decision tree as in Figure 2. On the
diagram you can see the mapping between specific parts of the truth table and the binary decision
tree. Each leaf node corresponds to one row in the truth table, while the level above the leaf nodes
correspond to two rows in the truth table, etc. By merging leaf nodes with the same value the tree
can be simplified, as in Figure 3.
Using B as the start node results in a different binary decision tree. In this particular case the tree
turns out to be as simple as the first. This is not the case for all decision trees.
4
COS4852/U3
A B f1
? ? ?
A
0 1
A B f1 A B f1
0 ? ? B B 1 ? ?
0 ? ? 1 ? ?
0 1 0 1
0 1 0 0
A B f1 A B f1
0 0 0 A B f1 A B f1 1 1 0
0 1 1 1 0 0
A
0 1
B 0
0 1
0 1
Decision lists
Rivest wrote a paper on how to create decision lists from a Boolean function. The paper goes into
some depth in how to do this.
You can think of a decision list as a binary decision tree where each node divides the data set
into two so that one branch has a binary value ((0, 1) or {T , F } as output, and the other branch
leads leads to further subdivision of the dataset. By writing a Boolean function in a DNF form, this
becomes reasonably obvious. Another method that works well is to draw a Karnaugh diagram of the
function and reduce the function through the diagram using the same process that would be used to
create a DNF form.
5
B
0 1
0 A
0 1
1 0
Constructing a decision tree is a recursive process to decide which attribute to use at each node of
the decision tree. We want to choose the attribute that is the “best” at classifying the instances in
the data set. “Best” here is a quantitative measure (a number). One such measure is a statistical
measure called Information Gain. This determines how well a given attribute separates the data set
6
COS4852/U3
ID3 uses the attribute with the highest Information Gain as the next node in constructing the tree.
The ID3 algorithm is a recursive algorithm that constructs sub-trees for attribute values of each
node, using the sub-sets of the data matching the attribute value of the sub-tree.
Attribute X with
highest information
gain
Figure 5: Decision tree showing how nodes are selected in the ID3 algorithm
Figure 5 shows a decision tree with labels indicating how ID3 selects nodes in the construction of
the tree.
Entropy
Entropy is an important concept in thermodynamics. Claude Shannon saw that the concept could
be used to describe how much information there is in the outcome of a random discrete variable
(such as determining if a coin will land heads up or not, or to make sure that communication over
a network does not lose information). We can use the concept to measure the “usefulness” of a
variable in terms of its information content. This idea forms the core of the decision tree construction
process in ID3.
Given a discrete random variable X , which takes values in the alphabet X and is distributed
according to p : X → [0, 1]:
X
H(X ) := − p(x) log p(x) = E[− log p(X )]
x∈X
7
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
where the sum is calculated over all possible alphabet values X . The base of the log matches the
distribution of p. For example log2 is used when the target values in the data is binary (yes/no or
T/F).
Figure 6 shows the Entropy for a single variable. Here you can see that Entropy is always positive
and can never be larger than 1.
Information Gain
Entropy can be viewed as a measure of the impurity of a collection of instances (a data set). In order
to construct a decision tree we want to repeatedly sub-divide our data set in such a way that we
create the largest reduction in entropy with each sub-division. The Information Gain of an attribute
A relative to a dataset S is defined as:
X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv )
|S|
v ∈Values(A)
where Values(A) is the set of all possible values attribute A can have, |Sv | is the subset of S where
A has the values v .
ID3 uses Information Gain to find the attribute that splits the dataset in such a way that we have
the highest reduction in entropy, or, as calculated above the highest Information Gain. The worked
example in Subsection 4.3 below will illustrate this in more detail.
8
COS4852/U3
We will use the data set in Table 1 to work through an example of the ID3 algorithm.
First, we calculate the Entropy for the entire data set. We do this as a baseline against which
to compare which attribute will become our root note. This is in turn is done by calculating the
Information Gain for each attribute.
This is a binary classification problem. There are 14 instances, of which 5 result in Class = ⊕ and 9
gives Class = . In other words:
Entropy(S) ≡ Entropy([5⊕ , 9 ])
There are four attributes, which we can shorten to (C, F , H, T ) whose combination of values
determine the value of the target attribute, Class.
9
Attribute C can take on three values (shortened here):
Values(C) = R, G, B
SC = [5⊕ , 9 ]
SC=R ← [2⊕ , 2 ]
SC=G ← [2⊕ , 4 ]
SC=B ← [1⊕ , 3 ]
Calculate the Entropy values of the three subsets of the data associated with the values of the
attribute C:
Repeat these calculations for the other three attributes as well. We now get all the Information Gain
values:
Gain(S, C) = 0.0292
Gain(S, F) = 0.2000
Gain(S, H) = 0.1518
Gain(S, T) = 0.0481
The attribute with the highest Information Gain causes the highest reduction in entropy. This is the
attribute Form with Gain(S, F) = 0.2000, which then becomes the root node of the decision tree, as
shown in Figure 7.
The ID3 algorithm now recurses over the subsets of the data associated with the three branches of
the root node of the decision tree.
10
COS4852/U3
Form
? ?
In Table 2 are 6 instances, of which 2 result in Class = ⊕ and 4 gives Class = . Therefore:
Entropy(SF=c ) ≡ Entropy([2⊕ , 4 ])
Calculate the Entropy values of the three subsets of the data associated with the values of the
attribute C:
Entropy(SF=c,C=R ) = −1/2 log2 (1/2) − 1/2 log2 (1/2)
= 1.0000
Entropy(SF=c,C=G ) = −1/3 log2 (1/3) − 2/3 log2 (2/3)
= 0.9183
Entropy(SF=c,C=B ) = −0/1 log2 (0/1) − 1/1 log2 (1/1)
= 0.0000
11
Table 3: Subset of the data with Form=sphere
Gain(S, CF =c ) = 0.1258
Gain(S, HF =c ) = 0.0441
Gain(S, TF =c ) = 0.9183
The attribute with the highest Information Gain is Transparent, which then becomes the next node
in the decision tree, under the branch with the value Form=cube. The data in Table 3 show that
all the instances have output , which means that we can define a leaf node under Form=sphere.
The result of these calculations gives the decision tree as in Figure 8.
In Table 4, where Form=pyramid, are 5 instances, of which 3 result in Class = ⊕ and 2 gives
Class = . Therefore:
Entropy(SF=p ) ≡ Entropy([3⊕ , 2 ])
Gain(S, CF =c ) = 0.9710
12
COS4852/U3
Form
Transparent ?
yes no
? ?
Figure 8: Decision tree after the second set of calculations and the observation for Form=sphere
and
Gain(S, CF =p ) = 0.1710
Gain(S, HF =p ) = 0.9710
Gain(S, TF =p ) = 0.9710
We now see an interesting phenomenon. The highest Information Gain values are the same for two
possible branches. We can choose either, as they have the same effect in reducing Entropy. We
have already used Transparent in another branch, so choosing Hollow will result in a simpler tree
(Occam’s razor) to get the decision tree as in Figure 9.
Form
Transparent Hollow
yes no yes no
? ? ? ?
13
We now have four branches of the tree to investigate, and possibly repeat the calculations. These
branches correspond to sub-sets of the data. Tables 5 and 6 are the subsets under the branches of
Transparent.
In both of these we see that there is only one output class in each. This means that we have two
more leaf nodes, as in Figure 10.
Form
Transparent Hollow
yes no yes no
+ - ? ?
We are now left with two more subsets to investigate - those for the branches of Hollow. Tables 7
and 8 show these sub-sets.
Again, we observe a similar phenomenon as with the previous two subsets, namely that there is
only a single class in each. This means that we have our final two leaf nodes, as in Figure 11.
14
COS4852/U3
Form
Transparent Hollow
yes no yes no
+ - + -
Figure 11: Decision tree after the final set of observations
5 ACTIVITIES
Find and read all the online material shown earlier in this document. Study the relevant concepts
carefully and thoroughly.
Find resources (some of this will be in the textbooks and material you have already studied in the
first task) on other algorithms for contructing decision trees. Some of these algorithms include ID3
(what you’ve studied here), C4.5, and CART.
15
Study these algorithms so that you understand how they work, and on what kinds of data sets
they can be applied. What are the differences? What are the advantages and shortcomings of
these algorithms. What would you do with missing or incorrect data? How would you handle
non-categorical or continuous data? Can you use other costs functions? Can you use different cost
functions in different parts of the data set? Why and when would you do so?
5.3 TASK 3
Find resources on more advanced extentions of decision-tree learning. Look specifically at ensemble
methods, such as bagging an boosting, and their further extension into random forests.
© UNISA 2024
16