0% found this document useful (0 votes)
4 views

Machine Learning II - Decision Trees

The document provides an overview of decision trees as a machine learning algorithm, highlighting their simplicity and efficiency in knowledge extraction from data. It includes a practical example of constructing a decision tree for skiing decisions based on various attributes, and explains the concepts of entropy and information gain in the context of decision tree construction. Additionally, it introduces random forest classification as a method that utilizes multiple decision trees for improved accuracy in predictions.

Uploaded by

Efunde Joki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning II - Decision Trees

The document provides an overview of decision trees as a machine learning algorithm, highlighting their simplicity and efficiency in knowledge extraction from data. It includes a practical example of constructing a decision tree for skiing decisions based on various attributes, and explains the concepts of entropy and information gain in the context of decision tree construction. Additionally, it introduces random forest classification as a method that utilizes multiple decision trees for improved accuracy in predictions.

Uploaded by

Efunde Joki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Machine Learning Lecture II:

Decision Trees

Prepared By

Dr Augustine S. Nsang
Introduction
 The decision tree learning is an extraordinarily important
algorithm for AI not only because it is very powerful, but
also because it is simple and efficient for extracting
knowledge from data.
 Compared to other learning algorithms, it has important
advantages. The extracted knowledge can be easily
understood, interpreted, and controlled by humans in the
form of a readable decision tree.
 We shall show in a simple example how a decision tree
can be constructed from training data.
A Simple Example
 A devoted skier who lives near the high sierra, a beautiful
mountain range in California, wants a decision tree to help
him decide whether it is worthwhile to drive his car to a ski
resort in the mountains. We thus have a two-class problem
ski yes/no based on the variables listed in Table 1.
 Figure 1 (Slide 5) shows a decision tree for this problem. A
decision tree is a tree whose inner nodes represent features
(attributes). Each edge stands for an attribute value. At each
leaf node a class value is given.
 The data used for the construction of the decision tree is
shown in Table 1 (next slide). Each row in the table contains
the data for one day and as such represents a sample.
A Simple Example – Cont’d

Table 1: Data set for the skiing classification problem

 Upon close examination we see that row 6 and row 7 contradict


each other. Thus no deterministic classification algorithm can
correctly classify all of the data. The number of falsely
classified data must therefore be ≥1.
 The tree in Fig. 1 (next slide) thus classifies the data optimally.
A Simple Example – Cont’d

Fig 1: Decision tree for the skiing classification problem


A Simple Example – Cont’d
 We now develop a heuristic algorithm which, starting
from the root, recursively builds a decision tree.
 First the attribute with the highest information gain
(Snow_Dist) is chosen for the root node from the set of
all attributes. For each attribute value (≤100, >100) there
is a branch in the tree.
 For every branch this process is repeated recursively.
During generation of the nodes, the attribute with the
highest information gain among the attributes which
have not yet been used is always chosen, in the spirit of a
greedy strategy.
Entropy as a Metric for Information Content
 The described top-down algorithm for the construction of a decision
tree, at each step selects the attribute with the highest information
gain.
 We now introduce the entropy as the metric for the information
content of a set of training data D. If we only look at the binary
variable skiing in the above example, then D can be described as

D = (yes, yes, yes, yes, yes, yes, no, no, no, no, no)

with estimated probabilities

p1 = P(yes) = 6/11 and p2 = P(no) = 5/11.


 Here we evidently have a probability distribution p = (6/11, 5/11).
In general, for an n class problem this reads
n
p = (p1, . . . , pn) with  pi 1
i 1
Entropy as a Metric for Information Content – Cont’d
 To introduce the information content of a distribution we observe
two extreme cases.

 First let p = (1, 0, 0, . . . , 0).

In this case, the first one of the n events will certainly occur and
all others will not. The uncertainty about the outcome of the events
is thus minimal.
In contrast, for the uniform distribution
1 1 1
, ,...,
n n n
p=( )

the uncertainty is maximal because no event can be


distinguished from the others.
Entropy as a Metric for Information Content – Cont’d

Using the definition: 0log20 = 0:


1 1 1
H(1,0,…,0) = 0 and H( n n n)
, ,..., = log2n

Thus for a 4-class problem, H has a maximum value


of 2, and for a 2-class problem, H has a maximum
value of 1.
Entropy as a Metric for Information Content – Cont’d

Fig 2: The entropy function for the case of two classes. We see the maximum at p
= 1/2 and the symmetry with respect to swapping p and 1 −p
Information Content
 The information content of a dataset is defined as:
I(D) := 1 − H(D)

 If we apply the entropy formula to the example, the result


is:
H(6/11, 5/11) = 0.994

 During construction of a decision tree, the dataset is


further subdivided by each new attribute. The more an
attribute raises the information content of the distribution
by dividing the data, the better that attribute is.
Information Gain
Information Gain – Cont’d

Applied to our example:


G(D, Snow_Dist)
4 7
H ( D )  ( H ( D100 )  H ( D100 )
11 11
4 7
0.994  ( .0  .0.863)
11 11
0.445
Information Gain – Cont’d
Analogously, we obtain:
G(D, Weekend) = 0.150 and G(D, Sun) = 0.049

Construction of the Decision Tree


 Since the attribute Snow_Dist has the largest information
gain, it becomes the root node of the decision tree. The
two attribute values ≤100 and >100 generate two edges in
the tree, which correspond to the subsets D≤100 and
D>100.
 For the subset D≤100 the classification is clearly yes. Thus
the tree terminates here. In the other branch D>100 there
is no clear result. Thus the algorithm repeats recursively.
Construction of the Decision Tree – Cont’d
 From the two attributes still available, Sun and
Weekend, the better one must be chosen. We calculate:
G(D>100, Weekend) = 0.292 and G(D>100, Sun) = 0.170
 The node thus gets the attribute Weekend assigned. For
Weekend = no the tree terminates with the decision Ski =
no. A calculation of the gain here returns the value 0.
For Weekend = yes, Sun results in a gain of 0.171.
 Then the construction of the tree terminates because no
further attributes are available, although example
number 7 is falsely classified. The finished tree was
displayed above in Slide 5.
Random Forest Classification
 Given a dataset D with n data samples, a random forest classification
is obtained using the following algorithm:

Step 1:
For i := 1 to p:
Select k data samples (k < n) at random from the n data samples in
D, and construct a decision tree using these k data samples

Step 2:
Given any unknown data sample, x, classify x using each of the p
decision trees constructed in Step 1. The class obtained by the random
forest classification is given by a majority vote of the p decision trees.

You might also like