03 Decision Tree
03 Decision Tree
1
Classification
The input data for a classification task is a
collection of records.
Each record, also known as an instance or
example, is characterized by a tuple (x, y)
x is the attribute set
y is the class label, also known as category or
target attribute.
The class label is a discrete attribute.
2
Classification
Classification is the task of learning a target
function f that maps each attribute set x to
one of the predefined class labels y.
The target function is also known as a
classification model.
A classification model is useful for the
following purposes
Descriptive modeling
Predictive modeling
3
Classification
A classification technique (or classifier) is a
systematic approach to perform classification
on an input data set.
Examples include
Decision tree classifiers
Neural networks
Support vector machines
4
Classification
A classification technique employs a learning
algorithm to identify a model that best fits the
relationship between the attribute set and the class
label of the input data.
The model generated by a learning algorithm should
Fit the input data well and
Correctly predict the class labels of records it has
never seen before.
A key objective of the learning algorithm is to build
models with good generalization capability.
5
Classification
First, a training set consisting of records
whose class labels are known must be
provided.
The training set is used to build a
classification model.
This model is subsequently applied to the test
set, which consists of records which are
different from those in the training set.
6
Confusion matrix
Evaluation of the performance of the model is
based on the counts of correctly and
incorrectly predicted test records.
These counts are tabulated in a table known
as a confusion matrix.
Each entry aij in this table denotes the
number of records from class i predicted to
be of class j.
7
Confusion matrix
Predicted Class
Class=1 Class=0
Actual Class=1 a11 a10
Class
Class=0 a01 a00
8
Confusion matrix
The total number of correct predictions made
by the model is a11+a00.
The total number of incorrect predictions is
a10+a01.
9
Confusion matrix
The information in a confusion matrix can be
summarized with the following two measures
Accuracy
a11 + a00
Accuracy =
a11 + a10 + a01 + a00
Error rate
a10 + a01
Error Rate =
a11 + a10 + a01 + a00
Most classification algorithms aim at attaining the
highest accuracy, or equivalently, the lowest error rate
when applied to the test set.
10
Decision tree
We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
Each time we receive an answer, a follow-up
question is asked.
This process is continued until we reach a
conclusion about the class label of the record.
11
Decision tree
The series of questions and answers can be
organized in the form of a decision tree.
It is a hierarchical structure consisting of nodes and
directed edges.
The tree has three types of nodes
A root node that has no incoming edges.
Internal nodes, each of which has exactly one
incoming edge and a number of outgoing edges.
Leaf or terminal nodes, each of which has exactly one
incoming edge and no outgoing edges.
12
Decision tree
In a decision tree, each leaf node is assigned
a class label.
The non-terminal nodes, which include the
root and other internal nodes, contain
attribute test conditions to separate records
that have different characteristics.
13
Decision tree
Classifying a test record is straightforward once a
decision tree has been constructed.
Starting from the root node, we apply the test
condition.
We then follow the appropriate branch based on the
outcome of the test.
This will lead us either to
Another internal node, at which a new test condition is
applied, or
A leaf node.
The class label associated with the leaf node is then
assigned to the record.
14
Decision tree construction
Efficient algorithms have been developed to
induce a reasonably accurate, although
suboptimal, decision tree in a reasonable
amount of time.
These algorithms usually employ a greedy
strategy that makes a series of locally optimal
decisions about which attribute to use for
partitioning the data.
15
Decision tree construction
A decision tree is grown in a recursive
fashion by partitioning the training records
into successively purer subsets.
We suppose
Us is the set of training records that are
associated with node s.
C={c1, c2, ……cK} is the set of class labels.
16
Decision tree construction
If all the records in Us belong to the same class ck,
then s is a leaf node labeled as ck.
If Us contains records that belong to more than one
class,
An attribute test condition is selected to partition the
records into smaller subsets.
A child node is created for each outcome of the test
condition.
The records in Us are distributed to the children based
on the outcomes.
The algorithm is then recursively applied to each
child node.
17
Decision tree construction
For each node, let p(ck) denotes the fraction
of training records from class ck.
In most cases, the leaf node is assigned to
the class that has the majority number of
training records.
The fraction p(ck) for a node can also be used
to estimate the probability that a record
assigned to that node belongs to class k.
18
Decision tree construction
Decision trees that are too large are
susceptible to a phenomenon known as
overfitting.
A tree pruning step can be performed to
reduce the size of the decision tree.
Pruning helps by trimming the tree branches
in a way that improves the generalization
error.
19
Attribute test
Each recursive step of the tree-growing
process must select an attribute test condition
to divide the records into smaller subsets.
To implement this step, the algorithm must
provide
A method for specifying the test condition for
different attribute types and
An objective measure for evaluating the
goodness of each test condition.
20
Attribute test
Binary attributes
The test condition for a binary attribute
generates two possible outcomes.
21
Attribute test
Nominal attributes
A nominal attribute can produce binary or
multi-way splits.
There are 2S-1-1 ways of creating a binary
partition of S attribute values.
For a multi-way split, the number of outcomes
depends on the number of distinct values for
the corresponding attribute.
22
Attribute test
23
Attribute test
Ordinal attributes
Ordinal attributes can also produce binary or multi-
way splits.
Ordinal attributes can be grouped as long as the
grouping does not violate the order property of the
attribute values.
24
Attribute test
Continuous attributes
The test condition can be expressed as a comparison
test x≤T or x>T.
For the binary case
The decision tree algorithm must consider all
possible split positions T, and
Select the one that produces the best partition.
25
Attribute test
26
Decision tree construction
We consider the problem of using a decision
tree to predict the number of participants in a
marathon race requiring medical attention.
This number depends on attributes such as
Temperature forecast (TEMP)
Humidity forecast (HUMID)
Air pollution forecast (AIR)
27
Decision tree construction
28
Decision tree construction
In a decision tree, each internal node
represents a particular attribute, e.g., TEMP
or AIR.
Each possible value of that attribute
corresponds to a branch of the tree.
Leaf nodes represent classifications, such as
Large or Small number of participants
requiring medical attention.
29
Decision tree construction
AIR
Low High
TEMP Large
Low High
Small Large
30
Decision tree construction
Suppose AIR is selected as the first attribute.
This partitions the examples as follows.
AIR
Low High
{1,3,4,6,9} {2,5,7,8,10}
31
Decision tree construction
Since the entries of the set {2,5,7,8,10} all correspond
to the case of a large number of participants requiring
medical attention, a leaf node is formed.
On the other hand, for the set {1,3,4,6,9}
TEMP is selected as the next attribute to be tested.
This further divides this partition into {4}, {3,6} and {1,9}.
32
Decision tree construction
AIR
Low High
TEMP Large
33
Information theory
Each attribute reduces a certain amount of
uncertainty in the classification process.
We calculate the amount of uncertainty
reduced by the selection of each attribute.
We then select the attribute that provides the
greatest uncertainty reduction.
34
Information theory
Information theory provides a mathematical
formulation for measuring how much information a
message contains.
We consider the case where a message is selected
among a set of possible messages and transmitted.
The information content of a message depends on
The size of this message set, and
The frequency with which each possible message
occurs.
35
Information theory
The amount of information in a message with
occurrence probability p is defined as -log2p.
Suppose we are given
a set of messages, C={c1,c2,…..,cK}
the occurrence probability p(ck) of each ck.
We define the entropy I as the expected information
content of a message in C :
K
I = −∑ p (ck ) log 2 p (ck )
k =1
The entropy is measured in bits.
36
Attribute selection
We can calculate the entropy of a set of
training examples from the occurrence
probabilities of the different classes.
In our example
p(Small)=2/10
p(Large)=8/10
37
Attribute selection
The set of training instances is denoted as U
We can calculate the entropy as follows:
2 2 8 8
I (U ) = − log 2 ( ) − log 2 ( )
10 10 10 10
2 8
= − (−2.322) − (−0.322)
10 10
= 0.722 bit
38
Attribute selection
The information gain provided by an attribute is the
difference between
1. The degree of uncertainty before including the
attribute.
2. The degree of uncertainty after including the attribute.
Item 2 above is defined as the weighted average of
the entropy values of the child nodes of the attribute.
39
Attribute selection
If we select attribute P, with S values, this will
partition U into the subsets {U1,U2,…,US}.
The average degree of uncertainty after
selecting P is
S
|Us |
I ( P) = ∑ I (U s )
s =1 | U |
40
Attribute selection
The information gain associated with attribute
P is computed as follows.
gain( P ) = I (U ) − I ( P )
If the attribute AIR is chosen, the examples
are partitioned as follows:
U1={1,3,4,6,9}
U2={2,5,7,8,10}
41
Attribute selection
The resulting entropy value is
5 5
I ( AIR) = I (U1 ) + I (U 2 )
10 10
5 2 2 3 3 5
= (− log 2 − log 2 ) + (0)
10 5 5 5 5 10
= 0.485 bit
42
Attribute selection
The information gain can be computed as
follows:
gain( AIR) = I (U ) − I ( AIR)
= 0.722 − 0.485
= 0.237 bit
43
Attribute selection
For the attribute TEMP which partitions the
examples into U1={4,5}, U2={3,6,7,8} and
U3={1,2,9,10}:
2 4 4
I (TEMP) = I (U1 ) + I (U 2 ) + I (U 3 )
10 10 10
2 1 1 1 1 4 1 1 3 3 4
= (− log 2 − log 2 ) + (− log 2 − log 2 ) + (0)
10 2 2 2 2 10 4 4 4 4 10
= 0.525 bit
gain(TEMP) = I (U ) − I (TEMP)
= 0.722 − 0.525
= 0.197 bit
44
Attribute selection
For the attribute HUMID which partitions the
examples into U1={4,5,6,7,9,10} and U2={1,2,3,8}:
6 4
I (HUMID) = I (U1 ) + I (U 2 )
10 10
6 2 2 4 4 4
= (− log 2 − log 2 ) + (0)
10 6 6 6 6 10
= 0.551 bit
gain(HUMID) = I (U ) − I (HUMID)
= 0.722 − 0.551
= 0.171 bit
45
Attribute selection
The attribute AIR corresponds to the highest
information gain.
As a result, this attribute will be selected.
46
Continuous attributes
If attribute P is continuous with value x, we can apply
a binary test.
The outcome of the test depends on a threshold
value T.
There are two possible outcomes:
x≤T
x>T
The training set is then partitioned into 2 subsets U1
and U2.
47
Continuous attributes
We apply sorting to values of attribute P to
obtain the sequence {x(1),x(2),.....,x(m)}.
Any threshold between x(r) and x(r+1) will divide
the set into two subsets
{x(1),x(2),…..,x(r)}
{x(r+1),x(r+2),…..,x(m)}
There are at most m-1 possible splits.
48
Continuous attributes
For r=1,……,m-1 such that x(r)≠x(r+1), the
corresponding threshold is chosen as
Tr=(x(r)+x(r+1))/2.
We can then calculate the information gain for
each Tr
gain( P, Tr ) = I (U ) − I ( P, Tr )
where I ( P, Tr ) is a function of Tr.
The threshold Tr which maximizes gain(P,Tr) is
then chosen.
49
Impurity measures
The measures developed for selecting the
best split are often based on the degree of
impurity of the child nodes.
Besides entropy, other examples of impurity
measures include
Gini index
K
G = 1 − ∑ p ( ck ) 2
k =1
Classification error rate
E = 1 − max p (ck )
k
50
Impurity measures
In the following figure, we compare the values
of the impurity measures for binary
classification problems.
p refers to the fraction of records that belong
to one of the two classes.
All three measures attain their maximum
value when p=0.5.
The minimum values of the measures are
attained when p equals 0 or 1.
51
Impurity measures
Gini index
52
Gain ratio
Impurity measures such as entropy and Gini
index tend to favor attributes that have a
large number of possible values.
In many cases, a test condition that results in
a large number of outcomes may not be
desirable.
This is because the number of records
associated with each partition is too small to
enable us to make any reliable predictions.
53
Gain ratio
To solve this problem, we can modify the
splitting criterion to take into account the
number of possible attribute values.
In the case of information gain, we can use
the gain ratio which is defined as follows
Gain( P)
Gain Ratio =
Split Info
where
S
|Us | |Us |
Split Info = −∑ log 2
s =1 | U | |U |
54
Oblique decision tree
The test condition described so far involve
using only a single attribute at a time.
The tree-growing procedure can be viewed
as the process of partitioning the attribute
space into disjoint regions.
The border between two neighboring regions
of different classes is known as a decision
boundary.
55
Oblique decision tree
Since the test condition involves only a single
attribute, the decision boundaries are parallel
to the coordinate axes.
This limits the expressiveness of the decision
tree representation for modeling complex
relationships among continuous attributes.
56
Oblique decision tree
57
Oblique decision tree
An oblique decision tree allows test conditions
that involve more than one attribute.
The following figure illustrates a data set that
cannot be classified effectively by a
conventional decision tree.
This data set can be easily represented by a
single node of an oblique decision tree with the
test condition x+y<1
However, finding the optimal test condition for a
given node can be computationally expensive.
58
Oblique decision tree
59