0% found this document useful (0 votes)
57 views

03 Decision Tree

The document discusses classification, which is the task of assigning objects to predefined categories. Classification involves learning a target function that maps attribute sets to class labels using a classification model. Some key points made in the document include: - Classification techniques employ learning algorithms to identify models that best fit the relationship between attributes and class labels in input data. - Decision trees can be used for classification by asking a series of questions to separate records into class labels. Decision tree algorithms recursively partition data into purer subsets using attribute tests at internal nodes. - Performance of a classification model is evaluated using a confusion matrix, which tabulates correct and incorrect predictions on a test set. The goal is to maximize accuracy and minimize error rate.

Uploaded by

Oscar Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

03 Decision Tree

The document discusses classification, which is the task of assigning objects to predefined categories. Classification involves learning a target function that maps attribute sets to class labels using a classification model. Some key points made in the document include: - Classification techniques employ learning algorithms to identify models that best fit the relationship between attributes and class labels in input data. - Decision trees can be used for classification by asking a series of questions to separate records into class labels. Decision tree algorithms recursively partition data into purer subsets using attribute tests at internal nodes. - Performance of a classification model is evaluated using a confusion matrix, which tabulates correct and incorrect predictions on a test set. The goal is to maximize accuracy and minimize error rate.

Uploaded by

Oscar Wong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Classification

 Classification is the task of assigning objects


to one of several predefined categories.
 It is an important problem in many
applications
 Detecting spam email messages based on the
message header and content.
 Categorizing cells as malignant or benign
based on the results of MRI scans.
 Classifying galaxies based on their shapes.

1
Classification
 The input data for a classification task is a
collection of records.
 Each record, also known as an instance or
example, is characterized by a tuple (x, y)
 x is the attribute set
 y is the class label, also known as category or
target attribute.
 The class label is a discrete attribute.

2
Classification
 Classification is the task of learning a target
function f that maps each attribute set x to
one of the predefined class labels y.
 The target function is also known as a
classification model.
 A classification model is useful for the
following purposes
 Descriptive modeling
 Predictive modeling

3
Classification
 A classification technique (or classifier) is a
systematic approach to perform classification
on an input data set.
 Examples include
 Decision tree classifiers
 Neural networks
 Support vector machines

4
Classification
 A classification technique employs a learning
algorithm to identify a model that best fits the
relationship between the attribute set and the class
label of the input data.
 The model generated by a learning algorithm should
 Fit the input data well and
 Correctly predict the class labels of records it has
never seen before.
 A key objective of the learning algorithm is to build
models with good generalization capability.

5
Classification
 First, a training set consisting of records
whose class labels are known must be
provided.
 The training set is used to build a
classification model.
 This model is subsequently applied to the test
set, which consists of records which are
different from those in the training set.

6
Confusion matrix
 Evaluation of the performance of the model is
based on the counts of correctly and
incorrectly predicted test records.
 These counts are tabulated in a table known
as a confusion matrix.
 Each entry aij in this table denotes the
number of records from class i predicted to
be of class j.

7
Confusion matrix

Predicted Class
Class=1 Class=0
Actual Class=1 a11 a10
Class
Class=0 a01 a00

8
Confusion matrix
 The total number of correct predictions made
by the model is a11+a00.
 The total number of incorrect predictions is
a10+a01.

9
Confusion matrix
 The information in a confusion matrix can be
summarized with the following two measures
 Accuracy
a11 + a00
Accuracy =
a11 + a10 + a01 + a00
 Error rate
a10 + a01
Error Rate =
a11 + a10 + a01 + a00
 Most classification algorithms aim at attaining the
highest accuracy, or equivalently, the lowest error rate
when applied to the test set.

10
Decision tree
 We can solve a classification problem by
asking a series of carefully crafted questions
about the attributes of the test record.
 Each time we receive an answer, a follow-up
question is asked.
 This process is continued until we reach a
conclusion about the class label of the record.

11
Decision tree
 The series of questions and answers can be
organized in the form of a decision tree.
 It is a hierarchical structure consisting of nodes and
directed edges.
 The tree has three types of nodes
 A root node that has no incoming edges.
 Internal nodes, each of which has exactly one
incoming edge and a number of outgoing edges.
 Leaf or terminal nodes, each of which has exactly one
incoming edge and no outgoing edges.

12
Decision tree
 In a decision tree, each leaf node is assigned
a class label.
 The non-terminal nodes, which include the
root and other internal nodes, contain
attribute test conditions to separate records
that have different characteristics.

13
Decision tree
 Classifying a test record is straightforward once a
decision tree has been constructed.
 Starting from the root node, we apply the test
condition.
 We then follow the appropriate branch based on the
outcome of the test.
 This will lead us either to
 Another internal node, at which a new test condition is
applied, or
 A leaf node.
 The class label associated with the leaf node is then
assigned to the record.
14
Decision tree construction
 Efficient algorithms have been developed to
induce a reasonably accurate, although
suboptimal, decision tree in a reasonable
amount of time.
 These algorithms usually employ a greedy
strategy that makes a series of locally optimal
decisions about which attribute to use for
partitioning the data.

15
Decision tree construction
 A decision tree is grown in a recursive
fashion by partitioning the training records
into successively purer subsets.
 We suppose
 Us is the set of training records that are
associated with node s.
 C={c1, c2, ……cK} is the set of class labels.

16
Decision tree construction
 If all the records in Us belong to the same class ck,
then s is a leaf node labeled as ck.
 If Us contains records that belong to more than one
class,
 An attribute test condition is selected to partition the
records into smaller subsets.
 A child node is created for each outcome of the test
condition.
 The records in Us are distributed to the children based
on the outcomes.
 The algorithm is then recursively applied to each
child node.

17
Decision tree construction
 For each node, let p(ck) denotes the fraction
of training records from class ck.
 In most cases, the leaf node is assigned to
the class that has the majority number of
training records.
 The fraction p(ck) for a node can also be used
to estimate the probability that a record
assigned to that node belongs to class k.

18
Decision tree construction
 Decision trees that are too large are
susceptible to a phenomenon known as
overfitting.
 A tree pruning step can be performed to
reduce the size of the decision tree.
 Pruning helps by trimming the tree branches
in a way that improves the generalization
error.

19
Attribute test
 Each recursive step of the tree-growing
process must select an attribute test condition
to divide the records into smaller subsets.
 To implement this step, the algorithm must
provide
 A method for specifying the test condition for
different attribute types and
 An objective measure for evaluating the
goodness of each test condition.

20
Attribute test
 Binary attributes
 The test condition for a binary attribute
generates two possible outcomes.

21
Attribute test
 Nominal attributes
 A nominal attribute can produce binary or
multi-way splits.
 There are 2S-1-1 ways of creating a binary
partition of S attribute values.
 For a multi-way split, the number of outcomes
depends on the number of distinct values for
the corresponding attribute.

22
Attribute test

23
Attribute test
 Ordinal attributes
 Ordinal attributes can also produce binary or multi-
way splits.
 Ordinal attributes can be grouped as long as the
grouping does not violate the order property of the
attribute values.

24
Attribute test
 Continuous attributes
 The test condition can be expressed as a comparison
test x≤T or x>T.
 For the binary case
 The decision tree algorithm must consider all
possible split positions T, and
 Select the one that produces the best partition.

 For the multi-way split,


 The algorithm must consider multiple split
positions.

25
Attribute test

26
Decision tree construction
 We consider the problem of using a decision
tree to predict the number of participants in a
marathon race requiring medical attention.
 This number depends on attributes such as
 Temperature forecast (TEMP)
 Humidity forecast (HUMID)
 Air pollution forecast (AIR)

27
Decision tree construction

Condition TEMP HUMID AIR Number of marathon participants


requiring medical attention
1 High High Low Large

2 High High High Large

3 Medium High Low Large


4 Low Low Low Small

5 Low Low High Large

6 Medium Low Low Small


7 Medium Low High Large

8 Medium High High Large

9 High Low Low Large

10 High Low High Large

28
Decision tree construction
 In a decision tree, each internal node
represents a particular attribute, e.g., TEMP
or AIR.
 Each possible value of that attribute
corresponds to a branch of the tree.
 Leaf nodes represent classifications, such as
Large or Small number of participants
requiring medical attention.

29
Decision tree construction

AIR

Low High

TEMP Large

Low Medium High

Small HUMID Large

Low High

Small Large

30
Decision tree construction
 Suppose AIR is selected as the first attribute.
 This partitions the examples as follows.

AIR

Low High

{1,3,4,6,9} {2,5,7,8,10}

31
Decision tree construction
 Since the entries of the set {2,5,7,8,10} all correspond
to the case of a large number of participants requiring
medical attention, a leaf node is formed.
 On the other hand, for the set {1,3,4,6,9}
 TEMP is selected as the next attribute to be tested.
 This further divides this partition into {4}, {3,6} and {1,9}.

32
Decision tree construction

AIR

Low High

TEMP Large

Low Medium High

{4} {3,6} {1,9}

33
Information theory
 Each attribute reduces a certain amount of
uncertainty in the classification process.
 We calculate the amount of uncertainty
reduced by the selection of each attribute.
 We then select the attribute that provides the
greatest uncertainty reduction.

34
Information theory
 Information theory provides a mathematical
formulation for measuring how much information a
message contains.
 We consider the case where a message is selected
among a set of possible messages and transmitted.
 The information content of a message depends on
 The size of this message set, and
 The frequency with which each possible message
occurs.

35
Information theory
 The amount of information in a message with
occurrence probability p is defined as -log2p.
 Suppose we are given
 a set of messages, C={c1,c2,…..,cK}
 the occurrence probability p(ck) of each ck.
 We define the entropy I as the expected information
content of a message in C :
K
I = −∑ p (ck ) log 2 p (ck )
k =1
 The entropy is measured in bits.

36
Attribute selection
 We can calculate the entropy of a set of
training examples from the occurrence
probabilities of the different classes.
 In our example
 p(Small)=2/10

 p(Large)=8/10

37
Attribute selection
 The set of training instances is denoted as U
 We can calculate the entropy as follows:

2 2 8 8
I (U ) = − log 2 ( ) − log 2 ( )
10 10 10 10
2 8
= − (−2.322) − (−0.322)
10 10
= 0.722 bit

38
Attribute selection
 The information gain provided by an attribute is the
difference between
1. The degree of uncertainty before including the
attribute.
2. The degree of uncertainty after including the attribute.
 Item 2 above is defined as the weighted average of
the entropy values of the child nodes of the attribute.

39
Attribute selection
 If we select attribute P, with S values, this will
partition U into the subsets {U1,U2,…,US}.
 The average degree of uncertainty after
selecting P is
S
|Us |
I ( P) = ∑ I (U s )
s =1 | U |

40
Attribute selection
 The information gain associated with attribute
P is computed as follows.
 gain( P ) = I (U ) − I ( P )
 If the attribute AIR is chosen, the examples
are partitioned as follows:
 U1={1,3,4,6,9}
 U2={2,5,7,8,10}

41
Attribute selection
 The resulting entropy value is
5 5
I ( AIR) = I (U1 ) + I (U 2 )
10 10
5 2 2 3 3 5
= (− log 2 − log 2 ) + (0)
10 5 5 5 5 10
= 0.485 bit

42
Attribute selection
 The information gain can be computed as
follows:
gain( AIR) = I (U ) − I ( AIR)
= 0.722 − 0.485
= 0.237 bit

43
Attribute selection
 For the attribute TEMP which partitions the
examples into U1={4,5}, U2={3,6,7,8} and
U3={1,2,9,10}:
2 4 4
I (TEMP) = I (U1 ) + I (U 2 ) + I (U 3 )
10 10 10
2 1 1 1 1 4 1 1 3 3 4
= (− log 2 − log 2 ) + (− log 2 − log 2 ) + (0)
10 2 2 2 2 10 4 4 4 4 10
= 0.525 bit
gain(TEMP) = I (U ) − I (TEMP)
= 0.722 − 0.525
= 0.197 bit

44
Attribute selection
 For the attribute HUMID which partitions the
examples into U1={4,5,6,7,9,10} and U2={1,2,3,8}:
6 4
I (HUMID) = I (U1 ) + I (U 2 )
10 10
6 2 2 4 4 4
= (− log 2 − log 2 ) + (0)
10 6 6 6 6 10
= 0.551 bit

gain(HUMID) = I (U ) − I (HUMID)
= 0.722 − 0.551
= 0.171 bit

45
Attribute selection
 The attribute AIR corresponds to the highest
information gain.
 As a result, this attribute will be selected.

46
Continuous attributes
 If attribute P is continuous with value x, we can apply
a binary test.
 The outcome of the test depends on a threshold
value T.
 There are two possible outcomes:
 x≤T
 x>T
 The training set is then partitioned into 2 subsets U1
and U2.

47
Continuous attributes
 We apply sorting to values of attribute P to
obtain the sequence {x(1),x(2),.....,x(m)}.
 Any threshold between x(r) and x(r+1) will divide
the set into two subsets
 {x(1),x(2),…..,x(r)}
 {x(r+1),x(r+2),…..,x(m)}
 There are at most m-1 possible splits.

48
Continuous attributes
 For r=1,……,m-1 such that x(r)≠x(r+1), the
corresponding threshold is chosen as
Tr=(x(r)+x(r+1))/2.
 We can then calculate the information gain for
each Tr
 gain( P, Tr ) = I (U ) − I ( P, Tr )
where I ( P, Tr ) is a function of Tr.
 The threshold Tr which maximizes gain(P,Tr) is
then chosen.

49
Impurity measures
 The measures developed for selecting the
best split are often based on the degree of
impurity of the child nodes.
 Besides entropy, other examples of impurity
measures include
 Gini index
K
 G = 1 − ∑ p ( ck ) 2
k =1
 Classification error rate
 E = 1 − max p (ck )
k

50
Impurity measures
 In the following figure, we compare the values
of the impurity measures for binary
classification problems.
 p refers to the fraction of records that belong
to one of the two classes.
 All three measures attain their maximum
value when p=0.5.
 The minimum values of the measures are
attained when p equals 0 or 1.

51
Impurity measures

Gini index

Classification error rate

52
Gain ratio
 Impurity measures such as entropy and Gini
index tend to favor attributes that have a
large number of possible values.
 In many cases, a test condition that results in
a large number of outcomes may not be
desirable.
 This is because the number of records
associated with each partition is too small to
enable us to make any reliable predictions.

53
Gain ratio
 To solve this problem, we can modify the
splitting criterion to take into account the
number of possible attribute values.
 In the case of information gain, we can use
the gain ratio which is defined as follows
Gain( P)
Gain Ratio =
Split Info

where
S
|Us | |Us |
Split Info = −∑ log 2
s =1 | U | |U |
54
Oblique decision tree
 The test condition described so far involve
using only a single attribute at a time.
 The tree-growing procedure can be viewed
as the process of partitioning the attribute
space into disjoint regions.
 The border between two neighboring regions
of different classes is known as a decision
boundary.

55
Oblique decision tree
 Since the test condition involves only a single
attribute, the decision boundaries are parallel
to the coordinate axes.
 This limits the expressiveness of the decision
tree representation for modeling complex
relationships among continuous attributes.

56
Oblique decision tree

57
Oblique decision tree
 An oblique decision tree allows test conditions
that involve more than one attribute.
 The following figure illustrates a data set that
cannot be classified effectively by a
conventional decision tree.
 This data set can be easily represented by a
single node of an oblique decision tree with the
test condition x+y<1
 However, finding the optimal test condition for a
given node can be computationally expensive.

58
Oblique decision tree

59

You might also like