Classification Intr DT
Classification Intr DT
Classification
Algorithms
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
April 22, 2025
Assistant ProfData Mining:
7 Concepts and
yesTechniqu 5
es
Supervised vs. Unsupervised Learning
no yes no yes
8
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discret
ized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no samples left
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf 9
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info ( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
10
m
Info ( D)
Attribute Selection: Information Gain
pi log 2 ( pi )
i 1
yes
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
no yes no yes
numerator
log2 1 2 3 4 5 6
2 -1 1
3 -1.60 -0.58 1
denominator 4 -2 -1 -0.42 1
5 -2.32 -1.32 -0.74 -0.32 1
6 -2.58 -1.6 -1 -0.58 -0.26 1
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a lar
ge number of
valuesid
Student Gender Rating Class label
1 M Excellent Y
2 M Fair N
3 F Fair Y
4 F Good Y
5 M Fair N
6 F Excellent N
Student id
1 6
2 3 4 5
Y N N
Y Y N 15
Gain Ratio for Attribute Selection (C4.5)
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
– The gain ratio is derived by taking into account the number and size of
children nodes into which an attribute splits the dataset.
v | Dj | | Dj | Income
SplitInfo A ( D) log 2 ( )
j 1 |D| |D| High 4
Medium 6
– GainRatio(A) = Gain(A)/SplitInfo(A) Low 4
• Ex.
17
Gain Ratio for Attribute Selection (C4.5)
• Unfortunately, in some situations the gain ratio modification
overcompensates and can lead to preferring an attribute just
because its intrinsic information is much lower than for the
other attributes. A 1
– GainRatio(A) = Gain(A)/SplitInfo(A) SplitInfo=0.0763
High 99
Low 1
v | Dj | | Dj |
SplitInfo A ( D) log 2 ( ) A2
j 1 |D| |D|
High 50 SplitInfo=1
Low 50
• A standard fix is to choose the attribute that maximizes the gain
ratio, provided that the information gain for that attribute is at
least as great as the average information gain for all the
attributes examined.
18
Gini Index (CART, IBM IntelligentMiner)
Student
student pi ni gini
yes 6 1 1-(6/7)2-(1/7)2=0.246
no 3 4 1-(3/7)2-(4/7)2=0.490
ginistudent(D)=(7/14)* 0.246+ (7/14)*0.490=0.368
Dgini(student)=0.459-0.368=0.091
20
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D) 1 0.459
14 14
21
2 2
9 5
gini ( D) 1 0.459
14 14
age income student credit_rating buys_computer
<=30 high no fair no
10 4 <=30 high no excellent no
giniincome{low,medium} ( D ) Gini ( D1 ) Gini ( D1 ) 31…40 high no fair yes
14 14 >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
{(low, high), medium} 0.458 31…40 high yes fair yes
{(medium, high), low} 0.450 >40 medium no excellent no
l 20-29 good T
l 30-39 good F
l 40-49 good F
m 20-29 fair F
m 30-39 fair F
m 40-49 fair T
h 40-49 good F
h 20-29 good T
h 30-39 fair T
h 40-49 fair T
Comparing Attribute Selection Measures
24
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
25
Overfitting and Tree Pruning
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split
a node if this would result in the goodness measure
falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data
to decide which is the “best pruned tree”
26
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why is decision tree induction popular?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– comparable classification accuracy with other methods
27
Entropy
• Category : a, b m
Entropy ( S ) pi log 2 ( pi )
– S1={a, a, a, a, a, a}
i 1
– S2={a, a, a, b, b, b}
28
Example
Ag att att2 Class
e 1 label E1 (1log 2 1 0 log 2 0) 0
5 … … N
10 … … N
20 … … Y E2 (0.5 log 2 0.5 0.5 log 2 0.5) 1
38 … … Y
52 … … N
1 4
I ( S , T1 ) * 0 *1 0.8
5 5
29
Example
Ag att att2 Class
e 1 label E1 (1log 2 1 0 log 2 0) 0
5 … … N
10 … … N
20 … … Y 2 2 1 1
E2 ( log 2 log 2 ) 0.92
38 … … Y 3 3 3 3
52 … … N
2 3
I ( S , T2 ) * 0 * 0.92 0.552
5 5
30
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
– Sort the value A in increasing order
– Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
– The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
– D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
31