C4.5 Algorithm
C4.5 Algorithm
Real World:
C4.5
Outline
2
Industrial-strength algorithms
▪ For an algorithm to be useful in a wide range of real-
world applications it must:
▪ Permit numeric attributes
▪ Allow missing values
▪ Be robust in the presence of noise
▪ Be able to approximate arbitrary concept descriptions (at least
in principle)
4
Numeric attributes
▪ Standard method: binary splits
▪ E.g. temp < 45
… … … … …
Sunny 80 90 True No
… … … … …
7
Example
▪ Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
▪ Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits
value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
class Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
X
10
Binary vs. multi-way splits
▪ Splitting (multi-way) on a nominal attribute
exhausts all information in that attribute
▪ Nominal attribute is tested (at most) once on any path
in the tree
Early stopping 1
2
0
0
0
1
0
1
3 1 0 1
4 1 1 0
19
Subtree
replacement, 3
▪ Bottom-up
▪ Consider replacing a tree
only after considering all
its subtrees
Pr[− z X z ] = c
▪ With a symmetric distribution:
Pr[− z X z] = 1 − 2 Pr[ X z]
witten & eibe 23
*Confidence limits
▪ Confidence limits for the normal distribution with 0 mean and
a variance of 1:
Pr[X z] z
0.1% 3.09
0.5% 2.58
1% 2.33
5% 1.65
10% 1.28
20% 0.84
25% 0.69
–1 0 1 1.65
40% 0.25
▪ Thus:
Pr[−1.65 X 1.65] = 90%
▪ Resulting equation:
f −p
Pr − z z = c
▪ Solving for p: p(1 − p) / N
z2 f f2 z2 z2
p = f + z − + 1 +
2
2N N N 4N N
z 2
f f 2
z 2
z2
e = f + +z − +
2
1 +
2N N N 4N N
▪ If c = 25% then z = 0.69 (from normal distribution)
▪ f is the error on the training data
▪ N is the number of instances covered by the leaf
f = 5/14
e = 0.46
e < 0.51
so prune!
29
From trees to rules – simple
▪ Simple way: one rule for each leaf
▪ C4.5rules: greedily prune conditions from each rule
if this reduces its estimated error
▪ Can produce duplicate rules
▪ Check for this at the end
▪ Then
▪ look at each class in turn
▪ consider the rules for that class
▪ find a “good” subset (guided by MDL)
37