Unit2 C4.5
Unit2 C4.5
5)
What is C4.5?
• C4.5 algorithm is improvement over ID3 algorithm, where “C” is
shows algorithm is written in C and 4.5 specifics version of algorithm.
• splitting criterion used by C4.5 is the normalized information gain
(difference in entropy).
• The attribute with the highest normalized information gain is chosen
to make the decision.
• GainRatio(A) = Gain(A) / SplitInfo(A)
• SplitInfo(A) = -∑ |Dj|/|D| x log|Dj|/|D|
Example:
Entropy(Decision) = ∑ — p(I) . log p(I) = — p(Yes) . log p(Yes) — p(No) . log2(No)
= — (9/14) . log(9/14) — (5/14) . log(5/14) = 0.940
Here, we need to calculate gain ratios instead of gains.
GainRatio(A) = Gain(A) / SplitInfo(A)
SplitInfo(A) = -∑ |Dj|/|D| x log|Dj|/|D|
Let’s calculate for Wind Attribute:
Gain(Decision, Wind) = Entropy(Decision) — ∑ ( p(Decision|Wind) . Entropy(Decision|
Wind) )
Gain(Decision, Wind) = Entropy(Decision) — [ p(Decision|Wind=Weak) . Entropy(Decision|
Wind=Weak) ] + [ p(Decision|Wind=Strong) . Entropy(Decision|Wind=Strong) ]
Entropy(Decision|Wind=Weak) = — p(No) . logp(No) — p(Yes) . logp(Yes) = — (2/8) .
log(2/8) — (6/8) . log(6/8) = 0.811
Entropy(Decision|Wind=Strong) = — (3/6) . log(3/6) — (3/6) . log(3/6) = 1
Gain(Decision, Wind) = 0.940 — (8/14).(0.811) — (6/14).(1) = 0.940–0.463–0.428 = 0.049
There are 8 decisions for weak wind, and 6 decisions for strong wind.
SplitInfo(Decision, Wind) = -(8/14).log(8/14) — (6/14).log(6/14) = 0.461 + 0.524 = 0.985
GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision, Wind) = 0.049 / 0.985
= 0.049
Similarly, calculate gain ratio for outlook,
humidity, and temperature