100% found this document useful (1 vote)

96 views31 pages

C4.5 Algorithm

C4.5 is an algorithm for generating decision trees that addresses limitations of prior algorithms like ID3. It can handle numeric attributes by evaluating all possible split points rather than just binary splits. It also deals with missing values by treating them as a separate value or splitting instances proportionally. C4.5 uses pruning techniques like pre-pruning to stop trees from growing too deep and post-pruning like subtree replacement to remove parts of fully grown trees that may be due to noise.

Uploaded by

Bryne Uy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

96 views31 pages

C4.5 Algorithm

Uploaded by

Bryne Uy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning in

Real World:
C4.5
Outline

▪ Handling Numeric Attributes

▪ Finding Best Split(s)

▪ Dealing with Missing Values

▪ Pruning
▪ Pre-pruning, Post-pruning, Error Estimates

▪ From Trees to Rules

2
Industrial-strength algorithms
▪ For an algorithm to be useful in a wide range of real-
world applications it must:
▪ Permit numeric attributes
▪ Allow missing values
▪ Be robust in the presence of noise
▪ Be able to approximate arbitrary concept descriptions (at least
in principle)

▪ Basic schemes need to be extended to fulfill these

requirements

witten & eibe 3

C4.5 History
▪ ID3, CHAID – 1960s
▪ C4.5 innovations (Quinlan):
▪ permit numeric attributes
▪ deal sensibly with missing values
▪ pruning to deal with for noisy data

▪ C4.5 - one of best-known and most widely-used learning

algorithms
▪ Last research version: C4.8, implemented in Weka as J4.8 (Java)
▪ Commercial successor: C5.0 (available from Rulequest)

4
Numeric attributes
▪ Standard method: binary splits
▪ E.g. temp < 45

▪ Unlike nominal attributes,

every attribute has many possible split points
▪ Solution is straightforward extension:
▪ Evaluate info gain (or other measure)
for every possible split point of attribute
▪ Choose “best” split point
▪ Info gain for best split point is info gain for attribute

▪ Computationally more demanding

witten & eibe 5

Weather data – nominal values
Outlook Temperature Humidity Windy Play
Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild Normal False Yes

… … … … …

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

witten & eibe 6

Weather data - numeric
Outlook Temperature Humidity Windy Play
Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

7
Example
▪ Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

▪ E.g. temperature  71.5: yes/4, no/2

temperature  71.5: yes/5, no/3

▪ Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits

▪ Place split points halfway between values

▪ Can evaluate all split points in one pass!
witten & eibe 8
Avoid repeated sorting!
▪ Sort instances by the values of the numeric attribute
▪ Time complexity for sorting: O (n log n)

▪ Q. Does this have to be repeated at each node of

the tree?
▪ A: No! Sort order for children can be derived from sort
order for parent
▪ Time complexity of derivation: O (n)
▪ Drawback: need to create and store an array of sorted indices
for each numeric attribute

witten & eibe 9

More speeding up
▪ Entropy only needs to be evaluated between points
of different classes (Fayyad & Irani, 1992)

value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
class Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
X

Potential optimal breakpoints

Breakpoints between values of the same class cannot

be optimal

10
Binary vs. multi-way splits
▪ Splitting (multi-way) on a nominal attribute
exhausts all information in that attribute
▪ Nominal attribute is tested (at most) once on any path
in the tree

▪ Not so for binary splits on numeric attributes!

▪ Numeric attribute may be tested several times along a
path in the tree

▪ Disadvantage: tree is hard to read

▪ Remedy:
▪ pre-discretize numeric attributes, or
▪ use multi-way splits instead of binary ones

witten & eibe 11

Missing as a separate value
▪ Missing value denoted “?” in C4.X
▪ Simple idea: treat missing as a separate value
▪ Q: When this is not appropriate?
▪ A: When values are missing due to different
reasons
▪ Example 1: gene expression could be missing when it is
very high or very low
▪ Example 2: field IsPregnant=missing for a male
patient should be treated differently (no) than for a
female patient of age 25 (unknown)
12
Missing values - advanced
Split instances with missing values into pieces
▪ A piece going down a branch receives a weight
proportional to the popularity of the branch
▪ weights sum to 1

▪ Info gain works with fractional instances

▪ use sums of weights instead of counts

▪ During classification, split the instance into pieces

in the same way
▪ Merge probability distribution using weights

witten & eibe 13

Pruning
▪ Goal: Prevent overfitting to noise in the
data
▪ Two strategies for “pruning” the decision
tree:
◆ Postpruning - take a fully-grown decision tree
and discard unreliable parts
◆ Prepruning - stop growing a branch when
information becomes unreliable

▪ Postpruning preferred in practice—

prepruning can “stop too early”
14
Prepruning
▪ Based on statistical significance test
▪ Stop growing the tree when there is no statistically significant
association between any attribute and the class at a particular
node

▪ Most popular test: chi-squared test

▪ ID3 used chi-squared test in addition to information gain
▪ Only statistically significant attributes were allowed to be
selected by information gain procedure

witten & eibe 15

a b class

Early stopping 1
2
0
0
0
1
0
1
3 1 0 1
4 1 1 0

▪ Pre-pruning may stop the growth process

prematurely: early stopping
▪ Classic example: XOR/Parity-problem
▪ No individual attribute exhibits any significant
association to the class
▪ Structure is only visible in fully expanded tree
▪ Pre-pruning won’t expand the root node

▪ But: XOR-type problems rare in practice

▪ And: pre-pruning faster than post-pruning

witten & eibe 16

Post-pruning
▪ First, build full tree
▪ Then, prune it
▪ Fully-grown tree shows all attribute interactions
▪ Problem: some subtrees might be due to chance effects
▪ Two pruning operations:
1. Subtree replacement
2. Subtree raising
▪ Possible strategies:
▪ error estimation
▪ significance testing
▪ MDL principle

witten & eibe 17

Subtree replacement, 1
▪ Bottom-up
▪ Consider replacing a tree
only after considering all
its subtrees
▪ Ex: labor negotiations

witten & eibe 18

Subtree replacement, 2
What subtree can we replace?

19
Subtree
replacement, 3
▪ Bottom-up
▪ Consider replacing a tree
only after considering all
its subtrees

witten & eibe 20

Estimating error rates
▪ Prune only if it reduces the estimated error
▪ Error on the training data is NOT a useful
estimator
Q: Why it would result in very little pruning?
▪ Use hold-out set for pruning
(“reduced-error pruning”)
▪ C4.5’s method
▪ Derive confidence interval from training data
▪ Use a heuristic limit, derived from this, for pruning
▪ Standard Bernoulli-process-based method
▪ Shaky statistical assumptions (based on training data)

witten & eibe 22

*Mean and variance
▪ Mean and variance for a Bernoulli trial:
p, p (1–p)
▪ Expected success rate f=S/N
▪ Mean and variance for f : p, p (1–p)/N
▪ For large enough N, f follows a Normal distribution
▪ c% confidence interval [–z  X  z] for random
variable with 0 mean is given by:

Pr[− z  X  z ] = c
▪ With a symmetric distribution:

Pr[− z  X  z] = 1 − 2  Pr[ X  z]
witten & eibe 23
*Confidence limits
▪ Confidence limits for the normal distribution with 0 mean and
a variance of 1:
Pr[X  z] z
0.1% 3.09
0.5% 2.58

1% 2.33
5% 1.65
10% 1.28
20% 0.84
25% 0.69
–1 0 1 1.65
40% 0.25
▪ Thus:
Pr[−1.65  X  1.65] = 90%

▪ To use this we have to reduce our random variable f to have

0 mean and unit variance
witten & eibe 24
*Transforming f
▪ Transformed value for f : f −p
p(1 − p) / N
(i.e. subtract the mean and divide by the standard deviation)

▪ Resulting equation:
 f −p 
Pr − z   z = c
▪ Solving for p:  p(1 − p) / N 

 z2 f f2 z2   z2 
p =  f + z − +  1 + 
2 
 2N N N 4N   N 

witten & eibe 25

C4.5’s method
▪ Error estimate for subtree is weighted sum of error
estimates for all its leaves
▪ Error estimate for a node (upper bound):

 z 2
f f 2
z 2 
 z2 
e =  f + +z − + 
2
1 + 
 2N N N 4N   N
▪ If c = 25% then z = 0.69 (from normal distribution)
▪ f is the error on the training data
▪ N is the number of instances covered by the leaf

witten & eibe 26

Example

f = 5/14
e = 0.46
e < 0.51
so prune!

f=0.33 f=0.5 f=0.33

e=0.47 e=0.72 e=0.47

witten & eibe

Combined27using ratios 6:2:6 gives 0.51
From trees to rules – how?

How can we produce a set of rules from

a decision tree?

29
From trees to rules – simple
▪ Simple way: one rule for each leaf
▪ C4.5rules: greedily prune conditions from each rule
if this reduces its estimated error
▪ Can produce duplicate rules
▪ Check for this at the end

▪ Then
▪ look at each class in turn
▪ consider the rules for that class
▪ find a “good” subset (guided by MDL)

▪ Then rank the subsets to avoid conflicts

▪ Finally, remove rules (greedily) if this decreases
error on the training data
witten & eibe 30
C4.5rules: choices and options
▪ C4.5rules slow for large and noisy datasets
▪ Commercial version C5.0rules uses a different technique
▪ Much faster and a bit more accurate

▪ C4.5 has two parameters

▪ Confidence value (default 25%):
lower values incur heavier pruning
▪ Minimum number of instances in the two most popular
branches (default 2)

witten & eibe 31

Summary
▪ Decision Trees
▪ splits – binary, multi-way
▪ split criteria – entropy, gini, …
▪ missing value treatment
▪ pruning
▪ rule extraction from trees
▪ Both C4.5 and CART are robust tools
▪ No method is always superior –
experiment!

witten & eibe 36

Reference:

• Machine Learning in Real World: C4.5.(n.d.). Retrieved from

https://info.psu.edu.sa/psu/cis/asameh/cs-500/dm7-decision-tree-
c45.ppt

M.A Economics PDF
No ratings yet
M.A Economics PDF
42 pages
Decision Tree & Techniques
71% (7)
Decision Tree & Techniques
41 pages
Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
How to Find a Wolf in Siberia (or, How to Troubleshoot Almost Anything)
From Everand
How to Find a Wolf in Siberia (or, How to Troubleshoot Almost Anything)
Don Jones
No ratings yet
Aptis Speaking Part 1: Sample Questions, Model Answers and Tips
No ratings yet
Aptis Speaking Part 1: Sample Questions, Model Answers and Tips
32 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
57 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
UNIT-3 Machine Learning
No ratings yet
UNIT-3 Machine Learning
43 pages
UNIT-3 Machine Learning
No ratings yet
UNIT-3 Machine Learning
40 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
ML-unit-3
No ratings yet
ML-unit-3
22 pages
An Introduction TO Decision Trees
No ratings yet
An Introduction TO Decision Trees
30 pages
ML-Lec-07-Decision Tree Overfitting
No ratings yet
ML-Lec-07-Decision Tree Overfitting
25 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Ch02 DecisionTree
No ratings yet
Ch02 DecisionTree
41 pages
07.2.Decision Trees_ML
No ratings yet
07.2.Decision Trees_ML
32 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
UNIT 2 - Decision Tree - Issues
No ratings yet
UNIT 2 - Decision Tree - Issues
23 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
C4.5 and CHAID Algorithm: Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
No ratings yet
C4.5 and CHAID Algorithm: Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
30 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
07.2.decision Trees
No ratings yet
07.2.decision Trees
33 pages
ML_UNIT3
No ratings yet
ML_UNIT3
24 pages
Learning Analytics
No ratings yet
Learning Analytics
56 pages
Decision Tree
No ratings yet
Decision Tree
33 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
ML Assignment-2: Unit 3
No ratings yet
ML Assignment-2: Unit 3
21 pages
Decision_tree
No ratings yet
Decision_tree
15 pages
ESGB_2025_classification and regression tress [Enregistré automatiquement]
No ratings yet
ESGB_2025_classification and regression tress [Enregistré automatiquement]
43 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
Unit-7
No ratings yet
Unit-7
67 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
DWDM UNIT 4
No ratings yet
DWDM UNIT 4
80 pages
Lec4 Tree v2.4 1
No ratings yet
Lec4 Tree v2.4 1
54 pages
I2ml3e Chap9
No ratings yet
I2ml3e Chap9
15 pages
Decision Tree
100% (4)
Decision Tree
66 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
ML Unit 3
No ratings yet
ML Unit 3
83 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Decision Tree - All Cost Functions - Stanford
No ratings yet
Decision Tree - All Cost Functions - Stanford
56 pages
Decision Trees MIT 15.097 Course Notes
No ratings yet
Decision Trees MIT 15.097 Course Notes
17 pages
A_comparative_analysis_of_methods_for_pruning_decision_trees
No ratings yet
A_comparative_analysis_of_methods_for_pruning_decision_trees
16 pages
unit3-ml
No ratings yet
unit3-ml
23 pages
Class Basic
No ratings yet
Class Basic
75 pages
From Zero to Oracle Hero: A Journey Through SQL, PL/SQL, and DBA Dark Arts
From Everand
From Zero to Oracle Hero: A Journey Through SQL, PL/SQL, and DBA Dark Arts
Scott Markham
No ratings yet
One-Handed Fractions
From Everand
One-Handed Fractions
Karen D. Tollefson
No ratings yet
Statement On The OHEA No Confidence Vote
No ratings yet
Statement On The OHEA No Confidence Vote
9 pages
guide-antibiotherapie-gpip-2023
No ratings yet
guide-antibiotherapie-gpip-2023
75 pages
Case No. 17 Catedrilla V Sps Lauron
100% (1)
Case No. 17 Catedrilla V Sps Lauron
2 pages
Analytics-2024-09-03-090050.ips.ca
No ratings yet
Analytics-2024-09-03-090050.ips.ca
56 pages
Annexure To CAS G.O
No ratings yet
Annexure To CAS G.O
14 pages
Vita Carnahan - 2
No ratings yet
Vita Carnahan - 2
6 pages
Wubalem Woreket - 2 PDF
No ratings yet
Wubalem Woreket - 2 PDF
105 pages
Gec 5: Purposive Communication 1: Business Letter Template: Job Application Letter
No ratings yet
Gec 5: Purposive Communication 1: Business Letter Template: Job Application Letter
9 pages
How To Narrate - 15 Steps (With Pictures) - Wikihow
No ratings yet
How To Narrate - 15 Steps (With Pictures) - Wikihow
5 pages
Soal PTS Bahasa Inggris Kelas 7
100% (1)
Soal PTS Bahasa Inggris Kelas 7
5 pages
Interview With Cynthia V. Lynch
No ratings yet
Interview With Cynthia V. Lynch
48 pages
Inside The Mind of A Master Procrastinator PDF
No ratings yet
Inside The Mind of A Master Procrastinator PDF
3 pages
Social Reveloution
No ratings yet
Social Reveloution
2 pages
Unit 3,4 201 Medics - Dr. Muhammed Essam
No ratings yet
Unit 3,4 201 Medics - Dr. Muhammed Essam
18 pages
Some Unpublished Religious Texts of Samas
No ratings yet
Some Unpublished Religious Texts of Samas
22 pages
Assessment Report:: Internal Network Penetration Test Internal Network Penetration Test
No ratings yet
Assessment Report:: Internal Network Penetration Test Internal Network Penetration Test
49 pages
Ebrd Contract
No ratings yet
Ebrd Contract
18 pages
TD Approved Project
No ratings yet
TD Approved Project
59 pages
Gender Lit Review
No ratings yet
Gender Lit Review
16 pages
Arts & Humanities
No ratings yet
Arts & Humanities
46 pages
Literature Review Template: Writing Centre
No ratings yet
Literature Review Template: Writing Centre
1 page
Time Table SP16
No ratings yet
Time Table SP16
1 page
The Truth About SSD Data Retention
No ratings yet
The Truth About SSD Data Retention
4 pages
Windows Forensics Commands - Networks Professionals
No ratings yet
Windows Forensics Commands - Networks Professionals
7 pages
Abreu, Carlos
No ratings yet
Abreu, Carlos
1 page
Questions For Couples
100% (1)
Questions For Couples
77 pages
John S. Bala
No ratings yet
John S. Bala
13 pages
Algebra
No ratings yet
Algebra
523 pages

C4.5 Algorithm

Uploaded by

C4.5 Algorithm

Uploaded by

Machine Learning in

▪ Handling Numeric Attributes

▪ Dealing with Missing Values

▪ From Trees to Rules

▪ Basic schemes need to be extended to fulfill these

witten & eibe 3

▪ C4.5 - one of best-known and most widely-used learning

▪ Unlike nominal attributes,

▪ Computationally more demanding

witten & eibe 5

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild Normal False Yes

If outlook = sunny and humidity = high then play = no

witten & eibe 6

Overcast 83 86 False Yes

Rainy 75 80 False Yes

If outlook = sunny and humidity > 83 then play = no

▪ E.g. temperature  71.5: yes/4, no/2

▪ Place split points halfway between values

▪ Q. Does this have to be repeated at each node of

witten & eibe 9

Potential optimal breakpoints

Breakpoints between values of the same class cannot

▪ Not so for binary splits on numeric attributes!

▪ Disadvantage: tree is hard to read

witten & eibe 11

▪ Info gain works with fractional instances

▪ During classification, split the instance into pieces

witten & eibe 13

▪ Postpruning preferred in practice—

▪ Most popular test: chi-squared test

witten & eibe 15

▪ Pre-pruning may stop the growth process

▪ But: XOR-type problems rare in practice

witten & eibe 16

witten & eibe 17

witten & eibe 18

witten & eibe 20

witten & eibe 22

▪ To use this we have to reduce our random variable f to have

witten & eibe 25

witten & eibe 26

f=0.33 f=0.5 f=0.33

witten & eibe

How can we produce a set of rules from

▪ Then rank the subsets to avoid conflicts

▪ C4.5 has two parameters

witten & eibe 31

witten & eibe 36

• Machine Learning in Real World: C4.5.(n.d.). Retrieved from

You might also like