Data Mining: Practical Machine Learning Tools and Techniques
Data Mining: Practical Machine Learning Tools and Techniques
• Decision trees
• From ID3 to C4.5 (pruning, numeric attributes, ...)
• Classification rules
• From PRISM to RIPPER and PART (pruning, numeric data, …)
• Association Rules
• Faster rule mining with frequent-pattern trees
2
Decision Trees
3
Industrial-strength algorithms
4
From ID3 to C4.5
• Extending ID3:
• to permit numeric attributes: straightforward
• to deal sensibly with missing values: trickier
• stability for noisy data: requires pruning mechanism
• End result: C4.5 (Quinlan)
• Best-known and (probably) most widely-used learning algorithm
• Commercial successor: C5.0
5
Numeric attributes
6
Weather data (again!)
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal
High False Yes
Rainy
… Cool
… Normal
… False
… Yes
…
Rainy
… Cool
… Normal
… True
… No
…
… … … … …
7
Weather data (again!)
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny HotOutlook High
Temperature True
Humidity No Windy Play
Overcast Hot Sunny High Hot
85 FalseHigh
85 Yes False No
Rainy MildSunny Normal
High Hot
80 FalseHigh
90 Yes True No
Rainy
… Cool
…Overcast Normal
… Hot
83 False
… High
86 Yes
… False Yes
Rainy
… … Rainy
Cool … Mild
70
Normal … Normal
True 96 … False
No Yes
… …
… Rainy …
… 68 … …
80 …
… False …
Yes
Rainy
… 65
… 70
… True
… No
…
… … … … …
• Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits
• Place split points halfway between values
• Can evaluate all split points in one pass!
9
Can avoid repeated sorting
• Sort instances by the values of the numeric attribute
• Time complexity for sorting: O (n log n)
• Does this have to be repeated at each node of the tree?
• No! Sort order for children can be derived from sort order
for parent
• Time complexity of derivation: O (n)
• Drawback: need to create and store an array of sorted indices for
each numeric attribute
10
Binary vs multiway splits
11
Computing multi-way splits
12
Missing values
13
Pruning
14
Prepruning
15
Early stopping
a b class
1 0 0 0
• Pre-pruning may stop the 2 0 1 1
growth process prematurely: 3 1 0 1
early stopping 4 1 1 0
• Classic example: XOR/Parity-problem
• No individual attribute exhibits any significant association
with the class
• Structure is only visible in fully expanded tree
• Prepruning won’t expand the root node
• But: XOR-type problems rare in practice
• And: prepruning faster than postpruning
16
Postpruning
• First, build full tree
• Then, prune it
• Fully-grown tree shows all attribute interactions
• Problem: some subtrees might be due to chance effects
• Two pruning operations:
• Subtree replacement
• Subtree raising
• Possible strategies:
• error estimation
• significance testing
• MDL principle
17
Subtree replacement
• Bottom-up
• Consider replacing a tree only after
considering all its subtrees
18
Subtree raising
• Delete node
• Redistribute instances
• Slower than subtree
replacement
(Worthwhile?)
19
Estimating error rates
• Prune only if it does not increase the estimated error
• Error on the training data is NOT a useful estimator
(would result in almost no pruning)
• One possibility: use hold-out set for pruning
(yields “reduced-error pruning”)
• C4.5’s method
• Derive confidence interval from training data
• Use a heuristic limit, derived from this, for pruning
• Standard Bernoulli-process-based method
• Shaky statistical assumptions (based on training data)
20
C4.5’s method
21
Example
f = 5/14
e = 0.46
e < 0.51
so prune!
23
From trees to rules
• Simple way: one rule for each leaf
• C4.5rules: greedily prune conditions from each rule if this
reduces its estimated error
• Can produce duplicate rules
• Check for this at the end
• Then
• look at each class in turn
• consider the rules for that class
• find a “good” subset (guided by MDL)
• Then rank the subsets to avoid conflicts
• Finally, remove rules (greedily) if this decreases error on
the training data
24
C4.5: choices and options
25
Cost-complexity pruning
26
Cost-complexity pruning details
• Basic idea:
• First prune subtrees that, relative to their size, lead to the smallest
increase in error on the training data
• Increase in error (α): average error increase per leaf of subtree
• Bottom-up pruning based on this criterion generates a sequence of
successively smaller trees
• Each candidate tree in the sequence corresponds to one particular
threshold value αi
• Which tree to chose as the final model?
• Use either a hold-out set or cross-validation to estimate the error for
each αi
• Rebuild tree on entire training set using chosen value of α
27
Discussion
TDIDT: Top-Down Induction of Decision Trees
28
Discussion and Bibliographic Notes
• CART’s pruning method (Breiman et al. 1984) can often
produce smaller trees than C4.5’s method
• C4.5’s overfitting problems have been investigated empirically
by Oates and Jensen (1997)
• A complete description of C4.5, the early 1990s version,
appears as a excellent and readable book (Quinlan 1993)
• An MDL-based heuristic for C4.5 Release 8 that combats
overfitting of numeric attributes is described by Quinlan (1998)
• The more recent version of Quinlan’s tree learner, C5.0, is also
available as open-source code
29
Classification Rules
30
Classification rules
31
Test selection criteria
• Basic covering algorithm:
• Keep adding conditions to a rule to improve its accuracy
• Add the condition that improves accuracy the most
• Accuracy measure 1: p/t
• t total instances covered by rule
p number of these that are positive
• Produce rules that don’t cover negative instances,
as quickly as possible
• May produce rules with very small coverage
—special cases or noise?
• Measure 2: Information gain p (log(p/t) – log(P/T))
• P and T the positive and total numbers before the new condition
was added
• Information gain emphasizes positive rather than negative instances
• These measures interact with the pruning mechanism used
32
Missing values, numeric attributes
33
Pruning rules
34
Using a pruning set
35
Incremental reduced-error pruning
36
Measures of worth used in IREP
• [p + (N – n)] / T
• (N is total number of negatives)
• Counterintuitive:
• p = 2000 and n = 1000 vs. p = 1000 and n = 1
• Success rate p / t
• Problem: p = 1 and t = 1
vs. p = 1000 and t = 1001
• (p – n) / t
• Same effect as success rate because it equals 2p/t – 1
• Seems hard to find a simple measure of a rule’s worth that
corresponds with intuition
37
Variations
38
Rule learning using global optimization
39
PART: rule learning using partial trees
40
Building a partial tree
Expand-subset (S):
Choose test T and use it to split set of examples
into subsets
Sort subsets into increasing order of average
entropy
while
there is a subset X not yet been expanded
AND all subsets expanded so far are
leaves
expand-subset(X)
if
all subsets expanded are leaves
AND estimated error for subtree
estimated error for node
undo expansion into subsets and make node a leaf
41
Example
42
Notes on PART
43
Rules with exceptions
44
Iris data example
45
Discussion and Bibliographic Notes
• The idea of incremental reduced-error pruning is due to
Fürnkranz and Widmer (1994)
• The RIPPER rule learner is due to Cohen (1995)
• What we have presented here is the basic idea of the algorithm; there
are many more details in the implementation
• An extensive theoretical study of various test selection criteria
for rules has been performed by Fürnkranz and Flach (2005)
• The rule-learning scheme based on partial decision trees was
developed by Frank and Witten (1998)
• The procedure for generating rules with exceptions was part of
Gaines and Compton’s Induct system (1995)
• They called rules with exceptions ripple-down rules
• Richards and Compton (1998) describe their role as an alternative to
classic knowledge engineering
46
Association Rules
47
Association rules
48
FP-growth
49
Building a frequent pattern tree
1) First pass over the data: count the number times individual
items occur
2) Second pass over the data:
before inserting each instance into the FP-tree, sort its items in
descending order of their frequency of occurrence
- Individual items that do not meet the minimum support are not inserted
into the tree
- Ideally, many instances will share items that occur frequently individually,
resulting in a high degree of compression close to the root of the tree
50
An example using the weather data
• Frequency of individual items (minimum support = 6)
play = yes 9
windy = false 8
humidity = normal 7
humidity = high 7
windy = true 6
temperature = mild 6
play = no 5
outlook = sunny 5
outlook = rainy 5
temperature = hot 4
temperature = cool 4
outlook = overcast 4
51
An example using the weather data
• Instances with items sorted
52
Finding large item sets
• FP-tree for the weather data (min support 6)
53
Finding large item sets cont.
• FP-tree for the data conditioned on temperature=mild
56
Discussion and Bibliographic Notes
• The FP-tree and the FP-growth algorithm were introduced by Han et al.
(2000) following pioneering work by Zaki et al. (1997)
• Han et al. (2004) give a more comprehensive description; the
algorithm has been extended in various ways
• Wang et al. (2003) develop an algorithm called CLOSET+ to mine closed
item sets
• Close item sets are sets for which there is no proper superset that has the same
support
• Produces few redundant rules and thus eases the task that users face when
examining the output of the mining process
• GSP, for Generalized Sequential Patterns, is a method for mining
patterns in event sequences (Srikant and Agrawal, 1996)
• An approach like FP-growth is used for event sequences by PrefixSpan
(Pei et al., 2004) and CloSpan (Yan et al., 2003)
• For graph patterns, there is gSpan (Yan and Han, 2002) and CloseGraph
(Yan and Han, 2003)
57