IME672 - Lecture 48

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

IME 672

Data Mining & Knowledge


Discovery

Dr. Faiz Hamid


Department of Industrial & Management Engineering
Indian Institute of Technology Kanpur
Email: [email protected]
Rule-Based Classification
IF-THEN Rules for Classification
• Represent the knowledge in the form of IF-THEN rules

• R: IF age = youth AND student = yes THEN buys_computer = yes


– The “IF” part (or left side) of a rule is known as the rule antecedent or
precondition; conjunction of attribute tests
– The “THEN” part (or right side) is the rule consequent

• Can be generated either from a decision Tree or directly from


the training data using a sequential covering algorithm

• If the rule antecedent holds true for a given tuple, we say that
the rule is satisfied; rule covers the tuple; rule is fired or
triggered
IF-THEN Rules for Classification
• Vertebrate Classification Problem
IF-THEN Rules for Classification
• Assessment of a rule R: coverage and accuracy
– ncovers = # of tuples covered by R
– ncorrect = # of tuples correctly classified by R
– Coverage: Fraction of records that satisfy the antecedent of a rule
– Accuracy: Fraction of records that satisfy both the antecedent and
consequent of a rule

• If more than one rule is triggered, need conflict resolution


– Size ordering: assigns highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
– Rule ordering: rules prioritized beforehand
• class-based, rule-based
IF-THEN Rules for Classification
• Class-based ordering: classes are sorted in decreasing order of
prevalence or misclassification cost per class
• Rule-based ordering (decision list): rules are organized into
one long priority list, according to some measure of rule
quality or by experts

• What if no rule satisfied by X?


– Set up a default rule to specify a default class, based on a training set
– May be the class in majority or the majority class of the tuples that
were not covered by any rule
Rule Extraction from a Decision Tree
• Rules are easier to understand than large trees
• One rule is created for each path from the root to a leaf
• Each splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part)
• The leaf node holds the class prediction, forming the rule consequent
(“THEN” part)

• IF age = youth AND student = no THEN buys_computer = no


• IF age = youth AND student = yes THEN buys_computer = yes
• IF age = mid-age THEN buys_computer = yes
• IF age = senior AND credit_rating = fair THEN buys_computer = no
• IF age = senior AND credit_rating = excellent THEN buys_computer = yes
Rule Extraction from a Decision Tree
• Rules extracted are mutually exclusive and exhaustive
• Mutually exclusive:
– no two rules will be triggered for the same tuple
– cannot have rule conflicts
• Exhaustive:
– one rule for each possible attribute–value combination
– each record is covered by at least one rule

• Since one rule extracted per leaf, the set of rules is not much
simpler than the corresponding decision tree

• Rule pruning required


Rule Induction: Sequential Covering Algorithm
• Extracts rules directly from training data
• Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
• Rules are learned sequentially, each rule for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of other
classes
Otherwise, the next rule is identical to previous rule
• Steps:
i. Rules are learned one at a time
ii. Each time a rule is learned, the tuples covered by the rules are removed
iii. Repeat the process on the remaining tuples until termination condition
• Termination condition
– when no more training examples, or
– when the quality of a rule returned is below a user-specified threshold
Basic Sequential Covering Algorithm

• Tuples of the class for which rules are learned are called positive tuples, while
the remaining tuples are negative
Basic Sequential Covering Algorithm
How are Rules Learned?
• Start with the most general rule possible:
– IF THEN loan_decision = accept
• Add new attributes by adopting a greedy depth-first
strategy
– Pick the one that improves the rule quality most
– E.g. maximize rule’s accuracy
• Similar to situation in decision trees: problem of
selecting an attribute to split on
• The resulting rule should cover relatively more of the
“accept” tuples
Rule Learning
Rule Learning

IF {} THEN class = a IF x > 1.2 THEN class = a IF x > 1.2 and y > 2.6 THEN class = a

• Possible rule set for class “b”


– IF x ≤ 1.2 THEN class = b
– IF x > 1.2 and y ≤ 2.6 THEN class = b

• Each new test reduces rule’s coverage


Rule-Quality Measures
• Rule R1 correctly classifies 38 of the 40 tuples it
covers
• Rule R2 covers only two tuples, which it
correctly classifies
• R2 (100%) has greater accuracy than R1 (95%)
• R2 not the better rule because of its small Rules for the class loan_decision = accept,
showing accept (a) and reject (r) tuples
coverage

• Accuracy on its own is not a reliable estimate of rule quality


• Coverage on its own is not useful either
– for a given class we could have a rule that covers many tuples, most of which belong to
other classes!
• Need to consider rule quality measures which may integrate aspects of
accuracy and coverage
FOIL Gain
• Entropy - prefers rules that cover a large number of tuples of a
single class and few tuples of other classes
• Foil-gain (in FOIL & RIPPER): assesses information gained by
extending the antecedent

• where pos (neg) is the number of positive (negative) tuples covered by R,


pos’ (neg’) number of positive (negative) tuples covered by R’
• Favors rules that have high accuracy and cover many positive
tuples

IF {} THEN class = a IF x > 1.2 THEN class = a IF x > 1.2 and y > 2.6 THEN class = a
Rule Pruning
• Pruning = remove conjunct (attribute test) from the rule
• Prune a rule, R, if the pruned version of R has greater quality, as
assessed on an independent set of tuples

• If FOIL_Prune is higher for the pruned version of R, prune R


• These assessments are performed on pruning set (validation
set), otherwise overfitting
Likelihood Ratio Statistic
• A statistical test of significance
• Determines if the apparent effect of a rule is not attributed to chance,
instead indicates a genuine correlation between attribute values and classes
• Test compares the observed distribution among classes of tuples covered by
a rule with the expected distribution that would result if the rule made
predictions at random

• m is the number of classes


• For tuples satisfying the rule
– fi is the observed frequency of each class i among the tuples
– ei is the expected frequency of each class i if the rule made random predictions
• The statistic has a chi-square distribution with m-1 degrees of freedom
• Higher the likelihood ratio, more likely there is a significant difference in the
number of correct predictions made by the rule compared to a “random
guessor”
Rule-Quality Measures
• Consider following pair of rules - R1: A → C and R2: A ∧ B → C
• Consider a validation set with 500 +ve examples and 500 -ve examples
• R1 is covered by 350 +ve examples and 150 -ve examples
• R2 is covered by 300 +ve examples and 50 -ve examples
• FOIL_Gain
– Rule R1 : pos’ = 350, neg’ = 150, pos = 500, neg = 500

– Rule R2 : pos’ = 300, neg’ = 50, pos = 500, neg = 500

• FOIL_Prune
Rule-Quality Measures
• Likelihood Ratio
– Rule R1 :
• Expected number of +ve examples = 500/1000*(350+150) = 250
• Expected number of -ve examples = 500/1000*(350+150) = 250

– Rule R2 :
• Expected number of +ve examples = 500/1000*(300+50) = 175
• Expected number of -ve examples = 500/1000*(300+50) = 175
Rule-Based Classifiers
• Advantages:
– As highly expressive as decision trees
– Easy to interpret
– Easy to generate
– Can classify new instances rapidly
– Performance comparable to decision trees
– Can easily handle missing values and numeric attributes

You might also like